For a while, we tried to solve the universal prediction problem by forcing
an analogy to the universal coding problem. However,
although we felt that optimal Bayes
prediction with respect to the conditional
probabilities induced by the coding algorithm should work, we could not
prove it. So we decided to concentrate first on a simpler problem:
Find a non-anticipating predictor that outperforms, for any sequence, the best ``fixed''
(single-state) predictor, that predicts either constantly ``0'' if
or constantly ``1'' otherwise, where
and
are the number of zeros and
ones along the sequence.
At first, it was not clear how to achieve even
this modest goal for an arbitrary sequence.
The best single-state predictor
makes the optimal Bayesian prediction
with respect to the empirical first-order letter probabilities of the data.
Thus, an intuitive suggestion for non-anticipating predictor is to
count the number of zeros and ones,
and
,
along the sequence
, calculate the empirical probabilities
, and
then predict
if
,
if
, and flip a fair coin otherwise.
This obvious ``majority'' predictor seems to work well in many examples,
and certainly in the case where the sequence comes from a Bernoulli source, where with
high probability it will converge, after a while, to the optimal Bayesian choice.
However, this predictor may be bad for some
sequences. For example, the sequence
:
While the fraction of errors made by the best single state predictor
(which can be either ``0'' or ``1'' in this example) is
, the sequential
majority predictor makes errors 75% of the time on the average.
Studying this example further, we realized that the difficulty
lies in the fact that
for this sequence converges to
which is a discontinuity point of the optimal Bayesian predictor,
that predicts ``0'' when
and ``1'' when
.
Well, if this is the case, what happens if we force the decision
of the universal predictor to be continuous
with respect to the empirical probability estimate? How about the predictor:
where
is the continuous, piecewise linear, function depicted in Figure 1?
The randomized predictor (1) provides
a decision rule that is continuous with respect to the estimated empirical
probability. Note that the randomization occurs only when the empirical
probability estimate is
close to
where there is a danger of making
a ``sizeable'' error in comparison with the optimal Bayes decision.
Forcing the decision rule to be continuous was indeed a good idea
as we were able to prove that the predictor (1) attains
asymptotically the single-state predictability for any sequence.
)
on the log-loss problem played a special role since it was easier to obtain, and so
it served as a vehicle to prove similar
results for other loss functions.