MDL is a statistical inference principle motivated by IT
(Rissanen 1978, 1989). It says that among various possible
stochastic models (or model classes) for a data sequence
, one should select that yielding
the shortest code for
, taking into account also
the bits needed to describe the model (model class)
that has been used for the encoding.
MDL has naturally lead to a strong interplay
with statistics of the theory of universal data
compression in IT. This would deserve detailed coverage,
but because of limited space we will
consider just one example, the MDL approach
to Markov order estimation.
For binary sequences
, consider the model classes
``i.i.d.,'' ``first order Markov,'' ``second order Markov,''
..., order
Markov. A (prefix condition) code
may be regarded optimal for a model class if the
maximum over the model class of either its
mean-redundancy or its max-redundancy is
the smallest possible. It is known from IT that a code
optimal
in either sense for the ``order k Markov'' model class
(k=0 meaning i.i.d.), has codeword length
Here
is the maximum of the probability of
for order k Markov sources. Hence, disregarding
the O(1) term, the order k yielding minimum
codelength for
will be
This
is taken as the MDL estimate
of the Markov order k.
Notice that now the description length for the model
class is constant over the considered (finite
number of) classes, hence it does not enter the above
comparison.
Eq. (3.12) is an instance of a more general result
that to chose among model classes involving different
number of parameters, the criterion given by MDL
is maximized log-likelihood minus the number of
parameters times
. In statistics,
this is known as the BIC criterion, and it enjoys
desirable properties.
It is interesting to note that a previous ``penalized maximum likelihood criterion'' known as AIC had also been derived using IT considerations (Akaike 1973).