next up previous
Next: Mutual information in statistics Up: Information theoretic methods in Previous: Iterative scalingEM algorithm

Minimum description length (MDL)

MDL is a statistical inference principle motivated by IT (Rissanen 1978, 1989). It says that among various possible stochastic models (or model classes) for a data sequence tex2html_wrap_inline1230 , one should select that yielding the shortest code for tex2html_wrap_inline1232 , taking into account also the bits needed to describe the model (model class) that has been used for the encoding. MDL has naturally lead to a strong interplay with statistics of the theory of universal data compression in IT. This would deserve detailed coverage, but because of limited space we will consider just one example, the MDL approach to Markov order estimation.

For binary sequences tex2html_wrap_inline1232 , consider the model classes ``i.i.d.,'' ``first order Markov,'' ``second order Markov,'' ..., order tex2html_wrap_inline1236 Markov. A (prefix condition) code may be regarded optimal for a model class if the maximum over the model class of either its mean-redundancy or its max-redundancy is the smallest possible. It is known from IT that a code tex2html_wrap_inline1238 optimal in either sense for the ``order k Markov'' model class (k=0 meaning i.i.d.), has codeword length

  equation323

Here tex2html_wrap_inline1244 is the maximum of the probability of tex2html_wrap_inline1232 for order k Markov sources. Hence, disregarding the O(1) term, the order k yielding minimum codelength for tex2html_wrap_inline1232 will be

  equation329

This tex2html_wrap_inline1256 is taken as the MDL estimate of the Markov order k. Notice that now the description length for the model class is constant over the considered (finite number of) classes, hence it does not enter the above comparison.

Eq. (3.12) is an instance of a more general result that to chose among model classes involving different number of parameters, the criterion given by MDL is maximized log-likelihood minus the number of parameters times tex2html_wrap_inline1260 . In statistics, this is known as the BIC criterion, and it enjoys desirable properties.

It is interesting to note that a previous ``penalized maximum likelihood criterion'' known as AIC had also been derived using IT considerations (Akaike 1973).


next up previous
Next: Mutual information in statistics Up: Information theoretic methods in Previous: Iterative scalingEM algorithm

Ramesh Rao
Mon Apr 6 16:41:42 PDT 1998