next up previous
Next: Other topics Up: Information theoretic methods in Previous: Minimum description length (MDL)

Mutual information in statistics

Using IT ideas in statistics comes most naturally when adopting the Bayesian approach. Indeed, suppose the joint distribution of tex2html_wrap_inline650 depends on an unknown parameter tex2html_wrap_inline1266 , say tex2html_wrap_inline786 are i.i.d. with distribution tex2html_wrap_inline1270 . Then, in order the amount of information provided by the observation tex2html_wrap_inline764 about the parameter tex2html_wrap_inline1266 , viz.\ the mutual information tex2html_wrap_inline1276 , be defined, it is necessary that tex2html_wrap_inline1266 be assigned a distribution (called prior distribution). The latter plays the role of input distribution for the channel defined by the possible distributions of tex2html_wrap_inline764 , corresponding to the possible values of tex2html_wrap_inline1266 . Of course, as an input distribution is often assigned in IT just as a technical tool, a prior can always be so assigned. For this reason, the statistical applicability of mutual information and related IT tools - such as Fano's inequality - is by no means restricted to Bayesian statistics.

In the sixties, Rényi studied in several papers the asymptotics of tex2html_wrap_inline1276 when the set tex2html_wrap_inline1286 of possible values of tex2html_wrap_inline1266 was finite, cf. Rényi 1969. He showed that tex2html_wrap_inline1290 exponentially fast, and related this to the asymptotic behavior of the error probability of the Bayesian (maximum a posteriori probability) estimate of tex2html_wrap_inline1266 .

When tex2html_wrap_inline1286 is a subset of tex2html_wrap_inline1296 of positive Lebesgue measure, the mutual information tex2html_wrap_inline1276 typically goes to infinity as tex2html_wrap_inline670 . Its asymptotics was studied in the seventies by Russian researchers (Pinsker 1972, Ibragimov and Hasminskii 1973, and others). Recently, Clarke and Barron 1994 obtained sharp results. In Bayesian statistics, Bernardo 1979 suggested to use a so-called reference prior selected by the IT criterion of yielding maximum tex2html_wrap_inline1276 is the limit tex2html_wrap_inline670 . He argued that this criterion leads to the familiar Jeffreys prior; Clarke and Barron 1994 provide a rigorous proof, under not too restrictive hypotheses on the family tex2html_wrap_inline1306 .

In many statistical problems, the parameter set tex2html_wrap_inline1286 is infinite dimensional, e.g., it may be the set of all probability densities on tex2html_wrap_inline1310 or tex2html_wrap_inline1312 , or the class of densities satisfying some smoothness conditions. In this context, it is a good idea to consider tex2html_wrap_inline1276 for tex2html_wrap_inline1266 restricted to and having uniform distribution on a suitable finite subset tex2html_wrap_inline1318 of tex2html_wrap_inline1286 . Suppose the problem is to estimate tex2html_wrap_inline1322 from the observations tex2html_wrap_inline764 , an estimator tex2html_wrap_inline1326 being evaluated by the supremum over tex2html_wrap_inline1286 of the expected loss tex2html_wrap_inline1330 , for a given loss function d. Subject to suitable assumptions, tex2html_wrap_inline1330 may be bounded below, for tex2html_wrap_inline1336 , in terms of tex2html_wrap_inline1338 , where tex2html_wrap_inline1340 is a tex2html_wrap_inline1318 -valued approximation of the estimator T. Then the sup expected loss may be bounded below in terms of

displaymath1262

Here the right hand side equals tex2html_wrap_inline1346 , for tex2html_wrap_inline1266 uniformly distributed on tex2html_wrap_inline1318 . Hence by Fano's inequality, it is bounded below by

eqnarray339

If here tex2html_wrap_inline1352 , for some c<1, one arrives at a useful lower bound to tex2html_wrap_inline1356 , valid for any estimator tex2html_wrap_inline1326 .

Ideas as hinted to above have been used to derive risk bounds tight up to a constant factor in non-parametric density estimation. Works in this direction include Hasminskii 1978, Ibragimov and Hasminskii 1982, Efroimovich and Pinsker 1982, and recently Yu 1995, Yang and Barron 1997; the results of the latter are particularly impressive.


next up previous
Next: Other topics Up: Information theoretic methods in Previous: Minimum description length (MDL)

Ramesh Rao
Mon Apr 6 16:41:42 PDT 1998