Using IT ideas in statistics comes most naturally when
adopting the Bayesian approach. Indeed, suppose the
joint distribution of
depends
on an unknown parameter
, say
are i.i.d. with distribution
. Then,
in order the amount of information provided by the
observation
about the parameter
, viz.\
the mutual information
, be
defined, it is necessary that
be assigned
a distribution (called prior distribution). The latter
plays the role of input distribution for the channel
defined by the possible distributions of
,
corresponding to the possible values of
.
Of course, as an input distribution is often assigned
in IT just as a technical tool, a prior can always be
so assigned. For this reason, the statistical
applicability of mutual information and related IT
tools - such as Fano's inequality - is by no means
restricted to Bayesian statistics.
In the sixties, Rényi studied in several papers
the asymptotics of
when the set
of possible values of
was finite,
cf. Rényi 1969. He showed that
exponentially fast, and related
this to the asymptotic behavior of the error
probability of the Bayesian (maximum a posteriori
probability) estimate of
.
When
is a subset of
of positive
Lebesgue measure, the mutual information
typically goes to infinity as
. Its asymptotics was studied in the seventies
by Russian researchers (Pinsker 1972, Ibragimov and
Hasminskii 1973, and others). Recently, Clarke and Barron
1994 obtained sharp results. In Bayesian statistics,
Bernardo 1979 suggested to use a so-called reference
prior selected by the IT criterion of yielding maximum
is the limit
. He
argued that this criterion leads to the familiar
Jeffreys prior; Clarke and Barron 1994 provide a
rigorous proof, under not too restrictive
hypotheses on the family
.
In many statistical problems, the parameter set
is infinite dimensional, e.g., it may be the set of all
probability densities on
or
, or the class
of densities satisfying some smoothness conditions.
In this context, it is a good idea to consider
for
restricted to and
having uniform distribution on a suitable finite subset
of
. Suppose the problem is to
estimate
from the observations
, an estimator
being evaluated by the
supremum over
of the expected loss
, for a given loss
function d. Subject to suitable assumptions,
may be bounded
below, for
, in terms of
, where
is a
-valued approximation of the estimator T.
Then the sup expected loss may be bounded below in terms
of
Here the right hand side equals
, for
uniformly
distributed on
. Hence by Fano's inequality,
it is bounded below by
If here
, for
some c<1, one arrives at a useful lower bound to
,
valid for any estimator
.
Ideas as hinted to above have been used to derive risk bounds tight up to a constant factor in non-parametric density estimation. Works in this direction include Hasminskii 1978, Ibragimov and Hasminskii 1982, Efroimovich and Pinsker 1982, and recently Yu 1995, Yang and Barron 1997; the results of the latter are particularly impressive.