Let
be i.i.d. random variables with values
in an arbitrary set
(equipped with a
-algebra
), with common distribution Q.
The empirical distribution
of
is the random probability measure on
defined by
. Sanov's theorem (Sanov 1957) says,
intuitively, that for any PD
,
If
is a finite set, we have more exactly
whenever P is a possible type for block-length n (a basic fact of the method of types, cf. Csiszár and Körner 1981, p.32). Clearly, (2.2) implies that
for any set
of PD's on
in which n-types become dense as
.
For arbitrary
, a general form of Sanov's theorem
says that for any set
of PD's on
for which
the probabilities
are
defined, we have
Here
and
denote the interior
and closure of
in the
-topology, i.e.,
the topology of setwise convergence of probability measures
(the
-topology is weaker than the topology of variation
distance). Of course, the equality of the right hand sides
of (2.4) and (2.5) is a sufficient condition for the limit
relation (2.3). In particular, (2.3) always holds if
is convex and
for some
.
Of particular interest is the choice
where f is a given (possibly vector valued) function on
and C is a given subset of the range of f; then
means that
.
In the parlance of large deviations theory (cf. Dembo and
Zeitouni 1993), the asymptotic bounds (2.4), (2.5),
together with the easily checked
compactness (in
-topology) of the divergence balls
mean that
satisfies the large deviation
principle, with good rate function
.
The simplest available proof of this important result is inherently information theoretic (Groeneboom, Oosterhoff and Ruymgaart 1979; they acknowledge using a suggestion of this author in proving (2.8) below). We sketch the proof of the more difficult part (2.5).
For any (measurable) partition
of
,
is bounded above by the sum of probabilities
for all PD's P' on
such that
for some
and P' is an n-type. This implies by (2.2) that
This gives
and (2.5) follows by (1.2) if one checks that
The latter is non-trivial but not too hard.
Suppose next that
is a convex set of PD's on
and the conditional
distribution of
on the condition
belongs to
. This is always the case for
given
by (2.6) if C is a convex set and
exists.
Under the above hypotheses, the asymptotic bound (2.5)
may be sharpened to a
non-asymptotic bound. More importantly,
under mild additional hypotheses,
a strong form of Gibbs' conditioning principle
(cf. Dembo and Zeitouni 1993) may be established
by a simple IT reasoning (Csiszár 1984).
The following identity
holds for any PD
on
whose marginals on
are equal to P, and for any PD Q on
. Take here
(the conditional distribution
of
on the condition
), then
by assumption. It follows that
proving the claimed non-asymptotic bound. If
is the I-projection of Q onto
, from (2.10) and
(1.6) we further obtain
where the last step follows from (2.9) with Q replaced
by
.
Supposing finally that the given
satisfies (2.3),
it follows from (2.11) - recalling
- that
This says, intuitively, that conditionally
on
,
the random variables
``almost behave''
as independent ones with common distribution
.
In particular, (2.12) implies that for any
fixed k
Thus the conditional joint distribution of
on the condition
converges
to the product distribution
in a stronger sense
than in variation distance, cf. (1.4). This is the promised
strong version of Gibbs' conditioning principle.
The above result may be extended to the case when
the I-projection of Q onto
does not exist.
Namely, a unique
(not necessarily in
)
always exists such that
with
implies
; then (2.12)
holds with this ``generalized I-projection''
(Csiszár 1984).