Iterative scaling is a familiar procedure to infer a
matrix with non-negative entries
when the
row and column sums
are known, say
,
, and a prior guess
is available. The procedure consists
in iteratively adjusting the row and column sums,
setting
and taking the limit
To this author's knowledge, iterative scaling was first
used (Kruithof 1937) to estimate telephone traffic, viz.\
the number
of calls from exchange i
to exchange j on a given day, from the counts
and
of outgoing and incoming calls, and
from the exact knowledge of the traffic
on some
previous day. In statistics, the same procedure was first
proposed by Deming and Stephan 1943 to infer a two
dimensional distribution
with known
marginals
,
, using the empirical distribution of the
observed sample (``contingency table'') as prior guess Q.
The IT nature of this procedure has been recognized
by Ireland and Kullback 1968. Suppose w.l.o.g. that
and Q are PD's,
let
and
denote the set of (two-dimensional)
PD's with first marginal
, respectively
with second marginal
. Then
belongs to
or
according as k is odd
or even, and
for each
(k odd) or
(k even).
Due to this Pythagorean identity,
is the I-projection
of
onto
or
, respectively.
We show (following Csiszár 1975, filling a gap in
the proof of Ireland and Kullback 1968) that if
, the limits
(3.6) exist and
equals the I-projection
of Q onto
.
By the uniqueness of I-projection, it suffices to show
that for any convergent subsequence
,
say, we have
and for each
Since (3.7) holds for each k if
, summing these identities from k=1 to
and letting
gives
If
, (3.9) implies that
, consequently
, establishing
. Thus (3.9)
applies to P=P', yielding that the sum in (3.9)
equals
. This completes the proof.
A similar convergence result had been claimed (Kullback 1968) also for iterative scaling of densities. That case, however, is much harder; the above proof essentially relies upon the continuity of I-divergence as a function of its second variable, and this no longer holds if the underlying set is infinite. A convergence proof for iterative scaling of densities appears available under additional assumptions only (Rüschendorf 1995).
There is a large variety of problems requiring to infer
a PD or, more generally, a non-negative valued function,
when the available information consists in certain
linear constraints. The popular ``maximum entropy''
method suggests to take the feasible P closest
in I-divergence to a default model Q, i.e.,
the I-projection of Q onto the feasible set of P's
satisfying the given constraints. The I-divergence
of non-negative valued functions P and Q, not
necessarily PD's, on a finite set
, is defined
by the following extension of eq. (
):
Often, the feasible set can be represented as
such that I-projection onto each
is easy to compute. Then, as in iterative
scaling, the desired I-projection (``maxent
solution'') can be computed by iterating successive
I-projections onto
starting
with
. Convergence to the I-projection
of Q onto
follows in the same way as above,
using the Pythagorean theorem for I-projections
(eq. (1.6) with equality), provided that
.
Some other iterative algorithms often used in statistics and elsewhere can also be given intuitive IT interpretations. One such algorithm, designed to compute I-projection onto an arbitrary feasible set defined by linear constraints (when the underlying set is finite) is generalized iterative scaling (Darroch and Ratcliff 1972), also known as SMART algorithm (Byrne 1993). This has been shown (Csiszár 1989) equivalent to iterative I-projection performed in a suitable product space.
The so-called EM algorithm (Dempster, Laird and Rubin
1977),
designed to compute maximum likelihood estimates from
incomplete data, has been shown (Csiszár and
Tusnády 1984) equivalent to an alternating minimization
of
for
and
with
suitably constructed
and
. Convergence of
the latter was proved under some technical conditions,
the most important being convexity of
and
. The general result implies convergence
of the EM algorithm in the particular case
of decomposition of mixtures. Remarkably, the same general
result implies also the convergence of familiar
algorithms for computing channel capacity (Arimoto 1972,
Blahut 1972), rate-distortion functions (Blahut 1972)
and optimum portfolios (Cover 1984).
Recent works related to the topics in this Subsection include Byrne 1993, 1996, Della Pietra, Della Pietra and Lafferty 1997, Matus 1997.