next up previous
Next: Minimum description length (MDL) Up: Information theoretic methods in Previous: Hypothesis testing: exponential rate

Iterative scaling, EM algorithm

Iterative scaling is a familiar procedure to infer a matrix with non-negative entries tex2html_wrap_inline1086 when the row and column sums

  equation252

are known, say tex2html_wrap_inline1088 , tex2html_wrap_inline1090 , and a prior guess tex2html_wrap_inline1092 is available. The procedure consists in iteratively adjusting the row and column sums, setting

  equation264

and taking the limit

  equation281

To this author's knowledge, iterative scaling was first used (Kruithof 1937) to estimate telephone traffic, viz.\ the number tex2html_wrap_inline1086 of calls from exchange i to exchange j on a given day, from the counts tex2html_wrap_inline1100 and tex2html_wrap_inline1102 of outgoing and incoming calls, and from the exact knowledge of the traffic tex2html_wrap_inline1104 on some previous day. In statistics, the same procedure was first proposed by Deming and Stephan 1943 to infer a two dimensional distribution tex2html_wrap_inline1106 with known marginals tex2html_wrap_inline1108 , tex2html_wrap_inline1110 , using the empirical distribution of the observed sample (``contingency table'') as prior guess Q.

The IT nature of this procedure has been recognized by Ireland and Kullback 1968. Suppose w.l.o.g. that tex2html_wrap_inline1114 and Q are PD's, let tex2html_wrap_inline1118 and tex2html_wrap_inline1120 denote the set of (two-dimensional) PD's with first marginal tex2html_wrap_inline1122 , respectively with second marginal tex2html_wrap_inline1124 . Then tex2html_wrap_inline1126 belongs to tex2html_wrap_inline1118 or tex2html_wrap_inline1120 according as k is odd or even, and

  equation297

for each tex2html_wrap_inline1134 (k odd) or tex2html_wrap_inline1138 (k even). Due to this Pythagorean identity, tex2html_wrap_inline1126 is the I-projection of tex2html_wrap_inline1144 onto tex2html_wrap_inline1118 or tex2html_wrap_inline1120 , respectively.

We show (following Csiszár 1975, filling a gap in the proof of Ireland and Kullback 1968) that if tex2html_wrap_inline1150 , the limits (3.6) exist and tex2html_wrap_inline1152 equals the I-projection of Q onto tex2html_wrap_inline1156 .

By the uniqueness of I-projection, it suffices to show that for any convergent subsequence tex2html_wrap_inline1158 , say, we have tex2html_wrap_inline1160 and for each tex2html_wrap_inline1162

  equation305

Since (3.7) holds for each k if tex2html_wrap_inline1162 , summing these identities from k=1 to tex2html_wrap_inline1170 and letting tex2html_wrap_inline1172 gives

  equation308

If tex2html_wrap_inline692 , (3.9) implies that tex2html_wrap_inline1176 , consequently tex2html_wrap_inline1178 , establishing tex2html_wrap_inline1160 . Thus (3.9) applies to P=P', yielding that the sum in (3.9) equals tex2html_wrap_inline1184 . This completes the proof.

A similar convergence result had been claimed (Kullback 1968) also for iterative scaling of densities. That case, however, is much harder; the above proof essentially relies upon the continuity of I-divergence as a function of its second variable, and this no longer holds if the underlying set is infinite. A convergence proof for iterative scaling of densities appears available under additional assumptions only (Rüschendorf 1995).

There is a large variety of problems requiring to infer a PD or, more generally, a non-negative valued function, when the available information consists in certain linear constraints. The popular ``maximum entropy'' method suggests to take the feasible P closest in I-divergence to a default model Q, i.e., the I-projection of Q onto the feasible set of P's satisfying the given constraints. The I-divergence of non-negative valued functions P and Q, not necessarily PD's, on a finite set tex2html_wrap_inline510 , is defined by the following extension of eq. ( gif):

  equation317

Often, the feasible set can be represented as tex2html_wrap_inline1200 such that I-projection onto each tex2html_wrap_inline1202 is easy to compute. Then, as in iterative scaling, the desired I-projection (``maxent solution'') can be computed by iterating successive I-projections onto tex2html_wrap_inline1204 starting with tex2html_wrap_inline1206 . Convergence to the I-projection tex2html_wrap_inline564 of Q onto tex2html_wrap_inline1200 follows in the same way as above, using the Pythagorean theorem for I-projections (eq. (1.6) with equality), provided that tex2html_wrap_inline1214 .

Some other iterative algorithms often used in statistics and elsewhere can also be given intuitive IT interpretations. One such algorithm, designed to compute I-projection onto an arbitrary feasible set defined by linear constraints (when the underlying set is finite) is generalized iterative scaling (Darroch and Ratcliff 1972), also known as SMART algorithm (Byrne 1993). This has been shown (Csiszár 1989) equivalent to iterative I-projection performed in a suitable product space.

The so-called EM algorithm (Dempster, Laird and Rubin 1977), designed to compute maximum likelihood estimates from incomplete data, has been shown (Csiszár and Tusnády 1984) equivalent to an alternating minimization of tex2html_wrap_inline534 for tex2html_wrap_inline1134 and tex2html_wrap_inline1220 with suitably constructed tex2html_wrap_inline1118 and tex2html_wrap_inline1120 . Convergence of the latter was proved under some technical conditions, the most important being convexity of tex2html_wrap_inline1118 and tex2html_wrap_inline1120 . The general result implies convergence of the EM algorithm in the particular case of decomposition of mixtures. Remarkably, the same general result implies also the convergence of familiar algorithms for computing channel capacity (Arimoto 1972, Blahut 1972), rate-distortion functions (Blahut 1972) and optimum portfolios (Cover 1984).

Recent works related to the topics in this Subsection include Byrne 1993, 1996, Della Pietra, Della Pietra and Lafferty 1997, Matus 1997.


next up previous
Next: Minimum description length (MDL) Up: Information theoretic methods in Previous: Hypothesis testing: exponential rate

Ramesh Rao
Mon Apr 6 16:41:42 PDT 1998