From Two to Infinity: Information Theory and Statistics for Large Alphabets

Presenter(s)

Alon Orlitsky

UC San Diego

2008 ISIT Plenary Lecture
From Two to Infinity: Information Theory and Statistics for Large Alphabets
Professor Alon Orlitsky
University of California, San Diego

Abstract

In most compression applications the underlying source distribution is unknown. Universal compression results show that if the source alphabet is small, e.g. binary, it can still be compressed to essentially its entropy. Yet practical sources such as text, audio and video have large, often infinite, support. These sources cannot be compressed to their entropy without knowing their distribution, and the excess number of bits grows to infinity linearly with the alphabet size.

Compressing such sources requires estimating distributions over large alphabets and approximating probabilities of rare, even unseen, events. We review good but suboptimal probability estimators derived by Laplace, Good, Turing, and Fisher, and show that seminal results by Hardy and Ramanujan on the number of integer partitions can be used to construct asymptotically optimal estimators that yield unbiased population estimates and compress data patterns to their entropy uniformly over all alphabet sizes.

This talk is based on work with P. Santhanam, K. Viswanathan, and J. Zhang.

Biography

Alon Orlitsky received B.Sc. degrees in Mathematics and Electrical Engineering from Ben Gurion University in 1980 and 1981, and M.Sc. and Ph.D. degrees in Electrical Engineering from Stanford University in 1982 and 1986.  From 1986 to 1996 he was with the Communications Analysis Research Department of Bell Laboratories. He spent the following year as a quantitative analyst at D.E. Shaw and Company, an investment firm in New York city. In 1997 he joined the University of California, San Diego, where he is currently a professor of Electrical and Computer Engineering and of Computer Science and Engineering, and directs the Information Theory and Applications Center.  Alon’s research concerns information theory, statistical modeling, machine learning, and speech recognition. He is a recipient of the 1981 ITT International Fellowship and the 1992 IEEE W.R.G. Baker Paper Award, a co-recipient of the 2006 Information Theory Society Paper Award, a fellow of the IEEE, and holds the Qualcomm Chair for Information Theory and its Applications at UCSD.