2008 ISIT Plenary Lecture
From Two to Infinity: Information Theory and Statistics for Large Alphabets
Professor Alon Orlitsky
University of California, San Diego
In most compression applications the underlying source distribution is unknown. Universal compression results show that if the source alphabet is small, e.g. binary, it can still be compressed to essentially its entropy. Yet practical sources such as text, audio and video have large, often infinite, support. These sources cannot be compressed to their entropy without knowing their distribution, and the excess number of bits grows to infinity linearly with the alphabet size.
Compressing such sources requires estimating distributions over large alphabets and approximating probabilities of rare, even unseen, events. We review good but suboptimal probability estimators derived by Laplace, Good, Turing, and Fisher, and show that seminal results by Hardy and Ramanujan on the number of integer partitions can be used to construct asymptotically optimal estimators that yield unbiased population estimates and compress data patterns to their entropy uniformly over all alphabet sizes.
This talk is based on work with P. Santhanam, K. Viswanathan, and J. Zhang.