next up previous
Next: Conclusion Up: No Title Previous: Complexity

Text compression

Application of the binary CTW method for compression of ASCII sequences is possible after decomposing the ASCII symbols into 7 binary digits. Moreover we use 7 context trees. Tree tex2html_wrap_inline3303 , where tex2html_wrap_inline3305 , is used for coding all binary digits occurring at position j in an ASCII. The (longest) context of digit j are the digits tex2html_wrap_inline3311 of the current ASCII preceded by the 7 digits in the most recent ASCII, preceded by the 7 digits in the second most recent ASCII, up to the M'th most recent ASCII. Digit 1 in an ASCII is the most significant one, and is encoded first, etc. By decomposing we allow different tree models for all 7 digits. This may reduce the total number of parameters and thus the redundancy.

The next problem that has to be solved is that the parameter redundancy depends on the number of parameters, however, after decomposition many of these parameters are 0 or 1, possibly because of non-occurring ASCII symbols. Therefore we use, instead of the KT-estimator, the ``zero-redundancy'' estimator, which is defined as

equation2255

where tex2html_wrap_inline2885 is the KT-estimator and tex2html_wrap_inline3317 and tex2html_wrap_inline3319 . This leads to a redundancy of not more than 2 bits for deterministic subsequences, i.e. sequences with a=0 or b=0, and tex2html_wrap_inline3325 for non-deterministic ones having length tex2html_wrap_inline3327 , where tex2html_wrap_inline3329 .

Our final problem deals with model redundancy. Note that context tree tex2html_wrap_inline3303 , for some tex2html_wrap_inline3333 , ``fits'' a model to the data (digits occurring at position j) which is a binary tree. Such a tree can have leaves also at positions inside an ASCII symbol. This seems not very useful. By allowing leaves only at ASCII borders we decrease the model redundancy. This is accomplished by weighting only at ASCII borders, i.e. taking tex2html_wrap_inline3337 in nodes inside an ASCII and tex2html_wrap_inline3339 in nodes on ASCII borders.

ASCII CTW for M=12 achieves 1.825 bit/sym on the file bib, 2.180 bit/sym on book1, 1.875 bit/sym on book2, 2.346 bit/sym on news, 2.290 bit/sym on paper1, 2.232 bit/sym on paper2, and 2.336 bit/sym on progc of the Calgary corpus.


next up previous
Next: Conclusion Up: No Title Previous: Complexity

Ramesh Rao
Fri May 2 08:36:00 PDT 1997