Application of the binary CTW method for compression of ASCII
sequences is possible after decomposing the ASCII symbols into 7
binary digits.
Moreover we use 7 context trees.
Tree
, where
, is used for coding all binary digits
occurring at position j in an ASCII.
The (longest) context of digit j are the digits
of
the current ASCII preceded by the 7 digits in the most recent ASCII,
preceded by the 7 digits in the second most recent ASCII, up to the
M'th most recent ASCII.
Digit 1 in an ASCII is
the most significant one, and is encoded first, etc.
By decomposing we allow different tree models for all 7 digits.
This may reduce the total number of parameters and thus the redundancy.
The next problem that has to be solved is that the parameter redundancy depends on the number of parameters, however, after decomposition many of these parameters are 0 or 1, possibly because of non-occurring ASCII symbols. Therefore we use, instead of the KT-estimator, the ``zero-redundancy'' estimator, which is defined as
where
is the KT-estimator and
and
. This leads to a redundancy of not
more than 2 bits for deterministic subsequences, i.e. sequences with
a=0 or b=0, and
for non-deterministic ones
having length
, where
.
Our final problem deals with model redundancy.
Note that context tree
, for some
, ``fits'' a
model to the data (digits occurring at position j) which is a binary
tree.
Such a tree can have leaves also at positions inside an ASCII symbol.
This seems not very useful. By allowing leaves only at ASCII borders
we decrease the model redundancy. This is accomplished by
weighting only at ASCII borders, i.e. taking
in nodes inside an ASCII and
in nodes on ASCII
borders.
ASCII CTW for M=12 achieves 1.825 bit/sym on the
file bib, 2.180 bit/sym on book1, 1.875 bit/sym on
book2, 2.346 bit/sym on news, 2.290 bit/sym on
paper1, 2.232 bit/sym on paper2, and 2.336 bit/sym on
progc of the Calgary corpus.