next up previous contents
Next: Comparison of Different Compression Up: The Fast Implementation Builds Previous: Experimental Time Complexity vs.

Example Book Encoding

The goal of this experiment is to illustrate the practical usage of the Hu-Tucker algorithm for encoding electronic texts.

``The Complete Works of William Shakespeare'', by the World Library, Inc., provided by the project Gutenberg Etext of Illinois Benedictine College is selected for this study. To appropriately measure the frequencies of the words in the text the following changes to the file were necessary:

The - and ' signs, when appearing at the beginning or at the end of the word, are removed4.2. Thus all the words appear in capitals, and the punctuation signs appear as separate lower case strings. The resulting version of the text was ran through awkfilter script and the number of the words appearances in the text are counted. The words list was sorted in alphabetic order. The input file fed to the FILH implementation contains the total number of words, followed by the pairs: number of word appearances, word string sorted in alphabetic order according to the word string. In total 29338 words were encountered. They are repeated between 1 and 83067 times. The top winners are:

weight word weight word
5479 THOU 9556 IS
5878 HAVE 10475 $question\_mark$
5916 AS 10939 IN
6222 BUT 11078 THAT
6230 HE 12468 MY
6807 THIS 13591 YOU
6850 HIS 14545 A
6866 YOUR 18113 OF
7064 BE 19120 TO
7650 IT 20608 I
7744 ME 26636 AND
7980 WITH 27544 THE
8186 FOR 77926 period
8687 NOT 83067 comma



$Total\_Execution\_Time = Time_{Initialization} + Time_{Combination} +\\
\inden...
...me_{Recombination} = \\
\indent \indent \indent \indent \indent \indent
= 2.64$ seconds

OABST Cost = $\sum_{i=1}^{29338} w_il_i = 10468466$

Averaged Word Encoding Length = $\frac {\sum_{i=1}^{29338} w_il_i}{\sum_{i=1}^{29338}w_i} = \frac{10468466}{1102905}$

= 9.4917205 bits per word

''The Complete Works of William Shakespeare``:

$File\_Size$ = 6977448 Bytes
$Compressed\_File$ = 1308558.2 Bytes


  
Figure 4.12: Results of encoding ''The Complete Works of William Shakespeare``. Statistics of the numbers of the words at a given level.

\includegraphics{level_stat.eps}


  
Figure 4.13: ''The Complete Works of William Shakespeare`` - Distribution of the levels of the words in the input sequence according to their position.

\includegraphics{book_levels.eps}



Footnotes

... removed4.2
Otherwise 'THIS, -THIS, and THIS' are considered erroneously as different words.


 
next up previous contents
Next: Comparison of Different Compression Up: The Fast Implementation Builds Previous: Experimental Time Complexity vs.
Sashka Davis;961;icsg6;
1999-01-14