The goal of this experiment is to illustrate the practical usage of the Hu-Tucker algorithm for encoding electronic texts.
``The Complete Works of William Shakespeare'', by the World Library, Inc., provided by the project Gutenberg Etext of Illinois Benedictine College is selected for this study. To appropriately measure the frequencies of the words in the text the following changes to the file were necessary:
The - and ' signs, when appearing at the beginning or at the end of the word, are removed4.2. Thus all the words appear in capitals, and the punctuation signs appear as separate lower case strings. The resulting version of the text was ran through awkfilter script and the number of the words appearances in the text are counted. The words list was sorted in alphabetic order. The input file fed to the FILH implementation contains the total number of words, followed by the pairs: number of word appearances, word string sorted in alphabetic order according to the word string. In total 29338 words were encountered. They are repeated between 1 and 83067 times. The top winners are:
| weight | word | weight | word |
| 5479 | THOU | 9556 | IS |
| 5878 | HAVE | 10475 |
|
| 5916 | AS | 10939 | IN |
| 6222 | BUT | 11078 | THAT |
| 6230 | HE | 12468 | MY |
| 6807 | THIS | 13591 | YOU |
| 6850 | HIS | 14545 | A |
| 6866 | YOUR | 18113 | OF |
| 7064 | BE | 19120 | TO |
| 7650 | IT | 20608 | I |
| 7744 | ME | 26636 | AND |
| 7980 | WITH | 27544 | THE |
| 8186 | FOR | 77926 | period |
| 8687 | NOT | 83067 | comma |
![]() |
![]() |