|
EE4253 Digital Communications
Department of Electrical and Computer Engineering - University of New Brunswick, Fredericton, NB, Canada |
Successful data compression depends on the presence of structure or patterns in the file to be compressed. Files containing English text as ASCII characters possess many features which allow compression. Moreover, an alphabetical listing of English words (such as might be used by a spell checking program) has structure which begs for compression. In this section, it will be shown that a dictionary listing English words can easily be compressed to less than 1/3 its original file size.
English text is distinguishable from random characters by the fact that
certain letters occur much more frequently than others. (This is reflected
in the points assigned to various letters in some board games.) This observation
can form the basis for a compression scheme in which letters are represented
by variable length codewords, where the shortest codewords are used for
the most frequently occurring characters.
This problem was identified by Samuel
Morse in defining a very early digital coding scheme. In the Morse
Code, telegraph messages are sent using combinations of 'dots' and 'dashes'.
Morse reasoned that telegraph operators could work more efficiently if
the common letters could be sent more quickly than the others. In the table
below showing the International Morse Code, the letters 'E' and
'T' are sent as single symbols, while 'Z' and 'Q' and 'J' each require
four symbols to transmit. This code can be compared to a more recent character
code by noting that in ASCII the same number of bits are used for every
character.
A: .- B: -... C: -.-. D: -.. E: . F: ..-. G: --. H: .... I: .. J: ---. K: -.- L: .-.. M: -- N: -. O: --- P: .--. Q: --.- R: ._. S: ... T: - U: .._ V: ...- W: .-- X: -..- Y: -.-- Z: --.. |
| 1-BIT | 2-BITS | 3-BITS | 4-BITS |
| E,T | I,A,N,M | S,U,R,W,D,O,G,K | H,V,F,L,P,J,B,X,C,Z,Q,Y |
The above scheme might be useful in a modern digital communications system, except that is difficult to identify the beginning and end of characters in a binary stream. In a telegraph system, pauses are inserted between characters to separate the codewords.
S E N D M O R E M O N E Y ... . -. _.. -- --- -.- . -- --- -. . -.-- 000 0 10 100 11 111 101 0 11 111 10 0 1011 |
In a binary data compression scheme, variable length codewords must be separated from each other. In the next example, the alphabet is sorted and a practical data compression scheme is described.
Note that the patterns present in English text go much deeper than single characters, and certain letter pairs ('TH', 'QU', 'CK' ) occur often enough that codes which recognized these patterns could be very efficient. Indeed, some words are so common that a few bits could encode entire strings of characters. In this example, only single letters will be coded.
English text can be compressed because of the structure inherent in words and sentences. Certain characters occur much more frequently than others {such as E,T}, while some characters are used only rarely {Q,Z}.
Divide the alphabet into the 13 most common and the 13 least common letters. Using the Morse distribution as a guide:
| HEX (4-bits) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
| ROW 1 | E | T | I | A | N | M | S | U | R | W | D | O | G | spc | up | end |
| ROW 2 | K | H | V | F | L | P | J | B | X | C | Z | Q | Y | .. | up | .. |
S E N D M O R E M O N E Y 6 0 4 A D 5 B 8 0 D 5 B 4 0 EC |
where it will (later) be easy to extract the message from the binary stream by counting 4-bit groups. The final message has 64 bits, compared to fifteen 8-bit (ASCII) characters = 120 bits (including spaces). As expected, the compressed file will be reduced to almost one half the original message size.
The above method illustrates one way in which text can be compressed. It would not be as effective for other languages unless the table was redefined. As a compression scheme, it is based purely on character frequency, and does not exploit the repeated string patterns found in a dictionary. A simple enhancement will greatly improve this technique.
A dictionary of the English language contains perhaps 100,000 words
arranged in alphabetical order. With an average word length of 9 letters,
this dictionary would require almost 1 MB of storage using ASCII codes.
Using only the above method for compressing text, this file would still
be more than one half its original size. On the other hand, in an alphabetical
word list, consecutive entries generally share many letters. Consider the
following alphabetically listed words:
chromate chromatic chromatin chromatogram chromatograph chromatography chrome chromic chromium chromosome |
When applying the letter compression scheme to this dictionary listing,
the space word 'D' will not be used. Rather, the unused code 'F' can be
employed to mean end-of-word, where 'F' will always be followed by a
single hex value denoting how many letters to keep from the previous
word. For example, the code 'F6' says "this word has ended, and keep
six letters for the next word". A single hex values limits to 15 the maximum
letters retained..
The remaining letters may now be coded using the original table:
123456789-123456789-123456789-123456789-123456789-1234 E9E18B5310F72E9F84F7BC835FBE5E1FDECF50F52E9F675F5B6B50 c hromateF7i cF8nF7ogramFB p hFD yF5eF5i cF6umF5osome |
This method resembles techniques used for video compression. In a video or motion picture film, consecutive frames generally change very little (sometimes not at all) and by sending only the changes from one frame to the next (as in the dictionary words) a digitized video image may be compressed significantly. Such compression techniques make possible the digital transmission of television by satellite and the storage of full length movies on DVD media.
One cost of using this compression method is that random access to dictionary words is not possible; the file must be read from the beginning to reconstruct all the words. Also note that single bit errors could have a disastrous effect on decompression. Since compression, by definition, exploits and removes redundancies, the problem of error control is common to all compression schemes.
The lessons learned in this example can be applied wherever data compression is required.
|
Sun May 19 05:43:03 ADT 2013
Last Updated: 09 NOV 1998 |
Richard Tervo [ tervo@unb.ca ] | Back to the course homepage... |