+ + + + + + + Suboptimal Huffman code table (Figure huffman.poor) Huffman Fixed-length Letter Frequency Code Code + + + + A | 3/4 | 0 | 0 | B | 1/4 | 1 | 1 | + + + + Using Huffman coding, there is no way to assign theoretically optimal codes to characters with such a frequency distribution. 3 Since the shortest possible Huffman code is one bit, the best we can do is to assign a one-bit code to each character, although this does not reflect the difference in their frequencies. In fact, Huffman coding can never provide any compression at all with a two-character alphabet. However, such a situation is handled very well indeed by arithmetic compression. To see how, we will start by asking a fundamental question: what is a bit? Half a Bit Is Better than One A bit is the amount of information required to specify an alternative that has a frequency of 50%; two bits can specify alternatives that have a frequency of 25%, and so forth. For example, the toss of a coin can result in either heads or tails, each of which can be optimally represented by a one-bit code; similarly, if a chance event has four equally likely outcomes, we can express each possible result most economically with two bits. On the other hand, as we have seen in our discussion of Huffman codes, we can have a number of alternatives that are not equally likely; in that case, we assign longer codes for those alternatives that are less likely. However, the shortest possible code in Huffman coding is one bit, which is assigned to an outcome with a frequency of one-half. The general formula for the optimal length of a code specifying a particular outcome with frequency f is "log 2 (1/f)". In our previous examples, an outcome with a frequency of .5 should have a code length of log 2 (1/(1/2)) or 1 bit. Similarly, if an outcome has a frequency of .25 it should have a code length of log 2 (1/(1/4)), or two bits. But what if one of the possible outcomes has a frequency greater than one-half? Logically, we should use less than one bit to specify it. For example, if we have a data file in which 84% of the characters are spaces, we can calculate the appropriate number of bits in the code for that character as log 2 (1/(.84)), or approximately .25 bits. If the remaining 255 characters all have equal frequencies, each of these frequencies is .0627%, so that our formula reduces to log 2 (1/(.0627)), or approximately 10.63 bits each. This would result in an average code length of (.84)*(.25) + (.16)*10.63, or 1.91 bits per character. By contrast, a Huffman code would require (.84)*1 + (.16)*9, or 2.28 bits per character. If we were compressing a 250-Kbyte file with such characteristics, using Huffman codes would produce a 71-Kbyte file, whereas arithmetic coding would result in a 60-Kbyte file, about a 20% difference between these two approaches. 4 Getting a Bit Excited Of course, we can't output a code of less than one bit. However, we can use a set of one or more bits to represent more than one character in the same message. This is why the statement that Huffman coding is the most efficient way to represent characters is true, but misleading; if our messages contain more than one character each, we may be able to combine the codes for a number of characters while consuming a fractional number of bits for some characters. To make this clearer, let's go through the encoding of a three-character message, "ABA", from a two- character alphabet in which the letter 'A' accounts for three-fourths of all the characters and the letter 'B' the remaining one-fourth. The situation after we see the first character is shown in Figure arithb1. The first character (Figure arithb1) Message Cum. Previous Current Output So Far Freq. Freq. Codes Output Output So Far + + + + + + + * A | 48 | 48 |000000( 0)-101111(47)| None | None | None | B | 16 | 64 |110000(48)-111111(63)| None | 11 | 11 | + + + + + + + If the first character were a 'B', then we could output "11", because all possible messages starting with 'B' have codes starting with those two bits. However, what can we output when the first character is an 'A'? Nothing! We don't know whether the first bit of our encoded message will be a 0 or a 1; that depends on what happens next. Remember, messages starting with the letter 'A' can have codes starting with 00 through 10, or three-fourths of all possible codes. An 'A' gives us somewhat less than 1/2 bit of information, not nearly enough to produce any output by itself. Now let's look at Figure arithb2 for the information needed to encode the next character. The second character (Figure arithb2) Message Cum. Previous Current Output So Far Freq. Freq. Codes Output Output So Far + + + + + + + AA | 36 | 36 |000000( 0)-100011(35)| None | None | None | * AB | 12 | 48 |100100(36)-101111(47)| None | 10 | 10 | BA | 12 | 60 |110000(48)-111011(59)| 11 | None | 11 | BB | 4 | 64 |111100(60)-111111(63)| 11 | 11 | 1111 | + + + + + + + We have two bits of output, since all the codes for messages starting with "AB" have the initial bits "10". 5 Let's continue with Figure arithb3 for the third (and last) character of our message. The third character (Figure arithb3) Message Cum. Previous Current Output So Far Freq. Freq. Codes Output Output So Far + + + + + + + AAA | 27 | 27 |000000( 0)-011010(26)| None | 0 | 0 | AAB | 9 | 36 |011011(27)-100011(35)| None | None | None | * ABA | 9 | 45 |100100(36)-101100(44)| 10 | None | 10 | ABB | 3 | 48 |101101(45)-101111(47)| 10 | 11 | 1011 | BAA | 9 | 57 |110000(48)-111000(56)| 11 | None | 11 | BAB | 3 | 60 |111001(57)-111011(59)| 11 | 10 | 1110 | BBA | 3 | 63 |111100(60)-111110(62)| 1111 | None | 1111 | BBB | 1 | 64 |111111(63)-111111(63)| 1111 | 11 | 111111| + + + + + + + We have still produced only two bits of output from our three character message, "ABA". The best we could do with Huffman coding is three bits. However, this is not the extreme case; if we were encoding the most frequent message, "AAA", we would have only a "0" bit. 6 A Character Study This algorithm will work nicely if we happen to know in advance what the frequencies of all the possible characters will be. But how do we acquire this information? If our program reads an input file and writes an output file, we can go through the input file once, counting the number of times a given character appears, build the table of frequencies, and then make a second pass using this table to do the actual encoding. However, this is not possible when we are compressing data as it is being generated, as there is then no input file to be analyzed. This might seem to be an insurmountable obstacle. Luckily, it is not; calculating the character frequencies as we encode yields about the same compression efficiency as precalculation, and in some cases even better efficiency! The reason for this surprising result is that most files (or sets of data) are not uniform throughout, but rather exhibit local variations in the frequency of a given character or set of characters. Calculating the frequencies on the fly often produces a better match for these changing characteristics. Therefore, our approach is as follows. Every character is initially assumed to have the same frequency of occurrence. After each character is read and encoded, its entry in the table of frequencies is increased to reflect our having seen it. The next time we encode it, its encoding will account for the fact that we have seen it before. That may seem quite simple for the sender, but not for the receiver. If the encoding table is being modified as we go, how does the receiver keep in step? The receiver has the same initial table as the transmitter; each character has the same expected frequency of occurrence. Then, as the receiver decodes each character, it updates its copy of the frequency table. This also explains why we increase a character's frequency after we encode it, rather than before; until the receiver decodes the character, it cannot update the frequency of that character's occurrence. 7 Keeping It in Context In our implementation, we achieve a large improvement in compression efficiency by using the context in which a character occurs when estimating its frequency of occurrence. It should be apparent that the frequency of a given character appearing in a message at a given point is quite dependent on the previous context. For example, in English text, a "Q" is almost certain to be followed by a "u", at least if we exclude articles about software such as DESQview! Ideally, therefore, the amount of information needed to encode a "u" following a "q" should be very small. The same principle, of course, applies to the encoding of a line feed following a carriage return in text files where this is a virtual certainty. On the other hand, the amount of storage required to keep track of a large amount of previous context can become excessive. 8 Even one character of previous context requires the construction of 256 tables of frequencies, one for each possible previous character. A direct extension of the approach given in the reference in footnote would require over 300 KB of storage for these tables. We will apply a number of space-saving methods to reduce the above storage requirement by about 90%, to approximately 35 KB, while still achieving good data-compression performance in most cases. Conspicuous Nonconsumption In order to achieve such a reduction in memory consumption, we must avoid storing anything that can be recalculated, such as the cumulative frequencies of all possible characters. 9 We must also dispense with the use of a self-organizing table that attempts to speed up encoding and decoding by moving more frequently used characters toward the front of the table, as is done in the reference in footnote . However, we must provide as large a dynamic range of frequencies as possible: the larger the ratio between the highest frequency and the lowest, the greater the possible efficiency. The greatest dynamic range is needed when one character always occurs in a particular context, such as line feed after carriage return. Assuming that we must be able to encode any of the 256 possible one-byte values, the algorithm limits the possible dynamic range to approximately one-fourth of the range of an unsigned value. 10 For maximum compression efficiency we therefore need 256 tables of 256 two-byte entries each, consuming a total of 128 KB. 11 When I first implemented this algorithm, I saw no way to reduce this significantly. Then early one morning, I woke up with the answer. We need some very large frequency values and some very small ones, but surely not every one in between. Why not use a code to represent one of a number of frequency values? These values would be spaced out properly to get as close as possible to optimal compression efficiency, but each would be represented by a small index, rather than literally. How small an index? First I considered using an eight-bit index. But that still would require 64 KB for a complete 256x256 table. Maybe a smaller index would help. But wouldn't that impose an unreasonable processing time penalty? Amazingly enough, we can use a four-bit index with no time penalty: in fact, processing is actually faster than with a byte index! This seemingly magical feat is accomplished by one of my favorite time-saving methods: the lookup table. . gives us somewhat less than 1/2 bit of information, not nearly enough to produce any output by itself. Now let's look at Figure arithb2 for the information needed to encode the next character is the amount of information required to specify an alternative that has a frequency of 50%; two bits can specify alternatives that have a frequency of 25%, and so forth. For example, the toss. next time we encode it, its encoding will account for the fact that we have seen it before. That may seem quite simple for the sender, but not for the receiver. If the encoding table is being