You see, a significant amount of the CPU time needed to encode a symbol is spent in a loop that computes the necessary portion of the cumulative frequency table for the current context. If each frequency value occupied one byte, we would have to execute that loop once for every index in the table up to the ASCII value of the character whose cumulative frequency we are trying to compute. However, since we have packed two indexes into one byte, we can accumulate two indexes worth of frequency values with one addition, thus reducing the number of times the loop has to be executed by approximately 50%. Each execution of the loop translates a pair of indexes contained in one byte into a total frequency value that is the sum of their individual frequency values by using the byte as an index into a 256-word table, which has one entry for each possible combination of two indexes. This table (the both_weights table) is calculated once at the start of the program. Once we have determined the frequency of occurrence assigned to the character to be encoded, we can decide how many bits we can send and what their values are. Receiving the Message The receiver uses almost the same code as the sender, with two main exceptions. First, the receiver reads the input one bit at a time and outputs each character as soon as it is decoded, just the reverse of the sender. Second, rather than knowing how many frequency entries we have to accumulate in advance as the sender does, we have to accumulate them until we find the one that corresponds to the range we are trying to interpret. The latter situation reduces the efficiency of our loop control, which accounts for much of the difference in speed between encoding and decoding. The Code Let's start at the beginning, with the main function in encode.cpp (Figure encode.00). Main program for encoding (compress\encode.cpp) (Figure encode.00) codelist/encode.00 This function opens the input and output files and sets up buffers to speed up input and output, then calls start_model (Figure adapt.00). start_model function (from compress\adapt.cpp) (Figure adapt.00) codelist/adapt.00 This function starts out by initializing the upgrade_threshold array, which is used to determine when to promote a character to the next higher frequency index value. As noted above, these values are not consecutive, so that we can use a four- bit index rather than literal values; this means that we have to promote a character only once in a while rather than every time we see it, as we would do with literal frequency values. How do we decide when to do this? A pseudorandom approach seems best: we can't use a genuine random number to tell us when to increment the index, because the receiver would have no way of reproducing our decisions when decoding the message. My solution is to keep a one-byte hash total of the ASCII codes for all the characters that have been sent (char_total) and to increment the index in question whenever char_total is greater than the corresponding value stored in the upgrade_threshold array. That threshold value is calculated so that the probability of incrementing the index is inversely proportional to the gap between frequency values in the translation table. 12 If, for example, each frequency value in the translation table were twice the previous value, there would be a 1/2 probability of incrementing the index each time a character was encountered. After we finish initializing the upgrade_threshold array, we set char_total to 0, in preparation for accumulating our hash total. The next operation in start_model is to generate the both_weights table. As we discussed above, this table allows us to translate from a pair of frequency values (or weights) to the total frequency to which they correspond. We calculate it by generating every possible pair and adding them together to fill in the corresponding table entry. The values in the translate table are defined in Figure model.00, in the line that starts with #define TRANSLATE_TABLE. How did I generate these values? Header file for translation table constants and variables (compress\model.h) (Figure model.00) codelist/model.00 A Loose Translation It wasn't easy. I knew that I wanted to allow the largest possible dynamic range, which means that the lowest value has to be 1 and the highest value has to be close to the maximum that can be accommodated by the algorithm (16,128). The reason I chose a top value lower than that maximum value is that if the highest value were 16,128, then the occurrence of any character other than the preferred one would cause the total frequency to exceed the allowable maximum, with the result that the table of frequencies would be recalculated to reduce the top value to the next lower step. This would greatly reduce the efficiency of the compression for this case. That accounts for the lowest and highest values. What about the ones in between? Initially, I decided to use a geometric progression, much like the tuning of a piano; in such a progression, each value is a fixed multiple of the one before it. However, I found that I achieved better compression on a fairly large sample of files by starting the progression with the second value at 16 and ending with the next to the last at 1024. Why is this so? The reason for leaving a big gap between the lowest frequency and the second lowest one is that many characters never occur in a particular situation. If they occur once, they are likely to recur later. Therefore, setting the next-to-the-lowest frequency to approximately one-tenth of 1% of the maximum value improves the efficiency of the compression. I have also found through experimentation that the compression ratio is improved if the first time a character is seen, it is given a frequency of about six-tenths of 1%, which requires an initial index of 7. Lower initial frequencies retard adaptation to popular new characters, and higher ones overemphasize new characters that turn out to be unpopular in the long run. The reason to leave a large gap between the next-to-highest frequency and the highest one is that most of the time, a very skewed distribution has exactly one extremely frequent character. It is rare to find several very high-frequency characters that have about the same frequency. Therefore, allowing the highest frequency to approach the theoretical maximum produces the best results. Of course, these are only empirical findings. If you have samples that closely resemble the data that you will be compressing, you can try modifying these frequency values to improve the compression. Getting in on the Ground Floor Continuing in start_model (Figure adapt.00), we initialize NO_OF_CHARS frequency_info tables, one for every possible character. Each of these tables will store the frequencies for characters that follow a particular character. If we start to encode the string "This is a test", the first character, 'T' will be encoded using the table for character 0 (a null); this is arbitrary, since we haven't seen any characters before this one. Then the 'h' will be encoded using the table for 'T'; the 'i' will be encoded with the table for 'h', and so forth. This approach takes advantage of the context dependence of English text; as we noted before, after we see a 'q', the next character is almost certain to be a 'u', so we should use a very short code to represent a 'u' in that position. However, our initialization code contains no information about the characteristics of English text (or any other kind of data for that matter). We assign the same frequency (the lowest possible) to every character in every frequency table. As discussed above, the encoding program will learn the appropriate frequencies for each character in each table as it processes the input data. At the end of the initialization for each table, we set its total_freq value to the translation of the lowest frequency, multiplied by the number of index pairs in the table. This value is needed to calculate the code that corresponds to each character, and recalculating it every time we access the table would be time-consuming. The last operation in start_model initializes the char_translation array, which is used to translate the internal representation of the characters being encoded and decoded to their ASCII codes. 13 Then we return to encode.cpp. Gentlemen, Start Your Output The next operation in main is to call start_outputing_bits (Figure arenc.00). start_outputing_bits function (from compress\arenc.cpp) (Figure arenc.00) codelist/arenc.00 Of course, we can't send individual bits to the output file; we have to accumulate at least one byte's worth. Whenever we have some bits to be output, we will store them in buffer; since we haven't encoded any characters yet, we set it to 0. In order to keep track of how many more bits we need before we can send buffer to the output file, we set bits_to_go to eight; then we return to main. The next thing we do in main is to call start_encoding (Figure arenc.01). start_encoding function (from compress\arenc.cpp) (Figure arenc.01) codelist/arenc.01 This is a very short function, but it initializes some of the most important variables in the program; low, high, and bits_to_follow. The first two of these keep track of the range of codes for the current message; at this point we know nothing about the message, so they are set to indicate the widest possible range of messages, from 0 to TOP_VALUE, which is defined in arith.h (Figure arith.00). Header file for arithmetic coding constants (compress\arith.h) (Figure arith.00) codelist/arith.00 In our current implementation, with 16-bit code values, TOP_VALUE evaluates to 65535. 14 The third variable, bits_to_follow, keeps track of the number of bits that have been deferred for later output. This is used where the range of possible codes for the current message includes codes that start with 0 and some that start with 1; as we have seen already, in such a situation we're not ready to send out any bits yet. After initializing these variables, we return to main. The Main Loop Upon returning to main, we need to do one more initialization before we enter the main loop of our program, which is executed once for each character in the input file; namely, setting oldch to 0. This variable controls the context in which we will encode each character. Since we haven't seen any characters as yet, it doesn't really matter which frequency table we use, as long as the decoding program selects the same initial table. The first operation in the main loop is to get the character to be encoded, via getc. If we have reached the end of the file, we break out of the loop. Otherwise, we call encode_symbol (Figure arenc.02) to place the representation of the current character in the output stream. encode_symbol function (from compress\arenc.cpp) (Figure arenc.02) codelist/arenc.02 This function takes two parameters: ch, the character to be encoded, and oldch, which determines the frequency table to be used for this encoding. As we have noted above, selecting a frequency table based upon the previous character encoded provides far better compression efficiency, as the frequency of occurrence of a particular character is greatly affected by the preceding character. encode_symbol Although encode_symbol is a short function, it is the subtlest function in this chapter; fortunately, you can use this algorithm effectively without going through the explanation below. The version here closely follows the reference in footnote ; . initialize NO_OF_CHARS frequency_info tables, one for every possible character. Each of these tables will store the frequencies for characters that follow a particular character. If we start to encode. using the table for character 0 (a null); this is arbitrary, since we haven't seen any characters before this one. Then the 'h' will be encoded using the table for 'T';. 'i' will be encoded with the table for 'h', and so forth. This approach takes advantage of the context dependence of English text; as we noted before, after we see a 'q',