if you are really interested in the details of operation of this function, I strongly advise you to study that reference very carefully in conjunction with the explanation here. To clarify this algorithm, we will work through a somewhat simplified example as we examine the code. For ease in calculation, we will set TOP_VALUE to 255, or 11111111 binary, rather than 65535 as in the actual program; as a result, high will start out at 255 as well. We will also use a single, constant frequency table (Figure sampfreq) containing only four entries rather than selecting a 256-entry table according to the previous character and modifying it as we see each character, so our translate and both_weights tables (Figures samptrans and sampboth, respectively) will be adjusted correspondingly. Instead of ASCII, we will use the codes from 0 (for 'A') to 3 (for 'D') in our example message. Sample frequency table (Figure sampfreq) Character Frequency code A,B 00010011 (1,3) C,D 00000010 (0,2) Sample translate table (Figure samptrans) Index Value 0 1 1 2 2 8 3 52 Sample both_weights table (Figure sampboth) First Second Index Index 0 1 2 3 + + + + + 0 | 2 | 3 | 9 | 53 | + + + + + 1 | 3 | 4 | 10 | 54 | + + + + + 2 | 9 | 10 | 16 | 60 | + + + + + 3 | 53 | 54 | 60 |104 | + + + + + As we begin encode_symbol, we establish temp_freq_info as a pointer to the structure containing the current frequency table. Next, we set freq_ptr to the address of the frequency table itself and total_freq to the stored total frequency in the frequency structure; as we will see shortly, total_freq is used to determine what fraction of the frequency range is accounted for by the particular character being encoded. The final operation before entering the frequency accumulation loop is to set prev_cum to 0. This variable is used to keep track of the cumulative frequency of all characters up to but not including the one being encoded; this is used to determine the position of the character being encoded as a part of the entire range of possibilities. Now we are ready to enter the frequency accumulation loop, which is shown in Figure arenc.03. The frequency accumulation loop (from compress\arenc.cpp) (Figure arenc.03) codelist/arenc.03 The reason we need this loop is that, as we saw before, we cannot afford to keep the cumulative frequency tables in memory; they would occupy hundreds of kilobytes of memory. Instead, we calculate the cumulative frequency for each character being encoded, as we need it. The total_freq variable, however, we do maintain from one encoding to the next; recalculating it would require us to go all the way through the frequency table for each character, even if we are encoding a character with a low ASCII code. Since we are saving the total frequency with the table, we have to accumulate the frequencies only up to the ASCII code of the character we are encoding. Let's see how this loop accumulates the frequencies of all the characters up to the character we want to encode. The first interesting item is that the loop executes only half as many times as the value of the character being encoded, since we are packing two 4-bit indexes into each byte of the frequency table. So the first statement in the loop retrieves one of these pairs of indexes from the frequency table and increments the pointer to point to the next pair. Then we index into the both_weights table with the index pair we just retrieved and set total_pair_weight to that entry in the table. The both_weights table is the key to translating two 4-bit indexes into a total frequency. Each entry in the table is the sum of the frequency values that correspond to the two indexes that make up the byte we use to index into the table. Finally, we add total_pair_weight to prev_cum, which is accumulating all of the frequencies. In our example, the first letter of our message is 'D', which has a symbol value of 3. Using the frequency table in Figure sampfreq, we execute the statements in the loop once. First, we set current_pair to the first entry in the frequency table, 00010011, which indicates that the frequency code for 'A' is 1 (0001 binary) and the frequency code for 'B' is 3 (0011 binary). Then we set total_pair_weight to entry (1,3) from the both_weights table, the sum of the frequencies to which this pair of indexes corresponds; its value is 54. The last statement in the loop adds this value to prev_cum, which was set to 0 before the loop was started. The next section of code, shown in Figure arenc.04, finishes the accumulation of the cumulative frequencies of the character to be encoded and the previous character, for both of the possible alignments of the character we are encoding and the previous character. Finishing the accumulation (from compress\arenc.cpp) (Figure arenc.04) codelist/arenc.04 If the target character has an even ASCII code, we already have the correct value for prev_cum; to calculate cum, the frequency accumulation for the current character, we need the first index from the next byte, which is stored in the high half of the byte. So we pick up current_pair, shift its high index down, and use it to retrieve the corresponding weight from the translate table. Then we add that frequency value to prev_cum to calculate cum. On the other hand, if the target character has an odd ASCII code, we need to update both prev_cum and cum. First, we add the total weights for the last two characters to prev_cum, which results in cum. Then we translate the high half of the current_pair and add that to prev_cum. In our example, the code of the first character is 3, so we need to update both prev_cum and cum. The value of current_pair is (0,2). Looking in the both_weights table, we set total_pair_weight to the translation of that pair, which is 9. Then cum is calculated as the sum of prev_cum and total_pair_weight, or 63. Then we extract the high part of current_pair, 0, which translates to 1; we add this amount to prev_cum, setting it to 55. This means that the code range associated with character "D" starts at 55 and ends slightly before 64, a range of 9 positions out of the total of 64. This will allow us to send out slightly less than three bits for this character, as we will narrow the range by a factor of more than 7. Now that we have calculated the cumulative frequencies of the current character and the previous character, we are ready to narrow the range of frequencies that correspond to our message, as shown in Figure arenc.05. Narrowing the range of frequencies (from compress\arenc.cpp) (Figure arenc.05) codelist/arenc.05 The first line of this section of code calculates the previous range of frequencies for our message. Then the other lines calculate the new range of the message. The purpose of low and high is to delimit the frequency interval in which the current message falls; range is just the size of the frequency interval extending from the old value of low to the old value of high, inclusive. In our example, the old value of low is 0 and the old value of high is 255. Therefore, the formula for range, (long)(high-low)+1, produces 256, which is its maximum value. This makes sense, as we have not yet used the information from the first character of our message. The new values of low and high represent the narrowing of the previous range due to the frequency of the new character. We calculate the new value of high, which represents the high end of the new range of frequencies for our message after the most recent character is added to the message. 15 Similarly, the new value of low represents the low end of the range of frequencies for our message after the most recent character is added to the message. In our example, low is still 0, range is 256, cum is 63, and total_freq is 63. The expression to calculate high is low + (range * cum) / total_freq - 1. Therefore, high is calculated as 0+(256*63)/63-1, or 255. This means that the new range of frequencies for our message ends slightly below 256. Next, we recalculate low. Its value is still 0 so far, range is 256, prev_cum is 55, and total_freq is 63. The expression to calculate low is low + (range * prev_cum)/total_freq. Therefore, we calculate the new value of low as 0+(256*55)/63, or 223. This means that the new range of frequencies for our message begins at 223. Now we are finally ready to start sending bits out. The loop shown in Figure arenc.06 extracts as many bits as possible from the encoding of the message so far, and widens the range correspondingly to allow the encoding of more characters. The bit output loop (from compress\arenc.cpp) (Figure arenc.06) codelist/arenc.06 In order to understand this code, we need to look at the table of possible initial bits of high and low, which is given in Figure initcode. The entries in this table could use some explanation. The first two columns contain all the possible combinations of the first two bits of low and high; we know that low cannot be greater than high, since these two values delimit the range of codes for the message being compressed. If low ever became greater than high, it would be impossible to encode any further characters. The "Action" column indicates what, if anything, we can output now. Clearly, if low and high have the same first bit, we can output that bit. The entries labeled "Next" indicate that since the separation between the values of low and high is at least one-fourth of the total range of values (0-TOP_VALUE), we can encode at least one more character now; this is the reason for the limit on the total frequency of all characters. Possible initial code bits (Figure initcode) Low High Action + + + + | 00 | 00 | 0 | | 00 | 01 | 0 | | 00 | 10 | Next | | 00 | 11 | Next | | 01 | 01 | 0 | | 01 | 10 | Defer | | 01 | 11 | Next | | 10 | 10 | 1 | | 10 | 11 | 1 | | 11 | 11 | 1 | + + + + The entry "Defer" means that we can't do any output now; however, when we do have output, we will be able to emit at least the first two bits of the result, since we know already that these bits are either "01" or "10". As we will see shortly, this condition is indicated by a nonzero value for bits_to_follow. . shortly, total_freq is used to determine what fraction of the frequency range is accounted for by the particular character being encoded. The final operation before entering the frequency accumulation. one being encoded; this is used to determine the position of the character being encoded as a part of the entire range of possibilities. Now we are ready to enter the frequency accumulation. cum is calculated as the sum of prev_cum and total_pair_weight, or 63. Then we extract the high part of current_pair, 0, which translates to 1; we add this amount to prev_cum, setting it to