1. Make only one pass through the data for each character position, rather than two as at present? 2. Add the ability to handle integer or floating-point data rather than character data? 3. Add the ability to handle variable-length strings as keys? (You can find suggested approaches to problems in Chapter artopt.htm). Footnotes 1. 250 blocks of 1000 records, at 16 bytes of overhead per block, yields an overhead of 250*16, or 4000 bytes. 2. A description of this sort can be found in Donald Knuth'sKnuth, Donald E. book The Art of Computer Programming, vol. 3. Reading, Massachusetts: Addison-Wesley, 1968. Every programmer should be aware of this book, since it describes a great number of generally applicable algorithms. As I write this, a new edition of this classic work on algorithms is about to be published. 3. However, distribution counting is not suitable for use where the data will be kept in virtual memory. See my article "Galloping Algorithms", in Windows Tech Journal, 2(February 1993), 40-43, for details on this limitation. 4. As is standard in C and C++ library implementations, this version of qsort requires the user to supply the address of a function that will compare the items to be sorted. While this additional overhead biases the comparison slightly against qsort, this small disadvantage is not of the same order of magnitude as the difference in inherent efficiency of the two algorithms. 5. Of course, we could just as well have used the keys that start with any other character. 6. There are some situations where this is not strictly true. For example, suppose we want to read a large fraction of the records in a file in physical order, and the records are only a few hundred bytes or less. In that case, it is almost certainly faster to read them all with a big buffer and skip the ones we aren't interested in. The gain from reading large chunks of data at once is likely to outweigh the time lost in reading some unwanted records. 7. Please note that there is a capacity constraint in this program relating to the total number of ZIP entries that can be processed. See Figure mail.00a for details on this constraint. Cn U Rd Ths (Qkly)? A Data Compression Utility Introduction In this chapter we will examine the Huffman coding and arithmetic coding methods of data compression and develop an implementation of the latter algorithm. The arithmetic coding algorithm allows a tradeoff among memory consumption and compression ratio; our emphasis will be on minimum memory consumption, on the assumption that the result would eventually be used as embedded code within a larger program. Algorithms Discussed Huffman Coding, Arithmetic Coding, Lookup Tables Huffman Coding Huffman coding is widely recognized as the most efficient method of encoding characters for data compression. This algorithm is a way of encoding different characters in different numbers of bits, with the most common characters encoded in the fewest bits. For example, suppose we have a message made up of the letters 'A', 'B', and 'C', which can occur in any combination. Figure huffman.freq shows the relative frequency of each of these letters and the Huffman code assigned to each one. Huffman code table (Figure huffman.freq) Huffman Fixed-length Letter Frequency Code Code + + + + A | 1/4 | 00 | 00 | B | 1/4 | 01 | 01 | C | 1/2 | 1 | 10 | + + + + The codes are determined by the frequencies: as mentioned above, the letter with the greatest frequency, 'C', has the shortest code, of one bit. The other two letters, 'A' and 'B', have longer codes. On the other hand, the simplest code to represent any of three characters would use two bits for each character. How would the length of an encoded message be affected by using the Huffman code rather than the fixed-length one? Let's encode the message "CABCCABC" using both codes. The results are shown in Figure huffman.fixed. Huffman vs. fixed-length coding (Figure huffman.fixed) Huffman Fixed-length Letter Code Code + + + C | 1 | 10 | A | 00 | 00 | B | 01 | 01 | C | 1 | 10 | C | 1 | 10 | A | 00 | 00 | B | 01 | 01 | C | 1 | 10 | + + + Total bits used 12 16 Here we have saved one-fourth of the bits required to encode this message; often, the compression can be much greater. Since we ordinarily use an eight-bit ASCII code to represent characters, if one of those characters (such as carriage return or line feed) accounts for a large fraction of the characters in a file, giving it a short code of two or three bits can reduce the size of the file noticeably. Let's see how arithmetic coding 1 would encode the first three characters of the same message, "CAB". One-character messages (Figure aritha1) Cum. Previous Current Output Message Freq. Freq. Codes Output Output So Far + + + + + + + A | 16 | 16 |000000( 0)-001111(15)| None | 00 | 00 | B | 16 | 32 |010000(16)-011111(31)| None | 01 | 01 | * C | 32 | 64 |100000(32)-111111(63)| None | 1 | 1 | + + + + + + + Figure aritha1 is the first of several figures which contain the information needed to determine how arithmetic coding would encode messages of up to three characters from an alphabet consisting of the letters 'A', 'B', and 'C', with frequencies of 1/4, 1/4, and 1/2, respectively. The frequency of a message composed of three characters chosen independently is the product of the frequencies of those characters. Since the lowest common denominator of these three fractions is 1/4, the frequency of any three-character message will be a multiple of (1/4) 3 , or 1/64. For example, the frequency of the message "CAB" will be (1/2)*(1/4)*(1/4), or 1/32 (=2/64). For this reason, we will express all of the frequency values in terms of 1/64ths. Thus, the "Freq." column signifies the expected frequency of occurrence of each message, in units of 1/64; the "Cum. Freq." column accumulates the values in the first column; the "Codes" column indicates the range of codes that can represent each message 2 ; the "Previous Output" column shows the bits that have been output before the current character was encoded; the "Current Output" column indicates what output we can produce at this point in the encoding process; and the "Output So Far" column shows the cumulative output for that message, starting with the first character encoded. As the table indicates, since the first character happens to be a 'C', then we can output "1", because all possible messages starting with 'C' have codes starting with a "1". Let's continue with Figure aritha2 to see the encoding for a two-character message. Two-character messages (Figure aritha2) Cum. Previous Current Output Message Freq. Freq. Codes Output Output So Far + + + + + + + AA | 4 | 4 |000000(00)-000011(03)| 00 | 00 | 0000 | AB | 4 | 8 |000100(04)-000111(07)| 00 | 01 | 0001 | AC | 8 | 16 |001000(08)-001111(15)| 00 | 1 | 001 | BA | 4 | 20 |010000(16)-010011(19)| 01 | 00 | 0100 | BB | 4 | 24 |010100(20)-010111(23)| 01 | 01 | 0101 | BC | 8 | 32 |011000(24)-011111(31)| 01 | 1 | 011 | * CA | 8 | 40 |100000(32)-100111(39)| 1 | 00 | 100 | CB | 8 | 48 |101000(40)-101111(47)| 1 | 01 | 101 | CC | 16 | 64 |110000(48)-111111(63)| 1 | 1 | 11 | + + + + + + + After encoding the first two characters of our message, "CA", our cumulative output is "100", since the range of codes for messages starting with "CA" is from "100000" to "100111"; all these codes start with "100". The whole three-character message is encoded as shown in Figure aritha3. We have generated exactly the same output from the same input as we did with Huffman coding. So far, this seems to be an exercise in futility; is arithmetic coding just another name for Huffman coding? These two algorithms provide the same compression efficiency only when the frequencies of the characters to be encoded happen to be representable as integral powers of 1/2, as was the case in our examples so far; however, consider the frequency table shown in Figure huffman.poor. Three-character messages (Figure aritha3) Cum. Previous Current Output Message Freq. Freq. Codes Output Output So Far + + + + + + + AAA | 1 | 1 |000000(00)-000000(00)| 0000 | 00 | 000000| AAB | 1 | 2 |000001(01)-000001(01)| 0000 | 10 | 000001| AAC | 2 | 4 |000010(02)-000011(03)| 0000 | 1 | 00001 | ABA | 1 | 5 |000100(04)-000100(04)| 0001 | 00 | 000100| ABB | 1 | 6 |000101(05)-000101(05)| 0001 | 01 | 000101| ABC | 2 | 8 |000110(06)-000111(07)| 0001 | 1 | 00011 | ACA | 2 | 10 |001000(08)-001001(09)| 001 | 00 | 00100 | ACB | 2 | 12 |001010(10)-001011(11)| 001 | 01 | 00101 | ACC | 4 | 16 |001100(12)-001111(15)| 001 | 1 | 0011 | BAA | 1 | 17 |010000(16)-010000(16)| 0100 | 00 | 010000| BAB | 1 | 18 |010001(17)-010001(17)| 0100 | 01 | 010001| BAC | 2 | 20 |010010(18)-010011(19)| 0100 | 1 | 01001 | BBA | 1 | 21 |010100(20)-010100(20)| 0101 | 00 | 010100| BBB | 1 | 22 |010101(21)-010101(21)| 0101 | 01 | 010101| BBC | 2 | 24 |010110(22)-010111(23)| 0101 | 1 | 01011 | BCA | 2 | 26 |011000(24)-011001(25)| 011 | 00 | 01100 | BCB | 2 | 28 |011010(26)-011111(27)| 011 | 01 | 01101 | BCC | 4 | 32 |011100(28)-011111(31)| 011 | 1 | 0111 | CAA | 2 | 34 |100000(32)-100001(33)| 100 | 00 | 10000 | * CAB | 2 | 36 |100010(34)-100011(35)| 100 | 01 | 10001 | CAC | 4 | 40 |100100(36)-100111(39)| 100 | 1 | 1001 | CBA | 2 | 42 |101000(40)-101001(41)| 101 | 00 | 10100 | CBB | 2 | 44 |101010(42)-101011(43)| 101 | 01 | 10101 | CBC | 4 | 48 |101100(44)-101111(47)| 101 | 1 | 1011 | CCA | 4 | 52 |110000(48)-110011(51)| 11 | 00 | 1100 | CCB | 4 | 56 |110100(52)-110111(55)| 11 | 01 | 1101 | CCC | 8 | 64 |110000(56)-111111(63)| 11 | 1 | 111 | . Knuth'sKnuth, Donald E. book The Art of Computer Programming, vol. 3. Reading, Massachusetts: Addison-Wesley, 1968. Every programmer should be aware of this book, since it describes a great. counting is not suitable for use where the data will be kept in virtual memory. See my article "Galloping Algorithms", in Windows Tech Journal, 2(February 1993), 40-43, for details on this. characters for data compression. This algorithm is a way of encoding different characters in different numbers of bits, with the most common characters encoded in the fewest bits. For example,