O''''Reilly Network For Information About''''s Book part 139 pptx

decoding. Then we call done_encoding (Figure arenc.08) and done_outputing_bits (Figure arenc.09) to flush any remaining information to the output file. Finally, we close the input and output files and exit. The done_encoding function (from compress\arenc.cpp) (Figure arenc.08) codelist/arenc.08 The done_outputing_bits function (from compress\arenc.cpp) (Figure arenc.09) codelist/arenc.09 Finding the Bottlenecks Now that we have developed a working version of our program, let's see how much more performance we can get by judicious optimization in the right places. The first step is to find out where those places are. Therefore, I ran the program under Microsoft's profiler on a file of approximately 11 KB, compressing it to a little less than 5 KB (a compression ratio of 2.3 to 1). This didn't take very long, as you can see from the profile in Figure encode.pro1. Profile of encode.exe before optimization (Figure encode.pro1) Total time: 167.100 milliseconds Func Time % Function 80.010 49.0 encode_symbol(unsigned int,unsigned int) (arenc.obj) 29.139 17.9 _main (encode.obj) 18.717 11.5 output_1(void) (arenc.obj) 17.141 10.5 output_0(void) (arenc.obj) 15.957 9.8 update_model(int,unsigned char) (adapt.obj) 2.234 1.4 start_model(void) (adapt.obj) The majority of the CPU time is spent in the encode_symbol function, so that's where we'll look first. As I mentioned in the discussion of the algorithm, a significant amount of the CPU time needed to encode a symbol is spent in the loop shown in Figure arenc.03. Accordingly, we should be able to increase the speed of our implementation by using a more efficient representation of the characters to be encoded, so that the accumulation loop will be executed fewer times in total. While we cannot spare the space for self-organizing tables of character translations, a fixed translation of characters according to some reasonable estimate of their likelihood in the text is certainly possible. Let's see how that would affect our performance. A Bunch of Real Characters After analyzing a number of (hopefully representative) text files, I used the results to produce a list of the characters in descending order of number of occurrences. This information is stored in the symbol_translation table (Figure adapt.02), which is used during decoding to translate from the internal code into ASCII. The symbol_translation table (from compress\adapt.cpp) (Figure adapt.02) codelist/adapt.02 The inverse transformation, used by the encoder, is supplied by the char_translation table, which is initialized near the end of the start_model function (Figure adapt.00). To use these new values for each character when encoding and decoding, we have to make some small changes in the encoding and decoding main programs. Specifically, in main, we have to use the translated character rather than the ASCII code as read from the input file when we call the encode_symbol function. The new version of the main program for encoding is shown in Figure encode1.00. The new main encoding program (compress\encode1.cpp) (Figure encode1.00) codelist/encode1.00 Of course, corresponding changes have to be made to the main function in the decoding program, which then looks like Figure decode1.00. The new main decoding program (compress\decode1.cpp) (Figure decode1.00) codelist/decode1.00 Figure char.trans shows the speed improvement resulting from this modification. Profile of encode1.exe, after character translation (Figure char.trans) Total time: 145.061 millisecond Func Time % Function (msec) 58.980 41.8 encode_symbol(unsigned int,unsigned int) (arenc.obj) 25.811 18.3 _main (encode1.obj) 18.964 13.4 output_0(void) (arenc.obj) 18.586 13.2 output_1(void) (arenc.obj) 16.565 11.7 update_model(int,unsigned char) (adapt.obj) 2.213 1.6 start_model(void) (adapt.obj) This speed improvement of about 15% took a total of about an hour, mostly to write and debug the character counting program and run it on a number of sample files. However, this change isn't helpful in all cases. It will speed up the handling of files that resemble those used to calculate the translation tables, but will have much less impact on those having a significantly different mix of data characters. In the worst case, it might even slow the program down noticeably; for example, a file containing many nulls, or 0 characters, might be compressed faster without translation, as encoding the most common character would then require no executions of the accumulation loop. This is an example of the general rule that knowing the makeup of the data is important in deciding how to optimize the program. Summary In this chapter, we have seen an example of how intelligent encoding of information can pack a lot of data into a relatively small amount of memory. In the next chapter, we will see how quantum files allow us to gain efficient random access to a large volume of variable-length textual data. Problems 1. How could arithmetic coding be used where each of a number of blocks of text must be decompressed without reference to the other blocks? 2. If we knew that the data to be compressed consisted entirely of characters with ASCII codes 0 through 127, could we reduce the memory requirements of this algorithm further? If so, how much and how? (You can find suggested approaches to problems in Chapter artopt.htm). Footnotes 1. I. H. Witten, R. M. Neal, and J. G. Cleary. "Arithmetic Coding for Data Compression". Commun. ACM 30(6), 520-540 (June 1987). 2. This column is derived from the "Cum. Freq." column; it represents the range of codes that has been allocated to messages that begin with the characters in the "Message So Far" column. For example, messages beginning with the character 'A' have a cumulative frequency of 16/64. This means that messages that start with that letter have codes between the cumulative frequency of the previous message (in this case 0, since this is the first message) and 15/64. Similarly, since messages beginning with the letter 'B' have a cumulative frequency of 32/64, codes for such messages must lie between 16/64, which is the cumulative frequency of the previous entry, and 31/64. 3. For that matter, a similar problem occurs when we have three characters of equal frequency; the characters have to end up with different code lengths in a Huffman code, even though their frequencies are equal. 4. As we will see later, there are commonly occurring situations which have even more lopsided distributions than this, when the context in which each character is seen is taken into account; for example, following a carriage return in a DOS-formatted text file, the next character is almost certain to be a line feed. In such cases, we can use much less than one-fourth of a bit to represent a line feed character following a carriage return and gain far more efficiency compared to Huffman coding. 5. Notice that if we were encoding "AA", we still wouldn't have even the first bit! 6. The second most frequent message, "AAB", is a special case, which we will see again later. Although it occupies only nine of the 64 possible message positions, some of those nine start with a 0 and some with a 1; therefore, we don't know the first bit to be output. However, we do know that the code for messages beginning with "AAB" starts with either "011" or "100"; as we continue encoding characters, eventually we will have enough information to decide which. At that point, we will be able to output at least those first three bits. 7. A consequence of this approach is that we cannot decompress data in any order other than the one in which it was compressed, which prevents direct access to records in compressed files. This limitation is the subject of one of the problems at the end of this chapter. 8. A. Moffat. "Word-Based Text Compression". Software Practice and Experience, 19(2), 185-198 (February 1989). 9. This is the column labeled "Cum. Freq." in Figures aritha1-aritha3 and arithb1-arithb3. 10. For compilers which produce two-byte unsigneds, the maximum cumulative frequency is 16,383 if we wish to avoid the use of long arithmetic. This limits the dynamic range to 16,128 for the one common character and 1 for each of the other characters. 11. Theoretically, we could use 14 bits per entry, but the CPU time required to encode or decode would be greatly increased by packing and unpacking frequencies stored in that way. 12. The last entry in the upgrade_threshold array is set to 255, which prevents the index from being increased beyond 15, the maximum value that can be stored in a four-bit value; char_total, being an unsigned char variable, cannot exceed 255. 13. Although this array is not used in the initial version of our program, we will see near the end of this chapter how it contributes to our optimization effort. 14. To fully understand the function of low and high, we must wait until we examine encode_symbol (Figure arenc.02). For now, let's just say that they indicate the current state of our knowledge about the message being encoded; the closer together they are, the more bits we are ready to output. 15. Actually, the top end of the range of frequencies for the message is just below high+1; even the authors of the reference in footnote , which is the source of the original version of this algorithm, admit that this is confusing! 16. In our example, we don't execute this particular adjustment. 17. The proof of this is left to the reader as an exercise; see the problems at the end of the chapter. 18. You may be wondering what happened to the remaining information contained in low and high. Rest assured that it will not be left twisting in the wind; there is a function called done_encoding that makes sure that all of the remaining information is encoded in the output stream. . this particular adjustment. 17. The proof of this is left to the reader as an exercise; see the problems at the end of the chapter. 18. You may be wondering what happened to the remaining information. let's see how much more performance we can get by judicious optimization in the right places. The first step is to find out where those places are. Therefore, I ran the program under Microsoft's. encode.exe before optimization (Figure encode.pro1) Total time: 167.100 milliseconds Func Time % Function 80.010 49.0 encode_symbol(unsigned int,unsigned int) (arenc.obj) 29 .139 17.9 _main

Định dạng
Số trang	6
Dung lượng	27,49 KB