Originally, I wasn't planning to make any changes to the program from the previous version to this one other than increasing the buffer size. However, when running tests with the 64 MB memory configuration, I discovered that making just that one change caused the program to fail with a message telling me I was out of memory. This was hard to understand at first, because I was allocating only 16 MB at any one time; surely a 64 MB machine, even one running Windows 95, should be able to handle that without difficulty! However, the program was crashing at the same place every time I ran it with the same data, after a number of passes through the main loop, so I had to figure out what the cause might be. At first, I didn't see anything questionable about the program. On further examination, however, I did notice something that was cause for concern: I was allocating and freeing those memory buffers every time through the main loop. While it seems reasonable to me that allocating a number of buffers and then freeing all of them should return the memory allocation map to its original state, apparently this was not the case. At least, that's the only explanation I can find for why the available memory displayed by the debugger should drop suddenly after a number of passes through the main loop in which it remained nearly constant. Actually, even if allocating and freeing the buffers every time through the loop did work properly, it really isn't the right way to handle the memory allocation task. It's much more efficient to allocate one large buffer and just keep pointers to the places in that buffer where our smaller, logically distinct, buffers reside. Once I made those changes to the program, the crashes went away, so I apparently identified the problem correctly. The new, improved version is shown in Figure zen03.cpp. Zensort version 3 (Zensort\zen03.cpp) (Figure zen03.cpp) zensort/zen03.cpp I think the changes in the program are relatively self-explanatory. Basically, the only changes are the allocation of a new variable called BigBuffer which is used to hold all the data for the records being sorted, and the change of the previously existing Buffer variable to an array of char* rather than an array of char. Rather than allocating and deleting the individual buffers on every pass through the main loop, we merely recalculate the position in the large buffer where the logical buffer for each character begins. The performance results for this version of the program are shown in Figure timings.03. Performance of Zensort version 3 (Figure timings.03) zensort/timings.03 While we didn't get as much of an increase in performance from making more room available for the buffers as we did from improving the algorithm in the previous stage, we did get about a 13 percent increase in throughput on the largest file with a 64 MB system, and about 17 percent on the 192 MB system, which isn't negligible. 6 Now let's take a look at another way of speeding up this algorithm that will have considerably more effect: sorting on two characters at a time. The Fourth Version Every pass we make through the file requires a significant amount of disk activity, both reading and writing. Therefore, anything that reduces the number of passes should help speed the program up noticeably. The simplest way of accomplishing this goal in a general way is to sort on two characters at a time rather than one as we have been doing previously. This requires a number of changes to the program, none of which is particularly complicated. The new version is shown in Figure zen04.cpp. Zensort version 4 (Zensort\zen04.cpp) (Figure zen04.cpp) zensort/zen04.cpp We'll start by examining a new function called CalculateKeySegment, which, as its name suggests, calculates the segment of the key that we're going to use for sorting. In this case, because we're going to be sorting on two characters at a time, this function calculates a key segment value by combining two characters of the input key, with the more significant character contributing more to the resulting value than the less significant character. A simple way to think of this optimization is that we're going to sort on an alphabet consisting of 65536 characters, each of which is composed of two characters from the regular ASCII set. Because the maximum possible value of a character is 256, we can calculate the buffer in which we will store a particular record according to two characters of its key by multiplying the first character of the key by 256 and adding the second character of the key. This value will never be more than 65535 or less than 0, so we will allocate 65536 buffers, one for each possible combination of two characters. Besides the substitution of the key segment for the individual character of the key, the other major change to the program is in the handling of a full buffer. In the old program, whenever a new record would cause the output buffer to overflow, we would write out the previous contents of the buffer and then store the new record at the beginning of the buffer. However, this approach has the drawback that it is possible in some cases to have a record that is larger than the allocated buffer, in which case the program will fail if we attempt to store the record in that buffer. This wasn't too much of a problem in the previous version of the program, because with only 256 buffers, each of them would be big enough to hold any reasonably- sized record. However, now that we have 65536 buffers, this is a real possibility. With the current implementation, as long as the record isn't more than twice the size of the buffer, the program will work correctly. If we're worried about records that are larger than that, we can change the code to handle the record in any number of segments by using a while loop that will continue to store segments of the record in the buffer and write them out until the remaining segment will fit in the buffer. So how does this fourth version of the program actually perform? Figure timings.04 answers that question. Performance of Zensort version 4 (Figure timings.04) zensort/timings.04 If you compare the performance of this fourth version of the program to the previous version on the small file, you'll notice that it has nearly doubled on both memory configurations. As usual, however, what is more important is how well it performs when we have a lot of data to sort. As you can see from the performance results, the throughput when sorting the large file has improved by over 50 percent on both the small and large memory configurations. We've just about reached the end of the line with incremental changes to the implementation. To get any further significant increases in performance, we'll need a radically different approach, and that's what the next version of this program provides. The Fifth Version Before we get to this change in the program, though, it might be instructive if I explain how I arrived at the conclusion that such a change was either necessary or even possible. Unlike some well-known authors who shall remain nameless, I take technical writing very seriously. I test my code before publishing it, I typeset my books myself to reduce the likelihood of typesetting errors, and I even create my master CDs myself, to minimize the chance that errors will creep in somewhere in between my development system and your computer. Of course, this doesn't guarantee that there aren't any bugs in my programs; if major software development companies can't guarantee that, my one-man development and quality assurance organization certainly can't! However, I do a pretty good job, and when I miss something, I usually hear from readers right away and can get the correction into the next printing of the book in question. You may be wondering what this has to do with the changes to the implementation of this sorting algorithm. The answer is that, although I thought I had discovered something important when I broke through the limited memory problem with distribution sorting, I decided it would be a good idea to see how its performance compares with other available sorts. Therefore, I asked a friend if he knew of any resources on sorting performance. After he found a page about sorting on the Internet, I followed up and found a page referring to a sorting contest. Before I could tell how my implementation would compare to those in the contest, I had to generate some performance figures. Although the page about the contest was somewhat out of date, it gave me enough information so that I was able to generate a test file similar to the one described in the contest. The description was "one million 100-byte records, with a 10-byte random key". I wasn't sure what they meant by "random": was it a string of ten random digits, or 10 random binary bytes, or 10 random ASCII values? I decided to assume that 10 random decimal digits would be close enough to start with, so that's how I created an initial version of a test file. When I ran my latest, greatest version on this file, I was pretty happy when I discovered I could sort about 500,000 records in a minute, because the figures on the contest page indicated that this was quite cost-competitive in the "minute sort" category, which was based on the number of records sorted in a minute; although I was certainly not breaking any speed records as such, my system was much cheaper than the one that had set the record, so on a cost- performance basis I was doing quite well. However, I did need some more recent information to see how the latest competition was going. So I contacted Jim Gray, who was listed on that page as a member of the contest committee, and heard back from him the next day. Imagine my surprise when I discovered that my "fast" sorting algorithm wasn't even in the ball park. My best throughput of approximately 800 KB/sec or so was less than one third of the leading competitors. Obviously, I had a lot more work to do if I wanted to compete in any serious way. The first priority was to find out exactly why these other programs were so much faster than my program was. My discussion with Jim Gray gave me the clue when he told me that all of the best programs were limited by their disk I/O throughput. Obviously, if we have to make five passes through the file, reading and writing all of the data on each pass, we aren't going to be competitive with programs that do much less disk I/O, if that is the limiting factor on sorting speed. Obviously, any possible sorting algorithm must read the entire input file at least once and write an output file of the same size. Is there any way to reduce the amount of I/O that our sorting algorithm uses so that it can approach that ideal? Although we can't get to that limiting case, it is possible to do much better than we have done. However, doing so requires more attention to the makeup of the keys that we are sorting. Until now, we haven't cared very much about the distribution of the keys, except that we would get larger buffers if there were fewer different characters in the keys, which would reduce the number of disk write operations needed to create the output file and thereby improve performance. However, if the keys were composed of reasonably uniformly distributed characters (or sets of characters) that we could use to divide up the input file into a number of segments of similar size based on their key values, then we could use a "divide and conquer" approach to sorting that can improve performance significantly. That's what the next version of this program, shown in Figure zen05.cpp, does. Zensort version 5 (Zensort\zen05.cpp) (Figure zen05.cpp) zensort/zen05.cpp This new version of the algorithm works in a different way from the ones we've seen before. Instead of moving from right to left through the keys, sorting on the less significant positions to prepare the way for the more significant positions, we . have been doing previously. This requires a number of changes to the program, none of which is particularly complicated. The new version is shown in Figure zen04.cpp. Zensort version 4 (Zensortzen04.cpp). maximum possible value of a character is 256, we can calculate the buffer in which we will store a particular record according to two characters of its key by multiplying the first character of