O''''Reilly Network For Information About''''s Book part 163 potx

use the leftmost part of the key to decide which logical buffer the record will be put in. Once we have decided which logical buffer the record will go into, we use an insertion sort to stick it into the appropriate place in the buffer, i.e., after any record in that buffer with a lower key and ahead of any record in that buffer with a higher key. Our use of the insertion sort to arrange the records in each buffer is the reason that we need a reasonably uniform distribution of keys to get good performance. As we have seen in previous versions of the program, if the keys are very non-uniform in their distribution, then each logical buffer will be very large. However, in contrast to our previous experience, with this version of the algorithm big logical buffers are a major impediment to good performance. The problem is that insertion sort is extremely time-consuming when used with large numbers of keys. Therefore, if we have a few big logical buffers, the insertion sort will dominate the time required to execute the algorithm. However, if the keys are reasonably uniformly distributed, then each logical buffer will be small and the insertion sort will not take very long to execute. If we had enough memory to hold the entire input file, this version of the algorithm would be very simple to implement: we would examine the key for each record, decide which buffer it goes into, then put it into that buffer in the appropriate place via the insertion sort. However, we may not have enough memory to hold a 100 MB file, and even if we do, using that much memory for our buffer is likely to be counterproductive because it will prevent that memory from being used for disk caching by the operating system, thus reducing the speed of our disk I/O. In addition, this solution is not scalable to larger files. Therefore, we need another approach. Luckily, there is one that is not too difficult to implement: we make several passes through the input file, selecting the records that will fit in the buffer according to their keys. At the end of each pass, we write the entire buffer out to disk with one disk I/O operation. On each pass, we select successively higher key values, until we have handled all the records in the input file. Let's see exactly how this works with our usual variable-length record input file. One fairly obvious change to the program is that we no longer are copying data from the input file to the output file and back again, which means we don't have to switch the filenames for the input and output files. Another change I've made, to reduce the amount of memory the program requires, is to eliminate several of the auxiliary arrays: Buffer, BufferSize, and TotalDisplacement. Instead of the Buffer array, I'm using a new array called BufferOffset that keeps track of the position in the one large buffer where each logical buffer starts. The TotalDisplacement array isn't really necessary, because it was only used during the calculations of the positions of the logical buffers and of the total data for all the records in the file; both of these functions have been replaced in the new implementation by a loop that calculates exactly how many records will fit in the big buffer and where they will go. As for the BufferSize array, that turns out to be unnecessary in the new implementation of the algorithm. Because we will be precalculating the exact amount of space that will be occupied by the records in that buffer, we don't have to worry about whether any particular record will fit in the buffer: if it belongs in that buffer, it will fit. Before we get to the major changes in the algorithm, I should explain the reason for the new version of CalculateKeySegment. As I've already mentioned, with this new version of the algorithm, it is extremely important to make each logical buffer as small as possible, to reduce the time needed to insert each record into its logical buffer. Therefore, because we want to be able to handle a one million record input file in an efficient manner, we will allocate one million logical buffers within our large physical buffer. But how do we decide which logical buffer each record should be stored in? Because the keys in this file are composed of numeric digits, we can use the first six digits of the key to determine the appropriate logical buffer for each record. Although this is not theoretically optimal, because the keys in the file are not random, the performance results indicate that it is sufficient to give us a significant increase in speed over the previous version of the program, at least for large files. Now let's get to the more complicated changes in this version of the algorithm, which are in the calculation of which data we will actually store in the big buffer. That's the responsibility of the code that starts with the line int Split[MAXPASSCOUNT] and ends with the line PassCount = j;. Let's go over this in some detail. First, we have the declaration of the variable Split, which is used to keep track of which keys we are handling on this pass. This variable is an array of a number of elements sufficient to handle any file of reasonable size: to be exact, we need one element in this array for every possible pass through the input file, so 100 elements would suffice for a file of about 1.6 GB if we're using a 16 MB buffer for each pass. Next, we have an array called SplitTotalSize, of the same number of elements, which is used to keep track of the total amount of data to be handled in each pass through the input file. We need this information to determine exactly how many bytes we're going to write from the big buffer at the end of each pass. After declaring a couple of auxiliary variables, we get to the actual code. First, we initialize the value of the first element of the Split array to 0, because we know that on the first pass through the file, we will start handling records from the lowest possible key value, which of course is 0. Now we're ready to start counting the keys and records that will be stored in the big buffer during each pass through the input file. To calculate the splits, we start by initializing the data count for each split to 0 and the offset of the first logical buffer to zero as well. Then we step through the array of buffer capacities, adding up the capacity of all the logical buffers that we encounter. As long as we've not yet exceeded the size of the big buffer, we add the size of each logical buffer to the previous total size and set the offset of the next logical buffer to that total size. Once we get to a buffer whose size would cause an overflow of the big buffer capacity, we break out of the loop without updating the total size or the next logical buffer's offset. Once we've exited from that inner loop, we know what segment of keys we're going to handle on this pass, so we set the next element of the Split array to the number of the last buffer that will fit in the big buffer on this pass. We also know the total size of the data that will be handled on this pass, so we set the value of the SplitTotalSize array element for this pass to that value. Next, we add the amount of data for this pass to the total data in the file so that we can report this information to the user. Finally, if we've reached the end of all the buffers, we break out of the outer loop, delete the BufferCapacity array, and set the PassCount variable to the number of passes that we will need to handle all the data. Once we know how we are going to split up the input file, the rest of the changes are pretty simple. After opening the output file, we allocate the big buffer, then start the main loop that will execute once for each pass through the input file. On each pass through the main loop, we handle the records whose keys fall in the range allocated to that pass. This requires changes in several areas of the code. First, at the beginning of the main loop, we clear the buffer character counts for each of the buffers that will be active during this pass. Then we reset the status of the input file, reposition it to its beginning, and clear the big buffer to all zeros. The next set of changes occur when we have found the key segment for the current record from the input file. If that key segment falls within the current pass, then we have to put the record into the appropriate buffer. However, unlike our previous implementations, we have to be careful exactly where we put the record in the buffer, rather than just appending it at the end. This is because we're going to be writing the records into their final position in the file rather than merely copying them to an output file in partial order by a segment of the key. Therefore, we have to insert each record into the buffer in the precise relative position where it should go in the file. To do this, we compare the key of this record to the key of each record already in the buffer. When we find a record whose key is greater than the key of the record that we're trying to put in the buffer, we shift that record and all the following records toward the end of the buffer to make room for the new record. If the key of the record that we're looking at in the buffer is less than or equal to the key of the record that we want to put in the buffer, we have to locate the next record in the buffer to see if its key is greater than the new record. To do this, we increment our pointer into the buffer until we find a new-line character, which signals the end of the record whose key we have just examined, then continue our key comparison with the next record's key, which follows immediately after. Of course, if we don't find a record with a greater key by the time we get to the end of the buffer, then the new record goes at the end of the buffer. Each time we reach the end of the outer loop, we have a big buffer filled with data that needs to be written to the file. Therefore, we call the write function to write the exact number of bytes that that pass has handled, which is stored in the SplitTotalSize variable for the particular pass that we are executing. So how does this fifth version of the program actually perform? Figure timings.05 answers that question. Performance of Zensort version 5 (Figure timings.05) zensort/timings.05 As in our previous comparison, we have improved performance radically on the smallest file, producing a 2.5 to 1 increase in speed with the smaller memory configuration and more than doubling it with the larger configuration. As usual, however, what is more important is how well it performs when we have a lot of data to sort. As you can see from the performance results, the throughput when sorting the large file has more than doubled on both the small and large memory configurations. The Key Requirement However, unlike our previous improvements, this one carries a fairly hefty price tag: if the keys are not reasonably evenly distributed, the program will run extremely slowly, and in some extreme cases may fail to work at all. The former problem results from the fact that as the number of keys in each logical buffer increases, the time taken to insert a key in that logical buffer increases as well. If the file is too big to fit in memory and a very large proportion of the keys are identical, then the program as currently written may fail to execute: this will happen if the size of one logical buffer exceeds the amount of memory allocated for all buffers. Does this mean that this new version of the program is useless? Not at all: it means that we have to be prepared for the eventuality that this version of the algorithm may behave badly on certain data and handle that eventuality should it occur. This is the subject of a problem at the end of this chapter. The Sixth Version The previous version marks the end of our performance improvements on the original input file, mostly because I couldn't think of any other improvements to implement. However, even after increasing the performance significantly on that file, I was still very interested in seeing how well an adaptation of the same algorithm could be made to perform on an input file like the ones specified in the sorting contest. It turned out that the changes needed to sort one of those files reasonably efficiently were not terribly complex, as you can see by looking at the next version of the program, shown in Figure zen06.cpp. Zensort version 6 (Zensort\zen06.cpp) (Figure zen06.cpp) zensort/zen06.cpp In fact, all I had to do was change the calculation of the key segment to combine values from the first three ASCII characters in the input line rather than the first six . SplitTotalSize array element for this pass to that value. Next, we add the amount of data for this pass to the total data in the file so that we can report this information to the user. Finally,. whether any particular record will fit in the buffer: if it belongs in that buffer, it will fit. Before we get to the major changes in the algorithm, I should explain the reason for the new. one element in this array for every possible pass through the input file, so 100 elements would suffice for a file of about 1.6 GB if we're using a 16 MB buffer for each pass. Next, we