HandBooks Professional Java-C-Scrip-SQL part 161 doc

pass through the file, because until then we do not know the total sizes of all records with a given key character. 1 But we are getting a little ahead of ourselves here. Before we can calculate the total size of all the records for a given key character, we have to read all the records in the file, so let's continue by looking at the loop that does that. This "endless" loop starts by reading a line from the input file and checking whether it was successful. If the status of the input file indicates that the read did not work, then we break out of the loop. Otherwise, we increment the total number of keys read (for statistical purposes), calculate the length of the record, and increment the total amount of data read (also for statistical purposes). Next, we determine whether the record is long enough for us to extract the character of the key that we need; if it is, we do so, and otherwise we treat as though it were 0 so that such a record will sort to the beginning of the file. 2 Once we have found (or substituted for) the character on which we are sorting, we add the length of the line (plus one for the new-line character that the getline function discards) to the displacement for that character. Then we continue with the next iteration of the loop. Once we get to the end of the input file, we close it. Then we compute the total displacement values for each character, by adding the total displacement value for the previous character to the displacement value for the previous character. At this point, having read all of the data from the input file, we can display the statistics on the total number of keys and total amount of data in the file, if this is the first pass. This is also the point where we display the time taken to do the counting pass. Now we're ready for the second, distribution, pass for this character position. This is another "endless" loop, very similar to the previous one. As before, we read a line from the input file and break out of the loop if the read fails. Next, we concatenate a new-line character to the end of the input line. This is necessary because the getline function discards that character from lines that it reads; therefore, if we did not take this step, our output file would have no new-line characters in it, which would undoubtedly be disconcerting to our users. Next, we extract the current key character from the line, or substitute a null byte for it if it is not present. The next operation is to calculate the current amount of data in the buffer used to store data for this key character. Then we add the length of the current line to the amount of existing data in the buffer. If adding the new line to the buffer would cause it to overflow its bounds, then we have to write out the current data and clear the buffer before storing our new data in it. To do this, we seek to the position in the output file corresponding to the current value of the total displacement array for the current character value. As we have already seen, the initial value of the total displacement array entry for each character is equal to the number of characters in the file for all records whose key character precedes this character. For example, if the current key character is a capital 'A', then element 65 in the total displacement array starts out at the beginning of the distribution loop with the offset into the output file where we want to write the first record whose key character is a capital 'A'. If this is the first time that we are writing the buffer corresponding to the letter 'A', we need to position the output file to the first place where records whose keys contain the key character 'A' should be written, so the initial value of the total displacement array element is what we need in this situation. However, once we have written that first batch of records whose keys contain the letter 'A', we have to update the total displacement element for that character so that the next batch of records whose keys contain the letter 'A' will be written immediately after the first batch. That's the purpose of the next statement in the source code. Now we have positioned the file properly and have updated the next output position, so we write the data in the buffer to the file. Then we update the total number of writes for statistical purposes, and clear the buffer in preparation for its next use to hold more records with the corresponding key character. At this point, we are ready to rejoin the regular flow of the program, where we append the input line we have just read to the buffer that corresponds to its key character. That's the end of the second "endless" loop, so we return to the top of that loop to continue processing the rest of the lines in the file. Once we've processed all the lines in the input file, there's one more task we have to handle before finishing this pass through the data: writing out whatever remains in the various buffers. This is the task of the for loop that follows the second "endless" loop. Of course, there's no reason to write out data from a buffer that doesn't contain anything, so we check whether the current length of the buffer is greater than 0 before writing it out to the file. After displaying a message telling the user how long it took to do this distribution pass, we return to the top of the outer loop and begin again with the next pass through the file to handle the next character position in the key. When we get through with all the characters in the key, we are finished with the processing, so we display a final message indicating the total number of writes that we have performed, free the memory for the buffers, terminate the timing routines, and exit. Performance: Baseline So how does this initial version of the program actually perform? While I was working on the answer to this question, it occurred to me that perhaps it would be a good idea to run tests on machines of various amounts of physical memory. After all, even if this algorithm works well with limited physical memory, that doesn't mean that having additional memory wouldn't help its performance. In particular, when we are reading and writing a lot of data, the availability of memory to use as a disk cache can make a lot of difference. Therefore, I ran the tests twice, once with 64 MB of RAM in my machine and once with 192 MB. The amount of available memory did make a substantial difference, as you'll see when we discuss the various performance results. Figure timings.01 illustrates how this initial version works with files of various sizes, starting with 100,000 records of approximately 60 bytes apiece and ending with one million similar records. 3 Performance of Zensort version 1 (Figure timings.01) zensort/timings.01 According to these figures, this is in fact a linear sort, or close enough to make no difference, at least on the larger machine. An n log n sort would take exactly 1.2 times as long per element when sorting one million records as when sorting 100,000 records. While this sort takes almost exactly that much longer per element for the one million record file on the smaller machine, the difference is only three percent on the larger machine, so obviously the algorithm itself has the capability of achieving linear scaling. But linear performance only matters if the performance is good enough in the region in which we are interested. Since this is a book on optimization, let's see if we can speed this up significantly. The Initial Improvements One of the most obvious areas where we could improve the efficiency of this algorithm is in the use of the buffer space. The particular input file that we are sorting has keys that consist entirely of digits, which means that allocating 256 buffers of equal size, one for each possible ASCII character, is extremely wasteful, because only 10 of those buffers will ever be used. Although not all keys consist only of digits, that is a very common key composition; similarly, many keys consist solely of alphabetic characters, and of course there are keys that combine both. In any of these cases, we would do much better to allocate more memory to buffers that are actually going to be used; in fact, we should not bother to allocate any memory for buffers that are not used at all. Luckily, we can determine this on the counting pass with very little additional effort, as you can see in Figure zen02.cpp. Zensort version 2 (Zensort\zen02.cpp) (Figure zen02.cpp) zensort/zen02.cpp The first changes of any significance in this program are the addition of two new arrays that we will use to keep track of the buffer size for each possible key character and the total number of characters stored in each buffer. Of course, because we are assigning memory to buffers in a dynamic fashion, we can't allocate those buffers until we know how much memory we want to devote to each buffer. Therefore, the allocation has to be inside the main loop rather than preceding it. By the same token, we have to delete each buffer before the end of the main loop so that they can be re-allocated for the next pass. The next question, of course, is how we decide how much space to devote to each buffer. It seemed to me that the best way to approach this would be to calculate the proportion of the entire file that the records for each key character correspond to, and allocate that proportion of the entire buffer space to the buffer for that key character, so that's how I did it. First, we add up all of the record length totals; then we compute the ratio of the total amount of space available for buffers to the total amount of data in the file. Then we step through all the different key characters and compute the appropriate size of the buffer for each key character. If the result comes out to be zero, then we don't allocate any space for that buffer; instead, we assign a null pointer to that buffer address, as that is much more efficient than allocating a zero-length buffer. However, if the buffer size comes out to be greater than zero, we allocate the computed amount of space for that buffer, then clear it to zeros. Finally, we clear the buffer character count for that buffer, as we haven't stored anything in it yet. When it's time to store some data in the buffer, we use the buffer character count array rather than calling strlen to find out how much data is currently in the buffer. I decided to track the count myself because when I first changed the buffer allocation strategy from fixed to variable, the program ran much more slowly than it had previously. This didn't make much sense to me at first, but upon reflection I realized that the longer the buffers are, the longer it would take strlen to find the end of each buffer. To prevent this undesirable effect, I decided to keep track of the size of buffers myself rather than relying on strlen to do it. Of course, that means that we have to add the length of each record to the total count for each buffer as we add the record to the buffer, so I added a line to handle this task. The Second Version So how does this second version of the program actually perform? Figure timings.02 illustrates how it works with files of various sizes. 4 Performance of Zensort version 2 (Zensort\timings.02) (Figure timings.02) zensort/timings.02 If you compare the performance of this second version of the program to the previous version on the small file, you'll notice that it is almost 7.5 times as fast on both the 64 MB machine and the 192 MB machine as was the previous version. However, what is more important is how well it performs when we have a lot of data to sort. While we haven't achieved quite as much of a speed-up on larger files as on the smallest one, we have still sped up the one million record sort by a factor of almost 4.5 to one when running the sort on a machine with "only" 64 MB of RAM and more than 6.5 to 1 on the more generously equipped machine with 192 MB of RAM, which is not an insignificant improvement. 5 Is this the best we can do? Not at all, as you'll see in the analysis of the other versions of the program. Let's continue with a very simple change that provided some improvement without any particular effort. The Third Version At this point in the development of the sorting algorithm, I decided that although saving memory is nice, we don't have to go overboard. On the assumption that anyone who wants to sort gigantic files has a reasonably capable computer, I decided to increase the amount of memory allocated to the buffers from 4 MB to 16 MB. As you might imagine, this improved performance significantly on larger files, although not nearly as much proportionally as our previous change did. . memory, that doesn't mean that having additional memory wouldn't help its performance. In particular, when we are reading and writing a lot of data, the availability of memory to use as. where we could improve the efficiency of this algorithm is in the use of the buffer space. The particular input file that we are sorting has keys that consist entirely of digits, which means. program. Let's continue with a very simple change that provided some improvement without any particular effort. The Third Version At this point in the development of the sorting algorithm,

Định dạng
Số trang	5
Dung lượng	25,77 KB