1. Trang chủ
  2. » Công Nghệ Thông Tin

HandBooks Professional Java-C-Scrip-SQL part 160 pdf

6 70 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Nội dung

count to 10,000 despite the fact that the quantum number field in the item reference class can handle a file with far more quanta. However, if our application needs so much data that a 160-MB maximum file size is too small, the extra space taken up by a larger free space list probably isn't an obstacle. Summary In this chapter, we have used the quantum file access method as the base for a disk- based variant of Larson's dynamic hashing. This algorithm provides efficient hash- coded access by key to a very large amount of variable-length textual data, while eliminating the traditional drawbacks of hashing, especially the need to specify the maximum size of the file in advance. In the final chapter, we will summarize the algorithms we have covered in this book and discuss some other resources we can use to improve the efficiency of our programs. Problems 1. What modifications to the dynamic hashing implementation would be needed to add the following capabilities? 1. Shrinking the file when deleting records; 2. Storing and retrieving records with duplicate keys. 2. How could the PersistentArrayUlong class be generalized to other data types? (You can find suggested approaches to problems in Chapter artopt.htm). Footnotes 1. William G. Griswold and Gregg M. Townsend, "The Design and Implementation of Dynamic Hashing for Sets and Tables in Icon". Software Practice and Experience, 23(April 1993), 351-367. 2. In case you're thinking that the availability of virtual memory will solve this problem by allowing us to pretend that we have enough memory to hold the lists no matter how large they get, you may be surprised to discover that the performance of such an algorithm in virtual memory is likely to be extremely poor. I provide an example of this phenomenon in my previously cited article "Galloping Algorithms". 3. P A. Larson, "Dynamic Hash Tables". Communications of the ACM, 31(1988). 4. Of course, thinking them up in the first place is a little harder! 5. This step alone will probably not provide a good distribution of records with a number of slots that is a power of two; in such cases, it effectively discards all the high-order bits of the key. In our actual test program, in order to improve the distribution, we precede this step by one that uses all of the bits in the input value. However, the principle is the same. 6. This may be obvious, but if it's not, the following demonstration might make it so: 1. Let's call the value to be hashed x. 2. y = x >> 4 [divide by 16] 3. z = y << 4 [multiply result by 16; result must have low four bits = 0] 4. x % 16 = x - z [definition of "remainder"; result is low four bits of x] Obviously, an analogous result holds true for the remainder after dividing by 8; therefore, the low three bits of these two values must be the same. 7. As it turns out, there's no reason to actually allocate storage for all the new slots when we double the current maximum count, because the only one that's immediately needed is the first new one created. Accordingly, Larson's algorithm allocates them in blocks of a fixed size as they're needed, which is effectively the way my adaptation works as well. 8. Given the quantum file access method, that is! 9. This is also from James Coplien's book, mentioned earlier. 10. As an illustration of the immense improvements in computer speeds in a few years, creating a new file containing 250,000 records took about four hours with the computer I had in 1995 and about 15 minutes with the computer I have in 1998. 11. You may wonder where I got such a large number of records to test with. They are extracted from a telephone directory on CD-ROM; unfortunately, I can't distribute them for your use, since I don't have permission to do so. 12. In order to hide this implementation detail from the rest of the dynamic hashing algorithm, the GetString and PutString functions of the DynamicHashArray class increment the element index before using it. Thus, the request to read or update element 0 is translated into an operation on element 1, and so forth. To access element 0, the index is specified as -1. 13. In case you're wondering why we need two functions here instead of one, it's because sometimes we want to create a dynamic hash array before we know what file it's going to be attached to. In such a case, we can use the default constructor and then call the Open function once we know the actual parameters; without separating construction and initialization, we would not have that capability. 14. This is actually somewhat of an over-simplification, as it ignores the possibility of overflow of a storage element; if that occurs, then the whole process of looking up the storage element by key has to be repeated, doubling the number of block accesses. However, if we've chosen the parameters properly, overflows will be rare and therefore will not affect this analysis significantly. 15. Of course, having a larger block size also makes the algorithm more suitable to other applications with larger data items. Zensort: A Sorting Algorithm for Limited Memory Introduction This chapter will explain how to get around the major limitation of the otherwise very efficient distribution counting sort algorithm: its poor performance with limited available memory. Algorithms Discussed Zensort: A version of the distribution counting sort for use with limited available memory Virtual Impossibility For many years, I've been a very big fan of the "distribution counting sort", which is described in Chapter mail.htm. However, it does have one fairly serious drawback: it doesn't work very well in virtual memory. The problem with the distribution counting sort in virtual memory is that it has very poor locality of reference: that is, rather than stepping through memory in a predictable and linear way, it jumps all over the place. Although this is not optimal even when we are dealing with programs that access data solely in memory, because it makes poor use of the processor cache, it is disastrous when we are dealing with virtual memory. The difficulty is that random accesses to various areas of the disk are much, much slower than sequential accesses: in some cases, the difference may be a factor of 1,000 or more. I discovered this fact (not that I should have had to discover it by experimentation) when I ran some comparison tests between Quicksort and the distribution counting sort for very large files, where the data would not even remotely fit in memory. However, I didn't believe that this was an insuperable obstacle, and I have made a number of attempts to do something about it. Finally, after some years of on-and-off experimentation, I have found the solution. The Urge to Merge The solution to this problem is really very simple, when you look at it the right way. If the problem is that we're moving data in a random fashion all over the disk, and thereby incurring massive numbers of positioning operations, perhaps we would do much better if we were to write the data out to intermediate buffers in memory and write those buffers out to disk only when they became full. In this way, we would reduce the number of random disk accesses by a large factor. Up to this point, I hadn't invented anything new. It has been known for many years that it is possible to sort very large files by dividing them into blocks each of which will fit into memory, sorting each block, and then merging the results into the final output file. The difficulty with this method of sorting is the merge phase, which requires a great deal of disk I/O in itself and can be a major contributor to the entire time taken to sort the file. However, I discovered that there was a way to avoid this merge phase. The key (no pun intended) is that by accumulating the data for the keys over the entire file, we could determine exactly where each output record should go in the entire file, even though we were buffering only a small percentage of the data. The Initial Implementation This is probably easier to show in code than it is to explain in English, although of course I'll do both before we are finished. However, let's start with the code, which is shown in Figure zen01.cpp. Initial implementation of Zensort (Zensort\zen01.cpp) (Figure zen01.cpp) zensort/zen01.cpp We start out reasonably enough by extracting the size of the keys, the input file name, and the output file name from the command line. After initializing the timing routine, we allocate space for the buffers that will be used to hold the data on its way to the disk. As you may recall from Chapter mail.htm, the distribution counting sort moves data from the input file to the output file based on the value of the current character of the key being sorted. Therefore, we need one buffer for each possible character in the key being sorted, so that we can keep the data for each character value separate from the data for every other character value. Now we are ready to start the main body of the algorithm. For each pass through the data, we have to read from one file and write to the other. On the first pass, of course, we will read from the original input file and write to the original output file. However, on the second pass we'll read from the original output file and write back to the original input file, because after the first pass, the output file is sorted on only the last character of the key. In the distribution counting sort algorithm, we sort on each character of the key separately, in reverse order of their positions in the key. After the first pass, the original output file contains exactly the information we need as input for the second pass. That's the reason we use the modulus operator in the if statement that determines which filenames to use for which files: on every odd numbered pass, we use the original input file name for the input file, and on every even numbered pass, we use the original output file name for the input file. Of course, the reverse is true for the output file. Once we figure out which file will be used for input and which will be used for output, we open them. Then we initialize all of the buffers by clearing them to 0, initialize the displacement values for each buffer to 0, and initialize the total displacement values for each buffer to 0. The displacement array keeps track of the amount of data in the input file for each possible value of the current key character. That is, the entry in the displacement array that has the index 65 (the ASCII value for 'A') represents the total size of all the records seen so far in the current pass that have the letter A in the current position in their key. The total displacement array, on the other hand, is used to accumulate the total amount of data in the input file that has a key character less than the ASCII value of the index in the total displacement array. For example, the entry in the total displacement array that has the index 65 represents the total size of all the records whose key character was less than the letter A, which is the same as the displacement into the output file of the first record whose key character is A. Of course, we cannot fill in the values of this array until we are finished with the . can handle a file with far more quanta. However, if our application needs so much data that a 160- MB maximum file size is too small, the extra space taken up by a larger free space list probably

Ngày đăng: 06/07/2014, 04:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN