The DynamicHashArray::StoreElement function (from quantum\dynhash.cpp) (Figure dynhash.04) codelist/dynhash.04 The basic idea is that whenever a particular element that we are trying to store in the hash array is about to exceed the size of a quantum, we create a new element in which we'll store the records that wouldn't fit in the previous element. The question, of course, is how we find the records stored in this new element, when they will not be stored in the location in the hash array where we expect to find it. The answer is that we modify the keys of the stored records in a way that we can do again when we're trying to find a particular record. Then we set a "chain" flag in the original element in the hash array indicating that it has overflowed, so that we don't mistakenly tell the user that the record in question is not in the file. Exactly how do we modify the key of the record? By appending a character sequence to it that consists of a form feed character followed by an ASCII representation of the number of times that we've had an overflow in this particular chain. For example, if the original key was "ABC", the first overflow key would be "ABC\f0", where "\f" represents the form feed character, and "0" is the ASCII digit 0. Of course, it's also possible that even the second element in which we're storing records with a particular hash key will overflow. However, the same algorithm will work in that case as well; in this case, the second record will have its "chain" flag set to indicate that the program should continue looking for the record in question if it is not found in the current element and the key will be modified again to make it unique; to continue our previous example, if the original key was "ABC", the second overflow key would be "ABC\f1", where "\f" represents the form feed character, and "1" is the ASCII digit 1. How did I choose this particular way of modifying the keys? Because it would not limit the users' choice of keys in a way that they would object to. Of course, I did have to inform the users that they should use only printable ASCII characters in their keys, but they did not consider that a serious limitation. If your users object to this limitation, then you'll have to come up with another way of constructing unique keys that won't collide with any keys that the users actually want to use. As this suggests, there were a few tricky parts to this solution, but overall it really wasn't that difficult to implement. I haven't done any serious performance testing on its effects, but I don't expect them to be significant; after all, assuming that we select our original parameters properly, overflows should be a rare event. I should also explain how we find a record that has overflowed. That is the job of DynamicHashArray::FindElement (Figure dynhash.05). The DynamicHashArray::FindElement function (from quantum\dynhash.cpp) (Figure dynhash.05) codelist/dynhash.05 As you can see, the code to look up a record is not complicated very much by the possibility of an overflow. First, we calculate a slot number based on the key that we are given by the user. Then we check whether that key is found in the element for that slot. If it is, we are done, so we break out of the loop. However, if we haven't found the record we're looking for yet, we check to see whether the particular element that we are looking in has its chain flag set. If not, the record must not be in the file, so we break out of the loop. On the other hand, if the chain flag is set in the element that we were looking at, then we have to keep looking. Therefore, we calculate what the key for that record would be if it were in the next element and continue processing at the top of the loop. On the next pass through the loop, we'll retrieve the next element that the record might be in, based on its modified key. We'll continue through this loop until we either find the record we're looking for or get to an element whose chain flag is not set; the latter situation, of course, means that the desired record is not in the file. Settling Our Hash I've recently had the opportunity to use this algorithm in a commercial setting, and have discovered (once again) that how the hash code is calculated is critical to its performance. In this case, the keys were very poorly suited to the simple (if not simple-minded) calculation used in the version of DynamicHashArray::CalculateHash in Figure oldhash.00. An early implementation of the DynamicHashArray::CalculateHash function (Figure oldhash.00) codelist/oldhash.00 The problem was that the keys used in the application were all very similar to one another. In particular, the last seven or eight characters in each key were likely to differ from the other keys in only one or two places. This caused tremendous overloading of the hash slots devoted to those keys, with corresponding performance deterioration. Luckily, I was able to provide a solution to this problem in short order by using an algorithm that produced much better hash codes. Interestingly enough, the algorithm that I substituted for my poorly performing hash code wasn't even designed to be a hash code algorithm at all. Instead, it was a cyclical redundancy check (CRC) function whose purpose was to calculate a value based on a block of data so that when the data was later read again, the reading program could determine whether it had been corrupted. One second thought, perhaps it wasn't so strange that a CRC algorithm would serve as a good hash code. After all, for it to do its job properly, any change to the data should be very likely to produce a different CRC. Therefore, even if two keys differed by only one or two bytes, their CRC's would almost certainly be different, which of course is exactly what we want for a hash code. As it happens, this substitution greatly improved the performance of the program, so apparently my choice of a new hash algorithm was appropriate. Fortunately, I came up with this solution just in time to include the new code on the CD-ROM in the back of this book. It is shown in Figure dynhash.06. The DynamicHashArray::CalculateHash function (from quantum\dynhash.cpp) (Figure dynhash.06) codelist/dynhash.06 As for the CRC class, its interface is shown in Figure crc32.00. The interface for the CRC32 class (from quantum\crc32.h) (Figure crc32.00) codelist/crc32.00 And its implementation, which also includes a brief description of the algorithm, is shown in Figure crc32.01. The implementation for the CRC32 class (from quantum\crc32.cpp) (Figure crc32.01) codelist/crc32.01 Bigger and Better What are the limits on the maximum capacity of a dynamic hashing array? As it happens, there are two capacity parameters of the quantum file access method that we can adjust without affecting the implementation very much: BlockSize, which specifies how large each quantum is, and the maximum number of blocks allowed in the file, set by the MaxFileQuantumCount const in blocki.h. In the current implementation, BlockSize is set to 16K, which means we need 14 bits to specify the location of an item in a quantum. Since an ItemIndex (Figure blocki.08) uses a 16-bit word to hold the offset, we could increase the BlockSize to as much as 64K bytes. Let's take a look at the advantages and disadvantages of increasing the quantum size. Suppose we increase the size of each quantum, for example, to 32K bytes from 16K. It's easy to see that this would double the maximum capacity of the file. What may not be so obvious is that this change would also decrease the memory requirements for efficient file access via dynamic hashing, for a given number of records in the file. To see why this is so, we have to look at the typical usage of disk buffers when looking up a string by key in the dynamic hashing algorithm. Suppose we want to find the record that has the key 609643342. The algorithm calculates which hash slot points to the storage element in which a record with that key would be stored. It then calls the quantum file access routine GetModifiableElement to retrieve that storage element. GetModifiableElement retrieves the big pointer array block for the array that the storage elements are kept in; then it retrieves the little pointer array block for the correct storage element; and finally it gets the block where the storage element is stored, retrieves it from the block, and returns it. The dynamic hashing algorithm then searches the storage element to find the key we specified and, if it is found, extracts the record we want. 14 So a total of three blocks are accessed for each retrieval of a string: the big pointer array block, a little pointer array block, and the final "leaf" block. The first of these blocks is referenced on every string retrieval, so it is almost certain to be in memory. The "leaf" block, on the other hand, is not very likely to be in memory, since the purpose of the hashing algorithm is to distribute the data as evenly as possible over the file: with a reasonably large file, most "leaf" accesses aren't going to be to one of the relatively few blocks we can keep in memory. Fortunately, this pessimistic outlook does not apply to the second block retrieved, the little pointer array block. If we have 500,000 strings in our file, there are 83,333 storage elements that the quantum file algorithm has to deal with. With a 16K-byte block, approximately 4000 little pointer elements fit in each little pointer block, so the little pointer blocks would occupy 21 buffers. Let's look at the situation if we go to 32K-byte blocks. If we double the number of strings in the average storage element to 12 so that the likelihood of an overflow would remain approximately the same as before, then the number of little pointer array elements needed to access a given number of records is halved; in addition, the number of little pointer array elements that fit in each little pointer block is doubled. 15 This means that to be fairly sure that the little pointer block we need will be present in memory, instead of 21 blocks taking up almost 700K of memory, we need only 6 blocks taking up about 200K of memory. In the case of dynamic hashing, this effect greatly alleviates the primary drawback of increasing the block size, which is that the number of blocks that can be held in a given amount of memory is inversely proportional to the size of each block. In the general case, however, this reduction in the number of buffers can hurt the performance of the program; the system can "thrash" if the working set of data needed at any given time needs more buffers than are available. The only other apparent drawback of increasing the size of the quanta is that the free space codes become less accurate, since the code remains fixed in size at one byte; with a 32K block, each increment in the size code represents 128 bytes. However, I doubt that this will cause any significant problems with space utilization. The More, the Merrier The other fairly simple way to increase the capacity of the system is to increase the number of blocks that can be addressed. The ItemReference class (Figure blocki.12) defines objects that take up four bytes each, 20 bits for the m_QuantumNumber field and 12 bits for the m_RelativeItemNumber field. The ItemReference class (from quantum\blocki.h) (Figure blocki.12) codelist/blocki.12 If we wanted to increase the number of quanta from its current maximum of 10,000 (set by the MaxFileQuantumCount const in blocki.h), we could increase it up to 1024K without changing the size of the m_QuantumNumber field. The main drawback of increasing the maximum block count is that the free space list gets bigger; in fact, that's the reason that I've set the maximum file quantum . sequence to it that consists of a form feed character followed by an ASCII representation of the number of times that we've had an overflow in this particular chain. For example, if the original. the element for that slot. If it is, we are done, so we break out of the loop. However, if we haven't found the record we're looking for yet, we check to see whether the particular. what we want for a hash code. As it happens, this substitution greatly improved the performance of the program, so apparently my choice of a new hash algorithm was appropriate. Fortunately,