1. In order to reduce the size of the free space list entries, I've also reduced the number of objects in the main object list to 256, so that an object number will fit in one byte. 2. Another possible optimization would be of use when we are loading a large number of items into a quantum file; in that case, we could just start a new quantum whenever we run out of space in the "last added to" quantum. This would eliminate the search of the free space list entirely and might be quite effective in decreasing the time needed to do such a mass load. Of course, this solution would require some way for the class user to inform the quantum file system that a mass load is in progress, as well as when it is over, so that the default behavior could be reestablished. 3. And they may even work properly after you compile them. Unfortunately, I discovered shortly before sending this book to the publisher that the mailing list program in Chapter mail.htm doesn't work when compiled under DJGPP 2.8.0, although it runs perfectly when compiled under VC++ 5.0. Apparently the similarity among compilers isn't quite as great as I thought it was. Heavenly Hash: A Dynamic Hashing Algorithm Introduction It's time to revisit the topic of hash coding, or "hashing" for short, which we previously discussed in Chapter superm.htm. Algorithm Discussed Dynamic Hashing Problems with Standard Hashing Hashing is a way to store and retrieve large numbers of records rapidly, by calculating an address close to the one where the record is actually stored. However, traditional disk-based methods of hashing have several deficiencies: 1. Variable-length records cannot be stored in the hash table. 2. Deleting records and reusing the space is difficult. 3. The time needed to store and retrieve a record in a file increases greatly as the file fills up, due to overflows between previously separated areas of the file. 4. Most serious of all, the maximum capacity of the file cannot be changed after its creation. The first of these problems can be handled by having fixed-length hash records pointing to variable-length records in another file, although this increases the access time. The second problem can be overcome with some extra work in the implementation to mark deleted records for later reuse. The third problem can be alleviated significantly by the use of "last-come, first-served" hashing, as I mentioned in a footnote in Chapter superm.htm. However, until relatively recently, the fourth problem, a fixed maximum capacity for a given file, seemed intractable. Of course, nothing stops us from creating a new, bigger, hash table and copying the records from the old file into it, rehashing them as we go along. In fact, one of the questions at the end of Chapter superm.htm asks how we could handle a file that becomes full, as an off-line process, and the "suggested approaches" section in Chapter artopt.htm proposes that exact answer. However, this solution to the problem of a full hash table has two serious drawbacks. For one thing, we have to have two copies of the file (one old and one new) while we're doing this update, which can take up a large amount of disk space temporarily. More important in many applications, this "big bang" file maintenance approach won't work for a file that has to be available all of the time. Without doing a lot of extra work, we can't update the data during this operation, which might take a long time (perhaps several hours or even days) if the file is big. Is it possible to increase the size of the file after its creation without incurring these penalties, and still gain the traditional advantages of hashing, i.e., extremely fast access to large amounts of data by key? Until shortly before I wrote the second edition of this book, I would have said that it wasn't possible. Therefore, I was quite intrigued when I ran across an article describing a hashing algorithm that can dynamically resize the hash table as needed during execution, without any long pauses in execution. 1 Unfortunately, the implementation described in the article relies on the use of linked lists to hold the records pointed to by the hash slots. While this is acceptable for in-memory hash tables, such an implementation is not appropriate for disk-based hash tables, where minimizing the number of disk seeks is critical to achieving good performance. 2 Nonetheless, it was obviously a fascinating algorithm, so I decided to investigate it to see whether it could be adapted to disk-based uses. Just Another Link in the Chain According to the the Griswold and Townsend article, Per-Ake Larson adapted an algorithm called "linear dynamic hashing", previously limited to accessing external files, to make it practical for use with in-memory tables. 3 The result, appropriately enough, is called "Larson's algorithm". How does it work? First, let's look at the "slot" method of handling overflow used by many in-memory hash table algorithms, including Larson's algorithm. To store an element in the hash table, we might proceed as follows, assuming the element to be stored contains a "next" pointer to point to another such element: 1. The key of each element to be stored is translated into a slot number to which the element should be attached, via the hashing algorithm. 2. The "next" pointer of the element to be attached to the slot is set to the current contents of the slot's "first" pointer, which is initialized to a null pointer when the table is created. 3. The slot's "first" pointer is set to the address of the element being attached to the slot. The result of this procedure is that the new element is linked into the chain attached to the slot, as its first element. To retrieve an element, given its key, we simply use the hashing algorithm to get the slot number, and then follow the chain from that slot until we find the element. If the element isn't found by the time we get to a null pointer, it's not in the table. This algorithm is easy to implement, but as written it has a performance problem: namely, with a fixed number of slots, the chains attached to each slot get longer in proportion to the number of elements in the table, which means that the retrieval time increases proportionately to the number of elements added. However, if we could increase the number of slots in proportion to the number of elements added, then the average time to find an element wouldn't change, no matter how many elements were added to the table. The problem is how we can find elements we've already added to the table if we increase the number of slots. It may not be obvious why this is a problem; for example, can't we just leave them where they are and add the new elements into the new slots? Unfortunately, this won't work. The reason is that when we want to look up an element, we don't generally know when the element was added, so as to tell what "section" of the hash file it resides in. The only piece of information we're guaranteed to have available is the key of the element, so we had better be able to find an element given only its key. Of course, we could always rehash every element in the table to fit the new number of slots, but we've already seen that this can cause serious delays in file availability. Larson's contribution was to find an efficient way to increase the number of slots in the table without having to relocate every element when we do it. Relocation Assistance The basic idea is reasonably simple, as are most great ideas, once you understand them. 4 For our example, let's assume that the key is an unsigned value; we start out with 8 slots, all marked "active"; and we want to limit the average number of elements per slot to 6. This means that in order to store an element in the table, we have to generate a number between 0 and 7, representing the number of the slot to which we will chain the new element being stored. For purposes of illustration, we'll simply divide the key by the number of slots and take the remainder. 5 For reasons which will become apparent shortly, the actual algorithm looks like this: 1. Divide the key by the number of slots (8), taking the remainder. This will result in a number from 0 to the number of slots - 1, i.e., 0-7 in this case. 2. If the remainder is greater than or equal to the number of "active" slots (8), subtract one-half the number of slots (4). You will notice that the way the parameters are currently set, the second rule will never be activated, since the remainder can't be greater than or equal to 8; we'll see the reason for the second rule shortly. Now we're set up to handle up to 48 elements. On the 49th element, we should add another slot to keep the average number per slot from exceeding 6. What do we do to the hash function to allow us to access those elements we have already stored? We double the number of slots, making it 16, recording that new number of slots allocated; then we increase the number of "active" slots, bringing that number to 9. The hash calculation now acts like this: 1. Divide the key by the number of slots (16), taking the remainder; the result will be between 0 and 15 inclusive. 2. If the remainder is more than the number of active slots (9), subtract one- half the number of slots (8). This modified hashing algorithm will spread newly added elements over nine slots rather than eight, since the possible slot values now range from 0 to 8 rather than 0 to 7. This keeps the number of elements per slot from increasing as we add new elements, but how do we retrieve elements that were already in the table? Here is where we see the stroke of genius in this new algorithm. Only one slot has elements that might have to be moved under the new hashing algorithm, and that is the first slot in the table! Why is this so? The analysis isn't really very difficult. Let's look at the elements in the second slot (i.e., slot number 1). We know that their hash codes under the old algorithm were all equal to 1, or they wouldn't be in that slot. That is, the remainder after dividing their keys by 8 was always 1, which means that the binary value of the remainder always ended in 001. Now let's see what the possible hash values are under the new algorithm. We divide the key by 16 and take the remainder. This is equivalent to taking the last four bits of the remainder. However, the last three bits of the remainder after dividing by 16 and the last three bits of the remainder after dividing by 8 have to be the same. 6 Therefore, we have to examine only the fourth bit (the 8's place) of the remainder to calculate the new hash code. There are only two possible values of that bit, 0 and 1. If the value of that bit is 0, we have the same hash code as we had previously, so we obviously can look up the element successfully even with the additional slot in the table. However, if the value is 1, the second rule tells us to compare the hash code, which is 1001 binary or 9 decimal against the number of active slots (9). Since the first is greater than or equal to the second, we subtract 8, getting 1 again; therefore, the lookup will always succeed. The same considerations apply to all the other slots previously occupied in the range 2-7. There's only one where this analysis breaks down, and that's slot 0. Why is this one different? Because slot 8 is actually available; therefore, the second rule won't apply. In this case, we do have to know whether an element is in slot 0 or 8. While this will be taken care of automatically for newly added elements using the new hash code parameters, we still have the old elements in slot 0 to contend with; on the average, half of them will need to be moved to slot 8. This is easy to fix; all we have to do is to retrieve each element from slot 0 and recalculate its hash value with the new parameters. If the result is 0, we put it back in slot 0; if the result is 8, we move it to slot 8. Now we're okay until we have to add the 57th element to the hash table, at which point we need to add another slot to keep the average chain length to 6. Happily,