The compute_hash function (from superm\suplook.cpp) (Figure suplook.06) codelist/suplook.06 This may look mysterious, but it's actually pretty simple. After clearing the hash code we are going to calculate, it enters a loop that first shifts the old hash code one (decimal) place to the left, end around, then adds the low four bits of the next character from the key to the result. When it finishes this loop, it returns to the caller, in this case compute_cache_hash. How did I come up with this algorithm? Making a Hash of Things Well, as you will recall from our example of looking up a telephone number, the idea of a hash code is to make the most of variations in the input data, so that there will be a wide distribution of "starting places" for the records in the file. If all the input values produced the same hash code, we would end up with a linear search again, which would be terribly slow. In this case, our key is a UPC code, which is composed of decimal digits. If each of those digits contributes equally to the hash code, we should be able to produce a fairly even distribution of hash codes, which are the starting points for searching through the file for each record. As we noted earlier, this is one of the main drawbacks of hashing: the difficulty of coming up with a good hashing algorithm. After analyzing the nature of the data, you may have to try a few different algorithms with some test data, until you get a good distribution of hash codes. However, the effort is usually worthwhile, since you can often achieve an average of slightly over one disk access per lookup (assuming that several records fit in one physical disk record). Meanwhile, back at compute_cache_hash, we convert the result of compute_hash, which is an unsigned value, into an index into the cache. This is then returned to lookup_record_and_number as the starting cache index. As mentioned above, we are using an eight-way associative cache, in which each key can be stored in any of eight entries in a cache line. This means that we need to know where the line starts, which is computed by compute_starting_cache_hash (Figure suplook.07) and where it ends, which is computed by compute_ending_cache_hash (Figure suplook.08). 9 The compute_starting_cache_hash function (from superm\suplook.cpp) (Figure suplook.07) codelist/suplook.07 The compute_ending_cache_hash function (from superm\suplook.cpp) (Figure suplook.08) codelist/suplook.08 After determining the starting and ending positions where the key might be found in the cache, we compare the key in each entry to the key that we are looking for, and if they are equal, we have found the record in the cache. In this event, we set the value of the record_number argument to the file record number for this cache entry, and return with the status set to FOUND. Otherwise, the record isn't in the cache, so we will have to look for it in the file; if we find it, we will need a place to store it in the cache. So we pick a "random" entry in the line (cache_replace_index) by calculating the remainder after dividing the number of accesses we have made by the MAPPING_FACTOR. This will generate an entry index between 0 and the highest entry number, cycling through all the possibilities on each successive access, thus not favoring a particular entry number. However, if the line has an invalid entry (where the key is INVALID_BCD_VALUE), we should use that one, rather than throwing out a real record that might be needed later. Therefore, we search the line for such an empty entry, and if we are successful, we set cache_replace_index to its index. Next, we calculate the place to start looking in the file, via compute_file_hash, (Figure suplook.09), which is very similar to compute_cache_hash except that it uses the FILE_SIZE constant in superm.h (Figure superm.00a) to calculate the index rather than the CACHE_SIZE constant, as we want a starting index in the file rather than in the cache. The compute_file_hash function (from superm\suplook.cpp) (Figure suplook.09) codelist/suplook.09 As we noted above, this is another of the few drawbacks of this hashing method: the size of the file must be decided in advance, rather than being adjustable as data is entered. The reason is that to find a record in the file, we must be able to calculate its approximate position in the file in the same manner as it was calculated when the record was stored. The calculation of the hash code is designed to distribute the records evenly throughout a file of known size; if we changed the size of the file, we wouldn't be able to find records previously stored. Of course, different files can have different sizes, as long as we know the size of the file we are operating on currently: the size doesn't have to be an actual constant as it is in our example, but it does have to be known in advance for each file. Searching the File Now we're ready to start looking for our record in the file at the position specified by starting_file_index. Therefore, we enter a loop that searches from this starting position toward the end of the file, looking for a record with the correct key. First we set the file pointer to the first position to be read, using position_record (Figure suplook.10), then read the record at that position. The position_record function (from superm\suplook.cpp) (Figure suplook.10) codelist/suplook.10 If the key in that record is the one we are looking for, our search is successful. On the other hand, if the record is invalid, then the record we are looking for is not in the file; when we add records to the file, we start at the position given by starting_file_index and store our new record in the first invalid record we find. 10 Therefore, no record can overflow past an invalid record, as the invalid record would have been used to store the overflow record. In either of these cases, we are through with the search, so we break out of the loop. On the other hand, if the entry is neither invalid nor the record we are looking for, we keep looking through the file until either we have found the record we want, we discover that it isn't in the file by encountering an invalid record, or we run off the end of the file. In the last case we start over at the beginning of the file. If we have found the record, we copy it to the cache entry we've previously selected and copy its record number into the list of record numbers in the cache so that we'll know which record we have stored in that cache position. Then we return to the calling function, write_record, with the record we have found. If we have determined that the record is not in the file, then we obviously can't read it into the cache, but we do want to keep track of the record number where we stopped, since that is the record number that will be used for the record if we write it to the file. To clarify this whole process, let's make a file with room for only nine records by changing FILE_SIZE to 6 in superm.h (Figure superm.00a). After adding a few records, a dump looks like Figure initcondition. Initial condition (Figure initcondition) Position Key Data 0. INVALID 1. INVALID 2. 0000098765: MINESTRONE:245 3. 0000121212: OATMEAL, 1 LB.:300 4. INVALID 5. INVALID 6. 0000012345: JELLY BEANS:150 7. INVALID 8. 0000099887: POPCORN:99 Let's add a record with the key "23232" to the file. Its hash code turns out to be 3, so we look at position 3 in the file. That position is occupied by a record with key "121212", so we can't store our new record there. The next position we examine, number 4, is invalid, so we know that the record we are planning to add is not in the file. (Note that this is the exact sequence we follow to look up a record in the file as well). We use this position to hold our new record. The file now looks like Figure aftermilk. After adding "milk" record (Figure aftermilk) Position Key Data 0. INVALID 1. INVALID 2. 0000098765: MINESTRONE:245 3. 0000121212: OATMEAL, 1 LB.:300 4. 0000023232: MILK:128 5. INVALID 6. 0000012345: JELLY BEANS:150 7. INVALID 8. 0000099887: POPCORN:99 Looking up our newly added record follows the same algorithm. The hash code is still 3, so we examine position 3, which has the key "121212". That's not the desired record, and it's not invalid, so we continue. Position 4 does match, so we have found our record. Now let's try to find some records that aren't in the file. If we try to find a record with key "98789", it turns out to have a hash code of 8. Since that position in the file is in use, but with a different key, we haven't found our record. However, we have encountered the end of the file. What next? Wrapping Around at End-of-File In order to continue looking for this record, we must start over at the beginning of the file. That is, position 0 is the next logical position after the last one in the file. As it happens, position 0 contains an invalid record, so we know that the record we want isn't in the file. 11 In any event, we are now finished with lookup_record_and_number. Therefore, we return to lookup_record_number, which returns the record number to be used to write_record (Figure suplook.01), along with a status value of FILE_FULL, FOUND, or NOT_IN_FILE (which is the status we want). FILE_FULL is an error, as we cannot add a record to a file that has reached its capacity. So is FOUND, in this situation, as we are trying to add a new record, not find one that alreadys exists. In either of these cases, we simply return the status to the calling function, process, (Figure superm.02), which gives an appropriate error message and continues execution. However, if the status is NOT_IN_FILE, write_record continues by positioning the file to the record number returned by lookup_record_number, writing the record to the file, and returns the status NOT_IN_FILE to process, which continues execution normally. That concludes our examination of the input mode in process. The lookup mode is very similar, except that it uses the lookup_record function (Figure suplook.03) rather than lookup_record_number, since it wants the record to be returned, not just the record number. The lookup mode, of course, also differs from the entry . number, cycling through all the possibilities on each successive access, thus not favoring a particular entry number. However, if the line has an invalid entry (where the key is INVALID_BCD_VALUE),