+ + + + + 4 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 5 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 6 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 7 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 8 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 9 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + So the answer is that we can't stop before the beginning of the next subfile, because the record we are looking for might have overflowed to the next subfile, as "1234321" just did. Where can we stop? Well, we know that the record we are looking for is somewhere between the beginning of the subfile it belongs in and the next invalid record, since we go to the next subfile only when the one we are trying to use is filled up. Therefore, if we get to an invalid record, we know that the record could not be stored later in the file, since the current subfile is not filled up yet; this means that we can stop looking when we get to the first invalid record. Some Drawbacks of Hashing This points out one of the drawbacks of standard disk-based hashing (SDBH). We cannot delete a record from the file simply by setting the invalid flag in that record; if we did, any record which overflowed past the one we deleted would become inaccessible, as we would stop looking when we got to the invalid record. For this reason, invalid records must be only those that have never been used and therefore can serve as "end-of-subfile" markers. While we're on the topic of drawbacks of SDBH, we ought to note that as the file gets filled up, the maximum length of a search increases, especially an unsuccessful search. That is because adding a record to the end of a subfile that has only one invalid entry left results in "merging" that subfile and the next one, so that searches for entries in the first subfile have to continue to the next one. In our example, when we added "1212121" to the file, the maximum length of a search for an entry in subfile 1 increased from four to seven, even though we had added only one record. With a reasonably even distribution of hash codes (which we don't have in our example), this problem is usually not serious until the file gets to be about 80% full. While this problem would be alleviated by increasing the capacity of the file as items are added, unfortunately SDBH does not allow such incremental expansion, since there would be no way to use the extra space; the subfile starting positions can't be changed after we start writing records, or we won't be able to find records we have already stored. Of course, one way to overcome this problem is to create a new file with larger (or more) subfiles, read each record from the old file and write it to the new one. What we will do in our example is simply to make the file 25% bigger than needed to contain the records we are planning to store, which means the file won't get more than 80% full. Another problem with SDBH methods is that they are not well suited to storage of variable-length records; the address calculation used to find a "slot" for a record relies on the fact that the records are of fixed length. Finally, there's the ever-present problem of coming up with a good hash code. Unfortunately, unless we know the characteristics of the input data precisely, it's theoretically possible that all of the records will generate the same hash code, thus negating all of the performance advantages of hashing. Although this is unlikely, it makes SDBH an inappropriate algorithm for use in situations where a maximum time limit on access must be maintained. When considering these less-than-ideal characteristics, we should remember that other search methods have their own disadvantages, particularly in speed of lookup. All in all, the disadvantages of such a simply implemented access method seem rather minor in comparison with its benefits, at least for the current application. In addition, recent innovations in hashing have made it possible to improve the flexibility of SDBH methods by fairly simple changes. For example, the use of "last-come, first-served" hashing, which stores a newly added record exactly where the hash code indicates, greatly reduces the maximum time needed to find any record; it also makes it possible to determine for any given file what that maximum is, thus removing one of the barriers to using SDBH in a time- critical application. 3 Even more recently, a spectacular advance in the state of the art has made it possible to increase the file capacity incrementally, as well as to delete records efficiently. Chapter dynhash.htm provides a C++ implementation of this remarkable innovation in hashing. Caching out Our Winnings Of course, even one random disk access takes a significant amount of time, from the computer's point of view. Wouldn't it be better to avoid accessing the disk at all? While that would result in the fastest possible lookup, it would require us to have the entire database in memory, which is usually not feasible. However, if we had some of the database in memory, perhaps we could eliminate some of the disk accesses. A cache is a portion of a large database that we want to access, kept in a type of storage that can be accessed more rapidly than the type used to store the whole database. For example, if your system has an optical disk which is considerably slower than its hard disk, a portion of the database contained on the optical disk may be kept on the hard disk. Why don't we just use the hard disk for the whole database? The optical disk may be cheaper per megabyte or may have more capacity, or the removability and long projected life span of the optical diskettes may be the major reason. Of course, our example of a cache uses memory to hold a portion of the database that is stored on the hard disk, but the principle is the same: memory is more expensive than disk storage, and with certain computers, it may not be possible to install enough memory to hold a copy of the entire database even if price were no object. If a few items account for most of the transactions, the use of a cache speeds up access to those items, since they are likely to be among the most recently used records. This means that if we keep a number of the most recently accessed records in memory, we can reduce the number of disk accesses significantly. However, we have to have a way to locate items in the cache quickly: this problem is very similar to a hash-coded file lookup, except that we have more freedom in deciding how to handle overflows, where the entry we wish to use is already being used by another item. Since the cache is only a copy of data that is also stored elsewhere, we don't have to be so concerned about what happens to overflowing entries; we can discard them if we wish to, since we can always get fresh copies from the disk. The simplest caching lookup algorithm is a direct-mapped cache. That means each key corresponds to one and only one entry in the cache. In other words, overflow is handled by overwriting the previous record in that space with the new entry. The trouble with this method is that if two (or more) commonly used records happen to map to the same cache entry, only one of them can be in the cache at one time. This requires us to go back to the disk repeatedly for these conflicting records. The solution to this problem is to use a multiway associative cache. In this algorithm, each key corresponds to a cache line, which contains more than one entry. In our example, it contains eight entries. Therefore, if a number of our records have keys that map to the same line, up to eight of them can reside in the cache simultaneously. How did I decide on an eight-way associative cache? By trying various cache sizes until I found the one that yielded the greatest performance. The performance of a disk caching system is defined by its hit ratio, or the proportion of accesses to the disk that are avoided by using the cache. In order to estimate the performance of this caching algorithm, we have to quantify our assumptions. I have written a program called stestgen.cpp to generate test keys in which 20% of the items account for 80% of the database accesses (Figure stestgen.00), along with another called supinit.cpp that initializes the database (Figure supinit.00), and a third called supert.cpp to read the test keys and look them up (Figure supert.00). The results of this simulation indicated that, using an eight-way associative cache, approximately 44% of the disk accesses that would be needed in a noncached system could be eliminated. Test program for caching (superm\stestgen.cpp) (Figure stestgen.00) codelist/stestgen.00 Database initialization program for caching (superm\supinit.cpp) (Figure supinit.00) codelist/supinit.00 Retrieval program for caching (superm\supert.cpp) (Figure supert.00) codelist/supert.00 Figure line.size shows the results of my experimentation with various line sizes. Line size effects (Figure line.size) Line size Hit ratio 1 .34 4 2 .38 4 .42 8 .44 16 .43 The line size is defined by the constant MAPPING_FACTOR in superm.h (Figure superm.00a). Header file for supermarket lookup system (superm\superm.h) (Figure superm.00a) codelist/superm.00a Heading for The Final Lookup Now that we have added a cache to our optimization arsenal, only three more changes are necessary to reach the final lookup algorithm that we will implement. The first is to shrink each of the subfiles to one entry. That is, we will calculate a record address rather than a subfile address when we start trying to add or look up a record. This tends to reduce the length of the search needed to find a given record, as each record (on average) will have a different starting position, rather than a number of records having the same starting position, as is the case with longer subfiles. The second change is that, rather than having a separate flag to indicate whether a record position in the file is in use, we will create an "impossible" key value to mean that the record is available. Since our key will consist only of decimal digits (compressed to two digits per byte), we can set the first digit to 0xf (the hex representation for 15), which cannot occur in a genuine decimal number. This will take the place of our "invalid" flag, without requiring extra storage in the record. Finally, we have to deal with the possibility that the search for a record will encounter the end of the file, because the last position in the file is occupied by another record. In this case, we will wrap around to the beginning of the file and keep looking. In other words, position 0 in the file will be considered to follow immediately after the last position in the file. Saving Storage Now that we have decided on our lookup algorithm, we can shift our attention to reducing the amount of storage required for each record in our supermarket price lookup program. Without any special encoding, the disk storage requirements for one record would be 35 bytes (10 for the UPC, 21 for the description, and 4 for the price). For a file of 10,000 items, this would require 350 Kbytes; allowing 25% . required for each record in our supermarket price lookup program. Without any special encoding, the disk storage requirements for one record would be 35 bytes (10 for the UPC, 21 for the description,. performance. The performance of a disk caching system is defined by its hit ratio, or the proportion of accesses to the disk that are avoided by using the cache. In order to estimate the performance. would stop looking when we got to the invalid record. For this reason, invalid records must be only those that have never been used and therefore can serve as "end-of-subfile" markers.