Some Random Musings Before we try to optimize our search, let us define some terms. There are two basic categories of storage devices, distinguished by the access they allow to individual records. The first type is sequential access; 1 in order to read record 1000 from a sequential device, we must read records 1 through 999 first, or at least skip over them. The second type is direct access; on a direct access device, we can read record 1000 without going past all of the previous records. However, only some direct access devices allow nonsequential accesses without a significant time penalty; these are called random access devices. Unfortunately, disk drives are direct access devices, but not random access ones. The amount of time it takes to get to a particular data record depends on how close the read/write head is to the desired position; in fact, sequential reading of data may be more than ten times as fast as random access. Is there a way to find a record in a large file with an average of about one nonsequential access? Yes; in fact, there are several such methods, varying in complexity. They are all variations on hash coding, or address calculation; as you will see, such methods actually can be implemented quite simply, although for some reason they have acquired a reputation for mystery. Hashing It Out Let's start by considering a linear or sequential search. That is, we start at the beginning of the file and read each record in the file until we find the one we want (because its key is the same as the key we are looking for). If we get to the end of the file without finding a record with the key we are looking for, the record isn't in the file. This is certainly a simple method, and indeed is perfectly acceptable for a very small file, but it has one major drawback: the average time it takes to find a given record increases every time we add another record. If the file gets twice as big, it takes twice as long to find a record, on the average. So this seems useless. Divide and Conquer But what if, instead of having one big file, we had many little files, each with only a few records in it? Of course, we would need to know which of the little files to look in, or we wouldn't have gained anything. Is there any way to know that? Let's see if we can find a way. Suppose that we have 1000 records to search through, keyed by telephone number. To speed up the lookup, we have divided the records into 100 subfiles, averaging 10 numbers each. We can use the last two digits of the telephone number to decide which subfile to look in (or to put a new record in), and then we have to search through only the records in that subfile. If we get to the end of the subfile without finding the record we are looking for, it's not in the file. That's the basic idea of hash coding. But why did we use the last two digits, rather than the first two? Because they will probably be more evenly distributed than the first two digits. Most of the telephone numbers on your list probably fall within a few telephone exchanges near where you live (or work). For example, suppose my local telephone book contained a lot of 758 and 985 numbers and very few numbers from other exchanges. Therefore, if I were to use the first two digits for this hash coding scheme, I would end up with two big subfiles (numbers 75 and 98) and 98 smaller ones, thus negating most of the benefit of dividing the file. You see, even though the average subfile size would still be 10, about 90% of the records would be in the two big subfiles, which would have perhaps 450 records each. Therefore, the average search for 90% of the records would require reading 225 records, rather than the five we were planning on. That is why it is so important to get a reasonably even distribution of the data records in a hash-coded file. Unite and Rule It is inconvenient to have 100 little files lying around, and the time required to open and close each one as we need it makes this implementation inefficient. But there's no reason we couldn't combine all of these little files into one big one and use the hash code to tell us where we should start looking in the big file. That is, if we have a capacity of 1000 records, we could use the last two digits of the telephone number to tell us which "subfile" we need of the 100 "subfiles" in the big file (records 0-9, 10-19 980-989, 990-999). To help visualize this, let's look at a smaller example: 10 subfiles having a capacity of four telephone numbers each and a hash code consisting of just the last digit of the telephone number (Figure initfile). Hashing with subfiles, initialized file (Figure initfile) Subfile # + + + + + 0 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 1 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 2 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 3 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 4 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 5 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 6 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 7 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 8 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 9 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + In order to use a big file rather than a number of small ones, we have to make some changes to our algorithm. When using many small files, we had the end-of-file indicator to tell us where to add records and where to stop looking for a record; with one big file subdivided into small subfiles, we have to find another way to handle these tasks. Knowing When to Stop One way is to add a "valid-data" flag to every entry in the file, which is initialized to "I" (for invalid) in the entries in Figure initfile, and set each entry to "valid" (indicated by a "V" in that same position) as we store data in it. Then if we get to an invalid record while looking up a record in the file, we know that we are at the end of the subfile and therefore the record is not in the file (Figure distinct). Hashing with distinct subfiles (Figure distinct) Subfile # + + + + + 0 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 1 | V 9876541 | V 2323231 | V 9898981 | I 0000000 | + + + + + 2 | V 2345432 | I 0000000 | I 0000000 | I 0000000 | + + + + + 3 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 4 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 5 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 6 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 7 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 8 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 9 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + For example, if we are looking for the number "9898981", we start at the beginning of subfile 1 in Figure distinct (because the number ends in 1), and examine each record from there on. The first two entries have the numbers "9876541" and "2323231", which don't match, so we continue with the third one, which is the one we are looking for. But what if we were looking for "9898971"? Then we would go through the first three entries without finding a match. The fourth entry is "I 0000000", which is an invalid entry. This is the marker for the end of this subfile, so we know the number we are looking for isn't in the file. Now let's add another record to the file with the phone number "1212121". As before, we start at the beginning of subfile 1, since the number ends in 1. Although the first three records are already in use, the fourth (and last) record in the subfile is available, so we store our new record there, resulting in the situation in Figure merged. Hashing with merged subfiles (Figure merged) Subfile # + + + + + 0 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 1 | V 9876541 | V 2323231 | V 9898981 | V 1212121 | + + + + + 2 | V 2345432 | I 0000000 | I 0000000 | I 0000000 | + + + + + 3 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 4 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 5 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 6 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 7 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 8 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 9 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + However, what happens if we look for "9898971" in the above situation? We start out the same way, looking at records with phone numbers "9876541", "2323231", "9898981", and "1212121". But we haven't gotten to an invalid record yet. Can we stop before we get to the first record of the next subfile? Handling Subfile Overflow To answer that question, we have to see what would happen if we had added another record that belonged in subfile 1. There are a number of possible ways to handle this situation, but most of them are appropriate only for memory-resident files. As I mentioned above, reading the next record in a disk file is much faster than reading a record at a different place in the file. Therefore, for disk-based data, the most efficient place to put "overflow" records is in the next open place in the file, which means that adding a record with phone number "1234321" to this file would result in the arrangement in Figure overflow. 2 Hashing with overflow between subfiles (Figure overflow) Subfile # + + + + + 0 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | + + + + + 1 | V 9876541 | V 2323231 | V 9898981 | V 1212121 | + + + + + 2 | V 2345432 | V 1234321 | I 0000000 | I 0000000 | + + + + + 3 | I 0000000 | I 0000000 | I 0000000 | I 0000000 | . are direct access devices, but not random access ones. The amount of time it takes to get to a particular data record depends on how close the read/write head is to the desired position; in