Tài liệu Thuật toán Algorithms (Phần 22) pdf

10 337 0
Tài liệu Thuật toán Algorithms (Phần 22) pdf

Đang tải... (xem toàn văn)

Thông tin tài liệu

HASHTNG 203 type link=fnode; node=record key, info: integer; next: link end; var heads: array [O M] of link; t, z: link; procedure initialize; var i: integer; begin new(z); zt.next:=z; for i:=O to M-l do begin new(heads[i]); heads[i]f.next:=z end; end ; Now the procedures from Chapter 14 can be used as is, with a hash function used to choose among the lists. For example, listinsert(v, heads[v mod M] ) can be used to add something to the table, t:=listsearch(v, heads[v mod M]) to find the first record with key v, and successively set t:=listsearch(v, t) until t=z to find subsequent records with key v. For example if the ith letter in the alphabet is represented with the number i and we use the hash function h(k) = kmod M, then we get the following hash values for our sample set of keys with M = 11: Key: ASEARCHINGEXAMPLE Hash: 18517389375212515 if these keys are successively inserted into an initially empty table, the follow- ing set of lists would result: 0 1 2 3 4 5 6 7 8 9 10 AMC E G H I A X N E R S A E L P Obviously, the amount of time required for a search depends on the length of the lists (and the relative positions of the keys in them). The lists could be left unordered: maintaining sorted lists is not as important for this application as it was for the elementary sequential search because the lists are quite short. For an “unsuccessful search” (for a record with a key not in the table) we can assume that the hash function scrambles things enough so that each of 204 CHAPTER 16 the M lists is equally likely to be searched and, as in sequential list searching, that the list searched is only traversed halfway (on t,he average). The average length of the list examined (not counting z) in this example for unsuccessful search is (0+4+2+2+0+4+0+2+2+1+0)/11 z 1.545. This would be the average time for an unsuccessful search were the lists unordered; by keeping them ordered we cut the time in half. For a “successful search” (for one of the records in the table), we assume that each record is equally likely to be sought: seven of the keys would be found as the first list item examined, six would be found as the second item examined, etc., so the average is (7.1+ 6.2 + 2.3 + 2.4)/17) z 1.941. (This count assumes that equal keys are distinguished with a unique identifier or some other mechanism, and the search routine modified appropriately to be able to search for each individual key.) If N, the number of keys in the table, is much larger than M then a good approximation to the average length of the lists is N/M. As in Chapter 14, unsuccessful and successful searches would be expected to go about halfway down some list. Thus, hashing provides an easy way to cut down the time required for sequential search by a factor of M, on the average. In a separate chaining implementation, M is typically chosen to be rela- tively small so as not to use up a large area of contiguous memory. But it’s probably best to choose M sufficiently large that the lists are short enough to make sequential search the most efficient method for them: “hybrid” methods (such as using binary trees instead of linked lists) are probably not worth the trouble. The implementation given above uses a hash table of links to headers of the lists containing the actual keys. Maintaining M list header nodes is somewhat wasteful of space; it is probably worthwhile to eliminate them and make heads be a table of links to the first keys in the lists. This leads to some complication in the algorithm. For example, adding a new record to the beginning of a list becomes a different operation than adding a new record anywhere else in a list, because it involves modifying an entry in the table of links, not a field of a record. An alternate implementation is to put the first key within the table. If space is at a premium, it is necessary to carefully analyze the tradeoff between wasting space for a table of links and wasting space for a key and a link for each empty list. If N is much bigger than M then the alternate method is probably better, though M is usually small enough that the extra convenience of using list header nodes is probably justified. Open Addressing If the number of elements to be put in the hash table can be estimated in advance, then it is probably not worthwhile to use any links at all in the hash table. Several methods have been devised which store N records in a table HASHING 205 of size A4 > N, relying on empty places in the table to help with collision resolution. Such methods are called open-addressing hashing methods. The simplest open-addressing method is called linear probing: when there is a collision (when we hash to a place in the table which is already occupied and whose key is not the same as the search key), then just probe the next position in the table: that is, compare the key in the record there against the search key. There are three possible outcomes of the probe: if the keys match, then the search terminates successfully; if there’s no record there, then the search terminates unsuccessfully; otherwise probe the next position, continuing until either the search key or an empty table position is found. If a record containing the search key is to be inserted following an unsuccessful search, then it can simply be put into the empty table space which terminated the search. This method is easily implemented as follows: type node=record key, info: integer end; var a: array [O M] of node; function h(k: integer): integer; begin h:=k mod Mend; procedure hashinitialize; var i: integer; begin for i:=O to M do a[i].key:=maxint; end ; function hashinsert (v: integer) : integer; var x: integer; begin x:=h(v); while a[x].key<>maxint do x:=(x+1) mod M; a [x] .key:=v; hashinsert :=x; end ; Linear probing requires a special key value to signal an empty spot in the table: this program uses maxint for that purpose. The computation x:=(x+1) mod M corresponds to examining the next position (wrapping back to the beginning when the end of the table is reached). Note that this program does not check for the table being filled to capacity. (What would happen in this case?) The implementation of hashsearch is similar to hashinsert: simply add the condition “a [x] .key< >v” to the while loop, and delete the following instruction which stores v. This leaves the calling routine with the task 206 CHAPTER 16 of checking if the search was unsuccessful, by testing whether the table position returned actually contains v (successful) or maxi& (unsuccessful). Other conventions could be used, for example hashsearch could return M for unsuccessful search. For reasons that will become obvious below, open addressing is not appropriate if large numbers of records with duplicate keys are to be processed, but hashsearch can easily be adapted to handle equal keys in the case where each record has a unique identifier. For our example set of keys with A4 = 19, we might get the hash values: Key: ASEARCHINGEXAMPLE Hash: 1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5 The following table shows the steps of successively inserting these into an initially empty hash table: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 - SAA E SAAm E SAAC E 0 H SAAC E HIII SAAC E H I 0 N SAAC E 0 GHI N SAAC R~~GHI SAAC ~~~~~~ : SmbimwEEGHIX N SAACAEEGHIX 0 MN SAACAEEGHIX MN SAACAEEGHIX L-l LMN SAACA~~~~~~lE]LMN 0 R R R R R R R R R R 0 P R P R P R HASHING 207 The table size is greater than before, since we must have M > N, but the total amount of memory space used is less, since no links are used. The average number of items that must be examined for a successful search for this example is: 33/17 z 1.941. Linear probing (indeed, any hashing method) works because it guarantees that, when searching for a particular key, we look at every key that hashes to the same table address (in particular, the key itself if it’s in the table). Unfortunately, in linear probing, other keys are also examined, especially when the table begins to fill up: in the example above, the search for X involved looking at G, H, and I which did not have the same hash value. What’s worse, insertion of a key with one hash value can drastically increase the search times for keys with other hash values: in the example, an insertion at position 17 would cause greatly increased search times for position 16. This phenomenon, called clustering, can make linear probing run very slowly for nearly full tables. Fortunately, there is an easy way to virtually eliminate the clustering problem: double hashing. The basic strategy is the same; the only difference is that, instead of examining each successive entry following a collided position, we use a second hash function to get a fixed increment to use for the “probe” sequence. This is easily implemented by inserting u:=h2(v) at the beginning of the procedure and changing x:=(x+1) mod M to x:=(x+u) mod M within the while loop. The second hash function h2 must be chosen with some care, otherwise the program might not work at all. First, we obviously don’t want to have u=O, since that would lead to an infinite loop on collision. Second, it is important that M and u be relatively prime here, since otherwise some of the probe sequences could be very short (for example, consider the case M=2u). This is easily enforced by making M prime and u<M. Third, the second hash function should be “different” from the first, otherwise a slightly more complicated clustering could occur. A function such as hz(k) = M - 2 - k mod (M - 2) will produce a good range of “second” hash values. For our sample keys, we get the following hash values: Key ASEARCHINGEXAMPLE Hash 1: 1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5 Hash 2’: 16 15 12 16 16 14 9 8 3 10 12 10 16 4 1 5 12 The following table is produced by successively inserting our sample keys into an initially empty table using double hashing with these values. 208 CHAPTER 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0 A 0 S A S A aI S A S A S A S A S A S A S A S A sm S A S A S A sB 0 C C C C C C C C C C C 0 C lEl E 0 A E Al-4 E AR E IHI AR E Hl2-l AR E H I 0 N AR - E IGI H I N AR 0 E TH Im N 0 AR 0 E GHIE NM AR E GHIEm 0 NX AR cl E GHIEA r-l MNX AR E GHIEA MNX~AR - E GHIEAMMNXPAR ~~G~I~AL~N~PMR This technique uses the same amount of space as linear probing but the average number of items examined for successful search is slightly smaller: 32/17 z 1.882. Still, such a full table is to be avoided as shown by the 9 probes used to insert the last E in this example. Open addressing methods can be inconvenient in a dynamic situation, when an unpredictable number of insertions and deletions might have to be processed. First, how big should the table be? Some estimate must be made of how many insertions are expected but performance degrades drastically as the table starts to get full. A common solution to this problem is to rehash everything into a larger table on a (very) infrequent basis. Second, a word of caution is necessary about deletion: a record can’t simply be removed from a table built with linear probing or double hashing. The reason is that later insertions into the table might have skipped over that record, and searches for those records will terminate at the hole left by the deleted record. A way to solve this problem is to have another special key which can serve as a placeholder for searches but can be identified and remembered as an HASHING 209 empty position for insertions. Note that neither table size nor deletion are a particular problem with separate chaining. Analytic Results The methods discussed above have been analyzed completely and it is pos- sible to compare their performance in some detail. The following formulas, summarized from detailed analyses described by D. E. Knuth in his book on sorting and searching, give the average number of items examined (probes) for unsuccessful and successful searching using the methods we’ve studied. The formulas are most conveniently expressed in terms of the “load factor” of the hash table, cr = N/M. Note that for separate chaining we can have CY > 1, but for the other methods we must have Q < 1. Unsuccessful Successful Separate Chaining: 1+ 42 (a + 1)/Z Linear Probing: l/2+1/2(1-42 1/2+1/2(1-a) Double Hashing: l/(1 - a) - ln( 1 - o)/o For small o, it turns out that all the formulas reduce to the basic result that unsuccessful search takes about 1 + N/M probes and successful search takes about 1 + N/2M probes. (Except, as we’ve noted, the cost of an unsuccessful search for separate chaining is reduced by about half by ordering the lists.) The formulas indicate how badly performance degrades for open addressing as CY gets close to 1. For large M and N, with a table about 90% full, linear probing will take about 50 probes for an unsuccessful search, compared to 10 for double hashing. Comparing linear probing and double hashing against separate chaining is more complicated, because there is more memory available in the open addressing methods (since there are no links). The value of CY used should be modified to take this into account, based on the relative size of keys and links. This means that it is not normally justified to choose separate chaining over double hashing on the basis of performance. The choice of the very best hashing method for a particular applica- tion can be very difficult. However, the very best method is rarely needed for a given situation, and the various methods do have similar performance characteristics as long as the memory resource is not being severely strained. Generally, the best course of action is to use the simple separate chaining method to reduce search times drastically when the number of records to be processed is not known in advance (and a good storage allocator is available) and to use double hashing to search among a set of keys whose size can be roughly predicted ahead of time. Many other hashing methods have been developed which have application in some special situations. Although we can’t go into details, we’ll briefly 210 CHAPTER 16 consider two examples to illustrate the nature of specially adapted hashing methods. These and many other methods are fully described in Knuth’s book. The first, called ordered hashing, is a method for making use of ordering within an open addressing table: in standard linear probing, we stop the search when we find an empty table position or a record with a key equal to the search key; in ordered hashing, we stop the search when we find a record with a key greater than or equal to the search key (the table must be cleverly constructed to make this work). This method turns out to reduce the time for unsuccessful search to approximately that for successful search. (This is the same kind of improvement that comes in separate chaining.) This method is useful for applications where unsuccessful searching is frequently used. For example, a text processing system might have an algorithm for hyphenating words that works well for most words, but not for bizarre cases (such as “bizarre”). The situation could be handled by looking up all words in a relatively small exception dictionary of words which must be handled in a special way, with most searches likely to be unsuccessful. Similarly, there are methods for moving some records around during unsuccessful search to make successful searching more efficient. In fact, R. P. Brent developed a method for which the average time for a successful search can be bounded by a constant, giving a very useful method for applications with frequent successful searching in very large tables such as dictionaries. These are only two examples of a large number of algorithmic improve- ments which have been suggested for hashing. Many of these improvements are interesting and have important applications. However, our usual cautions must be raised against premature use of advanced methods except by experts with serious searching applications, because separate chaining and double hashing are simple, efficient, and quite acceptable for most applications. Hashing is preferred to the binary tree structures of the previous two chapters for many applications because it is somewhat simpler and it can provide very fast (constant) searching times, if space is available for a large enough table. Binary tree structures have the advantages that they are dynamic (no advance information on the number of insertions is needed), they can provide guaranteed worst-case performance (everything could hash to the same place even in the best hashing method), and they support a wider range of operations (most important, the sort function). When these factors are not important, hashing is certainly the searching method of choice. 0 HASHING 211 Exercises 1. Describe how you might implement a hash function by making use of a good random number generator. Would it make sense to implement a random number generator by making use of a hash function? 2. How long could it take in the worst case to insert N keys into an initially empty table, using separate chaining with unordered lists? Answer the same question for sorted lists. 3. Give the contents of the hash table that results when the keys E A S Y Q U E S T I 0 N are inserted in that order into an initially empty table of size 13 using linear probing. (Use hi(k) = kmod 13 for the hash function for the lath letter of the alphabet.) 4. Give the contents of the hash table that results when the keys E A S Y Q U E S T I 0 N are inserted in that order into an initially empty table of size 13 using double hashing. (Use hi(k) from the previous question, hz(lc) = 1 + (Ic mod 11) for the second hash function.) 5. About how many probes are involved when double hashing is used to build a table consisting of N equal keys? 6. Which hashing method would you use for an application in which many equal keys are likely to be present? 7. Suppose that the number of items to be put into a hash table is known in advance. Under what condition will separate chaining be preferable to double hashing? 8. Suppose a programmer has a bug in his double hashing code so that one of the hash functions always returns the same value (not 0). Describe what happens in each situation (when the first one is wrong and when the second one is wrong). 9. What hash function should be used if it is known in advance that the key values fall into a relatively small range? 10. Criticize the following algorithm for deletion from a hash table built with linear probing. Scan right from the element to be deleted (wrapping as necessary) to find an empty position, then scan left to find an element with the same hash value. Then replace the element to be deleted with that element, leaving its table position empty.

Ngày đăng: 21/01/2014, 17:20

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan