HASHTNG 203
type link=fnode;
node=record key, info: integer; next: link end;
var heads: array
[O M]
of link;
t,
z: link;
procedure initialize;
var i: integer;
begin
new(z);
zt.next:=z;
for i:=O to M-l do
begin new(heads[i]); heads[i]f.next:=z end;
end ;
Now the procedures from Chapter 14 can be used as is, with a hash function
used to choose among the lists. For example, listinsert(v, heads[v mod M] )
can be used to add something to the table, t:=listsearch(v, heads[v mod M])
to find the first record with key v, and successively set t:=listsearch(v,
t)
until
t=z to find subsequent records with key v.
For example if the ith letter in the alphabet is represented with the
number i and we use the hash function h(k) = kmod M, then we get the
following hash values for our sample set of keys with M = 11:
Key:
ASEARCHINGEXAMPLE
Hash:
18517389375212515
if these keys are successively inserted into an initially empty table, the follow-
ing set of lists would result:
0
1
2 3 4 5 6 7
8
9 10
AMC E G H I
A X N E R S
A
E
L P
Obviously, the amount of time required for a search depends on the length
of the lists (and the relative positions of the keys in them). The lists could be
left unordered: maintaining sorted lists is not as important for this application
as it was for the elementary sequential search because the lists are quite short.
For an “unsuccessful search” (for a record with a key not in the table) we
can assume that the hash function scrambles things enough so that each of
204
CHAPTER
16
the M lists is equally likely to be searched and, as in sequential list searching,
that the list searched is only traversed halfway (on
t,he
average). The average
length of the list examined (not counting
z)
in this example for unsuccessful
search is
(0+4+2+2+0+4+0+2+2+1+0)/11
z
1.545. This would be the
average time for an unsuccessful search were the lists unordered; by keeping
them ordered we cut the time in half. For a “successful search” (for one of the
records in the table), we assume that each record is equally likely to be sought:
seven of the keys would be found as the first list item examined, six would be
found as the second item examined, etc., so the average is
(7.1+
6.2
+
2.3
+
2.4)/17)
z
1.941. (This count assumes that equal keys are distinguished with
a unique identifier or some other mechanism, and the search routine modified
appropriately to be able to search for each individual key.)
If N, the number of keys in the table, is much larger than M then a good
approximation to the average length of the lists is N/M. As in Chapter 14,
unsuccessful and successful searches would be expected to go about halfway
down some list. Thus, hashing provides an easy way to cut down the time
required for sequential search by a factor of M, on the average.
In a separate chaining implementation, M is typically chosen to be rela-
tively small so as not to use up a large area of contiguous memory. But it’s
probably best to choose M sufficiently large that the lists are short enough to
make sequential search the most efficient method for them: “hybrid” methods
(such as using binary trees instead of linked lists) are probably not worth the
trouble.
The implementation given above uses a hash table of links to headers
of the lists containing the actual keys. Maintaining M list header nodes is
somewhat wasteful of space; it is probably worthwhile to eliminate them and
make heads be a table of links to the first keys in the lists. This leads to
some complication in the algorithm. For example, adding a new record to the
beginning of a list becomes a different operation than adding a new record
anywhere else in a list, because it involves modifying an entry in the table of
links, not a field of a record. An alternate implementation is to put the first
key within the table. If space is at a premium, it is necessary to carefully
analyze the tradeoff between wasting space for a table of links and wasting
space for a key and a link for each empty list. If N is much bigger than M then
the alternate method is probably better, though M is usually small enough
that the extra convenience of using list header nodes is probably justified.
Open Addressing
If the number of elements to be put in the hash table can be estimated in
advance, then it is probably not worthwhile to use any links at all in the hash
table. Several methods have been devised which store N records in a table
HASHING 205
of size A4 > N, relying on empty places in the table to help with collision
resolution. Such methods are called open-addressing hashing methods.
The simplest open-addressing method is called linear probing: when there
is a collision (when we hash to a place in the table which is already occupied
and whose key is not the same as the search key), then just probe the next
position in the table: that is, compare the key in the record there against
the search key. There are three possible outcomes of the probe: if the keys
match, then the search terminates successfully; if there’s no record there,
then the search terminates unsuccessfully; otherwise probe the next position,
continuing until either the search key or an empty table position is found. If
a record containing the search key is to be inserted following an unsuccessful
search, then it can simply be put into the empty table space which terminated
the search. This method is easily implemented as follows:
type node=record key, info: integer end;
var a: array
[O M]
of node;
function h(k: integer): integer;
begin
h:=k
mod Mend;
procedure hashinitialize;
var i: integer;
begin
for i:=O to M do a[i].key:=maxint;
end
;
function hashinsert (v: integer) : integer;
var x: integer;
begin
x:=h(v);
while a[x].key<>maxint do
x:=(x+1)
mod M;
a [x]
.key:=v;
hashinsert
:=x;
end
;
Linear probing requires a special key value to signal an empty spot in the
table: this program uses maxint for that purpose. The computation
x:=(x+1)
mod M corresponds to examining the next position (wrapping back to the
beginning when the end of the table is reached). Note that this program does
not check for the table being filled to capacity. (What would happen in this
case?)
The implementation of hashsearch is similar to hashinsert: simply add
the condition
“a
[x] .key< >v” to the while loop, and delete the following
instruction which stores v. This leaves the calling routine with the task
206
CHAPTER 16
of checking if the search was unsuccessful, by testing whether the table
position returned actually contains v (successful) or
maxi&
(unsuccessful).
Other conventions could be used, for example hashsearch could return M
for unsuccessful search. For reasons that will become obvious below, open
addressing is not appropriate if large numbers of records with duplicate keys
are to be processed, but hashsearch can easily be adapted to handle equal
keys in the case where each record has a unique identifier.
For our example set of keys with A4 = 19, we might get the hash values:
Key: ASEARCHINGEXAMPLE
Hash:
1
0
5
1
18
3 8 9
14
7 5 5
1
13 16 12 5
The following table shows the steps of successively inserting these into an
initially empty hash table:
0
1
2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
-
SAA E
SAAm
E
SAAC E
0
H
SAAC E
HIII
SAAC E H I
0
N
SAAC E
0
GHI N
SAAC
R~~GHI
SAAC
~~~~~~
:
SmbimwEEGHIX
N
SAACAEEGHIX
0
MN
SAACAEEGHIX
MN
SAACAEEGHIX
L-l
LMN
SAACA~~~~~~lE]LMN
0
R
R
R
R
R
R
R
R
R
R
0
P R
P R
P R
HASHING
207
The table size is greater than before, since we must have M > N, but the
total amount of memory space used is less, since no links are used. The
average number of items that must be examined for a successful search for
this example is: 33/17
z
1.941.
Linear probing (indeed, any hashing method) works because it guarantees
that, when searching for a particular key, we look at every key that hashes
to the same table address (in particular, the key itself if it’s in the table).
Unfortunately, in linear probing, other keys are also examined, especially
when the table begins to fill up: in the example above, the search for X
involved looking at G, H, and I which did not have the same hash value.
What’s worse, insertion of a key with one hash value can drastically increase
the search times for keys with other hash values: in the example, an insertion
at position 17 would cause greatly increased search times for position 16. This
phenomenon, called clustering, can make linear probing run very slowly for
nearly full tables.
Fortunately, there is an easy way to virtually eliminate the clustering
problem: double hashing. The basic strategy is the same; the only difference is
that, instead of examining each successive entry following a collided position,
we use a second hash function to get a fixed increment to use for the “probe”
sequence. This is easily implemented by inserting u:=h2(v) at the beginning
of the procedure and changing x:=(x+1) mod M to
x:=(x+u)
mod M within
the while loop. The second hash function h2 must be chosen with some care,
otherwise the program might not work at all.
First, we obviously don’t want to have
u=O,
since that would lead to an
infinite loop on collision. Second, it is important that M and u be relatively
prime here, since otherwise some of the probe sequences could be very short
(for example, consider the case M=2u). This is easily enforced by making
M prime and u<M. Third, the second hash function should be “different”
from the first, otherwise a slightly more complicated clustering could occur.
A function such as hz(k) = M
-
2
-
k
mod (M
-
2) will produce a good range
of “second” hash values.
For our sample keys, we get the following hash values:
Key
ASEARCHINGEXAMPLE
Hash 1:
1
0 5
1
18 3 8 9 14 7 5 5
1
13 16 12 5
Hash
2’:
16 15 12 16 16 14 9 8 3 10 12 10 16 4
1
5 12
The following table is produced by successively inserting our sample keys
into an initially empty table using double hashing with these values.
208
CHAPTER 16
0
1
2 3 4 5 6 7 8 9
10 11 12 13 14
15 16 17 18
0
A
0
S A
S A
aI
S A
S A
S A
S A
S A
S A
S A
S A
sm
S A
S A
S A
sB
0
C
C
C
C
C
C
C
C
C
C
C
0
C
lEl
E
0
A
E
Al-4
E
AR
E
IHI
AR
E
Hl2-l
AR
E H I
0
N
AR
-
E
IGI
H I
N
AR
0
E
TH
Im
N
0
AR
0
E GHIE
NM
AR
E
GHIEm
0
NX AR
cl
E GHIEA
r-l
MNX AR
E
GHIEA
MNX~AR
-
E
GHIEAMMNXPAR
~~G~I~AL~N~PMR
This technique uses the same amount of space as linear probing but the
average number of items examined for successful search is slightly smaller:
32/17
z
1.882. Still, such a full table is to be avoided as shown by the 9
probes used to insert the last E in this example.
Open addressing methods can be inconvenient in a dynamic situation,
when an unpredictable number of insertions and deletions might have to be
processed. First, how big should the table be? Some estimate must be made
of how many insertions are expected but performance degrades drastically as
the table starts to get full. A common solution to this problem is to rehash
everything into a larger table on a (very) infrequent basis. Second, a word of
caution is necessary about deletion: a record can’t simply be removed from
a table built with linear probing or double hashing. The reason is that later
insertions into the table might have skipped over that record, and searches
for those records will terminate at the hole left by the deleted record. A
way to solve this problem is to have another special key which can serve
as a placeholder for searches but can be identified and remembered as an
HASHING 209
empty position for insertions. Note that neither table size nor deletion are a
particular problem with separate chaining.
Analytic Results
The methods discussed above have been analyzed completely and it is pos-
sible to compare their performance in some detail. The following formulas,
summarized from detailed analyses described by D. E. Knuth in his book on
sorting and searching, give the average number of items examined (probes) for
unsuccessful and successful searching using the methods we’ve studied. The
formulas are most conveniently expressed in terms of the “load factor” of the
hash table,
cr
= N/M. Note that for separate chaining we can have
CY
> 1,
but for the other methods we must have
Q
< 1.
Unsuccessful Successful
Separate Chaining:
1+
42
(a
+ 1)/Z
Linear Probing:
l/2+1/2(1-42
1/2+1/2(1-a)
Double Hashing:
l/(1
-
a)
-
ln( 1
-
o)/o
For small
o,
it turns out that all the formulas reduce to the basic result that
unsuccessful search takes about 1 + N/M probes and successful search takes
about 1 + N/2M probes. (Except, as we’ve noted, the cost of an unsuccessful
search for separate chaining is reduced by about half by ordering the lists.)
The formulas indicate how badly performance degrades for open addressing
as
CY
gets close to 1. For large M and N, with a table about 90% full, linear
probing will take about 50 probes for an unsuccessful search, compared to
10 for double hashing. Comparing linear probing and double hashing against
separate chaining is more complicated, because there is more memory available
in the open addressing methods (since there are no links). The value of
CY
used
should be modified to take this into account, based on the relative size of keys
and links. This means that it is not normally justified to choose separate
chaining over double hashing on the basis of performance.
The choice of the very best hashing method for a particular applica-
tion can be very difficult. However, the very best method is rarely needed
for a given situation, and the various methods do have similar performance
characteristics as long as the memory resource is not being severely strained.
Generally, the best course of action is to use the simple separate chaining
method to reduce search times drastically when the number of records to be
processed is not known in advance (and a good storage allocator is available)
and to use double hashing to search among a set of keys whose size can be
roughly predicted ahead of time.
Many other hashing methods have been developed which have application
in some special situations. Although we can’t go into details, we’ll briefly
210
CHAPTER 16
consider two examples to illustrate the nature of specially adapted hashing
methods. These and many other methods are fully described in Knuth’s book.
The first, called ordered hashing, is a method for making use of ordering
within an open addressing table: in standard linear probing, we stop the
search when we find an empty table position or a record with a key equal
to the search key; in ordered hashing, we stop the search when we find a
record with a key greater than or equal to the search key (the table must be
cleverly constructed to make this work). This method turns out to reduce
the time for unsuccessful search to approximately that for successful search.
(This is the same kind of improvement that comes in separate chaining.) This
method is useful for applications where unsuccessful searching is frequently
used. For example, a text processing system might have an algorithm for
hyphenating words that works well for most words, but not for bizarre cases
(such as “bizarre”). The situation could be handled by looking up all words
in a relatively small exception dictionary of words which must be handled in
a special way, with most searches likely to be unsuccessful.
Similarly, there are methods for moving some records around during
unsuccessful search to make successful searching more efficient. In fact, R. P.
Brent developed a method for which the average time for a successful search
can be bounded by a constant, giving a very useful method for applications
with frequent successful searching in very large tables such as dictionaries.
These are only two examples of a large number of algorithmic improve-
ments which have been suggested for hashing. Many of these improvements
are interesting and have important applications. However, our usual cautions
must be raised against premature use of advanced methods except by experts
with serious searching applications, because separate chaining and double
hashing are simple, efficient, and quite acceptable for most applications.
Hashing is preferred to the binary tree structures of the previous two
chapters for many applications because it is somewhat simpler and it can
provide very fast (constant) searching times, if space is available for a large
enough table. Binary tree structures have the advantages that they are
dynamic (no advance information on the number of insertions is needed), they
can provide guaranteed worst-case performance (everything could hash to the
same place even in the best hashing method), and they support a wider range
of operations (most important, the sort function). When these factors are not
important, hashing is certainly the searching method of choice.
0
HASHING 211
Exercises
1.
Describe how you might implement a hash function by making use of a
good random number generator. Would it make sense to implement a
random number generator by making use of a hash function?
2.
How long could it take in the worst case to insert N keys into an initially
empty table, using separate chaining with unordered lists? Answer the
same question for sorted lists.
3.
Give the contents of the hash table that results when the keys E A S Y Q
U E S T I 0 N are inserted in that order into an initially empty table of
size 13 using linear probing. (Use hi(k) = kmod 13 for the hash function
for the
lath
letter of the alphabet.)
4.
Give the contents of the hash table that results when the keys E A S Y
Q U E S T I 0 N are inserted in that order into an initially empty table
of size 13 using double hashing. (Use hi(k) from the previous question,
hz(lc)
= 1 +
(Ic
mod 11) for the second hash function.)
5. About how many probes are involved when double hashing is used to
build a table consisting of N equal keys?
6.
Which hashing method would you use for an application in which many
equal keys are likely to be present?
7.
Suppose that the number of items to be put into a hash table is known
in advance. Under what condition will separate chaining be preferable to
double hashing?
8.
Suppose a programmer has a bug in his double hashing code so that one
of the hash functions always returns the same value (not 0). Describe
what happens in each situation (when the first one is wrong and when
the second one is wrong).
9.
What hash function should be used if it is known in advance that the key
values fall into a relatively small range?
10.
Criticize the following algorithm for deletion from a hash table built with
linear probing. Scan right from the element to be deleted (wrapping as
necessary) to find an empty position, then scan left to find an element
with the same hash value. Then replace the element to be deleted with
that element, leaving its table position empty.