EXTERNAL SEARCHING 233
the record.
Continuing a little further, we call add S=lOOll and E=OOlOl before
another split is necessary to add
A=OOOOl.
This split also requires doubling
the directory, leaving the structure:
Diskl: 2021222230303030
Disk
2:
A A EEE LN
Disk3: R
S
T X
In general, the structure built by exl endible hashing consists of a directory
of
2d
words (one for each d-bit pattern) and a set of leaf pages which contain
all records with keys beginning with a specific bit pattern (of less than or
equal to d bits). A search entails using
,he
leading d bits of the key to index
into the directory, which contains
pointc:rs
to
1ea.f
pages. Then the referenced
leaf page is accessed and searched (usin:; any strategy) for the proper record.
A leaf page can be pointed to by more tlian one directory entry: to be precise,
if a leaf page contains all the records uith keys that begin with a specific k
bits (those marked with a vertical line in the pages on the diagram above),
then it will have 2d-k directory entries pointing to it. In the example above,
we have d = 3, and page 0 of disk 3 contains all the records with keys that
begin with a 1 bit, so there are four dirl:ctory entries pointing to it.
The directory contains only
pointc:rs
to pages. These are likely to be
smaller than keys or records, so more directory entries will fit on each page.
For our example, we’ll assume that we can fit twice as many directory entries
as records per page, though this ratio is likely to be much higher in practice.
When the directory spans more than one page, we keep a “root node” in
memory which tells where the
directoTT
pages are, using the same indexing
scheme. For example, if the directory spans two pages, the root node might
contain the two entries “10 11,” indicatir
g
that the directory for all the records
with keys beginning with 0 are on page 0 of disk 1, and the directory for all
keys beginning with 1 are on page 1
o‘
disk 1. For our example, this split
occurs when the E is inserted. Continuing up until the last E (see below), we
get the following disk storage structure:
234
CHAPTER 18
Disk
1:
20 20
21 22 30
30
31 32
40 40
41 41
42 42 42 42
Disk
Z?:
A A A C EEEE G
Disk3: H I LLM
NN
Disk4:PRR
S T xx
As illustrated in the above example, insertion into an extendible hashing
structure can involve one of three operations, after the leaf page which could
contain the search key is accessed. If there’s room in the leaf page, the new
record is simply inserted there; otherwise the leaf page is split in two (half the
records are moved to a new page). If the directory has more than one entry
pointing to that leaf page, then the directory entries can be split as the page
is. If not, the size of the directory must be doubled.
As described so far, this algorithm is very susceptible to a bad input
key distribution: the value of d is the largest number of bits required to
separate the keys into sets small enough to fit on leaf pages, and thus if
a large number of keys agree in a large number of leading bits, then the
directory could get unacceptably large. For actual large-scale applications,
this problem can be headed off by hashing the keys to make the leading
bits (pseudo-)random. To search for a record, we hash its key to get a bit
sequence which we use to access the directory, which tells us which page to
search for a record with the same key. From a hashing standpoint, we can
think of the algorithm as splitting nodes to take care of hash value collisions:
hence the name “extendible hashing.” This method presents a very attractive
alternative to B-trees and indexed sequential access because it always uses
exactly two disk accesses for each search (like indexed sequential), while still
retaining the capability for efficient insertion (like B-trees).
Even with hashing, extraordinary steps must be taken if large numbers
of equal keys are present. They can make the directory artificially large; and
the algorithm breaks down entirely if there are more equal keys than fit in
one leaf page. (This actually occurs in our example, since we have five
E’s,.)
If
many equal keys are present then we could (for example) assume distinct keys
in the data structure and put pointers to linked lists of records containing
equal keys in the leaf pages. To see the complication involved, consider the
insertion of the last E into the structure above.
Virtual Memory
The “easier way” discussed at the end of Chapter 13 for external sorting
applies directly and trivially to the searching problem. A virtual memory
is actually nothing more than a general-purpose external searching method:
given an address (key), return the information associated with that address.
EXTERNAL. SEARCHING 235
However, direct use of the virtual men ory is not recommended as an easy
searching application. As mentioned in Chapter 13, virtual memories perform
best when most accesses are relatively close to previous accesses. Sorting
algorithms can be adapted to this, but the very nature of searching is that
requests are for information from arbitr
n-y
parts of the database.
236
Exercises
1.
Give the contents of the B-tree that results when the keys E A S Y Q U
E S T I 0 N are inserted in that order into an initially empty tree, with
M = 5.
2.
Give the contents of the B-tree that results when the keys E A S Y Q U
E
S
T I 0 N are inserted in that order into an initially empty tree, with
M = 6, using the variant of the method where all the records are kept in
external nodes.
3.
Draw the B-tree that is built when sixteen equal keys are inserted into an
initially empty tree, with M = 5.
4.
Suppose that one page from the database is destroyed. Describe how you
would handle this event for each of the B-tree structures described in the
text.
5.
Give the contents of the extendible hashing table that results when the
keys E A S Y Q U E S T I 0 N are inserted in that order into an initially
empty table, with a page capacity of four records. (Following the example
in the text, don’t hash, but use the five-bit binary representation of i as
the key for the ith letter.)
6.
Give a sequence of as few distinct keys as possible which make an exten-
dible hashing directory grow to size 16, from an initially empty table,
with a page capacity of three records.
7.
Outline a method for deleting an item from an extendible hashing table.
8.
Why are “top-down” B-trees better than “bottom-up” B-trees for concur-
rent access to data? (For example, suppose two programs are trying to
insert a new node at the same time.)
9. Implement search and insert for internal searching using the extendible
hashing method.
10.
Discuss how the program of the previous exercise compares with double
hashing and radix trie searching for internal searching applications.
237
SOURCES for Searching
Again, the primary reference for this section is Knuth’s volume three.
Most of the algorithms that we’ve st ldied are treated in great detail in
that book, including mathematical analyses and suggestions for practical
applications.
The material in Chapter 15
come:,
from Guibas and Sedgewick’s 1978
paper, which shows how to fit many classical balanced tree algorithms into
the “red-black” framework, and which gives several other implementations.
There is actually quite a large literature on balanced trees. Comer’s 1979
survey gives many references on the subject of Btrees.
The extendible hashing algorithm presented in Chapter 18 comes from
Fagin,
Nievergelt, Pippenger and Stron;‘s 1979 paper. This paper is a must
for anyone wishing further information In external searching methods: it ties
together material from our Chapters 16 and 17 to bring out the algorithm in
Chapter 18.
Trees and binary trees as purely mltthematical objects have been studied
extensively, quite apart from computer science. A great deal is known about
the combinatorial properties of these objects. A reader interested in studying
this type of material might begin with Icnuth’s volume 1.
Many practical applications of thl: methods discussed here, especially
Chapter 18, arise within the context of
slatabase
systems. An introduction to
this field is given in Ullman’s 1980 book.
D. Comer, “The ubquitous &tree,” Colrlputing Surveys, 11 (1979).
R. Fagin, J. Nievergelt, N. Pippenger aIld H. R. Strong, “Extendible Hashing
-
a fast access method for dynamic
fi.es,”
ACM transactions on Database
Systems, 4, 3 (September, 1979).
L. Guibas and R. Sedgewick, “A dichromatic framework for balanced trees,”
in 19th Annual
Sym.posium
on Foundations of Computer Science, IEEE, 1978.
Also in A Decade of Progress
1970-19811,
Xerox PARC, Palo Alto, CA.
D. E. Knuth, The Art of Computer Pr
,gramming.
Volume 1: Fundamental
Algorithms, Addison-Wesley, Reading,
IAA,
1968.
D. E. Knuth, The Art of Computer
PI.ogramming.
Volume 3: Sorting and
Searching, Addison-Wesley, Reading, MA, second printing, 1975.
J. D. Ullman, Principles of Database
Sy:
terns, Computer Science Press, Rock-
ville, MD, 1982.
. perform
best when most accesses are relatively close to previous accesses. Sorting
algorithms can be adapted to this, but the very nature of searching is that
requests. the primary reference for this section is Knuth’s volume three.
Most of the algorithms that we’ve st ldied are treated in great detail in
that book, including