Tài liệu Thuật toán Algorithms (Phần 25) pdf

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	10
Dung lượng	120,35 KB

Nội dung

EXTERNAL SEARCHING 233 the record. Continuing a little further, we call add S=lOOll and E=OOlOl before another split is necessary to add A=OOOOl. This split also requires doubling the directory, leaving the structure: Diskl: 2021222230303030 Disk 2: A A EEE LN Disk3: R S T X In general, the structure built by exl endible hashing consists of a directory of 2d words (one for each d-bit pattern) and a set of leaf pages which contain all records with keys beginning with a specific bit pattern (of less than or equal to d bits). A search entails using ,he leading d bits of the key to index into the directory, which contains pointc:rs to 1ea.f pages. Then the referenced leaf page is accessed and searched (usin:; any strategy) for the proper record. A leaf page can be pointed to by more tlian one directory entry: to be precise, if a leaf page contains all the records uith keys that begin with a specific k bits (those marked with a vertical line in the pages on the diagram above), then it will have 2d-k directory entries pointing to it. In the example above, we have d = 3, and page 0 of disk 3 contains all the records with keys that begin with a 1 bit, so there are four dirl:ctory entries pointing to it. The directory contains only pointc:rs to pages. These are likely to be smaller than keys or records, so more directory entries will fit on each page. For our example, we’ll assume that we can fit twice as many directory entries as records per page, though this ratio is likely to be much higher in practice. When the directory spans more than one page, we keep a “root node” in memory which tells where the directoTT pages are, using the same indexing scheme. For example, if the directory spans two pages, the root node might contain the two entries “10 11,” indicatir g that the directory for all the records with keys beginning with 0 are on page 0 of disk 1, and the directory for all keys beginning with 1 are on page 1 o‘ disk 1. For our example, this split occurs when the E is inserted. Continuing up until the last E (see below), we get the following disk storage structure: 234 CHAPTER 18 Disk 1: 20 20 21 22 30 30 31 32 40 40 41 41 42 42 42 42 Disk Z?: A A A C EEEE G Disk3: H I LLM NN Disk4:PRR S T xx As illustrated in the above example, insertion into an extendible hashing structure can involve one of three operations, after the leaf page which could contain the search key is accessed. If there’s room in the leaf page, the new record is simply inserted there; otherwise the leaf page is split in two (half the records are moved to a new page). If the directory has more than one entry pointing to that leaf page, then the directory entries can be split as the page is. If not, the size of the directory must be doubled. As described so far, this algorithm is very susceptible to a bad input key distribution: the value of d is the largest number of bits required to separate the keys into sets small enough to fit on leaf pages, and thus if a large number of keys agree in a large number of leading bits, then the directory could get unacceptably large. For actual large-scale applications, this problem can be headed off by hashing the keys to make the leading bits (pseudo-)random. To search for a record, we hash its key to get a bit sequence which we use to access the directory, which tells us which page to search for a record with the same key. From a hashing standpoint, we can think of the algorithm as splitting nodes to take care of hash value collisions: hence the name “extendible hashing.” This method presents a very attractive alternative to B-trees and indexed sequential access because it always uses exactly two disk accesses for each search (like indexed sequential), while still retaining the capability for efficient insertion (like B-trees). Even with hashing, extraordinary steps must be taken if large numbers of equal keys are present. They can make the directory artificially large; and the algorithm breaks down entirely if there are more equal keys than fit in one leaf page. (This actually occurs in our example, since we have five E’s,.) If many equal keys are present then we could (for example) assume distinct keys in the data structure and put pointers to linked lists of records containing equal keys in the leaf pages. To see the complication involved, consider the insertion of the last E into the structure above. Virtual Memory The “easier way” discussed at the end of Chapter 13 for external sorting applies directly and trivially to the searching problem. A virtual memory is actually nothing more than a general-purpose external searching method: given an address (key), return the information associated with that address. EXTERNAL. SEARCHING 235 However, direct use of the virtual men ory is not recommended as an easy searching application. As mentioned in Chapter 13, virtual memories perform best when most accesses are relatively close to previous accesses. Sorting algorithms can be adapted to this, but the very nature of searching is that requests are for information from arbitr n-y parts of the database. 236 Exercises 1. Give the contents of the B-tree that results when the keys E A S Y Q U E S T I 0 N are inserted in that order into an initially empty tree, with M = 5. 2. Give the contents of the B-tree that results when the keys E A S Y Q U E S T I 0 N are inserted in that order into an initially empty tree, with M = 6, using the variant of the method where all the records are kept in external nodes. 3. Draw the B-tree that is built when sixteen equal keys are inserted into an initially empty tree, with M = 5. 4. Suppose that one page from the database is destroyed. Describe how you would handle this event for each of the B-tree structures described in the text. 5. Give the contents of the extendible hashing table that results when the keys E A S Y Q U E S T I 0 N are inserted in that order into an initially empty table, with a page capacity of four records. (Following the example in the text, don’t hash, but use the five-bit binary representation of i as the key for the ith letter.) 6. Give a sequence of as few distinct keys as possible which make an extendible hashing directory grow to size 16, from an initially empty table, with a page capacity of three records. 7. Outline a method for deleting an item from an extendible hashing table. 8. Why are “top-down” B-trees better than “bottom-up” B-trees for concur- rent access to data? (For example, suppose two programs are trying to insert a new node at the same time.) 9. Implement search and insert for internal searching using the extendible hashing method. 10. Discuss how the program of the previous exercise compares with double hashing and radix trie searching for internal searching applications. 237 SOURCES for Searching Again, the primary reference for this section is Knuth’s volume three. Most of the algorithms that we’ve st ldied are treated in great detail in that book, including mathematical analyses and suggestions for practical applications. The material in Chapter 15 come:, from Guibas and Sedgewick’s 1978 paper, which shows how to fit many classical balanced tree algorithms into the “red-black” framework, and which gives several other implementations. There is actually quite a large literature on balanced trees. Comer’s 1979 survey gives many references on the subject of Btrees. The extendible hashing algorithm presented in Chapter 18 comes from Fagin, Nievergelt, Pippenger and Stron;‘s 1979 paper. This paper is a must for anyone wishing further information In external searching methods: it ties together material from our Chapters 16 and 17 to bring out the algorithm in Chapter 18. Trees and binary trees as purely mltthematical objects have been studied extensively, quite apart from computer science. A great deal is known about the combinatorial properties of these objects. A reader interested in studying this type of material might begin with Icnuth’s volume 1. Many practical applications of thl: methods discussed here, especially Chapter 18, arise within the context of slatabase systems. An introduction to this field is given in Ullman’s 1980 book. D. Comer, “The ubquitous &tree,” Colrlputing Surveys, 11 (1979). R. Fagin, J. Nievergelt, N. Pippenger aIld H. R. Strong, “Extendible Hashing - a fast access method for dynamic fi.es,” ACM transactions on Database Systems, 4, 3 (September, 1979). L. Guibas and R. Sedgewick, “A dichromatic framework for balanced trees,” in 19th Annual Sym.posium on Foundations of Computer Science, IEEE, 1978. Also in A Decade of Progress 1970-19811, Xerox PARC, Palo Alto, CA. D. E. Knuth, The Art of Computer Pr ,gramming. Volume 1: Fundamental Algorithms, Addison-Wesley, Reading, IAA, 1968. D. E. Knuth, The Art of Computer PI.ogramming. Volume 3: Sorting and Searching, Addison-Wesley, Reading, MA, second printing, 1975. J. D. Ullman, Principles of Database Sy: terns, Computer Science Press, Rock- ville, MD, 1982. . perform best when most accesses are relatively close to previous accesses. Sorting algorithms can be adapted to this, but the very nature of searching is that requests. the primary reference for this section is Knuth’s volume three. Most of the algorithms that we’ve st ldied are treated in great detail in that book, including

Ngày đăng: 21/01/2014, 17:20

Xem thêm