Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 92 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
92
Dung lượng
1,52 MB
Nội dung
Partition the set S into …n/5… groups of size 5 each (except possibly for one group). Sort each little set and identify the median element in this set. From this set of …n/5… "baby" medians, apply the selection algorithm recursively to find the median of the baby medians. Use this element as the pivot and proceed as in the quick-select algorithm. Show that this deterministic method runs in O(n) time by answering the following questions (please ignore floor and ceiling functions if that simplifies the mathematics, for the asymptotics are the same either way): a. How many baby medians are less than or equal to the chosen pivot? How many are greater than or equal to the pivot? b. For each baby median less than or equal to the pivot, how many other elements are less than or equal to the pivot? Is the same true for those greater than or equal to the pivot? c. Argue why the method for finding the deterministic pivot and using it to partition S takes O(n) time. d. Based on these estimates, write a recurrence equation to bound the worst- case running time t(n) for this selection algorithm (note that in the worst case there are two recursive calls—one to find the median of the baby medians and one to recur on the larger of L and G). e. Using this recurrence equation, show by induction that t(n) is O(n). Projects P-11.1 Experimentally compare the performance of in-place quick-sort and a version of quick-sort that is not in-place. P-11.2 738 Design and implement a stable version of the bucket-sort algorithm for sorting a sequence of n elements with integer keys taken from the range [0,N − 1], for N ≥ 2. The algorithm should run in O(n + N) time. P-11.3 Implement merge-sort and deterministic quick-sort and perform a series of benchmarking tests to see which one is faster. Your tests should include sequences that are "random" as well as "almost" sorted. P-11.4 Implement deterministic and randomized versions of the quick-sort algorithm and perform a series of benchmarking tests to see which one is faster. Your tests should include sequences that are very "random" looking as well as ones that are "almost" sorted. P-11.5 Implement an in-place version of insertion-sort and an in-place version of quick-sort. Perform benchmarking tests to determine the range of values of n where quick-sort is on average better than insertion-sort. P-11.6 Design and implement an animation for one of the sorting algorithms described in this chapter. Your animation should illustrate the key properties of this algorithm in an intuitive manner. P-11.7 Implement the randomized quick-sort and quick-select algorithms, and design a series of experiments to test their relative speeds. P-11.8 Implement an extended set ADT that includes the methods union(B), intersect(B), subtract(B), size(), isEmpty(), plus the methods equals(B), contains(e), insert(e), and remove(e) with obvious meaning. P-11.9 Implement the tree-based union/find partition data structure with both the union-by-size and path-compression heuristics. Chapter Notes 739 Knuth's classic text on Sorting and Searching [63] contains an extensive history of the sorting problem and algorithms for solving it. Huang and Langston [52] describe how to merge two sorted lists in-place in linear time. Our set ADT is derived from the set ADT of Aho, Hopcroft, and Ullman [5] . The standard quick-sort algorithm is due to Hoare [49] . More information about randomization, including Chernoff bounds, can be found in the appendix and the book by Motwani and Raghavan [79]. The quick-sort analysis given in this chapter is a combination of an analysis given in a previous edition of this book and the analysis of Kleinberg and Tardos [59]. The quick-sort analysis of Exercise C-11.7 is due to Littman. Gonnet and Baeza-Yates [41] provide experimental comparisons and theoretical analyses of a number of different sorting algorithms. The term "prune-and-search" comes originally from the computational geometry literature (such as in the work of Clarkson [22] and Megiddo [72, 73]). The term "decrease-and-conquer" is from Levitin [68]. Chapter 12 Text Processing 740 Contents 12.1 String Operations 540 12.1.1 The Java String Class 541 12.1.2 The Java StringBuffer Class 542 12.2 PatternMatching Algorithms 543 12.2.1 Brute Force 543 12.2.2 The Boyer-Moore Algorithm 545 12.2.3 741 The Knuth-Morris-Pratt Algorithm 549 12.3 Tries 554 12.3.1 Standard Tries 554 12.3.2 Compressed Tries 558 12.3.3 Suffix Tries 560 12.3.4 Search Engines 564 12.4 Text Compression 565 12.4.1 The Huffman Coding Algorithm 566 12.4.2 The Greedy Method 742 567 12.5 Text Similarity Testing 568 12.5.1 The Longest Common Subsequence Problem 568 12.5.2 Dynamic Programming 569 12.5.3 Applying Dynamic Programming to the LCS Problem 569 12.6 12.6 Exercises 573 java.datastructures.net 12.1 String Operations Document processing is rapidly becoming one of the dominant functions of computers. Computers are used to edit documents, to search documents, to transport documents over the Internet, and to display documents on printers and computer screens. For example, the Internet document formats HTML and XML are primarily text formats, with added tags for multimedia content. Making sense of the many terabytes of information on the Internet requires a considerable amount of text processing. In addition to having interesting applications, text processing algorithms also highlight some important algorithmic design patterns. In particular, the pattern matching problem gives rise to the brute-force method, which is often inefficient but has wide applicability. For text compression, we can apply the greedy method, which 743 often allows us to approximate solutions to hard problems, and for some problems (such as in text compression) actually gives rise to optimal algorithms. Finally, in discussing text similarity, we introduce the dynamic programming design pattern, which can be applied in some special instances to solve a problem in polynomial time that appears at first to require exponential time to solve. Text Processing At the heart of algorithms for processing text are methods for dealing with character strings. Character strings can come from a wide variety of sources, including scientific, linguistic, and Internet applications. Indeed, the following are examples of such strings: P = "CGTAAACTGCTTTAATCAAACGC" S = "http://java.datastructures.net". The first string, P, comes from DNA applications, and the second string, S, is the Internet address (URL) for the Web site that accompanies this book. Several of the typical string processing operations involve breaking large strings into smaller strings. In order to be able to speak about the pieces that result from such operations, we use the term substring of an m-character string P to refer to a string of the form P[i]P[i + 1]P[i + 2] … P[j], for some 0 ≤ i ≤ j ≤ m− 1, that is, the string formed by the characters in P from index i to index j, inclusive. Technically, this means that a string is actually a substring of itself (taking i = 0 and j = m − 1), so if we want to rule this out as a possibility, we must restrict the definition to proper substrings, which require that either i > 0 or j − 1. To simplify the notation for referring to substrings, let us use P[i j] to denote the substring of P from index i to index j, inclusive. That is, P[i j]=P[i]P[i+1]…P[j]. We use the convention that if i > j, then P[i j] is equal to the null string, which has length 0. In addition, in order to distinguish some special kinds of substrings, let us refer to any substring of the form P [0 i], for 0 ≤ i ≤ m −1, as a prefix of P, and any substring of the form P[i m − 1], for 0 ≤ i ≤ m − 1, as a suffix of P. For example, if we again take P to be the string of DNA given above, then "CGTAA" is a prefix of P, "CGC" is a suffix of P, and "TTAATC" is a (proper) substring of P. Note that the null string is a prefix and a suffix of any other string. To allow for fairly general notions of a character string, we typically do not restrict the characters in T and P to explicitly come from a well-known character set, like the Unicode character set. Instead, we typically use the symbol σ to denote the character set, or alphabet, from which characters can come. Since most document processing algorithms are used in applications where the underlying character set is 744 finite, we usually assume that the size of the alphabet σ, denoted with |σ|, is a fixed constant. String operations come in two flavors: those that modify the string they act on and those that simply return information about the string without actually modifying it. Java makes this distinction precise by defining the String class to represent immutable strings, which cannot be modified, and the StringBuffer class to represent mutable strings, which can be modified. 12.1.1 The Java String Class The main operations of the Java String class are listed below: length(): Return the length, n, of S. charAt(i): Return the character at index i in S. startsWith(Q): Determine if Q is a prefix of S. endsWith(Q): Determine if Q is a suffix of S. substring(i,j): Return the substring S[i,j]. concat(Q): Return the concatenation of S and Q, that is, S+Q. equals(Q): Determine if Q is equal to S. indexOf(Q): If Q is a substring of S, return the index of the beginning of the first occurrence of Q in S, else return −1. This collection forms the typical operations for immutable strings. 745 Example 12.1: Consider the following set of operations, which are performed on the string S = "abcdefghijklmnop": Operation Output length() 16 charAt(5) 'f' concat("qrs") "abcdefghijklmnopqrs" endsWith("javapop") false indexOf("ghi") 6 startsWith("abcd") true substring(4,9) "efghij" With the exception of the indexOf(Q) method, which we discuss in Section 12.2 , all the methods above are easily implemented simply by representing the string as an array of characters, which is the standard String implementation in Java. 12.1.2 The Java StringBuffer Class The main methods of the Java StringBuffer class are listed below: append(Q): Return S+Q, replacing S with S + Q. insert(i, Q): 746 Return and update S to be the string obtained by inserting Q inside S starting at index i. reverse(): Reverse and return the string S. setCharAt(i,ch): Set the character at index i in S to be ch. charAt(i): Return the character at index i in S. Error conditions occur when the index i is out of the bounds of the indices of the string. With the exception of the charAt method, most of the methods of the String class are not immediately available to a StringBuffer object S in Java. Fortunately, the Java StringBuffer class provides a toString() method that returns a String version of S, which can be used to access String methods. Example 12.2: Consider the following sequence of operations, which are performed on the mutable string that is initially S = abcdefghijklmnop": Operation S append("qrs") "abcdefghijklmnopqrs" insert(3,"xyz") "abcxyzdefghijklmnopqrs" reverse() "srqponmlkjihgfedzyxcba" setCharAt(7,'W') "srqponmWkjihgfedzyxcba" 12.2 Pattern Matching Algorithms 747 [...]... matching algorithm could either be some indication that the pattern P does not exist in T or an integer indicating the starting index in T of a substring matching P This is exactly the computation performed by the indexOf method of the Java String interface Alternatively, one may want to find all the indices where a substring of T matching P begins In this section, we present three pattern matching algorithms. .. continue checking P against T Otherwise (there was a mismatch and we are at the 757 beginning of P), we simply increment the index for T (and keep the index variable for P at its beginning) We repeat this process until we find a match of P in T or the index for T reaches n, the length of T (indicating that we did not find the pattern PinT) Code Fragment 12.4: The KMP pattern matching algorithm The main... can use an incremental algorithm that inserts the strings one at a time Recall the assumption that no string of S is a prefix of another string To insert a string X into the current trie T, we first try to trace the path associated with X in T Since X is not already in T and no string in S is a prefix of another string, we will stop tracing the path at an internal node v of T before reaching the end... retrieval Indeed, the name "trie" comes from the word "retrieval." In an information retrieval application, such as a search for a certain DNA sequence in a genomic database, we are given a collection S of strings, all defined using the same alphabet The primary query operations that tries support are pattern matching and prefix matching The latter operation involves being given a string X, and looking for... string P is as shown in the following table: The KMP pattern matching algorithm, shown in Code Fragment 12.4, incrementally processes the text string T comparing it to the pattern string P Each time there is a match, we increment the current indices On the other hand, if there is a mismatch and we have previously made progress in P, then we consult the failure function to determine the new index in. .. outer loop indexing through all possible starting indices of the pattern in the text, and the inner loop indexing through each character of the pattern, comparing it to its potentially corresponding character in the text Thus, the correctness of the brute-force pattern matching algorithm follows immediately from this exhaustive search approach The running time of brute-force pattern matching in the worst... characters Output: Starting index of the first substring of T matching P, or an indication that P is not a substring of T for i ← 0 to n − m {for each candidate index in T} do j ← 0 while (j and T[i + j] = P[j]) do j ← j + 1 if j = m then return i return "There is no substring of T matching P." Code Fragment 12.1: Brute-force pattern matching 748 Performance The brute-force pattern matching algorithm could... all internal nodes have one child A trie T for a set S of strings can be used to implement a dictionary whose keys are the strings of S Namely, we perform a search in T for a string X by tracing down from the root the path indicated by the characters in X If this path can be traced and terminates at an external node, then we know X is in the dictionary For example, in the trie in Figure 12.6, tracing... terminates at an internal node, then X is not in the dictionary In the example in Figure 12.6, the path for "bet" cannot be traced and the path for "be" ends at an internal node Neither such word is in the dictionary Note that in this implementation of a dictionary, single characters are compared instead of the entire string (key) It is easy to see that the running time of the search for a string of... that the initial cost of preprocessing the text is compensated by a speedup in each subsequent query (for example, a Web site that offers pattern matching in Shakespeare's Hamlet or a search engine that offers Web pages on the Hamlet topic) A trie (pronounced "try") is a tree-based data structure for storing strings in order to support fast pattern matching The main application for tries is in information . S+Q, replacing S with S + Q. insert(i, Q): 746 Return and update S to be the string obtained by inserting Q inside S starting at index i. reverse(): Reverse and return the string S. setCharAt(i,ch):. simply by representing the string as an array of characters, which is the standard String implementation in Java. 12.1.2 The Java StringBuffer Class The main methods of the Java StringBuffer class. Matching Algorithms 747 In the classic pattern matching problem on strings, we are given a text string T of length n and apattern string P of length m, and want to find whether P is a substring