... field of approximate matching of strings, trees, graphs and regular expressions We begin our discussion in the field of approximate matching by first looking at the problem of approximate matching. .. with respect to the approximate matching of structures that are more complicated than strings, namely trees and regular expressions Approximate pattern matching of complex structures such as trees... thereby enabling us to reduce the approximate regular expression matching problem to that of an approximate string matching problem Our algorithm for approximate matching of a string S with a class
Acknowledgments I am extremely grateful to those who have helped me in different ways to materialize this thesis. First of all, I wish to thank my supervisor, Dr. St´ephane Bressan, for providing me his extremely valuable guidance and for teaching me what research is all about. His constant motivation and deep insight have enabled me to develop as a researcher. I sincerely thank Dr. Anirban Mondal and Mr. Vinsensius Vega for the tremendous help and support that they had extended to me. I would also like to acknowledge Dr. Ng Wee Siong, Mr. Anand Ramchand, Mr. Li Shiau Cheng, Mr. Ajay Hemnani, Mr. Liau Chu Yee, Mr. Tok Wee Hyong, Mr. Li Yingguang, Mr. Ong Twee Hee and all the members of Database and Electronic Commerce Laboratories for their friendship and willingness to help me in various ways. I ardently wish to thank my family for their tremendous support. Last, but not the least, I sincerely thank the National University of Singapore for providing me with the opportunity to complete my postgraduate studies. i CONTENTS Summary 1 1 Introduction 3 1.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 9 2 Related Work and Background Information 2.1 2.2 10 Approximate Matching in Strings . . . . . . . . . . . . . . . . . . . 11 2.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.4 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Approximate Matching in Trees and Graphs . . . . . . . . . . . . . 19 2.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 ii iii 2.2.4 2.3 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Approximate Matching of Strings and Regular Expressions . . . . . 25 2.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.4 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 Approximate matching of trees and graphs under the degree − 1 constraint 29 3.1 The degree-1 Constraint . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.1 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.2 The degree-1 concept . . . . . . . . . . . . . . . . . . . . . . 35 The Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.2 Ordered Tree Algorithm . . . . . . . . . . . . . . . . . . . . 38 3.2.3 Unordered Tree Algorithm . . . . . . . . . . . . . . . . . . . 42 3.2.4 Acyclic Graph Algorithm . . . . . . . . . . . . . . . . . . . . 44 3.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 4 Approximate matching of strings and regular expressions 49 4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Background Information . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.1 Grammars and Languages . . . . . . . . . . . . . . . . . . . 52 4.2.2 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . 52 4.2.3 Finite State Automata . . . . . . . . . . . . . . . . . . . . . 54 iv 4.2.4 Chomsky Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 58 4.2.5 Types of Regular Expressions . . . . . . . . . . . . . . . . . 60 4.3 A Simple String to Regular Expression Pattern Matching Machine . 61 4.4 Existing Algorithm - Myers and Miller . . . . . . . . . . . . . . . . 64 4.4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Our Algorithm - RP M . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5.2 The Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.5.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 77 4.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 77 4.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5 4.6 4.7 5 Conclusion and Future Work 84 Appendix A : Myers Example 97 LIST OF FIGURES 1.1 HTML Code and its tree representation . . . . . . . . . . . . . . . . 5 2.1 An alignment for S = abcaacaca and T = acaacacda . . . . . . . . . 12 2.2 Edit Distance Matrix for S1 = abba and S2 = aabaca . . . . . . . . 16 2.3 Ordered Tree Example . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 UnOrdered Trees Example . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Delete and Insert Edit operations on Trees . . . . . . . . . . . . . . 22 3.1 Ordered Tree Algorithm under the degree-1 constraint . . . . . . . . 39 3.2 Bipartite Graph GB = (V1 , V2 , E) . . . . . . . . . . . . . . . . . . . 40 3.3 (a)Matching MB1 (b)Matching MB2 . . . . . . . . . . . . . . . . . . 40 3.4 Unordered Tree Algorithm under the degree-1 constraint . . . . . . 43 3.5 Acyclic Graph Algorithm under the degree-1 constraint . . . . . . . 44 3.6 Two Example Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.7 Ordered Edit Distance Matrix . . . . . . . . . . . . . . . . . . . . . 46 3.8 Unordered Edit Distance Matrix . . . . . . . . . . . . . . . . . . . . 47 4.1 DFA Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 v vi 4.2 NFA Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Class 1 Regular Expression . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Class 2 Regular Expression . . . . . . . . . . . . . . . . . . . . . . . 60 4.5 Class 3 Regular Expression . . . . . . . . . . . . . . . . . . . . . . . 60 4.6 Class 4 Regular Expression . . . . . . . . . . . . . . . . . . . . . . . 61 4.7 Class 5 Regular Expression . . . . . . . . . . . . . . . . . . . . . . . 61 4.8 Special Arcs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.9 Transition graph for R=a*b . . . . . . . . . . . . . . . . . . . . . . 63 4.10 Myers representation : Fa . . . . . . . . . . . . . . . . . . . . . . . 65 4.11 Myers representation : FR|S . . . . . . . . . . . . . . . . . . . . . . 65 4.12 Myers representation : FRS . . . . . . . . . . . . . . . . . . . . . . . 65 4.13 Myers representation : FR∗ . . . . . . . . . . . . . . . . . . . . . . . 66 4.14 Myers algorithm for approximate matching of regular expression . . 66 4.15 Our algorithm for approximate matching of regular expression . . . 75 4.16 RPM Example 1 : R = ab∗ cb and S = bbbaab . . . . . . . . . . . . . 76 4.17 RPM Example 2 : R = ca∗ ab∗ and S = baaabb . . . . . . . . . . . . 76 4.18 RPM Example 3 : R = abb∗ a∗ bc and S = ccbbbc . . . . . . . . . . . 77 4.19 Performance Analysis : Varying Length of Regular Expression,|R| . 78 4.20 Performance Analysis : Varying Length of String, |S| . . . . . . . . 79 4.21 Performance Analysis : Varying size of alphabet Σ, |Σ| . . . . . . . 80 4.22 Performance Analysis : Varying number of kleene closures, | ∗ | . . . 81 4.23 Performance Analysis : Special Cases . . . . . . . . . . . . . . . . . 82 5.1 Myers Example : R = a∗ and S = aaa . . . . . . . . . . . . . . . . 97 5.2 Myers Edit Distance Matrix . . . . . . . . . . . . . . . . . . . . . . 98 Summary Approximate pattern matching techniques in various structures such as strings, trees, graphs and regular expressions form the basis of many commercial applications available today in important fields such as bio-informatics and information extraction. This thesis presents a detailed review of some of the basic and important algorithms and ideas over the past 40 years in the area of approximate pattern matching. In particular, we address the problem of approximate pattern matching specifically with respect to the approximate matching of structures that are more complicated than strings, namely trees and regular expressions. Approximate pattern matching of complex structures such as trees and graphs and regular expressions is a primitive operation essential to applications in information retrieval, information integration and mediation, and in many such domains that require evaluating or characterizing the similarity between structured and complex objects such as HTML documents, molecular compounds and XML data. The main contributions of our work are two-fold. First, we present new algorithms for the approximate matching of trees (ordered and unordered) and acyclic graphs based on edit distance measures under the degree-1 constraint, the impli1 2 cation being that the relevant information is located at the leaves of a tree or at the periphery of a graph. Under the degree-1 constraint edit operations can be performed only on vertices with degree≤1. The ordered and unordered tree algorithms have a worst-case execution time of O(|T1 |.|T2 |.k 2 log k) and the algorithm for acyclic graphs has a worst-case execution time of O(|T1 |2 .|T2 |2 .k 2 log k). Second, we consider the problem of approximate matching of a string with a special type of regular expression where the kleene closure (*) is only allowed to be bound to a single character. In this regard, we present a new algorithm which exploits the special properties of such a regular expression, thereby enabling us to reduce the approximate regular expression matching problem to that of an approximate string matching problem. Our algorithm for approximate matching of a string S with a class 2 regular expression R, which we designate as RP M , runs in O(|S|3 ) time and space in the worst case. Our performance evaluation indicates that our proposed techniques indeed outperform an existing well-known algorithm for approximate regular expression matching in terms of execution times. This may be primarily attributed to the approximate string matching nature of our algorithm which makes use of simple arithmetic operations. We plan to extend the work done in this thesis in the near future by trying to address more complex issues in the field of the approximate pattern matching. This research effort has laid the foundation for considering the problem of approximate matching of two regular expressions, but we believe that there are still a lot of open research issues in this field. In addition to this, we also aim to try and employ multiple sequence alignment techniques in order to derive a valid or an optimal schedule in a client-server architecture under delay constraints. CHAPTER 1 Introduction Approximate pattern matching techniques in various structures such as strings, trees, graphs and regular expressions form the basis of many important as well as diverse commercial applications ranging from traditional applications associated with information extraction to more specialized applications involving bio-informatics. The World Wide Web (WWW) is growing at an exponential rate with new websites emerging everyday. The WWW hosts and serves large amounts of documents containing data (primarily in textual form) pertaining to essentially all domains of human activity e.g., art, education, travel, science, politics and business, thereby making it a very large-scale distributed global information resource residing on the Internet. Notably, the information on the WWW is potentially useful for both individuals and businesses. There is an increasing need for the convergence of database and information retrieval support to new application domains such as information interchange over the Internet with XML, bio-computing, distributed directory servers with LDAP, or the management of hypermedia over the World 3 4 Wide Web. At the heart of this convergence is the possibility of evaluating the similarity of the objects in question according to appropriate metrics. Objects in the applications mentioned above have in common their complex structure that is often that of a tree or a graph. Trees and graphs approximate matching algorithms provide a variety of similarity measures for these objects. The HyperText Markup Language (HTML) is the lingua franca for publishing data on the Web. Unfortunately, HTML has been designed for display purposes, the implication being that it is primarily meant for human consumption as opposed to machine consumption. But for the WWW to reach its full potential, the data should be defined and linked in such a way that it can be used for more effective discovery, automation, integration, and reuse across various applications. Now, just like people need to have agreement on the meanings of the words which they employ in their communication, computers need mechanisms for agreeing on the semantics of the data in order to communicate effectively. The World Wide Web Consortium (W3C), in collaboration with a large number of researchers and industrial partners, is now exploring the possibility of creating a Semantic Web, in which the meaning is made explicit (through the use of meta-data), thereby allowing machines to process and integrate Web resources intelligently. Intuitively we can understand that documents in a markup language can generally be represented as a tree as illustrated in Figure 1.1. Being able to evaluate the similarity between two documents is the basis for search and integration. Schema matching is a basic problem in many database application domains, such as data integration, E-business, data warehousing and semantic query processing. Automatic schema matching has become one of the key areas of research in the field of computer science due to the rapid increasing number of web data sources and E-businesses to integrate. As systems become able to handle more complex data- 5 Data Model Syntax Data Model XML Basic Syntax Syntax XML Basic Syntax Figure 1.1: HTML Code and its tree representation bases and applications their schemas become larger, further increasing the number of matches to be performed. Seeing how data in an HTML or XML document is primarily stored at the leaves or on the periphery of the tree or graph that describes it, we devise an algorithm that takes advantage of this property when comparing two different structures. Although data may be represented in various ways, text still remains the primary medium for exchanging information on the WWW. This is particularly evident in the domains of literature or linguistics where data are composed of huge corpus and dictionaries. This is also applicable to computer science, where a large amount of data are stored in linear files. Given such predominantly textual nature of the data residing in the WWW, efficient pattern matching algorithms and compression techniques become necessary to address semantic issues associated with the data. While the problem of pattern matching pertains to locating a specific pattern inside raw data (a pattern is usually a collection of strings described in some formal language), the aim of data compression is to provide representation of data in a reduced form in order to save both storage space and transmission time such that there is no loss of information (the compression processes are reversible). 6 Incidentally, both pattern matching and compression techniques apply to the manipulation of texts (word editors), the storage of textual data (text compression) and data retrieval systems (full text search). Additionally, they are basic components used in the implementations of practical softwares existing under most operating systems. Moreover, they emphasize programming methods that serve as paradigms in other fields of computer science (system or software design). Finally, they also play an important role in theoretical computer science by providing challenging problems. In this thesis, we specifically focus on the problem of pattern matching. Pattern matching of textual data arises in several important commercial applications. Sequential pattern mining, i.e., the mining of frequent subsequences as patterns in a sequence database, is an important data mining task with broad applications. Interestingly, this is also the case in molecular biology because biological molecules can often be approximated as sequences of nucleotides or amino acids. Furthermore, the amount of available data in these fields tends to double every eighteen months, thereby underlining the necessity of efficient pattern matching algorithms even if the speed and storage capacity of computers increase regularly. Moreover, pattern matching is also a key part of various other applications including analysis of consumer behaviors, web access patterns, process analysis of scientific experiments, prediction of natural disasters, to mention just a few. Incidentally, ordered, labeled trees are often deployed in pattern matching. Ordered, labeled trees are trees in which each vertex has a label and the left-to-right order of its children (if any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology, programming compilation, and natural language processing. Many of the applications involve comparing trees or retrieving/extracting information from a repository of trees. Examples include 7 classification of unknown patterns, analysis of newly sequenced RNA structures, semantic taxonomy for dictionary definitions, generation of interpreters for nonprocedural programming languages, and automatic error recovery and correction for programming languages. Multiple sequence alignment can be seen as a generalization of the pairwise sequence alignment - instead of aligning two sequences, k sequences are aligned simultaneously, where k is any number greater than 2. Multiple sequence alignment is particularly useful in the field of bioinformatics because it allows biologists to extract and represent biologically important but faintly/widely dispersed sequence similarities giving them hints about the evolutionary history of certain sequences. The problem of multiple sequence alignment has been shown to be NP-Complete in general and is therefore not likely to be solved in polynomial time. For NPComplete problems, there is (almost) no hope that there is an algorithm that is not exponential in its complexity. The algorithm for multiple sequence alignment has a time complexity of Θ(2N LN ) and a space complexity of Θ(LN ). It turns out that not all cells of the cube (for a 3 sequence case) and in general, the Ndimensional matrix need to be computed, and the order of computation can also be heuristically optimized. In addition to uses in bio-informatics multiple sequence alignment could probably also be used to generate a optimal or in most cases a valid schedule given certain delay constraints in a client server architecture. We touch on this possible application of multiple sequence alignment in the future work sections in Chapter 5 Approximate matching of regular expressions forms the basis of many search procedures in various applications. Searching for a pattern in a text file is a very common operation in many applications ranging from conventional applications such as text editors to more sophisticated applications in molecular biology. Inci- 8 dentally, schema matching is also a basic problem in many database applications domains such as data integration, E-business, data warehousing and semantic query processing. Schema matching is usually performed manually through some form of a graphical interface. Such methods have the disadvantages of being cumbersome, time-consuming and error-prone. By combining existing schema matching techniques with customized approximate pattern matching algorithms for regular expressions, one can automate an otherwise largely manual operation. Rahm et. al [19] survey a number of approaches to automatic schema matching. 1.1 Summary of Contributions This work focusses on problems of approximate pattern matching in complex structures such as trees, graphs and regular expressions. The contributions of this thesis are two-fold. • We present new algorithms for the approximate matching of trees (ordered and unordered) and acyclic graphs based on edit distance measures under the degree-1 constraint, the implication being that the relevant information is located at the leaves of a tree or at the periphery of a graph. Under the degree-1 constraint edit operations can be performed only on vertices with degree is less than or equal to 1. The ordered and unordered tree algorithms have a worst-case execution time of O(|T1 |.|T2 |.k 2 log k) and the algorithm for acyclic graphs has a worst-case execution time of O(|T1 |2 .|T2 |2 .k 2 log k).Our work on approximate matching of trees and acyclic graphs under the degree-1 constraint has been submitted for publication to a well known journal and we are currently awaiting the results of the review. • We consider the problem of approximate matching of a string with a special 9 type of regular expression where the kleene closure (*) is only allowed to be bound to a single character. In this regard, we present a new algorithm which exploits the special properties of such a regular expression, thereby enabling us to reduce the approximate regular expression matching problem to that of an approximate string matching problem. Our algorithm RP M for approximate matching of a string S with a class 2 regular expression R runs in O(|S|3 ) time and space in the worst case. Our performance evaluation indicates that our proposed techniques indeed outperform an existing well-known algorithm M yers [45] for approximate regular expression matching in terms of execution times. This may be primarily attributed to the approximate string matching nature of our algorithm which makes use of simple arithmetic operations, whereas M yers needs to first construct a regular expression edit graph before it starts traversing the edges of this graph to provide the desired result. 1.2 Organization of the Thesis The remainder of the thesis is organized as follows: • Chapter 2 provides an overview of existing works in the field of approximate matching of strings, trees, graphs and regular expressions. • Chapter 3 presents our algorithm for approximate matching of ordered and unordered trees and acyclic graphs under the degree-1 constraint. • Chapter 4 discusses the work we have done in the area of string to regular expression matching in a special case and we compare our algorithm with an existing string to regular expression approximate matching algorithm. • Finally, we conclude in Chapter 5 with directions for future work. CHAPTER 2 Related Work and Background Information This chapter describes existing works in the field of approximate matching of strings, trees, graphs and regular expressions. We begin our discussion in the field of approximate matching by first looking at the problem of approximate matching of strings in Section 2.1. We present the notion of edit operations on strings and edit distance with respect to strings and introduce the edit distance matrix, a visualization which is central to many exact and approximate matching algorithms. We then present a brief survey on some of the key exact and approximate matching algorithms on strings. In Section 2.2, we discuss the issue of approximate matching in more complex structures like trees and graphs. This is a relatively new area of research in approximate matching when compared to strings. We then discuss some of the existing work done on different types of tree and graph structures. In Section 2.3, we aim to give a brief overview of the work done on approximate 10 11 matching of strings with regular expressions. 2.1 Approximate Matching in Strings The string matching problem is the most studied problem in algorithmics on words and there are many algorithms for solving this problem efficiently. In practical pattern-matching applications, exact matching is not always relevant. It is often more important to find objects that match a given pattern in a reasonably approximate manner, i.e., allowing some errors. Approximate string matching consists in finding all approximate occurrences of pattern x in text y. There exists a number of methods to compare two strings or sequences. One of the most common ways is the notion of similarity between two strings. A similarity measure is a function that associates a numeric value with a pair of sequences, with the idea that a higher value indicates greater similarity. A similarity measure can have both positive and negative values depending on the properties of the scoring function used. The notion of distance is somewhat dual to similarity. It treats sequences as points in a metric space. A distance measure is a function that also associates a numeric value with a pair of sequences, but with the idea that the larger the distance, the smaller the similarity, and vice versa. Distance measures usually satisfy the mathematical axioms of a metric. In particular, distance values are never negative. Approximate occurrences of x are segments of y that are close to x according to a specific distance: their distance to x must not be greater than a given integer k. There are two standard distances: the hamming distance and the edit distance. The Hamming distance, H is defined only for strings of the same length. For two strings S and T , H(S, T ) is the number of places in which the two string 12 differ, i.e., the places where they have different characters. The Hamming distance is related to the number of mismatches between the pattern and its approximate occurrences. This problem is also called the approximate string matching with k mismatches. For example, if S = aababb and T = bbbaba then H(S, T ) = 3 corresponds to the mismatches in positions 1,2 and 6 in the strings. The edit distance between two strings is the minimal number of edit operations (insert, delete and change) needed to transform one string into the other. It is defined for strings of arbitrary length. For example, if S = aab and T = accab, the minimum number of edit operations required to transform S to T is 2 corresponding to the deletion of the 2 c’s. There could be more than one sequence of operations to transform one string into the other. If each operation is assigned the same cost, this is also known as the Levenshtein distance. We shall describe in detail properties of the edit distance between strings later in this section. The longest common subsequence (LCS) problem is a particular case of the edit distance problem in strings. Given two strings S and T of length n and m respectively, if l = LCS(S, T ), then one can transform S to T by first deleting the n − l characters of S (all but those of a longest common subsequence) and then inserting m−l symbols to get T . For example, if S = abbcc and T = acb with n = 5 and m = 3, LCS(S, T ) = ab or ac with l = 2. We first delete 3 (5-2) non-LCS characters from S, bcc and then insert 1 (3-2) non-LCS characters from T , c in the correct position to obtain T from S. S: a b c a a c a c _ a T: a _ c a a c a c d a Figure 2.1: An alignment for S = abcaacaca and T = acaacacda 13 Another way of representing the differences (or similarity) between two strings (or sequences), which is one of the central concepts in bio-informatics, is the notion of alignment. An alignment is a mutual arrangement of two sequences such that it exhibits where the two sequences are similar, and where they differ. An optimal alignment is understandably one that exhibits the most correspondences, and the least differences. The alignment problem is another interesting variation of the edit distance problem where gaps or ‘empty’ strings are inserted in each of the two strings such that common characters are matched, whereas characters ‘unmatched’ can be inserted or deleted. Figure 2.1 depicts a possible alignment for strings S = abcaacaca and T = acaacacda which corresponds to a deletion of b in S and an insertion of d in T . Before we move on to the major algorithms on string matching, we shall first present a definition of a string proposed by Crochemore and Rytter(1994)[18]. 2.1.1 Definition Let Σ be an input alphabet - a finite set of symbols. Elements of Σ are called characters. A string over Σ is defined as a finite sequence of elements of Σ. The length of a string S, |S|, is defined as the number of elements (with repetitions) in the string S. Therefore the length of abbab is 5. The ith element of the string S is denoted by S[i] and i is its position in S. For example, the 4th character in the string pattern is S[4] = ‘e’. A substring of S, denoted by S[i . . . j] is the sequence of elements S[i]S[i + 1] . . . S[j] in S. For example, pat is a substring of pattern. A string Sseq is a subsequence of S if Sseq can be obtained from S by removing zero or more (not necessarily adjacent) letter from it. For example, pen is a valid subsequence of pattern. Intuitively, Sseq is a subsequence of S if Sseq = S[i1 ][i2 ] . . . [im ] where i1 , i2 , . . . , im is an increasing 14 sequence of indices in S. 2.1.2 Problem Definition Given two strings, S1 and S2 of length m and n (m ≤ n) respectively, the very basic form of the exact string matching problem is a membership decision problem, i.e. verify if S1 occurs in S2 . The output is a boolean value. S1 is either a member of S2 or it isn’t. As mentioned earlier in Section 2.1, very often this is not very helpful. A more interesting scheme would be to see how far S1 is from S2 . The Approximate String Matching problem is defined as the problem of transforming S1 to S2 via a series of edit operations. Now we shall discuss edit operations on strings. Edit Operations in Strings When two strings S1 and S2 do not exactly match, there exists errors corresponding to the differences between the two strings. Let ∅ represent an empty structure. An edit operation can be represented as a pair, (u, v) = (∅, ∅) sometimes written u → v. There are three kinds of edit operations: 1. change: symbols at corresponding positions are different. A change operation is represented by (u, v) where u = ∅ and v = ∅. 2. insert : a symbol of S2 is missing in S1 at a corresponding position. An insert operation is represented by (u, v) where u = ∅ and v = ∅. 3. delete : a symbol of S1 is missing in S2 at a corresponding position. A delete operation is represented by (u, v) where u = ∅ and v = ∅. 15 Edit Distance between Strings One would always look for the best way to transform S1 to S2 , i.e., the minimum number of differences between S1 and S2 . This can be translated as the smallest number of edit operations (change, insertion and deletion) to transform S1 to S2 . This is called the edit distance between S1 to S2 and is denoted by δ(S1 , S2 ). Three properties are satisfied at all times. They are as follows: • δ(S1 , S2 ) = 0 if f S1 = S2 : If both the strings are the same the minimum distance between them is 0 as no edit operation is required to transform one into the other. • δ(S1 , S2 ) = δ(S2 , S1 ) : A fundamental property of the edit distance is that it is symmetric. This comes from the duality between the deletion and the insertion operations. A deletion of a character a in S1 in order to get S2 corresponds to an insertion of a in S2 to get S1 . • δ(S1 , S2 ) ≤ δ(S1 , S3 ) + δ(S3 , S2 ) (triangle inequality) Central to many of the algorithms based on the edit distance scheme is the edit distance matrix. Assuming the strings S1 and S2 are of fixed length m and n such that n ≥ m. Each cell in the edit distance matrix, EDIT , has a value equal to δ(S1 [1 . . . i], S2 [1 . . . j]), with 0 ≤ i ≤ m and 0 ≤ j ≤ n. The boundary values of EDIT are defined as follows : for 0 ≤ i ≤ m, 0 ≤ j ≤ n, EDIT [0, j] = j; EDIT [i, 0] = i; Rest of the elements of EDIT can be computed using the simple formula 16 EDIT [i, j] = min EDIT[i − 1, j] + cost(delete), EDIT[i, j − 1] + cost(insert), EDIT[i − 1, j − 1] + cost(change) The formula reflects the three operations of deletion, insertion and change in that order. If we were to consider the matrix as a grid graph where each vertex (i − 1, j − 1) is connected to three other vertices: (i − 1, j), (i, j − 1), (i, j) when i ≤ m, j ≤ n, the edit distance between strings S1 and S2 equals the length of a least weighted path in this graph from source (0, 0) to the sink (m, n). An edge from (i − 1, j − 1) to (i, j − 1) represents a deletion edge and has a discrete integer cost assigned to it represented by cost(delete). An edge from (i−1, j −1) to (i−1, j) represents a insertion edge and has a discrete integer cost assigned to it represented by cost(insert). An edge from (i − 1, j − 1) to (i, j) represents a replacement edge and is represented by cost(change )=δ(S1 [i], S2 [j]). S1 ∅ ∅ 0 a 1 b 2 b 3 a 4 a 1 0 1 2 3 a 2 1 1 2 2 S2 b 3 2 1 1 2 a 4 3 2 2 1 c 5 4 3 3 2 a 6 5 4 4 3 Figure 2.2: Edit Distance Matrix for S1 = abba and S2 = aabaca Figure 2.2 shows the edit distance matrix for strings S1 = abba and S2 = aabaca. We assume each edit operation (delete, insert and change) has a cost of 1. The minimum edit distance, δ(S1 , S2 ) is represented by EDIT [4][6] = 3. This reflects the change operation b → a and the insertion of c and a. There are several paths from source to sink in this case. One possible path is EDIT [0][0] → EDIT [1][1] → 17 EDIT [2][2] → EDIT [3][3] → EDIT [4][4] → EDIT [4][5] → EDIT [4][6] 2.1.3 Algorithms The naive exact matching algorithm in strings locates all occurrences in time O(nm). But hashing provides a simple method that avoids the quadratic number of symbol comparisons in most practical situations, and that runs in linear time under reasonable probabilistic assumptions (Harrison (1971)[25] and Karp and Rabin (1987)[31]). The two most famous exact string matching algorithms were devised by algorithms Morris and Pratt (1970)[30] and Boyer and Moore (1977)[9]. The first linear-time string-matching algorithm was developed by Morris and Pratt (1970). It was improved by Knuth, Morris, and Pratt (1976)[34]. The algorithm’s preprocessing phase computes in O(m) space and time complexity and its searching phase computes in O(n + m) time complexity (independent from the alphabet size). The Boyer and Moore’s algorithm (1977) is considered as the most efficient string-matching algorithm in usual applications. The Boyer-Moore algorithm consists of preprocessing phase which computes in O(m + σ) time and space complexity and a searching phase which computes in O(mn) time complexity. A simplified version of it (or the entire algorithm) is often implemented in text editors for the “search” and “substitute” commands. Several variants of Boyer and Moore’s algorithm avoid the quadratic behavior when searching for all occurrences of the pattern. The most efficient solutions in terms of number of symbol comparisons have been designed by Apostolico and Giancarlo (1986)[5] and Colussi (1994)[15]. Although the idea of approximate pattern matching is ubiquitous in information processing, it first clearly appeared in the earlier work on approximate string matching. Wagner and Fisher (1974)[65] define an edit distance between two strings and an algorithm for its computation. The distance between two strings is given 18 by the minimum number of operations: insertion of a letter, deletion of letter or change of a letter, required to transform one string into the other. The authors present an algorithm which runs in O(nm) time where each cell in the matrix is the minimum cost of a delete operation on the cell to the top of it, an insert operation on the cell to the left of it and a change operation on the immediate cell to its left diagonal. The edit distance between the two strings is represented in the final cell of the two dimensional matrix. This is the unit edit distance. Different operations can be assigned a different weight. The notion of a longest common subsequence (LCS) of two strings is widely used to compare files. The diff command of UNIX system implement an algorithm based on the notion that lines of the files are considered as symbols. Informally, the result of a comparison gives the minimum number of operations (insert a symbol, or delete a symbol) to transform one string into the other. The comparison of molecular sequences is basically done with a closed concept, alignment of strings, which consists in aligning their symbols on vertical lines. This is related to an edit distance, called the Levenshtein distance, with the additional operation of substitution, and with weights associated to operations. Hirschberg (1975)[27] presents the computation of the LCS in linear space. Aho, Hirschberg and Ullman (1976)[3] show that unless a bound on the total number of distinct symbols is assumed, every solution to the problem can consume an amount of time that is proportional to the product of the lengths of the two strings. 2.1.4 Timeline In this section we present a time line of some of the major exact and approximate string matching algorithms. [30]1970 →[25]1971 →[65] 1974 →[27]1975 →[3] 1976 →[9, 34]1977 →[5] 1986 →[31]1987 →[15]1994 19 2.2 Approximate Matching in Trees and Graphs Approximate pattern matching of complex structures such as trees and graphs is a primitive operation essential to applications in information retrieval, information integration and mediation, and in many such domains that require evaluating or characterizing the similarity between structured and complex objects such as HTML documents, molecular compounds and XML data. 2.2.1 Definition An undirected graph G is a pair (V ,E), where V is a finite set of vertices and E, set of edges, is a binary relation on V , which consists of unordered pairs of vertices rather than ordered pairs as in the case of a directed graph. General graphs can have cycles, i.e. paths in the structure that lead the vertex back to itself. However we consider a special set of undirected graphs, called acyclic graphs, which as the name suggests do not have any cycles. A tree is a special subset of a graph. A tree is a graph which contains no cycles. We can visualize a tree by drawing it with a root at the top with the vertices below leading to the leaves at the lowest. If the vertices are placed on levels, higher level vertices are referred to the parents of the vertices directly below them, while the lower vertices are similarly referred to as their children. A tree with n vertices has n − 1 edges. Although maybe not part of the widest definition of a tree, a common constraint is that no vertex can have more than one parent. Moreover, for some applications, it is necessary to consider a vertex’s daughter vertices to be an ordered list, instead of merely a set. As a data structure in computer programs, trees are used in everything from B − trees in databases and file systems, to game trees in game theory, to syntax trees in a human or computer languages. 20 An ordered tree is a tree in which the relative order of the subtrees meeting at each vertex must be preserved, i.e. the left to right order of children of every vertex matters. 7 a 3 a 1 a 5 c 6 b 4 c 2 b Figure 2.3: Ordered Tree Example Figure 2.3 shows an ordered tree where each vertex is labeled with a character and the vertices post order number adjacent to the vertex. An unordered tree is a tree in which the relative order of the subtrees meeting at each vertex need not be preserved, i.e., the left-to-right order of the children of every vertex does not matter. T1 T2 a a a c b c b a Figure 2.4: UnOrdered Trees Example Figure 2.4 shows two vertex labeled trees which are exactly the same if they are considered to be unordered. The left-to-right ordering of the children does not matter in the unordered case. Given a tree, it is usually convenient to use a numbering scheme to refer to the vertices of the tree. For an ordered tree T , the left-to-right postorder numbering 21 or left-to-right preorder number are often used to number the vertices of T from 1 to |T |, the size of the tree T . For an unordered tree, we can fix an arbitrary order for each of the vertices in the tree and then use left-to-right postorder numbering or left-to-right preorder numbering. Suppose that we have a numbering for each tree. Let t[i] be the ith vertex of tree T in the given numbering. We use T [i] to denote the subtree rooted at t[i]. Let t[i1 ], t[i2 ], . . . , t[ini ] be the children of t[i]. The interesting property of the postorder numbering scheme is that children of a parent are always assigned a number lower than that of the parent. This fits in perfectly as in many dynamic algorithms it is crucial for the children to be processed first before the parent. Edit Operations in Trees and Graphs There are three kinds of edit operations in trees and graphs: 1. relabel : Relabeling a vertex x means changing the label on x. 2. delete : Deleting a vertex x means making the neighbors of x (except an arbitrarily specified neighbor x ) become the neighbors of x and then removing x. 3. insert : Insert is the inverse of delete. This means that inserting x as a neighbor of x makes a subset of current neighbors of x become the neighbors of x. Following [72], let ∅ represent an empty structure. An edit operation can be represented as a pair, (u, v) = (∅, ∅) sometimes written u → v. u → v is a relabeling operation if u = ∅ and v = ∅; a delete operation if u = ∅ and v = ∅; an insert operation if u = ∅ and v = ∅. 22 + X (a) Delete vertex (b) Insert vertex Figure 2.5: Delete and Insert Edit operations on Trees A neighbor of a character s in a string is the characters s and s” on either side of s, where s and/or s” may or may not be empty. A neighbor of a vertex x in a tree or a graph is any vertex x that is directly connected to x by a single edge. Although we are only concerned with inserts and deletes for our algorithms, we describe the relabeling operation for the sake of completeness. Edit Distance between Trees/Graphs Given two graphs G1 and G2 , there are several methods of performing approximate pattern matching between the two structures. One way is to measure the edit distance, i.e. the minimum cost of transforming one structure into the other quite often through a series of edit operations, i.e. deletion of a vertex in G1 , insertion of a vertex in G2 and the relabeling of a vertex in G2 with the label of a vertex in G1 . The edit distance between any two graphs G1 and G2 is denoted by δ(G1 , G2 ). Each edit operation can be assigned a numeric cost (not necessarily distinct). The edit distance is in fact a distance metric as it satisfies the basic rules of symmetry and triangle inequality. 23 2.2.2 Problem Definition Approximate tree matching is a generalization of approximate string matching. Given two trees, we view one tree as the pattern tree and the other as the data tree. The idea is to match approximately the pattern tree to the data tree. Given two trees T1 and T2 , the task of transforming T1 to T2 or T2 and T1 via a sequence of edit operations is termed as the problem of approximate pattern matching in trees. 2.2.3 Algorithms Several definitions and algorithms have been given for the approximate matching of graphs and trees. They correspond to different data structures (ordered trees, unordered trees, graphs, etc), different notions of similarity or distance, and different constraints. Tai (1979) [60] was one of the first authors to work on the topic of approximate pattern matching of trees. He gave the definition of the edit distance between ordered, labeled trees and the first non-exponential algorithm to compute it. Tai used a pre-order numbering scheme to number the trees. The convenient aspect of this notation is that for any i, 1 ≤ i ≤ |T |, vertices from T [1] to T [i] is a tree rooted at T [1]. He incorporated the same approach as sequence editing and came up with an algorithm that runs in O(|T1 |.|T2 |.depth(T1 )2 .depth(T2 )2 ) time and space. Lu (1979) [40] presents another algorithm on ordered trees based on the edit operations presented by Tai. Let t1 [i1 ], t1 [i2 ], . . . , t1 [ini ] be the children of t1 [i] and t2 [j1 ], t2 [j2 ], . . . , t2 [jnj ] be the children of t2 [j]. Lu considers the following three cases (1) t1 [i] is deleted - in this case the distance would be to match T2 [j] to one of the subtrees of t1 [i] and then to delete all the rest of the subtrees, (2) t2 [j] is inserted - in this case the distance would be to match T1 [i] to one of the subtrees of t2 [j] and then to insert all the rest of the subtrees, (3) t1 [i] matches t2 [j] - in 24 this case, consider the subtrees t1 [i1 ], t1 [i2 ], . . . , t1 [ini ] and t2 [j1 ], t2 [j2 ], . . . , t2 [jnj ] as two sequences and each individual subtree as a whole entity. He then uses the sequence edit distance to determine the distance between t1 [i1 ], t1 [i2 ], . . . , t1 [ini ] and t2 [j1 ], t2 [j2 ], . . . , t2 [jnj ]. This algorithm considers each subtree as a whole entity. It does not allow one subtree of T1 to map to more than one subtree of T2 . Kilpelainen and Mannila (1995)[32] introduced the tree inclusion problem i.e., given a pattern tree P and a target tree T , tree inclusions asks whether it is possible to obtain P by strictly deleting vertices of T . Both ordered and unordered trees are considered. Since there may be exponentially many ordered embeddings of P to T , they assume that P and T have the same label, their algorithm tries to embed P into T by embedding the subtrees of P as deeply and as far to the left as possible. The time complexity of their algorithm is O(|T1 |.|T2 |) and they showed that the unordered inclusion problem is NP-Complete. Shasha and Zhang(1990,1991) [57, 71], define an edit distance for ordered labeled trees and propose algorithms for its computation. Jiang, Wang and Zhang (1994)[29] address the problem of approximate matching in ordered labeled trees by inserting empty vertices to align the structure of ordered labeled trees. The distance is defined as the sum of the score of the opposing labels after the structurally similar graphs are overlayed. Shasha et. al. (1994) [56] propose several enumerative algorithms for the approximate matching of unordered labeled trees. The algorithms are based on probabilistic hill climbing and bipartite graph matching [10, 22]. The authors Luccio and Pagli (1991) [41], present algorithms for the approximate matching of H-ary trees and arbitrary ordered trees. Wang, Zhang and Chirn (1994) [67] define an edit distance for labeled graphs and describe an algorithm they call inexact graph matching in terms of finding a minimal transformation cost between the graphs. Vilares, Ribadas and Grana (2001) [63] present 25 a proposal intended to demonstrate the applicability of tabulation techniques for detecting approximately common patterns when dealing with structures sharing some common parts based on approximate pattern matching of two ordered labeled trees. Finally, Zhang, Wang and Shasha (1995) [72] introduce the notion of degree-2 constraint to define an edit distance between undirected acyclic graphs and propose algorithms for its computation. 2.2.4 Timeline In this section, we present a time line of some of the major exact and approximate tree and acyclic graph matching algorithms. [60, 40]1979 → [22]1987 → [57]1990 →[71, 41]1991 → [29, 56, 67]994 → [32, 10, 72]1995 →[63]2001 → 2.3 Approximate Matching of Strings and Regular Expressions Regular expressions and finite automata are central concepts in automata and formal language theory. These concepts are also crucial in the study of string pattern matching algorithms. Approximate matching of regular expressions form the basis of many search procedures in various applications. Searching for a pattern in a text file is a very common operation in many applications ranging from text editors and databases to applications in molecular biology. 2.3.1 Definition Following [1], regular expressions and the strings they match recursively can be summed up as 26 1. |()* are metacharacters 2. A non-metacharacter a is a regular expression that matches the string a 3. If r1 and r2 are regular expressions, then (r1 |r2 ) is a regular expression that matches any string matched by either r1 or r2 . 4. If r1 and r2 are regular expressions, then (r1 )(r2 ) is a regular expression that matches any string of the form xy, where r1 matches x and r2 matches y. 5. If r is a regular expression, then (r)* is a regular expression that matches any string of the form x1 , x2 , . . . , xn , n ≥ 0, where r matches xi for 1 ≤ i ≤ n. (r)* also matches the empty string, represented by . * is also known as the Kleene Closure operator. 6. If r is a regular expression, then (r) is a regular expression that matches the same string as r. The notation of regular expressions arises naturally from the mathematical result of Kleene [33] that characterizes the regular sets as the smallest class of sets of strings which contains all finite sets of strings and which is closed under the operations of union, concatenation and “Kleene Closure”. 2.3.2 Problem Definition Given a string S and a regular expression R, the problem of approximate matching of a string and a regular expression is to find a string SR ∈ L(R), where L(R) is the language defined by the regular expression R, such that the difference (editing distance) between S and SR is the least. Formally, ∆(R, S) = minSR ∈L(R) δ(S, SR ) 27 2.3.3 Algorithms The bulk of the research on regular expressions has been done on checking to see if an input string belongs to a regular expression and in some cases how “far” the string is from being a member of the regular expression. Waterman (1984)[68] reviews several mathematical methods for comparison of nucleic acid sequences. He discusses the problem of comparison of several sequences which is a slight simplification of the regular expression matching problem. Thompson (1968) [61] describes a regular expression recognition technique where each character in the text to be searched is examined in a sequence against a list of possible current characters. During the examination a new list of all possible next characters is built. When the end of the current list is reached a new list becomes the current list the next character is obtained and the process continues. Wagner (1974) [64] presents an error correction algorithm that acts as a preprocessor of sorts which accepts the possibly illegal source string and translates that source string into a guaranteed syntactically legal string based on the minimum edit distance between a string B belonging to a given regular language L which is “nearest” (in number of edit operations) to a given input string α. Knight et. al. (1995) [32] delve into the problem of approximate pattern matching of regular expressions with concave gap penalties and presents an O(M P (logM +log 2 P )) algorithm for its computation where M and P is the size of the input string and regular expression respectively. The concave gap penalty scheme is a symbol independent gap-cost model where the cost of the gap is solely a function of its length. Myers (1992) [44] presents a O(P N/logN ) where P is the length of a regular expression R and N is the length of the word A to determine if A is in the language denoted by R. The algorithm is based on a log N speedup of the standard O(P N ) time simulation of R s NDFA on A using a combination of node-listing and ”Four-Russians” [6] paradigms. Eppstein et. al. 28 (1993) [21] look into the problem of sequence alignment and the prediction of RNA secondary structure and present a common solution based on a common structure which can be expressed as system of dynamic programming recurrence equations. Myers et. al. (1989)[45] presents an algorithm to find a sequence matching a regular expression R whose optimal alignment with A is the highest scoring of all such sequences in O(M N ) time where M and N are the lengths of A and R respectively. 2.3.4 Timeline In this section, we present a timeline on some of the popular algorithms on approximate matching of regular expressions. [61]1968 → [64]1974 → [68]1984 → [45]1989 → [44]1992 → [21]1993 → [32]1995 CHAPTER 3 Approximate matching of trees and graphs under the degree − 1 constraint Approximate pattern matching of complex structures such as trees and graphs is a primitive operation essential to applications in information retrieval, information integration and mediation, and in many such domains that require evaluating or characterizing the similarity between structured and complex objects such as HTML documents, molecular compounds and XML data. For example, as more and more autonomous organizations produce and exchange XML data, the XML documents interchanged would be prone to spelling errors, syntactic or structural discrepancies as well as other syntactic or semantic differences. RDF [50] descriptions can be represented as acyclic graphs. The ability to evaluate the similarity between two documents is the basis for search and integration. Approximate pattern matching can also be used in the area of schema matching. Automating schema matching remains one of the challenging tasks in semi-structured data re29 30 search. Schema matching is a key operation for many applications including data integration, schema integration and semantic query processing. Approximate pattern matching in trees play a very important part in Information Extraction techniques. The traditional approach for extracting data from Web source is to write specialized programs called wrappers, that identify data of interest and map them to some suitable format, for instance, XML or relational tables. There are several existing approaches to web data extraction. One of the first initiatives for addressing the problem of wrapper generation was the development of languages specially designed to assist users in constructing wrappers. Some of the best known tools to adopt this approach are Minerva [17] and TSIMMIS [24]. Some tools, like W4F[53] and XWRAP[38], rely on the inherent structural features of HTML documents for accomplishing data extraction by converting a HTML document into a parsing tree, a representation that reflects its tag hierarchy. There exists tools, like RAPIER [11] and WHISK[59], which take advantage of Natural Language Processing (NLP) techniques such as filtering, part-of-speech tagging and lexical semantic tagging to build relationship between phrases and sentences elements so that extraction rules can be derived. Other tools rely solely on formatting features that implicitly depict the structure of the pieces of data found which make it more suitable for HTML documents. Examples of such tools are WIEN[36] and STALKER[43]. More information on the variety of wrappers available today can be found in the survey of web data extraction tools by Laender et. al. [37]. Several tools [8, 46, 26] are available to assist users in tracking when web pages have changed. Liu, Pu and Tang (2000) [39] present WebCQ, a prototype system for large-scale web information monitoring and delivery. The WebCQ system consists of four main components: a change detection robot that discovers and detects changes, a proxy cache service that reduces communication traffics to the original 31 information servers, a personalized presentation tool that highlights changes detected by WebCQ sentinels, and a change notification service that delivers fresh information to the right users at the right time. The change detection and summarization phases makes use of a scheme which merges the two documents (before and after change) by summarizing all the common, new and deleted materials in one document as it is done in HTMLDiff and the UNIX diff command [62]. Automatic schema matching has become one of the key areas of research in the field of computer science due to the rapidly increasing number of web data sources and E-businesses to integrate. Most work on schema matching has been motivated by schema integration [47, 58] - given a set of independently developed schemas, construct a global view. Schema matching is also useful in applications being considered for the semantic web [7], such as mapping messages between autonomous agents. A somewhat different scenario is semantic query processing [66, 52] - a run-time scenario where a user specifies the output of a query and the system figures out how to produce that output. A significant amount of work has been done on comparison of conceptual graphs representing knowledge elements. In [69, 70], the authors address the task of approximate matching of knowledge elements and present an algorithm for its comparison by measuring the similarity between two texts represented as conceptual graphs. Change detection and monitoring techniques of web pages on the internet have been around for sometime now and are constantly evolving based on different constraints. There exist commercial tools, [8], which inform users of when web pages are changed. In [39], the authors present WebCQ, a prototype system for large-scale Web information monitoring and delivery. It is designed to discover and detect changes to the web pages efficiently and to provide a personalized notification of what and how web pages of interest have been changed. In [49] the authors show 32 the feasibility of automatically extracting data from web pages by using approximate matching techniques. This can be applied to generate automatic wrappers or to notify/display web page differences, web page change monitoring, etc. In [51] the authors present an approach which collects a couple of example objects from the user and uses this information to extract new objects from semi-structured data from web sources. In each of the technologies mentioned above approximate matching of complex structures such as trees and graphs form an integral part. There are several methods of performing approximate pattern matching between two or more structures. One way is to measure the edit distance, i.e., the cost of transforming one structure into the other, quite often through a series of edit operations. Depending on the requirements of the application and the type of the distance measure required, various constraints can be placed on the calculation of edit distance. For instance, HTML or XML documents share the property that the actual values carrying the information is most often at the leaves of the tree while inner vertices represent the structural component of the document. Therefore one could require that they only be modified at the leaves. This is the concept of degree-1 constraint presented in this chapter. In this regard, we shall focus on finding the edit distance between two complex structures such as ordered and unordered trees and acyclic graphs, under the degree-1 constraint. Under this constraint, edit operations can only be performed at the leaf level of the tree or at the periphery of a graph. The work in [72] addresses the problem of comparing connected, undirected, acyclic and labeled graphs (CUAL Graphs). In view of the challenge associated with the problem of finding the edit distance between two CUAL graphs, proven to be NP-Complete, they propose a constrained distance metric, called the degree-2 distance which requires that any 33 node to be inserted or deleted have no more than two neighbors. Their algorithm √ runs in time O(N1 N2 D2 ) and in O(N1 N2 D D log D) where D = mind1 , d2 and di is the maximum degree of Gi . The degree-1 constraint we describe in this text also serves to simplify the problem of finding the edit distance between two CUAL graphs. We argue the relevance of such a constraint which requires that edit operations can only be performed at the leaf level of a tree or at the periphery of a graph in practical situations. We describe the concept of edit distance under the degree-1 constraint in Section 3.1. In Section 3.2, we present three algorithms to evaluate the edit distance between the complex structures under the degree-1 constraint. In Section 3.3 we analyze the time complexity of the three algorithms. A simple example is presented in Section 3.4. We then conclude this chapter with a summary in Section 3.5. 3.1 The degree-1 Constraint There are not only different notions of similarity, distance, and approximate matching corresponding to different data structures, but also there are different such notions for the same data structure corresponding to different needs. The respective efficiency of the algorithms computing the unit weight edit distance, the edit distances defined under the degree-2 constraint and the edit distance we propose under the degree-1 constraint and others are not comparable since the notions correspond to different needs for different applications. Their effectiveness can only be discussed in light of the requirements of the application. The degree of a vertex x in a CUAL structure is defined to be the number of vertices directly connected to x by means of an edge. Since the algorithms presented in this text are primarily concerned with trees and acyclic graphs, the 34 definition of degree does not allow for self loops and multiple edges between two vertices. Before describing the degree-1 constraint we first clarify the notion of edit distance and edit operations. 3.1.1 Edit Distance Edit distance is defined to be the minimum number of edit operations required to transform one structure to another, be it a string, a tree or a graph. There are three kinds of edit operations in trees and graphs: 1. relabel : Relabeling a vertex x means changing the label on x. 2. delete : Deleting a vertex x means making the neighbors of x (except an arbitrarily specified neighbor x ) become the neighbors of x and then removing x. 3. insert : Insert is the inverse of delete. This means that inserting x as a neighbor of x makes a subset of current neighbors of x become the neighbors of x. Following [72], let ∅ represent an empty structure. An edit operation can be represented as a pair, (u, v) = (∅, ∅) sometimes written u → v. u → v is a relabeling operation if u = ∅ and v = ∅; a delete operation if u = ∅ and v = ∅; an insert operation if u = ∅ and v = ∅. A neighbor of a character s in a string is the characters s and s” on either side of s, where s and/or s” may or may not be empty. A neighbor of a vertex x in a tree or a graph is any vertex x that is directly connected to x by a single edge. Although we are only concerned with inserts and deletes for our algorithms, we describe the relabeling operation for the sake of completeness. 35 Let ST2 be the structure that results from the application of an edit operation u → v to structure ST1 ; this is written as ST1 ⇒ ST2 via u → v. Let Seq be a sequence seq1 , seq2 , . . . , seqk of edit operations. Seq transforms a structure ST to ST if there is a sequence of structures ST0 , ST1 , . . . , STk such that ST = ST0 , ST = STk and STi−1 ⇒ STi via seqi for 1 ≤ i ≤ k. Let γ be the cost function that assigns to each edit operation u → v a non-negative real number γ(u → v). By extension, the cost of the sequence Seq, denoted γ(Seq), is simply the sum of costs of the constituent edit operations. The distance from ST to ST , denoted ∆(ST, ST ), is the minimum cost of all sequences of edit operations taking ST to ST . 3.1.2 The degree-1 concept In view of the hardness of the problem of finding the minimum edit distance between two graphs we propose the following degree-1 constraint on the edit operations: a vertex n can be deleted (inserted) only when degree(n) ≤ 1. Intuitively one can delete (insert) a vertex only at the leaf. This constraint is similar to the constraint presented in [72], which introduces the degree-2 constraint on edit operations where a vertex n can be deleted (inserted) only when degree(n) ≤ 2; i.e a vertex n can deleted only if it is a leaf or has two neighbors. We define the degree-1 distance between ST and ST , denoted δ1 (ST, ST ), to be the minimum cost of all sequences of the degree-1 edit operations transforming ST to ST . Clearly δ1 is a metric. When comparing two undirected tree structures, T1 and T2 under the degree-1 constraint any one of the five possible scenarios can occur: 1. T1 = ∅ AN D T2 = ∅ : This is the basic case where both the trees to be compared are empty, i.e. do not consists of any vertices. The minimum number of edit operations to transform an empty tree T1 into another empty 36 tree T2 is 0. Hence, δ1 (T1 , T2 ) = 0. 2. T2 = ∅ : When T2 is an empty tree the cost of transforming T1 to T2 is simply the size of the T1 represented by |T1 |. Since under the degree-1 constraint, inserts and deletes can occur only at those vertices that are connected to no more than one vertex. As a result we start deleting the vertices at the leaf levels of the tree T1 till an empty tree is formed. The number of deletions made is the total number of vertices in T1 , |T1 |. We assume the cost of a delete or insert edit operation is 1. Hence δ1 (T1 , T2 ) = |T1 |. 3. T1 = ∅ : This is the inverse of case (ii). In this case we have to insert all the vertices in T2 into the empty tree T1 . Hence δ1 (T1 , T2 ) = |T2 |. 4. labelT1 = labelT2 : This case handles the situation when the two trees T1 and T2 are non-empty and the labels of the root of the two trees do not match. Since our degree-1 constraint does not allow for replacement of labels, when two labels do not match we have to delete the entire tree T1 and insert the entire tree T2 . Hence δ1 (T1 , T2 ) = |T1 | + |T2 | 5. labelT1 = labelT2 : In the case when the two trees T1 and T2 are non-empty and the labels of the root match, the edit distance under the degree-1 constraint now becomes the minimum cost ordered/unordered bipartite matching between the children of T1 and T2 . Since the degree-1 algorithm proceeds in a post-order fashion we can ensure that the children have been processed beforehand. The ordered and unordered bipartite matching problem is discussed further in Section 3.2.2 and Section 3.2.3 respectively. Hence δ1 (T1 , T2 ) = min |M atchSet| ∀ i=0 cost(Mi ) , where M atchSet is the set of all valid ordered/unordered matchings, Mi and Mi ∈ M atchSet. 37 In essence, the distance measure under the degree-1 constraint, δ1 , now becomes 0 T1 = ∅ ∧ T2 = ∅ |T1 | T2 = ∅ |T2 | T1 = ∅ |T1 | + |T2 | labelT1 = labelT2 δ1 (T1 , T2 ) = min |M atchSet| ∀ i=0 cost(Mi ) labelT1 = labelT2 where ∅ is the empty tree, |T1 | represents the size of the subtree rooted at vertex T1 and |T2 | represents the size of the subtree rooted at vertex T2 . cost(Mi ) is the cost of a bipartite matching Mi ∈ M atchSet between the children of T1 and T2 , refer 3.2.1. |M atchSet| represents the number of matchings in the set M atchSet. Given any two labeled, unordered, and acyclic structures ST1 and ST2 , we can always transform ST1 to ST2 by applying a sequence of degree-1 edit operations. We first delete all the leaves of ST1 . By doing so, new leaves are generated in ST1 . We continue deleting the new leaves till no vertex is left in ST1 , i.e. till an empty graph ∅. We then insert a vertex x of ST2 into the empty graph. Then we insert the neighbors of x. Next, we insert new neighbors to those neighbors. Each such insertion is a valid degree-1 edit operation. We continue inserting new neighbors till ST2 is formed. The degree-1 metric is thus complete. 3.2 3.2.1 The Algorithms Preliminaries Central to the algorithms presented in this section is the concept of maximum weighted (or minimum cost) bipartite matching. The proposal in [54] discusses several matching algorithms for bipartite graphs. 38 A bipartite graph is a graph GB = (V, E) that has two disjoint set of vertices L and R such that L ∩ R = ∅, L ∪ R = V (partition: mutually exclusive and exhaustive), and for all (u, v) ∈ E, u ∈ L and v ∈ R. A matching M in a bipartite graph GB = (V, E) is a subset of E such that there is no u ∈ V, v1 ∈ V, v2 ∈ V such that v1 = v2 and either (u, v1 ) ∈ M and (u, v2 ) ∈ M , or (v1 , u) ∈ M and (v2 , u) ∈ M . In other words, no vertex is linked to two other vertices. An integer cost can be assigned to each edge in a matching and the sum total of the costs of all edges in a matching M is defined to be the cost of the matching, cost(M ). 3.2.2 Ordered Tree Algorithm We define an ordered tree as a tree in which the relative order of the subtrees meeting at each vertex must be preserved, i.e. the left to right order of children of every vertex matters. Algorithm The algorithm, presented in Figure 3.1, takes in two ordered labeled trees T1 and T2 and calculates the minimum edit distance under the degree-1 constraint between any two vertices in T1 and T2 represented by ordered minimum edit distance matrix, M EDO . Line 1 initializes the very basic distance, i.e. the distance between two empty trees. size(Tx [y]) returns the number of vertices in the subtree of Tx rooted at y. Lines 2–3 calculate the cost of deleting the entire subtree rooted T1 [i]. Lines 4–5 calculates the cost of inserting the entire subtree rooted T2 [j]. Next we proceed to calculate the edit distance between two vertices in the two ordered labeled trees. label(Tx [y]) returns the label of the vertex Tx [y]. Lines 8–9 check for the condition when the labels don’t match. If the labels of the two vertices, say T1 [i] and T2 [j] being compared are not the same, the edit distance is simply that of deleting the 39 Algorithm OTD Input: Ordered Labeled Trees T1 and T2 Output: Minimum edit distance Matrix, M EDO 1. M EDO [0][0] = 0 2. for(i=1;i ≤ SIZET1 ;i++) 3. M EDO [i][0]=size(T1 [i]); 4. for(j=1;j ≤ SIZET2 ;j++) 5. M EDO [0][j]=size(T2 [j]); 6. for(i=1;i ≤ SIZET1 ;i++) 7. for(j=1;j ≤ SIZET2 ;j++) 8. if(label(T1 [i]) = label(T2 [j])) 9. M EDO [i][j]=M EDO [i][0]+M EDO [0][j]; 10. else /* Labels match */ 11. M EDO [i][j] = OFD(i,j); 12. return M EDO Figure 3.1: Ordered Tree Algorithm under the degree-1 constraint entire subtree rooted at T1 [i] and inserting the entire subtree rooted at T2 [j], since under the degree-1 constraint as mentioned earlier in Section 3.1, edit operations are only permitted on those vertices which have at most one neighbor. As a result we have to start deleting all the vertices at the lowest level of the subtree, which would result in new leaf vertices. This goes on recursively until the root has been deleted. We then insert the vertices in T2 in a similar fashion until the whole subtree rooted at T2 [j] has been created. Lines 10–11 depict the case when the vertex labels match. If the vertex labels match, the edit distance then is the minimum sum of edit distances between the children, i.e. the ordered forest distance. This calculation is given by OF D(i, j) described in the next paragraph. Due to the post order traversal of the tree we can be guaranteed that the children are processed before the parents. Ordered Minimum Cost Bipartite Matching 40 Suppose, n represents the largest of the number of children of vertex T1 [i] and that of vertex T2 [j] and m the smallest. As pointed out earlier a bipartite graph GB = (V1 , V2 , E) can be thought of a graph G = (V, E) where the vertices V are partitioned into two disjoint sets V1 and V2 such that the vertices in one set V1 are adjacent only to the vertices in the other set V2 . Two vertices being adjacent means the two vertices are connected by means of an edge. Figure 3.2 shows an example of a bipartite graph GB = (V1 , V2 , E), |V1 | = 3 and |V2 | = 2. V1 V2 Figure 3.2: Bipartite Graph GB = (V1 , V2 , E) A matching M on a graph G is a subset of the edges of G = (V, E) such that each vertex in G is incident with no more than one edge in M . A bipartite matching MB on a bipartite graph GB = (V1 , V2 , E) is a subset of the edges of GB such that each vertex in V1 is incident with no more than one edge in V2 . Figure 3.3 shows two possible matchings from the bipartite graph in Figure 3.2, MB1 and MB2 . V1 V1 V2 3 V2 4 2 1 (a) (b) Figure 3.3: (a)Matching MB1 (b)Matching MB2 41 The numbers assigned to each edge could represent either cost or weight of the edge. More often than not we are interested in finding the minimum cost or the maximum weight matching in a graph. If the numbers on the edges are costs then we would choose MB2 (cost=4) over MB1 (cost=6) and if the numbers were to represent weights we would choose MB1 (weight=6) over MB2 (weight=4). We first create a bipartite graph with the immediate children of T1 [i] on one side and those of T2 [j] on the other. The cost assigned to each edge in the bipartite graph is the value of the edit distance between the trees rooted at the two participating vertices. Due to post-ordered traversal of algorithm, we can guarantee that the children are processed before the parent hence ensuring an accurate value for the edit distance. In view of the ordered nature of the main algorithm, we employ a restrictive version on the bipartite matching algorithm. A matching M O is considered to be a valid Ordered Bipartite Matching iff ∀(i1 , j1 ) ∈ M O , (i2 , j2 ) ∈ M O , i1 = i2 , j1 = j2 , i1 < i2 ↔ j1 < j2 Two lines in a matching M O cannot cross each other. E.g. matching MB1 is an example of an invalid ordered bipartite matching and MB2 is an example of a valid ordered bipartite matching. The edit distance is now the minimum total cost over valid ordered bipartite matchings. δ1 (T1 [i], T2 [j]) = OF D(i, j) = min |M atchSetO | ∀ i=0 cost(MiO ) δ1 (T1 , T2 ) = M EDO [SIZET1 ][SIZET2 ] where M EDO is the Minimum Edit Distance matrix which stores the minimum 42 edit distance between two subtrees rooted at any two vertices in T1 and T2 and SIZET1 and SIZET2 represents the number of vertices in the trees rooted at T1 and T2 respectively. M atchSetO represents the set of all valid ordered bipartite matchings (MiO ) possible between T1 [i] and T2 [j] and |M atchSetO | represents the number of matchings in the M atchSetO . An alternative solution is to insert (n − m) empty trees, ∅, in all possible n combinations ( Cm ) as children of T2 . We then add the edit distance between every ordered child pair T1 [i] and T2 [i] where 1 ≤ i ≤ n. The post-ordered traversal of the tree guarantees that the child vertices are processed before the parent. The sum total of the edit distances obtained is the edit distance for that combination. The minimum value of the edit distance over all possible combinations is the desired value of edit distance between T1 [i] and T2 [i], δ1 (T1 [i], T2 [i]). 3.2.3 Unordered Tree Algorithm We define an unordered tree as a tree in which the relative order of the subtrees meeting at each vertex need not be preserved, i.e. the left to right order of children of every vertex does not matter. Algorithm The initial part of the unordered tree algorithm, Figure 3.4, is very much similar to that of the ordered tree algorithm. The algorithm takes in two unordered labeled trees T1 and T2 and returns the minimum edit distance matrix, M ED. Lines 2-3 calculate the cost of deleting the subtrees of T1 and lines 4-5 calculate the cost of inserting the subtrees of T2 . Lines 9-10, consider the case when the labels do not match. As in the ordered case the cost is just that of deleting the subtree rooted at T1 [i] and inserting the subtree rooted at T2 [j]. Since the trees are unordered, in the 43 Algorithm UTD Input: Unordered Labeled Trees T1 and T2 Output: Minimum edit distance Matrix, M EDU 1. M EDU [0][0] = 0 2. for(i=1;i ≤ SIZET1 ;i++) 3. M EDU [i][0]=size(T1 [i]); 4. for(j=1;j ≤ SIZET2 ;j++) 5. M EDU [0][j]=size(T2 [j]); 6. for(i=1;i ≤ SIZET1 ;i++) 7. for(j=1;j ≤ SIZET2 ;j++) 8. if(label(T1 [i]) = label(T2 [j])) 9. M EDU [i][j]=M EDU [i][0]+M EDU [0][j]; 10. else /* Labels match */ 11. M EDU [i][j] = UFD(i,j); 12. return M EDU Figure 3.4: Unordered Tree Algorithm under the degree-1 constraint case when the labels of the vertices match the edit distance is now the unordered forest distance , U F D(i, j), which is maximum weighted (minimum cost) bipartite matching between the two sets of children. In this case, since the tree is unordered, we proceed to find the minimum cost bipartite matching, M U , between the children of the two similar vertices where an edge between two vertices x in T1 and y in T2 is simply δ1 (T1 [x], T2 [y]) which has been calculated at an earlier stage due to the postorder traversal of the tree. Hence, in this case δ1 (T1 [i], T2 [j]) = U F D(i, j) = min |M atchSetU | ∀ i=0 cost(MiU ) δ1 (T1 , T2 ) = M EDU [SIZET1 ][SIZET2 ] 44 where M EDU is the Minimum Edit Distance matrix which stores the minimum edit distance between two subtrees rooted at any two vertices in T1 and T2 and SIZET1 and SIZET2 represents the number of vertices in the trees rooted at T1 and T2 respectively. M atchSetU represents the set of all valid bipartite matchings (MiU ) possible between T1 [i] and T2 [j] and |M atchSetU | represents the number of matchings in the M atchSetU . 3.2.4 Acyclic Graph Algorithm An undirected graph G is a pair (V ,E), where V is a finite set of vertices and E, set of edges, is a binary relation on V , which consists of unordered pairs of vertices rather than ordered pairs as in the case of a directed graph. General graphs can have cycles, i.e. paths in the structure that lead the vertex back to itself. However we consider a special set of undirected graphs, called acyclic graphs, which as the name suggests do not have any cycles. Algorithm Algorithm AGD Input: Undirected Acyclic Labeled Graphs G1 and G2 Output: Minimum edit distance, γmin 1. γ.min = ∞ γ=0 2. for(i=1;i ≤ SIZEG1 ;i++) 3. TG1 = CreateTreeWithRoot(G1 ,i); 4. for(j=1;i ≤ SIZEG2 ;j++) 6. TG2 = CreateTreeWithRoot(G2 ,j); 7. γ= UTD(TG1 ,TG2 ).M EDU [SIZEG1 ][SIZEG2 ] 8. if(γ ≤ γmin ) γmin = γ 10 return γmin Figure 3.5: Acyclic Graph Algorithm under the degree-1 constraint 45 The algorithm for acyclic graphs, Figure 3.5, makes use of the unordered tree algorithm, as an unordered tree can be considered to be a acyclic graph rooted at a particular vertex. The algorithm takes in two connected, undirected, acyclic and labeled graphs G1 and G2 and returns the minimum edit distance between the two structures. The idea is simple. We first choose a vertex in Graph G1 and create an unordered tree by ‘pulling’ it up to the top. We do the same in G2 . This is done in lines 2-6. CreateT reeW ithRoot(G, x) reorganizes G and returns an unordered tree with the vertex i in G as its root. We keep track of the minimum edit distance γmin , which has been initialized to infinity in line 1. We calculate the edit distance, γ between the two recently created rooted unordered trees in line 7. Lines 8-9 keep track of the minimum edit distance across all possible structures. At the end of processing, the minimum edit distance, γmin is returned. 3.3 Complexity Analysis We choose to employ the maximum weighted (minimum cost) bipartite algorithm presented in [22, 20] which executes in O(mn + m2 log m) time, where n and m are the sizes of the two sets of vertices in a bipartite graph. Assuming T1 always has a larger number of child vertices than T2 , n represents the number of children of T1 and m represents the number of children of T2 . For the sake of simplicity assume the number of children for any particular vertex in T1 and T2 is bounded by some number, k. The worst-case execution time of finding the minimum edit distance under the degree-1 constraint of the ordered tree algorithm, OT D is O(|T1 |.|T2 |.k 2 log k) where |T1 | and |T2 | represent the number of vertices or the size of T1 and T2 , respectively. The worst-case execution time of finding the minimum edit distance under the 46 degree-1 constraint of the unordered tree algorithm, U T D is also O(|T1 |.|T2 |.k 2 log k). Since the unordered tree algorithm, U T D, is executed for all possible combinations of roots in G1 and G2 , i.e. |T1 |.|T2 | iterations, the algorithm for finding the edit distance under the degree-1 constraint between two acyclic graphs G1 and G2 , AGD has a worst-case execution time of O(|T1 |2 .|T2 |2 .k 2 log k). 3.4 Example Consider the two trees T1 and T2 in Figure 3.6. Figure 3.7 shows the distance matrix when the trees are considered to be ordered and Figure 3.8 presents the distance matrix when the trees are considered to be unordered. T2 7 a T1 4 a 2 1 c 3 d 1 c 2 a a 3 d 4 6 a c 5 b Figure 3.6: Two Example Trees T2 ∅ a d c a 0 1 2 3 4 ∅ c a 0 1 2 0 1 1 1 2 0 2 1 3 1 2 2 4 5 3 d 3 1 2 3 0 5 T1 a 4 1 0 2 2 3 b 5 1 2 3 2 5 c 6 3 4 1 4 7 a 7 7 6 9 8 5 Figure 3.7: Ordered Edit Distance Matrix The essential calculations for the ordered and unordered case is the same, e.g., when the labels are different, the distance is simply the cost of deleting the subtree in T1 and inserting the subtree in T2 in both cases. As mentioned the difference in 47 T2 ∅ a d c a 0 1 2 3 4 ∅ c a 0 1 2 0 1 1 1 2 0 2 1 3 1 2 2 4 5 3 d 3 1 2 3 0 5 T1 a 4 1 0 2 2 3 b 5 1 2 3 2 5 c 6 3 4 1 4 7 a 7 7 6 9 8 3 Figure 3.8: Unordered Edit Distance Matrix calculation between the ordered and unordered case is when the labels in contention are similar. Consider the trees rooted at T1 [4] and T2 [7]. In the ordered case the best possible ordered matching is M O ={(2,1),(0,2),(3,3),(0,6)} with cost(M O ) = 5. However in the unordered case there are two possible sets of matchings to be considered: M1U =(2,1),(0,2),(3,3),(0,6)}, cost(M1U )=5 and M2U ={(2,6),(0,1),(3,3),(0,2)}, cost(M2U )=3. Since we are interested in the least possible edit distance between the two structures under the degree-1 distance constraint, we choose M2U . 3.5 Summary In this chapter, we have presented algorithms for comparing two complex structures such as ordered and unordered trees and acyclic graphs based on an edit distance metric under the degree-1 constraint which states that edit operations can be performed only on vertices with degree less than or equal to 1. Simply put edit operations are only performed at the leaf level of the structure. The ordered and unordered tree algorithms have a worst-case execution time of 48 O(|T1 |.|T2 |.k 2 log k) and the algorithm for acyclic graphs has a worst-case execution time of O(|T1 |2 .|T2 |2 .k 2 log k). CHAPTER 4 Approximate matching of strings and regular expressions Approximate matching of regular expressions forms the basis of many search procedures in various applications. Several tools in the industry make extensive use of regular expressions. awk [42, 2] is a program that can be used to select particular records in a file and perform operations upon them. sed [55] is a stream editor which can be used to make changes to files or pipes. grep [23] searches the input files for lines containing a match to a given pattern list. As mentioned earlier in the introduction, we can combine existing schema matching techniques with customized approximate pattern matching algorithms for regular expressions to effectively speedup the matching process. With the rising popularity of bio-informatics and the gargantuan amounts of genetic data that remains to be analyzed, the need for fast and efficient sequence analysis algorithms are paramount. The two most popular sequence analysis algo49 50 rithms in bio-informatics and bio-medical sciences today are FastA (1985,1988)[48] and BLAST(1990)[4]. FastA looks at local alignments, i.e. rather than finding the best alignment between two sequences it tries to find paths of regional similarity. Its alignment may contain gaps and uses a strategy which is expected to find the most matches sacrificing complete sensitivity in order to gain on speed. BLAST short for Basic Local Alignment Search Tool directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. Eppstein et. al. (1993)[21] discuss several other algorithms on sequence analysis. Clarke and Cormack (1997)[14] look at the problem of limited use of current searching techniques (leftmost longest match) when searching structured text formatted with SGML or similar markup languages. Algorithms for approximate pattern matching of regular expressions are also widely used in information retrieval applications where typically the pattern or the regular expression is small, and the sequence to be matched is large. Regular expression pattern matching algorithms have also found themselves useful in solving misuse intrusion problems as is highlighted by Kumar and Spafford (1994) [35]. Intrusion signatures are usually specified as a sequence of events and conditions that lead to a break in. By comparing access sequences with a basic signature of an intrusion, one can determine if the access sequence is that of an intrusion. Furthermore, based on the type of intrusion necessary action can be taken to prevent and report the intrusion. Another interesting use of approximate pattern matching of regular expressions and strings is error correction in text. By specifying some maximum difference measure k, one can find all possible words in 51 the language of the regular expression at a distance of k from the original input. Such techniques are commonly found in online dictionaries and online search engines. By considering XML DTDs as regular trees we can extend the approximate matching problem of regular expression and strings to trees. Such algorithms can be used to check if XML documents conform with their respective DTDs. Further possible applications and algorithms are discussed in chapter 5. We briefly touch on the theory of Regular Expressions and Finite State Machines (FSM) in Section 4.2. In Section 4.3 we present a simple string to regular expression pattern matching machine. Next, in Section 4.4 we review an existing algorithm proposed by Myers and Miller on the approximate regular expression matching problem. We then present our algorithm in Section 4.5 and evaluate the performance of our algorithm for class-2 regular expressions against the algorithm by Myers and Miller. 4.1 Problem Definition Given a sequence S and a regular expression R, the approximate regular expression matching problem is to find a sequence matching R whose optimal alignment with S is the highest scoring of all such sequences or the whose distance from S is the least. In terms of the edit distance the task is to find a string SR ∈ L(R) such that distance from S to SR is the least. The edit distance between a string and a regular expression, represented by ∆, can be formally defined as ∆(S, R) = min {δ(S, SR )} SR ∈L(R) where δ(S1 , S2 ) gives the edit distance between two strings. E.g. if R = a∗ b and S = aaabaab, ∆(R, S) = 1. The closest matching string SR ∈ L(R) is SR = aaaaaab. 52 4.2 Background Information 4.2.1 Grammars and Languages Grammars provide a way to define languages by giving a finite set of rules that describe how the valid strings may be constructed. A grammar G consists of: an alphabet Σ of terminal symbols or terminals, a finite set of variables V, a set of rewrite rules P or productions of the form α → β, and a start symbol S (a variable): G = (V, Σ, P, S). The grammar generates strings in Σ∗ by applying rewrite rules to the start symbol S until no variables are left. Each time a rule is applied, a new sentential form (string of variables from V and terminals from Σ) is produced. The language generated by the grammar, L(G), is the set of all strings that may be generated in that way. 4.2.2 Regular Expressions Regular expressions and finite automata are central concepts in automata and formal language theory. These concepts are also crucial in the study of string pattern matching algorithms. Following [1], regular expressions and the strings they match recursively can be summed up as 1. |()* are metacharacters 2. A non-metacharacter a is a regular expression that matches the string a 3. If r1 and r2 are regular expressions, then (r1 |r2 ) is a regular expression that matches any string matched by either r1 or r2 . 53 4. If r1 and r2 are regular expressions, then (r1 )(r2 ) is a regular expression that matches any string of the form xy, where r1 matches x and r2 matches y. 5. If r is a regular expression, then (r)* is a regular expression that matches any string of the form x1 , x2 , . . . , xn , n ≥ 0, where r matches xi for 1 ≤ i ≤ n. (r)* also matches the empty string, represented by . * is also known as the Kleene Closure operator. 6. If r is a regular expression, then (r) is a regular expression that matches the same string as r. The notation of regular expressions arises naturally from the mathematical result of Kleene [33] that characterizes the regular sets as the smallest class of sets of strings which contains all finite sets of strings and which is closed under the operations of union, concatenation and “Kleene Closure”. Regular expressions figure arise in practically all kinds of text-manipulation tasks. Searching and search-and-replace are among the more common uses, but regular expressions can also be used to test for certain conditions in a text file or data stream. One might use regular expressions, for example, as the basis for a short program that separates incoming mail from incoming spam. Matching ordinary regular expressions with strings can be done in polynomial time, proportional to M N , where M is the length of the regular expression and N is the length of the string to be matched. The usual method for this is: Parse the regular expression and construct an equivalent finite automaton (F A), which will have O(M ) states; then simulate the action of the F A on the input, which takes O(M N ) time. 54 4.2.3 Finite State Automata Kleene [33] proved that the languages specified by regular expressions and the languages generated by finite automata constitute the same class, which is a basic result often referred to in the literature as Kleene’s Theorem. A finite state machine is an abstract machine consisting of a set of states (including the initial state), a set of input events, a set of output events, and a state transition function. The function takes the current state and an input event and returns the new set of output events and the next state. Some states may be designated as “terminal states”. The state machine can also be viewed as a function which maps an ordered sequence of input events into a corresponding sequence of (sets of) output events. A finite state machine can be classified as either a deterministic finite state machine or a nondeterministic finite state machine. Deterministic finite automaton A deterministic FSM (DFA) is one where the next state is uniquely determined by a single input event. Formally, a deterministic finite automaton (DFA) is defined by the quintuple M = (Q, Σ, λ, q0 , F ) where Q is a finite set of internal states. Σ is a finite set of symbols called the input alphabet. λ : Q × Σ → Q is a total function called the transition function. q0 is the initial state. 55 F ⊆ Q is a set of final states. The machine starts in the start state and reads in a string of symbols from its alphabet. It uses the transition function λ to determine the next state using the current state and the symbol just read. If, when it has finished reading, it is in an accepting state, it is said to accept the string, otherwise it is said to reject the string. The language accepted by a DFA M = (Q, Σ, λ, q0 , F ) is the set of all strings on Σ accepted by M . Formally, L(M ) = {w ∈ Σ∗ : λ∗ (q0 , w) ∈ F } One of the preferred notations for describing automata are transition diagrams. Following [28], a transition diagram for a DFA M = (Q, Σ, λ, q0 , F ) is a graph defined as follows: a. For each state in Q there is a node. b. For each state q in Q and each input symbol a in Σ, let λ(q, a) = p. Then the transition diagram has an arc from node q to node p, labeled a. If there are several input symbols that cause transitions from q to p then the transition diagram can have one arc, labeled by the list of symbols. c. There is an arrow into the start state q0 , labeled Start. This arrow does not originate at any node. d. Nodes corresponding to accepting states (those in F ) are marked by a double circle. States not in F have a single circle. 56 1 Start q0 0 0 1 q2 q1 0,1 Figure 4.1: DFA Example Figure 4.1 shows the transition diagram for the DF A that accepts all and only the strings of 0 s and 1 s that have the sequence 01 somewhere in the string. Non-deterministic finite automaton A nondeterministic FSM (NFA) has the power to be in several states at once. The next state of an N F A depends not only on the current input event, but also on an arbitrary number of subsequent input events. Until these subsequent events occur, it is not possible to determine which state the machine is in. Formally, an N F A is defined by the quintuple M = (Q, Σ, λ, q0 , F ) where Q is a finite set of internal states. Σ is a finite set of symbols called the input alphabet. λ : Q × (Σ ∪ λ) → 2Q is a total function called the transition function. q0 is the initial state. F ⊆ Q is a set of final states. There are three major differences between an NFA and a DFA based on the definitions. 57 a. In an NFA, the range of λ is the powerset of Q which implies its value is not a single element of Q, but a subset which defines the set of all possible states that can be reached by the transition. b. An NFA allows Λ-transitions as the second argument of λ. This means that an NFA can make a transition without consuming an input symbol. c. In an NFA, there could be a situation where a transition is not defined for a particular pair of current state and input symbol. 0,1 Start q0 0 1 q1 q2 Figure 4.2: NFA Example Figure 4.2 shows the transition diagram for the N F A that accepts all and only the strings of 0 s and 1 s that end in 01. The language L accepted by an NFA M = (Q, Σ, λ, q0 , F ) consists of all strings w for which there is a walk labeled w from the initial vertex of the transition graph to some final vertex. Formally, L(M ) = {w ∈ Σ∗ : λ∗ (q0 , w) ∩ F = ∅} It is possible to automatically translate any nondeterministic FSM into a deterministic one which will produce the same output given the same input. Each state in the DF A represents the set of states the N F A might be in at a given time. However the DF A may have exponentially more states than the N F A. 58 4.2.4 Chomsky Hierarchy For the sake of completeness, in the next section we describe the Chomsky hierarchy [12, 13], which is a containment hierarchy of classes of formal grammars that generate formal languages. This hierarchy was described by Noam Chomsky in 1956. Regular expressions correspond to the type 3 grammars (regular grammars) of the Chomsky hierarchy, and may be used to describe a regular language. The Chomsky hierarchy describes four levels of formal grammar. Grammar is the study of the rules that govern the use of a language. That set of rules is also called the language’s grammar, and each language has its own, distinct grammar. Programming languages have grammars, but do not resemble human languages. These are called formal grammars. In particular, they conform precisely to a grammar generated by a push down finite state automaton, with arbitrarily complex commands. They usually lack questions, exclamations, simile, metaphor and other features of human languages. The Chomsky hierarchy comprises the following levels: • Type-0 grammars (unrestricted grammars) include all formal grammars. They generate exactly all languages that can be recognized by a Turing machine. The language that is recognized by a Turing machine is defined as all the strings on which it halts. These languages are also known as the recursively enumerable languages. Note that this is different from the recursive languages which can be decided by an always halting Turing machine. • Type-1 grammars (context-sensitive grammars) generate the context-sensitive languages. These grammars have rules of the form αAβ → αγβ with A a nonterminal and α,β and γ strings of terminals and nonterminals. The strings α and β may be empty, but γ must be nonempty. The rule S → ε is allowed if 59 S does not appear on the right side of any rule. These languages are exactly all languages that can be recognized by a non-deterministic Turing machine whose tape is bounded by a constant times the length of the input. • Type-2 grammars (context-free grammars) generate the context-free languages. These are defined by rules of the form A → γ with A, a nonterminal and γ a string of terminals and nonterminals. These languages are exactly all languages that can be recognized by a non-deterministic pushdown automaton. Context free languages are the theoretical basis for the syntax of most programming languages. • Type-3 grammars (regular grammars) generate the regular languages. Such a grammar restricts its rules to a single nonterminal on the left-hand side and a right-hand side consisting of a single terminal, possibly followed by a single nonterminal. The rule S → ε is also allowed here if S does not appear on the right side of any rule. These languages are exactly all languages that can be decided by a finite state automaton. Additionally, this family of formal languages can be obtained by regular expressions. Regular languages are commonly used to define search patterns and the lexical structure of programming languages. Every regular language is context-free, every context-free language is contextsensitive and every context-sensitive language is recursively enumerable. These are all proper inclusions, meaning that there exist recursively enumerable languages which are not context-sensitive, context-sensitive languages which are not contextfree and context-free languages which are not regular. 60 4.2.5 Types of Regular Expressions We classify the individual elements that make a regular expression into five different classes. These are as follows: • Class 1 - a: Where the regular expression is a simple string. Start a q1 q0 Figure 4.3: Class 1 Regular Expression • Class 2 - a∗ : When the kleene closure (*) in a regular expression is bound to a single character. For E.g. R = a∗ bab∗ a Start q0 Figure 4.4: Class 2 Regular Expression • Class 3 - a|b: When the choice operator (|)is present in R. E.g. R = (a|b)c consists of the sequences S1 = ac and S2 = bc a q1 Start q0 b q2 Figure 4.5: Class 3 Regular Expression • Class 4 - (ab)∗ : When the kleene closure (*) in a regular expression is bound to multiple characters. E.g. R = (ab)∗ bab∗ 61 b Start a q2 q1 q0 a Figure 4.6: Class 4 Regular Expression • Class 5 - (a|b)∗ : When the kleene closure (*) in a regular expression is bound to a choice operator. E.g. the R = (a|b)∗ produces any string with a’s or b’s. a,b Start q0 Figure 4.7: Class 5 Regular Expression 4.3 A Simple String to Regular Expression Pattern Matching Machine The problem of approximate matching of string S and a regular expression R is essentially finding a string SR , SR ∈ L(R), such that the difference (editing distance) S and SR is the absolute minimum. The first step to the approximate matching problem is to generate the transition graph for the given regular expression R. The transition graphs for any given regular expression R usually contain a set of arcs originating from every state (except in some case the final state) denoting transitions from one state to the other after consuming a certain character. We designate such arcs as original arcs since these are part of the original FSM. Traversal of the original arcs incur no cost as they are legal transitions in the transition graph representing a regular expression R. However, in addition to the existing arcs in the graph, two new types of arcs with 62 a cost of 1 are added in order to facilitate easy calculation of the minimum edit distance between a string S and a regular expression R. a q0 - (a) Arc Delete + q0 q1 b (b) Weighted Empty Transition Arc Figure 4.8: Special Arcs • Delete Arc (-): weighted arc added to every state in the transition graph representing the deletion of a character c, c ∈ Σ(R) − LA, where LA is the original self loop arc originating from that particular state. LA could be empty, ∅. Each delete arc is assigned a non-negative integer cost. We assume this cost to be a unit cost. Every traversal of this arc results in a cost of 1 being added to the edit distance measure. An example of a delete arc is shown in Figure 4.8(a). • Weighted Empty Transition Arc (+): a weighted arc from each state to the next possible state representing the addition of a character, c, c ∈ Σ(R)−OA, where OA is the original set of arcs leaving the current state. Each weighted transition arc is assigned a non-negative integer cost. Every traversal of this arc results in a cost of 1 being added to the edit distance measure. An example of a weighted empty transition arc is shown in Figure 4.8(b). In order to calculate the edit distance between a regular expression R and a string S, each character (left to right order) in S is consumed by FSM representing 63 R. Based on the input an arc is traversed transferring control to a ‘new’ state. If an original arc is traversed no cost is incurred. Every special arc traversed results in the edit distance being increased by 1. In the case of a delete arc, this could be because the next character in S has no original transition associated with it. In the case where an original transition exists, i.e. a zero-cost transition, it may still be worthwhile to traverse a delete arc as this could (in many cases does) reduce the overall edit distance measure. In some cases, there is a need to move on to the next state even when there is no valid input character that enables this traversal. By traversing a empty weighted transition arc the next state in the FSM becomes the current state. This is a non-zero cost traversal. Since we assume a unit cost, the edit distance measure is increased by 1. Such a traversal is of particular importance when a final state cannot be reached. Possibly because of completely consumed string. a + q0 q1 b - - Figure 4.9: Transition graph for R=a*b Consider the regular expression, R = a∗ b, Σ(R) = {a, b}. Its corresponding transition graph is shown in Figure 4.9. Suppose S = aaabbaaaaa. After consuming the first three a’s at zero cost (loop), our first instinct would be to traverse the original arc out of q0 to q1 by consuming b at a cost of 0. The edit distance at this stage is 0. However after this traversal the edit distance increases by 1 after consuming each of the remaining characters in S corresponding to 1 delete of b and 5 deletes of a. The final value of the edit distance using this short sighted approach is 6. On the other hand, if we were to delete the 2 b’s by traversing the 64 delete arc in state q0 and then traverse the weighted empty transition arc after consuming the entire remaining string of a’s, the edit distance would now be 3, which corresponds to 2 deletes of b and 1 weighted traversal (similar to adding a b). This is the minimum cost in this case. 4.4 Existing Algorithm - Myers and Miller Before we describe our algorithm in the next section we briefly outline an existing algorithm on the approximate regular expression matching problem. In their paper, Myers and Miller [45] address the approximate regular expression matching problem by making use of an edge-labeled finite automaton accepting the language of all alignments between a sequence S = s1 s2 . . . sM of length M and sequences in the regular expression R of size N . Once the graph is created the total time spent to calculate the similarity cost is O(M N ). 4.4.1 Discussion The authors first create an edge labeled directed graph called a regular expression edit graph. It is designed so that paths between two designated vertices correspond to alignments between a sequence S of length |M | and sequences in R. The authors visualize the finite automaton, F =< V, E, λ, St, En > as a vertex labeled graph with distinguished source and end vertices. The regular expression edit graph can be constructed using the following rules: a. Given a regular expression R = a, a ∈ Σ ∪ {ε}, the finite automaton, FR , that accepts exactly the language denoted by R is 65 a St(R)=En(R) Figure 4.10: Myers representation : Fa b. Given a regular expression R with source state StR and end state EnR and a regular expression S with source state StS and end state EnS , the finite automaton, FR|S , that accepts exactly the language denoted by R|S is En(R) St(R) F(R) En(R|S) St(R|S) e e En(S) St(S) F(S) Figure 4.11: Myers representation : FR|S c. Given two regular expressions R and S, the finite automaton, FRS , that accepts exactly the language denoted by RS is St(RS)=St(R) En(R) En(S)=En(RS) St(S) F(R) F(S) Figure 4.12: Myers representation : FRS d. Given two regular expressions R and S, the finite automaton FR∗ , that accepts exactly the language denoted by R∗ is The vertices of the regular expression edit graph, GS,R are the pairs (i, S) where i ∈ [0, M ] and S ∈ V . The graph in effect contains M + 1 copies of F where (i, s) is in row i. Only the following edges are allowed in a. If i ∈ [1, M ] and either s = θ or λ(s) = ε, then there is a deletion edge (i − 1, s) → (i, s) labeled d(si ). 66 En(R) St(R) e F(R) St(R*) e En(R*) Figure 4.13: Myers representation : FR∗ b. If i ∈ [0, M ] and t → s ∈ E, then there is an insertion edge (i, t) → (i, s) labeled a(λ(s)) c. If i ∈ [1, M ],t → s ∈ E and λ(s) = ε, then there is a substitution edge (i − 1, t) → (i, s) labeled s(si , λ(s)) Algorithm Myers Input: Regular expression edit graph GS,R Output: Distance Score, d(M, φ) 1. d(0, θ) = 0 2. for s ∈ V − θ in topological order do 3. d(0, s) = min {d(0, t)} + σ(a(λ(s))) t→s∈D 4. for i = 1 to M do{ 5. d(i, θ) = d(i − 1, θ) + σ(d(si )) 6. for s ∈ V − θ in topological order do{ 7. d(i, s) = min {d(i, t)} + σ(a(λ(s))) t→s∈D 8. 9. if λ(s) = ε then d(i, s) = min{d(i, s), d(i − 1, s) + σ(d(si )), min {d(i − 1, t)} + σ(s(si , λ(s)))} 10. 11. 12. t→s∈E } for s ∈ V − θ in topological order do{ d(i, s) = min{d(i, s), min {d(i, t)} + σ(a(λ(s)))} t→s∈E t=θ 13. } 14. return d(M, φ) Figure 4.14: Myers algorithm for approximate matching of regular expression Figure 4.14 depicts the algorithm for approximate matching of a string S with a regular expression R. Line 1 initializes the base case, i.e the distance cost between 67 two empty sequences. Lines 3 and 5 calculate the values for the boundaries i.e., row (0, s) and column (i, θ). For every state/vertex in row i Line 7 gives the minimum value for the distance considering only the insertion edges. We refer to this as step a. For every non − ε state in row i we take into account the effect of deletion edges and substitution edges coming in from the vertices in the previous row. This is shown in Lines 8–9. We refer to this as step b. Until now, only the acyclic edges of the graph were considered when looking at the insertion edges. After one pass of the row, every state in the row is revisited in Lines 11–12 to evaluate the effect of the cyclic insertion edges if any on the distance value. We refer to this as step c. We present a detailed work out example for a pair of a regular expression and a string using the Myers algorithm in Appendix A. 4.5 Our Algorithm - RP M Now let us discuss the approximate matching of strings with regular expressions problem for a particular class of regular expressions. We are primarily concerned with class 1 and 2 types of regular expressions. (Please refer to Section 4.2.5 for a detailed discussion concerning different types of regular expressions.) We apply the approximate string matching problem under a dynamic cost function to solve the problem of approximately matching a string with a class 2 regular expression. 4.5.1 Background A string S is defined to be a finite sequence of characters. A string is of the form, S = a0 a1 . . . ai . . . an . The length of a string S, |S|, is the number of characters in S. |S| = n. The character in the ith position of the string S, S[i], is ai . A character is a member of a string, ∈, if it exists in the string. A character is not a member of 68 a string, ∈, / if it does not occur in the string. For example, a ∈ aaba, but c ∈ / aaba. An edit operation on a string is defined as the operation of deletion, insertion or substitution performed on a single character in the string. A delete edit operation, del(S, i), corresponds to deleting the ith character in a string S. An insert edit operation, ins(S, i, c), corresponds to inserting a character c at position i in string S. A substitution operation, sub(S, i, c1 , c2 ), corresponds to substituting a character c1 with another character c2 at position i in S. Each operation can be assigned a cost. We are concerned with only inserts and deletes and assume each has a unit cost. Any string S1 can be transformed into another string S2 through a series of edit operations. The easiest way is to first delete all the characters in the S1 . This results in the empty string, ε. We then insert one by one all the characters in S2 . The end result is the string S2 . The total cost of this sequence of edit operations is the sum total of the individual edit operation costs. There are a number of ways S1 can be transformed to S2 . The edit distance between any two strings S1 and S2 , δ(S1 , S2 ) is defined as the minimum cost of a sequence of edit operations required to transform S1 to S2 . E.g. S1 = aaba and S2 = aaa, δ(S1 , S2 ) = 1 corresponding to a single del(S1 , 2). IF S1 = aaba and S2 = aaaa, δ(S1 , S2 ) = 2 corresponding to del(S1 , 2) followed by ins(S1 , 2, a) since we are not concerned with substitutions. We define Seqmin as the minimum cost of sequence of edit operations. |Seqmin | represents the number of edit operations in Seqmin . Since we assume each edit operation carries a unit cost, |Seqmin | = δ(S1 , S2 ). The application of an edit operation on a string results in a new string. Consider two strings S1 and S2 . We [0] [x] represent the initial string of string S1 as S1 . Generalizing, S1 where 0 ≤ x ≤ δ(S1 , S2 ) represents the string obtained by the application of the xth edit operation 69 [δ(S1 ,S2 )] in Seqmin on the string Sy[x−1] . Clearly S1 = S2 . Theorem 1 For any pair of strings S1 and S2 such that δ(S1 , S2 ) = d, if a string S3 is such that ∀j < k, S3 [j] = S1 [j], S3 [k] = l, l ∈ / S2 , ∀j > k, S3 [j] = S1 [j − 1] then δ(S3 , S2 ) = d + 1. Proof: Case 1 - δ(S3 , S2 ) ≤ d + 1: To go from S2 to S3 we can go from S2 to S1 with a minimum distance of d. And from S1 to S3 with a minimum distance of 1, corresponding to the insertion of the extra character l, since S3 results from inserting a character l somewhere in S1 . Therefore, δ(S3 , S2 ) is at most d + 1. Case 2 - δ(S3 , S2 ) ≥ d + 1: There exists a sequence of edit operations from S3 to S2 such that the length is δ(S3 , S2 ). That sequence contains a deletion of the extra character l is S3 . The sequence of edit operations without the deletion of l is a sequence of edit operations from S1 to S2 . Therefore, δ(S3 , S2 ) necessarily longer or equal to δ(S1 , S2 ) δ(S3 , S2 ) − 1 ≥ δ(S1 , S2 ) We know δ(S1 , S2 ) = d Therefore, δ(S3 , S3 ) ≥ d + 1. ✷ 4.5.2 The Idea Expanding on the definition of the edit distance between two strings we define the edit distance between a string S and a regular expression R, ∆(S, R) to be the minimum cost of a sequence edit operations required to transform S to a string SR in the language defined by R, such that δ(S, SR ) is the least. Formally 70 ∆(S, R) = min {δ(S, SR )} SR ∈L(R) (4.1) Since we are only concerned with class 2 types of regular expressions, i.e. regular expressions where the kleene closure (*) is allowed to be bound to only a single character, we can exploit special properties of strings defined by these regular expressions when considering the approximate matching problem. We claim that there is a string in the language of the regular expression of a certain length beyond which the edit distance between the original string and regular expression will not decrease. Since by the definition of the approximate matching problem we are only concerned with the smallest value of the edit distance we need not consider strings in the language of the regular expression larger than the maximum string. We now present a series of definitions and proofs in order to prove the existence of such a maximum string. Definition 1 Given a regular expression, R and an integer n ∈ N , we define exp(R, n) to be the string Se where each character bound by a kleene closure (*) in the original R is repeated exactly n times. E.g. if R = ba∗ bb∗ and n = 3 then Se = baaabbbb where a and b bound by the * in R is repeated 3 times.. Definition 2 Given a regular expression, R and an integer n ∈ N , we define Exp(R, n) to be the set of all strings Si ∈ L(R) until exp(R, n). E.g. if R = a∗ b∗ , Exp(R, n) = {ax by | 0 ≤ x ≤ n 0 ≤ y ≤ n}. Theorem 2 ∀R, ∀S, ∃n ∈ N s.t. ∆(S, R) = ∆(S, Exp(R, n)) 71 Proof: By definition of ∆(S, R), ∀R, ∀S, ∃SR ∈ L(R) s.t. ∆(S, R) = δ(S, SR ). Let i be the maximum number of repetitions for a character bound by the kleene closure (*) in R by definition and construction to obtain SR . SR is a subsequence of exp(R, i). By definition, SR ∈ Exp(R, i) ⊂ L(R). Therefore, δ(S, SR ) = ∆(S, Exp(R, i)) = ∆(S, R) ✷ Now that we have proved that there does exist a string of a certain size beyond which the edit distance between the original string S and the regular expression R does not decrease, we set about trying to find a suitable bound for this size of this string. We claim that it is only necessary to consider all strings in the language of the regular expression up to a string where each character bound by the kleene closure (*) is repeated a maximum of |S| times where |S| is the length of the string S. The idea behind this is that if we were to consider larger strings, the edit distance would only increase in most cases due to the additional deletes required with the addition of an extra character in the constructed string. We now set about proving our claim. Lemma 1 Given a regular expression R and a string S, in order to obtain the edit distance between S and R, it is sufficient to find the least edit distance between S and a string in Exp(R, |S|). ∀S, ∆(S, R) = ∆(S, Exp(R, |S|)) Proof: By definition of ∆(S, R), ∀R, ∀S, ∃SR ∈ L(R) s.t. ∆(S, R) = δ(S, SR ). Let z be the minimum number of repetitions for a character bound by the kleene closure (*) in R such that SR is a subsequence of exp(R, z), hence SR ∈ Exp(R, z). Thus ∆(S, Exp(R, z)) = ∆(S, R) 72 Case 1 - z ≤ |S| : By definition and construction, Exp(R, z) ⊂ Exp(R, |S|). Therefore, ∆(S, Exp(R, |S|)) = ∆(S, Exp(R, z)) = ∆(S, R). Hence, ∆(S, R) = ∆(S, Exp(R, |S|)). Case 2 - z > |S| : (i) |SR | ≤ |S| : Hence SR is a subsequence of exp(R, |S|). Thus, SR ∈ Exp(R, |S|). Since |S| < z, exp(R, z) can no longer be the minimum string. Hence this case does not hold. (ii) |SR | > |S| : If |SR | > |S|, then we would have to delete some characters from SR . If this is the case when z > |S|, some of this deletion can be saved by setting z to a lesser value z i.e. z = z − 1 or z = |S|. z is no longer the minimum. This is a contradiction. Hence this case does not hold. ✷ We further refine our bound. Instead of being satisfied with a bound of |S| for the repetition of characters bound by a kleene closure (*) in R, we are certain that it is only sufficient to repeat the characters bound by a kleene closure in R the number of times that particular character appears in the original string S. It is clear that the size of the maximum string and the size of the set of strings to be considered can be reduced greatly in this case. Definition 3 Given a regular expression R and a string S, we define the expmin (R, S) to be the string Sme where each character, x, bound by a kleene closure (*) is repeated the number of times that character x appears in S. E.g. if R = a∗ b∗ c and S = abbaccbba,then Sme = aaabbbbc, since the number of a s in S is 3 and the number of b s in S is 4. Definition 4 Given a regular expression R and a string S, we define Expmin (R, S) to be the set of all strings Si ∈ L(R) until expmin (R, S). E.g. if R = a∗ b∗ and S = aba,then Sme = aab, since the number of a s in S is 1 and the number of b s in S is 1. Exp(R, S) = {ε, a, aa, b, ab, aab}. 73 Lemma 2 Given a regular expression R and a string S, in order to obtain the edit distance between S and R, it is sufficient to find the least edit distance between S and a string in Expmin (R, S). ∆(S, R) = ∆(S, Expmin (R, S)) = min {δ(S, SE )} SE ∈Expmin (R,S) Proof : In order to prove ∆(S, R) = ∆(S, Expmin (R, S)) we prove that the following cases cannot arise. Case 1: ∆(S, R) > ∆(S, Expmin (R, S)) Recall, by definition of Delta(S, R) (the minimum edit distance between a string S and a string SR ∈ L(R). Since Expmin (R, S) ⊆ L(R), given ∆(S, R), ∆(S, Expmin (R, S)) ≥ ∆(S, R). Hence case 1 does not hold. Case 2: ∆(S, R) < ∆(S, Expmin (R, S)) If case 2 is true, then there must exist a string Sk ∈ L(R) such that Sk ∈ / Expmin (R, S) and ∆(S, R) = δ(S, Sk ). Sk here is a string with an extra set of characters (characters bound by a * in R). In order to obtain S from Sk we need to perform one or more delete operations of those extra characters in Sk to obtain S. We can find a shorter string Sm ∈ Expmin (R, S) such that δ(S, Sm ) = δ(S, Sk ) i.e., in order to obtain S we need to perform the same number of insert operations in Sm as the number of delete operations in Sk . Since Sm ∈ Expmin (R, S), δ(S, Sm ) ≥ ∆(S, Expmin (R, S)). This contradicts case 2 and hence case 2 does not hold. Hence from Case 1 and Case 2, ∆(S, R) = ∆(S, Expmin (R, S)) = min {δ(S, SE )} SE ∈Expmin (R,S) ✷ Now that we have proved that it is not necessary to consider every string in the 74 regular expression R when trying to find out the edit distance between the string S and R, we turn our attention the central idea of the algorithm. Recall, in the approximate string matching problem given two strings S1 and S2 of fixed length m and n such that n ≥ m, each cell in the edit distance matrix, EDIT , has a value equal to δ(S1 [1 . . . i], S2 [1 . . . j]), with 0 ≤ i ≤ m and 0 ≤ j ≤ n. The boundary values of EDIT are defined as follows (for 0 ≤ i ≤ m, 0 ≤ j ≤ n): EDIT [0, j] = j; EDIT [i, 0] = i; Rest of the elements of EDIT can be computed using the simple formula EDIT [i, j] = min EDIT[i − 1, j] + cost(delete), EDIT[i, j − 1] + cost(insert), EDIT[i − 1, j − 1] + cost(change) In order to find out the edit distance, ∆(S, R) between a string S and a regular expression R it is sufficient to approximately match S with Sme = expmin (R, S) under a special scoring scheme. Since a * represents zero or more occurrences of a character special attention is given to those characters in Sme that are created as a result of expansion of the kleene closure (*). When calculating the values of each of the cells in the edit distance matrix, the cost of inserting those characters that were inserted as a result of expansion of the kleene closure (*) is now 0. 4.5.3 The Algorithm Line 1 initializes the base case, comparing two empty strings. The boundaries of the matrix are initialized in Lines 2–5, with the column (i,0) being processed in 75 Algorithm RPM Input: Strings S and Sme Output: Edit Distance Matrix, EDIT 1. 2. 3. 4. 5. 6. 7. EDIT [0][0] = 0; f or(i = 1; i ≤ |S|; i + +) EDIT [i][0] = EDIT [i − 1][j] + cost(delete); f or(j = 1; j ≤ |Sme |; j + +) EDIT [0][j] = EDIT [0][j − 1]+star(Sme [j]) f or(i = 1; i ≤ |S|; i + +) f or(j = 1; j ≤ |Sme |; j + +) EDIT [i − 1][j − 1] + match(S[i], Sme [j]) EDIT [i − 1][j] + cos t(delete) 8. EDIT [i][j] = min EDIT [i][j − 1] + star(Sme [j]) 9. return EDIT Figure 4.15: Our algorithm for approximate matching of regular expression Lines 2–3 and the row (0,j) being processed in Lines 4–5. star(Sme [j]) returns 0 if Sme [j] is a character inserted due to the expansion of * and cost(insert) otherwise. Once the base and boundary cells are set, we process rest of the cells one row at a time in Lines 6–8. Three cases are considered: a deletion of a character, S[i], insertion of a character Sme [j] and substitution of character S[i] with Sme [j] if they don’t match. match(S[i], Sme [j]) return 0 if the characters are the same and cost(substitute) otherwise. The edit distance between a string S and a regular expression R, ∆(S, R), is the value stored in the cell EDIT [|S|][|Sme |]. Given a string S of length |S| and a regular expression of length |R|, RP M 2 has a worst case running time of O( |S| 2.|R| ) or roughly O(|S|3 ). The worst case is highlighted in cases where the regular expression R is of the form R = xa ∗ y where x and y are other regular expressions and the string S is just a series of a’s for e.g S = aa . . . a. Its space requirements are also in the order of the cube of the length of the string S, O(|S|3 ). However, if only the edit distance between the string and 76 the regular expression is required, RP M functions in O(|S|2 ) space. 4.5.4 Examples Example 1: R = ab∗ cb, S = bbbaab - Sme = abbbbcb b b b a a b a b 0 1 1 1 2 1 3 2 3 3 4 4 5 4 0 1 2 3 4 5 6 b b 1 1 1 1 1 1 1 1 2 2 3 3 4 3 b 1 1 1 1 2 3 3 c 2 2 2 2 2 3 4 b 3 2 2 2 3 3 3 Figure 4.16: RPM Example 1 : R = ab∗ cb and S = bbbaab Example 2: R = ca∗ ab∗ , S = baaabb - Sme = caaaabbb 0 b 1 a 2 a 3 a 4 b 5 b 6 c 1 1 2 3 4 5 6 a a 1 1 1 1 1 1 2 1 3 2 4 3 5 4 a 1 1 1 1 1 2 3 a 2 2 1 1 1 2 3 b 2 2 1 1 1 1 2 b 2 2 1 1 1 1 1 b 2 2 1 1 1 1 1 Figure 4.17: RPM Example 2 : R = ca∗ ab∗ and S = baaabb 77 Example 3: R = abb∗ a∗ bc, S = ccbbbc - Sme = abbbbbc 0 c 1 c 2 b 3 b 4 b 5 c 6 a b 1 2 1 2 2 2 3 2 4 3 5 4 6 5 b b 2 2 2 2 2 2 2 2 2 2 3 2 4 3 b 2 2 2 2 2 2 3 b 3 3 3 2 2 2 3 c 4 3 3 3 3 3 2 Figure 4.18: RPM Example 3 : R = abb∗ a∗ bc and S = ccbbbc 4.6 4.6.1 Performance Evaluation Experimental Setup Both the algorithms, namely M yers and RP M , were implemented using Java SDK v1.4.1 and performance evaluation tests were carried on XENA1. The XENA1 server is a dual Sun Sparc CPU machine with a processing power of 480MHz. The XENA1 server has a 4 GB RAM capacity and has Solaris 7 running as the local operating system. In order to evaluate the algorithms, we implemented a random string and regular expression generator for a specified length of a string and regular expression. In addition to this, we can also vary the size of the alphabet Σ i.e., the number of characters in the alphabet as well as the number of kleene closures (*) in the regular expression. 4.6.2 Results We consider four different cases in the performance study of our algorithm, RP M , with respect to the existing approximate matching of regular expressions with 78 strings algorithm by M yers and Miller. We evaluate the effect on time taken by each algorithm by (1) varying the length of the regular expression, keeping every thing else constant (2) varying the length of the string, keeping every thing else constant (3) varying the size of the alphabet, keeping everything else constant (4) varying the number of kleene closures (*) in the regular expression, keeping every thing else constant. Each point on the graph is a result of averaging 1000 different string and regular expression combinations of similar lengths. Varying Length of Regular Expression 3250 3000 2750 2500 Time (milliseconds) 2250 2000 RPM MYERS 1750 1500 1250 1000 750 500 250 0 0 10 20 30 40 Length of R, |R| 50 60 70 Figure 4.19: Performance Analysis : Varying Length of Regular Expression,|R| Figure 4.19 depicts the results of the experiment varying the length of the regular expression while keeping the length of the string (500), the size of the alphabet (3) and the number of kleene closures (*) (4) the same. The graph is plotted along two axis, with the size of the regular expression R, |R| along xaxis and the time taken (in milliseconds) to find the distance between the regular 79 expression R and the string S for both the algorithms along the y-axis. For every extra character in R, the number of states in the regular expression edit graph in the M yers implementation increases by 3∗|S| and the number of edges in the worst case increases by 8 ∗ |S|. However for our algorithm every addition of a character in R results in the minimum string being increased by |S| in the worst case and a no change in the best case. As a result we can see that the increase in the size of the regular expression, R does not have a major effect on the performance of our algorithm, whereas M yers performance deteriorates with the increase in the size of R. Time (milliseconds) Varying Length of String 1100 1050 1000 950 900 850 800 750 700 650 600 550 500 450 400 350 300 250 200 150 100 50 0 RPM MYERS 50 100 150 200 250 300 Length of S, |S| 350 400 450 500 Figure 4.20: Performance Analysis : Varying Length of String, |S| Figure 4.20 indicates the results of the experiment varying the length of the string keeping the length of the regular expression (30), the size of the alphabet (3) and the number of kleene closures (*)(4) the same. The graph is plotted along 80 two axis, with the size of the string S, |S| along x-axis and the time taken (in milliseconds) to find the distance between the regular expression R and the string S for both the algorithms along the y-axis. As we can see, increasing the length of the string S has a similar effect on both algorithms as in increased times. Increasing the size of the string in most cases results in more occurrences of a character from the alphabet Σ and this in turn affects the size of the maximum string Sme from R. The M yers algorithm is also greatly affected due the increase in number of states and edges in the regular expression edit graph. Time (milliseconds) Varying size of alphabet Σ 1200 1150 1100 1050 1000 950 900 850 800 750 700 650 600 550 500 450 400 350 300 250 200 150 100 50 0 RPM MYERS 2 3 4 5 6 7 Length of number of characters in the alphabet, |C| 8 9 Figure 4.21: Performance Analysis : Varying size of alphabet Σ, |Σ| Figure 4.21 displays the results of the experiment varying the size of the alphabet keeping the length of the regular expression(30), the length of the string(500) and the number of kleene closures (*)(4) the same. The graph is plotted along two axis, with the size of the string Σ, |Σ| along x-axis and the time taken (in millisec- 81 onds) to find the distance between the regular expression R and the string S for both the algorithms along the y-axis. We see a gradual improvement in the performance as we increase the size of the alphabet as in the average case this results in shorter maximum strings of R since the original string S and R is populated by more characters from the alphabet, Σ. Time (milliseconds) Varying number of kleene closures (*) 1200 1150 1100 1050 1000 950 900 850 800 750 700 650 600 550 500 450 400 350 300 250 200 150 100 50 0 RPM MYERS 2 3 4 5 6 The number of ’*’ in R, |*| 7 8 Figure 4.22: Performance Analysis : Varying number of kleene closures, | ∗ | Figure 4.22 shows the results of the experiment varying the number of kleene closures keeping the length of the regular expression(30), the length of the string(500) and the size of the alphabet Σ (3) the same. The graph is plotted along two axis, with the number of kleene closures(*) in R, | ∗ | along x-axis and the time taken (in milliseconds) to find the distance between the regular expression R and the string S for both the algorithms along the y-axis. Increasing the number of kleene closures (*) in the regular expression results in increased times as an extra kleene 82 closure (*) in the regular expression R normally results in longer maximum strings of R. In the Myers case the addition of an extra kleene closure (*) results in 3*|S| extra edges being added to the regular expression edit graph which equates to more paths requiring to be evaluated. 3500 3325 3150 2975 2800 2625 2450 2275 2100 1925 1750 1575 1400 1225 1050 875 700 525 350 175 0 RPM MYERS Time (milliseconds) Time (milliseconds) Special cases 20 50 80 110 140 Number of a in S 170 200 (a) Varying number of a in S 230 3500 3325 3150 2975 2800 2625 2450 2275 2100 1925 1750 1575 1400 1225 1050 875 700 525 350 175 0 RPM MYERS 3 4 5 6 7 Number of a* in R 8 9 (b) Varying number of a*s in R Figure 4.23: Performance Analysis : Special Cases In Figure 4.23(a) we vary the number of a’s present in a string of length 600 and see the effects of matching it with the same regular expression (with 2 a*). As we can see increasing the number of a’s while keeping the regular expression constant results in increased times. This is due to the fact that for every x a’s in the string S, the size of the maximum string of R increases by at least 2x. In Figure 4.23(b) we vary the number of a∗’s in the regular expression R keeping the string S with a minimum number of a’s constant. For every additional a∗ in R the maximum 10 83 string of R increases in size by |a|S , which is the number of a’s in S. With the increase in size of the maximum string of R, an increase in time for approximately matching the regular expression R and the string S is expected. 4.7 Summary In this chapter, we have presented an algorithm to solve the problem of approximate matching of a string with a special type of regular expression. We consider those regular expression where a * is allowed on only a single character. The algorithm takes advantage of certain special properties of this type of a regular expression and successfully employs a modified version of the approximate string matching problem to address the issue of approximate matching of a regular expression. The RP M for approximate matching of a string with a class 2 regular expression runs in O(|S|3 ) time and space in the worst case. We compared the performance of our algorithm RP M with an existing algorithm by Myers and Miller and found out that RP M performs better for matching strings with class 2 type regular expressions. This is probably due to the speedup provided by reducing the regular expression matching problem to the approximate string matching problem as opposed to the inherent graph nature of the Myers algorithm. CHAPTER 5 Conclusion and Future Work Approximate pattern matching techniques in various structures such as strings, trees, graphs and regular expressions form the basis of many different kinds of commercial applications such as information extraction and bio-informatics. The proposal in this thesis has addressed a specialized problem in the area of approximate matching of complex structures such as trees, acyclic graphs and regular expressions. The contributions in this thesis are two-fold. • We present new algorithms for the approximate matching of trees (ordered and unordered) and acyclic graphs based on edit distance measures under the degree-1 constraint, the implication being that the relevant information is located at the leaves of a tree or at the periphery of a graph. Under the degree-1 constraint, edit operations can be performed only on vertices with degree less than or equal to 1. Our work on approximate matching of trees and acyclic graphs under the degree-1 constraint has been submitted for publication. 84 85 • We consider the problem of approximate matching of a string with a special type of regular expression where the kleene closure (*) is only allowed to be bound to a single character. In this regard, we present a new algorithm which exploits the special properties of such a regular expression, thereby enabling us to reduce the approximate regular expression matching problem to that of an approximate string matching problem. Incidentally, the ordered and unordered tree algorithms have a worst-case execution time of O(|T1 |.|T2 |.k 2 log k) and the algorithm for acyclic graphs has a worst-case execution time of O(|T1 |2 .|T2 |2 .k 2 log k). Moreover, the RP M algorithm for approximate matching of a string S with a class-2 regular expression R runs in O(|S|3 ) time and space in the worst case. In order to evaluate the performance of our regular expression matching algorithm RP M , we performed experiments on a dual 480MHz Sun Sparc CPU machine with 4 GB main memory running Solaris 7 as the local operating system. Our performance evaluation indicates that our proposed techniques indeed outperform an existing well-known algorithm M yers [45] for approximate regular expression matching in terms of execution times. This may be primarily attributed to the approximate string matching nature of our algorithm which makes use of simple arithmetic operations, whereas M yers needs to first construct a regular expression edit graph before it starts traversing the edges of this graph to provide the desired result. To this end, we believe that our work has addressed some of the important issues in the field of approximate matching of trees and acyclic graphs as well as in the area of approximate matching of strings with regular expressions. However, several open research issues still remain in this field. Now we shall list some of the possible extensions of our research and our directions for future work. 86 Future Scope of Work The research carried out in this thesis has laid the foundation for interesting research in the approximate matching of more complicated structures. To our knowledge, the problem of approximate matching of two regular expressions has not received adequate attention. The problem of approximate matching of two regular expressions is essentially to see how closely the two regular expressions are related. A possible method could be to find the edit distance between the two regular expressions, say R1 and R2 by finding a pair of strings S1 ∈ L(R1 ) and S2 ∈ L(R2 ) such that δ(S1 , S2 ) in the minimum. This is a possible interpretation of the problem. Since every regular expression can be represented as a finite state machine, another possible method could be to look for an ideal way to compare the graph structures of the two finite state machines to establish some kind of a similarity (or dissimilarity). Definitive rules need to be followed to convert the regular expression to its corresponding finite state machine. A formal language, L, is a set of finite-length words (or “strings”) over some finite alphabet. A typical alphabet would be a, b, a typical string over that alphabet would be “ababba”, and a typical language over that alphabet containing that string would be the set of all strings which contain the same number of a’s as b’s. It could be useful to see if a word, w, belonged to a particular language L and if not, how closely is it associated to L. Conceptually, this is similar to the approximate matching problem of strings and regular expressions. Another interesting problem to consider would be the issue of approximately comparing two different languages defined over the same finite alphabet. As in the case of approximate matching of two regular expressions, a possible method is to find a pair of words w1 ∈ L1 and w2 ∈ L2 where L1 and L2 are the languages defined over a finite alphabet Σ, such that delta(w1 , w2 ) is the minimum. 87 With the increasing popularity of XML, there is an growing need to be able to efficiently and effectively check if an XML document belongs to a particular DTD. As mentioned earlier an XML document can be represented as a tree. A DTD can be represented as a regular tree [16], the tree version of a regular expression for strings. The matching problem in this case would be to see if a XML document adheres to the DTD’s specifications and if it doesn’t how far is the XML document from the DTD. This is in some ways similar to the problem of approximate matching of strings and regular expressions. However in this case we would probably have to look for another XML document that adheres to the specific DTD such that the difference between the two XML documents is the least. As a next step we can look into the problem of comparing two DTDs. The matching problem in this case would be to find out how similar the DTDs are to each other and what changes need to be made to convert one DTD to the other. Since each DTD can be represented by a regular tree, the issue is similar to that in the case of matching of two regular expressions. We could either compare the graph representation of the DTDs or we could look at it from an edit distance point of view where we are concerned with finding a pair of XML documents from the respective DTDs such that the edit distance between their structures is the least. Given a schedule (a sequence of page numbers) from each of the n clients in a centralized architecture, a centralized content server must come up with the best possible combined global schedule under a special delay constraint before it begins to broadcast the requested pages. Under this delay constraint, two consecutive pages in a local schedule cannot be more than a certain x number of pages in the global schedule. By making use of existing multiple sequence alignment algorithms and modifying them to take into account this delay constraint we aim to devise an algorithm to solve the above-mentioned problem. BIBLIOGRAPHY [1] A. V. Aho. Algorithms for finding patterns in strings. In Jan van Leeuwen, editor, Handbook of Theoretical Computer Science, volume A,chapter 5, pages 255–300. The MIT Press, 1990. [2] A. V. Aho, B. W. Kernigham, and P. J. Weinberger. The AWK Programming Language. Addison-Wesley, Reading, MA, 1988. [3] A.V. Aho, D.S. Hirschberg, and J.D. Ullman. Bounds on the complexity of the longest common subsequence problem. In Journal of Association of Computer Machinery, 23, pages 1–12, 1976. [4] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. In Journal of Molecular Biology, volume 215, pages 403–410, 1990. [5] A. Apostolico and R. Giancarlo. The boyer-moore-galil string searching strategies revisited. In SIAM Journal on Computing, volume 15, pages 98–105, 1986. 88 89 [6] V. L. Arlazarov, E. A. Dinic, M. A. Kronrod, and I. A. Faradzev. On economic construction of the transitive closure of the transitive closure of a directed graph. In Dokl. Acad. Nauk SSSE, volume 194, pages 487–488, 1970. [7] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, May 2001. [8] BotSpot. http://www.internet.com. 1999. [9] R. Boyer and J. Moore. A fast string searching algorithm. In Communications of the ACM, volume 20, pages 762–772, 1977. [10] Samuel R. Buss and Peter N. Yianilos. A bipartite matching approach to approximate string comparison and search. Technical report, NEC Research Institute, 1995. [11] M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, pages 6–11, Menlo Park, CA, 1998. AAAI Press. [12] N. Chomsky. Three models for the description of language. In IRE Transactions on Information Theory, volume 2, pages 113–124, September 1956. [13] N. Chomsky. On certain formal properties of grammars. In Information and Control, volume 1, pages 91–112, June 1959. [14] C. L. A. Clarke and G. V. Cormack. On the use of regular expressions for searching text. In ACM Transactions on Programming Languages and Systems, volume 19, pages 413–426, 1997. 90 [15] L. Colussi. Fastest pattern matching in strings. In Journal of Algorithms, volume 16, pages 163–189, 1994. [16] H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, Denis Lugiez, Sophie Tison, and Marc Tommasi. Tree automata techniques and applications. [17] V. Crescenzi and G. Mecca. Grammars have exceptions. Information Systems, 23(8):539–565, 1998. [18] M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994. [19] P. A. Bernstein E. Rahm. A survey of approaches to automatic schema matching. In The VLDB Journal, volume 10, pages 334–350, 2001. [20] J. Edmonds and R. M. Karp. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM (JACM), 19(2):248–264, 1972. [21] D. Eppstein, Z. Galil, R. Giancarlo, and G. F. Italiano. Efficient algorithms for sequence analysis. In R. M. Capocelli, A. De Santis, and U. Vaccaro, editors, Sequences II: Communication, Security, and Computer Science, pages 225–244. Springer-Verlag, 1993. [22] M. L. Fredman and R. E. Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms. Journal of the ACM (JACM), 34(3):596– 615, 1987. [23] grep-Searching for a pattern. 2.4/html mono/grep.html. November 1999. http://www.gnu.org/manual/grep- 91 [24] J. Hammer, J. McHugh, and H. Garcia-Molina. Semistructured data: The tsimmis experience. In Advances in Databases and Information Systems, pages 1–8, 1997. [25] M.C. Harrison. Implementation of the substring test by hashing. In CACM, volume 14, pages 777–779, December 1971. [26] Tierra Highlights2. http://www.tierra.com. [27] D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18(6):341–343, 1975. [28] J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 2001. [29] T. Jiang, L. Wang, and K. Zhang. Alignment of trees - an alternative to tree edit. In Proceedings of Combinatorial Pattern Matching, pages 75–86, 1994. [30] J.H. Morris (Jr) and V.R. Pratt. A linear pattern-matching algorithm. In Technical Report 40, University of California, Berkeley, 1970. [31] R.M. Karp and M.O. Rabin. Efficient randomized pattern-matching algorithms. In IBM J. Res. Dev., volume 31, pages 249–260, 1987. [32] P. Kilpelainen and H. Mannila. Ordered and unordered tree inclusion. In SIAM Journal of Computing, volume 24, pages 340–356, 1995. [33] S. C. Kleene. Representation of events in nerve nets and finite automata. In C. E. Shannon and J. McCarthy, eds.,Automata Studies, Annals of Mathematics Studies, volume 34, pages 3–42. Princeton University Press, 1956. [34] D.E. Knuth, J. Morris, and V.R. Pratt. Fast pattern matching in strings. In SIAM Journal on Computing, volume 6, pages 323–360, 1977. 92 [35] S. Kumar and E. Spafford. An Application of Pattern Matching in Intrusion Detection. Technical Report 94-013, Department of Computer Sciences, 1994. [36] N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1-2):15–68, 2000. [37] A. Laender, B. Ribeiro-Neto, A. Silva, and J. Teixeira. A brief survey of web data extraction tools. In SIGMOD Record, volume 31, June 2002. [38] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabled wrapper construction system for web information sources. In ICDE, pages 611–621, 2000. [39] L. Liu, C. Pu, and W. Tang. Webcq-detecting and delivering information changes on the web. In Proceedings of the ninth international conference on Information and knowledge management, pages 512–519, 2000. [40] S. Lu. A tree-to-tree distance and its application to cluster analysis. In IEEE Trans. Pattern Analysis and Machine Intelligence, volume PAMI-1, pages 219– 224, 1979. [41] F. Luccio and L. Pagli. Simple solutions for approximate tree matching problems. In Proceedings of the international joint conference on theory and practice of software development on Colloquium on trees in algebra and programming (CAAP ’91): vol 1, pages 193–201. Springer-Verlag New York, Inc., 1991. [42] The AWK Manual. http://www.cs.uu.nl/docs/vakken/st/nawk/nawk toc.html. December 1995. [43] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93–114, 2001. 93 [44] E. Myers. A four-russian algorithm for regular expression pattern matching. In Journal of the ACM, volume 39, pages 430–448, 1992. [45] E. W. Myers and W. Miller. Approximate matching of regular expressions. In Bulletin of Mathematical Biology, volume 51, pages 5–37, 1989. [46] Netmind. http://www.netmind.com. [47] C. Parent and S. Spaccapietra. Issues and approaches of database integration. Communications of the ACM, 41(5es):166–178, 1998. [48] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence analysis. In Proc Natl Acad Sci USA, pages 2444–2448, 1988;85. [49] B. Rahardjo and R. H. C. Yap. Automatic information extraction from web pages. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 430–431. ACM Press, 2001. [50] Resource Description Framework (RDF). http://www.w3.org/RDF/. [51] B. A. Ribeiro-Neto, A. H. F. Laender, and A. S. da Silva. Extracting semistructured data through examples. In Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, Kansas City, Missouri, USA, November 2-6, 1999, pages 94–101. ACM. [52] N. Rishe, J. Yuan, R. Athauda, S. C. Chen, X. Lu, X. Ma, A. Vaschillo, A. Shaposhnikov, and D. Vasilevsky. Semantic access: Semantic interface for querying databases. In The VLDB Journal, pages 591–594, 2000. [53] A. Sahuguet and F. Azavant. Wysiwyg web wrapper factory (w4f). 94 [54] H. Saip and C. Lucchesi. Matching algorithms for bipartite graph. Technical report, Departamento de Ciˆencia da Computa¸c˝ao, Universidade Estudal de Campinas, March 1993. [55] sed A Stream Editor. http://www.gnu.org/manual/sed/html mono/sed.html. June 1998. [56] D. Shasha, J. T. L. Wang, K. Zhang, and F. Y. Shih. Exact and approximate algorithms for unordered tree matching. In IEEE Transactions On Systems, Man and Cybernetics, 24(4), April 1994. [57] D. Shasha and K. Zhang. Fast algorithms for the unit cost editing distance between trees. In Journal of Algorithms, 11, pages 581–621, 1990. [58] A. P. Sheth and J. A. Larson. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys (CSUR), 22(3):183–236, 1990. [59] S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3):233–272, 1999. [60] K. Tai. The tree-to-tree correction problem. Journal of the ACM (JACM), 26(3):422–433, 1979. [61] K. Thompson. Regular expression search algorithm. In CACM, volume 11, pages 419–422, 1968. [62] UNIXdif f . http://www.rt.com/man/diff1.html. [63] M. Vilares, F. J. Ribadas, and J. Graa. Approximately common patterns in shared-forests. In Proceedings of the tenth international conference on Information and knowledge management, pages 73–80. ACM Press, 2001. 95 [64] R. A. Wagner. Order-n correction for regular languages. In Communications of the ACM (CACM), volume 17, pages 265–268, 1974. [65] R. A. Wagner and M. J. Fisher. The string to string correction problem. In Journal of the ACM 21, pages 168–173, 1974. [66] J. A. Wald and P. G. Sorenson. Explaining ambiguity in a formal query language. ACM Transactions on Database Systems (TODS), 15(2):125–161, 1990. [67] J. T. L. Wang, K. Zhang, and G.-W. Chirn. The approximate graph matching problem. In Proceedings of the International Conference on Pattern Recognition, Vol. 2, pages 284–288, 1994. [68] M. S. Waterman. General methods of sequence comparison. In Bulletin of Mathematical Biology, volume 46, pages 473–500, 1984. [69] M. Montes y G´omez, A. Gelbukh, A. L´opez-L´opez, and R. Baeza-Yates. Flexible comparison of conceptual graphs. In Proceedings of the 12th International Conference and Workshop on Database and Expert Systems Applications, pages 102–111, Munich, Germany, 2001. Springer-Verlag, Berlin. [70] M. Montes y G´omez, A. L´opez-L´opez, and A. Gelbukh. Information retrieval with conceptual graph matching. In Proceedings of the 11th International Conference and Workshop on Database and Expert Systems Applications, pages 312–321, Greenwich, England, 2000. Springer-Verlag, Berlin. [71] K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related problems. In SIAM Journal on Computing, 18(6), pages 1245–1262, 1989. 96 [72] K. Zhang, J. T. L. Wang, and D. Shasha. On the editing distance between undirected acyclic graphs and related problems. In Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching, pages 395–407, Espoo, Finland, 1995. Springer-Verlag, Berlin. Appendix A - Myers Example a(a) E E a 0 E eps s(a,a) s(a,a) a(a) E d(a) E a 1 E eps s(a,a) s(a,a) E a(a) d(a) E a 2 E eps s(a,a) s(a,a) E a(a) d(a) E a 3 E eps Figure 5.1: Myers Example : R = a∗ and S = aaa We now present a detailed worked out example for approximate matching of a regular expression with a string using the algorithm by Myers and Miller. Figure 5.1 97 98 depicts the regular expression edit graph for a regular expression R = a∗ and a string S = aaa and the edit distance matrix for the pair is represented in Figure 5.2. ε a a a 0 1 2 3 ε 0 0 1 2 3 ε 1 0 1 2 3 a* 2 1 0 0 0 ε 3 0 0 0 0 Figure 5.2: Myers Edit Distance Matrix A step by step (cell by cell) calculation process is as follows: • (0,0) 0 • (0,1) min( E(0,0) )+a(ε) = 0 • (0,2) min( E(0,1) ) +a(a) = 1 • (0,3) min( E(0,1), E(0,2) ) + a(ε) = 0 • (1,0) E(0,0)+d(a) = 1 • (1,1) – step a : min( E(1,0) )+ a(ε) = 1+0 = 1 • (1,2) – step a : min(E(1,1))+ a(a) = 1+1 = 2 – step b : min( 2, E(0,2)+d(a), min(E(0,1),E(0,2))+s(a, a)) = min(2,2,min(0,1)+0) =0 – step c : min ( 0, (E(1,2)+a(a) ) = min(0, 1) =0 • (1,3) 99 – step a : min( E(1,1), E(1,2) )+ a(ε)=0+0=0 • (2,0) E(1,0)+d(a) = 2 • (2,1) – step a : min( E(2,0) )+ a(ε) = 2+0 = 2 • (2,2) – step a : min(E(2,1))+ a(a) = 2+1 = 3 – step b : min( 3, E(1,2)+d(a), min(E(1,1),E(1,2))+s(a, a)) = min(3,1,min(1,0)+0) =0 – step c : min ( 0, (E(2,2)+a(a) ) = min(0, 1) =0 • (2,3) – step a : min( E(2,1), E(2,2) )+ a(ε)=0+0=0 • (3,0) E(2,0)+d(a) = 3 • (3,1) – step a : min( E(3,0) )+ a(ε) = 3+0 = 3 • (3,2) – step a : min(E(3,1))+ a(a) = 3+1 = 4 – step b : min( 4, E(2,2)+d(a), min(E(2,1),E(2,2))+s(a, a)) = min(4,1,min(2,0)+0) =0 – step c : min ( 0, (E(3,2)+a(a) ) = min(0, 1) =0 100 • (3,3) – step a : min( E(3,1), E(3,2) )+ a(ε)=0+0=0 [...]... issue of approximate matching in more complex structures like trees and graphs This is a relatively new area of research in approximate matching when compared to strings We then discuss some of the existing work done on different types of tree and graph structures In Section 2.3, we aim to give a brief overview of the work done on approximate 10 11 matching of strings with regular expressions 2.1 Approximate. .. an amount of time that is proportional to the product of the lengths of the two strings 2.1.4 Timeline In this section we present a time line of some of the major exact and approximate string matching algorithms [30]1970 →[25]1971 →[65] 1974 →[27]1975 →[3] 1976 →[9, 34]1977 →[5] 1986 →[31]1987 →[15]1994 19 2.2 Approximate Matching in Trees and Graphs Approximate pattern matching of complex structures. .. field of approximate matching by first looking at the problem of approximate matching of strings in Section 2.1 We present the notion of edit operations on strings and edit distance with respect to strings and introduce the edit distance matrix, a visualization which is central to many exact and approximate matching algorithms We then present a brief survey on some of the key exact and approximate matching. .. G2 , there are several methods of performing approximate pattern matching between the two structures One way is to measure the edit distance, i.e the minimum cost of transforming one structure into the other quite often through a series of edit operations, i.e deletion of a vertex in G1 , insertion of a vertex in G2 and the relabeling of a vertex in G2 with the label of a vertex in G1 The edit distance... problem of approximate pattern matching in trees 2.2.3 Algorithms Several definitions and algorithms have been given for the approximate matching of graphs and trees They correspond to different data structures (ordered trees, unordered trees, graphs, etc), different notions of similarity or distance, and different constraints Tai (1979) [60] was one of the first authors to work on the topic of approximate. .. execution time of O(|T1 |.|T2 |.k 2 log k) and the algorithm for acyclic graphs has a worst-case execution time of O(|T1 |2 |T2 |2 k 2 log k).Our work on approximate matching of trees and acyclic graphs under the degree-1 constraint has been submitted for publication to a well known journal and we are currently awaiting the results of the review • We consider the problem of approximate matching of a string... 9 type of regular expression where the kleene closure (*) is only allowed to be bound to a single character In this regard, we present a new algorithm which exploits the special properties of such a regular expression, thereby enabling us to reduce the approximate regular expression matching problem to that of an approximate string matching problem Our algorithm RP M for approximate matching of a string... The remainder of the thesis is organized as follows: • Chapter 2 provides an overview of existing works in the field of approximate matching of strings, trees, graphs and regular expressions • Chapter 3 presents our algorithm for approximate matching of ordered and unordered trees and acyclic graphs under the degree-1 constraint • Chapter 4 discusses the work we have done in the area of string to regular... customized approximate pattern matching algorithms for regular expressions, one can automate an otherwise largely manual operation Rahm et al [19] survey a number of approaches to automatic schema matching 1.1 Summary of Contributions This work focusses on problems of approximate pattern matching in complex structures such as trees, graphs and regular expressions The contributions of this thesis are two-fold... notation of regular expressions arises naturally from the mathematical result of Kleene [33] that characterizes the regular sets as the smallest class of sets of strings which contains all finite sets of strings and which is closed under the operations of union, concatenation and “Kleene Closure” 2.3.2 Problem Definition Given a string S and a regular expression R, the problem of approximate matching of