Exact and Inexact Graph Matching: Methodology and Applications 243 [43] J. Larrosa and G. Valiente. Constraint satisfaction algorithms for graph pattern matching. Mathematical Structures in Computer Science, 12(4):403–422, 2002. [44] G. Levi. A note on the derivation of maximal common subgraphs of two directed or undirected graphs. Calcolo, 9:341–354, 1972. [45] E.M. Luks. Isomorphism of graphs of bounded valence can be tested in polynomial time. Journal of Computer and Systems Sciences, 25:42–65, 1982. [46] B. Luo and E. Hancock. Structural graph matching using the EM algo- rithm and singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(10):1120–1136, 2001. [47] B. Luo, R. Wilson, and E.R. Hancock. Spectral embedding of graphs. Pattern Recognition, 36(10):2213–2223, 2003. [48] P. Mah « e, N. Ueda, and T. Akutsu. Graph kernels for molecular structures – activity relationship analysis with support vector machines. Journal of Chemical Information and Modeling, 45(4):939–951, 2005. [49] J.J. McGregor. Backtrack search algorithms and the maximal common subgraph problem. Software Practice and Experience, 12:23–34, 1982. [50] B.D. McKay. Practical graph isomorphism. Congressus Numerantium, 30:45–87, 1981. [51] B.T. Messmer and H. Bunke. A decision tree approach to graph and sub- graph isomorphism detection. Pattern Recognition, 32:1979–1998, 1008. [52] A. Micheli. Neural network for graphs: A contextual constructive ap- proach. IEEE Transactions on Neural Networks, 20(3):498–511, 2009. [53] J. Munkres. Algorithms for the assignment and transportation problems. In Journal of the Society for Industrial and Applied Mathematics, vol- ume 5, pages 32–38, March 1957. [54] R. Myers, R.C. Wilson, and E.R. Hancock. Bayesian graph edit dis- tance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(6):628–635, 2000. [55] M. Neuhaus and H. Bunke. Self-organizing maps for learning the edit costs in graph matching. IEEE Transactions on Systems, Man, and Cyber- netics (Part B), 35(3):503–514, 2005. [56] M. Neuhaus and H. Bunke. Automatic learning of cost functions for graph edit distance. Information Sciences, 177(1):239–247, 2007. [57] M. Neuhaus and H. Bunke. Bridging the Gap Between Graph Edit Dis- tance and Kernel Machines. World Scientific, 2007. [58] M. Neuhaus and H. Bunke. A quadratic programming approach to the graph edit distance problem. In F. Escolano and M. Vento, editors, Proc. 244 MANAGING AND MINING GRAPH DATA 6th Int. Workshop on Graph Based Representations in Pattern Recognition, LNCS 4538, pages 92–102, 2007. [59] M. Neuhaus, K. Riesen, and H. Bunke. Fast suboptimal algorithms for the computation of graph edit distance. In Dit-Yan Yeung, J.T. Kwok, A. Fred, F. Roli, and D. de Ridder, editors, Proc. 11.th int. Workshop on Strucural and Syntactic Pattern Recognition, LNCS 4109, pages 163–172. Springer, 2006. [60] E. Pekalska and R. Duin. The Dissimilarity Representation for Pattern Recognition: Foundations and Applications. World Scientific, 2005. [61] M. Pelillo. Replicator equations, maximal cliques, and graph isomor- phism. Neural Computation, 11(8):1933–1955, 1999. [62] K. Riesen and H. Bunke. Graph classification based on vector space embedding. Int. Journal of Pattern Recognition and Artificial Intelligence, 2008. accepted for publication. [63] K. Riesen and H. Bunke. Kernel 𝑘-means clustering applied to vector space embeddings of graphs. In L. Prevost, S. Marinai, and F. Schwenker, editors, Proc. 3rd IAPR Workshop Artificial Neural Networks in Pattern Recognition, LNAI 5064, pages 24–35. Springer, 2008. [64] K. Riesen and H. Bunke. Non-linear transformations of vector space embedded graphs. In A. Juan-Ciscar and G. Sanchez-Albaladejo, editors, Pattern Recognition in Information Systems, pages 173–186, 2008. [65] K. Riesen and H. Bunke. On Lipschitz embeddings of graphs. In I. Lovrek, R.J. Howlett, and L.C. Jain, editors, Proc. 12th International Conference, Knowledge-Based Intelligent Information and Engineering Systems, Part I, LNAI 5177, pages 131–140. Springer, 2008. [66] K. Riesen and H. Bunke. Reducing the dimensionality of dissimilarity space embedding graph kernels. Engineering Applications of Artificial Intelligence, 22(1):48–56, 2008. [67] K. Riesen and H. Bunke. Approximate graph edit distance computa- tion by means of bipartite graph matching. Image and Vision Computing, 27(4):950–959, 2009. [68] K. Riesen and H. Bunke. Dissimilarity based vector space embedding of graphs using prototype reduction schemes. Accepted for publication in Machine Learning and Data Mining in Pattern Recognition, 2009. [69] A. Robles-Kelly and E.R. Hancock. String edit distance, random walks and graph matching. Int. Journal of Pattern Recognition and Artificial Intelligence, 18(3):315–327, 2004. [70] A. Robles-Kelly and E.R. Hancock. A Riemannian approach to graph embedding. Pattern Recognition, 40:1024–1056, 2007. Exact and Inexact Graph Matching: Methodology and Applications 245 [71] A. Sanfeliu and K.S. Fu. A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics (Part B), 13(3):353–363, 1983. [72] F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009. [73] K. Sch - adler and F. Wysotzki. Comparing structures using a Hopfield- style neural network. Applied Intelligence, 11:15–30, 1999. [74] A. Schenker, H. Bunke, M. Last, and A. Kandel. Graph-Theoretic Tech- niques for Web Content Mining. World Scientific, 2005. [75] B. Sch - olkopf and A. Smola. Learning with Kernels. MIT Press, 2002. [76] B. Sch - olkopf, A. Smola, and K R. M - uller. Nonlinear component analy- sis as a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. [77] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [78] A. Shokoufandeh, D. Macrini, S. Dickinson, K. Siddiqi, and S.W. Zucker. Indexing hierarchical structures using graph spectra. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7):1125–1140, 2005. [79] A. Smola and R. Kondor. Kernels and regularization on graphs. In Proc. 16th. Int. Conf. on Comptuational Learning Theory, pages 144–158, 2003. [80] S. Sorlin and C. Solnon. Reactive tabu search for measuring graph simi- larity. In L. Brun and M. Vento, editors, Proc. 5th Int. Worksho on Graph- based Representations in Pattern Recognition, LNCS 3434, pages 172– 182. Springer, 2005. [81] A. Sperduti and A. Starita. Supervised neural networks for the classifica- tion of structures. IEEE Transactions on Neural Networks, 8(3):714–735, 1997. [82] B. Spillmann, M. Neuhaus, H. Bunke, E. Pekalska, and R. Duin. Trans- forming strings to vector spaces using prototype selection. In Dit-Yan Ye- ung, J.T. Kwok, A. Fred, F. Roli, and D. de Ridder, editors, Proc. 11.th int. Workshop on Strucural and Syntactic Pattern Recognition, LNCS 4109, pages 287–296. Springer, 2006. [83] P.N. Suganthan, E.K. Teoh, and D.P. Mital. Pattern recognition by graph matching using the potts MFT neural networks. Pattern Recognition, 28(7):997–1009, 1995. [84] P.N. Suganthan, E.K. Teoh, and D.P. Mital. Pattern recognition by ho- momorphic graph matching using Hopfield neural networks. Image Vision Computing, 13(1):45–60, 1995. 246 MANAGING AND MINING GRAPH DATA [85] P.N. Suganthan, E.K. Teoh, and D.P. Mital. Self-organizing Hopfield network for attributed relational graph matching. Image Vision Computing, 13(1):61–73, 1995. [86] Y. Tian and J.M. Patel. Tale: A tool for approximate large graph matching. In IEEE 24th International Conference on Data Engineering, pages 963– 972, 2008. [87] A. Torsello and E. Hancock. Computing approximate tree edit distance using relaxation labeling. Pattern Recognition Letters, 24(8):1089–1097, 2003. [88] K. Tsuda. Support vector classification with asymmetric kernel function. In M. Verleysen, editor, Proc. 7th European Symposium on Artifical Neural Netweorks, pages 183–188, 1999. [89] J.R. Ullmann. An algorithm for subgraph isomorphism. Journal of the Association for Computing Machinery, 23(1):31–42, 1976. [90] S. Umeyama. An eigendecomposition approach to weighted graph matching problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(5):695–703, 1988. [91] M.A. van Wyk, T.S. Durrani, and B.J. van Wyk. A RKHS interpolator- based graph matching algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):988–995, 2003. [92] J P. Vert and M. Kanehisa. Graph-driven features extraction from mi- croarray data using diffusion kernels and kernel CCA. In Advances in Neu- ral Information Processing Systems, volume 15, pages 1425–1432. MIT Press, 2003. [93] R.A. Wagner and M.J. Fischer. The string-to-string correction prob- lem. Journal of the Association for Computing Machinery, 21(1):168–173, 1974. [94] W.D. Wallis, P. Shoubridge, M. Kraetzl, and D. Ray. Graph distances using graph union. Pattern Recognition Letters, 22(6):701–704, 2001. [95] C. Watkins. Dynamic alignment kernels. In A. Smola, P.L. Bartlett, B. Sch - olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 39–50. MIT Press, 2000. [96] R. Wilson and E.R. Hancock. Levenshtein distance for graph spectral features. In J. Kittler, M. Petrou, and M. Nixon, editors, Proc. 17th Int. Conf. on Pattern Recognition, volume 2, pages 489–492, 2004. [97] R.C. Wilson and E. Hancock. Structural matching by discrete relax- ation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6):634–648, 1997. Exact and Inexact Graph Matching: Methodology and Applications 247 [98] R.C. Wilson, E.R. Hancock, and B. Luo. Pattern vectors from algebraic graph theory. IEEE Trans. on Pattern Analysis ans Machine Intelligence, 27(7):1112–1124, 2005. [99] A.K.C. Wong and M. You. Entropy and distance of random graphs with application to structural pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(5):599–609, 1985. [100] R. Xu and D. Wunsch. Survey of graph clustering algorithms. IEEE Transactions on Neural Networks, 16(3):645–678, 2005. [101] Y. Yao, G.L. Marcialis, M. Pontil, P. Frasconi, and F. Roli. Combining flat and structured representations for fingerprint classification with recur- sive neural networks and support vector machines. Pattern Recognition, 36(2):397–406, 2003. Chapter 8 A SURVEY OF ALGORITHMS FOR KEYWORD SEARCH ON GRAPH DATA Haixun Wang Microsoft Research Asia Beijing, China 100190 haixunw@microsoft.com Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 charu@us.ibm.com Abstract In this chapter, we survey methods that perform keyword search on graph data. Keyword search provides a simple but user-friendly interface to retrieve infor- mation from complicated data structures. Since many real life datasets are repre- sented by trees and graphs, keyword search has become an attractive mechanism for data of a variety of types. In this survey, we discuss methods of keyword search on schema graphs, which are abstract representation for XML data and relational data, and methods of keyword search on schema-free graphs. In our discussion, we focus on three major challenges of keyword search on graphs. First, what is the semantics of keyword search on graphs, or, what qualifies as an answer to a keyword search; second, what constitutes a good answer, or, how to rank the answers; third, how to perform keyword search efficiently. We also discuss some unresolved challenges and propose some new research directions on this topic. Keywords: Keyword Search, Information Retrieval, Graph Structured Data, Semi- Structured Data © Springer Science+Business Media, LLC 2010 C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_8, 249 250 MANAGING AND MINING GRAPH DATA 1. Introduction Keyword search is the de facto information retrieval mechanism for data on the World Wide Web. It also proves to be an effective mechanism for querying semi-structured and structured data, because of its user-friendly query inter- face. In this survey, we focus on keyword search problems for XML documents (semi-structured data), relational databases (structured data), and all kinds of schema-free graph data. Recently, query processing over graph-structured data has attracted increas- ing attention, as myriads of applications are driven by and producing graph- structured data [14]. For example, in semantic web, two major W3C standards, RDF and OWL, conform to node-labeled and edge-labeled graph models. In bioinformatics, many well-known projects, e.g., BioCyc (http://biocyc.org), build graph-structured databases. In social network analysis, much inter- est centers around all kinds of personal interconnections. In other applica- tions, raw data might not be graph-structured at the first glance, but there are many implicit connections among data items; restoring these connections of- ten allows more effective and intuitive querying. For example, a number of projects [1, 18, 3, 26, 8] enable keyword search over relational databases. In personal information management (PIM) systems [10, 5], objects such as emails, documents, and photos are interwoven into a graph using manually or automatically established connections among them. The list of examples of graph-structured data goes on. For data with relational and XML schema, specific query languages, such as SQL and XQuery, have been developed for information retrieval. In or- der to query such data, the user must master a complex query language and understand the underlying data schema. In relational databases, information about an object is often scattered in multiple tables due to normalization con- siderations, and in XML datasets, the schema are often complicated and em- bedded XML structures often create a lot of difficulty to express queries that are forced to traverse tree structures. Furthermore, many applications work on graph-structured data with no obvious, well-structured schema, so the option of information retrieval based on query languages is not applicable. Both relational databases and XML databases can be viewed as graphs. Specifically, XML datasets can be regarded as graphs when IDREF/ID links are taken into consideration, and a relational database can be regarded as a data graph that has tuples and keywords as nodes. In the data graph, for example, two tuples are connected by an edge if they can be joined using a foreign key; a tuple and a keyword are connected if the tuple contains the keyword. Thus, traditional graph search algorithms, which extract features (e.g., paths [27], frequent-patterns [30], sequences [20]) from graph data, and convert queries into searches over feature spaces, can be used for such data. A Survey of Algorithms for Keyword Search on Graph Data 251 However, traditional graph search methods usually focus more on the struc- ture of the graph rather than the semantic content of the graph. In XML and re- lational data graphs, nodes contain keywords, and sometimes nodes and edges are labeled. The problem of keyword search requires us to determine a group of densely linked nodes in the graph, which may satisfy a particular keyword- based query. Thus, the keyword search problem makes use of both the content and the linkage structure. These two sources of information actually re-enforce each other, and improve the overall quality of the results. This makes keyword search a more preferred information retrieval method. Keyword search allows users to query the databases quickly, with no need to know the schema of the respective databases. In addition, keyword search can help discover unex- pected answers that are often difficult to obtain via rigid-format SQL queries. It is for these reasons that keyword search over tree- and graph-structured data has attracted much attention [1, 18, 3, 6, 13, 16, 2, 28, 21, 26, 24, 8]. Keyword search over graph data presents many challenges. The first ques- tion we must answer is that, what constitutes an answer to a keyword. For information retrieval on the Web, answers are simply Web documents that contain the keywords. In our case, the entire dataset is considered as a sin- gle graph, so the algorithms must work on a finer granularity and decide what subgraphs are qualified as answers. Furthermore, since many subgraphs may satisfy a query, we must design ranking strategies to find top answers. The definition of answers and the design of their ranking strategies must satisfy users’ intention. For example, several papers [16, 2, 12, 26] adopt IR-style answer-tree ranking strategies to enhance semantics of answers. Finally, a ma- jor challenge for keyword search over graph data is query efficiency, which to a large extent hinges on the semantics of the query and the ranking strategy. For instance, some ranking strategies score an answer by the sum of edge weights. In this case, finding the top-ranked answer is equivalent to the group Steiner tree problem [9], which is NP-hard. Thus, finding the exact top 𝑘 answers is inherently difficult. To improve search efficiency, many systems, such as BANKS [3], propose ways to reduce the search space. As another example, BLINKS [14] avoids the inherent difficulty of the group Steiner tree problem by proposing an alternative scoring mechanism, which lowers complexity and enables effective indexing and pruning. Before we delve into the details of various keyword search problems for graph data, we briefly summarize the scope of this survey chapter. We classify algorithms we survey into three categories based on the schema constraints in the underlying graph data. Keyword Search on XML Data: Keyword search on XML data [11, 6, 13, 23, 25] is a simpler prob- lem than on schema-free graphs. They are basically constrained to tree 252 MANAGING AND MINING GRAPH DATA structures, where each node only has a single incoming path. This prop- erty provides great optimization opportunities [28]. Connectivity infor- mation can also be efficiently encoded and indexed. For example, in XRank [13], the Dewey inverted list is used to index paths so that a key- word query can be evaluated without tree traversal. Keyword Search over Relational Databases: Keyword search on relational databases [1, 3, 18, 16, 26] has attracted much interest. Conceptually, a database is viewed as a labeled graph where tuples in different tables are treated as nodes connected via foreign-key relationships. Note that a graph constructed this way usu- ally has a regular structure because schema restricts node connections. Different from the graph-search approach in BANKS [3], DBXplorer [1] and DISCOVER [18] construct join expressions and evaluate them, re- lying heavily on the database schema and query processing techniques in RDBMS. Keyword Search on Graphs: A great deal of work on keyword query- ing of structured and semi-structured data has been proposed in re- cent years. Well known algorithms includes the backward expanding search [3], bidirectional search [21], dynamic programming techniques DPBF [8], and BLINKS [14]. Recently, work that extend keyword search to graphs on external memory has been proposed [7]. This rest of the chapter is organized as follows. We first discuss keyword search methods for schema graphs. In Section 2 we focus on keyword search for XML data, and in Section 3, we focus on keyword search for relational data. In Section 4, we introduce several algorithms for keyword search on schema-free graphs. Section 5 contains a discussion of future directions and the conclusion. 2. Keyword Search on XML Data Sophisticated query languages such as XQuery have been developed for querying XML documents. Although XQuery can express many queries pre- cisely and effectively, it is by no means a user-friendly interface for accessing XML data: users must master a complex query language, and in order to use it, they must have a full understanding of the schema of the underlying XML data. Keyword search, on the other hand, offers a simple and user-friendly in- terface. Furthermore, the tree structure of XML data gives nice semantics to the query and enables efficient query processing. A Survey of Algorithms for Keyword Search on Graph Data 253 2.1 Query Semantics In the most basic form, as in XRank [13] and many other systems, a keyword search query consists of 𝑛 keywords: 𝑄 = {𝑘 1 , ⋅⋅⋅ , 𝑘 𝑛 }. XSEarch [6] extends the syntax to allow users to specify which keywords must appear in a satisfying document, and which may or may not appear (although the appearance of such keywords is desirable, as indicated by the ranking function). Syntax aside, one important question is, what qualifies as an answer to a keyword search query? In information retrieval, we simply return documents that contain all the keywords. For keyword search on an XML document, we want to return meaningful snippets of the document that contains the keywords. One interpretation of meaningful is to find the smallest subtrees that contain all the keywords. A B C D x y x y exclusive LCA node minimal LCA node Figure 8.1. Query Semantics for Keyword Search 𝑄 = {𝑥, 𝑦 } on XML Data Specifically, for each keyword 𝑘 𝑖 , let 𝐿 𝑖 be the list of nodes in the XML document that contain keyword 𝑘 𝑖 . Clearly, subtrees formed by at least one node from each 𝐿 𝑖 , 𝑖 = 1, ⋅⋅⋅ , 𝑛 contain all the keywords. Thus, an answer to the query can be represented by 𝑙𝑐𝑎(𝑛 1 , ⋅⋅⋅ , 𝑛 𝑛 ), the lowest common ancestor (LCA) of nodes 𝑛 1 , ⋅⋅⋅ , 𝑛 𝑛 where 𝑛 𝑖 ∈ 𝐿 𝑖 . In other words, answering the query is equivalent to finding: 𝐿𝐶𝐴(𝑘 1 , ⋅⋅⋅ , 𝑘 𝑛 ) = {𝑙𝑐𝑎(𝑛 1 , ⋅⋅⋅ , 𝑛 𝑛 )∣𝑛 1 ∈ 𝐿 1 , ⋅⋅⋅ , 𝑛 𝑛 ∈ 𝐿 𝑛 } Moreover, we are only interested in the “smallest” answer, that is, 𝑆𝐿𝐶𝐴(𝑘 1 , ⋅⋅⋅ , 𝑘 𝑛 ) = {𝑣 ∣ 𝑣 ∈ 𝐿𝐶𝐴(𝑘 1 , ⋅⋅⋅ , 𝑘 𝑛 ) ∧ ∀𝑣 ′ ∈ 𝐿𝐶𝐴(𝑘 1 , ⋅⋅⋅ , 𝑘 𝑛 ), 𝑣 ⊀ 𝑣 ′ } (8.1) where ≺ denotes the ancestor relationship between two nodes in an XML document. As an example, in Figure 8.1, we assume the keyword query is 𝑄 = {𝑥, 𝑦}. We have 𝐶 ∈ 𝑆𝐿𝐶𝐴(𝑥, 𝑦) while 𝐴 ∈ 𝐿𝐶𝐴(𝑥, 𝑦) but 𝐴 ∕∈ 𝑆𝐿𝐶𝐴(𝑥, 𝑦). Several algorithms including [28, 17, 29] are based on the SLCA semantics. However, SLCA is by no means the only meaningful semantics for keyword . 2010 C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_8, 249 250 MANAGING AND MINING GRAPH DATA 1. Introduction Keyword. Neuhaus and H. Bunke. A quadratic programming approach to the graph edit distance problem. In F. Escolano and M. Vento, editors, Proc. 244 MANAGING AND MINING GRAPH DATA 6th Int. Workshop on Graph. documents (semi-structured data) , relational databases (structured data) , and all kinds of schema-free graph data. Recently, query processing over graph- structured data has attracted increas- ing