Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 198 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
198
Dung lượng
2,41 MB
Nội dung
SEMANTICS ANALYSIS FOR XML KEYWORD SEARCH LE THUY NGOC (M.Sc, Ho Chi Minh University of Science) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2014 DECLARATION I hereby declare that the thesis is my original work and it has been written by me entirely. I have duly acknowledged all the sources of information which have been used in the thesis. This thesis has also not been submitted for any degree in any university previously. Le Thuy Ngoc 18 August 2014 Acknowledgements This thesis would not have been completed without the guidance, support and encouragement of many people. I would like to reserve this section to express my gratitude to them. First and foremost, my sincerest gratitude goes to my supervisor Professor Ling Tok Wang who is very understanding and supportive. He always supports me not only in research but also in all the issues I may have. During my PhD study, he gave me insightful advice on my research work. He has taught me how to think critically, how to identify research problems, and how to write research papers. His advice and help are invaluable to me and I will remember them in all my life. ¨ I would like to thank Professor Tamer Ozsu, Professor Lee Mong Li and Professor Chan Chee Yong for serving as my thesis examiners and providing valuable advice on my work. I also gratefully acknowledge Professor H.V. Jagadish, Professor Gillian Dobbie and Professor Lu Jiaheng, who I had chances to collaborate in my papers, for giving me useful advice on my research work. I greatly appreciate my senior, Dr. Wu Huayu for his selfless help to me from the beginning of my PhD journey, and for always being there to answer my questions. I also would like to thank Zeng Yong and my co-authors (Dr. Wu Huayu, Dr. Bao Zhifeng, Li Luochen and Zeng Zhong), who worked with me in a group to discuss problems and work on interesting research topics. i Many thanks go to my friends in School of Computing for the open discussions, valuable assistance, and enjoyable hours we spent together at the leisure time. These will become beautiful memories in my mind. Last but not least, my deepest love is reserved for my family for their continuous love, support and understanding. They gave me the courage and strength to overcome difficulties during my PhD study. ii Contents Introduction 1.1 Background on XML and XML Keyword Search . . . . . . . . 1.2 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . 1.3 Our Publications and Relationships among Our Contributions . . 10 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Related Work 2.1 15 Tree-based XML Keyword Search . . . . . . . . . . . . . . . . 16 2.1.1 LCA Semantics . . . . . . . . . . . . . . . . . . . . . . 17 2.1.2 SLCA Semantics . . . . . . . . . . . . . . . . . . . . . 17 2.1.3 ELCA Semantics . . . . . . . . . . . . . . . . . . . . . 18 2.1.4 VLCA Semantics . . . . . . . . . . . . . . . . . . . . . 19 2.1.5 MLCA Semantics 2.1.6 Other Semantics . . . . . . . . . . . . . . . . . . . . . 22 2.1.7 Relationship and Comparison on the LCA-based . . . . . . . . . . . . . . . . . . . . 20 semantics . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.8 2.2 Common Problems of the LCA-based Semantics . . . . 24 Graph-based XML Keyword Search . . . . . . . . . . . . . . . 30 2.2.1 Subtree based Semantics for Directed Graphs . . . . . . 30 2.2.2 Subgraph based Semantics for Undirected Graphs . . . . 32 2.2.3 Bi-directed Tree based Semantics for Directed Graphs . 33 iii 2.2.4 Other Methods based on Graph . . . . . . . . . . . . . 34 2.2.5 Relationship and Comparison on the Semantics of Existing Graph-based Approaches . . . . . . . . . . . . 34 2.3 2.2.6 Common Problems of the Graph-based Approaches . . . 35 2.2.7 Inefficiency Problem of Graph-based Approaches . . . . 38 Other Topics Related to XML Keyword Search . . . . . . . . . 38 2.3.1 Using semantics in existing XML Keyword Search . . . 39 2.3.2 Group-by and Aggregate Functions in XML keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Output Presentation and Post-processing . . . . . . . . . 41 2.3.4 Ranking Answers in XML Keyword Search . . . . . . . 42 2.3.5 Storing XML Documents Using RDBMS . . . . . . . . 42 2.3.6 Keyword Search over Relational Database . . . . . . . . 43 Preliminary 3.1 2.3.3 44 ORA-semantics (Object-Relationship-Attribute-semantics) . . . 44 3.1.1 Definition of ORA-Semantics in XML . . . . . . . . . . 44 3.1.2 Discovering ORA-semantics . . . . . . . . . . . . . . . 48 3.2 Our Labeling and Matching . . . . . . . . . . . . . . . . . . . . 52 3.3 Handling Relationship Attribute . . . . . . . . . . . . . . . . . 52 Using ORA-Semantics in Keyword Search over XML Tree 4.1 53 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1.1 Limitations of the LCA semantics . . . . . . . . . . . . 54 4.1.2 Our novel semantics . . . . . . . . . . . . . . . . . . . 56 4.1.3 Our approach and contributions . . . . . . . . . . . . . 58 4.2 Our Nearest Common Object Node (NCON) semantics . . . . . 60 4.3 Overview of our approach . . . . . . . . . . . . . . . . . . . . 61 4.3.1 Object orientation . . . . . . . . . . . . . . . . . . . . . 62 iv 4.4 4.5 4.6 4.7 4.3.2 Reversal mechanism . . . . . . . . . . . . . . . . . . . 62 4.3.3 Overview of the process . . . . . . . . . . . . . . . . . 64 Detailed techniques of our approach . . . . . . . . . . . . . . . 66 4.4.1 Generating the reversed O-tree . . . . . . . . . . . . . . 66 4.4.2 Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.3 Basic query processing . . . . . . . . . . . . . . . . . . 71 4.4.4 Handling multiple object class paths. . . . . . . . . . . 73 4.4.5 Removing duplicated answers . . . . . . . . . . . . . . 74 4.4.6 Handling relationship attribute . . . . . . . . . . . . . . 75 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.5.1 Query mappings . . . . . . . . . . . . . . . . . . . . . 76 4.5.2 Classification of query mappings . . . . . . . . . . . . . 78 4.5.3 The optimized algorithm . . . . . . . . . . . . . . . . . 80 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.6.1 Experimental setup . . . . . . . . . . . . . . . . . . . . 82 4.6.2 Effectiveness evaluation . . . . . . . . . . . . . . . . . 84 4.6.3 Efficiency evaluation . . . . . . . . . . . . . . . . . . . 86 4.6.4 Quality of the extracted and reversed O-trees . . . . . . 87 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Using ORA-Semantics for Keyword Search over XML Graph 5.1 5.2 5.3 90 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.1.1 The problem of missing answers due to object duplication 92 5.1.2 Our approach and contributions . . . . . . . . . . . . . 94 Data and answer model . . . . . . . . . . . . . . . . . . . . . . 95 5.2.1 Data model . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2.2 Answer model . . . . . . . . . . . . . . . . . . . . . . 97 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 v 5.4 5.5 5.3.1 Overview of the approach . . . . . . . . . . . . . . . . 100 5.3.2 Labling and indexing 5.3.3 Runtime processing . . . . . . . . . . . . . . . . . . . . 105 . . . . . . . . . . . . . . . . . . 103 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . 109 5.4.2 Methodology of doing experiment . . . . . . . . . . . . 111 5.4.3 Effectiveness Evaluation . . . . . . . . . . . . . . . . . 112 5.4.4 Efficiency Evaluation . . . . . . . . . . . . . . . . . . . 113 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Schema-independent XML Keyword Search 116 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.3 The CR (Common Relative) semantics . . . . . . . . . . . . . . 122 6.4 6.5 6.3.1 Intuitive analysis . . . . . . . . . . . . . . . . . . . . . 122 6.3.2 The CR semantics . . . . . . . . . . . . . . . . . . . . 124 Our schema-independent approach . . . . . . . . . . . . . . . . 128 6.4.1 Identifying relatives of a node . . . . . . . . . . . . . . 128 6.4.2 Labeling and indexing . . . . . . . . . . . . . . . . . . 134 6.4.3 Processing . . . . . . . . . . . . . . . . . . . . . . . . 135 6.4.4 Output presentation . . . . . . . . . . . . . . . . . . . . 135 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . 136 6.5.2 Completeness . . . . . . . . . . . . . . . . . . . . . . . 137 6.5.3 Soundness . . . . . . . . . . . . . . . . . . . . . . . . 138 6.5.4 Schema-independence . . . . . . . . . . . . . . . . . . 139 6.5.5 Comparing with SLCA and ELCA . . . . . . . . . . . . 140 6.5.6 Efficiency evaluation . . . . . . . . . . . . . . . . . . . 140 vi Table 7.3: Results of queries of Baketball dataset QB1 QB2 QB4 QB5 QB6 QB7 QB8 XPower results Results if not filter duplication count player = 215 count coach = 13 count player = 2795 count coach = 13 15 answers for 15 persons (2 coaches, 13 players), each has a number of teams they have worked for. Sum of these numbers are 69. No answer for Johnson as players. Johnson as coaches: answers for coaches, each has a number of players. Total number is 136. answers for players Edwards in team Hawks, with max year 1997, 2004 resp. answers w.r.t. year for players Edwards: 1993, 1995, 1981, 1977 resp. answers: Michael as players: No answer. Michael as coach 1: teams (count players = 153, 256, 82 resp.) Michael as coach 2: teams (count Provide the number of players and those of coaches for each team 7.6.2 count team = 298 count player = 219 Reasons for duplication Team Celtics has been coached by 13 coaches, thus its players are duplicated 13 times. Coaches are not duplicated. Michael as players: a player can work with the same team (duplicated) under different coaches. A player can works for more than team (duplicated) in different years under the same coach Johnson Results if not differentiate Explain queryInterpretation same results with XPower answer: count team = 69 answer: count player = 136 same results with XPower duplication does not affect aggregate function max answer: max year = 2004 same results with XPower duplication does not affect aggregate function answer: year = 1977 same results with XPower Although players are duplicated in documents, they are not duplicated under the pair of coach and team answers: teams (count player = 153, 236, 512, 164 resp.). count player: diff count team: same If a team is duplicated, all of its players are duplicated. same results with XPower mix the results of all interpretations Impact of query interpretation due to keyword ambiguity Table 7.3 shows three different results for each query in Basketball in three different scenarios: (1) XPower considering both query interpretation and duplication, (2) only considering query interpretation but not duplication, and (3) only considering duplication but not differentiating query interpretation. We only showed the results and explanations for Basketball because DBLP has similar results. We also describe whether duplication and keyword ambiguity impact on the results of each query in Table 7.1. As can be seen, keyword ambiguity impacts on the correctness of the results of all queries in DBLP, and five out of seven queries in Basketball (not considering queries with no answer). This verifies the importance of differentiating query interpretations. Otherwise, the results of all query interpretations are mixed together. 163 7.6.3 Impact of duplication As we can see in Table 7.1, duplication impacts on the correctness of the results for four out of eight queries in both DBLP and Basketball. This is fewer than those affected by ambiguity but this number is still significant. This agrees our arguments about the importance of detecting duplication. Otherwise, the results of aggregate functions would not be correct. The number of queries affected by duplication is fewer than that of ambiguity because in Basketball, coaches are not duplicated, only teams and players can be duplicated. In DBLP, papers are not duplicated either. Therefore, there is no impact on the functions count coach in Basketball and count paper in DBLP. Moreover, duplication does not affect max and functions as in QB5 and QB6. 7.6.4 Efficiency Evaluation Figure 7.8 shows the response time of XPower (XP as abbreviation) and XKSearch (XK as abbreviation) for queries tested except the ones (QB3 and QD2) XPower does not provide any answer. Since XKSearch does not support group-by and aggregate functions, we dropped reserved words of tested queries when running XKSearch. Although XPower has the overhead of doing group-by and aggregate functions, the response time of queries are similar to those of XKSearch. This is because XPower does not find all SLCAs because many SLCAs not correspond to any intermediate answer. For queries with complicated group-by and aggregate functions (e.g., QB1, QB7, QB8, QD3, QD4 and QD7), the overhead of processing those functions makes XPower run slightly slower than XKSearch. The response time of XPower is dominated by that of Answer Finder. Aggregate Calculator costs more than Group-by Classifier because it needs to detect duplication. 164 90 300 Aggregate Calculator Aggregate Calculator Group-by Clasifier Group-by Clasifier Answer Finder Answer Finder 240 60 Time(ms) Time(ms) 180 120 30 60 XP XK XP XK XP XK QB1 QB2 QB4 XP XK QB5 XP XK QB6 XP XK QB7 XP XK XP XK QD1 QB8 (a) Basketball XP XK XP XK QD3 QD4 XP XK QD5 XP XK QD6 XP XK XP XK QD7 QD8 (b) DBLP Figure 7.8: Efficiency comparison of XPower and XKSearch on Basketball and DBLP (dropping reversed words of tested queries when running XKSearch) 7.7 Conclusion We proposed an approach to support queries with group-by and aggregate functions including sum, max, min, avg, count to query a data-centric XML document with a simple keyword interface. We processed query interpretations separately in order not to mix together the results of different query interpretations. To perform aggregate functions correctly, we detected duplication of objects and relationships. Otherwise, the results of aggregate functions may be wrong. We exploited the ORA-semantics to generating interpretations of a query and to detect duplicated objects and duplicated relationships. Without the ORA-semantics, group-by and aggregate functions cannot be answered correctly. Experimental results in real datasets showed the enhancement of our approach, the importance of detecting duplication and differentiating query interpretations on the correctness of aggregate functions. These results also showed that our approach is almost as efficient as the LCA-based approaches thank to our optimized techniques although it has some overhead. 165 Chapter Conclusion 8.1 Conclusion As XML has become more and more popular, keyword search in XML data has attracted a lot of research interest. Besides structure, XML data does contain semantics of objects, relationships between/among objects, and their attributes (referred to as the ORA-semantics). However, existing works only rely on the structure of XML but ignore such semantics. This causes many problems in XML keyword search, including the problems of meaningless answers, missing answers, duplicated answers, incomplete answers and schema-dependent answers. Moreover, they cannot handle queries with group-by or aggregate functions. In this thesis, we have exploited the ORA-semantics in XML keyword search to solve the problems of existing approaches, to improve the effectiveness and the efficiency on both XML tree and XML graph. Moreover, the ORA-semantics enables us to support more expressive queries with group-by and aggregate functions, and to be independent from schema designs. Without the ORA-semantics, we cannot achieve these results. We summarize the existing XML keyword search (KWS) and our XML KWS in Figure 8.1 and Figure 8.2 respectively. 166 Figure 8.1: Existing XML keyword search Figure 8.2: Our XML keyword search In summary, we have made four main contributions. The relationship of our contributions is described in Figure 8.3. We have studied from single keyword queries with no group-by or aggregate functions to expressive keyword queries with group-by and aggregate functions; from single to multiple XML documents sharing the same content; and family of problems for both XML tree and XML graph. Our four main contributions can be briefly summarized as follows. Meaningless answers Single XML document for data content Simple XkwQ with no group-by or aggregate functions Expressive XkwQ with group-by and aggregate functions Duplicated answers Incomplete answers Multiple XML documents for data content XML Keyword query (XkwQ) Missing answers Contribution (Chapter 7) Schema-dependent answers XML tree Contribution (Chapter 4) NCON XML graph Contribution (Chapter 5) XML IDREF graph Contribution (Chapter 6) Common Relative Reversed O-tree Hierarchy schema-independent answers query interpretation object and relationship duplication Figure 8.3: The relationship of our contributions In Chapter 4, we worked with data-centric XML documents with no IDREF. We exploited the ORA-semantics to solve the problem of meaningless 167 answers, missing answers, incomplete answers and duplicated answers of the tree-based approaches. We proposed the new semantics, called Nearest Common Object Semantics (NCON), which contains not only common ancestors, but also common descendants to find missing answers. The new semantics also only returns object nodes to avoid meaningless answers. We then proposed an approach to find NCONs by using O-tree and reversed O-tree. In Chapter 5, we generalized semantics and techniques in Chapter to work with data-centric XML document which contains IDREFs and may have duplicated objects as well. Specifically, we expand the NCON semantics to deal with reference edges and exploit the hierarchical structure of XML IDREF graph, our data model, to facilitate the search. The hierarchical structure of XML IDREF graph enables us to have an algorithm almost as efficient as algorithms of the LCA-based approaches. In contrast, without awareness of the hierarchy of XML IDREF graph, the search process is generally NP-Hard. In Chapter 6, we considered the case where several XML documents represent the same content. We proposed the new semantics and techniques to provide the same query answers for different schema designs of the same data content. The new semantics, called Common Relative (CR), not only takes common ancestors and common descendants into account to answer a keyword query, but also common relatives, which are common ancestors of some equivalent schemas. The semantics proposed in Chapter and Chapter becomes parts of the CR semantics. In Chapter 7, we supported XML expressive keyword queries with group-by and aggregate functions. We exploit the ORA-semantics to detect duplicated objects and duplicated relationships, and to generate interpretations of a query. We currently only support group-by and aggregate functions for XML keyword search over document with no IDREF. We leave the case for document with IDREFs in future study. 168 8.2 Future work In the future, we will support group-by and aggregate functions with the following aspects. First, we will handle predicates, comparison functions and range search efficiently. Second, we will handle more cases where group-by parameters and aggregate function parameters not have ancestor-descendant relationships. Thirdly, we will deal with XML documents with IDREFs. Handling predicates, comparison functions and range search. Many queries aim to find the object matched by a tag name, instead of values, such as queries {student John, count course} aim to count courses taken by student John. Generally, the meaning of a tag name keyword is either a predicate/description name or an output name. Though we can associate a tag name to all objects belonging to the corresponding object class to process a keyword query, it is much more efficient to have techniques to handle them separately. Moreover, users may want to issue comparison and range search as well. For example, users may want to count students with GPA greater than 4.5 (comparison) or count students with GPA in between 3.4 to 4.5 (range search). These queries can be expressed as {count student, GPA ≥ 4.5} and {count student, 3.4 ≤ GPA ≤ 4.5}. Such queries will be handled in my future work. Handling the cases where group-by parameters and aggregate function parameters not have ancestor-descendant relationships. We currently handle group-by functions based on only the ancestor-descendant relationships between the group-by parameters and aggregate function parameters. As such, if users want to count co-authors of Yi Chen in DBLP, where Yi Chen is an author, our current techniques cannot handle because Yi Chen (an author) cannot have ancestor-descendant relationship with other authors. This is similar to LCA-based approaches in the sense that the relationships between 169 the returned node and matching nodes in the LCA-based approaches can only be ancestor-descendant relationship. Therefore, to answer such query, we need to inherit some techniques from our third contribution about schema-independent XML keyword search, in which such relationships can be beyond ancestor-descendant relationship. Dealing with the XML documents with IDREFs. Current status of our forth contribution, which is about supporting group-by and aggregate functions for XML keyword search, can only deal with XML documents with no IDREF. Therefore, extending the techniques to deal with XML documents with IDREFs is our other future direction. 170 Bibliography [1] S. Agrawal, S. Chaudhuri, and G. Das. DBXPlorer: A system for keywordbased search over relation databases. In ICDE, 2002. [2] D. Alin, F. Mary, and S. Dan. Storing semistructured data with stored. In SIGMOD, 1999. [3] B. Andrey and P. Yannis. Storing and querying XML data using denormalized relational databases. VLDB J., 2005. [4] Z. Bao, T. W. Ling, B. Chen, and J. Lu. Efficient XML keyword search with relevance oriented ranking. In ICDE, 2009. [5] Z. Bao, J. Lu, T. W. Ling, L. Xu, and H. Wu. An effective object-level XML keyword search. In DASFAA, 2010. [6] Z. Bao, J. Lu, T. W. Ling, L. Xu, and H. Wu. An effective object-level XML keyword search. In DASFAA, 2010. [7] K. Benny and S. Yehoshua. Finding and approximating top-k answers in keyword proximity search. In PODS, 2006. [8] A. Berglund, D. Chamberlin, M. F. Fernandez, M. Kay, J. Robie, and J. Simeon. XML path language (XPath) 2.0. W3C Working Draft, 2003. 171 [9] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In ICDE, 2002. [10] S. Boag, D. Chamberlin, M. F. Fernandez, D. Florescu, J. Robie, and J. Simeon. XQuery 1.0: An XML query. W3C Working Draft, 2003. [11] P. Bohannon, J. Freire, P. Roy, and J. er ome Sim eon. From XML schema to relations: A cost-based approach to XML storage. In ICDE, 2002. [12] J. Camacho-Rodriguez, D. Colazzo, and I. Manolescu. Building large XML stores in the amazon cloud. In ICDEW, 2012. [13] L. J. Chen and Y. Papakonstantinou. Supporting top-k keyword search in XML databases. In ICDE, 2010. [14] S. Chong, C.-Y. Chan, and G. A. K. Multiway SLCA-based keyword search in XML data. In WWW, 2007. [15] E. Chu, A. Baid, X. Chai, A. Doan, and J. Naughton. Combining keyword search and forms for ad hoc querying of databases. In SIGMOD, 2009. [16] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv. XSEarch: A semantic search engine for XML. In VLDB, 2003. [17] B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin. Finding top-k min-cost connected trees in database. In ICDE, 2007. [18] S. E. Dreyfus and R. A. Wagner. The steiner problem in graphs. Networks, 1971. [19] J. Fong, H. K. Wong, and Z. Cheng. Converting relational database into XML documents with DOM. Information & Software Technology, 2003. 172 [20] C. Gokhale, N. Gupta, P. Kumar, L. V. S. Lakshmanan, R. Ng, and B. A. Prakash. Complex group-by queries for XML. In ICDE, 2007. [21] K. Golenberg, B. Kimelfeld, and Y. Sagiv. Keyword proximity search in complex data graphs. In SIGMOD, 2008. [22] K. Golenberg, B. Kimelfeld, and Y. Sagiv. Keyword proximity search in complex data graphs. In SIGMOD, 2008. [23] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: Ranked keyword search over XML documents. In SIGMOD, 2003. [24] H. He, H. Wang, J. Yang, and P. S. Yu. BLINKS: ranked keyword searches on graphs. In SIGMOD, 2007. [25] J. Hegewald, F. Naumann, and M. Weis. XStruct: Efficient schema extraction from multiple and large XML documents. In ICDE Workshops, 2006. [26] V. Hristidis, N. Koudas, Y. Papakonstantinou, and D. Srivastava. Keyword proximity search in XML trees. TKDE, 2006. [27] V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. In PVLDB, 2002. [28] V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keyword proximity search on XML graphs. In ICDE, 2003. [29] Y. Huang, Z. Liu, and Y. Chen. Query biased snippet generation in XML search. In SIGMOD, 2008. [30] D. F. Inria, D. Florescu, and D. Kossmann. Storing and querying XML data using an rdmbs. IEEE Data Engineering Bulletin, 1999. 173 [31] H. V. Jagadish and S. AL-Khalifa. Timber: A native XML database. Technical report, University of Michigan, 2002. [32] M. Jayapandian and H. V. Jagadish. Automated creation of a forms-based database query interface. Proc. VLDB Endow. [33] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, and R. D. Hrishikesh Karambelkar. Bidirectional expansion for keyword search on graph databases. In VLDB, 2005. [34] M. Kargar and A. An. Keyword search in graphs: finding r-cliques. PVLDB, 2011. [35] J. Kim, D. Jeong, and D.-K. Baik. A translation algorithm for effective RDB-to-XML schema conversion considering referential integrity information. Journal Inf. Sci. Eng., 2009. [36] B. Kimelfeld and Y. Sagiv. Finding and approximating top-k answers in keyword proximity search. In In PODS, 2006. [37] L. Kong, R. Gilleron, and A. L. Mostrare. Retrieving meaningful relaxed tightest fragments for xml keyword search. In EDBT, 2009. [38] G. Konstantin, K. Benny, and S. Yehoshua. Keyword proximity search in complex data graphs. In SIGMOD, 2008. [39] T. N. Le, Z. Bao, and T. W. Ling. Schema-independent XML keyword search. ER, 2014. [40] T. N. Le, Z. Bao, T. W. Ling, and G. Dobbie. Group-by and aggregate functions in XML keyword search. In DEXA, 2014. [41] T. N. Le, T. W. Ling, H. V. Jagadish, and J. Lu. Object semantics for XML keyword search. In DASFAA, 2014. 174 [42] T. N. Le, H. Wu, T. W. Ling, L. Li, and J. Lu. From structure-based to semantics-based: Effective XML keyword search. In ER, 2013. [43] T. N. Le, Z. Zeng, and T. W. Ling. Finding missing answers due to object duplication in XML keyword search. In DEXA, 2014. [44] G. Li, J. Feng, J. Wang, and L. Zhou. Effective keyword search for valuable LCAs over XML documents. In CIKM, 2007. [45] G. Li, B. C. Ooi, J. Feng, J. Wang, and L. Zhou. EASE: Efficient and adaptive keyword search on unstructured, semi-structured and structured data. In SIGMOD, 2008. [46] J. Li, C. Liu, R. Zhou, and W. Wang. Suggestion of promising result types for XML keyword search. In EDBT, 2010. [47] L. Li, T. N. Le, H. Wu, T. W. Ling, and S. Bressan. Discovering semantics from data-centric XML. In DEXA, 2013. [48] Y. Li, C. Yu, and H. V. Jagadish. Schema-free XQuery. In VLDB, 2004. [49] T. W. Ling, T. N. Le, and Z. Zeng. Semantics-based keyword search over XML and relational databases. In SoICT, 2013. [50] T. W. Ling, T. N. Le, and Z. Zeng. Towards an intelligent keyword search over XML and relational databases. In BigComp, 2014. [51] T. W. Ling, M. L. Lee, and G. Dobbie. Semistructured Database Design. Springer-Verlag, 2004. [52] X. Liu, C. Wan, and L. Chen. Returning clustered results for keyword search on XML documents. TKDE, 2011. [53] Z. Liu and Y. Chen. Identifying meaningful return information for XML keyword search. In SIGMOD, 2007. 175 [54] Z. Liu and Y. Chen. Reasoning and identifying relevant matches for XML keyword search. In PVLDB, 2008. [55] Z. Liu and Y. Chen. Return specification inference and result clustering for keyword search on XML. In TODS, 2010. [56] Z. Liu, P. Sun, and Y. Chen. Structured search result differentiation. VLDB, 2009. [57] J. Lu, P. Senellart, C. Lin, X. Du, S. Wang, and X. Chen. Optimal top-k generation of attribute combinations based on ranked lists. In SIGMOD, 2012. [58] I. Manolescu, D. Florescu, D. Kossmann, F. Xhumari, and D. Olteanu. Agora: Living with XML and relational. In VLDB, 2000. [59] A. Nandi and H. V. Jagadish. Qunits: queried units in database search. In CIDR, 2009. [60] K. Nguyen and J. Cao. Exploit keyword query semantics and structure of data for effective xml keyword search. In ADC, 2010. [61] Z. Peng, J. Zhang, S. Wang, and L. Qin. Treecluster: Clustering results of keyword search over databases. In WAIM, 2006. [62] J. Qin, S. Zhao, S. Yang, and W. Dou. Efficient storing well-formed XML documents using rdbms. Int’l Conference on Services Systems and Services Management, 2005. [63] L. Qin, J. X. Yu, L. Chang, and Y. Tao. Querying communities in relational databases. In ICDE, 2009. [64] D. Raggett, A. L. Hors, and I. Jacobs. HTML 4.01 specification. Technical report, W3C - World Wide Web Consortium, 1999. 176 [65] J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton. Relational databases for querying XML documents: Limitations and opportunities. In VLDB, 1999. ¨ [66] N. Tang, J. X. Yu, M. T. Ozsu, B. Choi, and K.-F. Wong. Multiple materialized view selection for xpath query rewriting. ICDE, 2008. [67] Y. Tao, S. Papadopoulos, C. Sheng, and K. Stefanidis. Nearest keyword search in XML documents. In SIGMOD, 2011. [68] Y. Tao and J. X. Yu. Finding frequent co-occurring terms in relational keyword search. In EDBT, 2009. [69] S. Tata and G. M. Lohman. SQAK: doing more with keywords. In SIGMOD, 2008. [70] A. Termehchy and M. Winslett. Effective, design-independent XML keyword search. In CIKM, 2009. [71] A. Termehchy and M. Winslett. EXTRUCT: using deep structural information in XML keyword search. PVLDB, 2010. [72] A. Termehchy and M. Winslett. Using structural information in XML keyword search effectively. TODS, 2011. [73] B. Tim, P. Jean, S.-M. C. M., M. Eve, and Y. Franois. Extensible markup language (XML) 1.0. Technical report, W3C, 2008. [74] B. Q. Truong, S. S. Bhowmick, C. E. Dyreson, and A. Sun. MESSIAH: missing element-conscious SLCA nodes search in XML data. In SIGMOD, 2013. [75] H. Wu and Z. Bao. Object-oriented XML keyword search. In ER, 2011. 177 [76] H. Wu, T. W. Ling, L. Xu, and Z. Bao. Performing grouping and aggregate functions in XML queries. In WWW, 2009. [77] P. Wu, Y. Sismanis, and B. Reinwald. Towards keyword-driven analytical processing. In SIGMOD, 2007. [78] Y. Xu and Y. Papakonstantinou. Efficient keyword search for smallest LCAs in XML databases. In SIGMOD, 2005. [79] Y. Xu and Y. Papakonstantinou. Efficient LCA based keyword search in XML data. In EDBT, 2008. [80] Y. Zeng, Z. Bao, H. V. Jagadish, T. W. Ling, and G. Li. Breaking out of the mismatch trap. ICDE, 2014. [81] Z. Zeng, Z. Bao, T. N. Le, M.-L. Lee, and T. W. Ling. Expressq: Enhancing the expressive power and evaluation for relational queries. In CIKM, 2014. [82] Z. Zeng, Z. Bao, M.-L. Lee, and T. W. Ling. A semantic approach to keyword search over relational databases. In ER, 2013. ¨ [83] N. Zhang, V. Kacholia, and M. T. Ozsu. A succinct physical storage scheme for efficient evaluation of path queries in xml. In ICDE, 2004. [84] J. Zhou, Z. Bao, W. Wang, T. W. Ling, Z. Chen, X. Lin, and J. Guo. Fast SLCA and ELCA computation for XML keyword queries based on set intersection. In ICDE, 2012. [85] R. Zhou, C. Liu, and J. Li. Fast ELCA computation for keyword queries on XML data. In EDBT, 2010. ¨ [86] L. Zou, L. Chen, and M. T. Ozsu. Distancejoin: Pattern match query in a large graph database. PVLDB. 178 [...]... issue a structured query XML keyword search can eliminate these limitations Given a set of keywords in a keyword query, XML keyword search aims to find the most relevant information with the input keywords over the corresponding XML document Due to the flexibility and simplicity of keyword queries, XML keyword search has gained substantial interests Approaches of XML keyword search can be classified into... tree Therefore, the common ancestors from both the original and reversed data tree provide the set of NCONs for a keyword query Contribution 2: Using ORA -semantics in Keyword Search over XML Graph When an XML document contains IDREFs, it is modeled as a graph because it cannot be modeled as a tree anymore Applying the NCON semantics for keyword search over XML graph is challenging because searching... min, sum, count, avg for XML keyword search • Chapter 8 presents future directions and concludes the thesis 14 Chapter 2 Related Work In this chapter, we would like to review the related works We mainly focus on the topics of defining semantics for XML keyword search and the corresponding algorithms to find answers based on these semantics We classify existing works for XML keyword search into two main... provide information As such, an XML document contains more meaningful structural and semantics information than an HTML document This property of XML helps the searching over XML documents give more accurate answers Thus, XML has become a standard format for data representation and exchange over the Internet Therefore, XML has wide applications such as electronic business1 , 1 http://www.ebxml.org 1... demonstrates that using ORA -semantics to process XML keyword queries one can gain a lot of benefit in terms of both effectiveness and efficiency This result is useful for future research and applications in XML keyword search x List of Tables 2.1 Our summary on the LCA-based semantics 25 2.2 Summary of the discussed XML keyword queries 29 3.1 Concepts of the ORA -semantics ... in XML Keyword Search , International Conference on Database and Expert Systems Applications (DEXA), full research paper, 2014 [43] 10 • [DEXA14 2]: Thuy Ngoc Le, Zhifeng Bao, Tok Wang Ling, Gillian Dobbie, “Group-by and Aggregate Functions in XML Keyword Search , DEXA, full research paper, 2014 [40] • [DASFAA14]: Thuy Ngoc Le, Tok Wang Ling, H V Jagadish, Jiaheng Lu, “Object Semantics for XML keyword. .. comparison of XPower and XKSearch on Basketball and DBLP (dropping reversed words of tested queries when running XKSearch) 165 xiv 8.1 Existing XML keyword search 167 8.2 Our XML keyword search 167 8.3 The relationship of our contributions 167 xv xvi Chapter 1 Introduction 1.1 Background on XML and XML Keyword Search Since the World... what they want to search for Thus, for a query, they expect to have the same answers from different XML documents sharing the same content However, for existing approaches, for the same data content, different schema designs may provide different answers for the same query Finally, we study how to support group-by and aggregate functions in XML keyword search It goes beyond the simple keyword query, and... represented as XML XML keyword search has been attracted a lot of interests because it provides a simple and user-friendly interface to query XML documents Existing approaches for XML keyword search can be classified into two types: tree-based approaches and graph-based approaches based on whether the considered XML document is modeled as a tree or a graph Commonly, the tree-based approaches are for XML documents... even in the cloud [12] As a result, XML has attracted a huge of interests in both research and industry with a wide range of topics such as XML storage, twig pattern query processing, query optimization, XML view, and XML keyword search There have been several XML database systems such as Timber [31], Oracle XML DB7 , MarkLogic Server8 , and the Toronto XML Engine9 XML permits a node to refer to an . Topics Related to XML Keyword Search . . . . . . . . . 38 2.3.1 Using semantics in existing XML Keyword Search . . . 39 2.3.2 Group-by and Aggregate Functions in XML keyword Search . . . . . LCA-based Semantics . . . . 24 2.2 Graph-based XML Keyword Search . . . . . . . . . . . . . . . 30 2.2.1 Subtree based Semantics for Directed Graphs . . . . . . 30 2.2.2 Subgraph based Semantics for. using ORA -semantics to process XML keyword queries one can gain a lot of benefit in terms of both effectiveness and efficiency. This result is useful for future research and applications in XML keyword search. x List