Querying and updating XML data based on node labeling schemes

QUERYING AND UPDATING XML DATA BASED ON NODE LABELING SCHEMES LI CHANGQING (Master of Engineering, Peking University, China) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2005 ii Acknowledgements First of all, I gratefully acknowledge the persistent support and encouragement from my supervisor, Professor Ling Tok Wang. Prof Ling patiently guided and advised me throughout the various phases of my research. His meticulosity greatly impressed me which makes me think thoroughly and carefully. Not only has Prof Ling provided constant academic guidance to my research, he also gave me suggestions on how to overcome the difficulties that I met in my life. There is a famous Chinese saying “One day's teacher is your father for your whole life”. To me, Prof Ling is a great supervisor and my second father in my life. I wish to express my deep gratitude to Dr Ang Chuan Heng and Dr Chan Chee Yong for serving on my thesis evaluation committees. Thank them for going through such a long document and giving me valuable feedbacks. Their comments on my thesis are precious. Great thanks to all the reviewers who have read or will read this thesis. It is also my pleasure to express my thanks to Dr Lee Mong Li and Dr Wynne Hsu who gave me a chance to research work together with them. Their guidance and suggestions are important to my future research. iii Dr Gary Tan Soon Huat, who gave me valuable suggestions on my research. The several months that I worked together with him gave me an unforgettable research experience. I also want to thank all the academic and administrative staffs in School of Computing, Register Office, and Office of Student Affairs of National University of Singapore for their help in different areas of my life in the these years. In my lab, I have to acknowledge the support and friendship I received from so many friends: Wu Xiaodong, Lu Jiaheng, Chen Ting, Ni Wei, He Qi, Chen Zhuo, Chen Yabing, Yang Xia, Jiao Enhua, Yu Tian, Zhang Wei, Xia Chenyi, Xiang Shili, Li Yingguang, Ni Yuan, Cheng Weiwei, Hu Jing and many others not appearing here. On a personal note, it is important for me to thank my wife, Hu, for her love and support during my Ph.D. study and for her braveness to give the birth to our baby, in July, 2005, which makes our life happy. I am also grateful to my parents for their efforts to bring me up and provide me with the best possible education, to my parentsin-law for their help in taking care of my wife. iv Summary The method of assigning labels to the nodes of an XML tree is called a node labeling (or numbering) scheme. Based on the labels only, both ordered and un-ordered queries can be processed without accessing the original XML file. The core issue for XML query is to efficiently determine the following four basic relationships: ancestor-descendant (A-D), parent-child (P-C), sibling and ordering relationships. The existing node labeling schemes, i.e. containment, prefix and prime number schemes, are not efficient to determine all the four basic relationships. For instance, the containment scheme is very inefficient to determine the sibling relationship; it needs to search the parent of a node, then decide whether another node is a child of this parent; the search of the parent needs a lot of parent-child relationship determinations which is very expensive. The prefix scheme is efficient to determine all the four basic relationships if the XML tree is shallow, however when the XML tree becomes deeper, the prefix scheme becomes not efficient because the labels of the prefix scheme become longer and the comparisons of node labels become expensive. The prime number scheme has very large label size and it employs the modular and division operations to determine the relationships which is expensive. Thus in this thesis, we firstly propose the P-Containment scheme which can determine v all the four basic relationships efficiently no matter what XML structure is. In addition, P-Containment is used to efficiently process the internal node updates and to completely avoid re-labeling. One more important point for the labeling scheme is to process updates when nodes are inserted into or deleted from the XML tree. All the existing node labeling schemes, i.e. containment, prefix and prime number schemes, have high update cost, therefore in this thesis we propose a novel Compact Dynamic Binary String (CDBS) encoding to encode the labels of different labeling schemes and based on CDBS encoding, updates can be efficiently processed. CDBS encoding has two important properties which form the foundations of this thesis: (1) CDBS compares codes based on the lexicographical order, and it supports that codes can be inserted between any two consecutive CDBS codes with the orders kept and without re-encoding the existing numbers; (2) CDBS is orthogonal to specific labeling schemes, e.g. containment, prefix and prime number schemes, thus it can be applied broadly to different labeling schemes or other applications to efficiently process the updates. Moreover, because the fixed size length field of CDBS will encounter the overflow problem, we improve CDBS to Compact Dynamic Quaternary String (CDQS) encoding. Though the label size of CDQS is larger and its update cost is larger, it can completely avoid re-labeling in XML updates no matter what labeling schemes XML data employs. We report the experimental results to show that CDBS and CDQS encodings are superior to previous approaches to process updates in terms of the number of nodes to re-label (none for CDQS) and the time for updating. When P-Containment vi scheme is combined with CDBS (for intermittent updates and uniformly frequent updates) or CDQS (completely avoid re-labeling) encoding, both queries and updates can be efficiently processed. vii Table of Contents Acknowledgements ii Summary . iv Introduction .1 1.1 Background 1.1.1 XML 1.1.2 XML Technologies .3 1.1.3 XML Query .4 1.1.4 XML Update .6 1.2 Problem Statement and Motivation 1.3 Overview of Contributions .8 1.4 Organization of Thesis .10 Background and Related Works 12 2.1 Node Labeling Schemes .13 2.1.1 Containment Labeling Scheme .13 2.1.2 Prefix Labeling Scheme 18 2.1.3 Prime Labeling Scheme 24 viii 2.2 Encoding Approaches to Store the Labels of Labeling Schemes .29 2.2.1 Binary Number Encodings 29 2.2.2 UTF8 Encoding .30 2.2.3 OrdPath Encodings 31 2.2.4 Binary String and Quaternary String Encodings .33 2.3 Summary 34 P-Containment Scheme 38 3.1 A Node Labeling Scheme: P-Containment Scheme 39 3.2 Summary 42 CDBS Encoding of Node Labels to Efficiently Process XML Updates 44 4.1 Lexicographical Order for Binary Strings 45 4.2 The Compact Dynamic Binary String Encoding (CDBS) .49 4.2.1 CDBS Encoding Algorithm 54 4.2.2 Size Analysis .56 4.3 Applying CDBS to Different Labeling Schemes .58 4.4 Processing of XML Updates Based on Different Labeling Schemes Encoded with CDBS .62 4.4.1 Leaf Node Updates 63 4.4.2 Internal Node Updates .66 4.4.3 Subtree Updates .71 4.4.4 Uniformly and Skewed Frequent Updates 73 4.5 Experimental Evaluation and Comparisons .74 ix 4.5.1 Experimental Setup .74 4.5.2 Performance Study on Static XML Data .76 4.5.3 Performance Study on Intermittent Updates in Dynamic XML Data .82 4.5.4 Summary of Experimental Results 88 4.6 Summary 89 CDQS Encoding of Node Labels to Completely Avoid Re-labeling .91 5.1 The Compact Dynamic Quaternary String Encoding (CDQS) for Node Labels 92 5.1.1 CDQS Encoding Algorithm 95 5.1.2 Size Analysis .97 5.2 Applying CDQS to Different Labeling Schemes .98 5.3 Completely Avoiding Re-labeling in XML Updates .102 5.4 Extensions of CDQS 105 5.5 Experimental Evaluation and Comparisons .105 5.5.1 Performance Study on Static XML Data .105 5.5.2 Performance Study on Frequent Updates in Dynamic XML Data 108 5.5.3 Performance Study on CDOS and CDHS .113 5.6 Summary 114 Controlling the Increase in Label Size 116 6.1 Finding the Codes with the Smallest Size between Two Codes 117 6.2 Handling Insertion Skew 123 6.3 Experimental Evaluation 124 x 6.3.1 Comparisons of Algorithm 4.1 and Algorithm 6.1 .125 6.3.2 Processing the Skewed Insertion .126 6.4 Summary 127 Conclusion 129 7.1 Summary of Contributions .129 7.2 Future Works 132 Appendices 133 Appendix A: Meanings of Abbreviations 133 Appendix B: Calculation of the SC Value for Prime Scheme .134 Appendix C: Size Calculations for V-CDBS and CDQS .136 C1: Calculation of the Total Code Size for V-CDBS 136 C2: Calculation of the Total Code Size for CDQS 136 Appendix D: Calculation of the Positions Based on V-CDBS 138 Appendix E: Publications During Ph.D. Period .139 Bibliography .142 Appendices 138 Appendix D: Calculation of the Positions Based on V-CDBS In this appendix, we show how to calculate the positions based on V-CDBS codes. We use the following example to show how to calculate the positions. Example A2 The V-CDBS code “01001” in Table 4.1 is corresponding to the 6th number. We show how to calculate this based on the V-CDBS code “01001” and the total number 18 (see Table 4.1). The first bit “0” indicates that “01001” is belong to the first half, i.e. between and 10 (10=0+round((19-0)/2)). The second bit “1” indicates that “01001” is belong to the second half of and 10, i.e. between (5=0+round((10-0)/2)) and 10. The third bit “0” indicates that “01001” is belong to the first half of and 10, i.e. between and (8=5+round((10-5)/2)). The fourth bit “0” indicates that “01001” is belong to the first half of and 8, i.e. between and (7=5+round((8-5)/2)). The fifth bit is the last bit and the last bit is always “1”. The number between and is only 6, therefore “01001” corresponds to number 6. In this way, the position of each V-CDBS code can be calculated based on the code itself and the total number. It is similar for the position calculation based on CDQS. Appendices 139 Appendix E: Publications During Ph.D. Period Changqing Li, Tok Wang Ling, and Min Hu. Efficient updates in dynamic XML: From Binary String to Quaternary String. Accepted by VLDB Journal, 2006. Changqing Li, Tok Wang Ling, Min Hu. Efficient Processing of Updates in Dynamic XML Data. In Proc. of the 22nd International Conference on Data Engineering (ICDE), Apr. 2006. Best (Student) Paper Award List (One of the best two student papers; one of the best six papers). Changqing Li, Tok Wang Ling, Min Hu. Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes. In Proc. of the 11th International Conference on Database Systems for Advanced Applications (DASFAA), Apr. 2006. Changqing Li, Tok Wang Ling. QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates. In Proc. of the 14th International Conference on Information and Knowledge Management (CIKM), Oct. 2005. Student Travel Award. Changqing Li, Tok Wang Ling, Jiaheng Lu, Tian Yu. On Reducing Redundancy and Improving Efficiency of XML Labeling Schemes. In Proc. of the 14th International Conference on Information and Knowledge Management (CIKM), Oct. 2005. (Poster paper). Appendices 140 Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni. Efficient Processing of Ordered XML Twig Pattern. In Proc. of the 16th Database and Expert Systems Applications (DEXA), Aug. 2005. Changqing Li, Tok Wang Ling. An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML. In Proc. of the 10th International Conference on Database Systems for Advanced Applications (DASFAA), Apr. 2005. Changqing Li, Tok Wang Ling. From XML to Semantic Web. In Proc. of the 10th International Conference on Database Systems for Advanced Applications (DASFAA), Apr. 2005. (Short paper). Changqing Li, Tok Wang Ling. OWL-Based Semantic Conflicts Detection and Resolution for Data Interoperability. In Proc. of the 23rd Int. Conf. on Conceptual Modeling (ER) Workshop LNCS3289, Nov. 2004. 10 Changqing Li, Tok Wang Ling. A Basis for Semantic Web and e-Business: Efficient Organization of Ontology Languages and Ontologies. To appear as a Book Chapter of book Semantic Web Technologies and eBusiness. Publisher: IDEA GROUP INC. 701 E. Chocolate Avenue, Suite 200, Hershey PA 170331240, USA. Below are the publications when I was in China for the master degree in Peking University 11 Changqing Li, Shiwei Tang, Hongyan Li. Using Associations to Mine the Appendices 141 Thick-Scale E-commerce Personalize Service Information. Journal of Computer Science, Jan. 2002. 12 Changqing Li, Shiwei Tang, Hongyan Li. The Design of a Whole E-commerce System. Journal of Computer Science, Jun. 2001. 13 Changqing Li, Wenbing Zhao, Shiwei Tang. A Personalized Service Protocol Based on HTTP. 11th Conference of Computer Networks and Data Communication in China. Oct. 2000. Bibliography [1] S. Abiteboul, H. Kaplan, and T. Milo. Compact labeling schemes for ancestor queries. In Proc. of the 12th annual ACM-SIAM Symp. on Discrete Algorithms (SODA’01), pages 547-556, 2001. [2] S. Abiteboul and V. Vianu. Regular path queries with constraints. In Proc. of the 16th ACM Symp. on Principles of Database Systems (PODS’97), pages 122133, 1997. [3] R. Agrawal, A. Borgida, and H.V. Jagadish. Efficient Management of Transitive Relationships in Large Data and Knowledge Bases. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’89), pages 253262, 1989. [4] H. Ait-Kaci, R. Boyer, P. Lincoln, and R. Nasr. Efficient implementation of lattice operations. ACM Trans. on Progr. Languages and Systems, 11(1):115146, 1989. [5] S. Al-Khalifa, H.V. Jagadish, J.M. Patel, Y. Wu, N. Koudas, and D. Srivastava. Structural Joins: A Primitive for Efficient XML Query Pattern Matching. In Proc. of the 18th Int. Conf. on Data Engineering (ICDE’02), pages 141-152, 2002. Bibliography 143 [6] T. Amagasa, M. Yoshikawa, and S. Uemura. QRS: A Robust Numbering Scheme for XML Documents. In Proc. of the 19th Int. Conf. on Data Engineering (ICDE’03), pages 705-707, 2003. [7] J.A. Anderson and J.M. Bell. Number Theory with Application. Prentice-Hall, New Jersey, 1997. [8] A. Berglund, S. Boag, D. Chamberlin, M. F. Fernandez, M. Kay, J. Robie, and J. Simon. XML path language (XPath) 2.0. W3C working draft 04, Apr. 2005. [9] S. Boag, D. Chamberlin, M. F. Fernandez, D. Florescu, J. Robie, and J. Simon. XQuery 1.0: An XML Query Language. W3C working draft 04, Apr. 2005. [10] T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler, F. Yergeau, and J. Cowan. Extensible markup language (XML) 1.1. W3C recommendation, Feb. 2004. [11] D. Brickley and R.V. Guha. Resource Description Framework Schema (RDFS) Specification 1.0. W3C Recommendation, Feb. 2004. [12] N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: Optimal XML pattern matching. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’02), pages 310-321, 2002. [13] B. Catania, W.Q. Wang, B.C. Ooi, and X. Wang. Lazy XML Updates: Laziness as a Virtue of Update and Structural Join Efficiency. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’05), 2005. Bibliography 144 [14] S. Ceri, S. Comai, E. Damiani, P. Fraternali, S. Paraboschi, and L. Tanca. XML-GL: A graphical language for querying and restructuring XML documents. In Proc. of the 8th Int. World Wide Web Conf. (WWW’99), pages 93-109, 1999. [15] D. Chamberlin, J. Robie, and D. Florescu. Quilt: An XML query language for heterogeneous data sources. In Int. Workshop on the Web and Databases (WebDB’00), pages 53-62, 2000. [16] E.C. Chang. Design and Analysis of Algorithms. Module CS3230 of Department of Computer Science, National University of Singapore. [17] Ting Chen, Jiaheng Lu, and Tok Wang Ling. On Boosting Holism in XML Twig Pattern Matching using Structural Indexing Techniques. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’05), 2005. [18] Ting Chen, Tok Wang Ling, and Chee Yong Chan. Prefix Path Streaming: A New Clustering Method for Optimal Holistic XML Twig Pattern Matching. In Proc. of the 15th Int. Conf. on Very Large Data Bases (DEXA’04), pages 801810, 2004. [19] Zhuo Chen, Tok Wang Ling, Mengchi Liu, and Gillian Dobbie. XTree for Declarative XML Querying. In Proc. of the 9th Int. Conf. on Database Systems for Advanced Applications (DASFAA’04), pages 100-112, 2004. [20] S.-Y. Chien, Z. Vagena, D. Zhang, V.J. Tsotras, and C. Zaniolo. Efficient Structural Joins on Indexed XML Documents. In Proc. of the 28th Int. Conf. on Very Large Data Bases (VLDB’02), pages 263-274, 2002. Bibliography 145 [21] V. Christophides, D. Plexousakis, M. Scholl, and S. Tourtounis. On labeling schemes for the semantic web. In Proc. of the 12th Int. World Wide Web Conf. (WWW’03), pages 544-555, 2003. [22] C. Chung, J. Min, and K. Shim. APEX: an adaptive path index for XML data. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’02), pages 121-132, 2002. [23] E. Cohen, H. Kaplan, and T. Milo. Labeling Dynamic XML Trees. In Proc. of the 21th ACM Symp. on Principles of Database Systems (PODS’02), pages 271281, 2002. [24] B. Cooper, N. Sample, M.J. Franklin, G.R. Hjaltason, and M. Shadmon. A fast index for semistructured data. In Proc. of the 27th Int. Conf. on Very Large Data Bases (VLDB’01), pages 341-350, 2001. [25] A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language for XML. In Proc. of the 8th Int. World Wide Web Conf. (WWW’99), pages 77-91, 1999. [26] P.F. Dietz. Maintaining order in a linked list. In Proc. of the 14th Annual ACM Symp. on Theory of Computing (STOC’82), pages 122–127, 1982. [27] P. F. Dietz and D. D. Sleator. Two algorithms for maintaining order in a list. In Proc. of the 16th Annual ACM Symp. on Theory of Computing (STOC’87), pages 365-372, 1987. Bibliography 146 [28] M. Duong and Y. Zhang. A New Labeling Scheme for Dynamically Updating XML Data. In Proc. of the 16th Australasian Database Conference (ADC’05), pages 185-193, 2005. [29] M. Fernandez and D. Suciu. Optimizing Regular Path Expressions Using Graph Schemas. In Proc. of the 14th Int. Conf. on Data Engineering (ICDE’98), pages 14-23, 1998. [30] C. Gavoille and D. Peleg. Compact and localized distributed data structures. Journal of Distributed Computing, Special Issue for the Twenty Years of Distributed Computing Research, 2003 [31] R. Goldman and J. Widom. DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In Proc. of the 23th Int. Conf. on Very Large Data Bases (VLDB’97), pages 436-445, 1997. [32] T. Grust. Accelerating XPath Location Steps. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’02), pages 109-120, 2002. [33] F.V. Harmelen, J. Hendler, I. Horrocks, D.L. McGuinness, P.F. Patel-Schneider, and L.A. Stein. OWL Web Ontology Language Reference. W3C Recommendation, 2004. [34] H. He, J. Xie, J. Yang, and H. Yu. Asymmetric Batch Incremental View Maintenance. In Proc. of the 21th Int. Conf. on Data Engineering (ICDE’05), pages 106-117, 2005. Bibliography 147 [35] H. He and J. Yang. Multiresolution Indexing of XML for Frequent Queries. In Proc. of the 20th Int. Conf. on Data Engineering (ICDE’04), pages 683-694, 2004. [36] http://www.saxproject.org/ [37] http://www.w3.org/DOM/ [38] http://www.w3.org/XML/Schema [39] H. Jiang, H. Lu, W. Wang, and B.C. Ooi. XR-Tree: Indexing XML Data for Efficient Structural Joins. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’03), pages 253-263, 2003. [40] Enhua Jiao, Tok Wang Ling, and Chee Yong Chan. PathStackØ: A Holistic Path Join Algorithm for Path Query with not-predicates on XML Data. In Proc. of the 10th Int. Conf. on Database Systems for Advanced Applications (DASFAA’05), pages 113-124, 2005. [41] H. Kaplan, T. Milo, and R. Shabo. A comparison of labeling schemes for ancestor queries. In Proc. of the 13th annual ACM-SIAM Symp. on Discrete Algorithms (SODA’02), pages 954-963, 2002. [42] R. Kaushik, P. Bohannon, J.F. Naughton, and H.F. Korth. Covering indexes for branching path queries. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’02), pages 133-144, 2002. Bibliography 148 [43] R. Kaushik, P. Bohannon, J.F. Naughton, and P. Shenoy. Updates for Structure Indexes. In Proc. of the 28th Int. Conf. on Very Large Data Bases (VLDB’02), pages 239-250, 2002. [44] R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting Local Similarity for Indexing Paths in Graph-Structured Data. In Proc. of the 18th Int. Conf. on Data Engineering (ICDE’02), pages 129-140, 2002. [45] D.D. Kha, M. Yoshikawa, and S. Uemura. A Structural Numbering Data. In Proc. of the 8th Int. Conf. on Extending Database Technology (EDBT’02) Workshop LNCS2490, pages 91-108, 2002. [46] D.D. Kha, M. Yoshikawa, and S. Uemura. An XML Indexing Structure with Relative Region Coordinate. In Proc. of the 17th Int. Conf. on Data Engineering (ICDE’01), pages 313-320, 2001. [47] O. Lassila and R. Swick. Resource Description Framework (RDF) Model and Syntax Specification. W3C Recommendation, Feb. 2004. [48] Changqing Li, Tok Wang Ling, and Min Hu. Efficient Processing of Updates in Dynamic XML Data. In Proc. of the 22nd Int. Conf. on Data Engineering (ICDE’06), 2006. Best Paper Award List. [49] Changqing Li, Tok Wang Ling, and Min Hu. Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes. In Proc. of the 11th Int. Conf. on Database Systems for Advanced Applications (DASFAA’06), pages 659-673, 2006. Bibliography 149 [50] Changqing Li and Tok Wang Ling. QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates. In Proc. of the 14th Int. Conf. on Information and Knowledge Management (CIKM’05), pages 501-508, 2005. Student Travel Award. [51] Changqing Li, Tok Wang Ling, Jiaheng Lu, and Tian Yu. On Reducing Redundancy and Improving Efficiency of XML Labeling Schemes. In Proc. of the 14th Int. Conf. on Information and Knowledge Management (CIKM’05), pages 225-226, 2005. (Poster paper). [52] Changqing Li and Tok Wang Ling. An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML. In Proc. of the 10th Int. Conf. on Database Systems for Advanced Applications (DASFAA’05), pages 125-137, 2005. [53] Changqing Li and Tok Wang Ling. From XML to Semantic Web. In Proc. of the 10th Int. Conf. on Database Systems for Advanced Applications (DASFAA’05), pages 582-587, 2005. (Short paper). [54] Changqing Li and Tok Wang Ling. OWL-Based Semantic Conflicts Detection and Resolution for Data Interoperability. In Proc. of the 23rd Int. Conf. on Conceptual Modeling (ER’04) Workshop LNCS3289, pages 266-277, Nov. 2004. [55] Changqing Li, Tok Wang Ling, and Min Hu. Efficient updates in dynamic XML: From Binary String to Quaternary String. Accepted by VLDB Journal, 2006. Bibliography 150 [56] Q. Li and B. Moon. Indexing and Querying XML Data for Regular Path Expressions. In Proc. of the 27th Int. Conf. on Very Large Data Bases (VLDB’01), pages 361-370, 2001. [57] Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, and Ting Chen. From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. In Proc. of the 31st Int. Conf. on Very Large Data Bases (VLDB’05), pages 193-204, 2005. [58] Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, and Wei Ni. Efficient Processing of Ordered XML Twig Pattern. To appear in Proc. of the 16th Int. Conf. on Database and Expert Systems Applications (DEXA’05), pages 300309, 2005. [59] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: A Database Management System for Semistructured Data. SIGMOD Record, 26(3): 54-66, 1997. [60] J. McHugh and J. Widom. Query optimization for XML. In Proc. of the 25th Int. Conf. on Very Large Data Bases (VLDB’99), pages 315-326, 1999. [61] T. Milo and D. Suciu. Index Structures for Path Expressions. In Proc. of the 7th Int. Conf. on Database Theory (ICDT’99), pages 277-295, 1999. [62] S. Nestorov, J.D. Ullman, J.L. Wiener, and S.S. Chawathe. Representative Objects: Concise Representations of Semistructured, Hierarchial Data. In Proc. of the 13th Int. Conf. on Data Engineering (ICDE’97), pages 79-90, 1997. Bibliography [63] NIAGARA 151 Experimental Data. Available at: http://www.cs.wisc.edu/niagara/data.html [64] P.E. O'Neil, E.J. O'Neil, S. Pal, I. Cseri, G. Schaller, and N. Westbury. ORDPATHs: Insert-Friendly XML Node Labels. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’04), pages 903-908, 2004. [65] C. Qun, A. Lim, and K.W. Ong. D(k)-Index: An Adaptive Structural Summary for Graph-Structured Data. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’03), pages 134-144, 2003. [66] P. Rao and B. Moon. PRIX: Indexing And Querying XML Using Prüfer Sequences. In Proc. of the 20th Int. Conf. on Data Engineering (ICDE’04), pages 288-300, 2004. [67] N. Santoro and R. Khatib. Labeling and implict routing in networks. The Computer J., 28:5-8, 1985. [68] A. Silberstein, H. He, K. Yi, and J. Yang. BOXes: Efficient Maintenance of Order-Based Labeling for Dynamic XML Data. In Proc. of the 21th Int. Conf. on Data Engineering (ICDE’05), pages 285-296, 2005. [69] I. Tatarinov, Z.G. Ives, A.Y. Halevy, and D.S. Weld. Updating XML. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’01), 2001. [70] I. Tatarinov, S. Viglas, K.S. Beyer, J. Shanmugasundaram, E.J. Shekita, and C. Zhang. Storing and querying ordered XML using a relational database system. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’02), pages 204-215, 2002. Bibliography 152 [71] University of Washington XML Repository. Available at: http://www.cs.washington.edu/research/xmldatasets/ [72] H. Wang, S. Park, W. Fan, and P.S. Yu. ViST: A Dynamic Index Method for Querying XML Data by Tree Structures. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’03), pages 110-121, 2003. [73] N. Wirth. Type extensions. ACM Trans. on Progr. Languages and Systems, 10(2):204-214, 1988 [74] X. Wu, M.L. Lee, and W. Hsu. A Prime Number Labeling Scheme for Dynamic Ordered XML Trees. In Proc. of the 20th Int. Conf. on Data Engineering (ICDE’04), pages 66-78, 2004. [75] G. Xing and B. Tseng. Extendible range-based numbering scheme for xml document. In Proc. of the Int. Conf. on Information Technology: Coding and Computing (ITCC’04), pages 140-141, 2004. [76] XMark — An XML Benchmark Project. Available at: http://monetdb.cwi.nl/xml/downloads.html [77] B.B. Yao, M.T. Özsu, and N. Khandelwal. XBench Benchmark and Performance Testing of XML DBMSs. In Proc. of the 20th Int. Conf. on Data Engineering (ICDE’04), pages 621-633, 2004. [78] F. Yergeau. UTF8: A Transformation Format of ISO 10646. Request for Comments (RFC) 2279, January 1998. Bibliography 153 [79] K. Yi, H. He, I. Stanoi, and J. Yang. Incremental Maintenance of XML Structural Indexes. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’04), pages 491-502, 2004. [80] M. Yoshikawa, T. Amagasa, T. Shimura, and S. Uemura. XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Trans. Internet Techn., 1(1): 110-141, 2001. [81] N. Zhang, S. Agrawal, and M.T. Özsu. BlossomTree: Evaluating XPaths in FLWOR Expressions. In Proc. of the 21st Int. Conf. on Data Engineering (ICDE’05), pages 388-389, 2005. [82] N. Zhang, V. Kacholia, and M.T. Özsu. A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML. In Proc. of the 20th Int. Conf. on Data Engineering (ICDE’04), pages 54-65, 2004. [83] C. Zhang, J.F. Naughton, D.J. DeWitt, Q. Luo, and G. Lohman. On Supporting Containment Queries in Relational Database Management Systems. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’01), pages 425-436, 2001. [...]... Leaf node insertions based on the existing containment scheme 65 Figure 4.7: V-CDBS-P-Containment scheme 67 Figure 4.8: Internal node insertions based on V-CDBS-P-Containment scheme 69 Figure 4.9: Internal node insertions based on the prime number scheme 70 Figure 4.10: Subtree insertion based on V-CDBS-Prefix scheme .72 Figure 4.11: Subtree insertion based on V-CDBS-P-Containment scheme... etc relationships between any two elements based on the labels only Both the ordered and un-ordered queries can be processed without accessing the original XML file In addition, the labeling schemes can be used to query XML no matter XML is schema oblivious or schema-conscious In this thesis, we focus on the labeling schemes 1.1.4 XML Update In this section, we discuss XML updates based on the structural... used to encode the labels of labeling schemes in storing We summarize this chapter in Section 2.3 2.1 Node Labeling Schemes The labeling scheme is used to label the nodes of an XML tree, and based on the labeling scheme, XML queries can be processed without accessing the original XML document In this section, we survey three families of labeling (numbering) schemes, viz containment [3, 26, 45, 46, 56,... existing containment scheme and P-Containment scheme .40 Figure 4.1: V-CDBS-Containment scheme 60 Figure 4.2: V-CDBS-Prefix scheme (for Figure 2.4) 60 Figure 4.3: Leaf node insertions based on V-CDBS-Prefix scheme 63 Figure 4.4: Leaf node insertions based on V-CDBS-Containment scheme .64 Figure 4.5: Leaf node insertions based on the existing prefix scheme 65 xiii Figure 4.6: Leaf node. .. than the context of the current node being processed in memory The most popular event-driven API is the Simple API for XML (SAX) [36] This thesis focuses on how to efficiently query and update XML data no matter XML data are schema oblivious or schema-conscious SAX will be used in the implementation to parse XML file in XML query and update processing 1.1.3 XML Query In the definition of XML, one element... indexing, querying and updating XML documents have been among the major issues of database research In this thesis, we mainly research on how to improve the query efficiency of the existing labeling schemes for XML data, and more important we propose novel techniques to efficiently update XML data In this chapter, we firstly introduce the background of XML related technologies in Section 1.1 Next in Section... references in XML, and all queries are based on the ordered treestructured representation of XML data Figure 1.3 shows an ordered XML tree Chapter 1 Introduction 5 book title author first_name preface last_name chapter section section Figure 1.2: An ordered XML tree The growing number of XML documents on the Web has motivated the development of languages and index techniques to query XML data efficiently... Background and Related Works 13 updates based on labeling schemes After updating, the labeling schemes still can efficiently support both the path query and twig pattern query Also different encoding approaches are proposed to store the labels of the labeling schemes The rest of this chapter is organized as follows In Section 2.1, we introduce different labeling schemes to process XML queries In Section 2.2,... 50, 64, 70], and prime [74] 2.1.1 Containment Labeling Scheme The containment labeling scheme was first suggested by Santoro and Khatib [67] Yoshikawa and Amagasa [80] also proposed a variant of containment labeling scheme To label an XML tree based on the containment scheme, different tree traversal methods (e.g pre -and- postorder[26], extended preorder[56]) are used (1) Dietz’s containment labeling scheme... relationship; it needs to search the parent of one node and determine whether another node is the child of this parent, which needs a lot of parent-child determinations and is very costly 2.1.1.2 Deficiencies of the Containment Schemes on Updates Although the ancestor-descendant relationship can be determined in constant time by the containment scheme, the insertion of a node will lead to a re-labeling . node insertions based on V-CDBS-Containment scheme 64 Figure 4.5: Leaf node insertions based on the existing prefix scheme 65 xiii Figure 4.6: Leaf node insertions based on the existing containment. /book[/title]//section[2]/preceding-sibling::section finds all the section nodes that are siblings of section[2] (section[2] means the second section) and these section sibling nodes should be before section[2] (“preceding- sibling”) to query XML no matter XML is schema oblivious or schema-conscious. In this thesis, we focus on the labeling schemes. 1.1.4 XML Update In this section, we discuss XML updates based on the structural

Định dạng
Số trang	167
Dung lượng	765,48 KB