Semantic based similarity searches in database systems multidimensional access methods, similarity search algorithms

EIDESSTATTLICHE ERKLÄRUNG Ich erkläre an Eides statt, dass ich die vorliegende Dissertation selbstständig und ohne fremde Hilfe verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutzt bzw die wörtlich oder sinngemäß entnommenen Stellen als solche kenntlich gemacht habe Linz, April 2003 Dang Tran Khanh KURZFASSUNG Eine Schwäche der traditionellen Anfragebeantwortung in den heutigen Datenbanksystemen ist der Mangel an Flexibilität bei der Interpretation der Benutzeranfragen Dies führte zur Entwicklung sogenannter „Flexilble Query Answering Systems“ (FQASs), welche die bestehenden Datenbanken um die Funktionalität von Ähnlichkeitsabfragen erweitern Doch die Realisierung solcher Funktionalitäten für die heute am Markt befindlichen Datenbankmanagementsysteme (DBMSs) ist kein trivialer Vorgang, weil die Semantik fehlt, auf die sich Ähnlichkeitsabfragen beziehen müssen Auf der anderen Seite spielt das Konzept von Ähnlichkeit / Relevanz von Dokumenten eine zentrale Rolle bei Information Retrieval (IR) Systemen Diese zielen mehr auf Befriedigung des Informationsbedarfes eines Benutzers ab, weniger auf einen Datenbedarf, den der Nutzer eines DBMS hat Eine angemessene Integration der IR-Konzepte in existierende Datenbanksysteme um diese Semantik-Lücke zu schlien kưnnte zu semantikbasierten FQAS führen Trotzdem ist die Entwicklung solcher Systeme absolut keine einfache Arbeit und enthält weitere Herausforderungen wie (1) die Modellierung eines solchen Systems, (2) die Entwicklung der Flexibilität und (3) die Schaffung eines transparenten und sicheren Systems In dieser Arbeit werden die oben angeführten Herausforderungen analysiert, Methoden und Techniken entworfen, implementiert und evaluiert, mit dem Ziel, die Planung und Konstruktion von effizienten semantik-basierten FQASs im Allgemeinen zu unterstützen Im Speziellen konzentriert sich die Arbeit auf Methoden für Ähnlichkeitssuche, komplexe Ähnlichkeitssuche, Ähnlichkeits-Join, angenäherte Ähnlichkeitssuche (approximate similarity queries) sowie auf Optimierung und Integration all dieser Ansätze in konventionelle DBMSs Darüber hinaus werden die Erfahrungen betreffend der Entwicklung von semantik-basierten FQAS behandelt Die Ergebnisse der Arbeit können nicht nur im Bereich traditioneller DBMSs eingesetzt werden, sondern auch im weiteren Umfeld (z.B moderne IR-Systeme, Data Mining) Unter den erreichten Ergebnissen sind besonders hervorzuheben: (1) Erfindung einer neuartigen multidimensionalen Indexstruktur, des SH-tree (Super Hybrid tree), die auch in hochdimensionalen Räumen skaliert; (2) ein neuer Ansatz, genannt Hyper Sphere Approach (ISA), der effizient komplexe Multi-Feature-Nearest-NeighborQueries (M-FNN) abarbeitet; (3) ein innovativer Ansatz, ε-ISA, für ein schnelles Durchführen von „Approximate complex M-FNN“ (ε-ISA ist einer der ersten Lösungen für diese Aufgabenstellung); (4) Effiziente Ansätze für „complex similarity joins“ und „approximate complex similarity joins“; (5) Diskussion und Lösung kritischer Fragen für die Integration des SH-tree in DBMSs ABSTRACT A critical weakness of the traditional query processing model in existing database systems is the lack of the flexibility in interpreting and answering users’ queries This has initiated a new research trend into flexible query answering systems (FQASs), which extend the existing database systems with similarity retrieval capabilities in order to fulfill a user’s requirements, which are expressed by a formal query language as, e.g., the SQL, more “intelligently” and effectively However, realizing such similarity retrieval capabilities for the state-of-the-art of conventional database management systems (DBMSs) is non-trivial work due to the lack of semantics that form the basis for similarity searches within these systems On the other hand, the concept of similarity/relevance search is at the center of Information Retrieval (IR) systems, which mainly aim at satisfying requirements of user information needs instead of user data needs as in conventional DBMSs A suitable integration of this concept with an existing database system in order to overcome the lack of the semantics could lead to a semantic based FQAS as desired Even then, developing such a FQAS is absolutely not simple work and it introduces many new challenges that relate to (1) modeling the system, (2) developing flexibilities for the system, and (3) building a transparency and guarantee system In this thesis, we shall analyze the above challenges, and design, implement and evaluate techniques in order to facilitate the design and construction of efficient semantic based FQASs in common In particular, we shall concentrate our attention on similarity search techniques, complex similarity query/join processing and optimization, approximate similarity queries of various types, and integration of these facilities all into DBMSs Moreover, although we shall present the research results in the context of developing semantic based FQASs, our achievements can also be applicable to other modern database applications (e.g., modern IR systems, data mining) rather than only within the range of traditional DBMSs Among the achieved results, most important ones include: (1) Inventing a multidimensional index structure, named the SH-tree (Super Hybrid tree), that can scale to high-dimensional data spaces; (2) Introducing a novel approach called the Incremental hyper-Sphere Approach (ISA) to efficiently address complex multifeature nearest neighbor (M-FNN) queries; (3) Introducing an innovative approach named the ε-ISA to efficiently solve approximate complex M-FNN queries (the εISA is one of a few vanguard solutions to the problem of approximate complex similarity query answering); (4) Proposing efficient approaches to deal with complex and approximate complex similarity joins; (5) Discussing and solving critical issues towards integrating the SH-tree into DBMSs ACKNOWLEDGEMENTS First of all, I must thank my parents, who give me the life and support me all the life in any kind of need I thank my direct supervisor, Prof Küng Josef for his great support and many fruitful discussions I am also so grateful to his family members, who helped and showed me to know more about Austria and Austrians Specially, I must thank Gabriela a lot for the diners and jokes I would express my great gratitude for all the help of my supervisor, Prof Wagner Roland He helped a lot with many important decisions during the research and study in Austria Thank my younger sister, Dang Viet Ha, and my youngest brother, Dang Tran Khanh They and my parents are indispensable people in my life They all infused spirit into me to finish this thesis My younger sister and brother helped me to forget the work whenever it was at a standstill and I needed some time to relax For the cost of living and part of funding for my research work, I thank all people from Austrian Exchange Service - OeAD In particular, I would thank Mr Szelegowitz Andreas who is managing the OeAD branch in Linz for all of his help during my stay in Austria I thank my old teacher, Dr Nguyen Thanh Son, indeed for his support and useful advices I am also thankful to all of my friends and colleagues for their support and encouragement Last but not least, I give my special thanks to my fiancée, Huyen Trang, who is always to be with me, and gives me her great encouragement Her support is a lever for me to accomplish the thesis Linz, April 2003 Dang Tran Khanh TABLE OF CONTENTS CHAPTER INTRODUCTION 1.1 Motivation 1.2 Challenges 1.3 Contributions and Structure of Thesis CHAPTER FLEXIBLE QUERY ANSWERING SYSTEMS 13 2.1 Introduction 13 2.2 Supporting Flexible Retrieval Capabilities in DBMSs 15 2.2.1 ARES 16 2.2.2 VAGUE 19 2.2.3 FLEX 21 2.2.4 CoBase 24 2.2.5 VQS 25 2.3 Modern Information Retrieval Systems 26 2.3.1 QBIC 28 2.3.2 Photobook 31 2.3.3 MARS 32 2.4 Discussion 33 2.5 Conclusions 35 CHAPTER VQS - VAGUE QUERY SYSTEM: APPROACH AND ISSUES 38 3.1 Introduction 38 3.2 Semantic Based Similarity Searches 41 3.3 Basic Ideas and Overall Architecture of VQS 44 3.3.1 The Basic Ideas 44 3.3.2 Overall Architecture 50 3.4 Vague Join Processing in VQS: A Good Match Join 52 3.5 Complex Vague Query Processing in VQS: An Incremental Hypercube Approach 56 3.6 Issues and Discussion 61 3.7 Conclusions 64 CHAPTER AN OVERVIEW OF MULTIDIMENSIONAL ACCESS METHODS 66 4.1 Preliminary 66 4.2 Introduction and Fundamental Definitions 68 4.2.1 Basic Operations Related to Building MAMs 72 4.2.2 Basic and Advanced Query Types Related to MAMs 74 4.3 Prevailing Search Algorithms 84 4.4 Developing a Taxonomy for Multidimensional Access Methods 90 4.5 A Survey of Some Typical Multidimensional Access Methods 93 4.5.1 KD-Tree Based Indexing Techniques 93 4.5.1.1 The KD-tree 93 4.5.1.2 The VAMSplit-tree 94 4.5.1.3 The LSDh-tree 94 4.5.2 R-Tree Based Indexing Techniques 95 4.5.2.1 The R-tree 95 4.5.2.2 The TV-tree 96 4.5.2.3 The SS-tree 97 4.5.2.4 The X-tree 98 4.5.2.5 The SR-tree 98 4.5.2.6 The M-tree and MAMs for Metric Spaces 99 4.5.2.7 The A-tree 99 4.5.3 Hybrid Techniques of both KD-tree and R-tree 100 4.5.3.1 The Hybrid tree 100 4.5.3.2 The SH-tree 101 4.5.4 Other Techniques 101 4.5.4.1 The Pyramid Technique and the iMinMax 102 4.5.4.2 The VA-file and its Variants 103 4.5.4.3 The UB-tree 104 4.6 The Generalized Search Tree (GiST): A Framework for Search Trees in Database Systems 104 4.7 Comparative Studies 108 4.8 Remarks and Conclusions 112 CHAPTER THE SH-TREE: A SUPER HYBRID INDEX STRUCTURE FOR MULTIDIMENSIONAL DATA 115 5.1 Preliminary Remarks 115 5.2 Introduction 117 5.3 Motivations 118 5.4 The SH-tree Structure 120 5.4.1 Multidimensional Space Partitioning and Basic Structure of the SHtree 121 5.4.2 Splitting Nodes in the SH-tree 125 5.4.2.1 Leaf Node Splitting 125 5.4.2.2 Balanced Node Splitting 128 5.4.2.3 Internal Node Splitting 128 5.4.3 The Extended Balanced SH-tree 131 5.5 The SH-tree’s Basic Operations 134 5.5.1 Insertion 134 5.5.2 Deletion 138 5.5.3 Search 138 5.6 Evaluating performance of the SH-tree 144 5.6.1 Implementation Details 144 5.6.2 The SH-tree Performance with k-Nearest Neighbor Queries 146 5.7 Towards Integrating the SH-tree into DBMSs 159 5.7.1 An Approach to Cost Estimation for Searching on the SH-tree 159 5.7.2 Local Dynamic Bulk-Loading of the SH-tree 166 5.7.3 Concurrency Control and Recovery Techniques for the SH-tree: A Preliminary Theoretical Study 170 5.8 Conclusions and Future Work 177 CHAPTER SOLVING COMPLEX MULTI-FEATURE NEAREST NEIGHBOR QUERIES 181 6.1 Introduction 181 6.2 Addressing Complex Vague Queries 183 6.3 Incremental Hyper-Sphere Approach (ISA) 188 6.3.1 Basic Incremental hyper-Sphere Approach Version 189 6.3.2 An Incremental Algorithm Adapted for hyper-Sphere Range Queries 193 6.3.3 Enhanced Incremental hyper-Sphere Approach Version 197 6.3.4 Discussion 201 6.4 Finding k Nearest Neighbors for Complex Vague Queries 203 6.5 Experimental Results 204 6.6 Remarks and Conclusions 209 CHAPTER SOLVING APPROXIMATE SIMILARITY QUERIES 211 7.1 A Preliminary Declaration of Problems 211 7.2 Solving Approximate Complex Multi-Feature Nearest Neighbor Queries 214 7.2.1 Introduction and Related Research 214 7.2.2 The ISA: An Efficient Approach for Solving Complex Vague Queries 220 7.2.3 The ε-ISA: Towards Efficiently Solving Approximate M-FNN Problem 227 7.2.3.1 Finding Approximate Nearest Neighbor 228 7.2.3.2 A Generalized Algorithm for Finding Approximate kNearest Neighbors 232 7.2.4 Experimental Results for Approximate M-FNN Queries 234 7.3 Solving Approximate Single-Feature Nearest Neighbor Queries 240 7.4 Solving Approximate Range Queries 248 7.5 Conclusions and Future Work 253 CHAPTER EFFICIENT PROCESSING OF COMPLEX AND APPROXIMATE COMPLEX SIMILARITY JOINS 256 8.1 Introduction and A Classification Scheme of Similarity Joins 256 8.2 Efficient Complex Similarity Join Processing in the VQS 262 8.2.1 Complex Similarity Join Processing: An Approach to the Best Match Joins 265 8.2.2 Approximate Complex Similarity Join Processing 267 8.2.3 Evaluation Results and Discussions 269 8.3 A Generalization of Complex and Approximate Complex Similarity Join Processing Problem 270 8.4 Remarks and Conclusions 273 CHAPTER CONCLUSIONS AND FUTURE WORK 276 9.1 Concluding Summary 276 9.2 Future Work 281 REFERENCES 285 LIST OF FIGURES Figure 2.1 Overall architecture of FLEX 22 Figure 2.2 An example runway length TAH 24 Figure 2.3 Overall architecture of QBIC 29 Figure 2.4 A classification scheme for flexible query answering systems 35 Figure 3.1 Normalization using the effective diameter 47 Figure 3.2 Formal description of the Vague Query Language - VQL 48 Figure 3.3 Overall architecture of primitive VQS 51 Figure 3.4 Formal description of the extended Vague Query Language 54 Figure 3.5 An example of complex vague query processing with the Incremental hyper-Cube Approach 58 Figure 3.6 New architecture of VQS 61 Figure 4.1 A schema of the query type classification with MAMs 75 Figure 4.2 The dilation (R+) and the erosion (R-) of an example range R 82 Figure 4.3 MINDIST, MINMAXDIST, and MAXDIST 88 Figure 4.4 Evolution schema of MAMs in recent years 92 Figure 5.1 Some problems with coding actual data region 119 Figure 5.2 A possible partition of an example data space and the corresponding mapping to the SH-tree 123 Figure 5.3 Problem with leaf node splitting in the Hybrid tree: No suitable split position satisfies the storage utilization constraint 126 Figure 5.4 Steps of slitting a leaf node in the SH-tree 128 Figure 5.5 Choose an internal node to insert a new data object into 135 Figure 5.6 Split propagation in the SH-tree 137 Figure 5.7 An algorithm for answering range queries in the SH-tree 142 Figure 5.8 Pseudo-code of the adapted k-NN algorithm 148 ... University of Linz, Austria, April 2003 Semantic Based Similarity Searches in Database Systems systems These concepts should enable the database systems to supply semantic based query processing capabilities... 2003 Semantic Based Similarity Searches in Database Systems Multidimensional Index Structures: Multidimensional data are often encountered in modern database applications such as data mining/OLAP,... processing model in existing database systems is the lack of the flexibility in interpreting and answering users’ queries This has initiated a new research trend into flexible query answering systems

Định dạng
Số trang	322
Dung lượng	2,06 MB