Author Trong Nhan Phan Submission Institute of Application Oriented Knowledge Processing (FAW) APPROACHES FOR FAST SIMILARITY Supervisor and First Examiner A.Univ.-Prof Dr Josef Küng Second Examiner Ao.Univ.-Prof Dr Andreas Rauber 05.2016 SEARCH WITH MAPREDUCE Doctoral Thesis to confer the academic degree of Doktor der technischen Wissenschaften in the Doctoral Program Engineering Sciences JOHANNES KEPLER UNIVERSITY LINZ Altenberger Str 69 4040 Linz, Austria www.jku.at DVR 0093696 EIDESSTATTLICHE ERKLÄRUNG Ich erkläre an Eides statt, dass ich die vorliegende Dissertation selbstständig und ohne fremde Hilfe verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutzt bzw die wưrtlich oder sinngemäß entnommenen Stellen als solche kenntlich gemacht habe Die vorliegende Dissertation ist mit dem elektronisch übermittelten Textdokument identisch Linz, 2016 Trong Nhan Phan i SWORN DECLARATION I hereby declare under oath that the submitted Doctoral Thesis has been written solely by me without any third-party assistance, information other than provided sources or aids have not been used and those used have been fully documented Sources for literal, paraphrased and cited quotes have been accurately credited The submitted document here present is identical to the electronically submitted text document Linz, 2016 Trong Nhan Phan ii ACKNOWLEDGEMENT I would like to express my deepest gratitude to my advisor, Professor Josef Küng, who always gives me invaluable advice and best support during my PhD journey Moreover, he has given me encouragement, opportunities, and chances not only in my research but also in my PhD life The fact that words might not express how happy and lucky I am when I have him as my advisor, I wholeheartedly and always wish him all the best no matter what tomorrow will bring I am very grateful to Professor Roland Wagner and Mr Knud Steiner for giving me good conditions and assistance when I am doing my PhD at the Institute for Application Oriented Knowledge Processing (FAW) in Linz and in Hagenberg, Austria I would also like to extend my sincere thanks to European Commission via the GATE (knowledGe mAnagement Technology transfer and Education programme), an Erasmus Mundus mobility project, especially to Ms Christine Hinterleitner and Ms Emma Huss for their support during my mobility in Austria My special thanks go to Mrs Gabriela Wagner, Mrs Gabriela Küng, Richard Küng, Eric Küng, and Felix Küng Their kindness, generosity, and sense of humor make my PhD life much more colorful and beautiful My sincere thanks to Mr Faruk Kujundžid, Scientific Computing, Information Management team, Johannes Kepler University Linz, for kindly supporting us with Alex Cluster My acknowledgement would not be complete without my colleagues: Markus Jäger, Stefan Nadschläger, Pablo Gómez-Pérez, and Christian Huber It is my great pleasure working with you I would like to thank Ms Monika Neubauer and Mr Andreas Dreiling for their help in administration and techniques I will never forget warm welcome from Ms Dagmar Auer, Hilda Kosorus, Jan Kubovy, Peter Regner, and all colleagues at FAW I am thankful to the reviewers who have spent their time on our research work and provided us their invaluable feedback I appreciate Professor Tran Khanh Dang and my colleagues at Faculty of Computer Science and Engineering, HCMC University of Technology, Vietnam, for their encouragement and support when I am doing my PhD in Austria I also appreciate my lecturers and teachers who have guided me with their knowledge My family has always and forever been by my side no matter how harsh and hard Though I cannot list all the names in this section, I feel myself very lucky when I have them, friends, and companies throughout my life People I have met come into my life for some reasons Anyhow, I thank you for everything and never stop hoping and believing to see you, the Angles, again and again iii KURZFASSUNG Similarity Search (Ähnlichkeitssuche) ist eine der zentralen Operationen, nicht nur in Datenbanken, ebenso in anderen Hauptgebieten der Datenverarbeitung wie Information Retrieval, Machine Learning oder Data Mining Darüber hinaus wird sie in verschiedenen Anwendungen verwendet, beispielsweise Duplicate Detection, Data Cleaning oder Data Clustering Trotz der hohen Verbreitung und Verwendung von Similarity Search, ist ihre Anwendung wegen der hohen Kosten der Ähnlichkeitsberechnungen sehr teuer Similarity Search wird auch sehr zeitintensiv und zeitraubend, wenn es bei der Ausführung auf irrelevante Objekte zugreift und deren Ähnlichkeitswerte unnötigerweise berechnet Mehr noch, muss Similarity Search mit den Herausforderungen von Big Data klar kommen, die grưßte darunter, das Verwalten grer Datenmengen Diese Herausforderungen machen Similarity Search teuer, hinterlassen uns aber eine große Motivation Als ein Paradigma für riesige (large-scale) Berechnungen auf parallelen und verteilten Systemen bestehend aus herkömmlichen Computern zeigt MapReduce (Verkleinerung von Datensätzen) schnell seine Fähigkeiten, große Datenmengen mit hoher Fehlertoleranz verarbeiten zu können In dieser Arbeit analysieren wir die Leistung von Similarity Search in Kombination mit MapReduce Darüber hinaus untersuchen wir die Probleme der Skalierbarkeit, Redundanz und Lastausgleich, welche bei Similarity Search immer ein Thema sind Natürlich wird eine genaue Similarity Search ohne Verlust von Genauigkeit bevorzugt Unter Verwendung verschiedener Ansätze streben wir eine Verbesserung der Performanz von Similarity Search Vorgängen unter Zuhilfenahme von MapReduce an Genau genommen untersuchen bzw verwenden wir drei unterschiedliche Ansätze für eine schnelle Similarity Search, diese sind: Instant, Build-In und Hybrid Zuerst wenden wir MapReduce an, um Erfahrungen mit Similarity Search mit besonderen Ähnlichkeitsmaßen zu sammeln – typische bzw beliebte Vertreter sind Jaccard- und Cosinus (Jaccard- Koeffizient und Kosinus-Ähnlichkeit) Die Idee dahinter ist, einen Invertierten Index zu erhalten, welcher ein bekanntes Werkzeug für die Indexierung einer schnellen Volltext-Suche ist Unsere Strategie ist es, die gegebenen Daten einer bestimmten Query zu indexieren und nicht alle vorhandenen Ausgangsdaten Als Folge dieser Strategie wird nur ein kleiner Teil der Daten für die Similarity Search verarbeitet Gleichzeitig wird der riesige Datenbestand von unwichtigen Daten befreit, indem sowohl in die Indizierung als auch in den Suchprozess eingegriffen wird Da der Prozess von einer bestimmten Query abhängig ist, ist er brauchbar für einmalige Abfragen, weil weniger Daten im MapReduce verarbeitet werden müssen Im zweiten Schritt wollen wir die Indexierung und die Similarity Search voneinander trennen, damit Abfragen der Similarity Search mehrmals in der Queue ausgeführt werden können, ohne den originalen Datenbestand erneut verarbeiten zu müssen Dieser Ansatz gehört iv zur Klasse der Build-In Ansätze, bei denen die indexierten Daten bereits gut vorbereitet sind und dann für Similarity Search Queues verwendet werden können Wir haben herausgefunden, dass die Verwendung invertierter Indizes zu Nachteilen führen kann, welche nicht adäquat für Similarity Search unter der Verwendung von MapReduce sind Folglich schlagen wir stattdessen dokumentbasierte Indizes vor um diese Nachteile auszugleichen Darüber hinaus werden Datenobjekte gebündelt (clustering) womit der mögliche Suchraum eingegrenzt wird Im dritten Schritt beschäftigen wir uns mit den Hybrid-Ansätzen, welche die Vorteile sowohl des Instant- und des Build-In Ansätze vereinen Unser Ziel ist das Erreichen von schnellen Indexierungen und einer schnellen Similarity Search Weiters muss jedem bewusst sein, dass die Daten richtig organisiert gehören, wenn man die Indizes bildet Einerseits sind zu fest oder zu lose geclusterte Objekte nicht hilfreich für die Full-Scan-Anwendung von MapReduce, andererseits sollten die Datenobjekte so organisiert werden, dass unnötige Zugriffe auf ein Minimum reduziert werden, vor allem auch deswegen, weil die dokumentbasierten Indizes das Grundgerüst unserer Indizierungsphase darstellen Darüber hinaus widmen wir uns einer guten Arbeitsverteilung und schlagen ein Verfahren vor, den Lastenausgleich und somit die Laufzeit zu verbessern Außerdem statten wir unsere vorgeschlagenen Methoden mit Filtern und Kürzungsstrategien aus Zusätzlich schlagen wir eine hybride MapReduce-basierte Architektur vor, deren Hauptaufgabe es ist, mit den „drei Vs“ von BigData umzugehen (Volume/Volumen – Velocity/Geschwindigkeit – Variety/Vielfalt) Darüber hinaus sind wir uns der Verwendung von minimalistischen MapReduce Methoden bewusst Da die Ausführung eines MapReduce Jobs teuer ist, führt die Verwendung weniger Jobs ohne wiederholte Zugriffe auf die Originaldaten zu weniger Zusatzaufwand Schließlich haben wir intensiv eine Reihe von Experimenten an realen Datensätzen durchgeführt Die Ergebnisse zeigen, dass unsere vorgeschlagenen Verfahren eine bessere Leistung erzielen als die Basismethode und relevante vergleichbare Arbeiten bzw Methoden von Similarity Search Schlüsselwörter: Ähnlichkeitssuche, schnelle Abfrageverarbeitung, Skalierbarkeit, Clustering, Filter, Pruning, redundanzfreie Verarbeitung, Lastverteilung, BigData, MapReduce, Hadoop v ABSTRACT Similarity search is the principle operation not only in databases but also in disciplinary majors such as information retrieval, machine learning, or data mining In addition, it has been widelyused in various applications like duplicate detection, data cleaning, or data clustering Nevertheless, a similarity search process is expensive due to the cost of similarity computations Moreover, similarity search becomes time-consuming when it has to access irrelevant objects and then has unnecessarily to evaluate their similarity Furthermore, it has to deal with challenges from big data, first and foremost with the large amounts of data Such challenges make similarity search costly but leave big motivations for us Emerging as a paradigm for large-scale processing with the fashion of parallel and distributed computing on a cluster of commodity machines, MapReduce rapidly shows its capability as a candidate for processing massive datasets with high-fault tolerance In our dissertation, we study the performance of similarity search with MapReduce Moreover, we also study the problems of scalability, redundancy, and load balance when doing similarity search Without loss of accuracy, we prefer an exact similarity search Among various approaches, we choose the way of improving performance from similarity search schemes using MapReduce More specifically, we propose three different kinds of approaches towards fast similarity search They are respectively the instant approaches, the build-in approaches, and the hybrid approaches Firstly, we employ MapReduce to experience similarity search with particular measures, whose typical and popular representatives are Cosine and Jaccard measures The idea behind is to utilize inverted index, which is a well-known index data structure for fast full text search Our strategy is, however, to index those data that exist in a given query rather than indexing all original data As a consequence, only a small portion of data is processed for similarity search At the same time, we minimize the big amount of inessential data engaging in both the indexing and the search processes Due to being dependent on a certain query, this approach belongs to the category of instant approaches and is considered to be suitable for one-time querying while maintaining less data throughout MapReduce jobs Secondly, we want to separate the indexing phase and the similarity search phase so that similarity queries are able to run multiple times from the indexing data without re-accessing the original data This approach belongs to the build-in approaches in that the indexed data are already prepared in advance and soon ready for similarity queries Moreover, we observe that using inverted index leads to some drawbacks that are not appropriate for similarity search using MapReduce Consequently, we propose using document-based index instead in order to overcome these drawbacks Furthermore, we cluster data objects into different compartments so that we reduce the search space for the task of searching vi Thirdly, we are towards the hybrid approaches that take the advantages of both the instant approaches and the build-in approaches Our goal is to achieve fast index building as well as fast similarity search Moreover, we are aware of organizing data when building the indices It is, on the one hand, noticed that clustering objects too tight or too loose would not be useful for the full-scan fashion of MapReduce On the other hand, though document-based index is exploited as a skeleton in our indexing phase, data objects should be organized in ways so that we are able to minimize unnecessary data accesses Furthermore, we address the load imbalance and then propose a straggler mitigating method to augment better load balance and at the same time improve the runtime at reducers Besides, we equip our proposed methods with filtering and pruning strategies Additionally, we propose a hybrid MapReduce-based architecture whose main aim is to deal with the “three Vs” (Volume, Velocity, Variety) of big data Moreover, we are conscious of employing minimal MapReduce jobs Due to the fact that a single MapReduce job is expensive, using less MapReduce jobs without re-accessing the original data results in fewer penalties Furthermore, we intensively conduct a series of empirical experiments on real datasets The results demonstrate that our proposed methods have better performance than the baseline method and some related work Key words: Similarity search, fast query processing, scalability, clustering, filtering, pruning, redundancy-free capability, load balance, big data, MapReduce, Figure 1-1: Examples of typical similarity queries; (a) Range query; (b) k-Nearest Neighbor query; (c) Self-join query
Figure 1-2: Data revolution since 2005 (Letouzé, 2012)
Figure 1-3: MapReduce paradigm (Phan et al., 2015c)
Figure 1-4: Data redundancy throughout MapReduce processes 2013) 36 Figure 2-10: Example of pair-wise similarity computation using a 2-pass blocking of objects (Kolb et al., 2013) 39 Figure 2-11: Example of redundancy-free MR-based pair-wise similarity computation 39 Figure 2-12: A hybrid MapReduce-based architecture (Phan et al., 2016) 41 Figure 3-1: An overview scheme of an instant approach 46 Figure 3-2: The overview scheme of Cosine-based method (Phan et al., 2014a) 48 Figure 3-3: MapReduce-1 from the Cosine-based method for pairwise similarity 54 Figure 3-4: MapReduce-2 from the Cosine-based method for pairwise similarity 55 Figure 3-5: MapReduce-3 from the Cosine-based method for pairwise similarity 56 Figure 3-6: MapReduce-4 from the Cosine-based method for pairwise similarity 56 Figure 3-7: MapReduce-1 from the Cosine-based method when given a pivot 57 Figure 3-8: MapReduce-2 from the Cosine-based method when given a pivot 57 Figure 3-9: MapReduce-4 from the Cosine-based method with Pre-pruning-2 57 Figure 3-10: MapReduce-3 from the Cosine-based method with Pre-pruning-1 58 Figure 3-11: Performance with DBLP Datasets (Phan et al., 2014a) 60 Figure 3-12: Similarity queries with DBLP Datasets (Phan et al., 2014a) 60 Figure 3-13: Performance with Gutenberg Datasets (Phan et al., 2015c) 61 Figure 3-14: Similarity queries with Gutenberg Datasets (Phan et al., 2015c) 61 Figure 3-15: Performance with shingles and Gutenberg Datasets (Phan et al., 2015c) 61 Figure 3-16: The overview scheme of Jaccard-based method (Phan et al., 2014b) 63 viii SUMMARY thus, left behind As we see that the load fed to a reducer depends on its number of aggregated keys from the intermediate key-value pairs emitted from mappers The skew key data distribution leads to different loads among reducers Recall that a MapReduce job is complete once the last reducer is finished As a consequence, the most prolonged reducer leads the MapReduce job to a late end, and the load-balancing problem highly affects the performance of a MapReduce job Our future work is to further investigate in the load imbalance and find out a better approach or way that is capable of optimizing the load balance among reducers while suffering little cost in return Moreover, we want to deal with it from the point of view of both mappers and reducers Other possible approach is to modify the partition function that distributes keys to reducers, together with data processing strategies from mappers, so that we can approximately get a uniform data distribution for each reducer On the other side, we can continue to approach the process of key distribution before the keys are delivered to reducers Last but not least, the optimization problem also appears from the implementation point of view as well as from the tuning parameters 7.3.2 More Efficient Data Organization Organizing data from key-value pairs gives influences to the whole performance of similarity search Besides, sorting data inside mappers or reducers also adds more costs Moreover, organizing data in either a centralized or nested way does not make the best use of the full-scan fashion from MapReduce paradigm A good way of data organization makes less data produced from mappers and reducers, and thus less cost for data storage and exchange Furthermore, it minimizes data redundancy throughout MapReduce jobs Our future work is to look for a better way of representing documents so that we will have a smaller size of data representations Moreover, we also want to improve data structures as well as the way to index data so that we are able to achieve more efficiency with MapReduce 7.3.3 Query Grouping Strategies It is useful when we have more than one query to process at one time Sequentially processing each query in a query batch does not bring much efficiency because of redundant searches from their shared contents For instance, if we have a query batch in the form [(Qj, NOSj)] such that [(Q1, 3024), (Q2, 2705), (Q3, 3132), (Q4, 2640), (Q5, 2746), (Q6, 2720), (Q7, 2592), (Q8, 3080)] Contact Information Trong Nhan Phan Institute for Application Oriented Knowledge Processing (FAW) Johannes Kepler University Linz, Austria
Field of Study Research Interests Education Information Systems  Similarity Search  Big Data  Privacy in Location-Based Services (LBS)  Security in Information Systems PhD Student, Institute for Application Oriented Knowledge Processing (FAW), Johannes Kepler University Linz, Austria, from 2013 to present Master Student, Master by research programme, HCMC University of Technology, National University of HCMC, from 2010 to 2013 Thesis: Trajectory privacy in location-based applications Un-graduate Student, HCMC University of Technology, National University of HCMC, from 2005 to 2010 Thesis: Developing a secure e-assessment supporting system Recent Publications [9] Phan T.N., Küng J., & Dang T.K (2016) eHSim: An Efficient Hybrid Similarity Search with MapReduce In Proceedings of the 30th IEEE International Conference on Advanced Information Networking and Applications (pp 422-429) IEEE Computer Society [8] Phan T.N., Küng J., & Dang T.K (2015) An Adaptive Similarity Search in Massive Datasets Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIII, 9480 (Selected Papers from FDSE 2014), 45–74, 2015 169 [7] Phan T.N., Jäger M., Nadschläger S., Küng J., & Dang T.K (2015) An Efficient Document Indexing-based Similarity Search in Large Datasets In Proceedings of the 2nd International Conference on Future Data and Security Engineering (FDSE 2015), LNCS 9446, Springer Verlag, Ho Chi Minh City, Vietnam, November 23-25, 2015, pp 16-31 [6] Jäger M., Phan T.N., & Nadschläger S (2015) Towards the Trustworthiness of Data, Information, Knowledge and Knowledge Processing Systems in Smart Homes In Proceedings of the 23rd Interdisciplinary Information Management Talks (IDIMT), September 9-11, 2015 Poděbrady, Czech Republic, pp 413-419, ISBN 978-3-99033-395-2 [5] Jäger M., Nadschläger S., Phan T.N., & Küng J (2015) Data, Information & Knowledge Sources in the Agricultural Domain In Proceedings of the 26th International Conference on Database and Expert Systems Applications Workshop, pp 115-119, 2015 [4] Phan T.N., Jäger M., Nadschläger S., & Küng J (2015) Range-based Clustering Supporting Similarity Search in Big Data In Proceedings of the 26th International Conference on Database and Expert Systems Applications Workshop, pp 120-124, 2015 [3] Phan T.N., Küng J., & Dang T.K (2015) KUR-Algorithm: from Position to Trajectory Privacy Protection in Location-based Applications In Proceedings of the 26th International Conference on Database and Expert Systems Applications (DEXA 2015), LNCS 9262, Springer-Verlag, September 1-4, 2015, Valencia, Spain, pp 82-89 [2] Phan T.N., Küng J., & Dang T.K (2014) An Efficient Similarity Search in Large Data Collections with MapReduce In Proceedings of the 1st International Conference on Future Data and Security Engineering, LNCS 8860, Springer Verlag, 2014, pp 44-57 [1] Phan T.N., Küng J., & Dang T.K (2014) An Elastic Approximate Similarity Search in Very Large Datasets with Mapreduce In Proceedings of the 7th International Conference on Data Management in Cloud, Grid and P2P Systems, LNCS 8648, Springer Verlag, 2014, pp 49-60 Languages Vietnamese (Mother tongue) English 170 Computer Skills Programming Languages: C/C++, C#, Visual Basic Operating Systems: Microsoft Windows, Unix Databases: Microsoft PostgreSQL/PostGIS SQL Server, Microsoft Access, Oracle, Others: MapReduce with Hadoop, Microsoft SQL Server Integration Services, Talend Open Studio Personal Attributes Other Information  Strong capability to research  Working well with other people  Honest, modest, patient, and sensitive Nationality: Vietnamese Hobbies: music, computer games, arts, movies, etc 171 ... balance when doing similarity search Without loss of accuracy, we prefer an exact similarity search Among various approaches, we choose the way of improving performance from similarity search schemes... kinds of approaches towards fast similarity search They are respectively the instant approaches, the build-in approaches, and the hybrid approaches Firstly, we employ MapReduce to experience similarity. .. assured, we have an exact similarity search Otherwise, we have an approximate similarity search The main aim of the trade-off is to accelerate the performance of similarity search with regardless to

