Approaches for fast similarity search with mapreducs

Author Trong Nhan Phan Submission Institute of Application Oriented Knowledge Processing (FAW) APPROACHES FOR FAST SIMILARITY Supervisor and First Examiner A.Univ.-Prof Dr Josef Küng Second Examiner Ao.Univ.-Prof Dr Andreas Rauber 05.2016 SEARCH WITH MAPREDUCE Doctoral Thesis to confer the academic degree of Doktor der technischen Wissenschaften in the Doctoral Program Engineering Sciences JOHANNES KEPLER UNIVERSITY LINZ Altenberger Str 69 4040 Linz, Austria www.jku.at DVR 0093696 EIDESSTATTLICHE ERKLÄRUNG Ich erkläre an Eides statt, dass ich die vorliegende Dissertation selbstständig und ohne fremde Hilfe verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutzt bzw die wưrtlich oder sinngemäß entnommenen Stellen als solche kenntlich gemacht habe Die vorliegende Dissertation ist mit dem elektronisch übermittelten Textdokument identisch Linz, 2016 Trong Nhan Phan i SWORN DECLARATION I hereby declare under oath that the submitted Doctoral Thesis has been written solely by me without any third-party assistance, information other than provided sources or aids have not been used and those used have been fully documented Sources for literal, paraphrased and cited quotes have been accurately credited The submitted document here present is identical to the electronically submitted text document Linz, 2016 Trong Nhan Phan ii ACKNOWLEDGEMENT I would like to express my deepest gratitude to my advisor, Professor Josef Küng, who always gives me invaluable advice and best support during my PhD journey Moreover, he has given me encouragement, opportunities, and chances not only in my research but also in my PhD life The fact that words might not express how happy and lucky I am when I have him as my advisor, I wholeheartedly and always wish him all the best no matter what tomorrow will bring I am very grateful to Professor Roland Wagner and Mr Knud Steiner for giving me good conditions and assistance when I am doing my PhD at the Institute for Application Oriented Knowledge Processing (FAW) in Linz and in Hagenberg, Austria I would also like to extend my sincere thanks to European Commission via the GATE (knowledGe mAnagement Technology transfer and Education programme), an Erasmus Mundus mobility project, especially to Ms Christine Hinterleitner and Ms Emma Huss for their support during my mobility in Austria My special thanks go to Mrs Gabriela Wagner, Mrs Gabriela Küng, Richard Küng, Eric Küng, and Felix Küng Their kindness, generosity, and sense of humor make my PhD life much more colorful and beautiful My sincere thanks to Mr Faruk Kujundžid, Scientific Computing, Information Management team, Johannes Kepler University Linz, for kindly supporting us with Alex Cluster My acknowledgement would not be complete without my colleagues: Markus Jäger, Stefan Nadschläger, Pablo Gómez-Pérez, and Christian Huber It is my great pleasure working with you I would like to thank Ms Monika Neubauer and Mr Andreas Dreiling for their help in administration and techniques I will never forget warm welcome from Ms Dagmar Auer, Hilda Kosorus, Jan Kubovy, Peter Regner, and all colleagues at FAW I am thankful to the reviewers who have spent their time on our research work and provided us their invaluable feedback I appreciate Professor Tran Khanh Dang and my colleagues at Faculty of Computer Science and Engineering, HCMC University of Technology, Vietnam, for their encouragement and support when I am doing my PhD in Austria I also appreciate my lecturers and teachers who have guided me with their knowledge My family has always and forever been by my side no matter how harsh and hard Though I cannot list all the names in this section, I feel myself very lucky when I have them, friends, and companies throughout my life People I have met come into my life for some reasons Anyhow, I thank you for everything and never stop hoping and believing to see you, the Angles, again and again iii KURZFASSUNG Similarity Search (Ähnlichkeitssuche) ist eine der zentralen Operationen, nicht nur in Datenbanken, ebenso in anderen Hauptgebieten der Datenverarbeitung wie Information Retrieval, Machine Learning oder Data Mining Darüber hinaus wird sie in verschiedenen Anwendungen verwendet, beispielsweise Duplicate Detection, Data Cleaning oder Data Clustering Trotz der hohen Verbreitung und Verwendung von Similarity Search, ist ihre Anwendung wegen der hohen Kosten der Ähnlichkeitsberechnungen sehr teuer Similarity Search wird auch sehr zeitintensiv und zeitraubend, wenn es bei der Ausführung auf irrelevante Objekte zugreift und deren Ähnlichkeitswerte unnötigerweise berechnet Mehr noch, muss Similarity Search mit den Herausforderungen von Big Data klar kommen, die grưßte darunter, das Verwalten grer Datenmengen Diese Herausforderungen machen Similarity Search teuer, hinterlassen uns aber eine große Motivation Als ein Paradigma für riesige (large-scale) Berechnungen auf parallelen und verteilten Systemen bestehend aus herkömmlichen Computern zeigt MapReduce (Verkleinerung von Datensätzen) schnell seine Fähigkeiten, große Datenmengen mit hoher Fehlertoleranz verarbeiten zu können In dieser Arbeit analysieren wir die Leistung von Similarity Search in Kombination mit MapReduce Darüber hinaus untersuchen wir die Probleme der Skalierbarkeit, Redundanz und Lastausgleich, welche bei Similarity Search immer ein Thema sind Natürlich wird eine genaue Similarity Search ohne Verlust von Genauigkeit bevorzugt Unter Verwendung verschiedener Ansätze streben wir eine Verbesserung der Performanz von Similarity Search Vorgängen unter Zuhilfenahme von MapReduce an Genau genommen untersuchen bzw verwenden wir drei unterschiedliche Ansätze für eine schnelle Similarity Search, diese sind: Instant, Build-In und Hybrid Zuerst wenden wir MapReduce an, um Erfahrungen mit Similarity Search mit besonderen Ähnlichkeitsmaßen zu sammeln – typische bzw beliebte Vertreter sind Jaccard- und Cosinus (Jaccard- Koeffizient und Kosinus-Ähnlichkeit) Die Idee dahinter ist, einen Invertierten Index zu erhalten, welcher ein bekanntes Werkzeug für die Indexierung einer schnellen Volltext-Suche ist Unsere Strategie ist es, die gegebenen Daten einer bestimmten Query zu indexieren und nicht alle vorhandenen Ausgangsdaten Als Folge dieser Strategie wird nur ein kleiner Teil der Daten für die Similarity Search verarbeitet Gleichzeitig wird der riesige Datenbestand von unwichtigen Daten befreit, indem sowohl in die Indizierung als auch in den Suchprozess eingegriffen wird Da der Prozess von einer bestimmten Query abhängig ist, ist er brauchbar für einmalige Abfragen, weil weniger Daten im MapReduce verarbeitet werden müssen Im zweiten Schritt wollen wir die Indexierung und die Similarity Search voneinander trennen, damit Abfragen der Similarity Search mehrmals in der Queue ausgeführt werden können, ohne den originalen Datenbestand erneut verarbeiten zu müssen Dieser Ansatz gehört iv zur Klasse der Build-In Ansätze, bei denen die indexierten Daten bereits gut vorbereitet sind und dann für Similarity Search Queues verwendet werden können Wir haben herausgefunden, dass die Verwendung invertierter Indizes zu Nachteilen führen kann, welche nicht adäquat für Similarity Search unter der Verwendung von MapReduce sind Folglich schlagen wir stattdessen dokumentbasierte Indizes vor um diese Nachteile auszugleichen Darüber hinaus werden Datenobjekte gebündelt (clustering) womit der mögliche Suchraum eingegrenzt wird Im dritten Schritt beschäftigen wir uns mit den Hybrid-Ansätzen, welche die Vorteile sowohl des Instant- und des Build-In Ansätze vereinen Unser Ziel ist das Erreichen von schnellen Indexierungen und einer schnellen Similarity Search Weiters muss jedem bewusst sein, dass die Daten richtig organisiert gehören, wenn man die Indizes bildet Einerseits sind zu fest oder zu lose geclusterte Objekte nicht hilfreich für die Full-Scan-Anwendung von MapReduce, andererseits sollten die Datenobjekte so organisiert werden, dass unnötige Zugriffe auf ein Minimum reduziert werden, vor allem auch deswegen, weil die dokumentbasierten Indizes das Grundgerüst unserer Indizierungsphase darstellen Darüber hinaus widmen wir uns einer guten Arbeitsverteilung und schlagen ein Verfahren vor, den Lastenausgleich und somit die Laufzeit zu verbessern Außerdem statten wir unsere vorgeschlagenen Methoden mit Filtern und Kürzungsstrategien aus Zusätzlich schlagen wir eine hybride MapReduce-basierte Architektur vor, deren Hauptaufgabe es ist, mit den „drei Vs“ von BigData umzugehen (Volume/Volumen – Velocity/Geschwindigkeit – Variety/Vielfalt) Darüber hinaus sind wir uns der Verwendung von minimalistischen MapReduce Methoden bewusst Da die Ausführung eines MapReduce Jobs teuer ist, führt die Verwendung weniger Jobs ohne wiederholte Zugriffe auf die Originaldaten zu weniger Zusatzaufwand Schließlich haben wir intensiv eine Reihe von Experimenten an realen Datensätzen durchgeführt Die Ergebnisse zeigen, dass unsere vorgeschlagenen Verfahren eine bessere Leistung erzielen als die Basismethode und relevante vergleichbare Arbeiten bzw Methoden von Similarity Search Schlüsselwörter: Ähnlichkeitssuche, schnelle Abfrageverarbeitung, Skalierbarkeit, Clustering, Filter, Pruning, redundanzfreie Verarbeitung, Lastverteilung, BigData, MapReduce, Hadoop v ABSTRACT Similarity search is the principle operation not only in databases but also in disciplinary majors such as information retrieval, machine learning, or data mining In addition, it has been widelyused in various applications like duplicate detection, data cleaning, or data clustering Nevertheless, a similarity search process is expensive due to the cost of similarity computations Moreover, similarity search becomes time-consuming when it has to access irrelevant objects and then has unnecessarily to evaluate their similarity Furthermore, it has to deal with challenges from big data, first and foremost with the large amounts of data Such challenges make similarity search costly but leave big motivations for us Emerging as a paradigm for large-scale processing with the fashion of parallel and distributed computing on a cluster of commodity machines, MapReduce rapidly shows its capability as a candidate for processing massive datasets with high-fault tolerance In our dissertation, we study the performance of similarity search with MapReduce Moreover, we also study the problems of scalability, redundancy, and load balance when doing similarity search Without loss of accuracy, we prefer an exact similarity search Among various approaches, we choose the way of improving performance from similarity search schemes using MapReduce More specifically, we propose three different kinds of approaches towards fast similarity search They are respectively the instant approaches, the build-in approaches, and the hybrid approaches Firstly, we employ MapReduce to experience similarity search with particular measures, whose typical and popular representatives are Cosine and Jaccard measures The idea behind is to utilize inverted index, which is a well-known index data structure for fast full text search Our strategy is, however, to index those data that exist in a given query rather than indexing all original data As a consequence, only a small portion of data is processed for similarity search At the same time, we minimize the big amount of inessential data engaging in both the indexing and the search processes Due to being dependent on a certain query, this approach belongs to the category of instant approaches and is considered to be suitable for one-time querying while maintaining less data throughout MapReduce jobs Secondly, we want to separate the indexing phase and the similarity search phase so that similarity queries are able to run multiple times from the indexing data without re-accessing the original data This approach belongs to the build-in approaches in that the indexed data are already prepared in advance and soon ready for similarity queries Moreover, we observe that using inverted index leads to some drawbacks that are not appropriate for similarity search using MapReduce Consequently, we propose using document-based index instead in order to overcome these drawbacks Furthermore, we cluster data objects into different compartments so that we reduce the search space for the task of searching vi Thirdly, we are towards the hybrid approaches that take the advantages of both the instant approaches and the build-in approaches Our goal is to achieve fast index building as well as fast similarity search Moreover, we are aware of organizing data when building the indices It is, on the one hand, noticed that clustering objects too tight or too loose would not be useful for the full-scan fashion of MapReduce On the other hand, though document-based index is exploited as a skeleton in our indexing phase, data objects should be organized in ways so that we are able to minimize unnecessary data accesses Furthermore, we address the load imbalance and then propose a straggler mitigating method to augment better load balance and at the same time improve the runtime at reducers Besides, we equip our proposed methods with filtering and pruning strategies Additionally, we propose a hybrid MapReduce-based architecture whose main aim is to deal with the “three Vs” (Volume, Velocity, Variety) of big data Moreover, we are conscious of employing minimal MapReduce jobs Due to the fact that a single MapReduce job is expensive, using less MapReduce jobs without re-accessing the original data results in fewer penalties Furthermore, we intensively conduct a series of empirical experiments on real datasets The results demonstrate that our proposed methods have better performance than the baseline method and some related work Key words: Similarity search, fast query processing, scalability, clustering, filtering, pruning, redundancy-free capability, load balance, big data, MapReduce, Hadoop vii LIST OF FIGURES Page Figure 1-1: Examples of typical similarity queries; (a) Range query; (b) k-Nearest Neighbor query; (c) Self-join query Figure 1-2: Data revolution since 2005 (Letouzé, 2012) Figure 1-3: MapReduce paradigm (Phan et al., 2015c) 10 Figure 1-4: Data redundancy throughout MapReduce processes 18 Figure 2-1: Performance-improving approaches with MapReduce 26 Figure 2-2: The architectures of Hadoop MapReduce and Hadoop YARN 28 Figure 2-3: Pruned document pair (Baraglia et al., 2010; De Francisci et al., 2010) 33 Figure 2-4: The generation of word frequency dictionary (Li et al., 2011) 35 Figure 2-5: The generation of text vector (Li et al., 2011) 35 Figure 2-6: The generation of PLT inverted file (Li et al., 2011) 35 Figure 2-7: Query text search (Li et al., 2011) 35 Figure 2-8: Computing pairwise similarity of a toy collection of documents 36 Figure 2-9: Example of task computation in partition-based similarity search with hybrid indexing (Alabduljalil et al., 2013) 36 Figure 2-10: Example of pair-wise similarity computation using a 2-pass blocking of objects (Kolb et al., 2013) 39 Figure 2-11: Example of redundancy-free MR-based pair-wise similarity computation 39 Figure 2-12: A hybrid MapReduce-based architecture (Phan et al., 2016) 41 Figure 3-1: An overview scheme of an instant approach 46 Figure 3-2: The overview scheme of Cosine-based method (Phan et al., 2014a) 48 Figure 3-3: MapReduce-1 from the Cosine-based method for pairwise similarity 54 Figure 3-4: MapReduce-2 from the Cosine-based method for pairwise similarity 55 Figure 3-5: MapReduce-3 from the Cosine-based method for pairwise similarity 56 Figure 3-6: MapReduce-4 from the Cosine-based method for pairwise similarity 56 Figure 3-7: MapReduce-1 from the Cosine-based method when given a pivot 57 Figure 3-8: MapReduce-2 from the Cosine-based method when given a pivot 57 Figure 3-9: MapReduce-4 from the Cosine-based method with Pre-pruning-2 57 Figure 3-10: MapReduce-3 from the Cosine-based method with Pre-pruning-1 58 Figure 3-11: Performance with DBLP Datasets (Phan et al., 2014a) 60 Figure 3-12: Similarity queries with DBLP Datasets (Phan et al., 2014a) 60 Figure 3-13: Performance with Gutenberg Datasets (Phan et al., 2015c) 61 Figure 3-14: Similarity queries with Gutenberg Datasets (Phan et al., 2015c) 61 Figure 3-15: Performance with shingles and Gutenberg Datasets (Phan et al., 2015c) 61 Figure 3-16: The overview scheme of Jaccard-based method (Phan et al., 2014b) 63 viii SUMMARY thus, left behind As we see that the load fed to a reducer depends on its number of aggregated keys from the intermediate key-value pairs emitted from mappers The skew key data distribution leads to different loads among reducers Recall that a MapReduce job is complete once the last reducer is finished As a consequence, the most prolonged reducer leads the MapReduce job to a late end, and the load-balancing problem highly affects the performance of a MapReduce job Our future work is to further investigate in the load imbalance and find out a better approach or way that is capable of optimizing the load balance among reducers while suffering little cost in return Moreover, we want to deal with it from the point of view of both mappers and reducers Other possible approach is to modify the partition function that distributes keys to reducers, together with data processing strategies from mappers, so that we can approximately get a uniform data distribution for each reducer On the other side, we can continue to approach the process of key distribution before the keys are delivered to reducers Last but not least, the optimization problem also appears from the implementation point of view as well as from the tuning parameters 7.3.2 More Efficient Data Organization Organizing data from key-value pairs gives influences to the whole performance of similarity search Besides, sorting data inside mappers or reducers also adds more costs Moreover, organizing data in either a centralized or nested way does not make the best use of the full-scan fashion from MapReduce paradigm A good way of data organization makes less data produced from mappers and reducers, and thus less cost for data storage and exchange Furthermore, it minimizes data redundancy throughout MapReduce jobs Our future work is to look for a better way of representing documents so that we will have a smaller size of data representations Moreover, we also want to improve data structures as well as the way to index data so that we are able to achieve more efficiency with MapReduce 7.3.3 Query Grouping Strategies It is useful when we have more than one query to process at one time Sequentially processing each query in a query batch does not bring much efficiency because of redundant searches from their shared contents For instance, if we have a query batch in the form [(Qj, NOSj)] such that [(Q1, 3024), (Q2, 2705), (Q3, 3132), (Q4, 2640), (Q5, 2746), (Q6, 2720), (Q7, 2592), (Q8, 3080)] The number of shingles we need for a query-by-query checking process is 22639 Simply assume that the number of shared shingles among these queries is 745 As a result, the number of shingles we actually need for a query-by-query checking process should be 21894 Our future work is to expect a query grouping mechanism so that we will have better group query processing rather than the query-by-query processing case 157 BIBLIOGRAPHY Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., & Rasin, A (2009) HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Proceedings of the VLDB Endowment, 2(1), 922–933 http://doi.org/10.14778/1687627.1687731 Adomavicius, G., & Tuzhilin, A (2005) Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions IEEE Transactions on Knowledge and Data Engineering, 17(6), 734–749 http://doi.org/10.1109/TKDE.2005.99 Ailamaki, A., DeWitt, D J., Hill, M D., & Skounakis, M (2001) Weaving Relations for Cache Performance In Proceedings of the 27th International Conference on Very Large Data Bases (pp 169–180) San Francisco, CA, USA: Morgan Kaufmann Publishers Inc Retrieved from http://dl.acm.org/citation.cfm?id=645927.672367 Alabduljalil, M A., Tang, X., & Yang, T (2013) Optimizing Parallel Algorithms for All Pairs Similarity Search In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (pp 203–212) New York, NY, USA: ACM http://doi.org/10.1145/2433396.2433422 Andoni, A., & Indyk, P (2008) Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions Communications of the ACM, 51(1), 117–122 http://doi.org/10.1145/1327452.1327494 Arasu, A., Ganti, V., & Kaushik, R (2006) Efficient Exact Set-similarity Joins In Proceedings of the 32nd International Conference on Very Large Data Bases (pp 918–929) VLDB Endowment Retrieved from http://dl.acm.org/citation.cfm?id=1182635.1164206 Bange, C., Grosser, T., & Janoschek, N (2013) Big Data Survey Europe: Usage, technology and budgets in European best-practice companies Wuerzburg Retrieved from https://www.pmone.com/fileadmin/user_upload/doc/study/BARC_BIG_DATA_SURVEY_ EN_final.pdf Baraglia, R., De Francisci Morales, G., & Lucchese, C (2010) Document Similarity Self-Join with MapReduce In Proceedings of the 2010 IEEE International Conference on Data Mining (pp 731–736) Washington, DC, USA: IEEE Computer Society http://doi.org/10.1109/ICDM.2010.70 Bayardo, R J., Ma, Y., & Srikant, R (2007) Scaling Up All Pairs Similarity Search In Proceedings of the 16th International Conference on World Wide Web (pp 131–140) New York, NY, USA: ACM http://doi.org/10.1145/1242572.1242591 Beeferman, D., & Berger, A (2000) Agglomerative Clustering of a Search Engine Query Log In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp 407–416) New York, NY, USA: ACM http://doi.org/10.1145/347090.347176 Bizer, C., Boncz, P., Brodie, M L., & Erling, O (2012) The Meaningful Use of Big Data: Four Perspectives -Four Challenges SIGMOD Record, 40(4), 56–60 http://doi.org/10.1145/2094114.2094129 158 Bonifacio, A S., Menolli, A., & Silva, F (2014) Hadoop MapReduce Configuration Parameters and System Performance: a Systematic Review In The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications (pp 1–7) Retrieved from http://worldcomp-proceedings.com/proc/p2014/PDP.html Broder, A Z., Glassman, S C., Manasse, M S., & Zweig, G (1997) Syntactic Clustering of the Web Computer Networks and ISDN Systems, 29(8-13), 1157–1166 http://doi.org/10.1016/S0169-7552(97)00031-7 Bryant, R E., Katz, R H., & Lazowska, E D (2008) Big-data computing: Creating revolutionary breakthroughs in commerce, science, and society Computing Research Initiatives for the 21st Century, Computing Research Association, pp 1–7 Retrieved from http://cra.org/ccc/wp-content/uploads/sites/2/2015/05/Big_Data.pdf Chaudhuri, S., Ganti, V., & Kaushik, R (2006) A Primitive Operator for Similarity Joins in Data Cleaning In Proceedings of the 22nd International Conference on Data Engineering (p 5–) Washington, DC, USA: IEEE Computer Society http://doi.org/10.1109/ICDE.2006.9 Chen, W.-Y., Song, Y., Bai, H., Lin, C.-J., & Chang, E Y (2011) Parallel Spectral Clustering in Distributed Systems IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3), 568–586 http://doi.org/10.1109/TPAMI.2010.88 Choudhary, B (1992) The Elements of Complex Analysis J Wiley Retrieved from https://books.google.co.uk/books?id=5K9i2YwgTjYC Cloud Security Alliance (2013) Expanded Top Ten Big Data Security and Privacy Challenges Retrieved from https://downloads.cloudsecurityalliance.org/initiatives/bdwg/Expanded_Top_Ten_Big_Data _Security_and_Privacy_Challenges.pdf Cormen, T H., Stein, C., Rivest, R L., & Leiserson, C E (2001) Introduction to Algorithms (2nd ed.) McGraw-Hill Higher Education De Francisci, G., Lucchese, C., & Baraglia, R (2010) Scaling Out All Pairs Similarity Search with MapReduce In 8th Workshop on Large-Scale Distributed Systems for Information Retrieval (pp 25–30) Retrieved from http://scholar.google.de/scholar.bib?q=info:qx1tF3xQSs0J:scholar.google.com/&output=cit ation&hl=de&as_sdt=0&as_vis=1&ct=citation&cd=0 Dean, J., & Ghemawat, S (2008) MapReduce: Simplified Data Processing on Large Clusters Communications of the ACM, 51(1), 107–113 http://doi.org/10.1145/1327452.1327492 Demchenko, Y., de Laat, C., & Membrey, P (2014) Defining architecture components of the Big Data Ecosystem In 2014 International Conference on Collaboration Technologies and Systems (pp 104–112) http://doi.org/10.1109/CTS.2014.6867550 Deng, D., Li, G., Hao, S., Wang, J., & Feng, J (2014) MassJoin: A mapreduce-based method for scalable string similarity joins In 30th IEEE International Conference on Data Engineering (pp 340–351) http://doi.org/10.1109/ICDE.2014.6816663 159 DeWitt, D., & Gray, J (1992) Parallel Database Systems: The Future of High Performance Database Systems Communications of the ACM, 35(6), 85–98 http://doi.org/10.1145/129888.129894 Dice, L R (1945) Measures of the Amount of Ecologic Association Between Species Ecology, 26(3), pp 297–302 Retrieved from http://www.jstor.org/stable/1932409 Ding, H., Takigawa, I., Mamitsuka, H., & Zhu, S (2014) Similarity-based machine learning methods for predicting drug-target interactions: a brief review In Briefings in Bioinformatics (pp 734–747) Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., & Schad, J (2010) Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) Proceedings of the VLDB Endowment, 3(1-2), 515–529 http://doi.org/10.14778/1920841.1920908 Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., & Schad, J (2012) Only Aggressive Elephants Are Fast Elephants Proceedings of the VLDB Endowment, 5(11), 1591–1602 http://doi.org/10.14778/2350229.2350272 Dittrich, J., Richter, S., & Schuh, S (2013a) Efficient OR Hadoop: Why Not Both? DatenbankSpektrum, 13(1), 17–22 Dittrich, J., Richter, S., Schuh, S., & Quiané-Ruiz, J.-A (2013b) Efficient or Hadoop: Why not both? IEEE Data Engineering Bulletin, 36(1), 15–23 Retrieved from http://sites.computer.org/debull/A13mar/jens.pdf Dong, X L., & Srivastava, D (2013) Big Data Integration Proceedings of the VLDB Endowment, 6(11), 1188–1189 http://doi.org/10.14778/2536222.2536253 Dorneles, C F., Goncalves, R., & dos Santos Mello, R (2011) Approximate Data Instance Matching: A Survey Knowledge and Information Systems, 27(1), 1–21 http://doi.org/10.1007/s10115-010-0285-0 Drew, J., & Hahsler, M (2014) Strand: Fast Sequence Comparison Using Mapreduce and Locality Sensitive Hashing In Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp 506–513) New York, NY, USA: ACM http://doi.org/10.1145/2649387.2649436 Elsayed, T., Lin, J., & Oard, D W (2008) Pairwise Document Similarity in Large Collections with MapReduce In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers (pp 265–268) Stroudsburg, PA, USA: Association for Computational Linguistics Retrieved from http://dl.acm.org/citation.cfm?id=1557690.1557767 Fenz, D., Lange, D., Rheinländer, A., Naumann, F., & Leser, U (2012) Efficient Similarity Search in Very Large String Sets In Proceedings of the 24th International Conference on Scientific and Statistical Database Management (pp 262–279) Berlin, Heidelberg: Springer-Verlag http://doi.org/10.1007/978-3-642-31235-9_18 Fox, G., Bae, S., Ekanayake, J., & Qiu, X (2008) Parallel Data Mining from Multicore to Cloudy Grids Science, 2008 Retrieved from http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Parallel+Data+Mining+fr om+Multicore+to+Cloudy+Grids#0 160 Fréchet, M (1906) Sur quelques points du calcul fonctionnel Rendic Circ Mat Palermo, 22, 1–74 Gantz, J F., Chute, C., Manfrediz, A., Minton, S., Reinsel, D., Schlichting, W., & Toncheva, A (2008) The diverse and exploding digital universe: an updated forecast of worldwide information growth through 2011 Retrieved from http://www.ifap.ru/library/book268.pdf Gantz, J., & Reinsel, D (2011) Extracting Value from Chaos Retrieved from https://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf Gartner (2013) Big Data defintion Retrieved from http://www.gartner.com/it-glossary/big-data/ Ghazi, M R., & Gangodkar, D (2015) Hadoop, MapReduce and HDFS: A Developers Perspective Procedia Computer Science, 48, 45–50 http://doi.org/http://dx.doi.org/10.1016/j.procs.2015.04.108 Gigaspaces (2012) Big Data Survey: Real-Time Stream Processing and Cloud-Based Big Data Increasing in Today’s Enterprises Retrieved from http://www.gigaspaces.com/sites/default/files/product/BigDataSurvey_Report.pdf Gionis, A., Indyk, P., & Motwani, R (1999) Similarity Search in High Dimensions via Hashing In Proceedings of the 25th International Conference on Very Large Data Bases (pp 518– 529) San Francisco, CA, USA: Morgan Kaufmann Publishers Inc Retrieved from http://dl.acm.org/citation.cfm?id=645925.671516 Gomaa, W H., & Fahmy, A A (2013) Article: A Survey of Text Similarity Approaches International Journal of Computer Applications, 68(13), 13–18 Hajishirzi, H., Yih, W., & Kolcz, A (2010) Adaptive Near-duplicate Detection via Similarity Learning In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp 419–426) New York, NY, USA: ACM http://doi.org/10.1145/1835449.1835520 Halevi, G., & Moed, H (2012, September) The Evolution of Big Data as a Research and Scientific Topic (H Moed, J Kamalski, I Kisjes, A Plume, S Huggett, M Richardson, … D van Weijen, Eds.)Research Trends - Special Issue on Big Data Retrieved from http://www.researchtrends.com/wp-content/uploads/2012/09/Research_Trends_Issue30.pdf Hamming, R (1950) Error Detecting and Error Correcting Codes Bell System Technical Journal, 26(2), 147–160 Han, J., Kamber, M., & Pei, J (2011) Data Mining: Concepts and Techniques (3rd ed.) San Francisco, CA, USA: Morgan Kaufmann Publishers Inc Harman, D., Baeza-Yates, R., Fox, E., & Lee, W (1992) Information Retrieval In W B Frakes & R Baeza-Yates (Eds.), (pp 28–43) Upper Saddle River, NJ, USA: Prentice-Hall, Inc Retrieved from http://dl.acm.org/citation.cfm?id=129687.129690 Heger, D (2013) Hadoop Performance Tuning - A Pragmatic & Iterative Approach CMG Journal of Computer Resource Management, 1–16 Retrieved from http://www.cmg.org/wp-content/uploads/2013/04/m_97_3.pdf Hey, T., Tansley, S., & Tolle, K M (2009) Jim Gray on eScience: a transformed scientific method In T Hey, S Tansley, & K M Tolle (Eds.), The Fourth Paradigm Microsoft 161 Research Retrieved from trier.de/db/books/collections/4paradigm2009.html#HeyTT09 http://dblp.uni- Hey, T., & Trefethen, A (2003) The Data Deluge: An e-Science Perspective Grid Computing, 809–824 http://doi.org/10.1002/0470867167.ch36 Hjaltason, G R., & Samet, H (2003) Index-driven Similarity Search in Metric Spaces (Survey Article) ACM Transactions on Database Systems, 28(4), 517–580 http://doi.org/10.1145/958942.958948 Hoad, T C., & Zobel, J (2003) Methods for Identifying Versioned and Plagiarized Documents Journal of the American Society for Information Science and Technology, 54(3), 203–215 http://doi.org/10.1002/asi.10170 IBM, Zikopoulos, P., & Eaton, C (2011) Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data (1st ed.) McGraw-Hill Osborne Media Impetus (2009) Hadoop Performance Tuning White Paper, pp 1–13 Retrieved from https://hadoop-toolkit.googlecode.com/files/White paper-HadoopPerformanceTuning.pdf Intel Corporation (n.d.) Optimizing Hadoop Deployments White Paper, pp 1–9 Retrieved from https://software.intel.com/sites/default/files/Optimizing Hadoop Deployments.pdf Intel Corporation (2013) Extract, Transform, and Load Big Data with Apache Hadoop White Paper, pp 1–9 Retrieved from https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf Jaccard, P (1912) The Distribution of the Flora in the Alpine Zone New Phytologist, 11(2), 37– 50 Retrieved from http://www.jstor.org/stable/2427226?seq=3 Jäger, M., Nadschläger, S., Phan, T N., & Küng, J (2015) Data, Information {&} Knowledge Sources in the Agricultural Domain In 26th International Workshop on Database and Expert Systems Applications, {DEXA} 2015, Valencia, Spain, September 1-4, 2015 (pp 115–119) doi:10.1109/DEXA.2015.40 Jenkyns, T., & Stephenson, B (2012) Fundamentals of Discrete Math for Computer Science: A Problem-Solving Primer Springer Publishing Company, Incorporated Kaisler, S., Armour, F., Espinosa, J A., & Money, W (2013) Big Data: Issues and Challenges Moving Forward In 46th Hawaii International Conference on System Sciences (pp 995– 1004) http://doi.org/10.1109/HICSS.2013.645 Kang, S J., Lee, S Y., & Lee, K M (2015) Performance Comparison of OpenMP, MPI, and MapReduce in Practical Problems Advances in Multimedia, 2015, pages http://doi.org/10.1155/2015/575687 Knuth, D E (1998) The Art of Computer Programming, Volume 3: (2Nd Ed.) Sorting and Searching Redwood City, CA, USA: Addison Wesley Longman Publishing Co., Inc Kolb, L., Thor, A., & Rahm, E (2013) Don’t Match Twice: Redundancy-free Similarity Computation with MapReduce In Proceedings of the 2nd Workshop on Data Analytics in the Cloud (pp 1–5) New York, NY, USA: ACM http://doi.org/10.1145/2486767.2486768 Kolcz, A., Chowdhury, A., & Alspector, J (2004) Improved Robustness of Signature-based Near-replica Detection via Lexicon Randomization In Proceedings of the 10th ACM 162 SIGKDD International Conference on Knowledge Discovery and Data Mining (pp 605– 610) New York, NY, USA: ACM http://doi.org/10.1145/1014052.1014127 Kumar, K A., Gluck, J., Deshpande, A., & Lin, J (2014) Optimization Techniques for “Scaling Down” Hadoop on Multi-Core, Shared-Memory Systems In Proceedings of the 17th International Conference on Extending Database Technology, {EDBT} 2014, Athens, Greece, March 24-28, 2014 (pp 13–24) http://doi.org/10.5441/002/edbt.2014.03 Labrinidis, A., & Jagadish, H V (2012) Challenges and Opportunities with Big Data Proceedings of the VLDB Endowment, 5(12), 2032–2033 http://doi.org/10.14778/2367502.2367572 Laney, D (2001) 3D data management: Controlling data volume, velocity, and variety Retrieved from http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-DataManagement-Controlling-Data-Volume-Velocity-and-Variety.pdf Larkey, L B., & Markman, A B (2005) Processes of Similarity Judgment Cognitive Science, 29(6), 1061–1076 http://doi.org/10.1207/s15516709cog0000_30 Larson, R (2009) Principles of Information Retrieval School of Information, University of California, Berkeley Retrieved from http://courses.ischool.berkeley.edu/i240/s09/Lectures/Lecture_07.ppt Letouzé, E (2012) Big data for development: challenges & opportunities In A R Tatevossian & R Kirkpatrick (Eds.), (pp 1–47) UN Global Pulse Levenshtein, V (1965) Binary codes capable of correcting spurious insertions and deletions of ones Problems of Information Transmission, 1, 8–17 Li, R., Ju, L., Peng, Z., Yu, Z., & Wang, C (2011) Batch Text Similarity Search with MapReduce In Proceedings of the 13th Asia-Pacific Web Conference on Web Technologies and Applications (pp 412–423) Berlin, Heidelberg: Springer-Verlag Retrieved from http://dl.acm.org/citation.cfm?id=1996794.1996851 Lim, H., Herodotou, H., & Babu, S (2012) Stubby: A Transformation-based Optimizer for MapReduce Workflows Proceedings of the VLDB Endowment, 5(11), 1196–1207 http://doi.org/10.14778/2350229.2350239 Lin, F., & Cohen, W W (2010) A Very Fast Method for Clustering Big Text Datasets In Proceedings of the 2010 Conference on ECAI 2010: 19th European Conference on Artificial Intelligence (pp 303–308) Amsterdam, The Netherlands, The Netherlands: IOS Press Retrieved from http://dl.acm.org/citation.cfm?id=1860967.1861028 Lin, J (2009) Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp 155–162) New York, NY, USA: ACM http://doi.org/10.1145/1571941.1571970 Manning, C D., Raghavan, P., & Schütze, H (2008) Introduction to Information Retrieval New York, NY, USA: Cambridge University Press Manning, C D., & Schütze, H (1999) Foundations of Statistical Natural Language Processing Cambridge, MA, USA: MIT Press 163 Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A H (2011) Big Data: The Next Frontier for Innovation, Competition, and Productivity Mccreadie, R., Macdonald, C., & Ounis, I (2012) MapReduce Indexing Strategies: Studying Scalability and Efficiency Information Processing & Management, 48(5), 873–888 http://doi.org/10.1016/j.ipm.2010.12.003 McKendrick, J (2012) Big Data, Big Challenges, Big Opportunities: 2012 IOUG Big Data Strategies Survey Retrieved from http://www.oracle.com/us/corporate/analystreports/infrastructure/ioug-big-data-survey1912835.pdf Megler, V M., & Maier, D (2012) When Big Data Leads to Lost Data In Proceedings of the 5th Ph.D Workshop on Information and Knowledge (pp 1–8) New York, NY, USA: ACM http://doi.org/10.1145/2389686.2389688 Message Passing Interface Forum (1994) MPI: A message-passing interface standard International Journal of Supercompurer Applications, 8(3-4), 159–416 Metwally, A., & Faloutsos, C (2012) V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors Proceedings of the VLDB Endowment, 5(8), 704–715 Retrieved from http://arxiv.org/abs/1204.6077 Microsoft IT SES Enterprise Data Architect Team (2013) Overview of Hadoop Performance Tuning White Paper, pp 1–59 Retrieved from https://www.google.com.vn/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uac t=8&ved=0ahUKEwj6ssvh7qvJAhVmcHIKHRqBC8YQFggaMAA&url=http://download microsoft.com/download/1/C/6/1C66D134-1FD5-4493-90BD-98F94A881626/Hadoop Job Optim Mika, P (2010) Distributed Indexing for Semantic Search In Proceedings of the 3rd International Semantic Search Workshop (pp 3:1–3:4) New York, NY, USA: ACM http://doi.org/10.1145/1863879.1863882 Minkowski, H (1953) Geometrie der Zahlen Chelsea Mitchell, I., Locke, Lm., Wilson, Lm., & Fuller, La (2012) The white book of Big Data: The definitive guide to the revolution in business analytics Retrieved from http://www.fujitsu.com/hr/Images/WhiteBookofBigData.pdf Moffat, A., Sacks-Davis, R., Wilkinson, R., & Zobel, J (1993) Retrieval of Partial Documents In Proceedings of the Second Text REtrieval Conference (pp 181–190) Retrieved from http://trec.nist.gov/pubs/trec2/papers/ps/citri.ps Murthy, A C., Vavilapalli, V K., Eadline, D., Niemiec, J., & Markham, J (2014) Apache Hadoop YARN: Moving Beyond MapReduce and Batch Processing with Apache Hadoop (1st ed.) Addison-Wesley Professional Nessi (2012) Big Data: A New World of Opportunities White Paper, pp 1–25 Retrieved from http://www.nessi-europe.com/Files/Private/NESSI_WhitePaper_BigData.pdf Onizuka, M., Kato, H., Hidaka, S., Nakano, K., & Hu, Z (2013) Optimization for Iterative Queries on MapReduce Proceedings of the VLDB Endowment, 7(4), 241–252 http://doi.org/10.14778/2732240.2732243 164 Patella, M., & Ciaccia, P (2009) Approximate similarity search: A multi-faceted problem Journal of Discrete Algorithms, 7(1), 36–48 http://doi.org/http://dx.doi.org/10.1016/j.jda.2008.09.014 Patil, M S., Kamdar, J K., & Khatri, C B (2014) BIG DATA – An Overview International Journal of Engineering Research & Technology, 3(7), 1–4 Retrieved from http://www.ijert.org/view-pdf/10431/big-data-an-overview Pavlo, A., Paulson, E., Rasin, A., Abadi, D J., DeWitt, D J., Madden, S., & Stonebraker, M (2009) A Comparison of Approaches to Large-scale Data Analysis In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (pp 165–178) New York, NY, USA: ACM http://doi.org/10.1145/1559845.1559865 Phan, T N., Jäger, M., Nadschläger, S., & Küng, J (2015a) Range-based Clustering Supporting Similarity Search in Big Data In Proceedings of the 26th International Conference on Database and Expert Systems Applications Workshop (pp 120–124) Phan, T N., Jäger, M., Nadschläger, S., Küng, J., & Dang, T K (2015b) An Efficient Document Indexing-Based Similarity Search in Large Datasets In Proceedings of the 2nd International Conference on Future Data and Security Engineering (pp 16–31) http://doi.org/10.1007/978-3-319-26135-5_2 Phan, T N., Küng, J., & Dang, T K (2014a) An Efficient Similarity Search in Large Data Collections with MapReduce In T K Dang, R Wagner, E Neuhold, M Takizawa, J Küng, & N Thoai (Eds.), Proceedings of the 1st International Conference on Future Data and Security Engineering (Vol 8860, pp 44–57) Springer http://doi.org/10.1007/978-3319-12778-1_4 Phan, T N., Küng, J., & Dang, T K (2014b) An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce In A Hameurlain, T K Dang, & F Morvan (Eds.), Proceedings of the 7th International Conference on Data Management in Cloud, Grid and P2P Systems (Vol 8648, pp 49–60) Springer http://doi.org/10.1007/978-3-319-10067-8_5 Phan, T N., Küng, J., & Dang, T K (2015c) An Adaptive Similarity Search in Massive Datasets Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIII, 9480 (Selected Papers from FDSE 2014), 45–74 Phan, T N., Küng, J., & Dang, T K (2016) eHSim: An Efficient Hybrid Similarity Search with MapReduce In Proceedings of the 30th IEEE International Conference on Advanced Information Networking and Applications (pp 422–429) IEEE Computer Society Radia, S., & Srinivas, S (2014) HADOOP 2: What’s New ;login: The Usenix Magazine, 39(1), 1–4 Retrieved from https://www.usenix.org/system/files/login/articles/03_radia.pdf Rajaraman, A., & Ullman, J D (2011) Chapter 3: Finding similar items In Mining of Massive Datasets (1st ed., pp 71–127) Cambridge University Press Richter, S., Quiané-Ruiz, J.-A., Schuh, S., & Dittrich, J (2014) Towards Zero-overhead Static and Adaptive Indexing in Hadoop The VLDB Journal, 23(3), 469–494 http://doi.org/10.1007/s00778-013-0332-z 165 Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., & Tung, A K H (2013) Efficient and Scalable Processing of String Similarity Join IEEE Transactions on Knowledge and Data Engineering, 25(10), 2217–2230 http://doi.org/10.1109/TKDE.2012.195 Russom, P (2011) Big Data Analytics Retrieved ftp://ftp.software.ibm.com/software/tw/Defining_Big_Data_through_3V_v.pdf from Satuluri, V., & Parthasarathy, S (2012) Bayesian Locality Sensitive Hashing for Fast Similarity Search Proceedings of the VLDB Endowment, 5(5), 430–441 http://doi.org/10.14778/2140436.2140440 Sicular, S (2013) Gartner’s Big Data Definition Consists of Three Parts, Not to Be Confused with Three “V”s Retrieved from http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-data-definition-consistsof-three-parts-not-to-be-confused-with-three-vs/ Singh, S (n.d.) What is difference between these two parallel programming paradigms: MPI and MapReduce? Retrieved October 14, 2015, from http://www.researchgate.net/post/What_is_difference_between_these_two_parallel_progra mming_paradigms_MPI_and_MapReduce Singhal, A (2001) Modern Information Retrieval: A Brief Overview IEEE Data Engineering Bulletin, 24(4), 35–43 Retrieved from http://dblp.unitrier.de/db/journals/debu/debu24.html#Singhal01 Sørensen, T J (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons Kongelige Danske Videnskabernes Selskab, 5(4), 1–34 Stein, B., & zu Eissen, S M (2006) Near Similarity Search and Plagiarism Analysis From Data and Information Analysis to Knowledge Engineering http://doi.org/10.1007/3-54031314-1_52 Stonebraker, M., Abadi, D., DeWitt, D J., Madden, S., Paulson, E., Pavlo, A., & Rasin, A (2010) MapReduce and Parallel DBMSs: Friends or Foes? Communications of the ACM, 53(1), 64–71 http://doi.org/10.1145/1629175.1629197 Su, X., & Swart, G (2012) Oracle In-database Hadoop: When Mapreduce Meets RDBMS In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (pp 779–790) New York, NY, USA: ACM http://doi.org/10.1145/2213836.2213955 Syncsort (2013) The European Big picture on Big Data and Hadoop in 2013 Retrieved from http://www.bitpipe.com/detail/RES/1377019868_718.html Tang, M., Yu, Y., Aref, W G., Malluhi, Q M., & Ouzzani, M (2015) Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce In Proceedings of the 18th International Conference on Extending Database Technology (pp 361–372) http://doi.org/10.5441/002/edbt.2015.32 Tannir, K (2014) Optimizing Hadoop for MapReduce Packt Publishing Tao, Y., Lin, W., & Xiao, X (2013) Minimal MapReduce Algorithms In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (pp 529–540) New York, NY, USA: ACM http://doi.org/10.1145/2463676.2463719 166 Theobald, M., Siddharth, J., & Paepcke, A (2008) SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp 563–570) New York, NY, USA: ACM http://doi.org/10.1145/1390334.1390431 Thusoo, A., Sarma, J Sen, Jain, N., Shao, Z., Chakka, P., Zhang, N., … Murthy, R (2010) Hive - a petabyte scale data warehouse using Hadoop In F Li, M M Moro, S Ghandeharizadeh, J R Haritsa, G Weikum, M J Carey, … V J Tsotras (Eds.), 26th IEEE International Conference on Data Engineering (pp 996–1005) IEEE Retrieved from http://infolab.stanford.edu/~ragho/hive-icde2010.pdf Trelles, O., Prins, P., Snir, M., & Jansen, R C (2011) Big data, but are we ready? Nature Reviews Genetics, 12(3), 224 http://doi.org/10.1038/nrg2857-c1 Vernica, R., Carey, M J., & Li, C (2010) Efficient Parallel Set-similarity Joins Using MapReduce In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (pp 495–506) New York, NY, USA: ACM http://doi.org/10.1145/1807167.1807222 Villars, R L., Olofson, C W., & Eastwood, M (2011) Big Data: What it is and why you should care White Paper, pp 1–14 Retrieved from http://www.adminmagazine.com/HPC/content/download/5604/49345/file/IDC_Big Data_whitepaper_final.pdf Wang, J., Li, G., Deng, D., Zhang, Y., & Feng, J (2015) Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search In J Gehrke, W Lehner, K Shim, S K Cha, & G M Lohman (Eds.), 31st IEEE International Conference on Data Engineering (pp 519–530) IEEE http://doi.org/10.1109/ICDE.2015.7113311 Weber, R., Schek, H.-J., & Blott, S (1998) A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces In Proceedings of the 24th International Conference on Very Large Data Bases (pp 194–205) San Francisco, CA, USA: Morgan Kaufmann Publishers Inc Retrieved from http://dl.acm.org/citation.cfm?id=645924.671192 Weinberger, K Q., & Saul, L K (2009) Distance Metric Learning for Large Margin Nearest Neighbor Classification Journal of Machine Learning Research, 10, 207–244 Retrieved from http://dl.acm.org/citation.cfm?id=1577069.1577078 Weisstein, E W (n.d.) Binary Search Retrieved http://mathworld.wolfram.com/BinarySearch.html January 1, 2015, from Witten, I H., Moffat, A., & Bell, T C (1999) Managing Gigabytes (2Nd Ed.): Compressing and Indexing Documents and Images San Francisco, CA, USA: Morgan Kaufmann Publishers Inc Wong, R C.-W (2012) Big Data Privacy Information Technology & Software Engineering, 2(5) WordNet, P U (2010) “About http://wordnet.princeton.edu WordNet.” 167 Retrieved October 12, 2015, from Xiao, C., Wang, W., & Lin, X (2008a) Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints Proceedings of the VLDB Endowment, 1(1), 933–944 http://doi.org/10.14778/1453856.1453957 Xiao, C., Wang, W., Lin, X., & Yu, J X (2008b) Efficient Similarity Joins for Near Duplicate Detection In Proceedings of the 17th International Conference on World Wide Web (pp 131–140) New York, NY, USA: ACM http://doi.org/10.1145/1367497.1367516 Xiao, C., Wang, W., Lin, X., Yu, J X., & Wang, G (2011) Efficient Similarity Joins for Nearduplicate Detection ACM Transactions on Database Systems, 36(3), 15:1–15:41 http://doi.org/10.1145/2000824.2000825 Zadeh, R B., & Goel, A (2013) Dimension Independent Similarity Computation Journal of Machine Learning Research, 14(1), 1605–1626 Retrieved from http://dl.acm.org/citation.cfm?id=2567709.2567715 Zezula, P (2012) Future Trends in Similarity Searching In Proceedings of the 5th International Conference on Similarity Search and Applications (pp 8–24) Berlin, Heidelberg: SpringerVerlag http://doi.org/10.1007/978-3-642-32153-5_2 Zezula, P., Amato, G., Dohnal, V., & Batko, M (2010) Similarity Search: The Metric Space Approach (1st ed.) Springer Publishing Company, Incorporated Zhang, D., Wang, J., Cai, D., & Lu, J (2010) Self-taught Hashing for Fast Similarity Search In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp 18–25) New York, NY, USA: ACM http://doi.org/10.1145/1835449.1835455 Zhang, D., Yang, G., Hu, Y., Jin, Z., Cai, D., & He, X (2013) A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (pp 681–687) AAAI Press Retrieved from http://dl.acm.org/citation.cfm?id=2540128.2540227 Zobel, J., & Moffat, A (2006) Inverted Files for Text Search Engines ACM Computing Surveys, 38(2) http://doi.org/10.1145/1132956.1132959 168 CURRICULUM VITAE Contact Information Trong Nhan Phan Institute for Application Oriented Knowledge Processing (FAW) Johannes Kepler University Linz, Austria Phone: +43 (0) 68181873919 E-mail: nphan@faw.jku.at Field of Study Research Interests Education Information Systems  Similarity Search  Big Data  Privacy in Location-Based Services (LBS)  Security in Information Systems PhD Student, Institute for Application Oriented Knowledge Processing (FAW), Johannes Kepler University Linz, Austria, from 2013 to present Master Student, Master by research programme, HCMC University of Technology, National University of HCMC, from 2010 to 2013 Thesis: Trajectory privacy in location-based applications Un-graduate Student, HCMC University of Technology, National University of HCMC, from 2005 to 2010 Thesis: Developing a secure e-assessment supporting system Recent Publications [9] Phan T.N., Küng J., & Dang T.K (2016) eHSim: An Efficient Hybrid Similarity Search with MapReduce In Proceedings of the 30th IEEE International Conference on Advanced Information Networking and Applications (pp 422-429) IEEE Computer Society [8] Phan T.N., Küng J., & Dang T.K (2015) An Adaptive Similarity Search in Massive Datasets Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIII, 9480 (Selected Papers from FDSE 2014), 45–74, 2015 169 [7] Phan T.N., Jäger M., Nadschläger S., Küng J., & Dang T.K (2015) An Efficient Document Indexing-based Similarity Search in Large Datasets In Proceedings of the 2nd International Conference on Future Data and Security Engineering (FDSE 2015), LNCS 9446, Springer Verlag, Ho Chi Minh City, Vietnam, November 23-25, 2015, pp 16-31 [6] Jäger M., Phan T.N., & Nadschläger S (2015) Towards the Trustworthiness of Data, Information, Knowledge and Knowledge Processing Systems in Smart Homes In Proceedings of the 23rd Interdisciplinary Information Management Talks (IDIMT), September 9-11, 2015 Poděbrady, Czech Republic, pp 413-419, ISBN 978-3-99033-395-2 [5] Jäger M., Nadschläger S., Phan T.N., & Küng J (2015) Data, Information & Knowledge Sources in the Agricultural Domain In Proceedings of the 26th International Conference on Database and Expert Systems Applications Workshop, pp 115-119, 2015 [4] Phan T.N., Jäger M., Nadschläger S., & Küng J (2015) Range-based Clustering Supporting Similarity Search in Big Data In Proceedings of the 26th International Conference on Database and Expert Systems Applications Workshop, pp 120-124, 2015 [3] Phan T.N., Küng J., & Dang T.K (2015) KUR-Algorithm: from Position to Trajectory Privacy Protection in Location-based Applications In Proceedings of the 26th International Conference on Database and Expert Systems Applications (DEXA 2015), LNCS 9262, Springer-Verlag, September 1-4, 2015, Valencia, Spain, pp 82-89 [2] Phan T.N., Küng J., & Dang T.K (2014) An Efficient Similarity Search in Large Data Collections with MapReduce In Proceedings of the 1st International Conference on Future Data and Security Engineering, LNCS 8860, Springer Verlag, 2014, pp 44-57 [1] Phan T.N., Küng J., & Dang T.K (2014) An Elastic Approximate Similarity Search in Very Large Datasets with Mapreduce In Proceedings of the 7th International Conference on Data Management in Cloud, Grid and P2P Systems, LNCS 8648, Springer Verlag, 2014, pp 49-60 Languages Vietnamese (Mother tongue) English 170 Computer Skills Programming Languages: C/C++, C#, Visual Basic Operating Systems: Microsoft Windows, Unix Databases: Microsoft PostgreSQL/PostGIS SQL Server, Microsoft Access, Oracle, Others: MapReduce with Hadoop, Microsoft SQL Server Integration Services, Talend Open Studio Personal Attributes Other Information  Strong capability to research  Working well with other people  Honest, modest, patient, and sensitive Nationality: Vietnamese Hobbies: music, computer games, arts, movies, etc 171 ... balance when doing similarity search Without loss of accuracy, we prefer an exact similarity search Among various approaches, we choose the way of improving performance from similarity search schemes... kinds of approaches towards fast similarity search They are respectively the instant approaches, the build-in approaches, and the hybrid approaches Firstly, we employ MapReduce to experience similarity. .. assured, we have an exact similarity search Otherwise, we have an approximate similarity search The main aim of the trade-off is to accelerate the performance of similarity search with regardless to

Định dạng
Số trang	189
Dung lượng	4,74 MB