1. Trang chủ
  2. » Luận Văn - Báo Cáo

Luận Văn Ace Agile, Contingent And Efficient Similarity Joins Using Mapreduce.pdf

103 1 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

A Thesis entitled ACE Agile, Contingent and Efficient Similarity Joins Using MapReduce by Mahalakshmi Lakshminarayanan Submitted to the Graduate Faculty as partial fulfillment of the requirements for[.]

A Thesis entitled ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce by Mahalakshmi Lakshminarayanan Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Masters of Science Degree in Engineering Dr Vijay Devabhaktuni, Committee Chair Dr William F Acosta, Committee Member Dr Robert C Green II, Committee Member Dr Mansoor Alam, Committee Member Dr Patricia R Komuniecki, Dean College of Graduate Studies The University of Toledo December 2013 Copyright 2013, Mahalakshmi Lakshminarayanan This document is copyrighted material Under copyright law, no parts of this document may be reproduced without the expressed permission of the author An Abstract of ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce by Mahalakshmi Lakshminarayanan Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Masters of Science Degree in Engineering The University of Toledo December 2013 Similarity Join is an important operation for data mining, with a diverse range of real world applications Three efficient MapReduce Algorithms for performing Similarity Joins between multisets are proposed in this thesis Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence, they are vital for improving the efficiency of the algorithm Multisets represent real world data better, by considering the frequency of its elements Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapReduce algorithms not incorporate any filtering technique or inefficiently incorporate prefix filtering with poor scalability This work extends the filtering techniques, namely the prefix, size, positional and suffix filters to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model Adeptly incorporating the filtering techniques in a strategic sequence minimizes the pairs generated and joined, resulting in I/O, network and computational efficiency In the SSS algorithm, prefix, size and positional filtering are incorporated in the MapReduce Framework The pairs that thrive filtering are joined suavely in the third Similarity Join Stage, utilizing a Multiset File generated in the second stage We also developed a rational and creative technique to enhance the scalability of the algorithm as a contingency need iii In the ESSJ algorithm, all the filtering techniques, namely, prefix, size, positional as well as suffix filtering are incorporated in the MapReduce Framework It is designed with a seamless and scalable Similarity Join Stage, where the similarity joins are performed without dependency to a file In the EASE algorithm, all the filtering techniques, namely, prefix, size, positional and suffix are incorporated in the MapReduce Framework However it is tailored as a hybrid algorithm to exploit the strategies of both SSS and ESSJ for performing the joins Some multiset pairs are joined utilizing the Multiset File similar to SSS, and some multisets are joined without utilizing it similar to ESSJ The algorithm harvests the benefits of both the strategies SSS and ESSJ algorithms were developed using Hadoop and tested using realworld Twitter data For both SSS and ESSJ, experimental results demonstrate phenomenal performance gains of over 70% in comparison to the competing state-of-theart algorithm iv I dedicate this work to the Almighty! Acknowledgments It is a pleasure beyond measure to acknowledge the people who have helped and supported me to complete the Master’s program I thank Dr Acosta for his kind, patient, intelligent, meticulous and thorough guidance and support He made my study at the University of Toledo, a very pleasant and memorable one! I thank Dr Rob for his timely, creative, elegant, prudent and thorough guidance and support It was wonderful and comfortable working under him! Without the guidance of Dr Acosta and Dr Rob, this work would not have been possible I thank Dr Alam for his wise, kind and gracious support throughout my Master’s program! Special thanks to Dr Vijay for his benevolent, erudite and gracious guidance and support! I thank Dr Acosta, Dr Alam, EECS and the ET Departments for the financial support I thank the EECS and ET faculty members and staff members who have helped me I thank my parents, grand parents, brothers, relatives and friends for their support, with special thanks to my mom! Ultimately, I thank God for showering His grace on us! vi Contents Abstract iii Acknowledgments vi Contents vii List of Tables x List of Figures xii Introduction Background 2.1 MapReduce Model 2.2 Hadoop Features 2.3 Multisets and Similarity Measures 2.4 Literature Review 10 2.4.1 Serial Algorithms 10 2.4.2 Parallel Algorithms 11 Strategic and Suave Processing for Similarity Joins Using MapReduce 14 3.1 Stage I - Map Phase 15 3.2 Stage I - Reduce Phase 17 3.3 Stage II - Map Phase 19 vii 3.4 Stage II- Reduce Phase 22 3.4.1 Positional Filtering: 24 3.4.2 Positional Filtering in Stage II-Reduce Phase 26 3.5 Stage III - Map Phase 28 3.6 Stage III - Reduce Phase 29 3.7 Preprocessing 31 3.8 Comparison of SSS with SSJ-2R 34 3.9 Experimental Results 35 3.10 Enhancing the Scalability of the Algorithm 41 3.11 Summary 44 Adept and Agile Processing for Efficient and Scalable Similarity Joins Using MapReduce 46 4.1 Stage I - Map Phase 47 4.2 Stage I - Reduce Phase 50 4.3 Stage II - Map Phase 50 4.4 Stage II - Reduce Phase 55 4.4.1 Suffix Filtering 58 4.4.2 Optimizing the minimum Prefix Hamming Distance, Hpmin : 60 4.4.3 Suffix Filtering in Stage II- Reduce Phase 62 4.5 Stage III - Map Phase 63 4.6 Stage III - Reduce Phase 65 4.7 Comparison of ESSJ with SSJ-2R 66 4.8 Experimental Results 68 4.9 Summary 73 Efficient, Adaptable and Scalable MapReduce Algorithm For Similarity Joins Using Hybrid Strategies viii 74 5.1 Stage II - Reduce Phase 75 5.2 Stage III - Map Phase 75 5.3 Stage III - Reduce Phase 77 5.4 Discussion 79 Conclusion 81 References 83 ix List of Tables 2.1 Multiset Similarity Measures and their formulae 3.1 The number of pairs for which similarity joins are performed in SSJ-2R and SSS algorithms 3.2 37 Running times of the Stages of SSJ-2R algorithm, for 16,000 records and a similarity threshold of 0.8 3.4 36 Running times of the Stages of SSS algorithm, for 16,000 records and a similarity threshold of 0.8 3.3 10 37 Running times of SSS and SSJ-2R algorithms, for varying number of input records and corresponding Performance Improvement for a similarity threshold of 0.7 3.5 39 Running times of SSS and SSJ-2R algorithms, for varying number of input records and corresponding Performance Improvement for a similarity threshold of 0.8 3.6 Running times of the Waves of Stage III of SSS-SE algorithm, for 16,000 records and a similarity threshold of 0.8 4.1 43 The number of pairs for which similarity joins are performed in SSJ-2R and ESSJ algorithms 4.2 40 69 Running times of the Stages of ESSJ algorithm, for 16,000 records and a similarity threshold of 0.7 x 69 for similarity computation In order to solve this purpose, the candidate pair is reversed and sent out with < Mj , Mi > as the key and the elements of multiset, Mi , as the value The Mappers of Type II, read the preprocessed input file of Stage I, where each record consists of the MID, Mi and its corresponding elements, and send them as output with < Mi , m > as the key and the elements of Mi as the value As Type II Mappers function the same way as Type II Mappers of Stage II of SSS, their pseudo-code can be shown in The mapper instances of Stage III can be represented by Algorithm 5-1 Type I Mapper Mapper Input Type I Type II Mi, Mj Mi, Mj, Elements of Mi Key Value Key Value Mj,Mi Mi Mj,Mi Mi, Elements of Mi Mapper Output Type Mapper Mapper Input Mi, Elements of Mi Key Value Mi, m Elements of Mi Mapper Output Figure 5-1: Mapper Instances of Stage III 76 Algorithm 16: Stage III - Map Phase (Type I) Input: Two types of input records.Type I input record has value, valin =< Mi , Mj > Type II input record has value, valin =< Mi , Mj , Elements of Mi > Output: For Type I Mapper record, the output record has the key, keyout =< Mj , Mi > and value, valout =< Mi > For Type II Mapper record, the output record has the key, keyout =< Mj , Mi > and value, valout =< Mi , Elements of Mi > keyout =< Mj , Mi >; if valin ==< Mi , Mj > then valout =< Mi >; end if valin =< Mi , Mj , Elements of Mi > then valout =< Mi , Elements of Mi >; 10 end output(keyout , valout ); Partitioning, Grouping and Sorting are done in the same way as Stage II 5.3 Stage III - Reduce Phase At every reduce instance, the records pertaining to the same MID, Mi at the first position in the key reaches it For every MID pair record with key < Mi , Mj > in the instance, the elements of Mi are obtained from the multiset record which arrives to the same instance The elements of Mj are already attached for some records, and since the elements of both Mi and Mj are present, similarity is computed For some records, elements of Mj are not attached, and hence is retrieved from the Multiset File and similarity between the pairs is computed 77 The scalability of EASE algorithm can be enhanced by simply adapting the technique used for enhancing the scalability of SSS, for dealing with the Multiset File The reducer instance of Stage III can be represented by 5-2 Reduce Input Key Mi,x Multiset File (Value)* Ma, Elements of Ma Mk, Elements of Mk ………………… Elements of Mi Mj,Elements of Mj Mk Mi,Mj, Similarity Mk Looks up Multiset File, while Mj has its elements attached to it Mi,Mk, Similarity Reduce Output Figure 5-2: Reducer Instance of Stage III The pseudo-code of Stage III - Reduce Phase is shown in Algorithm 17 78 Algorithm 17: Stage III - Reduce Phase Each reduce instance has as input key, keyin =< Mi , x > and has (value)∗ Input: containing Elements of Mi , (Mj )∗ and (< Mj , Elements of Mj >)∗ Output: The output consists of records with the value, valout =< Mi , Mj , Sim(Mi , Mj ) > for every Mj ∈ (Mj )∗ Look Up Multiset File to get the elements of Mj ; Compute Similarity between < Mi , Mj >, denoted by Sim(Mi , Mj ); if Sim(Mi , Mj ) > t then valout =< Mi , Mj , Sim(Mi , Mj ) output(valout ); ; end 10 end 11 for every Mj ∈ (< Mj , Elements of Mj >)∗ /*Both Elements of Mi and Mj are available for similarity computation as 12 Elements of Mj are appended to Mj in the entry */ 13 Compute Similarity between < Mi , Mj >, denoted by Sim(Mi , Mj ); 14 if Sim(Mi , Mj ) > t then 15 valout =< Mi , Mj , Sim(Mi , Mj ) > 16 output(valout ); end 17 18 ; end 5.4 Discussion EASE algorithm applies all the filtering techniques, namely, prefix, size, positional and suffix filtering To join the surviving pairs, it follows the strategies of both SSS 79 and ESSJ algorithms Some reduce instances of Stage II, write to the Multiset File, like SSS While some reduce instances of Stage II, append the elements of multiset to the surviving pairs, like ESSJ As a result, the Multiset File in EASE will be smaller than in SSS, and hence it will take lesser time by the instances to load it into the memory In ESSJ, the elements of multiset are replicated and appended to all the surviving pairs This will not hinder the performance as the number of pairs surviving all the filtering will be small, as observed in Table 4.1 of the previous chapter However, depending on the nature of the dataset and as the dataset grows, reducing this number, as in the case of EASE, by replicating and appending only for certain instances, can improve I/O and network efficiency Thus, EASE can harvest the benefits of both SSS and ESSJ EASE algorithm can even be adapted to perform more like SSS, or more like ESSJ depending on what will suit the nature and size of the dataset 80 Chapter Conclusion Three efficient MapReduce algorithms for performing Similarity Joins between multisets, namely SSS, ESSJ and EASE are proposed in this thesis In the Chapter 3, it was shown how to extend the prefix, size and positional filtering techniques to multisets and formulate the SSS algorithm which consists of three MapReduce stages Prefix, size and positional filtering techniques are efficiently incorporated in a strategic sequence in the MapReduce framework in the first two stages In the third stage, the pairs surviving filtering are agilely and suavely joined in the MapReduce Framework utilizing a Multiset File generated in the second stage A rational and creative technique for enhancing the scalability of the algorithm was presented as a contingency need In the Chapter 4, it was shown how to extend suffix filtering technique to multisets and formulate the ESSJ algorithm ESSJ also consists of three MapReduce Stages All the filtering techniques, prefix, size, positional and suffix filtering techniques were adeptly incorporated in the MapReduce Framework during first two stages An added specialty of the proposed algorithm is the seamless and scalable similarity join stage, where the need for dependency on a file is completely avoided In the Chapter 5, EASE algorithm, which utilizes hybrid strategies for performing the joins, viz., the strategies of both SSS, which utilizes Multiset File and ESSJ, 81 which does not use any file for performing the joins was presented and discussed It was hypothesized that EASE, by utilizing both the strategies, will lead to improved I/O, network and computational efficiencies In Chapter 3, a versatile technique to preprocess the input data, which can be applied to SSS, ESSJ, EASE as well as to the other similarity join algorithms was presented For all the three algorithms, SSS, ESSJ and EASE, as result of applying filtering techniques, lesser candidate multiset pairs are generated, thereby leading to minimized I/O access and network bottlenecks As only the thriving pairs are joined, computational efficiency results SSS and ESSJ algorithms were compared in detail with the competing algorithm and experimentally analyzed by testing them on the Twitter data and were shown to demonstrate phenomenal performance gain of around 70% It is worthwhile to note that Suffix Filtering can be easily incorporated in SSS in the same way as in ESSJ As a future work, the performance of EASE algorithm can be experimentally analyzed and compared, by testing it on various real-world datasets 82 References [1] S Sarawagi and A Bhamidipaty, “Interactive deduplication using active learning,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 269–278, ACM, 2002 [2] D Fetterly, M Manasse, and M Najork, “On the evolution of clusters of nearduplicate web pages,” Journal of Web Engineering, vol 2, no 4, pp 228–246, 2003 [3] M Henzinger, “Finding near-duplicate web pages: a large-scale evaluation of algorithms,” in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp 284–291, ACM, 2006 [4] C Xiao, W Wang, X Lin, J X Yu, and G Wang, “Efficient similarity joins for near-duplicate detection,” ACM Transactions on Database Systems (TODS), vol 36, no 3, p 15, 2011 [5] G S Manku, A Jain, and A Das Sarma, “Detecting near-duplicates for web crawling,” in Proceedings of the 16th international conference on World Wide Web, pp 141–150, ACM, 2007 [6] T C Hoad and J Zobel, “Methods for identifying versioned and plagiarized documents,” Journal of the American society for information science and technology, vol 54, no 3, pp 203–215, 2003 [7] C Vania and M Adriani, “Automatic external plagiarism detection using passage similarities,” I n Braschler and Harman, 2010 83 [8] T Elsayed, J Lin, and D W Oard, “Pairwise document similarity in large collections with mapreduce,” in Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp 265–268, Association for Computational Linguistics, 2008 [9] R Baraglia, G De Francisci Morales, and C Lucchese, “Document similarity self-join with mapreduce,” in Data Mining (ICDM), 2010 IEEE 10th International Conference on, pp 731–736, IEEE, 2010 [10] W E Winkler, “The state of record linkage and current research problems,” in Statistical Research Division, US Census Bureau, Citeseer, 1999 [11] L Jin, C Li, and S Mehrotra, “Efficient record linkage in large data sets,” in Database Systems for Advanced Applications, 2003.(DASFAA 2003) Proceedings Eighth International Conference on, pp 137–146, IEEE, 2003 [12] W W Cohen and J Richman, “Learning to match and cluster large highdimensional data sets for data integration,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 475–480, ACM, 2002 [13] W Cohen and J Richman, “Learning to match and cluster entity names,” in ACM SIGIR-2001 Workshop on Mathematical/Formal Methods in Information Retrieval, 2001 [14] E Ukkonen, “On approximate string matching,” in Foundations of Computation Theory, pp 487–495, Springer, 1983 [15] M Hadjieleftheriou, A Chandel, N Koudas, and D Srivastava, “Fast indexes and algorithms for set similarity selection queries,” in Data Engineering, 2008 ICDE 2008 IEEE 24th International Conference on, 84 pp 267–276, IEEE, 2008 [16] C Li, J Lu, and Y Lu, “Efficient merging and filtering algorithms for approximate string searches,” in Data Engineering, 2008 ICDE 2008 IEEE 24th International Conference on, pp 257–266, IEEE, 2008 [17] A Behm, S Ji, C Li, and J Lu, “Space-constrained gram-based indexing for efficient approximate string search,” in Data Engineering, 2009 ICDE’09 IEEE 25th International Conference on, pp 604–615, IEEE, 2009 [18] M Hadjieleftheriou, N Koudas, and D Srivastava, “Incremental maintenance of length normalized indexes for approximate string matching,” in Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp 429–440, ACM, 2009 [19] M Hadjieleftheriou and C Li, “Efficient approximate search on string collections,” Proceedings of the VLDB Endowment, vol 2, no 2, pp 1660–1661, 2009 [20] E Spertus, M Sahami, and O Buyukkokten, “Evaluating similarity measures: a large-scale study in the orkut social network,” in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp 678–684, ACM, 2005 [21] Y.-R Lin, H Sundaram, Y Chi, J Tatemura, and B L Tseng, “Blog community discovery and evolution based on mutual awareness expansion,” in Proceedings of the IEEE/WIC/ACM international conference on web intelligence, pp 48–56, IEEE Computer Society, 2007 [22] A Metwally, D Agrawal, and A El Abbadi, “Detectives: detecting coalition hit inflation attacks in advertising networks streams,” in Proceedings of the 16th international conference on World Wide Web, pp 241–250, ACM, 2007 [23] A Metwally and C Faloutsos, “V-smart-join: a scalable mapreduce framework 85 for all-pair similarity joins of multisets and vectors,” Proceedings of the VLDB Endowment, vol 5, no 8, pp 704–715, 2012 [24] R J Bayardo, Y Ma, and R Srikant, “Scaling up all pairs similarity search.,” WWW, vol 7, pp 131–140, 2007 [25] J Dean and S Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol 51, no 1, pp 107–113, 2008 [26] A H Org, “Hadoop.” [27] S Chaudhuri, V Ganti, and R Kaushik, “A primitive operator for similarity joins in data cleaning,” in Data Engineering, 2006 ICDE’06 Proceedings of the 22nd International Conference on, pp 5–5, IEEE, 2006 [28] A Arasu, V Ganti, and R Kaushik, “Efficient exact set-similarity joins,” in Proceedings of the 32nd international conference on Very large data bases, pp 918–929, VLDB Endowment, 2006 [29] T White, Hadoop: the definitive guide O’Reilly, 2012 [30] D Borthakur, “Hdfs architecture guide,” Hadoop Apache Project http://hadoop apache org/common/docs/current/hdfs design pdf, 2008 [31] A F Gates, O Natkovich, S Chopra, P Kamath, S M Narayanamurthy, C Olston, B Reed, S Srinivasan, and U Srivastava, “Building a high-level dataflow system on top of map-reduce: the pig experience,” Proceedings of the VLDB Endowment, vol 2, no 2, pp 1414–1425, 2009 [32] A Thusoo, J S Sarma, N Jain, Z Shao, P Chakka, S Anthony, H Liu, P Wyckoff, and R Murthy, “Hive: a warehousing solution over a map-reduce framework,” Proceedings of the VLDB Endowment, vol 2, no 2, pp 1626–1629, 2009 86 [33] A Thusoo, J S Sarma, N Jain, Z Shao, P Chakka, N Zhang, S Antony, H Liu, and R Murthy, “Hive-a petabyte scale data warehouse using hadoop,” in Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pp 996–1005, IEEE, 2010 [34] S Ghemawat, H Gobioff, and S.-T Leung, “The google file system,” in ACM SIGOPS Operating Systems Review, vol 37, pp 29–43, ACM, 2003 [35] N R Center, “Disco: Massive data-minimal code.” [36] R C Taylor, “An overview of the hadoop/mapreduce/hbase framework and its current applications in bioinformatics,” BMC bioinformatics, vol 11, no Suppl 12, p S1, 2010 [37] A Lakshman and P Malik, “Cassandra: a decentralized structured storage system,” ACM SIGOPS Operating Systems Review, vol 44, no 2, pp 35–40, 2010 [38] K Rohloff and R E Schantz, “High-performance, massively scalable distributed systems using the mapreduce software framework: The shard triple-store,” in Programming Support Innovations for Emerging Distributed Applications, p 4, ACM, 2010 [39] S Humbetov, “Data-intensive computing with map-reduce and hadoop,” in Application of Information and Communication Technologies (AICT), 2012 6th International Conference on, pp 1–5, IEEE, 2012 [40] F Geertsema and E Vast, “High-performance log data analysis using mapreduce,” SC@ RUG 2011 proceedings, p 66 [41] C.-Z Peng, Z.-J Jiang, X.-B Cai, and Z.-K Zhang, “Real-time analytics processing with mapreduce.,” in ICMLC, pp 1308–1311, 2012 87 [42] Y Bu, B Howe, M Balazinska, and M D Ernst, “Haloop: Efficient iterative data processing on large clusters,” Proceedings of the VLDB Endowment, vol 3, no 12, pp 285–296, 2010 [43] Z Prekopcs´ak, G Makrai, T Henk, and C G´asp´ar-Papanek, “Radoop: Analyzing big data with rapidminer and hadoop,” in Proceedings of the 2nd RapidMiner Community Meeting and Conference (RCOMM 2011), 2011 [44] H.-c Yang, A Dasdan, R.-L Hsiao, and D S Parker, “Map-reduce-merge: simplified relational data processing on large clusters,” in Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp 1029–1040, ACM, 2007 [45] B He, W Fang, Q Luo, N K Govindaraju, and T Wang, “Mars: a mapreduce framework on graphics processors,” in Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pp 260–269, ACM, 2008 [46] J Ekanayake, H Li, B Zhang, T Gunarathne, S.-H Bae, J Qiu, and G Fox, “Twister: a runtime for iterative mapreduce,” in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp 810–818, ACM, 2010 [47] A Abouzeid, K Bajda-Pawlikowski, D Abadi, A Silberschatz, and A Rasin, “Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads,” Proceedings of the VLDB Endowment, vol 2, no 1, pp 922– 933, 2009 [48] K.-H Lee, Y.-J Lee, H Choi, Y D Chung, and B Moon, “Parallel data processing with mapreduce: a survey,” ACM SIGMOD Record, vol 40, no 4, pp 11–20, 2012 [49] D Singh, A Ibrahim, T Yohanna, and J Singh, “An overview of the applications of multisets,” Novi Sad Journal of Mathematics, vol 37, no 3, pp 73–92, 2007 88 [50] P Indyk and R Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality,” in Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp 604–613, ACM, 1998 [51] A Z Broder, S C Glassman, M S Manasse, and G Zweig, “Syntactic clustering of the web,” Computer Networks and ISDN Systems, vol 29, no 8, pp 1157–1166, 1997 [52] R Baeza-Yates, B Ribeiro-Neto, et al., Modern information retrieval, vol 463 ACM press New York, 1999 [53] S Sarawagi and A Kirpal, “Efficient set joins on similarity predicates,” in Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp 743– 754, ACM, 2004 [54] C Xiao, W Wang, X Lin, and H Shang, “Top-k set similarity joins,” in Data Engineering, 2009 ICDE’09 IEEE 25th International Conference on, pp 916–927, IEEE, 2009 [55] J Wang, G Li, and J Feng, “Can we beat the prefix filtering?: an adaptive framework for similarity join and search,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp 85–96, ACM, 2012 [56] R Vernica, M J Carey, and C Li, “Efficient parallel set-similarity joins using mapreduce,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 495–506, ACM, 2010 [57] R Vernica and M J Adviser-Carey, Efficient processing of set-similarity joins on large clusters California State University at Long Beach, 2011 [58] F N Afrati, A D Sarma, D Menestrina, A Parameswaran, and J D Ull- 89 man, “Fuzzy joins using mapreduce,” in Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pp 498–509, IEEE, 2012 [59] F N Afrati and J D Ullman, “Optimizing multiway joins in a map-reduce environment,” Knowledge and Data Engineering, IEEE Transactions on, vol 23, no 9, pp 1282–1298, 2011 [60] S Blanas, J M Patel, V Ercegovac, J Rao, E J Shekita, and Y Tian, “A comparison of join algorithms for log processing in mapreduce,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, ACM, 2010 90 pp 975–986,

Ngày đăng: 19/06/2023, 09:15

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN