Data fusion in managing crowdsourcing data analytics systems

Data Fusion in Managing Crowdsourcing Data Analytics Systems LIU XUAN Bachelor of Engineering Tsinghua University, China A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2013 ii ACKNOWLEDGEMENT I hereby thank many people who contributed their valuable assistance to me during my Ph.D. study in National University of Singapore for their remarkable guidance and help. First and foremost, my sincere gratitude to my supervisor, Professor Beng Chin Ooi, who has supported me throughout my study for five years with his great amount of knowledge, his prospective thought, his inspiriting moral guidance and his magnanimous patience . Professor Ooi shared with me his valuable experience in both research and life selflessly, and offered me opportunities to have internships at research labs. I would like to thank Dr. Divesh Srivastava and Dr. Xin (Luna) Dong, my mentors during my internships at AT&T research lab during the summer of the year 2009, 2010 and 2011. They have shown to me broad knowledge, care and patience throughout the many discussions we had. Both of them have also helped me a lot in the daily life of my internships at US. I would like to also thank all the family members of both of them, who have provided a lot of assistance during my internships. I would like to thank professors Kian-Lee Tan and Chee-Yong Chan, and the external reviewer for their valuable comments on this dissertation. I would like to the following fellow colleagues and former fellow colleagues of mine: Dr. Zhenjie Zhang, A/Prof. Sai Wu, Meiyu Lu, Meihui Zhang, Wei Wang and Jinyang Gao et al. for their assistance and collaboration in helping me solving research problems. I would like to thank all my fellow colleagues in the database lab. The common interests shared among us have always been my source of inspiration. I would like to thank Dr. Zhifeng Bao for his guidance in the daily life of my internship in 2009. I would like to thank Fang Yu, Dr. He Yan, Dr. Yun Mao, Dr. Yu Jin, Dr. Feng Qian, Dr. Changbin Liu, Zhaoguang Wang and Tianhui Xu for their assistance during my internships. I would like to thank my friend Dr. Rong Ge and Dr. Hongyu Liang for helping me solve several sophisticated problems. I owe my deepest gratitude to my parents for their supporting and encouraging during my whole life. iii CONTENTS Acknowledgement ii Abstract vii Introduction 1.1 Online Data Fusion of Categorical Data Problem . . . . . . . . 1.2 Data Fusion of Continuous Data Problem . . . . . . . . . . . . . 1.3 Applications of Data Fusion Methods in Crowdsourcing . . . . . 1.4 The Limitation of Existing Methods . . . . . . . . . . . . . . . . 1.4.1 Gaps of the online data fusion problem of categorical data 1.4.2 Gaps of the data fusion problem of continuous data . . . 1.4.3 Gaps of the application of data fusion techniques in managing crowdsourcing data analytics systems . . . . . . . 1.5 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 10 Literature Review 2.1 Data Integration . . . . . . . . . . . . . . . . . 2.2 Categorical Data Fusion . . . . . . . . . . . . . 2.3 Online Aggregation . . . . . . . . . . . . . . . . 2.4 Multi-Sensor Data Fusion . . . . . . . . . . . . 2.5 Crowdsourcing Data Analytics Management . . 2.5.1 Crowdsourcing Systems and Applications iv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 13 14 15 17 17 CONTENTS 2.5.2 2.5.3 Crowdsourcing Database . . . . . . . . . . . . . . . . . . Quality Control in Crowdsourcing Systems . . . . . . . . Data Fusion of Categorical Data 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Background for Data Fusion . . . . . . . . . . . . . . . . . . . 3.3 Framework of Online Fusion . . . . . . . . . . . . . . . . . . . 3.3.1 Probability computation for independent sources . . . 3.4 Considering Copying in Online Fusion . . . . . . . . . . . . . . 3.4.1 Vote counting . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Probability computation . . . . . . . . . . . . . . . . . 3.4.3 Source ordering . . . . . . . . . . . . . . . . . . . . . . 3.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . 3.6.2 Overall Experimental results . . . . . . . . . . . . . . . 3.6.3 Detailed Experimental Results of Pragmatic Algorithm 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Fusion of Continuous Values 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . 4.2 Data Model . . . . . . . . . . . . . . . . . . . . . 4.2.1 Data Model . . . . . . . . . . . . . . . . . 4.3 Data Fusion Method . . . . . . . . . . . . . . . . 4.3.1 Estimation of the Drift of the Source . . . 4.3.2 Supervised Learning Method . . . . . . . . 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . 4.4.1 Experiments setup . . . . . . . . . . . . . 4.4.2 Varying the Number of Sources . . . . . . 4.4.3 Varying the Number of Objects . . . . . . 4.4.4 Varying the Drift of the Sources . . . . . . 4.4.5 Varying the Random Error of the Sources 4.4.6 Varying the True Values of the Sources . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 18 . . . . . . . . . . . . . . 19 20 23 25 28 31 31 34 41 45 46 46 47 50 54 . . . . . . . . . . . . . . 56 57 58 59 59 59 65 71 71 73 74 75 77 78 79 CONTENTS Resolving Data Conflicts in Crowdsourcing 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 5.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Architecture of the Framework . . . . . . . . 5.2.2 Deploying Applications using our framework 5.3 Prediction Model . . . . . . . . . . . . . . . . . . . 5.3.1 Economic Model in AMT . . . . . . . . . . 5.3.2 Voting-based Prediction . . . . . . . . . . . 5.3.3 Sampling-based Accuracy Estimation . . . . 5.4 Verification Model . . . . . . . . . . . . . . . . . . 5.4.1 Probability-based Verification . . . . . . . . 5.4.2 Online Processing . . . . . . . . . . . . . . . 5.4.3 Result Presentation . . . . . . . . . . . . . . 5.5 Performance Evaluation . . . . . . . . . . . . . . . 5.5.1 Application 1: TSA . . . . . . . . . . . . . . 5.5.2 Application 2: IT . . . . . . . . . . . . . . . 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . Conclusion 6.1 Online Data Fusion of Categorical Values . . . 6.2 Data Fusion of Continuous Values . . . . . . . 6.3 Applications of Data Fusion in Crowdsourcing 6.4 Future Work . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 82 85 85 87 89 89 90 94 95 96 101 105 106 107 112 113 . . . . 115 115 116 117 118 120 vi ABSTRACT Nowadays, the fast growth of the amount of Web data has attracted a lot of research interests, including the storing, indexing and query processing on the Web data and so on. However, among these huge amount of Web data, a lot of the data is dirty and erroneous. Furthermore, these dirty and erroneous data could be propagated through copying. Hence, there could be multiple conflicting values representing the same object. As a result, it is crucial important to distinguish the correct value from the conflicting values. Traditional data integration techniques allow querying structured data on the Web. They take the union of the answers retrieved from different sources and can thus return conflicting information. Data fusion techniques that are recently proposed, on the other hand, aim to find the true values, but are mainly designed for offline data aggregation on the categorical data and are time consuming. In this thesis, we aim to present three techniques to solve the data fusion problem, namely the online data fusion method of the categorical data, the data fusion method of the continuous data and the data fusion method used in designing crowdsourcing based data analytics systems. First of all, we aim to solve the online data fusion of categorical data problem, in order to improve the efficiency. Our method starts with returning answers from the first probed source, and refreshes the answers as it probes more sources and applies fusion techniques on the retrieved data. For each returned answer, it shows the likelihood that the answer is correct, and stops retrieving data for it after gaining enough confidence that data from the unvii CONTENTS processed sources are unlikely to change the answer. We address key problems in building such a online data fusion system and empirically show that the system can start returning correct answers quickly and terminate fast without sacrificing the quality of the answers. Second, we aim to design a novel data fusion method to solve the conflicts among continuous data. Specifically, our method models the drift and the random error of each data source. By maximizing the likelihood of the observation of the conflicting data, our method can find the true values by solving linear equations. Furthermore, we design an iterative algorithm to solve the conflicts without requiring prior knowledge of the continuous data. We address key problems in solving the data fusion problem of continuous data and conduct extensive experimental studies to show that our proposed method can efficiently reduce the error in the fusion results. Finally, we adapt and apply the proposed data fusion methods to design a framework to manage the crowdsourcing data analytics systems. Our framework is designed to support the deployment of various crowdsourcing applications. In this thesis, we discuss two key problems of designing the framework, namely the quality-sensitive answering model which guides the crowdsourcing engine to process and monitor the human tasks and the data fusion-based answer verification model which integrates the answers and return the results to the user. We conduct extensive experiments to validate that our proposed framework effectively and efficiently handles crowdsourcing-based data analytics jobs with minimum cost. The research works listed in this thesis have significantly affected both the data fusion area and crowdsourcing data management area. The online data fusion method introduces a novel idea of efficiently solving conflicting data by proposing the computation methods of source ordering, vote counting, truth finding and termination justification. The data fusion method of continuous data provides a novel way to improve the quality of continuous data (e.g. scientific data) by proposing the supervised learning method. Our proposed framework for managing crowdsourcing data analytics systems presents a new way to quantitatively analyze the relationship between the quality of the results and the cost. These new ideas are all generic and could be used to solve many other problems. viii LIST OF TABLES 3.1 3.2 3.3 3.4 Output at each time point in the motivating example. The time is made up for the purpose of illustration. . . . . . . . . . . . . Vote count of each source in the motivating example. . . . . . . Example 3.3. Vote count of NY and NJ as we probe S1 − S3 in the order of S3 , S2 , S1 . . . . . . . . . . . . . . . . . . . . . . . . Example 3.7: Vote counts computed in source ordering. The maximum vote count in each round of the pragmatic approach is in bold font. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Continuous Observed Values . . . . . . . . . . . . . . . . . . . . 5.1 5.2 5.3 5.4 Users’ Opinion on iPhone4S . . . Table of Notations . . . . . . . . An Example of Workers’ Answers Results of Verification Models . . ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 25 33 44 57 . 87 . 90 . 101 . 101 LIST OF FIGURES 1.1 1.2 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 An Example of Conflicting Weather Data Provided by Several Weather Forecasting Websites . . . . . . . . . . . . . . . . . . . Crowdsourcing Application . . . . . . . . . . . . . . . . . . . . . Sources for the motivating example. For each source we show the answer it provides for query “Where is AT&T Shannon Labs” in parenthesis and its accuracy in a circle. An arrow from S to S means that S copies some data from S . . . . . . . . . . . . . . Observations of output values by Pragmatic. . . . . . . . . . Observations of output probabilities by Pragmatic. . . . . . . Stable correct values of different methods. . . . . . . . . . . . . Precision of various methods. . . . . . . . . . . . . . . . . . . . Fusion CPU time. . . . . . . . . . . . . . . . . . . . . . . . . . . Method scalability. . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of different source ordering strategies. . . . . . . . Comparison of different source ordering strategies. . . . . . . . Comparison of different source ordering strategies. . . . . . . . Comparison of different vote counting strategies. . . . . . . . . Comparison of different vote counting strategies. . . . . . . . . Comparison of different vote counting strategies. . . . . . . . . Comparison of different termination conditions. . . . . . . . . . Comparison of different termination conditions. . . . . . . . . . Comparison of different termination conditions. . . . . . . . . . 21 47 47 47 47 48 48 50 50 51 51 51 51 52 52 52 x CHAPTER CONCLUSION In this thesis, we aim to present the techniques of data fusion methods to effectively and efficiently integrate conflicting data including both categorical values and continuous values. Furthermore, we aim to apply the proposed data fusion methods to manage the crowdsourcing data analytics systems. To achieve this goal, we have proposed techniques to solve three sub-problems, namely the online data fusion problem of categorical values, the data fusion problem of continuous values and the applications of the data fusion in crowdsourcing. The following sections conclude our contribution on each of the subproblems. 6.1 Online Data Fusion of Categorical Values We have proposed an online data fusion method to solve the data conflicts effectively and efficiently. Our method has absorbed the idea of online aggregation [44] which also refreshes answers as more data are processed and outputs confidence of the answers. To the best of our knowledge, our method is the first data fusion method that solves the conflicts of data online. The novelty of our work includes three aspects. First, we probe data from multiple sources and describe source ordering techniques that enable quick return of the correct answers and quick termination. Second, the data fusion techniques are very different from statistics computation, leading to different ways of computing 115 CHAPTER 6. CONCLUSION expected probabilities and probability ranges. Finally, we consider copying between sources, which raises new challenges such as vote counting when a copier is probed before the copied source. To solve this problem, we have • proposed an online data fusion method which returns answers and likelihood of each answer being correct as it probes new sources, and terminates when the unprocessed sources are unlikely to change the answers. • provided its expected probability, maximum probability, and minimum probability based on our observation of the retrieved data and our knowledge of source quality for each returned answer. • proposed source ordering algorithms that can lead to early returning of correct answers and quick convergence. • tested our method on both real-world data and synthetic data, showing that our methods can often return correct answers very quickly, terminate fast without sacrificing the quality of the final answers, and are scalable. Based on the experimental results, we have found that our proposed method terminates fast while still providing very accurate results. 6.2 Data Fusion of Continuous Values We have proposed a data fusion method on the new domain of data, i.e. the continuous values, to solve the conflicts among the continuous values. The novelty of our method includes • Our method can model the continuous data provided by multiple data sources well using the systematic error and random error model. • Our method is able to effectively and efficiently identify the systematic error as well as the random error using the observed values. To solve this problem, we have • modeled the systematic error and the random error data source that provides continuous values using a Gaussian model. 116 CHAPTER 6. CONCLUSION • proved that the problem of identifying the systematic error and random error by maximizing the likelihood of the observation has infinite solutions. • proposed a supervised learning method that can get a unique solution for the likelihood maximizing problem using very few training data. • conducted extensive experiments to validate the performance of our proposed method including the absolute error and the running time. The experiment results show that our proposed method can significantly reduce the error in the data fusion results. 6.3 Applications of Data Fusion in Crowdsourcing We have designed a novel framework based on our data fusion methods to manage the crowdsourcing data analytics systems. We have proposed the qualitysensitive answering model for our framework. To the best of our knowledge, this proposed quality-sensitive model is the first model that considers the relationship between the quality and the cost of the crowdsourcing platforms. The model guides the query engine to generate proper query plans based on the accuracy requirement. To solve this problem, we have • proposed the prediction model that predicts the number of human workers needed to be hired in the crowdsourcing system. • adapted and applied the data fusion methods as the verification model to find the correct answer among the conflicting answers given by human workers. • deployed application systems using our proposed framework to evaluate the performance of our proposed models on real crowdsourcing data, showing that our methods can guarantee the quality of the answers of the crowdsourcing platform given a fixed amount of the cost. 117 CHAPTER 6. CONCLUSION To evaluate the performance of our proposed method, we used real Twitter data and Flickr data as our queries. Amazon Mechanical Turk was employed as our crowdsourcing platform. The results show that our proposed model can provide high-quality answers while keeping the total cost low. The experimental results show that • using our proposed framework, the accuracy of using crowdsourcing based method is better than that of the machine learning methods. • our framework requires the least number of the workers, which reduce the cost the most. • the accuracy of our framework satisfies the accuracy constraint. To sum up, we have designed a data fusion-based framework to manage the crowdsourcing data analytics systems. Our proposed framework achieves a high accuracy and runs efficiently while only spending as little cost as need. This framework can be extended to deploy a variety kinds of crowdsourcing applications. 6.4 Future Work The future work may include • Combining our online data fusion techniques with those that consider overlap between sources for online fusion. In our online categorical data fusion, we only consider the dependency between a pair of data sources using the copying probability. Furthermore, we could also consider the overlap between a pair of data sources as the dependency. The overlap is defined as the percentage of the objects that the two data sources share the same value. Note that it is not necessary that a source copy from the other source such that they provide the same value. Usually the overlap information could be easier to be obtained than the copying probability. One possible research direction is to exploit the overlap information to improve the accuracy of the fusion results. • Exploring other quality measures such as freshness of data in the online data fusion method to improve the accuracy of results. In our work, we 118 119 model the truth of each object as a fixed value in both categorical data fusion and continuous data fusion. However, in real world, the true values of a lot of objects may vary by time. For example, the price of stocks may change sharply during one day. Therefore, it is crucial important to fuse the conflicting data with timestamps efficiently and effectively. One of possible research direction is to adapt our online data fusion method to solve such kind of data fusion problem. • Considering the copy relationship between sources providing continuous values. It is quite complex to identify the copying relationship between sources providing continuous values. There may be two types of copying. The first type is that copying as categorical data such that the copied value is exactly the same as the value being copied. The second type is that copying with disturbance, i.e., the copied value could be different from the original value. The detection of the copying as categorical data is simple and it can be solved by adapting our method. Identifying the copying with disturbance could be a new research direction. • Considering the coverage of the sources and sort the sources. In our work, our method support the fusion of data sources without full coverage, i.e. the sources may not provide values for some objects. However, our source ordering methods not take the coverage into account. By considering the coverage information, we could better sort the sources in order to terminate earlier. • Designing online algorithms to fusion the continuous data more efficiently. Our continuous data fusion method works well in the offline scenario. It is important to design an online continuous data fusion algorithm to solve the problem efficiently. BIBLIOGRAPHY [1] Serge Abiteboul. Querying semi-structured data. Springer, 1997. 13 [2] Serge Abiteboul and Oliver M Duschka. Complexity of answering queries using materialized views. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 254–263. ACM, 1998. 13 [3] Omar Alonso, Daniel E. Rose, and Benjamin Stewart. Crowdsourcing for relevance evaluation. SIGIR Forum, 42:9–15, 2008. 6, 17 [4] Yigal Arens, Chin Y Chee, Chun-Nan Hsu, and Craig A Knoblock. Retrieving and integrating data from multiple information sources. International Journal of Intelligent and Cooperative Information Systems, 2(02):127–158, 1993. 13 [5] Yigal Arens, Craig A Knoblock, and Wei-Min Shen. Query reformulation for dynamic information integration. Springer, 1996. 13 [6] Yaakov Bar-Shalom and Edison Tse. Tracking in a cluttered environment with probabilistic data association. Automatica, 11(5):451–460, 1975. 16 [7] Laure Berti-Equille. Quality Awareness for Managing and Mining Data. PhD thesis, Universite de Rennes 1, 2007. 14 [8] Gerald J Bierman. Factorization methods for discrete sequential estimation. Courier Dover Publications, 2006. 16 120 BIBLIOGRAPHY [9] Samuel S Blackman. Multiple-target tracking with radar applications. Dedham, MA, Artech House, Inc., 1986, 463 p., 1, 1986. 16 [10] WD Blair and T Bar-Shalom. Tracking maneuvering targets with multiple sensors: Does more data always mean better estimates? Aerospace and Electronic Systems, IEEE Transactions on, 32(1):450–456, 1996. 16 [11] Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, pages 83–97, 2010. 3, 4, 14, 20 [12] Jens Bleiholder, Samir Khuller, Felix Naumann, Louiqa Raschid, and Yao Wu. Query planning in the presence of overlapping sources. In EDBT, pages 811–828, 2006. 26 [13] Johan Bollen, Alberto Pepe, and Huina Mao. Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. In CoRR, 2009. 87 [14] Richard P Brent. Algorithms for minimization without derivatives. Courier Dover Publications, 1973. 16 [15] Michael J. Cafarella, Alon Y. Halevy, and Jayant Madhavan. Structured data on the web. Commun. ACM, 54(2):72–79, 2011. 20 [16] Chris Callison-Burch and Mark Dredze. Creating speech and language data with amazon’s mechanical turk. In Proc. of NAACL HLT Workshop, pages 1–12, 2010. 6, 17 [17] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Daniele Nardi, and Riccardo Rosati. Information integration: Conceptual modeling and reasoning support. In Cooperative Information Systems, 1998. Proceedings. 3rd IFCIS International Conference on, pages 280–289. IEEE, 1998. 13 [18] Sudarshan Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey Ullman, and Jennifer Widom. The tsimmis project: Integration of heterogenous information sources. 1994. 13 121 BIBLIOGRAPHY [19] Chandra Chekuri and Anand Rajaraman. Conjunctive query containment revisited. In Database TheoryICDT’97, pages 56–70. Springer, 1997. 13 [20] Xin Dong, Laure Berti-Equille, Yifan Hu, and Divesh Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358–1369, 2010. 23, 46 [21] Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550–561, 2009. 3, 4, 14, 18, 20, 24, 43, 46, 101 [22] Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562–573, 2009. 3, 4, 14 [23] Xin Luna Dong and Felix Naumann. Data fusion - resolving data conflicts for integration. PVLDB, 2(2):1654–1655, 2009. 2, 20 [24] Oliver M Duschka and Michael R Genesereth. Answering recursive queries using views. In Proceedings of the sixteenth ACM SIGACT-SIGMODSIGART symposium on Principles of database systems, pages 109–116. ACM, 1997. 13 [25] Oliver M Duschka and Michael R Genesereth. Query planning in infomaster. In Proceedings of the 1997 ACM symposium on Applied computing, pages 109–111. ACM, 1997. 13 [26] James S Dyer, Peter C Fishburn, Ralph E Steuer, Jyrki Wallenius, and Stanley Zionts. Multiple criteria decision making, multiattribute utility theory: the next ten years. Management science, 38(5):645–654, 1992. 17 [27] Oren Etzioni, Keith Golden, and Daniel S Weld. Sound and efficient closed-world reasoning for planning. Artificial Intelligence, 89(1):113– 148, 1997. 13 [28] R.A. Fisher. Statistical methods for research workers. Oliver and Boyd, 1954. 100 [29] Daniela Florescu, Alon Levy, Ioana Manolescu, and Dan Suciu. Query optimization in the presence of limited access patterns. In ACM SIGMOD Record, volume 28, pages 311–322. ACM, 1999. 13 122 BIBLIOGRAPHY [30] Daniela Florescu, Louiqa Raschid, and Patrick Valduriez. Using heterogeneous equivalences for query rewriting in multidatabase systems. In CoopIS, pages 158–169. Citeseer, 1995. 13 [31] Thomas E Fortmann, Yaakov Bar-Shalom, and Molly Scheffe. Multitarget tracking using joint probabilistic data association. In Decision and Control including the Symposium on Adaptive Processes, 1980 19th IEEE Conference on, volume 19, pages 807–812. IEEE, 1980. 16 [32] Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, pages 61–72, 2011. 6, 17, 82 [33] Keinosuke Fukunaga. Introduction to statistical pattern recognition. Access Online via Elsevier, 1990. 16 [34] Alban Galland, Serge Abiteboul, Amélie Marian, and Pierre Senellart. Corroborating information from disagreeing views. In WSDM, pages 131– 140, 2010. 3, 4, 14, 20 [35] Jinyang Gao, Xuan Liu, Beng Chin Ooi, Haixun Wang, and Gang Chen. An online cost sensitive decision-making method in crowdsourcing systems. In SIGMOD Conference, 2013. 18 [36] Arthur Gelb. Applied optimal estimation. The MIT press, 1974. 16 [37] Arpita Ghosh, Satyen Kale, and Preston McAfee. Who moderates the moderators?: crowdsourcing abuse detection in user-generated content. In Proc. of ACM Conference on Electronic Commerce, pages 167–176, 2011. 18 [38] Catherine Grady and Matthew Lease. Crowdsourcing document relevance assessment with mechanical turk. In Proc. of NAACL HLT Workshop, pages 172–179, 2010. 6, 17 [39] Laura Haas, Donald Kossmann, Edward Wimmers, and Jun Yang. Optimizing queries across diverse data sources. 1997. 13 [40] David David Lee Hall and Sonya Anne Hall McMullen. Mathematical techniques in multisensor data fusion. Artech House, 2004. 16 123 BIBLIOGRAPHY [41] David L Hall and Robert J Linn. Survey of commercial software for multisensor data fusion. In Optical Engineering and Photonics in Aerospace Sensing, pages 98–109. International Society for Optics and Photonics, 1993. 16 [42] David L Hall and James Llinas. An introduction to multisensor data fusion. Proceedings of the IEEE, 85(1):6–23, 1997. 15 [43] DL Hall and J Llinas. A challenge for the data fusion community i: research imperatives for improved processing. In Proc. 7th Nat. Symp. on Sensor Fusion, 1994. 16 [44] Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. Online aggregation. In SIGMOD Conference, pages 171–182, 1997. 4, 14, 115 [45] Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. Quality management on amazon mechanical turk. In Proc. of ACM SIGKDD Workshop, pages 64–67, 2010. 18 [46] Gabriella Kazai, Jaap Kamps, Marijn Koolen, and Natasa Milic-Frayling. Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking. In SIGIR, pages 205–214, 2011. 17 [47] O Kessler, K Askin, N Beck, J Lynch, F White, D Buede, D Hall, and J Llinas. Functional description of the data fusion process. Office of Naval Technology, Naval Air Development Center, Warminster, PA, 1992. 16 [48] Josef Kittler. Mathematical methods of feature selection in pattern recognition. International Journal of Man-Machine Studies, 7(5):609–637, 1975. 16 [49] Aniket Kittur, Ed H. Chi, and Bongwon Suh. Crowdsourcing user studies with mechanical turk. In SIGCHI, pages 453–456, 2008. 17 [50] L.A. Klein. Sensor and data fusion: a tool for information assessment and decision making. Press Monographs. SPIE Press, 2004. 2, 12 [51] Lawrence A Klein. Sensor and data fusion concepts and applications. Society of Photo-Optical Instrumentation Engineers (SPIE), 1993. 16 124 BIBLIOGRAPHY [52] Chung T Kwok, Daniel S Weld, et al. Planning to gather information. In PROCEEDINGS OF THE NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, pages 32–39. Citeseer, 1996. 13 [53] Eric Lambrecht, Subbarao Kambhampati, and Senthil Gnanaprakasam. Optimizing recursive information gathering plans. In IJCAI, volume 99, pages 1204–1211, 1999. 13 [54] Jonathan Ledlie, Billy Odero, Einat Minkov, Imre Kiss, and Joseph Polifroni. Crowd translator: on building localized speech recognizers through micropayments. SIGOPS Oper. Syst. Rev., 43:84–89, 2010. 6, 17 [55] Chulhee Lee and David A Landgrebe. Decision boundary feature extraction for nonparametric classification. Systems, Man and Cybernetics, IEEE Transactions on, 23(2):433–444, 1993. 16 [56] Maurizio Lenzerini. Data integration: A theoretical perspective. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 233–246. ACM, 2002. 12, 13 [57] Alon Levy, Anand Rajaraman, and Joann Ordille. Querying heterogeneous information sources using source descriptions. 1996. 13 [58] Alon Y Levy. Obtaining complete answers from incomplete databases. In VLDB, volume 96, pages 402–412. Citeseer, 1996. 13 [59] Alon Y Levy. Logic-based techniques in data integration. In Logic-based artificial intelligence, pages 575–595. Springer, 2000. 13 [60] Alon Y Levy, Anand Rajaraman, and Jeffrey D Ullman. Answering queries using limited external query processors. In Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 227–237. ACM, 1996. 13 [61] Data Fusion Lexicon. Data fusion subpanel of the joint directors of laboratories technical panel for 3. FE White, Code, 4202. 16 [62] Jia Li and James Z. Wang. Real-time computerized annotation of pictures. IEEE Trans. Pattern Anal. Mach. Intell., 30:985–1002, June 2008. 113 125 BIBLIOGRAPHY [63] Richard Lippmann. An introduction to computing with neural nets. ASSP Magazine, IEEE, 4(2):4–22, 1987. 16 [64] Xuan Liu, Xin Luna Dong, Beng Chin Ooi, and Divesh Srivastava. Online data fusion. In PVLDB, pages 932–943, 2011. 18, 101 [65] Xuan Liu, Meiyu Lu, Beng Chin Ooi, Yanyan Shen, Sai Wu, and Meihui Zhang. Cdas: a crowdsourcing data analytics system. Proceedings of the VLDB Endowment, 5(10):1040–1051, 2012. 18 [66] J Llinas and Richard T Antony. Blackboard concepts for data fusion applications. International journal of pattern recognition and artificial intelligence, 7(02):285–308, 1993. 16 [67] J Llinas and DL Hall. A challenge for the data fusion community ii: Infrastructure imperatives. In Proc. 7th Natl. Symp. on Sensor Fusion, 1994. 16 [68] Adam Marcus, Eugene Wu, David R. Karger, Samuel Madden, and Robert C. Miller. Demonstration of qurk: a query processor for humanoperators. In SIGMOD, pages 1315–1318, 2011. 6, 17 [69] Adam Marcus, Eugene Wu, Samuel Madden, and Robert C. Miller. Crowdsourced databases: Query processing with people. In CIDR, pages 211–214, 2011. 6, 17 [70] George A. Mihaila, Louiqa Raschid, and Maria-Esther Vidal. Using quality of data metadata for source selection and ranking. In WebDB (Informal Proceedings), pages 93–98, 2000. 14 [71] Robert Munro, Steven Bethard, Victor Kuperman, Vicky Tzuyin Lai, Robin Melnick, Christopher Potts, Tyler Schnoebelen, and Harry Tily. Crowdsourcing and language studies: the new generation of linguistic data. In Proc. of NAACL HLT Workshop, pages 122–130, 2010. 6, 17, 18 [72] D Mush and B Horne. Progress in supervised neural networks: whats new since lippman. IEEE Signal Processing Magazine, pages 8–39, 1993. 16 126 BIBLIOGRAPHY [73] Felix Naumann. Quality-Driven Query Answering for Integrated Information Systems. Springer, 2002. 14 [74] Felix Naumann, Alexander Bilke, Jens Bleiholder, and Melanie Weis. Data fusion in three steps: Resolving schema, tuple, and value inconsistencies. IEEE Data Eng. Bull, 29(2):21–31, 2006. 13 [75] DF Noble. Template-based data fusion for situation assessment. In Proc. 1987 Tri-Service Data Fusion Symp, volume 1, pages 152–162, 1987. 16 [76] Stefanie Nowak and Stefan R¨ uger. How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In MIR, pages 557–566, 2010. 6, 17 [77] Aditya Parameswaran, Anish Das Sarma, Hector Garcia-Molina, Neoklis Polyzotis, and Jennifer Widom. Human-assisted graph search: it’s okay to ask questions. VLDB, pages 267–278, 2011. 6, 18, 82 [78] Aditya G. Parameswaran, Hector Garcia-Molina, Hyunjung Park, Neoklis Polyzotis, Aditya Ramesh, and Jennifer Widom. Crowdscreen: algorithms for filtering data with humans. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, pages 361–372, New York, NY, USA, 2012. ACM. 18 [79] Aditya G. Parameswaran and Neoklis Polyzotis. Answering queries using humans, algorithms and databases. In CIDR, pages 160–166, 2011. 6, 18 [80] Elijah Polak. Computational methods in optimization: a unified approach, volume 77. Access Online via Elsevier, 1971. 16 [81] Aubrey B Poore. Multidimensional assignment formulation of data association problems arising from multitarget and multisensor tracking. Computational Optimization and Applications, 3(1):27–57, 1994. 16 [82] Aubrey B Poore and Nenad Rijavec. A numerical study of some data association problems arising in multitarget tracking, volume 339. Kluwer Academic Publishers BV, Boston, MA, 1994. 16 127 BIBLIOGRAPHY [83] Aubrey B Poore and Alexander J Robertson III. A new lagrangian relaxation based algorithm for a class of multidimensional assignment problems. Computational Optimization and Applications, 8(2):129–150, 1997. 16 [84] Rachel Pottinger and Alon Halevy. Minicon: A scalable algorithm for answering queries using views. The VLDB JournalThe International Journal on Very Large Data Bases, 10(2-3):182–198, 2001. 13 [85] Anand Rajaraman, Yehoshua Sagiv, and Jeffrey D Ullman. Answering queries using templates with binding patterns. In Proceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 105–112. ACM, 1995. 13 [86] Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. Collecting image annotations using amazon’s mechanical turk. In Proc. of NAACL HLT Workshop, pages 139–147, 2010. 6, 17 [87] Anish Das Sarma, Xin Luna Dong, and Alon Y. Halevy. Data integration with dependent sources. In EDBT, pages 401–412, 2011. 14, 26 [88] Sara A Solla, Todd K Leen, and Klaus-Robert M¨ uller. Advances in neural information processing systems. The MIT Press, 2000. 16 [89] RL Streit and TE Luginbuhl. A probabilistic multi-hypothesis tracking algorithm without enumeration and pruning. In Proc. 6th Joint Sevice Data Fusion Symp, 1993. 16 [90] Roy L Streit, Stephen G Greineder, and Tod E Luginbuhl. Procrustes: A feature set reduction technique, 1994. 16 [91] Roy L Streit and Tod E Luginbuhl. Maximum likelihood method for probabilistic multihypothesis tracking. In SPIE’s International Symposium on Optical Engineering and Photonics in Aerospace Sensing, pages 394–405. International Society for Optics and Photonics, 1994. 16 [92] VSp d Subrahmanian, Sibel Adali, Anne Brink, Ross Emery, James J Lu, Adil Rajput, Timothy J Rogers, Robert Ross, and Charles Ward. Hermes: A heterogeneous reasoning and mediator system, 1995. 13 128 BIBLIOGRAPHY [93] Maggy Anastasia Suryanto, Ee-Peng Lim, Aixin Sun, and Roger H. L. Chiang. Quality-aware collaborative question answering: methods and evaluation. In WSDM, pages 142–151, 2009. 14 [94] Val Tannen and Lucian Popa. An equational chase for path-conjunctive queries, constraints, and views. Database Research Group (CIS), page 40, 1999. 13 [95] Jan Van den Bussche. Two remarks on the complexity of answering queries using views. Information Processing Letters, 2000. 13 [96] Vasilis Vassalos and Yannis Papakonstantinou. Describing and using query capabilities of heterogeneous sources. 1997. 13 [97] E Waltz. Data fusion for c3i: A tutorial. Command, Control, Communications Intelligence (C3I) Handbook, pages 217–226, 1986. 16 [98] Edward Waltz, James Llinas, et al. Multisensor data fusion, volume 685. Artech House Norwood, 1990. 16, 17 [99] Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11):1483–1494, 2012. 18 [100] Ruben Van Wanzeele, Katja Verbeeck, Annemie Vorstermans, Tom Tourwe, and Elena Tsiporkova. Extracting emotions out of twitters microblogs. In BNAIC, 2011. 87 [101] Bernard Widrow and Rodney Winter. Neural nets for adaptive filtering and adaptive pattern recognition. Computer, 21(3):25–39, 1988. 16 [102] F Wright. The fusion of multi-source data. Signal, pages 39–43, 1980. 16 [103] Minji Wu and Amelie Marian. A framework for corroborating answers from multiple web sources. Inf. Syst., 36(2):431–449, 2011. 3, 4, 14, 20 [104] Tingxin Yan, Vikas Kumar, and Deepak Ganesan. Crowdsearch: exploiting crowds for accurate real-time image search on mobile phones. In Proc. of Conference on Mobile systems, applications, and services, pages 77–90, 2010. 82 129 BIBLIOGRAPHY [105] N. K. Yeganeh, S. Sadiq, K. Deng, and X. Zhou. Data quality aware queries in collaborative information systems. Lecture Notes in Computer Science, 5446:39–50, 2009. 14 [106] Xiaoxin Yin, Jiawei Han, and Philip S. Yu. Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng., 20:796–808, 2008. 3, 4, 13, 14, 20, 46 [107] Lotfi A Zadeh. Fuzzy algorithms. Information and control, 12(2):94–102, 1968. 16 [108] Bo Zhao, Benjamin I. P. Rubinstein, Jim Gemmell, and Jiawei Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550–561, 2012. 14 130 [...]... 1 INTRODUCTION Nowadays, the Internet contains a significant volume of data in various domains such as finance, technology, entertainment, and travel These data exist in a variety of data sources including deep web databases, HTML tables, HTML lists e.g Managing these deep web data has attracted a lot research interests, including storing, indexing and query processing of these data from multiple data. .. crowdsourcing data analytics system Note that our crowdsourcing data analytics system also supports the fusion of continuous data by adapting the method proposed in Chapter 4 As has been explained, the traditional data fusion researches focus on the fusing the data stored in different data sources that are directly retrieved through structured or unstructured queries However, in the crowdsourcing systems, the data. .. techniques in managing crowdsourcing data analytics systems To the best of our knowledge, no approach has been proposed to adapt and apply the data fusion methods in managing the crowdsourcing data The specific gaps are summarized as follows: • The current crowdsourcing data analytics methods often provide arbitrarily wrong answers, due to malicious workers or very hard questions • The current crowdsourcing data. .. conflicting data provided by multiple data sources We also review the existing works of employing crowdsourcing platform to solve problems in data management domain, including the research works of the properties of crowdsourcing platform and the applications of the crowdsourcing platform In Chapter 3 we propose a novel online data fusion method to fuse conflicting categorical data from various data sources... drift computation algorithm and fusion algorithm to find the true values In Chapter 5 we apply the data fusion methods to manage the crowdsourcing data analytics systems The data fusion methods are implemented in the crowdsourcing data analytics systems as the verification part We also propose a novel prediction model to estimate the minimum cost of the crowdsourcing data analytics system that still outputs... existing solutions to solve the data integration and data fusion problems Second, we review the online aggregation method and compare this method with our online data fusion method Third, we report the existing works on the multi-sensor data fusion problem which is related to our continuous data fusion problem Finally, we discuss the research works related to crowdsourcing 2.1 Data Integration Data integration... integration includes combining data provided by different sources and providing users with a unified view of these data [56] The major differences between data integration and data fusion is that data integration is used to combine the data and return the combination of the data to the user while data fusion actually is the data integration with a followed reduction process [50] Therefore, data integration... can be applied for other fusion techniques 1.2 Data Fusion of Continuous Data Problem Traditional data fusion methods only consider solving the conflicts of categorical data However, in real world, a large portion of the data are continuous data, i.e, real values For example, most of the scientific data are continuous data and cannot be processed as categorical data in data analytics such as aggregation... Applications of Data Fusion Methods in Crowdsourcing Data fusion techniques form the basis for solving many other problems related to data uncertainty and conflicts We extend the proposal made earlier to solve a related real world problem, namely crowdsourcing data analytics 5 CHAPTER 1 INTRODUCTION Recently, instead of relying on the deep-web data sources stored on several computer servers, the crowdsourcing platform... such as image tagging information retrieval and natural language processing A job is partitioned into two parts: the computer job and the crowdsourcing job In the crowdsourcing systems like AMT, the crowdsourcing job is broadcast in the system with a fixed pay given the owner of the crowdsourcing job Later when the workers who register in the crowdsourcing platform receive the crowdsourcing job, they decide . continuous data and the data fusion method used in designing crowdsourcing based data analytics systems. First of all, we aim to solve the online data fusion of categorical data problem, in order. Data Fusion in Managing Crowdsourcing Data Analytics Systems LIU XUAN Bachelor of Engineering Tsinghua University, China A THESIS SUBMITTED FOR THE DEGREE. the online data fusion problem of categorical data 8 1.4.2 Gaps of the data fusion problem of continuous data . . . 8 1.4.3 Gaps of the application of data fusion techniques in managing crowdsourcing

Định dạng
Số trang	142
Dung lượng	5,67 MB