Best probability queries on probabilistic databases

BEST PROBABILITY QUERIES ON PROBABILISTIC DATABASES Trieu Minh Nhut Le A thesis submitted in total fulfilment of the requirements for the degree of Doctor of Philosophy School of Engineering and Mathematical Sciences Faculty of Science, Technology and Engineering La Trobe University Bundoora, Victoria 3086 Australia January 2014 Acknowledgements I have been given invaluable support by several people during PhD candidature I would like to take this opportunity to express my gratitude to them First and foremost, this thesis could not have come into existence without the tremendous support and patient supervision from my supervisor I would like to thank Dr Jinli Cao for her endless support I sincerely appreciate her valuable advice and guidance to my research Second, I would like to thank Dr Zhen He at La Trobe University for providing insightful ideas, guidance, and comments on my research I have been fortunate to collaborate with him on various work and have learnt precious skills in research paper writing He has treated me more like a friend than a student and has always offered sound advice whenever I needed it Third, I would like to thank my co-supervisor Prof Wenny Rahayu for her role as a member of my Research Committee and for providing helpful feedback at every stage of my Ph.D I thank Ms Michele Mooney for her careful proof reading of my research papers and the final draft of this thesis Last but not least, I would like to express my gratitude and love to my family for always being there and when I needed them most, and for supporting me throughout my life Especially, I would like to thank my mum, Tiet Thi Phan, for her continuous love, care, and support I would like to thank my lover, Tuyen Mong Do, for her love, care and encouragement i Abstract This thesis focuses on answering probabilistic top-k and skyline queries on probabilistic data using the possible worlds semantics model These are two of the most important queries for decision support systems Almost all other existing methods for answering queries on probabilistic data require the user to set a probability threshold However, it is difficult to set a threshold because if it is set too high, important results may be lost, but if it is set too low, a lot of low quality results may be returned In this thesis, novel approaches for answering probabilistic top-k and skyline queries are proposed using the dominance principle as natural and effective methods to select results of queries with an acceptable number of answers, ensuring all important answers are captured without the need to set a threshold There are three challenges to answering both probabilistic top-k and skyline queries The first challenge is to develop novel probabilistic top-k and skyline queries using the dominance principle to return only the most interesting results The second challenge is to develop formulas based on probabilistic theory to directly calculate the probabilities of the results without considering any possible worlds and to also develop algorithms to effectively prune the search space The third challenge is to ensure that all the semantic properties of the probabilistic queries are covered The evaluations of the performance of the proposed approaches show that, firstly, the results of the queries are not only very reasonable in size but also capture all the important answers Secondly, the proposed algorithms outperform the current algorithms by accelerating the pruning search space, thereby reducing execution time Lastly, all the semantic properties of probabilistic queries are covered Statement of Authorship Except where reference is made in the text of this thesis, this thesis contains no material published elsewhere or extracted in whole or in part from a thesis submitted for the award of any other degree or diploma No other person’s work has been used without due acknowledgment in the main text of the thesis This thesis has not been submitted for the award of any degree or diploma in any other tertiary institution Trieu Minh Nhut Le Date: iii STATEMENT OF AUTHORSHIP iv External Refereed Publications Trieu Minh Nhut Le, Jinli Cao, Top-k best probability queries on probabilistic data, Proceedings of the 17th International Conference on Database Systems for Advanced Applications Volume Part II, DASFAA’12, Springer-Verlag, Berlin, Heidelberg, 2012, pp 116 Trieu Minh Nhut Le, Jinli Cao, and Zhen He, Top-k best probability queries and semantics ranking properties on probabilistic databases Data & Knowledge Engineering, 2013 Trieu Minh Nhut Le, Jinli Cao, and Zhen He, Answering skyline queries on probabilistic data using the dominance of probabilistic skyline tuples The ACM Transactions on Database Journal, under reviewed, August 2013 v EXTERNAL REFEREED PUBLICATIONS vi Contents Acknowledgements i Statement of Authorship iii External Refereed Publications v List of Figures xii Introduction 1.1 Uncertain data 1.2 Querying probabilistic data 1.2.1 Answering probabilistic top-k queries 1.2.2 Answering probabilistic skyline queries 1.3 1.4 Contributions of this thesis 11 1.3.1 Contribution on answering probabilistic top-k queries 11 1.3.2 Contribution on answering probabilistic skyline queries 12 Thesis organization 13 Background 2.1 2.2 2.3 15 Probabilistic database models 15 2.1.1 The uncertain object model 15 2.1.2 The possible worlds semantics model 16 Queries on databases 19 2.2.1 The top-k queries on data 19 2.2.2 Skyline queries on data 19 Summary of chapter 22 vii CONTENTS Existing work on probabilistic queries 3.1 3.2 3.3 25 Answering top-k queries on probabilistic data 25 3.1.1 The uncertain-top-k queries 26 3.1.2 The uncertain-k-rank 27 3.1.3 The probabilistic threshold top-k query 28 3.1.4 The global-top-k query 29 3.1.5 The expected-score query 30 3.1.6 The expected-rank 31 3.1.7 The robust rank 32 Evaluating probabilistic top-k queries on semantic properties 33 3.2.1 Semantic properties for top-k probabilistic queries 33 3.2.2 Analysing the answers to probabilistic top-k queries 34 Answering skyline queries on uncertain data 36 3.3.1 Answering skyline queries on incomplete data 37 3.3.2 Answering probabilistic skyline queries using the uncertain object model 38 3.3.3 3.4 Computing all skyline probabilities for uncertain data 39 Summary of chapter 40 Answering top-k best probability queries 4.1 4.2 4.3 4.4 41 Motivation and our proposal 41 4.1.1 Problem definition 42 4.1.2 Contributions 45 4.1.3 Calculation of top-k probability 46 4.1.4 Calculation of top-k probability with a generation rule 49 The top-k best probability query 52 4.2.1 Definition of the top-k best probability query 52 4.2.2 Significance of top-k best probability query 54 4.2.3 Finding top-k best probability and pruning rules 55 4.2.4 The top-k best probability algorithm 59 Semantics of top-k best probability query and other top-k queries 60 4.3.1 Semantics of ranking properties 60 4.3.2 Top-k queries satisfying semantic properties 62 Experimental study 64 4.4.1 Real data 65 4.4.2 Synthetic data 69 viii ANSWERING THE BEST PROBABILISTIC SKYLINE QUERIES quickly Varying maximum tuple probabilities This experiment shows the number of iterations that it takes for the algorithms to prune 50% of the tuples when varying the maximum tuple probabilities Figure 5.24 shows that between a maximum tuple probability of 0.1 and 0.3, there is a big drop in the number of iterations needed to prune 50% of the data set for both algorithms This is due to it being much easier to find effective pruning pivot tuples when the tuples have higher probabilities In contrast, when all the tuples have very low probabilities, it will be very hard to find a pivot tuple that can prune a large percentage of the search space From 0.1 to 0.5 maximum tuple probability, NN-BPS outperforms SKY-BPS However, above a maximum tuple probability of 0.5, both algorithms perform almost the same This is consistent with the results observed in the previous experiments Therefore, the reasons for this are the same as previously explained Number of iteration to prune 50% 1400 NN-BPS 1200 SKY-BPS 1000 800 600 400 200 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Maximum tuple probability Figure 5.24: Results of varying the maximum tuple probability 5.8 Summary of chapter In this chapter, we defined our bestpro-skyline query using the possible worlds semantics model which is able to find the interesting skyline tuples without the need to set any threshold We developed an efficient approach for answering bestpro-skyline query with low computational complexity Our method was able to both compute the skyline probability of tuples quickly and also prune the number of tuples that 116 5.8 Summary of chapter need skyline probabilities computed We used a low complexity formula to directly calculate the skyline probabilities of tuples without the need to enumerate all possible worlds We then defined the two bestpro-skyline algorithms for pruning the number of tuples considered The first algorithm, called NN-BPS, uses a nearest neighbor-based heuristic to find the pivot tuples for pruning The second, called SKY-BPS, uses skyline results to heuristically select the pivot tuples for pruning Extensive experiments using both real and synthetic data sets show the effectiveness of our approach In particular, our bestpro-skyline algorithms outperform a Na¨ıve solution by up to three orders of magnitude for computational time on the real data set We also find the bestpro-skyline query produces more desirable results compared to the Na¨ıve-Threshold-based query on the real data set We studied in detail the pruning ability of our two bestpro-skyline algorithms, NN-BPS and SKY-BPS We found both algorithms are very effective at pruning the search space quickly When comparing the relative pruning ability of the two algorithms, we found NN-BPS performs the same or better than SKY-BPS for all situations tested 117 ANSWERING THE BEST PROBABILISTIC SKYLINE QUERIES 118 Chapter Conclusion In this chapter, we first present a summary of the key findings of each chapter of the thesis Second, we conclude the thesis by highlighting the key findings of the entire thesis Finally, we envision some directions for future research on querying uncertain data 6.1 Thesis summary In many real life domains, probabilistic top-k and skyline queries can be used to discover interesting and useful facts about uncertain data This thesis addressed the problem of answering probabilistic top-k and skyline queries, which deal with the semantics of uncertainty and the efficient processing of top-k and skyline queries using the possible worlds semantics model Chapter provided two models of uncertain data which are the uncertain object model and the possible worlds semantics model We also present several basic concepts relating to top-k and skyline queries on certain data These ways of modeling uncertain data, and definitions of top-k and skyline queries on certain data were extended and modified for answering the probabilistic top-k and the probabilistic skyline queries on probabilistic data in subsequent chapters Chapter surveyed all of the previous work on answering probabilistic top-k queries and probabilistic skyline queries Firstly, we studied and analysed each definition of probabilistic top-k queries on probabilistic data, and the five “Exact-k”, “Containment”, “Unique ranking”, “Value invariance”, and “Stability” properties Secondly, we presented and discussed some definitions of the previous work on answering probabilistic skyline queries on uncertain data The definitions and formulas for calculating skyline probability are presented, discussed and their shortcomings 119 CONCLUSION are identified In Chapter 4, we proposed a new method for answering top-k queries on probabilistic data • Novel answers to the top-k best probability query are proposed to select the probabilistic tuples which not only have the best top-k scores but also have the best top-k probabilities • We developed an efficient algorithm for the top-k best probability query without the need to set a threshold value In addition, we introduced a new semantic ranking property called “k-best ranking score” based on the traditional top-k definition • The top-k best probability query was proven to satisfy the semantic ranking properties, which shows our proposal is better than other existing approaches in terms of semantic ranking properties • An extensive experimental study using both real and synthetic data sets was conducted to verify the effectiveness of answering the top-k best probability query against to the PT-k query The top-k best probability approach was shown to outperform the algorithm designed for the PT-k query in both efficiency and effectiveness In Chapter 5, we defined the bestpro-skyline query using the possible worlds semantics model without using a threshold We developed an efficient algorithm with effective formulas for answering bestpro-skyline query with low computational complexity There are two main challenges in answering probabilistic skyline queries The first challenge is defining the interesting probabilistic skyline tuples for the bestpo-skyline query without the use of a threshold The second challenge is efficiently finding these tuples without enumerating all possible worlds • We proposed the bestpro-skyline query to overcome the first challenge, in which the dominance principle is extended to include the skyline probability and the skyline tuples • We introduced formulas to directly calculate the skyline probabilities using probabilistic theory The NN-BPS and SKY-BPS algorithms are introduced to heuristically select the pivot tuples for pruning The most interesting probabilistic skyline tuples used for calculating skyline probability are presented, dis- 120 6.2 Conclusion and Key Findings cussed and their shortcomings are identified selecting the pivot tuples These techniques are used to overcome the second challenge • Our experiments used both real and synthetic data sets The experiments yielded a number of important findings First, our solution is able to find the 22 interesting probabilistic skyline tuples from 10,000 tuples within 41 seconds in a real data set Second, our bestpro-skyline algorithms outperform a Na¨ıve solution by up to three orders of magnitude for computational time on the real data set Finally, we found both algorithms are very effective at pruning the search space compared to the Na¨ıve-Threshold-based query, and the NN-BPS algorithm performs the same or better than SKY-BPS algorithm in terms of pruning the search space 6.2 Conclusion and Key Findings This thesis is the first work to identify the fact that the dominance concept can be used for answering both probabilistic top-k and probabilistic skyline queries to remove the need to set a threshold This is an important result since it gives users only the most sought after set of tuples without the burden of setting a non-intuitive threshold The primary goal of top-k and skyline query answering is to use some meaningful way of pruning the uninteresting tuples We contend that our solution best achieves this goal Our experiments for both top-k and skyline queries show that we are indeed able to distill just the most interesting tuples to the users Furthermore, our optimization techniques yielded very fast query processing times compared to Na¨ıve counterparts 6.3 Future work We outline some directions for future work: • Discovering semantic answers to queries is a critical issue in relation to uncertain data by identifying the optimization chances in practical applications Exploiting the semantics of probabilistic queries is an important area of research 121 CONCLUSION • Optimising uncertain data queries processing for big data is an important problem In new applications, the increasing of occurrence big data and uncertainty is motivating the need to redesign existing algorithms to handle such large data The integration of the techniques proposed in this thesis into Hadoop will be an interesting direction for future research • Finally, privacy is an important issue to many people nowadays Developing probabilistic skyline and top-k queries that preserve privacy is an important area for future research 122 Bibliography [1] Efficient indexing methods for probabilistic threshold queries over uncertain data, VLDB ’04 VLDB Endowment, 2004 Cited on page 15 [2] S Abiteboul, P Kanellakis, and G Grahne On the representation and querying of sets of possible worlds SIGMOD Rec., 16(3):34–48, Dec 1987 Cited on pages 2, 3, and 42 [3] P K Agarwal, S.-W Cheng, and K Yi Range searching on uncertain data ACM Trans Algorithms, 8(4):43:1–43:17, Oct 2012 Cited on page [4] C C Aggarwal and P S Yu A survey of uncertain data algorithms and applications IEEE Transactions on Knowledge and Data Engineering, 21(5):609–623, May 2009 Cited on pages 2, 6, and 54 [5] P Andritsos, A Fuxman, and R J Miller Clean answers over dirty databases: A probabilistic approach In Proceedings of the 22nd International Conference on Data Engineering, ICDE ’06, page 30, Washington, DC, USA, 2006 IEEE Computer Society Cited on page [6] M J Atallah and Y Qi Computing all skyline probabilities for uncertain data In Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’09, pages 279–287, New York, NY, USA, 2009 ACM Cited on pages 2, 6, 8, 15, 16, 19, 25, 36, 39, 53, and 74 [7] O Benjelloun, A D Sarma, A Halevy, and J Widom Uldbs: databases with uncertainty and lineage In Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, pages 953–964 VLDB Endowment, 2006 Cited on page 18 123 BIBLIOGRAPHY [8] C Böhm, F Fiedler, A Oswald, C Plant, and B Wackersreuther Probabilistic skyline queries In Proceeding of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pages 651–660, New York, NY, USA, 2009 ACM Cited on page 19 [9] S Börzsönyi, D Kossmann, and K Stocker The skyline operator In Proceedings of the 17th International Conference on Data Engineering, pages 421–430, Washington, DC, USA, 2001 IEEE Computer Society Cited on pages 2, 19, 20, 36, 53, and 107 [10] M A Cheema, X Lin, W Wang, W Zhang, and J Pei Probabilistic reverse nearest neighbor queries on uncertain data IEEE Transactions on Knowledge and Data Engineering, 22:550–564, 2010 Cited on pages 2, 6, and [11] R Cheng, L Chen, J Chen, and X Xie Evaluating probability threshold k-nearest-neighbor queries over uncertain data In Proceedings of the 12th International Conference on Extending Database Technology, pages 672–683, New York, NY, USA, 2009 ACM Cited on pages 2, 6, and [12] R Cheng, D V Kalashnikov, and S Prabhakar Evaluating probabilistic queries over imprecise data In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD ’03, pages 551–562, New York, NY, USA, 2003 ACM Cited on pages 15 and 42 [13] R Cheng and S Prabhakar Managing uncertainty in sensor database SIGMOD Rec., 32(4):41–46, Dec 2003 Cited on page 15 [14] J Chomicki, P Godfrey, J Gryz, and D Liang Skyline with presorting: Theory and optimizations In M A Klopotek, S T Wierzchon, and K Trojanowski, editors, Intelligent Information Systems, Advances in Soft Computing, pages 595–604 Springer, 2005 Cited on page 53 [15] G Cormode, F Li, and K Yi Semantics of ranking queries for probabilistic data and expected ranks In ICDE ’09 IEEE 25th International Conference on Data Engineering, pages 305–316, april 2009 Cited on pages 2, 11, 12, 16, 19, 30, 31, 33, 34, 35, 36, 41, 42, 46, 54, 60, 62, 63, and 64 [16] N Dalvi and D Suciu Management of probabilistic data: foundations and challenges In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART 124 BIBLIOGRAPHY symposium on Principles of database systems, PODS ’07, pages 1–12, New York, NY, USA, 2007 ACM Cited on pages and [17] A Doan, R Ramakrishnan, F Chen, P DeRose, Y Lee, R McCann, M Sayyadian, and W Shen Community information management IEEE Data Eng Bull., 29(1):64–72, 2006 Cited on page [18] M Franklin, A Halevy, and D Maier From databases to dataspaces: a new abstraction for information management SIGMOD Rec., 34(4):27–33, Dec 2005 Cited on page [19] A Fuxman, E Fazli, and R J Miller Conquer: efficient management of inconsistent databases In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, SIGMOD ’05, pages 155–166, New York, NY, USA, 2005 ACM Cited on page [20] T Ge, S Zdonik, and S Madden Top-k queries on uncertain data: on score distribution and typical answers In Proceedings of the 35th SIGMOD International Conference on Management of Data, SIGMOD ’09, pages 375–388, New York, NY, USA, 2009 ACM Cited on page 54 [21] A Halevy, A Rajaraman, and J Ordille Data integration: the teenage years In Proceedings of the 32nd international conference on Very large data bases, VLDB ’06, pages 9–16 VLDB Endowment, 2006 Cited on page [22] G R Hjaltason and H Samet Distance browsing in spatial databases ACM Trans Database Syst., 24(2):265–318, June 1999 Cited on page 93 [23] M Hua, J Pei, and X Lin Ranking queries on uncertain data The VLDB Journal, 20:129–153, February 2011 Cited on pages 2, 3, 4, 5, 9, 25, 28, 34, 36, 41, 42, 44, 45, 49, 55, 62, and 64 [24] M Hua, J Pei, W Zhang, and X Lin Ranking queries on uncertain data: a probabilistic threshold approach In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 673–686, New York, NY, USA, 2008 ACM Cited on pages 2, 3, 4, 5, 6, 7, 9, 18, 28, 34, 36, 41, 42, 44, 45, 49, 55, 62, 64, 65, 66, 104, and 105 [25] I F Ilyas, G Beskales, and M A Soliman A survey of top-k query processing techniques in relational database systems ACM Comput Surv., 40(4):11:1– 11:58, Oct 2008 Cited on pages 3, 16, and 25 125 BIBLIOGRAPHY [26] C Jin, K Yi, L Chen, J X Yu, and X Lin Sliding-window top-k queries on uncertain streams Pro the VLDB Endowment, 1:301–312, August 2008 Cited on pages 65 and 104 [27] M E Khalefa, M F Mokbel, and J J Levandoski Skyline query processing for incomplete data In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pages 556–565, Washington, DC, USA, 2008 IEEE Computer Society Cited on pages 2, 19, and 37 [28] D Kossmann, F Ramsak, and S Rost Shooting stars in the sky: an online algorithm for skyline queries In Proceedings of the 28th International Conference on Very Large Data Bases, VLDB ’02, pages 275–286 VLDB Endowment, 2002 Cited on page 19 [29] H.-P Kriegel, P Kunath, and M Renz Probabilistic nearest-neighbor query on uncertain objects In Proceedings of the 12th International Conference on Database Systems for Advanced Applications, DASFAA’07, pages 337–348, Berlin, Heidelberg, 2007 Springer-Verlag Cited on pages and [30] K Lange Numerical analysis for statisticians Springer, 1999 Cited on pages 47 and 48 [31] T M N Le and J Cao Top-k best probability queries on probabilistic data In Database Systems for Advanced Applications, volume 7239 of Lecture Notes in Computer Science, pages 1–16 Springer Berlin Heidelberg, 2012 Cited on pages 11 and 46 [32] T M N Le, J Cao, and Z He Answering skyline queries on probabilistic data using the dominace of probabilistic skyline tuples Submmited, (0), 2013 Cited on page 12 [33] T M N Le, J Cao, and Z He Top-k best probability queries and semantics ranking properties on probabilistic databases Data & Knowledge Engineering, (0):248–266, 2013 Cited on page 11 [34] A Leon-Garcia and I Widjaja Communication Networks McGraw-Hill, Inc., New York, NY, USA, edition, 2004 Cited on page [35] F Li, K Yi, and J Jestes Ranking distributed probabilistic data In Proceedings of the 35th SIGMOD International Conference on Management of Data, 126 BIBLIOGRAPHY SIGMOD ’09, pages 361–374, New York, NY, USA, 2009 ACM Cited on page [36] J Li, B Saha, and A Deshpande A unified approach to ranking in probabilistic databases Pro the VLDB Endowment, 2:502–513, August 2009 Cited on pages 18, 65, and 104 [37] J.-j Li, S.-l Sun, and Y.-y Zhu Efficient maintaining of skyline over probabilistic data stream In Proceedings of the 2008 Fourth International Conference on Natural Computation, pages 378–382, Washington, DC, USA, 2008 IEEE Computer Society Cited on pages 2, 6, and [38] X Lian and L Chen Probabilistic ranked queries in uncertain databases In Proceedings of the 11th international conference on Extending database technology: Advances in database technology, EDBT ’08, pages 511–522, New York, NY, USA, 2008 ACM Cited on pages 34, 35, 54, 62, and 63 [39] X Lian and L Chen Shooting top-k stars in uncertain databases The VLDB Journal, 20(6):819–840, Dec 2011 Cited on pages 19 and 53 [40] M Magnani and D Montesi A survey on uncertainty management in data integration J Data and Information Quality, 2(1):5:1–5:33, July 2010 Cited on pages and [41] S McClean, B Scotney, P Morrow, and K Greer Knowledge discovery by probabilistic clustering of distributed databases Data Knowl Eng., 54(2):189– 210, Aug 2005 Cited on page [42] T Pang-Ning, S Michael, and K Vipin Introduction to data mining Library of Congress, 2006 Cited on page [43] D Papadias, Y Tao, G Fu, and B Seeger An optimal and progressive algorithm for skyline queries In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD ’03, pages 467–478, New York, NY, USA, 2003 ACM Cited on pages 19, 20, 52, 93, 98, and 107 [44] D Papadias, J Zhang, N Mamoulis, and Y Tao Query processing in spatial network databases In Proceedings of the 29th international conference on Very large data bases - Volume 29, VLDB ’03, pages 802–813 VLDB Endowment, 2003 Cited on page 98 127 BIBLIOGRAPHY [45] J Pei, M Hua, Y Tao, and X Lin Query answering techniques on uncertain and probabilistic data: tutorial summary In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 1357–1364, New York, NY, USA, 2008 ACM Cited on pages and 16 [46] J Pei, B Jiang, X Lin, and Y Yuan Probabilistic skylines on uncertain data In Proceedings of the 33rd International Conference on Very Large Data bases, VLDB ’07, pages 15–26 VLDB Endowment, 2007 Cited on pages 2, 6, 7, 8, 9, 15, 16, 19, 25, 36, 38, 53, 74, and 75 [47] C Re, N Dalvi, and D Suciu Efficient top-k query evaluation on probabilistic data In IEEE 23rd International Conference on Data Engineering, pages 886– 895, april 2007 Cited on page 18 [48] R Ross, V S Subrahmanian, and J Grant Aggregate operators in probabilistic databases J ACM, 52(1):54–101, Jan 2005 Cited on page [49] N Roussopoulos, S Kelley, and F Vincent Nearest neighbor queries SIGMOD Recreation, 24:71–79, May 1995 Cited on page 93 [50] A D Sarma, O Benjelloun, A Halevy, and J Widom Working models for uncertain data In Proceedings of the 22nd International Conference on Data Engineering, ICDE ’06, Washington, DC, USA, 2006 IEEE Computer Society Cited on pages 2, 3, 16, and 42 [51] M A Soliman and I F Ilyas Top-k query processing in uncertain databases In IEEE 23rd International Conference on Data Enginering, pages 896–905, Istanbul, 2007 Cited on pages 2, 3, 6, 7, 16, 18, 19, 25, 26, 27, 33, 34, 35, 42, 54, 62, and 63 [52] M A Soliman, I F Ilyas, and K C.-C Chang Probabilistic top-k and rankingaggregate queries ACM Transaction Database System, 33:13:1–13:54, 2008 Cited on pages 19 and 54 [53] L Sun, R Cheng, D W Cheung, and J Cheng Mining uncertain data with probabilistic guarantees In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’10, pages 273–282, New York, NY, USA, 2010 ACM Cited on page 128 BIBLIOGRAPHY [54] K.-L Tan, P.-K Eng, and B C Ooi Efficient progressive skyline computation In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB ’01, pages 301–310, San Francisco, CA, USA, 2001 Morgan Kaufmann Publishers Inc Cited on page 98 [55] Y Tao, R Cheng, X Xiao, W K Ngai, B Kao, and S Prabhakar Indexing multi-dimensional uncertain data with arbitrary probability density functions In Proceedings of the 31st international conference on Very large data bases, VLDB ’05, pages 922–933 VLDB Endowment, 2005 Cited on page 15 [56] A Vlachou and M Vazirgiannis Ranking the sky: Discovering the importance of skyline points through subspace dominance relationships Data Knowl Eng., 69(9):943–964, Sept 2010 Cited on page 53 [57] D Yan and W Ng Robust ranking of uncertain data In J Yu, M Kim, and R Unland, editors, Database Systems for Advanced Applications, volume 6587 of Lecture Notes in Computer Science, pages 254–268 Springer Berlin / Heidelberg, 2011 Cited on pages 2, 3, 6, 7, 16, 18, 32, 34, 36, 42, 54, 62, 64, and 104 [58] K Yi, F Li, G Kollios, and D Srivastava Efficient processing of top-k queries in uncertain databases with x-relations IEEE Trans on Knowl and Data Eng., 20(12):1669–1682, Dec 2008 Cited on page 54 [59] M L Yiu, N Mamoulis, X Dai, Y Tao, and M Vaitis Efficient evaluation of probabilistic advanced spatial queries on existentially uncertain data IEEE Transaction on Knowledge and Data Engineering, 21(1):108–122, 2009 Cited on pages and [60] Y Yuan, X Lin, Q Liu, W Wang, J X Yu, and Q Zhang Efficient computation of the skyline cube In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB ’05, pages 241–252 VLDB Endowment, 2005 Cited on pages 19 and 36 [61] Q Zhang, F Li, and K Yi Finding frequent items in probabilistic data In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pages 819–832, New York, NY, USA, 2008 ACM Cited on page 18 129 BIBLIOGRAPHY [62] S Zhang and C Zhang A probabilistic data model and its semantics Journal of Research and Practice in Information Technology, 35(4):237–256, November 2003 Cited on page 52 [63] W Zhang, X Lin, J Pei, and Y Zhang Managing uncertain data: Probabilistic approaches In International Conference on Web-Age Information Management, pages 405–412, Los Alamitos, CA, USA, 2008 IEEE Computer Society Cited on pages and 16 [64] W Zhang, X Lin, Y Zhang, J Pei, and W Wang Threshold-based probabilistic top-k dominating queries The VLDB Journal, 19(2):283–305, Apr 2010 Cited on pages and [65] X Zhang and J Chomicki Semantics and evaluation of top-k queries in probabilistic databases Distrib Parallel Databases, 26(1):67–126, Aug 2009 Cited on pages 11, 12, 16, 25, 29, 33, 34, 35, 41, 42, 46, 60, 62, and 63 [66] L Zhu, C Li, and H Chen Efficient computation of reverse skyline on data stream In Computational Sciences and Optimization, 2009 CSO 2009 International Joint Conference on, volume 1, pages 735 –739, april 2009 Cited on page 98 130 [...]... top-k best probability query on probabilistic databases and proposed the semantic ranking property for probabilistic top-k queries Secondly, we introduced a novel approach which answers probabilistic skyline queries on probabilistic data using the dominance of probabilistic skyline tuples concept 1.3.1 Contribution on answering probabilistic top-k queries Our work on answering probabilistic top-k queries. .. definition of top-k queries on data will be the foundation, which will be investigated to develop several definitions for answering probabilistic top-k queries on probabilistic data in Chapter 3 and Chapter 4 2.2.2 Skyline queries on data This subsection revisits several basic concepts of skyline queries on certain data, in which key definitions and algorithms in the existing literature on skyline queries on. .. concepts on queries relating to top-k and skyline queries These two queries will be formally defined in this section, which is the foundation for this research In later chapters, we will extend the definitions to probabilistic data 2.2.1 The top-k queries on data The traditional top-k queries are useful in data exploration and decision making [51] [52] [39] In relational databases, answering top-k queries. .. definitions and concepts related to answering top-k queries and skyline queries on certain data are given These definitions form a foundation for defining the concepts used in answering queries on uncertain data Chapter 3 reviews the current research on answering probabilistic top-k queries and probabilistic skyline queries Firstly, various definitions which have been introduced on answering top-k queries. .. makes the following contributions: • We introduce a new definition of the top-k best probability query on probabilistic databases, based on traditional top-k answers and the dominance concept, in which the dominance concept takes both the ranking score and top-k probability into account for selecting the top-k best probability tuples • We develop formulas to calculate the top-k probability and handle... of a tuple in relation to the appearance of other tuples in possible worlds Several basic concepts of top-k and skyline queries on certain data were provided The top-k queries on certain data return k answers with the k best function score This definition of top-k queries will be studied and extended for answering probabilistic top-k queries in Chapters 3 and 4 The skyline queries on uncertain data return... decision making [46] [57], and querying [24] [51] [37] [10] [11] The possible worlds semantics model consists of probabilistic data and generation rules The probabilistic data contain a set of probabilistic tuples, each of which contains a set of attribute values and a probability for the existence of the tuple Generation rules are existence constraints on the tuples An example of a generation rule... semantics model This section provides all the definitions and concepts of the possible worlds semantics model based on papers [45] [63] [15] [25] [57] [65] [50] [51] The possible worlds semantics model consists of probabilistic data and a set of generation rules The probabilistic data contains a set of probabilistic tuples, each probabilistic tuple having one or multiple attributes and a probability, as shown... skyline queries are important tools for data exploration on data mining, decision making, and market analysis applications [27] [53] [9] [46] [24] [35] In a relational database system, probabilistic queries such as skyline or top-k queries select interesting and reliable results from the various alternatives within the probabilistic data Therefore, this thesis focuses on answering probabilistic top-k queries. .. top-k queries on probabilistic databases Therefore, the set {t1 , t2 , t3 } is the answer to the top-2 query on Table 1.1 The top-2 best probability query returns not only the tuples with the best top-2 ranking scores but also the top-2 highest probabilities to the users 1.2.2 Answering probabilistic skyline queries In many increasingly important fields such as decision-making, market analysis, and personalized

Định dạng
Số trang	144
Dung lượng	1,46 MB