Local bounding technique and its applications to uncertain clustering

Local Bounding Technique and Its Applications to Uncertain Clustering Zhang Zhenjie Bachelor of Science Fudan University, China A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY SCHOOL OF COMPUTING NATIONAL UNIVERSITY OF SINGAPORE 2010 Abstract Clustering analysis is a well studied topic in computer science with a variety of applications in data mining, information retrieval and electronic commerce. However, traditional clustering method can only be applied on data set with exact information. With the emergence of web-based applications in last decade, such as distributed relational database, traffic monitoring system and sensor network, there is a pressing need on handling uncertain data in these analysis tasks. However, no trivial solution over such uncertain data is available on clustering problem, by extending conventional methods. This dissertation discusses a new clustering framework on uncertain data, Worst Case Analysis (WCA) framework, which estimates the clustering uncertainty with the maximal deviation in the worst case. Several different clustering models under WCA framework are thus presented, satisfying the requirements of different applications, and all independent to the underlying clustering criterion and clustering algorithms. Solutions to these models with respect to k-means algorithm and EM algorithm are proposed, on the basis of Local Bounding Technique, which is a powerful tool on analyzing the impact of uncertain data on the local optimums reached by these algorithms. Extensive experiments are conducted to evaluate the effectiveness and efficiency of the technique in these models with data collected in real applications. Acknowledgements I would like to thank my PhD thesis committee members, Prof. Anthony K. H. Tung, Prof. Mohan Kankanhalli, Prof. David Hsu and external reviewer Prof. Xuemin Lin, for their valuable reviews, suggestions and comments on my thesis. My thesis advisor Anthony K. H. Tung deserves my special appreciations, who has taught me a lot on research, work and even life in the last half decade. My another project supervisor, Beng Chin Ooi, is another great figure in my life, empowering my growth as a scientist and human. During my fledging years of my research, Zhihong Chong, Jeffery Xu Yu and Aoying Zhou have given me huge helps on career selection and priceless knowledge on academic skills. I will also give the full credits to another of my research teacher, Dimitris Papadias, whose valuable experience and patient guidance greatly boosted my research abilities. During my visit to AT&T Shanon Lab, I learnt a lot from Divesh Srivastava and Marios Hadjieleftheriou, helping me to start new research areas. I appreciate the efforts from all the professors coauthoring with me in the past papers, including Chee-Yong Chan, Reynold Cheng, Zhiyong Huang, H. V. Jagadish, Christian S. Jensen, Laks V. S. Lakshmanan, Hongjun Lu, and Srinivasan Parthasarathy. The last six years in National University of Singapore have been an exciting and wonderful journey in my life. It’s my great pleasure to work with our strong army in Database group, including Zhifeng Bao, Ruichu Cai, Yu Cao, Xia Cao, Yueguo Chen, Gao Cong, Bingtian Dai, Mei Hui, Hanyu Li, Dan Lin, Yuting Lin, Xuan Liu, Hua Lu, Jiaheng Lu, Meiyu Lu, Chang Sheng, Yanfeng Shu, Zhenqiang Tan, Nan Wang, Wenqiang Wang, Xiaoli Wang, Ji Wu, Sai Wu, Shili Xiang, Jia Xu, Linhao Xu, Zhen Yao, Shanshan Ying, Meihui Zhang, Rui Zhang, Xuan Zhou, Yongluan Zhou, and Yuan Zhou. Some of my powers come from our strong order of Fudan University in School of Computing, including Feng Cao, Su Chen, Yicheng Huang, Chengliang Liu, Xianjun Wang, Ying Yan, Xiaoyan Yang, Jie Yu, Ni Yuan, and Dongxiang Zhang. I am also grateful to my friends in Hong Kong, including Ilaria Bartolini, Alexander Markowetz, Stavros Papadopoulos, Dimitris Sacharidis, Yufei Tao, and Ying Yang. I am always indebted to the powerful and faithful supports from my parents, Jianhua Zhang and Guiying Song. Their unconditioned love and nutrition have brought me into the world and develop me into a person with deep and endless power. Finally, my deepest love are always reserved for my girl, Shuqiao Guo, for accompanying me in the last four years. ii Contents Introduction 1.1 A Brief Revisit to Clustering Problems . . . . . . . . . . . . . . . . 1.2 Certainty vs. Uncertainty . . . . . . . . . . . . . . . . . . . . . . . 1.3 Worst Case Analysis Framework . . . . . . . . . . . . . . . . . . . . 11 1.4 Models under WCA Framework . . . . . . . . . . . . . . . . . . . . 14 1.4.1 Zero Uncertainty Model (ZUM) . . . . . . . . . . . . . . . . 17 1.4.2 Static Uncertainty Model (SUM) . . . . . . . . . . . . . . . 19 1.4.3 Dissolvable Uncertainty Model (DUM) . . . . . . . . . . . . 19 1.4.4 Reverse Uncertainty Model (RUM) . . . . . . . . . . . . . . 20 1.5 Local Bounding Technique . . . . . . . . . . . . . . . . . . . . . . . 21 1.6 Summary of the Contributions . . . . . . . . . . . . . . . . . . . . . 22 Literature Review 2.1 24 Clustering Techniques on Certain Data . . . . . . . . . . . . . . . . 24 2.1.1 K-Means Algorithm and Distance-based Clustering . . . . . 25 2.1.2 EM Algorithm and Model-Based Clustering . . . . . . . . . 27 2.2 Management of Uncertain and Probabilistic Database . . . . . . . . 28 2.3 Continuous Query Processing . . . . . . . . . . . . . . . . . . . . . 31 Local Bounding Technique 34 i 3.1 Notations and Data Models . . . . . . . . . . . . . . . . . . . . . . 34 3.2 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 EM on Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . 47 Zero Uncertain Model 52 4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Algorithms with K-Means Clustering . . . . . . . . . . . . . . . . . 54 4.3 Algorithm with Gaussian Mixture Model . . . . . . . . . . . . . . . 62 4.4 Experiments with K-Means Clustering . . . . . . . . . . . . . . . . 72 4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 72 4.4.2 Results on Synthetic Data Sets . . . . . . . . . . . . . . . . 73 4.4.3 Results on Real Data Sets . . . . . . . . . . . . . . . . . . . 75 Experiments with Gaussian Mixture Model . . . . . . . . . . . . . . 77 4.5.1 Results on Synthetic Data . . . . . . . . . . . . . . . . . . . 78 4.5.2 Results on Real Data . . . . . . . . . . . . . . . . . . . . . . 79 4.5 Static Uncertain Model 82 5.1 Problem Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2 Solution to SUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2.1 Intra Cluster Uncertainty . . . . . . . . . . . . . . . . . . . 85 5.2.2 Inter Cluster Uncertainty . . . . . . . . . . . . . . . . . . . . 86 5.2.3 Early Termination . . . . . . . . . . . . . . . . . . . . . . . 89 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3 Dissolvable Uncertain Model 97 6.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Solutions to DUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.2.1 97 Hardness of DUM . . . . . . . . . . . . . . . . . . . . . . . . 100 ii 6.2.2 6.3 6.4 Simple Heuristics . . . . . . . . . . . . . . . . . . . . . . . . 103 Better Heuristics for D-SUM . . . . . . . . . . . . . . . . . . . . . . 105 6.3.1 Candidates Expansion . . . . . . . . . . . . . . . . . . . . . 107 6.3.2 Better Reduction Estimation . . . . . . . . . . . . . . . . . . 107 6.3.3 Block Dissolution . . . . . . . . . . . . . . . . . . . . . . . . 111 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Reverse Uncertain Model 117 7.1 Framework and Problem Definition . . . . . . . . . . . . . . . . . . 117 7.2 Threshold Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.3 7.2.1 Mathematical Foundation of Thresholds . . . . . . . . . . . 123 7.2.2 Computation of Threshold . . . . . . . . . . . . . . . . . . . 125 7.2.3 Utilizing the Change Rate . . . . . . . . . . . . . . . . . . . 128 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Conclusion and Future Work 138 8.1 Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 8.2 Potential Applications . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.3 8.2.1 Change Detection on Data Stream . . . . . . . . . . . . . . 140 8.2.2 Privacy Preserving Data Publication . . . . . . . . . . . . . 141 Possible Research Directions . . . . . . . . . . . . . . . . . . . . . . 143 8.3.1 Graph Clustering . . . . . . . . . . . . . . . . . . . . . . . . 143 8.3.2 New Uncertainty Clustering Framework . . . . . . . . . . . . 144 iii List of Tables 1.1 Three major classes of data mining problems . . . . . . . . . . . . . 1.2 Characteristics and applications of the models . . . . . . . . . . . . 21 3.1 Table of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 Local optimums in KDD99 data set . . . . . . . . . . . . . . . . . . 58 4.2 Test results on KDD98 data set . . . . . . . . . . . . . . . . . . . . 77 7.1 Experimental parameters . . . . . . . . . . . . . . . . . . . . . . . . 131 7.2 k-means cost versus data cardinality on Spatial 7.3 k-means cost versus data cardinality on Road 7.4 k-means cost versus k on Spatial 7.5 k-means cost versus k on Road 7.6 k-means cost versus ∆ on Spatial . . . . . . . . . . . . . . . . . . . 136 7.7 k-means cost versus ∆ on Road . . . . . . . . . . . . . . . . . . . . 136 . . . . . . . . . . . 132 . . . . . . . . . . . . 133 . . . . . . . . . . . . . . . . . . . 134 . . . . . . . . . . . . . . . . . . . . 134 iv List of Figures 1.1 How to apply clustering in real systems . . . . . . . . . . . . . . . . 1.2 Why uncertain clustering instead of traditional clustering? . . . . . 1.3 An uncertain data set . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 The certain data set corresponding to Figure 1.3 . . . . . . . . . . . 10 1.5 Models based on the radiuses . . . . . . . . . . . . . . . . . . . . . 15 1.6 Forward inference and backward inference . . . . . . . . . . . . . . 16 1.7 Categories of uncertain clustering models in WCA framework . . . 18 2.1 Example of safe regions . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1 Example of k-means clustering . . . . . . . . . . . . . . . . . . . . . 36 3.2 Center movement in one iteration . . . . . . . . . . . . . . . . . . . 41 3.3 Example of maximal regions . . . . . . . . . . . . . . . . . . . . . . 43 4.1 Update events on the configuration . . . . . . . . . . . . . . . . . . 56 4.2 Example of the clustering running on real data set . . . . . . . . . . 59 4.3 Tests on varying dimensionality on synthetic data set . . . . . . . . 74 4.4 Tests on varying k on synthetic data set . . . . . . . . . . . . . . . 74 4.5 Tests on varying procedure number on synthetic data set . . . . . . 74 4.6 Tests on varying k on KDD 99 data set . . . . . . . . . . . . . . . . 75 4.7 Tests on varying procedure number on KDD99 data set . . . . . . . 76 v 4.8 Performance comparison with varying dimensionality . . . . . . . . 78 4.9 Performance comparison with varying component number . . . . . . 79 4.10 Performance comparison with varying data size . . . . . . . . . . . 79 4.11 Performance comparison with varying component number on Spam data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.12 Performance comparison with varying component number on Cloud data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.13 Likelihood comparison with fixed CPU time . . . . . . . . . . . . . 81 5.1 Tests on varying data size . . . . . . . . . . . . . . . . . . . . . . . 94 5.2 Tests on varying dimensionality . . . . . . . . . . . . . . . . . . . . 94 5.3 Tests on varying cluster number k . . . . . . . . . . . . . . . . . . . 95 5.4 Tests on varying expected uncertainty . . . . . . . . . . . . . . . . . 95 5.5 Tests on varying k on KDD99 data set . . . . . . . . . . . . . . . . 96 5.6 Tests on varying uncertainty expectation on KDD99 data set . . . . 96 6.1 Example of dissolvable uncertain model . . . . . . . . . . . . . . . . 99 6.2 Reduction example . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 Tests on varying data size . . . . . . . . . . . . . . . . . . . . . . . 113 6.4 Tests on varying dimensionality . . . . . . . . . . . . . . . . . . . . 114 6.5 Tests on varying cluster number k . . . . . . . . . . . . . . . . . . . 114 6.6 Tests on varying dissolution block size . . . . . . . . . . . . . . . . 115 6.7 Tests on varying uncertainty expectation . . . . . . . . . . . . . . . 115 6.8 Tests on varying k on KDD99 data set . . . . . . . . . . . . . . . . 115 6.9 Tests on varying uncertainty expectation on KDD99 data set . . . . 116 7.1 Example updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.2 CPU time versus data cardinality . . . . . . . . . . . . . . . . . . . 131 vi Chapter Conclusion and Future Work In this chapter, we conclude this dissertation and address some future work on the basis of the proposed models and methods in this dissertation. In particular, Section 8.1 provides a brief summary on the contributions of the dissertation. Section 8.2 discusses possible applications of the proposed uncertain clustering algorithms in two different domains, including data stream and privacy. Finally, Section 8.3 formalizes a few promising research directions to extend our current studies. 8.1 Summarization This dissertation focuses on the analysis of the clustering uncertainties over highly dynamic or uncertain objects in multi-dimensional space. While traditional clustering methods is only applicable to certain data with exact information on every attribute of every object, all of these methods encounter huge difficulties when the objects can only provide approximate values, especially on measuring the robustness (or uncertainty) of the clustering results. These problems prohibit the use of clustering algorithms on the optimization of complex systems with quickly evolving underlying data. 138 The Worst Case Analysis (WCA) framework provides a solid foundation on the uncertainty clustering models, which shows a new measurement on the uncertainty of the clustering. Specifically, given an uncertain data set, a clustering on the data and a universal clustering algorithm, WCA framework estimates the uncertainty of the clustering with the maximal cost difference between the current clustering and any other clustering computed with some exact data satisfying the uncertain data by the universal clustering algorithm. WCA framework facilitates the developments of different uncertainty clustering models. In this dissertation, four models including Zero Uncertain Model (ZUM), Static Uncertain Model (SUM), Dissolvable Uncertain Model (DUM) and Reverse Uncertain Model (RUM) are proposed. These models can be applied on different applications with different requirements on the system. As one of the most popular clustering algorithm, k-means algorithm is used as the running example for all of the models in this dissertation. Based on the concept of Maximal Region, it is shown that clustering uncertainty can be easily calculated and manipulated with respect to k-means algorithm. This dissertation covers the complete details on how to implement k-means algorithm with these models in an efficient and effective manner. Besides k-means algorithm, Gaussian Mixture Model is another important investigation direction in this dissertation. Unlike k-means clustering, Gaussian Mixture Model assigns positive cluster probabilities to all point-cluster pairs, showing strong abilities to distinguish clusters with large overlap. This dissertation shows that Expectation-Maximization (EM) algorithm is also consistent with WCA framework, leading to efficient algorithm for ZUM on Gaussian Mixture Model. There are also a couple of applications of the uncertain clustering techniques. By utilizing ZUM, for example, it is possible to dramatically accelerate the multi-run 139 clustering algorithms, including both k-means and EM. RUM, as another example, turns out to be effective on reducing the communication cost in the vehicle monitoring system for cluster analysis. 8.2 Potential Applications While this dissertation only covers a small fraction of possible applications with the uncertainty clustering models, it is necessary to discuss more possible exploration direction on the employment of these techniques. 8.2.1 Change Detection on Data Stream Data stream is one of the hottest research areas in computer science. Given a fast and infinite multi-dimensional object stream, it is only possible to scan the whole data set once, due to the constraints on both memory consumption and speed requirement. One of an interesting topic on data stream is how to effectively and efficiently detect the distribution change on the data stream. In scenarios, such as network monitoring, it is important for the analyst to discover the change of the underlying distribution quickly to be aware of any potential problems with the network infrastructure. In [56], Song et al. presented some statistics-based solutions to change detection on data stream. While statistics provides a strong guarantee on the detection accuracy, the efficiency remains an open issue to improve. Given the fact that clustering renders a concise summarization on the distribution of the underlying data, it it thus straightforward to employ the difference on clustering to measure the change of the distribution. With algorithms with Zero Uncertain Model, we are 140 able to directly estimate how much the k-means clustering or Gaussian Mixture Model have changed when points are inserted and deleted. This implies that a more efficient solution can be derived by applying ZUM on data stream. Figure 8.1: Detecting distribution change on data stream In Figure 8.1, we present an example to illustrate the connection between distribution change and clustering. On the left figure, the data objects form two clusters in the 2-dimensional space. When some old objects are removed and new objects are inserted, the distribution changes. Correspondingly, the centers of the optimal 2-means clustering also move. Thus, monitoring the clustering with appropriate parameters provides an effective way to estimate the distribution change. 8.2.2 Privacy Preserving Data Publication Data publication with personal information protection is considered one of the most important problems in privacy preservation on large database. A common solution to privacy preservation is artificially adding uncertainties to exact personal information record, such as k-anonymous [54, 57], l-diversity [43] and ANATOMY [64]. However, all of these techniques are proposed to reduce the distortion on the individual records, without taking the data distribution into consideration. By introducing Reverse Uncertain Model (RUM) into privacy data publication 141 problem, new opportunities are released to maintain a rough data distribution while hiding the sensitive personal information. In particular, each data record is represented by a circle covering the true value. To disable the adversary to detect the true identities of the records, circle overlaps are purposefully generated to reduce the possibility of accurate identification. Figure 8.2: Protecting sensitive personal records without affecting the global distribution Consider the example shown in Figure 8.2. Given the personal records marked with red points in 2-dimensional space, each record is transformed to a circle which exactly covers the original record. To guarantee the safety of personal identities, we require each personal record is covered by at least circles. With this constraint, even the adversary happens to know the exact information of one record, he remains unable to identify him, since there are at least two uncertain points satisfying it. On the other hand, the global distribution is well preserved, with two clusters on the left-bottom corner and right-top corner respectively. 142 8.3 Possible Research Directions There are also a few interesting research directions to explore in the future. In this section, we briefly introduce two of them, about uncertain clustering on graph data and more robust uncertainty clustering frameworks, respectively. 8.3.1 Graph Clustering In this dissertation, we discuss the uncertain clusterings with multi-dimensional data, each record of which is represented by a vector of fixed dimensionality. Graph data is another mainstream representation of data in many applications, such as social network. Clustering problem on graph data is also a well studied topic in both algorithmic community and data mining community. However, similar to uncertain clustering in multi-dimensional space, the studies on discovering robust uncertain clustering on graph data remain in its infant stage. 0.2 0.8 0.8 0.9 0.4 1.0 0.8 0.9 0.8 .8 0.7 0.8 Figure 8.3: Uncertain clustering on probabilistic graph data A typical uncertain graph is presented in Figure 8.3. Different from traditional certain graph, there is a probability associated with each edge in the graph, which indicates the likelihood on the existence of the edge. On the data in this Figure, for example, there are obviously two clusters, which own dense probabilistic sub- 143 graphs on the left and right sides. On more complicated and huge social network graphs, the problems becomes much more challenging because the combinations of the possible graphs increases exponentially. To overcome such difficulties, more robust uncertain clustering models and methods are necessary. 8.3.2 New Uncertainty Clustering Framework All the models and methods introduced in this dissertation are based on Worst Case Analysis (WCA) framework. While this framework is robust and general enough to handle many of the problems in real systems, there are also some disadvantages on the framework. In WCA framework, there is no distribution information necessary for the analysis and computation. This improves the adaptivity of our models, since it is usually not easy to retrieve the exact distribution of the objects in the real world. However, there is sometimes some rough approximation on the object distribution available. WCA framework has to discard all such information, by simply transforming all objects into circles in the corresponding space. The following question is how we can fully utilize such approximate distribution, without affecting the robustness of the existing models. Another problem with WCA framework is its continuous space limitation. All the values of the objects on all dimensions must be in continuous ranges, while many real data possess categorical values on many of the attributes, such as gender and marital status. Our current models are not strong enough to handle such categorical data. In WCA framework, the uncertainty of the object is measured by the radius of the uncertain sphere. Given categorical values, we are forced to find some other uncertainty measure to replace the current one. This raises new challenges to the potential frameworks in the future, which are expected to be 144 consistent with both numerical and categorical attributes. 145 Bibliography [1] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487–499, 1994. [2] David Arthur and Sergei Vassilvitskii. How slow is the k-means method? In SoCG, pages 144–153, 2006. [3] David Arthur and Sergei Vassilvitskii. k-means++: The advantage of careful seeding. In SODA, 2007. [4] Brian Babcock and Chris Olston. Distributed top-k monitoring. In SIGMOD, 2003. [5] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. In SDM, 2004. [6] Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, and Jennifer Widom. Uldbs: Databases with uncertainty and lineage. In VLDB, pages 953–964, 2006. [7] Leon Bottou and Yoshua Bengio. Convergence properties of the K-means algorithms. In NIPS, pages 585–592, 1995. [8] Paul S. Bradley and Usama M. Fayyad. Refining initial points for K-Means clustering. In ICML, pages 91–99, 1998. 146 [9] Thomas Brinkhoff. A framework for generating network-based moving objects. GeoInformatica, 6(2):153–180, 2002. [10] Reynold Cheng, Dmitri V. Kalashnikov, and Sunil Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD Conference, pages 551– 562, 2003. [11] Reynold Cheng, Ben Kao, Sunil Prabhakar, Alan Kwan, and Yi-Cheng Tu. Adaptive stream filters for entity-based queries with non-value tolerance. In VLDB, 2005. [12] Reynold Cheng, Yuni Xia, Sunil Prabhakar, Rahul Shah, and Jeffrey Scott Vitter. Efficient indexing methods for probabilistic threshold queries over uncertain data. In VLDB, pages 876–887, 2004. [13] Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, and Xin Xu. Mining top-k covering rule groups for gene expression data. In SIGMOD Conference, pages 670–681, 2005. [14] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms 2nd version. The MIT Press, 2001. [15] A. P. Dempster, N. M. Laird, and D. B. Robin. Maximum likelihood from incomplete data via the em algorithm (with discussion). Journal of Royal Statistical Society B, 39:1–38, 1977. [16] Amol Deshpande, Carlos Guestrin, Samuel Madden, Joseph M. Hellerstein, and Wei Hong. Model-driven data acquisition in sensor networks. In VLDB, pages 588–599, 2004. [17] Chris H. Q. Ding and Xiaofeng He. K-means clustering via principal component analysis. In ICML, 2004. 147 [18] Harris Drucker, Donghui Wu, and Vladimir Vapnik. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5):1048– 1054, 1999. [19] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (2nd ed.). Wiley-Interscience, 2001. [20] Charles Elkan. Using the triangle inequality to accelerate k-means. In ICML, pages 147–153, 2003. [21] T. Feder and C. Sohler. Optimal algorithms for approximate clustering. In STOC, pages 434–444, 1988. [22] Tomás Feder, Rajeev Motwani, Rina Panigrahy, Chris Olston, and Jennifer Widom. Computing the median with uncertainty. In STOC, pages 602–607, 2000. [23] Ashish Goel, Sudipto Guha, and Kamesh Munagala. Asking the right questions: model-driven optimization using probes. In PODS, pages 203–212, 2006. [24] T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science., 38(2-3):293–306, 1985. [25] Sudipto Guha and Kamesh Munagala. Model-driven optimization using adaptive probes. In SODA, 2007. [26] Gunjan Gupta and Joydeep Ghosh. Bregman bubble clustering: A robust, scalable framework for locating multiple, dense regions in data. In ICDM, 2006. [27] Jiawei Han and Micheline Kamber. Data Mining: Concept and Techniques. Academic Press, 2000. 148 [28] David Hand, Hiekki Mannila, and Padharic Smyth. Principles of Data Mining. The MIT Press, 2001. [29] Sariel Har-Peled and Bardia Sadri. How fast is the k-means method? In SODA, pages 877–885, 2005. [30] Haibo Hu, Jianliang Xu, and Dik Lun Lee. A generic framework for monitoring continuous spatial queries over moving objects. In SIGMOD, 2005. [31] Zan Huang, Hsinchun Chen, Chia-Jung Hsu, Wun-Hwa Chen, and Soushan Wu. Credit rating analysis with support vector machines and neural networks: a market comparative study. Decision Support Systems, 37(4):543–558, 2004. [32] Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted voronoi diagrams and randomization to variance-based -clustering (extended abstract). In Symposium on Computational Geometry, pages 332–339, 1994. [33] Ankur Jain, Edward Y. Chang, and Yuan-Fang Wang. Adaptive stream resource management using kalman filters. In SIGMOD, 2004. [34] Michael I. Jordan and Lei Xu. Convergence results for the em approach to mixtures of experts architectures. Neural Networks, 8(9):1409–1431, 1995. [35] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. An efficient k-means clustering algorithm: analysis and implementation, 2002. [36] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89–112, 2004. [37] Hans Kellerer, Ulrich Pferschy, and David Pisinger. Springer, 2005. 149 Knapsack Problems. [38] Sanjeev Khanna and Wang Chiew Tan. On computing functions with uncertainty. In PODS, 2001. [39] Hans-Peter Kriegel and Martin Pfeifle. Density-based clustering of uncertain data. In KDD, pages 672–677, 2005. [40] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1+ǫ)approximation algorithm for k-means clustering in any dimensions. In FOCS, pages 454–462, 2004. [41] S.P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129–137. [42] Helmut Lutkepohl. Handbook of Matrices. John Wiley & Sons Ltd., 1996. [43] Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. -diversity: Privacy beyond -anonymity. TKDD, 1(1), 2007. [44] J. MacQueen. Some methods for classification and analysis of multivariate observations. In 5th Berkeley Symposium on mathematics, Statistics and Probability, pages 281–298, 1967. [45] G. McLachlan and D. Peel. Finite Mixture Models. Wiley-Interscience, 2000. [46] G.J. McLachlan and T. Krishnan. The EM algorithm and extensions. WileyInterscience, 1996. [47] Kyriakos Mouratidis, Dimitris Papadias, Spiridon Bakiras, and Yufei Tao. A threshold-based algorithm for continuous monitoring of k nearest neighbors. TKDE, 17(11), 2005. 150 [48] Kyriakos Mouratidis, Man Lung Yiu, Dimitris Papadias, and Nikos Mamoulis. Continuous nearest neighbor monitoring in road networks. In VLDB, pages 43–54, 2006. [49] Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, and Kevin Y Yip. Efficient clustering of uncertain data. In ICDM, 2006. [50] Chris Olston, Jing Jiang, and Jennifer Widom. Adaptive filters for continuous queries over distributed data streams. In SIGMOD, 2003. [51] Chris Olston and Jennifer Widom. Offering a precision-performance tradeoff for aggregation queries over replicated data. In VLDB, pages 144–155, 2000. [52] Dan Pelleg and Andrew Moore. Accelerating exact k -means algorithms with geometric reasoning. In Knowledge Discovery and Data Mining, pages 277– 281, 1999. [53] Sunil Prabhakar, Yuni Xia, Dmitri V. Kalashnikov, Walid G. Aref, and Susanne E. Hambrusch. Query indexing and velocity constrained indexing: Scalable techniques for continuous queries on moving objects. Trans. Computers, 51(10), 2002. [54] Pierangela Samarati and Latanya Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In PODS, page 188, 1998. [55] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888–905, 2000. [56] Xiuyao Song, Mingxi Wu, Christopher M. Jermaine, and Sanjay Ranka. Statistical change detection for multi-dimensional data. In KDD, pages 667–676, 2007. 151 [57] Latanya Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):557– 570, 2002. [58] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Addison Wesley, 2005. [59] Yufei Tao, Reynold Cheng, Xiaokui Xiao, Wang Kay Ngai, Ben Kao, and Sunil Prabhakar. Indexing multi-dimensional uncertain data with arbitrary probability density functions. In VLDB, pages 922–933, 2005. [60] Yufei Tao, Christos Faloutsos, Dimitris Papadias, and Bin Liu. Prediction and indexing of moving objects with unknown motion patterns. In SIGMOD Conference, pages 611–622, 2004. [61] Vijay V. Vazirani. Approximate Algorithms. Springer, 2003. [62] Jennifer Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262–276, 2005. [63] Ouri Wolfson, A. Prasad Sistla, Sam Chamberlain, and Yelena Yesha. Updating and querying databases that track mobile units. Distributed and Parallel Databases, 7(3):257–387, 1999. [64] Xiaokui Xiao and Yufei Tao. Anatomy: Simple and effective privacy preservation. In VLDB, pages 139–150, 2006. [65] Xin Xu, Ying Lu, Anthony K. H. Tung, and Wei Wang. Mining shifting-andscaling co-regulation patterns on gene expression profiles. In ICDE, page 89, 2006. 152 [66] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: an efficient data clustering method for very large databases. In SIGMOD Conference, pages 103–114, 1996. [67] Zhenjie Zhang, Reynold Cheng, Dimitris Papadias, and Anthony K. H. Tung. Minimizing the communication cost for continuous skyline maintenance. In SIGMOD Conference, pages 495–508, 2009. [68] Zhenjie Zhang, Bing Tian Dai, and Anthony K. H. Tung. Estimating local optimums in em algorithm over gaussian mixture model. In ICML, pages 1240–1247, 2008. [69] Zhenjie Zhang, Bing Tian Dai, and Anthony K.H. Tung. On the lower bound of local optimums in k-means algorithm. In ICDM, 2006. [70] Zhenjie Zhang, Beng Chin Ooi, Srinivasan Parthasarathy, and Anthony K.H. Tung. Similarity search on bregman divergence: Towards non-metric indexing. In VLDB, 2009. [71] Zhenjie Zhang, Yin Yang, Anthony K. H. Tung, and Dimitris Papadias. Continuous k-means monitoring over moving objects. IEEE Trans. Knowl. Data Eng., 20(9):1205–1216, 2008. 153 [...]... Worst Case Analysis framework for uncertain clustering problems and propose four different models based on a couple of categorization standards, including Zero Uncertain Model, Static Uncertain Model, Dissolvable Uncertain Model and Reverse Uncertain Model 2 We develop the Local Bounding Technique, which effectively and efficiently evaluates the clustering output according to the current incomplete cluster-... present some important features in WCA framework, which is used to categorize different uncertain clustering models in it Exact Uncertainty v.s Uncertainty Upper Bound In the basic uncertain clustering model, the clustering uncertainty is expected to return along with the clustering In many cases, however, the exact clustering uncertainty is hard to calculate, since the number of possible object location... strategy is generally called Local Bounding Technique Based on this technique, it is possible to derive upper bound on the clustering uncertainty with respect to uncertain data, as is defined in Definition 1.5 Extensions are thus derived to provide answers to the following variant models shown in the previous Section In the following chapters, more details on Local Bounding Technique will be presented... all, the basic uncertain clustering model directly follows the definition below Definition 1.5 Basic Uncertain Clustering Model Given an uncertain data set P , find some k-means clustering C and return the clustering uncertainty of C as well In traditional clustering problem, a clustering C is optimal for some exact data set E, if it can minimize the clustering cost C(C, E) In our uncertain clustering framework,... true clustering for some reasons Models with zero uncertainty is supposed to verify the difference between the current clustering and the true clustering (or optimal clustering) On the right hand of the figure, each object has some non-empty circle for its possible locations, clustering uncertainty is definitely larger than the example on the left, since every object has larger freedom The clustering uncertainty... applications and data to verify the effectiveness and efficiency of our model and algorithms 23 Chapter 2 Literature Review In this chapter, we will review some related works about existing clustering techniques in section 2.1 and topics about analysis and management over uncertain data in large database 2.2 2.1 Clustering Techniques on Certain Data As mentioned in the previous chapter, a concrete clustering. .. uncertain data sets First, any result of uncertain clustering should be error bounded, i.e the result is able to indicate the uncertainty of the clustering itself Second, the goal of clustering analysis over uncertain objects is more than dividing objects into different groups Instead, reducing the uncertainty of clustering result is an equally important target Unfortunately, to the best of our knowledge, there... for clustering To avoid the leak of important personal information in private data sets, some uncertainties are deliberately inserted into the data records To guarantee the robustness of clustering result based on the static uncertain data, static uncertain model can be helpful 1.4.3 Dissolvable Uncertainty Model (DUM) The only difference between Static Uncertain Model and Dissolvable Uncertain Model... Table 1.2 1.5 Local Bounding Technique The Worst Case Analysis (WCA) framework and the models under WCA are all independent of the underlying clustering method In this dissertation, we focus on k-means clustering and Gaussian Mixture Model (GMM), employing them as the basic clustering algorithm A in Definition 1.5 In particular, we utilize the local trapping property of both k-means clustering and Gaussian... that a much better clustering can be found by algorithm A In the rest of the dissertation, we mainly focus on two problems: (1) how can we evaluate clustering uncertainty based on Definition 1.4?; and (2) how can we reduce the clustering uncertainty? Some solutions are derived, with some running examples with k-means clustering and k-means algorithm as the underlying clustering cost and clustering algorithm . Local Bounding Technique and Its Applications to Uncertain Clustering Zhang Zhenjie Bachelor of Science Fudan University, China A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY SCHOOL. different applications, and all independent to the underlying clustering criterion and clustering algorithms. Solutions to these models with respect to k-means algorithm and EM algorithm are proposed,. basis of Local Bounding Technique, which is a powerful tool on analyzing the impact of uncertain data on the local optimums reached by these algorithms. Extensive experiments are conducted to evaluate

Định dạng
Số trang	164
Dung lượng	877,07 KB