Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 30 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
30
Dung lượng
772,5 KB
Nội dung
1 Machine Learning Clustering Nguyen Thi Thu Ha Email: hantt@epu.edu.vn 2 What is clustering • Clustering can be considered the most important unsupervised learning problem; • An other definition of clustering could be “the process of organizing objects into groups whose members are similar”. 3 What is clustering • A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. 4 What is clustering • In this case we identify the 4 clusters into which the data can be divided; • the similarity criterion is distance: • two or more objects belong to the same cluster if they are “close” according to a given distance. (called distance-based clustering.) 5 What is clustering • Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. • In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures. Why? • determine the intrinsic grouping in a set of unlabeled data. • what constitutes a good clustering? 6 Application • Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records; • Biology: classification of plants and animals given their features; • Libraries: book ordering; 7 Application • City-planning: identifying groups of houses according to their house type, value and geographical location; • Earthquake studies: clustering observed earthquake to identify dangerous zones; • WWW: document classification; clustering weblog data to discover groups of similar access patterns. 8 Problems • dealing with large number of dimensions and large number of data items. • the effectiveness of the method depends on the definition of “distance” (for distance- based clustering); 9 Classification of clustering algorithm • Exclusive Clustering • Overlapping Clustering • Hierarchical Clustering • Probabilistic Clustering 10 [...]... Hierarchical Clustering • min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called BA/NA/RM L(BA/NA/RM) = 255 m=3 26 Hierarchical Clustering BA/NA/R M FI MI/TO BA/NA/R M 0 268 564 FI MI/TO 268 564 0 295 295 0 27 Hierarchical Clustering • min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM L(BA/FI/NA/RM) = 268 m=4 28 Hierarchical Clustering. .. = 138 and the new sequence number is m = 1 22 Hierarchical Clustering BA FI MI/T O NA RM BA 0 662 877 255 412 FI 662 0 295 468 268 MI/T O 877 295 0 754 564 NA 255 468 754 0 219 RM 412 268 564 219 0 23 Hierarchical Clustering • min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RM L(NA/RM) = 219 m=2 24 Hierarchical Clustering BA FI MI/TO NA/RM BA 0 662 877 255 FI 662 0 295 268...Classification of clustering algorithm • four of the most used clustering algorithms: – – – – K-means Fuzzy C-means Hierarchical clustering Mixture of Gaussians 11 K-Means • K-Means Algorithm Properties – There are always K clusters – There is always at least one item in each... 2 3 4 5 6 7 8 9 10 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 18 Hierarchical Clustering Step 0 a Step 1 Step 2 Step 3 Step 4 ab b abcde c cde d de e Step 4 agglomerative divisive Step 3 Step 2 Step 1 Step 0 19 Hierarchical Clustering • Start by assigning each item to a cluster, so that if you have N items • Find the closest (most similar) pair of clusters... steps 2 and 3 until all items are clustered into a single cluster of size N (*) 20 Hierarchical Clustering • Input distance matrix BA FI MI NA RM TO BA 0 662 877 255 412 996 FI 662 0 295 468 268 400 MI 877 295 0 754 564 138 NA 255 468 754 0 219 869 RM 412 268 564 219 0 669 TO 996 400 138 869 669 0 21 Hierarchical Clustering • The nearest pair of cities is MI and TO, at distance 138 These are merged into... norm: i =1 m L1 ( x , y ) = ∑ xi − yi i =1 • Cosine Similarity: x •y 1− x ⋅y 14 K-Means Let d be the distance measure between instances Select k random instances {s1, s2,… sk} as seeds Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cj such that d(xi, sj) is minimal (Update the seeds to the centroid of each cluster) For each cluster cj... d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM L(BA/FI/NA/RM) = 268 m=4 28 Hierarchical Clustering BA/FI/NA/R M MI/TO BA/FI/NA/R M 0 295 MI/TO 295 0 29 Hierarchical Clustering • Finally, we merge the last two clusters at level 295 30 . distance- based clustering) ; 9 Classification of clustering algorithm • Exclusive Clustering • Overlapping Clustering • Hierarchical Clustering • Probabilistic Clustering 10 Classification of clustering. “close” according to a given distance. (called distance-based clustering. ) 5 What is clustering • Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster. Learning Clustering Nguyen Thi Thu Ha Email: hantt@epu.edu.vn 2 What is clustering • Clustering can be considered the most important unsupervised learning problem; • An other definition of clustering