The algorithm works by iteratively assigning each data point to the nearest cluster center centroid and then recalculating the centroids based on the new assignments.. Agglomerative clus
Trang 1TRƯỜNG ĐẠI HỌC KINH TẾ - LUẬT
Giảng viên hướng dẫn: TS Phạm Hoàng UyênThs Lương Thanh Quỳnh
Thành phố Hồ Chí Minh, ngày 09 tháng 05 năm 2023
Introduction
Trang 2for data mining algorithms to explode in development.
Among the data mining algorithms, clustering algorithms are well-known for their ability to analyze and select the necessary data from digital data sources Among these clustering algorithms, the K-means algorithm is considered a basic algorithm, initiating the method of data mining by clustering Although still limited, the K-means algorithm is a foundation that points to a direction for data mining by clustering and achieves high efficiency Moreover, K-means is also the starting point for many subsequent clustering algorithms with the aim of achieving optimal efficiency.
In this report, we would like to present K-means, its advantages and disadvantages, and propose solutions to overcome its limitations.
I Clustering
1 Unsupervised learning.
Trang 3Unsupervised learning occurs when an algorithm learns from plain examples without any associated response, leaving it to the algorithm to determine the data patterns on its own This type of algorithm tends to restructure the data into something else, such as new features that may represent a class or a new series of uncorrelated values They are quite useful in providing humans with insights into the meaning of data and new useful inputs to supervised machine learning algorithms As a kind of learning, it resembles the methods humans use to figure out that certain objects or events are from the same class, such as by observing the degree of similarity between objects Some recommendation systems that you find on the web in the form of marketing automation are based on this type of learning The marketing automation algorithm derives its suggestions from what you’ve bought in the past The recommendations are based on an estimation of what group of customers you resemble the most and then inferring your likely preferences based on that group.
2 Definition.
Clustering is one of the most widely used techniques for exploratory data analysis Across all disciplines, from social sciences to biology to computer science, people try to get a first intuition about their data by identifying meaningful groups among the data points For example, computational biologists cluster genes on the basis of similarities in their expression in different experiments; retailers cluster customers, on the basis of their customer profiles, for the purpose of targeted marketing; and astronomers cluster stars on the basis of their spatial proximity.
The first point that one should clarify is, naturally, what is clustering? Intuitively, clustering is the task of grouping a set of objects such that similar objects end up in the same group and dissimilar objects are separated into different groups Clearly, this description is quite imprecise and possibly ambiguous Quite surprisingly, it is not at all clear how to come up with a more rigorous definition.
There are several sources for this difficulty One basic problem is that the two objectives mentioned in the earlier statement may in many cases contradict each other Mathematically speaking, similarity (or proximity) is not a transitive relation, while cluster sharing is an equivalence relation and, in particular, it is a transitive relation More concretely, it may be the case that there is a long sequence of objects, x1, , xm such that each xi is very similar to its two neighbors, x and x , but x1 and xm are very dissimilar Ifi−1i+1 we wish to make sure that whenever two elements are similar they share the same cluster, then we must put all of the elements of the sequence in the same cluster However, in that case, we end up with dissimilar elements (x and x ) sharing a cluster, thus violating the1m second requirement.
To illustrate this point further, suppose that we would like to cluster the points in the following picture into two clusters
Trang 4Another basic problem is the lack of “ground truth” for clustering, which is a common problem in unsupervised learning So far in the book, we have mainly dealt with supervised learning (e.g., the problem of learning a classifier from labeled training data) The goal of supervised learning is clear – we wish to learn a classifier which will predict the labels of future examples as accurately as possible Furthermore, a supervised learner can estimate the success, or the risk, of its hypotheses using the labeled training data by computing the empirical loss In contrast, clustering is an unsupervised learning problem; namely, there are no labels that we try to predict Instead, we wish to organize the data in some meaningful way As a result, there is no clear success evaluation procedure for clustering In fact, even on the basis of full knowledge of the underlying data distribution, it is not clear what is the “correct” clustering for that data or how to evaluate a proposed clustering
Consider, for example, the following set of points in R :2
and suppose we are required to cluster them into two clusters We have two highly justifiable solutions:
Trang 5This phenomenon is not just artificial but occurs in real applications A given set of objects can be clustered in various different meaningful ways This may be due to having different implicit notions of distance (or similarity) between objects, for example, clustering recordings of speech by the accent of the speaker versus clustering them by content, clustering movie reviews by movie topic versus clustering them by the review sentiment, clustering paintings by topic versus clustering them by style, and so on.
To summarize, there may be several very different conceivable clustering solutions for a given data set As a result, there is a wide variety of clustering algorithms that, on some input data, will output very different clusterings.
A Clustering Model:
Clustering tasks can vary in terms of both the type of input they have and the type of outcome they are expected to compute For concreteness, we shall focus on the following common setup:
known as a set of elements, , and a distance function over it That is, a function d :𝒳 𝒳 × → R that is symmetric, satisfies d(x, x) = 0 for all x 𝒳 + ∈ 𝒳 and often also satisfies the triangle inequality Alternatively, the function could be a similarity function s : × → [0, 𝒳 𝒳 1] that is symmetric and satisfies s(x, x) = 1 for all x ∈ 𝒳 Additionally, some clustering algorithms also require an input parameter k (determining the number of required clusters)
Output
known as a partition of the domain sets X into subsets That is, C = (C1, Ck) where Ui=1k Ci = and for all i 𝒳 ≠ j, Ci ∩ Cj = In some situations the clustering is “soft,” namely,∅ the partition of into the different clusters is probabilistic where the output is a function𝒳 assigning to each domain point, x ∈ 𝒳 , a vector (p1(x), , pk(x)), where pi(x) = P[x C∈ i] is the probability that x belongs to cluster C Another possible output is a clusteringi
dendrogram (from Greek dendron = tree, gramma = drawing), which is a hierarchical tree of domain subsets, having the singleton sets in its leaves, and the full domain as its root.
3 Clustering algorithm.
There are many different clustering algorithms, each with its own strengths and weaknesses Some popular clustering algorithms include k-means clustering, hierarchical clustering, DBSCAN, …
3.1 K-means clustering
K-means is a popular algorithm for clustering data into k distinct groups The algorithm works by iteratively assigning each data point to the nearest cluster center (centroid) and then recalculating the centroids based on the new assignments K-means is computationally efficient and works well on large datasets.
Trang 63.2 Hierarchical clustering
Hierarchical clustering is a clustering method that creates a tree-like structure of nested clusters There are two types of hierarchical clustering: agglomerative (bottom-up) and divisive (top-down) Agglomerative clustering starts with each data point as its own cluster and then recursively merges pairs of clusters based on a similarity metric, such as distance Divisive clustering starts with all data points in a single cluster and then recursively splits the cluster into smaller subclusters based on a similarity metric.
3.3 Density-based clustering
Density-based clustering algorithms, such as DBSCAN, group together data points that are in high-density regions These algorithms are especially useful for identifying clusters of irregular shapes and sizes Density-based clustering algorithms require the
Trang 7specification of a density threshold and a minimum number of points required to form a cluster.
3.4 Model-based clustering
Model-based clustering algorithms, such as Gaussian mixture models (GMMs), use probabilistic models to identify clusters in the data These algorithms assume that the data is generated from a mixture of Gaussian distributions, and the goal is to estimate the parameters of the distributions and assign each data point to the cluster with the highest probability.
3.5 Spectral clustering
Spectral clustering is a graph-based clustering algorithm that uses the eigenvalues and eigenvectors of a similarity matrix to identify clusters Spectral clustering works well on datasets with nonlinear or complex structure and can be used to identify clusters of different sizes and shapes.
Trang 8II K-means algorithm1 Definition
K-means clustering is an unsupervised Machine Learning algorithm, which groups the unlabeled dataset into different clusters ‘K’ in the name of the algorithm represents the number of groups/clusters we want to classify our items into.
The algorithm works by iteratively assigning each data point to the cluster with the nearest centroid (mean) and then re-estimating the centroids of each cluster based on the new assignments This process is repeated until the centroids no longer move.
The algorithm works as follows:
- First, randomly initialize k points, called means or cluster centroids.
- Categorize each item to its closest mean and update the mean’s coordinates, which are the averages of the items categorized in that cluster so far.
Trang 9- Repeat the process for a given number of iterations and at the end those are clusters of data.
With One-hot encoding technique, each data point is assigned a label vector xiyi =[yi1, yi2, … yiK] The constraint on yi can be written as follows:
2 Loss function
Loss function is a mathematical function that measures the difference between the predicted output of a model and the actual output The goal of the loss function is to minimize this difference - cost of the model.
Consider the center mk as the representation of each cluster and estimate all data points assigned to this cluster by mk, then a data point xi assigned to cluster k will have an error of (xi−mk) This error must be minimized in absolute value and the following quantity must be minimized in absolute value as well.
The cost formula for all data is:
The [y1, y2, …, yN] and [m1, m2, …, mK] are the matrices formed by the label vector of each data point and the center of each cluster The loss function in our k-means clustering problem is the function L(Y, M).
In summary, we need to optimize the following problem:
3 Optimal algorithm
Trang 10- Repeat until convergence:
+ Assign each point to closest centroid to it
+ Update each centroid to be the mean of all points assigned to it
A simple way to solve the above algorithm is to alternate between solving for andYM with the other variable fixed This is an iterative algorithm and a common technique in
optimization problems We will solve the following two problems in turn:
a Find the label vectors that minimize the loss function given the centroids.
When the centers are fixed, the problem of finding the label vectors for all data points can be broken down into the problem of finding the label vector for each data point asxi
Only one element of the label vector yi is equal to 1, the above problem can be further simplified to a simpler form:
Conclude that each data point belongs to the cluster whose center is closest to it.xi
From this, we can easily deduce the label vector for each data point.
b Find the new centroids for each cluster that minimize the loss function given theclusters for each data point.
Trang 11When determine the label vector for each data point, the problem of finding the new centroids for each cluster can be broken down as follows:
Solving the above equation with its derivative equal to 0:
See that the main denominator is the count of the number of data points in cluster j
The main numerator is the sum of all data points in cluster Therefore, jmj is the arithmetic mean of the data points in cluster j
There are several other optimization functions that can be used in k-means clustering such as K-medoids algorithm (use medoids instead of the mean to update the cluster centers) or Mini-batch K-means algorithm (use batches of data to update the cluster centers, faster and scalable for large datasets) The choice of optimization function depends on the specific problem and the characteristics of the data Some optimization functions may be more suitable for certain types of data or may converge faster than others.
3 Summary
Input: Data X and the number of clusters K.
Output: Centers M and label vectors Y for each data point Step 1: Choose K arbitrary points as initial centers.
Step 2: Assign each data point to the cluster with the closest center.
Step 3: If the data point assignments in step 2 do not change compared to the previous iteration,stop the algorithm.
Step 4: Update the center for each cluster by taking the average of all data points assigned to that cluster in step 2.
Step 5: Go back to step 2.
Trang 12This algorithm will stop after a finite number of iterations Indeed, since the loss function is a positive number and after each step 2 or 3, the value of the loss function decreases A sequence decreases and is bounded below, it converges Moreover, the number of ways to partition the entire data is finite, so at some point, the loss function cannot change, and stop the algorithm at this point.
4 Advantage
Simple and easy to understand: K-means is based on a straightforward iterative process that minimizes the sum of squared distances between data points and cluster centroids This makes it easy to understand and implement, even for those with limited knowledge of machine learning.
Fast processing speed: K-means is a relatively fast algorithm, especially for large datasets, because it only requires a few iterations to converge to a solution This makes it useful for real-time applications or situations where quick decision-making is required.
Adaptive to many types of data: K-means can be used with a variety of data types, including continuous, discrete, binary, and even mixed data This flexibility makes it a versatile clustering algorithm that can be used in many different contexts.
Allows for non-linear clustering: Unlike some other clustering algorithms, K-means can handle non-linearly separable data, meaning that data points that are not linearly separable in the feature space can still be clustered together.