So, our aim here is to minimize the distance between thepoints within a cluster.There is an algorithm that tries to minimize the distance of the points in a cluster withtheir centroid –
Trang 1ĐẠI HỌC BÁCH KHOA HÀ NỘI
TRƯỜNG ĐIỆN - ĐIỆN TỬ
BÁO CÁO BÀI TẬP HỌC MÁY TRONG Y TẾ Giáo viên hướng dẫn: Nguyễn Việt Dũng
Họ và tên: Phạm Trương Hà Phương MSSV: 20210693
Hà Nội, 2024
Trang 2Table of Contents
A K – Means Clustering 3
I Introduction 3
1 What is K – Means Clustering 3
2 How K – Means Clustering works 3
II How to select number of clusters in K-Means 4
1 Elbow Curve Method 4
2 Silhouette Analysis 5
B Other clustering techniques 7
I Introduction 7
1 What are Clustering Alogrithms 7
2 When to use Clustering 7
3 The top 8 Clustering Algorithms 8
II Gaussian Mixture Models (GMM) 8
C Compare between K – Means and Gaussian Mixture Models 9
I Similarities 9
1 Selection of the starting points of centroids 9
2 Tendency towards local minima rather than global minimum 9
II Differences 10
1 Deterministic vs Probabilistic Approach 10
2 Parameters 11
3 Updating the centroids 11
4 Feature Standardization 11
Reference 11
Trang 3A K – Means Clustering
I Introduction
1 What is K – Means Clustering
K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into a pre-defined number of clusters The goal is to group similar data points together and discover underlying patterns or structures within the data Recall the first property of clusters – it states that the points within a cluster should be
similar to each other So, our aim here is to minimize the distance between the points within a cluster.
There is an algorithm that tries to minimize the distance of the points in a cluster with their centroid – the k-means clustering technique
K-means is a centroid-based algorithm or a distance-based algorithm, where we calculate the distances to assign a point to a cluster In K-Means, each cluster is associated with a centroid
The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.
Optimization plays a crucial role in the k-means clustering algorithm The goal of the optimization process is to find the best set of centroids that minimizes the sum of squared distances between each data point and its closest centroid
2 How K – Means Clustering works
Here’s how it works:
Initialization: Start by randomly selecting K points from the dataset These points
will act as the initial cluster centroids
Assignment: For each data point in the dataset, calculate the distance between that
point and each of the K centroids Assign the data point to the cluster whose centroid is closest to it This step effectively forms K clusters
Trang 4Update centroids: Once all data points have been assigned to clusters, recalculate
the centroids of the clusters by taking the mean of all data points assigned to each cluster
Repeat: Repeat steps 2 and 3 until convergence Convergence occurs when the
centroids no longer change significantly or when a specified number of iterations
is reached
Final Result: Once convergence is achieved, the algorithm outputs the final cluster
centroids and the assignment of each data point to a cluster
II How to select number of clusters in K-Means
1 Elbow Curve Method
Recall that the basic idea behind partitioning methods, such as k-means clustering, is
to define clusters such that the total intra-cluster variation [or total within-cluster sum
of square (WSS)] is minimized The total wss measures the compactness of the clustering, and we want it to be as small as possible The elbow method runs k -means clustering (kmeans number of clusters) on the dataset for a range of values of k (say 1
to 10) In the elbow method, we plot mean distance and look for the elbow point where the rate of decrease shifts For each k, calculate the total within-cluster sum of squares (WSS) This elbow point can be used to determine K
Perform K-means clustering with all these different values of K For each of the
K values, we calculate average distances to the centroid across all data points Plot these points and find the point where the average distance from the centroid falls suddenly (“Elbow”)
At first, clusters will give a lot of information (about variance), but at some point, the marginal gain will drop, giving an angle in the graph The number of clusters is chosen
at this point, hence the “elbow criterion” This “elbow” can’t always be unambiguously identified
Inertia: Sum of squared distances of samples to their closest cluster center.
Trang 5we always do not have clear clustered data This means that the elbow may not be clear and sharp
The curve looks like an elbow In the above plot, the elbow is at k=3 (i.e., the Sum of squared distances falls suddenly), indicating the optimal k for this dataset is 3
2 Silhouette Analysis
The silhouette coefficient or silhouette score kmeans is a measure of how similar a data point is within-cluster (cohesion) compared to other clusters (separation) The Silhouette score k means can be easily calculated in Python using the metrics module
of the scikit-learn/sklearn library
Select a range of values of k (say 1 to 10)
Plot Silhouette coefficient for each value of K
The equation for calculating the silhouette coefficient for a particular data point:
𝑏(𝑖) − 𝑎(𝑖)
𝑆(𝑖) = max {𝑎(𝑖), 𝑏(𝑖)}
S(i) is the silhouette coefficient of the data point i
Trang 6a(i) is the average distance between i and all the other data points in the cluster
to which i belongs
b(i) is the average distance from i to all clusters to which i does not belong
We will then calculate the average_silhouette for every k
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑆𝑖𝑙ℎ𝑜𝑢𝑒𝑡𝑡 = 𝑚𝑒𝑎𝑛{𝑆( )𝑖 } Then plot the graph between average_silhouette and K
Points to Remember While Calculating Silhouette Coefficient:
The value of the silhouette coefficient is between [-1, 1]
A score of 1 denotes the best, meaning that the data point i is very compact within the cluster to which it belongs and far away from the other clusters The worst value is -1 Values near 0 denote overlapping clusters
We see that the silhouette score is maximized at k = 3 So, we will take 3 clusters
Trang 7B Other clustering techniques
I Introduction
1 What are Clustering Alogrithms
Clustering is an unsupervised machine learning task You might also hear this referred to
as cluster analysis because of the way this method works
Using a clustering algorithm means you're going to give the algorithm a lot of input data with no labels and let it find any groupings in the data it can
Those groupings are called clusters A cluster is a group of data points that are similar to each other based on their relation to surrounding data points Clustering is used for things like feature engineering or pattern discovery
When you're starting with data you know nothing about, clustering might be a good place
to get some insight
2 When to use Clustering
When you have a set of unlabeled data, it's very likely that you'll be using some kind of unsupervised learning algorithm
There are a lot of different unsupervised learning techniques, like neural networks, reinforcement learning, and clustering The specific type of algorithm you want to use is going to depend on what your data looks like
You might want to use clustering when you're trying to do anomaly detection to try and find outliers in your data It helps by finding those groups of clusters and showing the boundaries that would determine whether a data point is an outlier or not
If you aren't sure of what features to use for your machine learning model, clustering discovers patterns you can use to figure out what stands out in the data
Clustering is especially useful for exploring data you know nothing about It might take some time to figure out which type of clustering algorithm works the best, but when you
do, you'll get invaluable insight on your data You might find connections you never would have thought of
Trang 8Some real world applications of clustering include fraud detection in insurance, categorizing books in a library, and customer segmentation in marketing It can also be used in larger problems, like earthquake analysis or city planning
3 The top 8 Clustering Algorithms
K-means clustering algorithm
DBSCAN clustering algorithm
Gaussian Mixture Model algorithm
BIRCH algorithm
Affinity Propagation clustering algorithm
Mean-Shift clustering algorithm
OPTICS algorithm
Agglomerative Hierarchy clustering algorithm
II Gaussian Mixture Models (GMM)
One of the problems with k-means is that the data needs to follow a circular format The way k-means calculates the distance between data points has to do with a circular path, so non-circular data isn't clustered correctly
This is an issue that Gaussian mixture models fix You don’t need circular shaped data for it to work well
The Gaussian mixture model uses multiple Gaussian distributions to fit arbitrarily shaped data
There are several single Gaussian models that act as hidden layers in this hybrid model
So the model calculates the probability that a data point belongs to a specific Gaussian distribution and that's the cluster it will fall under
Trang 9C Compare between K – Means and Gaussian Mixture Models
I Similarities
1 Selection of the starting points of centroids
Choosing the starting point for each centroid is very important as it can have a big effect
on the outcome
Various methods have been proposed to initialize the cluster centers Common methods are:
Forgy (chooses K data points randomly and use those as the initial centroid locations)
random partition (randomly assigns a cluster to each data point and then calculates a centroid based on this clustering- this generally creates initial centroids close to the middle of the data set)
K-Means++ (chooses initial centroids that are as far apart from each other as possible, effectively around the edges of the data set)
For GMM, it is usual to set the initial spread (variance) for each cluster as the population variance and set the weights equally
Both Sklearn packages give the option to run the analysis a number of times using different starting points and provide a result that is the best fit of all the runs The number
of different tries at running the analysis can be set by the user
2 Tendency towards local minima rather than global minimum
Because the initialization of the centroids is essentially a guess, they can start far away from the true cluster centers in the data The two methods always converge (meaning they will always find a solution) but not necessarily the optimal one On the way, they may fall into a local minimum and get stuck there
Trang 10When the global minimum is not reached, the result will not be optimal and a different group of clusters will be obtained
II Differences
1 Deterministic vs Probabilistic Approach
K-Means: uses a deterministic approach and assigns each data point to one unique cluster This is referred to as a hard clustering method
Trang 11GMM: uses a probabilistic approach and gives the probability of each data point belonging to any of the clusters This is referred to as a soft clustering method
2 Parameters
K-Means: only uses two parameters: the number of clusters K and the centroid locations
GMM: uses three parameters: the number of clusters , mean, and clusterK covariances
3 Updating the centroids
K-Means: uses only the data points assigned to each specific cluster to update the centroid mean
GMM: uses all data points within the data set to update the centroid weights, means and covariances
4 Feature Standardization
K-Means: the ‘circular’ distance measure around a centroid can make feature standardization necessary if one or more of the dimensions will dominate the calculation
GMM: automatically takes the issue into account by its calculation and use of the covariance matrix
Reference
[1] Pulkit Sharma [2024], The Ultimate Guide to K-Means Clustering: Definition, Methods and Applications
[2] Milecia McGregor, 8 Clustering Algorithms in Machine Learning that All Data Scientists Should Know
[3] Ankita Banerji [2024], K-Mean: Getting the Optimal Number of Clusters