báo cáo bài tập học máy trong y tế

So, our aim here is to minimize the distance between thepoints within a cluster.There is an algorithm that tries to minimize the distance of the points in a cluster withtheir centroid –

Trang 1

ĐẠI HỌC BÁCH KHOA HÀ NỘI

TRƯỜNG ĐIỆN - ĐIỆN TỬ

BÁO CÁO BÀI TẬP HỌC MÁY TRONG Y TẾ Giáo viên hướng dẫn: Nguyễn Việt Dũng

Họ và tên: Phạm Trương Hà Phương MSSV: 20210693

Hà Nội, 2024

Trang 2

Table of Contents

A K – Means Clustering 3

I Introduction 3

1 What is K – Means Clustering 3

2 How K – Means Clustering works 3

II How to select number of clusters in K-Means 4

1 Elbow Curve Method 4

2 Silhouette Analysis 5

B Other clustering techniques 7

I Introduction 7

1 What are Clustering Alogrithms 7

2 When to use Clustering 7

3 The top 8 Clustering Algorithms 8

II Gaussian Mixture Models (GMM) 8

C Compare between K – Means and Gaussian Mixture Models 9

I Similarities 9

1 Selection of the starting points of centroids 9

2 Tendency towards local minima rather than global minimum 9

II Differences 10

1 Deterministic vs Probabilistic Approach 10

2 Parameters 11

3 Updating the centroids 11

4 Feature Standardization 11

Reference 11

Trang 3

A K – Means Clustering

I Introduction

1 What is K – Means Clustering

K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into a pre-defined number of clusters The goal is to group similar data points together and discover underlying patterns or structures within the data Recall the first property of clusters – it states that the points within a cluster should be

similar to each other So, our aim here is to minimize the distance between the points within a cluster.

There is an algorithm that tries to minimize the distance of the points in a cluster with their centroid – the k-means clustering technique

K-means is a centroid-based algorithm or a distance-based algorithm, where we calculate the distances to assign a point to a cluster In K-Means, each cluster is associated with a centroid

The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.

Optimization plays a crucial role in the k-means clustering algorithm The goal of the optimization process is to find the best set of centroids that minimizes the sum of squared distances between each data point and its closest centroid

2 How K – Means Clustering works

Here’s how it works:

Initialization: Start by randomly selecting K points from the dataset These points

will act as the initial cluster centroids

Assignment: For each data point in the dataset, calculate the distance between that

point and each of the K centroids Assign the data point to the cluster whose centroid is closest to it This step effectively forms K clusters

Trang 4

Update centroids: Once all data points have been assigned to clusters, recalculate

the centroids of the clusters by taking the mean of all data points assigned to each cluster

Repeat: Repeat steps 2 and 3 until convergence Convergence occurs when the

centroids no longer change significantly or when a specified number of iterations

is reached

Final Result: Once convergence is achieved, the algorithm outputs the final cluster

centroids and the assignment of each data point to a cluster

II How to select number of clusters in K-Means

1 Elbow Curve Method

Recall that the basic idea behind partitioning methods, such as k-means clustering, is

to define clusters such that the total intra-cluster variation [or total within-cluster sum

of square (WSS)] is minimized The total wss measures the compactness of the clustering, and we want it to be as small as possible The elbow method runs k -means clustering (kmeans number of clusters) on the dataset for a range of values of k (say 1

to 10) In the elbow method, we plot mean distance and look for the elbow point where the rate of decrease shifts For each k, calculate the total within-cluster sum of squares (WSS) This elbow point can be used to determine K

Perform K-means clustering with all these different values of K For each of the

K values, we calculate average distances to the centroid across all data points Plot these points and find the point where the average distance from the centroid falls suddenly (“Elbow”)

At first, clusters will give a lot of information (about variance), but at some point, the marginal gain will drop, giving an angle in the graph The number of clusters is chosen

at this point, hence the “elbow criterion” This “elbow” can’t always be unambiguously identified

Inertia: Sum of squared distances of samples to their closest cluster center.

Trang 5

we always do not have clear clustered data This means that the elbow may not be clear and sharp

The curve looks like an elbow In the above plot, the elbow is at k=3 (i.e., the Sum of squared distances falls suddenly), indicating the optimal k for this dataset is 3

2 Silhouette Analysis

The silhouette coefficient or silhouette score kmeans is a measure of how similar a data point is within-cluster (cohesion) compared to other clusters (separation) The Silhouette score k means can be easily calculated in Python using the metrics module

of the scikit-learn/sklearn library

Select a range of values of k (say 1 to 10)

Plot Silhouette coefficient for each value of K

The equation for calculating the silhouette coefficient for a particular data point:

𝑏(𝑖) − 𝑎(𝑖)

𝑆(𝑖) = max {𝑎(𝑖), 𝑏(𝑖)}

S(i) is the silhouette coefficient of the data point i

Trang 6

a(i) is the average distance between i and all the other data points in the cluster

to which i belongs

b(i) is the average distance from i to all clusters to which i does not belong

We will then calculate the average_silhouette for every k

𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑆𝑖𝑙ℎ𝑜𝑢𝑒𝑡𝑡 = 𝑚𝑒𝑎𝑛{𝑆( )𝑖 } Then plot the graph between average_silhouette and K

Points to Remember While Calculating Silhouette Coefficient:

The value of the silhouette coefficient is between [-1, 1]

A score of 1 denotes the best, meaning that the data point i is very compact within the cluster to which it belongs and far away from the other clusters The worst value is -1 Values near 0 denote overlapping clusters

We see that the silhouette score is maximized at k = 3 So, we will take 3 clusters

Trang 7

B Other clustering techniques

I Introduction

1 What are Clustering Alogrithms

Clustering is an unsupervised machine learning task You might also hear this referred to

as cluster analysis because of the way this method works

Using a clustering algorithm means you're going to give the algorithm a lot of input data with no labels and let it find any groupings in the data it can

Those groupings are called clusters A cluster is a group of data points that are similar to each other based on their relation to surrounding data points Clustering is used for things like feature engineering or pattern discovery

When you're starting with data you know nothing about, clustering might be a good place

to get some insight

2 When to use Clustering

When you have a set of unlabeled data, it's very likely that you'll be using some kind of unsupervised learning algorithm

There are a lot of different unsupervised learning techniques, like neural networks, reinforcement learning, and clustering The specific type of algorithm you want to use is going to depend on what your data looks like

You might want to use clustering when you're trying to do anomaly detection to try and find outliers in your data It helps by finding those groups of clusters and showing the boundaries that would determine whether a data point is an outlier or not

If you aren't sure of what features to use for your machine learning model, clustering discovers patterns you can use to figure out what stands out in the data

Clustering is especially useful for exploring data you know nothing about It might take some time to figure out which type of clustering algorithm works the best, but when you

do, you'll get invaluable insight on your data You might find connections you never would have thought of

Trang 8

Some real world applications of clustering include fraud detection in insurance, categorizing books in a library, and customer segmentation in marketing It can also be used in larger problems, like earthquake analysis or city planning

3 The top 8 Clustering Algorithms

K-means clustering algorithm

DBSCAN clustering algorithm

Gaussian Mixture Model algorithm

BIRCH algorithm

Affinity Propagation clustering algorithm

Mean-Shift clustering algorithm

OPTICS algorithm

Agglomerative Hierarchy clustering algorithm

II Gaussian Mixture Models (GMM)

One of the problems with k-means is that the data needs to follow a circular format The way k-means calculates the distance between data points has to do with a circular path, so non-circular data isn't clustered correctly

This is an issue that Gaussian mixture models fix You don’t need circular shaped data for it to work well

The Gaussian mixture model uses multiple Gaussian distributions to fit arbitrarily shaped data

There are several single Gaussian models that act as hidden layers in this hybrid model

So the model calculates the probability that a data point belongs to a specific Gaussian distribution and that's the cluster it will fall under

Trang 9

C Compare between K – Means and Gaussian Mixture Models

I Similarities

1 Selection of the starting points of centroids

Choosing the starting point for each centroid is very important as it can have a big effect

on the outcome

Various methods have been proposed to initialize the cluster centers Common methods are:

Forgy (chooses K data points randomly and use those as the initial centroid locations)

random partition (randomly assigns a cluster to each data point and then calculates a centroid based on this clustering- this generally creates initial centroids close to the middle of the data set)

K-Means++ (chooses initial centroids that are as far apart from each other as possible, effectively around the edges of the data set)

For GMM, it is usual to set the initial spread (variance) for each cluster as the population variance and set the weights equally

Both Sklearn packages give the option to run the analysis a number of times using different starting points and provide a result that is the best fit of all the runs The number

of different tries at running the analysis can be set by the user

2 Tendency towards local minima rather than global minimum

Because the initialization of the centroids is essentially a guess, they can start far away from the true cluster centers in the data The two methods always converge (meaning they will always find a solution) but not necessarily the optimal one On the way, they may fall into a local minimum and get stuck there

Trang 10

When the global minimum is not reached, the result will not be optimal and a different group of clusters will be obtained

II Differences

1 Deterministic vs Probabilistic Approach

K-Means: uses a deterministic approach and assigns each data point to one unique cluster This is referred to as a hard clustering method

Trang 11

GMM: uses a probabilistic approach and gives the probability of each data point belonging to any of the clusters This is referred to as a soft clustering method

2 Parameters

K-Means: only uses two parameters: the number of clusters K and the centroid locations

GMM: uses three parameters: the number of clusters , mean, and clusterK covariances

3 Updating the centroids

K-Means: uses only the data points assigned to each specific cluster to update the centroid mean

GMM: uses all data points within the data set to update the centroid weights, means and covariances

4 Feature Standardization

K-Means: the ‘circular’ distance measure around a centroid can make feature standardization necessary if one or more of the dimensions will dominate the calculation

GMM: automatically takes the issue into account by its calculation and use of the covariance matrix

Reference

[1] Pulkit Sharma [2024], The Ultimate Guide to K-Means Clustering: Definition, Methods and Applications

[2] Milecia McGregor, 8 Clustering Algorithms in Machine Learning that All Data Scientists Should Know

[3] Ankita Banerji [2024], K-Mean: Getting the Optimal Number of Clusters

Tiêu đề	Báo cáo bài tập học máy trong y tế
Tác giả	Phạm Trương Hà Phương
Người hướng dẫn	Nguyễn Việt Dũng
Trường học	Đại Học Bách Khoa Hà Nội, Trường Điện - Điện Tử
Chuyên ngành	Machine Learning
Thể loại	Bài tập
Năm xuất bản	2024
Thành phố	Hà Nội

Định dạng
Số trang	11
Dung lượng	1,47 MB