Đề tài k means clustering

The algorithm works by iteratively assigning each data point to the nearest cluster center centroid and then recalculating the centroids based on the new assignments.. Agglomerative clus

Trang 1

TRƯỜNG ĐẠI HỌC KINH TẾ - LUẬT

ĐỀ TÀI:

K-MEANS CLUSTERING Môn học: Giới thiệu về máy học Mã lớp học phần : 22TO1402

Nhóm thực hiện: Nhóm 6 – Đề tài 9

Giảng viên hướng dẫn: TS Phạm Hoàng Uyên

Ths Lương Thanh Quỳnh

Thành phố Hồ Chí Minh, ngày 09 tháng 05 năm 2023

Introduction

Trang 2

for data mining algorithms to explode in development.

Among the data mining algorithms, clustering algorithms are well-known for their ability to analyze and select the necessary data from digital data sources Among these clustering algorithms, the K-means algorithm is considered a basic algorithm, initiating the method of data mining by clustering Although still limited, the K-means algorithm is a foundation that points to a direction for data mining by clustering and achieves high efficiency Moreover, K-means is also the starting point for many subsequent clustering algorithms with the aim of achieving optimal efficiency

In this report, we would like to present K-means, its advantages and disadvantages, and propose solutions to overcome its limitations

I Clustering

1 Unsupervised learning.

Trang 3

Unsupervised learning occurs when an algorithm learns from plain examples without any associated response, leaving it to the algorithm to determine the data patterns on its own This type of algorithm tends to restructure the data into something else, such as new features that may represent a class or a new series of uncorrelated values They are quite useful in providing humans with insights into the meaning of data and new useful inputs to supervised machine learning algorithms As a kind of learning, it resembles the methods humans use to figure out that certain objects or events are from the same class, such as by observing the degree of similarity between objects Some recommendation systems that you find on the web in the form of marketing automation are based on this type of learning The marketing automation algorithm derives its suggestions from what you’ve bought in the past The recommendations are based on an estimation of what group of customers you resemble the most and then inferring your likely preferences based on that group

2 Definition.

Clustering is one of the most widely used techniques for exploratory data analysis Across all disciplines, from social sciences to biology to computer science, people try to get a first intuition about their data by identifying meaningful groups among the data points For example, computational biologists cluster genes on the basis of similarities in their expression

in different experiments; retailers cluster customers, on the basis of their customer profiles, for the purpose of targeted marketing; and astronomers cluster stars on the basis of their spatial proximity

The first point that one should clarify is, naturally, what is clustering? Intuitively, clustering is the task of grouping a set of objects such that similar objects end up in the same group and dissimilar objects are separated into different groups Clearly, this description is quite imprecise and possibly ambiguous Quite surprisingly, it is not at all clear how to come

up with a more rigorous definition

There are several sources for this difficulty One basic problem is that the two objectives mentioned in the earlier statement may in many cases contradict each other Mathematically speaking, similarity (or proximity) is not a transitive relation, while cluster sharing is an equivalence relation and, in particular, it is a transitive relation More concretely, it may be the case that there is a long sequence of objects, x1, , xm such that each xi is very similar to its two neighbors, x and x , but x1 and xm are very dissimilar Ifi−1 i+1

we wish to make sure that whenever two elements are similar they share the same cluster, then we must put all of the elements of the sequence in the same cluster However, in that case, we end up with dissimilar elements (x and x ) sharing a cluster, thus violating the1 m second requirement

To illustrate this point further, suppose that we would like to cluster the points in the following picture into two clusters

Trang 4

Another basic problem is the lack of “ground truth” for clustering, which is a common problem in unsupervised learning So far in the book, we have mainly dealt with supervised learning (e.g., the problem of learning a classifier from labeled training data) The goal of supervised learning is clear – we wish to learn a classifier which will predict the labels of future examples as accurately as possible Furthermore, a supervised learner can estimate the success, or the risk, of its hypotheses using the labeled training data by computing the empirical loss In contrast, clustering is an unsupervised learning problem; namely, there are

no labels that we try to predict Instead, we wish to organize the data in some meaningful way As a result, there is no clear success evaluation procedure for clustering In fact, even on the basis of full knowledge of the underlying data distribution, it is not clear what is the

“correct” clustering for that data or how to evaluate a proposed clustering

Consider, for example, the following set of points in R :2

and suppose we are required to cluster them into two clusters We have two highly justifiable solutions:

Trang 5

This phenomenon is not just artificial but occurs in real applications A given set of objects can be clustered in various different meaningful ways This may be due to having different implicit notions of distance (or similarity) between objects, for example, clustering recordings of speech by the accent of the speaker versus clustering them by content, clustering movie reviews by movie topic versus clustering them by the review sentiment, clustering paintings by topic versus clustering them by style, and so on

To summarize, there may be several very different conceivable clustering solutions for a given data set As a result, there is a wide variety of clustering algorithms that, on some input data, will output very different clusterings

A Clustering Model:

Clustering tasks can vary in terms of both the type of input they have and the type of outcome they are expected to compute For concreteness, we shall focus on the following common setup:

Input

known as a set of elements, , and a distance function over it That is, a function d :𝒳

𝒳 × → R that is symmetric, satisfies d(x, x) = 0 for all x 𝒳 + ∈ 𝒳 and often also satisfies the triangle inequality Alternatively, the function could be a similarity function s : × → [0, 𝒳 𝒳 1] that is symmetric and satisfies s(x, x) = 1 for all x ∈ 𝒳 Additionally, some clustering algorithms also require an input parameter k (determining the number of required clusters)

Output

known as a partition of the domain sets X into subsets That is, C = (C1, Ck) where

Ui=1k Ci = and for all i 𝒳 ≠ j, Ci ∩ Cj = In some situations the clustering is “soft,” namely,∅ the partition of into the different clusters is probabilistic where the output is a function𝒳 assigning to each domain point, x ∈ 𝒳 , a vector (p1(x), , pk(x)), where pi(x) = P[x C∈ i]

is the probability that x belongs to cluster C Another possible output is a clusteringi

dendrogram (from Greek dendron = tree, gramma = drawing), which is a hierarchical tree of domain subsets, having the singleton sets in its leaves, and the full domain as its root

3 Clustering algorithm.

There are many different clustering algorithms, each with its own strengths and weaknesses Some popular clustering algorithms include k-means clustering, hierarchical clustering, DBSCAN, …

3.1 K-means clustering

K-means is a popular algorithm for clustering data into k distinct groups The algorithm works by iteratively assigning each data point to the nearest cluster center (centroid) and then recalculating the centroids based on the new assignments K-means is computationally efficient and works well on large datasets

Trang 6

3.2 Hierarchical clustering

Hierarchical clustering is a clustering method that creates a tree-like structure of nested clusters There are two types of hierarchical clustering: agglomerative (bottom-up) and divisive (top-down) Agglomerative clustering starts with each data point as its own cluster and then recursively merges pairs of clusters based on a similarity metric, such as distance Divisive clustering starts with all data points in a single cluster and then recursively splits the cluster into smaller subclusters based on a similarity metric

3.3 Density-based clustering

Density-based clustering algorithms, such as DBSCAN, group together data points that are in high-density regions These algorithms are especially useful for identifying clusters of irregular shapes and sizes Density-based clustering algorithms require the

Trang 7

specification of a density threshold and a minimum number of points required to form a cluster

3.4 Model-based clustering

Model-based clustering algorithms, such as Gaussian mixture models (GMMs), use probabilistic models to identify clusters in the data These algorithms assume that the data is generated from a mixture of Gaussian distributions, and the goal is to estimate the parameters

of the distributions and assign each data point to the cluster with the highest probability

3.5 Spectral clustering

Spectral clustering is a graph-based clustering algorithm that uses the eigenvalues and eigenvectors of a similarity matrix to identify clusters Spectral clustering works well on datasets with nonlinear or complex structure and can be used to identify clusters of different sizes and shapes

Trang 8

II K-means algorithm

1 Definition

K-means clustering is an unsupervised Machine Learning algorithm, which groups the unlabeled dataset into different clusters ‘K’ in the name of the algorithm represents the number of groups/clusters we want to classify our items into

The algorithm works by iteratively assigning each data point to the cluster with the nearest centroid (mean) and then re-estimating the centroids of each cluster based on the new assignments This process is repeated until the centroids no longer move

The algorithm works as follows:

- First, randomly initialize k points, called means or cluster centroids

- Categorize each item to its closest mean and update the mean’s coordinates, which are the averages of the items categorized in that cluster so far

Trang 9

- Repeat the process for a given number of iterations and at the end those are clusters

of data

With One-hot encoding technique, each data point is assigned a label vector xi yi = [yi1, yi2, … yiK] The constraint on yi can be written as follows:

2 Loss function

Loss function is a mathematical function that measures the difference between the predicted output of a model and the actual output The goal of the loss function is to minimize this difference - cost of the model

Consider the center mk as the representation of each cluster and estimate all data points assigned to this cluster by mk, then a data point xi assigned to cluster k will have an error of (xi−mk) This error must be minimized in absolute value and the following quantity must be minimized in absolute value as well

The cost formula for all data is:

The [y1, y2, …, yN] and [m1, m2, …, mK] are the matrices formed by the label vector of each data point and the center of each cluster The loss function in our k-means clustering problem is the function L(Y, M)

In summary, we need to optimize the following problem:

3 Optimal algorithm

Trang 10

- Repeat until convergence:

+ Assign each point to closest centroid to it

+ Update each centroid to be the mean of all points assigned to it

A simple way to solve the above algorithm is to alternate between solving for andY

M with the other variable fixed This is an iterative algorithm and a common technique in

optimization problems We will solve the following two problems in turn:

a Find the label vectors that minimize the loss function given the centroids.

When the centers are fixed, the problem of finding the label vectors for all data points can be broken down into the problem of finding the label vector for each data point asxi

follows:

Only one element of the label vector yi is equal to 1, the above problem can be further simplified to a simpler form:

Conclude that each data point belongs to the cluster whose center is closest to it.xi

From this, we can easily deduce the label vector for each data point

b Find the new centroids for each cluster that minimize the loss function given the clusters for each data point.

Trang 11

When determine the label vector for each data point, the problem of finding the new centroids for each cluster can be broken down as follows:

Solving the above equation with its derivative equal to 0:

See that the main denominator is the count of the number of data points in cluster j

The main numerator is the sum of all data points in cluster Therefore, j mj is the arithmetic mean of the data points in cluster j

There are several other optimization functions that can be used in k-means clustering such as K-medoids algorithm (use medoids instead of the mean to update the cluster centers)

or Mini-batch K-means algorithm (use batches of data to update the cluster centers, faster and scalable for large datasets) The choice of optimization function depends on the specific problem and the characteristics of the data Some optimization functions may be more suitable for certain types of data or may converge faster than others

3 Summary

Input: Data X and the number of clusters K

Output: Centers M and label vectors Y for each data point

Step 1: Choose K arbitrary points as initial centers

Step 2: Assign each data point to the cluster with the closest center

Step 3: If the data point assignments in step 2 do not change compared to the previous iteration,stop the algorithm

Step 4: Update the center for each cluster by taking the average of all data points assigned to that cluster in step 2

Step 5: Go back to step 2

Trang 12

This algorithm will stop after a finite number of iterations Indeed, since the loss function is a positive number and after each step 2 or 3, the value of the loss function decreases A sequence decreases and is bounded below, it converges Moreover, the number

of ways to partition the entire data is finite, so at some point, the loss function cannot change, and stop the algorithm at this point

4 Advantage

Simple and easy to understand: K-means is based on a straightforward iterative process that minimizes the sum of squared distances between data points and cluster centroids This makes it easy to understand and implement, even for those with limited knowledge of machine learning

Fast processing speed: K-means is a relatively fast algorithm, especially for large datasets, because it only requires a few iterations to converge to a solution This makes it useful for real-time applications or situations where quick decision-making is required Adaptive to many types of data: K-means can be used with a variety of data types, including continuous, discrete, binary, and even mixed data This flexibility makes it a versatile clustering algorithm that can be used in many different contexts

Allows for non-linear clustering: Unlike some other clustering algorithms, K-means can handle non-linearly separable data, meaning that data points that are not linearly separable in the feature space can still be clustered together

Tiêu đề	K-Means Clustering
Tác giả	Huỳnh Trần Minh Thùy, Hà Quốc Huy, Lê Văn Tuấn Anh
Người hướng dẫn	Ts. Phạm Hoàng Uyên, Ths. Lương Thanh Quỳnh
Trường học	Đại học Quốc gia Thành phố Hồ Chí Minh, Trường Đại học Kinh tế - Luật
Chuyên ngành	Giới thiệu về máy học
Thể loại	Report
Năm xuất bản	2023
Thành phố	Thành phố Hồ Chí Minh

Định dạng
Số trang	20
Dung lượng	2,8 MB