1. Trang chủ
  2. » Luận Văn - Báo Cáo

final project report fundamental data analysis topic improving churn management based on customer segmentation using k means k means and k medoids

110 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

Initialize the center point of the K-means algorithm: First, we need to randomly select the first centroid from the data points.. Businesses can calculate the Customer Lifetime Value CLV

Trang 1

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY

UNIVERSITY OF ECONOMICS AND LAW

FINAL PROJECT REPORT FUNDAMENTAL DATA ANALYSISTOPIC: IMPROVING CHURN MANAGEMENT BASED ON

CUSTOMER SEGMENTATION USING MEANS,

K-MEANS++ AND K-MEDOIDS

Lecturer: 1 Assoc Prof Ho Trung Thanh, Ph.D 2 Le Thi Kim Hien, Ph.D

3 Nguyen Phat Dat, M.A Group 07

1 Le Nguyen Minh Tai 2 Le Trong Tan Dung 3 La Nam Khanh 4 Chung Bao Phu 5 Le Hoang Giang

Ho Chi Minh City, December 21, 2023

Trang 2

Members of Group 7

2 Nguyễn Tr ng n Dũngọ Tấ K224161808 8

Trang 3

2

Acknowledgements

First and foremost, Group 07 would like to express its deepest gratitude for all the encouragement and support received throughout the duration of this project In addition, we extend our sincerest thanks for the insightful lectures, devoted counsel, and enthusiastic guidance of Mr Ho Trung Thanh, our study supervisor His substantial investment of time and effort in guiding and instructing our class, including the oversight of our smaller assignments and culminating in the final project, has been invaluable

Furthermore, we would like to express our appreciation to Mr Nguyen Phat Dat, who represented Mr Thanh in directly supporting our class Without his passionate involvement and contributions, our performance and the successful completion of our project would not have been possible

Last but certainly not least, our heartfelt thanks go to our families, who have consistently provided encouragement, support, and assistance, enabling us to fulfill our project objectives

Group 07

Trang 4

Commitment

We solemnly declare that the contents of this report are the product of our scholarly research, and affirm that the findings presented herein have not been submitted for any other peer-reviewed report

Ho Chi Minh City, 12/2023 Group 07

Trang 5

Chapter 1 THEORETICAL BACKGROUND 17

1.1 Customer segmentation with RFM 17

1.2 How to compare other methods? 20

1.3 Customer Behavior and Customer Experience 22

1.4 Customer Lifetime Value (CLV) 23

Chapter 2 DATA PREPARATION 39

2.1 Exploratory data analysis 39

2.1.1 Data Description 39

2.1.2 Sales_data EDA steps 49

Trang 6

2.3.2 Customer Segmentation with RFM 61

Chapter 3 CUSTOMER SEGMENTATION WITH MEANS, MEANS++ AND MEDOIDS 63

K-3.1 Find K 63

3.1.1 Selecting the optimal number of clusters by Elbow 63

3.1.2 Measure the optimal number of clusters by Silhouette index 64

3.2 K-Means 66

3.3 K-Means++ 69

3.4 K-Medoids 74

3.5 Compare methods 80

3.6 Compare the best one with traditional 84

Chapter 4 DATA VISUALIZATION & ANALYSIS 87

4.1 Revenue by customer segments 87

Trang 7

Table 2 2 Sales Territory_data Sheet's variables Description 42

Table 2 3 Sales_data Sheet's variables Description 44

Table 2 4 Reseller _data Sheet's variables Description 45

Table 2 5 Date_data Sheet's variables Description 46

Table 2 6 Product_data Sheet's variables Description 47

Table 2 7 Customer_data Sheet's variables Description 48

Table 2 8 The list of customer segments 60

Table 2 9 Customer Segmentation by RFM Method 61

Chapter 3: Table 3 1 The comparison of evaluation methods among algorithms 80

Trang 8

List of Figures

Figure 1 Gantt Chart Error! Bookmark not defined

Figure 2 Research methods and procedures 14

Chapter 1: Figure 1 1 The results of RFM model presented in tree map chart 19

Figure 1 2 An example of K-means clustering algorithm results 26

Figure 1 3 Flowchart of K-means clustering 27

Figure 1 4 A chart displaying the number of clusters k 29

Figure 1 5 Some graphical illustrations of silhouette score 31

Figure 1 6 Input data point for K-means++ 33

Figure 1 7 Output data point for K-means++ 33

Figure 1 8 Comparison between the K-Means and K-Medoids 34

Figure 1 9 Randomly Generated Data 35

Figure 1 10 Cluster Assignments and Medoids 35

Figure 1 11 Cohort Retention Rate Table 37

Chapter 2: Figure 2 1 Contribution of Recency values 56

Figure 2 2 Contribution of Frequency values 56

Figure 2 3 Contribution of Monetary values 57

Figure 2 4 Customer Purchase Behavior Analysis 58

Figure 2 5 Data after max-min Scaler 59

Chapter 3: Figure 3 1 SSE curve results in Elbow method 64

Figure 3 2 For n_clusters = 2 Silhouette Score: 0.683 65

Figure 3 3 For n_clusters = 3 Silhouette Score: 0.629 65

Figure 3 4 For n_clusters = 4 Silhouette Score: 0.582 66

Figure 3 5 For n_clusters = 6 Silhouette Score: 0.445 66

Figure 3 6 The 3D Scatter showing the clusters 68

Figure 3 7 The 3D Scatter showing the clusters 70

Figure 3 8 The dispersion of K-Medoids Clustering 75

Figure 3 9 The centroids of K-Medoids Clustering 76

Figure 3 10 Customer segmentation by RFM using K-Medoids 77

Figure 3 11 ARI comparison across different clustering methods 82

Trang 9

8

Figure 3 12 AMI comparison across different clustering methods 82

Figure 3 13 DBI comparison across different clustering methods 83

Figure 3 14 Customer Segmentation by RFM method 84

Figure 3 15 Customer segmentation by RFM using K-Medoids 85

Chapter 4: Figure 4 1 The average revenue of each customer by traditional RFM 87

Figure 4 2 The average revenue of each customer by K-Medoids 88

Figure 4 3 Stacked Bar Chart of Product Category Revenue Share by Region 89

Figure 4 4 Area Chart of Cumulative Revenue by Customer Quantity 90

Figure 4 5 Bar Chart of Average Number of Orders per Customer by Region 91

Figure 4 6 Cohort Analysis Heatmap for Top Tier Customer Retention Rates 92

Figure 4 7 Cohort Analysis Heatmap for Mid Tier Customer Retention Rates 93

Figure 4 8 Cohort Analysis Heatmap for Low Tier Customer Retention Rates 94

Appendix: Figure A 1 SOM algorithm 105

Figure A 2 Competitive learning process 107

Figure A 3 Customer Segmentation by SOM algorithm 109

Figure A 4 Average Revenue of Each Customer 110

Trang 10

List of Acronyms

AMI Average Memory IntensityAOV Average Order ValueAPF Average Purchase FrequencyART Average Response TimeCLV Customer LifeTime ValueCRR Customer Retention RateCX Customer ExperienceDBI Davies-Bouldin Index EDA Exploratory Data AnalysisKPI Key Performance IndicatorSO Sales Order

SQL Structured Query LanguageSSE Error Sum of SquaresRAM Random Access MentoryROI Return on InvesmentSOM Self Organizing Map–

Trang 11

10

GANTT CHART

Figure 1 Gantt Chart

Trang 12

Project Overview

Introduction

The topic of retaining customers, building loyalty, and reducing churn is gaining significant traction across various industries This holds particular significance within the framework of customer lifetime value It allows companies to gauge the true extent of losses attributed to customer churn and determine the appropriate scale of retention efforts required In today's diverse consumer landscape, a one-size-fits-all mass marketing approach is unlikely to thrive Instead, leveraging customer value analysis and predictive churn modeling can enable marketing initiatives to better target specific customer segments (Mutanen, 2006)

Dunford et al (2014) asserted that 20% of the causes will lead to 80% of the outcomes This principle can be illustrated in the business domain as 80% of the revenue is generated by 20% of the customers Therefore, effectively segmenting customers allows businesses to identify the specific 20% of customer segments that contribute the most revenue compared to the rest Customer segmentation is carried out to support personalized customer engagement, marketing campaigns, product development, and efficient distribution strategies Most attributes used for segmentation are related to geographical, demographic, and behavioral characteristics of customers (Christy et al., 2018)

This research focuses on customer segmentation using the K-Means, K Means++, and K-Medoids algorithms Subsequently, the clustering results obtained from these three algorithms will be compared based on cluster characteristics and relevant metrics In addition, this study addresses potential directions for leveraging the predictive insights to enhance churn management within the enterprise

-Objectives

In order to enhance customer lifetime value (CLV) and customer relationship management (CRM), businesses strive for effective customer service This study aims to employ unsupervised learning algorithms for customer segmentation Subsequently,

Trang 13

12 distinct customer groups will receive improved service from the business, as opposed to applying a one-size-fits-all approach.

Currently, there exists a variety of algorithms for customer segmentation This underscores the rapid development of technology and machine learning in the present day, which can be perplexing for businesses embarking on this process Therefore, this research also assesses the performance of applied clustering methods in customer segmentation

(Hung, 2006) points out that churn management involves addressing two key factors: identifying those who are likely to leave the company and proposing new strategies to enhance the retention rate The study utilizes various algorithms and data mining techniques to improve the churn management of businesses

From the research objectives we aim to achieve, several business questions are proposed:

1 How many current customer segments exist in the business's data? 2 What is the most effective algorithm for identifying distinct customer groups? 3 What actions should be taken to enhance the churn management of the business?

Objects and Scopes

Trang 14

K-behavioral data This data undergoes thorough cleaning and preprocessing to ensure its suitability for clustering algorithms, encompassing steps like handling missing values, detecting and managing outliers, and normalizing the data for consistency

Our research delves into the specifics of three clustering algorithms K-Means is explored for its foundational principles and adaptability in segmenting customers, K-Means++ for its advanced initialization process which potentially leads to better clustering outcomes, and K-Medoids for its distinct approach and applicability in the context of our customer data To assess the effectiveness of these clustering results, we employ various cluster validity indices, including the Silhouette Coefficient, Davies-Bouldin Index, and the Elbow Method, providing a comprehensive evaluation of the clustering performance

A comparative analysis is conducted to scrutinize the results obtained from K-Means, K-Means++, and K-Medoids, focusing on the distinctiveness and business relevance of the customer segments derived This study also ventures into predictive analysis, leveraging clustering insights to forecast customer churn and formulating strategies to enhance customer retention This includes personalized marketing initiatives, product development insights, and improvements in customer service The performance of each algorithm is evaluated based on accuracy, efficiency, and business applicability, leading to informed recommendations for businesses on selecting the most suitable algorithm for customer segmentation and churn prediction

The methodology is implemented using Python, with libraries such as Pandas for data manipulation, Scikit-Learn for algorithm implementation, and Matplotlib/Seaborn for visualization Statistical tests are employed in the comparative analysis to ascertain the significance of the findings This research prioritizes the reproducibility of results, maintaining data integrity and confidentiality, and adheres to ethical standards in data handling and analysis

Trang 15

14

Process

Figure 2 Research methods and procedures

Establish RFM model: First, we need to explore the data and analyze them Then, based on what has been discovered, determine the values of Recency, Frequency and Monetary to have all 3 factors R, F, M, thereby establishing the RFM model Next, we will divide those factors on a scale of 1 to 5 From that result, we can evaluate and analyze them Clustering with K-means algorithm: The K means algorithm is a popular clustering -technique in unsupervised machine learning, starting with the random initialization of

Trang 16

K points as the initial cluster centers, where K is the desired number of clusters Each data point is then assigned to the cluster with the nearest center, typically based on Euclidean distance Next, the center of each cluster is updated by calculating the average coordinates of all the points in that cluster This process of cluster assignment and center updating repeats until there is no further change in cluster assignments or a predefined maximum number of iterations is reached The outcome of the algorithm is evaluated based on criteria such as the total squared distance within clusters, determining the effectiveness of the clustering K means is widely applied in various fields, from market -segmentation to biological data analysis

Initialize the center point of the K-means algorithm: First, we need to randomly select the first centroid from the data points For each data point, calculate its distance from the nearest center, the point we selected earlier Next, select the next centroid from the data points such that the probability of selecting a point as a centroid is proportional to its distance from the nearest, previously selected centroid Finally iterates until K centroids are sampled

Partition around the centroid: The starting step is the initialization step, selecting k random points among n data points as medoids Then, link each data point to the nearest medoid using any common distance measure Go through all the data points in each cluster Try replacing the current medoid with another data point in the same cluster Next, calculate the total distance from all data points in the cluster to the new medoid Choose the new medoid so that this total distance is minimum Finally, repeat step 2 and step 3 until there is no significant change in medoid selection and data assignment to

clusters Tools and Programing language

Our group offers two distinct methods for customer segmentation: one that makes use of the Python programming language and the other that uses Excel Excel is utilized for its intuitive design and robust calculation functions, particularly effective in calculating RFM (Recency, Frequency, Monetary) models, catering to those who prefer a visual, spreadsheet-based analysis In contrast, the Python method involves advanced machine learning algorithms to automatically segment customer data using the RFM model, incorporating K-means clustering for automation This advanced method, ideal for a

Trang 17

16 more automated and data-intensive analysis, displays results on an accessible dashboard Supporting these methods, Google Colab is employed for developing and sharing Python code, essential for querying and displaying data from the AdventureWork-Sales company database

Structure of project

The project is divided into five chapters: The first chapter, “Theoretical Background”: gives background material and reviews other studies on the subject

The second chapter, “Data preparation’: we meticulously collect and clean AdventureWorks Sales data, conducting exploratory analysis and feature engineering to optimize it for the next chapter algorithms This ensures a robust foundation for effective customer segmentation in the subsequent stages of the project

The third chapter, “The Customer Segmentation with K-Means, K Medoids, and K-Means++”: explores the real-world application of those algorithms for customer segmentation It compares the efficiency of these algorithms in improving churn management through relevant customer segmentation and offers insights into the clustering results using evaluation metrics

-The fourth chapter, ‘Data visualization’: section presents segmented customer clusters and their attributes using charts and dashboards in order to improve accessibility and support efficient churn management strategies

The final chapter, "Conclusion and Future Works”: summarizes the main aims that were initially stated in the study, offering a concise overview of the research goals and how they were achieved

Trang 18

Chapter 1 THEORETICAL BACKGROUND

This chapter gives background material and reviews other studies on the subject

1.1 Customer segmentation with RFM 1.1.1 Theory

The RFM analysis is a model used for customer segmentation based on three criteria: R (Recency), F (Frequency), and M (Monetary), derived from the customer transaction history

● Recency (R): This represents the time elapsed since the customer's last transaction in relation to the analysis date based on the RFM model

If the recency is large, the likelihood of customer churn (churn rate) is higher, and the business needs more resources to retain these customers Conversely, if the recency is short, upselling and cross selling become easier.-

In the context of e-commerce, this criterion could be the last time a customer accessed specific products, product categories, or the website itself

● Frequency (F): This measures how often a customer conducts transactions Higher frequency indicates a higher likelihood of customer responsiveness to campaigns and new products Transaction frequency also quantifies the customer's engagement with the business

In e-commerce, this could refer to how often a customer accesses a particular product, product category, or performs transactions

● Monetary (M): This criterion represents the total amount of money spent by a customer on transactions

It provides insights into the customer's spending capacity and can help calculate the average spending per transaction Consequently, businesses can tailor marketing campaigns or introduce new products to customers with appropriate average spending levels

1.1.2 Evaluation

Trang 19

18 R, F, and M each have a 5-point scale, thus there will be 125 different RFM value combinations

Loyal Customers 543, 444, 435, 355, 354, 345, 344, 335

Potential Loyalist 553, 551, 552, 541, 542, 533, 532, 531,

452, 451, 442, 441, 431, 453, 433, 432, 423, 353, 352, 351, 342, 341, 333, 323

Can’t Lose Them 155, 154, 144, 214, 215, 115, 114, 113

Trang 20

The tree map chart will help readers get an overview and easily identify the customer group that occupies the largest proportion The use of different colors or shades of color helps to quickly identify the customer groups

Figure 1 1 The results of RFM model presented in tree map chart

1.1.3 Why is the RFM Model?

Classifying customers based on the RFM model will help businesses monitor each step of the customer journey This model assists businesses in answering the following questions:

● Which customers bring the most benefit to the company?

● Which customers might leave the company and how can the churn rate be improved?

● Which customers are likely to spend more?

● Which customers are likely to respond positively to campaigns?

RFM analysis helps marketers to build campaigns, messages, and incentive programs that are tailored to each analyzed customer group Based on this, marketing activities will increase response rates, improve customer retention, customer satisfaction, and the customer lifetime value (CLV)

Trang 21

20 Some benefits of the RFM model include:

● Increasing the success rate of remarketing campaigns: - Increase revenue from new products (related to WOM) - Helps businesses create more loyal customers

● Reducing marketing operation costs: - Segment customer files for effective email marketing - Optimize the costs of marketing campaigns - Improve ROI

● Improving the customer lifetime value (CLV): - Reduce churn rate

- Make upselling and cross-selling more successful - Increase the likelihood of customer retention

1.2 How to compare other methods? 1.2.1 Davies-Bouldin Index (DBI)

DBI is calculated using the following formula:

DB = "!∑ 𝑚𝑎𝑥(#!$#"

%(' ,' )! ") The goal of optimizing DBI is to have its value as low as possible, indicating that clusters have good separation and a high concentration of data points within each cluster

1.2.2 Average response time (ART)

Average response time is simply the average amount of time that elapses between a customer contacting your business and an agent replying to them

This metric can be calculated by dividing the total time taken to respond during a particular time period by the number of replies sent during that same time period Your average response time can vary across different digital support channels But, whatever the channel, a lower average response time is always better

Trang 22

Processing Time: Processing time refers to the duration that a system requires to handle a specific request or task This component of ART encompasses activities such as data processing, computations, and executing relevant algorithms

Waiting Time: Waiting time is the period during which a request must wait in a queue or a stack before it is processed In multi-tasking or multi-user environments, requests may arrive faster than they can be processed, leading to queuing

Data Transmission Time: This factor considers the latency introduced by data transmission protocols, network congestion, and the physical distance between communicating devices

Response Time Variability: The variability in response times is a measure of how consistent or erratic the system's responses are Stable systems exhibit minimal fluctuations in ART, ensuring a predictable user experience

1.2.3 Average Memory Intensity (AMI)

Average Memory Intensity (AMI) is a performance metric that quantifies the average amount of computer memory (typically RAM - Random Access Memory) actively utilized by processes and applications during a given timeframe

Memory Usage: AMI primarily evaluates how much of a computer system's memory is actively used by running processes and applications It quantifies the intensity of memory consumption during a specific time frame

Memory Allocation: Effective memory allocation is crucial for maintaining an optimal AMI It involves allocating and deallocating memory resources to various processes and tasks as needed

Memory Access Patterns: AMI considers how memory is accessed and utilized by programs It takes into account factors such as the frequency of memory accesses, whether they are sequential or random, and how efficiently the system manages cache memory

Peak Memory Usage: In addition to average memory intensity, monitoring peak memory usage is essential Peak memory usage represents the highest amount of memory allocated or used during a specific operation or workload

Trang 23

22 Memory Fragmentation: Memory fragmentation can lead to inefficient memory usage AMI may assess both external fragmentation (gaps between memory allocations) and internal fragmentation (wasted space within allocated memory blocks)

Resource Allocation: AMI is often considered in the broader context of resource allocation It involves deciding how memory resources are allocated to different processes, threads, or virtual machines

1.3 Customer Behavior and Customer Experience

Customer Behavior: Customer behavior refers to the actions, decisions, and interactions that customers engage in when they interact with a business, its products or services, and its marketing efforts Understanding customer behavior is crucial for businesses as it helps them tailor their marketing strategies, product offerings, and customer service to meet customer needs and preferences effectively

lCustomer Experience: Customer experience is the overall perception and feeling a customer has throughout their entire journey with a brand or business, from initial contact to post-purchase support A positive customer experience is crucial for building customer loyalty, positive word- -of mouth, and repeat business Businesses that prioritize CX are more likely to retain customers and stand out in a competitive market The association between Customer Behavior and Customer Experience: According to the study of Kim et al (2013), the relationship between customer behavior and customer experience is intricately connected and highly influential on each other To enhance customer experience, businesses often use insights from customer behavior data to make informed decisions, implement improvements, and create personalized marketing strategies This synergy between customer behavior analysis and customer experience management is key to success in today's business landscape In summary, customer behavior data is the foundation for understanding and optimizing the customer experience It helps businesses to personalize interactions, improve processes, and anticipate customer needs A positive customer experience, in turn, enhances customer loyalty, advocacy, and influences future customer behavior, creating a cyclical relationship that is central to the success of any business

Trang 24

1.4 Customer Lifetime Value (CLV)

Customer lifetime value (CLV) is the total economic value to a business of a customer over the whole period of their relationship with the brand, from first interaction to final purchase This value computes the specific revenue from a customer by accounting for all possible transactions to be made over the course of the customer relationship, as opposed to focusing only on the value of individual transactions

Customer lifetime value can be viewed in two ways: historically, it shows how much each current customer has spent with your brand, and prospectively, it shows how much customers could spend with you For monitoring the success of a business, both customer lifetime value measurements are helpful

The complexity and applications of CLV models differ Value is determined by the Traditional CLV Model using acquisition cost and average purchase metrics The Predictive CLV Model integrates historical data with predictive analytics to produce more precise forecasts than the Historical CLV Model, which estimates future value based on past behavior

Value is broken down into time intervals by the Discrete Time CLV Model, which provides a thorough understanding of value over time For non-contractual customer relationships, the Non-Contractual CLV Model makes adjustments Machine learning-based CLV models employ sophisticated algorithms to provide accurate predictions, and customer segmentation groups customers for targeted analysis Companies select models according to their data, requirements, and need for precision in customer value forecasts and retention tactics

Businesses can calculate the Customer Lifetime Value (CLV) through five steps:

Step 1: Calculate the AOV (Average Order Value) = Total Revenue / Number of Orders Step 2: Calculate the APF (Average Purchase Frequency) = Total Number of Orders /

Unique Number of Customers

Step 3: Calculate the Customer Value = AOV / APF Step 4: Calculate the Average Customer Lifespan = Total Number of Years / Total

Number of Customers

Trang 25

1.6 Churn Rate

Churn rate is a vital metric that reflects customer attrition and has a substantial impact on a company's success, profitability, and customer satisfaction Churn rate, also known as customer attrition rate or customer turnover rate, is a key performance indicator (KPI) that measures the rate at which customers or subscribers discontinue their relationship with a company or product Typically expressed as a percentage, it is calculated by dividing the number of customers lost during a specific time period by the total number of customers at the beginning of that period

To calculate the churn rate, choose a specific time period and divide the total number of subscribers lost by the total number of subscribers acquired, and then multiply for the percentage Or we can also calculate the churn rate by dividing the number of subscribers lost in a period by the total number of subscribers at the beginning of that period

Churn rate holds immense significance for businesses due to its direct impact on various aspects Firstly, it affects revenue as high churn rates can erode a company's revenue stream when they lose customers and the associated recurring revenue Secondly, it directly influences Customer Lifetime Value (CLV), representing the total revenue a customer is expected to generate over their lifetime as a customer; higher churn rates lead to lower CLV Additionally, it can provide a competitive advantage, as companies with lower churn rates often have an edge, given that it's usually more cost-effective to

Trang 26

retain existing customers than to acquire new ones Lastly, high churn rates may signal underlying issues with customer satisfaction, product quality, or service delivery, necessitating attention

1.7 Retention

Retention here is understood as customer retention, it means retaining customers; it's a set of strategies that your business will use to increase the number of previous customers returning to make purchases The goal of Customer Retention activities is to help your business retain as many customers as possible In addition, it also reduces the number of customers leaving its products to switch to another competitor's brand Thanks to this, your brand will build customer loyalty and optimize revenue per customer Customer retention is an important business strategy that has many benefits When compared to bringing on new clients, it lowers expenses and raises order closure rates

The higher the Customer Retention Rate (CRR), the better your customer retention strategy is To calculate this ratio, apply the following formula:

CRR = (CE CN)/CS x 100%– Where:

CE: the number of customers at the end of a certain period CN: the number of new customers in that period CS: the number of customers at the beginning of the period

1.8 K - Means 1.8.1 Definition

K-means is a simple clustering algorithm that belongs to the unsupervised learning (unlabeled data) category and is a commonly used method to solve clustering problems, in cluster attribute analysis of data The idea of the K-means clustering algorithm is to divide the dataset into different clusters, where k is the pre given number of clusters -Each clustered data is characterized by a centroid The center is the best point for a cluster and has value by the middle of the global of the most observations in the cluster We will rely on the distance from each observation to the centers to determine the labels for them to overlap with the nearest centers Initially, the algorithm will randomly

Trang 27

26 initialize a quantifier before the group Then, the process determines the label for each data point and continues to update the cluster center The algorithm will stop until all data is analyzed for the correct cluster or the number of centers is updated

This clustering is based on the principle that data points in the same cluster must have the same number of certain attributes That is, there must be a relationship between points within the same cluster To a computer, points in a cluster are data points that are close to each other Using this algorithm, companies can determine which group the data actually belongs to

The K-means clustering algorithm is often used in applications of search engines, customer segmentation, data statistics, etc And it is especially the most commonly used in data mining and statistics

Figure 1 2 An example of K-means clustering algorithm results

1.8.2 The steps of K-means clustering algorithm

Trang 28

Figure 1 3 Flowchart of K-means clustering

Step 1: Choose the number of k clusters to be divided, determine how many clusters the

data set should be divided into

Step 2: Next, identify k data points (also known as observations, records, or objects)

that are the "central" points, representing each cluster The central points of each cluster will change in the following steps

Step 3: For each record, each object in the data set, we will now find the closest central

points, using the Euclidean distance formula The observation closest to the center of which cluster will belong to that cluster

The Euclidean distance formula is given by:

d = !(𝑥'− 𝑥()'+ (𝑦'− 𝑦()'Where:

“d” is the Euclidean distance (𝑥!, 𝑦!) is the coordinate of the first point

Trang 29

28 (𝑥*, 𝑦*)) is the coordinate of the second point

Step 4: Recalculate the central points, or determine the representative values of the

central points based on the data points, the observations in each cluster With K means clustering, the value of the center point will be equal to the mean of all observations

-Step 5: Repeat step 3, identifying which observations are closest to the new center of

which cluster will belong to that cluster The clusters will now change, the observations belonging to the previous cluster may move to another cluster We continue to do the same as step 4, recalculating the new center points for the newly changed clusters, then repeating step 3 if the new center points of the clusters have changed, shifted on the graph

Step 6: Repeat step 3, 4, 5 until the representative values in the clusters no longer change, meaning that the representative points have no longer moved when the observation in the cluster change or when dispersion of the clusters increases during the movement of the observations or when the observations cannot be moved from one cluster to another

1.8.3 How to find the optimal K 1.8.3.1 Elbow method

In the K-Means algorithm, we need to determine the number of clusters in advance The Elbow Method is a way to help us choose the appropriate number of clusters based on a visual graph by observing the decline of the distortion function and selecting the elbow point

This method is a graphical method for assessing data variance at different cluster numbers Specifically, we calculate the value of the distortion function or the total sum of squared distances between data points and cluster centers in K-Means Clustering The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10 in the examples above), and for each value of k calculate the sum of squared errors (SSE)

+-(.*-(

Trang 30

Where: x: data points in cluster i m: mean value cluster i k: number of clusters

Steps to use Elbow method to find the optimal K: Step 1: Start by building a K-Means model with a range of different cluster numbers

(e.g., from 2 to 10 clusters)

Step 2: Calculate the distortion value for each cluster number by using the distortion

function to compute the total sum of squared distances between data points and cluster centers for each cluster number

Step 3: Plot an Elbow graph, creating a chart displaying the number of clusters on the

x-axis and the distortion value on the y-axis

Figure 1 4 A chart displaying the number of clusters k

Step 4: Identify the elbow point by observing the graph and finding the point where the

distortion value significantly levels off or decreases This point is known as the elbow

Trang 31

30 point, where the rate of distortion decrease is most pronounced In other words, beyond this point, increasing the number of clusters does not significantly reduce the distortion

Step 5: Determine the optimal number of clusters, which is typically the value before

the elbow point It often represents the balance between effective clustering and minimizing distortion

1.8.3.2 Silhouette index

Silhouette Index is one of the most common and widely used indices to evaluate the effectiveness of the clustering process Its main goal is to evaluate how optimally an observation or data point is assigned to a specific cluster Essentially, the Silhouette method helps determine whether a data point or observation sits comfortably within a cluster (indicating a good fit) or hovers near the periphery of the cluster (indicating a poor fit), all for the purpose of overall assessment of validity Clustering methods Silhouette Index goes a step further and quantifies the distance between data points within a cluster and the centroid of that cluster (as the center point) It also measures the distance from this point to the nearest centroid of other clusters or, if applicable, to the remaining cluster centers, always choosing the shortest distance This method proved valuable when examining the results obtained using the K-means clustering algorithm However, in cases where the clusters are not predetermined by traditional clustering techniques, the silhouette metric is adjusted Instead of simply measuring the distance from a point to its centroid, it calculates the average distance from all other points within the same cluster to that particular point and the average distance from all other points in different clusters In this case, the goal is still to find the shortest average distance to evaluate the suitability of the data points in the context of clustering

Where: x(i) = data point in the cluster, i = 1, 2, 3, …, n, ax(i) = average distance between xi and every data point in

Trang 32

bx(i) = minimum average distance between xi and every data point in other clusters The value of the silhouette score varies between -1 and 1 If the score is 1, the cluster is denser and better separated than other clusters Values close to 0 indicate overlapping clusters whose samples are very close to the decision boundaries of neighboring clusters Negative values [-1, 0] indicate that the sample may have been assigned to the wrong cluster

Figure 1 5 Some graphical illustrations of silhouette score

1.9 K - Means ++1.9.1 Definition

K-means++ represents a significant improvement over the traditional K means clustering algorithm, designed to optimize the selection of initial centroids for clusters While K-means typically selects cluster centers randomly, leading to issues such as unstable clustering results and falling into local minima, K-means++ addresses this problem with a smarter approach

-The initialization process of K-means++ begins by randomly selecting a data point as the first cluster centroid Then, the distance from each data point to its nearest cluster centroid is computed Based on these distances, the algorithm then selects the next data point as the cluster centroid with a probability proportional to the square of its distance

Trang 33

32 to the nearest cluster centroid This process continues until all cluster centroids are determined

The advantages of K-means++ include minimizing the risk of ending up in local minima and significantly improving the quality of the generated clusters This not only increases the accuracy of clustering but also makes clustering results more stable across different runs Thanks to this intelligent initialization method, K-means++ has become an important choice in many modern data analysis applications

1.9.2 Impact of K-means++ on customer segmentation:

K-means++ is a powerful tool for customer segmentation with significant effects on businesses Firstly, it efficiently creates customer clusters By using this algorithm, businesses can identify customer groups based on common factors such as product preferences, shopping behavior, or demographic characteristics This helps gain deeper insights into the target audience and how they interact with your products or services Furthermore, K-means++ optimizes segmentation results By intelligently initializing centroids, this algorithm eliminates the dependence on random initial point selection, avoiding suboptimal segmentation results This saves time and effort needed for fine-tuning and ensuring the accuracy of cluster results

1.9.3 Process:

The K-means++ algorithm involves a specific procedure to create initial centroids and meaningfully partition data into clusters This process begins by initializing centroids, where a random data point is chosen from the dataset, and the distances from other points to this centroid are calculated Next, the algorithm iteratively repeats the centroid selection process by determining a new point based on probability, ensuring that the squared distance from this point to the current centroid is maximized

After the centroids are chosen, the process of dividing the data into clusters is performed using the traditional K-means method, assigning each data point to the nearest centroid Finally, centroids are updated by recalculating the average of the points within each cluster This process iterates until there is no change in cluster assignments or reaches a predetermined maximum number of iterations

Trang 34

The result is that the K-means++ algorithm creates k clusters and corresponding centroids, efficiently segmenting the data and optimizing this process, making data segmentation meaningful and useful in various applications.

K-Medoids is a type of partitioning algorithm used in cluster analysis Unlike K Means, which minimizes the sum of squared distances between data points and the mean of their respective cluster, K-Medoids minimizes the sum of dissimilarities between points and

Trang 35

-34 the most centrally located point in a cluster (the medoid) This difference makes K-Medoids more robust to outliers compared to K-Means

]

Figure 1 8 Comparison between the K-Means and K-Medoids

1.10.2 Process

Step 1: Initialization: Select k initial medoids randomly from the dataset These medoids act as the initial centers of the clusters The number k is the number of clusters you want to find Step 2: Assignment Step:

Assign each data point to the closest medoid The closeness is determined based on a distance metric, such as Euclidean or Manhattan distance Each point is assigned to the cluster of the medoid to which it has the smallest distance

Step 3: Update Step:: For each cluster, select a new medoid This is done by examining all the points in the cluster and choosing the one that minimizes the total distance to all other points in that cluster This step is crucial as it refines the cluster quality

Step 4: Iteration: Repeat the Assignment and Update steps iteratively until there is no change in the medoids, or until a predefined number of iterations is reached This iterative process ensures that the clusters are optimized based on the chosen medoids

Step 5: Convergence:

Trang 36

The algorithm converges when the medoids do not change between successive iterations or the change falls below a certain threshold At this point, the clusters are considered stable, and the algorithm terminates

Step 6: Output: The final output is a set of k clusters with their respective medoids Each data point in the dataset is assigned to one of these clusters

Trang 37

36

1.10.3 Advantages of K-Medoids

Robustness to outliers: Due to the use of medoids, K-Medoids are more resistant to noise and outliers, as these do not skew the cluster center as much as in mean-based clustering Flexibility with distance measures: It can be used with various types of distance calculations, making it suitable for different types of data, including non-numeric data Interpretability: The use of actual data points as centers makes the interpretation of clusters more intuitive and meaningful in certain applications

1.10.4 Limitations of K-Medoids

Computational complexity: Medoids is more computationally intensive than Means, especially for large datasets, as it involves exhaustive combinatorial searches for medoids

K-Sensitivity to initial selection: The initial selection of medoids can affect the final clustering result, though less severely than K-Means

Determining the number of clusters: Similar to K-Means, deciding the appropriate number of clusters (k) can be challenging and often requires additional methods, like the Elbow method or Silhouette analysis

1.10.5 Applications

K-Medoids is well-suited for applications where the integrity of the clusters is paramount, and the data contains noise and outliers In marketing analytics, K-Medoids can be used to segment customers based on purchasing behavior, demographics, or other relevant attributes Since K-Medoids use actual data points as the centers of clusters, the resulting segments are typically more representative and interpretable for targeted marketing strategies

1.10.6 Conclusion

K-Medoids is a robust and versatile clustering algorithm, ideal for scenarios where outlier sensitivity is a concern Its ability to produce interpretable and meaningful clusters makes it a valuable tool in the realm of data analytics, despite its computational

demands

Trang 38

1.11 Cohort Analysis

Cohort analysis is a marketing analytics technique that focuses on analyzing the behavior of a group of users/customers with a common characteristic over a specific period of time This can be used to gain insights into the customer experience and identify opportunities for improvement

The reason why cohort analysis is becoming increasingly important is because it helps marketers to overcome the limitations of average metrics, giving marketers clearer insights and, from that, making more accurate decisions If the average report tells us that the average per capita income in Vietnam increased in 2019 compared to 2018, then the cohort analysis technique helps us to have a clearer insight into the level of increase in each region and province By comparing metrics with different cohorts in the same analysis, we will be able to identify areas that have significantly different changes (not increasing or even decreasing) compared to the overall upward trend in the country (According to adbrix)

Figure 1 11 Cohort Retention Rate Table

Cohort analytics is widely used in the following verticals: - E-commerce

- Cloud software - Digital marketing

Trang 39

38 - Online gaming

In all of these industries, cohort analysis is commonly used to determine why customers are leaving and what can be done to prevent them from leaving That brings us to calculating Customer Retention Rate - Customer retention rate (Abbreviated as CRR).Customer retention rate is calculated by this formula:

CRR = ((E-N) / S) X 100 Where:

- E – Number of end customers in use at the end of the period.- N – Number of customers acquired during that period.- S – Number of customers at the beginning of the period (or at the beginning of

the period)

Trang 40

Chapter 2 DATA PREPARATION

In the second chapter, we meticulously collect and clean AdventureWorks Sales data, conducting exploratory analysis and feature engineering to optimize it for the next chapter algorithms This ensures a robust foundation for effective customer segmentation in the subsequent stages of the project

2.1 Exploratory data analysis 2.1.1 Data Description 2.1.1.1 About the Dataset: AdventureWorks Sales

AdventureWorks Sales is a sample database developed by Microsoft to help users learn how to work with SQL Server and related technologies This database is often used in examples and tutorials in Microsoft's documentation for SQL Server and related products AdventureWorks Sales Database comes from a series of different model database versions developed by Microsoft to showcase the features and functionality of SQL Server and other products There are multiple versions of the AdventureWorks Database, including AdventureWorks, AdventureWorksLT, and AdventureWorksDW (Data Warehouse) Each version focuses on different aspects of the database, such as product, customer, and sales data This database has gone through many versions, starting with SQL Server 2005 and continuing until SQL Server 2019

2.1.1.2 Data description

AdventureWorks Sales contains seven following sheets:

● Sales_Order data:

Ngày đăng: 27/08/2024, 12:11

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN