final project report fundamental data analysis topic improving churn management based on customer segmentation using k means k means and k medoids

Initialize the center point of the K-means algorithm: First, we need to randomly select the first centroid from the data points.. Businesses can calculate the Customer Lifetime Value CLV

THEORETICAL BACKGROUND

Customer segmentation with RFM

The RFM analysis is a model used for customer segmentation based on three criteria: R (Recency), F (Frequency), and M (Monetary), derived from the customer transaction history

● Recency (R): This represents the time elapsed since the customer's last transaction in relation to the analysis date based on the RFM model

If the recency is large, the likelihood of customer churn (churn rate) is higher, and the business needs more resources to retain these customers Conversely, if the recency is short, upselling and cross selling become easier.-

In the context of e-commerce, this criterion could be the last time a customer accessed specific products, product categories, or the website itself

● Frequency (F): This measures how often a customer conducts transactions

Higher frequency indicates a higher likelihood of customer responsiveness to campaigns and new products Transaction frequency also quantifies the customer's engagement with the business

In e-commerce, this could refer to how often a customer accesses a particular product, product category, or performs transactions

● Monetary (M): This criterion represents the total amount of money spent by a customer on transactions

It provides insights into the customer's spending capacity and can help calculate the average spending per transaction Consequently, businesses can tailor marketing campaigns or introduce new products to customers with appropriate average spending levels

R, F, and M each have a 5-point scale, thus there will be 125 different RFM value combinations

Table 1 1 The evaluation of RFM Model

The tree map chart will help readers get an overview and easily identify the customer group that occupies the largest proportion The use of different colors or shades of color helps to quickly identify the customer groups

Figure 1 1 The results of RFM model presented in tree map chart

1.1.3 Why is the RFM Model?

Classifying customers based on the RFM model will help businesses monitor each step of the customer journey This model assists businesses in answering the following questions:

● Which customers bring the most benefit to the company?

● Which customers might leave the company and how can the churn rate be improved?

● Which customers are likely to spend more?

● Which customers are likely to respond positively to campaigns?

RFM analysis helps marketers to build campaigns, messages, and incentive programs that are tailored to each analyzed customer group Based on this, marketing activities will increase response rates, improve customer retention, customer satisfaction, and the customer lifetime value (CLV)

20 Some benefits of the RFM model include:

● Increasing the success rate of remarketing campaigns:

- Increase revenue from new products (related to WOM)

- Helps businesses create more loyal customers

- Segment customer files for effective email marketing

- Optimize the costs of marketing campaigns

● Improving the customer lifetime value (CLV):

- Make upselling and cross-selling more successful

- Increase the likelihood of customer retention

How to compare other methods?

DBI is calculated using the following formula:

The goal of optimizing DBI is to have its value as low as possible, indicating that clusters have good separation and a high concentration of data points within each cluster

Average response time is simply the average amount of time that elapses between a customer contacting your business and an agent replying to them

This metric can be calculated by dividing the total time taken to respond during a particular time period by the number of replies sent during that same time period Your average response time can vary across different digital support channels But, whatever the channel, a lower average response time is always better

Processing Time: Processing time refers to the duration that a system requires to handle a specific request or task This component of ART encompasses activities such as data processing, computations, and executing relevant algorithms

Waiting time refers to the interval during which a request remains in a queue or stack before it undergoes processing In multitasking or multi-user environments, requests may be received at a faster rate than they can be processed, resulting in the formation of queues to accommodate these pending requests.

Data Transmission Time: This factor considers the latency introduced by data transmission protocols, network congestion, and the physical distance between communicating devices

Response Time Variability: The variability in response times is a measure of how consistent or erratic the system's responses are Stable systems exhibit minimal fluctuations in ART, ensuring a predictable user experience

Average Memory Intensity (AMI) is a performance metric that quantifies the average amount of computer memory (typically RAM - Random Access Memory) actively utilized by processes and applications during a given timeframe

AMI assesses memory usage by measuring the extent to which a computer system's memory is actively consumed by running processes and applications This assessment provides insight into the demand placed on the system's memory resources at a specific moment in time.

Memory Allocation: Effective memory allocation is crucial for maintaining an optimal AMI It involves allocating and deallocating memory resources to various processes and tasks as needed

Memory Access Patterns: AMI considers how memory is accessed and utilized by programs It takes into account factors such as the frequency of memory accesses, whether they are sequential or random, and how efficiently the system manages cache memory

Peak Memory Usage: In addition to average memory intensity, monitoring peak memory usage is essential Peak memory usage represents the highest amount of memory allocated or used during a specific operation or workload

22 Memory Fragmentation: Memory fragmentation can lead to inefficient memory usage AMI may assess both external fragmentation (gaps between memory allocations) and internal fragmentation (wasted space within allocated memory blocks)

Resource Allocation: AMI is often considered in the broader context of resource allocation It involves deciding how memory resources are allocated to different processes, threads, or virtual machines.

Customer Behavior and Customer Experience

Customer Behavior: Customer behavior refers to the actions, decisions, and interactions that customers engage in when they interact with a business, its products or services, and its marketing efforts Understanding customer behavior is crucial for businesses as it helps them tailor their marketing strategies, product offerings, and customer service to meet customer needs and preferences effectively lCustomer Experience: Customer experience is the overall perception and feeling a customer has throughout their entire journey with a brand or business, from initial contact to post-purchase support A positive customer experience is crucial for building customer loyalty, positive word- -of mouth, and repeat business Businesses that prioritize CX are more likely to retain customers and stand out in a competitive market

The association between Customer Behavior and Customer Experience: According to the study of Kim et al (2013), the relationship between customer behavior and customer experience is intricately connected and highly influential on each other To enhance customer experience, businesses often use insights from customer behavior data to make informed decisions, implement improvements, and create personalized marketing strategies This synergy between customer behavior analysis and customer experience management is key to success in today's business landscape In summary, customer behavior data is the foundation for understanding and optimizing the customer experience It helps businesses to personalize interactions, improve processes, and anticipate customer needs A positive customer experience, in turn, enhances customer loyalty, advocacy, and influences future customer behavior, creating a cyclical relationship that is central to the success of any business.

Customer Lifetime Value (CLV)

Customer lifetime value (CLV) is the total economic value to a business of a customer over the whole period of their relationship with the brand, from first interaction to final purchase This value computes the specific revenue from a customer by accounting for all possible transactions to be made over the course of the customer relationship, as opposed to focusing only on the value of individual transactions

Customer lifetime value can be viewed in two ways: historically, it shows how much each current customer has spent with your brand, and prospectively, it shows how much customers could spend with you For monitoring the success of a business, both customer lifetime value measurements are helpful

CLV models vary in complexity and applications Traditional CLV models rely on acquisition costs and average purchase metrics to calculate customer value Predictive CLV models, on the other hand, leverage historical data and predictive analytics to deliver more accurate value forecasting compared to the Historical CLV Model, which estimates future customer value based solely on past behavior.

Value is broken down into time intervals by the Discrete Time CLV Model, which provides a thorough understanding of value over time For non-contractual customer relationships, the Non-Contractual CLV Model makes adjustments Machine learning- based CLV models employ sophisticated algorithms to provide accurate predictions, and customer segmentation groups customers for targeted analysis Companies select models according to their data, requirements, and need for precision in customer value forecasts and retention tactics

Businesses can calculate the Customer Lifetime Value (CLV) through five steps:

Step 1: Calculate the AOV (Average Order Value) = Total Revenue / Number of Orders

Step 2: Calculate the APF (Average Purchase Frequency) = Total Number of Orders /

Step 3: Calculate the Customer Value = AOV / APF

Step 4: Calculate the Average Customer Lifespan = Total Number of Years / Total

Step 5: Calculate the Customer Lifetime Value (CLV) = Customer Value x Average

Machine learning

By leveraging machine learning algorithms, businesses can segment their customers based on churn risk and profitability prospects This allows for tailored marketing and sales strategies For instance, high-value customers at risk of attrition can be identified and targeted with retention campaigns Furthermore, machine learning can forecast future purchase likelihood, enabling personalized marketing initiatives and enhanced conversion rates.

Churn Rate

Churn rate is a vital metric that reflects customer attrition and has a substantial impact on a company's success, profitability, and customer satisfaction Churn rate, also known as customer attrition rate or customer turnover rate, is a key performance indicator (KPI) that measures the rate at which customers or subscribers discontinue their relationship with a company or product Typically expressed as a percentage, it is calculated by dividing the number of customers lost during a specific time period by the total number of customers at the beginning of that period

To calculate the churn rate, choose a specific time period and divide the total number of subscribers lost by the total number of subscribers acquired, and then multiply for the percentage Or we can also calculate the churn rate by dividing the number of subscribers lost in a period by the total number of subscribers at the beginning of that period

Churn rate holds immense significance for businesses due to its direct impact on various aspects Firstly, it affects revenue as high churn rates can erode a company's revenue stream when they lose customers and the associated recurring revenue Secondly, it directly influences Customer Lifetime Value (CLV), representing the total revenue a customer is expected to generate over their lifetime as a customer; higher churn rates lead to lower CLV Additionally, it can provide a competitive advantage, as companies with lower churn rates often have an edge, given that it's usually more cost-effective to retain existing customers than to acquire new ones Lastly, high churn rates may signal underlying issues with customer satisfaction, product quality, or service delivery, necessitating attention.

Retention

Retention here is understood as customer retention, it means retaining customers; it's a set of strategies that your business will use to increase the number of previous customers returning to make purchases The goal of Customer Retention activities is to help your business retain as many customers as possible In addition, it also reduces the number of customers leaving its products to switch to another competitor's brand Thanks to this, your brand will build customer loyalty and optimize revenue per customer Customer retention is an important business strategy that has many benefits When compared to bringing on new clients, it lowers expenses and raises order closure rates

The higher the Customer Retention Rate (CRR), the better your customer retention strategy is To calculate this ratio, apply the following formula:

CRR = (CE CN)/CS x 100%– Where:

CE: the number of customers at the end of a certain period

CN: the number of new customers in that period

CS: the number of customers at the beginning of the period.

K - Means

K-means is a simple clustering algorithm that belongs to the unsupervised learning (unlabeled data) category and is a commonly used method to solve clustering problems, in cluster attribute analysis of data The idea of the K-means clustering algorithm is to divide the dataset into different clusters, where k is the pre given number of clusters - Each clustered data is characterized by a centroid The center is the best point for a cluster and has value by the middle of the global of the most observations in the cluster

We will rely on the distance from each observation to the centers to determine the labels for them to overlap with the nearest centers Initially, the algorithm will randomly

26 initialize a quantifier before the group Then, the process determines the label for each data point and continues to update the cluster center The algorithm will stop until all data is analyzed for the correct cluster or the number of centers is updated

This clustering is based on the principle that data points in the same cluster must have the same number of certain attributes That is, there must be a relationship between points within the same cluster To a computer, points in a cluster are data points that are close to each other Using this algorithm, companies can determine which group the data actually belongs to

The K-means clustering algorithm is often used in applications of search engines, customer segmentation, data statistics, etc And it is especially the most commonly used in data mining and statistics

Figure 1 2 An example of K-means clustering algorithm results

1.8.2 The steps of K-means clustering algorithm

Figure 1 3 Flowchart of K-means clustering

Step 1: Choose the number of k clusters to be divided, determine how many clusters the data set should be divided into

Step 2: Next, identify k data points (also known as observations, records, or objects) that are the "central" points, representing each cluster The central points of each cluster will change in the following steps

Step 3: For each record, each object in the data set, we will now find the closest central points, using the Euclidean distance formula The observation closest to the center of which cluster will belong to that cluster

The Euclidean distance formula is given by: d = !(𝑥 ' − 𝑥 ( ) ' + (𝑦 ' − 𝑦 ( ) '

(𝑥 ! , 𝑦 ! ) is the coordinate of the first point

28 (𝑥*, 𝑦*)) is the coordinate of the second point

Step 4: Recalculate the central points, or determine the representative values of the central points based on the data points, the observations in each cluster With K means - clustering, the value of the center point will be equal to the mean of all observations

Step 5: Repeat step 3, identifying which observations are closest to the new center of which cluster will belong to that cluster The clusters will now change, the observations belonging to the previous cluster may move to another cluster We continue to do the same as step 4, recalculating the new center points for the newly changed clusters, then repeating step 3 if the new center points of the clusters have changed, shifted on the graph

Step 6: Repeat step 3, 4, 5 until the representative values in the clusters no longer change, meaning that the representative points have no longer moved when the observation in the cluster change or when dispersion of the clusters increases during the movement of the observations or when the observations cannot be moved from one cluster to another 1.8.3 How to find the optimal K

In the K-Means algorithm, we need to determine the number of clusters in advance The Elbow Method is a way to help us choose the appropriate number of clusters based on a visual graph by observing the decline of the distortion function and selecting the elbow point

This method is a graphical method for assessing data variance at different cluster numbers Specifically, we calculate the value of the distortion function or the total sum of squared distances between data points and cluster centers in K-Means Clustering

The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10 in the examples above), and for each value of k calculate the sum of squared errors (SSE)

Where: x: data points in cluster i m: mean value cluster i k: number of clusters

Steps to use Elbow method to find the optimal K:

Step 1: Start by building a K-Means model with a range of different cluster numbers

Step 2: Calculate the distortion value for each cluster number by using the distortion function to compute the total sum of squared distances between data points and cluster centers for each cluster number

Step 3: Plot an Elbow graph, creating a chart displaying the number of clusters on the x-axis and the distortion value on the y-axis

Figure 1 4 A chart displaying the number of clusters k

Step 4: Identify the elbow point by observing the graph and finding the point where the distortion value significantly levels off or decreases This point is known as the elbow

30 point, where the rate of distortion decrease is most pronounced In other words, beyond this point, increasing the number of clusters does not significantly reduce the distortion

Step 5: Determine the optimal number of clusters, which is typically the value before the elbow point It often represents the balance between effective clustering and minimizing distortion

The Silhouette Index is a prevalent metric for assessing the performance of clustering algorithms It evaluates the quality of assigning an observation to a specific cluster by measuring its proximity to other data points within the cluster A high Silhouette Index indicates that an observation is well-integrated into its assigned cluster, while a low index suggests that it may be better suited for a different cluster The Silhouette Index provides insights into the accuracy of clustering results and is widely used in data analysis techniques.

Silhouette Index goes a step further and quantifies the distance between data points within a cluster and the centroid of that cluster (as the center point) It also measures the distance from this point to the nearest centroid of other clusters or, if applicable, to the remaining cluster centers, always choosing the shortest distance This method proved valuable when examining the results obtained using the K-means clustering algorithm

K - Means ++

K-means++ represents a significant improvement over the traditional K means - clustering algorithm, designed to optimize the selection of initial centroids for clusters While K-means typically selects cluster centers randomly, leading to issues such as unstable clustering results and falling into local minima, K-means++ addresses this problem with a smarter approach

The initialization process of K-means++ begins by randomly selecting a data point as the first cluster centroid Then, the distance from each data point to its nearest cluster centroid is computed Based on these distances, the algorithm then selects the next data point as the cluster centroid with a probability proportional to the square of its distance

32 to the nearest cluster centroid This process continues until all cluster centroids are determined

K-means++, an advanced initialization method in clustering, offers key advantages It mitigates the risk of converging to subpar solutions, enhancing the quality of cluster formations This not only boosts clustering precision but also ensures result stability across multiple executions K-means++ has emerged as a sought-after choice in contemporary data analysis due to its ability to significantly improve clustering performance.

1.9.2 Impact of K-means++ on customer segmentation:

K-means++ is a powerful tool for customer segmentation with significant effects on businesses Firstly, it efficiently creates customer clusters By using this algorithm, businesses can identify customer groups based on common factors such as product preferences, shopping behavior, or demographic characteristics This helps gain deeper insights into the target audience and how they interact with your products or services Furthermore, K-means++ optimizes segmentation results By intelligently initializing centroids, this algorithm eliminates the dependence on random initial point selection, avoiding suboptimal segmentation results This saves time and effort needed for fine- tuning and ensuring the accuracy of cluster results

The K-means++ algorithm involves a specific procedure to create initial centroids and meaningfully partition data into clusters This process begins by initializing centroids, where a random data point is chosen from the dataset, and the distances from other points to this centroid are calculated Next, the algorithm iteratively repeats the centroid selection process by determining a new point based on probability, ensuring that the squared distance from this point to the current centroid is maximized.

After the centroids are chosen, the process of dividing the data into clusters is performed using the traditional K-means method, assigning each data point to the nearest centroid Finally, centroids are updated by recalculating the average of the points within each cluster This process iterates until there is no change in cluster assignments or reaches a predetermined maximum number of iterations.

The result is that the K-means++ algorithm creates k clusters and corresponding centroids, efficiently segmenting the data and optimizing this process, making data segmentation meaningful and useful in various applications.

Figure 1 6 Input data point for K-means++

Figure 1 7 Output data point for K-means++

K-Medoids

K-Medoids is a type of partitioning algorithm used in cluster analysis Unlike K Means, - which minimizes the sum of squared distances between data points and the mean of their respective cluster, K-Medoids minimizes the sum of dissimilarities between points and

34 the most centrally located point in a cluster (the medoid) This difference makes K- Medoids more robust to outliers compared to K-Means

Figure 1 8 Comparison between the K-Means and K-Medoids

Select k initial medoids randomly from the dataset These medoids act as the initial centers of the clusters The number k is the number of clusters you want to find Step 2: Assignment Step:

Assign each data point to the closest medoid The closeness is determined based on a distance metric, such as Euclidean or Manhattan distance Each point is assigned to the cluster of the medoid to which it has the smallest distance

For each cluster, select a new medoid This is done by examining all the points in the cluster and choosing the one that minimizes the total distance to all other points in that cluster This step is crucial as it refines the cluster quality

Repeat the Assignment and Update steps iteratively until there is no change in the medoids, or until a predefined number of iterations is reached This iterative process ensures that the clusters are optimized based on the chosen medoids

The algorithm converges when the medoids do not change between successive iterations or the change falls below a certain threshold At this point, the clusters are considered stable, and the algorithm terminates

The final output is a set of k clusters with their respective medoids Each data point in the dataset is assigned to one of these clusters

Figure 1 10 Cluster Assignments and Medoids

Robustness to outliers: Due to the use of medoids, K-Medoids are more resistant to noise and outliers, as these do not skew the cluster center as much as in mean-based clustering Flexibility with distance measures: It can be used with various types of distance calculations, making it suitable for different types of data, including non-numeric data

Interpretability: The use of actual data points as centers makes the interpretation of clusters more intuitive and meaningful in certain applications

Computational complexity: K-Medoids is more computationally intensive than K- Means, especially for large datasets, as it involves exhaustive combinatorial searches for medoids

Sensitivity to initial selection: The initial selection of medoids can affect the final clustering result, though less severely than K-Means

Determining the number of clusters: Similar to K-Means, deciding the appropriate number of clusters (k) can be challenging and often requires additional methods, like the Elbow method or Silhouette analysis

K-Medoids is well-suited for applications where the integrity of the clusters is paramount, and the data contains noise and outliers In marketing analytics, K-Medoids can be used to segment customers based on purchasing behavior, demographics, or other relevant attributes Since K-Medoids use actual data points as the centers of clusters, the resulting segments are typically more representative and interpretable for targeted marketing strategies

K-Medoids is a robust and versatile clustering algorithm, ideal for scenarios where outlier sensitivity is a concern Its ability to produce interpretable and meaningful clusters makes it a valuable tool in the realm of data analytics, despite its computational demands.

Cohort Analysis

Cohort analysis is a marketing analytics technique that focuses on analyzing the behavior of a group of users/customers with a common characteristic over a specific period of time This can be used to gain insights into the customer experience and identify opportunities for improvement

The reason why cohort analysis is becoming increasingly important is because it helps marketers to overcome the limitations of average metrics, giving marketers clearer insights and, from that, making more accurate decisions If the average report tells us that the average per capita income in Vietnam increased in 2019 compared to 2018, then the cohort analysis technique helps us to have a clearer insight into the level of increase in each region and province By comparing metrics with different cohorts in the same analysis, we will be able to identify areas that have significantly different changes (not increasing or even decreasing) compared to the overall upward trend in the country (According to adbrix)

Figure 1 11 Cohort Retention Rate Table

Cohort analytics is widely used in the following verticals:

In all of these industries, cohort analysis is commonly used to determine why customers are leaving and what can be done to prevent them from leaving That brings us to calculating Customer Retention Rate - Customer retention rate (Abbreviated as CRR). Customer retention rate is calculated by this formula:

- E – Number of end customers in use at the end of the period.

- N – Number of customers acquired during that period.

- S – Number of customers at the beginning of the period (or at the beginning of the period).

DATA PREPARATION

Exploratory data analysis

2.1.1.1 About the Dataset: AdventureWorks Sales

AdventureWorks Sales is a sample database created by Microsoft to demonstrate SQL Server and related technologies Initially part of a series of model databases showcasing SQL Server capabilities, it has undergone numerous versions since SQL Server 2005, including AdventureWorks, AdventureWorksLT, and AdventureWorksDW Each version highlights different database aspects, such as product, customer, or sales information.

AdventureWorks Sales contains seven following sheets:

40 Sales Order_data Sheet has 121253 lines and 4 columns:

Column Name Data Type Description

Channel Object The channel where the order is consumed

SaleOrderLineKey Int64 Identifier number of a sales order by combining the Sales Order number and the Sales Order Line

Sales Order Object Unique Sales Order identification number, Sales Order Line Object The order line number of the product in that sales order

Table 2 1 Sales Order_data Sheet's variables Description

Sales Territory_data Sheet has 11 lines and 4 columns:

SalesTerritoryKey Int64 The order number of Sales

Region Object The region with Sales

Country Object The country with Sales

Group Object The geographic area of the country with Sales Order

Table 2 2 Sales Territory_data Sheet's variables Description

Sales_data sheet has 121253 lines and 15 columns:

SalesOrderLineKey Int64 The order line number of the product in that sales order

ResellerKey Int64 The order number of the sales store

CustomerKey Int64 The order number of the customer

ProductKey Int64 The order number of product

OrderDateKey Int64 The order number of time

DueDateKey Int64 The due date of that order

ShipDateKey Float64 The date the company delivers the order

SalesTerritoryKey Int64 The order number of Sales

Order Quantity Int64 The number of products in the order

Unit Price Float64 The price per unit of product

Extended Amount Float64 The price of a sales order by multiplying the Unit Price number and the Order Quantity

Unit Price Discount Pct Int64 The discount based on product unit

Product Standard Cost Float64 Estimate of the actual costs in a company's production process

Total Product Cost Float64 The price of the products in an order by multiplying the Product Standard Cost number and the Order Quantity

Sales Amount Float64 The price the customer

44 must pay for the order

Table 2 3 Sales_data Sheet's variables Description

Reseller_data sheet has 702 lines and 8 columns:

ResellerKey Int64 The order number of the sales store

Reseller ID Object The identification number sales store

Business Type Object The business type of the sales store

Reseller Object The name of the store

City Object The city where the store is located

State-Province Object The state - province where the store is located

Country Region- Object The country - region where the store is located Postal Code Object The postal code of that city

Table 2 4 Reseller _data Sheet's variables Description

Date_data sheet has 1461 lines and 7 columns:

DateKey Int64 The order number of time

Date Datetime64 The date that the customer places the order

Fiscal Year Object The year period that is used for financial reporting and

Fiscal Quarter Object The three-month period during the fiscal year when financial activities are calculated and reported

Month Object The month that the customer places the order

Full Date Object The full date that the customer places the order

MonthKey Int64 The order number of month

Table 2 5 Date_data Sheet's variables Description

Product_data sheet has 397 lines and 9 columns:

ProductKey Int64 The order number of product

SKU Object Stock - keeping unit of the product

Product Object The name of the product

Standard Cost Float64 Estimate of the actual costs in a company's production process

Color Object The ordered color of the product

List Price Float64 The list price of the product

Model Object The model of the product

Subcategory Object The subcategory of the store

Category Object The category of the store

Table 2 6 Product_data Sheet's variables Description

48 Customer_data sheet has 18485 lines and 7 columns:

CustomerKey Int64 The order number of customer

Customer ID Object The identification number of the customer

Customer Object The customer’s full name

City Float64 The city where the customer lives

State-Province Object The state - province where the customer lives

Country Region- Float64 The country - region where the customer lives Postal Code Object The postal code of that city

Table 2 7 Customer_data Sheet's variables Description

Handling error and missing data

After the Data Understanding phase, it is necessary to address data errors and missing data To do so, you need to follow these steps:

First, get the Sales_Data tuple from within the data set

After that, we need to delete the customer information values of -1, because with these values "CustomerKey" = 1, no customer information exists.-

Next, we need to check whether there are errors or missing data and process those data if they exist (if any)

Data Preprocessing

After the checking process, there are no errors or missing data in the large data set, which makes the data preparation process easier and shorter In order for the program to understand the data set, we need to convert the data to the correct data type that the program can understand, specifically, convert "CustomerKey" and "OrderDateKey" to string form, besides, "SalesOrderLineKey" also needs to be converted to string format with the first 2 characters denoted as SO

The describe() function of Pandas offers a comprehensive descriptive statistics table for the numerical columns in a DataFrame This table presents essential statistical measures such as mean, standard deviation, minimum, maximum, and percentiles, providing a concise overview of the data's distribution and key characteristics.

Then, the groupby() function is used to group data by 'OrderDate' Afterwards, the sum() function is applied to compute aggregate values (like total sales) for each day This is a crucial technique in data analysis for understanding trends and patterns over time

Succeeding the conversion of "OrderDateKey" and “SalesOrderLineKey” into string form, we need to convert "OrderDateKey" into Year-Month Day and -

“SalesOrderLineKey” into “SO*” to prepare data for calculation using the RFM model and subsequent improved algorithms

To assess the method's efficacy, an experiment was conducted on AdventureWorks' dataset This dataset comprised over 121,254 transaction records spanning July 1, 2017, to June 15, 2022, and included all the requisite information needed for the analysis.

52 attributes for setting up the RFM model Additionally, the dataset revealed that a considerable number of retailers were also customers of AdventureWorks

The AdventureWorks dataset contains numerous different attributes Each attribute represents a numerical or categorical value of a transaction, such as Sales Order, Sales Order Line, Order Date Key, Amount, CustomerKey , etc However, not all the data can contribute to establishing the RFM model Therefore, prior to calculating the R, F, and

M values, the researchers observed and filtered the dataset to identify the relevant attributes and focused on the following: CustomerKey (customer identifier), OrderDateKey (order date identifier), SalesOrderLineKey and Sales Amount First, we need to ensure the OrderDateKey is in datetime format This is necessary because the original data might be in a string format that's not suitable for date calculations Converting it to datetime allows it to perform operations like finding the maximum date or calculating differences between dates

Next, the time milestone defined in the paper for the most recent purchase date in the

“OrderDateKey) column of the data and add one day to it This establishes a "current date" for the purpose of calculating recency

Data is grouped by customer to create individual customer segments Aggregation operations are performed on the grouped data to calculate RFM metrics These metrics provide insights into customer behavior and can be used for targeted marketing campaigns.

To determine customer recency, the difference between the current date and the most recent transaction date is calculated for each customer using the lambda function (current_date - x.max()).days This formula measures how recently a customer made a purchase, providing valuable insight for calculating R, F, and M values.

54 The frequency of a customer’s transactions may be affected by factors such as the type of product, the price point for the purchase, and the need for replenishment or replacement If the purchase cycle can be predicted for example, when a customer — needs to buy more groceries—marketing efforts may be directed toward reminding them to visit the business when staple items run low The Frequency value is the number of purchases made by a customer This value is given based on the number of SalesOrderLineKey that a customer (corresponding to a CustomerKey) creates

And the final value that needs to be calculated to establish a complete RFM model is Monetary - the total amount of money customers have spent to buy products from the business Monetary helps analysts see the material values that businesses gain every time customers use the service This metric is calculated by adding together the total amount coming from all orders for a particular customer that appear in the data set

Figure 2 1 Contribution of Recency values

Figure 2 2 Contribution of Frequency values

Figure 2 3 Contribution of Monetary values

Assuming you have a dataset with customer information, ensure that it includes 'Recency', 'Frequency', and 'Monetary' columns, which are standard metrics used in RFM analysis

Once your data is loaded into the DataFrame, you will want to calculate descriptive statistics for each RFM metric To do this, use the describe method available in pandas, which conveniently computes a variety of statistics

In the describe method, specify that you want to see the percentiles at 5%, 25%, 50% (which is the median), 75%, and 95% These percentiles give you a detailed view of the distribution of your data

Print the output of the descriptive statistics for each of the RFM columns This will give you the count, mean, standard deviation, minimum value, the specified percentiles, and the maximum value for each column

The printed output will resemble a table with each statistic as a row and the statistic's value for each RFM metric as the columns

Begin with the 'Recency' column The x-axis will represent the number of days since the last purchase, and the y-axis will represent the density, or the probability distribution of those days

Next, create a density plot for the 'Frequency' column This plot will display how often customers make purchases The x axis will - represent the number of purchases within a given time frame, and the y-axis will represent the density

Customer Segmentation with traditional RFM

RFM analysis categorizes customer value by dividing them into three groups, with each group assigned a rank from 1 to 3 For Frequency and Monetary values, a higher score indicates greater customer value as it signifies frequent purchases and higher spending, respectively On the contrary, for Recency, a lower score is preferable as it indicates recent engagement with the company Thus, a score of 3 in Recency (R=3) means the customer has engaged very recently, while a score of 3 in Frequency (F=3) and Monetary (M=3) indicates the customer is among those who purchase the most often and spend the most, respectively

The scaling and labeling of customer data within the RFM framework are achieved through the qcut function This function, as defined by pandas, is a quantile-based discretization tool that aims to distribute data into bins of equal size These bins are not determined by fixed numerical boundaries but are rather set according to the distribution of the data itself, ensuring that each bin represents the same proportion of the dataset

After running, we have the result below:

Table 2 8 The list of customer segments

The chart is divided into three segments, each representing a different group of customers based on their RFM score:

● Top Tier Customers: This segment is the smallest, making up 20.6% of the customer base These are likely the most valuable customers, with high scores in all three RFM categories

● Mid Tier Customers: The largest segment, comprising 56.9% of the customer base These might be regular customers with average scores in RFM

● Low Tier Customers: Making up 22.4% of the customer base, these customers might be occasional shoppers or new customers with lower RFM scores Maybe including customers who are about to leave the company

Table 2 9 Customer Segmentation by RFM Method

From a business perspective, this chart is useful for tailoring marketing strategies and customer outreach For instance, the company might want to invest more in retaining Top Tier Customers due to their high value, while developing different strategies to increase the value of High Tier Customers Low Tier Customers segment might require re-engagement campaigns or could be considered for churn analysis

It is noteworthy that the segments follow a somewhat expected distribution, where the majority of customers are in the mid tier, suggesting that while there are fewer customers in the extreme high and low ends of the RFM scale, there is a substantial proportion of customers that contribute consistently but might not be the highest spenders

CUSTOMER SEGMENTATION WITH K-MEANS, K-MEANS++ AND K-

Find K

3.1.1 Selecting the optimal number of clusters by Elbow

The optimal number of clusters for data clustering is a crucial factor to consider in unsupervised algorithms In the K-Means algorithm, users must predefine the number of clusters, raising the question of how to determine the optimal value for a specific dataset This determination is a fundamental step for effective data clustering, as it directly influences the accuracy and interpretability of the results obtained.

The Elbow method is a renowned technique for determining the ideal number of clusters, 'k', in a dataset It involves evaluating the within-cluster sum of squared errors (SSE) for various 'k' values and selecting the 'k' where the SSE curve exhibits a sharp "elbow," indicating the optimal number of clusters that minimizes intra-cluster variation.

“elbow” i.e the point after which the strain/inertia begins to decrease in a linear fashion

After transforming data, the subsequent procedure involves computing the optimal number of clusters (k) on data derived from three RFM variables (Recency, Frequency, Monetary) The objective is to determine the most suitable number of clusters for data segmentation, aiming to optimize performance and minimize inter-group overlap Carrying out the Elbow method with the number of clusters from 1 to 12 on the RFM model, the results are as follows:

Figure 3 1 SSE curve results in Elbow method

With the SSE curve shaped like an elbow, we have the elbow bend with k = 3 (points between 2 and 4 on the horizontal axis) which will be the appropriate number of clusters

In the chart, the curve with k=3 means that when the number of clusters increases from

2 to 3, the SSE curve value decreases sharply, but when the number of clusters increases from 3 to 4, the SSE curve value only decreases slightly This shows that when the number of clusters increases from 2 to 3, the points in the cluster will be divided reasonably, leading to high homogeneity of the points in each cluster However, when the number of clusters increases from 3 to 4, the points in the cluster will be divided too small, leading to larger differences between the points in each cluster

3.1.2 Measure the optimal number of clusters by Silhouette index

Silhouette index analysis aims to measure the optimal level when an observation, a data point is assigned to any cluster Specifically, the Silhouette method will tell us which data points or observations are located inside the cluster (good) or near the edge of the cluster (not good) to evaluate clustering effectiveness

In this study, we will use the Silhouette algorithm to evaluate the clustering results calculated by the Elbow method, thereby evaluating the optimal number of k clusters for the data set

65 The study conducted Silhouette scoring for clusters 2,3,4,5,6 respectively to determine the best k value

To do that, the research visualized Silhouette scores for each cluster through the following Silhouette charts

Figure 3 2 For n_clusters = 2 Silhouette Score: 0.683

K-Means

Once the number of clusters has been determined, the subsequent step involves initializing the cluster centroids This can be accomplished by randomly selecting K data

67 points from the dataset to serve as the initial cluster centers or means Each data point is then assigned to the nearest cluster by computing the distance between the data point and each cluster centroid and assigning the point to the cluster with the smallest distance After the assignment, the cluster centroids are updated by computing the new cluster centers or means, which involves taking the average of all the data points in each cluster This assignment and update process is repeated until either no data point changes its cluster membership or the maximum number of iterations is reached

With k = 3, the algorithm shows three clusters of the company and the model represents each cluster through three different colors

Figure 3 6 The 3D Scatter showing the clusters Number of customer in each cluster:

According to the value.counts(), there are 10800 customers in cluster 0 Those are the most valuable customers of Adventure Works - loyal customers who have made a recent purchase, frequently purchase, and spend a lot of money

Next, there are 4865 customers in cluster 2 These are customers with the potential to become loyal customers They have made a recent purchase and frequently purchase, but they may not spend as much money as the customers in cluster 0

69 Finally, there are 2819 customers in Cluster 1 These are customers at risk of churn They have made a recent purchase, but they do not purchase as frequently or spend as much money as other segments Adventure Works should focus on retaining these customers and encouraging them to purchase more frequently.

K-Means++

Initialize the K-Means model with 3 clusters, using the "K-Means++" initialization method and up to 100 iterations Clustering data in DataFrame RFM based on 'Recency', 'Frequency' and 'Monetary' columns Draw a scatter plot, with x-axis as 'Recency', y- axis as 'Frequency', and colors based on clustering

Figure 3 7 The 3D Scatter showing the clusters

To determine the distribution of data points among clusters, utilize the `value_counts()` method on the 'Cluster' column within the `rfm_kmean_pp` DataFrame This will provide an overview of the number of data points assigned to each respective cluster.

Use groupby('Cluster').describe() on the DataFrame to generate descriptive statistics for each cluster The describe() method will calculate statistical parameters such as mean (average), std (standard deviation), min (smallest value), percentiles 25, 50, 75, 85, 95, and max (maximum value) for each column in each cluster Use a loop to print out descriptive statistics for each cluster in turn

71 Below will be a detailed analysis of each cluster

According to the K-means++ method performed above, we can see that there are 4865 customers in cluster 0.

Describe the data of the cluster 0.

According to the K-medoids method performed above, we can see that there are 10800 customers in cluster 1.

Describe the data of the cluster 1

According to the K-medoids method performed above, we can see that there are 2819 customers in cluster 2.

K-Medoids

Similar to the above algorithms, K-Medoids also requires determining how many clusters will be Therefore, like K-Means and K-Means++, this algorithm will also have k=3

Next, the team builds a model for the K-Medoids algorithm:

75 The model showed three different customer clusters represented through the color of each element in the cluster The customer group is at least 1 with characteristic blue Next comes 0 (red) and at most 2 (green)

Figure 3 8 The dispersion of K-Medoids Clustering

Each “X” point is the medoid of each cluster:

Figure 3 9 The centroids of K-Medoids Clustering

Number of elements (customers) in each cluster (cluster)

Figure 3 10 Customer segmentation by RFM using K-Medoids

Below will be a detailed analysis of each cluster

According to the K-medoids method performed above, we can see that there are 11426 customers in cluster 0

Describe the data of the cluster 0

79 Describe the data of the cluster 1

Compare methods

Customer segmentation is crucial in today's dynamic business environment The RFM model and advanced clustering algorithms like K-Means provide valuable insights into customer data To assess their effectiveness, the Davies-Bouldin Index (DBI), Adjusted-Rand Index (ARI), and Adjusted Mutual Information (AMI) are used to comparatively analyze their accuracy and suitability for different applications.

Table 3 1 The comparison of evaluation methods among algorithms

Traditional RFM-Based Customer Segmentation

81 Overview: Utilizes Recency, Frequency, and Monetary metrics to categorize customers into distinct segments such as "Top Tier Customers," "Mid Tier Customers," and "Low Tier Customers."

Advantages: Directly links customer behavior with key business outcomes Rooted in the Pareto Principle, it highlights segments disproportionately contributing to revenue Considerations: While insightful, it may not capture the complete complexity of customer behaviors and interactions

Functionality: These algorithms partition customers into clusters based on similarities in their RFM values, iteratively optimizing cluster centers

Benefits: Effective in revealing complex patterns and hidden relationships

Limitations: Sensitivity to outliers; K-Means++ offers improved initialization to enhance cluster stability

Approach: Similar to K-Means but selects actual data points as cluster centers, enhancing robustness to noise and outliers

Strengths: Demonstrates a greater alignment with traditional RFM segments It's particularly effective in identifying key customer behaviors and segments that resonate with RFM categories

Insights: Offers nuanced understanding of customer dynamics, enabling the identification of unique customer groups for targeted engagement

Figure 3 11 ARI comparison across different clustering methods

Figure 3 12 AMI comparison across different clustering methods

Evaluating Clustering Methods with DBI

83 The Davies-Bouldin Index (DBI) is a metric for evaluating the separation between clusters Lower DBI values indicate better clustering quality with well-separated clusters In this context:

K-Means and K-Means++ exhibit identical DBI scores (0.4577), suggesting similar cluster quality and separation These methods are adept at identifying broad customer patterns but might miss finer nuances

K-Medoids shows a marginally better DBI score (0.4543), indicating slightly more distinct and well defined clusters compared to K- -Means and K Means++ This suggests - K-Medoids' effectiveness in creating distinct customer groups, which is crucial for businesses focusing on personalized customer engagement strategies

Figure 3 13 DBI comparison across different clustering methods

K-Medoids' Superiority: Given its higher ARI and AMI scores, combined with a marginally better DBI, K-Medoids proves more effective in aligning with the RFM model and in handling complex customer data scenarios This method is particularly suitable for businesses seeking to delve deeper into customer behavior analytics Balanced Approach: While K-Medoids stands out, a balanced approach utilizing insights from all methods can provide a comprehensive understanding of the customer base This multifaceted view can guide targeted marketing strategies, customer retention efforts, and personalized customer experiences

In conclusion, while traditional RFM segmentation offers a straightforward approach, K-Medoids, with its superior alignment to RFM, enhanced by its robust clustering capabilities and favorable DBI score, emerges as a valuable tool for nuanced customer analysis Businesses should consider leveraging K-Medoids to gain deeper insights into customer behaviors, complemented by insights from K-Means and K Means++, to - design and implement effective customer-centric strategies.

Compare the best one with traditional

In accordance with the Pareto Principle, it is posited that 20% of the customer base contributes to 80% of a company's revenue Consequently, it is imperative for businesses to identify and effectively serve these pivotal customer groups Typically, such customer segments include VIP groups that contribute significantly to the company's financial turnover This distinction forms a fundamental criterion in evaluating the differences between traditional RFM (Recency, Frequency, Monetary) analysis and the application of the K-Medoids algorithm:

Figure 3 14 Customer Segmentation by RFM method

The chart presented above illustrates the segmentation of customer groups based on the RFM method The "Top Tier Customers" category exhibits characteristics closely associated with high spending, recent purchase activity, and frequent return visits

85 Following this, the "Mid Tier Customers" typically demonstrate moderate to high spending, with recent purchases, but a reduced frequency of return visits This can be interpreted as this group scoring moderately across all three dimensions: Recency (R), Frequency (F), and Monetary (M) On the other hand, the "Low Tier Customers" represent customers with lower purchasing power, often seeking novelty or being patrons of competitors

Traditionally, the group that generates the highest revenue is the "Top Tier Customers," accounting for 20.6% This figure aligns well with the Pareto principle and is considered favorable However, attention should also be directed towards the "Low Tier Customers" group In the realm of business, vigilance is crucial, and with only 22.4% representation, neglecting this group could potentially lead to the loss of customers who might be inclined to switch to a competitor

Figure 3 15 Customer segmentation by RFM using K-Medoids

Based on the chart provided, the "Top Tier Customers" group, which contributes the highest revenue at 13.7%, allows businesses to gain a clearer understanding This approach is more straightforward and easier to identify compared to traditional methods, as it highlights common characteristics Additionally, from a business perspective, it is imperative for companies to devise specific strategies for improving churn management, especially with customer groups that exhibit subtle signs rather than obvious indicators

At that point, these customers may have already become loyal patrons of other companies or lost interest in the business Therefore, utilizing the K-Medoids algorithm to extract shared characteristics from 61.8% of the "Low Tier Customers" group can aid businesses in formulating more effective retention campaigns

DATA VISUALIZATION & ANALYSIS

Revenue by customer segments

Figure 4 1 The average revenue of each customer by traditional RFM

Based on the chart above, we can determine that the "Top Tier Customers" bring in over

$3,000 per customer for the business In contrast, the "Mid Tier Customers" group experiences a significant decline in revenue per customer, dropping by more than 50% compared to the "Top Tier Customers" to just over $1,300 Lastly, the concerning group is the "Low Tier Customers," contributing only approximately $100 per customer

Figure 4 2 The average revenue of each customer by K-Medoids

In comparison, the K-Medoids algorithm outperforms the traditional method Notably, the "Top Tier Customers" identified by the algorithm generate an individual revenue of approximately $5,000 per customer, far surpassing the $3,000 revenue identified by the traditional method Thus, leveraging the K-Medoids algorithm enables businesses to pinpoint customers who contribute the highest value.

Data mining

In the data analysis and processing workflow, communication is a pivotal step that data professionals must undertake The primary objective of a data analyst or data scientist is to aid business decision-making Most non-specialists will not comprehend the intricate details of coding or the dry numbers in tables and charts Consequently, visual representations act as a bridge between the business and the analysts, conveying complex information in an accessible format

Beyond complex algorithms like K-Means, it is still possible to discern which customers are likely to leave the company and the revenue generated by each product group All of this can be achieved with colorful and insightful charts

Figure 4 3 Stacked Bar Chart of Product Category Revenue Share by Region

Beyond complex algorithms like K-Means, it is still possible to discern which customers are likely to leave the company and the revenue generated by each product group All of this can be achieved with colorful and insightful charts

The provided chart illustrates that the customer segment purchasing Bikes contributes most significantly to the company's revenue The key customer groups in this segment are from the regions of Australia, Southwest, and Northwest Conversely, the Central and Southeast regions contribute the least to the Bikes category, as their corresponding colors are nearly absent from the column To improve the churn rate, one must determine why these groups contribute less revenue There are two scenarios to consider:

● Macro Environment: If the low contribution is due to macro-environmental factors beyond the company's control, these customer groups may not require attention for improving business performance

● Micro Environment: If the tepid response from these customers is due to the company's initiatives, internal strategies must be reassessed to retain this potential customer base

Despite the potential for smaller purchases in the Central and Southeast regions, revenue contributions across all product categories remain nominal This suggests either a limited customer base in these areas or a misalignment between marketing initiatives and local cultural preferences.

Figure 4 4 Area Chart of Cumulative Revenue by Customer Quantity

The variation observed in the graph indicates the rate of revenue growth in accordance with the customer quantity growth The uneven growth suggests that there are customer segments of high value contributing significantly to the revenue Additionally, the Pareto principle posits that 20% of customers may account for 80% of the revenue, a pattern that is consistent with the chart presented This is evident as the revenue from the top 500 customers, starting from the 8500th customer, doubles that of nearly 9000 customers within the company

While focusing on customer segmentation for cost-saving strategies, businesses should prioritize customers who contribute significantly to revenue Occasional patrons or those primarily loyal to other brands may not warrant the same level of attention, as the potential savings from altering their customer experience might be negligible.

Furthermore, the most pronounced region of the graph, from the 10000th to approximately the 18000th customer, may represent the most stable revenue-generating customer group Therefore, the company should implement targeted promotions and personalized marketing campaigns to retain this segment Additionally, a significant surge in revenue is observed at the end of the customer spectrum, indicating a customer

91 group that warrants utmost attention if the company aims to maximize its revenue Characteristics of this group can be extracted to develop predictive models and suitable marketing campaigns to increase the quantity of such high-value customers

Figure 4 5 Bar Chart of Average Number of Orders per Customer by Region

Based on the chart, it is readily apparent that customers from Canada, Australia, and the United Kingdom have the highest average number of orders per customer, with the figures oscillating around two orders per customer This suggests a favorable retention rate in these regions and indicates that the marketing or sales teams have effectively analyzed and leveraged customer insights Consequently, it is imperative to sustain and enhance the strategies in these areas, as well as to decipher the reasons behind their efficacy to replicate in other regions

Although customer groups in the lower portion of the chart exhibit a lower average number of orders, they still ensure at least one order per person, with a strong likelihood of generating a second order However, regions like Central and Southwest, despite having comparable ratios to other areas within this group, do not contribute significantly to the revenue Thus, these regions may not require extensive focus.

Cohort Analysis

Cohort analysis is a cornerstone of customer behavior analytics, enabling the dissection of data into groups that share common characteristics over a certain period In our forthcoming analysis, we will delve into three cohorts representing Top, Mid, and Low

Tier Customers, each symbolizing distinct levels of engagement and revenue contribution This approach will shed light on specific customer interactions and lifecycle patterns, providing actionable insights to enhance retention and revenue- generating strategies By tailoring our focus to the unique attributes and behaviors of each tier, we can allocate resources more effectively and drive targeted improvements in customer experience and value

Figure 4 6 Cohort Analysis Heatmap for Top Tier Customer Retention Rates

The cohort analysis heatmaps for the "Top Tier Customers" group reveal that horizontally across the chart, starting from the initial third quarter of 2017, there is a lack of return customers for the following six quarters, with a slight uptick observed in July Notably, there is a consistent increase in the final four quarters, culminating in a retention rate of 41% in the last quarter However, when observing other time points, we find that retention rates typically rebound during the second quarter of 2019 This suggests that prior to this period, the enterprise may not have focused on executing marketing campaigns to attract customers

Examining the chart vertically, starting from the second quarter of 2019 allows us to evaluate how the company's campaigns have improved retention rates After each quarter cycle, starting from every time point, a retention of 9.5% is maintained

93 However, there is a steady increase thereafter, with the fourth cycle averaging a 34.5% retention rate

Figure 4 7 Cohort Analysis Heatmap for Mid Tier Customer Retention Rates

In the "Mid Tier Customers" heatmap, a horizontal observation indicates a trend similar to the "Top Tier Customers," with an increase in retention rates starting only from the second quarter of 2019 It is noteworthy that from the third quarter of 2019 onwards, the company has not been successful in attracting this cohort back This could occur as customer preferences change, for the original customer group returns later

Vertically, beginning from the second quarter of 2019, we assess the effectiveness of the company's campaigns in improving retention rates A consistent 6% retention is maintained after each quarter cycle from every time point, with a notable increase in the seventh cycle to an average retention rate of 28.8%

Figure 4 8 Cohort Analysis Heatmap for Low Tier Customer Retention Rates

For the "Low Tier Customers," horizontal observations show that customers in this group tend to make one or a few purchases within a quarter and do not return in subsequent quarters However, this group, similar to the "Top Tier Customers" and "Mid Tier Customers," also returns in the second quarter of 2019, though not as significantly as the other two groups This could be explained by the enterprise's focus on a defined customer group, neglecting those who contribute less revenue

The heatmaps reflecting the cohort analysis across the Top, Mid, and Low Tier Customer segments reveal distinct retention patterns that merit strategic consideration

A comparative lens shows that all three tiers exhibit an uptick in retention around the second quarter of 2019, suggesting a potential universal response to marketing initiatives launched during this period While the Top Tier demonstrates robust recovery and sustained retention, peaking at a 41% rate, the Mid Tier's retention re engages to a - lesser extent, indicating moderate campaign effectiveness On the other hand, the Low Tier's slight increase signals a relatively tepid customer reactivation, underscoring the need for a differentiated approach

To enhance churn management, a tailored strategy for each tier is imperative For the Top Tier, reinforcing successful retention tactics and expanding upon them could further solidify customer loyalty The Mid Tier would benefit from a deep-dive analysis into

To enhance customer retention, successful campaigns driving retention surges in the past should be analyzed and replicated The Low Tier requires a significant engagement overhaul, potentially through loyalty programs or personalized marketing to transform one-time purchases into repeat purchases Notably, the observed resurgence in Q2 2019 across all tiers presents a strategic opportunity to focus retention efforts, aligning with seasonal trends or recurring customer demands.

CONCLUSION AND FUTURE WORKS

The final chapter summarizes the main aims that were initially stated in the study, offering a concise overview of the research goals and how they were achieved

In the field of improving churn management, the application of clustering algorithms like K-Means, K Means++, and K Medoids for customer segmentation has opened up - - profound insights and new approaches for businesses This method not only deepens the understanding of customer behavior and preferences but also enables companies to precisely identify customer segments at a higher risk of churn Utilizing targeted churn reduction strategies for each customer segment based on their unique characteristics and needs is far more effective than applying a one-size fits all solution - -

Each of these algorithms has its own advantages: K-Means and K-Means++ are suitable for large datasets and offer high efficiency, while K-Medoids focuses on accuracy and reliability in cases of noisy or outlier data The flexibility in choosing the algorithm depends on the data set characteristics and specific business requirements

By implementing customer segmentation algorithms, businesses can identify and retain high-risk customer segments, allowing for more efficient resource allocation and improved ROI for customer management initiatives The selection of the appropriate number of clusters and data quality are critical for effective segmentation Furthermore, careful interpretation of the results is essential to avoid misinterpretation and identify subtle but significant patterns, ensuring accurate decision-making.

In the future, there are many opportunities for research and development in this area, especially in refining the algorithms for specific industries, integrating them with other data mining techniques, and exploring real-time segmentation to promptly address churn issues In conclusion, the use of customer segmentation through K-Means, K Means++, - and K-Medoids has become a powerful method in improving churn management, offering deep insights and business success

Investigate methods to improve the computational efficiency of these algorithms, especially for very large datasets This could involve parallel processing techniques or more efficient data structures

97 Enhance the algorithms, particularly K-Means and K Means++, to make them more - robust against outliers and noisy data Integrating outlier detection and handling mechanisms could be a key area

Combine these clustering algorithms with other machine learning techniques, such as neural networks or decision trees, to create hybrid models that leverage the strengths of multiple approaches

Develop new evaluation metrics to assess the quality of clusters more effectively, especially in complex and high-dimensional datasets

Adi Bhat (2021), ‘Consumer Behavior: Definition, factors and methods’, retrieved on October 29th 2023, from https://www.questionpro.com/blog/consumer behavior- - definition/, retrieved on December 12th 2023

Nguyễn Phương Dung (2023), ‘Hành vi người tiêu dùng: Ví dụ, mô hình, các yếu tố ảnh hưởng’, retrieved on October 29th 2023, from https://viindoo.com/vi/blog/quan-tri- doanh nghiep 3/hanh vi nguoi- - - - -tieu-dung 1009#vtoc_8- , retrieved on December 12th 2023

Christiana Jolaoso (2023), ‘Customer Segmentation: The Ultimate Guide, Forbes

Advisor’, retrieved on October 29th 2023, from https://www.forbes.com/advisor/business/customer- segmentation/#what_is_customer_segmentation_section, retrieved on December 12th 2023

Lizzie Davey (2023), ‘What Is Customer Segmentation? Definition and Guide (2024)’, retrieved on October 29th 2023, from https://www.shopify.com/blog/what- -is customer- segmentation#2, retrieved on December 12th 2023

Peter Kua (2021), ‘How to Use RFM Analysis for User Segmentation, Mobile App Marketing Strategies, Distribution & Promotion’, retrieved on October 29th 2023, from https://www.mobileapps.com/blog/rfm-analysis, retrieved on December 12th 2023

In the present marketing landscape, traditional RFM (recency, frequency, monetary) models fall short as they fail to capture the nuances of customer behavior Their simplicity neglects essential factors that influence purchase decisions, such as product preferences, demographics, and omnichannel interactions Consequently, marketers seek enhanced models that incorporate these variables to effectively target and personalize campaigns.

Jasmin (2020), ‘Machine Learning in Customer Segmentation with RFM Analysis’, retrieved on October 29th 2023, from https://www.nextlytics.com/blog/machine- learning in- -customer-segmentation with rfm analysis, retrieved on December 12th - - -2023

99 Fanlin Meng, Qian Ma, Zixu Liu, Xiao-Jun Zeng (2023), ‘Multiple dynamic pricing for demand response with adaptive clustering based customer segmentation in smart grids’, - ScienceDirect, Volume 333, 120626, retrieved on December 12th 2023

Zhao-Hui Sun, Tian-Yu Zuo, Di Liang, Xinguo Ming, Zhihua Chen, Siqi Qiu (2021),

‘GPHC: A heuristic clustering method to customer segmentation’, Applied Soft Computing, Volume 111, 107677, retrieved on December 12th 2023

Nguyễn Đặng Lập Bằng, Nguyễn Văn Hồ, Hồ Trung Thành (2021), ‘A text-based model for opinion mining and sentiment analysis from online customer reviews in food industry’, Tạp chí Khoa học Đại học Mở Thành phố Hồ Chí Minh, 16(1), 64-78, retrieved on December 12th 2023 Đinh Tiên Minh & Lê Vũ Lan Oanh (2021), ‘Phân khúc khách hàng mua sắm dựa trên thuộc tính của các trung tâm thương mại tại Thành phố Hồ Chí Minh’, Tạp chí Công Thương, retrieved on December 12th 2023

Customer segmentation based on purchase behaviors is a crucial aspect of targeted marketing To effectively segment customers, Jun Wu et al (2020) propose using the RFM model and K-Means algorithm, while González Martínez et al (2019) advocate for the Fuzzy Linguistic RFM model These models consider recency, frequency, and monetary value of purchases to categorize customers into distinct segments, enabling businesses to tailor their campaigns to specific customer groups and optimize their marketing efforts.

Josep Alet Vilagines (2020), ‘Predicting customer behavior with Activation Loyalty per Period From RFM to RFMAP’, Esic Market Economics and Business JournalI, Volume 51, Issue 3, 609-637, retrieved on December 12th 2023

A Joy Christy, A Umamakeswari, L Priyatharsini, A Neyaa (2021), ‘RFM ranking –

An effective approach to customer segmentation’, Journal of King Saud University - Computer and Information Sciences, Volume 33, Issue 10, 1251-1257, retrieved on December 12th 2023

Banu Turkmen (2022), ‘Customer Segmentation With Machine Learning for Online Retail Industry’, European Publisher, Volume 31, Issue 2, 111-136, retrieved on December 12th 2023

Vladimír Holý, Ondřej Sokol, Michal Černý (2017), ‘Clustering retail products based on customer behaviour’, Applied Soft Computing, Volume 60, 752 762, retrieved on - December 12th 2023

Nguyen (2023) conducted a master's thesis to identify the optimal number of clusters for customer segmentation in the mobile service context of VNPT Tay Ninh The thesis was retrieved on December 12th, 2023.

Shahbaz Ali, Yongping Xie (2021), ‘The impact of Industry 4.0 on organizational performance: the case of Pakistan's retail industry’, European Journal of Management Studies, Volume 26, No 2/3, 63-86, retrieved on December 12th 2023

Ina Melati Indartoyo; Dae Wan Kim; Annetta Gunawan; Regina Eka Riantini; Aryo Bismo (2020), ‘2020 International Conference on Information Management and Technology (ICIMTech)’, retrieved October 29th 2023, DOI 10.1109/ICIMTech50083.2020.9211255, retrieved on December 12th 2023

Hai Ying Ma (2021), ‘Study of E-Commerce Sites Customer Segmentation Based on Comprehensive Model’, Advanced Materials Research, Volumes 282-283, 579 583, - retrieved on December 12th 2023

Trần Mạnh Hà (2022) ‘Các kết quả khoa học công nghệ lĩnh vực Điện tử trong giai đoạn 2016-2020 và định hướng trong giai đoạn tiếp theo’, Tạp chí công thương điện tử, retrieved on December 12th 2023

Nhĩ Anh ( 2023) ‘Thương mại điện tử Việt Nam năm 2023 dự kiến đạt hơn 20 tỷ USD’, Kinh tế số, retrieved on December 12th 2023

Vnetwork (2023), ‘Internet Việt Nam 2023 Số liệu mới nhất và xu hướng phát triển’, retrieved on December 12th 2023

Trang Vu (2023) ‘RFM Segmentation: Phân loại khách hàng theo mô hình RFM’ Subiz Blog, Quản trị quan hệ khách hàng, retrieved on December 12th 2023

101 Nguyễn Thanh Trường (2022), ‘Phân khúc khách hàng là gì? Cách phân khúc khách hàng chuẩn nhất’, Vietnix, retrieved on December 12th 2023

Pham Dinh Khanh (2021), ‘Các bước của thuật toán k-Means Clustering’, retrieved on October 29th 2023, from https://phamdinhkhanh.github.io/deepai- book/ch_ml/KMeans.html, retrieved on December 12th 2023

Basil Saji (2023), ‘Elbow Method for Finding the Optimal Number of Clusters in K- Means’, retrieved on November 2nd 2023, from https://s.net.vn/oyM3, retrieved on December 12th 2023

The Elbow Method is a technique used in KMeans clustering to determine the optimal number of clusters, represented by 'k.' By calculating the sum of squared errors (SSE) for different values of k, a graph can be plotted with the SSE on the y-axis and the k values on the x-axis The point where the SSE begins to increase sharply, known as the "elbow" of the graph, indicates the optimal value of k This method helps avoid overfitting or underfitting and ensures the selection of an appropriate number of clusters for the given data.

Soumya, ‘Silhouette Index – Cluster Validity index’, retrieved on November 2nd 2023, from https://www.geeksforgeeks.org/silhouette-index-cluster-validity-index set , - -2/ retrieved on December 12th 2023

Hoss Belyadi, Alireza Haghighat (2021), ‘Unsupervised machine learning: clustering algorithms’,

Machine Learning Guide for Oil and Gas Using Python, 125-168, retrieved on December 12th 2023

Tiêu đề	Improving Churn Management Based on Customer Segmentation Using K-Means, KMeans++ and K-Medoids
Tác giả	Le Nguyen Minh Tai, Le Trong Tan Dung, La Nam Khanh, Chung Bao Phu, Le Hoang Giang
Người hướng dẫn	Assoc. Prof. Ho Trung Thanh, Ph.D., Le Thi Kim Hien, Ph.D., Nguyen Phat Dat, M.A.
Trường học	University of Economics and Law
Chuyên ngành	Fundamental Data Analysis
Thể loại	Final Project Report
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	110
Dung lượng	10,66 MB