Using RFM, K-means clustering method involving Elbow to exploit the database based on AdventureWorks’ data.. Nowadays, using data mining techniques to support customer segmentation is a
Trang 1UNIVERSITY OF ECONOMICS AND LAW FACULTY OF INFORMATION SYSTEMS
FINAL PROJECT REPORT
INTERDISCIPLINARY RESEARCH METHOD COURSE
TOPIC: APPLYING K-MEANS CLUSTERING TO SEGMENT CUSTOMERS IN RETAIL INDUSTRY
2 Nguyen Trong Nghia
3 Thai Kim Huynh
4 Le Quang Thanh Tai
5 Tran Thi Minh Han
Ho Chi Minh City, November, 2022
Trang 2Members of Group 9
NO
Point / 10 (Individual Contribution)
4 Le Quang Thanh Tai K214111962 10/10
5 Tran Thi Minh Han K214110831 10/10
I
Trang 3Acknowledgements
This research is supported by lecturers Ho Trung Thanh and Nguyen Van Ho, whogenerously provided knowledge and expertise to help our group know some usefultechniques to prepare the whole scientific research, and instruct us precisely during thecourse to complete the research Although any errors are our own and should not tarnishthe reputations of these esteemed persons, and we sincerely wish to show our gratitudes
Trang 4Commitment
This research is conducted by every member in the group 9 and under instruction of twolecturers who are Ho Trung Thanh and Nguyen Van Ho If our group has any cheatingproof in this research paper, we will certainly shoulder the whole responsibility for thepunishment at any level Besides, there is also the presence of references from severalarticles about the related subjects
Ho Chi Minh City, November, 2021
Trang 5Table of Content
Members of Group 9 I Acknowledgements II Commitment III Table of Content IV List of Tables V List of Figures VI List of Acronyms VII
GANTT CHART 1
ABSTRACT 2
I Introduction 2
II Theoretical basis and literature review 3
III Methodology and proposed research models 7
III Experimental results and discussions 8
4.1 The process of customer segmentation 8
4.1.1 Preparing and preprocessing dataset 8
4.1.2 Calculating R, F and M 9
4.1.3 Transforming and Scaling R,F,M data 13
4.1.4 Grouping customers into separate clusters by K-Means 15
4.2 Applying marketing strategies to each cluster of customers 17
4.3 Discussions 18
IV Conclusions and implications 18 Appendix a
IV
Trang 6List of Tables
Table 1 RFM values after removing outliers 10
Table 2 R, F and M after being processed 11
Table 3 Scaling data to uniform value 15
Table 4 Description of three clusters 17
V
Trang 7List of Figure
Figure 1 Customer Segment Model based on data mining (Yun Chen, Guozheng Zhang,
Dengfeng Hu, Shanshan Wang, 2006) 5
Figure 2 The outline of the process of segmenting customers 8
Figure 3 Boxplots of Recency, Frequency and Monetary 13
Figure 4 Distribution of the variables Recency, Frequency and Monetary 13
Figure 5 Base, log, square root, boxcox of R, F, M 14
Figure 6 Elbow Method Visualization 16
Y
VI
Trang 8List of Acronyms
M Monetary
K The number of clusters
SSD The sum of squared distances
Q Quartile
Log Logarit
SQRT Square root
VII
Trang 9GANTT CHART
1
Trang 10In recent days, the demands of customers are getting more and more diverse, everybusiness finds it challenging to fulfill all of their customers’ expectations To meet thatrequirement, businesses have researched and implemented the process of dividingcustomers into different groups Using RFM, K-means clustering method involvingElbow to exploit the database based on AdventureWorks’ data After segmentingcustomers, businessmen can find out factors affecting customers’ behaviors and generatethe right marketing campaigns and effective business plans Experimental results in thisresearch only show a relative level of accuracy; but based on those, businesses can applythis research model to conduct the suitable business strategies for each customer group.Keywords: RFM, K-means clustering, customer segmentation, marketing, retail industry.
I Introduction
In the context of the economy's ongoing trend of international economicintegration, widespread market development, rapid scientific progress, etc., businesseshave many opportunities, but they also have to face many challenges and dangers alwayslurking This requires leaders to build for their businesses a right direction to survive,stand and develop
Factors in the business environment change extremely quickly, making incomeinformation sources obsolete and inaccurate The failure to fully measure the externalfactors greatly affects the conduct of the assessment, setting the objectives and drawingappropriate conclusions In that condition, it is difficult for the strategic direction of thebusiness to ensure its feasibility without a complete and detailed analysis of the targetgroup of customers: who is their customer group that needs to be satisfied? Thus, how toimprove the ability to group customers for businesses is an important issue because if thisrequirement is met, businesses will easily determine the right way to satisfy theircustomers, optimize costs and be more proactive in the development process Thisresearch confirms that excellent Customer Relationship Management with customers forcompanies is very important to achieve more profits
2
Trang 11Nowadays, using data mining techniques to support customer segmentation is auseful measure, in which the K-Means method introduced by MacQueen in the document
“J Some Methods for Classification and Analysis of Multivariate Observations” (1967) is
a useful tool for analysts to use in many different fields Means clustering algorithm means is capable of grouping large amounts of data with relatively fast and efficientcomputation time (Bain K K, Firli I, and Tri S 2016)
K-Starting from the above practical requirements, this study will focus on the use of
K - Means clustering in dividing customer data into the most suitable clusters The rest ofthe study will refer to the application of the K - Means algorithm to a real data set andpresent the evaluation results of this method
II Theoretical basis and literature review
Customer segmentation is widely used to group customers into specificcharacteristics Clustering is the process of forming segments of a set of data bymeasuring similarities between data with other data (Singh H and Kaur K, 2013) Eachcluster of customers has different features, behaviors which affect the business strategies.This strategy helps organizations, companies, have a thorough view of customers sothat they can target and market to customers more effectively
There are several methods that have often been used and studied to make thesegmentation process smoother and more accurate like RFM, K-Means clustering,Datamining,
Regarding the RFM method, this analysis ranks each customer to 3 factors: Recency - which shows how recent the customer’s last purchase was, Frequency - whichshows how often the customer made a purchase in a given period and the last one isMonetary, shows how much the customer spent in the given period RFM is applied tocalculate customer lifetime value (CLV) and customer segmentation of a health andbeauty company (Mahboubeh Khajvand 2010) RFM variables used to be used asfeatures to describe customer characteristics (Kaymak 2001) and classify customers intocategories such as: uncertain customers, spenders, frequent customers, the bestcustomers,… (Thompson 2002) However, in customer segmentation, the RFM model
3
Trang 12has some problems with data mining that changes over time (Monireh Hosseini 2015) Toovercome that problem, they broke down time into separate parts using the equation:
then based on the RFM model calculate customer value for each time period (Yeh2008) Implemented a comprehensive method to select customer segments from the datasource by extending the RFM model to the RFMTC model (adding two parameters,namely: time since first occurrence and identification general capacity) This model canestimate the probability that a customer will make a purchase next time and the expectedvalue of the total number of times that customer will make a purchase in the future.Data mining has many strengths for customer segmentation data analysis It allowsmanagers to identify the most profitable customers in each sector By the data obtained,data mining analysis results separate customers into many groups with specificcharacteristics Businesses can better understand consumers by finding out therelationship between groups of consumers and the products they have purchased throughspecific characteristics; present customer groups in the most intuitive way (AslihanDursun, 2016) Besides, data mining also has problems: Data preparation, modelinterpretation and evaluation is very time-consuming because it has to collect data fromreliable sources Besides, the lack of technical knowledge and information is also a bigobstacle Therefore, more research is needed in the following areas: link analysis toestablish customer buying patterns, enhance merchant websites, predict lifetime value ofeach customer (Daqing Chen, Sai Laing Sain & Kun Guo, 2012) That makes it difficult
to compare customer profiles across time periods of the year, as recent visits come close
to the winter deadline (Aslihan Dursun, 2016)
4
Trang 13Figure 1 Customer Segment Model based on data mining (Yun Chen, Guozheng Zhang,
Dengfeng Hu, Shanshan Wang, 2006)K-Means is an unsupervised clustering learning in pattern recognition andmachine learning which is often used to solve clustering problems (MacQueen, 1967).This method requires k clusters which are chosen in advance Each of k clusters has asingle random point and then adds each new point to the group whose mean the newpoint is nearest the centroids The function is used (Saurabh Shah, Manmohan Singh,2012):
|| xi( j ) - c || is the distance between centroid and the data pointj 2
The algorithm is composed of the following steps, whose output is a set of k clusters
1 Select k clusters from the dataset
2 Initialize K data points in dataset as center points (centroids)
5
Trang 143 For each data point, assign that data point to the cluster whose distance to thecenter point of the cluster is the smallest.
4 For each cluster, redefine the center point of all data points assigned to that cluster
5 Repeat the step 3 and 4 until convergence
When repeating the K-means algorithm many times, every rerun of the K-Meansalgorithm will restart different centroids After the learning process, proceed to reduce theresult from the last component runs of the result
And to find out “k” in K-means clustering accurately, elbow and silhouette can beused in this process:
The elbow approach is a technique in cluster analysis that is used to determine thenumber of clusters in a data set The method entails charting the explained variance as afunction of the number of clusters and selecting the elbow of the curve as the number ofclusters to utilize The same procedure can be used to select the number of parameters inother data-driven models, such as the number of primary components to characterize adata collection There are four steps in this method:
1 Run K-means for a variety of Ks
2 Sum the squares of the distances from the cluster mean
3 Plot the SSD curve over K's
4 Pick the K at the elbow visually
The silhouette method is also used to identify the optimal number of clusters aswell as to interpret and certify consistency within data clusters The silhouette methodcalculates silhouette coefficients for each point, which measure how similar a point is toits own cluster in comparison to other clusters by displaying a brief graphicalrepresentation of how effectively each object was categorized Compute the silhouettecoefficients for each point and average them over all samples to get the silhouette score
6
Trang 15b : is defined as the average distance to the closest cluster of datapoint I
ai : defined as the average distance to all other points in the cluster to which it belongs
III Methodology and proposed research models
The proposed research methodology includes four major steps (Figure 2)
In the first step, from the AdventureWorks dataset, this study implements preprocessing
of the dataset (including Data cleaning, Data selection, and Data Transformation) Afterpreprocessing data, there is a clean dataset that can reflect the values needed for using theRFM model Then, calculated R, F, and M values respectively, using the K-Meansmethod to analyze data points for different clusters Implementing grouping in Pythonlanguage
7
Trang 16Besides, for data analysis, this study uses the EDA method Use scatter plot graphs todisplay data by making numbers, which are relatively difficult for the human brain toanalyze as a whole, into easy images such as graphs Column plots, box plots, etc The use
of EDA in data analysis helps speed up the process of figuring out what the data means,what it can be used for, and what conclusions can be drawn from it
III Experimental results and discussions
4.1 The process of customer segmentation
The whole process involves preprocessing data, RFM scoring, and K-Meanscustomer segmentation
4.1.1 Preparing and preprocessing dataset
The study uses the dataset of the company Adventure Work Cycles whichmanufactures and distributes metal and composite bicycles to commercial markets inNorth America, Europe, and Asia The figures in the dataset including customers and
8
Trang 17sellers were recorded in more than three years, from 01/07/2017 to 15/06/2020 The fulldata is exploited to group customers into separate clusters, not specific to criteria.4.1.2 Calculating R, F and M
RFM analysis is a popular segmentation method to rank and group customersthrough customer’s behaviors: the recent purchase (Recency), the frequently the customerbuys (Frequency) and total or average transaction values (Monetary) In order to build anRFM model, the dataset used contains CustomerID, SalesOrderID, InvoiceDate,Quantity, and Unit Price, and then calculate Recency, Frequency and Monetary valueaccording to CustomerKey The figures are demonstrated as Table 1 below
To be specific, Recency is worked out based on the column of InvoiceDate, themaximum date of any customer minus each date according to each row of CustomerIDequals R of a person However, Frequency is easily counted, just to sum all Sales Amount
of a customer About Monetary, it is calculated by multiplying Quantity with Unit Priceand then summing all the prices of all products of a customer
9