final project report interdisciplinary research method course topic applying k means clustering to segment customers in retail industry

Customer Segment Model based on data mining Yun Chen, Guozheng Zhang, Dengfeng Hu, Shanshan Wang, 206.... Using RFM, K-means clustering method involving Elbow to exploit the database bas

Trang 1

UNIVERSITY OF ECONOMICS AND LAW FACULTY OF INFORMATION SYSTEMS

FINAL PROJECT REPORT

INTERDISCIPLINARY RESEARCH METHOD COURSE

TOPIC: APPLYING K-MEANS CLUSTERING TO SEGMENT CUSTOMERS IN RETAIL INDUSTRY

2 Nguyen Trong Nghia

3 Thai Kim Huynh

4 Le Quang Thanh Tai

5 Tran Thi Minh Han

Ho Chi Minh City, November, 2022

Trang 3

Acknowledgements

This research is supported by lecturers Ho Trung Thanh and Nguyen Van Ho, who generously provided knowledge and expertise to help our group know some useful techniques to prepare the whole scientific research, and instruct us precisely during the course to complete the research Although any errors are our own and should not tarnish the reputations of these esteemed persons, and we sincerely wish to show our gratitudes

Trang 4

Commitment

This research is conducted by every member in the group 9 and under instruction of two lecturers who are Ho Trung Thanh and Nguyen Van Ho If our group has any cheating proof in this research paper, we will certainly shoulder the whole responsibility for the punishment at any level Besides, there is also the presence of references from several articles about the related subjects

Ho Chi Minh City, November, 2021

Group 9

IH

Trang 5

Members of Group Ô TY 9 HT TH TH g1 0004.08.16 0610041908 08 I

I Infr0UCfÏOH 5= <5 họ KH TH n BH tị n HP n8 90 2

II Theoretical basis and ÏiferafUT€ T€VIW cọ Họ HH nh nh hờ 3

IIL Experimental results and đÏSCUSSÏOINS 55 5= se HH men nh ng 8 4.1, The process of customer SCQMeNtATION 0.6 ccc ccc ccc cece cette ec et te teteetentttsesteees 8 4.1.1, Preparing and preprocessing (ÌAÍQS€Í à SL LH HH HH kg kg 8 1U A9 an ng n6 9 4.13 Transforming and Scalng Ñ,FJÁÂ đÏQÍŒ ác tt tk v1 HH HH khe 13 4.1.4 Grouping customers into separate clusters by K-ÀÍẰ€QHS cài eeesằ, 15 4.2 Applying marketing strategies to each cluster 0ƒ CHSÍOTHHCES à coi 17 Ô“NX/ N6 nh he<e g 18

IV Conclusions and impÏÏC24fOTNS 5 2 03311300530 3 0 Y I1 ng cà 0n 18 ApPD€ndiX _ sọ Ọ Họ Họ Họ HT HH TH HH ĐH th TH a

IV

Trang 6

LIst of Tables

Table 1 RFM values affer removing OufÏI€TS - c1 11222122211 211211 10 Table 2 R, F and M after being processed - c2 1 2.12122212211111 ke II Table 3 Scaling data to uniform vaÌue - - c1 2212121122111 12 2111151111515 5 11111811 key 15 Table 4 Descnption of three cÏusf€Ts - c0 22112221 112112 111111158112 11x re 17

Trang 7

List of Figure

Figure 1 Customer Segment Model based on data mining (Yun Chen, Guozheng Zhang, Dengfeng Hu, Shanshan Wang, 206) ST S1 SH HH HH HH HH kg 5 Figure 2 The outline of the process of segmenting custOImer§ .cc co ccccc c2 8 Figure 3 Boxplots of Recency, Frequency and Monetary .0 00 00.ccccccccseeestecsesssees 13 Figure 4 Distribution of the variables Recency, Frequency and Monetary 13 Figure 5 Base, log, square root, boxcox of R, F, ÌM cv vn 1H HH ke, 14 Eigure 6 Elbow Method VisualizafIOH - c2 1 1121111112111 1111101111111 1111 11H krr 16

Y

VI

Trang 8

K The number of clusters

SSD The sum of squared distances

Trang 9

5 ¥ IL Theoretical basis and literature review Po

"1 ~ IV Results and discussions is

Trang 10

ABSTRACT

In recent days, the demands of customers are getting more and more diverse, every business finds it challenging to fulfill all of their customers’ expectations To meet that requirement, businesses have researched and implemented the process of dividing customers into different groups Using RFM, K-means clustering method involving Elbow to exploit the database based on AdventureWorks’ data After segmenting customers, businessmen can find out factors affecting customers’ behaviors and generate the right marketing campaigns and effective business plans Experimental results in this research only show a relative level of accuracy; but based on those, businesses can apply this research model to conduct the suitable business strategies for each customer group Keywords: RFM, K-means clustering, customer segmentation, marketing, retail industry

L Introduction

In the context of the economy's ongoing trend of international economic integration, widespread market development, rapid scientific progress, etc., businesses have many opportunities, but they also have to face many challenges and dangers always lurking This requires leaders to build for their businesses a night direction to survive, stand and develop

Factors in the business environment change extremely quickly, making income information sources obsolete and inaccurate The failure to fully measure the external factors greatly affects the conduct of the assessment, setting the objectives and drawing appropriate conclusions In that condition, it is difficult for the strategic direction of the business to ensure its feasibility without a complete and detailed analysis of the target group of customers: who is their customer group that needs to be satisfied? Thus, how to unprove the ability to group customers for businesses is an important issue because if this requirement is met, businesses will easily determine the mght way to satisfy their customers, optimize costs and be more proactive in the development process This research confirms that excellent Customer Relationship Management with customers for companies is very important to achieve more profits

Trang 11

Nowadays, using data mining techniques to support customer segmentation is a useful measure, in which the K-Means method introduced by MacQueen in the document

“J Some Methods for Classification and Analysis of Multivariate Observations” (1967) is

a useful tool for analysts to use in many different fields K-Means clustering algorithm K- means is capable of grouping large amounts of data with relatively fast and efficient computation time (Bain K K, Firli L, and Tn S 2016)

Starting from the above practical requirements, this study will focus on the use of

K - Means clustering in dividing customer data into the most suitable clusters The rest of the study will refer to the application of the K - Means algorithm to a real data set and present the evaluation results of this method

IL Theoretical basis and literature review

Customer segmentation is widely used to group customers into specific characteristics Clustering is the process of forming segments of a set of data by measuring similarities between data with other data (Singh H and Kaur K, 2013) Each cluster of customers has different features, behaviors which affect the business strategies This strategy helps organizations, companies, have a thorough view of customers so that they can target and market to customers more effectively

There are several methods that have often been used and studied to make the segmentation process smoother and more accurate like RFM, K-Means clustering, Datamining

Regarding the RFM method, this analysis ranks each customer to 3 factors: Recency - which shows how recent the customer’s last purchase was, Frequency - which shows how often the customer made a purchase in a given period and the last one is Monetary, shows how much the customer spent in the given period RFM is applied to calculate customer lifetime value (CLV) and customer segmentation of a health and beauty company (Mahboubeh Khajvand 2010) RFM variables used to be used as features to describe customer characteristics (Kaymak 2001) and classify customers into categories such as: uncertain customers, spenders, frequent customers, the best customers, (Thompson 2002) However, in customer segmentation, the RFM model

3

Trang 12

has some problems with data mining that changes over time (Monireh Hosseini 2015) To overcome that problem, they broke down time into separate parts using the equation:

then based on the RFM model calculate customer value for each time period (Yeh 2008) Implemented a comprehensive method to select customer segments from the data source by extending the RFM model to the RFMTC model (adding two parameters, namely: time since first occurrence and identification general capacity) This model can estimate the probability that a customer will make a purchase next time and the expected value of the total number of times that customer will make a purchase in the future Data mining has many strengths for customer segmentation data analysis It allows managers to identify the most profitable customers in each sector By the data obtained, data mining analysis results separate customers into many groups with specific characteristics Businesses can better understand consumers by finding out the relationship between groups of consumers and the products they have purchased through specific characteristics; present customer groups in the most intuitive way (Aslihan Dursun, 2016) Besides, data mining also has problems: Data preparation, model interpretation and evaluation is very time-consuming because it has to collect data from reliable sources Besides, the lack of technical knowledge and information 1s also a big obstacle Therefore, more research is needed in the following areas: link analysis to establish customer buying patterns, enhance merchant websites, predict lifetime value of each customer (Daqing Chen, Sai Laing Sain & Kun Guo, 2012) That makes it difficult

to compare customer profiles across time periods of the year, as recent visits come close

to the winter deadline (Aslihan Dursun, 2016)

Trang 13

Figure 1 Customer Segment Model based on data mining (Yun Chen, Guozheng Zhang,

Dengfeng Hu, Shanshan Wang, 2006) K-Means is an unsupervised clustering learning in pattern recognition and machine learning which is often used to solve clustering problems (MacQueen, 1967) This method requires k clusters which are chosen in advance Each of k clusters has a single random point and then adds each new point to the group whose mean the new point is nearest the centroids The function is used (Saurabh Shah, Manmohan Singh, 2012):

|| x; - G ||? is the distance between centroid and the data point

The algorithm is composed of the following steps, whose output is a set of k clusters

1 Select k clusters from the dataset

2 Initialize K data points in dataset as center points (centroids)

Trang 14

3 For each data point, assign that data point to the cluster whose distance to the center point of the cluster is the smallest

4 For each cluster, redefine the center point of all data points assigned to that cluster

5 Repeat the step 3 and 4 until convergence

When repeating the K-means algorithm many times, every rerun of the K-Means algorithm will restart different centroids After the learning process, proceed to reduce the result from the last component runs of the result

And to find out “k” in K-means clustering accurately, elbow and silhouette can be used in this process:

The elbow approach is a technique in cluster analysis that is used to determine the number of clusters in a data set The method entails charting the explained variance as a function of the number of clusters and selecting the elbow of the curve as the number of clusters to utilize The same procedure can be used to select the number of parameters in other data-driven models, such as the number of primary components to characterize a data collection There are four steps in this method:

Run K-means for a variety of Ks

Sum the squares of the distances from the cluster mean

Plot the SSD curve over K's

Pick the K at the elbow visually

The silhouette method 1s also used to identify the optimal number of clusters as well as to interpret and certify consistency within data clusters The silhouette method calculates silhouette coefficients for each point, which measure how similar a point is to its own cluster in comparison to other clusters by displaying a brief graphical representation of how effectively each object was categorized Compute the silhouette coefficients for each point and average them over all samples to get the silhouette score

Trang 15

b; : is defined as the average distance to the closest cluster of datapoint I

a; : defined as the average distance to all other points in the cluster to which it belongs

II Methodology and proposed research models

The proposed research methodology includes four major steps (Figure 2)

In the first step, from the Adventure Works dataset, this study implements preprocessing

of the dataset (including Data cleaning, Data selection, and Data Transformation) After preprocessing data, there is a clean dataset that can reflect the values needed for using the RFM model Then, calculated R, F, and M values respectively, using the K-Means method to analyze data points for different clusters Implementing grouping in Python language

Trang 16

Figure 2 The outline of the process of segmenting customers

Besides, for data analysis, this study uses the EDA method Use scatter plot graphs to display data by making numbers, which are relatively difficult for the human brain to analyze as a whole, into easy images such as graphs Column plots, box plots, etc The use

of EDA in data analysis helps speed up the process of figuring out what the data means, what it can be used for, and what conclusions can be drawn from it

III Experimental results and discussions

4.1, The process of customer segmentation

The whole process involves preprocessing data, RFM scoring, and K-Means

customer segmentation

4.1.1 Preparing and preprocessing dataset

The study uses the dataset of the company Adventure Work Cycles which manufactures and distributes metal and composite bicycles to commercial markets in North America, Europe, and Asia The figures in the dataset including customers and

Tiêu đề	Applying K-Means Clustering to Segment Customers in Retail Industry
Tác giả	Ta Thi My Hanh, Nguyen Trong Nghia, Thai Kim Huynh, Le Quang Thanh Tai, Tran Thi Minh Han
Người hướng dẫn	Ho Trung Thanh, Ph.D., Nguyen Van Ho, MA
Trường học	University of Economics and Law
Chuyên ngành	Interdisciplinary Research Method
Thể loại	Final Project Report
Năm xuất bản	2022
Thành phố	Ho Chi Minh City

Định dạng
Số trang	28
Dung lượng	1,74 MB