1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Topic applying k means clustering tosegment customers in retail industry

30 1 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Applying K-Means Clustering To Segment Customers In Retail Industry
Tác giả Ta Thi My Hanh, Nguyen Trong Nghia, Thai Kim Huynh, Le Quang Thanh Tai, Tran Thi Minh Han
Người hướng dẫn Ho Trung Thanh, Ph.D., Nguyen Van Ho, MA
Trường học University of Economics and Law
Chuyên ngành Information Systems
Thể loại Final Project Report
Năm xuất bản 2022
Thành phố Ho Chi Minh City
Định dạng
Số trang 30
Dung lượng 2,43 MB

Nội dung

Using RFM, K-means clustering method involving Elbow to exploit the database based on AdventureWorks’ data.. Nowadays, using data mining techniques to support customer segmentation is a

Trang 1

UNIVERSITY OF ECONOMICS AND LAWFACULTY OF INFORMATION SYSTEMS

FINAL PROJECT REPORT

INTERDISCIPLINARY RESEARCH METHOD COURSE

TOPIC: APPLYING K-MEANS CLUSTERING TOSEGMENT CUSTOMERS IN RETAIL INDUSTRY 2 Nguyen Trong Nghia 3 Thai Kim Huynh 4 Le Quang Thanh Tai 5 Tran Thi Minh Han

Ho Chi Minh City, November, 2022

Trang 2

4 Le Quang Thanh Tai K214111962 10/10

5 Tran Thi Minh Han K214110831 10/10

I

Trang 3

Acknowledgements

This research is supported by lecturers Ho Trung Thanh and Nguyen Van Ho, who generously provided knowledge and expertise to help our group know some useful techniques to prepare the whole scientific research, and instruct us precisely during the course to complete the research Although any errors are our own and should not tarnish the reputations of these esteemed persons, and we sincerely wish to show our gratitudes to them.

We are also grateful to each other of the group for checking and cooperating enthusiastically, late-night feedbacks and moral support during the course

Group 9

II

Trang 4

Commitment

This research is conducted by every member in the group 9 and under instruction of two lecturers who are Ho Trung Thanh and Nguyen Van Ho If our group has any cheating proof in this research paper, we will certainly shoulder the whole responsibility for the punishment at any level Besides, there is also the presence of references from several articles about the related subjects.

Ho Chi Minh City, November, 2021

Group 9

Too long to read onyour phone? Save to

read later on your computer

Save to a Studylist

Trang 5

II Theoretical basis and literature review 3

III Methodology and proposed research models 7

III.Experimental results and discussions 8

4.1 The process of customer segmentation 8

4.1.1 Preparing and preprocessing dataset 8

4.1.2 Calculating R, F and M 9

4.1.3 Transforming and Scaling R,F,M data 13

4.1.4 Grouping customers into separate clusters by K-Means 15

4.2 Applying marketing strategies to each cluster of customers 17

4.3 Discussions 18

IV.Conclusions and implications 18Appendix a

IV

Trang 6

List of Tables

Table 1 RFM values after removing outliers 10

Table 2 R, F and M after being processed 11

Table 3 Scaling data to uniform value 15

Table 4 Description of three clusters 17

V

Trang 7

List of Figure

Figure 1 Customer Segment Model based on data mining (Yun Chen, Guozheng Zhang,

Dengfeng Hu, Shanshan Wang, 2006) 5

Figure 2 The outline of the process of segmenting customers 8

Figure 3 Boxplots of Recency, Frequency and Monetary 13

Figure 4 Distribution of the variables Recency, Frequency and Monetary 13

Figure 5 Base, log, square root, boxcox of R, F, M 14

Figure 6 Elbow Method Visualization 16

VI

Trang 8

List of Acronyms

M Monetary

K The number of clusters SSD The sum of squared distances

Q Quartile

Log Logarit

SQRT Square root

VII

Trang 9

GANTT CHART

1

Trang 10

In recent days, the demands of customers are getting more and more diverse, every business finds it challenging to fulfill all of their customers’ expectations To meet that requirement, businesses have researched and implemented the process of dividing customers into different groups Using RFM, K-means clustering method involving Elbow to exploit the database based on AdventureWorks’ data After segmenting customers, businessmen can find out factors affecting customers’ behaviors and generate the right marketing campaigns and effective business plans Experimental results in this research only show a relative level of accuracy; but based on those, businesses can apply this research model to conduct the suitable business strategies for each customer group Keywords: RFM, K-means clustering, customer segmentation, marketing, retail industry.

I.Introduction

In the context of the economy's ongoing trend of international economic integration, widespread market development, rapid scientific progress, etc., businesses have many opportunities, but they also have to face many challenges and dangers always lurking This requires leaders to build for their businesses a right direction to survive, stand and develop.

Factors in the business environment change extremely quickly, making income information sources obsolete and inaccurate The failure to fully measure the external factors greatly affects the conduct of the assessment, setting the objectives and drawing appropriate conclusions In that condition, it is difficult for the strategic direction of the business to ensure its feasibility without a complete and detailed analysis of the target group of customers: who is their customer group that needs to be satisfied? Thus, how to improve the ability to group customers for businesses is an important issue because if this requirement is met, businesses will easily determine the right way to satisfy their customers, optimize costs and be more proactive in the development process This research confirms that excellent Customer Relationship Management with customers for companies is very important to achieve more profits.

2

Trang 11

Nowadays, using data mining techniques to support customer segmentation is a useful measure, in which the K-Means method introduced by MacQueen in the document “J Some Methods for Classification and Analysis of Multivariate Observations” (1967) is a useful tool for analysts to use in many different fields Means clustering algorithm K-means is capable of grouping large amounts of data with relatively fast and efficient computation time (Bain K K, Firli I, and Tri S 2016).

Starting from the above practical requirements, this study will focus on the use of K - Means clustering in dividing customer data into the most suitable clusters The rest of the study will refer to the application of the K - Means algorithm to a real data set and present the evaluation results of this method.

II Theoretical basis and literature review

Customer segmentation is widely used to group customers into specific characteristics Clustering is the process of forming segments of a set of data by measuring similarities between data with other data (Singh H and Kaur K, 2013) Each cluster of customers has different features, behaviors which affect the business strategies This strategy helps organizations, companies, have a thorough view of customers so that they can target and market to customers more effectively

There are several methods that have often been used and studied to make the segmentation process smoother and more accurate like RFM, K-Means clustering, Datamining,

Regarding the RFM method, this analysis ranks each customer to 3 factors: Recency - which shows how recent the customer’s last purchase was, Frequency - which shows how often the customer made a purchase in a given period and the last one is Monetary, shows how much the customer spent in the given period RFM is applied to calculate customer lifetime value (CLV) and customer segmentation of a health and beauty company (Mahboubeh Khajvand 2010) RFM variables used to be used as features to describe customer characteristics (Kaymak 2001) and classify customers into categories such as: uncertain customers, spenders, frequent customers, the best customers,… (Thompson 2002) However, in customer segmentation, the RFM model 3

Trang 12

has some problems with data mining that changes over time (Monireh Hosseini 2015) To overcome that problem, they broke down time into separate parts using the equation:

then based on the RFM model calculate customer value for each time period (Yeh 2008) Implemented a comprehensive method to select customer segments from the data source by extending the RFM model to the RFMTC model (adding two parameters, namely: time since first occurrence and identification general capacity) This model can estimate the probability that a customer will make a purchase next time and the expected value of the total number of times that customer will make a purchase in the future.

Data mining has many strengths for customer segmentation data analysis It allows managers to identify the most profitable customers in each sector By the data obtained, data mining analysis results separate customers into many groups with specific characteristics Businesses can better understand consumers by finding out the relationship between groups of consumers and the products they have purchased through specific characteristics; present customer groups in the most intuitive way (Aslihan Dursun, 2016) Besides, data mining also has problems: Data preparation, model interpretation and evaluation is very time-consuming because it has to collect data from reliable sources Besides, the lack of technical knowledge and information is also a big obstacle Therefore, more research is needed in the following areas: link analysis to establish customer buying patterns, enhance merchant websites, predict lifetime value of each customer (Daqing Chen, Sai Laing Sain & Kun Guo, 2012) That makes it difficult to compare customer profiles across time periods of the year, as recent visits come close to the winter deadline (Aslihan Dursun, 2016).

4

Trang 13

Figure 1 Customer Segment Model based on data mining (Yun Chen, Guozheng Zhang, Dengfeng Hu, Shanshan Wang, 2006)

K-Means is an unsupervised clustering learning in pattern recognition and machine learning which is often used to solve clustering problems (MacQueen, 1967) This method requires k clusters which are chosen in advance Each of k clusters has a single random point and then adds each new point to the group whose mean the new point is nearest the centroids The function is used (Saurabh Shah, Manmohan Singh, 2012):

|| xi( j ) - c || is the distance between centroid and the data pointj 2

The algorithm is composed of the following steps, whose output is a set of k clusters 1 Select k clusters from the dataset

2 Initialize K data points in dataset as center points (centroids)

5

Trang 14

3 For each data point, assign that data point to the cluster whose distance to the center point of the cluster is the smallest.

4 For each cluster, redefine the center point of all data points assigned to that cluster 5 Repeat the step 3 and 4 until convergence

When repeating the K-means algorithm many times, every rerun of the K-Means algorithm will restart different centroids After the learning process, proceed to reduce the result from the last component runs of the result.

And to find out “k” in K-means clustering accurately, elbow and silhouette can be used in this process:

The elbow approach is a technique in cluster analysis that is used to determine the number of clusters in a data set The method entails charting the explained variance as a function of the number of clusters and selecting the elbow of the curve as the number of clusters to utilize The same procedure can be used to select the number of parameters in other data-driven models, such as the number of primary components to characterize a data collection There are four steps in this method:

1 Run K-means for a variety of Ks.

2 Sum the squares of the distances from the cluster mean 3 Plot the SSD curve over K's.

4 Pick the K at the elbow visually.

The silhouette method is also used to identify the optimal number of clusters as well as to interpret and certify consistency within data clusters The silhouette method calculates silhouette coefficients for each point, which measure how similar a point is to its own cluster in comparison to other clusters by displaying a brief graphical representation of how effectively each object was categorized Compute the silhouette coefficients for each point and average them over all samples to get the silhouette score.

6

Trang 15

b : is defined as the average distance to the closest cluster of datapoint I

ai : defined as the average distance to all other points in the cluster to which it belongs.

III Methodology and proposed research models

The proposed research methodology includes four major steps (Figure 2) In the first step, from the AdventureWorks dataset, this study implements preprocessing of the dataset (including Data cleaning, Data selection, and Data Transformation) After preprocessing data, there is a clean dataset that can reflect the values needed for using the RFM model Then, calculated R, F, and M values respectively, using the K-Means method to analyze data points for different clusters Implementing grouping in Python language.

7

Trang 16

Besides, for data analysis, this study uses the EDA method Use scatter plot graphs to display data by making numbers, which are relatively difficult for the human brain to analyze as a whole, into easy images such as graphs Column plots, box plots, etc The use of EDA in data analysis helps speed up the process of figuring out what the data means, what it can be used for, and what conclusions can be drawn from it.

III.Experimental results and discussions

4.1 The process of customer segmentation

The whole process involves preprocessing data, RFM scoring, and K-Means customer segmentation.

4.1.1 Preparing and preprocessing dataset

The study uses the dataset of the company Adventure Work Cycles which manufactures and distributes metal and composite bicycles to commercial markets in North America, Europe, and Asia The figures in the dataset including customers and

8

Trang 17

sellers were recorded in more than three years, from 01/07/2017 to 15/06/2020 The full data is exploited to group customers into separate clusters, not specific to criteria.

4.1.2 Calculating R, F and M

RFM analysis is a popular segmentation method to rank and group customers through customer’s behaviors: the recent purchase (Recency), the frequently the customer buys (Frequency) and total or average transaction values (Monetary) In order to build an RFM model, the dataset used contains CustomerID, SalesOrderID, InvoiceDate, Quantity, and Unit Price, and then calculate Recency, Frequency and Monetary value according to CustomerKey The figures are demonstrated as Table 1 below.

To be specific, Recency is worked out based on the column of InvoiceDate, the maximum date of any customer minus each date according to each row of CustomerID equals R of a person However, Frequency is easily counted, just to sum all Sales Amount of a customer About Monetary, it is calculated by multiplying Quantity with Unit Price and then summing all the prices of all products of a customer.

9

Ngày đăng: 06/04/2024, 09:40

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w