Khóa luận tốt nghiệp: Exploring Machine Learning Techniques to Predict Customer Segmentation: Case study in Vietnam’s Banking

VIETNAM NATIONAL UNIVERSITY, HANOIINTERNATIONAL SCHOOL GRADUATION PROJECT PROJECT NAME: Exploring Machine Learning Techniques to Predict Customer Segmentation: Case study in Vietnam’s

Aim and Objectives

Aim

The objective of this dissertation is to analyze and assess the broad and specific use and effectiveness of advanced machine learning methodologies in customer segmentation focusing on the competitive and evolving banking sector of Vietnam As this line of study targets utilizing these methodologies in optimizing the delivery of specific value propositions in banking, achieving better tiered customer segmentation, and positioning the financial organizations better within the rapidly advancing area of the sector, it aims to offer a clearer perspective on the capabilities and the concerns attendant to applying machine learning in segmentation frameworks Thus, this study seeks to gain a deeper understanding of how the advanced analytical technology could be applied in the banking context of Vietnam; which particular aspects of the banks’ functioning could be improved through adopting the data-driven segmentation approach, and how the further enhancement and integration of these processes may result in the sustainable and continuous growth and competitiveness of the Vietnamese banks in the context on the continually growing and changing global market.

Objectives

Assess Current Segmentation Practices: Assess the current techniques used by the banks in Vietnam to group the customers and analyze their advantages and weaknesses for the modern conditions of the banking system in Vietnam

Understand Data Availability and Quality: Investigate the current state of data and its availability within the Vietnam banking system in terms of volume, variety, velocity, and veracity in relation to feasibility and effectiveness of applying machine learning to segmentation

Identify Key Segmentation Variables: Determine and analyze what type of variables: demographic, transactional, behavioral, etc are the most significant predicting the future customer behavior and preferences given the nature of the Vietnamese market

Select and Develop Machine Learning Models: Choose suitable machine learning algorithms and methods that are suitable to carry out the type of customer segmentation required in the Vietnamese banking environment Establish and build a rich customer database and build robust, reliable, internal structures for modeling which are capable of accurately segmenting clients based on prior experience

Evaluate Model Performance: Finally, evaluate the efficacy of the developed machine learning models for predicting personnel attrition using metrics such as accuracy, precision, recall, or F1 score Perform checks on the newly developed models’ of study in order to confirm their validity, and the ability to replicate the results on other datasets Interpret Model Results: Explain the findings gleaned from the machine learning models in order to foster greater comprehension of the different client segments, as well as their tendencies and preferences applicable to banking services that can be implemented by the observed banks

Recommend Implementation Strategies: Appropriate measures for the implementation and management of the ML-based Customer Segmentation in the banking context are to be suggested, including specific tactics for its application, performance evaluation, and improvement

Assess Impact on Business Outcomes: Assess the implications of the application of machine learning in customer segmentation for customer satisfaction, retention, acquisition, cross selling, and competitiveness within the banking field, in the Vietnamese context.

Research Questions

Which machine learning algorithms and techniques are most suitable for customer segmentation in the context of Vietnam's banking industry?

Machine Learning

Machine learning (ML), a subset of computer science, has the potential to transform epidemiological sciences With the growing emphasis on "Big Data," ML offers epidemiologists new strategies for addressing challenges that traditional approaches may not manage adequately To successfully evaluate the integration of ML algorithms with existing approaches, it is critical to remove linguistic and technical obstacles between the two disciplines, which can prevent epidemiologists from fully understanding and evaluating ML research

From the perspective of artificial intelligence, the primary drive for machine learning is the desire to acquire knowledge Expert knowledge is frequently scarce (such as when classifying credit card fraud for all daily transactions) or the activities require quick decisions (such as stock trading, when decisions must be made within seconds), necessitating the employment of computers Furthermore, the vast bulk of modern data is designed for computer reading, making it a valuable information resource Machine learning focuses on developing models that allow computers to perceive, process, and learn from data in order to execute tasks and evaluate performance more efficiently

Machine learning models are divided into three main categories:

Supervised learning entails categorizing data items in available data sets or predicting results with a high degree of accuracy

Supervised learning can help enterprises solve a wide range of real-world problems Some of the techniques used in supervised learning are: Neural networks are used to create complicated models that recognize patterns involved in classification Classification models include Naive Bayes Linear regression is a strategy that uses a model to analyze the link between previous data and prediction

Logistic regression is the approach that delivers the probability for classification by using the Somest Random Forests algorithm, which contains a number of choices

Unsupervised learning is also referred to as unsupervised machine learning is a category of machine learning learning algorithms used in the clustering of datasets where the data has not been tagged or categorized and is in its raw form These algorithms are the specific formulas that identify the concealed patterns or clusters of data with no interference from the human judgement

This method proves very useful in explorations of data and is mostly used in cross selling strategies, customer segmentation, image recognition and sample

It absolutely also helps in breaking the model into limited features via the process of dimensionality reduction Two of the most well known methods that work on this problem are principal component analysis (PCA) and singular value analysis (SVD) Some of the other algorithms employed in this area include neural networks of clustering, k-means, and probabilistic clustering techniques

Semi-supervised learning is in between supervised and unsupervised learning methods

In training, it utilises less labelled data than the amount used in the supervising learning procedure to control the classification; it takes features out of a bigger pool of unlabelled data

Sometimes, semi-supervised learning is used when there is a scarcity of labeled data for the supervised learning technique.

Machine Learning Algorithms

The ML algorithms utilized in this thesis will be explained in this section

The Random Forst Clasifier (RFC) is a widly used machine lerning algoritm known for its strength and acuracy in clasification tasks The Random Forst algoritm combines the outputs of many decision trees to create a single, unified outcome Its poplarity is largely because its easy to use and versatile, making it good for both clasification and regression tasks

The wide use of Random Forst is due to its user-frendlyness and flexbility, letting it handle both clasification and regression problems well The algoritm's main strength is in managing complex data and reducing overfiting, making it a valuable tool for many prediction tasks in machine lerning

One of the most important features of the Random Forst algoritm is its ability to handle data with both continuous variables (for regression) and categorical variables (for clasification) It always delivers high performance for both clasification and regression tasks In this tutorial, I will look at how the Random Forst algoritm works and use it for a clasification task

Logistic regression is a supervised machine learning algorithm designed to perform binary classification tasks by predicting the probability of a specific outcome, event, or observation This model generates a binary or dichotomous result, confined to two possible outcomes: yes/no, 0/1, or true/false It operates by analyzing the relationship

15 between one or more independent variables and classifying data into distinct, discrete classes

This algorithm is particularly valuable in predictive modeling, where it estimates the mathematical probability of an instance belonging to a particular category By understanding the likelihood of an outcome, logistic regression helps in making informed decisions based on the predicted probabilities For instance, in a binary classification scenario, the model might use 0 to represent a negative class and 1 to represent a positive class Logistic regression is especially effective for problems where the outcome variable is binary, revealing either of the two categories (0 or 1) This makes it a widely used tool in various applications, including medical diagnosis, financial forecasting, and marketing analytics

Naive Bayes classifiers comprise a collection of classification algorithms rooted in Bayes’ Theorem Rather than being a single algorithm, Naive Bayes encompasses a family of algorithms that all share a common principle: every pair of features being classified is considered independent of each other

One of the simplest and most effective classification algorithms, the Naive Bayes classifier facilitates the swift development of machine learning models with rapid prediction capabilities

The Naive Bayes algorithm is predominantly used for classification problems and is especially prevalent in text classification In text classification tasks, the data typically involves high dimensions, with each word representing a feature The algorithm is widely applied in spam filtering, sentiment analysis, and rating classification The primary advantage of using Naive Bayes is its speed, as it allows for fast and straightforward predictions even with high-dimensional data

K-Nearest Neighbors (KNN) is one of the fundamental yet crucial classification algorithms in machine learning As a part of the supervised learning domain, KNN finds significant application in pattern recognition, data mining, and intrusion detection

KNN is highly useful in real-life scenarios because it is non-parametric, meaning it does not make any underlying assumptions about the data distribution This characteristic sets it apart from algorithms like Gaussian Mixture Models (GMM), which assume a Gaussian distribution for the data In KNN, I'm provided with prior data, also known as training data, which classifies coordinates into groups based on a specific attribute The algorithm works by identifying the 'k' nearest data points to a given input and assigning the most common class among those points to the input, making it a versatile and practical tool for various classification tasks

XGBoost, short for Extreme Gradient Boosting, is a powerful and scalable machine learning library designed for gradient-boosted decision trees (GBDT) It supports parallel tree boosting and is widely recognized as a leading library for solving regression, classification, and ranking problems

To fully appreciate the capabilities of XGBoost, it's important to understand the foundational machine learning concepts it builds upon: supervised machine learning, decision trees, ensemble learning, and gradient boosting

Supervised machine learning involves using algorithms to train a model on a dataset that includes both labels and features The model learns to identify patterns within the data and then uses this knowledge to predict the labels of new datasets based on their features XGBoost enhances these principles by optimizing the boosting process, resulting in faster and more accurate models This makes it a vital tool for many machine learning tasks, providing robust performance and scalability.

Label Encoding

Label Encoding is a technique for converting categorical columns into numerical ones, allowing them to be analyzed by machine learning models that require numerical input

This is an important pre-processing step in every machine learning project because most algorithms can only handle numerical data

Label encoding assigns a distinct number to each category in a categorical feature The mapping is usually arbitrary, which means that the numbers are assigned with no underlying ordinal relationship For example, if a column has three categories: "Red,"

"Green," and "Blue," label encoding may assign 0 to "Red," 1 to "Green," and 2 to

Standard Scaler

StandardScaler is a scikit-learn preprocessing tool that standardizes features by eliminating the mean and scaling them to unit variance

The Standard Scaler function subtracts the mean of each feature from the data points and divides the result by the feature's standard deviation This can be stated mathematically as follows: where:

 X is the original data point

 μ is the mean of the feature

 σ is the standard deviation of the feature

This process centers the data around zero with a unit variance, making it easier to compare features that originally had different units or scales Standardization is especially crucial for machine learning algorithms that use normally distributed input data, such as linear regression, logistic regression, support vector machines, and neural networks These techniques are sensitive to the size of the input features, and

18 unstandardized data might result in poor model performance or slow convergence during training

The benefits of using the Standard Scaler include:

1 Standardized data improves model performance and training speed

2 Equal Contribution of characteristics: Making sure all characteristics contribute equally prevents larger scales from dominating the model training process

3 Standardized data facilitates feature comparison across multiple units

It is vital to highlight that standardization should be applied first to the training data, followed by the test data

This maintains consistency and eliminates data leaking, which occurs when information from the test set affects the training process.

Classification Evaluation Metrics

A confusion matrix is a valuable tool for assessing the performance of a classification model It gives a detailed breakdown of how the model's predictions compare to the actual values of the target variable We can obtain several essential metrics from the confusion matrix to evaluate the model's accuracy, precision, recall, and overall efficacy The confusion matrix consists of four fundamental components, which are particularly relevant in binary classification problems:

• True Positive (TP): The model accurately predicts the positive class

• False Positive (FP): The model predicts a positive class when it is actually negative

• True Negative (TN): The model accurately predicts the negative class

• False Negative (FN): The model predicts a negative class instead of a positive one.Key Metrics Derived from the Confusion Matrix

1 Accuracy: This indicator shows the percentage of correctly categorized instances out of the total

2 Precision: This indicator measures the proportion of true positive predictions made by the model

3 Recall (Sensitivity or True Positive Rate): Recall (Sensitivity or True Positive

Rate): This metric measures the proportion of true positive cases versus all positive cases

4 F1 Score: The F1 Score is a balanced metric that takes into account both false positives and false negatives It is calculated by taking the harmonic mean of precision and recall

The ROC (Receiver Operating Characteristic) curve is a graphical depiction of a classification model's performance, contrasting the True Positive Rate (TPR) versus the

False Positive Rate (FPR) for different threshold values It is a useful tool for comparing classifiers and learning how they behave at different decision thresholds

 True Positive Rate (TPR): Also known as recall, it is defined as:

 False Positive Rate (FPR): It is defined as:

The AUC (Area Under the ROC Curve) is a scalar metric that summarizes the classifier's overall performance A higher AUC indicates greater model performance, with 1 being a perfect classifier and 0.5 signifying a model without discriminative capacity

Data scientists and machine learning practitioners can obtain deeper insights into their model's strengths and limitations by utilizing these metrics and visual tools, ultimately leading to improved model selection and modification.

Working Environment

This experiment was conducted using Jupyter Notebook, a versatile and widely-used platform for data analysis and scientific computing Unlike traditional local systems, Jupyter Notebook offers an interactive environment that integrates code execution, visualization, and rich text within a single document, providing significant flexibility and ease of use for data analysis tasks This thesis utilizes several Python libraries, including:

Pandas is a Python library that offers fast, flexible, and expressive data structures designed to simplify working with "relational" or "labeled" data It aims to be the fundamental high-level building block for practical, real-world data analysis in Python Pandas can read data from various sources such as CSV, Excel, SQL, and JSON files and handle large datasets efficiently It provides two main data structures: Series and DataFrame The Series is a one-dimensional labeled array capable of holding any data type, while the DataFrame is a two-dimensional labeled data structure with columns of potentially different types Pandas is also compatible with other popular Python libraries like NumPy, Matplotlib, and Scikit-learn

Scikit-learn, also known as sklearn, is a popular open-source Python library extensively used for data analysis, data preprocessing, and machine learning algorithms Built on top of NumPy, SciPy, and Matplotlib, Scikit-learn focuses on performance and scalability, enabling it to handle large datasets and complex machine learning tasks Additionally, Scikit-learn offers tools for feature selection, feature extraction, and model evaluation

Matplotlib is a well-known open-source Python library created by John D Hunter in

2003 for scientific and technical computing It is widely used for creating high-quality visualizations, including line plots, scatter plots, bar plots, and histograms Matplotlib seamlessly integrates with other popular Python libraries, such as NumPy and Pandas

NumPy, short for Numerical Python, is a widely-used open-source Python library that provides support for multidimensional arrays, matrices, and a variety of mathematical

22 functions It is extensively used for data analysis, data preprocessing, and numerical operations One of NumPy's key advantages is its performance, as it is built on top of the

C programming language Additionally, NumPy is compatible with other popular Python libraries like SciPy and Matplotlib.

Customer’s Segment

New_Segment: There are three groups of New segment in the data

 Mass: The mass customer group with average income and assets This group comprises the majority of the population and typically uses basic financial services

 Mass Affluent: The customer group with higher income and assets than the Mass group but not yet at the Affluent level They have higher spending capacity and often use more premium financial services

 Affluent: The wealthy customer group with the highest income and assets They use the most premium financial services and have complex investment and asset management needs

Customer_Segment: There are seven groups of Customer segment

 CN01 Segment: Largest, over 800,000 customers, lower average income

 Smaller Segments (CN02, CN03, CN04): Fewer customers, higher and variable incomes

 CN06 Segment: Highest average income, many high-income outliers

 Other Segments (CN05, CN08): Mix of medium and high-income customers

Figure 3.1: Experimentation procedure.predictive models

Data Collection

For this analysis, we utilized a real-world dataset concerning customer segmentation from a bank in Vietnam, collected in 2022 This comprehensive dataset comprises information about the bank's customers, encompassing both personal and financial details of the applicants It consists of 23 variables, which include 22 features and 1 target variable, spanning over more than 1 million records The target variable, represented by

24 the "new_segment" column, indicates the segment to which each customer belongs Below is a detailed description of the dataset's variables:

_col0 Customer's code age Age of the customer (years) customer_segment The ex-segment to which a customer belongs thu_nhap Total income of customers/(month/year) casa Customer's current account savings account (đ) fd Customer's fixed deposit account (đ) avg_loan Average loan amount that customers borrow (đ) vay_mua_oto Car loan number (đ) vay_tieu_dung Consumer loan amount (đ) vay_sxkd Loan amount for production and business (đ) vay_mua_bds Loan amount to buy real estate (đ) vay_dac_thu Loans that are unique or specialized (đ) vay_khac Other loans tai_san_bds Real estate assets that the customer owns tai_san_oto Car assets that the customer owns chi_tieu_the Card spending job_title Customer's job toi Pro t from banking business activities for customers atm_debit_atm_card Amount debit card: Debit card is a card issued by the bank to the cardholder for payment instead of cash. atm_debit_post_card Postpaid debit card account atm_credit_card Credit card: is a type of card that allows customers to spend first and pay later without having money on the card. cnt_service The count of services that a customer is using or has subscribed to with the bank new_segment New customer segment

Table 3.1: Definition of variables in the dataset

Data Cleaning & Preprocessing

Removing Duplicates

- Initially, there were duplicate rows present in the DataFrame `data`

- The code successfully identified and removed all duplicate rows

- After removing the duplicates, the DataFrame no longer contains any duplicate rows, as indicated by the output `0` from the `print(data.duplicated().sum())` statement

This operation ensures that the dataset is now free from redundant data, which can improve the accuracy and reliability of subsequent data analysis or machine learning models applied to this dataset

Handling Missing Values

The dataset has been checked for missing values, and the results show:

- The “job_title” column has 1,010,214 missing entries

- All other columns have no missing values

Label Encoding

 The code iterates through all columns of the DataFrame data

 For each column with a data type of 'object' (i.e., categorical variables), the code applies the LabelEncoder to transform the categorical values into numerical labels

 This step is crucial for preparing the dataset for machine learning models, which generally require numerical input

 Converting categorical variables into numerical labels allows the dataset to be used effectively in various machine learning algorithms

All categorical columns in the dataset are encoded as numerical values, making the dataset suitable for machine learning tasks

This transformation ensures that the dataset is now fully numerical and ready for further analysis or model building.

Standard Scaler

 Purpose: To standardize the feature values by removing the mean and scaling to unit variance

 Details: o StandardScaler(): A scaler from sklearn.preprocessing used for standardization o fit_transform(X_train): Fits the scaler on the training data and transforms it o transform(X_test): Transforms the test data using the previously fitted scaler

Data Preparation: The dataset is successfully split into training and testing sets, with

80% of the data used for training and 20% for testing

Data Standardization: The feature values are standardized, ensuring that they have a mean of 0 and a standard deviation of 1 This step is crucial for many machine learning algorithms to perform optimally

These operations ensure that the data is ready for model training and evaluation, providing a standardized and reproducible framework for building and testing machine learning models.

Exploratory Data Analysis

Univariate analysis

The histogram reveals a broad spectrum of customer ages, ranging from the late teens to around 60 years Notably, there is a pronounced peak at approximately 30 years of age, indicating that this is the most common age among the bank's clientele The frequency

30 of customers in this age bracket significantly surpasses that of any other group, highlighting a substantial concentration of young adults within the bank’s customer base

The age distribution is right-skewed, with a higher concentration of customers in the younger age brackets (20-40 years) and a gradual decrease in customer numbers as age increases beyond 40 This skewness suggests a youthful customer demographic, with fewer older individuals engaging with the bank The density plot, illustrated by the smooth blue line, further confirms this observation by showing the probability density function of the age variable This plot gives a more refined and unbroken image of the age distribution, which helps to strengthen the conclusions made, while removing high and low peaks seen in the histogram

The visualization makes further explanation of the fact that the direct clients of the bank are young adults and middle-aged - people of the age range between 20 and 40 years This demographic information is important to the strategic development of the bank and its future marketing campaigns and interventions, indicating that products and services catering to this age group may be a profitable approach to take Further, the low count of customers in the age group above 40 have an implication for the bank to focus on how to retain the already attained customers and think of how to come up with the right product market to favor this demography thus expanding its market

Subsequently, the number count against age as depicted in the age distribution plot in Figure 2 offers a valuable insight of the bank’s customers with much focus on the youth

It is critical for the organization to access such information for purposes of product design, marketing and customer care strategies for the banking firm to realize its goals of targeting the key consumers while at the same time exploring more markets for opportunities

Figure 3.3: Number of customers by segment

The bar chart depicted in Figure 3.3 provides a detailed visualization of the distribution of customers across various segments within the dataset The chart illustrates a significant disparity in the number of customers among the different segments The most prominent observation is the overwhelming dominance of the "CN01" segment, which accounts for the majority of the customer base, with a count exceeding 800,000 This segment's customer count far surpasses that of any other segment, indicating that "CN01" is the predominant segment within the bank's clientele

Other segments, such as "CN02," "CN03," and "CN04," have substantially fewer customers, each with counts markedly lower than that of "CN01." The remaining segments, "CN05," "CN06," and "CN08," have an almost negligible number of customers, as evidenced by the very short bars in the chart This distribution suggests a highly uneven spread of customers across the various segments

The dominance of the "CN01" segment indicates that the bank’s primary focus should be on this segment due to its substantial customer base This segment is critical for the bank’s revenue and customer engagement strategies Ensuring the retention and satisfaction of customers in this segment should be a priority, as it constitutes the bulk of the customer base Moreover, the bank can leverage its understanding of the needs and preferences of this segment to tailor its products and services accordingly

The relatively smaller segments, such as "CN02," "CN03," and "CN04," present potential growth opportunities The bank could investigate the specific needs and preferences of customers within these segments to devise targeted strategies for increasing their customer base This might include specialized marketing campaigns, personalized services, or products designed to meet the unique demands of these segments By focusing on these areas, the bank can potentially diversify its customer base and reduce its reliance on the "CN01" segment

Given the significant dominance of the "CN01" segment, the bank might consider allocating more resources to retain and further develop this segment However, it is also important to explore ways to attract and retain customers in the smaller segments Balancing resource allocation between maintaining the dominant segment and growing the smaller segments could lead to a more robust and diversified customer base

The customer segment distribution plot in Figure 3.3 provides valuable insights into the bank’s customer segmentation The dominance of the "CN01" segment highlights a concentration of customers within a single segment, suggesting a targeted approach for product development and marketing strategies Simultaneously, the presence of smaller segments indicates opportunities for diversification and growth by catering to the needs of these less-represented customer groups This analysis underscores the importance of understanding customer distribution to inform strategic decisions and optimize resource allocation effectively By focusing on both retaining the dominant segment and

33 exploring growth opportunities in smaller segments, the bank can enhance its overall customer engagement and business performance

Figure 3.4: Income distribution by customer segment

The box plot presented in Figure 3.4 provides a comprehensive visualization of income distribution across various customer segments within the dataset Each box plot represents a distinct customer segment, offering insights into the central tendency, dispersion, and potential outliers of the income variable (`thu_nhap`) within each segment

The box plot reveals notable variations in income distribution among different customer segments The segment labeled "CN01," despite having the largest customer base, exhibits a relatively lower median income compared to other segments This indicates that while "CN01" encompasses a significant portion of the customer population, the income levels within this segment are comparatively modest

Segments "CN02" and "CN03" show a wider range of income levels, with the median income being higher than that of "CN01." These segments also display a considerable number of outliers, suggesting the presence of customers with exceptionally high incomes The variability within these segments indicates a diverse income distribution, with a significant portion of customers earning substantially more than the median

The "CN06" segment stands out with the highest median income among all segments This segment also exhibits a broader interquartile range (IQR), indicating a more dispersed income distribution The presence of numerous outliers in "CN06" suggests that this segment includes customers with significantly higher incomes compared to others This wide spread of income levels points to a segment with both affluent customers and those with lower incomes

The segments "CN04," "CN05," and "CN08" also show interesting patterns "CN04" has a relatively narrow IQR and lower median income, suggesting a more homogeneous group with fewer high-income outliers "CN05" and "CN08" have similar distributions, with moderate medians and some high-income outliers, indicating a mix of middle- income and high-income customers

Bivariate analysis

Figure 3.9 shows a correlation matrix that summarizes the links between the dataset's numerous financial variables Each cell in the matrix indicates a correlation coefficient between two variables, which can range from -1 to 1 A score closer to 1 suggests a strong positive connection, whereas a value closer to -1 indicates a strong negative relationship Values approaching zero indicate little or no association

The most notable observation is the high correlation between the variables

"vay_mua_bds" (real estate loan) and "avg_loan" (average loan amount), with a correlation coefficient of 0.81 This strong positive correlation indicates that customers with higher average loan amounts are likely to have substantial real estate loans Similarly, there is a significant correlation between "vay_sxkd" (business production loan) and "avg_loan," with a coefficient of 0.53, suggesting that business loans also contribute significantly to the total loan amount

Another prominent relationship is observed between "amt_debit_atm_card" (ATM card debit amount) and "chi_tieu_the" (credit card spending), with a correlation coefficient of 0.97 This near-perfect correlation suggests that customers who spend more on their credit cards also tend to withdraw and spend more using their ATM cards This relationship can be leveraged to identify high-spending customers and tailor financial products to their needs

The variable "fd" (fixed deposits) shows a moderate positive correlation of 0.35 with

"cnt_service" (number of services used), indicating that customers with more fixed deposits tend to use a wider range of banking services This insight can help the bank in cross-selling additional services to customers with significant fixed deposits

"casa" (current account and savings account) is positively correlated with "cnt_service," with a coefficient of 0.14 This suggests that customers with higher balances in their current and savings accounts tend to use more banking services The bank can target these customers for promotional offers and new service introductions

Interestingly, "thu_nhap" (income) shows relatively low correlations with most variables, with the highest being 0.19 with "casa." This indicates that higher income does not strongly correlate with other financial behaviors in this dataset, highlighting the importance of understanding other factors influencing customer financial activities

Summing up, the correlation matrix reveals key relationships between various financial variables, offering valuable insights for strategic decision-making The strong correlations between certain loan types and average loan amounts, as well as between spending behaviors on different cards, can guide the bank in designing targeted financial products and services Additionally, understanding the moderate correlations with fixed deposits and service usage can help the bank enhance customer engagement and cross- sell opportunities Overall, this analysis underscores the importance of leveraging data- driven insights to tailor banking strategies effectively to meet customer needs.

Business Findings from Exploratory Data Analysis (EDA)

The exploratory data analysis (EDA) of the bank’s dataset has yielded several important insights that have significant implications for business strategy These findings are crucial for informing the bank’s approach to product development, customer engagement, and overall strategic planning

The analysis of the age distribution reveals that the majority of the bank’s customers are young adults and middle-aged individuals, particularly those between the ages of 20 and

40 This demographic insight suggests that the bank's primary customer base is relatively young and likely to be tech-savvy Consequently, there is a strong case for the bank to enhance its digital and mobile banking solutions to better cater to the preferences and needs of this age group

In examining customer segmentation, it was observed that the segment labeled "CN01" comprises the majority of the customer base However, this segment also displays lower median income levels compared to other segments This indicates a need for the bank to tailor its products and services to meet the needs of this large, lower-income segment, while simultaneously exploring opportunities to expand its offerings to smaller, higher- income segments Developing targeted marketing campaigns and tailored financial products for these higher-income segments can help attract and retain affluent customers

The income distribution analysis indicates that segments such as "CN02," "CN03," and

"CN06" have substantial income levels, which suggests potential for offering premium services, wealth management, and investment products to these customers The bank should leverage this opportunity by developing and promoting financial products that cater specifically to high-income individuals

Loan distribution analysis reveals a heavily skewed pattern towards smaller loan amounts, indicating either conservative borrowing behavior or a customer base with lower borrowing needs This suggests an opportunity for the bank to develop micro-loan products or low-interest loan schemes to better serve these customers Additionally, ensuring robust risk management practices for higher loan amounts will be crucial in mitigating potential financial risks

The analysis of job titles and corresponding incomes highlights that the highest-paying job titles are predominantly senior management and specialized roles This indicates the presence of affluent professionals within the bank's customer base The bank can capitalize on this by offering tailored financial advice, exclusive banking services, and customized loan products to these high-earning individuals Conversely, for lower- paying job titles, the bank should ensure that basic banking services remain accessible and affordable, potentially offering financial literacy programs and budget management tools to support these customers

Spending behavior analysis shows a high correlation between ATM card spending and credit card spending, indicating that customers who spend more on credit cards also tend to use their ATM cards frequently This insight allows the bank to identify high-spending customers and offer them personalized credit card rewards, cashback programs, and other spending incentives to enhance customer loyalty and engagement

The correlation between the number of banking services used and both fixed deposits and current account balances suggests significant cross-selling opportunities The bank can encourage customers with substantial deposits to utilize a wider range of banking services, such as insurance, investment products, and financial planning services, thereby deepening customer relationships and increasing service utilization.

Data Splitting

Splitting the dataset into training and testing sets is a crucial step in machine learning workflows It allows the model to be trained on one subset of the data and evaluated on another, unseen subset This process helps to assess the model's generalization ability and ensures that it performs well on new, unseen data Using only the most important features (`X_important`) focuses the model on the most relevant information, potentially leading to better performance and more efficient computations

 Purpose: To divide the dataset into training and testing subsets

 Details: o X: Features of the dataset o y: Target variable of the dataset o train_test_split: A function from sklearn.model_selection that splits the data o test_size=0.2: 20% of the data will be used for testing, and 80% for training o random_stateB: Ensures reproducibility by fixing the random seed.

Training Models

During this step, the preprocessed dataset was used to build machine learning models using various techniques, including Logistic Regression Classifier (LRC), Random

Forest Classifier (RFC), XGBoost Classifier (XGB), K-Nearest Neighbors (KNN), and Naive Bayes (NB) The training dataset was used to educate these models, and the "fit" procedure was applied to each algorithm The input features (X_train_imp) and associated output labels (y_train) were given as parameters The "fit" method calibrates each model by altering its internal parameters to reduce the difference between projected and actual results After training, the models are ready to make predictions on previously unseen data This is accomplished using the "predict" method, which accepts the input attributes of the new data and returns the matching predictions

Handle Imbalanced Data with SMOTE:

Cross-Validation for All Models:

Model Evaluation

Confusion Matrix

Figure 3.10: Confusion Matrix - Logistic Regression

The confusion matrix presented in Figure 3.10 provides a detailed evaluation of the performance of a logistic regression model applied to the dataset The matrix displays the number of correct and incorrect predictions made by the model, categorized by the true labels and predicted labels

In the confusion matrix, the rows represent the actual classes (true labels), while the columns represent the predicted classes (predicted labels) The matrix is divided into three classes, labeled as 0, 1, and 2

- The model accurately predicted 1,884 instances of type 0 (True Positives)

- There were 685 cases where class 0 was wrongly projected as class 1 (False Negatives)

- The model accurately predicted 170,426 cases as class 1 (True Positives)

- There were 1,605 cases where class 1 was wrongly projected as class 2 (False Negatives)

- The model successfully predicted 19,335 cases as class 2 (true positives)

- There were 694 instances where class 2 was wrongly projected as class 0 (False Negatives)

- There were 3,105 cases where class 2 was mistakenly projected as class 1 (False Negatives)

This confusion matrix shows that the logistic regression model does a good job of predicting the majority class (class 1), with a large number of true positives (170,426) and few false negatives and false positives However, the model exhibits considerable confusion when discriminating between classes 0 and 2, as evidenced by the increased frequency of false negatives and false positives in these classes

The matrix's diagonal elements reflect correct predictions, whereas the off-diagonal elements represent incorrect classifications The proportion of correctly predicted instances (sum of the diagonal elements) to the total number of instances indicates the model's overall accuracy

To sum up, the confusion matrix provides a comprehensive view of the logistic regression model's classification performance, highlighting its strengths in predicting the majority class and areas for improvement in distinguishing between the minority classes This analysis is crucial for understanding the model's behavior and guiding further refinements to enhance its predictive accuracy and reliability across all classes

The confusion matrix presented in Figure 3.11 evaluates the performance of the K- Nearest Neighbors (KNN) model applied to the dataset This matrix provides insights into the accuracy and misclassification patterns of the model across three classes, labeled

0, 1, and 2 The matrix is structured with the true labels along the vertical axis and the

53 predicted labels along the horizontal axis Each cell represents the number of instances for a particular combination of true and predicted labels

- There were 35 cases where class 1 was wrongly forecasted as class 0 (False Negatives)

The confusion matrix reveals that the KNN model demonstrates high accuracy in predicting class 1, with a substantial number of true positives and relatively low numbers of false negatives and false positives However, the model shows some confusion in distinguishing between classes 0 and 2, as indicated by the non-zero off-diagonal elements This suggests that while the KNN model is effective in classifying the majority class, it requires further tuning to improve its performance on the minority classes

In conclusion, the confusion matrix analysis highlights the strengths and weaknesses of the KNN model in classifying the dataset The model performs exceptionally well in predicting the majority class but exhibits some misclassification errors with the minority classes These insights are crucial for guiding further refinements to enhance the model's predictive accuracy and reliability across all classes Understanding these patterns helps

54 in identifying areas for improvement, ensuring that the model can achieve better overall classification performance and accuracy This comprehensive evaluation is vital for making informed decisions about model selection and optimization in future applications

Figure 3.12: Confusion Matrix - Naive Bayes

Figure 3.12 shows the confusion matrix, which analyzes the performance of the Naive Bayes model on the dataset This matrix shows the model's accuracy and misclassification trends over three classes, labeled 0, 1, and 2 The genuine labels are arranged vertically, whereas the predicted labels are arranged horizontally Each cell shows the number of incidents for a specific set of true and expected labels

- There were 1,182 cases where class 0 was wrongly projected as class 1 (False Negatives)

- There were 4,333 instances where class 1 was wrongly projected as class 2 (False Negatives)

- The model accurately classified 14,119 cases as class 2 (True Positives)

- There were 1,554 instances of class 2 being mistakenly forecasted as class 0 (False Negatives)

- There were 7,461 instances where class 2 was wrongly forecasted as class 1 (False Negatives)

The confusion matrix reveals that the Naive Bayes model demonstrates high accuracy in predicting class 1, with a substantial number of true positives and relatively low numbers of false negatives and false positives However, the model shows some confusion in distinguishing between classes 0 and 2, as indicated by the non-zero off-diagonal elements This suggests that while the Naive Bayes model is effective in classifying the majority class, it requires further tuning to improve its performance on the minority classes

In conclusion, the confusion matrix analysis highlights the strengths and weaknesses of the Naive Bayes model in classifying the dataset The model performs well in predicting

56 the majority class but exhibits some misclassification errors with the minority classes These insights are crucial for guiding further refinements to enhance the model's predictive accuracy and reliability across all classes Understanding these patterns helps in identifying areas for improvement, ensuring that the model can achieve better overall classification performance and accuracy This comprehensive evaluation is vital for making informed decisions about model selection and optimization in future applications

The confusion matrix shown in Figure 3.13 assesses the performance of the XGBoost model on the dataset This matrix shows the model's accuracy and misclassification

57 trends over three classes, labeled 0, 1, and 2 The genuine labels are arranged vertically, whereas the predicted labels are arranged horizontally Each cell shows the number of incidents for a specific set of true and expected labels

- There were no instances of class 0 being mistakenly predicted as class 1 (false negatives)

- There were no instances of class 1 being mistakenly predicted as class 0 (false negatives)

The confusion matrix reveals that the XGBoost model demonstrates high accuracy across all classes, with substantial numbers of true positives and relatively low numbers of false negatives and false positives The model shows minimal confusion between classes, as indicated by the low numbers of off-diagonal elements This suggests that the XGBoost model is highly effective in classifying the dataset accurately

In conclusion, the confusion matrix analysis highlights the strengths of the XGBoost model in classifying the dataset The model performs exceptionally well across all classes, with minimal misclassification errors These insights are crucial for confirming the model's predictive accuracy and reliability, ensuring that it is a robust choice for deployment Understanding these patterns helps in making informed decisions about model selection and optimization, ultimately achieving better overall classification performance and accuracy in real-world applications

Figure 3.14: Confusion Matrix - Random Forest

Figure 14 shows the confusion matrix, which analyzes the performance of the Random Forest model on the dataset This matrix shows the model's accuracy and

Testing Models

The following techniques were used to create predictions on the test data: Logistic Regression Classifier (LRC), Random Forest Classifier (RFC), XGBoost Classifier (XGB), K-Nearest Neighbors (KNN), and Naive Bayes (NB) To select the best model, the performance of each model was assessed using several classification measures such as the F1 score and ROC AUC These measures are critical for acquiring a thorough view of the models' efficacy, as they take into account both precision and recall, as well as their capacity to distinguish between classes

The F1 score, which is the harmonic mean of precision and recall, is critical for determining the model's accuracy (the correctness of positive predictions) and completeness (the capacity to detect all positive cases) A higher F1 score reflects the model's stronger overall performance

The ROC AUC metric (Receiver Operating Characteristic Area Under Curve) assesses the model's ability to distinguish between classes at all classification levels A higher ROC AUC value indicates that the model performs better in terms of true and false positive rates

Through the evaluation of these metrics, I can discern which model excels in predictive accuracy and reliability, thus guiding the selection of the most appropriate model for real-world deployment A detailed evaluation of these metrics follows, providing an in- depth understanding of each model's strengths and areas for improvement

In this study, accuracy is the primary performance indicator used to determine the most precise algorithm among the selected machine learning algorithms To evaluate the efficacy of these algorithms, the testing dataset is used, and the accuracy results are painstakingly compared

Table 4.1: Accuracy of the testing models

Based on the accuracy metrics presented in Table 4.1, we can derive several key insights for each model following rigorous evaluation

 The Random Forest model achieved the highest accuracy at 99.993% This indicates that the Random Forest model is the most effective among the tested algorithms in predicting the correct customer segments The ensemble nature of Random Forest, which aggregates the predictions of multiple decision trees, likely contributes to its superior performance

 The XGBoost model demonstrated an impressive accuracy of 99.583%, making it the second-best performing model XGBoost is known for its efficiency and performance in handling large datasets and complex classification tasks Its high accuracy underscores its robustness and reliability as a predictive model

 The KNN model also performed exceptionally well, with an accuracy of 99.405% This high accuracy suggests that KNN is effective in capturing the patterns and relationships within the dataset, making it a strong contender for predicting customer segments

 Logistic Regression achieved an accuracy of 96.658% While this is lower than the ensemble and instance-based learning methods, it still represents a solid performance Logistic Regression is a simpler model compared to Random Forest and XGBoost, yet it provides valuable insights and reliable predictions

 The Naive Bayes model had the lowest accuracy among the tested models, at 92.552% Although this accuracy is still reasonably high, it indicates that Naive Bayes may not be as effective as the other models for this particular dataset This

63 could be due to the model’s assumptions about feature independence, which might not hold true for this dataset

⇒ Based on accuracy, RFC, KNN or XGB may be a good choice So we will compare the ROC-AUC score of these three models to find the best model

Figure 4.1: ROC Curve of KNN

Figure 4.2: ROC Curve of XGB

Figure 4.3: ROC Curve of RFC

After analyzing the ROC Curve results, it became evident that all three models achieved scores close to 1, indicating near-perfect classification performance Therefore, the selection of the optimal model will be based on their accuracy indices Among the five models evaluated, the Random Forest Classifier (RFC) demonstrated the highest accuracy index, making it the superior choice for deployment

Train the Random Forest model to determine the importance of features:

After saving the RFC model with the features, we use the prediction function to see customer information, for example, customer number 300 will belong to the Mass Affluent group

Personal financial situation and history, loans and assets play an important role in customer segmentation decisions Delving into the complexities of customer behavior and characteristics, my exploration aims to reveal insights that can revolutionize the way businesses understand and serve customer segments separate goods By harnessing the power of machine learning techniques, I work to build predictive models that not only identify current customer segments but also predict changes and trends in consumer preferences

The purpose of this thesis is to examine the accuracy of five different classification models (LRC, RFC, XGB, NB, and KNN) in predicting consumer segments In this study, accuracy and ROC-AUC were employed as critical performance indicators to identify which of the selected ML algorithms was the most accurate The findings of this investigation show that the Random Forest Classifier outperforms the other four models This suggests that the Random Forest Classifier is superior at predicting consumer categories

The results of this study indicate that random forest classifier is better than the other three models This shows that the random forest classifier is better at predicting customer segments

It can be seen that the Random Forest Classifier model demonstrates strong ability to segment customers into Mass, Mass Affluent and Affluent types with high precision and recall Banking organizations can use these insights to tailor specific financial products and services to each segment, ensuring targeted and effective allocation of marketing efforts and investment of resources more effective The model's reliability in accurately classifying the Affluent and Mass segments is commendable; however, a bit of improvement may be needed for the public assets category to ensure no customers are accidentally overlooked, ensuring excellent service for all wealth classes

Học Máy (2024) Wikipedia Available at:https://vi.wikipedia.org/wiki/H%E1%BB%8Dc_m%C3%A1y (Accessed: 14 June 2024)

Tiêu đề	Exploring Machine Learning Techniques to Predict Customer Segmentation
Tác giả	Nguyen Thi Mai Linh
Người hướng dẫn	Dr. Ha Manh Hung
Trường học	Vietnam National University, Hanoi International School
Chuyên ngành	Business Data Analytics
Thể loại	Graduation Project
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	72
Dung lượng	1,68 MB