Final project subject enterprise analytics for decision support final project report

gender Whether the customer is a male or a femaleSeniorCitizen Whether the customer is a senior citizen or not 1=senior citizen, 0=not Partner Whether the customer has a partner or not Y

Topic meaning of project

Customer churn signifies the end of business relationships between customers and a specific company or service In the highly competitive telecommunications sector, where consumers can select from numerous providers, the annual churn rate stands at 22%, highlighting the industry's intense rivalry.

Navigating individualized customer retention is challenging for companies due to large customer bases and limited resources The costs of extensive retention efforts often outweigh potential revenue gains However, accurately predicting which customers are likely to leave allows businesses to focus their retention strategies on high-risk clients The goal is to enhance customer loyalty and expand coverage, emphasizing the importance of understanding and meeting customer needs for success in the market.

Customer churn is crucial for businesses to address, as retaining existing customers is far more cost-effective than acquiring new ones To tackle this issue, telecom companies must proactively identify customers at risk of churning by developing a comprehensive understanding of their interactions across various channels This includes analyzing physical store visits, purchase histories, customer service interactions, online transactions, and social media engagements.

Telecom companies can enhance their market standing and drive growth by effectively reducing customer churn Retaining customers leads to lower acquisition costs and increased profit potential Therefore, minimizing client attrition and adopting robust retention strategies are crucial for success in this competitive landscape.

Objectives of project

Understand Customer Churn Patterns

- Analyze historical data to understand patterns and trends associated with customer churn.

- Identify key factors that have historically led to customer attrition in the telecom industry.

Data preprocessing

Clean and preprocess the data to ensure its quality and suitability for analysis.

Model Development

- Build predictive models using machine learning algorithms to predict customer churn.

- Evaluate and compare the performance of different models to choose the most effective one.

Tools in the project

Dataset

Source of dataset

Our dataset comes from Kaggle (giới thiệu Kaggle) https://www.kaggle.com/datasets/blastchar/telco-customer-churn/data

Dataset description

The data set includes information about:

- Customers who left within the last month – the column is called Churn.

- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies.

- Customer account information - how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges.

- Demographic information about customers – gender, and if they have partners and dependents.

Name Description customerID Customer ID gender Whether the customer is a male or a female

SeniorCitizen Whether the customer is a senior citizen or not (1=senior citizen,

Partner Whether the customer has a partner or not (Yes, No)

Dependents Whether the customer has dependents or not (Yes, No) tenure Numerical Value - Number of months the customer has stayed with the company

PhoneService Whether the customer has a phone service or not (Yes, No)

MultipleLines Whether the customer has multiple lines or not (Yes, No, No phone service)

InternetService Customer’s internet service provider (DSL, Fiber optic, No)

OnlineSecurity Whether the customer has online security or not (Yes, No, No internet service)

OnlineBackup Whether the customer has online backup or not (Yes, No, No internet service)

DeviceProtection Whether the customer has device protection or not (Yes, No, No internet service)

TechSupport Whether the customer has tech support or not (Yes, No, No internet service)

StreamingTV Whether the customer has streaming TV or not (Yes, No, No internet service)

Whether the customer has streaming movies or not (Yes, No, No internet service)

Contract The contract term of the customer (Month-to-month, One year, Two year)

PaperlessBilling Whether the customer has paperless billing or not (Yes, No)

PaymentMethod The customer’s payment method (Electronic check, Mailed check,

Bank transfer (automatic), Credit card (automatic))

MonthlyCharges The amount charged to the customer monthly

TotalCharges The total amount paid by the customer

Churn Whether the customer churned or not (Yes or No)

Descriptive statistics

Before creating complex visualizations, it's crucial to verify the quality of our data using Python This initial step is vital for understanding how to manage our data effectively in the project's later phases.

Figure 1.3 Loading and display the dataset

Figure 1.4 Check the information of dataset

Bổ sung describe , hình đánh theo chương

- Dataset Structure: The dataset has 7042 rows and 21 columns.

- Missing Data: There are no null values in the dataframe.

- Data type: There are 18 columns with object type data, 2 columns with integer value data and only 1 column with float value data.

- The label “Churn” is the label in our dataset and our project will find out which customers are likely to leave ahead of time.

- Imbalance dataset: The "No" category significantly outnumbers the "Yes" category, with 5174 (73.46%) instances compared to 1869 (26.54%).

The analysis using Tableau reveals an unbalanced dataset, with a ratio of approximately 3:1 between Not-Churn and Churn customers This imbalance is likely to skew predictions in favor of Not-Churn customers, potentially affecting the accuracy of the insights derived from the data.

Figure 1.5 Pie chart of Churn

1.3.1 Categorical features vs ‘Churn’ variable

Since there various categorical features in this dataset, we divide them into 3 groups according to its properties:

- Group 1: Customer information (Gender, SeniorCitizen, Partner, Dependents)

- Group 2: Service registered by the customer (OnlineSecurity, OnlineBackup, DeviceProtection, Techsupport)

- Group 3: Type of contract (Contract, PaperlessBilling, Payment Method)

Group 1: Customer information (Gender, SeniorCitizen, Partner, Dependents)

Customer churn rate for male & female customers is very similar to each other.

The number of Senior Citizen customers is relatively low, with a churn rate of approximately 40% This translates to a total of 476 customers lost out of 1,142 Senior Citizen clients.

Customers who are housing with a Partner churned less as compared to those not living with a Partner Similarly, churning is high for the customers that don't have Dependents with them.

Group 2: Service registered by the customer (OnlineSecurity, OnlineBackup, DeviceProtection, Techsupport)

Figure 1.9 Service register by the customer chart

The churn rate is low if users use the internet with the above services (first figure)

High churn rates are observed among users who access the internet without specific services, while individuals who do not use the internet exhibit the lowest churn rates.

Group 3: Type of contract (Contract, PaperlessBilling, Payment Method)

Figure 1.10 Type of contract chart

Customer churn rates for month-to-month contracts tend to be high, as many customers use this flexible option to explore different services This approach allows them to test various offerings without a long-term commitment, ultimately helping them save money while they determine the best fit for their needs.

PaperlessBilling displays a high number of customers being churned out This is probably because of some payment issue or receipt issues.

Customers expressed significant dissatisfaction with the Electronic Check payment method, leading to a concerning trend where 1,071 out of 2,365 users abandoned the service To retain customers, the company must either eliminate the Electronic Check option or enhance its usability to make it more user-friendly.

1.3.2 Numerical features vs ‘Churn’ variable

A very high number of customers opted out of the services for the TotalCharges below 500 This customer churning continues for a TotalCharges range of values from 0 –1000.

For MonthlyCharges group, churn rate is high for the values between 65 – 105.This MonthlyCharges range of values caused the customers to switch.

Customer churn is notably high in the first month, with many customers canceling their services This trend of cancellations persists for the next 4 to 5 months; however, the rate of churn begins to decline after the initial month As customer tenure increases, the likelihood of dropping out decreases significantly, leading to lower overall churn rates over time.

Conclusion

- 3 types of customers should be targeted : SeniorCitizen, Living with a Partner,living all alone.

SeniorCitizen customers, although fewer in number, tend to have higher Monthly Charges compared to other demographics, indicating their willingness to pay for premium services In contrast, both partnered customers and those living alone typically seek services with Monthly Charges under $65, reflecting a preference for more affordable options.

To enhance payment efficiency and reduce churn, it is essential to discontinue the use of Electronic Checks for payments, given their high churn rates Instead, the focus should shift entirely to automatic Bank Transfers and Credit Card payments However, a significant challenge lies in decreasing the median churn tenure for these two payment methods, which currently stands at over 20 months—double that of Electronic Checks.

Model application

Data preprocessing

2.1.2 Separate dependent and independent variables

Figure 2.2 Separate dependent and independent variables

Feature scaling is a crucial data preprocessing technique that standardizes the values of features in a dataset to a uniform scale This process ensures that each feature contributes equally to the model's performance, preventing features with larger values from overshadowing others.

In this project, we use min max normalization For every feature, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into a

1, and every other value gets transformed into a decimal between 0 and 1.

2.1.4 Handle data imbalance and Split dataset into train and test sets

The graph indicates that just 26.5% of customers fall into the 'churn' category, suggesting that a predictive model could achieve over 70% accuracy simply by consistently forecasting customer retention.

Therefore, we need to handle the data imbalance We choose two different way to solve this problem:

Figure 2.5 Handle data imbalance using SMOTE

Figure 2.6 Handle data imbalance using Near-Miss

Proposed model

This section focuses on utilizing machine learning models to identify and categorize key factors contributing to customer attrition, enhancing our understanding of the reasons behind customer departures Additionally, we will develop models to predict employee attrition using the available dataset.

In the crucial stage of our "Telco Customer Churn Prediction" project, we explore the implementation of three machine learning algorithms: Logistic Regression, Decision Tree Classifier, and Support Vector Classifier (SVC) Each algorithm offers unique advantages, enhancing our ability to conduct a thorough analysis for predicting customer churn in the telecommunications sector.

Model result

In this section, we show a confusion matrix and related evaluation metrics like Accuracy, Recall, F1 score, Precision by using Python.

Figure 2.7 Logistic Regression result using SMOTE resampling

Figure 2.8 Decision Tree Classifier using SMOTE resampling

Figure 2.9 SVC result using SMOTE resampling

Figure 2.10 Logistic Regression result using Near-Miss resampling

Figure 2.11 Decision Tree Classifier result using Near-Miss resampling

Figure 2.12 SVC result using Near-Miss resampling

Result summary

- Overall, with this dataset, handling data imbalance with SMOTE results in higher accuracy and F1-Score Therefore, we will choose SMOTE approach for this dataset.

- Among 3 supervised machine learning models, SVC presents the most accurate outcome and F1-Score, hence, SVC seems to be the best fit

Obtained results

The dataset exhibits an imbalance with a ratio of approximately 3:1 between Not-Churn and Churn customers, leading to biased predictions favoring Not-Churn customers To improve prediction accuracy, it is essential to preprocess the dataset prior to model training.

- We scale numerical data using min max normalization and use two different way to handle data imbalance – SMOTE and Near-Miss

In our analysis, we evaluated three machine learning models: Logistic Regression, Decision Tree Classifier, and Support Vector Classifier Upon comparing their accuracy and F1-Scores, we found that the SMOTE technique significantly improved performance, resulting in higher accuracy and F1-Score Consequently, we selected SMOTE as the optimal method for handling this dataset.

Recommendation

- 3 types of customers should be targeted : SeniorCitizen, Living with a Partner, living all alone.

- It is necessary to discontinue the use of Electronic check for payment purposes due to its high churn and focus entirely on Bank Transfer (automatic) & Credit Card (automatic).

- The price of these services - StreamingTV and StreamingMovies need to be adjusted in a more affordable way The content of these services should aim at a diverse range of customers.

- To establish a robust customer base, the Telco Company should establish a straightforward and cost-effective entry point for its services For the tenure of 1st

6 months, it needs to focus extensively on OnlineSecurity, OnlineBackup,DeviceProtection & TechSupport as this period is the most critical and uncertain for the customers.

Future work

- This project has not been able to exploit all aspects of the dataset, so we will continue to search for other meanings of the dataset.

- Finding other preprocessing methods and machine learning models to increase the accuracy of the model.

Tiêu đề	Enterprise Analytics for Decision Support
Tác giả	Nguyen Tuan Dat, Banh Gia Bao, Vu Thanh Van, Pham Huy Nam, Nguyen Thi Quynh Anh
Người hướng dẫn	Assoc. Prof. Do Trung Tuan
Trường học	Vietnam National University
Chuyên ngành	Enterprise Analytics
Thể loại	Final project report

Định dạng
Số trang	34
Dung lượng	1,26 MB