Summary In summary, we analyze and evaluate the data to come up with customer profiles as well as analyze customer satisfaction with airline services.. The data used in the article is ta
Trang 1VIETNAM GENERAL CONFEDERATION OF LABOUR TON DUC THANG UNIVERSITY
Subject: Big Data
Lecture: Dr Ngé Tan Vũ Khanh Group members:
Tran Thi Lan Anh 718H0243
Sin Gia Han 719H0032
Trang 23.2.1 ' an 8
3.2.2 Evaluate Linear regression .ee 8
IV Data mining and analysis using Ôrang€ SOÍTWAT€ óc LH Hành HH HH1 H1 TH TH Hàng re 10 4.1 Dafa preDTOC€sSing - Hà HH Hà HH TH TH TH ng TH TH TH ng Ta Hà re 10 4.2 _ Data mining and analysis using Ôrang€ SOÍTWATE + cà HH HH HH TH HH HH ch 15 L UNG nố ố 15
4.2.2 The customer portrait of the alrline 1nđÌSETY - 55- 2+ +2 S*12 2211 HH Hye, 16 4.2.3 Describe customer portraits by using Box PPÏOL - 55+ tk xé S119 2x re re, 19 4.2.4 Relationship between Flight Distances, Annual Income and Type of Travel by using Scatter 1 20
Trang 3Table of Figure
I0 09 00 ii in 6 Figure 1 Passenger age and #€neT- - 5s +Se + +1 HH HH HH HH 16 Figure 2 Type of CUSfOTT€TS - -Á - G-< HS TH HH HH HH TH HH HH HH TH HH r 17 Figure 3 Passenger’s annual IICOIN€ . 5 + + + ST HH HT ng 18 Figure 4 Custoimer”S DFOfESSIOH - 5 Ăn TH HH TH HH HH ng 19 Figure Š Customer`s profession by CÏa§§ - SĂcsScSx SH Hye, 19 Figure 6 Relationship between flight distance, income and type of travel 20 Figure 7 Scatter chart of customer clusters by age -c se se 23 Figure 8 Scatter chart of customer clusters by Annual Income 24 Figure 9 Scatter chart of customer clusters according to Work Experiences 25 Figure 10 Scatter chart of customer clusters by Age cece eects eeeeeeees 25 Figure I1 Scatter chart of customer clusters by Ïncome -« e+s+s++ 26 Figure 12 Scatter chart of customer clusters by Elight Distances 27 Figure 13 Scatter chart of customer clusters according to Work Experiences 27 Picture 1 Data Pf€DTOC€SSINE - - GÀ + HS HS HH HH HH HH HH HH HH 10 Picture 2 Adding data to Orange Data Mining by CSV File Import 11 Picture 3 Evaluate the data by the Eeature SfafISfICS - -ẶẶS-cS- Sex 11 Iiun011-880/ï1310150i010960)01510) 2n 12 Picture 5 Handling missing 1nÍOTTäf1O 5 5 55 S41 **E*3 +3 ke 12 Picture 6 Merging 2 Data by Concafeniafe -cce sex HH HH HH 13 Picture 7 Table of Customer Data After Preprocessing and Merging 14 Picture 8 Clusfer analysis model and customer satisfaction level ~- 15 Picture 9 K-means clustering diagram using Orange -+s se + +serssesee 21 Picture 10 Selecf 1nÍOTIAfIOHA 4G 5 S5 3112112323131 2110 21H HH KH HH TH HH 22
Picture 12 The process of Linear Regression method - -«<+<<+<++sxzs+ses 28 Picture 13 The table of variabÏes «c5 +4 <4 1113111191 H111 HH HH HH HH 29 Picture 14 The result of independent variables . 5 - Sss si, 30
Trang 4Summary
In summary, we analyze and evaluate the data to come up with customer profiles as well as analyze customer satisfaction with airline services The data used in the article is taken from Kaggle and conducted using Orange software with K-means method to discover customer segments and portraits, and analyze the factors affecting customer satisfaction through Linear Regression
I Introduction
With the development of society as well as the economy, the need for non-stop movement of people is constantly increasing In particular, the aviation industry is one of the most favored means of transportation, especially for long flights or people who need to save commuting time The airlines not only provide their passengers with a safe and convenient flight, they also want their customers to experience quality flight service That is also one of the competitive advantages of airlines, especially when their customer are often from middle-class above, who focus on their own experiences and customer services
The Airline customer service is the activities or services of the Airline for their passengers at each stage of their journey to meet or improve customers’ overall in- airport and in-flight experience Specifically, these services include pre-flight services such as the ease of booking, check-in service to in-flight services such as seat comfort, entertainment and many other services
Conducting an analysis of customer reviews and satisfaction of airline services is a method to help airlines understand their customers better as well as their satisfaction toward the airline services It is especially important to know whether the services provided by the airline are suitable and meet the needs of the customers Hence, airlines can assess customers’ attitudes towards them as well as what services they need to change to be more suitable for the passengers
The data on customer satisfaction ratings for flight services can be collected from many sources and the amount can be huge Therefore, in order to understand customer satisfaction, we had used the Orange tool to model data and analyze customer satisfaction of different services in a flight journey Specifically, we used K-means and Linear regression to determine the impact of factors on customer satisfaction Hence,
Trang 5make an assessment of customer satisfaction for the airline as well as orient the future
strategy and reinforces customer satisfaction
The data set from Kaggle includes 4000 responses which is customer information as well as their evaluation of the flight service experience The information includes:
Gender Gender of the passengers (Female, Male)
Customer Type The customer type (Loyal customer, disloyal customer) Age The actual age of the passengers
Profession The passengers current job Annual income The annual income of the passengers Spending Score The spending score of passenger (1-100)
Purpose of the flight of the passengers (Personal Travel, Type of Travel Business Travel)
Travel class in the plane of the passengers (Business,
Flight distance The flight distance of this journey
Satisfaction level of the inflight wifi service (0:Not Inflight wifi service Applicable; 1-5)
Seat comfort Satisfaction level of Seat comfort
Trang 6Inflight entertainment Satisfaction level of inflight entertainment On-board service Satisfaction level of On-board service Leg room service Satisfaction level of Leg room service Baggage handling Satisfaction level of baggage handling Check-in service Satisfaction level of Check-in service Inflight service Satisfaction level of inflight service Cleanliness Satisfaction level of Cleanliness Departure Delay in Minutes Minutes delayed when departure Arrival Delay in Minutes Minutes delayed when Arrival
Airline satisfaction level (Satisfaction, neutral or
The K-means++ technique was found to successfully resolve some of the issues related to establishing the initial cluster centroids for K-means in a review by Shindler that included numerous clustering algorithms
Trang 7K-Means is an unsupervised learning algorithm that partitions a given dataset into a fixed number of clusters (K clusters) by defining K centroids, one for each cluster To ensure better results, the centroids are placed far away from each other
3.1.2 The idea of K-means++ algorithm
The procedure for selecting the first centroid of k is to uniformly select the data points being clustered, ensuring that the closest centroid represents the group:
1 Decide on a shared centroid for the data points
2 Calculate D(x), the separation between each unselected data point x and the closest chosen centroid
3 Using a weighted probability distribution, choose a new data point at a new centroid, where the chosen point x has a probability proportional to D(x)2 4 Up until the centroids are chosen, repeat steps 2 and 3
5 After choosing the first centroids, we continue to conduct clustering using the conventional K-means algorithm
The K-means++ algorithm was created to get over the problem with picking the initial centroid at random Because the final clustering outcome is dependent on the original
cluster centroids
3.1.3 Details of the K-means++ algorithm
The K-means++ algorithm is a technique for displaying the shortest D(x) distance between a data point and the selected nearest centroid Initialization of K-
d, =MAX G1 myll%i —€ jl In there:
di: is the distance of the point xi to the farthest centroid
k: number of centroids selected
Trang 8@ Step 3: Take a new center xi , whose maximum probability is proportional to di @ Step 4: Repeat steps 3 and 4, until we find k centroids
After finding the centroid k, we continue to divide the cluster based on the standard K-means algorithm as in Section 3, Part III above
3.2 Linear Regression method 3.2.1 Definition
Linear Regression is a statistical method for predicting the relationship between a dependent variable (also known as the response variable) and one or more independent variables (also known as predictor variables) Linear regression seeks the best-fit line
Regression is also a method commonly used for two purposes The first purpose is to make predictions and forecasts, which is a task that overlaps with the field of machine learning The second purpose is to identify causal relationships between independent and dependent variables It's important to note that regression analysis only reveals relationships between a dependent variable and a specific collection of variables, and it does not provide conclusive evidence of causal relationships
Simple Linear Regression is the most basic form of linear regression, and it involves finding the best straight line to fit the data The equation for a simple linear regression model is y = b0 + b1*x
In this equation, y is the dependent variable, x is the independent variable, b0 is the intercept, and b1 is the line's slope
Linear regression can be used to predict future values of the dependent variable, understand the relationship between the independent and dependent variables, and identify the most important predictors in a dataset, among other things It is commonly
used in finance, economics, social sciences, and engineering
3.2.2, Evaluate Linear regression
Linear regression analysis based on the coefficient (or regression coefficient) is used to evaluate the degree of influence of independent variables on the dependent variable in a linear regression model The coefficient indicates the average change in the
Trang 9dependent variable (the unit of measurement of the dependent variable) when an independent variable changes by one unit If the coefficient has a positive value, the changes in the independent and dependent variables are in the same direction, meaning that when the value of the independent variable increases, the value of the dependent variable also increases Conversely, if the coefficient has a negative value, the changes in the independent and dependent variables are opposite, meaning that when the value of the independent variable increases, the value of the dependent variable decreases The larger the absolute value of the coefficient, the stronger the impact of the independent variable on the dependent variable Therefore, the coefficient is essential in evaluating the effects of independent variables on the dependent variable and helps us to better understand the relationship between them
Trang 10IV Data mining and analysis using Orange software
ể
2 Data
Customer Survey
lu Customer Survey
Picture 1 Data Preprocessing
Step 1: Adding data to Orange using the CSV File Import widget and evaluating the data In this task, we have added 2 data Segmentation Information and Customer Survey, which are separate data files we extracted to simulate data from multiple sources imported into the Data warehouse.
Trang 11Foature Statistice Customer Information
Cel daimiter Comma Quote character
Picture 2 Adding data to Orange Data Mining by CSV File Import To evaluate the data, we will use the Feature Statistics tool in Orange This tool allows us to quickly and easily get an overview of the data by showing us various statistics for each feature Evaluating the data using the Feature Statistics tool to identify prominent features and check for any missing data
feos
`
¬
Trang 12‘a Customer Survey - Orange _ Bo x
Name Distribution Mean Mode Median Dispersion Min Max Missing
Picture 4 Missing Information
Step 2: If the data has errors as mentioned above, we can use the Preprocess widget to help handle missing data Here, we will select the Impute Missing Values option and choose Replace with Random Value to fill in the missing data with random
values
Discretize Continuous Variables = 2 #2 Continuize Discrete Variables “nan Fa impute Missing Values © Replace with random value
4] Select Random Features Normalize Features 2} Randomize
Remove Sparse Features S Principal Component Analysis
CUR Matrix Decomposition
Trang 13To merge the two processed data files, we can use the Concatenate widget in Orange This tool allows us to join two or more data sources together based on a
common attribute or index
= Concatenate - Orange ? a OF Variable Merging
When there is no primary table, the output should contain: s © all variables that appear in input tables
/, oa ©) only variables that appear in all tables Kk
(_] Append data source IDs
? B | 2l-I20002000 [3 4000
Picture 6 Merging 2 Data by Concatenate
Step 3: Adding the processed data to a table to get an overview of the data
Trang 14sec, 2 Female Loyal Custooner 2 15000 39 Hoatincare 1
5 3 Moe aloyal Custom » 8000 6 Engneer 1 + Mae Loyal Customer “ 58090 7 Leyer 9 cÝŸŸ” G3) Ls cào Loyal Customer “ 3800 40 Eeneetainment 2 De sex si 6 Male Loyal Customer 1% e000 76 Artist o Bikes 0 cas cana 7 Female Loyal Customer n 31080 He 1 = Femate Loyal Customer “ seco 94 Hevitncte 1 B sect rome + Male Lora Customer 4 7000 3 mại 9 10 Female Loyal Customer “ 98G 72 Artist 1 1 Female Loyal Customer 4 7000 14 Engmeer 1 12 Female Loyal Customer 2 93000 99 Healthcare 4 13 Femde « 46 pane0 15 Executive 9 14 Female Loyal Customer “ 91000 17 Lamyer 1 I5 Female Loyal Customer 2 9000 13 Doctor Ạ
‘are Cees! Order 35 Female Loyal Customs 1000 1
Trang 15Segmentation Data mining and analysis using Orange software
A the Distributions (1)
Silhouette Plot
~í8 Evaluate Result { ^
Trang 164.2.2 The customer portrait of the airline industry a) Use distributions to visualize customer information
120
@ Female @ Male 100
80
> 9 60 3 2 ừ
Figure 1 Passenger age and gender
In general, male customers will fly more often than female customers In which, the age group with the highest number of customers flying is between 25 and 27 years old and those in their early 40s Passengers with the highest flight frequency is male customers about 41 years old with nearly 120 flights The number of passengers from about 50 years old onward fluctuates but tends to decrease
Trang 17b) Customer information by age and customer type
® Loyal Customer @ disloyal Customer
Figure 2 Type of Customers
According to the chart, we can see that Loyal Customers are mainly in the age of 40 to 60 years old In particular, the age group from 40 to 45 has the highest number of loyal customers, up to 180 In addition, loyal customers also concentrate in the age group from 25 to 30 with the number of loyal customers up to 120 However, from 20 to 30 years old is also the set of customers with the largest number of disloyal customers which is nearly 70 For other age groups, the number of disloyal customers is at about 20 to 30
c) Customer’s annual income
Trang 18Figure 3 Passenger’s annual income
In general, customers of the aviation industry are customers with an annual income from $60,000 or more, regardless of whether they are loyal customers or disloyal customers In which, customers who often use aviation services are those who
have an annual income of between $60,000 and $10,000 Although there are still
customers in the group with an income of less than $60,000, the number is not significant
Trang 194.2.3 Describe customer portraits by using Box Plot Customer profession
4 133
172 355