1. Trang chủ
  2. » Luận Văn - Báo Cáo

final project analyze airline passenger portrait and airline customer services satisfaction by using orange software

38 0 0
Tài liệu được quét OCR, nội dung có thể không chính xác
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Analyze Airline Passenger Portrait and Airline Customer Services Satisfaction by Using Orange Software
Tác giả Tran Thi Lan Anh, Nguyộn Cộng Doan, Phan Ngoc Duyộn, Sin Gia Han
Người hướng dẫn Dr. Ngộ Tan Vũ Khanh, Lecture
Trường học Ton Duc Thang University
Chuyên ngành Big Data
Thể loại Final Project
Năm xuất bản 2023
Thành phố Ho Chi Minh City
Định dạng
Số trang 38
Dung lượng 7,4 MB

Nội dung

Summary In summary, we analyze and evaluate the data to come up with customer profiles as well as analyze customer satisfaction with airline services.. The data used in the article is ta

Trang 1

VIETNAM GENERAL CONFEDERATION OF LABOUR TON DUC THANG UNIVERSITY

Subject: Big Data

Lecture: Dr Ngé Tan Vũ Khanh Group members:

Tran Thi Lan Anh 718H0243

Sin Gia Han 719H0032

Trang 2

3.2.1 ' an 8

3.2.2 Evaluate Linear regression .ee 8

IV Data mining and analysis using Ôrang€ SOÍTWAT€ óc LH Hành HH HH1 H1 TH TH Hàng re 10 4.1 Dafa preDTOC€sSing - Hà HH Hà HH TH TH TH ng TH TH TH ng Ta Hà re 10 4.2 _ Data mining and analysis using Ôrang€ SOÍTWATE + cà HH HH HH TH HH HH ch 15 L UNG nố ố 15

4.2.2 The customer portrait of the alrline 1nđÌSETY - 55- 2+ +2 S*12 2211 HH Hye, 16 4.2.3 Describe customer portraits by using Box PPÏOL - 55+ tk xé S119 2x re re, 19 4.2.4 Relationship between Flight Distances, Annual Income and Type of Travel by using Scatter 1 20

Trang 3

Table of Figure

I0 09 00 ii in 6 Figure 1 Passenger age and #€neT- - 5s +Se + +1 HH HH HH HH 16 Figure 2 Type of CUSfOTT€TS - -Á - G-< HS TH HH HH HH TH HH HH HH TH HH r 17 Figure 3 Passenger’s annual IICOIN€ . 5 + + + ST HH HT ng 18 Figure 4 Custoimer”S DFOfESSIOH - 5 Ăn TH HH TH HH HH ng 19 Figure Š Customer`s profession by CÏa§§ - SĂcsScSx SH Hye, 19 Figure 6 Relationship between flight distance, income and type of travel 20 Figure 7 Scatter chart of customer clusters by age -c se se 23 Figure 8 Scatter chart of customer clusters by Annual Income 24 Figure 9 Scatter chart of customer clusters according to Work Experiences 25 Figure 10 Scatter chart of customer clusters by Age cece eects eeeeeeees 25 Figure I1 Scatter chart of customer clusters by Ïncome -« e+s+s++ 26 Figure 12 Scatter chart of customer clusters by Elight Distances 27 Figure 13 Scatter chart of customer clusters according to Work Experiences 27 Picture 1 Data Pf€DTOC€SSINE - - GÀ + HS HS HH HH HH HH HH HH HH 10 Picture 2 Adding data to Orange Data Mining by CSV File Import 11 Picture 3 Evaluate the data by the Eeature SfafISfICS - -ẶẶS-cS- Sex 11 Iiun011-880/ï1310150i010960)01510) 2n 12 Picture 5 Handling missing 1nÍOTTäf1O 5 5 55 S41 **E*3 +3 ke 12 Picture 6 Merging 2 Data by Concafeniafe -cce sex HH HH HH 13 Picture 7 Table of Customer Data After Preprocessing and Merging 14 Picture 8 Clusfer analysis model and customer satisfaction level ~- 15 Picture 9 K-means clustering diagram using Orange -+s se + +serssesee 21 Picture 10 Selecf 1nÍOTIAfIOHA 4G 5 S5 3112112323131 2110 21H HH KH HH TH HH 22

Picture 12 The process of Linear Regression method - -«<+<<+<++sxzs+ses 28 Picture 13 The table of variabÏes «c5 +4 <4 1113111191 H111 HH HH HH HH 29 Picture 14 The result of independent variables . 5 - Sss si, 30

Trang 4

Summary

In summary, we analyze and evaluate the data to come up with customer profiles as well as analyze customer satisfaction with airline services The data used in the article is taken from Kaggle and conducted using Orange software with K-means method to discover customer segments and portraits, and analyze the factors affecting customer satisfaction through Linear Regression

I Introduction

With the development of society as well as the economy, the need for non-stop movement of people is constantly increasing In particular, the aviation industry is one of the most favored means of transportation, especially for long flights or people who need to save commuting time The airlines not only provide their passengers with a safe and convenient flight, they also want their customers to experience quality flight service That is also one of the competitive advantages of airlines, especially when their customer are often from middle-class above, who focus on their own experiences and customer services

The Airline customer service is the activities or services of the Airline for their passengers at each stage of their journey to meet or improve customers’ overall in- airport and in-flight experience Specifically, these services include pre-flight services such as the ease of booking, check-in service to in-flight services such as seat comfort, entertainment and many other services

Conducting an analysis of customer reviews and satisfaction of airline services is a method to help airlines understand their customers better as well as their satisfaction toward the airline services It is especially important to know whether the services provided by the airline are suitable and meet the needs of the customers Hence, airlines can assess customers’ attitudes towards them as well as what services they need to change to be more suitable for the passengers

The data on customer satisfaction ratings for flight services can be collected from many sources and the amount can be huge Therefore, in order to understand customer satisfaction, we had used the Orange tool to model data and analyze customer satisfaction of different services in a flight journey Specifically, we used K-means and Linear regression to determine the impact of factors on customer satisfaction Hence,

Trang 5

make an assessment of customer satisfaction for the airline as well as orient the future

strategy and reinforces customer satisfaction

The data set from Kaggle includes 4000 responses which is customer information as well as their evaluation of the flight service experience The information includes:

Gender Gender of the passengers (Female, Male)

Customer Type The customer type (Loyal customer, disloyal customer) Age The actual age of the passengers

Profession The passengers current job Annual income The annual income of the passengers Spending Score The spending score of passenger (1-100)

Purpose of the flight of the passengers (Personal Travel, Type of Travel Business Travel)

Travel class in the plane of the passengers (Business,

Flight distance The flight distance of this journey

Satisfaction level of the inflight wifi service (0:Not Inflight wifi service Applicable; 1-5)

Seat comfort Satisfaction level of Seat comfort

Trang 6

Inflight entertainment Satisfaction level of inflight entertainment On-board service Satisfaction level of On-board service Leg room service Satisfaction level of Leg room service Baggage handling Satisfaction level of baggage handling Check-in service Satisfaction level of Check-in service Inflight service Satisfaction level of inflight service Cleanliness Satisfaction level of Cleanliness Departure Delay in Minutes Minutes delayed when departure Arrival Delay in Minutes Minutes delayed when Arrival

Airline satisfaction level (Satisfaction, neutral or

The K-means++ technique was found to successfully resolve some of the issues related to establishing the initial cluster centroids for K-means in a review by Shindler that included numerous clustering algorithms

Trang 7

K-Means is an unsupervised learning algorithm that partitions a given dataset into a fixed number of clusters (K clusters) by defining K centroids, one for each cluster To ensure better results, the centroids are placed far away from each other

3.1.2 The idea of K-means++ algorithm

The procedure for selecting the first centroid of k is to uniformly select the data points being clustered, ensuring that the closest centroid represents the group:

1 Decide on a shared centroid for the data points

2 Calculate D(x), the separation between each unselected data point x and the closest chosen centroid

3 Using a weighted probability distribution, choose a new data point at a new centroid, where the chosen point x has a probability proportional to D(x)2 4 Up until the centroids are chosen, repeat steps 2 and 3

5 After choosing the first centroids, we continue to conduct clustering using the conventional K-means algorithm

The K-means++ algorithm was created to get over the problem with picking the initial centroid at random Because the final clustering outcome is dependent on the original

cluster centroids

3.1.3 Details of the K-means++ algorithm

The K-means++ algorithm is a technique for displaying the shortest D(x) distance between a data point and the selected nearest centroid Initialization of K-

d, =MAX G1 myll%i —€ jl In there:

di: is the distance of the point xi to the farthest centroid

k: number of centroids selected

Trang 8

@ Step 3: Take a new center xi , whose maximum probability is proportional to di @ Step 4: Repeat steps 3 and 4, until we find k centroids

After finding the centroid k, we continue to divide the cluster based on the standard K-means algorithm as in Section 3, Part III above

3.2 Linear Regression method 3.2.1 Definition

Linear Regression is a statistical method for predicting the relationship between a dependent variable (also known as the response variable) and one or more independent variables (also known as predictor variables) Linear regression seeks the best-fit line

Regression is also a method commonly used for two purposes The first purpose is to make predictions and forecasts, which is a task that overlaps with the field of machine learning The second purpose is to identify causal relationships between independent and dependent variables It's important to note that regression analysis only reveals relationships between a dependent variable and a specific collection of variables, and it does not provide conclusive evidence of causal relationships

Simple Linear Regression is the most basic form of linear regression, and it involves finding the best straight line to fit the data The equation for a simple linear regression model is y = b0 + b1*x

In this equation, y is the dependent variable, x is the independent variable, b0 is the intercept, and b1 is the line's slope

Linear regression can be used to predict future values of the dependent variable, understand the relationship between the independent and dependent variables, and identify the most important predictors in a dataset, among other things It is commonly

used in finance, economics, social sciences, and engineering

3.2.2, Evaluate Linear regression

Linear regression analysis based on the coefficient (or regression coefficient) is used to evaluate the degree of influence of independent variables on the dependent variable in a linear regression model The coefficient indicates the average change in the

Trang 9

dependent variable (the unit of measurement of the dependent variable) when an independent variable changes by one unit If the coefficient has a positive value, the changes in the independent and dependent variables are in the same direction, meaning that when the value of the independent variable increases, the value of the dependent variable also increases Conversely, if the coefficient has a negative value, the changes in the independent and dependent variables are opposite, meaning that when the value of the independent variable increases, the value of the dependent variable decreases The larger the absolute value of the coefficient, the stronger the impact of the independent variable on the dependent variable Therefore, the coefficient is essential in evaluating the effects of independent variables on the dependent variable and helps us to better understand the relationship between them

Trang 10

IV Data mining and analysis using Orange software

2 Data

Customer Survey

lu Customer Survey

Picture 1 Data Preprocessing

Step 1: Adding data to Orange using the CSV File Import widget and evaluating the data In this task, we have added 2 data Segmentation Information and Customer Survey, which are separate data files we extracted to simulate data from multiple sources imported into the Data warehouse.

Trang 11

Foature Statistice Customer Information

Cel daimiter Comma Quote character

Picture 2 Adding data to Orange Data Mining by CSV File Import To evaluate the data, we will use the Feature Statistics tool in Orange This tool allows us to quickly and easily get an overview of the data by showing us various statistics for each feature Evaluating the data using the Feature Statistics tool to identify prominent features and check for any missing data

feos

`

¬

Trang 12

‘a Customer Survey - Orange _ Bo x

Name Distribution Mean Mode Median Dispersion Min Max Missing

Picture 4 Missing Information

Step 2: If the data has errors as mentioned above, we can use the Preprocess widget to help handle missing data Here, we will select the Impute Missing Values option and choose Replace with Random Value to fill in the missing data with random

values

Discretize Continuous Variables = 2 #2 Continuize Discrete Variables “nan Fa impute Missing Values © Replace with random value

4] Select Random Features Normalize Features 2} Randomize

Remove Sparse Features S Principal Component Analysis

CUR Matrix Decomposition

Trang 13

To merge the two processed data files, we can use the Concatenate widget in Orange This tool allows us to join two or more data sources together based on a

common attribute or index

= Concatenate - Orange ? a OF Variable Merging

When there is no primary table, the output should contain: s © all variables that appear in input tables

/, oa ©) only variables that appear in all tables Kk

(_] Append data source IDs

? B | 2l-I20002000 [3 4000

Picture 6 Merging 2 Data by Concatenate

Step 3: Adding the processed data to a table to get an overview of the data

Trang 14

sec, 2 Female Loyal Custooner 2 15000 39 Hoatincare 1

5 3 Moe aloyal Custom » 8000 6 Engneer 1 + Mae Loyal Customer “ 58090 7 Leyer 9 cÝŸŸ” G3) Ls cào Loyal Customer “ 3800 40 Eeneetainment 2 De sex si 6 Male Loyal Customer 1% e000 76 Artist o Bikes 0 cas cana 7 Female Loyal Customer n 31080 He 1 = Femate Loyal Customer “ seco 94 Hevitncte 1 B sect rome + Male Lora Customer 4 7000 3 mại 9 10 Female Loyal Customer “ 98G 72 Artist 1 1 Female Loyal Customer 4 7000 14 Engmeer 1 12 Female Loyal Customer 2 93000 99 Healthcare 4 13 Femde « 46 pane0 15 Executive 9 14 Female Loyal Customer “ 91000 17 Lamyer 1 I5 Female Loyal Customer 2 9000 13 Doctor Ạ

‘are Cees! Order 35 Female Loyal Customs 1000 1

Trang 15

Segmentation Data mining and analysis using Orange software

A the Distributions (1)

Silhouette Plot

~í8 Evaluate Result { ^

Trang 16

4.2.2 The customer portrait of the airline industry a) Use distributions to visualize customer information

120

@ Female @ Male 100

80

> 9 60 3 2 ừ

Figure 1 Passenger age and gender

In general, male customers will fly more often than female customers In which, the age group with the highest number of customers flying is between 25 and 27 years old and those in their early 40s Passengers with the highest flight frequency is male customers about 41 years old with nearly 120 flights The number of passengers from about 50 years old onward fluctuates but tends to decrease

Trang 17

b) Customer information by age and customer type

® Loyal Customer @ disloyal Customer

Figure 2 Type of Customers

According to the chart, we can see that Loyal Customers are mainly in the age of 40 to 60 years old In particular, the age group from 40 to 45 has the highest number of loyal customers, up to 180 In addition, loyal customers also concentrate in the age group from 25 to 30 with the number of loyal customers up to 120 However, from 20 to 30 years old is also the set of customers with the largest number of disloyal customers which is nearly 70 For other age groups, the number of disloyal customers is at about 20 to 30

c) Customer’s annual income

Trang 18

Figure 3 Passenger’s annual income

In general, customers of the aviation industry are customers with an annual income from $60,000 or more, regardless of whether they are loyal customers or disloyal customers In which, customers who often use aviation services are those who

have an annual income of between $60,000 and $10,000 Although there are still

customers in the group with an income of less than $60,000, the number is not significant

Trang 19

4.2.3 Describe customer portraits by using Box Plot Customer profession

4 133

172 355

Ngày đăng: 29/08/2024, 08:26

w