probability and statistics final project topic airlines traffic passenger statistics

Preface The report is about descriptive statistics, data analysis to compare the level factors that can affect the number of passengers between airlines using ANOVA and linear regression

Trang 1

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

FACULTY OF APPLIED SCIENCE FACULTY OF TRANSPORTATION ENGINEERING

Trang 2

Preface

The report is about descriptive statistics, data analysis to compare the level factors that can affect the number of passengers between airlines using ANOVA and linear regression

To obtain the results, we used the R programming language to solve different tasks The final results are displayed in numbers and graphs to get the most subjective observations about interactions of users affected by many aspects In this report, we will include theoretical basis, and the analysis on "The number of passengers between the airlines.”

For which we will have two main sections: Section 1 Theoretical basis

Section 2 Analysis on the number of passengers The table below is our task division for this report:

We also want to send much thanks to Dr Neuyen Tien Dung for his instructions to Probability and Statistics “Probability And Statistics are the two important concepts in Maths Probability 1s all about chance Whereas statistics is more about how we handle various data using different techniques It helps to represent complicated data in a very easy and understandable way The statistic has a huge application nowadays in data science professions The professionals use the stats and do the predictions of many different aspects and departures.” It is an important subject, especially to us, Transportation Engineering students Therefore, involvement in this subject has improved and sharpened our skills, for not only in Data Science, but also in team-working, and problem-orientation

Trang 4

1 Theoretical Basis 1.1 ANOVA (one-way)

Let I be the number of treatments, and #1, H›, , Hr are the expected values of population 1, 2, ., 1 We want to test the following:

+ Ag: y= by = by + There exist two different u;,u;,(i#j)

We need the following assumptions: ® The data are random and independent

They follow normal distribution @ They have the same variances

Formula:

Where: k: total number of groups

N: total number of population

SSB = YE, n;(xi — x)?

ssw = YE,

n;: total number of sample in each group xi: mean value of samples in each group x: mean value of population

Trang 5

We can reject Hy if F> Fo 1-151 1.2 Others

1.2.1 Shapiro-Wilk test The Shapiro—Wilk test tests the null hypothesis that a sample X;, X: ›X„came from a normally distributed population The test statistic 1s:

(>>, az)" eee

2n (x i zr)?

mv! a, 1s called Shapiro-Wilk constant, given by: (a,,a,, ,a,46= C , some values of it can be found in

Where: +m=(m,, ,m,) is made of the expected values of the order statistics of independent and identically distributed random variables sampled from the standard normal distribution N(0,1)

+ V is the covariance matrix of the order statistics C is the norm +€ ¿|V-*m||

We wish to test the following hypothesis: Null hypothesis 4): The sample is normally distributed Alternative hypothesis H,: The sample is not normally distributed If the p-value is less then a, we reject Hy and conclude it does not follow normal distribution; otherwise, the sample follows normal distribution

k—15213;34(22 — where Z¿ can be have the following definitions:

Trang 6

Trimmed Mean - similar to the mid-mean except different percentile values are used To exemplify, a common choice is to trim 5% of the points in both the lower and upper tails, i.e., calculate the mean for data between the 5,, and 95, percentiles

+ Using mean provides the best power for symmetric, moderate-tailed, distributions

+ Using trimmed means performs best when the underlying data follows a Cauchy distribution

+ Using the median performs best when the underlying data followed a y*, (skewed) distribution (Recommended)

Given significance level a and W defined above, we can reject Ho if WP, p-in-k

1.2.3 Tukey’s Honest Significance Difference (Tukey’s HSD) test Tukey’s range test, also known as Tukey’s test, Tukey method, Tukey’s honest significance test, or Tukey’s HSD (honestly significant difference) test,[1] 1s a single-step multiple comparison procedure and statistical test It can be used to find means that are significantly different from each other

Tukey’s test compares the means of every treatment to the means of every other treatment; that is, it applies simultaneously to the set of all pairwise comparisons

That is, ÝZj, M,— H,

Assumptions: + The observations being tested are independent within and among the groups + The groups associated with each mean in the test are normally distributed + There is equal within-group variance across the groups associated with each mean in the test (homogeneity of variance)

The statistics can be described as:

YA—Ỳb

đồ” SE where Ÿ¿ is the larger of the two means being compared, Ÿz; is the smaller of the two means being compared, and SE is the standard error of the sum of the means

This 4; value can then be compared to a q-value from the studentized range distribution Ifthe 4, value is larger than the critical value qa obtained from the distribution, the two means are said to be significantly different at level a:0<a<1

Since the null hypothesis for Tukey’s test states that all means being compared are from the same population (1 = #2 =Hs = =x), the means should be normally distributed (according to the central limit theorem) This gives rise to the normality assumption of Tukey’s test

Trang 7

2 Data analysis 2.1 Data importing Explanation of the dataset

The data that our team uses to make this project is taken from Kaggle (link website: https://www.kagele.com/)

The Flight data.csv dataset contains information on the number of flights departing from 2 major airports in the USA namely SEA (Seattle) and PDX (Portland), this dataset is supplied by the United States government to analyze the causes for late departure of flights

Attributed information: - Total number of flights is counted: 15007 - Total number of variables: 16

Describe the main variables: - Activity Period, Operating Airline, Operating Airline IATA Code, Published Airline, Published Airline IATA Code, GEO Summary, GEO Region, Activity Type Code, Price Category Code, Terminal, Boarding Area, Passenger Count, Adjusted Activity Type Code, Adjusted Passenger Count, Year, Month - Carrier: the name of the airline, encoded with 2 capital letters: UA=United Airlines, VX= Virgin America, etc

Require step: a Insert flights information into the R studio program b Creating new data frame and the file name is new_data The main variables c Among the variables under consideration, there are some variables that contain many missing values (NA-Not Available) Print the vacancy rate statistics table for each variable Suggest a method to handle these missing values

d Calculate basic statistics (sample size, mean, standard deviation, min, max, quartiles)

e Plot a boxplot corresponding to each carrier airline

Trang 8

f Basing one boxplot, analyzing late departure time of each airline Import data

- Using the load function in R and specify the address of the dataset - Code: load(“C:/Users/Admin/Desktop/Air_ Traffic Passenger Statistics.csv’”’)

and include the necessary library to deal with the task ATPS_data <- read.csv("~/Desktop/Air_Traffic_Passenger_Statistics.csv",check.names=FA

Trang 9

## = $ Activity Period : int 200507 200507 200507 2005087 200507 200507 200597 200507 200507 20@5@7

## $ Published Airline : chr “ATA Airlines" "ATA Airlines" "ATA Airlines" "Air Canada " ## = $ Published Airline IATA Code: chr "“TZ" “TZ" "TZ" "AC"

## $ GEO Region : chr “US" "US" "US" "Canada"

## = $ Boarding Area : chr “B™ "B™ "B™ "B™ ## = $ Passenger Count : int 27271 29131 5415 35156 346090 6263 5500 12050 11638 4998

## =$ Adjusted Activity Type Code: chr “Deplaned" "Enplaned" “Thru / Transit * 2" "Deplaned”

## $ Adjusted Passenger Count : int 27271 29131 19830 35156 34999 6263 5500 12050 11638 4998 ## $ Year : int 2005 2005 2095 2005 2005 2005 2095 2005 2005 2095_

2.2 — Data cleaning - Create a new data file named “new data” including variables that we want to

analyze new_data <- ATPS_data %>% select(-c("index",”"Activity Period","Operating Airline","Published Airline","Published Airline IAT A Code","Passenger Count","Activity Type Code”))

## = Operating Airline IATA Code GEO Summary GEO Region Price Category Code

## 1 Tế Domestic us Low Fare

nanes(new data) <- gsub(” ", ” ”, names(new data))

head(new_data,3)

## Operating Airline IATA Code GEO_Summary GEO_Region Price_Category Code

## 5 TZ Donestic us Low Fare

## 1 2005 July ## 2 2005 July Check missing data In "new_ data”

Trang 10

## Operating Airline _IATA_Code GEO_Summary

new_data$Month <- match(new_data$Month, month.name)

2.3 Data clarification 2.3.1 Theory

Histogram A histogram can be defined as a set of rectangles with bases along with the intervals between class boundaries Each rectangle bar depicts some sort of data and all the rectangles are adjacent The heights of rectangles are proportional to corresponding frequencies of similar as well as for different classes Let's learn about histograms more in detail

A histogram is the graphical representation of data where data is grouped into continuous number ranges and each range corresponds to a vertical bar

e The horizontal axis displays the number range ® The vertical axis (frequency) represents the amount of data that is present in

each range The number ranges depend upon the data that is being used

Boxplot

10

Trang 11

In its simplest form, the boxplot presents five sample statistics - the minimum, the lower quartile, the median, the upper quartile and the maximum - in a visual display The box of the plot is a rectangle which encloses the middle half of the sample, with an end at each quartile The length of the box is thus the interquartile range of the sample The other dimension of the box does not represent anything in particular A line is drawn across the box at the sample median Whiskers sprout from the two ends of the box until they reach the sample maximum and minimum The crossbar at the far end of each whisker is optional and its length signifies nothing The following diagram shows a dotplot of a sample of 20 observations (actual sample values used in the display) together with a boxplot of the same data

Range

11

Trang 12

Although boxplots can be drawn in any orientation, most statistical packages seem to produce them vertically by default, as shown on the right, rather than horizontally The length of the box becomes its height The width across the page signifies nothing

Much more can be read from a boxplot than might be surmised from the simplistic method of its construction, particularly when the boxplots of several samples are lined up +10 alongside one another (Parallel Boxplots) The box length gives an indication of the sample variability and the line across the box shows where the sample is centered The position of the box in its whiskers and the position of the line in the box also tells us whether the sample is symmetric or skewed, either to the right or left For a symmetric distribution, long whiskers, relative to the box length, can i 6

betray a heavy tailed population and short whiskers, a short

tailed population So, provided the number of points in the sample is not too small, the boxplot also gives us some idea of the "shape" of the sample, and by implication, the shape of the population from which it was drawn This is all important when considering appropriate analyses of the data

» Scatter plot A scatter plot can be used either when one continuous variable is under the control of the experimenter and the other depends on it or when both continuous variables are independent If a parameter exists that is systematically incremented and/or decremented by the other, it is called the control parameter or independent variable and is customarily plotted along the horizontal axis The measured or dependent variable is customarily plotted along the vertical axis If no dependent variable

12

Trang 13

exists, either type of variable can be plotted on either axis and a scatter plot will illustrate only the degree of correlation (not causation) between two variables

2.3.2 Calculate descriptive statistics for continuous variables: library(psych)

describe(new_data[,c("Adjusted_ Passenger Count","Year","Month")],fast=T)

## Adjusted Passenger Count 1 15997 29331.92 58284.18 1 659837 659836 475.78 ## Year 2 15007 2016.39 3.14 2005 2016 11 9.03 ## Month 3 15007 6.55 3.46 1 12 11 9.03

Draw _a Hist graph showing the distribution of the number of passengers: hist(new_data$Adjusted Passenger Count,xlab="Adjusted Passenger Count",main="Histogram of Adjusted_ Passenger Count", labels=T, ylim=c(0,13100))

Histogram of Adjusted_Passenger_Count

12768

So o 4

S

Oo o + 84

So S 4

q 1053 678 " 185 39 45 73 114 38 2 4 4 3 1

1le+05 2e+05 3e+05 4e+05 5e+05 6e+05,

The eraph OF Adjusted Passenger Count displays a distribution with a I tight: -skewed shape The distribution is evideitstadiPabeenetial part having the highest frequency,

13

Trang 14

surpassing 12,500 However, this value strongly decreases in the subsequent sections until it reaches 1

Draw _a boxplot graph showing the distribution of passenger numbers according to categorical variables

par(cex.axis=0.5) boxplot(Adjusted Passenger Count~Operating Airline IATA Code,new data) boxplot(Adjusted Passenger Count~GEO Summary,new data)

par(cex.axis=0.5) boxplot(Adjusted Passenger Count~GEO_Region,new_data) boxplot(Adjusted Passenger Count~Price Category Code,new data) boxplot(Adjusted Passenger Count~Terminal,new data)

boxplot(Adjusted Passenger Count~Boarding Area,new data) boxplot(Adjusted Passenger Count~Adjusted Activity Type Code,new data)

Trang 15

gory_Code Price_Cate:

Trang 16

According to the summary and these plots above: Operating Airline IATA Code: By observing this boxplot, we can recognize more clearly that the median, Ist quartile and 3nd quartile of each variable fluctuate Therefore, it is possible to conclude again that most of the variables in the dataset are not normally distributed Moreover, the outliers that cannot be removed have some significant effects on the dataset

GEO_summary: We can see clearly that the distribution of Q3, Q1 between domestic tend to be extremely different from the international That means they are independent of each other and the median obviously can be seen decrease

GEO region: We can recognize more clearly that the median, Ist quartile and 3nd quartile of each variable fluctuate slightly The distribution of flights of each region except for The United State is almost the same And the distribution of flights of The US is the highest

Price_code: It is clearly seen that the median of the low fare and the other is almost the same grade However, the distribution of Q3 of the low fare is outweigh the other This says that travelers tend to prioritize choosing the type of ticket whose price suits them better

Terminals: The boxplot illustrates that the distribution of the median fluctuates marginally Concerning the domestic terminals including the terminal 1, the terminal 2, the terminal 3; the second one vastly outnumbered the others; while the median and the distribution of Q1 higher than the first and the third terminals, the distribution of Q3 is lower

Boarding area: It can be clearly seen that the distribution of the median of all areas are almost the same except for the D area which has the highest not solely the median but also the Q1 By contrast, the highest distribution of Q3 is F area

Adjusted activity: we have the same median in deplaned, enplaned, transit Similarly, the distribution of the Q3 of enplaned and deplaned is almost equal

Draw scatter plots showing the distribution of passenger numbers according to categorical variables

plot(new data$Month,new data$Adjusted Passenger Count,pch=20) plot(new data$Year,new data$Adjusted Passenger Count,pch=20)

18

Trang 18

Month & year: With respect to the monthly data in relation to the passenger count, there is a fairly uniform distribution across the months of the year Notably, the latter months, lasting from August to December, had a more obvious increase than the first half of the year On the other hand, when considering the yearly data, a noticeable surge is observed from 2012 to 2014 in contrast to other years

2.4 ANOVA method “* Because the data file is relatively large and there are relatively many airlines

in different regions, in this problem we select the data and use the ANOVA method to compare the number of passengers in the Mexican geographical airlines

Filtering Mexico region data: Mexico_data <- subset(new_data,new_data$GEO_Region=="Mexico") Atrlines in Mexico:

unique(Mexico_data$Operating_Airline_IATA_Code) #4 [1] "AS" "MX" "UA" "AM" "VX" "Sy" Comment: In Mexican airlines include Alaska Airlines, Mexicana Airlines, United Airlines - Pre 07/01/2013, Aeromexico, Virgin America, Sun Country Airlines Draw a boxplot graph showing the distribution of passenger numbers among airlines boxplot(Adjusted_Passenger_Count~Operating Airline IATA_Code,Mexico_data)

Trang 19

Comment: There is much different in the number of passengers of airlines of Mexico We want to determine whether or not the mean of the number of passengers between airlines in Mexico are the same

Step 1: Check ANOVA assumptions: Check the normality assumption by using Shapiro-Wilk test

Theoretical Quantiles shapiro.test(AM_data$Adjusted_Passenger_Count)

## ## Shapiro-Wilk normality test ##

21

Trang 20

## W = 0.84735, p-value = 1.91e-12 Comment: Comment: By applying the Shapiro — Wilk test, we get the p-value of 1.91*10 The p-value is smaller than 0.05 so we can conclude that residuals do not follow normal distribution

For AS:

qqnorm(AS_data$Adjusted_Passenger_Count) qqline(AS_data$Adjusted_Passenger_Count)

## data: AS_data$Adjusted_Passenger_Count

22

Trang 21

#4 W = 0.96257, p-value = 5.484e-07 Comment: By applying the Shapiro — Wilk test, we get the p-value of 5.484+10 7 The p-value is smaller than 0.05 so we can conclude that residuals do not follow normal distribution

oOo -— A

_ 8

œ — — 2 eS 6 =

c © 5 Co 0 8 |

E © ND oo S -|

5 ©

## Shapiro-Wilk normality test ##

## data: MX_data$Adjusted_Passenger_Count ## W = 0.97591, p-value = 0.02566 Comment: By applying the Shapiro — Wilk test, we get the p-value of 0.02566, The p- value is smaller than 0.05 so we can conclude that residuals do not follow normal distribution

For SY:

23

Trang 22

qqnorm(SY_data$Adjusted_Passenger_Count) qqline(S Y_data$Adjusted_Passenger_Count)

Normal Q-Q Plot

## data: SY_data$Adjusted_Passenger_Count ## W = 0.92242, p-value = 0.3393 Comment: By applying the Shapiro — Wilk test, we get the p-value of 0.3393, The p- value is larger than 0.05 so we can conclude Hy is accepted, which means residuals follow normal distribution

For UA: UA_data <-subset(Mexico_data,Mexico_data$Operating_Airline_ IATA_Code=="UA")

24

Trang 23

qqline(UA_data$Adjusted_Passenger_Count)

shapiro.test(UA_data$Adjusted_Passenger_Count) #H

## Shapiro-Wilk normality test #H

## data: UA_data$Adjusted_Passenger_Count ## W = 0.92402, p-value = 8.95e-13 Comment: By applying the Shapiro — Wilk test, we get the p-value of 8.95*10 ” The p-value is smaller than 0.05 so we can conclude that residuals do not follow normal distribution

Trang 24

## data: VX_data$Adjusted_Passenger_Count ## W = 0.97976, p-value = 0.05076 Comment: By applying the Shapiro — Wilk test, we get the p-value of 0.05076, The p- value is larger than 0.05 so we can conclude Hy cannot be rejected

Check the homogeneity of variance assumption library(car)

## Loading required package: carData ##

## Attaching package: ‘car’ ## The following object is masked from ‘package:psych': ##

Trang 25

## Levene's Test for Homogeneity of Variance (center = median)

## group 5 99.1 <2.2e-16 ***

## ## Signif codes: 0 '***' 0.001 '**' 0.01 '*'0.05 "0.1 '' 1 Remark:

Hypothesis H,: Variance of the number of passengers of each airline are the same Hypothesis 41; At least 2 airlines have different variance of the number of passengers

Since p-value = 2,2*10°* < significance level 5%, we reject the hypothesis Hy So, variance of the number of passengers in each airline are different

Step 2: State the hypothesis: Null hypothesis 4): Mean of number of passengers on regional airlines is equal Alternative hypothesis H,: There are at least 2 regional airlines in Mexico with different average passenger volumes

Step 3: Calculate p_value using aov() function:

summary(anova_1)

## Operating Airline IATA_Code 5 4.114e+09 822869891 33.95 <2e-16 ***

HH - ## Signif codes: 0 '***' 0.001 '**' 0.01 '*'0.05 "0.1 '' 1 Step 4: Compare p_value to a Because pr|iF)<2*10 “ (extremely smaller than significance level 5%), we reject Hy So, there is a difference in the mean of the number of passengers on regional airlines in Mexico

Step 5: Post-hoc test: Post-hoc tests (or post-hoc comparison tests) are used at the second stage of the analysis of variance (ANOVA) if the null hypothesis is rejected The question of interest at this stage is which groups significantly differ from others with respect to the mean

27

Trang 26

The Tukey Test (or Tukey procedure), also called Tukey’s Honest Significant Difference test, is a post-hoc test based on the studentized range distribution An ANOVA test can tell you if your results are significant overall, but it won’t tell you exactly where those differences lie After you have run an ANOVA and found significant results, then you can run Tukey’s HSD to find out which specific groups’ means (compared with each other) are different The test compares all possible pairs of means

TukeyHSD(anova_1) ## Tukey multiple comparisons of means

## ## Fit: aov(formula = Adjusted_Passenger_Count ~ Operating Airline IATA_Code, data = Mexico_data) ##

## $Operating_ Airline IATA_Code

# MX-AM 2529.984 889.90407 4170.0644 0.0001693 #4 SY-AM -4697.277 -9062.02770 -332.5258 0.0264195 #4 UA-AM 3973.733 2697.21202 5250.2541 0.0000000 # VX-AM -1246.869 -2868.01989 374.2824 0.2405110 # MX-AS 1139.540 -360.79036 2639.8699 0.2535670 ## SY-AS -6087.721 -10401.90460 -1773.5378 0.0008487 # UA-AS 2583.289 1492.12704 3674.4501 0.0000000 ## VX-AS -2637.313 -4116.92755 -1157.6988 0.0000063 #4 SY-MX -7227.261 -11648.40924 -2806.1128 0.0000504 # UA-MX 1443.749 -13.99156 2901.4892 0.0539653 # VX-MX -3776.853 -5544.23298 -2009.4729 0.0000000 # UA-SY 8671.010 4371.45230 12970.5673 0.0000002 # VX-SY 3450.408 -963.75326 7864.5693 0.2241127 # VX-UA -5220.602 -6657.01226 -3784.1913 0.0000000 plot(TukeyHSD(anova_1))

28

Tiêu đề	Airlines Traffic Passenger Statistics
Tác giả	Lý Chớ Hào, Nguyễn Huy Hoang, D6 Quang Sinh, Vừ Minh Thăng, Nguyễn Khỏnh Duy
Người hướng dẫn	Dr Nguyen Tien Dung
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Probability and Statistics
Thể loại	Final Project
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	53
Dung lượng	3,53 MB