Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 30 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
30
Dung lượng
3,85 MB
Nội dung
HO CHI MINH NATIONAL UNIVERSITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY **************** PROBABILITY AND STATISTICS PROJECT Group - Class CC03 – Semester 221 Lecturer: PhD Phan Th Hưng Group Members: Đinh Lê Minh – 2153568 Bin Công Khanh – 2053104 Nguyn Bo Duy – 2152042 Ngô Minh Đc – 2052073 Nguyn Trng Thân - 2052720 HO CHI MINH CITY, DECEMBER, 2022 TABLE OF CONTENTS Workload Theory summary Activity 3.1 Import data 3.2 Data cleaning 3.3 Data visualization 3.3.1 Descriptive statistics for each of the variables 3.3.2 Graph: Boxplot – dep_delay for each carrier Remove outliers 3.4 One way ANOVA 3.5 Generalize linear model 12 Activity 4.1 Introduction 16 4.2 Descriptive statistics 17 4.3 Hypothesis testing 4.3.1 The dataset is assumed to fully satisfy the conditions of ANOVA 20 4.3.2 The dataset doesn’t fully satisfy the conditions of ANOVA 4.3.2.1 Kruskal – Wallis test 23 4.3.2.2 Brown - Forsythe test .24 4.4 Ridge regression model 4.4.1 Checking correlation 25 4.4.2 Build Ridge regression model 4.4.2.1 Define response (y) and predictor (x) variables 25 4.4.2.2 Fit Ridge regression model .26 4.4.2.3 Choose an optimal value for 𝜆 (Lambda) 26 4.4.2.4 Analyze final model 27 Reference 28 Workload Name ID Đinh Lê Minh 2153568 Bin Công Khanh 2053104 Nguyn Bo Duy 2152042 Ngô Minh Đc 2052073 Nguyn Trng Thân 2052720 Work Activity 1: Question 1,2,3,5 summarise the report, code R, summarise theory Activity 1: Question 4, write the report, code R, summarise theory Activity 2: Find dataset, write the report, code R, summarise theory Activity 2: Code R, summarise report, summarise theory Activity 2: Find dataset, write the report, code R, summarise theory Percentage 23% 18% 18% 23% 18% Theory summary Definition of ANOVA Analysis of Variance (Analysis of Variance), also known as ANOVA test, is a parametric statistical technique used to compare data sets In simple words, ANOVA analysis has the function of evaluating the potential difference in a scale dependent variable equal to a nominal level variable of two or more categories Classification There are more than two types of analysis of variance, but within the scope of this report, we will only learn about the two most common types, one-factor analysis of variance and two-factor analysis of variance One-factor analysis of variance is the analysis of the effect of a causal (qualitative) factor on an outcome (quantitative) factor Two-factor ANOVA is an extension of one-way analysis of variance With One Way, we have an independent variable that affects the dependent variable As for two-way ANOVA, we will have independent variables Shapiro-Wilk normality test The Shapiro – Wilk test is a way to tell if a random sample comes from a normal distribution The test gives you a w value; small value indicate your sample is not normally distributed (you can reject the null hypothesis that your population is normally distributed if your values are under a certain threshold) Levene Test for equality of variances Levene's test is used to test if k samples have equal variances Equal variances across samples is called homogeneity of variance Some statistical tests, for example the analysis of variance, assume that variances are equal across groups or samples The Levene test can be used to verify that assumption Kruskal – Wallis test Like ANOVA, Kruskal – Wallis test is used to check whether or not there is any statistically significant difference between at least groups of an independent variable on continuous or ordinal dependent variable The only difference between Kruskal – Wallis test and ANOVA is the former is used for non-normal distribution while the latter is used for normal distribution Brown – Forsythe test The Brown–Forsythe test is a statistical test for the equality of group variances based on performing an Analysis of Variance (ANOVA) on a transformation of the response variable Instead of dividing by the mean square of the error, the mean square is adjusted using the observed variances of each group The p-value can be interpreted in the same manner as in the analysis of variance table Definition of multiple linear regression Multiple linear regression, also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable The goal of multiple linear regression is to model the linear relationship between the explanatory (independent) variables and response (dependent) variables Formula and calculation 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝 + 𝜀 𝑓𝑜𝑟 𝑖 = 𝑛 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠, 𝑤𝑖𝑡ℎ: 𝑦𝑖 : dependent variable 𝑥𝑖 : explanatory variables 𝛽0: y-intercept (constant term) 𝛽𝑝 : slope coefficients for each explanatory variable 𝜀: the model’s error term (also known as the residuals) Assumptions of multiple regression: • • • • There is a linear relationship between the dependent variables and the independent variables The independent variables are not too highly correlated with each other yi observations are selected independently and randomly from the population Residuals should be normally distributed with a mean of and variance σ Ridge regression model Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated Also known as Tikhonov regularization, named for Andrey Tikhonov, it is a method of regularization of ill-posed problems it is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters.In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias (see bias–variance tradeoff) Formula of Ridge regression: with: n: the number of rows 𝜆= the tuning parameter p: the number of column 3 Activity 3.1 Import data Load the file “flights.rda” to R Studio and create a new data set “data1” with only main variables “c” function is used to concatenate the vectors - which are the values of variables We use “head” function is used to check if the data set is containing the data 3.2 Data cleaning Checking the number of missing values in the data set “data1” of the variables As we can see, there are missing values in the data set In order to make “data1” cleaner, we have to remove those values To that, we use “na.omit” function A new clean data set called “data2” is created After having removed NA values from “data_1”, we will check if there are any NA values left The result from the console is “0”, which means that the data no longer contains missing values 3.3 Data visualization 3.3.1 Descriptive statistics for each of the variables In order to have descriptive statistics for variables, we have create a data set with variables that have changing values and name it “data3” After that, we will calculate factors such mean, min, max, Q1, Q2, Q3 and standard deviation values on columns with appropriate functions and display them in columns with “t()” function for better visualization 3.3.2 Graph: boxplot – dep_delay for each carrier Remove outliers We will plot boxplot graph using “boxplot” function It can be obseved that there are outliers in the data They need to be eliminated as they might cause problems in statistical analyses To remove them from “dep_delay” varible, a function called “remove_outl” is created First, the values of first quartile, third quartile and interquartile range of the parameter are calculated with “quantile” and “IQR” functions After that, the value of parameter input into the function will be checked If that value satisfies the condition of an outlier, it will be changed to NA values If it does not, it will be kept the same as the input one Creating subsets of dep_delay values of carriers and applying the function each created subset Combining the seperated subsets and storing to a new data set named ”data3” and check if the data set is intact We will then count the number of outliers which now are NA values with “is.na” function Removing the outliers with “na.omit” function After having removed those values, check again if there are any outliers left in “data3” Finally, the boxplot graph is plotted with “boxplot” function with the data set “data3” with outliers removed 3.4 One way ANOVA: Is there a difference in average delayed departure times among airlines for flights departing from Portland in 2014? We will test assumptions about normal distribution First, we create a new dataframe containing only the value of the delayed departures of the airlines: From the above analysis, we obtain the regression model: 𝑎𝑟𝑟_𝑑𝑒𝑙𝑎𝑦 = 0.444587 + 0.723235 × 𝑐𝑎𝑟𝑟𝑖𝑒𝑟𝐴𝑆 + (−0.058560) × 𝑐𝑎𝑟𝑟𝑖𝑒𝑟𝐵6 + ⋯ + (−20.083396) × 𝑑𝑒𝑠𝑡𝑇𝑃𝐴 + (−4.122422) × 𝑑𝑒𝑠𝑡𝑇𝑈𝑆 + 0.997834 × 𝑑𝑒𝑝𝑑𝑒𝑙𝑎𝑦 + (−0.002899) × 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 + 𝜀 According to the result of the linear regression model above, we assume: 𝐻0 : The coefficients on variables don’t have statistical significance We have 𝐻1 : 𝐻1 : The coefficients on variables have statistical significance Since p-value of “Carrier_B6”, “Carrier_OO”, “Carrier_US” and most of “dest” variable are greater than 5% significance level, we don’t have enough evidence to reject hypothesis 𝐻0 Therefore, the coefficients on those variables don’t have statistical significance and we will exclude those variables from the model Since p-value of the remaining variables are lower than 5% significance level, we can reject hypothesis 𝐻0 So the coefficients on those variables have statistical significance We therefore don’t need to exclude those variables from the model Constructing the second model with the exclusion of variable ”carrier” and “dest” 14 Comparing the models: According to the result above, we assume: 𝐻0 : The 1st model is more effective than the second one We have 𝐻1 : 𝐻1 : The 2nd model is more effective than the first one When we compare the models, the observation probability Pr (F)] is 2e-16(