Scenario We are FMT’s students of Hanoi University, our team is conducting research on the relationship between school level Secondary school and University and expenditure on education
Trang 1HANOI UNIVERSITY FACULTY OF MANAGEMENT AND TOURISM
Business and Economics Statistics
Academic Performance of
University Students
Trang 2Table of Contents
Trang 3
I Scenario We are FMT’s students of Hanoi University, our team is conducting research on the relationship between school level (Secondary school and University) and expenditure on education for
students in three main provinces, Ha Noi, Nam Dinh, and Hai Duong Education expenditure is
measured in Vietnamese dong The objective of our study is to find out if there is any remarkable interaction between Province and Level of Education and to examine if there were any significant differences in education expenditure for these two variables To learn about this sample, we use 0.05 level of significance
The sample information is shown in the table from case3.csv The dataset consisting of 120
observations, all provided in the case3.csv
Il Answering questions Question 1: What inference technique should be considered for this study? Explain
There are 4 variables in the given dataset Besides, from this frame, we can see that province and
eduspend are two independent variables, in which the dependent variable is eduspend
“Province” includes 3 places, and 2 levels of “edulevel” also has been indicated Therefore,
these two are categorical variables or factors In contrast, eduspend represents the numerical outcome variables From that, two- way ANOVA test is the most effective way to examine the relationship and main effect of two independent variables on the dependent one So it should be applied to this study in terms of finding out the difference between them
Question 2: Produce descriptive statistics for the dataset You are expected to generate as many relevant descriptive statistics as possible using ALL the relevant tools introduced in the labs of this course Remember to provide appropriate interpretations for the descriptive statistics Try not to include unnecessary or irrelevant descriptive statistics Descriptive statistics are used to describe the basic data of a study This data source provides the most comprehensive overview and measurement of the research sample space Besides, this also includes graphical analysis which is the formation of every quantitative data
1, Import Case 3.csv data frame into R and assign it to VHLSS After setting up the working directory, we import it into R R code:
VHLSS <- read.table("Case 3.csv" ,header=TRUE, sep=",",quote="\""
stringsAsFactors=FALSE) 2 Cross Tabulation table between factors There are 120 observations in this case, so it is better when we use the head() function to know
about the data
Trang 4head(VHLSS) Output:
obs province edulevel eduspend 1 1 HaiDuong University 16490 2 2 HaiDuong University 25800 3 3 HaiDuong University 29570 4 4 HaiDuong University 35000 5 5 HaiDuong University 16970 6 6 HaiDuong University 11600 We can also access the data structure by the following R code:
R code: > VHLSS$province <- factor(VHLSS$province, levels=c("HaiDuong","Hanoi","NamDinh")) > VHLSS$edulevel <- factor( VHLSS$edulevel,levels = c("Secondary school","University"))
> table( VHLSS$province, VHLSS$edulevel)
Output: It provides us the sample size (n=20) for each stratum Secondary school University
R code: by(VHLSS$eduspend,list( VHLSS$province, VHLSS$edulevel),mean) Output:
: HaiDuong : Secondary school
Trang 5[1] 4334.6
: Hanoi : Secondary school [1] 15234.6
: NamDinh : Secondary school [1] 5492.45
: HaiDuong : University
by(VHLSS$eduspend, list VHLSS $province, VHLSS$edulevel),sd) Output:
: HaiDuong : Secondary school [1] 1012.826
: Hanoi : Secondary school [1] 8817.526
: NamDinh
Trang 6: Secondary school [1] 2161.205
: HaiDuong : University
[1] 9902.959
: Hanoi
: University [1] 9847.818
: NamDinh : University [1] 7899.908 5 Median for groups R code:
by(VHLSS$eduspend,list( VHLSS$province, VHLSS$edulevel),median) Output:
: HaiDuong : Secondary school [1] 4268
: Hanoi : Secondary school [1] 15095
: NamDinh : Secondary school [1] 4605
: HaiDuong : University [1] 16950
Trang 7Output: : HaiDuong : Secondary school
Min Ist Qu Median Mean 3rd Qu Max 2230 3656 4268 4335 4886 6135
: Hanoi : Secondary school
Min Ist Qu Median Mean 3rd Qu Max 2728 8031 15095 15235 19283 39970
: NamDinh : Secondary school
Min Ist Qu Median Mean 3rd Qu Max 2974 4200 4605 5492 6820 10900
: HaiDuong : University
Min Ist Qu Median Mean 3rd Qu Max 3000 11750 16950 19579 26720 35050
: Hanoi
: University Min Ist Qu Median Mean 3rd Qu Max
Trang 85000 12700 17075 19157 28285 39600
: NamDinh : University
Min Ist Qu Median Mean 3rd Qu Max 10050 17498 21155 22978 27280 38275 7 Graphical description
In this situation, when the sample size of each group is about 20 observations, so we find it more suitable to use the boxplot to compare the six group Here is the code and output for the boxplot R code:
boxplot(eduspend~ interaction(province,edulevel), data = VHLSS, xlab = "Province & Schooling level", ylab = "Expenditure on education (Thousands of VND)", col = c("green",
it ft, "red", "purple","yellow","brown","pink"))
Province & Schooling level The black lines that appear in the middle of each box represent for the median of each group Moreover, there are also two white dots, which represent for the outliers (higher expenditure) However we can still apply mean-plot to compare different between groups
R code: install _packages("gplots") library(gplots)
Trang 9plotmeans(eduspend~ interaction(province,edulevel), data = VHLSS, xlab = "Province & Schooling level", ylab = "Expenditure on education (Thousands of VND)", main="Mean Plot with 95% CI")
Province & Schooling level There are total six groups which are shown in Mean Plot with 95% confidence interval NamDinh university group has the largest mean In addition, means of six groups are vary, which are satisfied for assumption of the two-way ANOVA test
Question 3: Check all the assumptions of the inference technique you suggest in Question 1 are the assumptions satisfied? Explain
In Question | we have suggested that the two-way ANOVA test is to be used as the most suitable method to tackle this case study Before conducting the test, we have to check all the
assumptions of the inference technique with the expenditure on education as outcome variable, two independent factors are Place of Resident (Hai Duong, Hanoi and Nam Dinh) and schooling level (Secondary School and University) According to this method, there are three assumptions we have to consider which are:
¢ The samples are independent and simple random samples of size nij from each k (=a*b) population
¢ All populations are normally distributed ¢ All populations have same variance Firstly, we check the assumption | to see if its true or not When comparing 2 factors: Province
and Level of Education, we find that there is no relation between a student’s Province and their
Level of Education because answers from students are different and they are not influenced by
Trang 10other elements Therefore, it can be concluded that the Level of Spending for students from Hai Doung, Hanoi and Nam Dinh and their Level of Education are independent Moreover, we found out that data was collected from 120 observations Choosing individuals from sample “Province” does not affect who from sample “edulevel” and vice versa As a result, those samples are simple random samples of size nij from each k (=a*b) populations Using R output of cross tabulation table between Province and schooling level variables would give the sample size:
R code:
table(VHLSS$province, VHLSS$edulevel)
Output: Secondary school University
Secondly, we check the assumptions whether they are normally distributed or not We use Q-Q plot:
R code: qqgPlot(Im(eduspend ~ province +edulevel + province*edulevel, data=VHLSS), simulate=T, main="Q-Q Plot", labels=F)
We then have the result of a Q-Q plot:
Trang 11Lastly, we check the remaining assumption whether if all the standard deviations of six populations are equal or not Based on the output “4 Standard deviation for groups” in Question 2, we can calculate the ratio of the largest standard deviation over the smallest one
(=9902.959/1012.826) is approximately 9.7776 Making it larger than 2 Therefore the assumption is not satisfied because not all population have the same standard deviation Question 4 Perform the inference technique you suggest in Question 1 Remember to provide all the necessary steps What are your interpretations and conclusions? Explain Ho: There is not a significant interaction in eduspend between province and edulevel Ha: There is a significant difference in eduspend due to province and edulevel Step 2: Check assumptions (Question 3
All populations are normally distributed Samples are independent, simple random samples Not all populations have the same standard deviation Step 3: Test statistic:
We used Rstudio to calculate statistic and p-value thus obtain the following outputs:
R code:
10
Trang 12> VHLSS.result<-aov(eduspend ~ province*edulevel, data = VHLSS) > summary(VHLSS result)
Output:
Df Sum Sq Mean SqF value Pr(>F) province 2 5.521e+08 2.760e+08 4.858 0.009443 ** edulevel 1 4.478e+09 4.478e+09 78.812 1.13e-14 *** province:edulevel 2 1.057e+09 5.286e+08 9.303 0.000181 ***
Signif codes: 0 ‘***’ 0.001 ‘**° 0.01 °*° 0.05 £0.16 ° 1 Step 4: Level of signifi
The level of significance is a= 0.05 Step 5: Decision rule
© We reject HO if p-value < a e P-value that we focus on the interaction between Province and Edulevel, so p-value
equals 0.000181 <a = 0.05 Therefore, following the decision rule, we reject Ho Step 6: Conclusion
At the 5% significance level, there is sufficient evidence to conclude that the interaction between
province and edulevel is significant; there are differences in eduspend due to province and cdulevel
Question 5: Draw an interaction plot and interpret the plot Is the plot consistent with the
conclusions made in Question 4?
This command is used to draw the interaction plot: interaction plot(VHLSS$edulevel, VHLSS$province, VHLSS $eduspend, type="b",
col=c("red","blue","black"), pch=c(16, 18),main="Interaction between Province and Schooling
level")
II
Trang 13
shown in the plot, the three lines are non-parallel, so it can be assumed that the interaction is considerable In other words, we can conclude that there is significant interaction between
province and edulevel in effect on expenditure on education Question 6: Discuss the credibility of interpretations and conclusions of question 4 Is there anything we should be concerned about? Explain
a The credibility of the interpretations and conclusion For this report, we use two way ANOVA as an inference technique in order to test the interaction between place of residence and schooling level and to test differences in education expenditure due to these two variables It can be seen that the assumptions are unsatisfactory After conducting all the steps in the test, we can conclude that the interaction between province and edulevel is significant, there are differences in eduspend due to province and edulevel because the p-value that equals 0.000181 is smaller than level of significance (a=0.05) It means that
12
Trang 14there is a 5% chance of getting a Type I error which rejects a true null hypothesis It is 95% credible that the null hypothesis is that no interaction between province and edulevel will not be rejected if it is true Finally, we conclude that conclusions in question 4 are reliable and 3 factors (province, edulevel, eduspend) are significantly different
b Limitations of two way ANOVA Besides the two way ANOVA solving the case, there is also a limitation that we realized after implementing that is the unsatisfactory assumption Therefore, despite achieving 95% satisfaction in the analysis, we still recognize the shortcoming of the assumption that makes the
results inaccurate
Additionally, in the given case study we only take into account two factors: Schooling level and Location of residence in effect to Level of spending even though there might be more factors that can influence Level of spending
13
Trang 15BES 2021 - PEER EVALUATION FORM
Phạm Thị Phương Thảo 1904040113 100% Phạm Thị Phương Thảo
Nguyễn Hải Vũ 1904050056 100% Nguyễn Hải Vũ