HANOI UNIVERSITY
FACULTY OF MANAGEMENT AND TOURISM
Business and Economics Statistics CASE STUDY
ACADEMIC PERFORMANCE OF UNIVERSITY STUDENTS
Tutor: Mr.Nguyen Hoang Viet
Tutorial class: Tut 4
Group members:
Nguyen Huong Giang - 1904040027 Pham Ngoc Van Giang : 1904040029
Trang 25 Interaction plot and 1nterpretatIOTIS c1 1 1011111111111 11110101111 HH HH g1 11110 14 6 Credibility of the interpretations and eoricÏuSIOTIS - c1 9n Snn S19 1911811121018 re 16
Trang 3TABLE OF FIGURES
Figure 1 The structure of this data Írame cic HH HH HH HH ng Ho 5 Figure 2 Frequency table (sample $1Z€$) nh nh HH HH HT 11 11111111111 6 Figure 3 Mean of Edu Spend according to Edu Levels and ProvInces§ - che 6 Figure 4 The standard deviation of Edu Spend according to Edu levels and Provinces 7
Figure 7 Levene’s 'Ï€s( T€SUÏLL HH HH HH ng HH HT HH 0 tt tre 11 Eigure 8 Q-Q plot ofresidual ác ch HH HH HH HH dt HH tt HH 11 Figure 9, Test s(atIstIC OUtDUL óc nh HH HH HH HH dt g1 HH6 1 Hr 13 Figure 10 Interaction plot between Spending for Edulevel and ProvInee ác chen reo 15
Trang 4A Scenario
The Vietnam Household Living Standards Survey (VHLSS) was conducted nationwide every two years
to systematically monitor the living standards of Vietnam's societies In 2018, the survey was carried
out with a sample size of 46,995 households in 3,133 communes/wards which were representative at national, regional, urban, rural, and provincial levels The household questionnaire contained many
sections, each of which covered a separate aspect of household activities, and education was one important indicator In the survey, household heads were asked to specify their place of residence (province), schooling level of their children (edulevel), and expenditure on education per child for the past 12 months in thousands of VND (eduspend) The objective of our study is to test for any significant interaction between the place of residence and schooling levels and to test for any significant differences in education expenditure due to these two variables Use 0.05 level of significance
A portion of the obtained data is presented below The complete dataset, consisting of 90 observations and 4 variables (obs, province, edulevel, eduspend), is provided in the accompanying file named
Case5.cvs
# W
1 2 3
4
5 6 7 8 9
Trang 5B Questions and Answers 1 Inference technique
It is given that the experiment of The Vietnam Household Living Standard is done to test for any significant interaction between the place of residence (province) and schooling level (edulevel) to test for any significant differences in education expenditure (eduspend) due to these two variables In this study, a two-way ANOVA (two-way analysis of variance) is applied into the real case study to assess whether there is a substantial interaction at the same time between 2 independent variables on 1 dependent variable Firstly, it can be seen that province and edulevel were two factors as well as independent variables in this case study Secondly, eduspend is known to be a variable that depends on two factors (province and edulevel)
The purpose of this study is to examine the effect of place of residence and schooling levels on education expenditure, and the interaction between two factors (province and edulevel)
2 Descriptive statistics for the dataset
Firstly, we use Rstudio to describe statistics for this question To start with, we import the Excel file “Case 5.csv” into R for further calculation:
e > setwd("Cc:/Users/Admin/Documents/bes research") e > getwd
e > case5 <- read.table(C"Case5.csv",header=TRUE, sep=",", quote="\"", stringsAsFactors=FALSE)
The structure of this data frame can be checked using str() function:
$ eduspend: int 12301
S
s
Figure 1 The structure of this data frame
From the above R output, we can obtain that there are 90 observations and 4 variables: osb, province,
edulevel, eduspend; obs and eduspend variables are numeric data, province and edulevel variables are
Trang 6character data To apply some graphical or statistical methods, we should convert province and edulevel into factors, using the following code:
Figure 2, Frequency table (sample sizes)
It can be seen that all 6 treatment groups have the same sample size of 15 This selection is our best
choice to use a two-way ANOVA test
Next, we use the by () function in R to find several descriptive statistics such as mean, standard deviation, for each treatment group listed by the factors and their output respectively:
e > byCcase5$eduspend, list(case5$province, case5$edulevel), mean)
Figure 3 Mean of Edu Spend according to Edu Levels and Provinces
Trang 7e > by(case5$eduspend, list(case5$province, case5$edulevel), sd)
Hungyen , > Nursery School
{1} 2481.657
Hungyen
Primary School
[11 1071.281 ThaiB1nh
Primary school [1] 356.0621
Hungyen Secondary school
] 4938.879
ThaiBinh
Secondary school [1131 1377.442
Figure 4 The standard deviation of Edu Spend according to Edu levels and Provinces Each code gives the specific descriptive statistics of the outcome variable (Edu Spend) for each treatment group with the listed Edu Levels first then the Cities
To get further information, we conduct the boxplot and the mean plot:
> boxplotCeduspend ~ interaction(province, edulevel), data = case5, xlab = “Place of residents", ylab ="Education
e expenditure", col = cC"pink", “light blue", “yellow", "white", “orange”,
“gray"))
Trang 8
HungYen Nursery School ThaiBinh Nursery School HungYen Primary School ThaiBinh PrimarySchool HungYen Secondary school ThaiBinh Secondary school
Figure 5 Boxplot for distribution of groups
Initially, the box plot shows clearly several descriptive statistics: medians, quartiles, maximum and minimum data among different groups Each cell has different characteristics for all Based on R output, we can see that the Hung Yen — Secondary School groups have reached the peak of median value but there is no difference between Hung Yen — School group and Thai Binh Secondary group Moreover, The Hung Yen — Nursery group has the lowest at almost every value: median and
minimum value
The skewness of each group is naturally through a boxplot The data of each group can be distributed basically, positive-skewed or negative-skewed is built based on the distance from the median to two
endpoints It can be obviously seen that Thai Binh — Secondary School and Hung Yen — Nursery
School are normally distributed Besides, Hung Yen — Secondary School and Thai Binh — Nursery School are basic examples of positive-skewed distribution while the others are negative-skewed distribution Also, there are 6 outliers when existing six white dots in Hung Yen — Secondary School (1 outlier), Thai Binh — Secondary School (1 outlier), Hung Yen — Nursery School (2 outliers), and Thai Binh — Nursery School (3 outliers) respectively but 4 out of 90
We still use meanplot to identify mean value of each group and compare means between groups with the following codes and their outcome:
e > install.packagesC"gplots") e > libraryC"gplots")
Trang 9e > plotmeansCeduspend~ interaction(province, edulevel), data = case5, xlab = “Province and edulevel", ylab = “Eduspend", main="Mean Plot + with 95% CI")
Mean Plot + with 95% CI
n=15 n=15
T HungYen Nursery School haiBinh Nursery School 2 and edulevel ThaiBinh.Primary Schoot HungYen Secondary school ThaiBinh Secondary school
Figure 6 Gplot of group means
Figure 6 helps to understand better the structure of the Case 5 data and summarize difference between the means of each group at 95% of the confidence interval It displays the sample size of each group which equals 15 And we can conclude that there is large difference between means of edulevel and province due to the variable eduspend Mean difference in ThaiBinh.Primary — school,
HungYen.Secondary school and ThaiBinh.Secondary school is large while there is small difference
between Hung Yen.Primary school, HungYen Nursery school and ThaiBinh Nursery school
3 Checking assumption
The two-way ANOVA test has three assumptions:
1 Assumption 1: Sample is independent, Simple random selected 2 Assumption 2: All population standard variances are identical 3 Assumption 3: All population distributions are normal
3.1 Sample are independent, Simple random selected
As the spending on education of one household is not determined by the other one, the samples are
independent The scenario stated that: “In 2018, the survey was carried out with a sample size of 9
Trang 1046,995 households in 3,133 communes/wards which were representative at national, regional,
urban, rural and provincial levels” Therefore, it can be assumed that this sample was selected randomly
3.2 All population variances are identical
From Figure 4 The standard deviation of Edu Spend according to Edu levels and Provinces, 1t can be seen that the largest standard variance equals 4938.879 and the smallest one equals 356.0621 The result is 4938 879/356.062 1 = 13.87084, larger than 2 This ratio reveals the second assumption
is not satisfied but the Levene’s test In fact, the condition of Levene’s test did not meet when the
ratio is larger than 3 Hypothesis
Ho: All population variances are equal
Ha: At least one population variance is a difference b Significant level: « = 0.05
Test statistic: F = 2.0727
p-value = 0.07684 d Rejection rule:
We reject Ho if p-value<a Where p-value= 0.07684 > 0.05, so we do not reject Ho e Conclusion
There is not enough significant evidence to conclude that at least one population variance is different
The result of Levene’s test was obtained by the following codes:
> install.packagesC"car") > library(Ccar)
>leveneTest(case5$eduspend, interaction(case5$province, cases$edulevel), center = median)
10
Trang 11Figure 7 Levene’s Test result
3.3 All population distributions are normal
By using these code, we can check the distribution of all population:
> install.packagesC"car") > library(Ccar)
> qqPlotC|lmCeduspend ~ province + edulevel + province*edulevel, data=case5), simulate=T, main="Q-Q Plot", labels=F)
Figure 8 Q-Q plot of residual
Looking at figure 8, numerous points are out of the blue area This can not be proof of the normal distribution of all populations However, due to the scope of the course, we assume that the 2 last assumptions are satisfied To sum up, we are able to carry out a two-way ANOVA test with all satisfied assumptions
4 Two-way ANOVA test
As mentioned in question 1, we could use two-way ANOVA to test for the significance of the
interaction between Province and Edulevel (Interaction effect) as well as that of the differences
11
Trang 12in education expenditure due to Province and Edulevel (2 main effects) with 0.05 level of significance
Step 1: Form hypotheses for the three tests
The three null hypotheses and alternative hypotheses for the test are stated below: > The hypothesis to test interaction effect:
Hol There is no interaction between Province and Edulevel Hal There is a significant interaction between Province and Edulevel >» The hypothesis to test main effects:
Ho2: There are no differences in education expenditure due to Province Ha2: There are differences in education expenditure due to Province Ho3: There are no differences in education expenditure due to Edulevel Ha3 There are differences in education expenditure due to Edulevel Step 2: Check assumptions:
The assumptions of the test that have been checked in the answer for question 3:
- Samples are independent, simple random samples - All populations are normally distributed - All populations have the same standard variances
Step 3: Test statistic
We run two-way ANOVA on R Studio with Eduspend as outcome variable; Province and
Edulevel as two factors by the following command:
> case5.result<-aovCeduspend~ province*edulevel, data = case5)
12
Trang 13> sSummaryCcase5.result)
> Case5.result<- aov(eduspend ~ = Case5 > summary(case5.result)
D province
edulevel province: edulevel Residuals
To test the main effect of edulevel:
Fe= 67582359/ 6309566= 10.711 Step 4: Level of significance
The level of significance is o = 0.05 Step 5: Decision rule
Reject Ho if p-value <a To test for interaction effect: p — value =0.6905 > ø = 0.05
Step 6: Conclusion
We do not have enough statistical evidence to conclude that there is a significant interaction between two factors and differences in the education expenditure of households in two provinces ( Thai Binh and Hung Yen) due to the place of residence and schooling levels at
13
Trang 145% level of significance, Therefore, our conclusion is that there is insufficient evidence to
argue that the interaction between the place of residence and schooling levels is significant
Because the interaction effect is not significant, we examine 2 main effects: the effect of
provinces on education expenditure and the effect of edulevel on education expenditure As_ regards the effect of the province, we have:
5 Interaction plot and interpretations
To visualize the possible interaction between two factors graphically, we use the interaction.plot
function as follow:
> interaction.plot(Case5$province, Case5$edulevel, Case5$eduspend, type = “b", col = cC"red", “blue", “black"), pch = c(16, 18), main="Interaction between Province and Eduspend”)
14