1 Question 3: Check all assumptions of the inference technique you suggest in Question 1.. Therefore, our group choose two-way ANOVA since it compares the differences between groups divi
Trang 1
HANOI UNIVERSITY FACULTY MANAGEMENT & TOURISM
Business and Economics Statistics
CASE STUDY: EXPENDITURE ON EDUCATION
Tutor’s name: Ms Hoai Phuong Tutorial class: Tutorial 3
Bùi Thị Thu Huyền 1904000052
Hanoi, November 102021
Trang 2
TABLE OF CONTENTS
Question 1: What inference technique should be considered for thịs study? Explain 1
Question 3: Check all assumptions of the inference technique you suggest in Question 1 Are the
Question 4: Perform the inference technique you suggest in Question 1 Remember to provide all the necessary steps What are your interpretations and conclusions? Explain .c:seesseese ese cne cee eneeneeneeee 9 Question 5: Draw an interaction plot and interpret the plot Is the plot consistent with the conclusions
Question 6: Discuss the credibility of the interpretations and conclusions of question 4 Is there
Trang 3TABLE OF FIGURES
Figure 1: Some first rows of the (Ìafa - c5 11191 1 21 E1 911 1 1n HT TH Hư 1 Figure 2: Structure of the data when factors have not been converted yet - «-««<<«+ 1 Figure 3: Structure of the data when factors have been converfed - + + ++ss++ss<sszs2 2 Figure 4: Frequency table (saimppÏ€ S1Z€S) - 6 G1211 301313915 13511 1 1 1 19t nh nh ngàn nưệp 2 Figure 5: Standard deviation of Eduspend according to Province and Edulevel - 3 Figure 6: Summary of Eduspend according to Province and Edulevel :c:ccssssceeseesseseeseeeeees 3
400051040 11 ồ".ồ 4
Eigure 8: Mean plot with 95⁄6 C Ì c3 193v v.v vn ng ng nh nh nh HH ng 5 Figure 9: Levene’s 'TS( - ch TT Thu nh TH TH HH Họ ch TT HH tt 8 I35)1/1010)0 16.90410177 8 Figure 11: Two-way ANOVA oufPUL << nh nh nh TH HH HH nh nh th nh nưy 10 Figure 12: Interaction between Province and Edule€vV€Ì «c5 St ‡*sveeexekevxeerexexee 11
Trang 4Question 1: What inference technique should be considered for this study? Explain For two reasons, two-way ANOVA should be considered an inference method in this case study This test assesses the mean differences of each element in general Besides, its purpose is to test for any significant interaction between place of residence and schooling levels and to test for any significant differences in education expenditure due to these two variables Therefore, our group choose two-way ANOVA since it compares the differences between groups divided into two independent factors (accommodation and level of education) and a dependent variable (education expenditure), as well as indicates the interaction between them Question 2: Produce descriptive statistics for the dataset
We use Rstudio to describe statistics for this question Firstly, we must import the csv file “Household survey.csv” into R for further calculation:
HH
> Householdsurvey<-read.table("Household-survey.csv", header — TRUE,sep —",", quote
—"/" stringsAsFactors — FALSE )
In addition, there are 90 observations in this case study; therefore, we should see some first
observations to have better knowledge related to this data using Head () function in R: > head(Householdsurvey)
obs province edulevel eduspend 1 1 HungyYen Primary School 1230 2 2 HungYen Primary School 1130 3 3 HungYen Primary School 3200 4 4 HungYen Primary School 2140 5 5 HungYen Primary School 2780 6 6 HungYen Primary School 3550
Figure 1: Some first rows of the data The internal structure of the data can be obtained by:
> str(Householdsurvey) ‘data frame': 90 obs of 4 variables:
$ obs > int 12345678910 $ province: chr "Hungyen" "Hungyen" "Hungyen" "Hungyen" $ edulevel: ch" "Primary School" "Primary School" "Primary School" "Primary School" $ eduspend: int 1230 1130 3200 2140 2780 3550 2715 1788 2020 1320
Figure 2: Structure of the data when factors have not been converted yet
Trang 5From the above output, it is clear that there are 90 observations with 4 variables: Observation,
province, edulevel and eduspend Since province and edulevel are characters, we will convert them into factors by using the following R codes:
> Householdsurvey$province <-factor(Householdsurvey$ province, levels—c("HungYen", "ThaiBinh"))
> Householdsurvey$edulevel<-factor(Householdsurvey$edulevel, levels—c("Primary School","Secondary school","Nursery School")
Then we use the R code str (Householdsurvey) to get the new structure of the data file with “province” and “‘edulevel” converted into factors:
Next, we use by () function in R to find several descriptive statistics such as mean, median, standard deviation, summary, for each treatment group listed by the factors and their output respectively:
> by(Householdsurvey$eduspend, list(Householdsurvey$ province, Householdsurvey$edul evel), mean)
Trang 6by(Householdsurvey$eduspend , list(Householdsurvey$ province, Householdsurvey$edul evel), median)
by(Householdsurvey$eduspend , list(Householdsurvey$ province, Householdsurvey$edul evel), sd)
: Hungyen : Primary School [1] 1071.281 : ThaiBinh : Primary School [1] 356.0621 : HungYen : Secondary school [1] 4938.879 : ThaiBinh : Secondary school [1] 1377.442 : HungYen : Nursery School [1] 2481.657 : ThaiBinh : Nursery School [1] 2033 343
Figure 5: Standard deviation of Eduspend according to Province and Edulevel by(Householdsurvey$eduspend , list(Householdsurvey$ province, Householdsurvey$edul evel), summary)
> HungYen : Primary School
: ThaiBinh : Primary School
: HungYen : Secondary school
: ThaiBinh : Secondary school
: HungYen : Nursery School
: ThaiBinh : Nursery School
Trang 7Each code gives the specific descriptive statistics of the outcome variable (eduspend) for each treatment group with the listed province first then the edulevel The final code Summary helps to find 6 basic statistics along with the eduspend: Minimum value, the first
quantile, mean, median, the third quartile and maximum value
To get further information, we conduct the boxplot and the mean plot > boxplot(eduspend~ interaction(province,edulevel), data — Householdsurvey, xlab —
"Province and Education level”, ylab — "Education expenditure ", col — c("red", "blue",
HH
“bellow ", "erey", “browH”, "pink”)
Figure 7: Boxplot It can be clearly seen that the box plot shows several descriptive statistics: medians, quartiles, maximum and minimum data of six groups Each cell has different characteristics for all Based on R output, the HungYen - Secondary School group has the highest median value but there is hardly any difference between Hung Yen -Secondary School group and ThaiBinh - Secondary group can be seen on boxplot Besides, the maximum value of HungYen - Secondary School group is superior to other groups (21500) Although the HungYen - Nursery School group has the lowest values at minimum and median values, the ThaiBinh - Primary School group generally has almost the lowest values
Moreover, we can see the skewness of each group is obvious through boxplot The skewness of each group can be distributed normally, positive-skewed or negative-skewed based on the distance from median to two endpoints It can be clearly seen that the ThaiBinh - Secondary
Trang 8School and HungYen - Nursery School are normally distributed with nearly equal distance from median to two endpoints Besides, HungYen - Secondary School, ThaiBinh - Nursery School are typical examples of positive-skewed distribution while the others are negative- skewed distribution However, there are 6 outliers when appearing six white dots in HungYen - Secondary School (1 outlier), ThaiBinh - Secondary School (1 outlier), HungYen - Nursery School (2 outliers), and ThaiBinh - Nursery School (3 outliers)
Meanplot also be used to identify mean value of each group and compare means between groups Before create Meanplot in R studio we need to install packages by using install packages ("gplots") then we used the following codes to obtain the outcome:
> library(gplots) > plotmeans(eduspend~ interaction(province,edulevel), data — Householdsurvey, xlab —
"Province and Education level", ylab — "Education Expenditure", main—"Mean Plot +
with 95% CI")
Mean Plot + with 95% Cl
Figure 8: Mean plot with 95% CI Question 3: Check all assumptions of the inference technique you suggest in Question 1 Are the assumptions satisfied? Explain
As mentioned in question 1, two-way factorial analysis of variance is always the most relevant and appropriate inference method to deal with this case However, it is essential to check all the assumptions of this inference system before showing our two-way ANOVA as a means to ensure that our results are reliable and valid For two-way ANOVA, there are three assumptions we need to examine:
Trang 9* Samples are independent, simple random samples of size nj from each of k (=a*b) populations
¢ All populations distribution are normal
° All populations have the same standard deviation: To use these general conditions to check whether the study satisfies three assumptions for two- way ANOVA or not, some subjects should be denoted in detail:
* nj: Cell (combination of the factors) ° 1{(Factor A): Province
¢ j (Factor B): Edulevel Firstly, we check assumption 1 Term and notation for two-way ANOVA are shown in the
Trang 10HungYen 15 15 15 ThaiBinh 15 15 15
Moreover, from each of k = a*b=2*3=6 populations, which is divided into 6 groups namely: HungYen - Primary school, ThaiBinh - Primary school, HungYen - Secondary School, ThaiBinh - Secondary School, HungYen - Nursery School, ThaiBinh - Nursery School, each individual in two samples Province and Edulevel has the same probability to be chosen randomly to be one of the 90 observations This 1s the reason why the study contains independent simple random samples Assumption 2: All populations have the same standard deviation
Secondly, we can check the assumption 2 of equal standard deviations through the output of the “By” function in R for both Province and Edulevel which is done in question 2 The ratio between the largest sample standard deviation over the smallest sample standard deviation (=4938.879/356.0621) is around 13.87, which is too large As the result, we can use Levene’s test to check whether the variances are equal or not with the following code:
» leveneTest(Householdsurvey$eduspend, interaction(Householdsurvey$province, Householdsurvey$edulevel), center—median)
The output is above:
Trang 11Levene's Test for Homogeneity of Variance (center = median)
84 Signif codes: 0 “*#**” 0.001 “**” 0.01 “*” 0.05 “.' 0.1 “ } 1
Figure 9: Levene’s Test Seeing that p-value = 0.07684 > a=0.05, we can conclude that all variance are equal, in other
words, all factor standard deviations are the same
Assumption 3: All populations are normally distributed In order to check all populations are normally distributed or not, we can use Q-Q plot with R
Householdsurvey$edulevel <- factor(Householdsurvey$edulevel, levels—c("Primary School","Secondary school","Nursery School")
qqPlot(lm(eduspend ~ province + edulevel + province*edulevel, data—Householdsurvey), simulate—T, main—"Q-O Plot", labels—F)
Trang 12With the sample size of 90, we probably use a normal Q-Q plot to see the normality of residuals The scatter measures up the data to a perfect normal distribution It can be seen from the plot that the scatter line is far away from confidence interval with some outliers Therefore, it is impractical for the Q-Q plot to meet two requirements, therefore, the population is not normally
distributed
Question 4: Perform the inference technique you suggest in Question 1 Remember to provide all the necessary steps What are your interpretations and conclusions? Explain We choose to use the two - way ANOVA test as mentioned in question | with significance level of 0.05
Step 1: Identify null and alternative hypotheses: “* Hypothesis testing for interaction factor:
o
*« Check assumptions: We have checked this step in Question 3
o All populations are normally distributed o Samples are independent, simple random samples of 15 from each
of 6 populations o All populations have the same standard deviation * Test statistic and p-value:
We used R studio to calculate and had the output as following: > Householdsurvey.result<-aov(eduspend~province *edulevel,data—Householdsurvey) > summary(Householdsurvey.result)
Trang 13Df Sum Sq Mean sq F value Pr(>F)
Figure 11; Two-way ANOVA output Step 3: Level of significance
The level of significance: œ=0.05 Step 4: Decision rule
We will reject Ho if p-value < a Step 5: Value of test statistic ** To test the interaction between Province and Edulevel, we got: p-value=0.6905 > a=0.05 ** To test the difference in Eduspend due to Province, we got: p-value = 0.0546 > a=0.05 ** To test the difference in Eduspend due to Edulevel, we got: p-value =7.19e-05 < a=0.05
Step 6: Conclusion “+ Do not reject Hol We do not have sufficient evidence to conclude that there is
significant interaction between place of residence and schooling levels in education expenditure with 95% level of significance
** Do not reject Ho2 As the above result, we have enough evidence to conclude that the
mean in education expenditures of factor Province is not different “+ Reject Ho3 Inferring from the result, we have enough evidence to conclude that the
mean in education expenditures of factor Edulevel are different Question 5: Draw an interaction plot and interpret the plot Is the plot consistent with the
conclusions made in Question 4?
This command is used to draw the interaction plot: > interaction.plot(Householdsurvey$province, Householdsurvey$edulevel,
Householdsurvey$eduspend, type—"b", col—c("red", "blue"), pch—c(16,18), main—"Interaction between Province and Edulevel")
10