business and economics statistics

1 Question 3: Check all assumptions of the inference technique you suggest in Question 1.. Therefore, our group choose two-way ANOVA since it compares the differences between groups divi

Trang 1

HANOI UNIVERSITY FACULTY MANAGEMENT & TOURISM

Business and Economics Statistics

CASE STUDY: EXPENDITURE ON EDUCATION

Tutor’s name: Ms Hoai Phuong Tutorial class: Tutorial 3

Bùi Thị Thu Huyền 1904000052

Hanoi, November 102021

Trang 2

TABLE OF CONTENTS

Question 1: What inference technique should be considered for thịs study? Explain 1

Question 3: Check all assumptions of the inference technique you suggest in Question 1 Are the

Question 4: Perform the inference technique you suggest in Question 1 Remember to provide all the necessary steps What are your interpretations and conclusions? Explain .c:seesseese ese cne cee eneeneeneeee 9 Question 5: Draw an interaction plot and interpret the plot Is the plot consistent with the conclusions

Question 6: Discuss the credibility of the interpretations and conclusions of question 4 Is there

Trang 3

TABLE OF FIGURES

Figure 1: Some first rows of the (Ìafa - c5 11191 1 21 E1 911 1 1n HT TH Hư 1 Figure 2: Structure of the data when factors have not been converted yet - «-««<<«+ 1 Figure 3: Structure of the data when factors have been converfed - + + ++ss++ss<sszs2 2 Figure 4: Frequency table (saimppÏ€ S1Z€S) - 6 G1211 301313915 13511 1 1 1 19t nh nh ngàn nưệp 2 Figure 5: Standard deviation of Eduspend according to Province and Edulevel - 3 Figure 6: Summary of Eduspend according to Province and Edulevel :c:ccssssceeseesseseeseeeeees 3

400051040 11 ồ".ồ 4

Eigure 8: Mean plot with 95⁄6 C Ì c3 193v v.v vn ng ng nh nh nh HH ng 5 Figure 9: Levene’s 'TS( - ch TT Thu nh TH TH HH Họ ch TT HH tt 8 I35)1/1010)0 16.90410177 8 Figure 11: Two-way ANOVA oufPUL << nh nh nh TH HH HH nh nh th nh nưy 10 Figure 12: Interaction between Province and Edule€vV€Ì «c5 St ‡*sveeexekevxeerexexee 11

Trang 4

Question 1: What inference technique should be considered for this study? Explain For two reasons, two-way ANOVA should be considered an inference method in this case study This test assesses the mean differences of each element in general Besides, its purpose is to test for any significant interaction between place of residence and schooling levels and to test for any significant differences in education expenditure due to these two variables Therefore, our group choose two-way ANOVA since it compares the differences between groups divided into two independent factors (accommodation and level of education) and a dependent variable (education expenditure), as well as indicates the interaction between them Question 2: Produce descriptive statistics for the dataset

We use Rstudio to describe statistics for this question Firstly, we must import the csv file “Household survey.csv” into R for further calculation:

HH

> Householdsurvey<-read.table("Household-survey.csv", header — TRUE,sep —",", quote

—"/" stringsAsFactors — FALSE )

In addition, there are 90 observations in this case study; therefore, we should see some first

observations to have better knowledge related to this data using Head () function in R: > head(Householdsurvey)

obs province edulevel eduspend 1 1 HungyYen Primary School 1230 2 2 HungYen Primary School 1130 3 3 HungYen Primary School 3200 4 4 HungYen Primary School 2140 5 5 HungYen Primary School 2780 6 6 HungYen Primary School 3550

Figure 1: Some first rows of the data The internal structure of the data can be obtained by:

> str(Householdsurvey) ‘data frame': 90 obs of 4 variables:

$ obs > int 12345678910 $ province: chr "Hungyen" "Hungyen" "Hungyen" "Hungyen" $ edulevel: ch" "Primary School" "Primary School" "Primary School" "Primary School" $ eduspend: int 1230 1130 3200 2140 2780 3550 2715 1788 2020 1320

Figure 2: Structure of the data when factors have not been converted yet

Trang 5

From the above output, it is clear that there are 90 observations with 4 variables: Observation,

province, edulevel and eduspend Since province and edulevel are characters, we will convert them into factors by using the following R codes:

> Householdsurvey$province <-factor(Householdsurvey$ province, levels—c("HungYen", "ThaiBinh"))

> Householdsurvey$edulevel<-factor(Householdsurvey$edulevel, levels—c("Primary School","Secondary school","Nursery School")

Then we use the R code str (Householdsurvey) to get the new structure of the data file with “province” and “‘edulevel” converted into factors:

Next, we use by () function in R to find several descriptive statistics such as mean, median, standard deviation, summary, for each treatment group listed by the factors and their output respectively:

> by(Householdsurvey$eduspend, list(Householdsurvey$ province, Householdsurvey$edul evel), mean)

Trang 6

by(Householdsurvey$eduspend , list(Householdsurvey$ province, Householdsurvey$edul evel), median)

by(Householdsurvey$eduspend , list(Householdsurvey$ province, Householdsurvey$edul evel), sd)

: Hungyen : Primary School [1] 1071.281 : ThaiBinh : Primary School [1] 356.0621 : HungYen : Secondary school [1] 4938.879 : ThaiBinh : Secondary school [1] 1377.442 : HungYen : Nursery School [1] 2481.657 : ThaiBinh : Nursery School [1] 2033 343

Figure 5: Standard deviation of Eduspend according to Province and Edulevel by(Householdsurvey$eduspend , list(Householdsurvey$ province, Householdsurvey$edul evel), summary)

> HungYen : Primary School

: ThaiBinh : Primary School

: HungYen : Secondary school

: ThaiBinh : Secondary school

: HungYen : Nursery School

: ThaiBinh : Nursery School

Trang 7

Each code gives the specific descriptive statistics of the outcome variable (eduspend) for each treatment group with the listed province first then the edulevel The final code Summary helps to find 6 basic statistics along with the eduspend: Minimum value, the first

quantile, mean, median, the third quartile and maximum value

To get further information, we conduct the boxplot and the mean plot > boxplot(eduspend~ interaction(province,edulevel), data — Householdsurvey, xlab —

"Province and Education level”, ylab — "Education expenditure ", col — c("red", "blue",

HH

“bellow ", "erey", “browH”, "pink”)

Figure 7: Boxplot It can be clearly seen that the box plot shows several descriptive statistics: medians, quartiles, maximum and minimum data of six groups Each cell has different characteristics for all Based on R output, the HungYen - Secondary School group has the highest median value but there is hardly any difference between Hung Yen -Secondary School group and ThaiBinh - Secondary group can be seen on boxplot Besides, the maximum value of HungYen - Secondary School group is superior to other groups (21500) Although the HungYen - Nursery School group has the lowest values at minimum and median values, the ThaiBinh - Primary School group generally has almost the lowest values

Moreover, we can see the skewness of each group is obvious through boxplot The skewness of each group can be distributed normally, positive-skewed or negative-skewed based on the distance from median to two endpoints It can be clearly seen that the ThaiBinh - Secondary

Trang 8

School and HungYen - Nursery School are normally distributed with nearly equal distance from median to two endpoints Besides, HungYen - Secondary School, ThaiBinh - Nursery School are typical examples of positive-skewed distribution while the others are negative- skewed distribution However, there are 6 outliers when appearing six white dots in HungYen - Secondary School (1 outlier), ThaiBinh - Secondary School (1 outlier), HungYen - Nursery School (2 outliers), and ThaiBinh - Nursery School (3 outliers)

Meanplot also be used to identify mean value of each group and compare means between groups Before create Meanplot in R studio we need to install packages by using install packages ("gplots") then we used the following codes to obtain the outcome:

> library(gplots) > plotmeans(eduspend~ interaction(province,edulevel), data — Householdsurvey, xlab —

"Province and Education level", ylab — "Education Expenditure", main—"Mean Plot +

with 95% CI")

Mean Plot + with 95% Cl

Figure 8: Mean plot with 95% CI Question 3: Check all assumptions of the inference technique you suggest in Question 1 Are the assumptions satisfied? Explain

As mentioned in question 1, two-way factorial analysis of variance is always the most relevant and appropriate inference method to deal with this case However, it is essential to check all the assumptions of this inference system before showing our two-way ANOVA as a means to ensure that our results are reliable and valid For two-way ANOVA, there are three assumptions we need to examine:

Trang 9

* Samples are independent, simple random samples of size nj from each of k (=a*b) populations

¢ All populations distribution are normal

° All populations have the same standard deviation: To use these general conditions to check whether the study satisfies three assumptions for two- way ANOVA or not, some subjects should be denoted in detail:

* nj: Cell (combination of the factors) ° 1{(Factor A): Province

¢ j (Factor B): Edulevel Firstly, we check assumption 1 Term and notation for two-way ANOVA are shown in the

Trang 10

HungYen 15 15 15 ThaiBinh 15 15 15

Moreover, from each of k = a*b=2*3=6 populations, which is divided into 6 groups namely: HungYen - Primary school, ThaiBinh - Primary school, HungYen - Secondary School, ThaiBinh - Secondary School, HungYen - Nursery School, ThaiBinh - Nursery School, each individual in two samples Province and Edulevel has the same probability to be chosen randomly to be one of the 90 observations This 1s the reason why the study contains independent simple random samples Assumption 2: All populations have the same standard deviation

Secondly, we can check the assumption 2 of equal standard deviations through the output of the “By” function in R for both Province and Edulevel which is done in question 2 The ratio between the largest sample standard deviation over the smallest sample standard deviation (=4938.879/356.0621) is around 13.87, which is too large As the result, we can use Levene’s test to check whether the variances are equal or not with the following code:

» leveneTest(Householdsurvey$eduspend, interaction(Householdsurvey$province, Householdsurvey$edulevel), center—median)

The output is above:

Trang 11

Levene's Test for Homogeneity of Variance (center = median)

84 Signif codes: 0 “*#**” 0.001 “**” 0.01 “*” 0.05 “.' 0.1 “ } 1

Figure 9: Levene’s Test Seeing that p-value = 0.07684 > a=0.05, we can conclude that all variance are equal, in other

words, all factor standard deviations are the same

Assumption 3: All populations are normally distributed In order to check all populations are normally distributed or not, we can use Q-Q plot with R

Householdsurvey$edulevel <- factor(Householdsurvey$edulevel, levels—c("Primary School","Secondary school","Nursery School")

qqPlot(lm(eduspend ~ province + edulevel + province*edulevel, data—Householdsurvey), simulate—T, main—"Q-O Plot", labels—F)

Trang 12

With the sample size of 90, we probably use a normal Q-Q plot to see the normality of residuals The scatter measures up the data to a perfect normal distribution It can be seen from the plot that the scatter line is far away from confidence interval with some outliers Therefore, it is impractical for the Q-Q plot to meet two requirements, therefore, the population is not normally

distributed

Question 4: Perform the inference technique you suggest in Question 1 Remember to provide all the necessary steps What are your interpretations and conclusions? Explain We choose to use the two - way ANOVA test as mentioned in question | with significance level of 0.05

Step 1: Identify null and alternative hypotheses: “* Hypothesis testing for interaction factor:

o

*« Check assumptions: We have checked this step in Question 3

o All populations are normally distributed o Samples are independent, simple random samples of 15 from each

of 6 populations o All populations have the same standard deviation * Test statistic and p-value:

We used R studio to calculate and had the output as following: > Householdsurvey.result<-aov(eduspend~province *edulevel,data—Householdsurvey) > summary(Householdsurvey.result)

Trang 13

Df Sum Sq Mean sq F value Pr(>F)

Figure 11; Two-way ANOVA output Step 3: Level of significance

The level of significance: œ=0.05 Step 4: Decision rule

We will reject Ho if p-value < a Step 5: Value of test statistic ** To test the interaction between Province and Edulevel, we got: p-value=0.6905 > a=0.05 ** To test the difference in Eduspend due to Province, we got: p-value = 0.0546 > a=0.05 ** To test the difference in Eduspend due to Edulevel, we got: p-value =7.19e-05 < a=0.05

Step 6: Conclusion “+ Do not reject Hol We do not have sufficient evidence to conclude that there is

significant interaction between place of residence and schooling levels in education expenditure with 95% level of significance

** Do not reject Ho2 As the above result, we have enough evidence to conclude that the

mean in education expenditures of factor Province is not different “+ Reject Ho3 Inferring from the result, we have enough evidence to conclude that the

mean in education expenditures of factor Edulevel are different Question 5: Draw an interaction plot and interpret the plot Is the plot consistent with the

conclusions made in Question 4?

This command is used to draw the interaction plot: > interaction.plot(Householdsurvey$province, Householdsurvey$edulevel,

Householdsurvey$eduspend, type—"b", col—c("red", "blue"), pch—c(16,18), main—"Interaction between Province and Edulevel")

10

Tiêu đề	Expenditure Statistics on Education
Tác giả	Dang Thi Thao My, Dang Linh Nga, Vũ Thỳy Hằng, Nguyễn Thị Thựy Dung, Bựi Thị Thu Huyền, Chu Thị Ngọc Anh, Dao Thi Dung
Người hướng dẫn	Ms. Hoai Phuong
Trường học	Hanoi University
Chuyên ngành	Business and Economics
Thể loại	Case study
Năm xuất bản	2021
Thành phố	Hanoi

Định dạng
Số trang	16
Dung lượng	1,36 MB