Case Study Expenditure On Education Subject Business And Economics Statistics.pdf

5 Figure 3: Crosstabulation table 5 Figure 4: The mean of Eduspend according to Province and Edulevel...es se 6 Figure 5: Standard deviation of Eduspend according to Province and Eduleve

Trang 1

HANOI UNIVERSITY FACULTY OF MANAGEMENT AND TOURISM

Tutor: Ms Lại Hoài Phương

Students: Nguyễn Thị Nguyệt Hà - 1904000034

Trịnh Cẩm Lân - 1904000060

Nguyễn Trang Linh - 1904040069

Nguyễn Ngoc Mai — 1904000076 Nguyễn Thị Thanh Tuyét — 1904040106 Hoàng Đức Huy — 1504040050

Date: November 19, 2021

Trang 2

I Introduction

II Answering the questions

Question 1 Question 2 Question 3

Trang 3

LIST OF TABLES AND FIGURES

Figure 1: The sixth first rows of the data 4 Figure 2: The structure of the data after being converted int0 ÍQCÍOTS «se 5 Figure 3: Crosstabulation table 5 Figure 4: The mean of Eduspend according to Province and Edulevel es se 6 Figure 5: Standard deviation of Eduspend according to Province and Edulevel 6

Figure 7: Mean Plot with 95% CI 8 Figure 8: Q-Q plot for the data 10 Figure 9: Levene's Test for Homogeneity of variance 11 Figure 10: Two-way ANOVA table 13 Figure 11: Interaction Plot 14

Trang 4

I Introduction In the survey of VHSS, respondents were asked to specify their place of residence (province), schooling level of their children (edulevel) and expenditure on education per child for the past 12 months in thousands of VND (eduspend) In this case study, we test for any significant interaction between place of residence and schooling levels and to examine significant differences in education expenditure due to these two variables by the way of using Two-way ANOVA Furthermore, the objective of this report is to explain how the outstanding features of the two-way ANOVA model was used in a real-life case study

Question 1 What inference technique should be considered for this study? Explain The two-way ANOVA is the suitable inference technique used for the given case study This choice was made for numerous reasons: Firstly, there are two factors (place of residence and

schooling levels) that might affect the measured variable Using the two-way ANOVA will help

increase efficiency because its designs allow users to analyze two components simultaneously rather than separately Secondly, the two-way ANOVA can be used to determine how components interact (a significant interaction means the effect of one variable change depending on the level of the other factor) This function is appropriate for the study's goal, which is to look for any significant interactions between two variables Another benefit of the two-way ANOVA is that it eliminates residual variance in the model by introducing a second component that is expected to influence the response variable The two-way ANOVA is the most appropriate method in this

circumstance, based on all of these considerations

Question 2 Produce descriptive statistics for the dataset Setting up the working directory, we import the file “Eduexpense.csv” into R by using this code:

Trang 5

> education<-read.table(" Eduexpense.csv" ,header=TRUE,sep = “,", stringsAsFactors

= FALSE)

This case has a quite large sample size with 120 observations, so head() function representing the

first 6 rows of the data should be used as illustrated below:

> head(education)

obs province edulevel eduspend 1 HaiDuong University 16490 2 HaiDuong University 25800 3 HaiDuong University 29570 4 HaiDuong University 35000 5 HaiDuong University 16970 6 HaiDuong University 11600

Figure 1: The sixth first rows of the data Because we want to see the combination edulevel by province, it is essential to convert province and edulevel variables into factors for later purposes:

> education$province<-factor(education$province, levels=c(" HaiDuong"," Hanoi" ," NamDinh"),labels=c("HaiDuong province", Hanoi

province", NamDinh province"))

> education$edulevel <- factor(education$edulevel, levels=c(" University" ," Secondary

school"),labels=c("" University level" ," Secondary level "))

Here is the structure of data frame when the 2 variables are converted into factors with the use of

Trang 6

Figure 2: The structure of the data being converted into factors

We investigate the data through the crosstabulation table to check the sample size of each stratum by tableQ) function

When looking for the descriptive statistics, calculations of mean, standard deviation are necessarily required

: Halpuong province : Secondary level [1] 4334.6

: Hanoi province

: Sse ary level [11 6

: NamDinh province

: Secondary level [1] 5492.45

Figure 4: The mean of Eduspend according to Province and Edulevel > by(education$eduspend,list(education$province,education$edulevel), sd)

Trang 7

: HaiDuong province : University level [1] 9902.959 : Hanoi province : University level [1] 9847.818 : NamDinh province : University level [1] 7899.908 : HaiDuong province : Secondary level [1] 1012.826

: Hanoi province

: Secondary level [1] 8817 526 : NamDinh province : Secondary level

[1] 2161.205

Figure 5: Standard deviation of Eduspend according to Province and Edulevel Various types of charts are presented for better visualization Using boxplot in this manner is a useful graphical technique to make comparisons between multiple groups

> boxplot(eduspend~interaction(province, edulevel), data = education, xlab = “Province and Schooling level", ylab = "Education expenditure", frame = FALSE, col = c(" light blue", “yellow”, "pink"))

HaiDuong province University level NamDinh province.University level Hanoi province.Secondary level

Province and Schooling level

Figure 6: Boxplot The boxplot is a indication of how the data are spreaded out It provides the distribution of data, skewness, variances and outliers Generally, the box represents the interquartile range in each

6

Trang 8

group The bold black line ¡s the median or the middle values for the groups individually In this

case, the medians are not quite well-located Among all, the median education expenditure of

NamDinhprovince.university level is the highest at over 2000, whereas the lowest one belongs to HaiDuongprovince.secondaryschool Moreover, the academic level of 3 provinces are heavily

invested while the lower schooling level of them are much lesser

In the meantime, observations from the boxplot can be a good indicator for the skewness of the data In this case, except for the group of HaiDuongprovince.Secondarylevel which the boxplot are seemingly symmetric thus the data follows normal distribution, the others are skewed and their data are not normally distributed

Interestingly, the spendings of Hanoi for higher education noticeably has the most variation whereas the least variation is attributed to education expenditure of Hai Duong for lower schooling level

Another way to visualize the data is to utilize the mean plot This will require the use of function plotmean() which can only be obtained after installing the gplots package

> install.packages(gplots) > library(gplots) > plotmeans(eduspend ~ interaction(province, edulevel), data = education, xlab =

“Province and education level", ylab = "Education expenditure", main="Mean Plot with 95% CI")

Trang 9

a

— om > GN 5

Province and education level

Figure 2: Mean Plot with 95% CI By importing this code into R, R will automatically produce a plot mean with 95% confidence interval The group of NamDinh University level has the highest mean as contrasted to the lowest one of HaiDuong.Secondarylevel

Question 3 Check all the assumptions of the inference technique you suggest in Question 1 Are the assumptions satisfied? Explain

Like all ANOVA tests, two-way ANOVA requires 3 assumptions Firstly, samples are independent, simple random samples Secondly, populations are normally distributed Lastly, populations have the same standard deviation

Assumption 1: Samples are independent, simple random samples of size Based on the definition of an independent sample, it is a sample that does not have any relationships to another sample when it happens The samples are independent, when the occurrence of another sample is unaffected by the presence of the current sample R output of crosstabulation table between Province and Edulevel variables would give us the sample size for

each stratum

Trang 10

crosstabulation table above (2 schooling levels x 3 places of residence = 6 groups) There are two

samples, including Province (Hai Duong, Hanoi, Nam Dinh) and Education level (University and Secondary) Each response was given by a different household, and their opinion is unaffected by the responses of others In other words, education expenses on the place of residence and education expenses on a certain schooling level are unrelated to each other Hence, the study has independent samples Besides, both of the samples are simple random samples since there is no information on how interviewees are chosen and each answer in two samples (Province and Edulvel) has the same probability randomly selected

Assumption 2: All populations are normal distributions Besides, to examine normality assumption, we use the Q-Q plot of residuals

> install.packages(“car”) > library(car) > qqPlot(im(eduspend ~ province + edulevel + province*edulevel, data=education),

simulate=T, main="Q-Q Plot", labels=F)

Trang 11

Assumption 3: All populations have the same standard deviation To check the assumption of equal standard deviations (SD), we look at the ratio between the largest SD and the smallest SD and to check whether all populations have the same SD or not, we compute the ratio of the largest SD over the smallest SD We can determine that the populations

are equal if this ratio is smaller than 2 Moreover, if the ratio is between 2 and 3 and it is not so clear to pool variances, then it is suitable to check again using Levene’s test However, using the byQ function gives the outcome of SD, and then the ratio between highest and smallest SD is

10

Trang 12

computed to 9.78 (9902.959/1012.826) This ratio is greater than 2 (9.78 > 2), so this assumption is dissatisfied

> by(education$eduspend,list(education$province,education$edulevel), sd)

spend, list(education$province, education$edulevel), sd)

We still decide to check the result of Levene’s Test This test is to examine the homogeneity of the variance, so the null hypothesis is all the variances which are equal We can see that the p-value is

much smaller than a 0.05 significance level In this case, we can infer that the variances are not all

equal and therefore the SD are not the same > leveneTest(education$eduspend,interaction(education$province,

Trang 13

By checking the requirement to conduct two-way ANOVA, we see that only the first assumption completely satisfies the conditions Despite the limitations in the other 2 assumptions, two-way ANOVA is considered as the most suitable method to test the interaction between the place of residence and schooling levels and the significant differences in education expenditure due to

these two variables

Question 4 Perform the inference technique you suggest in Question 1 What are your interpretations and conclusions? Explain

As mentioned in Question 1, the inference technique applied is Two-way ANOVA We study and test the relationship, if any, between place of residence and schooling levels and any significant differences in Education expenditure due to two variables which are Province and Edulevel by using two-way ANOVA method All the necessary steps are provided in performing this inference technique

Step 1: Set up hypothesis: Ho: No change, no difference Ha: Investigator’s opinion To test for any differences in Eduspend due to Province:

e Ho: There is no effect of Province on Eduspend e@ Ha: There is an effect of Edulevel on Eduspend To test for any differences in Eduspend due to Edulevel:

e Ho: There is no effect of Edulevel on Eduspend e@ Ha: There is an effect of Edulevel on Eduspend To test whether there is interaction between Province and Edulevel:

e Ho: There is no interaction between Province and Edulevel in Eduspend

12

Trang 14

e Ha: There is a significant interaction between Province and Edulevel in Eduspend Step 2: Level of significance:

The level of significance is a = 0.05 Step 3: Test statistic and p-value: We run two-way ANOVA test with Eduspend as outcome variable and Province & Edulevel as

two factors We are interested in the main effects of Province and Edulevel and their interaction,

so we apply the format eduspend ~ province*edulevel Thus, the main & interaction effects of

these two factors are summarized with summary(), afterwards, test statistic and p-value are also calculated leading to the following outputs:

> education.result<-aov(eduspend ~ province*edulevel, data = education) > summary(education.result)

Of Sum Sq Mean SqF\

province 2 5.521e+08 2.760e+08 4,

edulevel 1 4,478e+09 4.478e+09 78, province:edulevel 2 1.057e+09 5.286e+08 9, Residuals 114 6.477e+09 5.682e+07 Signif codes: 0 ‘***’ 0,001 ‘**’ 0.01 *' 0.05.’ 02571

Figure 5: Two-way ANOVA table

The summary model lists the independent variables being tested (Province and Edulevel) Next is the interaction between Province and Edulevel, and then residual variance which is the variation

in the dependent variable that is not explained by the independent variables The columns in the

outcome table provide all of the information needed to interpret the model including degrees of

freedom, sum of squares, mean sum of squares, test statistic and p-value of F-statistics In particular, the result of the two-way ANOVA shows that the test statistic (F-value) is equal 9.303, and the p-value is 0.000181 (F(2, 114) = 9.303, p = 0.000181) Through the mean square,

13

Tiêu đề	Expenditure On Education
Tác giả	Nguyễn Thị Nguyệt Hà, Trịnh Cẩm Lõn, Nguyễn Trang Linh, Nguyễn Ngoc Mai, Nguyễn Thị Thanh Tuyệt, Hoàng Đức Huy
Người hướng dẫn	Ms. Lại Hoài Phương
Trường học	University Of Management And Tourism
Chuyên ngành	Business and Economics Statistics
Thể loại	Case Study
Năm xuất bản	2021
Thành phố	HANOI

Định dạng
Số trang	19
Dung lượng	1,72 MB