In the survey, household heads were asked to specify their place of residence province, schooling level of their children edulevel and expenditure on education per child for the past 12
Trang 1HANOI UNIVERSITY FACULTY OF MANAGEMENT AND TOURISM
BUSINESS ECONOMICS STATISTIC
CASE STUDY REPORT
Tutor: Lai Hoai Phuong Students:
Nguyen Cong Thuy Linh- 1904040065 Pham Diem Quynh- 1904040101 Pham Tuan Dat- 1904040026
Vu Thi Hau- 1807010107 Ta Thu Thuy- 1804040107
Nguyen Nhu Quynh- 1904050037
Hanoi, 19th October, 2021
Trang 2Inference technique Descriptive statistics for the dataset
Checking the assumption Performing two-way ANOVA Interpreting the interaction plot
The credibility of the interpretations and conclusions PEER EVALUATION
Trang 3A Scenario
The Vietnam Household Living Standards Survey (VHLSS) was conducted nationwide every two years to systematically monitor the living standards of Vietnam's societies In 2018, the survey was carried out with a sample size of 46,995 households in 3,133 communes/wards which were representative at national, regional, urban, rural and provincial levels The household
questionnaire contained many sections, each of which covered a separate aspect of household
activities, and education was one important indicator In the survey, household heads were asked
to specify their place of residence (province), schooling level of their children (edulevel) and expenditure on education per child for the past 12 months in thousands of VND (eduspend) The objective of our study is to test for any significant interaction between place of residence and schooling levels and to test for any significant differences in education expenditure due to these two variables Use 0.05 level of significance
B Questions Question 1 What inference technique should be considered for this study? Explain
The purpose of this research was to see if there was any significant interaction between place of residence and educational levels, as well as to see if there were any significant disparities in education spending as a result of these two variables
Given the objectives of the case study, the two-way factorial analysis of variance (two-way
Anova) is the most suitable inference technique The reason is that two-way ANOVA can examine the effect of provinces and schooling levels and the interaction between two variables on schooling expenditures at the same time
As we use two-way ANOVA, we consider place of residence (province) and schooling levels (edulevel) as two factors and the education expenditure (eduspend) as outcome variable The first factor - Province has two levels: Hai Duong and Hai Phong The second factor - Schooling level
has three levels: Nursery School, Primary School and Secondary school As a result of these
considerations, two-way ANOVA is the most suitable method for this study, assisting in the discovery of the association between province and education level as well as the testing of any
Trang 4significant changes in eduspend related to these factors R-studio, a programming language for statistical analysis and graphics, is used in this study
Question 2 Produce descriptive statistics for the dataset You are expected to generate as many relevant descriptive statistics as possible using ALL the relevant tools introduced in the labs of this course Remember to provide appropriate interpretations for the descriptive statistics Try not to include unnecessary or irrelevant descriptive statistics
In this case, we will use numerical and graphical methods to look for patterns in the dataset and
summarize the information revealed in it With the data, we divide into six groups based on the
province and schooling levels: (Hai Duong, Nursery school), (Hai Phong, Nursery School), (Hai Duong, Primary school), (Hai Phong, Primary school), (Hai Duong Secondary School), (Hai Phong, Secondary school)
The numerical method will tell us the central tendency variation (by calculating the means,
standard deviations, and drawing boxplots of groups) With the graphical methods, we tabulate data or create a crosstabulation table
1 Import casel.csv data frame into R and assign it to datal Firstly, we set working directory and import the file ‘casel.csv’ into R by using the following
We use R code str() to see the internal structure of the data This code results informs us that there are 150 observations with four variables (including observations, pr ovince, edulevel, eduspend) Then the dataframe lists the first values of each variable in order
2 Cross Tabulation table between factors Secondly, we make a cross tabulation A cross tabulation table between province and edulevel variables would give people the sample size for each stratum From the tabulation below, we can see that the sample size of each group is equal with 25 observations
Trang 5> table(datal$province, datal$edulevel)
Nursery School Primary School Secondary school
3 Means for groups
The code below also gives the result of the means of six groups > #3 Means Tor groups
: HaiDuong : Nursery School
[1] 1913.88
: HaiPhong : Nursery School
[1] 4056.44
: HaiDuong : Primary School
[1] 2839.2
: HaiPhong : Primary School
[1] 7478
: HaiDuong : Secondary school
[1] 4516.88
: HaiPhong : Secondary school
[1] 10866 84 It can be seen that the largest mean of education expenditure is $10866.84, belonging to
Secondary school in Hai Phong The Nursery school in Hai Duong has the smallest mean of
education expenditure with $1913.88 Furthermore, Hai Phong is the province that pays for education higher than Hai Duong from Nursery school to Secondary school
4, Standard deviation for groups
The standard deviation is used to determine how the data disperses from the mean A higher standard deviation value indicates the higher level of dispersion of the data We use following R- code to get the value of standard deviation for each group:
Trang 6> by(datal$eduspend, list(datal$province,datal$edulevel), sd)
: HaiDuong : Nursery School
[1] 1070.98
: HaiPhong : Nursery School
[1] 2584.12 : HaiDuong
: Primary School
[1] 1180.223 : HaiPhong
: Primary School
[1] 6159.959 : HaiDuong
: Secondary school
[1] 1040.453
: HaiPhong : Secondary school
[1] 6143.572
From the table above, Hai Phong’s School Primary s has the largest standard deviation of
education expenditure 1s 6159.959 The smallest standard deviation is 1040.453 belonging to Hai Duong’s Secondary School
5 Box plot R code:
> boxplot(eduspend ~ province + edulevel + province*edulevel, xlab = "Groups", ylab="eduspend", data = datal, col = c("pink", "green", "skyblue","orange","red", "yellow"))
Output:
Trang 7Groups Based on the graph, we can determine the range, shapes, central tendency, and variability of distributions of education expenditure per child of six groups: (Hai Duong, Nursery school), (Hai
Phong, Nursery School), (Hai Duong, Primary school), (Hai Phong, Primary school), (Hai
Duong Secondary School), (Hai Phong, Secondary school) Given the range of six groups, the yellow box plot has the widest range This means that data recorded of education expenditure at Secondary schools in Hai Phong have the highest level of
disparity and variability
The black line of the box plot represents the median of each group The central tendency depends on the position of the black line If the black line divides the box plot into two equal parts, the box plot is symmetric In contrast, the box plot can be skewed to the right or left Particularly, the red box plot is symmetric The green box plot is right skewed while the blue one is left skewed In addition, there are some outliers, which may be plotted as individual points If there are more outliers above, namely, the orange box plot on the graph with 5 outliers, the median will be affected Specifically, the range of the orange box plot is small but the median is quite high
Trang 8Question 3: Check all the assumptions of the inference technique you suggest in Question 1
Are the assumptions satisfied? Explain
As mentioned above, the two-way ANOVA test is the most suitable method to solve this case
Before running the test, we will check the assumption with eduspend (expenditure on education per child) as the outcome variable, two independent factors which are edulevel (schooling level: Nursery, Primary, Secondary) and province (Hai Phong and Hai Duong) This method required 3 following assumptions:
e Samples are independent, simple random samples of size nj from each of k (= ab= 6) populations
e All populations are normally distributed e All populations have the same standard deviation o1=on = =O»=6 Firstly, we will ; The two group factors are compared, and it can be seen that there is no
relationship between these two factors because they are not affected by the other We can
conclude that eduspend in schooling levels and provinces are independent The total sample size is 150 observations, divided equally into 6 population samples, each one has 25 observations collected randomly from household heads from Hai Phong and Hai Duong in 3 levels of education Consequently, in this case, the samples are independent, simple random samples of size nj from each of 6 populations with 2 levels of factor province and 3 levels of factor edulevel (k = 2*3 = 6) including:
- HaiDuong - Nursery school - HaiDuong - Primary school - HaiDuong - Secondary school - HaiPhong - Nursery school - HaiPhong - Primary school - HaiPhong - Secondary school table(datal$province, datal$edulevel)
Trang 9=("brown"), labels=F) [1] 105 131
Q-Q Plot
Trang 10
differences, there are also some points which are far from the straight line, but they are trivial groups compared with overall distribution Other points lie approximately near the straight line As a result, our conclusion is that the populations are quite normally distributed, meeting the assumption to run the ANOVA test
Finally, we will check the last assumption to see if all populations have the same standard deviation o;,= 01 = = 0» = 0 Based on the result of the standard deviation of the groups, we will take the greatest standard deviation (HaiPhong, primary school) divided by the smallest
standard deviation (HaiDuong, secondary school), and the ratio is equal to 5.920459
: HaiPhong : Primary School [1] 6159.959
Trang 11Groups Hence, the standard deviations of samples are concluded to be not equal However, we will assume that all populations have the same standard deviation Ø¡¡ = Gia = = Øab= Ø fO continue running the ANOVA test
Question 4: Perform the inference technique you suggest in Question 1 Remember to provide all the necessary steps What are your interpretations and conclusions? Explain
- To test whether there are the differences in Education expenditure due to Province (1) Ho: The means of Province groups are equal
Ha: The means of Province groups are different - To test whether there are the differences in Education expenditure due to Schooling levels (2) Ho: The means of Schooling levels are equal
Ha: The means of Schooling levels are different - To test whether there is interaction between Province and Schooling levels (3) Ho: There is no significant interaction between Province and Schooling levels in Education expenditure
Trang 12Ha: There is a significant interaction between Province and Schooling levels in Education expenditure
Step 2: Checking assumption
e The sample are independent, simple random samples e All populations are as be normally distributed ® All populations are assumed to have the same variance’
The details of the above assumptions have been clearly stated in Question 3 The assumption is not satisfied but we still use two-way Anova (the reason for using two - way ANOVA will be
explained in the next section)
Step 3: Test statistic and p-value
We use R studio to calculate statistics and p-value Here is R output for two-way ANOVA test:
R code:
datal.result<-aov(eduspend~province* edulevel, data= datal)
summary(data 1.result)
> #Question 4 > datal result<-aov(eduspend~province*edulevel, data= datal) > summary(datal result)
Step 4: Level of significance: o= 0.05
Step 5: Decision rule
We decide to use the p-value approach to make decisions Hence, we reject Ho if the p-value < a
Based on the R output, we find out p-value = 0.0223
@ P-value = 0.0223 < a =0.05 e P-value = 5.85e-11 < a =0.05
Trang 13Question 5: Draw an interaction plot and interpret the plot Is the plot consistent with the conclusions made in Question 4?
R code:
> #Question 5 > interaction.plot(datal$province, datal$edulevel, datal$eduspend, type="b", col=c("orange","red","blue"), pch =c(16,18,15), main="Interaction between province and edulevel™)
>
Output:
Trang 14Significantly, the Schooling expenditure followed by provinces has a significant difference In general, the Schooling expenditures at three Schooling levels in Hai Phong is much higher than in Hai Duong The spending on Secondary schools in Hai Phong 1s the highest with more than 10,000,000 VND per child for the past 12 months, followed by expenditure on primary school of Hai Phong and that on secondary school in Hai Duong As we can see, these lines are nonparallel, which states that there is a significant relationship between Schooling levels and Provinces The spending of Hai Phong households has significant differences due to school levels while it has smaller differences in Hai Duong Furthermore, although the Hai Duong