Business Economics Statistic Case Study Report.pdf

In the survey, household heads were asked to specify their place of residence province, schooling level of their children edulevel and expenditure on education per child for the past 12

Trang 1

HANOI UNIVERSITY FACULTY OF MANAGEMENT AND TOURISM

BUSINESS ECONOMICS STATISTIC

CASE STUDY REPORT

Tutor: Lai Hoai Phuong Students:

Nguyen Cong Thuy Linh- 1904040065 Pham Diem Quynh- 1904040101 Pham Tuan Dat- 1904040026

Vu Thi Hau- 1807010107 Ta Thu Thuy- 1804040107

Nguyen Nhu Quynh- 1904050037

Hanoi, 19th October, 2021

Trang 2

Inference technique Descriptive statistics for the dataset

Checking the assumption Performing two-way ANOVA Interpreting the interaction plot

The credibility of the interpretations and conclusions PEER EVALUATION

Trang 3

A Scenario

The Vietnam Household Living Standards Survey (VHLSS) was conducted nationwide every two years to systematically monitor the living standards of Vietnam's societies In 2018, the survey was carried out with a sample size of 46,995 households in 3,133 communes/wards which were representative at national, regional, urban, rural and provincial levels The household

questionnaire contained many sections, each of which covered a separate aspect of household

activities, and education was one important indicator In the survey, household heads were asked

to specify their place of residence (province), schooling level of their children (edulevel) and expenditure on education per child for the past 12 months in thousands of VND (eduspend) The objective of our study is to test for any significant interaction between place of residence and schooling levels and to test for any significant differences in education expenditure due to these two variables Use 0.05 level of significance

B Questions Question 1 What inference technique should be considered for this study? Explain

The purpose of this research was to see if there was any significant interaction between place of residence and educational levels, as well as to see if there were any significant disparities in education spending as a result of these two variables

Given the objectives of the case study, the two-way factorial analysis of variance (two-way

Anova) is the most suitable inference technique The reason is that two-way ANOVA can examine the effect of provinces and schooling levels and the interaction between two variables on schooling expenditures at the same time

As we use two-way ANOVA, we consider place of residence (province) and schooling levels (edulevel) as two factors and the education expenditure (eduspend) as outcome variable The first factor - Province has two levels: Hai Duong and Hai Phong The second factor - Schooling level

has three levels: Nursery School, Primary School and Secondary school As a result of these

considerations, two-way ANOVA is the most suitable method for this study, assisting in the discovery of the association between province and education level as well as the testing of any

Trang 4

significant changes in eduspend related to these factors R-studio, a programming language for statistical analysis and graphics, is used in this study

Question 2 Produce descriptive statistics for the dataset You are expected to generate as many relevant descriptive statistics as possible using ALL the relevant tools introduced in the labs of this course Remember to provide appropriate interpretations for the descriptive statistics Try not to include unnecessary or irrelevant descriptive statistics

In this case, we will use numerical and graphical methods to look for patterns in the dataset and

summarize the information revealed in it With the data, we divide into six groups based on the

province and schooling levels: (Hai Duong, Nursery school), (Hai Phong, Nursery School), (Hai Duong, Primary school), (Hai Phong, Primary school), (Hai Duong Secondary School), (Hai Phong, Secondary school)

The numerical method will tell us the central tendency variation (by calculating the means,

standard deviations, and drawing boxplots of groups) With the graphical methods, we tabulate data or create a crosstabulation table

1 Import casel.csv data frame into R and assign it to datal Firstly, we set working directory and import the file ‘casel.csv’ into R by using the following

We use R code str() to see the internal structure of the data This code results informs us that there are 150 observations with four variables (including observations, pr ovince, edulevel, eduspend) Then the dataframe lists the first values of each variable in order

2 Cross Tabulation table between factors Secondly, we make a cross tabulation A cross tabulation table between province and edulevel variables would give people the sample size for each stratum From the tabulation below, we can see that the sample size of each group is equal with 25 observations

Trang 5

> table(datal$province, datal$edulevel)

Nursery School Primary School Secondary school

3 Means for groups

The code below also gives the result of the means of six groups > #3 Means Tor groups

: HaiDuong : Nursery School

[1] 1913.88

: HaiPhong : Nursery School

[1] 4056.44

: HaiDuong : Primary School

[1] 2839.2

: HaiPhong : Primary School

[1] 7478

: HaiDuong : Secondary school

[1] 4516.88

: HaiPhong : Secondary school

[1] 10866 84 It can be seen that the largest mean of education expenditure is $10866.84, belonging to

Secondary school in Hai Phong The Nursery school in Hai Duong has the smallest mean of

education expenditure with $1913.88 Furthermore, Hai Phong is the province that pays for education higher than Hai Duong from Nursery school to Secondary school

4, Standard deviation for groups

The standard deviation is used to determine how the data disperses from the mean A higher standard deviation value indicates the higher level of dispersion of the data We use following R- code to get the value of standard deviation for each group:

Trang 6

> by(datal$eduspend, list(datal$province,datal$edulevel), sd)

: HaiDuong : Nursery School

[1] 1070.98

: HaiPhong : Nursery School

[1] 2584.12 : HaiDuong

: Primary School

[1] 1180.223 : HaiPhong

: Primary School

[1] 6159.959 : HaiDuong

: Secondary school

[1] 1040.453

: HaiPhong : Secondary school

[1] 6143.572

From the table above, Hai Phong’s School Primary s has the largest standard deviation of

education expenditure 1s 6159.959 The smallest standard deviation is 1040.453 belonging to Hai Duong’s Secondary School

5 Box plot R code:

> boxplot(eduspend ~ province + edulevel + province*edulevel, xlab = "Groups", ylab="eduspend", data = datal, col = c("pink", "green", "skyblue","orange","red", "yellow"))

Output:

Trang 7

Groups Based on the graph, we can determine the range, shapes, central tendency, and variability of distributions of education expenditure per child of six groups: (Hai Duong, Nursery school), (Hai

Phong, Nursery School), (Hai Duong, Primary school), (Hai Phong, Primary school), (Hai

Duong Secondary School), (Hai Phong, Secondary school) Given the range of six groups, the yellow box plot has the widest range This means that data recorded of education expenditure at Secondary schools in Hai Phong have the highest level of

disparity and variability

The black line of the box plot represents the median of each group The central tendency depends on the position of the black line If the black line divides the box plot into two equal parts, the box plot is symmetric In contrast, the box plot can be skewed to the right or left Particularly, the red box plot is symmetric The green box plot is right skewed while the blue one is left skewed In addition, there are some outliers, which may be plotted as individual points If there are more outliers above, namely, the orange box plot on the graph with 5 outliers, the median will be affected Specifically, the range of the orange box plot is small but the median is quite high

Trang 8

Question 3: Check all the assumptions of the inference technique you suggest in Question 1

Are the assumptions satisfied? Explain

As mentioned above, the two-way ANOVA test is the most suitable method to solve this case

Before running the test, we will check the assumption with eduspend (expenditure on education per child) as the outcome variable, two independent factors which are edulevel (schooling level: Nursery, Primary, Secondary) and province (Hai Phong and Hai Duong) This method required 3 following assumptions:

e Samples are independent, simple random samples of size nj from each of k (= ab= 6) populations

e All populations are normally distributed e All populations have the same standard deviation o1=on = =O»=6 Firstly, we will ; The two group factors are compared, and it can be seen that there is no

relationship between these two factors because they are not affected by the other We can

conclude that eduspend in schooling levels and provinces are independent The total sample size is 150 observations, divided equally into 6 population samples, each one has 25 observations collected randomly from household heads from Hai Phong and Hai Duong in 3 levels of education Consequently, in this case, the samples are independent, simple random samples of size nj from each of 6 populations with 2 levels of factor province and 3 levels of factor edulevel (k = 2*3 = 6) including:

- HaiDuong - Nursery school - HaiDuong - Primary school - HaiDuong - Secondary school - HaiPhong - Nursery school - HaiPhong - Primary school - HaiPhong - Secondary school table(datal$province, datal$edulevel)

Trang 9

=("brown"), labels=F) [1] 105 131

Q-Q Plot

Trang 10

differences, there are also some points which are far from the straight line, but they are trivial groups compared with overall distribution Other points lie approximately near the straight line As a result, our conclusion is that the populations are quite normally distributed, meeting the assumption to run the ANOVA test

Finally, we will check the last assumption to see if all populations have the same standard deviation o;,= 01 = = 0» = 0 Based on the result of the standard deviation of the groups, we will take the greatest standard deviation (HaiPhong, primary school) divided by the smallest

standard deviation (HaiDuong, secondary school), and the ratio is equal to 5.920459

: HaiPhong : Primary School [1] 6159.959

Trang 11

Groups Hence, the standard deviations of samples are concluded to be not equal However, we will assume that all populations have the same standard deviation Ø¡¡ = Gia = = Øab= Ø fO continue running the ANOVA test

Question 4: Perform the inference technique you suggest in Question 1 Remember to provide all the necessary steps What are your interpretations and conclusions? Explain

- To test whether there are the differences in Education expenditure due to Province (1) Ho: The means of Province groups are equal

Ha: The means of Province groups are different - To test whether there are the differences in Education expenditure due to Schooling levels (2) Ho: The means of Schooling levels are equal

Ha: The means of Schooling levels are different - To test whether there is interaction between Province and Schooling levels (3) Ho: There is no significant interaction between Province and Schooling levels in Education expenditure

Trang 12

Ha: There is a significant interaction between Province and Schooling levels in Education expenditure

Step 2: Checking assumption

e The sample are independent, simple random samples e All populations are as be normally distributed ® All populations are assumed to have the same variance’

The details of the above assumptions have been clearly stated in Question 3 The assumption is not satisfied but we still use two-way Anova (the reason for using two - way ANOVA will be

explained in the next section)

Step 3: Test statistic and p-value

We use R studio to calculate statistics and p-value Here is R output for two-way ANOVA test:

R code:

datal.result<-aov(eduspend~province* edulevel, data= datal)

summary(data 1.result)

> #Question 4 > datal result<-aov(eduspend~province*edulevel, data= datal) > summary(datal result)

Step 4: Level of significance: o= 0.05

Step 5: Decision rule

We decide to use the p-value approach to make decisions Hence, we reject Ho if the p-value < a

Based on the R output, we find out p-value = 0.0223

@ P-value = 0.0223 < a =0.05 e P-value = 5.85e-11 < a =0.05

Trang 13

Question 5: Draw an interaction plot and interpret the plot Is the plot consistent with the conclusions made in Question 4?

R code:

> #Question 5 > interaction.plot(datal$province, datal$edulevel, datal$eduspend, type="b", col=c("orange","red","blue"), pch =c(16,18,15), main="Interaction between province and edulevel™)

>

Output:

Trang 14

Significantly, the Schooling expenditure followed by provinces has a significant difference In general, the Schooling expenditures at three Schooling levels in Hai Phong is much higher than in Hai Duong The spending on Secondary schools in Hai Phong 1s the highest with more than 10,000,000 VND per child for the past 12 months, followed by expenditure on primary school of Hai Phong and that on secondary school in Hai Duong As we can see, these lines are nonparallel, which states that there is a significant relationship between Schooling levels and Provinces The spending of Hai Phong households has significant differences due to school levels while it has smaller differences in Hai Duong Furthermore, although the Hai Duong

Tiêu đề	Case Study Report
Tác giả	Nguyen Cong Thuy Linh, Pham Diem Quynh, Pham Tuan Dat, Vu Thi Hau, Ta Thu Thuy, Nguyen Nhu Quynh
Người hướng dẫn	Lai Hoai Phuong
Trường học	Hanoi University
Chuyên ngành	Business Economics Statistic
Thể loại	Case Study Report
Năm xuất bản	2021
Thành phố	Hanoi

Định dạng
Số trang	17
Dung lượng	1,54 MB