business and economics statistics academic performance of university students

Scenario We are FMT’s students of Hanoi University, our team is conducting research on the relationship between school level Secondary school and University and expenditure on education

Trang 1

HANOI UNIVERSITY FACULTY OF MANAGEMENT AND TOURISM

Business and Economics Statistics

Academic Performance of

University Students

Trang 2

Table of Contents

Trang 3

I Scenario We are FMT’s students of Hanoi University, our team is conducting research on the relationship between school level (Secondary school and University) and expenditure on education for

students in three main provinces, Ha Noi, Nam Dinh, and Hai Duong Education expenditure is

measured in Vietnamese dong The objective of our study is to find out if there is any remarkable interaction between Province and Level of Education and to examine if there were any significant differences in education expenditure for these two variables To learn about this sample, we use 0.05 level of significance

The sample information is shown in the table from case3.csv The dataset consisting of 120

observations, all provided in the case3.csv

Il Answering questions Question 1: What inference technique should be considered for this study? Explain

There are 4 variables in the given dataset Besides, from this frame, we can see that province and

eduspend are two independent variables, in which the dependent variable is eduspend

“Province” includes 3 places, and 2 levels of “edulevel” also has been indicated Therefore,

these two are categorical variables or factors In contrast, eduspend represents the numerical outcome variables From that, two- way ANOVA test is the most effective way to examine the relationship and main effect of two independent variables on the dependent one So it should be applied to this study in terms of finding out the difference between them

Question 2: Produce descriptive statistics for the dataset You are expected to generate as many relevant descriptive statistics as possible using ALL the relevant tools introduced in the labs of this course Remember to provide appropriate interpretations for the descriptive statistics Try not to include unnecessary or irrelevant descriptive statistics Descriptive statistics are used to describe the basic data of a study This data source provides the most comprehensive overview and measurement of the research sample space Besides, this also includes graphical analysis which is the formation of every quantitative data

1, Import Case 3.csv data frame into R and assign it to VHLSS After setting up the working directory, we import it into R R code:

VHLSS <- read.table("Case 3.csv" ,header=TRUE, sep=",",quote="\""

stringsAsFactors=FALSE) 2 Cross Tabulation table between factors There are 120 observations in this case, so it is better when we use the head() function to know

about the data

Trang 4

head(VHLSS) Output:

obs province edulevel eduspend 1 1 HaiDuong University 16490 2 2 HaiDuong University 25800 3 3 HaiDuong University 29570 4 4 HaiDuong University 35000 5 5 HaiDuong University 16970 6 6 HaiDuong University 11600 We can also access the data structure by the following R code:

R code: > VHLSS$province <- factor(VHLSS$province, levels=c("HaiDuong","Hanoi","NamDinh")) > VHLSS$edulevel <- factor( VHLSS$edulevel,levels = c("Secondary school","University"))

> table( VHLSS$province, VHLSS$edulevel)

Output: It provides us the sample size (n=20) for each stratum Secondary school University

R code: by(VHLSS$eduspend,list( VHLSS$province, VHLSS$edulevel),mean) Output:

: HaiDuong : Secondary school

Trang 5

[1] 4334.6

: Hanoi : Secondary school [1] 15234.6

: NamDinh : Secondary school [1] 5492.45

: HaiDuong : University

by(VHLSS$eduspend, list VHLSS $province, VHLSS$edulevel),sd) Output:

: HaiDuong : Secondary school [1] 1012.826

: Hanoi : Secondary school [1] 8817.526

: NamDinh

Trang 6

: Secondary school [1] 2161.205

[1] 9902.959

: Hanoi

: University [1] 9847.818

: NamDinh : University [1] 7899.908 5 Median for groups R code:

by(VHLSS$eduspend,list( VHLSS$province, VHLSS$edulevel),median) Output:

: HaiDuong : Secondary school [1] 4268

: Hanoi : Secondary school [1] 15095

: NamDinh : Secondary school [1] 4605

: HaiDuong : University [1] 16950

Trang 7

Output: : HaiDuong : Secondary school

Min Ist Qu Median Mean 3rd Qu Max 2230 3656 4268 4335 4886 6135

: Hanoi : Secondary school

: NamDinh : Secondary school

: Hanoi

: University Min Ist Qu Median Mean 3rd Qu Max

Trang 8

5000 12700 17075 19157 28285 39600

: NamDinh : University

Min Ist Qu Median Mean 3rd Qu Max 10050 17498 21155 22978 27280 38275 7 Graphical description

In this situation, when the sample size of each group is about 20 observations, so we find it more suitable to use the boxplot to compare the six group Here is the code and output for the boxplot R code:

boxplot(eduspend~ interaction(province,edulevel), data = VHLSS, xlab = "Province & Schooling level", ylab = "Expenditure on education (Thousands of VND)", col = c("green",

it ft, "red", "purple","yellow","brown","pink"))

Province & Schooling level The black lines that appear in the middle of each box represent for the median of each group Moreover, there are also two white dots, which represent for the outliers (higher expenditure) However we can still apply mean-plot to compare different between groups

R code: install _packages("gplots") library(gplots)

Trang 9

plotmeans(eduspend~ interaction(province,edulevel), data = VHLSS, xlab = "Province & Schooling level", ylab = "Expenditure on education (Thousands of VND)", main="Mean Plot with 95% CI")

Province & Schooling level There are total six groups which are shown in Mean Plot with 95% confidence interval NamDinh university group has the largest mean In addition, means of six groups are vary, which are satisfied for assumption of the two-way ANOVA test

Question 3: Check all the assumptions of the inference technique you suggest in Question 1 are the assumptions satisfied? Explain

In Question | we have suggested that the two-way ANOVA test is to be used as the most suitable method to tackle this case study Before conducting the test, we have to check all the

assumptions of the inference technique with the expenditure on education as outcome variable, two independent factors are Place of Resident (Hai Duong, Hanoi and Nam Dinh) and schooling level (Secondary School and University) According to this method, there are three assumptions we have to consider which are:

¢ The samples are independent and simple random samples of size nij from each k (=a*b) population

¢ All populations are normally distributed ¢ All populations have same variance Firstly, we check the assumption | to see if its true or not When comparing 2 factors: Province

and Level of Education, we find that there is no relation between a student’s Province and their

Level of Education because answers from students are different and they are not influenced by

Trang 10

other elements Therefore, it can be concluded that the Level of Spending for students from Hai Doung, Hanoi and Nam Dinh and their Level of Education are independent Moreover, we found out that data was collected from 120 observations Choosing individuals from sample “Province” does not affect who from sample “edulevel” and vice versa As a result, those samples are simple random samples of size nij from each k (=a*b) populations Using R output of cross tabulation table between Province and schooling level variables would give the sample size:

R code:

table(VHLSS$province, VHLSS$edulevel)

Output: Secondary school University

Secondly, we check the assumptions whether they are normally distributed or not We use Q-Q plot:

R code: qqgPlot(Im(eduspend ~ province +edulevel + province*edulevel, data=VHLSS), simulate=T, main="Q-Q Plot", labels=F)

We then have the result of a Q-Q plot:

Trang 11

Lastly, we check the remaining assumption whether if all the standard deviations of six populations are equal or not Based on the output “4 Standard deviation for groups” in Question 2, we can calculate the ratio of the largest standard deviation over the smallest one

(=9902.959/1012.826) is approximately 9.7776 Making it larger than 2 Therefore the assumption is not satisfied because not all population have the same standard deviation Question 4 Perform the inference technique you suggest in Question 1 Remember to provide all the necessary steps What are your interpretations and conclusions? Explain Ho: There is not a significant interaction in eduspend between province and edulevel Ha: There is a significant difference in eduspend due to province and edulevel Step 2: Check assumptions (Question 3

All populations are normally distributed Samples are independent, simple random samples Not all populations have the same standard deviation Step 3: Test statistic:

We used Rstudio to calculate statistic and p-value thus obtain the following outputs:

R code:

10

Trang 12

> VHLSS.result<-aov(eduspend ~ province*edulevel, data = VHLSS) > summary(VHLSS result)

Output:

Df Sum Sq Mean SqF value Pr(>F) province 2 5.521e+08 2.760e+08 4.858 0.009443 ** edulevel 1 4.478e+09 4.478e+09 78.812 1.13e-14 *** province:edulevel 2 1.057e+09 5.286e+08 9.303 0.000181 ***

Signif codes: 0 ‘***’ 0.001 ‘**° 0.01 °*° 0.05 £0.16 ° 1 Step 4: Level of signifi

The level of significance is a= 0.05 Step 5: Decision rule

equals 0.000181 <a = 0.05 Therefore, following the decision rule, we reject Ho Step 6: Conclusion

At the 5% significance level, there is sufficient evidence to conclude that the interaction between

province and edulevel is significant; there are differences in eduspend due to province and cdulevel

Question 5: Draw an interaction plot and interpret the plot Is the plot consistent with the

conclusions made in Question 4?

This command is used to draw the interaction plot: interaction plot(VHLSS$edulevel, VHLSS$province, VHLSS $eduspend, type="b",

col=c("red","blue","black"), pch=c(16, 18),main="Interaction between Province and Schooling

level")

II

Trang 13

shown in the plot, the three lines are non-parallel, so it can be assumed that the interaction is considerable In other words, we can conclude that there is significant interaction between

province and edulevel in effect on expenditure on education Question 6: Discuss the credibility of interpretations and conclusions of question 4 Is there anything we should be concerned about? Explain

a The credibility of the interpretations and conclusion For this report, we use two way ANOVA as an inference technique in order to test the interaction between place of residence and schooling level and to test differences in education expenditure due to these two variables It can be seen that the assumptions are unsatisfactory After conducting all the steps in the test, we can conclude that the interaction between province and edulevel is significant, there are differences in eduspend due to province and edulevel because the p-value that equals 0.000181 is smaller than level of significance (a=0.05) It means that

12

Trang 14

there is a 5% chance of getting a Type I error which rejects a true null hypothesis It is 95% credible that the null hypothesis is that no interaction between province and edulevel will not be rejected if it is true Finally, we conclude that conclusions in question 4 are reliable and 3 factors (province, edulevel, eduspend) are significantly different

b Limitations of two way ANOVA Besides the two way ANOVA solving the case, there is also a limitation that we realized after implementing that is the unsatisfactory assumption Therefore, despite achieving 95% satisfaction in the analysis, we still recognize the shortcoming of the assumption that makes the

results inaccurate

Additionally, in the given case study we only take into account two factors: Schooling level and Location of residence in effect to Level of spending even though there might be more factors that can influence Level of spending

13

Trang 15

BES 2021 - PEER EVALUATION FORM

Phạm Thị Phương Thảo 1904040113 100% Phạm Thị Phương Thảo

Nguyễn Hải Vũ 1904050056 100% Nguyễn Hải Vũ

Tiêu đề	Academic Performance of University Students
Tác giả	Cao Bạch Dương, Bùi Việt Hà, Đỗ Thị Hằng, Nguyễn Mạnh Hùng, Nguyễn Thị Minh Phương, Phạm Thị Phương Thảo, Nguyễn Hải Vũ
Người hướng dẫn	Mr. Nguyén Hoang Viét
Trường học	Hanoi University
Chuyên ngành	Business and Economics Statistics
Thể loại	Academic Paper
Năm xuất bản	2021
Thành phố	Hanoi

Định dạng
Số trang	15
Dung lượng	545,8 KB