HANOI UNIVERSITY FACULTY OF MANAGEMENT AND TOURISM STATISTICS FOR ECONOMICS FALL 2019 Case study analysis: Academic Performance of University Students by TWO-WAY ANOVA Test Instruc

Trang 1HANOI UNIVERSITY FACULTY OF MANAGEMENT AND TOURISM

STATISTICS FOR ECONOMICS

(FALL 2019)

Case study analysis: Academic Performance of

University Students

by TWO-WAY ANOVA Test

Instructor: Ms Lai Hoai Phuong Tutorial 5 - Group 6

Group members:

2.Nguyễn Tuan Phong 1704040093 3 Nguyén Gia Phuong Anh 1704040005 4 Lé Thi Bao Ngoc 1704040084 5 Nguyễn Việt Hoa 1704040043

Trang 2I Introduction IL Answering questions

© Question 1

TABLE OF CONTENTS

® Question 2 ® Question 3 ® Question 4 ® Question 5

Trang 3I INTRODUCTION Analysis of variance (ANOVA) is a statistical technique that assesses potential differences in a scale-level dependent variable by a nominal-level variable having two or more categories By using this method, the aggregate variability in a dataset is divided into two parts: random factors and systematic factors In fact, we often use two types of ANOVA methods to determine whether differences exist among population means, they are: one-way and two-way In particular, a one- way ANOVA has just one independent variable, which estimates the effect of a factor on a

response variable The other, a two-way ANOVA, refers to an ANOVA using two independent

variables In this case study: we study the relationship, if any, between classroom seating positions and academic performance (GPA) for both female and male students in a large university in the United States by the way of using two-way ANOVA method The aim of our project is to describe how the outstanding features of two-way ANOVA model applied into the real case study

Il ANSWERING THE QUESTIONS

1 What inference technique should be considered for this study? Explain The objective of the survey in this case is to test for any significant interaction between Classroom seating positions and Gender and to test for any significant difference in academic performance (GPA) due to seat preference and gender We can easily notice that the suitable inference technique should be used for this study is Two-way ANOVA model Two-way ANOVA compares the mean differences among groups that have been split into 2 independent factors, each with several levels In particular, it is clear that respondents were asked to specify one of three levels of seat preference: “front” , “middle” and “back” Therefore, seating positions become the first factor which including 3 levels The second factor is gender with 2 levels of male and female From utilizing two factors, two-way ANOVA will expose the interaction

Trang 4between these two factors Each combination of the factors is named a cell Therefore, total

combinations of seats and genders results in 6 cells

2 Produce descriptive statistics for the dataset You are expected to generate as many relevant descriptive statistics as possible using ALL the relevant tools introduced in the labs of this course Remember to provide appropriate interpretations for the descriptive statistics Try not to include unnecessary or irrelevant descriptive statistics

2.1 Sample size The sample of the conservations is normally distributed It is conducted by 300 respondents which are large enough and it is independent because the attendants are randomly selected There are three variables consisting of the GPA, the gender (male,female), and the Seat (front, middle and back )

2.2 Mean and Standard deviation

We can get the mean of the GPA and find the standard deviation of two other variables but we have to convert variable Gender and Seat into factors Using “Factor” function, then use “By” function to get the mean for two groups at the same time

“ Convert variable Gender and Seat into factors and Crosstabulation table between Gender and Seat variables:

“ StudentSurvey$Gender <- factor(StudentSurvey$Gender, levels=c("Male","Female")) “ StudentSurvey$Seat <- factor(StudentSurvey$Seat, levels=c("Back","Front","Middle"))

Trang 5> StudentSurvey$Gender <- factor (StudentSurvey$Gender, levels=c("Male","Female")) > StudentSurvey$Seat <- factor (StudentSurvey$Seat, levels=c("Back","Front","Middle")

: Front : Male [1] 3.1028

: Front

: Female [1] 3.3356

: Middle

: Female [1] 3.1042

>

Trang 6“ Standard deviation of GPA for each combination of Seat and Gender:

> by(StudentSurvey$GPA, list (StudentSurvey$Seat , StudentSurvey$Gender ) ,sd)

: Back : Male [1] 0.4958685

: Front

: Male [1] 0.4919393 : Middle : Male

[1] 0.4132551

[1] 0.4180591 : Front : Female [1] 0.3795011

: Middle : Female (1] 0.452702

From this output, it is clearly seen that the highest standard deviation is the combination of back seat and male gender at 0.4958685 and the lowest one is 0.379501] examined from the group of front seat and female gender

2 3 Boxplot and mean plot “ Graphical description boxplot(GPA ~ interaction(Seat,Gender), data — StudentSurvey, xlab — "Seat and Gender", ylab — "expected GPA", col — c("red", "blue", "yellow", "grey", "pink", "green"))

Trang 7Seat and Gender

Judging the above boxplots, we can see that students who are female often have a stable mean than male In the male gender, the lowest GPA appears in the student groups who prefered the back, while it is middle in the female gender The black line which represent the median of the group reach the highest in group “Front.Female” and lowest in “Back.Male’ Furthermore, there are total seven outliers in the boxplots

> install.packages("gplots") > library(gplots) > plotmeans(GPA ~ interaction(Seat,Gender), data = StudentSurvey, xlab = "Seat and

Gender", ylab = “expected GPA", main="Mean Plot with 95% CI")

Mean Plot with 95% Cl

Seat and Gender

Trang 8Mean plot provides the difference between mean GPA of each combination and standard deviation of them Plot in front seat combined with female gender stands at the highest GPA with more than 3.3 , followed by “Back.Female” at nearly 3.2, and the lowest one is the ““Back.Male” with only 3.0

3 Check all the assumptions of the inference technique you suggest in Question 1 Are the assumptions satisfied? Explain

There are 3 assumptions required to use two — way ANOVA: ¢ Samples are independent, simple random samples ¢ All populations are normal distributions ¢ All populations have the same standard deviation: : 7i= 72= .=

3.1 Samples are independent, simple random samples Looking up for the definition of an independent sample, it is a sample which does not have any connection to another sample when they happen The samples are independent, the occurrence of this sample does not influence the probability of another sample

Front-Male, Front-Female, Middle-Male, Middle-Female Since there is not any information on

how respondents are selected, the group thinks that they are chosen randomly Each response came from a different person, and his/her answer is not affected by another Therefore, the samples are independent, and are randomly selected

3.2 All populations have the same standard deviation

Trang 9To check whether all populations have the same standard deviation or not, we look for the ratio of the largest standard deviation divided by the smallest one If this ratio is smaller than 2, we can conclude that the populations are equal

From the by() function shown in question 2 to get the standard deviation, it can be seen that the largest SD is 0.4958685, while the smallest SD is 0.3795011 The ratio of these two components is 1.3, which is smaller than 2 Therefore, we can conclude that all populations have the same standard deviation

Another technique can be used to check this assumption is to conduct the Levene test This test is to check the homogeneity of the variance, so the null hypothesis is all the variances which are equal We compare the P-value of the Levene test and our significant level (a = 0.05) The rejection rule is to reject Ho if P-value is smaller than a

The Levene test is in the “car” package, so it is necessary to install “car” package R code:

-> install _packages("car") -> library(car) leveneTest(StudentSurvey$GPA interaction(StudentSurvey$Seat,StudentSurvey$Gender),center =mean)

The outcome Levene's Test for Homogeneity of Variance (center = mean)

DfF value Pr(>F) group 5 1.1739 0.322

294 The P-value of the test is 0.322 while our a is 0.05, therefore we do not reject the hypothesis, as well as cannot conclude that the standard deviations are different

However, since the ratio is smaller than 2, conducting the Levene test is not truly necessary in

this case If the ratio of this case is larger than 3, we should choose other tests instead of the

Trang 10ne] C ao

_ 5

oD 2 9? ⁄ x 0101105

Trang 11We used Rstudio to calculate and had the output as following: > StudentSurvey.result<-aov(GPA ~ Gender* Seat, data = StudentSurvey) > summary(StudentSurvey.result)

Df Sum Sq Mean Sq F value Pr(>F) Gender 1 1.40 1.4008 7.108 0.0081 **

Seat 2 0.93 0.4673 2.371 0.0951 Gender:Seat 2 1.35 0.6745 3.423 0.0339 * Residuals 294 57.94 0.1971

Signif codes: 0 ‘***° 0.001 ‘**° 0.01 **° 0.05 °° 0.1571 Step 3: Level of significance: a—0.05

Step 4: Decision rule: Reject Ho if p-value < « From R output, we can see that the interaction between seat preference and gender has P-value: 0.0339<«

> Reject Ho The effect of interaction between seat preference and gender is significant Step 5: Conclusion:

We have enough statistical evidence to conclude that there are significant differences in GPA due to seat preference and gender

Question 5: Draw an interaction plot and interpret the plot As you can see that there is a significant interaction in GPA due to genders and seat is the interaction plot here with:

Reode: > interaction plot(StudentSurvey$Gender,StudentSurvey$Seat,

StudentSurvey$GPA,type=“b” ,col=c(‘“‘red”, “blue”),pch=c(16,18),main=“Interaction between

Gender and Seat’)

Trang 12Figure 7: Interaction Plot between Gender and Seat

As we can see from the interaction plot, the male and female student groups record a significant

difference among the ones who sit in the front, middle and back Looking at the details, the

female group who sit in the front scores the highest GPA with over 3.3 while the male group who also sit at the same spot has 3.1 The female sitting in the middle has approximately 3.1 and the male group has a bit higher GPA The female group who sits in the back shows a similarity with the ones who sit in the middle but the male has the lowest GPA (less than 3.0) From this interaction, we can conclude that the ones who sit from the middle to the front has the tendency of having higher GPA Yet, the female group who sits in the back also has remarkable result An intersection among seat lines can be observed in the above interaction plot This indicates that there is a connection between genders and the seat position The female students sitting in the front and the back of the class have better performance than the male students and the contrary can be seen in the middle seat group

6 Discuss the credibility of the interpretations and conclusions of question 4 Is there anything we should be concerned about? Explain

a Credibility of the interpretations

Trang 13With the purpose of comparing population means when population is categorized by two categorical factors, an appropriate and useful tool is used in this case study — two-way ANOVA test Secondly, a significant level of 0.05 is utilized, which guarantees the accuracy of the test At the same time, the result of p-value is quite small meaning that there is a higher chance to reject the null hypothesis Besides, all the assumptions for the test are satisfied with clear evidences as well as explanation for each proof in the third part of the report The thing should be highlighted is that although we use “by” function to test equal variances and receive the result: Largest standard deviation/Smallest standard deviation equal 1.3 (< 2), we still apply LeveneTest to ensure the result of this assumption checking Eventually, the plot and interpretation of interaction between two factors is considered as an important part of the case study

b Limitations of the case First of all, one of the assumptions is that the sample of the case has to be a Simple Random Sample However, there is nothing here to ensure that the sample is chosen randomly from its population Moreover, ANOVA test assumes that the data are normally distributed and the violation of this assumption affects greatly on the results Since the violation in this case is moderate, therefore if there are some outliers in the QQ-plot, this assumption still can be satisfied

Another limitation is the condition of equal variances because the greater the difference in variances between groups, the greater chance that the conclusion of the test is inaccurate Eventually, when running ANOVA to test the difference of GPA due to Gender and Seat position, the result only tells whether there is a difference or not but it does not indicate how the difference

18

TIT Conclusion Two-way ANOVA which is used to address this case is satisfied It brings us to the conclusion that it is significant about the change in academic performance due to the relationship classroom seating positions and academic performance (GPA) for both female and male students