The purpose of this study is to examine for any substantial interaction between major and gender and to check for any significant differences in GPA due to these two variables with 0.1 l
Trang 1HANOI UNIVERSITY FACULTY OF MANAGEMENT AND TOURISM
Business and Economics Statistics
Academic Performance of University Students
1
Trang 2Table of Contents
I Scenario
II Questions
Question 1: What inference technique should be considered for this study? Explain 3
Question 2: Produce descriptive statistics for the dataset 4
Question 3: Check all assumptions of the inference technique you suggest in Question 1 Are the assumptions satisfied? Explain
Question 4: Perform the inference technique you suggest in Question 1 Remember to provide all the necessary steps What are your interpretations and conclusions? Explain What are your interpretations and conclusions if we use 0.05 level of significance 12
Question 6: Discuss the credibility of the interpretations and conclusions of Question 4 Is there anything we should be concerned about? Explain 14
III Conclusion
2
Trang 3Table of Figures
Figure 1: Structure of data 3
Figure 2: Some first rows of the data 4
Figure 3: Structure of the data when factors have not been converted yet 4
Figure 4: Structure of the data when factors have been converted 5
Figure 5: Frequency table (sample sizes) 5
Figure 6: Mean of GPA according to Gender and Major 5
Figure 7: Median of GPA according to Gender and Major 6
Figure 8: Standard deviation of GPA according to Gender and Major 6
Figure 9: Summary of GPA according to Gender and Major 7
Figure 10: Boxplot 7
Figure 11: Mean plot with 90% CI 9
Figure 12: Q-Q Plot 11
Figure 13: Two-way ANOVA ouput 12
Figure 14: Interaction Plot between Gender and Major 14
3
Trang 4I Scenario.
A survey was conducted by a large university in the United States to find the relationship between majors and academic performance (GPA) for both its female and male students The answers required the interviewees to indicate their majors and GPA scores were based on a 0
to 4.0 scale The purpose of this study is to examine for any substantial interaction between major and gender and to check for any significant differences in GPA due to these two variables with 0.1 level of significance
Figure 1: Structure of data
Question 1: What inference technique should be considered for this study? Explain
In this case study, two-way ANOVA should be seen as an inference method for the two reasons In general, this test evaluates the mean differences of each factor Moreover, its purpose is to test for some connections between majors and genders with differences in GPA due to these two variables Therefore, our team decided to use two ways ANOVA for the fact that it compares the difference between groups that have split into two independent variables (major and gender) and dependent variable (GPA) as well as it indicates the interaction between them
4
Trang 5Question 2: Produce descriptive statistics for the dataset.
We use Rstudio to describe statistics for this question To start with, we import the Excel file “StudentSurvey 2.csv” into R for further calculation:
⮚ studentsurvey 2<-read.table("StudentSurvey 2.csv",header = TRUE,sep =",", quote
="/",stringsAsFactors = FALSE )
In addition, there are 234 observations in this case study; therefore, we should see some first observations to have better knowledge related to this data using head () function in R:
⮚ head(student survey 2)
Figure 2: Some first rows of the data The internal structure of the data can be obtained by:
⮚ str(student survey 2)
Figure 3: Structure of the data when factors have not been converted yet
From the above output, it is clear that there are 234 observations with 4 variables: observation, gender, major and GPA Since Gender and Major are characters, we will convert them into factors by using the following R codes:
⮚ student survey 2$gender<-factor(studentsurvey 2$gender,
levels=c("1","2"),labels=c("Female","Male"))
⮚ studentsurvey 2$major<-factor(student survey 2$major,levels=c("1","2","3"), labels=c("Administration", "Accounting", "Finance"))
Then we use the R code str (Student Survey) to get the new structure of the data file with
“Gender” and “Major” converted into factors:
5
Trang 6Figure 4: Structure of the data when factors have been converted
A frequency table can be created to see the sample size of each treatment group with the following R code:
⮚ table(student survey$gender,student survey 2$major)
Figure 5: Frequency table (sample sizes)
It can be seen that all 6 treatment groups have the same sample size of 39 This selection is our best choice to use a two-way ANOVA test
Next, we use by () function in R to find several descriptive statistics such as mean, median, standard deviation, summary, … for each treatment group listed by the factors and their output respectively:
⮚ by(studentsurvey2$gpa,list(studentsurvey2$gender,studentsurvey2$major), mean)
Figure 6: Mean of GPA according to Gender and Major
by(studentsurvey2$gpa,list(studentsurvey2$gender,studentsurvey2$major), median)
6
Trang 7ure 7: Median of GPA according to Gender and Major
by(studentsurvey2$gpa,list(studentsurvey2$gender,studentsurvey2$major), sd)
Fig ure 8: Standard deviation of GPA according to Gender and Major
⮚ by(studentsurvey2$gpa,list(studentsurvey2$gender,studentsurvey2$major), summary)
7
Trang 8ure 9: Summary of GPA according to Gender and Major
Each code gives the specific descriptive statistics of the outcome variable (GPA) for each treatment group with the listed Gender first then the Major The final code Summary helps
to find 5 basic statistics along with the GPA: Minimum value, the first quantile, mean, median, the third quartile and maximum value
To get further information, we conduct the boxplot and the mean plot
⮚ boxplot(gpa~ interaction(gender,major), data = studentsurvey2, xlab = "Gender and Major", ylab = "GPA", col = c("red", "blue", "yellow","grey","brown","pink"))
Figure 10: Boxplot 8
Trang 9Initially, the box plot shows clearly several descriptive statistics: medians, quartiles, maximum and minimum data among different Each cell has different characteristics for all Taken into account the most special cell, the Male - Accounting group seems to have the highest median value, the stable and uniform GPA values when the variance within the group is smallest because of the smallest interquartile range and marginal value range between highest and lowest value The Female - Finance group has the lowest at almost every value: median, minimum value and maximum value when others have the highest GPA above 3.5, first and third quartiles with the average interquartile range and large variance In contrast, the highest GPA, interquartile and variance belong to the Male -Finance group
The skewness of each group is obvious through boxplot The data of each group can be distributed asymmetrically, positive-skewed or negative-skewed based on the distance from median to two endpoints Taking three groups of male into consideration, Male -Administration distribution is left-skewed when the number of GPA values larger than median value is less than the number of those which is smaller than median In the same analysis, it can be seen that Male - Accounting is the example of right-skewed distribution and asymmetric distribution is discovered at Male - Finance group Also, there are 3 outliers when appearing three white dots in Male – Accounting, Female – Finance and Male – Finance respectively but 3 out of 234 will not affect our test result
We still use meanplot to identify mean value of each group and compare means between groups with the following codes and their outcome:
⮚ install.packages("gplots")
⮚ library(gplots)
⮚ plotmeans(gpa~ interaction(gender,major), data = studentsurvey2, xlab = "Gender and Seat", ylab = "GPA", main="Mean Plot + with 90% CI")
9
Trang 10Figure 11: Mean plot with 90% CI
It can be seen from the mean plot, there are six groups which are presented in the mean plot with 90% confidence interval The result of the mean plot for mean values is the same as By () function when we run it for means The Female – Accounting group has the highest mean and the lowest one is Female – Finance group Besides, means of six groups are different which are satisfied for assumption of two-way ANOVA
Question 3: Check all assumptions of the inference technique you suggest in Question 1 Are the assumptions satisfied? Explain
As you know from question 1, two-way factorial analysis of variance is always the best inference method to cope with this case.However, it is necessary to check all the assumption
of this inference system before showing our two-way ANOVA with the aim of ensuring that our results are valid.There are three assumptions which we need to check for two-way ANOVA
● Samples are independent, simple random samples of size n from each of k (=a*b)ij
populations
● All populations are normally distributed
● All populations have the same standard deviation: = = …=
To use these general conditions to check whether the study satisfies three assumptions for two-way ANOVA or not, some subjects should be denoted in detail:
10
Trang 11● nij: Cell (combination of the factors)
● i (Factor A): Gender
● j (Factor B): Major
Firstly, we check assumption 1 Term and notation for two-way ANOVA are shown in the following table:
FACTOR B
As we run the code table in question 2, we have already had the cross tabulation table between Gender and Seat preference variables that would give you the sample size for each cell Applying the case in these terms and notations and R output above, we have the corresponding table to check for the assumption:
MAJOR
Firstly, to check Assumption 1, we choose individuals from sample “Major” which does not affect that of sample “Gender” As a result, the study has independent samples Moreover, from each of k = a*b=2*3=6 populations each individual in two samples Gender and Major has the same probability to be chosen randomly to be one of the 234 observations This is the reason why the study contains independent simple random samples
Assumption 2:All populations have the same standard deviation
11
Trang 12Secondly, we are going to check the assumption 2 of equal standard deviations Looking at the output of the “By” function in R for both male and female gender which is done in question 2, we can see that the ratio between the largest sample standard deviation over the smallest sample standard deviation (= 0.765563/ 0.5129712) is around 1.49240932, which is less than 2 Therefore, we infer that all populations have the same standard deviations Assumption 3:All populations are normally distributed
In order to check all populations are normally distributed or not, we can use Q-Q plot with
R command:
install.packages("car")
library(car)
qqPlot(lm(gpa ~ gender + major + gender*major, data = studentSurvey), simulate = T,main="Q-Q Plot", labels=F)
Figure 12: Q-Q Plot
We usually use a normal Q-Q plot to see the normality of residuals The scatter measures up the data to a perfect normal distribution It can be seen from the plot that the scatter line closes
to the line without outliers Therefore, it is possible for Q-Q plot to meet two requirements, as
a result, the population is normally distributed
12
Trang 13Question 4: Perform the inference technique you suggest in Question 1 Remember to provide all the necessary steps What are your interpretations and conclusions? Explain What are your interpretations and conclusions if we use 0.05 level of significance
ANOVA test 2-way factors:
- Step 1: Identify null and alternative hypothesis:
Ho: There is not a significant interaction between major and gender in GPA
Ha: There is significant interaction between major and gender in GPA
- Step 2: Test statistic and p-value
❖ Check assumptions: We use Two-way ANOVA to test the hypothesis
● All populations are normally distributed
● Samples are independent, simple random samples of 39 from each of 6 populations
● All populations have the same standard deviation
❖ Test statistic and p-value:
We used Rstudio to calculate and had the output as following:
StudentSurvey2.result<-aov(GPA ~ Gender*Major, data = StudentSurvey2)
summary(StudentSurvey2.result)
Figure 13: Two-way ANOVA ouput
- Step 3 : Level of significance
The level of significance: α=0.1
- Step 4: Decision rule and conclusion
Reject Ho if p-value < ∝
As we mentioned in question 1, the primary purpose of a two-way ANOVA is to examine the influence of two different categorical independent variables on one continuous dependent variable, therefore, we now consider the interaction between major preference and gender as priority
● If α = 0.1
As can be obtained from the chart using R, P-value <α (0.06504< 0.1) Therefore, following the decision rule, we reject Ho
13
Trang 14Conclusion: We have enough evidence to conclude that there is significant interaction in GPA between major and gender with 90% confidence
● If α = 0.05
With 90% confidence, we can reject Ho but the scenario will change if we choose 0.05 level of significant
As can be obtained from the chart using R, P-value > α (0.06504 > 0.05) Therefore, follow the decision rule, we do not reject Ho
Conclusion: We do not have enough evidence to conclude that there is significant interaction
in GPA between major and gender with 95% confidence Subsequently, we focus on tests of each factor Gender and Major
- With the test for Gender: P-value < α (0.00737 < 0.05)
Inferring from the result, we have enough evidence to conclude that mean in GPA of factor gender are different
- With the test for Major: P-value < α (8.89e-13 < 0.05)
Inferring from the result, we have enough evidence to conclude that the mean in GPA of at least one factor major are different
Question 5: Draw an interaction plot and interpret the plot Is the plot consistent with the conclusions made in Question 4?
Another way to see that there is a significant interaction in GPA due to Major and Gender is the interaction plot here with Rcode:
interaction.plot(studentsurvey2$gender,studentsurvey2$major,studentsurvey2$gpa, type="b", col=c("red","blue"), pch=c(16, 18),main="Interaction between Gender and Major")
14
Trang 15Figure 14: Interaction Plot between Gender and Major
Theoretically, the more nonparallel the lines are, the greater the strength of the interaction From this interaction plot, it can be seen that there is an interaction between gender and major The Accounting major and the Administration major are two examples of the strong interaction while Administration and Finance show a moderate one of gender and major Overall, male students have a higher GPA than female students Female students studying accounting show better performances than other female students in other majors Their GPA is slightly higher than male student’s when the blue line presents a negative relationship The line of Administration shows a positive relationship when the male performs better GPA than female, approximately equal to Accounting male’s GPA This is a proof for a strong interaction between gender and major Since the GPA of both male and female in Finance are much lower than that of the other two majors, the interaction here is pretty weak However, as shown in the plot, the three lines are non-parallel, so it can be assumed that the interaction is moderate This result is consistent with the conclusions made in question 4 when we can follow the alternative hypothesis with 0.1 level of significant, not up to 0.05 level of significant
Question 6: Discuss the credibility of the interpretations and conclusions of Question 4 Is there anything we should be concerned about? Explain
a The credibility of the interpretations and conclusion
In terms of interpretation, it is noticeable that the assumptions are accurate and all assumptions have been apparently confirmed and convinced without any bewilderment In the area of α=0.1 (α: level of significance), we could conclude that there exists interaction in
15