HCM : One-owner [1] 47527.46 Figure 5: Standard Deviation of Total assets according to Province and Ownership Each code gives the specific descriptive statistics of the outcome vari
Trang 1HANOI UNIVERSITY
FACULTY OF MANAGEMENT AND TOURISM
BUSINESS AND ECONOMICS STATISTIC
CASE STUDY
TUTORIAL5 — CASE 4
TUTOR: Tran Thi Thu Hién
2104040026 Đỗ Thùy Dương
2104010055 Hứa Nguyễn Thanh Loan
Trang 2Question 1: Produce descriptive statistics to summarize the data You are expected to generate as many relevant descriptive statistics as possible using ALL the relevant tools introduced in the labs of this course Remember to provide appropriate interpretations for the descriptive statistics Try not to include unnecessary or irrelevant descriptive statistics
We use Rstudio to describe statistics for this question Firstly, we must import the csv file
“dataset23.csv” into R for further calculation:
—VN<-
read table("dataset23.csv", header=TRUE, sep=",",quote="\"""stringsAsFactors=FALSE) There are 300 observations in this study; therefore, we should see some first observations to have better knowledge related to this data using the head () function in R:
—> head(VN)
Figure 1: Some first rows of data
A frequency table can be created to see the sample size of each treatment group by using the following format of the table() function: tableName <- table(row variable, column variable)
— table(VN8X province, VNSown)
Multi-owner One-owner Haiphong 50 50
TP HCM 50 50
Figure 2: Frequency table (sample size)
Trang 3It can be seen that all 6 treatment groups have the same sample size of 50 This selection is our best choice to use one - way ANOVA test
The internal structure of the data can be obtained by:
— str(VN)
"data frame’: 300 obs of 5 variables:
$ X.province : chr "Hanoi" "Hanoi" "Hanoi" "Hanoi"
$ own : chr "One-owner" "One-owner" "One-owner" “One-owner"
Figure 3: Structure of the data From the above output, it is clear that there are 300 observations with 5 variables: X.province, own, X.quantityproduct, X.quatitysold, and totalass
Next, we use by Q function in R to find several descriptive statistics such as mean, standard deviation, minimum and maximum value for each treatment group listed by the factors and their output respectively In this part, we only focus on Total assets value
— by(VN3$totalass,list(VN$X province, VNSown), summary)
: Haiphong
: Multi-owner
: Hanoi
: Multi -owner
: TP HCM
: Multi-owner
: Haiphong
: One-owner
: Hanoi
: One-owner
: TP HCM
: One-owner
Figure 4: Summary of Total assets according to Province and Ownership
Trang 4— by(VN3Stotalass,list(VN$X province, VNSown), sd)
: Haiphong
: Multi-owner
[1] 21815 23
: Hanoi
: Multi-owner
[1] 7562.748
: TP HCM
: Multi-owner
[1] 148425.6
: Haiphong
: One-owner
[1] 57733.84
: Hanoi
: One-owner
[1] 13946 72
: TP HCM
: One-owner
[1] 47527.46
Figure 5: Standard Deviation of Total assets according to Province and Ownership
Each code gives the specific descriptive statistics of the outcome variable (ownership) for each treatment group with the listed province first then the ownership The final code Summary helps to find 6 basic statistics along with the ownership: Minimum value, the first quantile, mean, median, the third quartile, and maximum value
To get further information, we conduct the boxplot and the mean plot
— boxplot(VN$totalass~VN8X_ province +VN$own, ylim=c(1000,60000), col =
c(“salmon","green","orange","skyblue", "brown", "yellow"))
Figure 6: Box plot
Trang 5This box plot shows several descriptive statistics: medians, quartiles, and maximum and minimum data of 6 groups Each cell has different characteristics for all Based on the R output, the TP HCM multi-owner group has the highest median value and also the largest outliers
— plot(VN8X.quantityproduct, VN$X.quantitysold)
VNSX.quantityproduct
Figure 7: Scatter plot
It can be seen that the points have an upward trend This means that the more products can be produced, the more they can be sold in every province and type of ownership The relationship between these variables will be discussed thoroughly in Question 5
Mean plot is also be used to identify the mean value of each variable (Quantity sold, Quantity produced and Total assets) in different groups and compare means between groups Before create Meanplot in R studio we need to install packages gplots then we used the following codes to obtain the outcome:
— library(gplots)
— plotmeans(VN$totalass ~ interaction(VN$X province, VN$Sown), data = VN, xlab =
"Enterprises", vylab = "Total assets", main = "Mean Plot with 95% CI")
— plotmeans(VNSX.quantityproduct ~ interaction(VN$X province, VN$own), data = VN, xlab =
"Enterprises", ylab = "Total quantity produced", main = "Mean Plot with 95% CI")
Trang 6— plotmeans(VN3X.quantitysold ~ interaction(VN$X province, VN$own), data = VN, xlab
"Enterprises", ylab = "Total quantity sold", main = "Mean Plot with 95% CI")
Mean Plot with 95% Cl Mean Plot with 95% CI
f=50_n=50 tạ — t9 mạo _— mạo n=50 n=50 n=50 n=50 nz50 n=50
Enterprises Enterprises
Mean Plot with 95% Cl
Enterprises
Figure 8: Mean Plots with 95% CI The varieties between the groups are not significantly different, except the group TP HCM Multi-owner The shape of the last 2 figures (Total quantity produced and Total quantity sold) are exactly the same, which makes the scatter plot more meaningful
Question 2: Use analysis of variance to test for any significant differences due to province Use a 05 level of significance, and for now, ignore the effect of types of ownership, quantity produced and quantity sold Check all the assumptions of the inference technique you use Are the assumptions satisfied? Explain
Because the purpose is to test for any significant differences due to province and ignore the effect of types of ownership, quantity produced and quantity sold, there is only one independent variable which is province so we decided to use One- way ANOVA
1 Hypothesis:
Trang 7Ho: All the population means are equal
Ha: Not all the means are equal
2 Checking assumptions
For One - way ANOVA, there are three assumptions we need to examine
- Samples are independent, simple random samples
- All populations are normally distributed
- All population standard deviations are equal
Assumption I: Samples are independent, simple random samples
To see whether these samples are chosen by using simple random sampling or not, we need to observe how the samples are selected Because there is no mention in the scenario, we assume these samples are chosen by using simple random sampling
Assumption 2: All populations are normally distributed
In order to check all populations are normally distributed or not, we can use Q-Q plot with R command
¢ install packages("car")
¢ = library(car)
¢ library(carData)
¢ qqPlot(lm(totalass ~ X province, data=VN), simulate=T, main="Q-O Plot", labels=F)
Q-Q Plot
271
t Quantiles
Figure 9: Q-Q plot
It can be seen from Figure 9 that the points in the Q-Q plot are on a straight line but they do not pass through the origin and the scatter does not have a slope of 45 degree The scatter is also far
Trang 8from the confidence interval with some outliers Therefore, the population is not normally distributed
Assumption 3: All population standard deviations are equal
To check whether the standard deviations are equal or not, we calculate the ratio between the largest and the smallest standard deviation If this ratio is not larger than 2, assumption 3 is satisfied
— by(VNStotalass, VN$X_ province, sd)
This is the output:
VNS$X province: Haiphong
[1] 43451.78
VNS$X province: Hanoi
[1] 11162.1
VN$X province: TP HCM
[1] 110745.9
Figure 10: Population standard deviations The largest sample standard deviation is 110745.9, the smallest sample standard deviation is 11162.1 and the ratio is 9.921601, which is much larger than 2 Moreover, the ratio is greater than 3, we cannot apply Levene Test to check population’s distribution Instead, we use Kruskall Wallis test to check this assumption
1 Hypothesis
Ho: All population distributions are identical
Ha: Values are systematically different
2 Checking assumptions
¢ The data are quantitative but not normal
¢ The samples are independent, simple random samples
3 Test statistics: p-value
Run Kruskall Wallis test
ex! <- kruskal.test(VNS$totalass, VN$X province)
sex!
Kruskal-Wallis rank sum test
data: VN§totalass and VN$X.province
Kruskal-wallis chi-squared = 23.238, df = 2, p-value = 8.996e-06
Figure 11: Kruskall Wallis test outcome
Trang 94 Decision rule
Reject Ho if p-value < alpha
We have: p - value = 0.022 < alpha = 0.05
5 Making decision
Reject Ho
6 Conclusion
There is enough evidence to conclude that the population distributions are not identical
3 Test statistics: p-value
Run One - way ANOVA
—anv!<- aov(totalass ~ X_province, data = VN)
—summary(anv 1)
Df Sum sq Mean Sq F value Pr(>F)
X.province 2 2.871e+10 1.436e+10 3.017 0.0505
Residuals 297 1.413e+12 4.759e+09
Signif codes: 0O ‘***’ 0.001 ‘**’ 0.01 “*' 0.05 “.'” 0.1 “ ' 1
Figure 12: One - way ANOVA outcome
4 Decision rule
Reject Ho if p - value < alpha
We see: p - value = 0.0505 > alpha = 0.05
5 Making decision
Do not reject Ho
6 Conclusion
There is not enough statistical evidence to conclude that all the mean of total assets values in 3 provinces are the same, or we can conclude that the there are not significant differences among the variances
Question 3: At the 05 level of significance test for any significant differences due to province, types of ownership, and interaction (ignore the effect of quantity produced and quantity sold) Check all the assumptions of the inference technique you use Are the assumptions satisfied? Explain Draw an interaction plot and interpret the plot Is the plot consistent with the conclusion?
Trang 10In this question, we use two-way ANOVA method to check the differences due to province, types of ownership, and interaction We need to check assumptions:
¢ Samples are independent, simple random samples of size ny from each of k (= ab) populations
¢ All populations are normally distributed
¢ All populations have the same standard deviation (611 =012 = =oab =o)
1 Hypotheses:
Ho;: The total assets means of Province are equal
Hai: The total assets means of Province are different
Ho:: The total assets means of Ownership are equal
Ha: The total assets means of Ownership are different
Hos: There is no significant interaction between Province and Ownership
Has: There is significant interaction between Province and Ownership
2 Check assumptions:
Assumption I: Samples are independent, simple random samples
To check assumptions 1, first of all, we have term and notation for two-way ANOVA are shown
in the following table:
Trang 11Factor B Total
From figure 4 in question 1, we are given the cross tabulation between Province and Ownership status preference variables that could show thw sample size for each cell Applying the case in these notation, terms and R output, we have the corresponding table to check for the assumption:
— table(VN8X province, VNSown)
Figure 13: Frequency table After checking the table, we can conclude that there is no relationship between factor A and factor B which is “Province” and “Own” because those answers are different which are chosen at random from the 300 students In detail, from each of k = ab = 2x3 = 6 populations, which is Haiphong - Multi-owner, Hanoi - Multi-owner, TP HCM - Multi-owner, Haiphong - One-owner, Hanoi - One-owner, TP HCM - One-owner, each individual in two samples “Province” and
“Ownership status” has the same probability of being chosen randomly to be one of the 300 observations Therefore, the study has independent simple random samples
Assumption 2: All populations are normally distributed
Trang 12To check “All populations are normally distributed” is true or false, we can use Q-Q plot with R command:
— install packages("car")
— library(car)
— library(carData)
— qqPlot(m(totalass ~ X.province + own + X.province*own, data=VN), simulate=T, main="0-O Plot", labels=F)
Figure 14: Q-Q Plot of Total assets based on Province and Ownership
As we can see from the Q-Q plot, the line is nearly equal to 180 degrees, and the scatter line is far away from the confidence interval with some outliers Thus, it is reasonable to say that all populations have a non normal distribution
Assumption 3: All populations have the same standard deviation (61;= 612 = = O1»= 6)
In order to check the final assumption through the function “by” in R, which is about the ratio between the largest sample standard deviation over the smallest sample standard deviation (=148425.6/7562.748) equal to 19.62588202, which is much greater than 2 As a result, we should use Levene’s test instead to check whether the variances are equal or not with the following code:
— leveneTest(VNSX.quantityproduct, interaction(VN$X province, VN$X.quantitysold), center=median)