business and economics statistic case study

HCM : One-owner [1] 47527.46 Figure 5: Standard Deviation of Total assets according to Province and Ownership Each code gives the specific descriptive statistics of the outcome vari

Trang 1

HANOI UNIVERSITY

FACULTY OF MANAGEMENT AND TOURISM

BUSINESS AND ECONOMICS STATISTIC

CASE STUDY

TUTORIAL5 — CASE 4

TUTOR: Tran Thi Thu Hién

2104040026 Đỗ Thùy Dương

2104010055 Hứa Nguyễn Thanh Loan

Trang 2

Question 1: Produce descriptive statistics to summarize the data You are expected to generate as many relevant descriptive statistics as possible using ALL the relevant tools introduced in the labs of this course Remember to provide appropriate interpretations for the descriptive statistics Try not to include unnecessary or irrelevant descriptive statistics

We use Rstudio to describe statistics for this question Firstly, we must import the csv file

“dataset23.csv” into R for further calculation:

—VN<-

read table("dataset23.csv", header=TRUE, sep=",",quote="\"""stringsAsFactors=FALSE) There are 300 observations in this study; therefore, we should see some first observations to have better knowledge related to this data using the head () function in R:

—> head(VN)

Figure 1: Some first rows of data

A frequency table can be created to see the sample size of each treatment group by using the following format of the table() function: tableName <- table(row variable, column variable)

— table(VN8X province, VNSown)

Multi-owner One-owner Haiphong 50 50

TP HCM 50 50

Figure 2: Frequency table (sample size)

Trang 3

It can be seen that all 6 treatment groups have the same sample size of 50 This selection is our best choice to use one - way ANOVA test

The internal structure of the data can be obtained by:

— str(VN)

"data frame’: 300 obs of 5 variables:

$ X.province : chr "Hanoi" "Hanoi" "Hanoi" "Hanoi"

$ own : chr "One-owner" "One-owner" "One-owner" “One-owner"

Figure 3: Structure of the data From the above output, it is clear that there are 300 observations with 5 variables: X.province, own, X.quantityproduct, X.quatitysold, and totalass

Next, we use by Q function in R to find several descriptive statistics such as mean, standard deviation, minimum and maximum value for each treatment group listed by the factors and their output respectively In this part, we only focus on Total assets value

— by(VN3$totalass,list(VN$X province, VNSown), summary)

: Haiphong

: Multi-owner

: Hanoi

: Multi -owner

: TP HCM

: Multi-owner

: Haiphong

: One-owner

: Hanoi

: One-owner

: TP HCM

: One-owner

Figure 4: Summary of Total assets according to Province and Ownership

Trang 4

— by(VN3Stotalass,list(VN$X province, VNSown), sd)

: Haiphong

: Multi-owner

[1] 21815 23

: Hanoi

: Multi-owner

[1] 7562.748

: TP HCM

: Multi-owner

[1] 148425.6

: Haiphong

: One-owner

[1] 57733.84

: Hanoi

: One-owner

[1] 13946 72

: TP HCM

: One-owner

[1] 47527.46

Figure 5: Standard Deviation of Total assets according to Province and Ownership

Each code gives the specific descriptive statistics of the outcome variable (ownership) for each treatment group with the listed province first then the ownership The final code Summary helps to find 6 basic statistics along with the ownership: Minimum value, the first quantile, mean, median, the third quartile, and maximum value

To get further information, we conduct the boxplot and the mean plot

— boxplot(VN$totalass~VN8X_ province +VN$own, ylim=c(1000,60000), col =

c(“salmon","green","orange","skyblue", "brown", "yellow"))

Figure 6: Box plot

Trang 5

This box plot shows several descriptive statistics: medians, quartiles, and maximum and minimum data of 6 groups Each cell has different characteristics for all Based on the R output, the TP HCM multi-owner group has the highest median value and also the largest outliers

— plot(VN8X.quantityproduct, VN$X.quantitysold)

VNSX.quantityproduct

Figure 7: Scatter plot

It can be seen that the points have an upward trend This means that the more products can be produced, the more they can be sold in every province and type of ownership The relationship between these variables will be discussed thoroughly in Question 5

Mean plot is also be used to identify the mean value of each variable (Quantity sold, Quantity produced and Total assets) in different groups and compare means between groups Before create Meanplot in R studio we need to install packages gplots then we used the following codes to obtain the outcome:

— library(gplots)

— plotmeans(VN$totalass ~ interaction(VN$X province, VN$Sown), data = VN, xlab =

"Enterprises", vylab = "Total assets", main = "Mean Plot with 95% CI")

— plotmeans(VNSX.quantityproduct ~ interaction(VN$X province, VN$own), data = VN, xlab =

"Enterprises", ylab = "Total quantity produced", main = "Mean Plot with 95% CI")

Trang 6

— plotmeans(VN3X.quantitysold ~ interaction(VN$X province, VN$own), data = VN, xlab

"Enterprises", ylab = "Total quantity sold", main = "Mean Plot with 95% CI")

Mean Plot with 95% Cl Mean Plot with 95% CI

f=50_n=50 tạ — t9 mạo _— mạo n=50 n=50 n=50 n=50 nz50 n=50

Enterprises Enterprises

Mean Plot with 95% Cl

Enterprises

Figure 8: Mean Plots with 95% CI The varieties between the groups are not significantly different, except the group TP HCM Multi-owner The shape of the last 2 figures (Total quantity produced and Total quantity sold) are exactly the same, which makes the scatter plot more meaningful

Question 2: Use analysis of variance to test for any significant differences due to province Use a 05 level of significance, and for now, ignore the effect of types of ownership, quantity produced and quantity sold Check all the assumptions of the inference technique you use Are the assumptions satisfied? Explain

Because the purpose is to test for any significant differences due to province and ignore the effect of types of ownership, quantity produced and quantity sold, there is only one independent variable which is province so we decided to use One- way ANOVA

1 Hypothesis:

Trang 7

Ho: All the population means are equal

Ha: Not all the means are equal

2 Checking assumptions

For One - way ANOVA, there are three assumptions we need to examine

- Samples are independent, simple random samples

- All populations are normally distributed

- All population standard deviations are equal

Assumption I: Samples are independent, simple random samples

To see whether these samples are chosen by using simple random sampling or not, we need to observe how the samples are selected Because there is no mention in the scenario, we assume these samples are chosen by using simple random sampling

Assumption 2: All populations are normally distributed

In order to check all populations are normally distributed or not, we can use Q-Q plot with R command

¢ install packages("car")

¢ = library(car)

¢ library(carData)

¢ qqPlot(lm(totalass ~ X province, data=VN), simulate=T, main="Q-O Plot", labels=F)

Q-Q Plot

271

t Quantiles

Figure 9: Q-Q plot

It can be seen from Figure 9 that the points in the Q-Q plot are on a straight line but they do not pass through the origin and the scatter does not have a slope of 45 degree The scatter is also far

Trang 8

from the confidence interval with some outliers Therefore, the population is not normally distributed

Assumption 3: All population standard deviations are equal

To check whether the standard deviations are equal or not, we calculate the ratio between the largest and the smallest standard deviation If this ratio is not larger than 2, assumption 3 is satisfied

— by(VNStotalass, VN$X_ province, sd)

This is the output:

VNS$X province: Haiphong

[1] 43451.78

VNS$X province: Hanoi

[1] 11162.1

VN$X province: TP HCM

[1] 110745.9

Figure 10: Population standard deviations The largest sample standard deviation is 110745.9, the smallest sample standard deviation is 11162.1 and the ratio is 9.921601, which is much larger than 2 Moreover, the ratio is greater than 3, we cannot apply Levene Test to check population’s distribution Instead, we use Kruskall Wallis test to check this assumption

1 Hypothesis

Ho: All population distributions are identical

Ha: Values are systematically different

2 Checking assumptions

¢ The data are quantitative but not normal

¢ The samples are independent, simple random samples

3 Test statistics: p-value

Run Kruskall Wallis test

ex! <- kruskal.test(VNS$totalass, VN$X province)

sex!

Kruskal-Wallis rank sum test

data: VN§totalass and VN$X.province

Kruskal-wallis chi-squared = 23.238, df = 2, p-value = 8.996e-06

Figure 11: Kruskall Wallis test outcome

Trang 9

4 Decision rule

Reject Ho if p-value < alpha

We have: p - value = 0.022 < alpha = 0.05

5 Making decision

Reject Ho

6 Conclusion

There is enough evidence to conclude that the population distributions are not identical

3 Test statistics: p-value

Run One - way ANOVA

—anv!<- aov(totalass ~ X_province, data = VN)

—summary(anv 1)

Df Sum sq Mean Sq F value Pr(>F)

X.province 2 2.871e+10 1.436e+10 3.017 0.0505

Residuals 297 1.413e+12 4.759e+09

Signif codes: 0O ‘***’ 0.001 ‘**’ 0.01 “*' 0.05 “.'” 0.1 “ ' 1

Figure 12: One - way ANOVA outcome

4 Decision rule

Reject Ho if p - value < alpha

We see: p - value = 0.0505 > alpha = 0.05

5 Making decision

Do not reject Ho

6 Conclusion

There is not enough statistical evidence to conclude that all the mean of total assets values in 3 provinces are the same, or we can conclude that the there are not significant differences among the variances

Question 3: At the 05 level of significance test for any significant differences due to province, types of ownership, and interaction (ignore the effect of quantity produced and quantity sold) Check all the assumptions of the inference technique you use Are the assumptions satisfied? Explain Draw an interaction plot and interpret the plot Is the plot consistent with the conclusion?

Trang 10

In this question, we use two-way ANOVA method to check the differences due to province, types of ownership, and interaction We need to check assumptions:

¢ Samples are independent, simple random samples of size ny from each of k (= ab) populations

¢ All populations are normally distributed

¢ All populations have the same standard deviation (611 =012 = =oab =o)

1 Hypotheses:

Ho;: The total assets means of Province are equal

Hai: The total assets means of Province are different

Ho:: The total assets means of Ownership are equal

Ha: The total assets means of Ownership are different

Hos: There is no significant interaction between Province and Ownership

Has: There is significant interaction between Province and Ownership

2 Check assumptions:

Assumption I: Samples are independent, simple random samples

To check assumptions 1, first of all, we have term and notation for two-way ANOVA are shown

in the following table:

Trang 11

Factor B Total

From figure 4 in question 1, we are given the cross tabulation between Province and Ownership status preference variables that could show thw sample size for each cell Applying the case in these notation, terms and R output, we have the corresponding table to check for the assumption:

— table(VN8X province, VNSown)

Figure 13: Frequency table After checking the table, we can conclude that there is no relationship between factor A and factor B which is “Province” and “Own” because those answers are different which are chosen at random from the 300 students In detail, from each of k = ab = 2x3 = 6 populations, which is Haiphong - Multi-owner, Hanoi - Multi-owner, TP HCM - Multi-owner, Haiphong - One-owner, Hanoi - One-owner, TP HCM - One-owner, each individual in two samples “Province” and

“Ownership status” has the same probability of being chosen randomly to be one of the 300 observations Therefore, the study has independent simple random samples

Assumption 2: All populations are normally distributed

Trang 12

To check “All populations are normally distributed” is true or false, we can use Q-Q plot with R command:

— install packages("car")

— library(car)

— library(carData)

— qqPlot(m(totalass ~ X.province + own + X.province*own, data=VN), simulate=T, main="0-O Plot", labels=F)

Figure 14: Q-Q Plot of Total assets based on Province and Ownership

As we can see from the Q-Q plot, the line is nearly equal to 180 degrees, and the scatter line is far away from the confidence interval with some outliers Thus, it is reasonable to say that all populations have a non normal distribution

Assumption 3: All populations have the same standard deviation (61;= 612 = = O1»= 6)

In order to check the final assumption through the function “by” in R, which is about the ratio between the largest sample standard deviation over the smallest sample standard deviation (=148425.6/7562.748) equal to 19.62588202, which is much greater than 2 As a result, we should use Levene’s test instead to check whether the variances are equal or not with the following code:

— leveneTest(VNSX.quantityproduct, interaction(VN$X province, VN$X.quantitysold), center=median)

Tiêu đề	Business and Economics Statistic Case Study
Tác giả	Pham Bao Huyén, Nguyễn Thùy Trang, Đỗ Thùy Dương, Nguyễn Thị Hương, Phạm Như Hiển, Nguyễn Linh Chỉ, Hứa Nguyễn Thanh Loan
Người hướng dẫn	Tran Thi Thu Hién
Trường học	Hanoi University
Chuyên ngành	Business and Economics
Thể loại	Tutorial

Định dạng
Số trang	18
Dung lượng	1,27 MB