1. Trang chủ
  2. » Luận Văn - Báo Cáo

business and economics statistic case study

18 0 0
Tài liệu được quét OCR, nội dung có thể không chính xác
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

Trang 2

Question 1: Produce descriptive statistics to summarize the data You are expected to generate as many relevant descriptive statistics as possible using ALL the relevant tools introduced in the labs of this course Remember to provide appropriate interpretations for the descriptive statistics Try not to include unnecessary or irrelevant descriptive statistics We use Rstudio to describe statistics for this question Firstly, we must import the csv file “dataset23.csv” into R for further calculation:

—VN<-

read table("dataset23.csv", header=TRUE, sep=",",quote="\"""stringsAsFactors=FALSE) There are 300 observations in this study; therefore, we should see some first observations to have better knowledge related to this data using the head () function in R:

Figure 1: Some first rows of data

A frequency table can be created to see the sample size of each treatment group by using the following format of the table() function: tableName <- table(row variable, column variable)

— table(VN8X province, VNSown)

Multi-owner One-owner Haiphong 50 50

TP HCM 50 50

Figure 2: Frequency table (sample size)

Trang 3

It can be seen that all 6 treatment groups have the same sample size of 50 This selection is our best choice to use one - way ANOVA test

The internal structure of the data can be obtained by: — str(VN)

"data frame’: 300 obs of 5 variables:

$ X.province : chr "Hanoi" "Hanoi" "Hanoi" "Hanoi"

$ own : chr "One-owner" "One-owner" "One-owner" “One-owner"

Figure 3: Structure of the data

From the above output, it is clear that there are 300 observations with 5 variables: X.province, own, X.quantityproduct, X.quatitysold, and totalass

Next, we use by Q function in R to find several descriptive statistics such as mean, standard deviation, minimum and maximum value for each treatment group listed by the factors and their output respectively In this part, we only focus on Total assets value

— by(VN3$totalass,list(VN$X province, VNSown), summary)

: Haiphong

: Multi-owner

: Hanoi : Multi -owner

: TP HCM : Multi-owner

: Haiphong : One-owner

: Hanoi : One-owner

: TP HCM : One-owner

Figure 4: Summary of Total assets according to Province and Ownership

Trang 4

— by(VN3Stotalass,list(VN$X province, VNSown), sd) : Haiphong

: Multi-owner [1] 21815 23 : Hanoi : Multi-owner [1] 7562.748

: TP HCM

: Multi-owner [1] 148425.6

: Haiphong : One-owner [1] 57733.84

: Hanoi

: One-owner [1] 13946 72 : TP HCM

: One-owner

[1] 47527.46

Figure 5: Standard Deviation of Total assets according to Province and Ownership

Each code gives the specific descriptive statistics of the outcome variable (ownership) for each treatment group with the listed province first then the ownership The final code Summary helps to find 6 basic statistics along with the ownership: Minimum value, the first quantile, mean, median, the third quartile, and maximum value

To get further information, we conduct the boxplot and the mean plot — boxplot(VN$totalass~VN8X_ province +VN$own, ylim=c(1000,60000), col = c(“salmon","green","orange","skyblue", "brown", "yellow"))

Figure 6: Box plot

Trang 5

This box plot shows several descriptive statistics: medians, quartiles, and maximum and minimum data of 6 groups Each cell has different characteristics for all Based on the R output, the TP HCM multi-owner group has the highest median value and also the largest outliers — plot(VN8X.quantityproduct, VN$X.quantitysold)

Figure 7: Scatter plot

It can be seen that the points have an upward trend This means that the more products can be produced, the more they can be sold in every province and type of ownership The relationship between these variables will be discussed thoroughly in Question 5

Mean plot is also be used to identify the mean value of each variable (Quantity sold, Quantity produced and Total assets) in different groups and compare means between groups Before create Meanplot in R studio we need to install packages gplots then we used the following codes to obtain the outcome:

Trang 6

— plotmeans(VN3X.quantitysold ~ interaction(VN$X province, VN$own), data = VN, xlab "Enterprises", ylab = "Total quantity sold", main = "Mean Plot with 95% CI")

Mean Plot with 95% Cl Mean Plot with 95% CI

Figure 8: Mean Plots with 95% CI

The varieties between the groups are not significantly different, except the group TP HCM Multi-owner The shape of the last 2 figures (Total quantity produced and Total quantity sold) are exactly the same, which makes the scatter plot more meaningful

Question 2: Use analysis of variance to test for any significant differences due to province Use a 05 level of significance, and for now, ignore the effect of types of ownership, quantity produced and quantity sold Check all the assumptions of the inference technique you use Are the assumptions satisfied? Explain

Because the purpose is to test for any significant differences due to province and ignore the effect of types of ownership, quantity produced and quantity sold, there is only one independent variable which is province so we decided to use One- way ANOVA

1 Hypothesis:

Trang 7

Ho: All the population means are equal Ha: Not all the means are equal 2 Checking assumptions

For One - way ANOVA, there are three assumptions we need to examine - Samples are independent, simple random samples

- All populations are normally distributed - All population standard deviations are equal

Assumption I: Samples are independent, simple random samples

To see whether these samples are chosen by using simple random sampling or not, we need to observe how the samples are selected Because there is no mention in the scenario, we assume these samples are chosen by using simple random sampling

Assumption 2: All populations are normally distributed

In order to check all populations are normally distributed or not, we can use Q-Q plot with R command

¢ install packages("car") ¢ = library(car) ¢ library(carData)

¢ qqPlot(lm(totalass ~ X province, data=VN), simulate=T, main="Q-O Plot", labels=F)

Trang 8

from the confidence interval with some outliers Therefore, the population is not normally distributed

Assumption 3: All population standard deviations are equal

To check whether the standard deviations are equal or not, we calculate the ratio between the largest and the smallest standard deviation If this ratio is not larger than 2, assumption 3 is satisfied

— by(VNStotalass, VN$X_ province, sd) This is the output:

VNS$X province: Haiphong [1] 43451.78

VNS$X province: Hanoi [1] 11162.1 VN$X province: TP HCM

[1] 110745.9

Figure 10: Population standard deviations

The largest sample standard deviation is 110745.9, the smallest sample standard deviation is 11162.1 and the ratio is 9.921601, which is much larger than 2 Moreover, the ratio is greater than 3, we cannot apply Levene Test to check population’s distribution Instead, we use Kruskall Wallis test to check this assumption

1 Hypothesis

Ho: All population distributions are identical Ha: Values are systematically different 2 Checking assumptions

¢ The data are quantitative but not normal

¢ The samples are independent, simple random samples 3 Test statistics: p-value

Run Kruskall Wallis test

ex! <- kruskal.test(VNS$totalass, VN$X province) sex!

Kruskal-Wallis rank sum test

data: VN§totalass and VN$X.province

Kruskal-wallis chi-squared = 23.238, df = 2, p-value = 8.996e-06 Figure 11: Kruskall Wallis test outcome

Trang 9

4 Decision rule

Reject Ho if p-value < alpha

We have: p - value = 0.022 < alpha = 0.05 5 Making decision

Reject Ho 6 Conclusion

There is enough evidence to conclude that the population distributions are not identical 3 Test statistics: p-value

Run One - way ANOVA

—anv!<- aov(totalass ~ X_province, data = VN) —summary(anv 1)

Df Sum sq Mean Sq F value Pr(>F) X.province 2 2.871e+10 1.436e+10 3.017 0.0505 Residuals 297 1.413e+12 4.759e+09

Signif codes: 0O ‘***’ 0.001 ‘**’ 0.01 “*' 0.05 “.'” 0.1 “ ' 1 Figure 12: One - way ANOVA outcome

4 Decision rule Reject Ho if p - value < alpha

We see: p - value = 0.0505 > alpha = 0.05 5 Making decision

Do not reject Ho 6 Conclusion

There is not enough statistical evidence to conclude that all the mean of total assets values in 3 provinces are the same, or we can conclude that the there are not significant differences among the variances

Question 3: At the 05 level of significance test for any significant differences due to province, types of ownership, and interaction (ignore the effect of quantity produced and quantity sold) Check all the assumptions of the inference technique you use Are the assumptions satisfied? Explain Draw an interaction plot and interpret the plot Is the plot consistent with the conclusion?

Trang 10

In this question, we use two-way ANOVA method to check the differences due to province, types of ownership, and interaction We need to check assumptions:

¢ Samples are independent, simple random samples of size ny from each of k (= ab) populations

¢ All populations are normally distributed

¢ All populations have the same standard deviation (611 =012 = =oab =o) 1 Hypotheses:

Ho;: The total assets means of Province are equal Hai: The total assets means of Province are different Ho:: The total assets means of Ownership are equal Ha: The total assets means of Ownership are different

Hos: There is no significant interaction between Province and Ownership Has: There is significant interaction between Province and Ownership

2 Check assumptions:

Assumption I: Samples are independent, simple random samples

To check assumptions 1, first of all, we have term and notation for two-way ANOVA are shown in the following table:

Trang 11

Figure 13: Frequency table

After checking the table, we can conclude that there is no relationship between factor A and factor B which is “Province” and “Own” because those answers are different which are chosen at random from the 300 students In detail, from each of k = ab = 2x3 = 6 populations, which is Haiphong - Multi-owner, Hanoi - Multi-owner, TP HCM - Multi-owner, Haiphong - One-owner, Hanoi - One-owner, TP HCM - One-owner, each individual in two samples “Province” and “Ownership status” has the same probability of being chosen randomly to be one of the 300 observations Therefore, the study has independent simple random samples

Assumption 2: All populations are normally distributed.

Trang 12

To check “All populations are normally distributed” is true or false, we can use Q-Q plot with R command:

— install packages("car") — library(car) — library(carData)

— qqPlot(m(totalass ~ X.province + own + X.province*own, data=VN), simulate=T, main="0-O Plot", labels=F)

Figure 14: Q-Q Plot of Total assets based on Province and Ownership As we can see from the Q-Q plot, the line is nearly equal to 180 degrees, and the scatter line is far away from the confidence interval with some outliers Thus, it is reasonable to say that all populations have a non normal distribution

Assumption 3: All populations have the same standard deviation (61;= 612 = = O1»= 6) In order to check the final assumption through the function “by” in R, which is about the ratio between the largest sample standard deviation over the smallest sample standard deviation (=148425.6/7562.748) equal to 19.62588202, which is much greater than 2 As a result, we should use Levene’s test instead to check whether the variances are equal or not with the following code:

— leveneTest(VNSX.quantityproduct, interaction(VN$X province, VN$X.quantitysold), center=median)

Trang 13

Figure 15: Levene’s Test outcome

We have p-value = 0.08589 > « = 0.05 that means all variance are equal, which means that we can conclude that all populations have the same standard deviation

We choose to use the two - way ANOVA test as mentioned before with a significant level of 0.05

3 Test statistic and p-value:

We used R studio to calculate and had the output as following: — VN.result<-aov(totalass ~ X province *own,data = VN) — summary(VN.result)

Figure 16: Two — way ANOVA outcome 4 Level of significance:

The level of significance: œ = 0.05 5 Decision: Reject Ho if p-value < a We have:

p-value province:own = 0.2199 > o = 0.05 => Do not reject Hor.

Trang 14

p-value province = 0.0494 < o = 0.05 => Reject Hoa p-value own= 0.1483 > « = 0.05 => Do not reject Hos

Figure 17: Interaction plot of Province and Ownership

From the plot interaction, we can see that there is no relationship between Province and Ownership which is shown by independent lines in the interaction plot Therefore, what we

Trang 15

conclude in the two-way ANOVA test is correct, there is no significant interaction between Province and Ownership This conclusion can be more clearly seen in Hai Phong and Hanoi, as these two lines are parallel

Question 4: Discuss the credibility of the interpretations and conclusions of these tests Is there anything we should be concerned about? Explain

a The credibility of the interpretations and conclusions

In this case study, we used the two-way ANOVA test for Question 3 to see if there is any interaction between types of ownership and the location of the business All the assumptions are nearly satisfied and thoroughly checked without ignorance We concluded that the interpretations and conclusions in Question 3 are reliable at some level, and the three factors (total assets, province, and ownership) differ significantly

b Limitations

Firstly, in Question 1, because the outliers are too large, we have to set the limit to clearly see the box plot Therefore, the box plot does not reflect the data fully, but the values in total assets range from 2,000 to 60,000

Secondly, in Question 2, the data cannot completely satisfy 2 out of 3 conditions (normal distribution and standard deviations are the same) Therefore, the conclusion from the One-way ANOVA test is not highly reliable

Thirdly, when we run the two-way ANOVA in Question 3, the 2 conditions (normal distribution and standard deviations are the same) are not satisfied However, we need an advanced program to check the last assumption Therefore, the conclusion from two-way ANOVA is somewhat reliable, but not totally genuine

Lastly, when running One-way ANOVA and Two-way ANOVA, there is an assumption that “All samples are selected by simple random sampling” We do not have any evidence for this assumption However, due to the large sample size, it is believed that the data was randomly selected, but still not sure that the samples were from the announced populations

Question 5: Base on your dataset, make your own problem using simple/multiple linear regression Interpret the output

As initially observed from the scatter plot (Figure 7), we choose to analyze the relationship between Quantity sold and Quantity product, whether there

Ngày đăng: 16/08/2024, 18:14