LIST OF FIGURES Figure 1: Mean plot of total assets with 95% CL Figure 2: Box plot of total assets Figure 3: Histograms of firms’ total assets based on province Figure 4: Box plot of fir
Trang 1CASE STUDY REPORT
Tutor: Mrs Tran Thi Thu Hién Tutorial 4 - Group 5
Nguyễn Thị Hải Anh 2004040009 Nguyễn Minh Hằng 1904040038 Nguyễn Nghiêm Hưng 1904000122
Nguyễn Thị Bích Phượng 2004040090 Ngô Hồng Giang 2004050015
Hanoi, November 7th, 2023
Trang 2TABLE OF CONTENTS A Scenario
A Answering the questions
Question 1 Question 2 Question 3 Question 4 Question 5
REFERENCE
Trang 3LIST OF FIGURES
Figure 1: Mean plot of total assets with 95% CL Figure 2: Box plot of total assets
Figure 3: Histograms of firms’ total assets based on province
Figure 4: Box plot of firms’ total assets based on province and type of ownership Figure 5: Interaction plot between the effects of province and ownership on total assets Figure 6: Added-Variable Plots of quantity sold, quantity product vs total assets
11 14
Trang 4A Scenario
The database of The Viet Nam Small and Medium Enterprises (SME) is an important source of data for any scholars doing research on the Vietnam economy and its micro dynamics In 2015, the survey was carried out with a sample size of over 2500 enterprises from nine provinces across the country (Viet Nam SME database, 2015)
The survey instrument consists of three modules: (1) a main enterprise questionnaire for owners or managers; (11) an employee questionnaire administered to a random subset of employees in a quarter of randomly selected enterprises; and (111) an economic accounts
module
In the survey, business were asked to
e Specify address of firm: Hanoi, Haiphong, TP HCM (province) @ Ownership status: One owner, Multiple owners (own)
® Quantity produced for the most important product (n revenue terms) (quantity product)
® Quantity sold base one quantity produced for the most important product (quantitysold)
e Total assets in 2014 (end-year) (million VND) (in market value) (totalass)
A — Answering the questions
Question I: Produce descriptive statistics to summarize the data You are expected to generate as many relevant descriptive statistics as possible using ALL the relevant tools introduced in the labs of this course Remember to provide appropriate interpretations for the descriptive statistics Try not to include unnecessary or irrelevant descriptive statistics
We describe statistics using RStudio First, we import the Excel file "Dataset5.csv" into R for further analysis:
> DatasetS <-read.table("Dataset5.csv", header= TRUE, sep =",", quote ="/",
stringsAsFactors = TRUE)
Trang 5There are 300 observations in this case study, therefore, we should see some first
observations to have better knowledge related to this data using Head () function in R :
From the previous output, we can conclude that there are 300 observations with 5 variables:
province, own , quantity product, quantity sold, totalass After that, we use the s#r() function again to obtain the structure of the data
> str(DatasetsS)
Nextly, we use summary () function to give our knowledge about length, class, mode, max,
min, median, mean of Datasets
Trang 6province own quantityproduct quantitysold
Following that, we use code to show Mean Plot to determine the mean value as well as the
mean comparison between treatment groups:
> install.packages("gplots") > library(gplots)
> plotmeans(totalass~ interaction(Dataset5 $own,Dataset5$province), data = Dataset5, xlab = "Ownership and Province",
ylab = "Totalass", main="Mean Plot + with 95% CI")
Mean Plot + with 95% Cl
Ownership and Province
Figure 1:Mean plot of total assets with 95% CT
The mean plot is used to identify the mean value of each group and compare means between groups In the mean plot with a 95 percent confidence interval, there are six groups with six different means.As you can see in the chart, in both categories one-owner and multi-owner, TP HCM consistently has the highest mean total asset values compared to Haiphong and
Hanoi
Trang 7Nextly, we use code to do the boxplot to examine the findings more closely:
> boxplot(Dataset5$totalass~interaction(Dataset5$own,Dataset5$province), Dataset5= data,xlab="Province and Type of ownership",ylab="Totalass",col = c("red",
Province and Type of ownership
Figure 2: Box plot of total assets
The figure displays the dataset's minimum and maximum values, medians, quartiles, and outliers of total assets categorized by provinces and ownership The Multi-owner companies of Ho Chi Minh City are among those that have the largest total assets, whereas the sole- owner firms of Hai Phong obtain the least The box plot also indicates that total asset observations for all groups are right-skewed Outliers are also discussed in the graph, the One-owner Hai Phong and the one-owner Ha Noi groups have the most outliers compared to the other groups
Question 2: Use analysis of variance to test for any significant differences due to province Use a 05 level of significance, and for now, ignore the effect of types of ownership, quantity produced and quantity sold Check all the assumptions of the inference technique you use
In this question, we conduct an analysis of variance for one variable-province Therefore, we use a one-way ANOVA test and it is necessary to test all the assumptions of this interference system before showing our test with the aim of ensuring that our results are valid We have three assumptions for the one-way ANOVA test
e The samples are independent and selected by selecting simple random sampling (1)
Trang 8e The population is normally distributed (2) e All population standard deviations are equal (3)
Firstly, we check assumption | whether the samples are independent and simple random or not The data was collected from observations of random enterprises from three provinces as Hanoi, Haiphong, and TP HCM and have no relations to one another Therefore, the samples are independent
Secondly, we check assumption 2 whether they are normally distributed or not We use the histograms:
Trang 9We can see that the distributions of populations are not symmetrical, but rather right-skewed Therefore, we infer that the assumption of normal population distribution is not satisfied
Finally, we check assumption 3 by calculating the ratio of the highest standard deviation to the lowest standard deviation and comparing it to 2
> by (dataset5 $totalass, dataset5$.province, sd)
OUTPUT
> by Cdataset5$totalass, dataset5$x province, sd)
dataset5$X.province: Haiphong [1] 20036.26
dataset5$xX.province: Hanoi
[1] 7819.229
dataset5$X.province: TP HCM
[1] 56633.3 >
largest standard deviation/smallest standard deviation = 56633.3/7819.229 = 7.242824 > 56633.3/7819 229
[1] 7.242824
The ratio is approximately equal to 7.24, which is significantly larger than 2 The populations
have different variances or different standard deviations
As 2 out of 3 ANOVA assumptions are not met, we will run the Kruskal-Wallis test instead
of the ANOVA test We will perform the Kruskal-Wallis test as a way of testing for difference in abnormally distributed populations
1 Hypotheses: Ho: All population distributions are identical
Ha: Some populations are significantly different than others 2, Assumptions:
e The objective is to compare 3 populations based on provinces e The samples are independent, simple random samples e The data are quantitative but not normally distributed
The assumptions are proven above so we carry on the Kruskal-Wallis test
> kruskal.test(totalass~province,data=dataset5 )
Trang 10> kruskal test(totalass~province, data=datasets) Kruskal-wallis rank sum test data: totalass by province
Kruskal-wallis chi-squared = 15.983, df = 2, p-value = 0.0003384
3 Test statistic: H = 15.983 (round 3dp) 4 Level of significance: a = 0.05 5 Decision rule: Reject Ho if p < 0.05
As seen in the Kruskal-Wallis test R-output , p-value = 0.0003 < 0.05 so we reject Ho, the null hypothesis
6 Conclusion:
There is enough evidence to conclude that at 95% confidence level to support Ha, the values are systematically higher in some populations than in others This means there is a difference due to the province
Question 3 At the 05 level of significance test for any significant differences due to province, types of ownership, and interaction (ignore the effect of quantity produced and quantity sold) Check all the assumptions of the inference technique you use Are the assumptions satisfied? Explain Draw an interaction plot and interpret the plot Is the plot
consistent with the conclusions?
In order to assess the effect of independent variables (factors) on one dependent variable and the interaction between them, the Two-way Factorial Analysis of Variance is the most suitable technique to solve this problem
Trang 11Ha: There is interaction between province and types of ownership 2 Check Assumptions
e Samples are independent, simple random samples of size ny from each of k (=a*b) populations
e All populations are normally distributed e All populations have the same standard deviation
First, we check independent samples Term and notation for two-way ANOVA using output of cross tabulation table between province and own variables would give you the sample size for each stratum:
Next, we are going to check the assumption of equal standard deviations by looking at the standard deviation output of the “by” function in R for all the six groups
> by(dataset5 $totalass, list(dataset5$province,dataset5 $own),sd)
: Haiphong
: Multi-owner [1] 26O67.86
: Hanoi ¡ MuTti-owner
[11 6936.263
: TP HCM =: Multi-owner [11 78797.02 >: Haiphong
Trang 12it shows that all populations do not have the same standard deviation Therefore, the second assumption 1s not satisfied
Finally, we use the boxplot to check the normality of populations:
> boxplot(totalass~interaction(own,province),data=dataset5,xlab="Province and Type of
tft ot ft, it ft
ownership",ylab="Totalassets",col=c("red","purple","orange","yellow","beige","maroon"),yli m=c(0,500000))
Province and Type of ownership
Figure 4: Box plot of firms’ total assets based on province and type of ownership The figures suggest that the distributions are not normally distributed The median lines indicate there is no symmetry Most of the samples are right-skewed with the exception of the one-owner firms based in TP HCM Consequently, the populations’ distributions are not normal Therefore, this assumption of normality is not met
Because of the failed assumptions, we opt for a different method to conduct analyses of variance on factorial models when the assumptions of traditional parametric ANOVA, such as normality and homoscedasticity, are not met The method is Aligned Rank Transform for Nonparametric Factorial ANOVAs It transforms our data and runs a series of ANOVAs with
those transformed data to fit with the traditional ANOVA model (Wobbrock, J.O et al
2011) The ART technique will provide accurate nonparametric treatment for both main and
interaction effects
Firstly, we install the package (ARTool): > install._packages("ARTool") > library(ARTool)
Trang 13Secondly, we transform the data using the aligned rank transform (ART) > totalass_art <- art(totalass~own*province,data=dataset5 )
And we verify that the ART procedure was correctly applied and is appropriate for this dataset as followed
ANOVA Two-way test
Signif codes: 0 ‘***' 0.001 ‘**’ 0.01 “*' 0.05 “.' OL‘ 7 1
Test statistic
F ownership = 34.6064 F province = 9.8526
F own:province = 6.6387
3 Decision Rule and Conclusion
We have: Reject Ho if p-value < 0 (0 =0.05) Interaction effect of types of ownership - province
Trang 14p-value = 0.0015 < 0.05 Therefore, we reject Ho There is enough evidence to conclude that there is an interaction effect between provinces and types of ownership on total asset We represent this relationship in the graph below
> interaction plot(dataset5$province,dataset5 $own,dataset5 $totalass, type="b",
col=c("blue","red"), pch=c(16, 18),main="Interaction between Province and Ownership") Interaction between Province and Ownership
a The credibility of the interpretations and conclusions
For question 2, ANOVA tests have 2 assumption violations, so we replaced the ANOVA test with Kruskal-Wallis test which is suitable for this case study The conclusion of rejecting Ho is reliable since the nonparametric test adapts to the assumptions we have Regarding question 3, we conducted the Aligned Rank Transformation on our data so that they can be applied to the two-way ANOVA model According to researchers from the University of
Trang 15Washington, The Aligned Rank Transform (ART) procedure was devised to analyze multi- factor nonparametric designs The conclusions align with our visual depiction of the dataset and corroborate the relationship between factors
b Limitations of the case
The case has several drawbacks Firstly, the population distributions of numerical variables (quantityproduct, quantitysold, totalass) are non-normal They are heavily right-skewed, violating the ANOVA test normality assumption Secondly, the population variances are not equal Consequently, we cannot perform ANOVA tests directly and we pivot to alternative methods such as Kruskal-Wallis test and Aligned Rank Transformation which have drawbacks of their own Kruskal-Wallis test has lower statistical power than other parametric
tests because it overlooks the distribution assumption so the result is, while valid, may not
provide interpretations as convincing Our dataset exhibits extreme skewness and the ART method reduces that skew which may be undesirable if the distributions are meaningful to the
case study (Wobbrock, J.O et al 2011) All in all, we cannot say with 100% confidence
whether the samples effectively represent the populations of Vietnam Small and Medium Enterprises at the time
Question 5: Based on your dataset, make your own problem using simple/multiple linear regression Interpret the output
Multiple linear regression is useful for modeling the relationship between numeric outcome or dependent variables (Y) multiplier explanatory or independent variables (X) In a balance sheet, total assets are calculated as the sum of all short-term, long-term, and other assets These include cash, inventory Therefore, we surmise that the number of goods produced and sold may have some influence on the total assets In this case, quantity sold and quantity product are independent whereas total asset is the dependent variable
#fit model using quantitysold and quantityproduct as X-variables
> multiple.regression <- lm(totalass ~ quantitysold + quantityproduct, data=dataset5) > summary(multiple.regression)