Frequency table of sample size The above box plot shows data of medians, quartiles, maximum, minimum value and outliers of the dataset... - Samples are independent, simple random sample
Trang 1STATISTICS FOR ECONOMICS
Instructor: Ms Lai Hoai Phuong
Lưu Thị Thanh Ngân 2004040079
Trang 2Table of Contents
Tables of figure 3
A Scenario 4
B Answering questions 4
Question 1: 4
Question 2: 10
Question 3: 14
Question 4: 18
Question 5: 21
Question 6: 22
Trang 3Figure 4 Mean of dataset 6
Figure 5 Median of dataset 6
Figure 6 Standard deviation 7
Figure 7 Summary 7
Figure 8 Structure of new data 8
Figure 9 Frequency table of sample size 8
Figure 10 Box plot 9
Figure 11 Mean plot with 95% CI 10
Figure 12 QQ plot 12
Figure 13 Standard deviation 13
Figure 14 Levene test's result 13
Figure 15 Output 14
Figure 16 QQ plot 16
Figure 17 Standard deviation 17
Figure 18 Levene test's result 17
Figure 19 Output 18
Figure 20 Frequency table of sample size 19
Figure 21 Levene test's result 20
Figure 22 QQ plot 20
Figure 23 Interaction plot between Province and Ownership 21
Trang 4A Scenario
The database of the annual Vietnamese Enterprise Surveys (VESs) is an important source of data for any scholars doing research on Vietnam economy and its micro dynamics In 2004, the survey was carried out with a sample size of more than 2 million businesses in all provinces across the country The household questionnaire contained many sections, each of which covered a separate aspect of business activities, and profitability was one important indicator In the survey,
businesses were asked to specify their site of operation (province), types of ownership (own) and profitability (roa) The objective of our study is to test for any significant interaction between
provinces and types of ownership and to test for any significant differences in the profitability of businesses due to these two variables
A portion of the VES data is to be given to each group by your tutor In the given dataset, 1 represents firms from Hanoi, 2 represents firms from Danang and 3 represents firms from Ho Chi Minh City
B Answering questions
Question 1:
Produce descriptive statistics to summarize the data You are expected to generate as many relevant descriptive statistics as possible using ALL the relevant tools introduced in the labs of this course Remember to provide appropriate interpretations for the descriptive statistics Try not to include unnecessary or irrelevant descriptive statistics
We use R studio for descriptive statistics for this question Firstly, we must import the csv file
“Dataset2.csv” into R for further calculation:
➢ Dataset2 <− read.table(“Dataset2.csv”, header = TRUE, sep= “,”, quote= “\”,
stringsAsFactors = FALSE)
There are 180 observations in this case study; therefore, so we use head (Dataset2) to see some first observations to have better knowledge related to this data:
➢ head (Dataset2)
Trang 5Figure 1 Some first observations of the data set
The internal structure of the data can be obtained by:
➢ str(Dataset2)
Figure 2 Structure of the data when factors have not been converted yet
From the above output, it is clear that there are 180 observations with 3 variables: roa, own, province Since province and own are characters, we will convert them into factors by using the following R codes:
➢ Dataset2$province <− factor(Dataset2$province, levels=c("1","2","3"), labels = c("HaNoi",
"DaNang", "HCM"))
A frequency table can be created to see the sample size of each treatment group with the following
R code:
➢ table(Dataset2$province, Dataset2$own)
Figure 3 Frequency table of sample size
It can be seen that all 6 treatment groups have the same sample size of 30 This selection is our best choice to use a one-way ANOVA test
Next, we use by() function in R to find several descriptive statistics such as mean, median,
standard deviation, summary, … for each treatment group listed by the factors and their output respectively:
➢ by(Dataset2$roa,list(Dataset2$province ,Dataset2$own ), mean)
Trang 6Figure 4 Mean of dataset
➢ by(Dataset2$roa,list(Dataset2$province ,Dataset2$own ), median)
Figure 5 Median of dataset
➢ by(Dataset2$roa,list(Dataset2$province ,Dataset2$own ), sd)
Trang 7Figure 6 Standard deviation
➢ by(Dataset2$roa,list(Dataset2$province ,Dataset2$own ), summary)
Figure 7 Summary
Each code gives the specific descriptive statistics of the outcome variable for each treatment group with the listed province and the ownership The final code summary helps to find 6 basic statistics: Minimum value, the first quantile, mean, median, the third quartile and maximum value
Trang 8Finally, we conduct the boxplot and the mean plot to get further:
When using the data, we realized that because there are many outliers and the distance is quite large, the box plot is shrinking Therefore, we scanned the data to get a new table of data
The new data has 150 observations and 3 variables, 6 groups have the sample size of 25
Figure 8 Structure of new data
Figure 9 Frequency table of sample size
The above box plot shows data of medians, quartiles, maximum, minimum value and outliers of the dataset
➢ boxplot(roa ~ province + own, data = newdata, xlab = "Province and ownership ", ylab =
"Profitability",main="boxplot", col = c("lightcoral", "skyblue2", "hotpink2", "palegreen",
"purple1", "orangered3"))
Trang 9Figure 10 Box plot
It can be clearly seen that the box plot shows several descriptive statistics: medians, quartiles, maximum and minimum data of six groups Each cell has different characteristics for all Based
on R output, Ho Chi Minh private-owned group has the highest median value The maximum value of the Ho Chi Minh state-owned group is superior to other groups Da Nang private-owned has the most stable, its variance within the group is smallest, due to the smallest interquartile range and marginal value range between maximum and minimum value The private-owned in all provinces have relatively lower values, however, Ho Chi Minh private-owned has an outlier that spends nearly 0.6, which is the highest value in the dataset
The box plot also shows the skewness of the dataset While the profitability of private and owned at Ho Chi Minh have symmetric distribution, the 2 groups at Ha Noi are negatively skewed (skewed left) On the other hand, distribution at Da Nang provinces are positively skewed (skewed right)
state-Meanplot also be used to identify mean value of each group and compare means between groups Before create Meanplot in R studio we need to install packages by using install.packages ("gplots'') then we used the following codes to obtain the outcome:
➢ library(gplots)
Trang 10➢ plotmeans(roa ~ interaction(province,own), data = newdata, xlab = "Province and
ownership", ylab= "Profitability", main="Mean Plot with 95% CI")
Figure 11 Mean plot with 95% CI
There are 6 groups presented in the mean plot with a 95% confidence interval The result in this
graph is the same as the above outcome generated through By() functions for means Looking at
the chart, there is a distinct pattern between private -owned and state-owned in 3 provinces While the mean of Ho Chi Minh state-owned is about 0.06, that data remains under 0.05 for other groups
Ho Chi Minh state-owned has the highest mean and the private-owned of HaNoi has the lowest Moreover, the means of the 6 groups are different, therefore, they are satisfied with the assumption
of one-way ANOVA
Question 2:
Use analysis of variance to test for any significant differences due to province Use a 05 level of significance, and for now, ignore the effect of types of ownership Check all the assumptions of the inference technique you use Are the assumptions satisfied? Explain
One-Way analysis of variance compares the means of two or more independent groups in order
to determine whether there is statistical evidence that the associated population means are significantly different With this question, one-way ANOVA is considered the most suitable test,
Trang 11- Samples are independent, simple random samples
- All populations in question are normally distributed
- All population standard deviations are equal
1 Hypotheses
Ho: All population mean are equal
Ha: At least two populations are different
2 Check assumptions
Assumption 1: Samples are independent, simple random samples
In 2004, the annual Vietnamese Enterprise Surveys (VESs) was carried out with a sample size of more than 2 million businesses in all provinces across the country A portion of the VES data is
to be given to each group by our tutor Therefore, we assume that the observations in each group are collected by an independent random sample
Assumption 2: All populations in question are normally distributed
We can check the normality assumption graphically via Q plot We can draw an individual
Q-Q plot for each sample and check if all samples are normally distributed Alternatively, we can draw one plot to check the normality of residual
➢ install.packages("car")
➢ library(car)
➢ qqPlot(lm(roa ~ province, data = Dataset2), simulate = T, labels=F)
Trang 12Figure 12 QQ plot
For normally distributed data, observations should lie approximately on a straight line If the data
is non-normal, the points form a curve that deviates markedly from a straight line Looking at the plot, we can see that almost all the points lie on a straight line So, through this Q-Q plot, it can
be claimed that “All populations in question are normally distributed”
Assumption 3: All population standard deviations are equal
"All population standard deviations are equal" if the largest standard deviation is less than twice the smallest standard deviation That means the ratio between the largest standard deviation and the smallest standard deviation is less than 2 We can find the standard deviation by using the "by" function in R:
➢ by(Dataset2$roa, Dataset2$province, sd)
Trang 13Figure 13 Standard deviation
From this result, we have
𝐿𝑎𝑟𝑔𝑒𝑠𝑡 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛𝑆𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =
65.614240.07268901 = 902.6707063
We see that, which means that we cannot claim “all population standard deviations are equal”
So, we need to use Levene’s test to check this identical variance assumption We use function leneveTest in the package named “car”:
➢ install.packages('car')
➢ library(car)
➢ leveneTest(Dataset2$roa, Dataset2$province, center = median)
o Hypotheses: Ho: All population standard deviations are equal
Ha: All population standard deviations are different Result after run in R
Figure 14 Levene test's result
From the result, we see that p-value = 0.372 > α= 0.05, so do not reject Ho That means “All population standard deviations are equal”
3 Test statistic: F=MSG/MSE=1.004
Run one-way ANOVA:
Trang 14➢ aov1 <− aov(roa ~ province, data = Dataset2)
o The hypotheses test: Ho: All population means are equal
Ha: The population means are different
2 Check assumptions
Trang 15In 2004, the annual Vietnamese Enterprise Surveys (VESs) was carried out with a sample size of more than 2 million businesses in all provinces across the country A portion of the VES data is
to be given to each group by our tutor Hence, we assume that the observations within each group were obtained by a random sample Moreover, the profitability for state-owned must be independent of the profitability for private-owned
Assumption 2: For each population, the response variable is normally distributed.
Quantile Quantile plot also called QQ plot is to check the normality of data, therefore the R code will be used:
➢ qqPlot(lm(roa ~ own, data = Dataset2), simulate = T, labels=F)
The qqPlot code works by plotting the data from data sets on a different axis If the distribution
of the data is the same, the result will be a straight line
Trang 16“By” function below in R to calculate the standard deviations:
➢ by(Dataset2$roa, Dataset2$own, sd)
Trang 17Figure 17 Standard deviation
From this result, we take the ratio:
𝐿𝑎𝑟𝑔𝑒𝑠𝑡 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑆𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛=
53.57284 0.08292021= 646.0777The number 646.0777 is much greater than 2, which means that we cannot claim the variance is the same for all of the populations Therefore, we use another technique - Levene’s test to check the equality of all variances before running a One-Way ANOVA test Firstly, we set the hypothesis for Levene’s test
H0: The variance among groups is equal
Ha: The variance among different groups is not equal
Then use the R code below to find p- value of Levene test
➢ install.packages('car')
➢ library(car)
➢ leveneTest(Dataset2$roa, Dataset2$own, center = median)
Figure 18 Levene test's result
If the p-value of the Levene test is smaller than the significant level (α= 0.05), we can reject the null hypothesis that variances are equal for all populations From the results above, p-value (=0.3174) is greater than significance level (α= 0.05),so we cannot reject H0 That means there are all equal population variances
3 Test statistic
To run one-way ANOVA test, the R code will be:
Trang 18➢ aov <- aov(roa ~ own, data = Dataset2)
Although the two-way factorial analysis of variance is usually the optimal inference method to handle this case, it is vital to evaluate all the assumptions of the inference system before displaying our two-way ANOVA in order to ensure the validity of our findings
As a result, there are three assumptions required to check two-way ANOVA:
• Samples are independent, simple random samples of size from each population
• All populations are normally distributed
• All populations have the same standard deviation: 𝜎12 = 𝜎22 = ⋯ = 𝜎𝑘2
Assumption 1: Samples are independent
Trang 19populations The research consists of separate simple random samples The code below shows the
R output of 6 groups of similar size
➢ table(Dataset2$province, Dataset2$own)
Figure 20 Frequency table of sample size Assumption 2: All populations have the same standard deviation
Secondly, we will check assumption 2, the standard deviations are equal By looking at the output
of the “By” function in R code for both males and females, we observe that the ratio of the biggest sample standard deviation over the smallest sample standard deviation (= 92.7883/ 0.03831905)
is around 2421.466, which is greater than 2 As a result, we can conclude that standard deviations are not the same in all populations
The Levene test is a different method that can be used to verify the assumption of equal standard deviation This test determines whether the variance is homogeneous or whether the variances of samples are approximately equal The procedure was to compare the Levene test's p-value to our significant level (= 0.05) Equal variances were assumed if the p-value was greater than and the Levene test was non-significant, and vice versa
The Levene test, however, is only significant when the ratio of standard deviations is unclear (between 2 and 3) In this scenario, the Levene test was unnecessary The following are the codes for doing the Levene test:
➢ install.packages("car")
➢ library(car)
➢ leveneTest(Dataset2$roa, interaction(Dataset2$province,Dataset2$own), center= median)