1. Trang chủ
  2. » Luận Văn - Báo Cáo

Statistics For Economics.pdf

22 0 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Statistics For Economics
Tác giả Trịnh Phương Anh, Chu Bích Diệp, Lê Thanh Hà, Trần Thị Thu Hương, Lưu Thị Thanh Ngân, Lê Thị Hương Thảo, Hoàng Minh Trang
Người hướng dẫn Ms. Lai Hoai Phuong
Trường học Hanoi University
Chuyên ngành Statistics
Thể loại Tutorial
Định dạng
Số trang 22
Dung lượng 536,81 KB

Nội dung

Frequency table of sample size The above box plot shows data of medians, quartiles, maximum, minimum value and outliers of the dataset... - Samples are independent, simple random sample

Trang 1

STATISTICS FOR ECONOMICS

Instructor: Ms Lai Hoai Phuong

Lưu Thị Thanh Ngân 2004040079

Trang 2

Table of Contents

Tables of figure 3

A Scenario 4

B Answering questions 4

Question 1: 4

Question 2: 10

Question 3: 14

Question 4: 18

Question 5: 21

Question 6: 22

Trang 3

Figure 4 Mean of dataset 6

Figure 5 Median of dataset 6

Figure 6 Standard deviation 7

Figure 7 Summary 7

Figure 8 Structure of new data 8

Figure 9 Frequency table of sample size 8

Figure 10 Box plot 9

Figure 11 Mean plot with 95% CI 10

Figure 12 QQ plot 12

Figure 13 Standard deviation 13

Figure 14 Levene test's result 13

Figure 15 Output 14

Figure 16 QQ plot 16

Figure 17 Standard deviation 17

Figure 18 Levene test's result 17

Figure 19 Output 18

Figure 20 Frequency table of sample size 19

Figure 21 Levene test's result 20

Figure 22 QQ plot 20

Figure 23 Interaction plot between Province and Ownership 21

Trang 4

A Scenario

The database of the annual Vietnamese Enterprise Surveys (VESs) is an important source of data for any scholars doing research on Vietnam economy and its micro dynamics In 2004, the survey was carried out with a sample size of more than 2 million businesses in all provinces across the country The household questionnaire contained many sections, each of which covered a separate aspect of business activities, and profitability was one important indicator In the survey,

businesses were asked to specify their site of operation (province), types of ownership (own) and profitability (roa) The objective of our study is to test for any significant interaction between

provinces and types of ownership and to test for any significant differences in the profitability of businesses due to these two variables

A portion of the VES data is to be given to each group by your tutor In the given dataset, 1 represents firms from Hanoi, 2 represents firms from Danang and 3 represents firms from Ho Chi Minh City

B Answering questions

Question 1:

Produce descriptive statistics to summarize the data You are expected to generate as many relevant descriptive statistics as possible using ALL the relevant tools introduced in the labs of this course Remember to provide appropriate interpretations for the descriptive statistics Try not to include unnecessary or irrelevant descriptive statistics

We use R studio for descriptive statistics for this question Firstly, we must import the csv file

“Dataset2.csv” into R for further calculation:

➢ Dataset2 <− read.table(“Dataset2.csv”, header = TRUE, sep= “,”, quote= “\”,

stringsAsFactors = FALSE)

There are 180 observations in this case study; therefore, so we use head (Dataset2) to see some first observations to have better knowledge related to this data:

➢ head (Dataset2)

Trang 5

Figure 1 Some first observations of the data set

The internal structure of the data can be obtained by:

➢ str(Dataset2)

Figure 2 Structure of the data when factors have not been converted yet

From the above output, it is clear that there are 180 observations with 3 variables: roa, own, province Since province and own are characters, we will convert them into factors by using the following R codes:

➢ Dataset2$province <− factor(Dataset2$province, levels=c("1","2","3"), labels = c("HaNoi",

"DaNang", "HCM"))

A frequency table can be created to see the sample size of each treatment group with the following

R code:

➢ table(Dataset2$province, Dataset2$own)

Figure 3 Frequency table of sample size

It can be seen that all 6 treatment groups have the same sample size of 30 This selection is our best choice to use a one-way ANOVA test

Next, we use by() function in R to find several descriptive statistics such as mean, median,

standard deviation, summary, … for each treatment group listed by the factors and their output respectively:

➢ by(Dataset2$roa,list(Dataset2$province ,Dataset2$own ), mean)

Trang 6

Figure 4 Mean of dataset

➢ by(Dataset2$roa,list(Dataset2$province ,Dataset2$own ), median)

Figure 5 Median of dataset

➢ by(Dataset2$roa,list(Dataset2$province ,Dataset2$own ), sd)

Trang 7

Figure 6 Standard deviation

➢ by(Dataset2$roa,list(Dataset2$province ,Dataset2$own ), summary)

Figure 7 Summary

Each code gives the specific descriptive statistics of the outcome variable for each treatment group with the listed province and the ownership The final code summary helps to find 6 basic statistics: Minimum value, the first quantile, mean, median, the third quartile and maximum value

Trang 8

Finally, we conduct the boxplot and the mean plot to get further:

When using the data, we realized that because there are many outliers and the distance is quite large, the box plot is shrinking Therefore, we scanned the data to get a new table of data

The new data has 150 observations and 3 variables, 6 groups have the sample size of 25

Figure 8 Structure of new data

Figure 9 Frequency table of sample size

The above box plot shows data of medians, quartiles, maximum, minimum value and outliers of the dataset

➢ boxplot(roa ~ province + own, data = newdata, xlab = "Province and ownership ", ylab =

"Profitability",main="boxplot", col = c("lightcoral", "skyblue2", "hotpink2", "palegreen",

"purple1", "orangered3"))

Trang 9

Figure 10 Box plot

It can be clearly seen that the box plot shows several descriptive statistics: medians, quartiles, maximum and minimum data of six groups Each cell has different characteristics for all Based

on R output, Ho Chi Minh private-owned group has the highest median value The maximum value of the Ho Chi Minh state-owned group is superior to other groups Da Nang private-owned has the most stable, its variance within the group is smallest, due to the smallest interquartile range and marginal value range between maximum and minimum value The private-owned in all provinces have relatively lower values, however, Ho Chi Minh private-owned has an outlier that spends nearly 0.6, which is the highest value in the dataset

The box plot also shows the skewness of the dataset While the profitability of private and owned at Ho Chi Minh have symmetric distribution, the 2 groups at Ha Noi are negatively skewed (skewed left) On the other hand, distribution at Da Nang provinces are positively skewed (skewed right)

state-Meanplot also be used to identify mean value of each group and compare means between groups Before create Meanplot in R studio we need to install packages by using install.packages ("gplots'') then we used the following codes to obtain the outcome:

➢ library(gplots)

Trang 10

➢ plotmeans(roa ~ interaction(province,own), data = newdata, xlab = "Province and

ownership", ylab= "Profitability", main="Mean Plot with 95% CI")

Figure 11 Mean plot with 95% CI

There are 6 groups presented in the mean plot with a 95% confidence interval The result in this

graph is the same as the above outcome generated through By() functions for means Looking at

the chart, there is a distinct pattern between private -owned and state-owned in 3 provinces While the mean of Ho Chi Minh state-owned is about 0.06, that data remains under 0.05 for other groups

Ho Chi Minh state-owned has the highest mean and the private-owned of HaNoi has the lowest Moreover, the means of the 6 groups are different, therefore, they are satisfied with the assumption

of one-way ANOVA

Question 2:

Use analysis of variance to test for any significant differences due to province Use a 05 level of significance, and for now, ignore the effect of types of ownership Check all the assumptions of the inference technique you use Are the assumptions satisfied? Explain

One-Way analysis of variance compares the means of two or more independent groups in order

to determine whether there is statistical evidence that the associated population means are significantly different With this question, one-way ANOVA is considered the most suitable test,

Trang 11

- Samples are independent, simple random samples

- All populations in question are normally distributed

- All population standard deviations are equal

1 Hypotheses

Ho: All population mean are equal

Ha: At least two populations are different

2 Check assumptions

Assumption 1: Samples are independent, simple random samples

In 2004, the annual Vietnamese Enterprise Surveys (VESs) was carried out with a sample size of more than 2 million businesses in all provinces across the country A portion of the VES data is

to be given to each group by our tutor Therefore, we assume that the observations in each group are collected by an independent random sample

Assumption 2: All populations in question are normally distributed

We can check the normality assumption graphically via Q plot We can draw an individual

Q-Q plot for each sample and check if all samples are normally distributed Alternatively, we can draw one plot to check the normality of residual

➢ install.packages("car")

➢ library(car)

➢ qqPlot(lm(roa ~ province, data = Dataset2), simulate = T, labels=F)

Trang 12

Figure 12 QQ plot

For normally distributed data, observations should lie approximately on a straight line If the data

is non-normal, the points form a curve that deviates markedly from a straight line Looking at the plot, we can see that almost all the points lie on a straight line So, through this Q-Q plot, it can

be claimed that “All populations in question are normally distributed”

Assumption 3: All population standard deviations are equal

"All population standard deviations are equal" if the largest standard deviation is less than twice the smallest standard deviation That means the ratio between the largest standard deviation and the smallest standard deviation is less than 2 We can find the standard deviation by using the "by" function in R:

➢ by(Dataset2$roa, Dataset2$province, sd)

Trang 13

Figure 13 Standard deviation

From this result, we have

𝐿𝑎𝑟𝑔𝑒𝑠𝑡 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛𝑆𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =

65.614240.07268901 = 902.6707063

We see that, which means that we cannot claim “all population standard deviations are equal”

So, we need to use Levene’s test to check this identical variance assumption We use function leneveTest in the package named “car”:

➢ install.packages('car')

➢ library(car)

➢ leveneTest(Dataset2$roa, Dataset2$province, center = median)

o Hypotheses: Ho: All population standard deviations are equal

Ha: All population standard deviations are different Result after run in R

Figure 14 Levene test's result

From the result, we see that p-value = 0.372 > α= 0.05, so do not reject Ho That means “All population standard deviations are equal”

3 Test statistic: F=MSG/MSE=1.004

Run one-way ANOVA:

Trang 14

➢ aov1 <− aov(roa ~ province, data = Dataset2)

o The hypotheses test: Ho: All population means are equal

Ha: The population means are different

2 Check assumptions

Trang 15

In 2004, the annual Vietnamese Enterprise Surveys (VESs) was carried out with a sample size of more than 2 million businesses in all provinces across the country A portion of the VES data is

to be given to each group by our tutor Hence, we assume that the observations within each group were obtained by a random sample Moreover, the profitability for state-owned must be independent of the profitability for private-owned

Assumption 2: For each population, the response variable is normally distributed.

Quantile Quantile plot also called QQ plot is to check the normality of data, therefore the R code will be used:

➢ qqPlot(lm(roa ~ own, data = Dataset2), simulate = T, labels=F)

The qqPlot code works by plotting the data from data sets on a different axis If the distribution

of the data is the same, the result will be a straight line

Trang 16

“By” function below in R to calculate the standard deviations:

➢ by(Dataset2$roa, Dataset2$own, sd)

Trang 17

Figure 17 Standard deviation

From this result, we take the ratio:

𝐿𝑎𝑟𝑔𝑒𝑠𝑡 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑆𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛=

53.57284 0.08292021= 646.0777The number 646.0777 is much greater than 2, which means that we cannot claim the variance is the same for all of the populations Therefore, we use another technique - Levene’s test to check the equality of all variances before running a One-Way ANOVA test Firstly, we set the hypothesis for Levene’s test

H0: The variance among groups is equal

Ha: The variance among different groups is not equal

Then use the R code below to find p- value of Levene test

➢ install.packages('car')

➢ library(car)

➢ leveneTest(Dataset2$roa, Dataset2$own, center = median)

Figure 18 Levene test's result

If the p-value of the Levene test is smaller than the significant level (α= 0.05), we can reject the null hypothesis that variances are equal for all populations From the results above, p-value (=0.3174) is greater than significance level (α= 0.05),so we cannot reject H0 That means there are all equal population variances

3 Test statistic

To run one-way ANOVA test, the R code will be:

Trang 18

➢ aov <- aov(roa ~ own, data = Dataset2)

Although the two-way factorial analysis of variance is usually the optimal inference method to handle this case, it is vital to evaluate all the assumptions of the inference system before displaying our two-way ANOVA in order to ensure the validity of our findings

As a result, there are three assumptions required to check two-way ANOVA:

• Samples are independent, simple random samples of size from each population

• All populations are normally distributed

• All populations have the same standard deviation: 𝜎12 = 𝜎22 = ⋯ = 𝜎𝑘2

Assumption 1: Samples are independent

Trang 19

populations The research consists of separate simple random samples The code below shows the

R output of 6 groups of similar size

➢ table(Dataset2$province, Dataset2$own)

Figure 20 Frequency table of sample size Assumption 2: All populations have the same standard deviation

Secondly, we will check assumption 2, the standard deviations are equal By looking at the output

of the “By” function in R code for both males and females, we observe that the ratio of the biggest sample standard deviation over the smallest sample standard deviation (= 92.7883/ 0.03831905)

is around 2421.466, which is greater than 2 As a result, we can conclude that standard deviations are not the same in all populations

The Levene test is a different method that can be used to verify the assumption of equal standard deviation This test determines whether the variance is homogeneous or whether the variances of samples are approximately equal The procedure was to compare the Levene test's p-value to our significant level (= 0.05) Equal variances were assumed if the p-value was greater than and the Levene test was non-significant, and vice versa

The Levene test, however, is only significant when the ratio of standard deviations is unclear (between 2 and 3) In this scenario, the Levene test was unnecessary The following are the codes for doing the Levene test:

➢ install.packages("car")

➢ library(car)

➢ leveneTest(Dataset2$roa, interaction(Dataset2$province,Dataset2$own), center= median)

Ngày đăng: 08/03/2024, 16:25

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN