business statistics group midterm exam consider the variable income in gss sav file

Therefore, in our opinion, this frequency table does not make sense as we believe readers would struggle to fully interpret the aforesaid table.. In this situation, computing a Mean does

Trang 1

NATIONAL ECONOMICS UNIVERSITY SCHOOL OF ADVANCED EDUCATION PROGRAMS

BUSINESS STATISTICS

GROUP MIDTERM EXAM

Instructor: Mrs Tran Thi Bich

Class: Advanced Accounting K63

Students: Pham Minh Anh 11219597

Vu Ha Chau 11211022 Tran Ngoc Diep 11211320 Pham Tra My 11219197 Tran Thi Thu Thao 11215482

Le Khanh Vi 11216231

Hanoi, September, 2023

Trang 2

Question 1: Consider the variable income in gss.sav file (the variable is total family income in the year before the survey)

I, Make a frequency table for the variable Does the frequency table make sense? Does it make sense to make a histogram of the variable? A bar chart?

- Frequency table

TOTAL FAMILY INCOME FOR LAST YEAR

Cumulative Frequency Percent Valid Percent Percent

Valid UNDER $1 000 17 1.2 1.3 1.3

$1 000 TO 2999 17 1.2 1.3 2.5

$3 000 TO 3 999 9 6 J 3.2

$4000 TO 4999 7 5 5 3.7

$5 000 TO 5 999 13 9 1.0 47

$6 000 TO 6 999 19 1.3 1.4 6.1

$7 000 TO 7 989 17 1.2 1.3 7.3

$8 000 TO 9 999 40 2.8 3.0 10.3

$10000 TO 12499 58 44 443 14.5

$12500 TO 14999 56 3.9 44 18.7

$15000 TO 17499 50 3.5 3.7 22.4

$17500 TO 19999 54 3.8 4.0 26.4

$20000 TO 22499 42 3.0 3.1 29.5

$22500 TO 24999 59 4.2 44 33.8

$25000 TO 29999 79 5.6 5.8 39.7

$30000 TO 34999 86 6.1 6.4 46.0

$35000 TO 39999 82 5.8 6.1 52.1

$40000 TO 49999 119 8.4 8.8 60.9

$50000 TO 59999 108 7.6 8.0 68.8

$60000 TO 74999 111 7.8 8.2 77.0

$75000 TO $89999 66 47 4.9 81.9

$90000 - $109999 45 3.2 3.3 85.2

$110000 OR OVER 76 5.4 5.6 90.8

REFUSED 124 8.7 9.2 100.0

Total 1354 95.4 100.0

Missing DK 49 3.5

NA 16 1.1

Total 65 4.6

Total 1419 100.0

It is well-known that a frequency distribution records the number of times each value occurs and is presented in the form of a table However, as can be clearly seen from the frequency table, it is hard to draw a conclusion and figure out the trend since there

Trang 3

are too many observations Therefore, in our opinion, this frequency table does not make sense as we believe readers would struggle to fully interpret the aforesaid table

- Histogram

Mean = 16,13 Std Dev = 5,636 N=1.354

TOTAL FAMILY INCOME FOR LAST YEAR

A histogram is the most frequently used graph to showcase frequency distributions The X-axis on the histogram can be seen as intervals that show the scale of values which the measurements fall under, meanwhile the Y-axis shows the number of times that the values occurred within the intervals set by the X-axis While it is possible to make an equal width histogram of the income variable due to the fact that family income is generally continuous data with the same class intervals, however, when looking at the histogram, our group realize that the bins set on the X-axis are hard to understand and irrelevant to the aforesaid dataset Moreover, the missing values column is not stated out clearly since it is lying among the non-missing ones, causing misunderstandings to readers Therefore, the histogram does not make sense

- Bar chart

A bar chart is used when you want to show a distribution of data points or perform a comparison of metric values of your data From a bar chart, we can see which groups are highest or most common, and how other groups compare against the others Even though there are a lot of values on the X-axis of the bar chart, we can still see the trend going on here and can even compare these values in order to draw a conclusion

Trang 4

Besides, the missing values column (so-called “REFUSED’’) is stated, making readers understand the graph more clearly Hence, the bar graph in this case does make sense

12074

100-4

80-4

~

c

s

40-4

20-4

a~

D

Seo ee eee Ta GERBER BRBEBBERES GA §$8888888888888888888888 5

2 ØO OÒOO O O OưGưGoGOOOOOOOOOGOG sao

Br ef nN On © aaa aww ny CG CƠ + IT HG TF

“88888882 85858888 88 8 8 Be 5 8 & 6 6 @ @ 6 6 @ 6 @ @ 6 @ @ @ ồ 8 & @ o đ

2 What is the scale of measurement for the variable?

CB oss.sav [DataSett] - IBM SPSS Statistics Data Editor g

| Columns Align Measure Role

— 1 sphrs† Numeric 2 9 Number of hours spouse worked last week (I.NAP) -1,98.99 8 Right # Scale ® Input

12 sicheng2 Numeic 2 0 Search engine name (0 Net appi 0.9899 8 Right Nominal —N Input

13 hours Numeric 2 0 Hours per day watching TV (NẠP, -1989 8 HH Right @ Scale \ input

16 usemali = Numeric 1 0 Use e-mail? (0 No) 9 8 Right @Nominal — Ss Input

18 webhrs Numeric 8 2 Hours on the WWW per week for Intemet users (3.00 Time -1.00, -200 8 HH Right # Scale \ input

19 hapmar Numeric 1 0 HAPPINESS OF MARRIAGE {0.NAP) 0.8.9 8 Right ill Ordinal \ input

20 happy Numenc 1 9 GENERAL HAPPINESS {0, NAP) 0.8.9 8 3 Right edi Ordinal \ Input

21 speduc Numeric 2 0 HIGHEST YEAR SCHOOL COMPLETED, SPOUSE (57 NAP) 979899 8 3 Right # Scale N input

22 mm Numeic 2 0 RESPONDENTS INCOME FOR LAST YEAR (0.NAP) 0.98.99 8 3 Right el Ordinal \ input

23 le Numeric 1 0 IS LIFE EXCITING OR DULL {0.NAP) 0.8 8 Right fll Ordinal \ input

| 2 leem 2 uc Numeric 2 c2 o0 0 HIGHEST YEAR SCHOOL COMPLETED, MOTHER (%7 NAP) 979899 8 (NAP) 0.9899 8 NRgN # Scale Ordinal Ningr \ input

26 paeduc Numeric 2 0 HIGHEST YEAR SCHOOL COMPLETED, FATHER (97 NAP) 979899 8 Right @ Scale \ input

29 pres96 Numeric 1 0 VOTE FOR CLINTON, DOLE PEROT {.NAP) 0.89 8 Right BNominal —N Input 3Ơ richwork = Numeric 1 0 IF RICH, CONTINUE OR STOP WORKING {0.NAP) 0.8.9 8 Right ổyNomnal — ` Input

1 sauob Numeric + 0 JOB OR HOUSEWORK (0.NAP) 0.89 8 WH Right ill Ordinal \ Input

IBM SPSS Statistics Processor is ready

Trang 5

The scale of measurement for the variable “income” is ordinal because it has been divided into several categories, which are not mathematically measured or determined but are merely assigned as labels for opinion

3 What descriptive statistics are appropriate for describing this variable and why? Does it make sense to compute a mean?

There are 4 types of descriptive statistics: measures of frequency, measures of central tendency, measures of dispersion or variation and measures of position

From my viewpoint, Measure of Central Tendency is the most appropriate to describe the data and closest to our purpose, which is to know the most commonly indicated response

In this situation, computing a Mean does not make sense and there are reasons to choose Median and Mode instead of Mean to compute:

+ There is not enough information about the values in the range under $1,000 and

$110,000 or above as the frequency table is open-ended

+ When we draw a histogram, we see that it is heavily skewed to the right + As can be seen in the table, the Coefficient of Skewness is smaller than 0, the graph appears to be negatively skewed

As a result, mean is a poor descriptive statistic for skewed distributions and because there are 65 missing data in this frequency table, using Mean may give inaccurate results Hence, it does NOT make sense to compute a Mean

Statistics TOTAL FAMILY INCOME FOR LAST

YEAR

N Valid 1354 Missing 65 Mean 16,13 Median 17,00 Mode 24 Std Deviation 5,636 Skewness -,608

Std Error of Skewness ,0BB

4 Discuss the advantages and disadvantages of recording income in this manner Describe other ways of recording income and the problem associated with each of them

When partitioning the data area in depth, using labels like this allows visitors to readily read and understand the values provided by the topic Furthermore, the labels are divided into equal pay scales (e.g equal-width histograms), making it easy to calculate bin borders as well as evaluate and compare frequencies or the distribution

of data points across bins On the other hand, separating the data by salary level too tiny makes it easier for readers to understand but is a bit redundant, making the overall trend of the problem difficult to detect Furthermore, the average value will have variances as a result of computing several ranges, making it harder to identify between numbers

Trang 6

There are many methods to present people's earnings data One of the most common methods is to separate labels by age When dividing income, we can, for example, divide it into groups such as "under 18 years old", "from 18 years old to 29 years old",

"from 30 years old to 39 years old", "from 30 years old to 39 years old" With labels like these, we can easily identify the pattern of income levels by age, thereby understanding the chart's connection and meaning The other option is to separate labels based on occupations such as workers, officials, and state employees However, the disadvantage of these ways is that they are time intenstve when not only collecting information Furthermore, because the ratio of numbers between groups might vary greatly, it is easy to cause errors while calculating the average value Question 2: In the gss.sav file, the variable tvhours tell you how many hours per day GSS respondents say they watch TV

1, Make a frequency table of the hours of television watched Do any of the values strike you as strange? Explain

Statistics

Hours per day watching TV

Missing 513

Frequency Percent Valid Percent Cumulative Percent

Valid 0 54 3.8 6.0 6.0

1 189 13.3 20.9 26.8

2 238 16.8 26.3 53.1

3 159 11.2 17.5 70.6

4 115 8.1 127 83.3

5 54 3.8 6.0 89.3

6 30 2.1 3.3 92.6

7 10 7 1.1 93.7

8 22 1.6 2.4 96.1

10 13 9 1.4 97.6

11 3 2 3 97.9

12 13 9 1.4 99.3

14 2 1 2 99.6

15 2 1 2 99.8

20 1 1 1 99.9

24 1 1 1 100.0 Total 906 63.8 100.0

Missing NAP 486 34.2

NA 27 1.9 Total 513 38.2 Total 1419 100.0

Trang 7

Based on the frequency table, the value that strikes us as strange is "12" - people who spend 12 hours watching TV a day As can seen, the number of people watching TV from 0 hours to 10 hours is extremely high, more than 10 people while that of 11 hours to 20 hours are fewer, less than 4 people, except for 12 hours The figure for 12 hours is 4 times higher than the largest number of the others in this group Therefore, the figure for 12 hours is a significant number

2, Based on the frequency table, answer the following questions: Of the people who answered the question, what percentage don’t watch any television? What percentage watch two hours or less? Five hours or more? Of the people who watch

TV, what percentage watch one hour? What percentage watch four hours or less?

- Of the people who answered the question

e The percentage of people who don’t watch any television: 6.0%

e The percentage of people who watch two hours or less: 53.1%

e The percentage of people who watch five hours or more: 16.7% (=100% - 83.3%)

- Of the people who watch TV

¢ The percentage of people who watch one hour: 22.2% (= )

e The percentage of people who watch four hour or less: 82.3% ()

3 From the frequency table, estimate the 25th, 50th, 75th, 95th percentiles What is the value for the Median, Mode?

Statistics

N Valid 906

Missing 513 Median 2.00 Mode 2 Percentiles 25 1.00

50 2.00

75 4.00

95 8.00

- The value for the 25" percentiles is 1.00

- The value for Median is 2.00

- The value for Mode is 2

4 Make a bar chart of the hours of TV watched What problem do you see with this display?

Trang 8

250-4

200-4

150-4

1007

0 1 2 8 10 11 12 14 15 20 24

As can be seen from the graph, most people spend from | hour to 4 hours per day watching TV which accounts for 82.3% of people who watch TV On the other hand, the figures for people who watch TV more than 10 hours per day accounted for just 2.6% of people who watch TV

However, there are a few problems with the information on the bar chart Firstly, some figures which are 9, 13, 16, 17, 18, 19, 21, 22, and 23 hours per day are not presented

on the bar chart This could be the result of the disappearance of these figures in the survey response Secondly, even without any information, it should include a gap (represents 0 people) in these categories to avoid the misunderstood when witnessing the chart Finally, the number of people who watch TV for more than 10 hours per day

is very low, therefore, the shape of the distribution is hard to figure out

5 Make a histogram of the hours of TV watched What causes all of the values to

be clumped together? Compare this histogram to the bar chart you generated in question 2.4 Which is a better display for these data?

As can be seen from the histogram, all of the values are clumped together since the distribution is positive skew since the coeficient of skewness is positive:

Coefficient of skewness =3 x = 1.2 >0 This means that most of the figures are distributed to the left of the chart, followed by

a long tail skewed to the right Therefore, it gives an overall that the majority of people watch TV from | hour to 4 hours per day while that of watching TV for more than 10 hours is much lower

Trang 9

Histogram

2507 Mean =3

4 Std Dev = 2,56

N= 906 200-4

100 /

50>

0 T T T T T T T

-5 0 5 10 15 20 25

In our group's opinion, the histogram ts a better display for this data because it could solve the problems of the bar chart we figured out above The histogram presents all the figures even the missings by the gaps so the chart is easier to understand Moreover, the histogram could show the distributions of the values of data collected

by a curve, therefore, even do not know the real number of each category, we still can understand the overview trend of the chart

Question 3: Find a data set which is related to a specific organisational problem (either at the macro or micro level) and apply all possible descriptive statistical techniques that you think suitable to the problem Write a short report, which includes the objectives of your analysis, the research questions, the analytical techniques you apply to address to the research questions and your findings The maximum length of the report is 5 pages including Tables and Figures

I, Objectives of the analysis:

Food production 1s intricately linked to environmental sustainability, with various stages

in the food supply chain contributing to greenhouse gas emissions This analysis focuses

on assessing the environmental impact of different stages and types of food in food production, using data on greenhouse gas emissions measured in Kg CO2 equivalents per

kg of product By doing so, our group aims to understand the environmental impact of different stages in the lifecycle of food production and provide insights into potential areas for carbon dioxide mitigation

2, Research questions:

Trang 10

a) Which stage of food production contributes more to greenhouse gas emissions? b) Compare the carbon footprint of plant-based foods

c) Compare the carbon footprint of animal-based foods

ad) Which types of food have a more negative impact on the environment?

3, Analytical techniques:

This report applies descriptive statistics, sample correlation of coefficient formula to find out the impacts on the environment by each stage of foods Besides, different types of charts such as bar charts, pie charts, and histograms are also used to analyze and compare which foods contribute negative impacts to greenhouse gas emissions

4, The analysis

a) Which stage of food production contributes more to greenhouse gas emissions?

Histogram 40) Mean = 5.97

Std Dev = 10.502 N=43

Total_emissions

Statistics

Total_emissions

Missing 0

Maximum 59.60

The difference between total emissions of food products

The Mean is 5.97 so on average, the data tends to be higher than the midpoint of the range The dataset has a positively skewed distribution, where the majority of total emissions of food products are clustered towards the lower end (around 1.6), but there are some significant outliers on the higher end (up to 59.6) In conclusion, there is a 9

Tiêu đề	Consider the variable income in gss.sav file
Tác giả	Pham Minh Anh, Vu Ha Chau, Tran Ngoc Diep, Pham Tra My, Tran Thi Thu Thao, Le Khanh Vi
Người hướng dẫn	Mrs. Tran Thi Bich
Trường học	National Economics University
Chuyên ngành	Business Statistics
Thể loại	Midterm Exam
Năm xuất bản	2023
Thành phố	Hanoi

Định dạng
Số trang	13
Dung lượng	2,77 MB