Because this frequency table is a Grouped Frequency Table, and there are so many values in income data, we need a frequency table to accurately define groups of wages.. While it is possi
Trang 1
Ministry of Education and Training National Economics University
GROUP MID-TERM EXAM SUBJECTS: STATISTICS Students: Hoang Lé Anh Tho - 11203780
Nguyễn Vũ Thùy Linh- 11205851 Nguyễn Nguyệt Minh - 11206121 Nguyễn Hải Nam - 11206222 Nguyễn Phương Thảo - 11206947 Nguyễn Quang Anh - 11204425 Class: Advanced Finance 62C
Ha Noi - 03/2022
income in the year before the survey)h 1 Conaidesd ‘abled BleCd ‘abled 1
Trang 21 Make a frequency table for the variable Does the frequency table make sense? Does it make sense to make a histogram of the variable? A bar chart?
FREQUENCY TABLE TOTAL FAMILY INCOME Statistics FOR LAST YEAR
Missing Valid 1354 65
TOTAL FAMILY INCOME FOR LAST YEAR
Frequency] Percent Valid Cumulative
Percent Percent
$10000 TO
$12500 TO
$15000 TO
$17500 TO
$20000 TO
$22500 TO
$25000 TO
$30000 TO
39999
Trang 3Missing
Total
$40000 TO
49999
$50000 TO
59999
$60000 TO
74999
$75000 TO
$59999
$90000 -
$109999
$110000 OR
OVER
REFUSED
Total
DK
NA
119
108
111
66
45
76
124
1354
49
16
65
1419
8,4 7,6 7,8 4,7 3,2 5,4 8,7 95,4 3,5 1,1 4,6 100,0
8,8 8,0 8,2 4,9 3,3 5,6 9,2 100,0
60,9 68,8 77,0 81,9 85,2 90,8 100,0
=>]t makes sense to use this frequency table Because this frequency table is a Grouped Frequency Table, and there are so many values in income data, we need a frequency table to accurately define groups of wages And all of the frequency, percentage, and cumulative percentages demonstrate the family income group in a general way
Trang 4Histogram
Mean = 16,13 Std Dev = 5,636
1207
N=1354 100-54
œ o 1
Frequency 5 1
40¬
20¬ o-
TOTAL FAMILY INCOME FOR LAST YEAR
=> A histogram is the most frequently used graph to showcase frequency distributions The X-axis on the histogram can be seen as intervals that show the scale of values which the measurements fall under, meanwhile the Y-axis shows the number of times that the values occurred within the intervals While it is possible to make an equal- width histogram of the income variable due to the fact that family income 1s generally continuous data with the same class intervals, it is not advisable, and therefore does not make sense in this situation The missing values column is not stated clearly since
it is lying within the non-missing values, which, in tum, could cause several misunderstandings Therefore, it does not make sense to use the histogram for this variable
BAR CHART
Trang 5TOTAL FAMILY INCOME FOR LAST YEAR
120ể
Ủ o L
Frequency 3
40>
20-54 0~
Z 2 0 h0 đ N G 2 À ra nan S3 99 Ế0 5n 52 ự
ạ ooỏoococoooSh*ệs 0ụ 3+ đ Đ 0À đỏ ựú GỏGGỏGGứu ể
7# 588555555685 6S6S6EeseeSEeeeseesespg
đ 5S 5Ế 5Sểã1aaaĐOẹĐOSSDODODODODODODODOGOSGSGGjh
ể ể ể ể ể ể ể ể ể ể ể ể ể ' ỏ
ể ửO OOO OO OOỏGOGOOOOOGOOOGOODOGsỘ&Oẹ
B8 Đ 6 FY ON ẹ LBBB HHH oOwBeEH AH ST KEK GE Ể@6 e8 6 8o
@ @ Ủ @ @ @ Ủụ à 6 @ ệ à @ @ @ @ @ 6 @ ử6 Ủ6 C ử6 (6 @ (@ @ @6 @ GB @ @ @G6 @ @6 @6 6 6 @6 6 6 @6
6 ử6 @ử @ử @ử @ @ @ @ @ @ 6 ử6 @
TOTAL FAMILY INCOME FOR LAST YEAR
=> Using a bar chart in this situation makes sense When you want to show a distribution of data points or perform a comparison of metric values, a bar chart is a viable option With the bar chart, we can see which group has the highest number or how they compare to other groups From this bar chart, we can also easily determine the trend going on and draw a conclusion, despite the numerous values on the X-axis
2 What is the scale of measurement for the variable?
The scale of measurement for this variable income is ordinal because it has been divided into several categories, which are not measured but merely assigned as labels Moreover, we can see in the gss.sav file shows the scale of measurement for income is ordinal
3 What descriptive statistics are appropriate for describing this variable and why? Does it make sense to compute a mean?
The four types of descriptive statistics are Measures of Frequency, Measures of Central Tendency, Measures of Dispersion or Variation, and Measures of Position For describing this variable, the Measure of Central Tendency ( Median and Mode) is the most suited and is
Trang 6closest to our purpose, which is to determine the most often stated response
In this situation, computing a Mean does not make sense and there are reasons to choose Median and Mode instead of Mean to compute:
e We do not have any information about the values in this range (under $1,000 and $110,000 or above) The extreme classes are open in this range
e When we draw a histogram, we see that it is heavily skewed to the right As a result, mean is a poor descriptive statistic for skewed distributions
e Because there are 65 missing data in this frequency table, using Mean may give inaccurate results
4 Discuss the advantages and disadvantages of recording income in this manner Describe other ways of recording income and the problem associated with each of them
- On the one hand, the data can be presented in a comprehensible manner since it has been divided into small areas and logical details Moreover, those on the data view are clearly observed, people can see that it is made a room for a smaller label But on the other hand, there are some drawbacks while using this manner The Mean value will have deviations as a result of the multiple ranges calculated, and it will be difficult to distinguish between the numbers In addition, all the observation is assigned to a “class” The intervals are equally wide so even though it helps people to read and interpret the graph easily, it is not essential
- There are others ways to record income The first way is that we can compile the database on the occupation of the members of a family For example, we divide family members into categories such as Adults (>18 years old) and Teenagers (<18 years old) From this, we can continue to divide it into smaller sectors and focus on their careers in each category Another way is that we can collect the data by age With this method, people have to divide the range of ages like>18, 18 - 24, 24 — 30, 30 — 36, and more sectors following Then we make a survey to gather the information Although these methods all have different advantages, the problem associated with them is that because of the details of the procedures, the implementation will take a long time to synthesize the information and might result in a large number of expenses Furthermore, the mean values will be invalid because the variance between variables is so large
Question 2: In the gss.sav file, the variable tvhours tell you how many hours per day GSS respondents say they watch TV.
Trang 7a, Make a frequency table of the hours of television watched Do any of the values strike you as strange? Explain
FREQUENCY TABLE Statistics Hours per day watching
TV
Missing 513
Hours per day watching TV
Frequency] Percent Valid Cumulative
Percent Percent
Trang 8We can see from the frequency table, the value striking us as strange is 12 As
we can see from the frequency, the number of people watching TV in the range from 0
to 10 is high but from 11 to 24, the data reduces significantly However, 12 have a high frequency in this range from 11 to 24 which is unusual
b, Based on the frequency table, answer the following questions: Of the people who answered the question, what percentage don’t watch any television? What percentage watch two hours or less? Five hours or more? Of the people who watch TV, what percentage watch one hour? What percentage watch four hours or less?
- Based on the frequency table, of the people who answered this question
e The percentage of people who do not watch any television is 6%
e The percentage of people who watch two hours or less is 53,1%
e The percentage of people who watch five hours or more is 100% - (6% + 20,9% + 26,3% + 17,5% + 12,7%) = 16,6%
- Based on the frequency table, of the people who watch TV ( the values of variable 0 is eliminated)
e The percentage of people who watch one hour is
100% = 22,2%
e The percentage of people who watch four hours or less is
100% = 82,3%
c, From the frequency table, estimate the 25th, 50th, 75th, 95th percentiles What is the value for the Median, Mode?
We will calculate the 25th, 50th, 75th, and 95th percentiles for the variable tvhours with the Frequencies option, which generates percentiles, median, and mode (as shown in the SPSS data view above)
Statistics Hours per day watching
Trang 9
Valid | 906
25 1,00 Percentil 50 2,00
In conclusion, the SPSS output view shows:
¢ The value for 25" percentiles is |
¢ The value for 50" percentiles is 2
¢ The value for 75" percentiles is 4
¢ The value for 95" percentiles is 8
e The values for Median is 2
e The value for Mode is 2
d, Make a bar chart of the hours of TV watched What problem do you see with this display?
BAR CHART
Trang 10Hours per day watching TV
25074
200-4
= un o 1
100
50-4
01 2 3 4 5 6 7 8 10 11 12 14 15 20 24
Hours per day watching TV
There are a few problems with this bar chart:
e A few outliers exist with low frequencies but are used in a large number of discrete values
e Some values are missing (9, 13, 16, 17, 18, 19, 21, 22, 23) because these values do not appear in the survey response
e Because there are low frequencies given for higher classes, the true shape of the distribution is difficult to understand
e, Make a histogram of the hours of TV watched What causes all of the values to
be clumped together? Compare this histogram to the bar chart you generated in
Trang 11question 2d Which is a better display for these data?
Histogram
250-54 Mean = 3
Std Dev = 2.56 N= 908 200-4
= un o 1
100~—
5 10 15 20 25 Hours per day watching TV
All of the values in the histogram are clumped together because this graph above is skewed right, which means that most values are distributed to the left of the dataset and the right tail is longer The positively skewed histogram gives a brief overview that most of the people who took the survey watch television from 1 to 4
Trang 12hours while very few of them watch television for more than 10 hours
From our point of view, a histogram would be a better choice than a bar chart in this situation We have written in part (d) in this section that the bar chart does not perform the uncollected data well, but on the other hand, the histogram does As a seeable result
of the fact that the histogram can present the distributions of the values of collected data as well as well-represent the uncollected data, it would be a more appropriate option in comparison to the bar chart
Question 3: Many factors contribute to the happiness of a marriage ( variable hapmar in gss.sav file) Select one of the factors in gss.sav file and a descriptive tool to answer the question “ what
distinguishes people who are very happy with their marriage from those who are less content” Write maximum 5 sentence to justify your selection and findings
Bar Chart
Trang 13
60000~
S0000~
40000—
300004
20000
100004
VERY HAPPY PRETTY HAPPY
HAPPINESS OF MARRIAGE
NOT TOO HAPPY
==> TO answer this question, we decided that the factor that determines the happiness of a marriage is family income (variable incomdo! in gss.sav file ) The reason this is the factor of our choosing is that we believe that income has a direct correlation with people's happiness As such, a difference in income would lead to a very clear difference in people's level of satisfaction in their marriage, allowing for a chart with a clear trend and easy to draw conclusions from According to the figure above, we can see that people who are more content with their marriage are people with higher income, due to the fact that they have fewer worries of money and can focus more on their marital happiness In conclusion,
a higher income will most likely lead to a higher level of happiness regarding their marriage