PART I: QUESTIONS Question 1 The variable incomdol in gss.sav file is calculated as the class-midpoint of the income range for each value of income family income.. Based on the frequency
Trang 1
NATIONAL ECONOMICS UNIVERSITY
BUSINESS STATISTICS
MID-TERM EXAM
CLASS: Advanced Finance 64B
LECTURER: Assoc.Prof Tran Thi Bich
Trang 2PART I: QUESTIONS Question 1
The variable incomdol in gss.sav file is calculated as the class-midpoint of the income range for each value of income (family income)
1 Propose how this variable is calculated? Illustrate your answer by calculating the classmidpoint of one category from the variable income
2 Compute summary statistics of this variable and make a histogram as well What kind of distribution do you get? Explain why do you have that kind of distribution
3 Based on the outputs in (2), indicate below what value of income do 25% of respondents fall? Above what value do 25% of respondents fall?
4, Repeat (2) but in the Frequency procedure, you select Values are group midpoints Indicate the change you notice
5 Do you think you know the exact income of family in your sample? Explain your answer Question 2_
In the gss.sav file, the variable tvhours tells you how many hours per day GSS respondents say they watch TV
1 Make a frequency table of the hours of television watched Do any of the values strike you
as strange? Explain
2 Based on the frequency table, answer the following questions: Of the people who answered the question, what percentage don’t watch any television? What percentage watch two hours
or less? Five hours or more? Of the people who watch TV, what percentage watch one hour? What percentage watch four hours or less?
3 From the frequency table, estimate the 25th, 50th, 75th, 95th percentiles What is the value for the Median, Mode?
4, Make a bar chart of the hours of TV watched What problem do you see with this display?
5 Make a histogram of the hours of TV watched What causes all of the values to be clumped together? Compare this histogram to the bar chart you generated in question 2d Which is a better display for these data?
Question 3_
Find a data set which is related to a specific organisational problem (either at the macro or micro level) and apply all possible descriptive statistical techniques that you think suitable to the problem Write a short report, which includes the objectives of your analysis, the research questions, the analytical techniques you apply to address to the research questions and your findings The maximum length of the report is 5 pages including Tables and Figures
Trang 3Question 1;
1.1
The formula to calculate the class-midpoint is: ClassMidpoint — (Lower class limit + Upper class limit)/2
PART Hl: ANSWERS
In this case, the answer is: Midpoint = (1000 + 2999)/2 = 1999.5
1.2,
Statistics Family income; ranges recoded to midpoints
N Valid 1230
Missing 189
Median 32500.00
Std Deviation 30091.997
Variance 905528303.176
Std Error of Skewness 070
Range 109500
Family income; ranges recoded to midpoints
Frequency} Percent | Valid Cumulative
Percent Percent
500 17 1.2 1.4 1.4
2000 17 1.2 1.4 2.8
6500 19 1.3 1.5 6.7 Valid 7500 17 1.2 1.4 8.0
9000 40 2.8 3.3 11.3
11250 58 41 4.7 16.0
13750 56 3.9 4.6 20.6
16250 50 3.5 4.1 24.6
18750 54 3.8 4.4 29.0
21250 42 3.0 3.4 32.4
23750 59 42 48 37.2
Trang 4
27500 79 5.6 6.4 43.7
32500 86 6.1 7.0 50.7
37500 82 5.8 6.7 57.3
45000 119 8.4 97 67.0
55000 108 76 8.8 75.8
67500 111 7.8 90 84.8
82500 66 47 5.4 90.2
100000 45 3.2 3.7 93.8
110000 76 5.4 6.2 100.0 Total 1230 86.7 100.0
don't 65 4.6 Missing know/NA refused 124 8.7
Total 189 13.3 Total 1419 100.0
Histogram
Mean = 16.13
Std Dev = 5.636 N=1,354
TOTAL FAMILY INCOME FOR LAST YEAR
A left-skewed, or negatively skewed, distribution occurs when high income levels increase at
a slower rate than lower income values This is visually represented in a boxplot, where the longer tail of the distribution extends towards the lower income values, indicating the skewed
Trang 5nature of the data In this type of distribution, the peak is on the right side, signifying the
concentration of observations at lower income levels Overall, a left-skewed distribution
reflects the slower increase in high income levels compared to lower income values, highlighted through the visualization of the boxplot
1.3
Statistics Family income; ranges recorded to midpoints
N Valid 1230
Missing | 189
- 25 18750.00
Percentiles
75 55000.00
Based on the findings from the analysis mentioned in (2), it is evident that 18,750 represents
a salary that is 25% higher than the income of the survey respondent, while 55,000 signifies a salary that is 25% lower than the income of the survey respondent
1.4,
Statistics
Family income; ranges recoded to midpoints
Valid 1230
N
Missing 189
Mean 41840.85
Median 34583.33°
Mode 45000
Std Deviation 30091.997
Variance 905528303.176
Std Error of Skewness 070
25 17668.27°
Percentiles 50 34583.33
75 60079.91
a Calculated from grouped data
b Percentiles are calculated from grouped
Statistics
Family income; ranges recoded to midpoints
N Valid 1230
Missing 188
Mean 41840.85 Median 32500.00 Mode 45000
Std Deviation 30091.997
Variance 905528303.2
Std Error of Skewness 070
Percentiles 25 18750.00
50 32500.00 r5 55000.00
After selecting “Values are group midpoints” in the Frequency procedure, it can be clearly
seen that there are changes in the value of Median and Percentiles
Trang 6
1.5
The exact income of the family in the sample is not known The dataset does not include specific details about the family's income and the data only give information about the midpoint of the family income It can not represent individual income of each member in the family but can only represent ranges or an average income of the whole family Therefore, without additional information, the exact income of the family cannot be determined using the provided dataset
Question 2
2.1
Statistics Hours per day watching
TV
Valid 4906
Missing [513
N
Hours per day watching TV
Frequency} Percent | Valid Cumulative
Percent Percent
0 54 3.8 6.0 6.0
1 189 13.3 20.9 26.8
2 238 16.8 26.3 53.1
3 159 11.2 17.5 70.6
4 115 8.1 12.7 83.3
5 54 3.8 6.0 89.3
6 30 2.1 3.3 92.6
Valid 8 22 1.6 2.4 96.1
10 13 9 1.4 97.6
12 13 9 1.4 99.3
24 1 wl wl 100.0 Total J906 63.8 100.0
NAP 486 34.2 Missing NA 27 19
Total ]513 36.2
Trang 7} Total {1419 {100.0 | | |
As can be seen from the frequency table below, the figure that stands out to me is 12 - which corresponds to the 12 hours a day spent watching television The number of individuals who watch TV from 0 to 10 hours is pretty high, but the data begin to fall precipitously after variable 11 However, only variable 12 is unusually higher in this set of variables ranging
from 11 to 24, which strikes us as unusual
Most people watch under 6 hours per day with a cumulative percentage of 92.6
It is strange to see that for the value 24 hours, there is still an observation since this is
abnormal biological behaviour
2.2
Base on the above frequency table, we can infer these following data (using valid percent column)
Of the people who answered the question:
6% of the people don’t watch any televisions
o 53.1% of the people watched TV for two hours or less
o 16.6% of the people watched TV for five hours or more
© 83.4%, which is the total valid percent of whom watching TV from 0 to 4 hours (6% + 20,9% + 26,3% + 17,5% + 12,7% = 83,4%
Of the people who watch TV (which means the values of variable 0 is excluded):
o 20.9% watch TV for one hour
Oo 82.27% watch TV for four hours or less
2.3
Statistics Hours per day watching TV
Valid 906
N i
Missing }513
Median 2.00
Mode 2
25 1.00
- 50 2.00
Percentiles
75 4.00
95 8.00
In a data distribution, a percentile is the number below which a specified proportion of values falls In SPSS, there are several methods for calculating percentiles, as well as several equations Our group will compute the 25th, 50th, 75th, and 95th percentiles for the variable
Trang 8TV hours We'll select the Frequencies option, which uses a weighted average algorithm to determine percentiles (as displayed in the SPSS data view above)
As can be seen from the results which appear in the SPSS output view:
The value for 25th percentile is 1.00
The value for 50th percentile is 2.00
The value for 75th percentile is 4.00
The value for 95th percentile is 8.00
The values for Median 2.00 hours
The values for Mode is 2 hours
2.4,
250-4
200-4
1507
100~
S0~
o 1 2 3 4 5 6 7 8 10 11 12 14 15 20 24
Hours per day watching TV
There are a few problems with this bar chart:
o A few outliers with low frequencies exist, however they can be used in a huge number
of discrete values
o Some values (9, 13, 16, 17, 18, 19, 21, 22, 23) are missing since they do not occur in
the survey answer The bar chart below lacks a gap that reflects these uncollected data (missing values), which might lead to misinterpretation for readers at first look
© The real form of the distribution is difficult to determine because there are low frequencies indicated for higher classes
Trang 92.5
Histogram
Std Dev = 2.56 N= 906
20074
= nn ao i
100
5054
5 10 15 20 25
Hours per day watching TV
The values are clumped together because most of the respondents watch tv equal to or less than 8 hours a day (96.1%), which is biologically normal Therefore, the statistics skew to the right (positive) and make a group on the right of the histogram
Histogram is better to display this statistic compared to bar charts It is because histogram can display the categories with 0 responses Therefore, histogram can give readers a clearer overview of the distribution of statistics full of its possible outcomes
Trang 10Question 3
Title:
1 INTRODUCTION
In today's fiercely competitive business environment, gaining insights into customer perceptions and purchase outcomes holds immense significance for enterprises, especially those operating in the B2B sector
Our research will mainly focus on analysing three elements: geography, customer segment, revenue in correspondence with Therefore, our study objective is to present the influence of customer and region on customers’ decision to purchase products
The current study aims to fill the gap in the literature by addressing at some specific
objectives:
1 Examining the relationship between different customer type and their habit purchasing our products base on value ‘revenue’
2 Examining the relationship between geographical regions in the US and customer purchasing power toward different products
3 Recommending specific strategic plan for the company in expanding their market share regarding customer type and geographical region
2, ANALYTICAL TECHNIQUES & DESCRIPTIVE STATISTICS
2.1 What is the relationship between customer type and line total ?
To estimate the relationship between a set of data, we use frequency table and custom table
Customer Type
Cumulative Frequency | Percent | Valid Percent Percent Valid 8 2 2 2
Club 589 11.8 11.8 11.9 Distributor 908 18.1 18.1 30.1 Export 274 5.5 5.5 35.5 Online 874 17.5 17.5 53.0 Wholesale 2354 47.0 47.0 100.0 Total 5007 100.0 100.0
There is a correlation between a line total- dependent variable (that’s the variable or outcome you want to measure or predict) and customer type- independent variables (factors which may have an impact on the dependent variable) The reason this is the factor of our choosing 1s that we believe that customer type has a direct correlation with line total
There are four customer types: Club, Distributor, Export, and Wholesale Wholesale
customers seem to have the most transactions across all line total ranges There are more
Trang 11wholesale transactions at every line total point compared to the other customer types and nearly half the percentage is about 47%, while export is not as common
* Custom Tables
[DataSer1]
Customer Type
Club Distributor Export I Online Wholesale
LineTotal | | 1410.2203 | 830619.7381 | 13679475 | 1242036313 | 13609597 | 3729029508 | 1359.8465 | 1188505854 | 14052199 | 3307887.628
Our research mainly analyses the market share of each customer group through the average value (Mean) of total revenue, and the total revenue (Sum) of each customer file
Looking at total revenue, we can see the proportion of 4 types of customers The table above shows that this company focuses on Distributor and Wholesaler customers, because these two types of customers have the largest market share Explanation: h
yx
=1 '
Arithmetic Mean: p = ——
Line total of Club has the highest Mean, lowest Sum => The number sold is the lowest Line total of Distributor, Export, Online, Wholesale has a similar Mean (1350 - 1411), but
Distributor and Wholesale have higher Sum Therefore, the sales volume of these two types
of customers 1s the highest, and has the largest revenue
From the table, the mean of revenue from different customer types are not so different However, the sum has a big gap between each other The reason is that the number of customers of different types are varied Therefore, when the mean multiply by the total
number, the sum will be effected
In conclusion, the wholesales bring the most revenue for the company But, it is just because the company has a large number of orders from wholesale In average, all customer types can bring almost the same revenue to the company
10