Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 27 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
27
Dung lượng
704,02 KB
Nội dung
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY SCHOOL OF ECONOMICS AND MANAGEMENT GROUP ASSIGNMENT APPLIED STATISTICS IN BUSINESS NGUYEN PHUONG ANH anh.np203253@sis.hust.edu.vn NINH TRONG DUC duc.nt203257@sis.hust.edu.vn DOAN VAN HUNG hung.dv203299@sis.hust.edu.vn TRUONG CONG THANH thanh.tc203288@sis.hust.edu.vn Lecturer Department School Dr Nguyen Tien Dzung Business Administration Economics and Management HANOI, June 2021 Lecturer’s Signature PROTESTATION We assure that this is my own research report All the data, figures in the report are from my own study and cited fully from known sources We not copy from any documents and not violate the regulations for plagiarism Group Applied Statistics Assignment Page of 27 ACKNOWLEDGEMENTS The success and final outcome of this project required a lot of guidance and assistance from many people and we are extremely fortunate to have got this all along the completion of my project work Whatever we have done is only due to such guidance and assistance and we would not forget to thank them We respect and thank Mr Nguyen Tien Dung, for giving us an opportunity to the project work in Applied Statistics and Experimental Design We are extremely grateful to him for providing such a nice lecture in every online class on Microsoft Teams Group Applied Statistics Assignment Page of 27 TABLE OF CONTENTS List of Figures List of Tables Executive Summary Section Descriptive Statistics with Tabular and Graphical Displays 1.1 Question x 1.2 Question x 1.3 Question x 1.4 Question x 1.5 Question x Section Descriptive Statistics with Numerical Measures 2.1 Question x 2.2 Question x 2.3 Question x 2.4 Question x 2.5 Question x Section Hypothesis Tests 3.1 Question x 3.2 Question x 3.3 Question x 3.4 Question x 3.5 Question x Section Experimental Design and ANOVA 4.1 Question x 4.2 Question x 4.3 Question x 4.4 Question x 4.5 Question x Section Statistical Analysis with Real Data 5.1 Data Description x 5.2 Analysis Objectives x 5.3 Data Analysis and Interpretation x 5.4 Concluding Remarks x References x Appendices .x Group Applied Statistics Assignment Page of 27 LIST OF FIGURES Figure 1.1 xxxx x Figure 2.1 xxxx x Figure 2.2 xxxx x LIST OF TABLES Table 1.1 xxxx x Table 2.1 xxxx x Table 2.2 xxxx x Group Applied Statistics Assignment Page of 27 EXECUTIVE SUMMARY Students write a short summary of the whole report in – pages The purpose is to summarize the key findings in your report in a short, brief and easy-to-understand expression to your managers to catch the main ideas of your report and key management issues Your superiors are often very busy and may not have enough time to read the whole report Group Applied Statistics Assignment Page of 27 SECTION DESCRIPTIVE STATISTICS WITH TABULAR AND GRAPHICAL DISPLAYS 1.1 Question A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several nonoveralpping categories or classes A percent frequency distribution summarizes the percent frequency distribution for the key variables Thus, to help management develop a customer profile, firstly, we contruct the percent frequency distribution for the key variabiles, in this case, which is type of customers, items, net sales, mehod of payment, gender, maritual status and age group 1.1.1 Percent frequency distribution for Type of Customers Type of Customer Promotional Frequency 70 Percent frequency 70 Regular 30 30 Grand Total 100 100 Table 1.1 Percent frequency distribution for Type of Customers From the table above, we can see that in the sample of 100 customers, there are 70 promotional customers and 30 regular customers 1.1.2 Percent frequency distribution for Items Items Frequency 29 Percent Frequency 29 27 27 10 10 10 10 9 7 7+ 8 Grand Total 100 100 Table 1.2 Percent frequency distribution for Items 1.1.3 Percent frequency distribution for Net Sales Net Sales 0.00-24.99 Frequency Percent Frequency 25.00-49.99 30 30 50.00-74.99 25 25 75.00-99.99 10 10 100.00-124,99 12 12 Group Applied Statistics Assignment Page of 27 125.00-149.99 4 150.00-174.99 3 175.00-199.99 3 200+ 4 Grand Total 100 100 Table 1.3 Percent frequency distribution for Net Sales 1.1.4 Percent frequency distribution for Gender Gender Female Frequency 93 Percent frequency 93 Male 7 Grand Total 100 100 Table 1.4 Percent frequency distribution for Gender 1.1.5 Percent frequency distribution for Marital Status Marital Status Married Frequency 84 Percent frequency 84 Single 16 16 Grand Total 100 100 Table 1.5 Percent frequency distribution for Marital Status 1.1.6 Percent frequency distribution for Age Group Age Group 20-29 Frequency 10 Percent Frequency 10 30-39 30 30 40-49 33 33 50-59 16 16 60-69 7 70-79 4 Grand Total 100 100 Table 1.6 Percent frequency distribution for Age Group The sum of the frequencies in frequency distribution is 100, which equals the number of observations In addition, the sum of the percentage in a percent fequency distribution always equals 100 These percent frequency distributions provide a profile of Pelican’s customers We can conclude that: Over half of the customers purchase or items, but a few make numerous purchases Group Applied Statistics Assignment Page of 27 The percent frequency distribution of net sales shows that 61% of the customers spent $50 or more Customers are distributed across all adults age groups The overwhelming majority of customers are female Most of the customers are married 1.2 Question To contruct a bar chart showing the number of customer purchases attributed to the method of payment, we statistic the number of customer according to the method of payment by using PivotTable in Excel Excel’s PivotTalbe Report is an interactive tool that allows us to quickly summarize data in a variety of ways, including developing a frequency distribution for quantitative data 1.2.1 PivotTable showing the number of customer purchases attributable to the method of payment Method of Payment American Express Count of Customer Discover MasterCard 14 Proprietary Card 70 Visa 10 Grand Total 100 Table 1.7 The number of customer purchases attributable to the method of payment 1.2.2 Bar chart showing the number of customer purchases attributable to the method of payment 80 Number Of Customer 70 60 50 40 30 20 10 American Express Discover MasterCard Proprietary Card Visa Method Of Payment Figure 1.1 The number of customer purchases attributable to the method of payment Group Applied Statistics Assignment Page of 27 From the bar chart above, we conclude that a large majority of the customers use propretary credit card 1.3 Question 1.3.1 A crosstabulation of type of customer (regular or promotional) versus net sales Crosstabulation is a basic technique for examining the relationship between two categorical variables In this case, using Net sales category as a row variable and Customer as a column variable, we create a two-dimensional crosstabulation that shows the number of customers in each Net sales category Customer Promotional Regular Total 0.0024.99 25.0049.99 17 50.0074.99 17 75.0099.99 13 30 25 10 Net Sales 100.00124,99 125.00149.99 150.00174.99 1 12 175.00199.99 200+ Total 70 30 100 Table 1.7 A crosstabulation of types of customer versus net sales 1.3.2 Comment on similarities or differences present In terms of similarities figure of promotional and regular customers, we have some conclusions: - Both types of customers have highest total amount charged to the credit card in range of 25.00-49.99 and 50.00-74.99 - There are a few customers charged above $125.00 In terms of differences, we can conclude that: - Customers who use promotional coupons have net sales above 175.00, but regular does not In conclusion, from the crosstabulation above, it appears that net sales are larger for promotional customers 1.4 Question A scatter diagram is a graphical display of the relationship between two quantitative variables, and a trendline is a line that provides an approximate of the relationship In this case, we want to determine whether age and net sales have a relationship So we contruct a scatter diagram between net values and customer’s age Group Applied Statistics Assignment Page 10 of 27 Customers taking advantage of the promotional coupons spent more money on average The mean amount spent by all customers is $77.60; the average amount spent by promotional customers was $84.29 The standard deviation of sales is $55.66 This indicates a fairly wide variability in purchase amounts across customers This variability is quite a bit smaller for the regular customers The distribution of the sales data is skewed to the right The mean ($77.60) is larger than the median ($59.71) and the skewness measure (1.71) is positive Positive skewness is typical for this kind of data There are no negative sales amounts and there are a few large purchases 2.2 Question To determine the relationship between Age and Net sales, we calculate the correlation coefficient Let be the age variable, be the net sales variable We applied the formula of the correlation coefficient for a sample, denoted by where = sample correlation coefficient = sample covariance = sample standard deviation of = sample standard deviation of According to table 2.1, we had sample standrad deviation of Net sale We need calculate the sample correlation coefficient, sample covariance, and sample standard deviation of Age variable We use MegaStat to determine descriptive statistics on Age Count Mean Age 100 43.08 Sample standard deviation 12.39 Sample variance 153.49 Minimum 20 Maximum 78 Range 58 Sum 4,308.00 Skewness 0.52 Kurtosis 0.07 Mode 46.00 Table 2.3 Descriptive statistics on age It indicates that the sample standard deviation of Age Sample covariance will be calculated by using the formula To get the result, we use Excel as an assistant system Thus Group Applied Statistics Assignment Page 13 of 27 Since the value of near zero, it indicates a weak linear relationship between Net sales and Age variable In other words, age is not a factor in determining Net sales 2.3 Conclusion By using the methods of descriptive statistics, we can conclude that promotional coupons and proprietary card might affect store’s net sales, they increase net sales in detail, while age is not a factor in determining net sales Group Applied Statistics Assignment Page 14 of 27 SECTION HYPOTHESIS TEST 3.1 Question After conducting a hypothesis test for samples at the 0.01 level of significance, we have the hypothesis testing results as follow: Sample 12 Sample 12 Sample 12 Sample 12 Mean 11.96 12.03 11.89 12.08 Standard deviation Standard error 0.22 0.04 0.22 0.04 0.20 0.04 0.21 0.04 Hypothesized value Sample size 30 30 30 30 Test statistic -1.03 0.71 -2.94 2.16 p-value 0.30 0.48 0.003 0.03 Table 3.1 Hypothesis testing results Only sample leads to the rejection of the hypothesis Thus, corrective action is warranted for sample The other samples indicate cannot be rejected; thus, the process is operating satisfactorily Sample with shows the process is operating below the desired mean Sample with is on the high side, but the p-value of 0.03 is not sufficient to reject 3.2 Question Standard deviation Sample 0.220 Sample 0.220 Sample 0.207 Sample 0.206 The sample standard deviations for all four samples are in the 0.20 to 0.22 range It indicates that the assumption of 0.21 for the population standard deviation is reasonable 3.3 Question With and Using the standard error of the mean , the upper and lower control limits are computed as follows: Upper control limit Lower control limit As long as a sample mean is between these two limits, the process is in control and no corrective action is required 3.4 Question Increasing the level of significance will cause the null hypothesis to be rejected more often Although this may mean quicker corrective action when the process is out of control, it also means there will be higher error probability of stopping the process and attempting corrective action when the process is operating satisfactorily Group Applied Statistics Assignment Page 15 of 27 This would be an increase in the probability of a making a type I error 3.5 Conclusion Group Applied Statistics Assignment Page 16 of 27 SECTION EXPERIMENTAL DESIGN AND ANOVA 4.1 Question Anova: Single Factor Data from Medical SUMMARY Groups Florida Count 20 Sum 111 Average 5.55 Variance 4.58 Sample STD 2.14 Minimum Maximum Range New York North Carolina 20 20 160 141 7.05 4.84 8.05 2.20 2.84 13 12 9 Anova: Single factor Data from Medical SUMMARY Groups Count Sum Average Variance Sample STD Minimum Maximum Range Florida 20 New York North Carolina 20 20 290 14.5 10.05 3.17 21 12 305 279 15.25 13.95 17.04 8.68 4.13 2.95 24 19 15 11 Conclusion: Obviously, people 65 years of age or older who had chronic health condition have a higher score of depression than individuals in reasonable good health Both in reasonable good healthy and chronic health condition (and ≥ 65 years old), people live in New York have a highest level of depression 4.2 Question Medical Hypothesis tested : There is no significant difference in the mean depression score of healthy people in the three location : There is significant difference in the mean depression score of healthy people in the three location where: = the mean depression score of healthy people in Florida = the mean depression score of healthy people in New York = the mean depression score of healthy people in North Carolina Rejection Rule: Reject the null hypothesis, if the calculated value of F statistic is greater than the F SUMMARY Groups Florida Count 20 Sum 111 Average 5.55 Variance 4.58 New York North Carolina 20 20 160 141 7.05 4.84 8.05 Group Applied Statistics Assignment Page 17 of 27 ANOVA Source of Variation Between Groups SS 61.03 df MS 30.52 Within Groups 331.90 57 5.82 Total 392.93 59 F 5.24 P-value 0.01 F crit 3.16 Conclusion: The null hypothesis is rejected, because the sample provides enough evidence to support score of healthy people in the three locations (F≥) so all geographical means are not equal The factor that makes this difference is the mean between New York and Florida Medical Hypothesis tested : There is no significant difference in the mean depression score of healthy people in the three location : There is significant difference in the mean depression score of healthy people in the three location where: = the mean depression score of healthy people in Florida = the mean depression score of healthy people in New York = the mean depression score of healthy people in North Carolina Rejection Rule: Reject the null hypothesis, if the calculated value of F statistic is greater than the F SUMMARY Groups Count Sum Florida 20 New York North Carolina 20 20 Group Applied Statistics Assignment Average Variance 290 14.5 10.05 305 279 15.25 13.95 17.04 8.68 Page 18 of 27 ANOVA Source of Variation Between Groups SS 17.03 df MS 8.52 Within Groups 679.70 57 11.92 Total 696.73 59 F 0.71 P-value 0.49 F crit 3.16 Conclusion: The null hypothesis cannot be rejected, because the sample does not provide enough mean depression score of healthy people in the three locations (F≤) so all geographical mean are equal There is no relation between location and depression score 4.3 Question From the above two output results we observe that: - There is no interaction between health and locations - There is a big difference of depression scores between good health and chronic health - With people in reasonable good heath, geographical locations affect the levels of depression However, if they have some kind of chronic health problem, there will not be a depression variation in States Group Applied Statistics Assignment Page 19 of 27 SECTION STATISTICAL ANALYSIS WITH REAL DATA Population and samples 5.1 Introduction 5.1.1 Population and Sample Our teams found the datasets on a website named “Kaggle” The survey gathered basic information such as height and weight from 500 respondents Our major goal is to see if there is a difference in average height between boys and girls, as well as the relationship between the respondents' height and weight 5.1.2 Sample size Following a debate among team members, everyone agreed to select a sample of 500 students As this number is sufficiently large for us to obtain a definite and appropriate proportion for our testing and would result in a more accurate result Furthermore, when the sample size is too narrow, the overview of that sample size on how height and weight they are may not reflect the actual condition of the total number of respondents 5.1.3 Sampling method There are several sampling methods available to test person height and weight However, because such a large total population as the total number of obese people, with approximately 650 million people, along with a sample size of 500 was chosen led us to the decision of using a simple random sampling method to collect the responses 5.1.4 Data Collection After selecting the main objective and content, our group seeked datasets on the Internet Thanks to a recommendation by Mr.Nguyen Tien Dung, we came across a survey of height and weight on a website named “Kaggle” The data is viewed by nearly under 100 thousand users; therefore, we are confident that the collected data is highly authentic Finally, we documented (gender, height, weight, index) and processed the data with the help of Google Excel The tables of the data are presented in Appendix C of this report 5.1.5 Data Processing Acknowledging that the information obtained is rather large to compute manually, we collect theinformation and analyse the figure with the assistance of Microsoft Excel applications We inputted the data into Excel and did some statistics by using “MegaStat '' and “Pivot Table” Furthermore, our team also used the graph tools of Excel to visualize data in charts namely pie charts, histogram and regression line, which make readers much easier to understand 5.1.6 Significance level of sample test Group Applied Statistics Assignment Page 20 of 27 According to our research, the average height of men and women is roughly 170cm Therefore, according to what we have learnt, we form a hypothesis that there is no difference in the average height between boy and girl with the level of significance of 5% 5.2 Descriptive Statistics After having raw data materials, we decided to divide data into groups based on gender in order to conduct further statistics Gend er proportion Male Female 51.00% 49.00% Figure 5.1 The gender proportion The percentages of boys and girls among 500 interviewees are roughly equal, as seen in the pie chart, with 51 percent and 49 percent respectively It can ensure that the sample size is sufficient and that the outcome is not biased Firstly, we did some descriptive statistics for male The table is shown below: Sample size Sample mean Sample standard deviation Height 245.00 169.65 17.07 Mode Median 179.00 171.00 Minimum Maximum Range 140.00 199.00 59.00 Confidence interval 95.% lower Confidence interval 95.% upper 167.50 171.80 1st quartile 3rd quartile Interquartile range 154.00 183.00 29.00 Table 5.1 The male’s height descriptive statistic Then we did the frequency distribution table and histogram Group Applied Statistics Assignment Page 21 of 27 lower 140 < upper 145 midpoint 143 width frequency 22 percent 9.0 145 150 155 160 165 < < < < < 150 155 160 165 170 148 153 158 163 168 5 5 19 21 17 17 21 7.8 8.6 6.9 6.9 8.6 170 175 < < 175 180 173 178 5 20 25 8.2 10.2 180 185 190 < < < 185 190 195 183 188 193 5 26 23 18 10.6 9.4 7.3 195 < 200 197 16 6.5 Table 5.2 The male’s height frequency distribution table Male's height histogram 30 25 25 Frequency 21 21 19 20 26 23 22 17 20 18 17 16 15 10 14 14 15 15 16 16 17 17 18 18 19 19 20 Height Figure 5.2 The male’s height histogram We can observe some basic information from the graphs The poll was conducted by 245 boys with an average height of 169cm Their heights range from 140 to 199 centimeters, with a sample standard deviation of 17.07 centimeters Furthermore, among boys, the most common height (Mode) is 179cm, which is higher than the Median (171cm) and Mean (169cm), indicating a left-skewed distribution Sample size Sample mean Weight 245 106.31 Sample standard deviation Mode Median Minimum Maximum 31.83 80.00 105.00 50 160 Group Applied Statistics Assignment Page 22 of 27 Range Confidence interval 95.% lower 110 102.31 Confidence interval 95.% upper 1st quartile 3rd quartile Interquartile range 110.32 80.00 137.00 57.00 Table 5.3 The male’s weight descriptive statistic lower 50 < upper 60 midpoint 55 width 10 frequency 19 percent 7.8 60 70 < < 70 80 65 75 10 10 19 20 7.8 8.2 80 90 100 < < < 90 100 110 85 95 105 10 10 10 26 23 26 10.6 9.4 10.6 110 120 < < 120 130 115 125 10 10 21 14 8.6 5.7 130 140 150 < < < 140 150 160 135 145 155 10 10 10 27 27 19 11.0 11.0 7.8 160 < 170 165 10 1.6 Table 5.4 The female’s weight frequency distribution table Male's weight histogram 30 26 25 27 23 20 19 Frequency 27 26 19 21 20 19 14 15 10 50 60 70 80 90 10 0 11 12 13 14 15 16 17 Weight Figure 5.3 The male’s weight histogram The sample mean weight is 106.31kg, according to the data The range is 110kg, with the lowest point being 50kg Furthermore, the most common weight (Mode) is 80kg, which is Group Applied Statistics Assignment Page 23 of 27 lower than the Median (105kg) and Mean (106.31kg), indicating that the frequency distribution is right skewed with a lengthy right tail We did the same step for females: Sample size Height 255.00 Sample mean Sample standard deviation Mode Median Minimum 170.23 15.71 150.00 170.00 140.00 Maximum Range 199.00 59.00 Confidence interval 95.% lower Confidence interval 95.% upper 1st quartile 168.29 172.17 157.00 3rd quartile Interquartile range 184.00 27.00 Table 5.5 The female’s height descriptive statistic lower 140 145 < < upper 145 150 midpoint 143 148 width 5 frequency 11 16 percent 4.3 6.3 150 155 < < 155 160 153 158 5 26 19 10.2 7.5 160 165 170 < < < 165 170 175 163 168 173 5 25 30 18 9.8 11.8 7.1 175 180 < < 180 185 178 183 5 22 28 8.6 11.0 185 190 195 < < < 190 195 200 188 193 197 5 31 14 15 12.2 5.5 5.9 Table 5.6 The female’s height frequency distribution table Group Applied Statistics Assignment Page 24 of 27 Female's height histogram 35 26 28 25 25 Frequency 31 30 30 22 19 20 18 16 14 15 15 11 10 14 14 15 15 16 16 17 17 18 18 19 19 20 Height Figure 5.5 The female’s weight histogram According to the three graphs above, the survey was conducted by 255 girls with an average height of 170.23cm Despite the fact that their height range is 10cm higher than their male counterparts', the female sample standard deviation is lower than the male sample standard deviation, at 15.71cm and 17.07cm, respectively Another interesting feature of the female height dataset is that the distribution is right skewed, with the Mode (150cm) being smaller than the Median (170cm) and Mean (180cm) (170.23cm) Sample size Sample mean Weight 255 105.70 Sample standard deviation Mode 32.96 126.00 Median Minimum Maximum 106.00 50.00 160.00 Range Confidence interval 95.% lower 110.00 101.63 Confidence interval 95.% upper 1st quartile 3rd quartile Interquartile range 109.76 79.00 135.00 56.00 Table 5.7 The female’s weight descriptive statistic lower 50 60 70 < < < upper 60 70 80 Group Applied Statistics Assignment midpoint 55 65 75 width 10 10 10 frequency 25 23 18 percent 9.8 9.0 7.1 Page 25 of 27 ... 9.0 14 5 15 0 15 5 16 0 16 5 < < < < < 15 0 15 5 16 0 16 5 17 0 14 8 15 3 15 8 16 3 16 8 5 5 19 21 17 17 21 7.8 8.6 6.9 6.9 8.6 17 0 17 5 < < 17 5 18 0 17 3 17 8 5 20 25 8.2 10 .2 18 0 18 5 19 0 < < < 18 5 19 0 19 5 18 3 18 8... 10 10 26 23 26 10 .6 9.4 10 .6 11 0 12 0 < < 12 0 13 0 11 5 12 5 10 10 21 14 8.6 5.7 13 0 14 0 15 0 < < < 14 0 15 0 16 0 13 5 14 5 15 5 10 10 10 27 27 19 11 .0 11 .0 7.8 16 0 < 17 0 16 5 10 1. 6 Table 5.4 The female’s... 26 19 10 .2 7.5 16 0 16 5 17 0 < < < 16 5 17 0 17 5 16 3 16 8 17 3 5 25 30 18 9.8 11 .8 7 .1 175 18 0 < < 18 0 18 5 17 8 18 3 5 22 28 8.6 11 .0 18 5 19 0 19 5 < < < 19 0 19 5 200 18 8 19 3 19 7 5 31 14 15 12 .2 5.5 5.9 Table