STATISTICS ANOVA AND MULTIPLE REGRESSION REPORT

VIETNAM NATIONAL UNIVERSITY OF HCMC INTERNATIONAL UNIVERSITY Statistic of Business Instructor: Hồ Thanh Vũ PROJECT REPORT Group: Student name • • • • • NGUYỄN THỊ ÁNH TUYẾT HUỲNH THANH TONG TRẦN HOÀNG ANH TRẦN LÊ NGỌC THỊNH TRẦN VŨ KHA Student ID BABAWE12088 BABAWE14143 BABAWE14100 BABAWE14230 BABAWE14113 CONTENT I I Descriptive statistics about the data II Method presentation and result interpretation for the statements Women are likely to shop online rather than men There is some difference in the amount of money spent among different ages in groups of women There is some difference in the amount of money spent among different ages in groups of men III Multiple linear regression between revenue/month to the five factors IV Multiple regression with factors for the following group: Male under 18 Male aged 18 – 27 Male over 27 Female under 18 Female aged 18 – 27 Female over 27 Descriptive statistics about the data 2015, Lazada hired Neslien CA, a company in market research, to study about factors affect to customer decision when shopping at Lazada There are 1,500 surveys were issued with 1,241 responses (the response rate is 82.73%) The table and graph below shows how 1,241 surveys are distributed by male and female in different age groups 27 total GENDER MALE FEMALE 114 215 125 308 130 349 369 872 1241 In the survey there are five factors that is believed to have impact to customer decision in online shopping they are: 1) Price, 2) Brand awareness, 3) Security, 4) Easy of payment, and 5) Promotion and Marketing The factors are scored on a score range of [ -3 ; ], indicating the lowest to highest evaluation from consumers to Lazada online service The responses based factors are then analyzed into the following descriptive statistics The Mean or average is probably the most commonly used method of describing central tendency In this case, the mean of Price is -0.0757, which means, on average the price of Lazada is still not expected to outperform other brands by consumers There are a bit higher level of expectation for the rest: mean of Brand (0.359), Security(0.275), Payments(0.3158), Promotions and Marketing(0.302) but still, they are not expected to outperform other brands on average The standard error is the standard deviation of the sampling distribution of a statistic, most commonly of the mean In this case The standard errors of factors are close to each other on such scale: Price is 0.057, a little further than that of Brand (0.055),Security(0.056)Payments(0.056),Promotions and Marketing(0.056) The Median is the score found at the exact middle of the set of values One way to compute the median is to list all scores in numerical order, and then locate the score in the center of the sample In this case the Median of Price is 0, equal to the mean of Brand (0),Security(0),Payments(0),Promotions and Marketing(0) because they all have the same range and level of evaluating The mode is the most frequently occurring value in the set of scores To determine the mode, you might again order the scores as shown above, and then count each one The most frequently occurring value is the mode In this case, the mode of Price is -3, which means Price is expected by most of consumers joining the survey to be totally outperformed by other brands Meanwhile the other factors mostly get highest performing score: Brand (3),Security(3),Payments(3),Promotions and Marketing(3) The standard deviation (µ) is a measure that is used to quantify the amount of variation or dispersion of a set of data values A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values In this case, the SD of price is the highest (2.022), followed by security(1.991), promotion and marketing(1.972), payment(1.967) and brand(1.955) The variance is the expectation of the squared deviation of a random variable from its mean, and it informally measures how far a set of (random) numbers are spread out from their mean In this case, the Variance of price is the highest too (4.091), after that is security(3.966), promotion and marketing(3.891), payment(3.869) and brand(3.823) The skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean The skewness value can be positive or negative, or even undefined In this case Price get the top(0.053), after that is payment(-0.08), security(-0.112), promotion and marketing(-0.113) and brand(-0.114) There is a measure of the "tailedness" of the probability distribution of areal-valued random variable In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution and, just as for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population Depending on the particular measure of kurtosis that is used, there are various interpretations of kurtosis, and of how particular measures should be interpreted In this case, Brand get the top(-1.216) after that is security(-1.230), promotion and marketing(-1.249), payment (-1.253) and price(-1.259) A Range is simple the difference between the highest and the lowest score and is determined by subtraction If the range is small, the scores are close together; if it is large, the scores are more spread out In this case, five factors have the same range of levelling scores The maximum and minimum show up in the calculations for other summary statistics Both of these two numbers are used to calculate the range which is simply the difference of the maximum and minimum In this case, maximum and minimum have the same value that in turn is -3, The count is number of surveys that we have summited from 1241 consumers of LAZADA Same indicators are used to present Lazada revenue/month from the chosen consumers, also their monthly money spent for shopping online on Lazada REVENUE/ MONTH Mean 167.3296132 Standard Error 11.20328795 Median 128.01 Mode 139.95 Standard Deviation 394.6675223 Sample Variance 155762.4532 Kurtosis 96.70825409 Skewness 9.86533105 Range 4344.3 Minimum 7.19 Maximum 4351.49 Sum 207656.05 Count 1241 The average in money spent from each consumer is 167.3296132 The money spent from most consumers of Lazada is 139.95 The standard deviation is 394.6675223 The sample variance is 155762.4532 II Illustrate the following issues whether women are likely to shop online rather than men? We can accept this statement if the average amount of money a woman spent for shopping online is larger than a man In this • First, conduct a Hypothesis test with means of female and male are µ1 and µ2 respectively H0 : µ1 - µ2 ≤ , indicating that women doesn’t spend more money than men H1 : µ1 - µ2 > 0, indicating that women spends more money than men • Next, using excel tools to conduct z- test table Assume normally distributed populations, independent random samples, population variance is not given z-Test: Two Sample for Means Mean Known Variance Observations Hypothesized Mean Difference z P(Z T critical H1: β1 ≠  Reject BRAND PAYMENTS H0: β2 = T test < T critical H1: β2 ≠  Non reject P &SECURITY M H0: β3 = T test > T critical H1: β3 ≠  Reject PAYM H0: β4 = H1: β4 ≠ The critical value at 0.05 level of significance, df = 1235: T critical = 1.960 • Now, we can construct a regression model by taking information in regression analysis table below, with inputs from the data given, using data analysis in excel Coefficient s 150.348 37.530 7.383 11.926 From the table coefficient, we can set up the regression equation as followings: To test whether the variables of the regression modal are significant, we have to conduct the Z-test of individual regression parameters Our null and alternative hypothesizes of each variable: PRICE SECURITY BRAND Since F ratio > Fcrit , there is enough evidence to reject H0 Thus, revenue/month is dependent to at least of the factors • 1.7193E-16 • Y=150.35 + 37.53 X1 + 7.38 X2 + 11.92 X3 + 8.98 X4 + 36.55 X5 + ε Fcrit = F(0.05, 5, 1235) = 2.22132601 PRICE MS F 7553.15872 24.64870935 306.432219 Significance F The conclusion can be briefly describe through the table below Thus , among factors, price, security, promotion and marketing are expected to have significant affect on revenue/month IV Regression models for specific groups MaleF that there is β1 = C indicates enough evidence to reject the null hypothesis, which mean there is at least factor affects the money Removable Removable spent from male under 18 T testcritical > T critical T108,0.025 test >T=1.9822 critical The value ±tC=±t Intercept Standard Error 124.4638167 1.848918513 PRICE 0.689368267 0.816347265 BRAND -0.993916385 0.813290658 SECURITY 6.233068927 0.949998207 PAYMENTS 4.627027576 0.947222083 P&M ANOVA 0.833018299 6.866677223 T test > T critical  Reject  reject  Reject Based on the table coefficients, we can conclude Case2: Male 18-27 Significant HSignificant 1: not all(µi=1,2,3,4,5) are equal to zero t Stat Significant Standard Coefficients Error t Stat 122.3481617 1.272018764 96.18424284 0.508473755 0.561728716 0.905194519 Intercept The regression table for the group of male under 18 Coefficients P&M PRICE 67.31709153 BRAND 0.844454678 SECURITY PAYMENTS -1.222092466 6.561137566 P&M 4.884839213 0.611516825 5.899148815 5.983963271 1.027075128 0.578680059 0.624324642 0.591661516 1.05674425 9.448848274 10.11382879 0.572507111 1.793995407 P-value 1.1E-114 0.36719137 0.29276888 3.78409E-16 9.92225E-18 0.07535318 8.243128913 df H0:µ1=µ2=µ3=µ4=µ5=0 From the table, we can set up the Regression regression equation as following: Y= 124,464Residual Total 2+6.233X3+4.62 +0.689X1+0.994X 7X4+6.866X5 + ε SS MS 37398.7613 7479.752261 119 124 19196.77275 161.3174181 56595.53405 F 46.3666747 Significance F 2.23276E-26 At the significant level of 0.05, the critical value is: Our null and alternative hypothesizes of each variale: F(0.05,5, 119)= 2.2899 H0: β1 = BRANDFT>FC indicates that there is enough evidence to reject the null hypothesis, which mean there is at least factor affects the money spent from male aged 18-27 H0: β2 = H1: β1 ≠ H1: β2 ≠ PRICE From the table, we can set up the regression equation as following: Y= 122.348 -0.5085X1+0.6115X2+5.899X3+5.984X4-1.0271X5 + ε • Df=5 Our null and alternative hypothesizes of each variable is the same as the one previously used Lowe 119.8 1.620 0.534 4.662 4.812 2.160 F(0.05,5, 124)= 2.2899 Alpha=0.05, alpha/2=0.025 The critical value ±tC=±t119,0.025=1.9801 PRICE T test < T critical BRAND T test < T critical  Non reject H0: β1 =0 Removed FT>FC indicates that there is enough evidence to reject the null hypothesis, which mean there is at least factor affects the money spent from male aged more than SECURITY PAYMENTS P&M 27 T test > T critical  Non reject T test > T critical  Reject Removed Significant  Reject Significant T test < T critical  Non reject Removed Based on the table coefficients, we can conclude CASE3: MALE>27 H0:µ1=µ2=µ3=µ4=µ5=0 H1: not all(µi=1,2,3,4,5) are equal to zero ANOVA df Regression Residual Total SS 124 129 At the significant level of 0.05, the critical value is: 64974.6426 26581.17312 91555.81572 MS 12994.92852 214.3642994 F 60.6207682 Significance F 1.14242E-31 The regression table for the group of male aged more than 27 Intercept PRICE BRAND SECURITY PAYMENTS P&M PRICE Coefficient s 121.761 0.261 1.146 8.451 6.822 -1.690 T test < T critical T test < T critical  Non reject H0: β1 =0 Removed Standard Error 1.308 0.604 0.651 0.713 0.673 0.676 BRAND t Stat P-value 93.075 0.000 0.432 0.666 1.759 0.081 11.851 0.000 10.139 0.000 -2.501 0.014 SECURITY T test > T critical  Non reject Removed Lower Upper 95% 95% 119.172 124.350 -0.935 1.457 -0.143 2.436 7.040 9.863 5.490 8.154 -3.027 -0.353 PAYMENTS T test > T critical  Reject Significant  Reject Significant Lower 95.0% 119.172 -0.935 -0.143 7.040 5.490 -3.027 Upper 95.0% 124.350 1.457 2.436 9.863 8.154 -0.353 P&M T test < -T critical  Reject Significant From the table, we can set up the regression equation as following: Y= 121,761+0.2613X1+1.1471X2+8.4514X3+6.822X4-1.06898X5 + ε • Our null and alternative hypothesizes of each variable is the same as previous cases Df=5 Alpha=0.05, alpha/2=0.025 The critical value ±tC=±t124,0.025=1.9793 Based on the table coefficients, we can conclude: CASE4: FEMALEFC indicates that there is enough evidence to reject the null hypothesis, which mean there is at least factor Coefficient Standard s Error t Stat Intercept 139.993 19.703 7.105 PRICE 28.383 9.375 3.028 BRAND 4.055 9.576 0.423 SECURITY -4.630 9.065 -0.511 PAYMENTS 13.334 9.091 1.467 P&M 28.604 9.344 3.061 affects the money spent from female under 18 PRICE BRAND T test < T critical Significant SECURITY T test < T critical  Reject H0: β1 = Lower 95% 101.151 9.901 -14.823 -22.501 -4.588 10.184 P-value 0.000 0.003 0.672 0.610 0.144 0.002 T test > T critical  Non reject PAYMENTS Removed  Reject Significant regression table for the group of female under 18 From the table, we can set up the regression equation as following: Y= 139,993+28.3831X1+4.055X2-4.6304X3+13.334X4+28.6036X5 • Our null and alternative hypothesizes of each variable is the same as previous cases Df=5 Alpha=0.05, alpha/2=0.025 The critical value ±tC=±t209,0.025=1.9719 Based on the table coefficients, we can conclude: CASE5: FEMALE aged 18-27 H0:µ1=µ2=µ3=µ4=µ5=0 H1: not all(µi=1,2,3,4,5) are equal to zero ANOVA df SS MS F Upper 95.0% 178.835 46.865 22.933 13.240 31.255 47.024 T test < -T critical  Non reject Removed Lower 95.0% 101.151 9.901 -14.823 -22.501 -4.588 10.184 P&M T test > T critical  Non reject Removed Upper 95% 178.835 46.865 22.933 13.240 31.255 47.024 Significance The F Regression Residual Total 15089637.21 302 307 94826171.57 109915808.8 3017927.44 313993.945 9.61141921 1.5825E-08 At the significant level of 0.05, the critical value is: F(0.05,5,302)= 2.245 FT>FC indicates that there is enough evidence to reject the null hypothesis, which mean there is at least factor affects the money spent from female aged 18-27 The regression table for the group of female aged 18-27 Coefficient Standard s Error Intercept 163.256 35.066 PRICE PRICE 78.896 BRAND 15.456 BRAND 17.806 16.304 SECURITY 25.835 16.045 T < T T < T test critical test critical PAYMENTS 9.677 16.388 P&M 74.382 16.740  Reject H0: β1 =  Non reject Significant Removable t Stat P-value 4.656 0.000 5.104 0.000 SECURITY 1.092 0.276 1.610 0.108 T0.591 test > T critical 0.555 4.443 0.000  Non reject Removable Lower Upper 95% 95% 94.252 232.260 48.480 109.311 PAYMENTS -14.278 49.889 -5.739 57.408 T >T test critical -22.572 41.926 41.439 107.324  Non reject Removable Lower Upper 95.0% 95.0% 94.252 232.260 48.480P &109.311 M -14.278 49.889 -5.739 57.408 T > T test critical -22.572 41.926 41.439 107.324  Reject Significant From the table, we can set up the regression equation as following: Y= 163.256+78.895X1+17.805 X2+25.834X3+9.677X4+74.382 X5 • Our null and alternative hypothesizes of each variable is the same with previous cases Df=5 Alpha=0.05, alpha/2=0.025 The critical value ±tC=±t302,0.025=1.96 Based on the table coefficients, we can conclude: CASE6: FEMALE>27 H0:µ1=µ2=µ3=µ4=µ5=0 H1: not all(µi=1,2,3,4,5) are equal to zero ANOVA PRICE BRAND T test < T critical T test T critical MS PAYMENTS P&M T test >TFcritical T test > T critical Significance  Non reject F  Reject 8.1395103 2.82554E-07  Non reject  Non reject 6840409.475 1368081.89 343 57651144.54 168079.138 Removable Removable Removable 348 64491554.02 Significant At the significant level of 0.05, the critical value is: F(0.05,5,343)= 2.245 FT>FC indicates that there is enough evidence to reject the null hypothesis, which mean there is at least factor affects the money spent from female aged more than 27 The regression table for the group of female aged more than 27 Intercept PRICE BRAND SECURITY PAYMENTS P&M Coefficient s 147.512 49.228 9.145 15.865 8.630 44.272 Standard Error 24.243 11.654 11.973 10.775 11.572 11.422 t Stat 6.085 4.224 0.764 1.472 0.746 3.876 P-value 0.000 0.000 0.445 0.142 0.456 0.000 Lower 95% 99.828 26.307 -14.404 -5.328 -14.132 21.807 Upper 95% 195.196 72.150 32.695 37.058 31.392 66.737 Lower 95.0% 99.828 26.307 -14.404 -5.328 -14.132 21.807 Upper 95.0% 195.196 72.150 32.695 37.058 31.392 66.737 From the table, we can set up the regression equation as following: Y= 147.512+49.228X1+9.1455X2+15.8646X3+8.6299X4+44.272X5 • Our null and alternative hypothesizes of each variable is the same as previous cases Df=5 Alpha=0.05, alpha/2=0.025 The critical value ±tC=±t343,0.025=1.96 Based on the table coefficients, we can conclude:  Conclusion from regression models The overall trend from the whole surveys show the dependence of revenue/month significantly on price, security, promotion and marketing Meanwhile, each specific group with different characteristics spends their money depending on different factors Through this analysis, we can observe that: • male under 18 tend to spend their money responding to security, payments, P&M • • • • • male aged 18-27 responds to security and payments Male over 27 responds to security, payments, P&M Female under 18 responds to security, payments, P&M Female aged 18-27 responds to price, P&M Female over 27 responds to price, P&M [...]... set up the Regression regression equation as following: Y= 124,464Residual Total 2+6.233X3+4.62 +0.689X1+0.994X 7X4+6.866X5 + ε SS MS 5 37398.7613 7479.752261 119 124 19196.77275 161.3174181 56595.53405 F 46.3666747 8 Significance F 2.23276E-26 At the significant level of 0.05, the critical value is: Our null and alternative hypothesizes of each variale: F(0.05,5, 119)= 2.2899 H0: β1 = 0 BRANDFT>FC indicates... H0:µ1=µ2=µ3=µ4=µ5=0 H1: not all(µi=1,2,3,4,5) are equal to zero ANOVA df Regression Residual Total SS 5 124 129 At the significant level of 0.05, the critical value is: 64974.6426 26581.17312 91555.81572 MS 12994.92852 214.3642994 F 60.6207682 8 Significance F 1.14242E-31 The regression table for the group of male aged more than 27 Intercept PRICE BRAND SECURITY PAYMENTS P&M PRICE Coefficient s 121.761 0.261... the regression equation as following: Y= 121,761+0.2613X1+1.1471X2+8.4514X3+6.822X4-1.06898X5 + ε • Our null and alternative hypothesizes of each variable is the same as previous cases Df=5 Alpha=0.05, alpha/2=0.025 The critical value ±tC=±t124,0.025=1.9793 Based on the table coefficients, we can conclude: 4 CASE4: FEMALE27 H0:µ1=µ2=µ3=µ4=µ5=0 H1: not all(µi=1,2,3,4,5) are equal to zero ANOVA PRICE BRAND... spent from male under 18 T testcritical > T critical T108,0.025 test >T=1.9822 critical The value ±tC=±t Intercept Standard Error 124.4638167 1.848918513 PRICE 0.689368267 0.816347265 BRAND -0.993916385 0.813290658 SECURITY 6.233068927 0.949998207 PAYMENTS 4.627027576 0.947222083 P&M ANOVA 0.833018299 6.866677223 T test > T critical  Reject  reject  Reject Based on the table coefficients, we can... to reject the null hypothesis, which mean there is at least 1 factor Coefficient Standard s Error t Stat Intercept 139.993 19.703 7.105 PRICE 28.383 9.375 3.028 BRAND 4.055 9.576 0.423 SECURITY -4.630 9.065 -0.511 PAYMENTS 13.334 9.091 1.467 P&M 28.604 9.344 3.061 affects the money spent from female under 18 PRICE BRAND T test < T critical Significant SECURITY T test < T critical  Reject H0: β1 =... 0.000 0.003 0.672 0.610 0.144 0.002 T test > T critical  Non reject PAYMENTS Removed  Reject Significant regression table for the group of female under 18 From the table, we can set up the regression equation as following: Y= 139,993+28.3831X1+4.055X2-4.6304X3+13.334X4+28.6036X5 • Our null and alternative hypothesizes of each variable is the same as previous cases Df=5 Alpha=0.05, alpha/2=0.025 The... table, we can set up the regression equation as following: Y= 122.348 -0.5085X1+0.6115X2+5.899X3+5.984X4-1.0271X5 + ε • Df=5 Our null and alternative hypothesizes of each variable is the same as the one previously used Lowe 119.8 1.620 0.534 4.662 4.812 2.160 F(0.05,5, 124)= 2.2899 Alpha=0.05, alpha/2=0.025 The critical value ±tC=±t119,0.025=1.9801 PRICE T test < T critical BRAND T test < T critical... Male 18-27 Significant HSignificant 1: not all(µi=1,2,3,4,5) are equal to zero t Stat Significant Standard Coefficients Error t Stat 122.3481617 1.272018764 96.18424284 0.508473755 0.561728716 0.905194519 Intercept The regression table for the group of male under 18 Coefficients P&M PRICE 67.31709153 BRAND 0.844454678 SECURITY PAYMENTS -1.222092466 6.561137566 P&M 4.884839213 0.611516825 5.899148815

Định dạng
Số trang	17
Dung lượng	145,9 KB