1. Trang chủ
  2. » Toán

Business Statistics - D A V I D F. G R O E B N E R & P A T R I C K W . S H A N N O N & P H I L L I P C . F R Y & K E N T D . S M I T H , 2011 Part2

439 880 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 439
Dung lượng 14,15 MB

Nội dung

where: q1-a Value from studentized range table Appendix J, with D1 k and D2 nT - k degrees of freedom for the desired level of 1 - a [k Number of groups or factor levels, and nT Tot[r]

(1)• Review the computational methods for the • Review the basics of hypothesis testing discussed in Section 9.1 chapter 12 Chapter 12 Quick Prep Links • Re-examine the material on hypothesis testing for the difference between two population variances in Section 11.2 sample mean and the sample variance in Chapter Analysis of Variance 12.1 One-Way Analysis of Variance (pg 476–497) Outcome Understand the basic logic of analysis of variance Outcome Perform a hypothesis test for a single-factor design using analysis of variance manually and with the aid of Excel or Minitab software 12.2 Randomized Complete Block Analysis of Variance (pg 497–509) 12.3 Two-Factor Analysis of Variance with Replication (pg 509–520) Outcome Conduct and interpret post-analysis of variance pairwise comparisons procedures Outcome Recognize when randomized block analysis of variance is useful and be able to perform analysis of variance on a randomized block design Outcome Perform analysis of variance on a two-factor design of experiments with replications using Excel or Minitab and interpret the output Why you need to know Chapters through 11 introduced hypothesis testing By now you should understand that regardless of the population parameter in question, hypothesis-testing steps are basically the same: Specify the population parameter of interest Formulate the null and alternative hypotheses Specify the level of significance Determine a decision rule defining the rejection and “acceptance” regions Select a random sample of data from the population(s) Compute the appropriate sample statistic(s) Finally, calculate the test statistic Reach a decision Reject the null hypothesis, H0, if the sample statistic falls in the rejection region; otherwise, not reject the null hypothesis If the test is conducted using the p-value approach, H0 is rejected whenever the p-value is smaller than the significance level; otherwise, H0 is not rejected Draw a conclusion State the result of your hypothesis test in the context of the exercise or analysis of interest Chapter focused on hypothesis tests involving a single population Chapters 10 and 11 expanded the hypothesis-testing process to include applications in which differences between two populations are involved However, you will encounter many instances involving more than two populations For example, the vice president of operations at Farber Rubber, Inc., oversees production at Farber’s six different U.S manufacturing plants Because each plant uses slightly different manufacturing processes, the vice president needs to know if there are any differences in average strength of the products produced at the different plants Similarly, Golf Digest, a major publisher of articles about golf, might wish to determine which of five major brands of golf balls has the highest mean distance off the tee The Environmental Protection Agency (EPA) might conduct a test to determine if there is a difference in the average miles-per-gallon performance 475 (2) 476 CHAPTER 12 | Analysis of Variance of cars manufactured by the Big Three U.S automobile producers In each of these cases, testing a hypothesis involving more than two population means could be required This chapter introduces a tool called analysis of variance (ANOVA), which can be used to test whether there are differences among three or more population means There are several ANOVA procedures, depending on the type of test being conducted Our aim in this chapter is to introduce you to ANOVA and to illustrate how to use Microsoft Excel and Minitab to help conduct hypothesis tests involving three or more population parameters You will almost certainly need either to apply ANOVA in future decision-making situations or to interpret the results of an ANOVA study performed by someone else Thus, you need to be familiar with this powerful statistical technique Chapter Outcome Completely Randomized Design An experiment is completely randomized if it consists of the independent random selection of observations representing each level of one factor 12.1 One-Way Analysis of Variance In Chapter 10 we introduced the t-test for testing whether two populations have equal means when the samples from the two populations are independent However, you will often encounter situations in which you are interested in determining whether three or more populations have equal means To conduct this test, you will need a new tool called analysis of variance (ANOVA) There are many different analysis of variance designs to fit different situations; the simplest is a completely randomized design Analyzing a completely randomized design results in a one-way analysis of variance One-Way Analysis of Variance An analysis of variance design in which independent samples are obtained from two or more levels of a single factor for the purpose of testing whether the levels have equal means Factor A quantity under examination in an experiment as a possible cause of variation in the response variable Levels The categories, measurements, or strata of a factor of interest in the current experiment Introduction to One-Way ANOVA BUSINESS APPLICATION APPLYING ONE-WAY ANALYSIS OF VARIANCE BAYHILL MARKETING COMPANY The Bayhill Marketing Company is a full-service marketing and advertising firm in San Francisco Although Bayhill provides many different marketing services, one of its most lucrative in recent years has been Web site sales designs Companies that wish to increase Internet sales have contracted with Bayhill to design effective Web sites Bayhill executives have learned that certain Web site features are more effective than others For example, a major greeting card company wants to work with Bayhill on developing a Web-based sales campaign for its “Special Events” card set The company plans to work with Bayhill designers to come up with a Web site that will maximize sales effectiveness Sales effectiveness can be determined by the dollar value of the greeting card sets purchased Through a series of meetings with the client and focus-group sessions with potential customers, Bayhill has developed four Web site design options Bayhill plans to test the effectiveness of the designs by sending e-mails to a random sample of regular greeting card customers The sample of potential customers will be divided into four groups of eight customers each Group will be directed to a Web site with design 1, group to a Web site with design 2, and so forth The dollar value of the cards ordered are recorded and shown in Table 12.1 In this example, we are interested in whether the different Web site designs result in different mean order sizes In other words, we are trying to determine if “Web site designs” are one of the possible causes of the variation in the dollar value of the card sets ordered (the response variable) In this case, Web site design is called a factor The single factor of interest is Web site design This factor has four categories, measurements, or strata, called levels These four levels are the four designs: 1, 2, 3, and Because we are using only one factor, each dollar value of card sets ordered is associated with only one level (that is, with Web site design—type 1, 2, 3, or 4), as you can see in Table 12.1 Each level is a population of interest, and the values seen in Table 12.1 are sample values taken from those populations The null and alternative hypotheses to be tested are H0: m1  m2  m3  m4 (mean order sizes are equal) HA: At least two of the population means are different Balanced Design An experiment has a balanced design if the factor levels have equal sample sizes The appropriate statistical tool for conducting the hypothesis test related to this experimental design is analysis of variance Because this ANOVA addresses an experiment with only one factor, it is a one-way ANOVA, or a one-factor ANOVA Because the sample size for each Web site design (level) is the same, the experiment has a balanced design (3) CHAPTER 12 TABLE 12.1 | | Analysis of Variance 477 Bayhill Marketing Company Web Site Order Data Web Site Design Customer $4.10 $6.90 $4.60 $12.50 5.90 9.10 11.40 7.50 10.45 13.00 6.15 6.25 11.55 7.90 7.85 8.75 5.25 9.10 4.30 11.15 7.75 13.40 8.70 10.25 4.78 7.60 10.20 6.40 6.22 5.00 10.80 9.20 Grand Mean Mean x1  $7.00 x  $9.00 x3  $8.00 x4  $9.00 Variance s12  7.341 s22  8.423 s32  7.632 s42  5.016 x  $8.25 Note: Data are the dollar value of card sets ordered with each Web site design ANOVA tests the null hypothesis that three or more populations have the same mean The test is based on four assumptions: Assumptions All populations are normally distributed The population variances are equal The observations are independent—that is, the occurrence of any one individual value does not affect the probability that any other observation will occur The data are interval or ratio level If the null hypothesis is true, the populations have identical distributions If so, the sample means for random samples from each population should be close in value The basic logic of ANOVA is the same as the two-sample t-test introduced in Chapter 10 The null hypothesis should be rejected only if the sample means are substantially different Partitioning the Sum of Squares Total Variation The aggregate dispersion of the individual data values across the various factor levels is called the total variation in the data Within-Sample Variation The dispersion that exists among the data values within a particular factor level is called the within-sample variation Between-Sample Variation Dispersion among the factor sample means is called the between-sample variation To understand the logic of ANOVA, you should note several things about the data in Table 12.1 First, the dollar values of the orders are different throughout the data table Some values are higher; others are lower Thus, variation exists across all customer orders This variation is called the total variation in the data Next, within any particular Web site design (i.e., factor level), not all customers ordered the same dollar value of greeting card sets For instance, within level 1, order size ranged from $4.10 to $11.55 Similar differences occur within the other levels The variation within the factor levels is called the within-sample variation Finally, the sample means for the four Web site designs are not all equal Thus, variation exists between the four designs’ averages This variation between the factor levels is referred to as the between-sample variation Recall that the sample variance is computed as s2  ∑(x  x )2 n 1 The sample variance is the sum of squared deviations from the sample mean divided by its degrees of freedom When all the data from all the samples are included, s is the estimator of the total variation The numerator of this estimator is called the total sum of squares (SST) and can be partitioned into the sum of squares associated with the estimators of the betweensample variation and the within-sample variation, as shown in Equation 12.1 (4) 478 CHAPTER 12 | Analysis of Variance Partitioned Sum of Squares SST  SSB  SSW (12.1) where: SST  Total sum of squares SSB  Sum of squares between SSW  Sum of squares within After separating the sum of squares, SSB and SSW are divided by their respective degrees of freedom to produce two estimates for the overall population variance If the between-sample variance estimate is large relative to the within-sample estimate, the ANOVA procedure will lead us to reject the null hypothesis and conclude the population means are different The question is, how can we determine at what point any difference is statistically significant? The ANOVA Assumptions Chapter Outcome BUSINESS APPLICATION UNDERSTANDING THE ANOVA ASSUMPTIONS BAYHILL MARKETING COMPANY (CONTINUED) Recall that Bayhill is testing whether the four Web site designs generate orders of equal average dollar value The null and alternative hypotheses are H0: m1  m2  m3  m4 HA: At least two population means are different Before we jump into the ANOVA calculations, recall the four basic assumptions of ANOVA: All populations are normally distributed The population variances are equal The sampled observations are independent The data’s measurement level is interval or ratio Figure 12.1 illustrates the first two assumptions The populations are normally distributed and the spread (variance) is the same for each population However, this figure shows the FIGURE 12.1 | Normal Populations with Equal Variances and Unequal Means Population x 1 Population x 2 Population 3 x Population 4 x (5) CHAPTER 12 FIGURE 12.2 | Analysis of Variance 479 | Normal Populations with Equal Variances and Equal Means Population x 1 Population x 2 Population x 3 Population x 4 populations have different means—and therefore the null hypothesis is false Figure 12.2 illustrates the same assumptions but in a case in which the population means are equal; therefore, the null hypothesis is true You can a rough check to determine whether the normality assumption is satisfied by developing graphs of the sample data from each population Histograms are probably the best graphical tool for checking the normality assumption, but they require a fairly large sample size The stem and leaf diagram and box and whisker plot are alternatives when sample sizes are smaller If the graphical tools show plots consistent with a normal distribution, then that evidence suggests the normality assumption is satisfied.1 Figure 12.3 illustrates the box and FIGURE 12.3 | Box and Whisker Plot for Bayhill Marketing Company Box and Whisker Plot 14 12 10 2 Box and Whisker Plot Five-Number Summary Minimum 4.1 First Quartile 4.78 Median 6.06 Third Quartile 10.45 Maximum 11.55 5.0 6.9 8.5 13.0 13.4 4.3 4.6 8.275 10.8 11.4 6.25 6.4 8.975 11.15 12.5 1Chapter 13 introduces a goodness-of-fit approach to testing whether sample data come from a normally distributed population (6) 480 CHAPTER 12 | Analysis of Variance whisker plot for the Bayhill data Note, when the sample sizes are very small, as they are here, the graphical techniques may not be very effective In Chapter 11, you learned how to test whether two populations have equal variances using the F-test To determine whether the second assumption is satisfied, we can hypothesize that all the population variances are equal: H 0: s12  s 22  ⋅ ⋅ ⋅  s k2 HA : Not all variances are equal Because you are now testing a null hypothesis involving more than two population variances, you need an alternative to the F-test introduced in Chapter 11 This alternative method is called Hartley’s Fmax The Hartley’s F-test statistic is computed as shown in Equation 12.2 Hartley’s F-Test Statistic Fmax  smax smin (12.2) where: smax  Largest sample variance smin  Smallest sample variance We can use the F-value computed using Equation 12.2 to test whether the variances are equal by comparing the calculated F to a critical value from the Hartley’s Fmax distribution, which appears in Appendix I.2 For the Bayhill example, the computed variance for each of the four samples is s12  7.341 s22  8.423 s33  7.632 s42  5.016 Using Equation 12.2, we compute the Fmax value as Fmax  8.423  1.679 5.016 This value is now compared to the critical value Fa from the table in Appendix I for a  0.05, with k  and n   degrees of freedom The value k is the number of populations (k  4) The value n is the average sample size, which equals in this example If n is not an integer value, then set n equal to the integer portion of the computed n If Fmax  Fa, reject the null hypothesis of equal variances If Fmax  Fa, not reject the null hypothesis and conclude the population variances are equal From the Hartley’s Fmax distribution table, the critical F0.05  8.44 Because Fmax  1.679  8.44, the null hypothesis of equal variances is not rejected.3 Examining the sample data to see whether the basic assumptions are satisfied is always a good idea, but you should be aware that the analysis of variance procedures discussed in this chapter are robust, in the sense that the analysis of variance test is relatively unperturbed when the equal-variance assumption is not met This is especially so when all samples are the same size, as in the Bayhill Marketing Company example Hence, for one-way analysis of variance, or any other ANOVA design, try to have equal sample sizes when possible Recall, we earlier referred to an analysis of variance design with equal sample sizes as a balanced design If for some reason you are unable to use a balanced design, the rule of thumb is that the ratio of the largest sample size to the smallest sample size should not exceed 1.5 When the samples are the same size (or meet the 1.5 ratio rule), the analysis of variance is also robust with respect to the assumption that the populations are normally distributed So, in brief, the one-way ANOVA for independent samples can be applied to virtually any set of interval- or ratio-level data 2Other tests for equal variances exist For example, Minitab has a procedure that uses Bartlett’s and Levine’s test 3Hartley’s F max test is very dependent on the populations being normally distributed and should not be used if the populations’ distributions are skewed Note also in Hartley’s Fmax table, c  k and v  n 1 (7) CHAPTER 12 | 481 Analysis of Variance Finally, if the data are not interval or ratio level, or if they not satisfy the normal distribution assumption, Chapter 17 introduces an ANOVA procedure called the Kruskal-Wallis One-Way ANOVA, which does not require these assumptions Applying One-Way ANOVA Although the previous discussion covers the essence of ANOVA, to determine whether the null hypothesis should be rejected requires that we actually determine values of the estimators for the total variation, between-sample variation, and within-sample variation Most ANOVA tests are done using a computer, but we will illustrate the manual computational approach one time to show you how it is done Because software such as Excel and Minitab can be used to perform all calculations, future examples will be done using the computer The software packages will all the computations while we focus on interpreting the results BUSINESS APPLICATION DEVELOPING THE ANOVA TABLE BAYHILL MARKETING COMPANY (CONTINUED) Now we are ready to perform the necessary one-way ANOVA computations for the Bayhill example Recall from Equation 12.1 that we can partition the total sum of squares into two components: SST  SSB  SSW The total sum of squares is computed as shown in Equation 12.3 Total Sum of Squares k SST  ni ∑ ∑ (xij  x )2 (12.3) i1 j1 where: SST  Total sum of squares k  Number of populatiions (treatments) n i  Sample size from population i xij  jth measurement from population i x  Grand mean (mean of all the data values) Equation 12.3 is not as complicated as it appears Manually applying Equation 12.3 to the Bayhill data shown in Table 12.1 on page 477 ( Grand mean  x  8.25), we can compute the SST as follows: SST  (4.10 - 8.25)2  (5.90 - 8.25)2  (10.45 - 8.25)2   (9.20 - 8.25)2 SST  220.88 Thus, the sum of the squared deviations of all values from the grand mean is 220.88 Equation 12.3 can also be restated as k SST  ni ∑ ∑ (xij  x )2  (nT 1)s2 i1 j1 where s is the sample variance for all data combined, and nT is the sum of the combined sample sizes We now need to determine how much of this total sum of squares is due to between-sample sum of squares and how much is due to within-sample sum of squares The between-sample portion is called the sum of squares between and is found using Equation 12.4 (8) 482 CHAPTER 12 | Analysis of Variance Sum of Squares Between k SSB = ∑ ni ( xi − x )2 (12.4) i =1 where: SSB  Sum of squares between samples k  Number of populations ni  Sample size from population i xi  Sample mean from population i x  Grand mean We can use Equation 12.4 to manually compute the sum of squares between for the Bayhill data, as follows: SSB  8(7 - 8.25)2  8(9 - 8.25)2  8(8 - 8.25)2  8(9 - 8.25)2 SSB  22 Once both the SST and SSB have been computed, the sum of squares within (also called the sum of squares error, SSE) is easily computed using Equation 12.5 The sum of squares within can also be computed directly, using Equation 12.6 Sum of Squares Within SSW  SST - SSB (12.5) or Sum of Squares Within k SSW  ni ∑ ∑ (xij  xi )2 (12.6) i1 j1 where: SSW  Sum of squares within samples k  Number off populations ni  Sample size from population i xi  Sample mean from population i xij  jth measurement from population i For the Bayhill example, the SSW is SSW  220.88 - 22.00  198.88 These computations are the essential first steps in performing the ANOVA test to determine whether the population means are equal Table 12.2 illustrates the ANOVA table format used to conduct the test The format shown in Table 12.2 is the standard ANOVA table layout For the Bayhill example, we substitute the numerical values for SSB, SSW, and SST and complete the ANOVA table, as shown in Table 12.3 The mean square column contains the MSB (mean square between samples) and the MSW (mean square within samples).4 These values are computed by dividing the sum of squares by their respective degrees of freedom, as shown in Table 12.3 4MSW is also known as the mean square for error (MSE) (9) | CHAPTER 12 TABLE 12.2 Analysis of Variance 483 | One-Way ANOVA Table: The Basic Format Source of Variation SS df MS F-Ratio Between samples SSB k 1 MSB MSB Within samples SSW nT − k MSW MSW SST nT − Total where: k = Number of populations nT = Sum of the sample sizes from all populations df = Degreesof freedom SSB MSB = Mean square between = k −1 SSW MSW = Mean square within = nT − k Restating the null and alternative hypotheses for the Bayhill example: H0: m1  m2  m3  m4 HA: At least two population means are different Glance back at Figures 12.1 and 12.2 If the null hypothesis is true (that is, all the means are equal—Figure 12.2), the MSW and MSB will be equal, except for the presence of sampling error However, the more the sample means differ (Figure 12.1), the larger the MSB becomes As the MSB increases, it will tend to get larger than the MSW When this difference gets too large, we will conclude that the population means must not be equal, and the null hypothesis will be rejected But how we determine what “too large” is? How we know when the difference is due to more than just sampling error? To answer these questions, recall from Chapter 11 the F-distribution is used to test whether two populations have the same variance In the ANOVA test, if the null hypothesis is true, the ratio of MSB over MSW forms an F-distribution with D1  k - and D2  nT - k degrees of freedom If the calculated F-ratio in Table 12.3 gets too large, the null hypothesis is rejected Figure 12.4 illustrates the hypothesis test for a significance level of 0.05 Because the calculated F-ratio  1.03 is less than the critical F0.05  2.95 (found using Excel’s FINV function) with and 28 degrees of freedom, the null hypothesis cannot be rejected The F-ratio indicates that the between-levels estimate and the within-levels estimate are not different enough to conclude that the population means are different This means there is insufficient statistical evidence to conclude that any one of the four Web site designs will generate higher average dollar values of orders than any of the other designs Therefore, the choice of which Web site design to use can be based on other factors, such as company preference TABLE 12.3 | One-Way ANOVA Table for the Bayhill Marketing Company Source of Variation SS df MS Between samples 22.00 7.33 7.33 198.88 28 7.10 7.10 220.88 31 Within samples Total where: F-Ratio = 1.03 SSB 22 = = 7.33 k −1 SSW 198.88 = = 7.10 MSW = M ean square within = nT − k 28 MSB = Mean square between = (10) 484 CHAPTER 12 FIGURE 12.4 | Analysis of Variance | Bayhill Company Hypothesis Test H0: 1 = 2 = 3 = 4 HA: At least two population means are different  = 0.05 f(F) Degrees of Freedom: D1 = k – = – = D2 = nT – k = 32 – = 28 Rejection Region F0.05 = 2.95 F = 1.03 F Decision Rule: If: F > F0.05 reject H0; otherwise not reject H0 Then: F = MSB 7.33 = = 1.03 MSW 7.10 Because: F = 1.03 < F0.05 = 2.95, we not reject H0 Chapter Outcome EXAMPLE 12-1 ONE-WAY ANALYSIS OF VARIANCE Roderick, Wilterding & Associates Roderick, Wilterding & Associates (RWA) operates automobile dealerships in three regions: the West, Southwest, and Northwest Recently, RWA’s general manager questioned whether the company’s mean profit margin per vehicle sold differed by region To determine this, the following steps can be performed: Step Specify the parameter(s) of interest The parameter of interest is the mean dollars of profit margin in each region Step Formulate the null and alternative hypotheses The appropriate null and alternative hypotheses are H0: mW  mSW  mNW HA: At least two populations have different means Step Specify the significance level (a) for testing the hypothesis The test will be conducted using an a  0.05 Step Select independent simple random samples from each population, and compute the sample means and the grand mean There are three regions Simple random samples of vehicles sold in these regions have been selected: 10 in the West, in the Southwest, and 12 in the Northwest Note, even though the sample sizes are not equal, the largest sample is not more than 1.5 times as large as the smallest sample size The following sample data were collected (in dollars): West Southwest Northwest West Southwest Northwest 3,700 3,300 2,900 5,300 2,700 3,300 2,900 4,100 4,900 4,900 2,100 2,600 2,100 3,600 4,300 5,200 3,300 3,600 2,200 3,700 4,800 3,000 4,500 2,400 3,700 2,400 4,400 3,300 4,400 3,200 (11) CHAPTER 12 | Analysis of Variance 485 The sample means are ∑ x $39,500   $3,950 10 n $23,300   $2,912.50 $44,000   $3,666.67 12 xW  xSW x NW and the grand mean is the mean of the data from all samples is ∑ ∑ x $3,700  $2,900  ⋅ ⋅ ⋅  $3,200  nT 30 $106,,800  30  $3,560 x Step Determine the decision rule The F-critical value from the F-distribution table in Appendix H for D1  and D2  27 degrees of freedom is a value between 3.316 and 3.403 The exact value F0.05  3.354 can be found using Excel’s FINV function or Minitab’s Calc  Probability Distributions command The decision rule is If F  3.354, reject the null hypothesis; otherwise, not reject the null hypothesis Step Check to see that the equal variance assumption has been satisfied As long as we assume that the populations are normally distributed, Hartley’s Fmax test can be used to test whether the three populations have equal variances The test statistic is Fmax  smax smin The three variances are computed using  sW ∑(x  x w )2  1, 062,777.8 n 1  695,535.7 sSW s N2 W  604,242.4 Hartley’s Fmax  1,062,777.8  1.76 604,242.4 From the Fmax table in Appendix I, the critical value for a  0.05, c  (c  k), and v  (v  n 1  10   9) is 5.34 Because 1.76  5.34, we not reject the null hypothesis of equal variances Step Create the ANOVA table Compute the total sum of squares, sum of squares between, and sum of squares within, and complete the ANOVA table (12) 486 CHAPTER 12 | Analysis of Variance Total Sum of Squares k SST  ni ∑ ∑ (xij  x )2 i1 j1  (3, 700 − 3, 560)2  (2, 900 − 3, 560)2   (3, 200 − 3, 560)2  26, 092, 000 Sum of Squares Between k SSB  ∑ ni ( xi  x )2 i1  10( 3, 950  3, 560 )2  8(2, 912.50  3, 560 )2 12( 3, 666.67  3, 560 )2  5, 011, 583 Sum of Squares Within SSW  SST - SSB  26,092,000 - 5,011,583  21,080,417 The ANOVA table is Source of Variation SS df MS F-Ratio 5,011,583 2,505,792 Within samples 21,080,417 27 780,756 Total 26,092,000 29 Between samples 2, 505, 792 = 3.209 780, 756 Step Reach a decision Because the F-test statistic  3.209 F0.05  3.354, we not reject the null hypothesis based on these sample data Step Draw a conclusion We are not able to detect a difference in the mean profit margin per vehicle sold by region END EXAMPLE TRY PROBLEM 12-2 (pg 493) BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial USING SOFTWARE TO PERFORM ONE-WAY ANOVA HYDRONICS CORPORATION The Hydronics Corporation makes and distributes health products Currently, the company’s research department is experimenting with two new herbbased weight loss–enhancing products To gauge their effectiveness, researchers at the company conducted a test using 300 human subjects over a six-week period All the people in the study were between 30 and 40 pounds overweight One third of the subjects were randomly selected to receive a placebo—in this case, a pill containing only vitamin C One third of the subjects were randomly selected and given product The remaining 100 people received product The subjects did not know which pill they had been assigned Each person was asked to take the pill regularly for six weeks and otherwise observe his or her normal routine At the end of six weeks, the subjects’ weight loss was recorded The company was hoping to find statistical evidence that at least one of the products is an effective weight-loss aid The file Hydronics shows the study data Positive values indicate that the subject lost weight, whereas negative values indicate that the subject gained weight during the six-week study period As often happens in studies involving human subjects, people drop out Thus, at the end of six weeks, only 89 placebo subjects, 91 product subjects, and 83 product subjects with valid data remained Consequently, this experiment resulted in an unbalanced design Although the sample sizes are not equal, they are close to being the same size and not violate the 1.5-ratio rule of thumb mentioned earlier (13) CHAPTER 12 FIGURE 12.5A | Analysis of Variance 487 | Excel 2007 Output: Hydronics Weight Loss ANOVA Results Excel 2007 Instructions: Open file: Hydronics.xls On the Data tab, click Data Analysis Select ANOVA: Single Factor Define data range (columns B, C, and D) Specify alpha level  0.05 Indicate output location Click OK F, p-value and F-critical The null and alternative hypotheses to be tested using a significance level of 0.05 are H0: m1  m2  m3 HA: At least two population means are different The experimental design is completely randomized The factor is diet supplement, which has three levels: placebo, product 1, and product We will use a significance level of a  0.05 Figure 12.5a and Figure 12.5b show the Excel and Minitab analysis of variance results The top section of the Excel ANOVA and the bottom section of the Minitab ANOVA output provide descriptive information for the three levels The ANOVA table is shown in the other section of the output These tables look like the one we generated manually in the Bayhill example However, Excel and Minitab also compute the p-value In addition, Excel displays the critical value, F-critical, from the F-distribution table Thus, you can test the null hypothesis by comparing the calculated F to the F-critical or by comparing the p-value to the significance level The decision rule is If F  F0.05  3.03, reject H0; otherwise, not reject H0 FIGURE 12.5B | Minitab Output: Hydronics Weight Loss ANOVA Results F and p-value Minitab Instructions: Open file: Hydronics MTW Choose Stat  ANOVA  One way In Response, enter data column, Loss In Factor, enter factor level column, Program Click OK (14) 488 CHAPTER 12 | Analysis of Variance or If p-value a  0.05, reject H0; otherwise, not reject H0 Because F  20.48  F0.05  3.03 (or p-value  0.0000 a  0.05) we reject the null hypothesis and conclude there is a difference in the mean weight loss for people on the three treatments At least two of the populations have different means The top portion of Figure 12.5a shows the descriptive measures for the sample data For example, the subjects who took the placebo actually gained an average of 1.75 pounds Subjects on product lost an average of 2.45 pounds, and subjects on product lost an average of 2.58 pounds Chapter Outcome The Tukey-Kramer Procedure for Multiple Comparisons What does this conclusion imply about which treatment results in greater weight loss? One approach to answering this question is to use confidence interval estimates for all possible pairs of population means, based on the pooling of the two relevant sample variances, as introduced in Chapter 10 sp  (n1  1) s12  (n2 1) s22 n1  n2  These confidence intervals are constructed using the formula also given in Chapter 10: (x1  x2 ) tsp Experiment-Wide Error Rate The proportion of experiments in which at least one of the set of confidence intervals constructed does not contain the true value of the population parameter being estimated 1  n1 n2 It uses a weighted average of only the two sample variances corresponding to the two sample means in the confidence interval However, in the Hydronics example, we have three samples, and thus three variances, involved If we were to use the pooled standard deviation, sp shown here, we would be disregarding one third of the information available to estimate the common population variance Instead, we use confidence intervals based on the pooled standard deviation obtained from the square root of MSW This is the square root of the weighted average of all (three in this example) sample variances This is preferred to the interval estimate shown here because we are assuming that each of the three sample variances is an estimate of the common population variance A better method for testing which populations have different means after the one-way ANOVA has led us to reject the null hypothesis is called the Tukey-Kramer procedure for multiple comparisons.5 To understand why the Tukey-Kramer procedure is superior, we introduce the concept of an experiment-wide error rate The Tukey-Kramer procedure is based on the simultaneous construction of confidence intervals for all differences of pairs of treatment means In this example, there are three different pairs of means (m1 - m2, m1 - m3, m2 - m3) The Tukey-Kramer procedure simultaneously constructs three different confidence intervals for a specified confidence level, say 95% Intervals that not contain zero imply that a difference exists between the associated population means Suppose we repeat the study a large number of times Each time, we construct the TukeyKramer 95% confidence intervals The Tukey-Kramer method assures us that in 95% of these experiments, the three confidence intervals constructed will include the true difference between the population means, mi - mj In 5% of the experiments, at least one of the confidence intervals will not contain the true difference between the population means Thus in 5% of the situations, we would make at least one mistake in our conclusions about which populations have different means This proportion of errors (0.05) is known as the experiment-wide error rate For a 95% confidence interval, the Tukey-Kramer procedure controls the experimentwide error to a 0.05 level However, because we are concerned with only this one experiment (with one set of sample data), the error rate associated with any one of the three confidence intervals is actually less than 0.05 5There are other methods for making these comparisons Statisticians disagree over which method to use Later, we introduce alternative methods (15) CHAPTER 12 | 489 Analysis of Variance The Tukey-Kramer procedure allows us to simultaneously examine all pairs of populations after the ANOVA test has been completed without increasing the true alpha level Because these comparisons are made after the ANOVA F-test, the procedure is called a post-test (or post-hoc) procedure The first step in using the Tukey-Kramer procedure is to compute the absolute differences between each pair of sample means Using the results shown in Figure 12.5a, we get the following absolute differences: | x1  x2 |  | 1.75  2.45 |  4.20 | x1  x3 |  | 1.75  2.58 |  4.33 | x2  x3 |  | 2.45  2.58 |  0.13 The Tukey-Kramer procedure requires us to compare these absolute differences to the critical range that is computed using Equation 12.7 Tukey-Kramer Critical Range Critical range  q 1− a MSW ⎛1 1⎞ ⎜  ⎟ ⎝ ni n j ⎠ (12.7) where: q1-a  Value from studentized range table (Appendix J), with D1  k and D2  nT - k degrees of freedom for the desired level of - a [k  Number of groups or factor levels, and nT  Total number of data values from all populations (levels) combined] MSW  Mean square within ni and nj  Sample sizes from populations (levels) i and j, respectively A critical range is computed for each pairwise comparison, but if the sample sizes are equal, only one critical-range calculation is necessary because the quantity under the radical in Equation 12.7 will be the same for all comparisons If the calculated pairwise comparison value is greater than the critical range, we conclude the difference is significant To determine the q-value from the studentized range table in Appendix J for a significance level equal to a  0.05 and k  and nT - k  260 degrees of freedom For D2  nT - k  260 degrees of freedom, we use the row labeled ` The studentized range value for - 0.05  0.95 is approximately q0.95  3.31 Then, for the placebo versus product comparison, n1  89 and n2  91 we use Equation 12.7 to compute the critical range, as follows: ⎛1 1⎞ ⎜  ⎟ ⎝ ni n j ⎠ Critical range  q1 MSW Critical range  3.31 26.18 ⎛ 1⎞ ⎜  ⎟  1.785 ⎝ 89 91 ⎠ (16) 490 CHAPTER 12 | Analysis of Variance TABLE 12.4 | Hydronics Pairwise Comparisons—Tukey-Kramer Test | xixj | Critical Range Significant? Placebo vs product 4.20 1.785 Yes Placebo vs product 4.33 1.827 Yes Product vs product 0.13 1.818 No Because | x1  x2 |  4.20  1.785 we conclude that m  m2 The mean weight loss for the placebo group is not equal to the mean weight loss for the product group Table 12.4 summarizes the results for the three pairwise comparisons From the table we see that product and product both offer significantly higher average weight loss than the placebo However, the sample data not indicate a difference in the average weight loss between product and product Thus, the company can conclude that both product and product are superior to taking a placebo Chapter Outcome EXAMPLE 12-2 THE TUKEY-KRAMER PROCEDURE FOR MULTIPLE COMPARISON Digitron, Inc Digitron, Inc., makes disc brakes for automobiles Digitron’s research and Excel and Minitab tutorials Excel and Minitab Tutorial development (R&D) department recently tested four brake systems to determine if there is a difference in the average stopping distance among them Forty identical mid-sized cars were driven on a test track Ten cars were fitted with brake A, 10 with brake B, and so forth An electronic, remote switch was used to apply the brakes at exactly the same point on the road The number of feet required to bring the car to a full stop was recorded The data are in the file Digitron Because we care to determine only whether the four brake systems have the same or different mean stopping distances, the test is a one-way (single-factor) test with four levels and can be completed using the following steps: Step Specify the parameter(s) of interest The parameter of interest is the mean stopping distance for each brake type The company is interested in knowing whether a difference exists in mean stopping distance for the four brake types Step Formulate the appropriate null and alternative hypotheses The appropriate null and alternative hypotheses are H0: m1  m2  m3  m4 HA: At least two population means are different Step Specify the significance level for the test The test will be conducted using a  0.05 Step Select independent simple random samples from each population Step Check to see that the normality and equal-variance assumptions have been satisfied (17) CHAPTER 12 | Analysis of Variance 491 Because of the small sample size, the box and whisker plot is used 285 275 265 255 Brake A Brake B Brake C Brake D The box plots indicate some skewness in the samples and question the assumption of equality of variances However, if we assume that the populations are approximately normally distributed, Hartley’s Fmax test can be used to test whether the four populations have equal variances The test statistic is F max  smax smin The four variances are computed using s  s12  49.9001 Hartley’s Fmax  s22  61.8557 ∑(x  x )2 : n 1 s32  21.7356 s42  106.4385 106.4385  4.8970 21.7356 From the Fmax table in Appendix I, the critical value for a  0.05, k  4, and n   is F0.05  6.31 Because 4.8970 6.31, we conclude that the population variances could be equal Recall our earlier discussion stating that when the sample sizes are equal, as they are in this example, the ANOVA test is robust in regards to both the equal variance and normality assumptions Step Determine the decision rule Because k -  and nT - k  36, from Excel or Minitab F0.05  2.8663 The decision rule is If the calculated F  F0.05  2.8663, reject H0, or if the p-value a  0.05, reject H0; otherwise, not reject H0 Step Use Excel or Minitab to construct the ANOVA table Figure 12.6 shows the Excel output for the ANOVA Step Reach a decision From Figure 12.6, we see that F  3.89  F0.05  2.8663, and p-value  0.0167 0.05 We reject the null hypothesis Step Draw a conclusion We conclude that not all population means are equal But which systems are different? Is one system superior to all the others? Step 10 Use the Tukey-Kramer test to determine which populations have different means Because we have rejected the null hypothesis of equal means, we need to perform a post–ANOVA multiple comparisons test Using Equation 12.7 to (18) 492 CHAPTER 12 FIGURE 12.6 | Analysis of Variance | Excel 2007 One-Way ANOVA Output for the Digitron Example Because calculated F = 3.8854 > 2.8663, we reject the null hypothesis and conclude the means are not equal Excel 2007 Instructions: Open file: Digitron.xls On the Data tab, click Data Analysis Select ANOVA: Single Factor Define data range (columns B, C, D, E) Specify alpha level  0.05 Specify output location Click OK Minitab Instructions (for similar results): In Response, enter data column, Distance Open file: Digitron.MTW In Factor, enter factor level column, Brake Choose Stat  ANOVA  One-way Click OK construct the critical range to compare to the absolute differences in all possible pairs of sample means, the critical range is6 Critical range  q1−α MSW ⎛ 1⎞ 59.98 ⎛ 1⎞ 85  ⎟ ⎜  ⎟  3.8 ⎜ ⎜⎝ ni n j ⎟⎠ ⎝ 10 10 ⎠ Critical range  9.43 Only one critical range is necessary because the sample sizes are equal If any pair of sample means has an absolute difference, | xi  x j|, greater than the critical range, we can infer that a difference exists in those population means The possible pairwise comparisons (part of a family of comparisons called contrasts) are Contrast Significant Difference | x1  x2|  |272.3590  271.3299|  1.0291 9.43 No | x1  x3 |  |272.3590  262.3140|  10.0450  9.43 Yes | x1  x4 |  | 272.3590  265.2357|  7.1233 9.43 No | x2  x3 |  |271.3299  262.3140|  9.0159 9.43 No | x2  x4 |  |271.3299  265.2357|  6.0942 9.43 No | x3  x4|  |262.3140  265.2357|  2.9217 9.43 No 6The q-value from the studentized range table with a  0.05 and degrees of freedom equal to k  and n - k  36 T must be approximated using degrees of freedom and 30 because the table does not show degrees of freedom of and 36 This value is 3.85 Rounding down to 30 will give a larger q value and a conservatively large critical range (19) CHAPTER 12 | Analysis of Variance 493 Therefore, based on the Tukey-Kramer procedure, we can infer that population (brake system A) and population (brake system C) have different mean stopping distances Because short stopping distances are preferred, system C would be preferred over system A, but no other differences are supported by these sample data For the other contrasts, the difference between the two sample means is insufficient to conclude that a difference in population means exists END EXAMPLE TRY PROBLEM 12-6 (pg 494) Fixed Effects Versus Random Effects in Analysis of Variance In the Digitron brake example, the company was testing four brake systems These were the only brake systems under consideration The ANOVA was intended to determine whether there was a difference in these four brake systems only In the Hydronics weight-loss example, the company was interested in determining whether there was a difference in mean weight loss for two supplements and the placebo In the Bayhill example involving Web site designs, the company narrowed its choices to four different designs, and the ANOVA test was used to determine whether there was a difference in means for these four designs only Thus, in each of these examples, the inferences extend only to the factor levels being analyzed, and the levels are assumed to be the only levels of interest This type of test is called a fixed effects analysis of variance test Suppose in the Bayhill Web site example that instead of reducing the list of possible Web site designs to a final four, the company had simply selected a random sample of four Web site designs from all possible designs being considered In that case, the factor levels included in the test would be a random sample of the possible levels Then, if the ANOVA leads to rejecting the null hypothesis, the conclusion applies to all possible Web site designs The assumption is the possible levels have a normal distribution and the tested levels are a random sample from this distribution When the factor levels are selected through random sampling, the analysis of variance test is called a random effects test MyStatLab 12-1: Exercises Skill Development 12-1 A start-up cell phone applications company is interested in determining whether household incomes are different for subscribers to three different service providers A random sample of 25 subscribers to each of the three service providers was taken, and the annual household income for each subscriber was recorded The partially completed ANOVA table for the analysis is shown here: ANOVA Source of Variation Between Groups Within Groups Total SS df MS F b Based on the sample results, can the start-up firm conclude that there is a difference in household incomes for subscribers to the three service providers? You may assume normal distributions and equal variances Conduct your test at the a  0.10 level of significance Be sure to state a critical F-statistic, a decision rule, and a conclusion 12-2 An analyst is interested in testing whether four populations have equal means The following sample data have been collected from populations that are assumed to be normally distributed with equal variances: 2,949,085,157 Sample Sample Sample Sample 11 14 14 12 16 16 12 8 12 10 17 15 17 16 13 9,271,678,090 a Complete the ANOVA table by filling in the missing sums of squares, the degrees of freedom for each source, the mean square, and the calculated F-test statistic (20) 494 | CHAPTER 12 Analysis of Variance Conduct the appropriate hypothesis test using a significance level equal to 0.05 12-3 A manager is interested in testing whether three populations of interest have equal population means Simple random samples of size 10 were selected from each population The following ANOVA table and related statistics were computed: ANOVA: Single Factor Summary Groups Count Sum Average Variance Sample 10 507.18 50.72 35.06 Sample 10 405.79 40.58 30.08 Sample 10 487.64 48.76 23.13 ANOVA Source SS Between Groups Within Groups Total df MS F 578.78 289.39 9.84 794.36 27 29.42 p-value F-crit 0.0006 3.354 1,373.14 29 a State the appropriate null and alternative hypotheses b Conduct the appropriate test of the null hypothesis assuming that the populations have equal variances and the populations are normally distributed Use a 0.05 level of significance c If warranted, use the Tukey-Kramer procedure for multiple comparisons to determine which populations have different means (Assume a  0.05.) 12-4 Respond to each of the following questions using this partially completed one-way ANOVA table: Source of Variation SS Between Samples 1,745 Within Samples Total 6,504 df MS SS Between Samples Within Samples Total 405 888 31 Group Group Group Group 4 20.9 27.2 26.6 22.1 25.3 30.1 23.8 28.2 26.2 21.6 29.7 30.3 25.9 17.8 15.9 18.4 20.2 14.1 21.2 23.9 19.5 17.4 a Based on the computations for the within- and between-sample variation, develop the ANOVA table and test the appropriate null hypothesis using a  0.05 Use the p-value approach b If warranted, use the Tukey-Kramer procedure to determine which populations have different means Use a  0.05 12-7 Examine the three samples obtained independently from three populations: Item Group Group Group 3 14 13 12 15 16 17 16 16 18 17 14 15 16 14 16 a Conduct a one-way analysis of variance on the data Use alpha  0.05 b If warranted, use the Tukey-Kramer procedure to determine which populations have different means Use an experiment-wide error rate of 0.05 240 246 df Item F-ratio a How many different populations are being considered in this analysis? b Fill in the ANOVA table with the missing values c State the appropriate null and alternative hypotheses d Based on the analysis of variance F-test, what conclusion should be reached regarding the null hypothesis? Test using a significance level of 0.01 12-5 Respond to each of the following questions using this partially completed one-way ANOVA table: Source of Variation a How many different populations are being considered in this analysis? b Fill in the ANOVA table with the missing values c State the appropriate null and alternative hypotheses d Based on the analysis of variance F-test, what conclusion should be reached regarding the null hypothesis? Test using a  0.05 12-6 Given the following sample data MS F-ratio Business Applications 12-8 In conjunction with the housing foreclosure crisis of 2009, many economists expressed increasing concern about the level of credit card debt and efforts of banks to raise interest rates on these cards The banks claimed the increases were justified A Senate sub-committee decided to determine if the average credit card balance depends on the type of credit card used Under consideration are Visa, MasterCard, Discover, and American Express The sample sizes to be used for each level are 25, 25, 26, and 23, respectively a Describe the parameter of interest for this analysis b Determine the factor associated with this experiment c Describe the levels of the factor associated with this analysis (21) CHAPTER 12 d State the number of degrees of freedom available for determining the between-samples variation e State the number of degrees of freedom available for determining the within-samples variation f State the number of degrees of freedom available for determining the total variation 12-9 EverRun Incorporated produces treadmills for use in exercise clubs and recreation centers EverRun assembles, sells, and services its treadmills, but it does not manufacture the treadmill motors Rather, treadmill motors are purchased from an outside vendor Currently, EverRun is considering which motor to include in its new ER1500 series Three potential suppliers have been identified: Venetti, Madison, and Edison; however, only one supplier will be used The motors produced by these three suppliers are identical in terms of noise and cost Consequently, EverRun has decided to make its decision based on how long a motor operates at a high level of speed and incline before it fails A random sample of 10 motors of each type is selected, and each motor is tested to determine how many minutes (rounded to the nearest minute) it operates before it needs to be repaired The sample information for each motor is as follows: Venetti Madison Edison 14,722 14,699 12,627 13,010 13,570 14,217 13,687 13,465 14,786 12,494 13,649 13,592 11,788 12,623 14,552 13,441 13,404 13,427 12,049 11,672 13,296 13,262 11,552 11,036 12,978 12,170 12,674 11,851 12,342 11,557 a At the a  0.01 level of significance, is there a difference in the average time before failure for the three different supplier motors? b Is it possible for EverRun to decide on a single motor supplier based on the analysis of the sample results? Support your answer by conducting the appropriate post-test analysis 12-10 ESSROC Cement Corporation is a leading North American cement producer, with over 6.5 million metric tons of annual capacity With headquarters in Nazareth, Pennsylvania, ESSROC operates production facilities strategically located throughout the United States, Canada, and Puerto Rico One of its products is Portland cement Portland cement’s properties and performance standards are defined by its type designation Each type is designated by a Roman numeral Ninety-two percent of the Portland cement produced in North America is Type I, II, or I/II | Analysis of Variance 495 One characteristic of the type of cement is its compressive strength Sample data for the compressive strength (psi) are shown as follows: Type Compressive Strength I 4,972 4,983 4,889 5,063 II 3,216 3,399 3,267 3,357 I/II 4,073 3,949 3,936 3,925 a Develop the appropriate ANOVA table to determine if there is a difference in the average compressive strength among the three types of Portland cement Use a significance level of 0.01 b If warranted, use the Tukey-Kramer procedure to determine which populations have different mean compressive strengths Use an experiment-wide error rate of 0.01 12-11 The Weidmann Group Companies, with headquarters in Rapperswil, Switzerland, are worldwide leaders in insulation systems technology for power and distribution transformers One facet of its expertise is the development of dielectric fluids in electrical equipment Mineral oil–based dielectric fluids have been used more extensively than other dielectric fluids Their only shortcomings are their relatively low flash and fire point One study examined the fire point of mineral oil, high-molecular-weight hydrocarbon (HMWH), and silicone The fire points for each of these fluids were as follows: Fluid Fire Points (°C) Mineral Oil HMWH 162 312 151 310 168 300 165 311 169 308 Silicone 343 337 345 345 337 a Develop the appropriate ANOVA table to determine if there is a difference in the average fire points among the types of dielectric fluids Use a significance level of 0.05 b If warranted, use the Tukey-Kramer procedure to determine which populations have different mean fire points Use an experiment-wide error rate of 0.05 12-12 The manager at the Hillsberg Savings and Loan is interested in determining whether there is a difference in the mean time that customers spend completing their transactions depending on which of four tellers they use To conduct the test, the manager has selected simple random samples of 15 customers for each of the tellers and has timed them (in seconds) from the moment they start their transaction to the time the transaction is completed and they leave the teller station The manager then asked one of her assistants to perform the appropriate statistical test The assistant (22) 496 CHAPTER 12 | Analysis of Variance Computer Database Exercises returned with the following partially completed ANOVA table Summary Groups Count Sum Average Variance Teller 15 3,043.9 827.4 Teller 15 3,615.5 472.2 Teller 15 3,427.7 445.6 Teller 15 4,072.4 619.4 ANOVA Source of Variation Between Groups SS df MS F-ratio p-value 36,530.6 F-crit 4.03E–09 2.7694 Within Groups Total 69,633.7 59 a State the appropriate null and alternative hypotheses b Test to determine whether the population variances are equal Use a significance level equal to 0.05 c Fill in the missing parts of the ANOVA table and perform the statistical hypothesis test using a  0.05 d Based on the result of the test in part c, if warranted, use the Tukey-Kramer method with a  0.05 to determine which teller require the most time on average to complete a customer’s transaction 12-13 Suppose as part of your job you are responsible for installing emergency lighting in a series of state office buildings Bids have been received from four manufacturers of battery-operated emergency lights The costs are about equal, so the decision will be based on the length of time the lights last before failing A sample of four lights from each manufacturer has been tested with the following values (time in hours) recorded for each manufacturer: Type A Type B Type C Type D 1,024 1,270 1,121 923 1,121 1,325 1,201 983 1,250 1,426 1,190 1,087 1,022 1,322 1,122 1,121 a Using a significance level equal to 0.01, what conclusion should you reach about the four manufacturers’ battery-operated emergency lights? Explain b If the test conducted in part a reveals that the null hypothesis should be rejected, what manufacturer should be used to supply the lights? Can you eliminate one or more manufacturers based on these data? Use the appropriate test and a  0.01 for multiple comparisons Discuss 12-14 Damage to homes caused by burst piping can be expensive to repair By the time the leak is discovered, hundreds of gallons of water may have already flooded the home Automatic shutoff valves can prevent extensive water damage from plumbing failures The valves contain sensors that cut off water flow in the event of a leak, thereby preventing flooding One important characteristic is the time (in milliseconds) required for the sensor to detect the water leak Sample data obtained for four different shutoff valves are contained in the file entitled Waterflow a Produce the relevant ANOVA table and conduct a hypothesis test to determine if the mean detection time differs among the four shutoff valve models Use a significance level of 0.05 b Use the Tukey-Kramer multiple comparison technique to discover any differences in the average detection time Use a significance level of 0.05 c Which of the four shutoff valves would you recommend? State your criterion for your selection 12-15 A regional package delivery company is considering changing from full-size vans to minivans The company sampled minivans from each of three manufacturers The number sampled represents the number the manufacturer was able to provide for the test Each minivan was driven for 5,000 miles, and the operating cost per mile was computed The operating costs, in cents per mile, for the 12 are provided in the data file called Delivery: Mini Mini Mini 13.3 14.3 13.6 12.8 14.0 12.4 13.4 13.1 13.9 15.5 15.2 14.5 a Perform an analysis of variance on these data Assume a significance level of 0.05 Do the experimental data provide evidence that the average operating costs per mile for the three types of minivans are different? Use a p-value approach b Referring to part a, based on the sample data and the appropriate test for multiple comparisons, what conclusions should be reached concerning which type of car the delivery company should adopt? Discuss and prepare a report to the company CEO Use a  0.05 c Provide an estimate of the maximum and minimum difference in average savings per year if the CEO chooses the “best” versus the “worst” minivan using operating costs as a criterion Assume that minivans are driven 30,000 miles a year Use a 90% confidence interval 12-16 The Lottaburger restaurant chain in central New Mexico is conducting an analysis of its restaurants, (23) CHAPTER 12 which take pride in serving burgers and fries to go faster than the competition As a part of its analysis, Lottaburger wants to determine if its speed of service is different across its four outlets Orders at Lottaburger restaurants are tracked electronically, and the chain is able to determine the speed with which every order is filled The chain decided to randomly sample 20 orders from each of the four restaurants it operates The speed of service for each randomly sampled order was noted and is contained in the file Lottaburger a At the a  0.05 level of service, can Lottaburger conclude that the speed of service is different across the four restaurants in the chain? b If the chain concludes that there is a difference in speed of service, is there a particular restaurant the chain should focus its attention on? Use the appropriate test for multiple comparisons to support your decision Use a  0.05 12-17 Most auto batteries are made by just three manufacturers—Delphi, Exide, and Johnson Controls Industries Each makes batteries sold under several | Analysis of Variance 497 different brand names Delphi makes ACDelco and some EverStart (Wal-Mart) models Exide makes Champion, Exide, Napa, and some EverStart batteries Johnson Controls makes Diehard (Sears), Duralast (AutoZone), Interstate, Kirkland (Costco), Motorcraft (Ford), and some EverStarts To determine if who makes the auto batteries affects the average length of life of the battery, the samples in the file entitled Start were obtained The data represent the length of life (months) for batteries of the same specifications for each of the three manufacturers a Determine if the average length of battery life is different among the batteries produced by the three manufacturers Use a significance level of 0.05 b Which manufacturer produces the battery with the longest average length of life? If warranted, conduct the Tukey-Kramer procedure to determine this Use a significance level of 0.05 (Note: You will need to manipulate the data columns to obtain the appropriate factor levels) END EXERCISES 12-1 Chapter Outcome 12.2 Randomized Complete Block Analysis of Variance Section 12.1 introduced one-way ANOVA for testing hypotheses involving three or more population means This ANOVA method is appropriate as long as we are interested in analyzing one factor at a time and we select independent random samples from the populations For instance, Example 12-2 involving brake assembly systems at the Digitron Corporation (Figure 12.6) illustrated a situation in which we were interested in only one factor: type of brake assembly system The measurement of interest was the stopping distance with each brake system To test the hypothesis that the four brake systems were equal with respect to average stopping distance, four groups of the same make and model cars were assigned to each brake system independently Thus, the one-way ANOVA design was appropriate There are, however, situations in which another factor may affect the observed response in a one-way design Often, this additional factor is unknown This is the reason for randomization within the experiment However, there are also situations in which we know the factor that is impinging on the response variable of interest Chapter 10 introduced the concept of paired samples and indicated that there are instances when you will want to test for differences in two population means by controlling for sources of variation that might adversely affect the analysis For instance, in the Digitron example, we might be concerned that, even though we used the same make and model of car in the study, the cars themselves may interject a source of variability that could affect the result To control for this, we could use the concept of paired samples by using the same 10 cars for each of the four brake systems When an additional factor with two or more levels is involved, a design technique called blocking can be used to eliminate the additional factor’s effect on the statistical analysis of the main factor of interest Randomized Complete Block ANOVA Excel and Minitab tutorials Excel and Minitab Tutorial BUSINESS APPLICATION A RANDOMIZED BLOCK DESIGN CITIZEN’S STATE BANK At Citizen’s State Bank, homeowners can borrow money against the equity they have in their homes To determine equity, the bank determines the home’s value and subtracts the mortgage balance The maximum loan is 90% of the equity (24) 498 CHAPTER 12 | Analysis of Variance The bank outsources the home appraisals to three companies: Allen & Associates, Heist Appraisal, and Appraisal International The bank managers know that appraisals are not exact Some appraisal companies may overvalue homes on average, whereas others might undervalue homes Bank managers wish to test the hypothesis that there is no difference in the average house appraisal among the three different companies The managers could select a random sample of homes for Allen & Associates to appraise, a second sample of homes for Heist Appraisal to work on, and a third sample of homes for Appraisal International One-way ANOVA would be used to compare the sample means Obviously a problem could occur if, by chance, one company received larger, higher-quality homes located in better neighborhoods than the other companies This company’s appraisals would naturally be higher on average, not because it tended to appraise higher, but because the homes were simply more expensive Citizen’s State Bank officers need to control for the variation in size, quality, and location of homes to fairly test that the three companies’ appraisals are equal on the average To this, they select a random sample of properties and have each company appraise the same properties In this case, the properties are called blocks, and the test design is called a randomized complete block design The data in Table 12.5 were obtained when each appraisal company was asked to appraise the same five properties The bank managers wish to test the following hypothesis: H0: m1  m2  m3 HA: At least two populations have different means The randomized block design requires the following assumptions: Assumptions The populations are normally distributed The populations have equal variances The observations within samples are independent The data measurement must be interval or ratio level Because the managers have chosen to have the same properties appraised by each company (block on property), the samples are not independent, and a method known as randomized complete block ANOVA must be employed to test the hypothesis This method is similar to the one-way ANOVA in Section 12.1 However, there is one more source of variation to be accounted for, the block variation As was the case in Section 12.1, we must find estimators for each source of variation Identifying the appropriate sums of squares and then dividing each by its degrees of freedom does this As was the case in the one-way ANOVA, the sums of squares are obtained by partitioning the total sum of squares (SST ) However, in this case the SST is divided into three components instead of two, as shown in Equation 12.8 TABLE 12.5 | Citizen’s State Bank Property Appraisals (in thousands of dollars) Appraisal Company Property (Block) Allen & Associates Heist Appraisal Appraisal International Block Mean 78 82 79 79.67 102 102 99 101.00 68 74 70 70.67 83 88 86 85.67 95 99 92 95.33 Factor-Level Mean x1  85.2 x2  89 x3  85.2 x  86.47  Grand mean (25) CHAPTER 12 | 499 Analysis of Variance Sum of Squares Partitioning for Randomized Complete Block Design SST  SSB  SSBL  SSW (12.8) where: SST  Total sum of squares SSB  Sum of squares between factor levels SSBL  Sum of squares between blocks SSW  Sum of squares within levels Both SST and SSB are computed just as we did with one-way ANOVA, using Equations 12.3 and 12.4 The sum of squares for blocking (SSBL) is computed using Equation 12.9 Sum of Squares for Blocking b SSBL  ∑ k(x j  x )2 (12.9) j1 where: k  Number of levels for the factor b  Number off blocks x j  The mean of the j th block x  Grand mean Finally, the sum of squares within (SSW) is computed using Equation 12.10 This sum of squares is what remains (the residual) after the variation for all known factors has been removed This residual sum of squares may be due to the inherent variability of the data, measurement error, or other unidentified sources of variation Therefore, the sum of squares within is also known as the sum of squares of error, SSE Sum of Squares Within SSW  SST - (SSB  SSBL) (12.10) The effect of computing SSBL and subtracting it from SST in Equation 12.10 is that SSW is reduced Also, if the corresponding variation in the blocks is significant, the variation within the factor levels will be significantly reduced This can make it easier to detect a difference in the population means if such a difference actually exists If it does, the estimator for the within variability will in all likelihood be reduced, and thus, the denominator for the F-test statistic will be smaller This will produce a larger F-test statistic, which will more likely lead to rejecting the null hypothesis This will depend, of course, on the relative size of SSBL and the respective changes in the degrees of freedom Table 12.6 shows the completely randomized block ANOVA table format and equations for degrees of freedom, mean squares, and F-ratios As you can see, we now have two F-ratios The reason for this is that we test not only to determine whether the population means are equal but also to obtain an indication of whether the blocking was necessary by examining the ratio of the mean square for blocks to the mean square within Although you could manually compute the necessary values for the randomized block design, both Excel and Minitab contain a procedure that will all the computations and build the ANOVA table The Citizen’s State Bank appraisal data are included in the file Citizens (Note that the first column contains labels for each block.) Figures 12.9a and 12.9b show the ANOVA output Using Excel or Minitab to perform the computations frees the decision maker to focus on interpreting the results Note that Excel (26) 500 CHAPTER 12 | Analysis of Variance TABLE 12.6 | Basic Format for the Randomized Block ANOVA Table Source of Variation SS df MS F-ratio Between blocks SSBL b-1 MSBL MSBL MSW Between samples SSB k-1 MSB MSB MSW Within samples SSW SST ( k − 1) (b − 1) MSW Total where: nT − k = Number of levels b = Number of blocks of freedom df = Degreeso n T = Combined sample size SSBL b −1 SS SB MSB = Mean square between = k −1 SSW MSW = Mean square within = ( k − 1) ( b − 1) MSBL = Mean square blocking = Note: Some randomized block ANOVA tables put SSB first, followed by SSBL refers to the randomized block ANOVA as Two-Factor ANOVA without replication Minitab refers to the randomized block ANOVA as Two-Way ANOVA The main issue is to determine whether the three appraisal companies differ in average appraisal values The primary test is H0: m1  m2  m3 HA: At least two populations have different means a  0.05 Using the output presented in Figures 12.7a and 12.7b, you can test this hypothesis two ways First, we can use the F-distribution approach Figure 12.8 shows the results of this test Based on the sample data, we reject the null hypothesis and conclude that the three appraisal companies not provide equal average values for properties The second approach to testing the null hypothesis is the p-value approach The decision rule in an ANOVA application for p-values is If p-value a reject H0; otherwise, not reject H0 In this case, a  0.05 and the p-value in Figure 12.9a is 0.0103 Because p-value  0.0103 a  0.05 we reject the null hypothesis Both the F-distribution approach and the p-value approach give the same result, as they must Was Blocking Necessary? Before we take up the issue of determining which company provides the highest mean property values, we need to discuss one other issue Recall that the bank managers chose to control for variation between properties by having each appraisal company evaluate the same five properties This restriction is called blocking, and the properties are the blocks The ANOVA output in Figure 12.7a contains information that allows us to test whether blocking was necessary If blocking was necessary, it would mean that appraisal values are in fact influenced by the particular property being appraised The blocks then form a second factor of interest, and we formulate a secondary hypothesis test for this factor, as follows: H0: mb1  mb2  mb3  mb4  mb5 HA: Not all block means are equal (27) CHAPTER 12 FIGURE 12.7A | Analysis of Variance 501 | Excel 2007 Output: Citizen’s State Bank Analysis of Variance Excel 2007 Instructions: Open file: Citizens.xls On the Data tab, click Data Analysis Select ANOVA: TwoFactor Without Replication Define data range (include column A) Specify alpha level 0.05 Indicate output location Click OK FIGURE 12.7B Blocks Blocking Test Main Factor Test Within | Minitab Output: Citizen’s State Bank Analysis of Variance Main Factor Test Blocking Test Minitab Instructions: Open file: Citizens.MTW Choose Stat  ANOVA  Two-way In Response, enter the data column (Appraisal) In Row Factor, enter main factor indicator column (Company) and select Display Means In Column Factor, enter the block indicator column (Property) and select Display Means Choose Fit additive model Click OK Blocks Note that we are using mbj to represent the mean of the jth block It seems only natural to use a test statistic that consists of the ratio of the mean square for blocks to the mean square within However, certain (randomization) restrictions placed on the complete block design make this proposed test statistic invalid from a theoretical statistics (28) 502 CHAPTER 12 FIGURE 12.8 | Analysis of Variance | Appraisal Company Hypothesis Test for Citizen’s State Bank H0: 1 = 2 = 3 HA: At least two population means are different  = 0.05 f(F) Degrees of Freedom: D1 = k – = – = D2 = (b – 1)(k – 1) = (4)(2) = Rejection Region  = 0.05 F F0.05 = 4.459 Because F = 8.54  F0.05 = 4.459 reject H0 F = 8.54 point of view As an approximate procedure, however, the examination of the ratio MSBL/MSW is certainly reasonable If it is large, it implies that the blocks had a large effect on the response variable and that they were probably helpful in improving the precision of the F-test for the primary factor’s means.7 In performing the analysis of variance, we may also conduct a pseudotest to see whether the average appraisals for each property are equal If the null hypothesis is rejected, we have an indication that the blocking is necessary and that the randomized block design is justified However, we should be careful to present this only as an indication and not as a precise test of hypothesis for the blocks The output in Figure 12.9a provides the F-value and p-value for this pseudotest to determine if the blocking was a necessity Because F  156.13  F0.05  3.838, we definitely have an indication that the blocking design was necessary If a hypothesis test indicates blocking is not necessary, the chance of a Type II error for the primary hypothesis has been unnecessarily increased by the use of blocking The reason is that by blocking we not only partition the sum of squares, we also partition the degrees of freedom Therefore, the denominator of MSW is decreased, and MSW will most likely increase If blocking isn’t needed, the MSW will tend to be relatively larger than if we had run a one-way design with independent samples This can lead to failing to reject the null hypothesis for the primary test when it actually should have been rejected Therefore, if blocking is indicated to be unnecessary, follow these rules: If the primary H0 is rejected, proceed with your analysis and decision making There is no concern If the primary H0 is not rejected, redo the study without using blocking Run a one-way ANOVA with independent samples Chapter Outcome EXAMPLE 12-3 PERFORMING A RANDOMIZED BLOCK ANALYSIS OF VARIANCE Frankle Training & Education Frankle Training & Education conducts project management training courses throughout the eastern United States and Canada The company has developed three 1,000-point practice examinations meant to simulate the certification exams given by the Project Management Institute (PMI) The Frankle leadership wants to know if the three exams will yield the same or different mean scores To test this, a random sample of fourteen people who have been through the project management 7Many authors argue that the randomization restriction imposed by using blocks means that the F-ratio really is a test for the equality of the block means plus the randomization restriction For a summary of this argument and references, see D C Montgomery, Design and Analysis of Experiments, 4th ed (New York City: John Wiley & Sons, 1997) pp 175–176 (29) CHAPTER 12 | Analysis of Variance 503 training are asked to take the three tests The order the tests are taken is randomized and the scores are recorded A randomized block analysis of variance test can be performed using the following steps: Step Specify the parameter of interest and formulate the appropriate null and alternative hypotheses The parameter of interest is the mean test score for the three different exams, and the question is whether there is a difference among the mean scores for the three The appropriate null and alternative hypotheses are H0: m1  m2  m3 HA: At least two populations have different means In this case, the Frankle leadership wants to control for variation in student ability by having the same students take all three tests The test scores will be independent because the scores achieved by one student not influence the scores achieved by other students Here, the students are the blocks Step Specify the level of significance for conducting the tests The tests will be conducted using a  0.05 Step Select simple random samples from each population, and compute treatment means, block means, and the grand mean The following sample data were observed: Student Exam Exam Exam Block Means 830 647 630 702.33 743 840 786 789.67 652 747 730 709.67 885 639 617 713.67 814 943 632 796.33 733 916 410 686.33 770 923 727 806.67 829 903 726 819.33 847 760 648 751.67 10 878 856 668 800.67 11 728 878 670 758.67 12 693 990 825 836.00 13 807 871 564 747.33 14 901 980 719 866.67 793.57 849.50 668.00 770.36  Grand mean Treatment means Step Compute the sums of squares and complete the ANOVA table Four sums of squares are required: Total Sum of Squares (Equation 12.3) ni k SST  ∑ ∑ (xij  x )2  614,641.6 i1 j1 Sum of Squares Between (Equation 12.4) k SSB  ∑ ni ( xi  x )2  241,912.7 i1 (30) 504 CHAPTER 12 | Analysis of Variance Sum of Squares Blocking (Equation 12.9) b SSBL  ∑ k( x j  x )2 116,605.0 j1 Sum of Squares Within (Equation 12.10) SSW  SST - (SSB  SSBL)  256,123.9 The ANOVA table is (see Table 12.6 format) Source SS df MS F-Ratio Between blocks 116,605.0 13 8,969.6 0.9105 Between samples 241,912.7 120,956.4 12.2787 Within samples 256,123.9 26 9,850.9 Total 614,641.6 41 Step Test to determine whether blocking is effective Fourteen people were used to evaluate the three tests These people constitute the blocks, so if blocking is effective, the mean test scores across the three tests will not be the same for all 14 students The null and alternative hypotheses are H0: mb1  mb2  mb3   mb14 HA: Not all means are equal (blocking is effective) As shown in step 3, the F-test statistic to test this null hypothesis is formed by F MSBL 8, 969.6   0.9105 MSW 9, 850.9 The F-critical from the F-distribution, with a  0.05 and D1  13 and D2  26 degrees of freedom, can be approximated using the F-distribution table in Appendix H as Fa0.05 艐 2.15 The exact F-critical can be found using the FINV function in Excel or the Calc  Probability Distributions command in Minitab as F0.05  2.119 Then, because F  0.9105 Fa0.05  2.119, not reject the null hypothesis This means that based on these sample data we cannot conclude that blocking was effective Step Conduct the main hypothesis test to determine whether the populations have equal means We have three different project management exams being considered At issue is whether the mean score is equal for the three exams The appropriate null and alternative hypotheses are H0: m1  m2  m3 HA: At least two populations have different means As shown in the ANOVA table in Step 3, the F-test statistic for this null hypothesis is formed by F MSB 120, 956.4   12.2787 MSW 9, 850.9 (31) CHAPTER 12 | Analysis of Variance 505 The F-critical from the F-distribution, with a  0.05 and D1  and D2  26 degrees of freedom, can be approximated using the F-distribution table in Appendix H as Fa0.05 艐 3.40 The exact F-critical can be found using the FINV function in Excel or the Calc  Probability Distributions command in Minitab as F  3.369 Then, because F  12.2787  Fa0.05  3.369, reject the null hypothesis Even though in step we concluded that blocking was not effective, the sample data still lead us to reject the primary null hypothesis and conclude that the three tests not all have the same mean score The Frankle leaders will now be interested in looking into the issue in more detail to determine which tests yield higher or lower average scores (See Example 12-4.) END EXAMPLE TRY PROBLEM 12-21 (pg 507) Chapter Outcome Fisher’s Least Significant Difference Test An analysis of variance test can be used to test whether the populations of interest have different means However, even if the null hypothesis of equal population means is rejected, the ANOVA does not specify which population means are different In Section 12.1, we showed how the Tukey-Kramer multiple comparisons procedure is used to determine where the population differences occur for a one-way ANOVA design Likewise, Fisher’s least significant difference test is one test for multiple comparisons that we can use for a randomized block ANOVA design If the primary null hypothesis has been rejected, then we can compare the absolute differences in sample means from any two populations to the least significant difference (LSD), as computed using Equation 12.11 Fisher’s Least Significant Difference LSD  t a MSW b (12.11) where: t  One-tailed value from Student’s t -distriibution for  /2 and (k 1)(b 1) degrees of freedom MSW  Mean square within from ANOVA table b  Number of blocks k  Number of levels of the main factor EXAMPLE 12-4 APPLYING FISHER’S LEAST SIGNIFICANT DIFFERENCE TEST Frankle Training & Education (continued) Recall that in Example 12-3 the Frankle leadership used a randomized block ANOVA design to conclude that the three project management tests not all have the same mean test score To determine which populations (tests) have different means, you can use the following steps: Step Compute the LSD statistic using Equation 12.11 LSD  t a MSW b Using a significance level equal to 0.05, the t-critical value for (3 - 1) (14 - 1)  26 degrees of freedom is t0.05/2  2.0555 The mean square within from the ANOVA table (see Example 12-3, Step 3) is MSW  9,850.9 (32) 506 CHAPTER 12 | Analysis of Variance The LSD is 2  2.0555 9, 850.9  77.11 14 b LSD  t a MSW Step Compute the sample means from each population x1  ∑x  793.57 n x2  ∑x  849.50 n x3  ∑x  668 n Step Form all possible contrasts by finding the absolute differences between all pairs of sample means Compare these to the LSD value Absolute Difference Comparison Significant Difference | x1  x2|  |793.57 849.50|  55.93 55.93 77.11 No | x1  x3|  |793.57  668|  125.57 125.57  77.11 Yes | x2  x3|  |849.50  668|  181.50 181.50  77.11 Yes We infer, based on the sample data, that the mean score for test exceeds the mean for test 3, and the mean for test exceeds the mean for test Now the manager may wish to evaluate test to see why the scores are lower than for the other two tests No difference is detected between tests and END EXAMPLE TRY PROBLEM 12-22 (pg 507) MyStatLab 12-2: Exercises Skill Development 12-18 A study was conducted to determine if differences in new textbook prices exist between on-campus bookstores, off-campus bookstores, and Internet bookstores To control for differences in textbook prices that might exist across disciplines, the study randomly selected 12 textbooks and recorded the price of each of the 12 books at each of the three retailers You may assume normality and equal-variance assumptions have been met The partially completed ANOVA table based on the study’s findings is shown here: b Based on the study’s findings, was it correct to block for differences in textbooks? Conduct the appropriate test at the a  0.10 level of significance c Based on the study’s findings, can it be concluded that there is a difference in the average price of textbooks across the three retail outlets? Conduct the appropriate hypothesis test at the a  0.10 level of significance 12-19 The following data were collected for a randomized block analysis of variance design with four populations and eight blocks: ANOVA Source of Variation Textbooks Retailer SS df MS 16,624 2.4 Error Total Group Group Group Group Block Block 56 34 44 30 57 38 84 50 Block 50 41 48 52 Block 19 17 21 30 Block 33 30 35 38 Block 74 72 78 79 Block 33 24 27 33 Block 56 44 56 71 F 17,477.6 a Complete the ANOVA table by filling in the missing sums of squares, the degrees of freedom for each source, the mean square, and the calculated F-test statistic for each possible hypothesis test (33) CHAPTER 12 a State the appropriate null and alternative hypotheses for the treatments and determine whether blocking is necessary b Construct the appropriate ANOVA table c Using a significance level equal to 0.05, can you conclude that blocking was necessary in this case? Use a test-statistic approach d Based on the data and a significance level equal to 0.05, is there a difference in population means for the four groups? Use a p-value approach e If you found that a difference exists in part d, use the LSD approach to determine which populations have different means 12-20 The following ANOVA table and accompanying information are the result of a randomized block ANOVA test Summary Count Sum Average Variance 443 110.8 468.9 275 68.8 72.9 1,030 257.5 1891.7 4 300 75.0 433.3 603 150.8 468.9 435 108.8 72.9 1,190 297.5 1891.7 460 115.0 433.3 Sample 1,120 140.0 7142.9 Sample 1,236 154.5 8866.6 Sample 1,400 175.0 9000.0 Sample 980 122.5 4307.1 ANOVA Source of Variation Rows Columns Error Total SS 199,899 11,884 5,317 df MS F 28557.0 112.8 21 3961.3 253.2 15.7 p-value F-crit 0.0000 2.488 0.0000 3.073 217,100 31 a How many blocks were used in this study? b How many populations are involved in this test? c Test to determine whether blocking is effective using an alpha level equal to 0.05 d Test the main hypothesis of interest using a  0.05 e If warranted, conduct an LSD test with a  0.05 to determine which population means are different 12-21 The following sample data were recently collected in the course of conducting a randomized block analysis of variance Based on these sample data, what conclusions should be reached about blocking effectiveness and about the means of the three populations involved? Test using a significance level equal to 0.05 | 507 Analysis of Variance Block Sample Sample Sample 3 30 50 60 40 80 20 40 70 40 40 70 10 40 50 70 30 90 10 12-22 A randomized complete block design is carried out, resulting in the following statistics: x1 x2 x3 x4 Primary Factor 237.15 315.15 414.01 612.52 Block 363.57 382.22 438.33 Source SST  364,428 a Determine if blocking was effective for this design b Using a significance level of 0.05, produce the relevant ANOVA and determine if the average responses of the factor levels are equal to each other c If you discovered that there were differences among the average responses of the factor levels, use the LSD approach to determine which populations have different means Business Applications 12-23 Frasier and Company manufactures four different products that it ships to customers throughout the United States Delivery times are not a driving factor in the decision as to which type of carrier to use (rail, plane, or truck) to deliver the product However, breakage cost is very expensive, and Frasier would like to select a mode of delivery that reduces the amount of product breakage To help it reach a decision, the managers have decided to examine the dollar amount of breakage incurred by the three alternative modes of transportation under consideration Because each product’s fragility is different, the executives conducting the study wish to control for differences due to type of product The company randomly assigns each product to each carrier and monitors the dollar breakage that occurs over the course of 100 shipments The dollar breakage per shipment (to the nearest dollar) is as follows: Product Product Product Product Rail Plane Truck $7,960 $8,399 $9,429 $6,022 $8,053 $7,764 $9,196 $5,821 $8,818 $9,432 $9,260 $5,676 a Was Frasier and Company correct in its decision to block for type of product? Conduct the appropriate hypothesis test using a level of significance of 0.01 b Is there a difference due to carrier type? Conduct the appropriate hypothesis test using a level of significance of 0.01 (34) 508 CHAPTER 12 | Analysis of Variance 12-24 The California Lettuce Research Board was originally formed as the Iceberg Lettuce Advisory Board in 1973 The primary function of the board is to fund research on iceberg and leaf lettuce The California Lettuce Research Board published research (M Cahn and H Ajwa, “Salinity Effects on Quality and Yield of Drip Irrigated Lettuce”) concerning the effect of varying levels of sodium absorption ratios (SAR) on the yield of head lettuce The trials followed a randomized complete block design where variety of lettuce (Salinas and Sniper) was the main factor and salinity levels were the blocks The measurements (the number of lettuce heads from each plot) of the kind observed were SAR Salinas 10 Sniper 104 160 142 133 109 163 146 156 a Determine if blocking was effective for this design b Using a significance level of 0.05, produce the relevant ANOVA and determine if the average number of lettuce heads among the SARs are equal to each other c If you discovered that there were differences among the average number of lettuce heads among the SARs, use the LSD approach to determine which populations have different means 12-25 CB Industries operates three shifts every day of the week Each shift includes full-time hourly workers, nonsupervisory salaried employees, and supervisors/ managers CB Industries would like to know if there is a difference among the shifts in terms of the number of hours of work missed due to employee illness To control for differences that might exist across employee groups, CB Industries randomly selects one employee from each employee group and shift and records the number of hours missed for one year The results of the study are shown here: Hourly Nonsupervisory Supervisors/Managers Shift Shift 48 31 25 54 36 33 Shift 60 55 40 a Develop the appropriate test to determine whether blocking is effective or not Conduct the test at the a  0.05 level of significance b Develop the appropriate test to determine whether there are differences in the average number of hours missed due to illness across the three shifts Conduct the test at the a  0.05 level of significance c If it is determined that a difference in the average hours of work missed due to illness is not the same for the three shifts, use the LSD approach to determine which shifts have different means 12-26 Grant Thornton LLP is the U.S member firm of Grant Thornton International, one of the six global accounting, tax, and business advisory organizations It provides firmwide auditing training for its employees in three different auditing methods Auditors were grouped into four blocks according to the education they had received: (1) high school, (2) bachelor’s, (3) master’s, (4) doctorate Three auditors at each education level were used—one assigned to each method They were given a posttraining examination consisting of complicated auditing scenarios The scores for the 12 auditors were as follows: Doctorate Master’s Bachelor’s High School Method Method 83 77 74 72 81 75 73 70 Method 82 79 75 69 a Indicate why blocking was employed in this design b Determine if blocking was effective for this design by producing the relevant ANOVA c Using a significance level of 0.05, determine if the average posttraining examination scores among the auditing methods are equal to each other d If you discovered that there were differences among the average posttraining examination scores among the auditing methods, use the LSD approach to determine which populations have different means Computer Database Exercises 12-27 Applebee’s International, Inc., is a U.S company that develops, franchises, and operates the Applebee’s Neighborhood Grill and Bar restaurant chain It is the largest chain of casual dining restaurants in the country, with over 1,500 restaurants across the United States The headquarters is located in Overland Park, Kansas The company is interested in determining if mean weekly revenue differs among three restaurants in a particular city The file entitled Applebees contains revenue data for a sample of weeks for each of the three locations a Test to determine if blocking the week on which the testing was done was necessary Use a significance level of 0.05 b Based on the data gathered by Applebee’s, can it be concluded that there is a difference in the average revenue among the three restaurants? c If you did conclude that there was a difference in the average revenue, use Fisher’s LSD approach to determine which restaurant has the lowest mean sales 12-28 In a local community there are three grocery chain stores The three have been carrying out a spirited advertising campaign in which each claims to have the lowest prices A local news station recently sent a reporter to the three stores to check prices on several items She found that for certain items each store had the lowest price This survey didn’t really answer the question for consumers Thus, the station set up a test in which 20 shoppers were given different lists of grocery items and were sent to each of the three chain stores The sales receipts from each of the three stores are recorded in the data file Groceries (35) CHAPTER 12 a Why should this price test be conducted using the design that the television station used? What was it attempting to achieve by having the same shopping lists used at each of the three grocery stores? b Based on a significance level of 0.05 and these sample data, test to determine whether blocking was necessary in this example State the null and alternative hypotheses Use a test-statistic approach c Based on these sample data, can you conclude the three grocery stores have different sample means? Test using a significance level of 0.05 State the appropriate null and alternative hypotheses Use a p-value approach d Based on the sample data, which store has the highest average prices? Use Fisher’s LSD test if appropriate 12-29 The Cordage Institute, based in Wayne, Pennsylvania, is an international association of manufacturers, producers, and resellers of cordage, rope, and twine It is a not-for-profit corporation that reports on research concerning these products Although natural fibers like manila, sisal, and cotton were once the predominant rope materials, industrial synthetic fibers dominate the marketplace today, with most ropes made of nylon, polyester, or polypropylene One of the principal traits of rope material is its breaking strength A research project generated data given in the file entitled Knots The data listed were gathered on 10 different days from '' ⁄2 -diameter ropes | Analysis of Variance 509 a Test to determine if inserting the day on which the testing was done was necessary Use a significance level of 0.05 b Based on the data gathered by the Cordage Institute, can it be concluded that there is a difference in the average breaking strength of nylon, polyester, and polypropylene? c If you concluded that there was a difference in the average breaking strength of the rope material, use Fisher’s LSD approach to determine which material has the highest breaking strength 12-30 When the world’s largest retailer, Wal-Mart, decided to enter the grocery marketplace in a big way with its “Super Stores,” it changed the retail grocery landscape in a major way The other major chains such as Albertsons have struggled to stay competitive In addition, regional discounters such as WINCO in the western United States have made it difficult for the traditional grocery chains Recently, a study was conducted in which a “market basket” of products was selected at random from those items offered in three stores in Boise, Idaho: Wal-Mart, Winco, and Albertsons At issue was whether the mean prices at the three stores are equal or whether there is a difference in prices The sample data are in the data file called Food Price Comparisons Using an alpha level equal to 0.05, test to determine whether the three stores have equal population mean prices If you conclude that there are differences in the mean prices, perform the appropriate posttest to determine which stores have different means END EXERCISES 12-2 Chapter Outcome 12.3 Two-Factor Analysis of Variance with Replication Section 12.2 introduced an ANOVA procedure called the randomized complete block ANOVA This method is used when we are interested in testing whether the means for the populations (levels) for a factor of interest are equal and we want to control for potential variation due to a second factor The second factor is called the blocking factor Consider again the Citizen’s State Bank property appraisal application, in which the bank was interested in determining whether the mean property valuation was the same for three different appraisal companies The company used the same five properties to test each appraisal company in an attempt to reduce any variability that might exist due to the properties involved in the test The properties were the blocks in that example, but we were not really interested in knowing whether the mean appraisal was the same for all properties The single factor of interest was the appraisal companies However, you will encounter many situations in which there are actually two or more factors of interest in the same study In this section, we limit our discussion to situations involving only two factors The technique that is used when we wish to analyze two factors is called two-factor ANOVA with replications (36) 510 CHAPTER 12 | Analysis of Variance Two-Factor ANOVA with Replications BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial USING SOFTWARE FOR TWO-FACTOR ANOVA FLY HIGH AIRLINES Like other major U.S airlines, Fly High Airlines is concerned because many of its frequent flier program members have accumulated large quantities of free miles.8 The airline worries that at some point in the future there will be a big influx of customers wanting to use their miles and the airline will have difficulty satisfying all the requests at once Thus, Fly High recently conducted an experiment in which each of three methods for redeeming frequent flier miles was offered to a sample of 16 customers Each customer had accumulated more than 100,000 frequent flier miles The customers were equally divided into four age groups The variable of interest was the number of miles redeemed by the customers during the six-week trial Table 12.7 shows the number of miles redeemed for each person in the study These data are also contained in the Fly High file Method offered cash inducements to use miles Method offered discount vacation options, and method offered access to a discount-shopping program through the Internet The airline wants to know if the mean number of miles redeemed under the three redemption methods is equal and whether the mean miles redeemed is the same across the four age groups A two-factor ANOVA design is the appropriate method in this case because the airline has two factors of interest Factor A is the redemption offer type with three levels Factor B is the age group of each customer with four levels As shown in Table 12.7, there are  12 cells in the study and four customers in each cell The measurements are called replications because we get four measurements (miles redeemed) at each combination of redemption offer level (factor A) and age level (factor B) Two-factor ANOVA follows the same logic as all other ANOVA designs Each factor of interest introduces variability into the experiment As was the case in Sections 12.1 and 12.2, we must find estimators for each source of variation Identifying the appropriate sums of squares and then dividing each by its degrees of freedom does this As in the one-way ANOVA, the total sum of squares (SST) in two-factor ANOVA can be partitioned The SST is partitioned into four parts as follows: One part is due to differences in the levels of factor A (SSA) Another part is due to the levels of factor B (SSB) Another part is due to the interaction between factor A and factor B (SSAB) (We will discuss the concept of interaction between factors later.) The final component making up the total sum of squares is the sum of squares due to the inherent random variation in the data (SSE) TABLE 12.7 | Fly High Airlines Frequent Flier Miles Data Under 25 years 25 to 40 years 41 to 60 years Over 60 years 8Name changed at request of the airline Cash Option Vacation Shopping 30,000 25,000 60,000 0 25,000 40,000 25,000 25,000 0 5,000 25,000 50,000 40,000 25,000 0 40,000 25,000 5,000 25,000 25,000 50,000 25,000 45,000 25,000 50,000 25,000 25,000 75,000 5,000 30,000 25,000 50,000 25,000 50,000 0 30,000 25,000 25,000 50,000 (37) CHAPTER 12 FIGURE 12.9 | Analysis of Variance 511 | Two-Factor ANOVA— Partitioning of Total Sums of Squares SSA Factor A SSB Factor B SSAB Interaction between A and B SSE Inherent Variation (Error) SST Figure 12.9 illustrates this partitioning concept The variations due to each of these components will be estimated using the respective mean squares obtained by dividing the sums of squares by their degrees of freedom If the variation accounted for by factor A and factor B is large relative to the error variation, we will tend to conclude that the factor levels have different means Table 12.8 illustrates the format of the two-factor ANOVA Three different hypotheses can be tested from the information in this ANOVA table First, for factor A (redemption options), we have H0: mA1  mA2  mA3 HA: Not all factor A means are equal TABLE 12.8 | Basic Format of the Two-Factor ANOVA Table Source of Variation SS df MS F-Ratio Factor A SSA a-1 MSA MSA MSE Factor B SSB b-1 MSB MSB MSE AB interaction SSAB (a - 1)(b - 1) MSAB MSAB MSE Error SSE nT - ab MSE Total SST nT - where: a = Number of levels of factor A b = Number of levells of factor B nT = Total number of observation in all cells SS A MS A = Mean square factor A = a −1 SS MS B = Mean square factor B = B b −1 SS AB uare interaction = MS AB = Mean squ ( a − 1) ( b − 1) SSE MSE = Mean square error = nT − ab (38) 512 CHAPTER 12 | Analysis of Variance For factor B (age levels): H0: mB1  mB2  mB3  mB4 HA: Not all factor B means are equal Test to determine whether interaction exists between the two factors: H0: Factors A and B not interact to affect the mean response HA: Factors A and B interact Here is what we must assume to be true to use two-factor ANOVA: Assumptions The population values for each combination of pairwise factor levels are normally distributed The variances for each population are equal The samples are independent The data measurement is interval or ratio level Although all the necessary values to complete Table 12.8 could be computed manually using the equations shown in Table 12.9, this would be a time-consuming task for even a small example because the equations for the various sum-of-squares values are quite complicated Instead, you will want to use software such as Excel or Minitab to perform the two-factor ANOVA Interaction Explained Before we share the ANOVA results for the Fly High Airlines example, a few comments regarding the concept of factor interaction are needed Consider our example involving the two factors: miles-redemption-offer type and age category of customer The response variable is the number of miles redeemed in the six weeks after the offer Suppose one redemption-offer type is really better and results in a higher average miles being redeemed If there is no interaction between age and offer type, then customers of all ages will have uniformly higher average miles redeemed for this offer type compared with the other offer types If another offer type yields lower average miles, and if there is no interaction, all age groups receiving this offer type will redeem uniformly lower miles on average than the other offer types Figure 12.10 illustrates a situation with no interaction between the two factors However, if interaction exists between the factors, we would see a graph similar to the one shown in Figure 12.11 Interaction would be indicated if one age group redeemed higher average miles than the other age groups with one program but lower average miles than the other age groups on the other mileage-redemption programs In general, interaction occurs if the differences in the averages of the response variable for the various levels of one factor— say, factor A—are not the same for each level of the other factor—say, factor B The general idea is that interaction between two factors means that the effect due to one of them is not uniform across all levels of the other factor Another example in which potential interaction might exist occurs in plywood manufacturing, where thin layers of wood called veneer are glued together to form plywood One of the important quality attributes of plywood is its strength However, plywood is made from different species of wood (pine, fir, hemlock, etc.), and different types of glue are available If some species of wood work better (stronger plywood) with certain glues, whereas other species work better with different glues, we say that the wood species and the glue type interact If interaction is suspected, it should be accounted for by subtracting the interaction term (SSAB) from the total sum-of-squares term in the ANOVA From a strictly arithmetic point of view, the effect of computing SSAB and subtracting it from SST is that SSE is reduced Also, if the corresponding variation due to interaction is significant, the variation within the factor levels (error) will be significantly reduced This can make it easier to detect a difference in the population means if such a difference actually exists If so, MSE will most likely be reduced (39) CHAPTER 12 TABLE 12.9 | | Analysis of Variance 513 Two-Factor ANOVA Equations Total Sum of Squares a SST = b n ∑ ∑ ∑ (x ijk −x i =1 j =1 k =1 ) (12.12) Sum of Squares Factor A a SS A = bn ∑(x i −x i =1 ) (12.13) Sum of Squares Factor B b SSB = an ∑(x j −x j =1 ) (12.14) Sum of Squares Interaction between Factors A and B ∑∑(x a SSAB = n b ij − x i − x j + x i =1 j =1 ) (12.15) Sum of Squares Error a SSE = b n ∑ ∑ ∑ (x ijk − x ij i =1 j =1 k =1 ) where: a x= abn ijk j =1 k =1 = Mean of each level of factor A bn n ∑∑ x i =1 k =1 n xij = ∑ k =1 = Grand mean n ∑∑ x a x j = n i =1 j =1 k =1 b xi = b ∑∑∑ an xijk n ijk = Mean of each level of factor B = Mean of each cell a = Number of leveels of factor A b = Number of levels of factor B n = Number of replications in each cell | Differences between FactorLevel Mean Values: No Interaction Mean Response FIGURE 12.10 Factor B Level Factor B Level Factor B Level Factor B Level 2 Factor A Levels (12.16) (40) CHAPTER 12 FIGURE 12.11 | Analysis of Variance | Differences between FactorLevel Mean Values when Interaction is Present Factor B Level Mean Response 514 Factor B Level Factor B Level Factor B Level Factor A Levels This will produce a larger F-test statistic, which will more likely lead to correctly rejecting the null hypothesis Thus by considering potential interaction, your chances of finding a difference in the factor A and factor B mean values, if such a difference exists, is improved This will depend, of course, on the relative size of SSAB and the respective changes in the degrees of freedom We will comment later on the appropriateness of testing the factor hypotheses if interaction is present Note that to measure the interaction effect, the sample size for each combination of factor A and factor B must be or greater Excel and Minitab contain a data analysis tool for performing two-factor ANOVA with replications They can be used to compute the different sums of squares and complete the ANOVA table However, Excel requires that the data be organized in a special way, as shown in Figure 12.12.9 (Note, the first row must contain the names for each level of factor A Also, column contains the factor B level names These must be in the row corresponding to the first sample item for each factor B level.) The Excel two-factor ANOVA output for this example is actually too big to fit on one screen The top portion of the printout shows summary information for each cell, including FIGURE 12.12 | Excel 2007 Data Format for Two-Factor ANOVA for Fly High Airlines Factor A Names Excel 2007 Instruction: Open file: Fly High.xls Factor B Names 9Minitab uses the same data input format for two-factor ANOVA as for randomized block ANOVA (see Section 12.2) (41) CHAPTER 12 FIGURE 12.13 | Analysis of Variance 515 | Excel 2007 Output (Part 1) for Two-Factor ANOVA for Fly High Airlines Excel 2007 Instructions: Open file: Fly High.xls On the Data tab, click Data Analysis Select ANOVA: Two Factor with Replication Define data range (include factor A and B labels) Specify the number of rows per sample Specify alpha level means and variances (see Figure 12.13) At the bottom of the output (scroll down) is the ANOVA table shown in Figure 12.14a Figure 12.14b shows the Minitab output Excel changes a few labels For example, factor A (the miles redemption options) is now referred to as Columns Factor B (age groups) is referred to as Sample In Figures 12.14a and 12.14b, we see all the information necessary to test whether the three redemption offers (factor A) result in different mean miles redeemed H0: mA1  mA2  mA3 HA: Not all factor A means are equal a  0.05 Both the p-value and F-distribution approaches can be used Because p-value (columns)  0.5614  a  0.05 the null hypothesis H0 is not rejected (Also, F  0.59 F0.05  3.259; the null hypothesis is not rejected.) This means the test data not indicate that a difference exists between the average amounts of mileage redeemed for the three types of offers None seems superior to the others We can also test to determine if age level makes a difference in frequent flier miles redeemed H0: mB1  mB2  mB3  mB4 HA: Not all factor B means are equal a  0.05 (42) 516 CHAPTER 12 FIGURE 12.14A | Analysis of Variance | Excel 2007 Output (Part 2) for Two-Factor ANOVA for Fly High Airlines Excel terminology: Sample  Factor B (age) Columns  Factor A (program) Within  Error F-statistics and p-values for testing the three different hypotheses of interest in the two-factor test In Figure 12.14a, we see that the p-value  0.8796  a  0.05 (Also, F  0.22 F0.05  2.866) Thus, the null hypothesis is not rejected The test data not indicate that customer age significantly influences the average number of frequent flier miles that will be redeemed Finally, we can also test for interaction The null hypothesis is that no interaction exists The alternative is that interaction does exist between the two factors The ANOVA table in Figure 12.14b shows a p-value of 0.939, which is greater than a  0.05 Based on these data, FIGURE 12.14B | Minitab Output for Two-Factor ANOVA for Fly High Airlines F-statistics and p-values for testing the three different hypotheses of interest in the two-factor test Minitab Instructions: Open file: Fly High.MTW Choose Stat  ANOVA  Two-way In Response, enter the data Column (Value) In Row Factor, enter main factor indicator column (Redemption Option) In Column Factor, enter the block indicator column (Age) Click OK (43) CHAPTER 12 | Analysis of Variance 517 interaction between the two factors does not appear to exist This would indicate that the differences in the average mileage redeemed between the various age categories are the same for each redemption-offer type A Caution about Interaction In this example, the sample data indicate that no interaction between factors A and B is present Based on the sample data, we were unable to conclude that the three redemption offers resulted in different average frequent flier miles redeemed Finally, we were unable to conclude that a difference in average miles redeemed occurred over the four different age groups The appropriate approach is to begin by testing for interaction If the interaction null hypothesis is not rejected, proceed to test the factor A and factor B hypotheses However, if we conclude that interaction is present between the two factors, hypothesis tests for factors A and B generally should not be performed The reason is that findings of significance for either factor might be due only to interactive effects when the two factors are combined and not to the fact that the levels of the factor differ significantly It is also possible that interactive effects might mask differences between means of one of the factors for at least some of the levels of the other factor If significant interaction is present, the experimenter may conduct a one-way ANOVA to test the levels of one of the factors, for example, factor A, using only one level of the other factor, factor B Thus, when conducting hypothesis tests for a two-factor ANOVA: Test for interaction If interaction is present, conduct a one-way ANOVA to test the levels of one of the factors using only one level of the other factor.10 If no interaction is found, test factor A and factor B 10There are, however, some instances in which the effects of the factors provide important and meaningful information even though interaction is present See D R Cox, Planning of Experiments (New York City: John Wiley and Sons, 1992), pp 107–108 MyStatLab 12-3: Exercises Skill Development 12-31 Consider the following data from a two-factor experiment: Factor A Factor B Level Level Level Level Level 43 49 50 53 25 26 27 31 37 45 46 48 a Determine if there is interaction between factor A and factor B Use the p-value approach and a significance level of 0.05 b Does the average response vary among the levels of factor A? Use the test-statistic approach and a significance level of 0.05 c Determine if there are differences in the average response between the levels of factor B Use the p-value approach and a significance level of 0.05 12-32 Examine the following two-factor analysis of variance table: Source Factor A Factor B AB Interaction Error Total SS df 162.79 262.31 _ 1,298.74 12 84 MS F-Ratio 28.12 a Complete the analysis of variance table b Determine if interaction exists between factor A and factor B Use a  0.05 (44) 518 CHAPTER 12 | Analysis of Variance c Determine if the levels of factor A have equal means Use a significance level of 0.05 d Does the ANOVA table indicate that the levels of factor B have equal means? Use a significance level of 0.05 12-33 Consider the following data for a two-factor experiment: 12-35 A two-factor experiment yielded the following data: Factor A Factor B Level Level Level Level 375 390 402 396 395 390 Level 335 342 336 338 320 331 Level 302 324 485 455 351 346 Factor A Level Level Level Level 33 31 35 30 42 36 21 30 30 23 32 27 30 27 25 21 33 18 Factor B Level a Based on the sample data, factors A and B have significant interaction? State the appropriate null and alternative hypotheses and test using a significance level of 0.05 b Based on these sample data, can you conclude that the levels of factor A have equal means? Test using a significance level of 0.05 c Do the data indicate that the levels of factor B have different means? Test using a significance level equal to 0.05 12-34 Consider the following partially completed two-factor analysis of variance table, which is an outgrowth of a study in which factor A has four levels and factor B has three levels The number of replications was 11 in each cell Source of Variation SS df Factor A Factor B AB Interaction Error Total 345.1 1,123.2 256.7 1,987.3 12 MS F-Ratio a Determine if there is interaction between factor A and factor B Use the p-value approach and a significance level of 0.05 b Given your findings in part a, determine any significant differences among the response means of the levels of factor A for level of factor B c Repeat part b at levels and of factor B, respectively Business Applications 12-36 A PEW Research Center survey concentrated on the issue of weight loss It investigated how many pounds heavier the respondents were than their perceived ideal weight It investigated whether these perceptions differed among different regions of the country and gender of the respondents The following data (pounds) reflect the survey results: Region Gender Men Women West Midwest South Northeast 14 13 16 13 18 16 20 18 15 15 17 17 16 14 17 13 28.12 84 a Complete the analysis of variance table b Based on the sample data, can you conclude that the two factors have significant interaction? Test using a significance level equal to 0.05 c Based on the sample data, should you conclude that the means for factor A differ across the four levels or the means for factor B differ across the three levels? Discuss d Considering the outcome of part b, determine what can be said concerning the differences of the levels of factors A and B Use a significance level of 0.10 for any hypothesis tests required Provide a rationale for your response to this question a Determine if there is interaction between Region and Gender Use the p-value approach and a significance level of 0.05 b Given your findings in part a, determine any significant differences among the discrepancy between the average existing and desired weights in the regions c Repeat part b for the Gender factor 12-37 A manufacturing firm produces a single product on three production lines Because the lines were developed at different points in the firm’s history, they use different equipment The firm is considering changing the layouts of the lines and would like to know what effects different layouts would have on production output A study was conducted to determine the average output for each line over four randomly selected weeks using each of the three layouts under consideration The output (in hundreds of units (45) CHAPTER 12 produced) was measured for each line for each of the four weeks for each layout being evaluated The results of the study are as follows: Line Line Line Layout 12 10 12 12 12 14 10 11 11 10 14 12 Layout 17 18 15 17 16 15 16 17 18 18 17 18 Layout 12 12 11 11 10 11 11 11 11 11 10 12 a Based on the sample data, can the firm conclude that there is an interaction effect between the type of layout and the production line? Conduct the appropriate test at the 0.05 level of significance b At the 0.05 level of significance, can the firm conclude that there is a difference in mean output across the three production lines? c At the 0.05 level of significance, can the firm conclude that there is a difference in mean output due to the type of layout used? 12-38 A popular consumer staple was displayed in different locations in the same aisle of a grocery store to determine what, if any, effect different placement might have on its sales The product was placed at one of three heights on the aisle—low, medium, and high—and at one of three locations in the store—at the front of the store, at the middle of the store, or at the rear of the store The number of units sold of the product at the various height and distance combinations was recorded each week for five weeks The following results were obtained: Front Middle Rear Low 125 143 150 138 149 195 150 160 195 162 126 136 129 136 147 Medium 141 137 145 150 130 186 161 157 165 194 128 133 148 145 141 High 129 141 148 130 137 157 152 186 164 176 149 137 138 126 138 | Analysis of Variance 519 a At the 0.10 level of significance, is there an interaction effect? b At the 0.10 level of significance, does the height of the product’s placement have an effect on the product’s mean sales? c At the 0.10 level of significance, does the location in the store have an effect on the product’s mean sales? Computer Database Exercises 12-39 Mt Jumbo Plywood Company makes plywood for use in furniture production The first major step in the plywood process is the peeling of the logs into thin layers of veneer A lathe that rotates the logs through a knife that peels the log into layers 3/8-inch thick conducts the peeling process Ideally, when a log is reduced to a 4-inch core diameter, the lathe releases the core and a new log is loaded onto the lathe However, a problem called “spinouts” occurs if the lathe kicks out a core that has more than inches left This wastes wood and costs the company money Before going to the lathe, the logs are conditioned in a heated water-filled vat to warm the logs The company is concerned that improper log conditioning may lead to excessive spinouts Two factors are believed to affect the core diameter: the vat temperature and the time the logs spend in the vat prior to peeling The lathe supervisor has recently conducted a test during which logs were peeled at each combination of temperature and time The sample data for this experiment are in the data file called Mt Jumbo The data are the core diameters in inches a Based on the sample data, is there an interaction between water temperature and vat hours? Test using a significance level of 0.01 Discuss what interaction would mean in this situation Use a p-value approach b Based on the sample data, is there a difference in mean core diameter at the three water temperatures? Test using a significance level of 0.01 c Do the sample data indicate a difference in mean core diameter across the three vat times analyzed in this study? Use a significance level of 0.10 and a p-value approach 12-40 A psychologist is conducting a study to determine whether there are differences between the ability of history majors and mathematics majors to solve various types of puzzles Five mathematics majors and five history majors were randomly selected from the students at a liberal arts college in Maine Each student was given five different puzzles to complete: a crossword puzzle, a cryptogram, a logic problem, a maze, and a cross sums The time in minutes (rounded to the nearest minute) was recorded for each student in the study If a student could not complete a puzzle in the maximum time allowed, or completed a puzzle incorrectly, then a penalty of 10 minutes was added to his or her time The results are shown in the file Puzzle (46) 520 CHAPTER 12 | Analysis of Variance a Plot the mean time to complete a puzzle for each puzzle type by major What conclusion would you reach about the interaction between major and puzzle type? b At the 0.05 level of significance, is there an interaction effect? c If interaction is present, conduct a one-way ANOVA to test whether the mean time to complete a puzzle for history majors depends on the type of puzzle Does the mean time to complete a puzzle for mathematics majors depend on the type of puzzle? Conduct the one-way ANOVA tests at a level of significance of 0.05 12-41 The Iams Company sells Eukanuba and Iams premium dog and cat foods (dry and canned) in 70 countries Iams makes dry dog and cat food at plants in Lewisburg, Ohio; Aurora, Nebraska; Henderson, North Carolina; Leipsic, Ohio; and Coevorden, The Netherlands Its Eukanuba brand dry dog foods come in five formulas One of the ingredients is of particular importance: crude fat To discover if there is a difference in the average percent of crude fat among the five formulas and among the production sites, the sample data found in the file entitled Eukanuba were obtained a Determine if there is interaction between the Eukanuba formulas and the plant sites where they are produced Use the p-value approach and a significance level of 0.025 b Given your findings in part a, determine if there is a difference in the average percentage of crude fat in the Eukanuba formulas Use a test-statistic approach with a significance level of 0.025 c Repeat part b for the plant sites in which the formulas are produced d One important finding will be whether the average percent of crude fat for the “Reduced Fat” formula is equal to the advertised 9% Conduct a relevant hypothesis test to determine this using a significance level of 0.05 12-42 The amount of sodium in food has been of increasing concern due to its health implications Beers from various producers have been analyzed for their sodium content The file entitled Sodium contains the amount of sodium (mg) discovered in 12 fluid ounces of beer produced by the four major producers: Anheuser-Busch Inc., Miller Brewing Co., Coors Brewing Co., and Pabst Brewing Co The types of beer (ales, lager, and specialty beers) were also scrutinized in the analysis a Determine if there is interaction between the producer and the type of beer Use a significance level of 0.05 b Given your findings in part a, determine if there is a difference in the average amount of sodium in 12 ounces of beer among the producers of the beer Use a significance level of 0.05 c Repeat part b for the types of beer Use a significance level of 0.05 END EXERCISES 12-3 (47) CHAPTER 12 | Analysis of Variance 521 Visual Summary Chapter 12: A group of procedures known as analysis of variance, ANOVA, was introduced in this chapter The procedures presented here represent a wide range of techniques used to determine whether three or more populations have equal means Depending upon the experimental design employed, there are different hypothesis tests that must be performed The one-way design is used to test whether three or more populations have equal mean values when the samples from the populations are considered to be independent If an outside source of variation is present, the randomized complete block design is used If there are two factors of interest and we wish to test to see whether the levels of each separate factor have equal means, then a two-factor design with replications is used Regardless of which method is used, if the null hypothesis of equal means is rejected, methods presented in this chapter enable you to determine which pairs of populations have different means Analysis of variance is actually an array of statistical techniques used to test hypotheses related to these (and many other) experimental designs By completing this chapter, you have been introduced to some of the most popular ANOVA techniques 12.1 One-Way Analysis of Variance (pg 476 – 497) Summary There are often circumstances in which independent samples are obtained from two or more levels of a single factor to determine if the levels have equal means The experimental design which produces the data for this experiment is referred to as a completely randomized design The appropriate statistical tool for conducting the hypothesis test related to this experimental design is analysis of variance Because this procedure addresses an experiment with only one factor, it is called a one-way analysis of variance The concept acknowledges that the data produced by the completely randomized design will not all be the same value This indicates that there is variation in the data This is referred to as the total variation Each level’s data exhibits dispersion as well and is called the within-sample variation The dispersion between the factor levels is designated as the between-sample variation The ratio between estimators of these two variances forms the test statistic used to detect differences in the levels’ means If the null hypothesis of equal means is rejected, the Tukey-Kramer procedure was presented to determine which pairs of populations have different means Outcome Understand the basic logic of analysis of variance Outcome Perform a hypothesis test for a single-factor design using analysis of variance manually and with the aid of Excel or Minitab software Outcome Conduct and interpret post-analysis of variance pairwise comparisons procedures 12.2 Randomized Complete Block Analysis of Variance (pg 497– 509) Summary Section 12.1 addressed procedures for determining the equality of three or more population means of the levels of a single factor In this case all other unknown sources of variation are addressed by the use of randomization However, there are situations in which an additional known factor with at least two levels is impinging on the response variable of interest A technique called blocking is used in such cases to eliminate the effects of the levels of the additional known factor on the analysis of variance As was the case in Section 12.1, a multiple comparisons procedure known as Fisher’s least significant difference can be used to determine any difference among the population means of a randomized block ANOVA design Outcome Conduct and interpret post-analysis of variance pairwise comparisons procedures Outcome Recognize when randomized block analysis of variance is useful and be able to perform analysis of variance on a randomized block design Conclusion 12.3 Two-Factor Analysis of Variance with Replication (pg 509 – 520) Summary Two-factor ANOVA follows the same logic as was the case in the one-way and complete block ANOVA designs In the latter two procedures, there is only one factor of interest In the two-factor ANOVA, there are two factors of interest Each factor of interest introduces variability into the experiment There are circumstances in which the presence of a level of one factor affects the relationship between the response variable and the levels of the other factor This effect is called interaction and, if present, is another source of variation As was the case in Sections 12.1 and 12.2, we must find estimators for each source of variation Identifying the appropriate sums of squares and then dividing each by its degrees of freedom does this If the variation accounted for by factor A, factor B, and interaction is large relative to the error variation, we will tend to conclude that the factor levels have different means The technique that is used when we wish to analyze two factors as described above is called two-factor ANOVA with replications Outcome Perform analysis of variance on a two-factor design of experiments with replications using Excel or Minitab and interpret the output Chapter 12 has illustrated there are many instances in business in which we are interested in testing to determine whether three or more populations have equal means The technique for performing such tests is called analysis of variance If the sample means tend to be substantially different, then the hypothesis of equal means is rejected The most elementary ANOVA experimental design is the one-way design, which is used to test whether three or more populations have equal mean values when the samples from the populations are considered to be independent If we need to control for an outside source of variation (analogous to forming paired samples in Chapter 10), we can use the randomized complete block design If there are two factors of interest and we wish to test to see whether the levels of each separate factor have equal means, then a two-factor design with replications is used (48) 522 CHAPTER 12 | Analysis of Variance Equations (12.10) Sum of Squares Within pg 499 (12.1) Partitioned Sum of Squares pg 478 SSW  SST  (SSB  SSBL) SST  SSB  SSW (12.11) Fisher’s Least Significant Difference pg 505 (12.2) Hartley’s F-Test Statistic pg 480 Fmax  smax smin LSD  t a MSW (12.12) Total Sum of Squares pg 513 (12.3) Total Sum of Squares pg 481 SST  a ni k ∑ ∑ (xij SST   x )2 b ∑ ∑ ∑ (xijk  x )2 (12.13) Sum of Squares Factor A pg 513 (12.4) Sum of Squares Between pg 482 k a ∑ ni ( xi  x )2 SSA  bn i1 (12.14) Sum of Squares Factor B pg 513 SSW  SST  SSB (12.6) SSW  b SSB  an ni ∑ (x j  x )2 j1 ∑ ∑ (xij  xi )2 i1 j1 (12.15) Sum of Squares Interaction between Factors A and B pg 513 (12.7) Tukey-Kramer Critical Range pg 489 Critical range  q1− a ∑ (xi  x )2 i1 (12.5) Sum of Squares Within pg 482 k n i1 j1 k1 i1 j1 SSB  b a MSW ⎛ 1⎞ ⎜  ⎟ ⎝ ni n j ⎠ (12.8) Sum of Squares Partitioning for Randomized Complete SSAB  n b ∑ ∑ (xij. xi − x j.+ x )2 i1 j1 (12.16) Sum of Squares Error pg 513 Block Design pg 499 a SSE  SST  SSB  SSBL  SSW b n ∑ ∑ ∑ (xijk  xij.)2 i1 j1 k1 (12.9) Sum of Squares for Blocking pg 499 b SSBL  ∑ k ( x j  x )2 j1 Key Terms Balanced design pg 476 Between-sample variation pg 477 Completely randomized design pg 476 Experiment-wide error rate pg 488 Factor pg 476 Levels pg 476 Chapter Exercises Conceptual Questions 12-43 A one-way analysis of variance has just been performed The conclusion reached is that the null hypothesis stating the population means are equal has not been rejected What would you expect the One-way analysis of variance pg 476 Total variation pg 477 Within-sample variation pg 477 MyStatLab Tukey-Kramer procedure for multiple comparisons to show if it were performed for all pairwise comparisons? Discuss 12-44 In journals related to your major locate two articles where tests of three or more population means were (49) CHAPTER 12 important Discuss the issue being addressed, how the data were collected, the results of the statistical test, and any conclusions drawn based on the analysis 12-45 Discuss why in some circumstances it is appropriate to use the randomized complete block design Give an example other than those discussed in the text where this design could be used 12-46 A two-way analysis of variance experiment is to be conducted to examine CEO salaries ($K) as a function of the number of years the CEO has been with the company and the size of the company’s sales The years spent with the company are categorized into 0–3, 4–6, 7–9, and 9 years The size of the company is categorized using sales ($million) per year into three categories: 0–50, 51–100, and 100 a Describe the factors associated with this experiment b List the levels of each of the factors identified in part a c List the treatment combinations of the experiment d Indicate the components of the ANOVA table that will be used to explain the variation in the CEOs’ salaries e Determine the degrees of freedom for each of the components in the ANOVA if two replications are used 12-47 In any of the multiple comparison techniques (Tukey-Kramer, LSD), the estimate of the withinsample variance uses data from the entire experiment However, if one were to a two-sample t-test to determine if there were a difference between any two means, the estimate of the population variances would only include data from the two specific samples under consideration Explain this seeming discrepancy Business Applications 12-48 The development of the Internet has made many things possible, in addition to downloading music In particular, it allows an increasing number of people to telecommute, or work from home Although this has many advantages, it has required some companies to provide employees with the necessary equipment, which has made your job as office manager more difficult Your company provides computers, printers, and Internet service to a number of engineers and programmers, and although the cost of hardware has decreased, the cost of supplies, in this case printer cartridges, has not Because of the cost of name-brand printer replacement cartridges, several companies have entered the secondary market You are currently considering offers from four companies The prices are equivalent, so you will make your decision based on length of service, specifically number of pages printed You have given samples of four cartridges to | 523 Analysis of Variance 16 programmers and engineers and have received the following values: Supplier A Supplier B Supplier C Supplier D 424 521 650 422 650 725 826 722 521 601 590 522 323 383 487 521 a Using a significance level equal to 0.01, what conclusion should you reach about the four manufacturers’ printer cartridges? Explain b If the test conducted in part a reveals that the null hypothesis should be rejected, which supplier should be used? Is there one or more you can eliminate based on these data? Use the appropriate test for multiple comparisons Discuss 12-49 The W Atlee Burpee Co was founded in Philadelphia in 1876 by an 18-year-old with a passion for plants and animals and a mother willing to lend him $1,000 of “seed money” to get started in business Today, it is owned by George Ball Jr One of Burpee’s most demanded seeds is corn Burpee continues to increase production to meet the growing demand To this end, an experiment such as presented here is used to determine the combination of fertilizer and seed type that produces the largest number of kernels per ear Seed A Seed B Seed C Fert Fert Fert Fert 807 800 1,010 912 1,294 1,097 995 909 1,098 987 1,286 1,099 894 907 1,000 801 1,298 1,099 903 904 1,008 912 1,199 1,201 a Determine if there is interaction between the type of seed and the type of fertilizer Use a significance level of 0.05 b Given your findings in part a, determine if there is a difference in the average number of kernels per ear among the seeds c Repeat part b for the types of fertilizer Use a significance level of 0.05 12-50 Recent news stories have highlighted errors national companies such as H & R Block have made in preparing taxes However, many people rely on local accountants to handle their tax work A local television station, which prides itself on doing investigative reporting, decided to determine whether similar preparation problems occur in its market area The station selected eight people to have their taxes figured at each of three accounting offices in its market (50) 524 CHAPTER 12 | Analysis of Variance area The following data shows the tax bills (in dollars) as figured by each of the three accounting offices: Return Office 4,376.20 5,678.45 2,341.78 9,875.33 7,650.20 1,324.80 2,345.90 15,468.75 Office Office 5,100.10 6,234.23 2,242.60 10,300.30 8,002.90 1,450.90 2,356.90 16,080.70 4,988.03 5,489.23 2,121.90 9,845.60 7,590.88 1,356.89 2,345.90 15,376.70 a Discuss why this test was conducted as a randomized block design Why did the station think it important to have all three offices the returns for each of the eight people? b Test to determine whether blocking was necessary in this situation Use a significance level of 0.01 State the null and alternative hypotheses c Based on the sample data, can the station report statistical evidence that there is a difference in the mean taxes due on tax returns? Test using a significance level of 0.01 State the appropriate null and alternative hypotheses d Referring to part c, if you did conclude that a difference exists, use the appropriate test to determine which office has the highest mean tax due 12-51 A senior analyst working for Ameritrade has reviewed purchases his customers have made over the last six months He has categorized the mutual funds purchased into eight categories: (1) Aggressive Growth (AG), (2) Growth (G), (3) Growth-Income (G-I), (4) Income Funds (IF), (5) International (I), (6) Asset Allocation (AA), (7) Precious Metal (PM), and (8) Bond (B) The percentage gains accrued by randomly selected customers in each group are as follows: Mutual Fund AG G 12 -2 G-I IF I 6 14 13 10 AA -3 7 PM B -1 a Develop the appropriate ANOVA table to determine if there is a difference in the average percentage gains accrued by his customers among the mutual fund types Use a significance level of 0.05 b Use the Tukey-Kramer procedure to determine which mutual fund type has the highest average percentage gain Use an experiment-wide error rate of 0.05 12-52 Anyone who has gone into a supermarket or discount store has walked by displays at the end of aisles These are referred to as endcaps and are often prized because they increase the visibility of products A manufacturer of tortilla chips has recently developed a new product, a blue corn tortilla chip The manufacturer has arranged with a regional supermarket chain to display the chips on endcaps at four different locations in stores that have had similar weekly sales in snack foods The dollar volumes of sales for the last six weeks in the four stores are as follows: Store Week 4 $1,430 $2,200 $1,140 $ 880 $1,670 $ 990 $ 980 $1,400 $1,200 $1,300 $1,300 $ 550 $1,780 $2,890 $1,500 $1,470 $2,400 $1,600 $2,300 $2,680 $2,000 $1,900 $2,540 $1,900 a If the assumptions of a one-way ANOVA design are satisfied in this case, what should be concluded about the average sales at the four stores? Use a significance level of 0.05 b Discuss whether you think the assumptions of a one-way ANOVA are satisfied in this case and indicate why or why not If they are not, what design is appropriate? Discuss c Perform a randomized block analysis of variance test using a significance level of 0.05 to determine whether the mean sales for the four stores are different d Comment on any differences between the means in parts b and c e Suppose blocking was necessary and the researcher chooses not to use blocks Discuss what impact this could have on the results of the analysis of variance f Use Fisher’s least significant difference procedure to determine which, if any, stores have different true average weekly sales Computer Database Exercises 12-53 A USA Today editorial addressed the growth of compensation for corporate CEOs As an example, quoting a study made by BusinessWeek, USA Today indicated that the pay packages for CEOs have increased almost sevenfold on average from 1994 to 2004 The file entitled CEODough contains the salaries of CEOs in 1994 and in 2004, adjusted for inflation a Use analysis of variance to determine if there is a difference in the CEOs’ average salaries between 1994 and 2004, adjusted for inflation b Determine if there is a difference in the CEOs’ average salaries between 1994 and 2004 using the two-sample t-test procedure c What is the relationship between the two test statistics and the critical values, respectively, that were used in parts a and b? (51) CHAPTER 12 12-54 The use of high-technology materials and design has dramatically impacted the game of golf Not only are the professionals hitting the balls farther but so too are the average players This has led to a rush to design new and better equipment Gordon Manufacturing produces golf balls Recently, Gordon developed a golf ball made from a space-age material This new golf ball promises greater distance off the tee To test Gordon Manufacturing’s claim, a test was set up to measure the average distance of four different golf balls (the New Gordon, Competitor 1, Competitor 2, Competitor 3) hit by a driving machine using three different types of drivers (Driver 1, Driver 2, Driver 3) The results (rounded to the nearest yard) are listed in the data file called Gordon Conduct a test to determine if there are significant differences due to type of golf ball a Does there appear to be interaction between type of golf ball and type of driver? b Conduct a test to determine if there is a significant effect due to the type of driver used c How could the results of the tests be used by Gordon Manufacturing? 12-55 Maynards, a regional home improvement store chain located in the Intermountain West, is considering upgrading to a new series of scanning systems for its automatic checkout lanes Although scanners can save customers a great deal of time, scanners will sometimes misread an item’s price code Before investing in one of three new systems, Maynards would like to determine if there is a difference in scanner accuracy To investigate possible differences in scanner accuracy, 30 shopping carts were randomly selected from customers at the Golden, Colorado, store The 30 carts differed from each other in both the number and types of items each contained The items in each cart were then scanned by the three new scanning systems under consideration as well as by the current scanner used in all stores at a specially designed test facility for the purposes of the analysis Each item was also checked manually, and a count was kept of the number of scanning errors made by each scanner for each basket Each of the scannings was repeated 30 times, and the average number of scanning errors was determined The sample data are in the data file called Maynards a What type of experimental design did Maynards use to test for differences among scanning systems? Why was this type of design selected? b State the primary hypotheses of interest for this test c At the 0.01 level of significance, is there a difference in the average number of errors among the four different scanners? d (1) Is there a difference in the average number of errors by cart? (2) Was Maynards correct in blocking by cart? | Analysis of Variance 525 e If you determined that there is a difference in the average number of errors among the four different scanners, identify where those differences exist f Do you think that Maynards should upgrade from its existing scanner to Scanner A, Scanner B, or Scanner C? What other factors may it want to consider before making a decision? 12-56 PhoneEx provides call center services for many different companies A large increase in its business has made it necessary to establish a new call center Four cities are being considered—Little Rock, Wichita, Tulsa, and Memphis The new center will employ approximately 1,500 workers, and PhoneEx will transfer 75 people from its Omaha center to the new location One concern in the choice of where to locate the new center is the cost of housing for the employees who will be moving there To help determine whether significant housing cost differences exist across the competing sites, PhoneEx has asked a real estate broker in each city to randomly select a list of 33 homes between and 15 years old and ranging in size between 1,975 and 2,235 square feet The prices (in dollars) that were recorded for each city are contained in the file called PhoneEx a At the 0.05 level of significance, is there evidence to conclude that the average price of houses between and 15 years old and ranging in size between 1,975 and 2,235 square feet is not the same in the four cities? Use the p-value approach b At the 0.05 level of significance, is there a difference in average housing price between Wichita and Little Rock? Between Little Rock and Tulsa? Between Tulsa and Memphis? c Determine the sample size required to estimate the average housing price in Wichita to within $500 with a 95% confidence level Assume the required parameters’ estimates are sufficient for this calculation 12-57 An investigation into the effects of various levels of nitrogren (M L Vitosh, Tri-State Fertilizer Recommendations for Corn, Soybeans, Wheat and Alfalfa, Bulletin E-2567) at Ohio State University addressed the pounds per acre of nitrogen required to produce certain yield levels of corn on fields that had previously been planted with other crops The file entitled Nitrogen indicates the amount of nitrogen required to produce given quantities of corn planted a Determine if there is interaction between the yield levels of corn and the crop that had been previously planted in the field Use a significance level of 0.05 b Given your findings in part a, determine any significant differences among the average pounds per acre of nitrogen required to produce yield levels of corn on fields that had been planted with corn as the previous crop c Repeat part b for soybeans and grass sod, respectively (52) 526 CHAPTER 12 video | Analysis of Variance Video Case Drive-Thru Service Times @ McDonald’s When you’re on the go and looking for a quick meal, where you go? If you’re like millions of people every day, you make a stop at McDonald’s Known as “quick service restaurants” in the industry (not “fast food”), companies such as McDonald’s invest heavily to determine the most efficient and effective ways to provide fast, high-quality service in all phases of their business Drive-thru operations play a vital role It’s not surprising that attention is focused on the drive-thru process After all, over 60% of individual restaurant revenues in the United States come from the drive-thru experience Yet, understanding the process is more complex than just counting cars Marla King, professor at the company’s international training center, Hamburger University, got her start 25 years ago working at a McDonald’s drive-thru She now coaches new restaurant owners and managers “Our stated drive-thru service time is 90 seconds or less We train every manager and team member to understand that a quality customer experience at the drive-thru depends on them,” says Marla Some of the factors that affect a customer’s ability to complete their purchases within 90 seconds include restaurant staffing, equipment layout in the restaurant, training, efficiency of the grill team, and frequency of customer arrivals, to name a few Customerorder patterns also play a role Some customers will just order drinks, whereas others seem to need enough food to feed an entire soccer team And then there are the special orders Obviously, there is plenty of room for variability here Yet, that doesn’t stop the company from using statistical techniques to better understand the drive-thru action In particular, McDonald’s utilizes statistical techniques to display data and to help transform the data into useful information For restaurant managers to achieve the goal in their own restaurants, they need training in proper restaurant and drive-thru operations Hamburger University, McDonald’s training center located near Chicago, satisfies that need In the mock-up restaurant service lab, managers go thru a “before and after” training scenario In the “before” scenario, they run the restaurant for 30 minutes as if they were back in their home restaurants Managers in the training class are assigned to be crew, customers, drive-thru cars, special needs guests (such as hearing impaired, indecisive, or clumsy), or observers Statistical data about the operations, revenues, and service times are collected and analyzed Without the right training, the restaurant’s operations usually start breaking down after 10–15 minutes After debriefing and analyzing the data collected, the managers make suggestions for adjustments and head back to the service lab to try again This time, the results usually come in well within standards “When presented with the quantitative results, managers are pretty quick to make the connections between better operations, higher revenues, and happier customers,” Marla states When managers return to their respective restaurants, the training results and techniques are shared with staff charged with implementing the ideas locally The results of the training eventually are measured when McDonald’s conducts a restaurant operations improvement process study, or ROIP The goal is simple: improved operations When the ROIP review is completed, statistical analyses are performed and managers are given their results Depending on the results, decisions might be made that require additional financial resources, building construction, staff training, or reconfiguring layouts Yet one thing is clear: Statistics drive the decisions behind McDonald’s drive-thru service operations Discussion Questions: After returning from the training session at Hamburger University, a McDonald’s store owner selected a random sample of 362 drive-thru customers and carefully measured the time it took from when a customer entered the McDonald’s property until the customer had received the order at the drive-thru window These data are in the file called McDonald’s Drive-Thru Waiting Times Note, the owner selected some customers during the breakfast period, others during lunch, and others during dinner Test, using an alpha level equal to 0.05, to determine whether the mean drive-thru time is equal during the three dining periods (breakfast, lunch, and dinner.) Referring to question 1, write a short report discussing the results of the test conducted Make sure to include a discussion of any ramifications the results of this test might have regarding the efforts the manager will need to take to reduce drive-thru times Case 12.1 Agency for New Americans Denise Walker collapsed at home after her first outing as a volunteer for the Agency for New Americans in Raleigh, North Carolina Denise had a fairly good career going with various federal agencies after graduating with a degree in accounting She decided to stay at home after she and her husband started a family Now that their youngest is in high school, Denise decided she needed something more to than manage the household She decided on volunteer work and joined the Agency for New Americans The purpose of the Agency for New Americans is to help new arrivals become comfortable with the basic activities necessary to function in American society One of the major activities, of (53) CHAPTER 12 course, is shopping for food and other necessities Denise had just returned from her first outing to a supermarket with a recently arrived Somali Bantu family It was their first time also, and they were astonished by both the variety and selection Since the family was on a very limited budget, Denise spent much time talking about comparison shopping, and for someone working with a new currency this was hard She didn’t even want to tell them the store they were in was only one of four possible chains within a mile of their apartment Denise realized the store she started with would be the one they would automatically return to when on their own Next week Denise and the family were scheduled to go to a discount store Denise typically goes to a national chain close to her house but hasn’t felt the need to be primarily a value shopper for some time Since she feels the Somali family will automatically return to the store she picks, and she has her choice of two national chains and one regional chain, she decides to not automatically take them to “her” store Because each store advertises low prices and meeting all competitors’ prices, she also doesn’t want to base her decision on what she hears on commercials Instead, she picks | Analysis of Variance 527 a random selection of items and finds the prices in each store The items and prices are shown in the file New Americans In looking at the data, Denise sees there are differences in some prices but wonders if there is any way to determine which store to take the family to Required Tasks: Identify the major issue in the case Identify the appropriate statistical test that could be conducted to address the case’s major issue Explain why you selected the test you choose in (2) State the appropriate null and alternative hypotheses for the statistical test you identified Perform the statistical test(s) Be sure to state your conclusion(s) If possible, identify the stores that Diane should recommend to the family Summarize your analysis and findings in a short report Case 12.2 McLaughlin Salmon Works John McLaughlin’s father correctly predicted that a combination of declining wild populations of salmon and an increase in demand for fish in general would create a growing market for salmon grown in “fish farms.” Over recent years, an increasing percentage of salmon, trout, and catfish, for example, come from commercial operations At first, operating a fish farm consisted of finding an appropriate location, installing the pens, putting in smelt, and feeding the fish until they grew to the appropriate size However, as the number of competitors increased, successful operation required taking a more scientific approach to raising fish Over the past year, John has been looking at the relationship between food intake and weight gain Since food is a major cost of the operation, the higher the weight gain for a given amount of food, the more cost-effective the food John’s most recent effort involved trying to determine the relationship between four component mixes and three size progressions for the food pellets Since smaller fish require smaller food pellets but larger pellets contain more food, one question John was addressing is at what rate to move from smaller to larger pellets Also, since fish are harder to individually identify than livestock, the study involved constructing small individual pens and giving fish in each pen a different combination of pellet mix and size progression This involved a reasonable cost but a major commitment of time, and John’s father wasn’t sure the cost and time were justified John had just gathered his first set of data and has started to analyze it The data are shown in the file called McLaughlin Salmon Works John is not only interested in whether one component mix, or one pellet size progression, seemed to lead to maximum weight gain but would really like to find one combination of mix and size progression that proved to be superior Required Tasks: Identify the major issues in the case Identify an appropriate statistical analysis to perform Explain why you selected the test you choose in (2) State the appropriate null and alternative hypotheses for the statistical test you identified Perform the statistical test(s) Be sure to state your conclusion(s) Is there one combination of mix and size progression that is superior to the others? Summarize your analysis and findings in a short report Case 12.3 NW Pulp and Paper Cassie Coughlin had less than a week to finish her presentation to the CEO of NW Pulp and Paper Cassie had inherited a project started by her predecessor as head of the new-product development section of the company, and by the nature of the business, dealing with wood products, projects tended to have long lifetimes Her predecessor had successfully predicted the consequences of a series of events that, in fact, had occurred: The western United States, where NW Pulp and Paper had its operations, was running out of water, caused by a combination of population growth and increased irrigation The situation (54) 528 CHAPTER 12 | Analysis of Variance had currently been made worse by several years of drought This meant many farming operations were becoming unprofitable The amount of timber harvesting from national forests continued to be limited At least some of the land that had been irrigated would become less productive due to alkaline deposits caused by taking water from rivers Based on these three factors, Cassie’s predecessor had convinced the company to purchase a 2,000-acre farm that had four types of soil commonly found in the West and also had senior water rights Water rights in the West are given by the state, and senior rights are those that will continue to be able to get irrigation water after those with junior rights are cut off His idea had been to plant three types of genetically modified poplar trees (these are generally fast-growing trees) on the four types of soil and assess growth rates His contention was it might be economically feasible for the company to purchase more farms that were becoming less productive and to become self-sufficient in its supply of raw material for making paper The project had been started 15 years ago, and since her predecessor had since retired, Cassie was now in charge of the project The primary focus of the 15-year review was tree growth Growth in this case did not refer to height but wood volume Volume is assessed by measuring the girth of the tree three feet above the ground She had just received data from the foresters who had been managing the experiment They had taken a random sample of measurements from each of the tree types The data are shown in the file NW Pulp and Paper Cassie knew the CEO would at least be interested in whether one type of tree was generally superior and whether there was some unique combination of soil type and tree type that stood out Case 12.4 Quinn Restoration Last week John Quinn sat back in a chair with his feet on his deck and nodded at his wife, Kate They had just finished a conversation that would likely influence the direction of their lives for the next several years or longer John retired a little less than a year ago after 25 years in the Lake Oswego police department He had steadily moved up the ranks and retired as a captain Although his career had, in his mind, gone excellently, he had been working much more than he had been home Initially upon retiring he had reveled in the ability to spend time doing things he was never able to while working: complete repairs around the house, travel with his wife, spend time with the children still at home, and visit those who had moved out He was even able to knock five strokes off his golf handicap However, he had become increasingly restless, and both he and Kate agreed he needed something to do, but that something did not involve a full-time job John had, over the years, bought, restored, and sold a series of older Corvettes Although this had been entirely a hobby, it also had been a profitable one The discussion John and Kate just concluded involved expanding this hobby, not into a full-time job, but into a part-time business John would handle the actual restoration, which he enjoyed, and Kate would cover the paperwork, ordering parts, keeping track of expenses, and billing clients, which John did not like The last part of their conversation involved ordering parts In the past John had ordered parts for old Corvettes from one of three possible sources: Weckler’s, American Auto Parts, or Corvette Central Kate, however, didn’t want to call all three any time John needed a part but instead wanted to set up an account with one of the three and be able to order parts over the Internet The question was which company, if any, would be the appropriate choice John agreed to develop a list of common parts Kate would then call each of the companies asking for their prices, and, based on this information, determine with which company to establish the account Kate spent time over the last week on the phone developing the data located in the data file called Quinn Restoration The question John now faced is whether the prices he found could lead him to conclude one company will be less expensive, on average, than the other two Business Statistics Capstone Project Theme: Analysis of Variance Project Objective: The objective of this business statistics capstone project is to provide you with an opportunity to integrate the statistical tools and concepts you have learned in your business statistics course As in all real-world applications, it is not expected through the completion of this project that you will have utilized every statistical technique you have been taught in this course Rather, an objective of the assignment will be for you to determine which of the statistical tools and techniques are appropriate to employ for the situation you have selected Project Description: You are to identify a business or organizational issue that is appropriately addressed using analysis of variance or experimental design You will need to specify one or more sets of null and alternative hypotheses to be tested in order to reach conclusions (55) CHAPTER 12 pertaining to the business or organizational issue you have selected You are responsible for designing and carrying out an “experiment” or otherwise collecting appropriate data required to test the hypotheses using one or more of the analysis of variance designs introduced in your text and statistics course There is no minimum sample size The sample size should depend on the design you choose and the cost and difficulty in obtaining the data You are responsible for making sure that the data are accurate All methods (or sources) for data collection should be fully documented Project Deliverables: To successfully complete this capstone project, you are required to deliver, at a minimum, the following items in the context of a management report: • A complete description of the central issue of your project and of the background of the company or organization you have selected as the basis for the project | Analysis of Variance 529 • A clear and concise explanation of the data collection method used Included should be a discussion of your rationale for selecting the analysis of variance technique(s) used in your analysis • A complete descriptive analysis of all variables in the data set, including both numerical and graphical analysis You should demonstrate the extent to which the basic assumptions of the analysis of variance designs have been satisfied • Provide a clear and concise review of the hypotheses tests that formed the objective of your project Show any post–ANOVA multiple comparison tests where appropriate • Offer a summary and conclusion section that relates back to the central issue(s) of your project and discusses the results of the hypothesis tests • All pertinent appendix materials The final report should be presented in a professional format using the style or format suggested by your instructor References Berenson, Mark L., and David M Levine, Basic Business Statistics: Concepts and Applications, 11th ed (Upper Saddle River, NJ: Prentice Hall, 2009) Bowerman, Bruce L., and Richard T O’Connell, Linear Statistical Models: An Applied Approach, 2nd ed (Belmont, CA: Duxbury Press, 1990) Cox, D R., Planning of Experiments (New York City: John Wiley & Sons, 1992) Cryer, Jonathan D., and Robert B Miller, Statistics for Business: Data Analysis and Modeling, 2nd ed (Belmont, CA: Duxbury Press, 1994) Kutner, Michael H., Christopher J Nachtsheim, John Neter, William Li, Applied Linear Statistical Models, 5th ed (New York City, McGraw-Hill Irwin, 2005) Microsoft Excel 2007 (Redmond,WA: Microsoft Corp., 2007) Minitab for Windows Version 15 (State College, PA: Minitab, 2007) Montgomery, D C., Design and Analysis of Experiments, 6th ed (New York City: John Wiley & Sons, 2004) Searle, S R., and R F Fawcett, “Expected mean squares in variance component models having finite populations.” Biometrics 26 (1970), pp 243–254 (56) chapters 8–12 Special Review Section Chapter Estimating Single Population Parameters Chapter Introduction to Hypothesis Testing Chapter 10 Estimation and Hypothesis Testing for Two Population Parameters Chapter 11 Hypothesis Tests and Estimation for Population Variances Chapter 12 Analysis of Variance This review section, which is presented using block diagrams and flowcharts, is intended to help you tie together the material from several key chapters This section is not a substitute for reading and studying the chapters covered by the review However, you can use this review material to add to your understanding of the individual topics in the chapters Chapters to 12 Statistical inference is the process of reaching conclusions about a population based on a random sample selected from the population Chapters to 12 introduced the fundamental concepts of statistical inference involving two major categories of inference, estimation and hypothesis testing These chapters have covered a fairly wide range of different situations that for beginning students can sometimes seem overwhelming The following diagrams will, we hope, help you better identify which specific estimation or hypothesistesting technique to use in a given situation These diagrams form something resembling a decision support system that you should be able to use as a guide through the estimation and hypothesis-testing processes 530 (57) CHAPTERS 8–12 | Special Review Section A Business Application Hypothesis Test Estimation Population Go to B Populations Go to C Estimating Population Parameter Population Populations ≥3 Populations Go to D Go to E Go to F B Population Mean Population Proportion Estimate  Estimate  σ known σ unknown Go to B-1 Go to B-3 Go to B-3 531 (58) 532 CHAPTERS 8–12 | Special Review Section Estimate , B-1 Known Σx Point Estimate for  x±z n Confidence Interval Estimate for  2 n z e2 Determine Sample Size x n Critical z from Standard Normal Distribution e  Margin of Error Estimate , B-2  Unknown Critical t from t-Distribution with n – Degrees of Freedom s= Σ (x – x) Σx x= Point Estimate for  x±t s n Confidence Interval Estimate for  n n–1 Assumption: Population Is Normally Distributed Estimate  Population Proportion Critical z from Standard Normal Distribution e Margin of Error  Estimated from Pilot Sample or Specified B-3 x p= n Point Estimate for  p ± z p(1 – p) n Confidence Interval Estimate for  n z2(1 – ) e2 Requirement: np ≥ and n(1 – p) ≥ Determine Sample Size (59) CHAPTERS 8–12 Estimating Population Parameters | Special Review Section C Population Means, Independent Samples Population Proportions Paired Samples Estimate 1  2 σ1 and σ2 known σ1 and σ2 unknown Estimate d Estimate 1  2 Go to C-3 Go to C-4 Go to C-1 Go to C-2 Estimate 1  2, 1 and 2 Known Critical z from Standard Normal Distribution C-1 x1 – x2 Point Estimate for 1  2 2 (x1 – x2) ± z σ1 + σ2 n1 n2 Confidence Interval Estimate for 1  2 Estimate 1  2, 1 and 2 Unknown C-2 x1 – x2 Critical t from t-Distribution with (x1 – x2) ± tsp n1  n2  where: Degrees of Freedom sp = 1 + n n2 Point Estimate for 1  2 Confidence Interval Estimate for 1  2 (n1 – 1)s12 + (n2 – 1)s22 n1 + n2 – Assumptions: Populations Are Normally Distributed Populations Have Equal Variances Independent Random Samples Measurements Are Interval or Ratio 533 (60) 534 CHAPTERS 8–12 | Special Review Section Estimate d Paired Samples C-3 Σd d= n Critical t from t-Distribution with n – Degrees of Freedom sd n d±t where: sd = Estimate 1  2 Difference Between Proportions Σ (d – d)2 Point Estimate for d Confidence Interval Estimate for d n–1 C-4 Point Estimate for 1  2 p1 – p Critical z from Standard Normal Distribution (p1 – p2) ± z Hypothesis Tests for Population Parameter Confidence Interval Estimate for 1  2 p1(1 – p1) p (1 – p2)  n1 n2 D Population Mean Population Proportion Population Variance Go to Go to D-3 Test for  σ known Go to D-1 Go to D-4 σ unknown Go to D-2 Test for  (61) CHAPTERS 8–12 Hypothesis Test for ,  Known D-1 Null and Alternative Hypothesis Options for  H0: μ = 20 H0: μ ≤ 20 H0: μ ≥ 20 HA: μ ≠ 20 HA: μ > 20 HA: μ < 20 z= Critical z from Standard Normal Distribution x–μ n z-Test Statistic  Significance level One-tailed test, critical value  z or –z Two-tailed test, critical values  z /2 Hypothesis Test for ,  Unknown D-2 H0: μ = 20 H0: μ  20 H0: μ (62) 20 HA: μ ≠ 20 HA: μ > 20 HA: μ < 20 t= Critical t from t-Distribution with n – Degrees of Freedom | Special Review Section x–μ s n  Significance level One-tailed test, critical value  t or t Two-tailed test, critical values  t /2 Assumption: Population Is Normally Distributed Null and Alternative Hypothesis Options for  t-Test Statistic 535 (63) 536 CHAPTERS 8–12 | Special Review Section Hypothesis Test for 2 D-3 H0: = 50 HA: ≠ 50 H0:  50 HA:  50 2  H0: (64) 50 HA: 50 (n – 1)s2 Null and Alternative Hypothesis Options for 2 Test Statistic  Significance level df  n 1 One-tailed test, critical value  or 21 Two-tailed test, critical value  /2 and 21 /2 Assumption: Population Is Normally Distributed Hypothesis Test for  D-4 H0:  = 0.20 H0:  ≤ 0.20 H0:  ≥ 0.20 HA:  ≠ 0.20 HA:  > 0.20 HA:  < 0.20 z= Critical z from Standard Normal Distribution Null and Alternative Hypothesis Options for  p– (1 – ) n z-Test Statistic  significance level one-tailed test, critical value  z or z two-tailed test, critical values  z /2 Requirement: n (65) and n(1  ) (66) (67) CHAPTERS 8–12 Hypothesis Tests for Population Parameters | Special Review Section E Population Means, Independent Samples Population Proportions Paired Samples Test 1 – 2 σ1 and σ2 known σ1 and σ2 unknown Go to E-1 Test d Test 1 – 2 Test – Go to E-3 Go to E-4 Go to E-5 2 Go to E-2 Test 1  2, 1 and 2 Known E-1 H0: μ1  μ2 = H0: μ1  μ2 (68) H0: μ1  μ2  HA: μ1  μ2 ≠ HA: μ1  μ2 HA: μ1  μ2  z= Critical z from Standard Normal Distribution Population Variances (x1 – x2) – (μ1 – μ2) σ12 σ22 n1 + n2  Significance level One-tailed test, critical value  z or z Two-tailed test, critical values  z /2 Hypothesis Options for Testing 1  2 z-Test Statistic for 1  2 537 (69) 538 CHAPTERS 8–12 | Special Review Section Test 1  2, 1 and 2 Unknown E-2 H0: μ1  μ2 = H0: μ1  μ2 (70) H0: μ1  μ2  HA: μ1 – μ2 ≠ HA: μ1  μ2 HA: μ1  μ2  t Hypothesis Options for Testing 1  2 (x1 – x2) – (μ1 – μ2) sp t-Test Statistic for 1  2 1 + n1 n2 where: Pooled Standard Deviation Critical t from t-distribution with n1 + n2 – Degrees of Freedom sp = (n1 – 1)s12 + (n2 – 1)s22 n1 + n2 –  Significance level One-tailed test, critical value  t␣ or t␣ Two-tailed test, critical values  t␣/2 Assumptions: Populations Are Normally Distributed Populations Have Equal Variances Samples Are Independent Measurements Are Interval or Ratio Test d Paired Samples E-3 H0: μd = HA: μd ≠ H0: μd  HA: μd  t= H0: μd (71) HA: μd d – μd sd n where: sd = Critical t from t-Distribution with n–1 Degrees of Freedom Σ (d – d)2 n–1  Significance level One-tailed test, critical value  t or t Two-tailed test, critical values  t /2 Hypothesis Options for Testing d t-Test Statistic for d (72) CHAPTERS 8–12 Test for Difference Between Proportions 1  2 | Special Review Section E-4 H0: π1 – π2 = HA: π1 – π2 ≠ Hypothesis Options for Testing 1  2 H0: π1 – π2  H0: π1  π2 (73) HA: π1 – π2  HA: π1  π2 z (p1 – p2) – (π1 – π2) p(1 – p) z-Test Statistic for Testing 1  2 )n1 + n1 ) where: p Critical z from Standard Normal Distribution n1p1 + n2p2 n1 + n2  Significance level One-tailed test, critical value  z or z Two-tailed test, critical values  z /2 Test for Difference Between Population Variances 12  22 E-5 2 2 2 2 H0: σ1 = σ2 H0: σ1 ≤ σ2 H0: σ12 ≥ σ22 HA: σ1 ≠ σ2 HA: σ1 > σ2 HA: σ1 <σ 22 For two-tailed test put larger sample variance in numerator df = D1 = n1 – and D2 = n2 – F= Hypothesis Options for Testing 21  22 s12 s22 F-Test Statistic for Testing 21  22  Significance level One-tailed test, critical value  F Two-tailed test, critical value  F /2 Hypothesis Tests for or More Population Means F One-Way ANOVA Design Go to F-1 Randomized Block ANOVA Design Without Replications Go to F-2 Two-Factor ANOVA Design with Replications Go to F-3 539 (74) 540 CHAPTERS 8–12 | Special Review Section One-Way ANOVA Design for or More Population Means F-1 Null and Alternative Hypotheses H0: μ1  μ2  μ3  μk HA: At least two populations have different means ANOVA Table Source of Variation Between samples SS SSB df k–1 MS MSB Within samples SSW nT – k MSW Total SST Critical F from F-Distribution with D1 = k – and D2 = nT – k Degrees of Freedom If null hypothesis is rejected, compare all possible pairs, xi – xj to TukeyKramer Critical Range F-Ratio MSB MSW nT –  Significance level Critical value  F Tukey-Kramer Critical Range critical range = q1–α MSW + ni nj ) ) Assumptions: Populations Are Normally Distributed Populations Have Equal Variances Samples Are Independent Measurements Are Interval or Ratio (75) CHAPTERS 8–12 Randomized Block ANOVA Design for or More Population Means | Special Review Section F-2 Primary Null and Alternative Hypotheses H0: μ1 = μ2 = μ3 = = μk HA: At least two populations have different means H0: μb1 = μb2 = μb3 = = μbn (blocking is not effective) HA: At least two populations have different means (blocking is effective) Blocking Null and Alternative Hypotheses ANOVA Table Source of Variation Between blocks SS SSBL df b–1 MS MSBL Between samples SSB k–1 MSB Within samples SSW (k – 1)(b – 1) MSW Total SST nT – Critical F from F-Distribution F-Ratio MSBL MSW MSB MSW  Significance level Blocking critical value  F df  D1  b  and D2  (k  1)(b  1) Primary critical value  F , df  D1  k  and D2  (k  1)(b  1) Assumptions: Populations Are Normally Distributed Populations Have Equal Variances Observations Within Samples Are Independent Measurements Are Interval or Ratio If null is rejected, compare all ⎜xi – xj ⎜ to Fisher’s LSD = tα/2 MSW b 541 (76) 542 CHAPTERS 8–12 | Special Review Section Two-Factor ANOVA Design with Replications F-3 Factor A Null and Alternative Hypotheses Factor B Null and Alternative Hypotheses H0: μB1 = μB2 = μB3 = = μBn HA: Not all Factor B means are equal H0: μA1 = μA1 = μA3 = = μAk HA: Not all Factor A means are equal Null and alternative hypotheses for testing whether the two factors interact H0: Factors A and B not interact to affect the mean response HA: Factors A and B interact ANOVA Table Source of Variation Factor A SS SSA df a–1 MS MSA Factor B SSB b–1 MSB AB Interaction SSAB (a – 1)(b – 1) MSAB Error SSE nT – ab MSE Total SST nT –  Significance level Factor A critical value  F , df  D1  a  and D2  nT  ab Factor B critical value  F , df  D1  b  and D2  nT  ab Interaction critical value  F , df  D1  (a – 1)(b – 1) and D2  nT  ab Assumptions: The Population Values for Each Combination of Pairwise Factor Levels Are Normally Distributed The Variances for Each Population Are Equal The Samples Are Independent Measurements Are Interval or Ratio F-Ratio MSA MSE MSB MSE MSAB MSE (77) CHAPTERS 8–12 | Special Review Section 543 Using the Flow Diagrams Example Problem: A travel agent in Florida is interested in determining whether there is a difference in the mean out-of-pocket costs incurred by customers on two major cruise lines To test this, she has selected a simple random sample of 20 customers who have taken cruise line I and has asked these people to track their costs over and above the fixed price of the cruise She did the same for a second simple random sample of 15 people who took cruise line II You can use the flow diagrams to direct you to the appropriate statistical tool A The travel agency wishes to test a hypothesis involving two populations Business Application Estimation Population Hypothesis Test Populations Go to B Population Populations (78) Populations Go to D Go to E Go to F Go to C Proceed to E Hypothesis Tests for Population Parameters E Hypothesis test is for two population means The samples are independent because the spending by customers on one cruise line in no way influences the spending by customers on the second cruise line Population standard deviations are unknown Population Means, Independent Samples Paired Samples σ1 and σ2 unknown Variances Test d Test 1  2 Test 12  22 Go to E-1 Go to E-3 Go to E-4 Go to E-5 Go to E-2 Proceed to E-2 Test 1  2 σ1 and σ2 known Proportions (79) 544 CHAPTERS 8–12 | Special Review Section At E-2, we determine the null hypothesis to be H0: m1  m2  HA: m1  m2  Next, we establish the test statistic as t (x1  x2)  (m1  m2) sp 1  n1 n2 where: sp  (n1 1)s12  (n2 1)s22 n1  n2  Finally, the critical value is a t-value from the t-distribution with 20 + 15 – = 33 degrees of freedom Note, if the degrees of freedom are not shown in the t table, use Excel’s TINV or use Minitab to determine the t-value Thus, by using the flow diagrams and answering a series of basic questions, you should be successful in identifying the statistical tools required to address any problem or application covered in Chapters to 12 You are encouraged to apply this process to the application problems and projects listed here MyStatLab Exercises Integrative Application Problems SR.1 Brandon Outdoor Advertising supplies neon signs to retail stores A major complaint from its clients is that letters in the signs can burn out and leave the signs looking silly, depending on which letters stop working The primary cause of neon letters not working is the failure of the starter unit attached to each letter Starter units fail primarily based on turn-on/turn-off cycles The present unit bought by Brandon averages 1,000 cycles before failure A new manufacturer has approached Brandon claiming to have a model that is superior to the current unit Brandon is skeptical but agrees to sample 50 starter units It says it will buy from the new supplier if the sample results indicated the new unit is better The sample of 50 gives the following values: Sample mean = 1,010 cycles Sample standard deviation = 48 cycles Would you recommend changing suppliers? SR.2 PestFree Chemicals has developed a new fungus preventative that may have a significant market among potato growers Unfortunately, the actual extent of the fungus problem in any year depends on rainfall, temperature, and many other factors To test the new chemical, PestFree has used it on 500 acres of potatoes and has used the leading competitor on an additional 500 acres At the end of the season, 120 acres treated by the new chemical show significant levels of fungus infestation, whereas 160 of the acres treated by the leading chemical show significant infestation Do these data provide statistical proof that the new product is superior to the leading competitor? SR.3 Last year Tucker Electronics decided to try to something about turnover among assembly-line workers at its plants It implemented two trial personnel policies, one based on an improved hiring policy and the other based on increasing worker responsibility These policies were put into effect at two different plants, with the following results: Plant Plant Improved Hiring Increased Responsibility Workers in trial group 800 900 Turnover proportion 0.05 0.09 Do these data provide evidence that there is a difference between the turnover rates for the two trial policies? SR.4 A Big 10 University has been approached by Wilson Sporting Goods Wilson has developed a football designed specifically for practice sessions Wilson would like to claim the ball will last for 500 practice hours before it needs to be replaced Wilson has supplied six balls for use during spring and fall (80) CHAPTERS 8–12 practice The following data have been gathered on the time used before the ball must be replaced Hours 551 511 479 435 440 466 Do you see anything wrong with Wilson claiming the ball will last 500 hours? SR.5 The management of a chain of movie theaters believes the average weekend attendance at its downtown theater is greater than at its suburban theater The following sample results were found from their accounting data Downtown Number of weekends Average attendance Sample variance 11 Suburban 10 855 750 1,684 1,439 | Special Review Section 545 SR.8 AstraZeneca is the maker of the stomach medicine Prilosec, which is the second-best-selling drug in the world Recently, the company has come under close scrutiny concerning the cost of its medicines The company’s internal audit department selected a random sample of 300 purchases for Prilosec They wished to characterize how much is being spent on this medicine In the sample, the mean price per 20-milligram tablet of Prilosec was $2.70 The sample had a standard deviation of $0.30 Determine an estimate that will characterize the average range of values charged for a tablet of Prilosec SR.9 A manufacturer of PC monitors is interested in the effects that the type of glass and the type of phosphor used in the manufacturing process have on the brightness of the monitors The director of research and development has received anecdotal evidence that the type of glass does not affect the brightness of the monitor as long as phosphor type is used However, the evidence seems to indicate that the type of glass does make a difference if two other phosphor types are used Here are data to validate this anecdotal evidence Phosphor Type Do these data provide sufficient evidence to indicate there is a difference in average attendance? The company is also interested in whether there is a significant difference in the variability of attendance SR.6 A large mail-order company has placed an order for 5,000 thermal-powered fans to sit on wood-burning stoves from a supplier in Canada, with the stipulation that no more than 2% of the units will be defective To check the shipment, the company tests a random sample of 400 fans and finds 11 defective Should this sample evidence lead the company to conclude the supplier has violated the terms of the contract? SR.7 A manufacturer of automobile shock absorbers is interested in comparing the durability of its shocks with that of its two biggest competitors To make the comparison, a set of one each of the manufacturer’s and of the competitor’s shocks were randomly selected and installed on the rear wheels of each of six randomly selected cars of the same type After the cars had been driven 20,000 miles, the strength of each test shock was measured, coded, and recorded Car number Manufacturer’s Glass Type 2 279 307 287 254 313 290 297 294 285 243 253 252 245 232 236 267 223 278 Conduct a procedure to verify or repudiate the anecdotal evidence SR.10 The Vilmore Corporation is considering two word processing programs for its PCs One factor that will influence its decision is the ease of use in preparing a business report Consequently, Jody Vilmore selected a random sample of nine typists from the clerical pool and asked them to type a typical report using both word processors The typists then were timed (in seconds) to determine how quickly they could type one of the frequently used forms The results were as follows Competitor Competitor Typist Processor Processor 8.8 9.3 8.6 82 75 10.5 9.0 13.7 76 80 12.5 8.4 11.2 90 70 9.7 13.0 9.7 55 58 9.6 12.0 12.2 49 53 13.2 10.1 8.9 82 75 90 80 45 45 70 80 Do these data present sufficient evidence to conclude there is a difference in the mean strength of the three types of shocks after 20,000 miles? (81) 546 CHAPTERS 8–12 | Special Review Section Jody wishes to have an estimate of the smallest and biggest differences that might exist in the average time required for typing the business form using the two programs Provide this information SR.11 The research department of an appliance manufacturing firm has developed a solid-state switch for its blender that the department claims will reduce the percentage of appliances being returned under the one-year full warranty by a range of 3% to 6% To determine if the claim can be supported, the testing department selects a group of the blenders manufactured with the new switch and the old switch and subjects them to a normal year’s worth of wear Out of 250 blenders tested with the new switch, would have been returned Sixteen would have been returned out of the 250 blenders with the old switch Use a statistical procedure to verify or refute the department’s claim SR.12 The Ecco Company makes electronics products for distribution throughout the world As a member of the quality department, you are interested in the warranty claims that are made by customers who have experienced problems with Ecco products The file called Ecco contains data for a random sample of warranty claims Large warranty claims not only cost the company money but also provide adverse publicity The quality manager has asked you to provide her with a range of values that would represent the percentage of warranty claims filed for more than $300 Provide this information for your quality manager END EXERCISES Term Project Assignments Investigate whether there are differences in grocery prices for three or more stores in your city a Specify the type of testing procedure you will use b What type of experimental design will be used? Why? c Develop a “typical” market basket of at least 10 items that you will price-check Collect price data on these items at three or more different stores that sell groceries d Analyze your price data using the testing procedure and experimental design you specified in parts a and b e Present your findings in a report Did you find differences in average prices of the “market basket” across the different grocery stores? Business Statistics Capstone Project Theme: Financial Data Analysis Project Objective: The objective of this business statistics capstone project is to provide you with an opportunity to integrate the statistical tools and concepts that you have learned in your business statistics course Like all realworld applications, it is not expected that through the completion of this project you will have utilized every statistical technique you have been taught in this course Rather, an objective of the assignment will be for you to determine which of the statistical tools and techniques are appropriate to apply for the situation you have selected Project Description: Assume that you are working as an intern for a financial management company Your employer has a large number of clients who trust the company managers to invest their funds In your position, you have the responsibility for producing reports for clients when they request information Your company has two large data files with financial information for a large number of U.S companies The first is called US Companies 2003, which contains financial information for the companies’ 2001 or 2002 fiscal-year end The second file is called US Companies 2005, which has data for the fiscal 2003 or 2004 year-end The 2003 file has data for 7,441 companies The 2005 file has data for 6,992 companies Thus, many companies are listed in both files but some are just in one or the other The two files have many of the same variables, but the 2003 file has a larger range of financial variables than the 2005 file For some companies, the data for certain variables are not available and a code of NA is used to so indicate The 2003 file has a special worksheet containing the description of each variable These descriptions apply to the 2005 data file as well You have been given access to these two data files for use in preparing your reports Your role will be to perform certain statistical analyses that can be used to help convert these data into useful information in order to respond to the clients’ questions This morning, one of the partners of your company received a call from a client who asked for a report that would compare companies in the financial services industry (SIC codes in the 6000’s) to companies in production-oriented industries (SIC codes in the 2000’s and 3000’s) There are no firm guidelines on what the report should entail, but the partner has suggested the following: ● ● ● ● Start with the 2005 data file Pull the data for all companies with the desired SIC codes into a new worksheet Prepare a complete descriptive analysis of key financial variables using appropriate charts and graphs to help compare the two types of businesses Determine whether there are statistical differences between the two classes of companies in terms of key financial measures Using data from the 2003 file for companies that have these SIC codes and which are also in the 2005 file, develop a comparison that shows the changes over the time span both within SIC code grouping and between SIC code grouping Project Deliverables: To successfully complete this capstone project, you are required to deliver a management report that addresses the partner’s requests (listed above) and also contains at least one other substantial type of analysis not mentioned by the partner This latter work should be set off in a special section of the report The final report should be presented in a professional format using the style or format suggested by your instructor (82) • Review the logic involved in testing a hypothesis discussed in Chapter • Review the characteristics of probability distributions such as the binomial, Poisson, uniform, and normal distributions in Chapters and chapter 13 Chapter 13 Quick Prep Links • Review the definitions of Type I and Type II errors in Chapter Goodness-of-Fit Tests and Contingency Analysis 13.1 Introduction to Goodnessof-Fit Tests (pg 548–562) Outcome Utilize the chi-square goodness-of-fit test to determine whether data from a process fit a specified distribution 13.2 Introduction to Contingency Analysis (pg 562–572) Outcome Set up a contingency analysis table and perform a chi-square test of independence Why you need to know The previous 12 chapters introduced a wide variety of statistical techniques that are frequently used in business decision making We have discussed numerous descriptive tools and techniques, as well as estimation and hypothesis tests for one and two populations, hypothesis tests using the t-distribution, and analysis of variance However, as we have often mentioned, these statistical tools are limited to use under those conditions for which they were originally developed For example, the tests based on the standard normal distribution assume that the data can be measured at least at the interval level The tests that employ the t-distribution assume that the sampled populations are normally distributed In those situations in which the conditions just mentioned are not satisfied, we suggest using nonparametric statistics Several of the more widely used nonparametric techniques will be discussed in Chapter 17 These procedures will be shown to be generally the nonparametric equivalent of the classical procedures discussed in Chapters 8–12 The obvious questions when faced with a realistic decision-making situation are “Which test I use? Should I consider a nonparametric test?” These questions are generally followed by a second question: “Do the data come from a normal distribution?” But recall that we also described situations involving data from Poisson or binomial distributions How we know which distribution applies to our situation? Fortunately, a statistical technique called goodness-of-fit exists that can help us answer this question Using goodness-of-fit tests, we can decide whether a set of data comes from a specific hypothesized distribution You will also encounter many business situations in which the level of data measurement for the variable of interest is either nominal or ordinal, not interval or ratio For example, a bank may use a code to indicate whether a customer is a good or poor credit risk The bank may also have data for these customers that indicate, by a code, whether each person is buying or renting a home The loan officer may be interested in determining whether credit-risk status is independent of home ownership Because both credit risk and home ownership are qualitative, or categorical, variables, their measurement level is nominal and the statistical techniques introduced in Chapters 8–12 cannot be used to analyze this problem We therefore need a new statistical tool to assist the manager in reaching an inference about the customer population That statistical tool is contingency analysis Contingency analysis is a widely used tool for analyzing the relationship between qualitative variables, one that decision makers in all business areas find helpful for data analysis 547 (83) 548 CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis Chapter Outcome | Customer Door Entrance Data TABLE 13.1 Entrance Number of Customers East West 260 North 230 South 220 290 13.1 Introduction to Goodness-of-Fit Tests Many of the statistical procedures introduced in earlier chapters require that the sample data come from populations that are normally distributed For example, when we use the t-distribution in confidence interval estimation or hypothesis testing about one or two population means, the population(s) of interest is (are) assumed to be normally distributed The F-test introduced in Chapters 11 and 12 is based on the assumption that the populations are normally distributed But how can you determine whether these assumptions are satisfied? In other instances, you may wish to employ a particular probability distribution to help solve a problem related to an actual business process To solve the problem, you may find it necessary to know whether the actual data from the process fit the probability distribution being considered In such instances, a statistical technique known as a goodness-of-fit test can be used The term goodness-of-fit aptly describes the technique Suppose Macy’s, a major retail department store, believes the proportions of customers who use each of the four entrances to its Portland, Oregon, store are the same This would mean that customer arrivals are uniformly distributed across the four entrances Suppose a sample of 1,000 customers is observed entering the store, and the entrance (East, West, North, South) selected by each customer is recorded Table 13.1 shows the results of the sample If the manager’s assumption about the entrances being used uniformly holds and if there was no sampling error involved, we would expect one fourth of the customers, or 250, to enter through each door When we allow for the potential of sampling error, we would still expect close to 250 customers to enter through each entrance The question is, how “good is the fit” between the sample data in Table 13.1 and the expected number of 250 people at each entrance? At what point we no longer believe that the differences between what is actually observed at each entrance and what we expected can be attributed to sampling error? If these differences get too big, we will reject the uniformity assumption and conclude that customers prefer some entrances to others Chi-Square Goodness-of-Fit Test The chi-square goodness-of-fit test is one of the statistical tests that can be used to determine whether the sample data come from any hypothesized distribution Consider the following application BUSINESS APPLICATION CONDUCTING A GOODNESS-OF-FIT TEST VISTA HEALTH GUARD Vista Health Guard, a Pennsylvania health clinic with 25 offices, is open seven days a week The operations manager was recently hired from a similar position at a smaller chain of clinics in Florida She is naturally concerned that the level of staffing—physicians, nurses, and other support personnel—be balanced with patient demand Currently, the staffing level is balanced Monday through Friday, with reduced staffing on Saturday and Sunday Her predecessor explained that patient demand is fairly level throughout the week and about 25% less on weekends, but the new manager suspects that the staff members want to have weekends free Although she was willing to operate with this schedule for a while, she has decided to study patient demand to see whether the assumed demand pattern still applies The operations manager requested a random sample of 20 days for each day of the week that showed the number of patients on each of the sample days A portion of those data follows: Day Monday, May Monday, October Tuesday, July Patient Count 325 379 456 Day Monday, July 15 Wednesday, April etc Patient Count 323 467 etc (84) CHAPTER 13 FIGURE 13.1 | Goodness-of-Fit Tests and Contingency Analysis 549 | 12,000 Graph of Actual Frequencies for Vista Health Guard 10,000 Patient Count 8,000 6,000 4,000 2,000 Sunday Monday Tuesday Wednesday Thursday Friday Saturday Day of Week | Patient Count Data for the Vista Health Guard Example TABLE 13.2 Day Sunday Total Patient Count 4,502 Monday 6,623 Tuesday 8,308 Wednesday 10,420 Thursday 11,032 Friday 10,754 Saturday Total 4,361 56,000 For the 140 days observed, the total count was 56,000 patients The total patient counts for each day of the week are shown in Table 13.2 and are graphed in Figure 13.1 Recall that the previous operations manager at Vista Health Guard based his staffing on the premise that from Monday to Friday the patient count remained essentially the same and on Saturdays and Sundays it went down 25% If this is so, how many of the 56,000 patients would we expect on Monday? How many on Tuesday, and so forth? To figure out this demand, we determine weighting factors by allocating four units each to days Monday through Friday and three units each (representing the 25% reduction) to Saturday and Sunday The total number of units is then (5 4)  (2 3)  26 The proportion of total patients expected on each weekday is 4/26 and the proportion expected on each weekend day is 3/26 The expected number of patients on a weekday is (4/26) 56,000  8,615.38, and the expected number on each weekend day is (3/26) 56,000  6,461.54 Figure 13.2 shows a graph with the actual sample data and the expected values With the exception of what might be attributed to sampling error, if the distribution claimed by the previous operations manager is correct, the actual frequencies for each day of the week should fit quite closely with the expected frequencies As you can see in Figure 13.2, the actual data and the expected data not match perfectly However, is the difference enough to warrant a change in staffing patterns? The situation facing Vista Health Guard is one for which a number of statistical tests have been developed One of the most frequently used is the chi-square goodness-of-fit test What we need to examine is how well the sample data fit the hypothesized distribution The following null and alternative hypotheses can represent this: H0: The patient demand distribution is evenly spread through the weekdays and is 25% lower on the weekend HA: The patient demand follows some other distribution Equation 13.1 is the equation for the chi-square goodness-of-fit test statistic The logic behind this test is based on determining how far the actual observed frequency is from the expected frequency Because we are interested in whether a difference exists, positive or negative, we remove the effect of negative values by squaring the differences In addition, how important this difference is really depends on the magnitude of the expected frequency (e.g., a difference of is more important if the expected frequency is 10 than if the expected (85) 550 CHAPTER 13 FIGURE 13.2 | Goodness-of-Fit Tests and Contingency Analysis | 12,000 Actual and Expected Frequencies for Vista Health Guard Actual Data 10,000 Hypothesized Distribution Patient Count 8,000 6,000 4,000 2,000 Sunday Monday Tuesday Wednesday Thursday Friday Saturday Day of Week frequency is 1,000), so we divide the squared difference by the expected frequency Finally, we sum these difference ratios for all days This sum is a statistic that has an approximate chi-square distribution Chi-Square Goodness-of-Fit Test Statistic k 2  (oi  ei )2 ei i1 ∑ (13.1) where: oi  Observed frequency for category i ei  Expected frequency for category i k  Number of categories The  statistic is distributed approximately as a chi-square only if the sample size is large Special Note A sample size of at least 30 is sufficient in most cases provided that none of the expected frequencies is too small This issue of expected cell frequencies will be discussed later If the calculated chi-square statistic gets large, this is evidence to suggest that the fit of the actual data to the hypothesized distribution is not good and that the null hypothesis should be rejected Figure 13.3 shows the hypothesis-test process and results for this chi-square goodnessof-fit test Note that the degrees of freedom for the chi-square test are equal to k - 1, where k is the number of categories or observed cell frequencies In this example, we have categories corresponding to the days of the week, so the degrees of freedom are -  The critical value of 12.5916 is found in Appendix G for an upper-tail test with degrees of freedom and a significance level of 0.05 (86) CHAPTER 13 FIGURE 13.3 | Chi-Square Goodness-of-Fit Test for Vista Health Guard | Goodness-of-Fit Tests and Contingency Analysis 551 Hypotheses: H0: Patient demand is evenly spread through the weekdays and is 25% lower on weekends HA: Patient demand follows some other distribution a = 0.05 Total Patient Count Expected Observed ei oi Day Sunday Monday Tuesday Wednesday Thursday Friday Saturday Total Test Statistic: 2  4,502 6,623 8,308 10,420 11,032 10,754 4,361 56,000 6,461.54 8,615.38 8,615.38 8,615.38 8,615.38 8,615.38 6,461.54 56,000 k (4,361 – 6,461.54)2 (4,502 – 6,461.54)2 (6,623 – 8,615.38)2 (oi  ei )2     e 6,461.54 8,615.38 6,461.54 i i1 ∑ χ2  594.2  460.8   682.9 χ2  3,335.6 f (χ2) df = k − = − = Rejection Region a = 0.05 χ20.05 = 12.5916 χ2 Decision Rule: If χ2  12.5916, reject H0 Otherwise, not reject H0 Because 3,335.6  12.5916, reject H0 Based on the sample data, we can conclude that the patient distribution is not the same as previously indicated As Figure 13.3 indicates,   3,335.6  12.5916, so the null hypothesis is rejected and we should conclude that the demand pattern does not match the previously defined distribution The data in Figure 13.3 indicate that demand is heavier than expected Wednesday through Friday and less than expected on the other days The operations manager may now wish to increase staffing on Wednesday, Thursday, and Friday to more closely approximate current demand patterns EXAMPLE 13-1 CHI-SQUARE GOODNESS-OF-FIT TEST Standardized Exams Now that you are in college you have already taken a number of standardized exams such as the SAT and ACT exams A Southern California company that creates standardized exams a variety of organizations use for applications such as employee aptitude assessment, tries to develop multiple choice questions that are not so difficult that those taking the exams are forced to guess at the answers to most questions Recently, the company received a complaint about one of the math problems on an exam The problem listed five possible answers If the test takers were forced to guess, the company might expect an even percentage of choices across the five possible answers That (87) 552 CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis is, each answer is as likely to be a guess as any other answer To determine if people were actually forced to guess, the company plans to conduct a study to see if the choices on this question are in fact uniformly distributed across the five answers A chi-square goodness-offit test can be conducted to determine whether test takers are forced to guess at the answer using the following steps: Step Formulate the appropriate null and alternative hypotheses Because the number of selected choices is supposed to be the same across the five possible answers, the following null and alternative hypotheses are formed: H0: Distribution of choices is uniform across the five answers HA: Distribution of choices is not uniform Step Specify the significance level The test will be conducted using a  0.05 Step Determine the critical value The critical value for the goodness-of-fit test is a chi-square value from the chisquare distribution, with k     degrees of freedom, and a  0.05 is 9.4877 Step Collect the sample data and compute the chi-square test statistic A random sample of the responses to the exam question for 2,120 people is selected The following data represent the number of times each of the possible answers was chosen: Answer Times Selected 358 402 577 403 380 Under the hypothesis of a uniform distribution, 20% of this total (424) should be selected for each answer This is the expected cell frequency Equation 13.1 is used to form the test statistic based on these sample data k 2  ∑ (oi  ei )2 ei i=11  (358  424)2 (402  424)2  424 424 (577  424)2 (403  424)2 (380  424)2   424 424 424  72.231  Step Reach a decision The decision rule is If   9.4877, reject H0 Otherwise, not reject H0 Because   72.231  9.4877, reject the null hypothesis Step Draw a conclusion We conclude that the answer choices are not occurring equally across the five possible answers This is evidence to suggest that this particular question is not being answered by random guessing >> END EXAMPLE TRY PROBLEM 13-2 (pg 559) BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial USING SOFTWARE TO CONDUCT A GOODNESS-OF-FIT TEST WOODTRIM PRODUCTS, INC Woodtrim Products, Inc., makes wood moldings, doorframes, and window frames It purchases lumber from mills throughout New England and eastern Canada The first step in the production process is to rip the lumber into narrower (88) CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis 553 strips Different widths are used for different products For example, wider pieces with no imperfections are used to make door and window frames Once an operator decides on the appropriate width, that information is locked into a computer and a ripsaw automatically cuts the board to the desired size The manufacturer of the saw claims that the ripsaw cuts an average deviation of zero from target and that the differences from target will be normally distributed, with a standard deviation of 0.01 inch Woodtrim has recently become concerned that the ripsaw may not be cutting to the manufacturer’s specifications because operators at other machines downstream in the production process are finding excessive numbers of ripped pieces that are too wide or too narrow A quality improvement team (QIT) has started to investigate the problem Team members selected a random sample of 300 boards just as they came off the ripsaw To provide a measure of control, the only pieces sampled in the initial study had stated widths of 27⁄8 (2.875) inches Each piece’s width was measured halfway from its end A portion of the data and the differences between the target 2.875 inches and the actual measured width are shown in Figure 13.4 The full data set is contained in the file Woodtrim The team can use these data and the chi-square goodness-of-fit testing procedure to test the following null and alternative hypotheses: H0: The differences are normally distributed, with μ  and s  0.01 HA: The differences are not normally distributed, with μ  and s  0.01 This example differs slightly from the previous examples because the hypothesized distribution is continuous rather than discrete Thus, we must organize the data into a grouped-data frequency distribution (see Chapter 2), as shown in Figure 13.5 Our choice of classes requires careful consideration The chi-square goodness-of-fit test compares the actual cell frequencies with the expected cell frequencies The test statistic from Equation 13.1, k 2  ∑ i =1 (oi  ei )2 ei is approximately chi-square distributed if the expected cell frequencies are large Because the expected cell frequencies are used in computing the test statistic, the general recommendation is that the goodness-of-fit test be performed only when all expected cell frequencies are at least If any of the cells have expected frequencies less than 5, the cells should be combined in a meaningful way such that the expected frequencies are at least We have chosen to use k  classes The number of classes is your choice You can perform the chi-square goodness-of-fit FIGURE 13.4 | Woodtrim Products Test Data Excel 2007 Instruction: Open file: Woodtrim.xls (89) 554 CHAPTER 13 FIGURE 13.5 | Goodness-of-Fit Tests and Contingency Analysis | Excel 2007 Results— Goodness-of-Fit Test for the Woodtrim Example Excel 2007 Instructions: Open file: Woodtrim.xls (see Figure 13.4) Define Classes (column J) Determine observed frequencies [(i.e., cell K4 formula is =COUNTIF($D$2: $D$301,“<0.0”)-SUM($K$2:K3)] Determine normal distribution probabilities, assuming the mean  0.0 and st.dev  0.01 [ i.e., cell L4 formula is NORMDIST(0,0,0.01,TRUE)-SUM ($L$2:L3)] Determine expected frequencies by multiplying normal probability by the sample size (n  300) Compute values for chi-square in column N (cell N5 formula is =[K5-M5)^2/M5] Sum column N to get chi-square statistic Find p-value using CHITEST function [cell N10 formula is =CHITEST(K2:K7, M2:M7)] test using Excel functions and formulas and Minitab commands (The Excel and Minitab tutorials that accompany this text take you through the specific steps required to complete this example.) Figure 13.5 shows the normal distribution probabilities, expected cell frequencies, and the chi-square calculation The calculated chi-square statistic is   26.432 The p-value associated with   26.432 and   degrees of freedom is 0.0001 Therefore, because p-value  0.0001 is less than any reasonable level of alpha, we reject the null hypothesis and conclude the ripsaw is not currently meeting the manufacturer’s specification The saw errors are not normally distributed with mean equal to and a standard deviation equal to 0.01 Special Note Note that in this case, because the null hypothesis specified both the mean and the standard deviation, the normal distribution probabilities were computed using these values However, if the mean and/or the standard deviation had not been specified, the sample mean and standard deviation would be used in the probability computation You would lose additional degree of freedom for each parameter that was estimated from the sample data This is true any time sample statistics are specified in place of population parameters in the hypothesis Minitab has a procedure for performing a goodness-of-fit test for a normal distribution In fact, it offers three different approaches, none of which are exactly the chi-square approach (90) CHAPTER 13 FIGURE 13.6 | Goodness-of-Fit Tests and Contingency Analysis 555 | Minitab Output—Test of Normal-Distributed Ripsaw Cuts for the Woodtrim Example Minitab Instructions: Open file: Woodtrim.MTW Choose Stat > Basic Statistics > Normality Test In Variable, enter data column (Difference) Under Normality test, select KolmogorovSmirnov Click OK Because the p-value 0.01 is less than any reasonable level of significance, reject the null hypothesis of normality just outlined Figure 13.6 shows the Minitab results for the Woodtrim example Consistent with our other Minitab and Excel results, this output illustrates that the null hypothesis should be rejected because the p-value 0.01 EXAMPLE 13-2 GOODNESS-OF-FIT TEST Early Dawn Egg Company The Early Dawn Egg Company operates an egg-producing operation in Maine One of the key steps in the egg-production business is packaging eggs into cartons so that eggs arrive at stores unbroken That means the eggs have to leave the Early Dawn plant unbroken Because of the high volumes of egg cartons shipped each day, the employees at Early Dawn can’t inspect every carton of eggs Instead, every hour, 10 cartons are inspected If two or more contain broken or cracked eggs, a full inspection is done for all eggs produced since the previous inspection an hour earlier If the inspectors find one or fewer cartons containing cracked or broken eggs, they ship that hour’s production without further analysis The company’s contract with retailers calls for at most 10% of the egg cartons to have broken or cracked eggs At issue is whether Early Dawn Egg Company managers can evaluate this sampling plan using a binomial distribution with n  10 and p  0.10 To test this, a goodness-of-fit test can be performed using the following steps: Step State the appropriate null and alternative hypotheses In this case, the null and alternative hypotheses are H0: Distribution of defects is binomial, with n  10 and p  0.10 HA: Distribution is not binomial, with n  10 and p  0.10 Step Specify the level of significance The test will be conducted using a  0.025 (91) 556 CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis Step Determine the critical value The critical value depends on the number of degrees of freedom and the level of significance The degrees of freedom will be equal to k  1, where k is the number of categories for which observed and expected frequencies will be recorded In this case, the managers have set up the following groups: Defects: and over Therefore, k  4, and the degrees of freedom are   The critical chisquare value for a  0.025 found in Appendix G is 9.3484 Step Collect the sample data and compute the chi-square test statistic using Equation 13.1 The company selected a simple random sample of 100 hourly test results from past production records and recorded the number of defective cartons when the sample of 10 cartons was inspected The following table shows the computations for the chi-square statistic o Observed Defects Binomial Probability n  10, p  0.10 e Expected Frequency (oi  ei ) ei 30 0.3487 34.87 0.6802 40 20 0.3874 0.1937 38.74 19.37 0.0410 0.0205 and over 10 0.0702 7.02 1.2650 Defective Cartons Total 100 2.0067 The calculated chi-square test statistic is   2.0067 Step Reach a decision Because   2.0067 is less than the critical value of 9.3484, we not reject the null hypothesis Step Draw a conclusion The binomial distribution may be the appropriate distribution to describe the company’s sampling plan >>END EXAMPLE TRY PROBLEM 13-1 (pg 559) EXAMPLE 13-3 GOODNESS-OF-FIT TEST University Internet Service Students in a computer information systems class at a major university have established an Internet service provider (ISP) company for the university’s students, faculty, and staff Customers of this ISP connect via a wireless signal available throughout the campus and surrounding business area Capacity is always an issue for an ISP, and the students had to estimate the capacity demands for their service Before opening for business, the students conducted a survey of likely customers Based on this survey, they estimated that demand during the late afternoon and evening hours is Poisson distributed (refer to Chapter 5) with a mean equal to 10 users per hour Based on this assumption, the students developed the ISP with the capacity to handle 20 users simultaneously However, they have lately been receiving complaints from customers saying they have been denied access to the system because 20 users are (92) CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis 557 already online The students are now interested in determining whether the demand distribution is still Poisson distributed with a mean equal to 10 per hour To test this, they have collected data on the number of user requests for ISP access for 225 randomly selected time periods during the heavy-use hours The following steps can be used to conduct the statistical test: Step State the appropriate null and alternative hypotheses The null and alternative hypotheses are H0: Demand distribution is Poisson distributed with mean equal to 10 users per time period HA: The demand distribution is not distributed as a Poisson distribution with mean equal to 10 per period Step Specify the level of significance The hypothesis test will be conducted using a  0.05 Step Determine the critical value The critical value depends on the level of significance and the number of degrees of freedom The degrees of freedom is equal to k  1, where k is the number of categories In this case, after collapsing the categories to get the expected frequencies to be at least 5, we have 13 categories Thus, the degrees of freedom for the chi-square critical value is 13   12 For 12 degrees of freedom and a level of significance equal to 0.05, from Appendix G we find a critical value of   21.0261 Thus the decision rule is If   21.0261, reject the null hypothesis Otherwise, not reject Step Collect the sample data and compute the chi-square test statistic using Equation 13.1 A random sample of 225 time periods was selected, and the number of users requesting access to the ISP at each time period was recorded The observed frequencies based on the sample data are as follows: Number of Requests Observed Frequency Number of Requests Observed Frequency 10 11 18 14 12 17 3 13 18 4 14 25 15 28 16 23 17 17 11 18 9 19 and over 11 Total 225 To compute the chi-square test statistic you must determine the expected frequencies Start by determining the probability for each number of user requests based on the hypothesized distribution (Poisson with lt  10.) (93) 558 CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis The expected frequencies are calculated by multiplying the probability by the total observed frequency of 225 These results are as follows: Number of Requests Observed Frequency Poisson Probability lt  10 Expected Frequency 0.0000 0.0005 0.00 0.11 0.0023 0.52 3 0.0076 1.71 4 0.0189 4.25 0.0378 8.51 0.0631 14.20 0.0901 20.27 11 0.1126 25.34 0.1251 28.15 10 18 0.1251 28.15 11 14 0.1137 25.58 12 17 0.0948 21.33 13 18 0.0729 16.40 14 25 0.0521 11.72 15 28 0.0347 7.81 16 23 0.0217 4.88 17 17 0.0128 2.88 18 19 and over 11 0.0071 0.0072 1.60 1.62 Total 225 225 Now you need to check if any of the expected cell frequencies are less than In this case, we see there are several instances where this is so To deal with this, collapse categories so that all expected frequencies are at least Doing this gives the following: Number of Requests Observed Frequency Poisson Probability lt  10 Expected Frequency or fewer 10 0.0293 0.0378 6.59 8.51 0.0631 14.20 0.0901 20.27 11 0.1126 25.34 0.1251 28.15 10 18 0.1251 28.15 11 14 0.1137 25.58 12 17 0.0948 21.33 13 18 0.0729 16.40 14 15 16 or more 25 28 60 0.0521 0.0347 0.0488 11.72 7.81 10.98 Total 225 225 (94) CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis 559 Now we can compute the chi-square test statistic using Equation 13.1 as follows: k 2  ∑ i =1 (oi  ei )2 ei (10  6.59)2 (3  8.51)2 (60  10.98)2    6.59 8.51 10.98  338.1  Step Reach a decision Because   338.1  21.0261, reject the null hypothesis Step Draw a conclusion The demand distribution is not Poisson distributed with a mean of 10 The students should conclude that either the mean demand per period has increased from 10 or the distribution is not Poisson or both They may need to add more capacity to the ISP business >>END EXAMPLE TRY PROBLEM 13-3 (pg 559) MyStatLab 13-1: Exercises Skill Development 13-1 A large retailer receives shipments of batteries for consumer electronic products in packages of 50 batteries The packages are held at a distribution center and are shipped to retail stores as requested Because some packages may contain defective batteries, the retailer randomly samples 400 packages from its distribution center and tests to determine whether the batteries are defective or not The most recent sample of 400 packages revealed the following observed frequencies for defective batteries per package: # of Defective Batteries per Package Frequency of Occurrence 165 133 65 28 or more The retailer’s managers would like to know if they can evaluate this sampling plan using a binomial distribution with n  50 and p  0.02 Test at the a  0.01 level of significance 13-2 The following frequency distribution shows the number of times an outcome was observed from the toss of a die Based on the frequencies that were observed from 2,400 tosses of the die, can it be concluded at the 0.05 level of significance that the die is fair? Outcome Frequency 352 418 434 480 341 375 13-3 Based on the sample data in the following frequency distribution, conduct a test to determine whether the population from which the sample data were selected is Poisson distributed with mean equal to Test using a  0.05 x Frequency x Frequency or less 29 10 53 35 26 11 28 52 12 18 77 13 13 77 14 or more 72 Total 13 500 (95) 560 CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis 13-4 A chi-square goodness-of-fit test is to be conducted to test whether a population is normally distributed No statement has been made regarding the value of the population mean and standard deviation A frequency distribution has been formed based on a random sample of 1,000 values The frequency distribution has k  classes Assuming that the test is to be conducted at the a  0.10 level, determine the correct decision rule to be used 13-5 An experiment is run that is claimed to have a binomial distribution with p  0.15 and n  18 and the number of successes is recorded The experiment is conducted 200 times with the following results: Number of Successes Observed Frequency 80 75 39 Using a significance level of 0.01, is there sufficient evidence to conclude that the distribution is binomially distributed with p  0.15 and n  18? 13-6 Data collected from a hospital emergency room reflect the number of patients per day that visited the emergency room due to cardiac-related symptoms It is believed that the distribution of the number of cardiac patients entering the emergency room per day over a two-month period has a Poisson distribution with a mean of patients per day 12 8 12 11 10 10 7 8 11 11 14 7 7 10 10 16 10 12 13-8 Managers of a major book publisher believe that the occurrence of typographical errors in the books the company publishes is Poisson distributed with a mean of 0.2 per page Because of some customer quality complaints, the managers have arranged for a test to be conducted to determine if the error distribution still holds A total of 400 pages were randomly selected and the number of errors per page was counted These data are summarized in the following frequency distribution: 10 10 10 7 10 10 Use a chi-square goodness-of-fit test to determine if the data come from a Poisson distribution with mean of Test using a significance level of 0.01 Business Applications 13-7 HSBC Bank is a large, London-based international banking company One of its most important sources of income is home loans A component of its effort to maintain and increase its customer base is excellent service The loan manager at one of its branches in New York keeps track of the number of loan applicants who visit his branch’s loan department per week Having enough loan officers available is one of the ways of providing excellent service Over the last year, the loan manager accumulated the following data: Number of Customers (96) Frequencies 11 14 From previous years, the manager believes that the distribution of the number of customer arrivals has a Poisson distribution with an average of 3.5 loan applicants per week Determine if the loan officer’s belief is correct using a significance level of 0.025 Errors Frequency Total 335 56 400 Conduct the appropriate hypothesis test using a significance level equal to 0.01 Discuss the results 13-9 The Baltimore Steel and Pipe Company recently developed a new pipe product for a customer According to specifications, the pipe is supposed to have an average outside diameter of 2.00 inches with a standard deviation equal to 0.10 inch, and the distribution of outside diameters is to be normally distributed Before going into full-scale production, the company selected a random sample of 30 sections of pipe from the initial test run The following data were recorded: Pipe Section Diameter (inches) Pipe Section Diameter (inches) 10 11 12 13 14 15 2.04 2.13 2.07 1.99 1.90 2.06 2.19 2.01 2.05 1.98 1.95 1.90 2.10 2.02 2.11 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1.96 1.89 1.99 2.13 1.90 1.91 1.95 2.18 1.94 1.93 2.08 1.82 1.94 1.96 1.81 a Using a significance level of 0.01, perform the appropriate test b Based on these data, should the company conclude that it is meeting the product specifications? Explain your reasoning 13-10 Quality control managers work in every type of production environment possible, from producing dictionaries to dowel cutting for boat plugs The Cincinnati Dowel & Wood Products Co., located in (97) CHAPTER 13 Mount Orab, Ohio, manufactures wood dowels and wood turnings Four-inch-diameter boat plugs are one of its products The quality control procedures aimed at maintaining the 4-inch diameter are only valid if the diameters have a normal distribution The quality control manager recently obtained the following summary diameters taken from randomly selected boat plugs on the production line: Interval 3.872 3.872–3.916 3.917–3.948 3.949–3.975 3.976–4.000 Frequency 11 Interval 4.001–4.025 4.026–4.052 4.053–4.084 4.085–4.128  4.128 Frequency The boat plug diameters are specified to have a normal distribution with a mean of inches and a standard deviation of 0.10 Determine if the distribution of the 4-inch boat plugs is currently adhering to specification Use a chi-square goodness-of-fit test and a significance level of 0.05 Computer Database Exercises 13-11 The owners of Big Boy Burgers are considering remodeling their facility to include a drive-thru window There will be room for three cars in the drivethru line if they build it However, they are concerned that the capacity may be too low during their busy lunch time hours between 11:00 A.M and 1:30 P.M One of the factors they need to know is the distribution of the length of time it takes to fill an order for cars coming to the drive-thru To collect information on this, the owners have received permission from a similar operation owned by a relative in a nearby town to collect some data at that drive-thru The data in the file called Clair’s Deli reflect the service time per car Based on these sample data, is there sufficient evidence to conclude that the distribution of service time is not normally distributed? Test using the chisquare distribution and a  0.05 13-12 Executives at The Walt Disney Company are interested in estimating the mean spending per capita for people who visit Disney World in Orlando, Florida Since they not know the population standard deviation, they plan to use the t-distribution (see Chapter 9) to conduct the test However, they realize that the t-distribution requires that the population be normally distributed Six hundred customers were randomly surveyed, and the amount spent during their stay at Disney World was recorded These data are in the file called Disney Before using these sample data to estimate the population mean, the managers wish to test to determine whether the population is normally distributed a State the appropriate null and alternative hypotheses | Goodness-of-Fit Tests and Contingency Analysis 561 b Organize the data into six classes and form the grouped data frequency distribution (refer to Chapter 2) c Using the sample mean and sample standard deviation, calculate the expected frequencies, assuming that the null hypothesis is true d Conduct the test statistic and compare it to the appropriate critical value for a significance level equal to 0.05 What conclusion should be reached? Discuss 13-13 Again, working with the data in Problem 13-11, the number of cars that arrive in each 10-minute period is another factor that will determine whether there will be the capacity to handle the drive-thru business In addition to studying the service times, the owners also counted the number of cars that arrived at the deli in the nearby town in a sample of 10-minute time periods These data are as follows: 3 0 2 3 1 3 3 Based on these data, is there evidence to conclude that the arrivals are not Poisson distributed? State the appropriate null and alternative hypotheses and test using a significance level of 0.025 13-14 Damage to homes caused by burst piping can be expensive to repair By the time the leak is discovered, hundreds of gallons of water may have already flooded the home Automatic shutoff valves can prevent extensive water damage from plumbing failures The valves contain sensors that cut off water flow in the event of a leak, thereby preventing flooding One important characteristic is the time (in milliseconds) required for the sensor to detect the water flow The data obtained for four different shutoff valves are contained in the file entitled Waterflow The differences between the observed time for the sensor to detect the water flow and the predicted time (termed residuals) are listed and are assumed to be normally distributed Using the four sets of residuals given in the data file, determine if the residuals have a normal distribution Use a chi-square goodness-of-fit test and a significance level of 0.05 Use five groups of equal width to conduct the test 13-15 An article in the San Francisco Chronicle indicated that just 38% of drivers crossing the San Francisco Bay Area’s seven state-owned bridges pay their tolls electronically, compared with rates nearing 80% at systems elsewhere in the nation Albert Yee, director (98) 562 CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis of highway and arterial operations for the regional Metropolitan Transportation Commission, indicated that the commission is eager to drive up the percentage of tolls paid electronically In an attempt to see if its efforts are producing the required results, 15 vehicles each day are tracked through the toll lanes of the Bay Area bridges The number of drivers using electronic payment to pay their toll for a period of three months appears in the file entitled Fastrak a Determine if the distribution of the number of FasTrak users could be described as a binomial distribution with a population proportion equal to 0.50 Use a chi-square goodness-of-fit test and a significance level of 0.05 b Conduct a test of hypothesis to determine if the percent of tolls paid electronically has increased to more than 70% since Yee’s efforts END EXERCISES 13-1 Chapter Outcome 13.2 Introduction to Contingency Analysis In Chapters and 10 you were introduced to hypothesis tests involving one and two population proportions Although these techniques are useful in many cases, you will also encounter many situations involving multiple population proportions For example, a mutual fund company offers six different mutual funds The president of the company may wish to determine if the proportion of customers selecting each mutual fund is related to the four sales regions in which the customers reside A hospital administrator who collects service-satisfaction data from patients might be interested in determining whether there is a significant difference in patient rating by hospital department A personnel manager for a large corporation might be interested in determining whether there is a relationship between level of employee job satisfaction and job classification In each of these cases, the proportions relate to characteristic categories of the variable of interest The six mutual funds, four sales regions, hospital departments, and job classifications are all specific categories These situations involving categorical data call for a new statistical tool known as contingency analysis to help make decisions when multiple proportions are involved Contingency analysis can be used when a level of data measurement is either nominal or ordinal and the values are determined by counting the number of occurrences in each category  Contingency Tables BUSINESS APPLICATION APPLYING CONTINGENCY ANALYSIS DALGARNO PHOTO, INC Dalgarno Photo, Inc., gets much of its business from taking photographs for college yearbooks Dalgarno hired a first-year masters of business administration (MBA) student to develop the survey it mailed to 850 yearbook representatives at the colleges and universities in its market area The representatives were unaware that Dalgarno Photo had developed the survey The survey asked about the photography and publishing activities associated with yearbook development For instance, what photographer and publisher services did the schools use, and what factors were most important in selecting services? The survey instrument contained 30 questions, which were coded into 137 separate variables Among his many interests in this study, Dalgarno’s marketing manager questioned whether college funding source and gender of the yearbook editor were related in some manner To analyze this issue, we examine these two variables more closely Source of university funding is a categorical variable, coded as follows:  Private funding  State funding Of the 221 respondents who provided data for this variable, 155 came from privately funded colleges or universities and 66 were from state funded institutions (99) CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis 563 The second variable, gender of the yearbook editor, is also a categorical variable, with two response categories, coded as follows:  Male  Female Contingency Table A table used to classify sample observations according to two or more identifiable characteristics It is also called a crosstabulation table | Contingency Table for Dalgarno Photo TABLE 13.3 Source of Funding Gender Private State Male Female 14 43 57 141 23 164 155 66 221 Of the 221 responses to the survey, 164 were from females and 57 were from males In cases in which the variables of interest are both categorical and the decision maker is interested in determining whether a relationship exists between the two, a statistical technique known as contingency analysis is useful We first set up a two-dimensional table called a contingency table The contingency table for these two variables is shown in Table 13.3 Table 13.3 shows that 14 of the respondents were males from schools that are privately funded The numbers at the extreme right and along the bottom are called the marginal frequencies For example, 57 respondents were males, and 155 respondents were from privately funded institutions The issue of whether there is a relationship between responses to these two variables is formally addressed through a hypothesis test, in which the null and alternative hypotheses are stated as follows: H0: Gender of yearbook editor is independent of the college’s funding source HA: Gender of yearbook editor is not independent of the college’s funding source If the null hypothesis is true, the population proportion of yearbook editors from private institutions who are males should be equal to the proportion of male editors from state-funded institutions These two proportions should also equal the population proportion of male editors without regard to a school’s funding source To illustrate, we can use the sample data to determine the sample proportion of male editors as follows: PM  57 Number of male editors   0.2579 nts 221 Number of responden Then, if the null hypothesis is true, we would expect 25.79% of the 155 privately funded schools, or 39.98 schools, to have a male yearbook editor We would also expect 25.79% of the 66 state-funded schools, or 17.02, to have male yearbook editors (Note that the expected numbers need not be integer values Note also that the sum of expected frequencies in any column or row add up to the marginal frequency.) We can use this reasoning to determine the expected number of respondents in each cell of the contingency table, as shown in Table 13.4 You can simplify the calculations needed to produce the expected values for each cell Note that the first cell’s expected value, 39.98, was obtained by the following calculation: e11  0.2579(155)  39.98 However, because the probability, 0.2579, is calculated by dividing the row total, 57, by the grand total, 221, the calculation can also be represented as e11  ( Row total)(Column total) (57)(155)   39.98 Grand total 221 | Contingency Table for Dalgarno Photo TABLE 13.4 Source of Funding Gender Male Female Private State o11  14 o12  43 e11  39.98 e12  17.02 o21  141 o22  23 e21  115.02 e22  48.98 155 66 57 164 221 (100) 564 CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis As a further example, we can calculate the expected value for the next cell in the same row The expected number of male yearbook editors in state-funded schools is e12  (Row total)(Column total) (57)(66)  17.02  221 Grand total Keep in mind that the row and column totals (the marginal frequencies) must be the same for the expected values as for the observed values Therefore, when there is only one cell left in a row or a column for which you must calculate an expected value, you can obtain it by subtraction So, as an example, the expected value e12 could have been calculated as e12  57  39.98  17.02 Allowing for sampling error, we would expect the actual frequencies in each cell to approximately match the corresponding expected cell frequencies when the null hypothesis is true The greater the difference between the actual and the expected frequencies, the more likely the null hypothesis of independence is false and should be rejected The statistical test to determine whether the sample data support or refute the null hypothesis is given by Equation 13.2 Do not be confused by the double summation in Equation 13.2; it merely indicates that all rows and columns must be used in calculating  As was the case in the goodness-offit tests, the degrees of freedom are the number of independent data values obtained from the experiment In any given row, once you know c  of the data values, the remaining data value is determined For instance, once you know that 14 of the 57 male editors were from privately funded institutions, you know that 43 were from state-funded institutions Chi-Square Contingency Test Statistic r 2  c ∑∑ i1 j1 (oij  eij )2 eij with df  (r 1)(c 1)) (13.2) where: oij  Observed frequency in cell (i, j) eij  Expected frequency in cell (i, j) r  Number of rows c  Number of columns Similarly, once r  data values in a column are known, the remaining data value is determined Therefore, the degrees of freedom are obtained by the expression (r  1)(c  1) Figure 13.7 presents the hypotheses and test results for this example As was the case in the goodness-of-fit tests, the test statistic has a distribution that can be approximated by the chi-square distribution if the expected values are larger than Note that the calculated chisquare statistic is compared to the tabled value of chi-square for an a  0.05 and degrees of freedom  (2  1)(2  1)  Because   76.19  3.8415, the null hypothesis of independence should be rejected Dalgarno Photo representatives should conclude that the gender of the yearbook editor and each school’s source of funding are not independent By examining the data in Figure 13.7, you can see that private schools are more likely to have female editors, whereas state schools are more likely to have male yearbook editors EXAMPLE 13-4  CONTINGENCY ANALYSIS Wireridge Marketing Before releasing a major advertising campaign to the media, Wireridge Marketing managers run a test on the media material Recently, they randomly called 100 people and asked them to listen to a commercial that was slated to run nationwide on the radio At the end of the commercial, the respondents were asked to name the company that was in the advertisement The company is interested in determining (101) CHAPTER 13 FIGURE 13.7 | Chi-Square Contingency Analysis Test for Dalgarno Photo | Goodness-of-Fit Tests and Contingency Analysis 565 Hypotheses: H0 : Gender of yearbook editor is independent of college’s funding source HA: Gender of yearbook editor is not independent of college’s funding source a = 0.05 Male Female Private o 11  14 e11  39.98 State o12  43 e12  17.02 o21  141 o22  23 e22  48.98 e21  115.02 Test Statistic: χ2 = r (oij – eij) c ΣΣ i1 j1 eij  (14  39.98)2 (141  115.02)  115.02  (43  17.02)2 39.98 17.02 (23  48.98)2  76.19  48.98 f(χ2) df  (r –1) (c–1)  (1) (1) 1 χ2 χ2 0.05  3.8415 Decision Rule: If χ2  3.8415, reject H0 ; Otherwise, not reject H0 Because 76.19  3.8415, reject H0 Thus, the gender of the yearbook editor and the school’s source of funding are not independent whether there is a relationship between gender and a person’s ability to recall the company name To test this, the following steps can be used: Step Specify the null and alternative hypotheses The company is interested in testing whether a relationship exists between gender and recall ability Here are the appropriate null and alternative hypotheses H0: Ability to correctly recall the company name is independent of gender HA: Recall ability and gender are not independent Step Determine the significance level The test will be conducted using a 0.01 level of significance Step Determine the critical value The critical value for this test will be the chi-square value, with (r  1)(c  1)  (2  1)(2  1)  degree of freedom with an a  0.01 From Appendix G, the critical value is 6.6349 (102) 566 CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis Step Collect the sample data and compute the chi-square test statistic using Equation 13.2 The following contingency table shows the results of the sampling: Female Male Total Correct Recall Incorrect Recall 33 22 25 20 58 42 Total 55 45 100 Note that 58 percent of the entire data values result in a correct recall If the ability to correctly recall the company name is independent of gender you would expect the same percentage (58%) would occur for each gender Thus, 58% of the males [.58(45)  26.10] would be expected to have a correct recall In general, the expected cell frequencies are determined by multiplying the row total by the column total and dividing by the overall sample size For example, for the cell corresponding to female and correct recall, we get Expected  58 55  31.90 100 The expected cell values for all cells are Correct Recall Incorrect Recall Total Female Male o  33 e  31.90 o  22 e  23.10 o  25 e  26.10 o  20 e  18.90 55 Total 45 58 42 100 After checking to make sure all the expected cell frequencies (103) 5, the test statistic is computed using Equation 13.2 r 2  c ∑∑ i1 j1  (oi j  eij )2 eij (33  31.90)2 (25  26.10)2 (22  23.10)2 (20 18.90)2  0.20    31.90 18.90 26.10 23.10 Step Reach a decision Because   0.20 6.6349, not reject the null hypothesis Step Draw a conclusion Based on the sample data, there is no reason to believe that being able to recall the name of the company in the ad is related to gender >>END EXAMPLE TRY PROBLEM 13-17 (pg 569) r  c Contingency Tables BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial LARGER CONTINGENCY TABLES BENTON STONE & TILE Benton Stone & Tile makes a wide variety of products for the building industry It pays market wages, provides competitive benefits, and offers attractive options for employees in an effort to create a satisfied workforce and reduce turnover Recently, however, several supervisors have complained that employee absenteeism is becoming a problem In response to these complaints, the human resources manager studied a random (104) CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis 567 sample of 500 employees One aim of this study was to determine whether there is a relationship between absenteeism and marital status Absenteeism during the past year was broken down into three levels: absences to absences Over absences Marital status was divided into four categories: Single Divorced Married Widowed Table 13.5 shows the contingency table for the sample of 500 employees The table is also shown in the file Benton The null and alternative hypotheses to be tested are H0: Absentee behavior is independent of marital status HA: Absentee behavior is not independent of marital status As with 2 contingency analysis, the test for independence can be made using the chisquare test, where the expected cell frequencies are compared to the actual cell frequencies and the test statistic shown as Equation 13.2 is used The logic of the test says that if the actual and expected frequencies closely match, then the null hypothesis of independence is not rejected However, if the actual and expected cell frequencies are substantially different overall, the null hypothesis of independence is rejected The calculated chi-square statistic is compared to an Appendix G critical value for the desired significance and degrees of freedom equal to (r  1)(c  1) The expected cell frequencies are determined assuming that the row and column variables are independent This means, for example, that the probability of a married person being absent more than days during the year is the same as the probability of any employee being absent more than days An easy way to compute the expected cell frequencies, eij, is given by Equation 13.3 Expected Cell Frequencies eij  (ith row total)( jth column total) total s ample size (13.3) For example, the expected cell frequency for row 1, column is e11  (200)(200)  80 500 and the expected cell frequency for row 2, column is e23  TABLE 13.5 | (150)(100)  30 500 Contingency Table for Benton Stone & Tile Absentee Rate Marital Status 1–5 Over Row Totals Single Married Divorced Widowed 84 50 50 16 82 64 34 20 34 36 16 14 200 150 100 50 200 200 100 500 Column Total (105) 568 CHAPTER 13 FIGURE 13.8A | Goodness-of-Fit Tests and Contingency Analysis | Excel 2007 Output—Benton Stone & Tile Contingency Analysis Test Excel 2007 Instructions: Open file: Benton.xls Compute expected cell frequencies using Excel formula Compute chi-square statistics using Excel formula Expected frequency found using =(D$16*$F13)/$F$16) Figures 13.8a and 13.8b show the completed contingency table with the actual and expected cell frequencies that were developed using Excel and Minitab The calculated chi-square test value is computed as follows: r 2  c ∑∑ i1 j1 (oij  eij )2 eij (84  80)2 (82  80)2 (20  20)2 (14 10)2     80 80 20 10  10.88  FIGURE 13.8B | Minitab Output—Benton Stone & Tile Contingency Analysis Test Minitab Instructions: Open file: Benton.MTW Choose Stat > Tables > Chi-Square Test In Columns containing the table, enter data columns Click OK (106) CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis 569 The degrees of freedom are (r  1)(c  1)  (4  1)(3  1)  You can use the chi-square table in Appendix G to get the chi-square critical value for a  0.05 and degrees of freedom, or you can use Minitab’s Probability Distributions command or Excel’s CHIINV function (CHIINV(0.05,6)  12.5916) Because the calculated chi-square value (10.883) shown in Figures 13.8a and 13.8b is less than 12.5916, we cannot reject the null hypothesis Based on these sample data, there is insufficient evidence to conclude that absenteeism and marital status are not independent Chi-Square Test Limitations The chi-square distribution is only an approximation for the true distribution for contingency analysis We use the chi-square approximation because the true distribution is impractical to compute in most instances However, the approximation (and, therefore, the conclusion reached) is quite good when all expected cell frequencies are at least 5.0 When expected cell frequencies drop below 5.0, the calculated chi-square value tends to be inflated and may inflate the true probability of a Type I error beyond the stated significance level As a rule, if the null hypothesis is not rejected, you not need to worry when the expected cell frequencies drop below 5.0 There are two alternatives that can be used to overcome the small expected-cell-frequency problem The first is to increase the sample size This may increase the marginal frequencies in each row and column enough to increase the expected cell frequencies The second option is to combine the categories of the row and/or column variables If you decide to group categories together, there should be some logic behind the resulting categories You don’t want to lose the meaning of the results through poor groupings You will need to examine each situation individually to determine whether the option of grouping classes to increase expected cell frequencies makes sense MyStatLab 13-2: Exercises Skill Development 13-16 The billing department of a national cable service company is conducting a study of how customers pay their monthly cable bills The cable company accepts payment in one of four ways: in person at a local office, by mail, by credit card, or by electronic funds transfer from a bank account The cable company randomly sampled 400 customers to determine if there is a relationship between the customer’s age and the payment method used The following sample results were obtained: Age of Customer Payment Method 20–30 31–40 41–50 Over 50 In Person By Mail By Credit Card By Funds Transfer 29 26 23 12 67 19 35 11 72 17 13 50 Based on the sample data, can the cable company conclude that there is a relationship between the age of the customer and the payment method used? Conduct the appropriate test at the a  0.01 level of significance 13-17 A contingency analysis table has been constructed from data obtained in a phone survey of customers in a market area in which respondents were asked to indicate whether they owned a domestic or foreign car and whether they were a member of a union or not The following contingency table is provided Car Union Yes No Domestic Foreign 155 40 470 325 a Use the chi-square approach to test whether type of car owned (domestic or foreign) is independent of union membership Test using an a  0.05 level b Calculate the p-value for this hypothesis test 13-18 Utilize the following contingency table to answer the questions listed below R1 R2 R3 C1 C2 51 146 240 207 185 157 a State the relevant null and alternative hypotheses b Calculate the expected values for each of the cells c Compute the chi-square test statistic for the hypothesis test (107) 570 CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis d Determine the appropriate critical value and reach a decision for the hypothesis test Use a significance level of 0.05 e Obtain the p-value for this hypothesis test 13-19 A manufacturer of sports drinks has randomly sampled 198 men and 202 women Each sampled participant was asked to taste an unflavored version and a flavored version of a new sports drink currently in development The participants’ preferences are shown below: Men Women Flavored Unflavored 101 68 97 134 Newspaper Radio/TV Internet 20–30 31–40 41–50 Over 50 19 27 104 62 125 113 95 168 37 147 88 15 At the 0.01 level of significance, can the marketing research firm conclude that there is a relationship between the age of the individual and the individual’s preferred source for news? 13-21 A loan officer wished to determine if the marital status of loan applicants was independent of the approval of loans The following table presents the result of her survey: Single Married Divorced Rejected 213 374 358 189 231 252 B C D F Total Front Middle Back 18 55 42 15 30 95 104 11 14 106 156 138 Total 28 112 229 28 400 13-23 A study was conducted to determine if there is a difference between the investing preferences of midlevel managers working in the public and private sectors in New York City A random sample of 320 public sector employees and 380 private sector employees was taken The sampled participants were then asked about their retirement investment decisions and classified as being either “aggressive,” if they invested only in stocks or stock mutual funds, or “balanced,” if they invested in some combination of stocks, bonds, cash, and other The following results were found: Public Private Age of Respondent Approved A Business Applications a State the relevant null and alternative hypotheses b Conduct the appropriate test and state a conclusion Use a level of significance of 0.05 13-20 A marketing research firm is conducting a study to determine if there is a relationship between an individual’s age and the individual’s preferred source of news The research firm asked 1,000 individuals to list their preferred source for news: newspaper, radio and television, or the Internet The following results were obtained: Preferred News Source seating location and grade using a significance level equal to 0.05? a Conduct the appropriate hypothesis test that will provide an answer to the loan officer Use a significance level of 0.01 b Calculate the p-value for the hypothesis test in part a 13-22 An instructor in a large accounting class is interested in determining whether the grades that students get are related to how close to the front of the room the students sit He has categorized the room seating as “Front,” “Middle,” and “Back.” The following data were collected over two sections with 400 total students Based on the sample data, can you conclude that there is a dependency relationship between Aggressive Balanced 164 236 156 144 a State the hypothesis of interest and conduct the appropriate hypothesis test to determine whether there is a relationship between employment sector and investing preference Use a level of significance of 0.01 b State the conclusion of the test conducted in part a c Calculate the p-value for the hypothesis test conducted in part a 13-24 The following table classifies a stock’s price change as up, down, or no change for both today’s and yesterday’s prices Price changes were examined for 100 days A financial theory states that stock prices follow what is called a “random walk.” This means, in part, that the price change today for a stock must be independent of yesterday’s price change Test the hypothesis that daily stock price changes for this stock are independent Let a  0.05 Price Change Previous Day Price Change Today Up Up No Change Down 14 16 No Change 16 14 Down 12 13-25 A local appliance retailer handles four washing machine models for a major manufacturer: standard, deluxe, superior, and XLT The marketing manager has recently conducted a study on the purchasers of the washing machines The study recorded the model of appliance purchased and the credit account balance of (108) CHAPTER 13 the customer at the time of purchase The sample data are in the following table Based on these data, is there evidence of a relationship between the account balance and the model of washer purchased? Use a significance level of 0.025 Conduct the test using a p-value approach Credit Balance Washer Model Purchased Standard Deluxe Superior XLT 10 16 16 12 12 40 24 16 15 30 Under $200 $200–$800 Over $800 13-26 A random sample of 980 heads of households was taken from the customer list for State Bank and Trust Those sampled were asked to classify their own attitudes and their parents’ attitudes toward borrowing money as follows: A: Borrow only for real estate and car purchases B: Borrow for short-term purchases such as appliances and furniture C: Never borrow money The following table indicates the responses from those in the study | Goodness-of-Fit Tests and Contingency Analysis 571 13-28 In its ninth year, the Barclaycard Business Travel Survey has become an information source for business travelers not only in the United Kingdom but internationally as well Each year, as a result of the research, Barclaycard Business has been able to predict and comment on trends within the business travel industry One question asked in the 2003/2004 and 2004/2005 surveys was, “Have you considered reducing hours spent away from home to increase quality of life?” The following table represents the responses: 2003/2004 Yes—have reduced Yes—not been able to No—not certain Total 2004/2005 Total 400 400 400 384 300 516 784 700 916 1200 1200 2400 a Determine if the response to the survey question was independent of the year in which the question was asked Use a significance level of 0.05 b Determine if there is a significant difference between the proportion of travelers who say they have reduced hours spent away from home between the 2003/2004 and the 2004/2005 years Computer Database Exercises Respondent Parent A B C A B C 240 180 180 80 120 80 20 40 40 Test the hypothesis that the respondents’ borrowing habits are independent from what they believe their parents’ attitudes to be Let a  0.01 13-27 The California Lettuce Research board was originally formed as the Iceberg Lettuce Advisory Board in 1973 The primary function of the board is to fund research on iceberg and leaf lettuce A recent project involved studying the effect of varying levels of sodium absorption ratios (SAR) on the yield of head lettuce The measurements (the number of lettuce heads from each plot) of the kind observed were as follows: Lettuce Type SAR Salinas Sniper 10 104 160 142 133 109 163 146 156 a Determine if the number of lettuce heads harvested for the two lettuce types is independent of the levels of sodium absorption ratios (SAR) Use a significance level of 0.025 and a p-value approach b Which type of lettuce would you recommend? 13-29 Daniel Vinson of the University of Missouri–Columbia led a team of researchers investigating the increased risk when people are angry of serious injuries in the workplace requiring emergency medical care The file entitled Angry contains the data collected by the team of researchers It displays the emotions reported by patients just before they were injured a Use the data in the file entitled Angry to construct a contingency table b Determine if the type of emotion felt by patients just before they were injured is independent of the severity of that emotion Use a contingency analysis and a significance level of 0.05 13-30 Gift of the Gauche, a left-handedness information Web site (www.left-handedness.info), provides information concerning left-handed activities, products, and demography It indicates that about 10%–11% of the population of Europe and North America are lefthanded It also reports on demographic surveys It cites an American study in which over one million magazine respondents found that 12.6% of the male respondents were left-handed as were 9.9% of the female respondents, although this was not a random sample The data obtained by a British survey of over 8,000 randomly selected men and women, published in Issue 37 of The Graphologist in 1992, is furnished in a file entitled Lefties Based on this data, determine if the “handedness” of an individual is independent of gender Use a significance level of 0.01 and a p-value approach 13-31 The Marriott Company owns and operates a number of different hotel chains, including the Courtyard chain Recently, a survey was mailed to a random sample of (109) 572 CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis 400 Courtyard customers A total of 62 customers responded to the survey The customers were asked a number of questions related to their satisfaction with the chain as well as several demographic questions Among the issues of interest was whether there is a relationship between the likelihood that customers will stay at the chain again and whether or not this was the customer’s first stay at the Courtyard The following contingency table has been developed from the data set contained in the file called CourtyardSurvey: First Stay Stay Again? Yes No Total Definitely Will Probably Will Maybe Probably Not 18 15 12 21 20 18 Total 44 18 62 Using a significance level equal to 0.05, test to see whether these sample data imply a relationship between the two variables Discuss the results 13-32 ECCO (Electronic Controls Company) makes backup alarms that are used on such equipment as forklifts and delivery trucks The quality manager recently performed a study involving a random sample of 110 warranty claims One of the questions the manager wanted to answer was whether there is a relationship between the type of warranty complaint and the plant at which the alarm was made The data are in the file called ECCO a Calculate the expected values for the cells in this analysis Suggest a way in which cells can be combined to assure that the expected value of each cell is at least so that as many level combinations of the two variables as possible are retained b Using a significance level of 0.01, conduct a relevant hypothesis test and provide an answer to the manager’s question 13-33 Referring to Problem 13-32, can the quality control manager conclude that the type of warranty problem is independent of the shift on which the alarm was manufactured? Test using a significance level of 0.05 Discuss your results END EXERCISES 13-2 (110) CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis Visual Summary Chapter 13: Many of the statistical procedures introduced in earlier chapters require that the sample data come from populations that are normally distributed and that the data be measured at least at the interval level However, you will encounter many situations in which these specifications are not met This chapter introduces two sets of procedures that are used to address each of these issues in turn Goodness-of–fit procedures are used to determine if sample data has been drawn from a specific hypothesized distribution Contingency analysis is an often used technique to determine the relationship between qualitative variables Though these procedures are not used as much as those requiring normal population, you will discover that far too few of the procedures presented in this chapter are used when they should be It is, therefore, important that you learn when and how to use the procedures presented in this chapter 13.1 Introduction to Goodness-of-Fit Tests (pg 548 – 562) Summary The chi-square goodness-of-fit test can be used to determine if a set of data comes from a specific hypothesized distribution Recall that several of the procedures presented in Chapters 8-12 require that the sampled populations are normally distributed For example, tests involving the t-distribution are based on such a requirement In order to verify this requirement, the goodness-of-fit test determines if the observed set of values agree with a set of data obtained from a specified probability distribution Perhaps the goodness-of-fit test is most often used to verify a normal distribution However, it can be used to detect many other probability distributions Outcome Utilize the chi-square goodness-of-fit test to determine whether data from a process fit a specified distribution 13.2 Introduction to Contingency Analysis (pg 562– 572) Summary You will encounter many business situations in which the level of data measurement for the variable of interest is either nominal or ordinal, not interval or ratio In Chapters and 10 you were introduced to hypothesis tests involving one and two population proportions However, you will also encounter many situations involving multiple population proportions for which two population procedures are not applicable In each of these cases, the proportions relate to characteristic categories of the variable of interest These situations involving categorical data call for a new statistical tool known as contingency analysis to help make decisions when multiple proportions are involved Contingency analysis can be used when a level of data measurement is either nominal or ordinal and the values are determined by counting the number of occurrences in each category Outcome Set up a contingency table analysis and perform a chi-square test of independence Conclusion This chapter has introduced two very useful statistical tools: goodness-of-fit tests and contingency analysis Goodness-of-fit testing is used when a decision maker wishes to determine whether sample data come from a population having specific characteristics The chi-square goodness-of-fit procedure that was introduced in this chapter addresses this issue This test relies on the idea that if the distribution of the sample data is substantially different from the hypothesized population distribution, then the population distribution from which these sample data came must not be what was hypothesized Contingency analysis is a frequently used statistical tool that allows the decision maker to test whether responses to two variables are independent Market researchers, for example, use contingency analysis to determine whether attitude about the quality of their company’s product is independent of the gender of a customer By using contingency analysis and the chi-square contingency test, they can make this determination based on a sample of customers 573 (111) 574 CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis Equations (13.1) Chi-Square Goodness-of-Fit Test Statistic pg 550 (13.3) Expected Cell Frequencies pg 567 k 2  (oi  ei )2 ei i1 ∑ eij  (i th row total) ( j th column total) totaal sample size (13.2) Chi-Square Contingency Test Statistic pg 564 r 2  c ∑∑ i1 j1 (oij  eij )2 eij with df  (r 1)(c 1) Key Term Contingency table pg 563 Chapter Exercises Conceptual Questions 13-34 Locate a journal article that uses either contingency analysis or a goodness-of-fit test Discuss the article, paying particular attention to the reasoning behind using the particular statistical test 13-35 Find a marketing research book (or borrow one from a friend) Does it discuss either of the tests considered in this chapter? If yes, outline the discussion If no, determine where in the text such a discussion would be appropriate 13-36 One of the topics in Chapter 10 was hypothesis testing for the difference between two population proportions For the test to have validity, there were conditions set on the sample sizes with respect to the sample proportions A 2 contingency table may also be utilized to test the difference between proportions of two independent populations This procedure has conditions placed on the expected value of each cell Discuss the relationship between these two conditions 13-37 A 2 contingency table and a hypothesis test of the difference between two population proportions can be used to analyze the same data set However, besides all the similarities of the two methods, the hypothesis test of the difference between two proportions has two advantages Identify these advantages Business Applications 13-38 The College Bookstore has just hired a new manager, one with a business background Claudia Markman has been charged with increasing the profitability of the bookstore, with the profits going to the general MyStatLab scholarship fund Claudia started her job just before the beginning of the semester and was analyzing the sales during the days when students are buying their textbooks The store has four checkout stands, and Claudia noticed registers three and four served more students than registers one and two She is not sure whether the layout of the store channels customers into these registers, whether the checkout clerks in these lines are simply slower than the other two, or whether she was just seeing random differences Claudia kept a record of which stands the next 1,000 students chose for checkout The students checked out of the four stands according to the following pattern: Stand Stand Stand Stand 338 275 201 186 a Based on these data, can Claudia conclude the proportion of students using the four checkout stands is equal? (Use an a  0.05.) b A friend suggested that you could just as well conduct four hypothesis tests that the proportion of customers visiting each stand is equal to p  0.25 Discuss the merits of this suggestion 13-39 A regional cancer treatment center has had success treating localized cancers with a linear accelerator Whereas admissions for further treatment nationally average 2.1 per patient per year, the center’s director thinks that re-admissions with the new treatment are Poisson distributed, with a mean of 1.2 patients (112) CHAPTER 13 per year He has collected the following data on a random sample of 300 patients: Re-admissions Last Year Patients 139 87 48 14 1 300 a Adjust the data so that you can test the director’s claim using a test statistic whose sampling distribution can be approximated by a chi-square distribution b Assume the Type I error rate is to be controlled at 0.05 Do you agree with the director’s claim? Why? Conduct a statistical procedure to support your opinion 13-40 Cooper Manufacturing, Inc., of Dallas, Texas, has a contract with the U.S Air Force to produce a part for a new fighter plane being manufactured The part is a bolt that has specifications requiring that the length be normally distributed with a mean of 3.05 inches and a standard deviation of 0.015 inch As part of the company’s quality control efforts, each day Cooper’s engineers select a random sample of 100 bolts produced that day and carefully measure the bolts to determine whether the production is within specifications The following data were collected yesterday: Length (inches) Under 3.030 3.030 and under 3.035 3.035 and under 3.040 3.040 and under 3.050 3.050 and under 3.060 3.060 and under 3.065 3.065 and over Frequency 16 20 36 8 Based on these sample data, what should Cooper’s engineers conclude about the production output if they test using an a  0.01? Discuss 13-41 The Cooper Company discussed in Problem 13-40 has a second contract with a private firm for which it makes fuses for an electronic instrument The quality control department at Cooper periodically selects a random sample of five fuses and tests each fuse to | Goodness-of-Fit Tests and Contingency Analysis 575 determine whether it is defective Based on these findings, the production process is either shut down (if too many defectives are observed) or allowed to run The quality control department believes that the sampling process follows a binomial distribution, and it has been using the binomial distribution to compute the probabilities associated with the sampling outcomes The contract allows for at most 5% defectives The head of quality control recently compiled a list of the sampling results for the past 300 days in which five randomly selected fuses were tested, with the following frequency distribution for the number of defectives observed She is concerned that the binomial distribution with a sample size of and a probability of defectives of 0.05 may not be appropriate Number of Defectives Frequency 209 33 43 10 a Calculate the expected values for the cells in this analysis Suggest a way in which cells can be combined to assure that the expected value of each cell is at least b Using a significance level of 0.10, what should the quality control manager conclude based on these sample data? Discuss 13-42 A survey performed by Simmons Market Research investigated the percentage of individuals in various age groups who indicated they were willing to pay more for environmentally friendly products The results were presented in USA Today “Snapshots” (July 21, 2005) The survey had approximately 3,240 respondents in each age group Results of the survey follow: Age Group 18–24 25–34 35–44 45–54 55–64 65 and over 11 17 19 20 14 19 Percentage Conduct a goodness-of-fit test analysis to determine if the proportions of individuals willing to pay more for environmentally friendly products in the various age groups are equal Use a significance level of 0.01 13-43 An article published in USA Today asserts that many children are abandoning outdoor for indoor activities The National Sporting Goods Association annual survey for 2004 (the latest data available) compared activity levels in 1995 versus 2004 A random selection (113) 576 CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis of children (7- to 11-year-olds) indicating their favorite outdoor activity is given in the following table: 1995 2004 Bicycling Swimming Baseball Fishing Touch Football 68 47 60 42 29 22 25 18 16 10 Construct a contingency analysis to determine if the type of preferred outdoor activity is dependent on the year in this survey Use a significance level of 0.05 and the p-value approach Computer Database Exercises 13-44 With the economic downturn that started in late 2007, many people started worrying about retirement and whether they even would be able to retire A study recently done by the Employee Benefit Research Institute (EBRI) found about 69% of workers said they and/or their spouses have saved for retirement The file entitled Retirement contains the total savings and investments indicated Use a contingency analysis to determine if the amount of total savings and investments is dependent on the age of the worker 13-45 The airport manager at the Sacramento, California, airport recently conducted a study of passengers departing from the airport A random sample of 100 passengers was selected The data are in the file called Airline Passengers An earlier study showed the following usage by airline: Delta Horizon Northwest Skywest Southwest United 20% 10% 10% 3% 25% 32% a If the manager wishes to determine whether the airline usage pattern has changed from that reported in the earlier study, state the appropriate null and alternative hypotheses b Based on the sample data, what should be concluded? Test using a significance level of 0.01 13-46 A pharmaceutical company is planning to market a drug that is supposed to help reduce blood pressure The company claims that if the drug is taken properly, the amount of blood pressure decrease will be normally distributed with a mean equal to 10 points on the diastolic reading and a standard deviation equal to 4.0 One hundred patients were administered the drug, and data were collected showing the reduction in blood pressure at the end of the test period The data are located in the file labeled Blood Pressure a Using a goodness-of-fit test and a significance level equal to 0.05, what conclusion should be reached with respect to the distribution of diastolic blood pressure reduction? Discuss b Conduct a hypothesis test to determine if the standard deviation for this population could be considered to be 4.0 Use a significance level of 0.10 c Given the results of the two tests in parts a and b, is it appropriate to construct a confidence interval based on a normal distribution with a population standard deviation of 4.0? Explain your answer d If appropriate, construct a 99% confidence interval for the mean reduction in blood pressure Based on this confidence interval, does an average diastolic loss of 10 seem reasonable for this procedure? Explain your reasoning 13-47 An Ariel Capital Management and Charles Schwab survey addressed the proportion of African-Americans and White Americans who have money invested in the stock market Suppose the file entitled Stockrace contains data obtained in the surveys The survey asked 500 African-American and 500 White respondents if they personally had money invested in the stock market a Create a contingency table using the data in the file Stockrace b Conduct a contingency analysis to determine if the proportion of African-Americans differs from the proportion of White Americans who invest in stocks Use a significance level of 0.05 13-48 The state transportation department recently conducted a study of motorists in Idaho Two main factors of interest were whether the vehicle was insured with liability insurance and whether the driver was wearing a seat belt A random sample of 100 cars was stopped at various locations throughout the state The data are in the file called Liabins The investigators were interested in determining whether seat belt status is independent of insurance status Conduct the appropriate hypothesis test using a 0.05 level of significance and discuss your results (114) CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis 577 Case 13.1 American Oil Company Chad Williams sat back in his airline seat to enjoy the hour-long flight between Los Angeles and Oakland, California The hour would give him time to reflect on his upcoming trip to Australia and the work he had been doing the past week in Los Angeles Chad is one man on a six-man crew employed by the American Oil Company to literally walk the earth searching for oil His college degrees in geology and petroleum engineering landed him the job with American, but he never dreamed he would be doing the exciting work he now does Chad and his crew spend several months in special locations around the world using highly sensitive electronic equipment for oil exploration The upcoming trip to Australia is one that Chad has been looking forward to since it was announced that his crew would be going there to search the Outback for oil In preparation for the trip, the crew has been in Los Angeles at American’s engineering research facility working on some new equipment that will be used in Australia Chad’s thoughts centered on the problem he was having with a particular component part on the new equipment The specifications called for 200 of the components, with each having a diameter of between 0.15 and 0.18 inch The only available supplier of the component manufactures the components in New Jersey to specifications calling for normally distributed output, with a mean of 0.16 inches and a standard deviation of 0.02 inches Chad faces two problems First, he is unsure that the supplier actually does produce parts with means of 0.16 inches and standard deviations of 0.02 inches according to a normal distribution Second, if the parts are made to specifications, he needs to determine how many components to purchase if enough acceptable components are to be received to make two oil exploration devices The supplier has sent Chad the following data for 330 randomly selected components Chad believes that the supplier is honest and that he can rely on the data Diameter (Inch) Under 0.14 0.14 and under 0.15 0.15 and under 0.16 0.16 and under 0.17 0.17 and under 0.18 Over 0.18 Total Frequency 70 90 105 50 10 330 Chad needs to have a report ready for Monday indicating whether he believes the supplier delivers at its stated specifications and, if so, how many of the components American should order to have enough acceptable components to outfit two oil exploration devices Required Tasks: State the problems faced by Chad Williams Identify the statistical test Chad Williams can use to determine whether the supplier’s claim is true or not State the null and alternative hypotheses for the test to determine whether the supplier’s claim is true or not Assuming that the supplier produces output whose diameter is normally distributed with a mean of 0.16 inches and a standard deviation of 0.02 inches, determine the expected frequencies that Chad would expect to see in a sample of 330 components Based on the observed and expected frequencies, calculate the appropriate test statistic Calculate the critical value of the test statistic Select an alpha value State a conclusion Is the supplier’s claim with respect to specifications of the component parts supported by the sample data? Provide a short report that summarizes your analysis and conclusion Case 13.2 Bentford Electronics—Part On Saturday morning, Jennifer Bentford received a call at her home from the production supervisor at Bentford Electronics Plant The supervisor indicated that she and the supervisors from Plants 2, 3, and had agreed that something must be done to improve company morale and thereby increase the production output of their plants Jennifer Bentford, president of Bentford Electronics, agreed to set up a Monday morning meeting with the supervisors to see if they could arrive at a plan for accomplishing these objectives By Monday each supervisor had compiled a list of several ideas, including a four-day work week and interplant competitions of various kinds A second meeting was set for Wednesday to discuss the issue further Following the Wednesday afternoon meeting, Jennifer Bentford and her plant supervisors agreed to implement a weekly contest called the NBE Game of the Week The plant producing the (115) 578 CHAPTER 13 | Goodness-of-Fit Tests and Contingency Analysis most each week would be considered the NBE Game of the Week winner and would receive 10 points The second-place plant would receive points, and the third- and fourth-place plants would receive points and point, respectively The contest would last 26 weeks At the end of that period, a $200,000 bonus would be divided among the employees in the four plants proportional to the total points accumulated by each plant The announcement of the contest created a lot of excitement and enthusiasm at the four plants No one complained about the rules because the four plants were designed and staffed to produce equally At the close of the contest, Jennifer Bentford called the supervisors into a meeting, at which time she asked for data to determine whether the contest had significantly improved productivity She indicated that she had to know this before she could authorize a second contest The supervisors, expecting this request, had put together the following data: Units Produced (4 Plants Combined) 0–2,500 2,501–8,000 8,001–15,000 15,001–20,000 Before-Contest Frequency During-Contest Frequency 11 23 56 15 105 days 20 83 52 155 days Jennifer examined the data and indicated that the contest looked to be a success, but she wanted to base her decision to continue the contest on more than just an observation of the data “Surely there must be some way to statistically test the worthiness of this contest,” Jennifer stated “I have to see the results before I will authorize the second contest.” References Berenson, Mark L., and David M Levine, Basic Business Statistics: Concepts and Applications, 11th ed (Upper Saddle River, NJ: Prentice Hall, 2009) Conover, W J., Practical Nonparametric Statistics, 3rd ed (New York City: Wiley, 1999) Higgins, James J., Introduction to Modern Nonparametric Statistics, 1st ed (Pacific Grove, CA: Duxbury, 2004) Marascuilo, Leonard, and M McSweeney, Nonparametric and Distribution Free Methods for the Social Sciences (Monterey, CA: Brooks/Cole, 1977) Microsoft Excel 2007 (Redmond, WA: Microsoft Corp., 2007) Minitab for Windows Version 15 (State College, PA: Minitab, Inc., 2007) (116) • Review the methods for testing a null hypothesis using the t-distribution in Chapter • Review confidence intervals discussed in Chapter scatter plots in Chapter • Review the concepts associated with select- for finding critical values from the F-table as discussed in Chapters 11 and 12 ing a simple random sample in Chapter chapter 14 • Make sure you review the discussion about • Review the F-distribution and the approach Chapter 14 Quick Prep Links Introduction to Linear Regression and Correlation Analysis 14.1 Scatter Plots and Correlation (pg 580–589) Outcome Calculate and interpret the correlation between two variables Outcome Determine whether the correlation is significant 14.2 Simple Linear Regression Analysis (pg 589–612) Outcome Calculate the simple linear regression equation for a set of data and know the basic assumptions behind regression analysis Outcome Determine whether a regression model is significant 14.3 Uses for Regression Analysis (pg 612–623) Outcome Recognize regression analysis applications for purposes of description and prediction Outcome Calculate and interpret confidence intervals for the regression analysis Outcome Recognize some potential problems if regression analysis is used incorrectly Why you need to know Although some business situations involve only one variable, others require decision makers to consider the relationship between two or more variables For example, a financial manager might be interested in the relationship between stock prices and the dividends issued by a publicly traded company A marketing manager would be interested in examining the relationship between product sales and the amount of money spent on advertising Finally, consider a loan manager at a bank who is interested in determining the fair market value of a home or business She would begin by collecting data on a sample of comparable properties that have sold recently In addition to the selling price, she would collect data on other factors, such as the size and age of the property She might then analyze the relationship between the price and the other variables and use this relationship to determine an appraised price for the property in question Simple linear regression and correlation analysis, which are introduced in this chapter, are statistical techniques the broker, marketing director, and appraiser will need in their analyses These techniques are two of the most often applied statistical procedures used by business decision makers for analyzing the relationship between two variables In Chapter 15, we will extend the discussion to include three or more variables 579 (117) 580 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis 14.1 Scatter Plots and Correlation Scatter Plot A two-dimensional plot showing the values for the joint occurrence of two quantitative variables The scatter plot may be used to graphically represent the relationship between two variables It is also known as a scatter diagram Decision-making situations that call for understanding the relationship between two quantitative variables are aided by the use of scatter plots, or scatter diagrams Figure 14.1 shows scatter plots that depict several potential relationships between values of a dependent variable, y, and an independent variable, x A dependent (or response) variable is the variable whose variation we wish to explain An independent (or explanatory) variable is a variable used to explain variation in the dependent variable In Figure 14.1, (a) and (b) are examples of strong linear (or straight line) relationships between x and y Note that the linear relationship can be either positive (as the x variable increases, the y variable also increases) or negative (as the x variable increases, the y variable decreases) Figures 14.1 (c) and (d) illustrate situations in which the relationship between the x and y variable is nonlinear There are many possible nonlinear relationships that can occur The scatter plot is very useful for visually identifying the nature of the relationship Figures 14.1 (e) and (f) show examples in which there is no identifiable relationship between the two variables This means that as x increases, y sometimes increases and sometimes decreases but with no particular pattern The Correlation Coefficient Correlation Coefficient A quantitative measure of the strength of the linear relationship between two variables The correlation ranges from 1.0 to 1.0 A correlation of 1.0 indicates a perfect linear relationship, whereas a correlation of indicates no linear relationship In addition to analyzing the relationship between two variables graphically, we can also measure the strength of the linear relationship between two variables using a measure called the correlation coefficient The correlation coefficient of two variables can be estimated from sample data using Equation 14.1 or the algebraic equivalent, Equation 14.2 Sample Correlation Coefficient r Chapter Outcome FIGURE 14.1 | ∑ (x  x )(y  y ) (14.1) [∑ (x  x )2 ][∑ (y  y )2 ] Two-Variable Relationships (a) Linear (b) Linear y (c) Curvilinear y y x x x (e) No Relationship (d) Curvilinear y (f) No Relationship y x y x x (118) CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis 581 or the algebraic equivalent: r n ∑ xy  ∑ x ∑ y (14.2) [n(∑ x 2)  (∑ x)2 ][n(∑ y 2)  (∑ y)2 ] where: r  Sample correlation coefficient n  Sample size x  Value of the independent variable y  Value of the dependent variable The sample correlation coefficient computed using Equations 14.1 and 14.2 is called the Pearson Product Moment Correlation The sample correlation coefficient, r, can range from a perfect positive correlation, 1.0, to a perfect negative correlation, 1.0 A perfect correlation is one in which all points on the scatter plot fall on a straight line If two variables have no linear relationship, the correlation between them is and there is no linear relationship between the change in x and y Consequently, the more the correlation differs from 0.0, the stronger the linear relationship between the two variables The sign of the correlation coefficient indicates the direction of the relationship Figure 14.2 illustrates some examples of correlation between two variables Once again, for the correlation coefficient to equal plus or minus 1.0, all the (x, y) points form a perfectly straight line The more the points depart from a straight line, the weaker (closer to 0.0) the correlation is between the two variables BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial FIGURE 14.2 | TESTING FOR SIGNIFICANT CORRELATIONS MIDWEST DISTRIBUTION COMPANY Consider the application involving Midwest Distribution, which supplies soft drinks and snack foods to convenience stores in Michigan, Illinois, and Iowa Although Midwest Distribution has been profitable, the director of marketing has been concerned about the rapid turnover in her salesforce In the course of exit interviews, she discovered a major concern with the compensation structure Midwest Distribution has a two-part wage structure: a base salary and a commission computed on monthly sales Typically, about half of the total wages paid comes from the base Correlation between Two Variables (a) r = +1 r = +0.7 (b) r = –1 y y y x x x r =0 r = –0.55 y r =0 y x y x x (119) 582 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis salary, which increases with longevity with the company This portion of the wage structure is not an issue The concern expressed by departing employees is that new employees tend to be given parts of the sales territory previously covered by existing employees and are assigned prime customers as a recruiting inducement At issue, then, is the relationship between sales (on which commissions are paid) and number of years with the company The data for a random sample of 12 sales representatives are in the file called Midwest The first step is to develop a scatter plot of the data Both Excel and Minitab have procedures for constructing a scatter plot and computing the correlation coefficient The scatter plot for the Midwest data is shown in Figure 14.3 Based on this plot, total sales and years with the company appear to be linearly related However, the strength of this relationship is uncertain That is, how close the points come to being on a straight line? To answer this question, we need a quantitative measure of the strength of the linear relationship between the two variables That measure is the correlation coefficient Equation 14.1 is used to determine the correlation between sales and years with the company Table 14.1 shows the manual calculations that were used to determine this correlation coefficient of 0.8325 However, because the calculations are rather tedious and long, we almost always use computer software to perform the computation, as shown in Figure 14.4 The r  0.8325 indicates that there is a fairly strong, positive correlation between these two variables for the sample data Significance Test for the Correlation Although a correlation coefficient of 0.8325 seems quite large (relative to 0), you should remember that this value is based on a sample of 12 data points and is subject to sampling error Therefore, a formal hypothesis-testing | Excel 2007 Scatter Plot of Sales vs Years with Midwest Distribution FIGURE 14.3 Excel 2007 Instructions: Open file: Midwest.xls Move the Sales column to the right of Years with midwest column Select data for chart On Insert tab, click XY (Scatter), and then click the Scatter with only Markers Option Use the Layout tab of the Chart Tools to add titles and remove grid lines Use the Design tab of the Chart Tools to move the chart to a new worksheet Minitab Instructions (for similar results): Open file: Midwest.MTW Choose Graph  Scatterplot Under Scatterplot, choose Simple OK Under Y variable, enter y column In X variable, enter x column Click OK (120) CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis 583 | Correlation Coefficient Calculations for the Midwest Distribution Example TABLE 14.1 Sales Years y x _ x2x _ y2y 487 445 272 641 187 440 346 238 312 269 655 563 1.58 0.42 2.58 3.42 2.58 1.42 2.42 3.58 0.58 2.58 4.42 1.42 82.42 40.42 132.58 236.42 217.58 35.42 58.58 166.58 92.58 135.58 250.42 158.42   4,855   55 y Σy n  4, 855 12 _ ( x x )2 _ ( y y )2 130.22 16.98 342.06 808.56 561.36 50.30 141.76 596.36 53.70 349.80 1,106.86 224.96 2.50 0.18 6.66 11.70 6.66 2.02 5.86 12.82 0.34 6.66 19.54 2.02 6,793.06 1,633.78 17,577.46 55,894.42 47,341.06 1,254.58 3,431.62 27,748.90 8,571.06 18,381.94 62,710.18 25,096.90   3,838.92   76.92   276,434.92 _ _ ( x x ) ( y y)  404.58 x Σx n  55 12  4.58 Using Equation 14.1, r Σ ( x − x ) ( y − y) Σ ( x − x ) Σ ( y − y)  3, 838.92 (76.92) (276, 434.92)  0.8325 | Excel 2007 Correlation Output for Midwest Distribution FIGURE 14.4 Excel 2007 Instructions: Open file: Midwest.xls On the Data tab, click Data Analysis Select Correlation Define Data Range Click on Labels in First Row Specify output location Click OK Minitab Instructions (for similar results): Open file: Midwest.MTW Choose Stat  Basic Statistics  Correlation In Variables, enter Y and X columns Click OK (121) 584 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis procedure is needed to determine whether the linear relationship between sales and years with the company is significant The null and alternative hypotheses to be tested are H 0: r  (no correlation) HA: r  (correlation exists) where the Greek symbol r (rho) represents the population correlation coefficient We must test whether the sample data support or refute the null hypothesis The test procedure utilizes the t-test statistic in Equation 14.3 Chapter Outcome Test Statistic for Correlation t r 1 r n2 df  n  (14.3) where: t  Number of standard errors r is from r  Sample correlation coefficient n  Sample size The degrees of freedom for this test are n  2, because we lose degree of freedom for each of the – that are used to estimate the population means for the two variables two sample means ( x– and y) Figure 14.5 shows the hypothesis test for the Midwest Distribution example using an alpha level of 0.05 Recall that the sample correlation coefficient was r  0.8325 Based on these sample data, we should conclude there is a significant, positive linear relationship in the population between years of experience and total sales for Midwest Distribution sales representatives FIGURE 14.5 | Correlation Significance Test for the Midwest Distribution Example Hypothesis: H0 :   (no correlation) HA:   a  0.05 df  n   10 Rejection Region /2  0.025 t0.025  2.228 Rejection Region a/2  0.025 t0.025  2.228 The calculated t-value is 0.8325 r t  1r2 10.6931 10 n2  4.752 Decision Rule: If t  t0.025  2.228, reject H0 If t t0.025  2.228, reject H0 Otherwise, not reject H0 Because 4.752  2.228, reject H0 Based on the sample evidence, we conclude there is a significant positive linear relationship between years with the company and sales volume (122) CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis 585 The implication is that the more years an employee has been with the company, the more sales that employee generates This runs counter to the claims made by some of the departing employees The manager will probably want to look further into the situation to see whether a problem might exist in certain regions The t-test for determining whether the population correlation is significantly different from requires the following assumptions: Assumptions The data are interval or ratio-level The two variables (y and x) are distributed as a bivariate normal distribution Although the formal mathematical representation is beyond the scope of this text, two variables are bivariate normal if their joint distribution is normally distributed Although the t-test assumes a bivariate normal distribution, it is robust—that is, correct inferences can be reached even with slight departures from the normal-distribution assumption (See Kutner et al., Applied Linear Statistical Models, for further discussion of bivariate normal distributions.) EXAMPLE 14-1 CORRELATION ANALYSIS Stock Portfolio Analysis A student intern at the investment firm of McMillan & Associates was given the assignment of determining whether there is a positive correlation between the number of individual stocks in a client’s portfolio (x) and the annual rate of return (y) for the portfolio The intern selected a simple random sample of 12 client portfolios and determined the number of individual company stocks and the annual rate of return earned by the client on his or her portfolio To determine whether there is a statistically significant positive correlation between the two variables, the following steps can be employed: Step Specify the population parameter of interest The intern wishes to determine whether the number of stocks is positively correlated with the rate of return earned by the client The parameter of interest is, therefore, the population correlation, r Step Formulate the appropriate null and alternative hypotheses Because the intern was asked to determine whether a positive correlation exists between the variables of interest, the hypothesis test will be one-tailed, as follows: H0: r  HA: r  Step Specify the level of significance A significance level of 0.05 is chosen Step Compute the correlation coefficient and the test statistic Compute the sample correlation coefficient using Equation 14.1 or 14.2, or by using software such as Excel or Minitab The following sample data were obtained: Number of Stocks Rate of Return 16 25 16 20 16 20 20 16 0.13 0.16 0.21 0.18 0.18 0.19 0.15 0.17 0.13 0.11 (123) 586 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis Using Equation 14.1, we get r ∑ (x  x )(y  y ) [∑ (x  x )2 ][∑ ( y  y )2 ]  0.7796 Compute the t-test statistic using Equation 14.3: t r 1 r n2  0.7796  0.77962 10   3.52 Step Construct the rejection region and decision rule For an alpha level equal to 0.05, the one-tailed, upper-tail, critical value for n   10   degrees of freedom is t0.05  1.8595 The decision rule is If t  1.8595, reject the null hypothesis Otherwise, not reject the null hypothesis Step Reach a decision Because t  3.52  1.8595, reject the null hypothesis Step Draw a conclusion Because the null hypothesis is rejected, the sample data support the contention that there is a positive linear relationship between the number of individual stocks in a client’s portfolio and the portfolio’s rate of return >>END EXAMPLE TRY PROBLEM 14-3 (pg 587) Cause-and-Effect Interpretations Care must be used when interpreting the correlation results For example, even though we found a significant linear relationship between years of experience and sales for the Midwest Distribution sales force, the correlation does not imply cause and effect Although an increase in experience may, in fact, cause sales to change, simply because the two variables are correlated does not guarantee a cause-and-effect situation Two seemingly unconnected variables may be highly correlated For example, over a period of time, teachers’ salaries in North Dakota might be highly correlated with the price of grapes in Spain Yet, we doubt that a change in grape prices will cause a corresponding change in salaries for teachers in North Dakota, or vice versa When a correlation exists between two seemingly unrelated variables, the correlation is said to be a spurious correlation You should take great care to avoid basing conclusions on spurious correlations The Midwest Distribution marketing director has a logical reason to believe that years of experience with the company and total sales are related That is, sales theory and customer feedback hold that product knowledge is a major component in successfully marketing a product However, a statistically significant correlation alone does not prove that this causeand-effect relationship exists When two seemingly unrelated variables are correlated, they may both be responding to changes in some third variable For example, the observed correlation could be the effect of a company policy of giving better sales territories to more senior salespeople (124) CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis 587 MyStatLab 14-1: Exercises Skill Development 14-1 An industry study was recently conducted in which the sample correlation between units sold and marketing expenses was 0.57 The sample size for the study included 15 companies Based on the sample results, test to determine whether there is a significant positive correlation between these two variables Use an a  0.05 14-2 The following data for the dependent variable, y, and the independent variable, x, have been collected using simple random sampling: x y 10 14 16 12 20 18 16 14 16 18 120 130 170 150 200 180 190 150 160 200 a Construct a scatter plot for these data Based on the scatter plot, how would you describe the relationship between the two variables? b Compute the correlation coefficient 14-3 A random sample of the following two variables was obtained: x 29 48 28 22 28 42 33 26 48 44 y 16 46 34 26 49 11 41 13 47 16 a Calculate the correlation between these two variables b Conduct a test of hypothesis to determine if there exists a correlation between the two variables in the population Use a significance level of 0.10 14-4 A random sample of two variables, x and y, produced the following observations: x y 19 13 17 12 25 20 17 11 a Develop a scatter plot for the two variables and describe what relationship, if any, exists b Compute the correlation coefficient for these sample data c Test to determine whether the population correlation coefficient is negative Use a significance level of 0.05 for the hypothesis test 14-5 You are given the following data for variables x and y: x y 3.0 2.0 2.5 3.0 2.5 4.0 1.5 1.0 2.0 2.5 1.5 0.5 1.0 1.8 1.2 2.2 0.4 0.3 1.3 1.0 a Plot these variables in scatter plot format Based on this plot, what type of relationship appears to exist between the two variables? b Compute the correlation coefficient for these sample data Indicate what the correlation coefficient measures c Test to determine whether the population correlation coefficient is positive Use the   0.01 level to conduct the test Be sure to state the null and alternative hypotheses and show the test and decision rule clearly 14-6 For each of the following circumstances, perform the indicated hypothesis tests: a HA: r  0, r  0.53, and n  30 with a  0.01, using a test-statistic approach b HA: r  0, r  0.48, and n  20 with a  0.05, using a p-value approach c HA: r  0, r  0.39, and n  45 with a  0.02, using a test-statistic approach d HA: r 0, r  0.34, and n  25 with a  0.05, using a test-statistic approach Business Applications 14-7 The Federal No Child Left Behind Act requires periodic testing in standard subjects A random sample of 50 junior high school students from Atlanta was selected, and each student’s scores on a standardized mathematics examination and a standardized English examination were recorded School administrators were interested in the relation between the two scores Suppose the correlation coefficient for the two examination scores is 0.75 (125) 588 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis a Provide an explanation of the sample correlation coefficient in this context b Using a level of significance of a  0.01, test to determine whether there is a positive linear relationship between mathematics scores and English scores for junior high school students in Atlanta 14-8 Because of the current concern over credit card balances, a bank’s Chief Financial Officer is interested in whether there is a relationship between account balances and the number of times a card is used each month A random sample of 50 accounts was selected The account balance and the number of charges during the past month were the two variables recorded The correlation coefficient for the two variables was 0.23 a Discuss what the r  0.23 measures Make sure to frame your discussion in terms of the two variables mentioned here b Using an a  0.10 level, test to determine whether there is a significant linear relationship between account balance and the number of card uses during the past month State the null and alternative hypotheses and show the decision rule c Consider the decision you reached in part b Describe the type of error you could have made in the context of this problem 14-9 Farmers National Bank issues MasterCard credit cards to its customers A main factor in determining whether a credit card will be profitable to the bank is the average monthly balance that the customer will maintain on the card that will be subject to finance charges Bank analysts wish to determine whether there is a relationship between the average monthly credit card balance and the income stated on the original credit card application form The following sample data have been collected from existing credit card customers: Income Credit Balance $43,000 $35,000 $47,000 $55,000 $55,000 $59,000 $28,000 $43,000 $54,000 $36,000 $39,000 $31,000 $30,000 $37,000 $39,000 $345 $1,370 $1,140 $201 $56 $908 $2,345 $104 $0 $1,290 $130 $459 $0 $1,950 $240 a Indicate which variable is to be the independent variable and which is to be the dependent variable in the bank’s analysis and indicate why b Construct a scatter plot for these data and describe what, if any, relationship appears to exist between these two variables c Calculate the correlation coefficient for these two variables and test to determine whether there is a significant correlation at the a  0.05 level 14-10 Amazon.com has become one of the most successful online merchants Two measures of its success are sales and net income/loss figures (all figures in $million) These values for the years 1995–2007 are shown as follows Year Net Income/Loss Sales 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 0.3 5.7 27.5 124.5 719.9 1,411.2 567.3 149.1 35.3 588.5 359 190 476 0.5 15.7 147.7 609.8 1,639.8 2,761.9 3,122.9 3,933 5,263.7 6,921 8,490 10,711 14,835 a Produce a scatter plot for Amazon’s net income/loss and sales figures for the period 1995 to 2007 Does there appear to be a linear relationship between these two variables? Explain your response b Calculate the correlation coefficient between Amazon’s net income/loss and sales figures for the period 1995 to 2007 c Conduct a hypothesis test to determine if a positive correlation exists between Amazon’s net income/loss and sales figures Use a significance level of 0.05 and assume that these figures form a random sample 14-11 Complaints concerning excessive commercials seem to grow as the amount of “clutter,” including commercials and advertisements for other television shows, steadily increases on network and cable television A recent analysis by Nielsen Monitor-Plus compares the average nonprogram minutes in an hour of prime time for both network and cable television Data for selected years are shown as follows Year Network Cable 1996 1999 2001 2004 9.88 14.00 14.65 15.80 12.77 13.88 14.50 14.92 a Calculate the correlation coefficient for the average nonprogram minutes in an hour of prime time between network and cable television b Conduct a hypothesis test to determine if a positive correlation exists between the average nonprogram (126) CHAPTER 14 minutes in an hour of prime time between network and cable television Use a significance level of 0.05 and assume that these figures form a random sample Computer Database Exercises 14-12 Platinum Billiards, Inc., is a retailer of billiard supplies based in Jacksonville, Florida It stands out among billiard suppliers because of the research it does to assure its products are top-notch One experiment was conducted to measure the speed attained by a cue ball struck by various weighted pool cues The conjecture is that a light cue generates faster speeds while breaking the balls at the beginning of a game of pool Anecdotal experience has indicated that a billiard cue weighing less than 19 ounces generates faster speeds Platinum used a robotic arm to investigate this claim Its research generated the data given in the file entitled Breakcue a To determine if there is a negative relationship between the weight of the pool cue and the speed attained by the cue ball, calculate a correlation coefficient b Conduct a test of hypothesis to determine if there is a negative relationship between the weight of the pool cue and the speed attained by the cue ball Use a significance level of 0.025 and a p-value approach 14-13 Customers who made online purchases last quarter from an Internet retailer were randomly sampled from the retailer’s database The dollar value of each customer’s quarterly purchases along with the time the customer spent shopping the company’s online catalog that quarter were recorded The sample results are contained in the file Online a Create a scatter plot of the variables Time (x) and Purchases (y) What relationship, if any, appears to exist between the two variables? b Compute the correlation coefficient for these sample data What does the correlation coefficient measure? c Conduct a hypothesis test to determine if there is a positive relationship between time viewing the retailer’s catalog and dollar amount purchased Use a level of significance equal to 0.025 Provide a managerial explanation of your results | Introduction to Linear Regression and Correlation Analysis 589 14-14 A regional accreditation board for colleges and universities is interested in determining whether a relationship exists between student applicant verbal SAT scores and the in-state tuition costs at the university Data have been collected on a sample of colleges and universities and are in the data file called Colleges and Universities a Develop a scatter plot for these two variables and discuss what, if any, relationship you see between the two variables based on the scatter plot b Compute the sample correlation coefficient c Based on the correlation coefficient computed in part b, test to determine whether the population correlation coefficient is positive for these two variables That is, can we expect schools that charge higher in-state tuition will attract students with higher average verbal SAT scores? Test using a 0.05 significance level 14-15 As the number of air travelers with time on their hands increases, logic would indicate spending on retail purchases in airports would increase as well A study by Airport Revenue News addressed the per person spending at select airports for merchandise, excluding food, gifts, and news items A file entitled Revenues contains sample data selected from airport retailers in 2001 and again in 2004 a Produce a scatter plot for the per person spending at selected airports for merchandise, excluding food, gifts, and news items, for the years 2001 and 2004 Does there appear to be a linear relationship between spending in 2001 and spending in 2004? Explain your response b Calculate the correlation coefficient between the per person spending in 2001 and the per person spending in 2004 Does it appear that an increase in per person spending in 2001 would be associated with an increase in spending in 2004? Support your assertion c Conduct a hypothesis test to determine if a positive correlation exists between the per person spending in 2001 and that in 2004 Use a significance level of 0.05 and assume that these figures form a random sample END EXERCISES 14-1 14.2 Simple Linear Regression Analysis Simple Linear Regression The method of regression analysis in which a single independent variable is used to predict the dependent variable In the Midwest Distribution application, we determined that the relationship between years of experience and total sales is linear and statistically significant, based on the correlation analysis performed in the previous section Because hiring and training costs have been increasing, we would like to use this relationship to help formulate a more acceptable wage package for the sales force The statistical method we will use to analyze the relationship between years of experience and total sales is regression analysis When we have only two variables—a dependent variable, such as sales, and an independent variable, such as years with the company—the technique is referred to as simple regression analysis When the relationship between the dependent variable and the independent variable is linear, the technique is simple linear regression (127) 590 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis The Regression Model and Assumptions The objective of simple linear regression (which we shall call regression analysis) is to represent the relationship between values of x and y with a model of the form shown in Equation 14.4 Simple Linear Regression Model (Population Model) y  b0  b1x   (14.4) where: y  Value of the dependent variable x  Value of the independent variable b0  Population’s y intercept b1  Slope of the population regression line   Random error term The simple linear regression population model described in Equation 14.4 has four assumptions: Assumptions Individual values of the error terms, , are statistically independent of one another, and these values represent a random sample from the population of possible -values at each level of x For a given value of x, there can exist many values of y and therefore many values of  Further, the distribution of possible -values for any x-value is normal The distributions of possible -values have equal variances for all values of x The means of the dependent variable, y, for all specified values of the independent variable, (my|x), can be connected by a straight line called the population regression model Figure 14.6 illustrates assumptions 2, 3, and The regression model (straight line) connects the average of the y-values for each level of the independent variable, x The actual y-values for each level of x are normally distributed around the mean of y Finally, observe that the spread of possible y-values is the same regardless of the level of x The population regression line is determined by two values, b0 and b1 These values are known as the population regression coefficients Value b0 identifies the y intercept and b1 the slope of the regression line Under the regression assumptions, the coefficients define the true population model For each observation, the actual value of the dependent variable, y, for any x is the sum of two components: y b0  b1 x Linear component  ε Random error component The random error component, , may be positive, zero, or negative, depending on whether a single value of y for a given x falls above, on, or below the population regression line Section 15.5 in Chapter 15 discusses how to check whether assumptions have been violated and the possible courses of action if the violations occur FIGURE 14.6 | Graphical Display of Linear Regression Assumptions y my|x = b0 + b1x my|x3 my|x2 my|x1 x1 x2 x3 x (128) CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis 591 Meaning of the Regression Coefficients Regression Slope Coefficient The average change in the dependent variable for a unit change in the independent variable The slope coefficient may be positive or negative, depending on the relationship between the two variables Coefficient b1, the regression slope coefficient of the population regression line, measures the average change in the value of the dependent variable, y, for each unit change in x The regression slope can be either positive, zero, or negative, depending on the relationship between x and y For example, a positive population slope of 12 (b1  12) means that for a 1-unit increase in x, we can expect an average 12-unit increase in y Correspondingly, if the population slope is negative 12 (b1  12), we can expect an average decrease of 12 units in y for a 1-unit increase in x The population’s y intercept, b0, indicates the mean value of y when x is However, this interpretation holds only if the population could have x values equal to When this cannot occur, b0 does not have a meaningful interpretation in the regression model BUSINESS APPLICATION SIMPLE LINEAR REGRESSION ANALYSIS MIDWEST DISTRIBUTION (CONTINUED) The Midwest Distribution marketing manager has data for a sample of 12 sales representatives In Section 14.1, she has established that a significant linear relationship exists between years of experience and total sales using correlation analysis (Recall that the sample correlation between the two variables was r  0.8325.) Now she would like to estimate the regression equation that defines the true linear relationship (that is, the population’s linear relationship) between years of experience and sales Figure 14.3 shows the scatter plot for two variables: years with the company and sales We need to use the sample data to estimate b0 and b1, the true intercept and slope of the line representing the relationship between two variables The regression line through the sample data is the best estimate of the population regression line However, there are an infinite number of possible regression lines for a set of points For example, Figure 14.7 shows three of the FIGURE 14.7 700 | Possible Regression Lines y 700 600 Sales in Thousands Sales in Thousands 600 500 400 300 200 100 (a) x Years with Company 700 yˆ = 250 + 40x 500 400 300 200 100 yˆ = 450 + 0x 10 x (b) Years with Company y 600 Sales in Thousands y ˆy = 150 + 60x 500 400 300 200 100 (c) x Years with Company 10 10 (129) CHAPTER 14 FIGURE 14.8 | Introduction to Linear Regression and Correlation Analysis | Computation of Regression Error for the Midwest Distribution Example 700 y 600 Sales in Thousands 592 yˆ = 150 + 60x 500 390  yˆ 400 300 200 Residual = 312 – 390 = –78 312  y 100 Least Squares Criterion The criterion for determining a regression line that minimizes the sum of squared prediction errors Residual The difference between the actual value of y and the predicted value yˆ for a given level of the independent variable, x Years with Company x 10 possible different lines that pass through the Midwest Distribution data Which line should be used to estimate the true regression model? We must establish a criterion for selecting the best line The criterion used is the least squares criterion To understand the least squares criterion, you need to know about prediction error, or residual, which is the distance between the actual y coordinate of an (x, y) point and the predicted value of that y coordinate produced by the regression line Figure 14.8 shows how the prediction error is calculated for the employee who was with Midwest for four years (x  4) using one possible regression line: (where yˆ is the predicted sales value) The predicted sales value is yˆ  150  60(4)  390 However, the actual sales (y) for this employee is 312 (see Table 14.2) Thus, when x  4, the difference between the observed value, y  312, and the predicted value, yˆ  390, is 312  390 78 The residual (or prediction error) for this case when x  is 78 Table 14.2 shows the calculated prediction errors and sum of squared errors for each of the three regression lines shown in Figure 14.7.1 Of these three potential regression models, the line with the equation yˆ  150  60 x has the smallest sum of squared errors However, is there a better line than n this? That is, would ∑ ( yi  yˆi ) be smaller for some other line? One way to determine this is i1 to calculate the sum of squared errors for all other regression lines However, because there are an infinite number of these lines, this approach is not feasible Fortunately, through the use of calculus, equations can be derived to directly determine the slope and intercept estimates n such that ∑ ( yi  yˆi ) is minimized.2 This is accomplished by letting the estimated regression i1 model be of the form shown in Equation 14.5 Chapter Outcome Estimated Regression Model (Sample Model) yˆ  b0  b1 x (14.5) where: yˆ  Estimated, or predicted, y-value b0  Unbiased estimate of the regression intercept, found using Equation 14.8 b1  Unbiased estimate of the regression slope, found using Eqquation 14.6 or 14.7 x  Value of the independent variable Equations 14.6 and 14.8 are referred to as the solutions to the least squares equations because they provide the slope and intercept that minimize the sum of squared errors Equation 14.7 is 1The reason we are using the sum of the squared residuals is that the sum of the residuals will be zero for the best regression line (the positive values of the residuals will balance the negative values) 2The calculus derivation of the least squares equations is contained in the Kutner et al reference shown at the end of this chapter (130) CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis | Sum of Squared Errors for Three Linear Equations for Midwest Distribution TABLE 14.2 From Figure 14.7(a): yˆ = 450 + x Residual x yˆ y y − yˆ 450 450 450 450 450 450 450 450 450 450 450 450 487 445 272 641 187 440 346 238 312 269 655 563 37 5 178 191 263 10 104 212 138 181 205 113 ( y − yˆ )2 1,369 25 31,684 36,481 69,169 100 10,816 44,944 19,044 32,761 42,025 12,769 ∑  301,187 From Figure 14.7(b): yˆ  250  40 x Residual x yˆ y y − yˆ 370 450 330 570 330 490 530 290 410 330 610 490 487 445 272 641 187 440 346 238 312 269 655 563 117 5 58 71 143 50 184 52 98 61 45 73 ( y − yˆ )2 13,689 25 3,364 5,041 20,449 2,500 33,856 2,704 9,604 3,721 2,025 5,329 ∑  102,307 From Figure 14.7(c): yˆ  150  60x Residual x yˆ y y − yˆ 330 450 270 630 270 510 570 210 390 270 690 510 487 445 272 641 187 440 346 238 312 269 655 563 157 5 11 83 70 224 28 78 1 35 53 ( y − yˆ )2 24,649 25 121 6,889 4,900 50,176 784 6,084 1,225 2,809 ∑  97,667 593 (131) 594 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis the algebraic equivalent of Equation 14.6 and may be easier to use when the computation is performed using a calculator Least Squares Equations b1  ∑( xi  x )( yi  y ) ∑ (xi  x )2 (14.6) algebraic equivalent: ∑x∑y n b1  ( ∑ x )2 ∑ x2  n (14.7) b0  y  b1 x (14.8) ∑ xy − and Table 14.3 shows the manual calculations, which are subject to rounding, for the least squares estimates for the Midwest Distribution example However, you will almost always | Manual Calculations for Least Squares Regression Coefficients for the Midwest Distribution Example TABLE 14.3 y x xy x2 487 1,461 237,169 445 2,225 25 198,025 272 544 73,984 641 5,128 64 410,881 187 374 34,969 440 2,640 36 193,600 346 2,422 49 119,716 238 238 56,644 312 1,248 16 97,344 269 538 72,361 655 5,895 81 429,025 563 3,378 36 316,969 ∑ y  4,855 ∑ x  55 ∑ xy  26,091 ∑ x2  329 ∑ y2  2,240,687 y ∑y n  b1  4, 855 12 ∑xy − ∑x −  49.91  404.58 ∑x ∑ y n = ( ∑x ) n x ∑x 26, 091 − 329 − n  55 12 y2  4.58 55(4, 855) 12 (55) 12 Then, b0  y − b1 x  404.58 − 49.91(4.58)  175.99 The least squares regression line is, therefore, yˆ  175.99  49.91x There is a slight difference between the manual calculation and the computer result due to rounding (132) CHAPTER 14 FIGURE 14.9A | Introduction to Linear Regression and Correlation Analysis 595 | Excel 2007 Midwest Distribution Regression Results Excel 2007 Instructions: Open file: Midwest.xls On the Data tab, click Data Analysis Select Regression Analysis Define x (Years with Midwest) and y (Sales) variable data range Select output location Check Labels Click Residuals Click OK SSE = 84834.2947 Estimated regression equation is y = 175.8288 = 49.9101(x) ^ use a software package such as Excel or Minitab to perform these computations (Figures 14.9a and 14.9b show the Excel and Minitab output.) In this case, the “best” regression line, given the least squares criterion, is yˆ  175.8288  49.9101(x) Figure 14.10 shows the predicted sales values along with the prediction errors and squared errors associated with this best simple linear regression line Keep in mind that the prediction errors are also referred to as residuals From Figure 14.10, the sum of the squared errors is 84,834.29 This is the smallest sum of squared residuals possible for this set of sample data No other simple linear regression line FIGURE 14.9B | Minitab Midwest Distribution Regression Results Minitab Instructions: Open file: Midwest MTW Choose Stat  Regression  Regression In Response, enter the y variable column In Predictors, enter the x variable column Click Storage; under Diagnostic Measures select Residuals Click OK OK Estimated regression equation is yˆ = 176 + 49.9(x) Sum of squares residual = 84,834 (133) 596 CHAPTER 14 FIGURE 14.10 | | Introduction to Linear Regression and Correlation Analysis Residuals and Squared Residuals for the Midwest Distribution Example Excel 2007 Instructions: Create Squared Residuals using Excel formula (i.e., for cell D25, use = C25^2) Sum the residuals and squared residuals columns Minitab Instructions (for similar results): Choose Calc  Column Statistics Under Statistics, Choose Sum In Input variable, enter residual column Click OK Choose Calc  Column Statistics Under Statistic, choose Sum of Squares In Input variable, enter residual column Click OK Sum of residuals equal zero SSE = 84,834.2947 through these 12 (x, y) points will produce a smaller sum of squared errors Equation 14.9 presents a formula that can be used to calculate the sum of squared errors manually Sum of Squared Errors SSE  ∑ y  b0 ∑ y  b1 ∑ xy (14.9) Figure 14.11 shows the scatter plot of sales and years of experience and the least squares regression line for Midwest Distribution This line is the best fit for these sample data The regression line passes through the point corresponding to (x , y ) This will always be the case Least Squares Regression Properties Figure 14.10 illustrates several important properties of least squares regression These are as follows: The sum of the residuals from the least squares regression line is (Equation 14.10) The total underprediction by the regression model is exactly offset by the total overprediction Sum of Residuals n ∑ ( yi  yˆi )  i1 (14.10) (134) CHAPTER 14 FIGURE 14.11 | Introduction to Linear Regression and Correlation Analysis 597 | Least Squares Regression Line for Midwest Distribution Excel 2007 Instructions: Open file: Midwest.xls Move the Sales column to the right of Years with Midwest column Select data for chart On Insert tab, click XY (Scatter), and then click Scatter with only Markers option Use the Layout tab of the Chart Tools to add titles and remove grid lines Use the Design tab of the Chart Tools to move the chart to a new worksheet Click on a chart point Right click and select Add Trendline Select Linear yˆ  175.8288  49.9101x The sum of the squared residuals is the minimum (Equation 14.11) Sum of Squared Residuals (Errors) n SSE  ∑ ( yi  yˆi )2 (14.11) i1 This property provided the basis for developing the equations for b0 and b1 The simple regression line always passes through the mean of the y variable, y , and the mean of the x variable, x So, to manually draw any simple linear regression line, all you need to is to draw a line connecting the least squares y intercept with the ( x , y ) point The least squares coefficients are unbiased estimates of b0 and b1 Thus, the expected values of b0 and b1 equal b0 and b1, respectively EXAMPLE 14-2 SIMPLE LINEAR REGRESSION AND CORRELATION Fitzpatrick & Associates The investment firm Fitzpatrick & Associates wants to manage the pension fund of a major Chicago retailer For their presentation to the retailer, the Fitzpatrick analysts want to use simple linear regression to model the relationship between profits and numbers of employees for 50 Fortune 500 companies in the firm’s portfolio The data for the analysis are contained in the file Fortune 50 This analysis can be done using the following steps: Excel and Minitab tutorials Excel and Minitab Tutorial Step Specify the independent and dependent variables The object in this example is to model the linear relationship between number of employees (the independent variable) and each company’s profits (the dependent variable) Step Develop a scatter plot to graphically display the relationship between the independent and dependent variables Figure 14.12 shows the scatter plot, where the dependent variable, y, is company profits and the independent variable, x, is number of employees (135) 598 CHAPTER 14 FIGURE 14.12 | Introduction to Linear Regression and Correlation Analysis | Excel 2007 Scatter Plot for Fitzpatrick & Associates Excel 2007 Instructions: Open file: Fortune 50.xls Copy the Profits column to the immediate right of Employees column Select data for chart (Employees and Profits) On Insert tab, click XY (Scatter), and then click the Scatter with only Markers option Minitab Instructions (for similar result): Open file: Fortune 50.MTW Choose Graph  Character Graphs  Scatterplot Use the Layout tab of the Chart Tools to add titles and remove grid lines Use the Design tab of the Chart Tools to move the chart to a new worksheet In Y variable, enter y column In X variable, enter x column Click OK There appears to be a slight positive linear relationship between the two variables Step Calculate the correlation coefficient and the linear regression equation Do either manually using Equations 14.1, 14.6 (or 14.7), and 14.8, respectively, or by using Excel or Minitab software Figure 14.13 shows the regression results The sample correlation coefficient (called “Multiple R” in Excel) is r  0.3638 The regression equation is yˆ  2, 556.88  0.0048 x The regression slope is estimated to be 0.0048, which means that for each additional employee, the average increase in company profit is 0.0048 million dollars, or $4,800 The intercept can only be interpreted when a value equal to zero for the x variable (employees) is plausible Clearly, no company has zero employees, so the intercept in this case has no meaning other than it locates the height of the regression line for x  (136) CHAPTER 14 FIGURE 14.13 | Introduction to Linear Regression and Correlation Analysis 599 | Excel 2007 Regression Results for Fitzpatrick & Associates Correlation coefficient r = √ 0.1323 = 0.3638 Regression Equation Excel 2007 Instructions: Open file: Fortune 50.xls On the Data tab, click Data Analysis Select Regression Analysis Define x (Employees) and y (Profits) variable data ranges Check Labels Select output location Click OK Minitab Instructions (for similar result): In Response, enter the y variable column Open file: Fortune 50.MTW In Predictors, enter the x variable column Choose Stat  Regression  Regression Click OK >>END EXAMPLE TRY PROBLEM 14-17 a,b,c (pg 610) Chapter Outcome Significance Tests in Regression Analysis In Section 14.1, we pointed out that the correlation coefficient computed from sample data is a point estimate of the population correlation coefficient and is subject to sampling error We also introduced a test of significance for the correlation coefficient Likewise, the regression coefficients developed from a sample of data are also point estimates of the true regression coefficients for the population The regression coefficients are subject to sampling error For example, due to sampling error the estimated slope coefficient may be positive or negative while the population slope is really Therefore, we need a test procedure to determine whether the regression slope coefficient is statistically significant As you will see in this section, the test for the simple linear regression slope coefficient is equivalent to the test for the correlation coefficient That is, if the correlation between two variables is found to be significant, then the regression slope coefficient will also be significant (137) 600 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis The Coefficient of Determination, R2 BUSINESS APPLICATION TESTING THE REGRESSION MODEL MIDWEST DISTRIBUTION (CONTINUED) Recall that the Midwest Distribution marketing manager was analyzing the relationship between the number of years an employee had been with the company (independent variable) and the sales generated by the employee (dependent variable) We note when looking at the sample data for 12 employees (see Table 14.3) that sales vary among employees Regression analysis aims to determine the extent to which an independent variable can explain this variation In this case, does number of years with the company help explain the variation in sales from employee to employee? The SST (total sum of squares) can be used in measuring the variation in the dependent variable SST is computed using Equation 14.12 For Midwest Distribution, the total sum of squares for sales is provided in the output generated by Excel or Minitab, as shown in Figure 14.14a and Figure 14.14b As you can see, the total sum of squares in sales that needs to be explained is 276,434.92 Note that the SST value is in squared units and has no particular meaning Total Sum of Squares n SST  ∑ ( yi  y )2 (14.12) i1 where: FIGURE 14.14A SST  Total sum of squares n  Sample size yi  ith value of the dependent variable y  Average value of the dependent variable | Excel 2007 Regression Results for Midwest Distribution R-squared = 0.6931 Excel 2007 Instructions: Open file: Midwest.xls On the Data tab, click Data Analysis Select Regression Analysis Define x (Years with Midwest) and y (Sales) variable data range Click on Labels Specify output location Click OK SSR = 191,600.62 SST = 276,434.92 SSE = 84,834.29 (138) CHAPTER 14 FIGURE 14.14B | Introduction to Linear Regression and Correlation Analysis 601 | Minitab Regression Results for Midwest Distribution R-squared = 0.693 Minitab Instructions: Open file: Midwest.MTW Choose Stat  Regression  Regression In Response, enter the y variable column In Predictors, enter the x variable column Click OK SSR = 191,601 SSE = 84,834 SST = 276,435 The least squares regression line is computed so that the sum of squared residuals is minimized (recall the discussion of the least squares equations) The sum of squared residuals is also called the sum of squares error (SSE) and is defined by Equation 14.13 Sum of Squares Error n SSE  ∑ ( yi  yˆi )2 (14.13) i1 where: n  Sample size yi  i th value of the dependent variable yˆi  i th predicted value of y given the i th value of x SSE represents the amount of the total sum of squares in the dependent variable that is not explained by the least squares regression line Excel refers to SSE as sum of squares residual and Minitab refers to SSE as residual error This value is contained in the regression output shown in Figure 14.14a and Figure 14.14b SSE  ∑( y  yˆ)2  84,834.29 Thus, of the total sum of squares (SST  276,434.92), the regression model leaves SSE  84,834.29 unexplained Then, the portion of the total sum of squares that is explained by the regression line is called the sum of squares regression (SSR) and is calculated by Equation 14.14 (139) 602 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis Sum of Squares Regression n SSR  ∑ ( yˆi  y )2 (14.14) i1 where: yˆi  Estimated value of y for each value of x y  Average value of the y variable The sum of squares regression (SSR  191,600.62) is also provided in the regression output shown in Figure 14.14a and Figure 14.14b You should also note that the following holds: SST  SSR  SSE For the Midwest Distribution example, in the Minitab output we get 276,435  191,601  84,834 Coefficient of Determination The portion of the total variation in the dependent variable that is explained by its relationship with the independent variable The coefficient of determination is also called R-squared and is denoted as R We can use these calculations to compute an important measure in regression analysis called the coefficient of determination The coefficient of determination is calculated using Equation 14.15 Coefficient of Determination, R2 R2  SSR SST (14.15) Then, for the Midwest Distribution example, the proportion of variation in sales that can be explained by its linear relationship with the years of sales force experience is R2  SSR 191, 600.62   0.6931 SST 276, 434.92 This means that 69.31% of the variation in the sales data for this sample can be explained by the linear relationship between sales and years of experience Notice that R2 is part of the regression output in Figures 14.14a and 14.14b R2 can be a value between and 1.0 If there is a perfect linear relationship between two variables, then the coefficient of determination, R2, will be 1.0 This would correspond to a situation in which the least squares regression line would pass through each of the points in the scatter plot R2 is the measure used by many decision makers to indicate how well the linear regression line fits the (x, y) data points The better the fit, the closer R2 will be to 1.0 R2 will be close to when there is a weak linear relationship Finally, when you are employing simple linear regression (a linear relationship between a single independent variable and the dependent variable), there is an alternative way of computing R2, as shown in Equation 14.16 Coefficient of Determination for the Single Independent Variable Case R2  r2 where: R2  Coefficient of determination r  Sample correlation coefficient (14.16) (140) CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis 603 Therefore, by squaring the correlation coefficient, we can get R2 for the simple regression model Figure 14.14a shows the correlation, r  0.8325, which is referred to as Multiple R in Excel Then, using Equation 14.16, we get R2 R2  r  0.83252  0.6931 Keep in mind that R2  0.6931 is based on the random sample of size 12 and is subject to sampling error Thus, just because R2  0.6931 for the sample data does not mean that knowing the number of years an employee has worked for the company will explain 69.31% of the variation in sales for the population of all employees with the company Likewise, just because R2  0.0 for the sample data does not mean that the population coefficient of determination, noted as r2 (rho-squared), is greater than zero However, a statistical test exists for testing the following null and alternative hypotheses: H 0: r  HA: r2  The test statistic is an F-test with the test statistic defined as shown in Equation 14.17 Test Statistic for Significance of the Coefficient of Determination SSR F SSE (n  2) df  (D1  1, D2  n  2) (14.17) where: SSR  Sum of squares regression SSE  Sum of squares error For the Midwest Distribution example, the test statistic is computed using Equation 14.17 as follows: 191, 600.62 F  22.58 84, 834.29 (12  ) The critical value from the F-distribution table in Appendix H for a  0.05 and for and 10 degrees of freedom is 4.965 This gives the following decision rule: If F  4.965, reject the null hypothesis Otherwise, not reject the null hypothesis Because F  22.58  4.965, we reject the null hypothesis and conclude the population coefficient of determination (r2) is greater than zero This means the independent variable explains a significant proportion of the variation in the dependent variable For a simple regression model (a regression model with a single independent variable), the test for r2 is equivalent to the test shown earlier for the population correlation coefficient, r Refer to Figure 14.5 to see that the t-test statistic for the correlation coefficient was t  4.752 If we square this t-value we get t2  4.7522  F  22.58 Thus, the tests are equivalent They will provide the same conclusions about the relationship between the x and y variables (141) 604 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis Significance of the Slope Coefficient For a simple linear regression model (one independent variable), there are three equivalent statistical tests: Test for significance of the correlation between x and y Test for significance of the coefficient of determination Test for significance of the regression slope coefficient We have already introduced the first two of these tests The third one deals specifically with the significance of the regression slope coefficient The null and alternative hypotheses to be tested are H0: b1  HA: b1  To test the significance of the simple linear regression slope coefficient, we are interested in determining whether the population regression slope coefficient is A slope of would imply that there is no linear relationship between x and y variables and that the x variable, in its linear form, is of no use in explaining the variation in y If the linear relationship is useful, then we should reject the hypothesis that the regression slope is However, because the estimated regression slope coefficient, b1, is calculated from sample data, it is subject to sampling error Therefore, even though b1 is not 0, we must determine whether its difference from is greater than would generally be attributed to sampling error If we selected several samples from the same population and for each sample determined the least squares regression line, we would likely get regression lines with different slopes and different y intercepts This is analogous to getting different sample means from different samples when attempting to estimate a population mean Just as the distribution of possible sample means has a standard error, the possible regression slopes also have a standard error, which is given in Equation 14.18 Simple Regression Standard Error of the Slope Coefficient (Population) sb  where: s ∑(x  x)2 (14.18) sb  Standard deviation of the regression sloope (called the standard error of the slope) sε  Population standard error of the estimate Equation 14.18 requires that we know the standard error of the estimate It measures the dispersion of the dependent variable about its mean value at each value of the dependent variable in the original units of the dependent variable However, because we are sampling from the population, we can estimate se as shown in Equation 14.19 Simple Regression Estimator for the Standard Error of the Estimate sε  SSE n2 (14.19) where: SSE  Sum of squares error n  Sample size Equation 14.18, the standard error of the regression slope, applies when we are dealing with a population However, in most cases, such as the Midwest Distribution example, we are (142) CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis 605 dealing with a sample from the population Thus, we need to estimate the regression slope’s standard error using Equation 14.20 Simple Regression Estimator for the Standard Error of the Slope sb  sε (14.20) ∑(x  x )2 where: sb  Estimate of the standard error of the leeast squares slope sε  SSE  Sample standard error of the estimate (the measure n  of deviation of the actual y-values around the regression line) BUSINESS APPLICATION REGRESSION ANALYSIS USING COMPUTER SOFTWARE MIDWEST DISTRIBUTION (CONTINUED) For Midwest Distribution, the regression outputs in Figures 14.15a and 14.15b show b1  49.91 The question is whether this value is different enough from to have not been caused by sampling error We find the answer by looking at the value of the estimate of the standard error of the slope, calculated using Equation 14.20, which is also shown in Figure 14.15a The standard error of the slope coefficient is 10.50 If the standard error of the slope s b is large, then the value of b1 will be quite variable from sample to sample Conversely, if s b is small, the possible slope values will be less vari1 able However, regardless of the standard error of the slope, the average value of b1 will equal b1, the true regression slope, if the assumptions of the regression analysis are satisfied Figure 14.16 illustrates what this means Notice that when the standard error of the slope is FIGURE 14.15A | Excel 2007 Regression Results for Midwest Distribution Excel 2007 Instructions: Open file: Midwest.xls On the Data tab, click Data Analysis Select Regression Analysis Define x (Years with Midwest) and y (Sales) variable data range Click on Labels Specify output location Click OK The calculated t statistic and p-value for testing whether the regression slope is Standard error of the regression slope = 10.50 The corresponding F-ratio and p-value for testing whether the regression slope equals (143) 606 CHAPTER 14 FIGURE 14.15B | Introduction to Linear Regression and Correlation Analysis | Minitab Regression Results for Midwest Distribution The calculated t statistic and p-value for testing whether the regression slope is Minitab Instructions: Open file: Midwest MTW Choose Stat  Regression  Regression In Response, enter the y variable column In Predictors enter the x variable column Click OK The corresponding F-ratio and p-value for testing whether the regression slope equals Regression slope = 49.91 Standard error of the regression slope = 10.50 large, the sample slope can take on values much different from the true population slope As Figure 14.16(a) shows, a sample slope and the true population slope can even have different signs However, when b1 is small, the sample regression lines will cluster closely around the true population line [Figure 14.16(b)] Because the sample regression slope will most likely not equal the true population slope, we must test to determine whether the true slope could possibly be A slope of in the linear model means that the independent variable will not explain any variation in the dependent variable, nor will it be useful in predicting the dependent variable Assuming a 0.05 level of significance, the null and alternative hypotheses to be tested are H0: b1  HA: b1  FIGURE 14.16 | Standard Error of the Slope y y Sample E(y)  0  1x E(y)  0  1x Sample Sample (a) Large Standard Error x (b) Small Standard Error x (144) CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis 607 To test the significance of a slope coefficient, we use the t-test value in Equation 14.21 Simple Linear Regression Test Statistic for Test of the Significance of the Slope t b1  b1 sb df  n  (14.21) where: b1  Sample regression slope coefficient b1  Hypothesized slope (usually b1  0) sb  Estimator of the standard error of the slope Figure 14.17 illustrates this test for the Midwest Distribution example The calculated t-value of 4.752 exceeds the critical value, t  2.228, from the t-distribution with 10 degrees of freedom and a/2  0.025 This indicates that we should reject the hypothesis that the true regression slope is Thus, years of experience can be used to help explain the variation in an individual representative’s sales This is not a coincidence This test is always equivalent to the tests for r and r2 presented earlier The output shown in Figures 14.15a and 14.15b also contains the calculated t statistic The p-value for the calculated t statistic is also provided As with other situations involving two-tailed hypothesis tests, if the p-value is less than a, the null hypothesis is rejected In this case, because p-value  0.0008 0.05, we reject the null hypothesis FIGURE 14.17 | Significance Test of the Regression Slope for Midwest Distribution Hypotheses: H0: 1  HA: 1    0.05 df  12 –  10 Rejection Region /2  0.025 –t0.025  –2.228 Rejection Region /2  0.025 t0.025  2.228 The calculated t is b – 1 49.91 – t   4.752 sb1 10.50 Decision Rule: If t > t0.025  2.228, reject H0 If t < –t0.025  –2.228, reject H0 Otherwise, not reject H0 Because 4.752 > 2.228, we should reject the null hypothesis and conclude that the true slope is not Thus, the simple linear relationship that utilizes the independent variable, years with the company, is useful in explaining the variation in the dependent variable, sales volume (145) 608 CHAPTER 14 How to it | Introduction to Linear Regression and Correlation Analysis (Example 14-3) EXAMPLE 14-3 Vantage Electronic Systems Consider the example involving Simple Linear Regression Analysis The following steps outline the process that can be used in developing a simple linear regression model and the various hypotheses tests used to determine the significance of a simple linear regression model Define the independent (x) and dependent (y) variables and select a simple random sample of pairs of (x, y) values SIMPLE LINEAR REGRESSION ANALYSIS Vantage Electronic Systems in Deerfield, Michigan, which started out supplying electronic equipment for the automobile industry but in recent years has ventured into other areas One area is visibility sensors used by airports to provide takeoff and landing information and by transportation departments to detect low visibility on roadways during fog and snow The recognized leader in the visibility sensor business is the SCR Company, which makes a sensor called the Scorpion The research and development (R&D) department at Vantage has recently performed a test on its new unit by locating a Vantage sensor and a Scorpion sensor side-by-side Various data, including visibility measurements, were collected at randomly selected points in time over a two-week period These data are contained in a file called Vantage Develop a scatter plot of y and x You are looking for a linear relationship between the two variables Compute the correlation coefficient for the sample data Calculate the least squares regression line for the sample data and the coefficient of determination, R2 The coefficient of determination measures the proportion of variation in the dependent variable explained by the independent variable Conduct any of the following tests for determining whether the regression model is statistically significant a Test to determine whether the true regression slope is The test statistic with df  n  is t Step Define the independent (x) and dependent ( y) variables The analysis included a simple linear regression using the Scorpion visibility measurement as the dependent variable, y, and the Vantage visibility measurement as the independent variable, x Step Develop a scatter plot of y and x The scatter plot is shown in Figure 14.18 There does not appear to be a strong linear relationship Step Compute the correlation coefficient for the sample data Equation 14.1 or 14.2 can be used for manual computation, or we can use Excel or Minitab The sample correlation coefficient is r  0.5778 Step Calculate the least squares regression line for the sample data and the coefficient of determination, R2 Equations 14.7 and 14.8 can be used to manually compute the regression slope coefficient and intercept, respectively, and Equation 14.15 or 14.16 can be used to manually compute R2 Excel and Minitab can also be used to eliminate the computational burden The coefficient of determination is R2  r2  0.57782  0.3339 b1  1 b1   sb sb 1 b Test to see whether r is significantly different from The test statistic is r t 1 r FIGURE 14.18 Visibility Scatter Plot 12 Scorpion Visibility SSR F SSE (n  2) Reach a decision Draw a conclusion Scatter Plot—Example 14-3 y 14 n2 c Test to see whether r2 is significantly greater than The test statistic is | 10 0 0.2 0.4 0.6 0.8 Vantage Visibility 1.0 1.2 x (146) CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis 609 Thus, approximately 33% of the variation in the Scorpion visibility measures is explained by knowing the corresponding Vantage system visibility measure The least squares regression equation is Excel and Minitab tutorials yˆ  0.586  3.017 x Excel and Minitab Tutorial Step Conduct a test to determine whether the regression model is statistically significant (or whether the population correlation is equal to 0) The null and alternative hypotheses to test the correlation coefficient are H0: r  HA: r  The t-test statistic using Equation 14.3 is t r 1 r n2  0.5778 1 0.57782 280   11.8 The t  11.8 exceeds the critical t for any reasonable level of a for 278 degrees of freedom, so the null hypothesis is rejected and we conclude that there is a statistically significant linear relationship between visibility measures for the two visibility sensors Alternatively, the null and alternative hypotheses to test the regression slope coefficient are H0: b1  HA: b1  The t-test statistic is t b1  1 3.017    11.8 0.2557 sb Step Reach a decision The t-test statistic of 11.8 exceeds the t-critical for any reasonable level of a for 278 degrees of freedom Step Draw a conclusion The population regression slope coefficient is not equal to This means that knowing the Vantage visibility reading provides useful help in knowing what the Scorpion visibility reading will be >>END EXAMPLE TRY PROBLEM 14-16 (pg 609) MyStatLab 14-2: Exercises Skill Development 14-16 You are given the following sample data for variables y and x: y 140.1 120.3 80.8 100.7 130.2 90.6 110.5 120.2 130.4 130.3 100.1 x 5 4 a Develop a scatter plot for these data and describe what, if any, relationship exists b (1) Compute the correlation coefficient (2) Test to determine whether the correlation is significant at the significance level of 0.05 Conduct this hypothesis test using the p-value approach (3) Compute the regression equation based on these sample data and interpret the regression coefficients c Test the significance of the overall regression model using a significance level equal to 0.05 (147) 610 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis 14-17 You are given the following sample data for variables x and y: x (independent) y (dependent) 11 16 50 22 59 63 46 43 a Construct a scatter plot for these data and describe what, if any, relationship appears to exist b Compute the regression equation based on these sample data and interpret the regression coefficients c Based on the sample data, what percentage of the total variation in the dependent variable can be explained by the independent variable? d Test the significance of the overall regression model using a significance level of 0.01 e Test to determine whether the true regression slope coefficient is equal to Use a significance level of 0.01 14-18 The following data for the dependent variable, y, and the independent variable, x, have been collected using simple random sampling: x y 10 14 16 12 20 18 16 14 16 18 120 130 170 150 200 180 190 150 160 200 30.3 14.6 4.8 27.9 15.2 17.6 24.9 15.3 8.6 19.8 x y 2 a Construct a scatter plot of these data Describe the relationship between x and y b Calculate the sum of squares error for the following equations: (1) yˆ  0.8 1.60 x , (2) yˆ  1.50 x , and (3) yˆ  0.7 1.60 x c Which of these equations provides the “best” fit of these data? Describe the criterion you used to determine “best” fit d Determine the regression line that minimizes the sum of squares error Business Applications 14-21 The Skelton Manufacturing Company recently did a study of its customers A random sample of 50 customer accounts was pulled from the computer records Two variables were observed: y  Total dollar volume of business this year x  Miles customer is from corporate headquarters The following statistics were computed: yˆ  2,140.23 10.12 x sb  3.12 a Develop a simple linear regression equation for these data b Calculate the sum of squared residuals, the total sum of squares, and the coefficient of determination c Calculate the standard error of the estimate d Calculate the standard error for the regression slope e Conduct the hypothesis test to determine whether the regression slope coefficient is equal to Test using a  0.02 14-19 Consider the following sample data for the variables y and x: x y a Calculate the linear regression equation for these data b Determine the predicted y-value when x  10 c Estimate the change in the y variable resulting from the increase in the x variable of 10 units d Conduct a hypothesis test to determine if an increase of unit in the x variable will result in the decrease of the average value of the y variable Use a significance of 0.025 14-20 Examine the following sample data for the variables y and x: 20.1 13.2 9.3 25.6 11.2 19.4 a Interpret the regression slope coefficient b Using a significance level of 0.01, test to determine whether it is true that the farther a customer is from the corporate headquarters, the smaller the total dollar volume of business 14-22 A shipping company believes that the variation in the cost of a customer’s shipment can be explained by differences in the weight of the package being shipped To investigate whether this relationship is useful, a random sample of 20 customer shipments was selected, and the weight (in lb) and the cost (in dollars, rounded to the nearest dollar) for each shipment were recorded The following results were obtained: Weight (x) Cost (y) 11 11 11 (148) CHAPTER 14 Weight (x) Cost (y) 12 17 13 18 17 17 10 20 13 6 12 17 11 27 16 25 21 24 16 24 21 10 21 16 11 20 a Construct a scatter plot for these data What, if any, relationship appears to exist between the two variables? b Compute the linear regression model based on the sample data Interpret the slope and intercept coefficients c Test the significance of the overall regression model using a significance level equal to 0.05 d What percentage of the total variation in shipping cost can be explained by the regression model you developed in part b? 14-23 College tuition has risen at a pace faster than inflation for more than two decades, according to an article in USA Today The following data indicate the average college tuition (in 2003 dollars) for private and public colleges: Period 1983–1984 1988–1989 1993–1994 1998–1999 2003–2004 2008–2009 Private 9,202 12,146 13,844 16,454 19,710 21,582 Public 2,074 2,395 3,188 3,632 4,694 5,652 a Conduct a simple linear regression analysis of these data in which the average tuition for private colleges is predicted by the average public college tuition Test the significance of the model using an a  0.10 b How much does the average private college tuition increase when the average public college tuition increases by $100? c When the average public college tuition reaches $7,500, how much would you expect the average private college tuition to be? Computer Database Exercises 14-24 The file Online contains a random sample of 48 customers who made purchases last quarter from an online retailer The file contains information related to the time each customer spent viewing the online catalog and the dollar amount of purchases made | Introduction to Linear Regression and Correlation Analysis 611 The retailer would like to analyze the sample data to determine whether a relationship exists between the time spent viewing the online catalog and the dollar amount of purchases a Compute the regression equation based on these sample data and interpret the regression coefficients b Compute the coefficient of determination and interpret its meaning c Test the significance of the overall regression model using a significance level of 0.01 d Test to determine whether the true regression slope coefficient is equal to Use a significance level of 0.01 to conduct the hypothesis test 14-25 The National Football League (NFL) is arguably the most successful professional sports league in the United States Following the recent season, the commissioner’s office staff performed an analysis in which a simple linear regression model was developed with average home attendance used as the dependent variable and the total number of games won during the season as the independent variable The staff was interested in determining whether games won could be used as a predictor for average attendance Develop the simple linear regression model The data are in the file called NFL a What percentage of total variation in average home attendance is explained by knowing the number of games the team won? b What is the standard error of the estimate for this regression model? c Using a  0.05, test to determine whether the regression slope coefficient is significantly different from d After examining the regression analysis results, what should the NFL staff conclude about how the average attendance is related to the number of games the team won? 14-26 The consumer price index (CPI) is a measure of the average change in prices over time in a fixed market basket of goods and services typically purchased by consumers The CPI for all urban consumers covers about 80% of the total population It is prepared and published by the Bureau of Labor Statistics of the Department of Labor, which measures average changes in prices of goods and services The CPI is one way the government measures the general level of inflation—the annual percentage change in the value of this index is one way of measuring the annual inflation rate The file entitled CPI contains the monthly CPI and inflation rate for the period 2000–2005 a Construct a scatter plot of the CPI versus inflation for the period 2000–2005 Describe the relationship that appears to exist between these two variables b Conduct a hypothesis test to confirm your preconception of the relationship between the CPI and the inflation rate Use a  0.05 (149) 612 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis c Does it appear that the CPI and the inflation rate are measuring the same component of our economy? Support your assertion with statistical reasoning 14-27 The College Board, administrator of the SAT test for college entrants, has made several changes to the test in recent years One recent change occurred between years 2005 and 2006 In a press release the College Board announced SAT scores for students in the class of 2005, the last class to take the former version of the SAT featuring math and verbal sections The board indicated that for the class of 2005, the average SAT math scores continued their strong upward trend, increasing from 518 in 2004 to 520 in 2005, 14 points above 10 years previous and an all-time high The file entitled MathSAT contains the math SAT scores for the interval 1967 to 2005 a Produce a scatter plot of the average SAT math scores versus the year the test was taken for all students (male and female) during the last 10 years (1996–2005) b Construct a regression equation to predict the average math scores with the year as the predictor c Use regression to determine if the College Board’s assertion concerning the improvement in SAT average math test scores over the last 10 years is overly optimistic 14-28 One of the editors of a major automobile publication has collected data on 30 of the best-selling cars in the United States The data are in a file called Automobiles The editor is particularly interested in the relationship between highway mileage and curb weight of the vehicles a Develop a scatter plot for these data Discuss what the plot implies about the relationship between the two variables Assume that you wish to predict highway mileage by using vehicle curb weight b Compute the correlation coefficient for the two variables and test to determine whether there is a linear relationship between the curb weight and the highway mileage of automobiles c (1) Compute the linear regression equation based on the sample data (2) A CTS Sedan weighs approximately 4,012 pounds Provide an estimate of the average highway mileage you would expect to obtain from this model 14-29 The Insider View of Las Vegas Web site (www.insidervlv com) furnishes information and facts concerning Las Vegas A set of data published by them provides the amount of gaming revenue for various portions of Clark County, Nevada The file entitled “VEGAS” provides the gaming revenue for the year 2005 a Compute the linear regression equation to predict the gaming revenue for Clark County based on the gaming revenue of the Las Vegas Strip b Conduct a hypothesis test to determine if the gaming revenue from the Las Vegas Strip can be used to predict the gaming revenue for all of Clark County c Estimate the increased gaming revenue that would accrue to all of Clark County if the gaming revenue on the Las Vegas Strip were to increase by a million dollars END EXERCISES 14-2 Chapter Outcome 14.3 Uses for Regression Analysis Regression analysis is a statistical tool that is used for two main purposes: description and prediction This section discusses these two applications Regression Analysis for Description BUSINESS APPLICATION USING REGRESSION ANALYSIS FOR DECISION-MAKING CAR MILEAGE In the summer of 2006, gasoline prices soared to record levels in the United States, heightening motor vehicle customers’ concern for fuel economy Analysts at a major automobile company collected data on a variety of variables for a sample of 30 different cars and small trucks Included among those data were the Environmental Protection Agency (EPA)’s highway mileage rating and the horsepower of each vehicle The analysts were interested in the relationship between horsepower (x) and highway mileage (y) The data are contained in the file Automobiles A simple linear regression model can be developed using Excel or Minitab The Excel output is shown in Figure 14.19 For these sample data, the coefficient of determination, (150) CHAPTER 14 FIGURE 14.19 | Introduction to Linear Regression and Correlation Analysis 613 | Excel 2007 Regression Results for the Automobile Mileage Study Excel 2007 Instructions: Open file: Automobiles.xls Click on Data tab Select Data Analysis > Regression Define y variable range (Highway Mileage) and x variable range (Horse Power) Check Labels Specify Output Location Click OK Regression equation HW Mileage = 31.1658  0.0286 (Horse Power) Minitab Instructions (for similar results): Open file: Automobiles.MTW In Response, enter the y variable column Choose Stat  Regression  In Predictors, enter the x variable column Regression Click OK Excel and Minitab R2  0.3016, indicates that knowing the horsepower of the vehicle explains 30.16% of the variation in the highway mileage The estimated regression equation is yˆ  31.1658  0.0286 x tutorials Excel and Minitab Tutorial Before the analysts attempt to describe the relationship between horsepower and highway mileage, they first need to test whether there is a statistically significant linear relationship between the two variables To this, they can apply the t-test described in Section 14.2 to test the following null and alternative hypotheses: H0: b1  HA: b1  at the significance level a  0.05 The calculated t statistic and the corresponding p-value are shown in Figure 14.19 Because the p-value (Significance F )  0.0017 0.05 the null hypothesis, H0, is rejected and the analysts can conclude that the population regression slope is not equal to The sample slope, b1, equals 0.0286 This means that for each 1-unit increase in horsepower, the highway mileage is estimated to decrease by an average of 0.0286 miles per gallon However, b1 is subject to sampling error and is considered a point estimate for the true regression slope coefficient From earlier discussions about point estimates in Chapters and 10, we expect that b1  b1 Therefore, to help describe the relationship between the independent variable, horsepower, and the dependent variable, highway miles per gallon, we need to develop a confidence interval estimate for b1 Equation 14.22 is used to this (151) 614 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis Confidence Interval Estimate for the Regression Slope, Simple Linear Regression Chapter Outcome b1 tsb (14.22) or equivalently, b1 t sε ∑(x  x )2 df  n  where: sb  Standard error of the regression slope coefficient sε  Standard error of the estimate The regression output shown in Figure 14.19 contains the 95% confidence interval estimate for the slope coefficient, which is 0.045 –––––––––– 0.012 Thus, at the 95% confidence level, based on the sample data, the analysts for the car company can conclude that a 1-unit increase in horsepower will result in a drop in mileage by an average amount between 0.012 and 0.045 miles per gallon There are many other situations in which the prime purpose of regression analysis is description Economists use regression analysis for descriptive purposes as they search for a way of explaining the economy Market researchers also use regression analysis, among other techniques, in an effort to describe the factors that influence the demand for products EXAMPLE 14-4 DEVELOPING A CONFIDENCE INTERVAL ESTIMATE FOR THE REGRESSION SLOPE Home Prices Home values are determined by a variety of factors One factor is the size of the house (square feet) Recently, a study was conducted by First City Real Estate aimed at estimating the average value of each additional square foot of space in a house A simple random sample of 319 homes sold within the past year was collected Here are the steps required to compute a confidence interval estimate for the regression slope coefficient: Step Define the y (dependent) and x (independent) variables The dependent variable is sales price, and the independent variable is square feet Step Obtain the sample data The study consists of sales prices and corresponding square feet for a random sample of 319 homes The data are in a file called First-City Step Compute the regression equation and the standard error of the slope coefficient These computations can be performed manually using Equations 14.7 and 14.8 for the regression model and Equation 14.20 for the standard error of the slope Alternatively, we can use Excel or Minitab to obtain these values Intercept (b0) Square Feet (b1) Coefficients Standard Error 39,838.48 75.70 7,304.95 3.78 (152) CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis 615 The point estimate for the regression slope coefficient is $75.70 Thus, for a 1-square-foot increase in the size of a house, house prices increase by an average of $75.70 This is a point estimate and is subject to sampling error Step Construct and interpret the confidence interval estimate for the regression slope using Equation 14.22 The confidence interval estimate is b1 tsb where the degrees of freedom for the critical t is 319   317 The critical t for a 95% confidence interval estimate is approximately 1.97, and the interval estimate is $75.70 1.97($3.78) $75.70 $7.45 $68.25 –––––––––– $83.15 So, for a 1-square-foot increase in house size, at the 95% confidence level, we estimate that homes increase in price by an average of between $68.25 and $83.15 >>END EXAMPLE TRY PROBLEM 14-31 (pg 620) Chapter Outcome Regression Analysis for Prediction BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial PREDICTING HOSPITAL COSTS USING REGRESSION ANALYSIS FREEDOM HOSPITAL One of the main uses of regression analysis is prediction You may need to predict the value of the dependent variable based on the value of the independent variable Consider the administrator for Freedom Hospital, who has been asked by the hospital’s board of directors to develop a model to predict the total charges for a geriatric patient The file Patients contains the data that the administrator has collected Although the Regression tool in Excel works well for generating the simple linear regression equation and other useful information, it does not provide predicted values for the dependent variable However, both Minitab and the PHStat add-ins provide predictions We will illustrate the Minitab output, which is formatted somewhat differently from the Excel output but contains the same basic information The administrator is attempting to construct a simple linear regression model, with total charges as the dependent (y) variable and length of stay as the independent (x) variable Figure 14.20 shows the Minitab regression output The least squares regression equation is yˆ  528 1,353 x As shown in the figure, the regression slope coefficient is significantly different from (t  14.17; p-value  0.000) The model explains 59.6% of the variation in the total charges (R2  59.6%) Notice in Figure 14.20 that Minitab has rounded the regression coefficient The more precise values are provided in the column headed “Coef” and are yˆ  527.6 1,352.80 x The administrator could use this equation to predict total charges by substituting the length of stay into the regression equation for x For example, suppose a patient has a five-day stay The predicted total charges are yˆ  527.6 1,352.80(5) yˆ  $7,291.60 Note that this predicted value is a point estimate of the actual charges for this patient The true charges will be either higher or lower than this amount The administrator can develop a prediction interval, which is similar to the confidence interval estimates developed in Chapter (153) 616 CHAPTER 14 FIGURE 14.20 | Introduction to Linear Regression and Correlation Analysis | Minitab Regression Output for Freedom Hospital Minitab Instructions: Open file: Patients.MTW Choose Stat  Regression  Regression In Response, enter the y variable column In Predictors, enter the x variable column Click OK Chapter Outcome Confidence Interval for the Average y, Given x The marketing manager might like a 95% confidence interval for average, or expected value, for charges of patients who stay in the hospital five days The confidence interval for the expected value of a dependent variable, given a specific level of the independent variable, is determined by Equation 14.23 Observe that the specific value of x used to provide the prediction is denoted as xp Confidence Interval for E(y)|xp yˆ tsε (x p  x )  n ∑ (x  x )2 (14.23) where: yˆ  Point estimate of the dependent variable t  Critical value with n  degrees of freedom n  Sample size x p  Specific value of the independent variable x  Mean of the independent variable observations in the sample sε  Estimate of the standard error of the estimatee Although the confidence interval estimate can be manually computed using Equation 14.23, using your computer is much easier Both PHStat and Minitab have built-in options to generate the confidence interval estimate for the dependent variable for a given value of the x variable Figure 14.21 shows the Minitab results when length of stay, x, equals five days Given this length of stay, the point estimate for the mean total charges is rounded by Minitab to $7,292, and at the 95% confidence level, the administrators believe the mean total charges will be in the interval $6,790 to $7,794 Prediction Interval for a Particular y, Given x The confidence interval shown in Figure 14.21 is for the average value of y given xp The administrator might also be interested in predicting the total charges for a particular patient with a five-day stay, rather than the average of the charges for all patients staying five days Developing this 95% prediction interval requires only a slight modification to Equation 14.23 This prediction interval is given by Equation 14.24 (154) | CHAPTER 14 FIGURE 14.21 Introduction to Linear Regression and Correlation Analysis | Minitab Output: Freedom Hospital Confidence Interval Estimate se √n — (xp − x ) Σ(x − x)2 Point Estimate Interval Estimate 6,790 -7,794 617 Minitab Instructions: Use Instructions in Figure 14.20 to get regression results Before clicking OK, select Options In Prediction Interval for New Observations, enter value(s) of x variable In Confidence level, enter 0.95 Click OK OK Prediction Interval for y|xp (x  x ) ˆy tsε 1  p n ∑ (x  x )2 (14.24) As was the case with the confidence interval application discussed previously, the manual computations required to use Equation 14.24 can be onerous We recommend using your computer and software such as Minitab or PHStat to find the prediction interval Figure 14.22 shows the PHStat results Note that the same PHStat process generates both the prediction and confidence interval estimates FIGURE 14.22 | Excel 2007 (PHStat) Prediction Interval for Freedom Hospital Excel 2007 (PHStat) Instructions: Open file: Patients.xls Click on Add-Ins  PHStat Select Regression  Simple Linear Regression Define y variable range (Total Charges) and x variable range (Length of Stay) Select Confidence and Prediction Interval – set x  and 95% confidence Point estimate tα/2se 1 1(xp  x) n Σ(x  x)2 Prediction interval 1,545 - 13,038 (155) 618 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis Based on this regression model, at the 95% confidence level, the hospital administrators can predict total charges for any patient with length of stay of five days to be between $1,545 and $13,038 As you can see, this prediction has extremely poor precision We doubt any hospital administrator will use a prediction interval that is so wide Although the regression model explains a significant proportion of variation in the dependent variable, it is relatively imprecise for predictive purposes To improve the precision, we might decrease the confidence level or increase the sample size and redevelop the model The prediction interval for a specific value of the dependent variable is wider (less precise) than the confidence interval for predicting the average value of the dependent variable This will always be the case, as seen in Equations 14.23 and 14.24 From an intuitive viewpoint, we should expect to come closer to predicting an average value than a single value Note, the term (x p  x )2 has a particular effect on the confidence interval determined by both Equations 14.23 and 14.24 The farther xp (the value of the independent variable used to predict y), is from x , the greater (x p  x )2 becomes Figure 14.23 shows two regression lines developed from two samples with the same set of x-values We have made both lines pass through the same (x , y ) point; however, they have different slopes and intercepts At xp  x1, the two regression lines give predictions of y that are close to each other However, for xp  x2, the predictions of y are quite different Thus, when xp is close to x , the problems caused by variations in regression slopes are not as great as when xp is far from x Figure 14.24 shows the prediction intervals over the range of possible xp values The band around the estimated regression line bends away from the regression line as xp moves in either direction from x Chapter Outcome Common Problems Using Regression Analysis Regression is perhaps the most widely used statistical tool other than descriptive statistical techniques Because it is so widely used, you need to be aware of the common problems encountered when the technique is employed One potential problem occurs when decision makers apply regression analysis for predictive purposes The conclusions and inferences made from a regression line are statistically valid only over the range of the data contained in the sample used to develop the regression line For instance, in the Midwest Distribution example, we analyzed the performance of sales representatives with one to nine years of experience Therefore, predicting sales levels for employees with one to nine years of experience would be justified However, if we were to try to predict the sales performance of someone with more than nine years of experience, the relationship between sales and experience might be different Because no observations were FIGURE 14.23 | Regression Lines Illustrating the Increase in Potential Variation in y _as xp Moves Farther from x Regression Line from Sample x, y yˆ1 y yˆ2 Regression Line from Sample xp = x x xp = x (156) | CHAPTER 14 FIGURE 14.24 | 619 Introduction to Linear Regression and Correlation Analysis y Confidence Intervals for y|xp and E( y)|xp yˆ = b0 + b1x y|xp E(y)|xp x xp taken for experience levels beyond the 1- to 9-year range, we have no information about what might happen outside that range Figure 14.25 shows a case in which the true relationship between sales and experience reaches a peak value at about 20 years and then starts to decline If a linear regression equation were used to predict sales based on experience levels beyond the relevant range of data, large prediction errors could occur A second important consideration, one that was discussed previously, involves correlation and causation The fact that a significant linear relationship exists between two variables does not imply that one variable causes the other Although there may be a cause-and-effect relationship, you should not infer that such a relationship is present based only on regression and/or correlation analysis You should also recognize that a cause-and-effect relationship between two variables is not necessary for regression analysis to be an effective tool What matters is that the regression model accurately reflects the relationship between the two variables and that the relationship remains stable Many users of regression analysis mistakenly believe that a high coefficient of determination (R2) guarantees that the regression model will be a good predictor You should remember that R2 is a measure of the variation in the dependent variable explained by the independent variable Although the least squares criterion assures us that R2 will be maximized (because the sum of squares error is a minimum) for the given set of sample data, the FIGURE 14.25 | 1,200 Graph for a Sales Peak at 20 Years Sales in Thousands 1,000 800 600 400 200 0 10 15 Years 20 25 30 (157) 620 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis value applies only to those data used to develop the model Thus, R2 measures the fit of the regression line to the sample data There is no guarantee that there will be an equally good fit with new data The only true test of a regression model’s predictive ability is how well the model actually predicts Finally, we should mention that you might find a large R2 with a large standard error This can happen if total sum of squares is large in comparison to the SSE Then, even though R2 is relatively large, so too is the estimate of the model’s standard error Thus, confidence and prediction errors may simply be too wide for the model to be used in many situations This is discussed more fully in Chapter 15 MyStatLab 14-3: Exercises Skill Development Problems 14-32 and 14-33 refer to the following output for a simple linear regression model: 14-30 The following data have been collected by an accountant who is performing an audit of paper products at a large office supply company The dependent variable, y, is the time taken (in minutes) by the accountant to count the units The independent variable, x, is the number of units on the computer inventory record Summary Output Regression Statistics Multiple R R-Square Adjusted R-Square Standard Error Observations y 23.1 100.5 242.9 56.4 178.7 10.5 94.2 200.4 44.2 128.7 180.5 x 24 120 228 56 190 13 85 190 32 120 230 0.1027 0.0105 0.0030 9.8909 75 Anova a Develop a scatter plot for these data b Determine the regression equation representing the data Is the model significant? Test using a significance level of 0.10 and the p-value approach c Develop a 90% confidence interval estimate for the true regression slope and interpret this interval estimate Based on this interval, could you conclude the accountant takes an additional minute to count each additional unit? 14-31 You are given the following sample data: x y 10 3 a Develop a scatter plot for these data b Determine the regression equation for the data c Develop a 95% confidence interval estimate for the true regression slope and interpret this interval estimate d Provide a 95% prediction interval estimate for a particular y, given xp  Regression Residual Total df SS MS F Significance F 73 74 76.124 7141.582 7217.706 76.12 97.83 0.778 0.3806 Intercept x Intercept x Coefficents Standard Error t-Statistic 4.0133 0.0943 3.878 0.107 1.035 0.882 p-value Lower 95% Upper 95% 0.3041 0.3806 3.715 0.119 11.742 0.307 14-32 Referring to the displayed regression model, what percent of variation in the y variable is explained by the x variable in the model? 14-33 Construct and interpret a 90% confidence interval estimate for the regression slope coefficient 14-34 You are given the following summary statistics from a regression analysis: yˆ  200 150 x SSE  25.25 SSX  Sum of squares X  ∑ ( x − x )2  99, 645 n  18 x  52.0 (158) CHAPTER 14 a Determine the point estimate for y if xp  48 is used b Provide a 95% confidence interval estimate for the average y, given xp  48 Interpret this interval c Provide a 95% prediction interval estimate for a particular y, given xp  48 Interpret d Discuss the difference between the estimates provided in parts b and c 14-35 The sales manager at Sun City Real Estate Company in Tempe, Arizona, is interested in describing the relationship between condo sales prices and the number of weeks the condo is on the market before it sells He has collected a random sample of 17 low-end condos that have sold within the past three months in the Tempe area These data are shown as follows: Weeks on the Market Selling Price 23 48 26 20 40 51 18 25 62 33 11 15 26 27 56 12 $76,500 $102,000 $53,000 $84,200 $73,000 $125,000 $109,000 $60,000 $87,000 $94,000 $76,000 $90,000 $61,000 $86,000 $70,000 $133,000 $93,000 10 103 85 11 115 73 10 97 11 102 65 75 Introduction to Linear Regression and Correlation Analysis 621 14-37 A regression analysis from a sample of 15 produced the following: ∑(xi  x )( yi  y ) 156.4 ∑(xi  x )2  173.5 ∑( yi  y )2  181.6 ∑( yi  yˆ )2  40.621 x  13.4 and y  56.4 a Produce the regression line b Determine if there is a linear relationship between the dependent and independent variables Use a significance level of 0.05 and a p-value approach c Calculate a 90% confidence interval for the amount the dependent variable changes when the independent variable increases by unit Business Applications 14-38 During the recession that began in 2008, not only did some people stop making house payments, they also stopped making payments for local government services such as trash collection and water and sewer services The following data have been collected by an accountant who is performing an audit of account balances for a major city billing department The population from which the data were collected represent those accounts for which the customer had indicated the balance was incorrect The dependent variable, y, is the actual account balance as verified by the accountant The independent variable, x, is the computer account balance y x a Develop a simple linear regression model to explain the variation in selling price based on the number of weeks the condo is on the market b Test to determine whether the regression slope coefficient is significantly different from using a significance level equal to 0.05 c Construct and interpret a 95% confidence interval estimate for the regression slope coefficient 14-36 A sample of 10 yields the following data: x y | 15 155 95 a Provide a 95% confidence interval for the average y when xp  9.4 b Provide a 95% confidence interval for the average y when xp  10 c Obtain the margin of errors for both part a and part b Explain why the margin of error obtained in part b is larger than that in part a 233 245 10 12 24 22 56 56 78 90 102 103 90 85 200 190 344 320 120 120 18 23 a Compute the least squares regression equation b If the computer account balance was 100, what would you expect to be the actual account balance as verified by the accountant? c The computer balance for Timothy Jones is listed as 100 in the computer account record Provide a 90% interval estimate for Mr Jones’s actual account balance d Provide a 90% interval estimate for the average of all customers’ actual account balances in which a computer account balance is the same as that of Mr Jones (part c) Interpret 14-39 Gym Outfitters sells and services exercise equipment such as treadmills, ellipticals, and stair climbers to gymnasiums and recreational centers The company’s management would like to determine if there is a relationship between the number of minutes required to complete a routine service call and the number of machines serviced A random sample of 12 records revealed the following information concerning the (159) 622 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis number of machines serviced and the time (in minutes) to complete the routine service call: Number of Machines Service Time (minutes) 11 10 10 5 12 115 60 80 90 55 65 70 33 95 50 40 110 a Estimate the least squares regression equation b If a gymnasium had six machines, how many minutes should Gym Outfitters expect a routine service call to require? c Provide a 90% confidence interval for the average amount of time required to complete a routine service call when the number of machines being serviced is nine d Provide a 90% prediction interval for the time required to complete a particular routine service call for a gymnasium that has seven machines 14-40 The National Association of Realtors (NAR) ExistingHome Sales Series provides a measurement of the residential real estate market On or about the 25th of each month, NAR releases statistics on sales and prices of condos and co-ops, in addition to existing singlefamily homes, for the nation and the four regions The data presented here indicate the number of (in thousands) existing-home sales as well as condo/ co-op sales: Single-Family Condo/Co-op Sales Sales a Construct the regression equation that would predict the number of condo/co-op sales using the number of single-family sales b One might conjecture that these two markets (single-family sales and condo/co-op sales) would be competing for the same audience Therefore, we would expect that as the number of single-family sales increases, the number of condo/co-op sales would decrease Conduct a hypothesis test to determine this using a significance level of 0.05 c Provide a prediction interval for the number of condo/co-op sales when the number of singlefamily sales is 6,000 (thousands) Use a confidence level of 95% 14-41 J.D Power and Associates conducts an initial quality study (IQS) each year to determine the quality of newly manufactured automobiles IQS measures 135 attributes across nine categories, including ride/ handling/braking, engine and transmission, and a broad range of quality problem symptoms reported by vehicle owners The 2008 IQS was based on responses from more than 62,000 purchasers and lessees of new 2008 model-year cars and trucks, who were surveyed after 90 days of ownership The data given here portray the industry average of the number of reported problems per 100 vehicles for 1998–2008 Year 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Problems 176 167 154 147 133 133 125 121 119 118 118 a Construct a scatter plot of the number of reported problems per 100 vehicles as a function of the year b Determine if the average number of reported problems per 100 vehicles declines from year to year Use a significance level of 0.01 and a p-value approach c Assume the relationship between the number of reported problems per 100 vehicles and the year continues into the future Provide a 95% prediction interval for the initial quality industry average of the number of reported problems per 100 vehicles for 2010 Year Month 2009 Apr May 6,270 6,230 895 912 Jun 6,330 943 Computer Database Exercises Jul 6,220 914 Aug 6,280 928 Sept 6,290 908 Oct 6,180 867 Nov 6,150 876 Dec 5,860 885 Jan 5,790 781 Feb 6,050 852 Mar 6,040 862 Apr 5,920 839 14-42 A manufacturer produces a wash-down motor for the food service industry The company manufactures the motors to order by modifying a base model to meet the specifications requested by the customer The motors are produced in a batch environment with the batch size equal to the number ordered The manufacturer has recently sampled 27 customer orders The motor manufacturer would like to determine if there is a relationship between the cost of producing the order and the order size so that it could estimate the cost of producing a particular size order The sampled data are contained in the file Washdown Motors 2010 (160) CHAPTER 14 a Use the sample data to estimate the least squares regression model b Provide an interpretation of the regression coefficients c Test the significance of the overall regression model using a significance level of 0.01 d The company has just received an order for 30 motors Use the regression model developed in part a to estimate the cost of producing this particular order e Referring to part d, what is the 90% confidence interval for an average cost of an order of 30 motors? 14-43 Each month, the Bureau of Labor Statistics (BLS) of the U.S Department of Labor announces the total number of employed and unemployed persons in the United States for the previous month At the same time, it also publishes the inflation rate, which is the rate of change in the price of goods and services from one month to the next It seems quite plausible that there should be some relationship between these two indicators The file entitled CPI provides the monthly unemployment and inflation rates for the period 2000–2005 a Construct a scatter plot of the unemployment rate versus inflation rate for the period 2000–2005 Describe the relationship that appears to exist between these two variables b Produce a 95% prediction interval for the unemployment rate for the maximum inflation rate in the period 2000–2005 Interpret the interval | Introduction to Linear Regression and Correlation Analysis 623 c Produce a 95% prediction interval for the unemployment rate when the inflation rate is 0.00 d Which of the prediction intervals in parts b and c has the larger margin of error? Explain why this is the case 14-44 The National Highway Transportation Safety Administration’s National Center for Statistics and Analysis released its Vehicle Survivability Travel Mileage Schedules in January 2006 One item investigated was the relationship between the annual vehicle miles traveled (VMT) as a function of vehicle age for passenger cars up to 25 years old The VMT data were collected by asking consumers to estimate the number of miles driven in a given year The data were collected over a 14-month period, starting in March 2001 and ending in May 2002 The file entitled Miles contains this data a Produce a regression equation modeling the relationship between VMT and the age of the vehicle Estimate how many more annual vehicle miles would be traveled for a vehicle that is 10 years older than another vehicle b Provide a 90% confidence interval estimate for the average annual vehicle miles traveled when the age of the vehicle is 15 years c Determine if it is plausible for a vehicle that is 10 years old to travel 12,000 miles in a year Support your answer with statistical reasoning END EXERCISES 14-3 (161) 624 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis Visual Summary Chapter 14: Although some business situations involve only one variable, others require decision makers to consider the relationship between two or more variables In analyzing the relationship between two variables, there are two basic models that we can use The regression model covered in this chapter is referred to as simple linear regression This relationship between x and y assumes that the x variable takes on known values specifically selected from all the possible values for x The y variable is a random variable observed at the different levels of x Testing that a linear relationship exists between the dependent and independent variables is performed using the standard statistical procedures of hypothesis testing and confidence intervals A second model is referred to as the correlation model and is used in applications in which both the x and y variables are considered to be random variables These two models arise in practice by the way in which the data are obtained Regression analysis and correlation analysis are two of the most often applied statistical tools for business decision making 14.1 Scatter Plots and Correlation (pg 580–589) Summary Decision-making situations that call for understanding the relationship between two quantitative variables are aided by the use of scatter plots, or scatter diagrams A scatter plot is a two-dimensional plot showing the values for the joint occurrence of two quantitative variables The scatter plot may be used to graphically represent the relationship between two variables A numerical quantity that measures the strength of the linear relationship between two variables is labeled the correlation coefficient The sample correlation coefficient, r, can range from a perfect positive correlation, +1.0, to a perfect negative correlation, –1.0 A test based upon the t-distribution can determine whether the population correlation coefficient is significantly different from and, therefore, whether a linear relationship exists between the dependent and independent variables Outcome Calculate and interpret the correlation between two variables Outcome Determine whether the correlation is significant 14.2 Simple Linear Regression Analysis (pg 589–612) Summary The statistical technique we use to analyze the relationship between the dependent variable and the independent variable is known as regression analysis When the relationship between the dependent variable and the independent variable is linear, the technique is referred as simple linear regression The population regression model is determined by three values known as the population regression coefficients: (1) the y-intercept, (2) the slope of the regression line, and (3) the random error term The criterion used to determine the best estimate of the population regression line is known as the least squares criterion It chooses values for the y-intercept and slope that will produce the smallest possible sum of squared prediction errors Testing that the population’s slope coefficient is equal to zero provides a method to determine if there is no linear relationship between the dependent and independent variables The test for the simple linear regression is equivalent to the test that the correlation coefficient is significant A less involved procedure that indicates the goodness of fit of the regression equation to the data is known as the coefficient of determination Simple linear regression, which is introduced in this chapter, is one of the most often applied statistical tools by business decision makers for analyzing the relationship between two variables Outcome Calculate the simple linear regression equation for a set of data and know the basic assumptions behind regression analysis Outcome Determine whether a regression model is significant Conclusion 14.3 Uses for Regression Analysis (pg 612–623) Summary Regression analysis is a statistical tool that is used for two main purposes: description and prediction Description is accomplished by describing the plausible values the population slope coefficient may attain To provide this, a confidence interval estimator of the population slope is employed There are many other situations in which the prime purpose of regression analysis is description Market researchers also use regression analysis, among other techniques, in an effort to describe the factors that influence the demand for their products The analyst may wish to provide a confidence interval for the expected value of a dependent variable, given a specific level of the independent variable This is obtained by the use of a confidence interval for the average y, given x Another confidence interval is available in the case that the analyst wishes to predict a particular y for a given x This interval estimator is called a prediction interval Any procedure in statistics is valid only if the assumptions it is built upon is valid This is particularly true in regression analysis Therefore, before using a regression model for description or prediction, you should check to see if the assumptions associated with linear regression analysis are valid Residual analysis is the procedure that is used for that purpose Outcome Recognize regression analysis applications for purposes of description and prediction Outcome Calculate and interpret confidence intervals for the regression analysis Outcome Recognize some potential problems if regression analysis is used incorrectly Correlation and regression analysis are two of the most frequently used statistical techniques by business decision makers This chapter has introduced the basics of these two topics The discussion of regression analysis has been limited to situations in which you have one dependent variable and one independent variable Chapter 15 will extend the discussion of regression analysis by showing how two or more independent variables are included in the analysis The focus of that chapter will be on building a model for explaining the variation in the dependent variable However, the basic concepts presented in this chapter will be carried forward (162) CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis 625 Equations (14.1) Sample Correlation Coefficient pg 580 (14.13) Sum of Squares Error pg 601 ∑ (x  x )( y  y ) r n SSE  [∑ (x  x )2 ][∑ ( y  y )2 ] ∑ ( yi  yˆi )2 i1 (14.2) or the algebraic equivalent: pg 581 r (14.14) Sum of Squares Regression pg 602 n ∑ xy  ∑ x ∑y n SSR  [ n( ∑ x )  ( ∑ x)2 ][ n( ∑y )  ( ∑y)2 ] (14.3) Test Statistic for Correlation pg 584 t r i1 (14.15) Coefficient of Determination, R2 pg 602 df  n  1 r n2 ∑ ( yˆi  y )2 R2  SSR SST (14.16) Coefficient of Determination for the Single Independent (14.4) Simple Linear Regression Model (Population Model) pg 590 Variable Case pg 602 R2  r2 y  b0  b1x   (14.5) Estimated Regression Model (Sample Model) pg 592 yˆ  b0  b1 x (14.6) Least Squares Equations pg 594 b1  ∑ (x  x )( y  y ) ∑ (x  x )2 (14.17) Test Statistic for Significance of the Coefficient of Determination pg 603 F SSR / SSE / (n  2) df  (D1  1, D2  n − 2) (14.18) Simple Regression Standard Error of the Slope Coefficient (Population) pg 604 (14.7) or the algebraic equivalent: pg 594 ∑x∑y ∑ xy  n b1  ( ∑ x )2 ∑ x2  n b  ε ∑(x  x )2 (14.19) Simple Regression Estimator for the Standard Error of the Estimate pg 604 (14.8) and pg 594 b0  y  b1 x sε  SSE n 2 (14.9) Sum of Squared Errors pg 596 SSE  ∑ y  b0 ∑ y  b1 ∑ xy (14.20) Simple Regression Estimator for the Standard Error of the Slope pg 605 (14.10) Sum of Residuals pg 596 sb  n ∑ ( yi  yˆi )  i1 Significance of the Slope pg 607 n ∑ ( yi  yˆi )2 i1 (14.12) Total Sum of Squares pg 600 n SST  ∑(x  x )2 (14.21) Simple Linear Regression Test Statistic for Test of the (14.11) Sum of Squared Residuals (Errors) pg 597 SSE  sε ∑ ( yi  y )2 i1 t b1  1 sb df  n  (14.22) Confidence Interval Estimate for the Regression Slope, Simple Linear Regression pg 614 b1 tsb (163) 626 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis or equivalently, b1 t (14.24) Prediction Interval for y|xp pg 617 sε ∑(x  x )2 df  n  2 (x  x ) ˆy tsε 1  p n ∑ (x  x )2 (14.23) Confidence Interval for E(y)|xp pg 616 yˆ tsε (x p  x )  n ∑ (x  x )2 Key Terms Coefficient of determination pg 602 Correlation coefficient pg 580 Least squares criterion pg 592 Regression slope coefficient pg 591 Residual pg 592 Scatter plot pg 580 Simple linear regression pg 589 Chapter Exercises Conceptual Questions 14-45 A statistics student was recently working on a class project that required him to compute a correlation coefficient for two variables After careful work he arrived at a correlation coefficient of 0.45 Interpret this correlation coefficient for the student who did the calculations 14-46 Referring to the previous problem, another student in the same class computed a regression equation relating the same two variables The slope of the equation was found to be 0.735 After trying several times and always coming up with the same result, she felt that she must have been doing something wrong since the value was negative and she knew that this could not be right Comment on this student’s conclusion 14-47 If we select a random sample of data for two variables and, after computing the correlation coefficient, conclude that the two variables may have zero correlation, can we say that there is no relationship between the two variables? Discuss 14-48 Discuss why prediction intervals that attempt to predict a particular y-value are less precise than confidence intervals for predicting an average y 14-49 Consider the two following scenarios: a The number of new workers hired per week in your county has a high positive correlation with the average weekly temperature Can you conclude that an increase in temperature causes an increase in the number of new hires? Discuss b Suppose the stock price and the common dividends declared for a certain company have a high positive correlation Are you safe in concluding on the basis of the correlation coefficient that an increase in the common dividends declared causes an increase in MyStatLab the stock price? Present other reasons than the correlation coefficient that might lead you to conclude that an increase in common dividends declared causes an increase in the stock price 14-50 Consider the following set of data: x y 48 47 27 23 34 31 24 20 49 50 29 48 39 47 38 47 46 42 32 47 a Calculate the correlation coefficient of these two variables b Multiply each value of the variable x by and add 10 to the resulting products Now multiply each value of the variable y by and subtract from the resulting products Finally, calculate the correlation coefficient of the new x and y variables c Describe the principle that the example developed in parts a and b demonstrates 14-51 Go to the library and locate an article in a journal related to your major (Journal of Marketing, Journal of Finance, etc.) that uses linear regression Discuss the following: a How the author chose the dependent and independent variables b How the data were gathered c What statistical tests the author used d What conclusions the analysis allowed the author to draw Business Applications 14-52 The Smithfield Organic Milk Company recently studied a random sample of 30 of its distributors and found the correlation between sales and advertising dollars to be 0.67 (164) CHAPTER 14 a Is there a significant linear relationship between sales and advertising? If so, is it fair to conclude that advertising causes sales to increase? b If a regression model were developed using sales as the dependent variable and advertising as the independent variable, determine the proportion of the variation in sales that would be explained by its relationship to advertising Discuss what this says about the usefulness of using advertising to predict sales 14-53 A previous exercise discussed the relationship between the average college tuition (in 2003 dollars) for private and public colleges The data indicated in the article follow: Period Private 1983–1984 9,202 1988–1989 12,146 1993–1994 13,844 Public 2,074 2,395 3,188 Period 1998–1999 2003–2004 2008–2009 Private 16,454 19,710 21,582 Public 3,632 4,694 5,652 a Construct the regression equation that would predict the average college tuition for private colleges using that of the public colleges b Determine if there is a linear tendency for the average college tuition for private colleges to increase when the average college tuition for public colleges increases Use a significance level of 0.05 and a p-value approach c Provide a 95% confidence interval for the average college tuition for private colleges when the average college tuition for public colleges reaches $7,000 d Is it plausible that the average college tuition for private colleges would be larger than $35,000 when the average college tuition for public colleges reaches $7,000? Support your assertion with statistical reasoning 14-54 The Farmington City Council recently commissioned a study of park users in their community Data were collected on the age of the person surveyed and the amount of hours he or she spent in the park in the past month The data collected were as follows: | Introduction to Linear Regression and Correlation Analysis 627 a Draw a scatter plot for these data and discuss what, if any, relationship appears to be present between the two variables b Compute the correlation coefficient between age and the amount of time spent in the park Provide an explanation to the Farmington City Council of what the correlation measures c Test to determine whether the amount of time spent in the park decreases with the age of the park user Use a significance level of 0.10 Use a p-value approach to conduct this hypothesis test 14-55 At State University, a study was done to establish whether a relationship exists between students’ graduating grade point average (GPA) and the SAT verbal score when the student originally entered the university The sample data are reported as follows: GPA 2.5 3.2 3.5 2.8 3.0 2.4 3.4 2.9 2.7 3.8 SAT 640 700 550 540 620 490 710 600 505 710 a Develop a scatter plot for these data and describe what, if any, relationship exists between the two variables, GPA and SAT score b (1) Compute the correlation coefficient (2) Does it appear that the success of students at State University is related to the SAT verbal scores of those students? Conduct a statistical procedure to answer this question Use a significance level of 0.01 c (1) Compute the regression equation based on these sample data if you wish to predict the university GPA using the student SAT score (2) Interpret the regression coefficients 14-56 An American airline company recently performed a customer survey in which it asked a random sample of 100 passengers to indicate their income and the total cost of the airfares they purchased for pleasure trips during the past year A regression model was developed for the purpose of determining whether income could be used as a variable to explain the variation in the total cost of airfare on airlines in a year The following regression results were obtained: yˆ  0.25  0.0150 x sε  721.44 R  0.65 sb  0.0000122 Time in Park Age Time in Park Age 7.2 3.5 6.6 5.4 1.5 2.3 16 15 28 16 29 38 4.4 8.8 4.9 5.1 1.0 48 18 24 33 56 a Produce an estimate of the maximum and minimum differences in the amounts allocated to purchase airline tickets by two families that have a difference of $20,000 in family income Assume that you wish to use a 90% confidence level b Can the intercept of the regression equation be interpreted in this case, assuming that no one who was surveyed had an income of dollars? Explain (165) 628 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis c Use the information provided to perform an F-test for the significance of the regression model Discuss your results, assuming the test is performed at the significance level of 0.05 14-57 One of the advances that have helped to diminish carpal tunnel syndrome is ergonomic keyboards The ergonomic keyboards may also increase typing speed Ten administrative assistants were chosen to type on both standard and ergonomic keyboards The resulting typing speeds follow: Ergonomic: 69 70 Standard: 80 68 60 54 71 56 73 58 64 64 63 62 70 51 63 64 74 53 a Produce a scatter plot of the typing speed of administrative assistants using ergonomic and standard keyboards Does there appear to be a linear relationship between these two variables? Explain your response b Calculate the correlation coefficient of the typing speed of administrative assistants using ergonomic and standard keyboards c Conduct a hypothesis test to determine if a positive correlation exists between administrative assistants using ergonomic and standard keyboards Use a significance level of 0.05 14-58 A company is considering recruiting new employees from a particular college and plans to place a great deal of emphasis on the student’s college GPA However, the company is aware that not all schools have the same grading standards, so it is possible that a student at this school might have a lower (or higher) GPA than a student from another school, yet really be on par with the other student To make this comparison between schools, the company has devised a test that it has administered utilizing a sample size of 400 students With the results of the test, it has developed a regression model that it uses to predict student GPA The following equation represents the model: yˆ  1.0  0.028 x The R2 for this model is 0.88 and the standard error of the estimate is 0.20, based on the sample data used to develop the model Note that the dependent variable is the GPA and the independent variable is test score, where this score can range from to 100 For the sample data used to develop the model, the following values are known: y  2.76 x  68 ∑ (x  x )2  148,885.73 a Based on the information contained in this problem, can you conclude that as the test score increases, the GPA will also increase, using a significance level of 0.05? b Suppose a student interviews with this company, takes the company test, and scores 80 correct What is the 90% prediction interval estimate for this student’s GPA? Interpret the interval c Suppose the student in part b actually has a 2.90 GPA at this school Based on this evidence, what might be concluded about this person’s actual GPA compared with other students with the same GPA at other schools? Discuss the limitations you might place on this conclusion d Suppose a second student with a 2.45 GPA took the test and scored 65 correct What is the 90% prediction interval for this student’s “real” GPA? Interpret Computer Database Exercises 14-59 Although the Jordan Banking System, a smaller regional bank, generally avoided the subprime mortgage market and consequently did not take money from the Federal Troubled Asset Relief Program (TARP), its board of directors has decided to look into all aspects of revenues and costs One service the bank offers is free checking, and the board is interested in whether the costs of this service are offset by revenues from interest earned on the deposits One aspect in studying checking accounts is to determine whether changes in average checking account balance can be explained by knowing the number of checks written per month The sample data selected are contained in the data file named Jordan a Draw a scatter plot for these data b Develop the least squares regression equation for these data c Develop the 90% confidence interval estimate for the change in the average checking account balance when a person who formerly wrote 25 checks a month doubles the number of checks used d Test to determine if an increase in the number of checks written by an individual can be used to predict the checking account balance of that individual Use a  0.05 Comment on this result and the result of part c 14-60 An economist for the state government of Mississippi recently collected the data contained in the file called Mississippi on the percentage of people unemployed in the state at randomly selected points in time over the past 25 years and the interest rate of Treasury bills offered by the federal government at that point in time a (1) Develop a plot showing the relationship between the two variables (2) Describe the relationship as being either linear or curvilinear b (1) Develop a simple linear regression model with unemployment rate as the dependent variable (2) Write a short report describing the model and indicating the important measures (166) CHAPTER 14 14-61 Terry Downes lost his job as an operations analyst last year in a company downsizing effort In looking for job opportunities Terry remembered reading an article in Fortune stating companies were looking to outsource activities they were currently doing that were not part of their core competence Terry decided no company’s core competence involved cleaning its facilities, and so using his savings, he started a cleaning company In a surprise to his friends, Terry’s company proved to be successful Recently, Terry decided to survey customers to determine how satisfied they are with the work performed He devised a rating scale between and 100, with being poor and 100 being excellent service He selected a random sample of 14 customers and asked them to rate the service He also recorded the number of worker hours spent in the customer’s facility These data are in the data file named Downes a (1) Draw a scatter plot showing these two variables, with the y variable on the vertical axis and the x variable on the horizontal axis (2) Describe the relationship between these two variables b (1) Develop a linear regression model to explain the variation in the service rating (2) Write a short report describing the model and showing the results of pertinent hypothesis tests, using a significance level of 0.10 14-62 A previous problem discussed the College Board changing the SAT test between 2005 and 2006 The class of 2005 was the last to take the former version of the SAT featuring math and verbal sections The file entitled MathSAT contains the math SAT scores for the interval 1967 to 2005 One point of interest concerning the data is the relationship between the average scores of male and female students a Produce a scatter plot depicting the relationship between the average math SAT score of males (the dependent variable) and females (independent variable) over the period 1967 to 2005 Describe the relationship between these two variables b Is there a linear relationship between the average score for males and females over the period 1967 to 2005? Use a significance level of 0.05 and the p-value approach to determine this 14-63 The housing market in the United States saw a major decrease in value between 2007 and 2008 The file entitled House contains the data on average and median housing prices between November 2007 and November 2008 Assume the data can be viewed as samples of the relevant populations a Determine the linear relationship that could be used to predict the average selling prices for November 2007 using the median selling prices for that period | Introduction to Linear Regression and Correlation Analysis 629 b Conduct a hypothesis test to determine if the median selling prices for November 2007 could be used to determine the average selling prices in that period Use a significance level of 0.05 and the p-value approach to conduct the test c Provide an interval estimate of the average selling price of homes in November 2007 if the median selling price was $195,000 Use a 90% confidence interval 14-64 The Grinfield Service Company’s marketing director is interested in analyzing the relationship between her company’s sales and the advertising dollars spent In the course of her analysis, she selected a random sample of 20 weeks and recorded the sales for each week and the amount spent on advertising These data are contained in the data file called Grinfield a Identify the independent and dependent variables b Draw a scatter plot with the dependent variable on the vertical axis and the independent variable on the horizontal axis c The marketing director wishes to know if increasing the amount spent on advertising increases sales As a first attempt, use a statistical test that will provide the required information Use a significance level of 0.025 On careful consideration, the marketing manager realizes that it takes a certain amount of time for the effect of advertising to register in terms of increased sales She therefore asks you to calculate a correlation coefficient for sales of the current week against amount of advertising spent in the previous week and to conduct a hypothesis test to determine if, under this model, increasing the amount spent on advertising increases sales Again, use a significance level of 0.025 14-65 Refer to the Grinfield Service Company discussed in Problem 14-64 a Develop the least squares regression equation for these variables Plot the regression line on the scatter plot b Develop a 95% confidence interval estimate for the increase in sales resulting from increasing the advertising budget by $50 Interpret the interval c Discuss whether it is appropriate to interpret the intercept value in this model Under what conditions is it appropriate? Discuss d Develop a 90% confidence interval for the mean sales amount achieved during all weeks in which advertising is $200 for the week e Suppose you are asked to use this regression model to predict the weekly sales when advertising is to be set at $100 What would you reply to the request? Discuss (167) 630 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis Case 14.1 A & A Industrial Products Alex Court, the cost accountant for A & A Industrial Products, was puzzled by the repair cost analysis report he had just reviewed This was the third consecutive report where unscheduled plant repair costs were out of line with the repair cost budget allocated to each plant A & A budgets for both scheduled maintenance and unscheduled repair costs for its plants’ equipment, mostly large industrial machines Budgets for scheduled maintenance activities are easy to estimate and are based on the equipment manufacturer’s recommendations The unscheduled repair costs, however, are harder to determine Historically, A & A Industrial Products has estimated unscheduled maintenance using a formula based on the average number of hours of operation between major equipment failures at a plant Specifically, plants were given a budget of $65.00 per hour of operation between major failures Alex had arrived at this amount by dividing aggregate historical repair costs by the total number of hours between failures Then plant averages would be used to estimate unscheduled repair cost For example, if a plant averaged 450 hours of run time before a major repair occurred, the plant would be allocated a repair budget of 450 × $65  $29,250 per repair If the plant was expected to be in operation 3,150 hours per year, the company would anticipate seven unscheduled repairs (3,150/450) annually and budget $204,750 for annual unscheduled repair costs Alex was becoming more and more convinced that this approach was not working Not only was upper management upset about the variance between predicted and actual costs of repair, but plant managers believed that the model did not account for potential differences among the company’s three plants when allocating dollars for unscheduled repairs At the weekly management meeting, Alex was informed that he needed to analyze his cost projections further and produce a report that provided a more reliable method for predicting repair costs On leaving the meeting, Alex had his assistant randomly pull 64 unscheduled repair reports The data are in the file A & A Costs The management team is anxiously waiting for Alex’s analysis Required Tasks: Identify the major issue(s) of the case Analyze the overall cost allocation issues by developing a scatterplot of Cost v Hours of Operation Which variable, cost or hours of operation, should be the dependent variable? Explain why Fit a linear regression equation to the data Explain how the results of the linear regression equation could be used to develop a cost allocation formula State any adjustments or modification you have made to the regression output to develop a cost allocation formula that can be used to predict repair costs Sort the data by plant Fit a linear regression equation to each plant’s data Explain how the results of the individual plant regression equations can help the manager determine whether a different linear regression equation could be used to develop a cost allocation formula for each plant State any adjustments or modification you have made to the regression output to develop a cost allocation formula Based on the individual plant regression equations determine whether there is reason to believe there are differences among the repair costs of the company’s three plants Summarize your analysis and findings in a report to the company’s manager Case 14.2 Sapphire Coffee—Part Jennie Garcia could not believe that her career had moved so far so fast When she left graduate school with a master’s degree in anthropology, she intended to work at a local coffee shop until something else came along that was more related to her academic background But after a few months she came to enjoy the business, and in a little over a year she was promoted to store manager When the company for whom she worked continued to grow, Jennie was given oversight of a few stores Now, eight years after she started as a barista, Jennie was in charge of operations and planning for the company’s southern region As a part of her responsibilities, Jennie tracks store revenues and forecasts coffee demand Historically, Sapphire Coffee would base its demand forecast on the number of stores, believing that each store sold approximately the same amount of coffee This approach seemed to work well when the company had shops of similar size and layout, but as the company grew, stores became more varied Now, some stores had drive-thru windows, a feature that top management added to some stores believing that it would increase coffee sales for customers who wanted a cup of coffee on their way to work but who were too rushed to park and enter the store to place an order Jennie noticed that weekly sales seemed to be more variable across stores in her region and was wondering what, if anything, might explain the differences The company’s financial vice president had also noticed the increased differences in sales across stores and was wondering what might be happening In an e-mail to Jennie he stated that weekly store sales are expected to average $5.00 per square foot Thus, a 1,000-square-foot store would have average weekly sales of $5,000 He asked that Jennie analyze the stores in her region to see if this rule of thumb was a reliable measure of a store’s performance The vice president of finance was expecting the analysis to be completed by the weekend Jennie decided to randomly select weekly sales records for 53 stores The data are in the file Sapphire Coffee-1 A full analysis needs to be sent to the corporate office by Friday (168) CHAPTER 14 Required Tasks: Identify the major issue(s) of the case Develop a scatter plot of the variables Store Size and Weekly Sales Identify the dependent variable Briefly describe the relationship between the two variables Fit a linear regression equation to the data Does the variable Store Size explain a significant amount of the variation in Weekly Sales? | Introduction to Linear Regression and Correlation Analysis 631 Based on the estimated regression equation does it appear the $5.00 per square foot weekly sales expectation the company currently uses is a valid one? Summarize your analysis and findings in a report to the company’s vice president of finance Case 14.3 Alamar Industries While driving home in northern Kentucky at 8:00 P.M., Juan Alamar wondered whether his father had done him any favor by retiring early and letting him take control of the family machine tool–restoration business When his father started the business of overhauling machine tools (both for resale and on a contract basis), American companies dominated the tool manufacturing market During the past 30 years, however, the original equipment industry had been devastated, first by competition from Germany and then from Japan Although foreign competition had not yet invaded the overhaul segment of the business, Juan had heard about foreign companies establishing operations on the West Coast The foreign competitors were apparently stressing the highquality service and operations that had been responsible for their great inroads into the original equipment market Last week Juan attended a daylong conference on total quality management that had discussed the advantages of competing for the Baldrige Award, the national quality award established in 1987 Presenters from past Baldrige winners, including Xerox, Federal Express, Cadillac, and Motorola, stressed the positive effects on their companies of winning and said similar effects would be possible for any company This assertion of only positive effects was what Juan questioned He was certain that the effect on his remaining free time would not be positive The Baldrige Award considers seven corporate dimensions of quality Although the award is not based on a numerical score, an overall score is calculated The maximum score is 1,000, with most recent winners scoring about 800 Juan did not doubt the award was good for the winners, but he wondered about the nonwinners In particular, he wondered about any relationship between attempting to improve quality according to the Baldrige dimensions and company profitability Individual company scores are not released, but Juan was able to talk to one of the conference presenters, who shared some anonymous data, such as companies’ scores in the year they applied, their returns on investment (ROIs) in the year applied, and returns on investment in the year after application Juan decided to commit the company to a total quality management process if the data provided evidence that the process would lead to increased profitability Baldrige Score ROI Application Year ROI Next Year 470 520 660 540 600 710 580 600 740 610 570 660 11% 10 14 12 15 16 11 12 16 11 12 17 13% 11 15 12 16 16 12 13 16 14 13 19 Case 14.4 Continental Trucking Norm Painter is the newly hired cost analyst for Continental Trucking Continental is a nationwide trucking firm, and until recently, most of its routes were driven under regulated rates These rates were set to allow small trucking firms to earn an adequate profit, leaving little incentive to work to reduce costs by efficient management techniques In fact, the greatest effort was made to try to influence regulatory agencies to grant rate increases A recent rash of deregulation has made the long-distance trucking industry more competitive Norm has been hired to analyze Continental’s whole expense structure As part of this study, Norm is looking at truck repair costs Because the trucks are involved in long hauls, they inevitably break down In the past, little preventive maintenance was done, and if a truck broke down in the middle of a haul, either a replacement tractor was sent or an independent contractor finished the haul The truck was then repaired at the nearest local shop Norm is sure this procedure has (169) 632 CHAPTER 14 | Introduction to Linear Regression and Correlation Analysis led to more expense than if major repairs had been made before the trucks failed Norm thinks that some method should be found for determining when preventive maintenance is needed He believes that fuel consumption is a good indicator of possible breakdowns; as trucks begin to run badly, they will consume more fuel Unfortunately, the major determinants of fuel consumption are the weight of a truck and headwinds Norm picks a sample of a single truck model and gathers data relating fuel consumption to truck weight All trucks in the sample are in good condition He separates the data by direction of the haul, realizing that winds tend to blow predominantly out of the west Although he can rapidly gather future data on fuel consumption and haul weight, now that Norm has these data, he is not quite sure what to with them East-West Haul West-East Haul Miles/Gallon Haul Weight Miles/Gallon 4.1 4.7 3.9 4.3 4.8 5.1 4.3 4.6 5.0 41,000 lb 36,000 37,000 38,000 32,000 37,000 46,000 35,000 37,000 4.3 4.5 4.8 5.2 5.0 4.7 4.9 4.5 5.2 4.8 Haul Weight 40,000 lb 37,000 36,000 38,000 35,000 42,000 37,000 36,000 42,000 41,000 References Berenson, Mark L., and David M Levine, Basic Business Statistics: Concepts and Applications, 11th ed (Upper Saddle River, NJ: Prentice Hall, 2009) Cryer, Jonathan D., and Robert B Miller, Statistics for Business: Data Analysis and Modeling, 2nd ed (Belmont, CA: Duxbury Press, 1994) Dielman, Terry E., Applied Regression Analysis—A Second Course in Business and Economic Statistics, 4th ed (Belmont, CA: Duxbury Press, 2005) Draper, Norman R., and Harry Smith, Applied Regression Analysis, 3rd ed (New York City: John Wiley and Sons, 1998) Frees, Edward W., Data Analysis Using Regression Models: The Business Perspective (Upper Saddle River, NJ: Prentice Hall, 1996) Kleinbaum, David G., Lawrence L Kupper, Azhar Nizam, and Keith E Muller, Applied Regression Analysis and Multivariable Methods, 4th ed (Florence, KY: Cengage Learning, 2008) Kutner, Michael H., Christopher J Nachtsheim, John Neter, and William Li, Applied Linear Statistical Models, 5th ed (New York: McGraw-Hill Irwin, 2005) Microsoft Excel 2007 (Redmond, WA: Microsoft Corp., 2007) Minitab for Windows Version 15 (State College, PA: Minitab, 2007) (170) chapter 15 Chapter 15 Quick Prep Links • Make sure you review the discussion about • In Chapter 14, review the steps involved in scatter plots in Chapters and 14 hypothesis using the t-distribution in Chapter • Review the concepts associated with simple linear regression and correlation analysis • Review confidence intervals discussed in presented in Chapter 14 Chapter • Review the methods for testing a null using the t-distribution for testing the significance of a correlation coefficient and a regression coefficient Multiple Regression Analysis and Model Building 15.1 Introduction to Multiple Regression Analysis (pg 634–653) Outcome Understand the general concepts behind model building using multiple regression analysis Outcome Apply multiple regression analysis to business decision-making situations Outcome Analyze the computer output for a multiple regression model and interpret the regression results Outcome Test hypotheses about the significance of a multiple regression model and test the significance of the independent variables in the model Outcome Recognize potential problems when using multiple regression analysis and take steps to correct the problems 15.2 Using Qualitative Independent Variables Outcome Incorporate qualitative variables into a regression model by using dummy variables (pg 654–661) 15.3 Working with Nonlinear Relationships (pg 661–678) Outcome Apply regression analysis to situations where the relationship between the independent variable(s) and the dependent variable is nonlinear 15.4 Stepwise Regression Outcome Understand the uses of stepwise regression (pg 678–689) 15.5 Determining the Aptness of the Model (pg 689–699) Outcome Analyze the extent to which a regression model satisfies the regression assumptions Why you need to know Chapter 14 introduced linear regression and correlation analyses for analyzing the relationship between two variables As you might expect, business problems are not limited to linear relationships involving only two variables Many practical situations involve analyzing the relationships among three or more variables For example, a vice president of planning for an automobile manufacturer would be interested in the relationship between her company’s automobile sales and the variables that influence those sales Included in her analysis might be such independent or explanatory variables as automobile price, competitors’ sales, and advertising, as well as economic variables such as disposable personal income, the inflation rate, and the unemployment rate When multiple independent variables are to be included in an analysis simultaneously, the technique introduced in this chapter—multiple linear regression—is very useful When a relationship between variables is 633 (171) 634 CHAPTER 15 | Multiple Regression Analysis and Model Building nonlinear, we may be able to transform the independent variables in ways that allow us to use multiple linear regression analysis to model the nonlinear relationships This chapter examines the general topic of model building by extending the concepts of simple linear regression analysis provided in Chapter 14 15.1 Introduction to Multiple Regression Analysis Chapter 14 introduced the concept of simple linear regression analysis The simple regression model is characterized by two variables: y, the dependent variable, and x, the independent variable The single independent variable explains some variation in the dependent variable, but unless x and y are perfectly correlated, the proportion explained will be less than 100% In multiple regression analysis, additional independent variables are added to the regression model to clear up some of the as yet unexplained variation in the dependent variable Multiple regression is merely an extension of simple regression analysis; however, as we expand the model for the population from one independent variable to two or more, there are some new considerations The general format of a multiple regression model for the population is given by Equation 15.1 Multiple Regression Model Population y  b0  b1x1  b2x2   bkxk   (15.1) where: b0  Population’s regression constant bj  Population’s regression coefficient for each variable xj  1, 2, k k  Number of independent variables   Model error Four assumptions similar to those that apply to the simple linear regression model must also apply to the multiple regression model Assumptions Individual model errors, , are statistically independent of one another, and these values represent a random sample from the population of possible errors at each level of x For a given value of x there can exist many values of y, and therefore many possible values for  Further, the distribution of possible model errors for any level of x is normally distributed The distributions of possible -values have equal variances at each level of x The means of the dependent variable, y, for all specified values of x can be connected with a line called the population regression model Equation 15.1 represents the multiple regression model for the population However, in most instances, you will be working with a random sample from the population Given the preceding assumptions, the estimated multiple regression model, based on the sample data, is of the form shown in Equation 15.2 Estimated Multiple Regression Model yˆ  b0  b1 x1 b2 x2   bk x k (15.2) This estimated model is an extension of an estimated simple regression model The principal difference is that whereas the estimated simple regression model is the equation for a straight line in a two-dimensional space, the estimated multiple regression model forms a hyperplane (or response surface) through multidimensional space Each regression coefficient represents a different slope Therefore, using Equation 15.2, a value of the dependent variable can be (172) CHAPTER 15 | Multiple Regression Analysis and Model Building 635 TABLE 15.1 | Sample Data to Illustrate the Difference between Simple and Multiple Regression Models (A) One Independent Variable Regression Hyperplane The multiple regression equivalent of the simple regression line The plane typically has a different slope for each independent variable (B) Two Independent Variables y x1 y x1 x2 564.99 601.06 560.11 616.41 674.96 630.58 554.66 50 60 40 50 60 45 53 564.99 601.06 560.11 616.41 674.96 630.58 554.66 50 60 40 50 60 45 53 10 13 14 12 15 16 14 estimated using values of two or more independent variables The regression hyperplane represents the relationship between the dependent variable and the k independent variables For example, Table 15.1A shows sample data for a dependent variable, y, and one independent variable, x1 Figure 15.1 shows a scatter plot and the regression line for the simple regression analysis for y and x1 The points are plotted in two-dimensional space, and the regression model is represented by a line through the points such that the sum of squared errors [ SSE  ∑( y  yˆ )2 ] is minimized If we add variable x2 to the model, as shown in Table 15.1B, the resulting multiple regression equation becomes yˆ  307.71  2.85 x1  10.94 x2 For the time being don’t worry about how this equation was computed That will be discussed shortly Note, however, that the (y, x1, x2) points form a three-dimensional space, as shown in Figure 15.2 The regression equation forms a slice (hyperplane) through the data such that ∑( y  yˆ )2 is minimized This is the same least squares criterion that is used with simple linear regression The mathematics for developing the least squares regression equation for simple linear regression involves differential calculus The same is true for the multiple regression equation but the mathematical derivation is beyond the scope of this text.1 FIGURE 15.1 | Simple Regression Line Scatter Plot 800 700 yˆ = 463.89 + 2.67x1 600 500 y 400 300 200 100 0 10 20 30 40 50 60 70 x1 1For a complete treatment of the matrix algebra approach for estimating multiple regression coefficients, consult Applied Linear Statistical Models by Kutner et al (173) 636 CHAPTER 15 FIGURE 15.2 | Multiple Regression Analysis and Model Building | y Multiple Regression Hyperplane for Population 2 0 Regression Plane 1 x1 x2 Multiple regression analysis is usually performed with the aid of a computer and appropriate software Both Minitab and Excel contain procedures for performing multiple regression Minitab has a far more complete regression procedure However, the PHStat Excel add-ins expand Excel’s capabilities Each software package presents the output in a slightly different format; however, the same basic information will appear in all regression output Chapter Outcome Model A representation of an actual system using either a physical or a mathematical portrayal Basic Model-Building Concepts An important activity in business decision making is referred to as model building Models are often used to test changes in a system without actually having to change the real system Models are also used to help describe a system or to predict the output of a system based on certain specified inputs You are probably quite aware of physical models Airlines use flight simulators to train pilots Wind tunnels are used to determine the aerodynamics of automobile designs Golf ball makers use a physical model of a golfer called “Iron Mike” that can be set to swing golf clubs in a very controlled manner to determine how far a golf ball will fly Although physical models are very useful in business decision making, our emphasis in this chapter is on statistical models that are developed using multiple regression analysis Modeling is both an art and a science Determining an appropriate model is a challenging task, but it can be made manageable by employing a model-building process consisting of the following three components: model specification, model fitting, and model diagnosis Model Specification Model specification, or model identification, is the process of determining the dependent variable, deciding which independent variables should be included in the model, and obtaining the sample data for all variables As with any statistical procedure, the larger the sample size the better, because the potential for extreme sampling error is reduced when the sample size is large However, at a minimum, the sample size required to compute a regression model must be at least one greater than the number of independent variables.2 If we are thinking of developing a regression model with five independent variables, the absolute minimum number of cases required is six Otherwise, the computer software will indicate an error has been made or will print out meaningless values However, as a practical matter, the sample size should be at least four times the number of independent variables Thus, if we had five independent variables (k  5), we would want a sample of at least 20 2There are mathematical reasons for this sample-size requirement that are beyond the scope of this text In essence, the regression coefficient in Equation 15.2 can’t be computed if the sample size is not at least one larger than the number of independent variables (174) CHAPTER 15 | Multiple Regression Analysis and Model Building 637 Model Building Model building is the process of actually constructing a mathematical equation in which some or all of the independent variables are used in an attempt to explain the variation in the dependent variable Chapter Outcome How to it Model Specification In the context of the statistical models discussed in this chapter, this component involves the following three steps: Decide what question you want to ask The question being asked usually indicates the dependent variable In the previous chapter, we discussed how simple linear regression analysis could be used to describe the relationship between a dependent and an independent variable List the potential independent variables for your model Here, your knowledge of the situation you are modeling guides you in identifying potential independent variables Gather the sample data (observations) for all variables How to it Developing a Multiple Regression Model The following steps are employed in developing a multiple regression model: Specify the model by determining the dependent variable and potential independent variables, and select the sample data Formulate the model This is done by computing the correlation coefficients for the dependent variable and each independent variable, and for each independent variable with all other independent variables The multiple regression equation is also computed The computations are performed using computer software such as Excel or Minitab Perform diagnostic checks on the model to determine how well the specified model fits the data and how well the model appears to meet the multiple regression assumptions Model Diagnosis Model diagnosis is the process of analyzing the quality of the model you have constructed by determining how well a specified model fits the data you just gathered You will examine output values such as R-squared and the standard error of the model At this stage, you will also assess the extent to which the model’s assumptions appear to be satisfied (Section 15.5 is devoted to examining whether a model meets the regression analysis assumptions.) If the model is unacceptable in any of these areas, you will be forced to revert to the model-specification step and begin again However, you will be the final judge of whether the model provides acceptable results, and you will always be constrained by time and cost considerations You should use the simplest available model that will meet your needs The objective of model building is to help you make better decisions You not need to feel that a sophisticated model is better if a simpler one will provide acceptable results BUSINESS APPLICATION DEVELOPING A MULTIPLE REGRESSION MODEL First City Real Estate First City Real Estate executives wish to build a model to predict sales prices for residential property Such a model will be valuable when working with potential sellers who might list their homes with First City This can be done using the following steps: Step Model Specification The question being asked is how can the real estate firm determine the selling price for a house? Thus, the dependent variable is the sales price This is what the managers want to be able to predict The managers met in a brainstorming session to determine a list of possible independent (explanatory) variables Some variables, such as “condition of the house,” were eliminated because of lack of data Others, such as “curb appeal” (the appeal of the house to people as they drive by), were eliminated because the values for these variables would be too subjective and difficult to quantify From a wide list of possibilities, the managers selected the following variables as good candidates: x1  Home size (in square feet) x2  Age of house x3  Number of bedrooms x4  Number of bathrooms x5  Garage size (number of cars) Data were obtained for a sample of 319 residential properties that had sold within the previous two months in an area served by two of First City’s offices For each house in the sample, the sales price and values for each potential independent variable were collected The data are in the file First City Step Model Building The regression model is developed by including independent variables from among those for which you have complete data There is no way to determine whether an independent variable will be a good predictor variable by analyzing the individual variable’s descriptive statistics, such as the mean and standard deviation Instead, we need to look at the correlation between the independent (175) 638 CHAPTER 15 | Multiple Regression Analysis and Model Building Correlation Coefficient variables and the dependent variable, which is measured by the correlation coefficient When we have multiple independent variables and one dependent variable, we can look at the correlation between all pairs of variables by developing a correlation matrix Each correlation is computed using one of the equations in Equation 15.3 The appropriate formula is determined by whether the correlation is being calculated for an independent variable and the dependent variable or for two independent variables A quantitative measure of the strength of the linear relationship between two variables The correlation coefficient, r, ranges from 1.0 to 1.0 Correlation Matrix A table showing the pairwise correlations between all variables (dependent and independent) Correlation Coefficient r ∑ ( x  x )( y  y ) ∑ ( x  x )2 ∑ ( y  y )2 One x variable with y or r ∑ ( xi  xi)( xj  xj) ∑ ( xi  xi )2 ∑ (xj  xj )2 (15.3) One x variable with another x The actual calculations are done using Excel’s correlation tool or Minitab’s correlation command, and the results are shown in Figure 15.3a and Figure 15.3b The output provides the correlation between y and each x variable and between each pair of independent variables.3 Recall that in Chapter 14, a t-test (see Equation 14-3) was used to test whether the correlation coefficient is statistically significant H0: r  HA: r  We will conduct the test with a significance level of a  0.05 FIGURE 15.3A | Excel 2007 Results Showing First City Real Estate Correlation Matrix Correlation between age and square feet = −0.0729 Older homes tend to be smaller Excel 2007 Instructions: Open file: First City.xls Select Home Sample worksheet Click on Data > Data Analysis Select Correlation Define y variable range (all rows and columns) Click on Labels Click OK 3Minitab, in addition to providing the correlation matrix, can provide the p-values for each correlation If the p-value is less than the specified alpha, the correlation is statistically significant (176) CHAPTER 15 FIGURE 15.3B | Multiple Regression Analysis and Model Building 639 | Minitab Results Showing First City Real Estate Correlation Matrix Correlation between age and square feet = − 0.073 Older homes tend to have fewer square feet Minitab Instructions: Open file: First City.MTW Choose Stat  Basic Statistics  Correlation In Variables, enter variable columns Click OK Given degrees of freedom equal to n -  319 -  317 the critical t (see Appendix E) for a two-tailed test is approximately 1.96.4 Any correlation coefficient generating a t-value greater than 1.96 or less than -1.96 is determined to be significant For now, we will focus on the correlations in the first column in Figures 15.3a and 15.3b, which measure the strength of the linear relationship between each independent variable and the dependent variable, sales price For example, the t statistic for price and square feet is t r 1 r n2  0.7477 1 0.7477 319   20.048 Because t  20.048  1.96 we reject H0 and conclude that the correlation between sales price and square feet is statistically significant Similar calculations for the other independent variables with price show that all variables are statistically correlated with price This indicates that a significant linear relationship exists between each independent variable and sales price Variable x1, square feet, has the highest correlation at 0.748 Variable x2, age of the house, has the lowest correlation at 0.485 The negative correlation implies that older homes tend to have lower sales prices As we discussed in Chapter 14, it is always a good idea to develop scatter plots to visualize the relationship between two variables Figure 15.4 shows the scatter plots for each independent variable and the dependent variable, sales price In each case, the plots indicate a linear relationship between the independent variable and the dependent variable Note that several of the independent variables (bedrooms, bathrooms, garage size) are quantitative but discrete The scatter plots for these variables show points at each level of the independent variable rather than over a continuum of values 4You can use the Excel TINV function to get the precise t-value, which is 1.967 (177) CHAPTER 15 FIGURE 15.4 | Multiple Regression Analysis and Model Building | First City Real Estate Scatter Plots (b) Price versus Age $400,000 $400,000 $300,000 $300,000 Price Price (a) Price versus Square Feet $200,000 $100,000 $200,000 $100,000 1,000 2,000 3,000 Square Feet 4,000 5,000 40 60 80 100 (d) Price versus Bathrooms $400,000 $300,000 $300,000 Price $400,000 $200,000 $200,000 $100,000 $100,000 20 Age (c) Price versus Bedrooms Price Bedrooms Bathrooms (e) Price versus # Car Garage $400,000 $300,000 Price 640 $200,000 $100,000 Chapter Outcome 3 # Car Garage Computing the Regression Equation First City’s goal is to develop a regression model to predict the appropriate selling price for a home, using certain measurable characteristics The first attempt at developing the model will be to run a multiple regression computer program using all available independent variables The regression outputs from Excel and Minitab are shown in Figure 15.5a and Figure 15.5b The estimate of the multiple regression model given in Figure 15.5a is yˆ  31,127.6  63.1(sq ft.)  1,144.4(age)  8,4100.4(bedrooms)  3,522.0(bathrooms)  28,203.5(garage) The coefficients for each independent variable represent an estimate of the average change in the dependent variable for a 1-unit change in the independent variable, holding all other independent variables constant For example, for houses of the same age, with the same number of bedrooms, baths, and garages, a 1-square-foot increase in the size of the house is estimated to increase its price by an average of $63.10 Likewise, for houses with the same square feet, bedrooms, bathrooms, and garages, a 1-year increase in the (178) CHAPTER 15 FIGURE 15.5A | Multiple Regression Analysis and Model Building 641 | Excel 2007 Multiple Regression Model Results for First City Real Estate Multiple coefficient of determination Standard error of estimate SSR = 1.0389E+12 SSE = 2.34135E+11 SST = 1.27303E+12 Regression coefficients Excel 2007 Instructions: Open file: First City.xls Click on Data  Data Analysis Select Regression Define y variable range and the x variable range (include labels) Click Labels Click OK age of the house is estimated to result in an average drop in sales price of $1,144.40 The other coefficients are interpreted in the same way Note, in each case, we are interpreting the regression coefficient for one independent variable while holding the other variables constant To estimate the value of a residential property, First City Real Estate brokers would substitute values for the independent variables into the regression equation For example, suppose a house with the following characteristics is considered: x1  Square feet  2,100 x2  Age  15 x3  Number of bedrooms  x4  Number of bathrooms  x5  Size of garage  The point estimate for the sales price is yˆ  31,127.6  63.1 (2,100)  1,144.4 (15)  8,410.4 (4)  3,522.0 (3)  28,203.5 (2) yˆ  $179,802.70 (179) 642 CHAPTER 15 FIGURE 15.5B | Multiple Regression Analysis and Model Building | Minitab Multiple Regression Model Results for First City Real Estate Regression coefficients Multiple coefficient of determination Standard error of the estimate SSR = 1.0389E+12 SSE = 2.34135E+11 SST = 1.27303E+12 Minitab Instructions: Open file: First City.MTW Choose Stat  Regression  Regression Multiple coefficient of determination (R2 ) The proportion of the total variation of the dependent variable in a multiple regression model that is explained by its relationship to the independent variables It is, as is the case in the simple linear model, called R-squared and is denoted as R In Response, enter dependent (y) variable In Predictors, enter independent (x) variables Click OK The Coefficient of Determination You learned in Chapter 14 that the coefficient of determination, R2, measures the proportion of variation in the dependent variable that can be explained by the dependent variable’s relationship to a single independent variable When there are multiple independent variables in a model, R2 is called the multiple coefficient of determination and is used to determine the proportion of variation in the dependent variable that is explained by the dependent variable’s relationship to all the independent variables in the model Equation 15.4 is used to compute R2 for a multiple regression model Multiple Coefficient of Determination (R2) R2  Sum of squares regression SSR  Total sum of squ uares SST (15.4) As shown in Figure 15.5a, R2  0.8161 Both SSR and SST are also included in the output Therefore, you can also use Equation 15.4 to get R2, as follows: 1.0389 E 12 SSR  0.8161  SST 1.27303E 12 More than 81% of the variation in sales price can be explained by the linear relationship of the five independent variables in the regression model to the dependent variable However, as we shall shortly see, not all independent variables are equally important to the model’s ability to explain this variation (180) CHAPTER 15 | 643 Multiple Regression Analysis and Model Building Model Diagnosis Before First City actually uses this regression model to estimate the sales price of a house, there are several questions that should be answered Is the overall model significant? Are the individual variables significant? Is the standard deviation of the model error too large to provide meaningful results? Is multicollinearity a problem? Have the regression analysis assumptions been satisfied? We shall answer the first four questions in order We will have to wait until Section 15.5 before we have the procedures to answer the fifth important question Chapter Outcome Is the Model Significant? Because the regression model we constructed is based on a sample of data from the population and is subject to sampling error, we need to test the statistical significance of the overall regression model The specific null and alternative hypotheses tested for First City Real Estate are H0: b1  b2  b3  b4  b5  HA: At least one bi  If the null hypothesis is true and all the slope coefficients are simultaneously equal to zero, the overall regression model is not useful for predictive or descriptive purposes The F-test is a method for testing whether the regression model explains a significant proportion of the variation in the dependent variable (and whether the overall model is significant) The F-test statistic for a multiple regression model is shown in Equation 15.5 F-Test Statistic SSR k F SSE n  k 1 (15.5) where: SSR  Sum of squares regression  ∑ (yˆ  y )2 SSE  Sum of squares error  ∑( y  yˆ )2 n  Sample size k  Number of independent variables Degrees of freedom  D1  k and D2  (n  k 1) The ANOVA portion of the output shown in Figure 15.5a contains values for SSR, SSE, and the F-value The general format of the ANOVA table in a regression analysis is as follows: ANOVA Source df SS MS F Significance F Regression k SSR MSR  SSR/k MSR/MSE computed p-value nk1 SSE MSE  SSE/(n  k  1) n1 SST Residual Total (181) 644 CHAPTER 15 | Multiple Regression Analysis and Model Building The ANOVA portion of the output from Figure 15.5a is as follows: ANOVA Source df Regression Residual Total SS MS F Significance F 1.04E  12 2.08E  11 277.8 0.0000 313 2.34E  11 7.48E  08 318 1.27303E  12 We can test the model’s significance H0: b1  b2  b3  b4  b5  HA: At least one bi  by either comparing the calculated F-value, 277.8, with a critical value for a given alpha level a  0.01 and k  and n  k   313 degrees of freedom using Excel’s FINV function (F0.01  3.079) or comparing the p-value in the output with a specified alpha level Because F  277.8  3.079, reject H0 or because p-value ≈ 0.0 0.01, reject H0 Adjusted R-squared A measure of the percentage of explained variation in the dependent variable in a multiple regression model that takes into account the relationship between the sample size and the number of independent variables in the regression model we should therefore conclude that the regression model does explain a significant proportion of the variation in sales price Thus, the overall model is statistically significant This means we can conclude that at least one of the regression slope coefficients is not equal to zero Excel and Minitab also provide a measure called the R-sq(adj), which is the adjusted R-squared value (see Figures 15.5a and 15.5b) It is calculated by Equation 15.6 Adjusted R-Squared A measure of the percentage of explained variation in the dependent variable that takes into account the relationship between the sample size and the number of independent variables in the regression model ⎛ n 1 ⎞ R-sq(adj)  RA2  1 (1 R ) ⎜ ⎟ ⎝ n  k 1 ⎠ (15.6) where: n  Sample size k  Number of independent variables R2  Coefficient of determination Adding independent variables to the regression model will always increase R2, even if these variables have no relationship to the dependent variable Therefore, as the number of independent variables is increased (regardless of the quality of the variables), R2 will increase However, each additional variable results in the loss of one degree of freedom This is viewed as part of the cost of adding the specified variable The addition to R2 may not justify the reduction in degrees of freedom The RA2 value takes into account this cost and adjusts the RA2 value accordingly RA2 will always be less than R2 When a variable is added that does not contribute its fair share to the explanation of the variation in the dependent variable, the RA2 value may actually decline, even though R2 will always increase The adjusted R-squared is a particularly important measure when the number of independent variables is large relative to the sample size It takes into account the relationship between sample size and number of variables R2 may appear artificially high if the number of variables is large compared with the sample size In this example, in which the sample size is quite large relative to the number of independent variables, the adjusted R-squared is 81.3%, only slightly less than R2  81.6% (182) CHAPTER 15 Chapter Outcome | Multiple Regression Analysis and Model Building 645 Are the Individual Variables Significant? We have concluded that the overall model is significant This means at least one independent variable explains a significant proportion of the variation in sales price This does not mean that all the variables are significant, however To determine which variables are significant, we test the following hypotheses: H0: bj  HA: bj  for all j We can test the significance of each independent variable using significance level a  0.05 and a t-test, as discussed in Chapter 14 The calculated t-values should be compared to the critical t-value with n  k   319    313 degrees of freedom, which is approximately t0.025 ≈ 1.97 for a  0.05 The calculated t-value for each variable is provided on the computer printout in Figures 15.5a and 15.5b Recall that the t statistic is determined by dividing the regression coefficient by the estimator of the standard deviation of the regression coefficient, as shown in Equation 15.7 t-Test for Significance of Each Regression Coefficient t bj 0 sb df  n  k 1 (15.7) j where: bj  Sample slope coefficient for the jth indeependent variable sb  Estimate of the standard error for the jth sample slope coefficient j For example, the t-value for square feet shown in Figure 15.5a is 15.70 This was computed using Equation 15.7, as follows: t bj 0 sb j  63.1  15.70 4.02 Because t  15.70  1.97, we reject H0, and conclude that, given the other independent variables in the model, the regression slope for square feet is not zero We can also look at the Excel or Minitab output and compare the p-value for each regression slope coefficient with alpha If the p-value is less than alpha, we reject the null hypothesis and conclude that the independent variable is statistically significant in the model Both the t-test and the p-value techniques will give the same results You should consider that these t-tests are conditional tests This means that the null hypothesis is the value of each slope coefficient is 0, given that the other independent variables are already in the model.5 Figure 15.6 shows the hypothesis tests for each independent 5Note that the t-tests may be affected if the independent variables in the model are themselves correlated A procedure known as the sum of squares drop F-test, discussed by Kutner et al in Applied Linear Statistical Models, should be used in this situation Each t-test considers only the marginal contribution of the independent variables and may indicate that none of the variables in the model are significant, even though the ANOVA procedure indicates otherwise (183) CHAPTER 15 FIGURE 15.6 | Multiple Regression Analysis and Model Building | Significance Tests for Each Independent Variable in the First City Real Estate Example Hypotheses: H0: j = 0, given all other variables are already in the model HA: j = 0, given all other variables are already in the model  = 0.05 – 646 Decision Rule: If t 1.97 or t  1.97, reject H0 Otherwise, not reject H0 df = n – k – = 319 – – = 313 /2 = 0.025 /2 = 0.025 t –t0.025 = –1.97 t0.025 = 1.97 0.0 The test is: For 1: Calculated t (from printout) = 15.70 Because 15.70 > 1.97, reject H0 For 2: Calculated t = –10.15 Because –10.15 < –1.97, reject H0 For 3: Calculated t = –2.80 Because –2.80 < –1.97, reject H0 For 4: Calculated t = 2.23 Because 2.23 > 1.97, reject H0 For 5: Calculated t = 9.87 Because 9.87 > 1.97, reject H0 variable using a 0.05 significance level We conclude that all five independent variables in the model are significant When a regression model is to be used for prediction, the model should contain no insignificant variables If insignificant variables are present, they should be dropped and a new regression equation obtained before the model is used for prediction purposes We will have more to say about this later Is the Standard Deviation of the Regression Model Too Large? The purpose of developing the First City regression model is to be able to determine values of the dependent variable when corresponding values of the independent variables are known An indication of how good the regression model is can be found by looking at the relationship between the measured values of the dependent variable and those values that would be predicted by the regression model The standard deviation of the regression model (also called the standard error of the estimate), measures the dispersion of observed home sale values, y, around values predicted by the regression model The standard error of the estimate is shown in Figure 15.5a and can be computed using Equation 15.8 Standard Error of the Estimate s  SSE  MSE n  k 1 where: SSE  Sum of squares error (residual) n  Sample size k  Number of independent variables (15.8) (184) CHAPTER 15 | Multiple Regression Analysis and Model Building 647 Examining Equation 15.8 closely, we see that this standard error of the estimate is the square root of the mean square error of the residuals found in the analysis of variance table Sometimes, even though a model has a high R2, the standard error of the estimate will be too large to provide adequate precision for confidence and prediction intervals A rule of thumb that we have found useful is to examine the range 2s Taking into account the mean value of the dependent variable, if this range is acceptable from a practical viewpoint, the standard error of the estimate might be considered acceptable.6 In this First City Real Estate Company example, the standard error, shown in Figure 15.5a, is $27,350 Thus, the rough prediction range for the price of an individual home is 2($27,350)  $54,700 Considering that the mean price of homes in this study is in the low $200,000s, a potential error of $54,700 high or low is probably not acceptable Not many homeowners would be willing to have their appraisal value set by a model with a possible error this large Even though the model is statistically significant, the company needs to take steps to reduce the standard deviation of the estimate Subsequent sections of this chapter discuss some ways we can attempt to reduce it Chapter Outcome Multicollinearity A high correlation between two independent variables such that the two variables contribute redundant information to the model When highly correlated independent variables are included in the regression model, they can adversely affect the regression results Is Multicollinearity a Problem? Even if the overall regression model is significant and each independent variable is significant, decision makers should still examine the regression model to determine whether it appears reasonable This is referred to as checking for face validity Specifically, you should check to see that signs on the regression coefficients are consistent with the signs on the correlation coefficients between the independent variables and the dependent variable Does any regression coefficient have an unexpected sign? Before answering this question for the First City Real Estate example, we should review what the regression coefficients mean First, the constant term, b0, is the estimate of the model’s y intercept If the data used to develop the regression model contain values of x1, x2, x3, x4, and x5 that are simultaneously (such as would be the case for vacant land), b0 is the mean value of y, given that x1, x2, x3, x4, and x5 all equal Under these conditions b0 would estimate the average value of a vacant lot However, in the First City example, no vacant land was in the sample, so b0 has no particular meaning The coefficient for square feet, b1, estimates the average change in sales price corresponding to a change in house size of square foot, holding the other independent variables constant The value shown in Figure 15.5a for b1 is 63.1 The coefficient is positive, indicating that an increase in square footage is associated with an increase in sales price This relationship is expected All other things being equal, bigger houses should sell for more money Likewise, the coefficient for x5, the size of the garage, is positive, indicating that an increase in size is also associated with an increase in price This is expected The coefficient for x2, the age of the house, is negative, indicating that an older house is worth less than a similar younger house This also seems reasonable Finally, variable x4 for bathrooms has the expected positive sign However, the coefficient for variable x3, the number of bedrooms, is $8,410.4, meaning that if we hold the other variables constant but increase the number of bedrooms by one, the average price will drop by $8,410.40 Does this seem reasonable? Referring to the correlation matrix that was shown earlier in Figure 15.3, the correlation between variable x3, bedrooms, and y, the sales price, is 0.540 This indicates that without considering the other independent variables, the linear relationship between number of bedrooms and sales price is positive But why does the regression coefficient for variable x3 turn out to be negative in the model? The answer lies in what is called multicollinearity Multicollinearity occurs when independent variables are correlated with each other and therefore overlap with respect to the information they provide in explaining the variation in 6The actual confidence interval for prediction of a new observation requires the use of matrix algebra However, when the sample size is large and dependent variable values near the means of the dependent variables are used, the rule of thumb given here is a close approximation Refer to Applied Linear Statistical Models by Kutner et al for further discussion (185) 648 CHAPTER 15 | Multiple Regression Analysis and Model Building the dependent variable For example, x3 and the other independent variables have the following correlations (see Figure 15.3b): rx 3, x1 rx 3, x2 rx 3, x rx 3, x  0.706   0.202  0.600  0.312 All four correlations have t-values indicating a significant linear relationship Refer to the correlation matrix in Figure 15.3 to see that other independent variables are also correlated with each other The problems caused by multicollinearity, and how to deal with them, continue to be of prime concern to statisticians From a decision maker’s viewpoint, you should be aware that multicollinearity can (and often does) exist and recognize the basic problems it can cause The following are some of the most obvious problems and indications of severe multicollinearity: Unexpected, therefore potentially incorrect, signs on the coefficients A sizable change in the values of the previously estimated coefficients when a new variable is added to the model A variable that was previously significant in the regression model becomes insignificant when a new independent variable is added The estimate of the standard deviation of the model error increases when a variable is added to the model Variance Inflation Factor A measure of how much the variance of an estimated regression coefficient increases if the independent variables are correlated A VIF equal to 1.0 for a given independent variable indicates that this independent variable is not correlated with the remaining independent variables in the model The greater the multicollinearity, the larger the VIF Mathematical approaches exist for dealing with multicollinearity and reducing its impact Although these procedures are beyond the scope of this text, one suggestion is to eliminate the variables that are the chief cause of the multicollinearity problems If the independent variables in a regression model are correlated and multicollinearity is present, another potential problem is that the t-tests for the significance of the individual independent variables may be misleading That is, a t-test may indicate that the variable is not statistically significant when in fact it is One method of measuring multicollinearity is known as the variance inflation factor (VIF) Equation 15.9 is used to compute the VIF for each independent variable Variance Inflation Factor VIF  (1 R 2j ) (15.9) where: R 2j  Coefficient of determination when the jth independent variable is regressed againstt the remaining k 1 independent variables Both the PHStat add-ins to Excel and Minitab contain options that provide VIF values.7 Figure 15.7 shows the Excel (PHStat) output of the VIFs for the First City Real Estate example The effect of multicollinearity is to decrease the test statistic, thus reducing the probability that the variable will be declared significant A related impact is to increase the width of the confidence interval estimate of the slope coefficient in the regression model Generally, if the VIF for a particular independent variable, multicollinearity is not considered a problem for that variable VIF values (186) imply that the correlation between the independent variables is too extreme and should be dealt with by dropping variables from the 7Excel’s Regression procedure in the Data Analysis Tools area does not provide VIF values directly Without PHStat, you would need to compute each regression analysis individually and record the R-squared value to compute the VIF (187) CHAPTER 15 FIGURE 15.7 | Multiple Regression Analysis and Model Building 649 | Excel 2007 (PHStat) Multiple Regression Model Results for First City Real Estate with Variance Inflation Factors Variance inflation factors Excel 2007 Instructions: Open file: First City.xls Click on Add-Ins PHStat Select Regression Multiple Regression Define y variable range and the x variable range Select Regression Statistics Table and ANOVA and Coefficients Table Select Variance Inflation Factor (VIF) Click OK Note VIFs consolidated to one page for display in Figure 15.7 Minitab Instructions (for similar result): Open file: First City.MTW Choose Stat  Regression  Regression In Response enter dependent (y) variable In Predictors, enter independent (x) variables Click Options In Display, select Variance Inflation factors Click OK OK model As Figure 15.7 illustrates, the VIF values for each independent variable are less than 5, so based on variance inflation factors, even though the sign on the variable, bedrooms, has switched from positive to negative, the other multicollinearity issues not exist among these independent variables Confidence Interval Estimation for Regression Coefficients Previously, we showed how to determine whether the regression coefficients are statistically significant This was necessary because the estimates of the regression coefficients are developed from sample data and are subject to sampling error The issue of sampling error also comes into play when interpreting the slope coefficients Consider again the regression models for First City Real Estate shown in Figures 15.8a and 15.8b The regression coefficients shown are point estimates for the true regression coefficients For example, the coefficient for the variable square feet is b1  63.1 We interpret this to mean that, holding the other variables constant, for each increase in the size of a home by square foot, the price of a house is estimated to increase by $63.1 But like all point estimates, this is subject to sampling error In Chapter 14 you were introduced to the concept of confidence interval estimates for the regression coefficients That same concept applies in multiple (188) 650 CHAPTER 15 FIGURE 15.8A | Multiple Regression Analysis and Model Building | Excel 2007 Multiple Regression Model Results for First City Real Estate 95% confidence interval estimates for the regression Excel 2007 Instructions: Open file: First City.xls Click on Data  Data Analysis Select Regression Define y variable range and the x variable range (include labels) Click Labels Click OK regression models Equation 15.10 is used to develop the confidence interval estimate for the regression coefficients Confidence Interval Estimate for the Regression Slope b j tsb j (15.10) where: bj  Point estimate for the regression coefficcient for x j t  Critical t -value for the specified confidence level sb  The standard error of the j th regression coefficient j The Excel output in Figure 15.8a provides the confidence interval estimates for each regression coefficient For example, the 95% interval estimate for square feet is $55.2 $71.0 Minitab does not have a command to generate confidence intervals for the individual regression parameters However, statistical quantities are provided on the Minitab output in Figure 15.8b to allow the manual calculation of these confidence intervals As an example, (189) CHAPTER 15 FIGURE 15.8B | 651 Multiple Regression Analysis and Model Building | Minitab Multiple Regression Model Results for First City Real Estate sb1 b1 = 63.1 Minitab Instructions: Open file: First City.MTW Choose Stat  Regression  Regression In Response, enter dependent (y) variable In Predictors, enter independent (x) variable Click OK the confidence interval for the coefficient associated with the square feet variable can be computed using Equation 15.10 as8 b1 tsb 63.1 1.967( 4.017 ) 63.1 7.90 $55.2 - $71.0 We interpret this interval as follows: Holding the other variables constant, using a 95% confidence level, a change in square feet by foot is estimated to generate an average change in home price of between $55.20 and $71.00 Each of the other regression coefficients can be interpreted in the same manner 8Note, we used Excel’s TINV function to get the precise t-value of 1.967 MyStatLab 15-1: Exercises Skill Development 15-1 The following output is associated with a multiple regression model with three independent variables: df SS MS F Regression 16,646.091 5,548.697 5.328 Residual 21 21,871.669 Total 24 38,517.760 1,041.508 Significance F 0.007 Intercept x1 x2 x3 Coefficients Standard Error 87.790 0.970 0.002 8.723 25.468 0.586 0.001 7.495 t Stat p-value 3.447 1.656 3.133 1.164 0.002 0.113 0.005 0.258 (190) 652 CHAPTER 15 | Multiple Regression Analysis and Model Building Lower 95% Upper 95% Lower 90% Upper 90% Intercept x1 x2 x3 34.827 2.189 0.001 24.311 140.753 0.248 0.004 6.864 43.966 1.979 0.001 21.621 131.613 0.038 0.004 4.174 a What is the regression model associated with these data? b Is the model statistically significant? c How much of the variation in the dependent variable can be explained by the model? d Are all of the independent variables in the model significant? If not, which are not and how can you tell? e How much of a change in the dependent variable will be associated with a one-unit change in x2? In x3? f Do any of the 95% confidence interval estimates of the slope coefficients contain zero? If so, what does this indicate? 15-2 You are given the following estimated regression equation involving a dependent and two independent variables: yˆ  12.67  4.14 x1  8.72 x2 a Interpret the values of the slope coefficients in the equation b Estimate the value of the dependent variable when x1  and x2  15-3 In working for a local retail store you have developed the following estimated regression equation: yˆ  22,167  412 x1  818 x2  93x3  71x4 where: y  Weekly sales x1  Local unemployment rate x2  Weekly average high temperature x3  Number of activities in the local community x4  Average gasoline price a Interpret the values of b1, b2, b3, and b4 in this estimated regression equation b What is the estimated sales if the unemployment rate is 5.7%, the average high temperature is 61°, there are 14 activities, and gasoline average price is $1.39? 15-4 The following correlation matrix is associated with the same data used to build the regression model in Problem 15-1: y x1 x2 y x1 0.406 x2 0.459 0.051 x3 0.244 0.504 0.272 Does this output indicate any potential multicollinearity problems with the analysis? x3 15-5 Consider the following set of data: x1 x2 y 29 15 16 48 37 46 28 24 34 22 32 26 28 47 49 42 13 11 33 43 41 26 12 13 48 58 47 44 19 16 a Obtain the estimated regression equation b Develop the correlation matrix for this set of data Select the independent variable whose correlation magnitude is the smallest with the dependent variable Determine if its correlation with the dependent variable is significant c Determine if the overall model is significant Use a significance level of 0.05 d Calculate the variance inflation factor for each of the independent variables Indicate if multicollinearity exists between the two independent variables 15-6 Consider the following set of data: x2 x1 y 10 50 103 45 85 11 37 115 32 73 10 44 97 11 51 102 42 65 a Obtain the estimated regression equation b Examine the coefficient of determination and the adjusted coefficient of determination Does it seem that either of the independent variables’ addition to R2 does not justify the reduction in degrees of freedom that results from its addition to the regression model? Support your assertions c Conduct a hypothesis test to determine if the dependent variable increases when x2 increases Use a significance level of 0.025 and the p-value approach d Construct a 95% confidence interval for the coefficient of x1 Computer Database Exercises 15-7 An investment analyst collected data about 20 randomly chosen companies The data consisted of the 52-week-high stock prices, price-to-earnings (PE) ratio, and the market value of the company These data are in the file entitled Investment a Produce a regression equation to predict the market value using the 52-week-high stock price and the PE ratio of the company b Determine if the overall model is significant Use a significance level of 0.05 c OmniVision Technologies (Sunnyvale, CA) in April 2006 had a 52-week-high stock price of 31 and a PE ratio of 19 Estimate its market value for that time period (Note: Its actual market value for that time period was $1,536.) 15-8 An article in BusinessWeek presents a list of the 100 companies perceived as having “hot growth” characteristics A company’s rank on the list is the sum of 0.5 times its rank in return on total capital and 0.25 times its sales and profit-growth ranks The file entitled Growth contains sales ($million), sales increase (%), (191) CHAPTER 15 return on capital, market value ($million), and recent stock price of the top 20 ranked companies a Produce a correlation matrix for the variables contained in the file entitled Growth b Select the two variables that are most highly correlated with the recent stock price and produce the regression equation to predict the recent stock price as a function of the two variables you chose c Determine if the overall model is significant Use a significance level of 0.10 d Examine the coefficient of determination and the adjusted coefficient of determination Does it seem that either of the independent variables’ addition to R2 does not justify the reduction in degrees of freedom that results from its addition to the regression model? Support your assertions e Select the variable that is most correlated with the stock price and test to see if it is a significant predictor of the stock price Use a significance level of 0.10 and the p-value approach 15-9 Refer to Exercise 15-8, which referenced a list of the 100 companies perceived as having “hot growth” characteristics The file entitled Logrowth contains sales ($million), sales increase (%), return on capital, market value ($million), and recent stock price of the companies ranked from 81 to 100 In Exercise 15-8, stock prices were the focus Examine the sales of the companies a Produce a regression equation that will predict the sales as a function of the other variables b Determine if the overall model is significant Use a significance level of 0.05 c Conduct a test of hypothesis to discover if market value should be removed from this model d To see that a variable can be insignificant in one model but very significant in another, construct a regression equation in which sales is the dependent variable and market value is the independent variable Test the hypothesis that market value is a significant predictor of sales for those companies ranked from 81 to 100 Use a significance level of 0.05 and the p-value approach 15-10 The National Association of Theatre Owners is the largest exhibition trade organization in the world, representing more than 26,000 movie screens in all 50 states and in more than 20 countries worldwide Its membership includes the largest cinema chains and hundreds of independent theatre owners It publishes statistics concerning the movie sector of the economy The file entitled Flicks contains data on total U.S box office grosses ($billion), total number of admissions (billion), average U.S ticket price ($), and number of movie screens a Construct a regression equation in which total U.S box office grosses are predicted using the other variables b Determine if the overall model is significant Use a significance level of 0.05 | Multiple Regression Analysis and Model Building 653 c Determine the range of plausible values for the change in box office grosses if the average ticket price were to be increased by $1 Use a confidence level of 95% d Calculate the variance inflation factor for each of the independent variables Indicate if multicollinearity exists between any two independent variables e Produce the regression equation suggested by your answer to part d 15-11 The athletic director of State University is interested in developing a multiple regression model that might be used to explain the variation in attendance at football games at his school A sample of 16 games was selected from home games played during the past 10 seasons Data for the following factors were determined: y  Game attendance x1  Team win/loss percentage to date x2  Opponent win/loss percentage to date x3  Games played this season x4  Temperature at game time The data collected are in the file called Football a Produce scatter plots for each independent variable versus the dependent variable Based on the scatter plots, produce a model that you believe represents the relationship between the dependent variable and the group of predictor variables represented in the scatter plots b Based on the correlation matrix developed from these data, comment on whether you think a multiple regression model will be effectively developed from these data c Use the sample data to estimate the multiple regression model that contains all four independent variables d What percentage of the total variation in the dependent variable is explained by the four independent variables in the model? e Test to determine whether the overall model is statistically significant Use a  0.05 f Which, if any, of the independent variables is statistically significant? Use a significance level of a  0.08 and the p-value approach to conduct these tests g Estimate the standard deviation of the model error and discuss whether this regression model is acceptable as a means of predicting the football attendance at State University at any given game h Define the term multicollinearity and indicate the potential problems that multicollinearity can cause for this model Indicate what, if any, evidence there is of multicollinearity problems with this regression model Use the variance inflation factor to assist you in this analysis i Develop a 95% confidence interval estimate for each of the regression coefficients and interpret each estimate Comment on whether the interpretation of the intercept is relevant in this situation END EXERCISES 15-1 (192) 654 CHAPTER 15 | Multiple Regression Analysis and Model Building Chapter Outcome 15.2 Using Qualitative Independent Variables Dummy Variable A variable that is assigned a value equal to either or 1, depending on whether the observation possesses a given characteristic In Example 15-1 involving the First City Real Estate Company, the independent variables were quantitative and ratio level However, you will encounter many situations in which you may wish to use a qualitative, lower level variable as an explanatory variable If a variable is nominal, and numerical codes are assigned to the categories, you already know not to perform mathematical calculations using those data The results would be meaningless Yet, we may wish to use a variable such as marital status, gender, or geographical location as an independent variable in a regression model If the variable of interest is coded as an ordinal variable, such as education level or job performance ranking, computing means and variances is also inappropriate Then how are these variables incorporated into a multiple regression analysis? The answer lies in using what are called dummy (or indicator) variables For instance, consider the variable gender, which can take on two possible values: male or female Gender can be converted to a dummy variable as follows: x1  if female x1  if male Thus, a data set consisting of males and females will have corresponding values for x1 equal to 0s and 1s, respectively Note that it makes no difference which gender is coded and which is coded If a categorical variable has more than two mutually exclusive outcome possibilities, multiple dummy variables must be created Consider the variable marital status, with the following possible outcomes: never married married divorced widowed In this case, marital status has four values To account for all the possibilities, you would create three dummy variables, one less than the number of possible outcomes for the original variable They could be coded as follows: x1  if never married, if not x2  if married, if not x3  if divorced, if not Note that we don’t need the fourth variable because we would know that a person is widowed if x1  0, x2  0, and x3  If the person isn’t single, married, or divorced, he or she must be widowed Always use one fewer dummy variables than categories The mathematical reason that the number of dummy variables must be one less than the number of possible responses is called the dummy variable trap Perfect multicollinearity is introduced, and the least squares regression estimates cannot be obtained, if the number of dummy variables equals the number of possible categories EXAMPLE 15-1 INCORPORATING DUMMY VARIABLES Business Executive Salaries To illustrate the effect of incorporating dummy variables into a regression model, consider the sample data displayed in the scatter plot in Figure 15.9 The population from which the sample was selected consists of executives between the ages of 24 and 60 who are working in U.S manufacturing businesses Data for annual salary (y) and age (x1) are available The objective is to determine whether a model can be generated to explain the variation in annual salary for business executives Even though age and annual salary are significantly correlated (r  0.686) at the a  0.05 level, the coefficient of determination is only 47% Therefore, we would likely search for other independent variables that could help us to further explain the variation in annual salary Suppose we can determine which of the 16 people in the sample had a master of business administration (MBA) degree Figure 15.10 shows the scatter plot for these same data, with (193) CHAPTER 15 FIGURE 15.9 | 655 Multiple Regression Analysis and Model Building | Executive Salary Data—Scatter Plot $200,000 ANNUAL SALARY VERSUS AGE y r = 0.686 Salary $150,000 $100,000 $50,000 20 30 40 50 60 x 70 Age Salary($) Age MBA 65,000 85,000 74,000 83,000 110,000 160,000 100,000 122,000 85,000 120,000 105,000 135,000 125,000 175,000 156,000 140,000 26 28 36 35 35 40 41 42 45 46 50 51 55 50 61 63 0 1 1 1 FIGURE 15.10 10 | Executive Salary Data Including MBA Variable TABLE 15.2 the MBA data represented by triangles To incorporate a qualitative variable into the analysis, use the following steps: Step Code the qualitative variable as a dummy variable Create a new variable, x2, which is a dummy variable coded as x2  if MBA, if not The data with the new variable are shown in Table 15.2 Step Develop a multiple regression model with the dummy variables incorporated as independent variables The two-variable population multiple regression model has the following form: y  b0  b1x1  b2x2  ε Using either Excel or Minitab, we get the following regression equation as an estimate of the population model: yˆ  6,974  2,055x1  35,236x2 Because the dummy variable, x2, has been coded or depending on MBA status, incorporating it into the regression model is like having two simple | y $200,000 Impact of a Dummy Variable $180,000 $160,000 MBAs yˆ = 42,210 + 2,055x1 Salary $140,000 $120,000 $100,000 $80,000 Non–MBAs yˆ = 6,974 + 2,055x1 $60,000 $40,000 b2 = 35,236 = Regression coefficient on the dummy variable $20,000 0 10 20 30 40 Age 50 60 x 70 (194) 656 CHAPTER 15 | Multiple Regression Analysis and Model Building linear regression lines with the same slopes, but different intercepts For instance, when x2  0, the regression equation is yˆ  6, 974  2, 055 x1  35, 236(0 )  6, 974  2, 055 x1 This line is shown in Figure 15.10 However, when x2  (the executive has an MBA), the regression equation is yˆ  6, 974  2, 055 x1  35, 236(1)  42, 210  2, 055 x1 This regression line is also shown in Figure 15.10 As you can see, incorporating the dummy variable affects the regression intercept In this case, the intercept for executives with an MBA degree is $35,236 higher than for those without an MBA We interpret the regression coefficient on this dummy variable as follows: Based on these data, and holding age (x1) constant, we estimate that executives with an MBA degree make an average of $35,236 per year more in salary than their non–MBA counterparts >>END EXAMPLE TRY PROBLEM 15-17 (pg 659) BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial REGRESSION MODELS USING DUMMY VARIABLES FIRST CITY REAL ESTATE (CONTINUED) The regression model developed in Example 15-l for First City Real Estate showed potential because the overall model was statistically significant Looking back at Figure 15.8, we see that the model explained nearly 82% of the variation in sales prices for the homes in the sample All of the independent variables were significant, given that the other independent variables were in the model However, the standard error of the estimate is $27,350 The managers have decided to try to improve the model First, they have decided to add a new variable: area However, at this point, the only area variable they have access to defines whether the home is in the foothills Because this is a categorical variable with two possible outcomes (foothills or not foothills), a dummy variable can be created as follows: x6 (area)  if foothills, if not Of the 319 homes in the sample, 249 were homes in the foothills and 70 were not Figure 15.11 shows the revised Minitab multiple regression with the variable, area, added This model is an improvement over the original model because the adjusted R-squared has increased from 81.3% to 90.2% and the standard error of the estimate has decreased from $27,350 to $19,828 The conditional t-tests show that all of the regression models’ slope coefficients, except that for the variable bathrooms, differ significantly from The Minitab output shows that the variance inflation factors are all less than 5.0, so we don’t need to be too concerned about the t-tests understating the significance of the regression coefficients (See the Excel Tutorial for this example to get the full VIF output from PHStat.) The resulting regression model is yˆ   6, 817  63 3(sq ft.)  334(age)  8, 445( bedroo oms)  949(bathrooms)  26,246(garage)  62,041 (area) Because the variable bathrooms is not significant in the presence of the other variables, we can remove the variable and rerun the multiple regression The resulting model is Price  7,050  62.5(sq ft.)  322(age)  8,830(bedrooms)  26,054(garage)  61,370(area) (195) CHAPTER 15 FIGURE 15.11 | Multiple Regression Analysis and Model Building 657 | Minitab Output—First City Real Estate Revised Regression Model Minitab Instructions: Open file: First City MTW Choose Stat  Regression  Regression In Response, enter dependent (y) variable In Predictors, enter independent (x) variables Click Options In Display, select Variance inflation Factors Click OK OK Dummy variable coefficient Improved R-square, adjusted R-square, and standard error Based on the sample data and this regression model, we estimate that a house with the same characteristics (square feet, age, bedrooms, and garages) is worth an average of $61,370 more if it is located in the foothills (based on how the dummy variable was coded) There are still signals of multicollinearity problems The coefficient on the independent variable bedrooms is negative, when we might expect homes with more bedrooms to sell for more Also, the standard error of the estimate is still very large ($19,817) and does not provide the precision the managers need to set prices for homes More work needs to be done before the model is complete Possible Improvements to the First City Appraisal Model Because the standard error of the estimate is still too high, we look to improve the model We could start by identifying possible problems: We may be missing useful independent variables Independent variables may have been included that should not have been included There is no sure way of determining the correct model specification However, a recommended approach is for the decision maker to try adding variables or removing variables from the model We begin by removing the bedrooms variable, which has an unexpected sign on the regression slope coefficient (Note: If the regression model’s sole purpose is for prediction, independent variables with unexpected signs not automatically pose a problem and not necessarily need to be deleted However, insignificant variables should be deleted.) The resulting model is shown in Figures 15.12a and 15.12b Now, all the variables in the model have the expected signs However, the standard error of the estimate has increased slightly Adding other explanatory variables might help For instance, consider whether the house has central air conditioning, which might affect sales If we can identify whether a house has air conditioning, we could add a dummy variable coded as follows: If air conditioning, x7  If no air conditioning, x7  Other potential independent variables might include a more detailed location variable, a measure of the physical condition, or whether the house has one or two stories Can you think of others? (196) 658 CHAPTER 15 FIGURE 15.12A | Multiple Regression Analysis and Model Building | Excel 2007 Output for the First City Real Estate Revised Model Excel 2007 Instructions: Open file: First City.xls (worksheet: HomesSample-2) Click on Data tab—then click on Data Analysis Select Regression Define y variable range and x variables range Click OK All variables are significant and have the expected signs The First City example illustrates that even though a regression model may pass the statistical tests of significance, it may not be functional Good appraisal models can be developed using multiple regression analysis, provided more detail is available about such characteristics as finish quality, landscaping, location, neighborhood characteristics, and so forth The cost and effort required to obtain these data can be relatively high Developing a multiple regression model is more of an art than a science The real decisions revolve around how to select the best set of independent variables for the model FIGURE 15.12B | Minitab Output for the First City Real Estate Revised Model Minitab Instructions: Open file: First City MTW Choose Stat  Regression  Regression In Response, enter dependent (y) variable In Predictors, enter independent (x) variables Click OK All variables are significant and have the expected signs (197) CHAPTER 15 | Multiple Regression Analysis and Model Building 659 MyStatLab 15-2: Exercises Skill Development 15-12 Consider the following regression model: y  b0  b1x1  b2x2   where: x1  A quantitative variable ⎧1 if x1 20 x2  ⎨ ⎩ if x1 (198) 20 The following estimated regression equation was obtained from a sample of 30 observations: yˆ  24.1  5.8 x1  7.9 x2 a Provide the estimated regression equation for instances in which x1 20 b Determine the value of yˆ when x1  10 c Provide the estimated regression equation for instances in which x1  20 d Determine the value of yˆ when x1  30 15-13 You are considering developing a regression equation relating a dependent variable to two independent variables One of the variables can be measured on a ratio scale, but the other is a categorical variable with two possible levels a Write a multiple regression equation relating the dependent variable to the independent variables b Interpret the meaning of the coefficients in the regression equation 15-14 You are considering developing a regression equation relating a dependent variable to two independent variables One of the variables can be measured on a ratio scale, but the other is a categorical variable with four possible levels a How many dummy variables are needed to represent the categorical variable? b Write a multiple regression equation relating the dependent variable to the independent variables c Interpret the meaning of the coefficients in the regression equation 15-15 A real estate agent wishes to estimate the monthly rental for apartments based on the size (square feet) and the location of the apartments She chose the following model: y  b0  b1x1  b2x2   where: x1  Square footage of the apartment ⎧1 if loocated in town center x2  ⎨ ⎩ if not located in toown center This linear regression model was fitted to a sample of size 50 to produce the following regression equation: yˆ  145  1.2 x1  300 x2 a Predict the average monthly rent for an apartment located in the town center that has 1,500 square feet b Predict the average monthly rent for an apartment located in the suburbs that has 1,500 square feet c Interpret b2 in the context of this exercise Business Applications 15-16 The Polk Utility Corporation is developing a multiple regression model that it plans to use to predict customers’ utility usage The analyst currently has three quantitative variables (x1, x2, and x3) in the model, but she is dissatisfied with the R-squared and the estimate of the standard deviation of the model’s error Two variables she thinks might be useful are whether the house has a gas water heater or an electric water heater and whether the house was constructed after the 1974 energy crisis or before Provide the model she should use to predict customers’ utility usage Specify the dummy variables to be used, the values these variables could assume, and what each value will represent 15-17 A study was recently performed by the American Automobile Association in which it attempted to develop a regression model to explain variation in Environmental Protection Agency (EPA) mileage ratings of new cars At one stage of the analysis, the estimate of the model took the following form: yˆ  34.20  0.003x1  4.56 x2 where: x1  Vehicle weight ⎧1, if standard transmission x2  ⎨ ⎩ 0, if automatic transmission a Interpret the regression coefficient for variable x1 b Interpret the regression coefficient for variable x2 c Present an estimate of a model that would predict the average EPA mileage rating for an automobile with standard transmission as a function of the vehicle’s weight d Cadillac’s STS-V with automatic transmission weighs approximately 4,394 pounds Provide an estimate of the average highway mileage you would expect to obtain from this model e Discuss the effect of a dummy variable being incorporated in a regression equation like this one Use a graph if it is helpful 15-18 A real estate agent wishes to determine the selling price of residences using the size (square feet) and whether the residence is a condominium or a (199) 660 CHAPTER 15 | Multiple Regression Analysis and Model Building single-family home A sample of 20 residences was obtained with the following results: Price($) Type Square Feet Price($) Type Square Feet 199,700 211,800 197,100 228,400 215,800 190,900 312,200 313,600 239,000 184,400 Family Condo Family Family Family Condo Family Condo Family Condo 1,500 2,085 1,450 1,836 1,730 1,726 2,300 1,650 1,950 1,545 200,600 208,000 210,500 233,300 187,200 185,200 284,100 207,200 258,200 203,100 Condo Condo Family Family Condo Condo Family Family Family Family 1,375 1,825 1,650 1,960 1,360 1,200 2,000 1,755 1,850 1,630 a Produce a regression equation to predict the selling price for residences using a model of the following form: yi  b0  b1x1  b2x2  ε where: ⎧1 if a condo x1 Square footage and x  ⎨ ⎩ if a single-family home b Interpret the parameters b1 and b2 in the model given in part a c Produce an equation that describes the relationship between the selling price and the square footage of (1) condominiums and (2) single-family homes d Conduct a test of hypothesis to determine if the relationship between the selling price and the square footage is different between condominiums and single-family homes 15-19 When cars from Korean automobile manufacturers started coming to the United States they were given very poor quality ratings That started changing several years ago J.D Power and Associates generates a widely respected report on initial quality The improved quality started being seen in the 2004 Initial Quality Study Results were based on responses from more than 62,000 purchasers and lessors of newmodel-year cars and trucks, who were surveyed after 90 days of ownership Initial quality is measured by the number of problems per 100 vehicles (PP100) The PP100 data from the interval 1998–2004 follow: 1998 1999 2000 2001 2002 2003 2004 Korean 272 227 222 214 172 152 117 Domestic 182 177 164 153 137 135 123 European 158 171 154 141 137 136 122 a Produce a regression equation to predict the PP100 for vehicles in the model yi  b0  b1x1  b2x2  ε where ⎧1 if Domestic ⎧1 if European and x2  ⎨ x1 ⎨ if not Domestic ⎩ ⎩ if not European b Interpret the parameters b0, b1, and b2 in the model given in part a c Conduct a test of hypothesis using the model in part a to determine if the average PP100 is the same for the three international automobile production regions Computer Database Exercises 15-20 The Energy Information Administration (EIA), created by Congress in 1977, is a statistical agency of the U.S Department of Energy It provides data, forecasts, and analyses to promote sound policymaking and public understanding regarding energy and its interaction with the economy and the environment One of the most important areas of analysis is petroleum The file entitled Crude contains data for the period 1991–2006 concerning the price, supply, and demand for fuel It has been conjectured that the pricing structure of gasoline changed at the turn of the century a Produce a regression equation to predict the selling price of gasoline: yi  b0  b1x1  ε where: ⎧1 if in twenty-first century x1  ⎨ ⎩ if in twenttieth century b Conduct a hypothesis test to address the conjecture Use a significance level of 0.05 and the test statistic approach c Produce a 95% confidence interval to estimate the change of the average selling price of gasoline between the twentieth and the twenty-first centuries 15-21 The Gilmore Accounting firm, in an effort to explain variation in client profitability, collected the data found in the file called Gilmore, where: y  Net profit earned from the client x1  Number of hours spent working with the client x2  Type of client: 1, if manufacturing 2, if service 3, if governmental a Develop a scatter plot of each independent variable against the client income variable Comment on what, if any, relationship appears to exist in each case b Run a simple linear regression analysis using only variable x1 as the independent variable Describe the resulting estimate fully c Test to determine if the number of hours spent working with the client is useful in predicting client profitability (200) CHAPTER 15 | Multiple Regression Analysis and Model Building 661 math and verbal sections There had been conjecture about whether a relationship between the average math SAT score and the average verbal SAT score and the gender of the student taking the SAT examination existed Consider the following relationship: yi  b0  b1x1  b2x2   15-22 Using the data from the Gilmore Accounting firm found in the data file Gilmore (see Exercise 15-21), a Incorporate the client type into the regression analysis using dummy variables Describe the resulting multiple regression estimate b Test to determine if this model is useful in predicting the net profit earned from the client c Test to determine if the number of hours spent working with the client is useful in this model in predicting the net profit earned from a client d Considering the tests you have performed, construct a model and its estimate for predicting the net profit earned from the client e Predict the average difference in profit if the client is governmental versus one in manufacturing Also state this in terms of a 95% confidence interval estimate 15-23 Several previous problems have dealt with the College Board changing the format of the SAT test taken by many entering college freshmen Many reasons were given for changing the format The class of 2005 was the last to take the former version of the SAT, featuring where: ⎧1 if female x1  Average verbal SAT score and x2  ⎨ ⎩0 if male a Use the file MathSAT to compute the linear regression equation to predict the average math SAT score using the gender and the average verbal SAT score of the students taking the SAT examination b Interpret the parameters in the model c Conduct a hypothesis test to determine if the gender of the student taking the SAT examination is a significant predictor of the student’s average math SAT score for a given average verbal SAT score d Predict the average math SAT score of female students with an average verbal SAT score of 500 END EXERCISES 15-2 Chapter Outcome 15.3 Working with Nonlinear Relationships Section 14.1 in Chapter 14 showed there are a variety of ways in which two variables can be related Correlation and regression analysis techniques are tools for measuring and modeling linear relationships between variables Many situations in business have a linear relationship between two variables, and regression equations that model that relationship will be appropriate to use in these situations However, there are also many instances in which the relationship between two variables will be curvilinear, rather than linear For instance, demand for electricity has grown at an almost exponential rate relative to the population growth in some areas Advertisers believe that a diminishing returns relationship will occur between sales and advertising if advertising is allowed to grow too large These two situations are shown in Figures 15.13 and 15.14, respectively They represent just two of the great many possible curvilinear relationships that could exist between two variables As you will soon see, models with nonlinear relationships become more complicated than models showing only linear relationships Although complicated models are sometimes | Exponential Relationship of Increased Demand for Electricity versus Population Growth Electricity Demand FIGURE 15.13 25 20 15 10 0 10 15 Population 20 25 (201) CHAPTER 15 FIGURE 15.14 | Multiple Regression Analysis and Model Building | Diminishing Returns Relationship of Advertising versus Sales Sales 662 0 10 15 20 25 Advertising necessary, decision makers should use them with caution for several reasons First, management researchers and authors have written that people use decision aids they understand and don’t use those they don’t understand So, the more complicated a model is, the less likely it is to be used Second, the scientific principle of parsimony suggests using the simplest model possible that provides a reasonable fit of the data, because complex models typically not reflect the underlying phenomena that produce the data in the first place This section provides a brief introduction to how linear regression analysis can be used in dealing with curvilinear relationships To model such curvilinear relationships, we must incorporate terms into the multiple regression model that will create “curves” in the model we are building Including terms whose independent variable has an exponent larger than generates these curves When a model possesses such terms we refer to it as a polynomial model The general equation for a polynomial with one independent variable is given in Equation 15.11 Polynomial Population Regression Model y  b0  b1x  b2x2   bpx p   (15.11) where: b0  Population regression’s constant bj  Population’s regression coefficient for variable x j; j  1, 2, , p p  Order (or degree) of the polynomial   Model error The order, or degree, of the model is determined by the largest exponent of the independent variable in the model For instance, the model y  b0  b1x b2x2   is a second-order polynomial because the largest exponent in any term of the polynomial is You will note that this model contains terms of all orders less than or equal to A polynomial with this property is said to be a complete polynomial Therefore, the previous model would be referred to as a complete second-order regression model A second-order model produces a parabola The parabola either opens upward (b2  0) or downward (b2 0), shown in Figure 15.15 You will notice that the models in Figures 15.13, 15.14, and 15.15 possess a single curve As more curves appear in the data, the order of the polynomial must be increased A general (complete) third-order polynomial is given by the equation y  b0  b1x  b2x2  b3x3   This model produces a curvilinear model that reverses the direction of the initial curve to produce a second curve, as shown in Figure 15.16 Note that there are two curves in the thirdorder model In general, a pth-order polynomial will exhibit p  curves Although polynomials of all orders exist in the business sector, perhaps second-order polynomials are the most common Sharp reversals in the curvature of a relationship between (202) CHAPTER 15 FIGURE 15.15 | Multiple Regression Analysis and Model Building 663 | y Second-Order Regression Models 2 > 0 2 < x variables in the business environment usually point to some unexpected or, perhaps severe, changes that were not foreseen The vast majority of organizations try to avoid such reverses For this reason, and the fact that this is an introductory business statistics course, we will direct most of our attention to second-order polynomials The following examples illustrate two of the most common instances in which curvilinear relationships can be used in decision making They should give you an idea of how to approach similar situations Chapter Outcome EXAMPLE 15-2 MODELING CURVILINEAR RELATIONSHIPS Ashley Investment Services Ashley Investment Services was severely shaken by the downturn in the stock market during the summer and fall of 2008 To maintain profitability and save as many jobs as possible, since then everyone has been extra busy analyzing new investment opportunities The director of personnel has noticed an increased number of people suffering from “burnout,” in which physical and emotional fatigue hurt job performance Although he cannot change the job’s pressures, he has read that the more time a person spends socializing with coworkers away from the job, the more likely there is to be a higher degree of burnout With the help of the human resources lab at the local university, the personnel director has administered a questionnaire to company employees A burnout index has been computed from the responses to the survey Likewise, the survey responses are used to determine quantitative measures of socialization Sample data from questionnaires are contained in the file Ashley The following steps FIGURE 15.16 | Third-Order Regression Models y 3 > 3 < 0 xx (203) 664 CHAPTER 15 | Multiple Regression Analysis and Model Building can be used to model the relationship between the socialization index and the burnout index for Ashley employees: Step Specify the model by determining the dependent and potential independent variables The dependent variable is the burnout index The company wishes to explain the variation in burnout level One potential independent variable is the socialization index Step Formulate the model We begin by proposing that a linear relationship exists between the two variables Figures 15.17a and 15.17b show the linear regression analysis results using Excel and Minitab The correlation between the two variables is r  0.818, which is statistically different from zero at any reasonable significance level The estimate of the population linear regression model shown in Figure 15.17a is yˆ   66.164  9.589 x Step Perform diagnostic checks on the model The sample data and the regression line are plotted in Figure 15.18 The line appears to fit the data However, a closer inspection reveals instances where FIGURE 15.17A | Excel 2007 Output of a Simple Linear Regression for Ashley Investment Services Regression coefficients Excel 2007 Instructions: Open file: Ashley.xls Select Data  Data Analysis Select Regression Specify y variable range and x variable range (include labels) Check Labels option Specify output location Click OK (204) CHAPTER 15 FIGURE 15.17B | Multiple Regression Analysis and Model Building 665 | Minitab Output of a Simple Linear Regression for Ashley Investment Services Minitab Instructions: Open file: Ashley.MTW Choose Stat  Regression  Regression In Response, enter the y variable column In Predictors, enter the x variable column Click OK Regression coefficients several consecutive points lie above or below the line The points are not randomly dispersed around the regression line, as should be the case given the regression analysis assumptions As you will recall from earlier discussions, we can use an F-test to test whether a regression model explains a significant amount of variation in the dependent variable H0: r2  HA: r2  From the output in Figure 15.17a, F  36.43 which has a p-value ≈ 0.0000 Thus, we conclude that the simple linear model is statistically significant However, we should also examine the data to determine if any curvilinear relationships may be present FIGURE 15.18 | 1,200 Plot of Regression Line for the Ashley Investment Services Example y Burnout Index 1,000 yˆ = –66.164 + 9.589x R2 = 0.6693 800 600 400 200 0 20 40 60 Socialization Measure 80 x 100 (205) 666 CHAPTER 15 FIGURE 15.19A | Multiple Regression Analysis and Model Building | Excel 2007 Output of a Second-Order Polynomial Fit for Ashley Investment R-squared Regression Coefficients Excel 2007 Instructions: Open file: Ashley.xls Use Excel equations to create the new variable in column C (i.e for the first data value use = A2^2 Then copy down) Select Data > Data Analysis Select Regression FIGURE 15.19B | Minitab Output of a SecondOrder Polynomial Fit for Ashley Investment Minitab Instructions: Open file: Ashley.MTW Use Calc  Calculator to create socialization column Choose Stat  Regression  Fitted Line Plot In Response, enter y variable In Predictor, enter x variables Under Type of Regression Model, choose Quadratic Click OK R-squared vv Specify y variable range and x variable range (include the new variable and the labels) Check Labels option Specify ouput location Click OK (206) CHAPTER 15 | Multiple Regression Analysis and Model Building 667 Step Model the curvilinear relationship Finding instances of nonrandom patterns in the residuals for a regression model indicates the possibility of using a curvilinear relationship rather than a linear one One possible approach to modeling the curvilinear nature of the data in the Ashley Investments example is with the use of polynomials From Figure 15.18, we can see that there appears to be one curve in the data This suggests fitting the second-order polynomial y  b0  b1x  b2x2   Before fitting the estimate for this population model, you will need to create the new independent variable by squaring the socialization measure variable In Excel use the formula option or in Minitab use the Calc  Calculator command to create the new variable Figures 15.19a and 15.19b show the output after fitting this second-order polynomial model Step Perform diagnostics on the revised curvilinear model Notice the second-order polynomial provides a model whose estimated regression equation has an R2 of 74.1% This is higher than the R2 of 66.9% for the linear model Figure 15.20 shows the plot of the second-order polynomial model Comparing Figure 15.20 with Figure 15.18, we can see that the polynomial model does appear to fit the sample data better than the linear model >>END EXAMPLE TRY PROBLEM 15-24 (pg 675) Analyzing Interaction Effects BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial DEALING WITH INTERACTION ASHLEY INVESTMENT SERVICES (CONTINUED) Referring to Example 15-3 involving Ashley Investment Services, the director of personnel wondered if the effects of burnout differ among male and female workers He therefore identified the gender of the previously surveyed employees (see file Ashley-2) A multiple scatter plot of the data appears as Figure 15.21 The personnel director tried to determine the relationship between the burnout index and socialization measure for men and women The graphical result is presented in Figure 15.21 Note that both relationships appear to be curvilinear, with a similarly shaped curve As we showed earlier, curvilinear shapes often can be modeled by the second-order polynomial yˆ  b0  b1 x1  b2 x12 FIGURE 15.20 | 1,200 Plot of Second-Order Polynomial Model for Ashley Investment y Second-Degree Polynomial Burnout Index 1,000 800 600 400 200 0 10 20 30 40 50 60 70 Socialization Measure 80 90 x 100 (207) 668 CHAPTER 15 FIGURE 15.21 | Multiple Regression Analysis and Model Building | Excel 2007 Multiple Scatter Plot for Ashley Investment Services Male Female Excel 2007 Instructions: corresponding to females (rows 2–11) Open file: Ashley-2.xls For Series Y Values select data from the Select the Socialization Measure and Burnout column corresponding to females Burnout Index columns (rows 2–11) Select the Insert tab Repeat step for males Select the XY (Scatter) Select Chart and click the right mouse Click on layout tab to remove legend and to add chart and axis titles button—choose Select Data 10 Select data points for males—right click Click on Add on the Legend Entry and select Add Trendline  Exponential (Series) section 11 Repeat step 10 for females Enter Series Name—(Females)— for Series X Values select data from Socialization column for row However, the regression equations that estimate this second-order polynomial for men and women are not the same The two equations seem to have different locations and different rates of curvature Whether an employee is a man or woman seems to change the basic relationship between burnout index (y) and socialization measure (x1) To represent this difference, the equation’s coefficients b0, b1, and b2 must be different for male and female employees Thus, we could use two models, one for each gender Alternatively, we could use one model for both male and female employees by incorporating a dummy independent variable with two levels, which is shown as x2  if male, if female As x2 changes values from to 1, it affects the values of the coefficients b0, b1, and b2 Suppose the director fitted the second-order model for the female employees only He obtained the following regression equation: yˆ  291.70  4.62x1  0.102 x12 The equation for only male employees was yˆ  149.59  4.40x1  0.160 x12 (208) CHAPTER 15 Interaction The case in which one independent variable (such as x2 ) affects the relationship between another independent variable (x1) and the dependent variable ( y ) | Multiple Regression Analysis and Model Building 669 To explain how a change in gender can cause this kind of change, we must introduce interaction In our example, gender (x2) interacts with the relationship between socialization measure (x1) and burnout index (y) The question is how we obtain the interaction terms to model such a relationship? To answer this question, we first obtain the model for the basic relationship between the x1 and the y variables The population model is y  b0  b1x1  b2 x12   To obtain the interaction terms, multiply the terms on the right-hand side of this model by the variable that is interacting with this relationship between y and x1 In this case, that interacting variable is x2 Then the interaction terms would be b3 x2  b4 x1 x2  b5 x12 x2 Composite Model The model that contains both the basic terms and the interaction terms Notice that we have changed the coefficient subscripts so we not duplicate those in the original model Then the interaction terms are added to the original model to produce the composite model y  b0  b1 x1  b2 x12  b3 x2  b4 x1 x2  b5 x12 x2   Note, the model for women is obtained by substituting x2  into the composite model This gives y  b0  b1 x1  b2 x12  b3 (0)  b4 x1 (0)  b5 x12 (0)    b0  b1 x1  b2 x12   Similarly, for men we substitute the value of x2  The model then becomes y  b0  b1 x1  b2 x12  b3 (1)  b4 x1 (1)  b5 x12 (1)    ( b0 + b3 )  ( b1 + b4 ) x1  ( b2  b5 ) x12   This illustrates how the coefficients are changed for different values of x2 and, therefore, how x2 is interacting with the relationship between x1 and y Once we know b3, b4, and b5, we know the effect of the interaction of gender on the original relationship between the burnout index (y) and the socialization measure (x1) To estimate the composite model, we need to FIGURE 15.22 | Excel 2007 Data Preparation for Estimating Interactive Effects for Second-Order Model for Ashley Investment Excel 2007 Instructions: Open file: Ashley-2.xls Use Excel formulas to create new variables in columns C, E, and F (209) 670 CHAPTER 15 FIGURE 15.23A | Multiple Regression Analysis and Model Building | Excel 2007 Composite Model for Ashley Investment Services Excel 2007 Instructions: Open file: Ashley-2.xls Create new variables (see Figure 15.22 Excel 2007 Instructions) Rearrange columns so all x variables are contiguous Select Data  Data Analysis Select Regression Specify y variable range and x variable range (include the new variables and the labels) Check Labels option Specify output location Click OK Regression coefficients for the composite model create the required variables, as shown in Figure 15.22 Figures 15.23a and 15.23b show the regression for the composite model The estimate for the composite model is yˆ  291.706  4.615 x1  0.102 x12 142.113x2  0.215 x1 x2  0.058 x12 x2 We obtain the model for females by substituting x2  0, giving yˆ  291.706  4.615 x1  0.102 x12 142.113(0 )  0.215 x1 (0 )  0.058 x12 (0 ) yˆ  291.706  4.615 x1  0.102 x12 FIGURE 15.23B | Minitab Composite Model for Ashley Investment Services Minitab Instructions: Continue from Figure 15.19b Choose Stat  Regression  Regression In Response, enter dependent (y) variable In Predictors, enter independent (x) variables Click OK Regression coefficients for the composite model (210) CHAPTER 15 | Multiple Regression Analysis and Model Building 671 For males, we substitute x2  1, giving yˆ  291.706  4.615 x1  0.102 x12 142.113(1)  0.215 x1 (1)  0.058 x12 (1) yˆ  149.593  4.40 x1  0.160 x12 Note that these equations for male and female employees are the same as those we found earlier when we generated two separate regression models, one for each gender In this example we have looked at a case in which a dummy variable interacts with the relationship between another independent variable and the dependent variable However, the interacting variable need not be a dummy variable It can be any independent variable Also, strictly speaking, interaction is not said to exist if the only effect of the interaction variable is to change the y intercept of the equation relating another independent variable to the dependent variable Therefore, when you examine a scatter plot to detect interaction, you are trying to determine if the relationships produced, when the interaction variable changes values, are parallel or not If the relationships are parallel, that indicates that only the y intercept is being affected by the change of the interacting variable and that interaction does not exist Figure 15.24 demonstrates this concept graphically The Partial-F Test So far you have been given the procedures required to test the significance of either one or all of the coefficients in a regression model For instance, in Example 15-3 a hypothesis test was used to determine that a second-order model involving the socialization measure fit the sample data better than the linear model Testing H0: b2  was the mechanism used to establish this We could have determined whether both the linear and quadratic components were useful in predicting the burnout index level by testing the hypothesis H0: b1  b2  However, more complex models occur The interaction model involving Ashley Investment Services containing five predictor variables was  b x  b x x  b x2 x   yi  b0  b1x1i  b2x1i 2i 1i 2i 1i 2i FIGURE 15.24 | Graphical Evidence of Interaction y y x2 = x2 = x2 = x2 = x1 (a) First-order polynomial without interaction y x1 (b) First-order polynomial with interaction y x2 = 11.3 x2 = 11.3 x2 = 13.2 x2 = 13.2 x1 (c) Second-order polynomial without interaction x1 (d) Second-order polynomial with interaction (211) 672 CHAPTER 15 | Multiple Regression Analysis and Model Building Two of these predictor variables (i.e., x1i x2i and x12i x2i) existing in the model would indicate interaction is evident in this regression model If the two interaction variables were absent, the model would be yi  b0  b1x1i  b2x21i  b3 x2i   To determine whether there is statistical evidence of interaction, we must determine if the coefficients of the interaction terms are all equal to If they are, there is no interaction Otherwise, at least some interaction exists For the Ashley Investment example, we test the hypotheses H0: b4  b5  HA: At least one of the bis  Earlier in this chapter we introduced procedures for testing whether all of the coefficients of a model equaled In that case, you use the analysis of variance F-test found on both Excel and Minitab output However, to test whether there is significant interaction, we must test more than one but fewer than all the regression coefficients The method for doing this is the partial-F test This test relies on the fact that if given a choice between two models, one model is a better fit if its sum of squares of error (SSE) is significantly smaller than for the other model Therefore, to determine if interaction exists in our model, we must obtain the SSE for the model with the interaction terms and for the model without the interaction terms The model without the interaction terms is called the reduced model The model containing the interaction terms is called the complete model We will denote the respective sum of squares as SSER and SSEC It is important to note that the procedure is appropriate to test any subset greater than one but less than all of the model’s coefficients We use the interaction terms in this example just as one such procedure There are many models not containing interaction terms in which the partial-F test is applicable The test is based on the concept that the SSE will be significantly reduced if not all of the regression coefficients being tested equal zero Of course if the SSE is significantly reduced, then SSER  SSEC must be significantly different from zero To determine if this difference is significantly different from zero we use the partial-F test statistic given by Equation 15.12 Partial-F Test Statistic F ( SSE R  SSEC ) / (c  r ) MSEC (15.12) where: MSEC  Mean square error for the complete model  SSEC /(n  c  1) r  The number of coefficients in the reduced model c  The number of coefficients in the complete model n  Sample size The numerator of this test statistic is basically the average SSE per degree of freedom reduced by including the coefficients being tested in the model This is compared to the average SSE per degree of freedom for the complete model If these averages are significantly different, then the null hypothesis is rejected This test statistic has an F-distribution whose numerator degrees of freedom equals the number of parameters being tested (c  r) and whose denominator degrees of freedom equals the degrees of freedom for the complete model (n  c  1) We are now prepared to determine if the director’s data indicate a significant interaction between gender and the relationship between the Socialization Measure and the Burnout Index In order to conduct the test of hypothesis, the director produced regression equations (212) CHAPTER 15 FIGURE 15.25A | | Multiple Regression Analysis and Model Building 673 Sum of Squares for the Complete Model SSEC MSEC for both models (Figures 15.25a and 15.25b) He obtained the SSEC, MSEC from Figure 15.25a, and SSER from Figure 15.25c He was then able to conduct the hypothesis test to determine if there was any interaction Figure 15.25c displays this test Since the null hypothesis was rejected we can conclude that interaction does exist in this model Apparently, gender of the employee does affect the relationship between the Burnout Index and the Socialization Measure The relationship between the Burnout Index and the Socialization Measure is different within men and women You must be very careful with interpretations of regression coefficients when interaction exists Notice that the equation that contains interaction terms is given by yˆi  292  4.61x1i  0.102x21i  142x2i  0.2x1i x2i  0.058x21ix2i When interpreting the coefficient b1, you may be tempted to say that the Burnout Index will decrease by an average of 4.61 units for every unit the Socialization Measure (x1i) increases, holding all other predictor variables constant However, this is not true, there are three other components of this regression equation that contain x1i When x1i increases by one unit x21i will also increase In addition, the interaction terms also contain x1i, and therefore, those terms will change as well This being the case, every time the variable x2 changes, the rate of change of the interaction terms are also affected Perhaps you will see this more clearly if we rewrite the equation as yˆi  (292  142x2i)  (0.2x2i  4.61)x1i  (0.102  0.058x2i) x21i change whenever x changes In this form you can see that the coefficients of x1i and x1i Thus, the interpretation of any of these components depends on the value x2, as well as x1i Whenever interaction or higher order components are present you should be very careful in your attempts to interpret the results of your regression analysis (213) 674 CHAPTER 15 FIGURE 15.25B | | Multiple Regression Analysis and Model Building Sum of Squares for the Reduced Model SSER FIGURE 15.25C | H0: 4  5  HA: At least one of the i s  Partial-F Hypothesis Test for Interaction   0.05 Test Statistic: F (SSER  SSEC)/(c  r) MSEC  (231,845  127,317)/(5  3) 9,094  5.747 Rejection Region: d.f.: D1  c  r    D2  (n  c  1)  20    14 Rejection Region F  3.739 Because Partial-F  5.747 > 3.739, we reject H0   0.05 (214) CHAPTER 15 | Multiple Regression Analysis and Model Building 675 MyStatLab 15-3: Exercises Skill Development 15-27 Examine the following two sets of data: 15-24 Consider the following values for the dependent and independent variables: When x2  x1 y When x2  x1 y x y x y 2 15 40 10 15 25 50 60 80 44 79 112 12 11 14 19 20 15 23 52 60 154 122 200 381 392 10 14 13 16 21 10 48 50 87 51 63 202 a Develop a scatter plot of the data Does the plot suggest a linear or nonlinear relationship between the dependent and independent variables? b Develop an estimated linear regression equation for the data Is the relationship significant? Test at an a  0.05 level c Develop a regression equation of the form yˆ  b0  b1 x  b2 x Does this equation provide a better fit to the data than that found in part b? 15-25 Consider the following values for the dependent and independent variables: x 14 y x y 20 28 18 22 27 30 33 35 a Develop a scatter plot of the data Does the plot suggest a linear or nonlinear relationship between the dependent and independent variables? b Develop an estimated linear regression equation for the data Is the relationship significant? Test at an a  0.05 level c Develop a regression equation of the form yˆ  b0  b1 ln(x ) Does this equation provide a better fit to the data than that found in part b? 15-26 Examine the following data: x 12 15 y 75 175 415 620 22 21 25 37 39 7,830 7,551 7,850 11,112 11,617 a Construct a scatter plot of the data Determine the order of the polynomial that is represented by the data b Obtain an estimate of the model identified in part a c Conduct a test of hypothesis to determine if a thirdorder, as opposed to a second-order, polynomial is a better representation of the relationship between y and x Use a significance level of 0.05 and the p-value approach a Produce a distinguishable scatter plot for each of the data sets on the same graph Does it appear that there is interaction between x2 and the relationship between y and x1? Support your assertions b Consider the following model to represent the relationship among y, x1, and x2: yi  b0  b1 x1  b2 x12  b3 x1 x2  b4 x12 x2   Produce the estimated regression equation for this model c Conduct a test of hypothesis for each interaction term Use a significance level of 0.05 and the p-value approach d Based on the two hypothesis tests in part c, does it appear that there is interaction between x2 and the relationship between y and x1? Support your assertions 15-28 Consider the following data: x 12 11 14 19 20 y 54 125 324 512 5,530 5,331 5,740 7,058 7,945 a Construct a scatter plot of the data Determine the order of the polynomial that is represented by this data b Obtain an estimate of the model identified in part a c Conduct a test of hypothesis to determine if a thirdorder, as opposed to a first-order, polynomial is a better representation of the relationship between y and x Use a significance level of 0.05 and the p-value approach 15-29 A regression equation to be used to predict a dependent variable with four independent variables is (215) 676 CHAPTER 15 | Multiple Regression Analysis and Model Building developed from a sample of size 10 The resulting equation is yˆ  32.8  0.470x1  0.554x2  4.77x3  0.929x4 Two other equations are developed from the sample: yˆ  12.4  0.60x1  1.60x2 and yˆ  49.7  5.38x3  1.35x4 The respective sum of squares errors for the three equations are 201.72, 1,343, and 494.6 a Use the summary information to determine if the independent variables x3 and x4 belong in the complete regression model Use a significance level of 0.05 b Repeat part a for the independent variables x1 and x2 Use the p-value approach and a significance level of 0.05 Computer Database Exercises 15-30 In a bit of good news for male students, American men have closed the gap with women on life span, according to a USA Today article Male life expectancy attained a record 75.2 years, and women’s reached 80.4 The National Center for Health Statistics provided the data given in the file entitled Life a Produce a scatter plot depicting the relationship between the life expectancy of women and men b Determine the order of the polynomial that is represented on the scatter plot obtained in part a Produce the estimated regression equation that represents this relationship c Determine if women’s average life expectancy can be used in a second-order polynomial to predict the average life expectancy of men Use a significance level of 0.05 d Use the estimated regression equation computed in part b to predict the average length of life of men when women’s length of life equals 100 What does this tell you about the wisdom (or lack thereof) of extrapolation in regression models? 15-31 The Gilmore Accounting firm previously mentioned, in an effort to explain variation in client profitability, collected the data found in the file called Gilmore, where: y  Net profit earned from the client x1  Number of hours spent working with the client x2  Type of client: 1, if manufacturing 2, if service 3, if governmental Gilmore has asked if it needs the client type in addition to the number of hours spent working with the client to predict the net profit earned from the client You are asked to provide this information a Fit a model to the data that incorporates the number of hours spent working with the client and the type of client as independent variables (Hint: Client type has three levels.) b Fit a second-order model to the data, again using dummy variables for client type Does this model provide a better fit than that found in part a? Which model would you recommend be used? 15-32 McCullom’s International Grains is constantly searching out areas in which to expand its market Such markets present different challenges since tastes in the international market are often different from domestic tastes India is one country on which McCullom’s has recently focused Paddy is a grain used widely in India, but its characteristics are unknown to McCullom’s Charles Walters has been assigned to take charge of the handling of this grain He has researched its various characteristics During his research he came across an article, “Determination of Biological Maturity and Effect of Harvesting and Drying Conditions on Milling Quality of Paddy” [Journal of Agricultural Engineering Research (1975), pp 353–361], which examines the relationship between y, the yield (kg/ha) of paddy, as a function of x, and the number of days after flowering at which harvesting took place The accompanying data appeared in the article and are in a file called Paddy y 2,508 2,518 3,304 3,423 3,057 3,190 3,500 3,883 x y x 16 18 20 22 24 26 28 30 3,823 3,646 3,708 3,333 3,517 3,241 3,103 2,776 32 34 36 38 40 42 44 46 a Construct a scatter plot of the yield (kg/ha) of paddy as a function of the number of days after flowering at which harvesting took place Display at least two models that would explain the relationship you see in the scatter plot b Conduct tests of hypotheses to determine if the models you selected are useful in predicting the yield of paddy c Consider the model that includes the second-order term x2 Would a simple linear regression model be preferable to the model containing the second-order term? Conduct a hypothesis test using the p-value approach to arrive at your answer d Which model should Charles use to predict the yield of paddy? Explain your answer 15-33 The National Association of Realtors Existing-Home Sales Series provides a measurement of the residential real estate market One of the measurements it produces is the Housing Affordability Index (HAI) It is a measure of the financial ability of U.S families to buy a house: 100 means that families earning the national median income have just the amount of money needed to qualify for a mortgage on a median-priced (216) CHAPTER 15 home; higher than 100 means they have more than enough, and lower than 100 means they have less than enough The file entitled Index contains the HAI and associated variables a Construct a scatter plot relating the HAI to the median family income b Determine the order of the polynomial that is suggested by the scatter plot produced in part a Obtain the estimated regression equation of the polynomial selected c Determine if monthly principle and interest interacts with the relationship between the HAI and the median family income indicated in part b (Hint: You may need to conduct more than one hypothesis test.) Use a significance level of 0.05 and the p-value approach 15-34 Badeaux Brothers Louisiana Treats ships packages of Louisiana coffee, cakes, and Cajun spices to individual customers around the United States The cost to ship these products depends primarily on the weight of the package being shipped Badeaux charges the customers for shipping and then ships the product itself As a part of a study of whether it is economically feasible to continue to ship products itself, Badeaux sampled 20 recent shipments to determine what, if any, relationship exists between shipping costs and package weight These data are in the file called Badeaux a Develop a scatter plot of the data with the dependent variable, cost, on the vertical axis and the independent variable, weight, on the horizontal axis Does there appear to be a relationship between the two variables? Is the relationship linear? b Compute the sample correlation coefficient between the two variables Conduct a test, using a significance level of 0.05, to determine whether the population correlation coefficient is significantly different from zero c Badeaux Brothers has been using a simple linear regression equation to predict the cost of shipping various items Would you recommend it use a second-order polynomial model instead? Is the second-order polynomial model a significant improvement on the simple linear regression equation? d Badeaux Brothers has made a decision to stop shipping products if the shipping charges exceed $100 The company has asked you to determine the maximum weight for future shipments Do this for both the first- and second-order models you have developed 15-35 The National Association of Theatre Owners is the largest exhibition trade organization in the world, representing more than 26,000 movie screens in all 50 states and in more than 20 countries worldwide Its membership includes the largest cinema chains and hundreds of independent theater owners It publishes statistics concerning the movie sector of the economy The file entitled Flicks contains data on total U.S box | Multiple Regression Analysis and Model Building 677 office grosses ($billion), total number of admissions (billion), average U.S ticket price ($), and number of movie screens One concern is the effect the increasing ticket prices have on the number of individuals who go to the theaters to view movies a Construct a scatter plot depicting the relationship between the total number of admissions and the U.S ticket price b Determine the order of the polynomial that is suggested by the scatter plot produced in part a Obtain the estimated regression equation of the polynomial selected c Examine the p-value associated with the F-test for the polynomial you have selected in part a Relate these results to those of the t-tests for the individual parameters and the adjusted coefficient of determination To what is this attributed? d Conduct t-tests to remove higher order components until no components can be removed 15-36 The Energy Information Administration (EIA), created by Congress in 1977, is a statistical agency of the U.S Department of Energy It provides data, forecasts, and analyses to promote sound policymaking and public understanding regarding energy and its interaction with the economy and the environment One of the most important areas of analysis is petroleum The file entitled Crude contains data for the period 1991–2006 concerning the price, supply, and demand for fuel One concern has been the increase in imported oil into the United States a Examine the relationship between price of gasoline and the annual amount of imported crude oil Construct a scatter plot depicting this relationship b Determine the order of the polynomial that would fit the data displayed in part a Express “Imports” in millions of gallons, i.e., 3,146,454/1,000,000  3.146454 Produce an estimate of this polynomial c Is a linear or quadratic model more appropriate for predicting the price of gasoline using the annual quantity of imported oil? Conduct the appropriate hypothesis test to substantiate your answer 15-37 The National Association of Realtors Existing-Home Sales Series provides a measurement of the residential real estate market One of the measurements it produces is the Housing Affordability Index (HAI) It is a measure of the financial ability of U.S families to buy a house A value of 100 means that families earning the national median income have just the amount of money needed to qualify for a mortgage on a median-priced home; higher than 100 means they have more than enough and lower than 100 means they have less than enough The file entitled INDEX contains the HAI and associated variables a Construct a second order of the polynomial relating the HAI to the median family income b Conduct a test of hypothesis to determine if this polynomial is useful in predicting the HAI Use a p-value approach and a significance level of 0.01 (217) 678 CHAPTER 15 | Multiple Regression Analysis and Model Building c Determine if monthly principle and interest interacts with the relationship between the HAI and the median family income indicated in part b Use a significance level of 0.01 Hint: You must produce another regression equation with the interaction terms inserted You must then use the appropriate test to determine if the interaction terms belong in this latter model 15.38 An investment analyst collected data of 20 randomly chosen companies The data consisted of the 52-week-high stock prices, PE ratio, and the market value of the company These data are in the file entitled INVESTMENT The analyst wishes to produce a regression equation to predict the market value using the 52-week-high stock price and the PE ratio of the company He creates a complete second-degree polynomial a Construct an estimate of the regression equation using the indicated variables b Determine if any of the quadratic terms are useful in predicting the average market value Use a p-value approach with a significance level of 0.10 c Determine if any of the PE ratio terms are useful in predicting the average market value Use a test statistic approach with a significance level of 0.05 END EXERCISES 15-3 Chapter Outcome 15.4 Stepwise Regression One option in regression analysis is to bring all possible independent variables into the model in one step This is what we have done in the previous sections We use the term full regression to describe this approach Another method for developing a regression model is called stepwise regression Stepwise regression, as the name implies, develops the least squares regression equation in steps, either through forward selection, backward elimination, or standard stepwise regression Forward Selection Coefficient of Partial Determination The measure of the marginal contribution of each independent variable, given that other independent variables are in the model The forward selection procedure begins (Step 1) by selecting a single independent variable from all those available The independent variable selected at Step is the variable that is most highly correlated with the dependent variable A t-test is used to determine if this variable explains a significant amount of the variation in the dependent variable At Step 1, if the variable does explain a significant amount of the dependent variable’s variation, it is selected to be part of the final model used to predict the dependent variable If it does not, the process is terminated If no variables are found to be significant, the researcher will have to search for different independent variables than the ones already tested In the next step (Step 2), a second independent variable is selected based on its ability to explain the remaining unexplained variation in the dependent variable The independent variable selected in the second, and each subsequent, step is the variable with the highest coefficient of partial determination Recall that the coefficient of determination (R2) measures the proportion of variation explained by all of the independent variables in the model Thus, after the first variable (say, x1) is selected, R2 will indicate the percentage of variation explained by this variable The forward selection routine will then compute all possible two-variable regression models, with x1 included, and determine the R2 for each model The coefficient of partial determination at Step is the proportion of the as yet unexplained variation (after x1 is in the model) that is explained by the additional variable The independent variable that adds the most to R2, given the variable(s) already in the model, is the one selected Then, a t-test is conducted to determine if the newly added variable is significant This process continues until either all independent variables have been entered or the remaining independent variables not add appreciably to R2 For the forward selection procedure, the model begins with no variables Variables are entered one at a time, and after a variable is entered, it cannot be removed (218) CHAPTER 15 | Multiple Regression Analysis and Model Building 679 Backward Elimination Backward elimination is just the reverse of the forward selection procedure In the backward elimination procedure, all variables are forced into the model to begin the process Variables are removed one at a time until no more insignificant variables are found Once a variable has been removed from the model, it cannot be re-entered EXAMPLE 15-3 APPLYING FORWARD SELECTION STEPWISE REGRESSION ANALYSIS B T Longmont Company The B T Longmont Company operates a large retail department store in Macon, Georgia Like other department stores, Longmont has incurred heavy losses due to shoplifting and employee pilferage The store’s security manager wants to develop a regression model to explain the monthly dollar loss The following steps can be used when developing a multiple regression model using stepwise regression: Step Specify the model by determining the dependent variable and potential independent variables The dependent variable (y) is the monthly dollar losses due to shoplifting and pilferage The security manager has identified the following potential independent variables: x1  Average monthly temperature (degrees Fahrenheit) x2  Number of sales transactions x3  Dummy variable for holiday month (1 if holiday during month, if not) x4  Number of persons on the store’s monthly payroll The data are contained in the file Longmont Step Formulate the regression model The correlation matrix for the data is presented in Figure 15.26 The forward selection procedure will select the independent variable most highly correlated with the dependent variable By examining the bottom row in the correlation matrix in Figure 15.26, you can see the variable x2, number of sales transactions, is most highly correlated (r  0.6307) with dollars lost Once this variable is entered into the model, the remaining independent variables will be entered based on their ability to explain the remaining variation in the dependent variable FIGURE 15.26 | Excel 2007 Correlation Matrix Output for the B.T Longmont Company Excel 2007 Instructions: Open file: Longmont.xls Select Data tab Select Data Analysis > Correlation Specify data range (include labels) Click Labels Specify output location Click OK (219) 680 CHAPTER 15 FIGURE 15.27A | Multiple Regression Analysis and Model Building | Excel 2007 (PHStat) Forward Selection Results for the B.T Longmont Company—Step R-Squared = SSR/SST = 1,270,172/3,192,631 Number of sales transactions entered t = 3.1481 p-value = 0.0066 Excel 2007 (PHStat) Instructions: Open file: Longmont.xls Select Add-Ins Select PHStat Select Regression > Stepwise Regression Define data range for y variable and data range for x variables Check p-value criteria Select Forward Selection Click OK Figure 15.27a shows the PHStat stepwise regression output, and Figure 15.27b shows the Minitab output At Step of the process, variable x2, number of monthly sales transactions, enters the model Step Perform diagnostic checks on the model Although PHStat does not provide R2 or the standard error of the estimate directly, they can be computed from the output in the ANOVA section of the printout Recall from Chapter 14 that R2 is computed as R2  SSR 1, 270,172.193   0.398 SST 3,192, 631.529 This single independent variable explains 39.8% (R2  0.398) of the variation in the dependent variable The standard error of the estimate is the square root of the mean square residual: s  MSE  MS residual  128,163.96  358 Now at Step 1, we test the following: H0: b2  (slope for variable x2  0) HA: b2  a  0.05 As shown in Figure 15.27a, the calculated t-value is 3.1481.We compare this to the critical value from the t-distribution for  /  (0.05 / 2)  0.025 and degrees of freedom equal to n  k   17    15 (220) CHAPTER 15 FIGURE 15.27B | Multiple Regression Analysis and Model Building 681 | Minitab Forward Selection Results for the B.T Longmont Company—Step s = Variable x2 has entered the model t-value = 3.15 p-value = 0.007 ÷ √ MS Residual = 358 R = SSR = 1,270,172.193 = 0.3978 SST 3,192,631.529 Minitab Instructions: Open file: Longmont.MTW Select Methods Choose Stat  Select Forward selection, enter  Regression  Stepwise in Alpha to enter and F in F to In Response, enter dependent enter variable (y) Click OK In Predictors, enter independent variable (x) This critical value is t0.025  2.1315 Because t  3.1481  2.1315 we reject the null hypothesis and conclude that the regression slope coefficient for the variable, number of sales transactions, is not zero Note also, because the p-value  0.0066 a  0.05 we would reject the null hypothesis.9 9Some authors use an F-distribution to perform these tests This is possible since squaring a random variable that has a t-distribution produces a random variable that has an F-distribution with one degree of freedom in the numerator and the same number of degrees as the t-distribution in the denominator (221) 682 CHAPTER 15 | Multiple Regression Analysis and Model Building Step Continue to formulate and diagnose the model by adding other independent variables The next variable to be selected will be the one that can the most to increase R2 If you were doing this manually, you would try each variable to see which one yields the highest R2, given that the transactions variable is already in the model Both the PHStat add-in software and Minitab this automatically As shown in Figure 15.27b and Figure 15.28, the variable selected in Step of the process is x4, number of employees Using the ANOVA section, we can determine R2 and s as before R2  SSR 1, 833, 270.524   0.5742 and SST 3,192, 631.529 s  MS residual  97, 097.22  311.6 The model now explains 57.42% of the variation in the dependent variable The t-values for both slope coefficients exceed t  2.145 (the critical value from the t-distribution table with a one-tailed area equal to 0.025 and 17    14 degrees of freedom), so we conclude that both variables are significant in explaining the variation in the dependent variable, shoplifting loss The forward selection routine continues to enter variables as long as each additional variable explains a significant amount of the remaining variation in the dependent variable Note that PHStat allows you to set the significance level in terms of a p-value or in terms of the t statistic Then, as long as the calculated p-value for an incoming variable is less than your limit, the variable is allowed to enter the model Likewise, if the calculated t-statistic exceeds your t limit, the variable is allowed to enter In this example, with the p-value limit set at 0.05, neither of the two remaining independent variables would explain a significant amount of the remaining variation in the dependent variable The procedure is, therefore, terminated The resulting regression equation provided by forward selection is yˆ  4600.8  0.203x2  21.57 x4 FIGURE 15.28 | PHStat Forward Selection Results for the B.T Longmont Company—Step (222) CHAPTER 15 | Multiple Regression Analysis and Model Building 683 Note that the dummy variables for holiday and temperature did not enter the model This implies that, given the other variables are already included, knowing whether the month in question has a holiday or knowing its average temperature does not add significantly to the model’s ability to explain the variation in the dependent variable The B.T Longmont Company can now use this regression model to explain variation in shoplifting and pilferage losses based on knowing the number of sales transactions and the number of employees >>END EXAMPLE TRY PROBLEM 15-40 (pg 677) Standard Stepwise Regression The standard stepwise procedure (sometimes referred to as forward stepwise regression—not to be confused with forward selection) combines attributes of both backward elimination and forward selection The standard stepwise method serves one more important function If two or more independent variables are correlated, a variable selected in an early step may become insignificant when other variables are added at later steps The standard stepwise procedure will drop this insignificant variable from the model Standard stepwise regression also offers a means of observing multicollinearity problems, because we can see how the regression model changes as each new variable is added to it The standard stepwise procedure is widely used in decision-making applications and is generally recognized as a useful regression method However, care should be exercised when using this procedure because it is easy to rely too heavily on the automatic selection process Remember, the order of variable selection is conditional, based on the variables already in the model There is no guarantee that stepwise regression will lead you to the best set of independent variables from those available Decision makers still must use common sense in applying regression analysis to make sure they have usable regression models Best Subsets Regression Another method for developing multiple regression models is called the best subsets method As the name implies, the best subsets method works by selecting subsets from the chosen possible independent variables to form models The user can then select the “best” model based on such measures as R-squared or the standard error of the estimate Both Minitab and PHStat contain procedures for performing best subsets regression EXAMPLE 15-4 APPLYING BEST SUBSETS REGRESSION Winston Investment Advisors Charles L Winston, founder and CEO at Winston Investment Advisors in Burbank, California, is interested in developing a regression model to explain the variation in dividends paid per share by U.S companies Such a model would be useful in advising his clients The following steps show how to develop such a model using the best subsets regression approach: Step Specify the model Some publicly traded companies pay higher dividends than others The CEO is interested in developing a multiple regression model to explain the variation in dividends per share paid to shareholders The dependent variable will be dividends per share The CEO met with other analysts in his firm to identify (223) 684 CHAPTER 15 | Multiple Regression Analysis and Model Building potential independent variables for which data would be readily available The following list of potential independent variables was selected: x1  Return on equity (net income/equity) x2  Earnings per share x3  Current assets in millions of dollars x4  Year-ending stock price x5  Current ratio (current assets/current liabilities) A random sample of 35 publicly traded U.S companies was selected For each company in the sample, the analysis obtained data on the dividend per share paid last year and the year-ending data on the five independent variables These data are contained in the data file Company Financials Step Formulate the regression model The CEO is interested in developing the “best” regression model for explaining the variation in the dependent variable, dividends per share The approach taken is to use best subsets, which requires that multiple regression models be developed, each containing a different mix of variables The models tried will contain from one to five independent variables The resulting models will be evaluated by comparing values for R-squared and the standard error of the estimate High R-squares and low standard errors are desirable Another statistic, identified as Cp, is sometimes used to evaluate regression models This statistic measures the difference between the estimated model and the true population model Values of Cp close to p  k  (k is the number of independent variables in the model) are preferred Both the PHStat Excel add-ins and Minitab can be used to perform best subsets regression analysis Figure 15.29 shows output from PHStat Notice that all possible combinations of models with k  to k  independent variables are included Several models appear to be good candidates based on R-squared, adjusted R-squared, standard error of the estimate, and Cp values These are as follows: Model Cp k1 R-square Adj R-square Std Error x1 x2 x3 x4 x1 x2 x3 x4 x5 x1 x2 x3 x5 x2 x3 x2 x3 x4 x2 x3 x4 x5 x2 x3 x5 4.0 6.0 5.0 1.4 2.5 4.5 3.4 5 0.628 0.629 0.615 0.610 0.622 0.622 0.610 0.579 0.565 0.564 0.586 0.585 0.572 0.573 0.496 0.505 0.505 0.492 0.493 0.500 0.500 There is little difference in these seven models in terms of the statistics shown We can examine any of them in more detail by looking at further PHStat output For instance, the model containing variables x1, x2, x3, and x5 is shown in Figure 15.30 Note that although this model is among the best with respect to R-squared, adjusted R-squared, standard error of the estimate, and Cp value, two of the four variables (x1  ROE and x5  Current ratio) have statistically insignificant regression coefficients Figure 15.31 shows the regression model with the two statistically significant variables remaining The R-squared value is 0.61, the adjusted R2 has increased, and the overall model is statistically significant >>END EXAMPLE TRY PROBLEM 15-43 (pg 687) (224) CHAPTER 15 FIGURE 15.29 | Multiple Regression Analysis and Model Building | Best Subsets Regression Output for Winston Investment Advisors Excel 2007 (PHStat) Instructions: Open file: Company Financials.xls Select Add-Ins Select PHStat Select Regression > Best Subsets Regression Define data range for y variable and data range for x variables Click OK 685 (225) 686 CHAPTER 15 FIGURE 15.30 | Multiple Regression Analysis and Model Building | One Potential Model for Winston Investment Advisors FIGURE 15.31 | A Final Model for Winston Financial Advisors (226) CHAPTER 15 | Multiple Regression Analysis and Model Building 687 MyStatLab 15-4: Exercises Skill Development 15-39 Suppose you have four potential independent variables, x1, x2, x3, and x4, from which you want to develop a multiple regression model Using stepwise regression, x2 and x4 entered the model a Why did only two variables enter the model? Discuss b Suppose a full regression with only variables x2 and x4 had been run Would the resulting model be different from the stepwise model that included only these two variables? Discuss c Now, suppose a full regression model had been developed, with all four independent variables in the model Which would have the higher R2 value, the full regression model or the stepwise model? Discuss 15-40 You are given the following set of data: y x1 x2 x3 33 44 34 60 20 30 11 10 13 11 192 397 235 345 245 235 40 47 37 61 23 35 y x1 x2 x3 45 25 53 45 37 44 12 10 13 11 13 296 235 295 335 243 413 52 27 57 50 41 51 a Determine the appropriate correlation matrix and use it to predict which variable will enter in the first step of a stepwise regression model b Use standard stepwise regression to construct a model, entering all significant variables c Construct a full regression model What are the differences in the model? Which model explains the most variation in the dependent variable? 15-41 You are given the following set of data: y x1 x2 x3 y x1 x2 x3 45 41 43 38 50 39 50 40 31 45 43 42 48 44 41 41 49 41 42 40 44 39 35 39 41 51 42 41 45 43 34 49 45 40 43 42 37 40 35 39 43 53 39 52 47 44 40 30 34 37 41 40 44 45 42 34 a Determine the appropriate correlation matrix and use it to predict which variable will enter in the first step of a stepwise regression model b Use standard stepwise regression to construct a model, entering all significant variables c Construct a full regression model What are the differences in the model? Which model explains the most variation in the dependent variable? 15-42 The following data represent a dependent variable and four independent variables: y x1 x2 x3 x4 y x1 x2 61 37 22 48 66 37 25 23 12 34 13 15 10 11 21 32 18 30 33 69 24 68 65 45 35 23 35 37 30 19 14 17 11 x3 17 24 x4 23 31 33 19 31 a Use the standard stepwise regression to produce an estimate of a multiple regression model to predict y Use 0.15 as the alpha to enter and to remove b Change the alpha to enter to 0.01 Repeat the standard stepwise procedure c Change the alpha to remove to 0.35, leaving alpha to enter to be 0.15 Repeat the standard stepwise procedure d Change the alpha to remove to 0.15, leaving alpha to enter to be 0.05 Repeat the standard stepwise procedure e Change the alpha to remove and to enter to 0.35 Repeat the standard stepwise procedure f Compare the estimated regression equations developed in parts a–e 15-43 Consider the following set of data: y x1 x2 x3 x4 y x1 x2 x3 x4 61 37 22 48 66 37 25 23 12 34 18 12 14 13 10 15 25 69 24 68 65 45 35 23 35 37 30 21 15 19 12 3 20 14 19 12 a Use standard stepwise regression to produce an estimate of a multiple regression model to predict y b Use forward selection stepwise regression to produce an estimate of a multiple regression model to predict y (227) 688 CHAPTER 15 | Multiple Regression Analysis and Model Building c Use backwards elimination stepwise regression to produce an estimate of a multiple regression model to predict y d Use best subsets regression to produce an estimate of a multiple regression model to predict y Computer Database Exercises 15-44 The U.S Energy Information Administration publishes summary statistics concerning the energy sector of the U.S economy The electric power industry continues to grow Electricity generation and sales rose to record levels The file entitled Energy presents the revenue from retail sales ($million) and the net generation by energy source for the period 1993–2004 a Produce the correlation matrix of all the variables Predict the variables that will remain in the estimated regression equation if standard stepwise regression is used to predict the revenues from retail sales of energy b Use standard stepwise regression to develop an estimate of a model that is to predict the revenue from retail sales of energy ($million) c Compare the results of parts a and b Explain any difference between the two models 15-45 The Western State Tourist Association gives out pamphlets, maps, and other tourist-related information to people who call a toll-free number and request the information The association orders the packets of information from a document-printing company and likes to have enough available to meet the immediate need without having too many sitting around taking up space The marketing manager decided to develop a multiple regression model to be used in predicting the number of calls that will be received in the coming week A random sample of 12 weeks is selected, with the following variables: y  Number of calls x1  Number of advertisements placed the previous week x2  Number of calls received the previous week x3  Number of airline tour bookings into Western cities for the current week These data are in the data file called Western States a Develop the multiple regression model for predicting the number of calls received, using backward elimination stepwise regression b At the final step of the analysis, how many variables are in the model? c Discuss why the variables were removed from the model in the order shown by the stepwise regression 15-46 Refer to Problem 15-45 a Develop the correlation matrix that includes all independent variables and the dependent variable Predict the order that the variables will be selected into the model if forward selection stepwise regression is used b Use forward selection stepwise regression to develop a model for predicting the number of calls that the company will receive Write a report that describes what has taken place at each step of the regression process c Compare the forward selection stepwise regression results in part b with the backward elimination results determined in Problem 15-45 Which model would you choose? Explain your answer 15-47 An investment analyst collected data of 20 randomly chosen companies The data consisted of the 52-weekhigh stock prices, PE ratios, and the market values of the companies These data are in the file entitled Investment The analyst wishes to produce a regression equation to predict the market value using the 52-week-high stock price and the PE ratio of the company He creates a complete second-degree polynomial a Produce two scatter plots: (1) market value versus stock price and (2) market value versus PE ratio Do the scatter plots support the analyst’s decision to produce a second-order polynomial? Support your assertion with statistical reasoning b Use forward selection stepwise regression to eliminate any unneeded components from the analyst’s model c Does forward selection stepwise regression support the analyst’s decision to produce a second-order polynomial? Support your assertion with statistical reasoning 15-48 A variety of sources suggest that individuals assess their health, at least in part, by estimating their percentage of body fat A widely accepted measure of body fat uses an underwater weighing technique There are, however, more convenient methods using only a scale and a measuring tape An article in the Journal of Statistics Education by Roger W Johnson explored regression models to predict body fat The file entitled Bodyfat lists a portion of the data presented in the cited article a Use best subsets stepwise regression to establish the relationship between body fat and the variables in the specified file b Predict the body fat of an individual whose age is 21, weight is 170 pounds, height is 70 inches, chest circumference is 100 centimeters, abdomen is 90 centimeters, hip is 105 centimeters, and thigh is 60 centimeters around 15-49 The consumer price index (CPI) is a measure of the average change in prices over time in a fixed market basket of goods and services typically purchased by consumers One of the items in this market basket that affects the CPI is the price of oil and its derivatives The file entitled Consumer contains the price of the derivatives of oil and the CPI adjusted to 2005 levels (228) CHAPTER 15 | Multiple Regression Analysis and Model Building a Use backward elimination stepwise regression to determine which combination of the oil derivative prices drive the CPI If you encounter difficulties in completing this task, explain what caused the difficulties 689 b Eliminate the source of the difficulties in part a by producing a correlation matrix to determine where the difficulty lies c Delete one of the variables indicated in part b and complete the instructions in part a END EXERCISES 15-4 15.5 Determining the Aptness of the Model In Section 15.1 we discussed the basic steps involved in building a multiple regression model These are as follows: Specify the model Build the model Perform diagnostic checks on the model The final step is the diagnostic step in which you examine the model to determine how well it performs In Section 15.2, we discussed several statistics that you need to consider when performing the diagnostic step, including analyzing R2, adjusted R2, and the standard error of the estimate In addition, we discussed the concept of multicollinearity and the impacts that can occur when multicollinearity is present Section 15.3 introduced another diagnostic step that involves looking for potential curvilinear relationships between the independent variables and the dependent variable We presented some basic data transformation techniques for dealing with curvilinear situations However, a major part of the diagnostic process involves an analysis of how well the model fits the regression analysis assumptions The assumptions required to use multiple regression include the following: Assumptions Individual model errors, , are statistically independent of one another, and these values represent a random sample from the population of possible residuals at each level of x For a given value of x there can exist many values of y, and therefore many possible values for  Further, the distribution of possible -values for any level of x is normally distributed The distributions of possible -values have equal variances at each level of x The means of the dependent variable, y, for all specified values of x can be connected with a line called the population regression model The degree to which a regression model satisfies these assumptions is called aptness Analysis of Residuals Residual The residual is computed using Equation 15.13 The difference between the actual value of the dependent variable and the value predicted by the regression model Residual ei  yi  yˆi (15.13) A residual value can be computed for each observation in the data set A great deal can be learned about the aptness of the regression model by analyzing the residuals The principal means of residual analysis is a study of residual plots The following problems can be inferred through graphical analysis of residuals: The regression function is not linear The residuals not have a constant variance The residuals are not independent The residual terms are not normally distributed We will address each of these in order The regression options in both Minitab and Excel provide extensive residual analysis (229) 690 CHAPTER 15 | Multiple Regression Analysis and Model Building Checking for Linearity A plot of the residuals (on the vertical axis) against the independent variable (on the horizontal axis) is useful for detecting whether a linear function is the appropriate regression function Figure 15.32 illustrates two different residual plots Figure 15.32a shows residuals that systematically depart from When x is small the residuals are negative When x is in the midrange the residuals are positive, and for large xvalues the residuals are negative again This type of plot suggests that the relationship between y and x is nonlinear Figure 15.32b shows a plot in which residuals not show a systematic variation around 0, implying that the relationship between x and y is linear If a linear model is appropriate, we expect the residuals to band around with no systematic pattern displayed If the residual plot shows a systematic pattern, it may be possible to transform the independent variable (refer to Section 15.3) so that the revised model will produce residual plots that will not systematically vary from BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial FIGURE 15.32 RESIDUAL ANALYSIS FIRST CITY REAL ESTATE (CONTINUED) We have been using First City Real Estate to introduce multiple regression tools throughout this chapter Remember, the managers wish to develop a multiple regression model for predicting the sales prices of homes in their market Suppose that the most current model (First city-3) incorporates a transformation of the lot size variable as log of lot size The output for this model is shown in Figure 15.33 Notice the model now has a R2 value of 96.9% There are currently four independent variables in the model: square feet, bedrooms, garage, and the log of lot size Both Minitab and Excel provide procedures for automatically producing residual plots Figure 15.34 shows the plots of the residuals against each | Residual Plots Showing Linear and Nonlinear Patterns Residuals –1 –2 –3 x (a) Nonlinear Pattern Residuals –1 –2 –3 x (b) Linear Pattern (230) CHAPTER 15 FIGURE 15.33 | Multiple Regression Analysis and Model Building | Minitab Output of First City Real Estate Appraisal Model Minitab Instructions: Open file: First City-3 MTW Choose Stat  Regression  Regression In Response, enter dependent (y) variable In Predictors, enter independent (x) variables Click OK FIGURE 15.34 | First City Real Estate Residual Plots versus the Independent Variables 50,000 Residual Residual 50,000 –50,000 1,000 2,000 3,000 Square Feet (a) Residuals versus Square Feet (Response Is Price) –50,000 4,000 50,000 Residual Residual Garage (b) Residuals versus Garage (Response Is Price) 50,000 –50,000 Bedrooms (c) Residuals versus Bedrooms (Response Is Price) –50,000 10 LOG Lot (d) Residuals versus LOG Lot (Response Is Price) 11 691 (231) 692 CHAPTER 15 | Multiple Regression Analysis and Model Building of the independent variables The transformed variable, log lot size, has a residual pattern that shows a systematic pattern (Figure 15.34d) The residuals are positive for small values of log lot size, negative for intermediate values of log lot size, and positive again for large values of log lot size This pattern suggests that the curvature of the relationship between sales prices of homes and lot size is even more pronounced than the logarithm implies Potentially, a second- or third-degree polynomial in the lot size should be pursued Do the Residuals Have Equal Variances at all Levels of Each x Variable? Residual plots also can be used to determine whether the residuals have a constant variance Consider Figure 15.35, in which the residuals are plotted against an independent variable The plot in Figure 15.35a shows an example in which as x increases the residuals become less variable Figure 15.35b shows the opposite situation When x is small, the residuals are tightly packed around 0, but as x increases the residuals become more variable Figure 15.35c shows an example in which the residuals exhibit a constant variance around the zero mean When a multiple regression model has been developed, we can analyze the equal variance assumption by plotting the residuals against the fitted ( yˆ ) values When the residual plot is cone-shaped, as in Figure 15.36, it suggests that the assumption of equal variance has been violated This is evident because the residuals are wider on one end than the other That indicates that the standard error of the estimate is larger on one end than the other, i.e., it is not constant | Residuals –2 –4 –6 x (a) Variance Decreases as x Increases Residuals Residual Plots Showing Constant and Nonconstant Variances –2 –4 –6 x (b) Variance Increases as x Increases Residuals FIGURE 15.35 –1 –2 –3 x (c) Equal Variance (232) CHAPTER 15 FIGURE 15.36 | Multiple Regression Analysis and Model Building 693 | Residual Plots against the Fitted ( yˆ ) Values Residuals –2 –4 –6 ˆ Fitted Values (y) (a) Variance Not Constant Residuals –2 –4 –6 ˆ Fitted Values (y) (b) Variance Not Constant Figure 15.37 shows the residuals plotted against the yˆ -values for First City Real Estate’s appraisal model We have drawn a band around the residuals that shows that the variance of the residuals stays quite constant over the range of the fitted values Are the Residuals Independent? If the data used to develop the regression model are measured over time, a plot of the residuals against time is used to determine whether the residuals are correlated Figure 15.38a shows an example in which the residual plot against time suggests independence The residuals in Figure 15.38a appear to be randomly distributed around the mean of zero over time However, in Figure 15.38b, the plot suggests that the residuals are not independent, because in the early time periods the residuals are negative and in later time periods the residuals are positive This, or any other nonrandom pattern in the residuals over time, indicates that the assumption of independent residuals has been violated Generally, this means some variable associated with the passage of time has been omitted from the model Often, time is used as a surrogate for other time-related variables in a regression model Chapter 16 will discuss time-series data analysis and forecasting techniques in more detail and will address the issue of incorporating the time variable into the model In Chapter 16, we introduce a procedure called the Durbin-Watson test to determine whether residuals are correlated over time Checking for Normally Distributed Error Terms The need for normally distributed model errors occurs when we want to test a hypothesis about the regression model Small departures from normality not cause serious problems However, if the model errors depart dramatically from a normal distribution, there is cause for concern Examining the sample residuals will allow us to detect such dramatic departures One method for graphically analyzing the residuals is to form a frequency histogram of the residuals to determine whether the general shape is normal The chi-square goodness-of-fit test presented in Chapter 13 can be used to test whether the residuals fit a normal distribution (233) 694 CHAPTER 15 FIGURE 15.37 | Multiple Regression Analysis and Model Building | Minitab Plot of Residuals versus Fitted Values for First City Real Estate Minitab Instructions: Open file: First City-3 MTW Choose Stat  Regression  Regression In Response, enter dependent (y) variable In Predictors, enter independent (x) variables Choose Graphs Under Residual Plots, select Residuals versus fits Click OK OK FIGURE 15.38 | Plot of Residuals against Time Residuals –1 –2 –3 Time (a) Independent Residuals Residuals –2 –4 –6 Time (b) Residuals Not Independent (234) CHAPTER 15 | Multiple Regression Analysis and Model Building 695 Another method for determining normality is to calculate and plot the standardized residuals In Chapter you learned that a random variable is standardized by subtracting its mean and dividing the result by its standard deviation The mean of the residuals is zero Therefore, dividing each residual by an estimate of its standard deviation gives the standardized residual.10 Although the proof is beyond the scope of this text, it can be shown that the standardized residual for any particular observation for a simple linear regression model is found using Equation 15.14 Standardized Residual for Linear Regression ei se  i ( xi  x ) s 1  n ( ∑ x )2 ∑ x2  n (15.14) where: ei  ith residual value s  Standard error of the estimate xi  Value of x used to generate the predicted y-value for the ith observation Computing the standardized residual for an observation in a multiple regression model is too complicated to be done by hand However, the standardized residuals are generated from most statistical software, including Minitab and Excel The Excel and Minitab tutorials illustrate the methods required to generate the standardized residuals and residual plots Because other problems such as nonconstant variance and nonindependent residuals can result in residuals that seem to be abnormal, you should check these other factors before addressing the normality assumption Recall that for a normal distribution, approximately 68% of the values will fall within standard deviation of the mean, 95% will fall within standard deviations of the mean, and virtually all values will fall within standard deviations of the mean Figure 15.39 illustrates the histogram of the residuals for the First City Real Estate example The distribution of residuals looks to be close to a normal distribution Figure 15.40 shows the FIGURE 15.39 | Minitab Histogram of Residuals for First City Real Estate Minitab Instructions: Open file: First City-3 MTW Choose Stat  Regression  Regression In Response, enter dependent (y) variable In Predictors, enter independent (x) variables Choose Graphs Under Residual Plots, select Histogram of residuals Click OK OK 10The standardized residual is also referred to as the studentized residual (235) 696 CHAPTER 15 FIGURE 15.40 | Multiple Regression Analysis and Model Building | Minitab Histogram of Standardized Residuals for First City Real Estate Minitab Instructions: Open file: First City-3 MTW Choose Stat  Regression  Regression In Response, enter dependent (y) variable In Predictors, enter independent (x) variables Choose Graphs Under Residual for Plots, select Standardized Under Residual Plots, select Histogram of residuals Click OK OK histogram for the standardized residuals, which will have the same basic shape as the residual distribution in Figure 15.39 Another approach for checking for normality of the residuals is to form a probability plot We start by arranging the residuals in numerical order from smallest to largest The standardized residuals are plotted on the horizontal axis, and the corresponding expected value for the standardized residual is plotted on the vertical axis Although we won’t delve into how the expected value is computed, you can examine the normal probability plot to see whether the plot forms a straight line The closer the line is to linear, the closer the residuals are to being normally distributed Figure 15.41 shows the normal probability plot for the First City Real Estate Company example FIGURE 15.41 | Minitab Normal Probability Plot of Residuals for First City Real Estate Minitab Instructions: Open file: First City-3 MTW Choose Stat  Regression  Regression In Response, enter dependent (y) variable In Predictors, enter independent (x) variables Choose Graphs Under Residual Plots, select Normal plot of residuals Click OK.OK (236) CHAPTER 15 | Multiple Regression Analysis and Model Building 697 You should be aware that Minitab and Excel format their residual plots in a slightly different way However, the same general information is conveyed, and you can look for the same signs of problems with the regression model Corrective Actions If, based on analyzing the residuals, you decide the model constructed is not appropriate, but you still want a regression-based model, some corrective action may be warranted There are three approaches that may work: Transform some of the existing independent variables, remove some variables from the model, or start over in the development of the regression model Earlier in this chapter, we discussed a basic approach involved in variable transformation In general, the transformations of the independent variables (such as raising x to a power, taking the square root of x, or taking the log of x) are used to make the data better conform to a linear relationship If the model suffers from nonlinearity and if the residuals have a nonconstant variance, you may want to transform both the independent and dependent variables In cases in which the normality assumption is not satisfied, transforming the dependent variable is often useful In many instances, a log transformation works In some instances, a transformation involving the product of two independent variables will help A more detailed discussion is beyond the scope of this text However, you can read more about this subject in the Kutner et al reference listed at the end of the chapter The alternative of using a different regression model means that we respecify the model to include new independent variables or remove existing variables from the model In most modeling situations, we are in a continual state of model respecification We are always seeking to improve the regression model by finding new independent variables MyStatLab 15-5: Exercises Skill Development 15-50 Consider the following values for an independent and dependent variable: x y 14 18 22 27 33 20 28 30 33 35 45 a Determine the estimated linear regression equation relating the dependent and independent variables b Is the regression equation you found significant? Test at the a  0.05 level c Determine both the residuals and standardized residuals Is there anything about the residuals that would lead you to question whether the assumptions necessary to use regression analysis are satisfied? Discuss 15-51 Consider the following values for an independent and dependent variable: x y 14 18 22 27 33 50 61 75 20 28 15 27 31 32 60 132 160 (237) 698 CHAPTER 15 | Multiple Regression Analysis and Model Building a Determine the estimated linear regression equation relating the dependent and independent variables b Is the regression equation you found significant? Test at the a  0.05 level c Determine both the residuals and standardized residuals Is there anything about the residuals that would lead you to question whether the assumptions necessary to use regression analysis are satisfied? 15-52 Examine the following data set: y x 25 35 14 45 52 41 65 63 68 10 10 10 20 20 20 30 30 30 a Determine the estimated regression equation for this data set b Calculate the residuals for this regression equation c Produce the appropriate residual plot to determine if the linear function is the appropriate regression function for this data set d Use a residual plot to determine if the residuals have a constant variance e Produce a residual plot to determine if the residuals are independent Assume the order of appearance is the time order of the data f Use a probability plot to determine if the error terms are normally distributed 15-53 Examine the following data set: y x1 x2 25 35 14 45 52 41 65 63 68 75 5 25 25 25 30 30 30 40 25 5 40 25 30 30 25 30 a Determine the estimated regression equation for this data set b Calculate the residuals and the standardized residuals for this regression equation c Produce the appropriate residual plot to determine if the linear function is the appropriate regression function for this data set d Use a residual plot to determine if the residuals have a constant variance e Produce the appropriate residual plot to determine if the residuals are independent f Construct a probability plot to determine if the error terms are normally distributed Computer Database Exercises 15-54 Refer to Exercise 15-9, which referenced an article in BusinessWeek that presented a list of the 100 companies perceived as having “hot growth” characteristics The file entitled Logrowth contains sales ($million), sales increase (%), return on capital, market value ($million), and recent stock price of the companies ranked from 81 to 100 In Exercise 15-9, a regression equation was constructed in which the sales of the companies was predicted using their market value a Determine the estimated regression equation for this data set b Calculate the residuals and the standardized residuals for this regression equation c Produce the appropriate residual plot to determine if the linear function is the appropriate regression function for this data set d Use a residual plot to determine if the residuals have a constant variance e Produce the appropriate residual plot to determine if the residuals are independent Assume the data were extracted in the order listed f Construct a probability plot to determine if the error terms are normally distributed 15-55 The White Cover Snowmobile Association promotes snowmobiling in both the Upper Midwest and the Rocky Mountain region The industry has been affected in the West because of uncertainty associated with conflicting court rulings about the number of snowmobiles allowed in national parks The Association advertises in outdoor- and tourist-related publications and then sends out pamphlets, maps, and other regional related information to people who call a toll-free number and request the information The association orders the packets from a documentprinting company and likes to have enough available to meet the immediate need without having too many sitting around taking up space The marketing manager decided to develop a multiple regression model to be used in predicting the number of calls that will be received in the coming week A random sample of 12 weeks is selected, with the following variables: y  Number of calls x1  Number of advertisements placed the previous week x2  Number of calls received the previous week (238) CHAPTER 15 x3  Number of airline tour bookings into Western cities for the current week The data are in the file called Winter Adventures a Construct a multiple regression model using all three independent variables Write a short report discussing the model b Based on the appropriate residual plots, what can you conclude about the constant variance assumption? Discuss c Based on the appropriate residual analysis, does it appear that the residuals are independent? Discuss d Use an appropriate analysis of the residuals to determine whether the regression model meets the assumption of normally distributed error terms Discuss 15-56 The athletic director of State University is interested in developing a multiple regression model that might be used to explain the variation in attendance at football games at his school A sample of 16 games was selected from home games played during the past 10 seasons Data for the following factors were determined: y  Game attendance x1  Team win/loss percentage to date x2  Opponent win/loss percentage to date x3  Games played this season x4  Temperature at game time The sample data are in the file called Football a Build a multiple regression model using all four independent variables Write a short report that outlines the characteristics of this model b Develop a table of residuals for this model What is the average residual value? Why you suppose it came out to this value? Discuss c Based on the appropriate residual plot, what can you conclude about the constant variance assumption? Discuss d Based on the appropriate residual analysis, does it appear that the model errors are independent? Discuss e Can you conclude, based on the appropriate method of analysis, that the model error terms are approximately normally distributed? 15-57 The consumer price index (CPI) is a measure of the average change in prices over time in a fixed market basket of goods and services typically purchased by consumers One of the items in this market basket that affects the CPI is the price of oil and its derivatives The file entitled Consumer contains the price of the derivatives of oil and the CPI adjusted to 2005 levels In Exercise 15-49, backward elimination stepwise regression was used to determine the relationship between CPI and two independent variables: the price of heating oil and of diesel fuel | Multiple Regression Analysis and Model Building 699 a Construct an estimate of the regression equation using the same variables b Produce the appropriate residual plots to determine if the linear function is the appropriate regression function for this data set c Use a residual plot to determine if the residuals have a constant variance d Produce the appropriate residual plot to determine if the residuals are independent Assume the data were extracted in the order listed e Construct a probability plot to determine if the error terms are normally distributed 15-58 In Exercise 15-48 you were asked to use best subsets stepwise regression to establish the relationship between body fat and the independent variables weight, abdomen circumference, and thigh circumference based on data in the file Bodyfat This is an extension of that exercise a Construct an estimate of the regression equation using the same variables b Produce the appropriate residual plots to determine if the linear function is the appropriate regression function for this data set c Use a residual plot to determine if the residuals have a constant variance d Produce the appropriate residual plot to determine if the residuals are independent Assume the data were extracted in the order listed e Construct a probability plot to determine if the error terms are normally distributed 15-59 The National Association of Theatre Owners is the largest exhibition trade organization in the world, representing more than 26,000 movie screens in all 50 states and in more than 20 countries worldwide Its membership includes the largest cinema chains and hundreds of independent theater owners It publishes statistics concerning the movie sector of the economy The file entitled Flicks contains data on total U.S box office grosses ($billion), total number of admissions (billion), average U.S ticket price ($), and number of movie screens a Construct a regression equation in which total U.S box office grosses are predicted using the other variables b Produce the appropriate residual plots to determine if the linear function is the appropriate regression function for this data set c Square each of the independent variables and add them to the model on which the regression equation in part a was built Produce the new regression equation d Use a residual plot to determine if the quadratic model in part c alleviates the problem identified in part b e Construct a probability plot to determine if the error terms are normally distributed for the updated model END EXERCISES 15-5 (239) 700 CHAPTER 15 | Multiple Regression Analysis and Model Building Visual Summary Chapter 15: Chapter 14 introduced linear regression, concentrating on analyzing a linear relationship between two variables However, business problems are not limited to linear relationships involving only two variables, many situations involve linear and nonlinear relationships among three or more variables This chapter introduces several extensions of the techniques covered in the last chapter including: multiple linear regression, incorporating qualitative variables in the regression model, working with nonlinear relationships, techniques for determining how “good” the model fits the data, and stepwise regression 15.1 Introduction to Multiple Regression Analysis (pg 634–653) Summary Multiple linear regression analysis examines the relationship between a dependent and more than one independent variable Determining the appropriate relationship starts with model specification, where the appropriate variables are determined, then moves to model building, followed by model diagnosis, where the quality of the model built is determined The purpose of the model is to explain variation in the dependent variable Useful independent variables are those highly correlated with the dependent variable The percentage of variation in the dependent variable explained by the model is determined by the coefficient of determination, R The overall model can be tested for significance as can the individual terms in the model A common problem in multiple regressions models occurs when the independent variable are highly correlated, this is called multicollinearity Outcome Outcome Outcome Outcome Outcome Understand the general concepts behind model building using multiple regression analysis Apply multiple regression analysis to business decision-making situations Analyze the computer output for a multiple regression model and interpret the regression results Test hypotheses about the significance of a multiple regression model and test the significance of the independent variables in the model Recognize potential problems when using multiple regression analysis and take steps to correct the problems 15.2 Using Qualitative Independent Variables (pg 654–661) Summary Independent variables are not always quantitative and ratio level Important independent variables might include determining if someone is married, or not, owns her home, or not, is recently employed and type of car owned All of these are qualitative, not quantitative, variables and are incorporated into multiple regression analysis using dummy variables Dummy variables are numerical codes, or 1, depending on whether the observation has the indicated characteristic Be careful to insure you use one fewer dummy variables than categories to avoid the dummy variable trap Outcome Incorporate qualitative variables into a regression model by using dummy variables 15.3 Working with Nonlinear Relationships (pg 661–668) Summary Sometimes business situations involve a nonlinear relationship between the dependent and independent variables Regression models with nonlinear relationship become more complicated to build and analyze Start by plotting the data to see the relationships between the dependent variable and independent variable Exponential, or second or third order polynomial relationships are commonly found Once the appropriate relationship is determined, the independent variable is modified and used in the model Outcome Apply regression analysis to situations where the relationship between the independent variable(s) and the dependent variable is nonlinear 15.4 Stepwise Regression (pg 678–689) Summary Stepwise regression develops the regression equation either through forward selection, backward elimination, or standard stepwise regression Forward selection begins by selecting a single independent variable which is most highly correlated with the dependent variable Additional variables will be added to the model as long as they reduce a significant amount of the remaining variation in the dependent variable Backward elimination starts with all variables in the model to begin the process Variables are removed one at a time until no more insignificant variables are found Standard stepwise is similar to forward selection However, if two or more variables are correlated, a variable selected in an early step may become insignificant when other variables are added at later steps The standard stepwise procedure will drop this insignificant variable from the model Outcome Understand the uses of stepwise regression Conclusion Multiple regression uses two or more independent variables to explain the variation in the dependent 15.5 Determining the Aptness of the Model (pg 689–699) variable As a decision maker, you will generally not be required to manually develop the regression model, but Summary Determining the aptness of a model relies on an analysis of residuals, the difference between the observed value of you will have to judge its applicability based on a computer printout Consequently, this chapter has the dependent variable and the value predicted by the model The residuals should be randomly scattered about the largely involved an analysis of computer printouts You regression line with a normal distribution and constant variance If a plot of the residuals indicates any of the no doubt will encounter printouts that look somewhat preceding does not occur, corrective action should be taken which might involve transforming some independent different from those shown in this text and some of the variables, dropping some variables or adding new ones, or even starting over with the model building process terms used may differ slightly, but Excel and Minitab software we have used are representative of the many Outcome Analyze the extent to which a regression model satisfies the regression assumptions software packages that are available (240) CHAPTER 15 | Multiple Regression Analysis and Model Building 701 Equations (15.1) Population Multiple Regression Model pg 634 (15.8) Standard Error of the Estimate pg 646 y  b0  b1x1  b2x2   bkxk  ε (15.2) Estimated Multiple Regression Model pg 634 yˆ  b0  b1 x1  b2 x2   bk xk (15.9) Variance Inflation Factor pg 648 (15.3) Correlation Coefficient pg 638 VIF  ∑ ( x  x )( y  y ) r pg 650 bj tsb ∑ ( xi  xi )( x j  x j ) r ∑ ( xi  xi ) ∑ ( x j  x j ) One x variable with another x (15.4) Multiple Coefficient of Determination (R2) pg 642 R2 (1 R 2j ) (15.10) Confidence Interval Estimate for the Regression Slope ∑ ( x  x )2 ∑ ( y  y )2 One x variable wiith y or SSE  MSE n  k 1 s  j (15.11) Polynomial Population Regression Model pg 662 y  b0  b1x  b2x2   bpxp  ε (15.12) Partial-F Test Statistic pg 672 Sum of squares regression SSR   Total sum of squuares SST F ( SSE R  SSEC ) / (c  r ) MSEC (15.5) F-Test Statistic pg 643 (15.13) Residual pg 689 SSR k F SSE n  k 1 ei  yi  yˆi (15.14) Standardized Residual for Linear Regression pg 695 se  (15.6) Adjusted R-Squared pg 644 i ⎛ n 1 ⎞ R-sq(adj) RA2  1 (1 R ) ⎜ ⎝ n  k 1 ⎟⎠ ei s 1  n ( xi  x ) ∑ x2  (∑ x )2 n (15.7) t-Test for Significance of Each Regression Coefficient pg 645 t bj  sb df  n  k 1 j Key Terms Adjusted R-squared pg 644 Coefficient of partial determination pg 678 Composite model pg 669 Correlation coefficient pg 638 Correlation matrix pg 638 Dummy variable pg 654 Interaction pg 669 Model pg 636 Multicollinearity pg 647 Chapter Exercises Conceptual Questions 15-60 Go to the library or use the Internet to locate three articles using a regression model with more than one independent variable For each article write a short summary covering the following points: Multiple coefficient of determination (R2) pg 642 Regression hyperplane pg 635 Residual pg 689 Variance inflation factor (VIF) pg 648 MyStatLab Purpose for using the model How the variables in the model were selected How the data in the model were selected Any possible violations of the needed assumptions The conclusions drawn from using the model (241) 702 CHAPTER 15 | Multiple Regression Analysis and Model Building 15-61 Discuss in your own terms the similarities and differences between simple linear regression analysis and multiple regression analysis 15-62 Discuss what is meant by the least squares criterion as it pertains to multiple regression analysis Is the least squares criterion any different for simple regression analysis? Discuss 15-63 List the basic assumptions of regression analysis and discuss in your own terms what each means 15-64 What does it mean if we have developed a multiple regression model and have concluded that the model is apt? 15-65 Consider the following model: yˆ   3x1  5x2 a Provide an interpretation of the coefficient of x1 b Is the interpretation provided in part a true regardless of the value of x2? Explain c Now consider the model yˆ   3x1  5x2  4x1x2 Let x2  Give an interpretation of the coefficient of x1 when x2  d Repeat part c when x2  Is the interpretation provided in part a true regardless of the value of x2? Explain e Considering your answers to parts c and d, what type of regression components has conditional interpretations? Computer Database Exercises 15-66 Amazon.com has become one of the most successful online merchants Two measures of its success are sales and net income/loss figures They are given here Year Net Income/Loss Sales 0.3 5.7 0.5 15.7 1997 27.5 147.7 1998 124.5 609.8 1999 719.9 1,639.8 2000 1,411.2 2,761.9 2001 567.3 3,122.9 2002 149.1 3,933 2003 35.3 1995 1996 5,263.7 2004 588.5 6,921 2005 359 8,490 2006 190 10,711 2007 476 14,835 a Produce a scatter plot for Amazon’s net income/loss and sales figures for the period 1995 to 2007 Determine the order (or degree) of the polynomial that could be used to predict Amazon’s net income/loss using sales figures for the period 1995 to 2007 b To simplify the analysis, consider only the values from 1995–2004 Produce the polynomial indicated by this data c Test to determine whether the overall model from part b is statistically significant Use a significance level of 0.10 d Conduct a hypothesis test to determine if curvature exists in the model that predicts Amazon’s net income/loss using sales figures from part b Use a significance level of 0.02 and the test statistic approach The following information applies to Exercises 15-67, 15-68, and 15-69 A publishing company in New York is attempting to develop a model that it can use to help predict textbook sales for books it is considering for future publication The marketing department has collected data on several variables from a random sample of 15 books These data are given in the file Textbooks 15-67 Develop the correlation matrix showing the correlation between all possible pairs of variables Test statistically to determine which independent variables are significantly correlated with the dependent variable, book sales Use a significance level of 0.05 15-68 Develop a multiple regression model containing all four independent variables Show clearly the regression coefficients Write a short report discussing the model In your report make sure you cover the following issues: a How much of the total variation in book sales can be explained by these four independent variables? Would you conclude that the model is significant at the 0.05 level? b Develop a 95% confidence interval for each regression coefficient and interpret these confidence intervals c Which of the independent variables can you conclude to be significant in explaining the variation in book sales? Test using a  0.05 d How much of the variation in the dependent variable is explained by the independent variable? Is the model statistically significant at the a  0.01 level? Discuss e How much, if at all, does adding one more page to the book impact the sales volume of the book? Develop and interpret a 95% confidence interval estimate to answer this question f Perform the appropriate analysis to determine the aptness of this regression model Discuss your results and conclusions 15-69 The publishing company recently came up with some additional data for the 15 books in the original sample Two new variables, production expenditures (x5) and (242) CHAPTER 15 number of prepublication reviewers (x6), have been added These additional data are as follows: | Multiple Regression Analysis and Model Building 703 Twenty-five customers were selected at random, and values for the following variables were recorded in the file called McCracken: Book x5 ($) x6 Book x5 ($) x6 38,000 86,000 59,000 80,000 29,500 31,000 40,000 69,000 3 10 11 12 13 14 15 51,000 34,000 20,000 80,000 60,000 87,000 29,000 5 Incorporating these additional data, calculate the correlation between each of these additional variables and the dependent variable, book sales a Test the significance of the correlation coefficients, using a  0.05 Comment on your results b Develop a multiple regression model that includes all six independent variables Which, if any, variables would you recommend be retained if this model is going to be used to predict book sales for the publishing company? For any statistical tests you might perform, use a significance level of 0.05 Discuss your results c Use the F-test approach to test the null hypothesis that all slope coefficients are Test with a significance level of 0.05 What these results mean? Discuss d Do multicollinearity problems appear to be present in the model? Discuss the potential consequences of multicollinearity with respect to the regression model e Discuss whether the standard error of the estimate is small enough to make this model useful for predicting the sales of textbooks f Plot the residuals against the predicted value of y and comment on what this plot means relative to the aptness of the model g Compute the standardized residuals and form these into a frequency histogram What does this indicate about the normality assumption? h Comment on the overall aptness of this model and indicate what might be done to improve the model The following information applies to Exercises 15-70 through 15-79 The J J McCracken Company has authorized its marketing research department to make a study of customers who have been issued a McCracken charge card The marketing research department hopes to identify the significant variables that explain the variation in purchases Once these variables are determined, the department intends to try to attract new customers who would be predicted to make a high volume of purchases y  Average monthly purchases (in dollars) at McCracken x1  Customer age x2  Customer family income x3  Family size 15-70 A first step in regression analysis often involves developing a scatter plot of the data Develop the scatter plots of all the possible pairs of variables, and with a brief statement indicate what each plot says about the relationship between the two variables 15-71 Compute the correlation matrix for these data Develop the decision rule for testing the significance of each coefficient Which, if any, correlations are not significant? Use a  0.05 15-72 Use forward selection stepwise regression to develop the multiple regression model The variable x2, family income, was brought into the model Discuss why this happened 15-73 Test the significance of the regression model at Step of the process Justify the significance level you have selected 15-74 Develop a 95% confidence level for the slope coefficient for the family income variable at Step of the model Be sure to interpret this confidence interval 15-75 Describe the regression model at Step of the analysis In your discussion, be sure to discuss the effect of adding a new variable on the standard error of the estimate and on R2 15-76 Referring to Problem 15-75, suppose the manager of McCracken’s marketing department questions the appropriateness of adding a second variable How would you respond to her question? 15-77 Looking carefully at the stepwise regression model, you can see that the value of the slope coefficient for variable x2, family income, changes as a new variable is added to the regression model Discuss why this change takes place 15-78 Analyze the stepwise regression model Write a report to the marketing manager pointing out the strengths and weaknesses of the model Be sure to comment on the department’s goal of being able to use the model to predict which customers will purchase high volumes from McCracken 15-79 Plot the residuals against the predicted value of y and comment on what this plot means relative to the aptness of the model a Compute the standardized residuals and form these in a frequency histogram What does this indicate about the normality assumption? b Comment on the overall aptness of this model and indicate what might be done to improve the model 15-80 The National Association of Realtors Existing-Home Sales Series provides a measurement of the residential (243) 704 CHAPTER 15 | Multiple Regression Analysis and Model Building real estate market One of the measurements it produces is the Housing Affordability Index (HAI), which is a measure of the financial ability of U.S families to buy a house A value of 100 means that families earning the national median income have just the amount of money needed to qualify for a mortgage on a median-priced home; higher than 100 means they have more than enough, and lower than 100 means they have less than enough The file entitled Index contains the HAI and associated variables a Produce the correlation matrix of all the variables Predict the variables that will remain in the estimated regression equation if standard stepwise regression is used b Use standard stepwise regression to develop an estimate of a model that is to predict the HAI from the associated variables found in the file entitled Index c Compare the results of parts a and b Explain any difference between the two models 15-81 An investment analyst collected data from 20 randomly chosen companies The data consisted of the 52-weekhigh stock prices, PE ratios, and the market values of the companies This data are in the file entitled Investment The analyst wishes to produce a regression equation to predict the market value using the 52-week-high stock price and the PE ratio of the company He creates a complete second-degree polynomial a Construct an estimate of the regression equation using the indicated variables b Produce the appropriate residual plots to determine if the polynomial function is the appropriate regression function for this data set c Use a residual plot to determine if the residuals have a constant variance d Produce the appropriate residual plot to determine if the residuals are independent Assume the data were extracted in the order listed e Construct a probability plot to determine if the error terms are normally distributed 15-82 The consumer price index (CPI) is a measure of the average change in prices over time in a fixed market basket of goods and services typically purchased by consumers One of the items in this market basket that affects the CPI is the price of oil and its derivatives The file entitled Consumer contains the price of the derivatives of oil and the CPI adjusted to 2005 levels a Produce a multiple regression equation depicting the relationship between the CPI and the price of the derivatives of oil b Conduct a t-test on the coefficient that has the highest p-value Use a significance level of 0.02 and the p-value approach c Produce a multiple regression equation depicting the relationship between the CPI and the price of the derivatives of oil leaving out the variable tested in part b d Referring to the regression results in part c, repeat the tests indicated in part b e Perform a test of hypothesis to determine if the resulting overall model is statistically significant Use a significance level of 0.02 and the p-value approach 15-83 Badeaux Brothers Louisiana Treats ships packages of Louisiana coffee, cakes, and Cajun spices to individual customers around the United States The cost to ship these products depends primarily on the weight of the package being shipped Badeaux charges the customers for shipping and then ships the product itself As a part of a study of whether it is economically feasible to continue to ship products themselves, Badeaux sampled 20 recent shipments to determine what if any relationship exists between shipping costs and package weight The data are contained in the file Badeaux a Develop a scatter plot of the data with the dependent variable, cost, on the vertical axis and the independent variable, weight, on the horizontal axis Does there appear to be a relationship between the two variables? Is the relationship linear? b Compute the sample correlation coefficient between the two variables Conduct a test, using an alpha value of 0.05, to determine whether the population correlation coefficient is significantly different from zero c Determine the simple linear regression model for this data Plot the simple linear regression model together with the data Would a nonlinear model better fit the sample data? d Now develop a nonlinear model and plot the model against the data Does the nonlinear model provide a better fit than the linear model developed in part c? 15-84 The State Tax Commission must download information files each morning The time to download the files primarily depends on the size of the file The Tax Commission has asked your computer consulting firm to determine what, if any, relationship exists between download time and size of files The Tax Commission randomly selected a sample of days and provided the information contained in the file Tax Commission a Develop a scatter plot of the data with the dependent variable, download time, on the vertical axis and the independent variable, size, on the horizontal axis Does there appear to be a relationship between the two variables? Is the relationship linear? b Compute the sample correlation coefficient between the two variables Conduct a test, using an alpha value of 0.05, to determine whether the population correlation coefficient is significantly different from zero c Determine the simple linear regression model for these data Plot the simple linear regression model together with the data Would a nonlinear model better fit the sample data? d Now determine a nonlinear model and plot the model against the data Does the nonlinear model provide a better fit than the linear model developed in part c? 15-85 Refer to the State Department of Transportation data set called Liabins The department was interested in determining the rate of compliance with the state’s (244) CHAPTER 15 mandatory liability insurance law, as well as other things Assume the data were collected using a simple random sampling process Develop the best possible linear regression model using vehicle year as the dependent variable and any or all of the other variables as potential independent variables Assume that your objective is to | Multiple Regression Analysis and Model Building 705 develop a predictive model Write a report that discusses the steps you took to develop the final model Include a correlation matrix and all appropriate statistical tests Use an a  0.05 If you are using a nominal or ordinal variable, remember that you must make sure it is in the form of one or more dummy variables Case 15.1 Dynamic Scales, Inc In 2005, Stanley Ahlon and three financial partners formed Dynamic Scales, Inc The company was based on an idea Stanley had for developing a scale to weigh trucks in motion and thus eliminate the need for every truck to stop at weigh stations along highways This dynamic scale would be placed in the highway approximately one-quarter mile from the regular weigh station The scale would have a minicomputer that would automatically record truck speed, axle weights, and climate variables, including temperature, wind, and moisture Stanley Ahlon and his partners believed that state transportation departments in the United States would be the primary market for such a scale As with many technological advances, developing the dynamic scale has been difficult When the scale finally proved accurate for trucks traveling 40 miles per hour, it would not perform for trucks traveling at higher speeds However, eight months ago, Stanley TABLE 15.3 Month January February March April May June | announced that the dynamic scale was ready to be field-tested by the Nebraska State Department of Transportation under a grant from the federal government Stanley explained to his financial partners, and to Nebraska transportation officials, that the dynamic weight would not exactly equal the static weight (truck weight on a static scale) However, he was sure a statistical relationship between dynamic weight and static weight could be determined, which would make the dynamic scale useful Nebraska officials, along with people from Dynamic Scales, installed a dynamic scale on a major highway in Nebraska Each month for six months data were collected for a random sample of trucks weighed on both the dynamic scale and a static scale Table 15.3 presents these data Once the data were collected, the next step was to determine whether, based on this test, the dynamic scale measurements could be used to predict static weights A complete report will be submitted to the U.S government and to Dynamic Scales Test Data for the Dynamic Scales Example Front-Axle Static Weight (lb.) 1,800 1,311 1,504 1,388 1,250 2,102 1,410 1,000 1,430 1,073 1,502 1,721 1,113 978 1,254 994 1,127 1,406 875 1,350 1,102 1,240 1,087 993 1,408 1,420 1,808 1,401 933 1,150 Front-Axle Dynamic Weight (lb.) Truck Speed (mph) Temperature (°F) Moisture (%) 1,625 1,904 1,390 1,402 1,100 1,950 1,475 1,103 1,387 948 1,493 1,902 1,415 983 1,149 1,052 999 1,404 900 1,275 1,120 1,253 1,040 1,102 1,400 1,404 1,790 1,396 1,004 1,127 52 71 48 50 61 55 58 59 43 59 62 67 48 59 60 58 52 59 47 68 55 57 62 59 67 58 54 49 62 64 21 17 13 19 24 26 32 38 24 18 34 36 42 29 48 37 34 40 48 51 52 57 63 62 68 70 71 83 88 81 0.00 0.15 0.40 0.10 0.00 0.10 0.20 0.15 0.00 0.40 0.00 0.00 0.21 0.32 0.00 0.00 0.21 0.40 0.00 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.40 0.00 (245) 706 CHAPTER 15 | Multiple Regression Analysis and Model Building Case 15.2 Glaser Machine Works Glaser Machine Works has experienced a significant change in its business operations over the past 50 years Glaser started business as a machine shop that produced specialty tools and products for the timber and lumber industry This was a logical fit, given its location in the southern part of the United States However, over the years Glaser looked to expand its offerings beyond the lumber and timber industry Initially, its small size coupled with its rural location made it difficult to attract the attention of large companies that could use its products All of that began to change as Glaser developed the ability not only to fabricate parts and tools but also to assemble products for customers who needed special components in large quantities Glaser’s business really took off when first foreign, and then domestic, automakers began to build automobile plants in the southern United States Glaser was able to provide quality parts quickly for firms that expected high quality and responsive delivery Many of Glaser’s customers operated with little inventory and required that suppliers be able to provide shipments with short lead times As part of its relationship with the automobile industry, Glaser was expected to buy into the lean manufacturing and quality improvement initiatives of its customers Glaser had always prided itself on its quality, but as the number and variety of its products increased, along with ever higher expectations by its customers, Glaser knew that it would have to respond by ensuring its quality and operations were continually improving Of recent concern was the performance of its manufacturing line 107B This line produced a component part for a Japanese automobile company The Japanese firm had initially been pleased with Glaser’s performance, but lately the number of defects was approaching an unacceptable level Managers of the 107B line knew the line and its workers had been asked to ramp up production to meet increased demand and that some workers were concerned with the amount of overtime being required There was also concern about the second shift now being run at 107B Glaser had initially run only one shift, but when demand for its product became so high that there was not sufficient capacity with one shift, additional workers were hired to operate a night shift Management was wondering if the new shift had been stretched beyond its capabilities Glaser plant management asked Kristi Johnson, the assistant production supervisor for line 107B, to conduct an analysis of product defects for the line Kristi randomly selected several days of output and counted the number of defective parts produced on the 107B line This information, along with other data, is contained in the file Glaser Machine Works Kristi promised to have a full report for the management team by the end of the month Required Tasks: Identify the primary issue of the case Identify a statistical model you might use to help analyze the case Develop a multiple regression model that can be used to help Kristi Johnson analyze the product defects for line 107B Be sure to carefully specify the dependent variable and the independent variables Discuss how the variables overtime hours, supervisor training, and shift will be modeled Run the regression model you developed and interpret the results Which variables are significant? Provide a short report that describes your analysis and explains in managerial terms the findings of your model Be sure to explain which variables, if any, are significant explanatory variables Provide a recommendation to management Case 15.3 Hawlins Manufacturing Ross Hawlins had done it all at Hawlins Manufacturing, a company founded by his grandfather 63 years ago Among his many duties, Ross oversaw all the plant’s operations, a task that had grown in responsibility given the company’s rapid growth over the past three decades When Ross’s grandfather founded the company, there were only two manufacturing sites Expansion and acquisition of competitors over the years had caused that number to grow to over 50 manufacturing plants in 18 states Hawlins had a simple process that produced only two products, but the demand for these products was strong and Ross had spent millions of dollars upgrading his facilities over the past decade Consequently, most of the company’s equipment was less than 10 years old on average Hawlins’s two products were produced for local markets, as prohibitive shipping costs prevented shipping the product long distances Product demand was sufficiently strong to support two manufacturing shifts (day and night) at every plant, and every plant had the capability to produce both products sold by Hawlins Recently, the management team at Hawlins noticed that there were differences in output levels across the various plants They were uncertain what, if anything, might explain these differences Clearly, if some plants were more productive than others, there might be some meaningful insights that could be standardized across plants to boost overall productivity Ross Hawlins asked Lisa Chandler, an industrial engineer at the company’s headquarters, to conduct a study of the plant’s productivity Lisa randomly sampled 159 weeks of output from various plants together with the number of plant employees working that week, the plants’ average age in years, the product mix produced that week (either product A or B), and whether the output was from the day or night shift The sampled data are contained in the file Hawlins Manufacturing The Hawlins management team is expecting a written report and a presentation by Lisa when it meets again next Tuesday (246) CHAPTER 15 Required Tasks: Identify the primary issue of the case Identify a statistical model you might use to help analyze the case Develop a multiple regression model for Lisa Chandler Be sure to carefully specify the dependent variable and the independent variables Discuss how the type of product (A or B) and the shift (Day or Night) can be included in the regression model | Multiple Regression Analysis and Model Building 707 Run the regression model you developed and interpret the results Which variables are significant? Provide a short report that describes your analysis and explains in management terms the findings of your model Be sure to explain which variables, if any, are significant explanatory variables Provide a recommendation to management Case 15.4 Sapphire Coffee—Part Jennie Garcia could not believe that her career had moved so far so fast When she left graduate school with a master’s degree in anthropology, she intended to work at a local coffee shop until something else came along that was more related to her academic background But after a few months, she came to enjoy the business, and in a little over a year she was promoted to store manager When the company for whom she worked continued to grow, Jennie was given oversight of a few stores Now, eight years after she started as a barista, Jennie is in charge of operations and planning for the company’s southern region As a part of her responsibilities, Jennie tracks store revenues and forecasts coffee demand Historically, Sapphire Coffee would base its demand forecast on the number of stores, believing that each store sold approximately the same amount of coffee This approach seemed to work well when the company had shops of similar size and layout, but as the company grew, stores became more varied Now, some stores have drive-thru windows, a feature that top management added to some stores believing that it would increase coffee sales for customers who wanted a cup of coffee on their way to work but who were too rushed to park and enter the store Jennie noticed that weekly sales seemed to be more variable across stores in her region and was wondering what, if anything, might explain the differences The company’s financial vice president had also noticed the increased differences in sales across stores and was wondering what might be happening In an e-mail to Jennie he stated that weekly store sales are expected to average $5.00 per square foot Thus, a 1,000-square-foot store would have average weekly sales of $5,000 He asked that Jennie analyze the stores in her region to see if this rule of thumb was a reliable measure of a store’s performance Jennie had been in the business long enough to know that a store’s size, although an important factor, was not the only thing that might influence sales She had never been convinced of the efficacy of the drive-thru window, believing that it detracted from the coffee house experience that so many of Sapphire Coffee customers had come to expect The VP of finance was expecting the analysis to be completed by the weekend Jennie decided to randomly select weekly sales records for 53 stores, along with each store’s size, whether it was located close to a college, and whether it had a drive-thru window The data are in the file Sapphire Coffee-2 A full analysis would need to be sent to the corporate office by Friday Case 15.5 Wendell Motors Wendell Motors manufactures and ships small electric motors and drives to a variety of industrial and commercial customers in and around St Louis Wendell is a small operation with a single manufacturing plant Wendell’s products are different from other motor and drive manufacturers because Wendell only produces small motors (25 horsepower or less) and because its products are used in a variety of industries and businesses that appreciate Wendell’s quality and speed of delivery Because it has only one plant, Wendell ships motors directly from the plant to its customers Wendell’s reputation for quality and speed of delivery allows it to maintain low inventories of motors and to ship make-to-order products directly As part of its ongoing commitment to lean manufacturing and continuous process improvement, Wendell carefully monitors the cost associated with both production and shipping The manager of shipping for Wendell, Tyler Jenkins, regularly reports the shipping costs to Wendell’s management team Because few finished goods inventories are maintained, competitive delivery times often require that Wendell expedite shipments This is almost always the case for those customers who operate their business around the clock every day of the week Such customers might maintain their own backup safety stock of a particular motor or drive, but circumstances often result in cases where replacement products have to be rushed through production and then expedited to the customer Wendell’s management team wondered if these special orders were too expensive to handle in this way and if it might be less expensive to produce and hold certain motors as finished goods inventory, enabling off-the-shelf delivery using less expensive modes of shipping This might especially be true for orders that must be filled on a holiday, incurring an additional shipping charge At the last meeting of the management team, Tyler Jenkins was asked to analyze expedited shipping costs and to develop a model that could be used to estimate the cost of expediting a customer’s order (247) 708 CHAPTER 15 | Multiple Regression Analysis and Model Building Donna Layton, an industrial engineer in the plant, was asked to prepare an inventory cost analysis to determine the expenses of holding additional finished goods inventory Tyler began his analysis by randomly selecting 45 expedited shipping records The sampled data can be found in the file Wendell Motors The management team expects a full report in five days Tyler knew he would need a model for explaining shipping costs for expedited orders and that he would also need to answer the questions as to what effect, if any, shipping on a holiday had on costs References Berenson, Mark L., and David M Levine, Basic Business Statistics: Concepts and Applications, 11th ed (Upper Saddle River, NJ: Prentice Hall, 2009) Bowerman, Bruce L., and Richard T O’Connell, Linear Statistical Models: An Applied Approach, 2nd ed (Belmont, CA: Duxbury Press, 1990) Cryer, Jonathan D., and Robert B Miller, Statistics for Business: Data Analysis and Modeling, 2nd ed (Belmont, CA: Duxbury Press, 1994) Demmert, Henry, and Marshall Medoff, “Game-Specific Factors and Major League Baseball Attendance: An Econometric Study,” Santa Clara Business Review (1977), pp 49–56 Dielman, Terry E., Applied Regression Analysis: A Second Course in Business and Economic Statistics, 4th ed (Belmont, CA: Duxbury Press, 2005) Draper, Norman R., and Harry Smith, Applied Regression Analysis, 3rd ed (New York City: John Wiley and Sons, 1998) Frees, Edward W., Data Analysis Using Regression Models: The Business Perspective (Upper Saddle River, NJ: Prentice Hall, 1996) Gloudemans, Robert J., and Dennis Miller, “Multiple Regression Analysis Applied to Residential Properties.” Decision Sciences (April 1976), pp 294–304 Kleinbaum, David G., Lawrence L Kupper, Azhar Nizam, and Keith E Muller, Applied Regression Analysis and Multivariable Methods, 4th ed (Florence, KY: Cengage Learning, 2008) Kutner, Michael H., Christopher J Nachtsheim, John Neter, and William Li, Applied Linear Statistical Models, 5th ed (New York City: McGraw-Hill Irwin, 2005) Microsoft Excel 2007 (Redmond, WA: Microsoft Corp 2007) Minitab for Windows Version 15 (State College, PA: Minitab, 2007) (248) • Review the steps used to develop a line chart discussed in Chapter • Make sure you understand the steps necessary to construct and interpret linear and nonlinear regression models in Chapters 14 and 15 • Review the concepts and properties associated with means discussed in Chapter chapter 16 Chapter 16 Quick Prep Links Analyzing and Forecasting Time-Series Data 16.1 Introduction to Forecasting, Time-Series Data, and Index Numbers (pg 710–723) Outcome Identify the components present in a time series 16.2 Trend-Based Forecasting Techniques (pg 724–749) Outcome Apply the fundamental steps in developing and implementing forecasting models Outcome Understand and compute basic index numbers Outcome Apply trend-based forecasting models, including linear trend, nonlinear trend, and seasonally adjusted trend 16.3 Forecasting Using Smoothing Methods Outcome Use smoothing-based forecasting models, including single and double exponential smoothing (pg 750–761) Why you need to know No organization, large or small, can function effectively without a forecast for the goods or services it provides A retail clothing store must forecast the demand for the shirts it sells by shirt size The concessionaire at Dodger Stadium in Los Angeles must forecast each game’s attendance to determine how many soft drinks and Dodger dogs to have on hand Your state’s elected officials must forecast tax revenues in order to establish a budget each year These are only a few of the instances in which forecasting is required For many organizations, the success of the forecasting effort will play a major role in determining the general success of the organization When you graduate and join an organization in the public or private sectors, you will almost certainly be required to prepare forecasts or to use forecasts provided by someone else in the organization You won’t have access to a crystal ball on which to rely for an accurate prediction of the future Fortunately, if you have learned the material presented in this chapter, you will have a basic understanding of forecasting and of how and when to apply various forecasting techniques We urge you to focus on the material and take with you the tools that will give you a competitive advantage over those who are not familiar with forecasting techniques 709 (249) 710 CHAPTER 16 | Analyzing and Forecasting Time-Series Data 16.1 Introduction to Forecasting, Time-Series Data, and Index Numbers Decision makers often confuse forecasting and planning Planning is the process of determining how to deal with the future On the other hand, forecasting is the process of predicting what the future will be like Forecasts are used as inputs for the planning process Experts agree that good planning is essential for an organization to be effective Because forecasts are an important part of the planning process, you need to be familiar with forecasting methods There are two broad categories of forecasting techniques: qualitative and quantitative Qualitative forecasting techniques are based on expert opinion and judgment Quantitative forecasting techniques are based on statistical methods for analyzing quantitative historical data This chapter focuses on quantitative forecasting techniques In general, quantitative forecasting techniques are used whenever the following conditions are true: Historical data relating to the variable to be forecast exist, the historical data can be quantified, and you are willing to assume that the historical pattern will continue into the future General Forecasting Issues Model Specification The process of selecting the forecasting technique to be used in a particular situation Model Fitting The process of estimating the specified model’s parameters to achieve an adequate fit of the historical data Model Diagnosis The process of determining how well a model fits past data and how well the model’s assumptions appear to be satisfied Forecasting Horizon The number of future periods covered by a forecast It is sometimes referred to as forecast lead time Decision makers who are actively involved in forecasting frequently say that forecasting is both an art and a science Operationally, the forecaster is engaged in the process of modeling a real-world system Determining the appropriate forecasting model is a challenging task, but it can be made manageable by employing the same model-building process discussed in Chapter 15 consisting of model specification, model fitting, and model diagnosis As we will point out in later sections, guidelines exist for determining which techniques may be more appropriate than others in certain situations However, you may have to specify (and try) several model forms for a given situation before deciding on one that is acceptable The idea is that if the future tends to look like the past, a model should adequately fit the past data to have a reasonable chance of forecasting the future As a forecaster, you will spend much time selecting a model’s specification and estimating its parameters to reach an acceptable fit of the past data You will need to determine how well a model fits past data, how well it performs in mock forecasting trials, and how well its assumptions appear to be satisfied If the model is unacceptable in any of these areas, you will be forced to revert to the model specification step and begin again An important consideration when you are developing a forecasting model is to use the simplest available model that will meet your forecasting needs The objective of forecasting is to provide good forecasts You not need to feel that a sophisticated approach is better if a simpler one will provide acceptable forecasts As in football, in which some players specialize in defense and others in offense, forecasting techniques have been developed for special situations, which are generally dependent on the forecasting horizon For the purpose of categorizing forecasting techniques in most business situations, the forecast horizon, or lead time, is typically divided into four categories: Forecasting Period The unit of time for which forecasts are to be made Forecasting Interval The frequency with which new forecasts are prepared Immediate term—less than one month Short term—one to three months Medium term—three months to two years Long term—two years or more As we introduce various forecasting techniques, we will indicate the forecasting horizon(s) for which each is typically best suited In addition to determining the desired forecasting horizon, the forecaster must determine the forecasting period For instance, the forecasting period might be a day, a week, a month, a quarter, or a year Thus, the forecasting horizon consists of one or more forecasting periods If quantitative forecasting techniques are to be employed, historical quantitative data must be available for a similar period If we want weekly forecasts, weekly historical data must be available The forecasting interval is generally the same length as the forecast period That is, if the forecast period is one week, then we will provide a new forecast each week (250) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 711 Components of a Time Series Quantitative forecasting models have one factor in common: They use past measurements of the variable of interest to generate a forecast of the future The past data, measured over time, are called time-series data The decision maker who plans to develop a quantitative forecasting model must analyze the relevant time-series data Chapter Outcome BUSINESS APPLICATION IDENTIFYING TIME-SERIES COMPONENTS WEB SITE DESIGN AND CONSULTING For the past four years, White-Space, Inc., has been helping firms to design and implement Web sites The owners need to forecast revenues in order to make sure they have ample cash flows to operate the business In forecasting this company’s revenue for next year, they plan to consider the historical pattern over the prior four years They want to know whether demand for consulting services has tended to increase or decrease and whether there have been particular times during the year when demand was typically higher than at other times The forecasters can perform a time-series analysis of the historical sales Table 16.1 presents the time-series data for the revenue generated by the firm’s sales for the four-year period An effective means for analyzing these data is to develop a time-series plot, or line chart, as shown in Figure 16.1 By graphing the data, much can be observed about the firm’s revenue over the past four years The time-series plot is an important tool in identifying the time-series components All time-series data exhibit one or more of the following: Linear Trend A long-term increase or decrease in a time series in which the rate of change is relatively constant Trend component Seasonal component Cyclical component Random component Trend Component A trend is the long-term increase or decrease in a variable being measured over time Figure 16.1 shows that White-Space’s revenues exhibited an upward trend over the four-year period In other situations, the time series may exhibit a downward trend Trends can be classified as linear or nonlinear A trend can be observed when a time series is measured in any time increment, such as years, quarters, months, or days Figure 16.1 shows a good example of a positive linear trend Time-series data that exhibit a linear trend will tend to increase or decrease at a fairly constant rate However, not all trends are linear TABLE 16.1 | Time-Series Data for Sales Revenues (Thousands of Dollars) Billing Total Month 2006 2007 2008 2009 January 170 390 500 750 February 200 350 470 700 March 190 300 510 680 April 220 320 480 710 May 180 310 530 710 June 230 350 500 660 July 220 380 540 630 August 260 420 580 670 September 300 460 630 700 October 330 500 690 720 November 370 540 770 850 December 390 560 760 880 (251) 712 CHAPTER 16 FIGURE 16.1 | Analyzing and Forecasting Time-Series Data | BILLINGS 1,000 Time-Series Plot for Billing Data 900 $ in 1,000s 800 700 600 500 400 300 200 100 January 2006 July January 2007 July January 2008 July January 2009 July Many time series will show a nonlinear trend For instance in the years between 2001 and 2008, total annual game attendance for the New York Yankees Major League baseball team is shown in Figure 16.2 Attendance was fairly flat between 2001 and 2003, increased dramatically between 2003 and 2006, and then slowed down again through 2008 Seasonal Component A wavelike pattern that is repeated throughout a time series and has a recurrence period of at most one year FIGURE 16.2 | New York Yankees Annual Attendance Showing a Nonlinear Trend Seasonal Component Another component that may be present in time-series data is the seasonal component Many time series show a repeating pattern over time For instance, Figure 16.1 showed a time series that exhibits a wavelike pattern This pattern repeats itself throughout the time series Web site consulting revenues reach an annual maximum around January and then decline to an annual minimum around April This pattern repeats itself every 12 months The shortest period of repetition for a pattern is known as its recurrence period A seasonal component’s recurrence period is at most one year If the time series exhibits a repetitious pattern with a recurrence period longer than a year, the time series is said to exhibit a cyclical effect—a concept to be explored shortly In analyzing past sales data for a retail toy store, we would expect to see sales increase in the months leading into Christmas and then substantially decrease after Christmas Automobile (252) CHAPTER 16 FIGURE 16.3 | | Analyzing and Forecasting Time-Series Data 713 4,500 Hotel Sales by Quarter 4,000 Sales in Millions 3,500 3,000 2,500 2,000 1,500 1,000 500 Sum.’04 Wint.’04 Sum.’05 Wint.’05 Sum.’06 Wint.’06 Sum.’07 Wint.’07 Sum.’08 Wint.’08 Sum.’09 Wint.’09 Year gasoline sales might show a seasonal increase during the summer months, when people drive more, and a decrease during the cold winter months These predictable highs and lows at specific times during the year indicate seasonality in data To view seasonality in a time series, the data must be measured quarterly, monthly, weekly, or daily Annual data will not show seasonal patterns of highs and lows Figure 16.3 shows quarterly sales data for a major hotel chain from June 2004 through December 2009 Notice that the data exhibit a definite seasonal pattern The local maximums occur in the spring The recurrence period of the component in the time series is, therefore, one year The winter quarter tends to be low, whereas the following quarter (spring) is the high quarter each year Seasonality can be observed in time-series data measured over time periods shorter than a year For example, the number of checks processed daily by a bank may show predictable highs and lows at certain times during a month The pattern of customers arriving at the bank during any hour may be “seasonal” within a day, with more customers arriving near opening time, around the lunch hour, and near closing time Cyclical Component A wavelike pattern within the time series that repeats itself throughout the time series and has a recurrence period of more than one year Random Component Changes in time-series data that are unpredictable and cannot be associated with a trend, seasonal, or cyclical component Cyclical Component If you observe time-series data over a long enough time span, you may see sustained periods of high values followed by periods of lower values If the recurrence period of these fluctuations is larger than a year, the data are said to contain a cyclical component National economic measures such as the unemployment rate, gross national product, stock market indexes, and personal saving rates tend to cycle The cycles vary in length and magnitude That is, some cyclical time series may have longer runs of high and low values than others Also, some time series may exhibit deeper troughs and higher crests than others Figure 16.4 shows quarterly housing starts in the United States between 1995 and 2006 Note the definite cyclical pattern, with low periods in 1995, 1997, and 2000 Although the pattern resembles the shape of a seasonal component, the length of the recurrence period identifies this pattern as being the result of a cyclical component Random Component Although not all time series possess a trend, seasonal, or cyclical component, virtually all time series will have a random component The random component is often referred to as “noise” in the data A time series with no identifiable pattern is completely random and contains only noise In addition to other components, each of the time series in Figures 16.1 through 16.4 contains random fluctuations In the following sections of this chapter, you will see how various forecasting techniques deal with the time-series components An important first step in forecasting is to identify which components are present in the time series to be analyzed As we have shown, constructing a time-series plot is the first step in this process (253) 714 CHAPTER 16 FIGURE 16.4 | Analyzing and Forecasting Time-Series Data | 2400 Time-Series Plot of Housing Starts Housing Starts (Millions) 2200 2000 1800 1600 1400 1200 Month Jan Year 1995 Chapter Outcome Base Period Index The time-series value to which all other values in the time series are compared The index number for the base period is defined as 100 Jan 1996 Jan 1997 Jan 1998 Jan 1999 Jan 2000 Jan 2001 Jan 2002 Jan 2003 Jan 2004 Jan 2005 Jan 2006 Introduction to Index Numbers When analyzing time-series data, decision makers must often compare one value measured at one point in time with other values measured at different points in time For example, a real estate broker may wish to compare house prices in 2009 with house prices in previous years A common procedure for making relative comparisons is to begin by determining a base period index to which all other data values can be fairly compared Equation 16.1 is used to make relative comparisons for data found in different periods by calculating a simple index number Simple Index Number It  yt 100 y0 (16.1) where: It  Index number at time period t yt  Value of the time series at time t y0  Value of the time series at the index base period EXAMPLE 16-1 COMPUTING SIMPLE INDEX NUMBERS Wilson Windows, Inc The managers at Wilson Windows, Inc., are considering the purchase of a window and door plant in Wisconsin The current owners of the window and door plant have touted their company’s rapid sales growth over the past 10 years as a reason for their asking price Wilson executives wish to convert the company’s sales data to index numbers The following steps can be used to this: Step Obtain the time-series data The company has sales data for each of the 10 years since 2000 Step Select a base period Wilson managers have selected 2000 as the index base period Sales in 2000 were $14.0 million (254) | CHAPTER 16 Analyzing and Forecasting Time-Series Data 715 Step Compute the simple index numbers for each year using Equation 16.1 For instance, sales in 2001 were $15.2 million Using Equation 16.1, the index for 2001 is It  I 2001  yt y0 100 15.2 100  108.6 14.0 For the 10 years, we get: Year Sales ($ millions) Index 2000 2001 2002 2003 2004 2005 2006 2007 2008 14.0 15.2 17.8 21.4 24.6 30.5 29.8 32.4 37.2 100.0 108.6 127.1 152.9 175.7 217.9 212.9 231.4 265.7 2009 39.1 279.3 >> END EXAMPLE TRY PROBLEM 16-8 (pg 722) Referring to Example 16-1, we can use the index numbers to determine the percentage change any year is from the base year For instance, sales in 2007 have an index of 231.4 This means that sales in 2007 are 131.4% above sales in the base year of 2000 Sales in 2009 are 179.3% higher than they were in 2000 Note that although you can use the index number to compare values between any one time period and the base period and can express the difference in percentage-change terms, you cannot compare period-to-period changes by subtracting the index numbers For instance, in Example 16-1, when comparing sales for 2008 and 2009, we cannot say that the growth has been 279.3 - 265.7  13.6% To determine the actual percentage growth, we the following: 279.3  265.7 100  5.1% 265.7 Thus, the sales growth rate between 2008 and 2009 has been 5.1%, not 13.6% Aggregate Price Indexes Aggregate Price Index An index that is used to measure the rate of change from a base period for a group of two or more items “The dollar’s not worth what it once was” is a saying that everyone has heard The problem is that nothing is worth what it used to be; sometimes it is worth more, and other times it is worth less The simple index shown in Equation 16.1 works well for comparing prices when we wish to analyze the price of a single item over time For instance, we could use the simple index to analyze how apartment rents have changed over time or how college tuition has increased over time However, if we wish to compare prices of a group of items, we might construct an aggregate price index (255) 716 CHAPTER 16 | Analyzing and Forecasting Time-Series Data Equation 16.2 is used to compute an unweighted aggregate price index Unweighted Aggregate Price Index It  ∑ pt (100) ∑ p0 (16.2) where: It  Unweighted aggregate index at time period t pt  Sum of the prices for the group of items at time period t p0  Sum of the prices for the group of items at base time period EXAMPLE 16-2 COMPUTING AN UNWEIGHTED AGGREGATE PRICE INDEX College Costs There have been many news stories recently discussing the rate of growth of college and university costs One university is interested in analyzing the growth in the total costs for students over the past five years The university wishes to consider three main costs: tuition and fees, room and board, and books and supplies Rather than analyzing these factors individually using three simple indexes, an unweighted aggregate price index can be developed using the following steps: Step Define the variables to be included in the index and gather the time-series data The university has identified three main categories of costs: tuition fees, room and board, and books and supplies Data for the past five years have been collected Full-time tuition and fees for two semesters are used The full dormand-meal package offered by the university is priced for the room-and-board variable, and the books-and-supplies cost for a “typical” student are used for that component of the total costs Step Select a base period The base period for this study will be the 2004–2005 academic year Step Use Equation 16.2 to compute the unweighted aggregate price index The equation is It  ∑ pt (100) ∑ p0 The sum of the prices for the three components during the base academic year of 2004–2005 is $13,814 The sum of the prices in the 2008–2009 academic year is $19,492 Applying Equation 16.2, the unweighted aggregate price index is I 2008 − 2009  $19, 492 (100)  141.1 $13, 814 This means, as a group, the components making up the cost of attending this university have increased by 41.1% since the 2004–2005 academic year The indexes for the other years are shown as follows: Academic Year Tuition & Fees ($) Room & Board ($) Books & Supplies ($) ∑ pt ($) Index 2004–2005 2005–2006 2006–2007 2007–2008 7,300 7,720 8,560 9,430 5,650 5,980 6,350 6,590 864 945 1,067 1,234 13,814 14,645 15,977 17,254 100.0 106.0 115.7 124.9 2008–2009 10,780 7,245 1,467 19,492 141.1 >> END EXAMPLE TRY PROBLEM 16-9 (pg 722) (256) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 717 Weighted Aggregate Price Indexes Example 16-2 utilized an unweighted aggregate price index to determine the change in university costs This was appropriate because each student would incur the same set of three costs However, in some situations, the items composing a total cost are not equally weighted For instance, in a consumer price study of a “market basket” of 10 food items, a typical household will not use the same number (or volume) of each item During a week, a typical household might use three gallons of milk, but only two loaves of bread In these types of situations, we need to compute a weighted aggregate price index to account for the different levels of use Two common weighted indexes are the Paasche Index and the Laspeyres Index The Paasche Index Equation 16.3 is used to compute a Paasche Index Note that the weighting percentage in Equation 16.3 for the Paasche Index is always the percentage for the time period for which the index is being computed The idea is that the prices in the base period should be weighted relative to their current use, not to what that use level was in other periods Paasche Index It  ∑ qt pt (100) ∑ qt p0 (16.3) where: qt  Weighting percentage at time t pt  Price in time period t p0  Price in the base period EXAMPLE 16-3 COMPUTING THE PAASCHE INDEX Wage Rates Before a company makes a decision to locate a new manufacturing plant in a community, the managers will be interested in knowing how the wage rates have changed Two categories of wages are to be analyzed as a package: production hourly wages and administrative/clerical hourly wages Annual data showing the average hourly wage rates since 2000 are available Each year, the makeup of the labor market differs in terms of the percentage of employees in the two categories To compute a Paasche Index, use the following steps: Step Define the variables to be included in the index and gather the time-series data The variables are the mean hourly price for production workers and the mean average price for administrative/clerical workers Data are collected for the 10-year period through 2009 Step Select the base period Because similar data for another community are available only back to 2003, the company will use 2003 as the base period to make comparisons between the two communities easier Step Use Equation 16.3 to compute the Paasche Index The equation is It = ∑ qt pt (100) ∑ qt p0 The hourly wage rate for production workers in the base year 2003 was $10.80, whereas the average hourly administrative/clerical rate was $10.25 (257) 718 CHAPTER 16 | Analyzing and Forecasting Time-Series Data In 2009, the production hourly rate had increased to $15.45, and the administrative/clerical rate was $13.45 In 2009, 60% of the employees in the community were designated as working in production and 40% were administrative/clerical Equation 16.3 is used to compute the Paasche Index for 2009, as follows: I 2009  (0.60)($15.45)  (0.40)($13.45) (100)  138.5 (0.60)($1 10.80)  (0.40)($10.25) This means that, overall, the wage rates in this community have increased by 38.5% since the base year of 2003 The following table shows the Paasche Indexes for all years Year Production Wage Rate ($) Percent Production Administrative/Clerical Wage Rate ($) Percent Admin./Clerical Paasche Index 2000 2001 2002 2003 2004 2005 2006 2007 2008 8.50 9.10 10.00 10.80 11.55 12.15 12.85 13.70 14.75 0.78 0.73 0.69 0.71 0.68 0.67 0.65 0.65 0.62 9.10 9.45 9.80 10.25 10.60 10.95 11.45 11.90 12.55 0.22 0.27 0.31 0.29 0.32 0.33 0.35 0.35 0.38 80.8 86.3 93.5 100.0 105.9 110.7 116.5 123.2 131.4 2009 15.45 0.60 13.45 0.40 138.5 >> END EXAMPLE TRY PROBLEM 16-12 (pg 723) The Laspeyres Index The Paasche Index is computed using the logic that the index for the current period should be compared to a base period with the current period weightings An alternate index, called the Laspeyres Index, uses the base-period weighting in its computation, as shown in Equation 16.4 Laspeyres Index It  ∑ q0 pt (100) ∑ q0 p0 (16.4) where: q0  Weighting percentage at base period pt  Price in time period t p0  Price in base period EXAMPLE 16-4 COMPUTING THE LASPEYRES INDEX Wage Rates Refer to Example 16-3, in which the managers of a company are interested in knowing how the wage rates have changed in the community in which they are considering building a plant Two categories of wages are to be analyzed as a package: production hourly wages and administrative/clerical hourly wages Annual data showing the average hourly wage rate since (258) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 719 2000 are available Each year, the makeup of the labor market differs in terms of the percentage of employees in the two categories To compute a Laspeyres Index, use the following steps: Step Define the variables to be included in the index and gather the time-series data The variables are the mean hourly price for production workers and the mean average price for administrative/clerical workers Data are collected for the 10-year period through 2009 Step Select the base period Because similar data for another community are available only back to 2003, the company will use 2003 as the base period to make comparisons between the two communities easier Step Use Equation 16.4 to compute the Laspeyres Index The equation is It = ∑ q0 pt (100) ∑ q0 p0 The hourly wage rate for production workers in the base year of 2003 was $10.80, whereas the average hourly administrative/clerical rate was $10.25 In that year, 71% of the workers were classified as production In 2009, the production hourly rate had increased to $15.45, and the administrative/ clerical rate was at $13.45 Equation 16.4 is used to compute the Laspeyres Index for 2009, as follows: I 2009  (0.71)($15.45)  (0.29)($13.45) (100)  139.7 (0.71)($1 10.80)  (0.29)($10.25) This means that, overall, the wage rates in this community have increased by 39.7% since the base year of 2003 The following table shows the Laspeyres Indexes for all years Year Production Wage Rate ($) Percent Production Administrative/Clerical Wage Rate ($) Percent Admin./Clerical Laspeyres Index 2000 2001 2002 2003 2004 2005 2006 2007 2008 8.50 9.10 10.00 10.80 11.55 12.15 12.85 13.70 14.75 0.78 0.73 0.69 0.71 0.68 0.67 0.65 0.65 0.62 9.10 9.45 9.80 10.25 10.60 10.95 11.45 11.90 12.55 0.22 0.27 0.31 0.29 0.32 0.33 0.35 0.35 0.38 81.5 86.5 93.4 100.0 106.0 110.9 116.9 123.8 132.6 2009 15.45 0.60 13.45 0.40 139.7 >> END EXAMPLE TRY PROBLEM 16-13 (pg 723) Commonly Used Index Numbers In addition to converting time-series data to index numbers, you will encounter a variety of indexes in your professional and personal life Consumer Price Index To most of us, inflation has come to mean increased prices and less purchasing power for our dollar The Consumer Price Index (CPI) attempts to measure the overall changes in retail prices for goods and services The CPI, originally published in (259) 720 CHAPTER 16 | Analyzing and Forecasting Time-Series Data TABLE 16.2 | CPI Index (1996 to 2008) Year 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 CPI 156.9 157.0 160.5 166.6 172.2 177.1 179.9 184.0 188.9 195.3 201.6 207.3 215.3 Base  1982 to 1984 (Index  100) Source: Bureau of Labor Statistics: http://www.bls.gov/cpi/home.htm#data 1913 by the U.S Department of Labor, uses a “market basket” of goods and services purchased by a typical wage earner living in a city The CPI, a weighted aggregate index similar to a Laspeyres Index, is based on items grouped into seven categories, including food, housing, clothing, transportation, medical care, entertainment, and miscellaneous items The items in the market basket have changed over time to keep pace with the buying habits of our society and as new products and services have become available Since 1945, the base period used to construct the CPI has been updated Currently, the base period, 1982 to 1984, has an index of 100 Table 16.2 shows the CPI index values for 1996 to 2008 For instance, the index for 2005 is 195.3, which means that the price of the market basket of goods increased 95.3% between 1984 and 2005 Remember also that you cannot determine the inflation rate by subtracting index values for successive years Instead, you must divide the difference by the earlier year’s index For instance, the rate of inflation between 2004 and 2005 was Inflation rate  195.3 188.9 (100)  3.39% 188.9 Thus, in general terms, if your income did not increase by at least 3.39% between 2004 and 2005, you failed to keep pace with inflation and your purchasing power was reduced Producer Price Index The U.S Bureau of Labor Statistics publishes the Producer Price Index (PPI) on a monthly basis to measure the rate of change in nonretail prices Like the CPI, the PPI is a Laspeyres weighted aggregate index This index is used as a leading indicator of upcoming changes in the CPI Table 16.3 shows the PPI between 1996 and 2005 Stock Market Indexes Every night on the national and local TV news, reporters tell us what happened on the stock market that day by reporting on the Dow Jones Industrial Average (DJIA) The Dow, as this index is commonly referred to, is not the same type of index as the CPI or PPI, in that it is not a percentage of a base year Rather, the DJIA is the sum of the stock prices for 30 large industrial companies whose stocks trade on the New York Stock Exchange divided by a factor that is adjusted for stock splits Many analysts use the DJIA, which is computed daily, as a measure of the health of the stock market Other analysts prefer other indexes, such as the Standard and Poor’s 500 (S&P 500) The S&P 500 includes stock prices for 500 companies and is thought by some to be more representative of the broader market TABLE 16.3 | PPI Index (1996 to 2005) Year 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 PPI 127.7 127.6 124.4 125.5 132.7 134.2 131.1 138.1 142.7 157.4 Base  1984 (Index  100) Source: Bureau of Labor Statistics: http://www.bls.gov/ppi/home.htm#data (260) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 721 The NASDAQ is an index made up of stocks on the NASDAQ exchange and is heavily influenced by technology-based companies that are traded on this exchange Publications such as The Wall Street Journal and Barrons publish all these indexes and others every day for investors to use in their investing decisions Using Index Numbers to Deflate a Time Series A common use of index numbers is to convert values measured at different times into more directly comparable values For instance, if your wages increase, but at a rate less than inflation, you will in fact be earning less in “real terms.” A company experiencing increasing sales at a rate of increase less than inflation is actually not increasing in “real terms.” BUSINESS APPLICATION DEFLATING TIME-SERIES VALUES USING INDEX VALUES WYMAN-GORMAN COMPANY The Wyman-Gorman Company, located in Massachusetts, designs and produces forgings, primarily for internal combustion engines The company has recently been experiencing some financial difficulty and has discontinued its agricultural and earthmoving divisions Table 16.4 shows sales in millions of dollars for the company since 1996 Also shown is the PPI (Producer Price Index) for the same years Finally, sales, adjusted to 1984 dollars, are also shown Equation 16.5 is used to determine the adjusted time-series values Deflation Formula yadj  t yt (100) It (16.5) where: yadj  Deflated time-series value at time t t yt  Actual value of the time series at time t It  Index (such as CPI or PPI) at time t For instance, in 1996 sales were $610.3 million The PPI for that year was 127.7 The sales, adjusted to 1984 dollars, is yadj 1996 TABLE 16.4 |  610.3 (100)  $477.9 127.7 Deflated Sales Data—Using Producer Price Index (PPI) Year Sales ($ millions) PPI (Base = 1984) Sales ($ millions, adjusted to 1984 dollars) 1996 610.3 127.7 477.9 1997 473.1 127.6 370.8 1998 383.5 124.4 308.3 1999 425.5 125.5 339.0 2000 384.1 132.7 289.4 2001 341.1 134.2 254.2 2002 310.3 131.1 236.7 2003 271.6 138.1 196.7 2004 371.6 142.7 260.4 2005 390.2 157.4 247.9 (261) 722 CHAPTER 16 | Analyzing and Forecasting Time-Series Data MyStatLab 16-1: Exercises Skill Development 16-1 What is meant by time-series data? Give an example 16-2 Explain the difference between time-series data and cross-sectional data Are these two types of data sets mutually exclusive? What they have in common? How they differ? 16-3 What are the differences between quantitative and qualitative forecasting techniques? Under what conditions is it appropriate to use a quantitative technique? 16-4 Provide an example of a business decision that requires (1) a short-term forecast, (2) a medium-term forecast, and (3) a long-term forecast 16-5 What is meant by the trend component of a time series? How is a linear trend different from a nonlinear trend? 16-6 Must a seasonal component be associated with the seasons (fall, spring, summer, winter) of the year? Provide an example of a seasonal effect that is not associated with the seasons of the year 16-7 A Greek entrepreneur followed the olive harvests He noted that olives ripen in September Each March he would try to determine if the upcoming olive harvest would be especially bountiful If his analysis indicated it would, he would enter into agreements with the owners of all the olive oil presses in the region In exchange for a small deposit months ahead of the harvest, he would obtain the right to lease the presses at market prices during the harvest If he was correct about the harvest and demand for olive oil presses boomed, he could make a great deal of money Identify the following quantities in the context of this scenario: a forecasting horizon b category that applies to the forecasting horizon identified in part a c forecasting period d forecasting interval 16-8 Consider the following median selling prices ($thousands) for homes in a community: Year Price 10 320 334 329 344 358 347 383 404 397 411 a Use year as a base year and construct a simple index number to show how the median selling price has increased b Determine the actual percentage growth in the median selling price between the base year and year 10 c Determine the actual percentage growth in the median selling price between the base year and year d Determine the actual percentage growth in the median selling price between year and year 10 16-9 The following values represent advertising rates paid by a regional catalog retailer that advertises either on radio or in newspapers: Year Radio Rates ($) Newspaper Rates ($) 300 310 330 346 362 380 496 400 420 460 520 580 640 660 a Determine a relative index for each type of advertisement using year as the base year b Determine an unweighted aggregate index for the two types of advertisement c In year the retailer spent 30% of the advertisement budget on radio advertising Construct a Laspeyres index for the data d Using year as the base, construct a Paasche index for the same data Business Applications Problems 16-10 through 16-13 refer to Gallup Construction and Paving, a company whose primary business has been constructing homes in planned communities in the upper Midwest The company has kept a record of the relative cost of labor and materials in its market areas for the last 11 years These data are as follows: Year Hourly Wages ($) Average Material Cost ($) 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 30.10 30.50 31.70 32.50 34.00 35.50 35.10 35.05 34.90 33.80 34.20 66,500 68,900 70,600 70,900 71,200 71,700 72,500 73,700 73,400 74,100 74,000 (262) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 723 Retail Forward Future Spending IndexTM (December 2005  100) 110 107.5 104.6 102.8 103.5 105 100 99.7 99.1 96.8 97.3 101.6 95.9 94.0 95 101.3 99.6 90 Jun05 Jul05 Aug- Sep05 05 Oct05 Nov- Dec05 05 16-10 Using 1999 as the base year, construct a separate index for each component in the construction of a house 16-11 Plot both series of data and comment on the trend you see in both plots 16-12 Construct a Paasche index for 2004 using the data Use 1999 as the base year and assume that in 2004 60% of the cost of a townhouse was in materials 16-13 Construct a Laspeyres index using the data, assuming that in 1999, 40% of the cost of a townhouse was labor 16-14 Retail Forward, Inc., is a global management consulting and market research firm specializing in retail intelligence and strategies One of its press releases (June Consumer Outlook: Spending Plans Show Resilience, June 1, 2006) divulged the result of the Retail Forward ShopperScape™ survey conducted each month from a sample of 4,000 U.S primary household shoppers A measure of consumer spending is represented by the figure at the top of the page: a Describe the type of index used by Retail Forward to explore consumer spending b Determine the actual percentage change in the Future Spending Index between December 2005 and June 2006 c Determine the actual percentage change in the Future Spending Index between June 2005 and June 2006 Computer Database Exercises 16-15 The Energy Information Administration (EIA), created by Congress in 1977, is a statistical agency of the U.S Department of Energy It provides data, forecasts, and analyses to promote sound policymaking and public understanding regarding energy and its interaction with the economy and the environment The price of the sources of energy is becoming more and more important as our natural resources are consumed The file entitled Prices contains data for the period Jan06 Feb- Mar06 06 Apr- May- Jun06 06 06 1993–2008 concerning the price of gasoline ($/gal.), natural gas ($/cu ft.), and electricity (cents/ kilowatt hr.) a Using 1993 as the base, calculate an aggregate energy price index for these three energy costs b Determine the actual percentage change in the aggregate energy prices between 1993 and 2008 c Determine the actual percentage change in the aggregate energy prices between 1998 and 2008 16-16 The federal funds rate is the interest rate charged by banks when banks borrow “overnight” from each other The funds rate fluctuates according to supply and demand and is not under the direct control of the Federal Reserve Board, but is strongly influenced by the Fed’s actions The file entitled The Fed contains the federal funds rates for the period 1955–2008 a Construct a time-series plot for the federal funds rate for the period 1955–2008 b Describe the time-series components that are present in the data set c Indicate the recurrence periods for any seasonal or cyclical components 16-17 The Census Bureau of the Department of Commerce released the U.S retail e-commerce sales for the period Fourth Quarter 1999–Fourth Quarter 2008 The file entitled E-Commerce contains that data a Using the fourth quarter of 1999 as the base, calculate a Laspeyres Index for the retail sales for the period of Fourth Quarter 1999–Fourth Quarter 2008 b Determine the actual percentage change in the retail sales for the period Fourth Quarter 1999–First Quarter 2004 c Determine the actual percentage change in the retail sales for the period First Quarter 2004–First Quarter 2006 END EXERCISES 16-1 (263) 724 CHAPTER 16 | Analyzing and Forecasting Time-Series Data 16.2 Trend-Based Forecasting Techniques As we discussed in Section 16.1, some time series exhibit an increasing or decreasing trend Further, the trend may be linear or nonlinear A plot of the data will be very helpful in identifying which, if any, of these trends exist Developing a Trend-Based Forecasting Model In this section, we introduce trend-based forecasting techniques As the name implies, these techniques are used to identify the presence of a trend and to model that trend Once the trend model has been defined, it is used to provide forecasts for future time periods Chapter Outcomes and Excel and Minitab tutorials Excel and Minitab Tutorial BUSINESS APPLICATION LINEAR TREND FORECASTING THE TAFT ICE CREAM COMPANY The Taft Ice Cream Company is a family-operated company selling gourmet ice cream to resort areas, primarily on the North Carolina coast Figure 16.5 displays the annual sales data for the 10-year period 1997–2006 and shows the time-series plot illustrating that sales have trended up in the 10-year period These data are in a file called Taft Taft’s owners are considering expanding their ice cream manufacturing facilities As part of the bank’s financing requirements, the managers are asked to supply a forecast of future sales Recall from our earlier discussions that the forecasting process has three steps: (1) model specification, (2) model fitting, and (3) model diagnosis Step Model Specification The time-series plot in Figure 16.5 indicates that sales have exhibited a linear growth pattern A possible forecasting tool is a linear trend (straight-line) model Step Model-Fitting Because we have specified a linear trend model, the process of fitting can be accomplished using least squares regression analysis of a form described by Equation 16.6 FIGURE 16.5 | Excel 2007 Output Showing Taft Ice Cream Sales Trend Line Excel 2007 Instructions: Open file: Taft.xls Select data in the Sales data column Click on Insert  Line Chart Click on Select Data Under Horizontal (categories) Axis Labels, select data in Year column Click on Layout  Chart Title and enter desired title Click on Layout  Axis Titles and enter horizontal and vertical axis titles Minitab Instructions (for similar results): Click Time/Scale Open file: Taft.MTW Under Time Scale select Calendar and Choose Graph > Time Series Plot Year Select Simple Under Series, enter time series’ column Under Start Values, insert the starting year Click OK OK (264) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 725 Linear Trend Model yt  0  lt  t (16.6) where: yt  Value of the trend at time t 0  y intercept of the trend line 1  Slope of the trend line t  Time period (t  1, 2, ) t  Model error at time t We let the first period in the time series be t  1, the second period be t  2, and so forth The values for time form the independent variable, with sales being the dependent variable Referring to Chapter 14, the least squares regression equations for the slope and intercept are estimated by Equations 16.7 and 16.8 Here the sums are taken over the values of t (t  1, ) Least Squares Equations Estimates ∑ t ∑ yt n b1  ∑ t) ( ∑ t2 − n ∑ tyt − b0  ∑ yt ∑t − b1 n n (16.7) (16.8) where: n  Number of periods in the time series t  Time period (independent variable) yt  Dependent variable at time t The linear regression procedures in either Excel or Minitab can be used to compute the least squares trend model Figure 16.6 shows the Excel output for the Taft Ice Cream Company example The least squares trend model for the Taft Company is yˆt  b0  b1t yˆt  277, 333.33 14, 575.76(t ) For a forecast, we use Ft as the forecast value or predicted value at time period t Thus, Ft  277,333.33  14,575.76(t) Step Model Diagnosis The linear trend regression output in Figure 16.6 offers some conclusions about the potential capabilities of our model The R-squared  0.9123 shows that for these 10 years of data, the linear trend model explains more than 91% of the variation in sales The p-value for the regression slope coefficient to four decimal places is 0.0000 This means that time (t) can be used to explain a significant portion of the variation in sales Figure 16.7 shows the plot of the trend line through the data You can see the trend model fits the historical data quite closely Although these results are a good sign, the model diagnosis step requires further analysis (265) 726 CHAPTER 16 FIGURE 16.6 | Analyzing and Forecasting Time-Series Data | Excel 2007 Output for Taft Ice Cream Trend Model Excel 2007 Instructions: Open file: Taft.xls Select Data > Data Analysis Click on Regression Enter range for the y variable (Sales) Enter range for the x variable (t  1, ) Specify output location Click on Labels Linear Trend Equation: Sales = 277,333.33 + 14,575.76(t) Minitab Instructions (for similar results): Open file: Taft.MTW In Predictors, enter the time variable Choose Stat  Regression  Regression column, t In Response, enter the time series column, Click OK Sales FIGURE 16.7 | Excel 2007 Output for Taft Ice Cream Trend Line Excel 2007 Instructions: Open file: Taft.xls Select the Sales data Click on Insert  Line Chart Click on Select Data Under Horizontal (categories) Axis Labels, select data in Year column Click on Layout  Chart Title and enter desired title Click on Layout  Axis Titles and enter horizontal and vertical axes titles Select the data Right-click and select Add Trendline  Linear 10 To set color, go to Trend Line Options (see step 9) Linear trend line Ft = 277,333.33 + 14,575.76(t) Minitab Instructions (for similar results): Open file: Taft.MTW In Variable, enter the time series column Choose Stat  Time Series  Under Model Type choose Linear Trend Analysis Click OK (266) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 727 Comparing the Forecast Values to the Actual Data The slope of the trend line indicates the Taft Ice Cream Company has experienced an average increase in sales of $14,575.76 per year over the 10-year period The linear trend model’s fitted sales values for periods t  through t  10 can be found by substituting for t in the following forecast equation: Ft  277,333.33  14,575.76(t) For example, for t  1, we get Ft  277,333.33  14,575.76(1)  $291,909.09 Note that the actual sales figure, y1, for period was $300,000 The difference between the actual sales in time t and the forecast values in time t, found using the trend model, is called the forecast error or the residual Figure 16.8 shows the forecasts for periods through 10 and the forecast errors at each period Computing the forecast error by comparing the trend-line values with actual past data is an important part of the model diagnosis step The errors measure how closely the model fits the actual data at each point A perfect fit would lead to residuals of each time We would like to see small residuals and an overall good fit Two commonly used measures of fit are mean squared residual, or mean squared error (MSE), and mean absolute deviation (MAD) These measures are computed using Equations 16.9 and 16.10, respectively MAD measures the average magnitude of the forecast errors MSE is a measure of the variability in the forecast errors The forecast error is the observed value, yt , minus the predicted value, Ft Mean Squared Error MSE = ∑( yt  Ft )2 n Mean Absolute Deviation (16.9) MAD  ∑ | yt  Ft | n (16.10) where: yt  Actual value at time t Ft  Predicted value at time t n  Number of time periods FIGURE 16.8 | Excel 2007 Residual Output for Taft Ice Cream Residual = Forecast error Excel 2007 Instructions: Follow Figure 16.6 Excel Instructions In Regression procedure, click on Residuals (267) 728 CHAPTER 16 FIGURE 16.9 | Analyzing and Forecasting Time-Series Data | Excel 2007 MSE and MAD Computations for Taft Ice Cream Excel 2007 Instructions: Follow Figure 16.6 Excel Instructions In Regression procedure, click on Residuals Create a new column of squared residuals Create a column of absolute values of the residuals using the ABS function in Excel Use Equations 16.9 and 16.10 to calculate MSE and MAD Excel equations: Squared residual = C25^2 Absolute value = ABS(C25) MSE  (ytFt)2 n  168,515,151.52 MAD  ytFt n  11,042.42 Figure 16.9 shows the MSE and MAD calculations using Excel for the Taft Ice Cream example The MAD value of $11,042 indicates the linear trend model has an average absolute error of $11,042 per period The MSE (in squared units) equals 168,515,151.52 The square root of the MSE (often referred to as RMSE, root mean square error) is $12,981.34, and although it is not equal to the MAD value, it does provide similar information about the relationship between the forecast values and the actual values of the time series.1 These error measures are particularly helpful when comparing two or more forecasting techniques We can compute the MSE and/or the MAD for each forecasting technique The forecasting technique that gives the smallest MSE or MAD is generally considered to provide the best fit Autocorrelation Correlation of the error terms (residuals) occurs when the residuals at points in time are related Autocorrelation In addition to examining the fit of the forecasts to the actual time series, the model-diagnosis step also should examine how a model meets the assumptions of the regression model One regression assumption is that the error terms are uncorrelated, or independent When using regression with time-series data, the assumption of independence could be violated That is, the error terms may be correlated over time We call this serial correlation, or autocorrelation When dealing with a time-series variable, the value of y at time period t is commonly related to the value of y at previous time periods If a relationship between yt and yt -1 exists, we conclude that first-order autocorrelation exists If yt is related to yt -2, second-order autocorrelation exists, and so forth If the time-series values are autocorrelated, the assumption that the error terms are independent is violated The autocorrelation can be positive or negative For instance, when the values are firstorder positively autocorrelated, we expect a positive residual to be followed by a positive residual in the next period, and we expect a negative residual to be followed by another negative residual With negative first-order autocorrelation, we expect a positive residual to be followed by a negative residual, followed by a positive residual, and so on The presence of autocorrelation can have adverse consequences on tests of statistical significance in a regression model Thus, you need to be able to detect the presence of autocorrelation and take action to remove the problem The Durbin-Watson statistic, which is shown in Equation 16.11, is used to test whether residuals are autocorrelated 1Technically this is the square root of the average squared distance between the forecasts and the observed data values Algebraically, of course, this is not the same as the average forecast error, but it is comparable (268) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 729 Durbin-Watson Statistic n ∑ (et − et −1 )2 d  t 2 (16.11) n ∑ et2 t 1 where: d  Durbin-Watson test statistic et  ( yt − yˆt )  Residual at time t n  Number of time periods in the time series Figure 16.10 shows the Minitab output providing the Durbin-Watson statistic for the Taft Ice Cream data, as follows: n ∑ (et − et −1 )2 d  t 2 n ∑  2.65 et2 t 1 Examining Equation 16.11, we see that if successive values of the residual are close in value, the Durbin-Watson d statistic will be small This situation would describe one in which residuals are positively correlated The Durbin-Watson statistic can have a value ranging from to A value of indicates no autocorrelation However, like any other statistics computed from a sample, the DurbinWatson d is subject to sampling error We may wish to test formally to determine whether positive autocorrelation exists H0: r  HA: r  FIGURE 16.10 | Minitab Output—DurbinWatson Statistic: Taft Ice Cream Company Example Minitab Instructions: Open file: Taft.MTW Choose Stat  Regression  Regression In Response, enter the time series column, Sales In Predictors, enter the time variable column, t Select Options Under Display, select Durbin-Watson statistic Click OK OK Excel 2007 Instructions (for similar results): Open file: Taft.xls Click on Add-Ins Select PHStat Select Regression  Simple Linear Regression Define y variable Define x variable Check box for DurbinWatson statistic Click OK (269) 730 CHAPTER 16 | Analyzing and Forecasting Time-Series Data If the d statistic is too small, we will reject the null hypothesis and conclude that positive autocorrelation exists If the d statistic is too large, we will not reject and will not be able to conclude that positive autocorrelation exists Appendix O contains a table of one-tailed Durbin-Watson critical values for a  0.05 and a  0.01 levels (Note: The critical values in Appendix O are for one-tailed tests with a  0.05 or 0.01 For a two-tailed test, the alpha is doubled.) The Durbin-Watson table provides two critical values: dL and dU In this test for positive autocorrelation, the decision rule is If d dL, reject H0 and conclude that positive autocorrelation exists If d  dU, not reject H0 and conclude that no positive autocorrelation exists If dL d dU, the test is inconclusive The Durbin-Watson test is not reliable for sample sizes smaller than 15 Therefore, for the Taft Ice Cream Company application, we are unable to conduct the hypothesis test for autocorrelation However, Example 16-5 shows a Durbin-Watson test carried out EXAMPLE 16-5 TESTING FOR AUTOCORRELATION Banion Automotive, Inc Banion Automotive, Inc., has supplied parts to General Motors since the company was founded in 1992 During this time, revenues from the General Motors account have grown steadily Figure 16.11 displays the data in a time-series plot The data are in a file called Banion Automotive Recently the managers of the company developed a linear trend regression model they hope to use to forecast revenue for the next two years to determine whether they can support adding another production line to their Ohio factory They are now interested in determining whether the linear model is subject to positive autocorrelation To test for this, the following steps can be used: FPO Step Specify the model Based on a study of the line chart, the forecasting model is to be a simple linear trend regression model, with revenue as the dependent variable and time (t) as the independent variable Step Fit the model Because we have specified a linear trend model, the process of fitting can be accomplished using least squares regression analysis and Excel or Minitab to | Revenue Time-Series Plot Time-Series Plot of Banion Automotive Revenue Data $80.00 $60.00 $50.00 $40.00 $30.00 $20.00 $10.00 Year 09 08 20 07 20 06 20 05 20 04 20 03 20 02 20 01 20 00 20 99 20 98 19 97 19 96 19 95 19 94 19 93 19 92 $ 19 Dollars (in Millions) $70.00 19 FIGURE 16.11 (270) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 731 estimate the slope and intercept for the model Fitting the 18 data points with a least squares line, we find the following: Ft  5.0175  3.3014(t) Step Diagnose the model The following values were also found: R2  0.935 F-statistic  230.756 Standard error  4.78 The large F-statistic indicates that the model explains a significant amount of variation in revenue over time However, looking at a plot of the trend line shown in Figure 16.12, we see a pattern of actual revenue values first above, and then below, the trend line This pattern indicates possible autocorrelation among the error terms We will test for autocorrelation by calculating the Durbin-Watson d statistic Both Minitab and the PHStat add-ins for Excel have the option to generate the Durbin-Watson statistic The output is shown in Figure 16.13 Figure 16.13 shows the Durbin-Watson d statistic as d  0.661 The null and alternative hypotheses for testing for positive autocorrelation are H0:   HA:   We next go to the Durbin-Watson table (Appendix O) for a  0.05, sample size 18, and number of independent variables, p  The values from the table for dL and dU are dL  1.16 and dU  1.39 The decision rule for testing whether we have positive autocorrelation is If d 1.16, reject H0 and conclude that positive autocorrelation exists If d  1.39, conclude that no positive autocorrelation exists If 1.16 d 1.39, the test is inconclusive | Revenue Time-Series Plot Banion Automotive Trend Line $80.00 $60.00 $50.00 $40.00 $30.00 $20.00 $10.00 Year 09 08 20 07 20 06 20 05 20 04 20 03 20 02 20 01 20 00 20 99 20 98 97 19 96 19 95 19 94 19 93 19 19 92 $ 19 Dollars (in Millions) $70.00 19 FIGURE 16.12 (271) 732 CHAPTER 16 FIGURE 16.13 | Analyzing and Forecasting Time-Series Data | Excel 2007 (PHStat) Output— Durbin-Watson Statistic for Banion Automotive Excel 2007 Instructions: Open file: Banion Automotive.xls Select Add-Ins Select PHStat Select Regression  Simple Linear Regression Define y variable data range Specify x variable data range (time  t values) Click on Durbin-Watson Statistic Click OK Minitab Instructions (for similar results): Open file: Banion Automotive.MTW Choose Stat  Regression  Regression In Response, enter the time series column, Revenue In Predictors, enter the time variable column, t Select Options Under Display, select Durbin-Watson statistic Click OK OK Because d  0.661 dL  1.16, we must reject the null hypothesis and conclude that significant positive autocorrelation exists in the regression model This means that the assumption of uncorrelated error terms has been violated in this case Thus, the linear trend model is not the appropriate model to provide the annual revenue forecasts for the next two years There are several techniques for dealing with the problem of autocorrelation Some of these are beyond the scope of this text (Refer to books by Nelson and Wonnacott.) However, one option is to attempt to fit a nonlinear trend to the data, which is discussed starting on page 734 >> END EXAMPLE TRY PROBLEM 16-18 (pg 747) True Forecasts Although a decision maker is interested in how well a forecasting technique can fit historical data, the real test comes with how well it forecasts future values Recall in the Taft example, we had 10 years of historical data If we wish to forecast ice cream sales for year 11 using the linear trend model, we substitute t  11 into the forecast equation to produce a forecast as follows: F11  277,333.33  14,575.76(11)  $437,666.69 This method of forecasting is called trend projection To determine how well our trend model actually forecasts, we would have to wait until the actual sales amount for period 11 is known (272) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 733 As we just indicated, a model’s true forecasting ability is determined by how well it forecasts future values, not by how well it fits historical values However, having to wait until after the forecast period to know how effective a forecast is doesn’t help us assess a model’s effectiveness ahead of time This problem can be partially overcome by using split samples, which involves dividing a time series into two groups You put the first (nl) periods of historical data in the first group These (nl) periods will be used to develop the forecasting model The second group contains the remaining (n2) periods of historical data, which will be used to test the model’s forecasting ability These data are called the holdout data Usually, three to five periods are held out, depending on the total number of periods in the time series In the Taft Ice Cream business application, we have only 10 years of historical data, so we will hold out the last three periods and use the first seven periods to develop the linear trend model The computations are performed as before, using Excel or Minitab or Equations 16.7 and 16.8 Because we are using a different data set to develop the linear equation, we get a slightly different trend line than when all 10 periods were used The new trend line is Ft  277,142.85  14,642.85(t) This model is now used to provide forecasts for periods through 10 by using trend projection These forecasts are Year Actual Forecast Error t yt Ft (yt - Ft) 400,000 394,285.65 5,714.35 395,000 408,928.50 -13,928.50 10 430,000 423,571.35 6,428.65 Then we can compute the MSE and the MAD values for periods through 10 MSE  [(5, 714.35 )2  (−13, 928.50 )2  (6, 428.65 )2 ]  89, 328, 149.67 and MAD  (|5, 714.35|  |− 13, 928.50|  |6, 428.65|)  8, 690.50 These values could be compared with those produced using other forecasting techniques or evaluated against the forecaster’s own standards Smaller values are preferred Other factors should also be considered For instance, in some cases, the forecast values might tend to be higher (or lower) than the actual values This may imply the linear trend model isn’t the best model to use Forecasting models that tend to over- or underforecast are said to contain forecast bias Equation 16.12 is used as an estimator of the bias Forecast Bias Forecast bias  ∑ ( yt − Ft ) (16.12) n The forecast bias can be either positive or negative A positive value indicates a tendency to underforecast A negative value indicates a tendency to overforecast The estimated bias taken from the forecasts for periods through 10 in our example is [(5, 714.35)  ( −13, 928.50)  (6, 428.65)]  −595.17 This means that, on average, the model overforecasts sales by $595.17 Forecast bias  (273) 734 CHAPTER 16 | Analyzing and Forecasting Time-Series Data Suppose that on the basis of our bias estimate we judge that the linear trend model does an acceptable job in forecasting Then all available data (periods through 10) would be used to develop a linear trend model (see Figure 16.6), and a trend projection would be used to forecast for future time periods by substituting appropriate values for t into the trend model Ft  277,333.33  14,575.76(t) However, if the linear model is judged to be unacceptable, the forecaster will need to try a different technique For the purpose of the bank loan application, the Taft Ice Cream Company needs to forecast sales for the next three years (periods 11 through 13) Assuming the linear trend model is acceptable, these forecasts are F11  277,333.33  14,575.76(11)  $437,666.69 F12  277,333.33  14,575.76(12)  $452,242.45 F13  277,333.33  14,575.76(13)  $466,818.21 Nonlinear Trend Forecasting As we indicated earlier, you may encounter a time series that exhibits a nonlinear trend Figure 16.2 showed an example of a nonlinear trend When the historical data show a nonlinear trend, you should consider using a nonlinear trend forecasting model A common method for dealing with nonlinear trends is to use an extension of the linear trend method This extension calls for making a data transformation before applying the least squares regression analysis BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial FIGURE 16.14 FORECASTING NONLINEAR TRENDS HARRISON EQUIPMENT COMPANY Consider Harrison Equipment Company, which leases large construction equipment to contractors in the Southwest The lease arrangements call for Harrison to perform all repairs and maintenance on this equipment Figure 16.14 shows a line chart for the repair costs for a crawler tractor leased to a contractor in Phoenix for the past 20 quarters The data are contained in the file Harrison | Excel 2007 Time-Series Plot for Harrison Equipment Repair Costs Excel 2007 Instructions: Open file: Harrison.xls Select data in the Repair Costs data column Click on Insert  Line Chart Click on Select Data Under Horizontal (categories) Axis Labels, select data in year and quarter columns Click on Layout  Chart Title and enter desired title Click on Layout  Axis Titles and enter horizontal and vertical axes titles Nonlinear trend Minitab Instructions (for similar results): Open file: Harrison.MTW Select Graph  Time Series Plot Select Simple Under Series, enter time series column Click Time/Scale Under Time Scale select Calendar and Quarter Year Under Start Values, insert the starting quarter and year Click OK OK (274) CHAPTER 16 Chapter Outcome | Analyzing and Forecasting Time-Series Data 735 Model Specification Harrison Equipment is interested in forecasting future repair costs for the crawler tractor Recall that the first step in forecasting is model specification Even though the plot in Figure 16.14 indicates a sharp upward nonlinear trend, the forecaster may start by specifying a linear trend model Model Fitting As a part of the model-fitting step, the forecaster could use Excel’s or Minitab’s regression procedure to obtain the linear forecasting model shown in Figure 16.15 As shown the linear trend model is Ft  –1,022.7  570.9(t) Model Diagnosis The fit is pretty good with an R-squared  0.8214 and a standard error of 1,618.5 But we need to look closer Figure 16.15 shows a plot of the trend line compared with the actual data A close inspection indicates the linear trend model may not be best for this case Notice that the linear model underforecasts, then overforecasts, then underforecasts again From this we might suspect positive autocorrelation We can establish the following null and alternative hypotheses: H0:   HA:   Equation 16.11 could be used to manually compute the Durbin-Watson d statistic, or more likely, we would use either PHStat or Minitab The calculated Durbin-Watson is d  0.505 FIGURE 16.15 | Excel 2007 (PHStat) Output for the Harrison Equipment Company Linear Trend Model Excel 2007 Instructions: Open file: Harrison.xls Select Add-Ins Select PHStat Select Regression > Simple Linear Regression Specify y variable data range Specify x variable data range (time = t values) Check box for DurbinWatson Statistic Copy Figure 16.14 line chart onto output—add linear trend line Click OK Minitab Instructions (for similar results): Open file: Harrison.MTW In Variable, enter time-series column Follow Minitab instructions in Figure 16.13 Under Model Type, Choose Linear Choose Stat  Time Series  Trend Click OK Analysis (275) 736 CHAPTER 16 | Analyzing and Forecasting Time-Series Data The dL critical value from the Durbin-Watson table in Appendix O for   0.05 and a sample size of n  20 and p  independent variable is 1.20 Because d  0.505 dL  1.20, we reject the null hypothesis We conclude that the error terms are significantly positively autocorrelated The model-building process needs to be repeated Model Specification After examining Figure 16.15 and determining the results of the test for positive autocorrelation, a nonlinear trend will likely provide a better fit for these data To account for the nonlinear growth trend, which starts out slowly and then builds rapidly, the forecaster might consider transforming the time variable by squaring t to form a model of the form y  0  1t  2t2   This transformation is suggested because the growth in costs appears to be increasing at an increasing rate Other nonlinear trends may require different types of transformations, such as taking a square root or natural log Each situation must be analyzed separately (See the reference by Kutner et al for further discussion of transformations.) Model Fitting Figure 16.16 shows the Excel regression results, and Figure 16.17 shows the revised timeseries plot using the polynomial transformation The resulting nonlinear trend regression model is Ft  2,318.7 - 340.4(t)  43.4(t 2) FIGURE 16.16 | Excel 2007 (PHStat) Transformed Regression Model for Harrison Equipment Excel 2007 (PHStat) Instructions: Open data file: Harrison.xls Select Add-Ins Select PHStat Select Regression  Multiple Regression Specify y variable data range Specify x variable data range (time = t values and t = squared values) Check on Residuals table (output not shown here) Check box for DurbinWatson Statistic 10 Click OK Minitab Instructions (for similar results): Open File: Harrison.MTW Choose Calc > Calculator In Store result in variable, enter destination column Qrt square With cursor in Expressions, enter Quarter column then **2 10 Click OK Choose Stat  Regression  Regression In Response, enter Repair Costs In Predictors, enter Qrt square Click Storage Under Diagnostic Measures, select Residuals Under Characteristics of Estimated Equation, select Fits Click OK OK (276) CHAPTER 16 FIGURE 16.17 | Analyzing and Forecasting Time-Series Data 737 | Excel 2007 Transformed Model for Harrison Equipment Company Excel 2007 (PHStat) Instructions: Open data file: Harrison.xls Follow Figure 16.14 instruction to generate line chart Select Add-Ins Select PHStat Select Regression > Multiple Regression Specify y variable data range Specify x variable data range (time = t values and t-squared values) Click on Residuals table (output not shown here) Paste Predicted Values on the line chart Fitted values Model Diagnosis Visually, the transformed model now looks more appropriate The fit is much better as the Rsquared value is increased to 0.9466 and the standard error is reduced to 910.35 The null and alternative hypotheses for testing whether positive autocorrelation exists are H0:   HA:   As seen in Figure 16.16, the calculated Durbin-Watson statistic is d  1.63 The dL and dU critical values from the Durbin-Watson table in Appendix O for a  0.05 and a sample size of n  20 and p  independent variables are 1.10 and 1.54, respectively Because d  1.63  1.54, the Durbin-Watson test indicates that there is no positive autocorrelation Given this result and the improvements to R-squared and the standard error of the estimate, the nonlinear model is judged superior to the original linear model Forecasts for periods 21 and 22, using this latest model, are obtained using the trend projection method For period t  21: F21  2,318.7 - 340.4(21)  43.4(212)  $14,310 For period t  22: F22  2,318.7 - 340.4(22)  43.4(222)  $15,836 Using transformations often provides a very effective way of improving the fit of a time series However, a forecaster should be careful not to get caught up in an exercise of “curvefitting.” One suggestion is that only explainable terms—terms that can be justified—be used (277) 738 CHAPTER 16 | Analyzing and Forecasting Time-Series Data for transforming data For instance, in our example, we might well expect repair costs to increase at a faster rate as a tractor gets older and begins wearing out Thus, the t2 transformation seems to make sense Some Words of Caution The trend projection method relies on the future behaving in a manner similar to the past In the previous example, if equipment repair costs continue to follow the pattern displayed over the past 20 quarters, these forecasts may prove acceptable However, if the future pattern changes, there is no reason to believe these forecasts will be close to actual costs Adjusting for Seasonality In Section 16.1, we discussed seasonality in a time series The seasonal component represents those changes (highs and lows) in the time series that occur at approximately the same time every period If the forecasting model you are using does not already explicitly account for seasonality, you should adjust your forecast to take into account the seasonal component The linear and nonlinear trend models discussed thus far not automatically incorporate the seasonal component Forecasts using these models should be adjusted as illustrated in the following application BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial FIGURE 16.18 FORECASTING WITH SEASONAL DATA BIG MOUNTAIN SKI RESORT Most businesses in the tourist industry know that sales are seasonal For example, at the Big Mountain Ski Resort, business peaks at two times during the year: winter for skiing and summer for golf and tennis These peaks can be identified in a time series if the sales data are measured on at least a quarterly basis Figure 16.18 shows the quarterly sales data for the past four years in spreadsheet form The line chart for these data is also shown The data are in the file Big Mountain The time-series plot clearly shows that the summer and winter quarters are the busy times There has also been a slightly increasing linear trend in sales over the four years | Excel 2007 Big Mountain Resort Quarterly Sales Data Excel 2007 Instructions: Open file: Big Mountain.xls Select data in the Sales data column Click on Insert  Line Chart Click on Select Data Under Horizontal (categories) Axis Labels, select data in Year and Season columns Click on Layout  Chart Title and enter desired title Click on Layout  Axis Titles and enter horizontal and vertical axes titles (278) CHAPTER 16 Seasonal Index A number used to quantify the effect of seasonality in time-series data | Analyzing and Forecasting Time-Series Data 739 Big Mountain Resort wants to forecast sales for each quarter of the coming year, and it hopes to use a linear trend model When the historical data show a trend and seasonality, the trend-based forecasting model needs to be adjusted to incorporate the seasonality One method for doing this involves computing seasonal indexes For instance, when we have quarterly data, we can develop four seasonal indexes, one each for winter, spring, summer, and fall A seasonal index below 1.00 indicates that the quarter has a value that is typically below the average value for the year On the other hand, an index greater than 1.00 indicates that the quarter’s value is typically higher than the yearly average Computing Seasonal Indexes Although there are several methods for computing the seasonal indexes, the procedure introduced here is the ratio-to-moving-average method This method assumes that the actual time-series data can be represented as a product of the four time-series components—trend, seasonal, cyclical, and random—which produces the multiplicative model shown in Equation 16.13 Multiplicative Time-Series Model yt  Tt St Ct It (16.13) where: yt  Value of the time series at time t Tt  Trend value at time t St  Seasonal value at time t Ct  Cyclical value at time t It  Irregular or random value at time t Moving Average The successive averages of n consecutive values in a time series FIGURE 16.19 The ratio-to-moving-average method begins by removing the seasonal and irregular components, St and It, from the data, leaving the combined trend and cyclical components, Tt and Ct This is done by first computing successive four-period moving averages for the time series A moving average is the average of n consecutive values of a time series Using the Big Mountain sales data in Figure 16.19, we find that the moving average using the first four quarters is 205  96 194 102  149.25 | Excel 2007 Seasonal Index— Step 1: Moving Average Values for Big Mountain Resort Each moving average corresponds to the midpoint between its cell and the following cell Excel 2007 Instructions: Open file: Big Mountain xls Create a new column of 4-period moving averages using Excel’s AVERAGE function The first moving average is placed in cell E3 and the equation is =AVERAGE(D2:D5) Copy the equation down to cell E15 (279) 740 CHAPTER 16 | Analyzing and Forecasting Time-Series Data This moving average is associated with the middle time period of the data values in the moving average The middle period of the first four quarters is 2.5 (between quarter and quarter 3) The second moving average is found by dropping the value from period and adding the value from period 5, as follows: 96 194 102  230  155.50 This moving average is associated with time period 3.5, the middle period between quarters and Figure 16.19 shows the moving averages for the Big Mountain sales data in Excel spreadsheet form.2 We selected data values for the moving average because we have quarterly data; with monthly data, 12 data values would have been used The next step is to compute the centered moving averages by averaging each successive pair of moving averages Centering the moving averages is necessary so that the resulting moving average will be associated with one of the data set’s original time periods In this example, Big Mountain is interested in quarterly sales data—that is, time periods 1, 2, 3, etc Therefore, the moving averages we have representing time periods 2.5, 3.5, and so forth are not of interest to Big Mountain Centering these averaged time series values, however, produces moving averages for the (quarterly) time periods of interest For example, the first two moving averages are averaged to produce the first centered moving average We get 149.25 155.5  152.38 This centered moving average is associated with quarter The centered moving averages are shown in Figure 16.20.3 These values estimate the Tt Ct value FIGURE 16.20 | Excel 2007 Seasonal Index— Step 2: Big Mountain Resort Centered Moving Averages Excel 2007 Instructions: Open File: Big Mountain.xls Follow instructions in Figure 16.19 Create a new column of centered moving averages using Excel’s AVERAGE function The first moving average is placed in cell F4 and the equation is =AVERAGE (E3:E4) Copy the equation down to cell F15 2The Excel process illustrated by Figures 16.19 through 16.23 is also accomplished using Minitab’s Time Series > Decomposition command, which is illustrated by Figure 16.25 on page 744 3Excel’s tabular format does not allow the uncentered moving averages to be displayed with their “interquarter” time periods That is, 149.25 is associated with time period 2.5, 155.50 with time period 3.5, and so forth (280) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 741 If the number of data values used for a moving average is odd, the moving average will be associated with the time period of the middle observation In such cases, we would not have to center the moving average, as we did in Figure 16.20, because the moving averages would already be associated with one of the time periods from the original time series Next, we estimate the St It value Dividing the actual sales value for each quarter by the corresponding centered moving average, as in Equation 16.14, does this As an example, we examine the third time period: summer of 2006 The sales value of 194 is divided by the centered moving average of 152.38, to produce 1.273 This value is called the ratio-to-movingaverage Figure 16.21 shows these values for the Big Mountain data Ratio-to-Moving-Average St I t  yt Tt Ct (16.14) The final step in determining the seasonal indexes is to compute the mean ratio-to-movingaverage value for each season Each quarter’s ratio-to-moving-average is averaged over the years to produce the seasonal index for that quarter Figure 16.22 shows the seasonal indexes The seasonal index for the winter quarter is 1.441 This indicates that sales for Big Mountain during the winter are 44.1% above the average for the year Also, sales in the spring quarter are only 60.8% of the average for the year One important point about the seasonal indexes is that the sum of the indexes is equal to the number of seasonal indexes That is, the average of all seasonal indexes equals 1.0 In the Big Mountain Resort example, we find Summer 1.323 Fall  0.626 Winter  1.441 Spring  0.608  3.998 (difference from due to rounding) Likewise, in an example with monthly data instead of quarterly data, we would generate 12 seasonal indexes, one for each month The sum of these indexes should be 12 The Need to Normalize the Indexes If the sum of the seasonal indexes does not equal the number of time periods in the recurrence period of the time series, an adjustment is necessary In the Big Mountain Resort example, the sum of the four seasonal indexes may have FIGURE 16.21 | Excel 2007 Seasonal Index— Step 3: Big Mountain Resort Ratio-to-Moving-Averages Excel 2007 Instructions: Open File: Big Mountain.xls Follow instructions in Figures 16.19 and 16.20 Create a new column of ratio-to-moving-averages using an Excel equation (e.g., D4/F4) Copy the equation down to cell G15 (281) 742 CHAPTER 16 FIGURE 16.22 | Analyzing and Forecasting Time-Series Data | Excel 2007 Seasonal Index— Step 4: Big Mountain Resort Mean Ratios Seasonal indexes: Summer = 1.323 Fall = 0.626 Winter = 1.441 Spring = 0.608 Excel 2007 Instructions: Open File: Big Mountain.xls Follow instructions in Figures 16.19 through 16.21 Rearrange the ratio-tomoving-average value, organizing them by season of the year (summer, fall, etc.) Total and average the ratio-to-movingaverages for each season been something other than (the recurrence period) In such cases, we must adjust the seasonal indexes by multiplying each by the number of time periods in the recurrence period over the sum of the unadjusted seasonal indexes For quarterly data such as the Big Mountain Resort example, we would multiply each seasonal index by 4/(Sum of the unadjusted seasonal indexes) Performing this multiplication will normalize the seasonal indexes This adjustment is necessary if the seasonal adjustments are going to even out over the recurrence period Deseasonalizing A strong seasonal component may partially mask a trend in the timeseries data Consequently, to identify the trend you should first remove the effect of the seasonal component This is called deseasonalizing the time series Again, assume that the multiplicative model shown previously in Equation 16.13 is appropriate: yt  Tt St Ct It Deseasonalizing is accomplished by dividing yt by the appropriate seasonal index, S t, as shown in Equation 16.15 Deseasonalization Tt Ct It  yt St (16.15) For time period 1, which is the winter quarter, the seasonal index is 1.441 The deseasonalized value for y1 is 205/1.441  142.26 Figure 16.23 presents the deseasonalized values and the graph of these deseasonalized sales data for the Big Mountain example This shows that there has been a gentle upward trend over the four years Once the data have been deseasonalized, the next step is to determine the trend based on the deseasonalized data As in the previous examples of trend estimation, you can use either (282) CHAPTER 16 FIGURE 16.23 | Analyzing and Forecasting Time-Series Data 743 | Excel 2007 Deseasonalized Time Series for Big Mountain Sales Data Excel 2007 Instructions: Open File: Big Mountain.xls Follow instructions for Figures 16.18 through 16.22 Create a new column containing the deseasonalized values Use Excel equations and Equation 16.15 Select the new deseasonalized data and paste onto the line graph Excel or Minitab to develop the linear model for the deseasonalized data The results are shown in Figure 16.24 The linear regression trend line equation is Ft  142.113  4.686(t) You can use this trend line and the trend projection method to forecast sales for period t  17: F17  142.113  4.686(17)  221.775  $221,775 Seasonally Unadjusted Forecast A forecast made for seasonal data that does not include an adjustment for the seasonal component in the time series This is a seasonally unadjusted forecast, because the time-series data used in developing the trend line were deseasonalized Now we need to adjust the forecast for period 17 to reflect the quarterly fluctuations We this by multiplying the unadjusted forecast values by the appropriate seasonal index In this case, period 17 corresponds to the winter quarter The winter quarter has a seasonal index of 1.441, indicating a high sales period The adjusted forecast is F17  (221.775)(1.441)  319.578, or $319,578 FIGURE 16.24 | Excel 2007 Regression Trend Line of Big Mountain Deseasonalized Data Excel 2007 Instructions: Open File: Big Mountain.xls Follow instructions in Figures 16.19 through 16.23 Click on Data Select Data Analysis > Regression Specify y variable range (deseasonalized variable) and x variable range (time variable) Click OK Linear trend equation: Ft = 142.113 + 4.686(t) (283) 744 CHAPTER 16 | Analyzing and Forecasting Time-Series Data How to it The Seasonal Adjustment Process: The Multiplicative Model We can summarize the steps for performing a seasonal adjustment to a trend-based forecast as follows: Compute each moving average from the k appropriate consecutive data values, where k is the number of values in one period of the time series Compute the centered moving averages Isolate the seasonal component by computing the ratio-tomoving-average values Compute the seasonal indexes by averaging the ratio-tomoving-average values for comparable periods Normalize the seasonal indexes (if necessary) Deseasonalize the time series by dividing the actual data by the appropriate seasonal index Use least squares regression to develop the trend line using the deseasonalized data Develop the unadjusted forecasts using trend projection Seasonally adjust the forecasts by multiplying the unadjusted forecasts by the appropriate seasonal index The seasonally adjusted forecasts for each quarter in 2010 are as follows: Quarter (2010) t Unadjusted Forecast Index Adjusted Forecast Winter 17 221.775 1.441 319.578  $319,578 Spring 18 226.461 0.608 137.688  $137,688 Summer 19 231.147 1.323 305.807  $305,807 Fall 20 235.833 0.626 147.631  $147,631 You can use the seasonally adjusted trend model when a time series exhibits both a trend and seasonality This process allows for a better identification of the trend and produces forecasts that are more sensitive to seasonality in the data Minitab contains a procedure for generating seasonal indexes and seasonally adjusted forecasts Figure 16.25 shows the Minitab results for the Big Mountain Ski Resort example Notice that the forecast option in Minitab gives different forecasts than we showed earlier This is because Minitab generates the linear trend model using original sales data rather than deseasonalized data Our suggestion is to use Minitab to generate the seasonal indexes, but then follow our outline to generate seasonally adjusted forecasts.4 Using Dummy Variables to Represent Seasonality The multiplicative model approach for dealing with seasonal data in a time-series forecasting application is one method that is commonly used by forecasters Another method used to incorporate the seasonal component into a linear trend forecast involves the use of dummy variables To illustrate, we again use the Big Mountain example, which had four years of quarterly data Because of quarterly data, start by constructing dummy variables (one less than the number of data values in the Trend model is based on original time-series data, not on deseasonalized data Seasonal Indexes Minitab forecasts— Based on data with seasonal component still present Measures of forecast error: MAPE = Mean Absolute Percentage Error FIGURE 16.25 | Minitab Output Showing Big Mountain Seasonal Indexes Minitab Instructions: Open file: Big Mountain MTW Select Stat  Time Series  Decomposition In Variable enter time series column In Seasonal length, enter number of time periods in season Under Model Type, choose Multiplicative Under Model Components, choose Trend plus seasonal Select Generate forecasts, for Number of forecasts insert 4: for Starting with origin insert the last time series time period:16 Click OK 4Neither Excel nor PHStat offers a procedure for automatically generating seasonal indexes However, as shown in the Big Mountain example, you can use the spreadsheet formulas to this See the Excel Tutorial that accompanies this text (284) | CHAPTER 16 TABLE 16.5 | 745 Analyzing and Forecasting Time-Series Data Big Mountain Sales Output Using Dummy Variables Quarter = t x1 Winter Dummy x2 Spring Dummy x3 Summer Dummy Season Year y Sales Winter 2006 205 1 0 96 Summer 194 0 Fall 102 0 230 0 Spring 105 Summer 245 0 120 0 272 0 Spring 110 10 Summer 255 11 0 Fall 114 12 0 296 13 0 Spring 130 14 Summer 270 15 0 Fall 140 16 0 Spring Winter 2007 Fall Winter Winter 2008 2009 year; if you have monthly data, construct 11 dummy variables) Form dummy variables as follows: x1  if season is winter x2  if season is spring x3  if season is summer x1  if not winter x2  if not spring x3  if not summer Table 16.5 shows the revised data set for the Big Mountain Company Next form a multiple regression model: Ft  0  1t  2x1  3x2  4x3   Note, this model formulation is an extension of the linear trend model where the seasonality is accounted for by adding the regression coefficient for the season to the linear trend fitted value Figure 16.26 shows the Excel multiple regression output The regression equation is Ft  71.0  4.8t  146.2(x1)  0.9(x2)  126.8(x3) The R-square value is very high at 0.9710, indicating the regression model fits the historical data quite well The F-ratio of 92.07 is significant at any reasonable level of significance, indicating the overall regression model is statistically significant However, the p-value for x2, the spring dummy variable, is 9359, indicating that variable is insignificant Consequently, we will drop this variable and rerun the regression analysis with only three independent variables The resulting model is Ft  71.5  4.8t  145.7x1  126.4x3 This overall model is significant, and all three variables are statistically significant at any reasonable level of alpha The coefficients on the two dummy variables can be interpreted as the seasonal indexes for winter and summer The indexes for spring and fall are incorporated into the intercept value We can now use this model to develop forecasts for year (periods 17–20) as follows: Winter (t  17): Ft  71.5  4.8(17)  145.7(1)  126.4(0)  298.80  $298,800 Spring (t  18): Ft  71.5  4.8(18)  145.7(0)  126.4(0)  157.90  $157,900 Summer (t  19): Ft  71.5  4.8(19)  145.7(0)  126.4(1)  289.10  $289,100 Fall (t  20): Ft  71.5  4.8(20)  145.7(0)  126.4(0)  167.50  $167,500 (285) 746 CHAPTER 16 FIGURE 16.26 | Analyzing and Forecasting Time-Series Data | Excel 2007 Regression Output with Dummy Variables Included—Big Mountain Example Excel 2007 Instructions: Open File: Big Mountain xls Create three dummy variables for winter, spring, and summer Click on Data Select Data Analysis > Regression Specify y variable range and x variable range (time variable plus three dummies) Click OK If you compare these forecasts to the ones we previously obtained using the multiplicative model approach, the forecasts for winter and summer are lower with the dummy variable model but higher for spring and fall You could use the split-sample approach to test the two alternative approaches to see which, in this case, seems to provide more accurate forecasts based on MAD and MSE calculations Both the multiplicative and the dummy variable approach have their advantages and both methods are commonly used by business forecasters MyStatLab 16-2: Exercises Skill Development Problems 16-18 to 16-22 refer to Tran’s Furniture Store, which has maintained monthly sales records for the past 48 months, with the following results: Month Sales ($) (Jan.) 10 23,500 21,700 18,750 22,000 23,000 26,200 27,300 29,300 31,200 34,200 Month 11 12 13 (Jan.) 14 15 16 17 18 19 20 Sales ($) 39,500 43,400 23,500 23,400 21,400 24,200 26,900 29,700 31,100 32,400 Month 21 22 23 24 25 (Jan.) 26 27 28 29 30 31 32 33 34 Sales ($) Month Sales ($) 34,500 35,700 42,000 42,600 31,000 30,400 29,800 32,500 34,500 33,800 34,200 36,700 39,700 42,400 35 36 37 (Jan) 38 39 40 41 42 43 44 45 46 47 48 43,600 47,400 32,400 35,600 31,200 34,600 36,800 35,700 37,500 40,000 43,200 46,700 50,100 52,100 (286) CHAPTER 16 16-18 Based on the Durbin-Watson statistic, is there evidence of autocorrelation in these data? Use a linear trend model 16-19 Using the multiplicative model, estimate the T C portion by computing a 12-month moving average and then the centered 12-month moving average 16-20 Estimate the S I portion of the multiplicative model by finding the ratio-to-moving-averages for the timeseries data Determine whether these ratio-to-movingaverages are stable from year to year 16-21 Extract the irregular component by taking the normalized average of the ratio-to-moving-averages Make a table that shows the normalized seasonal indexes Interpret what the index for January means relative to the index for July 16-22 Based on your work in the previous three problems, a Determine a seasonally adjusted linear trend forecasting model Compare this model with an unadjusted linear trend model Use both models to forecast Tran’s sales for period 49 b Which of the two models developed has the lower MAD and lower MSE? 16-23 Consider the following set of sales data, given in millions of dollars: 2006 2008 1st quarter 152 2nd quarter 162 3rd quarter 157 4th quarter 167 1st quarter 217 2nd quarter 209 3rd quarter 202 4th quarter 221 | Analyzing and Forecasting Time-Series Data 747 16-24 Examine the following time series: t yt 10 52 72 58 66 68 60 46 43 17 a Produce a scatter plot of this time series Indicate the appropriate forecasting model for this time series b Construct the equation for the forecasting model identified in part a c Produce forecasts for time periods 11, 12, 13, and 14 d Obtain the forecast bias for the forecasts produced in part c if the actual time series values are -35, -41, -79, and -100 for periods 11–14, respectively 16-25 Examine the following quarterly data: t yt 10 11 12 12 23 20 18 32 48 41 35 52 79 63 a Compute the four-period moving averages for this set of data b Compute the centered moving averages from the moving averages of part a c Compute the ratio-to-moving-averages values d Calculate the seasonal indexes Normalize them if necessary e Deseasonalize the time series f Produce the trend line using the deseasonalized data g Produce seasonally adjusted forecasts for each of the time periods 13, 14, 15, and 16 Business Applications 2007 2009 1st quarter 182 2nd quarter 192 3rd quarter 191 4th quarter 197 1st quarter 236 2nd quarter 242 3rd quarter 231 4th quarter 224 a Plot these data Based on your visual observations, what time-series components are present in the data? b Determine the seasonal index for each quarter c Fit a linear trend model to the data and determine the MAD and MSE values Comment on the adequacy of the linear trend model based on these measures of forecast error d Provide a seasonally unadjusted forecast using the linear trend model for each quarter of the year 2010 e Use the seasonal index values computed in part b to provide seasonal adjusted forecasts for each quarter of 2010 16-26 “The average college senior graduated this year with more than $19,000 in debt” was the beginning sentence of a recent article in USA Today The majority of students have loans that are not due until the student leaves school This can result in the student ignoring the size of debt that piles up Federal loans obtained to finance college education are steadily mounting The data given here show the amount of loans ($million) for the last 13 academic years, with year 20 being the most recent Year Amount Year Amount Year Amount 9,914 10,182 12,493 13,195 13,414 13,890 15,232 10 11 12 13 14 16,221 22,557 26,011 28,737 31,906 33,930 34,376 15 16 17 18 19 20 37,228 39,101 42,761 49,360 57,463 62,614 a Produce a time-series plot of these data Indicate the time-series components that exist in the data (287) 748 CHAPTER 16 | Analyzing and Forecasting Time-Series Data b Conduct a test of hypothesis to determine if there exists a linear trend in these data Use a significance level of 0.10 and the p-value approach c Provide a 90% prediction interval for the amount of federal loans for the 26th academic year 16-27 The average monthly price of regular gasoline in Southern California is monitored by the Automobile Club of Southern California’s monthly Fuel Gauge Report The prices of the time period July 2004 to June 2006 are given here Month Price ($) Month Price ($) Month Price ($) 7/04 8/04 9/04 10/04 11/04 12/04 1/05 2/05 3/05 4/05 2.247 2.108 2.111 2.352 2.374 2.192 1.989 2.130 2.344 2.642 5/05 6/05 7/05 8/05 9/05 10/05 11/05 12/05 1/06 2/06 2.532 2.375 2.592 2.774 3.031 2.943 2.637 2.289 2.357 2.628 3/06 4/06 5/06 6/06 2.626 2.903 3.417 3.301 a Produce a time-series plot of the average price of regular gas in Southern California Identify any time-series components that exist in the data b Identify the recurrence period of the time series Determine the seasonal index for each month within the recurrence period c Fit a linear trend model to the deseasonalized data d Provide a seasonally adjusted forecast using the linear trend model for July 2006 and July 2010 16-28 Manuel Gutierrez correctly predicted the increasing need for home health care services due to the country’s aging population Five years ago, he started a company offering meal delivery, physical therapy, and minor housekeeping services in the Galveston area Since that time he has opened offices in seven additional Gulf State cities Manuel is currently analyzing the revenue data from his first location for the first five years of operation Revenue ($10,000s) January February March April May June July August September October November December 2005 2006 2007 2008 2009 23 34 45 48 46 49 60 65 67 60 71 76 67 63 65 71 75 70 72 75 80 78 89 94 72 64 64 77 79 72 71 77 79 78 87 92 76 75 77 81 86 75 80 82 86 87 91 96 81 72 71 83 85 77 79 84 91 86 94 99 a Plot these data Based on your visual observations, what time-series components are present in the data? b Determine the seasonal index for each month c (1) Fit a linear trend model to the deseasonalized data for the years 2005–2009 and determine the MAD and MSE for forecasts for each of the months in 2010 (2) Conduct a test of hypothesis to determine if the linear trend model fits the existing data (3) Comment on the adequacy of the linear trend model based on the measures of forecast error and the hypothesis test you conducted d Manuel had hoped to reach $2,000,000 in revenue by the time he had been in business for 10 years From the results in part c, is this a feasible goal based on the historical data provided? Consider and comment on the size of the standard error for this prediction What makes this value so large? How does it affect your conclusion? e Use the seasonal index values computed in part b to provide seasonal adjusted forecasts for each month of the year 2010 16-29 A major brokerage company has an office in Miami, Florida The manager of the office is evaluated based on the number of new clients generated each quarter The following data reflect the number of new customers added during each quarter between 2006 and 2009 2006 2007 1st quarter 218 2nd quarter 190 3rd quarter 236 4th quarter 218 1st quarter 250 2nd quarter 220 3rd quarter 265 4th quarter 241 2008 2009 1st quarter 244 2nd quarter 228 3rd quarter 263 4th quarter 240 1st quarter 229 2nd quarter 221 3rd quarter 248 4th quarter 231 a Plot the time series and discuss the components that are present in the data b Referring to part a, fit the linear trend model to the data for the years 2006–2008 Then use the resulting model to forecast the number of new brokerage customers for each quarter in the year 2009 Compute the MAD and MSE for these forecasts and discuss the results c Using the data for the years 2006–2008, determine the seasonal indexes for each quarter d Develop a seasonally unadjusted forecast for the four quarters of year 2009 e Using the seasonal indexes computed in part d, determine the seasonally adjusted forecast for each quarter for the year 2009 Compute the MAD and MSE for these forecasts (288) CHAPTER 16 f Examine the values for the MAD and MSE in parts b and e Which of the two forecasting techniques would you recommend the manager use to forecast the number of new clients generated each quarter? Support your choice by giving your rationale Computer Database Exercises 16-30 Logan Pickens is a plan/build construction company specializing in resort area construction projects Plan/build companies typically have a cash flow problem since they tend to be paid in lump sums when projects are completed or hit milestones However, their expenses, such as payroll, must be paid regularly Consequently, such companies need bank lines of credit to finance their initial costs, but in 2009 lines of credit were difficult to negotiate The data file LoganPickens contains month-end cash balances for the past 16 months a Plot the data as a time-series graph Discuss what the graph implies concerning the relationship between cash balance and the time variable, month b Fit a linear trend model to the data Compute the coefficient of determination for this model and show the trend line on the time-series graph Discuss the appropriateness of the linear trend model What are the strengths and weaknesses of the model? c Referring to part b, compute the MAD and MSE for the 16 data points d Use the t2 transformation approach and recompute the linear model using the transformed time variable Plot the new trend line against the transformed data Discuss whether this model appears to provide a better fit than did the model without the transformation Compare the coefficients of determination for the two models Which model seems to be superior, using the coefficient of determination as the criterion? e Refer to part d Compute the MAD and MSE for the 16 data values Discuss how these compare to those that were computed in part c, prior to transformation Do the measures of fit (R2, MSE, or MAD) agree on the best model to use for forecasting purposes? 16-31 Refer to Problem 16-30 a Use the linear trend model (without transformation) for the first 15 months and provide a cash balance forecast for month 16 Then make | Analyzing and Forecasting Time-Series Data 749 the t2 transformation and develop a new linear trend forecasting model based on months 1–15 Forecast the cash balance for month 16 Now compare the accuracy of the forecasts with and without the transformation Which of the two forecast models would you prefer? Explain your answer b Provide a 95% prediction interval for the cash balance forecast for month 16 using the linear trend model both with and without the transformation Which interval has the widest width? On this basis, which procedure would you choose? 16-32 The federal funds rate is the interest rate charged by banks when banks borrow “overnight” from each other The funds rate fluctuates according to supply and demand and is not under the direct control of the Federal Reserve Board, but is strongly influenced by the Fed’s actions The file entitled The Fed contains the federal funds rates for the period 1955–2008 a Produce a scatter plot of the federal funds rate for the period 1955–2008 Identify any time-series components that exist in the data b Identify the recurrence period of the time series Determine the seasonal index for each month within the recurrence period c Fit a nonlinear trend model that uses coded years and coded years squared as predictors for the deseasonalized data d Provide a seasonally adjusted forecast using the nonlinear trend model for 2010 and 2012 e Diagnose the model 16-33 The Census Bureau of the Department of Commerce released the U.S retail e-commerce sales (“Quarterly Retail E-Commerce Sales 1st Quarter 2006,” May 18, 2006) for the period of Fourth Quarter 1999–Fourth Quarter 2008 The file entitled E-Commerce contains those data a Produce a time-series plot of this data Indicate the time-series components that exist in the data b Conduct a test of hypothesis to determine if there exists a linear trend in these data Use a significance level of 0.10 and the p-value approach c Provide forecasts for the e-commerce retail sales for the next four quarters d Presume the next four quarters exhibit e-commerce retail sales of 35,916, 36,432, 35,096, and 36,807, respectively Produce the forecast bias Interpret this number in the context of this exercise END EXERCISES 16-2 (289) 750 CHAPTER 16 | Analyzing and Forecasting Time-Series Data 16.3 Forecasting Using Smoothing Methods The trend-based forecasting technique introduced in the previous section is widely used and can be very effective in many situations However, it has a disadvantage in that it gives as much weight to the earliest data in the time series as it does to the data that are close to the period for which the forecast is required Also, this trend approach does not provide an opportunity for the model to “learn” or “adjust” to changes in the time series A class of forecasting techniques called smoothing models is widely used to overcome these problems and to provide forecasts in situations in which there is no pronounced trend in the data These models attempt to “smooth out” the random or irregular component in the time series by an averaging process In this section we introduce two frequently used smoothingbased forecasting techniques: single exponential smoothing and double exponential smoothing Double exponential smoothing offers a modification to the single exponential smoothing model that specifically deals with trends Chapter Outcome Exponential Smoothing A time-series and forecasting technique that produces an exponentially weighted moving average in which each smoothing calculation or forecast is dependent on all previous observed values Exponential Smoothing The trend-based forecasting methods discussed in Section 16.2 are used in many forecasting situations As we showed, the least squares trend line is computed using all available historical data Each observation is given equal input in establishing the trend line, thus allowing the trend line to reflect all the past data If the future pattern looks like the past, the forecast should be reasonably accurate However, in many situations involving time-series data, the more recent the observation, the more indicative it is of possible future values For example, this month’s sales are probably a better indicator of next month’s sales than would be sales from 20 months ago However, the regression analysis approach to trend-based forecasting does not take this fact into account The data from 20 periods ago will be given the same weight as data from the most current period in developing a forecasting model This equal valuation can be a drawback to the trend-based forecasting approach With exponential smoothing, current observations can be weighted more heavily than older observations in determining the forecast Therefore, if in recent periods the time-series values are much higher (or lower) than those in earlier periods, the forecast can be made to reflect this difference The extent to which the forecast reflects the current data depends on the weights assigned by the decision maker We will introduce two classes of exponential smoothing models: single exponential smoothing and double exponential smoothing Double smoothing is used when a time series exhibits a linear trend Single smoothing is used when no linear trend is present in the time series Both single and double exponential smoothing are appropriate for short-term forecasting and for time series that are not seasonal Single Exponential Smoothing Just as its name implies, single exponential smoothing uses a single smoothing constant Equations 16.16 and 16.17 represent two equivalent methods for forecasting using single exponential smoothing Exponential Smoothing Model Ft1  Ft  a(yt - Ft) (16.16) Ft1  ayt  (1 - a)Ft (16.17) or where: Ft1  Forecast value for period t  yt  Actual value of the time series at time t Ft  Forecast value for period t   Alpha (smoothing constant  a  1) The logic of the exponential smoothing model is that the forecast made for the next period will equal the forecast made for the current period, plus or minus some adjustment factor The (290) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 751 adjustment factor is determined by the difference between this period’s forecast and the actual value (yt - Ft), multiplied by the smoothing constant,  The idea is that if we forecast low, we will adjust next period’s forecast upward, by an amount determined by the smoothing constant EXAMPLE 16-6 Excel and Minitab tutorials Excel and Minitab Tutorial DEVELOPING A SINGLE EXPONENTIAL SMOOTHING MODEL Dawson Graphic Design Consider the past 10 weeks of potential incoming customer sale calls for Dawson Graphic Design located in Orlando, Florida These data and their line graph are shown in Figure 16.27 The data showing the number of incoming calls from potential customers are in the file Dawson Suppose the current time period is the end of week 10 and we wish to forecast the number of incoming calls for week 11 using a single exponential smoothing model The following steps can be used: Step Specify the model Because the data not exhibit a pronounced trend and because we are interested in a short-term forecast (one period ahead), the single exponential smoothing model with a single smoothing constant can be used Step Fit the model We start by selecting a value for , the smoothing constant, between 0.0 and 1.0 The closer  is to 0.0, the less influence the current observations have in determining the forecast Small  values will result in greater smoothing of the time series Likewise, when  is near 1.0, the current observations have greater impact in determining the forecast and less smoothing will occur There is no firm rule for selecting the value for the smoothing constant However, in general, if the time series is quite stable, a small  should be used to lessen the impact of random or irregular fluctuations Because the time series shown in Figure 16.27 appears to be relatively stable, we will use   0.20 in this example FIGURE 16.27 | Incoming Customer Sale Calls Data and Line Graph for Dawson Graphic Design Excel 2007 Instructions: Open data file: Dawson.xls Select the Calls data Click on Insert  Line Click on Layout Use Chart Titles and Axis Titles to provide appropriate labels (291) 752 CHAPTER 16 | Analyzing and Forecasting Time-Series Data The forecast value for period t  11 is found using Equation 16.17, as follows: F11  0.20y10  (1 - 0.20)F10 This demonstrates that the forecast for period 11 is a weighted average of the actual number of calls in period 10 and the forecast for period 10 Although we know the number of calls for period 10, we don’t know the forecast for period 10 However, we can determine it by F10  0.20y9  (1 - 0.20)F9 Again, this forecast is a weighted average of the actual number of calls in period and the forecast calls for period We would continue in this manner until we get to F2  0.20y1  (1 - 0.20)F1 This requires a forecast for period Because we have no data before period from which to develop a forecast, a rule often used is to assume that F1  y1.5 Forecast for period  Actual value in period Because setting the starting value is somewhat arbitrary, you should obtain as much historical data as possible to “warm” the model and dampen out the effect of the starting value In our example, we have 10 periods of data to warm the model before the forecast for period 11 is made Note that when using an exponential smoothing model, the effect of the initial forecast is reduced by (1 - ) in the forecast for period 2, then reduced again for period 3, and so on After sufficient periods, any error due to the arbitrary initial forecast should be very small Figure 16.28 shows the results of using the single exponential smoothing equation and Excel for weeks through 10 For week 1, F1  y1  400 Then, for week 2, we get F2  0.20y1  (1 - 0.20)F1 F2  (0.20)400  (1 - 0.20)400.00  400.00 For week 3, F3  0.20y2  (1 - 0.20)F2 F3  (0.20)430  (1 - 0.20)400.00  406.00 At the end of week 2, after seeing what actually happened to the number of calls in week 2, our forecast for week is 406 calls This is a 6-unit increase over the forecast for week of 400 calls The actual number of calls in week was 430, rather than 400 The number of calls for week was 30 units higher than the forecast for that time period Because the actual calls were larger than the forecast, an adjustment must be made The 6-unit adjustment is determined by multiplying the smoothing constant by the forecast error [0.20(30)  6], as specified in Equation 16.16 The adjustment compensates for the forecast error in week Continuing for week again using Equation 16.17, F4  0.20y3  (1 - 0.20)F3 F4  (0.20)420  (1 - 0.20)406.00  408.80 Recall that our forecast for week was 406 However, actual calls were higher than forecast at 420, and we underforecast by 14 calls The adjustment for week is then 0.20(14)  2.80, and the forecast for week is 406  2.80  408.80 This process continues through the data until we are ready to forecast week 11, as shown in Figure 16.28 F11  0.20y10  (1 - 0.20)F10 F11  (0.20)420  (1 - 0.20)435.70  432.56 5Another approach for establishing the starting value, F , is to use the mean value for some portion of the available data Regardless of the method used, the quantity of available data should be large enough to dampen out the impact of the starting value (292) CHAPTER 16 FIGURE 16.28 | Analyzing and Forecasting Time-Series Data 753 | Dawson Graphic Design Single Exponential Smoothing—Excel Spreadsheet Minitab Instructions: (for similar results): Open file: Dawson.MTW Choose Stat  Time Series  Single Exp Smoothing In Variable, enter time series column Under Weight to use in smoothing, select Use and insert  Click on Storage and select Fits (one-periodahead forecasts) Click OK OK Forecast for period 11 = 432.565 or 433 units Excel 2007 Instructions: Open data file: Dawson.xls Create and label two new columns Enter the smoothing constant in an empty cell (e.g B14) Enter initial forecast for period in C3 (400) Use Equation 16.17 to create forecast for period t + in D2 Forecast for period t is set equal to forecast for period t + from previous period Copy equations down Dawson Graphic Design managers would forecast incoming customer calls for week 11 of 432 If we wished to forecast week 12 calls, we would either use the week 11 forecast or wait until the actual week 11 calls are known and then update the smoothing equations to get a new forecast for week 12 Step Diagnose the model However, before we actually use the exponential smoothing forecast for decision-making purposes, we need to determine how successfully the model fits the historical data Unlike the trend-based forecast, which uses least squares regression, there is no need to use split samples to test the forecasting ability of an exponential smoothing model, because the forecasts are “true forecasts.” The forecast for a given period is made before considering the actual value for that period Figure 16.29 shows the MAD for the forecast model with   0.20 and a plot of the forecast values versus the actual call values This plot shows the smoothing that has occurred Note, we don’t include period in the MAD calculation since the forecast is set equal to the actual value Our next step would be to try different smoothing constants and find the MAD for each new  The forecast for period 11 would be made using the smoothing constant that generates the smallest MAD Both Excel and Minitab have single exponential smoothing procedures, although Minitab’s procedure is much more extensive Refer to the Excel and Minitab tutorials for instructions on each Minitab provides optional methods for determining the initial forecast value for period and a variety of useful graphs Minitab also has an option for determining the optimal smoothing constant value.6 Figure 16.30 shows the output generated using Minitab This shows that 6The solver in Excel can be used to determine the optimal alpha level to minimize the MAD (293) 754 CHAPTER 16 FIGURE 16.29 | Analyzing and Forecasting Time-Series Data | Excel 2007 Output for Dawson Graphic Design MAD Computation for Single Exponential Smoothing,  = 0.20 Single smoothed Excel 2007 Instructions: Open data file: Dawson.xls Select the Calls data Click on Insert > Line Click on Layout Use Chart Titles and Axis Titles to provide appropriate labels Follow directions for Figure 16.28 FIGURE 16.30 Select the Forecast values (Ft) and copy and paste onto the line chart Create a column of forecast errors using an Excel equation Create a column of absolute forecast errors using Excel’s ABS function 10 Compute the MAD by using the AVERAGE function for the absolute errors | Optimal alpha = 524 Minitab Output for Dawson Graphic Design Single Exponential Smoothing Model Minitab Instructions: Open file: Humboldt.MTW Choose Stat Time Series  Single Exp Smoothing In Variable, enter time series column Under Weight to Use in Smoothing, select Optimal ARIMA Click OK (294) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 755 the best forecast is found using an   0.524 Note, MAD is decreased from 20.86 (when   0.20) to 17.321 when the optimal smoothing constant is used END EXAMPLE TRY PROBLEM 16-34 (pg 759) A major advantage of the single exponential smoothing model is that it is easy to update In Example 16-6, the forecast for week 12 using this model is found by simply plugging the actual data value for week 11, once it is known, into the smoothing formula F12  ay11  (1 - a)F11 We not need to go back and recompute the entire model, as would have been necessary with a trend-based regression model Chapter Outcome Double Exponential Smoothing When the time series has an increasing or decreasing trend, a modification to the single exponential smoothing model is used to explicitly account for the trend The resulting technique is called double exponential smoothing The double exponential smoothing model is often referred to as exponential smoothing with trend In double exponential smoothing, a second smoothing constant, beta (b), is included to account for the trend Equations 16.18, 16.19, and 16.20 are needed to provide the forecasts Double Exponential Smoothing Model Ct  yt  (1 - )(Ct-1  Tt-1) Tt  (Ct - Ct-1)  (1 - )Tt-1 Ft1  Ct  Tt (16.18) (16.19) (16.20) where: yt  Value of the time series at time t   Constant-process smoothing constant   Trend-smoothing constant Ct  Smoothed constant-process value for period t Tt  Smoothed trend value for period t Ft1  Forecast value for period t  t  Current time period Equation 16.18 is used to smooth the time-series data; Equation 16.19 is used to smooth the trend; and Equation 16.20 combines the two smoothed values to form the forecast for period t  EXAMPLE 16-7 Excel and Minitab tutorials Excel and Minitab Tutorial DOUBLE EXPONENTIAL SMOOTHING Billingsley Insurance Company The Billingsley Insurance Company has maintained data on the number of automobile claims filed at its Denver office over the past 12 months These data, which are in the file Billingsley, are listed and graphed in Figure 16.31 The claims manager wants to forecast claims for month 13 A double exponential smoothing model can be developed using the following steps: Step Specify the model The time series contains a strong upward trend, so a double exponential smoothing model might be selected As was the case with single exponential smoothing, we must select starting values In the case of the double exponential smoothing model, we must select initial values for C0, T0, and the smoothing constants a and b The choice of smoothing constant values ( and ) depends on the same issues as those discussed earlier for single exponential smoothing That is, use larger smoothing constants when less smoothing is desired and values closer to when more smoothing is desired The larger the smoothing constant value, the more impact that current data will have on the forecast Suppose we use a  0.20 and   0.30 in this example There are several approaches for (295) 756 CHAPTER 16 FIGURE 16.31 | Analyzing and Forecasting Time-Series Data | Excel 2007 Billingsley Insurance Company Data and Time-Series Plot Excel 2007 Instructions: Open file: Billingsley.xls Select Claims data Click on Insert  Line Chart Click on Layout  Chart Title and enter desired title Click on Layout  Axis Titles and enter horizontal and vertical axes titles selecting starting values for C0 and T0 The method we use here is to fit the least squares trend to the historical data, yˆt  b0  b1t where the y intercept, b0, is used as the starting value, C0, and the slope, b1, is used as the starting value for the trend, T0 We can use the regression procedure in Excel or Minitab to perform these calculations, giving yˆt  34.273  4.1119(t ) So, C0  34.273 and T0  4.1119 Keep in mind that these are arbitrary starting values, and as with single exponential smoothing, their effect will be dampened out as you proceed through the sample data to the current period The more historical data you have, the less impact the starting values will have in the forecast Step Fit the model The forecast for period made at the beginning of period is F1  C0  T0 F1  34.273  4.1119  38.385 At the close of period 1, in which actual claims were 38, the smoothing equations are updated as follows C1  0.20(38)  (1 - 0.20)(34.273  4.1119)  38.308 T1  0.30(38.308 - 34.273)  (1 - 0.30)(4.1119)  4.089 Next, the forecast for period is F2  38.308  4.089  42.397 We then repeat the process through period 12 to find the forecast for period 13 (296) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 757 Step Diagnose the model Figures 16.32 and 16.33 show the results of the computations and the MAD value The forecast for period 13 is F13  C12  T12 F13  83.867  3.908  87.775 Based on this double exponential smoothing model, the number of claims for period 13 is forecast to be about 88 However, before settling on this forecast, we should try different smoothing constants to determine whether a smaller MAD can be found END EXAMPLE TRY PROBLEM 16-39 (pg 760) As you can see, the computations required for double exponential smoothing are somewhat tedious and are ideally suited for your computer Although Excel does not have a double exponential smoothing procedure, in Figure 16.32 we have used Excel formulas to develop FIGURE 16.32 | Excel 2007 Double Exponential Smoothing Spreadsheet for Billingsley Insurance Month 13 forecast Excel 2007 Instructions: Open File: Billingsley.xls Create five new column headings as shown in Figure 16.32 Place the smoothing constants (alpha and beta) in empty cells (B17 and B18) Place the starting values for the constant process and the trend in empty cells (D17 and D18) Use Equations 16.18 and 16.19 to create process columns Use Equation 16.19 to create the forecast values in the forecast column Calculate the forecast error by subtracting the forecast column values from the y column values Calculate the absolute forecast errors using the Excel ABS function Calculate the MAD by using the Excel AVERAGE function (297) 758 CHAPTER 16 FIGURE 16.33 | Analyzing and Forecasting Time-Series Data | Minitab Double Exponential Smoothing Spreadsheet for Billingsley Insurance Forecast for month 13 MAD = 3.395 Minitab Instructions: Open file: Billingsley.MTW Choose Stat  Time Series  Double Exp Smoothing In Variable, enter time series column Check Generate forecasts and enter in Number of forecasts and 12 in Starting from origin Click OK our model in conjunction with the regression tool for determining the starting values Minitab does have a double exponential smoothing routine, as illustrated in Figure 16.33 The MAPE on the Minitab output is the Mean Absolute Percent Error, which is computed using Equation 16.21 The MAPE  5.7147, indicating that on average, the double exponential smoothing model produced a forecast that differed from the actual claims by 5.7% Mean Absolute Percent Error ∑ MAPE = | yt − Ft | (100) yt n where: yt  Value of time series in time t Ft  Forecast value for time period t n  Number of periods of available data (16.21) (298) CHAPTER 16 | Analyzing and Forecasting Time-Series Data 759 MyStatLab 16-3: Exercises Skill Development 16-34 The following table represents two years of data: Year 1st quarter 2nd quarter 3rd quarter 4th quarter Year 242 1st quarter 272 252 257 267 2nd quarter 3rd quarter 4th quarter 267 276 281 a Prepare a single exponential smoothing forecast for the first quarter of year using an alpha value of 0.10 Let the initial forecast value for quarter of year be 250 b Prepare a single exponential smoothing forecast for the first quarter of year using an alpha value of 0.25 Let the initial forecast value for quarter of year be 250 c Calculate the MAD value for the forecasts you generated in parts a and b Which alpha value provides the smaller MAD value at the end of the 4th quarter in year 2? 16-35 The following data represent enrollment in a major at your university for the past six semesters (Note: semester is the oldest data; semester is the most recent data): Semester Enrollment 87 110 123 127 145 160 a Prepare a graph of enrollment for the six semesters b Based on the graph you prepared in part a, does it appear that a trend is present in the enrollment figures? c Prepare a single exponential smoothing forecast for semester using an alpha value of 0.35 Assume that the initial forecast for semester is 90 d Prepare a double exponential smoothing forecast for semester using an alpha value of 0.20 and a beta value of 0.25 Assume that the initial smoothed constant value for semester is 80 and the initial smoothed trend value for semester is 10 e Calculate the MAD values for the simple exponential smoothing model and the double exponential smoothing model at the end of semester Which model appears to be doing the better job of forecasting course enrollment? Don’t include period in the calculation 16-36 The following data represent the average number of employees in outlets of a large consumer electronics retailer: Year 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Number 20.6 17.3 18.6 21.5 23.2 19.9 18.7 15.6 19.7 20.4 a Construct a time-series plot of this time series Does it appear that a linear trend exists in the time series? b Calculate forecasts for each of the years in the time series Use a smoothing constant of 0.25 and single exponential smoothing c Calculate the MAD value for the forecasts you generated in part b d Construct a single exponential smoothing forecast for 2011 Use a smoothing constant of 0.25 16-37 A brokerage company is interested in forecasting the number of new accounts the office will obtain next month It has collected the following data for the past 12 months: Month Accounts 10 11 12 19 20 21 25 26 24 24 21 27 30 24 30 a Produce a time-series plot for these data Specify the exponential forecasting model that should be used to obtain next month’s forecast b Assuming a double exponential smoothing model, fit the least squares trend to the historical data, to determine the smoothed constant-process value and the smoothed trend value for period c Produce the forecasts for periods through 12 using   0.15,   0.25 Indicate the number of new accounts the company may expect to receive next month based on the forecast model d Calculate the MAD for this model (299) 760 CHAPTER 16 | Analyzing and Forecasting Time-Series Data determine the smoothed constant-process value and the smoothed trend value for period c Using data for periods through 20 and using a  0.20 and   0.30, forecast the total student loan volume for the year 21 d Calculate the MAD for this model 16-40 The human resources manager for a medium-sized business is interested in predicting the dollar value of medical expenditures filed by employees of her company for the year 2011 From her company’s database she has collected the following information showing the dollar value of medical expenditures made by employees for the previous seven years: Business Applications 16-38 With tax revenues declining in many states, school districts have been searching for methods of cutting costs without affecting classroom academics One district has been looking at the cost of extracurricular activities ranging from band trips to athletics The district business manager has gathered the past six months’ costs for these activities as shown here Month Expenditures ($) 23,586.41 September October November December January February 23,539.22 23,442.06 23,988.71 23,727.13 23,799.69 Using this past history, prepare a single exponential smoothing forecast for March using an  value of 0.25 16-39 “The average college senior graduated this year with more than $19,000 in debt” was the beginning sentence of a recent article in USA Today The majority of students have loans that are not due until the student leaves school This can result in the student ignoring the size of debt that piles up Federal loans obtained to finance college education are steadily mounting The data given here show the amount of loans ($million) for the last 20 academic years, with year 20 being the most recent Year Amount Year Amount Year Amount 9,914 16,221 15 37,228 10,182 12,493 13,195 13,414 13,890 15,232 10 11 12 13 14 22,557 26,011 28,737 31,906 33,930 34,376 16 17 18 19 20 39,101 42,761 49,360 57,463 62,614 a Produce a time-series plot for these data Specify the exponential forecasting model that should be used to obtain next year’s forecast b Assuming a double exponential smoothing model, fit the least squares trend to the historical data to Year Medical Claims 2004 2005 2006 2007 2008 2009 2010 $405,642.43 $407,180.60 $408,203.30 $410,088.03 $411,085.64 $412,200.39 $414,043.90 a Prepare a graph of medical expenditures for the years 2004–2010 Which forecasting technique you think is most appropriate for this time series, single exponential smoothing or double exponential smoothing? Why? b Use an a value of 0.25 and a b value of 0.15 to produce a double exponential forecast for the medical claims data Use linear trend analysis to obtain the starting values for C0 and T0 c Compute the MAD value for your model for the years 2004 to 2010 Also produce a graph of your forecast values 16-41 Retail Forward, Inc., is a global management consulting and market research firm specializing in retail intelligence and strategies One of its press releases (June Consumer Outlook: Spending Plans Show Resilience, June 1, 2006) divulged the result of the Retail Forward ShopperScape™ survey conducted each month from a sample of 4,000 U.S primary household shoppers A measure of consumer spending is represented by the following figure: Retail Forward Future Spending IndexTM (December 2005  100) 110 107.5 104.6 102.8 103.5 105 100 99.7 99.1 96.8 97.3 101.6 95.9 94.0 95 101.3 99.6 90 Jun05 Jul05 Aug- Sep05 05 Oct05 Nov- Dec05 05 Jan06 Feb- Mar06 06 Apr- May- Jun06 06 06 (300) CHAPTER 16 a Construct a time-series plot of these data Does it appear that a linear trend exists in the time series? b Calculate forecasts for each of the months in the time series Use a smoothing constant of 0.25 c Calculate the MAD value for the forecasts you generated in part b d Construct a single exponential smoothing forecast for July 2006 Use a smoothing constant of 0.25 Computer Database Exercises 16-42 The National Association of Theatre Owners is the largest exhibition trade organization in the world, representing more than 26,000 movie screens in all 50 states and in more than 20 countries worldwide Its membership includes the largest cinema chains and hundreds of independent theater owners It publishes statistics concerning the movie sector of the economy The file entitled Flicks contains data on average U.S ticket prices ($) One concern is the rapidly increasing price of tickets a Produce a time-series plot for these data Specify the exponential forecasting model that should be used to obtain next year’s forecast b Assuming a double exponential smoothing model, fit the least squares trend to the historical data to determine the smoothed constant-process value and the smoothed trend value for period c Use a  0.20 and b  0.30 to forecast the average yearly ticket price for the year 2010 d Calculate the MAD for this model 16-43 Inflation is a fall in the market value or purchasing power of money Measurements of inflation are prepared and published by the Bureau of Labor Statistics of the Department of Labor, which measures average changes in prices of goods and services The file entitled CPI contains the monthly CPI and inflation rate for the period January 2000–December 2005 a Construct a plot of this time series Does it appear that a linear trend exists in the time series? Specify the exponential forecasting model that should be used to obtain next month’s forecast b Assuming a single exponential smoothing model, calculate forecasts for each of the months in the time series Use a smoothing constant of 0.15 c Calculate the MAD value for the forecasts you generated in part b d Construct a single exponential smoothing forecast for January 2006 Use a smoothing constant of 0.15 | Analyzing and Forecasting Time-Series Data 761 16-44 The sales manager at Grossmieller Importers in New York City needs to determine a monthly forecast for the number of men’s golf sweaters that will be sold so that he can order an appropriate amount of packing boxes Grossmieller ships sweaters to retail stores throughout the United States and Canada Shirts are packed six to a box Data for the past 12 months are contained in the data file called Grossmieller a Plot the sales data using a time-series plot Based on the graph, what time series components are present? Discuss b (1) Use a single exponential smoothing model with a  0.30 to forecast sales for month 17 Assume that the initial forecast for period is 36,000 (2) Compute the MAD for this model (3) Graph the smoothing-model-fitted values on the time-series plot c (1) Referring to part b, try different alpha levels to determine which smoothing constant value you would recommend (2) Indicate why you have selected this value and then develop the forecast for month 17 (3) Compare this to the forecast you got using a  0.30 in part b 16-45 Referring to Problem 16-44, in which the sales manager for Grossmieller Imports of New York City needs to forecast monthly sales, a Discuss why a double exponential smoothing model might be preferred over a single exponential smoothing model b (1) Develop a double exponential smoothing model using a  0.20 and b  0.30 as smoothing constants To obtain the starting values, use the regression trend line approach discussed in this section (2) Determine the forecast for month 17 (3) Also compute the MAD for this model (4) Graph the fitted values on the time-series graph c Compare the results for this double exponential smoothing model with the “best” single exponential smoothing model developed in part c of Exercise 16-44 Discuss which model is preferred d Referring to part b, try different alpha and beta values in an attempt to determine an improved forecast model for monthly sales For each model, show the forecast for period 17 and the MAD Write a short report that compares the different models e Referring to part d and to part c for Exercise 16-44, write a report to the Grossmieller sales manager that indicates your choice for the forecasting model, complete with your justification for the selection END EXERCISES 16-3 (301) 762 CHAPTER 16 | Analyzing and Forecasting Time-Series Data Visual Summary Chapter 16: Organizations must operate effectively in the environment they face today, but also plan to continue to effectively operate in the future To plan for the future organizations must forecast This chapter introduces the two basic types of forecasting: qualitative forecasting and quantitative forecasting Qualitative forecasting techniques are based on expert opinion and judgment Quantitative forecasting techniques are based on statistical methods for analyzing quantitative historical data The chapter focuses on quantitative forecasting techniques Numerous techniques exist, often determined by the forecasting horizon Forecasts are often divided into four phases, immediate forecasts of one month or less, short term of one to three months, medium term of three months to two years and long term of two years of more The forecasting technique used is often determined by the length of the forecast, called the forecasting horizon The model building issues discussed in Chapter 15 involving model specification, model fitting, and model diagnosis also apply to forecasting models 16.1 Introduction to Forecasting, Time-Series Data, and Index Numbers (pg 710–723) Summary Quantitative forecasting techniques rely on data gathered in the past to forecast what will happen in the future Time series analysis is a commonly used quantitative forecasting technique Time series analysis involves looking for patterns in the past data that will hopefully continue into the future It involves looking for four components, trend, seasonal, cyclical and random A trend is the long-term increase or decrease in a variable being measured over time and can be linear or nonlinear A seasonal component is present if the data shows a repeating pattern over time If when observing time-series data you see sustained periods of high values followed by periods of lower values and the recurrence period of these fluctuations is larger than a year, the data are said to contain a cyclical component Although not all time series possess a trend, seasonal, or cyclical component, virtually all time series will have a random component.The random component is often referred to as “noise” in the data When analyzing time-series data, you will often compare one value measured at one point in time with other values measured at different points in time A common procedure for making relative comparisons is to begin by determining a base period index to which all other data values can be fairly compared The simplest index is an unweighted aggregate index More complicated weighted indexes include the Paasche and Lespeyres indexes Outcome Identify the components present in a time series Outcome Understand and compute basic index numbers 16.2 Trend-Based Forecasting Techniques (pg 724–749) Summary Trend-based forecasting techniques begin by identifying and modeling that trend Once the trend model has been defined, it is used to provide forecasts for future time periods Regression analysis is often used to identify the trend component How well the trend fits the actual data can be determined by the Mean Squared Error (MSE) or Mean Absolute Deviation (MAD) In general the smaller the MSE and MAD the better the model fits the actual data Using regression analysis to determine the trend carries some risk, one of which is that the error terms in the analysis are not independent Related error terms indicate autocorrelation in the data and is tested for using the Durbin-Watson Statistic Seasonality is often found in trend based forecasting models and if found is dealt with by computing seasonal indexes While alternate methods are used to compute seasonal indexes this section concentrates on the ratio-to-moving-average method Once the seasonal indexes are determined, they are used to deseasonalize the data to allow for a better trend forecast The indexes are then used to determine a seasonally adjusted forecast Determining the trend and seasonal components to a time series model allows the cyclical and random components to be better determined Outcome Apply the fundamental steps in developing and implementing forecasting models Outcome Apply trend-based forecasting models, including linear trend, nonlinear trend, and seasonally adjusted trend Conclusion 16.3 Forecasting Using Smoothing Methods (pg 750–761) Summary A disadvantage of trend based forecasting is that it gives as much weight to the earliest data in the time series as it does to the data that are close to the period for which the forecast is required It does not therefore allow model to “learn” or “adjust” to changes in the time series This section introduces exponential smoothing models With exponential smoothing, current observations can be weighted more heavily than older observations in determining the forecast Therefore, if in recent periods the time-series values are much higher (or lower) than those in earlier periods, the forecast can be made to reflect this difference The section discusses single exponential smoothing models and double exponential smoothing models Single exponential smoothing models are used when only random fluctuations are seen in the data while double exponential smoothing models are used if the data seems to combine both random variations with a trend Both models weigh recent data more heavily than past data As with all forecast models the basic steps of model building: specification, fitting and diagnosing the model are followed Outcome Use smoothing-based forecasting models, including single and double exponential smoothing While both qualitative and quantitative forecasting techniques are used, this chapter has emphasized quantitative techniques Quantitative forecasting techniques require historical data for the variable to be forecasted The success of a quantitative model is determined by how well the model fits the historical time-series data and how closely the future resembles the past Forecasting is as much an art as it is a science The more experience you have in a given situation, the more effective you likely will be at identifying and applying the appropriate forecasting tool You will find that the techniques introduced in this chapter are used frequently as an initial basis for a forecast However, in most cases, the decision maker will modify the forecast based on personal judgment and other qualitative inputs that are not considered by the quantitative model (302) CHAPTER 16 | Analyzing and Forecasting Time-Series Data Equations (16.1) Simple Index Number pg 714 It = (16.11) Durbin-Watson Statistic pg 729 n yt 100 y0 ∑ (et − et −1 )2 d  t 2 (16.2) Unweighted Aggregate Price Index pg 716 It  ∑ pt (100) ∑ p0 (16.12) Forecast Bias pg 733 Forecast bias  ∑ qt pt (100) ∑ qt p0 (16.14) Ratio-to-Moving-Average pg 741 St × I t = (16.5) Deflation Formula pg 721 t yt (100) It Tt Ct I t  yt  b0  blt  t ∑ t ∑ yt n b1  (∑ t )2 ∑t − n b0 = ∑ yt ∑t − b1 n n (16.9) Mean Squared Error pg 727 MSE  ∑ ( yt − Ft )2 n yt St (16.16) Exponential Smoothing Model pg 750 Ft1  Ft  (yt - Ft) (16.7) Least Squares Equations Estimates pg 725 ∑ tyt − yt Tt × Ct (16.15) Deseasonalization pg 742 (16.6) Linear Trend Model pg 725 (16.8) n yt  Tt St Ct It ∑ q0 pt (100) ∑ q0 p0 yadj = ∑ ( yt − Ft ) (16.13) Multiplicative Time-Series Model pg 739 (16.4) Laspeyres Index pg 718 It  ∑ et2 t 1 (16.3) The Paasche Index pg 717 It  n or (16.17) Ft1  yt  (1 - )Ft (16.18) Double Exponential Smoothing Model pg 755 Ct  yt  (1 - )(Ct-1  Tt-1) (16.19) Tt  (Ct - Ct-1)  (1 - )Tt-1 (16.20) Ft1  Ct  Tt (16.21) Mean Absolute Percent Error pg 758 (16.10) Mean Absolute Deviation pg 727 ∑ ∑ | yt − Ft | MAD  n MAPE  | yt − Ft | (100) yt n Key Terms Aggregate price index pg 715 Autocorrelation pg 728 Base period index pg 714 Cyclical component pg 713 Exponential smoothing pg 752 Forecasting horizon pg 710 Forecasting interval pg 710 Forecasting period pg 710 Linear trend pg 711 Model diagnosis pg 710 Model fitting pg 710 Model specification pg 710 Moving average pg 739 Random component pg 713 Seasonal component pg 712 Seasonal index pg 739 Seasonally unadjusted forecast pg 743 763 (303) 764 CHAPTER 16 | Analyzing and Forecasting Time-Series Data Chapter Exercises Conceptual Questions 16-46 Go to the library or use the Internet to find data showing your state’s population for the past 20 years Plot these data and indicate which of the time-series components are present 16-47 A time series exhibits the pattern stated below Indicate the type of time-series component described a The pattern is “wavelike” with a recurrence period of nine months b The time series is steadily increasing c The pattern is “wavelike” with a recurrence period of two years d The pattern is unpredictable e The pattern steadily decreases, with a “wavelike” shape which reoccurs every 10 years 16-48 Identify the businesses in your community that might be expected to have sales that exhibit a seasonal component Discuss 16-49 Discuss the difference between a cyclical component and a seasonal component Which component is more predictable, seasonal or cyclical? Discuss and illustrate with examples 16-50 In the simple linear regression model, confidence and prediction intervals are utilized to provide interval estimates for an average and a particular value, respectively, of the dependent variable The linear trend model in time series is an application of simple linear regression This being said, discuss whether a confidence or a prediction interval is the relevant interval estimate for a linear trend model’s forecast Business Applications Problems 16-51 through 16-54 refer to Malcar Autoparts Company, which has started producing replacement control microcomputers for automobiles This has been a growth industry since the first control units were introduced in 1985 Sales data since 1994 are as follows: Year Sales ($) Year Sales ($) 1994 1995 1996 1997 1998 1999 2000 2001 240,000 2002 1,570,000 218,000 405,000 587,000 795,000 762,000 998,000 1,217,000 2003 2004 2005 2006 2007 2008 2009 1,947,000 2,711,000 3,104,000 2,918,000 4,606,000 5,216,000 5,010,000 MyStatLab 16-51 As a start in analyzing these data, a Graph these data and indicate whether they appear to have a linear trend b Develop a simple linear regression model with time as the independent variable Using this regression model, describe the trend and the strength of the linear trend over the 16 years Is the trend line statistically significant? Plot the trend line against the actual data c Compute the MAD value for this model d Provide the Malcar Autoparts Company an estimate of its expected sales for the next years e Provide the maximum and minimum sales Malcar can expect with 90% confidence for the year 2014 16-52 Develop a single exponential smoothing model using a  0.20 Use as a starting value the average of the first years’ data Determine the forecasted value for year 2010 a Compute the MAD for this model b Plot the forecast values against the actual data c Use the same starting value but try different smoothing constants (say, 0.05, 0.10, 0.25, and 0.30) in an effort to reduce the MAD value d Is it possible to answer part d of Problem 16.51 using this forecasting technique? Explain your answer 16-53 Develop a double exponential smoothing model using smoothing constants a  0.20 and b  0.40 As starting values, use the least squares trend line slope and intercept values a Compute the MAD for this model b Plot the forecast values against the actual data c Use the same starting values but try different smoothing constants [say, (a, b)  (0.10, 0.50), (0.30, 0.30), and (0.40, 0.20)] in an effort to reduce the MAD value 16-54 Using whatever diagnostic tools you are familiar with, determine which of the three forecasting methods utilized to forecast sales for Malcar Autoparts Company in the previous three problems provides superior forecasts Explain the reasons for your choice 16-55 Amazon.com has become one of the most successful online merchants Two measures of its success are sales and net income/loss figures They are given here (304) CHAPTER 16 Year Net Income/Loss 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 Sales -0.3 0.5 -5.7 -27.5 -124.5 -719.9 -1,411.2 -567.3 -149.1 35.3 588.5 359 190 476 15.7 147.7 609.8 1,639.8 2,761.9 3,122.9 3,933 5,263.7 6,921 8,490 10,711 14,835 a Produce a time-series plot for these data Specify the exponential forecasting model that should be used to obtain the following years’ forecasts b Assuming a double exponential smoothing model, fit the least squares trend to the historical data to determine the smoothed constant-process value and the smoothed trend value for period c Produce the forecasts for periods through 13 using a  0.10 and b  0.20 Indicate the sales Amazon should expect for 2008 based on the forecast model d Calculate the MAD for this model 16-56 College tuition has risen at a pace faster than inflation for more than two decades, according to an article in USA Today The following data indicate the average college tuition (in 2003 dollars) for public colleges: Period 1983–1984 1988–1989 1993–1994 1998–1999 2003–2004 2008–2009 Public 2,074 2,395 3,188 3,632 4,694 5,652 a Produce a time-series plot of these data Indicate the time-series components that exist in the data b Provide a forecast for the average tuition for public colleges in the academic year 2013–2014 (Hint: One time-series time period represents five academic years.) c Provide an interval of plausible values for the average tuition change after five academic periods have gone by Use a confidence level of 0.90 Computer Database Exercises 16-57 HSH® Associates, financial publishers, is the nation’s largest publisher of mortgage and consumer loan information Every week it collects current data from 2,000 mortgage lenders across the nation It tracks a variety of adjustable rate mortgage (ARM) indexes and makes them available on its Web site The file ARM | Analyzing and Forecasting Time-Series Data 765 contains the national monthly average one-year ARM for the time period January 2004 to December 2008 a Produce a scatter plot of the federal ARM for the time period January 2004 to December 2008 Identify any time-series components that exist in the data b Identify the recurrence period of the time series Determine the seasonal index for each month within the recurrence period c Fit a nonlinear trend model containing coded years and coded years squared as predictors for the deseasonalized data d Provide a seasonally adjusted forecast using the nonlinear trend model for January 2009 e Diagnose the model 16-58 DataNet is an Internet service where clients can find information and purchase various items such as airline tickets, stereo equipment, and listed stocks DataNet has been in operation for four years Data on monthly calls for service for the time that the company has been in business are in the data file called DataNet a Plot these data in a time-series graph Based on the graph, what time-series components are present in the data? b Develop the seasonal indexes for each month Describe what the seasonal index for August means c Fit a linear trend model to the deseasonalized data for months 1–48 and determine the MAD value Comment on the adequacy of the linear trend model based on these measures of forecast error d Provide a seasonally unadjusted forecast using the linear trend model for each month of the year e Use the seasonal index values computed in part b to provide seasonal adjusted forecasts for months 49–52 16-59 Referring to Problem 16-58, the managers of DataNet, the Internet company where users can purchase products like airline tickets, need to forecast monthly call volumes in order to have sufficient capacity Develop a single exponential smoothing model using a  0.30 Use as a starting value the average of the first six months’ data a Compute the MAD for this model b Plot the forecast values against the actual data c Use the same starting value but try different smoothing constants (say, 0.10, 0.20, 0.40, and 0.50) in an effort to reduce the MAD value d Reflect on the type of time series for which the single exponential smoothing model is designed to provide forecasts Does it surprise you that the MAD for this method is relatively large for these data? Explain your reasoning 16-60 Continuing with the DataNet forecasting problems, develop a double exponential smoothing model using smoothing constants a  0.20 and b  0.20 As (305) 766 CHAPTER 16 | Analyzing and Forecasting Time-Series Data starting values, use the least squares trend line slope and intercept values a Compute the MAD for this model b Plot the forecast values against the actual data c Compare this with a linear trend model Which forecast method would you use? Explain your rationale d Use the same starting values but try different smoothing constants [say, (, )  (0.10, 0.30), (0.15, 0.25), and (0.30, 0.10)] in an effort to reduce the MAD value Prepare a short report that summarizes your efforts 16.61 The College Board, administrator of the SAT test for college entrants, has made several changes to the test in recent years One recent change occurred between years 2005 and 2006 In a press release the College Board announced SAT scores for students in the class of 2005, the last to take the former version of the SAT featuring math and verbal sections The board indicated that for the class of 2005, the average SAT video math scores continued their strong upward trend, increasing from 518 in 2004 to 520 in 2005, 14 points higher than 10 years ago and an all-time high The file entitled MathSAT contains the math SAT scores for the interval 1967 to 2005 a Produce a time-series plot for the combined gender math SAT scores for the period 1980 to 2005 Indicate the time-series components that exist in the data b Conduct a test of hypothesis to determine if the average SAT math scores of students continued to increase in the period indicated in part a Use a significance level of 0.10 and the test statistic approach c Produce a forecast for the average SAT math scores for 2010 d Beginning with the March 12, 2005, administration of the exam, the SAT Reasoning Test, was modified and lengthened How does this affect the forecast produced in part c? What statistical concept is exhibited by producing the forecast in part c? Video Case Restaurant Location and Re-imaging Decisions @ McDonald’s In the early days of his restaurant company’s growth, McDonald’s founder Ray Kroc knew that finding the right location was key He had a keen eye for prime real estate locations Today, the company is more than 30,000 restaurants strong When it comes to picking prime real estate locations for its restaurants and making the most of them, McDonald’s is way ahead of the competition In fact, when it comes to global real estate holdings, no corporate entity has more From urban office and airport locations, to Wal-Mart stores and the busiest street corner in your town, McDonald’s has grown to become one of the world’s most recognized brands Getting there hasn’t been just a matter of buying all available real estate on the market Instead, the company has used the basic principles and process Ray Kroc believed in to investigate and secure the best possible sites for its restaurants Factors such as neighborhood demographics, traffic patterns, competitor proximity, workforce, and retail shopping center locations all play a role Many of the company’s restaurant locations have been in operation for decades And although the restaurants have adapted to changing times—including diet fads and reporting nutrition information, staff uniform updates, and menu innovations such as Happy Meals, Chicken McNuggets, and premium salads—there’s more to bringing customers back time and again than an updated menu and a good location Those same factors that played a role in the original location decision need to be periodically examined to learn what’s changed and, as a result, what changes the local McDonald’s needs to consider Beginning in 2003, McDonald’s started work on “re-imaging” its existing restaurants while continuing to expand the brand globally More than 6,000 restaurants have been re-imaged to date Sophia Galassi, vice president of U.S Restaurant Development, is responsible for the new look nationwide According to Sophia, reimaging is more than new landscaping and paint In some cases, the entire store is torn down and rebuilt with redesigned drive-thru lanes to speed customers through faster, interiors with contemporary colors and coffee-house seating, and entertainment zones with televisions, and free Wi-Fi “We work very closely with our owner/operators to collect solid data about their locations, and then help analyze them so we can present the business case to them,” says Sophia Charts and graphs, along with the detailed statistical results, are vital to the decision process One recent project provides a good example of how statistics supported the re-imaging decision Dave Traub, owner/operator, had been successfully operating a restaurant in Midlothian, Virginia, for more than 30 years The location was still prime, but the architecture and décor hadn’t kept up with changing times After receiving the statistical analysis on the location from McDonald’s, Dave had the information he needed to make the decision to invest in re-imaging the restaurant With revenues and customer traffic up, he has no regrets “We’ve become the community’s gathering place The local senior citizens group now meets here regularly in the mornings,” he says The re-imaging effort doesn’t mean the end to new restaurant development for the company “As long as new communities are developed and growth continues in neighborhoods across the country, we’ll be analyzing data about them to be sure our restaurants are positioned in the best possible locations,” states Sophia Ray Kroc would be proud (306) CHAPTER 16 Discussion Questions: Sophia Galassi, vice president of U.S Restaurant Development for McDonald’s, indicated that she and her staff work very closely with owner/operators to collect data about McDonald’s restaurant locations Describe some of the kinds of data that Sophia’s staff would collect and the respective types of charts that could be used to present their findings to the owner/operators At the end of 2001, Sophia Galassi and her team led a remodel and re-imaging effort for the McDonald’s franchises in a major U.S city This entailed a total change in store layout and design and a renewed emphasis on customer service Once this work had been completed, the company put in place a comprehensive customer satisfaction data collection and tracking system The data in the file called McDonald’s Customer Satisfaction consist of the overall percentage of customers at the franchise McDonald’s in this city who have rated the customer service as Excellent or Very Good during each quarter since the re-imaging and remodeling was completed Develop a line chart and discuss what time-series components appear to be contained in these data Referring to question 2, based on the available historical data, develop a seasonally adjusted forecast for the percentage of customers who will rate the stores as Excellent or Very Good for Quarter and Quarter of 2006 Discuss the process you used to arrive at these forecasts | Analyzing and Forecasting Time-Series Data 767 Referring to questions and 3, use any other forecasting method discussed in this chapter to arrive at a forecast for Quarters and of 2006 Compare your chosen model with the seasonally adjusted forecast model specified in question Use appropriate measures of forecast error Prepare a short report outlining your forecasting attempts along with your recommendation of which method McDonald’s should use in this case Prior to remodeling or re-imaging a McDonald’s store, extensive research is conducted This includes the use of “mystery shoppers,” who are people hired by McDonald’s to go to stores as customers to observe various attributes of the store and the service being provided The file called McDonald’s Mystery Shopper contains data pertaining to the “cleanliness” rating provided by the mystery shoppers who visited a particular McDonald’s location each month between January 2004 and June 2006 The values represent the average rating on a 0–100 percent scale provided by five shoppers A score of 100% is considered perfect Using these time-series data, develop a line chart and discuss what timeseries components are present in these data Referring to question 5, develop a double exponential smoothing model to forecast the rating for July 2006 (use alpha  0.20 and beta  0.30 smoothing constants.) Compare the results of this forecasting approach with a simple linear trend forecasting approach Write a short report describing the methods you have used and the results Use linear trend analysis to obtain the starting values for C0 and T0 Case 16.1 Park Falls Chamber of Commerce Masao Sugiyama is the recently elected president of the Chamber of Commerce in Park Falls, Wisconsin He is the long-time owner of the only full-service hardware store in this small farming town Being president of the Chamber of Commerce has been considered largely a ceremonial post because business conditions have not changed in Park Falls for as long as anyone can remember However, Masao has just read an article in The Wall Street Journal that has made him think he needs to take a more active interest in the business conditions of his town The article concerned Wal-Mart, the largest retailer in the United States Wal-Mart has expanded primarily by locating in small towns and avoiding large suburban areas The Park Falls merchants have not had to deal with either Lowes or Home Depot because these companies have located primarily in large urban centers In addition, a supplier has recently told Masao that both Lowes and Home Depot are considering locating stores in smaller towns Sugiyama knows that Wal-Mart has moved into the outskirts of metropolitan areas and now is considering stores for smaller, untapped markets He also has heard that Lowes and Home Depot have recently had difficulty Masao decided he needs to know more about all three retailers He asked the son of a friend to locate the following sales data, which are also in a file called Park Falls Quarterly Sales Values in Millions of Dollars Fiscal 1999 Fiscal 2000 Fiscal 2001 Fiscal 2002 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Lowes Home Depot Wal-Mart $ 3,772 $ 4,435 $ 3,909 $ 3,789 $ 4,467 $ 5,264 $ 4,504 $ 4,543 $ 5,276 $ 6,127 $ 5,455 $ 5,253 $ 6,470 $ 7,488 $ 6,415 $ 6,118 $ 8,952 $10,431 $ 9,877 $ 9,174 $11,112 $12,618 $11,545 $10,463 $12,200 $14,576 $13,289 $13,488 $14,282 $16,277 $14,475 $13,213 $29,819 $33,521 $33,509 $40,785 $34,717 $38,470 $40,432 $51,394 $42,985 $46,112 $45,676 $56,556 $48,052 $52,799 $52,738 $64,210 (continued ) (307) 768 CHAPTER 16 | Analyzing and Forecasting Time-Series Data Quarterly Sales Values in Millions of Dollars Lowes Home Depot Wal-Mart Fiscal 2003 Q1 Q2 Q3 Q4 $ $ $ $ 7,118 8,666 7,802 7,252 $15,104 $17,989 $16,598 $15,125 $51,705 $56,271 $55,241 $66,400 Fiscal 2004 Q1 Q2 Q3 Q4 $ 8,681 $10,169 $ 9,064 $ 8,550 $17,550 $19,960 $18,772 $16,812 $56,718 $62,637 $62,480 $74,494 Quarterly Sales Values in Millions of Dollars Fiscal 2005 Q1 Q2 Q3 Q4 Lowes Home Depot Wal-Mart $ 9,913 $11,929 $10,592 $10,808 $18,973 $22,305 $30,744 $19,489 $64,763 $69,722 $68,520 $82,216 Masao is interested in what all these data tell him How much faster has Wal-Mart grown than the other two firms? Is there any evidence Wal-Mart’s growth has leveled off? Does Lowes seem to be rebounding, based on sales? Are seasonal fluctuations an issue in these sales figures? Is there any evidence that one firm is more affected by the cyclical component than the others? He needs some help in analyzing these data Case 16.2 The St Louis Companies An irritated Roger Hatton finds himself sitting in the St Louis airport after hearing that his flight to Chicago has been delayed— and, if the storm in Chicago continues, possibly cancelled Because he must get to Chicago if at all possible, Roger is stuck at the airport He decides he might as well try to get some work done, so he opens his laptop computer and calls up the Claimnum file Roger was recently assigned as an analyst in the worker’s compensation section of the St Louis Companies, one of the biggest issuers of worker’s compensation insurance in the country Until this year, the revenues and claim costs for all parts of the company were grouped together to determine any yearly profit or loss Therefore, no one really knew if an individual department was profitable Now, however, the new president is looking at each part of the company as a profit center The clear implication is that money-losing departments may not have a future unless they develop a clear plan to become profitable When Roger asked the accounting department for a listing, by client, of all policy payments and claims filed and paid, he was told that the information is available but he may have to wait two or three months to get it He was able to determine, however, that the department has been keeping track of the clients who file frequent (at least one a month) claims and the total number of firms that purchase workers’ compensation insurance Using the data from this report, Roger divides the number of clients filing frequent claims by the corresponding number of clients These ratios, in the file Claimnum, are as follows: Year Ratio (%) Year Ratio (%) 3.8 12 6.1 3.6 13 7.8 3.5 14 7.1 4.9 15 7.6 5.9 16 9.7 5.6 17 9.6 4.9 18 7.5 5.6 19 7.9 8.5 20 8.3 10 7.7 21 8.4 11 7.1 Staring at these figures, Roger feels there should be some way to use them to project what the next several years may hold if the company doesn’t change its underwriting policies Case 16.3 Wagner Machine Works Mary Lindsey has recently agreed to leave her upper-level management job at a major paper manufacturing firm and return to her hometown to take over the family machine-products business The U.S machine-products industry had a strong position of world dominance until recently, when it was devastated by foreign competition, particularly from Germany and Japan Among the many problems facing the American industry is that it is made up of many small firms that must compete with foreign industrial giants Wagner Machine Works, the company Mary is taking over, is one of the few survivors in its part of the state, but it, too, faces increasing competitive pressure Mary’s father let the business slide as he approached retirement, and Mary sees the need for an immediate modernization of their plant She has arranged for a loan from the local bank, but now she must forecast sales for the next three years to ensure that the company has enough cash flow to repay the debt Surprisingly, Mary finds that her father has no (308) CHAPTER 16 forecasting system in place, and she cannot afford the time, or money, to install a system like that used at her previous company Wagner Machine Works’ quarterly sales (in millions of dollars) for the past 15 years are as follows: Quarter Year 1995 1996 10,490 11,130 10,005 11,058 11,424 12,550 10,900 12,335 1997 12,835 13,100 11,660 13,767 1998 13,877 14,100 12,780 14,738 1999 14,798 15,210 13,785 16,218 2000 16,720 17,167 14,785 17,725 2001 18,348 18,951 16,554 19,889 2002 20,317 21,395 19,445 22,816 2003 23,335 24,179 22,548 25,029 2004 25,729 27,778 23,391 27,360 2005 28,886 30,125 26,049 30,300 2006 30,212 33,702 27,907 31,096 2007 31,715 35,720 28,554 34,326 2008 35,533 39,447 30,046 37,587 2009 39,093 44,650 30,046 37,587 | Analyzing and Forecasting Time-Series Data 769 While looking at these data, Mary wonders whether they can be used to forecast sales for the next three years She wonders how much, if any, confidence she can have in a forecast made with these data She also wonders if the recent increase in sales is due to growing business or just to inflationary price increases in the national economy Required Tasks: Identify the central issue in the case Plot the quarterly sales for the past 15 years for Wagner Machine Works Identify any patterns that are evident in the quarterly sales data If a seasonal pattern is identified, estimate quarterly seasonal factors Deseasonalize the data using the quarterly seasonal factors developed Run a regression model on the deseasonalized data using the time period as the independent variable Develop a seasonally adjusted forecast for the next three years Prepare a report that includes graphs and analysis References Armstrong, J Scott, “Forecasting by Extrapolation: Conclusions from 25 Years of Research.” Interfaces, 14, no (1984) Bails, Dale G., and Larry C Peppers, Business Fluctuations: Forecasting Techniques and Applications, 2nd ed (Englewood Cliffs, NJ: Prentice Hall, 1992) Berenson, Mark L., and David M Levine, Basic Business Statistics: Concepts and Applications, 11th ed (Upper Saddle River, NJ: Prentice Hall, 2008) Bowerman, Bruce L., and Richard T O’Connell, Forecasting and Time Series: An Applied Approach, 4th ed (North Scituate, MA: Duxbury Press, 1993) Brandon, Charles, R Fritz, and J Xander, “Econometric Forecasts: Evaluation and Revision.” Applied Economics, 15, no (1983) Cryer, Jonathan D., Time Series Analysis (Boston: Duxbury Press, 1986) Frees, Edward W., Data Analysis Using Regression Models: The Business Perspective (Englewood Cliffs, NJ: Prentice Hall, 1996) Granger, C W G., Forecasting in Business and Economics, 2nd ed (New York: Academic Press, 1989) Kutner, Michael H., Christopher J Nachtshein, John Neter, and William Li, Applied Linear Statistical Models, 5th ed (New York: McGraw-Hill Irwin, 2005) Makridakis, Spyros, Steven C Wheelwright, and Rob J Hyndman, Forecasting: Methods and Applications, 3rd ed (New York: John Wiley & Sons, 1998) McLaughlin, Robert L., “Forecasting Models: Sophisticated or Naive?” Journal of Forecasting, 2, no (1983) Microsoft Excel 2007 (Redmond,WA: Microsoft Corp., 2007) Minitab for Windows Version 15 (State College, PA: Minitab, 2007) Montgomery, Douglas C., and Lynwood A Johnson, Forecasting and Time Series Analysis, 2nd ed (New York: McGraw-Hill, 1990) Nelson, C R., Applied Time Series Analysis for Managerial Forecasting (San Francisco: Holdon-Day, 1983) The Ombudsman: “Research on Forecasting—A Quarter-Century Review, 1960–1984.” Interfaces, 16, no (1986) Willis, R E., A Guide to Forecasting for Planners (Englewood Cliffs, NJ: Prentice Hall, 1987) Wonnacott, T H., and R J Wonnacott, Econometrics, 2nd ed (New York: John Wiley & Sons, 1979) (309) chapter 17 Chapter 17 Quick Prep Links • Review the concepts associated with hypothesis testing for a single population mean using the t-distribution in Chapter • Make sure you are familiar with the steps involved in testing hypotheses for the difference between two population means discussed in Chapter 10 • Review the concepts and assumptions associated with analysis of variance in Chapter 12 Introduction to Nonparametric Statistics 17.1 The Wilcoxon Signed Rank Test for One Population Median Outcome Recognize when and how to use the Wilcoxon signed rank test for a population median (pg 771–776) 17.2 Nonparametric Tests for Two Population Medians (pg 776–789) Outcome Recognize the situations for which the Mann–Whitney U-test for the difference between two population medians applies and be able to use it in a decision-making context Outcome Know when to apply the Wilcoxon matchedpairs signed rank test for related samples 17.3 Kruskal–Wallis One-Way Analysis of Variance Outcome Perform nonparametric analysis of variance using the Kruskal–Wallis one-way ANOVA (pg 789–796) Why you need to know Housing prices are particularly important when a company considers potential locations for a new manufacturing plant because the company would like affordable housing to be available for employees who transfer to the new location A company that is in the midst of relocation has taken a sample of real estate listings from the four cities in contention for the new plant and would like to make a statistically valid comparison of home prices based on this sample information Another company is considering changing to a group-based, rather than an individual-based, employee evaluation system As a part of its analysis, the firm has gathered questionnaire data from employees who were asked to rate their satisfaction with the evaluation system on a five-point scale: very satisfied, satisfied, of no opinion, dissatisfied, or very dissatisfied In previous chapters, you were introduced to a wide variety of statistical techniques that would seem to be useful tools for these companies However, many of the techniques discussed earlier may not be appropriate for these situations For instance, in the plant relocation situation, the analysis of variance (ANOVA) F-test introduced in Chapter 12 would seem appropriate However, this test is based on the assumptions that all populations are normally distributed and have equal variances Unfortunately, housing prices are generally not normally distributed because most cities have home prices that are highly right skewed with most home prices clustered around the median price with a few very expensive houses that pull the mean value up In the employee questionnaire situation, answers were measured on an ordinal, not on an interval or ratio, scale, and interval or ratio data is required to use a t or F test To handle cases where interval or ratio data is not available, a class of statistical tools called nonparametric statistics has been developed 770 (310) CHAPTER 17 | Introduction to Nonparametric Statistics 771 17.1 The Wilcoxon Signed Rank Test for One Population Median Up to this point, the text has presented a wide array of statistical tools for describing data and for drawing inferences about a population based on sample information from that population These tools are widely used in decision-making situations However, you will also encounter decision situations in which major departures from the required assumptions exist For example, many populations, such as family income levels and house prices, are highly skewed In other instances, the level of data measurement will be too low (ordinal or nominal) to warrant use of the techniques presented earlier In such cases, the alternative is to employ a nonparametric statistical procedure that has been developed to meet specific inferential needs Such procedures have fewer restrictive assumptions concerning data level and underlying probability distributions There are a great many nonparametric procedures that cover a wide range of applications The purpose of this chapter is to introduce you to the concept of nonparametric statistics and illustrate some of the more frequently used methods Chapter Outcome The Wilcoxon Signed Rank Test—Single Population Chapter introduced examples that involved testing hypotheses about a single population mean Recall that if the data were interval or ratio level and the population was normally distributed, a t-test was used to test whether a population mean had a specified value However the t-test is not appropriate in cases in which the data level is ordinal or when populations are not believed to be approximately normally distributed To overcome data limitation issues, a nonparametric statistical technique known as the Wilcoxon signed rank test can be used This test makes no highly restrictive assumption about the shape of the population distribution The Wilcoxon test is used to test hypotheses about a population median rather than a population mean The basic logic of the Wilcoxon test is straightforward Because the median is the midpoint in a population, allowing for sampling error, we would expect approximately half the data values in a random sample to be below the hypothesized median and about half to be above it The hypothesized median will be rejected if the actual data distribution shows too large a departure from this expectation BUSINESS APPLICATION APPLYING THE WILCOXON SIGNED RANK TEST UNIVERSITY UNDERGRADUATE STARTING SALARIES The university placement office is interested in testing to determine whether the median starting salary distribution for undergraduates exceeds $35,000 People in the office believe the salary distribution is highly skewed to the right, so the center of the distribution should be measured by the median Therefore the t-test from Chapter 9, which requires that the population be normally distributed, is not appropriate A simple random sample of n  10 graduates is selected The Wilcoxon signed rank test can be used to test whether the population median exceeds $35,000 As with all tests, we start by stating the appropriate null and alternative hypotheses The null and alternative hypotheses for the one-tailed test are H 0: ~ m  $35, 000 HA : ~ m  $35, 000 The test will be conducted using a  0.05 For small samples, the hypothesis is tested using a W-test statistic determined by the following steps: Step Collect the sample data Step Compute di, the deviation between each value and the hypothesized median (311) 772 CHAPTER 17 | Introduction to Nonparametric Statistics TABLE 17.1 | Wilcoxon Ranking Table for Starting Salaries Example Salary  xi ($) di  xi  $35,000 |di | Rank 36,400 38,500 27,000 35,000 29,000 40,000 52,000 34,000 38,900 41,000 1,400 3,500 8,000 6,000 5,000 17,000 1,000 3,900 6,000 1,400 3,500 8,000 6,000 5,000 17,000 1,000 3,900 6,000 R R 6.5 5 9 4 6.5 6.5 Total  W  29.5 6.5 _ 15.5 Step Convert the di values to absolute differences Step Determine the ranks for each di value, eliminating any zero di values The lowest di value receives a rank of If observations are tied, assign the average rank of the tied observations to each of the tied values Step For any data value greater than the hypothesized median, place the rank in an R+ column For data values less than the hypothesized median, place the rank in an R– column Step The test statistic W is the sum of the ranks in the R column For a lower tail test use the sum in the R– column For an equal to hypothesis use either Table 17.1 shows the results for a random sample of 10 starting salaries The hypothesis is tested comparing the calculated W-value with the critical values for the Wilcoxon signed rank test that are shown in Appendix P Both upper and lower critical values are shown, corresponding to n  to n  20 for various levels of alpha Note that n equals the number of nonzero di values In this example, we have n  nonzero di values The lower critical value for n  and a one-tailed a  0.05 is The corresponding upper-tailed critical value is 37 Because this is an upper-tail test, we are interested only in the upper critical value W0.05 Therefore, the decision rule is If W  37, reject H0 Because W  29.5 37, we not reject the null hypothesis and are unable to conclude that the median starting salary for university graduates exceeds $35,000 Although neither Excel nor PHStat has a Wilcoxon signed rank test, Minitab does Figure 17.1 illustrates the Minitab output for this example Note that the p-value  0.221  a  0.05, which reinforces the conclusion that the null hypothesis should not be rejected The starting salary example illustrates how the Wilcoxon signed rank test is used when the sample sizes are small The W-test statistic approaches a normal distribution as n increases Therefore, for sample sizes  20, the Wilcoxon test can be approximated using the normal distribution where the test statistic is a z-value, as shown in Equation 17.1 Large-Sample Wilcoxon Signed Rank Test Statistic z n(n + 1) n(n 1)(2n 1) 24 W− where: W  Sum of the R ranks n  Number of nonzero di values (17.1) (312) CHAPTER 17 FIGURE 17.1 | Introduction to Nonparametric Statistics 773 | Minitab Output—Wilcoxon Ranked Sum Test for Starting Salaries Minitab Instructions: Enter data in column Choose Stat  Nonparametric  1-Samples Wilcoxon In Variable, enter the data column Choose Test median and enter median value tested From Alternative, select appropriate hypothesis Click OK WILCOXON SIGNED RANK TEST, ONE SAMPLE, n  20 EXAMPLE 17-1 Executive Salaries A recent article in the business section of a regional newspaper indicated that the median salary for C-level executives (CEO, CFO, CIO, etc.) in the United States is less than $276,200 A shareholder advocate group has decided to test this assertion A random sample of 25 C-level executives was selected Since we would expect that executive salaries are highly rightskewed, a t-test is not appropriate Instead a large-sample Wilcoxon signed rank test can be conducted using the following steps: Step Specify the null and alternative hypotheses In this case, the null and alternative hypotheses are H 0: ~ m (313) $276, 200 HA : ~ m $276, 200 (claim) Step Determine the significance level for the test The test will be conducted using a  0.01 Step Collect the sample data and compute the W-test statistic Using the steps outlined on pages 771–772, we manually compute the W-test statistic as shown in Table 17.2 Step Compute the z-test statistic The z-test statistic using the sum of the positive ranks is z n(n 1)  n(n 1)(2n 1) 24 W− 25(25 1)  −2.49 25(25 1)(2(25) 1) 24 70 − (314) 774 CHAPTER 17 | Introduction to Nonparametric Statistics TABLE 17.2 | Wilcoxon Ranking Table for Executive Salaries Example Salary  xi ($) 273,000 269,900 263,500 260,600 259,200 257,200 256,500 255,400 255,200 297,750 254,200 300,750 249,500 303,000 304,900 245,900 243,500 237,650 316,250 234,500 228,900 217,000 212,400 204,500 202,600 di |di | Rank 3,200 6,300 12,700 15,600 17,000 19,000 19,700 20,800 21,000 3,200 6,300 12,700 15,600 17,000 19,000 19,700 20,800 21,000 21,550 22,000 24,550 26,700 26,800 28,700 30,300 32,700 38,550 40,050 41,700 47,300 59,200 63,800 71,700 73,600 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 21,550 22,000 24,550 26,700 26,800 28,700 30,300 32,700 38,550 40,050 41,700 47,300 59,200 63,800 71,700 73,600 R R 10 11 12 13 14 15 16 17 18 19 70 20 21 22 23 24 25 255 Step Reach a decision The critical value for a one-tailed test for alpha  0.01 from the standard normal distribution is 2.33 Because z  2.49 2.33, we reject the null hypothesis Step Draw a conclusion Thus, based on the sample data, the shareholder group should conclude the median executive salary is less than $276,200 >>END EXAMPLE TRY PROBLEM 17-1 (pg 774) MyStatLab 17-1: Exercises Skill Development 17-2 Consider the following set of observations: 17-1 Consider the following set of observations: 10.21 13.65 12.30 9.51 11.32 12.77 6.16 8.55 11.78 12.32 9.0 15.6 21.1 11.1 13.5 9.2 13.6 15.8 12.5 18.7 18.9 You should not assume these data come from a normal distribution Test the hypothesis that the median of these data is greater than or equal to 14 You should not assume these data come from a normal distribution Test the hypothesis that these data come from a distribution with a median less than or equal to 10 (315) CHAPTER 17 17-3 Consider the following set of observations: 3.1 4.8 2.3 5.6 2.8 2.9 4.4 You should not assume these data come from a normal distribution Test the hypothesis that these data come from a distribution with a median equal to Use a  0.10 Business Applications 17-4 Sigman Corporation makes batteries that are used in highway signals in rural areas The company managers claim that the median life of a battery exceeds 4,000 hours To test this claim, they have selected a random sample of n  12 batteries and have traced their life spans between installation and failure The following data were obtained: 1,973 4,459 4,838 4,098 3,805 4,722 4,494 5,894 4,738 3,322 194 278 302 140 245 234 268 208 102 190 220 255 a Construct the appropriate null and alternative hypotheses b Based on the sample data, what should the operations manager conclude? Test at the 0.05 significance level 17-6 A recent trade newsletter reported that during the initial 6-month period of employment, new sales personnel in an insurance company spent a median of 119 hours per month in the field A random sample of 20 new salespersons was selected The numbers of hours spent in the field by members in a randomly chosen month are listed here: 163 147 189 142 103 102 126 111 112 95 135 103 96 134 114 89 134 126 129 115 Introduction to Nonparametric Statistics 775 Do the data support the trade newsletter’s claim? Conduct the appropriate hypothesis test with a significance level of 0.05 17-7 At Hershey’s, the chocolate maker, a particular candy bar is supposed to weigh 11 ounces However, the company has received complaints that the bars are under weight To assess this situation, the company has conducted a statistical study that concluded that the average weight of the candy is indeed 11 ounces However, a consumer organization, while acknowledging the finding that the mean weight is 11 ounces, claims that more than 50% of the candy bars weigh less than 11 ounces and that a few heavy bars pull the mean up, thereby cheating a majority of customers A sample of 20 candy bars was selected The data obtained follow: 10.9 11.5 10.6 10.7 5,249 4,800 a State the appropriate null and alternative hypotheses b Assuming that the test is to be conducted using a 0.05 level of significance, what conclusion should be reached based on these sample data? Be sure to examine the required normality assumption 17-5 A cable television customer call center has a goal that states that the median time for each completed call should not exceed four minutes If calls take too long, productivity is reduced and other customers have to wait too long on hold The operations manager does not want to incorrectly conclude that the goal isn’t being satisfied unless sample data justify that conclusion A sample of 12 calls was selected, and the following times (in seconds) were recorded: | 11.7 10.8 10.9 10.8 10.5 11.2 11.6 10.5 11.8 11.8 11.2 11.3 10.2 10.7 11.0 10.1 Test the consumer organization’s claim at a significance level of 0.05 17-8 Sylvania’s quality control division is constantly monitoring various parameters related to its products One investigation addressed the life of incandescent light bulbs (in hours) Initially, they were satisfied with examining the average length of life However, a recent sample taken from the production floor gave them pause for thought The data follow: 1,100 1,460 1,150 1,770 1,140 1,940 1,260 1,270 1,550 2,080 1,760 1,210 1,210 1,350 1,250 1,230 1,280 1,150 1,500 1,230 840 730 1,560 2,100 1,620 2,410 1,210 1,630 1,500 1,060 1,440 500 Their initial efforts indicated that the average length of life of the light bulbs was 1,440 hours a Construct a box and whisker plot of these data On this basis, draw a conclusion concerning the distribution of the population from which this sample was drawn b Conduct a hypothesis test to determine if the median length of life of the light bulbs is longer than the average length of life Use a  0.05 17-9 The Penn Oil Company wished to verify the viscosity of its premium 30-weight oil A simple random sample of specimens taken from automobiles running at normal temperatures was obtained The viscosities observed were as follows: 25 25 35 27 24 35 29 31 21 38 30 32 35 32 27 30 25 36 28 30 (316) 776 CHAPTER 17 | Introduction to Nonparametric Statistics Determine if the median viscosity at normal running temperatures is equal to 30 as advertised for Penn’s premium 30-weight oil (Use a  0.05.) Computer Database Exercises 17-10 The Cell Tone Company sells cellular phones and airtime in several states At a recent meeting, the marketing manager stated that the median age of its customers is less than 40 This came up in conjunction with a proposed advertising plan that is to be directed toward a young audience Before actually completing the advertising plan, Cell Tone decided to randomly sample customers Among the questions asked in a survey of 50 customers in the Jacksonville, Florida, area was the customers’ ages The data are in the file Cell Phone Survey a Examine the sample data Is the variable being measured a discrete or a continuous variable? Does it seem feasible that these data could have come from a normal distribution? b The marketing manager must support his statement concerning customer age in an upcoming board meeting Using a significance level of 0.10, provide this support for the marketing manager 17-11 The Wilson Company uses a great deal of water in the process of making industrial milling equipment To comply with federal clean-water laws, it has a water purification system that all wastewater goes through before being discharged into a settling pond on the company’s property To determine whether the company is complying with federal requirements, sample measures are taken every so often One requirement is that the median pH level must be less than 7.4 A sample of 95 pH measures has been taken The data for these measures are shown in the file Wilson Water a Carefully examine the data Use an appropriate procedure to determine if the data could have been sampled from a normal distribution (Hint: Review the goodness-of-fit test in Chapter 13.) b Based on the sample data of pH level, what should the company conclude about its current status on meeting federal requirements? Test the hypothesis at the 0.05 level END EXERCISES 17-1 17.2 Nonparametric Tests for Two Population Medians Chapters through 12 introduced a variety of hypothesis-testing tools and techniques Included were tests involving two or more population means These tests carried with them several assumptions and requirements For some situations in which you are testing about the difference between two population means, the student t-distribution is employed One of the assumptions for the t-distribution is that the two populations are normally distributed Another is that the data are interval or ratio level Although in many situations these assumptions and the data requirements will be satisfied, you will often encounter situations in which this is not the case In this section we introduce two nonparametric techniques that not require such stringent assumptions and data requirements: the Mann–Whitney U-test1 and the Wilcoxon matched-pairs signed rank test Both tests can be used with ordinal (ranked) data, and neither requires that the populations be normally distributed The Mann–Whitney U-test is used when the samples are independent, whereas the Wilcoxon matched-pairs signed rank test is used when the design has paired samples Chapter Outcome The Mann–Whitney U-Test BUSINESS APPLICATION TESTING TWO POPULATION MEDIANS BLAINE COUNTY HIGHWAY DISTRICT The workforce of the Blaine County Highway District (BCHD) is made up of the rural and urban divisions A few months ago, several rural division supervisors began claiming that the urban division employees waste gravel from the county gravel pit The supervisors claimed the urban division uses more gravel per mile of road maintenance than the rural division In response to these 1An equivalent test to the Mann–Whitney U-test is the Wilcoxon rank-sum test (317) CHAPTER 17 | Introduction to Nonparametric Statistics 777 claims, the BCHD materials manager performed a test He selected a random sample from the district’s job-cost records of jobs performed by the urban (U) division and another sample of jobs performed by the rural (R) division The yards of gravel per mile for each job are recorded Even though the data are ratio-level, the manager is not willing to make the normality assumptions necessary to employ the two-sample t-test (discussed in Chapter 10) However, the Mann–Whitney U-test will allow him to compare the gravel use of the two divisions The Mann–Whitney U-test is one of the most commonly used nonparametric tests to compare samples from two populations in those cases when the following assumptions are satisfied: Assumptions The two samples are independent and random The value measured is a continuous variable The measurement scale used is at least ordinal If they differ, the distributions of the two populations will differ only with respect to central location The fourth point is instrumental in setting your null and alternative hypotheses We are interested in determining whether two populations have the same or different medians The test can be performed using the following steps: Step State the appropriate null and alternative hypotheses In this situation, the variable of interest is cubic yards of gravel used This is a ratio-level variable: However, the populations are suspected to be skewed, so the material manager has decided to test the following hypotheses, stated in terms of the population medians: H0 : ~ mU  ~ mR ( Median urban gravel use is less than or equal to median rural use.) ~ ~ HA: mU  mR (Urban median exceeds rural med dian.) Step Specify the desired level of significance The decision makers have determined that the test will be conducted using a  0.05 Step Select the sample data and compute the appropriate test statistic Computing the test statistic manually requires several steps: Combine the raw data from the two samples into one set of numbers, keeping track of the sample from which each value came Rank the numbers in this combined set from low to high Note that we expect no ties to occur because the values are considered to have come from continuous distributions However, in actual situations ties will sometimes occur When they do, we give tied observations the average of the rank positions for which they are tied For instance, if the lowest four data points were each 460, each of the four 460s would receive a rank of (1    4)/4  10/4  2.5.2 Separate the two samples, listing each observation with the rank it has been assigned This leads to the rankings shown in Table 17.3 The logic of the Mann–Whitney U-test is based on the idea that if the sum of the rankings of one sample differs greatly from the sum of the rankings of the second sample, we should conclude that there is a difference in the population medians 2Noether provides an adjustment when ties occur He, however, points out that using the adjustment has little effect unless a large proportion of the observations are tied or there are ties of considerable extent See the References at the end of this chapter (318) 778 CHAPTER 17 | Introduction to Nonparametric Statistics | Ranking of Yards of Gravel per Mile for the Blaine County Highway District Example TABLE 17.3 Urban (n1  12) Yards of Gravel 460 830 720 930 500 620 703 407 1,521 900 750 800 Rural (n2  12) Rank Yards of Gravel 16 12 20 11 24 17 13 14 R1  142 Rank 600 652 603 594 1,402 1,111 902 700 827 490 904 1,400 23 21 18 10 15 19 22 R2  158 Calculate a U-value for each sample, as shown in Equations 17.2 and 17.3 U Statistics U1  n1n2  U  n1n2  n1 (n1  1) − ∑ R1 (17.2) − ∑ R2 (17.3) n2 (n2  1) where: n1 and n2  Sample sizes from populations and R1 and R2  Sum of ranks for samples and For our example using the ranks in Table 17.3, U1  12(12)   80 U  12(12)   64 12(13) − 142 12(13) − 158 Note that U1  U2 = n1n2 This is always the case, and it provides a good check on the correctness of the rankings in Table 17.3 Select the U-value to be the test statistic The Mann–Whitney U tables in Appendices L and M give the lower tail of the U-distribution For one-tailed tests such as our Blaine County example, you need to look at the alternative hypothesis to determine whether U1 or U2 should be selected as the test statistic Recall that H : ~  ~  A U R If the alternative hypothesis indicates that population has a higher median, as in this case, then U1 is selected as the test statistic If population is expected to have a higher median, then U2 should be selected as the test statistic The reason is that the population with the larger median should have the larger sum of ranked values, thus producing the smaller U-value (319) CHAPTER 17 | Introduction to Nonparametric Statistics 779 It is very important to note that this logic must be made in terms of the alternative hypothesis and not on the basis of the U-values obtained from the samples Now, we select the U-value that the alternative hypothesis indicates should be the smaller and call this U Because population (Urban) should have the smaller U-value (larger median) if the alternative hypothesis is true, the sample data give a U  80 This is actually larger than the U-value for the rural population, but we still use it as the test statistic because the alternative hypothesis indicates that m% U  m% R Step Determine the critical value for the Mann–Whitney U-test For sample sizes less than 9, use the Mann–Whitney U table in Appendix L for the appropriate sample size For sample sizes from to 20, as in this example, the null hypothesis can be tested by comparing U with the appropriate critical value given in the Mann–Whitney U table in Appendix M We begin by locating the part of the table associated with the desired significance level In this case, we have a one-tailed test with a  0.05 Go across the top of the Mann–Whitney U table to locate the value corresponding to the sample size from population (Rural) and down the left side of the table to the sample size from population (Urban) In the Blaine County example, both sample sizes are 12, so we will use the Mann–Whitney table in Appendix M for a  0.05 Go across the top of the table to n2  12 and down the left-hand side to n1  12 The intersection of these column and row values gives a critical value of U0.05  42 We can now form the decision rule as follows: If U 42, reject H0 Otherwise, not reject H0 Step Reach a decision Now because U  80  42 we not reject the null hypothesis Step Draw a conclusion Therefore, based on the sample data, there is not sufficient evidence to conclude that the median yards of gravel per mile used by the urban division is greater than that for the rural division Neither Excel nor the PHStat add-ins contain a Mann–Whitney U-test PHStat does have the equivalent Wilcoxon Rank Sum Test The Wilcoxon test uses, as its test statistic, the sum of the ranks from the population that is supposed to have the larger median Referring to Table 17.3, Urban is supposed to have the larger median, and the sum of the ranks is 142 Minitab, on the 3For a two-tailed test, you should select the smaller U-value as your test statistic This will force you toward the lower tail If the U-value is smaller than the critical value in the Mann–Whitney U table, you will reject the null hypothesis (320) 780 CHAPTER 17 FIGURE 17.2 | Introduction to Nonparametric Statistics | Minitab Output— Mann–Whitney U-Test for the Blaine County Example Minitab Instructions: Enter data in columns Choose Stat  Nonparametric  Mann–Whitney In First Sample, enter one data column In Second Sample, enter the other data column In Alternative, select greater than Click OK other hand, contains the Mann–Whitney test but not the Wilcoxon Rank Sum Test The test statistic used in the Mann–Whitney test is a function of the rank sums produced in the Wilcoxon test and the two tests are equivalent Mann–Whitney U-Test—Large Samples When you encounter a situation with sample sizes in excess of 20, the previous approaches to the Mann–Whitney U-test cannot be used because of table limitations However, the U statistic approaches a normal distribution as the sample sizes increase, and the Mann–Whitney U-test can be conducted using a normal approximation approach, where the mean and standard deviation for the U statistic are as given in Equations 17.4 and 17.5, respectively Mean and Standard Deviation for U Statistic nn m 2 s (n1 )(n2 )(n1  n2  1) 12 (17.4) (17.5) where: n1 and n2  Sample sizes from populations and Equations 17.4 and 17.5 are used to form the U-test statistic in Equation 17.6 Mann–Whitney U-Test Statistic z U − n1n2 (n1 )(n2 )(n1  n2 1) 12 (17.6) (321) CHAPTER 17 BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial | Introduction to Nonparametric Statistics 781 LARGE SAMPLE TEST OF TWO POPULATION MEDIANS FUTURE-VISION Consider an application involving the managers for a local network television affiliate who are preparing for a national television advertising conference The theme of the presentation is to center around the advantage for businesses to advertise on network television rather than on cable The managers believe that median household income for cable subscribers is less than the median for those who not subscribe Therefore, by advertising on network stations, businesses could reach a higher income audience The managers are concerned with the median (as opposed to the mean) income because data such as household income are notorious for having large outliers In such cases, the median, which is not sensitive to outliers, is a preferable measure of the center of the data The large outliers are also an indication that the data not have a symmetric (such as the normal) distribution—another reason to use a nonparametric procedure such as the Mann–Whitney test The managers can use the Mann–Whitney U-test and the following steps to conduct a test about the median incomes for cable subscribers versus nonsubscribers Step Specify the null and alternative hypotheses Given that the managers believe that median household income for cable subscribers (C) is less than the median for those who not subscribe (NC) the null and alternative hypotheses to be tested are H0 : ~ mC (322) ~ m NC ~ H A: mC ~ m NC (claim) Step Specify the desired level of significance The test is to be conducted using a  0.05 Step Select the random sample and compute the test statistic In the spirit of friendly cooperation, the network managers joined forces with the local cable provider, Future-Vision, to survey a total of 548 households (144 nonsubscribers and 404 cable subscribers) in the market area The results of the survey are contained in the file Future-Vision Because of the sample size, we can use the large-sample approach to the Mann–Whitney U-test To compute the test statistic shown in Equation 17.6, use the following steps: The income data must be converted to ranks The sample data and ranks are in a file called Future-Vision-Ranks Note that when data are tied in value, they share the same average rank For example, if four values are tied for the fifth position, each one is assigned the average of rankings 5, 6, 7, and 8, or (5    8)/4  6.5 Next, we compute the U-value The sum of the ranks for noncable subscribers is R1  41,204 and the sum of the ranks for cable subscribers is R2  109,222 Based on sample sizes of n1  144 noncable subscribers and n2  404 (323) 782 CHAPTER 17 | Introduction to Nonparametric Statistics cable subscribers, we compute the U-values using Equations 17.2 and 17.3 U1  144(404)  144(145) − 41, 204  27, 412 U  144(404)  404(405) − 109, 222  30, 764 Because the alternative hypothesis predicts that noncable subscribers will have a higher median, U1 is selected to be U Thus, U  27,412 We now substitute appropriate values into Equations 17.4 and 17.5 m  n1n2  (144)(404)  29, 088 and s (n1 )(n2 )(n1  n2  1) 12   1) (144)(404)(144  404  1, 631.43 12 The test statistic is computed using Equation 17.6 z U − n1n2  (n1 )(n2 )(n1  n2  1) 12 −1, 676   − 1.027 1, 631.43 27, 412 − 29, 088 (144)(404)(144  404  1) 12 Step Determine the critical value for the test Based on a one-tailed test with a  0.05, the critical value from the standard normal distribution table is z0.05  1.645 Step Reach a decision Since z  1.027  1.645, the null hypothesis cannot be rejected Step Draw a conclusion This means that the claim that noncable families have higher median incomes than cable families is not supported by the sample data Chapter Outcome Assumptions The Wilcoxon Matched-Pairs Signed Rank Test The Mann–Whitney U-test is a very useful nonparametric technique However, as discussed in the Blaine County Highway District example, its use is limited to those situations in which the samples from the two populations are independent As we discussed in Chapter 10, you will encounter decision situations in which the samples will be paired and, therefore, are not independent The Wilcoxon matched-pairs signed rank test has been developed for situations in which you have related samples and are unwilling or unable (due to data-level limitations) to use the paired-sample t-test It is useful when the two related samples have a measurement scale that allows us to determine not only whether the pairs of observations differ but also the magnitude of any difference The Wilcoxon matched-pairs test can be used in those cases in which the following assumptions are satisfied: The differences are measured on a continuous variable The measurement scale used is at least interval The distribution of the population differences is symmetric about their median (324) CHAPTER 17 EXAMPLE 17-2 | Introduction to Nonparametric Statistics 783 SMALL-SAMPLE WILCOXON TEST Financial Systems Associates Financial Systems Associates develops and markets financial planning software To differentiate its products from the other packages on the market, Financial Systems has built many macros into its software According to Financial Systems, once a user learns the macro keystrokes, complicated financial computations become much easier to perform As part of its product-development testing program, software engineers at Financial Systems have selected a focus group of seven people who frequently use spreadsheet packages Each person is given complicated financial and accounting data and is asked to prepare a detailed analysis The software tracks the amount of time each person takes to complete the task Once the analysis is complete, these same seven individuals are given a training course in Financial Systems add-ons After the training course, they are given a similar set of data and are asked to the same analysis Again, the systems software determines the time needed to complete the analysis You should recognize that the samples in this application are not independent because the same subjects are used in both cases If the software engineers performing the analysis are unwilling to make the normal distribution assumption required of the paired-sample t-test, they can use the Wilcoxon matched-pairs signed rank test This test can be conducted using the following steps: Step Specify the appropriate null and alternative hypotheses The null and alternative hypotheses being tested are H0 : ~ mb  ~ ma H :~ m ~ m ( Median time will be leess after the training.) A b a Step Specify the desired level of significance The test will be conducted using a  0.025 Step Collect the sample data and compute the test statistic The data are shown in Table 17.4 First, we convert the data in Table 17.4 to differences The column of differences, d, gives the “before minus after” differences The next column is the rank of the d-values from low to high Note that the ranks are determined without considering the sign on the d-value However, once the rank is determined, the original sign on the d-value is attached to the rank For example, d  13 is given a rank of 7, whereas d  4 has a rank of 3 The final column is titled “Ranks with Smallest Expected Sum.” To determine the values in this column, we take the absolute values of either the positive or the negative ranks, depending on which group has the smallest expected sum of absolute-valued ranks We look to the alternative hypothesis, which is HA : ~ mb  ~ ma TABLE 17.4 | Financial Systems Associates Ranked Data Subject Before Training After Training 24 20 19 20 13 28 15 11 18 23 15 16 22 d Rank of d Ranks with Smallest Expected Sum 13 4 3 3 2 T5 (325) 784 CHAPTER 17 | Introduction to Nonparametric Statistics Because the before median is predicted to exceed the after median, we would expect the positive differences to exceed the negative differences Therefore, the negative ranks should have the smaller sum, and therefore should be used in the final column, as shown in Table 17.4 The test statistic, T, is equal to the sum of absolute values of these negative ranks Thus, T  Step Determine the critical value To determine whether T is sufficiently small to reject the null hypothesis, we consult the Wilcoxon table of critical T-values in Appendix N If the calculated T is less than or equal to the critical T from the table, the null hypothesis is rejected For instance, with a  0.025 for our one-tailed test and n  7, we get a critical value of T0.025  Step Reach a decision Because T   2, not reject H0 Step Draw a conclusion Based on these sample data, Financial Systems Associates does not have a statistical basis for stating that its product will reduce the median time required to perform complicated financial analyses >>END EXAMPLE TRY PROBLEM 17-20 (pg 786) Ties in the Data If the two measurements of an observed data pair have the same values and, therefore, a d-value of 0, that case is dropped from the analysis and the sample size is reduced accordingly You should note that this procedure favors rejecting the null hypothesis because we are eliminating cases in which the two sample points have exactly the same values If two or more d-values have the same absolute values, we assign the same average rank to each one using the same approach as with the Mann–Whitney U-test For example, if we have two d-values that tie for ranks and 5, we average them as (4  5)/2  4.5 and assign both a rank of 4.5 Studies have shown that this method of assigning ranks to ties has little effect on the Wilcoxon test results For a more complete discussion of the effect of ties on the Wilcoxon matched-pairs signed rank test, please see the text by Marascuilo and McSweeney referenced at the end of this chapter Large-Sample Wilcoxon Test If the sample size (number of matched pairs) exceeds 25, the Wilcoxon table of critical T-values in Appendix N cannot be used However, it can be shown that for large samples, the distribution of T-values is approximately normal, with a mean and standard deviation given by Equations 17.7 and 17.8, respectively Wilcoxon Mean and Standard Deviation n(n  1) (17.7) n(n  1)(2n  1) 24 (17.8) m s where: n  Number of paired values (326) CHAPTER 17 | Introduction to Nonparametric Statistics 785 The Wilcoxon test statistic is given by Equation 17.9 Wilcoxon Test Statistic z n(n  1) n(n  1)(2n  1) 24 T − (17.9) Then, the z-value is compared to the critical value from the standard normal table in the usual manner MyStatLab 17-2: Exercises Skill Development 17-12 For each of the following tests, determine which of the two U statistics (U1 or U2) you would choose, the appropriate test statistic, and the rejection region for the Mann–Whitney test: a HA: ~ m1 ~ m2 , a  0.05, n1  5, and n2  10 H : m1 ~ m2 , a  0.05, n1  15, and n2  12 b A ~ c HA: ~ m1 ≠ ~ m , a  0.10, n1  12, and n2  17 ~ m2 , a  0.05, n1  22, and n2  25 d HA: m ~ m1 ≠ ~ m2 , a  0.10, n1 = 44, and n2 = 15 e HA: ~ 17-13 The following sample data have been collected from two independent samples from two populations Test the claim that the second population median will exceed the median of the first population Sample Sample 12 21 15 10 18 16 17 11 14 12 20 12 19 a State the appropriate null and alternative hypotheses b If you are unwilling to assume that the two populations are normally distributed, based on the sample data, what should you conclude about the null hypothesis? Test using a  0.05 17-14 The following sample data have been collected from independent samples from two populations The claim is that the first population median will be larger than the median of the second population Sample Sample 4.4 2.7 1.0 3.5 2.8 3.7 3.5 4.0 4.9 3.1 2.6 2.4 2.0 2.8 4.2 5.2 4.4 4.3 a State the appropriate null and alternative hypotheses b Using the Mann–Whitney U-test, based on the sample data, what should you conclude about the null hypothesis? Test using a  0.05 17-15 The following sample data have been collected from two independent random samples from two populations Test the claim that the first population median will exceed the median of the second population Sample Sample 50 47 44 48 40 36 38 44 38 37 43 44 43 46 72 40 55 38 31 38 39 54 41 40 a State the appropriate null and alternative hypotheses b Using the Mann–Whitney U-test, based on the sample data, what should you conclude about the null hypothesis? Test using a significance level of 0.01 17-16 Determine the rejection region for the Mann–Whitney U-test in each of the following cases: m1 ~ m2 ,   0.05, n1 3, and n2  15 a HA: ~ m1 ≠ ~ m2 ,   0.10, n1 5, and n2  20 b HA: ~ m1 ~ m2 ,   0.025, n1 9, and n2  12 c HA: ~ ~ d HA:  ≠ ~ 2 ,   0.10, n1 124, and n2 25 17-17 The following sample data have been collected from independent samples from two populations Do the populations have different medians? Test at a significance level of 0.05 Use the Mann–Whitney U-test (327) 786 CHAPTER 17 | Introduction to Nonparametric Statistics Sample 550 483 379 438 398 582 528 502 352 488 400 451 571 382 588 465 492 384 563 506 489 480 433 436 540 415 532 412 572 579 556 383 515 501 353 369 475 470 595 361 Sample 594 542 447 466 560 447 526 446 573 542 473 418 511 510 577 585 436 461 545 441 538 505 486 425 497 511 576 558 500 467 556 383 515 501 353 17-18 For each of the following tests, determine which of the two sums of absolute ranks (negative or positive) you would choose, the appropriate test statistic, and the rejection region for the Wilcoxon matched-pairs signed rank test: a HA: ~ m1 ~ m2 , a  0.025, n  15 ~ 2 , a  0.01, n  12 b HA:   ~ 1 ≠ ~ 2 , a  0.05, n  c HA: ~  ~ 2 , a  0.05, n  26 d HA: ~ 1 ≠ ~ 2 , a  0.10, n  44 e HA: ~ 17-19 You are given two paired samples with the following information: Item Sample Sample 2 3.4 2.5 7.0 5.9 4.0 5.0 6.2 5.3 2.8 3.0 5.5 6.7 3.5 5.0 7.5 4.2 a Based on these paired samples, test at the a  0.05 level whether the true median paired difference is b Answer part a assuming data given here were sampled from normal distributions with equal variances 17-20 You are given two paired samples with the following information: Item Sample Sample 2 10 19.6 22.1 19.5 20.0 21.5 20.2 17.9 23.0 12.5 19.0 21.3 17.4 19.0 21.2 20.1 23.5 18.9 22.4 14.3 17.8 Based on these paired samples, test at the a  0.05 level whether the true median paired difference is 17-21 You are given two paired samples with the following information: Item Sample Sample 2 1,004 1,245 1,360 1,150 1,300 1,450 900 1,045 1,145 1,400 1,000 1,350 1,350 1,140 Based on these paired samples, test at the a  0.05 level whether the true median paired difference is 17-22 From a recent study we have collected the following data from two independent random samples: Sample Sample 405 450 290 370 345 460 425 275 380 330 500 215 300 340 400 250 270 410 435 390 225 210 395 315 Suppose we not wish to assume normal distributions Use the appropriate nonparametric test to determine whether the populations have equal medians Test at a  0.05 (328) CHAPTER 17 17-23 You are given two paired samples with the following information: Item Sample Sample 2 234 221 196 245 234 204 245 224 194 267 230 198 Based on these paired samples, test at the a  0.05 level whether the true median paired difference is 17-24 Consider the following data for two paired samples: Case # Sample Sample 2 258 197 400 350 237 400 370 130 304 190 500 340 250 358 390 100 a Test the following null and alternative hypotheses at an a  0.05 level: H0: There is no difference between the two population distributions HA: There is a difference between the two populations b Answer part a as if the samples were independent samples from normal distributions with equal variances Business Applications 17-25 National Reading Academy claims that graduates of its program have a higher median reading speed per minute than people who not take the course An independent agency conducted a study to determine whether this claim was justified Researchers from the agency selected a random sample of people who had taken the speed reading course and another random sample of people who had not taken the course The agency was unwilling to make the assumption that the populations were normally distributed Therefore, a nonparametric test was needed The following summary data were observed: With Course Without Course n7 n5 Sum of ranks  42 Sum of ranks  36 | Introduction to Nonparametric Statistics 787 Assuming that higher ranks imply more words per minute being read, what should the testing agency conclude based on the sample data? Test at an a  0.05 level 17-26 The makers of the Plus 20 Hardcard, a plug-in hard disk unit on a PC board, have recently done a marketing research study in which they asked two independently selected groups to rate the Hardcard on a scale of to 100, with 100 being perfect satisfaction The first group consisted of professional computer programmers The second group consisted of home computer users The company hoped to be able to say that the product would receive the same median ranking from each group The following summary data were recorded: Professionals Home Users n  10 n8 Sum of ranks  92 Sum of ranks  79 Based on these data, what should the company conclude? Test at the a  0.02 level 17-27 Property taxes are based on assessed values of property In most states, the law requires that assessed values be “at or near” market value of the property In one Washington county, a tax protest group has claimed that assessed values are higher than market values To address this claim, the county tax assessor, together with representatives from the protest group, has selected 15 properties at random that have sold within the past six months Both parties agree that the sales price was the market value at the time of the sale The assessor then listed the assessed values and the sales values side by side, as shown House Assessed Value ($) 10 11 12 13 14 15 302,000 176,000 149,000 198,500 214,000 235,000 305,000 187,500 150,000 223,000 178,500 245,000 167,000 219,000 334,000 Market Value ($) 198,000 182,400 154,300 198,500 218,000 230,000 298,900 190,000 149,800 222,000 180,000 250,900 165,200 220,700 320,000 a Assuming that the population of assessed values and the population of market values have the same distribution shape and that they may differ only with respect to (329) 788 CHAPTER 17 | Introduction to Nonparametric Statistics medians, state the appropriate null and alternative hypotheses b Test the hypotheses using an a  0.01 level c Discuss why one would not assume that the samples were obtained from normal distributions for this problem What characteristic about the market values of houses would lead you to conclude that these data were not normally distributed? 17-28 The Kansas Tax Commission recently conducted a study to determine whether there is a difference in median deductions taken for charitable contributions depending on whether a tax return is filed as a single or a joint return A random sample from each category was selected, with the following results: Single Joint n6 n8 Sum of ranks  43 Sum of ranks  62 If higher scores are better, use the Wilcoxon matchedpairs signed rank test to test whether this tape program produces quick improvement in reading ability Use an a  0.025 17-31 The Montgomery Athletic Shoe Company has developed a new shoe-sole material it thinks provides superior wear compared with the old material the company has been using for its running shoes The company selected 10 cross-country runners and supplied each runner with a pair of shoes Each pair had one sole made of the old material and the other made of the new material The shoes were monitored until the soles wore out The following lifetimes (in hours) were recorded for each material: Runner Based on these data, what should the tax commission conclude? Use an a  0.05 level 17-29 A cattle feedlot operator has collected data for 40 matched pairs of cattle showing weight gain on two different feed supplements His purpose in collecting the data is to determine whether there is a difference in the median weight gain for the two supplements He has no preconceived idea about which supplement might produce higher weight gain He wishes to test using an a  0.05 level Assuming that the T-value for these data is 480, what should be concluded concerning which supplement might produce higher weight gain? Use the largesample Wilcoxon matched-pairs signed rank test normal approximation Conduct the test using a p-value approach 17-30 Radio advertisements have been stressing the virtues of an audiotape program to help children learn to read To test whether this tape program can cause a quick improvement in reading ability, 10 children were given a nationally recognized reading test that measures reading ability The same 10 children were then given the tapes to listen to for hours spaced over a 2-day period The children then were tested again The test scores were as follows: Child Before After 10 60 40 78 53 67 88 77 60 64 75 63 38 77 50 74 96 80 70 65 75 10 Old Material 45.5 50.0 43.0 45.5 58.5 49.0 29.5 52.0 48.0 57.5 New Material 47.0 51.0 42.0 46.0 58.0 50.5 39.0 53.0 48.0 61.0 a If the populations from which these samples were taken could be considered to have normal distributions, determine if the soles made of the new material have a longer mean lifetime than those made from the old material Use a significance level of 0.025 b Suppose you were not willing to consider that the populations have normal distributions Make the determination requested in part a c Given only the information in this problem, which of the two procedures indicated in parts a and b would you choose to use? Give reasons for your answer Computer Database Exercises 17-32 For at least the past 20 years, there has been a debate over whether children who are placed in child-care facilities while their parents work suffer as a result A recent study of 6,000 children discussed in the March 1999 issue of Developmental Psychology found “no permanent negative effects caused by their mothers’ absence.” In fact, the study indicated that there might be some positive benefits from the day-care experience To investigate this premise, a nonprofit organization called Child Care Connections conducted a small study in which children were observed playing in neutral settings (not at home or at a day-care center) Over a period of 20 hours of observation, 15 children who did not go to day care and 21 children who had spent much time in day care were observed The (330) CHAPTER 17 variable of interest was the total minutes of play in which each child was actively interacting with other students Child Care Connections leaders hoped to show that the children who had been in day care would have a higher median time in interactive situations than the stay-at-home children The file Children contains the results of the study a Conduct a hypothesis test to determine if the hopes of the Child Care Connections leaders can be substantiated Use a significance level of 0.05, and write a short statement that describes the results of the test b Based on the outcome of the hypothesis test, which statistical error might have been committed? 17-33 The California State Highway Patrol recently conducted a study on a stretch of interstate highway south of San Francisco to determine whether the mean speed for California vehicles exceeded the mean speed for Nevada vehicles A total of 140 California cars were included in the study, and 75 Nevada cars were included Radar was used to measure the speed The file Speed-Test contains the data collected by the California Highway Patrol a Past studies have indicated that the speeds at which both Nevada and California drivers drive have normal distributions Using a significance level equal to 0.10, obtain the results desired by the California Highway Patrol Use a p-value approach to conduct the relevant hypothesis test Discuss the results of this test in a short written statement b Describe, in the context of this problem, what a Type I error would be 17-34 The Sunbeam Corporation makes a wide variety of appliances for the home One product is a digital blood pressure gauge For obvious reasons, the blood pressure readings made by the monitor need to be accurate When a new model is being designed, one of the steps is to test it To this, a sample of people is selected Each person has his or her systolic blood pressure taken by a highly respected physician They then immediately have their systolic blood pressure | Introduction to Nonparametric Statistics 789 taken using the Sunbeam monitor If the mean blood pressure is the same for the monitor as it is as determined by the physician, the monitor is determined to pass the test In a recent test, 15 people were randomly selected to be in the sample The blood pressure readings for these people using both methods are contained in the file Sunbeam a Based on the sample data and a significance level equal to 0.05, what conclusion should the Sunbeam engineers reach regarding the latest blood pressure monitor? Discuss your answer in a short written statement b Conduct the test as a paired t-test c Discuss which of the two procedures in parts a and b is more appropriate to analyze the data presented in this problem 17-35 The Hersh Corporation is considering two wordprocessing systems for its computers One factor that will influence its decision is the ease of use in preparing a business report Consequently, nine typists were selected from the clerical pool and asked to type a typical report using both word-processing systems The typists then rated the systems on a scale of to 100 The resulting ratings are in the file Hersh a Which measurement level describes the data collected for this analysis? b (1) Could a normal distribution describe the population distribution from which these data were sampled? (2) Which measure of central tendency would be appropriate to describe the center of the populations from which these data were sampled? c Choose the appropriate hypothesis procedure to determine if there is a difference in the measures of central tendency you selected in part b between these two word-processing systems Use a significance level of 0.01 d Which word-processing system would you recommend the Hersh Corporation adopt? Support your answer with statistical reasoning END EXERCISES 17-2 Chapter Outcome 17.3 Kruskal–Wallis One-Way Analysis of Variance Section 17.2 showed that the Mann–Whitney U-test is a useful nonparametric procedure for determining whether two independent samples are from populations with the same median However, as discussed in Chapter 12, many decisions involve comparing more than two populations Chapter 12 introduced one-way analysis of variance and showed how, if the assumptions of normally distributed populations with equal variances are satisfied, the F-distribution can be used to test the hypothesis of equal population means However, what if decision makers are not willing to assume normally distributed populations? In that case, they can turn to a (331) 790 CHAPTER 17 | Introduction to Nonparametric Statistics nonparametric procedure to compare the populations Kruskal–Wallis One-Way Analysis of Variance is the nonparametric counterpart to the one-way ANOVA procedure It is applicable any time the variables in question satisfy the following conditions: Assumptions They have a continuous distribution The data are at least ordinal The samples are independent The samples come from populations whose only possible difference is that at least one may have a different central location than the others BUSINESS APPLICATION USING THE KRUSKAL-WALLIS ONE-WAY ANOVA TEST WESTERN STATES OIL AND GAS Western States Oil and Gas is considering outsourcing its information systems activities, including demand-supply analysis, general accounting, and billing On the basis of cost and performance standards, the company’s information systems manager has reduced the possible suppliers to three, each using different computer systems One critical factor in the decision is downtime (the time when the system is not operational) When the system goes down, the online applications stop and normal activities are interrupted The information systems manager received from each supplier a list of firms using its service From these lists, the manager selected random samples of nine users of each service In a telephone interview, she found the number of hours of downtime in the previous month for each service At issue is whether the three computer downtime populations have the same or different centers If the manager is unwilling to make the assumptions of normality and equal variances required for the one-way ANOVA technique introduced in Chapter 12 she can implement the Kruskal–Wallis nonparametric test using the following steps Step Specify the appropriate null and alternative hypotheses to be tested In this application the information systems manager is interested in determining whether a difference exists between median downtime for the three systems Thus, the null and alternative hypotheses are H 0: ˜ A  ˜ B   ˜C HA: Not all population medianns are equal Step Specify the desired level of significance for the test The test will be conducted using a significance level equal to a  0.10 Step Collect the sample data and compute the test statistic The data represent a random sample of downtimes from each service The samples are independent To use the Kruskal–Wallis ANOVA, first replace each downtime measurement by its relative ranking within all groups combined The smallest downtime is given a rank of 1, the next smallest a rank of 2, and so forth, until all downtimes for the three services have been replaced by their relative rankings Table 17.5 shows the sample data and the rankings for the 27 observations Notice that the rankings are summed for each service The Kruskal–Wallis test will determine whether these sums are so different that it is not likely that they came from populations with equal medians If the samples actually come from populations with equal medians (that is, the three services have the same per-month median downtime), then the H statistic, calculated by Equation 17.10, will be approximately distributed as a chi-square variable with k  degrees of freedom, where k equals the number of populations (systems in this application) under study (332) CHAPTER 17 TABLE 17.5 | | 791 Introduction to Nonparametric Statistics Sample Data and Rankings of System Downtimes for the Western States Gas and Oil Example Service A Service B Service C Data Ranking Data Ranking Data Ranking 4.0 11 6.9 19 0.5 3.7 10 11.3 23 1.4 5.1 15 21.7 27 1.0 2.0 9.2 20 1.7 4.6 12 6.5 17 3.6 9.3 21 4.9 14 5.2 16 2.7 12.2 25 1.3 2.5 11.7 24 6.8 18 4.8 13 Sum of ranks  103 10.5 22 Sum of ranks  191 14.1 26 Sum of ranks  84 H Statistic H k Ri2 ∑ ni 12 N ( N  1) − 3( N  1) (17.10) i1 where: N  Sum of the sample sizes from all populations k  Number of populations Ri  Sum of ranks in the sample from the ith population ni  Size of the sample from the ith population Using Equation 17.10, the H statistic is H  12 N ( N  1) k Ri2 ∑ ni − 3( N  1) i1 ⎡ 1032 12 1912 84 ⎤   ⎥ − 3(27  1)  11.50 ⎢ 27(27  1) ⎣ 9 ⎦ Step Determine the critical value from the chi-square distribution If H is larger than x2 from the chi-square distribution with k  degrees of freedom in Appendix G, the hypothesis of equal medians should be rejected The critical value for a  0.10 and k1312 degrees of freedom is  0.10  4.6052 Step Reach a decision Since H  11.50  4.6052, reject the null hypothesis based on these sample data Step Draw a conclusion The Kruskal–Wallis one-way ANOVA shows the information systems manager should conclude, based on the sample data, that the three services not have equal median downtimes From this analysis, the supplier with system B would most likely be eliminated from consideration unless other factors such as price or service offset the apparent longer downtimes (333) 792 CHAPTER 17 | Introduction to Nonparametric Statistics Excel and Minitab EXAMPLE 17-3 USING KRUSKAL–WALLIS ONE-WAY ANOVA Amalgamated Sugar Amalgamated Sugar has recently tutorials Excel and Minitab Tutorial begun a new effort called total productive maintenance (TPM) The TPM concept is to increase the overall operating effectiveness of the company’s equipment One component of the TPM process attempts to reduce unplanned machine downtime The first step is to gain an understanding of the current downtime situation To this, a sample of 20 days has been collected for each of the three shifts (day, swing, and graveyard) The variable of interest is the minutes of unplanned downtime per shift per day The minutes are tabulated by summing the downtime minutes for all equipment in the plant The Kruskal–Wallis test can be performed using the following steps: Step State the appropriate null and alternative hypotheses The Kruskal–Wallis one-way ANOVA procedure can test whether the medians are equal, as follows: H : %  %  % HA : Not all population medians are equal Step Specify the desired significance level for the test The test will be conducted using an a  0.05 Step Collect the sample data and compute the test statistic The sample data are in the Amalgamated file Both Minitab and Excel (using the PHStat add-ins) can be used to perform Kruskal–Wallis nonparametric ANOVA tests.4 Figures 17.3a and 17.3b illustrate the Excel and Minitab outputs for these sample data The calculated H statistic is H  0.1859 FIGURE 17.3A | Excel 2007 (PHStat) Kruskal–Wallis ANOVA Output for Amalgamated Sugar Test statistic p-value Conclusion Excel 2007 Instructions: Open File: Amalgamated.xls Specify significance level (0.05) Select Add-Ins Define data range including labels Select PHStat Click OK Select Multiple Sample Test  Kruskal–Wallis Rank Test 4In Minitab, the variable of interest must be in one column A second column contains the population identifier In Excel, the data are placed in separate columns by population The column headings identify the population (334) CHAPTER 17 FIGURE 17.3B | Introduction to Nonparametric Statistics 793 | Minitab Kruskal–Wallis ANOVA Output for Amalgamated Sugar p-value Test statistic Conclusion: Do not reject H0 Minitab Instructions: Open file: Amalgamated.MTW Choose Data  Stack  Columns In Stack the following columns, enter data columns In Store the stacked data in select Column of current worksheet, and enter column name: Downtime In Store subscripts in, enter column name: Shifts Click OK Choose Stat  Nonparametrics  Kruskal–Wallis In Response, enter data column: Downtime In Factor, enter factor levels column: Shifts Click OK Step Determine the critical value from the chi-square distribution The critical value for a  0.05 and k   degrees of freedom is  20.05 = 5.9915 Step Reach a decision Because H  0.1859 5.9915 we not reject the null hypothesis Both the PHStat and Minitab output provide the p-value associated with the H statistic The p-value of 0.9112 far exceeds an alpha of 0.05 Step Draw a conclusion Based on the sample data, the three shifts not appear to differ with respect to median equipment downtime The company can now begin to work on steps that will reduce the downtime across the three shifts >>END EXAMPLE TRY PROBLEM 17-37 (pg 795) Limitations and Other Considerations The Kruskal–Wallis one-way ANOVA does not require the assumption of normality and is, therefore, often used instead of the ANOVA technique discussed in Chapter 12 However, the Kruskal–Wallis test as discussed here applies only if the sample size from each population is at least 5, the samples are independently selected, and each population has the same distribution except for a possible difference in central location (335) 794 CHAPTER 17 | Introduction to Nonparametric Statistics When ranking observations, you will sometimes encounter ties When ties occur, each observation is given the average rank for which it is tied The H statistic is influenced by ties and should be corrected by dividing Equation 17.10 by Equation 17.11 Correction for Tied Rankings—Kruskal–Wallis Test g 1− ∑ (ti3 − ti ) i1 (17.11) N3 − N where: g  Number of different groups of ties ti  Number of tied observations in the ith tied group of scores N  Total number of observations The correct formula for calculating the Kruskal–Wallis H statistic when ties are present is Equation 17.12 Correcting for ties increases H This makes rejecting the null hypothesis more likely than if the correction is not used A rule of thumb is that if no more than 25% of the observations are involved in ties, the correction factor is not required Note that if you use Minitab to perform the Kruskal–Wallis test, the adjusted H statistic is provided The PHStat add-in to Excel for performing the Kruskal–Wallis test does not provide the adjusted H statistic However, the adjustment is only necessary when the null hypothesis is not rejected and the H statistic is “close” to the rejection region In that case, making the proper adjustment could lead to rejecting the null hypothesis H Statistic Corrected for Tied Rankings H 12 N ( N  1) 1− k Ri2 ∑ ni − 3( N  1) t1 g (17.12) ∑ (ti3 − ti ) i1 N3 − N MyStatLab 17-3: Exercises Skill Development 17-36 Given the following sample data: Group Group Group 21 25 36 35 33 23 31 32 17 15 34 22 16 19 30 20 29 38 28 27 14 26 39 36 a State the appropriate null and alternative hypotheses to test whether there is a difference in the medians of the three populations b Based on the sample data and a significance level of 0.05, what conclusion should be reached about the medians of the three populations if you are not willing to make the assumption that the populations are normally distributed? c Test the hypothesis stated in part a, assuming that the populations are normally distributed with equal variances d Which of the procedures described in parts b and c would you select to analyze the data? Explain your reasoning (336) CHAPTER 17 Group 10 11 12 13 12 8 10 10 Group 13 12 12 11 13 15 a State the appropriate null and alternative hypotheses for determining whether a difference exists in the median value for the three populations b Based on the sample data, use the Kruskal–Wallis ANOVA procedure to test the null hypothesis using a  0.05 What conclusion should be reached? 17-38 Given the following data: Group Group Group Group 20 27 26 22 25 30 23 28 26 21 29 30 25 17 15 18 20 14 21 23 19 17 20 a State the appropriate null and alternative hypotheses for determining whether a difference exists in the median value for the four populations b Based on the sample data, use the Kruskal–Wallis one-way ANOVA procedure to test the null hypothesis What conclusion should be reached using a significance level of 0.10? Discuss c Determine the H-value adjusted for ties d Given the results in part b, is it necessary to use the H-value adjusted for ties? If it is, conduct the hypothesis test using this adjusted value of H If not, explain why not 17-39 A study was conducted in which samples were selected independently from four populations The sample size from each population was 20 The data were converted to ranks The sum of the ranks for the data from each sample is as follows: Sum of ranks Introduction to Nonparametric Statistics 795 Business Applications 17-37 Given the following sample data: Group | Sample Sample Sample Sample 640 780 460 1,360 a State the appropriate null and alternative hypotheses if we wish to determine whether the populations have equal medians b Use the information in this exercise to perform a Kruskal–Wallis one-way ANOVA 17-40 The American Beef Growers Association is trying to promote the consumption of beef products The organization performs numerous studies, the results of which are often used in advertising campaigns One such study involved a quality perception test Three grades of beef were involved: choice, standard, and economy A random sample of people was provided pieces of choice-grade beefsteak and was asked to rate its quality on a scale of to 100 A second sample of people was given pieces of standard-grade beefsteak, and a third sample was given pieces of economy-grade beefsteak, with instructions to rate the beef on the 100-point scale The following data were obtained: Choice Standard Economy 78 87 90 87 89 90 67 80 78 80 67 70 65 62 70 66 70 73 a What measurement level these data possess? Would it be appropriate to assume that such data could be obtained from a normal distribution? Explain your answers b Based on the sample data, what conclusions should be reached concerning the median quality perception scores for the three grades of beef? Test using an a  0.01 17-41 A study was conducted by the sports department of a national network television station in which the objective was to determine whether a difference exists between median annual salaries of National Basketball Association (NBA) players, National Football League (NFL) players, and Major League Baseball (MLB) players The analyst in charge of the study believes that the normal distribution assumption is violated in this study Thus, she thinks that a nonparametric test is in order The following summary data have been collected: NBA NFL MLB n  20 n  30 n  40 ∑Ri  1,655 ∑ Ri  1,100 ∑ Ri  1,340 a Why would the sports department address the median as the parameter of interest in this analysis, as opposed to the mean? Explain your answer b What characteristics of the salaries of professional athletes suggest that such data are not normally distributed? Explain c Based on these data, what can be concluded about the median salaries for the three sports? Test at an a  0.05 Assume no ties (337) 796 CHAPTER 17 | Introduction to Nonparametric Statistics 17-42 Referring to Exercise 17-41, suppose that there were 40 ties at eight different salary levels The following shows how many scores were ties at each salary level: Level t 10 a Given the results in the previous exercise, is it necessary to use the H-value adjusted for ties? b If your answer to part a is yes, conduct the test of hypothesis using this adjusted value of H If it is not, explain why not 17-43 Suppose as part of your job you are responsible for installing emergency lighting in a series of state office buildings Bids have been received from four manufacturers of battery-operated emergency lights The costs are about equal, so the decision will be based on the length of time the lights last before failing A sample of four lights from each manufacturer has been tested, with the following values (time in hours) recorded for each manufacturer: Type A Type B Type C Type D 1,024 1,121 1,250 1,022 1,270 1,325 1,426 1,322 1,121 1,201 1,190 1,122 923 983 1,087 1,121 Using a  0.01, what conclusion for the four manufacturers should you reach about the median length of time the lights last before failing? Explain Computer Database Exercises 17-44 As purchasing agent for the Horner-Williams Company, you have primary responsibility for securing high-quality raw materials at the best possible prices One particular material the Horner-Williams Company uses a great deal of is aluminum After careful study, you have been able to reduce the prospective vendors to three It is unclear whether these three vendors produce aluminum that is equally durable To compare durability, the recommended procedure is to put pressure on aluminum until it cracks The vendor whose aluminum requires the highest median pressure will be judged to provide the most durable product To carry out this test, 14 pieces from each vendor have been selected These data are in the file Horner-Williams (The data are pounds per square inch pressure.) Using a  0.05, what should the company conclude about whether there is a difference in the median strength of the three vendors’ aluminum? 17-45 A large metropolitan police force is considering changing from full-size to mid-size cars The police force sampled cars from each of three manufacturers The number sampled represents the number that the manufacturer was able to provide for the test Each car was driven for 5,000 miles, and the operating cost per mile was computed The operating costs, in cents per mile, for the 12 cars are provided in the file Police Perform the appropriate ANOVA test on these data Assume a significance level of 0.05 State the appropriate null and alternative hypotheses Do the experimental data provide evidence that the median operating costs per mile for the three types of police cars are different? 17-46 A nationwide moving company is considering five different types of nylon tie-down straps The purchasing department randomly selected straps from each company and determined their breaking strengths in pounds The sample data are contained in the file Nylon Based on your analysis, with a Type I error rate of 0.05, can you conclude that a difference exists among the median breaking strengths of the types of nylon ropes? END EXERCISES 17-3 (338) CHAPTER 17 | Introduction to Nonparametric Statistics Visual Summary Chapter 17: Previous chapters introduced a wide variety of commonly used statistical techniques all of which rely on underlying assumptions about the data used The t-distribution assumes the population from which the sample is selected is normally distributed Analysis of Variance is based on the assumptions that all populations are normally distributed and have equal variances In addition, each of these techniques requires the data measurement level for the variables of interest to be either interval or ratio level In decision-making situations, you will encounter situations in which either the level of data measurement is too low or the distribution assumptions are clearly violated To handle such cases as these, a class of statistical tools called nonparametric statistics has been developed While many different nonparametric statistics tests exist, this chapter introduces some of the more commonly used: The Wilcoxon Signed Rank Test for one population median, two nonparametric tests for two population medians, The Mann-Whitney U Test and the Wilcoxon Matched Pairs Signed Rank Test and finally the Kruskal-Wallis One Way Analysis of Variance test 17.1 The Wilcoxon Signed Rank Test for One Population Median (pg 771–776) Summary In chapter we introduced the t-test which is used to test whether a population mean has a specified value However the t-test is not appropriate if the data is ordinal or when populations are not believed to be approximately normally distributed In these cases, the Wilcoxon signed rank test can be used This test makes no highly restrictive assumption about the shape of the population distribution The Wilcoxon test is used to test hypotheses about a population median rather than a population mean The logic of the Wilcoxon test is because the median is the midpoint in a population, we would expect approximately half the data values in a random sample to lie below the hypothesized median and about half to lie above it The hypothesized median will be rejected if the actual data distribution shows too large a departure from a 50-50 split Outcome Recognize when and how to use the Wilcoxon signed rank test for a population median 17.2 Nonparametric Tests for Two Population Medians (pg 776–789) Summary Chapter 10 discussed testing the difference between two population means using the student t-distribution Again, the t-distribution assumes the two populations are normally distributed and the data are restricted to being interval or ratio level Although in many situations these assumptions and the data requirements will be satisfied, you will often encounter situations where they are not This section introduces two nonparametric techniques that not require the distribution and data level assumptions of the t-test: the Mann–Whitney U-test and the Wilcoxon matched-pairs signed rank test Both tests can be used with ordinal (ranked) data, and neither requires that the populations be normally distributed The Mann–Whitney U-test is used when the samples are independent, whereas the Wilcoxon matched-pairs signed rank test is used when the design has paired samples Outcome Recognize the situations for which the Mann–Whitney U-test for the difference between two population medians applies and be able to use it in a decision-making context Outcome Know when to apply the Wilcoxon matched-pairs signed rank test for related samples Conclusion 17.3 Kruskal-Wallis One-Way Analysis of Variance (pg 789–796) Summary Decision makers are often faced with deciding between three or more alternatives Chapter 12 introduced one-way analysis of variance and showed how, if the assumptions of normally distributed populations with equal variances are satisfied, the F-distribution can be used to test the hypothesis of equal population means If the assumption of normally distributed populations can not be made, the Kruskal–Wallis One-Way Analysis of Variance is the nonparametric counterpart to the one-way ANOVA procedure presented in Chapter 12 However, it has its own set of assumptions: The distributions are continuous The data are at least ordinal The samples are independent The samples come from populations whose only possible difference is that at least one may have a different central location than the others Outcome Perform nonparametric analysis of variance using the Kruskal–Wallis one-way ANOVA Many statistical techniques discussed in this book are based on the assumptions the data being analyzed are interval or ratio and the underlying populations are normal If these assumptions come close to being satisfied, many of the tools discussed before this chapter apply and are useful However, in many practical situations these assumptions just not apply In such cases nonparametric statistical tests may be appropriate While this chapter introduced some common nonparametric tests, many other nonparametric statistical techniques have been developed for specific applications Many are aimed at situations involving small samples Figure 17.4 may help you determine which nonparametric test to use in different situations 797 (339) 798 CHAPTER 17 FIGURE 17.4 | Introduction to Nonparametric Statistics | Nonparametric Tests Introduced in Chapter 17 Number of Samples One Two Three or Independent More Samples Kruskal– Wallis One-Way ANOVA Wilcoxon Signed Rank Test Independent Samples Paired Samples Wilcoxon Signed Rank Test Mann– Whitney U Test Other Commonly Used Nonparametric Tests: Friedman Test: randomized block ANOVA Sign Test: test for randomness Runs Test: test for randomness Spearman Rank Correlation: measure of the linear relationship between two variables Equations (17.1) Large-Sample Wilcoxon Signed Rank Test Statistic pg 772 n(n 1) n(n 1)(2n 1) 24 W− z (17.2) U Statistics pg 778 z U− n1n2 (n1 )(n2 )(n1  n2 1) 12 (17.7) Wilcoxon Mean and Standard Deviation pg 784 U1  n1n2  (17.3) (17.6) Mann–Whitney U-Test Statistic pg 780 U  n1n2  n1 (n11) n2 (n21)  − ∑ R1 − ∑ R2 (17.8)  n(n 1) n(n 1)(2n 1) 24 (17.4) Mean and Standard Deviation for U Statistic pg 780 (17.9) Wilcoxon Test Statistic pg 785  (17.5)  n1n2 (n1 )(n2 )(n1 n21) 12 z n(n 1) n(n 1)(2n 1) 24 T − (340) CHAPTER 17 (17.10) H Statistic pg 791 H | Introduction to Nonparametric Statistics (17.12) H Statistic Corrected for Tied Rankings pg 794 12 N ( N 1) k Ri2 ∑ ni − 3( N 1) i1 H 12 N ( N 1) (17.11) Correction for Tied Rankings—Kruskal–Wallis Test pg 794 1− g 1− 799 k R2 ∑ nii − 3( N 1) i1 g ∑ (ti3 − ti ) i1 N3 − N ∑ (ti3 − ti ) i1 N3 − N Chapter Exercises Conceptual Questions 17-47 Find an organization you think would be interested in data that would violate the measurement scale or known distribution assumptions necessary to use the statistical tools found in Chapters 10–12 (retail stores are a good candidate) Determine to what extent this organization considers these problems and whether it uses any of the techniques discussed in this chapter 17-48 Discuss the data conditions that would lead you to use the Kruskal–Wallis test as opposed to the ANOVA procedure introduced in Chapter 12 Present an example illustrating these conditions 17-49 In the library, locate two journal articles that use one of the nonparametric tests discussed in this chapter Prepare a brief outline of the articles, paying particular attention to the reasons given for using the particular test 17-50 As an example of how the sampling distribution for the Mann–Whitney Test is derived, consider two samples with sample sizes n1  and n2  The distribution is obtained under the assumption that the two variables, say x and y, are identically distributed Under this assumption, each measurement is equally likely to obtain one of the ranks between and n1  n2 a List all the possible sets of two ranks that could be obtained from five ranks Calculate the Mann–Whitney U-value for each of these sets of two ranks b The number of ways in which we may choose n1 ranks from n1  n2 is given by (n1  n2)! n1! n2! Calculate this value for n1  and n2  Now calculate the probability of any one of the possible Mann–Whitney U-values c List all the possible Mann–Whitney U-values you obtained in part a Then using part b, calculate the probability that each of these U-values occurs, thereby producing the sampling distribution for the Mann–Whitney U statistic when n1  and n2  17-51 Let us examine how the sampling distribution of the Wilcoxon test statistic is obtained Consider the MyStatLab sampling distributions of the positive ranks from a sample size of The ranks to be considered are, therefore, 1, 2, 3, and Under the null hypothesis, the differences to be ranked are distributed symmetrically about zero Thus, each difference is just as likely to be positively as negatively ranked a For a sample size of four, there are 24  16 possible sets of signs associated with the four ranks List the 16 possible sets of ranks that could be positive— that is, (none), (1), (2) (1, 2, 3, 4) Each of these sets of positive ranks (under the null hypothesis) has the same probability of occurring b Calculate the sum of the ranks of each set specified in part a c Using parts a and b, produce the sampling distribution for the Wilcoxon test statistic when n  Business Applications 17-52 Students attending West Valley Community College buy their textbooks online from one of two different book sellers because the college does not have a bookstore The following data represent sample amounts that students spend on books per term: Company ($) 246 211 235 270 411 310 450 502 311 200 Company ($) 300 305 308 325 340 295 320 330 240 360 (341) 800 CHAPTER 17 | Introduction to Nonparametric Statistics a Do these data indicate a difference in mean textbook prices for the two companies? Apply the Mann–Whitney U test with a significance level of 0.10 b Apply the t-test to determine whether the data indicate a difference between the mean amount spent on books at the two companies Use a significance level of 0.10 Indicate what assumptions must be made to apply the t-test 17-53 The Hunter Family Corporation owns roadside diners in numerous locations across the country For the past few months, the company has undertaken a new advertising study Initially, company executives selected 22 of its retail outlets that were similar with respect to sales volume, profitability, location, climate, economic status of customers, and experience of store management Each of the outlets was randomly assigned one of two advertising plans promoting a new sandwich product The accompanying data represent the number of new sandwiches sold during the specific test period at each retail outlet Hunter executives want you to determine which of the two advertising plans leads to the largest average sales levels for the new product They are not willing to make the assumptions necessary for you to use the t-test They not wish to have an error rate of more than 0.05 Advertising Plan ($) 1,711 1,915 1,905 2,153 1,504 1,195 2,103 1,601 1,580 1,475 1,588 Advertising Plan ($) 2,100 2,210 1,950 3,004 2,725 2,619 2,483 2,520 1,904 1,875 1,943 17-54 The Miltmore Corporation performs consulting services for companies that think they have image problems Recently, the Bluedot Beer Company approached Miltmore Bluedot executives were concerned that the company’s image, relative to its two closest competitors, had diminished Miltmore conducted an image study in which a random sample of people was asked to rate Bluedot’s image Five people were asked to rate competitor A’s image, and 10 people were asked to rate competitor B’s image The image ratings were made on a 100-point scale, with 100 being the best possible rating Here are the results of the sampling Bluedot Competitor A Competitor B 40 60 70 40 55 90 20 20 95 50 53 55 92 90 80 82 87 93 51 63 72 96 88 a Based on these sample results, should Bluedot conclude there is an image difference among the three companies? Use a significance level equal to 0.05 b Should Bluedot infer that its image has been damaged by last year’s federal government recall of its product? Discuss why or why not c Why might the decision maker wish to use parametric ANOVA rather than the corresponding nonparametric test? Discuss 17-55 The Style-Rite Company of Atlanta makes windbreaker jackets for people who play golf and who are active outdoors during the spring and fall months The company recently developed a new material and is in the process of test-marketing jackets made from the material As part of this test-marketing effort, 10 people were each supplied with a jacket made from the original material and were asked to wear it for two months, washing it at least twice during that time A second group of 10 people was each given a jacket made from the new material and asked to wear it for two months with the same washing requirements Following the two-month trial period, the individuals were asked to rate the jackets on a scale of to 100, with being the worst performance rating and 100 being the best The ratings for each material are shown as follows: Original Material 76 34 70 23 45 80 10 46 67 75 New Material 55 90 72 17 56 69 91 95 86 74 (342) CHAPTER 17 The company expects that, on the average, the performance ratings will be superior for the new material a Examine the data given What characteristics of these data sets would lead you to reject the assumption that the data came from populations that had normal distributions and equal variances? b Do the sample data support this belief at a significance level of 0.05? Discuss 17-56 A study was recently conducted by the Bonneville Power Association (BPA) to determine attitudes regarding the association’s policies in western U.S states One part of the study asked respondents to rate the performance of the BPA on its responsiveness to environmental issues The following responses were obtained for a sample of 12 urban residents and 10 rural residents The ratings are on a to 100 scale, with 100 being perfect Urban 76 90 86 60 43 96 50 20 30 82 75 84 | 801 Introduction to Nonparametric Statistics Based on the 18 customer balances sampled, is there enough evidence to allow you to conclude the median balance has changed? Test at the 0.05 level of significance 17-58 During the production of a textbook, there are many steps between when the author begins preparing the manuscript and when the book is finally printed and bound Tremendous effort is made to minimize the number of errors of any type in the text One type of error that is especially difficult to eliminate is the typographical error that can creep in when the book is typeset The Prolythic Type Company does contract work for many publishers As part of its quality control efforts, it charts the number of corrected errors per page in its manuscripts In one particularly difficult to typeset book, the following data were observed for a sample of 15 pages (in sequence): Rural 55 80 94 40 85 92 77 68 35 59 a Based on the sample data, should the BPA conclude that there is no difference between the urban and rural residents with respect to median environmental rating? Test using a significance level of 0.02 b Perform the appropriate parametric statistical test and indicate the assumptions necessary to use this test that were not required by the Mann–Whitney tests Use a significance level of 0.02 (Refer to Chapter 10, if needed.) 17-57 The manager of credit card operations for a small regional bank has determined that last year’s median credit card balance was $1,989.32 A sample of 18 customer balances this year revealed the following figures, in dollars: Sample Sample Sample Sample Sample Sample 1,827.85 1,992.75 2,012.35 1,955.64 2,023.19 1,998.52 Sample Sample Sample Sample 10 Sample 11 Sample 12 2,003.75 1,752.55 1,865.32 2,013.13 2,225.35 2,100.35 Sample 13 Sample 14 Sample 15 Sample 16 Sample 17 Sample 18 2,002.02 1,850.37 1,995.35 2,001.18 2,252.54 2,035.75 Page Errors 2 10 11 12 13 14 15 Is there sufficient evidence to conclude the median number of errors per page is greater than 6? 17-59 A Vermont company is monitoring a process that fills maple syrup bottles When the process is filling correctly, the median average fill in an 8-ounce bottle of syrup is 8.03 ounces The last 15 bottles sampled revealed the following levels of fill: 7.95 8.04 8.02 8.07 8.06 8.05 8.04 7.97 8.05 8.08 8.11 7.99 8.00 8.02 8.01 a Formulate the null and alternative hypotheses needed in this situation b Do the sample values support the null or alternative hypothesis? Computer Database Exercises 17-60 A major car manufacturer is experimenting with three new methods of pollution control The testing lab must determine whether the three methods produce equal pollution reductions Readings from a calibrated carbon monoxide meter are taken from groups of engines randomly equipped with one of the three control units The data are in the file Pollution Determine whether the three pollution-control methods will produce equal results 17-61 A business statistics instructor at State University has been experimenting with her testing procedure This term, she has taken the approach of giving two tests over each section of material The first test is a problem-oriented exam, in which students have to set up and solve applications problems The exam is worth 50 points The second test, given a day later, is a (343) 802 CHAPTER 17 | Introduction to Nonparametric Statistics multiple-choice test, covering the concepts introduced in the section of the text covered by the exam This exam is also worth 50 points In one class of 15 students, the observed test scores over the first section of material in the course are contained in the file State University a If the instructor is unwilling to make the assumptions for the paired-sample t-test, what should she conclude based on these data about the distribution of scores for the two tests if she tests at a significance level of 0.05? b In the context of this problem, define a Type II error 17-62 Two brands of tires are being tested for tread wear To control for vehicle and driver variation, one tire of each brand is put on the front wheels of 10 cars The cars are driven under normal driving conditions for a total of 15,000 miles The tread wear is then measured using a very sophisticated instrument The data that were observed are in the file Tread Wear (Note that the larger the number, the less wear in the tread.) a What would be the possible objection in this case for employing the paired-sample t-test? Discuss b Assuming that the decision makers in this situation are not willing to make the assumptions required to perform the paired-sample t-test, what decision should be reached using the appropriate nonparametric test if a significance level of 0.05 is used? Discuss 17-63 High Fuel Company markets a gasoline additive for automobiles that it claims will increase a car’s miles per gallon (mpg) performance In an effort to determine whether High Fuel’s claim is valid, a consumer testing agency randomly selected eight makes of automobiles Each car’s tank was filled with gasoline and driven around a track until empty Then the car’s tank was refilled with gasoline and the additive, and the car was driven until the gas tank was empty again The miles per gallon were measured for each car with and without the additive The results are reported in the file High Fuel The testing agency is unwilling to accept the assumption that the underlying probability distribution is normally distributed, but it would still like to perform a statistical test to determine the validity of High Fuel’s claim a What statistical test would you recommend the testing agency use in this case? Why? b Conduct the test that you believe to be appropriate Use a significance level of 0.025 c State your conclusions based on the test you have just conducted Is High Fuel’s claim supported by the test’s findings? 17-64 A company assembles remote controls for television sets The company’s design engineers have developed a revised design that they think will make it faster to assemble the controls To test whether the new design leads to faster assembly, 14 assembly workers were randomly selected and each worker was asked to assemble a control using the current design and then asked to assemble a control using the revised design The times in seconds to assemble the controls are shown in the file Remote Control The company’s engineers are unable to assume that the assembly times are normally distributed, but they would like to test whether assembly times are lower using the revised design a What statistical test you recommend the company use? Why? b State the null and alternative hypotheses of interest to the company c At the 0.025 level of significance, is there any evidence to support the engineers’ belief that the revised design reduces assembly time? d How might the results of the statistical test be used by the company’s management? Case 17.A Bentford Electronics—Part On Saturday morning, Jennifer Bentford received a call at her home from the production supervisor at Bentford Electronics Plant The supervisor indicated that she and the supervisors from Plants 2, 3, and had agreed that something must be done to improve company morale and, thereby, increase the production output of their plants Jennifer Bentford, president of Bentford Electronics, agreed to set up a Monday morning meeting with the supervisors to see if they could arrive at a plan for accomplishing these objectives By Monday each supervisor had compiled a list of several ideas, including a 4-day work week and interplant competition of various kinds After listening to the discussion for some time, Jennifer Bentford asked if anyone knew if there was a difference in average daily output for the four plants When she heard no positive response, she told the supervisors to select a random sample of daily production reports from each plant and test whether there was a difference They were to meet again on Wednesday afternoon with test results By Wednesday morning, the supervisors had collected the following data on units produced: (344) CHAPTER 17 Plant Plant Plant Plant 4,306 2,852 1,900 4,711 2,933 3,627 1,853 1,948 2,702 4,110 3,950 2,300 2,700 2,705 2,721 2,900 2,650 2,480 1,704 2,320 4,150 3,300 3,200 2,975 The supervisors had little trouble collecting the data, but they were at a loss about how to determine whether there was a | Introduction to Nonparametric Statistics 803 difference in the output of the four plants Jerry Gibson, the company’s research analyst, told the supervisors that there were statistical procedures that could be used to test hypotheses regarding multiple samples if the daily output was distributed in a bell shape (normal distribution) at each plant The supervisors expressed dismay because no one thought his or her output was normal Jerry Gibson indicated that there were techniques that did not require the normality assumption, but he did not know what they were The meeting with Jennifer Bentford was scheduled to begin in hours, so he needed some statistical-analysis help immediately References Berenson, Mark L., and David M Levine, Basic Business Statistics: Concepts and Applications, 11th ed (Upper Saddle River, NJ: Prentice Hall, 2009) Conover,W J., Practical Nonparametric Statistics, 3rd ed (New York: John Wiley and Sons, 1999) Dunn, O J., “Multiple Comparisons Using Rank Sums,” Technometrics, (1964), pp 241–252 Marascuilo, Leonard A., and Maryellen McSweeney, Nonparametric and Distribution-Free Methods for Social Sciences (Pacific Grove, CA: Brooks/Cole, 1977) Microsoft Excel 2007 (Redmond,WA: Microsoft Corp., 2007) Minitab for Windows Version 15 (State College, PA: Minitab, 2007) Noether, G E., Elements of Nonparametric Statistics (New York: John Wiley & Sons, 1967) (345) chapter 18 Chapter 18 Quick Prep Links Modern quality control is based on many of the statistical concepts you have covered up to now, so to adequately understand the material, you need to review many previous topics • Review how to construct and interpret line • Review how to determine the mean and charts, covered in Chapter • Make sure you are familiar with the steps involved in determining the mean and standard deviation of the binomial and Poisson distributions, covered in Chapter standard deviation of samples and the meaning of the Central Limit Theorem from Chapter • Finally, become familiar again with how to determine a confidence interval estimate and test a hypothesis of a single population parameter as covered in Chapters and Introduction to Quality and Statistical Process Control 18.1 Quality Management and Tools for Process Improvement (pg 805–808) Outcome Use the seven basic tools of quality 18.2 Introduction to Statistical Process Control Charts Outcome Construct and interpret x-charts and R-charts (pg 808–830) Outcome Construct and interpret p-charts Outcome Construct and interpret c-charts Why you need to know Organizations across the United States and around the world have turned to quality management in an effort to meet the competitive challenges of the international marketplace Although there is no set approach for implementing quality management, a commonality among most organizations is for employees at all levels to be brought into the effort as members of process improvement teams Successful organizations, such as General Electric and HewlettPackard, realize that thrusting people together in teams and then expecting process improvement to occur will generally lead to disappointment They know that their employees need to understand how to work together as a team In many instances, teams are formed to improve a process so that product or service quality is enhanced However, teamwork and team building must be combined with training and education in the proper tools if employees are to be successful at making lasting process improvements Over the past several decades, a number of techniques and methods for process improvement have been developed and used by organizations As a group, these are referred to as the Tools of Quality Many of these tools are based on statistical procedures and data analysis One set of quality tools known as statistical process control charts is so prevalent in business today, and is so closely linked to the material in Chapters through 9, that its coverage merits a separate chapter Today, successful managers must have an appreciation of, and familiarity with, the role of quality in process improvement activities This chapter is designed to introduce you to the fundamental tools and techniques of quality management and to show you how to construct and interpret statistical process control charts 804 (346) CHAPTER 18 | Introduction to Quality and Statistical Process Control 805 18.1 Quality Management and Tools for Process Improvement Total Quality Management A journey to excellence in which everyone in the organization is focused on continuous process improvement directed toward increased customer satisfaction Pareto Principle 80% of the problems come from 20% of the causes From the end of World War II through the mid-1970s, industry in the United States was kept busy meeting a pent-up demand for its products both at home and abroad The emphasis in most companies was on getting the “product out the door.” The U.S effort to produce large quantities of goods and services led to less emphasis on quality During this same time, Dr W Edwards Deming and Dr Joseph Juran were consulting with Japanese business leaders to help them rebuild their economic base after World War II Deming, a statistician, and Juran, an engineer, emphasized that quality was the key to being competitive and that quality could be best achieved by improving the processes that produced the products and delivered the services Employing the process improvement approach was a slow, but effective, method for improving quality, and by the early 1970s Japanese products began to exceed those of the United States in terms of quality The impact was felt by entire industries, such as the automobile and electronics industries Whereas Juran focused on quality planning and helping businesses drive costs down by eliminating waste in processes, Deming preached a new management philosophy, which has become known as Total Quality Management, or TQM There are about as many definitions of TQM as there are companies who have attempted to implement it In the early 1980s, U.S business leaders began to realize the competitive importance of providing high-quality products and services, and a quality revolution was under way in the United States Deming’s 14 points (see Table 18.1) reflected a new philosophy of management that emphasized the importance of leadership The numbers attached to each point not indicate an order of importance; rather, the 14 points collectively are seen as necessary steps to becoming a world-class company Juran’s role in the quality movement was also important Juran is noted for many contributions to TQM, including his 10 steps to quality improvement, which are outlined in Table 18.2 Note that Juran and Deming differed with respect to the use of goals and targets Juran is also credited with applying the Pareto principle to quality Juran urges managers to use the Pareto principle to focus on the vital few sources of problems and to separate the vital few from the trivial many A form of a bar chart, a Pareto chart is used to display data in a way that helps managers find the most important problem issues There have been numerous other individuals who have played significant roles in the quality movement Among these are Philip B Crosby, who is probably best known for his TABLE 18.1 | Deming’s 14 Points Create a constancy of purpose toward the improvement of products and services in order to become competitive, stay in business, and provide jobs Adopt the new philosophy Management must learn that it is in a new economic age and awaken to the challenge, learn its responsibilities, and take on leadership for change Stop depending on inspection to achieve quality Build in quality from the start Stop awarding contracts on the basis of low bids Improve continuously and forever the system of production and service to improve quality and productivity, and thus constantly reduce costs Institute training on the job Institute leadership The purpose of leadership should be to help people and technology work better Drive out fear so that everyone may work effectively Break down barriers between departments so that people can work as a team 10 Eliminate slogans, exhortations, and targets for the workforce They create adversarial relationships 11 Eliminate quotas and management by objectives Substitute leadership 12 Remove barriers that rob employees of their pride of workmanship 13 Institute a vigorous program of education and self-improvement 14 Make the transformation everyone’s job and put everyone to work on it (347) 806 CHAPTER 18 | Introduction to Quality and Statistical Process Control TABLE 18.2 | Juran’s 10 Steps to Quality Improvement Build awareness of both the need for improvement and the opportunity for improvement Set goals for improvement Organize to meet the goals that have been set Provide training Implement projects aimed at solving problems Report progress Give recognition Communicate the results Keep score 10 Maintain momentum by building improvement into the company’s regular systems book Quality Is Free, in which he emphasized that in the long run, the costs of improving quality are more than offset by the reductions in waste, rework, returns, and unsatisfied customers Kauro Ishikawa is credited with developing and popularizing the application of the fishbone diagram that we will discuss shortly in the section on the Basic Tools of Quality There is also the work of Masaaki Imai, who popularized the philosophy of kaizen, or people-based continuous improvement Finally, we must not overlook the contributions of many different managers at companies such as Hewlett-Packard, General Electric, Motorola, Toyota, and Federal Express These leaders synthesized and applied many different quality ideas and concepts to their organizations in order to create world-class corporations By sharing their successes with other firms, they have inspired and motivated others to continually seek opportunities in which the tools of quality can be applied to improve business processes Chapter Outcome The Tools of Quality for Process Improvement Once U.S managers realized that their businesses were engaged in a competitive battle with companies around the world, they reacted in many ways Some managers ignored the challenge and continued to see their market presence erode Other managers realized that they needed a system or approach for improving their firms’ operations and processes The Deming Cycle, which is illustrated in Figure 18.1, has been effectively used by many organizations as a guide to their quality improvement efforts The approach taken by the Deming Cycle is that problems should be identified and solved based on data FIGURE 18.1 | The Deming Cycle Plan Act The Deming Cycle Study Do (348) CHAPTER 18 | Introduction to Quality and Statistical Process Control 807 Over time, a collection of tools and techniques known as the Basic Tools has been developed for quality and process improvement Some of these tools have already been introduced at various points throughout this text However, we will briefly discuss all Basic Tools in this section Section 18.2 will explore one of these tools—Statistical Process Control Charts—in greater depth Process Flowcharts A flowchart is a diagram that illustrates the steps in a process Flowcharts provide a visualization of the process and are good beginning points in planning a process improvement effort Brainstorming Brainstorming is a tool that is used to generate ideas from the members of the team Employees are encouraged to share any idea that comes to mind, and all ideas are listed with no ideas being evaluated until all ideas are posted Brainstorming can be either structured or unstructured In structured brainstorming, team members are asked for their ideas, in order, around the table Members may pass if they have no further ideas With unstructured brainstorming, members are free to interject ideas at any point Fishbone Diagram Kauro Ishikawa from Japan is credited with developing the fishbone diagram, which is also called the cause-and-effect diagram or the Ishikawa diagram The fishbone diagram can be applied as a simple graphical brainstorming tool in which team members are given a problem and several categories of possible causes They then brainstorm possible causes in any or all of the cause categories Histograms You were introduced to histograms in Chapter as a method for graphically displaying quantitative data Recall that histograms are useful for identifying the center, spread, and shape of a distribution of measurements As a tool of quality, histograms are used to display measurements to determine whether the output of a process is centered on the target value and whether the process is capable of meeting specifications Trend Charts In Chapter we illustrated the use of a line chart to display time-series data A trend chart is a line chart that is used to track output from a process over time Scatter Plots There are many instances in quality improvement efforts in which you will want to examine the relationship between two quantitative variables A scatter plot is an excellent tool for doing this You were first introduced to scatter plots in Section 2.3 in Chapter Scatter plots were also used in developing regression models in Chapters 14 and 15 Statistical Process Control Charts One of the most frequently used Basic Tools is the statistical process control (SPC) chart SPC charts are a special type of trend chart In addition to the data, the charts display the process average and the upper and lower control limits These control limits define the range of random variation expected in the output of a process SPC charts are used to provide early warnings when a process has gone out of control There are several types of control charts depending on the type of data generated by the process of interest Section 18.2 presents an introductory discussion of why control charts work and how to develop and interpret some of the most commonly used SPC charts You will have the opportunity to use the techniques presented in this chapter and throughout this text as you help your organization meet its quality challenges MyStatLab 18-1: Exercises Skill Development 18-1 Discuss the similarities and differences between Dr Deming’s 14 points and Dr Juran’s 10 steps to quality improvement 18-2 Deming is opposed to setting quotas or specific targets for workers a Use the library or the Internet to locate information that explains his reasoning (349) 808 CHAPTER 18 | Introduction to Quality and Statistical Process Control b Discuss whether you agree or disagree with him c If possible, cite one or more examples based on your own experience to support your position 18-3 Philip Crosby wrote the book Quality Is Free In it he argues that it is possible to have high quality and low price Do you agree? Provide examples that support your position Business Applications 18-4 Develop a process flowchart that describes the registration process at your college or university 18-5 Generate a process flowchart showing the steps you have to go through to buy a concert ticket at your school 18-6 Assume that you are a member of a team at your university charged with the task of improving student, faculty, and staff parking Use brainstorming to generate a list of problems with the current parking plan After you have generated the list, prioritize the list of problems based on their importance to you 18-7 Refer to Exercise 18-6 a Use brainstorming to generate a list of solutions for the top-rated parking problem b Order these possible solutions separately according to each of the following factors: cost to implement, time to implement, and easiest to gain approval c Did your lists come out in a different order? Why? 18-8 Suppose the computer lab manager at your school has been receiving complaints from students about long waits to use a computer The “effect” is long wait times The categories of possible causes are people, equipment, methods/rules, and the environment Brainstorm specific ideas for each of these causes 18-9 The city bus line is consistently running behind schedule Brainstorm possible causes organized by such cause categories as people, methods, equipment, and the environment Once you have finished, develop a priority order based on which cause is most likely to be the root cause of the problem END EXERCISES 18-1 18.2 Introduction to Statistical Process Control Charts As we stated in Section 18.1, one of the most important tools for quality improvement is the statistical process control (SPC) chart In this section we provide an overview of SPC charts As you will see, SPC is actually an application of hypothesis testing The Existence of Variation After studying the material in Chapters through 17, you should be well aware of the importance of variation in business decision making Variation exists naturally in the world around us In any process or activity, the day-to-day outcomes are rarely the same As a practical example, think about the time it takes you to travel to the university each morning You know it’s about a 15-minute trip, and even though you travel the same route, your actual time will vary somewhat from day to day You will notice this variation in many other daily occurrences The next time you renew your car license plates, notice that some people seem to get through faster than others The same is true at a bank, where the time to cash your payroll check varies each payday Even in instances when variation is hard to detect, it is present For example, when you measure a stack of 4-foot by 8-foot sheets of plywood using a tape measure, they will all appear to be feet wide However, when the stack is measured using an engineer’s scale, you may be able to detect slight variations among sheets, and using a caliper you can detect even more (see Figure 18.2) Therefore, three concepts to remember about variation are Variation is natural; it is inherent in the world around us No two products or service experiences are exactly the same With a fine-enough gauge, all things can be seen to differ Sources of Variation What causes variation? Variation in the output of a process comes from variation in the inputs to the process Let’s go back to your travel time to school Why isn’t it always the same? Your travel time depends on many factors, such as what route you take, how much traffic you encounter, whether you are in a hurry, how your car is running, and so on (350) CHAPTER 18 FIGURE 18.2 | Introduction to Quality and Statistical Process Control 809 | Plywood Variation 8' Measuring Device 4' 4' 4' 4' 4' 4' 4' 4' Tape Measure 4.01' 3.99' 4.01' 4.0' Engineer Scale 4.009' 3.987' 4.012' 4.004' Caliper 4.00913' 3.98672' 4.01204' 4.00395' Electronic Microscope The six most common sources of variation are People Machines Materials Methods Measurement Environment Types of Variation Although variation is always present, we can define two major types that occur The first is called common cause variation, which means it is naturally occurring or expected in the system Other terms people use for common cause variation include normal, random, chance occurrence, inherent, and stable variation The other type of variation is called special cause variation This type of variation is abnormal, indicating that something out of the ordinary has happened This type of variation is also called nonrandom, unstable, and assignable cause variation In our example of travel time to school, there are common causes of variation such as traffic lights, traffic patterns, weather, and departure time On the days when it takes you significantly more or less time to arrive at work, there are also special causes of variation occurring These may be factors such as accidents, road construction detours, or needing to stop for gas Examples of the two types of variation and some sources are as follows: Sources of Common Cause Variation Sources of Special Cause Variation Weather conditions Inconsistent work methods Machine wear Temperature Employee skill levels Computer response times Equipment not maintained and cleaned Poor training Worker fatigue Procedures not followed Misuse of tools Incorrect data entry (351) 810 CHAPTER 18 | Introduction to Quality and Statistical Process Control In any process or system, the total process variation is a combination of common cause and special cause factors This can be expressed by Equation 18.1 Variation Components Total process variation  Common cause variation  Special cause variation (18.1) In process improvement efforts, the goal is first to remove the special cause variation and then to reduce the common cause variation in a system Removing special cause variation requires the source of a variation be identified and its cause eliminated The Predictability of Variation: Understanding the Normal Distribution Dr W Edwards Deming said there is no such thing as consistency However, there is such a thing as a constant-cause system A system that contains only common cause variations is very predictable Though the outputs vary, they exhibit an important feature called stability This means that some percentage of the output will continue to lie within given limits hour after hour, day after day, so long as the constant-cause system is operating When a process exhibits stability, it is in control The reason that the outputs vary in a predictable manner is because measurable data, when subgrouped and pictured in a histogram, tend to cluster around the average and spread out symmetrically on both sides This tendency is a function of the Central Limit Theorem that you first encountered in Chapter This means that the frequency distribution of most processes will begin to resemble the shape of the normal distribution as the values are collected and grouped into classes The Concept of Stability We showed in Chapter that the normal distribution can be divided into six sections, the sum of which includes 99.7% of the data values The width of each of these sections is called the standard deviation The standard deviation is the primary way the spread (or dispersion) of the distribution is measured Thus, we expect virtually all (99.7%) of the data in a stable process to fall within plus or minus standard deviations of the mean Generally speaking, as long as the measurements fall within the 3-standard-deviation boundary, we consider the process to be stable This concept provides the basis for statistical process control charts Introducing Statistical Process Control Charts Most cars are equipped with a temperature gauge that measures engine temperature We come to rely on the gauge to let us know if “everything is all right.” As long as the gauge points to the normal range, we conclude that there is no problem However, if the gauge moves outside the normal range toward the hot mark, it’s a signal that the engine is overheating and something is wrong If the gauge moves out of the normal range toward the cold mark, it’s also a signal of potential problems Under typical driving conditions, engine temperature will fluctuate The normal range on the car’s gauge defines the expected temperature variation when the car is operating properly Over time, we come to know what to expect If something changes, the gauge is designed to give us a signal The engine temperature gauge is analogous to a process control chart Like the engine gauge, process control charts are used in business to define the boundaries that represent the amount of variation that can be considered normal Figure 18.3 illustrates the general format of a process control chart The upper and lower control limits define the normal operating region for the process The horizontal axis reflects the passage of time, or order of production The vertical axis corresponds to the variable of interest There are a number of different types of process control charts In this section, we introduce four of the most commonly used process control charts: x -chart R-chart (range chart) p-chart c-chart (352) CHAPTER 18 FIGURE 18.3 | Introduction to Quality and Statistical Process Control 811 | Process Control Chart Format Special Cause Variation Upper Control Limit (UCL) Common Cause Variation UCL Average 99.7% LC L Lower Control Limit (LCL) Special Cause Variation Only 0.3% outside control limits UCL = Average + Standard Deviations LC L = Average – Standard Deviations Each of these charts is designed for a special purpose However, as you will see, the underlying logic is the same for each The x -chart and R-chart are almost always used in tandem The x -charts are used to monitor a process average R-charts are used to monitor the variation of individual process values They require the variable of interest to be quantitative The following Business Application shows how these two charts are developed and used Chapter Outcome x Chart and R-Chart BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial MONITORING A PROCESS USING x– AND R CHARTS CATTLEMEN’S BAR AND GRILL The Cattlemen’s Bar and Grill in Kansas City, Missouri, has developed a name for its excellent food and service To maintain this reputation, the owners have established key measures of product and service quality, and they monitor these regularly One measure is the amount of time customers wait from the time they are seated until they are served Every day, each hour that the business is open, four tables are randomly selected The elapsed time from when customers are seated at these tables until their orders arrive is recorded The owners wish to use these data to construct an x -chart and an R-chart These control charts can be developed using the following steps: Step Collect the initial sample data from which the control charts will be developed Four measurements during each hour for 30 hours are contained in the file Cattlemen The four values recorded each hour make up a subgroup The x -charts and R-charts are typically generated from small subgroups (three to six observations), and the general recommendation is that data from a minimum of 20 subgroups be gathered before a chart is constructed Once the subgroup size is determined, all subgroups must be the same size In this case, the subgroup size is four tables Step Calculate subgroup means and ranges Figure 18.4 shows the Excel worksheet with a partial listing of the data after the means and ranges have been computed for each subgroup (353) 812 CHAPTER 18 FIGURE 18.4 | Introduction to Quality and Statistical Process Control | Excel 2007 Worksheet of Cattlemen’s Service-Time Data, Including Subgroup Means and Ranges Excel 2007 Instructions: Open file: Cattlemen.xls Note: Mean and Range values have been computed using Excel functions Minitab Instructions (for similar results): In Store result in, enter storage column Open file: Cattlemen.MTW Click OK Choose Calc  Row Statistics Repeat through choosing Range under Under Statistics, choose Mean Statistics In Input variables, enter data columns Step Compute the average of the subgroup means and the average range value The average of the subgroup means and the average range value are computed using Equations 18.2 and 18.3 Average Subgroup Mean k ∑ xi x  i1 k (18.2) where: xi  ith subgroup average k  Number of subgroupss Average Subgroup Range k ∑ Ri R  i1 k where: Ri  ith subgroup range k  Number of subgroups (18.3) (354) CHAPTER 18 FIGURE 18.5 | | 813 Introduction to Quality and Statistical Process Control 30 Line Chart for x–-Values for Cattlemen’s Data 25 x-Values 20 x 15 10 5 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Hour Using Equations 18.2 and 18.3, we get: ∑ x 19.5  21   20.75   19 24 k 30 ∑ R 77 3 R   5.73 k 30 x Step Prepare graphs of the subgroup means and ranges as a line chart On one graph, plot the x -values in time order across the graph and draw a line across the graph at the value corresponding to x This is shown in Figure 18.5 Likewise, graph the R-values and R as a line chart, as shown in Figure 18.6 The x and R values in Figures 18.5 and 18.6 are called the “process control centerlines.” The centerline is a graph of the mean value of the sample data We use x as the notation for centerline, which represents the FIGURE 18.6 | Line Chart for R-Values for Cattlemen’s Data 15 13 R-Values 11 R 1 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Hour (355) 814 CHAPTER 18 | Introduction to Quality and Statistical Process Control current process average For these sample data, the average time people in the subgroups wait between being seated and being served is 19.24 minutes However, as seen in Figure 18.5, there is variation around the centerline The next step is to establish the boundaries that define the limits for what is considered normal variation in the process Step Compute the upper and lower control limits for the x -chart For a normal distribution with mean m and standard deviation s, approximately 99.7% of the values will fall within m 3s Most process control charts are developed as 3-sigma control charts, meaning that the range of inherent variation is standard deviations from the mean Because the x -chart is a graph of subgroup (sample) means, the control limits are established at points standard errors from the centerline, x The control limits are analogous to the critical values we establish in hypothesistesting problems Using this analogy, the null hypothesis is that the process is in control We will reject this null hypothesis whenever we obtain a subgroup mean beyond standard errors from the centerline in either direction Because the control chart is based on sample data, our conclusions are subject to error Approximately times in 1,000 (0.003), a subgroup mean should be outside the control limits when, in fact, the process is still in control If this happens, we will have committed a Type I error The 0.003 value is the significance level for the test This small alpha level implies that 3-sigma control charts are very conservative when it comes to saying that a process is out of control We might also conclude that the process is in control when in fact it isn’t If this happens, we have committed a Type II error To construct the control limits, we must determine the standard error of the sample means, s / n Based on what you have learned in previous chapters, you might suspect that we would use s / n However, in most applications this is not done In the 1930s, when process control charts were first introduced, there was no such thing as pocket calculators To make control charts usable by people without calculators and without statistical training, a simpler approach was needed An unbiased estimator of the standard error of the sample means, s / n , that was relatively easy to calculate was developed by Walter Shewhart.1 The unbiased estimator is A2 R where: R  The mean of the subgroups’ ranges A2  A Shewhart factor that makes (A2 /3) R an unbiased estimator of the standard error of the sample means, s / n Thus, standard errors of the sample means can be estimated by ⎛A ⎞ ⎜ R ⎟  A2 R ⎝ ⎠ Appendix Q displays the Shewhart factors for various subgroup sizes Equations 18.4 and 18.5 are used to compute the upper and lower control limits for the x -chart.2 Upper Control Limit, x -Chart UCL  x  A2 ( R) (18.4) 1The leader of a group at the Bell Telephone Laboratories that did much of the original work in SPC, Shewhart is credited with developing the idea of control charts 2When A /3 is multiplied by R , this product becomes an unbiased estimator of the standard error, which is the reason for A2’s use here (356) CHAPTER 18 | 815 Introduction to Quality and Statistical Process Control Lower Control Limit, x -Chart LCL  x − A2 ( R ) (18.5) where: A2  Shewhart factor for subgroup size  n For the Cattlemen’s Bar and Grill example, the subgroup size is Thus, the A2 factor from the Shewhart table (Appendix Q) is 0.729 We can compute the upper and lower control limits as follows: UCL  x  A2 ( R ) UCL  19.24  0.729(5.73) UCL  23.42 LCL  x − A2 ( R ) LCL  19.24 − 0.729(5.73) LCL  15.06 Step Compute the upper and lower control limits for the R-chart The D3 and D4 factors in the Shewhart table presented in Appendix Q are used to compute the 3-sigma control limits for the R-chart The control limits are established at points standard errors from the centerline, R However, unlike the case for the x -chart, the unbiased estimator of the standard error of the sample ranges is a constant multiplied by R The constant for the lower control limit is the D3 value from the Shewhart table The D4 value from the Shewhart table is the constant for the upper control limit Equations 18.6 and 18.7 are used to find the UCL and LCL values Because the subgroup size is in our example, D3  0.0 and D4  2.282.3 Upper Control Limit, R-chart UCL  D4 ( R ) (18.6) Lower Control Limit, R-chart LCL  D3 ( R ) (18.7) where: D3 and D4 are taken from Appendix Q, the Shewhart table, for subgroup size  n Using Equations 18.6 and 18.7, we get: UCL  D4 ( R ) UCL  2.282(5.73) UCL  13.08 LCL  D3 ( R ) LCL  0.0(5.73) LCL  0.0 Step Finish constructing the control chart by locating the control limits on the x - and R-charts Graph the UCL and LCL values on the x -chart and R-chart, as shown in Figures 18.7a and 18.7b and Figures 18.8a and 18.8b, which were constructed using the PHStat add-in to Excel and Minitab.4 3Because a range cannot be negative, the constant is adjusted to indicate that the lower boundary for the range must equal 4See the Excel and Minitab Tutorial for the specific steps required to obtain the x - and R-charts Minitab provides a much more extensive set of SPC options than PHStat (357) 816 | CHAPTER 18 FIGURE 18.7A Introduction to Quality and Statistical Process Control | Excel 2007 (PHStat) Cattlemen’s x–-Chart Output Excel 2007 Instructions: Open File: Cattlemen xls Select Add-Ins Select PHStat Select Control Charts  R and X-Bar Charts Specify Subgroup Size Define cell range for the subgroup ranges Check R and X-Bar Chart option Define cell range for subgroup means Click OK FIGURE 18.7B | Minitab Cattlemen’s x–-Chart Output Minitab Instructions: Open file: Cattlemen MTW Choose Stat  Control Charts  Variables Charts for Subgroups  X-bar Select Observations for a subgroup are in one row of columns Enter data columns in box Because file contains additional data select Data Options, Specify which rows to exclude Select Row Numbers and enter 31, 32, 33, 34, 35, 36 Click OK Click on Xbar Options and select Test tab Select Perform the following tests for special courses and select the first four tests Click OK OK (358) CHAPTER 18 FIGURE 18.8A | Introduction to Quality and Statistical Process Control 817 | Excel 2007 (PHStat) Cattlemen’s R-Chart Output Excel 2007 Instructions: Open File: Cattlemen xls Select Add-Ins Select PHStat Select Control Charts  R and X-Bar Charts Specify Subgroup Size Define cell range for the subgroup ranges Check R and X-Bar Chart option Define cell range for subgroup means Click OK Both students and people in industry sometimes confuse control limits and specification limits Specification limits are arbitrary and are defined by a customer, by an industry standard, or by engineers who designed the item The specification limits are defined as values above and below the “target” value for the item The specification limits pertain to individual FIGURE 18.8B | Minitab Cattlemen’s R-Chart Output Minitab Instructions: Open file: Cattlemen MTW Choose Stat  Control Charts  Variables Charts for Subgroups  R-bar Select Observations for a subgroup are in one row of columns Enter data columns in box Because file contains additional data select Data Options, Specify which rows to exclude Select Row Numbers and enter 31, 32, 33, 34, 35, 36 Click OK Click on R-bar Options and select Test tab Select Perform the following tests for special courses and select the first four tests Click OK OK UCL13.08 R5.733 LCL0 (359) 818 CHAPTER 18 | Introduction to Quality and Statistical Process Control items—an item either meets specifications or it does not Process control limits are computed from actual data from the process These limits define the range of inherent variation that is actually occurring in the process The control limits are values above and below the current process average (which may be higher or lower than the “target”) Therefore, a process may be operating in a state of control, but it may be producing individual items that not meet the specifications Companies interested in improving quality must first bring the process under control before attempting to make changes in the process to reduce the defect level Using the Control Charts Once control charts for Cattlemen’s service time have been developed, they can be used to determine whether the time it takes to serve customers remains in control The concept involved is essentially a hypothesis test in which the null and alternative hypotheses can be stated as H0: The process is in control; the variation around the centerline is a result of common causes inherent in the process HA: The process is out of control; the variation around the centerline is due to some special cause and is beyond what is normal for the process In the Cattlemen’s Bar and Grill example, the hypothesis is tested every hour, when four tables are selected and the service time is recorded for each table The x - and R-values for the new subgroup are computed and plotted on their respective control charts There are three main process changes that can be detected with a process control chart: The process average has shifted up or down from normal The process average is trending up or down from normal The process is behaving in such a manner that the existing variation is not random in nature If any of these has happened, the null hypothesis is considered false and the process is considered to be out of control The control charts are used to provide signals that something has changed There are four primary signals that indicate a change and that, if observed, will cause us to reject the null hypothesis.5 These are Signals One or more points outside the upper or lower control limits Nine or more points in a row above (or below) the centerline Six or more consecutive points moving in the same direction (increasing or decreasing) Fourteen points in a row, alternating up and down These signals were derived such that the probability of a Type I error is less than 0.01 Thus, there is a very small chance that we will conclude the process has changed when, in fact, it has not If we examine the control charts in Figures 18.7a and 18.7b and 18.8a and 18.8b, we find that none of these signals occur Thus, the process is deemed in control during the period in which the initial sample data were collected Suppose that the Cattlemen’s owners monitor the process for the next hours Table 18.3 shows these new values, along with the mean and range for each hour The means are plotted on the x -chart, and the R-values are plotted on the R-chart, as shown in Figures 18.9 and 18.10 TABLE 18.3 | Data for Hours 31 to 35 for Cattlemen’s Bar and Grill Hour Table Table Table Table Mean Range 31 32 33 34 35 20 17 23 24 24 21 22 20 23 25 24 18 22 19 26 22 20 22 20 27 21.75 19.25 21.75 21.50 25.50 5 5There is some minor disagreement on the signals, depending on which process control source you refer to Minitab actually lets the user define the signals under the option Define Tests (360) CHAPTER 18 FIGURE 18.9 | | Introduction to Quality and Statistical Process Control 819 30 Cattlemen’s x–-Chart x  25.5 Subgroup Mean = x 25 Out of Control UCL = 23.42 20 15 LCL = 15.06 10 5 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Hour When x - and R-charts are used, we first look at the R-chart Figure 18.10 shows the range (R) has been below the centerline (R  5.733) for seven consecutive hours Although this doesn’t quite come up to the nine points of signal 2, the owners should begin to suspect something unusual might be happening to cause a downward shift in the variation in service time between tables Although the R-chart does not indicate the reason for the shift, the owners should be pleased, because this might indicate greater consistency in service times This change may represent a quality improvement If this trend continues, the owners will want to study the situation so they will be able to retain these improvements in service-time variability The x -chart in Figure 18.9 indicates that the average service time is out of control because in hour 35 the mean service time exceeded the upper control limit of 23.42 minutes The mean wait time for the four tables during this hour was 25.5 minutes The chance of this happening is extremely low unless something has changed in the process This should be a signal to FIGURE 18.10 | 14 Cattlemen’s R-Chart UCL  13.06 Subgroup Range = R 12 10 Consecutive Points below the Centerline LCL  0.0 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Hour (361) 820 CHAPTER 18 | Introduction to Quality and Statistical Process Control the owners that a special cause exists They should immediately investigate possible problems to determine if there has been a system change (e.g., training issue) or if this is truly a one-time event (e.g., fire in the kitchen) They could use a brainstorming technique or a fishbone diagram to identify possible causes of the problem An important point is that analysis of each of the control charts should not be done in isolation A moment’s consideration will lead you to see that if the variation of the process has gotten out of control (above the upper control limit), then trying to interpret the x -chart can be very misleading Widely fluctuating variation could make it much more probable that an x -value would exceed the control limits even though the process mean had not changed Adding (or subtracting) a given number from all of the numbers in a data set does not change the variance of that data set, so a shift in the mean of a process can occur without that shift affecting the variation of the process However, a change in the variation almost always affects the x control chart For this reason, the general advice is to control the variation of a process before the mean of the process is subjected to control chart analysis The process control charts signal the user when something in the process has changed For process control charts to be effective, they must be updated immediately after data have been collected, and action must be taken when the charts signal that a change has occurred Chapter Outcome p-Charts The previous example illustrated how x -charts and R-charts can be developed and used They are used in tandem and are applicable when the characteristic being monitored is a variable measured on a continuous scale (e.g., time, weight, length, etc.) However, there are instances when the process issue involves an attribute rather than a quantitative variable An attribute is a quality characteristic that is either present or not present In many quality control situations, an attribute is whether an item is good (meets specifications) or defective, and in those cases a p-chart can be used to monitor the proportion of defects BUSINESS APPLICATION CONSTRUCTING p -CHARTS HILDER’S PUBLISHING COMPANY Hilder’s Publishing Company sells books and records through a catalog, processing hundreds of mail and phone orders each day Each customer order requires numerous data-entry steps Mistakes made in data entry can be costly, resulting in shipping delays, incorrect prices, or the wrong items being shipped As part of its ongoing efforts to improve quality, Hilder’s managers and employees want to reduce errors The manager of the order-entry department has developed a process control chart to monitor order-entry errors For each of the past 30 days she has selected a random sample of 100 orders These orders were examined, with the attribute being Excel and Minitab ● ● tutorials Excel and Minitab Tutorial Order entry is correct Order entry is incorrect In developing a p-chart, the sample size should be large enough such that np (362) and n(1  p) (363) Unlike the x - and R-chart cases, the sample size may differ from sample to sample However, this complicates the development of the p-chart The p-chart can be developed using the following steps: Step Collect the sample data The sample size is 100 orders This size sample was selected for each of 30 days The proportion of incorrect orders, called nonconformances, is displayed in the – file Hilders The proportions are given the notation p Step Plot the subgroup proportions as a line chart Figures 18.11a and 18.11b show the line chart for the 30 days (364) CHAPTER 18 FIGURE 18.11A | Introduction to Quality and Statistical Process Control | Excel 2007 (PHStat) p-Chart for Hilder’s Publishing Excel 2007 (PHStat) Instructions: Open File: Hilders.xls Select Add-Ins Select PHStat Select Control Charts  p-Charts Define cell range for number of nonconformances (note, not enter cell range for proportions of nonconformances) Check Sample/Subgroup Size does not vary Enter sample/subgroup size Click OK FIGURE 18.11B | Minitab p-Chart for Hilder’s Publishing UCL  0.1604 p  0.0793 LCL  Minitab Instructions: Open file: Hilders.MTW Choose Stat  Control Charts  Attribute Charts  P In Variable, enter number of orders with errors column In Subgroup Sizes enter size of subgroup:100 Click on P Chart Options and select Tests tab Select Perform all tests for special causes Click OK OK 821 (365) 822 CHAPTER 18 | Introduction to Quality and Statistical Process Control Step Compute the mean subgroup proportion for all samples using Equation 18.8 or 18.9, depending on whether the sample sizes are equal Mean Subgroup Proportion For equal-size samples k p ∑ pi i1 (18.8) k where: pi  Sample proportion for subgroup i k  Number of samples of size n For unequal sample sizes: k p ∑ ni pi i1 k (18.9) ∑ ni i1 where: ni  The number of items in sample i pi  Sample proportion for subgroup i k ∑ ni  Total number of items sampled in k samples i =1 k  Number of samples of size n Because we have equal sample sizes, we use Equation 18.8, as follows: p ∑ pi 0.10  0.06  0.06  0.07  2.38    0.0793 k 30 30 Thus, the average proportion of orders with errors is 0.0793 Step Compute the standard error of p using Equation 18.10 Standard Error for the Subgroup Proportions For equal sample sizes: sp  where: ( p)(1− p) n (18.10) p  Mean subgroup proportion n  Common sample siize For unequal sample sizes: Option Compute sp using largest sample size and sp using the smallest sample size Construct control limits using each value Option sp  p(1 − p) ni Compute a unique value of sp for each different sample size, ni Construct control limits for each sp value producing “wavy” control limits (366) CHAPTER 18 | Introduction to Quality and Statistical Process Control 823 We compute s p using Equation 18.10, as follows: sp  ( p ) (1 − p)  ( 0.0793) (1 − 0.0793)  0.027 n 100 Step Compute the 3-sigma control limits using Equations 18.11 and 18.12 Control Limits for p-Chart UCL  p 3sp (18.11) LCL  p − 3sp (18.12) where: p  Mean subgroup proportion sp  Estimated stand dard error of p  ( p ) (1 − p ) n Using Equations 18.11 and 18.12, we get the following control limits: UCL  p  3sp  0.079  3( 0.027 )  0.160 LCL  p − 3sp  0.079 − 3( 0.027 )  − 0.002 → 0.0 Because a proportion of nonconforming items cannot be negative, the lower control limit is set to 0.0 Step Plot the centerline and control limits on the control chart Both upper and lower control limits are plotted on the control charts in Figures 18.11a and 18.11b Using the p-Chart Once the control chart is developed, the same rules are used as for the x - and R-charts6: Signals One or more points outside the upper or lower control limits Nine or more points in a row above (or below) the centerline Six or more consecutive points moving in the same direction (increasing or decreasing) Fourteen points in a row, alternating up and down The p-chart shown in Figure 18.11a indicates the process is in control None of the signals are present in these data The variation in the nonconformance rates is assumed to be due to the common cause issues For future days, the managers would select random samples of 100 orders, count the number with errors, and compute the proportion This value would be plotted on the p-chart For each day, the managers would use the control chart to test the hypotheses: H0: The process is in control The variation around the centerline is a result of common causes and is inherent in the process HA: The process is out of control The variation around the centerline is due to some special cause and is beyond what is normal for the process The signals mentioned previously would be used to test the null hypothesis Remember, control charts are most useful when the charts are updated as soon as the new sample data become available When a signal of special cause variation is present, you should take action to determine the source of the problem and address it as quickly as possible 6Minitab allows the user to specify the signals This is done in the Define Tests feature under Stat—Control Charts Minitab also allows unequal sample sizes See the Excel and Minitab Tutorial for specifics in developing a p-chart (367) 824 CHAPTER 18 | Introduction to Quality and Statistical Process Control Chapter Outcome c-Charts The p-chart just discussed is used when you select a sample of items and you determine the number of the sampled items that possess a specific attribute of interest Each item either has or does not have that attribute You will encounter other situations that involve attribute data but differ from the p-chart applications In these situations, you have what is defined as a sampling unit (or experimental unit), which could be a sheet of plywood, a door panel on a car, a textbook page, an hour of service, or any other defined unit of space, volume, time, and so on Each sampling unit could have one or more of the attributes of interest, and you would be able to count the number of attributes present in each sampling unit In cases in which the sampling units are the same size, the appropriate control chart is a c-chart BUSINESS APPLICATION Excel and Minitab tutorials Excel and Minitab Tutorial Special Note CONSTRUCTING c -CHARTS CHANDLER TILE COMPANY The Chandler Tile Company makes ceramic tile In recent years, there has been a big demand for tile products in private residences for kitchens and bathrooms and in commercial establishments for decorative counter and wall covering Although the demand has increased, so has the competition The senior management at Chandler knows that three factors are key to winning business from contractors: price, quality, and service One quality issue is scratches on a tile The production managers wish to set up a control chart to monitor the level of scratches per tile to determine whether the production process remains in control Control charts monitor a process as it currently operates, not necessarily how you would like it to operate Thus, a process that is in control might still yield a higher number of scratches per tile than the managers would like The managers believe that the numbers of scratches per tile are independent of each other and that the prevailing operating conditions are consistent from tile to tile In this case, the proper control chart is a c-chart Here the tiles being sampled are the same size, and the managers will tally the number of scratches per tile However, if we asked, “How many opportunities were there to scratch each tile?” we probably would not be able to answer the question There are more opportunities than we could count For this reason, the c-chart is based on the Poisson probability distribution introduced in Chapter 5, rather than the binomial distribution.You might recall that the Poisson distribution is defined by the mean number of successes per interval, or sampling unit, as shown in Equation 18.13 A success can be regarded as a defect, a nonconformance, or any other characteristic of interest In the Chandler example, a success is a scratch on a tile Mean for c-Chart k c ∑ xi i1 k where: xi  Number of successes per sampling unit k  Number of sampling units (18.13) (368) CHAPTER 18 | Introduction to Quality and Statistical Process Control 825 Because the Poisson distribution is skewed when the mean of the sampling unit is small, we must define the sampling unit so that it is large enough to provide an average of at least successes per sampling unit ( c (369) 5) This may require that you combine smaller sampling units into a larger unit size In this case we combine six tiles to form a sampling unit The mean and the variance of the Poisson distribution are identical Therefore, the standard deviation of the Poisson distribution is the square root of its mean For this reason, the estimator of the standard deviation for the Poisson distribution is computed as the square root of the sample mean, as shown in Equation 18.14 Standard Deviation for c-Chart sc  c (18.14) Then Equations 18.15 and 18.16 are used to compute the 3-sigma (3 standard deviation) control limits for the c-chart c-Chart Control Limits UCL  c 3 c (18.15) LCL  c − c (18.16) and You can use the following steps to construct a c-chart: Step Collect the sample data The original plan called for the Chandler Tile Company to select six tiles each hour from the production line and to perform a thorough inspection to count the number of scratches per tile Like all control charts, at least 20 samples are desired in developing the initial control chart After collecting 40 sampling units of six tiles each, the total number of scratches found was 228 The data set is contained in the file Chandler Step Plot the subgroup number of occurrences as a line chart Figures 18.12a and 18.12b show the line chart for the 40 sampling units Step Compute the average number of occurrences per sampling unit using Equation 18.13 The mean is c ∑ x 228   5.70 k 40 Step Compute the standard error of Sc using Equation 18.14 The standard error is sc  c  5.70  2.387 Step Construct the 3-sigma control limits, using Equations 18.15 and 18.16 The upper and lower 3-sigma control limits are UCL  c  c  5.7  3(2.387 )  12.86 LCL  c − c  5.7 − 3(2.387 )  −1.46 → 0.0 (370) 826 CHAPTER 18 FIGURE 18.12A | Introduction to Quality and Statistical Process Control | Excel 2007 c-Chart for Chandler Tile Company Excel 2007 Instructions: Open File: Chandler.xls Calculate the c-bar value using Excel’s AVERAGE function and copy this value in a new column Calculate the standard deviation as the square root of the mean using Excel’s How to it Constructing SPC charts The following steps are employed when constructing statistical quality control charts: Collect the sample data Plot the subgroup statistics as a line chart Compute the average subgroup statistic, i.e., the centerline value The centerline on the control chart is the average value for the sample data This is the current process average Compute the appropriate stan- SQRT function Then copy that value in a new column Select the three columns Click Insert > Line Chart Click Layout to add axis labels and chart titles Step Plot the centerline and control limits on the control chart Both upper and lower control limits are plotted on the control charts in Figures 18.12a and 18.12b As with the p-chart, the lower control limit can’t be negative We change it to zero, which is the fewest possible scratches on a tile The completed c-chart is shown in Figures 18.12a and 18.12b Note in both figures that four samples of six tiles each had a total number of scratches that fell outside the upper control limit of 12.86 The managers need to consult production records and other information to determine what special cause might have generated this level of scratches If they can determine the cause, these data points should be removed and the control limits should be recomputed from the remaining 36 values You might also note that the graph changes beginning with about sample 22 The process seems more stable from sample 22 onward Managers might consider inspecting for another 13 to 15 hours and recomputing the control limits using data from hours 22 and higher dard error Compute the upper and lower control limits Plot the appropriate data on the control chart, along with the centerline and control limits Other Control Charts Our purpose in this chapter has been to introduce SPC charts We have illustrated a few of the most frequently used charts However, there are many other types of control charts that can be used in special situations You are encouraged to consult several of the references listed at the end of the chapter for information about these other charts Regardless of the type of statistical quality control chart you are using, the same general steps are used (371) CHAPTER 18 FIGURE 18.12B | Introduction to Quality and Statistical Process Control 827 | Minitab c-Chart for Chandler Tile Company Out of control Minitab Instructions: Open file: Chandler.xls Choose Stat  Control Chart  Attributes Charts  C In Variable, enter number of defectives column Click on C Chart Options and select Tests tab Select Perform all tests for special causes Click OK OK MyStatLab 18-2: Exercises Skill Development 18-10 Fifty sampling units of equal size were inspected, and the number of nonconforming situations was recorded The total number of instances was 449 a Determine the appropriate control chart to use for this process b Compute the mean value for the control chart c Compute the upper and lower control limits 18-11 Data were collected on a quantitative measure with a subgroup size of five observations Thirty subgroups were collected, with the following results: x  44.52 R  5.6 a Determine the Shewhart factors that will be needed if x - and R-charts are to be developed b Compute the upper and lower control limits for the R-chart c Compute the upper and lower control limits for the x-chart 18-12 Data were collected from a process in which the factor of interest was whether the finished item contained a particular attribute The fraction of items that did not contain the attribute was recorded A total of 30 samples were selected The common sample size was 100 items The total number of nonconforming items was 270 Based on these data, compute the upper and lower control limits for the p-chart 18-13 Explain why it is important to update the control charts as soon as new data become available Computer Database Exercises 18-14 Grandfoods, Inc., makes energy supplement bars for use by athletes and others who need an energy boost One of the critical quality characteristics is the weight of the bars Too much weight implies that too many liquids have been added to the mix and the bar will be too chewy If the bars are light, the implication is that the bars are too dry To monitor the weights, the production manager wishes to use process control charts Data for 30 subgroups of size bars are contained in the file Grandfoods Note that a subgroup is selected every 15 minutes as bars come off the manufacturing line a Use these data to construct the appropriate process control chart(s) b Discuss what each chart is used for Why we need both charts? c Examine the control charts and indicate which, if any, of the signals are present Is the process currently in control? (372) 828 CHAPTER 18 | Introduction to Quality and Statistical Process Control d Develop a histogram for the energy bar weights Discuss the shape of the distribution and the implications of this toward the validity of the control chart procedures you have used 18-15 The Haines Lumber Company makes plywood for residential and commercial construction One of the key quality measures is plywood thickness Every hour, five pieces of plywood are selected and the thicknesses are measured The data (in inches) for the first 20 subgroups are in the file Haines a Construct an x -chart based on these data Make sure you plot the centerline and both 3-sigma upper and lower control limits b Construct an R-chart based on these data c Examine both control charts and determine if there are any special causes of variation that require attention in this process 18-16 Referring to Exercise 18-15, suppose the process remained in control for the next 40 hours The thickness measurements for hours 41 through 43 are as follows: Hour 41 Hour 42 Hour 43 0.764 0.766 0.812 0.737 0.785 0.774 0.724 0.777 0.767 0.716 0.790 0.799 0.752 0.799 0.821 a Based on these data and the two control charts, what should you conclude about the process? Has the process gone out of control? Discuss b Was it necessary to obtain your answer to Exercise 18-15 before part a could be answered? Explain your reasoning 18-17 Referring to the process control charts developed in Exercise 18-14, data for periods 31 to 40 are contained in the file Grandfoods-Extra a Based on these data, what would you conclude about the energy bar process? b Write a report discussing the results, and show the control charts along with the new data 18-18 Wilson, Ryan, and Reed is a large certified public accounting (CPA) firm in Charleston, South Carolina It has been monitoring the accuracy of its employees and wishes to get the number of accounts with errors under statistical control It has sampled 100 accounts for each of the last 30 days and has examined them for errors The data are presented in the file Accounts a Construct the relevant control chart for the account process b What does the chart indicate about the statistical stability of the process? Give reasons for your answers c Suppose that for the next days, sample sizes of 100 accounts are examined with the following results: Number of Errors Plot the appropriate data on the control chart and indicate whether any of the control chart signals are present Discuss your results 18-19 Trinkle & Sons performs subcontract body paint work for one of the “Big Three” automakers One of its recent contracts called for the company to paint 12,500 door panels Several quality characteristics are very important to the manufacturer, one of which is blemishes in the paint The manufacturer has required Trinkle & Sons to have control charts to monitor the number of paint blemishes per door panel The panels are all for the same model car and are the same size To initially develop the control chart, data for 88 door panels were collected and are provided in the file CarPaint a Determine the appropriate type of process control chart to develop b Develop a 3-sigma control chart c Based on the control chart and the standard signals discussed in this chapter, what conclusions can you reach about whether the paint process is in control? Discuss 18-20 Tony Perez is the manager of one of the largest chains of service stores that specialize in oil and lubrication of automobiles, Fastlube, Inc One of the company’s stated goals is to provide a lube and oil change for anyone’s automobile in 15 minutes Tony has thought for some time now that there is a growing disparity among his workers in the time it takes to lube and change the oil of an automobile To monitor this aspect of Fastlube, Tony has selected a sample of 20 days and has recorded the time it took five randomly selected employees to service an automobile The data are located in the file Fastlube a Tony glanced through the data and noticed that the longest time it took to service a car was 25.33 minutes Suppose the distribution of times to service a car was normal, with a mean of 15 Use your knowledge of a normal distribution to let Tony know what the standard deviation is for the time it takes to service a car b Use the Fastlube data to construct an x and an R-chart c Based on these data, what would you conclude about the service process? d Based on your findings on the R-chart, would it be advisable to draw conclusions based on the x -chart? 18-21 The Ajax Taxi company in Manhattan, New York, wishes to set up an x -chart and an R-chart to monitor the number of miles driven per day by its taxi drivers Each week, the scheduler selects four taxis and (without the drivers’ knowledge) monitors the number of miles driven He has done this for the past 40 weeks The data are in the file Ajax a Construct the R-chart for these 40 subgroups b Construct the x -chart for these 40 subgroups Be sure to label the chart correctly (373) CHAPTER 18 c Look at both control charts and determine if any of the control chart signals are present to indicate that the process is not in control Explain the implications of what you have found for the Ajax Taxi Company 18-22 Referring to Exercise 18-21, assume the Ajax managers determine any issues identified by the control charts were caused by one time events The data for weeks 41 through 45 are in the Ajax-Extra file a Using the control limits developed from the first 40 weeks, these data indicate that the process is now out of control? Explain b If a change has occurred, brainstorm some of the possible reasons c What will be the impact on the control charts when the new data are included? d Use the data in the files Ajax and AjaxExtra to develop the new control charts e Are any of the typical control chart signals present? Discuss 18-23 The Kaiser Corporation makes aluminum at various locations around the country One of the key factors in being profitable is keeping the machinery running One particularly troublesome machine is a roller that flattens the sheets to the appropriate thickness This machine tends to break down for various reasons Consequently, the maintenance manager has decided to develop a process control chart Over a period of 10 weeks, 20 subgroups consisting of downtime measures (in minutes) were collected (one measurement at the end of each of the two shifts.) The subgroup means and ranges are shown as follows and are contained in the file called Kaiser Subgroup Mean Subgroup Range Subgroup Mean Subgroup Range Subgroup Mean Subgroup Range 104.8 85.9 78.6 72.8 102.6 84.8 67.0 9.6 14.3 8.6 10.6 11.2 13.5 10.8 91.1 79.5 71.9 47.6 106.7 80.7 81.0 5.2 14.2 14.1 14.9 12.7 13.3 15.4 57.0 98.9 87.9 64.9 101.6 83.9 15.5 13.8 16.6 11.2 9.6 11.5 a Explain why the x - and R-charts would be appropriate in this case b Find the centerline value for the x -chart c Calculate the centerline value for the R-chart d Compute the upper and lower control limits for the R-chart, and construct the chart with appropriate labels e Compute the upper and lower control limits for the x -chart, and construct the chart with appropriate labels f Examine the charts constructed in parts d and e and determine whether the process was in control during | Introduction to Quality and Statistical Process Control 829 the period for which the control charts were developed Explain 18-24 Referring to Exercise 18-23, if necessary delete any out-of-control points and construct the appropriate x and R-charts Now, suppose the process stays in control for the next six weeks (subgroups 18 through 23) The subgroup means and ranges for subgroups 33 to 38 are as follows: Subgroup Subgroup Mean Subgroup Range 33 89.0 11.4 34 88.4 5.4 35 85.2 14.2 36 89.3 11.7 37 97.2 9.5 38 105.3 10.2 a Plot the ranges on the R-chart Is there evidence based on the range chart that the process has gone out of control? Discuss b Plot the subgroup means on the x -chart Is there evidence to suggest that the process has gone out of control with respect to the process mean? Discuss 18-25 Regis Printing Company performs printing services for individuals and business customers Many of the jobs require that brochures be folded for mailing The company has a machine that does the folding It generally does a good job, but it can have problems that cause it to improper folds To monitor this process, the company selects a sample of 50 brochures from every order and counts the number of incorrectly folded items in each sample Until now, nothing has been done with the 300 samples that have been collected The data are located in the file Regis a What is the appropriate control chart to use to monitor this process? b Using the data in this file, construct the appropriate control chart and label it properly c Suppose that for the next three orders, sample sizes of 50 brochures are examined with the following results: Sample Number Number of Bad Folds 301 302 303 Plot the appropriate data to the control chart and indicate whether any of the control chart signals are present Discuss your results d Suppose that the next sample of 50 has 14 improperly folded brochures What conclusion should be reached based on the control chart? Discuss 18-26 Recall from Exercise 18-20 that Tony Perez is the manager of one of the largest chains of service stores that specialize in oil and lubrication of automobiles, Fastlube, Inc One of the company’s stated goals is to provide a lube and oil change for anyone’s automobile in 15 minutes Tony has thought for some time now that there is a growing disparity among his workers in the time it takes to lube and change the oil of an automobile To monitor this aspect of Fastlube, Tony has selected a sample of 24 days and has recorded the time it took to service 100 automobiles each day The (374) 830 CHAPTER 18 | Introduction to Quality and Statistical Process Control number of times the service was performed in 15 minutes or less (≤15) is given in the file Lubeoil a (1) Convert the sample data to proportions and plot the data as a line graph (2) Compute p and plot this value on the line graph (3) Compute sp and interpret what it measures b Construct a p-chart and determine if the process of the time required for oil and lube jobs is in control c Specify the signals that are used to indicate an outof-control situation on a p-chart 18-27 Susan Booth is the director of operations for National Skyways, a small commuter airline with headquarters in Cedar Rapids, Iowa She has become increasingly concerned about the amount of carry-on luggage passengers have been carrying on board National Skyways’ planes She collected data concerning the number of pieces of baggage that were taken on board over a one-month period The data collected are provided in the file Carryon Hint: Consider a U-chart from the optional topics a Set up a control chart for the number of carry-on bags per day b Is the process in a state of statistical control? Explain your answer c Suppose that National Skyways’ aircraft were full for each of the 30 days Each Skyways aircraft holds 40 passengers Describe the control chart you would use Is it necessary that you use this latter alternative or is it just a preference? Explain your answer 18-28 Sid Luka is the service manager for Brakes Unlimited, a franchise corporation that specializes in servicing automobile brakes He wants to study the length of time required to replace the rear drum brakes of automobiles A subgroup of 10 automobiles needing their brakes replaced was selected on each day for a period of 20 days The subgroup times required (in hours) for this service were recorded and are presented in the file Brakes (This problem cannot be done using Minitab.) a Sid has been trying to get the average time required to replace the rear drum brakes of an automobile to be under 1.65 hours Use the data Sid has collected to determine if he has reached his goal b Set up the appropriate control charts to determine if this process is under control c Determine whether the process is under control If the process is not under control, brainstorm suggestions that might help Sid bring it under control What tools of quality might Sid find useful? END EXERCISES 18-2 (375) CHAPTER 18 | Introduction to Quality and Statistical Process Control Visual Summary Chapter 18: Organizations across the United States and around the world have turned to quality management in an effort to meet the competitive challenges of the international marketplace Their efforts have generally followed two distinctive, but complementary, tracks The first track involves a change in managerial philosophy following principles set out by W Edwards Deming and Joseph Juran, two pioneers in the quality movement It generally involves employees at all levels to be brought into the effort as members of process improvement teams These teams are assisted by training in the Tools of Quality The second track involves a process of continual improvement using a set of statistically based tools involving process control charts This chapter presents a brief introduction into the contributions of Deming, Juran and other leaders in the quality movement It then discusses four of the most commonly used statistical process control charts 18.1 Quality Management and Tools for Process Improvement (pg 805–808) Summary Although Deming is better known, both Deming and Juran were instrumental in the world-wide quality movement Deming, a statistician, and Juran, an engineer, emphasized that quality was the key to being competitive and that quality could be best achieved by improving the processes that produced the products and delivered the services Deming, also known as the “Father of Japanese Quality and Productivity” preached a new management philosophy, which has become known as Total Quality Management, or TQM His philosophy is spelled out in his 14 points that emphasized the importance of managerial leadership and continual improvement Juran is noted for his 10 steps to quality improvement Juran and Deming differed on some areas, including the use of goals and targets Juran is also credited with applying the Pareto Principle to quality which is used to focus on the vital few sources of problems and to separate the vital few from the trivial many Pareto charts are used to display data in a way that helps managers find the most important problem issues Deming, Juran and others contributed to the Tools of Quality: process flowcharts, brainstorming, fishbone diagrams, histograms, trend charts, scatter plots and statistical process control charts Outcome Use the seven basic tools of quality 18.2 Introduction to Statistical Process Control Charts (pg 808–830) Summary Process control charts are based on the idea that all processes exhibit variation Although variation is always present, two major types occur The first is common cause variation, which means it is naturally occurring or expected in the system Other terms people use for common cause variation include normal, random, chance occurrence, inherent, and stable variation The other type of variation is special cause variation, It indicates that something out of the ordinary has happened This type of variation is also called nonrandom, unstable, and assignable cause variation Process control charts are used to separate special cause variation from common cause variation Process control charts are constructed by determining a centerline (a process average) and construction upper and lower control limits around the center line (three standard deviation lines above and below the average) The charts considered in this chapter are the most commonly used: the x and R charts (used in tandem), p-charts and c-charts Data is gathered continually from the process being monitored and plotted on the chart Data points between the upper and lower control limits generally indicate the process is stable, but not always Process control charts are continually monitored to indicate signs of going “out of control” Common signals include: one or more points outside the upper or lower control limits, nine or more points in a row above (or below) the centerline, six or more consecutive points moving in the same direction (increasing or decreasing), fourteen points in a row, alternating up and down Generally, continual process improvement procedures involve identifying and addressing the reason for special cause variation and then working to reduce the common cause variation Outcome Construct and interpret x charts and R-charts Outcome Construct and interpret p-charts Outcome Construct and interpret c-charts Conclusion The quality movement throughout the United States and much of the rest of the world has created great expectations among consumers Ideas such as continuous process improvement and customer focus have become a central part in raising these expectations Statistics has played a key role in increasing expectations of quality products The enemy of quality is variation, which exists in everything Through the use of appropriate statistical tools and the concept of statistical reasoning, managers and employees have developed better understandings of their processes Although they haven’t yet figured out how to eliminate variation, statistics has helped reduce it and understand how to operate more effectively when it exists Statistical process control (SPC) has played a big part in the understanding of variation SPC is quite likely the most frequently used of the Basic Tools This chapter has introduced SPC Hopefully, you realize that these tools are merely extensions of the hypothesis-testing and estimation concepts presented in Chapters 8–10 You will very likely have the opportunity to use SPC in one form or another after you leave this course and enter the workforce Figure 18.13 summarizes some of the key SPC charts and the conditions under which each is developed 831 (376) 832 CHAPTER 18 FIGURE 18.13 | Introduction to Quality and Statistical Process Control | Statistical Process Control Chart Option Type of Data Variables Data R-Chart UCL = D4(R) LCL = D3(R) k R=  Ri i =1 k x -Chart UCL = x + A2(R) LCL = x – A2(R) k xi  i =1 x= k Attribute Data p-Chart UCL = p + 3sP LCL = p – 3sP k pi  p = i =1 k (p)(1 – p) sP = n c-Chart UCL = c + c LCL = c–3 c k xi  c = i =1 k sc = c Other Variables Charts: Other Attribute Charts: x-Chart MR Chart Zone Chart U -Chart np-Chart Equations (18.1) Variation Components pg 810 Total process variation  Common cause variation  Special cause variation (18.2) Average Subgroup Means pg 812 k x ∑ xi (18.6) Upper Control Limit, R-Chart pg 815 UCL  D4 ( R ) (18.7) Lower Control Limit, R-Chart pg 815 LCL  D3 ( R ) i1 k (18.8) Mean Subgroup Proportion pg 822 (18.3) Average Subgroup Range pg 812 For equal-size samples: k R ∑ k Ri i1 k (18.4) Upper Control Limit, x-Chart pg 814 p LCL  x − A2 ( R ) i1 k (18.9) For unequal sample sizes: UCL  x  A2 ( R ) (18.5) Lower Control Limit, x-Chart pg 815 ∑ pi k p ∑ ni pi i1 k ∑ ni i1 (377) CHAPTER 18 (18.10) Estimate for the Standard Error for the Subgroup | Introduction to Quality and Statistical Process Control (18.13) Mean for c-Chart pg 824 Proportions pg 822 k For equal sample sizes: sp  833 c ( p )(1 − p ) n ∑ xi i1 k (18.14) Standard Deviation for c-Chart pg 825 sc  c (18.11) Control Limits for p-Chart pg 823 UCL  p 3s p (18.15) c-Chart Control Limits pg 825 UCL  c 3 c (18.12) (18.16) LCL  p − 3s p LCL  c − c Key Terms Pareto principle pg 805 Total quality management (TQM) pg 805 Chapter Exercises Conceptual Questions 18-29 Data were collected on a quantitative measure with a subgroup size of three observations Thirty subgroups were collected, with the following results: x  1, 345.4 R  209.3 a Determine the Shewhart factors that will be needed if x - and R-charts are to be developed b Compute the upper and lower control limits for the R-chart c Compute the upper and lower control limits for the x -chart 18-30 Data were collected on a quantitative measure with subgroups of four observations.Twenty-five subgroups were collected, with the following results: x  2.33 R  0.80 a Determine the Shewhart factors that will be needed if x - and R-charts are to be developed b Compute the upper and lower control limits for the R-chart c Compute the upper and lower control limits for the x -chart 18-31 Data were collected from a process in which the factor of interest was whether a finished item contained a particular attribute The fraction of items that did not contain the attribute was recorded A total of 20 samples were selected The common sample size was 150 items The total number of nonconforming items was 720 Based on these data, compute the upper and lower control limits for the p-chart MyStatLab 18-32 Data were collected from a process in which the factor of interest was whether a finished item contained a particular attribute The fraction of items that did not contain the attribute was recorded A total of 30 samples were selected The common sample size was 100 items The average number of nonconforming items per sample was 14 Based on these data, construct the upper and lower control limits for the p-chart Computer Database Exercises 18-33 CC, Inc., provides billing services for the health care industry To ensure that its processes are operating as intended, CC selects 100 billing records at random every day and inspects each record to determine if it is free of errors A billing record is classified as defective whenever there is an error that requires that the bill be reprocessed and mailed again Such errors can occur for a variety of reasons For example, a defective bill could have an incorrect mailing address, a wrong insurance identification number, or an improper doctor or hospital reference The sample data taken from the most recent five weeks of billing records are contained in the file CC Inc Use the sample information to construct the appropriate 3-sigma control chart Does CC’s billing process appear to be in control? What, if any, comments can you make regarding the performance of its billing process? 18-34 A & A Enterprises ships integrated circuits to companies that assemble computers Because computer manufacturing operations run on little inventory, parts must be available when promised (378) 834 CHAPTER 18 | Introduction to Quality and Statistical Process Control Thus, a critical element of A & A’s customer satisfaction is on-time delivery performance To ensure that the delivery process is performing as intended, a quality improvement team decided to monitor the firm’s distribution and delivery process From the A & A corporate database, 100 monthly shipping records were randomly selected for the previous 21 months, and the number of on-time shipments was counted This information is contained in the file A & A On Time Shipments Develop the appropriate 3-sigma control chart(s) for monitoring this process Does it appear that the delivery process is in control? If not, can you suggest some possible assignable causes? 18-35 Fifi Carpets, Inc., produces carpet for homes and offices Fifi has recently opened a new production process dedicated to the manufacture of a special type of carpet used by firms who want a floor covering for high-traffic spaces As a part of their ongoing quality improvement activities, the managers of Fifi regularly monitor their production processes using statistical process control For their new production process, Fifi managers would like to develop control charts to help them in their monitoring activities Thirty samples of carpet sections, with each section having an area of 50 square meters, were randomly selected, and the numbers of stains, cuts, snags, and tears were counted on each section The sample data are contained in the file Fifi Carpets Use the sample data to construct the appropriate 3-sigma control chart(s) for monitoring the production process Does the process appear to be in statistical control? 18-36 The order-entry, order-processing call center for PS Industries is concerned about the amounts of time that customers must wait before their calls are handled A quality improvement consultant suggests that it monitor call-wait times using control charts Using call center statistics maintained by the company’s database system, the consultant randomly selects four calls every hour for 30 different hours and examines the wait time, in seconds, for each call This information is contained in the file PS Industries Use the sample data to construct the appropriate control chart(s) Does the process appear to be in statistical control? What other information concerning the call center’s process should the consultant be aware of? 18-37 Varians Controls manufactures a variety of different electric motors and drives One step in the manufacturing process involves cutting copper wire from large reels into smaller lengths For a particular motor, there is a dedicated machine for cutting wire to the required length As a part of its regular quality improvement activities, the continuous process improvement team at Varians took a sample size of cuttings every hour for 30 consecutive hours of operation At the time the samples were taken, Varians had every reason to believe that its process was working as intended The automatic cutting machine records the length of each cut, and the results are reported in the file Varians Controls a Develop the appropriate 3-sigma control chart(s) for this process Does the process appear to be working as intended (in control)? b A few weeks after the previous data were sampled, a new operator was hired to calibrate the company’s cutting machines The first samples taken from the machine after the calibration adjustments (samples 225 to 229) are shown as follows: Cutting Cutting Cutting Cutting Cutting Sample 225 Sample 226 Sample 227 0.7818 0.7694 0.7875 0.7760 0.7838 0.7738 0.7814 0.7675 0.7737 0.7824 0.7834 0.7594 0.7702 0.7730 0.7837 Sample 228 Sample 229 0.7762 0.7805 0.7711 0.7724 0.7700 0.7748 0.7823 0.7823 0.7673 0.7924 Based on these sample values, what can you say about the cutting process? Does it appear to be in control? Case 18.1 Izbar Precision Casters, Inc Izbar Precision Casters, Inc., manufactures a variety of structural steel products for the construction trade Currently, there is a strong demand for its I-beam product produced at a mill outside Memphis Beams at this facility are shipped throughout the Midwest and mid-South, and demand for the product is high due to the strong economy in the regions served by the plant Angie Schneider, the mill’s manager, wants to ensure that the plant’s operations are in control, and she has selected several characteristics to monitor Specifically, she collects data on the number of weekly accidents at the plant, the number of orders shipped on time, and the thickness of the steel I-beams produced For the number of reported accidents, Angie selected 30 days at random from the company’s safety records Angie and all the plant employees are very concerned about workplace safety, and management, labor, and government officials have worked together to help create a safe work environment As a part of the safety program, the company requires employees to report every accident regardless of how minor it may be In fact, most accidents are very minor, but Izbar still records them and works to prevent them from recurring Because of Izbar’s strong reporting requirement, Angie was able to get a count of the number of reported accidents for each of the 30 sampled days These data are shown in Table 18.4 To monitor the percentage of on-time shipments, Angie randomly selected 50 records from the firm’s shipping and billing (379) CHAPTER 18 TABLE 18.4 | Accident Data Day Number of Reported Accidents Day Number of Reported Accidents 10 11 12 13 14 15 11 9 11 10 10 10 7 11 10 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 11 7 10 11 6 11 system every day for 20 different days over the past six months These records contain the actual and promised shipping dates for each order Angie used a spreadsheet to determine the number of shipments that were made after the promised shipment dates The number of late shipments from the 50 sampled records was then reported These data are shown in Table 18.5 Finally, to monitor the thickness of the I-beam produced at the plant, Angie randomly selected six I-beams every day for 30 days and had each sampled beam measured The thickness of each beam, in inches, was recorded All of the data collected by Angie are contained in the file Izbar She wants to use the information | 835 Introduction to Quality and Statistical Process Control she has collected to construct and analyze the appropriate control charts for the plant’s production processes She intends to present this information at the next manager’s meeting on Monday morning Required Tasks: a Use the data that Angie has collected to develop and analyze the appropriate control charts for this process Be sure to label each control chart carefully and also to identify the type of control chart used b Do the processes analyzed appear to be in control? Why or why not? What would you suggest that Angie do? c Does Angie need to continue to monitor her processes on a regular basis? How should she this? Also, are there other variables that might be of interest to her in monitoring the plant’s performance? If so, what you think they might be? TABLE 18.5 | Late Shipments Day Number of Late Shipments Day Number of Late Shipments 10 4 11 12 13 14 15 16 17 18 19 20 7 References Crosby, Philip B., Quality Is Free: The Art of Making Quality Certain (New York: McGraw-Hill, 1979) Deming, W Edwards, Out of the Crisis (Cambridge, MA: MIT Center for Advanced Engineering Study, 1986) Evans, James R., and William M Lindsay, Managing for Quality and Performance Excellence (Cincinnati,OH: South-Western College Publishing, 2007) Foster, S Thomas, Managing Quality: Integrating the Supply Chain and Student CD PKG, 3rd ed (Upper Saddle River, NJ: Prentice Hall, 2007) Juran, Joseph M., Quality Control Handbook, 5th ed (New York: McGraw-Hill, 1999) Microsoft Excel 2007 (Redmond,WA: Microsoft Corp., 2007) Minitab for Windows Version 15 (State College, PA: Minitab, 2007) Mitra, Amitava, Fundamentals of Quality Control and Improvement, 2d ed (Upper Saddle River, NJ: Prentice Hall, 1998) (380) List of Appendix Tables APPENDIX A Random Numbers Table 837 APPENDIX B Cumulative Binomial Distribution Table 838 APPENDIX C Cumulative Poisson Probability Distribution Table 851 APPENDIX D Standard Normal Distribution Table 856 APPENDIX E Exponential Distribution Table 857 Values of t for Selected Probabilities 858 APPENDIX G Values of χ for Selected Probabilities 859 APPENDIX F APPENDIX H F-Distribution Table 860 APPENDIX I Critical Values of Hartley’s Fmax Test 866 APPENDIX J Distribution of the Studentized Range (q-values) 867 APPENDIX K Critical Values of r in the Runs Test 869 APPENDIX L Mann-Whitney U Test Probabilities (n < 9) 870 APPENDIX M Mann-Whitney U Test Critical Values (9 ≤ n ≤ 20) 872 APPENDIX N Critical Values of T in the Wilcoxon Matched-Pairs Signed-Ranks Test (n ≤ 25) 874 APPENDIX O Critical Values dL and dU of the Durbin-Watson Statistic D 875 APPENDIX P Lower and Upper Critical Values W of Wilcoxon Signed-Ranks Test 877 APPENDIX Q Control Chart Factors 878 836 (381) APPENDIX A APPENDIX A Random Numbers Table 1511 6249 2587 0168 9664 1384 6390 6944 3610 9865 7044 9304 1717 2461 8240 1697 4695 3056 6887 1267 4369 2888 9893 8927 2676 0775 3828 3281 0328 8406 7076 0446 3719 5648 3694 3554 4934 7835 1098 1186 4618 5529 0754 5865 6168 7479 4608 0654 3000 2686 4713 9281 5736 2383 8740 4745 7073 4800 1379 9021 4981 8953 8134 3119 0028 6712 4857 8278 3598 9856 6805 2251 8558 9035 8824 9267 0333 7251 3977 7064 7316 9178 3419 7471 1826 8418 8641 9712 0563 8582 9876 8446 1506 2113 2685 1522 4173 5808 0806 8963 4144 6576 2483 9694 3675 4121 6522 9419 0408 8038 8716 0460 3455 7838 4990 2708 4292 0704 7442 1783 7530 5476 0072 5173 0075 1386 8962 3020 8520 5588 9377 5347 6243 6054 2198 2249 3726 6660 5352 8437 6778 3249 7472 6346 3434 4249 4646 0019 8287 7225 0627 5711 8458 2070 0235 6697 9422 6001 6616 5760 5144 7916 5022 2821 7284 2793 0819 7565 7487 5570 6437 7372 8500 6218 9029 0018 8386 2636 9666 7599 2340 5638 7509 6571 2821 8205 4849 4617 5979 3234 5606 0743 7968 2019 3078 1292 5431 1517 1981 4052 9473 2054 5011 3487 8311 0448 7419 2218 7986 1514 2255 4198 4486 5599 2918 5164 8941 6955 7313 6054 9142 0729 1196 7420 4697 2298 7197 6996 7623 2858 0945 1540 3217 6165 8468 6694 9459 5105 3233 1247 6479 5526 9256 8566 3796 9411 4075 1238 5842 9068 2019 4068 8850 9512 8392 9085 1136 0563 8250 3835 0669 2535 9180 4800 7875 5465 2578 4941 7759 0185 8104 6710 3356 5781 2246 4958 6806 7768 5285 7939 6230 2121 3492 0546 8737 8803 5760 1693 7438 7653 9786 5578 4283 7175 0967 7002 2975 4039 8120 5506 3818 3560 2246 1665 1425 3506 6045 6862 0659 3883 6594 1023 4450 2269 8059 4086 5876 6213 3076 2176 7233 1701 1500 1581 7364 0403 1670 5732 6951 1674 5245 2713 6137 8045 5842 7443 6538 4397 8394 7023 4467 9815 6081 6805 6272 0536 0676 5390 2859 4355 0649 5295 4800 2566 4462 5268 9542 2164 5939 1232 7474 1939 6990 5694 5126 2434 6295 1466 1876 9163 4083 8435 5280 2866 3095 4981 4764 3502 9896 9985 4984 1399 1042 7181 9984 8312 6595 4941 6679 5353 9425 2093 8802 3286 0444 0979 7191 1330 2357 0573 6423 2276 5715 1615 1385 4731 5071 9393 4449 5121 7652 3922 4567 6337 0573 0141 5626 5475 6668 0477 9453 6483 6334 3684 2539 0881 2564 4753 0515 1171 3553 7460 9693 2312 5930 3877 5961 0527 0608 0355 837 8925 5149 0488 1361 7503 5384 7629 3253 4463 8575 1342 3291 3458 6994 4344 1083 4724 8405 3349 0727 7086 6011 3263 2414 9052 6098 7688 1801 9102 7751 6544 1867 6227 2563 4034 8883 9915 2606 8856 6487 4270 3031 0696 7417 7892 8144 3509 1956 8140 9869 8772 4714 7441 2864 0775 (382) 838 APPENDIX B X APPENDIX B P( x ≤ X ) = ∑ i!(n  i)! p i (1  p) ni p  0.06 0.9400 1.0000 p  0.35 0.6500 1.0000 p  0.80 0.2000 1.0000 p  0.97 0.0300 1.0000 p  0.07 0.9300 1.0000 p  0.40 0.6000 1.0000 p  0.85 0.1500 1.0000 p  0.98 0.0200 1.0000 p  0.08 0.9200 1.0000 p  0.45 0.5500 1.0000 p  0.90 0.1000 1.0000 p  0.99 0.0100 1.0000 p  0.09 0.9100 1.0000 p  0.50 0.5000 1.0000 p  0.91 0.0900 1.0000 p  1.00 0.0000 1.0000 p  0.06 0.8836 0.9964 1.0000 p  0.35 0.4225 0.8775 1.0000 p  0.80 0.0400 0.3600 1.0000 p  0.97 0.0009 0.0591 1.0000 p  0.07 0.8649 0.9951 1.0000 p  0.40 0.3600 0.8400 1.0000 p  0.85 0.0225 0.2775 1.0000 p  0.98 0.0004 0.0396 1.0000 p  0.08 0.8464 0.9936 1.0000 p  0.45 0.3025 0.7975 1.0000 p  0.90 0.0100 0.1900 1.0000 p  0.99 0.0001 0.0199 1.0000 p  0.09 0.8281 0.9919 1.0000 p  0.50 0.2500 0.7500 1.0000 p  0.91 0.0081 0.1719 1.0000 p  1.00 0.0000 0.0000 1.0000 n! i =0 Cumulative Binomial Distribution Table n1 x x x x p  0.01 0.9900 1.0000 p  0.10 0.9000 1.0000 p  0.55 0.4500 1.0000 p  0.92 0.0800 1.0000 p  0.02 0.9800 1.0000 p  0.15 0.8500 1.0000 p  0.60 0.4000 1.0000 p  0.93 0.0700 1.0000 p  0.03 0.9700 1.0000 p  0.20 0.8000 1.0000 p  0.65 0.3500 1.0000 p  0.94 0.0600 1.0000 p  0.04 0.9600 1.0000 p  0.25 0.7500 1.0000 p  0.70 0.3000 1.0000 p  0.95 0.0500 1.0000 p  0.05 0.9500 1.0000 p  0.30 0.7000 1.0000 p  0.75 0.2500 1.0000 p  0.96 0.0400 1.0000 n2 x x x x p  0.01 0.9801 0.9999 1.0000 p  0.10 0.8100 0.9900 1.0000 p  0.55 0.2025 0.6975 1.0000 p  0.92 0.0064 0.1536 1.0000 p  0.02 0.9604 0.9996 1.0000 p  0.15 0.7225 0.9775 1.0000 p  0.60 0.1600 0.6400 1.0000 p  0.93 0.0049 0.1351 1.0000 p  0.03 0.9409 0.9991 1.0000 p  0.20 0.6400 0.9600 1.0000 p  0.65 0.1225 0.5775 1.0000 p  0.94 0.0036 0.1164 1.0000 p  0.04 0.9216 0.9984 1.0000 p  0.25 0.5625 0.9375 1.0000 p  0.70 0.0900 0.5100 1.0000 p  0.95 0.0025 0.0975 1.0000 p  0.05 0.9025 0.9975 1.0000 p  0.30 0.4900 0.9100 1.0000 p  0.75 0.0625 0.4375 1.0000 p  0.96 0.0016 0.0784 1.0000 n3 x p  0.01 p  0.02 p  0.03 p  0.04 p  0.05 p  0.06 p  0.07 p  0.08 p  0.09 0.9703 0.9412 0.9127 0.8847 0.8574 0.8306 0.8044 0.7787 0.7536 x x x 0.9997 1.0000 1.0000 p  0.10 0.7290 0.9720 0.9990 1.0000 p  0.55 0.0911 0.4253 0.8336 1.0000 p  0.92 0.0005 0.0182 0.2213 1.0000 0.9988 1.0000 1.0000 p  0.15 0.6141 0.9393 0.9966 1.0000 p  0.60 0.0640 0.3520 0.7840 1.0000 p  0.93 0.0003 0.0140 0.1956 1.0000 0.9974 1.0000 1.0000 p  0.20 0.5120 0.8960 0.9920 1.0000 p  0.65 0.0429 0.2818 0.7254 1.0000 p  0.94 0.0002 0.0104 0.1694 1.0000 0.9953 0.9999 1.0000 p  0.25 0.4219 0.8438 0.9844 1.0000 p  0.70 0.0270 0.2160 0.6570 1.0000 p  0.95 0.0001 0.0073 0.1426 1.0000 0.9928 0.9999 1.0000 p  0.30 0.3430 0.7840 0.9730 1.0000 p  0.75 0.0156 0.1563 0.5781 1.0000 p  0.96 0.0001 0.0047 0.1153 1.0000 0.9896 0.9998 1.0000 p  0.35 0.2746 0.7183 0.9571 1.0000 p  0.80 0.0080 0.1040 0.4880 1.0000 p  0.97 0.0000 0.0026 0.0873 1.0000 0.9860 0.9997 1.0000 p  0.40 0.2160 0.6480 0.9360 1.0000 p  0.85 0.0034 0.0608 0.3859 1.0000 p  0.98 0.0000 0.0012 0.0588 1.0000 0.9818 0.9995 1.0000 p  0.45 0.1664 0.5748 0.9089 1.0000 p  0.90 0.0010 0.0280 0.2710 1.0000 p  0.99 0.0000 0.0003 0.0297 1.0000 0.9772 0.9993 1.0000 p  0.50 0.1250 0.5000 0.8750 1.0000 p  0.91 0.0007 0.0228 0.2464 1.0000 p  1.00 0.0000 0.0000 0.0000 1.0000 (383) APPENDIX B n4 x p  0.01 p  0.02 p  0.03 p  0.04 p  0.05 p  0.06 p  0.07 p  0.08 p  0.09 0.9606 0.9224 0.8853 0.8493 0.8145 0.7807 0.7481 0.7164 0.6857 x 0.9994 1.0000 1.0000 1.0000 p  0.10 0.9977 1.0000 1.0000 1.0000 p  0.15 0.9948 0.9999 1.0000 1.0000 p  0.20 0.9909 0.9998 1.0000 1.0000 p  0.25 0.9860 0.9995 1.0000 1.0000 p  0.30 0.9801 0.9992 1.0000 1.0000 p  0.35 0.9733 0.9987 1.0000 1.0000 p  0.40 0.9656 0.9981 1.0000 1.0000 p  0.45 0.9570 0.9973 0.9999 1.0000 p  0.50 0.6561 0.5220 0.4096 0.3164 0.2401 0.1785 0.1296 0.0915 0.0625 x x 0.9477 0.9963 0.9999 1.0000 p  0.55 0.0410 0.2415 0.6090 0.9085 1.0000 p  0.92 0.0000 0.0019 0.0344 0.2836 1.0000 0.8905 0.9880 0.9995 1.0000 p  0.60 0.0256 0.1792 0.5248 0.8704 1.0000 p  0.93 0.0000 0.0013 0.0267 0.2519 1.0000 0.8192 0.9728 0.9984 1.0000 p  0.65 0.0150 0.1265 0.4370 0.8215 1.0000 p  0.94 0.0000 0.0008 0.0199 0.2193 1.0000 0.7383 0.9492 0.9961 1.0000 p  0.70 0.0081 0.0837 0.3483 0.7599 1.0000 p  0.95 0.0000 0.0005 0.0140 0.1855 1.0000 0.6517 0.9163 0.9919 1.0000 p  0.75 0.0039 0.0508 0.2617 0.6836 1.0000 p  0.96 0.0000 0.0002 0.0091 0.1507 1.0000 0.5630 0.8735 0.9850 1.0000 p  0.80 0.0016 0.0272 0.1808 0.5904 1.0000 p  0.97 0.0000 0.0001 0.0052 0.1147 1.0000 0.4752 0.8208 0.9744 1.0000 p  0.85 0.0005 0.0120 0.1095 0.4780 1.0000 p  0.98 0.0000 0.0000 0.0023 0.0776 1.0000 0.3910 0.7585 0.9590 1.0000 p  0.90 0.0001 0.0037 0.0523 0.3439 1.0000 p  0.99 0.0000 0.0000 0.0006 0.0394 1.0000 0.3125 0.6875 0.9375 1.0000 p  0.91 0.0001 0.0027 0.0430 0.3143 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 1.0000 n5 x p  0.01 p  0.02 p  0.03 p  0.04 p  0.05 p  0.06 p  0.07 p  0.08 p  0.09 0.9510 0.9039 0.8587 0.8154 0.7738 0.7339 0.6957 0.6591 0.6240 x 0.9990 1.0000 1.0000 1.0000 1.0000 p  0.10 0.9962 0.9999 1.0000 1.0000 1.0000 p  0.15 0.9915 0.9997 1.0000 1.0000 1.0000 p  0.20 0.9852 0.9994 1.0000 1.0000 1.0000 p  0.25 0.9774 0.9988 1.0000 1.0000 1.0000 p  0.30 0.9681 0.9980 0.9999 1.0000 1.0000 p  0.35 0.9575 0.9969 0.9999 1.0000 1.0000 p  0.40 0.9456 0.9955 0.9998 1.0000 1.0000 p  0.45 0.9326 0.9937 0.9997 1.0000 1.0000 p  0.50 0.5905 0.4437 0.3277 0.2373 0.1681 0.1160 0.0778 0.0503 0.0313 x x 0.9185 0.9914 0.9995 1.0000 1.0000 p  0.55 0.0185 0.1312 0.4069 0.7438 0.9497 1.0000 p  0.92 0.0000 0.0002 0.0045 0.0544 0.3409 1.0000 0.8352 0.9734 0.9978 0.9999 1.0000 p  0.60 0.0102 0.0870 0.3174 0.6630 0.9222 1.0000 p  0.93 0.0000 0.0001 0.0031 0.0425 0.3043 1.0000 0.7373 0.9421 0.9933 0.9997 1.0000 p  0.65 0.0053 0.0540 0.2352 0.5716 0.8840 1.0000 p  0.94 0.0000 0.0001 0.0020 0.0319 0.2661 1.0000 0.6328 0.8965 0.9844 0.9990 1.0000 p  0.70 0.0024 0.0308 0.1631 0.4718 0.8319 1.0000 p  0.95 0.0000 0.0000 0.0012 0.0226 0.2262 1.0000 0.5282 0.8369 0.9692 0.9976 1.0000 p  0.75 0.0010 0.0156 0.1035 0.3672 0.7627 1.0000 p  0.96 0.0000 0.0000 0.0006 0.0148 0.1846 1.0000 0.4284 0.7648 0.9460 0.9947 1.0000 p  0.80 0.0003 0.0067 0.0579 0.2627 0.6723 1.0000 p  0.97 0.0000 0.0000 0.0003 0.0085 0.1413 1.0000 0.3370 0.6826 0.9130 0.9898 1.0000 p  0.85 0.0001 0.0022 0.0266 0.1648 0.5563 1.0000 p  0.98 0.0000 0.0000 0.0001 0.0038 0.0961 1.0000 0.2562 0.5931 0.8688 0.9815 1.0000 p  0.90 0.0000 0.0005 0.0086 0.0815 0.4095 1.0000 p  0.99 0.0000 0.0000 0.0000 0.0010 0.0490 1.0000 0.1875 0.5000 0.8125 0.9688 1.0000 p  0.91 0.0000 0.0003 0.0063 0.0674 0.3760 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 x p  0.01 0.9415 0.9985 1.0000 1.0000 1.0000 p  0.02 0.8858 0.9943 0.9998 1.0000 1.0000 p  0.03 0.8330 0.9875 0.9995 1.0000 1.0000 p  0.04 0.7828 0.9784 0.9988 1.0000 1.0000 p  0.06 0.6899 0.9541 0.9962 0.9998 1.0000 p  0.07 0.6470 0.9392 0.9942 0.9997 1.0000 p  0.08 0.6064 0.9227 0.9915 0.9995 1.0000 p  0.09 0.5679 0.9048 0.9882 0.9992 1.0000 n6 p  0.05 0.7351 0.9672 0.9978 0.9999 1.0000 (continued ) 839 (384) 840 APPENDIX B x 1.0000 1.0000 p  0.10 1.0000 1.0000 p  0.15 1.0000 1.0000 p  0.20 1.0000 1.0000 p  0.25 1.0000 1.0000 p  0.30 1.0000 1.0000 p  0.35 1.0000 1.0000 p  0.40 1.0000 1.0000 p  0.45 1.0000 1.0000 p  0.50 0.5314 0.3771 0.2621 0.1780 0.1176 0.0754 0.0467 0.0277 0.0156 x x 0.8857 0.9842 0.9987 0.9999 1.0000 1.0000 p  0.55 0.0083 0.0692 0.2553 0.5585 0.8364 0.9723 1.0000 p  0.92 0.0000 0.0000 0.0005 0.0085 0.0773 0.3936 1.0000 0.7765 0.9527 0.9941 0.9996 1.0000 1.0000 p  0.60 0.0041 0.0410 0.1792 0.4557 0.7667 0.9533 1.0000 p  0.93 0.0000 0.0000 0.0003 0.0058 0.0608 0.3530 1.0000 0.6554 0.9011 0.9830 0.9984 0.9999 1.0000 p  0.65 0.0018 0.0223 0.1174 0.3529 0.6809 0.9246 1.0000 p  0.94 0.0000 0.0000 0.0002 0.0038 0.0459 0.3101 1.0000 0.5339 0.8306 0.9624 0.9954 0.9998 1.0000 p  0.70 0.0007 0.0109 0.0705 0.2557 0.5798 0.8824 1.0000 p  0.95 0.0000 0.0000 0.0001 0.0022 0.0328 0.2649 1.0000 0.4202 0.7443 0.9295 0.9891 0.9993 1.0000 p  0.75 0.0002 0.0046 0.0376 0.1694 0.4661 0.8220 1.0000 p  0.96 0.0000 0.0000 0.0000 0.0012 0.0216 0.2172 1.0000 0.3191 0.6471 0.8826 0.9777 0.9982 1.0000 p  0.80 0.0001 0.0016 0.0170 0.0989 0.3446 0.7379 1.0000 p  0.97 0.0000 0.0000 0.0000 0.0005 0.0125 0.1670 1.0000 0.2333 0.5443 0.8208 0.9590 0.9959 1.0000 p  0.85 0.0000 0.0004 0.0059 0.0473 0.2235 0.6229 1.0000 p  0.98 0.0000 0.0000 0.0000 0.0002 0.0057 0.1142 1.0000 0.1636 0.4415 0.7447 0.9308 0.9917 1.0000 p  0.90 0.0000 0.0001 0.0013 0.0159 0.1143 0.4686 1.0000 p  0.99 0.0000 0.0000 0.0000 0.0000 0.0015 0.0585 1.0000 0.1094 0.3438 0.6563 0.8906 0.9844 1.0000 p  0.91 0.0000 0.0000 0.0008 0.0118 0.0952 0.4321 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 p  0.06 0.6485 0.9382 0.9937 0.9996 1.0000 1.0000 1.0000 1.0000 p  0.35 0.0490 0.2338 0.5323 0.8002 0.9444 0.9910 0.9994 1.0000 p  0.80 0.0000 0.0004 0.0047 0.0333 0.1480 0.4233 0.7903 1.0000 p  0.97 0.0000 0.0000 0.0000 0.0000 0.0009 0.0171 0.1920 1.0000 p  0.07 0.6017 0.9187 0.9903 0.9993 1.0000 1.0000 1.0000 1.0000 p  0.40 0.0280 0.1586 0.4199 0.7102 0.9037 0.9812 0.9984 1.0000 p  0.85 0.0000 0.0001 0.0012 0.0121 0.0738 0.2834 0.6794 1.0000 p  0.98 0.0000 0.0000 0.0000 0.0000 0.0003 0.0079 0.1319 1.0000 p  0.08 0.5578 0.8974 0.9860 0.9988 0.9999 1.0000 1.0000 1.0000 p  0.45 0.0152 0.1024 0.3164 0.6083 0.8471 0.9643 0.9963 1.0000 p  0.90 0.0000 0.0000 0.0002 0.0027 0.0257 0.1497 0.5217 1.0000 p  0.99 0.0000 0.0000 0.0000 0.0000 0.0000 0.0020 0.0679 1.0000 p  0.09 0.5168 0.8745 0.9807 0.9982 0.9999 1.0000 1.0000 1.0000 p  0.50 0.0078 0.0625 0.2266 0.5000 0.7734 0.9375 0.9922 1.0000 p  0.91 0.0000 0.0000 0.0001 0.0018 0.0193 0.1255 0.4832 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 n7 x x x x p  0.01 0.9321 0.9980 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.10 0.4783 0.8503 0.9743 0.9973 0.9998 1.0000 1.0000 1.0000 p  0.55 0.0037 0.0357 0.1529 0.3917 0.6836 0.8976 0.9848 1.0000 p  0.92 0.0000 0.0000 0.0001 0.0012 0.0140 0.1026 0.4422 1.0000 p  0.02 0.8681 0.9921 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.15 0.3206 0.7166 0.9262 0.9879 0.9988 0.9999 1.0000 1.0000 p  0.60 0.0016 0.0188 0.0963 0.2898 0.5801 0.8414 0.9720 1.0000 p  0.93 0.0000 0.0000 0.0000 0.0007 0.0097 0.0813 0.3983 1.0000 p  0.03 0.8080 0.9829 0.9991 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.20 0.2097 0.5767 0.8520 0.9667 0.9953 0.9996 1.0000 1.0000 p  0.65 0.0006 0.0090 0.0556 0.1998 0.4677 0.7662 0.9510 1.0000 p  0.94 0.0000 0.0000 0.0000 0.0004 0.0063 0.0618 0.3515 1.0000 p  0.04 0.7514 0.9706 0.9980 0.9999 1.0000 1.0000 1.0000 1.0000 p  0.25 0.1335 0.4449 0.7564 0.9294 0.9871 0.9987 0.9999 1.0000 p  0.70 0.0002 0.0038 0.0288 0.1260 0.3529 0.6706 0.9176 1.0000 p  0.95 0.0000 0.0000 0.0000 0.0002 0.0038 0.0444 0.3017 1.0000 p  0.05 0.6983 0.9556 0.9962 0.9998 1.0000 1.0000 1.0000 1.0000 p  0.30 0.0824 0.3294 0.6471 0.8740 0.9712 0.9962 0.9998 1.0000 p  0.75 0.0001 0.0013 0.0129 0.0706 0.2436 0.5551 0.8665 1.0000 p  0.96 0.0000 0.0000 0.0000 0.0001 0.0020 0.0294 0.2486 1.0000 (385) APPENDIX B n8 x x x x p  0.01 0.9227 0.9973 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.10 0.4305 0.8131 0.9619 0.9950 0.9996 1.0000 1.0000 1.0000 1.0000 p  0.55 0.0017 0.0181 0.0885 0.2604 0.5230 0.7799 0.9368 0.9916 1.0000 p  0.92 0.0000 0.0000 0.0000 0.0001 0.0022 0.0211 0.1298 0.4868 1.0000 p  0.02 0.8508 0.9897 0.9996 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.15 0.2725 0.6572 0.8948 0.9786 0.9971 0.9998 1.0000 1.0000 1.0000 p  0.60 0.0007 0.0085 0.0498 0.1737 0.4059 0.6846 0.8936 0.9832 1.0000 p  0.93 0.0000 0.0000 0.0000 0.0001 0.0013 0.0147 0.1035 0.4404 1.0000 p  0.03 0.7837 0.9777 0.9987 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.20 0.1678 0.5033 0.7969 0.9437 0.9896 0.9988 0.9999 1.0000 1.0000 p  0.65 0.0002 0.0036 0.0253 0.1061 0.2936 0.5722 0.8309 0.9681 1.0000 p  0.94 0.0000 0.0000 0.0000 0.0000 0.0007 0.0096 0.0792 0.3904 1.0000 p  0.04 0.7214 0.9619 0.9969 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.25 0.1001 0.3671 0.6785 0.8862 0.9727 0.9958 0.9996 1.0000 1.0000 p  0.70 0.0001 0.0013 0.0113 0.0580 0.1941 0.4482 0.7447 0.9424 1.0000 p  0.95 0.0000 0.0000 0.0000 0.0000 0.0004 0.0058 0.0572 0.3366 1.0000 p  0.05 0.6634 0.9428 0.9942 0.9996 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.30 0.0576 0.2553 0.5518 0.8059 0.9420 0.9887 0.9987 0.9999 1.0000 p  0.75 0.0000 0.0004 0.0042 0.0273 0.1138 0.3215 0.6329 0.8999 1.0000 p  0.96 0.0000 0.0000 0.0000 0.0000 0.0002 0.0031 0.0381 0.2786 1.0000 p  0.06 0.6096 0.9208 0.9904 0.9993 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.35 0.0319 0.1691 0.4278 0.7064 0.8939 0.9747 0.9964 0.9998 1.0000 p  0.80 0.0000 0.0001 0.0012 0.0104 0.0563 0.2031 0.4967 0.8322 1.0000 p  0.97 0.0000 0.0000 0.0000 0.0000 0.0001 0.0013 0.0223 0.2163 1.0000 p  0.07 0.5596 0.8965 0.9853 0.9987 0.9999 1.0000 1.0000 1.0000 1.0000 p  0.40 0.0168 0.1064 0.3154 0.5941 0.8263 0.9502 0.9915 0.9993 1.0000 p  0.85 0.0000 0.0000 0.0002 0.0029 0.0214 0.1052 0.3428 0.7275 1.0000 p  0.98 0.0000 0.0000 0.0000 0.0000 0.0000 0.0004 0.0103 0.1492 1.0000 p  0.08 0.5132 0.8702 0.9789 0.9978 0.9999 1.0000 1.0000 1.0000 1.0000 p  0.45 0.0084 0.0632 0.2201 0.4770 0.7396 0.9115 0.9819 0.9983 1.0000 p  0.90 0.0000 0.0000 0.0000 0.0004 0.0050 0.0381 0.1869 0.5695 1.0000 p  0.99 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0027 0.0773 1.0000 p  0.09 0.4703 0.8423 0.9711 0.9966 0.9997 1.0000 1.0000 1.0000 1.0000 p  0.50 0.0039 0.0352 0.1445 0.3633 0.6367 0.8555 0.9648 0.9961 1.0000 p  0.91 0.0000 0.0000 0.0000 0.0003 0.0034 0.0289 0.1577 0.5297 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 n9 x x p  0.01 p  0.02 p  0.03 p  0.04 p  0.05 p  0.06 p  0.07 p  0.08 p  0.09 0.9135 0.8337 0.7602 0.6925 0.6302 0.5730 0.5204 0.4722 0.4279 0.9966 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.10 0.3874 0.7748 0.9470 0.9917 0.9991 0.9999 1.0000 1.0000 1.0000 1.0000 0.9869 0.9994 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.15 0.2316 0.5995 0.8591 0.9661 0.9944 0.9994 1.0000 1.0000 1.0000 1.0000 0.9718 0.9980 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.20 0.1342 0.4362 0.7382 0.9144 0.9804 0.9969 0.9997 1.0000 1.0000 1.0000 0.9522 0.9955 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.25 0.0751 0.3003 0.6007 0.8343 0.9511 0.9900 0.9987 0.9999 1.0000 1.0000 0.9288 0.9916 0.9994 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.30 0.0404 0.1960 0.4628 0.7297 0.9012 0.9747 0.9957 0.9996 1.0000 1.0000 0.9022 0.9862 0.9987 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.35 0.0207 0.1211 0.3373 0.6089 0.8283 0.9464 0.9888 0.9986 0.9999 1.0000 0.8729 0.9791 0.9977 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.40 0.0101 0.0705 0.2318 0.4826 0.7334 0.9006 0.9750 0.9962 0.9997 1.0000 0.8417 0.9702 0.9963 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.45 0.0046 0.0385 0.1495 0.3614 0.6214 0.8342 0.9502 0.9909 0.9992 1.0000 0.8088 0.9595 0.9943 0.9995 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.50 0.0020 0.0195 0.0898 0.2539 0.5000 0.7461 0.9102 0.9805 0.9980 1.0000 (continued ) 841 (386) APPENDIX B 842 x x p  0.55 0.0008 0.0091 0.0498 0.1658 0.3786 0.6386 0.8505 0.9615 0.9954 1.0000 p  0.92 0.0000 0.0000 0.0000 0.0000 0.0003 0.0037 0.0298 0.1583 0.5278 1.0000 p  0.60 0.0003 0.0038 0.0250 0.0994 0.2666 0.5174 0.7682 0.9295 0.9899 1.0000 p  0.93 0.0000 0.0000 0.0000 0.0000 0.0002 0.0023 0.0209 0.1271 0.4796 1.0000 p  0.65 0.0001 0.0014 0.0112 0.0536 0.1717 0.3911 0.6627 0.8789 0.9793 1.0000 p  0.94 0.0000 0.0000 0.0000 0.0000 0.0001 0.0013 0.0138 0.0978 0.4270 1.0000 p  0.70 0.0000 0.0004 0.0043 0.0253 0.0988 0.2703 0.5372 0.8040 0.9596 1.0000 p  0.95 0.0000 0.0000 0.0000 0.0000 0.0000 0.0006 0.0084 0.0712 0.3698 1.0000 p  0.75 0.0000 0.0001 0.0013 0.0100 0.0489 0.1657 0.3993 0.6997 0.9249 1.0000 p  0.96 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0045 0.0478 0.3075 1.0000 p  0.80 0.0000 0.0000 0.0003 0.0031 0.0196 0.0856 0.2618 0.5638 0.8658 1.0000 p  0.97 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0020 0.0282 0.2398 1.0000 p  0.85 0.0000 0.0000 0.0000 0.0006 0.0056 0.0339 0.1409 0.4005 0.7684 1.0000 p  0.98 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0006 0.0131 0.1663 1.0000 p  0.90 0.0000 0.0000 0.0000 0.0001 0.0009 0.0083 0.0530 0.2252 0.6126 1.0000 p  0.99 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0034 0.0865 1.0000 p  0.91 0.0000 0.0000 0.0000 0.0000 0.0005 0.0057 0.0405 0.1912 0.5721 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 n  10 x 10 x 10 x 10 x p  0.01 0.9044 0.9957 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.10 0.3487 0.7361 0.9298 0.9872 0.9984 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.55 0.0003 0.0045 0.0274 0.1020 0.2616 0.4956 0.7340 0.9004 0.9767 0.9975 1.0000 p  0.92 p  0.02 0.8171 0.9838 0.9991 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.15 0.1969 0.5443 0.8202 0.9500 0.9901 0.9986 0.9999 1.0000 1.0000 1.0000 1.0000 p  0.60 0.0001 0.0017 0.0123 0.0548 0.1662 0.3669 0.6177 0.8327 0.9536 0.9940 1.0000 p  0.93 p  0.03 0.7374 0.9655 0.9972 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.20 0.1074 0.3758 0.6778 0.8791 0.9672 0.9936 0.9991 0.9999 1.0000 1.0000 1.0000 p  0.65 0.0000 0.0005 0.0048 0.0260 0.0949 0.2485 0.4862 0.7384 0.9140 0.9865 1.0000 p  0.94 p  0.04 0.6648 0.9418 0.9938 0.9996 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.25 0.0563 0.2440 0.5256 0.7759 0.9219 0.9803 0.9965 0.9996 1.0000 1.0000 1.0000 p  0.70 0.0000 0.0001 0.0016 0.0106 0.0473 0.1503 0.3504 0.6172 0.8507 0.9718 1.0000 p  0.95 p  0.05 0.5987 0.9139 0.9885 0.9990 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.30 0.0282 0.1493 0.3828 0.6496 0.8497 0.9527 0.9894 0.9984 0.9999 1.0000 1.0000 p  0.75 0.0000 0.0000 0.0004 0.0035 0.0197 0.0781 0.2241 0.4744 0.7560 0.9437 1.0000 p  0.96 p  0.06 0.5386 0.8824 0.9812 0.9980 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.35 0.0135 0.0860 0.2616 0.5138 0.7515 0.9051 0.9740 0.9952 0.9995 1.0000 1.0000 p  0.80 0.0000 0.0000 0.0001 0.0009 0.0064 0.0328 0.1209 0.3222 0.6242 0.8926 1.0000 p  0.97 p  0.07 0.4840 0.8483 0.9717 0.9964 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.40 0.0060 0.0464 0.1673 0.3823 0.6331 0.8338 0.9452 0.9877 0.9983 0.9999 1.0000 p  0.85 0.0000 0.0000 0.0000 0.0001 0.0014 0.0099 0.0500 0.1798 0.4557 0.8031 1.0000 p  0.98 p  0.08 0.4344 0.8121 0.9599 0.9942 0.9994 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.45 0.0025 0.0233 0.0996 0.2660 0.5044 0.7384 0.8980 0.9726 0.9955 0.9997 1.0000 p  0.90 0.0000 0.0000 0.0000 0.0000 0.0001 0.0016 0.0128 0.0702 0.2639 0.6513 1.0000 p  0.99 p  0.09 0.3894 0.7746 0.9460 0.9912 0.9990 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.50 0.0010 0.0107 0.0547 0.1719 0.3770 0.6230 0.8281 0.9453 0.9893 0.9990 1.0000 p  0.91 0.0000 0.0000 0.0000 0.0000 0.0001 0.0010 0.0088 0.0540 0.2254 0.6106 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 (387) APPENDIX B 10 0.0000 0.0000 0.0006 0.0058 0.0401 0.1879 0.5656 1.0000 0.0000 0.0000 0.0003 0.0036 0.0283 0.1517 0.5160 1.0000 0.0000 0.0000 0.0002 0.0020 0.0188 0.1176 0.4614 1.0000 0.0000 0.0000 0.0001 0.0010 0.0115 0.0861 0.4013 1.0000 0.0000 0.0000 0.0000 0.0004 0.0062 0.0582 0.3352 1.0000 0.0000 0.0000 0.0000 0.0001 0.0028 0.0345 0.2626 1.0000 0.0000 0.0000 0.0000 0.0000 0.0009 0.0162 0.1829 1.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0043 0.0956 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 p  0.06 0.5063 0.8618 0.9752 0.9970 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.35 0.0088 0.0606 0.2001 0.4256 0.6683 0.8513 0.9499 0.9878 0.9980 0.9998 1.0000 1.0000 p  0.80 0.0000 0.0000 0.0000 0.0002 0.0020 0.0117 0.0504 0.1611 0.3826 0.6779 0.9141 1.0000 p  0.97 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0037 0.0413 0.2847 1.0000 p  0.07 0.4501 0.8228 0.9630 0.9947 0.9995 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.40 0.0036 0.0302 0.1189 0.2963 0.5328 0.7535 0.9006 0.9707 0.9941 0.9993 1.0000 1.0000 p  0.85 0.0000 0.0000 0.0000 0.0000 0.0003 0.0027 0.0159 0.0694 0.2212 0.5078 0.8327 1.0000 p  0.98 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0012 0.0195 0.1993 1.0000 p  0.08 0.3996 0.7819 0.9481 0.9915 0.9990 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.45 0.0014 0.0139 0.0652 0.1911 0.3971 0.6331 0.8262 0.9390 0.9852 0.9978 0.9998 1.0000 p  0.90 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0028 0.0185 0.0896 0.3026 0.6862 1.0000 p  0.99 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0052 0.1047 1.0000 p  0.09 0.3544 0.7399 0.9305 0.9871 0.9983 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.50 0.0005 0.0059 0.0327 0.1133 0.2744 0.5000 0.7256 0.8867 0.9673 0.9941 0.9995 1.0000 p  0.91 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0017 0.0129 0.0695 0.2601 0.6456 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 n  11 x 10 11 x 10 11 x 10 11 x 10 11 p  0.01 0.8953 0.9948 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.10 0.3138 0.6974 0.9104 0.9815 0.9972 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.55 0.0002 0.0022 0.0148 0.0610 0.1738 0.3669 0.6029 0.8089 0.9348 0.9861 0.9986 1.0000 p  0.92 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0010 0.0085 0.0519 0.2181 0.6004 1.0000 p  0.02 0.8007 0.9805 0.9988 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.15 0.1673 0.4922 0.7788 0.9306 0.9841 0.9973 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.60 0.0000 0.0007 0.0059 0.0293 0.0994 0.2465 0.4672 0.7037 0.8811 0.9698 0.9964 1.0000 p  0.93 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0005 0.0053 0.0370 0.1772 0.5499 1.0000 p  0.03 0.7153 0.9587 0.9963 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.20 0.0859 0.3221 0.6174 0.8389 0.9496 0.9883 0.9980 0.9998 1.0000 1.0000 1.0000 1.0000 p  0.65 0.0000 0.0002 0.0020 0.0122 0.0501 0.1487 0.3317 0.5744 0.7999 0.9394 0.9912 1.0000 p  0.94 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0030 0.0248 0.1382 0.4937 1.0000 p  0.04 0.6382 0.9308 0.9917 0.9993 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.25 0.0422 0.1971 0.4552 0.7133 0.8854 0.9657 0.9924 0.9988 0.9999 1.0000 1.0000 1.0000 p  0.70 0.0000 0.0000 0.0006 0.0043 0.0216 0.0782 0.2103 0.4304 0.6873 0.8870 0.9802 1.0000 p  0.95 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0016 0.0152 0.1019 0.4312 1.0000 p  0.05 0.5688 0.8981 0.9848 0.9984 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.30 0.0198 0.1130 0.3127 0.5696 0.7897 0.9218 0.9784 0.9957 0.9994 1.0000 1.0000 1.0000 p  0.75 0.0000 0.0000 0.0001 0.0012 0.0076 0.0343 0.1146 0.2867 0.5448 0.8029 0.9578 1.0000 p  0.96 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0007 0.0083 0.0692 0.3618 1.0000 (continued ) 843 (388) 844 APPENDIX B n  12 x p  0.01 p  0.02 p  0.03 p  0.04 p  0.05 p  0.06 p  0.07 p  0.08 p  0.09 0.8864 0.7847 0.6938 0.6127 0.5404 0.4759 0.4186 0.3677 0.3225 10 11 12 x 10 11 12 x 10 11 12 x 10 11 12 0.9938 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.10 0.2824 0.6590 0.8891 0.9744 0.9957 0.9995 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.55 0.0001 0.0011 0.0079 0.0356 0.1117 0.2607 0.4731 0.6956 0.8655 0.9579 0.9917 0.9992 1.0000 p  0.92 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0016 0.0120 0.0652 0.2487 0.6323 1.0000 0.9769 0.9985 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.15 0.1422 0.4435 0.7358 0.9078 0.9761 0.9954 0.9993 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.60 0.0000 0.0003 0.0028 0.0153 0.0573 0.1582 0.3348 0.5618 0.7747 0.9166 0.9804 0.9978 1.0000 p  0.93 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0009 0.0075 0.0468 0.2033 0.5814 1.0000 0.9514 0.9952 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.20 0.0687 0.2749 0.5583 0.7946 0.9274 0.9806 0.9961 0.9994 0.9999 1.0000 1.0000 1.0000 1.0000 p  0.65 0.0000 0.0001 0.0008 0.0056 0.0255 0.0846 0.2127 0.4167 0.6533 0.8487 0.9576 0.9943 1.0000 p  0.94 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0004 0.0043 0.0316 0.1595 0.5241 1.0000 0.9191 0.9893 0.9990 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.25 0.0317 0.1584 0.3907 0.6488 0.8424 0.9456 0.9857 0.9972 0.9996 1.0000 1.0000 1.0000 1.0000 p  0.70 0.0000 0.0000 0.0002 0.0017 0.0095 0.0386 0.1178 0.2763 0.5075 0.7472 0.9150 0.9862 1.0000 p  0.95 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0022 0.0196 0.1184 0.4596 1.0000 0.8816 0.9804 0.9978 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.30 0.0138 0.0850 0.2528 0.4925 0.7237 0.8822 0.9614 0.9905 0.9983 0.9998 1.0000 1.0000 1.0000 p  0.75 0.0000 0.0000 0.0000 0.0004 0.0028 0.0143 0.0544 0.1576 0.3512 0.6093 0.8416 0.9683 1.0000 p  0.96 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0010 0.0107 0.0809 0.3873 1.0000 0.8405 0.9684 0.9957 0.9996 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.35 0.0057 0.0424 0.1513 0.3467 0.5833 0.7873 0.9154 0.9745 0.9944 0.9992 0.9999 1.0000 1.0000 p  0.80 0.0000 0.0000 0.0000 0.0001 0.0006 0.0039 0.0194 0.0726 0.2054 0.4417 0.7251 0.9313 1.0000 p  0.97 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0048 0.0486 0.3062 1.0000 0.7967 0.9532 0.9925 0.9991 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.40 0.0022 0.0196 0.0834 0.2253 0.4382 0.6652 0.8418 0.9427 0.9847 0.9972 0.9997 1.0000 1.0000 p  0.85 0.0000 0.0000 0.0000 0.0000 0.0001 0.0007 0.0046 0.0239 0.0922 0.2642 0.5565 0.8578 1.0000 p  0.98 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0015 0.0231 0.2153 1.0000 0.7513 0.9348 0.9880 0.9984 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.45 0.0008 0.0083 0.0421 0.1345 0.3044 0.5269 0.7393 0.8883 0.9644 0.9921 0.9989 0.9999 1.0000 p  0.90 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0005 0.0043 0.0256 0.1109 0.3410 0.7176 1.0000 p  0.99 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0062 0.1136 1.0000 0.7052 0.9134 0.9820 0.9973 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.50 0.0002 0.0032 0.0193 0.0730 0.1938 0.3872 0.6128 0.8062 0.9270 0.9807 0.9968 0.9998 1.0000 p  0.91 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0027 0.0180 0.0866 0.2948 0.6775 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 x p  0.01 0.8775 0.9928 0.9997 1.0000 p  0.02 0.7690 0.9730 0.9980 0.9999 p  0.03 0.6730 0.9436 0.9938 0.9995 p  0.04 0.5882 0.9068 0.9865 0.9986 p  0.06 0.4474 0.8186 0.9608 0.9940 p  0.07 0.3893 0.7702 0.9422 0.9897 p  0.08 0.3383 0.7206 0.9201 0.9837 p  0.09 0.2935 0.6707 0.8946 0.9758 n  13 p  0.05 0.5133 0.8646 0.9755 0.9969 (389) APPENDIX B 10 11 12 13 x 10 11 12 13 x 10 11 12 13 x 10 11 12 13 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.10 0.2542 0.6213 0.8661 0.9658 0.9935 0.9991 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.55 0.0000 0.0005 0.0041 0.0203 0.0698 0.1788 0.3563 0.5732 0.7721 0.9071 0.9731 0.9951 0.9996 1.0000 p  0.92 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0024 0.0163 0.0799 0.2794 0.6617 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.15 0.1209 0.3983 0.6920 0.8820 0.9658 0.9925 0.9987 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.60 0.0000 0.0001 0.0013 0.0078 0.0321 0.0977 0.2288 0.4256 0.6470 0.8314 0.9421 0.9874 0.9987 1.0000 p  0.93 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0013 0.0103 0.0578 0.2298 0.6107 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.20 0.0550 0.2336 0.5017 0.7473 0.9009 0.9700 0.9930 0.9988 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.65 0.0000 0.0000 0.0003 0.0025 0.0126 0.0462 0.1295 0.2841 0.4995 0.7217 0.8868 0.9704 0.9963 1.0000 p  0.94 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0007 0.0060 0.0392 0.1814 0.5526 1.0000 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.25 0.0238 0.1267 0.3326 0.5843 0.7940 0.9198 0.9757 0.9944 0.9990 0.9999 1.0000 1.0000 1.0000 1.0000 p  0.70 0.0000 0.0000 0.0001 0.0007 0.0040 0.0182 0.0624 0.1654 0.3457 0.5794 0.7975 0.9363 0.9903 1.0000 p  0.95 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0031 0.0245 0.1354 0.4867 1.0000 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.30 0.0097 0.0637 0.2025 0.4206 0.6543 0.8346 0.9376 0.9818 0.9960 0.9993 0.9999 1.0000 1.0000 1.0000 p  0.75 0.0000 0.0000 0.0000 0.0001 0.0010 0.0056 0.0243 0.0802 0.2060 0.4157 0.6674 0.8733 0.9762 1.0000 p  0.96 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0014 0.0135 0.0932 0.4118 1.0000 0.9993 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.35 0.0037 0.0296 0.1132 0.2783 0.5005 0.7159 0.8705 0.9538 0.9874 0.9975 0.9997 1.0000 1.0000 1.0000 p  0.80 0.0000 0.0000 0.0000 0.0000 0.0002 0.0012 0.0070 0.0300 0.0991 0.2527 0.4983 0.7664 0.9450 1.0000 p  0.97 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0005 0.0062 0.0564 0.3270 1.0000 0.9987 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.40 0.0013 0.0126 0.0579 0.1686 0.3530 0.5744 0.7712 0.9023 0.9679 0.9922 0.9987 0.9999 1.0000 1.0000 p  0.85 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0013 0.0075 0.0342 0.1180 0.3080 0.6017 0.8791 1.0000 p  0.98 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0020 0.0270 0.2310 1.0000 0.9976 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.45 0.0004 0.0049 0.0269 0.0929 0.2279 0.4268 0.6437 0.8212 0.9302 0.9797 0.9959 0.9995 1.0000 1.0000 p  0.90 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0009 0.0065 0.0342 0.1339 0.3787 0.7458 1.0000 p  0.99 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0072 0.1225 1.0000 0.9959 0.9995 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.50 0.0001 0.0017 0.0112 0.0461 0.1334 0.2905 0.5000 0.7095 0.8666 0.9539 0.9888 0.9983 0.9999 1.0000 p  0.91 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0005 0.0041 0.0242 0.1054 0.3293 0.7065 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 p  0.06 0.4205 0.7963 0.9522 0.9920 0.9990 0.9999 1.0000 p  0.07 0.3620 0.7436 0.9302 0.9864 0.9980 0.9998 1.0000 p  0.08 0.3112 0.6900 0.9042 0.9786 0.9965 0.9996 1.0000 p  0.09 0.2670 0.6368 0.8745 0.9685 0.9941 0.9992 0.9999 n  14 x p  0.01 0.8687 0.9916 0.9997 1.0000 1.0000 1.0000 1.0000 p  0.02 0.7536 0.9690 0.9975 0.9999 1.0000 1.0000 1.0000 p  0.03 0.6528 0.9355 0.9923 0.9994 1.0000 1.0000 1.0000 p  0.04 0.5647 0.8941 0.9833 0.9981 0.9998 1.0000 1.0000 p  0.05 0.4877 0.8470 0.9699 0.9958 0.9996 1.0000 1.0000 (continued ) 845 (390) 846 APPENDIX B 10 11 12 13 14 x 10 11 12 13 14 x 10 11 12 13 14 x 10 11 12 13 14 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.10 0.2288 0.5846 0.8416 0.9559 0.9908 0.9985 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.55 0.0000 0.0003 0.0022 0.0114 0.0426 0.1189 0.2586 0.4539 0.6627 0.8328 0.9368 0.9830 0.9971 0.9998 1.0000 p  0.92 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0004 0.0035 0.0214 0.0958 0.3100 0.6888 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.15 0.1028 0.3567 0.6479 0.8535 0.9533 0.9885 0.9978 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.60 0.0000 0.0001 0.0006 0.0039 0.0175 0.0583 0.1501 0.3075 0.5141 0.7207 0.8757 0.9602 0.9919 0.9992 1.0000 p  0.93 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0020 0.0136 0.0698 0.2564 0.6380 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.20 0.0440 0.1979 0.4481 0.6982 0.8702 0.9561 0.9884 0.9976 0.9996 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.65 0.0000 0.0000 0.0001 0.0011 0.0060 0.0243 0.0753 0.1836 0.3595 0.5773 0.7795 0.9161 0.9795 0.9976 1.0000 p  0.94 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0010 0.0080 0.0478 0.2037 0.5795 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.25 0.0178 0.1010 0.2811 0.5213 0.7415 0.8883 0.9617 0.9897 0.9978 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.70 0.0000 0.0000 0.0000 0.0002 0.0017 0.0083 0.0315 0.0933 0.2195 0.4158 0.6448 0.8392 0.9525 0.9932 1.0000 p  0.95 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0004 0.0042 0.0301 0.1530 0.5123 1.0000 x p  0.01 0.8601 0.9904 0.9996 1.0000 p  0.02 0.7386 0.9647 0.9970 0.9998 p  0.03 0.6333 0.9270 0.9906 0.9992 p  0.04 0.5421 0.8809 0.9797 0.9976 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.30 0.0068 0.0475 0.1608 0.3552 0.5842 0.7805 0.9067 0.9685 0.9917 0.9983 0.9998 1.0000 1.0000 1.0000 1.0000 p  0.75 0.0000 0.0000 0.0000 0.0000 0.0003 0.0022 0.0103 0.0383 0.1117 0.2585 0.4787 0.7189 0.8990 0.9822 1.0000 p  0.96 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0019 0.0167 0.1059 0.4353 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.35 0.0024 0.0205 0.0839 0.2205 0.4227 0.6405 0.8164 0.9247 0.9757 0.9940 0.9989 0.9999 1.0000 1.0000 1.0000 p  0.80 0.0000 0.0000 0.0000 0.0000 0.0000 0.0004 0.0024 0.0116 0.0439 0.1298 0.3018 0.5519 0.8021 0.9560 1.0000 p  0.97 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0006 0.0077 0.0645 0.3472 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.40 0.0008 0.0081 0.0398 0.1243 0.2793 0.4859 0.6925 0.8499 0.9417 0.9825 0.9961 0.9994 0.9999 1.0000 1.0000 p  0.85 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0022 0.0115 0.0467 0.1465 0.3521 0.6433 0.8972 1.0000 p  0.98 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0025 0.0310 0.2464 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.45 0.0002 0.0029 0.0170 0.0632 0.1672 0.3373 0.5461 0.7414 0.8811 0.9574 0.9886 0.9978 0.9997 1.0000 1.0000 p  0.90 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0015 0.0092 0.0441 0.1584 0.4154 0.7712 1.0000 p  0.99 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0084 0.1313 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.50 0.0001 0.0009 0.0065 0.0287 0.0898 0.2120 0.3953 0.6047 0.7880 0.9102 0.9713 0.9935 0.9991 0.9999 1.0000 p  0.91 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0008 0.0059 0.0315 0.1255 0.3632 0.7330 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 p  0.06 0.3953 0.7738 0.9429 0.9896 p  0.07 0.3367 0.7168 0.9171 0.9825 p  0.08 0.2863 0.6597 0.8870 0.9727 p  0.09 0.2430 0.6035 0.8531 0.9601 n  15 p  0.05 0.4633 0.8290 0.9638 0.9945 (391) APPENDIX B 10 11 12 13 14 15 x 10 11 12 13 14 15 x 10 11 12 13 14 15 x 10 11 12 13 14 15 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.10 0.2059 0.5490 0.8159 0.9444 0.9873 0.9978 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.55 0.0000 0.0001 0.0011 0.0063 0.0255 0.0769 0.1818 0.3465 0.5478 0.7392 0.8796 0.9576 0.9893 0.9983 0.9999 1.0000 p  0.92 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0007 0.0050 0.0273 0.1130 0.3403 0.7137 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.15 0.0874 0.3186 0.6042 0.8227 0.9383 0.9832 0.9964 0.9994 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.60 0.0000 0.0000 0.0003 0.0019 0.0093 0.0338 0.0950 0.2131 0.3902 0.5968 0.7827 0.9095 0.9729 0.9948 0.9995 1.0000 p  0.93 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0028 0.0175 0.0829 0.2832 0.6633 1.0000 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.20 0.0352 0.1671 0.3980 0.6482 0.8358 0.9389 0.9819 0.9958 0.9992 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.65 0.0000 0.0000 0.0001 0.0005 0.0028 0.0124 0.0422 0.1132 0.2452 0.4357 0.6481 0.8273 0.9383 0.9858 0.9984 1.0000 p  0.94 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0014 0.0104 0.0571 0.2262 0.6047 1.0000 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.25 0.0134 0.0802 0.2361 0.4613 0.6865 0.8516 0.9434 0.9827 0.9958 0.9992 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.70 0.0000 0.0000 0.0000 0.0001 0.0007 0.0037 0.0152 0.0500 0.1311 0.2784 0.4845 0.7031 0.8732 0.9647 0.9953 1.0000 p  0.95 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0006 0.0055 0.0362 0.1710 0.5367 1.0000 0.9994 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.30 0.0047 0.0353 0.1268 0.2969 0.5155 0.7216 0.8689 0.9500 0.9848 0.9963 0.9993 0.9999 1.0000 1.0000 1.0000 1.0000 p  0.75 0.0000 0.0000 0.0000 0.0000 0.0001 0.0008 0.0042 0.0173 0.0566 0.1484 0.3135 0.5387 0.7639 0.9198 0.9866 1.0000 p  0.96 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0024 0.0203 0.1191 0.4579 1.0000 0.9986 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.35 0.0016 0.0142 0.0617 0.1727 0.3519 0.5643 0.7548 0.8868 0.9578 0.9876 0.9972 0.9995 0.9999 1.0000 1.0000 1.0000 p  0.80 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0008 0.0042 0.0181 0.0611 0.1642 0.3518 0.6020 0.8329 0.9648 1.0000 p  0.97 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0008 0.0094 0.0730 0.3667 1.0000 0.9972 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.40 0.0005 0.0052 0.0271 0.0905 0.2173 0.4032 0.6098 0.7869 0.9050 0.9662 0.9907 0.9981 0.9997 1.0000 1.0000 1.0000 p  0.85 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0006 0.0036 0.0168 0.0617 0.1773 0.3958 0.6814 0.9126 1.0000 p  0.98 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0030 0.0353 0.2614 1.0000 0.9950 0.9993 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.45 0.0001 0.0017 0.0107 0.0424 0.1204 0.2608 0.4522 0.6535 0.8182 0.9231 0.9745 0.9937 0.9989 0.9999 1.0000 1.0000 p  0.90 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0022 0.0127 0.0556 0.1841 0.4510 0.7941 1.0000 p  0.99 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0004 0.0096 0.1399 1.0000 0.9918 0.9987 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.50 0.0000 0.0005 0.0037 0.0176 0.0592 0.1509 0.3036 0.5000 0.6964 0.8491 0.9408 0.9824 0.9963 0.9995 1.0000 1.0000 p  0.91 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0013 0.0082 0.0399 0.1469 0.3965 0.7570 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 (continued ) 847 (392) 848 APPENDIX B n  20 x 10 11 12 13 14 15 16 17 18 19 20 x 10 11 12 13 14 15 16 17 18 19 20 x 10 11 12 13 14 15 16 17 18 19 20 p  0.01 0.8179 0.9831 0.9990 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.10 0.1216 0.3917 0.6769 0.8670 0.9568 0.9887 0.9976 0.9996 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.55 0.0000 0.0000 0.0000 0.0003 0.0015 0.0064 0.0214 0.0580 0.1308 0.2493 0.4086 0.5857 0.7480 0.8701 0.9447 0.9811 0.9951 0.9991 0.9999 1.0000 1.0000 p  0.02 0.6676 0.9401 0.9929 0.9994 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.15 0.0388 0.1756 0.4049 0.6477 0.8298 0.9327 0.9781 0.9941 0.9987 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.60 0.0000 0.0000 0.0000 0.0000 0.0003 0.0016 0.0065 0.0210 0.0565 0.1275 0.2447 0.4044 0.5841 0.7500 0.8744 0.9490 0.9840 0.9964 0.9995 1.0000 1.0000 p  0.03 0.5438 0.8802 0.9790 0.9973 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.20 0.0115 0.0692 0.2061 0.4114 0.6296 0.8042 0.9133 0.9679 0.9900 0.9974 0.9994 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.65 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0015 0.0060 0.0196 0.0532 0.1218 0.2376 0.3990 0.5834 0.7546 0.8818 0.9556 0.9879 0.9979 0.9998 1.0000 p  0.04 0.4420 0.8103 0.9561 0.9926 0.9990 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.25 0.0032 0.0243 0.0913 0.2252 0.4148 0.6172 0.7858 0.8982 0.9591 0.9861 0.9961 0.9991 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.70 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0013 0.0051 0.0171 0.0480 0.1133 0.2277 0.3920 0.5836 0.7625 0.8929 0.9645 0.9924 0.9992 1.0000 p  0.05 0.3585 0.7358 0.9245 0.9841 0.9974 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.30 0.0008 0.0076 0.0355 0.1071 0.2375 0.4164 0.6080 0.7723 0.8867 0.9520 0.9829 0.9949 0.9987 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.75 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0009 0.0039 0.0139 0.0409 0.1018 0.2142 0.3828 0.5852 0.7748 0.9087 0.9757 0.9968 1.0000 p  0.06 0.2901 0.6605 0.8850 0.9710 0.9944 0.9991 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.35 0.0002 0.0021 0.0121 0.0444 0.1182 0.2454 0.4166 0.6010 0.7624 0.8782 0.9468 0.9804 0.9940 0.9985 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.80 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0006 0.0026 0.0100 0.0321 0.0867 0.1958 0.3704 0.5886 0.7939 0.9308 0.9885 1.0000 p  0.07 0.2342 0.5869 0.8390 0.9529 0.9893 0.9981 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.40 0.0000 0.0005 0.0036 0.0160 0.0510 0.1256 0.2500 0.4159 0.5956 0.7553 0.8725 0.9435 0.9790 0.9935 0.9984 0.9997 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.85 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0013 0.0059 0.0219 0.0673 0.1702 0.3523 0.5951 0.8244 0.9612 1.0000 p  0.08 0.1887 0.5169 0.7879 0.9294 0.9817 0.9962 0.9994 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.45 0.0000 0.0001 0.0009 0.0049 0.0189 0.0553 0.1299 0.2520 0.4143 0.5914 0.7507 0.8692 0.9420 0.9786 0.9936 0.9985 0.9997 1.0000 1.0000 1.0000 1.0000 p  0.90 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0004 0.0024 0.0113 0.0432 0.1330 0.3231 0.6083 0.8784 1.0000 p  0.09 0.1516 0.4516 0.7334 0.9007 0.9710 0.9932 0.9987 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.50 0.0000 0.0000 0.0002 0.0013 0.0059 0.0207 0.0577 0.1316 0.2517 0.4119 0.5881 0.7483 0.8684 0.9423 0.9793 0.9941 0.9987 0.9998 1.0000 1.0000 1.0000 p  0.91 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0013 0.0068 0.0290 0.0993 0.2666 0.5484 0.8484 1.0000 (393) APPENDIX B x 10 11 12 13 14 15 16 17 18 19 20 p  0.92 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0006 0.0038 0.0183 0.0706 0.2121 0.4831 0.8113 1.0000 p  0.93 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0019 0.0107 0.0471 0.1610 0.4131 0.7658 1.0000 p  0.94 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0009 0.0056 0.0290 0.1150 0.3395 0.7099 1.0000 p  0.95 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0026 0.0159 0.0755 0.2642 0.6415 1.0000 x 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 x 10 11 12 13 p  0.01 0.7778 0.9742 0.9980 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.10 0.0718 0.2712 0.5371 0.7636 0.9020 0.9666 0.9905 0.9977 0.9995 0.9999 1.0000 1.0000 1.0000 1.0000 p  0.02 0.6035 0.9114 0.9868 0.9986 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.15 0.0172 0.0931 0.2537 0.4711 0.6821 0.8385 0.9305 0.9745 0.9920 0.9979 0.9995 0.9999 1.0000 1.0000 p  0.03 0.4670 0.8280 0.9620 0.9938 0.9992 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.20 0.0038 0.0274 0.0982 0.2340 0.4207 0.6167 0.7800 0.8909 0.9532 0.9827 0.9944 0.9985 0.9996 0.9999 p  0.04 0.3604 0.7358 0.9235 0.9835 0.9972 0.9996 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.25 0.0008 0.0070 0.0321 0.0962 0.2137 0.3783 0.5611 0.7265 0.8506 0.9287 0.9703 0.9893 0.9966 0.9991 p  0.96 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0010 0.0074 0.0439 0.1897 0.5580 1.0000 p  0.97 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0027 0.0210 0.1198 0.4562 1.0000 p  0.98 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0006 0.0071 0.0599 0.3324 1.0000 p  0.99 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0010 0.0169 0.1821 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 p  0.06 0.2129 0.5527 0.8129 0.9402 0.9850 0.9969 0.9995 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.35 0.0000 0.0003 0.0021 0.0097 0.0320 0.0826 0.1734 0.3061 0.4668 0.6303 0.7712 0.8746 0.9396 0.9745 p  0.07 0.1630 0.4696 0.7466 0.9064 0.9726 0.9935 0.9987 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.40 0.0000 0.0001 0.0004 0.0024 0.0095 0.0294 0.0736 0.1536 0.2735 0.4246 0.5858 0.7323 0.8462 0.9222 p  0.08 0.1244 0.3947 0.6768 0.8649 0.9549 0.9877 0.9972 0.9995 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.45 0.0000 0.0000 0.0001 0.0005 0.0023 0.0086 0.0258 0.0639 0.1340 0.2424 0.3843 0.5426 0.6937 0.8173 p  0.09 0.0946 0.3286 0.6063 0.8169 0.9314 0.9790 0.9946 0.9989 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.50 0.0000 0.0000 0.0000 0.0001 0.0005 0.0020 0.0073 0.0216 0.0539 0.1148 0.2122 0.3450 0.5000 0.6550 n  25 p  0.05 0.2774 0.6424 0.8729 0.9659 0.9928 0.9988 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.30 0.0001 0.0016 0.0090 0.0332 0.0905 0.1935 0.3407 0.5118 0.6769 0.8106 0.9022 0.9558 0.9825 0.9940 (continued ) 849 (394) 850 APPENDIX B 14 15 16 17 18 19 20 21 22 23 24 25 x 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 x 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.55 0.0000 0.0000 0.0000 0.0000 0.0001 0.0004 0.0016 0.0058 0.0174 0.0440 0.0960 0.1827 0.3063 0.4574 0.6157 0.7576 0.8660 0.9361 0.9742 0.9914 0.9977 0.9995 0.9999 1.0000 1.0000 1.0000 p  0.92 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0005 0.0028 0.0123 0.0451 0.1351 0.3232 0.6053 0.8756 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.60 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0012 0.0043 0.0132 0.0344 0.0778 0.1538 0.2677 0.4142 0.5754 0.7265 0.8464 0.9264 0.9706 0.9905 0.9976 0.9996 0.9999 1.0000 1.0000 p  0.93 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0013 0.0065 0.0274 0.0936 0.2534 0.5304 0.8370 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.65 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0008 0.0029 0.0093 0.0255 0.0604 0.1254 0.2288 0.3697 0.5332 0.6939 0.8266 0.9174 0.9680 0.9903 0.9979 0.9997 1.0000 1.0000 p  0.94 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0005 0.0031 0.0150 0.0598 0.1871 0.4473 0.7871 1.0000 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.70 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0005 0.0018 0.0060 0.0175 0.0442 0.0978 0.1894 0.3231 0.4882 0.6593 0.8065 0.9095 0.9668 0.9910 0.9984 0.9999 1.0000 p  0.95 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0012 0.0072 0.0341 0.1271 0.3576 0.7226 1.0000 0.9982 0.9995 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.75 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0009 0.0034 0.0107 0.0297 0.0713 0.1494 0.2735 0.4389 0.6217 0.7863 0.9038 0.9679 0.9930 0.9992 1.0000 p  0.96 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0004 0.0028 0.0165 0.0765 0.2642 0.6396 1.0000 0.9907 0.9971 0.9992 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.80 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0004 0.0015 0.0056 0.0173 0.0468 0.1091 0.2200 0.3833 0.5793 0.7660 0.9018 0.9726 0.9962 1.0000 p  0.97 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0008 0.0062 0.0380 0.1720 0.5330 1.0000 0.9656 0.9868 0.9957 0.9988 0.9997 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.85 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0005 0.0021 0.0080 0.0255 0.0695 0.1615 0.3179 0.5289 0.7463 0.9069 0.9828 1.0000 p  0.98 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0014 0.0132 0.0886 0.3965 1.0000 0.9040 0.9560 0.9826 0.9942 0.9984 0.9996 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 p  0.90 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0005 0.0023 0.0095 0.0334 0.0980 0.2364 0.4629 0.7288 0.9282 1.0000 p  0.99 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0020 0.0258 0.2222 1.0000 0.7878 0.8852 0.9461 0.9784 0.9927 0.9980 0.9995 0.9999 1.0000 1.0000 1.0000 1.0000 p  0.91 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0011 0.0054 0.0210 0.0686 0.1831 0.3937 0.6714 0.9054 1.0000 p  1.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 (395) APPENDIX C X APPENDIX C P( x ≤ X ) = ∑ i =0 Cumulative Poisson Probability Distribution Table 851 (t)i e − t i! t x 0.005 0.9950 1.0000 1.0000 1.0000 0.01 0.9900 1.0000 1.0000 1.0000 0.02 0.9802 0.9998 1.0000 1.0000 0.03 0.9704 0.9996 1.0000 1.0000 0.04 0.9608 0.9992 1.0000 1.0000 0.05 0.9512 0.9988 1.0000 1.0000 0.06 0.9418 0.9983 1.0000 1.0000 0.07 0.9324 0.9977 0.9999 1.0000 0.08 0.9231 0.9970 0.9999 1.0000 0.09 0.9139 0.9962 0.9999 1.0000 x 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 0.9048 0.8187 0.7408 0.6703 0.6065 0.5488 0.4966 0.4493 0.4066 0.3679 0.9953 0.9998 1.0000 1.0000 1.0000 1.0000 1.0000 0.9825 0.9989 0.9999 1.0000 1.0000 1.0000 1.0000 0.9631 0.9964 0.9997 1.0000 1.0000 1.0000 1.0000 0.9384 0.9921 0.9992 0.9999 1.0000 1.0000 1.0000 0.9098 0.9856 0.9982 0.9998 1.0000 1.0000 1.0000 0.8781 0.9769 0.9966 0.9996 1.0000 1.0000 1.0000 0.8442 0.9659 0.9942 0.9992 0.9999 1.0000 1.0000 0.8088 0.9526 0.9909 0.9986 0.9998 1.0000 1.0000 0.7725 0.9371 0.9865 0.9977 0.9997 1.0000 1.0000 0.7358 0.9197 0.9810 0.9963 0.9994 0.9999 1.0000 x 1.10 0.3329 0.6990 0.9004 0.9743 0.9946 0.9990 0.9999 1.0000 1.0000 1.0000 1.20 0.3012 0.6626 0.8795 0.9662 0.9923 0.9985 0.9997 1.0000 1.0000 1.0000 1.30 0.2725 0.6268 0.8571 0.9569 0.9893 0.9978 0.9996 0.9999 1.0000 1.0000 1.40 0.2466 0.5918 0.8335 0.9463 0.9857 0.9968 0.9994 0.9999 1.0000 1.0000 1.50 0.2231 0.5578 0.8088 0.9344 0.9814 0.9955 0.9991 0.9998 1.0000 1.0000 1.60 0.2019 0.5249 0.7834 0.9212 0.9763 0.9940 0.9987 0.9997 1.0000 1.0000 1.70 0.1827 0.4932 0.7572 0.9068 0.9704 0.9920 0.9981 0.9996 0.9999 1.0000 1.80 0.1653 0.4628 0.7306 0.8913 0.9636 0.9896 0.9974 0.9994 0.9999 1.0000 1.90 0.1496 0.4337 0.7037 0.8747 0.9559 0.9868 0.9966 0.9992 0.9998 1.0000 2.00 0.1353 0.4060 0.6767 0.8571 0.9473 0.9834 0.9955 0.9989 0.9998 1.0000 x 10 11 12 2.10 0.1225 0.3796 0.6496 0.8386 0.9379 0.9796 0.9941 0.9985 0.9997 0.9999 1.0000 1.0000 1.0000 2.20 0.1108 0.3546 0.6227 0.8194 0.9275 0.9751 0.9925 0.9980 0.9995 0.9999 1.0000 1.0000 1.0000 2.30 0.1003 0.3309 0.5960 0.7993 0.9162 0.9700 0.9906 0.9974 0.9994 0.9999 1.0000 1.0000 1.0000 2.40 0.0907 0.3084 0.5697 0.7787 0.9041 0.9643 0.9884 0.9967 0.9991 0.9998 1.0000 1.0000 1.0000 2.50 0.0821 0.2873 0.5438 0.7576 0.8912 0.9580 0.9858 0.9958 0.9989 0.9997 0.9999 1.0000 1.0000 2.60 0.0743 0.2674 0.5184 0.7360 0.8774 0.9510 0.9828 0.9947 0.9985 0.9996 0.9999 1.0000 1.0000 2.70 0.0672 0.2487 0.4936 0.7141 0.8629 0.9433 0.9794 0.9934 0.9981 0.9995 0.9999 1.0000 1.0000 2.80 0.0608 0.2311 0.4695 0.6919 0.8477 0.9349 0.9756 0.9919 0.9976 0.9993 0.9998 1.0000 1.0000 2.90 0.0550 0.2146 0.4460 0.6696 0.8318 0.9258 0.9713 0.9901 0.9969 0.9991 0.9998 0.9999 1.0000 3.00 0.0498 0.1991 0.4232 0.6472 0.8153 0.9161 0.9665 0.9881 0.9962 0.9989 0.9997 0.9999 1.0000 t t t (continued ) (396) 852 APPENDIX C t x 10 11 12 13 14 3.10 0.0450 0.1847 0.4012 0.6248 0.7982 0.9057 0.9612 0.9858 0.9953 0.9986 0.9996 0.9999 1.0000 1.0000 1.0000 3.20 0.0408 0.1712 0.3799 0.6025 0.7806 0.8946 0.9554 0.9832 0.9943 0.9982 0.9995 0.9999 1.0000 1.0000 1.0000 3.30 0.0369 0.1586 0.3594 0.5803 0.7626 0.8829 0.9490 0.9802 0.9931 0.9978 0.9994 0.9998 1.0000 1.0000 1.0000 3.40 0.0334 0.1468 0.3397 0.5584 0.7442 0.8705 0.9421 0.9769 0.9917 0.9973 0.9992 0.9998 0.9999 1.0000 1.0000 3.50 0.0302 0.1359 0.3208 0.5366 0.7254 0.8576 0.9347 0.9733 0.9901 0.9967 0.9990 0.9997 0.9999 1.0000 1.0000 3.60 0.0273 0.1257 0.3027 0.5152 0.7064 0.8441 0.9267 0.9692 0.9883 0.9960 0.9987 0.9996 0.9999 1.0000 1.0000 3.70 0.0247 0.1162 0.2854 0.4942 0.6872 0.8301 0.9182 0.9648 0.9863 0.9952 0.9984 0.9995 0.9999 1.0000 1.0000 3.80 0.0224 0.1074 0.2689 0.4735 0.6678 0.8156 0.9091 0.9599 0.9840 0.9942 0.9981 0.9994 0.9998 1.0000 1.0000 3.90 0.0202 0.0992 0.2531 0.4532 0.6484 0.8006 0.8995 0.9546 0.9815 0.9931 0.9977 0.9993 0.9998 0.9999 1.0000 4.00 0.0183 0.0916 0.2381 0.4335 0.6288 0.7851 0.8893 0.9489 0.9786 0.9919 0.9972 0.9991 0.9997 0.9999 1.0000 t x 4.10 4.20 4.30 4.40 4.50 4.60 4.70 4.80 4.90 5.00 0.0166 0.0150 0.0136 0.0123 0.0111 0.0101 0.0091 0.0082 0.0074 0.0067 10 11 12 13 14 15 16 0.0845 0.2238 0.4142 0.6093 0.7693 0.8786 0.9427 0.9755 0.9905 0.9966 0.9989 0.9997 0.9999 1.0000 1.0000 1.0000 0.0780 0.2102 0.3954 0.5898 0.7531 0.8675 0.9361 0.9721 0.9889 0.9959 0.9986 0.9996 0.9999 1.0000 1.0000 1.0000 0.0719 0.1974 0.3772 0.5704 0.7367 0.8558 0.9290 0.9683 0.9871 0.9952 0.9983 0.9995 0.9998 1.0000 1.0000 1.0000 0.0663 0.1851 0.3594 0.5512 0.7199 0.8436 0.9214 0.9642 0.9851 0.9943 0.9980 0.9993 0.9998 0.9999 1.0000 1.0000 0.0611 0.1736 0.3423 0.5321 0.7029 0.8311 0.9134 0.9597 0.9829 0.9933 0.9976 0.9992 0.9997 0.9999 1.0000 1.0000 0.0563 0.1626 0.3257 0.5132 0.6858 0.8180 0.9049 0.9549 0.9805 0.9922 0.9971 0.9990 0.9997 0.9999 1.0000 1.0000 0.0518 0.1523 0.3097 0.4946 0.6684 0.8046 0.8960 0.9497 0.9778 0.9910 0.9966 0.9988 0.9996 0.9999 1.0000 1.0000 0.0477 0.1425 0.2942 0.4763 0.6510 0.7908 0.8867 0.9442 0.9749 0.9896 0.9960 0.9986 0.9995 0.9999 1.0000 1.0000 0.0439 0.1333 0.2793 0.4582 0.6335 0.7767 0.8769 0.9382 0.9717 0.9880 0.9953 0.9983 0.9994 0.9998 0.9999 1.0000 0.0404 0.1247 0.2650 0.4405 0.6160 0.7622 0.8666 0.9319 0.9682 0.9863 0.9945 0.9980 0.9993 0.9998 0.9999 1.0000 x 5.10 5.20 5.30 5.40 5.50 5.60 5.70 5.80 5.90 6.00 0.0061 0.0055 0.0050 0.0045 0.0041 0.0037 0.0033 0.0030 0.0027 0.0025 10 11 12 13 14 15 16 17 18 0.0372 0.1165 0.2513 0.4231 0.5984 0.7474 0.8560 0.9252 0.9644 0.9844 0.9937 0.9976 0.9992 0.9997 0.9999 1.0000 1.0000 1.0000 0.0342 0.1088 0.2381 0.4061 0.5809 0.7324 0.8449 0.9181 0.9603 0.9823 0.9927 0.9972 0.9990 0.9997 0.9999 1.0000 1.0000 1.0000 0.0314 0.1016 0.2254 0.3895 0.5635 0.7171 0.8335 0.9106 0.9559 0.9800 0.9916 0.9967 0.9988 0.9996 0.9999 1.0000 1.0000 1.0000 0.0289 0.0948 0.2133 0.3733 0.5461 0.7017 0.8217 0.9027 0.9512 0.9775 0.9904 0.9962 0.9986 0.9995 0.9998 0.9999 1.0000 1.0000 0.0266 0.0884 0.2017 0.3575 0.5289 0.6860 0.8095 0.8944 0.9462 0.9747 0.9890 0.9955 0.9983 0.9994 0.9998 0.9999 1.0000 1.0000 0.0244 0.0824 0.1906 0.3422 0.5119 0.6703 0.7970 0.8857 0.9409 0.9718 0.9875 0.9949 0.9980 0.9993 0.9998 0.9999 1.0000 1.0000 0.0224 0.0768 0.1800 0.3272 0.4950 0.6544 0.7841 0.8766 0.9352 0.9686 0.9859 0.9941 0.9977 0.9991 0.9997 0.9999 1.0000 1.0000 0.0206 0.0715 0.1700 0.3127 0.4783 0.6384 0.7710 0.8672 0.9292 0.9651 0.9841 0.9932 0.9973 0.9990 0.9996 0.9999 1.0000 1.0000 0.0189 0.0666 0.1604 0.2987 0.4619 0.6224 0.7576 0.8574 0.9228 0.9614 0.9821 0.9922 0.9969 0.9988 0.9996 0.9999 1.0000 1.0000 0.0174 0.0620 0.1512 0.2851 0.4457 0.6063 0.7440 0.8472 0.9161 0.9574 0.9799 0.9912 0.9964 0.9986 0.9995 0.9998 0.9999 1.0000 t (397) APPENDIX C t x 6.10 6.20 6.30 6.40 6.50 6.60 6.70 6.80 6.90 7.00 0.0022 0.0020 0.0018 0.0017 0.0015 0.0014 0.0012 0.0011 0.0010 0.0009 10 11 12 13 14 15 16 17 18 19 20 0.0159 0.0577 0.1425 0.2719 0.4298 0.5902 0.7301 0.8367 0.9090 0.9531 0.9776 0.9900 0.9958 0.9984 0.9994 0.9998 0.9999 1.0000 1.0000 1.0000 0.0146 0.0536 0.1342 0.2592 0.4141 0.5742 0.7160 0.8259 0.9016 0.9486 0.9750 0.9887 0.9952 0.9981 0.9993 0.9997 0.9999 1.0000 1.0000 1.0000 0.0134 0.0498 0.1264 0.2469 0.3988 0.5582 0.7017 0.8148 0.8939 0.9437 0.9723 0.9873 0.9945 0.9978 0.9992 0.9997 0.9999 1.0000 1.0000 1.0000 0.0123 0.0463 0.1189 0.2351 0.3837 0.5423 0.6873 0.8033 0.8858 0.9386 0.9693 0.9857 0.9937 0.9974 0.9990 0.9996 0.9999 1.0000 1.0000 1.0000 0.0113 0.0430 0.1118 0.2237 0.3690 0.5265 0.6728 0.7916 0.8774 0.9332 0.9661 0.9840 0.9929 0.9970 0.9988 0.9996 0.9998 0.9999 1.0000 1.0000 0.0103 0.0400 0.1052 0.2127 0.3547 0.5108 0.6581 0.7796 0.8686 0.9274 0.9627 0.9821 0.9920 0.9966 0.9986 0.9995 0.9998 0.9999 1.0000 1.0000 0.0095 0.0371 0.0988 0.2022 0.3406 0.4953 0.6433 0.7673 0.8596 0.9214 0.9591 0.9801 0.9909 0.9961 0.9984 0.9994 0.9998 0.9999 1.0000 1.0000 0.0087 0.0344 0.0928 0.1920 0.3270 0.4799 0.6285 0.7548 0.8502 0.9151 0.9552 0.9779 0.9898 0.9956 0.9982 0.9993 0.9997 0.9999 1.0000 1.0000 0.0080 0.0320 0.0871 0.1823 0.3137 0.4647 0.6136 0.7420 0.8405 0.9084 0.9510 0.9755 0.9885 0.9950 0.9979 0.9992 0.9997 0.9999 1.0000 1.0000 0.0073 0.0296 0.0818 0.1730 0.3007 0.4497 0.5987 0.7291 0.8305 0.9015 0.9467 0.9730 0.9872 0.9943 0.9976 0.9990 0.9996 0.9999 1.0000 1.0000 7.60 0.0005 0.0043 0.0188 0.0554 0.1249 0.2307 0.3646 0.5100 0.6482 0.7649 0.8535 0.9148 0.9536 0.9762 0.9886 0.9948 0.9978 0.9991 0.9996 0.9999 1.0000 1.0000 7.70 0.0005 0.0039 0.0174 0.0518 0.1181 0.2203 0.3514 0.4956 0.6343 0.7531 0.8445 0.9085 0.9496 0.9739 0.9873 0.9941 0.9974 0.9989 0.9996 0.9998 0.9999 1.0000 7.80 0.0004 0.0036 0.0161 0.0485 0.1117 0.2103 0.3384 0.4812 0.6204 0.7411 0.8352 0.9020 0.9454 0.9714 0.9859 0.9934 0.9971 0.9988 0.9995 0.9998 0.9999 1.0000 7.90 0.0004 0.0033 0.0149 0.0453 0.1055 0.2006 0.3257 0.4670 0.6065 0.7290 0.8257 0.8952 0.9409 0.9687 0.9844 0.9926 0.9967 0.9986 0.9994 0.9998 0.9999 1.0000 8.00 0.0003 0.0030 0.0138 0.0424 0.0996 0.1912 0.3134 0.4530 0.5925 0.7166 0.8159 0.8881 0.9362 0.9658 0.9827 0.9918 0.9963 0.9984 0.9993 0.9997 0.9999 1.0000 t x 10 11 12 13 14 15 16 17 18 19 20 21 7.10 0.0008 0.0067 0.0275 0.0767 0.1641 0.2881 0.4349 0.5838 0.7160 0.8202 0.8942 0.9420 0.9703 0.9857 0.9935 0.9972 0.9989 0.9996 0.9998 0.9999 1.0000 1.0000 7.20 0.0007 0.0061 0.0255 0.0719 0.1555 0.2759 0.4204 0.5689 0.7027 0.8096 0.8867 0.9371 0.9673 0.9841 0.9927 0.9969 0.9987 0.9995 0.9998 0.9999 1.0000 1.0000 7.30 0.0007 0.0056 0.0236 0.0674 0.1473 0.2640 0.4060 0.5541 0.6892 0.7988 0.8788 0.9319 0.9642 0.9824 0.9918 0.9964 0.9985 0.9994 0.9998 0.9999 1.0000 1.0000 7.40 0.0006 0.0051 0.0219 0.0632 0.1395 0.2526 0.3920 0.5393 0.6757 0.7877 0.8707 0.9265 0.9609 0.9805 0.9908 0.9959 0.9983 0.9993 0.9997 0.9999 1.0000 1.0000 7.50 0.0006 0.0047 0.0203 0.0591 0.1321 0.2414 0.3782 0.5246 0.6620 0.7764 0.8622 0.9208 0.9573 0.9784 0.9897 0.9954 0.9980 0.9992 0.9997 0.9999 1.0000 1.0000 (continued ) 853 (398) APPENDIX C 854 t x 10 11 12 13 14 15 16 17 18 19 20 21 22 23 8.10 0.0003 0.0028 0.0127 0.0396 0.0940 0.1822 0.3013 0.4391 0.5786 0.7041 0.8058 0.8807 0.9313 0.9628 0.9810 0.9908 0.9958 0.9982 0.9992 0.9997 0.9999 1.0000 1.0000 1.0000 8.20 0.0003 0.0025 0.0118 0.0370 0.0887 0.1736 0.2896 0.4254 0.5647 0.6915 0.7955 0.8731 0.9261 0.9595 0.9791 0.9898 0.9953 0.9979 0.9991 0.9997 0.9999 1.0000 1.0000 1.0000 8.30 0.0002 0.0023 0.0109 0.0346 0.0837 0.1653 0.2781 0.4119 0.5507 0.6788 0.7850 0.8652 0.9207 0.9561 0.9771 0.9887 0.9947 0.9977 0.9990 0.9996 0.9998 0.9999 1.0000 1.0000 8.40 0.0002 0.0021 0.0100 0.0323 0.0789 0.1573 0.2670 0.3987 0.5369 0.6659 0.7743 0.8571 0.9150 0.9524 0.9749 0.9875 0.9941 0.9973 0.9989 0.9995 0.9998 0.9999 1.0000 1.0000 8.50 0.0002 0.0019 0.0093 0.0301 0.0744 0.1496 0.2562 0.3856 0.5231 0.6530 0.7634 0.8487 0.9091 0.9486 0.9726 0.9862 0.9934 0.9970 0.9987 0.9995 0.9998 0.9999 1.0000 1.0000 8.60 0.0002 0.0018 0.0086 0.0281 0.0701 0.1422 0.2457 0.3728 0.5094 0.6400 0.7522 0.8400 0.9029 0.9445 0.9701 0.9848 0.9926 0.9966 0.9985 0.9994 0.9998 0.9999 1.0000 1.0000 8.70 0.0002 0.0016 0.0079 0.0262 0.0660 0.1352 0.2355 0.3602 0.4958 0.6269 0.7409 0.8311 0.8965 0.9403 0.9675 0.9832 0.9918 0.9962 0.9983 0.9993 0.9997 0.9999 1.0000 1.0000 8.80 0.0002 0.0015 0.0073 0.0244 0.0621 0.1284 0.2256 0.3478 0.4823 0.6137 0.7294 0.8220 0.8898 0.9358 0.9647 0.9816 0.9909 0.9957 0.9981 0.9992 0.9997 0.9999 1.0000 1.0000 8.90 0.0001 0.0014 0.0068 0.0228 0.0584 0.1219 0.2160 0.3357 0.4689 0.6006 0.7178 0.8126 0.8829 0.9311 0.9617 0.9798 0.9899 0.9952 0.9978 0.9991 0.9996 0.9998 0.9999 1.0000 9.00 0.0001 0.0012 0.0062 0.0212 0.0550 0.1157 0.2068 0.3239 0.4557 0.5874 0.7060 0.8030 0.8758 0.9261 0.9585 0.9780 0.9889 0.9947 0.9976 0.9989 0.9996 0.9998 0.9999 1.0000 x 9.10 9.20 9.30 9.40 9.50 9.60 9.70 9.80 9.90 10.00 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0000 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0.0011 0.0058 0.0198 0.0517 0.1098 0.1978 0.3123 0.4426 0.5742 0.6941 0.7932 0.8684 0.9210 0.9552 0.9760 0.9878 0.9941 0.9973 0.9988 0.9995 0.9998 0.9999 1.0000 1.0000 0.0010 0.0053 0.0184 0.0486 0.1041 0.1892 0.3010 0.4296 0.5611 0.6820 0.7832 0.8607 0.9156 0.9517 0.9738 0.9865 0.9934 0.9969 0.9986 0.9994 0.9998 0.9999 1.0000 1.0000 0.0009 0.0049 0.0172 0.0456 0.0986 0.1808 0.2900 0.4168 0.5479 0.6699 0.7730 0.8529 0.9100 0.9480 0.9715 0.9852 0.9927 0.9966 0.9985 0.9993 0.9997 0.9999 1.0000 1.0000 0.0009 0.0045 0.0160 0.0429 0.0935 0.1727 0.2792 0.4042 0.5349 0.6576 0.7626 0.8448 0.9042 0.9441 0.9691 0.9838 0.9919 0.9962 0.9983 0.9992 0.9997 0.9999 1.0000 1.0000 0.0008 0.0042 0.0149 0.0403 0.0885 0.1649 0.2687 0.3918 0.5218 0.6453 0.7520 0.8364 0.8981 0.9400 0.9665 0.9823 0.9911 0.9957 0.9980 0.9991 0.9996 0.9999 0.9999 1.0000 0.0007 0.0038 0.0138 0.0378 0.0838 0.1574 0.2584 0.3796 0.5089 0.6329 0.7412 0.8279 0.8919 0.9357 0.9638 0.9806 0.9902 0.9952 0.9978 0.9990 0.9996 0.9998 0.9999 1.0000 0.0007 0.0035 0.0129 0.0355 0.0793 0.1502 0.2485 0.3676 0.4960 0.6205 0.7303 0.8191 0.8853 0.9312 0.9609 0.9789 0.9892 0.9947 0.9975 0.9989 0.9995 0.9998 0.9999 1.0000 0.0006 0.0033 0.0120 0.0333 0.0750 0.1433 0.2388 0.3558 0.4832 0.6080 0.7193 0.8101 0.8786 0.9265 0.9579 0.9770 0.9881 0.9941 0.9972 0.9987 0.9995 0.9998 0.9999 1.0000 0.0005 0.0030 0.0111 0.0312 0.0710 0.1366 0.2294 0.3442 0.4705 0.5955 0.7081 0.8009 0.8716 0.9216 0.9546 0.9751 0.9870 0.9935 0.9969 0.9986 0.9994 0.9997 0.9999 1.0000 0.0005 0.0028 0.0103 0.0293 0.0671 0.1301 0.2202 0.3328 0.4579 0.5830 0.6968 0.7916 0.8645 0.9165 0.9513 0.9730 0.9857 0.9928 0.9965 0.9984 0.9993 0.9997 0.9999 1.0000 t (399) APPENDIX C t x 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 0.0002 0.0012 0.0049 0.0151 0.0375 0.0786 0.1432 0.2320 0.3405 0.4599 0.5793 0.6887 0.7813 0.8540 0.9074 0.9441 0.9678 0.9823 0.9907 0.9953 0.9977 0.9990 0.9995 0.9998 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0001 0.0005 0.0023 0.0076 0.0203 0.0458 0.0895 0.1550 0.2424 0.3472 0.4616 0.5760 0.6815 0.7720 0.8444 0.8987 0.9370 0.9626 0.9787 0.9884 0.9939 0.9970 0.9985 0.9993 0.9997 0.9999 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 0.0002 0.0011 0.0037 0.0107 0.0259 0.0540 0.0998 0.1658 0.2517 0.3532 0.4631 0.5730 0.6751 0.7636 0.8355 0.8905 0.9302 0.9573 0.9750 0.9859 0.9924 0.9960 0.9980 0.9990 0.9995 0.9998 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 0.0001 0.0005 0.0018 0.0055 0.0142 0.0316 0.0621 0.1094 0.1757 0.2600 0.3585 0.4644 0.5704 0.6694 0.7559 0.8272 0.8826 0.9235 0.9521 0.9712 0.9833 0.9907 0.9950 0.9974 0.9987 0.9994 0.9997 0.9999 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0002 0.0009 0.0028 0.0076 0.0180 0.0374 0.0699 0.1185 0.1848 0.2676 0.3632 0.4657 0.5681 0.6641 0.7489 0.8195 0.8752 0.9170 0.9469 0.9673 0.9805 0.9888 0.9938 0.9967 0.9983 0.9991 0.9996 0.9998 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0001 0.0004 0.0014 0.0040 0.0100 0.0220 0.0433 0.0774 0.1270 0.1931 0.2745 0.3675 0.4667 0.5660 0.6593 0.7423 0.8122 0.8682 0.9108 0.9418 0.9633 0.9777 0.9869 0.9925 0.9959 0.9978 0.9989 0.9994 0.9997 0.9999 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0002 0.0007 0.0021 0.0054 0.0126 0.0261 0.0491 0.0847 0.1350 0.2009 0.2808 0.3715 0.4677 0.5640 0.6550 0.7363 0.8055 0.8615 0.9047 0.9367 0.9594 0.9748 0.9848 0.9912 0.9950 0.9973 0.9986 0.9993 0.9996 0.9998 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0010 0.0029 0.0071 0.0154 0.0304 0.0549 0.0917 0.1426 0.2081 0.2867 0.3751 0.4686 0.5622 0.6509 0.7307 0.7991 0.8551 0.8989 0.9317 0.9554 0.9718 0.9827 0.9897 0.9941 0.9967 0.9982 0.9990 0.9995 0.9998 0.9999 0.9999 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0005 0.0015 0.0039 0.0089 0.0183 0.0347 0.0606 0.0984 0.1497 0.2148 0.2920 0.3784 0.4695 0.5606 0.6472 0.7255 0.7931 0.8490 0.8933 0.9269 0.9514 0.9687 0.9805 0.9882 0.9930 0.9960 0.9978 0.9988 0.9994 0.9997 0.9998 0.9999 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0008 0.0021 0.0050 0.0108 0.0214 0.0390 0.0661 0.1049 0.1565 0.2211 0.2970 0.3814 0.4703 0.5591 0.6437 0.7206 0.7875 0.8432 0.8878 0.9221 0.9475 0.9657 0.9782 0.9865 0.9919 0.9953 0.9973 0.9985 0.9992 0.9996 0.9998 0.9999 0.9999 1.0000 855 (400) APPENDIX D 856 APPENDIX D Standard Normal Distribution Table 0.3944 z = 1.25 z z 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 0.0000 0.0398 0.0793 0.1179 0.1554 0.1915 0.2257 0.2580 0.2881 0.3159 0.3413 0.3643 0.3849 0.4032 0.4192 0.4332 0.4452 0.4554 0.4641 0.4713 0.4772 0.4821 0.4861 0.4893 0.4918 0.4938 0.4953 0.4965 0.4974 0.4981 0.4987 0.0040 0.0438 0.0832 0.1217 0.1591 0.1950 0.2291 0.2611 0.2910 0.3186 0.3438 0.3665 0.3869 0.4049 0.4207 0.4345 0.4463 0.4564 0.4649 0.4719 0.4778 0.4826 0.4864 0.4896 0.4920 0.4940 0.4955 0.4966 0.4975 0.4982 0.4987 0.0080 0.0478 0.0871 0.1255 0.1628 0.1985 0.2324 0.2642 0.2939 0.3212 0.3461 0.3686 0.3888 0.4066 0.4222 0.4357 0.4474 0.4573 0.4656 0.4726 0.4783 0.4830 0.4868 0.4898 0.4922 0.4941 0.4956 0.4967 0.4976 0.4982 0.4987 0.0120 0.0517 0.0910 0.1293 0.1664 0.2019 0.2357 0.2673 0.2967 0.3238 0.3485 0.3708 0.3907 0.4082 0.4236 0.4370 0.4484 0.4582 0.4664 0.4732 0.4788 0.4834 0.4871 0.4901 0.4925 0.4943 0.4957 0.4968 0.4977 0.4983 0.4988 0.0160 0.0557 0.0948 0.1331 0.1700 0.2054 0.2389 0.2704 0.2995 0.3264 0.3508 0.3729 0.3925 0.4099 0.4251 0.4382 0.4495 0.4591 0.4671 0.4738 0.4793 0.4838 0.4875 0.4904 0.4927 0.4945 0.4959 0.4969 0.4977 0.4984 0.4988 0.0199 0.0596 0.0987 0.1368 0.1736 0.2088 0.2422 0.2734 0.3023 0.3289 0.3531 0.3749 0.3944 0.4115 0.4265 0.4394 0.4505 0.4599 0.4678 0.4744 0.4798 0.4842 0.4878 0.4906 0.4929 0.4946 0.4960 0.4970 0.4978 0.4984 0.4989 0.0239 0.0636 0.1026 0.1406 0.1772 0.2123 0.2454 0.2764 0.3051 0.3315 0.3554 0.3770 0.3962 0.4131 0.4279 0.4406 0.4515 0.4608 0.4686 0.4750 0.4803 0.4846 0.4881 0.4909 0.4931 0.4948 0.4961 0.4971 0.4979 0.4985 0.4989 0.0279 0.0675 0.1064 0.1443 0.1808 0.2157 0.2486 0.2794 0.3078 0.3340 0.3577 0.3790 0.3980 0.4147 0.4292 0.4418 0.4525 0.4616 0.4693 0.4756 0.4808 0.4850 0.4884 0.4911 0.4932 0.4949 0.4962 0.4972 0.4979 0.4985 0.4989 0.0319 0.0714 0.1103 0.1480 0.1844 0.2190 0.2517 0.2823 0.3106 0.3365 0.3599 0.3810 0.3997 0.4162 0.4306 0.4429 0.4535 0.4625 0.4699 0.4761 0.4812 0.4854 0.4887 0.4913 0.4934 0.4951 0.4963 0.4973 0.4980 0.4986 0.4990 0.0359 0.0753 0.1141 0.1517 0.1879 0.2224 0.2549 0.2852 0.3133 0.3389 0.3621 0.3830 0.4015 0.4177 0.4319 0.4441 0.4545 0.4633 0.4706 0.4767 0.4817 0.4857 0.4890 0.4916 0.4936 0.4952 0.4964 0.4974 0.4981 0.4986 0.4990 (401) APPENDIX E Values of e−la APPENDIX E Exponential Distribution Table 857 la e−la la e−la la e−la la e−la la e−la 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 1.0000 0.9512 0.9048 0.8607 0.8187 0.7788 0.7408 0.7047 0.6703 0.6376 0.6065 0.5769 0.5488 0.5220 0.4966 0.4724 0.4493 0.4274 0.4066 0.3867 0.3679 0.3499 0.3329 0.3166 0.3012 0.2865 0.2725 0.2592 0.2466 0.2346 0.2231 0.2122 0.2019 0.1920 0.1827 0.1738 0.1653 0.1572 0.1496 0.1423 0.1353 2.05 2.10 2.15 2.20 2.25 2.30 2.35 2.40 2.45 2.50 2.55 2.60 2.65 2.70 2.75 2.80 2.85 2.90 2.95 3.00 3.05 3.10 3.15 3.20 3.25 3.30 3.35 3.40 3.45 3.50 3.55 3.60 3.65 3.70 3.75 3.80 3.85 3.90 3.95 4.00 0.1287 0.1225 0.1165 0.1108 0.1054 0.1003 0.0954 0.0907 0.0863 0.0821 0.0781 0.0743 0.0707 0.0672 0.0639 0.0608 0.0578 0.0550 0.0523 0.0498 0.0474 0.0450 0.0429 0.0408 0.0388 0.0369 0.0351 0.0334 0.0317 0.0302 0.0287 0.0273 0.0260 0.0247 0.0235 0.0224 0.0213 0.0202 0.0193 0.0183 4.05 4.10 4.15 4.20 4.25 4.30 4.35 4.40 4.45 4.50 4.55 4.60 4.65 4.70 4.75 4.80 4.85 4.90 4.95 5.00 5.05 5.10 5.15 5.20 5.25 5.30 5.35 5.40 5.45 5.50 5.55 5.60 5.65 5.70 5.75 5.80 5.85 5.90 5.95 6.00 0.0174 0.0166 0.0158 0.0150 0.0143 0.0136 0.0129 0.0123 0.0117 0.0111 0.0106 0.0101 0.0096 0.0091 0.0087 0.0082 0.0078 0.0074 0.0071 0.0067 0.0064 0.0061 0.0058 0.0055 0.0052 0.0050 0.0047 0.0045 0.0043 0.0041 0.0039 0.0037 0.0035 0.0033 0.0032 0.0030 0.0029 0.0027 0.0026 0.0025 6.05 6.10 6.15 6.20 6.25 6.30 6.35 6.40 6.45 6.50 6.55 6.60 6.65 6.70 6.75 6.80 6.85 6.90 6.95 7.00 7.05 7.10 7.15 7.20 7.25 7.30 7.35 7.40 7.45 7.50 7.55 7.60 7.65 7.70 7.75 7.80 7.85 7.90 7.95 8.00 0.0024 0.0022 0.0021 0.0020 0.0019 0.0018 0.0017 0.0017 0.0016 0.0015 0.0014 0.0014 0.0013 0.0012 0.0012 0.0011 0.0011 0.0010 0.0010 0.0009 0.0009 0.0008 0.0008 0.0007 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003 8.05 8.10 8.15 8.20 8.25 8.30 8.35 8.40 8.45 8.50 8.55 8.60 8.65 8.70 8.75 8.80 8.85 8.90 8.95 9.00 9.05 9.10 9.15 9.20 9.25 9.30 9.35 9.40 9.45 9.50 9.55 9.60 9.65 9.70 9.75 9.80 9.85 9.90 9.95 10.00 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0000 0.0000 (402) 858 APPENDIX F APPENDIX F df = 10 Values of t for Selected Probabilities 0.05 0.05 t = –1.8125 t = 1.8125 t PROBABILITIES (OR AREAS UNDER t-DISTRIBUTION CURVE) Conf Level One Tail Two Tails 0.1 0.45 0.9 0.3 0.35 0.7 0.5 0.25 0.5 0.7 0.15 0.3 0.1584 0.1421 0.1366 0.1338 0.1322 0.1311 0.1303 0.1297 0.1293 0.1289 0.1286 0.1283 0.1281 0.1280 0.1278 0.1277 0.1276 0.1274 0.1274 0.1273 0.1272 0.1271 0.1271 0.1270 0.1269 0.1269 0.1268 0.1268 0.1268 0.1267 0.1265 0.1263 0.1262 0.1261 0.1261 0.1260 0.1260 0.1258 0.1257 0.1257 0.5095 0.4447 0.4242 0.4142 0.4082 0.4043 0.4015 0.3995 0.3979 0.3966 0.3956 0.3947 0.3940 0.3933 0.3928 0.3923 0.3919 0.3915 0.3912 0.3909 0.3906 0.3904 0.3902 0.3900 0.3898 0.3896 0.3894 0.3893 0.3892 0.3890 0.3881 0.3875 0.3872 0.3869 0.3867 0.3866 0.3864 0.3858 0.3855 0.3853 1.0000 0.8165 0.7649 0.7407 0.7267 0.7176 0.7111 0.7064 0.7027 0.6998 0.6974 0.6955 0.6938 0.6924 0.6912 0.6901 0.6892 0.6884 0.6876 0.6870 0.6864 0.6858 0.6853 0.6848 0.6844 0.6840 0.6837 0.6834 0.6830 0.6828 0.6807 0.6794 0.6786 0.6780 0.6776 0.6772 0.6770 0.6755 0.6750 0.6745 1.9626 1.3862 1.2498 1.1896 1.1558 1.1342 1.1192 1.1081 1.0997 1.0931 1.0877 1.0832 1.0795 1.0763 1.0735 1.0711 1.0690 1.0672 1.0655 1.0640 1.0627 1.0614 1.0603 1.0593 1.0584 1.0575 1.0567 1.0560 1.0553 1.0547 1.0500 1.0473 1.0455 1.0442 1.0432 1.0424 1.0418 1.0386 1.0375 1.0364 df 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 50 60 70 80 90 100 250 500 ∞ 0.8 0.1 0.2 0.9 0.05 0.1 0.95 0.025 0.05 0.98 0.01 0.02 0.99 0.005 0.01 6.3137 2.9200 2.3534 2.1318 2.0150 1.9432 1.8946 1.8595 1.8331 1.8125 1.7959 1.7823 1.7709 1.7613 1.7531 1.7459 1.7396 1.7341 1.7291 1.7247 1.7207 1.7171 1.7139 1.7109 1.7081 1.7056 1.7033 1.7011 1.6991 1.6973 1.6839 1.6759 1.6706 1.6669 1.6641 1.6620 1.6602 1.6510 1.6479 1.6449 12.7062 4.3027 3.1824 2.7765 2.5706 2.4469 2.3646 2.3060 2.2622 2.2281 2.2010 2.1788 2.1604 2.1448 2.1315 2.1199 2.1098 2.1009 2.0930 2.0860 2.0796 2.0739 2.0687 2.0639 2.0595 2.0555 2.0518 2.0484 2.0452 2.0423 2.0211 2.0086 2.0003 1.9944 1.9901 1.9867 1.9840 1.9695 1.9647 1.9600 31.8210 6.9645 4.5407 3.7469 3.3649 3.1427 2.9979 2.8965 2.8214 2.7638 2.7181 2.6810 2.6503 2.6245 2.6025 2.5835 2.5669 2.5524 2.5395 2.5280 2.5176 2.5083 2.4999 2.4922 2.4851 2.4786 2.4727 2.4671 2.4620 2.4573 2.4233 2.4033 2.3901 2.3808 2.3739 2.3685 2.3642 2.3414 2.3338 2.3263 63.6559 9.9250 5.8408 4.6041 4.0321 3.7074 3.4995 3.3554 3.2498 3.1693 3.1058 3.0545 3.0123 2.9768 2.9467 2.9208 2.8982 2.8784 2.8609 2.8453 2.8314 2.8188 2.8073 2.7970 2.7874 2.7787 2.7707 2.7633 2.7564 2.7500 2.7045 2.6778 2.6603 2.6479 2.6387 2.6316 2.6259 2.5956 2.5857 2.5758 Values of t 3.0777 1.8856 1.6377 1.5332 1.4759 1.4398 1.4149 1.3968 1.3830 1.3722 1.3634 1.3562 1.3502 1.3450 1.3406 1.3368 1.3334 1.3304 1.3277 1.3253 1.3232 1.3212 1.3195 1.3178 1.3163 1.3150 1.3137 1.3125 1.3114 1.3104 1.3031 1.2987 1.2958 1.2938 1.2922 1.2910 1.2901 1.2849 1.2832 1.2816 (403) APPENDIX G APPENDIX G 859 f (χ2) Values of χ for Selected Probabilities df = 0.10 χ2 χ2 = 9.2363 PROBABILITIES (OR AREAS UNDER CHI-SQUARE DISTRIBUTION CURVE ABOVE GIVEN CHI-SQUARE VALUES) 0.995 0.99 0.975 0.95 df 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 0.90 0.10 0.05 0.025 0.01 0.005 5.0239 7.3778 9.3484 11.1433 12.8325 14.4494 16.0128 17.5345 19.0228 20.4832 21.9200 23.3367 24.7356 26.1189 27.4884 28.8453 30.1910 31.5264 32.8523 34.1696 35.4789 36.7807 38.0756 39.3641 40.6465 41.9231 43.1945 44.4608 45.7223 46.9792 6.6349 9.2104 11.3449 13.2767 15.0863 16.8119 18.4753 20.0902 21.6660 23.2093 24.7250 26.2170 27.6882 29.1412 30.5780 31.9999 33.4087 34.8052 36.1908 37.5663 38.9322 40.2894 41.6383 42.9798 44.3140 45.6416 46.9628 48.2782 49.5878 50.8922 7.8794 10.5965 12.8381 14.8602 16.7496 18.5475 20.2777 21.9549 23.5893 25.1881 26.7569 28.2997 29.8193 31.3194 32.8015 34.2671 35.7184 37.1564 38.5821 39.9969 41.4009 42.7957 44.1814 45.5584 46.9280 48.2898 49.6450 50.9936 52.3355 53.6719 Values of Chi-Squared 0.0000 0.0100 0.0717 0.2070 0.4118 0.6757 0.9893 1.3444 1.7349 2.1558 2.6032 3.0738 3.5650 4.0747 4.6009 5.1422 5.6973 6.2648 6.8439 7.4338 8.0336 8.6427 9.2604 9.8862 10.5196 11.1602 11.8077 12.4613 13.1211 13.7867 0.0002 0.0201 0.1148 0.2971 0.5543 0.8721 1.2390 1.6465 2.0879 2.5582 3.0535 3.5706 4.1069 4.6604 5.2294 5.8122 6.4077 7.0149 7.6327 8.2604 8.8972 9.5425 10.1957 10.8563 11.5240 12.1982 12.8785 13.5647 14.2564 14.9535 0.0010 0.0506 0.2158 0.4844 0.8312 1.2373 1.6899 2.1797 2.7004 3.2470 3.8157 4.4038 5.0087 5.6287 6.2621 6.9077 7.5642 8.2307 8.9065 9.5908 10.2829 10.9823 11.6885 12.4011 13.1197 13.8439 14.5734 15.3079 16.0471 16.7908 0.0039 0.1026 0.3518 0.7107 1.1455 1.6354 2.1673 2.7326 3.3251 3.9403 4.5748 5.2260 5.8919 6.5706 7.2609 7.9616 8.6718 9.3904 10.1170 10.8508 11.5913 12.3380 13.0905 13.8484 14.6114 15.3792 16.1514 16.9279 17.7084 18.4927 0.0158 0.2107 0.5844 1.0636 1.6103 2.2041 2.8331 3.4895 4.1682 4.8652 5.5778 6.3038 7.0415 7.7895 8.5468 9.3122 10.0852 10.8649 11.6509 12.4426 13.2396 14.0415 14.8480 15.6587 16.4734 17.2919 18.1139 18.9392 19.7677 20.5992 2.7055 4.6052 6.2514 7.7794 9.2363 10.6446 12.0170 13.3616 14.6837 15.9872 17.2750 18.5493 19.8119 21.0641 22.3071 23.5418 24.7690 25.9894 27.2036 28.4120 29.6151 30.8133 32.0069 33.1962 34.3816 35.5632 36.7412 37.9159 39.0875 40.2560 3.8415 5.9915 7.8147 9.4877 11.0705 12.5916 14.0671 15.5073 16.9190 18.3070 19.6752 21.0261 22.3620 23.6848 24.9958 26.2962 27.5871 28.8693 30.1435 31.4104 32.6706 33.9245 35.1725 36.4150 37.6525 38.8851 40.1133 41.3372 42.5569 43.7730 (404) 860 APPENDIX H APPENDIX H f(F) df = D1 = D2 = 10 F-Distribution Table: Upper 5% Probability (or 5% Area) under F-Distribution Curve 0.05 F F = 3.326 DENOMINATOR df  D2 NUMERATOR df  D1 10 11 12 13 14 15 16 17 18 19 20 24 30 40 50 100 200 300 161.446 18.513 10.128 7.709 6.608 5.987 5.591 5.318 5.117 4.965 4.844 4.747 4.667 4.600 4.543 4.494 4.451 4.414 4.381 4.351 4.260 4.171 4.085 4.034 3.936 3.888 3.873 199.499 19.000 9.552 6.944 5.786 5.143 4.737 4.459 4.256 4.103 3.982 3.885 3.806 3.739 3.682 3.634 3.592 3.555 3.522 3.493 3.403 3.316 3.232 3.183 3.087 3.041 3.026 215.707 19.164 9.277 6.591 5.409 4.757 4.347 4.066 3.863 3.708 3.587 3.490 3.411 3.344 3.287 3.239 3.197 3.160 3.127 3.098 3.009 2.922 2.839 2.790 2.696 2.650 2.635 224.583 19.247 9.117 6.388 5.192 4.534 4.120 3.838 3.633 3.478 3.357 3.259 3.179 3.112 3.056 3.007 2.965 2.928 2.895 2.866 2.776 2.690 2.606 2.557 2.463 2.417 2.402 11 242.981 19.405 8.763 5.936 4.704 4.027 3.603 3.313 3.102 2.943 2.818 2.717 2.635 2.565 2.507 2.456 12 243.905 19.412 8.745 5.912 4.678 4.000 3.575 3.284 3.073 2.913 2.788 2.687 2.604 2.534 2.475 2.425 13 244.690 19.419 8.729 5.891 4.655 3.976 3.550 3.259 3.048 2.887 2.761 2.660 2.577 2.507 2.448 2.397 14 245.363 19.424 8.715 5.873 4.636 3.956 3.529 3.237 3.025 2.865 2.739 2.637 2.554 2.484 2.424 2.373 230.160 19.296 9.013 6.256 5.050 4.387 3.972 3.688 3.482 3.326 3.204 3.106 3.025 2.958 2.901 2.852 2.810 2.773 2.740 2.711 2.621 2.534 2.449 2.400 2.305 2.259 2.244 233.988 19.329 8.941 6.163 4.950 4.284 3.866 3.581 3.374 3.217 3.095 2.996 2.915 2.848 2.790 2.741 2.699 2.661 2.628 2.599 2.508 2.421 2.336 2.286 2.191 2.144 2.129 236.767 19.353 8.887 6.094 4.876 4.207 3.787 3.500 3.293 3.135 3.012 2.913 2.832 2.764 2.707 2.657 2.614 2.577 2.544 2.514 2.423 2.334 2.249 2.199 2.103 2.056 2.040 238.884 19.371 8.845 6.041 4.818 4.147 3.726 3.438 3.230 3.072 2.948 2.849 2.767 2.699 2.641 2.591 2.548 2.510 2.477 2.447 2.355 2.266 2.180 2.130 2.032 1.985 1.969 240.543 19.385 8.812 5.999 4.772 4.099 3.677 3.388 3.179 3.020 2.896 2.796 2.714 2.646 2.588 2.538 2.494 2.456 2.423 2.393 2.300 2.211 2.124 2.073 1.975 1.927 1.911 10 241.882 19.396 8.785 5.964 4.735 4.060 3.637 3.347 3.137 2.978 2.854 2.753 2.671 2.602 2.544 2.494 2.450 2.412 2.378 2.348 2.255 2.165 2.077 2.026 1.927 1.878 1.862 17 246.917 19.437 8.683 5.832 4.590 3.908 3.480 3.187 2.974 2.812 2.685 2.583 2.499 2.428 2.368 2.317 18 247.324 19.440 8.675 5.821 4.579 3.896 3.467 3.173 2.960 2.798 2.671 2.568 2.484 2.413 2.353 2.302 19 247.688 19.443 8.667 5.811 4.568 3.884 3.455 3.161 2.948 2.785 2.658 2.555 2.471 2.400 2.340 2.288 20 248.016 19.446 8.660 5.803 4.558 3.874 3.445 3.150 2.936 2.774 2.646 2.544 2.459 2.388 2.328 2.276 DENOMINATOR df  D2 NUMERATOR df  D1 10 11 12 13 14 15 16 15 245.949 19.429 8.703 5.858 4.619 3.938 3.511 3.218 3.006 2.845 2.719 2.617 2.533 2.463 2.403 2.352 16 246.466 19.433 8.692 5.844 4.604 3.922 3.494 3.202 2.989 2.828 2.701 2.599 2.515 2.445 2.385 2.333 (405) APPENDIX H DENOMINATOR df  D2 17 18 19 20 24 30 40 50 100 200 300 NUMERATOR df  D1 11 2.413 2.374 2.340 2.310 2.216 2.126 2.038 1.986 1.886 1.837 1.821 12 2.381 2.342 2.308 2.278 2.183 2.092 2.003 1.952 1.850 1.801 1.785 13 2.353 2.314 2.280 2.250 2.155 2.063 1.974 1.921 1.819 1.769 1.753 14 2.329 2.290 2.256 2.225 2.130 2.037 1.948 1.895 1.792 1.742 1.725 DENOMINATOR df  D2 10 11 12 13 14 15 16 17 18 19 20 24 30 40 50 100 200 300 15 2.308 2.269 2.234 2.203 2.108 2.015 1.924 1.871 1.768 1.717 1.700 16 2.289 2.250 2.215 2.184 2.088 1.995 1.904 1.850 1.746 1.694 1.677 17 2.272 2.233 2.198 2.167 2.070 1.976 1.885 1.831 1.726 1.674 1.657 18 2.257 2.217 2.182 2.151 2.054 1.960 1.868 1.814 1.708 1.656 1.638 19 2.243 2.203 2.168 2.137 2.040 1.945 1.853 1.798 1.691 1.639 1.621 20 2.230 2.191 2.155 2.124 2.027 1.932 1.839 1.784 1.676 1.623 1.606 NUMERATOR df  D1 24 249.052 19.454 8.638 5.774 4.527 3.841 3.410 3.115 2.900 2.737 2.609 2.505 2.420 2.349 2.288 2.235 2.190 2.150 2.114 2.082 1.984 1.887 1.793 1.737 1.627 1.572 1.554 30 250.096 19.463 8.617 5.746 4.496 3.808 3.376 3.079 2.864 2.700 2.570 2.466 2.380 2.308 2.247 2.194 2.148 2.107 2.071 2.039 1.939 1.841 1.744 1.687 1.573 1.516 1.497 40 251.144 19.471 8.594 5.717 4.464 3.774 3.340 3.043 2.826 2.661 2.531 2.426 2.339 2.266 2.204 2.151 2.104 2.063 2.026 1.994 1.892 1.792 1.693 1.634 1.515 1.455 1.435 50 251.774 19.476 8.581 5.699 4.444 3.754 3.319 3.020 2.803 2.637 2.507 2.401 2.314 2.241 2.178 2.124 2.077 2.035 1.999 1.966 1.863 1.761 1.660 1.599 1.477 1.415 1.393 100 253.043 19.486 8.554 5.664 4.405 3.712 3.275 2.975 2.756 2.588 2.457 2.350 2.261 2.187 2.123 2.068 2.020 1.978 1.940 1.907 1.800 1.695 1.589 1.525 1.392 1.321 1.296 200 253.676 19.491 8.540 5.646 4.385 3.690 3.252 2.951 2.731 2.563 2.431 2.323 2.234 2.159 2.095 2.039 1.991 1.948 1.910 1.875 1.768 1.660 1.551 1.484 1.342 1.263 1.234 300 253.887 19.492 8.536 5.640 4.378 3.683 3.245 2.943 2.723 2.555 2.422 2.314 2.225 2.150 2.085 2.030 1.981 1.938 1.899 1.865 1.756 1.647 1.537 1.469 1.323 1.240 1.210 (continued ) 861 (406) 862 APPENDIX H APPENDIX H (continued) f(F) df = D1 = D2 = 15 F-Distribution Table: Upper 2.5% Probability (or 2.5% Area) under F-Distribution Curve 0.025 F = 3.123 DENOMINATOR df  D2 NUMERATOR df  D1 10 11 12 13 14 15 16 17 18 19 20 24 30 40 50 100 200 300 647.793 38.506 17.443 12.218 10.007 8.813 8.073 7.571 7.209 6.937 6.724 6.554 6.414 6.298 6.200 6.115 6.042 5.978 5.922 5.871 5.717 5.568 5.424 5.340 5.179 5.100 5.075 799.482 39.000 16.044 10.649 8.434 7.260 6.542 6.059 5.715 5.456 5.256 5.096 4.965 4.857 4.765 4.687 4.619 4.560 4.508 4.461 4.319 4.182 4.051 3.975 3.828 3.758 3.735 864.151 39.166 15.439 9.979 7.764 6.599 5.890 5.416 5.078 4.826 4.630 4.474 4.347 4.242 4.153 4.077 4.011 3.954 3.903 3.859 3.721 3.589 3.463 3.390 3.250 3.182 3.160 899.599 39.248 15.101 9.604 7.388 6.227 5.523 5.053 4.718 4.468 4.275 4.121 3.996 3.892 3.804 3.729 3.665 3.608 3.559 3.515 3.379 3.250 3.126 3.054 2.917 2.850 2.829 921.835 39.298 14.885 9.364 7.146 5.988 5.285 4.817 4.484 4.236 4.044 3.891 3.767 3.663 3.576 3.502 3.438 3.382 3.333 3.289 3.155 3.026 2.904 2.833 2.696 2.630 2.609 937.114 39.331 14.735 9.197 6.978 5.820 5.119 4.652 4.320 4.072 3.881 3.728 3.604 3.501 3.415 3.341 3.277 3.221 3.172 3.128 2.995 2.867 2.744 2.674 2.537 2.472 2.451 948.203 39.356 14.624 9.074 6.853 5.695 4.995 4.529 4.197 3.950 3.759 3.607 3.483 3.380 3.293 3.219 3.156 3.100 3.051 3.007 2.874 2.746 2.624 2.553 2.417 2.351 2.330 956.643 39.373 14.540 8.980 6.757 5.600 4.899 4.433 4.102 3.855 3.664 3.512 3.388 3.285 3.199 3.125 3.061 3.005 2.956 2.913 2.779 2.651 2.529 2.458 2.321 2.256 2.234 963.279 39.387 14.473 8.905 6.681 5.523 4.823 4.357 4.026 3.779 3.588 3.436 3.312 3.209 3.123 3.049 2.985 2.929 2.880 2.837 2.703 2.575 2.452 2.381 2.244 2.178 2.156 10 968.634 39.398 14.419 8.844 6.619 5.461 4.761 4.295 3.964 3.717 3.526 3.374 3.250 3.147 3.060 2.986 2.922 2.866 2.817 2.774 2.640 2.511 2.388 2.317 2.179 2.113 2.091 11 973.028 39.407 14.374 8.794 6.568 5.410 4.709 4.243 3.912 3.665 3.474 3.321 3.197 3.095 3.008 2.934 2.870 2.814 2.765 2.721 2.586 2.458 2.334 2.263 2.124 2.058 2.036 18 990.345 39.442 14.196 8.592 6.362 5.202 4.501 4.034 3.701 3.453 3.261 3.108 2.983 2.879 2.792 2.717 19 991.800 39.446 14.181 8.575 6.344 5.184 4.483 4.016 3.683 3.435 3.243 3.090 2.965 2.861 2.773 2.698 20 993.081 39.448 14.167 8.560 6.329 5.168 4.467 3.999 3.667 3.419 3.226 3.073 2.948 2.844 2.756 2.681 24 30 997.272 1001.405 39.457 39.465 14.124 14.081 8.511 8.461 6.278 6.227 5.117 5.065 4.415 4.362 3.947 3.894 3.614 3.560 3.365 3.311 3.173 3.118 3.019 2.963 2.893 2.837 2.789 2.732 2.701 2.644 2.625 2.568 DENOMINATOR df  D2 NUMERATOR df  D1 10 11 12 13 14 15 16 12 976.725 39.415 14.337 8.751 6.525 5.366 4.666 4.200 3.868 3.621 3.430 3.277 3.153 3.050 2.963 2.889 13 979.839 39.421 14.305 8.715 6.488 5.329 4.628 4.162 3.831 3.583 3.392 3.239 3.115 3.012 2.925 2.851 14 982.545 39.427 14.277 8.684 6.456 5.297 4.596 4.130 3.798 3.550 3.359 3.206 3.082 2.979 2.891 2.817 15 984.874 39.431 14.253 8.657 6.428 5.269 4.568 4.101 3.769 3.522 3.330 3.177 3.053 2.949 2.862 2.788 16 986.911 39.436 14.232 8.633 6.403 5.244 4.543 4.076 3.744 3.496 3.304 3.152 3.027 2.923 2.836 2.761 17 988.715 39.439 14.213 8.611 6.381 5.222 4.521 4.054 3.722 3.474 3.282 3.129 3.004 2.900 2.813 2.738 (407) APPENDIX H DENOMINATOR df  D2 17 18 19 20 24 30 40 50 100 200 300 NUMERATOR df  D1 12 2.825 2.769 2.720 2.676 2.541 2.412 2.288 2.216 2.077 2.010 1.988 13 2.786 2.730 2.681 2.637 2.502 2.372 2.248 2.176 2.036 1.969 1.947 14 2.753 2.696 2.647 2.603 2.468 2.338 2.213 2.140 2.000 1.932 1.910 15 2.723 2.667 2.617 2.573 2.437 2.307 2.182 2.109 1.968 1.900 1.877 16 2.697 2.640 2.591 2.547 2.411 2.280 2.154 2.081 1.939 1.870 1.848 17 2.673 2.617 2.567 2.523 2.386 2.255 2.129 2.056 1.913 1.844 1.821 18 2.652 2.596 2.546 2.501 2.365 2.233 2.107 2.033 1.890 1.820 1.797 19 2.633 2.576 2.526 2.482 2.345 2.213 2.086 2.012 1.868 1.798 1.775 20 2.616 2.559 2.509 2.464 2.327 2.195 2.068 1.993 1.849 1.778 1.755 24 2.560 2.503 2.452 2.408 2.269 2.136 2.007 1.931 1.784 1.712 1.688 30 2.502 2.445 2.394 2.349 2.209 2.074 1.943 1.866 1.715 1.640 1.616 DENOMINATOR df  D2 10 11 12 13 14 15 16 17 18 19 20 24 30 40 50 100 200 300 NUMERATOR df  D1 40 50 100 200 300 1005.596 1008.098 1013.163 1015.724 1016.539 39.473 39.478 39.488 39.493 39.495 14.036 14.010 13.956 13.929 13.920 8.411 8.381 8.319 8.288 8.278 6.175 6.144 6.080 6.048 6.037 5.012 4.980 4.915 4.882 4.871 4.309 4.276 4.210 4.176 4.165 3.840 3.807 3.739 3.705 3.693 3.505 3.472 3.403 3.368 3.357 3.255 3.221 3.152 3.116 3.104 3.061 3.027 2.956 2.920 2.908 2.906 2.871 2.800 2.763 2.750 2.780 2.744 2.671 2.634 2.621 2.674 2.638 2.565 2.526 2.513 2.585 2.549 2.474 2.435 2.422 2.509 2.472 2.396 2.357 2.343 2.442 2.405 2.329 2.289 2.275 2.384 2.347 2.269 2.229 2.215 2.333 2.295 2.217 2.176 2.162 2.287 2.249 2.170 2.128 2.114 2.146 2.107 2.024 1.981 1.966 2.009 1.968 1.882 1.835 1.819 1.875 1.832 1.741 1.691 1.673 1.796 1.752 1.656 1.603 1.584 1.640 1.592 1.483 1.420 1.397 1.562 1.511 1.393 1.320 1.293 1.536 1.484 1.361 1.285 1.255 (continued ) 863 (408) 864 APPENDIX H APPENDIX H (continued) f(F) df = D1 = D2 = 14 F-Distribution Table: Upper 1% Probability (or 1% Area) under F-Distribution Curve 0.01 F = 4.278 F DENOMINATOR df  D2 NUMERATOR df  D1 10 11 12 13 14 15 16 17 18 19 20 24 30 40 50 100 200 300 10 11 4052.185 4999.340 5403.534 5624.257 5763.955 5858.950 5928.334 5980.954 6022.397 6055.925 6083.399 98.502 99.000 99.164 99.251 99.302 99.331 99.357 99.375 99.390 99.397 99.408 34.116 30.816 29.457 28.710 28.237 27.911 27.671 27.489 27.345 27.228 27.132 21.198 18.000 16.694 15.977 15.522 15.207 14.976 14.799 14.659 14.546 14.452 16.258 13.274 12.060 11.392 10.967 10.672 10.456 10.289 10.158 10.051 9.963 13.745 10.925 9.780 9.148 8.746 8.466 8.260 8.102 7.976 7.874 7.790 12.246 9.547 8.451 7.847 7.460 7.191 6.993 6.840 6.719 6.620 6.538 11.259 8.649 7.591 7.006 6.632 6.371 6.178 6.029 5.911 5.814 5.734 10.562 8.022 6.992 6.422 6.057 5.802 5.613 5.467 5.351 5.257 5.178 10.044 7.559 6.552 5.994 5.636 5.386 5.200 5.057 4.942 4.849 4.772 9.646 7.206 6.217 5.668 5.316 5.069 4.886 4.744 4.632 4.539 4.462 9.330 6.927 5.953 5.412 5.064 4.821 4.640 4.499 4.388 4.296 4.220 9.074 6.701 5.739 5.205 4.862 4.620 4.441 4.302 4.191 4.100 4.025 8.862 6.515 5.564 5.035 4.695 4.456 4.278 4.140 4.030 3.939 3.864 8.683 6.359 5.417 4.893 4.556 4.318 4.142 4.004 3.895 3.805 3.730 8.531 6.226 5.292 4.773 4.437 4.202 4.026 3.890 3.780 3.691 3.616 8.400 6.112 5.185 4.669 4.336 4.101 3.927 3.791 3.682 3.593 3.518 8.285 6.013 5.092 4.579 4.248 4.015 3.841 3.705 3.597 3.508 3.434 8.185 5.926 5.010 4.500 4.171 3.939 3.765 3.631 3.523 3.434 3.360 8.096 5.849 4.938 4.431 4.103 3.871 3.699 3.564 3.457 3.368 3.294 7.823 5.614 4.718 4.218 3.895 3.667 3.496 3.363 3.256 3.168 3.094 7.562 5.390 4.510 4.018 3.699 3.473 3.305 3.173 3.067 2.979 2.906 7.314 5.178 4.313 3.828 3.514 3.291 3.124 2.993 2.888 2.801 2.727 7.171 5.057 4.199 3.720 3.408 3.186 3.020 2.890 2.785 2.698 2.625 6.895 4.824 3.984 3.513 3.206 2.988 2.823 2.694 2.590 2.503 2.430 6.763 4.713 3.881 3.414 3.110 2.893 2.730 2.601 2.497 2.411 2.338 6.720 4.677 3.848 3.382 3.079 2.862 2.699 2.571 2.467 2.380 2.307 DENOMINATOR df  D2 NUMERATOR df  D1 10 11 12 13 14 15 16 12 13 14 15 16 17 18 19 20 24 30 6106.682 6125.774 6143.004 6156.974 6170.012 6181.188 6191.432 6200.746 6208.662 6234.273 6260.350 99.419 99.422 99.426 99.433 99.437 99.441 99.444 99.448 99.448 99.455 99.466 27.052 26.983 26.924 26.872 26.826 26.786 26.751 26.719 26.690 26.597 26.504 14.374 14.306 14.249 14.198 14.154 14.114 14.079 14.048 14.019 13.929 13.838 9.888 9.825 9.770 9.722 9.680 9.643 9.609 9.580 9.553 9.466 9.379 7.718 7.657 7.605 7.559 7.519 7.483 7.451 7.422 7.396 7.313 7.229 6.469 6.410 6.359 6.314 6.275 6.240 6.209 6.181 6.155 6.074 5.992 5.667 5.609 5.559 5.515 5.477 5.442 5.412 5.384 5.359 5.279 5.198 5.111 5.055 5.005 4.962 4.924 4.890 4.860 4.833 4.808 4.729 4.649 4.706 4.650 4.601 4.558 4.520 4.487 4.457 4.430 4.405 4.327 4.247 4.397 4.342 4.293 4.251 4.213 4.180 4.150 4.123 4.099 4.021 3.941 4.155 4.100 4.052 4.010 3.972 3.939 3.910 3.883 3.858 3.780 3.701 3.960 3.905 3.857 3.815 3.778 3.745 3.716 3.689 3.665 3.587 3.507 3.800 3.745 3.698 3.656 3.619 3.586 3.556 3.529 3.505 3.427 3.348 3.666 3.612 3.564 3.522 3.485 3.452 3.423 3.396 3.372 3.294 3.214 3.553 3.498 3.451 3.409 3.372 3.339 3.310 3.283 3.259 3.181 3.101 (409) APPENDIX H DENOMINATOR df  D2 17 18 19 20 24 30 40 50 100 200 300 NUMERATOR df  D1 12 3.455 3.371 3.297 3.231 3.032 2.843 2.665 2.563 2.368 2.275 2.244 13 3.401 3.316 3.242 3.177 2.977 2.789 2.611 2.508 2.313 2.220 2.190 14 3.353 3.269 3.195 3.130 2.930 2.742 2.563 2.461 2.265 2.172 2.142 15 3.312 3.227 3.153 3.088 2.889 2.700 2.522 2.419 2.223 2.129 2.099 16 3.275 3.190 3.116 3.051 2.852 2.663 2.484 2.382 2.185 2.091 2.061 17 3.242 3.158 3.084 3.018 2.819 2.630 2.451 2.348 2.151 2.057 2.026 DENOMINATOR df  D2 10 11 12 13 14 15 16 17 18 19 20 24 30 40 50 100 200 300 NUMERATOR df  D1 40 50 100 200 300 6286.427 6302.260 6333.925 6349.757 6355.345 99.477 99.477 99.491 99.491 99.499 26.411 26.354 26.241 26.183 26.163 13.745 13.690 13.577 13.520 13.501 9.291 9.238 9.130 9.075 9.057 7.143 7.091 6.987 6.934 6.916 5.908 5.858 5.755 5.702 5.685 5.116 5.065 4.963 4.911 4.894 4.567 4.517 4.415 4.363 4.346 4.165 4.115 4.014 3.962 3.944 3.860 3.810 3.708 3.656 3.638 3.619 3.569 3.467 3.414 3.397 3.425 3.375 3.272 3.219 3.202 3.266 3.215 3.112 3.059 3.040 3.132 3.081 2.977 2.923 2.905 3.018 2.967 2.863 2.808 2.790 2.920 2.869 2.764 2.709 2.691 2.835 2.784 2.678 2.623 2.604 2.761 2.709 2.602 2.547 2.528 2.695 2.643 2.535 2.479 2.460 2.492 2.440 2.329 2.271 2.251 2.299 2.245 2.131 2.070 2.049 2.114 2.058 1.938 1.874 1.851 2.007 1.949 1.825 1.757 1.733 1.797 1.735 1.598 1.518 1.490 1.694 1.629 1.481 1.391 1.357 1.660 1.594 1.441 1.346 1.309 18 3.212 3.128 3.054 2.989 2.789 2.600 2.421 2.318 2.120 2.026 1.995 19 3.186 3.101 3.027 2.962 2.762 2.573 2.394 2.290 2.092 1.997 1.966 20 3.162 3.077 3.003 2.938 2.738 2.549 2.369 2.265 2.067 1.971 1.940 24 3.083 2.999 2.925 2.859 2.659 2.469 2.288 2.183 1.983 1.886 1.854 30 3.003 2.919 2.844 2.778 2.577 2.386 2.203 2.098 1.893 1.794 1.761 865 (410) 866 APPENDIX I APPENDIX I Fmax = Critical Values of Hartley’s Fmax Test S largest ∼ Fmax Ssmallest 1a(c,v) UPPER 5% POINTS (a  0.05) c v 10 12 15 20 30 60 ∞ 10 11 12 39.0 15.4 9.60 7.15 5.82 4.99 4.43 4.03 3.72 3.28 2.86 2.46 2.07 1.67 1.00 87.5 27.8 15.5 10.8 8.38 6.94 6.00 5.34 4.85 4.16 3.54 2.95 2.40 1.85 1.00 142 39.2 20.6 13.7 10.4 8.44 7.18 6.31 5.67 4.79 4.01 3.29 2.61 1.96 1.00 202 50.7 25.2 16.3 12.1 9.70 8.12 7.11 6.34 5.30 4.37 3.54 2.78 2.04 1.00 266 62.0 29.5 18.7 13.7 10.8 9.03 7.80 6.92 5.72 4.68 3.76 2.91 2.11 1.00 333 72.9 33.6 20.8 15.0 11.8 9.78 8.41 7.42 6.09 4.95 3.94 3.02 2.17 1.00 403 83.5 37.5 22.9 16.3 12.7 10.5 8.95 7.87 6.42 5.19 4.10 3.12 2.22 1.00 475 93.9 41.1 24.7 17.5 13.5 11.1 9.45 8.28 6.72 5.40 4.24 3.21 2.26 1.00 550 104 44.6 26.5 18.6 14.3 11.7 9.91 8.66 7.00 5.59 4.37 3.29 2.30 1.00 626 114 48.0 28.2 19.7 15.1 12.2 10.3 9.01 7.25 5.77 4.49 3.36 2.33 1.00 704 124 51.4 29.9 20.7 15.8 12.7 10.7 9.34 7.48 5.93 4.59 3.39 2.36 1.00 10 11 12 UPPER 1% POINTS (a  0.01) c v 199 448 47.5 85 23.2 37 14.9 22 11.1 15.5 8.89 12.1 7.50 9.9 6.54 8.5 10 5.85 7.4 12 4.91 6.1 15 4.07 4.9 20 3.32 3.8 30 2.63 3.0 60 1.96 2.2 ∞ 1.00 1.0 729 120 49 28 19.1 14.5 11.7 9.9 8.6 6.9 5.5 4.3 3.3 2.3 1.0 1036 151 59 33 22 16.5 13.2 11.1 9.6 7.6 6.0 4.6 3.4 2.4 1.0 1362 184 69 38 25 18.4 14.5 12.1 10.4 8.2 6.4 4.9 3.6 2.4 1.0 1705 2063 2432 2813 3204 3605 21(6) 24(9) 28(1) 31(0) 33(7) 36(1) 79 89 97 106 113 120 42 46 50 54 57 60 27 30 32 34 36 37 20 22 23 24 26 27 15.8 16.9 17.9 18.9 19.8 21 13.1 13.9 14.7 15.3 16.0 16.6 11.1 11.8 12.4 12.9 13.4 13.9 8.7 9.1 9.5 9.9 10.2 10.6 6.7 7.1 7.3 7.5 7.8 8.0 5.1 5.3 5.5 5.6 5.8 5.9 3.7 3.8 3.9 4.0 4.1 4.2 2.5 2.5 2.6 2.6 2.7 2.7 1.0 1.0 1.0 1.0 1.00 1.0 Note: Slargest is the largest and the s2smallest the smallest in a set of c independent mean squares, each based on v degrees of freedom Source: Reprinted from E S Pearson and H O Hartley, eds., Biometrika Tables for Statisticians, 3d ed (New York: Cambridge University Press, 1966), by permission of the Biometrika Trustees (411) APPENDIX J APPENDIX J Distribution of the Studentized Range (q-values) 867 p  0.95 D2 D1 10 10 11 12 13 14 15 16 17 18 19 20 24 30 40 60 120 ∞ 17.97 6.08 4.50 3.93 3.64 3.46 3.34 3.26 3.20 3.15 3.11 3.08 3.06 3.03 3.01 3.00 2.98 2.97 2.96 2.95 2.92 2.89 2.86 2.83 2.80 2.77 26.98 8.33 5.91 5.04 4.60 4.34 4.16 4.04 3.95 3.88 3.82 3.77 3.73 3.70 3.67 3.65 3.63 3.61 3.59 3.58 3.53 3.49 3.44 3.40 3.36 3.31 32.82 9.80 6.82 5.76 5.22 4.90 4.68 4.53 4.41 4.33 4.26 4.20 4.15 4.11 4.08 4.05 4.02 4.00 3.98 3.96 3.90 3.85 3.79 3.74 3.68 3.63 37.08 10.88 7.50 6.29 5.67 5.30 5.06 4.89 4.76 4.65 4.57 4.51 4.45 4.41 4.37 4.33 4.30 4.28 4.25 4.23 4.17 4.10 4.04 3.98 3.92 3.86 40.41 11.74 8.04 6.71 6.03 5.63 5.36 5.17 5.02 4.91 4.82 4.75 4.69 4.64 4.59 4.56 4.52 4.49 4.47 4.45 4.37 4.30 4.23 4.16 4.10 4.03 43.12 12.44 8.48 7.05 6.33 5.90 5.61 5.40 5.24 5.12 5.03 4.95 4.88 4.83 4.78 4.74 4.70 4.67 4.65 4.62 4.54 4.46 4.39 4.31 4.24 4.17 45.40 13.03 8.85 7.35 6.58 6.12 5.82 5.60 5.43 5.30 5.20 5.12 5.05 4.99 4.94 4.90 4.86 4.82 4.79 4.77 4.68 4.60 4.52 4.44 4.36 4.29 47.36 13.54 9.18 7.60 6.80 6.32 6.00 5.77 5.59 5.46 5.35 5.27 5.19 5.13 5.08 5.03 4.99 4.96 4.92 4.90 4.81 4.72 4.63 4.55 4.47 4.39 49.07 13.99 9.46 7.83 6.99 6.49 6.16 5.92 5.74 5.60 5.49 5.39 5.32 5.25 5.20 5.15 5.11 5.07 5.04 5.01 4.92 4.82 4.73 4.65 4.56 4.47 D2 D1 10 11 12 13 14 15 16 17 18 19 20 24 30 40 60 120 ∞ 11 12 13 14 15 16 17 18 19 20 50.59 14.39 9.72 8.03 7.17 6.65 6.30 6.05 5.87 5.72 5.61 5.51 5.43 5.36 5.31 5.26 5.21 5.17 5.14 5.11 5.01 4.92 4.82 4.73 4.64 4.55 51.96 14.75 9.95 8.21 7.32 6.79 6.43 6.18 5.98 5.83 5.71 5.61 5.53 5.46 5.40 5.35 5.31 5.27 5.23 5.20 5.10 5.00 4.90 4.81 4.71 4.62 53.20 15.08 10.15 8.37 7.47 6.92 6.55 6.29 6.09 5.93 5.81 5.71 5.63 5.55 5.49 5.44 5.39 5.35 5.31 5.28 5.18 5.08 4.98 4.88 4.78 4.68 54.33 15.38 10.35 8.52 7.60 7.03 6.66 6.39 6.19 6.03 5.90 5.80 5.71 5.64 5.57 5.52 5.47 5.43 5.39 5.36 5.25 5.15 5.04 4.94 4.84 4.74 55.36 15.65 10.52 8.66 7.72 7.14 6.76 6.48 6.28 6.11 5.98 5.88 5.79 5.71 5.65 5.59 5.54 5.50 5.46 5.43 5.32 5.21 5.11 5.00 4.90 4.80 56.32 15.91 10.69 8.79 7.83 7.24 6.85 6.57 6.36 6.19 6.06 5.95 5.86 5.79 5.72 5.66 5.61 5.57 5.53 5.49 5.38 5.27 5.16 5.06 4.95 4.85 57.22 16.14 10.84 8.91 7.93 7.34 6.94 6.65 6.44 6.27 6.13 6.02 5.93 5.85 5.78 5.73 5.67 5.63 5.59 5.55 5.44 5.33 5.22 5.11 5.00 4.89 58.04 16.37 10.98 9.03 8.03 7.43 7.02 6.73 6.51 6.34 6.20 6.09 5.99 5.91 5.85 5.79 5.73 5.69 5.65 5.61 5.49 5.38 5.27 5.15 5.04 4.93 58.83 16.57 11.11 9.13 8.12 7.51 7.10 6.80 6.58 6.40 6.27 6.15 6.05 5.97 5.90 5.84 5.79 5.74 5.70 5.66 5.55 5.43 5.31 5.20 5.09 4.97 59.56 16.77 11.24 9.23 8.21 7.59 7.17 6.87 6.64 6.47 6.33 6.21 6.11 6.03 5.96 5.90 5.84 5.79 5.75 5.71 5.59 5.47 5.36 5.24 5.13 5.01 Note: D1  K populations and D2  N − K (412) 868 APPENDIX J p  0.99 D2 D1 10 10 11 12 13 14 15 16 17 18 19 20 24 30 40 60 120 ∞ 90.03 14.04 8.26 6.51 5.70 5.24 4.95 4.75 4.60 4.48 4.39 4.32 4.26 4.21 4.17 4.13 4.10 4.07 4.05 4.02 3.96 3.89 3.82 3.76 3.70 3.64 135.0 19.02 10.62 8.12 6.98 6.33 5.92 5.64 5.43 5.27 5.15 5.05 4.96 4.89 4.84 4.79 4.74 4.70 4.67 4.64 4.55 4.45 4.37 4.28 4.20 4.12 164.3 22.29 12.17 9.17 7.80 7.03 6.54 6.20 5.96 5.77 5.62 5.50 5.40 5.32 5.25 5.19 5.14 5.09 5.05 5.02 4.91 4.80 4.70 4.59 4.50 4.40 185.6 24.72 13.33 9.96 8.42 7.56 7.01 6.62 6.35 6.14 5.97 5.84 5.73 5.63 5.56 5.49 5.43 5.38 5.33 5.29 5.17 5.05 4.93 4.82 4.71 4.60 202.2 26.63 14.24 10.58 8.91 7.97 7.37 6.96 6.66 6.43 6.25 6.10 5.98 5.88 5.80 5.72 5.66 5.60 5.55 5.51 5.37 5.24 5.11 4.99 4.87 4.76 215.8 28.20 15.00 11.10 9.32 8.32 7.68 7.24 6.91 6.67 6.48 6.32 6.19 6.08 5.99 5.92 5.85 5.79 5.73 5.69 5.54 5.40 5.26 5.13 5.01 4.88 227.2 29.53 15.64 11.55 9.67 8.61 7.94 7.47 7.13 6.87 6.67 6.51 6.37 6.26 6.16 6.08 6.01 5.94 5.89 5.84 5.69 5.54 5.39 5.25 5.12 4.99 237.0 30.68 16.20 11.93 9.97 8.87 8.17 7.68 7.33 7.05 6.84 6.67 6.53 6.41 6.31 6.22 6.15 6.08 6.02 5.97 5.81 5.65 5.50 5.36 5.21 5.08 245.6 31.69 16.69 12.27 10.24 9.10 8.37 7.86 7.49 7.21 6.99 6.81 6.67 6.54 6.44 6.35 6.27 6.20 6.14 6.09 5.92 5.76 5.60 5.45 5.30 5.16 D2 D1 10 11 12 13 14 15 16 17 18 19 20 24 30 40 60 120 ∞ 11 12 13 14 15 16 17 18 19 20 253.2 32.59 17.13 12.57 10.48 9.30 8.55 8.03 7.65 7.36 7.13 6.94 6.79 6.66 6.55 6.46 6.38 6.31 6.25 6.19 6.02 5.85 5.69 5.53 5.37 5.23 260.0 33.40 17.53 12.84 10.70 9.48 8.71 8.18 7.78 7.49 7.25 7.06 6.90 6.77 6.66 6.56 6.48 6.41 6.34 6.28 6.11 5.93 5.76 5.60 5.44 5.28 266.2 34.13 17.89 13.09 10.89 9.65 8.86 8.31 7.91 7.60 7.36 7.17 7.01 6.87 6.76 6.66 6.57 6.50 6.43 6.37 6.19 6.01 5.83 5.67 5.50 5.35 271.8 34.81 18.22 13.32 11.08 9.81 9.00 8.44 8.03 7.71 7.46 7.26 7.10 6.96 6.84 6.74 6.66 6.58 6.51 6.45 6.26 6.08 5.90 5.73 5.56 5.40 277.0 35.43 18.52 13.53 11.24 9.95 9.12 8.55 8.13 7.81 7.56 7.36 7.19 7.05 6.93 6.82 6.73 6.65 6.58 6.52 6.33 6.14 5.96 5.78 5.61 5.45 281.8 36.00 18.81 13.73 11.40 10.08 9.24 8.66 8.23 7.91 7.65 7.44 7.27 7.13 7.00 6.90 6.81 6.73 6.65 6.59 6.39 6.20 6.02 5.84 5.66 5.49 286.3 36.53 19.07 13.91 11.55 10.21 9.35 8.76 8.33 7.99 7.73 7.52 7.35 7.20 7.07 6.97 6.87 6.79 6.72 6.65 6.45 6.26 6.07 5.89 5.71 5.54 290.4 37.03 19.32 14.08 11.68 10.32 9.46 8.85 8.41 8.08 7.81 7.59 7.42 7.27 7.14 7.03 6.94 6.85 6.78 6.71 6.51 6.31 6.12 5.93 5.75 5.57 294.3 37.50 19.55 14.24 11.81 10.43 9.55 8.94 8.49 8.15 7.88 7.66 7.48 7.33 7.20 7.09 7.00 6.91 6.84 6.77 6.56 6.36 6.16 5.97 5.79 5.61 298.0 37.95 19.77 14.40 11.93 10.54 9.65 9.03 8.57 8.23 7.95 7.73 7.55 7.39 7.26 7.15 7.05 6.97 6.89 6.82 6.61 6.41 6.21 6.01 5.83 5.65 Source: Reprinted with permission from E S Pearson and H O Hartley, Biometrika Tables for Statisticians (New York: Cambridge University Press, 1954) (413) APPENDIX K APPENDIX K Critical Values of r in the Runs Test a Lower Tail: Too Few Runs n1 10 11 12 13 14 15 16 17 18 19 20 n1 b Upper Tail: Too Many Runs n2 n2 10 11 12 13 14 15 16 17 18 19 20 2 2 2 2 2 2 2 2 2 3 3 3 2 3 3 3 3 4 4 2 3 3 4 4 4 5 2 3 3 4 4 5 5 5 6 9 10 10 10 11 11 12 11 12 13 13 13 13 10 869 11 12 13 14 15 16 17 18 19 20 2 5 6 7 8 9 10 10 10 10 2 5 7 8 9 10 10 10 11 11 3 6 7 8 9 10 10 11 11 11 12 4 6 8 9 10 10 11 11 11 12 12 4 7 9 10 10 11 11 11 12 12 13 5 8 9 10 10 11 11 12 12 13 13 6 8 10 10 11 11 12 12 13 13 13 6 9 10 10 11 12 12 13 13 13 14 2 3 4 5 5 6 6 6 3 4 5 6 6 7 7 3 4 5 6 7 7 8 3 5 6 7 7 8 8 4 5 6 7 8 9 9 2 4 6 7 8 9 10 10 10 11 12 13 14 15 16 17 18 19 20 11 12 13 13 14 14 14 14 15 15 15 11 12 13 14 14 15 15 16 16 16 16 17 17 17 17 17 13 14 14 15 16 16 16 17 17 18 18 18 18 18 18 13 14 15 16 16 17 17 18 18 18 19 19 19 20 20 13 14 15 16 17 17 18 19 19 19 20 20 20 21 21 13 14 16 16 17 18 19 19 20 20 21 21 21 22 22 15 16 17 18 19 19 20 20 21 21 22 22 23 23 15 16 17 18 19 20 20 21 22 22 23 23 23 24 15 16 18 18 19 20 21 22 22 23 23 24 24 25 17 18 19 20 21 21 22 23 23 24 25 25 25 17 18 19 20 21 22 23 23 24 25 25 26 26 17 18 19 20 21 22 23 24 25 25 26 26 27 17 18 20 21 22 23 23 24 25 26 26 27 27 17 18 20 21 22 23 24 25 25 26 27 27 28 Source: Adapted from Frieda S Swed and C Eisenhart, “Tables for testing randomness of grouping in a sequence of alternatives,” Ann Math Statist 14 (1943): 83–86, with the permission of the publisher (414) 870 APPENDIX L APPENDIX L U Mann-Whitney U Test Probabilities (n 9) n1 U n1 10 11 12 13 U n1 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 n2  3 .250 500 750 .100 200 400 600 .050 100 200 350 500 650 U n1 n2  5 .167 333 500 667 .047 095 190 286 429 571 .018 036 071 125 196 286 393 500 607 .008 016 032 056 095 143 206 278 365 452 548 .004 008 016 028 048 075 111 155 210 274 345 421 500 579 U n1 10 11 12 13 14 15 16 17 18 n2  7 .125 250 375 500 625 .028 056 111 167 250 333 444 556 .008 017 033 058 092 133 192 258 333 417 500 583 .003 006 012 021 036 055 082 115 158 206 264 324 394 464 538 .001 003 005 009 015 024 037 053 074 101 134 172 216 265 319 378 438 500 562 .001 001 002 004 007 011 017 026 037 051 069 090 117 147 183 223 267 314 365 418 473 527 .000 001 001 002 003 006 009 013 019 027 036 049 064 082 104 130 159 191 228 267 310 355 402 451 500 549 n2  4 .200 400 600 .067 133 267 400 600 .028 057 114 200 314 429 571 .014 029 057 100 171 243 343 443 557 n2  6 .143 286 428 571 .036 071 143 214 321 429 571 .012 024 048 083 131 190 274 357 452 548 .005 010 019 033 057 086 129 176 238 305 381 457 545 .002 004 009 015 026 041 063 089 123 165 214 268 331 396 465 535 .001 002 004 008 013 021 032 047 066 090 120 155 197 242 294 350 409 469 531 (415) APPENDIX L U n1 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 871 n2  8 t Normal .111 222 333 444 556 .022 044 089 133 200 267 356 444 556 .006 012 024 042 067 097 139 188 248 315 387 461 539 .002 004 008 014 024 036 055 077 107 141 184 230 285 341 404 467 533 .001 002 003 005 009 015 023 033 047 064 085 111 142 177 217 262 311 362 416 472 528 .000 001 001 002 004 006 010 015 021 030 041 054 071 091 114 141 172 207 245 286 331 377 426 475 525 .000 000 001 001 002 003 005 007 010 014 020 027 036 047 060 076 095 116 140 168 198 232 268 306 347 389 433 478 522 .000 000 000 001 001 001 002 003 005 007 010 014 019 025 032 041 052 065 080 097 117 139 164 191 221 253 287 323 360 399 439 480 520 3.308 3.203 3.098 2.993 2.888 2.783 2.678 2.573 2.468 2.363 2.258 2.153 2.048 1.943 1.838 1.733 1.628 1.523 1.418 1.313 1.208 1.102 998 893 788 683 578 473 368 263 158 052 .001 001 001 001 002 003 004 005 007 009 012 016 020 026 033 041 052 064 078 094 113 135 159 185 215 247 282 318 356 396 437 481 Source: Reproduced from H B Mann and D R Whitney, “On a test of whether one of two random variables is stochastically larger than the other,” Ann Math Statist 18 (1947): 52–54, with the permission of the publisher (416) 872 APPENDIX M APPENDIX M Mann-Whitney U Test Critical Values (9 ≤ n ≤ 20) Critical Values of U for a One-Tailed Test at a  0.001 or for a Two-Tailed Test at a  0.002 Critical Values of U for a One-Tailed Test at a  0.01 or for a Two-Tailed Test at a  0.02 n1 n2 10 11 12 13 14 15 16 17 18 19 20 n1 n2 10 11 12 13 14 15 16 17 18 19 20 10 11 12 13 14 15 16 17 18 19 20 10 12 14 15 17 19 21 23 25 26 10 12 14 17 19 21 23 25 27 29 32 10 12 15 17 20 22 24 27 29 32 34 37 12 14 17 20 23 25 28 31 34 37 40 42 11 14 17 20 23 26 29 32 35 38 42 45 48 12 15 19 22 25 29 32 36 39 43 46 50 54 10 14 17 21 24 28 32 36 40 43 47 51 55 59 11 15 19 23 27 31 35 39 43 48 52 56 60 65 13 17 21 25 29 34 38 43 47 52 57 61 66 70 10 14 18 23 27 32 37 42 46 51 56 61 66 71 76 11 15 20 25 29 34 40 45 50 55 60 66 71 77 82 12 16 21 26 32 37 42 48 54 59 65 70 76 82 88 10 11 12 13 14 15 16 17 18 19 20 11 14 16 18 21 23 26 28 31 33 36 38 40 11 13 16 19 22 24 27 30 33 36 38 41 44 47 12 15 18 22 25 28 31 34 37 41 44 47 50 53 11 14 17 21 24 28 31 35 38 42 46 49 53 56 60 12 16 20 23 27 31 35 39 43 47 51 55 59 63 67 10 13 17 22 26 30 34 38 43 47 51 56 60 65 69 73 11 15 19 24 28 33 37 42 47 51 56 61 66 70 75 80 12 16 21 26 31 36 41 46 51 56 61 66 71 76 82 87 13 18 23 28 33 38 44 49 55 60 66 71 77 82 88 93 14 19 24 30 36 41 47 53 59 65 70 76 82 88 94 100 15 20 26 32 38 44 50 56 63 69 75 82 88 94 101 107 10 16 22 28 34 40 47 53 60 67 73 80 87 93 100 107 114 (417) APPENDIX M Critical Values of U for a One-Tailed Test at a  0.025 or for a Two-Tailed Test at a  0.05 Critical Values of U for a One-Tailed Test at a  0.05 or for a Two-Tailed Test at a  0.10 n1 n2 10 11 12 13 14 15 16 17 18 19 20 n1 n2 10 11 12 13 14 15 16 17 18 19 20 873 10 11 12 13 14 15 16 17 18 19 20 10 12 15 17 20 23 26 28 31 34 37 39 42 45 48 11 14 17 20 23 26 29 33 36 39 42 45 48 52 55 13 16 19 23 26 30 33 37 40 44 47 51 55 58 62 11 14 18 22 26 29 33 37 41 45 49 53 57 61 65 69 12 16 20 24 28 33 37 41 45 50 54 59 63 67 72 76 13 17 22 26 31 36 40 45 50 55 59 64 67 74 78 83 10 14 19 24 29 34 39 44 49 54 59 64 70 75 80 85 90 11 15 21 26 31 37 42 47 53 59 64 70 75 81 86 92 98 11 17 22 28 34 39 45 51 57 63 67 75 81 87 93 99 105 12 18 24 30 36 42 48 55 61 67 74 80 86 93 99 106 112 13 19 25 32 38 45 52 58 65 72 78 85 92 99 106 113 119 13 20 27 34 41 48 55 62 69 76 83 90 98 105 112 119 127 10 11 12 13 14 15 16 17 18 19 20 16 22 28 35 41 48 55 61 68 75 82 88 95 102 109 116 123 10 17 23 30 37 44 51 58 65 72 80 87 94 101 109 116 123 130 11 18 25 32 39 47 54 62 69 77 84 92 100 107 115 123 130 138 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 11 14 17 20 24 27 31 34 37 41 44 48 51 55 58 62 12 16 19 23 27 31 34 38 42 46 50 54 57 61 65 69 13 17 21 26 30 34 38 42 47 51 55 60 64 68 72 77 10 15 19 24 28 33 37 42 47 51 56 61 65 70 75 80 84 11 16 21 26 31 36 41 46 51 56 61 66 71 77 82 87 92 12 18 23 28 33 39 44 50 55 61 66 72 77 83 88 94 100 14 19 25 30 36 42 48 54 60 65 71 77 83 89 95 101 107 15 20 26 33 39 45 51 57 64 70 77 83 89 96 102 109 115 Source: Adapted and abridged from Tables 1, 3, 5, and of D Auble, “Extended tables for the Mann-Whitney statistic,” Bulletin of the Institute of Educational Research at Indiana University 1, No (1953), with the permission of the publisher (418) 874 APPENDIX N LEVEL OF SIGNIFICANCE FOR ONE-TAILED TEST APPENDIX N Critical Values of T in the Wilcoxon Matched-Pairs Signed-Ranks Test (n  25) n 0.025 0.01 0.005 LEVEL OF SIGNIFICANCE FOR TWO-TAILED TEST 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0.05 0.02 0.01 11 14 17 21 25 30 35 40 46 52 59 66 73 81 89 — 10 13 16 20 24 28 33 38 43 49 56 62 69 77 — — 10 13 16 20 23 28 32 38 43 49 55 61 68 Source: Adapted from Table of F Wilcoxon, Some Rapid Approximate Statistical Procedures (New York: American Cyanamid Company, 1949), 13, with the permission of the publisher (419) APPENDIX O a  05 APPENDIX O Critical Values dL and dU of the Durbin-Watson Statistic D (Critical Values Are One-Sided) 875 P1 P2 P3 P4 P5 n dL dU dL dU dL dU dL dU dL dU 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 45 50 55 60 65 70 75 80 85 90 95 100 1.08 1.10 1.13 1.16 1.18 1.20 1.22 1.24 1.26 1.27 1.29 1.30 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.40 1.41 1.42 1.43 1.43 1.44 1.48 1.50 1.53 1.55 1.57 1.58 1.60 1.61 1.62 1.63 1.64 1.65 1.36 1.37 1.38 1.39 1.40 1.41 1.42 1.43 1.44 1.45 1.45 1.46 1.47 1.48 1.48 1.49 1.50 1.50 1.51 1.51 1.52 1.52 1.53 1.54 1.54 1.54 1.57 1.59 1.60 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.69 .95 98 1.02 1.05 1.08 1.10 1.13 1.15 1.17 1.19 1.21 1.22 1.24 1.26 1.27 1.28 1.30 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.43 1.46 1.49 1.51 1.54 1.55 1.57 1.59 1.60 1.61 1.62 1.63 1.54 1.54 1.54 1.53 1.53 1.54 1.54 1.54 1.54 1.55 1.55 1.55 1.56 1.56 1.56 1.57 1.57 1.57 1.58 1.58 1.58 1.59 1.59 1.59 1.60 1.60 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.70 1.70 1.71 1.72 .82 86 90 93 97 1.00 1.03 1.05 1.08 1.10 1.12 1.14 1.16 1.18 1.20 1.21 1.23 1.24 1.26 1.27 1.28 1.29 1.31 1.32 1.33 1.34 1.38 1.42 1.45 1.48 1.50 1.52 1.54 1.56 1.57 1.59 1.60 1.61 1.75 1.73 1.71 1.69 1.68 1.68 1.67 1.66 1.66 1.66 1.66 1.65 1.65 1.65 1.65 1.65 1.65 1.65 1.65 1.65 1.65 1.65 1.66 1.66 1.66 1.66 1.67 1.67 1.68 1.69 1.70 1.70 1.71 1.72 1.72 1.73 1.73 1.74 .69 74 78 82 86 90 93 96 99 1.01 1.04 1.06 1.08 1.10 1.12 1.14 1.16 1.18 1.19 1.21 1.22 1.24 1.25 1.26 1.27 1.29 1.34 1.38 1.41 1.44 1.47 1.49 1.51 1.53 1.55 1.57 1.58 1.59 1.97 1.93 1.90 1.87 1.85 1.83 1.81 1.80 1.79 1.78 1.77 1.76 1.76 1.75 1.74 1.74 1.74 1.73 1.73 1.73 1.73 1.73 1.72 1.72 1.72 1.72 1.72 1.72 1.72 1.73 1.73 1.74 1.74 1.74 1.75 1.75 1.75 1.76 .56 62 67 71 75 79 83 86 90 93 95 98 1.01 1.03 1.05 1.07 1.09 1.11 1.13 1.15 1.16 1.18 1.19 1.21 1.22 1.23 1.29 1.34 1.38 1.41 1.44 1.46 1.49 1.51 1.52 1.54 1.56 1.57 2.21 2.15 2.10 2.06 2.02 1.99 1.96 1.94 1.92 1.90 1.89 1.88 1.86 1.85 1.84 1.83 1.83 1.82 1.81 1.81 1.80 1.80 1.80 1.79 1.79 1.79 1.78 1.77 1.77 1.77 1.77 1.77 1.77 1.77 1.77 1.78 1.78 1.78 (continued ) n  Number of observations; P  Number of independent variables Source: This table is reproduced from Biometrika 41 (1951): 173 and 175, with the permission of the Biometrika Trustees (420) 876 APPENDIX O a  01 P1 P2 P3 P4 P5 n dL dU dL dU dL dU dL dU dL dU 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 45 50 55 60 65 70 75 80 85 90 95 100 .81 84 87 90 93 95 97 1.00 1.02 1.04 1.05 1.07 1.09 1.10 1.12 1.13 1.15 1.16 1.17 1.18 1.19 1.21 1.22 1.23 1.24 1.25 1.29 1.32 1.36 1.38 1.41 1.43 1.45 1.47 1.48 1.50 1.51 1.52 1.07 1.09 1.10 1.12 1.13 1.15 1.16 1.17 1.19 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.30 1.31 1.32 1.32 1.33 1.34 1.34 1.38 1.40 1.43 1.45 1.47 1.49 1.50 1.52 1.53 1.54 1.55 1.56 .70 74 77 80 83 86 89 91 94 96 98 1.00 1.02 1.04 1.05 1.07 1.08 1.10 1.11 1.13 1.14 1.15 1.16 1.18 1.19 1.20 1.24 1.28 1.32 1.35 1.38 1.40 1.42 1.44 1.46 1.47 1.49 1.50 1.25 1.25 1.25 1.26 1.26 1.27 1.27 1.28 1.29 1.30 1.30 1.31 1.32 1.32 1.33 1.34 1.34 1.35 1.36 1.36 1.37 1.38 1.38 1.39 1.39 1.40 1.42 1.45 1.47 1.48 1.50 1.52 1.53 1.54 1.55 1.56 1.57 1.58 .59 63 67 71 74 77 80 83 86 88 90 93 95 97 99 1.01 1.02 1.04 1.05 1.07 1.08 1.10 1.11 1.12 1.14 1.15 1.20 1.24 1.28 1.32 1.35 1.37 1.39 1.42 1.43 1.45 1.47 1.48 1.46 1.44 1.43 1.42 1.41 1.41 1.41 1.40 1.40 1.41 1.41 1.41 1.41 1.41 1.42 1.42 1.42 1.43 1.43 1.43 1.44 1.44 1.45 1.45 1.45 1.46 1.48 1.49 1.51 1.52 1.53 1.55 1.56 1.57 1.58 1.59 1.60 1.60 .49 53 57 61 65 68 72 75 77 80 83 85 88 90 92 94 96 98 1.00 1.01 1.03 1.04 1.06 1.07 1.09 1.10 1.16 1.20 1.25 1.28 1.31 1.34 1.37 1.39 1.41 1.43 1.45 1.46 1.70 1.66 1.63 1.60 1.58 1.57 1.55 1.54 1.53 1.53 1.52 1.52 1.51 1.51 1.51 1.51 1.51 1.51 1.51 1.51 1.51 1.51 1.51 1.52 1.52 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.60 1.60 1.61 1.62 1.63 .39 44 48 52 56 60 63 66 70 72 75 78 81 83 85 88 90 92 94 95 97 99 1.00 1.02 1.03 1.05 1.11 1.16 1.21 1.25 1.28 1.31 1.34 1.36 1.39 1.41 1.42 1.44 1.96 1.90 1.85 1.80 1.77 1.74 1.71 1.69 1.67 1.66 1.65 1.64 1.63 1.62 1.61 1.61 1.60 1.60 1.59 1.59 1.59 1.59 1.59 1.58 1.58 1.58 1.58 1.59 1.59 1.60 1.61 1.61 1.62 1.62 1.63 1.64 1.64 1.65 n  Number of observations; P  Number of independent variables Source: This table is reproduced from Biometrika 41 (1951): 173 and 175, with the permission of the Biometrika Trustees (421) APPENDIX P One-Tailed: a  05 Two-Tailed: a  10 APPENDIX P Lower and Upper Critical Values W of Wilcoxon Signed-Ranks Test n 10 11 12 13 14 15 16 17 18 19 20 a  025 a  05 a  01 a  02 a  005 a  01 —,— —,— 0,28 1,35 3,42 5,50 7,59 10,68 12,79 16,89 19,101 23,113 27,126 32,139 37,153 43,167 —,— —,— —,— 0,36 1,44 3,52 5,61 7,71 10,81 13,92 16,104 19,117 23,130 27,144 32,158 37,173 (Lower, Upper) 0,15 2,19 3,25 5,31 8,37 10,45 13,53 17,61 21,70 25,80 30,90 35,101 41,112 47,124 53,137 60,150 —,— 0,21 2,26 3,33 5,40 8,47 10,56 13,65 17,74 21,84 25,95 29,107 34,119 40,131 46,144 52,158 Source: Adapted from Table of F Wilcoxon and R A Wilcox, Some Rapid Approximate Statistical Procedures (Pearl River, NY: Lederle Laboratories, 1964), with permission of the American Cyanamid Company 877 (422) 878 APPENDIX Q APPENDIX Q Control Chart Factors Number of Observations in Subgroup d2 d3 D3 D4 A2 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1.128 1.693 2.059 2.326 2.534 2.704 2.847 2.970 3.078 3.173 3.258 3.336 3.407 3.472 3.532 3.588 3.640 3.689 3.735 3.778 3.819 3.858 3.895 3.931 0.853 0.888 0.880 0.864 0.848 0.833 0.820 0.808 0.797 0.787 0.778 0.770 0.763 0.756 0.750 0.744 0.739 0.733 0.729 0.724 0.720 0.716 0.712 0.708 0 0 0.076 0.136 0.184 0.223 0.256 0.283 0.307 0.328 0.347 0.363 0.378 0.391 0.404 0.415 0.425 0.435 0.443 0.452 0.459 3.267 2.575 2.282 2.114 2.004 1.924 1.864 1.816 1.777 1.744 1.717 1.693 1.672 1.653 1.637 1.622 1.609 1.596 1.585 1.575 1.565 1.557 1.548 1.541 1.880 1.023 0.729 0.577 0.483 0.419 0.373 0.337 0.308 0.285 0.266 0.249 0.235 0.223 0.212 0.203 0.194 0.187 0.180 0.173 0.167 0.162 0.157 0.153 Source: Reprinted from ASTM-STP 15D by kind permission of the American Society for Testing and Materials (423) Answers to Selected Odd-Numbered Problems This section contains summary answers to most of the odd-numbered problems in the text The Student Solutions Manual contains fully developed solutions to all odd-numbered problems and shows clearly how each answer is determined Chapter Chapter 1-1 Descriptive; use charts, graphs, tables, and numerical measures 1-3 A bar chart is used whenever you want to display data that has already been categorized, while a histogram is used to display data over a range of values for the factor under consideration 1-5 Hypothesis testing uses statistical techniques to validate a claim 1-13 statistical inference, particularly estimation 1-17 written survey or telephone survey 1-19 An experiment is any process that generates data as its outcome 1-23 internal and external validity 1-27 Advantages—low cost, speed of delivery, instant updating of data analysis; disadvantages—low response and potential confusion about questions 1-29 personal observation data gathering 1-33 Part range  1-37 1-41 1-43 1-49 1-51 1-53 1-55 1-61 1-67 Population size Sample size  18, 000 100  180 Thus, the first person selected will come from employees 1–180 Once that person is randomly selected, the second person will be the one numbered 100 higher than the first, and so on The census would consist of all items produced on the line in a defined period of time parameters, since it would include all U.S colleges a stratified random sampling b simple random sampling or possibly cluster random sampling c systematic random sampling d stratified random sampling a time-series b cross-sectional c time-series d cross-sectional a ordinal—categories with defined order b nominal—categories with no defined order c ratio d nominal—categories with no defined order ordinal data a nominal data b ratio data c nominal data d ratio data e ratio data f nominal data g ratio data interval or ratio data a Use a random sample or systematic random sample b The product is going to be ruined after testing it You would not want to ruin the entire product that comes off the assembly line 2-3 a 2k (424) n or 210  1,024 (425) 1,000 Thus, use k  10 classes b w  High  Low Classes  2,900  300 10  2,600 10  260 (round to 300) 2-5 a 2.833, which is rounded to b Divide the number of occurrences (frequency) in each class by the total number of occurrences c Compute a running sum for each class by adding the frequency for that class to the frequencies for all classes above it d Classes form the horizontal axis, and the frequency forms the vertical axis Bars corresponding to the frequency of each class are developed 2-7 a  0.24  0.76 b 0.56  0.08  0.48 c 0.96  0.08  0.86 2-9 a Class Frequency Relative Frequency 2–3 0.0333 4–5 25 0.4167 6–7 26 0.4333 8–9 0.1000 10–11 0.0167 b cumulative frequencies: 2; 27; 53; 59; 60 c cumulative relative frequencies: 0.0333; 0.4500; 0.8833; 0.9833; 1.000 d ogive 2-13 a The weights are sorted from smallest to largest to create the data array b Weight (Classes) Frequency 77–81 82–86 87–91 16 92–96 16 97–101 Total = 49 c The histogram can be created from the frequency distribution d 10.20% Largest  smallest 214.4 105.0  9.945 → w  10  11 number of classes b of the 25, or 0.32 of the salaries at least $175,000 c 18 of the 25, or 0.72 having salaries that are at most $205,000 but at least $135,000 2-15 a w  879 (426) 880 ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 2-19 a classes High  Low 32 10 22 b w     2.44 (round up p to 3.0) Classes 9 c The frequency distribution with nine classes and a class width of 3.0 will depend on the starting point for the first class This starting value must be at or below the minimum value of 10 d The distribution is mound shaped and fairly symmetrical It appears that the center is between 19 and 22 rounds per year, but the rounds played are quite spread out around the center 2-21 a 25  32 and 26  64 Therefore, classes are chosen b The class width is 690,189/6 classes = 115,031.5 Rounding up to the nearest 1,000 passengers results in a class width of 116,000 c Classes Frequency 116,000 232,000 347,000 347,000 462,000 462,000 577,000 577,000 692,000 2-43 a 50 $6,398 2-67 2-73 3-1 Q1 = 4,423; Median = 5,002; Q3 = 5,381 3-3 Q1 Q3   24.28 University Related 2-55 2-61 Chapter d More airlines have fewer than 116,000 passengers 2-23 a Order the observations (coffee consumption) from smallest to largest b Using the 2k  n guideline, the number of classes, k, would be 0.9 and w  (10.1  3.5)/8  0.821, which is rounded to Most observations fall in the class of 5.3–7.9 kg of coffee c The histogram can be created from the frequency distribution The classes are shown on the horizontal axis and the frequency on the vertical axis 2-29 a The pie chart categories are the regions and the measure is the region’s percentage of total income b The bar chart categories are the regions and the measure for each category is the region’s percentage of total income c The bar chart, however, makes it easier to compare percentages across regions 2-31 b   0.0985  0.9015 2-33 The bar chart is skewed below indicating that the number of $1 million houses is growing rapidly It also appears that that growth is exponential rather than linear 2-35 A bar chart can be used to make the comparison 2-37 a The stem unit is 10 and the leaf unit is b between 70 and 79 seconds 2-41 a Leaf unit  1.0 b Slightly skewed to the left The center is between 24 and 26 2, 428 2-51 29 116,000 232,000 c x  2-47 2-49 c A pie chart showing how that total is divided among the four hospital types would not be useful or appropriate The sales have trended upward over the past 12 months The line chart shows that year-end deposits have been increasing since 1997, but have increased more sharply since 2002 and have leveled off between 2006 and 2007 b curvilinear c The largest difference in sales occurred between 2006 and 2007 That difference was 14.835  10.711  4.124 ($billions) positive linear relationship b Both relationships seem to be linear in nature c This occurred in 1998, 1999, and 2001 b It appears that there is a positive linear relationship between the attendance and the year a The independent variable is hours and the dependent variable is sales b It appears that there is a positive linear relationship Municipally Owned Privately Held $3,591 $4,613 $5,191 b The largest average charges occurs at university-related hospitals and the lowest average appears to be in Religious Affiliated hospitals 15.5 15.9  13.55  15.7 (31.2  32.2)/2 = 31.7 (26.7  31.2)/2  28.95 (20.8  22.8)/2  21.8 Mean  19; Median  (19  19)/2  19; Mode  19 Symmetrical; Mean  Median  Mode 11,213.48 Use weighted average Mean  114.21; Median  107.50; Mode  156 skewed right 562.99 551.685 562.90 FDIC  768,351,603.83 Bank of America average  113,595,000 b 768,351,603.83/113,595,000  6.76 3-19 Mean  0.33 minutes; Median  0.31 minutes; Mode  0.24 minutes; slightly right skewed; 80th percentile  0.40 minutes 3-7 a b c 3-9 a b 3-11 a b 3-13 a b 3-15 a b c 3-17 a 3-21 a x  3-25 3-27 Religious Affiliated 13.5 13.6 3-29 3-31 2, 448.30 20  122.465; Median  (123.2 + 123.7)  123.45; left-skewed b 0.925 c 126.36 d weighted average a Range  -  b 3.99 c 1.998 a 16.87 b 4.11 Standard deviation  2.8 a Standard deviation  7.21; IQR  1212 12 b Range  Largest  Smallest  30 -  24; Standard deviation  6.96; IQR  12 c s2 is smaller than s2 by a factor of ( N 1)/N s is smaller than s by a factor of affected ( N 1) / N The range is not (427) ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 3-33 a The variance is 815.79 and the standard deviation is 28.56 b Interquartile range overcomes the susceptibility of the range to being highly influenced by extreme values 3-35 a Range  33 - 21  12 n ∑ xi x  i =1 n  261/10  26.1 n S2 = S= b 3-37 a b c 3-39 a b ∑(x − x ) i =1 n −1 3-55 a x = b 51 22.60, 51 2(22.60), 51 3(22.60), i.e., (28.4, 73.6), (5.8, 96.2), and (16.8, 118.8) There are (19/30)100%  63.3% of the data within (28.4, 73.6), (30/30)100%  100% of the data within (5.8, 96.2), (30/30)100%  100% of the data within (16.8, 118.8) c bell-shaped population 3-57 a Mean S = 16.5444 = 4.0675 c s1  1.000 and s2  0.629 3-41 a Men spent an average of $117, whereas women spent an average of $98 for their phones The standard deviation for men was nearly twice that for women b Business users spent an average of $166.67 on their phone, whereas home users spent an average of $105.74 The variation in phone costs for the two groups was about equal 3-43 a The population mean is ∑x m= = $178, 465 N b The population median is ~ $173,000 m c The range is: R  High  Low R  $361,100  $54,100  $307,000 d The population standard deviation is ∑ (x − ) = $63,172 N 3-47 a at least 75% in the range 2,600 to 3,400; m 2(s) b The range 2,400 to 3,600 should contain at least 89% of the data values c less than 11% 3-49 a 25.008 b CV  23.55% c The range from 31.19 to 181.24 should contain at least 89% of the data values s 100 (100 ) = 20% 3-51 For Distribution A: CV = (100 ) = m 500 s 4.0 For Distribution B: CV = (100 ) = (100 ) = 40% 10.0  800 − x 800 − 1, 000 = = − 0.80 s 250 b z  0.80 c z  0.00 3-53 a z = Drug A Drug B 234.75 270.92 13.92 19.90 Standard Deviation b Based on the sample means of the time each drug is effective, Drug B appears to be effective longer than Drug A c Based on the standard deviation of effect time, Drug B exhibits a higher variability in effect time than Drug A d Drug A, CV  5.93%; Drug B, CV  7.35% Drug B has a higher coefficient of variation and the greater relative spread 0.078 3-59 Existing supplier: CV = (100 ) = 2.08% 3.75 New supplier: CV = 0.135 (100 ) = 0.75% 18.029 3-61 Anyone scoring below 61.86 (rounded to 62) will be rejected without an interview Anyone scoring higher than 91.98 (rounded to 92) will be sent directly to the company 3-63 CV = 3, 083.45 (100 ) = 27.67% 11,144.48 At least 75% of CPA firms will compute a tax owed between $4,977.58 ————— $17,311.38 3-65 a Varibale Mean StDev Variance Scores 94.780 4.130 17.056 s= 1, 530 = 51; Variance = 510.76; Standard deviatiion = 22.60 30 = 148.9 / (10 − 1) = 16.5444; Interquartile range  28 - 23  Ages are lower at Whitworth than for the U.S colleges and universities as a group The range is 113.0, the IQR is 62.25, the variance is 1,217.14, and the standard deviation is 34.89 No Adding a constant to all the data values leaves the variance unchanged 2004: Mean  0.422; Variance  0.999; Standard deviation  1.000; IQR  1.7 2005: Mean  1.075; Variance  0.396; Standard deviation  0.629; IQR  0.75 x  1.075 and x1 0.422 881 3-73 3-75 3-77 3-81 Q1 Median Q3 IQR 93.000 96.000 98.000 5.000 b Tchebysheff’s Theorem would be preferable c 99 The mode is a useful measure of location of a set of data if the data set is large and involves nominal or ordinal data a 0.34 b 0.34 c 0.16 a 364.42 b Variance  16,662.63; Standard deviation  129.08 a Comparing only the mean bushels/acre you would say that Seed Type C produces the greatest average yield per acre b CV of Seed Type A  25/88  0.2841 or 28.41% CV of Seed Type B  15/56  0.2679 or 26.79% CV of Seed Type C  16/100  0.1600 or 16% Seed Type C shows the least relative variability c Seed Type A: 68% between 63 to 113 95% between 38 to 138 approximately 100% between 13 to 163 Seed Type B: 68% between 41 to 71 95% between 26 to 86 approximately 100% between 11 to 101 (428) 882 ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS Seed Type C: 68% between 84 to 116 95% between 68 to 132 approximately 100% between 52 to 148 d Seed Type A e Seed type C 3-87 a Variable Mean StDev Price 22.000 -0.0354 0.2615 -0.0600 b It means that the closing price for GE stock is an average of approximately four ($0.0354) cents lower than the opening price c Variable Mean StDev Median Open 33.947 0.503 33.980 Close-Open -0.0354 0.2615 -0.0600 Chapter 4-1 independent events 4-3 V, V V, C V, S C, V C, C C, S S, V S, C S, S 4-5 a subjective probability based on expert opinion b relative frequency based on previous customer return history c 1/5  0.20 4-7 1/3  0.333333 4-9 a P(Brown)  # Brown/Total  310/982  0.3157 b P(YZ-99)  # YZ-99/Total  375/982  0.3819 c P(YZ-99 and Brown)  205/982  0.2088 d not mutually exclusive since their joint probability is 0.1324 4-11 0.375 4-15 Type of Ad Occurrences Help Wanted Ad Real Estate Ad Other Ad Total Electrical Mechanical Total 28 64 92 39 69 108 67 133 200 Lincoln Tyler Total 3.813 b x 1s  22 (3.813)  (18.187, 25.813); x 2s  (14.374, 29.626); x 3s  (10.561, 33.439) c The Empirical Rule indicates that 95% of the data is contained within x 2s This would mean that each tail has (1  0.95)/2  0.025 of the data Therefore, the costume should be priced at $14.37 3-89 a Variable Mean StDev Median Close-Open 4-23 The following joint frequency table (developed using Excel’s pivot table feature) summarizes the data 204 520 306 1,030 a b c 4-17 a 0.1981 relative frequency yes relative frequency assessment method 4, 000  0.69 b P(# 1)  5, 900 4-19 a 3,122 / 21, 768  0.1434 b relative frequency assessment method 4-21 a P (Caesarean)  22  0.44 50 b New births may not exactly match the 50 in this study a 133 200  0.665 b 108 200  0.54 c 28 200  0.14 4-25 a b 43  0.43 100 56 6 = 0.17 100 c For Pepsi, Probability  For Coke, Probability  56 6 17   0.486 12 12 11 35 6 6 18   0.514 12 12 11 35 d For Pepsi, Probability   8  26   0.4 19 16 14 16 65 For Coke, Probability  12 10  11 39   0.6 19 16 14 16 65 4-27 a (0.9)(1  0.5)  0.45 b (0.6)(0.8)  0.48 ⎛ ⎞ ⎛ ⎞ 20  0.22 4-29 P(senior1 and senior2)  ⎜ ⎟ ⎜ ⎟  ⎝ 10 ⎠ ⎝ ⎠ 90 4-31 a P(E1 and B)  P(E1|B)P(B)  0.25(0.30)  0.075 b P(E1 or B)  P(E1)  P(B)  P(E1 and B)  0.35  0.30  0.075  0.575 c P(E1 and E2 and E3)  P(E1)P(E2) P(E3)  0.35(0.15)(0.40)  0.021 4-33 a P ( B)  Number of drives from B 195   0.28 Total drives 700 b P(Defect )  50 Number of defective drives   0.07 700 Total drives c P (Defect B)  P ( Defect and B ) 0.02   0.076 P ( B) 0.28 Number of defective drives from B Number of drives from B 15   0.076 195 P (Defect B)  4-35 a 0.61; 0.316 b 0.42; 0.202; 0.518 c 0.39 4-37 They cannot get to 99.9% on color copies 4-39 P(Free gas)  0.00015  0.00585  0.0005  0.0065 4-40 a 0.1667 b 0.0278 c 0.6944 d (1/6)(5/6)  (5/6)(1/6)  0.2778 (429) ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 4-41 a b c d P(NFL)  105/200  0.5250 P(College degree and NBA)  40/200  0.20 10/50  0.20 The two events are not independent 4-43 P (Line | Defective)  (0.05)(0.4)/0.0725  0.2759 P (Line | Defective)  (0.10)(0.35)/0.0725  0.4828 P (Line | Defective)  (0.07)(0.25)/0.0725  0.2413 The unsealed cans probably came from Line 4-45 P (Supplier A | Defective)  (0.15)(0.3)/0.115  0.3913 P (Supplier B | Defective)  (0.10)(0.7)/0.115  0.6087 Supplier B is the most likely to have supplied the defective parts 4-47 a P(E1 and E 2)  P(E1| E2)P(E2)  0.508(0.607)  0.308 b P(E1 and E 3)  P(E1| E3)  0.607/0.853  0.712 4-49 a b c d 4-51 a b c d 4-53 a b c 4-55 a b c 4-61 a b 0.76 0.988 0.024 0.9999 0.1856 0.50 0.0323 0.3653 0.119 0.148 0.3814 0.50 0.755 0.269 0.80; 0.40; 0.20; 0.60 A B and A B are complements 4-63 a 0.0156 b 0.1563 c 0.50 4-65 a 0.149 b 0.997 c 0.449 4-67 a the relative frequency assessment approach b 0.028 c 0.349 d yes 4-69 Clerk is most likely responsible for the boxes that raised the complaints 100 4-71 a  0.33 300 30 b  0.10 300 c P(East or C)  P(East)  P(C)  P(East and C)  0.25  0.333  0.103  0.48 d P (C East )  P (C and East)/P (East)  0.103 / 0.25  0.41 4-73 a 0.3333 b Boise will get 70.91% of the cost, Salt Lake will get 21.82%, and Toronto will get 7.27% regardless of production volume 5-5 3.7 days 5-7 a 130 b 412.50 c 412.50  20.31 5-9 a b c d e 5-11 a b 5-13 a s  1.4931  1.22 1.65 days to 4.09 days $58,300 $57,480 Small firm profits  $135,000 Mid-sized profits  $155,000 Large firm profits  $160,000 b Small firm: s  $30,000 Mid-sized firm: s  $90,000 Large firm: s  $156,604.60 c The large firm has the largest expected profit 5-21 a x P(x) x P(x) b c 5-15 a b 5-17 a 5-23 5-25 5-27 5-29 5-31 5-33 Chapter 5-1 a b 5-3 a b discrete random variable The possible values for x are x  {0, 1, 2, 3, 4, 5, 6} number of children under 22 living in a household discrete 15.75 20.75 78.75 increases both the expected value by an amount equal to the constant added both the expected value being multiplied by that same constant 3.51 s2  1.6499; s  1.2845 2.87 days 5-35 14 0.008 19 0.216 15 0.024 20 0.240 16 0.064 21 0.128 17 0.048 22 0.072 18 0.184 23 0.016 b 19.168; s   3.1634  1.7787 c Median is 19 and the quality control department is correct 0.2668 a P(x  5)  0.0746 b P(x (430) 7)  0.2143 c d s  npq  20(.20)(.80)  1.7889 0.1029 a 0.1442 b 0.8002 c 4.55 a 0.0688 b 0.0031 c 0.1467 d 0.8470 e 0.9987 a 3.2 b 1.386 c 0.4060 d 0.9334 a 3.08 b 0.0068 c 0.0019 d It is quite unlikely 883 (431) 884 ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 5-37 a b c 5-39 a b c 5-41 a b c 5-43 a b 5-45 a b c d e 5-49 5-51 5-53 5-55 5-57 5-59 5-61 5-63 5-65 5-67 5-69 5-71 5-75 5-77 0.5580 0.8784 An increase in sample size would be required 2.96 Variance  1.8648; Standard deviation  1.3656 0.3811 0.3179 0.2174 0.25374 0.051987 0.028989 0.372 12 estimate may be too high 0.0832 0.0003 Redemption rate is lower than either Vericours or TCA Fulfillment estimate a corporations b 0.414324 c 70th percentile is 12 a 0.0498 b 0.1512 0.175 a 0.4242 b 0.4242 c 0.4696 a P(x  3)  0.5 b P(x  5)  c 0.6667 d Since 0.6667  0.25, then x’  P(x (432) 10)   0.8305  0.1695 0.0015 a P(x  4)  0.4696 b P(x  3)  0.2167 c 0.9680 a 0.0355 b 0.0218 c 0.0709 a 0.0632 b 120 Spicy Dogs a 0.0274 b 0.0000 c 0.0001 a b lt  1(3)  c 0.0119 d It is very unlikely Therefore, we believe that the goal has not been met a This means the trials are dependent b does not imply that the trials are independent a X P(x) xP(x) 0.56 0.00 0.21 0.21 0.13 0.26 0.07 0.21 0.03 0.12 0.80 b Standard deviation  1.0954; Variance  1.20 5-79 0.0020 5-81 0.6244 5-83 a b c 5-85 a b c 5-87 a b 5-89 a b c d 2.0 1.4142 because outcomes equally likely E(x)  750; E(y)  100 StDev(x)  844.0972; StDev(y)  717.635 CV(x)  844.0972/750  1.1255 CV(y)  717.635/100  7.1764 0.3501 0.3250 0.02 E(X)  0.6261 (0, 2.2783)  0.9929  0.0071 Chapter 6-1 a b 6-5 6-7 6-9 6-11 6-13 6-15 6-17 6-19 6-21 6-23 6-25 190 − 200 10  0.50 20 20 240  200 40   2.00 20 20 a 0.4901 b 0.6826 c 0.0279 a 0.4750 b 0.05 c 0.0904 d 0.97585 e 0.8513 a 0.9270 b 0.6678 c 0.9260 d 0.8413 e 0.3707 a x  1.29(0.50)  5.5  6.145 b m  6.145  (1.65)(0.50)  5.32 a 0.0027 b 0.2033 c 0.1085 a 0.0668 b 0.228 c 0.7745 a 0.3446 b 0.673 c 51.30 d 0.9732 a 0.1762 b 0.3446 c 0.4401 d 0.0548 The mean and standard deviation of the random variable are 15,000 and 1,250, respectively a 0.0548 b 0.0228 c m  15,912 (approximately) about $3,367.35 a 0.1949 b 0.9544 c Mean  Median; symmetric distribution P(x 1.0)  0.5000  0.4761  0.0239 c 6-3 225  200 25   1.25 20 20 (433) ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 6-27 a P(0.747  x  0.753)  0.6915  0.1587  0.5328 0.753  0.75  0.001 b  2.33 6-31 a b c d 6-33 a b 6-35 6-37 6-39 6-41 6-43 c d a b c d a b a b c a b a b 6-45 a b 6-47 a b c 6-49 a b skewed right approximate normal distribution 0.1230 2.034% 0.75 Q1  4.25/0.0625  8; Q2  4.50/0.0625  12; Q3  4.75/0.0625  16 14.43 0.92 0.9179 0.0498 0.0323 0.9502 0.3935 0.2865 0.7143 0.1429 0.0204 0.4084; yes 40,840 0.2939 0.4579 0.1455 0.0183 0.3679 0.0498 0.4493 approximately,   0.08917 positively skewed Descriptive Statistics: ATM FEES Variable Mean StDev ATM FEES 2.907 2.555 c  0.6433  0.3567 6-55 a 0.1353 b 0.1353 6-57 0.5507 6-59 Machine #1: 0.4236 Machine #2: 0.4772 6-61 a 0.0498 b 0.0971 c 0.1354 6-63 d e 0.25 f 0.125 g 0.167 6-65 P(x  74)  0.1271 P(x  90)  0.011 6-67 a approximately normally distributed b Mean  2.453; Standard deviation  4.778 c 0.0012 d No 6-69 a The histogram seems to be “bell shaped.” b The 90th percentile is 540.419 c 376.71 is the 43rd percentile 6-71 a Uniform distribution Sampling error could account for differences in this sample b f ( x )    0.098 b  a 35  24.8 c 0.451 885 Chapter 7-1 18.50 7-3 x  m  10.17  11.38  1.21 7-5 a – 4.125 b –13.458 to 11.042 c –9.208 to 9.208 7-9 0.64 ∑ x 864 = = 43.20 days 7-11 a  = N 20 b x  ∑ x 206   41.20 days; n Sampling error  41.20  43.20  2 28.4 days to 40.4 days $3,445.30 $29.70 1,432.08 87.12 175.937 to 178.634 Mean of Sales Mean of sales  2,764.83 b Mean of Sample Mean of sample  2,797.16 c $32.33 million d Smallest  $170.47; Largest  $218.41 7-21 a Sampling error  x  m  $15.84  $20.00  $4.16 b Random and nonrandom samples can produce sampling error and the error is computed the same way 7-23 P( x  2,100)  0.5000  0.3907  0.1093 c 7-13 a b 7-15 a b c 7-17 a 7-25 x  n  40 25 8 7-27 0.0016 7-29 a 0.3936 b 0.2119 c 0.1423 d 0.0918 7-31 a 0.8413 b 0.1587 7-33 a 0.3830 b 0.9444 c 0.9736 7-35 P( x (434) 4.2)  0.5000  0.4989  0.0011 7-37 P( x  33.5)  0.5000  0.2019  0.2981 7-39 a Descriptive Statistics: Video Price Variable Mean StDev Video Price 45.580 2.528 b Cumulative Distribution Function Normal with mean  46 and Standard deviation  0.179 x P(X  x) 45.58 0.00948 c Cumulative Distribution Function Normal with mean  45.75 and Standard deviation  0.179 x P(X  x) 45.58 0.171127 7-43 a Mean  55.68; Standard deviation  6.75 (435) 886 ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 7-45 a b c d 7-47 a b 7-49 a b 7-51 a b 0.8621 0.0146 0.8475 0.7422 Sampling error  p  p  0.65  0.70  0.05 0.1379 0.8221 0.6165 0.9015 0.0049 x 27 7-53 a p    0.45 n 60 b P(p (436) 0.45)  0.5000  0.2852  0.2148 7-55 P(p (437) 0.09)  0.5000  0.4838  0.0162 7-57 a 0.1020 b 0.4522 c ≈ 0.0 7-59 a 0.0749 b 0.0359 c essentially 7-61 a 0.72 b 0.9448 c The proportion of on-time arrivals is smaller than in 2004 7-63 a 131 over $100,000 and 65 of $100,000 or less b 0.668 c 0.2981 7-67 Sample averages would be less variable than the population 7-69 a 405.55 b 159.83 7-71 A sample size of would be sufficient 7-73 a 0.2643 b 0.3752 c 0.0951 7-75 a Right skewed distribution; a normal distribution cannot be used b Sampling distribution of the sample means cannot be approximated by a normal distribution c 0.50 7-77 a 0.8660 b 0.9783 7-79 a P(x  16.10)  0.5000  0.1554  0.3446 b P( x  16.10)  0.5000  0.4177  0.0823 7-81 Note, because of the small population, the finite correction factor is used a 0.1112 b Either the mean or the standard deviation or both may have changed c 0.2483 7-83 a b highly unlikely c 0.4999 7-85 a b 0.117 7-89 a 0.216 b 0.3275 c Reducing the warranty is a judgment call Chapter 8-1 8-3 8-5 8-7 15.86 —————————— 20.94 293.18 —————————— 306.82 1,180.10 —————————— 1,219.90 a 1.69 —————————— 4.31 b 1.38 —————————— 4.62 8-9 97.62 ————— 106.38 8-11 a b c d 8-13 a b 8-15 a b c 8-17 a b 8-19 a b c 8-21 a b 8-23 a 8-25 a b c 11.6028 ————— 15.1972 29.9590 ————— 32.4410 2.9098 ————— 6.0902 18.3192 ————— 25.0808 $13.945 ———— $14.515 $7,663.92; $7,362.96 (4,780.25; 5,219.75) 219.75 715.97 ≈ 716 $5.29 ————— $13.07 Sample data not dispute the American Express study 83.785 (505.415, 567.985) increased since 2007 163.5026 ————— 171.5374 Increase the sample size, decrease the level of confidence, decrease the standard deviation 6.3881 ————— 6.6855 33.281 x  256.01; s is 80.68 130 242.01 ————— 270.01 14.00 seconds 8-27 n  z s 1.962350   188.24  189 e2 50 2 ⎛ (1.96 ) ( 680 ) ⎞ z s ⎛ zs ⎞ 8-29 n   ⎜ ⎟  ⎜ ⎟  917.54 ≈ 918 44 ⎝ e ⎠ e ⎠ ⎝ 8-31 3,684.21 8-33 a 61.47; n  62 b 5,725.95; n  5726 c 1.658; n  d 305.38; n  306 e 61.47; n  62 8-35 n  249 8-37 a 883 b Reduce the confidence level to something less than 95% Increase the margin of error beyond 0.25 pounds some combination of decreasing the confidence level and increasing the margin of error ⎛ ( 2.575) (1.4 ) ⎞ z s ⎛ zs ⎞ 8-39 n   ⎜ ⎟  ⎜ ⎟  324.9 ≅ 325 ⎝ e ⎠ 0.2 e ⎠ ⎝ n  246 n  6,147 Margin of error is from $0.44 to $0.51 n  60 n  239 429  137  292 additional 302  137  165 additional Net required sample is 1,749  150  1,599 Reduce the confidence level (lowers the z-value) or increase the margin of error or some combination of the two 1,698 0.224 ————— 0.336 a yes b (0.286, 414) c 0.286 —— 0.414 d 0.064x a p    0.175 n 40 b (0.057, 0.293) c n  888 8-41 a b c 8-43 a b 8-45 a b 8-47 a b 8-49 8-51 8-53 8-55 (438) ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 8-57 a 0.324 —– 0.436 b 9,604 8-59 a 0.3155 —– 0.3745 b 179.20 —– 212.716 c 0.3692 —– 0.4424 d 1,224.51  1,225 8-61 0.895 ——————————— 0.925 8-63 a 0.6627 ——— 0.7173 b 2,401 c 0.4871(2)  0.9742 8-65 a 0.1167 b 0.1131 c (0.0736, 0.1598) 8-67 a 0.7444 b 0.6260 ——— 0.8628 c The sample size could be increased 8-75 a 0.7265 —– 0.7935 b 25,427.50 —– 27,772.50 8-77 a n  62  40  22 more b $620 without pilot; savings of $1,390  $620  $770 8-79 a 5.21 b n  390 c 0.25 work days  2.00 work hours 8-81 a 0.7003 —– 0.7741 b 32,279.4674 —– 33,322.7227 8-83 a 45.23 to 45.93 b is plausible c n  25 Chapter 9-1 a b c d 9-3 a b c d e 9-5 a b c 9-7 a b c 9-9 a b c 9-11 a b c d e f 9-13 a b c d z  1.96 t  1.6991 t  2.4033 z  1.645 za  1.645 t/2  2.5083 za/2  2.575 ta  1.5332 Invalid Reject the null hypothesis if the calculated value of the test statistic, z, is greater than 2.575 or less than 2.575 Otherwise, not reject z  3.111 Reject the null hypothesis Reject the null hypothesis if the calculated value of the test statistic, t, is less than the critical value of 2.0639 Otherwise, not reject t  1.875 Do not reject the null hypothesis Reject the null hypothesis if the calculated value of the test statistic, t, is greater than 1.3277 Otherwise, not reject t  0.78 Do not reject the null hypothesis Type I error Type II error Type I error No error Type II error No error H0: m (439) 30,000 HA: m 30,000 $30,411.25 Do not reject Type II 887 9-15 a H0: m (440) 3,600 HA: m 3,600 b Since t  0.85  1.8331, the null hypothesis is not rejected 9-17 a H0: m (441) 55 HA: m 55 b Because t  0.93  2.4620, the null hypothesis is not rejected 9-19 The annual average consumer unit spending for food at home in Detroit is less than the 2006 national consumer unit average 9-21 a Since t  0.74 2.1604, we not reject the null hypothesis b Type II error 9-23 a z  1.96 b z  1.645 c z  2.33 d z  1.645 9-25 Since 2.17  2.33, don’t reject 9-27 a Reject the null hypothesis if the calculated value of the test statistic, z, is less than the critical value of the test statistic z  1.96 Otherwise, not reject b z  2.0785 c reject 9-29 a p-value  0.05 b p-value  0.5892 c p-value  0.1902 d p-value  0.0292 9-31 Because z  3.145 is less than 2.055, reject H0 9-33 Since z  0.97 1.645, we not reject the null hypothesis 9-35 Because z  1.543 is less than 1.96, not reject H0 p-value  0.50  0.4382  0.0618 9-37 a H0: p  0.40 HA: p  0.40 b Since z  1.43 1.645, we not reject the null hypothesis 9-39 a H0: p  0.10 HA: p  0.10 b Since the p-value  0.1736 is greater than 0.05, don’t reject 9-41 a Since z  3.8824  1.96, reject H0 b p-value  2(0.5  0.5)  0.0 9-43 Because z  0.5155 is neither less than 1.645 nor greater than 1.645, not reject H0 9-45 a H0: p (442) 0.50 HA: p 0.50 b Since z  6.08 2.05, we reject the null hypothesis 9-47 a The appropriate null and alternative hypotheses are H0: p (443) 0.95 HA: p 0.95 b Since z  4.85 1.645, we reject the null hypothesis 9-49 a 0.80 b 0.20 c The power increases, and beta decreases d Since x  1.23 then 1.0398 1.23 1.3062, not reject H0 9-51 0.8888 9-53 0.3228 9-55 a 0.0084 b 0.2236 c 0.9160 (444) 888 ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 9-57 a b c 9-59 a 9-61 9-63 9-65 9-67 9-77 0.1685 0.1446 0.1190 H0: m (445) 243 HA: m 243 b 0.0537 a H0: m (446) 15 HA: m 15 b 0.0606 a H0  m (447) $47,413 HA  m $47,413 b 0.1446 0.0495 a Since t  3.97 1.6991, we reject H0 b 0.3557 a If a is decreased, the rejection region is smaller making it easier to accept H0, so b is increased b If n is increased, the test statistic is also increased, making it harder to accept H0, so b is decreased c If n is increased, the test statistic is also increased, making it harder to accept H0, so b is decreased and power is increased d If a is decreased, the rejection region is smaller, making it easier to accept H0, so b is increased and power is decreased 9-79 a z  x m s 9-99 a b 9-101 a b Chapter 10 10-1 10-3 10-5 10-7 10-9 10-11 10-13 10-15 10-17 10-19 10-21 10-23 10-25 n 10-27 x m b t  s n c z  p   (1 ) n 9-81 a H0: m  4,000 HA: m  4,000 b Since t  1.2668 1.7959, not reject 9-83 a H0: p (448) 0.50 HA: p 0.50 b Since z  5.889 1.645, reject the null hypothesis Since z  5.889, the p-value is approximately zero 9-85 a Since z  1.5275  1.645, not reject b Type II error 9-87 a yes b p-value  0.001, reject 9-89 a yes b Since z  1.1547  1.96, not reject H0 9-91 a H0: m  inches HA: m  inches b Reject H0 if z  2.58 or z 2.58; otherwise not reject H0 c Since x  6.03  6.0182, reject the null hypothesis 9-93 a Because z  4.426  1.96 we reject H0 b 50,650.33 ——— 51,419.67 9-95 p-value  0, so reject H0 9-97 a H0: m  0.75 inch HA: m  0.75 inch b Since t  0.9496 2.6264, not reject H0 Since the p-value is less than a, we reject H0 0.1170 Since the p-value is greater than a, we not reject H0 0.5476 10-29 10-31 10-33 10-35 10-37 10-39 10-41 10-43 10-45 10-47 10-49 10-51 6.54  (m1  m2)  0.54 13.34  m1  m2  6.2 0.07  (m1  m2)  50.61 19.47  m1  m2  7.93 a 0.05 b 0.0974  (m1  m2)  0.0026 c The two lines not fill bags with equal average amounts a 0.1043 —— 2.7043; no b no a highly unlikely b (36.3551, 37.4449) 0.10  m1  m2  0.30 a 2.35% b 1,527.32 ——— 4,926.82 c plausible that there is no difference d 3,227 Since t  4.80 2.1199, we reject Since z  5.26  1.645, reject the null hypothesis a If t  1.9698 or t 1.9698, reject H0 b Since 5.652  1.9698, reject H0 Because t  0.896 is neither less than t  2.0167, nor greater than t  2.0167, not reject a Since 0.9785  1.677, not reject H0 b Type II error Since t  4.30 1.6510, we reject the null hypothesis a 2,084/2,050  1.02 b Because p-value  P(t  5.36) ⬵ 1.00, the null hypothesis is not rejected a The ratio of the two means is 9.60/1.40  6.857 b The ratio of the two standard deviations is 8.545/1.468  5.821 c Since p-value  0.966  0.25, not reject the null a It is plausible to conjecture the goal has been met b Since the p-value is greater than 0.025, we not reject the null hypothesis 674.41  m  191.87 a H0: md (449) HA: md b Since 3.64 1.3968, reject H0 c 2.1998 —— 0.7122 a (7.6232, 13.4434) b Because t  0.37 1.459, the null hypothesis cannot be rejected a The samples were matched pairs b Because 0.005 p-value 0.01  a, the null hypothesis should be rejected a Since t  7.35 1.98, reject H0 b 100.563 —— 57.86; yes a Because t  9.24 2.3646, reject b The ratio of the two standard errors is 3.84 ( 0.5118/0.13313) Since 1.068 2.136, not reject the null hypothesis a no b no c yes d no (450) 889 ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 10-53 Because the p-value  0.367 is greater than a  0.05, we not reject the null hypothesis 10-55 a Since z  2.538  1.96, reject H0 b Since z  2.538  1.645, fail to reject H0 c Since z  2.538  1.96, reject H0 d Since z  2.08 2.33, fail to reject H0 10-57 Since the test statistic, 0.4111  2.575, not reject the null hypothesis 10-59 a Since p-value  0.0244 0.05 reject H0 b 0.00597 10-61 a yes b Since the p-value  0.039 0.05, the null hypothesis is rejected 10-63 a yes b Since the p-value  0.095  0.01, the null hypothesis is not rejected c Larger sample sizes would be required 10-67 a (2.1064, 7.8936) b t  2.0687 c t  2.0687 10-69 120.8035 —— 146.0761 10-71 a yes b Since 0.7745 2.17, not reject H0 10-73 26.40  md  0.36 10-75 a paired samples experiment b Since the p-value  0.000 0.05  a, the null hypothesis is rejected c (0.5625, 0.8775) 10-77 a Since t  5.25  2.3499, reject b Type I error Chapter 11 11-1 74,953.7  s2  276,472.2 11-3 Since  12.39  10.1170, not reject the null hypothesis 11-5 a Because  17.82 20.05  19.6752 and because  17.82  20.95  4.5748, not reject b Because  12.96 20.025  31.5264 and because  12.96  20.975  8.2307, we not reject the null hypothesis 11-7 a 0.01 p-value 0.05; since p-value a, reject H0 b Since the test statistic  1.591 the critical value  1.6899, reject the null hypothesis c p-value  0.94; not reject 11-9 22.72  s2  81.90 11-11 353.38  s2  1,745.18 11-13 a H0: m  10 HA: m  10 b Since 1.2247 1.383, not reject H0 c Since 3.75 14.6837, not reject H0 11-15 a s2  4.884 b Since the test statistic  10.47 the critical value  13.8484, not reject Since t  17.57 1.7109, reject H0 11-17 a s2  0.000278 b Since p-value  0.004 0.01, we reject 11-19 a If the calculated F  2.278, reject H0, otherwise not reject H0 b Since 1.0985 2.278, not reject H0 11-21 a F  3.619 b F  3.106 c F  3.051 11-23 Since F  1.154 6.388  F0.05, fail to reject H0 11-25 Since 3.4807  1.984, reject H0 11-27 a Since 2.0818 2.534, not reject b Type I error Decrease the alpha or increase the sample sizes 11-29 Since F  3.035  2.526  F0.05, reject H0 11-31 Since 1.4833 3.027, not reject H0 11-33 a The F-test approach is the appropriate b 33.90 11-39 0.753  s2  2.819 11-41 Since 1.2129 2.231, not reject 11-43 a Since  37.24  2U  30.1435, reject H0 b P(x (451) 3)  0.0230  0.0026  0.0002  0.0256 11-45 a Since F  3.817  1.752  F0.025, reject H0 b yes 11-47 Since 202.5424 224.9568, not reject 11-49 Since 2.0609  1.4953, reject the null hypothesis 11-51 a Since the p-value  0.496  0.05  a  0.05, not reject b yes Chapter 12 12-1 a SSW  SST  SSB  9,271,678,090  2,949,085,157  6,322,592,933 MSB  2,949,085,157/2  1,474,542,579 MSW  6,322,592,933/72  87,813,791 F  1,474,542,579/87,813,791  16.79 b Because the F test statistic  16.79  Fa  2.3778, we reject 12-3 a The appropriate null and alternative hypotheses are H0: m1  m2  m3 HA: Not all mj are equal b Because F  9.84  critical F  3.35, we reject Because p-value  0.0006 a  0.05, we reject c Critical range  6.0; m1  m2 and m3  m2 12-5 a dfB  b F  11.1309 c H0: m1  m2  m3  m4 HA: At least two population means are different d Since 11.1309  2.9467, reject H0 12-7 a Because 2.729 15.5, we conclude that the population variances could be equal; since F  5.03  3.885, we reject H0 b Critical range  2.224; pop and means differ  no other differences 12-9 a Since F  7.131  Fa0.01  5.488, reject b CR  1,226.88; Venetti mean is greater than Edison mean 12-11 a Because 3.125 15.5, we conclude that the population variances could be equal Since F  1,459.78  3.885, we reject H0 b Critical range  9.36 12-13 a Since 10.48326  5.9525 reject H0 and conclude that at least two populations means are different b Critical range  222.02; eliminate Type D and A 12-15 a Since 0.01 0.05, reject H0 and conclude that at least two populations means are different b Mini  Mini Absolute Differences Critical Range Significant? 0.633 1.264 no Mini  Mini 1.175 1.161 yes Mini  Mini 1.808 1.322 yes (452) 890 12-17 12-19 12-21 12-23 12-25 12-27 12-29 12-31 12-33 12-35 12-37 12-39 12-41 12-49 12-51 12-53 ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS Student reports will vary, but they should recommend either or since there is no statistically significant difference between them c 0.978 —— 2.678 cents per mile; $293.40 —— $803.40 annual savings e p-value  0.000 a  0.05 f Average length of life differs between Delphi and Exide and also between Delphi and Johnson There is not enough evidence to indicate that the average lifetime for Exide and Johnson differ a H0: m1  m2  m3  m4 HA: At least two population means are different H0: mb1  mb2  mb3  mb4  mb5  mb6  mb7  mb8 HA: Not all block means are equal b F Blocks  2.487 F Groups  3.072 c Since 46.876  2.487, reject H0 d Since p-value 0.0000326 0.05, reject e LSD  5.48 Because F  14.3  Critical F  3.326, we reject and conclude that blocking is effective Because F  0.1515 Critical F  4.103, we not reject a Because F  32.12  Fa0.01  9.78, reject the null hypothesis b Because F  1.673 Fa0.01  10.925, not reject the null hypothesis a Because F  22.32  Fa0.05  6.944, reject the null hypothesis b Because F  14.185  Fa0.05  6.944, reject the null hypothesis c LSD  8.957 a p-value  0.000 a  0.05 b p-value  0.004 a  0.05 c LSD  1.55; m1 m3 and m2 m3 a p-value  0.628 a  0.05 b p-value  0.000 a  0.05 c LSD  372.304; m1  m2  m3 a p-value  0.854  a  0.05 Therefore, fail to reject H0 b Since F  47.10  F0.05  5.143, reject H0 c p-value  0.039 a  0.05 Therefore, reject H0 a Since 0.4617 3.8853, not reject H0 b Since 2.3766 3.8853, not reject H0 c Since 5.7532  4.7472, reject H0 a Since F  39.63  F0.05  3.633, reject H0 b Since F  2.90 F0.05  9.552, fail to reject H0 c Since F  3.49 F0.05  9.552, fail to reject H0 Since F  57.73  F0.05  9.552, reject H0 a Because F  1.016 Fa0.05  2.728, not reject b Because F  1.157 Fa0.05  3.354, not reject c Because F  102.213  Fa0.05  3.354, reject a Since p-value  0.0570  0.01, not reject b Since 2.9945 6.0129, not reject c Since p-value  0.4829  0.1, not reject a a  0.025 p-value  0.849 Therefore, fail to reject b Since F  15.61  F0.025  3.353, reject c Since F  3.18 F0.025  3.353, not reject d Since 2.0595 t  0.69 2.0595, fail to reject a a  0.05 p-value  0.797 Therefore, fail to reject b Since F  25.55  F0.05  3.855, reject c Since F  0.82 F0.05  3.940, not reject a Since F  3.752  2.657, we reject b Critical range  9.5099; m1  mB, m1  m1F, and m1  mG a Since (F1,200,0.05  3.888 F1,998,0.05 F1,100,0.05  3.936) F  89.53, reject b Since t  9.46 (t250,0.05  1.9695 t998,0.05 t100,0.05  1.9840), reject c Note that (t-value)2  (9.46)2  89.49 ≈ 89.53 12-55 a randomized block design b H0: m1  m2  m3  m4 HA: At least two population means are different c Since 3.3785 4.0150, not reject d H0: Since 20.39312  1.9358, reject e no difference 12-57 a Since F  5.37  F0.05  2.642, reject b Since F  142.97  F0.05  5.192, reject c Since F  129.91  F0.05  5.192, reject; since F  523.33  F0.05  5.192, reject Chapter 13 13-1 Because  6.4607 13.2767, not reject 13-3 Because  218.62  18.3070, we reject 13-5 Because the calculated value of 595.657  13.2767, we reject 13-7 Because the calculated value of 4.48727 is less than 12.8345, not reject 13-9 a Since chi-square statistic  3.379443 11.3449, not reject b Based on the test, we have no reason to conclude that the company is not meeting its product specification 13-11 Chi-square value  0.3647; not reject 13-13 Since chi-square  3.6549 7.3778, not reject 13-15 a Because the calculated value of 1.97433 is less than the critical value of 14.0671, we not reject b Since z  16.12 1.645, not reject 13-17 a Since the calculated value of 27.9092  3.8415, we reject b 0.000000127 13-19 a H0: Gender is independent of drink preference HA: There is a relationship between gender and drink preference b 12.331  3.8415; reject 13-21 a  7.783 9.2104, not reject b 0.0204 13-23 a 8.3584  6.6349; reject b p-value  0.00384 13-25 The p-value  0.00003484; reject 13-27 a Chi-square  0.932; not reject b A decision could be made for other reasons, like cost 13-29 Because  24.439  12.5916, reject the null hypothesis 13-31 Collapse cells, chi-square  11.6  5.9915, reject 13-33 Since 0.308 5.991, not reject 13-41 a Chi-square  3,296.035 b 402.3279  4.6052; reject 13-43 Chi-square  0.172; not reject 13-45 a H0: Airline usage pattern has not changed from a previous study HA: Airline usage pattern has changed from a previous study b Chi-square test statistic is 66.4083  15.08632; reject 13-47 b  37.094  3.8415; reject Chapter 14 14-1 H0: r  0.0 HA: r  0.0 a  0.05, t  2.50, 2.50  1.7709; reject 14-3 a r  0.206 b H0: r  (453) ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 14-5 a b c 14-7 a b 14-9 a b c 14-11 a b 14-13 a b c 14-15 a b c 14-17 a b c d e 14-19 a b c d 14-21 a b 14-23 a b c 14-25 a b HA: r  a  0.10, 1.8595 t  0.59 1.8595; not reject H0 There appears to be a positive linear relationship r  0.9239 H0: r  HA: r  d.f  10   Since 6.8295  2.8965, reject H0 fairly strong positive correlation H0: r  HA: r  a  0.01, t  7.856  2.4066; reject the null hypothesis The dependent is the average credit card balance The independent variable is the income variable does not appear to be a strong relationship H0: r  0.0 HA: r  0.0 a  0.05, t  1.56  2.1604; not reject r  0.979 H0: r  HA: r  a  0.05, 6.791  t  2.9200 as it was; we reject H0 positive linear relationship using Excel, r  0.75644 H0: r  HA: r  If t  2.0096, reject the null hypothesis; Because t  8.0957  2.0096, reject the null hypothesis As 2001 revenue increases there is an increase in the 2004 revenue, which can be seen as the upward “tilt” of the scatter plot r  0.186 H0: r  HA: r  a  0.05, 1.325 t  1.67655; not reject H0 There appears to be a positive linear relationship between x and y yˆ  15.31  4.92(x); b0 of 15.31 would be the value of y if x were 0; b1 of 4.92 means that for every one unit increase in x, y will increase by 4.92 R2  0.8702 H0: r  HA: r  Because t  5.7923  4.0321, reject H0 H0: b1  HA: b1  Because t  5.79  4.0321, reject H0 yˆ  26.830  0.4923x when x  10, yˆ  26.830  0.4923(10)  21.907 10(0.4923)  4.923 H0: b1 (454) HA: b1 a  0.025, 3.67 2.4469; reject 10.12 H0: b1 (455) HA: b1 Since 3.2436 2.407, reject H0 t test statistic  10.86  2.1318, reject An increase of the average public college tuition of $1 would accompany an increase in the average private college tuition of $3.36 yˆ  3,372  3.36(7,500)  $28,572 R-square  0.0395 Se  8,693.43 14-27 14-29 14-31 14-33 14-35 14-37 14-39 14-41 14-43 14-45 14-47 14-49 14-51 14-53 14-55 891 c H0: b1  0.0 HA: b1  0.0 a  0.05, t  1.11 2.0423; not reject d insignificant b yˆ  1,995  1.25x c R2  94.0% H0: b1 (456) 1.4 HA: b1 1.4 a  0.05; t test statistic  1.301 is greater than the t-critical value of 1.8595 Do not reject a yˆ  145,000,000  1.3220x b The regression equation is yˆ  145,000,000  1.3220x and the coefficient of determination is R2  83.1% H0: b1  HA: b1  a  0.05; the t test statistic of 7.02  2.2281 Therefore, reject c yˆ  1,000,000(1.322)  $1,322,000 b yˆ  10.1  (0.7)(x) c 0.9594 ——————— 0.4406 d 2.98165 ——— 7.41835 0.0847 ———————————————— 0.2733 a yˆ  $58,766.11  $943.58(x) b H0: b1  0.0 H0: b1  0.0 a  0.05 Because t  3.86  2.1315, reject c $422.58 ——— $1,464.58 a yˆ  44.3207  0.9014x b H0: 1  HA: 1  a  0.05 The p-value 0.05 (t  6.717 is greater than any value for 13 degrees of freedom shown in the table) and the null hypothesis is rejected c (0.6637, 1.1391) a yˆ  6.2457  (9.8731)(x) b yˆ  6.2457  (9.8731)(6)  52.9929 minutes c 78.123 ——— 87.102 d 47.96866 ——— 77.76391 b yˆ  11,991  5.92x and the coefficient of determination is R2  89.9% c (76.47  115.40) b (3.2426, 6.4325) c (3.78 to 6.5) The value of 0.45 would indicate a relatively weak positive correlation no a no b cannot assume cause and effect The answer will vary depending on the article the students select a The regression equation is yˆ  3,372  3.36x b H0: b1  HA: b1  a  0.05, t  10.86, p-value  0.000 c ($23,789, $30,003) d Since $35,000 is larger than $30,966, it does not seem to be a plausible value a There appears to be a weak positive linear relationship b r  0.6239 H0: r  HA: r  (457) 892 ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS c 14-57 a b 14-59 b c d 14-61 a b 14-63 a b c 14-65 b c d e d.f  10   Since 2.2580 3.3554, not reject H0 yˆ  0.9772  0.0034(x) The y-intercept has no interpretation in this case The slope indicates that the average university GPA increases by 0.0034 for each increase of unit in the SAT score There seems to be a random pattern in the relationship between the typing speed using the standard and ergonomic H0: r  HA: r  a  0.05, r  0.071 t calc  0.2013 t  1.8595; not reject H0 yˆ  1,219.8035  (9.1196)(x) 568.285  112.305 Since 1.5 5.117, not reject H0 There appears to be a possible positive linear relationship between time (in hours) and rating yˆ  66.7111  10.6167(x) yˆ  3,786  1.35x H0: b1  HA: b1  a  0.05, p-value  0.000 Since the p-value is less than a  0.05, we reject the null hypothesis ($265,597, $268,666) 378.365  592.08 no 2,115.458  2,687.465 $100 is outside of the range of the sample Chapter 15 15-1 a yˆ  87.7897  0.9705x1  0.0023x2  8.7233x3 b F  5.3276  F0.05  3.0725; also, p-value  0.00689 any reasonable alpha Therefore, reject H0: 1  2  3  c R  SSR 16, 646.09124   0.432 SST 38, 517.76 d x1 ( p-value  0.1126  a  0.05; fail to reject H0: b1  0) and x3 (p-value  0.2576  a  0.05; fail to reject H0: b3  0) are not significant e b2  0.0023; yˆ increases 0.0023 for each one-unit increase of x2 b3  8.7233; yˆ decreases 8.7233 for each one-unit increase of x3 f The confidence intervals for b1 and b3 contain 15-3 a b1  412; b2  818; b3  93; b4  71 b yˆ  22,167  412(5.7)  818(61)  93(14)  71(1.39)  68,315.91 15-5 a yi  5.05  0.051x1  0.888x2 b yi x1 x1 0.206 x2 0.919 d Predictor Coef SE Coef Constant 5.045 8.698 0.58 P VIF 0.580 x1 0.0513 0.2413 0.21 0.838 1.1 x2 0.8880 0.1475 6.02 0.001 1.1 15-7 a yˆ  977.1  11.252(WK HI)  117.72(P-E) b H0: b1  b2  HA: at least one bi  a  0.05, F  39.80  3.592; we reject c yˆ  1,607 15-9 a yˆ  503  10.5x1  2.4x2  0.165x3  1.90x4 b H0: b1  b2  b3  b4  HA: at least one bi  a  0.05 Since F  2.44 3.056, fail to reject H0 c H0: b3  HA: b3  a  0.05, p-value  0.051  0.05; fail to reject H0 d yˆ  344  0.186x1 The p-value  0.004 Since the p-value  0.004 0.05, reject 15-11 a There is a positive linear relationships between team win/loss percentage and game attendance, opponent win/loss percentage and game attendance, games played and game attendance There is no relationship between temperature and game attendance b There is a significant relationship between game attendance and team win/loss percentage and games played c Attendance  14,122.24  63.15(Win/loss%) 10.10 (Opponent win/loss)  31.51(Games played)  55.46 (Temperature) d R2  0.7753, so 77.53% is explained e H0: b1  b2  b3  b4  HA: at least one bi does not equal significance F  0.00143; reject H0 f For team win/loss % the p-value  0.0014 0.08 For opponent win/loss % the p-value  0.4953  0.08 For games played the p-value  0.8621  0.08 For temperature the p-value  0.3909  0.08 g 1,184.1274; interval of 2(1,184.1274) h VIF Team win/loss percentage and all other X 1.569962033 Temperature and all other X 1.963520336 Games played and all other X 1.31428258 Opponent win/loss percentage and all other X 1.50934547 0.257 H0: r  HA: r  a  0.05, t  0.5954, 2.306 t  0.5954 2.306; we fail to reject H0 c H0: b1  b2  HA: at least one bi  a  0.05, F  19.07 Since F  19.07  4.737, reject H0 T 15-15 a b c 15-17 a b c Multicollinearity is not a problem since no VIF is greater than x2  1, yˆ  145  1.2(1,500)  300(1)  2,245 x2  0, yˆ  145  1.2(1,500)  300(0)  1,945 b2 indicates the average premium paid for living in the city’s town center As the vehicle weight increases by pound, the average highway mileage rating would decrease by 0.003 If the car has standard transmission the highway mileage rating will increase by 4.56, holding the weight constant yˆ  34.2  0.003x1  4.56(1)  38.76  0.003x1 (458) ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 15-19 15-21 15-23 15-25 15-27 15-29 15-31 15-33 d yˆ  34.2  0.003(4,394)  4.56(0)  21.02 e Incorporating the dummy variable essentially gives two regression equations with the same slope but different intercepts depending on whether the automobile is an automatic or standard transmission a yˆ  197  43.6x1  51x2 b b1  The difference in the average PP100 between Domestic and Korean vehicles b2  The difference in the average PP100 between European and Korean vehicles c H0: b1  b2  HA: at least one bi  a  0.05, F  4.53  3.555; reject H0 a There appears to be a weak positive linear relationship between hours and net profit There appears to be a weak negative linear relationship between client type and net profit b yˆ  1,012.0542  69.1471(x1) c The p-value  0.0531 The R2 is only 0.3549 a yˆ  390  37.0x1  0.263x2 H0: b1  HA: b1  c a  0.05 Since t  20.45 1.9921, we reject H0 d yˆ  390  37.0x1  0.263x2  390  37.0(1)  0.263(500)  484.5 ≈ 485 a A linear line is possible, nonlinear is more likely c yˆ  4.937  1.2643x; the p-value  0.015 a  0.05, reject yˆ  25.155  18.983ln x b two quadratic models; interaction between x2 and the quadratic relationship between y and x2 yˆ i  4.9  3.58x1  0.014x12  1.42x1x2  0.528x21x2 c b3x1x2 and b4x21 x2 So you must conduct two hypothesis tests: i H0: b3  HA: b3  a  0.05, p-value  0.488; we fail to reject H0 ii For b4  0, the p-value  0.001 d Conclude that there is interaction between x2 and the quadratic relationship between x1 and y a The complete model is yi  b0  b1x1  b2x2  b3x3  b4x4  i The reduced model is yi  b0  b1x1  b2x2  i H0: b3  b4  0, HA: at least one bi  SSEC  201.72 So MSEC  SSEC /(n  c  1)  201.72/(10   1)  40.344 and SSER  1,343 a  0.05, F  14.144; 14.144  5.786; we reject H0 b The complete model is yi  b0  b1x1  b2x2  b3x3  b4x4  i The reduced model is yi  b0  b1x3  b2x4  i SSEC  201.72 So MSEC  SSEC /(n  c  1)  201.72/(10   1)  40.344 and SSER  494.6 H0: b1  b2  0; HA: at least one bi  0; a  0.05, F  3.63 The numerator degrees of freedom are c  r    and the denominator degrees of freedom are n  c   10    The p-value  P(F (459) 3.630)  0.1062 Fail to reject a two dummy variables x2  if manufacturing, otherwise x3  if service, otherwise Net profit  586.256  22.862x1  2,302.267x2  1,869.813x3 b Net profit  5,828.692  334.406x1  4.577x1 sq  2,694.801x2  12,874,953x3 a Create scatter plot b Second-order polynomial seems to be the correct model yˆ  8,083  0.273x  0.000002x2 15-35 15-39 15-41 15-43 15-45 15-47 15-49 15-51 15-55 15-57 893 c y  b0  b1x1  b2x12  b3x2  b4x1x2  b5x12 x2   The two interaction terms are b4x1x2 and b5x12x2 So you must conduct two hypothesis tests: i Test for b4  Since the p-value  0.252  0.05, we fail to reject H0 ii Test for b5  Since the p-value for b5 0.273  0.05, we fail to reject H0 a Create scatter plot b fourth order polynomial c The regression equation is Admissions  30.0  24.5(Average prices)  7.3(AP2)  0.98(AP3) – 0.0499(AP4) d Test the hypothesis b4  Since p-value  0.572  0.05, we fail to reject H0; there is sufficient evidence to remove the fourth order component Similarly, there is evidence to remove the third order component yˆ i  2.68  1.47x1  0.129x21 a x2 and x4 only; x1 and x3 did not have high enough coefficients of partial determination to add significantly to the model b would be identical c Stepwise regression cannot have a larger R2 than the full model a None are significant b Alpha-to-enter: 0.25; alpha-to-remove: 0.25 yˆ  26.19  0.42x3, R2  14.68 c yˆ  27.9  0.035x1  0.002x2  0.42x3, R2  14.8 The adjusted R2 is 0% for the full model and 7.57% for the standard selection model Neither model offers a good approach to fitting this data a yˆ  32.08  0.76x1  5x3  0.95x4 b yˆ  32.08  0.76x1  5x3  0.95x4 c yˆ  32.08  0.76x1  5x3  0.95x4 d yˆ  32.08  0.76x1  5x3  0.95x4 a yˆ  18.33  1.07x2 b one independent variable (x2) and one dependent variable (y) c x1 was the first variable removed, p-value (0.817)  0.05 x3 was the last variable removed, p-value  0.094  0.05 b yˆ  1,110  1.60x22 a Using Minitab you get an error message indicating there is multicollinearity among the predictor variables b Either crude oil or diesel prices should be removed c yˆ  0.8741  0.00089x2  0.00023x3 a yˆ  16.02  2.1277x b p-value  0.000 0.05 a Calls  269.838  4.953(Ads previous week)  0.834(Calls previous week)  0.089(Airline bookings) The overall model is not significant and none of the independent variables are significant b The assumption of constant variance has not been violated c It is inappropriate to test for randomness using a plot of the residuals over time since the weeks were randomly selected and are not in sequential, time-series order d Model meets the assumption of normally distributed error terms a yˆ  0.874  0.000887x1  0.000235x2 b The residual plot supports the choice of the linear model c The residuals not have constant variances (460) 894 ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS d e 15-59 a b 15-63 15-65 15-69 15-71 15-73 15-75 15-77 15-79 15-81 15-83 The linear model appears to be insufficient The error terms are normally distributed yˆ  6.81  5.29x1  1.51x2  0.000033x3 Plot the residuals versus the independent variable (x) or the fitted value (ˆyi) c yˆ  0.97  3.20x1  0.285x2  0.000029x3  3.12x12  0.103x22  0.000000x32 d The residual plot does not display any nonrandom pattern e The error terms are normally distributed a The relationship between the dependent and each independent variable is linear b The residuals are independent c The variances of the residuals are constant over the range of the independent variables d The residuals are normally distributed a The average y increases by three units holding x2 constant b x2, since x2 only affects the y-intercept of this model c The coefficient of x1 indicates that the average y increases by units when x2  d The coefficient of x1 indicates that the average y increases by 11 units when x2  e Those coefficients affected by the interaction terms have conditional interpretations a The critical t for all pairs would be 2.1604, correlated pairs Volumes sold (y)  Production expenditures Volumes sold (y)  Number of reviewers Volumes sold (y)  Pages Volumes sold ( y)  Advertising budget b All p-values  0.05 c Critical F  3.581; since F  9.1258  3.581, conclude that the overall model is significant e 2(24,165.9419)  48,331.8 f Constant variance assumption is satisfied g The residuals appear to be approximately normally distributed h The model satisfies the normal distribution assumption The t-critical for all pairs would be 2.0687, correlated pairs are For family size and age For purchase volume and age For purchase volume and family income The significance F  0.0210 Age entered the model The R2 at step was 0.2108 and the standard error at step was 36.3553 The R2 at step is 0.3955 and the standard error at step is 32.5313 Other variables that enter into the model partially overlap with the other included variables in its ability to explain the variation in the dependent variable a Normal distribution of the residuals b The selected independent variables are not highly correlated with the dependent variable a yˆ  2,857  26.4x1  80.6x2  0.115x21  2.31x22  0.542x1x2 b The residual plot supports the choice of the linear model c The residuals have constant variances d The linear model appears to be insufficient The addition of an independent variable representing time is indicated e A transformation of the independent or dependent variables is required a Quadratic relationship exists between cost and weight b r  0.963 H0: r  HA: r  a  0.05; Since the p-value  0.000  0.05, we reject H0 c Cost  64.06  14.92(Weight) d Cost  113.8  9.22(Weight)  1.44(Weight2) Comparing the R2adj for the quadratic equation (95.6%) and the R2 for the simple linear equation (94.5%), the quadratic equation appears to fit the data better 15-85 Vehicle year  73.18  9.1(Gender)  1.39(Years education)  24(Not seat belt) R-squared  14.959% Chapter 16 16-3 Generally, quantitative forecasting techniques can be used whenever historical data related to the variable of interest exist and we believe that the historical patterns will continue into the future 16-7 a The forecasting horizon is months b a medium term forecast c a month d 12 months 16-9 c Year Radio % radio Newspaper Laspeyres d 300 0.3 400 100 310 0.42 420 104.59 330 0.42 460 113.78 346 0.4 520 126.43 362 0.38 580 139.08 380 0.37 640 151.89 496 0.43 660 165.08 Year Radio % radio 300 Newspaper Paasche 0.3 400 100 310 0.42 420 104.41 330 0.42 460 113.22 346 0.4 520 124.77 362 0.38 580 136.33 380 0.37 640 148.12 496 0.43 660 165.12 16-13 Year Labor Costs Material Costs % % Laspeyres Materials Labor Index 1999 44,333 66,500 60 40 100 2000 49,893 68,900 58 42 106.36 2001 57,764 70,600 55 45 113.59 2002 58,009 70,900 55 45 114.07 2003 55,943 71,200 56 44 112.95 2004 61,078 71,700 54 46 117.03 2005 67,015 72,500 52 48 122.09 2006 73,700 73,700 50 50 127.88 2007 67,754 73,400 52 48 123.44 2008 74,100 74,100 50 50 128.57 2009 83,447 74,000 47 53 134.95 (461) ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 16-15 a The sum ∑p1993  1.07  6.16  8.32  15.55 b 170.87 170.87 103.86 c 103.86; 100  64.52% 103.86 16-17 a 102.31 b 15.41% c 14.43% 16-21 January  0.849; July  0.966 16-23 a upward linear trend with seasonal component as a slight drop in the 3rd quarter b Normalize to get the following values: Quarter Seasonal Index 1.035013 1.020898 0.959934 0.984154 d Quarter Period Quarter 2010 Quarter 2010 Quarter 2010 Quarter 2010 17 18 19 20 Seasonal Index 250.15 256.17 262.18 268.20 1.02349 1.07969 1.16502 1.12147 0.98695 0.83324 0.86807 0.91287 0.97699 10 1.07311 11 1.01382 12 0.94529 Seasonally Adjusted Forecast 48.27804189 Prediction Interval Lower Limit 161.7600534 Prediction Interval Upper Limit 258.3161371 Model with transformation: For Individual Response y 258.91 261.52 251.68 263.95 Interval Half Width 16-33 a b c d 16-35 b c d e 16-37 a c F  1.98  0.0459(Month) d F25  1.98  0.0459(25)  3.12 Adjusted F25  (1.02349)(3.12)  3.19 F73  1.98  0.0458589(73)  5.32 Adjusted F73  (1.02349)(5.32)  5.44 16-29 a seasonal component to the data b MSE  976.34 and MAD  29.887 c Quarter Index 256.5620033 260.0884382 263.614873 267.1413079 Interval Half Width 1.0350 1.0209 0.9599 0.9842 Index Forecast 13 14 15 16 For Individual Response y 16-27 b The seasonal indexes generated by Minitab are: Month Period e MSE  926.1187, MAD  29.5952 f The adjusted model has a lower MSE and MAD 16-31 a Forecast without transformation  36.0952  10.8714(16)  210.0376 Forecast with transformation  65.2986  0.6988(16)2  244.1914 Actual cash balance for Month 16 was 305 The transformed model had a smaller error than the model without the transformation b Model without transformation: c MSE  36.955 and MAD  4.831 d and e Seasonally Unadjusted Forecast 2009 Qtr Qtr Qtr Qtr 895 b c 23.89550188 Prediction Interval Lower Limit 220.29634337 Prediction Interval Upper Limit 268.08734713 The model without the transformation has the wider interval Linear trend evidenced by the slope from small to large values Randomness is exhibited since not all of the data points would lie on a straight line H0: b1  HA: b1  a  0.10, Minitab lists the p-value as 0.000 The fitted values are F38  36,051, F39  36,955, F40  37,858, and F41  38,761 The forecast bias is 1,343.5 On average, the model over forecasts the e-commerce retail sales an average of $1,343.5 million A trend is present Forecast  136.78, MAD  23.278 Forecast  163.69, MAD  7.655 The double exponential smoothing forecast has a lower MAD The time series contains a strong upward trend, so a double exponential smoothing model is selected The equation is yˆ t  19.364  0.7517t Since C0  b0, C0  19.364 T0  b1  0.7517 Forecasts Period Forecast Lower Upper 13 29.1052 23.9872 34.2231 d MAD as calculated by Minitab: Accuracy Measures 1.0290 0.9207 MAPE 8.58150 1.0789 MAD 2.08901 0.9714 MSE 6.48044 (462) 896 ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 16-39 a The time series contains a strong upward trend, so a double exponential smoothing model is selected b yˆ t  990  2,622.8t Since C0  b0, C0  990 T0  b1  2,622.8 c Forecast  58,852.1 d MAD  3,384 16-41 a There does not appear to be any trend component in this time series c MAD  3.2652 d F14  0.25y13  (1  0.25)F13  0.25(101.3)  0.75(100.22)  100.49 16-43 a Single exponential smoothing model is selected b The forecast is calculated as an example F1  F2  0.296 Then F3  0.15y2  (1  0.15)F2  0.15(0.413)  0.85(0.296)  0.3136 c MAD  15.765/71  0.222 d F73  0.15y72  (1  0.15)F72  0.15(0.051)  0.85(0.259)  0.212 16-45 a The double exponential smoothing model will incorporate the trend effect b From regression output, Initial constant  28,848; Initial trend  2,488.96 Forecast for period 17  72,450.17 MAD  5,836.06 c The MAD produced by the double exponential smoothing model at the end of Month 16 is smaller than the MAD produced by the single exponential smoothing model d and e Of the combinations considered the minimum MAD at the end of Month 16 occurs when alpha  0.05 and beta  0.05 The forecast for Month 17 with alpha  0.05 and beta  0.05 is 71,128.45 16-47 a a seasonal component b The pattern is linear with a positive slope c a cyclical component d a random component e a cyclical component 16-49 A seasonal component is one that is repeated throughout a time series and has a recurrence period of at most one year A cyclical component is one that is represented by wavelike fluctuations that has a recurrence period of more than one year Seasonal components are more predictable 16-51 a There does appear to be an upward linear trend b Forecast  682,238,010.3  342,385.3(Year) Since F  123.9719  4.6001, conclude that there is a significant relationship c MAD  461,216.7279 d Year Forecast 2010 4,929,275.00 2011 5,271,660.29 2012 5,614,045.59 2013 5,956,430.88 2014 6,298,816.18 Period Index 0.98230 1.01378 1.00906 1.00979 0.99772 1.01583 0.99739 1.00241 0.98600 10 0.98572 c The nonlinear trend model (using t and t2) fitted to the deseasonalized data ARM  3.28  0.114(Month)  0.00177(Monthsq) d The unadjusted forecast: F61  3.28  0.114(61)  0.00117(61)2  5.8804 The adjusted forecast is F61  (0.9823)(5.8804)  5.7763 e The following values have been computed: R2  92.9%, F-statistic  374.70, and standard error  0.226824 The model explains a significant amount of variation in the ARM Durbin-Watson d statistic  0.378224 Because d  0.378224 dL  1.35, conclude that significant positive autocorrelation exists 16-59 a MAD  4,767.2 c Alpha MAD 0.1 5,270.7 0.2 4,960.6 0.3 4,767.2 0.4 4,503.3 0.5 4,212.4 16-61 a A strong trend component is evident in the data b Using 1980  Year 1, the estimated regression equation is yˆt  490.249  1.09265t, R2  95.7% e For Individual Response y Interval Half Width c Forecast(2008)  740.073 d MAD  89.975 16-57 a A cyclical component is evidenced by the wave form, which recurs approximately every 10 months b If recurrence period, as explained in part a, is 10 months, the seasonal indexes generated by Minitab are 1,232,095.322 Prediction Interval Lower Limit 5,066,720.854 Prediction Interval Upper Limit 7,530,911.499 16-55 a The time series contains a strong upward trend, so a double exponential smoothing model is selected b Since C0  b0, C0  2,229.9; T0  b1  1.12 H0: b1  HA: b1  a  0.10, t  23.19, n   26   24, critical value is t0.10  1.3178 c yˆt  490.249  1.09265(31)  524.1212 Chapter 17 ~ (463) 14 17-1 The hypotheses are H0: m ~ 14 HA: m W  36 n  11, a  05; reject if W 13 (464) ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS ~4 17-3 The hypotheses are H0: m ~4 HA: m W  9, W  19: Critical values for n  7, assuming a  0.1 are and 25 Cannot reject ~4 17-5 a The hypotheses are H0: m ~4 HA: m b Using the Wilcoxon Signed Rank test, W  26: Upper tail test and n  12, letting a  05, reject if W  61 So, cannot reject ~ (465) 11 17-7 H0: m ~ 11 HA: m Using the Wilcoxon Signed Rank test, W  92: Reject if W 53 ~  30 17-9 H0: m ~  30 HA: m Using the Wilcoxon Signed Rank test, W  71.5, W  81.5: Because some of the differences are 0, n  17 The upper and lower values for the Wilcoxon test are 34 and 119 for a  0.05 Do not reject 17-11 a Using data classes one standard deviation wide, with the data mean of 7.6306 and a standard deviation of 0.2218: e o 14.9440 32.4278 32.4278 14.9440 21 31 27 16 (o - e)2/e 2.45417 0.06287 0.90851 0.07462 Sum = 3.5002 Testing at the a  0.05 level, χa  5.9915 b Since we concluded the data come from a normal distribution we test the following: H0: m (466) 7.4 HA: m 7.4 Decision rule: If z 1.645, reject H0; otherwise not reject Z  10.13 17-13 a Putting the claim in the alternate hypothesis: ~ m ~ (467) H0: m ~ ~ HA: m1  m b Test using the Mann-Whitney U Test U1  40, U2  24 Use U2 as the test statistic For n1  and n2  and U  24, p-value  0.221 ~ m ~ 0 17-15 a H0: m ~ m ~ 0 HA: m b Since the alternate hypothesis indicates Population should have the larger median, U1  40 n1  12 and n2  12 Reject if U  31 ~ m ~ 0 17-17 H0: m ~ ~ 0 HA: m1  m Mann-Whitney Test and CI: C1, C2 C1 C2 N  40, N  35, Median  481.50 Median  505.00 Point estimate for ETA1  ETA2 is 25.00 95.1% CI for ETA1  ETA2 is (62.00, 9.00) W  1,384.0 Test of ETA1  ETA2 vs ETA1 not  ETA2 is significant at 0.1502 ~ m ~ 0 17-19 a H0: m ~ m ~ 0 HA: m 897 With n  8, reject if T  Since T  11.5, we not reject the null hypothesis b Use the paired sample t test p-value  0.699 ~ m ~ 0 17-21 H0: m ~ ~ 0 HA: m1  m With n  7, reject if T  2, T  13 ~ m ~ 17-23 H0: m ~ m ~ H :m A If T  reject H0; T  ~ m ~ 0 17-25 H0: m W WO ~ m ~ 0 HA: m W WO U1  (7)(5)  (7)(7  1)/2  42  21 U2  (7)(5)  (5)(5  1)/2  36  14 Utest  21 Since 21 is not in the table you cannot determine the exact p-value, but you know that the p-value will be greater than 0.562 ~ m ~ 17-27 a H0: m ~ m ~ H :m A b Since T  51 is greater than 16, not reject H0 c Housing values are typically skewed ~ m ~ 17-29 H0: m ~ ~ HA: m1  m m  40(40  1)/4  410 s 40(40 1)(80 1)/ 24  74.3976 z  (480  410)/74.3976  0.94 p-value  (0.5  0.3264)2  (0.1736)(2)  0.3472 Do not reject H0 17-31 a A paired-t test H0: md (468) HA: md t  (1.7)/(3.011091/ 10 )  1.785 Since 1.785  t critical  2.2622, not reject H0 ~ (469) m ~ b H0: m O N ~ ~ HA: mO m N T  5.5 Since 5.5 6, reject H0 and conclude that the medians are not the same c Because you cannot assume the underlying populations are normal you must use the technique from part b ~ (470) m ~ 17-33 a H0: m N C ~ m ~ H :m b 17-35 a b c d 17-37 a A N C U1  4,297, U2  6,203, m  5,250, s  434.7413 z  2.19 p-value  0.0143 a Type I error The data are ordinal The median would be the best measure ~ m ~ H0: m ~ ~ HA: m1  m Using a  0.01, if T  2, reject H0 Since 12.5  2, not reject H0 The decision could be made based on some other factor, such as cost ~ m ~ m ~ H :m HA: Not all population medians are equal H  10.98 Since, with a  0.05, χa  5.9915, and H  10.98, we reject ~ m ~ m ~ m ~ 17-39 a H0: m HA: Not all population medians are equal b Use Equation 17-10 Selecting a  0.05, χa  7.8147, since H  42.11, we reject the null hypothesis of equal medians (471) 898 ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 17-41 a Salaries in general are usually thought to be skewed b The top-salaried players get extremely high salaries compared to the other players ~ m ~ m ~ c H0: m HA: Not all population medians are equal H  52.531 If a  0.05, χa  5.9915 Reject ~ ~ ~ ~ 17-43 H0: m1  m2  m3  m4 HA: Not all population medians are equal Using PHStat, H test statistic  11.13971 Adjusting for ties, the test statistic is 11.21, which is smaller than the critical value (11.34488) ~ m ~ m ~ 17-45 H0: m HA: Not all population medians are equal H  13.9818, testing at a  0.05, χa  5.9915 Since 13.9818  5.9915, reject H0 ~ m ~ 0 17-53 H0: m ~ ~ 0 HA: m1  m U1  107, U2  14; U test  14 with a  0.05, Ua  30 Since 14 30, reject H0 17-55 a A nonparametric test ~ m ~ (472) b H0: m O N ~ m ~ HA: m O N U1  71, U2  29; U test  29 If a  0.05, Ua  27 Since 29  27, not reject H0 17-57 The hypotheses being tested are ~  1,989.32 H0: m ~  1,989.32 HA: m Find W  103, W  68 With a  0.05, reject if W  40 or W  131 ~  8.03 17-59 a H0: m ~  8.03 HA: m b W  62.5, W  57.5 This is a two-tailed test with n  15 If a  0.05, reject if W  25 or W  95 ~ m ~ 17-61 H0: m ~ ~ HA: m1  m Constructing the paired difference table, T  44.5 With a  0.05, reject if T  21 or if T (473) 84 b a Type II error 17-63 a They should use the Wilcoxon Matched-Pairs Signed Rank ~ ~ b H0: m w/oA (474) mA ~ ~ H :m m A w/oA A T6 Using a  0.025, Ta  c Do not reject H0 Chapter 18 18-9 Some possible causes by category are: People: Too Few Drivers, High Driver Turnover Methods: Poor Scheduling, Improper Route Assignments Equipment: Buses Too Small, Bus Reliability, Too Few Buses Environment: Weather, Traffic Congestion, Road Construction 18-11 a A2  0.577; D3  and D4  2.114 b The R-chart upper control limit is 2.114 × 5.6  11.838 The R-chart lower control limit is × 5.6  c X-bar chart upper control limit  44.52  (0.577 × 5.6)  47.751; Lower control limit  44.52  (0.577 × 5.6)  41.289 18-15 a x-bar chart centerline  0.753 UCL  0.753  (0.577 0.074)  0.7957 LCL  0.753  (0.577 0.074)  0.7103 b R-chart centerline  0.074 UCL  2.114 0.074  0.1564 LCL  0.074  0.000 c There are no subgroup means outside of the upper or lower control limits on the x-bar chart For the R-chart, there are no subgroup ranges outside the control limits 18-17 a The process has gone out of control since all but two observations and the 1st eight in sequence are below the LCL 18-19 a c-chart b c  3.2841 UCL  6.5682  3( 6.5682 )  14.2568 LCL  6.5682  3( 6.5682 )  1.1204, so set to c in statistical control 18-21 a For R-chart UCL  2.282 100.375  229.056 CL  100.375 LCL  100.375  b For x-bar chart UCL  415.3  0.729(100.375)  488.473 CL  415.3 LCL  415.3  0.729(100.375)  342.127 c out of control 18-23 b 82.46 c 12.33 d UCL  2.114 12.33  26.07 and LCL  12.33  e UCL  82.46  (0.577 12.33)  89.57 LCL  82.46  (0.577 12.33)  75.35 f There is a run of nine values above the centerline 18-25 b p  441/(300 50)  0.0294 s  (0.0294)(1  0.0294)/50  0.0239 UCL  0.0294  3(0.0239)  0.1011 LCL  0.0294  3(0.0239)  0.0423, so set to c Sample Number 301 302 303 p-bar 0.12 0.18 0.14 The process has gone out of control b 0.28, which is again above the UCL 18-27 a c  29.3333 UCL  29.3333  3( 29.3333 )  45.5814 LCL  29.3333  3( 29.3333 )  13.0852 b The process seems to be out of control c Need to convert the data to bags per passenger by dividing bags by 40 and then developing a u-chart based on the explanation in optional topics CL  29.333/40  0.7333 UCL  0.7333  d 18-29 a b c 0.7333/40  1.1395 LCL  0.7333  0.7333/40  0.3271 Process is out of control A2  1.023 D3  0.0 and D4  2.575 UCL  2.575 0.80  2.06 LCL  0.80  UCL  2.33  (1.023 0.80)  3.1468 LCL  2.33  (1.023 0.80)  1.512 (475) ANSWERS TO SELECTED ODD-NUMBERED PROBLEMS 18-31 The centerline of the control chart is the average proportion of defective  720/(20 150)  0.240 For 3-sigma control chart limits we find UCL  0.240  0.240 (1  0.240)  0.345 150 LCL  0.240  0.240 (1  0.240)  0.135 150 18-33 p-chart; p  0.0524 Sp  0.0524 (1  0.0524)  0.0223 100 Lower Control Limit  0.0524  0.0223  0.0145, so set to Centerline  0.0524 Upper Control Limit  0.0524  0.0223  0.1193 in control 899 18-35 The appropriate control chart for monitoring this process is a c-chart The 3-sigma upper control limit is 23.00  (3 4.7958)  37.3875 The 3-sigma lower control limit is 23.00  (3 4.7958)  8.6125 Note that sample number with 38 defects is above the upper control limit 18-37 a x-bar chart: CL  0.7499 UCL  0.7499  (0.577)(0.0115)  0.7565 LCL  0.7499  (0.577)(0.0115)  0.7433 R-chart: CL  0.0115 UCL  (0.0115)(2.114)  0.0243 LCL  (0.0115)(0)  b It now appears that the process is out of control (476) Glossary Adjusted R-squared A measure of the percentage of explained Box and Whisker Plot A graph that is composed of two parts: a variation in the dependent variable in a multiple regression model that takes into account the relationship between the sample size and the number of independent variables in the regression model box and the whiskers The box has a width that ranges from the first quartile (Q1) to the third quartile (Q3) A vertical line through the box is placed at the median Limits are located at a value that is 1.5 times the difference between Q1 and Q3 below Q1 and above Q3 The whiskers extend to the left to the lowest value within the limits and to the right to the highest value within the limits Aggregate Price Index An index that is used to measure the rate of change from a base period for a group of two or more items All-Inclusive Classes A set of classes that contains all the pos- sible data values Alternative Hypothesis The hypothesis that includes all popu- lation values not included in the null hypothesis The alternative hypothesis will be selected only if there is strong enough sample evidence to support it The alternative hypothesis is deemed to be true if the null hypothesis is rejected Arithmetic Average or Mean The sum of all values divided by the number of values Autocorrelation Correlation of the error terms (residuals) occurs when the residuals at points in time are related Balanced Design An experiment has a balanced design if the factor levels have equal sample sizes Bar Chart A graphical representation of a categorical data set in which a rectangle or bar is drawn over each category or class The length or height of each bar represents the frequency or percentage of observations or some other measure associated with the category The bars may be vertical or horizontal The bars may all be the same color or they may be different colors depicting different categories Additionally, multiple variables can be graphed on the same bar chart Base Period Index The time-series value to which all other val- ues in the time series are compared The index number for the base period is defined as 100 Between-Sample Variation Dispersion among the factor sam- ple means is called the between-sample variation Business Statistics A collection of procedures and techniques that are used to convert data into meaningful information in a business environment Census An enumeration of the entire set of measurements taken from the whole population The Central Limit Theorem For simple random samples of n ob- servations taken from a population with mean m and standard deviation s, regardless of the population’s distribution, provided the sample size is sufficiently large, the distribution of the sam– ple means, x, will be approximately normal with a mean equal to the population mean (m x  m x ) and a standard deviation equal to the population standard deviation divided by the square root of the sample size s x  s  n The larger the sample size, the better the approximation to the normal distribution Class Boundaries The upper and lower values of each class Class Width The distance between the lowest possible value and the highest possible value for a frequency class Classical Probability Assessment The method of determining probability based on the ratio of the number of ways an outcome or event of interest can occur to the number of ways any outcome or event can occur when the individual outcomes are equally likely Closed-End Questions Questions that require the respondent to select from a short list of defined choices Cluster Sampling A method by which the population is divided distorting it; different from a random error which may distort on any one occasion but balances out on the average into groups, or clusters, that are each intended to be minipopulations A simple random sample of m clusters is selected The items chosen from a cluster can be selected using any probability sampling technique Binomial Probability Distribution Characteristics A distribu- Coefficient of Determination The portion of the total variation Bias An effect which alters a statistical result by systematically tion that gives the probability of x successes in n trials in a process that meets the following conditions: A trial has only two possible outcomes: a success or a failure There is a fixed number, n, of identical trials The trials of the experiment are independent of each other This means that if one outcome is a success, this does not influence the chance of another outcome being a success The process must be consistent in generating successes and failures That is, the probability, p, associated with a success remains constant from trial to trial If p represents the probability of a success, then 1p  q is the probability of a failure 900 in the dependent variable that is explained by its relationship with the independent variable The coefficient of determination is also called R-squared and is denoted as R2 Coefficient of Partial Determination The measure of the mar- ginal contribution of each independent variable, given that other independent variables are in the model Coefficient of Variation The ratio of the standard deviation to the mean expressed as a percentage The coefficient of variation is used to measure variation relative to the mean Complement The complement of an event E is the collection of all possible outcomes not contained in event E (477) GLOSSARY Completely Randomized Design An experiment is completely randomized if it consists of the independent random selection of observations representing each level of one factor Composite Model The model that contains both the basic terms and the interaction terms Conditional Probability The probability that an event will occur given that some other event has already happened Confidence Interval An interval developed from sample values | 901 standard deviation can be calculated from a sample of size n, the degrees of freedom are equal to n  k Demographic Questions Questions relating to the respondents’ characteristics, backgrounds, and attributes Dependent Events Two events are dependent if the occurrence of one event impacts the probability of the other event occurring Dependent Variable A variable whose values are thought to be such that if all possible intervals of a given width were constructed, a percentage of these intervals, known as the confidence level, would include the true population parameter a function of, or dependent on, the values of another variable called the independent variable On a scatter plot, the dependent variable is placed on the y axis and is often called the response variable Confidence Level The percentage of all possible confidence Discrete Data Data that can take on a countable number of pos- intervals that will contain the true population parameter Consistent Estimator An unbiased estimator is said to be a con- sistent estimator if the difference between the estimator and the parameter tends to become smaller as the sample size becomes larger Contingency Table A table used to classify sample observations according to two or more identifiable characteristics It is also called a crosstabulation table Continuous Data Data whose possible values are uncountable and which may assume any value in an interval Continuous Random Variables Random variables that can as- sume an uncountably infinite number of values Convenience Sampling A sampling technique that selects the items from the population based on accessibility and ease of selection Correlation Coefficient A quantitative measure of the strength of the linear relationship between two variables The correlation ranges from 1.0 to 1.0 A correlation of 1.0 indicates a perfect linear relationship, whereas a correlation of indicates no linear relationship Correlation Matrix A table showing the pairwise correlations between all variables (dependent and independent) Critical Value The value corresponding to a significance level that determines those test statistics that lead to rejecting the null hypothesis and those that lead to a decision not to reject the null hypothesis Cross-Sectional Data A set of data values observed at a fixed point in time Cumulative Frequency Distribution A summary of a set of data that displays the number of observations with values less than or equal to the upper limit of each of its classes Cumulative Relative Frequency Distribution A summary of a sible values Discrete Random Variable A random variable that can only as- sume a finite number of values or an infinite sequence of values such as 0, 1, Dummy Variable A variable that is assigned a value equal to ei- ther or 1, depending on whether the observation possesses a given characteristic Empirical Rule If the data distribution is bell-shaped, then the interval m 1s contains approximately 68% of the values m 2s contains approximately 95% of the values m 3s contains virtually all of the data values Equal-Width Classes The distance between the lowest possi- ble value and the highest possible value in each class is equal for all classes Event A collection of experimental outcomes Expected Value The mean of a probability distribution The av- erage value when the experiment that generates values for the random variable is repeated over the long run Experiment A process that produces a single outcome whose result cannot be predicted with certainty Experimental Design A plan for performing an experiment in which the variable of interest is defined One or more factors are identified to be manipulated, changed, or observed so that the impact (or influence) on the variable of interest can be measured or observed Experiment-Wide Error Rate The proportion of experiments in which at least one of the set of confidence intervals constructed does not contain the true value of the population parameter being estimated Exponential Smoothing A time-series and forecasting tech- set of data that displays the proportion of observations with values less than or equal to the upper limit of each of its classes nique that produces an exponentially weighted moving average in which each smoothing calculation or forecast is dependent on all previous observed values Cyclical Component A wavelike pattern within the time series External Validity A characteristic of an experiment whose results that repeats itself throughout the time series and has a recurrence period of more than one year can be generalized beyond the test environment so that the outcomes can be replicated when the experiment is repeated Data Array Data that have been arranged in numerical order Degrees of Freedom The number of independent data values available to estimate the population’s standard deviation If k parameters must be estimated before the population’s Factor A quantity under examination in an experiment as a pos- sible cause of variation in the response variable Forecasting Horizon The number of future periods covered by a forecast It is sometimes referred to as forecast lead time (478) 902 | GLOSSARY Forecasting Interval The frequency with which new forecasts are prepared Forecasting Period The unit of time for which forecasts are to be made Frequency Distribution A summary of a set of data that dis- Median The median is a center value that divides a data array ~ to denote the population median into two halves We use m and Md to denote the sample median Mode The mode is the value in a data set that occurs most frequently plays the number of observations in each of the distribution’s distinct categories or classes Model A representation of an actual system using either a phys- Frequency Histogram A graph of a frequency distribution with Model Diagnosis The process of determining how well a model the horizontal axis showing the classes, the vertical axis showing the frequency count, and (for equal class widths) the rectangles having a height equal to the frequency in each class fits past data and how well the model’s assumptions appear to be satisfied Hypergeometric Distribution The hypergeometric distribution is formed by the ratio of the number of ways an event of interest can occur over the total number of ways any event can occur Independent Events Two events are independent if the occur- rence of one event in no way influences the probability of the occurrence of the other event Independent Samples Samples selected from two or more pop- ulations in such a way that the occurrence of values in one sample has no influence on the probability of the occurrence of values in the other sample(s) Independent Variable A variable whose values are thought to impact the values of the dependent variable The independent variable, or explanatory variable, is often within the direct control of the decision maker On a scatter plot, the independent variable, or explanatory variable, is graphed on the x axis Interaction The case in which one independent variable (such as x2) affects the relationship between another independent variable (x1) and a dependent variable (y) Internal Validity A characteristic of an experiment in which data ical or a mathematical portrayal Model Fitting The process of estimating the specified model’s parameters to achieve an adequate fit of the historical data Model Specification The process of selecting the forecasting technique to be used in a particular situation Moving Average The successive averages of n consecutive val- ues in a time series Multicollinearity A high correlation between two independent variables such that the two variables contribute redundant information to the model When highly correlated independent variables are included in the regression model, they can adversely affect the regression results Multiple Coefficient of Determination The proportion of the total variation of the dependent variable in a multiple regression model that is explained by its relationship to the independent variables It is, as is the case in the simple linear model, called R-squared and is denoted as R2 Mutually Exclusive Classes Classes that not overlap so that a data value can be placed in only one class are collected in such a way as to eliminate the effects of variables within the experimental environment that are not of interest to the researcher Mutually Exclusive Events Two events are mutually exclusive Interquartile Range The interquartile range is a measure of vari- Nonstatistical Sampling Techniques Those methods of select- if the occurrence of one event precludes the occurrence of the other event ation that is determined by computing the difference between the third and first quartiles ing samples using convenience, judgment, or other nonchance processes Least Squares Criterion The criterion for determining a regres- Normal Distribution The normal distribution is a bell-shaped sion line that minimizes the sum of squared prediction errors Left-Skewed Data A data distribution is left skewed if the mean for the data is smaller than the median Levels The categories, measurements, or strata of a factor of in- terest in the current experiment Line Chart A two-dimensional chart showing time on the hori- zontal axis and the variable of interest on the vertical axis Linear Trend A long-term increase or decrease in a time series in which the rate of change is relatively constant Margin of Error The amount that is added and subtracted to the point estimate to determine the endpoints of the confidence interval Also, a measure of how close we expect the point estimate to be to the population parameter with the specified level of confidence Mean A numerical measure of the center of a set of quantitative measures computed by dividing the sum of the values by the number of values in the data distribution with the following properties: It is unimodal; that is, the normal distribution peaks at a single value It is symmetrical; this means that the two areas under the curve between the mean and any two points equidistant on either side of the mean are identical One side of the distribution is the mirror image of the other side The mean, median, and mode are equal The normal approaches the horizontal axis on either side of the mean toward plus and minus infinity () In more formal terms, the normal distribution is asymptotic to the x axis The amount of variation in the random variable determines the height and spread of the normal distribution Null Hypothesis The statement about the population parameter that will be assumed to be true during the conduct of the hypothesis test The null hypothesis will be rejected only if the sample data provide substantial contradictory evidence Ogive The graphical representation of the cumulative relative frequency A line is connected to points plotted above the (479) GLOSSARY | 903 upper limit of each class at a height corresponding to the cumulative relative frequency of the event occurring The definition given is for a countable number of events One-Tailed Test A hypothesis test in which the entire rejection p-Value The probability (assuming the null hypothesis is true) of region is located in one tail of the sampling distribution In a one-tailed test, the entire alpha level is located in one tail of the distribution obtaining a test statistic at least as extreme as the test statistic we calculated from the sample The p-value is also known as the observed significance level One-Way Analysis of Variance An analysis of variance design Qualitative Data Data whose measurement scale is inherently in which independent samples are obtained from two or more levels of a single factor for the purpose of testing whether the levels have equal means Open-End Questions Questions that allow respondents the freedom to respond with any value, words, or statements of their own choosing Paired Samples Samples that are selected in such a way that categorical Quantitative Data Measurements whose values are inherently numerical Quartiles Quartiles in a data array are those values that divide the data set into four equal-sized groups The median corresponds to the second quartile values in one sample are matched with the values in the second sample for the purpose of controlling for extraneous factors Another term for paired samples is dependent samples Random Component Changes in time-series data that are un- Parameter A measure computed from the entire population As Random Variable A variable that takes on different numerical long as the population does not change, the value of the parameter will not change Range The range is a measure of variation that is computed by Pareto Principle 80% of the problems come from 20% of the causes predictable and cannot be associated with a trend, seasonal, or cyclical component values based on chance finding the difference between the maximum and minimum values in a data set Percentiles The pth percentile in a data array is a value that di- Regression Hyperplane The multiple regression equivalent of vides the data set into two parts The lower segment contains at least p% and the upper segment contains at least (100  p)% of the data The 50th percentile is the median Regression Slope Coefficient The average change in the de- Pie Chart A graph in the shape of a circle The circle is divided into “slices” corresponding to the categories or classes to be displayed The size of each slice is proportional to the magnitude of the displayed variable associated with each category or class Pilot Sample A sample taken from the population of interest of a size smaller than the anticipated sample size that is used to provide an estimate for the population standard deviation Point Estimate A single statistic, determined from a sample, the simple regression line The plane typically has a different slope for each independent variable pendent variable for a unit change in the independent variable The slope coefficient may be positive or negative, depending on the relationship between the two variables Relative Frequency The proportion of total observations that are in a given category Relative frequency is computed by dividing the frequency in a category by the total number of observations The relative frequencies can be converted to percentages by multiplying by 100 Relative Frequency Assessment The method that defines that is used to estimate the corresponding population parameter probability as the number of times an event occurs divided by the total number of times an experiment is performed in a large number of trials Population Mean The average for all values in the population Research Hypothesis The hypothesis the decision maker at- computed by dividing the sum of all values by the population size Population Proportion The fraction of values in a population that have a specific attribute Population The set of all objects or individuals of interest or the measurements obtained from all objects or individuals of interest Power The probability that the hypothesis test will correctly re- ject the null hypothesis when the null hypothesis is false Power Curve A graph showing the probability that the hypoth- esis test will correctly reject a false null hypothesis for a range of possible “true” values for the population parameter Probability The chance that a particular event will occur The probability value will be in the range to A value of means the event will not occur A probability of means the event will occur Anything between and reflects the uncertainty tempts to demonstrate to be true Because this is the hypothesis deemed to be the most important to the decision maker, it will be declared true only if the sample data strongly indicates that it is true Residual The difference between the actual value of y and the predicted value yˆ for a given level of the independent variable, x Right-Skewed Data A data distribution is right skewed if the mean for the data is larger than the median Sample A subset of the population Sample Mean The average for all values in the sample computed by dividing the sum of all sample values by the sample size Sample Proportion The fraction of items in a sample that have the attribute of interest Sample Space The collection of all outcomes that can result from a selection, decision, or experiment (480) 904 | GLOSSARY Sampling Distribution The distribution of all possible values of Statistical Inference Procedures Procedures that allow a de- a statistic for a given sample size that has been randomly selected from a population cision maker to reach a conclusion about a set of data based on a subset of that data Sampling Error The difference between a measure computed Statistical Sampling Techniques Those sampling methods that from a sample (a statistic) and the corresponding measure computed from the population (a parameter) Scatter Diagram, or Scatter Plot A two-dimensional graph of use selection techniques based on chance selection Stratified Random Sampling A statistical sampling method in plotted points in which the vertical axis represents values of one quantitative variable and the horizontal axis represents values of the other quantitative variable Each plotted point has coordinates whose values are obtained from the respective variables which the population is divided into subgroups called strata so that each population item belongs to only one stratum The objective is to form strata such that the population values of interest within each stratum are as much alike as possible Sample items are selected from each stratum using the simple random sampling method Scatter Plot A two-dimensional plot showing the values for the Structured Interview Interviews in which the questions are joint occurrence of two quantitative variables The scatter plot may be used to graphically represent the relationship between two variables It is also known as a scatter diagram Seasonal Component A wavelike pattern that is repeated throughout a time series and has a recurrence period of at most one year Seasonal Index A number used to quantify the effect of season- ality in time-series data Seasonally Unadjusted Forecast A forecast made for seasonal data that does not include an adjustment for the seasonal component in the time series Significance Level The maximum allowable probability of committing a Type I statistical error The probability is denoted by the symbol a Simple Linear Regression The method of regression analysis in which a single independent variable is used to predict the dependent variable Simple Random Sample A sample selected in such a manner that each possible sample of a given size has an equal chance of being selected Simple Random Sampling A method of selecting items from a population such that every possible sample of a specified size has an equal chance of being selected Skewed Data Data sets that are not symmetric For skewed data, the mean will be larger or smaller than the median Standard Deviation The standard deviation is the positive square root of the variance Standard Error A value that measures the spread of the sample means around the population mean The standard error is reduced when the sample size is increased Standard Normal Distribution A normal distribution that has a mean  0.0 and a standard deviation  1.0 The horizontal axis is scaled in z-values that measure the number of standard deviations a point is from the mean Values above the mean have positive z-values Values below the mean have negative z-values Standardized Data Values The number of standard deviations scripted Student’s t-Distributions A family of distributions that is bell- shaped and symmetric like the standard normal distribution but with greater area in the tails Each distribution in the t-family is defined by its degrees of freedom As the degrees of freedom increase, the t-distribution approaches the normal distribution Subjective Probability Assessment The method that defines probability of an event as reflecting a decision maker’s state of mind regarding the chances that the particular event will occur Symmetric Data Data sets whose values are evenly spread around the center For symmetric data, the mean and median are equal Systematic Random Sampling A statistical sampling technique that involves selecting every kth item in the population after a randomly selected starting point between and k The value of k is determined as the ratio of the population size over the desired sample size Tchebysheff’s Theorem Regardless of how data are distributed, at least (11/k 2) of the values will fall within k standard deviations of the mean For example: ⎛ ⎝ At least ⎜  1⎞ ⎟   0% of the values will fall within k  12 ⎠ standard deviation of the mean ⎛ 1⎞ At least ⎜  ⎟   75% of the values will lie within k  ⎝ ⎠ standard deviations of the mean ⎛ 1⎞ At least ⎜  ⎟   89% of the values will lie within k  ⎝ ⎠ standard deviations of the mean Test Statistic A function of the sampled observations that pro- vides a basis for testing a statistical hypothesis a value is from the mean Standardized data values are sometimes referred to as z scores Time-Series Data A set of consecutive data values observed at Statistic A measure computed from a sample that has been Total Quality Management A journey to excellence in which selected from a population The value of the statistic will depend on which sample is selected everyone in the organization is focused on continuous process improvement directed toward increased customer satisfaction successive points in time (481) GLOSSARY Total Variation The aggregate dispersion of the individual data values across the various factor levels is called the total variation in the data Two-Tailed Test A hypothesis test in which the entire rejection region is split into the two tails of the sampling distribution In a two-tailed test, the alpha level is split evenly between the two tails Type I Error Rejecting the null hypothesis when it is, in fact, true Type II Error Failing to reject the null hypothesis when it is, in fact, false Unbiased Estimator A characteristic of certain statistics in | 905 Variance The population variance is the average of the squared distances of the data values from the mean Variance Inflation Factor A measure of how much the vari- ance of an estimated regression coefficient increases if the independent variables are correlated A VIF equal to 1.0 for a given independent variable indicates that this independent variable is not correlated with the remaining independent variables in the model The greater the multicollinearity, the larger the VIF Variation A set of data exhibits variation if all the data are not the same value which the average of all possible values of the sample statistic equals a parameter, no matter the value of the parameter Weighted mean The mean value of data values that have been Unstructured Interview Interviews that begin with one or more Within-Sample Variation The dispersion that exists among the broadly stated questions, with further questions being based on the responses data values within a particular factor level is called the within-sample variation weighted according to their relative importance (482) Index A Addition rule Individual outcomes, 160–162 Mutually exclusive events, 167 Two events, 163–167 Adjusted R-square Equation, 644, 701 Aggregate price index Defined, 715 Unweighted, 716, 763 All-inclusive classes, 39 Alpha, controlling, 378 Alternative hypothesis Defined, 347 Analysis of variance Assumptions, 477, 478–481, 498, 512 Between-sample variation, 477 Experiment-wide error rate, 488 Fisher’s Least Significant Difference test, 505–506, 522 Fixed effects, 493 Hartley’s F-test statistic, 480, 485, 522 Kruskal-Wallis one-way, 789–794 One-way ANOVA, 476–493 One-way ANOVA table, 483 Random effects, 493 Randomized block ANOVA, 497–505 Total variation, 477 Tukey-Kramer, 488–493, 522 Two-factor ANOVA, 509–517 Within-sample variation, 477 Arithmetic mean, 4, 97 Autocorrelation Defined, 728 Durbin-Watson statistic, 728–729, 763 Average, See also Mean Equation, Moving average, 739, 740 Ratio-to-moving-average, 763 Sample equation, 265 Average subgroup range, 812, 832 B Backward elimination stepwise, 679–683 Balanced design, 476, 480 Defined, 476 Bar Chart, Cluster, 58 Column, 55 Defined, 57 Excel examples, 59 Horizontal, 55–56 Minitab example, 59 Pie Chart versus, 61 Summary steps, 56 Base period index Defined, 714 Simple index number, 714, 763 Bayes’ Theorem, 175–179 Equation, 176 Best subsets regression, 683–686 Beta Calculating, 376–377, 379–382 Controlling, 378–382 Power, 382 Proportion, 380–381 Summary steps, 378 Two-tailed test, 379–380 Between-sample variation Defined, 477 906 Bias Interviewer, 12 Nonresponse, 12 Observer, 12–13 Selection, 12 Binomial distribution Characteristics, 199–200 Defined, 199 Excel example, 206 Formula, 202 Mean, 205–207 Minitab example, 207 Shapes, 208 Standard deviation, 207–208 Table, 204–205 Binomial formula, 202–203 Bivariate normal distribution, 585 Box and whisker plots ANOVA assumptions, 479 Defined, 100 Summary steps, 100 Brainstorming, 807 Business statistics Defined, C c-charts, 824–827, 833 Control limits, 825 Excel example, 826 Minitab example, 826 Standard deviation, 825, 833 Census Defined, 15 Centered moving average, 740 Central limit theorem, 282–286 Examples, 284–285 Theorem 4, 283 Central tendency, applying measures, 94–97 Charts, Bar chart, 54–60, 61 Box and whisker, 100–101 Histogram, 41–46 Line, 66–69 Pie, 60–61 Scatter diagram, 70–72, 581, 639–640 Scatter plot, 580 Stem and leaf diagrams, 62–63 Chi-square Assumptions, 450 Confidence interval, 454–455 Contingency analysis, 564 Contingency test statistic, 564, 574 Degrees of freedom, 450, 550, 554, 564 Goodness-of-fit, 548–559 Goodness-of-fit-test statistic, 550, 574 Sample size, 550 Single variance, 449–455, 471 Summary steps, 452 Test for single population variance, 450, 452 Test limitations, 569 Class boundaries, 39 Classes All-inclusive, 39 Boundaries, 39 Equal-width, 39 Mutually exclusive, 39 Classical probability assessment Defined, 152 Equation, 152 Class width Equation, 39 Closed-end questions, Cluster sampling, 18–19 Primary clusters, 19 Coefficient of determination Adjusted R-square, 644, 701 Defined, 602 Equation, 602, 625 Hypothesis test, 603, 643–644 Multiple regression, 642–643 Single independent variable case, 602 Test statistic, 603 Coefficient of partial determination, 678 Coefficient of variation Defined, 119 Population equation, 119 Sample equation, 119 Combinations Counting rule equation, 201 Complement Defined, 162 Rule, 162–163 Completely randomized design Defined, 476 Composite polynomial model Defined, 669 Excel example, 669–670 Minitab example, 670–671 Conditional probability Bayes’ theorem, 175–179 Defined, 167 Independent events, 171–172 Rule for independent events, 171 Tree diagrams, 170–171 Two events, 168 Confidence interval Average y, given x, 616 Critical value, 309, 340 Defined, 306 Difference between means, 398 Estimate, 399, 441, 471 Excel example, 307, 318 Flow diagram, 340 General format, 309, 340, 398, 441 Impact of sample size, 314 Larger sample sizes, 320 Margin of error, 311–312 Minitab example, 307, 318 Paired samples, 423–427 Population mean, 308–314 Population mean estimate, 317 Proportion, 331–333 Regression slope, 614, 625, 649–651, 701 Sample size requirements, 324–327 Standard error of mean, 308 Summary steps, 310, 319 t-distribution, 314–320, 400–406 two proportions, 432–433 Unequal variances, 404 Variance, 455 Consistent estimator Defined, 280 Consumer price index, 719–720 Contingency analysis, 562–569 Chi-square test statistic, 564, 574 Contingency table, 562–569 Excel example, 568 Expected cell frequencies, 567, 574 (483) INDEX Marginal frequencies, 563 Minitab, 568 r c contingency analysis, 566–569 2 contingency analysis, 564–566 Contingency table Defined, 563 Continuous data, 36 Continuous probability distributions Exponential distribution, 252–254 Normal distribution, 234–245 Uniform distribution, 249–252 Continuous random variables Defined, 192 Convenience sampling, 15 Correlation coefficient Assumptions, 585 Cause-and-effect, 586 Defined, 580, 638 Equation, 580–581, 625, 638, 701 Excel example, 582–583, 638 Hypothesis test, 584 Minitab example, 582–583, 639 Test statistic, 584, 625 Correlation matrix, 638 Counting rule Combinations, 201–202 Critical value Calculating, 352–353 Commonly used values, 309, 340 Confidence interval estimate, 310 Defined, 352 Hypothesis testing, 351–352 Crosby, Philip B., 805–806 Cumulative frequency distribution Defined, 40 Relative frequency, 40 Cyclical component Defined, 713 D Data Categorizing, 23–24 Classification, 27 Discrete, 33 Hierarchy, 21–23 Interval, 22 Measurement levels, 21–24 Nominal, 21–22 Ordinal, 22 Qualitative, 21 Quantitative, 21 Ratio, 22–23 Skewed, 92–93 Symmetric, 92–93 Time-series, 21 Data array, 37, 98 Data collection methods, 7, 27 Array, 37 Bar codes, 11–12 Direct observation, 7, 11 Experiments, 7, Issues, 12–13 Personal interview, 11 Telephone surveys, 7–9 Written questionnaires, 7, 9–11 Data frequency distribution See Grouped data frequency distribution Decision rule Hypothesis testing, 351–352, 354–357 Deflating time-series data, 721, 763 Formula, 721, 763 Degrees of freedom Chi-square, 450 One sample, 315 Student’s t-distribution, 314–315 Unequal means, 441 Unequal variances, 419, 442 Deming, W Edwards Cycle, 806 Fourteen points, 805 Variation, 810 Demographic Questions, Dependent events Defined, 150 Dependent variable, 70, 634 Descriptive statistical techniques Data-level issues, 102–103 Deseasonalization Equation, 742, 763 Excel examples, 743 Direct Observation, 7, 11 Discrete probability distributions, 192–195 Binomial distribution, 199–208 Hypergeometric distribution, 217, 219–223 Poisson distribution, 213–217 Discrete random variable Defined, 192 Displaying graphically, 192–193 Expected value equation, 194 Mean, 193–194 Standard deviation, 194–195 Dummy variables, 654–657 Defined, 654 Excel example, 746 Seasonality, 744–746 Durbin-Watson statistic Equation, 729, 763 Test for autocorrelation, 730–732 E Empirical rule Defined, 120 Empty classes, 39 Equal-width classes, 39 Error See also Standard error Experimental-wide error rate, 488 Forecast, 727, 763 Margin of error, 311–312, 340 Mean absolute percent error, 758, 763 Measurement, 13 Standard error of mean, 308 Sum of squares, 522 Type I, 350 Type II, 350, 376–383 Estimate Confidence interval, 306, 398 Difference between means, 398 Paired difference, 423–427 Point, 306 Testing flow diagram, 441 Estimation, 5, 306 Sample size for population, 326–327 Event Defined, 149–150 Dependent, 150–151 Independent, 150–151 Mutually exclusive, 150 Expected cell frequencies Equation, 567, 574 Expected value Binomial distribution, 205–207 Defined, 193 Equation, 194 Experimental design, Experimental-wide error rate Defined, 488 Experiments, 7, Exponential probability distribution Density function, 252 Excel example, 253–254 Minitab example, 254 Probability, 253 Exponential smoothing Defined, 750 Double smoothing, 755–758, 763 Equation, 763 Excel examples, 751, 753–754, 756–757 Minitab examples, 754, 758 Single smoothing, 750–755 Smoothing constant, 750 External validity, 13 F Factor Defined, 476 Finite population correction factor, 280 Fishbone diagram, 807 Fisher’s Least Significant Difference test, 505–506, 522 Fixed effects, 493 Flowcharts, 807 Forecast bias Equation, 733, 763 Forecasting Autocorrelation, 728–732 Bias, 733, 763 Cyclical component, 713 Dummy variables, 744–746 Durbin-Watson statistic, 728–730, 763 Error, 727, 763 Excel example, 734–737 Exponential smoothing, 750–758 Horizon, 710 Interval, 710 Linear trend, 711, 725, 763 Mean absolute deviation, 727, 763 Mean absolute percent error, 758, 763 Mean squared error, 727, 763 Minitab example, 735–736 Model diagnosis, 710 Model fitting, 710 Model specification, 710 Nonlinear trend, 734–738 Period, 710 Random component, 713 Residual, 727–728 Seasonal adjustment, 738–746 Seasonal component, 712–713 Seasonally unadjusted, 743 Trend-base technique, 724–746 True forecasts, 732–734 Forward selection stepwise, 678 Forward stepwise regression, 683 Frequency distribution, 32–41 Classes, 39 Data array, 37 Defined, 33 Discrete data, 33 Grouped data, 36–41 Joint, 47–50 Qualitative, 36 Quantitative, 35 Relative, 33–35 Tables, 32–34 Frequency histogram Defined, 41 Issues with Excel, 44 Relative frequency, 45–46 Summary steps, 44 F-test Assumptions, 459 Coefficient of determination, 603 Excel example, 464–465 Minitab, 464–465 Multiple regression, 643, 701 Test statistic, 459 Two variances, 458–467 907 (484) 908 INDEX G Goodness-of-fit tests, 548–559 Chi-square test, 548–559 Chi-square test statistic, 550, 574 Degrees of freedom, 550, 554 Excel example, 553–554 Minitab, 554–555 Sample size, 550 Grouped data frequency distribution All-inclusive classes, 39 Class boundaries, 39 Classes, 39 Class width, 39 Continuous data, 36 Cumulative frequency, 40 Data array, 37 Empty classes, 39 Equal-width classes, 39 Excel example, 37 Minitab example, 37 Mutually exclusive classes, 39 Number of classes, 39 Steps, 39–40 H Hartley’s F-test statistic, 480, 485, 522 Histogram, 3, 41–46 Examples, 42 Excel example, 43 Issues with Excel, 44 Minitab example, 44 Quality, 807 Relative frequency, 45–46 Summary steps, 44 Types of information, 41–42 Hypergeometric distribution Multiple possible outcomes, 222–223 Two possible outcomes, 220–221 Hypothesis Alternative, 347 ANOVA, 477 Null, 347 Research, 348 Summary steps, 349 Hypothesis testing, 5, 347–372 Alternative hypothesis, 347 Calculating beta, 376–377 Chi-square test, 449–455 Controlling alpha and beta, 378–382 Correlation coefficient, 584, 638–639 Critical value, 351–353 Decision rule, 351–352, 354–357 Difference between two population proportions, 433–437 Excel example, 364, 415–416, 435 Flow diagram, 441 F-test, 458–467 Median, 771–785 Minitab, 365, 416–417, 436 Multiple regression analysis, 643–644 Nonparametric tests, 770–803 Null hypothesis, 347 One-tailed test, 358 Paired samples, 427–428 Population mean, 347–365 Population proportion, 368–372 Power, 382–383 Power curve, 382 Procedures, deciding among, 388 p-value, 357–358, 412 Significance level, 352 Simple regression coefficient, 606 Single population variance, 452 Summary steps, 355, 362, 370, 410, 452 t-test statistic, 361–362, 412, 427 Two means, 409–419 Two-tailed test, 358 Two variances, 458–467 Type I error, 350 Type II error, 350, 376–383 Types of tests, 358–359 z-test statistic, 354 I Imai, Masaaki, 806 Independent events Conditional probability rule, 171 Defined, 150 Independent samples, 398 Defined, 459 Independent variable, 70, 634 Index numbers, 714–721, 763 Aggregate price index, 715–717 Base period index, 714 Consumer price, 719–720 Deflating time-series data, 721, 763 Laspeyres, 718–719, 763 Paasche, 717–718, 763 Producer price, 720 Simple index number, 714, 763 Stock market, 720–721 Unweighted aggregate price, 716, 763 Inferences, Interaction Cautions, 517 Defined, 669 Explained, 512, 514–517 Partial F-test, 671–674, 701 Polynomial regression model, 667–671 Internal validity, 13 Interquartile Range Defined, 108 Equation, 108 Interval data, 22 Interviewer bias, 12 Interviews Structured, 11 Unstructured, 11 Ishikawa, Kauro, 806, 807 J Joint frequency distribution, 47–50 Excel example, 48–49 Minitab example, 49–50 Judgment sampling, 15 Juran, Joseph, 805 Ten steps, 806 K kaizen, 806 Kruskal-Wallis one-way ANOVA Assumptions, 790 Correction, 799 Correction for ties, 794, 799 Excel example, 792 H-statistic, 791, 794, 799 H-statistic corrected for tied rankings, 799 Hypotheses, 790 Limitations, 793–794 Minitab example, 793 Steps, 790–793 L Laspeyres index, 718–719, 763 Equation, 718, 763 Least squares criterion Defined, 592 Equations, 594, 725, 763 Linear trend Defined, 711 Model, 725, 763 Line charts, 66–69 Excel examples, 67–69 Minitab examples, 67–69 Summary steps, 68 Location measures Percentiles, 98–99 Quartiles, 99–100 M MAD, 727–728 Mann-Whitney U-test Assumptions, 777 Critical value, 779 Equations, 798 Hypotheses, 777 Large sample test, 780–785 Minitab example, 780 Steps, 777–779 Test statistic, 785, 798 U-statistics, 778, 780–782, 798 MAPE, 758, 763 Margin of error Defined, 311 Equation, 312, 340 Proportions, 340 Mean Advantages and disadvantages, 103 Binomial distribution, 205–207 c-charts, 824–827, 833 Defined, 86 Discrete random variable, 193–194 Excel example, 89, 95–96 Expected value, 194 Extreme values, 90–91 Hypothesis test, 347–365 Minitab example, 89–90 Poisson distribution, 217 Population, determining required sample size, 325–326 Population equation, 86, 265 Sample equation, 90, 265, 266 Sampling distribution of a proportion, 292 Summary steps, 87 Uniform distribution, 251 U-statistics, 778, 798 Weighted, 97–98 Wilcoxon, 784 Mean absolute deviation Equation, 727, 763 Mean absolute percent error Equation, 758, 763 Mean squared error Equation, 727, 763 Mean subgroup proportion, 822, 832 Measurement error, 13 Median Advantages and disadvantages, 103 Data array, 91 Defined, 91 Excel example, 95–96 Hypothesis test, 771–785 Index point, 91 Issues with Excel, 96–97 Mode Advantages and disadvantages, 103 Defined, 93 Model Building, 636, 637 Diagnosis, 637, 643, 710 Model fitting, 710 Specification, 636, 637, 710 Model building concepts, 636 Summary steps, 637 Moving average Centered, 740 Defined, 739 (485) INDEX Multicollinearity Defined, 647 Face validity, 647 Variance inflation factor, 648–649, 701 Multiple coefficient of determination, 642–643, 701 Equation, 642 Multiple regression analysis, 634–686 Aptness of the model, 689–697 Assumptions, 634, 689 Coefficient of determination, 642–643, 701 Correlation coefficient, 638–639 Dependent variable, 634 Diagnosis, 637, 643 Dummy variables, 654–657 Estimated model, 634, 701 Excel example, 640–641 Hyperplane, 635 Independent variable, 634 Interval estimate for slope, 649–651 Minitab example, 642 Model building, 636–651 Multicollinearity, 647–649 Nonlinear relationships, 661–667 Partial F-test, 671–674, 701 Polynomial, 662–663, 701 Population model, 634, 701 Scatter plots, 639–640 Significance test, 643–644 Standard error of the estimate, 646, 701 Stepwise regression, 678–686 Summary steps, 637 Multiplication probability rule, 172 Independent events, 174–175 Tree diagram, 173–174 Two events, 172–173 Multiplicative time-series model Equation, 739, 763 Seasonal indexes, 738–746 Summary steps, 744 Mutually exclusive classes, 39 Mutually exclusive events Defined, 150 N Nominal data, 21–22 Nonlinear trend, 712, 734–738 Nonparametric statistics, 770–803 Kruskal-Wallis one-way ANOVA, 789–794 Mann-Whitney U test, 776–785, 798 Wilcoxon matched pairs test, 782–785, 798 Wilcoxon signed rank test, 771–774, 798 Nonresponse bias, 12 Nonstatistical sampling, 15 Convenience sampling, 15 Judgment sampling, 15 Ratio sampling, 15 Normal distribution, 234–245 Approximate areas under normal curve, 245 Defined, 234 Empirical rule, 245 Excel example, 242–243 Function, 235 Minitab example, 242, 244 Standard normal, 235–245 Standard normal table, 237–242 Steps, 237 Summary steps, 237 Null hypothesis Claim, 348–349 Defined, 347 Research hypothesis, 348 Status quo, 347–348 Numerical statistical measures Summary, 129 O Observer bias, 12–13 Ogive, 45–46 One-tailed hypothesis test Defined, 358 One-way ANOVA Assumptions, 477 Balanced design, 476 Between-sample variation, 477 Completely randomized design, 476 Defined, 476 Excel example, 486–487, 492 Experimental-wide error rate, 488 Factor, 476 Fixed effects, 493 Hartley’s F-test statistic, 480, 485, 522 Levels, 476 Logic, 476 Minitab, 486–487, 492 Partitioning sums of squares, 477–478 Random effects, 493 Sum of squares between, 482, 522 Sum of squares within, 482, 522 Table, 483 Total sum of squares, 481, 522 Total variation, 477 Tukey-Kramer, 488–493, 522 Within-sample variation, 477 Open-end questions, 10–11 Ordinal data, 22 P Paasche index Equation, 717, 763 Paired sample Confidence interval estimation, 425 Defined, 423 Equation, 424 Point estimate, 426 Population mean, 426 Standard deviation, 425 Why use, 423–424 Parameters, 15 Defined, 86, 266 Unbiased estimator, 276 Pareto principle, 805 Partial F-test, 671–674, 701 Statistic formula, 672, 701 p-charts, 820–823 Control limits, 823, 833 Pearson product moment correlation, 581 Percentiles Defined, 98 Location index, 98 Summary steps, 99 Personal interviews, 11 Pie chart bar chart versus, 61 Defined, 60 Summary steps, 60 Pilot sample Defined, 326 Proportions, 340 Point estimate Defined, 306 Paired difference, 424 Poisson distribution, 213–217 Assumptions, 213 Equation, 214 Excel example, 218 Mean, 217 Minitab example, 218 Standard deviation, 217 Summary steps, 216 Table, 214–217 Polynomial regression model Composite model, 669 Equation, 662, 701 909 Excel example, 664, 666, 669–670 Interaction, 667–671 Minitab example, 665, 666, 670–671 Second order model, 662–663, 666 Third order model, 663 Population Defined, 14 Mean, 86–89, 265 Proportion, 289–290 Population model, multiple regression analysis, 634, 701 Power of the test Curve, 382 Defined, 382 Equation, 382, 388 Prediction interval for y given x, 616–618, 625 Probability Addition rule, 159–162 Classical assessment, 152–153 Conditional, 167 Defined, 147 Experiment, 147 Methods of assigning, 152–156 Relative frequency assessment, 153–155 Rules, 159–179 Rules summary and equations, 186 Sample space, 147 Subjective assessment, 155–156 Probability sampling, 16 Producer price index, 720 Proportions Confidence interval, 340 Estimation, 333–335 Hypothesis tests, 368–372 Pooled estimator, 434, 442 Population, 289–290 Sample proportion, 330 Sampling distribution, 289–294 Sampling error, 290 Standard error, 331 z-test statistic equation, 388 p-value, 357–358 Q Qualitative Data Defined, 21 Dummy variables, 654–657 Frequency distribution, 36 Qualitative forecasting, 710 Quality Basic tools, 806–807 Brainstorming, 807 Control charts, 807 Deming, 805 Fishbone diagram, 807 Flowcharts, 807 Juran, 805, 806 Scatter plots, 807 SPC, 807, 808–827 Total quality management, 805 Trend charts, 807 Quantitative Data Defined, 21 Frequency distribution, 35 Quantitative forecasting, 710 Quartiles Defined, 99 Issues with Excel, 100 Questions Closed-end, Demographic, Leading, 10 Open-end, 10–11 Poorly worded, 10–11 R Random component Defined, 713 (486) 910 INDEX Randomized complete block ANOVA, 497–505 Assumptions, 498 Excel example, 499–500, 501 Fisher’s Least Significant Difference test, 505–506, 522 Minitab, 499–500, 501 Partitioning sums of squares, 499, 522 Sum of squares blocking, 499, 522 Sum of squares within, 499, 522 Table, 500 Type II error, 502 Random sample, 16–17 Excel example, 17 Random variable Continuous, 192 Defined, 192 Discrete, 192 Range Defined, 107 Equation, 107 Interquartile, 109 Ratio data, 22–23 Ratio sampling, 15 Ratio-to-moving-average method, 739–740 Equation, 741, 763 Regression analysis Aptness, 689–697 Assumptions, 590 Coefficient of determination, 602, 625, 642–643, 701 Confidence interval estimate, 614, 626, 649–651, 701 Descriptive purposes, 612–615 Dummy variables, 654–657, 671 Equations, 625 Excel examples, 595–598, 599, 600, 613, 656, 658 Exponential relationship, 661 Hyperplane, 635 Least squares criterion, 592 Least squares equations, 594, 625 Least squares regression properties, 596–599 Minitab examples, 598, 599, 601, 613, 657, 658 Multicollinearity, 647–649 Multiple regression, 634–686 Nonlinear relationships, 661–667 Partial F-test, 671–674, 701 Polynomial, 662–663 Prediction, 615–618, 625 Problems using, 618–620 Residual, 592, 597, 625, 727 R-squared, 602, 625, 642–643 Sample model, 592 Significance tests, 599–609 Simple linear model, 590, 625 Slope coefficients, 591 Standard error, 646, 701 Stepwise, 678–686 Summary steps, 608 Sum of squares error, 593 Sum of squares regression, 602 Test statistic for the slope, 607 Total sum of squares, 600, 625 Regression slope coefficient Defined, 591 Excel example, 605 Intercept, 591 Interval estimate, 614, 626, 649–651 Minitab example, 606 Significance, 604–605, 645–646, 701 Slope, 591 Standard error, 604 Relative frequency, 33–35 Distributions, 36–41 Equation, 34 Histogram, 45–46 Relative frequency assessment Defined, 153 Equation, 153 Issues, 155 Research hypothesis Defined, 348 Residual Assumptions, 689 Checking for linearity, 690 Corrective actions, 697 Defined, 592, 689 Equal variances, 692–693 Equation, 689, 701 Excel examples, 690–691 Forecasting error, 727 Independence, 693 Minitab examples, 690–691, 694–696 Normality, 693, 695 Plots, 691–694 Standardized residual, 695–697, 701 Sum of squared residuals, 597, 625 Review Sections Chapters 1–3, 139–142 Chapters 8–12, 530–543 S Sample Defined, 14 Mean, 89–90, 265, 266 Proportion, 290 Size, 324–327 Sample size requirements Equation, 325 Estimating sample mean, 324–327 Estimating sample proportion, 340 Pilot sample, 326–327 Sample space, 147–148 Tree Diagrams, 148–149 Sampling distribution of a proportion Mean, 292 Standard error, 292 Summary steps, 294 Theorem 5, 292 Sampling distribution of the mean, 273–282 Central limit theorem, 282–286 Defined, 273 Excel example, 274–275 Minitab example, 275 Normal populations, 277–280 Proportions, 289–294 Steps, 285 Theorem 1, 276 Theorem 2, 276 Theorem 3, 278–279 Sampling error Computing, 267 Defined, 265, 306 Equation, 265 Role of sample size, 268–269 Sampling techniques, 15–19, 27 Nonstatistical, 15 Statistical, 16 Scatter diagram/plot Defined, 70, 580 Dependent variable, 70, 580 Examples, 580, 581, 640 Excel example, 71 Independent variables, 70, 580 Minitab example, 71 Multiple regression, 640 Quality, 807 Summary steps, 71 Seasonal component Defined, 712 Seasonal index Adjustment process steps, 744 Computing, 739–740 Defined, 739 Deseasonalization, 743, 763 Dummy variables, 744–746 Excel example, 740–742 Minitab example, 744 Multiplicative model, 739 Normalize, 741–742 Ratio-to-moving-average, 739–740 Selection bias, 12 Significance level Defined, 352 Significance tests, 599–609 Simple index number Equation, 714, 716 Simple linear regression Assumptions, 590 Defined, 590 Equations, 625 Least squares criterion, 592 Summary steps, 608 Simple random sample, 16–17 Defined, 266 Skewed data Defined, 92 Left-skewed, 93 Right-skewed, 93 Standard deviation, 112–115 Binomial distribution, 207–208 c-charts, 825, 833 Defined, 109 Discrete random variable, 194–195 Excel example, 114–115, 121 Minitab example, 114–115, 121 Poisson distribution, 217 Population standard deviation equation, 111 Population variance equation, 110 Regression model, 646–647 Sample equation, 112 Summary steps, 111 Uniform distribution, 251 U-statistics, 778, 798 Wilcoxon, 784 Standard error Defined, 308 Difference between two means, 398, 441 Proportion, 331 Sampling distribution of a proportion, 292 Statistical process control, 822, 833 Standard error of regression slope Equation, 604, 605 Graphed, 606 Standard error of the estimate Equation, 604 Multiple regression equation, 646 Standardized data values, 122 Population equation, 122 Sample equation, 122 Summary steps, 123 Standardized residuals Equation, 695, 701 Standard normal distribution, 235–245 Table, 237–239 States of nature, 350 Statistical inference procedures, Statistical inference tools Nonstatistical sampling, 15 Statistical sampling techniques, 16 Statistical process control Average subgroup means, 812, 832 Average subgroup range, 812, 832 c-charts, 824–827 Control limits, 810, 814–815, 818, 823, 825, 827, 832 Excel examples, 812, 820–821 Mean subgroup proportion, 822, 832 Minitab example, 812, 820–821 p-charts, 820–823 R-charts, 811–820 (487) INDEX Signals, 818 Stability, 810 Standard deviation, 825 Standard error, 822, 833 Summary steps, 827 Variation, 807, 808–810 x charts, 811–820 Statistical sampling Cluster sampling, 18–19 Simple random sampling, 16–17 Stratified random sampling, 17–18 Systematic random sampling, 18 Statistics, 15 Defined, 86 Stem and leaf diagrams, 62–63 Summary steps, 62 Stepwise regression, 678–686 Backward elimination, 679–683 Best subsets, 683–686 Forward selection, 678 Standard, 683 Stratified random sampling, 17–18 Subjective probability assessment Defined, 155 Sum of squares between Equation, 478, 482, 522 Sum of squares blocking Equation, 499, 522 Sum of squares error Equation, 596, 601, 625 Interaction, 672–674 Sum of squares regression Equation, 602 Sum of squares within Equation, 482, 499, 522 Symmetric Data, 92–93 Systematic random sampling, 18 T Tchebysheff’s Theorem, 121–122 t-distribution assumptions, 315, 412 defined, 314 degrees of freedom, 314–315 equation, 315 table, 316 two means, 400–406 unequal variances, 404 Telephone surveys, 7–9 Test statistic Correlation coefficient, 584, 625 Defined, 354 R-squared, 602 t-test, 361–362, 412 z-test, 354, 409–410 Time-series data Components, 711–714 Defined, 21 Deseasonalization, 743, 763 Index numbers, 714–721 Laspeyres index, 718–719, 763 Linear trend, 711 Nonlinear trend, 712 Paasche index, 717–718, 763 Random component, 713 Seasonal component, 712–713 Trend, 711–712 Total quality management Defined, 805 Total sum of squares Equation, 481, 600, 625 Total variation Defined, 477 Tree diagrams, 148–149, 170–171 Trend Defined, 711 Excel example, 724, 726, 727–728 Forecasting technique, 724–746 Linear, 711, 725, 763 Minitab example, 724, 726 Nonlinear, 712 Quality chart, 807 t-test statistic assumption, 361 correlation coefficient, 584, 638 equation, 361, 388, 412, 427, 442 paired samples, 427, 442 Population variances unknown and not assumed equal, 419, 442 Regression coefficient significance, 645, 701 Tukey-Kramer multiple comparisons, 488–493 Critical range equation, 522 Equation, 488 Two-factor ANOVA, 510–517 Assumptions, 512 Equations, 513, 522 Excel example, 514–516 Interaction, 512, 514–517 Minitab, 516 Partitioning sum of squares, 510–511, 522 Replications, 509–517 Two-tailed hypothesis test Defined, 358 p-value, 359–361 summary steps, 362 Type I error Defined, 350 Type II error Calculating beta, 376–377 Defined, 350 U Unbiased estimator, 276 Uniform probability distribution Density function, 250 Mean, 251 Standard deviation, 251 Unweighted aggregate price index Equation, 716, 763 V Validity External, 13 Internal, 13 Variable Dependent, 70 Independent, 70 Variance Defined, 109 F-test statistic, 459 Population variance equation, 110 Sample equation, 112, 461, 471 Sample shortcut equation, 112 Shortcut equation, 110 Summary steps, 111 Variance inflation factor Equation, 648, 701 Excel example, 648–649 Minitab example, 648–649 Variation, 107, 808 Components, 810 Sources, 808–810 W Weighted aggregate price index, 717–719 Laspeyres index, 718–719, 763 Paasche index, 717–718, 763 Weighted Mean Defined, 97 Population equation, 97 Sample equation, 97 Wilcoxon matched-pairs signed rank, 782–785 Assumptions, 782 Large sample test, 784–785 Test statistic, 785, 798 Ties, 784 Wilcoxon signed rank test, 771–774 Equation, 798 Hypotheses, 771 Large sample test statistic, 772 Minitab example, 773 Steps, 771–772 Within-sample variation Defined, 477 Written questionnaires, 7, 9–11 Steps, Z z-scores Finite population correction, 280 Sampling distribution of mean, 280 Sampling distribution of p, 293 Standardized, 236 Standard normal distribution, 237, 245 z-test statistic Defined, 354 Equation, proportion, 370, 388 Equation, sigma known, 354, 388 Equation, two means, 409, 441 Equation, two proportions, 434, 442 911 (488) Values of t for Selected Probabilities df = 10 0.05 0.05 t = ⫺1.8125 t = 1.8125 t Probabilites (Or Areas Under t-Distribution Curve) Conf Level One Tail Two Tails 0.1 0.45 0.9 0.3 0.35 0.7 0.5 0.25 0.5 0.7 0.15 0.3 d f 0.8 0.1 0.2 0.9 0.05 0.1 0.95 0.025 0.05 0.98 0.01 0.02 0.99 0.005 0.01 Values of t 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 50 60 70 80 90 100 250 500 0.1584 0.1421 0.1366 0.1338 0.1322 0.1311 0.1303 0.1297 0.1293 0.1289 0.1286 0.1283 0.1281 0.1280 0.1278 0.1277 0.1276 0.1274 0.1274 0.1273 0.1272 0.1271 0.1271 0.1270 0.1269 0.1269 0.1268 0.1268 0.1268 0.1267 0.1265 0.1263 0.1262 0.1261 0.1261 0.1260 0.1260 0.1258 0.1257 0.5095 0.4447 0.4242 0.4142 0.4082 0.4043 0.4015 0.3995 0.3979 0.3966 0.3956 0.3947 0.3940 0.3933 0.3928 0.3923 0.3919 0.3915 0.3912 0.3909 0.3906 0.3904 0.3902 0.3900 0.3898 0.3896 0.3894 0.3893 0.3892 0.3890 0.3881 0.3875 0.3872 0.3869 0.3867 0.3866 0.3864 0.3858 0.3855 1.0000 0.8165 0.7649 0.7407 0.7267 0.7176 0.7111 0.7064 0.7027 0.6998 0.6974 0.6955 0.6938 0.6924 0.6912 0.6901 0.6892 0.6884 0.6876 0.6870 0.6864 0.6858 0.6853 0.6848 0.6844 0.6840 0.6837 0.6834 0.6830 0.6828 0.6807 0.6794 0.6786 0.6780 0.6776 0.6772 0.6770 0.6755 0.6750 1.9626 1.3862 1.2498 1.1896 1.1558 1.1342 1.1192 1.1081 1.0997 1.0931 1.0877 1.0832 1.0795 1.0763 1.0735 1.0711 1.0690 1.0672 1.0655 1.0640 1.0627 1.0614 1.0603 1.0593 1.0584 1.0575 1.0567 1.0560 1.0553 1.0547 1.0500 1.0473 1.0455 1.0442 1.0432 1.0424 1.0418 1.0386 1.0375 3.0777 1.8856 1.6377 1.5332 1.4759 1.4398 1.4149 1.3968 1.3830 1.3722 1.3634 1.3562 1.3502 1.3450 1.3406 1.3368 1.3334 1.3304 1.3277 1.3253 1.3232 1.3212 1.3195 1.3178 1.3163 1.3150 1.3137 1.3125 1.3114 1.3104 1.3031 1.2987 1.2958 1.2938 1.2922 1.2910 1.2901 1.2849 1.2832 6.3137 2.9200 2.3534 2.1318 2.0150 1.9432 1.8946 1.8595 1.8331 1.8125 1.7959 1.7823 1.7709 1.7613 1.7531 1.7459 1.7396 1.7341 1.7291 1.7247 1.7207 1.7171 1.7139 1.7109 1.7081 1.7056 1.7033 1.7011 1.6991 1.6973 1.6839 1.6759 1.6706 1.6669 1.6641 1.6620 1.6602 1.6510 1.6479 12.7062 4.3027 3.1824 2.7765 2.5706 2.4469 2.3646 2.3060 2.2622 2.2281 2.2010 2.1788 2.1604 2.1448 2.1315 2.1199 2.1098 2.1009 2.0930 2.0860 2.0796 2.0739 2.0687 2.0639 2.0595 2.0555 2.0518 2.0484 2.0452 2.0423 2.0211 2.0086 2.0003 1.9944 1.9901 1.9867 1.9840 1.9695 1.9647 31.8210 6.9645 4.5407 3.7469 3.3649 3.1427 2.9979 2.8965 2.8214 2.7638 2.7181 2.6810 2.6503 2.6245 2.6025 2.5835 2.5669 2.5524 2.5395 2.5280 2.5176 2.5083 2.4999 2.4922 2.4851 2.4786 2.4727 2.4671 2.4620 2.4573 2.4233 2.4033 2.3901 2.3808 2.3739 2.3685 2.3642 2.3414 2.3338 63.6559 9.9250 5.8408 4.6041 4.0321 3.7074 3.4995 3.3554 3.2498 3.1693 3.1058 3.0545 3.0123 2.9768 2.9467 2.9208 2.8982 2.8784 2.8609 2.8453 2.8314 2.8188 2.8073 2.7970 2.7874 2.7787 2.7707 2.7633 2.7564 2.7500 2.7045 2.6778 2.6603 2.6479 2.6387 2.6316 2.6259 2.5956 2.5857 ∞ 0.1257 0.3853 0.6745 1.0364 1.2816 1.6449 1.9600 2.3263 2.5758 (489) Standard Normal Distribution Table 0.3944 0.45 z z = 1.25 z 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 0.0000 0.0398 0.0793 0.1179 0.1554 0.1915 0.2257 0.2580 0.2881 0.3159 0.3413 0.3643 0.3849 0.4032 0.4192 0.4332 0.4452 0.4554 0.4641 0.4713 0.4772 0.4821 0.4861 0.4893 0.4918 0.4938 0.4953 0.4965 0.4974 0.4981 0.4987 0.0040 0.0438 0.0832 0.1217 0.1591 0.1950 0.2291 0.2611 0.2910 0.3186 0.3438 0.3665 0.3869 0.4049 0.4207 0.4345 0.4463 0.4564 0.4649 0.4719 0.4778 0.4826 0.4864 0.4896 0.4920 0.4940 0.4955 0.4966 0.4975 0.4982 0.4987 0.0080 0.0478 0.0871 0.1255 0.1628 0.1985 0.2324 0.2642 0.2939 0.3212 0.3461 0.3686 0.3888 0.4066 0.4222 0.4357 0.4474 0.4573 0.4656 0.4726 0.4783 0.4830 0.4868 0.4898 0.4922 0.4941 0.4956 0.4967 0.4976 0.4982 0.4987 0.0120 0.0517 0.0910 0.1293 0.1664 0.2019 0.2357 0.2673 0.2967 0.3238 0.3485 0.3708 0.3907 0.4082 0.4236 0.4370 0.4484 0.4582 0.4664 0.4732 0.4788 0.4834 0.4871 0.4901 0.4925 0.4943 0.4957 0.4968 0.4977 0.4983 0.4988 0.0160 0.0557 0.0948 0.1331 0.1700 0.2054 0.2389 0.2704 0.2995 0.3264 0.3508 0.3729 0.3925 0.4099 0.4251 0.4382 0.4495 0.4591 0.4671 0.4738 0.4793 0.4838 0.4875 0.4904 0.4927 0.4945 0.4959 0.4969 0.4977 0.4984 0.4988 0.0199 0.0596 0.0987 0.1368 0.1736 0.2088 0.2422 0.2734 0.3023 0.3289 0.3531 0.3749 0.3944 0.4115 0.4265 0.4394 0.4505 0.4599 0.4678 0.4744 0.4798 0.4842 0.4878 0.4906 0.4929 0.4946 0.4960 0.4970 0.4978 0.4984 0.4989 0.0239 0.0636 0.1026 0.1406 0.1772 0.2123 0.2454 0.2764 0.3051 0.3315 0.3554 0.3770 0.3962 0.4131 0.4279 0.4406 0.4515 0.4608 0.4686 0.4750 0.4803 0.4846 0.4881 0.4909 0.4931 0.4948 0.4961 0.4971 0.4979 0.4985 0.4989 0.0279 0.0675 0.1064 0.1443 0.1808 0.2157 0.2486 0.2794 0.3078 0.3340 0.3577 0.3790 0.3980 0.4147 0.4292 0.4418 0.4525 0.4616 0.4693 0.4756 0.4808 0.4850 0.4884 0.4911 0.4932 0.4949 0.4962 0.4972 0.4979 0.4985 0.4989 0.0319 0.0714 0.1103 0.1480 0.1844 0.2190 0.2517 0.2823 0.3106 0.3365 0.3599 0.3810 0.3997 0.4162 0.4306 0.4429 0.4535 0.4625 0.4699 0.4761 0.4812 0.4854 0.4887 0.4913 0.4934 0.4951 0.4963 0.4973 0.4980 0.4986 0.4990 0.0359 0.0753 0.1141 0.1517 0.1879 0.2224 0.2549 0.2852 0.3133 0.3389 0.3621 0.3830 0.4015 0.4177 0.4319 0.4441 0.4545 0.4633 0.4706 0.4767 0.4817 0.4857 0.4890 0.4916 0.4936 0.4952 0.4964 0.4974 0.4981 0.4986 0.4990 (490)

Ngày đăng: 28/01/2021, 10:52

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w