IN THIS CHAPTER YOU WILL LEARN VALUABLE TECHNIQUES with which to develop forecasts and classification schemes. These techniques have been used to forecast parts sales by the Honda Motors Company and epidemics at naval training centers, to develop criteria for retention of marine recruits, optimal tariffs for Federal Express, and multitiered pricing plans for Delta Airlines. And these are just examples in which I’ve been person- ally involved! 7.1. MODELS A model in statistics is simply a way of expressing a quantitative relation- ship between one variable, usually referred to as the dependent variable, and one or more other variables, often referred to as the predictors.We began our text with a reference to Boyle’s law for the behavior of perfect gases, V = KT/P. In this version of Boyle’s law, V (the volume of the gas) is the dependent variable; T (the temperature of the gas) and P (the pres- sure exerted on and by the gas) are the predictors; and K (known as Boyle’s constant) is the coefficient of the ratio T/P. An even more familiar relationship is that between the distance S trav- eled in t hours and the velocity V of the vehicle in which we are traveling: S = Vt. Here S is the dependent variable and V and t are predictors. If we travel at a velocity of 60mph for 3 hours we can plot the distance we travel over time with Excel as follows: 1. Put the labels Time and Distance at the head of the first two columns. 2. Put the values 0.5, 1, 1.5, 2, 2.5, and 3 in the first column. Chapter 7 Developing Models Introduction to Statistics Through Resampling Methods & Microsoft Office Excel ® , by Phillip I. Good Copyright © 2005 John Wiley & Sons, Inc. 3. Put the formula = 60*A3 in cell B3 and copy it down the column. 4. Create a scatterplot, using Excel’s Chart Wizard. Select “XY(Scatter)” but use the option “Scatter with data points connected by smoothed lines without markers.” I attempted to drive at 60mph on a nearby highway past where a truck had recently overturned. Recording the distances at half-hour intervals, I found I’d traveled 32, 66, 75, 90, 115, and 150 miles. As you can see from Fig. 7.1, the reality on a busy highway was quite different from what theory would predict. Incidentally, I created this figure with the aid of DDXL. The setup is depicted in Fig. 7.2. Exercise 7.1. My average velocity over the three-hour period was equal to distance traveled/time = 150/3 = 50 miles per hour, or Distance i = 50*Time i + z i , where the {z i } are random deviations from the expected distance. Construct a graph to show that this new model is a much better fit than the old. 7.1.1. Why Build Models? We develop models for at least three different purposes. First, as the term “predictors” suggests, models can be used for prediction. A manufacturer of automobile parts will want to predict part sales several months in advance to ensure that its dealers have the necessary parts on hand. Too few parts in stock will reduce profits; too many may necessitate interim borrowing. So entire departments are hard at work trying to come up with the needed formula. 156 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL ® 180 150 120 90 60 30 D i s t a n c e 0.5 1.0 1.5 2.0 2.5 * * * * * * * * * * * * FIGURE 7.1 Distance expected at 60 mph (straight line) vs. distance observed. At one time, I was part of just such a study team. We soon realized that the primary predictor of part sales was the weather. Snow, sleet, and freez- ing rain sent sales skyrocketing. Unfortunately, predicting the weather is as or more difficult than predicting part sales. Models can be used to develop additional insight into cause-and-effect relationships. At one time, it was assumed that the growth of the welfare caseload L was a simple function of time t, so that L = ct, where the growth rate c was a function of population size. Throughout the 1960s, in state after state, the constant c constantly had to be adjusted upward if this model were to fit the data. An alternative and better-fitting model proved to be L = ct + dt 2 , an equation often used in modeling the growth of an epidemic. As it proved, the basis for the new second-order model was the same as it was for an epidemic: Welfare recipients were spreading the news of welfare availability to others who had not yet taken advantage of the program much as diseased individuals might spread an infection. Boyle’s law seems to fit the data in the sense that if we measure both the pressure and volume of gases at various temperatures, we find that a plot of pressure times volume versus temperature yields a straight line. Or CHAPTER 7 DEVELOPING MODELS 157 FIGURE 7.2 Preparing a scatterplot that will depict multiple lines. if we fix the volume, say by confining all the gas in a chamber of fixed size with a piston on top to keep the gas from escaping, a plot of the pressure exerted on the piston against the temperature of the gas yields a straight line. Observations such as these both suggested and confirmed what is known today as kinetic molecular theory. A third use for models is in classification. At first glance, the problem of classification might seem quite similar to that of prediction. For example, instead of predicting that Y would be 5 or 6 or even 6.5, we need only predict that Y will be greater or less than 6. But the loss functions for the two problems are quite different. The loss connected with predicting y p when the observed value is y o is usually a monotone increasing function of the difference between the two. By contrast, the loss function connected with a classification problem has jumps, being zero if the classification is correct, and taking one of several possible values otherwise, depending on the nature of the misclassification. Not surprisingly, different modeling methods have developed to meet the different purposes. For the balance of this chapter, we shall consider two primary modeling methods: linear regression, whose objective is to predict the expected value of a given dependent variable, and decision trees, which are used for classification. We shall briefly discuss some other alternatives. 7.1.2. Caveats The modeling techniques that you learn in this chapter may seem impressive—they require extensive calculations that only a computer can do—so I feel it necessary to issue three warnings. • You cannot use the same data both to formulate a model and to test it. It must be independently validated. • A cause-and-effect basis is required for every model, just as molecular theory serves as the causal basis for Boyle’s law. • Don’t let your software do your thinking for you. Just because a model fits the data does not mean that it is appropriate or correct. It must be independently validated and have a cause-and-effect basis. You may have heard that having a black cat cross your path will bring bad luck. Don’t step in front of a moving vehicle to avoid that black cat unless you have some causal basis for believing that black cats can affect your luck. (And why not white cats or tortoiseshell?) I avoid cats myself because cats lick themselves and shed their fur; when I breathe cat hairs, 158 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL ® the traces of saliva on the cat fur trigger an allergic reaction that results in the blood vessels in my nose dilating. Now that is a causal connection. 7.2. REGRESSION Regression combines two ideas with which we gained familiarity in previ- ous chapters: 1. Correlation or dependence among variables 2. Additive model Here is an example: Anyone familiar with the restaurant business (or indeed, with any number of businesses that provide direct service to the public, including the post office) knows that the volume of business is a function of the day of the week. Using an additive model, we can repre- sent business volume via the formula where V ij is the volume of business on the ith day of the jth week, m is the average volume, d i is the deviation from the average volume observed on the ith day of the week, i = 1, , 7, and the z ij are independent, identi- cally distributed random fluctuations. Many physiological processes such as body temperature have a circadian rhythm, rising and falling each 24 hours. We could represent body tem- perature by the formula where i (in minutes) takes values from 1 to 24 * 60, but this would force us to keep track of 1441 different parameters. Besides, we can get almost as good a fit to the data by using the formula (7.1) If you are not familiar with the cos() function, you can use Excel to gain familiarity as follows: 1. Put the hours from 1 to 24 in the first column. 2. In the third cell of the second column, put = cos(2* 3.1412 * (A3 + 6)/24). 3. Copy the formula down the column; then construct a scatterplot. ET ij t () =+ + ()() mbcos * 2 300 1440P T ij i ij z=++md , V ij i ij z=++md CHAPTER 7 DEVELOPING MODELS 159 Note how the cos() function first falls then rises, undergoing a complete cycle in a 24-hour period. Why use a formula as complicated as Equation 7.1? Because now we have only two parameters we need to estimate, m and b. For predicting body temperature, m = 98.6 and b = 0.4 might be reasonable choices. Of course, the values of these parameters will vary from individual to individ- ual. For me, m = 97.6. Exercise 7.2. If E(Y) = 3X + 2, can X and Y be independent? Exercise 7.3. According to the inside of the cap on a bottle of Snapple’s Mango Madness, “the number of times a cricket chirps in 15 seconds plus 37 will give you the current air temperature.” How many times would you expect to hear a cricket chirp in 15 seconds when the temperature is 39 degrees? 124 degrees? Exercise 7.4. If we constantly observe large values of one variable, call it Y, whenever we observe large values of another variable, call it X, does this mean X is part of the mechanism responsible for increases in the value of Y? If not, what are the other possibilities? To illustrate the several possi- bilities, give at least three real-world examples in which this statement would be false. (You’ll do better at this exercise if you work on it with one or two others.) 7.2.1. Linear Regression Equation 7.1 is an example of linear regression. The general form of linear regression is (7.2) Where Y is known as the dependent or response variable, X is known as the independent variable or predictor, f[X] is a function of known form, m and b are unknown parameters, and Z is a random variable whose expected value is zero. If it weren’t for this last random component Z, then if we knew the parameters m and b, we could plot the values of the dependent variable Y and the function f[X] as a straight line on a graph; hence the name: linear regression. For the past year, the price of homes in my neighborhood could be rep- resented as a straight line on a graph relating house prices to time, P = m + bt, where m was the price of the house on the first of the year and t is the day of the year. Of course, as far as the price of any individual house Y =+ [] +mbf XZ 160 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL ® was concerned, there was a lot of fluctuation around this line depending on how good a salesman the realtor was and how desperate the owner was to sell. If the price of my house ever reaches $700 K, I might just sell and move to Australia. Of course, a straight line might not be realistic. Prices have a way of coming down as well as going up. A better prediction formula might be P = m + bt - gt 2 , in which prices continue to rise until b - gt = 0, after which they start to drop. If I knew what b and g were or could at least get some good estimates of their value, then I could sell my house at the top of the market! The trick is to look at a graph such as Fig. 7.1 and somehow extract that information. Note that P = m + bt - gt 2 is another example of linear regression, only with three parameters rather than two. So is the formula W = m + bH + gA + Z where W denotes the weight of a child, H is its height, A its age, and Z, as always, is a purely random component. W = m + bH + gA + dAH + Z is still another example. The parameters m, b, g, and so forth are sometimes referred to as the coefficients of the model. What then is a nonlinear regression? Here are two examples: and Regression models that are nonlinear in their parameters are beyond the scope of this text. The important lesson to be learned from their existence is that we need to have some idea of the functional relationship between a response variable and its predictors before we start to fit a linear regression model. Exercise 7.5. Generate a plot of the function P = 100 + 10t - 1.5t 2 for values of t = 0, 1, . . . 10. Does the curve reach a maximum and then turn over? 7.3. FITTING A REGRESSION EQUATION Suppose we have determined that the response variable Y whose value we wish to predict is related to the value of a predictor variable X by the T which also is linear in but nonlinear in .=+ () bg b gcos ,t Y = () bg b g log ,X which is linear in but nonlinear in the unknown parameter CHAPTER 7 DEVELOPING MODELS 161 equation, E(Y) = a + bX and on the basis of a sample of n paired observa- tions (x 1 , y 1 ), (x 2 , y 2 ), (x n , y n ) we wish to estimate the unknown coefficients a and b. Three methods of estimation are in common use: ordinary least squares, least absolute deviation, and error-in-variable, also known as Deming regression. We will study all three in the next few sections. 7.3.1. Ordinary Least Squares The ordinary least squares (OLS) technique of estimation is the most commonly used, primarily for historical reasons, as its computations can be done (with some effort) by hand or with a primitive calculator. The objec- tive of the method is to determine the parameter values that will minimize the sum of squares S(y i - EY) 2 where EY, the expected or mean value of Y, is modeled by the right-hand side of our regression equation. In our example, EY = a + bx i , and so we want to find the values of a and b that will minimize S(y i - a - bx i ) 2 . We can readily obtain the desired estimates with the aid of the XLStat add-in. Suppose we have the following data relating age and systolic blood pres- sure (SBP): • Age 39,47,45,47,65,46,67,42,67,56,64,56,59,34,42 • SBP 144,220,138,145,162,142,170,124,158,154,162,150,140, 110,128 From the main XLStat menu select the scatterplot (fifth from left). Select the straight-line scatter plot (second from left) from the modeling data menu that pops up. Enter the observations in the first two columns and complete the Linear Regression menu as shown in Fig. 7.3. A plethora of results appear on a second worksheet. Let’s focus on what is important. In Table 7.1, extracted from the worksheet, we see that the best-fitting model by least squares methods is that the expected SBP of an individual is 95.6125119693584 + 1.04743855729333 times that person’s Age. Note that when we report our results, we write this as E ˆ (SBP) = â + ˆ bAge = 95.6 + 1.04Age, dropping decimal places that convey a false impression of precision. 162 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL ® CHAPTER 7 DEVELOPING MODELS 163 FIGURE 7.3 Preparing to fit a regression line. TABLE 7.1 Model Parameters Standard Lower Upper Parameter Value Deviation Student’s t Pr > t bound 95% bound 95% Intercept 95.613 29.894 3.198 0.007 31.031 160.194 Age 1.047 0.566 1.850 0.087 -0.176 2.271 The equation of the model writes: SBP = 95.6125119693584 + 1.04743855729333 * Age We also see from Table 7.1 that the coefficient of Age, that is, the slope of the regression line depicted in Fig. 7.4, is not significantly different from zero at the 5% level. The associated p value is 0.087 > 0.05. Whether this p value is meaningful is the topic of Section 7.4.1. What can be the explanation for the poor fit? Our attention is immedi- ately drawn to the point in Fig. 7.4 that stands out from the rest. It is that of a 47-year old whose systolic blood pressure is 220. Part of our output, reproduced in Table 7.2, includes a printout of all the residuals, that is, of the differences between the values our regression equation would predict and the SBPs that were actually observed. Consider the fourth residual in the series, 0.158. This is the difference between what was observed, SBP = 145, and what the regression equation estimates as the expected SBP for a 47-year-old individual E(SBP) = 95.6 + 1.04 * 47 = 144.8. The largest residual is 75, which corresponds to the outlying value we’ve already alluded to. 164 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL ® TABLE 7.2 Deviations from Regression Line Age SBP SBP (Model) Residuals 39.000 144.000 136.463 7.537 47.000 220.000 144.842 75.158 45.000 138.000 142.747 -4.747 47.000 145.000 144.842 0.158 65.000 162.000 163.696 -1.696 46.000 142.000 143.795 -1.795 67.000 170.000 165.791 4.209 42.000 124.000 139.605 -15.605 67.000 158.000 165.791 -7.791 56.000 154.000 154.269 -0.269 64.000 162.000 162.649 -0.649 56.000 150.000 154.269 -4.269 59.000 140.000 157.411 -17.411 34.000 110.000 131.225 -21.225 42.000 128.000 139.605 -11.605 Data and regression line 30 35 40 45 50 55 60 65 70 Age Observations Predictions Conf. onpred (95.00%) Conf. on mean (95.00%) 250 200 150 100 50 0 SBP FIGURE 7.4 Data and regression line of SBP vs. Age. [...]... Economic Report of the President, 1 988 , Table B-27 Year Income 1 982 $s Expenditures 1 982 $s 1960 6036 5561 1962 6271 5729 1964 6727 6099 1966 7 280 6607 19 68 77 28 7003 1970 81 34 7275 1972 85 62 7726 1974 88 67 782 6 1976 9175 82 72 19 78 9735 88 08 1 980 9722 87 83 1 982 9725 88 18 Exercise 7.6 Do U.S residents do their best to spend what they earn? Fit a regression line, using OLS, to the data in the accompanying... 3, 8, 12, 11, 27 Oxygen 95.64, 102.09, 104.76, 106. 98, 102.6, 109.15, 96.12, 111. 98, 100.67, 103 .87 , 107.57, 106.55, 89 .21, 100.65, 100.54, 102. 98, 98, 106 .86 , 98. 17, 100. 98, 99. 78, 100 .87 , 97.25, 97. 78, 99.24, 104.32, 101.21, 102.73, 99.17, 104 .88 , 97.13, 102.43, 99 .87 , 100 .89 , 99.43, 99.5, 99.07, 105.32, 102 .89 , 102.67, 106.04, 106.67, 98. 14, 100.65, 103. 98, 100.34, 98. 27, 105.69, 96.22, 102 .87 ,... 105.69, 96.22, 102 .87 , 103. 98, 102.76, 107.54, 104.13, 98. 74, 101.12, 104. 98, 101.43, 106.42, 107.99, 95 .89 , 104 .87 , 104. 98, 100 .89 , 109.39, 98. 17, 99.14, 103 .87 , 103 .87 , 102 .89 , 1 08. 78, 107.73, 97.34, 105.32, 101 .87 , 100. 78, 98. 21, 97.66, 96.22, 22, 99. 78, 101.54, 100.53, 109 .86 Exercise 7.13 The slope of a regression line is zero if and only if the correlation between the predictor and the predicted variable... 2, 8, 6, 3, 5, 3, 6, 8, 2, 5, 6, 6, 3, 5, 8, 8, 1, 9, 8, 8, 7, 5, 2, 2, 3, 8, 2, 2, 8, 9, 5, 6, 7, 4, 6, 5, 8, 4, 7, 8, 7, 5, 5, 9, 9, 9, 7, 3, 8, 9, 8, 4, 8, 5, 5, 8, 4, 3, 7, 1, 2, 1, 1, 7, 5, 5, 1, 4, 1, 9, 9, 6, 5, 4, 3, 6, 6, 4, 5, 7, 2, 6, 5, 6, 3, 8, 2, 5, 3, 4, 2, 3, 8, 3, 9, 1, 3, 1, 6, 7, 1, 1, 1, 4, 4, 8, 4, 7, 4, 4, 2, 6, 6, 6, 7, 2, 9, 4, 1, 9, 3, 5, 7, 2, 2, 8, 9, 1, 3, 6, 2, 6, 2, 8, ... 19, 18, 20, 22, 17, 15, 19, 18, 20, 23, 18, 15, 17, 18, 20, 15, 17, 18, 20, 25, 19, 15, 17, 19, 20, 25, 18, 15, 18, 19, 21, 24, 19 FecalColiform 16, 8, 8, 11, 11, 21, 34, 11, 11, 7, 11, 6, 8, 6, 35, 18, 18, 21, 13, 9, 32, 11, 29, 11, 28, 7, 12, 7, 12, 9, 10, 3, 43, 5, 12, 14, 4, 9, 8, 10, 4, 12, 0, 4, 7, 5, 12, 26, 0, 3, 32, 0, 8, 12, 0, 0, 21, 0, 7, 8, 0, 0, 17, 4, 0, 14, 0, 0, 11, 7, 6, 0, 8, 0,... April through 1 68 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL September 19 98 Included in this data set are levels of fecal coliform, dissolved oxygen, and temperature • Are there significant differences in each of these variables from month to month? • Develop a model for fecal coliform levels in terms of month, temperature, and dissolved oxygen Month 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, ... 10.621, 10.405, 11 .87 4, 13.444, 13.343, 16.402, 19.1 08, 19.25, 20.917, 23.409, 5. 583 , 5.063, 6.272, 7.469, 10.176, 6. 581 , 7.63 NEW 2.362, 3.5 48, 4.5 28, 4.923, 6.443, 6.494, 8. 275, 9.623, 9.646, 11.542, 10.251, 11 .86 6, 13. 388 , 17.666, 17.379, 21. 089 , 21.296, 23. 983 , 5.42, 6.369, 7 .89 9, 8. 619, 11.247, 7.526, 7.653 Exercise 7.15 Which method should be used to regress U as a function of W in the following... 5, 8, 6, 6, 5, 6, 4, 4, 6, 4, 7, 4, 4, 5, 4, 3, 7, 8, 1, 4, 4, 7, 4, 5, 4, 5, 1, 3, 4, 4, 4, 5, 3, 5, 5, 4, 7, 6, 3, 6, 4, 6, 5, 3, 4, 5, 7, 4, 5, 4, 5, 3, 7, 6, 4, 6, 4, 8, 3, 4, 2, 5, 5, 5, 4, 5, 6, 3, 5, 8, 4, 5, 2, 5, 4, 5, 6, 3, 3, 3, 1, 5, 3, 4, 7, 4, 4, 6, 4, 3, 5, 3, 4, 4, 8, 6, 7, 4, 6, 4, 5, 4, 6, 8, 7, 2, 5, 4, 7, 4, 5, 6, 4, 6, 6 1 78 STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE. .. 4, 4, 1, 1, 2, 2, 8, 3, 3, 3, 1, 1, 6, 8, 3, 7, 5, 9, 8, 3, 5, 6, 1, 5, 6, 6, 9, 6, 9, 9, 6, 7, 3, 8, 4, 2, 6, 4, 8, 3, 3, 6, 4, 4, 9, 5, 6, 4, 5, 3, 3, 1, 3, 4, 3, 6, 8, 1, 5, 3, 4, 8, 2, 5, 3, 2, 3, 2, 5, 8, 3, 1, 6, 3, 7, 8, 9, 2, 3, 5, 3, 9, 2, 9, 3, 9, 2, 8, 9, 5, 1, 9, 9, 1, 8, 7, 1, 4, 9, 3, 4, 9, 1, 3, 9, 1, 5, 2, 7, 4, 6, 1, 4, 2, 7, 5, 4, 5, 9, 5, 5, 5, 2, 4, 1, 8, 7, 9, 6, 8, 1, 5, 9, 9, 9,... 6, 7, 8, 9, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9 Temp 25, 21, 15, 19, 25, 17, 14, 17, 24, 21, 22, 20, 14, 17, 24, 21, 23, 22, 14, 17, 25, 21, 21, 22, 14, 17, 25, 20, 14, 17, 25, 21, 21, 19, 14, 17, 25, 21, 25, 19, 14, 16, 25, 21, 25, 19, 18, 21, . 98. 27, 105.69, 96.22, 102 .87 , 103. 98, 102.76, 107.54, 104.13, 98. 74, 101.12, 104. 98, 101.43, 106.42, 107.99, 95 .89 , 104 .87 , 104. 98, 100 .89 , 109.39, 98. 17, 99.14, 103 .87 , 103 .87 , 102 .89 , 1 08. 78, . 782 6 1976 9175 82 72 19 78 9735 88 08 1 980 9722 87 83 1 982 9725 88 18 Exercise 7.6. Do U.S. residents do their best to spend what they earn? Fit a regression line, using OLS, to the data in the accompanying. 1 988 , Table B-27 Income Expenditures Year 1 982 $s 1 982 $s 1960 6036 5561 1962 6271 5729 1964 6727 6099 1966 7 280 6607 19 68 77 28 7003 1970 81 34 7275 1972 85 62 7726 1974 88 67 782 6 1976 9175 82 72 1978