1. Trang chủ
  2. » Kinh Doanh - Tiếp Thị

Essentials of business statistics 5th bowerman

610 2.5K 1

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Bruce L Bowerman Miami University Richard T O’Connell Miami University Emily S Murphree Miami University J B Orris Butler University Essentials of Business Statistics FIFTH EDITION with major contributions by Steven C Huchendorf University of Minnesota Dawn C Porter University of Southern California Patrick J Schur Miami University ESSENTIALS OF BUSINESS STATISTICS, FIFTH EDITION Published by McGraw-Hill Education, Penn Plaza, New York, NY 10121 Copyright © 2015 by McGraw-Hill Education All rights reserved Printed in the United States of America Previous editions © 2012, 2010, 2008, and 2004 No part of this publication may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without the prior written consent of McGraw-Hill Education, including, but not limited to, in any network or other electronic storage or transmission, or broadcast for distance learning Some ancillaries, including electronic and print components, may not be available to customers outside the United States This book is printed on acid-free paper DOW/DOW ISBN MHID 978-0-07-802053-7 0-07-802053-0 Senior Vice President, Products & Markets: Kurt L Strand Vice President, Content Production & Technology Services: Kimberly Meriwether David Managing Director: Douglas Reiner Senior Brand Manager: Thomas Hayward Executive Director of Development: Ann Torbert Senior Development Editor: Wanda J Zeman Senior Marketing Manager: Heather A Kazakoff Director, Content Production: Terri Schiesl Content Project Manager: Harvey Yep Content Project Manager: Daryl Horrocks Senior Buyer: Debra R Sylvester Design: Matthew Baldwin Cover Image: © Bloomberg via Getty Images Lead Content Licensing Specialist: Keri Johnson Typeface: 10/12 Times New Roman Compositor: MPS Limited Printer: R R Donnelley All credits appearing on page or at the end of the book are considered to be an extension of the copyright page The CIP data for this title has been applied for The Internet addresses listed in the text were accurate at the time of publication The inclusion of a website does not indicate an endorsement by the authors or McGraw-Hill Education, and McGraw-Hill Education does not guarantee the accuracy of the information presented at these sites www.mhhe.com About the Authors Bruce L Bowerman Bruce L Bowerman is professor emeritus of decision sciences at Miami University in Oxford, Ohio He received his Ph.D degree in statistics from Iowa State University in 1974, and he has over 41 years of experience teaching basic statistics, regression analysis, time series forecasting, survey sampling, and design of experiments to both undergraduate and graduate students In 1987 Professor Bowerman received an Outstanding Teaching award from the Miami University senior class, and in 1992 he received an Effective Educator award from the Richard T Farmer School of Business Administration Together with Richard T O’Connell, Professor Bowerman has written 20 textbooks In his spare time, Professor Bowerman enjoys watching movies and sports, playing tennis, and designing houses Richard T O’Connell Richard T O’Connell is professor emeritus of decision sciences at Miami University in Oxford, Ohio He has more than 36 years of experience teaching basic statistics, statistical quality control and process improvement, regression analysis, time series forecasting, and design of experiments to both undergraduate and graduate business students He also has extensive consulting experience and has taught workshops dealing with statistical process control and process improvement for a variety of companies in the Midwest In 2000 Professor O’Connell received an Effective Educator award from the Richard T Farmer School of Business Administration Together with Bruce L Bowerman, he has written 20 textbooks In his spare time, Professor O’Connell enjoys fishing, collecting 1950s and 1960s rock music, and following the Green Bay Packers and Purdue University sports Emily S Murphree Emily S Murphree is associate professor of statistics in the Department of Mathematics and Statistics at Miami University in Oxford, Ohio She received her Ph.D degree in statistics from the University of North Carolina and does research in applied probability Professor Murphree received Miami’s College of Arts and Science Distinguished Educator Award in 1998 In 1996, she was named one of Oxford’s Citizens of the Year for her work with Habitat for Humanity and for organizing annual Sonia Kovalevsky Mathematical Sciences Days for area high school girls Her enthusiasm for hiking in wilderness areas of the West motivated her current research on estimating animal population sizes James Burdeane “Deane” Orris J B Orris is a professor emeritus of management science at Butler University in Indianapolis, Indiana He received his Ph.D from the University of Illinois in 1971, and in the late 1970s with the advent of personal computers, he combined his interest in statistics and computers to write one of the first personal computer statistics packages—MICROSTAT Over the past 20 years, MICROSTAT has evolved into MegaStat which is an Excel add-in statistics program He wrote an Excel book, Essentials: Excel 2000 Advanced, in 1999 and Basic Statistics Using Excel and MegaStat in 2006 He taught statistics and computer courses in the College of Business Administration of Butler University from 1971 until 2013 He is a member of the American Statistical Association and is past president of the Central Indiana Chapter In his spare time, Professor Orris enjoys reading, working out, and working in his woodworking shop FROM THE In Essentials of Business Statistics, Fifth Edition, we provide a modern, practical, and unique framework for teaching an introductory course in business statistics As in previous editions, we employ real or realistic examples, continuing case studies, and a business improvement theme to teach the material Moreover, we believe that this fifth edition features more concise and lucid explanations, an improved topic flow, and a judicious use of realistic and compelling examples Overall, the fifth edition is 32 pages shorter than the fourth edition while covering all previous material as well as additional topics Below we outline the attributes and new features we think make this book an effective learning tool • Continuing case studies that tie together different statistical topics These continuing case studies span not only • individual chapters but also groups of chapters Students tell us that when new statistical topics are developed in the context of familiar cases, their “fear factor” is reduced Of course, to keep the examples from becoming overtired, we introduce new case studies throughout the book Business improvement conclusions that explicitly show how statistical results lead to practical business decisions After appropriate analysis and interpretation, examples and case studies often result in a business improvement conclusion To emphasize this theme of business improvement, icons BI are placed in the page margins to identify when statistical analysis has led to an important business conclusion The text of each conclusion is also highlighted in yellow for additional clarity • Examples exploited to motivate an intuitive approach to statistical ideas Most concepts and formulas, particu- • • • larly those that introductory students find most challenging, are first approached by working through the ideas in accessible examples Only after simple and clear analysis within these concrete examples are more general concepts and formulas discussed An improved introduction to business statistics in Chapter The example introducing data and how data can be used to make a successful offer to purchase a house has been made clearer, and two new and more graphically oriented examples have been added to better introduce quantitative and qualitative variables Random sampling is introduced informally in the context of more tightly focused case studies [The technical discussion about how to select random samples and other types of samples is in Chapter (Sampling and Sampling Distributions), but the reader has the option of reading about sampling in Chapter immediately after Chapter 1.] Chapter also includes a new discussion of ethical guidelines for practitioners of statistics Throughout the book, statistics is presented as a broad discipline requiring not simply analytical skills but also judgment and personal ethics A more streamlined discussion of the graphical and numerical methods of descriptive statistics Chapters and utilize several new examples, including an example leading off Chapter that deals with college students’ pizza brand preferences In addition, the explanations of some of the more complicated topics have been simplified For example, the discussion of percentiles, quartiles, and box plots has been shortened and clarified An improved, well-motivated discussion of probability and probability distributions in Chapters 4, 5, and In Chapter 4, methods for calculating probabilities are more clearly motivated in the context of two new examples We use the Crystal Cable Case, which deals with studying cable television and Internet penetration rates, to illustrate many probabilistic concepts and calculations Moreover, students’ understanding of the important concepts of conditional probability and statistical independence is sharpened by a new real-world case involving gender discrimination at a pharmaceutical company The probability distribution, mean, and standard deviation of a discrete random variable are all motivated and explained in a more succinct discussion in Chapter An example illustrates how knowledge of a mean and standard deviation are enough to estimate potential investment returns Chapter also features an improved introduction to the binomial distribution where the previous careful discussion is supplemented by an illustrative tree diagram Students can now see the origins of all the factors in the binomial formula more clearly Chapter ends with a new optional section where joint probabilities and covariances are explained in the context of portfolio diversification In Chapter 6, continuous probabilities are developed by improved examples The coffee temperature case introduces the key ideas and is eventually used to help study the normal distribution Similarly, the elevator waiting time case is used to explore the continuous uniform distribution AUTHORS • An improved discussion of sampling distributions and statistical inference in Chapters through 12 In • Chapter 7, the discussion of sampling distributions has been modified to more seamlessly move from a small population example involving sampling car mileages to a related large population example The introduction to confidence intervals in Chapter features a very visual, graphical approach that we think makes finding and interpreting confidence intervals much easier This chapter now also includes a shorter and clearer discussion of the difference between a confidence interval and a tolerance interval and concludes with a new section about estimating parameters of finite populations Hypothesis testing procedures (using both the critical value and p-value approaches) are summarized efficiently and visually in summary boxes that are much more transparent than traditional summaries lacking visual prompts These summary boxes are featured throughout the chapter covering inferences for one mean, one proportion, and one variance (Chapter 9), and the chapter covering inferences for two means, two proportions, and two variances (Chapter 10), as well as in later chapters covering regression analysis In addition, the discussion of formulating the null and alternative hypotheses has been completely rewritten and expanded, and a new, earlier discussion of the weight of evidence interpretation of p-values is given Also, a short presentation of the logic behind finding the probability of a Type II error when testing a two-sided alternative hypothesis now accompanies the general formula that can be used to calculate this probability In Chapter 10 we mention the unrealistic “known variance” case when comparing population means only briefly and move swiftly to the more realistic “unknown variance” case The discussion of comparing population variances has been shortened and made clearer In Chapter 11 (Experimental Design and Analysis of Variance) we use a concise but understandable approach to covering one-way ANOVA, the randomized block design, and two-way ANOVA A new, short presentation of using hypothesis testing to make pairwise comparisons now supplements our usual confidence interval discussion Chapter 12 covers chi-square goodness-of-fit tests and tests of independence Streamlined and improved discussions of simple and multiple regression and statistical quality control As in the fourth edition, we use the Tasty Sub Shop Case to introduce the ideas of both simple and multiple regression analysis This case has been popular with our readers In Chapter 13 (Simple Linear Regression Analysis), the discussion of the simple linear regression model has been slightly shortened, the section on residual analysis has been significantly shortened and improved, and more exercises on residual analysis have been added After discussing the basics of multiple regression, Chapter 14 has five innovative, advanced sections that are concise and can be covered in any order These optional sections explain (1) using dummy variables (including an improved discussion of interaction when using dummy variables), (2) using squared and interaction terms, (3) model building and the effects of multicollinearity (including an added discussion of backward elimination), (4) residual analysis in multiple regression (including an improved and slightly expanded discussion of outlying and influential observations), and (5) logistic regression (a new section) Chapter 15, which is on the book’s website and deals with _ process improvement, has been streamlined by relying on a single case, the hole location case, to explain X and R charts as well as establishing process control, pattern analysis, and capability studies • Increased emphasis on Excel and MINITAB throughout the text The main text features Excel and MINITAB outputs The end-of-chapter appendices provide improved step-by-step instructions about how to perform statistical analyses using these software packages as well as MegaStat, an Excel add-in Bruce L Bowerman Richard T O’Connell Emily S Murphree J B Orris A TOUR OF THIS Chapter Introductions Each chapter begins with a list of the section topics that are covered in the chapter, along with chapter learning objectives and a preview of the case study analysis to be carried out in the chapter CHAPTER T he subject of statistics involves the study of how to collect, analyze, and interpret data Data are facts and figures from which conclusions can be drawn Such conclusions are important to the decision making of many professions and organizations For example, economists use conclusions drawn from the latest data on unemployment and inflation to help the government make policy decisions Financial planners use recent trends in stock market prices and economic conditions to make investment decisions Accountants use sample data concerning a company’s actual sales revenues to assess whether the company’s claimed sales revenues are valid Marketing professionals help businesses decide which products to develop and market by using data An Introduction to Business Statistics that reveal consumer preferences Production supervisors use manufacturing data to evaluate, control, and improve product quality Politicians rely on data from public opinion polls to formulate legislation and to devise campaign strategies Physicians and hospitals use data on the effectiveness of drugs and surgical procedures to provide patients with the best possible treatment In this chapter we begin to see how we collect and analyze data As we proceed through the chapter, we introduce several case studies These case studies (and others to be introduced later) are revisited throughout later chapters as we learn the statistical methods needed to analyze them Briefly, we will begin to study three cases: C The Cell Phone Case A bank estimates its cellular phone costs and decides whether to outsource management of its wireless resources by studying the calling patterns of its employees The Marketing Research Case A bottling company investigates consumer reaction to a new bottle design for one of its popular soft drinks The Car Mileage Case To determine if it qualifies for a federal tax credit based on fuel economy, an automaker studies the gas mileage of its new midsize model 1.1 Data Data sets, elements, and variables We have said that data are facts and figures from which conclusions can be drawn Together, the data that are collected for a particular study are referred to as a data set For example, Table 1.1 is a data set that gives information about the new homes sold in a Florida luxury home development over a recent three-month period Potential buyers in this housing community could choose either the “Diamond” or the “Ruby” home model design and could have the home built on either a lake lot or a treed lot (with no water access) In order to understand the data in Table 1.1, note that any data set provides information about some group of individual elements, which may be people, objects, events, or other entities The information that a data set provides about its elements usually describes one or more characteristics of these elements Learning Objectives When you have mastered the material in this chapter, you will be able to: LO1-1 Define a variable LO1-6 Describe the difference between a population and a sample LO1-2 Describe the difference between a Any characteristic of an element is called a variable LO1-7 Distinguish between descriptive statistics quantitative variable and a qualitative variable and statistical inference LO1-3 Describe the difference between cross- LO1-8 Explain the importance of random sectional data and time series data sampling LO1-4 Construct and interpret a time series (runs) LO1-9 Identify the ratio, interval, ordinal, and plot nominative scales of measurement (Optional) LO1-5 Identify the different types of data sources: LO1-1 Define a variable For the data set in Table 1.1, each sold home is an element, and four variables are used to describe the homes These variables are (1) the home model design, (2) the type of lot on which the home was built, (3) the list (asking) price, and (4) the (actual) selling price Moreover, each home model design came with “everything included”—specifically, a complete, luxury interior package and a choice (at no price difference) of one of three different architectural exteriors The builder made the list price of each home solely dependent on the model design however, the builder gave various price reductions for homes build on treed lots existing data sources, experimental studies, and observational studies TA B L E Chapter Outline 1.1 Data 1.2 Data Sources 1.3 1.4 1.5 Populations and Samples Three Case Studies That Illustrate Sampling and Statistical Inference Ratio, Interval, Ordinal, and Nominative Scales of Measurement (Optional) A Data Set Describing Five Home Sales DS HomeSales Home Model Design Lot Type List Price Diamond Ruby Diamond Diamond Ruby Lake Treed Treed Treed Lake $494,000 $447,000 $494,000 $494,000 $447,000 Selling Price $494,000 $398,000 $440,000 $469,000 $447,000 Continuing Case Studies and Business Improvement Conclusions The main chapter discussions feature real or realistic examples, continuing case studies, and a business improvement theme The continuing case studies span not only individual chapters but also groups of chapters and tie together different statistical topics To emphasize the text’s theme of business improvement, icons BI are placed in the page margins to identify when statistical analysis has led to an important business improvement conclusion Each conclusion is also highlighted in yellow for additional clarity For example, in Chapters and we consider The Cell Phone Case: TA B L E 75 654 496 879 511 542 571 719 482 485 578 553 822 433 704 562 338 120 683 A Sample of Cellular Usages (in Minutes) for 100 Randomly Selected Employees DS CellUse 37 504 705 420 535 49 503 468 212 547 670 198 814 521 585 505 529 730 418 753 490 507 20 648 341 461 737 853 399 93 225 157 513 41 530 496 444 18 376 897 509 672 546 528 216 241 372 479 323 694 247 296 801 359 512 624 555 144 173 797 597 774 721 367 491 885 290 24 669 477 173 479 273 948 259 830 513 611 EXAMPLE 3.5 The Cell Phone Case: Reducing Cellular Phone Costs C Suppose that a cellular management service tells the bank that if its cellular cost per minute for the random sample of 100 bank employees is over 18 cents per minute, the bank will benefit from automated cellular management of its calling plans Last month’s cellular usages for the 100 randomly selected employees are given in Table 1.4 (page 9), and a dot plot of these usages is given in the page margin If we add the usages together, we find that the 100 employees used a total of 46,625 minutes Furthermore, the total cellular cost incurred by the 100 employees is found to be $9,317 (this total includes base costs, overage costs, long distance, and roaming) This works out to an average of $9,317͞46,625 ϭ $.1998, or 19.98 cents per minute Because this average cellular cost per minute exceeds 18 cents per minute, the bank will hire the cellular management service to manage its calling plans BI TEXT’S FEATURES Figures and Tables Throughout the text, charts, graphs, tables, and Excel and MINITAB outputs are used to illustrate statistical concepts For example: • In Chapter (Descriptive Statistics: Numerical Methods), the following figures are used to help explain the Empirical Rule Moreover, in The Car Mileage Case an automaker uses the Empirical Rule to find estimates of the “typical,” “lowest,” and “highest” mileage that a new midsize car should be expected to get in combined city and highway driving In actual practice, real automakers have provided similar information broken down into separate estimates for city and highway driving—see the Buick LaCrosse new car sticker in Figure 3.14 The Empirical Rule and Tolerance Intervals for a Normally Distributed Population Histogram of the 50 Mileages These estimates reflect new EPA methods beginning with 2008 models Percent See the Recent Fuel Economy Guide at dealers or www.fueleconomy.gov Expected range for most drivers 14 to 20 MPG ␮ 2␴ 30.8 Estimated tolerance interval for the mileages of 68.26 percent of all individual cars 32.4 30.0 Estimated tolerance interval for the mileages of 95.44 percent of all individual cars 33.2 29.2 ␮ 5 Mpg Expected range for most drivers 22 to 32 MPG 99.73% of the population measurements are within (plus or minus) three standard deviations of the mean ␮ 3␴ 33 33 32 W2A 48 All mid-size cars 32 11 95.44% of the population measurements are within (plus or minus) two standard deviations of the mean 10 Your actual mileage will vary depending on how you drive and maintain your vehicle 21 10 31 Combined Fuel Economy This Vehicle 18 15 ␮1␴ ␮ Expected range for most drivers 22 to 32 MPG 31 ␮ 27 $2,485 based on 15,000 miles at $3.48 per gallon 22 16 17 Estimated Annual Fuel Cost 22 20 HIGHWAY MPG 30 CITY MPG Expected range for most drivers 14 to 20 MPG ␮ 2␴ Estimated Tolerance Intervals in the Car Mileage Case 25 EPA Fuel Economy Estimates 68.26% of the population measurements are within (plus or minus) one standard deviation of the mean ␮2␴ FIGURE 3.15 (b) Tolerance intervals for the 2012 Buick LaCrosse 30 (a) The Empirical Rule 29 FIGURE 3.14 34.0 Estimated tolerance interval for the mileages of 99.73 percent of all individual cars ␮ 3␴ • In Chapter (Sampling and Sampling Distributions), the following figures (and others) are used to help explain the sampling distribution of the sample mean and the Central Limit Theorem In addition, the figures describe different applications of random sampling in The Car Mileage Case, and thus this case is used as an integrative tool to help students understand sampling distributions FIGURE 7.1 FIGURE 7.2 A Comparison of Individual Car Mileages and Sample Means The Normally Distributed Population of All Individual Car Mileages and the Normally Distributed Population of All Possible Sample Means (a) A graph of the probability distribution describing the population of six individual car mileages The normally distributed population of all individual car mileages 0.20 30.0 Probability 1/6 1/6 1/6 1/6 1/6 1/6 30.8 31.6 m 29.2 32.4 0.10 Sample mean x¯ 31.3 x1 30.8 x2 31.9 x3 30.3 x4 32.1 x5 31.4 Sample mean x¯ 31.8 x1 32.3 x2 30.7 x3 31.8 x4 31.4 x5 32.8 0.05 0.00 29 30 31 32 33 34 Individual Car Mileage Scale of car 34.0 mileages x1 33.8 x2 31.7 Sample x3 33.4 mean x4 32.4 x¯ 32.8 x5 32.7 The normally distributed population of all possible sample means (b) A graph of the probability distribution describing the population of 15 sample means m 3/15 0.20 30.4 Probability 33.2 0.15 0.15 2/15 2/15 30.8 31.2 31.6 32.0 32.4 32.8 Scale of sample means, x¯ 2/15 2/15 0.10 1/15 1/15 1/15 1/15 FIGURE 7.3 0.05 0.00 29 29.5 30 30.5 31 31.5 32 32.5 33 33.5 A Comparison of (1) the Population of All Individual Car Mileages, (2) the Sampling Distribution of the Sample Mean x When n ‫ ؍‬5, and (3) the Sampling Distribution of the Sample Mean x When n ‫ ؍‬50 34 Sample Mean (a) The population of individual mileages FIGURE 7.5 The normal distribution describing the population of all individual car mileages, which has mean m and standard deviation s The Central Limit Theorem Says That the Larger the Sample Size Is, the More Nearly Normally Distributed Is the Population of All Possible Sample Means Scale of gas mileages m ¯ when n 5 (b) The sampling distribution of the sample mean x x x x n=2 n=2 n=2 x x x The normal distribution describing the population of all possible sample means when the sample s 358 size is 5, where m¯x m and s¯x n x (a) Several sampled populations n=2 x m n=6 n=6 x x n=6 n=6 x Scale of sample means, x¯ (c) The sampling distribution of the sample mean x ¯ when n 50 x The normal distribution describing the population of all possible sample means when the sample size s is 50, where m¯x m and s¯x 5 113 n 50 n = 30 n = 30 x x n = 30 n = 30 x (b) Corresponding populations of all possible sample means for different sample sizes x m Scale of sample means, x¯ A TOUR OF THIS • In Chapter (Confidence Intervals), the following figure (and others) are used to help explain the meaning of a 95 percent confidence interval for the population mean Furthermore, in The Car Mileage Case an automaker uses a confidence interval procedure specified by the Environmental Protection Agency (EPA) to find the EPA estimate of a new midsize model’s true mean mileage FIGURE 8.2 Three 95 Percent Confidence Intervals for M The probability is 95 that x will be within plus or minus 1.96␴x 22 of ␮ Population of all individual car mileages ␮ ␴ 95 Samples of n 50 car mileages m n 50 x 31.56 31.6 22 x 31.6 31.6 22 31.56 n 50 x 31.2 31.78 31.68 31.34 n 50 x 31.68 31.46 31.90 31.2 31.42 30.98 • In Chapter (Hypothesis Testing), a five-step hypothesis testing procedure, new graphical hypothesis testing summary boxes, and many graphics are used to show how to carry out hypothesis tests A t Test about a Population Mean: S Unknown Null Hypothesis Test Statistic H0:m ϭ m0 tϭ x Ϫ m0 s͞ n df ϭ n Ϫ Do not reject H0 Ha: ␮ Ͻ ␮0 Reject H0 ␣ t␣ Reject H0 if t Ͼ t␣ Reject H0 Do not reject H0 ␣ Ϫt␣ Reject H0 if t Ͻ Ϫt␣ Ha: ␮ ϶ ␮0 Reject H0 Do not reject H0 ␣ր2 Ha: ␮ Ͼ ␮0 Reject H0 Ha: ␮ Ͻ ␮0 p-value p-value Ϫt␣ր2 t␣ր2 Reject H0 if ԽtԽ Ͼ t␣ր2—that is, t Ͼ t␣ր2 or t Ͻ Ϫt␣ր2 t p-value ϭ area to the right of t t p-value ϭ area to the left of t ԽtԽ ϪԽtԽ p-value ϭ twice the area to the right of ԽtԽ Testing H0: M ‫ ؍‬1.5 versus Ha: M Ͻ 1.5 by Using a Critical Value and the p-Value FIGURE 9.5 State the null hypothesis H0 and the alternative hypothesis Ha Specify the level of significance a Select the test statistic 14 degrees of freedom (a) Setting ␣ ‫ ؍‬.01 df t.01 12 13 14 2.681 2.650 2.624 Determine the critical value rule for deciding whether to reject H0 Collect the sample data, compute the value of the test statistic, and decide whether to reject H0 by using the critical value rule Interpret the statistical results ␣ ϭ 01 Ϫt.01 ϭ Using a critical value rule: Ϫ2.624 If t Ͻ Ϫ2.624, reject H0: ␮ ϭ 1.5 Using a p-value: Ha: ␮ ϶ ␮0 ␣ր2 The Five Steps of Hypothesis Testing Normal population or Large sample size p-Value (Reject H0 if p-Value Ͻ ␣) Critical Value Rule Ha: ␮ Ͼ ␮0 Assumptions Collect the sample data, compute the value of the test statistic, and compute the p-value Reject H0 at level of significance a if the p-value is less than a Interpret the statistical results (b) The test statistic and p-value p-value ϭ 00348 ϭ t Ϫ3.1589 Test of mu = 1.5 vs < 1.5 Variable Ratio N 15 Mean 1.3433 StDev 0.1921 SE Mean 0.0496 95% Upper Bound 1.4307 T –3.16 P 0.003 • In Chapters 13 and 14 (Simple Linear and Multiple Regression), a substantial number of data plots, Excel and MINITAB outputs, and other graphics are used to teach simple and multiple regression analysis For example, in The Tasty Sub Shop Case a business entrepreneur uses data plotted in Figures 14.1 and 14.2 and the Excel and MINITAB outputs in Figure 14.4 to predict the yearly revenue of a potential Tasty Sub Shop restaurant site on the basis of the population and business activity near the site Using the 95 percent prediction interval on the MINITAB output and projected restaurant operating costs, the entrepreneur decides whether to purchase a Tasty Sub Shop franchise for the potential restaurant site TEXT’S FEATURES FIGURE 14.1 Plot of y (Yearly Revenue) versus x1 (Population Size) FIGURE 14.4 (a) The Excel output y Regression Statistics 1300 Multiple R R Square Adjusted R Square Standard Error Observations 1200 Yearly Revenue Excel and MINITAB Outputs of a Regression Analysis of the Tasty Sub Shop Revenue Data in Table 14.1 Using the Model y ‫ ؍‬B0 ؉ B1x1 ؉ B2x2 ؉ E 1100 1000 900 800 0.9905 0.9810 0.9756 36.6856 10 ANOVA 700 600 500 20 30 40 50 Population Size 60 70 x1 df Regression Residual Total SS Coefficients Intercept population bus_rating MS 486355.7 10 9420.8 11 495776.5 12 Standard Error 125.289 14.1996 22.8107 F 243177.8 1345.835 t Stat 40.9333 0.9100 5.7692 Significance F 180.689 13 P-value 3.06 15.60 3.95 0.0183 1.07E-06 0.0055 9.46E-07 14 Lower 95% 19 Upper 95% 19 28.4969 12.0478 9.1686 222.0807 16.3517 36.4527 (b) The MINITAB output The regression equation is revenue = 125 + 14.2 population + 22.8 bus_rating FIGURE 14.2 Predictor Constant population bus_rating S = 36.6856 Plot of y (Yearly Revenue) versus x2 (Business Rating) y 1300 SE Coef 40.93 0.91 5.769 R-Sq = 98.10% Analysis of Variance Source DF Regression Residual Error Total 1200 Yearly Revenue Coef 125.29 14.1996 22.811 1100 1000 900 SS 486356 10 9421 11 495777 12 T 3.06 15.6 3.95 MS 243178 1346 Predicted Values for New Observations New Obs Fit 15 SE Fit 16 956.6 15 800 700 600 500 Business Rating x2 P 0.018 0.000 0.006 R-Sq(adj) = 97.6% F 180.69 13 P 0.000 14 95% CI 17 (921.0, 992.2) 95% PI 18 (862.8, 1050.4) Values of Predictors for New Observations New Obs population bus_rating 47.3 b0 R2 b1 b2 sbj ϭ standard error of the estimate bj Adjusted R2 14 p-value for F(model) 10 Explained variation t statistics 17 95% confidence interval when x1 ϭ 47.3 and x2 ϭ p-values for t statistics 11 SSE ϭ Unexplained variation 15 yˆ ϭ point prediction when x1 ϭ 47.3 and x2 ϭ 12 Total variation s ϭ standard error 13 F(model) statistic 16 syˆ ϭ standard error of the estimate yˆ 18 95% prediction interval when x1 ϭ 47.3 and x2 ϭ 19 95% confidence interval for bj Exercises Many of the exercises in the text require the analysis of real data Data sets are identified by an icon in the text and are included on the Online Learning Center (OLC): www.mhhe.com/bowermaness5e Exercises in each section are broken into two parts—“Concepts” and “Methods and Applications”—and there are supplementary and Internet exercises at the end of each chapter 2.7 Below we give the overall dining experience ratings (Outstanding, Very Good, Good, Average, or Poor) of 30 randomly selected patrons at a restaurant on a Saturday evening DS RestRating Outstanding Outstanding Very Good Outstanding Good Good Outstanding Outstanding Good Very Good Very Good Outstanding Outstanding Very Good Outstanding Very Good Very Good Outstanding Outstanding Very Good Outstanding Very Good Outstanding Very Good Good Good Average Very Good Outstanding Outstanding a Find the frequency distribution and relative frequency distribution for these data b Construct a percentage bar chart for these data c Construct a percentage pie chart for these data Chapter Ending Material and Excel/MINITAB/MegaStat® Tutorials The end-of-chapter material includes a chapter summary, a glossary of terms, important formula references, and comprehensive appendices that show students how to use Excel, MINITAB, and MegaStat Chapter Summary We began this chapter by presenting and comparing several measures of central tendency We defined the population mean and we saw how to estimate the population mean by using a sample mean We also defined the median and mode, and we compared the mean, median, and mode for symmetrical distributions and for distributions that are skewed to the right or left We then studied measures of variation (or spread ) We defined the range, variance, and standard deviation, and we saw how to estimate a population variance and standard deviation by using a sample We learned that a good way to interpret the standard deviation when a population is (approximately) normally distributed is to use the Empirical Rule, and we studied Chebyshev’s Theorem, which gives us intervals containing reasonably large fractions of the population units no matter what the population’s shape might be We also saw that, when a data set is highly skewed, it is best to use percentiles and quartiles to measure variation, and we learned how to construct a box-and-whiskers plot by using the quartiles After learning how to measure and depict central tendency and variability, we presented several optional topics First, we discussed several numerical measures of the relationship between two variables These included the covariance, the correlation coefficient, and the least squares line We then introduced the concept of a weighted mean and also explained how to compute descriptive statistics for grouped data Finally, we showed how to calculate the geometric mean and demonstrated its interpretation Glossary of Terms box-and-whiskers display (box plot): A graphical portrayal of a data set that depicts both the central tendency and variability of the data It is constructed using Q1, Md, and Q3 (pages 121, 122) central tendency: A term referring to the middle of a population or sample of measurements (page 99) Chebyshev’s Theorem: A theorem that (for any population) ll fi d i l h i ifi d outlier (in a box-and-whiskers display): A measurement less than the lower limit or greater than the upper limit (page 122) percentile: The value such that a specified percentage of the measurements in a population or sample fall at or below it (page 118) point estimate: A one-number estimate for the value of a population parameter (page 99) l ti (d t d ) Th f l i f Constructing a scatter plot of sales volume versus advertising expenditure as in Figure 2.24 on page 67 (data file: SalesPlot.xlsx): • Enter the advertising and sales data in Table 2.20 on page 67 into columns A and B—advertising expenditures in column A with label “Ad Exp” and sales values in column B with label “Sales Vol.” Note: The variable to be graphed on the horizontal axis must be in the first column (that is, the left-most column) and the variable to be graphed on the vertical axis must be in the second column (that is, the rightmost column) • • Select the entire range of data to be graphed • The scatter plot will be displayed in a graphics window Move the plot to a chart sheet and edit appropriately Select Insert : Scatter : Scatter with only Markers 14.10 Model Building and the Effects of Multicollinearity 573 Unfortunately, although we would like to test the significance of the independent variables in this model, extreme multicollinearity (relationships between the independent variables) exists when using squared and interaction variables Thus, the usual t tests for assessing the significance of individual independent variables might not be reliable As an alternative, we will use a partial F-test Specifically, considering the model with the smallest s of 174.6 and a total of 17 variables to be a complete model, we will use this test to assess whether at least one variable in the subset of 12 squared and interaction variables in this model is significant The Partial F-Test: An F-Test for a Portion of a Regression Model S uppose that the regression assumptions hold, and consider a complete model that uses k independent variables To assess whether at least one of the independent variables in a subset of k* independent variables in this model is significant, we test the null hypothesis reduced model that uses all k independent variables except for the k* independent variables in the subset Also, define H0: All of the ␤j coefficients corresponding to the independent variables in the subset are zero and define the p-value related to F(partial) to be the area under the curve of the F distribution (having k* and [n Ϫ (k ϩ 1)] degrees of freedom) to the right of F(partial) Then, we can reject H0 in favor of Ha at level of significance a if either of the following equivalent conditions holds: which says that none of the independent variables in the subset are significant We test H0 versus Ha: At least one of the ␤j coefficients corresponding F(partial) ϭ to the independent variables in the subset is not equal to zero which says that at least one of the independent variables in the subset is significant Let SSEC denote the unexplained variation for the complete model, and let SSER denote the unexplained variation for the (SSER Ϫ SSEC ) ͞k* SSEC ͞[n Ϫ (k ϩ 1)] F (partial) Ͼ Fa p-value Ͻ a Here the point Fa is based on k* numerator and n Ϫ (k ϩ 1) denominator degrees of freedom Using Excel or MINITAB, we find that the unexplained variation for the complete model that uses all k ϭ 17 variables is SSEC ϭ 213,396.12 and the unexplained variation for the reduced model that does not use the k* ϭ 12 squared and interaction variables (and thus uses only the linear variables) is SSER ϭ 3,516,859.2 It follows that F(partial) ϭ (SSE R Ϫ SSE C)͞k* SSE C͞[n Ϫ (k ϩ 1)] ϭ (3,516,859.2 Ϫ 213,396.12)͞12 213,396.12͞[25 Ϫ (17 ϩ 1)] ϭ 3,303,463.1͞12 ϭ 9.03 213,396.12͞7 Because F.05 based on k* ϭ 12 numerator and nϪ(k ϩ 1) ϭ denominator degrees of freedom is 3.57 (see Table A.7 on page 612), and because F(partial) ϭ 9.03 is greater than F.05 ϭ 3.57, we reject H0 and conclude (at an a of 05) that at least one of the 12 squared and interaction variables in the 17-variable model is significant In the exercises, the reader will further analysis and use another partial F-test to find a model that is perhaps better than the 17-variable model Exercises for Section 14.10 CONCEPTS 14.38 What is multicollinearity? What problems can be caused by multicollinearity? 14.39 List the criteria and model selection procedures we use to compare regression models 574 Chapter 14 FIGURE 14.30 Multiple Regression and Model Building Multicollinearity and a Model Building Analysis in the Hospital Labor Needs Case (b) A MINITAB regression analysis (a) Excel output of the correlation matrix Hours Hours Xray BedDays Length Load Pop 0.9425 0.9889 0.5603 0.9886 0.9465 Xray BedDays Length 0.9048 0.4243 0.9051 0.9124 0.6609 0.9999 0.9328 Load 0.6610 0.4515 0.9353 Pop Predictor Coef Constant 2270.4 Xray(x1) 0.04112 BedDays(x2) 1.413 Length(x3) –467.9 Load(x4) –9.30 Pop(x5) –3.223 Vars 1 2 3 4 R-Sq 97.8 97.7 99.3 99.3 99.6 99.6 99.7 99.6 99.7 T 3.38 3.01 0.73 –3.55 –0.15 –0.72 P 0.007 0.013 0.48 0.005 0.882 0.488 VIF 8.1 8684.2 4.2 9334.5 23.0 (d) Stepwise regression (␣entry ‫␣ ؍‬stay ‫ ؍‬.10) (c) The two best models of each size Mallows R-Sq(adj) C-p 97.6 52.3 97.6 54 99.2 9.5 99.2 11.1 99.5 3.3 99.4 99.5 99.5 4.5 99.5 SE Coef 670.8 0.01368 1.925 131.6 60.81 4.474 S 856.71 867.67 489.13 509.82 387.16 415.47 381.56 390.88 399.71 X r a y x X X X X X D a y s x X L g t h x X X X X X X X X X X X X L o a d x P o p x X X X Constant BedDays T-Value P-Value X X 2741.24 1946.80 1.101 24.87 0.000 1.223 36.30 0.000 1.039 15.39 0.000 -572 -5.47 0.000 -414 -4.20 0.001 XRay T-Value P-Value S R-Sq -70.23 Length T-Value P-Value X X Step 0.039 2.96 0.012 857 97.79 489 99.33 387 99.61 METHODS AND APPLICATIONS 14.40 THE HOSPITAL LABOR NEEDS CASE DS HospLab2 Load Pop 15.57 44.02 20.42 18.74 49.20 44.92 55.48 59.28 94.39 128.02 96.00 131.42 127.21 409.20 463.70 510.22 18.0 9.5 12.8 36.7 35.7 24.0 43.3 46.7 78.7 180.5 60.9 103.7 126.8 169.4 331.4 371.6 DS HospLab2 Recall that Table 14.6 (page 534) presents data concerning the need for labor in 16 U.S Navy hospitals This table gives values of the dependent variable Hours (monthly labor hours) and of the independent variables Xray (monthly X-ray exposures), BedDays (monthly occupied bed days—a hospital has one occupied bed day if one bed is occupied for an entire day), and Length (average length of patients’ stay, in days) The data in Table 14.6 are part of a larger data set analyzed by the Navy The complete data set includes two additional independent variables—Load (average daily patient load) and Pop (eligible population in the area, in thousands)—values of which are given in the page margin Figure 14.30 gives Excel and MINITAB outputs of multicollinearity analysis and model building for the complete hospital labor needs data set a (1) Find the three largest simple correlation coefficients among the independent variables and the three largest variance inflation factors in Figure 14.30(a) and (b) (2) Discuss why these statistics imply that the independent variables BedDays, Load, and Pop are most strongly involved in multicollinearity and thus contribute possibly redundant information for predicting Hours Note that, although we have reasoned in Exercise 14.6(a) on page 535 that a negative coefficient (that is, least squares point estimate) for Length might be intuitively reasonable, the negative coefficients for Load and Pop [see Figure 14.30(b)] are not intuitively reasonable and are a further indication of strong multicollinearity We conclude that a final regression model for predicting Hours may not need all three of the potentially redundant independent variables BedDays, Load, and Pop b Figure 14.30(c) indicates that the two best hospital labor needs models are the model using Xray, BedDays, Pop, and Length, which we will call Model 1, and the model using Xray, BedDays, and Length, which we will call Model (1) Which model gives the smallest value of s and the largest value of R 2? (2) Which model gives the smallest value of C? (3) Consider a questionable hospital for which Xray ϭ 56,194, BedDays ϭ 14,077.88, Pop ϭ 329.7, and Length ϭ 6.89 The 95 percent prediction intervals given by Models and for labor hours corresponding to this combination of values of the independent variables are, respectively, [14,888.43, 16,861.30] and [14,906.24, 16,886.26] Which model gives the shortest prediction interval? 14.11 575 Residual Analysis in Multiple Regression c (1) Which model is chosen by stepwise regression in Figure 14.30(d)? (2) If we start with all five potential independent variables and use backward elimination with an astay of 05, the procedure removes (in order) Load and Pop and then stops Which model is chosen by backward elimination? (3) Overall, which model seems best? (4) Which of BedDays, Load, and Pop does this best model use? 14.41 THE SALES REPRESENTATIVE CASE DS SalePerf Consider Figure 14.29 on page 572 The model using 12 squared and interaction variables has the smallest s However, if we desire a somewhat simpler model, note that s does not increase substantially until we move from a model having seven squared and interaction variables to a model having six such variables Moreover, we might subjectively conclude that the s of 210.70 for the model using squared and interaction variables is not that much larger than the s of 174.6 for the model using 12 squared and interaction variables Using the fact that the unexplained variations for these respective models are 532,733.88 and 213,396.12, perform a partial F-test to assess whether at least one of the extra five squared and interaction variables is significant If none of the five extra variables are significant, we might consider the simpler model to be best LO14-10 Use residual analysis to check the assumptions of multiple regression 14.11 Residual Analysis in Multiple Regression Basic residual analysis For a multiple regression model we plot the residuals given by the model against (1) values of each independent variable, (2) values of the predicted value of the dependent variable, and (3) the time order in which the data have been observed (if the regression data are time series data) A fanning-out pattern on a residual plot indicates an increasing error variance; a funneling-in pattern indicates a decreasing error variance Both violate the constant variance assumption A curved pattern on a residual plot indicates that the functional form of the regression model is incorrect If the regression data are time series data, a cyclical pattern on the residual plot versus time suggests positive autocorrelation, while an alternating pattern suggests negative autocorrelation Both violate the independence assumption On the other hand, if all residual plots have (at least approximately) a horizontal band appearance, then it is reasonable to believe that the constant variance, correct functional form, and independence assumptions approximately hold To check the normality assumption, we can construct a histogram, stem-andleaf display, and normal plot of the residuals The histogram and stem-and-leaf display should look bell-shaped and symmetric about 0; the normal plot should have a straight-line appearance To illustrate these ideas, consider the sales representative performance data in Table 14.8 (page 549) Figure 14.10 (page 549) gives a partial Excel output of a regression analysis of these data using the model that relates y to x1, x2, x3, x4, and x5 The least squares point estimates on the Obs Predicted Residual output give the prediction equation 3,504.990 164.890 yˆ ϭ Ϫ1,113.7879 ϩ 3.6121x1 ϩ 0421x2 ϩ 1289x3 ϩ 256.9555x4 ϩ 324.5334x5 Using this prediction equation, we can calculate the predicted sales values and residuals given in the page margin For example, observation 10 corresponds to a sales representative for whom x1 ϭ 105.69, x2 ϭ 42,053.24, x3 ϭ 5,673.11, x4 ϭ 8.85, and x5 ϭ 31 If we insert these values into the prediction equation, we obtain a predicted sales value of yˆ 10 ϭ 4,143.597 Because the actual sales for the sales representative are y10 ϭ 4,876.370, the residual e10 equals the difference between y10 ϭ 4,876.370 and yˆ 10 ϭ 4,143.597, which is 732.773 The normal plot of the residuals in Figure 14.31(a) on the next page has an approximate straight-line appearance The plot of the residuals versus predicted sales in Figure 14.31(b) has a horizontal band appearance, as the plots of the residuals versus the independent variables (these plots are not given here) We conclude that the regression assumptions approximately hold for the sales representative performance model Outliers An observation that is well separated from the rest of the data is called an outlier, and an observation may be an outlier with respect to its y value and/or its x values We illustrate these ideas by considering Figure 14.32 on the next page, which is a hypothetical plot of the values of a dependent variable y against an independent variable x Observation in this figure is outlying with respect to its y value, but not with respect to its x value Observation is outlying with respect to its x value, but because its y value is consistent with the 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 3,901.180 2,774.866 4,911.872 5,415.196 2,026.090 5,126.127 3,106.925 6,055.297 4,143.597 2,503.165 1,827.065 2,478.083 2,351.344 4,797.688 2,904.099 3,362.660 2,907.376 3,625.026 4,056.443 1,409.835 2,494.101 1,617.561 4,574.903 2,488.700 Ϫ427.230 Ϫ479.766 Ϫ236.312 710.764 108.850 Ϫ94.467 260.525 464.153 732.773 Ϫ34.895 706.245 Ϫ69.973 Ϫ13.964 Ϫ210.738 Ϫ174.859 Ϫ73.260 Ϫ106.596 Ϫ360.826 Ϫ602.823 331.615 Ϫ458.351 Ϫ39.561 Ϫ407.463 311.270 576 Chapter 14 F I G U R E 14.31 Multiple Regression and Model Building Residual Plots for the Sales Representative Performance Model F I G U R E 14.32 Outlying Observations y (a) Normal plot of the residuals (b) Plot of the residuals versus predicted sales 750 750 500 250 Residual Residual 500 Ϫ250 Ϫ500 250 Observation Ϫ250 Observation Ϫ500 Ϫ2 Ϫ1 Normal Score 1000 Observation 3000 5000 Predicted x regression relationship displayed by the nonoutlying observations, it is not outlying with respect to its y value Observation is an outlier with respect to its x value and its y value It is important to identify outliers because (as we will see) outliers can have adverse effects on a regression analysis and thus are candidates for removal from a data set Moreover, in addition to using data plots, we can use more sophisticated procedures to detect outliers For example, suppose that the U.S Navy wishes to develop a regression model based on efficiently run Navy hospitals to evaluate the labor needs of questionably run Navy hospitals Figure 14.33(a) gives labor needs data for 17 Navy hospitals Specifically, this table gives values of the dependent variable Hours (y, monthly labor hours required) and of the independent variables Xray (x1, monthly X-ray exposures), BedDays (x2, monthly occupied bed days—a hospital has one occupied bed day if one bed is occupied for an entire day), and Length (x3, average length of patients’ stay, in days) When we perform a regression analysis of these data using the model relating y to x1, x2, and x3, we obtain the Excel add-in (MegaStat) output of residuals and outlier diagnostics shown in Figure 14.33(b), as well as the residual plot shown in Figure 14.33(c) (MINITAB gives the same diagnostics, and at the end of this section we will give formulas for most of these diagnostics) We now explain the meanings of the diagnostics Leverage values The leverage value for an observation is the distance value that has been discussed in the optional technical note at the end of Section 14.6 (page 546) This value is a measure of the distance between the observation’s x values and the center of all of the observed x values If the leverage value for an observation is large, the observation is outlying with respect to its x values and thus would have substantial leverage in determining the least squares prediction equation For example, each of observations and in Figure 14.32 is an outlier with respect to its x value and thus would have substantial leverage in determining the position of the least squares line Moreover, because observations and have inconsistent y values, they would pull the least squares line in opposite directions A leverage value is considered to be large if it is greater than twice the average of all of the leverage values, which can be shown to be equal to 2(k ϩ 1)͞n [The Excel add-in (MegaStat) shades such a leverage value in dark blue.] For example, because there are n ϭ 17 observations in Figure 14.33(a) and because the model relating y to x1, x2, and x3 utilizes k ϭ independent variables, twice the average leverage value is 2(k ϩ 1)͞n ϭ 2(3 ϩ 1)͞17 ϭ 4706 Looking at Figure 14.33(b), we see that the leverage values for hospitals 15, 16, and 17 are, respectively, 682, 785, and 863 Because these leverage values are greater than 4706, we conclude that hospitals 15, 16, and 17 are outliers with respect to their x values Intuitively, this is because Figure14.33(a) indicates that x2 (monthly occupied bed days) is substantially larger for hospitals 15, 16, and 17 than for hospitals through 14 Also note that both x1 (monthly X-ray exposures) and x2 (monthly occupied bed days) are substantially larger for hospital 14 than for hospitals through 13 To summarize, we might classify hospitals through 13 as small to medium sized hospitals and hospitals 14, 15, 16, and 17 as larger hospitals 14.11 FIGURE 14.33 (a) The data DS Hospital Labor Needs Data, Outlier Diagnostics, and Residual Plots HospLab3 Hours y Xray x1 BedDays x2 Length x3 566.52 696.82 1033.15 1603.62 1611.37 1613.27 1854.17 2160.55 2305.58 3503.93 3571.89 3741.40 4026.52 10343.81 11732.17 15414.94 18854.45 2463 2048 3940 6505 5723 11520 5779 5969 8461 20106 13313 10771 15543 36194 34703 39204 86533 472.92 1339.75 620.25 568.33 1497.60 1365.83 1687.00 1639.92 2872.33 3655.08 2912.00 3921.00 3865.67 7684.10 12446.33 14098.40 15524.00 4.45 6.92 4.28 3.90 5.50 4.60 5.62 5.15 6.18 6.15 5.88 4.88 5.50 7.00 10.78 7.05 6.35 Hospital 10 11 12 13 14 15 16 17 577 Residual Analysis in Multiple Regression (b) Excel add-in (MegaStat) outlier diagnostics for the model y ‫␤ ؍‬0 ؉ ␤1 x1 ؉ ␤2 x2 ؉ ␤3 x3 ؉ ⑀ Observation 10 11 12 13 14 15 16 17 Residual -121.889 -25.028 67.757 431.156 84.590 -380.599 177.612 369.145 -493.181 -687.403 380.933 -623.102 -337.709 1,630.503 -348.694 281.914 -406.003 Leverage 0.121 0.226 0.130 0.159 0.085 0.112 0.084 0.083 0.085 0.120 0.077 0.177 0.064 0.146 0.682 0.785 0.863 Studentized Residual -0.211 -0.046 0.118 0.765 0.144 -0.657 0.302 0.627 -0.838 -1.192 0.645 -1.117 -0.568 2.871 -1.005 0.990 -1.786 Studentized Deleted Residual -0.203 -0.044 0.114 0.752 0.138 -0.642 0.291 0.612 -0.828 -1.214 0.630 -1.129 -0.553 4.558 -1.006 0.989 -1.975 (c) Plot of residuals in Figure 14.33(b) (d) Plot of residuals for Option (e) Plot of residuals for Option Residual (gridlines ϭ std error) Residual (gridlines ϭ std error) Residual (gridlines ϭ std error) Source: “Hospital Labor Needs Data” from Procedures and Analysis for Staffing Standards Development: Regression Analysis Handbook, © 1979 1,844.338 1,229.559 614.779 0.000 Ϫ614.779 Ϫ1,229.559 5000 10000 15000 20000 25000 Predicted 774.320 387.160 0.000 Ϫ387.160 Ϫ774.320 1,091.563 727.708 363.854 0.000 Ϫ363.854 Ϫ727.708 5000 10000 15000 20000 Predicted Residuals and studentized residuals To identify outliers with respect to their y values, we can use residuals Any residual that is substantially different from the others is suspect For example, note from Figure 14.33(b) that the residual for hospital 14, e14 ϭ 1630.503, seems much larger than the other residuals Assuming that the labor hours of 10,343.81 for hospital 14 has not been misrecorded, the residual of 1630.503 says that the labor hours are 1630.503 hours more than predicted by the regression model If we divide an observation’s residual by the residual’s standard error, we obtain a studentized residual For example, Figure 14.33(b) tells us that the studentized residual for hospital 14 is 2.871 If the studentized residual for an observation is greater than in absolute value, we have some evidence that the observation is an outlier with respect to its y value Deleted residuals and studentized deleted residuals Consider again Figure 14.32, and suppose that we use observation to help determine the least squares line Doing this might draw the least squares line toward observation 3, causing the point prediction yˆ3 given by the line to be near y3 and thus the usual residual y3 Ϫ yˆ3 to be small This would falsely imply that observation is not an outlier with respect to its y value Moreover, this sort of situation shows the need for computing a deleted residual For a particular observation, observation i, the deleted residual is found by subtracting from yi the point prediction yˆ(i) computed using least squares point estimates based on all n observations except for observation i Standard statistical software packages calculate the deleted residual for each observation and divide this residual by its standard error to 5000 10000 15000 20000 Predicted 578 Chapter 14 Multiple Regression and Model Building form the studentized deleted residual The experience of the authors leads us to suggest that one should conclude that an observation is an outlier with respect to its y value if (and only if ) the studentized deleted residual is greater in absolute value than t.005, which is based on n ؊ k ؊ degrees of freedom [The Excel add-in (MegaStat) shades such a studentized deleted residual in dark blue.] For the hospital labor needs model, n Ϫ k Ϫ ϭ 17 Ϫ Ϫ ϭ 12, and therefore t.005 ϭ 3.055 The studentized deleted residual for hospital 14, which equals 4.558 [see Figure 14.33(b)], is greater in absolute value than t.005 ϭ 3.055 Therefore, we conclude that hospital 14 is an outlier with respect to its y value An example of dealing with outliers One option for dealing with the fact that hospital 14 is an outlier with respect to its y value is to assume that hospital 14 has been run inefficiently Because we need to develop a regression model using efficiently run hospitals, based on this assumption we would remove hospital 14 from the data set If we perform a regression analysis using a model relating y to x1, x2, and x3 with hospital 14 removed from the data set (we call this Option 1), we obtain a standard error of s ϭ 387.16 This s is considerably smaller than the large standard error of 614.779 caused by hospital 14’s large residual when we use all 17 hospitals to relate y to x1, x2, and x3 A second option is motivated by the fact that large organizations sometimes exhibit inherent inefficiencies To assess whether there might be a general large hospital inefficiency, we define a dummy variable DL that equals for the larger hospitals 14–17 and for the smaller hospitals 1–13 If we fit the resulting regression model y ϭ b0 ϩ b1x1 ϩ b2 x2 ϩ b3x3 ϩ b4DL ϩ e to all 17 hospitals (we call this Option 2), we obtain a b4 of 2871.78 and a p-value for testing H0: b4 ϭ of 0003 This indicates the existence of a large hospital inefficiency that is estimated to be an extra 2871.78 hours per month In addition, the dummy variable model’s s is 363.854, which is slightly smaller than the s of 387.16 obtained using Option The studentized deleted residual for hospital 14 using the dummy variable model tells us what would happen if we removed hospital 14 from the data set and predicted y14 by using a newly fitted dummy variable model In the exercises the reader will show that the prediction obtained, which uses a large hospital inefficiency estimate based on the remaining large hospitals 15, 16, and 17, indicates that hospital 14’s labor hours are not unusually large This justifies leaving hospital 14 in the data set when using the dummy variable model In summary, both Options and seem reasonable The reader will further compare these options in the exercises Obs 10 11 12 13 14 15 16 17 Cook’s D 0.002 0.000 0.001 0.028 0.000 0.014 0.002 0.009 0.016 0.049 0.009 0.067 0.006 0.353 0.541 0.897 5.033 Cook’s D, Dfbetas, and Dffits (Optional) If a particular observation, observation i, is an outlier with respect to its y and/or x values, it might significantly influence the least squares point estimates of the model parameters To detect such influence, we compute Cook’s distance measure (or Cook’s D) for observation i, which we denote as Di To understand Di, let F.50 denote the 50th percentile of the F distribution based on (k ϩ 1) numerator and n Ϫ (k ϩ 1) denominator degrees of freedom It can be shown that if Di is greater than F.50, then removing observation i from the data set would significantly change (as a group) the least squares point estimates of the model parameters In this case we say that observation i is influential For example, suppose that we relate y to x1, x2, and x3 using all n ϭ 17 observations in Figure 14.33(a) Noting that k ϩ ϭ and n Ϫ (k ϩ 1) ϭ 13, we find (using Excel) that F.50 ϭ 8845 The Excel add-in (MegaStat) output in the page margin tells us that both D16 ϭ 897 and D17 ϭ 5.033 are greater than F.50 ϭ 8845 (see the dark blue shading) It follows that removing either hospital 16 or 17 from the data set would significantly change (as a group) the least squares estimates of the model parameters To assess whether a particular least squares point estimate would significantly change, we can use an advanced statistical software package such as SAS, which gives the following difference in estimate of ␤j statistics (Dfbetas) for hospitals 16 and 17: 0bs 16 17 INTERCEP X1 X2 X3 Dfbetas 0.9880 0.0294 Dfbetas ؊1.4289 ؊3.0114 Dfbetas 1.7339 1.2688 Dfbetas ؊1.1029 0.3155 Examining the Dfbetas statistics, we see that hospital 17’s Dfbetas for the independent variable x1 (monthly X-ray exposures) equals Ϫ3.0114, which is negative and greater in absolute value 14.11 579 Residual Analysis in Multiple Regression than 2, a sometimes used critical value for Dfbetas statistics This implies that removing hospital 17 from the data set would significantly decrease the least squares point estimate of the effect, ␤1, of monthly X-ray exposures on monthly labor hours One possible consequence might then be that our model would significantly underpredict the monthly labor hours for a hospital which [like hospital 17—see Figure 14.33(a)] has a particularly large number of monthly X-ray exposures In fact, consider the MINITAB output in the page margin of the difference in fits statistic (Dffits) Dffits for hospital 17 equals Ϫ4.96226, which is negative and greater in absolute value than the critical value This implies that removing hospital 17 from the data set would significantly decrease the point prediction of the monthly labor hours for a hospital that has the same values of x1, x2, and x3 as does hospital 17 Moreover, although it can be verified that using the previously discussed Option or Option to deal with hospital 14’s large residual substantially reduces Cook’s D, Dfbetas for x1, and Dffits for hospital 17, these or similar statistics remain or become somewhat significant for the large hospitals 15, 16, and 17 The practical implication is that if we wish to predict monthly labor hours for questionably run large hospitals, it is very important to keep all of the efficiently run large hospitals 15, 16, and 17 in the data set (Furthermore, it would be desirable to add information for additional efficiently run large hospitals to the data set) Hosp 10 11 12 13 14 15 16 17 Dffits Ϫ0.07541 Ϫ0.02404 0.04383 0.32657 0.04213 Ϫ0.22799 0.08818 0.18406 Ϫ0.25179 Ϫ0.44871 0.18237 Ϫ0.52368 Ϫ0.14509 1.88820 Ϫ1.47227 1.89295 Ϫ4.96226 A technical note (Optional) Let hi and ei denote the leverage value and residual for observation i based on a regression model using k independent variables Then, for observation i, the studentized residual is ei divided by s11 Ϫ hi, the studentized deleted residual (denoted ti) is ei 1n Ϫ k Ϫ divided by 1SSE(1 Ϫ hi) Ϫ e2i , Cook’s distance measure is e2i hi divided by (k ϩ 1)s 2(1 Ϫ hi)2, and the difference in fits statistic is ti [hi͞(1 Ϫ hi)]1͞2 The formula for the difference in estimate of bj statistics is very complicated and will not be given here Exercises for Section 14.11 CONCEPTS 14.42 Discuss how we use residual plots to check the regression assumptions for multiple regression 14.43 List the tools that we use to identify an outlier with respect to its y value and/or x values METHODS AND APPLICATIONS 14.44 For each of the following cases, use the indicated residual plots to check for any violations of the regression assumptions a The Tasty Sub Shop Case: For the model relating Revenue to Population and Business Rating, use the following plots: DS TastySub2 (b) Plot of residuals versus Business rating 100 Residuals Residuals (a) Plot of residuals versus Population 50 Ϫ50 20 40 Population 60 80 100 50 Ϫ50 Business Rating 10 b The Natural Gas Consumption Case: For the model relating GasCons to Temp and Chill, use the plots shown on pages 591 (in Appendix 14.2) and 594 (in Appendix 14.3) DS GasCon2 14.45 THE HOSPITAL LABOR NEEDS CASE DS HospLab DS HospLab4 (1) Analyze the studentized deleted residuals in the page margin for Options and (see SDR1 and SDR2) (2) Is hospital 14 an outlier with respect to its y value when using Option 2? (3) Consider a questionable large hospital (DL ϭ 1) for which Xray ϭ 56,194, BedDays ϭ 14,077.88, and Length ϭ 6.89 Also, consider the labor needs in an efficiently run large hospital described by this combination of values of the independent variables The 95 percent prediction intervals for these Obs 10 11 12 13 14 15 16 17 SDR1 Ϫ0.333 0.404 0.161 1.234 0.425 Ϫ0.795 0.677 1.117 Ϫ1.078 Ϫ1.359 1.461 Ϫ2.224 Ϫ0.685 Ϫ0.137 1.254 0.597 SDR2 Ϫ1.439 0.233 Ϫ0.750 0.202 0.213 Ϫ1.490 0.617 1.010 Ϫ0.409 Ϫ0.400 2.571 Ϫ0.624 0.464 1.406 Ϫ2.049 1.108 Ϫ0.639 580 Chapter 14 n k‫؍‬2 dL,.05 dU,.05 15 16 17 18 0.95 0.98 1.02 1.05 1.54 1.54 1.54 1.53 LO14-11 Use a logistic model to estimate probabilities and odds ratios x pˆ 08 14 40 70 88 92 y 20 35 44 46 DS Multiple Regression and Model Building labor needs given by the models of Options and are, respectively, [14,906.24, 16,886.26] and [15,175.04, 17,030.01] By comparing these prediction intervals, by analyzing the residual plots for Options and given in Figure 14.33(d) and (e) on page 577, and by using your conclusions regarding the studentized deleted residuals, recommend which option should be used (4) What would you conclude if the questionable large hospital used 17,207.31 monthly labor hours? 14.46 Recall that Figure13.27(a) on page 508 gives n ϭ 16 weekly values of Pages Bookstore sales (y), Pages’ advertising expenditure (x1), and competitor’s advertising expenditure (x2) When we fit the model y ϭ b0 ϩ b1x1 ϩ b2x2 ϩ e to the data, we find that the Durbin–Watson statistic is – d ϭ 1.63 Use the partial Durbin–Watson table in the page margin to test for positive autocorrelation by setting a equal to 05 14.12 Logistic Regression Suppose that in a study of the effectiveness of offering a price reduction on a given product, 300 households having similar incomes were selected A coupon offering a price reduction, x, on the product, as well as advertising material for the product, was sent to each household The coupons offered different price reductions (10, 20, 30, 40, 50, and 60 dollars), and 50 homes were assigned at random to each price reduction The table in the page margin summarizes the number, y, and proportion, pˆ , of households redeeming coupons for each price reduction, x (expressed in units of $10) On the left side of Figure 14.34 we plot the pˆ values versus the x values and draw a hypothetical curve through the plotted points A theoretical curve having the shape of the curve in Figure 14.34 is the logistic curve p(x) ϭ PrcRed e(b0 ϩ b1x) ϩ e(b0 ϩ b1x) where p(x) denotes the probability that a household receiving a coupon having a price reduction of x will redeem the coupon The MINITAB output in Figure 14.34 tells us that the point estimates of b0 and b1 are b0 ϭ Ϫ3.7456 and b1 ϭ 1.1109 (The point estimates in logistic regression are usually obtained by an advanced statistical procedure called maximum likelihood estimation.) Using these estimates, it follows that, for example, pˆ (5) ϭ e(Ϫ3.7456ϩ1.1109(5)) 6.1037 ϭ 8593 (Ϫ3.7456ϩ1.1109(5)) ϭ 1ϩe ϩ 6.1037 That is, pˆ (5) ϭ 8593 is the point estimate of the probability that a household receiving a coupon having a price reduction of $50 will redeem the coupon The MINITAB output in Figure 14.34 gives the values of pˆ (x) for x ϭ 1, 2, 3, 4, 5, and Redemption Proportion, p FIGURE 14.34 MINITAB Output of a Logistic Regression of the Price Reduction Data Logistic Regression Table Predictor Coef SE Coef Constant -3.7456 0.434355 x 1.1109 0.119364 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Price Probability Reduction, x Estimate Price Reduction, x 0.066943 0.178920 0.398256 Z -8.62 9.31 P 0.000 0.000 Price Probability Reduction, x Estimate 0.667791 0.859260 0.948831 14.12 581 Logistic Regression FIGURE 14.35 Logistic Regression Table Predictor Constant Test Test TA B L E The Performance Data DS PerfTest MINITAB Output of a Logistic Regression of the Performance Data Coef -56.17 0.4833 0.1652 SE Coef 17.4516 0.1578 0.1021 Z -3.22 3.06 1.62 Odds Ratio P 0.001 0.002 0.106 1.62 1.18 95% CI Lower Upper 1.19 0.97 2.21 1.44 Log-Likelihood = -13.959 Test that all slopes are zero: G = 31.483, DF = 2, P-Value = 0.000 The general logistic regression model relates the probability that an event (such as redeeming a coupon) will occur to k independent variables x1, x2, , xk This general model is e(b0 ϩ b1x1 ϩ b2x2 ϩ ϩ bkxk) … ϩ e(b0 ϩ b1x1 ϩ b2x2 ϩ ϩ bkxk) … p(x1, x2, , xk) ϭ where p(x1, x2, , xk) is the probability that the event will occur when the values of the independent variables are x1, x2, , xk In order to estimate b0, b1, b2, , bk we obtain n observations, with each observation consisting of observed values of x1, x2, , xk and of a dependent variable y Here, y is a dummy variable that equals if the event has occurred and otherwise For example, suppose that the personnel director of a firm has developed two tests to help determine whether potential employees would perform successfully in a particular position To help estimate the usefulness of the tests, the director gives both tests to 43 employees that currently hold the position Table 14.16 gives the scores of each employee on both tests and indicates whether the employee is currently performing successfully or unsuccessfully in the position If the employee is performing successfully, we set the dummy variable Group equal to 1; if the employee is performing unsuccessfully, we set Group equal to Let x1 and x2 denote the scores of a potential employee on tests and 2, and let p(x1, x2) denote the probability that a potential employee having the scores x1 and x2 will perform successfully in the position We can estimate the relationship between p(x1, x2) and x1 and x2 by using the logistic regression model p(x1, x2) ϭ e(b0 ϩ b1x1 ϩ b2x2) ϩ e(b0 ϩ b1x1 ϩ b2x2) The MINITAB output in Figure 14.35 tells us that the point estimates of b0, b1, and b2 are b0 ϭ Ϫ56.17, b1 ϭ 4833, and b2 ϭ 1652 Consider, therefore, a potential employee who scores a 93 on test and an 84 on test It follows that a point estimate of the probability that the potential employee will perform successfully in the position is pˆ(93, 84) ϭ e(Ϫ56.17ϩ 4833(93)ϩ 1652(84)) 14.206506 ϭ 9342 (Ϫ 56.17ϩ 4833(93)ϩ 1652(84)) ϭ 1ϩe 15.206506 To further analyze the logistic regression output, we consider several hypothesis tests that are based on the chi-square distribution We first consider testing H0: b1 ϭ b2 ϭ versus Ha: At least one of b1 or b2 does not equal The p-value for this test is the area under the chi-square curve having k ϭ degrees of freedom to the right of the test statistic value G ϭ 31.483 Although the calculation of G is too complicated to demonstrate in this book, the MINITAB output gives the value of G and the related p-value, which is less than 001 This p-value implies that we have extremely strong evidence that at least one of b1 or b2 does not equal zero The p-value for testing H0: b1 ϭ versus Ha: b1 is the area under the chi-square curve having one degree of freedom to the right of the square of z ϭ (b1͞sb1) ϭ (.4833͞.1578) ϭ 3.06 The MINITAB output tells us that this p-value is 002, which implies that we have very strong evidence that the score on test is related to the probability of a potential employee’s success The p-value for testing H0: b2 ϭ versus Ha: b2 is the area under the chi-square curve having one degree of freedom to the right of the square of z ϭ (b2͞sb2) ϭ (.1652͞.1021) ϭ 1.62 The MINITAB output tells us that Group Test Test 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 96 96 91 95 92 93 98 92 97 95 99 89 94 92 94 90 91 90 86 90 91 88 87 93 90 91 91 88 86 79 83 79 88 81 85 82 82 81 86 81 85 83 81 85 88 81 78 85 87 84 82 89 96 93 90 90 94 84 92 70 81 81 76 79 83 82 74 84 81 78 78 86 81 84 77 75 85 83 72 81 77 76 84 78 77 71 Source: Performance data from T.E Dielman, Applied Regression Analysis for Business and Economics, 2nd ed © 1996 Reprinted with permission of Brooks/Cole, a division of Cengage Learning, www cengagerights.com, Fax 800-730-2215 582 Chapter 14 Multiple Regression and Model Building this p-value is 106, which implies that we not have strong evidence that the score on test is related to the probability of a potential employee’s success In Exercise 14.49 we will consider a logistic regression model that uses only the score on test to estimate the probability of a potential employee’s success The odds of success for a potential employee is defined to be the probability of success divided by the probability of failure for the employee That is, odds ϭ p(x1, x2) Ϫ p(x1, x2) For the potential employee who scores a 93 on test and an 84 on test 2, we estimate that the odds of success are 9342͞(1 Ϫ 9342) ϭ 14.2 That is, we estimate that the odds of success for the potential employee are about 14 to It can be shown that eb1 ϭ e.4833 ϭ 1.62 is a point estimate of the odds ratio for x1, which is the proportional change in the odds (for any potential employee) that is associated with an increase of one in x1 when x2 stays constant This point estimate of the odds ratio for x1 is shown on the MINITAB output and says that, for every one point increase in the score on test when the score on test stays constant, we estimate that a potential employee’s odds of success increase by 62 percent Furthermore, the 95 percent confidence interval for the odds ratio for x1, [1.19, 2.21], does not contain Therefore, as with the (equivalent) chi-square test of H0: b1 ϭ 0, we conclude that there is strong evidence that the score on test is related to the probability of success for a potential employee Similarly, it can be shown that eb2 ϭ e.1652 ϭ 1.18 is a point estimate of the odds ratio for x2, which is the proportional change in the odds (for any potential employee) that is associated with an increase of one in x2 when x1 stays constant This point estimate of the odds ratio for x2 is shown on the MINITAB output and says that, for every one point increase in the score on test when the score on test stays constant, we estimate that a potential employee’s odds of success increases by 18 percent However, the 95 percent confidence interval for the odds ratio for x2, [.97, 1.44], contains Therefore, as with the equivalent chisquare test of H0: b2 ϭ 0, we cannot conclude that there is strong evidence that the score on test is related to the probability of success for a potential employee To conclude this section, consider the general logistic regression model e(b0 ϩ b1x1 ϩ b2x2 ϩ ϩ bkxk) … ϩ e(b0 ϩ b1x1 ϩ b2x2 ϩ ϩ bkxk) … p(x1, x2, , xk) ϭ where p(x1, x2, , xk) is the probability that the event under consideration will occur when the values of the independent variables are x1, x2, , xk The odds of the event occurring is defined to be p(x1, x2, , xk)͞(1 Ϫ p(x1, x2, , xk)), which is the probability that the event will occur divided by the probability that the event will not occur It can be shown that the odds equals … e(b0 ϩ b1x1 ϩ b2x2 ϩ ϩ bkxk) The natural logarithm of the odds is (b0 ϩ b1x1 ϩ b2x2 ϩ · · · ϩ bkxk), which is called the logit If b0, b1, b2, , bk are the point estimates of b0, b1, b2, , bk, ˆ , is (b ϩ b x ϩ b x ϩ · · · ϩ b x ) It follows that the the point estimate of the logit, denoted ᐉg 1 2 k k point estimate of the probability that the event will occur is p(x ˆ 1, x2, , xk) ϭ ˆ eᐉg e(b0 ϩ b1x1 ϩ b2x2 ϩ ϩ bkxk) ϭ … ˆ ϩ eᐉg ϩ e(b0 ϩ b1x1 ϩ b2x2 ϩ ϩ bkxk) … Finally, consider an arbitrary independent variable xj It can be shown that ebj is the point estimate of the odds ratio for xj, which is the proportional change in the odds that is associated with a one unit increase in xj when the other independent variables stay constant Exercises for Section 14.12 CONCEPTS 14.47 What two values does the dependent variable equal in logistic regression? What these values represent? 14.48 Define the odds of an event, and the odds ratio for xj 583 Glossary of Terms METHODS AND APPLICATIONS 14.49 If we use the logistic regression model p(x1) ϭ e(b0 ϩ b1x1) ϩ e(b0 ϩ b1x1) to analyze the performance data in Table 14.16 on page 581, we find that the point estimates of the model parameters and their associated p-values (given in parentheses) are b0 ϭ Ϫ43.37(.001) and b1 ϭ 4897(.001) (1) Find a point estimate of the probability of success for a potential employee who scores a 93 on test (2) Using b1 ϭ 4897, find a point estimate of the odds ratio for x1 (3) Interpret this point estimate 14.50 Mendenhall and Sincich (1993) present data that can be used to investigate allegations of gender discrimination in the hiring practices of a particular firm These data are given in the page margin In this table, y is a dummy variable that equals if a potential employee was hired and otherwise; x1 is the number of years of education of the potential employee; x2 is the number of years of experience of the potential employee; and x3 is a dummy variable that equals if the potential employee was a male and if the potential employee was a female If we use the logistic regression model p(x1, x2, x3) ϭ e(b0 ϩ b1x1 ϩ b2x2 ϩ b3x3) ϩ e(b0 ϩ b1x1 ϩ b2x2 ϩ b3x3) to analyze these data, we find that the point estimates of the model parameters and their associated p-values (given in parentheses) are b0 ϭ Ϫ14.2483 (.0191), b1 ϭ 1.1549 (.0552), b2 ϭ 9098 (.0341), and b3 ϭ 5.6037 (.0313) a Consider a potential employee having years of education and years of experience Find (1) a point estimate of the probability that the potential employee will be hired if the potential employee is a male, and (2) a point estimate of the probability that the potential employee will be hired if the potential employee is a female b (1) Using b3 ϭ 5.6037, find a point estimate of the odds ratio for x3 (2) Interpret this odds ratio (3) Using the p-value describing the importance of x3, can we conclude that there is strong evidence that gender is related to the probability that a potential employee will be hired? Chapter Summary This chapter has discussed multiple regression analysis We began by considering the multiple regression model and the assumptions behind this model We next discussed the least squares point estimates of the model parameters and some ways to judge overall model utility—the standard error, the multiple coefficient of determination, the adjusted multiple coefficient of determination, and the overall F-test Then we considered testing the significance of a single independent variable in a multiple regression model, calculating a confidence interval for the mean value of the dependent variable, and calculating a prediction interval for an individual value of the dependent variable We continued this chapter by explaining the use of dummy variables to model qualitative independent variables and the use of y x1 x2 x3 0 1 0 0 0 0 0 0 0 1 6 4 8 6 4 8 6 3 10 7 5 1 10 12 1 0 0 0 1 1 1 0 1 DS Gender Source: William Mendenhall and Terry Sincich, A Second Course in Business Statistics: Regression Analysis, Fourth edition, © 1993 Reprinted with permission of Prentice Hall squared and interaction variables We then considered multicollinearity, which can adversely affect the ability of the t statistics and associated p-values to assess the importance of the independent variables in a regression model For this reason, we need to determine if the overall model gives a high R2, a small s, a high adjusted R2, short prediction intervals, and a small C We explained how to compare regression models on the basis of these criteria, and we discussed stepwise regression, backward elimination, and the partial F-test We then considered using residual analysis (including the detection of outliers) to check the assumptions for multiple regression models We concluded this chapter by discussing how to use logistic regression to estimate the probability of an event Glossary of Terms dummy variable: A variable that takes on the values or and is used to describe the effects of the different levels of a qualitative independent variable in a regression model (page 550) interaction: The situation in which the relationship between the mean value of the dependent variable and an independent variable is dependent on the value of another independent variable (pages 554 and 562) multicollinearity: The situation in which the independent variables used in a regression analysis are related to each other (page 566) multiple regression model: An equation that describes the relationship between a dependent variable and more than one independent variable (page 530) stepwise regression (and backward elimination): Iterative model building techniques for selecting important predictor variables (pages 570–571) 584 Chapter 14 Multiple Regression and Model Building Important Formulas and Tests The least squares point estimates: page 527 Confidence interval for bj: page 544 The multiple regression model: page 530 Confidence interval for a mean value of y: pages 545 and 546 Point estimate of a mean value of y: page 530 Prediction interval for an individual value of y: pages 545 and 546 Point prediction of an individual value of y: page 530 Distance value (in multiple regression): page 546 Mean square error: page 536 The quadratic regression model: page 560 Standard error: page 536 Variance inflation factor: page 566 Total variation: page 537 C statistic: page 570 Explained variation: page 537 Partial F test: page 573 Unexplained variation: page 537 Studentized deleted residual: pages 577 and 578 Multiple coefficient of determination: page 537 Cook’s D, Dfbetas, and Dffits: pages 578 and 579 Multiple correlation coefficient: page 537 Logistic curve: page 580 Adjusted multiple coefficient of determination: page 538 Logistic regression model: pages 581 and 582 An F test for the multiple regression model: page 539 Odds and odds ratio: page 582 Testing the significance of an independent variable: page 542 Supplementary Exercises TA B L E The Least Squares Point Estimates for Exercise 14.51 b0 ϭ 10.3676 (.3710) b1 ϭ 0500 (Ͻ.001) b2 ϭ 6.3218 (.0152) b3 ϭ Ϫ11.1032 (.0635) b4 ϭ Ϫ.4319 (.0002) 14.51 The trend in home building in recent years has been to emphasize open spaces and great rooms, rather than smaller living rooms and family rooms A builder of speculative homes in the college community of Oxford, Ohio, had been building such homes, but his homes had been taking many months to sell and selling for substantially less than the asking price In order to determine what types of homes would attract residents of the community, the builder contacted a statistician at a local college The statistician went to a local real estate agency and obtained the data in Table 13.10 on page 518 This table presents the sales price y, square footage x1, number of rooms x2, number of bedrooms x3, and age x4 for each of 63 single-family residences recently sold in the community When we perform a regression analysis of these data using the model y ϭ b0 ϩ b1x1 ϩ b2 x2 ϩ b3 x3 ϩ b4x4 ϩ e we find that the least squares point estimates of the model parameters and their associated p-values (given in parentheses) are as shown in Table 14.17 Discuss why the estimates b2 ϭ 6.3218 and b3 ϭ Ϫ11.1032 suggest that it might be more profitable when building a house of a specified square footage (1) to include both a (smaller) living room and family room rather than a (larger) great room and (2) to not increase the number of bedrooms (at the cost of another type of room) that would normally be included in a house of the specified square footage Note: Based on the statistical results, the builder realized that there are many families with children in a college town and that the parents in such families would rather have one living area for the children (the family room) and a separate living area for themselves (the living room) The builder started modifying his open-space homes accordingly and greatly increased his profits DS OxHome 14.52 THE SUPERMARKET CASE Shelf Display Height Bottom Middle Top (B) (M) (T ) 73.0 78.1 75.4 76.2 78.4 82.1 52.4 49.7 50.9 54.0 52.1 49.9 BakeSale The Tastee Bakery Company supplies a bakery product to many supermarkets in a metropolitan area The company wishes to study the effect of the height of the shelf display employed by the supermarkets on monthly sales, y (measured in cases of 10 units each), for this product Shelf display height has three levels—bottom (B), middle (M ), and top (T ) For each shelf display height, six supermarkets of equal sales potential are randomly selected, and each supermarket displays the product using its assigned shelf height for a month At the end of the month, sales of the bakery product at the 18 participating stores are recorded, and the data in Table 14.18 are obtained To compare the population mean sales amounts mB, mM, and mT that would be obtained by using the bottom, middle, and top display heights, we use the following dummy variable regression model: y ϭ bB ϩ bMDM ϩ bT DT ϩ e, which we call Model Here, DM equals if a middle display height is used and otherwise; DT equals if a top display height is used and otherwise.1 TA B L E Bakery Sales Study Data (Sales in Cases) DS BakeSale 58.2 53.7 55.8 55.7 52.5 58.9 DS In general, the regression approach of this exercise produces the same comparisons of several population means that are produced by one-way analysis of variance (see Section 11.2) 585 Supplementary Exercises FIGURE 14.36 MINITAB Output for the Bakery Sales Data (for Exercise 14.52) (a) Partial MINITAB output for Model 1: y ‫ ؍‬BB ؉ BMDM ؉ BTDT ؉ E Predictor Constant DMiddle DTop S = 2.48193 Coef 55.800 21.400 –4.300 SE Coef 1.013 1.433 1.433 R–Sq = 96.1% (b) Partial MINITAB output for Model 2: y ‫ ؍‬BT ؉ BBDB ؉ BMDM ؉ E T 55.07 14.93 –3.00 P 0.000 0.000 0.009 Predictor Constant DBottom DMiddle Coef 51.500 4.300 25.700 SE Coef 1.013 1.433 1.433 T 50.83 3.00 17.94 P 0.000 0.009 0.000 R–Sq(adj) = 95.6% Analysis of Variance SS MS Source DF Regression 2273.9 1136.9 Residual Error 15 92.4 6.2 Total 17 2366.3 F P 184.57 0.000 (c) MINITAB prediction using Model or Fit 77.200 95% CI (75.040, 79.360) a Because the expression bB ϩ bM DM ϩ bTDT represents mean monthly sales for the bakery product, the definitions of the dummy variables imply, for example, that mT ϭ bB ϩ bM (0) ϩ bT (1) ϭ bB ϩ bT (1) In a similar fashion, show that mB ϭ bB and mM ϭ bB ϩ bM (2) By appropriately subtracting the expressions for mB, mM, and mT, show that mM Ϫ mB ϭ bM, mT Ϫ mB ϭ bT, and mM Ϫ mT ϭ bM Ϫ bT b Use the overall F statistic in Figure 14.36(a) to test H0: bM ϭ bT ϭ 0, or, equivalently, H0: mB ϭ mM ϭ mT Interpret the practical meaning of the result of this test c Consider the following two differences in means: mM Ϫ mB ϭ bM and mT Ϫ mB ϭ bT Use information in Figure 14.36(a) to (1) find a point estimate of, (2) test the significance of, and (3) find a 95 percent confidence interval for each difference (Hint: Use the confidence interval formula on page 544.) Interpret your results d Consider the following alternative model: y ϭ bT ϩ bBDB ϩ bMDM ϩ e, which we call Model Here, DB equals if a bottom display height is used and otherwise This model implies that mM Ϫ mT ϭ bM Use information in Figure 14.36(b) to (1) find a point estimate of, (2) test the significance of, and (3) find a 95 percent confidence interval for mM Ϫ mT ϭ bM Interpret your results e Show by hand calculation that both Models and give the same point estimate yˆ ϭ 77.2 of mean monthly sales when using a middle display height f Use information in Figure 14.36(c) to find (1) a 95 percent confidence interval for mean sales when using a middle display height, and (2) a 95 percent prediction interval for individual sales during a month at a supermarket that employs a middle display height 14.53 THE FRESH DETERGENT CASE DS Fresh3 Recall from Exercise 14.32 (page 558) that Enterprise Industries has advertised Fresh liquid laundry detergent by using three different advertising campaigns—advertising campaign A (television commercials), advertising campaign B (a balanced mixture of television and radio commercials), and advertising campaign C (a balanced mixture of television, radio, newspaper, and magazine ads) To compare the effectiveness of these advertising campaigns, consider using two models, Model and Model 2, that are shown with corresponding partial Excel outputs in Figure 14.37 on the next page In these models y is demand for Fresh; x4 is the price difference; x3 is Enterprise Industries’ advertising expenditure for Fresh; DA equals if advertising campaign A is used in a sales period and otherwise; DB equals if advertising campaign B is used in a sales period and otherwise; and DC equals if advertising campaign C is used in a sales period and otherwise Moreover, in Model the parameter b5 represents the effect on mean demand of advertising campaign B compared to advertising campaign A, and the parameter b6 represents the effect on mean demand of advertising campaign C compared to advertising campaign A In Model the parameter b6 represents the effect on mean demand of advertising campaign C compared to advertising campaign B a Compare advertising compaigns A, B, and C by finding 95 percent confidence intervals for (1) b5 and b6 in Model and (2) b6 in Model Interpret the intervals 95% PI (71.486, 82.914) 586 Chapter 14 FIGURE 14.37 Multiple Regression and Model Building Excel Output for the Fresh Detergent Case (for Exercise 14.53) (a) Partial Excel output for Model 1: y ‫␤ ؍‬0 ؉ ␤1x4 ؉ ␤2x3 ؉ ␤3x23 ؉ ␤4x4x3 ؉ ␤5DB ؉ ␤6DC ؉ E (b) Partial Excel output for Model 2: y ‫␤ ؍‬0 ؉ ␤1x4 ؉ ␤2x3 ؉ ␤3x23 ؉ ␤4x4x3 ؉ ␤5DA ؉ ␤6DC ؉ E Coefficients Lower 95% Upper 95% 25.612696 9.0587 Ϫ6.5377 0.5844 Ϫ1.1565 0.2137 0.3818 15.6960 2.7871 Ϫ9.8090 0.3158 Ϫ2.0992 0.0851 0.2551 35.5294 15.3302 Ϫ3.2664 0.8531 Ϫ0.2137 0.3423 0.5085 Intercept X4 X3 X3SQ X4X3 DB DC Intercept X4 X3 X3SQ X4X3 DA DC Coefficients Lower 95% Upper 95% 25.8264 9.05868 Ϫ6.5377 0.58444 Ϫ1.1565 Ϫ0.2137 0.16809 15.9081 2.7871 Ϫ9.8090 0.3158 Ϫ2.0992 Ϫ0.3423 0.0363 35.7447 15.3302 Ϫ3.2664 0.8531 Ϫ0.2137 Ϫ0.0851 0.2999 b Using Model or Model 2, a point prediction of Fresh demand when x4ϭ 20, x3 ϭ 6.50, and campaign C will be used is 8.50068 (that is, 850,068 bottles) Show (by hand calculation) that Model and Model give the same point prediction c Consider the alternative model y ϭ b0 ϩ b1x4 ϩ b2x3 ϩ b3x23 ϩ b4x4x3 ϩ b5DB ϩ b6DC ϩ b7x3DB ϩ b8x3DC ϩ e which we call Model The least squares point estimates of the model parameters and their associated p-values (given in parentheses) are as shown in Table 14.19 in the page margin Let m[d,a,A], m[d,a,B], and m[d,a,C] denote the mean demands for Fresh when the price difference is d, the advertising expenditure is a, and we use advertising campaigns A, B, and C, respectively The model of this part implies that TA B L E The Least Squares Point Estimates for Model b0 ϭ 28.6873 (Ͻ.0001) b1 ϭ 10.8253 (.0036) b2 ϭ Ϫ7.4115 (.0002) b3 ϭ 6458 (Ͻ.0001) b4 ϭ Ϫ1.4156 (.0091) b5 ϭ Ϫ.4807 (.5179) b6 ϭ Ϫ.9351 (.2758) b7 ϭ 10722 (.3480) b8 ϭ 20349 (.1291) m[d,a,A] ϭ b0 ϩ b1d ϩ b2a ϩ b3a2 ϩ b4da ϩ b5(0) ϩ b6(0) ϩ b7a(0) ϩ b8a(0) m[d,a,B] ϭ b0 ϩ b1d ϩ b2a ϩ b3a2 ϩ b4da ϩ b5(1) ϩ b6(0) ϩ b7a(1) ϩ b8a(0) m[d,a,C] ϭ b0 ϩ b1d ϩ b2a ϩ b3a2 ϩ b4da ϩ b5(0) ϩ b6(1) ϩ b7a(0) ϩ b8a(1) (1) Using these equations, verify that m[d,a,C] Ϫ m[d,a,A] equals b6 ϩ b8a (2) Using the least squares point estimates, show that a point estimate of m[d,a,C] Ϫ m[d,a,A] equals 3266 when a ϭ 6.2 and equals 4080 when a ϭ 6.6 (3) Verify that m[d,a,C] Ϫ m[d,a,B] equals b6 Ϫ b5 ϩ b8a Ϫ b7a (4) Using the least squares point estimates, show that a point estimate of m[d,a,C] Ϫ m[d,a,B] equals 14266 when a ϭ 6.2 and equals 18118 when a ϭ 6.6 (5) Discuss why these results imply that the larger that advertising expenditure a is, then the larger is the improvement in mean sales that is obtained by using advertising campaign C rather than advertising campaign A or B d If we use an Excel add-in (MegaStat), we can use Models 1, 2, and to predict demand for Fresh in a future sales period when the price difference will be x4 ϭ 20, the advertising expenditure will be x3 ϭ 6.50, and campaign C will be used The prediction results using Model or Model are given on the left below and the prediction results using Model are given on the right below Model or Predicted 8.50068 95% Prediction Interval lower upper 8.21322 Model Predicted 8.78813 8.51183 95% Prediction Interval lower upper 8.22486 8.79879 Which model gives the shortest 95 percent prediction interval for Fresh demand? e Using all of the results in this exercise, discuss why there might be a small amount of interaction between advertising expenditure and advertising campaign 14.54 THE FRESH DETERGENT CASE DS Fresh3 The unexplained variation for Model of the previous exercise y ϭ b0 ϩ b1x4 ϩ b2x3 ϩ b3x23 ϩ b4x4x3 ϩ b5DB ϩ b6DC ϩ e 587 Supplementary Exercises is 3936 If we set both b5 and b6 in this model equal to (that is, if we eliminate the dummy variable portion of this model), the resulting reduced model has an unexplained variation of 1.0644 Using an a of 05, perform a partial F-test (see page 573) of H0: ␤5 ϭ ␤6 ϭ (Hint: n ϭ 30, k ϭ 6, and k* ϭ 2.) If we reject H0, we conclude that at least two of advertising campaigns A, B, and C have different effects on mean demand Many statisticians believe that rejection of H0 by using the partial F-test makes it more legitimate to make pairwise comparisons of advertising campaigns A, B, and C, as we did in part a of the previous exercise Here, the partial F-test is regarded as a preliminary test of significance 14.55 Table 14.20 gives the number of bathrooms for each of the 63 homes in Table 13.10 on page 518 Using the following MINITAB output of the best single model of each size in terms of R 2, s, and C, determine which overall model seems best: DS OxHome2 Vars R-Sq 63.3 69.1 70.9 72.6 73.2 R-Sq(adj) 62.7 68.1 69.4 70.7 70.8 14.56 THE QHIC CASE DS Mallows C-p 18.9 8.6 6.9 5.3 6.0 S 21.382 19.782 19.372 18.962 18.917 S Q F T X X X X X R O O M S B B E A A D G T S E H X X X X X X X X X X QHIC Consider the QHIC data in Figure 13.18 (page 502) When we performed a regression analysis of these data by using the simple linear regression model, plots of the model’s residuals versus x (home value) and yˆ (predicted upkeep expenditure) both fanned out and had a “dip,” or slightly curved appearance (see Figure 13.18) In order to remedy the indicated violations of the constant variance and correct functional form assumptions, we transformed the dependent variable by taking the square roots of the upkeep expenditures An alternative approach consists of two steps First, the slightly curved appearance of the residual plots implies that it is reasonable to add the squared term x2 to the simple linear regression model This gives the quadratic regression model y ϭ b0 ϩ b1x ϩ b2 x ϩ e The upper residual plot in the MINITAB output that follows shows that a plot of the model’s residuals versus x fans out, indicating a violation of the constant variance assumption To remedy this violation, we (in the second step) divide all terms in the quadratic model by x, which gives the transformed model and associated MINITAB regression output shown to the right of the residual plots Here, the lower residual plot is the residual plot versus x, for the transformed model Residuals Versus Value Residual (response is Upkeep) Transformed Model: 400 300 200 100 -100 -200 -300 50 100 150 200 250 300 Value Coef SE Coef T P -53.50 3.409 0.011224 83.20 1.321 0.004627 -0.64 2.58 2.43 0.524 0.014 0.020 Predicted Values for New Observations Fit 95% CI 95% PI 5.635 (5.306, 5.964) (3.994, 7.276) Residuals Versus Value Residual Predictor Noconstant 1/Value One Value y^ Ϫ53.50 ϭ ϩ 3.409 ϩ 011224(220) ϭ 5.635 220 220 -1 -2 50 100 150 200 250 300 Value a Does the lower residual plot indicate the constant variance assumption holds for the transformed model? TA B L E Number of Bathrooms DS OxHome2 Bath Bath 1.0 1.0 1.0 1.0 1.0 1.5 1.5 2.0 2.0 2.5 1.0 1.0 2.0 1.5 2.0 1.5 1.0 1.5 1.0 1.5 2.5 2.0 1.0 2.5 2.0 2.0 1.0 2.0 1.0 2.5 2.0 1.0 2.0 1.5 1.0 1.5 2.0 1.5 1.0 1.5 1.5 2.0 2.5 2.0 2.5 1.0 1.0 1.0 2.0 2.5 2.0 2.0 1.0 2.0 1.0 1.5 1.5 1.0 2.0 1.0 2.0 2.0 2.0 ... TECHNOLOGY CONNECTS STUDENTS business statistics Connect® Plus Business Statistics includes a seamless integration of an eBook and Connect Business Statistics Benefits of the rich functionality integrated... associate professor of statistics in the Department of Mathematics and Statistics at Miami University in Oxford, Ohio She received her Ph.D degree in statistics from the University of North Carolina... contributions by Steven C Huchendorf University of Minnesota Dawn C Porter University of Southern California Patrick J Schur Miami University ESSENTIALS OF BUSINESS STATISTICS, FIFTH EDITION Published by

Ngày đăng: 19/06/2017, 14:08

Xem thêm: Essentials of business statistics 5th bowerman

TỪ KHÓA LIÊN QUAN

Mục lục

    Chapter 1 An Introduction to Business Statistics

    1.4 Three Case Studies That Illustrate Sampling and Statistical Inference

    1.5 Ratio, Interval, Ordinal, and Nominative Scales of Measurement (Optional)

    Appendix 1.1 Getting Started with Excel

    Appendix 1.2 Getting Started with MegaStat

    Appendix 1.3 Getting Started with MINITAB

    Chapter 2 Descriptive Statistics: Tabular and Graphical Methods

    2.1 Graphically Summarizing Qualitative Data

    2.2 Graphically Summarizing Quantitative Data

    2.7 Misleading Graphs and Charts (Optional)

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w