www.freebookslides.com www.freebookslides.com Bruce L Bowerman Miami University Richard T O’Connell Miami University Emily S Murphree Miami University Business Statistics in Practice Using Modeling, Data, and Analytics EIGHTH EDITION with major contributions by Steven C Huchendorf University of Minnesota Dawn C Porter University of Southern California Patrick J Schur Miami University bow49461_fm_i–xxi.indd 20/11/15 4:06 pm www.freebookslides.com BUSINESS STATISTICS IN PRACTICE: USING DATA, MODELING, AND ANALYTICS, EIGHTH EDITION Published by McGraw-Hill Education, Penn Plaza, New York, NY 10121 Copyright © 2017 by McGraw-Hill Education All rights reserved Printed in the United States of America Previous editions © 2014, 2011, and 2009 No part of this publication may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without the prior written consent of McGraw-Hill Education, including, but not limited to, in any network or other electronic storage or transmission, or broadcast for distance learning Some ancillaries, including electronic and print components, may not be available to customers outside the United States This book is printed on acid-free paper DOW/DOW ISBN 978-1-259-54946-5 MHID 1-259-54946-1 Senior Vice President, Products & Markets: Kurt L Strand Vice President, General Manager, Products & Markets: Marty Lange Vice President, Content Design & Delivery: Kimberly Meriwether David Managing Director: James Heine Senior Brand Manager: Dolly Womack Director, Product Development: Rose Koos Product Developer: Camille Corum Marketing Manager: Britney Hermsen Director of Digital Content: Doug Ruby Digital Product Developer: Tobi Philips Director, Content Design & Delivery: Linda Avenarius Program Manager: Mark Christianson Content Project Managers: Harvey Yep (Core) / Bruce Gin (Digital) Buyer: Laura M Fuller Design: Srdjan Savanovic Content Licensing Specialists: Ann Marie Jannette (Image) / Beth Thole (Text) Cover Image: ©Sergei Popov, Getty Images and ©teekid, Getty Images Compositor: MPS Limited Printer: R R Donnelley All credits appearing on page or at the end of the book are considered to be an extension of the copyright page Library of Congress Control Number: 2015956482 The Internet addresses listed in the text were accurate at the time of publication The inclusion of a website does not indicate an endorsement by the authors or McGraw-Hill Education, and McGraw-Hill Education does not guarantee the accuracy of the information presented at these sites www.mhhe.com bow49461_fm_i–xxi.indd 20/11/15 4:06 pm www.freebookslides.com ABOUT THE AUTHORS Bruce L Bowerman Bruce L Bowerman is emeritus professor of information systems and analytics at Miami University in Oxford, Ohio He received his Ph.D degree in statistics from Iowa State University in 1974, and he has over 40 years of experience teaching basic statistics, regression analysis, time series forecasting, survey sampling, and design of experiments to both undergraduate and graduate students In 1987 Professor Bowerman received an Outstanding Teaching award from the Miami University senior class, and in 1992 he received an Effective Educator award from the Richard T Farmer School of Business Administration Together with Richard T O’Connell, Professor Bowerman has written 23 textbooks These include Forecasting, Time Series, and Regression: An Applied Approach (also coauthored with Anne B Koehler); Linear Statistical Models: An Applied Approach; Regression Analysis: Unified Concepts, Practical Applications, and Computer Implementation (also coauthored with Emily S Murphree); and Experimental Design: Unified Concepts, Practical Applications, and Computer Implementation (also coauthored with Emily S Murphree) The first edition of Forecasting and Time Series earned an Outstanding Academic Book award from Choice magazine Professor Bowerman has also published a number of articles in applied stochastic process, time series forecasting, and statistical education In his spare time, Professor Bowerman enjoys watching movies and sports, playing tennis, and designing houses Richard T O’Connell Richard T O’Connell is emeritus professor of information systems and analytics at Miami University in Oxford, Ohio He has more than 35 years of experience teaching basic statistics, statistical quality control and process improvement, regression analysis, time series forecasting, and design of experiments to both undergraduate and graduate business students He also has extensive consulting experience and has taught workshops dealing with statistical process control and process improvement for a variety of companies in the Midwest In 2000 Professor O’Connell received an Effective Educator award from the Richard T Farmer School of Business Administration Together with Bruce L Bowerman, he has written 23 textbooks These include Forecasting, Time Series, and Regression: An Applied Approach (also coauthored with Anne B Koehler); Linear Statistical Models: An Applied Approach; Regression Analysis: Unified Concepts, Practical Applications, and Computer Implementation (also coauthored with Emily S Murphree); and Experimental Design: Unified Concepts, Practical Applications, and Computer Implementation (also coauthored with Emily S Murphree) Professor O’Connell has published a number of articles in the area of innovative statistical education He is one of the first college instructors in the United States to integrate statistical process control and process improvement methodology into his basic business statistics course He (with Professor Bowerman) has written several articles advocating this approach He has also given presentations on this subject at meetings such as the Joint Statistical Meetings of the American Statistical Association and the Workshop on Total Quality Management: Developing Curricula and Research Agendas (sponsored by the Production and Operations Management Society) Professor O’Connell received an M.S degree in decision sciences from Northwestern University in 1973 In his spare time, Professor O’Connell enjoys fishing, collecting 1950s and 1960s rock music, and following the Green Bay Packers and Purdue University sports Emily S Murphree is emerita professor of statistics at Miami University in Oxford, Ohio She received her Ph.D degree in statistics from the University of North Carolina and does research in applied probability Professor Murphree received Miami’s College of Arts and Science Distinguished Educator Award in 1998 In 1996, she was named one of Oxford’s Citizens of the Year for her work with Habitat for Humanity and for organizing annual Sonia Kovalevsky Mathematical Sciences Days for area high school girls In 2012 she was recognized as “A Teacher Who Made a Difference” by the University of Kentucky Emily S Murphree iii bow49461_fm_i–xxi.indd 20/11/15 4:06 pm www.freebookslides.com AUTHORS’ PREVIEW 1.4 Business Statistics in Practice: Using Data, Modeling, and Analytics, Eighth Edition, provides a unique and flexible framework for teaching the introductory course in business statistics This framework features: Figure 1.6 (a) A histogram of the 50 mileages We now discuss how these features are implemented in the book’s 18 chapters Chapters 1, 2, and 3: Introductory concepts and statistical modeling Graphical and numerical descriptive methods In Chapter we discuss data, variables, populations, and how to select random and other types of samples (a topic formerly discussed in Chapter 7) A new section introduces statistical modeling by defining what a statistical model is and by using The Car Mileage Case to preview specifying a normal probability model describing the mileages obtained by a new midsize car model (see Figure 1.6): Percent 33 32 33 31 32 30 31 29 30 33 33 32 31 31 30 30 29 32 18 16 The exact reasoning behind 15 and meaning of this statement is given in Chapter 8, which discusses confidence intervals 10 10 31 32 32 33 33 and Minitab to carry out traditional statistical analysis and descriptive analytics Use of JMP and the Excel add-in XLMiner to carry out predictive analytics 20 10 31 • Use of Excel (including the Excel add-in MegaStat) In Chapters and we begin to formally discuss the statistical analysis used in statistical modeling and the statistical inferences that can be made using statistical models For example, in Chapter (graphical descriptive methods) we show how to construct the histogram of car mileages shown in Chapter 1, and in Chapter (numerical descriptive methods) we use this histogram BI to help explain the Empirical Rule As illustrated in Figure 3.15, this rule gives tolerance intervals providing estimates of the “lowest” and “highest” mileages that the new midsize car model should be expected to get in combined city and highway driving: students doing complete statistical analyses on their own Mpg all mileages achieved by the new midsize cars, the population histogram would look “bellshaped.” This leads us to “smooth out” the sample histogram and represent the population of all mileages by the bell-shaped probability curve in Figure 1.6 (b) One type of bell-shaped probability curve is a graph of what is called the normal probability distribution (or normal probability model), which is discussed in Chapter Therefore, we might conclude that the statistical model describing the sample of 50 mileages in Table 1.7 states that this sample has been (approximately) randomly selected from a population of car mileages that is described by a normal probability distribution We will see in Chapters and that this statistical model and probability theory allow us to conclude that we are “95 percent” confident that the sampling error in estimating the population mean mileage by the sample mean mileage is no more than 23 mpg Because we have seen in Example 1.4 that the mean of the sample of n 50 mileages in Table 1.7 is 31.56 mpg, this implies that we are 95 percent confident that the true population mean EPA combined mileage for the new midsize model is between 31.56 23 31.33 mpg and 31.56 23 31.79 mpg.10 Because we are 95 percent confident that the population mean EPA combined mileage is at least 31.33 mpg, we have strong statistical evidence that this not only meets, but slightly exceeds, the tax credit standard of 31 mpg and thus that the new midsize model deserves the tax credit Throughout this book we will encounter many situations where we wish to make a statistical inference about one or more populations by using sample data Whenever we make assumptions about how the sample data are selected and about the population(s) from which the sample data are selected, we are specifying a statistical model that will lead to making what we hope are valid statistical inferences In Chapters 13, 14, and 15 these models become complex and not only specify the probability distributions describing the sampled populations but also specify how the means of the sampled populations are related to each other through one or more predictor variables For example, we might relate mean, or expected, sales of a product to the predictor variables advertising expenditure and price In order to 150 a response variable Chapter sales to one Descriptive Statistics:variables Numerical and Some Predictive Analytics relate such as or more predictor soMethods that we can explain and predict values of the response variable, we sometimes use a statistical technique called regression analysis and specify a regression model F i g u r e Estimated Tolerance Intervals in the Car Mileage Case The idea of building a model to help explain and predict is not new Sir Isaac Newton’s equations describing motion and gravitational attraction help us understand bodies in motion and are used today by scientists plotting the trajectories of spacecraft Despite their Histogram of the 50 Mileages successful use, however, these equations are only approximations to the exact nature of 25 motion Seventeenth-century Newtonian has been superseded by the more sophis22 physics 22 ticated twentieth-century physics of Einstein and Bohr But even with the refinements of 30 • Many new exercises, with increased emphasis on Mpg in yellow and designated by icons BI in the page margins—that explicitly show how statistical analysis leads to practical business decisions 30 • Business improvement conclusions—highlighted 5 learning by presenting new concepts in the context of familiar situations 10 10 29 • Continuing case studies that facilitate student 18 15 Percent of probability, probability modeling, traditional statistical inference, and regression and time series modeling (b) The normal probability curve 22 16 • A substantial and innovative presentation of • Improved and easier to understand discussions 22 20 Chapter and used throughout the text business analytics and data mining that provides instructors with a choice of different teaching options A Histogram of the 50 Mileages and the Normal Probability Curve 25 • A new theme of statistical modeling introduced in 17 Random Sampling, Three Case Studies That Illustrate Statistical Inference Mpg 30.8 30.0 29.2 Estimated tolerance interval for the mileages of 68.26 percent of all individual cars 32.4 Estimated tolerance interval for the mileages of 95.44 percent of all individual cars 33.2 34.0 Estimated tolerance interval for the mileages of 99.73 percent of all individual cars Figure 3.15 depicts these estimated tolerance intervals, which are shown below the histogram Because difference3: between the upper and lower limitssections of each estimated tolerance Chapters 1, 2,theand Six optional dis- interval is fairly small, we might conclude that the variability of the individual car mileages BI business around the estimated mean mileage of 31.6 mpg is fairly small Furthermore, The the interval cussing analytics and data mining _ [x 3s] [29.2, 34.0] implies that almost any individual car that a customer might pur- chase this year will is obtainused a mileagein between mpg Disney Parks Case an29.2optional section of _ and 34.0 mpg Before continuing, recall that we have rounded x and s to one decimal point accuracy order to simplify ourhow initial example of the Empirical Rule If, instead, we calculate Chapter toinEmpirical introduce business analytics and data the _ Rule intervals by using x 31.56 and s 7977 and then round the interval endpoints to one decimal place accuracy at the end of the calculations, we obtain the same inmining are used to analyze big data This case considtervals as obtained above In general, however, rounding intermediate calculated results can lead toDisney inaccurate final results Because this, throughout this book we will avoid greatly ers how Walt World in ofOrlando, Florida, uses rounding intermediate results next note thatmany if we actually count number of the 50_collect mileages in Table 3.1 that MagicBandsareWe worn by of its visitors to mas_ the _ contained in each of the intervals [x s] [30.8, 32.4], [x 2s] [30.0, 33.2], and 3s] 5real-time [29.2, 34.0], we find that these intervals contain, respectively, 34, 48,and and 50 of [x 6of sive amounts location, riding pattern, the 50 mileages The corresponding sample percentages—68 percent, 96 percent, and 100 percent—are close to the theoretical percentages—68.26 percent, 95.44 percent, and purchase history data These data help Disney improve99.73 percent—that apply to a normally distributed population This is further evidence that the population of all mileages is (approximately) distributed and thus that the Empirivisitor experiences and tailor its normally marketing messages cal Rule holds for this population To conclude this example, we note that the automaker has studied the city and to different highway typesmileages of visitors At its Epcot park, combined Disney of the new model because the federal tax credit is based on these combined mileages When reporting fuel economy estimates for a particular car model to the public, however, the EPA realizes that the proportions of city and highway driving vary from purchaser to purchaser Therefore, the EPA reports both a combined mileage estimate and separate city and highway mileage estimates to the public (see Table 3.1(b) on page 137) iv bow49461_fm_i–xxi.indd 20/11/15 4:06 pm A Dashboard of the Key Performance Indicators for an Airline Figure 2.35 Flights on Time Average Load Arrival Average Load Factor Breakeven Load Factor 90% Departure Midwest 50 60 70 80 90 100 50 60 70 80 90 100 85% 50 60 70 80 90 100 50 60 70 80 90 100 80% Northeast Pacific www.freebookslides.com 75% 50 60 70 80 90 100 50 60 70 80 90 100 70% South Fleet Utilization 90 80 75 Short-Haul 95 100 85 80 75 70 90 International 95 85 100 90 80 100 75 70 95 70 Apr May June July Aug Sept Oct Costs 10 Fuel Costs Nov Dec Total Costs $ 100,000,000 Regional 85 Feb Mar Jan 50 60 70 80 90 100 50 60 70 80 90 100 Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec 2.8 Flights on Time Average Load Arrival The Number of Ratings and the Mean Rating for Each of Seven Rides at Epcot Nemo & Friends (0 Poor, Fair, Good, Very Good, Excellent, 5 Superb) and Mission: Space green an Excel Mission: SpaceOutput orangeof a Treemap of the Numbers of Ratings and the Mean Ratings Living With The Land (a) The number of ratings and the mean ratings DS DisneyRatings Spaceship Earth Test Track Ride Number of Ratings Mean Rating Soarin' Soarin‘ 2572 4.815 94 Test Track presented by Chevrolet 20 40 2045 60 80 Pacific 697 Living With 725 Mission: Space orange Midwest Mission: Space green 70% Regional 2.186 Living With The Land 85 90 80 Short-Haul 95 85 100 75 90 80 100 75 70 85 90 100 70 Chapter Descriptive Statistics: Numerical Methods and Some Pr (Continued ) (e) XLMiner classification tree using coupon redemption training data (f) Growing the tree in ( Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec Card Figure 2.36 2.5 3.116 2.712 Nov Dec Total Costs 0.5 ,0.5; that is, Card 3.408 467 The Seas with Nemo 50 & 60Friends 70 80 90 100 50 60 1157 70 80 90 100 Fuel Costs F i75g u r e 70 Apr May June July Aug Sept Oct Costs 10 95 80 Feb Mar Jan International 182 95 Spaceship1.319 Earth 2.186 Mission: Space The Arrival orange Departure 1589 Land Breakeven Load Factor 75% 50 60 70 80 90 100 50 60 70 80 90 100 Fleet Utilization Mission:the Space orange 1589or objective, which 3.408 represented by graphs single primary measure to aMean target, F i g u rcompare e 2.37 The Number of Ratings and the Rating for Each of SevenisRides at Epcot Mission: Space green 467 3.116times uses five a symbol on the bullet bullet predicted waiting (0 5graph Poor, The Fair, graph Good, 3of5Disney’s Very Good, Excellent, 5 Superb) and The Seas with Nemo & Friends 1157 2.712 colors ranging fromandark green to red and signifying short (0 to 20 minutes) to very long (80 Excel Output of a Treemap of the Numbers of Ratings and the Mean Ratings to 100 minutes) predicted waiting times This bullet graph does not compare the predicted (b) Excel output of the treemap waiting times to an objective However, the bullet graphs located in the upper left of the (a) The number of ratings and the mean ratings DS DisneyRatings dashboard in Figure 2.35 (representing the percentages of on-time arrivals and departures Ride Number of Ratings Mean Rating 4.8 example, for the airline) displayTest objectives represented vertical Soarin' Track The Seas Withby short Mission: Spaceblack lines For Soarin‘ 2572 4.815 Nemo Friends green consider the bullet graphspresented representing the &percentages of on-time arrivals and departures in Test Track presented 2045 4.247 by by Chevrolet the Midwest, which are shown below 3.6 Spaceship Earth Chevrolet 80% 50 60 70 80 90 100 50 60 70 80 90 100 100 4.247 725 85% 50 60 70 80 90 100 50 60 70 80 90 100 South Chapter Spaceship Descriptive Analytics Earth Statistics: Tabular and Graphical 697 Methods and Descriptive 1.319 Living With The Land 50 60 70 80 90 100 50 60 70 80 90 100 Northeast $ 100,000,000 Figure 2.37 Average Load Factor 90% Departure Midwest $0.5; that is, Card Excel Output of a Bullet Graph of Disney’s 15 Predicted Waiting Times (in minutes) for the Seven Epcot Rides Posted at p.m on February 21, 2015 DS DisneyTimes Purchases 1.3 51.17 ,51.17 (b) Excel output of the treemap 13 Nemo & Friends Mission: Space green Mission: Space orange Living With The Land Spaceship Earth Test Track Soarin' 4.8 Test Track The Seas With Mission: Space presented Nemo & Friends green The airline’s objective by was to have 80 percent of midwestern arrivals be on time The approximately 75 percent of actual midwestern arrivals that were on time is in3.6 the airline’s Chevrolet Living Spaceship light brown “satisfactory” region of Mission: the bullet graph,With but this 75Earth percent does not reach the Space The 2.5 80 percent objective orange Land Soarin' Purchases $51.17 22.52 ,22.52 helps visitors choose their next ride by continuously rides posted by Disney on February 21,33.95 2015, at p.m ,33.95 $33.95 0 summarizing predicted waiting times for seven popular and Figure 2.37p 5shows a treemap illustrating ficticious 50 51 50 p5 p5 13 2 rides on large screens in the park Disney management visitor ratings of the seven Epcot rides Other graphics Treemaps We next discuss treemaps, which help visualize two variables Treemaps 1 display uses information series of clustered rectangles, represent a whole The sizes planalso thein ariding pattern datawhich it collects to1.3 make discussed in the optional section on descriptive analyt-2 of the rectangles represent a first variable, and treemaps use color to characterize the vari5 667 51 p5 p5 graphs compare the single primary measure to a target, or objective, which is represented by ning decisions, as isaccording shown by variable the following business gauges, datatimesdrill-down graph-3 ous rectangles within the treemap to a second For example, suppose a symbolics on theinclude bullet graph The bullet graph ofsparklines, Disney’s predicted waiting uses five (as a purely hypothetical example) that Disney gave visitors at Epcot the voluntary opporcolors ranging from dark green to red and signifying short (0 to 20 minutes) to very long (80 improvement conclusion from toChapter ics, and combining graphics tunity to use their personal computers or smartphones rate as many 1: of the seven Epcot to 100 minutes) predicteddashboards waiting times This bullet graph does not compare the predictedillustrating a rides desired objective on a scalewas from to 5.80Here, represents “poor,” represents “fair,” waiting times to an objective However, the bullet graphs located in the upper left of the Theasairline’s to 0have percent of midwestern arrivals be on time The2 business’s key performance indicators For example, dashboard in Figure 2.35 (representing percentages of on-time arrivals and departures represents “good,” represents “very good,” 4arrivals represents “excellent,” and represents (g) the XLMiner best pruned classification tree using the validation data approximately 75 percent of actual midwestern that were on time is in 5the airline’s for the airline) display objectives represented by short vertical black lines For example, “superb.” Figure 2.37(a) gives of graph, ratings andthis the75 mean rating fornot each ridethe on …As a matter of fact, Channel 13 News light brown “satisfactory” regiontheofnumber the bullet but percent does reach Figure 2.35 is a dashboard showing eight “flight on consider the bullet graphs representing the percentages of on-time arrivals and departures in a particular day (These data are completely fictitious.) Figure 2.37(b) shows the Excel 80 percent objective the Midwest, which are shown below in reported on 6, Card output of Orlando a treemap, where the size and color ofMarch the rectangle for2015— a particular ride repretime” bullet graphs and three “flight utilization” gauges We next treemaps, which help twofor variables Treemaps sent, respectively, the discuss total number of ratings the visualize mean rating the ride Treemaps The colors 0.5 display information in a(signifying series of clustered which representora5, whole sizes the writing ofand near this case—that rangeduring from dark green a mean rectangles, rating the “superb,” level)The to white for an airline of the rectangles represent a firstthe variable, and1,treemaps colorby to the characterize theonvari5 11 (signifying a mean rating near “fair,” or level), asuse color scale the Disney had announced plans toshown add For a third ous rectangles within treemap to a second example, suppose treemap Note that sixthe of the sevenaccording rides are rated to be atvariable least “good,” four of the seven Chapter containsPurchases four optional sections that disPurchases (as a purely hypothetical example) that Disney gave visitors at Epcot the voluntary opporrides“theatre” are rated to be at for least “very good,” and(a one ride is rated asride) “fair.” Many treemaps Soarin’ virtual in tunity to use their personal computers or smartphones to rate as many of the seven Epcot cuss six methods of predictive analytics The methods 51.17 22.52 use a larger range of colors (ranging, say, from dark green to red), but the Excel app we rides asobtain desiredFigure on a scale from to 5.range Here, represents “poor,” represents “fair,” used order to 2.37(b) gave0long the of 0colors shown in that figure Also, note that to shorten visitor waiting times practical 11 way applied and represents “good,” represents “very good,” represents “excellent,” and represents discussed are explained in an treemaps are frequently used to display hierarchical information (information that could “superb.” Figure givesdifferent the number of ratings and be theused meantorating ride on be displayed as a 2.37(a) tree, where branchings would show for the each hierarchical by using the numerical descriptive statistics previously ainformation) particular day (These data are completely fictitious.) Figure 2.37(b) shows the Excel For example, Parks Disney could have visitors voluntarily rate the in each 1 The Disney Case isrectangle also used inrides an optional output of aOrlando treemap, where the size and color of the for a particular repreof its four parks—Disney’s Magic Kingdom, Epcot, Disney’s Animalride Kingdom, discussed in Chapter sent, respectively, the totalStudios number A of treemap ratings and the mean rating for by thebreaking ride Theacolors These2methods are: and Disney’s Hollywood would be constructed large 50 51 50 857 p5 p5 p5 p5 section rating to nearhelp discuss range from darkof green Chapter (signifying a mean the “superb,” or 5, level)descriptive to white 13 2 (signifying a mean rating near the “fair,” or 1, level), as shown by the color scale on the Classification tree modeling and regression tree analytics Specifically, Figure 2.36 shows a bullet graph • treemap Note that six of the seven rides are rated to be at least “good,” four of the seven rides are rated to be at least “very good,” and one ride is rated as “fair.” treemapsEpcot modeling (see Section 3.7 thepruned following figures): summarizing predicted waiting times forManyseven (i) XLMiner training dataand and best Fresh demand regression use a larger range of colors (ranging, say, from dark green to red), but the Excel app we ^ 20 40 80 BI 177 Decision Trees: Classification Trees and Regression Trees (Optional) JMP Output of a Classification Tree for the Card Upgrade Data Figure 3.26 DS CardUpgrade Arrival Departure Midwest 50 60 70 80 90 100 50 60 70 80 90 100 Upgrade 0.50 0.25 ^ 0.00 ^ ^ ^ Purchases ,39.925 Purchases.539.925 PlatProfile Purchases.526.185 (0) PlatProfile(1) Purchases.532.45 Purchases,26.185 Purchases,32.45 All Rows used to obtain Figure 2.37(b) gave the range of colors shown in that figure Also, note that treemaps are frequently used to display hierarchical information (information that could be displayed as a tree, where different branchings would be used to show the hierarchical information) For example, Disney could have visitors voluntarily rate the rides in each of its four Orlando parks—Disney’s Magic Kingdom, Epcot, Disney’s Animal Kingdom, and Disney’s Hollywood Studios A treemap would be constructed by breaking a large Split Prune Color Points RSquare 0.640 tree (for Exercise 3.57(a)) N Number of Splits 40 DS Fresh2 All Rows Count G^2 40 55.051105 Level Rate AdvExp LogWorth 6.0381753 22 0.4500 0.4500 18 PurchasesϾϭ32.45 G^2 Count 21 20.450334 Level Rate 0.8095 0.7932 17 Rate Level 0.9375 0.9108 15 1 Rate 18 0.0526 0.0725 0.4000 0.4141 Level Rate G^2 Count 5.0040242 Prob Count 0.0000 0.0394 1.0000 0.9606 11 Level Rate Prob Count 0.2000 0.2455 0.8000 0.7545 Error Report Class # Cases # Err Overall (h) Pruning the tree in (e) Nodes % Error 6.25 6.25 6.25 6.25 62.5 Best P & Min Classification Confusion M Predicted C Actual Class 0 Error Report Class # Cases # Errors 10 Overall 16 AdvExp 5.525 7.075 Count 10 Prob Count 0.8889 0.8588 0.1111 0.1412 Level Prob for Cust Cust PriceDif 7.45 G^2 Rate For Exercise 3.56 (d) and (e) 8.022 0.3 Prob for 0.142857 0.857143 9.52 Prob Count 1.0000 0.9625 10 0.0000 0.0375 8.587 Predicted Value 8.826 PriceDif 8.826 8.021667 v bow49461_fm_i–xxi.indd 16 24 AdvExp PurchasesϽ26.185 G^2 Count 6.2789777 Prob Count 0.6000 0.5859 Prob Count 0.9474 0.9275 PurchasesϾϭ26.185 G^2 Count 6.7301167 Level Rate LogWorth 0.1044941 PurchasesϽ39.925 G^2 Rate LogWorth 0.937419 Prob Count 0.0625 0.0892 PurchasesϾϭ39.925 Count 11 Level PlatProfile(0) G^2 Count 16 7.4813331 Level G^2 Count 19 7.8352979 Prob Count PlatProfile(1) PurchasesϽ32.45 LogWorth 1.9031762 0.1905 0.2068 15 6.65 Prob Count 0.5500 0.5500 ^ Partition for Upgrade 1.00 0.75 Actual Class 100 ^ 3.7 33.33333 20.83333 12.5 4.166667 4.166667 Classification Confusio Predicte ^ 60 % Error $22.52 Purchases ^ Nodes ➔ Chapter A Dashboard of the Key Performance Indicators for an Airline Figure 2.35 Excel Output of a Bullet Graph of Disney’s Predicted Waiting Times (in minutes) for the Seven Epcot Rides Posted at p.m on February 21, 2015 DS DisneyTimes Descriptive Statistics: Tabular and Graphical Methods and Descriptive Analytics Figure 2.36 94 93 Descriptive Analytics (Optional) 20/11/15 4:06 pm 0.3 0.1 AdvExp 6 (a) The Minitab output Dendrogram Complete Linkage 192 0.00 Chapter Sport SportCluster ID Boxing Boxing Basketball Basketball Golf Similarity Swimming 33.33 66.67 100.00 BOX SKI 12 SWIM P PONG H BALL TEN 10 12 TR & F GOLF BOWL BSKT HKY Descriptive Statistics: Golf Swimming Skiing Skiing valwww.freebookslides.com the centroids of each cluster (that is, the six mean Baseball Baseball ues on the six perception scales of the cluster’s memPing-Pong Ping-Pong bers), the average distance of each cluster’s members Hockey Hockey from the cluster centroid, and the distances between Handball Handball the cluster centroids DS SportsRatings Track & Field Track & Field a Use the output to summarize the members of each Bowling Bowling 36 cluster Tennis FOOT Tennis BASE Dist Clust-1 Dist Clust-2 Dist Dist Clust-3 Dist Dist Clust-4 Dist Dist Clust-5 Dist C Cluster ID Dist Clust-1 Clust-2 Clust-3 Clust-4 3 5.64374 3.350791 Numerical 5.192023 5.649341 5.64374 6.914365 3.350791 Methods and 0.960547 5.192023 4.656952 5.528173 3.138766 4.656952 4.049883 5.528173 Method Complete Dendrogram • Boxing Chapter • Basketball Some Predictive3.825935 Analytics 3.825935 7.91777 0.960547 1.836496 3.138766 2.58976 4.049883 4.171558 4.387004 3.396653 4.171558 7.911413 4.387004 6.290393 1.836496 6.107439 2.58976 1.349154 3.396653 6.027006 7.911413 XLMiner Output for Exercise 3.61 (b) The JMP output 5.353059 4.289699 1.315692 4.599382 b By using the members of each cluster and the clus0 4.573068 3.76884 3.928169 4.573068 3.76884 ter1 centroids, discuss the basic differences between 4.081887 3.115774 1.286466 6.005379 4.081887 1.286466 the clusters Also, discuss 3.115774 how this k-means cluster 4.357545 7.593 5.404686 0.593558 4.357545 7.593 5.404686 analysis leads to the same practical conclusions 3.621243 3.834742 0.924898 4.707396 3.834742 0.924898 about how3.621243 to improve the popularities of baseball 3.362202 3.4311 1.201897 4.604437 3.4311 1.201897 and tennis 3.362202 that have been obtained using the previ4.088288 0.960547 2.564819 7.005408 4.088288 ously discussed hierachical0.960547 clustering 2.564819 Football 3.8 andFootball the following figures): • Hierarchical clustering and k-means clustering (see Section 196 4.289699 5.649341 4.599382 6.914365 5.417842 1.349154 1.04255 6.027006 2.382945 5.353059 5.130782 1.315692 2.3 4.546991 7.91777 3.223697 6.290393 4.5 2.382945 6.107439 5.052507 3.928169 2.3 3.434243 6.005379 5.110051 0.593558 3.4 2.643109 4.707396 2.647626 4.604437 2.6 4.257925 7.005408 2.710078 5.417842 4.2 5.712401 1.04255 5.7 } } Sport Cluster ID Dist.Cluster Clust-1 Dist Clust-2 Dist Dist.Summary Clust-4Ncon Dist Fast Compl Team Opp Cluster Fast Clust-3 Compl Team Easy Clust-5 Ncon ChapterEasy 201 Descriptive Statistics: Numerical Methods and Some AnalyticsCluster-1 Cluster-1 4.78 2.16 3.33 3.6 2.67 3.6 Boxing Predictive 5.64374 5.649341 4.184.784.289699 5.353059 4.18 2.16 3.332.382945 Opp Summary Chapter 5.1 3.2 5.0 5.1 2.6 2.7 2.67 In the real world, companies such as Amazon In the Netflix real world, sell or companies rent thousands such oraseven Amazon and Netflix sell or rent thousands or even Cluster-2 5.6 and 4.825 5.99 3.475 1.71 3.92 3.350791 6.914365 1.315692 Cluster-2 5.64.599382 4.825 5.99 3.4755.130782 1.71 These 3.92are These are millions of items and find association rules based millions on millions of items of and customers find association In orderrules to based on millions of customers In order to the3.022 centroids the centroids Cluster-3 2.858 4.796 5.078 3.638 2.418 3.022 Cluster-3 2.858 4.796association 5.078 3.6384.546991 2.418 2meaningful 5.192023 0.960547 3.825935 7.91777 make obtaining association rules manageable, make obtaining these meaningful companies break products rules manageable, these companies break products F• Golf igure 3.35 Minitab Output of a Factor Analysis of the Applicant Data (4 Factors Used) Cluster-4 1.99 3.253333 1.606667 4.62 5.773333 2.363333 Cluster-4 1.99 3.253333 1.606667 4.623.223697 5.773333 (for 2.363333 • Bowling Swimming obtaining 4.656952 3.138766 1.836496 6.290393 for which they are association rules into forvarious which they categories are obtaining (for example, association comedies rules into various categories example, comedies • Baseball Cluster-5 2.6or thrillers) 4.61 6.29 4.265 5related 3.22 or thrillers) and5hierarchies (for example,Cluster-5 a hierarchy related and to how hierarchies new the(for product example, is) a hierarchy to how new the3.22 product is) 2.6 4.61 6.29 4.265 Skiing 5.528173 4.049883 2.58976 6.107439 2.382945 Principal • Ping-Pong Component Factor Analysis of the Correlation Matrix • Handball Baseball 4.573068 3.76884 3.928169 5.052507 Unrotated Factor Loadings and Communalities • Tennis Ping-Pong 4.081887 3.115774 1.286466 6.005379 3.434243 • Swimming Cluster #Obs Avg #Obs Dist Distance Avg Dist Distance Variable Factor1 Factor2 Factor3 Factor4 for Section Communality Exercises 3.10Cluster Exercises for Section 3.10 • Track & Field Hockey 4.357545 7.593 5.404686 0.593558 Between 5.110051 Between Cluster-1 Cluster-1 Var 0.447 20.619 20.376 20.121 0.739 • Skiing CONCEPTS CONCEPTS3.69 The XLMiner output of an association rule analysis 3.69Cluster-2 of The XLMinerCluster-3 output of an association rule analysis o Centers Cluster-1 Handball0.020 3.6212430.422 3.834742 0.924898 4.707396 2.643109 Centers Cluster-1 Cluster-2 Cluster-4 Cluster-3 Clu Clu Cluster-2 0.960547 Var 0.583 0.050 0.2823 Basketball • Hockey • Football Golf Cluster-2 DVD0.960547 the data using a specified support perthe DVD renters data using a specified support per3.66 What is the purpose of association rules? 3.66 What is the purpose ofrenters association rules? Cluster-1 0per- 4.573068 3.76884 3.928169 Track & Field 3.3622020.881 Cluster-3 3.4311 1.201897 4.604437 2.647626 Cluster-1 4.573068 3.76884 Var 0.109 20.339 0.494 0.7143 centage of 40 percent and a specified confidence centage of040 percent and a specified confidence per-5 Cluster-3 1.319782 1.319782 3.67 Discuss the meanings of the terms support percentage, 3.67 Discuss the meanings termsissupport DS DVDRent centage DS DVDRen Varwill illustrate k-means 0.617 0.3572 0.877 centage ofof70the percent shownpercentage, below of0 70 percent is shown Cluster-2 4.573068 3.112135 7.413553 Cluster-2 4.573068 below 3.112135 4.088288 0.960547 2.564819 7.005408 4.257925 We clustering by using0.181 a real dataBowling mining20.580 project For Cluster-4 0.983933 Cluster-4 3 0.983933 confidence percentage, and lift ratio confidence percentage, and lift ratio Var 0.356 0.299 the 20.179 0.885 a Summarize the recommendationsCluster-3 based on a lift a 3.76884 Summarize the3.112135 recommendations based on0a lift confidentiality purposes, we0.798 will consider a fictional grocery chain However, Cluster-3 3.76884 5.276346 Tennis 4.1715580.825 Cluster-5 3.396653 1.349154 5.417842 2.710078 3.112135 Cluster-5 2.382945 ratio2.382945 Var reached are real.0.867 0.185 20.0693 conclusions Consider, then, the Just Right grocery chain,0.184 which has greater than ratio greater than 3.928169 7.413553 5.276346 AND METHODS AND6.027006 APPLICATIONS Cluster-4 3.928169 7.413553 5.276346 2.3Var million holders Store managers are interested in clustering Football APPLICATIONS 4.3870040.855 Overall 7.911413 1.04255 ofCluster-4 5.712401 Overall 13 1.249053 store loyalty card0.433 0.582 20.360 their METHODS 0.446 13 1.249053 b Consider the recommendation DVD B based on b Consider the recommendation of DVD B based on customers whose shopping habits tend to be similar They expect to 3.68 20.228 Cluster-5 5.052507 5.224902 In the previous XLMiner output, show how the 3.68 lift In the previous XLMiner output, the Cluster-5 lift Var into various subgroups 0.882 0.056 0.248 0.895 5.052507 2.622167 having rented C &show E (1)how Identify and interpret the 4.298823 having rented 2.622167 C4.298823 & E (1) Identify and interpret the find that certain customers tend to buy many cooking basics like oil, flour, eggs, rice, and ratio of 1.1111(rounded) for the0.779 recommendation of C ratio of 1.1111(rounded) Var 0.365 20.794 20.093 0.074 support forfor C the & E.recommendation Do the same for of theCsupport for support for C & E Do the same for the support for raw chickens, while others are buying prepared items from the deli, salad bar, and frozen to renters of B has been calculated Interpret this lift to renters of B has Interpret lift C &been E &calculated B (2) Show how thethis Confidence% of 80 C & E & B (2) Show how the Confidence% of 80 Var 10Perhaps there are other 0.863 20.069 20.166 0.787 food aisle important categories like calorie-conscious,0.100 vegetarian, Cluster 0.256 Fast 20.209 Compl Team 0.879Easy Ncon Opp ratio ratio has been calculated (3) Show how the Lift Ratio of has been calculated (3) Show how the Lift Ratio o Var 11 0.098 or premium-quality shoppers 0.872 1.1429 (rounded) has been calculated 1.1429 (rounded) has been calculated The don’t know 0.908 what the clusters are and hope the data will enlighten Varexecutives 12 0.030 0.135them.4.78 0.097 Cluster-1 4.18 2.16 0.8523.33 3.6 2.67 LO3-10 LO3-10 They choose to concentrate on 100 important products offered in their stores Suppose that Var 13 0.913 20.032 0.073 0.218 These are Cluster-2 4.825 5.99 0.888 3.475 1.71 3.92 product is fresh strawberries, product is olive oil, product is hamburger buns, and prod- 5.6 Var 14 0.710 0.115 20.558 0.884 uct is potato chips having a Just Right loyalty card, they willMethods know the Chapter For each customer Descriptive Statistics: Numerical and20.235 Some Analytics Interpret the 3.638 the centroids the 2.418 Cluster-3 4.796Predictive 5.078 3.022 Var 15 0.646 20.604 20.107 2.858 20.029 0.794Interpret Rule: If all Antecedent items are purchased, then with Rule:Confidence If all Antecedent percentage itemsConsequent are purchased, items then willwith alsoConfidence be purchased percentage Consequent items will also be purchase } 3.9 Factor (Optional and Requires 3.9 Analysis Factor Analysis (Optional and Requi Section 3.4) Section 3.4) • Factor analysis and association rule mining (see Sections 3.9 and 3.10 and the following figures): 196 information provided information provided 2.363333 1.606667 4.62 5.773333 Factor analysis starts with a large of correlated variables andvariables attemptsand to fin Factor analysis starts number with a large number of correlated a Variance 7.5040 2.0615 1.4677 1.2091 12.2423 by a6.29 factor analysis a 5factor analysis Row Antecedent (x) by Consequent Row (y) ID Support Confidence% for x3.22 Support Antecedent for y (x) Support Consequent for x & y (y) Support Lift Ratio for x Support for y Support for x & y Lift Ra Cluster-5 2.6 ID Confidence% 4.61 4.265 % 0.500 Output of0.137 0.081 0.816 underlying uncorrelated factors that describe the “essential aspects” of the large nu underlying uncorrelated factors that describe the “essential aspects” of F Var igure 3.35 Minitab a Factor Analysis 0.098 of the Applicant Data (4 Factors Used) (Optional) 71.42857143 B A B A 1.020408163 7 1.0204081 (Optional) 71.42857143 Cluster-4 1.99 3.253333 correlated variables To illustrate factor analysis, suppose that a personnel officer h To 5illustrate factor analysis,7 suppose that5 a person 71.42857143 A B 71.42857143 correlated A 7variables B 1.020408163 1.0204081 Rotated Factor Loadings and Communalities 85.71428571 A C 85.71428571 A C 6applicants 0.952380952 sales 0.9523809 viewed and 7rated 48 and job9 applicants sales positions on positions the 9following 15following variables Principal Component Factor Analysis of the Correlation Matrix viewed rated 48 jobfor for on the Varimax Rotation Cluster #Obs Avg DistC Distance 77.77777778 B 77.77777778 C B 1.111111111 7 1.1111111 Unrotated Loadings and Communalities Variable Factor Factor1 Factor2 Factor3 Between 15 Form of100 Lucidity 11 Ambition B Form of application letter Lucidity 11 Cluster-1 5Factor4 100 BCommunality C 7application letter C 1.111111111 1.1111111 Variable Factor1 Factor2 Factor3 Var 0.114 20.833✓ 0.739 Centers Cluster-1 Cluster-3 6Factor4 71.42857143 BCommunality &C A 71.42857143 Cluster-2 B&C A Cluster-4 1.020408163 Cluster-5 1.0204081 Cluster-220.111 220.138 0.960547 26 Appearance Honesty Honesty 12 Grasp Appearance 12 Var 0.440 20.150 20.394 0.226 0.422 20.121 83.33333333 A & C B 83.33333333 C B 1.19047619 1.190476 Cluster-1 06 A &4.573068 3.768845 3.9281696 5.052507 Var 0.447 20.619 20.376 0.739 Cluster-3 1.319782 Var Var Var Var Var Var Var Var Var Var Var Var Var Var Var 10 Var Var 11 Var 10 Var 11 12 Var Var 13 Var 12 Var 14 Var 13 Var 15 Var 14 Var 15 Variance % Var Variance % Var 0.061 0.583 0.216 0.109 0.919 ✓ 0.617 0.864 ✓ 0.798 0.217 0.867 0.918 ✓ 0.433 0.085 0.882 0.796 ✓ 0.365 0.916 ✓ 0.863 0.804 ✓ 0.872 0.739 ✓ 0.908 0.436 0.913 0.379 0.710 0.646 5.7455 0.383 7.5040 0.500 20.127 0.050 20.247 20.339 0.104 0.181 20.102 0.356 0.246 0.185 20.206 0.582 20.849✓ 0.056 20.354 20.794 20.163 20.069 20.259 0.098 20.329 0.030 20.364 20.032 20.798✓ 0.115 20.604 2.7351 0.182 2.0615 0.137 0.881 C 100 ability A Academic &B C 1.111111111 1.1111111 38 Academic Salesmanship 13 Potential 84.298823 Salesmanship 13 0.422 Cluster-2 4.573068 ability 3.1121355 7.413553 0.877 A&C 71.42857143 B A&C 1.19047619 1.190476 0.881 Likability Experience 14 Keenness 92.622167 Experience 14 20.162 20.062 0.885 Cluster-3 3.768847 A Likability 3.112135 05 1.020408163 5.2763467 10 71.42857143 A B&C 10 71.42857143 B&C 1.0204081 Cluster-5 2.382945 20.580 0.357 0.877 20.259 0.825 11 0.006 83.33333333 E C 83.33333333 E 7.413553 C 5.2763465 0.925925926 06 0.9259259 Cluster-4 3.928169 0.299 20.179 0.885 511 Self-confidence 10 Drive 105.224902 15 Suitabilit Self-confidence Drive 15 Overall 20.864 13 1.249053 ✓ 0.855 12 0.003 80 C & E B 12 580 C & E B 1.142857143 1.1428571 0.184 20.069 0.825 Cluster-5 5.052507 4.298823 2.6221674 1.111111111 5.2249024 20.088 0.895 13 20.049 100 B & E C 13 100 B&E C 1.1111111 20.360 0.446 0.855 0.055 0.219 0.779 0.248 20.228 0.895 20.160 20.050 0.787 20.093 0.074 0.779 20.105 20.042 0.879 0.100 20.166 0.787 Chapter Summary Chapter Summary 20.340 0.152 0.852 0.256 20.209 0.879 LO3-10 20.425 0.230 0.888 0.135 We began this 0.097 0.852 use percentiles and and quartiles to measure use we percentiles and quartiles to measure variation, and chapter by presenting and comparing several We began mea- thistochapter by presenting comparing several variation, mea- to and 20.541 20.519 0.884 0.073 0.218 0.888 Interpret the learned how toWe construct box-and-whiskers using the how to construct a box-and-whiskers plot by using tendency We defined the population mean sures of andcentral tendency defined athe population meanplot and by learned 20.078 sures of central 0.082 0.794 20.558 we saw how 20.235 0.884 to estimate the population mean by using we a sample saw how quartiles to estimate the population mean by using a sample quartiles information provided 20.107 0.794with 2.4140 mean We20.029 1.3478 12.2423 Factor analysis starts large of variables and attempts to findhowfewer After learning howand to measure and central tendency After learning to measure and depict central tende also defined the median and mode, and we acompared mean We number also defined the correlated median mode, and wedepict compared by a factor 0.161 analysis 0.090 and mode for symmetrical 0.816 and variability, weforpresented various optional topics and First,variability, we we presented median, the mean, and and mode symmetrical distributions and underlying uncorrelated factors thatmedian, describe the “essential aspects” of the large number of various optional topics First, 1.4677 the mean,1.2091 12.2423 distributions (Optional).0.098 for distributions discussed numerical measures relationship between several numerical measures of the relationship betw that are skewed to the right or left We then for dis studtributions that areseveral skewed to the right or left of Wethe then stud- discussed 0.081 0.816 correlated variables To illustrate factor analysis, suppose that a personnel officer has inter20.006 0.020 ✓ Cluster-420.874 0.494 0.928✓ 0.282 100 A & B 20.081 0.983933 93 71.42857143 B 0.714 3.9 Factor Analysis (Optional and Requires Section 3.4) variables These included covariance, the correlation two variables These included the covariance, the correla ied measures of variation (or spread) We defined the iedrange, measurestwo of variation (or spread) We the defined the range, coefficient, and the least We then introduced coefficient, the and the least squares line We then introduced variance, and standardand deviation, and48 we job saw how tovariance, estimate and standard andsquares we saw how to estimate viewed rated applicants for sales deviation, positions on theline following 15 variables a weighted mean andby also explained how toconcept computeof a weighted mean and also explained how to com a population variance and standard deviation by using a population sample concept varianceofand standard deviation using a sample Factor 2, “experience”; Factor 3, “agreeable descriptive statistics for grouped data Indeviation addition, wedescriptive showed statistics for grouped data In addition, we sho We learned that aForm good way to interpret the standard We deviation learned that a good way to interpret the standard of application letter Lucidity 11 Ambition how toiscalculate the geometric mean and demonstrated its interto calculate the geometric mean and demonstrated its in when a population is (approximately) normally distributed whenisa to population (approximately) normally distributed is to how Variable Factor1 Factor2 Factor3 Factor4 personality”; Factor 4, “academic ability.” Variable (appearance) does notCommunality load heavily pretation Finally, used the numerical methods of chapter Finally, we used the numerical methods of this cha use the Empirical Rule, and we studied Chebyshev’s Theorem, use the Empirical Rule, and wewestudied Chebyshev’s Theorem, Appearance Honesty 12thispretation Grasp Var 0.114 20.833✓ 20.111 20.138 0.739 on any factor and thus is its own factor, as Factor on the Minitab outputcontaining in Figure 3.34large fractions to give an introduction four important to give an introduction to four important techniques of predic which gives us intervals reasonably which gives of us intervals containing to reasonably large techniques fractions ofof predictive Var 0.440 20.150 20.394 0.226 0.422 decision trees, analysis, analysis, analytics: and decision trees, cluster analysis, factor analysis, the population matter what the population’s shape the population might analytics: units8no matter what the cluster population’s shapefactor might13 units Academic Salesmanship Potential indicated is true Variable (form20.127 of application letter) loads heavily onnoFactor 2ability (“experiVar 0.061 20.006 be We also 0.928✓ association rules saw that, when a data set is 0.881 highly skewed,be it isWe best also saw that, when a data set is highly skewed, it is best association rules Rotated Factor Loadings and Communalities asVarimax follows:Rotation Factor 1, “extroverted personality”; BI BI We believe that an early introduction to predictive andistributions Sampling distributions and conalytics (in Chapter 3) will make statistics seem more fidence intervals Chapter discusses probability useful and relevant from beginning and✓thus by featuring of probability modeling ence”) summary, there is notthe much difference20.874 between the moti7-factor and 4-factor solu- a new discussion Var In 0.216 20.247 20.081 0.877 Likability Experience 14 Keenness to join Var We might therefore 0.919 ✓ conclude 0.104 20.162 can be20.062 0.885 tions that the in 15 the variables reduced toand the following vate students to be more interested entire course using motivating examples—The Crystal15Cable Self-confidence 10 Drive Suitability ✓ Var 0.864 20.102 20.259 0.006 0.825 five uncorrelated factors: “extroverted personality,” “experience,” “agreeable personality,” Var 0.217 0.246 20.864 ✓ 0.003 0.855 a real-world example of gender discrimiHowever, our presentation gives instructors various Casefocus and “academic ability,” and “appearance.” helps the personnel officer on ✓ Var 0.918 20.206This conclusion 20.088 20.049 0.895 choices This is 0.085 because,ofafter coveringMoreover, the in- analyst nation atata pharmaceutical company—to illustrate the characteristics” a20.849✓ job applicant if a company wishes Var“essential 0.055optional 0.219 0.779 10 date to use a0.796 20.354 20.160to predict20.050 0.787 aVar later tree ✓diagram or regression analysis sales performance on the troduction to business analytics in Chapter 1, the five the probability rules Chapters and give more Var 11 0.916 ✓ 20.163 20.105 20.042 0.879 basis of the characteristics of salespeople, the analyst can simplify the prediction modeling ✓ Var 12 0.804 20.259 and predictive 20.340 0.152 0.852 optional sections on descriptive analytconcise discussions of discrete and continuous probprocedure by using0.739 the five uncorrelated vari✓ Var 13 20.329 factors instead 20.425 of the original 0.230 15 correlated0.888 ics andvariables can be covered in any order Varin 14 0.436 20.364 20.541 20.519 ability0.884 distributions (models) and feature practical ables asChapters potential predictor 20.078 0.794 Var 15 0.379 20.798✓ In general, mining project where wethe wish to predict a 0.082 response variable and inillustrating the “rare event approach” to without lossinofa data continuity Therefore, instructor can examples Variance 2.4140 12.2423 which there are an 5.7455 extremely large2.7351 number of potential correlated1.3478 predictor variables, it can choose which of the six optional analytics secmaking a statistical inference In Chapter 7, The Car % Var 0.383 0.182 business 0.161 0.090 0.816 be useful to first employ factor analysis to reduce the large number of potential correlated pretions to coverto early, as part offactors the that main flow Chap- predictor Mileage Case is used to introduce sampling distribudictor variables fewer uncorrelated we can useof as potential variables ters 1–3, and which to discuss later We recommend tions and motivate the Central Limit Theorem (see as follows: Factorchosen 1, “extroverted “experience”; 3, “agreeable that sections to bepersonality”; discussedFactor later2, be coveredFactorFigures 7.1, 7.3, and 7.5) In Chapter 8, the automaker personality”; Factor 4, “academic ability.” Variable (appearance) does not load heavily after Chapter 14, which presents the further predictive in The Car on any factor and thus is its own factor, as Factor on the Minitab output in Figure 3.34 Mileage Case uses a confidence interval analytics of multiple regression, procedure indicated is topics true Variable (form of linear application letter) loads logistic heavily on Factor (“experi- specified by the Environmental Protection ence”) In summary, there isnetworks not much difference between the 7-factor and Agency 4-factor soluregression, and neural (EPA) to find the EPA estimate of a new midtions We might therefore conclude that the 15 variables can be reduced to the following size model’s true mean mileage and determine if the five uncorrelated factors: “extroverted personality,” “experience,” “agreeable personality,” Chapters 4–8: and Probability probability new focus midsize “academic ability,” “appearance.” and This conclusion helps themodpersonnel officer on model deserves a federal tax credit (see the “essential characteristics” job applicant Moreover, if a company analyst wishes at eling Discrete andof a continuous probability Figure 8.2) a later date to use a tree diagram or regression analysis to predict sales performance on the vibasis of the characteristics of salespeople, the analyst can simplify the prediction modeling procedure by using the five uncorrelated factors instead of the original 15 correlated variables as potential predictor variables In general, in a data mining project where we wish to predict a response variable and in which there are an extremely large number of potential correlated predictor variables, it can be useful to first employ factor analysis to reduce the large number of potential correlated predictor variables to fewer uncorrelated factors that we can use as potential predictor variables bow49461_fm_i–xxi.indd 20/11/15 4:06 pm m for the population of six preThis sample mean is the point estimate of the mean has mileage been working to improve gas mileages, we cannot assume that we know the true value of production cars and is the preliminary mileage estimate for the new midsize that was m for the new midsize model However, engineering data might the population mean mileagemodel indicate that the spread of individual car mileages for the automaker’s midsize cars is the reported at the auto shows same from modelthe andnew year midsize to year Therefore, if the mileages for previous models When the auto shows were over, the automaker decided to model furthertostudy hadtests a standard to mpg, it might be reasonable to assume that the standard model by subjecting the four auto show cars to various Whendeviation the EPAequal mileage test was deviation of the mileages for the new model will also equal mpg Such an assumption performed, the four cars obtained mileages of 29 mpg, 31 mpg, 33 mpg, and 34 mpg Thus, would, of course, be questionable, and in most real-world situations there would probably not the mileages obtained by the six preproduction cars were 29 mpg, 31 mpg, 32 mpg, be an actual basis30 formpg, knowing s However, assuming that s is known will help us to illustrate 33 mpg, and 34 mpg Thewww.freebookslides.com probability distribution sampling of this population sixinindividual car distributions,ofand later chapters we will see what to when s is unknown mileages is given in Table 7.1 and graphed in Figure 7.1(a) The mean of the population of C EXAMPLE 7.2 The Car Mileage Case: Estimating Mean Mileage A Probability Distribution Describing thePart Population Six Individual Car Mileages Consider the infinite population of the mileages of all of the new 1: BasicofConcepts Table 7.1 Individual Car Mileage Probability Figure 7.1 30 31 1y6 1y6 Probability 0.20 29 30 1/6 1/6 1/6 1/6 31 32 33 34 0.15 0.10 0.05 0.00 Individual Car Mileage (b) A graph of the probability distribution describing the population of 15 sample means 3/15 0.20 Probability 0.15 2/15 2/15 2/15 2/15 0.10 1/15 1/15 29.5 30 1/15 1/15 33 33.5 0.05 0.00 29 30.5 31 31.5 32 32.5 1y6 1y6 Sampling Distribution of the Sample Mean x When n 5, and (3) the Sampling (a) A graph of the probability distribution describing the population of six individual car mileages 1/6 1y6 re 7.3 A Comparison of (1) the Population of All Individual Car Mileages, (2) the T a b l e F2i g uThe Population of Sample Means A Comparison of Individual Car Mileages and Sample Means 1/6 midsize cars that could potentially be produced by this year’s manufacturing process If 33 34 we assume32that this population is normally distributed with mean m and standard deviation 29 1y6 34 of the Mean x When n 50 (a) The population of theDistribution 15 samples of nSample car mileages and corresponding sample means Car Mileages Sample Mean (a) The population of individual mileages Sample 10 11 12 13 14 15 (b) A probability distribution describing the of sampling the sample mean x¯ when n 50 population(c)ofThe 15sampling sample distribution means: the distribution of the sample mean Sample Mean Frequency Probability 29.5 30 30.5 31 31.5 32 32.5 33 33.5 1 2 2 1 1y15 1y15 2y15 2y15 3y15 2y15 2y15 1y15 1y15 The Sampling Distribution of the Sample Mean Figure 7.5 The Central Limit Theorem Says That the Larger the Sample Size Is, the More Nearly Normally Distributed Is the Population of All Possible Sample Means The normal distribution describing the population of all possible sample means when the sample size is 50, where x¯ and x¯ 5 113 n 50 Scale of sample means, x¯ Sample Mean 7.1 The normal distribution describing the 29, 30 29.5 population of all individual car mileages, which has mean and standard deviation 29, 31 30 29, 32 30.5 29, 33 31 Scale of gas mileages 29, 34 31.5 30, 31 30.5 (b) The distribution of 30,sampling 32 31the sample mean x¯ when n 5 30, 33 31.5 The normal distribution describing the population 30, 34 32 of all possible sample means when the sample 31, 32 31.5 size is 5, where x¯ and x¯ 5 358 n 31, 33 32 31, 34 32.5 32, 33 32.5 32, 34 33 Scale of sample means, x¯ 33, 34 33.5 8.1 z-Based Confidence Intervals for a Population Mean: s Known 335 Figure 8.2 349 Three 95 Percent Confidence Intervals for m The probability is 95 that x¯ will be within plus or minus 1.96x¯ 22 of x x x n=2 n=2 n=2 x¯ x¯ x (a) Several sampled populations n=6 n=6 n=2 x¯ n=6 x¯ Population of all individual car mileages n=6 95 Samples of n 50 car mileages m n 50 x¯ 31.56 x¯ 31.6 31.6 22 31.6 22 31.56 x¯ x¯ x¯ x¯ n 50 x¯ 31.2 31.78 31.68 31.34 n 50 x¯ 31.68 31.46 31.90 31.2 n = 30 n = 30 x¯ x¯ n = 30 30.98 n = 30 x¯ x¯ (b) Corresponding populations of all possible sample means for different sample sizes _ How large must the sample size be for the sampling distribution of x to be approximately normal? In general, the more skewed the probability distribution of the sampled population, the larger the sample size must be for the population of all possible sample means to be approximately normally distributed For some sampled populations, particularly those described by symmetric distributions, the population of all possible sample means is approximately normally distributed for a fairly small sample size In addition, studies indicate that, if the sample size is at least 30, then for most sampled populations the population of all possible sample means is approximately normally distributed In this book, when_ ever the sample size n is at least 30, we will assume that the sampling distribution of x is approximately a normal distribution Of course, if the sampled population is exactly nor_ mally distributed, the sampling distribution of x is exactly normal for any sample size Chapters 9–12: Hypothesis testing Two-sample procedures Experimental design and analysis of variance Chi-square tests Chapter discusses hypothesis testing and begins with a new section on formulating statistical hypotheses Three cases— e-billing Case : Reducing Mean Bill Payment 7.3 The Case, C EXAMPLE The Trash Bag The e-billing Case,Timeand The Recall that a management consulting firm has installed a new computer-based electronic Valentine’s Day Chocolate Case—are then used in a billing system in a Hamilton, Ohio, trucking company Because of the previously discussed advantages of the new billingexplains system, and because the trucking company’s renew section that the critical value clients and are p-value ceptive to using this system, the management consulting firm believes that the new system will reduce the mean to bill payment time by than 50 percent The mean payment time approaches testing amore hypothesis about a populausing the old billing system was approximately equal to, but no less than, 39 days Theretion Anew summary boxthevisually these m denotes the mean payment time, consulting firmillustrating believes that m will be fore, ifmean less than 19.5 days To assess whether m is less than 19.5 days, the consulting firm has randomly selected a sample of n 65 invoices processed using the new billing system and _has determined the payment times for these invoices The mean of the 65 payment times is x 18.1077 days, which is less than 19.5 days Therefore, we ask the following question: If bow49461_fm_i–xxi.indd 31.42 _ In statement we showed that the probability is 95 that the sample mean x will be we showed within plus or minus 1.96s_x 22 of the population mean m In statement _ _ that x being within plus or minus 22 of m is the same as the interval [x 22] containing m Combining these results, we see that the probability is 95 that the sample mean _ x will be such that the interval _ _ [x 1.96s_x] [x 22] contains the population mean m approaches is presented in the middle of this section editions) so that Statement says that, before we randomly select the sample,to theredeveloping is a 95 probability that more of the section can be devoted the _ we will obtain an interval [x 22] that contains the population mean m In other words, summary boxthat and showing how it ofInthese addition, m,to anduse 5 percent intervals 95 percent of all intervals we might obtain contain _ not contain m For this reason, we call the interval [x 22] a 95 percent confidence interval a five-step hypothesis testing procedure emphasizes for m To better understand this interval, we must realize that, when we actually select the sample, will observe one particular extremely large number of possible thatwesuccessfully usingsample anyfrom ofthethe book’s hypothesis samples Therefore, we will obtain one particular confidence interval from the extremely large testing summary boxes requires simply identifying number of possible confidence intervals For example, recall that when the automaker randomly selected sample of n 50 cars and tested them as prescribed the EPA, the automaker the the alternative hypothesis being testedbyand then look_ obtained the sample of 50 mileages given in Table 1.7 The mean of this sample is x ing inandthe summary box the corresponding critical 31.56 mpg, a histogram constructed usingfor this sample (see Figure 2.9 on page 66) indicates that the population of all individual car mileages is normally distributed It follows that a value rule and/or p-value (see the next page) 95 percent confidence interval for the population mean mileage m of the new midsize model is (rather than at the interval end, asfor in mprevious A 95 percent confidence _ [ x 22] [31.56 22] [31.34, 31.78] vii Because we not know the true value of m, we not know for sure whether this interval contains m However, we are 95 percent confident that this interval contains m That is, we are 95 percent confident that m is between 31.34 mpg and 31.78 mpg What we mean by “95 percent confident” is that we hope that the confidence interval [31.34, 31.78] is one of the 95 percent of all confidence intervals that contain m and not one of the percent of all confidence intervals that not contain m Here, we say that 95 percent is the confidence 20/11/15 4:06 pm 396 Chapter www.freebookslides.com Hypothesis Testing p-value is a right-tailed p-value This p-value, which we have previously computed, is the area under the standard normal curve to the right of the computed test statistic value z In the next two subsections we will discuss using the critical value rules and p-values in the summary box to test a “less than” alternative hypothesis (Ha: m , m0) and a “not equal to” alternative hypothesis (Ha: m Þ m0) Moreover, throughout this book we will (formally or informally) use the five steps below to implement the critical value and p-value approaches to hypothesis testing The Five Steps of Hypothesis Testing State the null hypothesis H0 and the alternative hypothesis Ha Specify the level of significance a 402 Chapter Plan the sampling procedure and select the test statistic Hypothesis Testing Using a critical value rule: LO9-4 Use the summary box to find the critical value rule corresponding to the alternative hypothesis Use critical values Collect the and sample data, compute the value of the test statistic, and decide whether to reject H0 by using the p-values to perform a t critical value rule Interpret the statistical results If we not know s (which is usually the case), we can base a hypothesis test about m on test about a population Usingwhen a p-value mean s is rule: the sampling distribution of _ x alternative 2m unknown Use the summary box to find the p-value corresponding to the hypothesis Collect the sample data, 9.3 t Tests about a Population Mean: s Unknown syÏn compute the value of the test statistic, and compute the p-value the sampledapopulation is normally distributed (or ifthe thestatistical sample size is large—at least 30), Reject H0 at level ofIfsignificance if the p-value is less than a Interpret results then this sampling distribution is exactly (or approximately) a t distribution having n degrees of freedom This leads to the following results: Testing a “less than” alternative hypothesis We have seen in the e-billing case that to study whether the new electronic billing system A ttime Test a Population s Unknown reduces the mean bill payment byabout more than 50 percent, theMean: management consulting firm will test H0: m 19.5 versus Ha: m , 19.5 (step 1) A Type I error (concluding that Ha: m , 19.5 is true when H0: m_ 19.5 is true) would result in the consulting firm overstating Null Test Normal population x2 m the benefits of the new billing system, both to the company in which it has been installed and Hypothesis H0: m m0 Statistic t _ df n Assumptions or syÏ n to other companies that are considering installing such a system Because the consulting firm Large sample size desires to have only a percent chance of doing this, the firm will set a equal to 01 (step 2) To perform the hypothesis test, we will randomly select a sample of n 65 invoices _ Critical Value Rule p-Value (Reject H0 if p-Value Ͻ ␣) payment times of these paid using the new billing system and calculate the mean x of the Ha: Ͼ 0 Ha: Ͻ Then, Ha: ϶ Ͼ 0 we willHautilize : Ͻ 0the test statistic Ha: ϶ 0in the invoices because the0sample sizeHais: large, summary (step 3): Do not Reject Reject Do not boxReject Do not Reject _ reject H0 H0 H0 reject H0 H0 reject H0 H0 x 19.5 z 9.4 z Tests about a Population Proportion 407 p-value p-value syÏn _ ␣ ␣ր2 ␣ր2 ␣ A value of the test statistic z that is less than zero results when x is less than 19.5 This In order to see how to test this kind of hypothesis, remember that when n is large, the _ of Ha because the point estimate x indiprovides to␣ր2 support rejecting H0 in0favor sampling0distribution of t␣ր2 Ϫt␣ evidence Ϫt t␣ t t 0 ԽtԽ ϪԽtԽ m might be less than 19.5 To decide how much less than zero the value of the cates that pˆ 2Hp0 if p-value ϭ twice p-value ϭ area p-value ϭ area Reject H0 if Reject Reject H0 if _ of Hofa tat level ofleft significance , wetonote test t statistic mustԽtԽbeϾ p(1 reject theaarea the that to the right to the of t t Ͼ t␣ Ͻ Ϫt␣ tto ␣ր2—that p)is,H0 in favor right of ԽtԽ t␣ր2form or tnϽ H Ϫt␣ր2 Ha: m , 19.5 is oft Ͼthe a: m , m0, and we look in the summary box under the critical value rule heading Ha: m , m0 The critical value rule that we find is a left-tailed critical is approximately a standard normal p0 denote a specified value between value rule and distribution says to theLet following: and (its exact value will depend on the problem), and consider testing the null hypothesis H0: p p0 We then have the following result: Place the probability of a Type I error, ␣, in the left-hand tail of the standard normal curve Mean Debt-to-Equity Ratio Here 2za the negative of the and useCtheEXAMPLE normal table 9.4 to find theCommercial critical valueLoan 2z␣.Case: AThe Large Sample Test about Population Proportion ␣ is normal point z␣ That is, 2z␣ is the point on the horizontal axis under the standard normal One measure of a company’s financial health is its debt-to-equity ratio This quantity is curve that gives a left-hand Null Testtail area equal to ␣ np0 $ defined to be the ratio of the company’s equity If this pˆ p0 corporate debt to the company’s Hypothesis H0:Reject and p p0 H0: ϭ 19.5 in favor Assumptions of indication Hz␣:5 Ͻ 19.5 if and only if the computed value of the ratio is too high, itStatistic is one of2 financial instability For obvious reasons, banks (1 p ) p n(1 the p 0) $ 0 test statistic z is less the critical value 2z 4) Because equals 01, ␣ (step often monitor the than financial health of companies to which they have␣ extended commercial n criticalloans value Suppose 2z␣ is 2z [see Table A.3aand Figure 01 that, in 22.33 order to reduce risk, large bank 9.3(a)] has decided to initiate a policy limiting theRule mean debt-to-equity ratio for its portfolio commercial loansϽ to Critical Value ␣) being less p-Valueof(Reject H0 if p-Value than 1.5 In order to assess whether the mean debt-to-equity ratio m of its (current) comHa: p Ͼ p0 Ha: p Ͻ p0 Ha: p ϶ p0 Ha: p Ͼ p0 Ha: p Ͻ p0 Ha: p ϶ p0 mercial loan portfolio is less than 1.5, the bank will test the null hypothesis H0: m 1.5 Do not Reject Reject not alternative Reject Do not Reject H : m , 1.5 In this situation, a Type I error (rejectversusDothe hypothesis a reject H0 H0 H0 H0 H0 reject H0 H0 result in the bank concluding that the ing Hreject p-value p-value 0: m 1.5 when H0: m 1.5 is true) would mean debt-to-equity ratio of its commercial loan portfolio is less than 1.5 when it is not 1.0 ␣ ␣ Because the bank ␣ր2 wishes to be␣ր2 very sure that it does not commit this Type I error, it will 1.1 versus Ha byϪzusing a 01 To perform the hypothesis test z1␣2 ϪzH z␣ր2 level of significance z z 0 test, ϪԽzԽ ԽzԽ the ␣ 00 ␣ր2 1.2 bank randomly selects a sample of 15p-value of its ϭcommercial loanϭaccounts Auditsϭ of these area p-value area p-value twice Reject Reject H0 if Reject H0 if 1.3 H01if2 to the right of z ratiosto (arranged the left of z in increasing the area to the —that is, z Ͼ z␣ z Ͻ Ϫz␣ result ԽzԽ z␣ր2following companies in Ͼthe debt-to-equity order): 1.4 right of ԽzԽ Ͼ z␣ր21.22, or z Ͻ1.29, Ϫz␣ր2 1.31, 1.32, 1.33, 1.37, 1.41, 1.45, 1.46, 1.65, and 1.78 1.05, 1.11, 1.19, z1.21, 1.5 The mound-shaped stem-and-leaf display of these ratios is given in the page margin and 1.6 indicates that the population of all debt-to-equity ratios is (approximately) normally dis1.7 tributed It follows that it is appropriate to calculate the value of the test statistic t in the DS DebtEq summary box Furthermore, because the alternative hypothesis Ha: m , 1.5 says to use C EXAMPLE 9.6 The Cheese Spread Case: Improving Profitability Ï Ï Hypothesis testing summary boxes are featured theory Chapters 13–15 present predictive analytWe have seen that the cheese spread producer wishes to test H0: p 10 versus Ha: p , 10, throughout Chapter Chapter (two-sample proceics where p is the9, proportion of all10 current purchasers who would stop buying the methods cheese spread that are based on parametric regression can betime rejected in if the new spout were used The producer will use the newand spout if H0and dures), Chapter 11 (one-way, randomized block, series models Specifically, Chapter 13 and favor of Ha at the 01 level of significance To perform the hypothesis test, we will rantwo-way analysis of nvariance), Chapter (chi-square the first seven sections of Chapter 14 discuss simple domly select 1,000 current purchasers12 of the cheese spread, find the proportion (pˆ) of these purchasers stop buying the cheese the new spout used,multiple and tests of goodness of fit who andwould independence), andspread theifreandwere basic regression analysis by using a more calculate the value of the test statistic z in the summary box Then, because the alternative mainder of hypothesis the book addition, emphasis is placed , 10 says to use the left-tailed critical value rule in thestreamlined summary box, weorganization and The Tasty Sub Shop (revHa: pIn the value of zimportance is less than 2za 5after 2z.01 22.33 (Note that using H0: p 10 ifpractical throughout will on reject estimating enue prediction) Case (see Figure 14.4) The next five this procedure is valid because np0 1,000(.10) 100 and n(1 p0) 1,000(1 10) testing for statistical sections ofthat Chapter 14 present five advanced modeling 900 are both significance at least 5.) Suppose that when the sample is randomly selected, we find 63 of the 1,000 current purchasers say they would stop buying the cheese spread if the new topics that can be covered in any order without loss of spout were used Because pˆ 63y1,000 063, the value of the test statistic is Chapters 13–18: Simple and multiple regression continuity: dummy variables (including a discussion p ˆ p 063 10 23.90 _ z _ _ 10(1 10) and analysis Model building Logistic regression of interaction); quadratic and quantitative (1 p ) p _ 0 n Ï _ Ï 1,000 Con- interaction variables; modelvariables neural networks Time series forecasting building and the effects Because z 23.90 is less than 2z.01 22.33, we reject H0: p 10 in favor of Ha: p , 10 trol charts statistics Decision of multicollinearity; residual analysis and diagnosing that the proportion of all current purchasers who would ThatNonparametric is, we conclude (at an a of 01) ␣ 01 2z.01 22.33 p-value 00005 z 23.90 viii stop buying the cheese spread if the new spout were used is less than 10 It follows that the company will use the new spout Furthermore, the point estimate pˆ 063 says we estimate that 6.3 percent of all current customers would stop buying the cheese spread if the new spout were used BI Some statisticians suggest using the more conservative rule that both np0 and n(1 p0) must be at least 10 bow49461_fm_i–xxi.indd 20/11/15 4:06 pm www.freebookslides.com outlying and influential observations; and logistic regression (see Figure 14.36) The last section of Chapter 14 discusses neural networks and has logistic regression as a prerequisite This section shows why neural network modeling is particularly useful when analyzing big data and how neural network models are used to make predictions (see Figures 14.37 and 14.38) Chapter 15 discusses time series forecasting, includ594 Chapter 14 652 Multiple Regression and Model Building Chapter 14 Figure 14.36 Excel and Minitab Outputs of a Regression Analysis of the Tasty Sub Shop Revenue Data in Table 14.1 Using the Model y b0 b1x1 b2x2 « Figure 14.4 ing Holt–Winters’ exponential smoothing models, and refers readers to Appendix B (at the end of the book), which succinctly discusses the Box–Jenkins methodology The book concludes with Chapter 16 (a clear discussion of control charts and process capability), Chapter 17 (nonparametric statistics), and Chapter 18 (decision theory, another useful predictive analytics topic) Deviance Table Source Regression Purchases PlatProfile Error Total (a) The Excel output Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.9905 0.9810 0.9756 36.6856 10 ANOVA Regression Residual Total df SS 486355.7 10 9420.8 11 495776.5 12 Coefficients 125.289 14.1996 22.8107 Intercept population bus_rating MS 243177.8 1345.835 Standard Error 40.9333 0.9100 5.7692 t Stat 3.06 15.60 3.95 F 180.689 13 Significance F 9.46E-07 14 P-value 0.0183 1.07E-06 0.0055 Lower 95% 19 28.4969 12.0478 9.1686 Coefficients Coef Term 210.68 Constant 0.2264 Purchases 3.84 PlatProfile (b) The Minitab output Analysis of Variance Source DF Regression Population Bus_Rating Error Total Adj SS 486356 10 327678 21039 9421 11 495777 12 Model Summary S R-sq 36.6856 98.10% Coefficients Term Constant Population Bus_Rating AdJ MS 243178 327678 21039 1346 Coef 125.3 14.200 22.81 R-sq(adj) 97.56% F-Value 180.69 13 243.48 15.63 Setting 47.3 b0 b1 R2 Adjusted R2 b2 14 p-value for F(model) Fit 15 956.606 VIF P-Value 0.018 0.000 0.006 1.18 1.18 95% CI 17 (921.024, 992.188) S bj standard error of the estimate bj 10 Explained variation t statistics 95% PI 18 (862.844, 1050.37) p-values for t statistics 11 SSE Unexplained variation 15 yˆ point prediction when x1 47.3 and x2 17 95% confidence interval when x1 47.3 and x2 12 Total variation s standard error 13 F(model) statistic 16 syˆ standard error of the estimate yˆ 18 95% prediction interval when x1 47.3 and x2 residual—the difference between the restaurant’s observed and predicted yearly revenues— fairly small (in magnitude) We define the least squares point estimates to be the values of 14 Multiple Regression and Model Building bChapter 0, b1, and b2 that minimize SSE, the sum of squared residuals for the 10 restaurants The formula for the least squares point estimates of the parameters in a multiple regresmodel expressed using a branch of mathematics called matrix algebra This formula F i g u r e Thesion Single LayerisPerceptron is presented in Bowerman, O’Connell, and Koehler (2005) In the main body of this book, Excel Layer and Minitab to compute the needed Output estimates Input Layer we will rely onHidden LayerFor example, consider the Excel and Minitab outputs in Figure 14.4 The Excel output tells us that the least squares point estimates of b0, b1, and b2 in the Tasty Sub Shop revenue model are b0 125.289, b1 14.1996, and b2 22.8107 (see , , and ) The point estimate b1 14.1996 of b1 l1 h10that h11 x1 hyearly says we estimate mean revenue increases by $14,199.60 when the population size 12x2 … residents h1kxk increases by 1,000 and the business rating does not change The point estimate g(L) e l2 el2 1 if response variable is 1 e2L qualitative L if response variable is quantitative elm elm1 1 An input layer consisting of the predictor variables x1, x2, , xk under consideration A single hidden layer consisting of m hidden nodes At the vth hidden node, for v 1, 2, , m, we form a linear combination ℓv of the k predictor variables: ℓ 5h 1h x 1h x 1h x v0 v1 v2 vk k Here, hv0, hv1, , hvk are unknown parameters that must be estimated from the sample data Having formed ℓv, we then specify a hidden node function Hv(ℓv) of ℓv This hidden node function, which is also called an activation function, is usually nonlinear The activation function used by JMP is eℓv Hv (ℓv) eℓv 1 [Noting that (e2x 1)y(e2x 1) is the hyperbolic tangent function of the variable x, it follows that Hv (ℓv) is the hyperbolic tangent function of x 5 ℓv.] For example, at nodes 1, 2, , m, we specify bow49461_ch14_590-679.indd 654 bow49461_fm_i–xxi.indd VIF 1.59 1.59 95% CI P-Value 0.993 0.998 0.919 Variable Purchases PlatProfile Setting 42.571 Fitted Probability 0.943012 SE Fit 0.0587319 95% CI (0.660211, 0.992954) Variable Purchases PlatProfile Setting 51.835 Fitted Probability 0.742486 SE Fit 0.250558 95% CI (0.181013, 0.974102) byOutput 25 percent oddsEstimation ratio estimate of 46.76 for PlatProfile that we JMP of NeuralThe Network for the Credit Card Upgrade Data DS says CardUpgrade 24.34324 H1_2:Purchases H1_2: PlatProfile:0 H1_2:Intercept 0.062612 0.172119 22.28505 H1_3:Purchases H1_3: PlatProfile:0 H1_3:Intercept 0.023852 0.93322 21.1118 Upgrade(0):H1_1 Upgrade(0):H1_2 Upgrade(0):H1_3 Upgrade(0):Intercept 2201.382 236.2743 81.97204 27.26818 } } } 2.0399995 } } } 7698664933 (210.681.2264(42.571)13.84 (1)) e _ 9430 (210.681.2264(42.571)13.84 (1)) ℓˆ2 hˆ20 hˆ21(Purchases) hˆ22(JDPlatProfile) 11e 22.28505 062612(51.835) 172119(1) 1.132562 e 21 H2 (ℓˆ2) 1.132562 e 11 5126296505 • The upgrade5 1.132562 probability for a Silver card holder who had purchases of $51,835 last year and does not conform to the bank’s Platinum profile is } ℓˆ3 hˆ30 hˆ31(Purchases) (210.681.2264(51.835)13.84(0)) hˆ32(JDPlatProfile) e 11e 1.0577884 e 21 _ 21.1118 023852(51.835) 93322(1) H (ℓˆ ) 1.0577884 (210.681.2264(51.835)13.84(0))3 e 11 1.0577884 4845460029 Lˆ b0 b1H1 (ℓˆ1) b2H2 (ℓˆ2) b3H3 (ℓˆ3) 27.26818 201.382(.7698664933) } 1236.2743(.5126296505) 81.97204(.4845460029) 21.464996 7425 g (Lˆ ) 2(21.464996) 11e 1877344817 Most Likely H1_1 H1_2 H1_3 31.95 2.001752e-11 20.108826172 20.056174265 0.2837494642 17 34.964 0.3115599633 0.6884400367 20.408631173 20.133198987 20.540965203 33 39.925 3.3236679e-6 0.9999966763 20.151069087 0.0213116076 20.497781967 40 17.584 1.241383e-17 20.728299523 20.466805781 0.1198442857 41 42.571 6.333631e-10 0.9999999994 20.001969351 0.1037760733 20.47367384 42 51.835 0.1877344817 0.8122655183 0.7698664933 0.5126296505 0.4845460029 card holders who have not yet been sent an upgrade offer and for whom we wish to estimate the probability of upgrading Silver card holder 42 had purchases last year of $51,835 (Purchases 51.835) and did not conform to the bank’s Platinum profile (PlatProfile 0) Because PlatProfile 0, we have JDPlatProfile Figure 14.38 shows the parameter estimates for the neural network model based on the training data set and how they are used to estimate the probability that Silver card holder 42 would upgrade Note that because the response variable Upgrade is qualitative, the output layer function is g(L) 1y(1 e2L) The final result obtained in the calculations, g(Lˆ) 1877344817, is an estimate of the probability that Silver card holder 42 would not upgrade (Upgrade 0) This implies that the estimate of the probability that Silver card holder 42 would upgrade is 1877344817 5 8122655183 If we predict a Silver card holder would upgrade if and only if his or her upgrade probability is at least 5, then Silver card holder 42 is predicted to upgrade (as is Silver card holder 41) JMP uses the model fit to the training data set to calculate an upgrade probability estimate for each of the 67 percent of the Silver card holders in the training data set and for each of the 33 percent of the Silver card holders in the validation data set If a particular Silver card holder’s upgrade probability estimate is at least 5, JMP predicts an upgrade for the card holder and assigns a “most likely” qualitative value of to the card holder Otherwise, JMP assigns a “most likely” qualitative value of to the card holder At the bottom of Figure 14.38, we show the results of JMP doing this for Silver card holders 1, 17, 33, and 40 Specifically, JMP predicts an upgrade (1) for card holders 17 and 33, but only card holder 33 did upgrade JMP predicts a nonupgrade (0) for card holders and 40, and neither of these card holders upgraded The “confusion matrices” in Figure 14.39 summarize The idea behind neural network modeling is to represent the response variable as a nonlinear function of linear combinations of the predictor variables The simplest but most widely used neural network model is called the single-hidden-layer, feedforward neural network This model, which is also sometimes called the single-layer perceptron, is motivated (like all neural network models) by the connections of the neurons in the human brain As illustrated in Figure 14.37, this model involves: v P-Value 0.011 0.014 0.017 (1.9693, 1110.1076) Probability Probability Upgrade Purchases PlatProfile (Upgrade50) (Upgrade51) lm = hm0 + hm1x1 + hm2x2 + … + hmkxk Hm(lm) Z-Value 22.55 2.46 2.38 Chi-Square 19.21 17.14 3.23 H1_1:Intercept … xk 95% CI (218.89, 22.46) ( 0.0458, 0.4070) ( 0.68, 7.01) SE Coef 4.19 0.0921 1.62 Neural (Optional) 657 purchases estimate of 1.25 for Purchases says that forNetworks each increase of $1,000 in last year’s by a Silver card holder, we estimate that the Silver card holder’s odds of upgrading increase estimate that the odds of upgrading for a Silver card holder who conforms to the bank’s Platinum profile are Neural Validation: 46.76 Randomtimes Holdback Modelthe NTanH(3) larger than odds of upgrading for a Silver card holder who does not conform to the bank’s Platinum profile, if both Silver card holders had the same amount of purchases Estimates last year Finally, the bottom of the Minitab output says that we estimate that Parameter Estimate ˆ hˆ (Purchases) ℓˆ1 h ) H1_1:Purchases 0.113579 • The upgrade probability for1ahˆSilver card holder who had purchases of $42,571 last year 10 11 12(JDPlatProfile e 2.0399995 H1_1: PlatProfile:0 0.495872 24.34324 113579(51.835) 495872(1) H1 (ℓˆ1) and conforms to the bank’s Platinum profile is e 2.0399995 1 L 0 1H1(l1) 2H2(l2) … mHm(lm) … H2(l2) P-Value 0.000 0.000 0.001 AIC 25.21 Goodness-of-Fit Tests DF Test 37 Deviance 37 Pearson Hosmer-Lemeshow Figure 14.38 x1 x2 Chi-Square 35.84 13.81 10.37 14.13 e l1 e l1 1 l2 h20 h21x1 h22x2 … h2kxk Adj Mean 17.9197 13.8078 10.3748 0.5192 19 95% confidence interval for bj 654 H1(l1) Adj Dev 35.84 13.81 10.37 19.21 Odds Ratios for Categorical Predictors Level A Level B Odds Ratio PlatProfile 46.7564 Odds ratio for level A relative to level B T-Value 3.06 15.60 3.95 SE Fit 16 15.0476 Contribution 65.10% 46.26% 18.85% 34.90% 100.00% Purchases Regression Equation Revenue 125.3 14.200 Population 22.81 Bus_Rating Variable Population Bus_Rating Seq Dev 35.84 25.46 10.37 19.21 55.05 Odds Ratios for Continuous Predictors Odds Ratio 95% CI 1.2541 (1.0469, 1.5024) P-Value 0.000 14 0.000 0.006 R-sq(pred) 96.31% SE Coef 40.9 0.910 5.77 DF 1 37 39 Model Summary Deviance Deviance R-Sq R-Sq(adj) 61.47% 65.10% Upper 95% 19 222.0807 16.3517 36.4527 Multiple Regression and Model Building Minitab Output of a Logistic Regression of the Credit Card Upgrade Data 23/11/15 4:37 pm ix 23/11/15 5:27 pm www.freebookslides.com 876 Binomial probabilities (Continued) in MegaStat, 286 in Minitab, 287 Binomial random variables explanation of, 266 mean, variance, and standard deviation of, 271–272 Binomial tables, 268–269 Block, S B., 101n, 262–263 Block, Stanley B., 101, 262 Block sum of squares (SSB), 480 Blodgett, J G., 378 Bloomberg, Bohr, Niels, 17 Bon Jovi, 72 Bonner Frozen Foods, Inc., 598 Boston Celtics, 144 Boston Red Sox, 4, 73 Boundary, of classes, 63 Bowerman, Bruce L., 468, 594, 608, 730 Box-and-whiskers displays (box plots) construction of, 158–159 explanation of, 157–158 Box–Jenkins methodology, 682, 688 Boys & Girls Clubs of America, 101 Boy Scouts of America, 101 Branches, 812 Brother’s Brother Foundation, 101 Brown, C E., 496 Bruno’s Pizza, 55–58 Bryant, Kobe, 72 Buick, 8, 9, 14 Bullet graph, 22 Bullet graphs, 92–94 Bureau of Labor Statistics, 6, 251 Bureau of the Census, 157 Business analytics descriptive analytics as, 21–22 explanation of, 9, 21 predictive analytics as, 23–25 prescriptive analytics as, 23–25 B zones, 747 C Cable passing, 227–228 Cable penetration, 228 Cadillac Division, 730–731 CA Magazine, 373 Campus Crusade for Christ, 101 C&S Wholesale Grocers, 205 Capability studies, 754–760 CARE USA, 101 Cargill, 205 Categorical variables, 734 Catholic Charities USA, 101 Catholic Medical Mission Board, 101 Catholic Relief Services, 101 Cause-and-effect diagrams, 768–769 CBS, 394 CEEM Information Systems, 731 Cell frequency, 83, 514–515 Census, Census II method, 697 Centered moving averages, 692–694 bow21493_ind_875_892.indd 876 Index Center line (CNL), 741–745, 762 Centers for Disease Control and Prevention, 253 Central Limit Theorem, 334–335 Central tendency explanation of, 135, 784 mean and, 135–137, 139–142 median and, 138, 139–142 mode and, 138–142 Certainty, 810 Chambers, S., 580 Champ golf balls, 456 Charlotte Bobcats, 144 Charts See also Bar charts; Control charts; Graphs misleading, 19, 89–91 Pareto, 58–59, 61 pie, 60, 61 Chebyshev’s Theorem, 151, 261 Chevrolet, 261 Chicago Bulls, 144 Chicago Cubs, 4, 73 Chicago White Sox, 4, 73 Chi-square distribution, 417, 418 Chi-square goodness of fit test in Excel, 523 in MegaStat, 526 in Minitab, 527–528 for multinomial probabilities, 505–508 for normal distributions, 509–512 Chi-square point, 417, 419 Chi-square statistic, 506, 507 Chi-square table, 417, 418 Chi-square test for independence in Excel, 524–525 explanation of, 514–518 in MegaStat, 525–526 in Minitab, 527 Chronic Disease Fund, 101 Chrysler, 512 Chrysler Motors, 99 Cincinnati Reds, 4, 73 Classical method, of assigning probabilities, 221, 222 Classification explanation of, 172 predictive analytics as, 24 Classification trees, 172–177 Class midpoints, 65 Cleary, Barbara A., 60, 61 Clemen, Robert T., 252 Cleveland Air Route Traffic Control Center, 274 Cleveland Cavaliers, 144 Cleveland Indians, 4, 73 Cluster analysis explanation of, 184 hierarchical clustering and, 184–188 k-means clustering and, 188–190 Cluster detection, 24 Cluster sampling, 27, 28 Coates, R., 35, 35n Coca-Cola, 222, 484 Coefficient of variation, 152–153 Colorado Rockies, 4, 73 Column percentages, 84, 514 Column totals, 83, 514 Coma, 277 20/11/15 4:17 pm www.freebookslides.com 877 Index Combinations, counting rule for, 248–249 Comparisonwise error rate, 473 Compassion International, 101 Complete lineage approach, 185–186 Composite score, 13 Conditional probability explanation of, 235–238 independence and, 238–241 Confidence coefficient, 350 Confidence intervals application of, 350 in Excel, 379 explanation of, 347 to find population mean, 350–353 in MegaStat, 380 in Minitab, 381 multiple regression model and, 611–613 95 percent confidence interval for m, 349 one-sided, 400 for pairwise difference, 473 for parameters of finite populations, 373–375 for population mean, 347–349, 440 for population proportion, 367–369, 447 for proportion of total number of units in category when sampling finite population, 375 for regression parameter, 610 sample size for, 364–366, 369–372 simple linear regression and, 559–562 simultaneous, 473 for slope, 554 t-based, 355–361, 431, 434 testing hypotheses in, 400 tolerance intervals vs., 361–362 two-sided, 400 in two-way ANOVA, 490 z-based, 347–353 Confidence level explanation of, 347, 349 increase in, 352 Confidence percentage, 200 Conforming units (nondefective), 762 Conners, Jimmy, 184, 187 Consolidated Power, 389, 401, 416–417 Constant seasonal variation, 686 Constant variance assumption, residual analysis and, 567 Consumer Price Index (CPI), 36, 716 Consumer Reports, 373, 373n3, 378, 423 Contingency tables example of, 81–84, 515 explanation of, 81, 514, 517 Continuity correction, 310–311 Continuous probability distributions explanation of, 223 properties of, 290–291 types of, 289–290 Continuous process improvement, 731 Continuous random variables explanation of, 255–256, 289 exponential distribution and, 313–315 normal approximation of binomial distribution and, 310–312 normal probability distribution and, 294–308 normal probability plot and, 316–318 properties of, 290–291 types of, 289–290 uniform distribution and, 291–293 bow21493_ind_875_892.indd 877 Control charts analysis of, 745 center lines and control limits in, 741–745, 762 constants and, 741 explanation of, 737–738 for fraction nonconforming, 762–765 in MegaStat, 775–776 in Minitab, 744, 745, 776–777 pattern analysis in, 747–749 p charts, 762–765 prevention using, 760 R charts, 738–749 x charts, 738–749 Control limits, 741–742, 745 Convenience sampling, 18 Cook’s distance measure, 646 Cooper, Donald R., 437, 519, 521 Correlation negative, 164 population rank, 798 positive, 164 Correlation coefficient explanation of, 163, 281 population, 164, 564 sample, 164 simple, 546–547 Spearman’s rank, 564, 797 Correlation matrix, 632 Cost variance, 309 Counting rules, 247–249 Covariance explanation of, 161 sample, 161–164 Cox, D R., 665 Cox Enterprises, 205 Cp index, 760 Cravens, David, 615, 631 Critical value rules right-tailed and left-tailed, 395 for testing “greater than” alternative hypothesis, 390–392 for testing not equal to alternative hypothesis, 398–399 Critical values, 390 Cross International, 101 Cross-sectional data, Crystal Cable, 221, 227–232, 235–237 C statistic, 636 Cumberland Gulf Group, 205 Cumulative frequencies, 70 Cumulative frequency distributions, 69–70 Cumulative percentage point, 59 Cumulative percent frequencies, 70 Cumulative relative frequencies, 70 Curvature rate, 625 Cycle, 681 Cyrus, Miley, 72 C zones, 747 D Dallas Mavericks, 144 D’Ambrosio, P., 580 Dana-Farber Cancer Institute, 101 Dartmouth College, 8–9 Dashboards, 92 20/11/15 4:17 pm www.freebookslides.com 878 Data big, 7, 21 bimodal, 138 cross-sectional, explanation of, multimodal, 138 overview of, 3–4 primary, quantitative and qualitative variables and, 4–5 secondary, sources of, 6–7 time series, Data and Story Library (DASL), 170, 205 Data discovery methods, 95 Data drill down, 95, 96 Data mining explanation of, 3, 9, 23 stepwise regression and, 638 Data sets explanation of, 3–4 training and validation, 640, 656 Data warehouses, Dax, George S., 198 Decision making under certainty, 810 posterior decision analysis for, 815–820 under risk, 811–812 under uncertainty, 810–811 Decision theory decision trees and, 812–813 explanation of, 809–810 posterior decision analysis and, 815–820 utility theory and, 823–824 Decision trees See also Tree diagrams classification trees as, 172–177 explanation of, 172, 812–813 regression trees as, 172, 176, 178–180 Defect concentration diagrams, 769 DeGeneres, Ellen, 72 Degrees of freedom (df), 356, 402, 417, 451, 452, 511–512, 604 Deleted residual, 641 Dell, 205 Deloitte & Touche Consulting, 62 Deming, W Edwards, 729, 730, 730n Dendogram, 188 Denman, D W., 100, 100n Denver Nuggets, 144 Dependent events, 238 Dependent variables, 531, 532 Descriptive analytics bullet graphs and, 92–94 dashboards and gauges and, 92 data discovery and, 95, 96 explanation of, 21–22, 92 sparklines and, 95 treemaps and, 94–95 Descriptive statistics See also Central tendency; Variance association rules and, 198–201 box-and-whiskers displays and, 157–160 central tendency and, 135–142 cluster analysis and multidimensional scaling and, 184–190 contingency tables and, 81–84 bow21493_ind_875_892.indd 878 Index covariance, correlation, and the least squares line and, 161–165 decision trees and, 172–180 dot plots and, 75 explanation of, 9, 55 factor analysis and, 192–196 geometric mean and, 170–171 graphically summarizing qualitative data and, 55–59 graphically summarizing quantitative data and, 61–71 for grouped data, 167–169 measures of variation and, 145–153 misleading, 19, 89–91 percentiles, quartiles, and five-number displays and, 155–157 scatter plots and, 87–88 stem-and-leaf displays and, 76–79 weighted means and grouped data and, 166–169 Deseasonalized observation, 694–695 Designed statistical experiments, 731 Detroit Pistons, 144 Detroit Tigers, 4, 73 Dichotomous surveys, 29 Dielman, T.E., 649 Difference in fits statistic (Dffits), 646 Digital Equipment Corporation, 760 Dillon, W R., 444 Dillon, William R., 235, 369, 410, 450, 457, 519 Direct Relief, 101 Discount Soda Shop, Inc., 691 Discrete probability distributions explanation of, 223, 256–257 properties of, 257–261 Discrete random variables binomial distribution and, 263–272 explanation of, 255 hypergeometric distribution and, 278–279 joint distributions and, 280–281 mean or expected value of, 258 Poisson distribution and, 274–277 probability distributions and, 256–261 standard deviation of, 260 variance of, 260 Discrete uniform distribution, 261 Disney, 9, 24, 94 Disney Cruise Line, 24 Disney Parks, 92–95 Distance value, 560, 613, 644 Dobyns, Lloyd, 729 Dr Dre, 72 Dodge, Harold F., 728 Domino’s Pizza, 55–58 Dot plots, 4–5, 75 Double exponential smoothing, 704–707 Dow Jones & Company, Downey, Robert, Jr., 72 Dummy variables explanation of, 173, 616, 648 to model qualitative independent variables, 616–622 quantitative, 656 DuPont, 731 Durbin-Watson critical points, 688 Durbin-Watson statistic, 573, 688 Durbin-Watson test, 573–574 20/11/15 4:17 pm www.freebookslides.com 879 Index E Economic indexes, 716–717 Economists, data use by, Eigenvalue, 195 Einstein, Albert, 17 Elber, Lynn, 241 Electronics World, 616–622, 797, 798 Elements, Emenyonu, Emmanuel N., 518, 519 Emory, C William, 437, 519, 521 Empirical Rule explanation of, 294–295 for normally distributed population, 148, 149 skewness and, 151 Employee performance, evaluation of, 614–616 Enterprise Holding, 205 Enterprise Industries, 184, 251, 541–542, 557–558, 563, 599–600, 614, 624–625, 627–630, 658, 663, 664 Entire small data set, 180 Environmental Protection Agency (EPA), 4, 9, 14–17, 15n8, 135–136, 138, 150, 223, 255, 289, 328–333, 347, 349, 350, 350n1, 360, 364, 366, 389, 401, 423 Epcot Center, 22 Ernst & Young Consulting, 62, 205 Errors forecast, 712–713 of nonobservation, 31 of observation, 31, 32 recording, 32 sampling, 31 in surveys, 30–32 Error sum of squares (SSE), 470, 471, 480 Error term, 533, 534, 592 ESPN Baseball Encyclopedia, 184 “Ethical Guidelines for Statistical Practice” (American Statistical Association), 18–19 Events conditional probability of, 235–238 dependent, 238 explanation of, 224 independent, 223, 238–241 mutually exclusive, 232–234 probability of, 224–228 Everlast, 307 Evert, Chris, 184 Excel analysis of variance in, 473, 489, 492, 497–498 bar charts, 56–57 binomial probabilities in, 284–285 bullet graph, 93 calculated results in, 42 chi-square tests in, 523–525 confidence intervals in, 379 creating 100 random numbers in, 43 creating time series plot in, 40–42 entering data in, 39 explanation of, 36 features of window in, 36–38 frequency histograms, 66, 67 hypergeometric probabilities in, 285 hypothesis testing in, 424, 459–460 including output in reports in, 42 bow21493_ind_875_892.indd 879 multiple regression analysis in, 594, 598, 619, 622, 629, 666–667 normal distribution in, 321–322 normal probability plot in, 317 numerical descriptive statistics in, 207–209 output of statistics in, 142, 143 Poisson probabilities in, 285–286 printing spreadsheet in embedded graph in, 42 procedure to start, 38 randomized block ANOVA in, 482 retrieving spreadsheet in, 40 saving data in, 39 simple linear regression analysis in, 553, 557, 559, 583–584 tabular and graphical methods in, 103–120 time series analysis in, 722 treemap, 94 Expected value of discrete random variable, 258 of perfect information, 813 Experimental design concepts related to, 465–467 randomized block, 479–480 Experimental region, 538, 540 Experimental studies, 6–7 Experiments binomial, 224 counting rules for, 247–249 explanation of, 221 multinomial, 505–508 paired differences, 439–443, 789 in tree diagrams, 224, 225 two-factor factorial, 486–487 Experimentwise error rate, 473 Exponential distribution, 313–315 Exponential smoothing double, 704–707 simple, 699–703 use of, 682 F Factor analysis, 192–196 Factor detection, 24 Factors, Farnum, Nicholas R., 203, 276 F distribution, 451–452 Federal Aviation Administration (FAA), 274, 276 Federal Reserve, Federal Trade Commission (FTC), 30, 386, 411–413 Federer, Roger, 72 Feeding America, 101 Feed the Children, 101 Ferguson, J T., 597 Fidelity Investments, 205 Fidelity Small Cap Discovery Fund, 458 Financial planners, data use by, Finite population correction, 374–375 Finite population multiplier, 337 Finite populations confidence intervals fort, 373–375 explanation of, 14 Firtle, N H., 444 Firtle, Neil H., 235, 369, 410, 450, 457, 519 Fishbone charts, 768–769 20/11/15 4:17 pm www.freebookslides.com 880 Fitzgerald, Neil, 373 Five-number summary, 156–157 Food for the Poor, 101 Forbes, 72–74 Forbes Magazine, 72, 73–74, 204 Ford, John K., 345 Ford Motor Company, 512, 728, 729, 747 Forecast error comparisons, 712–713 Forrest Gump (movie), 319 Fortune magazine, 342, 436 F point, 451 Fractional power transformation, 686 Fraction nonconforming, 762–765 Frames, 11, 27 Freeman, L., 35, 35n Frequencies cell, 83 of classes, 63 cumulative, 70 explanation of, 55 percent, 56 relative, 56 Frequency bar charts, 57 Frequency distributions classes of, 62–64 construction of, 65–66 cumulative, 69–70 examples using, 55–56, 64 explanation of, 55 histograms and, 61–65 percent, 56 relative, 56 shape of, 67–68 Frequency histograms, 64–65 Frequency polygons, 68–69 Friedman, David, 14 Frommer, F J., 274n2 F table, 451, 452 F test for differences between treatment means, 471 overall, 605–606 partial, 638–640 for significance of slope, 554–556 Fuel Economy Guide, 138 G Gallup, George, 18 Gallup News, 86 Gallup Organization, 60, 103, 341, 451 Gallup Poll, 18 Gaudard, M., 35, 35n Gauges, 92 General Electric, 760 General logistic regression model, 648, 650–651 General Motors Corporation, 9, 14, 221, 512, 729, 730–731 Geometric mean, 170–171 Georgetown University, 29 Giges, Nancy, 320 Gillette, Gary, 184 Gitlow, Howard, 730, 748, 753s, 754s Gitlow, Shelly, 730, 748, 753s, 754s Golden State Warriors, 144 Goldstein, Mathew, 191 bow21493_ind_875_892.indd 880 Index Good 360, 101 GoodCarBadCar.net, 61 Goodness-of-fit tests for multinomial probabilities, 505–508 for normal distributions, 509–512 Goodwill Industries International, 101 Granbois, D H., 378 Graphical methods in Excel, 103–120 in MegaStat, 121–124 in Minitab, 125–133 Graphs See also Control charts bullet, 22 misleading, 19, 89–91 Gray, S J., 519 Gray, Sidney J., 518 Greater Cincinnati International Airport, 362 Greater than alternative hypothesis, 390–394 Grouped data, 167–169 Gunter, B., 754s Gupta, S., 312, 512 H Habitat for Humanity International, 101 Harris, Calvin, 72 Hartley, H O., 357 He Butt Grocery, 205 Hierarchical clustering, 184–188 Hinckley, John, 252 Hirt, G A., 101n, 262–263 Hirt, Geoffrey A., 262 Histograms construction of, 16–17, 62–66 explanation of, 62, 64 frequency, 64–65 percent frequency, 65, 66 process capability and, 755 relative frequency, 65, 66 shape of, 67–68 HJ Heinz, 205 Holt-Winters’ double exponential smoothing, 704–707 Homogeneity test, 508 Honda, 15, 137–138 Horizontal bar charts, 57 Hospitals, data use by, Houston Astros, 4–5, 73 Houston Rockets, 144 Hypergeometric distribution, 278, 279 Hypergeometric probabilities in Excel, 285 in MegaStat, 286 in Minitab, 287 Hypergeometric probability formula, 278 Hypergeometric random variables, 278–279 Hypothesis testing about difference between two population proportions, 448 alternative hypothesis and, 383–385, 390–392, 396–399 confidence intervals and, 400 critical value rule and, 398–399 in Excel, 424, 459–460 explanation of, 383 in MagaStat, 425–426, 460–461 in Minitab, 426–427, 462–463 20/11/15 4:17 pm www.freebookslides.com 881 Index null hypothesis and, 383–387, 394 population mean and, 390–400 p-value and, 392–394, 397–398 statistical significance and, 392 steps in, 396 test statistic and, 387 Type I and Type II errors and, 387–388, 392 z test and, 395–396 I IBM, 731, 760 Increasing seasonal variation, 686 Independence assumption chi-square test and, 515, 516 residual analysis and, 571–573 Independent events explanation of, 223, 238 multiplication rule for, 239, 240 Independent samples comparing population means using, 429–436 comparing population proportions using, 445–449 comparing population variances using, 453–455 Independent samples experiment, 430 Independent variables multicollinearity and, 632–633 simple linear regression model and, 531, 532 testing significance of, 607–610 using dummy variables to model qualitative, 616–622 Index numbers, 713–717 Indiana Pacers, 144 Indicator variables See Dummy variables Infinite population, 14 Information Resources, Inc., Interaction, 620 Interaction variables, 627–630, 638, 639 Interquartile range (IQR), 157 Interval variables, 25–26 Investment Digest, 203–204 Irregular fluctuations, 681 IRS, 23 Ishikawa, Kaoru, 768 Ishikawa diagrams, 768–769 ISO 9000, 731 J James, LeBron, 72 Japanese manufacturers, 512 Jay-Z, 72 Jeep, 99 JM Family Enterprises, 205 JMP analytics using, 216–219 decision trees and, 176, 177, 183 hierarchical clustering in, 189 neural network analysis in, 657–660, 677 Johnson, Dwayne, 72 Joint probability distribution, 280–281 Journal News, 241, 274 Journal of Accounting Research, 377 Journal of Advertising, 372, 450 Journal of Economic Education, 444 Journal of Economics and Business, 458 bow21493_ind_875_892.indd 881 Journal of Management, 338 Journal of Marketing, 100, 369, 615 Journal of Marketing Research, 512 Journal of Retailing, 422 Judgment sampling, 18 JUSE (Union of Japanese Scientists and Engineers), 729 JVC CD player, 457 K Kansas City Royals, 4, 73 Kendall, Maurice, 193n Kerrich, John, 222n Kiewit Corporation, 205 k-means clustering, 188–190 Knowles, Beyonce, 72 Koch Industries, 205 Krogers, 386n Krohn, Gregory, 444 Kruskal-Wallis H statistic, 794 Kruskal-Wallis H test, 475, 794–795 Kumar, Kerwin, 422, 458 Kumar, V., 198 Kutner, M H., 582–583 L Lady Gaga, 72 Landers, Ann, 18, 32 Landon, Alf, 18 Laspeyres index, 715–716 Lawrence, Jennifer, 72 Leaf unit, 78 Least squares line, 164–165 Least squares plane, 595 Least squares point estimates calculation of, 537–538 explanation of, 535–536, 594 multiple regression model and, 593–596 Least squares prediction equation, 595 Leaves, 76 See also Stem-and-leaf displays Left-hand tail area, 306–307, 395 Less than alternative hypothesis, testing for, 396–398 Leukemia & Lymphoma Society, 101 Level of significance, 392 Leverage values, 644–645 Levine, D M., 185 Levine, David M., 191 Li, W., 582–583 Liebeck, Stella, 35 Lift Ratio, 200 Linear regression models See Multiple regression models; Simple linear regression models Linear relationship, 87, 161–162 Line of means, 533 Literary Digest, 18 Literary Digest Poll (1936), 31 Little Caesars, 58 Little Caesars Pizza, 55, 56 Local minimum, 656 Logarithmic transformation, 569 Logistic regression, 647–652 Logit, 651 Los Angeles Angels, 4, 73 Los Angeles Clippers, 144 20/11/15 4:17 pm www.freebookslides.com 882 Los Angeles Dodgers, 4, 73 Los Angeles Lakers, 144 Love’s Travel Stops & Country Stores, 205 Lower limit, 157 Lutheran Services in America, 101 M Madden, T J., 444 Madden, Thomas J., 235, 369, 410, 450, 457, 519 Magee, Robert P., 309 Mail surveys, 30 Major League Baseball (MLB), 4, 60, 73 Major League Soccer (MLS), 60 Make-A-Wish Foundation of America, 101 Makridakis, S., 576 Malcolm Baldrige National Quality Award Consortium, 730 Malcolm Baldrige National Quality Awards, 730–731 Mall surveys, 30 Mann-Whitney test See Wilcoxon rank sum test MAP International, 101 Margin of error, 352 Marine Toys for Tots Foundation, 101 Maris, Roger, 80 Marketing professionals, Marketing Science, 354 Mars, 205 Mars, Bruno, 72 Martocchio, Joseph, 338–339 MasterCard, 8, 234–235, 241 Master golf balls, 456 Matrix algebra, 594 Maximax criterion, 811m810 Maximin criterion, 810 Mayo Clinic, 101 Mayweather, Floyd, 72 Mazis, M B., 100, 100n McDonald’s, 35 McEnroe, John, 184, 187 McGee, V E., 576 Mean absolute deviation (MAD), 712, 713 Mean absolute percentage error (MAPE), 712–713 Mean level, 532, 591 Means See also Population means; Sample means of binomial random variable, 272 compared to median and mode, 139–142 constant, 754 derivation of, 342–343 of discrete random variable, 258 explanation of, 16 geometric, 170–171 plane of, 592 Poisson random variable and, 276–277 process, 739, 740, 742 weighted, 166–169 Mean square, 469 Mean squared deviation (MSD), 712, 713 Mean square error explanation of, 549–550 standard error and, 604 Mean square of prediction error, 640 Measurements, 4, 61 bow21493_ind_875_892.indd 882 Index Median compared to mean and mode, 139–142 explanation of, 138 as resistant to extreme values, 140 sign test and, 780–783 Medicaid, 60 Medicare, 60 MegaStat analysis of variance in, 498–499 binomial probabilities in, 286 chi-square tests in, 525–526 confidence intervals in, 380 control charts in, 775–776 creating 100 random numbers in, 46 data labels in, 45 data selection in, 44–45 example in, 45–46 explanation of, 36, 43 getting started in, 43–44 hypergeometric probabilities in, 286 hypothesis testing in, 425–426, 460–461 multiple regression analysis in, 629, 668–670 nonparametric methods on, 802–804 normal distribution in, 322–323 normal probability plot in, 317 numerical descriptive statistics in, 210–211 Poisson probabilities in, 286 simple linear regression analysis in, 585–586 tabular and graphical methods in, 121–124 time series analysis in, 723–724 Meier, Heidi Hylton, 522 Meijer, 205 Memorial Sloan-Kettering Cancer Center, 101 Memphis Grizzlies, 144 Mendenhall, W., 27, 28, 653 Merrington, M., 452 Metropolitan Museum of Art, 101 Miami Heat, 144 Miami Marlins, 4, 73 Miami University, 140 Miami University Alumni Association, 31 Miami University of Ohio, 31 Michaels Art and Crafts, 24 Miller’s, 437–438, 456 Milliken and Company, 730–731 Milwaukee Brewers, 4, 73 Milwaukee Bucks, 144 Minimum error tree, 183 Minitab analysis of variance in, 473, 489, 492, 493, 500–502 binomial probabilities in, 287 box-and-whiskers display in, 158 chi-square tests in, 527–528 confidence intervals in, 381, 562 control charts in, 744, 745, 776–777 copying high-resolution graphics output in, 53 copying session window output in, 52 creating 100 random numbers in, 53 creating time series plot in, 48–49 entering data in, 47 explanation of, 36 exponential smoothing in, 703, 705, 706, 709–712 factor analysis in, 193–196 features of, 46–47 20/11/15 4:17 pm www.freebookslides.com 883 Index frequency histograms in, 66 getting started in, 47 hierarchical clustering in, 189 hypergeometric probabilities in, 287 hypothesis testing in, 426–427, 462, 463 logistic regression in, 648, 649, 652 multiple regression analysis in, 594, 598, 620, 627, 671–676 nonparametric methods on, 805–806 normal distribution in, 323–325 numerical descriptive statistics in, 212–216 output of statistics describing satisfaction ratings in, 158 output of statistics in, 142, 143 Poisson probabilities in, 287 prediction intervals in, 562 printing and saving graphs in, 50 printing data from Session window or Data window in, 51–52 randomized block ANOVA in, 482, 484 retreiving worksheet in, 48 saving data in, 48 simple linear regression analysis in, 553, 558, 569–571, 587–588 stem-and-leaf display in, 78 tabular and graphical methods in, 125–133 time series analysis in, 725 Minnesota Timberwolves, 144 Minnesota Twins, 4, 73 Misleading information, 19, 89–91 Mode compared to mean and median, 139–142 explanation of, 138–139 Model building, 638–640 Morgenstern, O., 823 Morningstar.com, 458 Motorola, 727, 730, 731, 760 Mound-shaped distributions, 68 Moving averages, 692–694 Multicollinearity, 631–634 Multidimensional scaling, 187–188 Multimodal data, 138 Multinomial experiments, 505–508 Multiple-choice surveys, 29 Multiple coefficient of determination (R2), 601–602 Multiple regression models advanced methods for, 638–640 comparison of, 634–636 confidence and prediction intervals and, 611–613 employee performance evaluation and, 614–616 in Excel, 594, 598, 619, 622, 629, 666–667 explanation of, 591–592 interaction variables and, 627–630 least squares point estimates and, 593–597 logistic regression and, 647–652 mean square error and, 604 in MegaStat, 629, 668–670 in Minitab, 594, 598, 620, 627, 671–676 model validation, PRESS statistic, and R2, 640–641 multicollinearity and, 631–635 multiple coefficient of determination and, 601–603 neural networks and, 653–658 overall F test and, 605–606 quadratic, 625–626 residual analysis in, 642–647 significance of independent variable and, 607–610 bow21493_ind_875_892.indd 883 squared variables and, 625–627 standard error and, 603–604 stepwise regression and backward elimination and, 636–638 use of dummy variables to model qualitative independent variables and, 616–622 Multiplication rule for independent events, 239, 240 Multiplicative decomposition, 682, 691–698 Multiplicative model, 691–692 Multiplicative Winters’ method, 707–711 Multistage cluster sampling, 27, 28 Mutually exclusive events, 232–234 N Nachtsheim, C S., 582–583 Nadal, Rafael, 72 NASDAQ, 171 National Automobile Dealers Assocation, 60 National Basketball Association (NBA), 60, 143, 144 National Do Not Call Registry, 30 National Enquirer, 32 National Football League (NFL), 60 National Golf Association, 456, 478 National Hockey League (NHL), 60 National Multiple Sclerosis Society, 101 Nature Conservancy, 101 NBA Players Association, 143 NBC, 394, 729 Negative autocorrelation, 571, 572, 688 Negative correlation, 164, 546 Nelson, John R., 100 Neter, J., 582–583 Netflix, 24, 24n13 Neural network analysis, in JMP, 657–660, 677 Neural networks, 653–658 New Jersey Nets, 144 New Orleans Hornets, 144 Newton, Isaac, 17 New York Knicks, 144 New York Mets, 4, 73 New York Stock Exchange, 344 New York Yankees, 4, 73, 76–77 95 percent confidence interval for m, 349 Nodes, 812 Nominative variables, 26 Nonconforming units (defective), 762 Nonparametric methods explanation of, 361, 780 Kruskal-Wallis H test and, 794–795 on MegaStat, 802–804 on Minitab, 805–806 sign test and, 780–783 Spearman’s rank correlation coefficient and, 797–799 Wilcoxon rank sum test and, 784–788 Wilcoxon signed ranks test and, 789–792 Nonresponse, 31–32 Normal curve areas under, 295–301 explanation of, 294–295 finding point on horizontal axis under, 304–306 left-hand tail area on, 306–307 right-hand tail area of, 298, 299, 305 standard normal distribution and, 296 used to approximate binomial probability, 312 20/11/15 4:17 pm www.freebookslides.com 884 Normal distributions approximating binomial distribution by using, 310–312 in Excel, 321–322 goodness of fit test for, 509–512 in MegaStat, 322–323 in Minitab, 323–325 Normality assumption, residual analysis and, 568–569 Normally distributed populations, Empirical Rule and, 148, 149, 151 Normal probability distribution applications of, 301–304 explanation of, 294, 302–303 normal curve and, 294–301, 304–308 Normal probability plot, 316–318 Normal table cumulative, 296–298 explanation of, 294 North American Oil Company, 466–468, 471, 794 Not equal to alternative hypothesis, 398–399 No trend regression model, 682 Novartis Pharmaceutical Company, 239 Null hypothesis See also Hypothesis testing explanation of, 383, 385 measuring weight of evidence against, 394 statistical significance and, 392 Number of degrees of freedom (df), 356, 417, 451, 452 O Oakland Athletics, 4, 73 Observation deseasonalized, 694–695 errors of, 32 explanation of, Observational studies, O’Connor, Catherine, 444 Odds, 650 Odds ratio, 650, 651 Ogives, 70, 71 Ohio State Unversity, 88 Oklahoma City Thunder, 144 Olmsted, Don, 21 One-sided alternative hypothesis, 386, 453 One-sided confidence intervals, 400 One-way ANOVA assumptions for, 468 between-treatment variability and, 469 estimation in, 467, 474 explanation of, 467–468 F test and, 471–472 in-treatment variability and, 469 pairwise comparisons and, 472–475 testing for significant differences between treatment means and, 468–471 Onkyo CD player, 457 Open-ended classes, 66 Open-ended surveys, 29 Operation Blessing International Relief & Development, 101 Oppenheim, A., 748, 753s, 754s Oppenheim, Alan, 730 Oppenheim, R., 730, 748, 753s, 754s Ordinal variables, 26 bow21493_ind_875_892.indd 884 Index Orlando Magic, 144 Ott, L., 27, 28 Outliers detection of, 75, 643–644 example of dealing with, 645–646 explanation of, 79, 643 Overall F test, 605–606 Overall total, 514 P Paasche index, 716 Paired differences experiments, 439–443, 789 Pairwise comparisons, 472–475 Pairwise differences, 472 Papa John’s Pizza, 55–58, 139 Parabola, 625–626 Parameters explanation of, 271 population, 373–375 Parametric model, 172 Parametric test, 792 Pareto, Vilfredo, 58–59 Pareto charts, 58–59, 61 Pareto principle, 58 Partial F test, 638, 639 Pattern analysis, 747–749 Patterson, 280 Payoffs, 809 Payoff table, 809–810 p charts, 762–765 Pearson, E S., 357 Pearson, Michael A., 522 Penalized least squares criterion, 655, 656 Penalty weight, 655 Pepsi, 484 Percent bar charts, 57 Percent frequency distributions, 56 Percent frequency histograms, 65, 66 Percentiles, 155–157 See also Quartiles Performance Food Group, 205 Perry, E S., 100, 100n Perry, Katy, 72 Petersen, Donald, 729 Pfaffenberger, 280 Philadelphia Phillies, 4, 73 Philadelphia 76ers, 144 Philip Morris, 100 Phoenix Suns, 144 Phone surveys, 30 Pie charts, 60, 61 Pilot Flying J, 205 Pittsburgh Pirates, 4, 73 Pizza Hut, 55, 56, 58 Plane of means, 592 Planned Parenthood Federation of America, 101 Platinum Equity, 205 Point estimate See also Least squares point estimates explanation of, 136, 138 of mean, 596 of population mean, 136, 538 randomized block design and, 481, 482 in simple linear regression, 540 in two-way ANOVA, 490 20/11/15 4:17 pm www.freebookslides.com 885 Index Point prediction in multiple regression, 596 in simple linear regression, 539, 540 Poisson distribution explanation of, 274–276 mean, variance, and standard deviation, 276–277 Poisson probabilities in Excel, 285–286 in MegaStat, 286 in Minitab, 287 Poisson probability table, 275 Poisson random variable, 274–277 Politicians, Population explanation of, 8–9 finite, 14 infinite, 14 Population correlation coefficient, 164, 564 Population covariance, 164 Population means confidence intervals for, 355–362 explanation of, 135 point estimate of, 136, 539 t tests and, 402–405 using independent samples to compare, 429–436 z tests and, 395–396 Population median, 782, 783 Population parameter, 135–136 Population proportions confidence intervals for, 367–369 hypothesis test about difference between two, 448 use of large, independent samples to compare, 445–449 z tests and, 406–409 Population rank correlation coefficient, 798 Populations, finite, 373–375 Population Services International, 101 Population standard deviation, 146 Population total, 374 Population variances explanation of, 146 statistical inference for, 418–419 testing equality of two, 454–455 use of independent samples to compare, 453–455 Portland Trail Blazers, 144 Positive autocorrelation, 571, 572, 688 Positive correlation, 164, 546 Posterior probability decision making using, 815–820 explanation of, 243, 815 PQ Systems, 60 Prediction, 24 Prediction interval multiple regression model and, 611–613 simple linear regression and, 559–562 Prediction model, 172 Predictive analytics, 23–25 Preposterior analysis, 820 Prescriptive analytics, 23–25 Price indexes, 715, 716 PricewaterhouseCoopers, 205 Principle components, 193 Prior decision analysis, 815 Prior probability, 243 bow21493_ind_875_892.indd 885 Probability classical method of assigning, 221, 222 conditional, 235–241 counting rules and, 247–249 of events, 224–228 explanation of, 221 independence and, 238–241 multinomial, 505–508 posterior, 243, 815–820 prior, 243 relative-frequency method of assigning, 221, 222 sample space outcomes and, 221–223 subjective, 223 Type I and Type II errors and, 387–388, 392 Probability distributions binomial, 267–268 continuous, 223, 289–291 discrete, 223, 256–261 explanation of, 16, 223 joint, 280–281 normal, 294–308 types of, 223–224 Probability models, 223–224 Probability revision table, 817, 818 Probability rules addition rule and, 232–234 explanation of, 229 multiplication rule and, 239, 240 mutually exclusive events and, 232–234 rule of complements as, 229–232 Probability sampling, 18 Process control analyzing charts to establish, 745–747 capability studies and, 754–760 explanation of, 14 Process mean, 739, 740, 743 Process performance graphs, 736, 737 Process sampling, 734–738 Process standard deviation, 739 Process variation, 731–733, 754 Procter & Gamble Company, 729 Producer Price Index (PPI), 716, 717 Production supervisors, Project HOPE, 101 pth percentile, 155 Public Roads, 558 Publix Super Markets, 205 p-value right-tailed, left-tailed, or two tailed, 395–396 for testing greater than alternative hypothesis, 392–394 Q Quadratic regression model, 625, 626 Qualitative data, graphical summaries See Bar charts; Pie charts Qualitative variables dummy variables to model independent, 616–622 explanation of, 5, 26 types of, 26 Quality, 727–728 Quality control history background of, 728–731 statistical process control and, 731–733 20/11/15 4:17 pm www.freebookslides.com 886 Quality Home Improvement Center (QHIC), 565–571, 664–665 Quality of conformance, 727 Quality of design, 727 Quality of performance, 727–728 Quality Progress, 35, 35n, 60, 341 Quantitative dummy variable, 656 Quantitative variables explanation of, 4–5, 25, 734 types of, 25 Quartic root transformation, 569 Quartiles, 156 Queueing theory, 315 Queues, 315 QuikTrip, 205 Quiznos, 531 R Randomized block design explanation of, 479–480 point estimates and confidence intervals in, 481–483 use of, 475, 480–481 Random number, 11 Random number table, 11–12 Random sampling case studies illustrating statistical inference and, 10–16 confidence intervals and, 373–375 explanation of, 9, 10 inference and, 12, 13, 15–16 selection of, 11–13 stratified, 27 Random selection, 10 Random shock, 688 Random variables binomial, 266, 271–272 continuous, 255–256 discrete, 255–281 explanation of, 223, 255 hypergeometric, 278–279 Poisson, 274–277 Range, 145 Ranks, 799 Rare event approach, to statistical inference, 270 Rate of curvature, 625 Rational subgroups, 734–737 Ratio variables, 25 R charts, 738–749 Reagan, Ronald, 252 Recording errors, 32 Regression assumptions, 548–549 Regression models See also Multiple regression models; Simple linear regression models comparison of, 634–636 explanation of, 172, 532 general logistic, 648 logistic, 647–652 time series, 682–689 Regression parameters, 534, 593, 610 Regression trees, 172, 176, 178–179 Relative frequencies explanation of, 56, 223 long-run, 222 Relative frequency distributions, 56 bow21493_ind_875_892.indd 886 Index Relative frequency histograms, 65, 66 Relative frequency method, 221, 222 Research hypothesis, 384 See also Alternative hypothesis Residual analysis assumption of correct functional form and, 568 constant variance assumption and, 567 Durbin-Watson test and, 573–574 independence assumption and, 571–573 in multiple regression, 642–646 normality assumption and, 568–569 residual plots and, 565–567 simple linear regression model and, 565–574 transforming the dependent variable and, 569–571 use of, 565 Residuals deleted, 641 studentized, 645 studentized deleted, 645 sum of squared, 595 Response bias, 32 Response variables, 6, 467 Reyes Holdings, 205 Right-hand tail area, 298, 299, 305, 395 Right-tailed critical value rule, 395 Rihanna, 72 Ringold, D J., 100, 100n Riordan, Edward A., 369 Risk decision making under, 811–812 explanation of, 810 Risk averter’s curve, 824 Risk neutral’s curve, 824 Risk seeker’s curve, 777, 824 Ritz Carlton Hotels, 730–731 Romig, Harold G., 728 Roosevelt, Franklin D., 18 Row percentages, 84, 514 Row totals, 83, 514 Rudd, Ralph, 362 Runs plot See Time series plot S Sacramento Kings, 144 St Jude Children’s Research Hospital, 101 St Louis Cardinals, 4, 73 Salford Systems, 180 Salomon Brothers, 102 Salvation Army, 101 Samaritan’s Purse, 101 Sample block means, 479 Sample correlation coefficient, 163–164 Sample covariance explanation of, 161–162 interpretation of, 162–164 Sample frame, 31 Sample means See also Means derivation of, 342–343 explanation of, 136 for grouped data, 168 as minimum-variance unbiased point estimate of m, 337 sampling distribution of, 327–331, 348 variance of, 342–343 Sample of measurements, 20/11/15 4:17 pm www.freebookslides.com 887 Index Sample proportion explanation of, 174–176 sampling distribution of, 339–340 Sample size for confidence intervals, 364–366, 369–372 explanation of, 10, 136, 221 needed to achieve specified values, 415–416 Type II error probabilities and, 411–416 Sample space, 226 Sample space outcomes, 221–223 Samples/sampling cluster, 27, 28 convenience, 18 expected payoff of, 820 explanation of, improper, 19 judgment, 18 probability, 18 random, 9–16, 27 in replacement, 10 stratified random, 27 systematic, 27, 28 voluntary response, 18 without replacement, 10 Sample standard deviation, 147 Sample statistic explanation of, 136 sampling distribution of, 336–337 Sample surveys, 27 Sample treatment means, 479, 485 Sample variance, 147–148 Sampling designs, 27 Sampling distributions central limit theorem and, 334–447 Central Limit Theorem and, 334–335 derivation of mean and variance of sample mean and, 342–343 of F, 453 population mean and, 429–432 population proportions and, 446 of sample mean, 327–331, 348 of sample proportion, 339–340 of sample statistic, 336 unbiasedness and minimum-variance estimates and, 336–337 of x, 331–334 Sampling error, 16, 31–32, 137 San Antonio Spurs, 144 San Diego Padres, 4, 73 S&P 500, 171 Sanford, Wittels, and Heisler LLP, 239 San Francisco Giants, 4, 73 San Francisco Museum of Modern Art, 101 Save the Children Federation, 101 Scatter plots, 87–88, 531 Schaeffer, R L., 27, 28 s charts, 738 Scree plot, 195 Seasonal variation, 681, 684–689 Seattle Mariners, 4, 73 Selection bias, 32 Sharma, Subhash, 197 Shewhart, Walter, 728, 729 Shift parameter, 625 bow21493_ind_875_892.indd 887 Sichelman, Lew, 372 Sigma level capability, 758, 759 Sign test, 780–783 Simple coefficient of determination, 543–546 Simple correlation coefficient, 546–547 Simple exponential smoothing, 699–703 Simple index, 713 Simple linear regression analysis in Excel, 553, 557, 559, 583–584 in MegaStat, 585–586 in Minitab, 553, 558, 587–588 Simple linear regression models, 531–534 assumptions of, 548–549 confidence and prediction intervals and, 559–562 F test for significance of slope and, 554–556 least squares point estimates and, 535–540 mean square error and standard error and, 549–550 point estimation and, 540 point prediction and, 540 regression parameters and, 534–535 residual analysis and, 565–574 significance of y-intercept and, 554 simple coefficient of determination and, 543–546 simple correlation coefficient and, 546–547 test of significance of population correlation coefficient and, 564 t test for significance of slope and, 551–554 Simpson, O J., 245 Sincich, Terry, 653 Single-hidden-layer, feedforward neural network, 654 Six sigma companies, 760 Six sigma philosophy, 760 Skewed to left, 67 Skewed to right, 67 Slope confidence interval for, 554 simple linear regression model and, 533 testing significance of, 551–556 Smith’s Department Stores, Inc., 683 Smoothing constant, 699, 704 Smoothing equation, 700, 702 Snell, E J., 665 Solomon, l., 496 Sound City, 169, 255–262 Southern Wine & Spirits, 205 Sparkline, 95 SPC See Statistical process control Spearman’s rank correlation coefficient, 564, 797–799 Spielberg, Steven, 72 Springsteen, Bruce, 72 Squared variables, 638, 639 Square root transformation, 569 Stamper, Joseph C., 615, 631 Standard and Poor’s 500 (S&P 500), 159 Standard deviation of binomial random variable, 272 constant, 754 of discrete random variable, 260 Poisson random variable and, 276–277 population, 146 process, 739 sample, 147 20/11/15 4:17 pm www.freebookslides.com 888 Standard error of estimate, 360, 551, 608 explanation of, 549–550 mean square error and, 604 model assumptions and, 603–604 Standardized normal quartile value, 316 Standardized value See z-scores State of nature, 809 Statesman Journal, 160 Statistical acceptance sampling, 728 Statistical inference explanation of, for population variance, 418–419 random samples and, 12, 13, 15–16 rare event approach, 270 sampling distribution and, 333–334 Statistical modeling, 9, 16 Statistical process control (SPC) See also Control charts explanation of, 731–732 quality variation and, 732–733 Statistical process monitoring, 737 Statistical quality control (SQC), 728, 729 Statistical significance, 392 Statistics See also Descriptive statistics Bayesian, 245 explanation of, traditional, Stem, 76 Stem-and-leaf displays back-to-back, 79 construction of, 78–79 example of, 76–78 explanation of, 76 Step Up for Students, 101 Stepwise regression, 636–638 Stratified random samples, 27 Studentized deleted residuals, 645 Studentized residuals, 645 Subgroups explanation of, 734 rational, 735–737 size of, 735 Subjective probability, 223 Subway, 531 Summation notation, 136 Sum of squared residuals, 595 Sums of squares, 469, 480 Sunshine Pools Inc., 205 Supervised learning, 24–25 Surveys dichotomous, 29 errors in, 30–32 explanation of, mail, 30 mall, 30 multiple-choice, 29 open-ended, 29 phone, 30 questions on, 29–30 types of, 30 Swift, Taylor, 72 Symmetrical distributions, 68 Systematic samples, 28 bow21493_ind_875_892.indd 888 Index T Tabular methods in Excel, 103–120 in MegaStat, 121–124 in Minitab, 125–133 Tampa Bay Rays, 4, 73 Target population, 30 Task Force for Global Health, 101 Tastee Bakery Company, 466–467, 485, 663 Tasty Sub Shop, 531–540, 543–547, 549–555, 560–562, 564, 591–596, 602–606, 609–610, 612–613, 647 t-based confidence intervals, 355–361, 431, 434 t distribution, 355–358, 402 Terminal nodes, 174 Test statistic explanation of, 387 legal system analogy and, 387 value of, 396, 397 Texas Rangers, 4, 73 Thompson, C M., 452 3M, 731 Time series analysis in Excel, 722 in MegaStat, 723–724 in Minitab, 725 Time series data, 5, 571 Time series forecasting components and model of, 681–682 forecast error comparisons and, 712–713 Holt-Winters’ double exponential smoothing and, 704–707 index numbers and, 713–717 multiplicative decomposition and, 691–698 multiplicative Winters’ method and, 707–711 seasonal components and, 684–689 simple exponential smoothing and, 699–703 Time series plots, 5, 23, 88 Time series regression modeling seasonal components and, 684–689 modeling trend components and, 682–684 Time Warner Cable, 228n2 Tobacco Institute, 100 Tolerance intervals confidence intervals vs., 361–362 estimated, 150 explanation of, 148–149 Toronto Blue Jays, 4, 73 Toronto Raptors, 144 Total quality control (TQC), 729 Total quality management (TQM), 729 Total sum of squares (SSTO), 480, 487 Total variation, 545 Toys “R” Us, 205 Traditional statistics, Training data set, 179, 640 Trammo, 205 Traveler’s Rest, Inc., 687 Treatments, 467 Treatment sum of squares (SST), 470, 471, 480 Tree diagrams See also Decision trees experiments in, 224, 225 showing clustering, 186 Treemaps, 94–95 20/11/15 4:17 pm www.freebookslides.com 889 Index Trend, 681 Trial control limits, 742 t table, 356, 357 t test population mean and, 402–405, 433 for significance of slope, 551–554 Tukey formula, 473–475 Two-factor factorial experiment, 486–487 Two-sided alternative hypothesis, 386 Two-sided confidence intervals, 400 Two-way ANOVA for analyzing data from two-factor factorial experiment, 487 confidence intervals and, 490–491 explanation of, 485–489 point estimates and, 490–491 Two-way ANOVA table, 488 Two-way cross-classification table See Contingency table Type I errors, 387–388, 392, 554 Type II errors calculating probability of, 414 explanation of, 387–388 sample size determination and, 411–416 U UC San Diego, 180 Unbiased point estimate, 229–230, 336, 337 Uncertainty, 810–811 Uncorrelated factors, 193 Undercoverage, 31 Unexplained variation, 545 Uniform distribution, 261, 291–293 Union of Concerned Scientists’ Clean Vehicle Program, 14 Union of Japanese Scientists and Engineers (JUSE), 729 United Medicine, Inc., 283 United Motors, 309 United Oil Company, 630–631, 664 United States Fund for UNICEF, 101 United States Golf Association, 273 United Way, 101 Universal Paper Company, 479 University of Chrysler/Jeep of Oxford, Ohio, 327 Unsupervised learning, 25 Upper limit, 157 U.S Bureau of Labor Statistics, 713, 716 U.S Bureau of the Census, 716 U.S Census Bureau, 6, 36, 229 U.S Department of Transportation, 252 U.S Energy Information Administration, U.S Open Tennis Tournament, 184 U.S War Department, 728 USA Today, 363 USA Weekend, 21 US Foods, 205 Utah Jazz, 144 Utilities, 813, 824 Utility curve, 824 V VALIC Investment Digest, 102 Validation data set, 179, 640, 656 Variable Annuity Life Insurance Company, 102, 203–204 bow21493_ind_875_892.indd 889 Variables See also Dependent variables; Independent variables; Qualitative variables; Quantitative variables; Random variables dummy, 173, 616–622, 648, 656 explanation of, interaction, 627–630, 638, 639 interval, 25–26 nominative, 26 ordinal, 26 random, 223 ratio, 25 response, 6, 467 squared, 638, 639 Variance See also Analysis of variance (ANOVA) of binomial random variable, 272 of discrete random variable, 260 Poisson random variable, 276–277 population, 146 sample, 147–148 of sample mean, 342–343 Variance inflation factors (VIF), 632–633 Variation Chebyshev’s Theorem and, 151–152 coefficient of, 152–153 common cause, 754 Empirical Rule and, 148–151 explained, 545 explanation of, 145 measures of, 145–153 range, variance and standard deviation and, 145–148 skewness and, 151 total, 545 unexplained, 545 Varimax rotation, 195 Venn diagrams, 232, 233 Vertical bar charts, 57 Visa, 241 VISA, 234–235 Voluntary response samples, 18 Von Neumann, J., 823 W Wainer, Howard, 91, 92s Walsh, Bryan, 6, 14n5 Walt Disney World Orlando, 22–23 Walt Disney World Parks and Resorts, 3, 7–8 Walters, R G., 378 Walton 1986, 730 Washington Nationals, 4, 73 Washington Post, 372 Washington Wizards, 144 Weighted aggregate price index, 715 Weighted mean, 166–169 Weinberger, Daniel R., 252 West, Kanye, 72 Western Electric (AT&T), 728, 747 Western Steakhouses, 576–577 Westinghouse Electric Corporation, 730–731 Wheelwright, S C., 576 Whiskers, 158 Whitehurst, Kevin, 11, 11n3 Whole Foods, 363, 367 Wilcoxon rank sum test, 789–792 20/11/15 4:17 pm www.freebookslides.com 890 Wilcoxon signed ranks table, 789–792 Wilcoxon signed ranks test, 443 William’s Apparel, 659 Will’s Uptown Pizza, 55, 56, 58 Wilson, Edwin, 367n Winfrey, Oprah, 72 Winter Olympics (19th), 341 Winters’ method, 689 Within-treatment variability, 469 Wonnacott, Helen, 564 Wonnacott, Tim, 564 Woodruff, Robert B., 615, 631 Woods, Tiger, 72 World Vision, 101 X Index Y y intercept explanation of, 625 simple linear regression model and, 533 testing significance of, 554 YMCA of the UDa, 101 Z Zagat restaurant, 29 z-scores, 152–153 z tests general procedure for, 395–396 population mean and, 395–396 population proportion and, 406–409 x charts, 738–749 Xerox Corporation Business Products and Systems, 730–731 XLMiner, clustering, 180, 182, 183, 192 bow21493_ind_875_892.indd 890 20/11/15 4:17 pm ... PREVIEW 1.4 Business Statistics in Practice: Using Data, Modeling, and Analytics, Eighth Edition, provides a unique and flexible framework for teaching the introductory course in business statistics. .. traditional statistics? ? ?business analytics and data mining—have been developed to help analyze big data In optional Section 1.5 we will begin to discuss business analytics and data mining As one... Introduction to Business Statistics and? ?Analytics 1.1 ■? ?Data? ? 3 1.2 ■? ?Data Sources, Data Warehousing, and Big Data? ? 6 1.3 ■ Populations, Samples, and Traditional Statistics? ? 8 1.4 ■ Random Sampling, Three