INTRODUCTION TO PROBABILITY AND STATISTICS FOR ENGINEERS AND SCIENTISTS

probability distribution of the sample mean and the sample variance in the tant special case in which the underlying data come from a normally distributedpopulation.impor-Chapter 7 shows

Trang 2

PROBABILITY AND STATISTICS FOR ENGINEERS AND SCIENTISTS

Third Edition

Trang 3

tion of the accompanying code (“the product”) cannot and do not warrant the performance

or results that may be obtained by using the product The product is sold “as is” withoutwarranty of merchantability or ﬁtness for any particular purpose AP warrants only thatthe magnetic diskette(s) on which the code is recorded is free from defects in material andfaulty workmanship under the normal use and service for a period of ninety (90) daysfrom the date the product is delivered The purchaser’s sole and exclusive remedy in theevent of a defect is expressly limited to either replacement of the diskette(s) or refund ofthe purchase price, at AP’s sole discretion

In no event, whether as a result of breach of contract, warranty, or tort (includingnegligence), will AP or anyone who has been involved in the creation or production ofthe product be liable to purchaser for any damages, including any lost proﬁts, lost savings,

or other incidental or consequential damages arising out of the use or inability to use theproduct or any modiﬁcations thereof, or due to the contents of the code, even if AP hasbeen advised on the possibility of such damages, or for any claim by any other party.Any request for replacement of a defective diskette must be postage prepaid and must

be accompanied by the original defective diskette, your mailing address and telephonenumber, and proof of date of purchase and purchase price Send such requests, statingthe nature of the problem, to Academic Press Customer Service, 6277 Sea Harbor Drive,Orlando, FL 32887, 1-800-321-5068 AP shall have no obligation to refund the purchaseprice or to replace a diskette based on claims of defects in the nature or operation of theproduct

Some states do not allow limitation on how long an implied warranty lasts, nor sions or limitations of incidental or consequential damage, so the above limitations andexclusions may not apply to you This warranty gives you speciﬁc legal rights, and youmay also have other rights, which vary from jurisdiction to jurisdiction

exclu-The re-export of United States original software is subject to the United States lawsunder the Export Administration Act of 1969 as amended Any further sale of the productshall be in compliance with the United States Department of Commerce Administra-tion regulations Compliance with such regulations is your responsibility and not theresponsibility of AP

Trang 4

PROBABILITY AND STATISTICS FOR ENGINEERS AND SCIENTISTS

■ Third Edition ■

Sheldon M Ross Department of Industrial Engineering and Operations Research

University of California, Berkeley

Amsterdam Boston Heidelberg London New York OxfordParis San Diego San Francisco Singapore Sydney Tokyo

Trang 5

84 Theobald’s Road, London WC1X 8RR, UK

This book is printed on acid-free paper.

No part of this publication may be reproduced or transmitted in any form or by any

means, electronic or mechanical, including photocopy, recording, or any information

storage and retrieval system, without permission in writing from the publisher.

Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail: permissions@elsevier.com.uk You may also complete your request on-line via the Elsevier homepage (http://elsevier.com), by selecting

“Customer Support” and then “Obtaining Permissions.”

Library of Congress Cataloging-in-Publication Data

Application submitted

British Library Cataloguing in Publication Data

A catalogue record for this book is available from the British Library

ISBN: 0-12-598057-4 (Text)

ISBN: 0-12-598059-0 (CD-ROM)

For all information on all Academic Press publications

visit our Web site at www.academicpress.com

Printed in the United States of America

04 05 06 07 08 09 9 8 7 6 5 4 3 2 1

Trang 8

Preface xiii

Chapter 1 Introduction to Statistics 1

1.1 Introduction 1

1.2 Data Collection and Descriptive Statistics 1

1.3 Inferential Statistics and Probability Models 2

1.4 Populations and Samples 3

1.5 A Brief History of Statistics 3

Problems 7

Chapter 2 Descriptive Statistics 9

2.1 Introduction 9

2.2 Describing Data Sets 9

2.2.1 Frequency Tables and Graphs 10

2.2.2 Relative Frequency Tables and Graphs 10

2.2.3 Grouped Data, Histograms, Ogives, and Stem and Leaf Plots 14

2.3 Summarizing Data Sets 17

2.3.1 Sample Mean, Sample Median, and Sample Mode 17

2.3.2 Sample Variance and Sample Standard Deviation 22

2.3.3 Sample Percentiles and Box Plots 24

2.4 Chebyshev’s Inequality 27

2.5 Normal Data Sets 31

2.6 Paired Data Sets and the Sample Correlation Coefﬁcient 33

Problems 41

Chapter 3 Elements of Probability 55

3.1 Introduction 55

3.2 Sample Space and Events 56

3.3 Venn Diagrams and the Algebra of Events 58

3.4 Axioms of Probability 59

3.5 Sample Spaces Having Equally Likely Outcomes 61

3.6 Conditional Probability 67

3.7 Bayes’ Formula 70

vii

Trang 9

3.8 Independent Events 76

Problems 80

Chapter 4 Random Variables and Expectation 89

4.1 Random Variables 89

4.2 Types of Random Variables 92

4.3 Jointly Distributed Random Variables 95

4.3.1 Independent Random Variables 101

*4.3.2 Conditional Distributions 105

4.4 Expectation 107

4.5 Properties of the Expected Value 111

4.5.1 Expected Value of Sums of Random Variables 115

4.6 Variance 118

4.7 Covariance and Variance of Sums of Random Variables 121

4.8 Moment Generating Functions 126

4.9 Chebyshev’s Inequality and the Weak Law of Large Numbers 127

Problems 130

Chapter 5 Special Random Variables 141

5.1 The Bernoulli and Binomial Random Variables 141

5.1.1 Computing the Binomial Distribution Function 147

5.2 The Poisson Random Variable 148

5.2.1 Computing the Poisson Distribution Function 155

5.3 The Hypergeometric Random Variable 156

5.4 The Uniform Random Variable 160

5.5 Normal Random Variables 168

5.6 Exponential Random Variables 175

*5.6.1 The Poisson Process 179

*5.7 The Gamma Distribution 182

5.8 Distributions Arising from the Normal 185

5.8.1 The Chi-Square Distribution 185

*5.8.1.1 The Relation Between Chi-Square and Gamma Random Variables 187

5.8.2 The t-Distribution 189

5.8.3 The F-Distribution 191

*5.9 The Logistics Distribution 192

Problems 194

Chapter 6 Distributions of Sampling Statistics 201

6.1 Introduction 201

6.2 The Sample Mean 202

6.3 The Central Limit Theorem 204

Trang 10

6.3.1 Approximate Distribution of the Sample Mean 210

6.3.2 How Large a Sample is Needed? 212

6.4 The Sample Variance 213

6.5 Sampling Distributions from a Normal Population 214

6.5.1 Distribution of the Sample Mean 215

6.5.2 Joint Distribution of X and S2 215

6.6 Sampling from a Finite Population 217

Problems 221

Chapter 7 Parameter Estimation 229

7.2 Maximum Likelihood Estimators 230

*7.2.1 Estimating Life Distributions 238

7.3 Interval Estimates 240

7.3.1 Conﬁdence Interval for a Normal Mean When the Variance is Unknown 246

7.3.2 Conﬁdence Intervals for the Variances of a Normal Distribution 251

7.4 Estimating the Difference in Means of Two Normal Populations 253

7.5 Approximate Conﬁdence Interval for the Mean of a Bernoulli Random Variable 260

*7.6 Conﬁdence Interval of the Mean of the Exponential Distribution 265

*7.7 Evaluating a Point Estimator 266

*7.8 The Bayes Estimator 272

Problems 277

Chapter 8 Hypothesis Testing 291

8.2 Signiﬁcance Levels 292

8.3 Tests Concerning the Mean of a Normal Population 293

8.3.1 Case of Known Variance 293

8.3.2 Case of Unknown Variance: The t-Test 305

8.4 Testing the Equality of Means of Two Normal Populations 312

8.4.1 Case of Known Variances 312

8.4.2 Case of Unknown Variances 314

8.4.3 Case of Unknown and Unequal Variances 318

8.4.4 The Paired t-Test 319

8.5 Hypothesis Tests Concerning the Variance of a Normal Population 321

8.5.1 Testing for the Equality of Variances of Two Normal Populations 322

8.6 Hypothesis Tests in Bernoulli Populations 323

8.6.1 Testing the Equality of Parameters in Two Bernoulli Populations 327

Trang 11

8.7 Tests Concerning the Mean of a Poisson Distribution 330

8.7.1 Testing the Relationship Between Two Poisson Parameters 331

Problems 334

Chapter 9 Regression 351

9.2 Least Squares Estimators of the Regression Parameters 353

9.3 Distribution of the Estimators 355

9.4 Statistical Inferences about the Regression Parameters 361

9.4.1 Inferences Concerning β 362

9.4.1.1 Regression to the Mean 366

9.4.2 Inferences Concerning α 370

9.4.3 Inferences Concerning the Mean Response α + βx0 371

9.4.4 Prediction Interval of a Future Response 373

9.4.5 Summary of Distributional Results 375

9.5 The Coefﬁcient of Determination and the Sample Correlation Coefﬁcient 376

9.6 Analysis of Residuals: Assessing the Model 378

9.7 Transforming to Linearity 381

9.8 Weighted Least Squares 384

9.9 Polynomial Regression 391

*9.10 Multiple Linear Regression 394

9.10.1 Predicting Future Responses 405

9.11 Logistic Regression Models for Binary Output Data 410

Problems 413

Chapter 10 Analysis of Variance 439

10.2 An Overview 440

10.3 One-Way Analysis of Variance 442

10.3.1 Multiple Comparisons of Sample Means 450

10.3.2 One-Way Analysis of Variance with Unequal Sample Sizes 452

10.4 Two-Factor Analysis of Variance: Introduction and Parameter Estimation 454

10.5 Two-Factor Analysis of Variance: Testing Hypotheses 458

10.6 Two-Way Analysis of Variance with Interaction 463

Problems 471

Chapter 11 Goodness of Fit Tests and Categorical Data Analysis 483

11.2 Goodness of Fit Tests When all Parameters are Speciﬁed 484

11.2.1 Determining the Critical Region by Simulation 490

11.3 Goodness of Fit Tests When Some Parameters are Unspeciﬁed 493

11.4 Tests of Independence in Contingency Tables 495

Trang 12

11.5 Tests of Independence in Contingency Tables Having Fixed

Marginal Totals 499

*11.6 The Kolmogorov–Smirnov Goodness of Fit Test for Continuous Data 504

Problems 508

Chapter 12 Nonparametric Hypothesis Tests 515

12.2 The Sign Test 515

12.3 The Signed Rank Test 519

12.4 The Two-Sample Problem 525

12.4.1 The Classical Approximation and Simulation 529

12.5 The Runs Test for Randomness 533

Problems 537

Chapter 13 Quality Control 545

13.2 Control Charts for Average Values: The X -Control Chart 546

13.2.1 Case of Unknown µ and σ 549

13.3 S-Control Charts 554

13.4 Control Charts for the Fraction Defective 557

13.5 Control Charts for Number of Defects 559

13.6 Other Control Charts for Detecting Changes in the Population Mean 563

13.6.1 Moving-Average Control Charts 563

13.6.2 Exponentially Weighted Moving-Average Control Charts 565

13.6.3 Cumulative Sum Control Charts 571

Problems 573

Chapter 14* Life Testing 581

14.2 Hazard Rate Functions 581

14.3 The Exponential Distribution in Life Testing 584

14.3.1 Simultaneous Testing — Stopping at the rth Failure 584

14.3.2 Sequential Testing 590

14.3.3 Simultaneous Testing — Stopping by a Fixed Time 594

14.3.4 The Bayesian Approach 596

14.4 A Two-Sample Problem 598

14.5 The Weibull Distribution in Life Testing 600

14.5.1 Parameter Estimation by Least Squares 602

Problems 604

Appendix of Tables 611

Index 617

* Denotes optional material.

Trang 14

The third edition of this book continues to demonstrate how to apply probability theory

to gain insight into real, everyday statistical problems and situations As in the previouseditions, carefully developed coverage of probability motivates probabilistic models of realphenomena and the statistical procedures that follow This approach ultimately results

in an intuitive understanding of statistical procedures and strategies most often used bypracticing engineers and scientists

This book has been written for an introductory course in statistics, or in probabilityand statistics, for students in engineering, computer science, mathematics, statistics, andthe natural sciences As such it assumes knowledge of elementary calculus

ORGANIZATION AND COVERAGE

Chapter 1 presents a brief introduction to statistics, presenting its two branches of

descrip-tive and inferential statistics, and a short history of the subject and some of the peoplewhose early work provided a foundation for work done today

The subject matter of descriptive statistics is then considered in Chapter 2 Graphs and

tables that describe a data set are presented in this chapter, as are quantities that are used

to summarize certain of the key properties of the data set

To be able to draw conclusions from data, it is necessary to have an understanding

of the data’s origination For instance, it is often assumed that the data constitute a

“random sample” from some population To understand exactly what this means andwhat its consequences are for relating properties of the sample data to properties of theentire population, it is necessary to have some understanding of probability, and that

is the subject of Chapter 3 This chapter introduces the idea of a probability

experi-ment, explains the concept of the probability of an event, and presents the axioms ofprobability

Our study of probability is continued in Chapter 4, which deals with the important concepts of random variables and expectation, and in Chapter 5, which considers some

special types of random variables that often occur in applications Such random variables

as the binomial, Poisson, hypergeometric, normal, uniform, gamma, chi-square, t, and

F are presented

In Chapter 6, we study the probability distribution of such sampling statistics

as the sample mean and the sample variance We show how to use a remarkabletheoretical result of probability, known as the central limit theorem, to approximatethe probability distribution of the sample mean In addition, we present the joint

xiii

Trang 15

probability distribution of the sample mean and the sample variance in the tant special case in which the underlying data come from a normally distributedpopulation.

impor-Chapter 7 shows how to use data to estimate parameters of interest For instance, a

scientist might be interested in determining the proportion of Midwestern lakes that areafﬂicted by acid rain Two types of estimators are studied The ﬁrst of these estimatesthe quantity of interest with a single number (for instance, it might estimate that 47percent of Midwestern lakes suffer from acid rain), whereas the second provides an esti-mate in the form of an interval of values (for instance, it might estimate that between

45 and 49 percent of lakes suffer from acid rain) These latter estimators also tell usthe “level of confidence” we can have in their validity Thus, for instance, whereas wecan be pretty certain that the exact percentage of afflicted lakes is not 47, it might verywell be that we can be, say, 95 percent confident that the actual percentage is between

45 and 49

Chapter 8 introduces the important topic of statistical hypothesis testing, which is

concerned with using data to test the plausibility of a speciﬁed hypothesis For instance,such a test might reject the hypothesis that fewer than 44 percent of Midwestern lakesare afﬂicted by acid rain The concept of the p-value, which measures the degree ofplausibility of the hypothesis after the data have been observed, is introduced A variety

of hypothesis tests concerning the parameters of both one and two normal populationsare considered Hypothesis tests concerning Bernoulli and Poisson parameters are alsopresented

Chapter 9 deals with the important topic of regression Both simple linear

regression — including such subtopics as regression to the mean, residual analysis, andweighted least squares — and multiple linear regression are considered

Chapter 10 introduces the analysis of variance Both one-way and two-way (with and

without the possibility of interaction) problems are considered

Chapter 11 is concerned with goodness of ﬁt tests, which can be used to test whether a

proposed model is consistent with data In it we present the classical chi-square goodness

of ﬁt test and apply it to test for independence in contingency tables The ﬁnal section

of this chapter introduces the Kolmogorov–Smirnov procedure for testing whether datacome from a speciﬁed continuous probability distribution

Chapter 12 deals with nonparametric hypothesis tests, which can be used when one

is unable to suppose that the underlying distribution has some speciﬁed parametric form(such as normal)

Chapter 13 considers the subject matter of quality control, a key statistical technique

in manufacturing and production processes A variety of control charts, including not onlythe Shewhart control charts but also more sophisticated ones based on moving averagesand cumulative sums, are considered

Chapter 14 deals with problems related to life testing In this chapter, the exponential,

rather than the normal, distribution, plays the key role

Trang 16

NEW TO THIS EDITION

New exercises and real data examples have been added throughout, including:

• The One-sided Chebyshev Inequality for Data (Section 2.4)

• The Logistics Distribution and Logistic Regression (Sections 5.4 and 9.11)

• Estimation and Testing in proofreader problems (Examples 7.2B and 8.7g)

• Product Form Estimates of Life Distributions (Section 7.2.1)

• Observational Studies (Example 8.6e)

About the CD

Packaged along with the text is a PC disk that can be used to solve most of the statisticalproblems in the text For instance, the disk computes the p-values for most of the hypothesistests, including those related to the analysis of variance and to regression It can also beused to obtain probabilities for most of the common distributions (For those studentswithout access to a personal computer, tables that can be used to solve all of the problems

in the text are provided.)

One program on the disk illustrates the central limit theorem It considers randomvariables that take on one of the values 0, 1, 2, 3, 4, and allows the user to enter theprobabilities for these values along with an integer n The program then plots the probabilitymass function of the sum of n independent random variables having this distribution Byincreasing n, one can “see” the mass function converge to the shape of a normal densityfunction

ACKNOWLEDGEMENTS

We thank the following people for their helpful comments on the Third Edition:

• Charles F Dunkl, University of Virginia, Charlottesville

• Gabor Szekely, Bowling Green State University

• Krzysztof M Ostaszewski, Illinois State University

• Micael Ratliff, Northern Arizona University

• Wei-Min Huang, Lehigh University

• Youngho Lee, Howard University

• Jacques Rioux, Drake University

• Lisa Gardner, Bradley University

• Murray Lieb, New Jersey Institute of Technology

• Philip Trotter, Cornell University

Trang 18

1.2 DATA COLLECTION AND DESCRIPTIVE STATISTICS

Sometimes a statistical analysis begins with a given set of data: For instance, the governmentregularly collects and publicizes data concerning yearly precipitation totals, earthquakeoccurrences, the unemployment rate, the gross domestic product, and the rate of inﬂation.Statistics can be used to describe, summarize, and analyze these data

In other situations, data are not yet available; in such cases statistical theory can be used todesign an appropriate experiment to generate data The experiment chosen should depend

on the use that one wants to make of the data For instance, suppose that an tor is interested in determining which of two different methods for teaching computerprogramming to beginners is most effective To study this question, the instructor mightdivide the students into two groups, and use a different teaching method for each group

instruc-At the end of the class the students can be tested and the scores of the members of thedifferent groups compared If the data, consisting of the test scores of members of eachgroup, are signiﬁcantly higher in one of the groups, then it might seem reasonable tosuppose that the teaching method used for that group is superior

It is important to note, however, that in order to be able to draw a valid conclusionfrom the data, it is essential that the students were divided into groups in such a mannerthat neither group was more likely to have the students with greater natural aptitude forprogramming For instance, the instructor should not have let the male class members beone group and the females the other For if so, then even if the women scored signiﬁcantlyhigher than the men, it would not be clear whether this was due to the method used

to teach them, or to the fact that women may be inherently better than men at learning

1

Trang 19

programming skills The accepted way of avoiding this pitfall is to divide the class membersinto the two groups “at random.” This term means that the division is done in such

a manner that all possible choices of the members of a group are equally likely

At the end of the experiment, the data should be described For instance, the scores

of the two groups should be presented In addition, summary measures such as the age score of members of each of the groups should be presented This part of statistics,concerned with the description and summarization of data, is called descriptive statistics

aver-1.3 INFERENTIAL STATISTICS AND

PROBABILITY MODELS

After the preceding experiment is completed and the data are described and summarized,

we hope to be able to draw a conclusion about which teaching method is superior Thispart of statistics, concerned with the drawing of conclusions, is called inferential statistics

To be able to draw a conclusion from the data, we must take into account the possibility

of chance For instance, suppose that the average score of members of the first group isquite a bit higher than that of the second Can we conclude that this increase is due to theteaching method used? Or is it possible that the teaching method was not responsible forthe increased scores but rather that the higher scores of the first group were just a chanceoccurrence? For instance, the fact that a coin comes up heads 7 times in 10 flips doesnot necessarily mean that the coin is more likely to come up heads than tails in futureflips Indeed, it could be a perfectly ordinary coin that, by chance, just happened to landheads 7 times out of the total of 10 flips (On the other hand, if the coin had landedheads 47 times out of 50 flips, then we would be quite certain that it was not an ordinarycoin.)

To be able to draw logical conclusions from data, we usually make some assumptionsabout the chances (or probabilities) of obtaining the different data values The totality ofthese assumptions is referred to as a probability model for the data

Sometimes the nature of the data suggests the form of the probability model that isassumed For instance, suppose that an engineer wants to ﬁnd out what proportion ofcomputer chips, produced by a new method, will be defective The engineer might select

a group of these chips, with the resulting data being the number of defective chips in thisgroup Provided that the chips selected were “randomly” chosen, it is reasonable to supposethat each one of them is defective with probability p, where p is the unknown proportion

of all the chips produced by the new method that will be defective The resulting data canthen be used to make inferences about p

In other situations, the appropriate probability model for a given data set will not bereadily apparent However, careful description and presentation of the data sometimesenable us to infer a reasonable model, which we can then try to verify with the use ofadditional data

Because the basis of statistical inference is the formulation of a probability model todescribe the data, an understanding of statistical inference requires some knowledge of

Trang 20

the theory of probability In other words, statistical inference starts with the assumptionthat important aspects of the phenomenon under study can be described in terms ofprobabilities; it then draws conclusions by using data to make inferences about theseprobabilities.

1.4 POPULATIONS AND SAMPLES

In statistics, we are interested in obtaining information about a total collection of elements,which we will refer to as the population The population is often too large for us to examineeach of its members For instance, we might have all the residents of a given state, or all thetelevision sets produced in the last year by a particular manufacturer, or all the households

in a given community In such cases, we try to learn about the population by choosingand then examining a subgroup of its elements This subgroup of a population is called

a sample

If the sample is to be informative about the total population, it must be, in some sense,representative of that population For instance, suppose that we are interested in learningabout the age distribution of people residing in a given city, and we obtain the ages of theﬁrst 100 people to enter the town library If the average age of these 100 people is 46.2years, are we justiﬁed in concluding that this is approximately the average age of the entirepopulation? Probably not, for we could certainly argue that the sample chosen in this case

is probably not representative of the total population because usually more young studentsand senior citizens use the library than do working-age citizens

In certain situations, such as the library illustration, we are presented with a sample andmust then decide whether this sample is reasonably representative of the entire population

In practice, a given sample generally cannot be assumed to be representative of a populationunless that sample has been chosen in a random manner This is because any speciﬁcnonrandom rule for selecting a sample often results in one that is inherently biased towardsome data values as opposed to others

Thus, although it may seem paradoxical, we are most likely to obtain a representativesample by choosing its members in a totally random fashion without any prior consid-erations of the elements that will be chosen In other words, we need not attempt todeliberately choose the sample so that it contains, for instance, the same gender percentageand the same percentage of people in each profession as found in the general population.Rather, we should just leave it up to “chance” to obtain roughly the correct percentages.Once a random sample is chosen, we can use statistical inference to draw conclusions aboutthe entire population by studying the elements of the sample

1.5 A BRIEF HISTORY OF STATISTICS

A systematic collection of data on the population and the economy was begun in the Italiancity states of Venice and Florence during the Renaissance The term statistics, derived fromthe word state, was used to refer to a collection of facts of interest to the state The idea of

Trang 21

collecting data spread from Italy to the other countries of Western Europe Indeed, by theﬁrst half of the 16th century it was common for European governments to require parishes

to register births, marriages, and deaths Because of poor public health conditions this laststatistic was of particular interest

The high mortality rate in Europe before the 19th century was due mainly to epidemicdiseases, wars, and famines Among epidemics, the worst were the plagues Starting withthe Black Plague in 1348, plagues recurred frequently for nearly 400 years In 1562, as away to alert the King’s court to consider moving to the countryside, the City of Londonbegan to publish weekly bills of mortality Initially these mortality bills listed the places

of death and whether a death had resulted from plague Beginning in 1625 the bills wereexpanded to include all causes of death

In 1662 the English tradesman John Graunt published a book entitled Natural andPolitical Observations Made upon the Bills of Mortality Table 1.1, which notes the totalnumber of deaths in England and the number due to the plague for ﬁve different plagueyears, is taken from this book

TABLE 1.1 Total Deaths in England

Source: John Graunt, Observations Made upon the Bills of Mortality.

3rd ed London: John Martyn and James Allestry (1st ed 1662).

Graunt used London bills of mortality to estimate the city’s population For instance,

to estimate the population of London in 1660, Graunt surveyed households in certainLondon parishes (or neighborhoods) and discovered that, on average, there were approxi-mately 3 deaths for every 88 people Dividing by 3 shows that, on average, there wasroughly 1 death for every 88/3 people Because the London bills cited 13,200 deaths inLondon for that year, Graunt estimated the London population to be about

13,200× 88/3 = 387,200Graunt used this estimate to project a ﬁgure for all England In his book he noted thatthese ﬁgures would be of interest to the rulers of the country, as indicators of both thenumber of men who could be drafted into an army and the number who could be taxed.Graunt also used the London bills of mortality — and some intelligent guesswork as towhat diseases killed whom and at what age — to infer ages at death (Recall that the bills

of mortality listed only causes and places at death, not the ages of those dying.) Grauntthen used this information to compute tables giving the proportion of the population that

Trang 22

TABLE 1.2 John Graunt’s Mortality Table

Age at Death Number of Deaths per 100 Births

6 and 15, and so on

Graunt’s estimates of the ages at which people were dying were of great interest to those

in the business of selling annuities Annuities are the opposite of life insurance in that onepays in a lump sum as an investment and then receives regular payments for as long as onelives

Graunt’s work on mortality tables inspired further work by Edmund Halley in 1693.Halley, the discoverer of the comet bearing his name (and also the man who was mostresponsible, by both his encouragement and his ﬁnancial support, for the publication ofIsaac Newton’s famous Principia Mathematica), used tables of mortality to compute theodds that a person of any age would live to any other particular age Halley was inﬂuential

in convincing the insurers of the time that an annual life insurance premium should depend

on the age of the person being insured

Following Graunt and Halley, the collection of data steadily increased throughoutthe remainder of the 17th and on into the 18th century For instance, the city of Parisbegan collecting bills of mortality in 1667; and by 1730 it had become common practicethroughout Europe to record ages at death

The term statistics, which was used until the 18th century as a shorthand for thedescriptive science of states, became in the 19th century increasingly identiﬁed withnumbers By the 1830s the term was almost universally regarded in Britain and France

as being synonymous with the “numerical science” of society This change in meaningwas caused by the large availability of census records and other tabulations that began to

be systematically collected and published by the governments of Western Europe and theUnited States beginning around 1800

Throughout the 19th century, although probability theory had been developed by suchmathematicians as Jacob Bernoulli, Karl Friedrich Gauss, and Pierre-Simon Laplace, itsuse in studying statistical ﬁndings was almost nonexistent, because most social statisticians

Trang 23

at the time were content to let the data speak for themselves In particular, statisticians

of that time were not interested in drawing inferences about individuals, but rather wereconcerned with the society as a whole Thus, they were not concerned with sampling butrather tried to obtain censuses of the entire population As a result, probabilistic inferencefrom samples to a population was almost unknown in 19th century social statistics

It was not until the late 1800s that statistics became concerned with inferring conclusionsfrom numerical data The movement began with Francis Galton’s work on analyzinghereditary genius through the uses of what we would now call regression and correlationanalysis (see Chapter 9), and obtained much of its impetus from the work of Karl Pearson.Pearson, who developed the chi-square goodness of fit tests (see Chapter 11), was the firstdirector of the Galton Laboratory, endowed by Francis Galton in 1904 There Pearsonoriginated a research program aimed at developing new methods of using statistics ininference His laboratory invited advanced students from science and industry to learnstatistical methods that could then be applied in their fields One of his earliest visitingresearchers was W S Gosset, a chemist by training, who showed his devotion to Pearson

by publishing his own works under the name “Student.” (A famous story has it that Gossetwas afraid to publish under his own name for fear that his employers, the Guinness brewery,would be unhappy to discover that one of its chemists was doing research in statistics.)Gosset is famous for his development of the t-test (see Chapter 8)

Two of the most important areas of applied statistics in the early 20th century werepopulation biology and agriculture This was due to the interest of Pearson and others athis laboratory and also to the remarkable accomplishments of the English scientist Ronald

A Fisher The theory of inference developed by these pioneers, including among others

TABLE 1.3 The Changing Deﬁnition of Statistics

Statistics has then for its object that of presenting a faithful representation of a state at a determined epoch (Quetelet, 1849)

Statistics are the only tools by which an opening can be cut through the formidable thicket of

difﬁculties that bars the path of those who pursue the Science of man (Galton, 1889)

Statistics may be regarded (i) as the study of populations, (ii) as the study of variation, and (iii) as the study of methods of the reduction of data (Fisher, 1925)

Statistics is a scientiﬁc discipline concerned with collection, analysis, and interpretation of data obtained from observation or experiment The subject has a coherent structure based on the theory of Probability and includes many different procedures which contribute to research and development throughout the whole of Science and Technology (E Pearson, 1936)

Statistics is the name for that science and art which deals with uncertain inferences — which uses numbers to ﬁnd out something about nature and experience (Weaver, 1952)

Statistics has become known in the 20th century as the mathematical tool for analyzing experimental and observational data (Porter, 1986)

Statistics is the art of learning from data (this book, 2004)

Trang 24

Karl Pearson’s son Egon and the Polish born mathematical statistician Jerzy Neyman,was general enough to deal with a wide range of quantitative and practical problems As

a result, after the early years of the 20th century a rapidly increasing number of people

in science, business, and government began to regard statistics as a tool that was able toprovide quantitative solutions to scientific and practical problems (see Table 1.3).Nowadays the ideas of statistics are everywhere Descriptive statistics are featured inevery newspaper and magazine Statistical inference has become indispensable to publichealth and medical research, to engineering and scientific studies, to marketing and qualitycontrol, to education, to accounting, to economics, to meteorological forecasting, topolling and surveys, to sports, to insurance, to gambling, and to all research that makesany claim to being scientific Statistics has indeed become ingrained in our intellectualheritage

Problems

1. An election will be held next week and, by polling a sample of the votingpopulation, we are trying to predict whether the Republican or Democraticcandidate will prevail Which of the following methods of selection is likely toyield a representative sample?

(a) Poll all people of voting age attending a college basketball game.

(b) Poll all people of voting age leaving a fancy midtown restaurant.

(c) Obtain a copy of the voter registration list, randomly choose 100 names, and

question them

(d) Use the results of a television call-in poll, in which the station asked its listeners

to call in and name their choice

(e) Choose names from the telephone directory and call these people.

2. The approach used in Problem 1(e) led to a disastrous prediction in the 1936presidential election, in which Franklin Roosevelt defeated Alfred Landon by alandslide A Landon victory had been predicted by the Literary Digest The maga-zine based its prediction on the preferences of a sample of voters chosen from lists

of automobile and telephone owners

(a) Why do you think the Literary Digest’s prediction was so far off?

(b) Has anything changed between 1936 and now that would make you believe

that the approach used by the Literary Digest would work better today?

3. A researcher is trying to discover the average age at death for people in the UnitedStates today To obtain data, the obituary columns of the New York Times are readfor 30 days, and the ages at death of people in the United States are noted Doyou think this approach will lead to a representative sample?

Trang 25

4. To determine the proportion of people in your town who are smokers, it has beendecided to poll people at one of the following local spots:

(a) the pool hall;

(b) the bowling alley;

(c) the shopping mall;

$75,000

(a) Would the university be correct in thinking that $75,000 was a good

approxi-mation to the average salary level of all of its graduates? Explain the reasoningbehind your answer

(b) If your answer to part (a) is no, can you think of any set of conditions

relat-ing to the group that returned questionnaires for which it would be a goodapproximation?

6. An article reported that a survey of clothing worn by pedestrians killed at night intrafﬁc accidents revealed that about 80 percent of the victims were wearing dark-colored clothing and 20 percent were wearing light-colored clothing The conclu-sion drawn in the article was that it is safer to wear light-colored clothing at night

(a) Is this conclusion justiﬁed? Explain.

(b) If your answer to part (a) is no, what other information would be needed

before a ﬁnal conclusion could be drawn?

7. Critique Graunt’s method for estimating the population of London Whatimplicit assumption is he making?

8. The London bills of mortality listed 12,246 deaths in 1658 Supposing that asurvey of London parishes showed that roughly 2 percent of the population diedthat year, use Graunt’s method to estimate London’s population in 1658

9. Suppose you were a seller of annuities in 1662 when Graunt’s book was published.Explain how you would make use of his data on the ages at which people weredying

10. Based on Graunt’s mortality table:

(a) What proportion of people survived to age 6?

(b) What proportion survived to age 46?

(c) What proportion died between the ages of 6 and 36?

Trang 26

DESCRIPTIVE STATISTICS

2.1 INTRODUCTION

In this chapter we introduce the subject matter of descriptive statistics, and in doing

so learn ways to describe and summarize a set of data Section 2.2 deals with ways ofdescribing a data set Subsections 2.2.1 and 2.2.2 indicate how data that take on only

a relatively few distinct values can be described by using frequency tables or graphs, whereasSubsection 2.2.3 deals with data whose set of values is grouped into different intervals.Section 2.3 discusses ways of summarizing data sets by use of statistics, which are numericalquantities whose values are determined by the data Subsection 2.3.1 considers threestatistics that are used to indicate the “center” of the data set: the sample mean, the samplemedian, and the sample mode Subsection 2.3.2 introduces the sample variance and itssquare root, called the sample standard deviation These statistics are used to indicate thespread of the values in the data set Subsection 2.3.3 deals with sample percentiles, whichare statistics that tell us, for instance, which data value is greater than 95 percent of allthe data In Section 2.4 we present Chebyshev’s inequality for sample data This famousinequality gives a lower bound to the proportion of the data that can differ from thesample mean by more than k times the sample standard deviation Whereas Chebyshev’sinequality holds for all data sets, we can in certain situations, which are discussed inSection 2.5, obtain more precise estimates of the proportion of the data that is within ksample standard deviations of the sample mean In Section 2.5 we note that when a graph

of the data follows a bell-shaped form the data set is said to be approximately normal, andmore precise estimates are given by the so-called empirical rule Section 2.6 is concernedwith situations in which the data consist of paired values A graphical technique, calledthe scatter diagram, for presenting such data is introduced, as is the sample correlationcoefﬁcient, a statistic that indicates the degree to which a large value of the ﬁrst member

of the pair tends to go along with a large value of the second

2.2 DESCRIBING DATA SETS

The numerical ﬁndings of a study should be presented clearly, concisely, and in such

a manner that an observer can quickly obtain a feel for the essential characteristics of

9

Trang 27

the data Over the years it has been found that tables and graphs are particularly usefulways of presenting data, often revealing important features such as the range, the degree

of concentration, and the symmetry of the data In this section we present some commongraphical and tabular ways for presenting data

2.2.1 Frequency Tables and Graphs

A data set having a relatively small number of distinct values can be conveniently presented

in a frequency table For instance, Table 2.1 is a frequency table for a data set consisting of thestarting yearly salaries (to the nearest thousand dollars) of 42 recently graduated studentswith B.S degrees in electrical engineering Table 2.1 tells us, among other things, that thelowest starting salary of $47,000 was received by four of the graduates, whereas the highestsalary of $60,000 was received by a single student The most common starting salary was

$52,000, and was received by 10 of the students

TABLE 2.1 Starting Yearly Salaries

Starting Salary Frequency

Another type of graph used to represent a frequency table is the frequency polygon, whichplots the frequencies of the different data values on the vertical axis, and then connects theplotted points with straight lines Figure 2.3 presents a frequency polygon for the data ofTable 2.1

2.2.2 Relative Frequency Tables and Graphs

Consider a data set consisting of n values If f is the frequency of a particular value, thenthe ratio f /n is called its relative frequency That is, the relative frequency of a data value is

Trang 29

FIGURE 2.3 Frequency polygon for starting salary data.

the proportion of the data that have that value The relative frequencies can be representedgraphically by a relative frequency line or bar graph or by a relative frequency polygon.Indeed, these relative frequency graphs will look like the corresponding graphs of theabsolute frequencies except that the labels on the vertical axis are now the old labels (thatgave the frequencies) divided by the total number of data points

EXAMPLE 2.2a Table 2.2 is a relative frequency table for the data of Table 2.1 The tive frequencies are obtained by dividing the corresponding frequencies of Table 2.1 by

rela-42, the size of the data set ■

A pie chart is often used to indicate relative frequencies when the data are not numerical

in nature A circle is constructed and then sliced into different sectors; one for each distincttype of data value The relative frequency of a data value is indicated by the area of its sector,this area being equal to the total area of the circle multiplied by the relative frequency ofthe data value

EXAMPLE 2.2b The following data relate to the different types of cancers affecting the 200most recent patients to enroll at a clinic specializing in cancer These data are represented

in the pie chart presented in Figure 2.4 ■

Trang 30

Bladder 6%

Lung 21%

Breast 25%

Colon 16%

Prostate

27.5%

FIGURE 2.4

Trang 31

Type of Cancer Number of New Cases Relative Frequency

2.2.3 Grouped Data, Histograms, Ogives, and

Stem and Leaf Plots

As seen in Subsection 2.2.2, using a line or a bar graph to plot the frequencies of data values

is often an effective way of portraying a data set However, for some data sets the number

of distinct values is too large to utilize this approach Instead, in such cases, it is useful todivide the values into groupings, or class intervals, and then plot the number of data valuesfalling in each class interval The number of class intervals chosen should be a trade-offbetween (1) choosing too few classes at a cost of losing too much information about theactual data values in a class and (2) choosing too many classes, which will result in the

TABLE 2.3 Life in Hours of 200 Incandescent Lamps

Trang 32

frequencies of each class being too small for a pattern to be discernible Although 5 to 10class intervals are typical, the appropriate number is a subjective choice, and of course, youcan try different numbers of class intervals to see which of the resulting charts appears to

be most revealing about the data It is common, although not essential, to choose classintervals of equal length

The endpoints of a class interval are called the class boundaries We will adopt theleft-end inclusion convention, which stipulates that a class interval contains its left-end butnot its right-end boundary point Thus, for instance, the class interval 20–30 containsall values that are both greater than or equal to 20 and less than 30

Table 2.3 presents the lifetimes of 200 incandescent lamps A class frequency table forthe data of Table 2.3 is presented in Table 2.4 The class intervals are of length 100, withthe ﬁrst one starting at 500

TABLE 2.4 A Class Frequency Table

Frequency (Number of Data Values in

Trang 33

FIGURE 2.6 A cumulative frequency plot.

A bar graph plot of class data, with the bars placed adjacent to each other, is called

a histogram The vertical axis of a histogram can represent either the class frequency or therelative class frequency; in the former case the graph is called a frequency histogram and

in the latter a relative frequency histogram Figure 2.5 presents a frequency histogram of thedata in Table 2.4

We are sometimes interested in plotting a cumulative frequency (or cumulative relativefrequency) graph A point on the horizontal axis of such a graph represents a possibledata value; its corresponding vertical plot gives the number (or proportion) of the datawhose values are less than or equal to it A cumulative relative frequency plot of the data

of Table 2.3 is given in Figure 2.6 We can conclude from this ﬁgure that 100 percent

of the data values are less than 1,500, approximately 40 percent are less than or equal to

900, approximately 80 percent are less than or equal to 1,100, and so on A cumulativefrequency plot is called an ogive

An efﬁcient way of organizing a small- to moderate-sized data set is to utilize a stemand leaf plot Such a plot is obtained by ﬁrst dividing each data value into two parts —its stem and its leaf For instance, if the data are all two-digit numbers, then we could letthe stem part of a data value be its tens digit and let the leaf be its ones digit Thus, forinstance, the value 62 is expressed as

Trang 34

EXAMPLE 2.2c Table 2.5 gives the monthly and yearly average daily minimum tures in 35 U.S cities.

tempera-The annual average daily minimum temperatures from Table 2.5 are represented in thefollowing stem and leaf plot

2.3 SUMMARIZING DATA SETS

Modern-day experiments often deal with huge sets of data For instance, in an attempt

to learn about the health consequences of certain common practices, in 1951 the medicalstatisticians R Doll and A B Hill sent questionnaires to all doctors in the United Kingdomand received approximately 40,000 replies Their questions dealt with age, eating habits,and smoking habits The respondents were then tracked for the ensuing 10 years and thecauses of death for those who died were monitored To obtain a feel for such a large amount

of data, it is useful to be able to summarize it by some suitably chosen measures In thissection we present some summarizing statistics, where a statistic is a numerical quantitywhose value is determined by the data

2.3.1 Sample Mean, Sample Median, and Sample Mode

In this section we introduce some statistics that are used for describing the center of a set

of data values To begin, suppose that we have a data set consisting of the n numericalvalues x1, x2, , xn The sample mean is the arithmetic average of these values

Definition

The sample mean, designated by ¯x, is deﬁned by

¯x =n

i=1

xi/n

The computation of the sample mean can often be simpliﬁed by noting that if for constants

yi= axi+ b, i= 1, , n

Trang 35

TABLE 2.5 Normal Daily Minimum Temperature — Selected Cities

[In Fahrenheit degrees Airport data except as noted Based on standard 30-year period, 1961 through 1990]

Annual State Station Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec avg.

GA Atlanta 31.5 34.5 42.5 50.2 58.7 66.2 69.5 69.0 63.5 51.9 42.8 35.0 51.3

HI Honolulu 65.6 65.4 67.2 68.7 70.3 72.2 73.5 74.2 73.5 72.3 70.3 67.0 70.0

ID Boise 21.6 27.5 31.9 36.7 43.9 52.1 57.7 56.8 48.2 39.0 31.1 22.5 39.1

IL Chicago 12.9 17.2 28.5 38.6 47.7 57.5 62.6 61.6 53.9 42.2 31.6 19.1 39.5 Peoria 13.2 17.7 29.8 40.8 50.9 60.7 65.4 63.1 55.2 43.1 32.5 19.3 41.0

MN Duluth −2.2 2.8 15.7 28.9 39.6 48.5 55.1 53.3 44.5 35.1 21.5 4.9 29.0 Minneapolis-St Paul 2.8 9.2 22.7 36.2 47.6 57.6 63.1 60.3 50.3 38.8 25.2 10.2 35.3

Trang 36

then the sample mean of the data set y1, , ynis

¯y =

n

i=1(axi+ b)/n =

n

i=1

axi/n+

n

i=1b/n= a¯x + b

EXAMPLE 2.3a The winning scores in the U.S Masters golf tournament in the years from

1982 to 1991 were as follows:

284, 280, 277, 282, 279, 285, 281, 283, 278, 277

Find the sample mean of these scores

SOLUTION Rather than directly adding these values, it is easier to ﬁrst subtract 280 fromeach one to obtain the new values yi= xi− 280:

4, 0,−3, 2, −1, 5, 1, 3, −2, −3Because the arithmetic average of the transformed data set is

¯y = 6/10

it follows that

¯x = ¯y + 280 = 280.6 ■Sometimes we want to determine the sample mean of a data set that is presented in

a frequency table listing the k distinct values v1, , vk having corresponding frequencies

f1, , fk Since such a data set consists of n= k

i =1fi observations, with the value viappearing fi times, for each i= 1, , k, it follows that the sample mean of these n datavalues is

¯x =

k

i=1

Trang 37

EXAMPLE 2.3b The following is a frequency table giving the ages of members of a symphonyorchestra for young adults.

Definition

Order the values of a data set of size n from smallest to largest If n is odd, the samplemedian is the value in position (n+ 1)/2; if n is even, it is the average of the values inpositions n/2 and n/2+ 1

Thus the sample median of a set of three values is the second smallest; of a set of fourvalues, it is the average of the second and third smallest

EXAMPLE 2.3c Find the sample median for the data described in Example 2.3b

SOLUTION Since there are 54 data values, it follows that when the data are put in increasingorder, the sample median is the average of the values in positions 27 and 28 Thus, thesample median is 18.5 ■

The sample mean and sample median are both useful statistics for describing thecentral tendency of a data set The sample mean makes use of all the data values and

is affected by extreme values that are much larger or smaller than the others; the samplemedian makes use of only one or two of the middle values and is thus not affected byextreme values Which of them is more useful depends on what one is trying to learnfrom the data For instance, if a city government has a ﬂat rate income tax and is trying toestimate its total revenue from the tax, then the sample mean of its residents’ income would

be a more useful statistic On the other hand, if the city was thinking about constructingmiddle-income housing, and wanted to determine the proportion of its population able

to afford it, then the sample median would probably be more useful

Trang 38

EXAMPLE 2.3d In a study reported in Hoel, D G., “A representation of mortality data by

competing risks,” Biometrics, 28, pp 475–488, 1972, a group of 5-week-old mice were

each given a radiation dose of 300 rad The mice were then divided into two groups;the ﬁrst group was kept in a germ-free environment, and the second in conventionallaboratory conditions The numbers of days until death were then observed The data forthose whose death was due to thymic lymphoma are given in the following stem and leafplots (whose stems are in units of hundreds of days); the ﬁrst plot is for mice living in thegerm-free conditions, and the second for mice living under ordinary laboratory conditions

Determine the sample means and the sample medians for the two sets of mice

SOLUTION It is clear from the stem and leaf plots that the sample mean for the set of mice put

in the germ-free setting is larger than the sample mean for the set of mice in the usual tory setting; indeed, a calculation gives that the former sample mean is 344.07, whereas thelatter one is 292.32 On the other hand, since there are 29 data values for the germ-free mice,the sample median is the 15th largest data value, namely, 259; similarly, the sample medianfor the other set of mice is the 10th largest data value, namely, 265 Thus, whereas thesample mean is quite a bit larger for the ﬁrst data set, the sample medians are approximatelyequal The reason for this is that whereas the sample mean for the ﬁrst set is greatly affected

labora-by the five data values greater than 500, these values have a much smaller effect on thesample median Indeed, the sample median would remain unchanged if these values werereplaced by any other five values greater than or equal to 259 It appears from the stem andleaf plots that the germ-free conditions probably improved the life span of the five longestliving rats, but it is unclear what, if any, effect it had on the life spans of the other rats ■

Trang 39

Another statistic that has been used to indicate the central tendency of a data set is thesample mode, deﬁned to be the value that occurs with the greatest frequency If no singlevalue occurs most frequently, then all the values that occur at the highest frequency arecalled modal values.

EXAMPLE 2.3e The following frequency table gives the values obtained in 40 rolls of a die

Find (a) the sample mean, (b) the sample median, and (c) the sample mode.

SOLUTION (a) The sample mean is

¯x = (9 + 16 + 15 + 20 + 30 + 42)/40 = 3 05

(b) The sample median is the average of the 20th and 21st smallest values, and is thus

equal to 3 (c) The sample mode is 1, the value that occurred most frequently. ■

2.3.2 Sample Variance and Sample Standard Deviation

Whereas we have presented statistics that describe the central tendencies of a data set,

we are also interested in ones that describe the spread or variability of the data values

A statistic that could be used for this purpose would be one that measures the averagevalue of the squares of the distances between the data values and the sample mean This

is accomplished by the sample variance, which for technical reasons divides the sum ofthe squares of the differences by n− 1 rather than n, where n is the size of the data set

Definition

The sample variance, call it s2, of the data set x1, , xnis deﬁned by

s2=n

i=1(xi− ¯x)2/(n− 1)

EXAMPLE 2.3f Find the sample variances of the data sets A and B given below.

A : 3, 4, 6, 7, 10 B :−20, 5, 15, 24

Trang 40

SOLUTION As the sample mean for data set A is¯x = (3 + 4 + 6 + 7 + 10)/5 = 6, it followsthat its sample variance is

s2= [(−3)2+ (−2)2+ 02+ 12+ 42]/4 = 7.5

The sample mean for data set B is also 6; its sample variance is

s2= [(−26)2+ (−1)2+ 92+ (18)2]/3 ≈ 360.67Thus, although both data sets have the same sample mean, there is a much greater

variability in the values of the B set than in the A set. ■

The following algebraic identity is often useful for computing the sample variance:

An Algebraic Identity

n

i=1(xi− ¯x)2=

n

i=1

xi2− n¯x2The identity is proven as follows:

n

i=1(xi− ¯x)2=

n

i=1

xi2− 2xi¯x + ¯x2

=n

i=1

xi2− 2¯x

n

i=1

xi+

n

i=1

¯x2

=

n

i=1

xi2− 2n¯x2+ n¯x2

=n

i=1

xi2− n¯x2

The computation of the sample variance can also be eased by noting that if

yi= a + bxi, i= 1, , nthen¯y = a + b¯x, and so

n

i=1( yi− ¯y)2= b2

n

i=1(xi− ¯x)2That is, if sy2and s2x are the respective sample variances, then

sy2= b2s2x

Định dạng
Số trang	641
Dung lượng	2,29 MB