1. Trang chủ
  2. » Công Nghệ Thông Tin

Probability for Statistics and Machine Learning AI fundamentals and advanced

803 0 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Probability for Statistics and Machine Learning Fundamentals and Advanced Topics
Tác giả Anirban Dasgupta
Người hướng dẫn G. Casella, S. Fienberg, I. Olkin
Trường học Purdue University
Chuyên ngành Statistics
Thể loại Book
Năm xuất bản 2011
Thành phố New York
Định dạng
Số trang 803
Dung lượng 3,93 MB

Nội dung

This is the companion second volume to my undergraduate text Fundamentals of Probability: A First Course. The purpose of my writing this book is to give graduate students, instructors, and researchers in statistics, mathematics, and computer science a lucidly written unique text at the confluence of probability, advanced stochastic processes, statistics, and key tools for machine learning. Numerous topics in probability and stochastic processes of current importance in statistics and machine learning that are widely scattered in the literature in many different specialized books are all brought together under one fold in this book. This is done with an extensive bibliography for each topic, and numerous workedout examples and exercises. Probability, with all its models, techniques, and its poignant beauty, is an incredibly powerful tool for anyone who deals with data or randomness. The content and the style of this book reflect that philosophy; I emphasize lucidity, a wide background, and the farreaching applicability of probability in science. The book starts with a selfcontained and fairly complete review of basic probability, and then traverses its way through the classics, to advanced modern topics and tools, including a substantial amount of statistics itself. Because of its nearly encyclopaedic coverage, it can serve as a graduate text for a yearlong probability sequence, or for focused short courses on selected topics, for selfstudy, and as a nearly unique reference for research in statistics, probability, and computer science. It provides an extensive treatment of most of the standard topics in a graduate probability sequence, and integrates them with the basic theory and many examples of several core statistical topics, as well as with some tools of major importance in machine learning. This is done with unusually detailed bibliographies for the reader who wants to dig deeper into a particular topic, and with a huge repertoire of workedout examples and exercises. The total number of workedout examples in this book is 423, and the total number of exercises is 808. An instructor can rotate the exercises between semesters, and use them for setting exams, and a student can use them for additional exam preparation and selfstudy. I believe that the book is unique in its range, unification, bibliographic detail, and its collection of problems and examples.

Trang 2

Springer Texts in Statistics

Trang 4

Anirban DasGupta

Probability for Statistics and Machine Learning Fundamentals and Advanced Topics

ABC

Trang 5

Springer New York Dordrecht Heidelberg London

Library of Congress Control Number: 2011924777

c

 Springer Science+Business Media, LLC 2011

All rights reserved This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,

NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis Use in connection with any form of information storage and retrieval, electronic adaptation, computer software,

or by similar or dissimilar methodology now known or hereafter developed is forbidden.

The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject

to proprietary rights.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Trang 6

To Persi Diaconis, Peter Hall, Ashok Maitra, and my mother, with affection

Trang 8

is an incredibly powerful tool for anyone who deals with data or randomness Thecontent and the style of this book reflect that philosophy; I emphasize lucidity, awide background, and the far-reaching applicability of probability in science.The book starts with a self-contained and fairly complete review of basic prob-ability, and then traverses its way through the classics, to advanced modern topicsand tools, including a substantial amount of statistics itself Because of its nearlyencyclopaedic coverage, it can serve as a graduate text for a year-long probabil-ity sequence, or for focused short courses on selected topics, for self-study, and as

a nearly unique reference for research in statistics, probability, and computer ence It provides an extensive treatment of most of the standard topics in a graduateprobability sequence, and integrates them with the basic theory and many examples

sci-of several core statistical topics, as well as with some tools sci-of major importance

in machine learning This is done with unusually detailed bibliographies for thereader who wants to dig deeper into a particular topic, and with a huge repertoire ofworked-out examples and exercises The total number of worked-out examples inthis book is 423, and the total number of exercises is 808 An instructor can rotatethe exercises between semesters, and use them for setting exams, and a student canuse them for additional exam preparation and self-study I believe that the book isunique in its range, unification, bibliographic detail, and its collection of problemsand examples

Topics in core probability, such as distribution theory, asymptotics, Markovchains, martingales, Poisson processes, random walks, and Brownian motion arecovered in the first 14 chapters In these chapters, a reader will also find basic

vii

Trang 9

viii Preface

coverage of such core statistical topics as confidence intervals, likelihood functions,maximum likelihood estimates, posterior densities, sufficiency, hypothesis testing,variance stabilizing transformations, and extreme value theory, all illustrated withmany examples In Chapters 15, 16, and 17, I treat three major topics of great appli-cation potential, empirical processes and VC theory, probability metrics, and largedeviations Chapters 18, 19, and 20 are specifically directed to the statistics andmachine-learning community, and cover simulation, Markov chain Monte Carlo,the exponential family, bootstrap, the EM algorithm, and kernels

The book does not make formal use of measure theory I do not intend to mize the role of measure theory in a rigorous study of probability However, I believethat a large amount of probability can be taught, understood, enjoyed, and appliedwithout needing formal use of measure theory We do it around the world everyday At the same time, some theorems cannot be proved without at least a men-tion of some measure theory terminology Even some definitions require a mention

mini-of some measure theory notions I include some unavoidable mention mini-of theoretic terms and results, such as the strong law of large numbers and its proof, thedominated convergence theorem, monotone convergence, Lebesgue measure, and afew others, but only in the advanced chapters in the book

measure-Following the table of contents, I have suggested some possible courses withdifferent themes using this book I have also marked the nonroutine and harder ex-ercises in each chapter with an asterisk Likewise, some specialized sections withreference value have also been marked with an asterisk Generally, the exercisesand the examples come with a caption, so that the reader will immediately know thecontent of an exercise or an example The end of the proof of a theorem has beenmarked by a sign

My deepest gratitude and appreciation are due to Peter Hall I am lucky that thestyle and substance of this book are significantly molded by Peter’s influence Out ofhabit, I sent him the drafts of nearly every chapter as I was finishing them It didn’tmatter where exactly he was, I always received his input and gentle suggestionsfor improvement I have found Peter to be a concerned and warm friend, teacher,mentor, and guardian, and for this, I am extremely grateful

Mouli Banerjee, Rabi Bhattacharya, Burgess Davis, Stewart Ethier, ArthurFrazho, Evarist Gin´e, T Krishnan, S N Lahiri, Wei-Liem Loh, Hyun-Sook Oh,

B V Rao, Yosi Rinott, Wen-Chi Tsai, Frederi Viens, and Larry Wassermangraciously went over various parts of this book I am deeply indebted to each

of them Larry Wasserman, in particular, suggested the chapters on empirical cesses, VC theory, concentration inequalities, the exponential family, and Markovchain Monte Carlo The Springer series editors, Peter Bickel, George Casella, SteveFienberg, and Ingram Olkin have consistently supported my efforts, and I am so verythankful to them Springer’s incoming executive editor Marc Strauss saw throughthe final production of this book extremely efficiently, and I have much enjoyedworking with him I appreciated Marc’s gentility and his thoroughly professionalhandling of the transition of the production of this book to his oversight ValerieGreco did an astonishing job of copyediting the book The presentation, display,and the grammar of the book are substantially better because of the incredible care

Trang 10

pro-Preface ix

and thoughtfulness that she put into correcting my numerous errors The staff atSPi Technologies, Chennai, India did an astounding and marvelous job of produc-ing this book Six anonymous reviewers gave extremely gracious and constructivecomments, and their input has helped me in various dimensions to make this abetter book Doug Crabill is the greatest computer systems administrator, and with

an infectious pleasantness has bailed me out of my stupidity far too many times

I also want to mention my fond memories and deep-rooted feelings for the IndianStatistical Institute, where I had all of my college education It was just a wonderfulplace for research, education, and friendships Nearly everything that I know is due

to my years at the Indian Statistical Institute, and for this I am thankful

This is the third time that I have written a book in contract with John Kimmel.John is much more than a nearly unique person in the publishing world To me,John epitomizes sensitivity and professionalism, a singular combination I have nowknown John for almost six years, and it is very very difficult not to appreciate andadmire him a whole lot for his warmth, style, and passion for the subjects of statis-tics and probability Ironically, the day that this book entered production, the newscame that John was leaving Springer I will remember John’s contribution to myprofessional growth with enormous respect and appreciation

Trang 12

Suggested Courses with Different Themes xix

1 Review of Univariate Probability 1

1.1 Experiments and Sample Spaces 1

1.2 Conditional Probability and Independence 5

1.3 Integer-Valued and Discrete Random Variables 8

1.3.1 CDF and Independence 9

1.3.2 Expectation and Moments 13

1.4 Inequalities 19

1.5 Generating and Moment-Generating Functions 22

1.6  Applications of Generating Functions to a Pattern Problem 26

1.7 Standard Discrete Distributions 28

1.8 Poisson Approximation to Binomial 34

1.9 Continuous Random Variables 36

1.10 Functions of a Continuous Random Variable 42

1.10.1 Expectation and Moments 45

1.10.2 Moments and the Tail of a CDF 49

1.11 Moment-Generating Function and Fundamental Inequalities 51

1.11.1  Inversion of an MGF and Post’s Formula 53

1.12 Some Special Continuous Distributions 54

1.13 Normal Distribution and Confidence Interval for a Mean 61

1.14 Stein’s Lemma 66

1.15  Chernoff’s Variance Inequality 68

1.16  Various Characterizations of Normal Distributions 69

1.17 Normal Approximations and Central Limit Theorem 71

1.17.1 Binomial Confidence Interval 74

1.17.2 Error of the CLT 76

1.18 Normal Approximation to Poisson and Gamma 79

1.18.1 Confidence Intervals 80

1.19  Convergence of Densities and Edgeworth Expansions 82

References 92

xi

Trang 13

xii Contents

2 Multivariate Discrete Distributions 95

2.1 Bivariate Joint Distributions and Expectations of Functions 95

2.2 Conditional Distributions and Conditional Expectations .100

2.2.1 Examples on Conditional Distributions and Expectations .101

2.3 Using Conditioning to Evaluate Mean and Variance 104

2.4 Covariance and Correlation 107

2.5 Multivariate Case 111

2.5.1 Joint MGF .112

2.5.2 Multinomial Distribution 114

2.6  The Poissonization Technique 116

3 Multidimensional Densities 123

3.1 Joint Density Function and Its Role 123

3.2 Expectation of Functions 132

3.3 Bivariate Normal 136

3.4 Conditional Densities and Expectations 140

3.4.1 Examples on Conditional Densities and Expectations .142

3.5 Posterior Densities, Likelihood Functions, and Bayes Estimates 147

3.6 Maximum Likelihood Estimates .152

3.7 Bivariate Normal Conditional Distributions 154

3.8  Useful Formulas and Characterizations for Bivariate Normal 155

3.8.1 Computing Bivariate Normal Probabilities .157

3.9  Conditional Expectation Given a Set and Borel’s Paradox 158

References 165

4 Advanced Distribution Theory 167

4.1 Convolutions and Examples 167

4.2 Products and Quotients and the t - and F -Distribution 172

4.3 Transformations 177

4.4 Applications of Jacobian Formula .178

4.5 Polar Coordinates in Two Dimensions 180

4.6  n-Dimensional Polar and Helmert’s Transformation 182

4.6.1 Efficient Spherical Calculations with Polar Coordinates 182

4.6.2 Independence of Mean and Variance in Normal Case 185

4.6.3 The t Confidence Interval 187

4.7 The Dirichlet Distribution 188

4.7.1  Picking a Point from the Surface of a Sphere 191

4.7.2  Poincar´e’s Lemma 191

4.8  Ten Important High-Dimensional Formulas for Easy Reference 191

References 197

Trang 14

Contents xiii

5 Multivariate Normal and Related Distributions 199

5.1 Definition and Some Basic Properties .199

5.2 Conditional Distributions 202

5.3 Exchangeable Normal Variables 205

5.4 Sampling Distributions Useful in Statistics 207

5.4.1  Wishart Expectation Identities 208

5.4.2 * Hotelling’s T2and Distribution of Quadratic Forms 209

5.4.3  Distribution of Correlation Coefficient 212

5.5 Noncentral Distributions 213

5.6 Some Important Inequalities for Easy Reference 214

References 218

6 Finite Sample Theory of Order Statistics and Extremes .221

6.1 Basic Distribution Theory 221

6.2 More Advanced Distribution Theory .225

6.3 Quantile Transformation and Existence of Moments 229

6.4 Spacings 233

6.4.1 Exponential Spacings and R´eyni’s Representation 233

6.4.2 Uniform Spacings 234

6.5 Conditional Distributions and Markov Property .235

6.6 Some Applications 238

6.6.1  Records 238

6.6.2 The Empirical CDF 241

6.7  Distribution of the Multinomial Maximum 243

References 247

7 Essential Asymptotics and Applications 249

7.1 Some Basic Notation and Convergence Concepts 250

7.2 Laws of Large Numbers 254

7.3 Convergence Preservation 259

7.4 Convergence in Distribution 262

7.5 Preservation of Convergence and Statistical Applications 267

7.5.1 Slutsky’s Theorem .268

7.5.2 Delta Theorem 269

7.5.3 Variance Stabilizing Transformations 272

7.6 Convergence of Moments .274

7.6.1 Uniform Integrability .275

7.6.2 The Moment Problem and Convergence in Distribution 277

7.6.3 Approximation of Moments .278

7.7 Convergence of Densities and Scheff´e’s Theorem .282

References 292

Trang 15

xiv Contents

8 Characteristic Functions and Applications 293

8.1 Characteristic Functions of Standard Distributions 294

8.2 Inversion and Uniqueness .298

8.3 Taylor Expansions, Differentiability, and Moments 302

8.4 Continuity Theorems .303

8.5 Proof of the CLT and the WLLN 305

8.6  Producing Characteristic Functions 306

8.7 Error of the Central Limit Theorem 308

8.8 Lindeberg–Feller Theorem for General Independent Case 311

8.9  Infinite Divisibility and Stable Laws 315

8.10  Some Useful Inequalities 317

References 322

9 Asymptotics of Extremes and Order Statistics 323

9.1 Central-Order Statistics 323

9.1.1 Single-Order Statistic .323

9.1.2 Two Statistical Applications 325

9.1.3 Several Order Statistics .326

9.2 Extremes .328

9.2.1 Easily Applicable Limit Theorems 328

9.2.2 The Convergence of Types Theorem 332

9.3  Fisher–Tippett Family and Putting it Together 333

References 338

10 Markov Chains and Applications 339

10.1 Notation and Basic Definitions 340

10.2 Examples and Various Applications as a Model 340

10.3 Chapman–Kolmogorov Equation .345

10.4 Communicating Classes 349

10.5 Gambler’s Ruin 352

10.6 First Passage, Recurrence, and Transience .354

10.7 Long Run Evolution and Stationary Distributions 359

References 374

11 Random Walks 375

11.1 Random Walk on the Cubic Lattice 375

11.1.1 Some Distribution Theory .378

11.1.2 Recurrence and Transience 379

11.1.3  P´olya’s Formula for the Return Probability 382

11.2 First Passage Time and Arc Sine Law 383

11.3 The Local Time 387

11.4 Practically Useful Generalizations 389

11.5 Wald’s Identity 390

11.6 Fate of a Random Walk 392

Trang 16

Contents xv

11.7 Chung–Fuchs Theorem 394

11.8 Six Important Inequalities 396

References 400

12 Brownian Motion and Gaussian Processes 401

12.1 Preview of Connections to the Random Walk 402

12.2 Basic Definitions 403

12.2.1 Condition for a Gaussian Process to be Markov 406

12.2.2  Explicit Construction of Brownian Motion 407

12.3 Basic Distributional Properties 408

12.3.1 Reflection Principle and Extremes 410

12.3.2 Path Properties and Behavior Near Zero and Infinity .412

12.3.3  Fractal Nature of Level Sets 415

12.4 The Dirichlet Problem and Boundary Crossing Probabilities .416

12.4.1 Recurrence and Transience 418

12.5 The Local Time of Brownian Motion 419

12.6 Invariance Principle and Statistical Applications 421

12.7 Strong Invariance Principle and the KMT Theorem .425

12.8 Brownian Motion with Drift and Ornstein–Uhlenbeck Process .427

12.8.1 Negative Drift and Density of Maximum 427

12.8.2  Transition Density and the Heat Equation 428

12.8.3  The Ornstein–Uhlenbeck Process 429

References 435

13 Poisson Processes and Applications .437

13.1 Notation .438

13.2 Defining a Homogeneous Poisson Process .439

13.3 Important Properties and Uses as a Statistical Model 440

13.4  Linear Poisson Process and Brownian Motion: A Connection 448

13.5 Higher-Dimensional Poisson Point Processes 450

13.5.1 The Mapping Theorem 452

13.6 One-Dimensional Nonhomogeneous Processes 453

13.7  Campbell’s Theorem and Shot Noise 456

13.7.1 Poisson Process and Stable Laws 458

References 462

14 Discrete Time Martingales and Concentration Inequalities 463

14.1 Illustrative Examples and Applications in Statistics 463

14.2 Stopping Times and Optional Stopping 468

14.2.1 Stopping Times 469

14.2.2 Optional Stopping 470

14.2.3 Sufficient Conditions for Optional Stopping Theorem 472

14.2.4 Applications of Optional Stopping 474

Trang 17

xvi Contents

14.3 Martingale and Concentration Inequalities 477

14.3.1 Maximal Inequality .477

14.3.2  Inequalities of Burkholder, Davis, and Gundy 480

14.3.3 Inequalities of Hoeffding and Azuma 483

14.3.4  Inequalities of McDiarmid and Devroye 485

14.3.5 The Upcrossing Inequality 488

14.4 Convergence of Martingales 490

14.4.1 The Basic Convergence Theorem .490

14.4.2 Convergence in L1and L2 .493

14.5  Reverse Martingales and Proof of SLLN 494

14.6 Martingale Central Limit Theorem .497

References 503

15 Probability Metrics 505

15.1 Standard Probability Metrics Useful in Statistics 505

15.2 Basic Properties of the Metrics 508

15.3 Metric Inequalities 515

15.4 Differential Metrics for Parametric Families 519

15.4.1  Fisher Information and Differential Metrics 520

15.4.2  Rao’s Geodesic Distances on Distributions 522

References 525

16 Empirical Processes and VC Theory 527

16.1 Basic Notation and Definitions 527

16.2 Classic Asymptotic Properties of the Empirical Process 529

16.2.1 Invariance Principle and Statistical Applications 531

16.2.2  Weighted Empirical Process 534

16.2.3 The Quantile Process 536

16.2.4 Strong Approximations of the Empirical Process 537

16.3 Vapnik–Chervonenkis Theory 538

16.3.1 Basic Theory .538

16.3.2 Concrete Examples 540

16.4 CLTs for Empirical Measures and Applications 543

16.4.1 Notation and Formulation .543

16.4.2 Entropy Bounds and Specific CLTs .544

16.4.3 Concrete Examples 547

16.5 Maximal Inequalities and Symmetrization .547

16.6  Connection to the Poisson Process 551

References 557

17 Large Deviations .559

17.1 Large Deviations for Sample Means 560

17.1.1 The Cram´er–Chernoff Theorem inR 560

17.1.2 Properties of the Rate Function 564

17.1.3 Cram´er’s Theorem for General Sets 566

Trang 18

Contents xvii

17.2 The GRartner–Ellis Theorem and Markov Chain Large

Deviations 567

17.3 The t-Statistic 570

17.4 Lipschitz Functions and Talagrand’s Inequality 572

17.5 Large Deviations in Continuous Time .574

17.5.1  Continuity of a Gaussian Process 576

17.5.2  Metric Entropy of T and Tail of the Supremum 577

References 582

18 The Exponential Family and Statistical Applications 583

18.1 One-Parameter Exponential Family 583

18.1.1 Definition and First Examples 584

18.2 The Canonical Form and Basic Properties 589

18.2.1 Convexity Properties 590

18.2.2 Moments and Moment Generating Function .591

18.2.3 Closure Properties 594

18.3 Multiparameter Exponential Family .596

18.4 Sufficiency and Completeness .600

18.4.1  Neyman–Fisher Factorization and Basu’s Theorem 602

18.4.2  Applications of Basu’s Theorem to Probability 604

18.5 Curved Exponential Family 607

References 612

19 Simulation and Markov Chain Monte Carlo 613

19.1 The Ordinary Monte Carlo .615

19.1.1 Basic Theory and Examples .615

19.1.2 Monte Carlo P -Values 622

19.1.3 Rao–Blackwellization 623

19.2 Textbook Simulation Techniques .624

19.2.1 Quantile Transformation and Accept–Reject 624

19.2.2 Importance Sampling and Its Asymptotic Properties 629

19.2.3 Optimal Importance Sampling Distribution .633

19.2.4 Algorithms for Simulating from Common Distributions 634

19.3 Markov Chain Monte Carlo 637

19.3.1 Reversible Markov Chains 639

19.3.2 Metropolis Algorithms 642

19.4 The Gibbs Sampler 645

19.5 Convergence of MCMC and Bounds on Errors 651

19.5.1 Spectral Bounds 653

19.5.2  Dobrushin’s Inequality and Diaconis–Fill– Stroock Bound .657

19.5.3  Drift and Minorization Methods 659

Trang 19

xviii Contents

19.6 MCMC on General Spaces 662

19.6.1 General Theory and Metropolis Schemes 662

19.6.2 Convergence 666

19.6.3 Convergence of the Gibbs Sampler 670

19.7 Practical Convergence Diagnostics 673

References 686

20 Useful Tools for Statistics and Machine Learning 689

20.1 The Bootstrap 689

20.1.1 Consistency of the Bootstrap 692

20.1.2 Further Examples .696

20.1.3  Higher-Order Accuracy of the Bootstrap 699

20.1.4 Bootstrap for Dependent Data 701

20.2 The EM Algorithm 704

20.2.1 The Algorithm and Examples 706

20.2.2 Monotone Ascent and Convergence of EM 711

20.2.3  Modifications of EM 714

20.3 Kernels and Classification 715

20.3.1 Smoothing by Kernels 715

20.3.2 Some Common Kernels in Use 717

20.3.3 Kernel Density Estimation 719

20.3.4 Kernels for Statistical Classification 724

20.3.5 Mercer’s Theorem and Feature Maps .732

References 744

A Symbols, Useful Formulas, and Normal Table 747

A.1 Glossary of Symbols 747

A.2 Moments and MGFs of Common Distributions 750

A.3 Normal Table 755

Author Index 757

Subject Index 763

Trang 20

Suggested Courses with Different Themes

15 weeks Special topics for statistics students 9, 10, 15, 16, 17, 18, 20

15 weeks Special topics for computer science students 4, 11, 14, 16, 17, 18, 19

8 weeks Summer course for statistics students 11, 12, 14, 20

8 weeks Summer course for computer science students 14, 16, 18, 20

8 weeks Summer course on modeling and simulation 4, 10, 13, 19

xix

Trang 22

Chapter 1

Review of Univariate Probability

Probability is a universally accepted tool for expressing degrees of confidence ordoubt about some proposition in the presence of incomplete information or uncer-tainty By convention, probabilities are calibrated on a scale of 0 to 1; assigningsomething a zero probability amounts to expressing the belief that we consider itimpossible, whereas assigning a probability of one amounts to considering it a cer-tainty Most propositions fall somewhere in between Probability statements that wemake can be based on our past experience, or on our personal judgments Whetherour probability statements are based on past experience or subjective personal judg-ments, they obey a common set of rules, which we can use to treat probabilities in

a mathematical framework, and also for making decisions on predictions, for derstanding complex systems, or as intellectual experiments and for entertainment.Probability theory is one of the most applicable branches of mathematics It is used

un-as the primary tool for analyzing statistical methodologies; it is used routinely innearly every branch of science, such as biology, astronomy and physics, medicine,economics, chemistry, sociology, ecology, finance, and many others A background

in the theory, models, and applications of probability is almost a part of basic cation That is how important it is

edu-For a classic and lively introduction to the subject of probability, we recommendFeller(1968,1971) Among numerous other expositions of the theory of probabil-ity, a variety of examples on various topics can be seen inRoss(1984),Stirzaker(1994),Pitman(1992),Bhattacharya and Waymire(2009), andDasGupta(2010).Ash(1972),Chung(1974),Breiman(1992),Billingsley(1995), andDudley(2002)are masterly accounts of measure-theoretic probability

1.1 Experiments and Sample Spaces

Treatment of probability theory starts with the consideration of a sample space.

The sample space is the set of all possible outcomes in some physical experiment.For example, if a coin is tossed twice and after each toss the face that shows isrecorded, then the possible outcomes of this particular coin-tossing experiment, say

A DasGupta, Probability for Statistics and Machine Learning: Fundamentals

and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 1,

c

 Springer Science+Business Media, LLC 2011

1

Trang 23

2 1 Review of Univariate Probability

 are HH; HT; TH; TT, with H denoting the occurrence of heads and T denoting

the occurrence of tails We call

D fHH; HT; TH; TTg

the sample space of the experiment 

In general, a sample space is a general set , finite or infinite An easy examplewhere the sample space  is infinite is to toss a coin until the first time heads show

up and record the number of the trial at which the first head appeared In this case,

the sample space  is the countably infinite set

D f1; 2; 3; : : :g:

Sample spaces can also be uncountably infinite; for example, consider the

experi-ment of choosing a number at random from the interval Œ0; 1 The sample space ofthis experiment is  D Œ0; 1 In this case,  is an uncountably infinite set In allcases, individual elements of a sample space are denoted as ! The first task is to

define events and to explain the meaning of the probability of an event.

Definition 1.1 Let  be the sample space of an experiment  Then any subset A

of , including the empty set  and the entire sample space  is called an event.

Events may contain even one single sample point !, in which case the event

is a singleton set f!g We want to assign probabilities to events But we want to

assign probabilities in a way that they are logically consistent In fact, this cannot

be done in general if we insist on assigning probabilities to arbitrary collections ofsample points, that is, arbitrary subsets of the sample space  We can only defineprobabilities for such subsets of  that are tied together like a family, the exactconcept being that of a  -field In most applications, including those cases wherethe sample space  is infinite, events that we would want to normally think aboutwill be members of such an appropriate  -field So we do not mention the needfor consideration of  -fields any further, and get along with thinking of events assubsets of the sample space , including in particular the empty set  and the entiresample space  itself

Here is a definition of what counts as a legitimate probability on events

Definition 1.2 Given a sample space , a probability or a probability measure on

 is a function P on subsets of  such that

(a) P A/  0 for anyA  I

Trang 24

an-1.1 Experiments and Sample Spaces 3

probabilists agree that countable additivity is natural; but we do not get into thatdebate in this book One important point is that finite additivity is subsumed incountable additivity; that is if there are some finite number m of disjoint subsets

A1; A2; : : : ; Amof , then P [miD1Ai/D Pm

iD1P Ai/: Also, it is useful to note

that the last two conditions in the definition of a probability measure imply that

P /, the probability of the empty set or the null event, is zero.

One notational convention is that strictly speaking, for an event that is just asingleton set f!g, we should write P f!g/ to denote its probability But to reduceclutter, we simply use the more convenient notation P !/

One pleasant consequence of the axiom of countable additivity is the followingbasic result We do not prove it here as it is a simple result; seeDasGupta(2010) for

a proof

Theorem 1.1 LetA1 A2 A3    be an infinite family of subsets of a sample

space  such that An# A Then, P.An/! P.A/ as n ! 1.

Next, the concept of equally likely sample points is a very fundamental one.

Definition 1.3 Let  be a finite sample space consisting of N sample points We

say that the sample points are equally likely if P !/ DN1 for each sample point !

An immediate consequence, due to the addivity axiom, is the following usefulformula

Proposition Let  be a finite sample space consisting of N equally likely sample

points Let A be any event and suppose A contains n distinct sample points Then

P A/D n

N D Number of sample points favorable toA

Total number of sample points :

Let us see some examples

Example 1.1 (The Shoe Problem) Suppose there are five pairs of shoes in a closet

and four shoes are taken out at random What is the probability that among the fourthat are taken out, there is at least one complete pair?

The total number of sample points is10

4



D 210 Because selection was done

completely at random, we assume that all sample points are equally likely At leastone complete pair would mean two complete pairs, or exactly one complete pairand two other nonconforming shoes Two complete pairs can be chosen in5

Example 1.2 (Five-Card Poker) In five-card poker, a player is given 5 cards from a

full deck of 52 cards at random Various named hands of varying degrees of rarity

exist In particular, we want to calculate the probabilities of A D two pairs and

Trang 25

4 1 Review of Univariate Probability

BD a flush Two pairs is a hand with 2 cards each of 2 different denominations and

the fifth card of some other denomination; a flush is a hand with 5 cards of the samesuit, but the cards cannot be of denominations in a sequence

!

 10

! , 525

!

D :00197:

These are basic examples of counting arguments that are useful whenever there

is a finite sample space and we assume that all sample points are equally likely

A major result in combinatorial probability is the inclusionexclusion formula,

which says the following

Theorem 1.2 LetA1; A2; : : : ; Anbe n general events Let

Example 1.3 (Missing Suits in a Bridge Hand) Consider a specific player, say

North, in a Bridge game We want to calculate the probability that North’s hand

is void in at least one suit Towards this, denote the suits as 1, 2, 3, 4 and let

Ai D North’s hand is void in suit i

Then, by the inclusion exclusion formula,

P North’s hand is void in at least one suit/

Trang 26

1.2 Conditional Probability and Independence 5

The inclusion–exclusion formula can be hard to apply exactly, because the tities Sj for large indices j can be difficult to calculate However, fortunately, theinclusion–exclusion formula leads to bounds in both directions for the probability

quan-of the union quan-of n general events We have the following series quan-of bounds

Theorem 1.3 (Bonferroni Bounds) Given n events A1; A2; : : : ; An, let pn D

n

X

iD1

P Aci/:

1.2 Conditional Probability and Independence

Both conditional probability and independence are fundamental concepts for bilists and statisticians alike Conditional probabilities correspond to updating one’sbeliefs when new information becomes available Independence corresponds to ir-relevance of a piece of new information, even when it is made available In addition,the assumption of independence can and does significantly simplify development,mathematical analysis, and justification of tools and procedures

proba-Definition 1.4 Let A; B be general events with respect to some sample space ,

and suppose P A/ > 0 The conditional probability of B given A is defined as

P BjA/ D P A\ B/

P A/ :

Some immediate consequences of the definition of a conditional probability are thefollowing

Theorem 1.4 (a) (Multiplicative Formula) For any two events A, B such that

P A/ > 0, one has P A\ B/ D P.A/P.BjA/I

(b) For any two events A, B such that 0 < P A/ < 1, one has P B/ D P.BjA/

Trang 27

6 1 Review of Univariate Probability

(d) (Hierarchical Multiplicative Formula) Let A1; A2; : : : ; Ak be k general

events in a sample space  Then

P A1\ A2\    ::: \ Ak/D P.A1/P A2jA1/P A3jA1\ A2/  

 P.AkjA1\ A2\    ::: \ Ak1/:

Example 1.4 One of two urns has a red and b black balls, and the other has c red

and d black balls One ball is chosen at random from each urn, and then one of thesetwo balls is chosen at random What is the probability that this ball is red?

If each ball selected from the two urns is red, then the final ball is definitely red

If one of those two balls is red, then the final ball is red with probability 1/2 If none

of those two balls is red, then the final ball cannot be red

Thus,

P The final ball is red/ D a=.a C b/  c=.c C d / C 1=2

 Œa=.a C b/  d=.c C d / C b=.a C b/  c=.c C d /

D 2acC ad C bc2.aC b/.c C d /:

As an example, suppose a D 99; b D 1; c D 1; d D 1

Then 2acC ad C bc2.aC b/.c C d / D :745:

Although the total percentage of red balls in the two urns is more than 98%, thechance that the final ball selected would be red is just about 75%

Example 1.5 (A Clever Conditioning Argument) Coin A gives heads with

probabil-ity s and coin B gives heads with probabilprobabil-ity t They are tossed alternately, startingoff with coin A We want to find the probability that the first head is obtained oncoin A

We find this probability by conditioning on the outcomes of the first two tosses;more precisely, define

A1D fH g D First toss gives HI A2D fTH gI A3D fT T g:

Let also,

AD The first head is obtained on coin A:

One of the three events A1; A2; A3 must happen, and they are also mutuallyexclusive Therefore, by the total probability formula,

Trang 28

1.2 Conditional Probability and Independence 7

As an example, let s D :4; t D :5 Note that coin A is biased against heads Even

then, s=.s C t  st / D :57 > :5 We see that there is an advantage in starting first.

Definition 1.5 A collection of events A1; A2; : : : ; Anis said to be mutually pendent (or just independent) if for each k; 1  k  n, and any k of the events,

inde-Ai 1; : : : ; Ai k; P Ai 1 \    Ai k/ D P.Ai 1/   P.Ai k/: They are called pairwise

independent if this property holds for k D 2.

Example 1.6 (Lotteries) Although many people buy lottery tickets out of an

expec-tation of good luck, probabilistically speaking, buying lottery tickets is usually awaste of money Here is an example Suppose in a weekly state lottery, five of thenumbers 00; 01; : : : ; 49 are selected without replacement at random, and someoneholding exactly those numbers wins the lottery Then, the probability that someoneholding one ticket will be the winner in a given week is

1

505

 D 4:72  107:

Suppose this person buys a ticket every week for 40 years Then, the probabilitythat he will win the lottery on at least one week is 1  1  4:72  107/5240 D:00098 < :001; still a very small probability We assumed in this calculation that

the weekly lotteries are all mutually independent, a reasonable assumption Thecalculation would fall apart if we did not make this independence assumption

It is not uncommon to see the conditional probabilities P AjB/ and P BjA/confused with each other Suppose in some group of lung cancer patients, we see alarge percentage of smokers If we define B to be the event that a person is a smoker,and A to be the event that a person has lung cancer, then all we can conclude is that

in our group of people P BjA/ is large But we cannot conclude from just thisinformation that smoking increases the chance of lung cancer, that is, that P AjB/

is large In order to calculate a conditional probability P AjB/ when we know the

other conditional probability P BjA/, a simple formula known as Bayes’ theorem

is useful Here is a statement of a general version of Bayes’ theorem

Theorem 1.5 LetfA1; A2; : : : ; Amg be a partition of a sample space  Let B be

some fixed event Then

P AjjB/ D PmP BjAj/P Aj/

iD1P BjAi/P Ai/:

Example 1.7 (Multiple Choice Exams) Suppose that the questions in a multiple

choice exam have five alternatives each, of which a student has to pick one as thecorrect alternative A student either knows the truly correct alternative with proba-bility :7, or she randomly picks one of the five alternatives as her choice Suppose aparticular problem was answered correctly We want to know what the probability

is that the student really knew the correct answer

Trang 29

8 1 Review of Univariate Probability

Define

AD The student knew the correct answer;

BD The student answered the question correctly:

We want to compute P AjB/ By Bayes’ theorem,

what Bayes’ theorem does; it updates our prior belief to the posterior belief, when

new evidence becomes available

1.3 Integer-Valued and Discrete Random Variables

In some sense, the entire subject of probability and statistics is about distributions ofrandom variables Random variables, as the very name suggests, are quantities thatvary, over time, or from individual to individual, and the reason for the variability issome underlying random process Depending on exactly how an underlying exper-iment  ends, the random variable takes different values In other words, the value

of the random variable is determined by the sample point ! that prevails, when theunderlying experiment  is actually conducted We cannot know a priori the value

of the random variable, because we do not know a priori which sample point ! willprevail when the experiment  is conducted We try to understand the behavior of

a random variable by analyzing the probability structure of that underlying randomexperiment

Random variables, like probabilities, originated in gambling Therefore, the dom variables that come to us more naturally, are integer-valued random variables;for examples, the sum of the two rolls when a die is rolled twice Integer-valuedrandom variables are special cases of what are known as discrete random variables.Discrete or not, a common mathematical definition of all random variables is thefollowing

ran-Definition 1.6 Let  be a sample space corresponding to some experiment  and

let X W  ! R be a function from the sample space to the real line Then X is called a random variable.

Discrete random variables are those that take a finite or a countably infinitenumber of possible values In particular, all integer-valued random variables arediscrete From the point of view of understanding the behavior of a random variable,the important thing is to know the probabilities with which X takes its differentpossible values

Trang 30

1.3 Integer-Valued and Discrete Random Variables 9

Definition 1.7 Let X W  ! R be a discrete random variable taking a finite or

countably infinite number of values x1; x2; x3; : : : : The probability distribution or

the probability mass function (pmf) of X is the function p.x/ D P X D x/; x D

x1; x2; x3; : : : ; and p.x/ D 0, otherwise

It is common to not explicitly mention the phrase “p.x/ D 0 otherwise,” and we

generally follow this convention Some authors use the phrase mass function instead

of probability mass function.

For any pmf, one must have p.x/  0 for any x, and P

ip.xi/ D 1 Any

function satisfying these two properties for some set of numbers x1; x2; x3; : : : is a

valid pmf

1.3.1 CDF and Independence

A second important definition is that of a cumulative distribution function (CDF).

The CDF gives the probability that a random variable X is less than or equal to anygiven number x It is important to understand that the notion of a CDF is universal

to all random variables; it is not limited to only the discrete ones

Definition 1.8 The cumulative distribution function of a random variable X is the

function F x/ D P X  x/; x 2R.

Definition 1.9 Let X have the CDF F x/ Any number m such that P X m/  :5,

and also P X  m/  :5 is called a median of F , or equivalently, a median of X

Remark The median of a random variable need not be unique A simple way to

characterize all the medians of a distribution is available

Proposition Let X be a random variable with the CDF F x/ Let m0be the first

x such that F x/  :5, and let m1be the last x such that P X  x/  :5 Then, a

number m is a median of X if and only if m 2 Œm0; m1.

The CDF of any random variable satisfies a set of properties Conversely, any function satisfying these properties is a valid CDF; that is, it will be the CDF of some appropriately chosen random variable These properties are given in the next result.

Theorem 1.6 A function F x/ is the CDF of some real-valued random variable X

if and only if it satisfies all of the following properties.

(a) 0 F x/  1 8x 2 R.

(b) F x/! 0 as x ! 1; and F x/ ! 1 as x ! 1.

(c) Given any real numbera; F x/# F a/ as x # a.

(d) Given any two real numbersx; y; x < y; F x/ F y/:

Property (c) is called continuity from the right, or simply right continuity It is clear that a CDF need not be continuous from the left; indeed, for discrete random variables, the CDF has a jump at the values of the random variable, and at the jump points, the CDF is not left continuous More precisely, one has the following result.

Trang 31

10 1 Review of Univariate Probability

Proposition Let F x/ be the CDF of some random variable X Then, for any x,

(a) P X D x/ D F x/  limy"xF y/D F x/  F x/, including those points

x for which P X D x/ D 0.

(b) P X  x/ D P.X > x/ C P.X D x/ D 1  F x// C F x/  F x// D

1 F x/.

Example 1.8 (Bridge) Consider the random variable

X D Number of aces in North’s hand in a Bridge game:

Clearly, X can take any of the values x D 0; 1; 2; 3; 4 If X D x, then the other

13 x cards in North’s hand must be non-ace cards Thus, the pmf of X is

P XD x/ D

4x

Example 1.9 (Indicator Variables) Consider the experiment of rolling a fair die

twice and now define a random variable Y as follows

Y D 1 if the sum of the two rolls X is an even numberI

Y D 0 if the sum of the two rolls X is an odd number:

If we let A be the event that X is an even number, then Y D 1 if A happens, and

Y D 0 if A does not happen Such random variables are called indicator random

variables and are immensely useful in mathematical calculations in many complex

situations

Trang 32

1.3 Integer-Valued and Discrete Random Variables 11

Definition 1.10 Let A be any event in a sample space  The indicator random

variable for A is defined as

IA D 1 if A happens:

IA D 0 if A does not happen:

Thus, the distribution of an indicator variable is simply P IA D 1/ D P.A/I

P IAD 0/ D 1  P.A/

An indicator variable is also called a Bernoulli variable with parameter p, where

p is just P A/ We later show examples of uses of indicator variables in calculation

of expectations.

In applications, we are sometimes interested in the distribution of a function,say g.X /, of a basic random variable X In the discrete case, the distribution of afunction is found in the obvious way

variable andP Y D g.X/ a real-valued function of X Then, P.Y D y/ D

First, the constant c must be explicitly evaluated By directly summing the values,

Trang 33

12 1 Review of Univariate Probability

So, for example, P Z D 0/ D P X D 2/ C P X D 0/ C P X D 2/ D 75cD7=13: The pmf of Z D h.X / is:

P ZD z/ 3/13 7/13 3/13

A key concept in probability is that of independence of a collection of randomvariables The collection could be finite or infinite In the infinite case, we wanteach finite subcollection of the random variables to be independent The definition

of independence of a finite collection is as follows

Definition 1.11 Let X1; X2; : : : ; Xk be k  2 discrete random variables defined

on the same sample space  We say that X1; X2; : : : ; Xk are independent if

P X1 D x1; X2 D x2; : : : ; Xk D xk/D P.X1 D x1/P X2 D x2/   P.Xk D

xk/;8 x1; x2; : : : ; xk.

It follows from the definition of independence of random variables that if X1; X2are independent, then any function of X1and any function of X2are also indepen-dent In fact, we have a more general result

Theorem 1.7 LetX1; X2; : : : ; Xk bek  2 discrete random variables, and

sup-pose they are independent Let U D f X1; X2; : : : ; Xi/ be some function of

X1; X2; : : : ; Xi, and V D g.XiC1; : : : ; Xk/ be some function of XiC1; : : : ; Xk Then, U and V are independent.

This result is true of any types of random variables X1; X2;   ; Xk, not just discrete ones.

A common notation of wide use in probability and statistics is now introduced.

If X1; X2; : : : ; Xk are independent, and moreover have the same CDF, say F ,then we say that X1; X2; : : : ; Xk are iid (or IID) and write X1; X2; : : : ; Xk

iid

F

The abbreviation iid (IID) means independent and identically distributed

Example 1.11 (Two Simple Illustrations) Consider the experiment of tossing a fair

coin (or any coin) four times Suppose X1is the number of heads in the first twotosses, and X2is the number of heads in the last two tosses Then, it is intuitivelyclear that X1; X2 are independent, because the last two tosses carry no informa-tion regarding the first two tosses The independence can be easily mathematicallyverified by using the definition of independence

Next, consider the experiment of drawing 13 cards at random from a deck of 52cards Suppose X1is the number of aces and X2is the number of clubs among the 13cards Then, X1; X2are not independent For example, P X1 D 4; X2 D 0/ D 0,

but P X1D 4/, and P.X2 D 0/ are both > 0, and so P.X1 D 4/P.X2D 0/ > 0

So, X1; X2cannot be independent

Trang 34

1.3 Integer-Valued and Discrete Random Variables 13

1.3.2 Expectation and Moments

By definition, a random variable takes different values on different occasions It isnatural to want to know what value it takes on average Averaging is a very primitiveconcept A simple average of just the possible values of the random variable will bemisleading, because some values may have so little probability that they are rela-tively inconsequential The average or the mean value, also called the expected value

of a random variable is a weighted average of the different values of X , weightedaccording to how important the value is Here is the definition

Definition 1.12 Let X be a discrete random variable We say that the expected

expected value is also known as the expectation or the mean of X

If the set of possible values of X is infinite, then the infinite sum P

finite or countably infinite and X is a discrete random variable with expectation .

where P !/ is the probability of the sample point !.

Important Point Although it is not the focus of this chapter, in applications we are

often interested in more than one variable at the same time To be specific, considertwo discrete random variables X; Y defined on a common sample space  Then

we could construct new random variables out of X and Y , for example, X Y; X C

Y; X2C Y2, and so on We can then talk of their expectations as well Here is ageneral definition of expectation of a function of more than one random variable

Definition 1.13 Let X1; X2; : : : ; Xnbe n discrete random variables, all defined on

a common sample space , with a finite or a countably infinite number of ple points We say that the expectation of a function g.X1; X2; : : : ; Xn/ exists ifP

sam-!jg.X1.!/; X2.!/; : : : ; Xn.!//jP.!/ < 1, in which case, the expected value

Trang 35

14 1 Review of Univariate Probability

Proposition (a) If there exists a finite constant c such that P X D c/ D 1, then

E.X /D c.

(b) If X; Y are random variables defined on the same sample space  with finite

expectations, and ifP X  Y / D 1, then E.X/  E.Y /.

(c) If X has a finite expectation, and if P X  c/ D 1; then E.X/  c If P.X 

c/D 1, then E.X/  c.

Proposition (Linearity of Expectations) Let X1; X2; : : : ; Xn be random ables defined on the same sample space , and c1; c2; : : : ; cn any real-valued constants Then, providedE.Xi/ exists for every Xi,

n

X

iD1

ciE.Xi/:

in particular,E.cX / D cE.X/ and E.X1C X2/ D E.X1/C E.X2/, whenever

the expectations exist.

The following fact also follows easily from the definition of the pmf of a function

of a random variable The result says that the expectation of a function of a random variable X can be calculated directly using the pmf of X itself, without having to

calculate the pmf of the function.

Proposition (Expectation of a Function) Let X be a discrete random variable

on a sample space  with a finite or countable number of sample points, and

provided E.Y / exists.

Caution If g.X / is a linear function of X , then, of course, E.g.X // D g.E.X //.

But, in general, the two things are not equal For example, E.X2/ is not the same

as E.X //2; indeed, E.X2/ > E.X //2for any random variable X that is not aconstant

A very important property of independent random variables is the following torization result on expectations

fac-Theorem 1.8 SupposeX1; X2; : : : ; Xn are independent random variables Then, provided each expectation exists,

E.X1X2   Xn/D E.X1/E.X2/   E.Xn/:

Let us now show some more illustrative examples.

Trang 36

1.3 Integer-Valued and Discrete Random Variables 15

Example 1.12 Let X be the number of heads obtained in two tosses of a fair coin.

The pmf of X is p.0/ D p.2/ D 1=4; p.1/ D 1=2 Therefore, E.X / D 0  1=4 C

1 1=2 C 2  1=4 D 1 Because the coin is fair, we expect it to show heads 50% of

the number of times it is tossed, which is 50% of 2, that is, 1

Example 1.13 (Dice Sum) Let X be the sum of the two rolls when a fair die

is rolled twice The pmf of X is p.2/ D p.12/ D 1=36I p.3/ D p.11/ D

2=36I p.4/ D p.10/ D 3=36I p.5/ D p.9/ D 4=36I p.6/ D p.8/ D 5=36I p.7/ D6=36 Therefore, E.X / D 21=36C32=36C43=36C  C121=36 D 7 This

can also be seen by letting X1D the face obtained on the first roll; X2D the face

obtained on the second roll, and by using E.X / D E.X1 C X2/ D E.X1/ CE.X2/D 3:5 C 3:5 D 7

Let us now make this problem harder Suppose that a fair die is rolled 10 timesand X is the sum of all 10 rolls The pmf of X is no longer so simple; it will becumbersome to write it down But, if we let Xi D the face obtained on the ith roll, it

is still true by the linearity of expectations that E.X / D E.X1CX2C    CX10/DE.X1/C E.X2/C    C E.X10/ D 3:5  10 D 35 We can easily compute the

expectation, although the pmf would be difficult to write down

Example 1.14 (A Random Variable Without a Finite Expectation) Let X take the

positive integers 1; 2; 3; : : : as its values with the pmf

p.x/D P.X D x/ D 1

x.xC 1/; xD 1; 2; 3; : : : :

This is a valid pmf, because obviouslyx.xC1/1 > 0 for any x D 1; 2; 3; : : : ; and also

the infinite seriesP1

xD1 x.xC1/1 sums to 1, a fact from calculus Now,

also a fact from calculus

This example shows that not all random variables have a finite expectation Here,the reason for the infiniteness of E.X / is that X takes large integer values x withprobabilities p.x/ that are not adequately small The large values are realized suffi-ciently often that on average X becomes larger than any given finite number.The zero–one nature of indicator random variables is extremely useful for calcu-lating expectations of certain integer-valued random variables whose distributionsare sometimes so complicated that it would be difficult to find their expectations di-rectly from definition We describe the technique and some illustrations of it below

Proposition Let X be an integer-valued random variable such that it can be

rep-resented asX D Pm

iD1ciIA i for some m, constants c1; c2; : : : ; cm, and suitable eventsA1; A2; : : : ; Am Then,E.X /DPm

iD1ciP Ai/.

Trang 37

16 1 Review of Univariate Probability

Example 1.15 (Coin Tosses) Suppose a coin that has probability p of showing

heads in any single toss is tossed n times, and let X denote the number of times in the

n tosses that a head is obtained Then, X DPn

iD1IA i, where Aiis the event that a

head is obtained in the ith toss Therefore, E.X / DPn

iD1P Ai/DPn

iD1pD np

A direct calculation of the expectation would involve finding the pmf of X andobtaining the sumPn

xD0xP X D x/; it can also be done that way, but that is a

much longer calculation

The random variable X of this example is a binomial random variable

with parameters n and p Its pmf is given by the formula P X D x/ Dn

x



px.1 p/nx; xD 0; 1; 2; : : : ; n

Example 1.16 (Consecutive Heads in Coin Tosses) Suppose a coin with probability

p for heads in a single toss is tossed n times How many times can we expect to see

a head followed by at least one more head? For example, if n D 5, and we see theoutcomes HTHHH, then we see a head followed by at least one more head twice.Define AiD The ith and the i C 1/th toss both result in heads Then

XD number of times a head is followed by at least one more head D

iD1p2D n  1/p2 For example, if a fair coin

is tossed 20 times, we can expect to see a head followed by another head about fivetimes (19  :52D 4:75)

Another useful technique for calculating expectations of nonnegative valued random variables is based on the CDF of the random variable, rather thandirectly on the pmf This method is useful when calculating probabilities of the form

integer-P X > x/ is logically more straightforward than directly calculating integer-P X D x/

Here is the expectation formula based on the tail CDF

Theorem 1.9 (Tailsum Formula) Let X take values 0; 1; 2; : : : : Then

Example 1.17 (Family Planning) Suppose a couple will have children until they

have at least one child of each sex How many children can they expect to have?Let X denote the childbirth at which they have a child of each sex for the first time.Suppose the probability that any particular childbirth will be a boy is p, and that allbirths are independent Then,

P X > n/D P.the first n children are all boys or all girls/ D pnC 1  p/n

:

Therefore, E.X / D 2 CP1

nD2ŒpnC 1  p/nD 2 C p2=.1 p/ C 1  p/2=pD

1

p.1p/ 1 If boys and girls are equally likely on any childbirth, then this says that

a couple waiting to have a child of each sex can expect to have three children

Trang 38

1.3 Integer-Valued and Discrete Random Variables 17

The expected value is calculated with the intention of understanding what atypical value is of a random variable But two very different distributions can haveexactly the same expected value A common example is that of a return on an in-vestment in a stock Two stocks may have the same average return, but one may bemuch riskier than the other, in the sense that the variability in the return is muchhigher for that stock In that case, most risk-averse individuals would prefer to in-vest in the stock with less variability Measures of risk or variability are of coursenot unique Some natural measures that come to mind are E.jX  j/, known as the

mean absolute deviation, or P jX  j > k/ for some suitable k However, neither

of these two is the most common measure of variability The most common measure

is the standard deviation of a random variable.

Definition 1.14 Let a random variable X have a finite mean  The variance of X

is defined as

2D EŒ.X  /2;

and the standard deviation of X is defined as  Dp

2:

It is easy to prove that 2<1 if and only if E.X2/, the second moment of X ,

is finite It is not uncommon to mistake the standard deviation for the mean absolutedeviation, but they are not the same In fact, an inequality always holds

Proposition.   E.jX  j/, and  is strictly greater unless X is a constant

random variable, namely,P XD / D 1.

We list some basic properties of the variance of a random variable.

Proposition.

(a) Var.cX /D c2Var X / for any real c.

(b) Var.XC k/ D Var.X/ for any real k.

(c) Var.X / 0 for any random variable X, and equals zero only if P.X D c/ D 1

for some real constant c.

(d) Var.X /D E.X2/ 2:

The quantityE.X2/ is called the second moment of X The definition of a

gen-eral moment is as follows.

Definition 1.15 Let X be a random variable, and k  1 a positive integer Then

E.Xk/ is called the kth moment of X , and E.Xk/ is called the kth inverse moment

ofX , provided they exist

We therefore have the following relationships involving moments and thevariance

Variance D Second Moment  First Moment/2:

Second Moment D Variance C First Moment/2:

Statisticians often use the third moment around the mean as a measure of lack ofsymmetry in the distribution of a random variable The point is that if a randomvariable X has a symmetric distribution, and has a finite mean , then all odd mo-ments around the mean, namely, EŒ.X  /2kC1 will be zero, if the moment exists

Trang 39

18 1 Review of Univariate Probability

In particular, EŒ.X  /3 will be zero Likewise, statisticians also use the fourth

moment around the mean as a measure of how spiky the distribution is around themean To make these indices independent of the choice of unit of measurement (e.g.,inches or centimeters), they use certain scaled measures of asymmetry and peaked-ness Here are the definitions

Definition 1.16 (a) Let X be a random variable with EŒjX j3 <1 The skewness

is not really available We later show that D 0 for all normal distributions; hence

the motivation for subtracting 3 in the definition of

Example 1.18 (Variance of Number of Heads) Consider the experiment of two

tosses of a fair coin and let X be the number of heads obtained Then, we have seenthat p.0/ D p.2/ D 1=4; and p.1/ D 1=2 Thus, E.X2/D 0  1=4 C 1  1=2 C

4 1=4 D 3=2, and E.X/ D 1 Therefore, Var.X/ D E.X2/ 2D 3=2  1 D 1

2,and the standard deviation is  Dp

:5D :707

Example 1.19 (A Random Variable with an Infinite Variance) If a random variable

has a finite variance, then it can be shown that it must have a finite mean Thisexample shows that the converse need not be true

Let X be a discrete random variable with the pmf

Therefore, by direct verification, X has a finite expectation Let us now examine thesecond moment of X

1

X

xD1

x 1.xC 1/.x C 2/D 1;

Trang 40

is not finitely summable, a fact from calculus Because E.X2/ is infinite, but E.X /

is finite, 2D E.X2/ ŒE.X/2must also be infinite

If a collection of random variables is independent, then just like the expectation,the variance also adds up Precisely, one has the following very useful fact

Theorem 1.10 LetX1; X2; : : : ; Xnbe n independent random variables Then,

Var.X1C X2C    C Xn/D Var.X1/C Var.X2/C    C Var.Xn/:

An important corollary of this result is the following variance formula for the mean, N X , of n independent and identically distributed random variables.

Corollary 1.1 LetX1; X2; : : : ; Xnbe independent random variables with a mon variance2<1 Let NX D X 1 CCXn

2, assumed to be finite Let k be any positive number Then

P jX  j  k/  1

k2:

(b) (Markov’s Inequality) Suppose X takes only nonnegative values, and

sup-poseE.X /D , assumed to be finite Let c be any postive number Then,

P X  c/  

c:

The virtue of these two inequalities is that they make no restrictive assumptions on the random variable X Whenever ;  are finite, Chebyshev’s inequality is appli-

cable, and whenever ; is finite, Markov’s inequality applies, provided the random

variable is nonnegative However, the universal nature of these inequalities also makes them typically quite conservative.

Although Chebyshev’s inequality usually gives conservative estimates for tail probabilities, it does imply a major result in probability theory in a special case.

Ngày đăng: 07/04/2024, 17:57

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN