This is the companion second volume to my undergraduate text Fundamentals of Probability: A First Course. The purpose of my writing this book is to give graduate students, instructors, and researchers in statistics, mathematics, and computer science a lucidly written unique text at the confluence of probability, advanced stochastic processes, statistics, and key tools for machine learning. Numerous topics in probability and stochastic processes of current importance in statistics and machine learning that are widely scattered in the literature in many different specialized books are all brought together under one fold in this book. This is done with an extensive bibliography for each topic, and numerous workedout examples and exercises. Probability, with all its models, techniques, and its poignant beauty, is an incredibly powerful tool for anyone who deals with data or randomness. The content and the style of this book reflect that philosophy; I emphasize lucidity, a wide background, and the farreaching applicability of probability in science. The book starts with a selfcontained and fairly complete review of basic probability, and then traverses its way through the classics, to advanced modern topics and tools, including a substantial amount of statistics itself. Because of its nearly encyclopaedic coverage, it can serve as a graduate text for a yearlong probability sequence, or for focused short courses on selected topics, for selfstudy, and as a nearly unique reference for research in statistics, probability, and computer science. It provides an extensive treatment of most of the standard topics in a graduate probability sequence, and integrates them with the basic theory and many examples of several core statistical topics, as well as with some tools of major importance in machine learning. This is done with unusually detailed bibliographies for the reader who wants to dig deeper into a particular topic, and with a huge repertoire of workedout examples and exercises. The total number of workedout examples in this book is 423, and the total number of exercises is 808. An instructor can rotate the exercises between semesters, and use them for setting exams, and a student can use them for additional exam preparation and selfstudy. I believe that the book is unique in its range, unification, bibliographic detail, and its collection of problems and examples.
Trang 2Springer Texts in Statistics
Trang 4Anirban DasGupta
Probability for Statistics and Machine Learning Fundamentals and Advanced Topics
ABC
Trang 5Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011924777
Springer Science+Business Media, LLC 2011
All rights reserved This work may not be translated or copied in whole or in part without the writtenpermission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis Use inconnection with any form of information storage and retrieval, electronic adaptation, computer software,or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they arenot identified as such, is not to be taken as an expression of opinion as to whether or not they are subjectto proprietary rights.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Trang 6To Persi Diaconis, Peter Hall, Ashok Maitra,and my mother, with affection
Trang 8This is the companion second volume to my undergraduate text Fundamentals ofProbability: A First Course The purpose of my writing this book is to give
gradu-ate students, instructors, and researchers in statistics, mathematics, and computer science a lucidly written unique text at the confluence of probability, advanced stochastic processes, statistics, and key tools for machine learning Numerous top-ics in probability and stochastic processes of current importance in statisttop-ics and machine learning that are widely scattered in the literature in many different spe-cialized books are all brought together under one fold in this book This is done with an extensive bibliography for each topic, and numerous worked-out examples and exercises Probability, with all its models, techniques, and its poignant beauty, is an incredibly powerful tool for anyone who deals with data or randomness The content and the style of this book reflect that philosophy; I emphasize lucidity, a wide background, and the far-reaching applicability of probability in science.
The book starts with a self-contained and fairly complete review of basic prob-ability, and then traverses its way through the classics, to advanced modern topics and tools, including a substantial amount of statistics itself Because of its nearly encyclopaedic coverage, it can serve as a graduate text for a year-long probabil-ity sequence, or for focused short courses on selected topics, for self-study, and as a nearly unique reference for research in statistics, probability, and computer sci-ence It provides an extensive treatment of most of the standard topics in a graduate probability sequence, and integrates them with the basic theory and many examples of several core statistical topics, as well as with some tools of major importance in machine learning This is done with unusually detailed bibliographies for the reader who wants to dig deeper into a particular topic, and with a huge repertoire of worked-out examples and exercises The total number of worked-out examples in this book is 423, and the total number of exercises is 808 An instructor can rotate the exercises between semesters, and use them for setting exams, and a student can use them for additional exam preparation and self-study I believe that the book is unique in its range, unification, bibliographic detail, and its collection of problems and examples.
Topics in core probability, such as distribution theory, asymptotics, Markov chains, martingales, Poisson processes, random walks, and Brownian motion are covered in the first 14 chapters In these chapters, a reader will also find basic
vii
Trang 9coverage of such core statistical topics as confidence intervals, likelihood functions, maximum likelihood estimates, posterior densities, sufficiency, hypothesis testing, variance stabilizing transformations, and extreme value theory, all illustrated with many examples In Chapters 15, 16, and 17, I treat three major topics of great appli-cation potential, empirical processes and VC theory, probability metrics, and large deviations Chapters 18, 19, and 20 are specifically directed to the statistics and machine-learning community, and cover simulation, Markov chain Monte Carlo, the exponential family, bootstrap, the EM algorithm, and kernels.
The book does not make formal use of measure theory I do not intend to mini-mize the role of measure theory in a rigorous study of probability However, I believe that a large amount of probability can be taught, understood, enjoyed, and applied without needing formal use of measure theory We do it around the world every day At the same time, some theorems cannot be proved without at least a men-tion of some measure theory terminology Even some definimen-tions require a menmen-tion of some measure theory notions I include some unavoidable mention of measure-theoretic terms and results, such as the strong law of large numbers and its proof, the dominated convergence theorem, monotone convergence, Lebesgue measure, and a few others, but only in the advanced chapters in the book.
Following the table of contents, I have suggested some possible courses with different themes using this book I have also marked the nonroutine and harder ex-ercises in each chapter with an asterisk Likewise, some specialized sections with reference value have also been marked with an asterisk Generally, the exercises and the examples come with a caption, so that the reader will immediately know the content of an exercise or an example The end of the proof of a theorem has been marked by a sign.
My deepest gratitude and appreciation are due to Peter Hall I am lucky that the style and substance of this book are significantly molded by Peter’s influence Out of habit, I sent him the drafts of nearly every chapter as I was finishing them It didn’t matter where exactly he was, I always received his input and gentle suggestions for improvement I have found Peter to be a concerned and warm friend, teacher, mentor, and guardian, and for this, I am extremely grateful.
Mouli Banerjee, Rabi Bhattacharya, Burgess Davis, Stewart Ethier, Arthur Frazho, Evarist Gin´e, T Krishnan, S N Lahiri, Wei-Liem Loh, Hyun-Sook Oh, B V Rao, Yosi Rinott, Wen-Chi Tsai, Frederi Viens, and Larry Wasserman graciously went over various parts of this book I am deeply indebted to each of them Larry Wasserman, in particular, suggested the chapters on empirical pro-cesses, VC theory, concentration inequalities, the exponential family, and Markov chain Monte Carlo The Springer series editors, Peter Bickel, George Casella, Steve Fienberg, and Ingram Olkin have consistently supported my efforts, and I am so very thankful to them Springer’s incoming executive editor Marc Strauss saw through the final production of this book extremely efficiently, and I have much enjoyed working with him I appreciated Marc’s gentility and his thoroughly professional handling of the transition of the production of this book to his oversight Valerie Greco did an astonishing job of copyediting the book The presentation, display, and the grammar of the book are substantially better because of the incredible care
Trang 10and thoughtfulness that she put into correcting my numerous errors The staff at SPi Technologies, Chennai, India did an astounding and marvelous job of produc-ing this book Six anonymous reviewers gave extremely gracious and constructive comments, and their input has helped me in various dimensions to make this a better book Doug Crabill is the greatest computer systems administrator, and with an infectious pleasantness has bailed me out of my stupidity far too many times I also want to mention my fond memories and deep-rooted feelings for the Indian Statistical Institute, where I had all of my college education It was just a wonderful place for research, education, and friendships Nearly everything that I know is due to my years at the Indian Statistical Institute, and for this I am thankful.
This is the third time that I have written a book in contract with John Kimmel John is much more than a nearly unique person in the publishing world To me, John epitomizes sensitivity and professionalism, a singular combination I have now known John for almost six years, and it is very very difficult not to appreciate and admire him a whole lot for his warmth, style, and passion for the subjects of statis-tics and probability Ironically, the day that this book entered production, the news came that John was leaving Springer I will remember John’s contribution to my professional growth with enormous respect and appreciation.
Trang 12Suggested Courses with Different Themes xix
1Review of Univariate Probability 1
1.1 Experiments and Sample Spaces 1
1.2 Conditional Probability and Independence 5
1.3 Integer-Valued and Discrete Random Variables 8
1.3.1 CDF and Independence 9
1.3.2 Expectation and Moments 13
1.4 Inequalities 19
1.5 Generating and Moment-Generating Functions 22
1.6 Applications of Generating Functions to a Pattern Problem 26
1.7 Standard Discrete Distributions 28
1.8 Poisson Approximation to Binomial 34
1.9 Continuous Random Variables 36
1.10 Functions of a Continuous Random Variable 42
1.10.1 Expectation and Moments 45
1.10.2 Moments and the Tail of a CDF 49
1.11 Moment-Generating Function and Fundamental Inequalities 51
1.11.1 Inversion of an MGF and Post’s Formula 53
1.12 Some Special Continuous Distributions 54
1.13 Normal Distribution and Confidence Interval for a Mean 61
1.14 Stein’s Lemma 66
1.15 Chernoff’s Variance Inequality 68
1.16 Various Characterizations of Normal Distributions 69
1.17 Normal Approximations and Central Limit Theorem 71
1.17.1 Binomial Confidence Interval 74
Trang 132Multivariate Discrete Distributions 95
2.1 Bivariate Joint Distributions and Expectations of Functions 95
2.2 Conditional Distributions and Conditional Expectations .100
2.2.1 Examples on Conditional Distributions and Expectations .101
2.3 Using Conditioning to Evaluate Mean and Variance 104
2.4 Covariance and Correlation 107
3.4 Conditional Densities and Expectations 140
3.4.1 Examples on Conditional Densities and Expectations .142
3.5 Posterior Densities, Likelihood Functions, and Bayes Estimates 147
3.6 Maximum Likelihood Estimates .152
3.7 Bivariate Normal Conditional Distributions 154
3.8 Useful Formulas and Characterizations for Bivariate Normal 155
3.8.1 Computing Bivariate Normal Probabilities .157
3.9 Conditional Expectation Given a Set and Borel’s Paradox 158
References 165
4Advanced Distribution Theory 167
4.1 Convolutions and Examples 167
4.2 Products and Quotients and the t - and F -Distribution 172
4.3 Transformations 177
4.4 Applications of Jacobian Formula .178
4.5 Polar Coordinates in Two Dimensions 180
4.6 n-Dimensional Polar and Helmert’s Transformation 182
4.6.1 Efficient Spherical Calculations with Polar Coordinates 182
4.6.2 Independence of Mean and Variance in Normal Case 185
4.6.3 The t Confidence Interval 187
4.7 The Dirichlet Distribution 188
4.7.1 Picking a Point from the Surface of a Sphere 191
4.7.2 Poincar´e’s Lemma 191
4.8 Ten Important High-Dimensional Formulas for Easy Reference 191
References 197
Trang 145Multivariate Normal and Related Distributions 199
5.1 Definition and Some Basic Properties .199
5.2 Conditional Distributions 202
5.3 Exchangeable Normal Variables 205
5.4 Sampling Distributions Useful in Statistics 207
5.4.1 Wishart Expectation Identities 208
5.4.2 * Hotelling’s T2and Distribution of Quadratic Forms 209
5.4.3 Distribution of Correlation Coefficient 212
5.5 Noncentral Distributions 213
5.6 Some Important Inequalities for Easy Reference 214
References 218
6Finite Sample Theory of Order Statistics and Extremes .221
6.1 Basic Distribution Theory 221
6.2 More Advanced Distribution Theory .225
6.3 Quantile Transformation and Existence of Moments 229
7Essential Asymptotics and Applications 249
7.1 Some Basic Notation and Convergence Concepts 250
7.2 Laws of Large Numbers 254
Trang 158Characteristic Functions and Applications 293
8.1 Characteristic Functions of Standard Distributions 294
8.2 Inversion and Uniqueness .298
8.3 Taylor Expansions, Differentiability, and Moments 302
8.4 Continuity Theorems .303
8.5 Proof of the CLT and the WLLN 305
8.6 Producing Characteristic Functions 306
8.7 Error of the Central Limit Theorem 308
8.8 Lindeberg–Feller Theorem for General Independent Case 311
8.9 Infinite Divisibility and Stable Laws 315
8.10 Some Useful Inequalities 317
References 322
9Asymptotics of Extremes and Order Statistics 323
9.1 Central-Order Statistics 323
9.1.1 Single-Order Statistic .323
9.1.2 Two Statistical Applications 325
9.1.3 Several Order Statistics .326
9.2 Extremes .328
9.2.1 Easily Applicable Limit Theorems 328
9.2.2 The Convergence of Types Theorem 332
9.3 Fisher–Tippett Family and Putting it Together 333
References 338
10Markov Chains and Applications 339
10.1 Notation and Basic Definitions 340
10.2 Examples and Various Applications as a Model 340
10.3 Chapman–Kolmogorov Equation .345
10.4 Communicating Classes 349
10.5 Gambler’s Ruin 352
10.6 First Passage, Recurrence, and Transience .354
10.7 Long Run Evolution and Stationary Distributions 359
References 374
11Random Walks 375
11.1 Random Walk on the Cubic Lattice 375
11.1.1 Some Distribution Theory .378
11.1.2 Recurrence and Transience 379
11.1.3 P´olya’s Formula for the Return Probability 382
11.2 First Passage Time and Arc Sine Law 383
11.3 The Local Time 387
11.4 Practically Useful Generalizations 389
11.5 Wald’s Identity 390
11.6 Fate of a Random Walk 392
Trang 1611.7 Chung–Fuchs Theorem 394
11.8 Six Important Inequalities 396
References 400
12Brownian Motion and Gaussian Processes 401
12.1 Preview of Connections to the Random Walk 402
12.2 Basic Definitions 403
12.2.1 Condition for a Gaussian Process to be Markov 406
12.2.2 Explicit Construction of Brownian Motion 407
12.3 Basic Distributional Properties 408
12.3.1 Reflection Principle and Extremes 410
12.3.2 Path Properties and Behavior Near Zero and Infinity .412
12.3.3 Fractal Nature of Level Sets 415
12.4 The Dirichlet Problem and Boundary Crossing Probabilities .416
12.4.1 Recurrence and Transience 418
12.5 The Local Time of Brownian Motion 419
12.6 Invariance Principle and Statistical Applications 421
12.7 Strong Invariance Principle and the KMT Theorem .425
12.8 Brownian Motion with Drift and Ornstein–Uhlenbeck Process .427
12.8.1 Negative Drift and Density of Maximum 427
12.8.2 Transition Density and the Heat Equation 428
12.8.3 The Ornstein–Uhlenbeck Process 429
References 435
13Poisson Processes and Applications .437
13.1 Notation .438
13.2 Defining a Homogeneous Poisson Process .439
13.3 Important Properties and Uses as a Statistical Model 440
13.4 Linear Poisson Process and Brownian Motion: A Connection 448
13.5 Higher-Dimensional Poisson Point Processes 450
13.5.1 The Mapping Theorem 452
13.6 One-Dimensional Nonhomogeneous Processes 453
13.7 Campbell’s Theorem and Shot Noise 456
13.7.1 Poisson Process and Stable Laws 458
References 462
14Discrete Time Martingales and Concentration Inequalities 463
14.1 Illustrative Examples and Applications in Statistics 463
14.2 Stopping Times and Optional Stopping 468
14.2.1 Stopping Times 469
14.2.2 Optional Stopping 470
14.2.3 Sufficient Conditions for Optional Stopping Theorem 472
14.2.4 Applications of Optional Stopping 474
Trang 1714.3 Martingale and Concentration Inequalities 477
14.3.1 Maximal Inequality .477
14.3.2 Inequalities of Burkholder, Davis, and Gundy 480
14.3.3 Inequalities of Hoeffding and Azuma 483
14.3.4 Inequalities of McDiarmid and Devroye 485
14.3.5 The Upcrossing Inequality 488
14.4 Convergence of Martingales 490
14.4.1 The Basic Convergence Theorem .490
14.4.2 Convergence in L1and L2 .493
14.5 Reverse Martingales and Proof of SLLN 494
14.6 Martingale Central Limit Theorem .497
References 503
15Probability Metrics 505
15.1 Standard Probability Metrics Useful in Statistics 505
15.2 Basic Properties of the Metrics 508
15.3 Metric Inequalities 515
15.4 Differential Metrics for Parametric Families 519
15.4.1 Fisher Information and Differential Metrics 520
15.4.2 Rao’s Geodesic Distances on Distributions 522
References 525
16Empirical Processes and VC Theory 527
16.1 Basic Notation and Definitions 527
16.2 Classic Asymptotic Properties of the Empirical Process 529
16.2.1 Invariance Principle and Statistical Applications 531
16.2.2 Weighted Empirical Process 534
16.2.3 The Quantile Process 536
16.2.4 Strong Approximations of the Empirical Process 537
16.3 Vapnik–Chervonenkis Theory 538
16.3.1 Basic Theory .538
16.3.2 Concrete Examples 540
16.4 CLTs for Empirical Measures and Applications 543
16.4.1 Notation and Formulation .543
16.4.2 Entropy Bounds and Specific CLTs .544
16.4.3 Concrete Examples 547
16.5 Maximal Inequalities and Symmetrization .547
16.6 Connection to the Poisson Process 551
References 557
17Large Deviations .559
17.1 Large Deviations for Sample Means 560
17.1.1 The Cram´er–Chernoff Theorem inR 560
17.1.2 Properties of the Rate Function 564
17.1.3 Cram´er’s Theorem for General Sets 566
Trang 1817.2 The GRartner–Ellis Theorem and Markov Chain Large
Deviations 567
17.3 The t-Statistic 570
17.4 Lipschitz Functions and Talagrand’s Inequality 572
17.5 Large Deviations in Continuous Time .574
17.5.1 Continuity of a Gaussian Process 576
17.5.2 Metric Entropy of T and Tail of the Supremum 577
References 582
18The Exponential Family and Statistical Applications 583
18.1 One-Parameter Exponential Family 583
18.1.1 Definition and First Examples 584
18.2 The Canonical Form and Basic Properties 589
18.2.1 Convexity Properties 590
18.2.2 Moments and Moment Generating Function .591
18.2.3 Closure Properties 594
18.3 Multiparameter Exponential Family .596
18.4 Sufficiency and Completeness .600
18.4.1 Neyman–Fisher Factorization and Basu’s Theorem 602
18.4.2 Applications of Basu’s Theorem to Probability 604
18.5 Curved Exponential Family 607
References 612
19Simulation and Markov Chain Monte Carlo 613
19.1 The Ordinary Monte Carlo .615
19.1.1 Basic Theory and Examples .615
19.1.2 Monte Carlo P -Values 622
19.1.3 Rao–Blackwellization 623
19.2 Textbook Simulation Techniques .624
19.2.1 Quantile Transformation and Accept–Reject 624
19.2.2 Importance Sampling and Its Asymptotic Properties 629
19.2.3 Optimal Importance Sampling Distribution .633
19.2.4 Algorithms for Simulating from Common Distributions 634
19.3 Markov Chain Monte Carlo 637
19.3.1 Reversible Markov Chains 639
19.3.2 Metropolis Algorithms 642
19.4 The Gibbs Sampler 645
19.5 Convergence of MCMC and Bounds on Errors 651
Trang 1919.6 MCMC on General Spaces 662
19.6.1 General Theory and Metropolis Schemes 662
19.6.2 Convergence 666
19.6.3 Convergence of the Gibbs Sampler 670
19.7 Practical Convergence Diagnostics 673
20.1.3 Higher-Order Accuracy of the Bootstrap 699
20.1.4 Bootstrap for Dependent Data 701
20.2 The EM Algorithm 704
20.2.1 The Algorithm and Examples 706
20.2.2 Monotone Ascent and Convergence of EM 711
20.2.3 Modifications of EM 714
20.3 Kernels and Classification 715
20.3.1 Smoothing by Kernels 715
20.3.2 Some Common Kernels in Use 717
20.3.3 Kernel Density Estimation 719
20.3.4 Kernels for Statistical Classification 724
20.3.5 Mercer’s Theorem and Feature Maps .732
References 744
A Symbols, Useful Formulas, and Normal Table 747
A.1 Glossary of Symbols 747
A.2 Moments and MGFs of Common Distributions 750
A.3 Normal Table 755
Author Index 757
Subject Index 763
Trang 20Suggested Courses with Different Themes
15 weeksSpecial topics for statistics students9, 10, 15, 16, 17, 18, 2015 weeksSpecial topics for computer science students4, 11, 14, 16, 17, 18, 198 weeksSummer course for statistics students11, 12, 14, 208 weeksSummer course for computer science students14, 16, 18, 208 weeksSummer course on modeling and simulation4, 10, 13, 19
xix
Trang 22Chapter 1
Review of Univariate Probability
Probability is a universally accepted tool for expressing degrees of confidence or doubt about some proposition in the presence of incomplete information or uncer-tainty By convention, probabilities are calibrated on a scale of 0 to 1; assigning something a zero probability amounts to expressing the belief that we consider it impossible, whereas assigning a probability of one amounts to considering it a cer-tainty Most propositions fall somewhere in between Probability statements that we make can be based on our past experience, or on our personal judgments Whether our probability statements are based on past experience or subjective personal judg-ments, they obey a common set of rules, which we can use to treat probabilities in a mathematical framework, and also for making decisions on predictions, for un-derstanding complex systems, or as intellectual experiments and for entertainment Probability theory is one of the most applicable branches of mathematics It is used as the primary tool for analyzing statistical methodologies; it is used routinely in nearly every branch of science, such as biology, astronomy and physics, medicine, economics, chemistry, sociology, ecology, finance, and many others A background in the theory, models, and applications of probability is almost a part of basic edu-cation That is how important it is.
For a classic and lively introduction to the subject of probability, we recommend Feller(1968,1971) Among numerous other expositions of the theory of probabil-ity, a variety of examples on various topics can be seen inRoss(1984),Stirzaker (1994),Pitman(1992),Bhattacharya and Waymire(2009), andDasGupta(2010) Ash(1972),Chung(1974),Breiman(1992),Billingsley(1995), andDudley(2002) are masterly accounts of measure-theoretic probability.
1.1 Experiments and Sample Spaces
Treatment of probability theory starts with the consideration of a sample space.
The sample space is the set of all possible outcomes in some physical experiment For example, if a coin is tossed twice and after each toss the face that shows is recorded, then the possible outcomes of this particular coin-tossing experiment, say
A DasGupta, Probability for Statistics and Machine Learning: Fundamentalsand Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 1,
Springer Science+Business Media, LLC 2011
1
Trang 2321 Review of Univariate Probability
are HH; HT; TH; TT, with H denoting the occurrence of heads and T denoting
the occurrence of tails We call
D fHH; HT; TH; TTg
the sample space of the experiment .
In general, a sample space is a general set , finite or infinite An easy example where the sample space is infinite is to toss a coin until the first time heads show up and record the number of the trial at which the first head appeared In this case,
the sample space is the countably infinite set
D f1; 2; 3; : : :g:
Sample spaces can also be uncountably infinite; for example, consider the
experi-ment of choosing a number at random from the interval Œ0; 1 The sample space of this experiment is D Œ0; 1 In this case, is an uncountably infinite set In all cases, individual elements of a sample space are denoted as ! The first task is to
define events and to explain the meaning of the probability of an event.
Definition 1.1 Let be the sample space of an experiment Then any subset A
of , including the empty set and the entire sample space is called an event.
Events may contain even one single sample point !, in which case the event
is a singleton set f!g We want to assign probabilities to events But we want to
assign probabilities in a way that they are logically consistent In fact, this cannot be done in general if we insist on assigning probabilities to arbitrary collections of sample points, that is, arbitrary subsets of the sample space We can only define probabilities for such subsets of that are tied together like a family, the exact concept being that of a -field In most applications, including those cases where the sample space is infinite, events that we would want to normally think about will be members of such an appropriate -field So we do not mention the need for consideration of -fields any further, and get along with thinking of events as subsets of the sample space , including in particular the empty set and the entire sample space itself.
Here is a definition of what counts as a legitimate probability on events.
Definition 1.2 Given a sample space , a probability or a probability measure on
is a function P on subsets of such that
(a) P A/ 0 for anyA I (b) P / D 1I
(c) Given disjoint subsets A1; A2; : : : of ; P [1iD1Ai/DP1
iD1P Ai/:
Property (c) is known as countable additivity Note that it is not something thatcan be proved, but it is like an assumption or an axiom In our experience, we have
seen that operating as if the assumption is correct leads to useful and credible an-swers in many problems, and so we accept it as a reasonable assumption Not all
Trang 241.1 Experiments and Sample Spaces3
probabilists agree that countable additivity is natural; but we do not get into that debate in this book One important point is that finite additivity is subsumed in countable additivity; that is if there are some finite number m of disjoint subsets
A1; A2; : : : ; Amof , then P [miD1Ai/D Pm
iD1P Ai/: Also, it is useful to note
that the last two conditions in the definition of a probability measure imply that
P /, the probability of the empty set or the null event, is zero.
One notational convention is that strictly speaking, for an event that is just a singleton set f!g, we should write P f!g/ to denote its probability But to reduce clutter, we simply use the more convenient notation P !/.
One pleasant consequence of the axiom of countable additivity is the following basic result We do not prove it here as it is a simple result; seeDasGupta(2010) for a proof.
Theorem 1.1 LetA1 A2 A3 be an infinite family of subsets of a sample
space such that An# A Then, P.An/! P.A/ as n ! 1.
Next, the concept of equally likely sample points is a very fundamental one.
Definition 1.3 Let be a finite sample space consisting of N sample points We
say that the sample points are equally likely if P !/ DN1 for each sample point ! An immediate consequence, due to the addivity axiom, is the following useful formula.
Proposition Let be a finite sample space consisting of N equally likely sample
points LetA be any event and suppose A contains n distinct sample points Then
P A/D n
N D Number of sample points favorable toA
Total number of sample points :
Let us see some examples.
Example 1.1 (The Shoe Problem) Suppose there are five pairs of shoes in a closet
and four shoes are taken out at random What is the probability that among the four that are taken out, there is at least one complete pair?
The total number of sample points is10
D 210 Because selection was done
completely at random, we assume that all sample points are equally likely At least one complete pair would mean two complete pairs, or exactly one complete pair and two other nonconforming shoes Two complete pairs can be chosen in5
term is for choosing two incomplete pairs, and then from each incomplete pair, one chooses the left or the right shoe Thus, the probability that there will be at least one complete pair among the four shoes chosen is 10 C 120/=210 D 13=21 D :62.
Example 1.2 (Five-Card Poker) In five-card poker, a player is given 5 cards from a
full deck of 52 cards at random Various named hands of varying degrees of rarity
exist In particular, we want to calculate the probabilities of A D two pairs and
Trang 2541 Review of Univariate Probability
BD a flush Two pairs is a hand with 2 cards each of 2 different denominations and
the fifth card of some other denomination; a flush is a hand with 5 cards of the same suit, but the cards cannot be of denominations in a sequence.
To find P B/, note that there are 10 ways to select 5 cards from a suit such that the cards are in a sequence, namely, fA; 2; 3; 4; 5g; f2; 3; 4; 5; 6g; : : : ; f10; J; Q;
K; Ag, and so,
These are basic examples of counting arguments that are useful whenever there is a finite sample space and we assume that all sample points are equally likely.
A major result in combinatorial probability is the inclusionexclusion formula,
which says the following.
Theorem 1.2 LetA1; A2; : : : ; Anben general events Let
Example 1.3 (Missing Suits in a Bridge Hand) Consider a specific player, say
North, in a Bridge game We want to calculate the probability that North’s hand is void in at least one suit Towards this, denote the suits as 1, 2, 3, 4 and let
Ai D North’s hand is void in suit i.
Then, by the inclusion exclusion formula,
P North’s hand is void in at least one suit/
Trang 261.2 Conditional Probability and Independence5
The inclusion–exclusion formula can be hard to apply exactly, because the quan-tities Sj for large indices j can be difficult to calculate However, fortunately, the inclusion–exclusion formula leads to bounds in both directions for the probability of the union of n general events We have the following series of bounds.
Theorem 1.3 (Bonferroni Bounds) Givenn events A1; A2; : : : ; An, let pn D
1.2 Conditional Probability and Independence
Both conditional probability and independence are fundamental concepts for proba-bilists and statisticians alike Conditional probabilities correspond to updating one’s beliefs when new information becomes available Independence corresponds to ir-relevance of a piece of new information, even when it is made available In addition, the assumption of independence can and does significantly simplify development, mathematical analysis, and justification of tools and procedures.
Definition 1.4 Let A; B be general events with respect to some sample space ,
and suppose P A/ > 0 The conditional probability of B given A is defined as
P BjA/ D P A\ B/ P A/ :
Some immediate consequences of the definition of a conditional probability are the following.
Theorem 1.4 (a) (Multiplicative Formula) For any two events A, B such that
P A/ > 0, one has P A\ B/ D P.A/P.BjA/I
(b) For any two events A, B such that0 < P A/ < 1, one has P B/ D P.BjA/ P A/C P.BjAc/P Ac/I
(c) (Total Probability Formula) IfA1; A2; : : : ; Ak form a partition of the samplespace, (i.e., Ai \ Aj D for all i ¤ j , and [k
Trang 2761 Review of Univariate Probability
(d) (Hierarchical Multiplicative Formula) Let A1; A2; : : : ; Ak bek general
events in a sample space Then
P A1\ A2\ ::: \ Ak/D P.A1/P A2jA1/P A3jA1\ A2/ P.AkjA1\ A2\ ::: \ Ak1/:
Example 1.4 One of two urns has a red and b black balls, and the other has c red
and d black balls One ball is chosen at random from each urn, and then one of these two balls is chosen at random What is the probability that this ball is red?
If each ball selected from the two urns is red, then the final ball is definitely red If one of those two balls is red, then the final ball is red with probability 1/2 If none of those two balls is red, then the final ball cannot be red.
Although the total percentage of red balls in the two urns is more than 98%, the chance that the final ball selected would be red is just about 75%.
Example 1.5 (A Clever Conditioning Argument) Coin A gives heads with
probabil-ity s and coin B gives heads with probabilprobabil-ity t They are tossed alternately, starting off with coin A We want to find the probability that the first head is obtained on coin A.
We find this probability by conditioning on the outcomes of the first two tosses; more precisely, define
A1D fH g D First toss gives HI A2D fTH gI A3D fT T g:
Let also,
AD The first head is obtained on coin A:
One of the three events A1; A2; A3 must happen, and they are also mutually exclusive Therefore, by the total probability formula,
Trang 281.2 Conditional Probability and Independence7
As an example, let s D :4; t D :5 Note that coin A is biased against heads Even
then, s=.s C t st / D :57 > :5 We see that there is an advantage in starting first.
Definition 1.5 A collection of events A1; A2; : : : ; Anis said to be mutually inde-pendent (or just indeinde-pendent) if for each k; 1 k n, and any k of the events,
Ai1; : : : ; Aik; P Ai1 \ Aik/ D P.Ai1/ P.Aik/: They are called pairwise
independent if this property holds for k D 2.
Example 1.6 (Lotteries) Although many people buy lottery tickets out of an
expec-tation of good luck, probabilistically speaking, buying lottery tickets is usually a waste of money Here is an example Suppose in a weekly state lottery, five of the numbers 00; 01; : : : ; 49 are selected without replacement at random, and someone holding exactly those numbers wins the lottery Then, the probability that someone holding one ticket will be the winner in a given week is
Suppose this person buys a ticket every week for 40 years Then, the probability that he will win the lottery on at least one week is 1 1 4:72 107/5240 D :00098 < :001; still a very small probability We assumed in this calculation that
the weekly lotteries are all mutually independent, a reasonable assumption The calculation would fall apart if we did not make this independence assumption.
It is not uncommon to see the conditional probabilities P AjB/ and P BjA/ confused with each other Suppose in some group of lung cancer patients, we see a large percentage of smokers If we define B to be the event that a person is a smoker, and A to be the event that a person has lung cancer, then all we can conclude is that in our group of people P BjA/ is large But we cannot conclude from just this information that smoking increases the chance of lung cancer, that is, that P AjB/ is large In order to calculate a conditional probability P AjB/ when we know the
other conditional probability P BjA/, a simple formula known as Bayes’ theorem
is useful Here is a statement of a general version of Bayes’ theorem.
Theorem 1.5 LetfA1; A2; : : : ; Amg be a partition of a sample space Let B be
some fixed event Then
P AjjB/ D PmP BjAj/P Aj/
iD1P BjAi/P Ai/:
Example 1.7 (Multiple Choice Exams) Suppose that the questions in a multiple
choice exam have five alternatives each, of which a student has to pick one as the correct alternative A student either knows the truly correct alternative with proba-bility :7, or she randomly picks one of the five alternatives as her choice Suppose a particular problem was answered correctly We want to know what the probability is that the student really knew the correct answer.
Trang 2981 Review of Univariate Probability
AD The student knew the correct answer; BD The student answered the question correctly:
We want to compute P AjB/ By Bayes’ theorem,
P AjB/ D P BjA/P.A/
P BjA/P.A/ C P.BjAc/P Ac/D 1 :7
1 :7 C :2 :3 D :921:
Before the student answered the question, our probability that she would know the correct answer to the question was :7; but once she answered it correctly, the poste-rior probability that she knew the correct answer increases to :921 This is exactly
what Bayes’ theorem does; it updates our prior belief to the posterior belief, when
new evidence becomes available.
1.3 Integer-Valued and Discrete Random Variables
In some sense, the entire subject of probability and statistics is about distributions of random variables Random variables, as the very name suggests, are quantities that vary, over time, or from individual to individual, and the reason for the variability is some underlying random process Depending on exactly how an underlying exper-iment ends, the random variable takes different values In other words, the value of the random variable is determined by the sample point ! that prevails, when the underlying experiment is actually conducted We cannot know a priori the value of the random variable, because we do not know a priori which sample point ! will prevail when the experiment is conducted We try to understand the behavior of a random variable by analyzing the probability structure of that underlying random experiment.
Random variables, like probabilities, originated in gambling Therefore, the ran-dom variables that come to us more naturally, are integer-valued ranran-dom variables; for examples, the sum of the two rolls when a die is rolled twice Integer-valued random variables are special cases of what are known as discrete random variables Discrete or not, a common mathematical definition of all random variables is the following.
Definition 1.6 Let be a sample space corresponding to some experiment and
let X W ! R be a function from the sample space to the real line Then X iscalled a random variable.
Discrete random variables are those that take a finite or a countably infinite number of possible values In particular, all integer-valued random variables are discrete From the point of view of understanding the behavior of a random variable, the important thing is to know the probabilities with which X takes its different possible values.
Trang 301.3 Integer-Valued and Discrete Random Variables9
Definition 1.7 Let X W ! R be a discrete random variable taking a finite or
countably infinite number of values x1; x2; x3; : : : : The probability distribution or
the probability mass function (pmf) of X is the function p.x/ D P X D x/; x D
x1; x2; x3; : : : ; and p.x/ D 0, otherwise.
It is common to not explicitly mention the phrase “p.x/ D 0 otherwise,” and we
generally follow this convention Some authors use the phrase mass function insteadof probability mass function.
For any pmf, one must have p.x/ 0 for any x, and P
ip.xi/ D 1 Any
function satisfying these two properties for some set of numbers x1; x2; x3; : : : is a
valid pmf.
1.3.1 CDF and Independence
A second important definition is that of a cumulative distribution function (CDF).
The CDF gives the probability that a random variable X is less than or equal to any given number x It is important to understand that the notion of a CDF is universal to all random variables; it is not limited to only the discrete ones.
Definition 1.8 The cumulative distribution function of a random variable X is the
function F x/ D P X x/; x 2R.
Definition 1.9 Let X have the CDF F x/ Any number m such that P X m/ :5,
and also P X m/ :5 is called a median of F , or equivalently, a median of X
Remark The median of a random variable need not be unique A simple way to
characterize all the medians of a distribution is available.
Proposition LetX be a random variable with the CDF F x/ Let m0be the first
x such that F x/ :5, and let m1be the lastx such that P X x/ :5 Then, a
numberm is a median of X if and only if m 2 Œm0; m1.
The CDF of any random variable satisfies a set of properties Conversely, anyfunction satisfying these properties is a valid CDF; that is, it will be the CDF ofsome appropriately chosen random variable These properties are given in the nextresult.
Theorem 1.6 A functionF x/ is the CDF of some real-valued random variable X
if and only if it satisfies all of the following properties.(a) 0 F x/ 1 8x 2 R.
(b) F x/! 0 as x ! 1; and F x/ ! 1 as x ! 1.
(c) Given any real numbera; F x/# F a/ as x # a.
(d) Given any two real numbersx; y; x < y; F x/ F y/:
Property (c) is called continuity from the right, or simply right continuity It isclear that a CDF need not be continuous from the left; indeed, for discrete randomvariables, the CDF has a jump at the values of the random variable, and at the jumppoints, the CDF is not left continuous More precisely, one has the following result.
Trang 31101 Review of Univariate Probability
Proposition LetF x/ be the CDF of some random variable X Then, for any x,
(a) P X D x/ D F x/ limy"xF y/D F x/ F x/, including those pointsx for which P XD x/ D 0.
(b) P X x/ D P.X > x/ C P.X D x/ D 1 F x// C F x/ F x// D 1 F x/.
Example 1.8 (Bridge) Consider the random variable
X D Number of aces in North’s hand in a Bridge game:
Clearly, X can take any of the values x D 0; 1; 2; 3; 4 If X D x, then the other
13 x cards in North’s hand must be non-ace cards Thus, the pmf of X is
The CDF of X is a jump function, taking jumps at the values 0; 1; 2; 3; 4, namely the possible values of X The CDF is
Example 1.9 (Indicator Variables) Consider the experiment of rolling a fair die
twice and now define a random variable Y as follows.
Y D 1 if the sum of the two rolls X is an even numberI Y D 0 if the sum of the two rolls X is an odd number:
If we let A be the event that X is an even number, then Y D 1 if A happens, and
Y D 0 if A does not happen Such random variables are called indicator random
variables and are immensely useful in mathematical calculations in many complex
situations.
Trang 321.3 Integer-Valued and Discrete Random Variables11
Definition 1.10 Let A be any event in a sample space The indicator random
variable for A is defined as
IA D 1 if A happens: IA D 0 if A does not happen:
Thus, the distribution of an indicator variable is simply P IA D 1/ D P.A/I P IAD 0/ D 1 P.A/.
An indicator variable is also called a Bernoulli variable with parameter p, where
p is just P A/ We later show examples of uses of indicator variables in calculation
of expectations.
In applications, we are sometimes interested in the distribution of a function, say g.X /, of a basic random variable X In the discrete case, the distribution of a function is found in the obvious way.
variable andP Y D g.X/ a real-valued function of X Then, P.Y D y/ D
Note that g.X / is a one-to-one function of X , but h.X / is not one-to-one The values of Y are 0; ˙1; ˙8; ˙27 For example, P Y D 0/ D P X D 0/ D c D
5=13I P.Y D 1/ D P.X D 1/ D c=2 D 5=26, and so on In general, for y D 0;
Trang 33121 Review of Univariate Probability
So, for example, P Z D 0/ D P X D 2/ C P X D 0/ C P X D 2/ D 75cD 7=13: The pmf of Z D h.X / is:
P ZD z/ 3/13 7/13 3/13
A key concept in probability is that of independence of a collection of random variables The collection could be finite or infinite In the infinite case, we want each finite subcollection of the random variables to be independent The definition of independence of a finite collection is as follows.
Definition 1.11 Let X1; X2; : : : ; Xk be k 2 discrete random variables defined on the same sample space We say that X1; X2; : : : ; Xk are independent if
P X1 D x1; X2 D x2; : : : ; Xk D xk/D P.X1 D x1/P X2 D x2/ P.Xk D xk/;8 x1; x2; : : : ; xk.
It follows from the definition of independence of random variables that if X1; X2 are independent, then any function of X1and any function of X2are also indepen-dent In fact, we have a more general result.
Theorem 1.7 LetX1; X2; : : : ; Xk bek 2 discrete random variables, and
sup-pose they are independent Let U D f X1; X2; : : : ; Xi/ be some function of
X1; X2; : : : ; Xi, and V D g.XiC1; : : : ; Xk/ be some function of XiC1; : : : ; Xk.Then,U and V are independent.
This result is true of any types of random variables X1; X2; ; Xk, not justdiscrete ones.
A common notation of wide use in probability and statistics is now introduced.
If X1; X2; : : : ; Xk are independent, and moreover have the same CDF, say F , then we say that X1; X2; : : : ; Xk are iid (or IID) and write X1; X2; : : : ; Xk
F
The abbreviation iid (IID) means independent and identically distributed.
Example 1.11 (Two Simple Illustrations) Consider the experiment of tossing a fair
coin (or any coin) four times Suppose X1is the number of heads in the first two tosses, and X2is the number of heads in the last two tosses Then, it is intuitively clear that X1; X2 are independent, because the last two tosses carry no informa-tion regarding the first two tosses The independence can be easily mathematically verified by using the definition of independence.
Next, consider the experiment of drawing 13 cards at random from a deck of 52 cards Suppose X1is the number of aces and X2is the number of clubs among the 13 cards Then, X1; X2are not independent For example, P X1 D 4; X2 D 0/ D 0,
but P X1D 4/, and P.X2 D 0/ are both > 0, and so P.X1 D 4/P.X2D 0/ > 0.
So, X1; X2cannot be independent.
Trang 341.3 Integer-Valued and Discrete Random Variables13
1.3.2 Expectation and Moments
By definition, a random variable takes different values on different occasions It is natural to want to know what value it takes on average Averaging is a very primitive concept A simple average of just the possible values of the random variable will be misleading, because some values may have so little probability that they are rela-tively inconsequential The average or the mean value, also called the expected value of a random variable is a weighted average of the different values of X , weighted according to how important the value is Here is the definition.
Definition 1.12 Let X be a discrete random variable We say that the expected
expected value is also known as the expectation or the mean of X
If the set of possible values of X is infinite, then the infinite sum P
If the sample space of the underlying experiment is finite or countably infinite, then we can also calculate the expectation by averaging directly over the sample space.
finite or countably infinite andX is a discrete random variable with expectation .
whereP !/ is the probability of the sample point !.
Important Point Although it is not the focus of this chapter, in applications we are
often interested in more than one variable at the same time To be specific, consider two discrete random variables X; Y defined on a common sample space Then we could construct new random variables out of X and Y , for example, X Y; X C
Y; X2C Y2, and so on We can then talk of their expectations as well Here is a general definition of expectation of a function of more than one random variable.
Definition 1.13 Let X1; X2; : : : ; Xnbe n discrete random variables, all defined on a common sample space , with a finite or a countably infinite number of sam-ple points We say that the expectation of a function g.X1; X2; : : : ; Xn/ exists if
Trang 35141 Review of Univariate Probability
Proposition (a) If there exists a finite constantc such that P XD c/ D 1, then
E.X /D c.
(b) IfX; Y are random variables defined on the same sample space with finite
expectations, and ifP X Y / D 1, then E.X/ E.Y /.
(c) IfX has a finite expectation, and if P X c/ D 1; then E.X/ c If P.X
c/D 1, then E.X/ c.
Proposition (Linearity of Expectations) Let X1; X2; : : : ; Xn be random vari-ables defined on the same sample space, and c1; c2; : : : ; cn any real-valuedconstants Then, providedE.Xi/ exists for every Xi,
in particular,E.cX / D cE.X/ and E.X1C X2/ D E.X1/C E.X2/, whenever
the expectations exist.
The following fact also follows easily from the definition of the pmf of a functionof a random variable The result says that the expectation of a function of a randomvariableX can be calculated directly using the pmf of X itself, without having to
calculate the pmf of the function.
Proposition (Expectation of a Function) LetX be a discrete random variable
on a sample space with a finite or countable number of sample points, and
providedE.Y / exists.
Caution If g.X / is a linear function of X , then, of course, E.g.X // D g.E.X //.
But, in general, the two things are not equal For example, E.X2/ is not the same
as E.X //2; indeed, E.X2/ > E.X //2for any random variable X that is not a constant.
A very important property of independent random variables is the following fac-torization result on expectations.
Theorem 1.8 SupposeX1; X2; : : : ; Xn are independent random variables Then,provided each expectation exists,
E.X1X2 Xn/D E.X1/E.X2/ E.Xn/:
Let us now show some more illustrative examples.
Trang 361.3 Integer-Valued and Discrete Random Variables15
Example 1.12 Let X be the number of heads obtained in two tosses of a fair coin.
The pmf of X is p.0/ D p.2/ D 1=4; p.1/ D 1=2 Therefore, E.X / D 0 1=4 C
1 1=2 C 2 1=4 D 1 Because the coin is fair, we expect it to show heads 50% of
the number of times it is tossed, which is 50% of 2, that is, 1.
Example 1.13 (Dice Sum) Let X be the sum of the two rolls when a fair die
is rolled twice The pmf of X is p.2/ D p.12/ D 1=36I p.3/ D p.11/ D
2=36I p.4/ D p.10/ D 3=36I p.5/ D p.9/ D 4=36I p.6/ D p.8/ D 5=36I p.7/ D 6=36 Therefore, E.X / D 21=36C32=36C43=36C C121=36 D 7 This
can also be seen by letting X1D the face obtained on the first roll; X2D the face
obtained on the second roll, and by using E.X / D E.X1 C X2/ D E.X1/ C E.X2/D 3:5 C 3:5 D 7.
Let us now make this problem harder Suppose that a fair die is rolled 10 times and X is the sum of all 10 rolls The pmf of X is no longer so simple; it will be cumbersome to write it down But, if we let Xi D the face obtained on the ith roll, it
is still true by the linearity of expectations that E.X / D E.X1CX2C CX10/D E.X1/C E.X2/C C E.X10/ D 3:5 10 D 35 We can easily compute the
expectation, although the pmf would be difficult to write down.
Example 1.14 (A Random Variable Without a Finite Expectation) Let X take the
positive integers 1; 2; 3; : : : as its values with the pmf
p.x/D P.X D x/ D 1
x.xC 1/; xD 1; 2; 3; : : : :
This is a valid pmf, because obviouslyx.xC1/1 > 0 for any x D 1; 2; 3; : : : ; and also
the infinite seriesP1
xD1x.xC1/1 sums to 1, a fact from calculus Now,
also a fact from calculus.
This example shows that not all random variables have a finite expectation Here, the reason for the infiniteness of E.X / is that X takes large integer values x with probabilities p.x/ that are not adequately small The large values are realized suffi-ciently often that on average X becomes larger than any given finite number.
The zero–one nature of indicator random variables is extremely useful for calcu-lating expectations of certain integer-valued random variables whose distributions are sometimes so complicated that it would be difficult to find their expectations di-rectly from definition We describe the technique and some illustrations of it below.
Proposition LetX be an integer-valued random variable such that it can be
rep-resented asX D Pm
iD1ciIAi for somem, constants c1; c2; : : : ; cm, and suitableeventsA1; A2; : : : ; Am Then,E.X /DPm
iD1ciP Ai/.
Trang 37161 Review of Univariate Probability
Example 1.15 (Coin Tosses) Suppose a coin that has probability p of showing
heads in any single toss is tossed n times, and let X denote the number of times in the
n tosses that a head is obtained Then, X DPn
iD1IAi, where Aiis the event that a
head is obtained in the ith toss Therefore, E.X / DPn
iD1P Ai/DPn
iD1pD np.
A direct calculation of the expectation would involve finding the pmf of X and obtaining the sumPn
xD0xP X D x/; it can also be done that way, but that is a
much longer calculation.
The random variable X of this example is a binomial random variable
with parameters n and p Its pmf is given by the formula P X D x/ Dn
px 1 p/nx; xD 0; 1; 2; : : : ; n.
Example 1.16 (Consecutive Heads in Coin Tosses) Suppose a coin with probability
p for heads in a single toss is tossed n times How many times can we expect to see
a head followed by at least one more head? For example, if n D 5, and we see the outcomes HTHHH, then we see a head followed by at least one more head twice.
Define AiD The ith and the i C 1/th toss both result in heads Then XD number of times a head is followed by at least one more head D
iD1p2D n 1/p2 For example, if a fair coin is tossed 20 times, we can expect to see a head followed by another head about five times (19 :52D 4:75).
Another useful technique for calculating expectations of nonnegative integer-valued random variables is based on the CDF of the random variable, rather than directly on the pmf This method is useful when calculating probabilities of the form
P X > x/ is logically more straightforward than directly calculating P X D x/.
Here is the expectation formula based on the tail CDF.
Theorem 1.9 (Tailsum Formula) LetX take values 0; 1; 2; : : : : Then
Example 1.17 (Family Planning) Suppose a couple will have children until they
have at least one child of each sex How many children can they expect to have? Let X denote the childbirth at which they have a child of each sex for the first time Suppose the probability that any particular childbirth will be a boy is p, and that all births are independent Then,
P X > n/D P.the first n children are all boys or all girls/ D pnC 1 p/n
Therefore, E.X / D 2 CP1
nD2ŒpnC 1 p/nD 2 C p2=.1 p/ C 1 p/2=pD
p.1p/ 1 If boys and girls are equally likely on any childbirth, then this says that
a couple waiting to have a child of each sex can expect to have three children.
Trang 381.3 Integer-Valued and Discrete Random Variables17
The expected value is calculated with the intention of understanding what a typical value is of a random variable But two very different distributions can have exactly the same expected value A common example is that of a return on an in-vestment in a stock Two stocks may have the same average return, but one may be much riskier than the other, in the sense that the variability in the return is much higher for that stock In that case, most risk-averse individuals would prefer to in-vest in the stock with less variability Measures of risk or variability are of course not unique Some natural measures that come to mind are E.jX j/, known as the
mean absolute deviation, or P jX j > k/ for some suitable k However, neither
of these two is the most common measure of variability The most common measure
is the standard deviation of a random variable.
Definition 1.14 Let a random variable X have a finite mean The variance of X
is defined as
2D EŒ.X /2;
and the standard deviation of X is defined as Dp 2:
It is easy to prove that 2<1 if and only if E.X2/, the second moment of X ,
is finite It is not uncommon to mistake the standard deviation for the mean absolute deviation, but they are not the same In fact, an inequality always holds.
Proposition. E.jX j/, and is strictly greater unless X is a constant
random variable, namely,P XD / D 1.
We list some basic properties of the variance of a random variable.
(a) Var.cX /D c2Var.X / for any real c.
(b) Var.XC k/ D Var.X/ for any real k.
(c) Var.X / 0 for any random variable X, and equals zero only if P.X D c/ D 1
for some real constantc.
(d) Var.X /D E.X2/ 2:
The quantityE.X2/ is called the second moment of X The definition of a
gen-eral moment is as follows.
Definition 1.15 Let X be a random variable, and k 1 a positive integer Then
E.Xk/ is called the kth moment of X , and E.Xk/ is called the kth inverse moment
ofX , provided they exist.
We therefore have the following relationships involving moments and the variance.
Variance D Second Moment First Moment/2:
Second Moment D Variance C First Moment/2:
Statisticians often use the third moment around the mean as a measure of lack of symmetry in the distribution of a random variable The point is that if a random variable X has a symmetric distribution, and has a finite mean , then all odd mo-ments around the mean, namely, EŒ.X /2kC1 will be zero, if the moment exists.
Trang 39181 Review of Univariate Probability
In particular, EŒ.X /3 will be zero Likewise, statisticians also use the fourth
moment around the mean as a measure of how spiky the distribution is around the mean To make these indices independent of the choice of unit of measurement (e.g., inches or centimeters), they use certain scaled measures of asymmetry and peaked-ness Here are the definitions.
Definition 1.16 (a) Let X be a random variable with EŒjX j3 <1 The skewness
The skewness ˇ is zero for symmetric distributions, but the converse need not be true The kurtosis is necessarily 2, but can be arbitrarily large, with spikier distributions generally having a larger kurtosis But a very good interpretation of
is not really available We later show that D 0 for all normal distributions; hence
the motivation for subtracting 3 in the definition of .
Example 1.18 (Variance of Number of Heads) Consider the experiment of two
tosses of a fair coin and let X be the number of heads obtained Then, we have seen that p.0/ D p.2/ D 1=4; and p.1/ D 1=2 Thus, E.X2/D 0 1=4 C 1 1=2 C 4 1=4 D 3=2, and E.X/ D 1 Therefore, Var.X/ D E.X2/ 2D 3=2 1 D 1
2, and the standard deviation is Dp
:5D :707.
Example 1.19 (A Random Variable with an Infinite Variance) If a random variable
has a finite variance, then it can be shown that it must have a finite mean This example shows that the converse need not be true.
Let X be a discrete random variable with the pmf
Trang 40is not finitely summable, a fact from calculus Because E.X2/ is infinite, but E.X /
is finite, 2D E.X2/ ŒE.X/2must also be infinite.
If a collection of random variables is independent, then just like the expectation, the variance also adds up Precisely, one has the following very useful fact.
Theorem 1.10 LetX1; X2; : : : ; Xnben independent random variables Then,
Var.X1C X2C C Xn/D Var.X1/C Var.X2/C C Var.Xn/:
An important corollary of this result is the following variance formula for themean, NX , of n independent and identically distributed random variables.
Corollary 1.1 LetX1; X2; : : : ; Xnbe independent random variables with a com-mon variance2<1 Let NX D X1CCXn
n Then Var NX/D 2n.
1.4 Inequalities
The mean and the variance, together, have earned the status of being the two most common summaries of a distribution A relevant question is whether ; are useful summaries of the distribution of a random variable The answer is a qualified yes The inequalities below suggest that knowing just the values of ; , it is in fact possible to say something useful about the full distribution.
2, assumed to be finite Letk be any positive number Then
P jX j k/ 1 k2:
(b) (Markov’s Inequality) SupposeX takes only nonnegative values, and
sup-poseE.X /D , assumed to be finite Let c be any postive number Then,
P X c/ c:
The virtue of these two inequalities is that they make no restrictive assumptions onthe random variableX Whenever ; are finite, Chebyshev’s inequality is
appli-cable, and whenever; is finite, Markov’s inequality applies, provided the random
variable is nonnegative However, the universal nature of these inequalities alsomakes them typically quite conservative.
Although Chebyshev’s inequality usually gives conservative estimates for tailprobabilities, it does imply a major result in probability theory in a special case.