This is the companion second volume to my undergraduate text Fundamentals of Probability: A First Course. The purpose of my writing this book is to give graduate students, instructors, and researchers in statistics, mathematics, and computer science a lucidly written unique text at the confluence of probability, advanced stochastic processes, statistics, and key tools for machine learning. Numerous topics in probability and stochastic processes of current importance in statistics and machine learning that are widely scattered in the literature in many different specialized books are all brought together under one fold in this book. This is done with an extensive bibliography for each topic, and numerous workedout examples and exercises. Probability, with all its models, techniques, and its poignant beauty, is an incredibly powerful tool for anyone who deals with data or randomness. The content and the style of this book reflect that philosophy; I emphasize lucidity, a wide background, and the farreaching applicability of probability in science. The book starts with a selfcontained and fairly complete review of basic probability, and then traverses its way through the classics, to advanced modern topics and tools, including a substantial amount of statistics itself. Because of its nearly encyclopaedic coverage, it can serve as a graduate text for a yearlong probability sequence, or for focused short courses on selected topics, for selfstudy, and as a nearly unique reference for research in statistics, probability, and computer science. It provides an extensive treatment of most of the standard topics in a graduate probability sequence, and integrates them with the basic theory and many examples of several core statistical topics, as well as with some tools of major importance in machine learning. This is done with unusually detailed bibliographies for the reader who wants to dig deeper into a particular topic, and with a huge repertoire of workedout examples and exercises. The total number of workedout examples in this book is 423, and the total number of exercises is 808. An instructor can rotate the exercises between semesters, and use them for setting exams, and a student can use them for additional exam preparation and selfstudy. I believe that the book is unique in its range, unification, bibliographic detail, and its collection of problems and examples.
Trang 2Springer Texts in Statistics
Trang 4Anirban DasGupta
Probability for Statistics and Machine Learning Fundamentals and Advanced Topics
ABC
Trang 5Springer New York Dordrecht Heidelberg London
Library of Congress Control Number: 2011924777
c
Springer Science+Business Media, LLC 2011
All rights reserved This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis Use in connection with any form of information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Trang 6To Persi Diaconis, Peter Hall, Ashok Maitra, and my mother, with affection
Trang 8is an incredibly powerful tool for anyone who deals with data or randomness Thecontent and the style of this book reflect that philosophy; I emphasize lucidity, awide background, and the far-reaching applicability of probability in science.The book starts with a self-contained and fairly complete review of basic prob-ability, and then traverses its way through the classics, to advanced modern topicsand tools, including a substantial amount of statistics itself Because of its nearlyencyclopaedic coverage, it can serve as a graduate text for a year-long probabil-ity sequence, or for focused short courses on selected topics, for self-study, and as
a nearly unique reference for research in statistics, probability, and computer ence It provides an extensive treatment of most of the standard topics in a graduateprobability sequence, and integrates them with the basic theory and many examples
sci-of several core statistical topics, as well as with some tools sci-of major importance
in machine learning This is done with unusually detailed bibliographies for thereader who wants to dig deeper into a particular topic, and with a huge repertoire ofworked-out examples and exercises The total number of worked-out examples inthis book is 423, and the total number of exercises is 808 An instructor can rotatethe exercises between semesters, and use them for setting exams, and a student canuse them for additional exam preparation and self-study I believe that the book isunique in its range, unification, bibliographic detail, and its collection of problemsand examples
Topics in core probability, such as distribution theory, asymptotics, Markovchains, martingales, Poisson processes, random walks, and Brownian motion arecovered in the first 14 chapters In these chapters, a reader will also find basic
vii
Trang 9viii Preface
coverage of such core statistical topics as confidence intervals, likelihood functions,maximum likelihood estimates, posterior densities, sufficiency, hypothesis testing,variance stabilizing transformations, and extreme value theory, all illustrated withmany examples In Chapters 15, 16, and 17, I treat three major topics of great appli-cation potential, empirical processes and VC theory, probability metrics, and largedeviations Chapters 18, 19, and 20 are specifically directed to the statistics andmachine-learning community, and cover simulation, Markov chain Monte Carlo,the exponential family, bootstrap, the EM algorithm, and kernels
The book does not make formal use of measure theory I do not intend to mize the role of measure theory in a rigorous study of probability However, I believethat a large amount of probability can be taught, understood, enjoyed, and appliedwithout needing formal use of measure theory We do it around the world everyday At the same time, some theorems cannot be proved without at least a men-tion of some measure theory terminology Even some definitions require a mention
mini-of some measure theory notions I include some unavoidable mention mini-of theoretic terms and results, such as the strong law of large numbers and its proof, thedominated convergence theorem, monotone convergence, Lebesgue measure, and afew others, but only in the advanced chapters in the book
measure-Following the table of contents, I have suggested some possible courses withdifferent themes using this book I have also marked the nonroutine and harder ex-ercises in each chapter with an asterisk Likewise, some specialized sections withreference value have also been marked with an asterisk Generally, the exercisesand the examples come with a caption, so that the reader will immediately know thecontent of an exercise or an example The end of the proof of a theorem has beenmarked by a sign
My deepest gratitude and appreciation are due to Peter Hall I am lucky that thestyle and substance of this book are significantly molded by Peter’s influence Out ofhabit, I sent him the drafts of nearly every chapter as I was finishing them It didn’tmatter where exactly he was, I always received his input and gentle suggestionsfor improvement I have found Peter to be a concerned and warm friend, teacher,mentor, and guardian, and for this, I am extremely grateful
Mouli Banerjee, Rabi Bhattacharya, Burgess Davis, Stewart Ethier, ArthurFrazho, Evarist Gin´e, T Krishnan, S N Lahiri, Wei-Liem Loh, Hyun-Sook Oh,
B V Rao, Yosi Rinott, Wen-Chi Tsai, Frederi Viens, and Larry Wassermangraciously went over various parts of this book I am deeply indebted to each
of them Larry Wasserman, in particular, suggested the chapters on empirical cesses, VC theory, concentration inequalities, the exponential family, and Markovchain Monte Carlo The Springer series editors, Peter Bickel, George Casella, SteveFienberg, and Ingram Olkin have consistently supported my efforts, and I am so verythankful to them Springer’s incoming executive editor Marc Strauss saw throughthe final production of this book extremely efficiently, and I have much enjoyedworking with him I appreciated Marc’s gentility and his thoroughly professionalhandling of the transition of the production of this book to his oversight ValerieGreco did an astonishing job of copyediting the book The presentation, display,and the grammar of the book are substantially better because of the incredible care
Trang 10pro-Preface ix
and thoughtfulness that she put into correcting my numerous errors The staff atSPi Technologies, Chennai, India did an astounding and marvelous job of produc-ing this book Six anonymous reviewers gave extremely gracious and constructivecomments, and their input has helped me in various dimensions to make this abetter book Doug Crabill is the greatest computer systems administrator, and with
an infectious pleasantness has bailed me out of my stupidity far too many times
I also want to mention my fond memories and deep-rooted feelings for the IndianStatistical Institute, where I had all of my college education It was just a wonderfulplace for research, education, and friendships Nearly everything that I know is due
to my years at the Indian Statistical Institute, and for this I am thankful
This is the third time that I have written a book in contract with John Kimmel.John is much more than a nearly unique person in the publishing world To me,John epitomizes sensitivity and professionalism, a singular combination I have nowknown John for almost six years, and it is very very difficult not to appreciate andadmire him a whole lot for his warmth, style, and passion for the subjects of statis-tics and probability Ironically, the day that this book entered production, the newscame that John was leaving Springer I will remember John’s contribution to myprofessional growth with enormous respect and appreciation
Trang 12Suggested Courses with Different Themes xix
1 Review of Univariate Probability 1
1.1 Experiments and Sample Spaces 1
1.2 Conditional Probability and Independence 5
1.3 Integer-Valued and Discrete Random Variables 8
1.3.1 CDF and Independence 9
1.3.2 Expectation and Moments 13
1.4 Inequalities 19
1.5 Generating and Moment-Generating Functions 22
1.6 Applications of Generating Functions to a Pattern Problem 26
1.7 Standard Discrete Distributions 28
1.8 Poisson Approximation to Binomial 34
1.9 Continuous Random Variables 36
1.10 Functions of a Continuous Random Variable 42
1.10.1 Expectation and Moments 45
1.10.2 Moments and the Tail of a CDF 49
1.11 Moment-Generating Function and Fundamental Inequalities 51
1.11.1 Inversion of an MGF and Post’s Formula 53
1.12 Some Special Continuous Distributions 54
1.13 Normal Distribution and Confidence Interval for a Mean 61
1.14 Stein’s Lemma 66
1.15 Chernoff’s Variance Inequality 68
1.16 Various Characterizations of Normal Distributions 69
1.17 Normal Approximations and Central Limit Theorem 71
1.17.1 Binomial Confidence Interval 74
1.17.2 Error of the CLT 76
1.18 Normal Approximation to Poisson and Gamma 79
1.18.1 Confidence Intervals 80
1.19 Convergence of Densities and Edgeworth Expansions 82
References 92
xi
Trang 13xii Contents
2 Multivariate Discrete Distributions 95
2.1 Bivariate Joint Distributions and Expectations of Functions 95
2.2 Conditional Distributions and Conditional Expectations .100
2.2.1 Examples on Conditional Distributions and Expectations .101
2.3 Using Conditioning to Evaluate Mean and Variance 104
2.4 Covariance and Correlation 107
2.5 Multivariate Case 111
2.5.1 Joint MGF .112
2.5.2 Multinomial Distribution 114
2.6 The Poissonization Technique 116
3 Multidimensional Densities 123
3.1 Joint Density Function and Its Role 123
3.2 Expectation of Functions 132
3.3 Bivariate Normal 136
3.4 Conditional Densities and Expectations 140
3.4.1 Examples on Conditional Densities and Expectations .142
3.5 Posterior Densities, Likelihood Functions, and Bayes Estimates 147
3.6 Maximum Likelihood Estimates .152
3.7 Bivariate Normal Conditional Distributions 154
3.8 Useful Formulas and Characterizations for Bivariate Normal 155
3.8.1 Computing Bivariate Normal Probabilities .157
3.9 Conditional Expectation Given a Set and Borel’s Paradox 158
References 165
4 Advanced Distribution Theory 167
4.1 Convolutions and Examples 167
4.2 Products and Quotients and the t - and F -Distribution 172
4.3 Transformations 177
4.4 Applications of Jacobian Formula .178
4.5 Polar Coordinates in Two Dimensions 180
4.6 n-Dimensional Polar and Helmert’s Transformation 182
4.6.1 Efficient Spherical Calculations with Polar Coordinates 182
4.6.2 Independence of Mean and Variance in Normal Case 185
4.6.3 The t Confidence Interval 187
4.7 The Dirichlet Distribution 188
4.7.1 Picking a Point from the Surface of a Sphere 191
4.7.2 Poincar´e’s Lemma 191
4.8 Ten Important High-Dimensional Formulas for Easy Reference 191
References 197
Trang 14Contents xiii
5 Multivariate Normal and Related Distributions 199
5.1 Definition and Some Basic Properties .199
5.2 Conditional Distributions 202
5.3 Exchangeable Normal Variables 205
5.4 Sampling Distributions Useful in Statistics 207
5.4.1 Wishart Expectation Identities 208
5.4.2 * Hotelling’s T2and Distribution of Quadratic Forms 209
5.4.3 Distribution of Correlation Coefficient 212
5.5 Noncentral Distributions 213
5.6 Some Important Inequalities for Easy Reference 214
References 218
6 Finite Sample Theory of Order Statistics and Extremes .221
6.1 Basic Distribution Theory 221
6.2 More Advanced Distribution Theory .225
6.3 Quantile Transformation and Existence of Moments 229
6.4 Spacings 233
6.4.1 Exponential Spacings and R´eyni’s Representation 233
6.4.2 Uniform Spacings 234
6.5 Conditional Distributions and Markov Property .235
6.6 Some Applications 238
6.6.1 Records 238
6.6.2 The Empirical CDF 241
6.7 Distribution of the Multinomial Maximum 243
References 247
7 Essential Asymptotics and Applications 249
7.1 Some Basic Notation and Convergence Concepts 250
7.2 Laws of Large Numbers 254
7.3 Convergence Preservation 259
7.4 Convergence in Distribution 262
7.5 Preservation of Convergence and Statistical Applications 267
7.5.1 Slutsky’s Theorem .268
7.5.2 Delta Theorem 269
7.5.3 Variance Stabilizing Transformations 272
7.6 Convergence of Moments .274
7.6.1 Uniform Integrability .275
7.6.2 The Moment Problem and Convergence in Distribution 277
7.6.3 Approximation of Moments .278
7.7 Convergence of Densities and Scheff´e’s Theorem .282
References 292
Trang 15xiv Contents
8 Characteristic Functions and Applications 293
8.1 Characteristic Functions of Standard Distributions 294
8.2 Inversion and Uniqueness .298
8.3 Taylor Expansions, Differentiability, and Moments 302
8.4 Continuity Theorems .303
8.5 Proof of the CLT and the WLLN 305
8.6 Producing Characteristic Functions 306
8.7 Error of the Central Limit Theorem 308
8.8 Lindeberg–Feller Theorem for General Independent Case 311
8.9 Infinite Divisibility and Stable Laws 315
8.10 Some Useful Inequalities 317
References 322
9 Asymptotics of Extremes and Order Statistics 323
9.1 Central-Order Statistics 323
9.1.1 Single-Order Statistic .323
9.1.2 Two Statistical Applications 325
9.1.3 Several Order Statistics .326
9.2 Extremes .328
9.2.1 Easily Applicable Limit Theorems 328
9.2.2 The Convergence of Types Theorem 332
9.3 Fisher–Tippett Family and Putting it Together 333
References 338
10 Markov Chains and Applications 339
10.1 Notation and Basic Definitions 340
10.2 Examples and Various Applications as a Model 340
10.3 Chapman–Kolmogorov Equation .345
10.4 Communicating Classes 349
10.5 Gambler’s Ruin 352
10.6 First Passage, Recurrence, and Transience .354
10.7 Long Run Evolution and Stationary Distributions 359
References 374
11 Random Walks 375
11.1 Random Walk on the Cubic Lattice 375
11.1.1 Some Distribution Theory .378
11.1.2 Recurrence and Transience 379
11.1.3 P´olya’s Formula for the Return Probability 382
11.2 First Passage Time and Arc Sine Law 383
11.3 The Local Time 387
11.4 Practically Useful Generalizations 389
11.5 Wald’s Identity 390
11.6 Fate of a Random Walk 392
Trang 16Contents xv
11.7 Chung–Fuchs Theorem 394
11.8 Six Important Inequalities 396
References 400
12 Brownian Motion and Gaussian Processes 401
12.1 Preview of Connections to the Random Walk 402
12.2 Basic Definitions 403
12.2.1 Condition for a Gaussian Process to be Markov 406
12.2.2 Explicit Construction of Brownian Motion 407
12.3 Basic Distributional Properties 408
12.3.1 Reflection Principle and Extremes 410
12.3.2 Path Properties and Behavior Near Zero and Infinity .412
12.3.3 Fractal Nature of Level Sets 415
12.4 The Dirichlet Problem and Boundary Crossing Probabilities .416
12.4.1 Recurrence and Transience 418
12.5 The Local Time of Brownian Motion 419
12.6 Invariance Principle and Statistical Applications 421
12.7 Strong Invariance Principle and the KMT Theorem .425
12.8 Brownian Motion with Drift and Ornstein–Uhlenbeck Process .427
12.8.1 Negative Drift and Density of Maximum 427
12.8.2 Transition Density and the Heat Equation 428
12.8.3 The Ornstein–Uhlenbeck Process 429
References 435
13 Poisson Processes and Applications .437
13.1 Notation .438
13.2 Defining a Homogeneous Poisson Process .439
13.3 Important Properties and Uses as a Statistical Model 440
13.4 Linear Poisson Process and Brownian Motion: A Connection 448
13.5 Higher-Dimensional Poisson Point Processes 450
13.5.1 The Mapping Theorem 452
13.6 One-Dimensional Nonhomogeneous Processes 453
13.7 Campbell’s Theorem and Shot Noise 456
13.7.1 Poisson Process and Stable Laws 458
References 462
14 Discrete Time Martingales and Concentration Inequalities 463
14.1 Illustrative Examples and Applications in Statistics 463
14.2 Stopping Times and Optional Stopping 468
14.2.1 Stopping Times 469
14.2.2 Optional Stopping 470
14.2.3 Sufficient Conditions for Optional Stopping Theorem 472
14.2.4 Applications of Optional Stopping 474
Trang 17xvi Contents
14.3 Martingale and Concentration Inequalities 477
14.3.1 Maximal Inequality .477
14.3.2 Inequalities of Burkholder, Davis, and Gundy 480
14.3.3 Inequalities of Hoeffding and Azuma 483
14.3.4 Inequalities of McDiarmid and Devroye 485
14.3.5 The Upcrossing Inequality 488
14.4 Convergence of Martingales 490
14.4.1 The Basic Convergence Theorem .490
14.4.2 Convergence in L1and L2 .493
14.5 Reverse Martingales and Proof of SLLN 494
14.6 Martingale Central Limit Theorem .497
References 503
15 Probability Metrics 505
15.1 Standard Probability Metrics Useful in Statistics 505
15.2 Basic Properties of the Metrics 508
15.3 Metric Inequalities 515
15.4 Differential Metrics for Parametric Families 519
15.4.1 Fisher Information and Differential Metrics 520
15.4.2 Rao’s Geodesic Distances on Distributions 522
References 525
16 Empirical Processes and VC Theory 527
16.1 Basic Notation and Definitions 527
16.2 Classic Asymptotic Properties of the Empirical Process 529
16.2.1 Invariance Principle and Statistical Applications 531
16.2.2 Weighted Empirical Process 534
16.2.3 The Quantile Process 536
16.2.4 Strong Approximations of the Empirical Process 537
16.3 Vapnik–Chervonenkis Theory 538
16.3.1 Basic Theory .538
16.3.2 Concrete Examples 540
16.4 CLTs for Empirical Measures and Applications 543
16.4.1 Notation and Formulation .543
16.4.2 Entropy Bounds and Specific CLTs .544
16.4.3 Concrete Examples 547
16.5 Maximal Inequalities and Symmetrization .547
16.6 Connection to the Poisson Process 551
References 557
17 Large Deviations .559
17.1 Large Deviations for Sample Means 560
17.1.1 The Cram´er–Chernoff Theorem inR 560
17.1.2 Properties of the Rate Function 564
17.1.3 Cram´er’s Theorem for General Sets 566
Trang 18Contents xvii
17.2 The GRartner–Ellis Theorem and Markov Chain Large
Deviations 567
17.3 The t-Statistic 570
17.4 Lipschitz Functions and Talagrand’s Inequality 572
17.5 Large Deviations in Continuous Time .574
17.5.1 Continuity of a Gaussian Process 576
17.5.2 Metric Entropy of T and Tail of the Supremum 577
References 582
18 The Exponential Family and Statistical Applications 583
18.1 One-Parameter Exponential Family 583
18.1.1 Definition and First Examples 584
18.2 The Canonical Form and Basic Properties 589
18.2.1 Convexity Properties 590
18.2.2 Moments and Moment Generating Function .591
18.2.3 Closure Properties 594
18.3 Multiparameter Exponential Family .596
18.4 Sufficiency and Completeness .600
18.4.1 Neyman–Fisher Factorization and Basu’s Theorem 602
18.4.2 Applications of Basu’s Theorem to Probability 604
18.5 Curved Exponential Family 607
References 612
19 Simulation and Markov Chain Monte Carlo 613
19.1 The Ordinary Monte Carlo .615
19.1.1 Basic Theory and Examples .615
19.1.2 Monte Carlo P -Values 622
19.1.3 Rao–Blackwellization 623
19.2 Textbook Simulation Techniques .624
19.2.1 Quantile Transformation and Accept–Reject 624
19.2.2 Importance Sampling and Its Asymptotic Properties 629
19.2.3 Optimal Importance Sampling Distribution .633
19.2.4 Algorithms for Simulating from Common Distributions 634
19.3 Markov Chain Monte Carlo 637
19.3.1 Reversible Markov Chains 639
19.3.2 Metropolis Algorithms 642
19.4 The Gibbs Sampler 645
19.5 Convergence of MCMC and Bounds on Errors 651
19.5.1 Spectral Bounds 653
19.5.2 Dobrushin’s Inequality and Diaconis–Fill– Stroock Bound .657
19.5.3 Drift and Minorization Methods 659
Trang 19xviii Contents
19.6 MCMC on General Spaces 662
19.6.1 General Theory and Metropolis Schemes 662
19.6.2 Convergence 666
19.6.3 Convergence of the Gibbs Sampler 670
19.7 Practical Convergence Diagnostics 673
References 686
20 Useful Tools for Statistics and Machine Learning 689
20.1 The Bootstrap 689
20.1.1 Consistency of the Bootstrap 692
20.1.2 Further Examples .696
20.1.3 Higher-Order Accuracy of the Bootstrap 699
20.1.4 Bootstrap for Dependent Data 701
20.2 The EM Algorithm 704
20.2.1 The Algorithm and Examples 706
20.2.2 Monotone Ascent and Convergence of EM 711
20.2.3 Modifications of EM 714
20.3 Kernels and Classification 715
20.3.1 Smoothing by Kernels 715
20.3.2 Some Common Kernels in Use 717
20.3.3 Kernel Density Estimation 719
20.3.4 Kernels for Statistical Classification 724
20.3.5 Mercer’s Theorem and Feature Maps .732
References 744
A Symbols, Useful Formulas, and Normal Table 747
A.1 Glossary of Symbols 747
A.2 Moments and MGFs of Common Distributions 750
A.3 Normal Table 755
Author Index 757
Subject Index 763
Trang 20Suggested Courses with Different Themes
15 weeks Special topics for statistics students 9, 10, 15, 16, 17, 18, 20
15 weeks Special topics for computer science students 4, 11, 14, 16, 17, 18, 19
8 weeks Summer course for statistics students 11, 12, 14, 20
8 weeks Summer course for computer science students 14, 16, 18, 20
8 weeks Summer course on modeling and simulation 4, 10, 13, 19
xix
Trang 22Chapter 1
Review of Univariate Probability
Probability is a universally accepted tool for expressing degrees of confidence ordoubt about some proposition in the presence of incomplete information or uncer-tainty By convention, probabilities are calibrated on a scale of 0 to 1; assigningsomething a zero probability amounts to expressing the belief that we consider itimpossible, whereas assigning a probability of one amounts to considering it a cer-tainty Most propositions fall somewhere in between Probability statements that wemake can be based on our past experience, or on our personal judgments Whetherour probability statements are based on past experience or subjective personal judg-ments, they obey a common set of rules, which we can use to treat probabilities in
a mathematical framework, and also for making decisions on predictions, for derstanding complex systems, or as intellectual experiments and for entertainment.Probability theory is one of the most applicable branches of mathematics It is used
un-as the primary tool for analyzing statistical methodologies; it is used routinely innearly every branch of science, such as biology, astronomy and physics, medicine,economics, chemistry, sociology, ecology, finance, and many others A background
in the theory, models, and applications of probability is almost a part of basic cation That is how important it is
edu-For a classic and lively introduction to the subject of probability, we recommendFeller(1968,1971) Among numerous other expositions of the theory of probabil-ity, a variety of examples on various topics can be seen inRoss(1984),Stirzaker(1994),Pitman(1992),Bhattacharya and Waymire(2009), andDasGupta(2010).Ash(1972),Chung(1974),Breiman(1992),Billingsley(1995), andDudley(2002)are masterly accounts of measure-theoretic probability
1.1 Experiments and Sample Spaces
Treatment of probability theory starts with the consideration of a sample space.
The sample space is the set of all possible outcomes in some physical experiment.For example, if a coin is tossed twice and after each toss the face that shows isrecorded, then the possible outcomes of this particular coin-tossing experiment, say
A DasGupta, Probability for Statistics and Machine Learning: Fundamentals
and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 1,
c
Springer Science+Business Media, LLC 2011
1
Trang 232 1 Review of Univariate Probability
are HH; HT; TH; TT, with H denoting the occurrence of heads and T denoting
the occurrence of tails We call
D fHH; HT; TH; TTg
the sample space of the experiment
In general, a sample space is a general set , finite or infinite An easy examplewhere the sample space is infinite is to toss a coin until the first time heads show
up and record the number of the trial at which the first head appeared In this case,
the sample space is the countably infinite set
D f1; 2; 3; : : :g:
Sample spaces can also be uncountably infinite; for example, consider the
experi-ment of choosing a number at random from the interval Œ0; 1 The sample space ofthis experiment is D Œ0; 1 In this case, is an uncountably infinite set In allcases, individual elements of a sample space are denoted as ! The first task is to
define events and to explain the meaning of the probability of an event.
Definition 1.1 Let be the sample space of an experiment Then any subset A
of , including the empty set and the entire sample space is called an event.
Events may contain even one single sample point !, in which case the event
is a singleton set f!g We want to assign probabilities to events But we want to
assign probabilities in a way that they are logically consistent In fact, this cannot
be done in general if we insist on assigning probabilities to arbitrary collections ofsample points, that is, arbitrary subsets of the sample space We can only defineprobabilities for such subsets of that are tied together like a family, the exactconcept being that of a -field In most applications, including those cases wherethe sample space is infinite, events that we would want to normally think aboutwill be members of such an appropriate -field So we do not mention the needfor consideration of -fields any further, and get along with thinking of events assubsets of the sample space , including in particular the empty set and the entiresample space itself
Here is a definition of what counts as a legitimate probability on events
Definition 1.2 Given a sample space , a probability or a probability measure on
is a function P on subsets of such that
(a) P A/ 0 for anyA I
Trang 24an-1.1 Experiments and Sample Spaces 3
probabilists agree that countable additivity is natural; but we do not get into thatdebate in this book One important point is that finite additivity is subsumed incountable additivity; that is if there are some finite number m of disjoint subsets
A1; A2; : : : ; Amof , then P [miD1Ai/D Pm
iD1P Ai/: Also, it is useful to note
that the last two conditions in the definition of a probability measure imply that
P /, the probability of the empty set or the null event, is zero.
One notational convention is that strictly speaking, for an event that is just asingleton set f!g, we should write P f!g/ to denote its probability But to reduceclutter, we simply use the more convenient notation P !/
One pleasant consequence of the axiom of countable additivity is the followingbasic result We do not prove it here as it is a simple result; seeDasGupta(2010) for
a proof
Theorem 1.1 LetA1 A2 A3 be an infinite family of subsets of a sample
space such that An# A Then, P.An/! P.A/ as n ! 1.
Next, the concept of equally likely sample points is a very fundamental one.
Definition 1.3 Let be a finite sample space consisting of N sample points We
say that the sample points are equally likely if P !/ DN1 for each sample point !
An immediate consequence, due to the addivity axiom, is the following usefulformula
Proposition Let be a finite sample space consisting of N equally likely sample
points Let A be any event and suppose A contains n distinct sample points Then
P A/D n
N D Number of sample points favorable toA
Total number of sample points :
Let us see some examples
Example 1.1 (The Shoe Problem) Suppose there are five pairs of shoes in a closet
and four shoes are taken out at random What is the probability that among the fourthat are taken out, there is at least one complete pair?
The total number of sample points is10
4
D 210 Because selection was done
completely at random, we assume that all sample points are equally likely At leastone complete pair would mean two complete pairs, or exactly one complete pairand two other nonconforming shoes Two complete pairs can be chosen in5
Example 1.2 (Five-Card Poker) In five-card poker, a player is given 5 cards from a
full deck of 52 cards at random Various named hands of varying degrees of rarity
exist In particular, we want to calculate the probabilities of A D two pairs and
Trang 254 1 Review of Univariate Probability
BD a flush Two pairs is a hand with 2 cards each of 2 different denominations and
the fifth card of some other denomination; a flush is a hand with 5 cards of the samesuit, but the cards cannot be of denominations in a sequence
!
10
! , 525
!
D :00197:
These are basic examples of counting arguments that are useful whenever there
is a finite sample space and we assume that all sample points are equally likely
A major result in combinatorial probability is the inclusionexclusion formula,
which says the following
Theorem 1.2 LetA1; A2; : : : ; Anbe n general events Let
Example 1.3 (Missing Suits in a Bridge Hand) Consider a specific player, say
North, in a Bridge game We want to calculate the probability that North’s hand
is void in at least one suit Towards this, denote the suits as 1, 2, 3, 4 and let
Ai D North’s hand is void in suit i
Then, by the inclusion exclusion formula,
P North’s hand is void in at least one suit/
Trang 261.2 Conditional Probability and Independence 5
The inclusion–exclusion formula can be hard to apply exactly, because the tities Sj for large indices j can be difficult to calculate However, fortunately, theinclusion–exclusion formula leads to bounds in both directions for the probability
quan-of the union quan-of n general events We have the following series quan-of bounds
Theorem 1.3 (Bonferroni Bounds) Given n events A1; A2; : : : ; An, let pn D
n
X
iD1
P Aci/:
1.2 Conditional Probability and Independence
Both conditional probability and independence are fundamental concepts for bilists and statisticians alike Conditional probabilities correspond to updating one’sbeliefs when new information becomes available Independence corresponds to ir-relevance of a piece of new information, even when it is made available In addition,the assumption of independence can and does significantly simplify development,mathematical analysis, and justification of tools and procedures
proba-Definition 1.4 Let A; B be general events with respect to some sample space ,
and suppose P A/ > 0 The conditional probability of B given A is defined as
P BjA/ D P A\ B/
P A/ :
Some immediate consequences of the definition of a conditional probability are thefollowing
Theorem 1.4 (a) (Multiplicative Formula) For any two events A, B such that
P A/ > 0, one has P A\ B/ D P.A/P.BjA/I
(b) For any two events A, B such that 0 < P A/ < 1, one has P B/ D P.BjA/
Trang 276 1 Review of Univariate Probability
(d) (Hierarchical Multiplicative Formula) Let A1; A2; : : : ; Ak be k general
events in a sample space Then
P A1\ A2\ ::: \ Ak/D P.A1/P A2jA1/P A3jA1\ A2/
P.AkjA1\ A2\ ::: \ Ak1/:
Example 1.4 One of two urns has a red and b black balls, and the other has c red
and d black balls One ball is chosen at random from each urn, and then one of thesetwo balls is chosen at random What is the probability that this ball is red?
If each ball selected from the two urns is red, then the final ball is definitely red
If one of those two balls is red, then the final ball is red with probability 1/2 If none
of those two balls is red, then the final ball cannot be red
Thus,
P The final ball is red/ D a=.a C b/ c=.c C d / C 1=2
Œa=.a C b/ d=.c C d / C b=.a C b/ c=.c C d /
D 2acC ad C bc2.aC b/.c C d /:
As an example, suppose a D 99; b D 1; c D 1; d D 1
Then 2acC ad C bc2.aC b/.c C d / D :745:
Although the total percentage of red balls in the two urns is more than 98%, thechance that the final ball selected would be red is just about 75%
Example 1.5 (A Clever Conditioning Argument) Coin A gives heads with
probabil-ity s and coin B gives heads with probabilprobabil-ity t They are tossed alternately, startingoff with coin A We want to find the probability that the first head is obtained oncoin A
We find this probability by conditioning on the outcomes of the first two tosses;more precisely, define
A1D fH g D First toss gives HI A2D fTH gI A3D fT T g:
Let also,
AD The first head is obtained on coin A:
One of the three events A1; A2; A3 must happen, and they are also mutuallyexclusive Therefore, by the total probability formula,
Trang 281.2 Conditional Probability and Independence 7
As an example, let s D :4; t D :5 Note that coin A is biased against heads Even
then, s=.s C t st / D :57 > :5 We see that there is an advantage in starting first.
Definition 1.5 A collection of events A1; A2; : : : ; Anis said to be mutually pendent (or just independent) if for each k; 1 k n, and any k of the events,
inde-Ai 1; : : : ; Ai k; P Ai 1 \ Ai k/ D P.Ai 1/ P.Ai k/: They are called pairwise
independent if this property holds for k D 2.
Example 1.6 (Lotteries) Although many people buy lottery tickets out of an
expec-tation of good luck, probabilistically speaking, buying lottery tickets is usually awaste of money Here is an example Suppose in a weekly state lottery, five of thenumbers 00; 01; : : : ; 49 are selected without replacement at random, and someoneholding exactly those numbers wins the lottery Then, the probability that someoneholding one ticket will be the winner in a given week is
1
505
D 4:72 107:
Suppose this person buys a ticket every week for 40 years Then, the probabilitythat he will win the lottery on at least one week is 1 1 4:72 107/5240 D:00098 < :001; still a very small probability We assumed in this calculation that
the weekly lotteries are all mutually independent, a reasonable assumption Thecalculation would fall apart if we did not make this independence assumption
It is not uncommon to see the conditional probabilities P AjB/ and P BjA/confused with each other Suppose in some group of lung cancer patients, we see alarge percentage of smokers If we define B to be the event that a person is a smoker,and A to be the event that a person has lung cancer, then all we can conclude is that
in our group of people P BjA/ is large But we cannot conclude from just thisinformation that smoking increases the chance of lung cancer, that is, that P AjB/
is large In order to calculate a conditional probability P AjB/ when we know the
other conditional probability P BjA/, a simple formula known as Bayes’ theorem
is useful Here is a statement of a general version of Bayes’ theorem
Theorem 1.5 LetfA1; A2; : : : ; Amg be a partition of a sample space Let B be
some fixed event Then
P AjjB/ D PmP BjAj/P Aj/
iD1P BjAi/P Ai/:
Example 1.7 (Multiple Choice Exams) Suppose that the questions in a multiple
choice exam have five alternatives each, of which a student has to pick one as thecorrect alternative A student either knows the truly correct alternative with proba-bility :7, or she randomly picks one of the five alternatives as her choice Suppose aparticular problem was answered correctly We want to know what the probability
is that the student really knew the correct answer
Trang 298 1 Review of Univariate Probability
Define
AD The student knew the correct answer;
BD The student answered the question correctly:
We want to compute P AjB/ By Bayes’ theorem,
what Bayes’ theorem does; it updates our prior belief to the posterior belief, when
new evidence becomes available
1.3 Integer-Valued and Discrete Random Variables
In some sense, the entire subject of probability and statistics is about distributions ofrandom variables Random variables, as the very name suggests, are quantities thatvary, over time, or from individual to individual, and the reason for the variability issome underlying random process Depending on exactly how an underlying exper-iment ends, the random variable takes different values In other words, the value
of the random variable is determined by the sample point ! that prevails, when theunderlying experiment is actually conducted We cannot know a priori the value
of the random variable, because we do not know a priori which sample point ! willprevail when the experiment is conducted We try to understand the behavior of
a random variable by analyzing the probability structure of that underlying randomexperiment
Random variables, like probabilities, originated in gambling Therefore, the dom variables that come to us more naturally, are integer-valued random variables;for examples, the sum of the two rolls when a die is rolled twice Integer-valuedrandom variables are special cases of what are known as discrete random variables.Discrete or not, a common mathematical definition of all random variables is thefollowing
ran-Definition 1.6 Let be a sample space corresponding to some experiment and
let X W ! R be a function from the sample space to the real line Then X is called a random variable.
Discrete random variables are those that take a finite or a countably infinitenumber of possible values In particular, all integer-valued random variables arediscrete From the point of view of understanding the behavior of a random variable,the important thing is to know the probabilities with which X takes its differentpossible values
Trang 301.3 Integer-Valued and Discrete Random Variables 9
Definition 1.7 Let X W ! R be a discrete random variable taking a finite or
countably infinite number of values x1; x2; x3; : : : : The probability distribution or
the probability mass function (pmf) of X is the function p.x/ D P X D x/; x D
x1; x2; x3; : : : ; and p.x/ D 0, otherwise
It is common to not explicitly mention the phrase “p.x/ D 0 otherwise,” and we
generally follow this convention Some authors use the phrase mass function instead
of probability mass function.
For any pmf, one must have p.x/ 0 for any x, and P
ip.xi/ D 1 Any
function satisfying these two properties for some set of numbers x1; x2; x3; : : : is a
valid pmf
1.3.1 CDF and Independence
A second important definition is that of a cumulative distribution function (CDF).
The CDF gives the probability that a random variable X is less than or equal to anygiven number x It is important to understand that the notion of a CDF is universal
to all random variables; it is not limited to only the discrete ones
Definition 1.8 The cumulative distribution function of a random variable X is the
function F x/ D P X x/; x 2R.
Definition 1.9 Let X have the CDF F x/ Any number m such that P X m/ :5,
and also P X m/ :5 is called a median of F , or equivalently, a median of X
Remark The median of a random variable need not be unique A simple way to
characterize all the medians of a distribution is available
Proposition Let X be a random variable with the CDF F x/ Let m0be the first
x such that F x/ :5, and let m1be the last x such that P X x/ :5 Then, a
number m is a median of X if and only if m 2 Œm0; m1.
The CDF of any random variable satisfies a set of properties Conversely, any function satisfying these properties is a valid CDF; that is, it will be the CDF of some appropriately chosen random variable These properties are given in the next result.
Theorem 1.6 A function F x/ is the CDF of some real-valued random variable X
if and only if it satisfies all of the following properties.
(a) 0 F x/ 1 8x 2 R.
(b) F x/! 0 as x ! 1; and F x/ ! 1 as x ! 1.
(c) Given any real numbera; F x/# F a/ as x # a.
(d) Given any two real numbersx; y; x < y; F x/ F y/:
Property (c) is called continuity from the right, or simply right continuity It is clear that a CDF need not be continuous from the left; indeed, for discrete random variables, the CDF has a jump at the values of the random variable, and at the jump points, the CDF is not left continuous More precisely, one has the following result.
Trang 3110 1 Review of Univariate Probability
Proposition Let F x/ be the CDF of some random variable X Then, for any x,
(a) P X D x/ D F x/ limy"xF y/D F x/ F x/, including those points
x for which P X D x/ D 0.
(b) P X x/ D P.X > x/ C P.X D x/ D 1 F x// C F x/ F x// D
1 F x/.
Example 1.8 (Bridge) Consider the random variable
X D Number of aces in North’s hand in a Bridge game:
Clearly, X can take any of the values x D 0; 1; 2; 3; 4 If X D x, then the other
13 x cards in North’s hand must be non-ace cards Thus, the pmf of X is
P XD x/ D
4x
Example 1.9 (Indicator Variables) Consider the experiment of rolling a fair die
twice and now define a random variable Y as follows
Y D 1 if the sum of the two rolls X is an even numberI
Y D 0 if the sum of the two rolls X is an odd number:
If we let A be the event that X is an even number, then Y D 1 if A happens, and
Y D 0 if A does not happen Such random variables are called indicator random
variables and are immensely useful in mathematical calculations in many complex
situations
Trang 321.3 Integer-Valued and Discrete Random Variables 11
Definition 1.10 Let A be any event in a sample space The indicator random
variable for A is defined as
IA D 1 if A happens:
IA D 0 if A does not happen:
Thus, the distribution of an indicator variable is simply P IA D 1/ D P.A/I
P IAD 0/ D 1 P.A/
An indicator variable is also called a Bernoulli variable with parameter p, where
p is just P A/ We later show examples of uses of indicator variables in calculation
of expectations.
In applications, we are sometimes interested in the distribution of a function,say g.X /, of a basic random variable X In the discrete case, the distribution of afunction is found in the obvious way
variable andP Y D g.X/ a real-valued function of X Then, P.Y D y/ D
First, the constant c must be explicitly evaluated By directly summing the values,
Trang 3312 1 Review of Univariate Probability
So, for example, P Z D 0/ D P X D 2/ C P X D 0/ C P X D 2/ D 75cD7=13: The pmf of Z D h.X / is:
P ZD z/ 3/13 7/13 3/13
A key concept in probability is that of independence of a collection of randomvariables The collection could be finite or infinite In the infinite case, we wanteach finite subcollection of the random variables to be independent The definition
of independence of a finite collection is as follows
Definition 1.11 Let X1; X2; : : : ; Xk be k 2 discrete random variables defined
on the same sample space We say that X1; X2; : : : ; Xk are independent if
P X1 D x1; X2 D x2; : : : ; Xk D xk/D P.X1 D x1/P X2 D x2/ P.Xk D
xk/;8 x1; x2; : : : ; xk.
It follows from the definition of independence of random variables that if X1; X2are independent, then any function of X1and any function of X2are also indepen-dent In fact, we have a more general result
Theorem 1.7 LetX1; X2; : : : ; Xk bek 2 discrete random variables, and
sup-pose they are independent Let U D f X1; X2; : : : ; Xi/ be some function of
X1; X2; : : : ; Xi, and V D g.XiC1; : : : ; Xk/ be some function of XiC1; : : : ; Xk Then, U and V are independent.
This result is true of any types of random variables X1; X2; ; Xk, not just discrete ones.
A common notation of wide use in probability and statistics is now introduced.
If X1; X2; : : : ; Xk are independent, and moreover have the same CDF, say F ,then we say that X1; X2; : : : ; Xk are iid (or IID) and write X1; X2; : : : ; Xk
iid
F
The abbreviation iid (IID) means independent and identically distributed
Example 1.11 (Two Simple Illustrations) Consider the experiment of tossing a fair
coin (or any coin) four times Suppose X1is the number of heads in the first twotosses, and X2is the number of heads in the last two tosses Then, it is intuitivelyclear that X1; X2 are independent, because the last two tosses carry no informa-tion regarding the first two tosses The independence can be easily mathematicallyverified by using the definition of independence
Next, consider the experiment of drawing 13 cards at random from a deck of 52cards Suppose X1is the number of aces and X2is the number of clubs among the 13cards Then, X1; X2are not independent For example, P X1 D 4; X2 D 0/ D 0,
but P X1D 4/, and P.X2 D 0/ are both > 0, and so P.X1 D 4/P.X2D 0/ > 0
So, X1; X2cannot be independent
Trang 341.3 Integer-Valued and Discrete Random Variables 13
1.3.2 Expectation and Moments
By definition, a random variable takes different values on different occasions It isnatural to want to know what value it takes on average Averaging is a very primitiveconcept A simple average of just the possible values of the random variable will bemisleading, because some values may have so little probability that they are rela-tively inconsequential The average or the mean value, also called the expected value
of a random variable is a weighted average of the different values of X , weightedaccording to how important the value is Here is the definition
Definition 1.12 Let X be a discrete random variable We say that the expected
expected value is also known as the expectation or the mean of X
If the set of possible values of X is infinite, then the infinite sum P
finite or countably infinite and X is a discrete random variable with expectation .
where P !/ is the probability of the sample point !.
Important Point Although it is not the focus of this chapter, in applications we are
often interested in more than one variable at the same time To be specific, considertwo discrete random variables X; Y defined on a common sample space Then
we could construct new random variables out of X and Y , for example, X Y; X C
Y; X2C Y2, and so on We can then talk of their expectations as well Here is ageneral definition of expectation of a function of more than one random variable
Definition 1.13 Let X1; X2; : : : ; Xnbe n discrete random variables, all defined on
a common sample space , with a finite or a countably infinite number of ple points We say that the expectation of a function g.X1; X2; : : : ; Xn/ exists ifP
sam-!jg.X1.!/; X2.!/; : : : ; Xn.!//jP.!/ < 1, in which case, the expected value
Trang 3514 1 Review of Univariate Probability
Proposition (a) If there exists a finite constant c such that P X D c/ D 1, then
E.X /D c.
(b) If X; Y are random variables defined on the same sample space with finite
expectations, and ifP X Y / D 1, then E.X/ E.Y /.
(c) If X has a finite expectation, and if P X c/ D 1; then E.X/ c If P.X
c/D 1, then E.X/ c.
Proposition (Linearity of Expectations) Let X1; X2; : : : ; Xn be random ables defined on the same sample space , and c1; c2; : : : ; cn any real-valued constants Then, providedE.Xi/ exists for every Xi,
n
X
iD1
ciE.Xi/:
in particular,E.cX / D cE.X/ and E.X1C X2/ D E.X1/C E.X2/, whenever
the expectations exist.
The following fact also follows easily from the definition of the pmf of a function
of a random variable The result says that the expectation of a function of a random variable X can be calculated directly using the pmf of X itself, without having to
calculate the pmf of the function.
Proposition (Expectation of a Function) Let X be a discrete random variable
on a sample space with a finite or countable number of sample points, and
provided E.Y / exists.
Caution If g.X / is a linear function of X , then, of course, E.g.X // D g.E.X //.
But, in general, the two things are not equal For example, E.X2/ is not the same
as E.X //2; indeed, E.X2/ > E.X //2for any random variable X that is not aconstant
A very important property of independent random variables is the following torization result on expectations
fac-Theorem 1.8 SupposeX1; X2; : : : ; Xn are independent random variables Then, provided each expectation exists,
E.X1X2 Xn/D E.X1/E.X2/ E.Xn/:
Let us now show some more illustrative examples.
Trang 361.3 Integer-Valued and Discrete Random Variables 15
Example 1.12 Let X be the number of heads obtained in two tosses of a fair coin.
The pmf of X is p.0/ D p.2/ D 1=4; p.1/ D 1=2 Therefore, E.X / D 0 1=4 C
1 1=2 C 2 1=4 D 1 Because the coin is fair, we expect it to show heads 50% of
the number of times it is tossed, which is 50% of 2, that is, 1
Example 1.13 (Dice Sum) Let X be the sum of the two rolls when a fair die
is rolled twice The pmf of X is p.2/ D p.12/ D 1=36I p.3/ D p.11/ D
2=36I p.4/ D p.10/ D 3=36I p.5/ D p.9/ D 4=36I p.6/ D p.8/ D 5=36I p.7/ D6=36 Therefore, E.X / D 21=36C32=36C43=36C C121=36 D 7 This
can also be seen by letting X1D the face obtained on the first roll; X2D the face
obtained on the second roll, and by using E.X / D E.X1 C X2/ D E.X1/ CE.X2/D 3:5 C 3:5 D 7
Let us now make this problem harder Suppose that a fair die is rolled 10 timesand X is the sum of all 10 rolls The pmf of X is no longer so simple; it will becumbersome to write it down But, if we let Xi D the face obtained on the ith roll, it
is still true by the linearity of expectations that E.X / D E.X1CX2C CX10/DE.X1/C E.X2/C C E.X10/ D 3:5 10 D 35 We can easily compute the
expectation, although the pmf would be difficult to write down
Example 1.14 (A Random Variable Without a Finite Expectation) Let X take the
positive integers 1; 2; 3; : : : as its values with the pmf
p.x/D P.X D x/ D 1
x.xC 1/; xD 1; 2; 3; : : : :
This is a valid pmf, because obviouslyx.xC1/1 > 0 for any x D 1; 2; 3; : : : ; and also
the infinite seriesP1
xD1 x.xC1/1 sums to 1, a fact from calculus Now,
also a fact from calculus
This example shows that not all random variables have a finite expectation Here,the reason for the infiniteness of E.X / is that X takes large integer values x withprobabilities p.x/ that are not adequately small The large values are realized suffi-ciently often that on average X becomes larger than any given finite number.The zero–one nature of indicator random variables is extremely useful for calcu-lating expectations of certain integer-valued random variables whose distributionsare sometimes so complicated that it would be difficult to find their expectations di-rectly from definition We describe the technique and some illustrations of it below
Proposition Let X be an integer-valued random variable such that it can be
rep-resented asX D Pm
iD1ciIA i for some m, constants c1; c2; : : : ; cm, and suitable eventsA1; A2; : : : ; Am Then,E.X /DPm
iD1ciP Ai/.
Trang 3716 1 Review of Univariate Probability
Example 1.15 (Coin Tosses) Suppose a coin that has probability p of showing
heads in any single toss is tossed n times, and let X denote the number of times in the
n tosses that a head is obtained Then, X DPn
iD1IA i, where Aiis the event that a
head is obtained in the ith toss Therefore, E.X / DPn
iD1P Ai/DPn
iD1pD np
A direct calculation of the expectation would involve finding the pmf of X andobtaining the sumPn
xD0xP X D x/; it can also be done that way, but that is a
much longer calculation
The random variable X of this example is a binomial random variable
with parameters n and p Its pmf is given by the formula P X D x/ Dn
x
px.1 p/nx; xD 0; 1; 2; : : : ; n
Example 1.16 (Consecutive Heads in Coin Tosses) Suppose a coin with probability
p for heads in a single toss is tossed n times How many times can we expect to see
a head followed by at least one more head? For example, if n D 5, and we see theoutcomes HTHHH, then we see a head followed by at least one more head twice.Define AiD The ith and the i C 1/th toss both result in heads Then
XD number of times a head is followed by at least one more head D
iD1p2D n 1/p2 For example, if a fair coin
is tossed 20 times, we can expect to see a head followed by another head about fivetimes (19 :52D 4:75)
Another useful technique for calculating expectations of nonnegative valued random variables is based on the CDF of the random variable, rather thandirectly on the pmf This method is useful when calculating probabilities of the form
integer-P X > x/ is logically more straightforward than directly calculating integer-P X D x/
Here is the expectation formula based on the tail CDF
Theorem 1.9 (Tailsum Formula) Let X take values 0; 1; 2; : : : : Then
Example 1.17 (Family Planning) Suppose a couple will have children until they
have at least one child of each sex How many children can they expect to have?Let X denote the childbirth at which they have a child of each sex for the first time.Suppose the probability that any particular childbirth will be a boy is p, and that allbirths are independent Then,
P X > n/D P.the first n children are all boys or all girls/ D pnC 1 p/n
:
Therefore, E.X / D 2 CP1
nD2ŒpnC 1 p/nD 2 C p2=.1 p/ C 1 p/2=pD
1
p.1p/ 1 If boys and girls are equally likely on any childbirth, then this says that
a couple waiting to have a child of each sex can expect to have three children
Trang 381.3 Integer-Valued and Discrete Random Variables 17
The expected value is calculated with the intention of understanding what atypical value is of a random variable But two very different distributions can haveexactly the same expected value A common example is that of a return on an in-vestment in a stock Two stocks may have the same average return, but one may bemuch riskier than the other, in the sense that the variability in the return is muchhigher for that stock In that case, most risk-averse individuals would prefer to in-vest in the stock with less variability Measures of risk or variability are of coursenot unique Some natural measures that come to mind are E.jX j/, known as the
mean absolute deviation, or P jX j > k/ for some suitable k However, neither
of these two is the most common measure of variability The most common measure
is the standard deviation of a random variable.
Definition 1.14 Let a random variable X have a finite mean The variance of X
is defined as
2D EŒ.X /2;
and the standard deviation of X is defined as Dp
2:
It is easy to prove that 2<1 if and only if E.X2/, the second moment of X ,
is finite It is not uncommon to mistake the standard deviation for the mean absolutedeviation, but they are not the same In fact, an inequality always holds
Proposition. E.jX j/, and is strictly greater unless X is a constant
random variable, namely,P XD / D 1.
We list some basic properties of the variance of a random variable.
Proposition.
(a) Var.cX /D c2Var X / for any real c.
(b) Var.XC k/ D Var.X/ for any real k.
(c) Var.X / 0 for any random variable X, and equals zero only if P.X D c/ D 1
for some real constant c.
(d) Var.X /D E.X2/ 2:
The quantityE.X2/ is called the second moment of X The definition of a
gen-eral moment is as follows.
Definition 1.15 Let X be a random variable, and k 1 a positive integer Then
E.Xk/ is called the kth moment of X , and E.Xk/ is called the kth inverse moment
ofX , provided they exist
We therefore have the following relationships involving moments and thevariance
Variance D Second Moment First Moment/2:
Second Moment D Variance C First Moment/2:
Statisticians often use the third moment around the mean as a measure of lack ofsymmetry in the distribution of a random variable The point is that if a randomvariable X has a symmetric distribution, and has a finite mean , then all odd mo-ments around the mean, namely, EŒ.X /2kC1 will be zero, if the moment exists
Trang 3918 1 Review of Univariate Probability
In particular, EŒ.X /3 will be zero Likewise, statisticians also use the fourth
moment around the mean as a measure of how spiky the distribution is around themean To make these indices independent of the choice of unit of measurement (e.g.,inches or centimeters), they use certain scaled measures of asymmetry and peaked-ness Here are the definitions
Definition 1.16 (a) Let X be a random variable with EŒjX j3 <1 The skewness
is not really available We later show that D 0 for all normal distributions; hence
the motivation for subtracting 3 in the definition of
Example 1.18 (Variance of Number of Heads) Consider the experiment of two
tosses of a fair coin and let X be the number of heads obtained Then, we have seenthat p.0/ D p.2/ D 1=4; and p.1/ D 1=2 Thus, E.X2/D 0 1=4 C 1 1=2 C
4 1=4 D 3=2, and E.X/ D 1 Therefore, Var.X/ D E.X2/ 2D 3=2 1 D 1
2,and the standard deviation is Dp
:5D :707
Example 1.19 (A Random Variable with an Infinite Variance) If a random variable
has a finite variance, then it can be shown that it must have a finite mean Thisexample shows that the converse need not be true
Let X be a discrete random variable with the pmf
Therefore, by direct verification, X has a finite expectation Let us now examine thesecond moment of X
1
X
xD1
x 1.xC 1/.x C 2/D 1;
Trang 40is not finitely summable, a fact from calculus Because E.X2/ is infinite, but E.X /
is finite, 2D E.X2/ ŒE.X/2must also be infinite
If a collection of random variables is independent, then just like the expectation,the variance also adds up Precisely, one has the following very useful fact
Theorem 1.10 LetX1; X2; : : : ; Xnbe n independent random variables Then,
Var.X1C X2C C Xn/D Var.X1/C Var.X2/C C Var.Xn/:
An important corollary of this result is the following variance formula for the mean, N X , of n independent and identically distributed random variables.
Corollary 1.1 LetX1; X2; : : : ; Xnbe independent random variables with a mon variance2<1 Let NX D X 1 CCXn
2, assumed to be finite Let k be any positive number Then
P jX j k/ 1
k2:
(b) (Markov’s Inequality) Suppose X takes only nonnegative values, and
sup-poseE.X /D , assumed to be finite Let c be any postive number Then,
P X c/
c:
The virtue of these two inequalities is that they make no restrictive assumptions on the random variable X Whenever ; are finite, Chebyshev’s inequality is appli-
cable, and whenever ; is finite, Markov’s inequality applies, provided the random
variable is nonnegative However, the universal nature of these inequalities also makes them typically quite conservative.
Although Chebyshev’s inequality usually gives conservative estimates for tail probabilities, it does imply a major result in probability theory in a special case.