Probability for Statistics and Machine Learning AI fundamentals and advanced

This is the companion second volume to my undergraduate text Fundamentals of Probability: A First Course. The purpose of my writing this book is to give graduate students, instructors, and researchers in statistics, mathematics, and computer science a lucidly written unique text at the confluence of probability, advanced stochastic processes, statistics, and key tools for machine learning. Numerous topics in probability and stochastic processes of current importance in statistics and machine learning that are widely scattered in the literature in many different specialized books are all brought together under one fold in this book. This is done with an extensive bibliography for each topic, and numerous workedout examples and exercises. Probability, with all its models, techniques, and its poignant beauty, is an incredibly powerful tool for anyone who deals with data or randomness. The content and the style of this book reflect that philosophy; I emphasize lucidity, a wide background, and the farreaching applicability of probability in science. The book starts with a selfcontained and fairly complete review of basic probability, and then traverses its way through the classics, to advanced modern topics and tools, including a substantial amount of statistics itself. Because of its nearly encyclopaedic coverage, it can serve as a graduate text for a yearlong probability sequence, or for focused short courses on selected topics, for selfstudy, and as a nearly unique reference for research in statistics, probability, and computer science. It provides an extensive treatment of most of the standard topics in a graduate probability sequence, and integrates them with the basic theory and many examples of several core statistical topics, as well as with some tools of major importance in machine learning. This is done with unusually detailed bibliographies for the reader who wants to dig deeper into a particular topic, and with a huge repertoire of workedout examples and exercises. The total number of workedout examples in this book is 423, and the total number of exercises is 808. An instructor can rotate the exercises between semesters, and use them for setting exams, and a student can use them for additional exam preparation and selfstudy. I believe that the book is unique in its range, unification, bibliographic detail, and its collection of problems and examples.

Experiments and Sample Spaces

Probability theory begins with the concept of a sample space, which encompasses all possible outcomes of a specific experiment For instance, when a coin is tossed twice and the results of each toss are documented, the sample space consists of all potential outcomes from this coin-tossing scenario.

A DasGupta, Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics, Springer Texts in Statistics, DOI 10.1007/978-1-4419-9634-3 1, c Springer Science+Business Media, LLC 2011

1 are HH;HT;TH;TT, withH denoting the occurrence of heads andT denoting the occurrence of tails We call

D fHH;HT;TH;TTg the sample space of the experiment.

A sample space refers to a general set, which can be either finite or infinite An illustrative example of an infinite sample space is the process of tossing a coin repeatedly until heads appears for the first time, with the number of trials recorded In this scenario, the sample space consists of a countably infinite set.

Sample spaces can be uncountably infinite, such as when selecting a random number from the interval [0, 1], where the sample space is represented as D [0, 1] In this scenario, the set is uncountably infinite, and individual elements within the sample space are denoted as ! The initial focus is on defining events and clarifying the concept of the probability of an event.

In probability theory, an event is defined as any subset of a sample space, which can include the empty set and the entire sample space itself Events can consist of a single sample point, known as a singleton set To assign probabilities to these events in a logically consistent manner, it is essential to focus on subsets that are interconnected, referred to as a σ-field In most practical scenarios, including those with infinite sample spaces, the events of interest will belong to an appropriate σ-field Therefore, we will consider events simply as subsets of the sample space without further emphasis on the σ-field concept.

Here is a definition of what counts as a legitimate probability on events.

Definition 1.2 Given a sample space, a probability or a probability measure on is a functionP on subsets ofsuch that

(c) Given disjoint subsetsA1; A2; : : : of; P [ 1 iD1 Ai/DP1 iD1P Ai/:

Countable additivity, referred to as property (c), is an assumption rather than a provable statement Our experience suggests that treating this assumption as valid yields useful and credible solutions to various problems, making it a reasonable basis for analysis.

In the realm of probability, while there is a consensus among probabilists that countable additivity is a natural concept, this book will not delve into that debate It is essential to note that finite additivity is encompassed within countable additivity; specifically, if there exists a finite number of disjoint subsets, the principles of finite additivity apply.

A1; A2; : : : ; Amof, thenP [ m iD1 Ai/D Pm iD1P Ai/:Also, it is useful to note that the last two conditions in the definition of a probability measure imply that

P /, the probability of the empty set or the null event, is zero.

In probability notation, it is more precise to denote the probability of a singleton set {f} as P({f}) However, to simplify expressions and reduce clutter, we often use the more convenient notation P(f).

One pleasant consequence of the axiom of countable additivity is the following basic result We do not prove it here as it is a simple result; seeDasGupta(2010) for a proof.

Theorem 1.1 LetA1A2A3 be an infinite family of subsets of a sample spacesuch thatAn#A Then,P An/!P A/asn! 1.

Next, the concept of equally likely sample points is a very fundamental one.

Definition 1.3 Letbe a finite sample space consisting ofN sample points We say that the sample points are equally likely ifP !/D N 1 for each sample point!.

An immediate consequence, due to the addivity axiom, is the following useful formula.

Proposition Letbe a finite sample space consisting ofN equally likely sample points LetAbe any event and supposeAcontainsndistinct sample points Then

N D Number of sample points favorable toA

Total number of sample points : Let us see some examples.

In a scenario where five pairs of shoes are stored in a closet, and four individual shoes are randomly selected, the probability of selecting at least one complete pair among the chosen shoes is a key consideration.

The total number of sample points is 10

In the D 210 study, random selection ensures that each sample point has an equal likelihood of being chosen To form at least one complete pair, we can either select two complete pairs or one complete pair along with two nonconforming shoes The total combinations for selecting two complete pairs amount to 5.

D10 ways Exactly one complete pair can be chosen in 5

1 term is for choosing the pair that is complete; the 4

To determine the probability of selecting at least one complete pair from four shoes, we consider two incomplete pairs From each pair, one shoe is chosen—either the left or the right The resulting probability is calculated as 10C120 divided by 10D13!D:62.

In five-card poker, players are dealt five random cards from a standard 52-card deck The game features various hands with different levels of rarity, and this article focuses on calculating the probabilities of achieving two pairs, specifically the hand containing an Ace and another pair.

A two pairs hand consists of two cards of one rank, two cards of another rank, and a fifth card of a different rank In contrast, a flush is defined as a hand containing five cards of the same suit, provided that the cards do not form a sequential order.

To find P B/, note that there are 10 ways to select 5 cards from a suit such that the cards are in a sequence, namely,fA; 2; 3; 4; 5g;f2; 3; 4; 5; 6g; : : : ;f10; J; Q; K; Ag, and so,

These are basic examples of counting arguments that are useful whenever there is a finite sample space and we assume that all sample points are equally likely.

A major result in combinatorial probability is the inclusionexclusion formula, which says the following.

Theorem 1.2 LetA1; A2; : : : ; An bengeneral events Let

In a Bridge game, we analyze the probability that a specific player, North, has a void in at least one suit To do this, we label the suits as 1, 2, 3, and 4, allowing for a systematic evaluation of North's hand composition Understanding this probability is crucial for strategic gameplay and decision-making in Bridge.

Ai DNorth’s hand is void in suit i.

Then, by the inclusion exclusion formula,

P North’s hand is void in at least one suit/

D:051;which is small, but not very small.

Conditional Probability and Independence

The inclusion-exclusion formula can be challenging to apply precisely due to the complexity of calculating quantities \( S_j \) for large indices \( j \) However, it provides valuable bounds for the probability of the union of \( n \) general events, offering insights in both directions.

Theorem 1.3 (Bonferroni Bounds) Given n events A1; A2; : : : ; An , let pn D

P [ n iD1 Ai/:Then, pnS1IpnS1S2IpnS1S2CS3I : : : :

Conditional probability and independence are essential concepts in probability and statistics Conditional probabilities involve adjusting beliefs based on new information, while independence signifies that new information does not affect existing beliefs Furthermore, assuming independence can greatly streamline the development, mathematical analysis, and validation of various statistical tools and procedures.

Definition 1.4 LetA; B be general events with respect to some sample space, and supposeP A/ > 0 The conditional probability ofBgivenAis defined as

Some immediate consequences of the definition of a conditional probability are the following.

Theorem 1.4 (a) (Multiplicative Formula) For any two events A, B such that

(b) For any two events A, B such that0 < P A/ < 1, one hasP B/ D P BjA/

(c) (Total Probability Formula) IfA1; A2; : : : ; A k form a partition of the sample space , (i.e.,Ai \Aj D for alli ¤ j, and[ k iD1 Ai D ), and if0 :5 We see that there is an advantage in starting first.

Definition 1.5 A collection of eventsA1; A2; : : : ; Anis said to be mutually independent (or just independent) if for eachk; 1 k n, and anykof the events,

Ai 1 ; : : : ; Ai k ; P Ai 1 \ Ai k / D P Ai 1 / P Ai k /: They are called pairwise independent if this property holds forkD2.

Many individuals purchase lottery tickets hoping for good fortune, but statistically, this practice often results in financial loss For instance, in a weekly state lottery where five numbers are drawn from a pool of 00 to 49 without replacement, the odds of winning with a single ticket are exceedingly low.

If an individual purchases a lottery ticket every week for 40 years, the probability of winning at least once is approximately 1.14 in 7,524,000, which highlights the extremely low odds of winning This calculation is based on the assumption that each weekly lottery is independent, a premise that is crucial for the accuracy of the probability estimation.

Conditional probabilities P(A|B) and P(B|A) are often confused For instance, in a group of lung cancer patients with a high percentage of smokers, we can only conclude that P(B|A) is large, indicating many lung cancer patients are smokers However, this does not imply that smoking increases the likelihood of lung cancer, meaning P(A|B) cannot be inferred to be large To accurately calculate P(A|B) when P(B|A) is known, Bayes’ theorem provides a straightforward formula.

Theorem 1.5 LetfA1; A2; : : : ; Amgbe a partition of a sample space LetBbe some fixed event Then

In a multiple choice exam with five alternatives, a student selects one option as the correct answer The student has a 70% probability of knowing the correct answer, while there is a 30% chance that the student randomly guesses If a question is answered correctly, we aim to determine the probability that the student actually knew the correct answer.

ADThe student knew the correct answer;

BDThe student answered the question correctly:

We want to computeP AjB/ By Bayes’ theorem,

Before the student answered the question, the probability of her knowing the correct answer was 7% However, after she answered correctly, the posterior probability that she knew the answer rose significantly to 92.1% This illustrates the essence of Bayes' theorem, which updates prior beliefs to reflect new evidence.

Integer-Valued and Discrete Random Variables

CDF and Independence

A cumulative distribution function (CDF) is a crucial concept in probability theory, representing the likelihood that a random variable X is less than or equal to a specified value x This definition applies universally to all types of random variables, not just discrete ones, highlighting its fundamental role in understanding probability distributions.

Definition 1.8 The cumulative distribution function of a random variableX is the functionF x/DP Xx/; x 2R.

Definition 1.9 LetXhave the CDFF x/ Any numbermsuch thatP Xm/:5, and alsoP Xm/:5is called a median ofF, or equivalently, a median ofX.

Remark The median of a random variable need not be unique A simple way to characterize all the medians of a distribution is available.

Proposition LetX be a random variable with the CDFF x/ Letm0 be the first xsuch thatF x/:5, and letm1 be the lastxsuch thatP X x/:5 Then, a numbermis a median ofXif and only ifm 2Œm0; m1

The cumulative distribution function (CDF) of a random variable adheres to specific properties, and any function that meets these criteria qualifies as a valid CDF This means it can represent the CDF of a suitably defined random variable The essential properties of CDFs are outlined in the following results.

Theorem 1.6 A functionF x/is the CDF of some real-valued random variableX if and only if it satisfies all of the following properties.

(c) Given any real numbera; F x/# F a/asx#a.

(d) Given any two real numbersx; y; x < y; F x/F y/:

Right continuity, or continuity from the right, is a key property of cumulative distribution functions (CDFs) While a CDF may exhibit right continuity, it is important to note that it does not necessarily have to be continuous from the left This is particularly evident in the case of discrete random variables, where the CDF experiences jumps at specific values At these jump points, the CDF fails to maintain left continuity.

Proposition LetF x/be the CDF of some random variableX Then, for anyx,

(a) P X Dx/ DF x/limy"xF y/DF x/F x/, including those points xfor whichP X Dx/D0.

Example 1.8 (Bridge) Consider the random variable

X DNumber of aces in North’s hand in a Bridge game:

Clearly,X can take any of the valuesx D 0; 1; 2; 3; 4 IfX D x, then the other

13xcards in North’s hand must be non-ace cards Thus, the pmf ofXis

In decimals, the pmf ofX is: x 0 1 2 3 4 p(x) 304 439 213 041 003

The CDF ofXis a jump function, taking jumps at the values0; 1; 2; 3; 4, namely the possible values ofX The CDF is

Example 1.9 (Indicator Variables) Consider the experiment of rolling a fair die twice and now define a random variableY as follows.

Y D1if the sum of the two rolls X is an even numberI

Y D0if the sum of the two rolls X is an odd number:

If we letAbe the event thatX is an even number, thenY D 1ifAhappens, and

Y D 0ifAdoes not happen Such random variables are called indicator random variables and are immensely useful in mathematical calculations in many complex situations.

1.3 Integer-Valued and Discrete Random Variables 11

Definition 1.10 LetAbe any event in a sample space The indicator random variable forAis defined as

IA D0 if A does not happen:

Thus, the distribution of an indicator variable is simply P IA D 1/DP A/I

An indicator variable is also called a Bernoulli variable with parameterp, where pis justP A/ We later show examples of uses of indicator variables in calculation of expectations.

In various applications, understanding the distribution of a function, denoted as g(X), based on a fundamental random variable X is crucial In the case of discrete variables, determining the distribution of such a function is straightforward and can be accomplished easily.

Proposition (Function of a Random Variable) Let X be a discrete random variable andP Y D g.X / a real-valued function of X Then, P Y D y/ D xW g.x/Dyp.x/.

Example 1.10 SupposeX has the pmf p.x/D c

Suppose we want to find the distribution of two functions ofX:

First, the constantcmust be explicitly evaluated By directly summing the values,

The function g(X) is a one-to-one function, while h(X) is not The possible values of Y are 0, ±1, ±8, and ±27 For instance, the probability P(Y = 0) equals P(X = 0), resulting in c = 5 Similarly, P(Y = 1) is equal to P(X = 1) minus 5 In general, for y = 0, ±1, ±8, and ±27, the probability P(Y = y) can be expressed as P(X = y/3) = 1 + c * (2/3) with c equal to 5.

However,Z D h.X /is not a one-to-one function ofX The possible values of

So, for example,P ZD0/DP XD 2/CP XD0/CP X D2/D 7 5 cD 7:The pmf ofZ Dh.X /is: z 1 0 1

In probability theory, the independence of random variables is a fundamental concept that applies to both finite and infinite collections For an infinite collection, it is essential that every finite subcollection of these random variables remains independent The formal definition of independence for a finite collection of random variables is crucial for understanding their behavior and interactions.

Definition 1.11 Let X1; X2; : : : ; X k be k 2 discrete random variables defined on the same sample space We say that X1; X2; : : : ; X k are independent if

According to the definition of independence in random variables, if X1 and X2 are independent, then any function of X1 and any function of X2 will also maintain independence This principle leads to a broader conclusion regarding the independence of functions derived from independent random variables.

Theorem 1.7 LetX1; X2; : : : ; X k bek 2 discrete random variables, and suppose they are independent Let U D f X1; X2; : : : ; Xi/ be some function of

X1; X2; : : : ; Xi , and V D g.XiC1; : : : ; Xk/be some function of XiC1; : : : ; Xk Then,U andV are independent.

This result is true of any types of random variables X1; X2; ; Xk , not just discrete ones.

A common notation of wide use in probability and statistics is now introduced.

When a set of random variables \( X_1, X_2, \ldots, X_k \) are independent and share the same cumulative distribution function (CDF), denoted as \( F \), they are classified as independent and identically distributed (iid) This concept is commonly abbreviated as iid, signifying that each variable is both independent and identically distributed according to the same statistical distribution \( F \).

In the experiment of tossing a fair coin four times, let X1 represent the number of heads in the first two tosses and X2 denote the number of heads in the last two tosses It is evident that X1 and X2 are independent, as the outcomes of the last two tosses do not influence the results of the first two This independence can be mathematically confirmed by applying the formal definition of independence in probability theory.

In an experiment involving the random selection of 13 cards from a standard 52-card deck, the variables X1, representing the number of aces, and X2, representing the number of clubs, are not independent Specifically, the probability of drawing four aces (X1 = 4) while having no clubs (X2 = 0) is zero, even though the probabilities of drawing four aces and drawing no clubs individually are both greater than zero This indicates a dependency between the two variables, highlighting the complex interactions within card probabilities.

Expectation and Moments

A random variable can assume various values at different times, prompting interest in its average value However, calculating a simple average of all possible values can be misleading, as some values may have negligible probabilities The mean value, also known as the expected value, is a more accurate representation, as it is a weighted average that considers the significance of each value based on its probability.

Definition 1.12 Let X be a discrete random variable We say that the expected value of X exists if P ijxijp.xi/ < 1, in which case the expected value is defined as

For notational convenience, we simply writeP xxp.x/instead ofP ixip.xi/:The expected value is also known as the expectation or the mean ofX.

If the set of possible values ofX is infinite, then the infinite sum P xxp.x/ can take different values on rearranging the terms of the infinite series unless

P xjxjp.x/ x/is logically more straightforward than directly calculatingP X D x/. Here is the expectation formula based on the tail CDF.

Theorem 1.9 (Tailsum Formula) LetXtake values0; 1; 2; : : : :Then

In a family planning scenario where a couple aims to have at least one child of each sex, the expected number of children they will have can be analyzed using probability Let X represent the childbirth at which they first have one child of each sex Assuming the probability of having a boy during any childbirth is p, and considering that all births are independent events, we can derive the expected number of children based on these probabilities.

P X > n/DP the firstn children are all boys or all girls/Dp n C.1p/ n :

Therefore,E.X /D2CP1 nD2Œp n C.1p/ n D2Cp 2 =.1p/C.1p/ 2 =pD

1 p.1p/1 If boys and girls are equally likely on any childbirth, then this says that a couple waiting to have a child of each sex can expect to have three children.

The expected value helps to determine the typical outcome of a random variable, but different distributions can yield the same expected value For instance, two stocks may have identical average returns, yet one could be significantly riskier due to greater variability in its returns Consequently, risk-averse investors often favor stocks with lower variability While there are various measures of risk, such as the mean absolute deviation or the probability of exceeding a certain threshold, the standard deviation remains the most widely used measure of variability in random variables.

Definition 1.14 Let a random variableXhave a finite mean The variance ofX is defined as

2 DEŒ.X/ 2 ; and the standard deviation ofXis defined asDp

Inequalities

.xC1/.xC2/ is not finitely summable, a fact from calculus BecauseE.X 2 /is infinite, butE.X / is finite, 2 DE.X 2 /ŒE.X / 2 must also be infinite.

If a collection of random variables is independent, then just like the expectation, the variance also adds up Precisely, one has the following very useful fact.

Theorem 1.10 LetX1; X2; : : : ; Xn benindependent random variables Then, Var.X1CX2C CXn/DVar.X1/CVar.X2/C CVar.Xn/:

An important corollary of this result is the following variance formula for the mean,XN, ofnindependent and identically distributed random variables.

Corollary 1.1 LetX1; X2; : : : ; Xn be independent random variables with a common variance 2 /!0, asn! 1.

There is a stronger version of the weak law of large numbers, which says that in fact, with certainty,XN will converge toasn! 1 The precise mathematical statement is that

The strong law of large numbers requires that E[jX|] is finite, a condition that necessitates advanced concepts for proof, which will be explored later in the book Additionally, there are inequalities that can provide tighter bounds than Chebyshev's or Markov's inequality, contingent upon further restrictions on the distribution of the random variable X We will present three alternative inequalities that may offer improved bounds compared to those of Chebyshev and Markov.

Theorem 1.13 (a) (Cantelli’s Inequality) SupposeE.X / D ;Var.X / D 2 , assumed to be finite Then,

(b) (Paley–Zygmund Inequality) SupposeXtakes only nonnegative values, with

E.X /D;Var.X /D 2 , assumed to be finite Then, for0 < c < 1,

(c) (Alon–Spencer Inequality) SupposeXtakes only nonnegative integer values, withE.X /D;Var.X /D 2 , assumed to be finite Then,

These inequalities may be seen inRao(1973),Paley and Zygmund(1932), andAlon and Spencer(2000, p 58), respectively.

Probability inequalities represent a vast and varied field, primarily due to their utility in providing approximate solutions when exact answers are difficult or unattainable Throughout this book, we will periodically showcase and explain various inequalities The following theorem introduces fundamental inequalities derived from moments.

Theorem 1.14 (a) (Cauchy–Schwarz Inequality) Let X; Y be two random variables such thatE.X 2 /andE.Y 2 /are finite Then,

(b) (HRolder’s Inequality) LetX; Y be two random variables, and1 < p

Tiêu đề	Probability for Statistics and Machine Learning Fundamentals and Advanced Topics
Tác giả	Anirban Dasgupta
Người hướng dẫn	G. Casella, S. Fienberg, I. Olkin
Trường học	Purdue University
Chuyên ngành	Statistics
Thể loại	Book
Năm xuất bản	2011
Thành phố	New York

Định dạng
Số trang	803
Dung lượng	3,93 MB