The basic problems of probability theory

Consider a random walk on the set {0, 1}, where with probability one on each step the chain moves to the other state. A less trivial case is the simple random walk. on the integers.. Sup[r]

(1)

Probability Theory Richard F Bass

These notes are c1998 by Richard F Bass They may be used for personal or classroom purposes, but not for commercial purposes

Revised 2001

1 Basic notions

A probability or probability measureis a measure whose total mass is one Because the origins of probability are in statistics rather than analysis, some of the terminology is different For example, instead of denoting a measure space by (X,A, µ), probabilists use (Ω,F,P) So here Ω is a set,F is called aσ-field (which is the same thing as aσ-algebra), andPis a measure withP(Ω) = Elements ofF are calledevents Elements of Ω are denotedω

Instead of saying a property occurs almost everywhere, we talk about properties occurring almost surely, written a.s Real-valued measurable functions from Ω to R are called random variables and are usually denoted byX orY or other capital letters We often abbreviate ”random variable” byr.v

We letAc = (ω∈Ω :ω /∈A) (called thecomplementofA) andB−A=B∩Ac.

Integration (in the sense of Lebesgue) is calledexpectationor expected value, and we writeEX for R

XdP The notationE[X;A] is often used forR

AXdP

The random variable 1A is the function that is one if ω ∈ A and zero otherwise It is called the indicatorofA(the name characteristic function in probability refers to the Fourier transform) Events such as (ω:X(ω)> a) are almost always abbreviated by (X > a)

Given a random variableX, we can define a probability onRby

PX(A) =P(X ∈A), A⊂R (1.1)

The probabilityPX is called thelawofX or thedistributionofX We defineFX:R→[0,1] by

FX(x) =PX((−∞, x]) =P(X ≤x) (1.2)

The functionFX is called thedistribution functionofX

As an example, let Ω = {H, T}, F all subsets of Ω (there are of them), P(H) = P(T) = 12 Let X(H) = andX(T) = Then PX = 12δ0+21δ1, whereδx is point mass atx, that is,δx(A) = if x∈A

and otherwise FX(a) = ifa <0, 12 if 0≤a <1, and ifa≥1

Proposition 1.1 The distribution functionFX of a random variableX satisfies: (a) FX is nondecreasing;

(b) FX is right continuous with left limits; (c) limx→∞FX(x) = 1andlimx→−∞FX(x) =

Proof We prove the first part of (b) and leave the others to the reader Ifxn↓x, then (X ≤xn)↓(X ≤x),

and soP(X≤xn)↓P(X ≤x) sinceP is a measure

Note that ifxn↑x, then (X ≤xn)↑(X < x), and soFX(xn)↑P(X < x)

(2)

Proposition 1.2 Suppose F is a distribution function There exists a random variable X such that

F =FX

Proof Let Ω = [0,1],F the Borelσ-field, andPLebesgue measure Define X(ω) = sup{x:F(x)< ω} It is routine to check thatFX =F

In the above proof, essentially X = F−1 However F may have jumps or be constant over some intervals, so some care is needed in definingX

Certain distributions or laws are very common We list some of them

(a) Bernoulli A random variable is Bernoulli if P(X= 1) =p,P(X = 0) = 1−pfor somep∈[0,1] (b) Binomial This is defined byP(X =k) =

n k

pk(1−p)n−k, wherenis a positive integer, 0≤k≤n,

andp∈[0,1]

(c) Geometric Forp∈(0,1) we setP(X =k) = (1−p)pk Herek is a nonnegative integer.

(d) Poisson Forλ >0 we setP(X=k) =e−λλk/k! Againkis a nonnegative integer.

(e) Uniform For some positive integern, setP(X =k) = 1/nfor 1≤k≤n

If F is absolutely continuous, we call f = F0 the density of F Some examples of distributions characterized by densities are the following

(f) Uniform on[a, b] Definef(x) = (b−a)−11

[a,b](x) This means that ifX has a uniform distribution,

then

P(X ∈A) = Z

A

1

b−a1[a,b](x)dx (g) Exponential Forx >0 letf(x) =λe−λx.

(h) Standard normal Definef(x) =√1

2πe

−x2/2 So

P(X∈A) = √

2π Z

A

e−x2/2dx

(i) N(µ, σ2) We shall see later that a standard normal has mean zero and variance one If Z is a

standard normal, then a N(µ, σ2) random variable has the same distribution as µ+σZ It is an

exercise in calculus to check that such a random variable has density

√ 2πσe

−(x−µ)2/2σ2

(1.3)

(j) Cauchy Here

f(x) = π

1 +x2

We can use the law of a random variable to calculate expectations Proposition 1.3 Ifg is bounded or nonnegative, then

Eg(X) = Z

g(x)PX(dx)

Proof Ifgis the indicator of an eventA, this is just the definition ofPX By linearity, the result holds for

(3)

If FX has a density f, then PX(dx) = f(x)dx So, for example, EX = Rxf(x)dx and EX2 = R

x2f(x)dx (We need

E|X|finite to justify this if X is not necessarily nonnegative.)

We define themeanof a random variable to be its expectation, and thevarianceof a random variable is defined by

VarX=E(X−EX)2

For example, it is routine to see that the mean of a standard normal is zero and its variance is one Note

VarX =E(X2−2XEX+ (EX)2) =EX2−(EX)2

Another equality that is useful is the following Proposition 1.4 IfX ≥0a.s andp >0, then

EXp= Z ∞

0

pλp−1P(X > λ)dλ

The proof will show that this equality is also valid if we replaceP(X > λ) byP(X≥λ) Proof Use Fubini’s theorem and write

Z ∞

0

pλp−1P(X > λ)dλ=E Z ∞

0

pλp−11(λ,∞)(X)dλ=E

Z X

0

pλp−1dλ=EXp

We need two elementary inequalities

Proposition 1.5 Chebyshev’s inequality IfX≥0,

P(X ≥a)≤ E X a

Proof We write

P(X ≥a) =E h

1[a,∞)(X)

i

≤EhX

a1[a,∞)(X) i

≤EX/a,

sinceX/ais bigger than when X∈[a,∞)

If we apply this toX = (Y −EY)2, we obtain

P(|Y −EY| ≥a) =P((Y −EY)2≥a2)≤VarY /a2 (1.4)

This special case of Chebyshev’s inequality is sometimes itself referred to as Chebyshev’s inequality, while Proposition 1.5 is sometimes called the Markov inequality

(4)

Proposition 1.6 Supposegis convex and and X andg(X)are both integrable Then

g(EX)≤Eg(X)

Proof One property of convex functions is that they lie above their tangent lines, and more generally their support lines So ifx0∈R, we have

g(x)≥g(x0) +c(x−x0)

for some constantc Takex=X(ω) and take expectations to obtain

Eg(X)≥g(x0) +c(EX−x0)

Now setx0 equal toEX

IfAn is a sequence of sets, define (An i.o.), read ”An infinitely often,” by

(An i.o.) =∩∞n=1∪

∞

i=nAi

This set consists of thoseωthat are in infinitely many of theAn

A simple but very important proposition is the Borel-Cantelli lemma It has two parts, and we prove the first part here, leaving the second part to the next section

Proposition 1.7 (Borel-Cantelli lemma)IfP

nP(An)<∞, thenP(An i.o.) =

Proof We have

P(An i.o.) = lim n→∞P(∪

∞

i=nAi)

However,

P(∪∞i=nAi)≤

∞ X

i=n

P(Ai),

which tends to zero asn→ ∞

2 Independence

Let us say two eventsAandB areindependentifP(A∩B) =P(A)P(B) The eventsA1, , An are

independent if

P(Ai1∩Ai2∩ · · · ∩Aij) =P(Ai1)P(Ai2)· · ·P(Aij)

for every subset{i1, , ij} of{1,2, , n}

Proposition 2.1 IfAandB are independent, thenAc andB are independent.

Proof We write

P(Ac∩B) =P(B)−P(A∩B) =P(B)−P(A)P(B) =P(B)(1−P(A)) =P(B)P(A)

(5)

{(X ∈A) :Aa Borel subset ofR}.) We define the independence ofn σ-fields ornrandom variables in the obvious way

Proposition 2.1 tells us thatA and B are independent if the random variables 1A and 1B are

inde-pendent, so the definitions above are consistent

Iff andg are Borel functions andX and Y are independent, thenf(X) and g(Y) are independent This follows because theσ-field generated byf(X) is a sub-σ-field of the one generated byX, and similarly forg(Y)

LetFX,Y(x, y) =P(X ≤x, Y ≤y) denote the joint distribution function ofX and Y (The comma inside the set means ”and.”)

Proposition 2.2 FX,Y(x, y) =FX(x)FY(y)if and only ifX andY are independent

Proof IfXandY are independent, the 1(−∞,x](X) and 1(−∞,y](Y) are independent by the above comments

Using the above comments and the definition of independence, this showsFX,Y(x, y) =FX(x)FY(y)

Conversely, if the inequality holds, fixy and letMy denote the collection of setsA for whichP(X ∈ A, Y ≤y) =P(X∈A)P(Y ≤y) My contains all sets of the form (−∞, x] It follows by linearity thatMy

contains all sets of the form (x, z], and then by linearity again, by all sets that are the finite union of such half-open, half-closed intervals Note that the collection of finite unions of such intervals, A, is an algebra generating the Borelσ-field It is clear thatMy is a monotone class, so by the monotone class lemma,My

contains the Borelσ-field

For a fixed set A, let MA denote the collection of sets B for which P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈B) Again,MAis a monotone class and by the preceding paragraph contains theσ-field generated

by the collection of finite unions of intervals of the form (x, z], hence contains the Borel sets ThereforeX andY are independent

The following is known as the multiplication theorem

Proposition 2.3 IfX,Y, andXY are integrable andX and Y are independent, thenEXY =EXEY Proof Consider the random variables in σ(X) (the σ-field generated by X) and σ(Y) for which the multiplication theorem is true It holds for indicators by the definition of X and Y being independent It holds for simple random variables, that is, linear combinations of indicators, by linearity of both sides It holds for nonnegative random variables by monotone convergence And it holds for integrable random variables by linearity again

Let us give an example of independent random variables Let Ω = Ω1×Ω2and letP=P1×P2, where

(Ωi,Fi,Pi) are probability spaces fori= 1,2 We use the product σ-field Then it is clear thatF1 and F2

are independent by the definition ofP IfX1is a random variable such thatX1(ω1, ω2) depends only onω1

andX2 depends only onω2, then X1 andX2 are independent

This example can be extended tonindependent random variables, and in fact, if one has independent random variables, one can always view them as coming from a product space We will not use this fact Later on, we will talk about countable sequences of independent r.v.s and the reader may wonder whether such things can exist That it can is a consequence of the Kolmogorov extension theorem; see PTA, for example

If X1, , Xn are independent, then so are X1−EX1, , Xn −EXn Assuming everything is

integrable,

(6)

using the multiplication theorem to show that the expectations of the cross product terms are zero We have thus shown

Var (X1+· · ·+Xn) = VarX1+· · ·+ VarXn (2.1)

We finish up this section by proving the second half of the Borel-Cantelli lemma

Proposition 2.4 SupposeAnis a sequence of independent events IfPnP(An) =∞, thenP(An i.o.) =

Note that here the An are independent, while in the first half of the Borel-Cantelli lemma no such

assumption was necessary Proof Note

P(∪Ni=nAi) = 1−P(∩Ni=nAci) = 1− N

Y

i=n

P(Aci) = 1− N

Y

i=n

(1−P(Ai))

By the mean value theorem, 1−x≤e−x, so we have that the right hand side is greater than or equal to

1−exp(−PN

i=nP(Ai)).As N → ∞, this tends to 1, soP(∪∞i=nAi) = This holds for alln, which proves

the result

3 Convergence

In this section we consider three ways a sequence of r.v.sXn can converge

We say Xn converges toX almost surelyif (Xn 6→X) has probability zero Xn converges toX in probability if for eachε, P(|Xn−X|> ε)→0 asn→ ∞ Xn converges to X in Lp ifE|Xn−X|p →0 as

n→ ∞

The following proposition shows some relationships among the types of convergence Proposition 3.1 (a) IfXn→X a.s., thenXn→X in probability

(b) If Xn→X in Lp, thenXn→X in probability

(c) IfXn→X in probability, there exists a subsequencenj such thatXnj converges toX almost surely

Proof To prove (a), noteXn−X tends to almost surely, so 1(−ε,ε)c(Xn−X) also converges to almost

surely Now apply the dominated convergence theorem (b) comes from Chebyshev’s inequality:

P(|Xn−X|> ε) =P(|Xn−X|p> εp)≤E|Xn−X|p/εp→0

asn→ ∞

To prove (c), choose nj larger than nj−1 such that P(|Xn−X| > 2−j) < 2−j whenever n ≥ nj

So if we let Ai = (|Xnj −X| >

−i for somej ≥ i), then

P(Ai) ≤ 2−i+1 By the Borel-Cantelli lemma

P(Ai i.o.) = This impliesXnj →X on the complement of (Ai i.o.)

Let us give some examples to show there need not be any other implications among the three types of convergence

Let Ω = [0,1],F the Borelσ-field, andP Lebesgue measure LetXn =en1(0,1/n) Then clearlyXn

converges to almost surely and in probability, butEXp

n=enp/n→ ∞for anyp

Let Ω be the unit circle, and letP be Lebesgue measure on the circle normalized to have total mass Let tn =Pni=1i−1, and letAn ={θ:tn−1 ≤θ < tn} LetXn = 1An Any point on the unit circle will

be in infinitely manyAn, so Xn does not converge almost surely to ButP(An) = 1/2πn→0, soXn→0

(7)

4 Weak law of large numbers

SupposeXn is a sequence of independent random variables Suppose also that they all have the same

distribution, that is, FXn =FX1 for all n This situation comes up so often it has a name, independent,

identically distributed, which is abbreviatedi.i.d

DefineSn=P n

i=1Xi Sn is called a partial sum process Sn/n is the average value of the firstnof

theXi’s

Theorem 4.1 (Weak law of large numbers) Suppose theXi are i.i.d andEX12<∞ ThenSn/n→EX1 in probability

Proof Since theXi are i.i.d., they all have the same expectation, and soESn=nEX1 HenceE(Sn/n−

EX1)2 is the variance ofSn/n Ifε >0, by Chebyshev’s inequality,

P(|Sn/n−EX1|> ε)≤

Var (Sn/n)

ε2 =

Pn

i=1VarXi

n2ε2 =

nVarX1

n2ε2 (4.1)

SinceEX12<∞, then VarX1<∞, and the result follows by lettingn→ ∞

A nice application of the weak law of large numbers is a proof of the Weierstrass approximation theorem

Theorem 4.2 Supposef is a continuous function on [0,1]and ε >0 There exists a polynomialP such thatsupx∈[0,1]|f(x)−P(x)|< ε

Proof Let

P(x) =

n

X

k=0

f(k/n)

n k

xk(1−x)n−k

ClearlyP is a polynomial Sincef is continuous, there exists M such that|f(x)| ≤M for allxand there existsδsuch that |f(x)−f(y)|< ε/2 whenever|x−y|< δ

LetXi be i.i.d Bernoulli r.v.s with parameterx ThenSn, the partial sum, is a binomial, and hence

P(x) =Ef(Sn/n) The mean ofSn/n isx We have

≤MP(|Sn/n−x|> δ) +ε/2

By (4.1) the first term will be less than

MVarX1/nδ2≤M x(1−x)/nδ2≤M nδ2,

which will be less thanε/2 ifnis large enough, uniformly inx

5 Techniques related to almost sure convergence

Our aim is the strong law of large numbers (SLLN), which says thatSn/n converges toEX1 almost

surely ifE|X1|<∞

(8)

Proposition 5.1 SupposeXi is an i.i.d sequence withEXi4 <∞and letSn =P n

i=1Xi ThenSn/n→

EX1 a.s

Proof By looking atXi−EXi we may assume that theXi have mean By Chebyshev,

P(|Sn/n|> ε)≤ E

(Sn/n)4

ε4 =

ESn4

n4ε4

If we expand Sn4, we will have terms involving Xi4, terms involvingXi2Xj2, terms involving Xi3Xj, terms

involvingXi2XjXk, and terms involvingXiXjXkX`, withi, j, k, `all being different By the multiplication

theorem and the fact that theXi have mean 0, the expectations of all the terms will be except for those

of the first two types So

ESn4 = n

X

i=1

EXi4+

X

i6=j

EXi2EXj2

By the finiteness assumption, the first term on the right is bounded byc1n By Cauchy-Schwarz,EXi2 ≤

(EXi4)

1/2 < ∞, and there are at most n2 terms in the second term on the right, so this second term is

bounded byc2n2 Substituting, we have

P(|Sn/n|> ε)≤c3/n2ε4

ConsequentlyP(|Sn/n|> εi.o.) = by Borel-Cantelli Sinceεis arbitrary, this impliesSn/n→0 a.s

Before we can prove the SLLN assuming only the finiteness of first moments, we need some prelimi-naries

Proposition 5.2 IfY ≥0, thenEY <∞if and only ifPnP(Y > n)<∞

Proof By Proposition 1.4, EY = R∞

0 P(Y > x)dx P(Y > x) is nonincreasing in x, so the integral is

bounded above byP∞n=0P(Y > n) and bounded below byP∞n=1P(Y > n)

If Xi is a sequence of r.v.s, the tail σ-field is defined by ∩∞n=1σ(Xn, Xn+1, ) An example of an

event in the tailσ-field is (lim supn→∞Xn > a) Another example is (lim supn→∞Sn/n > a) The reason

for this is that ifk < nis fixed,

Sn

n = Sk

n + Pn

i=k+1Xi

n

The first term on the right tends to as n → ∞ So lim supSn/n = lim sup(P n

i=k+1Xi)/n, which is in

σ(Xk+1, Xk+2, ) This holds for eachk The set (lim supSn> a) is easily seen not to be in the tailσ-field

Theorem 5.3 (Kolmogorov 0-1 law)If the Xi are independent, then the events in the tail σ-field have probability or

This implies that in the case of i.i.d random variables, ifSn/n has a limit with positive probability,

then it has a limit with probability one, and the limit must be a constant

Proof LetM be the collection of sets inσ(Xn+1, ) that is independent of every set inσ(X1, , Xn)

M is easily seen to be a monotone class and it containsσ(Xn+1, , XN) for every N > n Therefore M

must be equal toσ(Xn+1, )

IfAis in the tail σ-field, then for eachn, Ais independent ofσ(X1, , Xn) The class MA of sets

independent ofAis a monotone class, hence is aσ-field containingσ(X1, , Xn) for eachn ThereforeMA

(9)

We thus have that the eventA is independent of itself, or

P(A) =P(A∩A) =P(A)P(A) =P(A)2 This impliesP(A) is zero or one

The next proposition shows that in considering a law of large numbers we can consider truncated random variables

Proposition 5.4 SupposeXi is an i.i.d sequence of r.v.s withE|X1|<∞ LetXn0 =Xn1(|Xn|≤n) Then

(a) Xn converges almost surely if and only ifXn0 does; (b) If Sn0 =Pn

i=1X

0

i, thenSn/nconverges a.s if and only ifSn0/ndoes

Proof LetAn= (Xn 6=Xn0) = (|Xn|> n) ThenP(An) =P(|Xn|> n) =P(|X1|> n) Since E|X1|<∞,

then by Proposition 5.2 we have P

P(An)<∞ So by the Borel-Cantelli lemma, P(An i.o.) = Thus for

almost everyω,Xn=Xn0 fornsufficiently large This proves (a)

For (b), letk(depending onω) be the largest integer such thatXk0(ω)6=Xk(ω) ThenSn/n−Sn0/n=

(X1+· · ·+Xk)/n−(X10 +· · ·+Xk0)/n→0 asn→ ∞

Next is Kolmogorov’s inequality, a special case of Doob’s inequality

Proposition 5.5 Suppose the Xi are independent andEXi = 0for eachi Then

P( max

1≤i≤n|Si| ≥λ)≤

ESn2

λ2

Proof LetAk = (|Sk| ≥λ,|S1|< λ, ,|Sk−1|< λ) Note theAkare disjoint and thatAk∈σ(X1, , Xk)

ThereforeAk is independent ofSn−Sk Then

ESn2≥ n

X

k=1

E[Sn2;Ak]

=

n

X

k=1

E[(Sk2+ 2Sk(Sn−Sk) + (Sn−Sk)2);Ak]

≥

n

X

k=1

E[Sk2;Ak] + n

X

k=1

E[Sk(Sn−Sk);Ak]

Using the independence,E[Sk(Sn−Sk)1Ak] =E[Sk1Ak]E[Sn−Sk] = Therefore

ESn2 ≥ n

X

k=1

E[S2k;Ak]≥ n

X

k=1

λ2P(Ak) =λ2P( max

1≤k≤n|Sk| ≥λ)

Our result is immediate from this

The last result we need for now is a special case of what is known as Kronecker’s lemma Proposition 5.6 Supposexiare real numbers andsn=P

n i=1xi If

P∞

j=1(xj/j)converges, thensn/n→0

Proof Letbn =P n

j=1(xj/j),b0= 0, and supposebn→b As is well known, this implies (P n

i=1bi)/n→b

We haven(bn−bn−1) =xn, so

sn

n = Pn

i=1(ibi−ibi−1)

n =

Pn

i=1ibi−P n−1

i=1(i+ 1)bi

n =bn−

Pn

i=1bi

(10)

6 Strong law of large numbers

This section is devoted to a proof of Kolmogorov’s strong law of large numbers We showed earlier that ifEXi2<∞, where theXiare i.i.d., then the weak law of large numbers (WLLN) holds: Sn/nconverges

toEX1in probability The WLLN can be improved greatly; it is enough thatxP(|X1|> x)→0 asx→ ∞

Here we show the strong law (SLLN): if one has a finite first moment, then there is almost sure convergence First we need a lemma

Lemma 6.1 Suppose Vi is a sequence of independent r.v.s, each with mean Let Wn = P n

i=1Vi If

P∞

i=1VarVi<∞, thenWn converges almost surely

Proof Choosenj> nj−1such that

P∞

i=njVarVi<2

−3j Ifn > n

j, then applying Kolmogorov’s inequality

shows that

P( max

nj≤i≤n

|Wi−Wnj|>2

−j)≤2−3j/2−2j = 2−j.

Lettingn→ ∞, we haveP(Aj)≤2−j, where

Aj = (max nj≤i

|Wi−Wnj|>2

−j).

By the Borel-Cantelli lemma,P(Aj i.o.) =

Supposeω /∈(Aj i.o.) Letε >0 Choosejlarge enough so that 2−j+1< εandω /∈Aj Ifn, m > nj,

then

|Wn−Wm| ≤ |Wn−Wnj|+|Wm−Wnj| ≤2

−j+1< ε.

Sinceεis arbitrary,Wn(ω) is a Cauchy sequence, and hence converges

Theorem 6.2 (SLLN)LetXibe a sequence of i.i.d random variables ThenSn/nconverges almost surely if and only ifE|X1|<∞

Proof Let us first supposeSn/n converges a.s and showE|X1|<∞ IfSn(ω)/n→a, then

Sn−1

n = Sn−1

n−1 n−1

n →a So

Xn

n = Sn

n − Sn−1

n →a−a=

HenceXn/n→0, a.s ThusP(|Xn|> ni.o.) = By the second part of Borel-Cantelli,PP(|Xn|> n)<∞

Since theXi are i.i.d., this means

P∞

n=1P(|X1|> n)<∞, and by Proposition 4.1,E|X1|<∞

Now suppose E|X1|<∞ By looking atXi−EXi, we may suppose without loss of generality that

EXi= We truncate, and letYi=Xi1(|Xi|≤i) It suffices to show

Pn

i=1Yi/n→0 a.s., by Proposition 5.4

Next we estimate We have

(11)

The convergence follows by the dominated convergence theorem, since the integrands are bounded by|X1|

To estimate the second moment of theYi, we write

EYi2=

Z ∞

0

2yP(|Yi| ≥y)dy

= Z i

0

2yP(|Yi| ≥y)dy

≤ Z i

0

2yP(|X1| ≥y)dy,

and so

∞ X

i=1

E(Yi2/i 2)≤

∞ X

i=1

1 i2

Z i

0

2yP(|X1| ≥y)dy

= ∞ X

i=1

1 i2

Z ∞

0

1(y≤i)yP(|X1| ≥y)dy

= Z ∞

0

∞ X

i=1

1

i21(y≤i)yP(|X1| ≥y)dy

≤4 Z ∞

0

1

yyP(|X1| ≥y)dy =

Z ∞

0

P(|X1| ≥y)dy= 4E|X1|<∞

LetUi=Yi−EYi Then VarUi= VarYi≤EYi2, and by the above,

∞ X

i=1

Var (Ui/i)<∞

By Lemma 6.1 (withVi=Ui/i),P n

i=1(Ui/i) converges almost surely By Kronecker’s lemma, (P n

i=1Ui)/n

converges almost surely Finally, sinceEYi→0, thenP n

i=1EYi/n→0, henceP n

i=1Yi/n→0

7 Uniform integrability Before proceeding to some extensions of the SLLN, we discuss uniform inte-grability A sequence of r.v.s isuniformly integrable if

sup

i

Z

(|Xi|>M)

|Xi|dP→0

asM → ∞

Proposition 7.1 Suppose there existsϕ: [0,∞)→[0,∞)such that ϕis nondecreasing, ϕ(x)/x→ ∞as

x→ ∞, andsupiEϕ(|Xi|)<∞ Then theXi are uniformly integrable

Proof Letε >0 and choosex0 such thatx/ϕ(x)< εifx≥x0 IfM ≥x0,

Z

(|Xi|>M)

|Xi|=

Z |X

i|

ϕ(|Xi|)

ϕ(|Xi|)1(|Xi|>M)≤ε

Z

ϕ(|Xi|)≤εsup i E

(12)

Proposition 7.2 IfXn andYn are two uniformly integrable sequences, thenXn+Yn is also a uniformly integrable sequence

Proof Since there exists M0 such that supnE[|Xn|;|Xn| > M0] < and supnE[|Yn|;|Yn| > M0] < 1,

then supnE|Xn| ≤ M0+ 1, and similarly for theYn Let ε >0 and choose M1 >4(M0+ 1)/ε such that

supnE[|Xn|;|Xn|> M1]< ε/4 and supnE[|Yn|;|Yn|> M1]< ε/4 LetM2= 4M12

NoteP(|Xn|+|Yn|> M2)≤(E|Xn|+E|Yn|)/M2≤ε/(4M1) by Chebyshev’s inequality Then

E[|Xn+Yn|;|Xn+Yn|> M2]≤E[|Xn|;|Xn|> M1]

+E[|Xn|;|Xn| ≤M1,|Xn+Yn|> M2]

+E[|Yn|;|Yn|> M1]

+E[|Yn|;|Yn| ≤M1,|Xn+Yn|> M2]

The first and third terms on the right are each less than ε/4 by our choice of M1 The second and fourth

terms are each less thanM1P(|Xn+Yn|> M2)≤ε/2

The main result we need in this section is Vitali’s convergence theorem

Theorem 7.3 IfXn→X almost surely and theXn are uniformly integrable, thenE|Xn−X| →0

Proof By the above proposition, Xn−X is uniformly integrable and tends to a.s., so without loss of

generality, we may assumeX = Letε >0 and chooseM such that supnE[|Xn|;|Xn|> M]< ε Then

E|Xn| ≤E[|Xn|;|Xn|> M] +E[|Xn|;|Xn| ≤M]≤ε+E[|Xn|1(|Xn|≤M)]

The second term on the right goes to by dominated convergence

8 Complements to the SLLN

Proposition 8.1 SupposeXi is an i.i.d sequence andE|X1|<∞ Then

E

Sn

n −EX1 →0

Proof Without loss of generality we may assumeEX1= By the SLLN, Sn/n →0 a.s So we need to

show that the sequenceSn/n is uniformly integrable

PickM1such that E[|X1|;|X1|> M1]< ε/2 PickM2=M1E|X1|/ε So

P(|Sn/n|> M2)≤E|Sn|/nM2≤E|X1|/M2=ε/M1

We used hereE|Sn| ≤P n

i=1E|Xi|=nE|X1|

We then have

E[|Xi|;|Sn/n|> M2]≤E[|Xi|:|Xi|> M1] +E[|Xi|;|Xi| ≤M1,|Sn/n|> M2]

≤ε+M1P(|Sn/n|> M2)≤2ε

Finally,

E[|Sn/n|;|Sn/n|> M2]≤

1 n

n

X

i=1

E[|Xi|;|Sn/n|> M2]≤2ε

(13)

Theorem 8.2 LetXi be a sequence of independent random variables.,A >0, andYi=Xi1(|Xi|≤A) Then

PX

i converges if and only if all of the following three series converge: (a)PP(|Xn|> A); (b) PEYi; (c)

PVarY

i

Proof of “if” part Since (c) holds, thenP(Y

i−EYi) converges by Lemma 6.1 Since (b) holds, taking the

difference showsPY

iconverges Since (a) holds,PP(Xi6=Yi) =PP(|Xi|> A)<∞, so by Borel-Cantelli,

P(Xi6=Yi i.o.) = It follows thatPXi converges

9 Conditional expectation

If F ⊆ G are two σ-fields and X is an integrable G measurable random variable, the conditional expectationof X given F, written E[X | F] and read as “the expectation (or expected value) of X given F,” is anyF measurable random variableY such thatE[Y;A] =E[X;A] for everyA∈ F Theconditional

probabilityofA∈ G givenF is defined byP(A| F) =E[1A| F]

IfY1, Y2are twoFmeasurable random variables withE[Y1;A] =E[Y2;A] for allA∈ F, thenY1=Y2,

a.s., or conditional expectation is unique up to a.s equivalence

In the caseX is alreadyF measurable, E[X | F] =X IfX is independent of F,E[X | F] =EX Both of these facts follow immediately from the definition For another example, which ties this definition with the one used in elementary probability courses, if{Ai}is a finite collection of disjoint sets whose union

is Ω,P(Ai)>0 for alli, andF is theσ-field generated by theAis, then

P(A| F) = X

i

P(A∩Ai)

P(Ai)

1Ai

This follows since the right-hand side isF measurable and its expectation over any setAi isP(A∩Ai)

As an example, suppose we toss a fair coin independently times and let Xi be or depending

whether the ith toss was a heads or tails Let A be the event that there were heads and let Fi =

σ(X1, , Xi) ThenP(A) = 1/32 whileP(A| F1) is equal to 1/16 on the event (X1= 1) and on the event

(X1= 0) P(A| F2) is equal to 1/8 on the event (X1= 1, X2= 1) and otherwise

We have

E[E[X | F]] =EX (9.1)

becauseE[E[X| F]] =E[E[X | F]; Ω] =E[X; Ω] =EX The following is easy to establish

Proposition 9.1 (a) IfX ≥Y are both integrable, thenE[X | F]≥E[Y | F]a.s

(b) If X andY are integrable and a∈R, thenE[aX+Y | F] =aE[X | F] +E[Y | F]

It is easy to check that limit theorems such as monotone convergence and dominated convergence have conditional expectation versions, as inequalities like Jensen’s and Chebyshev’s inequalities Thus, for example, we have the following

Proposition 9.2 (Jensen’s inequality for conditional expectations) If g is convex and X and g(X) are integrable,

(14)

Proposition 9.3 IfX andXY are integrable and Y is measurable with respect toF, then

E[XY | F] =YE[X | F] (9.2)

Proof IfA∈ F, then for anyB∈ F,

E1AE[X| F];B=EE[X| F];A∩B=E[X;A∩B] =E[1AX;B]

Since 1AE[X | F] isF measurable, this shows that (9.1) holds whenY = 1A and A∈ F Using linearity

and taking limits shows that (9.1) holds wheneverY isF measurable andXand XY are integrable Two other equalities follow

Proposition 9.4 IfE ⊆ F ⊆ G, then

EE[X | F] | E=E[X| E] =EE[X | E] | F

Proof The right equality holds becauseE[X | E] isE measurable, henceF measurable To show the left equality, letA∈ E Then sinceA is also inF,

EEE[X | F] | E;A=EE[X | F];A=E[X;A] =E[EX | E];A Since both sides areE measurable, the equality follows

To show the existence ofE[X | F], we proceed as follows Proposition 9.5 IfX is integrable, thenE[X | F]exists

Proof Using linearity, we need only consider X ≥0 Define a measure Q onF by Q(A) =E[X;A] for A∈ F This is trivially absolutely continuous with respect to P|F, the restriction of Pto F LetE[X | F] be the Radon-Nikodym derivative ofQwith respect toP|F The Radon-Nikodym derivative isF measurable by construction and so provides the desired random variable

WhenF=σ(Y), one usually writesE[X|Y] forE[X | F] Notation that is commonly used (however, we will use it only very occasionally and only for heuristic purposes) is E[X |Y =y] The definition is as follows IfA∈σ(Y), thenA= (Y ∈B) for some Borel setB by the definition ofσ(Y), or 1A= 1B(Y) By

linearity and taking limits, ifZ is σ(Y) measurable, Z =f(Y) for some Borel measurable function f Set Z=E[X |Y] and choosef Borel measurable so thatZ=f(Y) ThenE[X |Y =y] is defined to bef(y)

If X ∈ L2 and M = {Y ∈ L2 : Y isF-measurable}, one can show that

E[X | F] is equal to the projection ofX onto the subspaceM We will not use this in these notes

10 Stopping times

We next want to talk about stopping times Suppose we have a sequence of σ-fields Fi such that

Fi ⊂ Fi+1 for each i An example would be if Fi = σ(X1, , Xi) A random mapping N from Ω to

{0,1,2, }is called astopping time if for eachn, (N≤n)∈ Fn A stopping time is also called an optional

time in the Markov theory literature

The intuition is that the sequence knows whether N has happened by time n by looking at Fn

Suppose some motorists are told to drive north on Highway 99 in Seattle and stop at the first motorcycle shop past the second realtor after the city limits So they drive north, pass the city limits, pass two realtors, and come to the next motorcycle shop, and stop That is a stopping time If they are instead told to stop at the third stop light before the city limits (and they had not been there before), they would need to drive to the city limits, then turn around and return past three stop lights That is not a stopping time, because they have to go ahead of where they wanted to stop to know to stop there

(15)

Proposition 10.1

(a) Fixed timesnare stopping times

(b) If N1and N2are stopping times, then so areN1∧N2andN1∨N2

(c) IfNn is a nondecreasing sequence of stopping times, then so isN = supnNn (d) If Nn is a nonincreasing sequence of stopping times, then so isN = infnNn

(e) IfN is a stopping time, then so isN+n

We defineFN ={A:A∩(N ≤n)∈ Fn for alln}

11 Martingales

In this section we consider martingales LetFn be an increasing sequence ofσ-fields A sequence of

random variablesMn isadapted toFn if for eachn,Mn isFn measurable

Mn is amartingaleifMn is adapted toFn, Mn is integrable for alln, and

E[Mn| Fn−1] =Mn−1, a.s., n= 2,3, (11.1)

If we have E[Mn | Fn−1]≥Mn−1 a.s for every n, thenMn is a submartingale If we have E[Mn |

Fn−1]≤Mn−1, we have a supermartingale Submartingales have a tendency to increase

Let us take a moment to look at some examples If Xi is a sequence of mean zero i.i.d random

variables and Sn is the partial sum process, then Mn =Sn is a martingale, sinceE[Mn | Fn−1] =Mn−1+

E[Mn−Mn−1| Fn−1] =Mn−1+E[Mn−Mn−1] = Mn−1, using independence If theXi’s have variance

one andMn=Sn2−n, then

E[Sn2 | Fn−1] =E[(Sn−Sn−1)2| Fn−1] + 2Sn−1E[Sn| Fn−1]−S2n−1= +S n−1,

using independence It follows thatMn is a martingale

Another example is the following: ifX ∈L1 andMn=E[X | Fn], thenMn is a martingale

IfMn is a martingale andHn∈ Fn−1 for eachn, it is easy to check thatNn=P n

i=1Hi(Mi−Mi−1)

is also a martingale

IfMn is a martingale and g(Mn) is integrable for eachn, then by Jensen’s inequality

E[g(Mn+1)| Fn]≥g(E[Mn+1| Fn]) =g(Mn),

or g(Mn) is a submartingale Similarly if g is convex and nondecreasing on [0,∞) and Mn is a positive

submartingale, theng(Mn) is a submartingale because

E[g(Mn+1)| Fn]≥g(E[Mn+1 |Fn])≥g(Mn)

12 Optional stopping

Note that if one takes expectations in (11.1), one hasEMn=EMn−1, and by inductionEMn=EM0

The theorem about martingales that lies at the basis of all other results is Doob’s optional stopping theorem, which says that the same is true if we replacenby a stopping timeN There are various versions, depending on what conditions one puts on the stopping times

Theorem 12.1 IfN is a bounded stopping time with respect toFn andMn a martingale, then EMN =

EM0

Proof SinceN is bounded, letKbe the largest valueN takes We write EMN =

K

X

k=0

E[MN;N =k] = K

X

k=0

(16)

Note (N =k) isFj measurable ifj ≥k, so

E[Mk;N =k] =E[Mk+1;N =k]

=E[Mk+2;N =k] = .=E[MK;N =k]

Hence

EMN = K

X

k=0

E[MK;N =k] =EMK=EM0

This completes the proof

The assumption thatN be bounded cannot be entirely dispensed with For example, letMn be the

partial sums of a sequence of i.i.d random variable that take the values ±1, each with probability If

N = min{i:Mi= 1}, we will see later on thatN <∞a.s., butEMN = 16= =EM0

The same proof as that in Theorem 12.1 gives the following corollary

Corollary 12.2 IfN is bounded byKand Mn is a submartingale, thenEMN ≤EMK

Also the same proof gives

Corollary 12.3 IfN is bounded byK,A∈ FN, andMnis a submartingale, thenE[MN;A]≤E[MK;A]

Proposition 12.4 IfN1 ≤N2 are stopping times bounded by K and M is a martingale, thenE[MN2 |

FN1] =MN1, a.s

Proof SupposeA∈ FN1 We need to showE[MN1;A] =E[MN2;A] Define a new stopping time N3by

N3(ω) =

N1(ω) ifω∈A

N2(ω) ifω /∈A

It is easy to check thatN3 is a stopping time, soEMN3 =EMK =EMN2 implies

E[MN1;A] +E[MN2;A

c] =

E[MN2]

SubtractingE[MN2;A

c] from each side completes the proof.

The following is known as the Doob decomposition

Proposition 12.5 SupposeXk is a submartingale with respect to an increasing sequence of σ-fields Fk Then we can writeXk=Mk+Ak such thatMk is a martingale adapted to theFk andAk is a sequence of random variables withAk being Fk−1-measurable andA0≤A1≤ · · ·

Proof Letak =E[Xk | Fk−1]−Xk−1fork= 1,2, SinceXk is a submartingale, then eachak≥0 Then

letAk =P k

i=1ai The fact that the Ak are increasing and measurable with respect to Fk−1 is clear Set

Mk =Xk−Ak Then

E[Mk+1−Mk| Fk] =E[Xk+1−Xk|Fk]−ak+1= 0,

orMk is a martingale

(17)

Corollary 12.6 SupposeXk is a submartingale, andN1≤N2 are bounded stopping times Then

E[XN2 | FN1]≥XN1

13 Doob’s inequalities

The first interesting consequences of the optional stopping theorems are Doob’s inequalities IfMn

is a martingale, denoteM∗

n = maxi≤n|Mi|

Theorem 13.1 IfMn is a martingale or a positive submartingale,

P(Mn∗≥a)≤E[|Mn|;Mn∗≥a]/a≤E|Mn|/a

Proof SetMn+1=Mn LetN = min{j:|Mj| ≥a} ∧(n+ 1) Since| · |is convex,|Mn|is a submartingale

IfA= (Mn∗≥a), thenA∈ FN and by Corollary 12.3

aP(Mn∗≥a)≤E[|MN|;A]≤E[|Mn|;A]≤E|Mn|

Forp >1, we have the following inequality

Theorem 13.2 Ifp >1 andE|Mi|p<∞fori≤n, then

E(Mn∗)

p≤ p

p−1 p

E|Mn|p

Proof NoteMn∗≤Pn

i=1|Mn|, henceMn∗∈Lp We write

E(Mn∗) p=Z

∞

0

pap−1P(Mn∗> a)da≤ Z ∞

0

pap−1E[|Mn|1(M∗

n≥a)/a]da

=E Z Mn∗

0

pap−2|Mn|da=

p

p−1E[(M ∗

n) p−1

|Mn|]

≤ p

p−1(E(M ∗

n)p)(p−1)/p(E|Mn|p)1/p

The last inequality follows by Hăolders inequality Now divide both sides by the quantity (E(Mn)p)(p1)/p

14 Martingale convergence theorems

The martingale convergence theorems are another set of important consequences of optional stopping The main step is the upcrossing lemma The number of upcrossings of an interval [a, b] is the number of times a process crosses from belowato aboveb

To be more exact, let

S1= min{k:Xk ≤a}, T1= min{k > S1:Xk≥b},

and

Si+1= min{k > Ti :Xk ≤a}, Ti+1= min{k > Si+1:Xk≥b}

(18)

Theorem 14.1 (Upcrossing lemma)IfXk is a submartingale,

EUn≤(b−a)−1E[(Xn−a)+]

Proof The number of upcrossings of [a, b] byXk is the same as the number of upcrossings of [0, b−a]

byYk = (Xk−a)+ MoreoverYk is still a submartingale If we obtain the inequality for the the number of

upcrossings of the interval [0, b−a] by the processYk, we will have the desired inequality for upcrossings of

X

So we may assumea= Fixnand defineYn+1=Yn This will still be a submartingale Define the

Si,Ti as above, and letS0i=Si∧(n+ 1), Ti0=Ti∧(n+ 1) SinceTi+1> Si+1> Ti, thenTn+10 =n+

We write

EYn+1=EYS0 1+

n+1

X

i=0

E[YT0 i −YS

0 i] +

n+1

X

i=0

E[YS0 i+1−YT

0 i]

All the summands in the third term on the right are nonnegative sinceYk is a submartingale For thejth

upcrossing,YT0 j−YS

0

j ≥b−a, whileYT j −YS

0

j is always greater than or equal to So

∞ X

i=0

(YT0

i −YSi0)≥(b−a)Un

So

(4.8) EUn≤EYn+1/(b−a)

This leads to the martingale convergence theorem

Theorem 14.2 IfXn is a submartingale such thatsupnEXn+<∞, thenXn converges a.s asn→ ∞

Proof LetU(a, b) = limn→∞Un For each a, brational, by monotone convergence,

EU(a, b)≤c(b−a)−1E(Xn−a)+<∞

SoU(a, b)<∞, a.s Taking the union over all pairs of rationals a, b, we see that a.s the sequenceXn(ω)

cannot have lim supXn > lim infXn Therefore Xn converges a.s., although we still have to rule out the

possibility of the limit being infinite SinceXn is a submartingale,EXn≥EX0, and thus

E|Xn|=EXn++EXn−= 2EX +

n −EXn ≤2EXn+−EX0

By Fatou’s lemma,Elimn|Xn| ≤supnE|Xn|<∞, orXn converges a.s to a finite limit

Corollary 14.3 IfXnis a positive supermartingale or a martingale bounded above or below,Xnconverges a.s

Proof If Xn is a positive supermartingale, −Xn is a submartingale bounded above by Now apply

(19)

If Xn is a martingale bounded above, by considering −Xn, we may assume Xn is bounded below

Looking atXn+M for fixedM will not affect the convergence, so we may assumeXn is bounded below by

0 Now apply the first assertion of the corollary

Proposition 14.4 IfXn is a martingale withsupnE|Xn|p<∞for somep >1, then the convergence is in

Lp as well as a.s This is also true whenX

n is a submartingale If Xn is a uniformly integrable martingale, then the convergence is inL1 IfX

n→X∞in L1, thenXn=E[X∞| Fn]

Xn is a uniformly integrable martingale if the collection of random variables Xn is uniformly

inte-grable

Proof TheLp convergence assertion follows by using Doob’s inequality (Theorem 13.2) and dominated

convergence TheL1convergence assertion follows since a.s convergence together with uniform integrability

impliesL1 convergence Finally, ifj < n, we haveX

j =E[Xn| Fj] IfA∈ Fj,

E[Xj;A] =E[Xn;A]→E[X∞;A] by theL1convergence ofX

n toX∞ Since this is true for allA∈ Fj,Xj=E[X∞| Fj]

15 Applications of martingales

LetSn be your fortune at timen In a fair casino,E[Sn+1 | Fn] =Sn IfN is a stopping time, the

optional stopping theorem says that ESN =ES0; in other words, no matter what stopping time you use

and what method of betting, you will not better on average than ending up with what you started with An elegant application of martingales is a proof of the SLLN Fix N large Let Yi be i.i.d Let

Zn = E[Y1 | Sn, Sn+1, , SN] We claim Zn = Sn/n Certainly Sn/n is σ(Sn,· · ·, SN) measurable If

A∈σ(Sn, , SN) for somen, thenA= ((Sn, , SN)∈B) for some Borel subsetB ofRN−n+1 Since the Yi are i.i.d., for eachk≤n,

E[Y1; (Sn, , SN)∈B] =E[Yk; (Sn, , SN)∈B]

Summing overk and dividing byn,

E[Y1; (Sn, , SN)∈B] =E[Sn/n; (Sn, , SN)∈B]

ThereforeE[Y1;A] =E[Sn/n;A] for everyA∈σ(Sn, , SN) ThusZn =Sn/n

Let Xk = ZN−k, and let Fk = σ(SN−k, SN−k+1, , SN) Note Fk gets larger as k gets larger,

and by the above Xk = E[Y1 | Fk] This shows that Xk is a martingale (cf the next to last example

in Section 11) By Doob’s upcrossing inequality, if UnX is the number of upcrossings of [a, b] by X, then EUNX ≤EX

+

N/(b−a) ≤E|Z0|/(b−a) =E|Y1|/(b−a) This differs by at most one from the number of

upcrossings of [a, b] by Z1, , ZN So the expected number of upcrossings of [a, b] by Zk for k ≤ N is

bounded by +E|Y1|/(b−a) Now letN → ∞ By Fatou’s lemma, the expected number of upcrossings

of [a, b] byZ1, is finite Arguing as in the proof of the martingale convergence theorem, this says that

Zn=Sn/ndoes not oscillate

It is conceivable that|Sn/n| → ∞ But by Fatou’s lemma,

E[lim|Sn/n|]≤lim infE|Sn/n| ≤lim infnE|Y1|/n=E|Y1|<∞

(20)

Proposition 15.1 Suppose theYi are i.i.d withE|Y1|<∞,N is a stopping time withEN <∞, andN

is independent of theYi ThenESN = (EN)(EY1), where theSn are the partial sums of theYi

Proof Sn−n(EY1) is a martingale, soESn∧N =E(n∧N)EY1 by optional stopping The right hand side

tends to (EN)(EY1) by monotone convergence Sn∧N converges almost surely toSN, and we need to show

the expected values converge Note

|Sn∧N|=

∞ X

k=0

|Sn∧k|1(N=k)≤

∞ X

k=0 n∧k

X

j=0

|Yj|1(N=k)

=

n

X

j=0

∞ X

k>j

|Yj|1(N=k)= n

X

j=0

|Yj|1(N≥j)≤

∞ X

j=0

|Yj|1(N≥j)

The last expression, using the independence, has expected value ∞

X

j=0

(E|Yj|)P(N≥j)≤(E|Y1|)(1 +EN)<∞

So by dominated convergence, we haveESn∧N →ESN

Wald’s second identity is a similar expression for the variance ofSN

We can use martingales to find certain hitting probabilities

Proposition 15.2 Suppose theYi are i.i.d withP(Y1= 1) = 1/2,P(Y1=−1) = 1/2, andSn the partial sum process Supposeaandbare positive integers Then

P(Sn hits−abeforeb) =

b a+b

IfN = min{n:Sn ∈ {−a, b}}, thenEN =ab

Proof S2

n−nis a martingale, soESn2∧N =En∧N Let n→ ∞ The right hand side converges to EN

by monotone convergence Since Sn∧N is bounded in absolute value by a+b, the left hand side converges

by dominated convergence toESN2, which is finite SoEN is finite, henceN is finite almost surely

Sn is a martingale, soESn∧N =ES0= By dominated convergence, and the fact thatN <∞a.s.,

henceSn∧N →SN, we haveESN = 0, or

−aP(SN =−a) +bP(SN =b) =

We also have

P(SN =−a) +P(SN =b) =

Solving these two equations for P(SN =−a) andP(SN =b) yields our first result Since EN = ESN2 =

a2P(SN =−a) +b2P(SN =b), substituting gives the second result

Based on this proposition, if we let a → ∞, we see that P(Nb < ∞) = and ENb = ∞, where

Nb= min{n:Sn=b}

(21)

Proposition 15.3 SupposeAn∈ Fn Then(An i.o.)and(P∞n=1P(An| Fn−1) =∞)differ by a null set

Proof Let Xn = Pnm=1[1Am −P(Am | Fm−1)] Note |Xn−Xn−1| ≤ Also, it is easy to see that

E[Xn−Xn−1| Fn−1] = 0, soXn is a martingale

We claim that for almost every ω either limXn exists and is finite, or else lim supXn = ∞ and

lim infXn = −∞ In fact, if N = min{n : Xn ≤ −k}, then Xn∧N ≥ −k−1, so Xn∧N converges by the

martingale convergence theorem Therefore limXn exists and is finite on (N =∞) So if limXn does not

exist or is not finite, thenN <∞ This is true for allk, hence lim infXn =−∞ A similar argument shows

lim supXn=∞in this case

Now if limXn exists and is finite, then

P∞

n=11An =∞if and only if

P

P(An | Fn−1)<∞ On the

other hand, if the limit does not exist or is not finite, thenP1

An =∞and

P

P(An| Fn−1) =∞

16 Weak convergence

We will see later that if theXi are i.i.d with mean zero and variance one, thenSn/

√

n converges in the sense

P(Sn/

√

n∈[a, b])→P(Z∈[a, b]), where Z is a standard normal If Sn/

√

n converged in probability or almost surely, then by the zero-one law it would converge to a constant, contradicting the above We want to generalize the above type of convergence

We say Fn converges weakly to F if Fn(x) → F(x) for all x at which F is continuous Here Fn

and F are distribution functions We say Xn converges weakly to X ifFXn converges weakly to FX We

sometimes sayXn converges in distributionor converges in law toX Probabilities µn converge weakly if

their corresponding distribution functions converges, that is, ifFµn(x) =µn(−∞, x] converges weakly

An example that illustrates why we restrict the convergence to continuity points ofF is the following LetXn = 1/nwith probability one, andX = with probability one FXn(x) is ifx <1/nand otherwise

FXn(x) converges toFX(x) for all xexceptx=

Proposition 16.1 Xn converges weakly to X if and only if Eg(Xn) → Eg(X) for all g bounded and

continuous

The idea that Eg(Xn) converges to Eg(X) for all g bounded and continuous makes sense for any metric space and is used as a definition of weak convergence forXn taking values in general metric spaces

Proof First supposeEg(Xn) converges toEg(X) Letxbe a continuity point of F, letε >0, and choose δsuch that|F(y)−F(x)|< εif|y−x|< δ Choosegcontinuous such thatg is one on (−∞, x], takes values between and 1, and is on [x+δ,∞) ThenFXn(x)≤Eg(Xn)→Eg(X)≤FX(x+δ)≤F(x) +ε

Similarly, ifhis a continuous function taking values between and that is on (−∞, x−δ] and on [x,∞),FXn(x)≥Eh(Xn)→Eh(X)≥FX(x−δ)≥F(x)−ε Sinceεis arbitrary,FXn(x)→FX(x)

Now suppose Xn converges weakly to X If aand b are continuity points of F and of all theFXn,

thenE1[a,b](Xn) =FXn(b)−FXn(a)→F(b)−F(a) =E1[a,b](X) By taking linear combinations, we have

Eg(Xn)→Eg(X) for everyg which is a step function where the end points of the intervals are continuity points for all the FXn and forFX Since the set of points that are not a continuity point for some FXn or

forFX is countable, and we can approximate any continuous function on an interval by such step functions

uniformly, we haveEg(Xn)→Eg(X) for allgsuch that the support ofgis a closed interval whose endpoints are continuity points ofFX andg is continuous on its support

Letε >0 and choose M such thatFX(M)>1−ε and FX(−M)< ε and so thatM and −M are

(22)

whereg is a bounded continuous function The difference betweenE(1[−M,M]g)(X) andEg(X) is bounded by kgk∞P(X /∈ [−M, M]) ≤2εkgk∞ Similarly, whenX is replaced by Xn, the difference is bounded by

kgk∞P(Xn∈/[−M, M])→ kgk∞P(X /∈[−M, M]) So fornlarge, it is less than 3εkgk∞ Sinceεis arbitrary,

Eg(Xn)→Eg(X) wheneverg is bounded and continuous

Let us examine the relationship between weak convergence and convergence in probability The example ofSn/

√

nshows that one can have weak convergence without convergence in probability Proposition 16.2 (a) IfXn converges to X in probability, then it converges weakly

(b) If Xn converges weakly to a constant, it converges in probability

(c) (Slutsky’s theorem) If Xn converges weakly to X and Yn converges weakly to a constant c, then

Xn+Yn converges weakly toX+candXnYn converges weakly tocX

Proof To prove (a), let g be a bounded and continuous function If nj is any subsequence, then there

exists a further subsequence such thatX(njk) converges almost surely toX Then by dominated convergence,

Eg(X(njk))→Eg(X) That suffices to showEg(Xn) converges toEg(X)

For (b), ifXn converges weakly toc,

P(Xn−c > ε) =P(Xn > c+ε) = 1−P(Xn≤c+ε)→1−P(c≤c+ε) =

We use the fact that if Y ≡ c, then c+ε is a point of continuity for FY A similar equation shows

P(Xn−c≤ −ε)→0, soP(|Xn−c|> ε)→0

We now prove the first part of (c), leaving the second part for the reader Letxbe a point such that x−cis a continuity point ofFX Chooseεso thatx−c+εis again a continuity point Then

P(Xn+Yn ≤x)≤P(Xn+c≤x+ε) +P(|Yn−c|> ε)→P(X ≤x−c+ε)

So lim supP(Xn+Yn ≤x)≤P(X+c≤x+ε) Sinceεcan be as small as we like andx−c is a continuity point ofFX, then lim supP(Xn+Yn≤x)≤P(X+c≤x) The lim inf is done similarly

We say a sequence of distribution functions{Fn}is tightif for each ε >0 there existsM such that

Fn(M)≥ 1−ε and Fn(−M) ≤ ε for alln A sequence of r.v.s is tight if the corresponding distribution

functions are tight; this is equivalent toP(|Xn| ≥M)≤ε

Theorem 16.3 (Helly’s theorem)LetFn be a sequence of distribution functions that is tight There exists a subsequencenj and a distribution functionF such thatFnj converges weakly toF

What could happen is thatXn=n, so thatFXn →0; the tightness precludes this

Proof Let qk be an enumeration of the rationals Since Fn(qk) ∈ [0,1], any subsequence has a further

subsequence that converges Use diagonalization so that Fnj(qk) converges for each qk and call the limit

F(qk) F is nondecreasing, and defineF(x) = infqk≥xF(qk) So F is right continuous and nondecreasing

Ifxis a point of continuity ofF andε >0, then there existrandsrational such thatr < x < sand F(s)−ε < F(x)< F(r) +ε Then

Fnj(x)≥Fnj(r)→F(r)> F(x)−ε

and

Fnj(x)≤Fnj(s)→F(s)< F(x) +ε

Sinceεis arbitrary,Fnj(x)→F(x)

Since the Fn are tight, there exists M such that Fn(−M) < ε Then F(−M) ≤ ε, which implies

(23)

Proposition 16.4 Suppose there exists ϕ: [0,∞)→[0,∞)that is increasing andϕ(x)→ ∞as x→ ∞ Ifc= supnEϕ(|Xn|)<∞, then theXn are tight

Proof Letε >0 ChooseM such thatϕ(x)≥c/εifx > M Then P(|Xn|> M)≤

Z ϕ(|X

n|)

c/ε 1(|Xn|>M)dP≤

ε

cEϕ(|Xn|)≤ε

17 Characteristic functions

We define thecharacteristic function of a random variableX byϕX(t) =Eeitxfort∈R

Note thatϕX(t) =R eitxPX(dx) So ifX andY have the same law, they have the same characteristic

function Also, if the law of X has a density, that is,PX(dx) =fX(x)dx, thenϕX(t) =ReitxfX(x)dx, so

in this case the characteristic function is the same as (one definition of) the Fourier transform offX

Proposition 17.1 ϕ(0) = 1,|ϕ(t)| ≤1,ϕ(−t) =ϕ(t), andϕis uniformly continuous

Proof Since|eitx| ≤1, everything follows immediately from the definitions except the uniform continuity.

For that we write

|ϕ(t+h)−ϕ(t)|=|Eei(t+h)X−EeitX| ≤E|eitX(eihX−1)|=E|eihX−1|

|eihX−1| tends to almost surely ash→0, so the right hand side tends to by dominated convergence.

Note that the right hand side is independent oft

Proposition 17.2 ϕaX(t) =ϕX(at)andϕX+b(t) =eitbϕX(t),

Proof The first follows fromEeit(aX)=Eei(at)X, and the second is similar Proposition 17.3 IfX andY are independent, thenϕX+Y(t) =ϕX(t)ϕY(t)

Proof From the multiplication theorem,

Eeit(X+Y)=EeitXeitY =EeitXEeitY

Note that ifX1 andX2are independent and identically distributed, then

ϕX1−X2(t) =ϕX1(t)ϕ−X2(t) =ϕX1(t)ϕX2(−t) =ϕX1(t)ϕX2(t) =|ϕX1(t)|

2.

Let us look at some examples of characteristic functions

(a) Bernoulli: By direct computation, this is peit+ (1−p) = 1−p(1−eit) (b) Coin flip: (i.e.,P(X = +1) =P(X =−1) = 1/2) We have 12e

it+1 2e

−it= cost.

(c) Poisson:

EeitX= ∞ X

k=0

eitke−λλ

k

k! =e

−λX(λeit)k k! =e

−λeλeit

=eλ(eit−1)

(24)

(e) Binomial: WriteX as the sum ofnindependent Bernoulli r.v.sBi So

ϕX(t) = n

Y

i=1

ϕBi(t) = [ϕBi(t)]

n

= [1−p(1−eit)]n

(f) Geometric:

ϕ(t) = ∞ X

k=0

p(1−p)keitk=pX((1−p)eit)k = p 1−(1−p)eit

(g) Uniform on[a, b]:

ϕ(t) = b−a

Z b

a

eitxdx= e

itb−eita

(b−a)it Note that whena=−bthis reduces to sin(bt)/bt

(h) Exponential:

Z ∞

0

λeitxe−λxdx=λ Z ∞

0

e(it−λ)xdx= λ λ−it (i) Standard normal:

ϕ(t) = √1 2π

Z ∞ −∞

eitxe−x2/2dx

This can be done by completing the square and then doing a contour integration Alternately,ϕ0(t) = (1/√2π)R−∞∞ ixeitxe−x2/2dx (do the real and imaginary parts separately, and use the dominated convergence theorem to justify taking the derivative inside.) Integrating by parts (do the real and imaginary parts separately), this is equal to−tϕ(t) The only solution toϕ0(t) =−tϕ(t) withϕ(0) = 1 isϕ(t) =e−t2/2

(j) Normal with mean µ and variance σ2: Writing X =σZ +µ, where Z is a standard normal, then

ϕX(t) =eiµtϕZ(σt) =eiµt−σ

2t2/2

(k) Cauchy: We have

ϕ(t) = π

Z eitx

1 +x2dx

This is a standard exercise in contour integration in complex analysis The answer ise−|t|.

18 Inversion formula

We need a preliminary real variable lemma, and then we can proceed to the inversion formula, which gives a formula for the distribution function in terms of the characteristic function

Lemma 18.1 (a)RN

0 (sin(Ax)/x)dx→sgn (A)π/2 asN → ∞ (b) supa|Ra

0(sin(Ax)/x)dx|<∞

(25)

An alternate proof of (a) is the following e−xysinxis integrable on{(x, y); 0< x < a,0< y <∞} So

Z a

0

sinx x dx=

Z a

0

Z ∞

0

e−xysinx dy dx

= Z ∞

0

Z a

0

e−xysinx dx dy

= Z ∞

0

h e−xy

y2+ 1(−ysinx−cosx)

ia dy = Z ∞ h e −ay

y2+ 1(−ysina−cosa)

o − −1

y2+ 1

i dy

= π −sina

Z ∞

0

ye−ay

y2+ 1dy−cosa

Z ∞

0

e−ay y2+ 1dy

The last two integrals tend to asa→ ∞since the integrand is bounded by (1 +y)e−y ifa≥1 Theorem 18.2 (Inversion formula)Let µbe a probability measure and let ϕ(t) =R

eitxµ(dx) If a < b, then lim T→∞ 2π Z T −T

e−ita−e−itb

it ϕ(t)dt=µ(a, b) +

2µ({a}) + 2µ({b})

The example whereµis point mass at 0, soϕ(t) = 1, shows that one needs to take a limit, since the integrand in this case is sint/t, which is not integrable

Proof By Fubini,

Z T

−T

e−ita−e−itb

it ϕ(t)dt= Z T

−T

Z e−ita−e−itb

it e

itxµ(dx)dt

= Z Z T

−T

e−ita−e−itb

it e

itxdt µ(dx).

To justify this, we bound the integrand by the mean value theorem

Expandinge−itb ande−ita using Euler’s formula, and using the fact that cos is an even function and

sin is odd, we are left with Z

2h Z T

0

sin(t(x−a)) t dt−

Z T

0

sin(t(x−b))

t dt

i µ(dx) Using Lemma 18.1 and dominated convergence, this tends to

Z

[πsgn (x−a)−πsgn (x−b)]µ(dx)

Theorem 18.3 IfR|ϕ(t)|dt <∞, thenµhas a bounded densityf and

f(y) = 2π

Z

e−ityϕ(t)dt

Proof

µ(a, b) +1

2µ({a}) + 2µ({b}) = lim T→∞ 2π Z T −T

e−ita−e−itb

it ϕ(t)dt =

2π Z ∞

−∞

e−ita−e−itb it ϕ(t)dt ≤b−a

2π Z

(26)

Lettingb→ashows thatµhas no point masses We now write

µ(x, x+h) = 2π

Z e−itx−e−it(x+h)

it ϕ(t)dt =

2π

Z Z x+h

x

e−itydyϕ(t)dt

= Z x+h

x

2π

Z

e−ityϕ(y)dtdy

Soµhas density (1/2π)R

e−ityϕ(t)dt As in the proof of Proposition 17.1, we seef is continuous.

A corollary to the inversion formula is the uniqueness theorem Theorem 18.3 IfϕX=ϕY, thenPX =PY

The following proposition can be proved directly, but the proof using characteristic functions is much easier

Proposition 18.4 (a) IfX andY are independent,X is a normal with meanaand varianceb2, andY is a normal with meancand varianced2, thenX+Y is normal with meana+c and varianceb2+d2.

(b) If X andY are independent,X is Poisson with parameterλ1, andY is Poisson with parameterλ2, thenX+Y is Poisson with parameterλ1+λ2

(c) IfXi are i.i.d Cauchy, thenSn/nis Cauchy

Proof For (a),

ϕX+Y(t) =ϕX(t)ϕY(t) =eiat−b

2t2/2

eict−c2t2/2=ei(a+c)t−(b2+d2)t2/2 Now use the uniqueness theorem

Parts (b) and (c) are proved similarly

19 Continuity theorem

Lemma 19.1 Supposeϕis the characteristic function of a probabilityµ Then

µ([−2A,2A])≥A

Z 1/A

−1/A

ϕ(t)dt −1

Proof Note

1 2T

Z T

−T

ϕ(t)dt= 2T

Z T

−T

Z

eitxµ(dx)dt

=

Z Z 1

2T1[−T ,T](t)e

itxdtµ(dx)

=

Z sinT x T x µ(dx)

Z sinT x T x µ(dx)

≤µ([−2A,2A]) + Z

[−2A,2A]c

1

2T Aµ(dx) =µ([−2A,2A]) +

2T A(1−µ([−2A,2A])

=

2T A+

1− 2T A

(27)

SettingT = 1/A,

A

Z 1/A

−1/A

ϕ(t)dt ≤

1 2+

1

2µ([−2A,2A]) Now multiply both sides by

Proposition 19.2 Ifµn converges weakly toµ, thenϕn converges toϕuniformly on every finite interval

Proof Let ε > and choose M large so that µ([−M, M]c) < ε Define f to be on [−M, M], on

[−M−1, M+ 1]c, and linear in between SinceR

f dµn→R f dµ, then ifnis large enough,

Z

(1−f)dµn≤2ε

We have

|ϕn(t+h)−ϕn(t)| ≤

Z

|eihx−1|µ n(dx)

≤2 Z

(1−f)dµn+h

Z

|x|f(x)µn(dx)

≤2ε+h(M + 1) So fornlarge enough and|h| ≤ε/(M+ 1), we have

|ϕn(t+h)−ϕn(t)| ≤3ε,

which says that theϕn are equicontinuous Therefore the convergence is uniform on finite intervals

The interesting result of this section is the converse, L´evy’s continuity theorem

Theorem 19.3 Suppose µn are probabilities, ϕn(t) converges to a function ϕ(t) for each t, and ϕ is continuous at Thenϕis the characteristic function of a probabilityµandµn converges weakly toµ

Proof Letε >0 Sinceϕis continuous at 0, chooseδsmall so that

2δ

Z δ

−δ

ϕ(t)dt−1 < ε Using the dominated convergence theorem, chooseN such that

1 2δ

Z δ

−δ

|ϕn(t)−ϕ(t)|dt < ε

ifn≥N So ifn≥N,

1 2δ

Z δ

−δ

ϕn(t)dt

≥

1 2δ

Z δ

−δ

ϕ(t)dt −

1 2δ

Z δ

−δ

|ϕn(t)−ϕ(t)|dt

≥1−2ε By Lemma 19.1 withA= 1/δ, for suchn,

µn[−2/δ,2/δ]≥2(1−2ε)−1 = 1−4ε

This shows theµn are tight

Let nj be a subsequence such that µnj converges weakly, say to µ Then ϕnj(t) → ϕµ(t), hence

ϕ(t) = ϕµ(t), or ϕ is the characteristic function of a probability µ If µ0 is any subsequential weak limit

point ofµn, thenϕµ0(t) =ϕ(t) =ϕµ(t); soµ0 must equal µ Henceµn converges weakly toµ

(28)

Proposition 19.4 IfE|X|k <∞for an integerk, thenϕX has a continuous derivative of orderk and

ϕ(k)(t) = Z

(ix)keitxPX(dx)

In particular,ϕ(k)(0) =ikEXk

Proof Write

ϕ(t+h)−ϕ(t)

h =

Z ei(t+h)x−eitx

h P(dx) The integrand is bounded by|x| So if R

|x|PX(dx)<∞, we can use dominated convergence to obtain the

desired formula forϕ0(t) As in the proof of Proposition 17.1, we seeϕ0(t) is continuous We the case of generalkby induction Evaluatingϕ(k) at gives the particular case.

Here is a converse

Proposition 19.5 Ifϕis the characteristic function of a random variableXandϕ00(0)exists, thenE|X|2< ∞

Proof Note

eihx−2 +e−ihx h2 =−2

1−coshx h2 ≤0

and 2(1−coshx)/h2converges to x2 ash→0 So by Fatou’s lemma,

Z

x2PX(dx)≤2 lim inf h→0

Z 1−coshx

h2 PX(dx)

=−lim sup

h→0

ϕ(h)−2ϕ(0) +ϕ(−h)

h2 =ϕ

00(0)<∞.

One nice application of the continuity theorem is a proof of the weak law of large numbers Its proof is very similar to the proof of the central limit theorem, which we give in the next section

Another nice use of characteristic functions and martingales is the following

Proposition 19.6 Suppose Xi is a sequence of independent r.v.s and Sn converges weakly Then Sn converges almost surely

Proof SupposeSnconverges weakly toW ThenϕSn(t)→ϕW(t) uniformly on compact sets by Proposition

19.2 SinceϕW(0) = andϕW is continuous, there existsδsuch that|ϕW(t)−1|<1/2 if|t|< δ So forn

large,ϕSn(t)| ≥1/4 if|t|< δ

Note E

h

eitSn|X

1, Xn−1

i

=eitSn−1

E[eitXn|X1, , Xn−1] =eitSn−1ϕXn(t)

SinceϕSn(t) =

Qϕ

Xi(t), it follows thate

itSn/ϕ

Sn(t) is a martingale

Therefore for|t|< δand nlarge,eitSn/ϕ

Sn(t) is a bounded martingale, and hence converges almost

surely SinceϕSn(t)→ϕW(t)6= 0, then e

itSn converges almost surely if|t|< δ.

Let A = {(ω, t) ∈ Ω×(−δ, δ) : eitSn(ω) does not converge} For each t, we have almost sure

con-vergence, soR

1A(ω, t)P(dω) = Therefore R−δδ

R

1AdPdt= 0, and by Fubini, R R−δδ1Adt dP= Hence

almost surely,R

1A(ω, t)dt= This means, there exists a setN withP(N) = 0, and ifω /∈N, theneitSn(ω)

(29)

Ifω /∈N, by dominated convergence,R0aeitSn(ω)dtconverges, provideda < δ Call the limitA

a Also

Z a

0

eitSn(ω)dt=e

iaSn(ω)−1

iSn(ω)

ifSn(ω)6= and equalsaotherwise

Since Sn converges weakly, it is not possible for |Sn| → ∞ with positive probability If we let

N0 = {ω : |S

n(ω)| → ∞} and choose ω /∈ N ∪N0, there exists a subsequence Snj(ω) which converges

to a finite limit, say R We can choose a < δ such that eiaSn(ω) converges and eiaR 6= Therefore

Aa = (eiaR−1)/R, a nonzero quantity But then

Sn(ω) =

eiaSn(ω)−1

Ra

0 e

itSn(ω)dt

→ limn→∞e

iaSn(ω)−1

Aa

Therefore, except forω∈N∪N0, we have thatSn(ω) converges

20 Central limit theorem

The simplest case of the central limit theorem (CLT) is the case when the Xi are i.i.d., with mean

zero and variance one, and then the CLT says thatSn/

√

nconverges weakly to a standard normal We first prove this case

We need the fact that ifcn are complex numbers converging toc, then (1 + (cn/n))n →ec We leave

the proof of this to the reader, with the warning that any proof using logarithms needs to be done with some care, since logz is a multi-valued function whenz is complex

Theorem 20.1 Suppose theXi are i.i.d., mean zero, and variance one ThenSn/

√

nconverges weakly to a standard normal

Proof Since X1 has finite second moment, then ϕX1 has a continuous second derivative By Taylor’s

theorem,

ϕX1(t) =ϕX1(0) +ϕ

0

X1(0)t+ϕ

00

X1(0)t

2/2 +R(t),

where|R(t)|/t2→0 as|t| →0 So

ϕX1(t) = 1−t

2/2 +R(t).

Then

ϕS

n/

√

n(t) =ϕSn(t/

√

n) = (ϕX1(t/

√

n))n=h1− t

2

2n+R(t/ √

n)i

n

Sincet/√nconverges to zero asn→ ∞, we have

ϕS

n/

√

n(t)→e− t2/2

Now apply the continuity theorem

(30)

Proposition 20.2 WithXi as above,Sn/

√

nconverges weakly to a standard normal

Proof LetY1, , Ynbe i.i.d standard normal r.v.s that are independent of theXi LetZ1=Y2+· · ·+Yn,

Z2=X1+Y3+· · ·+Yn,Z3=X1+X2+Y4+· · ·+Yn, etc

Let us supposeg ∈C3 with compact support and letW be a standard normal Our first goal is to

show

|Eg(Sn/

√

n)−Eg(W)| →0 (20.1)

We have

Eg(Sn/

√

n)−Eg(W) =Eg(Sn/

√

n)−Eg(

n

X

i=1

Yi/

√ n) = n X i=1 h Eg

Xi+Zi √

n

−Eg

Yi+Zi √

n i

By Taylor’s theorem,

gXi√+Zi n

=g(Zi/

√

n) +g0(Zi/

√ n)√Xi

n+ 2g 00(Z i/ √

n)Xi2+Rn,

where|Rn| ≤ kg000k∞|Xi|3/n3/2 Taking expectations and using the independence,

Eg

Xi+Zi √

n

=Eg(Zi/

√

n) + +1 2Eg

00(Z

i/

√

n) +ERn

We have a very similar expression forEg((Yi+Zi)/

√

n) Taking the difference,

Eg

Xi+Zi √

n

−Eg

Yi+Zi √ n ≤ kg 000k ∞E

|Xi|3+E|Yi|3

n3/2

Summing overi from ton, we have (20.1)

By approximating continuous functions with compact support byC3functions with compact support, we have (20.1) for suchg Since E(Sn/

√

n)2 = 1, the sequenceSn/

√

n is tight So givenεthere exists M such thatP(|Sn/

√

n|> M)< εfor alln By taking M larger if necessary, we also haveP(|W|> M)< ε Supposegis bounded and continuous Letψbe a continuous function with compact support that is bounded by one, is nonnegative, and that equals on [−M, M] By (20.1) applied togψ,

|E(gψ)(Sn/

√

n)−E(gψ)(W)| →0

However,

|Eg(Sn/

√

n)−E(gψ)(Sn/

√

n)| ≤ kgk∞P(|Sn/

√

n|> M)< εkgk∞, and similarly

|Eg(W)−E(gψ)(W)|< εkgk∞

Since ε is arbitrary, this proves (20.1) for bounded continuous g By Proposition 16.1, this proves our proposition

(31)

Proposition 20.3 Suppose for each nthe r.v.sXni, i= 1, , n are i.i.d Bernoullis with parameterpn Ifnpn →λandSn=P

n

i=1Xni, thenSn converges weakly to a Poisson r.v with parameter λ

Proof We write

ϕSn(t) = [ϕXn1(t)]

n =h1 +p

n(eit−1)

in

=h1 + npn n (e

it−1)in→eλ(eit−1)

Now apply the continuity theorem

A much more general theorem than Theorem 20.1 is the Lindeberg-Feller theorem

Theorem 20.4 Suppose for eachn,Xni,i= 1, , nare mean zero independent random variables Suppose (a) Pn

i=1EX

ni→σ2>0 and (b) for eachε,Pn

i=1E[|X

ni;|Xni|> ε]→0 LetSn=P

n

i=1Xni ThenSn converges weakly to a normal r.v with mean zero and varianceσ2

Note nothing is said about independence of theXni for differentn

Let us look at Theorem 20.1 in light of this theorem Suppose theYi are i.i.d and letXni=Yi/

√ n Then

n

X

i=1

E(Yi/

√

n)2=EY12

and

n

X

i=1

E[|Xni|2;|Xni|> ε] =nE[|Y1|2/n;|Y1|>

√

nε] =E[|Y1|2;|Y1|>

√ nε],

which tends to by the dominated convergence theorem If theYi are independent with mean 0, and

Pn

i=1E|Yi|3

(VarSn)3/2

→0,

then Sn/(VarSn)1/2 converges weakly to a standard normal This is known as Lyapounov’s theorem; we

leave the derivation of this from the Lindeberg-Feller theorem as an exercise for the reader

Proof Letϕnibe the characteristic function ofXniand letσni2 be the variance ofXni We need to show n

Y

i=1

ϕni(t)→e−t

2σ2/2

(20.2)

Using Taylor series,|eib−1−ib+b2/2| ≤c|b|3 for a constantc Also,

|eib−1−ib+b2/2| ≤ |eib−1−ib|+|b2|/2≤c|b|2

If we apply this to a random variabletY and take expectations,

|ϕY(t)−(1 +itEY −t2EY2/2)| ≤c(t2EY2∧t3EY3)

Applying this toY =Xni,

|ϕni(t)−(1−t2σni2 /2)| ≤cE[t 3|X

(32)

The right hand side is less than or equal to

cE[t3|Xni|3;|Xni| ≤ε] +cE[t2|Xni|2;|Xni|> ε]

≤cεt3E[|Xni|2] +ct2E[|Xni|2;|Xni| ≥ε]

Summing overi we obtain

n

X

i=1

|ϕni(t)−(1−t2σ2ni/2)| ≤cεt 3X

E[|Xni|2] +ct2

X

E[|Xni|2;|Xni| ≥ε]

We need the following inequality: if|ai|,|bi| ≤1, then

n Y i=1

ai− n Y i=1 bi ≤ n X i=1

|ai−bi|

To prove this, note

Y ai−

Y

bi= (an−bn)

Y

i<n

bi+an

Y

i<n

ai−

Y

i<n

bi

and use induction

Note |ϕni(t)| ≤ and |1−t2σni2 /2| ≤1 because σni2 ≤ε2+E[|Xni2|;|Xni| > ε] <1/t2 if we takeε

small enough andnlarge enough So

n Y i=1

ϕni(t)− n

Y

i=1

(1−t2σni2 /2) ≤cεt

3X

E[|Xni|2] +ct2

X

E[|Xni|2;|Xni| ≥ε]

Since supiσ2

ni→0, then log(1−t2σni2/2) is asymptotically equal to−t2σ2ni/2, and so

Y

(1−t2σ2ni/2) = exp Xlog(1−t2σni2 /2)

is asymptotically equal to

exp−t2Xσni2 /2=e−t2σ2/2 Sinceεis arbitrary, the proof is complete

We now complete the proof of Theorem 8.2 Proof of “only if” part of Theorem 8.2 Since P

Xn converges, then Xn must converge to zero a.s.,

and soP(|Xn|> Ai.o.) = By the Borel-Cantelli lemma, this saysPP(|Xn|> A)<∞ We also conclude

by Proposition 5.4 thatP

Yn converges

Letcn =P n

i=1VarYi and supposecn→ ∞ LetZnm= (Ym−EYm)/

√

cn ThenP n

m=1VarZnm=

(1/cn)P n

m=1VarYm = If ε > 0, then for n large, we have 2A/

√

cn < ε Since |Ym| ≤ A and hence

|EYm| ≤ A, then |Znm| ≤ 2A/

√

cn < ε It follows that P n

m=1E(|Znm|2;|Znm| > ε) = for large n

By Theorem 20.4, Pn

m=1(Ym−EYm)/

√

cn converges weakly to a standard normal However, P n m=1Ym

converges, andcn → ∞, soPYm/

√

cnmust converge to The quantitiesPEYm/

√

cn are nonrandom, so

there is no way the difference can converge to a standard normal, a contradiction We concludecn does not

converge to infinity

LetVi =Yi−EYi Since |Vi| <2A, EVi = 0, and VarVi = VarYi, which is summable, by the “if”

part of the three series criterion,P

Vi converges SincePYi converges, taking the difference shows PEYi

(33)

21 Framework for Markov chains

SupposeS is a set with some topological structure that we will use as our state space Think of S as beingRd or the positive integers, for example A sequence of random variablesX

0, X1, , is a Markov

chain if

P(Xn+1∈A|X0, , Xn) =P(Xn+1∈A|Xn) (21.1)

for all n and all measurable sets A The definition of Markov chain has this intuition: to predict the probability that Xn+1 is in any set, we only need to know where we currently are; how we got there gives

no new additional intuition

Let’s make some additional comments First of all, we previously considered random variables as mappings from Ω toR Now we want to extend our definition by allowing a random variable be a mapX from Ω toS, where (X ∈A) is F measurable for all open setsA This agrees with the definition of r.v in the caseS =R

Although there is quite a theory developed for Markov chains with arbitrary state spaces, we will con-fine our attention to the case where eitherS is finite, in which case we will usually supposeS={1,2, , n}, or countable and discrete, in which case we will usually supposeS is the set of positive integers

We are going to further restrict our attention to Markov chains where

P(Xn+1∈A|Xn=x) =P(X1∈A|X0=x),

that is, where the probabilities not depend onn Such Markov chains are said to have stationary transition probabilities

Define the initial distribution of a Markov chain with stationary transition probabilities by µ(i) = P(X0 = i) Define the transition probabilities by p(i, j) = P(Xn+1 = j | Xn = i) Since the transition

probabilities are stationary,p(i, j) does not depend onn

In this case we can use the definition of conditional probability given in undergraduate classes If P(Xn =i) = for alln, that means we never visitiand we could drop the pointifrom the state space

Proposition 21.1 LetX be a Markov chain with initial distributionµand transition probabilitiesp(i, j) then

P(Xn=in, Xn−1=in−1, , X1=i1, X0=i0) =µ(i0)p(i0, i1)· · ·p(in−1, in) (21.2)

Proof We use induction onn It is clearly true forn= by the definition ofµ(i) Suppose it holds forn; we need to show it holds forn+ For simplicity, we will the case n= Then

P(X3=i3,X2=i2, X1=i1, X0=i0)

=E[P(X3=i3|X0=i0, X1=i1, X2=i2);X2=i2, X1=ii, X0=i0]

=E[P(X3=i3|X2=i2);X2=i2, X1=ii, X0=i0]

=p(i2, i3)P(X2=i2, X1=i1, X0=i0)

Now by the induction hypothesis,

P(X2=i2, X1=i1, X0=i0) =p(i1, i2)p(i0, i1)µ(i0)

Substituting establishes the claim forn=

(34)

Proposition 21.2 Suppose µ(i) is a sequence of nonnegative numbers with P

iµ(i) = and for each i the sequencep(i, j)is nonnegative and sums to Then there exists a Markov chain withµ(i)as its initial distribution andp(i, j)as the transition probabilities

Proof Define Ω = S∞ Let F be the σ-fields generated by the collection of sets {(i

0, i1, , in) : n >

0, ij ∈ S} An element ω of Ω is a sequence (i0, i1, ) Define Xj(ω) = ij if ω = (i0, i1, ) Define

P(X0 =i0, , Xn=in) by (21.2) Using the Kolmogorov extension theorem, one can show thatP can be extended to a probability on Ω

The above framework is rather abstract, but it is clear that under P the sequence Xn has initial

distributionµ(i); what we need to show is thatXn is a Markov chain and that

P(Xn+1=in+1|X0=i0, Xn=in) =P(Xn+1=in+1|Xn=in) =p(in, in+1) (21.3)

By the definition of conditional probability, the left hand side of (21.3) is

P(Xn+1=in+1|X0=i0, , Xn =in) =P

(Xn+1=in+1, Xn=in, X0=i0)

P(Xn=in, , X0=i0)

(21.4) =µ(i0)· · ·p(in−1, in)p(in, in+1)

µ(i0)· · ·p(in−1, in)

=p(in, in+1)

as desired

To complete the proof we need to show

P(Xn+1=in+1, Xn=in)

P(Xn =in)

=p(in, in+1),

or

P(Xn+1=in+1, Xn =in) =p(in, in+1)P(Xn =in) (21.5)

Now

P(Xn=in) =

X

i0,···,in−1

P(Xn =in, Xn−1=in−1, , X0=i0)

= X

i0,···,in−1

µ(i0)· · ·p(in−1, in)

and similarly

P(Xn+1=in+1, Xn=in)

= X

i0,···,in−1

P(Xn+1=in+1, Xn=in, Xn−1=in−1, , X0=i0)

=p(in, in+1)

X

i0,···,in−1

µ(i0)· · ·p(in−1, in)

Equation (21.5) now follows

Note in this construction that theXn sequence is fixed and does not depend onµorp Letp(i, j) be

fixed The probability we constructed above is often denoted Pµ Ifµis point mass at a point i orx, it is denoted Pi orPx So we have one probability space, one sequenceXn, but a whole family of probabilities

(35)

Later on we will see that this framework allows one to express the Markov property and strong Markov property in a convenient way As part of the preparation for doing this, we define the shift operators θk : Ω→Ω by

θk(i0, i1, ) = (ik, ik+1, )

ThenXj◦θk=Xj+k To see this, ifω= (i0, i1, ), then

Xj◦θk(ω) =Xj(ik, ik+1, ) =ij+k =Xj+k(ω)

22 Examples

Random walk on the integers

We let Yi be an i.i.d sequence of r.v.’s, with p = P(Yi = 1) and 1−p = P(Yi = −1) Let

Xn =X0+Pni=1Yi Then theXn can be viewed as a Markov chain withp(i, i+ 1) =p,p(i, i−1) = 1−p,

and p(i, j) = if|j−i| 6= More general random walks on the integers also fit into this framework To check that the random walk is Markov,

P(Xn+1=in+1|X0=i0, , Xn=in)

=P(Xn+1−Xn=in+1−in|X0=i0, , Xn=in)

=P(Xn+1−Xn=in+1−in),

using the independence, while

P(Xn+1=in+1|Xn=in) =P(Xn+1−Xn=in+1−in|Xn=in)

=P(Xn+1−Xn=in+1−in)

Random walks on graphs

Suppose we havenpoints, and from each point there is some probability of going to another point For example, suppose there are points and we have p(1,2) =

2, p(1,3) =

2, p(2,1) =

4, p(2,3) = 2,

p(2,5) = 14,p(3,1) = 14, p(3,2) = 14, p(3,3) = 12, p(4,1) = 1,p(5,1) = 12, p(5,5) = 12 Thep(i, j) are often arranged into a matrix:

P=               

0 12 12 0

1 4 4 0

1 0 0

1

2 0               

Note the rows must sum to since

5

X

j=1

p(i, j) =

5

X

j=1

P(X1=j|X0=i) =P(X1∈ S |X0=i) =

Renewal processes

Let Yi be i.i.d withP(Yi = k) = ak and the ak are nonnegative and sum to Let T0 = i0 and

Tn=T0+P n

i=1 We think of theYias the lifetime of thenth light bulb andTnthe time when thenth light

(36)

SoXn is the amount of time after timenuntil the current light bulb burns out

IfXn=j andj >0, thenTi =n+j for someibutTi does not equaln, n+ 1, , n+j−1 for any

i SoTi= (n+ 1) + (j−1) for somei andTi does not equal (n+ 1),(n+ 1) + 1, ,(n+ 1) + (j−2) for

anyi ThereforeXn+1=j−1 Sop(i, i−1) = if i≥1

IfXn = 0, then a light bulb burned out at time n andXn+1 is if the next light bulb burned out

immediately andj−1 if the light bulb has lifetimej The probability of this isaj Sop(0, j) =aj+1 All

the otherp(i, j)’s are

Branching processes

Considerkparticles At the next time interval, some of them die, and some of them split into several particles The probability that a given particle will split into j particles is given byaj, j = 0,1, , where

the aj are nonnegative and sum to The behavior of each particle is independent of the behavior of all

the other particles IfXn is the number of particles at timen, thenXn is a Markov chain LetYi be i.i.d

random variables withP(Yi=j) =aj Thep(i, j) forXn are somewhat complicated, and can be defined by

p(i, j) =P(Pim=1Ym=j) Queues

We will discuss briefly theM/G/1 queue TheM refers to the fact that the customers arrive according to a Poisson process So the probability that the number of customers arriving in a time interval of lengtht iskis given bye−λt(λt)k/k! TheGrefers to the fact that the length of time it takes to serve a customer is

given by a distribution that is not necessarily exponential The refers to the fact that there is server Suppose the length of time to serve one customer has distribution functionF with density f The probability thatkcustomers arrive during the time it takes to serve one customer is

ak=

Z ∞

0

e−λt(λt)

k

k! f(t)dt

Let theYi be i.i.d withP(Yi =k−1) =ak SoYi is the number of customers arriving during the time it

takes to serve one customer LetXn+1= (Xn+Yn+1)+ be the number of customers waiting ThenXn is a

Markov chain withp(0,0) =a0+a1 andp(i, j−1 +k) =ak ifj≥1, k >1 Ehrenfest urns

Suppose we have two urns with a total of r balls, k in one and r−k in the other Pick one of the r balls at random and move it to the other urn LetXn be the number of balls in the first urn Xn is a

Markov chain withp(k, k+ 1) = (r−k)/r,p(k, k−1) =k/r, andp(i, j) = otherwise

One model for this is to consider two containers of air with a thin tube connecting them Suppose a few molecules of a foreign substance are introduced Then the number of molecules in the first container is like an Ehrenfest urn We shall see that all states in this model are recurrent, so infinitely often all the molecules of the foreign substance will be in the first urn Yet there is a tendency towards equilibrium, so on average there will be about the same number of molecules in each container for all large times

Birth and death processes

Suppose there are i particles, and the probability of a birth is ai, the probability of a death is bi,

whereai, bi ≥0,ai+bi ≤1 SettingXn equal to the number of particles, thenXn is a Markov chain with

p(i, i+ 1) =ai,p(i, i−1) =bi, andp(i, i) = 1−ai−bi

23 Markov properties

A special case of the Markov property says that

(37)

The right hand side is to be interpreted as ϕ(Xn), whereϕ(y) =Eyf(X1) The randomness on the right

hand side all comes from the Xn If we write f(Xn+1) =f(X1)◦θn and we writeY for f(X1), then the

above can be rewritten

Ex[Y ◦θn| Fn] =EXnY LetF∞ be theσ-field generated by∪∞

n=1Fn

Theorem 23.1 (Markov property)IfY is bounded and measurable with respect toF∞, then

Ex[Y ◦θn| Fn] =EXn[Y], P−a.s

for eachnand x

Proof If we can prove this forY =f1(X1)· · ·fm(Xm), then takingfj(x) = 1ij(x), we will have it forY’s

of the form 1(X1=i1, ,Xm=im) By linearity (and the fact thatS is countable), we will then have it for Y’s

of the form 1((X1, ,Xm)∈B) A monotone class argument shows that suchY’s generateF∞

We use induction onm, and first we prove it form= We need to show

E[f1(X1)◦θn| Fn] =EXnf1(X1)

Using linearity and the fact that S is countable, it suffices to show this for f1(y) = 1{j}(y) Using the definition ofθn, we need to show

P(Xn+1=j| Fn) =PXn(X1=j),

or equivalently,

Px(Xn+1=j;A) =Ex[PXn(X1=j);A] (23.2)

whenA∈ Fn By linearity it suffices to considerAof the formA= (X1=i1, , Xn=in) The left hand

side of (23.2) is then

Px(Xn+1=j, X1=i1, , Xn=ij),

and by (21.4) this is equal to

p(in, j)Px(X1=i1, , Xn−in) =p(in, j)Px(A)

Letg(y) =Py(X

1=j) We have

Px(X1=j, X0=k) =

Pk(X1=j) ifx=k,

0 ifx6=k,

while

Ex[g(X0);X0=k] =Ex[g(k);X0=k] =Pk(X1=j)Px(X0=k) =

P(X1=j) ifx=k,

0 ifx6=k

It follows that

p(i, j) =Px(X1=j|X0=i) =Pi(X1=j)

So the right hand side of (23.2) is

(38)

as required

Suppose the result holds formand we want to show it holds form+ We have

Ex[f1(Xn+1)· · ·fm+1(Xn+m+1)| Fn]

=Ex[Ex[fm+1(Xn+m+1)| Fn+m]f1(Xn+1)· · ·fm(Xn+m)| Fn]

Ex[EXn+m[fm+1(X1)]f1(Xn+1)· · ·fm(Xn+m)| Fn]

Ex[f1(Xn+1)· · ·fm−1(Xn+m−1)h(Xn+m)| Fn]

Here we used the result form= and we definedh(y) =fn+m(y)g(y), whereg(y) =Ey[fm+1(X1)] Using

the induction hypothesis, this is equal to

EXn[f1(X1)· · ·fm−1(Xm−1)g(Xm)] =EXn[f1(X1)· · ·fm(Xm)EXmfm+1(X1)]

=EXn[f

1(X1)· · ·fm(Xm)E[fm+1(Xm+1)| Fm]]

=EXn[f1(X1)· · ·fm+1(Xm+1)],

which is what we needed

Define θN(ω) = (θN(ω))(ω) The strong Markov property is the same as the Markov property, but

where the fixed timenis replaced by a stopping timeN

Theorem 23.2 IfY is bounded and measurable andN is a finite stopping time, then

Ex[Y ◦θN | FN] =EXN[Y]

Proof We will show

Px(XN+1=j | FN) =PXN(X1=j)

Once we have this, we can proceed as in the proof of the Theorem 23.1 to obtain our result To show the above equality, we need to show that ifB ∈ FN, then

Px(XN+1=j, B) =Ex[PXN(X1=j);B] (23.3)

Recall that sinceB∈ FN, then B∩(N =k)∈ Fk We have

Px(XN+1=j, B, N=k) =Px(Xk+1=j, B, N =k)

=Ex[Px(Xk+1=j| Fk);B, N=k]

=Ex[PXk(X

1=j);B, N=k]

=Ex[PXN(X1=j);B, N=k]

Now sum overk; sinceN is finite, we obtain our desired result

Another way of expressing the Markov property is through the Chapman-Kolmogorov equations Let pn(i, j) =

P(Xn =j |X0=i)

Proposition 23.3 For alli, j, m, n we have

pn+m(i, j) =X

k∈S

(39)

Proof We write

P(Xn+m=j, X0=i) =

X

k

P(Xn+m=j, Xn=k, X0=i)

=X

k

P(Xn+m=j|Xn=k, X0=i)P(Xn=k|X0=i)P(X0=i)

=X

k

P(Xn+m=j|Xn=k)pn(i, k)P(X0=i)

=X

k

pm(k, j)pn(i, k)P(X0=i)

If we divide both sides byP(X0=i), we have our result

Note the resemblance to matrix multiplication It is clear if P is the matrix made up of thep(i, j), thenPn will be the matrix whose (i, j) entry ispn(i, j).

24 Recurrence and transience Let

Ty= min{i >0 :Xi=y}

This is the first time thatXihits the point y Even ifX0=ywe would have Ty >0 We letTyk be thek-th

time that the Markov chain hitsy and we set

r(x, y) =Px(Ty <∞),

the probability starting atxthat the Markov chain ever hitsy Proposition 24.1 Px(Tyk <∞) =r(x, y)r(y, y)k−1

Proof The case k= is just the definition, so supposek >1 Using the strong Markov property, Px(Tyk <∞) =P

x(T

y◦θTyk−1 <∞, T

k−1 y <∞)

=Ex[Px(Ty◦θTk−1

y <∞ | FTyk−1);T

k−1 y <∞]

=Ex[PX(Tyk−1)(T

y<∞);Tyk−1]

=Ex[Py(Ty<∞);Tyk−1<∞]

=r(y, y)Px(Tyk−1<∞)

We used here the fact that at timeTyk−1the Markov chain must be at the pointy Repeating this argument

k−2 times yields the result

We say thaty is recurrent ifr(y, y) = 1; otherwise we sayy is transient Let

N(y) = ∞ X

n=1

1(Xn=y)

Proposition 24.2 y is recurrent if and only ifEyN(y) =∞ Proof Note

EyN(y) = ∞ X

k=1

Py(N(y)≥k) = ∞ X

k=1

Py(Tyk<∞)

= ∞ X

k=1

(40)

We used the fact thatN(y) is the number of visits toy and the number of visits being larger than kis the same as the time of thek-th visit being finite Sincer(y, y)≤1, the left hand side will be finite if and only ifr(y, y)<1

Observe that

EyN(y) =X

n

Py(Xn=y) =

X

n

pn(y, y)

If we consider simple symmetric random walk on the integers, thenpn(0,0) is ifnis odd and equal

to

n n/2

2−n if nis even This is because in order to be at after nsteps, the walk must have had n/2

positive steps andn/2 negative steps; the probability of this is given by the binomial distribution Using Stirling’s approximation, we see thatpn(0,0)∼c/√nforneven, which diverges, and so simple random walk

in one dimension is recurrent

Similar arguments show that simple symmetric random walk is also recurrent in dimensions but transient in or more dimensions

Proposition 24.3 Ifxis recurrent andr(x, y)>0, theny is recurrent andr(y, x) =

Proof First we showr(y, x) = Suppose not Since r(x, y)>0, there is a smallest nand y1, , yn−1

such thatp(x, y1)p(y1, y2)· · ·p(yn−1, y)>0 Since this is the smallestn, none of theyi can equalx Then

Px(Tx=∞)≥p(x, y1)· · ·p(yn−1, y)(1−r(y, x))>0,

a contradiction toxbeing recurrent

Next we show thaty is recurrent Sincer(y, x)>0, there existsLsuch thatpL(y, x)>0 Then

pL+n+K(y, y)≥pL(y, x)pn(x, x)pK(x, y)

Summing overn,

X

n

pL+n+K(y, y)≥pL(y, x)pK(x, y)X

n

pn(x, x) =∞

We say a subsetC ofS is closed ifx∈C andr(x, y)>0 impliesy∈C A subsetD is irreducible if x, y∈D impliesr(x, y)>0

Proposition 24.4 LetCbe finite and closed ThenC contains a recurrent state

From the preceding proposition, ifCis irreducible, then all states will be recurrent Proof If not, for ally we have r(y, y)<1 and

ExN(y) = ∞ X

k=1

r(x, y)r(y, y)k−1= r(x, y)

1−r(y, y)<∞

SinceC is finite, thenP

yE

xN(y)<∞ But that is a contradiction since

X

y

ExN(y) = X

y

X

n

pn(x, y) =X

n

X

y

pn(x, y) =X

n

Px(Xn∈C) =

X

n

(41)

Theorem 24.5 LetR={x:r(x, x) = 1}, the set of recurrent states ThenR=∪∞

i=1Ri, where eachRi is closed and irreducible

Proof Sayx∼y ifr(x, y)>0 Since every state is recurrent,x∼xand ifx∼y, theny∼x Ifx∼y and y∼z, then pn(x, y)>0 andpm(y, z)>0 for some nandm Thenpn+m(x, z)>0 or x∼z Therefore we

have an equivalence relation and we let theRi be the equivalence classes

Looking at our examples, it is easy to see that in the Ehrenfest urn model all states are recurrent For the branching process model, supposep(x,0)>0 for allx Then is recurrent and all the other states are transient In the renewal chain, there are two cases If{k:ak >0} is unbounded, all states are recurrent

IfK= max{k:ak>0}, then{0,1, , K−1}are recurrent states and the rest are transient

For the queueing model, letµ=Pka

k, the expected number of people arriving during one customer’s

service time We may view this as a branching process by letting all the customers arriving during one person’s service time be considered the progeny of that customer It turns out that ifµ≤1, is recurrent and all other states are also Ifµ >1 all states are transient

25 Stationary measures

A probabilityµis a stationary distribution if

X

x

µ(x)p(x, y) =µ(y) (25.1)

In matrix notation this isµP =µ, orµis the left eigenvector corresponding to the eigenvalue In the case of a stationary distribution,Pµ(X

1=y) =µ(y), which implies thatX1, X2, all have the same distribution

We can use (25.1) when µ is a measure rather than a probability, in which case it is called a stationary measure

If we have a random walk on the integers,µ(x) = for allxserves as a stationary measure In the case of an asymmetric random walk: p(i, i+ 1) =p,p(i, i−1) =q= 1−pandp6=q, setting µ(x) = (p/q)x also works

In the Ehrenfest urn model,µ(x) = 2−r

r x

works One way to see this is thatµis the distribution one gets if one flipsrcoins and puts a coin in the first urn when the coin is heads A transition corresponds to picking a coin at random and turning it over

Proposition 25.1 Letabe recurrent and let T =Ta Set

µ(y) =Ea

T−1

X

n=0

1(Xn=y)

Thenµis a stationary measure

The idea of the proof is thatµ(y) is the expected number of visits toy by the sequenceX0, , XT−1

while µP is the expected number of visits to y by X1, , XT These should be the same because XT =

X0=a

Proof First, letpn(a, y) =Pa(X

n =y, T > n) So

µ(y) = ∞ X

n=0

Pa(Xn=y, T > n) =

∞ X

n=0

(42)

and

X

y

µ(y)p(y, z) =X

y

∞ X

n=0

pn(a, y)p(y, z)

Second, we consider the casez6=a Then X

y

pn(a, y)p(y, z)

=X

y

Pa(hity innsteps without first hittingaand then go toz in one step)

=pn+1(a, z)

So

X

n

µ(y)p(y, z) =X

n

X

y

pn(a, y)p(y, z)

= ∞ X

n=0

pn+1(a, z) = ∞ X

n=0

pn(a, z)

=µ(z) sincep0(a, z) =

Third, we consider the casea=z Then

X

y

pn(a, y)p(y, z)

=X

y

Pa(hity innsteps without first hittingaand then go toz in one step)

=Pa(T =n+ 1)

RecallPa(T = 0) = 0, and sinceais recurrent,T <∞ So

X

y

µ(y)p(y, z) =X

n

X

y

pn(a, y)p(y, z)

= ∞ X

n=0

Pa(T =n+ 1) = ∞ X

n=0

Pa(T =n) =

On the other hand,

T−1

X

n=0

1(Xn=a)= 1(X0=a)= 1,

henceµ(a) = Therefore, whetherz6=aorz=a, we haveµP(z) =µ(z)

Finally, we showµ(y)<∞ Ifr(a, y) = 0, thenµ(y) = Ifr(a, y)>0, choosenso thatpn(a, y)>0,

and then

1 =µ(a) =X

y

µ(y)pn(a, y),

which impliesµ(y)<∞

(43)

Proposition 25.2 If the Markov chain is irreducible and all states are recurrent, then the stationary measure is unique up to a constant multiple

Proof Fixa∈ S Letµa be the stationary measure constructed above and letν be any other stationary

measure

Sinceν=νP, then

ν(z) =ν(a)p(a, z) +X

y6=a

ν(y)p(y, z)

=ν(a)p(a, z) +X

y6=a

ν(a)p(a, y)p(y, z) +X

x6=a

X

y6=a

ν(x)p(x, y)p(y, z)

=ν(a)Pa(X1=z) +ν(a)Pa(X16=a, X2=z) +Pν(X06=a, X16=a, X2=z)

Continuing,

ν(z) =ν(a)

n

X

m=1

Pa(X16=a, X26=a, , Xm−16=a, Xm=z)

+Pν(X06=a, X16=a, , Xn−16=a, Xn=z)

≥ν(a)

n

X

m=1

Pa(X16=a, X26=a, , Xm−16=a, Xm=z)

Lettingn→ ∞, we obtain

ν(z)≥ν(a)µa(z)

We have

ν(a) =X

x

ν(x)pn(x, a)≥ν(a)X

x

µa(x)pn(x, a)

=ν(a)µa(a) =ν(a),

sinceµa(a) = (see proof of Proposition 25.1) This means that we have equality and so

ν(x) =ν(a)µa(x)

wheneverpn(x, a)>0 Sincer(x, a)>0, this happens for somen Consequently ν(x)

ν(a) =µa(x)

Proposition 25.3 If a stationary distribution exists, thenµ(y)>0impliesy is recurrent

Proof Ifµ(y)>0, then ∞=

∞ X

n=1

µ(y) = ∞ X

n=1

X

x

µ(x)pn(x, y) =X

x

µ(x) ∞ X

n=1

pn(x, y)

=X

x

µ(x) ∞ X

n=1

Px(Xn=y) =

X

x

µ(x)ExN(y)

=X

x

µ(x)r(x, y)[1 +r(y, y) +r(y, y)2+· · ·]

Sincer(x, y)≤1 andµis a probability measure, this is less than X

x

µ(x)(1 +r(y, y) +· · ·)≤1 +r(y, y) +· · ·

Hencer(y, y) must equal

(44)

Proposition 25.4 If the Markov chain is irreducible and has stationary distributionµ, then

µ(x) = ExTx

Proof µ(x)>0 for somex Ify∈ S, thenr(x, y)>0 and sopn(x, y)>0 for somen Hence

µ(y) =X

x

µ(x)pn(x, y)>0

Hence by Proposition 25.3, all states are recurrent By the uniqueness of the stationary distribution,µx is

a constant multiple ofµ, i.e., µx=cµ Recall

µx(y) =

∞ X

n=0

Px(Xn=y, Tx> n),

and so

X

y

µx(y) =

X

y

∞ X

n=0

Px(Xn=y, Tx> n) =

X

n

X

y

Px(Xn =y, Tx> n)

=X

n

Px(Tx> n) =ExTx

Thusc=ExTx Recalling that µx(x) = 1,

µ(x) =µx(x)

c =

1 ExTx

We make the following distinction for recurrent states IfExTx <∞, then xis said to be positive

recurrent Ifxis recurrent butExTx=∞,xis null recurrent

Proposition 25.5 Suppose a chain is irreducible

(a) If there exists a positive recurrent state, then there is a stationary distribution, (b) If there is a stationary distribution, all states are positive recurrent

(c) If there exists a transient state, all states are transient

(d) If there exists a null recurrent state, all states are null recurrent

Proof To show (a), ifxis positive recurrent, then there exists a stationary measure withµ(x) = Then µ(y) =µ(y)/ExTx will be a stationary distribution

For (b), supposeµ(x)>0 for somex We showed this implies µ(y)>0 for ally Then 0< µ(y) = 1/EyTy, which impliesEyTy <∞

We showed that ifxis recurrent andr(x, y)>0, theny is recurrent So (c) follows

Suppose there exists a null recurrent state If there exists a positive recurrent or transient state as well, then by (a) and (b) or by (c) all states are positive recurrent or transient, a contradiction, and (d) follows

26 Convergence

Our goal is to show that under certain conditionspn(x, y)→π(y), whereπ is the stationary

distri-bution (In the null recurrent casepn(x, y)→0.)

Consider a random walk on the set{0,1}, where with probability one on each step the chain moves to the other state Then pn(x, y) = if x6=y andnis even A less trivial case is the simple random walk

on the integers We need to eliminate this periodicity

Supposexis recurrent, let Ix ={n≥1 :pn(x, x)>0}, and let dx be the g.c.d (greatest common

(45)

Proposition 26.1 Ifr(x, y)>0, thendy=dx

Proof Sincexis recurrent,r(y, x)>0 ChooseK andLsuch thatpK(x, y), pL(y, x)>0.

pK+L+n(y, y)≥pL(y, x)pn(x, x)pK(x, y),

so taking n= 0, we have pK+L(y, y)>0, ord

y divides K+L So dy divides n ifpn(x, x)>0, ordy is a

divisor ofIx Hencedy dividesdx By symmetrydx dividesdy

Proposition 26.2 Ifdx= 1, there existsm0 such thatpm(x, x)>0 wheneverm≥m0

Proof First of all, Ix is closed under addition: ifm, n∈Ix,

pm+n(x, x)≥pm(x, x)pn(x, x)>0

Secondly, if there existsN such thatN, N+ 1∈Ix, letm0=N2 Ifm≥m0, thenm−N2=kN+r

for somer < N and

m=r+N2+kN =r(N+ 1) + (N−r+k)N ∈Ix

Third, pickn0∈Ixandk >0 such thatn0+k∈Ix Ifk= 1, we are done Sincedx= 1, there exists

n1∈Ixsuch thatkdoes not dividen1 We haven1=mk+rfor some 0< r < k Note (m+ 1)(n0+k)∈Ix

and (m+ 1)n0+n1∈Ix The difference between these two numbers is (m+ 1)k−n1=k−r < k So now

we have two numbers inIk differing by less than or equal to k−1 Repeating at most ktimes, we get two

numbers inIx differing by at most 1, and we are done

We writedfordx A chain is aperiodic ifd=

Ifd >1, we sayx∼y ifpkd(x, y)>0 for somek >0 We divideS into equivalence classesS1, Sd

Everydsteps the chain started inSi is back inSi So we look atp0=pd onSi

Theorem 26.3 Suppose the chain is irreducible, aperiodic, and has a stationary distribution π Then

pn(x, y)→π(y)asn→ ∞.

Proof The idea is to take two copies of the chain with different starting distributions, let them run independently until they couple, i.e., hit each other, and then have them move together So define

q((x1, y1),(x2, y2)) =

(p(x

1, x2)p(y1, y2) ifx16=y1,

p(x1, x2) ifx1=y1, x2=y2,

0 otherwise

LetZn= (Xn, Yn) andT = min{i:Xi=Yi} We have

P(Xn=y) =P(Xn=y, T ≤n) +P(Xn=y, T > n)

=P(Yn =y, T ≤n) +P(Xn=y, T > n),

while

P(Yn=y) =P(Yn =y, T ≤n) +P(Yn =y, T > n)

Subtracting,

P(Xn =y)−P(Yn=y)≤P(Xn=y, T > n)−P(Yn=y, T > n)

(46)

Using symmetry,

|P(Xn=y)−P(Yn =y)| ≤P(T > n) Suppose we letY0 have distributionπandX0=x Then

|pn(x, y)−π(y)| ≤

P(T > n)

It remains to showP(T > n)→0 To this, consider another chainZn0 = (Xn, Yn), where now we

takeXn, Yn independent Define

r((x1, y1),(x2, y2)) =p(x1, x2)p(y1, y2)

The chain under the transition probabilitiesr is irreducible To see this, there existK andL such that pK(x

1, x2) > and pL(y1, y2) > If M is large, pL+M(x2, x2) > and pK+M(y2, y2) > So

pK+L+M(x

1, x2)>0 andpK+L+M(y1, y2)>0, and hence we haverK+L+M((x1, x2),(y1, y2))>0

It is easy to check thatπ0(a, b) =π(a)π(b) is a stationary distribution forZ0 HenceZn0 is recurrent, and hence it will hit (x, x), hence the time to hit the diagonal {(y, y) : y ∈ S} is finite However the distribution of the time to hit the diagonal is the same asT

27 Gaussian sequences

We first prove a converse to Proposition 17.3 Proposition 27.1 IfEei(uX+vY)=

EeiuXEeivY for alluandv, thenX and Y are independent random

variables

Proof LetX0 be a random variable with the same law asX, Y0 one with the same law asY, and X0, Y0 independent (We let Ω = [0,1]2,PLebesgue measure,X0 a function of the first variable, andY0 a function of the second variable defined as in Proposition 1.2.) ThenEei(uX

0+vY0)

=EeiuX

0

EeivY

0

SinceX, X0 have the same law, they have the same characteristic function, and similarly forY, Y0 Therefore (X0, Y0) has the same joint characteristic function as (X, Y) By the uniqueness of the Fourier transform, (X0, Y0) has the same joint law as (X, Y), which is easily seen to imply thatX andY are independent

A sequence of random variables X1, , Xn is said to be jointly normal if there exists a sequence

of independent standard normal random variables Z1, , Zm and constants bij and such that Xi =

Pm

j=1bijZj+ai,i= 1, , n In matrix notation,X =BZ+A For simplicity, in what follows let us take

A= 0; the modifications for the general case are easy The covariance of two random variablesX andY is defined to beE[(X−EX)(Y −EY)] Since we are assuming our normal random variables are mean 0, we can omit the centering at expectations Given a sequence of mean random variables, we can talk about the covariance matrix, which is Cov (X) =EXXt, where Xtdenotes the transpose of the vectorX In the

above case, we see Cov (X) =E[(BZ)(BZ)t] =E[BZZtBt] =BBt, sinceEZZt=I, the identity Let us compute the joint characteristic functionEeiu

tX

of the vectorX, whereuis ann-dimensional vector First, ifv is anm-dimensional vector,

Eeiv

tZ

=E

m

Y

j=1

eivjZj =

m

Y

j=1

EeivjZj =

m

Y

j=1

e−v2j/2=e−vtv/2

using the independence of theZ’s So Eeiu

tX

=Eeiu

tBZ

=e−utBBtu/2

(47)

Proposition 27.2 If theXiare jointly normal andCov (Xi, Xj) = 0fori6=j, then theXiare independent

Proof If Cov (X) =BBtis a diagonal matrix, then the joint characteristic function of theX’s factors, and

so by Proposition 27.1, theXs would in this case be independent

28 Stationary processes

In this section we give some preliminaries which will be used in the next on the ergodic theorem We say a sequenceXi isstationaryif (Xk, Xk+1, ) has the same distribution as (X0, X1, )

One example is if theXi are i.i.d For readers who are familiar with Markov chains, another is ifXi

is a Markov chain,π is the stationary distribution, andX0 has distributionπ

A third example is rotations of a circle Let Ω be the unit circle,Pnormalized Lebesgue measure on Ω, andθ∈[0,2π) We letX0(ω) =ω and setXn(ω) =ω+nθ(mod 2π)

A fourth example is the Bernoulli shift: let Ω = [0,1),P Lebesgue measure, X0(ω) =ω, andXn(ω)

be binary expansion ofω from thenth place on

Proposition 28.1 IfXn is stationary, thenYk =g(Xk, Xk+1, )is stationary

Proof IfB⊂R∞, let

A={x= (x0, x1, ) : (g(x0, ), g(x1, ,), )∈B}

Then

P((Y0, Y1, )∈B) =P((X0, X1, )∈A)

=P((Xk, Xk+1, )∈A)

=P((Yk, Yk+1, )∈B)

We say thatT : Ω→Ω ismeasure preservingifP(T−1A) =

P(A) for allA∈ F

There is a one-to-one correspondence between measure preserving transformations and stationary sequences GivenT, letX0=ω andXn=Tnω Then

P((Xk, Xk+1, )∈A) =P(Tk(X0, X1, )∈A) =P((X0, X1, )∈A)

On the other hand, ifXkis stationary, defineΩ =b R∞, and defineXbk(ω) =ωk, whereω= (ω0, ω1, ) Define

b

P onΩ so that the law ofb Xb underPb is the same as the law ofX underP Then define T ω= (ω1, ω2, )

We see that

P(A) =P((ω0, ω1, )∈A) =Pb((Xb0,Xb1, )∈A)

=P((X0, X1, )∈A) =P((X1, X2, )∈A)

=Pb((Xb1,Xb2, )∈A) =P((ω1, ω2, )∈A)

=P(T ω∈A) =P(T−1A)

We say a setAisinvariantifT−1A=A(up to a null set, that is, the symmetric difference has

prob-ability zero) The invariantσ-fieldI is the collection of invariant sets A measure preserving transformation isergodicif the invariantσ-field is trivial

In the case of an i.i.d sequence,Ainvariant means A=T−nA∈σ(X

n, Xn+1, ) for eachn Hence

each invariant set is in the tailσ-field, and by the Kolmogorov 0-1 law,T is ergodic

(48)

recall that iff is measurable and bounded, thenf is theL2 limit ofPK

k=−Kckeikx, whereck are the Fourier

coefficients So

f(Tnx) =Xckeikx+iknθ

=Xdkeikx,

where dk =ckeiknθ Iff(Tnx) =f(x) a.e., then ck =dk, or ckeiknθ =ck Butθ is not a rational multiple

ofπ, soeiknθ6= 1, soc

k = Thereforef = a.e If we takef = 1A, this says that eitherAis empty or A

is the whole space, up to sets of measure zero

Our last example was the Bernoulli shift Let Xi be i.i.d withP(X = 1) =P(Xi = 0) = 1/2 Let

Yn = P∞m=02−(m+1)Xn+m So there exists g such that Yn = g(Xn, Xn+1, ) If A is invariant for the

Bernoulli shift,

A= ((Yn, Yn+1, )∈B) = ((Xn, Xn+1, )∈C),

where C={x: (g(x0, x1, ), g(x1, x2, ), )∈B} this is true for all n, soA is in the invariant σ-field

for theXi’s, which is trivial ThereforeT is ergodic

29 The ergodic theorem

The key to the ergodic theorem is the following maximal lemma

Lemma 29.1 LetX be integrable LetT be a measure preserving transformation, letXj(ω) =X(Tjω), letSk(ω) =X0(ω) +· · ·+Xk−1(ω), and Mk(ω) = max(0, S1(ω), , Sk(ω)) ThenE[X;Mk >0]≥0

Proof Ifj≤k,Mk(T ω)≥Sj(T ω), so X(ω) +Mk(T ω)≥X(ω) +Sj(T ω) =Sj+1(ω), or

X(ω)≥Sj+1(ω)−Mk(T ω), j= 1, , k

SinceS1(ω) =X(ω) andMk(T ω)≥0, then

X(ω)≥S1(ω)−Mk(T ω)

Therefore

E[X(ω);Mk>0]≥

Z

(Mk>0)

[max(S1, , Sk)(ω)−Mk(T ω)]

= Z

(Mk>0)

[Mk(ω)−Mk(T ω)]

On the set (Mk = 0) we haveMk(ω)−Mk(T ω) =−Mk(T ω)≤0 Hence

E[X(ω);Mk >0]≥

Z

[Mk(ω)−Mk(T ω)]

SinceT is measure preserving,EMk(ω)−EMk(T ω) = 0, which completes the proof

RecallI is the invariantσ-field The ergodic theorem says the following Theorem 29.2 LetT be measure preserving andX integrable Then

1 n

n−1

X

m=0

X(Tmω)→E[X | I],

where the convergence takes place almost surely and inL1.

(49)

Letδ >0 SinceX is integrable, P

P(|Xn(ω)|> δn) =PP(|X|> δn)<∞(cf proof of Proposition 5.1) By Borel-Cantelli,|Xn|/nwill eventually be less than δ Sinceδis arbitrary,|Xn|/n→0 a.s Since

(Sn/n)(T ω)−(Sn/n)(ω) =Xn(ω)/n−X0(ω)/n→0,

then lim sup(Sn/n)(T ω) = lim sup(Sn/n)(ω), and soD ∈ I Let X∗(ω) = (X(ω)−ε)1D(ω), and defineSn∗

andMn∗ analogously to the definitions ofSn andMn OnD, lim sup(Sn/n)> ε, hence lim sup(Sn∗/n)>0

Let F = ∪n(Mn∗ > 0) Note ∪ni=0(Mi∗ > 0) = (Mn∗ > 0) Also |X∗| ≤ |X|+ε is integrable By

Lemma 29.1,E[X∗;Mn∗>0]≥0 By dominated convergence, E[X∗;F]≥0

We claim D =F, up to null sets To see this, if lim sup(Sn∗/n)>0, then ω ∈ ∪n(Mn∗ >0) Hence

D⊂F On the other hand, ifω∈F, then Mn∗>0 for some n, soXn∗6= for some n By the definition of X∗, for some n,Tnω∈D, and sinceDis invariant,ω∈D a.s

RecallD∈ I Then

0≤E[X∗;D] =E[X−ε;D]

=E[E[X | I];D]−εP(D) =−εP(D), using the fact thatE[X | I] = We concludeP(D) = as desired

Since we have this for everyε, then lim supSn/n ≤0 By applying the same argument to −X, we

obtain lim infSn/n≥0, and we have proved the almost sure result Let us now turn to theL1convergence

LetM >0,XM0 =X1(|X|≤M), andXM00 =X−XM00 By the almost sure result,

1 n

X

XM0 (Tmω)→E[XM0 | I]

almost surely Both sides are bounded byM, so

E n

X

XM0 (Tmω)−E[XM0 | I]

→0 (29.1)

Letε > and chooseM large so that E|X00

M| < ε; this is possible by dominated convergence We

have

E n

n−1

X

m=0

XM00(Tmω)

≤

1 n

X

E|XM00(Tmω)|=E|XM00 | ≤ε

and

E|E[XM00 | I]| ≤E[E[|X

00

M| | I]] =E|X

00

M| ≤ε

So combining with (29.1)

lim supE n

X

X(Tmω)−E[X | I] ≤2ε This shows theL1 convergence

What does the ergodic theorem tell us about our examples? In the case of i.i.d random variables, we seeSn/n→EX almost surely and in L1, since E[X |I] =EX Thus this gives another proof of the SLLN For rotations of the circle withX(ω) = 1A(ω) andθ is an irrational multiple ofπ,E[X | I] =EX= P(A), the normalized Lebesgue measure of A So the ergodic theorem says that (1/n)P1A(ω+nθ), the

average number of timesω+nθ is inA, converges for almost everyω to the normalized Lebesgue measure ofA

Định dạng
Số trang	49
Dung lượng	328,82 KB