Consider a random walk on the set {0, 1}, where with probability one on each step the chain moves to the other state. A less trivial case is the simple random walk. on the integers.. Sup[r]
(1)Probability Theory Richard F Bass
These notes are c1998 by Richard F Bass They may be used for personal or classroom purposes, but not for commercial purposes
Revised 2001
1 Basic notions
A probability or probability measureis a measure whose total mass is one Because the origins of probability are in statistics rather than analysis, some of the terminology is different For example, instead of denoting a measure space by (X,A, µ), probabilists use (Ω,F,P) So here Ω is a set,F is called aσ-field (which is the same thing as aσ-algebra), andPis a measure withP(Ω) = Elements ofF are calledevents Elements of Ω are denotedω
Instead of saying a property occurs almost everywhere, we talk about properties occurring almost surely, written a.s Real-valued measurable functions from Ω to R are called random variables and are usually denoted byX orY or other capital letters We often abbreviate ”random variable” byr.v
We letAc = (ω∈Ω :ω /∈A) (called thecomplementofA) andB−A=B∩Ac.
Integration (in the sense of Lebesgue) is calledexpectationor expected value, and we writeEX for R
XdP The notationE[X;A] is often used forR
AXdP
The random variable 1A is the function that is one if ω ∈ A and zero otherwise It is called the indicatorofA(the name characteristic function in probability refers to the Fourier transform) Events such as (ω:X(ω)> a) are almost always abbreviated by (X > a)
Given a random variableX, we can define a probability onRby
PX(A) =P(X ∈A), A⊂R (1.1)
The probabilityPX is called thelawofX or thedistributionofX We defineFX:R→[0,1] by
FX(x) =PX((−∞, x]) =P(X ≤x) (1.2)
The functionFX is called thedistribution functionofX
As an example, let Ω = {H, T}, F all subsets of Ω (there are of them), P(H) = P(T) = 12 Let X(H) = andX(T) = Then PX = 12δ0+21δ1, whereδx is point mass atx, that is,δx(A) = if x∈A
and otherwise FX(a) = ifa <0, 12 if 0≤a <1, and ifa≥1
Proposition 1.1 The distribution functionFX of a random variableX satisfies: (a) FX is nondecreasing;
(b) FX is right continuous with left limits; (c) limx→∞FX(x) = 1andlimx→−∞FX(x) =
Proof We prove the first part of (b) and leave the others to the reader Ifxn↓x, then (X ≤xn)↓(X ≤x),
and soP(X≤xn)↓P(X ≤x) sinceP is a measure
Note that ifxn↑x, then (X ≤xn)↑(X < x), and soFX(xn)↑P(X < x)
(2)Proposition 1.2 Suppose F is a distribution function There exists a random variable X such that
F =FX
Proof Let Ω = [0,1],F the Borelσ-field, andPLebesgue measure Define X(ω) = sup{x:F(x)< ω} It is routine to check thatFX =F
In the above proof, essentially X = F−1 However F may have jumps or be constant over some intervals, so some care is needed in definingX
Certain distributions or laws are very common We list some of them
(a) Bernoulli A random variable is Bernoulli if P(X= 1) =p,P(X = 0) = 1−pfor somep∈[0,1] (b) Binomial This is defined byP(X =k) =
n k
pk(1−p)n−k, wherenis a positive integer, 0≤k≤n,
andp∈[0,1]
(c) Geometric Forp∈(0,1) we setP(X =k) = (1−p)pk Herek is a nonnegative integer.
(d) Poisson Forλ >0 we setP(X=k) =e−λλk/k! Againkis a nonnegative integer.
(e) Uniform For some positive integern, setP(X =k) = 1/nfor 1≤k≤n
If F is absolutely continuous, we call f = F0 the density of F Some examples of distributions characterized by densities are the following
(f) Uniform on[a, b] Definef(x) = (b−a)−11
[a,b](x) This means that ifX has a uniform distribution,
then
P(X ∈A) = Z
A
1
b−a1[a,b](x)dx (g) Exponential Forx >0 letf(x) =λe−λx.
(h) Standard normal Definef(x) =√1
2πe
−x2/2 So
P(X∈A) = √
2π Z
A
e−x2/2dx
(i) N(µ, σ2) We shall see later that a standard normal has mean zero and variance one If Z is a
standard normal, then a N(µ, σ2) random variable has the same distribution as µ+σZ It is an
exercise in calculus to check that such a random variable has density
√ 2πσe
−(x−µ)2/2σ2
(1.3)
(j) Cauchy Here
f(x) = π
1 +x2
We can use the law of a random variable to calculate expectations Proposition 1.3 Ifg is bounded or nonnegative, then
Eg(X) = Z
g(x)PX(dx)
Proof Ifgis the indicator of an eventA, this is just the definition ofPX By linearity, the result holds for
(3)If FX has a density f, then PX(dx) = f(x)dx So, for example, EX = Rxf(x)dx and EX2 = R
x2f(x)dx (We need
E|X|finite to justify this if X is not necessarily nonnegative.)
We define themeanof a random variable to be its expectation, and thevarianceof a random variable is defined by
VarX=E(X−EX)2
For example, it is routine to see that the mean of a standard normal is zero and its variance is one Note
VarX =E(X2−2XEX+ (EX)2) =EX2−(EX)2
Another equality that is useful is the following Proposition 1.4 IfX ≥0a.s andp >0, then
EXp= Z ∞
0
pλp−1P(X > λ)dλ
The proof will show that this equality is also valid if we replaceP(X > λ) byP(X≥λ) Proof Use Fubini’s theorem and write
Z ∞
0
pλp−1P(X > λ)dλ=E Z ∞
0
pλp−11(λ,∞)(X)dλ=E
Z X
0
pλp−1dλ=EXp
We need two elementary inequalities
Proposition 1.5 Chebyshev’s inequality IfX≥0,
P(X ≥a)≤ E X a
Proof We write
P(X ≥a) =E h
1[a,∞)(X)
i
≤EhX
a1[a,∞)(X) i
≤EX/a,
sinceX/ais bigger than when X∈[a,∞)
If we apply this toX = (Y −EY)2, we obtain
P(|Y −EY| ≥a) =P((Y −EY)2≥a2)≤VarY /a2 (1.4)
This special case of Chebyshev’s inequality is sometimes itself referred to as Chebyshev’s inequality, while Proposition 1.5 is sometimes called the Markov inequality
(4)Proposition 1.6 Supposegis convex and and X andg(X)are both integrable Then
g(EX)≤Eg(X)
Proof One property of convex functions is that they lie above their tangent lines, and more generally their support lines So ifx0∈R, we have
g(x)≥g(x0) +c(x−x0)
for some constantc Takex=X(ω) and take expectations to obtain
Eg(X)≥g(x0) +c(EX−x0)
Now setx0 equal toEX
IfAn is a sequence of sets, define (An i.o.), read ”An infinitely often,” by
(An i.o.) =∩∞n=1∪
∞
i=nAi
This set consists of thoseωthat are in infinitely many of theAn
A simple but very important proposition is the Borel-Cantelli lemma It has two parts, and we prove the first part here, leaving the second part to the next section
Proposition 1.7 (Borel-Cantelli lemma)IfP
nP(An)<∞, thenP(An i.o.) =
Proof We have
P(An i.o.) = lim n→∞P(∪
∞
i=nAi)
However,
P(∪∞i=nAi)≤
∞ X
i=n
P(Ai),
which tends to zero asn→ ∞
2 Independence
Let us say two eventsAandB areindependentifP(A∩B) =P(A)P(B) The eventsA1, , An are
independent if
P(Ai1∩Ai2∩ · · · ∩Aij) =P(Ai1)P(Ai2)· · ·P(Aij)
for every subset{i1, , ij} of{1,2, , n}
Proposition 2.1 IfAandB are independent, thenAc andB are independent.
Proof We write
P(Ac∩B) =P(B)−P(A∩B) =P(B)−P(A)P(B) =P(B)(1−P(A)) =P(B)P(A)
(5){(X ∈A) :Aa Borel subset ofR}.) We define the independence ofn σ-fields ornrandom variables in the obvious way
Proposition 2.1 tells us thatA and B are independent if the random variables 1A and 1B are
inde-pendent, so the definitions above are consistent
Iff andg are Borel functions andX and Y are independent, thenf(X) and g(Y) are independent This follows because theσ-field generated byf(X) is a sub-σ-field of the one generated byX, and similarly forg(Y)
LetFX,Y(x, y) =P(X ≤x, Y ≤y) denote the joint distribution function ofX and Y (The comma inside the set means ”and.”)
Proposition 2.2 FX,Y(x, y) =FX(x)FY(y)if and only ifX andY are independent
Proof IfXandY are independent, the 1(−∞,x](X) and 1(−∞,y](Y) are independent by the above comments
Using the above comments and the definition of independence, this showsFX,Y(x, y) =FX(x)FY(y)
Conversely, if the inequality holds, fixy and letMy denote the collection of setsA for whichP(X ∈ A, Y ≤y) =P(X∈A)P(Y ≤y) My contains all sets of the form (−∞, x] It follows by linearity thatMy
contains all sets of the form (x, z], and then by linearity again, by all sets that are the finite union of such half-open, half-closed intervals Note that the collection of finite unions of such intervals, A, is an algebra generating the Borelσ-field It is clear thatMy is a monotone class, so by the monotone class lemma,My
contains the Borelσ-field
For a fixed set A, let MA denote the collection of sets B for which P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈B) Again,MAis a monotone class and by the preceding paragraph contains theσ-field generated
by the collection of finite unions of intervals of the form (x, z], hence contains the Borel sets ThereforeX andY are independent
The following is known as the multiplication theorem
Proposition 2.3 IfX,Y, andXY are integrable andX and Y are independent, thenEXY =EXEY Proof Consider the random variables in σ(X) (the σ-field generated by X) and σ(Y) for which the multiplication theorem is true It holds for indicators by the definition of X and Y being independent It holds for simple random variables, that is, linear combinations of indicators, by linearity of both sides It holds for nonnegative random variables by monotone convergence And it holds for integrable random variables by linearity again
Let us give an example of independent random variables Let Ω = Ω1×Ω2and letP=P1×P2, where
(Ωi,Fi,Pi) are probability spaces fori= 1,2 We use the product σ-field Then it is clear thatF1 and F2
are independent by the definition ofP IfX1is a random variable such thatX1(ω1, ω2) depends only onω1
andX2 depends only onω2, then X1 andX2 are independent
This example can be extended tonindependent random variables, and in fact, if one has independent random variables, one can always view them as coming from a product space We will not use this fact Later on, we will talk about countable sequences of independent r.v.s and the reader may wonder whether such things can exist That it can is a consequence of the Kolmogorov extension theorem; see PTA, for example
If X1, , Xn are independent, then so are X1−EX1, , Xn −EXn Assuming everything is
integrable,
(6)using the multiplication theorem to show that the expectations of the cross product terms are zero We have thus shown
Var (X1+· · ·+Xn) = VarX1+· · ·+ VarXn (2.1)
We finish up this section by proving the second half of the Borel-Cantelli lemma
Proposition 2.4 SupposeAnis a sequence of independent events IfPnP(An) =∞, thenP(An i.o.) =
Note that here the An are independent, while in the first half of the Borel-Cantelli lemma no such
assumption was necessary Proof Note
P(∪Ni=nAi) = 1−P(∩Ni=nAci) = 1− N
Y
i=n
P(Aci) = 1− N
Y
i=n
(1−P(Ai))
By the mean value theorem, 1−x≤e−x, so we have that the right hand side is greater than or equal to
1−exp(−PN
i=nP(Ai)).As N → ∞, this tends to 1, soP(∪∞i=nAi) = This holds for alln, which proves
the result
3 Convergence
In this section we consider three ways a sequence of r.v.sXn can converge
We say Xn converges toX almost surelyif (Xn 6→X) has probability zero Xn converges toX in probability if for eachε, P(|Xn−X|> ε)→0 asn→ ∞ Xn converges to X in Lp ifE|Xn−X|p →0 as
n→ ∞
The following proposition shows some relationships among the types of convergence Proposition 3.1 (a) IfXn→X a.s., thenXn→X in probability
(b) If Xn→X in Lp, thenXn→X in probability
(c) IfXn→X in probability, there exists a subsequencenj such thatXnj converges toX almost surely
Proof To prove (a), noteXn−X tends to almost surely, so 1(−ε,ε)c(Xn−X) also converges to almost
surely Now apply the dominated convergence theorem (b) comes from Chebyshev’s inequality:
P(|Xn−X|> ε) =P(|Xn−X|p> εp)≤E|Xn−X|p/εp→0
asn→ ∞
To prove (c), choose nj larger than nj−1 such that P(|Xn−X| > 2−j) < 2−j whenever n ≥ nj
So if we let Ai = (|Xnj −X| >
−i for somej ≥ i), then
P(Ai) ≤ 2−i+1 By the Borel-Cantelli lemma
P(Ai i.o.) = This impliesXnj →X on the complement of (Ai i.o.)
Let us give some examples to show there need not be any other implications among the three types of convergence
Let Ω = [0,1],F the Borelσ-field, andP Lebesgue measure LetXn =en1(0,1/n) Then clearlyXn
converges to almost surely and in probability, butEXp
n=enp/n→ ∞for anyp
Let Ω be the unit circle, and letP be Lebesgue measure on the circle normalized to have total mass Let tn =Pni=1i−1, and letAn ={θ:tn−1 ≤θ < tn} LetXn = 1An Any point on the unit circle will
be in infinitely manyAn, so Xn does not converge almost surely to ButP(An) = 1/2πn→0, soXn→0
(7)4 Weak law of large numbers
SupposeXn is a sequence of independent random variables Suppose also that they all have the same
distribution, that is, FXn =FX1 for all n This situation comes up so often it has a name, independent,
identically distributed, which is abbreviatedi.i.d
DefineSn=P n
i=1Xi Sn is called a partial sum process Sn/n is the average value of the firstnof
theXi’s
Theorem 4.1 (Weak law of large numbers) Suppose theXi are i.i.d andEX12<∞ ThenSn/n→EX1 in probability
Proof Since theXi are i.i.d., they all have the same expectation, and soESn=nEX1 HenceE(Sn/n−
EX1)2 is the variance ofSn/n Ifε >0, by Chebyshev’s inequality,
P(|Sn/n−EX1|> ε)≤
Var (Sn/n)
ε2 =
Pn
i=1VarXi
n2ε2 =
nVarX1
n2ε2 (4.1)
SinceEX12<∞, then VarX1<∞, and the result follows by lettingn→ ∞
A nice application of the weak law of large numbers is a proof of the Weierstrass approximation theorem
Theorem 4.2 Supposef is a continuous function on [0,1]and ε >0 There exists a polynomialP such thatsupx∈[0,1]|f(x)−P(x)|< ε
Proof Let
P(x) =
n
X
k=0
f(k/n)
n k
xk(1−x)n−k
ClearlyP is a polynomial Sincef is continuous, there exists M such that|f(x)| ≤M for allxand there existsδsuch that |f(x)−f(y)|< ε/2 whenever|x−y|< δ
LetXi be i.i.d Bernoulli r.v.s with parameterx ThenSn, the partial sum, is a binomial, and hence
P(x) =Ef(Sn/n) The mean ofSn/n isx We have
|P(x)−f(x)|=|Ef(Sn/n)−f(EX1)| ≤E|f(Sn/n)−f(EX1)|
≤MP(|Sn/n−x|> δ) +ε/2
By (4.1) the first term will be less than
MVarX1/nδ2≤M x(1−x)/nδ2≤M nδ2,
which will be less thanε/2 ifnis large enough, uniformly inx
5 Techniques related to almost sure convergence
Our aim is the strong law of large numbers (SLLN), which says thatSn/n converges toEX1 almost
surely ifE|X1|<∞
(8)Proposition 5.1 SupposeXi is an i.i.d sequence withEXi4 <∞and letSn =P n
i=1Xi ThenSn/n→
EX1 a.s
Proof By looking atXi−EXi we may assume that theXi have mean By Chebyshev,
P(|Sn/n|> ε)≤ E
(Sn/n)4
ε4 =
ESn4
n4ε4
If we expand Sn4, we will have terms involving Xi4, terms involvingXi2Xj2, terms involving Xi3Xj, terms
involvingXi2XjXk, and terms involvingXiXjXkX`, withi, j, k, `all being different By the multiplication
theorem and the fact that theXi have mean 0, the expectations of all the terms will be except for those
of the first two types So
ESn4 = n
X
i=1
EXi4+
X
i6=j
EXi2EXj2
By the finiteness assumption, the first term on the right is bounded byc1n By Cauchy-Schwarz,EXi2 ≤
(EXi4)
1/2 < ∞, and there are at most n2 terms in the second term on the right, so this second term is
bounded byc2n2 Substituting, we have
P(|Sn/n|> ε)≤c3/n2ε4
ConsequentlyP(|Sn/n|> εi.o.) = by Borel-Cantelli Sinceεis arbitrary, this impliesSn/n→0 a.s
Before we can prove the SLLN assuming only the finiteness of first moments, we need some prelimi-naries
Proposition 5.2 IfY ≥0, thenEY <∞if and only ifPnP(Y > n)<∞
Proof By Proposition 1.4, EY = R∞
0 P(Y > x)dx P(Y > x) is nonincreasing in x, so the integral is
bounded above byP∞n=0P(Y > n) and bounded below byP∞n=1P(Y > n)
If Xi is a sequence of r.v.s, the tail σ-field is defined by ∩∞n=1σ(Xn, Xn+1, ) An example of an
event in the tailσ-field is (lim supn→∞Xn > a) Another example is (lim supn→∞Sn/n > a) The reason
for this is that ifk < nis fixed,
Sn
n = Sk
n + Pn
i=k+1Xi
n
The first term on the right tends to as n → ∞ So lim supSn/n = lim sup(P n
i=k+1Xi)/n, which is in
σ(Xk+1, Xk+2, ) This holds for eachk The set (lim supSn> a) is easily seen not to be in the tailσ-field
Theorem 5.3 (Kolmogorov 0-1 law)If the Xi are independent, then the events in the tail σ-field have probability or
This implies that in the case of i.i.d random variables, ifSn/n has a limit with positive probability,
then it has a limit with probability one, and the limit must be a constant
Proof LetM be the collection of sets inσ(Xn+1, ) that is independent of every set inσ(X1, , Xn)
M is easily seen to be a monotone class and it containsσ(Xn+1, , XN) for every N > n Therefore M
must be equal toσ(Xn+1, )
IfAis in the tail σ-field, then for eachn, Ais independent ofσ(X1, , Xn) The class MA of sets
independent ofAis a monotone class, hence is aσ-field containingσ(X1, , Xn) for eachn ThereforeMA
(9)We thus have that the eventA is independent of itself, or
P(A) =P(A∩A) =P(A)P(A) =P(A)2 This impliesP(A) is zero or one
The next proposition shows that in considering a law of large numbers we can consider truncated random variables
Proposition 5.4 SupposeXi is an i.i.d sequence of r.v.s withE|X1|<∞ LetXn0 =Xn1(|Xn|≤n) Then
(a) Xn converges almost surely if and only ifXn0 does; (b) If Sn0 =Pn
i=1X
0
i, thenSn/nconverges a.s if and only ifSn0/ndoes
Proof LetAn= (Xn 6=Xn0) = (|Xn|> n) ThenP(An) =P(|Xn|> n) =P(|X1|> n) Since E|X1|<∞,
then by Proposition 5.2 we have P
P(An)<∞ So by the Borel-Cantelli lemma, P(An i.o.) = Thus for
almost everyω,Xn=Xn0 fornsufficiently large This proves (a)
For (b), letk(depending onω) be the largest integer such thatXk0(ω)6=Xk(ω) ThenSn/n−Sn0/n=
(X1+· · ·+Xk)/n−(X10 +· · ·+Xk0)/n→0 asn→ ∞
Next is Kolmogorov’s inequality, a special case of Doob’s inequality
Proposition 5.5 Suppose the Xi are independent andEXi = 0for eachi Then
P( max
1≤i≤n|Si| ≥λ)≤
ESn2
λ2
Proof LetAk = (|Sk| ≥λ,|S1|< λ, ,|Sk−1|< λ) Note theAkare disjoint and thatAk∈σ(X1, , Xk)
ThereforeAk is independent ofSn−Sk Then
ESn2≥ n
X
k=1
E[Sn2;Ak]
=
n
X
k=1
E[(Sk2+ 2Sk(Sn−Sk) + (Sn−Sk)2);Ak]
≥
n
X
k=1
E[Sk2;Ak] + n
X
k=1
E[Sk(Sn−Sk);Ak]
Using the independence,E[Sk(Sn−Sk)1Ak] =E[Sk1Ak]E[Sn−Sk] = Therefore
ESn2 ≥ n
X
k=1
E[S2k;Ak]≥ n
X
k=1
λ2P(Ak) =λ2P( max
1≤k≤n|Sk| ≥λ)
Our result is immediate from this
The last result we need for now is a special case of what is known as Kronecker’s lemma Proposition 5.6 Supposexiare real numbers andsn=P
n i=1xi If
P∞
j=1(xj/j)converges, thensn/n→0
Proof Letbn =P n
j=1(xj/j),b0= 0, and supposebn→b As is well known, this implies (P n
i=1bi)/n→b
We haven(bn−bn−1) =xn, so
sn
n = Pn
i=1(ibi−ibi−1)
n =
Pn
i=1ibi−P n−1
i=1(i+ 1)bi
n =bn−
Pn
i=1bi
(10)6 Strong law of large numbers
This section is devoted to a proof of Kolmogorov’s strong law of large numbers We showed earlier that ifEXi2<∞, where theXiare i.i.d., then the weak law of large numbers (WLLN) holds: Sn/nconverges
toEX1in probability The WLLN can be improved greatly; it is enough thatxP(|X1|> x)→0 asx→ ∞
Here we show the strong law (SLLN): if one has a finite first moment, then there is almost sure convergence First we need a lemma
Lemma 6.1 Suppose Vi is a sequence of independent r.v.s, each with mean Let Wn = P n
i=1Vi If
P∞
i=1VarVi<∞, thenWn converges almost surely
Proof Choosenj> nj−1such that
P∞
i=njVarVi<2
−3j Ifn > n
j, then applying Kolmogorov’s inequality
shows that
P( max
nj≤i≤n
|Wi−Wnj|>2
−j)≤2−3j/2−2j = 2−j.
Lettingn→ ∞, we haveP(Aj)≤2−j, where
Aj = (max nj≤i
|Wi−Wnj|>2
−j).
By the Borel-Cantelli lemma,P(Aj i.o.) =
Supposeω /∈(Aj i.o.) Letε >0 Choosejlarge enough so that 2−j+1< εandω /∈Aj Ifn, m > nj,
then
|Wn−Wm| ≤ |Wn−Wnj|+|Wm−Wnj| ≤2
−j+1< ε.
Sinceεis arbitrary,Wn(ω) is a Cauchy sequence, and hence converges
Theorem 6.2 (SLLN)LetXibe a sequence of i.i.d random variables ThenSn/nconverges almost surely if and only ifE|X1|<∞
Proof Let us first supposeSn/n converges a.s and showE|X1|<∞ IfSn(ω)/n→a, then
Sn−1
n = Sn−1
n−1 n−1
n →a So
Xn
n = Sn
n − Sn−1
n →a−a=
HenceXn/n→0, a.s ThusP(|Xn|> ni.o.) = By the second part of Borel-Cantelli,PP(|Xn|> n)<∞
Since theXi are i.i.d., this means
P∞
n=1P(|X1|> n)<∞, and by Proposition 4.1,E|X1|<∞
Now suppose E|X1|<∞ By looking atXi−EXi, we may suppose without loss of generality that
EXi= We truncate, and letYi=Xi1(|Xi|≤i) It suffices to show
Pn
i=1Yi/n→0 a.s., by Proposition 5.4
Next we estimate We have
(11)The convergence follows by the dominated convergence theorem, since the integrands are bounded by|X1|
To estimate the second moment of theYi, we write
EYi2=
Z ∞
0
2yP(|Yi| ≥y)dy
= Z i
0
2yP(|Yi| ≥y)dy
≤ Z i
0
2yP(|X1| ≥y)dy,
and so
∞ X
i=1
E(Yi2/i 2)≤
∞ X
i=1
1 i2
Z i
0
2yP(|X1| ≥y)dy
= ∞ X
i=1
1 i2
Z ∞
0
1(y≤i)yP(|X1| ≥y)dy
= Z ∞
0
∞ X
i=1
1
i21(y≤i)yP(|X1| ≥y)dy
≤4 Z ∞
0
1
yyP(|X1| ≥y)dy =
Z ∞
0
P(|X1| ≥y)dy= 4E|X1|<∞
LetUi=Yi−EYi Then VarUi= VarYi≤EYi2, and by the above,
∞ X
i=1
Var (Ui/i)<∞
By Lemma 6.1 (withVi=Ui/i),P n
i=1(Ui/i) converges almost surely By Kronecker’s lemma, (P n
i=1Ui)/n
converges almost surely Finally, sinceEYi→0, thenP n
i=1EYi/n→0, henceP n
i=1Yi/n→0
7 Uniform integrability Before proceeding to some extensions of the SLLN, we discuss uniform inte-grability A sequence of r.v.s isuniformly integrable if
sup
i
Z
(|Xi|>M)
|Xi|dP→0
asM → ∞
Proposition 7.1 Suppose there existsϕ: [0,∞)→[0,∞)such that ϕis nondecreasing, ϕ(x)/x→ ∞as
x→ ∞, andsupiEϕ(|Xi|)<∞ Then theXi are uniformly integrable
Proof Letε >0 and choosex0 such thatx/ϕ(x)< εifx≥x0 IfM ≥x0,
Z
(|Xi|>M)
|Xi|=
Z |X
i|
ϕ(|Xi|)
ϕ(|Xi|)1(|Xi|>M)≤ε
Z
ϕ(|Xi|)≤εsup i E
(12)Proposition 7.2 IfXn andYn are two uniformly integrable sequences, thenXn+Yn is also a uniformly integrable sequence
Proof Since there exists M0 such that supnE[|Xn|;|Xn| > M0] < and supnE[|Yn|;|Yn| > M0] < 1,
then supnE|Xn| ≤ M0+ 1, and similarly for theYn Let ε >0 and choose M1 >4(M0+ 1)/ε such that
supnE[|Xn|;|Xn|> M1]< ε/4 and supnE[|Yn|;|Yn|> M1]< ε/4 LetM2= 4M12
NoteP(|Xn|+|Yn|> M2)≤(E|Xn|+E|Yn|)/M2≤ε/(4M1) by Chebyshev’s inequality Then
E[|Xn+Yn|;|Xn+Yn|> M2]≤E[|Xn|;|Xn|> M1]
+E[|Xn|;|Xn| ≤M1,|Xn+Yn|> M2]
+E[|Yn|;|Yn|> M1]
+E[|Yn|;|Yn| ≤M1,|Xn+Yn|> M2]
The first and third terms on the right are each less than ε/4 by our choice of M1 The second and fourth
terms are each less thanM1P(|Xn+Yn|> M2)≤ε/2
The main result we need in this section is Vitali’s convergence theorem
Theorem 7.3 IfXn→X almost surely and theXn are uniformly integrable, thenE|Xn−X| →0
Proof By the above proposition, Xn−X is uniformly integrable and tends to a.s., so without loss of
generality, we may assumeX = Letε >0 and chooseM such that supnE[|Xn|;|Xn|> M]< ε Then
E|Xn| ≤E[|Xn|;|Xn|> M] +E[|Xn|;|Xn| ≤M]≤ε+E[|Xn|1(|Xn|≤M)]
The second term on the right goes to by dominated convergence
8 Complements to the SLLN
Proposition 8.1 SupposeXi is an i.i.d sequence andE|X1|<∞ Then
E
Sn
n −EX1 →0
Proof Without loss of generality we may assumeEX1= By the SLLN, Sn/n →0 a.s So we need to
show that the sequenceSn/n is uniformly integrable
PickM1such that E[|X1|;|X1|> M1]< ε/2 PickM2=M1E|X1|/ε So
P(|Sn/n|> M2)≤E|Sn|/nM2≤E|X1|/M2=ε/M1
We used hereE|Sn| ≤P n
i=1E|Xi|=nE|X1|
We then have
E[|Xi|;|Sn/n|> M2]≤E[|Xi|:|Xi|> M1] +E[|Xi|;|Xi| ≤M1,|Sn/n|> M2]
≤ε+M1P(|Sn/n|> M2)≤2ε
Finally,
E[|Sn/n|;|Sn/n|> M2]≤
1 n
n
X
i=1
E[|Xi|;|Sn/n|> M2]≤2ε
(13)Theorem 8.2 LetXi be a sequence of independent random variables.,A >0, andYi=Xi1(|Xi|≤A) Then
PX
i converges if and only if all of the following three series converge: (a)PP(|Xn|> A); (b) PEYi; (c)
PVarY
i
Proof of “if” part Since (c) holds, thenP(Y
i−EYi) converges by Lemma 6.1 Since (b) holds, taking the
difference showsPY
iconverges Since (a) holds,PP(Xi6=Yi) =PP(|Xi|> A)<∞, so by Borel-Cantelli,
P(Xi6=Yi i.o.) = It follows thatPXi converges
9 Conditional expectation
If F ⊆ G are two σ-fields and X is an integrable G measurable random variable, the conditional expectationof X given F, written E[X | F] and read as “the expectation (or expected value) of X given F,” is anyF measurable random variableY such thatE[Y;A] =E[X;A] for everyA∈ F Theconditional
probabilityofA∈ G givenF is defined byP(A| F) =E[1A| F]
IfY1, Y2are twoFmeasurable random variables withE[Y1;A] =E[Y2;A] for allA∈ F, thenY1=Y2,
a.s., or conditional expectation is unique up to a.s equivalence
In the caseX is alreadyF measurable, E[X | F] =X IfX is independent of F,E[X | F] =EX Both of these facts follow immediately from the definition For another example, which ties this definition with the one used in elementary probability courses, if{Ai}is a finite collection of disjoint sets whose union
is Ω,P(Ai)>0 for alli, andF is theσ-field generated by theAis, then
P(A| F) = X
i
P(A∩Ai)
P(Ai)
1Ai
This follows since the right-hand side isF measurable and its expectation over any setAi isP(A∩Ai)
As an example, suppose we toss a fair coin independently times and let Xi be or depending
whether the ith toss was a heads or tails Let A be the event that there were heads and let Fi =
σ(X1, , Xi) ThenP(A) = 1/32 whileP(A| F1) is equal to 1/16 on the event (X1= 1) and on the event
(X1= 0) P(A| F2) is equal to 1/8 on the event (X1= 1, X2= 1) and otherwise
We have
E[E[X | F]] =EX (9.1)
becauseE[E[X| F]] =E[E[X | F]; Ω] =E[X; Ω] =EX The following is easy to establish
Proposition 9.1 (a) IfX ≥Y are both integrable, thenE[X | F]≥E[Y | F]a.s
(b) If X andY are integrable and a∈R, thenE[aX+Y | F] =aE[X | F] +E[Y | F]
It is easy to check that limit theorems such as monotone convergence and dominated convergence have conditional expectation versions, as inequalities like Jensen’s and Chebyshev’s inequalities Thus, for example, we have the following
Proposition 9.2 (Jensen’s inequality for conditional expectations) If g is convex and X and g(X) are integrable,
(14)Proposition 9.3 IfX andXY are integrable and Y is measurable with respect toF, then
E[XY | F] =YE[X | F] (9.2)
Proof IfA∈ F, then for anyB∈ F,
E1AE[X| F];B=EE[X| F];A∩B=E[X;A∩B] =E[1AX;B]
Since 1AE[X | F] isF measurable, this shows that (9.1) holds whenY = 1A and A∈ F Using linearity
and taking limits shows that (9.1) holds wheneverY isF measurable andXand XY are integrable Two other equalities follow
Proposition 9.4 IfE ⊆ F ⊆ G, then
EE[X | F] | E=E[X| E] =EE[X | E] | F
Proof The right equality holds becauseE[X | E] isE measurable, henceF measurable To show the left equality, letA∈ E Then sinceA is also inF,
EEE[X | F] | E;A=EE[X | F];A=E[X;A] =E[EX | E];A Since both sides areE measurable, the equality follows
To show the existence ofE[X | F], we proceed as follows Proposition 9.5 IfX is integrable, thenE[X | F]exists
Proof Using linearity, we need only consider X ≥0 Define a measure Q onF by Q(A) =E[X;A] for A∈ F This is trivially absolutely continuous with respect to P|F, the restriction of Pto F LetE[X | F] be the Radon-Nikodym derivative ofQwith respect toP|F The Radon-Nikodym derivative isF measurable by construction and so provides the desired random variable
WhenF=σ(Y), one usually writesE[X|Y] forE[X | F] Notation that is commonly used (however, we will use it only very occasionally and only for heuristic purposes) is E[X |Y =y] The definition is as follows IfA∈σ(Y), thenA= (Y ∈B) for some Borel setB by the definition ofσ(Y), or 1A= 1B(Y) By
linearity and taking limits, ifZ is σ(Y) measurable, Z =f(Y) for some Borel measurable function f Set Z=E[X |Y] and choosef Borel measurable so thatZ=f(Y) ThenE[X |Y =y] is defined to bef(y)
If X ∈ L2 and M = {Y ∈ L2 : Y isF-measurable}, one can show that
E[X | F] is equal to the projection ofX onto the subspaceM We will not use this in these notes
10 Stopping times
We next want to talk about stopping times Suppose we have a sequence of σ-fields Fi such that
Fi ⊂ Fi+1 for each i An example would be if Fi = σ(X1, , Xi) A random mapping N from Ω to
{0,1,2, }is called astopping time if for eachn, (N≤n)∈ Fn A stopping time is also called an optional
time in the Markov theory literature
The intuition is that the sequence knows whether N has happened by time n by looking at Fn
Suppose some motorists are told to drive north on Highway 99 in Seattle and stop at the first motorcycle shop past the second realtor after the city limits So they drive north, pass the city limits, pass two realtors, and come to the next motorcycle shop, and stop That is a stopping time If they are instead told to stop at the third stop light before the city limits (and they had not been there before), they would need to drive to the city limits, then turn around and return past three stop lights That is not a stopping time, because they have to go ahead of where they wanted to stop to know to stop there
(15)Proposition 10.1
(a) Fixed timesnare stopping times
(b) If N1and N2are stopping times, then so areN1∧N2andN1∨N2
(c) IfNn is a nondecreasing sequence of stopping times, then so isN = supnNn (d) If Nn is a nonincreasing sequence of stopping times, then so isN = infnNn
(e) IfN is a stopping time, then so isN+n
We defineFN ={A:A∩(N ≤n)∈ Fn for alln}
11 Martingales
In this section we consider martingales LetFn be an increasing sequence ofσ-fields A sequence of
random variablesMn isadapted toFn if for eachn,Mn isFn measurable
Mn is amartingaleifMn is adapted toFn, Mn is integrable for alln, and
E[Mn| Fn−1] =Mn−1, a.s., n= 2,3, (11.1)
If we have E[Mn | Fn−1]≥Mn−1 a.s for every n, thenMn is a submartingale If we have E[Mn |
Fn−1]≤Mn−1, we have a supermartingale Submartingales have a tendency to increase
Let us take a moment to look at some examples If Xi is a sequence of mean zero i.i.d random
variables and Sn is the partial sum process, then Mn =Sn is a martingale, sinceE[Mn | Fn−1] =Mn−1+
E[Mn−Mn−1| Fn−1] =Mn−1+E[Mn−Mn−1] = Mn−1, using independence If theXi’s have variance
one andMn=Sn2−n, then
E[Sn2 | Fn−1] =E[(Sn−Sn−1)2| Fn−1] + 2Sn−1E[Sn| Fn−1]−S2n−1= +S n−1,
using independence It follows thatMn is a martingale
Another example is the following: ifX ∈L1 andMn=E[X | Fn], thenMn is a martingale
IfMn is a martingale andHn∈ Fn−1 for eachn, it is easy to check thatNn=P n
i=1Hi(Mi−Mi−1)
is also a martingale
IfMn is a martingale and g(Mn) is integrable for eachn, then by Jensen’s inequality
E[g(Mn+1)| Fn]≥g(E[Mn+1| Fn]) =g(Mn),
or g(Mn) is a submartingale Similarly if g is convex and nondecreasing on [0,∞) and Mn is a positive
submartingale, theng(Mn) is a submartingale because
E[g(Mn+1)| Fn]≥g(E[Mn+1 |Fn])≥g(Mn)
12 Optional stopping
Note that if one takes expectations in (11.1), one hasEMn=EMn−1, and by inductionEMn=EM0
The theorem about martingales that lies at the basis of all other results is Doob’s optional stopping theorem, which says that the same is true if we replacenby a stopping timeN There are various versions, depending on what conditions one puts on the stopping times
Theorem 12.1 IfN is a bounded stopping time with respect toFn andMn a martingale, then EMN =
EM0
Proof SinceN is bounded, letKbe the largest valueN takes We write EMN =
K
X
k=0
E[MN;N =k] = K
X
k=0
(16)Note (N =k) isFj measurable ifj ≥k, so
E[Mk;N =k] =E[Mk+1;N =k]
=E[Mk+2;N =k] = .=E[MK;N =k]
Hence
EMN = K
X
k=0
E[MK;N =k] =EMK=EM0
This completes the proof
The assumption thatN be bounded cannot be entirely dispensed with For example, letMn be the
partial sums of a sequence of i.i.d random variable that take the values ±1, each with probability If
N = min{i:Mi= 1}, we will see later on thatN <∞a.s., butEMN = 16= =EM0
The same proof as that in Theorem 12.1 gives the following corollary
Corollary 12.2 IfN is bounded byKand Mn is a submartingale, thenEMN ≤EMK
Also the same proof gives
Corollary 12.3 IfN is bounded byK,A∈ FN, andMnis a submartingale, thenE[MN;A]≤E[MK;A]
Proposition 12.4 IfN1 ≤N2 are stopping times bounded by K and M is a martingale, thenE[MN2 |
FN1] =MN1, a.s
Proof SupposeA∈ FN1 We need to showE[MN1;A] =E[MN2;A] Define a new stopping time N3by
N3(ω) =
N1(ω) ifω∈A
N2(ω) ifω /∈A
It is easy to check thatN3 is a stopping time, soEMN3 =EMK =EMN2 implies
E[MN1;A] +E[MN2;A
c] =
E[MN2]
SubtractingE[MN2;A
c] from each side completes the proof.
The following is known as the Doob decomposition
Proposition 12.5 SupposeXk is a submartingale with respect to an increasing sequence of σ-fields Fk Then we can writeXk=Mk+Ak such thatMk is a martingale adapted to theFk andAk is a sequence of random variables withAk being Fk−1-measurable andA0≤A1≤ · · ·
Proof Letak =E[Xk | Fk−1]−Xk−1fork= 1,2, SinceXk is a submartingale, then eachak≥0 Then
letAk =P k
i=1ai The fact that the Ak are increasing and measurable with respect to Fk−1 is clear Set
Mk =Xk−Ak Then
E[Mk+1−Mk| Fk] =E[Xk+1−Xk|Fk]−ak+1= 0,
orMk is a martingale
(17)Corollary 12.6 SupposeXk is a submartingale, andN1≤N2 are bounded stopping times Then
E[XN2 | FN1]≥XN1
13 Doob’s inequalities
The first interesting consequences of the optional stopping theorems are Doob’s inequalities IfMn
is a martingale, denoteM∗
n = maxi≤n|Mi|
Theorem 13.1 IfMn is a martingale or a positive submartingale,
P(Mn∗≥a)≤E[|Mn|;Mn∗≥a]/a≤E|Mn|/a
Proof SetMn+1=Mn LetN = min{j:|Mj| ≥a} ∧(n+ 1) Since| · |is convex,|Mn|is a submartingale
IfA= (Mn∗≥a), thenA∈ FN and by Corollary 12.3
aP(Mn∗≥a)≤E[|MN|;A]≤E[|Mn|;A]≤E|Mn|
Forp >1, we have the following inequality
Theorem 13.2 Ifp >1 andE|Mi|p<∞fori≤n, then
E(Mn∗)
p≤ p
p−1 p
E|Mn|p
Proof NoteMn∗≤Pn
i=1|Mn|, henceMn∗∈Lp We write
E(Mn∗) p=Z
∞
0
pap−1P(Mn∗> a)da≤ Z ∞
0
pap−1E[|Mn|1(M∗
n≥a)/a]da
=E Z Mn∗
0
pap−2|Mn|da=
p
p−1E[(M ∗
n) p−1
|Mn|]
≤ p
p−1(E(M ∗
n)p)(p−1)/p(E|Mn|p)1/p
The last inequality follows by Hăolders inequality Now divide both sides by the quantity (E(Mn)p)(p1)/p
14 Martingale convergence theorems
The martingale convergence theorems are another set of important consequences of optional stopping The main step is the upcrossing lemma The number of upcrossings of an interval [a, b] is the number of times a process crosses from belowato aboveb
To be more exact, let
S1= min{k:Xk ≤a}, T1= min{k > S1:Xk≥b},
and
Si+1= min{k > Ti :Xk ≤a}, Ti+1= min{k > Si+1:Xk≥b}
(18)Theorem 14.1 (Upcrossing lemma)IfXk is a submartingale,
EUn≤(b−a)−1E[(Xn−a)+]
Proof The number of upcrossings of [a, b] byXk is the same as the number of upcrossings of [0, b−a]
byYk = (Xk−a)+ MoreoverYk is still a submartingale If we obtain the inequality for the the number of
upcrossings of the interval [0, b−a] by the processYk, we will have the desired inequality for upcrossings of
X
So we may assumea= Fixnand defineYn+1=Yn This will still be a submartingale Define the
Si,Ti as above, and letS0i=Si∧(n+ 1), Ti0=Ti∧(n+ 1) SinceTi+1> Si+1> Ti, thenTn+10 =n+
We write
EYn+1=EYS0 1+
n+1
X
i=0
E[YT0 i −YS
0 i] +
n+1
X
i=0
E[YS0 i+1−YT
0 i]
All the summands in the third term on the right are nonnegative sinceYk is a submartingale For thejth
upcrossing,YT0 j−YS
0
j ≥b−a, whileYT j −YS
0
j is always greater than or equal to So
∞ X
i=0
(YT0
i −YSi0)≥(b−a)Un
So
(4.8) EUn≤EYn+1/(b−a)
This leads to the martingale convergence theorem
Theorem 14.2 IfXn is a submartingale such thatsupnEXn+<∞, thenXn converges a.s asn→ ∞
Proof LetU(a, b) = limn→∞Un For each a, brational, by monotone convergence,
EU(a, b)≤c(b−a)−1E(Xn−a)+<∞
SoU(a, b)<∞, a.s Taking the union over all pairs of rationals a, b, we see that a.s the sequenceXn(ω)
cannot have lim supXn > lim infXn Therefore Xn converges a.s., although we still have to rule out the
possibility of the limit being infinite SinceXn is a submartingale,EXn≥EX0, and thus
E|Xn|=EXn++EXn−= 2EX +
n −EXn ≤2EXn+−EX0
By Fatou’s lemma,Elimn|Xn| ≤supnE|Xn|<∞, orXn converges a.s to a finite limit
Corollary 14.3 IfXnis a positive supermartingale or a martingale bounded above or below,Xnconverges a.s
Proof If Xn is a positive supermartingale, −Xn is a submartingale bounded above by Now apply
(19)If Xn is a martingale bounded above, by considering −Xn, we may assume Xn is bounded below
Looking atXn+M for fixedM will not affect the convergence, so we may assumeXn is bounded below by
0 Now apply the first assertion of the corollary
Proposition 14.4 IfXn is a martingale withsupnE|Xn|p<∞for somep >1, then the convergence is in
Lp as well as a.s This is also true whenX
n is a submartingale If Xn is a uniformly integrable martingale, then the convergence is inL1 IfX
n→X∞in L1, thenXn=E[X∞| Fn]
Xn is a uniformly integrable martingale if the collection of random variables Xn is uniformly
inte-grable
Proof TheLp convergence assertion follows by using Doob’s inequality (Theorem 13.2) and dominated
convergence TheL1convergence assertion follows since a.s convergence together with uniform integrability
impliesL1 convergence Finally, ifj < n, we haveX
j =E[Xn| Fj] IfA∈ Fj,
E[Xj;A] =E[Xn;A]→E[X∞;A] by theL1convergence ofX
n toX∞ Since this is true for allA∈ Fj,Xj=E[X∞| Fj]
15 Applications of martingales
LetSn be your fortune at timen In a fair casino,E[Sn+1 | Fn] =Sn IfN is a stopping time, the
optional stopping theorem says that ESN =ES0; in other words, no matter what stopping time you use
and what method of betting, you will not better on average than ending up with what you started with An elegant application of martingales is a proof of the SLLN Fix N large Let Yi be i.i.d Let
Zn = E[Y1 | Sn, Sn+1, , SN] We claim Zn = Sn/n Certainly Sn/n is σ(Sn,· · ·, SN) measurable If
A∈σ(Sn, , SN) for somen, thenA= ((Sn, , SN)∈B) for some Borel subsetB ofRN−n+1 Since the Yi are i.i.d., for eachk≤n,
E[Y1; (Sn, , SN)∈B] =E[Yk; (Sn, , SN)∈B]
Summing overk and dividing byn,
E[Y1; (Sn, , SN)∈B] =E[Sn/n; (Sn, , SN)∈B]
ThereforeE[Y1;A] =E[Sn/n;A] for everyA∈σ(Sn, , SN) ThusZn =Sn/n
Let Xk = ZN−k, and let Fk = σ(SN−k, SN−k+1, , SN) Note Fk gets larger as k gets larger,
and by the above Xk = E[Y1 | Fk] This shows that Xk is a martingale (cf the next to last example
in Section 11) By Doob’s upcrossing inequality, if UnX is the number of upcrossings of [a, b] by X, then EUNX ≤EX
+
N/(b−a) ≤E|Z0|/(b−a) =E|Y1|/(b−a) This differs by at most one from the number of
upcrossings of [a, b] by Z1, , ZN So the expected number of upcrossings of [a, b] by Zk for k ≤ N is
bounded by +E|Y1|/(b−a) Now letN → ∞ By Fatou’s lemma, the expected number of upcrossings
of [a, b] byZ1, is finite Arguing as in the proof of the martingale convergence theorem, this says that
Zn=Sn/ndoes not oscillate
It is conceivable that|Sn/n| → ∞ But by Fatou’s lemma,
E[lim|Sn/n|]≤lim infE|Sn/n| ≤lim infnE|Y1|/n=E|Y1|<∞
(20)Proposition 15.1 Suppose theYi are i.i.d withE|Y1|<∞,N is a stopping time withEN <∞, andN
is independent of theYi ThenESN = (EN)(EY1), where theSn are the partial sums of theYi
Proof Sn−n(EY1) is a martingale, soESn∧N =E(n∧N)EY1 by optional stopping The right hand side
tends to (EN)(EY1) by monotone convergence Sn∧N converges almost surely toSN, and we need to show
the expected values converge Note
|Sn∧N|=
∞ X
k=0
|Sn∧k|1(N=k)≤
∞ X
k=0 n∧k
X
j=0
|Yj|1(N=k)
=
n
X
j=0
∞ X
k>j
|Yj|1(N=k)= n
X
j=0
|Yj|1(N≥j)≤
∞ X
j=0
|Yj|1(N≥j)
The last expression, using the independence, has expected value ∞
X
j=0
(E|Yj|)P(N≥j)≤(E|Y1|)(1 +EN)<∞
So by dominated convergence, we haveESn∧N →ESN
Wald’s second identity is a similar expression for the variance ofSN
We can use martingales to find certain hitting probabilities
Proposition 15.2 Suppose theYi are i.i.d withP(Y1= 1) = 1/2,P(Y1=−1) = 1/2, andSn the partial sum process Supposeaandbare positive integers Then
P(Sn hits−abeforeb) =
b a+b
IfN = min{n:Sn ∈ {−a, b}}, thenEN =ab
Proof S2
n−nis a martingale, soESn2∧N =En∧N Let n→ ∞ The right hand side converges to EN
by monotone convergence Since Sn∧N is bounded in absolute value by a+b, the left hand side converges
by dominated convergence toESN2, which is finite SoEN is finite, henceN is finite almost surely
Sn is a martingale, soESn∧N =ES0= By dominated convergence, and the fact thatN <∞a.s.,
henceSn∧N →SN, we haveESN = 0, or
−aP(SN =−a) +bP(SN =b) =
We also have
P(SN =−a) +P(SN =b) =
Solving these two equations for P(SN =−a) andP(SN =b) yields our first result Since EN = ESN2 =
a2P(SN =−a) +b2P(SN =b), substituting gives the second result
Based on this proposition, if we let a → ∞, we see that P(Nb < ∞) = and ENb = ∞, where
Nb= min{n:Sn=b}
(21)Proposition 15.3 SupposeAn∈ Fn Then(An i.o.)and(P∞n=1P(An| Fn−1) =∞)differ by a null set
Proof Let Xn = Pnm=1[1Am −P(Am | Fm−1)] Note |Xn−Xn−1| ≤ Also, it is easy to see that
E[Xn−Xn−1| Fn−1] = 0, soXn is a martingale
We claim that for almost every ω either limXn exists and is finite, or else lim supXn = ∞ and
lim infXn = −∞ In fact, if N = min{n : Xn ≤ −k}, then Xn∧N ≥ −k−1, so Xn∧N converges by the
martingale convergence theorem Therefore limXn exists and is finite on (N =∞) So if limXn does not
exist or is not finite, thenN <∞ This is true for allk, hence lim infXn =−∞ A similar argument shows
lim supXn=∞in this case
Now if limXn exists and is finite, then
P∞
n=11An =∞if and only if
P
P(An | Fn−1)<∞ On the
other hand, if the limit does not exist or is not finite, thenP1
An =∞and
P
P(An| Fn−1) =∞
16 Weak convergence
We will see later that if theXi are i.i.d with mean zero and variance one, thenSn/
√
n converges in the sense
P(Sn/
√
n∈[a, b])→P(Z∈[a, b]), where Z is a standard normal If Sn/
√
n converged in probability or almost surely, then by the zero-one law it would converge to a constant, contradicting the above We want to generalize the above type of convergence
We say Fn converges weakly to F if Fn(x) → F(x) for all x at which F is continuous Here Fn
and F are distribution functions We say Xn converges weakly to X ifFXn converges weakly to FX We
sometimes sayXn converges in distributionor converges in law toX Probabilities µn converge weakly if
their corresponding distribution functions converges, that is, ifFµn(x) =µn(−∞, x] converges weakly
An example that illustrates why we restrict the convergence to continuity points ofF is the following LetXn = 1/nwith probability one, andX = with probability one FXn(x) is ifx <1/nand otherwise
FXn(x) converges toFX(x) for all xexceptx=
Proposition 16.1 Xn converges weakly to X if and only if Eg(Xn) → Eg(X) for all g bounded and
continuous
The idea that Eg(Xn) converges to Eg(X) for all g bounded and continuous makes sense for any metric space and is used as a definition of weak convergence forXn taking values in general metric spaces
Proof First supposeEg(Xn) converges toEg(X) Letxbe a continuity point of F, letε >0, and choose δsuch that|F(y)−F(x)|< εif|y−x|< δ Choosegcontinuous such thatg is one on (−∞, x], takes values between and 1, and is on [x+δ,∞) ThenFXn(x)≤Eg(Xn)→Eg(X)≤FX(x+δ)≤F(x) +ε
Similarly, ifhis a continuous function taking values between and that is on (−∞, x−δ] and on [x,∞),FXn(x)≥Eh(Xn)→Eh(X)≥FX(x−δ)≥F(x)−ε Sinceεis arbitrary,FXn(x)→FX(x)
Now suppose Xn converges weakly to X If aand b are continuity points of F and of all theFXn,
thenE1[a,b](Xn) =FXn(b)−FXn(a)→F(b)−F(a) =E1[a,b](X) By taking linear combinations, we have
Eg(Xn)→Eg(X) for everyg which is a step function where the end points of the intervals are continuity points for all the FXn and forFX Since the set of points that are not a continuity point for some FXn or
forFX is countable, and we can approximate any continuous function on an interval by such step functions
uniformly, we haveEg(Xn)→Eg(X) for allgsuch that the support ofgis a closed interval whose endpoints are continuity points ofFX andg is continuous on its support
Letε >0 and choose M such thatFX(M)>1−ε and FX(−M)< ε and so thatM and −M are
(22)whereg is a bounded continuous function The difference betweenE(1[−M,M]g)(X) andEg(X) is bounded by kgk∞P(X /∈ [−M, M]) ≤2εkgk∞ Similarly, whenX is replaced by Xn, the difference is bounded by
kgk∞P(Xn∈/[−M, M])→ kgk∞P(X /∈[−M, M]) So fornlarge, it is less than 3εkgk∞ Sinceεis arbitrary,
Eg(Xn)→Eg(X) wheneverg is bounded and continuous
Let us examine the relationship between weak convergence and convergence in probability The example ofSn/
√
nshows that one can have weak convergence without convergence in probability Proposition 16.2 (a) IfXn converges to X in probability, then it converges weakly
(b) If Xn converges weakly to a constant, it converges in probability
(c) (Slutsky’s theorem) If Xn converges weakly to X and Yn converges weakly to a constant c, then
Xn+Yn converges weakly toX+candXnYn converges weakly tocX
Proof To prove (a), let g be a bounded and continuous function If nj is any subsequence, then there
exists a further subsequence such thatX(njk) converges almost surely toX Then by dominated convergence,
Eg(X(njk))→Eg(X) That suffices to showEg(Xn) converges toEg(X)
For (b), ifXn converges weakly toc,
P(Xn−c > ε) =P(Xn > c+ε) = 1−P(Xn≤c+ε)→1−P(c≤c+ε) =
We use the fact that if Y ≡ c, then c+ε is a point of continuity for FY A similar equation shows
P(Xn−c≤ −ε)→0, soP(|Xn−c|> ε)→0
We now prove the first part of (c), leaving the second part for the reader Letxbe a point such that x−cis a continuity point ofFX Chooseεso thatx−c+εis again a continuity point Then
P(Xn+Yn ≤x)≤P(Xn+c≤x+ε) +P(|Yn−c|> ε)→P(X ≤x−c+ε)
So lim supP(Xn+Yn ≤x)≤P(X+c≤x+ε) Sinceεcan be as small as we like andx−c is a continuity point ofFX, then lim supP(Xn+Yn≤x)≤P(X+c≤x) The lim inf is done similarly
We say a sequence of distribution functions{Fn}is tightif for each ε >0 there existsM such that
Fn(M)≥ 1−ε and Fn(−M) ≤ ε for alln A sequence of r.v.s is tight if the corresponding distribution
functions are tight; this is equivalent toP(|Xn| ≥M)≤ε
Theorem 16.3 (Helly’s theorem)LetFn be a sequence of distribution functions that is tight There exists a subsequencenj and a distribution functionF such thatFnj converges weakly toF
What could happen is thatXn=n, so thatFXn →0; the tightness precludes this
Proof Let qk be an enumeration of the rationals Since Fn(qk) ∈ [0,1], any subsequence has a further
subsequence that converges Use diagonalization so that Fnj(qk) converges for each qk and call the limit
F(qk) F is nondecreasing, and defineF(x) = infqk≥xF(qk) So F is right continuous and nondecreasing
Ifxis a point of continuity ofF andε >0, then there existrandsrational such thatr < x < sand F(s)−ε < F(x)< F(r) +ε Then
Fnj(x)≥Fnj(r)→F(r)> F(x)−ε
and
Fnj(x)≤Fnj(s)→F(s)< F(x) +ε
Sinceεis arbitrary,Fnj(x)→F(x)
Since the Fn are tight, there exists M such that Fn(−M) < ε Then F(−M) ≤ ε, which implies
(23)Proposition 16.4 Suppose there exists ϕ: [0,∞)→[0,∞)that is increasing andϕ(x)→ ∞as x→ ∞ Ifc= supnEϕ(|Xn|)<∞, then theXn are tight
Proof Letε >0 ChooseM such thatϕ(x)≥c/εifx > M Then P(|Xn|> M)≤
Z ϕ(|X
n|)
c/ε 1(|Xn|>M)dP≤
ε
cEϕ(|Xn|)≤ε
17 Characteristic functions
We define thecharacteristic function of a random variableX byϕX(t) =Eeitxfort∈R
Note thatϕX(t) =R eitxPX(dx) So ifX andY have the same law, they have the same characteristic
function Also, if the law of X has a density, that is,PX(dx) =fX(x)dx, thenϕX(t) =ReitxfX(x)dx, so
in this case the characteristic function is the same as (one definition of) the Fourier transform offX
Proposition 17.1 ϕ(0) = 1,|ϕ(t)| ≤1,ϕ(−t) =ϕ(t), andϕis uniformly continuous
Proof Since|eitx| ≤1, everything follows immediately from the definitions except the uniform continuity.
For that we write
|ϕ(t+h)−ϕ(t)|=|Eei(t+h)X−EeitX| ≤E|eitX(eihX−1)|=E|eihX−1|
|eihX−1| tends to almost surely ash→0, so the right hand side tends to by dominated convergence.
Note that the right hand side is independent oft
Proposition 17.2 ϕaX(t) =ϕX(at)andϕX+b(t) =eitbϕX(t),
Proof The first follows fromEeit(aX)=Eei(at)X, and the second is similar Proposition 17.3 IfX andY are independent, thenϕX+Y(t) =ϕX(t)ϕY(t)
Proof From the multiplication theorem,
Eeit(X+Y)=EeitXeitY =EeitXEeitY
Note that ifX1 andX2are independent and identically distributed, then
ϕX1−X2(t) =ϕX1(t)ϕ−X2(t) =ϕX1(t)ϕX2(−t) =ϕX1(t)ϕX2(t) =|ϕX1(t)|
2.
Let us look at some examples of characteristic functions
(a) Bernoulli: By direct computation, this is peit+ (1−p) = 1−p(1−eit) (b) Coin flip: (i.e.,P(X = +1) =P(X =−1) = 1/2) We have 12e
it+1 2e
−it= cost.
(c) Poisson:
EeitX= ∞ X
k=0
eitke−λλ
k
k! =e
−λX(λeit)k k! =e
−λeλeit
=eλ(eit−1)
(24)(e) Binomial: WriteX as the sum ofnindependent Bernoulli r.v.sBi So
ϕX(t) = n
Y
i=1
ϕBi(t) = [ϕBi(t)]
n
= [1−p(1−eit)]n
(f) Geometric:
ϕ(t) = ∞ X
k=0
p(1−p)keitk=pX((1−p)eit)k = p 1−(1−p)eit
(g) Uniform on[a, b]:
ϕ(t) = b−a
Z b
a
eitxdx= e
itb−eita
(b−a)it Note that whena=−bthis reduces to sin(bt)/bt
(h) Exponential:
Z ∞
0
λeitxe−λxdx=λ Z ∞
0
e(it−λ)xdx= λ λ−it (i) Standard normal:
ϕ(t) = √1 2π
Z ∞ −∞
eitxe−x2/2dx
This can be done by completing the square and then doing a contour integration Alternately,ϕ0(t) = (1/√2π)R−∞∞ ixeitxe−x2/2dx (do the real and imaginary parts separately, and use the dominated convergence theorem to justify taking the derivative inside.) Integrating by parts (do the real and imaginary parts separately), this is equal to−tϕ(t) The only solution toϕ0(t) =−tϕ(t) withϕ(0) = 1 isϕ(t) =e−t2/2
(j) Normal with mean µ and variance σ2: Writing X =σZ +µ, where Z is a standard normal, then
ϕX(t) =eiµtϕZ(σt) =eiµt−σ
2t2/2
(k) Cauchy: We have
ϕ(t) = π
Z eitx
1 +x2dx
This is a standard exercise in contour integration in complex analysis The answer ise−|t|.
18 Inversion formula
We need a preliminary real variable lemma, and then we can proceed to the inversion formula, which gives a formula for the distribution function in terms of the characteristic function
Lemma 18.1 (a)RN
0 (sin(Ax)/x)dx→sgn (A)π/2 asN → ∞ (b) supa|Ra
0(sin(Ax)/x)dx|<∞
(25)An alternate proof of (a) is the following e−xysinxis integrable on{(x, y); 0< x < a,0< y <∞} So
Z a
0
sinx x dx=
Z a
0
Z ∞
0
e−xysinx dy dx
= Z ∞
0
Z a
0
e−xysinx dx dy
= Z ∞
0
h e−xy
y2+ 1(−ysinx−cosx)
ia dy = Z ∞ h e −ay
y2+ 1(−ysina−cosa)
o − −1
y2+ 1
i dy
= π −sina
Z ∞
0
ye−ay
y2+ 1dy−cosa
Z ∞
0
e−ay y2+ 1dy
The last two integrals tend to asa→ ∞since the integrand is bounded by (1 +y)e−y ifa≥1 Theorem 18.2 (Inversion formula)Let µbe a probability measure and let ϕ(t) =R
eitxµ(dx) If a < b, then lim T→∞ 2π Z T −T
e−ita−e−itb
it ϕ(t)dt=µ(a, b) +
2µ({a}) + 2µ({b})
The example whereµis point mass at 0, soϕ(t) = 1, shows that one needs to take a limit, since the integrand in this case is sint/t, which is not integrable
Proof By Fubini,
Z T
−T
e−ita−e−itb
it ϕ(t)dt= Z T
−T
Z e−ita−e−itb
it e
itxµ(dx)dt
= Z Z T
−T
e−ita−e−itb
it e
itxdt µ(dx).
To justify this, we bound the integrand by the mean value theorem
Expandinge−itb ande−ita using Euler’s formula, and using the fact that cos is an even function and
sin is odd, we are left with Z
2h Z T
0
sin(t(x−a)) t dt−
Z T
0
sin(t(x−b))
t dt
i µ(dx) Using Lemma 18.1 and dominated convergence, this tends to
Z
[πsgn (x−a)−πsgn (x−b)]µ(dx)
Theorem 18.3 IfR|ϕ(t)|dt <∞, thenµhas a bounded densityf and
f(y) = 2π
Z
e−ityϕ(t)dt
Proof
µ(a, b) +1
2µ({a}) + 2µ({b}) = lim T→∞ 2π Z T −T
e−ita−e−itb
it ϕ(t)dt =
2π Z ∞
−∞
e−ita−e−itb it ϕ(t)dt ≤b−a
2π Z
(26)Lettingb→ashows thatµhas no point masses We now write
µ(x, x+h) = 2π
Z e−itx−e−it(x+h)
it ϕ(t)dt =
2π
Z Z x+h
x
e−itydyϕ(t)dt
= Z x+h
x
2π
Z
e−ityϕ(y)dtdy
Soµhas density (1/2π)R
e−ityϕ(t)dt As in the proof of Proposition 17.1, we seef is continuous.
A corollary to the inversion formula is the uniqueness theorem Theorem 18.3 IfϕX=ϕY, thenPX =PY
The following proposition can be proved directly, but the proof using characteristic functions is much easier
Proposition 18.4 (a) IfX andY are independent,X is a normal with meanaand varianceb2, andY is a normal with meancand varianced2, thenX+Y is normal with meana+c and varianceb2+d2.
(b) If X andY are independent,X is Poisson with parameterλ1, andY is Poisson with parameterλ2, thenX+Y is Poisson with parameterλ1+λ2
(c) IfXi are i.i.d Cauchy, thenSn/nis Cauchy
Proof For (a),
ϕX+Y(t) =ϕX(t)ϕY(t) =eiat−b
2t2/2
eict−c2t2/2=ei(a+c)t−(b2+d2)t2/2 Now use the uniqueness theorem
Parts (b) and (c) are proved similarly
19 Continuity theorem
Lemma 19.1 Supposeϕis the characteristic function of a probabilityµ Then
µ([−2A,2A])≥A
Z 1/A
−1/A
ϕ(t)dt −1
Proof Note
1 2T
Z T
−T
ϕ(t)dt= 2T
Z T
−T
Z
eitxµ(dx)dt
=
Z Z 1
2T1[−T ,T](t)e
itxdtµ(dx)
=
Z sinT x T x µ(dx)
Since|(sin(T x))/T x| ≤1 and|sin(T x)| ≤1, then|(sin(T x))/T x| ≤1/2T Aif|x| ≥2A So
Z sinT x T x µ(dx)
≤µ([−2A,2A]) + Z
[−2A,2A]c
1
2T Aµ(dx) =µ([−2A,2A]) +
2T A(1−µ([−2A,2A])
=
2T A+
1− 2T A
(27)
SettingT = 1/A,
A
Z 1/A
−1/A
ϕ(t)dt ≤
1 2+
1
2µ([−2A,2A]) Now multiply both sides by
Proposition 19.2 Ifµn converges weakly toµ, thenϕn converges toϕuniformly on every finite interval
Proof Let ε > and choose M large so that µ([−M, M]c) < ε Define f to be on [−M, M], on
[−M−1, M+ 1]c, and linear in between SinceR
f dµn→R f dµ, then ifnis large enough,
Z
(1−f)dµn≤2ε
We have
|ϕn(t+h)−ϕn(t)| ≤
Z
|eihx−1|µ n(dx)
≤2 Z
(1−f)dµn+h
Z
|x|f(x)µn(dx)
≤2ε+h(M + 1) So fornlarge enough and|h| ≤ε/(M+ 1), we have
|ϕn(t+h)−ϕn(t)| ≤3ε,
which says that theϕn are equicontinuous Therefore the convergence is uniform on finite intervals
The interesting result of this section is the converse, L´evy’s continuity theorem
Theorem 19.3 Suppose µn are probabilities, ϕn(t) converges to a function ϕ(t) for each t, and ϕ is continuous at Thenϕis the characteristic function of a probabilityµandµn converges weakly toµ
Proof Letε >0 Sinceϕis continuous at 0, chooseδsmall so that
2δ
Z δ
−δ
ϕ(t)dt−1 < ε Using the dominated convergence theorem, chooseN such that
1 2δ
Z δ
−δ
|ϕn(t)−ϕ(t)|dt < ε
ifn≥N So ifn≥N,
1 2δ
Z δ
−δ
ϕn(t)dt
≥
1 2δ
Z δ
−δ
ϕ(t)dt −
1 2δ
Z δ
−δ
|ϕn(t)−ϕ(t)|dt
≥1−2ε By Lemma 19.1 withA= 1/δ, for suchn,
µn[−2/δ,2/δ]≥2(1−2ε)−1 = 1−4ε
This shows theµn are tight
Let nj be a subsequence such that µnj converges weakly, say to µ Then ϕnj(t) → ϕµ(t), hence
ϕ(t) = ϕµ(t), or ϕ is the characteristic function of a probability µ If µ0 is any subsequential weak limit
point ofµn, thenϕµ0(t) =ϕ(t) =ϕµ(t); soµ0 must equal µ Henceµn converges weakly toµ
(28)Proposition 19.4 IfE|X|k <∞for an integerk, thenϕX has a continuous derivative of orderk and
ϕ(k)(t) = Z
(ix)keitxPX(dx)
In particular,ϕ(k)(0) =ikEXk
Proof Write
ϕ(t+h)−ϕ(t)
h =
Z ei(t+h)x−eitx
h P(dx) The integrand is bounded by|x| So if R
|x|PX(dx)<∞, we can use dominated convergence to obtain the
desired formula forϕ0(t) As in the proof of Proposition 17.1, we seeϕ0(t) is continuous We the case of generalkby induction Evaluatingϕ(k) at gives the particular case.
Here is a converse
Proposition 19.5 Ifϕis the characteristic function of a random variableXandϕ00(0)exists, thenE|X|2< ∞
Proof Note
eihx−2 +e−ihx h2 =−2
1−coshx h2 ≤0
and 2(1−coshx)/h2converges to x2 ash→0 So by Fatou’s lemma,
Z
x2PX(dx)≤2 lim inf h→0
Z 1−coshx
h2 PX(dx)
=−lim sup
h→0
ϕ(h)−2ϕ(0) +ϕ(−h)
h2 =ϕ
00(0)<∞.
One nice application of the continuity theorem is a proof of the weak law of large numbers Its proof is very similar to the proof of the central limit theorem, which we give in the next section
Another nice use of characteristic functions and martingales is the following
Proposition 19.6 Suppose Xi is a sequence of independent r.v.s and Sn converges weakly Then Sn converges almost surely
Proof SupposeSnconverges weakly toW ThenϕSn(t)→ϕW(t) uniformly on compact sets by Proposition
19.2 SinceϕW(0) = andϕW is continuous, there existsδsuch that|ϕW(t)−1|<1/2 if|t|< δ So forn
large,ϕSn(t)| ≥1/4 if|t|< δ
Note E
h
eitSn|X
1, Xn−1
i
=eitSn−1
E[eitXn|X1, , Xn−1] =eitSn−1ϕXn(t)
SinceϕSn(t) =
Qϕ
Xi(t), it follows thate
itSn/ϕ
Sn(t) is a martingale
Therefore for|t|< δand nlarge,eitSn/ϕ
Sn(t) is a bounded martingale, and hence converges almost
surely SinceϕSn(t)→ϕW(t)6= 0, then e
itSn converges almost surely if|t|< δ.
Let A = {(ω, t) ∈ Ω×(−δ, δ) : eitSn(ω) does not converge} For each t, we have almost sure
con-vergence, soR
1A(ω, t)P(dω) = Therefore R−δδ
R
1AdPdt= 0, and by Fubini, R R−δδ1Adt dP= Hence
almost surely,R
1A(ω, t)dt= This means, there exists a setN withP(N) = 0, and ifω /∈N, theneitSn(ω)
(29)Ifω /∈N, by dominated convergence,R0aeitSn(ω)dtconverges, provideda < δ Call the limitA
a Also
Z a
0
eitSn(ω)dt=e
iaSn(ω)−1
iSn(ω)
ifSn(ω)6= and equalsaotherwise
Since Sn converges weakly, it is not possible for |Sn| → ∞ with positive probability If we let
N0 = {ω : |S
n(ω)| → ∞} and choose ω /∈ N ∪N0, there exists a subsequence Snj(ω) which converges
to a finite limit, say R We can choose a < δ such that eiaSn(ω) converges and eiaR 6= Therefore
Aa = (eiaR−1)/R, a nonzero quantity But then
Sn(ω) =
eiaSn(ω)−1
Ra
0 e
itSn(ω)dt
→ limn→∞e
iaSn(ω)−1
Aa
Therefore, except forω∈N∪N0, we have thatSn(ω) converges
20 Central limit theorem
The simplest case of the central limit theorem (CLT) is the case when the Xi are i.i.d., with mean
zero and variance one, and then the CLT says thatSn/
√
nconverges weakly to a standard normal We first prove this case
We need the fact that ifcn are complex numbers converging toc, then (1 + (cn/n))n →ec We leave
the proof of this to the reader, with the warning that any proof using logarithms needs to be done with some care, since logz is a multi-valued function whenz is complex
Theorem 20.1 Suppose theXi are i.i.d., mean zero, and variance one ThenSn/
√
nconverges weakly to a standard normal
Proof Since X1 has finite second moment, then ϕX1 has a continuous second derivative By Taylor’s
theorem,
ϕX1(t) =ϕX1(0) +ϕ
0
X1(0)t+ϕ
00
X1(0)t
2/2 +R(t),
where|R(t)|/t2→0 as|t| →0 So
ϕX1(t) = 1−t
2/2 +R(t).
Then
ϕS
n/
√
n(t) =ϕSn(t/
√
n) = (ϕX1(t/
√
n))n=h1− t
2
2n+R(t/ √
n)i
n
Sincet/√nconverges to zero asn→ ∞, we have
ϕS
n/
√
n(t)→e− t2/2
Now apply the continuity theorem
(30)Proposition 20.2 WithXi as above,Sn/
√
nconverges weakly to a standard normal
Proof LetY1, , Ynbe i.i.d standard normal r.v.s that are independent of theXi LetZ1=Y2+· · ·+Yn,
Z2=X1+Y3+· · ·+Yn,Z3=X1+X2+Y4+· · ·+Yn, etc
Let us supposeg ∈C3 with compact support and letW be a standard normal Our first goal is to
show
|Eg(Sn/
√
n)−Eg(W)| →0 (20.1)
We have
Eg(Sn/
√
n)−Eg(W) =Eg(Sn/
√
n)−Eg(
n
X
i=1
Yi/
√ n) = n X i=1 h Eg
Xi+Zi √
n
−Eg
Yi+Zi √
n i
By Taylor’s theorem,
gXi√+Zi n
=g(Zi/
√
n) +g0(Zi/
√ n)√Xi
n+ 2g 00(Z i/ √
n)Xi2+Rn,
where|Rn| ≤ kg000k∞|Xi|3/n3/2 Taking expectations and using the independence,
Eg
Xi+Zi √
n
=Eg(Zi/
√
n) + +1 2Eg
00(Z
i/
√
n) +ERn
We have a very similar expression forEg((Yi+Zi)/
√
n) Taking the difference,
Eg
Xi+Zi √
n
−Eg
Yi+Zi √ n ≤ kg 000k ∞E
|Xi|3+E|Yi|3
n3/2
Summing overi from ton, we have (20.1)
By approximating continuous functions with compact support byC3functions with compact support, we have (20.1) for suchg Since E(Sn/
√
n)2 = 1, the sequenceSn/
√
n is tight So givenεthere exists M such thatP(|Sn/
√
n|> M)< εfor alln By taking M larger if necessary, we also haveP(|W|> M)< ε Supposegis bounded and continuous Letψbe a continuous function with compact support that is bounded by one, is nonnegative, and that equals on [−M, M] By (20.1) applied togψ,
|E(gψ)(Sn/
√
n)−E(gψ)(W)| →0
However,
|Eg(Sn/
√
n)−E(gψ)(Sn/
√
n)| ≤ kgk∞P(|Sn/
√
n|> M)< εkgk∞, and similarly
|Eg(W)−E(gψ)(W)|< εkgk∞
Since ε is arbitrary, this proves (20.1) for bounded continuous g By Proposition 16.1, this proves our proposition
(31)Proposition 20.3 Suppose for each nthe r.v.sXni, i= 1, , n are i.i.d Bernoullis with parameterpn Ifnpn →λandSn=P
n
i=1Xni, thenSn converges weakly to a Poisson r.v with parameter λ
Proof We write
ϕSn(t) = [ϕXn1(t)]
n =h1 +p
n(eit−1)
in
=h1 + npn n (e
it−1)in→eλ(eit−1)
Now apply the continuity theorem
A much more general theorem than Theorem 20.1 is the Lindeberg-Feller theorem
Theorem 20.4 Suppose for eachn,Xni,i= 1, , nare mean zero independent random variables Suppose (a) Pn
i=1EX
ni→σ2>0 and (b) for eachε,Pn
i=1E[|X
ni;|Xni|> ε]→0 LetSn=P
n
i=1Xni ThenSn converges weakly to a normal r.v with mean zero and varianceσ2
Note nothing is said about independence of theXni for differentn
Let us look at Theorem 20.1 in light of this theorem Suppose theYi are i.i.d and letXni=Yi/
√ n Then
n
X
i=1
E(Yi/
√
n)2=EY12
and
n
X
i=1
E[|Xni|2;|Xni|> ε] =nE[|Y1|2/n;|Y1|>
√
nε] =E[|Y1|2;|Y1|>
√ nε],
which tends to by the dominated convergence theorem If theYi are independent with mean 0, and
Pn
i=1E|Yi|3
(VarSn)3/2
→0,
then Sn/(VarSn)1/2 converges weakly to a standard normal This is known as Lyapounov’s theorem; we
leave the derivation of this from the Lindeberg-Feller theorem as an exercise for the reader
Proof Letϕnibe the characteristic function ofXniand letσni2 be the variance ofXni We need to show n
Y
i=1
ϕni(t)→e−t
2σ2/2
(20.2)
Using Taylor series,|eib−1−ib+b2/2| ≤c|b|3 for a constantc Also,
|eib−1−ib+b2/2| ≤ |eib−1−ib|+|b2|/2≤c|b|2
If we apply this to a random variabletY and take expectations,
|ϕY(t)−(1 +itEY −t2EY2/2)| ≤c(t2EY2∧t3EY3)
Applying this toY =Xni,
|ϕni(t)−(1−t2σni2 /2)| ≤cE[t 3|X
(32)The right hand side is less than or equal to
cE[t3|Xni|3;|Xni| ≤ε] +cE[t2|Xni|2;|Xni|> ε]
≤cεt3E[|Xni|2] +ct2E[|Xni|2;|Xni| ≥ε]
Summing overi we obtain
n
X
i=1
|ϕni(t)−(1−t2σ2ni/2)| ≤cεt 3X
E[|Xni|2] +ct2
X
E[|Xni|2;|Xni| ≥ε]
We need the following inequality: if|ai|,|bi| ≤1, then
n Y i=1
ai− n Y i=1 bi ≤ n X i=1
|ai−bi|
To prove this, note
Y ai−
Y
bi= (an−bn)
Y
i<n
bi+an
Y
i<n
ai−
Y
i<n
bi
and use induction
Note |ϕni(t)| ≤ and |1−t2σni2 /2| ≤1 because σni2 ≤ε2+E[|Xni2|;|Xni| > ε] <1/t2 if we takeε
small enough andnlarge enough So
n Y i=1
ϕni(t)− n
Y
i=1
(1−t2σni2 /2) ≤cεt
3X
E[|Xni|2] +ct2
X
E[|Xni|2;|Xni| ≥ε]
Since supiσ2
ni→0, then log(1−t2σni2/2) is asymptotically equal to−t2σ2ni/2, and so
Y
(1−t2σ2ni/2) = exp Xlog(1−t2σni2 /2)
is asymptotically equal to
exp−t2Xσni2 /2=e−t2σ2/2 Sinceεis arbitrary, the proof is complete
We now complete the proof of Theorem 8.2 Proof of “only if” part of Theorem 8.2 Since P
Xn converges, then Xn must converge to zero a.s.,
and soP(|Xn|> Ai.o.) = By the Borel-Cantelli lemma, this saysPP(|Xn|> A)<∞ We also conclude
by Proposition 5.4 thatP
Yn converges
Letcn =P n
i=1VarYi and supposecn→ ∞ LetZnm= (Ym−EYm)/
√
cn ThenP n
m=1VarZnm=
(1/cn)P n
m=1VarYm = If ε > 0, then for n large, we have 2A/
√
cn < ε Since |Ym| ≤ A and hence
|EYm| ≤ A, then |Znm| ≤ 2A/
√
cn < ε It follows that P n
m=1E(|Znm|2;|Znm| > ε) = for large n
By Theorem 20.4, Pn
m=1(Ym−EYm)/
√
cn converges weakly to a standard normal However, P n m=1Ym
converges, andcn → ∞, soPYm/
√
cnmust converge to The quantitiesPEYm/
√
cn are nonrandom, so
there is no way the difference can converge to a standard normal, a contradiction We concludecn does not
converge to infinity
LetVi =Yi−EYi Since |Vi| <2A, EVi = 0, and VarVi = VarYi, which is summable, by the “if”
part of the three series criterion,P
Vi converges SincePYi converges, taking the difference shows PEYi
(33)21 Framework for Markov chains
SupposeS is a set with some topological structure that we will use as our state space Think of S as beingRd or the positive integers, for example A sequence of random variablesX
0, X1, , is a Markov
chain if
P(Xn+1∈A|X0, , Xn) =P(Xn+1∈A|Xn) (21.1)
for all n and all measurable sets A The definition of Markov chain has this intuition: to predict the probability that Xn+1 is in any set, we only need to know where we currently are; how we got there gives
no new additional intuition
Let’s make some additional comments First of all, we previously considered random variables as mappings from Ω toR Now we want to extend our definition by allowing a random variable be a mapX from Ω toS, where (X ∈A) is F measurable for all open setsA This agrees with the definition of r.v in the caseS =R
Although there is quite a theory developed for Markov chains with arbitrary state spaces, we will con-fine our attention to the case where eitherS is finite, in which case we will usually supposeS={1,2, , n}, or countable and discrete, in which case we will usually supposeS is the set of positive integers
We are going to further restrict our attention to Markov chains where
P(Xn+1∈A|Xn=x) =P(X1∈A|X0=x),
that is, where the probabilities not depend onn Such Markov chains are said to have stationary transition probabilities
Define the initial distribution of a Markov chain with stationary transition probabilities by µ(i) = P(X0 = i) Define the transition probabilities by p(i, j) = P(Xn+1 = j | Xn = i) Since the transition
probabilities are stationary,p(i, j) does not depend onn
In this case we can use the definition of conditional probability given in undergraduate classes If P(Xn =i) = for alln, that means we never visitiand we could drop the pointifrom the state space
Proposition 21.1 LetX be a Markov chain with initial distributionµand transition probabilitiesp(i, j) then
P(Xn=in, Xn−1=in−1, , X1=i1, X0=i0) =µ(i0)p(i0, i1)· · ·p(in−1, in) (21.2)
Proof We use induction onn It is clearly true forn= by the definition ofµ(i) Suppose it holds forn; we need to show it holds forn+ For simplicity, we will the case n= Then
P(X3=i3,X2=i2, X1=i1, X0=i0)
=E[P(X3=i3|X0=i0, X1=i1, X2=i2);X2=i2, X1=ii, X0=i0]
=E[P(X3=i3|X2=i2);X2=i2, X1=ii, X0=i0]
=p(i2, i3)P(X2=i2, X1=i1, X0=i0)
Now by the induction hypothesis,
P(X2=i2, X1=i1, X0=i0) =p(i1, i2)p(i0, i1)µ(i0)
Substituting establishes the claim forn=
(34)Proposition 21.2 Suppose µ(i) is a sequence of nonnegative numbers with P
iµ(i) = and for each i the sequencep(i, j)is nonnegative and sums to Then there exists a Markov chain withµ(i)as its initial distribution andp(i, j)as the transition probabilities
Proof Define Ω = S∞ Let F be the σ-fields generated by the collection of sets {(i
0, i1, , in) : n >
0, ij ∈ S} An element ω of Ω is a sequence (i0, i1, ) Define Xj(ω) = ij if ω = (i0, i1, ) Define
P(X0 =i0, , Xn=in) by (21.2) Using the Kolmogorov extension theorem, one can show thatP can be extended to a probability on Ω
The above framework is rather abstract, but it is clear that under P the sequence Xn has initial
distributionµ(i); what we need to show is thatXn is a Markov chain and that
P(Xn+1=in+1|X0=i0, Xn=in) =P(Xn+1=in+1|Xn=in) =p(in, in+1) (21.3)
By the definition of conditional probability, the left hand side of (21.3) is
P(Xn+1=in+1|X0=i0, , Xn =in) =P
(Xn+1=in+1, Xn=in, X0=i0)
P(Xn=in, , X0=i0)
(21.4) =µ(i0)· · ·p(in−1, in)p(in, in+1)
µ(i0)· · ·p(in−1, in)
=p(in, in+1)
as desired
To complete the proof we need to show
P(Xn+1=in+1, Xn=in)
P(Xn =in)
=p(in, in+1),
or
P(Xn+1=in+1, Xn =in) =p(in, in+1)P(Xn =in) (21.5)
Now
P(Xn=in) =
X
i0,···,in−1
P(Xn =in, Xn−1=in−1, , X0=i0)
= X
i0,···,in−1
µ(i0)· · ·p(in−1, in)
and similarly
P(Xn+1=in+1, Xn=in)
= X
i0,···,in−1
P(Xn+1=in+1, Xn=in, Xn−1=in−1, , X0=i0)
=p(in, in+1)
X
i0,···,in−1
µ(i0)· · ·p(in−1, in)
Equation (21.5) now follows
Note in this construction that theXn sequence is fixed and does not depend onµorp Letp(i, j) be
fixed The probability we constructed above is often denoted Pµ Ifµis point mass at a point i orx, it is denoted Pi orPx So we have one probability space, one sequenceXn, but a whole family of probabilities
(35)Later on we will see that this framework allows one to express the Markov property and strong Markov property in a convenient way As part of the preparation for doing this, we define the shift operators θk : Ω→Ω by
θk(i0, i1, ) = (ik, ik+1, )
ThenXj◦θk=Xj+k To see this, ifω= (i0, i1, ), then
Xj◦θk(ω) =Xj(ik, ik+1, ) =ij+k =Xj+k(ω)
22 Examples
Random walk on the integers
We let Yi be an i.i.d sequence of r.v.’s, with p = P(Yi = 1) and 1−p = P(Yi = −1) Let
Xn =X0+Pni=1Yi Then theXn can be viewed as a Markov chain withp(i, i+ 1) =p,p(i, i−1) = 1−p,
and p(i, j) = if|j−i| 6= More general random walks on the integers also fit into this framework To check that the random walk is Markov,
P(Xn+1=in+1|X0=i0, , Xn=in)
=P(Xn+1−Xn=in+1−in|X0=i0, , Xn=in)
=P(Xn+1−Xn=in+1−in),
using the independence, while
P(Xn+1=in+1|Xn=in) =P(Xn+1−Xn=in+1−in|Xn=in)
=P(Xn+1−Xn=in+1−in)
Random walks on graphs
Suppose we havenpoints, and from each point there is some probability of going to another point For example, suppose there are points and we have p(1,2) =
2, p(1,3) =
2, p(2,1) =
4, p(2,3) = 2,
p(2,5) = 14,p(3,1) = 14, p(3,2) = 14, p(3,3) = 12, p(4,1) = 1,p(5,1) = 12, p(5,5) = 12 Thep(i, j) are often arranged into a matrix:
P=
0 12 12 0
1 4 4 0
1 0 0
1
2 0
Note the rows must sum to since
5
X
j=1
p(i, j) =
5
X
j=1
P(X1=j|X0=i) =P(X1∈ S |X0=i) =
Renewal processes
Let Yi be i.i.d withP(Yi = k) = ak and the ak are nonnegative and sum to Let T0 = i0 and
Tn=T0+P n
i=1 We think of theYias the lifetime of thenth light bulb andTnthe time when thenth light
(36)SoXn is the amount of time after timenuntil the current light bulb burns out
IfXn=j andj >0, thenTi =n+j for someibutTi does not equaln, n+ 1, , n+j−1 for any
i SoTi= (n+ 1) + (j−1) for somei andTi does not equal (n+ 1),(n+ 1) + 1, ,(n+ 1) + (j−2) for
anyi ThereforeXn+1=j−1 Sop(i, i−1) = if i≥1
IfXn = 0, then a light bulb burned out at time n andXn+1 is if the next light bulb burned out
immediately andj−1 if the light bulb has lifetimej The probability of this isaj Sop(0, j) =aj+1 All
the otherp(i, j)’s are
Branching processes
Considerkparticles At the next time interval, some of them die, and some of them split into several particles The probability that a given particle will split into j particles is given byaj, j = 0,1, , where
the aj are nonnegative and sum to The behavior of each particle is independent of the behavior of all
the other particles IfXn is the number of particles at timen, thenXn is a Markov chain LetYi be i.i.d
random variables withP(Yi=j) =aj Thep(i, j) forXn are somewhat complicated, and can be defined by
p(i, j) =P(Pim=1Ym=j) Queues
We will discuss briefly theM/G/1 queue TheM refers to the fact that the customers arrive according to a Poisson process So the probability that the number of customers arriving in a time interval of lengtht iskis given bye−λt(λt)k/k! TheGrefers to the fact that the length of time it takes to serve a customer is
given by a distribution that is not necessarily exponential The refers to the fact that there is server Suppose the length of time to serve one customer has distribution functionF with density f The probability thatkcustomers arrive during the time it takes to serve one customer is
ak=
Z ∞
0
e−λt(λt)
k
k! f(t)dt
Let theYi be i.i.d withP(Yi =k−1) =ak SoYi is the number of customers arriving during the time it
takes to serve one customer LetXn+1= (Xn+Yn+1)+ be the number of customers waiting ThenXn is a
Markov chain withp(0,0) =a0+a1 andp(i, j−1 +k) =ak ifj≥1, k >1 Ehrenfest urns
Suppose we have two urns with a total of r balls, k in one and r−k in the other Pick one of the r balls at random and move it to the other urn LetXn be the number of balls in the first urn Xn is a
Markov chain withp(k, k+ 1) = (r−k)/r,p(k, k−1) =k/r, andp(i, j) = otherwise
One model for this is to consider two containers of air with a thin tube connecting them Suppose a few molecules of a foreign substance are introduced Then the number of molecules in the first container is like an Ehrenfest urn We shall see that all states in this model are recurrent, so infinitely often all the molecules of the foreign substance will be in the first urn Yet there is a tendency towards equilibrium, so on average there will be about the same number of molecules in each container for all large times
Birth and death processes
Suppose there are i particles, and the probability of a birth is ai, the probability of a death is bi,
whereai, bi ≥0,ai+bi ≤1 SettingXn equal to the number of particles, thenXn is a Markov chain with
p(i, i+ 1) =ai,p(i, i−1) =bi, andp(i, i) = 1−ai−bi
23 Markov properties
A special case of the Markov property says that
(37)The right hand side is to be interpreted as ϕ(Xn), whereϕ(y) =Eyf(X1) The randomness on the right
hand side all comes from the Xn If we write f(Xn+1) =f(X1)◦θn and we writeY for f(X1), then the
above can be rewritten
Ex[Y ◦θn| Fn] =EXnY LetF∞ be theσ-field generated by∪∞
n=1Fn
Theorem 23.1 (Markov property)IfY is bounded and measurable with respect toF∞, then
Ex[Y ◦θn| Fn] =EXn[Y], P−a.s
for eachnand x
Proof If we can prove this forY =f1(X1)· · ·fm(Xm), then takingfj(x) = 1ij(x), we will have it forY’s
of the form 1(X1=i1, ,Xm=im) By linearity (and the fact thatS is countable), we will then have it for Y’s
of the form 1((X1, ,Xm)∈B) A monotone class argument shows that suchY’s generateF∞
We use induction onm, and first we prove it form= We need to show
E[f1(X1)◦θn| Fn] =EXnf1(X1)
Using linearity and the fact that S is countable, it suffices to show this for f1(y) = 1{j}(y) Using the definition ofθn, we need to show
P(Xn+1=j| Fn) =PXn(X1=j),
or equivalently,
Px(Xn+1=j;A) =Ex[PXn(X1=j);A] (23.2)
whenA∈ Fn By linearity it suffices to considerAof the formA= (X1=i1, , Xn=in) The left hand
side of (23.2) is then
Px(Xn+1=j, X1=i1, , Xn=ij),
and by (21.4) this is equal to
p(in, j)Px(X1=i1, , Xn−in) =p(in, j)Px(A)
Letg(y) =Py(X
1=j) We have
Px(X1=j, X0=k) =
Pk(X1=j) ifx=k,
0 ifx6=k,
while
Ex[g(X0);X0=k] =Ex[g(k);X0=k] =Pk(X1=j)Px(X0=k) =
P(X1=j) ifx=k,
0 ifx6=k
It follows that
p(i, j) =Px(X1=j|X0=i) =Pi(X1=j)
So the right hand side of (23.2) is
(38)as required
Suppose the result holds formand we want to show it holds form+ We have
Ex[f1(Xn+1)· · ·fm+1(Xn+m+1)| Fn]
=Ex[Ex[fm+1(Xn+m+1)| Fn+m]f1(Xn+1)· · ·fm(Xn+m)| Fn]
Ex[EXn+m[fm+1(X1)]f1(Xn+1)· · ·fm(Xn+m)| Fn]
Ex[f1(Xn+1)· · ·fm−1(Xn+m−1)h(Xn+m)| Fn]
Here we used the result form= and we definedh(y) =fn+m(y)g(y), whereg(y) =Ey[fm+1(X1)] Using
the induction hypothesis, this is equal to
EXn[f1(X1)· · ·fm−1(Xm−1)g(Xm)] =EXn[f1(X1)· · ·fm(Xm)EXmfm+1(X1)]
=EXn[f
1(X1)· · ·fm(Xm)E[fm+1(Xm+1)| Fm]]
=EXn[f1(X1)· · ·fm+1(Xm+1)],
which is what we needed
Define θN(ω) = (θN(ω))(ω) The strong Markov property is the same as the Markov property, but
where the fixed timenis replaced by a stopping timeN
Theorem 23.2 IfY is bounded and measurable andN is a finite stopping time, then
Ex[Y ◦θN | FN] =EXN[Y]
Proof We will show
Px(XN+1=j | FN) =PXN(X1=j)
Once we have this, we can proceed as in the proof of the Theorem 23.1 to obtain our result To show the above equality, we need to show that ifB ∈ FN, then
Px(XN+1=j, B) =Ex[PXN(X1=j);B] (23.3)
Recall that sinceB∈ FN, then B∩(N =k)∈ Fk We have
Px(XN+1=j, B, N=k) =Px(Xk+1=j, B, N =k)
=Ex[Px(Xk+1=j| Fk);B, N=k]
=Ex[PXk(X
1=j);B, N=k]
=Ex[PXN(X1=j);B, N=k]
Now sum overk; sinceN is finite, we obtain our desired result
Another way of expressing the Markov property is through the Chapman-Kolmogorov equations Let pn(i, j) =
P(Xn =j |X0=i)
Proposition 23.3 For alli, j, m, n we have
pn+m(i, j) =X
k∈S
(39)Proof We write
P(Xn+m=j, X0=i) =
X
k
P(Xn+m=j, Xn=k, X0=i)
=X
k
P(Xn+m=j|Xn=k, X0=i)P(Xn=k|X0=i)P(X0=i)
=X
k
P(Xn+m=j|Xn=k)pn(i, k)P(X0=i)
=X
k
pm(k, j)pn(i, k)P(X0=i)
If we divide both sides byP(X0=i), we have our result
Note the resemblance to matrix multiplication It is clear if P is the matrix made up of thep(i, j), thenPn will be the matrix whose (i, j) entry ispn(i, j).
24 Recurrence and transience Let
Ty= min{i >0 :Xi=y}
This is the first time thatXihits the point y Even ifX0=ywe would have Ty >0 We letTyk be thek-th
time that the Markov chain hitsy and we set
r(x, y) =Px(Ty <∞),
the probability starting atxthat the Markov chain ever hitsy Proposition 24.1 Px(Tyk <∞) =r(x, y)r(y, y)k−1
Proof The case k= is just the definition, so supposek >1 Using the strong Markov property, Px(Tyk <∞) =P
x(T
y◦θTyk−1 <∞, T
k−1 y <∞)
=Ex[Px(Ty◦θTk−1
y <∞ | FTyk−1);T
k−1 y <∞]
=Ex[PX(Tyk−1)(T
y<∞);Tyk−1]
=Ex[Py(Ty<∞);Tyk−1<∞]
=r(y, y)Px(Tyk−1<∞)
We used here the fact that at timeTyk−1the Markov chain must be at the pointy Repeating this argument
k−2 times yields the result
We say thaty is recurrent ifr(y, y) = 1; otherwise we sayy is transient Let
N(y) = ∞ X
n=1
1(Xn=y)
Proposition 24.2 y is recurrent if and only ifEyN(y) =∞ Proof Note
EyN(y) = ∞ X
k=1
Py(N(y)≥k) = ∞ X
k=1
Py(Tyk<∞)
= ∞ X
k=1
(40)We used the fact thatN(y) is the number of visits toy and the number of visits being larger than kis the same as the time of thek-th visit being finite Sincer(y, y)≤1, the left hand side will be finite if and only ifr(y, y)<1
Observe that
EyN(y) =X
n
Py(Xn=y) =
X
n
pn(y, y)
If we consider simple symmetric random walk on the integers, thenpn(0,0) is ifnis odd and equal
to
n n/2
2−n if nis even This is because in order to be at after nsteps, the walk must have had n/2
positive steps andn/2 negative steps; the probability of this is given by the binomial distribution Using Stirling’s approximation, we see thatpn(0,0)∼c/√nforneven, which diverges, and so simple random walk
in one dimension is recurrent
Similar arguments show that simple symmetric random walk is also recurrent in dimensions but transient in or more dimensions
Proposition 24.3 Ifxis recurrent andr(x, y)>0, theny is recurrent andr(y, x) =
Proof First we showr(y, x) = Suppose not Since r(x, y)>0, there is a smallest nand y1, , yn−1
such thatp(x, y1)p(y1, y2)· · ·p(yn−1, y)>0 Since this is the smallestn, none of theyi can equalx Then
Px(Tx=∞)≥p(x, y1)· · ·p(yn−1, y)(1−r(y, x))>0,
a contradiction toxbeing recurrent
Next we show thaty is recurrent Sincer(y, x)>0, there existsLsuch thatpL(y, x)>0 Then
pL+n+K(y, y)≥pL(y, x)pn(x, x)pK(x, y)
Summing overn,
X
n
pL+n+K(y, y)≥pL(y, x)pK(x, y)X
n
pn(x, x) =∞
We say a subsetC ofS is closed ifx∈C andr(x, y)>0 impliesy∈C A subsetD is irreducible if x, y∈D impliesr(x, y)>0
Proposition 24.4 LetCbe finite and closed ThenC contains a recurrent state
From the preceding proposition, ifCis irreducible, then all states will be recurrent Proof If not, for ally we have r(y, y)<1 and
ExN(y) = ∞ X
k=1
r(x, y)r(y, y)k−1= r(x, y)
1−r(y, y)<∞
SinceC is finite, thenP
yE
xN(y)<∞ But that is a contradiction since
X
y
ExN(y) = X
y
X
n
pn(x, y) =X
n
X
y
pn(x, y) =X
n
Px(Xn∈C) =
X
n
(41)Theorem 24.5 LetR={x:r(x, x) = 1}, the set of recurrent states ThenR=∪∞
i=1Ri, where eachRi is closed and irreducible
Proof Sayx∼y ifr(x, y)>0 Since every state is recurrent,x∼xand ifx∼y, theny∼x Ifx∼y and y∼z, then pn(x, y)>0 andpm(y, z)>0 for some nandm Thenpn+m(x, z)>0 or x∼z Therefore we
have an equivalence relation and we let theRi be the equivalence classes
Looking at our examples, it is easy to see that in the Ehrenfest urn model all states are recurrent For the branching process model, supposep(x,0)>0 for allx Then is recurrent and all the other states are transient In the renewal chain, there are two cases If{k:ak >0} is unbounded, all states are recurrent
IfK= max{k:ak>0}, then{0,1, , K−1}are recurrent states and the rest are transient
For the queueing model, letµ=Pka
k, the expected number of people arriving during one customer’s
service time We may view this as a branching process by letting all the customers arriving during one person’s service time be considered the progeny of that customer It turns out that ifµ≤1, is recurrent and all other states are also Ifµ >1 all states are transient
25 Stationary measures
A probabilityµis a stationary distribution if
X
x
µ(x)p(x, y) =µ(y) (25.1)
In matrix notation this isµP =µ, orµis the left eigenvector corresponding to the eigenvalue In the case of a stationary distribution,Pµ(X
1=y) =µ(y), which implies thatX1, X2, all have the same distribution
We can use (25.1) when µ is a measure rather than a probability, in which case it is called a stationary measure
If we have a random walk on the integers,µ(x) = for allxserves as a stationary measure In the case of an asymmetric random walk: p(i, i+ 1) =p,p(i, i−1) =q= 1−pandp6=q, setting µ(x) = (p/q)x also works
In the Ehrenfest urn model,µ(x) = 2−r
r x
works One way to see this is thatµis the distribution one gets if one flipsrcoins and puts a coin in the first urn when the coin is heads A transition corresponds to picking a coin at random and turning it over
Proposition 25.1 Letabe recurrent and let T =Ta Set
µ(y) =Ea
T−1
X
n=0
1(Xn=y)
Thenµis a stationary measure
The idea of the proof is thatµ(y) is the expected number of visits toy by the sequenceX0, , XT−1
while µP is the expected number of visits to y by X1, , XT These should be the same because XT =
X0=a
Proof First, letpn(a, y) =Pa(X
n =y, T > n) So
µ(y) = ∞ X
n=0
Pa(Xn=y, T > n) =
∞ X
n=0
(42)and
X
y
µ(y)p(y, z) =X
y
∞ X
n=0
pn(a, y)p(y, z)
Second, we consider the casez6=a Then X
y
pn(a, y)p(y, z)
=X
y
Pa(hity innsteps without first hittingaand then go toz in one step)
=pn+1(a, z)
So
X
n
µ(y)p(y, z) =X
n
X
y
pn(a, y)p(y, z)
= ∞ X
n=0
pn+1(a, z) = ∞ X
n=0
pn(a, z)
=µ(z) sincep0(a, z) =
Third, we consider the casea=z Then
X
y
pn(a, y)p(y, z)
=X
y
Pa(hity innsteps without first hittingaand then go toz in one step)
=Pa(T =n+ 1)
RecallPa(T = 0) = 0, and sinceais recurrent,T <∞ So
X
y
µ(y)p(y, z) =X
n
X
y
pn(a, y)p(y, z)
= ∞ X
n=0
Pa(T =n+ 1) = ∞ X
n=0
Pa(T =n) =
On the other hand,
T−1
X
n=0
1(Xn=a)= 1(X0=a)= 1,
henceµ(a) = Therefore, whetherz6=aorz=a, we haveµP(z) =µ(z)
Finally, we showµ(y)<∞ Ifr(a, y) = 0, thenµ(y) = Ifr(a, y)>0, choosenso thatpn(a, y)>0,
and then
1 =µ(a) =X
y
µ(y)pn(a, y),
which impliesµ(y)<∞
(43)Proposition 25.2 If the Markov chain is irreducible and all states are recurrent, then the stationary measure is unique up to a constant multiple
Proof Fixa∈ S Letµa be the stationary measure constructed above and letν be any other stationary
measure
Sinceν=νP, then
ν(z) =ν(a)p(a, z) +X
y6=a
ν(y)p(y, z)
=ν(a)p(a, z) +X
y6=a
ν(a)p(a, y)p(y, z) +X
x6=a
X
y6=a
ν(x)p(x, y)p(y, z)
=ν(a)Pa(X1=z) +ν(a)Pa(X16=a, X2=z) +Pν(X06=a, X16=a, X2=z)
Continuing,
ν(z) =ν(a)
n
X
m=1
Pa(X16=a, X26=a, , Xm−16=a, Xm=z)
+Pν(X06=a, X16=a, , Xn−16=a, Xn=z)
≥ν(a)
n
X
m=1
Pa(X16=a, X26=a, , Xm−16=a, Xm=z)
Lettingn→ ∞, we obtain
ν(z)≥ν(a)µa(z)
We have
ν(a) =X
x
ν(x)pn(x, a)≥ν(a)X
x
µa(x)pn(x, a)
=ν(a)µa(a) =ν(a),
sinceµa(a) = (see proof of Proposition 25.1) This means that we have equality and so
ν(x) =ν(a)µa(x)
wheneverpn(x, a)>0 Sincer(x, a)>0, this happens for somen Consequently ν(x)
ν(a) =µa(x)
Proposition 25.3 If a stationary distribution exists, thenµ(y)>0impliesy is recurrent
Proof Ifµ(y)>0, then ∞=
∞ X
n=1
µ(y) = ∞ X
n=1
X
x
µ(x)pn(x, y) =X
x
µ(x) ∞ X
n=1
pn(x, y)
=X
x
µ(x) ∞ X
n=1
Px(Xn=y) =
X
x
µ(x)ExN(y)
=X
x
µ(x)r(x, y)[1 +r(y, y) +r(y, y)2+· · ·]
Sincer(x, y)≤1 andµis a probability measure, this is less than X
x
µ(x)(1 +r(y, y) +· · ·)≤1 +r(y, y) +· · ·
Hencer(y, y) must equal
(44)Proposition 25.4 If the Markov chain is irreducible and has stationary distributionµ, then
µ(x) = ExTx
Proof µ(x)>0 for somex Ify∈ S, thenr(x, y)>0 and sopn(x, y)>0 for somen Hence
µ(y) =X
x
µ(x)pn(x, y)>0
Hence by Proposition 25.3, all states are recurrent By the uniqueness of the stationary distribution,µx is
a constant multiple ofµ, i.e., µx=cµ Recall
µx(y) =
∞ X
n=0
Px(Xn=y, Tx> n),
and so
X
y
µx(y) =
X
y
∞ X
n=0
Px(Xn=y, Tx> n) =
X
n
X
y
Px(Xn =y, Tx> n)
=X
n
Px(Tx> n) =ExTx
Thusc=ExTx Recalling that µx(x) = 1,
µ(x) =µx(x)
c =
1 ExTx
We make the following distinction for recurrent states IfExTx <∞, then xis said to be positive
recurrent Ifxis recurrent butExTx=∞,xis null recurrent
Proposition 25.5 Suppose a chain is irreducible
(a) If there exists a positive recurrent state, then there is a stationary distribution, (b) If there is a stationary distribution, all states are positive recurrent
(c) If there exists a transient state, all states are transient
(d) If there exists a null recurrent state, all states are null recurrent
Proof To show (a), ifxis positive recurrent, then there exists a stationary measure withµ(x) = Then µ(y) =µ(y)/ExTx will be a stationary distribution
For (b), supposeµ(x)>0 for somex We showed this implies µ(y)>0 for ally Then 0< µ(y) = 1/EyTy, which impliesEyTy <∞
We showed that ifxis recurrent andr(x, y)>0, theny is recurrent So (c) follows
Suppose there exists a null recurrent state If there exists a positive recurrent or transient state as well, then by (a) and (b) or by (c) all states are positive recurrent or transient, a contradiction, and (d) follows
26 Convergence
Our goal is to show that under certain conditionspn(x, y)→π(y), whereπ is the stationary
distri-bution (In the null recurrent casepn(x, y)→0.)
Consider a random walk on the set{0,1}, where with probability one on each step the chain moves to the other state Then pn(x, y) = if x6=y andnis even A less trivial case is the simple random walk
on the integers We need to eliminate this periodicity
Supposexis recurrent, let Ix ={n≥1 :pn(x, x)>0}, and let dx be the g.c.d (greatest common
(45)Proposition 26.1 Ifr(x, y)>0, thendy=dx
Proof Sincexis recurrent,r(y, x)>0 ChooseK andLsuch thatpK(x, y), pL(y, x)>0.
pK+L+n(y, y)≥pL(y, x)pn(x, x)pK(x, y),
so taking n= 0, we have pK+L(y, y)>0, ord
y divides K+L So dy divides n ifpn(x, x)>0, ordy is a
divisor ofIx Hencedy dividesdx By symmetrydx dividesdy
Proposition 26.2 Ifdx= 1, there existsm0 such thatpm(x, x)>0 wheneverm≥m0
Proof First of all, Ix is closed under addition: ifm, n∈Ix,
pm+n(x, x)≥pm(x, x)pn(x, x)>0
Secondly, if there existsN such thatN, N+ 1∈Ix, letm0=N2 Ifm≥m0, thenm−N2=kN+r
for somer < N and
m=r+N2+kN =r(N+ 1) + (N−r+k)N ∈Ix
Third, pickn0∈Ixandk >0 such thatn0+k∈Ix Ifk= 1, we are done Sincedx= 1, there exists
n1∈Ixsuch thatkdoes not dividen1 We haven1=mk+rfor some 0< r < k Note (m+ 1)(n0+k)∈Ix
and (m+ 1)n0+n1∈Ix The difference between these two numbers is (m+ 1)k−n1=k−r < k So now
we have two numbers inIk differing by less than or equal to k−1 Repeating at most ktimes, we get two
numbers inIx differing by at most 1, and we are done
We writedfordx A chain is aperiodic ifd=
Ifd >1, we sayx∼y ifpkd(x, y)>0 for somek >0 We divideS into equivalence classesS1, Sd
Everydsteps the chain started inSi is back inSi So we look atp0=pd onSi
Theorem 26.3 Suppose the chain is irreducible, aperiodic, and has a stationary distribution π Then
pn(x, y)→π(y)asn→ ∞.
Proof The idea is to take two copies of the chain with different starting distributions, let them run independently until they couple, i.e., hit each other, and then have them move together So define
q((x1, y1),(x2, y2)) =
(p(x
1, x2)p(y1, y2) ifx16=y1,
p(x1, x2) ifx1=y1, x2=y2,
0 otherwise
LetZn= (Xn, Yn) andT = min{i:Xi=Yi} We have
P(Xn=y) =P(Xn=y, T ≤n) +P(Xn=y, T > n)
=P(Yn =y, T ≤n) +P(Xn=y, T > n),
while
P(Yn=y) =P(Yn =y, T ≤n) +P(Yn =y, T > n)
Subtracting,
P(Xn =y)−P(Yn=y)≤P(Xn=y, T > n)−P(Yn=y, T > n)
(46)Using symmetry,
|P(Xn=y)−P(Yn =y)| ≤P(T > n) Suppose we letY0 have distributionπandX0=x Then
|pn(x, y)−π(y)| ≤
P(T > n)
It remains to showP(T > n)→0 To this, consider another chainZn0 = (Xn, Yn), where now we
takeXn, Yn independent Define
r((x1, y1),(x2, y2)) =p(x1, x2)p(y1, y2)
The chain under the transition probabilitiesr is irreducible To see this, there existK andL such that pK(x
1, x2) > and pL(y1, y2) > If M is large, pL+M(x2, x2) > and pK+M(y2, y2) > So
pK+L+M(x
1, x2)>0 andpK+L+M(y1, y2)>0, and hence we haverK+L+M((x1, x2),(y1, y2))>0
It is easy to check thatπ0(a, b) =π(a)π(b) is a stationary distribution forZ0 HenceZn0 is recurrent, and hence it will hit (x, x), hence the time to hit the diagonal {(y, y) : y ∈ S} is finite However the distribution of the time to hit the diagonal is the same asT
27 Gaussian sequences
We first prove a converse to Proposition 17.3 Proposition 27.1 IfEei(uX+vY)=
EeiuXEeivY for alluandv, thenX and Y are independent random
variables
Proof LetX0 be a random variable with the same law asX, Y0 one with the same law asY, and X0, Y0 independent (We let Ω = [0,1]2,PLebesgue measure,X0 a function of the first variable, andY0 a function of the second variable defined as in Proposition 1.2.) ThenEei(uX
0+vY0)
=EeiuX
0
EeivY
0
SinceX, X0 have the same law, they have the same characteristic function, and similarly forY, Y0 Therefore (X0, Y0) has the same joint characteristic function as (X, Y) By the uniqueness of the Fourier transform, (X0, Y0) has the same joint law as (X, Y), which is easily seen to imply thatX andY are independent
A sequence of random variables X1, , Xn is said to be jointly normal if there exists a sequence
of independent standard normal random variables Z1, , Zm and constants bij and such that Xi =
Pm
j=1bijZj+ai,i= 1, , n In matrix notation,X =BZ+A For simplicity, in what follows let us take
A= 0; the modifications for the general case are easy The covariance of two random variablesX andY is defined to beE[(X−EX)(Y −EY)] Since we are assuming our normal random variables are mean 0, we can omit the centering at expectations Given a sequence of mean random variables, we can talk about the covariance matrix, which is Cov (X) =EXXt, where Xtdenotes the transpose of the vectorX In the
above case, we see Cov (X) =E[(BZ)(BZ)t] =E[BZZtBt] =BBt, sinceEZZt=I, the identity Let us compute the joint characteristic functionEeiu
tX
of the vectorX, whereuis ann-dimensional vector First, ifv is anm-dimensional vector,
Eeiv
tZ
=E
m
Y
j=1
eivjZj =
m
Y
j=1
EeivjZj =
m
Y
j=1
e−v2j/2=e−vtv/2
using the independence of theZ’s So Eeiu
tX
=Eeiu
tBZ
=e−utBBtu/2
(47)Proposition 27.2 If theXiare jointly normal andCov (Xi, Xj) = 0fori6=j, then theXiare independent
Proof If Cov (X) =BBtis a diagonal matrix, then the joint characteristic function of theX’s factors, and
so by Proposition 27.1, theXs would in this case be independent
28 Stationary processes
In this section we give some preliminaries which will be used in the next on the ergodic theorem We say a sequenceXi isstationaryif (Xk, Xk+1, ) has the same distribution as (X0, X1, )
One example is if theXi are i.i.d For readers who are familiar with Markov chains, another is ifXi
is a Markov chain,π is the stationary distribution, andX0 has distributionπ
A third example is rotations of a circle Let Ω be the unit circle,Pnormalized Lebesgue measure on Ω, andθ∈[0,2π) We letX0(ω) =ω and setXn(ω) =ω+nθ(mod 2π)
A fourth example is the Bernoulli shift: let Ω = [0,1),P Lebesgue measure, X0(ω) =ω, andXn(ω)
be binary expansion ofω from thenth place on
Proposition 28.1 IfXn is stationary, thenYk =g(Xk, Xk+1, )is stationary
Proof IfB⊂R∞, let
A={x= (x0, x1, ) : (g(x0, ), g(x1, ,), )∈B}
Then
P((Y0, Y1, )∈B) =P((X0, X1, )∈A)
=P((Xk, Xk+1, )∈A)
=P((Yk, Yk+1, )∈B)
We say thatT : Ω→Ω ismeasure preservingifP(T−1A) =
P(A) for allA∈ F
There is a one-to-one correspondence between measure preserving transformations and stationary sequences GivenT, letX0=ω andXn=Tnω Then
P((Xk, Xk+1, )∈A) =P(Tk(X0, X1, )∈A) =P((X0, X1, )∈A)
On the other hand, ifXkis stationary, defineΩ =b R∞, and defineXbk(ω) =ωk, whereω= (ω0, ω1, ) Define
b
P onΩ so that the law ofb Xb underPb is the same as the law ofX underP Then define T ω= (ω1, ω2, )
We see that
P(A) =P((ω0, ω1, )∈A) =Pb((Xb0,Xb1, )∈A)
=P((X0, X1, )∈A) =P((X1, X2, )∈A)
=Pb((Xb1,Xb2, )∈A) =P((ω1, ω2, )∈A)
=P(T ω∈A) =P(T−1A)
We say a setAisinvariantifT−1A=A(up to a null set, that is, the symmetric difference has
prob-ability zero) The invariantσ-fieldI is the collection of invariant sets A measure preserving transformation isergodicif the invariantσ-field is trivial
In the case of an i.i.d sequence,Ainvariant means A=T−nA∈σ(X
n, Xn+1, ) for eachn Hence
each invariant set is in the tailσ-field, and by the Kolmogorov 0-1 law,T is ergodic
(48)recall that iff is measurable and bounded, thenf is theL2 limit ofPK
k=−Kckeikx, whereck are the Fourier
coefficients So
f(Tnx) =Xckeikx+iknθ
=Xdkeikx,
where dk =ckeiknθ Iff(Tnx) =f(x) a.e., then ck =dk, or ckeiknθ =ck Butθ is not a rational multiple
ofπ, soeiknθ6= 1, soc
k = Thereforef = a.e If we takef = 1A, this says that eitherAis empty or A
is the whole space, up to sets of measure zero
Our last example was the Bernoulli shift Let Xi be i.i.d withP(X = 1) =P(Xi = 0) = 1/2 Let
Yn = P∞m=02−(m+1)Xn+m So there exists g such that Yn = g(Xn, Xn+1, ) If A is invariant for the
Bernoulli shift,
A= ((Yn, Yn+1, )∈B) = ((Xn, Xn+1, )∈C),
where C={x: (g(x0, x1, ), g(x1, x2, ), )∈B} this is true for all n, soA is in the invariant σ-field
for theXi’s, which is trivial ThereforeT is ergodic
29 The ergodic theorem
The key to the ergodic theorem is the following maximal lemma
Lemma 29.1 LetX be integrable LetT be a measure preserving transformation, letXj(ω) =X(Tjω), letSk(ω) =X0(ω) +· · ·+Xk−1(ω), and Mk(ω) = max(0, S1(ω), , Sk(ω)) ThenE[X;Mk >0]≥0
Proof Ifj≤k,Mk(T ω)≥Sj(T ω), so X(ω) +Mk(T ω)≥X(ω) +Sj(T ω) =Sj+1(ω), or
X(ω)≥Sj+1(ω)−Mk(T ω), j= 1, , k
SinceS1(ω) =X(ω) andMk(T ω)≥0, then
X(ω)≥S1(ω)−Mk(T ω)
Therefore
E[X(ω);Mk>0]≥
Z
(Mk>0)
[max(S1, , Sk)(ω)−Mk(T ω)]
= Z
(Mk>0)
[Mk(ω)−Mk(T ω)]
On the set (Mk = 0) we haveMk(ω)−Mk(T ω) =−Mk(T ω)≤0 Hence
E[X(ω);Mk >0]≥
Z
[Mk(ω)−Mk(T ω)]
SinceT is measure preserving,EMk(ω)−EMk(T ω) = 0, which completes the proof
RecallI is the invariantσ-field The ergodic theorem says the following Theorem 29.2 LetT be measure preserving andX integrable Then
1 n
n−1
X
m=0
X(Tmω)→E[X | I],
where the convergence takes place almost surely and inL1.
(49)Letδ >0 SinceX is integrable, P
P(|Xn(ω)|> δn) =PP(|X|> δn)<∞(cf proof of Proposition 5.1) By Borel-Cantelli,|Xn|/nwill eventually be less than δ Sinceδis arbitrary,|Xn|/n→0 a.s Since
(Sn/n)(T ω)−(Sn/n)(ω) =Xn(ω)/n−X0(ω)/n→0,
then lim sup(Sn/n)(T ω) = lim sup(Sn/n)(ω), and soD ∈ I Let X∗(ω) = (X(ω)−ε)1D(ω), and defineSn∗
andMn∗ analogously to the definitions ofSn andMn OnD, lim sup(Sn/n)> ε, hence lim sup(Sn∗/n)>0
Let F = ∪n(Mn∗ > 0) Note ∪ni=0(Mi∗ > 0) = (Mn∗ > 0) Also |X∗| ≤ |X|+ε is integrable By
Lemma 29.1,E[X∗;Mn∗>0]≥0 By dominated convergence, E[X∗;F]≥0
We claim D =F, up to null sets To see this, if lim sup(Sn∗/n)>0, then ω ∈ ∪n(Mn∗ >0) Hence
D⊂F On the other hand, ifω∈F, then Mn∗>0 for some n, soXn∗6= for some n By the definition of X∗, for some n,Tnω∈D, and sinceDis invariant,ω∈D a.s
RecallD∈ I Then
0≤E[X∗;D] =E[X−ε;D]
=E[E[X | I];D]−εP(D) =−εP(D), using the fact thatE[X | I] = We concludeP(D) = as desired
Since we have this for everyε, then lim supSn/n ≤0 By applying the same argument to −X, we
obtain lim infSn/n≥0, and we have proved the almost sure result Let us now turn to theL1convergence
LetM >0,XM0 =X1(|X|≤M), andXM00 =X−XM00 By the almost sure result,
1 n
X
XM0 (Tmω)→E[XM0 | I]
almost surely Both sides are bounded byM, so
E n
X
XM0 (Tmω)−E[XM0 | I]
→0 (29.1)
Letε > and chooseM large so that E|X00
M| < ε; this is possible by dominated convergence We
have
E n
n−1
X
m=0
XM00(Tmω)
≤
1 n
X
E|XM00(Tmω)|=E|XM00 | ≤ε
and
E|E[XM00 | I]| ≤E[E[|X
00
M| | I]] =E|X
00
M| ≤ε
So combining with (29.1)
lim supE n
X
X(Tmω)−E[X | I] ≤2ε This shows theL1 convergence
What does the ergodic theorem tell us about our examples? In the case of i.i.d random variables, we seeSn/n→EX almost surely and in L1, since E[X |I] =EX Thus this gives another proof of the SLLN For rotations of the circle withX(ω) = 1A(ω) andθ is an irrational multiple ofπ,E[X | I] =EX= P(A), the normalized Lebesgue measure of A So the ergodic theorem says that (1/n)P1A(ω+nθ), the
average number of timesω+nθ is inA, converges for almost everyω to the normalized Lebesgue measure ofA