Bài giảng Tối ưu hóa nâng cao - Chương 9: Stochastic gradient descent

Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội... Step sizes t k chosen to be fixed and small, or via backtracking..[r]

(1)

Stochastic Gradient Descent

Hoàng Nam Dũng

(2)

Last time: proximal gradient descent

Consider the problem

min

x g(x) +h(x)

withg,h convex,g differentiable, andh “simple” in so much as proxt(x) = argminz

1

2tkx−zk

2+h(z)

is computable

Proximal gradient descent: letx(0)∈Rn, repeat:

x(k)= proxtk(x

(k−1)−t

k∇g(x(k−1))), k =1,2,3, Step sizestk chosen to be fixed and small, or via backtracking If∇g is Lipschitz with constantL, then this has convergence rate

(3)

Outline

Today:

I Stochastic gradient descent I Convergence rates

(4)

Stochastic gradient descent

Consider minimizing an average of functions

x

m

X

i=1 fi(x)

As∇Pm

i=1fi(x) =Pmi=1∇fi(x), gradient descent would repeat

x(k)=x(k−1)−tk ·

m

X

i=1

∇fi(x(k−1)), k =1,2,3, In comparison,stochastic gradient descentor SGD (or incremental gradient descent) repeats:

x(k)=x(k−1)−tk· ∇fik(x

(5)

Stochastic gradient descent

Two rules for choosing indexik at iterationk:

I Randomized rule: choose ik ∈ {1, m} uniformly at random I Cyclic rule: choose ik =1,2, ,m,1,2, ,m,

Randomized rule is more common in practice For randomized rule, note that

E[∇fik(x)] =∇f(x),

so we can view SGD as using anunbiased estimateof the gradient at each step

Main appeal of SGD:

(6)

Example: stochastic logistic regression

Given(xi,yi)∈Rp× {0,1},i =1, ,n, recall logistic regression

min

β f(β) =

n

X

i=1

−yixiTβ+ log(1+ exp(xiTβ))

| {z }

fi(β)

Gradient computation∇f(β) =

n

Pn

i=1(yi−pi(β))xi is doable whenn is moderate, butnot when n is huge

Full gradient (also called batch) versus stochastic gradient: I One batch update costs O(np)

I One stochastic update costsO(p)

(7)

Batch vs stochastic gradient descent

Small example withn =10,p=2 to show the “classic picture” for batch versus stochastic methods:

Small example withn= 10,p= 2to show the “classic picture” for batch versus stochastic methods:

−20 −10 10 20

−20 −10 10 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● * ● ● Batch Random

Blue: batch steps, O(np) Red: stochastic steps,O(p) Rule of thumb for stochastic methods:

• generally thrive far from optimum

• generally struggle close to optimum

Blue: batch steps,O(np) Red: stochastic steps,O(p) Rule of thumb for stochastic methods:

I generally thrive far from optimum

I generally struggle close to optimum

(8)

Step sizes

Standard in SGD is to usediminishing step sizes, e.g., tk =1/k, fork =1,2,3,

Why not fixed step sizes? Here’s some intuition

Suppose we take cyclic rule for simplicity Settk =t for m updates in a row, we get

x(k+m) =x(k)−t

m

X

i=1

∇fi(x(k+i−1)) Meanwhile, full gradient with step sizet would give

x(k+1)=x(k)−t

m

X

i=1

∇fi(x(k)) The difference here:tPm

(9)

Convergence rates

Recall: for convexf, (sub)gradient descent with diminishing step sizes satisfies

f(x(k))−f∗ =O(1/√k)

Whenf is differentiable with Lipschitz gradient, there holds for gradient descent with suitable fixed step sizes

f(x(k))−f∗ =O(1/k)

What about SGD? For convexf, SGD with diminishing step sizes satisfies1

E[f(x(k))]−f∗ =O(1/√k)

Unfortunately thisdoes not improvewhen we further assume f has Lipschitz gradient

1E.g., Nemirosvki et al (2009), “Robust stochastic optimization approach to

(10)

Convergence rates

Even worse is the following discrepancy!

Whenf is strongly convex and has a Lipschitz gradient, gradient descent satisfies

f(x(k))−f∗=O(ck)

wherec <1 But under same conditions, SGD gives us2

E[f(x(k))]−f∗ =O(1/k)

So stochastic methods not enjoy thelinear convergence rateof gradient descent under strong convexity

What can we to improve SGD?

2

Định dạng
Số trang	10
Dung lượng	211,21 KB