Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội... Step sizes t k chosen to be fixed and small, or via backtracking..[r]
(1)Stochastic Gradient Descent
Hoàng Nam Dũng
(2)Last time: proximal gradient descent
Consider the problem
min
x g(x) +h(x)
withg,h convex,g differentiable, andh “simple” in so much as proxt(x) = argminz
1
2tkx−zk
2+h(z)
is computable
Proximal gradient descent: letx(0)∈Rn, repeat:
x(k)= proxtk(x
(k−1)−t
k∇g(x(k−1))), k =1,2,3, Step sizestk chosen to be fixed and small, or via backtracking If∇g is Lipschitz with constantL, then this has convergence rate
(3)Outline
Today:
I Stochastic gradient descent I Convergence rates
(4)Stochastic gradient descent
Consider minimizing an average of functions
x
m
m
X
i=1 fi(x)
As∇Pm
i=1fi(x) =Pmi=1∇fi(x), gradient descent would repeat
x(k)=x(k−1)−tk ·
m
m
X
i=1
∇fi(x(k−1)), k =1,2,3, In comparison,stochastic gradient descentor SGD (or incremental gradient descent) repeats:
x(k)=x(k−1)−tk· ∇fik(x
(5)Stochastic gradient descent
Two rules for choosing indexik at iterationk:
I Randomized rule: choose ik ∈ {1, m} uniformly at random I Cyclic rule: choose ik =1,2, ,m,1,2, ,m,
Randomized rule is more common in practice For randomized rule, note that
E[∇fik(x)] =∇f(x),
so we can view SGD as using anunbiased estimateof the gradient at each step
Main appeal of SGD:
(6)Example: stochastic logistic regression
Given(xi,yi)∈Rp× {0,1},i =1, ,n, recall logistic regression
min
β f(β) =
n
n
X
i=1
−yixiTβ+ log(1+ exp(xiTβ))
| {z }
fi(β)
Gradient computation∇f(β) =
n
Pn
i=1(yi−pi(β))xi is doable whenn is moderate, butnot when n is huge
Full gradient (also called batch) versus stochastic gradient: I One batch update costs O(np)
I One stochastic update costsO(p)
(7)Batch vs stochastic gradient descent
Small example withn =10,p=2 to show the “classic picture” for batch versus stochastic methods:
Small example withn= 10,p= 2to show the “classic picture” for batch versus stochastic methods:
−20 −10 10 20
−20 −10 10 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● * ● ● Batch Random
Blue: batch steps, O(np) Red: stochastic steps,O(p) Rule of thumb for stochastic methods:
• generally thrive far from optimum
• generally struggle close to optimum
Blue: batch steps,O(np) Red: stochastic steps,O(p) Rule of thumb for stochastic methods:
I generally thrive far from optimum
I generally struggle close to optimum
(8)Step sizes
Standard in SGD is to usediminishing step sizes, e.g., tk =1/k, fork =1,2,3,
Why not fixed step sizes? Here’s some intuition
Suppose we take cyclic rule for simplicity Settk =t for m updates in a row, we get
x(k+m) =x(k)−t
m
X
i=1
∇fi(x(k+i−1)) Meanwhile, full gradient with step sizet would give
x(k+1)=x(k)−t
m
X
i=1
∇fi(x(k)) The difference here:tPm
(9)Convergence rates
Recall: for convexf, (sub)gradient descent with diminishing step sizes satisfies
f(x(k))−f∗ =O(1/√k)
Whenf is differentiable with Lipschitz gradient, there holds for gradient descent with suitable fixed step sizes
f(x(k))−f∗ =O(1/k)
What about SGD? For convexf, SGD with diminishing step sizes satisfies1
E[f(x(k))]−f∗ =O(1/√k)
Unfortunately thisdoes not improvewhen we further assume f has Lipschitz gradient
1E.g., Nemirosvki et al (2009), “Robust stochastic optimization approach to
(10)Convergence rates
Even worse is the following discrepancy!
Whenf is strongly convex and has a Lipschitz gradient, gradient descent satisfies
f(x(k))−f∗=O(ck)
wherec <1 But under same conditions, SGD gives us2
E[f(x(k))]−f∗ =O(1/k)
So stochastic methods not enjoy thelinear convergence rateof gradient descent under strong convexity
What can we to improve SGD?
2