Bài giảng Tối ưu hóa nâng cao: Chương 7 - Hoàng Nam Dũng

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	34
Dung lượng	543,2 KB

Nội dung

Bài giảng Tối ưu hóa nâng cao - Chương 7: Subgradient method cung cấp cho người học các kiến thức: Last last time - gradient descent, subgradient method, step size choices, convergence analysis, lipschitz continuity, convergence analysis - Proof,... Mời các bạn cùng tham khảo.

Subgradient Method Hồng Nam Dũng Khoa Tốn - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội Last last time: gradient descent Consider the problem f (x) x for f convex and differentiable, dom(f ) = Rn Gradient descent: choose initial x (0) ∈ Rn , repeat: x (k) = x (k−1) − tk · ∇f (x (k−1) ), k = 1, 2, 3, Step sizes tk chosen to be fixed and small, or by backtracking line search If ∇f Lipschitz, gradient descent has convergence rate O(1/ε) Downsides: Requires f differentiable — addressed this lecture Can be slow to converge — addressed next lecture Subgradient method Now consider f convex, having dom(f ) = Rn , but not necessarily differentiable Subgradient method: like gradient descent, but replacing gradients with subgradients, i.e., initialize x (0) , repeat: x (k) = x (k−1) − tk · g (k−1) , where g (k−1) ∈ ∂f (x (k−1) ) k = 1, 2, 3, any subgradient of f at x (k−1) Subgradient method Now consider f convex, having dom(f ) = Rn , but not necessarily differentiable Subgradient method: like gradient descent, but replacing gradients with subgradients, i.e., initialize x (0) , repeat: x (k) = x (k−1) − tk · g (k−1) , where g (k−1) ∈ ∂f (x (k−1) ) k = 1, 2, 3, any subgradient of f at x (k−1) Subgradient method is not necessarily a descent method, so we (k) keep track of best iterate xbest among x (0) , x (k) so far, i.e., (k) f (xbest ) = f (x (i) ) i=0, ,k Outline Today: How to choose step sizes Convergence analysis Intersection of sets Projected subgradient method Step size choices Fixed step sizes: tk = t all k = 1, 2, 3, Fixed step length, i.e., tk = s/ g (k−1) , and hence tk g (k−1) = s Diminishing step sizes: choose to meet conditions ∞ k=1 ∞ tk2 < ∞, k=1 tk = ∞, i.e., square summable but not summable Important here that step sizes go to zero, but not too fast There are several other options too, but key difference to gradient descent: step sizes are pre-specified, not adaptively computed Convergence analysis Assume that f convex, dom(f ) = Rn , and also that f is Lipschitz continuous with constant L > 0, i.e., |f (x) − f (y )| ≤ L x − y for all x, y Theorem For a fixed step size t, subgradient method satisfies x (0) − x ∗ 22 L2 t (k) f (xbest ) − f ∗ ≤ + 2kt For fixed step length, i.e., tk = s/ g (k−1) , we have L x (0) − x ∗ 22 Ls + 2ks For diminishing step sizes, subgradient method satisfies (k) f (xbest ) − f ∗ ≤ (k) f (xbest ) − f ∗ ≤ (k) i.e., lim f (xbest ) = f ∗ k→∞ x (0) − x ∗ 2 + L2 k i=1 ti k i=1 ti , Lipschitz continuity Before the proof let consider the Lipschitz continuity assumption Lemma f is Lipschitz continuous with constant L > 0, i.e., |f (x) − f (y )| ≤ L x − y for all x, y , is equivalent to g ≤ L for all x and g ∈ ∂f (x) Chứng minh ⇐=: Choose subgradients gx and gy at x and y We have gxT (x − y ) ≥ f (x) − f (y ) ≥ gyT (x − y ) Apply Cauchy-Schwarz inequality get L x −y ≥ f (x) − f (y ) ≥ −L x − y Lipschitz continuity Before the proof let consider the Lipschitz continuity assumption Lemma f is Lipschitz continuous with constant L > 0, i.e., |f (x) − f (y )| ≤ L x − y for all x, y , is equivalent to g ≤ L for all x and g ∈ ∂f (x) Chứng minh =⇒: Assume g > L for some g ∈ ∂f (x) Take y = x + g / g we have y − x = and f (y ) ≥ f (x) + g T (y − x) = f (x) + g 2 > f (x) + L, contradiction Convergence analysis - Proof Can prove both results from same basic inequality Key steps: Using definition of subgradient x (k) − x ∗ = x (k−1) − ≤ x (k−1) − 2 = x ∗ 22 x ∗ 22 x (k−1) − tk g (k−1) − x ∗ 2 − 2tk g (k−1) (x (k−1) − x ∗ ) + tk2 g (k−1) − 2tk (f (x (k−1) ) − f (x ∗ )) + 2 (k−1) tk g Convergence analysis - Proof The basic inequality tells us that after k steps, we have (k) f (xbest ) − f (x ∗ ) ≤ R2 + k (i−1) 2 i=1 ti g ki=1 ti From this and the Lipschitz continuity, we have (k) f (xbest ) − f (x ∗ ) ≤ R + L2 k i=1 ti k i=1 ti With diminishing step size, ∞ i=1 ti = ∞ and holds (k) lim f (xbest ) = f ∗ ∞ i=1 ti < ∞, there k→∞ 12 Example: 1-norm minimization minimize Ax − b • subgradient is given by AT sign(Ax − b) ã example with A R500ì100, b R500 Fixed steplength tk = s/ g (k−1) 10 for s = 0.1, 0.01, 0.001 (f (x(k)) − f )/f 0.1 0.01 0.001 10−1 10 (fbest(x(k)) − f )/f 0.1 0.01 0.001 10−1 10−2 10−2 10−3 10−3 20 Subgradient method 40 k 60 80 100 10−4 1000 k 2000 3000 5-8 √ Diminishing step size: tk = 0.01/ k and tk = 0.01/k (fbest(x(k)) − f )/f 100 10−1 √ 0.01/ k 0.01/k 10−2 10−3 10−4 10−5 Subgradient method 1000 2000 k 3000 4000 5000 5-9 Example: regularized logistic regression Given (xi , yi ) ∈ Rp × {0, 1} for i = 1, n, the logistic regression loss is n f (β) = i=1 −yi xiT β + log(1 + exp(xiT β)) This is a smooth and convex with n ∇f (β) = i=1 (yi − pi (β))xi , where pi (β) = exp(xiT β)/(1 + exp(xiT β)), i = 1, , n Consider the regularized problem f (β) + λ · P(β), β where P(β) = β 2 ridge penalty; or P(β) = β lasso penalty 13 Example: regularized logistic regression Ridge: use gradients; gradients; lasso: lasso: use use subgradients subgradients Example Examplehere herehas has Ridge: use n = 1000, p = 20: n = 1000, p = 20 Gradient descent Subgradient method f−fstar 1e−13 0.02 0.05 0.20 0.50 1e−01 1e−04 1e−07 1e−10 f−fstar t=0.001 t=0.001/k 2.00 t=0.001 50 100 k 150 200 50 100 150 200 k Step sizes sizes hand-tuned hand-tuned to be favorable Step to be favorable for for each each method method (of (of course course comparison is imperfect, but it reveals the convergence behaviors) behaviors) 14 Polyak step sizes Polyak step sizes: when the optimal value f ∗ is known, take f (x (k−1) ) − f ∗ , k = 1, 2, 3, tk = g (k−1) 22 Can be motivated from first step in subgradient proof: x (k) −x ∗ 2 ≤ x (k−1) −x ∗ 22 −2tk (f (x (k−1) )−f (x ∗ ))+tk2 g (k−1) 22 Polyak step size minimizes the right-hand side With Polyak step sizes, can show subgradient method converges to optimal value Convergence rate is still O(1/ε2 ) LR (k) f (xbest ) − f (x ∗ ) ≤ √ k (Proof: see slide 11, http://www.seas.ucla.edu/~vandenbe/ 236C/lectures/sgmethod.pdf) 15 Example: intersection of sets Suppose we want to find x ∗ ∈ C1 ∩ · · · ∩ Cm , i.e., find a point in intersection of closed, convex sets C1 , , Cm First define fi (x) = dist(x, Ci ), i = 1, , m f (x) = max fi (x) i=1, ,m and now solve f (x) x Check: is this convex? Note that f ∗ = ⇐⇒ x ∗ ∈ C1 ∩ · · · ∩ Cm 16 Example: intersection of sets Recall the distance function dist(x, C ) = miny ∈C y − x time we computed its gradient x − PC (x) ∇ dist(x, C ) = x − PC (x) where PC (x) is the projection of x onto C Last Also recall subgradient rule: if f (x) = maxi=1, m fi (x), then   ∂f (x) = conv  ∂fi (x) i:fi (x)=f (x) So if fi (x) = f (x) and gi ∈ ∂fi (x), then gi ∈ ∂f (x) 17 Example: intersection of sets Put these two facts together for intersection of sets problem, with fi (x) = dist(x, Ci ): if Ci is farthest set from x (so fi (x) = f (x)), and x − PCi (x) gi = ∇fi (x) = x − PCi (x) then gi ∈ ∂f (x) Now apply subgradient method, with Polyak size tk = f (x (k−1) ) At iteration k, with Ci farthest from x (k−1) , we perform update x (k−1) − PCi (x (k−1) ) x (k) = x (k−1) − f (x (k−1) ) (k−1) x − PCi (x (k−1) ) = PCi (x (k−1) ), since f (x (k−1) ) = dist(x (k−1) , Ci ) = x (k−1) − PCi (x (k−1) ) 18 Example: intersection of sets For two sets, this is the famous alternating projections1 algorithm,1 For two sets, this is the famous alternating projections algorithm , i.e., just keep projecting back and forth i.e., just keep projecting back and forth Boyd’s lecturevolume notes) von Neumann (1950),(From “Functional operators, II: The geometry of orthogonal spaces” 1 von Neumann (1950), “Functional operators, volume II: The geometry of 19 Projected subgradient method To optimize a convex function f over a convex set C , f (x) subject to x ∈ C x we can use the projected subgradient method Just like the usual subgradient method, except we project onto C at each iteration: x (k) = PC (x (k−1) − tk · g (k−1) ), k = 1, 2, 3, Assuming we can this projection, we get the same convergence guarantees as the usual subgradient method, with the same step size choices 20 Projected subgradient method What sets C are easy to project onto? Lots, e.g., Affine images: {Ax + b : x ∈ Rn } Solution set of linear system: {x : Ax = b} Nonnegative orthant: Rn+ = {x : x ≥ 0} Some norm balls: {x : x p ≤ 1} for p = 1, 2, ∞ Some simple polyhedra and simple cones Warning: it is easy to write down seemingly simple set C , and PC can turn out to be very hard! E.g., generally hard to project onto arbitrary polyhedron C = {x : Ax ≤ b} Note: projected gradient descent works too, more next time 21 Can we better? Upside of the subgradient method: broad applicability Downside: O(1/ε2 ) convergence rate over problem class of convex, Lipschitz functions is really slow Nonsmooth first-order methods: iterative methods updating x (k) in x (0) + span{g (0) , g (1) , , g (k−1) } where subgradients g (0) , g (1) , , g (k−1) come from weak oracle Theorem (Nesterov) For any k ≤ n − and starting point x (0) , there is a function in the problem class such that any nonsmooth first-order method satisfies RG √ f (x (k) ) − f ∗ ≥ 2(1 + k + 1) 22 Improving on the subgradient method In words, we cannot better than the O(1/ε2 ) rate of subgradient method (unless we go beyond nonsmooth first-order methods) So instead of trying to improve across the board, we will focus on minimizing composite functions of the form f (x) = g (x) + h(x) where g is convex and differentiable, h is convex and nonsmooth but “simple” For a lot of problems (i.e., functions h), we can recover the O(1/ε) rate of gradient descent with a simple algorithm, having important practical consequences 23 References and further reading S Boyd, Lecture notes for EE 264B, Stanford University, Spring 2010-2011 Y Nesterov (1998), Introductory lectures on convex optimization: a basic course, Chapter B Polyak (1987), Introduction to optimization, Chapter L Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012 24 ... 1e−13 0.02 0.05 0.20 0.50 1e−01 1e−04 1e− 07 1e−10 f−fstar t=0.001 t=0.001/k 2.00 t=0.001 50 100 k 150 200 50 100 150 200 k Step sizes sizes hand-tuned hand-tuned to be favorable Step to be favorable... (xbest ) is approximately L2t -suboptimal To make the gap ≤ ε, let’s make each term ≤ ε/2 So we can choose t = ε/L2 , and k = R /t · 1/ε = R L2 /ε2 10 Convergence analysis - Proof The basic inequality... (xbest ) is approximately Ls -suboptimal To make the gap ≤ ε, let’s make each term ≤ ε/2 So we can choose s = ε/L, and k = LR /s · 1/ε = R L2 /ε2 11 Convergence analysis - Proof The basic inequality

Ngày đăng: 18/05/2021, 11:57