Bài giảng Tối ưu hóa nâng cao - Chương 8: Proximal gradient descent (and acceleration) cung cấp cho người học các kiến thức: Subgradient method, decomposable functions, proximal mapping, proximal gradient descent,.... Mời các bạn cùng tham khảo.
Proximal Gradient Descent (and Acceleration) Hồng Nam Dũng Khoa Tốn - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội Last time: subgradient method Consider the problem f (x) x with f convex, and dom(f ) = Rn Subgradient method: choose an initial x (0) ∈ Rn , and repeat: x (k) = x (k−1) − tk · g (k−1) , k = 1, 2, 3, where g (k−1) ∈ ∂f (x (k−1) ) We use pre-set rules for the step sizes (e.g., diminshing step sizes rule) If f is Lipschitz, then subgradient method has a convergence rate O(1/ε2 ) Upside: very generic Downside: can be slow — addressed today Outline Today Proximal gradient descent Convergence analysis ISTA, matrix completion Special cases Acceleration Decomposable functions Suppose f (x) = g (x) + h(x) where g is convex, differentiable, dom(g ) = Rn h is convex, not necessarily differentiable If f were differentiable, then gradient descent update would be x + = x − t · ∇f (x) Recall motivation: minimize quadratic approximation to f around x, replace ∇2 f (x) by 1t I x + = argminz f (x) + ∇f (x)T (z − x) + z − x 22 2t f˜t (z) Decomposable functions In our case f is not differentiable, but f = g + h, g differentiable Why don’t we make quadratic approximation to g , leave h alone? I.e., update x + = argminz g˜t (z) + h(z) = argminz g (x) + ∇g (x)T (z − x) + = argminz 2t z − (x − t∇g (x)) 2t z − (x − t∇g (x)) h(z) 2 2 z −x 2t 2 + h(z) + h(z) stay close to gradient update for g also make h small Proximal mapping The proximal mapping (or prox-operator) of a convex function h is defined as proxh (x) = argminz x − z 22 + h(z) Proximal mapping The proximal mapping (or prox-operator) of a convex function h is defined as proxh (x) = argminz x − z 22 + h(z) Examples: h(x) = 0: proxh (x) = x Proximal mapping The proximal mapping (or prox-operator) of a convex function h is defined as proxh (x) = argminz x − z 22 + h(z) Examples: h(x) = 0: proxh (x) = x h(x) is indicator function of a closed convex set C : proxh is the projection on C x − z 22 = PC (x) proxh (x) = argminz∈C Proximal mapping The proximal mapping (or prox-operator) of a convex function h is defined as proxh (x) = argminz x − z 22 + h(z) Examples: h(x) = 0: proxh (x) = x h(x) is indicator function of a closed convex set C : proxh is the projection on C x − z 22 = PC (x) proxh (x) = argminz∈C h(x) = x : proxh is the ’soft-threshold’ (shrinkage) operation xi − xi ≥ proxh (x)i = |xi | ≤ xi + xi ≤ −1 Proximal mapping Theorem If h is convex and closed (has closed epigraph) then proxh (x) = argminz x − z 22 + h(z) exists and is unique for all x Chứng minh See http://www.seas.ucla.edu/~vandenbe/236C/lectures/ proxop.pdf Uniqueness since the objective function is strictly convex Acceleration Turns out we can accelerate proximal gradient descent in order to √ achieve the optimal O(1/ ε) convergence rate Four ideas (three acceleration methods) by Nesterov: 1983: original acceleration idea for smooth functions 1988: another acceleration idea for smooth functions 2005: smoothing techniques for nonsmooth functions, coupled with original acceleration idea 2007: acceleration idea for composite functions7 We will follow Beck and Teboulle (2008), an extension of Nesterov (1983) to composite functions8 Each step uses entire history of previous steps and makes two prox calls Each step uses information from two last steps and makes one prox call 23 Accelerated proximal gradient method As before consider g (x) + h(x), x where g convex, differentiable, and h convex Accelerated proximal gradient method: choose initial point x (0) = x (−1) ∈ Rn , repeat: k − (k−1) v = x (k−1) + (x − x (k−2) ) k +1 x (k) = proxtk h (v − tk ∇g (v )) for k = 1, 2, 3, First step k = is just usual proximal gradient update (k−1) − x (k−2) ) carries some After that, v = x (k−1) + k−2 k+1 (x “momentum” from previous iterations h = gives accelerated gradient method 24 Accelerated proximal gradient method 1.0 Momentum weights: Momentum weights ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●●●●● ●●●● ● ● ●● ●● ●● ● ● ● ● 0.0 ● −0.5 (k − 2)/(k + 1) 0.5 ● ● ● ● ● 20 40 60 80 100 k 25 Accelerated proximal gradient method 0.050 0.020 0.005 f−fstar 0.200 0.500 Back toBack lassotoexample: acceleration can really lasso example: acceleration canhelp! really help! 0.002 Subgradient method Proximal gradient Nesterov acceleration 200 400 600 800 1000 k Note: accelerated proximal gradient is not a descent method Note: accelerated proximal gradient is not a descent method 26 Backtracking line search Backtracking under with acceleration in different ways Simple approach: fix β < 1, t0 = At iteration k, start with t = tk−1 , and while + x − v 22 g (x + ) > g (v ) + ∇g (v )T (x + − v ) + 2t shrink t = βt, and let x + = proxth (v − t∇g (v )) Else keep x + Note that this strategy forces us to take decreasing step sizes (more complicated strategies exist which avoid this) 27 Convergence analysis For criterion f (x) = g (x) + h(x), we assume as before g is convex, differentiable, dom(g ) = Rn , and ∇g is Lipschitz continuous with constant L > h is convex, proxth (x) = argminz { x − z 22 /(2t) + h(z)} can be evaluated Theorem Accelerated proximal gradient method with fixed step size t ≤ 1/L satisfies x (0) − x ∗ 22 f (x (k) ) − f ∗ ≤ t(k + 1)2 and same result holds for backtracking, with t replaced by β/L √ Achieves optimal rate O(1/k ) or O(1/ ε) for first-order methods 28 FISTA (Fast ISTA) Back to lasso problem β y − Xβ 2 + λ β Recall ISTA (Iterative Soft-thresholding Algorithm): β (k) = Sλtk (β (k−1) + tk X T (y − X β (k−1) )), k = 1, 2, 3, Sλ (·) being vector soft-thresholding Applying acceleration gives us FISTA (F is for Fast)9 : for k = 1, 2, 3, k − (k−1) v = β (k−1) + (β − β (k−2) ) k +1 β (k) = Sλtk (v + tk X T (y − Xv )) Beck and Teboulle (2008) actually call their general acceleration technique (for general g , h) FISTA, which may be somewhat confusing 29 ISTA vs FISTA Lasso regression: 100 instances (with n = 100, p = 500): 1e−02 1e−03 ISTA FISTA 1e−04 f(k)−fstar 1e−01 1e+00 Lasso regression: 100 instances (with n = 100, p = 500): 200 400 600 800 1000 30 ISTA vs FISTA Lasso logistic regression: 100 instances (n = 100, p = 500): 1e−02 1e−03 ISTA FISTA 1e−04 f(k)−fstar 1e−01 1e+00 Lasso logistic regression: 100 instances (n = 100, p = 500): 200 400 600 800 1000 31 Is acceleration always useful? Acceleration can be a very effective speedup tool but should it always be used? 32 Is acceleration always useful? Acceleration can be a very effective speedup tool but should it always be used? In practice the speedup of using acceleration is diminished in the presence of warm starts E.g., suppose want to solve lasso problem for tuning parameters values λ1 > λ2 > · · · > λr When solving for λ1 , initialize x (0) = 0, record solution xˆ(λ1 ) When solving for λj , initialize x (0) = xˆ(λj−1 ), the recorded solution for λj−1 Over a fine enough grid of λ values, proximal gradient descent can often perform just as well without acceleration 32 Is acceleration always useful? Sometimes backtracking and acceleration can be disadvantageous! Recall matrix completion problem: the proximal gradient update is B + = Sλ B + t(PΩ (Y ) − PΩ⊥ (B)) where Sλ is the matrix soft-thresholding operator requires SVD One backtracking loop evaluates generalized gradient Gt (x), i.e., evaluates proxt (x), across various values of t For matrix completion, this means multiple SVDs Acceleration changes argument we pass to prox: v − t∇g (v ) instead of x − t∇g (x) For matrix completion (and t = 1), B − ∇g (B) = PΩ (Y ) + PΩ⊥ (B) sparse V − ∇g (V ) = PΩ (Y ) + sparse ⇒ fast SVD low rank PΩ⊥ (V ) not necessarily low rank ⇒ slow SVD 33 References and further reading Nesterov’s four ideas (three acceleration methods): Y Nesterov (1983), A method for solving a convex programming problem with convergence rate O(1/k ) Y Nesterov (1988), On an approach to the construction of optimal methods of minimization of smooth convex functions Y Nesterov (2005), Smooth minimization of non-smooth functions Y Nesterov (2007), Gradient methods for minimizing composite objective function 34 References and further reading Extensions and/or analyses: A Beck and M Teboulle (2008), A fast iterative shrinkage-thresholding algorithm for linear inverse problems S Becker and J Bobin and E Candes (2009), NESTA: a fast and accurate first-order method for sparse recovery P Tseng (2008), On accelerated proximal gradient methods for convex-concave optimization 35 References and further reading Helpful lecture notes/books: E Candes, Lecture notes for Math 301, Stanford University, Winter 2010-2011 Y Nesterov (1998), Introductory lectures on convex optimization: a basic course, Chapter L Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012 36 ... soft-thresholding algorithm (ISTA)2 Very Oftenalgorithm called the iterative soft-thresholding algorithm (ISTA).1 Very simple simple algorithm 200 400 600 80 0 1000 k Beck and Teboulle (20 08) ,... Beck and Teboulle (20 08) , “A fast iterative shrinkage-thresholding algorithm Beck and problems” Teboulle (20 08) , “A fast iterative shrinkage-thresholding for linear inverse 11 Backtracking line... expected to be low-rank since user preferences can often be described by a few factors, such as the movie genre and time of release https://math.berkeley.edu/~hutching/teach/5 4-2 017/svd-notes.pdf 14