Bài giảng Tối ưu hóa nâng cao: Chương 5 - Hoàng Nam Dũng

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	31
Dung lượng	1,26 MB

Nội dung

Bài giảng Tối ưu hóa nâng cao - Chương 5: Gradient descent cung cấp cho người học các kiến thức: Gradient descent, gradient descent interpretation, fixed step size, backtracking line search,... Mời các bạn cùng tham khảo.

Gradient Descent Hồng Nam Dũng Khoa Tốn - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội Gradient descent Consider unconstrained, smooth convex optimization f (x) x with convex and differentiable function f : Rn → R Denote the optimal value by f ∗ = minx f (x) and a solution by x ∗ Gradient descent Consider unconstrained, smooth convex optimization f (x) x with convex and differentiable function f : Rn → R Denote the optimal value by f ∗ = minx f (x) and a solution by x ∗ Gradient descent: choose initial point x (0) ∈ Rn , repeat: x (k) = x (k−1) − tk · ∇f (x (k−1) ), k = 1, 2, 3, Stop at some point ● ● ● ● ● ● ● ● ●● 53 Gradient descent interpretation At each iteration, consider the expansion f (y ) ≈ f (x) + ∇f (x)T (y − x) + y −x 2t 2 Quadratic approximation, replacing usual Hessian ∇2 f (x) by 1t I f (x) + ∇f (x)T (y − x) linear approximation to f y − x proximity term to x, with weight 1/2t 2t Choose next point y = x + to minimize quadratic approximation x + = x − t∇f (x) Gradient descent interpretation ● ● Blue point redred point is is Blue pointisisx,x, point T ∗ x + x = argminy f (x) + ∇f (x) (y − x) + T 2t y −1 x = argmin f (x) + ∇f (x) (y − x) + y 2t 2 y−x 2 Outline How to choose step sizes Convergence analysis Nonconvex functions Gradient boosting Fixed step size Fixed step size 20 Simply take take ttkk ==t tfor 2, 2, 3, 3, ., .can diverge if t is Simply forallallk k==1,1, , can diverge if too t is big too big 2 2 Consider f (x) = (10x + x )/2, gradient descent after steps: 1 +2 x2 )/2, gradient descent after steps: Consider f (x) = (10x ● 10 ● ● −20 −10 * −20 −10 10 20 Fixed step size 20 Can be be slow slow ifif tt isis too example, gradient descent after after 100 Can toosmall small.Same Same example, gradient descent steps: 100 steps: 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −20 −10 * −20 −10 10 20 Convergence analysis Assume that f : Rn → R convex and differentiable and additionally ∇f (x) − ∇f (y ) ≤L x −y for any x, y i.e., ∇f is Lipschitz continuous with constant L > Theorem Gradient descent with fixed step size t ≤ 1/L satisfies f (x (k) ) − f ∗ ≤ x (0) − x ∗ 22 2tk and same result holds for backtracking with t replaced by β/L We say gradient descent has convergence rate O(1/k), i.e., it finds ε-suboptimal point in O(1/ε) iterations 14 Convergence analysis Assume that f : Rn → R convex and differentiable and additionally ∇f (x) − ∇f (y ) ≤L x −y for any x, y i.e., ∇f is Lipschitz continuous with constant L > Theorem Gradient descent with fixed step size t ≤ 1/L satisfies f (x (k) ) − f ∗ ≤ x (0) − x ∗ 22 2tk and same result holds for backtracking with t replaced by β/L We say gradient descent has convergence rate O(1/k), i.e., it finds ε-suboptimal point in O(1/ε) iterations Chứng minh Slide 20-25 in http://www.seas.ucla.edu/~vandenbe/236C/ lectures/gradient.pdf 14 Convergence under strong convexity Reminder: strong convexity of f means f (x) − m > m x 2 is convex for some Assuming Lipschitz gradient as before and also strong convexity: Theorem Gradient descent with fixed step size t ≤ 2/(m + L) or with backtracking line search search satisfies L f (x (k) ) − f ∗ ≤ c k x (0) − x ∗ 22 where < c < 15 Convergence under strong convexity Reminder: strong convexity of f means f (x) − m > m x 2 is convex for some Assuming Lipschitz gradient as before and also strong convexity: Theorem Gradient descent with fixed step size t ≤ 2/(m + L) or with backtracking line search search satisfies L f (x (k) ) − f ∗ ≤ c k x (0) − x ∗ 22 where < c < Rate under strong convexity is O(c k ), exponentially fast, i.e., we find ε-suboptimal point in O(log(1/ε)) iterations 15 Convergence under strong convexity Reminder: strong convexity of f means f (x) − m > m x 2 is convex for some Assuming Lipschitz gradient as before and also strong convexity: Theorem Gradient descent with fixed step size t ≤ 2/(m + L) or with backtracking line search search satisfies L f (x (k) ) − f ∗ ≤ c k x (0) − x ∗ 22 where < c < Rate under strong convexity is O(c k ), exponentially fast, i.e., we find ε-suboptimal point in O(log(1/ε)) iterations Chứng minh Slide 26-27 in http://www.seas.ucla.edu/~vandenbe/236C/ lectures/gradient.pdf 15 Convergence rate 9.3 Gradient descent method 104 Called linearlinear convergence, Called convergence, because looks linear on a on a because looks linear semi-log plot plot semi-log f (x(k) ) − p⋆ 102 100 exact l.s 10−2 backtracking l.s 10−4 50 (k) 100 k 150 200 ⋆ Figure 9.6 Error f (x ) − p versus iteration k for the gradient method with backtracking and exact line search, for a problem in R100 (From & VVpage page487) 487) (From B B& Important contraction factor c in crate depends adversely on Importantnote: note: contraction in rate depends on on the These factor experiments suggest that the eﬀect of theadversely backtracking parameters convergence is not large, no more than a factor of two or condition L/m: higher condition number ⇒ slower rate so.rate conditionnumber number L/m: higher condition number ⇒ slower Gradient method and condition number Affects upper bound very apparent Our last experiment will illustratein thepractice importance too of the condition number Affectsnot notonly onlyourour upper bound very apparent in practice too of ∇2 f (x) (or the sublevel sets) on the rate of convergence of the gradient method We start with the function given by (9.21), but replace the variable x by x = T x ¯, where 16 T = diag((1, γ 1/n , γ 2/n , , γ (n−1)/n )), A look at the conditions A look at the conditions for a simple problem, f (β) = 2 y − X β Lipschitz continuity of ∇f : This mean ∇2 f (x) LI T As ∇ f (β) = X X , we have L = σmax (X T X ) Strong convexity of f : This mean ∇2 f (x) mI T As ∇ f (β) = X X , we have m = σmin (X T X ) If X is wide (i.e., X is n × p with p > n), then σmin (X T X ) = 0, and f can’t be strongly convex Even if σmin (X T X ) > 0, can have a very large condition number L/m = σmax (X T X )/σmin (X T X ) 17 Practicalities Stopping rule: stop when ∇f (x) is small Recall ∇f (x ∗ ) = at solution x ∗ If f is strongly convex with parameter m, then √ ∇f (x) ≤ 2mε =⇒ f (x) − f ∗ ≤ ε 18 Practicalities Stopping rule: stop when ∇f (x) is small Recall ∇f (x ∗ ) = at solution x ∗ If f is strongly convex with parameter m, then √ ∇f (x) ≤ 2mε =⇒ f (x) − f ∗ ≤ ε Pros and cons of gradient descent: Pro: simple idea, and each iteration is cheap (usually) Pro: fast for well-conditioned, strongly convex problems Con: can often be slow, because many interesting problems aren’t strongly convex or well-conditioned Con: can’t handle nondifferentiable functions 18 Can we better? Gradient descent has O(1/ε) convergence rate over problem class of convex, differentiable functions with Lipschitz gradients First-order method: iterative method, which updates x (k) in x (0) + span{∇f (x (0) ), ∇f (x (1) ), , ∇f (x (k−1) )} 19 Can we better? Gradient descent has O(1/ε) convergence rate over problem class of convex, differentiable functions with Lipschitz gradients First-order method: iterative method, which updates x (k) in x (0) + span{∇f (x (0) ), ∇f (x (1) ), , ∇f (x (k−1) )} Theorem (Nesterov) For any k ≤ (n − 1)/2 and any starting point x (0) , there is a function f in the problem class such that any first-order method satisfies f (x (k) 3L x (0) − x ∗ )−f ≥ 32(k + 1)2 ∗ 2 √ Can attain rate O(1/k ), or O(1/ ε)? Answer: yes (we’ll see)! 19 What about nonconvex functions? Assume f is differentiable with Lipschitz gradient as before, but now nonconvex Asking for optimality is too much So we’ll settle for x such that ∇f (x) ≤ ε, called ε-stationarity Carmon et al (2017), “Lower bounds for finding stationary points I” 20 What about nonconvex functions? Assume f is differentiable with Lipschitz gradient as before, but now nonconvex Asking for optimality is too much So we’ll settle for x such that ∇f (x) ≤ ε, called ε-stationarity Theorem Gradient descent with fixed step size t ≤ 1/L satisfies i=0, ,k ∇f (x (i) ) ≤ 2(f (x ) − f ∗ ) t(k + 1) √ Thus gradient descent has rate O(1/ k), or O(1/ε2 ), even in the nonconvex case for finding stationary points This rate cannot be improved (over class of differentiable functions with Lipschitz gradients) by any deterministic algorithm1 Carmon et al (2017), “Lower bounds for finding stationary points I” 20 Proof Key steps: ∇f Lipschitz with constant L means f (y ) ≤ f (x) + ∇f (x)T (y − x) + L y −x 2 ∀x, y , Plugging in y = x + = x − t∇f (x), Lt t ∇f (x) Taking < t ≤ 1/L, and rearranging, 2 ∇f (x) ≤ (f (x) − f (x + )) t Summing over iterations f (x + ) ≤ f (x) − − k i=0 ∇f (x (i) ) 2 ≤ 2 2 (f (x (0) ) − f (x (k+1) )) ≤ (f (x (0) ) − f ∗ ) t t Lower bound sum by (k + 1) mini=0,1, ∇f (x (i) ) , conclude 21 References and further reading S Boyd and L Vandenberghe (2004), Convex optimization, Chapter T Hastie, R Tibshirani and J Friedman (2009), The elements of statistical learning, Chapters 10 and 16 Y Nesterov (1998), Introductory lectures on convex optimization: a basic course, Chapter L Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012 22 ... looks linear on a on a because looks linear semi-log plot plot semi-log f (x(k) ) − p⋆ 102 100 exact l.s 10−2 backtracking l.s 10−4 50 (k) 100 k 150 200 ⋆ Figure 9.6 Error f (x ) − p versus iteration... exponentially fast, i.e., we find ε-suboptimal point in O(log(1/ε)) iterations Chứng minh Slide 2 6-2 7 in http://www.seas.ucla.edu/~vandenbe/236C/ lectures/gradient.pdf 15 Convergence rate 9.3 Gradient... say gradient descent has convergence rate O(1/k), i.e., it finds ε-suboptimal point in O(1/ε) iterations Chứng minh Slide 2 0- 25 in http://www.seas.ucla.edu/~vandenbe/236C/ lectures/gradient.pdf

Ngày đăng: 16/05/2020, 01:42