Bài giảng Tối ưu hóa nâng cao: Chương 6 - Hoàng Nam Dũng

36 2 0
Bài giảng Tối ưu hóa nâng cao: Chương 6 - Hoàng Nam Dũng

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Bài giảng Tối ưu hóa nâng cao - Chương 6: Subgradients cung cấp cho người học các kiến thức: Last time - gradient descent, subgradients, examples of subgradients, monotonicity, examples of non-subdifferentiable functions,... Mời các bạn cùng tham khảo.

Subgradients Hồng Nam Dũng Khoa Tốn - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội Last time: gradient descent Consider the problem f (x) x for f convex and differentiable, dom(f ) = Rn Gradient descent: choose initial x (0) ∈ Rn , repeat x (k) = x (k−1) − tk · ∇f (x (k−1) ), k = 1, 2, 3, Step sizes tk chosen to be fixed and small, or by backtracking line search If ∇f Lipschitz, gradient descent has convergence rate O(1/ε) Downsides: Requires f differentiable Can be slow to converge Outline Today: Subgradients Examples Properties Optimality characterizations Basic inequality Basic inequality recall the basic inequality for differentiable convex functions: Recall that for convex and differentiable f , T T f (y )f≥ + ∇f (x)(x) (y (y −−x), dom(f (y)f (x) ≥ f (x) + ∇f x) ∀x, ∀yy∈∈dom f ) (x, f (x)) ∇f (x) −1 first-order approximation f at xlower is abound global lower • the The first-order approximation of f at x isof a global bound ∇f defines a non-vertical supporting hyperplane to epi(f ) at T ∇f (x) y x (x, f (x)) − ≤ ∀(y, t) ∈ epi f • ∇f (x) defines a non-vertical supporting hyperplane to epi f at (x, f (x)): Subgradients ∇f −1 −1 y t t f (x) − x f (x) ≤ 0, ∀(y , t) ∈ epi(f ) 4-2 Subgradients A subgradient of a convex function f at x is any g ∈ Rn such that f (y ) ≥ f (x) + g T (y − x), ∀y ∈ dom(f ) Always exists (on the relative interior of dom(f )) If f differentiable at x, then g = ∇f (x) uniquely Same definition works for nonconvex f (however, subgradients need not exist) Subgradients A subgradient of a convex function f at x is any g ∈ Rn such that f (y ) ≥ f (x) + g T (y − x), ∀y ∈ dom(f ) Always exists (on the relative interior of dom(f )) Subgradient If f differentiable at x, then g = ∇f (x) uniquely g is a subgradient of a convex function f at x ∈ dom f if Same definition works for nonconvex f (however, subgradients need not exist).f (y) ≥ f (x) + gT (y − x) ∀y ∈ dom f f (y) f (x1) + g1T (y − x1) f (x1) + g2T (y − x1) f (x2) + g3T (y − x2) x1 x2 g1 and g2 are subgradients at x1 , g3 is subgradient at x2 g1, g2 are subgradients at x1; g3 is a subgradient at x2 Examples of subgradients Examples of subgradients −0.5 0.0 0.5 f(x) 1.0 1.5 2.0 Considerff :: RR → → R, R, ff (x) = |x| Consider −2 −1 x • For x = 0, unique subgradient g = sign(x) For x = 0, unique subgradient g = sign(x) For x = 0, subgradient g is any element of [−1, 1] • For x = 0, subgradient g is any element of [−1, 1] 5 Examples of subgradients Considerf f: :RRnn→ →R, R,ff (x) (x) = = xx Consider 22 f(x) x2 x1 x Forxx==0,0,unique uniquesubgradient subgradientg g==x/ x • For x • For Forxx==0,0,subgradient subgradientg gis isany anyelement elementofof{z{z: : z z2 2≤≤1}1} Examples of subgradients Considerff :: RRnn → → R, R, ff (x) Consider (x) = = xx 11 f(x) x2 x1 For x = 0, unique ith component g = sign(x ) For xi = 0, ith component gi is any element of [−1, 1] • For xi = 0, ith component gi is any element of [−1, 1] • For xi i = 0, unique ith component gi i = sign(xii) Examples of subgradients f(x) 10 15 n Considerf f(x) (x)==max{f max{f11(x), (x), ff22(x)}, for Consider for ff11,,ff2 2: R : Rn→→RRconvex, convex, differentiable differentiable −2 −1 x Forf1f1(x) (x)>>ff22(x), (x), unique unique subgradient subgradient gg = = ∇f ∇f11(x) • For Forf2f2(x) (x)>>ff11(x), (x), unique unique subgradient subgradient gg = = ∇f ∇f22(x) (x) • For Forf1f1(x) (x)==ff22(x), (x), subgradient subgradient gg isis any any point point on on line line segment segment • For between∇f ∇f11(x) (x)and and∇f ∇f22(x) (x) between Optimality condition Subgradient optimality condition: For any f (convex or not), f (x ∗ ) = f (x) ⇐⇒ ∈ ∂f (x ∗ ), x i.e., x ∗ is a minimizer if and only if is a subgradient of f at x ∗ 17 Optimality condition Subgradient optimality condition: For any f (convex or not), f (x ∗ ) = f (x) ⇐⇒ ∈ ∂f (x ∗ ), x i.e., x ∗ is a minimizer if and only if is a subgradient of f at x ∗ Why? Easy: g = being a subgradient means that for all y f (y ) ≥ f (x ∗ ) + 0T (y − x ∗ ) = f (x ∗ ) Note the implication for a convex and differentiable function f with ∂f (x) = {∇f (x)} 17 Derivation of first-order optimality Example of the power of subgradients: we can use what we have learned so far to derive the first-order optimality condition Theorem For f convex and differentiable and C convex f (x) subject to x ∈ C x is solved at x if and only if ∇f (x)T (y − x) ≥ for all y ∈ C a a Direct proof see, e.g., http://www.princeton.edu/~amirali/Public/ Teaching/ORF523/S16/ORF523_S16_Lec7_gh.pdf Proof using subgradient next slide Intuitively: says that gradient increases as we move away from x Note that for C = Rn (unconstrained case) it reduces to ∇f = 18 Derivation of first-order optimality Chứng minh First recast problem as f (x) + IC (x) x Now apply subgradient optimality: ∈ ∂(f (x) + IC (x)) Observe ∈ ∂(f (x) + IC (x)) ⇔ ∈ {∇f (x)} + NC (x) ⇔ −∇f (x) ∈ NC (x) ⇔ −∇f (x)T x ≥ −∇f (x)T y for all y ∈ C ⇔ ∇f (x)T (y − x) ≥ for all y ∈ C as desired Note: the condition ∈ ∂f (x) + NC (x) is a fully general condition for optimality in convex problems But it’s not always easy to work with (KKT conditions, later, are easier) 19 Example: lasso optimality conditions Given y ∈ Rn , X ∈ Rn×p , lasso problem can be parametrized as y − X β 22 + λ β β where λ ≥ 20 Example: lasso optimality conditions Given y ∈ Rn , X ∈ Rn×p , lasso problem can be parametrized as y − X β 22 + λ β β where λ ≥ Subgradient optimality 0∈∂ y − X β 22 + λ β ⇔ ∈ −X T (y − X β) + λ∂ β ⇔ X T (y − X β) = λv for some v ∈ ∂ β , i.e.,   if βi >  {1} vi ∈ {−1} if βi < 0,    [−1, 1] if βi = i = 1, , p 20 Example: lasso optimality conditions Write X1 , , Xp for columns of X Then our condition reads  X T (y − X β) = λ · sign(β ) if β = i i i |X T (y − X β)| ≤ λ if βi = i 21 Example: lasso optimality conditions Write X1 , , Xp for columns of X Then our condition reads  X T (y − X β) = λ · sign(β ) if β = i i i |X T (y − X β)| ≤ λ if βi = i Note: subgradient optimality conditions don’t lead to closed-form expression for a lasso solution However they provide a way to check lasso optimality They are also helpful in understanding the lasso estimator; e.g., if |XiT (y − X β)| < λ, then βi = (used by screening rules, later?) 21 Example: soft-thresholding Simplfied lasso problem with X = I : y − β 22 + λ β β This we can solve directly using subgradient optimality Solution is β = Sλ (y ), where Sλ is the soft-thresholding operator    yi − λ if yi > λ [Sλ (y )]i = if − λ ≤ yi ≤ λ, i = 1, , n    yi + λ if yi < −λ Check: from last slide, subgradient optimality conditions are  y − β = λ · sign(β ) if β = i i i i |yi − βi | ≤ λ if βi = 22 Example: soft-thresholding Now plug in in β= SλS(yλ)(y) and check these areare satisfied: Now plug β= and check these satisfied: •When When yi λ−>λ 0, > so 0, so yi β−i = βi λ==λ λ=· λ1.· yi y>i > λ, λ, βi β =i y=i − yi − •When When −λ, argument is similar yi y 0, x − PC (x) ∂ dist(x, C ) = x − PC (x) only has one element, so in fact dist(x, C ) is differentiable and this is its gradient 24 Example: distance to a convex set We will only show one direction, i.e., that x − PC (x) ∈ ∂ dist(x, C ) x − PC (x) Write u = PC (x) Then by first-order optimality conditions for a projection, (x − u)T (y − u) ≤ for all y ∈ C Hence C ⊆ H = {y : (x − u)T (y − u) ≤ 0} Claim dist(y , C ) ≥ (x − u)T (y − u) x −u for all y Check: first, for y ∈ H, the right-hand side is ≤ 25 Example: distance to a convex set Now for y ∈ H, we have (x − u)T (y − u) = x − u y − u cos θ where θ is the angle between x − u and y − u Thus (x − u)T (y − u) = y − u cos θ = dist(y , H) ≤ dist(y , C ) x −u as desired Using the claim, we have for any y (x − u)T (y − x + x − u) dist(y , C ) ≥ x −u = x −u Hence g = (x − u)/ x − u T x −u (y − x) x −u is a subgradient of dist(x, C ) at x 2+ 26 References and further reading S Boyd, Lecture notes for EE 264B, Stanford University, Spring 2010-2011 R T Rockafellar (1970), Convex analysis, Chapters 23–25 L Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012 27 ... (x, f (x)) ∇f (x) −1 first-order approximation f at xlower is abound global lower • the The first-order approximation of f at x isof a global bound ∇f defines a non-vertical supporting hyperplane... x −u is a subgradient of dist(x, C ) at x 2+ 26 References and further reading S Boyd, Lecture notes for EE 264 B, Stanford University, Spring 201 0-2 011 R T Rockafellar (1970), Convex analysis,... ∂f (x) = {∇f (x)} 17 Derivation of first-order optimality Example of the power of subgradients: we can use what we have learned so far to derive the first-order optimality condition Theorem For

Ngày đăng: 18/05/2021, 11:58

Tài liệu cùng người dùng

Tài liệu liên quan