Bài giảng Tối ưu hóa nâng cao - Chương 6: Subgradients cung cấp cho người học các kiến thức: Last time - gradient descent, subgradients, examples of subgradients, monotonicity, examples of non-subdifferentiable functions,... Mời các bạn cùng tham khảo.
Subgradients Hồng Nam Dũng Khoa Tốn - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội Last time: gradient descent Consider the problem f (x) x for f convex and differentiable, dom(f ) = Rn Gradient descent: choose initial x (0) ∈ Rn , repeat x (k) = x (k−1) − tk · ∇f (x (k−1) ), k = 1, 2, 3, Step sizes tk chosen to be fixed and small, or by backtracking line search If ∇f Lipschitz, gradient descent has convergence rate O(1/ε) Downsides: Requires f differentiable Can be slow to converge Outline Today: Subgradients Examples Properties Optimality characterizations Basic inequality Basic inequality recall the basic inequality for differentiable convex functions: Recall that for convex and differentiable f , T T f (y )f≥ + ∇f (x)(x) (y (y −−x), dom(f (y)f (x) ≥ f (x) + ∇f x) ∀x, ∀yy∈∈dom f ) (x, f (x)) ∇f (x) −1 first-order approximation f at xlower is abound global lower • the The first-order approximation of f at x isof a global bound ∇f defines a non-vertical supporting hyperplane to epi(f ) at T ∇f (x) y x (x, f (x)) − ≤ ∀(y, t) ∈ epi f • ∇f (x) defines a non-vertical supporting hyperplane to epi f at (x, f (x)): Subgradients ∇f −1 −1 y t t f (x) − x f (x) ≤ 0, ∀(y , t) ∈ epi(f ) 4-2 Subgradients A subgradient of a convex function f at x is any g ∈ Rn such that f (y ) ≥ f (x) + g T (y − x), ∀y ∈ dom(f ) Always exists (on the relative interior of dom(f )) If f differentiable at x, then g = ∇f (x) uniquely Same definition works for nonconvex f (however, subgradients need not exist) Subgradients A subgradient of a convex function f at x is any g ∈ Rn such that f (y ) ≥ f (x) + g T (y − x), ∀y ∈ dom(f ) Always exists (on the relative interior of dom(f )) Subgradient If f differentiable at x, then g = ∇f (x) uniquely g is a subgradient of a convex function f at x ∈ dom f if Same definition works for nonconvex f (however, subgradients need not exist).f (y) ≥ f (x) + gT (y − x) ∀y ∈ dom f f (y) f (x1) + g1T (y − x1) f (x1) + g2T (y − x1) f (x2) + g3T (y − x2) x1 x2 g1 and g2 are subgradients at x1 , g3 is subgradient at x2 g1, g2 are subgradients at x1; g3 is a subgradient at x2 Examples of subgradients Examples of subgradients −0.5 0.0 0.5 f(x) 1.0 1.5 2.0 Considerff :: RR → → R, R, ff (x) = |x| Consider −2 −1 x • For x = 0, unique subgradient g = sign(x) For x = 0, unique subgradient g = sign(x) For x = 0, subgradient g is any element of [−1, 1] • For x = 0, subgradient g is any element of [−1, 1] 5 Examples of subgradients Considerf f: :RRnn→ →R, R,ff (x) (x) = = xx Consider 22 f(x) x2 x1 x Forxx==0,0,unique uniquesubgradient subgradientg g==x/ x • For x • For Forxx==0,0,subgradient subgradientg gis isany anyelement elementofof{z{z: : z z2 2≤≤1}1} Examples of subgradients Considerff :: RRnn → → R, R, ff (x) Consider (x) = = xx 11 f(x) x2 x1 For x = 0, unique ith component g = sign(x ) For xi = 0, ith component gi is any element of [−1, 1] • For xi = 0, ith component gi is any element of [−1, 1] • For xi i = 0, unique ith component gi i = sign(xii) Examples of subgradients f(x) 10 15 n Considerf f(x) (x)==max{f max{f11(x), (x), ff22(x)}, for Consider for ff11,,ff2 2: R : Rn→→RRconvex, convex, differentiable differentiable −2 −1 x Forf1f1(x) (x)>>ff22(x), (x), unique unique subgradient subgradient gg = = ∇f ∇f11(x) • For Forf2f2(x) (x)>>ff11(x), (x), unique unique subgradient subgradient gg = = ∇f ∇f22(x) (x) • For Forf1f1(x) (x)==ff22(x), (x), subgradient subgradient gg isis any any point point on on line line segment segment • For between∇f ∇f11(x) (x)and and∇f ∇f22(x) (x) between Optimality condition Subgradient optimality condition: For any f (convex or not), f (x ∗ ) = f (x) ⇐⇒ ∈ ∂f (x ∗ ), x i.e., x ∗ is a minimizer if and only if is a subgradient of f at x ∗ 17 Optimality condition Subgradient optimality condition: For any f (convex or not), f (x ∗ ) = f (x) ⇐⇒ ∈ ∂f (x ∗ ), x i.e., x ∗ is a minimizer if and only if is a subgradient of f at x ∗ Why? Easy: g = being a subgradient means that for all y f (y ) ≥ f (x ∗ ) + 0T (y − x ∗ ) = f (x ∗ ) Note the implication for a convex and differentiable function f with ∂f (x) = {∇f (x)} 17 Derivation of first-order optimality Example of the power of subgradients: we can use what we have learned so far to derive the first-order optimality condition Theorem For f convex and differentiable and C convex f (x) subject to x ∈ C x is solved at x if and only if ∇f (x)T (y − x) ≥ for all y ∈ C a a Direct proof see, e.g., http://www.princeton.edu/~amirali/Public/ Teaching/ORF523/S16/ORF523_S16_Lec7_gh.pdf Proof using subgradient next slide Intuitively: says that gradient increases as we move away from x Note that for C = Rn (unconstrained case) it reduces to ∇f = 18 Derivation of first-order optimality Chứng minh First recast problem as f (x) + IC (x) x Now apply subgradient optimality: ∈ ∂(f (x) + IC (x)) Observe ∈ ∂(f (x) + IC (x)) ⇔ ∈ {∇f (x)} + NC (x) ⇔ −∇f (x) ∈ NC (x) ⇔ −∇f (x)T x ≥ −∇f (x)T y for all y ∈ C ⇔ ∇f (x)T (y − x) ≥ for all y ∈ C as desired Note: the condition ∈ ∂f (x) + NC (x) is a fully general condition for optimality in convex problems But it’s not always easy to work with (KKT conditions, later, are easier) 19 Example: lasso optimality conditions Given y ∈ Rn , X ∈ Rn×p , lasso problem can be parametrized as y − X β 22 + λ β β where λ ≥ 20 Example: lasso optimality conditions Given y ∈ Rn , X ∈ Rn×p , lasso problem can be parametrized as y − X β 22 + λ β β where λ ≥ Subgradient optimality 0∈∂ y − X β 22 + λ β ⇔ ∈ −X T (y − X β) + λ∂ β ⇔ X T (y − X β) = λv for some v ∈ ∂ β , i.e., if βi > {1} vi ∈ {−1} if βi < 0, [−1, 1] if βi = i = 1, , p 20 Example: lasso optimality conditions Write X1 , , Xp for columns of X Then our condition reads X T (y − X β) = λ · sign(β ) if β = i i i |X T (y − X β)| ≤ λ if βi = i 21 Example: lasso optimality conditions Write X1 , , Xp for columns of X Then our condition reads X T (y − X β) = λ · sign(β ) if β = i i i |X T (y − X β)| ≤ λ if βi = i Note: subgradient optimality conditions don’t lead to closed-form expression for a lasso solution However they provide a way to check lasso optimality They are also helpful in understanding the lasso estimator; e.g., if |XiT (y − X β)| < λ, then βi = (used by screening rules, later?) 21 Example: soft-thresholding Simplfied lasso problem with X = I : y − β 22 + λ β β This we can solve directly using subgradient optimality Solution is β = Sλ (y ), where Sλ is the soft-thresholding operator yi − λ if yi > λ [Sλ (y )]i = if − λ ≤ yi ≤ λ, i = 1, , n yi + λ if yi < −λ Check: from last slide, subgradient optimality conditions are y − β = λ · sign(β ) if β = i i i i |yi − βi | ≤ λ if βi = 22 Example: soft-thresholding Now plug in in β= SλS(yλ)(y) and check these areare satisfied: Now plug β= and check these satisfied: •When When yi λ−>λ 0, > so 0, so yi β−i = βi λ==λ λ=· λ1.· yi y>i > λ, λ, βi β =i y=i − yi − •When When −λ, argument is similar yi y 0, x − PC (x) ∂ dist(x, C ) = x − PC (x) only has one element, so in fact dist(x, C ) is differentiable and this is its gradient 24 Example: distance to a convex set We will only show one direction, i.e., that x − PC (x) ∈ ∂ dist(x, C ) x − PC (x) Write u = PC (x) Then by first-order optimality conditions for a projection, (x − u)T (y − u) ≤ for all y ∈ C Hence C ⊆ H = {y : (x − u)T (y − u) ≤ 0} Claim dist(y , C ) ≥ (x − u)T (y − u) x −u for all y Check: first, for y ∈ H, the right-hand side is ≤ 25 Example: distance to a convex set Now for y ∈ H, we have (x − u)T (y − u) = x − u y − u cos θ where θ is the angle between x − u and y − u Thus (x − u)T (y − u) = y − u cos θ = dist(y , H) ≤ dist(y , C ) x −u as desired Using the claim, we have for any y (x − u)T (y − x + x − u) dist(y , C ) ≥ x −u = x −u Hence g = (x − u)/ x − u T x −u (y − x) x −u is a subgradient of dist(x, C ) at x 2+ 26 References and further reading S Boyd, Lecture notes for EE 264B, Stanford University, Spring 2010-2011 R T Rockafellar (1970), Convex analysis, Chapters 23–25 L Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012 27 ... (x, f (x)) ∇f (x) −1 first-order approximation f at xlower is abound global lower • the The first-order approximation of f at x isof a global bound ∇f defines a non-vertical supporting hyperplane... t) ∈ epi f • ∇f (x) defines a non-vertical supporting hyperplane to epi f at (x, f (x)): Subgradients ∇f −1 −1 y t t f (x) − x f (x) ≤ 0, ∀(y , t) ∈ epi(f ) 4-2 Subgradients A subgradient of... ∂f (x) = {∇f (x)} 17 Derivation of first-order optimality Example of the power of subgradients: we can use what we have learned so far to derive the first-order optimality condition Theorem For