Bài giảng Tối ưu hóa - Chương 6: Subgradients - Hoàng Nam Dũng

Why subgradients? Subgradients are important for two reasons: I Convex analysis: optimality characterization via s monotonicity, relationship to duality. I Convex optimization: if you can compute subgrad you can minimize any convex function.

pdf36 trang | Chia sẻ: thanhle95 | Lượt xem: 289 | Lượt tải: 0download
Bạn đang xem trước 20 trang tài liệu Bài giảng Tối ưu hóa - Chương 6: Subgradients - Hoàng Nam Dũng, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Subgradients Hoàng Nam Dũng Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội Last time: gradient descent Consider the problem min x f (x) for f convex and differentiable, dom(f ) = Rn. Gradient descent: choose initial x (0) ∈ Rn, repeat x (k) = x (k−1) − tk · ∇f (x (k−1)), k = 1, 2, 3, . . . Step sizes tk chosen to be fixed and small, or by backtracking line search If ∇f Lipschitz, gradient descent has convergence rate O(1/ε) Downsides: I Requires f differentiable I Can be slow to converge 1 Outline Today: I Subgradients I Examples I Properties I Optimality characterizations 2 Basic inequality Recall that for convex and differentiable f , f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y ∈ dom(f ). Basic inequality recall the basic inequality for differentiable convex functions: f(y) ≥ f(x) +∇f(x)T (y − x) ∀y ∈ dom f (x, f(x)) [ ∇f(x) −1 ] • the first-order approximation of f at x is a global lower bound • ∇f(x) defines a non-vertical supporting hyperplane to epi f at (x, f(x)):[ ∇f(x) −1 ]T ([ y t ] − [ x f(x) ]) ≤ 0 ∀(y, t) ∈ epi f Subgradients 4-2 I The first-order approximation of f at x is a global lower bound. I ∇f defines a non-vertical supporting hyperplane to epi(f ) at (x , f (x))( ∇f −1 )((y t ) − ( x f (x) )) ≤ 0, ∀(y , t) ∈ epi(f ). 3 Subgradients A subgradient of a convex function f at x is any g ∈ Rn such that f (y) ≥ f (x) + gT (y − x), ∀y ∈ dom(f ). I Always exists (on the relative interior of dom(f )) I If f differentiable at x , then g = ∇f (x) uniquely I Same definition works for nonconvex f (however, subgradients need not exist). Subgradient g is a subgradient of a convex function f at x ∈ dom f if f(y) ≥ f(x) + gT (y − x) ∀y ∈ dom f x1 x2 f(y) f(x1) + g T 1 (y − x1) f(x1) + g T 2 (y − x1) f(x2) + g T 3 (y − x2) g1, g2 are subgradients at x1; g3 is a subgradient at x2 Subgradients 4-3 g1 and g2 are subgradients at x1, g3 is subgradient at x2. 4 Subgradients A subgradient of a convex function f at x is any g ∈ Rn such that f (y) ≥ f (x) + gT (y − x), ∀y ∈ dom(f ). I Always exists (on the relative interior of dom(f )) I If f differentiable at x , then g = ∇f (x) uniquely I Same definition works for nonconvex f (however, subgradients need not exist). Subgradient g is a subgradient of a convex function f at x ∈ dom f if f(y) ≥ f(x) + gT (y − x) ∀y ∈ dom f x1 x2 f(y) f(x1) + g T 1 (y − x1) f(x1) + g T 2 (y − x1) f(x2) + g T 3 (y − x2) g1, g2 are subgradients at x1; g3 is a subgradient at x2 Subgradients 4-3 g1 and g2 are subgradients at x1, g3 is subgradient at x2. 4 Examples of subgradients Consider f : R→ R, f (x) = |x | Examples of subgradients Consider f : R R, −2 −1 0 1 2 − 0. 5 0. 0 0. 5 1. 0 1. 5 2. 0 x f(x ) • For x 6= 0, unique subgradient g = sign(x) • For x = 0, subgradient g is any element of [−1, 1] 5 I For x 6= 0, unique subgradient g = sign(x) I For x = 0, subgradient g is any element of [−1, 1]. 5 Examples of subgradients Consider f : Rn → R, f (x) = ‖x‖2Consider f : Rn → R, f(x) = ‖x‖2 x1 x2 f(x) • For x 6= 0, unique subgradient g = x/‖x‖2 • For x = 0, subgradient g is any element of {z : ‖z‖2 ≤ 1} 6 I For x 6= 0, unique subgradient g = x‖x‖2 I For x = 0, subgradient g is any lement of {z : ‖z‖2 ≤ 1}. 6 Examples of subgradients Consider f : Rn → R, f (x) = ‖x‖1Consider f : Rn → R, f(x) = ‖x‖1 x1 x2 f(x) • For xi 6= 0, unique ith component gi = sign(xi) • For xi = 0, ith component gi is any element of [−1, 1] 7 I For xi 6= 0, unique ith component gi = sign(xi ) I For xi = 0, ith component gi is any element of [−1, 1]. 7 Examples of subgradients Consider f (x) = max{f1(x), f2(x)}, for f1, f2 : Rn → R convex, differentiable Consider f(x) = max{f1( ), 2( ) , f r 1, f2 : Rn → R convex, differentiable −2 −1 0 1 2 0 5 10 15 x f(x ) • For f1(x) > f2(x), unique subgradient g = ∇f1(x) • For f2(x) > f1(x), unique subgradient g = ∇f2(x) • For f1(x) = f2(x), subgradient g is any point on line segment between ∇f1(x) and ∇f2(x) 8 I For f1(x) > f2( ), i e s r ie t f1( ) I For f2(x) > f1(x), unique subgradient g f2(x) I For f1(x) = f2(x), subgradient g is any point on line segment betw en ∇f1(x) and ∇f2( ). 8 Subdifferential Set of all subgradients of convex f is called the subdifferential: ∂f (x) = {g ∈ Rn : g is a subgradient of f at x}. Properties: I Nonempty for convex f at x ∈ int(domf ) I ∂f (x) is closed and convex (even for nonconvex f ) I If f is differentiable at x , then ∂f (x) = {∇f (x)} I If ∂f (x) = {g}, then f is differentiable at x and ∇f (x) = g . Proof: See lectures/subgradients.pdf 9 Subdifferential Set of all subgradients of convex f is called the subdifferential: ∂f (x) = {g ∈ Rn : g is a subgradient of f at x}. Properties: I Nonempty for convex f at x ∈ int(domf ) I ∂f (x) is closed and convex (even for nonconvex f ) I If f is differentiable at x , then ∂f (x) = {∇f (x)} I If ∂f (x) = {g}, then f is differentiable at x and ∇f (x) = g . Proof: See lectures/subgradients.pdf 9 Monotonicity Theorem The subdifferential of a convex function f is a monotone operator (u − v)T (x − y) ≥ 0, ∀u ∈ ∂f (x), v ∈ ∂f (y). Chứng minh. By definition we have f (y) ≥ f (x) + uT (y − x) and f (x) ≥ f (y) + vT (x − y). Combining them get shows monotonicity. Question: Monotonicity for differentiable convex function? (∇f (x)−∇f (y))T (x − y) ≥ 0, which follows directly from the first order characterization of convex functions. 10 Monotonicity Theorem The subdifferential of a convex function f is a monotone operator (u − v)T (x − y) ≥ 0, ∀u ∈ ∂f (x), v ∈ ∂f (y). Chứng minh. By definition we have f (y) ≥ f (x) + uT (y − x) and f (x) ≥ f (y) + vT (x − y). Combining them get shows monotonicity. Question: Monotonicity for differentiable convex function? (∇f (x)−∇f (y))T (x − y) ≥ 0, which follows directly from the first order characterization of convex functions. 10 Monotonicity Theorem The subdifferential of a convex function f is a monotone operator (u − v)T (x − y) ≥ 0, ∀u ∈ ∂f (x), v ∈ ∂f (y). Chứng minh. By definition we have f (y) ≥ f (x) + uT (y − x) and f (x) ≥ f (y) + vT (x − y). Combining them get shows monotonicity. Question: Monotonicity for differentiable convex function? (∇f (x)−∇f (y))T (x − y) ≥ 0, which follows directly from the first order characterization of convex functions. 10 Examples of non-subdifferentiable functions The following functions are not subdifferentiable at x = 0 I f : R→ R, dom(f ) = R+ f (x) = { 1 if x = 0 0 if x > 0. I f : R→ R, dom(f ) = R+ f (x) = −√x . The only supporting hyperplane to epi(f ) at (0, f (0)) is vertical. 11 Connection to convex geometry Convex set C ⊆ Rn, consider indicator function IC : Rn → R, IC (x) = I{x ∈ C} = 0 if x ∈ C∞ if x 6∈ C . For x ∈ C , ∂IC (x) = NC (x), the normal cone of C at x is, recall NC = {g ∈ Rn : gT x ≥ gT y for any y ∈ C}. Why? By definition of subgradient g , IC (y) ≥ IC (x) + gT (y − x) for all y . I For y 6∈ C , IC (y) =∞ I For y ∈ C , this means 0 ≥ gT (y − x). 12 ll l l 1113 Subgradient calculus Basic rules for convex functions: I Scaling: ∂(af ) = a · ∂f provided a > 0. I Addition: ∂(f1 + f2) = ∂f1 + ∂f2. I Affine composition: if g(x) = f (Ax + b), then ∂g(x) = AT∂f (Ax + b). I Finite pointwise maximum: if f (x) = maxi=1,...m fi (x), then ∂f (x) = conv ( ⋃ i :fi (x)=f (x) ∂fi (x) ) convex hull of union of subdifferentials of active functions at x . 14 Subgradient calculus I General pointwise maximum: if f (x) = maxs∈S fs(x), then ∂f (x) ⊇ cl conv ( ⋃ s:fs(x)=f (x) ∂fs(x) ) . Under some regularity conditions (on S , fs), we get equality. I Norms: important special case, f (x) = ‖x‖p. Let q be such that 1/p + 1/q = 1, then ‖x‖p = max‖z‖q≤1 zT x . And ∂f (x) = argmax‖z‖q≤1 z T x . 15 Why subgradients? Subgradients are important for two reasons: I Convex analysis: optimality characterization via subgradients, monotonicity, relationship to duality. I Convex optimization: if you can compute subgradients, then you can minimize any convex function. 16 Optimality condition Subgradient optimality condition: For any f (convex or not), f (x∗) = min x f (x)⇐⇒ 0 ∈ ∂f (x∗), i.e., x∗ is a minimizer if and only if 0 is a subgradient of f at x∗. Why? Easy: g = 0 being a subgradient means that for all y f (y) ≥ f (x∗) + 0T (y − x∗) = f (x∗). Note the implication for a convex and differentiable function f with ∂f (x) = {∇f (x)}. 17 Optimality condition Subgradient optimality condition: For any f (convex or not), f (x∗) = min x f (x)⇐⇒ 0 ∈ ∂f (x∗), i.e., x∗ is a minimizer if and only if 0 is a subgradient of f at x∗. Why? Easy: g = 0 being a subgradient means that for all y f (y) ≥ f (x∗) + 0T (y − x∗) = f (x∗). Note the implication for a convex and differentiable function f with ∂f (x) = {∇f (x)}. 17 Derivation of first-order optimality Example of the power of subgradients: we can use what we have learned so far to derive the first-order optimality condition. Theorem For f convex and differentiable and C convex min x f (x) subject to x ∈ C is solved at x if and only if ∇f (x)T (y − x) ≥ 0 for all y ∈ C .a aDirect proof see, e.g., Teaching/ORF523/S16/ORF523_S16_Lec7_gh.pdf. Proof using subgradient next slide. Intuitively: says that gradient increases as we move away from x . Note that for C = Rn (unconstrained case) it reduces to ∇f = 0. 18 Derivation of first-order optimality Chứng minh. First recast problem as min x f (x) + IC (x). Now apply subgradient optimality: 0 ∈ ∂(f (x) + IC (x)). Observe 0 ∈ ∂(f (x) + IC (x))⇔ 0 ∈ {∇f (x)}+NC (x) ⇔ −∇f (x) ∈ NC (x) ⇔ −∇f (x)T x ≥ −∇f (x)T y for all y ∈ C ⇔ ∇f (x)T (y − x) ≥ 0 for all y ∈ C as desired. Note: the condition 0 ∈ ∂f (x) +NC (x) is a fully general condition for optimality in convex problems. But it’s not always easy to work with (KKT conditions, later, are easier). 19 Example: lasso optimality conditions Given y ∈ Rn,X ∈ Rn×p, lasso problem can be parametrized as min β 1 2 ‖y − Xβ‖22 + λ ‖β‖1 where λ ≥ 0. Subgradient optimality 0 ∈ ∂ ( 1 2 ‖y − Xβ‖22 + λ ‖β‖1 ) ⇔ 0 ∈ −XT (y − Xβ) + λ∂ ‖β‖1 ⇔ XT (y − Xβ) = λv for some v ∈ ∂ ‖β‖1, i.e., vi ∈  {1} if βi > 0 {−1} if βi < 0, i = 1, . . . , p. [−1, 1] if βi = 0 20 Example: lasso optimality conditions Given y ∈ Rn,X ∈ Rn×p, lasso problem can be parametrized as min β 1 2 ‖y − Xβ‖22 + λ ‖β‖1 where λ ≥ 0. Subgradient optimality 0 ∈ ∂ ( 1 2 ‖y − Xβ‖22 + λ ‖β‖1 ) ⇔ 0 ∈ −XT (y − Xβ) + λ∂ ‖β‖1 ⇔ XT (y − Xβ) = λv for some v ∈ ∂ ‖β‖1, i.e., vi ∈  {1} if βi > 0 {−1} if βi < 0, i = 1, . . . , p. [−1, 1] if βi = 0 20 Example: lasso optimality conditions Write X1, . . . ,Xp for columns of X . Then our condition readsXTi (y − Xβ) = λ · sign(βi ) if βi 6= 0|XTi (y − Xβ)| ≤ λ if βi = 0. Note: subgradient optimality conditions don’t lead to closed-form expression for a lasso solution. However they do provide a way to check lasso optimality. They are also helpful in understanding the lasso estimator; e.g., if |XTi (y − Xβ)| < λ, then βi = 0 (used by screening rules, later?). 21 Example: lasso optimality conditions Write X1, . . . ,Xp for columns of X . Then our condition readsXTi (y − Xβ) = λ · sign(βi ) if βi 6= 0|XTi (y − Xβ)| ≤ λ if βi = 0. Note: subgradient optimality conditions don’t lead to closed-form expression for a lasso solution. However they do provide a way to check lasso optimality. They are also helpful in understanding the lasso estimator; e.g., if |XTi (y − Xβ)| < λ, then βi = 0 (used by screening rules, later?). 21 Example: soft-thresholding Simplfied lasso problem with X = I : min β 1 2 ‖y − β‖22 + λ ‖β‖1 . This we can solve directly using subgradient optimality. Solution is β = Sλ(y), where Sλ is the soft-thresholding operator [Sλ(y)]i =  yi − λ if yi > λ 0 if − λ ≤ yi ≤ λ, i = 1, . . . , n. yi + λ if yi < −λ Check: from last slide, subgradient optimality conditions areyi − βi = λ · sign(βi ) if βi 6= 0|yi − βi | ≤ λ if βi = 0. 22 Example: soft-thresholding Now plug in β = Sλ(y) and check these are satisfied: I When yi > λ, βi = yi − λ > 0, so yi − βi = λ = λ · 1. I When yi < −λ, argument is similar. I When |yi | ≤ λ, βi = 0, and |yi − βi | = |yi | ≤ λ. Soft-thresholding in one variable Now plug in β = Sλ(y) and check these are satisfied: • When yi > λ, βi = yi − λ > 0, so yi − βi = λ = λ · 1 • When yi < −λ, argument is similar • When |yi| ≤ λ, βi = 0, and |yi − βi| = |yi| ≤ λ Soft-thresholding in one variable: −1.0 −0.5 0.0 0.5 1.0 − 1. 0 − 0. 5 0. 0 0. 5 1. 0 21 23 Example: distance to a convex set Recall the distance function to a closed, convex set C dist(x ,C ) = min y∈C ‖y − x‖2 . This is a convex function. What are its subgradients? Write dist(x ,C ) = ‖x − PC (x)‖2, where PC (x) is the projection of x onto C . It turns out that when dist(x ,C ) > 0, ∂ dist(x ,C ) = { x − PC (x) ‖x − PC (x)‖2 } only has one element, so in fact dist(x ,C ) is differentiable and this is its gradient. 24 Example: distance to a convex set Recall the distance function to a closed, convex set C dist(x ,C ) = min y∈C ‖y − x‖2 . This is a convex function. What are its subgradients? Write dist(x ,C ) = ‖x − PC (x)‖2, where PC (x) is the projection of x onto C . It turns out that when dist(x ,C ) > 0, ∂ dist(x ,C ) = { x − PC (x) ‖x − PC (x)‖2 } only has one element, so in fact dist(x ,C ) is differentiable and this is its gradient. 24 Example: distance to a convex set We will only show one direction, i.e., that x − PC (x) ‖x − PC (x)‖2 ∈ ∂ dist(x ,C ). Write u = PC (x). Then by first-order optimality conditions for a projection, (x − u)T (y − u) ≤ 0 for all y ∈ C . Hence C ⊆ H = {y : (x − u)T (y − u) ≤ 0}. Claim dist(y ,C ) ≥ (x − u) T (y − u) ‖x − u‖2 for all y . Check: first, for y ∈ H, the right-hand side is ≤ 0. 25 Example: distance to a convex set Now for y 6∈ H, we have (x − u)T (y − u) = ‖x − u‖2 ‖y − u‖2 cos θ where θ is the angle between x − u and y − u. Thus (x − u)T (y − u) ‖x − u‖2 = ‖y − u‖2 cos θ = dist(y ,H) ≤ dist(y ,C ) as desired. Using the claim, we have for any y dist(y ,C ) ≥ (x − u) T (y − x + x − u) ‖x − u‖2 = ‖x − u‖2 + ( x − u ‖x − u‖2 )T (y − x) Hence g = (x − u)/ ‖x − u‖2 is a subgradient of dist(x ,C ) at x . 26 References and further reading S. Boyd, Lecture notes for EE 264B, Stanford University, Spring 2010-2011 R. T. Rockafellar (1970), Convex analysis, Chapters 23–25 L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012 27