Why subgradients?
Subgradients are important for two reasons:
I Convex analysis: optimality characterization via s
monotonicity, relationship to duality.
I Convex optimization: if you can compute subgrad
you can minimize any convex function.
36 trang |
Chia sẻ: thanhle95 | Lượt xem: 281 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Bài giảng Tối ưu hóa - Chương 6: Subgradients - Hoàng Nam Dũng, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Subgradients
Hoàng Nam Dũng
Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
Last time: gradient descent
Consider the problem
min
x
f (x)
for f convex and differentiable, dom(f ) = Rn.
Gradient descent: choose initial x (0) ∈ Rn, repeat
x (k) = x (k−1) − tk · ∇f (x (k−1)), k = 1, 2, 3, . . .
Step sizes tk chosen to be fixed and small, or by backtracking line
search
If ∇f Lipschitz, gradient descent has convergence rate O(1/ε)
Downsides:
I Requires f differentiable
I Can be slow to converge
1
Outline
Today:
I Subgradients
I Examples
I Properties
I Optimality characterizations
2
Basic inequality
Recall that for convex and differentiable f ,
f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y ∈ dom(f ).
Basic inequality
recall the basic inequality for differentiable convex functions:
f(y) ≥ f(x) +∇f(x)T (y − x) ∀y ∈ dom f
(x, f(x)) [ ∇f(x)
−1
]
• the first-order approximation of f at x is a global lower bound
• ∇f(x) defines a non-vertical supporting hyperplane to epi f at (x, f(x)):[ ∇f(x)
−1
]T ([
y
t
]
−
[
x
f(x)
])
≤ 0 ∀(y, t) ∈ epi f
Subgradients 4-2
I The first-order approximation of f at x is a global lower
bound.
I ∇f defines a non-vertical supporting hyperplane to epi(f ) at
(x , f (x))(
∇f −1
)((y
t
)
−
(
x
f (x)
))
≤ 0, ∀(y , t) ∈ epi(f ).
3
Subgradients
A subgradient of a convex function f at x is any g ∈ Rn such that
f (y) ≥ f (x) + gT (y − x), ∀y ∈ dom(f ).
I Always exists (on the relative interior of dom(f ))
I If f differentiable at x , then g = ∇f (x) uniquely
I Same definition works for nonconvex f (however, subgradients
need not exist).
Subgradient
g is a subgradient of a convex function f at x ∈ dom f if
f(y) ≥ f(x) + gT (y − x) ∀y ∈ dom f
x1 x2
f(y)
f(x1) + g
T
1 (y − x1)
f(x1) + g
T
2 (y − x1)
f(x2) + g
T
3 (y − x2)
g1, g2 are subgradients at x1; g3 is a subgradient at x2
Subgradients 4-3
g1 and g2 are subgradients at x1, g3 is subgradient at x2.
4
Subgradients
A subgradient of a convex function f at x is any g ∈ Rn such that
f (y) ≥ f (x) + gT (y − x), ∀y ∈ dom(f ).
I Always exists (on the relative interior of dom(f ))
I If f differentiable at x , then g = ∇f (x) uniquely
I Same definition works for nonconvex f (however, subgradients
need not exist).
Subgradient
g is a subgradient of a convex function f at x ∈ dom f if
f(y) ≥ f(x) + gT (y − x) ∀y ∈ dom f
x1 x2
f(y)
f(x1) + g
T
1 (y − x1)
f(x1) + g
T
2 (y − x1)
f(x2) + g
T
3 (y − x2)
g1, g2 are subgradients at x1; g3 is a subgradient at x2
Subgradients 4-3
g1 and g2 are subgradients at x1, g3 is subgradient at x2. 4
Examples of subgradients
Consider f : R→ R, f (x) = |x |
Examples of subgradients
Consider f : R R,
−2 −1 0 1 2
−
0.
5
0.
0
0.
5
1.
0
1.
5
2.
0
x
f(x
)
• For x 6= 0, unique subgradient g = sign(x)
• For x = 0, subgradient g is any element of [−1, 1]
5
I For x 6= 0, unique subgradient g = sign(x)
I For x = 0, subgradient g is any element of [−1, 1].
5
Examples of subgradients
Consider f : Rn → R, f (x) = ‖x‖2Consider f : Rn → R, f(x) = ‖x‖2
x1
x2
f(x)
• For x 6= 0, unique subgradient g = x/‖x‖2
• For x = 0, subgradient g is any element of {z : ‖z‖2 ≤ 1}
6
I For x 6= 0, unique subgradient g = x‖x‖2
I For x = 0, subgradient g is any lement of {z : ‖z‖2 ≤ 1}. 6
Examples of subgradients
Consider f : Rn → R, f (x) = ‖x‖1Consider f : Rn → R, f(x) = ‖x‖1
x1
x2
f(x)
• For xi 6= 0, unique ith component gi = sign(xi)
• For xi = 0, ith component gi is any element of [−1, 1]
7
I For xi 6= 0, unique ith component gi = sign(xi )
I For xi = 0, ith component gi is any element of [−1, 1].
7
Examples of subgradients
Consider f (x) = max{f1(x), f2(x)}, for f1, f2 : Rn → R convex,
differentiable
Consider f(x) = max{f1( ), 2( ) , f r 1, f2 : Rn → R convex,
differentiable
−2 −1 0 1 2
0
5
10
15
x
f(x
)
• For f1(x) > f2(x), unique subgradient g = ∇f1(x)
• For f2(x) > f1(x), unique subgradient g = ∇f2(x)
• For f1(x) = f2(x), subgradient g is any point on line segment
between ∇f1(x) and ∇f2(x)
8
I For f1(x) > f2( ), i e s r ie t f1( )
I For f2(x) > f1(x), unique subgradient g f2(x)
I For f1(x) = f2(x), subgradient g is any point on line segment
betw en ∇f1(x) and ∇f2( ). 8
Subdifferential
Set of all subgradients of convex f is called the subdifferential:
∂f (x) = {g ∈ Rn : g is a subgradient of f at x}.
Properties:
I Nonempty for convex f at x ∈ int(domf )
I ∂f (x) is closed and convex (even for nonconvex f )
I If f is differentiable at x , then ∂f (x) = {∇f (x)}
I If ∂f (x) = {g}, then f is differentiable at x and ∇f (x) = g .
Proof: See
lectures/subgradients.pdf
9
Subdifferential
Set of all subgradients of convex f is called the subdifferential:
∂f (x) = {g ∈ Rn : g is a subgradient of f at x}.
Properties:
I Nonempty for convex f at x ∈ int(domf )
I ∂f (x) is closed and convex (even for nonconvex f )
I If f is differentiable at x , then ∂f (x) = {∇f (x)}
I If ∂f (x) = {g}, then f is differentiable at x and ∇f (x) = g .
Proof: See
lectures/subgradients.pdf
9
Monotonicity
Theorem
The subdifferential of a convex function f is a monotone operator
(u − v)T (x − y) ≥ 0, ∀u ∈ ∂f (x), v ∈ ∂f (y).
Chứng minh.
By definition we have
f (y) ≥ f (x) + uT (y − x) and f (x) ≥ f (y) + vT (x − y).
Combining them get shows monotonicity.
Question: Monotonicity for differentiable convex function?
(∇f (x)−∇f (y))T (x − y) ≥ 0,
which follows directly from the first order characterization of
convex functions.
10
Monotonicity
Theorem
The subdifferential of a convex function f is a monotone operator
(u − v)T (x − y) ≥ 0, ∀u ∈ ∂f (x), v ∈ ∂f (y).
Chứng minh.
By definition we have
f (y) ≥ f (x) + uT (y − x) and f (x) ≥ f (y) + vT (x − y).
Combining them get shows monotonicity.
Question: Monotonicity for differentiable convex function?
(∇f (x)−∇f (y))T (x − y) ≥ 0,
which follows directly from the first order characterization of
convex functions.
10
Monotonicity
Theorem
The subdifferential of a convex function f is a monotone operator
(u − v)T (x − y) ≥ 0, ∀u ∈ ∂f (x), v ∈ ∂f (y).
Chứng minh.
By definition we have
f (y) ≥ f (x) + uT (y − x) and f (x) ≥ f (y) + vT (x − y).
Combining them get shows monotonicity.
Question: Monotonicity for differentiable convex function?
(∇f (x)−∇f (y))T (x − y) ≥ 0,
which follows directly from the first order characterization of
convex functions.
10
Examples of non-subdifferentiable functions
The following functions are not subdifferentiable at x = 0
I f : R→ R, dom(f ) = R+
f (x) =
{
1 if x = 0
0 if x > 0.
I f : R→ R, dom(f ) = R+
f (x) = −√x .
The only supporting hyperplane to epi(f ) at (0, f (0)) is vertical.
11
Connection to convex geometry
Convex set C ⊆ Rn, consider indicator function IC : Rn → R,
IC (x) = I{x ∈ C} =
0 if x ∈ C∞ if x 6∈ C .
For x ∈ C , ∂IC (x) = NC (x), the normal cone of C at x is, recall
NC = {g ∈ Rn : gT x ≥ gT y for any y ∈ C}.
Why? By definition of subgradient g ,
IC (y) ≥ IC (x) + gT (y − x) for all y .
I For y 6∈ C , IC (y) =∞
I For y ∈ C , this means 0 ≥ gT (y − x).
12
ll
l
l
1113
Subgradient calculus
Basic rules for convex functions:
I Scaling: ∂(af ) = a · ∂f provided a > 0.
I Addition: ∂(f1 + f2) = ∂f1 + ∂f2.
I Affine composition: if g(x) = f (Ax + b), then
∂g(x) = AT∂f (Ax + b).
I Finite pointwise maximum: if f (x) = maxi=1,...m fi (x), then
∂f (x) = conv
( ⋃
i :fi (x)=f (x)
∂fi (x)
)
convex hull of union of subdifferentials of active functions at
x .
14
Subgradient calculus
I General pointwise maximum: if f (x) = maxs∈S fs(x), then
∂f (x) ⊇ cl
conv
( ⋃
s:fs(x)=f (x)
∂fs(x)
) .
Under some regularity conditions (on S , fs), we get equality.
I Norms: important special case, f (x) = ‖x‖p. Let q be such
that 1/p + 1/q = 1, then
‖x‖p = max‖z‖q≤1
zT x .
And
∂f (x) = argmax‖z‖q≤1 z
T x .
15
Why subgradients?
Subgradients are important for two reasons:
I Convex analysis: optimality characterization via subgradients,
monotonicity, relationship to duality.
I Convex optimization: if you can compute subgradients, then
you can minimize any convex function.
16
Optimality condition
Subgradient optimality condition: For any f (convex or not),
f (x∗) = min
x
f (x)⇐⇒ 0 ∈ ∂f (x∗),
i.e., x∗ is a minimizer if and only if 0 is a subgradient of f at x∗.
Why? Easy: g = 0 being a subgradient means that for all y
f (y) ≥ f (x∗) + 0T (y − x∗) = f (x∗).
Note the implication for a convex and differentiable function f
with ∂f (x) = {∇f (x)}.
17
Optimality condition
Subgradient optimality condition: For any f (convex or not),
f (x∗) = min
x
f (x)⇐⇒ 0 ∈ ∂f (x∗),
i.e., x∗ is a minimizer if and only if 0 is a subgradient of f at x∗.
Why? Easy: g = 0 being a subgradient means that for all y
f (y) ≥ f (x∗) + 0T (y − x∗) = f (x∗).
Note the implication for a convex and differentiable function f
with ∂f (x) = {∇f (x)}.
17
Derivation of first-order optimality
Example of the power of subgradients: we can use what we have
learned so far to derive the first-order optimality condition.
Theorem
For f convex and differentiable and C convex
min
x
f (x) subject to x ∈ C
is solved at x if and only if
∇f (x)T (y − x) ≥ 0 for all y ∈ C .a
aDirect proof see, e.g.,
Teaching/ORF523/S16/ORF523_S16_Lec7_gh.pdf. Proof using subgradient
next slide.
Intuitively: says that gradient increases as we move away from x .
Note that for C = Rn (unconstrained case) it reduces to ∇f = 0. 18
Derivation of first-order optimality
Chứng minh.
First recast problem as
min
x
f (x) + IC (x).
Now apply subgradient optimality: 0 ∈ ∂(f (x) + IC (x)).
Observe
0 ∈ ∂(f (x) + IC (x))⇔ 0 ∈ {∇f (x)}+NC (x)
⇔ −∇f (x) ∈ NC (x)
⇔ −∇f (x)T x ≥ −∇f (x)T y for all y ∈ C
⇔ ∇f (x)T (y − x) ≥ 0 for all y ∈ C
as desired.
Note: the condition 0 ∈ ∂f (x) +NC (x) is a fully general condition
for optimality in convex problems. But it’s not always easy to work
with (KKT conditions, later, are easier). 19
Example: lasso optimality conditions
Given y ∈ Rn,X ∈ Rn×p, lasso problem can be parametrized as
min
β
1
2
‖y − Xβ‖22 + λ ‖β‖1
where λ ≥ 0.
Subgradient optimality
0 ∈ ∂
(
1
2
‖y − Xβ‖22 + λ ‖β‖1
)
⇔ 0 ∈ −XT (y − Xβ) + λ∂ ‖β‖1
⇔ XT (y − Xβ) = λv
for some v ∈ ∂ ‖β‖1, i.e.,
vi ∈
{1} if βi > 0
{−1} if βi < 0, i = 1, . . . , p.
[−1, 1] if βi = 0
20
Example: lasso optimality conditions
Given y ∈ Rn,X ∈ Rn×p, lasso problem can be parametrized as
min
β
1
2
‖y − Xβ‖22 + λ ‖β‖1
where λ ≥ 0. Subgradient optimality
0 ∈ ∂
(
1
2
‖y − Xβ‖22 + λ ‖β‖1
)
⇔ 0 ∈ −XT (y − Xβ) + λ∂ ‖β‖1
⇔ XT (y − Xβ) = λv
for some v ∈ ∂ ‖β‖1, i.e.,
vi ∈
{1} if βi > 0
{−1} if βi < 0, i = 1, . . . , p.
[−1, 1] if βi = 0
20
Example: lasso optimality conditions
Write X1, . . . ,Xp for columns of X . Then our condition readsXTi (y − Xβ) = λ · sign(βi ) if βi 6= 0|XTi (y − Xβ)| ≤ λ if βi = 0.
Note: subgradient optimality conditions don’t lead to closed-form
expression for a lasso solution. However they do provide a way to
check lasso optimality.
They are also helpful in understanding the lasso estimator; e.g., if
|XTi (y − Xβ)| < λ, then βi = 0 (used by screening rules, later?).
21
Example: lasso optimality conditions
Write X1, . . . ,Xp for columns of X . Then our condition readsXTi (y − Xβ) = λ · sign(βi ) if βi 6= 0|XTi (y − Xβ)| ≤ λ if βi = 0.
Note: subgradient optimality conditions don’t lead to closed-form
expression for a lasso solution. However they do provide a way to
check lasso optimality.
They are also helpful in understanding the lasso estimator; e.g., if
|XTi (y − Xβ)| < λ, then βi = 0 (used by screening rules, later?).
21
Example: soft-thresholding
Simplfied lasso problem with X = I :
min
β
1
2
‖y − β‖22 + λ ‖β‖1 .
This we can solve directly using subgradient optimality. Solution is
β = Sλ(y), where Sλ is the soft-thresholding operator
[Sλ(y)]i =
yi − λ if yi > λ
0 if − λ ≤ yi ≤ λ, i = 1, . . . , n.
yi + λ if yi < −λ
Check: from last slide, subgradient optimality conditions areyi − βi = λ · sign(βi ) if βi 6= 0|yi − βi | ≤ λ if βi = 0.
22
Example: soft-thresholding
Now plug in β = Sλ(y) and check these are satisfied:
I When yi > λ, βi = yi − λ > 0, so yi − βi = λ = λ · 1.
I When yi < −λ, argument is similar.
I When |yi | ≤ λ, βi = 0, and |yi − βi | = |yi | ≤ λ.
Soft-thresholding in
one variable
Now plug in β = Sλ(y) and check these are satisfied:
• When yi > λ, βi = yi − λ > 0, so yi − βi = λ = λ · 1
• When yi < −λ, argument is similar
• When |yi| ≤ λ, βi = 0, and |yi − βi| = |yi| ≤ λ
Soft-thresholding in
one variable:
−1.0 −0.5 0.0 0.5 1.0
−
1.
0
−
0.
5
0.
0
0.
5
1.
0
21
23
Example: distance to a convex set
Recall the distance function to a closed, convex set C
dist(x ,C ) = min
y∈C
‖y − x‖2 .
This is a convex function. What are its subgradients?
Write dist(x ,C ) = ‖x − PC (x)‖2, where PC (x) is the projection of
x onto C . It turns out that when dist(x ,C ) > 0,
∂ dist(x ,C ) =
{
x − PC (x)
‖x − PC (x)‖2
}
only has one element, so in fact dist(x ,C ) is differentiable and this
is its gradient.
24
Example: distance to a convex set
Recall the distance function to a closed, convex set C
dist(x ,C ) = min
y∈C
‖y − x‖2 .
This is a convex function. What are its subgradients?
Write dist(x ,C ) = ‖x − PC (x)‖2, where PC (x) is the projection of
x onto C . It turns out that when dist(x ,C ) > 0,
∂ dist(x ,C ) =
{
x − PC (x)
‖x − PC (x)‖2
}
only has one element, so in fact dist(x ,C ) is differentiable and this
is its gradient.
24
Example: distance to a convex set
We will only show one direction, i.e., that
x − PC (x)
‖x − PC (x)‖2
∈ ∂ dist(x ,C ).
Write u = PC (x). Then by first-order optimality conditions for a
projection,
(x − u)T (y − u) ≤ 0 for all y ∈ C .
Hence
C ⊆ H = {y : (x − u)T (y − u) ≤ 0}.
Claim
dist(y ,C ) ≥ (x − u)
T (y − u)
‖x − u‖2
for all y .
Check: first, for y ∈ H, the right-hand side is ≤ 0.
25
Example: distance to a convex set
Now for y 6∈ H, we have
(x − u)T (y − u) = ‖x − u‖2 ‖y − u‖2 cos θ where θ is the angle
between x − u and y − u. Thus
(x − u)T (y − u)
‖x − u‖2
= ‖y − u‖2 cos θ = dist(y ,H) ≤ dist(y ,C )
as desired.
Using the claim, we have for any y
dist(y ,C ) ≥ (x − u)
T (y − x + x − u)
‖x − u‖2
= ‖x − u‖2 +
(
x − u
‖x − u‖2
)T
(y − x)
Hence g = (x − u)/ ‖x − u‖2 is a subgradient of dist(x ,C ) at x .
26
References and further reading
S. Boyd, Lecture notes for EE 264B, Stanford University,
Spring 2010-2011
R. T. Rockafellar (1970), Convex analysis, Chapters 23–25
L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring
2011-2012
27