Convex Optimization 读书笔记 (8)
Chapter9: Unconstrained minimization
9.1 Unconstrained minimization problems
The unconstrained optimization problem is
m
i
n
i
m
i
z
e
f
(
x
)
\begin{aligned} {\rm minimize} \ \ \ \ & f(x) \\ \end{aligned}
minimize f(x)where
f
:
R
n
→
R
f : \mathbf{R}^n → \mathbf{R}
f:Rn→R is convex and twice continuously differentiable (which implies that
d
o
m
f
\mathbf{dom} \ f
dom f is open).
Since
f
f
f is differentiable and convex, a necessary and sufficient condition for a point
x
⋆
x^⋆
x⋆ to be optimal is
∇
f
(
x
∗
)
=
0
\nabla f(x^*)=0
∇f(x∗)=0
9.1.1 Examples
9.1.2 Strong convexity and implications
The objective function is strongly convex on
S
S
S, which means that there exists an
m
>
0
m > 0
m>0 such that
∇
2
f
(
x
)
⪰
m
I
\nabla^2f(x)\succeq mI
∇2f(x)⪰mIfor all
x
∈
S
x \in S
x∈S. For
x
,
y
∈
S
x, y ∈ S
x,y∈S we have
f
(
y
)
=
f
(
x
)
+
∇
f
(
x
)
T
(
y
−
x
)
+
1
2
(
y
−
x
)
T
∇
2
f
(
z
)
(
y
−
x
)
f(y)=f(x)+\nabla f(x)^T(y-x)+\frac{1}{2}(y-x)^T\nabla^2f(z)(y-x)
f(y)=f(x)+∇f(x)T(y−x)+21(y−x)T∇2f(z)(y−x)which means
f
(
y
)
≥
f
(
x
)
+
∇
f
(
x
)
T
(
y
−
x
)
+
m
2
∣
∣
y
−
x
∣
∣
2
2
f(y)\geq f(x)+\nabla f(x)^T(y-x)+\frac{m}{2}||y-x||_2^2
f(y)≥f(x)+∇f(x)T(y−x)+2m∣∣y−x∣∣22for all
x
x
x and
y
y
y in
S
S
S.
Upper bound on
There exists a constant
M
M
M such that
∇
2
f
(
x
)
⪯
M
I
∇^2f(x)\preceq MI
∇2f(x)⪯MIwhich implies that
f
(
y
)
≤
f
(
x
)
+
∇
f
(
x
)
T
(
y
−
x
)
+
M
2
∣
∣
y
−
x
∣
∣
2
2
f(y)\leq f(x)+\nabla f(x)^T(y-x)+\frac{M}{2}||y-x||_2^2
f(y)≤f(x)+∇f(x)T(y−x)+2M∣∣y−x∣∣22
9.2 Descent methods
The algorithms described in this chapter produce a minimizing sequence
x
(
k
)
,
k
=
1
,
.
.
.
,
x^{(k)}, k = 1,...,
x(k),k=1,..., where
x
(
k
+
1
)
=
x
(
k
)
+
t
(
k
)
Δ
x
(
k
)
x^{(k+1)}=x^{(k)}+t^{(k)}\Delta x^{(k)}
x(k+1)=x(k)+t(k)Δx(k)and
t
(
k
)
>
0
t^{(k)} > 0
t(k)>0 (except when
x
(
k
)
x^{(k)}
x(k) is optimal). Here the concatenated symbols
∆
∆
∆ and
x
x
x that form
∆
x
∆x
∆x are to be read as a single entity, a vector in
R
n
\mathbf{R}^n
Rn called the step or search direction (even though it need not have unit norm), and
k
=
0
,
1
,
.
.
.
k = 0, 1, . . .
k=0,1,... denotes the iteration number. The scalar
t
(
k
)
≥
0
t^{(k)} ≥ 0
t(k)≥0 is called the step size or step length at iteration
k
k
k (even though it is not equal to
∥
x
(
k
+
1
)
−
x
(
k
)
∥
∥x^{(k+1)} − x^{(k)}∥
∥x(k+1)−x(k)∥ unless
∥
∆
x
(
k
)
∥
=
1
∥∆x^{(k)}∥ = 1
∥∆x(k)∥=1).
Exact line search
One line search method sometimes used in practice is exact line search, in which
t
t
t is chosen to minimize
f
f
f along the ray
{
x
+
t
∆
x
∣
t
≥
0
}
\{x+t∆x|t≥0\}
{x+t∆x∣t≥0}:
t
=
arg
min
s
≥
0
f
(
x
+
s
Δ
x
)
t=\arg\min_{s\geq0}f(x+s\Delta x)
t=args≥0minf(x+sΔx)Many inexact line search methods have been proposed. One inexact line search method that is very simple and quite effective is called backtracking line search. It depends on two constants
α
,
β
α, β
α,β with
0
<
α
<
0.5
,
0
<
β
<
1
0 < α < 0.5, 0 < β < 1
0<α<0.5,0<β<1.
9.3 Gradient descent method
A natural choice for the search direction is the negative gradient
∆
x
=
−
∇
f
(
x
)
∆x = −∇f(x)
∆x=−∇f(x). The resulting algorithm is called the gradient algorithm or gradient descent method.
9.3.1 Convergence analysis
Analysis for exact line search
We assume
f
f
f is strongly convex on
S
S
S, so there are positive constants
m
m
m and
M
M
M such that
m
I
⪯
∇
f
(
x
)
⪯
M
I
mI \preceq ∇ f(x) \preceq MI
mI⪯∇f(x)⪯MI for all
x
∈
S
x ∈ S
x∈S. Define the function
f
~
:
R
→
R
\tilde{f} : \mathbf{R} → \mathbf{R}
f~:R→R by
f
~
(
t
)
=
f
(
x
−
t
∇
f
(
x
)
)
\tilde{f} (t) = f (x − t∇f (x))
f~(t)=f(x−t∇f(x)). We obtain a quadratic upper bound on
f
~
\tilde{f}
f~:
f
~
≤
f
(
x
)
−
t
∣
∣
∇
f
(
x
)
∣
∣
2
2
+
M
t
2
2
∣
∣
∇
f
(
x
)
∣
∣
2
2
\tilde{f}\leq f(x)-t||\nabla f(x)||_2^2+\frac{Mt^2}{2}||\nabla f(x)||_2^2
f~≤f(x)−t∣∣∇f(x)∣∣22+2Mt2∣∣∇f(x)∣∣22The RHS has minimum value about
t
t
t:
f
(
x
)
−
1
2
M
∣
∣
∇
f
(
x
)
∣
∣
2
2
f(x)-\frac{1}{2M}||\nabla f(x)||_2^2
f(x)−2M1∣∣∇f(x)∣∣22, then we have
f
(
x
+
)
−
p
∗
≤
f
(
x
)
−
p
∗
−
1
2
M
∣
∣
∇
f
(
x
)
∣
∣
2
2
f(x^+)-p^*\leq f(x)-p^*-\frac{1}{2M}||\nabla f(x)||_2^2
f(x+)−p∗≤f(x)−p∗−2M1∣∣∇f(x)∣∣22Combine with
∣
∣
∇
f
(
x
)
∣
∣
2
2
≥
2
m
(
f
(
x
)
−
p
∗
)
||\nabla f(x)||_2^2\geq 2m(f(x)-p^*)
∣∣∇f(x)∣∣22≥2m(f(x)−p∗), we get
f
(
x
+
)
−
p
⋆
≤
(
1
−
m
M
)
(
f
(
x
)
−
p
⋆
)
f(x^+) − p^⋆ ≤ (1 − \frac{m}{M})(f(x) − p^⋆)
f(x+)−p⋆≤(1−Mm)(f(x)−p⋆)Applying this inequality recursively, we find that
f
(
x
(
k
)
)
−
p
∗
≤
c
k
(
f
(
x
(
0
)
)
−
p
∗
)
f(x^{(k)})-p^*\leq c^k(f(x^{(0)})-p^*)
f(x(k))−p∗≤ck(f(x(0))−p∗)where
c
=
1
−
m
M
<
1
c=1-\frac{m}{M}<1
c=1−Mm<1, which shows that
f
(
x
(
k
)
)
f(x^{(k)})
f(x(k)) converges to
p
⋆
p^⋆
p⋆ as
k
→
∞
k → ∞
k→∞.
Analysis for backtracking line search
Analogous to exact line search we get
f
(
x
(
k
)
)
−
p
∗
≤
c
k
(
f
(
x
(
0
)
)
−
p
∗
)
f(x^{(k)})-p^*\leq c^k(f(x^{(0)})-p^*)
f(x(k))−p∗≤ck(f(x(0))−p∗)where
c
=
1
−
min
{
2
m
α
,
2
β
α
m
M
}
<
1
c=1-\min\{ 2m\alpha,2\beta\alpha\frac{m}{M} \}<1
c=1−min{2mα,2βαMm}<1
9.3.2 Examples
9.4 Steepest descent method
The first-order Taylor approximation of
f
(
x
+
v
)
f (x + v)
f(x+v) around
x
x
x is
f
(
x
+
v
)
≈
f
^
(
x
+
v
)
=
f
(
x
)
+
∇
f
(
x
)
T
v
f(x+v)\approx\hat{f}(x+v)=f(x)+\nabla f(x)^Tv
f(x+v)≈f^(x+v)=f(x)+∇f(x)TvLet
∥
⋅
∥
∥ · ∥
∥⋅∥ be any norm on
R
\mathbf{R}
R . We define a normalized steepest descent direction (with respect to the norm
∥
⋅
∥
∥ · ∥
∥⋅∥) as
Δ
x
n
s
d
=
arg
min
{
∇
f
(
x
)
T
v
∣
∣
∣
v
∣
∣
=
1
}
\Delta x_{\rm nsd} = \arg\min\{ \nabla f(x)^Tv\mid||v||=1 \}
Δxnsd=argmin{∇f(x)Tv∣∣∣v∣∣=1}It is also convenient to consider a steepest descent step
Δ
x
n
s
d
\Delta x_{\rm nsd}
Δxnsd that is unnormalized, by scaling the normalized steepest descent direction in a particular way:
Δ
x
s
d
=
∣
∣
∇
f
(
x
)
∣
∣
∗
Δ
x
n
s
d
\Delta x_{\rm sd} = ||\nabla f(x)||_*\Delta x_{\rm nsd}
Δxsd=∣∣∇f(x)∣∣∗Δxnsdwhere
∥
⋅
∥
∗
∥ · ∥_∗
∥⋅∥∗ denotes the dual norm.
9.4.1 Steepest descent for Euclidean and quadratic norms
Steepest descent for Euclidean norm
If we take the norm ∥ ⋅ ∥ ∥·∥ ∥⋅∥ to be the Euclidean norm we find that the steepest descent direction is simply the negative gradient ∆ x s d = − ∇ f ( x ) ∆x_{\rm sd} = −∇f(x) ∆xsd=−∇f(x).
Steepest descent for quadratic norm
We consider the quadratic norm
∣
∣
z
∣
∣
P
=
(
z
T
P
z
)
1
2
=
∣
∣
P
1
2
z
∣
∣
2
||z||_P=(z^TPz)^{\frac{1}{2}}=||P^{\frac{1}{2}}z||_2
∣∣z∣∣P=(zTPz)21=∣∣P21z∣∣2where
P
∈
S
+
+
n
P\in\mathbf{S}_{++}^n
P∈S++n. The normalized steepest descent is
∆
x
n
s
d
=
−
(
f
(
x
)
T
P
−
1
∇
f
(
x
)
)
−
1
2
P
−
1
∇
f
(
x
)
∆
x
s
d
=
−
P
−
1
∇
f
(
x
)
∆x_{\rm nsd} =−(f(x)^TP^{−1}∇f(x))^{-\frac{1}{2}}P^{−1}∇f(x) \\ ∆x_{\rm sd} = -P^{-1}\nabla f(x)
∆xnsd=−(f(x)TP−1∇f(x))−21P−1∇f(x)∆xsd=−P−1∇f(x)
9.4.2 Steepest descent for l 1 {l}_1 l1-norm
As another example, we consider the steepest descent method for the
l
1
l_1
l1-norm. A normalized steepest descent direction,
Δ
x
n
s
d
=
arg
min
{
∇
f
(
x
)
T
v
∣
∣
∣
v
∣
∣
1
≤
1
}
\Delta x_{\rm nsd} = \arg\min\{ \nabla f(x)^Tv\mid||v||_1\leq1 \}
Δxnsd=argmin{∇f(x)Tv∣∣∣v∣∣1≤1}Then a normalized steepest descent direction
∆
x
n
s
d
∆x_{\rm nsd}
∆xnsd for the
l
1
l_1
l1-norm is given by
∆
x
n
s
d
=
−
s
i
g
n
(
∂
f
(
x
)
∂
x
i
)
e
i
∆x_{\rm nsd} = −{\rm sign}(\frac{∂f(x)}{∂x_i})e_i
∆xnsd=−sign(∂xi∂f(x))eiAn unnormalized steepest descent step is then
∆
x
s
d
=
∆
x
n
s
d
∥
∇
f
(
x
)
∥
∞
=
−
∂
f
(
x
)
∂
x
i
e
i
∆x_{\rm sd} = ∆x_{\rm nsd}∥∇f(x)∥_∞ = −\frac{∂f(x)}{∂x_i}e_i
∆xsd=∆xnsd∥∇f(x)∥∞=−∂xi∂f(x)ei
9.4.3 Convergence analysis
We have
f
(
x
(
k
)
)
−
p
∗
≤
c
k
(
f
(
x
(
0
)
)
−
p
∗
)
f(x^{(k)})-p^*\leq c^k(f(x^{(0)})-p^*)
f(x(k))−p∗≤ck(f(x(0))−p∗)where
c
=
1
−
2
m
α
γ
~
2
min
{
1
,
β
γ
2
M
}
<
1
c=1-2m\alpha\tilde{\gamma}^2\min\{ 1,\frac{\beta\gamma^2}{M} \}<1
c=1−2mαγ~2min{1,Mβγ2}<1.
9.4.4 Discussion and examples
9.5 Newton’s method
9.5.1 The Newton step
For
x
∈
d
o
m
f
x ∈ \mathbf{dom} \ f
x∈dom f, the vector
Δ
x
n
t
=
−
∇
2
f
(
x
)
−
1
∇
f
(
x
)
\Delta x_{\rm nt}=-\nabla^2f(x)^{-1}\nabla f(x)
Δxnt=−∇2f(x)−1∇f(x)is called the Newton step. Positive definiteness of
∇
2
f
(
x
)
∇_2f(x)
∇2f(x) implies that
∇
f
(
x
)
T
∆
x
n
t
=
−
∇
f
(
x
)
T
∇
2
f
(
x
)
−
1
∇
f
(
x
)
<
0
∇f(x)^T ∆x_{\rm nt} = −∇f(x)^T ∇^2f(x)^{−1}∇f(x) < 0
∇f(x)T∆xnt=−∇f(x)T∇2f(x)−1∇f(x)<0unless
∇
f
(
x
)
=
0
∇f(x) = 0
∇f(x)=0, so the Newton step is a descent direction (unless
x
x
x is optimal).
9.5.2 Newton’s method
Newton’s method, as outlined below, is sometimes called the damped Newton method or guarded Newton method, to distinguish it from the pure Newton method, which uses a fixed step size
t
=
1
t = 1
t=1.
9.5.3 Convergence analysis
There are numbers η η η and γ γ γ with 0 < η ≤ m 2 L 0 < η ≤ \frac{m^2}{L} 0<η≤Lm2 and γ > 0 γ > 0 γ>0 such that the following hold:
If
∥
∇
f
(
x
(
k
)
)
∥
2
≥
η
∥∇f(x^{(k)})∥_2 ≥ η
∥∇f(x(k))∥2≥η, then
f
(
x
(
k
+
1
)
)
−
f
(
x
(
k
)
)
≤
−
γ
f(x^{(k+1)}) − f(x^{(k)}) ≤ −γ
f(x(k+1))−f(x(k))≤−γIf
∥
∇
f
(
x
(
k
)
)
∥
2
<
η
∥∇f(x^{(k)})∥_2 < η
∥∇f(x(k))∥2<η, then the backtracking line search selects
t
(
k
)
=
1
t^{(k)}=1
t(k)=1 and
L
2
m
2
∥
∇
f
(
x
)
(
k
+
1
)
∥
2
≤
(
L
2
m
2
∣
∣
∇
f
(
x
(
k
)
)
∣
∣
2
)
2
\frac{L}{2m^2} ∥∇f(x)^{(k+1)}∥_2 ≤ (\frac{L}{2m^2}||\nabla f(x^{(k)})||_2)^2
2m2L∥∇f(x)(k+1)∥2≤(2m2L∣∣∇f(x(k))∣∣2)2The second case gives the result:
f
(
x
(
l
)
)
−
p
∗
≤
1
2
m
∣
∣
∇
f
(
x
(
l
)
)
∣
∣
2
2
≤
2
m
3
L
2
(
1
2
)
2
l
−
k
+
1
f(x^{(l)})-p^*\leq \frac{1}{2m}||\nabla f(x^{(l)})||_2^2 \leq\frac{2m^3}{L^2}(\frac{1}{2})^{2^{l-k+1}}
f(x(l))−p∗≤2m1∣∣∇f(x(l))∣∣22≤L22m3(21)2l−k+1This last inequality shows that convergence is extremely rapid once the second condition is satisfied. This phenomenon is called quadratic convergence.
9.6 Self-concordance
9.6.1 Definition and examples
We start by considering functions on
R
\mathbf{R}
R. A convex function
f
:
R
→
R
f : \mathbf{R} → \mathbf{R}
f:R→R is self-concordant if
∣
f
′
′
′
(
x
)
∣
≤
2
f
′
′
(
x
)
3
2
|f^{\prime\prime\prime}(x)|\leq 2f^{\prime\prime}(x)^{\frac{3}{2}}
∣f′′′(x)∣≤2f′′(x)23
9.6.2 Self-concordant calculus
Scaling and sum
Self-concordance is preserved by scaling by a factor exceeding one: If f f f is self-concordant and a ≥ 1 a ≥ 1 a≥1, then a f af af is self-concordant. Self-concordance is also preserved by addition: If f 1 , f 2 f_1, f_2 f1,f2 are self-concordant, then f 1 + f 2 f_1 + f_2 f1+f2 is self-concordant.
Composition with affine function
If f : R n → R f : \mathbf{R}^n → \mathbf{R} f:Rn→R is self-concordant, and A ∈ R n × m , b ∈ R n A ∈ \mathbf{R}^{n\times m}, b ∈ \mathbf{R}^n A∈Rn×m,b∈Rn, then f ( A x + b ) f(Ax+b) f(Ax+b) is self-concordant.
Composition with logarithm
Let
g
:
R
→
R
g : \mathbf{R} → \mathbf{R}
g:R→R be a convex function with
d
o
m
g
=
R
+
+
\mathbf{dom} \ g = \mathbf{R}_{++}
dom g=R++, and
∣
g
′
′
′
(
x
)
∣
≤
3
g
′
′
(
x
)
x
|g^{\prime\prime\prime}(x)| ≤ 3\frac{g^{\prime\prime}(x)}{x}
∣g′′′(x)∣≤3xg′′(x)for all
x
x
x. Then
f
(
x
)
=
−
log
(
−
g
(
x
)
)
−
log
x
f(x)=-\log (-g(x))-\log x
f(x)=−log(−g(x))−logxis self-concordant on
{
x
∣
x
>
0
,
g
(
x
)
<
0
}
\{x | x > 0, g(x) < 0\}
{x∣x>0,g(x)<0}.
9.6.3 Properties of self-concordant functions
9.6.4 Analysis of Newton’s method for self-concordant functions
We will show that there are numbers
η
η
η and
γ
>
0
γ > 0
γ>0, with
0
<
η
≤
1
4
0 < η ≤ \frac{1}{4}
0<η≤41, that depend only on the line search parameters
α
α
α and
β
β
β, such that the following hold:
If
λ
(
x
(
k
)
)
≥
η
\lambda(x^{(k)}) ≥ η
λ(x(k))≥η, then
f
(
x
(
k
+
1
)
)
−
f
(
x
(
k
)
)
≤
−
γ
f(x^{(k+1)}) − f(x^{(k)}) ≤ −γ
f(x(k+1))−f(x(k))≤−γIf
λ
(
x
(
k
)
)
<
η
\lambda(x^{(k)}) < η
λ(x(k))<η, then the backtracking line search selects
t
(
k
)
=
1
t^{(k)}=1
t(k)=1 and
2
λ
(
x
(
k
+
1
)
)
≤
(
2
λ
(
x
(
k
)
)
)
2
2\lambda(x^{(k+1)})\leq (2\lambda(x^{(k)}))^2
2λ(x(k+1))≤(2λ(x(k)))2As a consequence,
f
(
x
(
l
)
)
−
p
∗
≤
λ
(
x
(
l
)
)
2
≤
(
1
2
)
2
l
−
k
+
1
f(x^{(l)})-p^*\leq\lambda(x^{(l)})^2\leq(\frac{1}{2})^{2^{l-k+1}}
f(x(l))−p∗≤λ(x(l))2≤(21)2l−k+1
9.6.5 Discussion and numerical examples
9.7 Implementation
9.7.1 Pre-computation for line searches
In the simplest implementation of a line search, f ( x + t ∆ x ) f (x + t∆x) f(x+t∆x) is evaluated for each value of t t t in the same way that f ( z ) f(z) f(z) is evaluated for any z ∈ d o m f z ∈ \mathbf{dom} \ f z∈dom f. But in some cases we can exploit the fact that f f f (and its derivatives, in an exact line search) are to be evaluated at many points along the ray { x + t ∆ x ∣ t ≥ 0 } \{x + t∆x | t ≥ 0\} {x+t∆x∣t≥0} to reduce the total computational effort. This usually requires some pre-computation, which is often on the same order as computing f f f at any point, after which f f f (and its derivatives) can be computed more efficiently along the ray.
9.7.2 Computing the Newton step
To compute the Newton step ∆ x n t ∆x_{\rm nt} ∆xnt, we first evaluate and form the Hessian matrix H = ∇ 2 f ( x ) H = ∇^2f(x) H=∇2f(x) and the gradient g = ∇ f ( x ) g = ∇f(x) g=∇f(x) at x x x. Then we solve the system of linear equations H ∆ x n t = − g H∆x_{\rm nt} = −g H∆xnt=−g to find the Newton step. This set of equations is sometimes called the Newton system (since its solution gives the Newton step) or the normal equations, since the same type of equation arises in solving a least-squares problem.