Chapter9: Unconstrained minimization

9.1 Unconstrained minimization problems

The unconstrained optimization problem is
m i n i m i z e f ( x ) \begin{aligned} {\rm minimize} \ \ \ \ & f(x) \\ \end{aligned} minimize f(x)where f : R n → R f : \mathbf{R}^n → \mathbf{R} f:Rn→R is convex and twice continuously differentiable (which implies that d o m f \mathbf{dom} \ f dom f is open).
Since f f f is differentiable and convex, a necessary and sufficient condition for a point x ⋆ x^⋆ x⋆ to be optimal is
∇ f ( x ∗ ) = 0 \nabla f(x^*)=0 ∇f(x∗)=0

9.1.1 Examples

9.1.2 Strong convexity and implications

The objective function is strongly convex on S S S, which means that there exists an m > 0 m > 0 m>0 such that
∇ 2 f ( x ) ⪰ m I \nabla^2f(x)\succeq mI ∇2f(x)⪰mIfor all x ∈ S x \in S x∈S. For x , y ∈ S x, y ∈ S x,y∈S we have
f ( y ) = f ( x ) + ∇ f ( x ) T ( y − x ) + 1 2 ( y − x ) T ∇ 2 f ( z ) ( y − x ) f(y)=f(x)+\nabla f(x)^T(y-x)+\frac{1}{2}(y-x)^T\nabla^2f(z)(y-x) f(y)=f(x)+∇f(x)T(y−x)+21(y−x)T∇2f(z)(y−x)which means
f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + m 2 ∣ ∣ y − x ∣ ∣ 2 2 f(y)\geq f(x)+\nabla f(x)^T(y-x)+\frac{m}{2}||y-x||_2^2 f(y)≥f(x)+∇f(x)T(y−x)+2m∣∣y−x∣∣22for all x x x and y y y in S S S.

Upper bound on

There exists a constant M M M such that
∇ 2 f ( x ) ⪯ M I ∇^2f(x)\preceq MI ∇2f(x)⪯MIwhich implies that
f ( y ) ≤ f ( x ) + ∇ f ( x ) T ( y − x ) + M 2 ∣ ∣ y − x ∣ ∣ 2 2 f(y)\leq f(x)+\nabla f(x)^T(y-x)+\frac{M}{2}||y-x||_2^2 f(y)≤f(x)+∇f(x)T(y−x)+2M∣∣y−x∣∣22

9.2 Descent methods

The algorithms described in this chapter produce a minimizing sequence x ( k ) , k = 1 , . . . , x^{(k)}, k = 1,..., x(k),k=1,..., where
x ( k + 1 ) = x ( k ) + t ( k ) Δ x ( k ) x^{(k+1)}=x^{(k)}+t^{(k)}\Delta x^{(k)} x(k+1)=x(k)+t(k)Δx(k)and t ( k ) > 0 t^{(k)} > 0 t(k)>0 (except when x ( k ) x^{(k)} x(k) is optimal). Here the concatenated symbols ∆ ∆ ∆ and x x x that form ∆ x ∆x ∆x are to be read as a single entity, a vector in R n \mathbf{R}^n Rn called the step or search direction (even though it need not have unit norm), and k = 0 , 1 , . . . k = 0, 1, . . . k=0,1,... denotes the iteration number. The scalar t ( k ) ≥ 0 t^{(k)} ≥ 0 t(k)≥0 is called the step size or step length at iteration k k k (even though it is not equal to ∥ x ( k + 1 ) − x ( k ) ∥ ∥x^{(k+1)} − x^{(k)}∥ ∥x(k+1)−x(k)∥ unless ∥ ∆ x ( k ) ∥ = 1 ∥∆x^{(k)}∥ = 1 ∥∆x(k)∥=1).

Convex Optimization 读书笔记 (8)

Exact line search

One line search method sometimes used in practice is exact line search, in which t t t is chosen to minimize f f f along the ray { x + t ∆ x ∣ t ≥ 0 } \{x+t∆x|t≥0\} {x+t∆x∣t≥0}:
t = arg ⁡ min ⁡ s ≥ 0 f ( x + s Δ x ) t=\arg\min_{s\geq0}f(x+s\Delta x) t=args≥0minf(x+sΔx)Many inexact line search methods have been proposed. One inexact line search method that is very simple and quite effective is called backtracking line search. It depends on two constants α , β α, β α,β with 0 < α < 0.5 , 0 < β < 1 0 < α < 0.5, 0 < β < 1 0<α<0.5,0<β<1.
Convex Optimization 读书笔记 (8)

9.3 Gradient descent method

A natural choice for the search direction is the negative gradient ∆ x = − ∇ f ( x ) ∆x = −∇f(x) ∆x=−∇f(x). The resulting algorithm is called the gradient algorithm or gradient descent method.
Convex Optimization 读书笔记 (8)

9.3.1 Convergence analysis

Analysis for exact line search

We assume f f f is strongly convex on S S S, so there are positive constants m m m and M M M such that m I ⪯ ∇ f ( x ) ⪯ M I mI \preceq ∇ f(x) \preceq MI mI⪯∇f(x)⪯MI for all x ∈ S x ∈ S x∈S. Define the function f ~ : R → R \tilde{f} : \mathbf{R} → \mathbf{R} f~:R→R by f ~ ( t ) = f ( x − t ∇ f ( x ) ) \tilde{f} (t) = f (x − t∇f (x)) f~(t)=f(x−t∇f(x)). We obtain a quadratic upper bound on f ~ \tilde{f} f~:
f ~ ≤ f ( x ) − t ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 + M t 2 2 ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 \tilde{f}\leq f(x)-t||\nabla f(x)||_2^2+\frac{Mt^2}{2}||\nabla f(x)||_2^2 f~≤f(x)−t∣∣∇f(x)∣∣22+2Mt2∣∣∇f(x)∣∣22The RHS has minimum value about t t t: f ( x ) − 1 2 M ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 f(x)-\frac{1}{2M}||\nabla f(x)||_2^2 f(x)−2M1∣∣∇f(x)∣∣22, then we have
f ( x + ) − p ∗ ≤ f ( x ) − p ∗ − 1 2 M ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 f(x^+)-p^*\leq f(x)-p^*-\frac{1}{2M}||\nabla f(x)||_2^2 f(x+)−p∗≤f(x)−p∗−2M1∣∣∇f(x)∣∣22Combine with ∣ ∣ ∇ f ( x ) ∣ ∣ 2 2 ≥ 2 m ( f ( x ) − p ∗ ) ||\nabla f(x)||_2^2\geq 2m(f(x)-p^*) ∣∣∇f(x)∣∣22≥2m(f(x)−p∗), we get
f ( x + ) − p ⋆ ≤ ( 1 − m M ) ( f ( x ) − p ⋆ ) f(x^+) − p^⋆ ≤ (1 − \frac{m}{M})(f(x) − p^⋆) f(x+)−p⋆≤(1−Mm)(f(x)−p⋆)Applying this inequality recursively, we find that
f ( x ( k ) ) − p ∗ ≤ c k ( f ( x ( 0 ) ) − p ∗ ) f(x^{(k)})-p^*\leq c^k(f(x^{(0)})-p^*) f(x(k))−p∗≤ck(f(x(0))−p∗)where c = 1 − m M < 1 c=1-\frac{m}{M}<1 c=1−Mm<1, which shows that f ( x ( k ) ) f(x^{(k)}) f(x(k)) converges to p ⋆ p^⋆ p⋆ as k → ∞ k → ∞ k→∞.

Analysis for backtracking line search

Analogous to exact line search we get
f ( x ( k ) ) − p ∗ ≤ c k ( f ( x ( 0 ) ) − p ∗ ) f(x^{(k)})-p^*\leq c^k(f(x^{(0)})-p^*) f(x(k))−p∗≤ck(f(x(0))−p∗)where
c = 1 − min ⁡ { 2 m α , 2 β α m M } < 1 c=1-\min\{ 2m\alpha,2\beta\alpha\frac{m}{M} \}<1 c=1−min{2mα,2βαMm}<1

9.3.2 Examples

9.4 Steepest descent method

The first-order Taylor approximation of f ( x + v ) f (x + v) f(x+v) around x x x is
f ( x + v ) ≈ f ^ ( x + v ) = f ( x ) + ∇ f ( x ) T v f(x+v)\approx\hat{f}(x+v)=f(x)+\nabla f(x)^Tv f(x+v)≈f^(x+v)=f(x)+∇f(x)TvLet ∥ ⋅ ∥ ∥ · ∥ ∥⋅∥ be any norm on R \mathbf{R} R . We define a normalized steepest descent direction (with respect to the norm ∥ ⋅ ∥ ∥ · ∥ ∥⋅∥) as
Δ x n s d = arg ⁡ min ⁡ { ∇ f ( x ) T v ∣ ∣ ∣ v ∣ ∣ = 1 } \Delta x_{\rm nsd} = \arg\min\{ \nabla f(x)^Tv\mid||v||=1 \} Δxnsd=argmin{∇f(x)Tv∣∣∣v∣∣=1}It is also convenient to consider a steepest descent step Δ x n s d \Delta x_{\rm nsd} Δxnsd that is unnormalized, by scaling the normalized steepest descent direction in a particular way:
Δ x s d = ∣ ∣ ∇ f ( x ) ∣ ∣ ∗ Δ x n s d \Delta x_{\rm sd} = ||\nabla f(x)||_*\Delta x_{\rm nsd} Δxsd=∣∣∇f(x)∣∣∗Δxnsdwhere ∥ ⋅ ∥ ∗ ∥ · ∥_∗ ∥⋅∥∗ denotes the dual norm.
Convex Optimization 读书笔记 (8)

9.4.1 Steepest descent for Euclidean and quadratic norms

Steepest descent for Euclidean norm

If we take the norm ∥ ⋅ ∥ ∥·∥ ∥⋅∥ to be the Euclidean norm we find that the steepest descent direction is simply the negative gradient ∆ x s d = − ∇ f ( x ) ∆x_{\rm sd} = −∇f(x) ∆xsd=−∇f(x).

Steepest descent for quadratic norm

We consider the quadratic norm
∣ ∣ z ∣ ∣ P = ( z T P z ) 1 2 = ∣ ∣ P 1 2 z ∣ ∣ 2 ||z||_P=(z^TPz)^{\frac{1}{2}}=||P^{\frac{1}{2}}z||_2 ∣∣z∣∣P=(zTPz)21=∣∣P21z∣∣2where P ∈ S + + n P\in\mathbf{S}_{++}^n P∈S++n. The normalized steepest descent is
∆ x n s d = − ( f ( x ) T P − 1 ∇ f ( x ) ) − 1 2 P − 1 ∇ f ( x ) ∆ x s d = − P − 1 ∇ f ( x ) ∆x_{\rm nsd} =−(f(x)^TP^{−1}∇f(x))^{-\frac{1}{2}}P^{−1}∇f(x) \\ ∆x_{\rm sd} = -P^{-1}\nabla f(x) ∆xnsd=−(f(x)TP−1∇f(x))−21P−1∇f(x)∆xsd=−P−1∇f(x)
Convex Optimization 读书笔记 (8)

9.4.2 Steepest descent for l 1 {l}_1 l1-norm

As another example, we consider the steepest descent method for the l 1 l_1 l1-norm. A normalized steepest descent direction,
Δ x n s d = arg ⁡ min ⁡ { ∇ f ( x ) T v ∣ ∣ ∣ v ∣ ∣ 1 ≤ 1 } \Delta x_{\rm nsd} = \arg\min\{ \nabla f(x)^Tv\mid||v||_1\leq1 \} Δxnsd=argmin{∇f(x)Tv∣∣∣v∣∣1≤1}Then a normalized steepest descent direction ∆ x n s d ∆x_{\rm nsd} ∆xnsd for the l 1 l_1 l1-norm is given by
∆ x n s d = − s i g n ( ∂ f ( x ) ∂ x i ) e i ∆x_{\rm nsd} = −{\rm sign}(\frac{∂f(x)}{∂x_i})e_i ∆xnsd=−sign(∂xi∂f(x))eiAn unnormalized steepest descent step is then
∆ x s d = ∆ x n s d ∥ ∇ f ( x ) ∥ ∞ = − ∂ f ( x ) ∂ x i e i ∆x_{\rm sd} = ∆x_{\rm nsd}∥∇f(x)∥_∞ = −\frac{∂f(x)}{∂x_i}e_i ∆xsd=∆xnsd∥∇f(x)∥∞=−∂xi∂f(x)ei
Convex Optimization 读书笔记 (8)

9.4.3 Convergence analysis

We have
f ( x ( k ) ) − p ∗ ≤ c k ( f ( x ( 0 ) ) − p ∗ ) f(x^{(k)})-p^*\leq c^k(f(x^{(0)})-p^*) f(x(k))−p∗≤ck(f(x(0))−p∗)where c = 1 − 2 m α γ ~ 2 min ⁡ { 1 , β γ 2 M } < 1 c=1-2m\alpha\tilde{\gamma}^2\min\{ 1,\frac{\beta\gamma^2}{M} \}<1 c=1−2mαγ~2min{1,Mβγ2}<1.

9.4.4 Discussion and examples

9.5 Newton’s method

9.5.1 The Newton step

For x ∈ d o m f x ∈ \mathbf{dom} \ f x∈dom f, the vector
Δ x n t = − ∇ 2 f ( x ) − 1 ∇ f ( x ) \Delta x_{\rm nt}=-\nabla^2f(x)^{-1}\nabla f(x) Δxnt=−∇2f(x)−1∇f(x)is called the Newton step. Positive definiteness of ∇ 2 f ( x ) ∇_2f(x) ∇2f(x) implies that
∇ f ( x ) T ∆ x n t = − ∇ f ( x ) T ∇ 2 f ( x ) − 1 ∇ f ( x ) < 0 ∇f(x)^T ∆x_{\rm nt} = −∇f(x)^T ∇^2f(x)^{−1}∇f(x) < 0 ∇f(x)T∆xnt=−∇f(x)T∇2f(x)−1∇f(x)<0unless ∇ f ( x ) = 0 ∇f(x) = 0 ∇f(x)=0, so the Newton step is a descent direction (unless x x x is optimal).

9.5.2 Newton’s method

Newton’s method, as outlined below, is sometimes called the damped Newton method or guarded Newton method, to distinguish it from the pure Newton method, which uses a fixed step size t = 1 t = 1 t=1.
Convex Optimization 读书笔记 (8)

9.5.3 Convergence analysis

There are numbers η η η and γ γ γ with 0 < η ≤ m 2 L 0 < η ≤ \frac{m^2}{L} 0<η≤Lm2 and γ > 0 γ > 0 γ>0 such that the following hold:

If ∥ ∇ f ( x ( k ) ) ∥ 2 ≥ η ∥∇f(x^{(k)})∥_2 ≥ η ∥∇f(x(k))∥2≥η, then
f ( x ( k + 1 ) ) − f ( x ( k ) ) ≤ − γ f(x^{(k+1)}) − f(x^{(k)}) ≤ −γ f(x(k+1))−f(x(k))≤−γIf ∥ ∇ f ( x ( k ) ) ∥ 2 < η ∥∇f(x^{(k)})∥_2 < η ∥∇f(x(k))∥2<η, then the backtracking line search selects t ( k ) = 1 t^{(k)}=1 t(k)=1 and
L 2 m 2 ∥ ∇ f ( x ) ( k + 1 ) ∥ 2 ≤ ( L 2 m 2 ∣ ∣ ∇ f ( x ( k ) ) ∣ ∣ 2 ) 2 \frac{L}{2m^2} ∥∇f(x)^{(k+1)}∥_2 ≤ (\frac{L}{2m^2}||\nabla f(x^{(k)})||_2)^2 2m2L∥∇f(x)(k+1)∥2≤(2m2L∣∣∇f(x(k))∣∣2)2The second case gives the result:
f ( x ( l ) ) − p ∗ ≤ 1 2 m ∣ ∣ ∇ f ( x ( l ) ) ∣ ∣ 2 2 ≤ 2 m 3 L 2 ( 1 2 ) 2 l − k + 1 f(x^{(l)})-p^*\leq \frac{1}{2m}||\nabla f(x^{(l)})||_2^2 \leq\frac{2m^3}{L^2}(\frac{1}{2})^{2^{l-k+1}} f(x(l))−p∗≤2m1∣∣∇f(x(l))∣∣22≤L22m3(21)2l−k+1This last inequality shows that convergence is extremely rapid once the second condition is satisfied. This phenomenon is called quadratic convergence.

9.6 Self-concordance

9.6.1 Definition and examples

We start by considering functions on R \mathbf{R} R. A convex function f : R → R f : \mathbf{R} → \mathbf{R} f:R→R is self-concordant if
∣ f ′ ′ ′ ( x ) ∣ ≤ 2 f ′ ′ ( x ) 3 2 |f^{\prime\prime\prime}(x)|\leq 2f^{\prime\prime}(x)^{\frac{3}{2}} ∣f′′′(x)∣≤2f′′(x)23

9.6.2 Self-concordant calculus

Scaling and sum

Self-concordance is preserved by scaling by a factor exceeding one: If f f f is self-concordant and a ≥ 1 a ≥ 1 a≥1, then a f af af is self-concordant. Self-concordance is also preserved by addition: If f 1 , f 2 f_1, f_2 f1,f2 are self-concordant, then f 1 + f 2 f_1 + f_2 f1+f2 is self-concordant.

Composition with affine function

If f : R n → R f : \mathbf{R}^n → \mathbf{R} f:Rn→R is self-concordant, and A ∈ R n × m , b ∈ R n A ∈ \mathbf{R}^{n\times m}, b ∈ \mathbf{R}^n A∈Rn×m,b∈Rn, then f ( A x + b ) f(Ax+b) f(Ax+b) is self-concordant.

Composition with logarithm

Let g : R → R g : \mathbf{R} → \mathbf{R} g:R→R be a convex function with d o m g = R + + \mathbf{dom} \ g = \mathbf{R}_{++} dom g=R++, and
∣ g ′ ′ ′ ( x ) ∣ ≤ 3 g ′ ′ ( x ) x |g^{\prime\prime\prime}(x)| ≤ 3\frac{g^{\prime\prime}(x)}{x} ∣g′′′(x)∣≤3xg′′(x)for all x x x. Then
f ( x ) = − log ⁡ ( − g ( x ) ) − log ⁡ x f(x)=-\log (-g(x))-\log x f(x)=−log(−g(x))−logxis self-concordant on { x ∣ x > 0 , g ( x ) < 0 } \{x | x > 0, g(x) < 0\} {x∣x>0,g(x)<0}.

9.6.3 Properties of self-concordant functions

9.6.4 Analysis of Newton’s method for self-concordant functions

We will show that there are numbers η η η and γ > 0 γ > 0 γ>0, with 0 < η ≤ 1 4 0 < η ≤ \frac{1}{4} 0<η≤41, that depend only on the line search parameters α α α and β β β, such that the following hold:
If λ ( x ( k ) ) ≥ η \lambda(x^{(k)}) ≥ η λ(x(k))≥η, then
f ( x ( k + 1 ) ) − f ( x ( k ) ) ≤ − γ f(x^{(k+1)}) − f(x^{(k)}) ≤ −γ f(x(k+1))−f(x(k))≤−γIf λ ( x ( k ) ) < η \lambda(x^{(k)}) < η λ(x(k))<η, then the backtracking line search selects t ( k ) = 1 t^{(k)}=1 t(k)=1 and
2 λ ( x ( k + 1 ) ) ≤ ( 2 λ ( x ( k ) ) ) 2 2\lambda(x^{(k+1)})\leq (2\lambda(x^{(k)}))^2 2λ(x(k+1))≤(2λ(x(k)))2As a consequence,
f ( x ( l ) ) − p ∗ ≤ λ ( x ( l ) ) 2 ≤ ( 1 2 ) 2 l − k + 1 f(x^{(l)})-p^*\leq\lambda(x^{(l)})^2\leq(\frac{1}{2})^{2^{l-k+1}} f(x(l))−p∗≤λ(x(l))2≤(21)2l−k+1

9.6.5 Discussion and numerical examples

9.7 Implementation

9.7.1 Pre-computation for line searches

In the simplest implementation of a line search, f ( x + t ∆ x ) f (x + t∆x) f(x+t∆x) is evaluated for each value of t t t in the same way that f ( z ) f(z) f(z) is evaluated for any z ∈ d o m f z ∈ \mathbf{dom} \ f z∈dom f. But in some cases we can exploit the fact that f f f (and its derivatives, in an exact line search) are to be evaluated at many points along the ray { x + t ∆ x ∣ t ≥ 0 } \{x + t∆x | t ≥ 0\} {x+t∆x∣t≥0} to reduce the total computational effort. This usually requires some pre-computation, which is often on the same order as computing f f f at any point, after which f f f (and its derivatives) can be computed more efficiently along the ray.

9.7.2 Computing the Newton step

To compute the Newton step ∆ x n t ∆x_{\rm nt} ∆xnt, we first evaluate and form the Hessian matrix H = ∇ 2 f ( x ) H = ∇^2f(x) H=∇2f(x) and the gradient g = ∇ f ( x ) g = ∇f(x) g=∇f(x) at x x x. Then we solve the system of linear equations H ∆ x n t = − g H∆x_{\rm nt} = −g H∆xnt=−g to find the Newton step. This set of equations is sometimes called the Newton system (since its solution gives the Newton step) or the normal equations, since the same type of equation arises in solving a least-squares problem.