Convex Optimization 读书笔记 (5)
Chapter 6: Approximation and fitting
6.1 Norm approximation
6.1.1 Basic norm approximation problem
The simplest norm approximation problem is an unconstrained problem of the form
m
i
n
i
m
i
z
e
∣
∣
A
x
−
b
∣
∣
{\rm minimize} \ \ \ \ ||Ax-b||
minimize ∣∣Ax−b∣∣where
A
∈
R
m
×
n
,
b
∈
R
m
,
x
∈
R
n
,
∣
∣
⋅
∣
∣
A\in \mathbf{R}^{m\times n}, b\in \mathbf{R}^m,x\in \mathbf{R}^n,||\cdot||
A∈Rm×n,b∈Rm,x∈Rn,∣∣⋅∣∣ is a norm on
R
m
\mathbf{R}^m
Rm. A solution of the norm approximation problem is sometimes called an approximate solution of
A
x
≈
b
Ax ≈ b
Ax≈b, in the norm
∥
⋅
∥
∥ · ∥
∥⋅∥. The vector
r
=
A
x
−
b
r=Ax-b
r=Ax−bis called the residual for the problem; its components are sometimes called the individual residuals associated with
x
x
x.
6.1.2 Penalty function approximation
In
l
p
l_p
lp-norm approximation, for
1
≤
p
<
∞
1 ≤ p < ∞
1≤p<∞, the objective is
(
∣
r
1
∣
p
+
⋯
+
∣
r
m
∣
p
)
1
p
(|r_1|^p+\cdots+|r_m|^p)^{\frac{1}{p}}
(∣r1∣p+⋯+∣rm∣p)p1As in least-squares problems, we can consider the equivalent problem with objective
∣
r
1
∣
p
+
⋯
+
∣
r
m
∣
p
|r_1|^p+\cdots+|r_m|^p
∣r1∣p+⋯+∣rm∣pThe penalty function approximation problem has the form
m
i
n
i
m
i
z
e
ϕ
(
r
1
)
+
⋯
+
ϕ
(
r
m
)
s
u
b
j
e
c
t
t
o
r
=
A
x
−
b
\begin{aligned} {\rm minimize} \ \ \ \ & \phi(r_1)+\cdots+\phi(r_m)\\ {\rm subject \ to} \ \ \ \ & r=Ax-b \\ \end{aligned}
minimize subject to ϕ(r1)+⋯+ϕ(rm)r=Ax−bwhere
ϕ
:
R
→
R
\phi : \mathbf{R} → \mathbf{R}
ϕ:R→R is called the (residual) penalty function. We assume that
ϕ
\phi
ϕ is convex, so the penalty function approximation problem is a convex optimization problem.
deadzone-linear
ϕ ( u ) = { 0 ∣ u ∣ ≤ a ∣ u ∣ − a ∣ u ∣ > a \phi(u)=\left\{ \begin{array}{rl} & 0 & |u|\leq a \\ & |u|-a & |u| > a \end{array} \right. ϕ(u)={0∣u∣−a∣u∣≤a∣u∣>a
log barrier
ϕ ( u ) = { − a log ( 1 − ( u a ) 2 ) ∣ u ∣ ≤ a ∞ ∣ u ∣ > a \phi(u)=\left\{ \begin{array}{rl} & -a\log(1-(\frac{u}{a})^2) & |u|\leq a \\ & \infty & |u| > a \end{array} \right. ϕ(u)={−alog(1−(au)2)∞∣u∣≤a∣u∣>a
The log barrier penalty puts weight very much like the
l
2
l_2
l2-norm penalty for small residuals, but puts very strong weight on residuals larger than around 0.8, and infinite weight on residuals larger than 1.
6.1.3 Approximation with constraints
It is possible to add constraints to the basic norm approximation problem. When these constraints are convex, the resulting problem is convex.
6.2 Least-norm problems
The basic least-norm problem has the form
m
i
n
i
m
i
z
e
∣
∣
x
∣
∣
s
u
b
j
e
c
t
t
o
A
x
=
b
\begin{aligned} {\rm minimize} \ \ \ \ & ||x|| \\ {\rm subject \ to} \ \ \ \ & Ax=b \\ \end{aligned}
minimize subject to ∣∣x∣∣Ax=bwhere
A
∈
R
m
×
n
,
b
∈
R
m
,
x
∈
R
n
,
∣
∣
⋅
∣
∣
A\in \mathbf{R}^{m\times n}, b\in \mathbf{R}^m,x\in \mathbf{R}^n,||\cdot||
A∈Rm×n,b∈Rm,x∈Rn,∣∣⋅∣∣ is a norm on
R
m
\mathbf{R}^m
Rm.
6.3 Regularized approximation
6.3.1 Bi-criterion formulation
In the basic form of regularized approximation, the goal is to find a vector
x
x
x that is small (if possible), and also makes the residual
A
x
−
b
Ax − b
Ax−b small. This is naturally described as a (convex) vector optimization problem with two objectives,
∥
A
x
−
b
∥
∥Ax − b∥
∥Ax−b∥ and
∥
x
∥
∥x∥
∥x∥:
m
i
n
i
m
i
z
e
(
w
.
r
.
t
.
R
+
2
)
(
∣
∣
A
x
−
b
∣
∣
,
∣
∣
x
∣
∣
)
\begin{aligned} {\rm minimize \ (w.r.t. \ } \mathbf{R}^2_+) \ \ \ \ & (||Ax-b||,||x||) \\ \end{aligned}
minimize (w.r.t. R+2) (∣∣Ax−b∣∣,∣∣x∣∣)
The two norms can be different: the first, used to measure the size of the residual, is on
R
m
\mathbf{R}^m
Rm; the second, used to measure the size of
x
x
x, is on
R
n
\mathbf{R}^n
Rn.
6.3.2 Regularization
Regularization is a common scalarization method used to solve the bi-criterion problem. One form of regularization is to minimize the weighted sum of the objectives:
m
i
n
i
m
i
z
e
∣
∣
A
x
−
b
∣
∣
+
γ
∣
∣
x
∣
∣
\begin{aligned} {\rm minimize} \ \ \ \ & ||Ax-b||+\gamma||x|| \\ \end{aligned}
minimize ∣∣Ax−b∣∣+γ∣∣x∣∣where
γ
>
0
γ > 0
γ>0 is a problem parameter. As
γ
γ
γ varies over
(
0
,
∞
)
(0, ∞)
(0,∞), the solution of traces out the optimal trade-off curve.
Another common method of regularization, especially when the Euclidean norm is used, is to minimize the weighted sum of squared norms:
m
i
n
i
m
i
z
e
∣
∣
A
x
−
b
∣
∣
2
+
δ
∣
∣
x
∣
∣
2
\begin{aligned} {\rm minimize} \ \ \ \ & ||Ax-b||^2+\delta||x||^2 \\ \end{aligned}
minimize ∣∣Ax−b∣∣2+δ∣∣x∣∣2for a variety of values of
δ
>
0
δ > 0
δ>0.
6.3.3 Reconstruction, smoothing, and de-noising
Quadratic smoothing
The simplest reconstruction method uses the quadratic smoothing function
ϕ
q
u
a
d
(
x
)
=
∑
i
=
1
n
−
1
(
x
i
−
1
−
x
i
)
2
\phi_{\rm quad}(x)=\sum_{i=1}^{n-1}(x_{i-1}-x_i)^2
ϕquad(x)=i=1∑n−1(xi−1−xi)2
We can obtain the optimal trade-off between
∥
x
^
−
x
c
o
r
∥
2
∥\hat{x}−x_{\rm cor}∥^2
∥x^−xcor∥2 and
∥
D
x
^
∥
2
∥D\hat{x}∥^2
∥Dx^∥2 by minimizing
∣
∣
x
^
−
x
c
o
r
∣
∣
2
2
+
δ
∣
∣
D
x
^
∣
∣
2
2
||\hat{x}-x_{\rm cor}||^2_2+\delta||D\hat{x}||_2^2
∣∣x^−xcor∣∣22+δ∣∣Dx^∣∣22
where
D
∈
R
(
n
−
1
)
×
n
D\in \mathbf{R}^{(n-1)\times n}
D∈R(n−1)×n is the bidiagonal matrix.
But any rapid variations in the original signal will, obviously, be attenuated or removed by quadratic smoothing.
Total variation reconstruction
The method is based on the smoothing function
ϕ
t
v
(
x
)
=
∑
i
=
1
n
−
1
∣
x
i
−
1
−
x
i
∣
\phi_{\rm tv}(x)=\sum_{i=1}^{n-1}|x_{i-1}-x_i|
ϕtv(x)=i=1∑n−1∣xi−1−xi∣
6.4 Robust approximation
6.4.1 Stochastic robust approximation
We assume that
A
A
A is a random variable taking values in
R
m
×
n
\mathbf{R}^{m\times n}
Rm×n with mean
A
ˉ
\bar{A}
Aˉ so we can describe
A
A
A as with mean
A
ˉ
\bar{A}
Aˉ as
A
=
A
ˉ
+
U
A=\bar{A}+U
A=Aˉ+Uwhere
U
U
U is a random matrix with zero mean. Then the problem is
m
i
n
i
m
i
z
e
E
[
∣
∣
A
x
−
b
∣
∣
2
2
]
{\rm minimize} \ \ \ \ \mathbb{E}[||Ax-b||_2^2]
minimize E[∣∣Ax−b∣∣22]
We can express the objective as
m
i
n
i
m
i
z
e
∣
∣
A
ˉ
x
−
b
∣
∣
2
2
+
∣
∣
P
1
2
x
∣
∣
2
2
{\rm minimize} \ \ \ \ ||\bar{A}x-b||_2^2+||P^{\frac{1}{2}}x||_2^2
minimize ∣∣Aˉx−b∣∣22+∣∣P21x∣∣22where
P
=
E
[
U
T
U
]
P=\mathbb{E}[U^TU]
P=E[UTU]. The solution is
x
=
(
A
ˉ
T
A
ˉ
+
P
)
−
1
A
ˉ
T
b
x=(\bar{A}^T\bar{A}+P)^{-1}\bar{A}^Tb
x=(AˉTAˉ+P)−1AˉTb
6.4.2 Worst-case robust approximation
The (worst-case) robust approximation problem is to minimize the worst-case error:
m
i
n
i
m
i
z
e
e
w
c
(
x
)
=
sup
{
∣
∣
A
x
−
b
∣
∣
∣
A
∈
A
}
{\rm minimize} \ \ \ \ e_{\rm wc}(x)=\sup\{ ||Ax-b|| \mid A\in \mathcal{A} \}
minimize ewc(x)=sup{∣∣Ax−b∣∣∣A∈A}where
A
\mathcal{A}
A is a set of all possible value. The robust approximation problem is always a convex optimization problem, but its tractability depends on the norm used and the description of the uncertainty set
A
\mathcal{A}
A.
6.5 Function fitting and interpolation
6.5.1 Function families
We consider a family of functions
f
1
,
⋯
,
f
n
:
R
k
→
R
f_1, \cdots , f_n : \mathbf{R}^k → \mathbf{R}
f1,⋯,fn:Rk→R, with common domain
d
o
m
f
i
=
D
\mathbf{dom} \ f_i = D
dom fi=D. With each
x
∈
R
n
x ∈ \mathbf{R}^n
x∈Rn we associate the function $ f : \mathbf{R}^k → \mathbf{R} $ given by
f
(
u
)
=
x
1
f
1
(
u
)
+
⋯
+
x
n
f
n
(
x
)
f(u)=x_1f_1(u)+\cdots+x_nf_n(x)
f(u)=x1f1(u)+⋯+xnfn(x)
The family
{
f
1
,
⋯
,
f
n
}
\{f_1, \cdots , f_n\}
{f1,⋯,fn} is sometimes called the set of basis functions (for the fitting problem) even when the functions are not independent. The vector
x
∈
R
n
x ∈ \mathbf{R}^n
x∈Rn, which parametrizes the subspace of functions, is our optimization variable, and is sometimes called the coefficient vector.
6.5.2 Constraints
Interpolation and inequalities
Interpolation conditions
f
(
v
j
)
=
z
j
,
j
=
1
,
.
.
.
m
f(v_j)=z_j, \ \ j=1,...m
f(vj)=zj, j=1,...mwhich require the function
f
f
f to have the values
z
j
∈
R
z_j ∈ \mathbf{R}
zj∈R at specified points
v
j
∈
D
v_j ∈ D
vj∈D, form a set of linear equalities in
x
x
x.
Derivative constrains
The gradient
∇
f
(
v
)
=
∑
i
=
1
n
x
i
∇
f
i
(
v
)
\nabla f(v)=\sum_{i=1}^{n}x_i\nabla f_i(v)
∇f(v)=i=1∑nxi∇fi(v)is a linear function of
x
x
x, so interpolation conditions on the derivative of f at
v
v
v reduce to linear equality constraints on
x
x
x:
∣
∣
∇
f
(
v
)
∣
∣
=
∣
∣
∑
i
=
1
n
x
i
∇
f
i
(
v
)
∣
∣
≤
M
||\nabla f(v)||=||\sum_{i=1}^{n}x_i\nabla f_i(v)||\leq M
∣∣∇f(v)∣∣=∣∣i=1∑nxi∇fi(v)∣∣≤M
is a convex constrain on
x
x
x.
Integral constrains
Any linear functional
L
\mathcal{L}
L on the subspace of functions can be expressed as a linear function of
x
x
x:
L
=
c
T
x
\mathcal{L}=c^Tx
L=cTxwhere
c
i
=
∫
D
ϕ
(
u
)
f
i
(
i
)
d
u
c_i=\int_{D}\phi(u)f_i(i)du
ci=∫Dϕ(u)fi(i)du
6.5.3 Fitting and interpolation problems
6.5.4 Sparse descriptions and basis pursuit
In basis pursuit, there is a very large number of basis functions, and the goal is to find a good fit of the given data as a linear combination of a small number of the basis functions.
Thus we seek a function
f
∈
F
f ∈ \mathcal{F}
f∈F that fits the data well,
f
(
u
i
)
≈
y
i
,
i
=
1
,
.
.
.
,
m
f(u_i)\approx y_i,\ \ i=1,...,m
f(ui)≈yi, i=1,...,mwith a sparse coefficient vector
x
x
x, i.e.,
c
a
r
d
(
x
)
\mathbf{card}(x)
card(x) small:
f
=
∑
i
∈
B
x
i
f
i
f=\sum_{i \in \mathcal{B}}x_if_i
f=i∈B∑xifi
6.5.5 Interpolation with convex functions
If and only if there exist
g
1
,
.
.
.
,
g
m
g_1, . . . , g_m
g1,...,gm such that
y
j
≥
y
i
+
g
i
T
(
u
j
−
u
i
)
,
i
,
j
=
1
,
.
.
.
,
m
y_j\geq y_i+g_i^T(u_j-u_i), \ \ i,j=1,...,m
yj≥yi+giT(uj−ui), i,j=1,...,mthen there exsit a convex function
f
:
R
k
→
R
f:\mathbf{R}^k → \mathbf{R}
f:Rk→R with
d
o
m
f
=
R
k
\mathbf{dom} \ f= \mathbf{R}^k
dom f=Rk, that satisfies
f
(
u
i
)
=
y
i
,
i
=
1
,
.
.
.
,
m
f(u_i)=y_i, \ \ i=1,...,m
f(ui)=yi, i=1,...,m
For example,
f
(
z
)
=
max
i
=
1
,
.
.
.
,
m
(
y
i
+
g
i
T
(
z
−
u
i
)
)
.
f(z)=\max_{i=1,...,m}(y_i+g_i^T(z-u_i)).
f(z)=maxi=1,...,m(yi+giT(z−ui)).