Convex Optimization 读书笔记 (6)
Chapter7: Statistical estimation
7.1 Parametric distribution estimation
7.1.1 Maximum likelihood estimation
Define log-likelihood function, and denoted l:
l
(
x
)
=
log
p
x
(
y
)
l(x)=\log p_x(y)
l(x)=logpx(y)A widely used method, called maximum likelihood (ML) estimation, is to estimate
x
x
x as
x
^
m
l
=
arg
max
x
p
x
(
y
)
=
arg
max
x
l
(
x
)
\hat{x}_{\rm ml}=\arg\max_xp_x(y)=\arg\max_xl(x)
x^ml=argxmaxpx(y)=argxmaxl(x)
7.1.2 Maximum a posteriori probability estimation
Maximum a posteriori probability (MAP) estimation can be considered a Bayesian version of maximum likelihood estimation, with a prior probability density on the underlying parameter
x
x
x. We assume that
x
x
x (the vector to be estimated) and
y
y
y (the observation) are random variables with a joint probability density
p
(
x
,
y
)
p(x,y)
p(x,y).
The prior density of
x
x
x is given by
p
x
(
x
)
=
∫
p
(
x
,
y
)
d
y
p_x(x)=\int p(x,y)dy
px(x)=∫p(x,y)dySimilarly,
p
y
(
y
)
=
∫
p
(
x
,
y
)
d
x
p_y(y)=\int p(x,y)dx
py(y)=∫p(x,y)dxThe conditional density of
y
y
y, given
x
x
x, is given by
p
y
∣
x
(
x
,
y
)
=
p
(
x
,
y
)
p
x
(
x
)
p_{y\mid x}(x,y)=\frac{p(x,y)}{p_x(x)}
py∣x(x,y)=px(x)p(x,y)In the MAP estimation method, our estimate of
x
x
x, given the observation
y
y
y, is given by
x
^
m
a
p
=
arg
max
x
p
x
∣
y
(
x
,
y
)
=
arg
max
x
p
y
∣
x
(
x
,
y
)
p
x
(
x
)
=
arg
max
x
p
(
x
,
y
)
\begin{aligned} \hat{x}_{\rm map} &= \arg\max_xp_{x\mid y}(x,y)\\ &= \arg\max_xp_{y\mid x}(x,y)p_x(x)\\ &= \arg\max_xp(x,y)\\ \end{aligned}
x^map=argxmaxpx∣y(x,y)=argxmaxpy∣x(x,y)px(x)=argxmaxp(x,y)
7.2 Nonparametric distribution estimation
7.3 Optimal detector design and hypothesis testing
Suppose
X
X
X is a random variable with values in
{
1
,
.
.
.
,
n
}
\{1, . . . , n\}
{1,...,n}, with a distribution that depends on a parameter
θ
∈
{
1
,
.
.
.
,
m
}
θ ∈ \{1, . . . , m\}
θ∈{1,...,m}. The distributions of
X
X
X, for the
m
m
m possible values of
θ
θ
θ, can be represented by a matrix
P
∈
R
n
×
m
P ∈ \mathbf{R}^{n×m}
P∈Rn×m, with elements
p
k
j
=
p
r
o
b
(
X
=
k
∣
θ
=
j
)
p_{kj}=\mathbf{prob}(X=k\mid \theta=j)
pkj=prob(X=k∣θ=j)The
j
j
jth column of
P
P
P gives the probability distribution associated with the parameter value
θ
=
j
θ = j
θ=j. The
m
m
m values of
θ
θ
θ are called hypotheses, and guessing which hypothesis is correct is called hypothesis testing.
7.3.1 Deterministic and randomized detectors
A (deterministic) estimator or detector is a function
ψ
ψ
ψ from
{
1
,
.
.
.
,
n
}
\{1, . . . , n\}
{1,...,n} (the set of possible observed values) into
{
1
,
.
.
.
,
m
}
\{1, . . . , m\}
{1,...,m} (the set of hypotheses).
A randomized detector of
θ
θ
θ is a random variable
θ
^
∈
{
1
,
.
.
.
,
m
}
\hat{θ} ∈ \{1, . . . , m\}
θ^∈{1,...,m}. A randomized detector can be defined in terms of a matrix
T
∈
R
m
×
n
T\in\mathbf{R}^{m\times n}
T∈Rm×n with
t
i
k
=
p
r
o
b
(
θ
^
=
i
∣
X
=
k
)
t_{ik} = \mathbf{prob}(\hat{\theta}=i\mid X=k)
tik=prob(θ^=i∣X=k)
7.3.2 Detection probability matrix
For the randomized detector defined by the matrix
T
T
T, we define the detection probability matrix as
D
=
T
P
D = T P
D=TP . We have
D
i
j
=
(
T
P
)
i
j
=
p
r
o
b
(
θ
^
=
i
∣
θ
=
j
)
D_{ij}=(TP)_{ij}=\mathbf{prob}(\hat{\theta}=i\mid \theta=j)
Dij=(TP)ij=prob(θ^=i∣θ=j)so
D
i
j
D_{ij}
Dij is the probability of guessing
θ
^
=
i
\hat{θ} = i
θ^=i, when in fact
θ
=
j
θ = j
θ=j.
7.3.3 Optimal detector design
7.3.4 Multicriterion formulation and scalarization
The optimal detector design problem can be considered a multicriterion problem, and the
m
(
m
−
1
)
m(m − 1)
m(m−1) objectives given by the off-diagonal entries of
D
D
D, which are the probabilities of the different types of detection error:
m
i
n
i
m
i
z
e
(
w
.
r
.
t
.
R
+
m
(
m
−
1
)
)
D
i
j
,
i
,
j
=
1
,
.
.
.
,
m
,
i
≠
j
s
u
b
j
e
c
t
t
o
t
k
⪰
0
,
1
T
t
k
=
1
,
k
=
1
,
.
.
.
,
n
\begin{aligned} {\rm minimize \ (w.r.t. \ } \mathbf{R}^{m(m-1)}_+) \ \ \ \ & D_{ij}, \ \ i,j=1,...,m, \ \ i\ne j \\ {\rm subject \ to \ } \ \ \ \ & t_k\succeq0, \ \ \bold{1}^Tt_k=1,k=1,...,n \end{aligned}
minimize (w.r.t. R+m(m−1)) subject to Dij, i,j=1,...,m, i=jtk⪰0, 1Ttk=1,k=1,...,n
7.3.5 Binary hypothesis testing
As an illustration, we consider the special case m = 2 m = 2 m=2, which is called binary hypothesis testing.
7.3.6 Robust detectors
We define the worst-case detection probability matrix
D
w
c
D^{\rm wc}
Dwc as
D
i
j
w
c
=
sup
P
∈
P
D
i
j
,
i
,
j
=
1
,
.
.
.
,
m
,
i
≠
j
D^{\rm wc}_{ij}=\sup_{P\in\mathcal{P}}D_{ij}, \ \ i,j=1,...,m, \ \ i\neq j
Dijwc=P∈PsupDij, i,j=1,...,m, i=jand
D
i
i
w
c
=
inf
P
∈
P
D
i
i
,
i
=
1
,
.
.
.
,
m
D^{\rm wc}_{ii}=\inf_{P \in \mathcal{P}}D_{ii}, \ \ i=1,...,m
Diiwc=P∈PinfDii, i=1,...,m
7.4 Chebyshev and Chernoff bounds
7.4.1 Chebyshev bounds
Chebyshev bounds give an upper bound on the probability of a set based on known expected values of certain functions. If X X X is a random variable on R \mathbf{R} R with E X = μ \mathbb{E}X=μ EX=μ and E ( X − μ ) 2 = σ 2 \mathbb{E}(X−μ)^2 =σ^2 E(X−μ)2=σ2, then we have p r o b ( ∣ X − μ ∣ ≥ 1 ) ≤ σ 2 \mathbf{prob}(|X−μ|≥1)≤σ^2 prob(∣X−μ∣≥1)≤σ2, again no matter what the distribution of X X X is.
7.4.2 Chernoff bounds
Let
X
X
X be a random variable on
R
\mathbf{R}
R. The Chernoff bound states that
p
r
o
b
(
X
≥
u
)
≤
inf
λ
≥
0
E
e
λ
(
X
−
u
)
\mathbf{prob}(X\geq u)\leq \inf_{\lambda\geq 0}\mathbb{E}e^{\lambda(X-u)}
prob(X≥u)≤λ≥0infEeλ(X−u)which can be expressed as
log
p
r
o
b
(
X
≥
u
)
≤
inf
λ
≥
0
{
−
λ
u
+
log
E
e
λ
X
}
\log\mathbf{prob}(X\geq u)\leq \inf_{\lambda\geq 0} \{-\lambda u +\log \mathbb{E} e^{\lambda X} \}
logprob(X≥u)≤λ≥0inf{−λu+logEeλX}
7.4.3 Example
7.5 Experiment design
We consider the problem of estimating a vector
x
∈
R
n
x ∈ \mathbf{R}^n
x∈Rn from measurements or experiments
y
i
=
a
i
x
+
w
i
,
i
=
1
,
.
.
.
,
m
y_i=a_ix+w_i, \ \ i=1,...,m
yi=aix+wi, i=1,...,mThe associated estimation error
e
=
x
^
−
x
e = \hat{x} − x
e=x^−x has zero mean and covariance matrix
E
=
E
e
e
T
=
(
∑
i
=
1
m
a
i
a
i
T
)
−
1
E=\mathbb{E}{ee^T}=(\sum_{i=1}^{m}a_ia_i^T)^{-1}
E=EeeT=(i=1∑maiaiT)−1We suppose that the vectors
a
1
,
.
.
.
,
a
m
a_1, . . . , a_m
a1,...,am, which characterize the measurements, can be chosen among
p
p
p possible test vectors
v
1
,
.
.
.
,
v
p
∈
R
n
v_1, . . . , v_p ∈ \mathbf{R}^n
v1,...,vp∈Rn. The goal of experiment design is to choose the vectors
a
i
a_i
ai, from among the possible choices, so that the error covariance
E
E
E is small (in some sense).
7.5.1 The relaxed experiment design problem
In the case when
m
m
m is large compared to
n
n
n, however, a good approximate solution can be found by ignoring, or relaxing, the constraint that the
m
i
m_i
mi are integers. The relaxed experiment design problem is
m
i
n
i
m
i
z
e
(
w
.
r
.
t
.
S
+
n
)
E
=
1
m
(
∑
i
=
1
p
λ
i
v
i
v
i
T
)
−
1
s
u
b
j
e
c
t
t
o
λ
⪰
0
,
1
T
λ
=
1
\begin{aligned} {\rm minimize \ (w.r.t. \ } \mathbf{S}^{n}_+) \ \ \ \ & E=\frac{1}{m}(\sum_{i=1}^p\lambda_iv_iv_i^T)^{-1} \\ {\rm subject \ to \ } \ \ \ \ & \lambda\succeq0, \ \ \bold{1}^T\lambda=1 \end{aligned}
minimize (w.r.t. S+n) subject to E=m1(i=1∑pλiviviT)−1λ⪰0, 1Tλ=1
7.5.2 Scalarizations
D D D-optimal design
The most widely used scalarization is called
D
D
D-optimal design, in which we minimize the determinant of the error covariance matrix
E
E
E.
m
i
n
i
m
i
z
e
log
det
(
∑
i
=
1
p
λ
i
v
i
v
i
T
)
−
1
s
u
b
j
e
c
t
t
o
λ
⪰
0
,
1
T
λ
=
1
\begin{aligned} {\rm minimize} \ \ \ \ & \log \det (\sum_{i=1}^p\lambda_iv_iv_i^T)^{-1} \\ {\rm subject \ to \ } \ \ \ \ & \lambda\succeq0, \ \ \bold{1}^T\lambda=1 \end{aligned}
minimize subject to logdet(i=1∑pλiviviT)−1λ⪰0, 1Tλ=1
E E E-optimal design
In
E
E
E-optimal design, we minimize the norm of the error covariance matrix, the maximum eigenvalue of
E
E
E. The
E
E
E-optimal experiment design problem can be cast as an SDP
m
i
n
i
m
i
z
e
t
s
u
b
j
e
c
t
t
o
∑
i
=
1
p
λ
i
v
i
v
i
T
⪰
t
I
λ
⪰
0
,
1
T
λ
=
1
\begin{aligned} {\rm minimize} \ \ \ \ & t \\ {\rm subject \ to \ } \ \ \ \ & \sum_{i=1}^p\lambda_iv_iv_i^T \succeq tI \\ &\lambda\succeq0, \ \ \bold{1}^T\lambda=1 \end{aligned}
minimize subject to ti=1∑pλiviviT⪰tIλ⪰0, 1Tλ=1
A A A-optimal design
In A-optimal experiment design, we minimize
t
r
E
\mathbf{tr}E
trE, the trace of the covariance matrix
m
i
n
i
m
i
z
e
t
r
(
∑
i
=
1
p
λ
i
v
i
v
i
T
)
−
1
s
u
b
j
e
c
t
t
o
λ
⪰
0
,
1
T
λ
=
1
\begin{aligned} {\rm minimize} \ \ \ \ & \mathbf{tr} (\sum_{i=1}^p\lambda_iv_iv_i^T)^{-1} \\ {\rm subject \ to \ } \ \ \ \ & \lambda\succeq0, \ \ \bold{1}^T\lambda=1 \end{aligned}
minimize subject to tr(i=1∑pλiviviT)−1λ⪰0, 1Tλ=1