Chapter7: Statistical estimation

7.1 Parametric distribution estimation

7.1.1 Maximum likelihood estimation

Define log-likelihood function, and denoted l:
l ( x ) = log ⁡ p x ( y ) l(x)=\log p_x(y) l(x)=logpx(y)A widely used method, called maximum likelihood (ML) estimation, is to estimate x x x as
x ^ m l = arg ⁡ max ⁡ x p x ( y ) = arg ⁡ max ⁡ x l ( x ) \hat{x}_{\rm ml}=\arg\max_xp_x(y)=\arg\max_xl(x) x^ml=argxmaxpx(y)=argxmaxl(x)

7.1.2 Maximum a posteriori probability estimation

Maximum a posteriori probability (MAP) estimation can be considered a Bayesian version of maximum likelihood estimation, with a prior probability density on the underlying parameter x x x. We assume that x x x (the vector to be estimated) and y y y (the observation) are random variables with a joint probability density p ( x , y ) p(x,y) p(x,y).
The prior density of x x x is given by
p x ( x ) = ∫ p ( x , y ) d y p_x(x)=\int p(x,y)dy px(x)=∫p(x,y)dySimilarly,
p y ( y ) = ∫ p ( x , y ) d x p_y(y)=\int p(x,y)dx py(y)=∫p(x,y)dxThe conditional density of y y y, given x x x, is given by
p y ∣ x ( x , y ) = p ( x , y ) p x ( x ) p_{y\mid x}(x,y)=\frac{p(x,y)}{p_x(x)} py∣x(x,y)=px(x)p(x,y)In the MAP estimation method, our estimate of x x x, given the observation y y y, is given by
x ^ m a p = arg ⁡ max ⁡ x p x ∣ y ( x , y ) = arg ⁡ max ⁡ x p y ∣ x ( x , y ) p x ( x ) = arg ⁡ max ⁡ x p ( x , y ) \begin{aligned} \hat{x}_{\rm map} &= \arg\max_xp_{x\mid y}(x,y)\\ &= \arg\max_xp_{y\mid x}(x,y)p_x(x)\\ &= \arg\max_xp(x,y)\\ \end{aligned} x^map=argxmaxpx∣y(x,y)=argxmaxpy∣x(x,y)px(x)=argxmaxp(x,y)

7.2 Nonparametric distribution estimation

7.3 Optimal detector design and hypothesis testing

Suppose X X X is a random variable with values in { 1 , . . . , n } \{1, . . . , n\} {1,...,n}, with a distribution that depends on a parameter θ ∈ { 1 , . . . , m } θ ∈ \{1, . . . , m\} θ∈{1,...,m}. The distributions of X X X, for the m m m possible values of θ θ θ, can be represented by a matrix P ∈ R n × m P ∈ \mathbf{R}^{n×m} P∈Rn×m, with elements
p k j = p r o b ( X = k ∣ θ = j ) p_{kj}=\mathbf{prob}(X=k\mid \theta=j) pkj=prob(X=k∣θ=j)The j j jth column of P P P gives the probability distribution associated with the parameter value θ = j θ = j θ=j. The m m m values of θ θ θ are called hypotheses, and guessing which hypothesis is correct is called hypothesis testing.

7.3.1 Deterministic and randomized detectors

A (deterministic) estimator or detector is a function ψ ψ ψ from { 1 , . . . , n } \{1, . . . , n\} {1,...,n} (the set of possible observed values) into { 1 , . . . , m } \{1, . . . , m\} {1,...,m} (the set of hypotheses).
A randomized detector of θ θ θ is a random variable θ ^ ∈ { 1 , . . . , m } \hat{θ} ∈ \{1, . . . , m\} θ^∈{1,...,m}. A randomized detector can be defined in terms of a matrix T ∈ R m × n T\in\mathbf{R}^{m\times n} T∈Rm×n with
t i k = p r o b ( θ ^ = i ∣ X = k ) t_{ik} = \mathbf{prob}(\hat{\theta}=i\mid X=k) tik=prob(θ^=i∣X=k)

7.3.2 Detection probability matrix

For the randomized detector defined by the matrix T T T, we define the detection probability matrix as D = T P D = T P D=TP . We have
D i j = ( T P ) i j = p r o b ( θ ^ = i ∣ θ = j ) D_{ij}=(TP)_{ij}=\mathbf{prob}(\hat{\theta}=i\mid \theta=j) Dij=(TP)ij=prob(θ^=i∣θ=j)so D i j D_{ij} Dij is the probability of guessing θ ^ = i \hat{θ} = i θ^=i, when in fact θ = j θ = j θ=j.

7.3.3 Optimal detector design

7.3.4 Multicriterion formulation and scalarization

The optimal detector design problem can be considered a multicriterion problem, and the m ( m − 1 ) m(m − 1) m(m−1) objectives given by the off-diagonal entries of D D D, which are the probabilities of the different types of detection error:
m i n i m i z e ( w . r . t . R + m ( m − 1 ) ) D i j , i , j = 1 , . . . , m , i ≠ j s u b j e c t t o t k ⪰ 0 , 1 T t k = 1 , k = 1 , . . . , n \begin{aligned} {\rm minimize \ (w.r.t. \ } \mathbf{R}^{m(m-1)}_+) \ \ \ \ & D_{ij}, \ \ i,j=1,...,m, \ \ i\ne j \\ {\rm subject \ to \ } \ \ \ \ & t_k\succeq0, \ \ \bold{1}^Tt_k=1,k=1,...,n \end{aligned} minimize (w.r.t. R+m(m−1)) subject to Dij, i,j=1,...,m, i=jtk⪰0, 1Ttk=1,k=1,...,n

7.3.5 Binary hypothesis testing

As an illustration, we consider the special case m = 2 m = 2 m=2, which is called binary hypothesis testing.

7.3.6 Robust detectors

We define the worst-case detection probability matrix D w c D^{\rm wc} Dwc as
D i j w c = sup ⁡ P ∈ P D i j , i , j = 1 , . . . , m , i ≠ j D^{\rm wc}_{ij}=\sup_{P\in\mathcal{P}}D_{ij}, \ \ i,j=1,...,m, \ \ i\neq j Dijwc=P∈PsupDij, i,j=1,...,m, i=jand
D i i w c = inf ⁡ P ∈ P D i i , i = 1 , . . . , m D^{\rm wc}_{ii}=\inf_{P \in \mathcal{P}}D_{ii}, \ \ i=1,...,m Diiwc=P∈PinfDii, i=1,...,m

7.4 Chebyshev and Chernoff bounds

7.4.1 Chebyshev bounds

Chebyshev bounds give an upper bound on the probability of a set based on known expected values of certain functions. If X X X is a random variable on R \mathbf{R} R with E X = μ \mathbb{E}X=μ EX=μ and E ( X − μ ) 2 = σ 2 \mathbb{E}(X−μ)^2 =σ^2 E(X−μ)2=σ2, then we have p r o b ( ∣ X − μ ∣ ≥ 1 ) ≤ σ 2 \mathbf{prob}(|X−μ|≥1)≤σ^2 prob(∣X−μ∣≥1)≤σ2, again no matter what the distribution of X X X is.

7.4.2 Chernoff bounds

Let X X X be a random variable on R \mathbf{R} R. The Chernoff bound states that
p r o b ( X ≥ u ) ≤ inf ⁡ λ ≥ 0 E e λ ( X − u ) \mathbf{prob}(X\geq u)\leq \inf_{\lambda\geq 0}\mathbb{E}e^{\lambda(X-u)} prob(X≥u)≤λ≥0infEeλ(X−u)which can be expressed as
log ⁡ p r o b ( X ≥ u ) ≤ inf ⁡ λ ≥ 0 { − λ u + log ⁡ E e λ X } \log\mathbf{prob}(X\geq u)\leq \inf_{\lambda\geq 0} \{-\lambda u +\log \mathbb{E} e^{\lambda X} \} logprob(X≥u)≤λ≥0inf{−λu+logEeλX}

7.4.3 Example

Convex Optimization 读书笔记 (6)

7.5 Experiment design

We consider the problem of estimating a vector x ∈ R n x ∈ \mathbf{R}^n x∈Rn from measurements or experiments
y i = a i x + w i , i = 1 , . . . , m y_i=a_ix+w_i, \ \ i=1,...,m yi=aix+wi, i=1,...,mThe associated estimation error e = x ^ − x e = \hat{x} − x e=x^−x has zero mean and covariance matrix
E = E e e T = ( ∑ i = 1 m a i a i T ) − 1 E=\mathbb{E}{ee^T}=(\sum_{i=1}^{m}a_ia_i^T)^{-1} E=EeeT=(i=1∑maiaiT)−1We suppose that the vectors a 1 , . . . , a m a_1, . . . , a_m a1,...,am, which characterize the measurements, can be chosen among p p p possible test vectors v 1 , . . . , v p ∈ R n v_1, . . . , v_p ∈ \mathbf{R}^n v1,...,vp∈Rn. The goal of experiment design is to choose the vectors a i a_i ai, from among the possible choices, so that the error covariance E E E is small (in some sense).

7.5.1 The relaxed experiment design problem

In the case when m m m is large compared to n n n, however, a good approximate solution can be found by ignoring, or relaxing, the constraint that the m i m_i mi are integers. The relaxed experiment design problem is
m i n i m i z e ( w . r . t . S + n ) E = 1 m ( ∑ i = 1 p λ i v i v i T ) − 1 s u b j e c t t o λ ⪰ 0 , 1 T λ = 1 \begin{aligned} {\rm minimize \ (w.r.t. \ } \mathbf{S}^{n}_+) \ \ \ \ & E=\frac{1}{m}(\sum_{i=1}^p\lambda_iv_iv_i^T)^{-1} \\ {\rm subject \ to \ } \ \ \ \ & \lambda\succeq0, \ \ \bold{1}^T\lambda=1 \end{aligned} minimize (w.r.t. S+n) subject to E=m1(i=1∑pλiviviT)−1λ⪰0, 1Tλ=1

7.5.2 Scalarizations

D D D-optimal design

The most widely used scalarization is called D D D-optimal design, in which we minimize the determinant of the error covariance matrix E E E.
m i n i m i z e log ⁡ det ⁡ ( ∑ i = 1 p λ i v i v i T ) − 1 s u b j e c t t o λ ⪰ 0 , 1 T λ = 1 \begin{aligned} {\rm minimize} \ \ \ \ & \log \det (\sum_{i=1}^p\lambda_iv_iv_i^T)^{-1} \\ {\rm subject \ to \ } \ \ \ \ & \lambda\succeq0, \ \ \bold{1}^T\lambda=1 \end{aligned} minimize subject to logdet(i=1∑pλiviviT)−1λ⪰0, 1Tλ=1
Convex Optimization 读书笔记 (6)

E E E-optimal design

In E E E-optimal design, we minimize the norm of the error covariance matrix, the maximum eigenvalue of E E E. The E E E-optimal experiment design problem can be cast as an SDP
m i n i m i z e t s u b j e c t t o ∑ i = 1 p λ i v i v i T ⪰ t I λ ⪰ 0 , 1 T λ = 1 \begin{aligned} {\rm minimize} \ \ \ \ & t \\ {\rm subject \ to \ } \ \ \ \ & \sum_{i=1}^p\lambda_iv_iv_i^T \succeq tI \\ &\lambda\succeq0, \ \ \bold{1}^T\lambda=1 \end{aligned} minimize subject to ti=1∑pλiviviT⪰tIλ⪰0, 1Tλ=1
Convex Optimization 读书笔记 (6)

A A A-optimal design

In A-optimal experiment design, we minimize t r E \mathbf{tr}E trE, the trace of the covariance matrix
m i n i m i z e t r ( ∑ i = 1 p λ i v i v i T ) − 1 s u b j e c t t o λ ⪰ 0 , 1 T λ = 1 \begin{aligned} {\rm minimize} \ \ \ \ & \mathbf{tr} (\sum_{i=1}^p\lambda_iv_iv_i^T)^{-1} \\ {\rm subject \ to \ } \ \ \ \ & \lambda\succeq0, \ \ \bold{1}^T\lambda=1 \end{aligned} minimize subject to tr(i=1∑pλiviviT)−1λ⪰0, 1Tλ=1
Convex Optimization 读书笔记 (6)

Convex Optimization 读书笔记 (6)

Convex Optimization 读书笔记 (6)

Chapter7: Statistical estimation

7.1 Parametric distribution estimation

7.1.1 Maximum likelihood estimation

7.1.2 Maximum a posteriori probability estimation

7.2 Nonparametric distribution estimation

7.3 Optimal detector design and hypothesis testing

7.3.1 Deterministic and randomized detectors

7.3.2 Detection probability matrix

7.3.3 Optimal detector design

7.3.4 Multicriterion formulation and scalarization

7.3.5 Binary hypothesis testing

7.3.6 Robust detectors

7.4 Chebyshev and Chernoff bounds

7.4.1 Chebyshev bounds

7.4.2 Chernoff bounds

7.4.3 Example

7.5 Experiment design

7.5.1 The relaxed experiment design problem

7.5.2 Scalarizations

D D D-optimal design

E E E-optimal design

A A A-optimal design

相关推荐