Classifiers Based on Bayes Decision Theory
Reference:
Section 2.1-2.6 of Pattern recognition, by Sergios Theodoridis, Konstantinos Koutroumbas (2009)
Slides of CS4220, TUD
Content
Bayes Decision Theory
We will initially focus on the two-class case.
Let ω 1 , ω 2 \omega_1,\omega_2 ω1,ω2 be the two classes in which our patterns belong.
-
priori probabilities: P ( ω 1 ) , P ( ω 2 ) P(\omega_1),P(\omega_2) P(ω1),P(ω2)
Assumed to be known. If not, they can easily estimated as P ( ω 1 ) ≈ N 1 / N , P ( ω 2 ) ≈ N 2 / N P(\omega_1)\approx N_1/N, P(\omega_2)\approx N_2/N P(ω1)≈N1/N,P(ω2)≈N2/N.
-
likelihood function: p ( x ∣ ω 1 ) , p ( x ∣ ω 2 ) p(\mathbf x|\omega_1), p(\mathbf x|\omega_2) p(x∣ω1),p(x∣ω2)
Assumed to be known. If not, they can also be estimated from the available training data.
-
posteriori probabilities: P ( ω 1 ∣ x ) , P ( ω 2 ∣ x ) P(\omega_1|\mathbf x), P(\omega_2|\mathbf x) P(ω1∣x),P(ω2∣x)
Can be computed by the Bayes rule:
P ( ω i ∣ x ) = p ( x ∣ ω i ) P ( ω i ) p ( x ) (BD.1) P(\omega_i|\mathbf x)=\frac{p(\mathbf x| \omega_i)P(\omega_i)}{p(\mathbf x)}\tag{BD.1} P(ωi∣x)=p(x)p(x∣ωi)P(ωi)(BD.1)
where p ( x ) p(\mathbf x) p(x) is the pdf of x \mathbf x x and for which we have
p ( x ) = ∑ p ( x ∣ ω i ) P ( ω i ) (BD.2) p(\mathbf x)=\sum p(\mathbf x|\omega_i)P(\omega_i)\tag{BD.2} p(x)=∑p(x∣ωi)P(ωi)(BD.2)
The Bayes classification rule can now be stated as
If
P
(
ω
1
∣
x
)
>
P
(
ω
2
∣
x
)
,
x
is classified to
ω
1
If
P
(
ω
1
∣
x
)
<
P
(
ω
2
∣
x
)
,
x
is classified to
ω
2
(BD.3)
\begin{aligned} &\text { If } P\left(\omega_{1}| x\right)>P\left(\omega_{2}| x\right), \quad x \text { is classified to } \omega_{1}\\ &\text { If } P\left(\omega_{1} | x\right)<P\left(\omega_{2} | x\right), \quad x \text { is classified to } \omega_{2} \end{aligned}\tag{BD.3}
If P(ω1∣x)>P(ω2∣x),x is classified to ω1 If P(ω1∣x)<P(ω2∣x),x is classified to ω2(BD.3)
Using
(
B
D
.
2
)
(BD.2)
(BD.2), the decision can equivalently be based on the inequalities
p
(
x
∣
ω
1
)
P
(
ω
1
)
≷
p
(
x
∣
ω
2
)
P
(
ω
2
)
(BD.4)
p(\mathbf x| \omega_1)P(\omega_1)\gtrless p(\mathbf x| \omega_2)P(\omega_2)\tag{BD.4}
p(x∣ω1)P(ω1)≷p(x∣ω2)P(ω2)(BD.4)
It can also be described as
p
(
x
∣
ω
1
)
P
(
ω
1
)
−
p
(
x
∣
ω
2
)
P
(
ω
2
)
≷
0
p
(
x
∣
ω
1
)
P
(
ω
1
)
p
(
x
∣
ω
2
)
P
(
ω
2
)
≷
1
log
(
p
(
x
∣
ω
1
)
P
(
ω
1
)
)
−
log
(
p
(
x
∣
ω
2
)
P
(
ω
2
)
)
≷
0
p(\mathbf x| \omega_1)P(\omega_1)-p(\mathbf x| \omega_2)P(\omega_2)\gtrless 0\\ \frac{p(\mathbf x| \omega_1)P(\omega_1)}{p(\mathbf x| \omega_2)P(\omega_2)}\gtrless 1\\ \log(p(\mathbf x| \omega_1)P(\omega_1))-\log(p(\mathbf x| \omega_2)P(\omega_2))\gtrless 0
p(x∣ω1)P(ω1)−p(x∣ω2)P(ω2)≷0p(x∣ω2)P(ω2)p(x∣ω1)P(ω1)≷1log(p(x∣ω1)P(ω1))−log(p(x∣ω2)P(ω2))≷0
Minimizing the classification error probability
We will show that the Bayesian classifier is optimal with respect to minimizing the classification error probability.
Define the classification error as
P
(
e
r
r
o
r
)
=
∑
i
=
1
C
P
(
e
r
r
o
r
∣
ω
i
)
P
(
ω
i
)
(BD.5)
P(error)=\sum_{i=1}^C P(error|\omega_i)P(\omega_i)\tag{BD.5}
P(error)=i=1∑CP(error∣ωi)P(ωi)(BD.5)
For a determined decision boundary (the green line), it could happen that
x
∈
ω
2
\mathbf x\in \omega_2
x∈ω2 is assigned to
ω
1
\omega_1
ω1, and that
x
∈
ω
1
\mathbf x\in \omega_1
x∈ω1 is assigned to
ω
2
\omega_2
ω2. Let
R
i
R_i
Ri be the region of the feature space in which we decide in favor of
ω
i
\omega_i
ωi. The classification error probability can be written as
P
e
=
P
(
ω
1
)
∫
R
2
p
(
x
∣
ω
1
)
d
x
+
P
(
ω
2
)
∫
R
1
p
(
x
∣
ω
2
)
d
x
=
∫
R
2
P
(
ω
1
∣
x
)
p
(
x
)
d
x
+
∫
R
1
P
(
ω
2
∣
x
)
p
(
x
)
d
x
(BD.6)
\begin{aligned} P_{e}&=P(\omega_1)\int_{R_{2}} p\left(\mathbf x|\omega_1\right) d \mathbf x+P(\omega_2)\int_{R_{1}} p\left(\mathbf x|\omega_2\right) d \mathbf x \\&=\int_{R_{2}} P\left(\omega_{1} |\mathbf x\right) p(\mathbf x) d \mathbf x+\int_{R_{1}} P\left(\omega_{2} | \mathbf x\right) p(\mathbf x) d \mathbf x \end{aligned} \tag{BD.6}
Pe=P(ω1)∫R2p(x∣ω1)dx+P(ω2)∫R1p(x∣ω2)dx=∫R2P(ω1∣x)p(x)dx+∫R1P(ω2∣x)p(x)dx(BD.6)
which is the marked yellow area in the figure above. Then it’s intuitive that the optimal decision boundary should be
x
0
:
P
(
ω
1
∣
x
0
)
=
P
(
ω
2
∣
x
0
)
(BD.7)
\mathbf x_0:P\left(\omega_{1} | \mathbf x_0\right)=P\left(\omega_{2} | \mathbf x_0\right)\tag{BD.7}
x0:P(ω1∣x0)=P(ω2∣x0)(BD.7)
so that
R
1
:
P
(
ω
1
∣
x
)
>
P
(
ω
2
∣
x
)
R
2
:
P
(
ω
2
∣
x
)
>
P
(
ω
1
∣
x
)
(BD.8)
\begin{array}{l} R_{1}: P\left(\omega_{1} | \mathbf x\right)>P\left(\omega_{2} | \mathbf x\right) \\ R_{2}: P\left(\omega_{2} |\mathbf x\right)>P\left(\omega_{1} | \mathbf x\right) \end{array}\tag{BD.8}
R1:P(ω1∣x)>P(ω2∣x)R2:P(ω2∣x)>P(ω1∣x)(BD.8)
By combining
(
B
D
.
6
)
(BD.6)
(BD.6) and
(
B
D
.
7
)
(BD.7)
(BD.7), we obtain the Bayes error
ε
∗
\varepsilon^*
ε∗, which is the minimum classification error we can get. It does not depend on the classification rule that we apply, but on the distribution of data. In practical we can not obtain
ε
∗
\varepsilon^*
ε∗, since we do not have the true distributions, and the high dimensional integrals are very complicated.
Minimizing the average risk
Sometimes misclassification of class A to class B is much more dangerous than misclassification of class B to class A. Thus, it is more appropriate to assign a penalty term to weigh each error:
r
=
λ
12
P
(
ω
1
)
∫
R
2
p
(
x
∣
ω
1
)
d
x
+
λ
21
P
(
ω
2
)
∫
R
1
p
(
x
∣
ω
2
)
d
x
(BD.9)
\mathbf r=\lambda_{12}P(\omega_1)\int_{R_{2}} p\left(\mathbf x|\omega_1\right) d x+\lambda_{21}P(\omega_2)\int_{R_{1}} p\left(\mathbf x|\omega_2\right) d x\tag{BD.9}
r=λ12P(ω1)∫R2p(x∣ω1)dx+λ21P(ω2)∫R1p(x∣ω2)dx(BD.9)
which is very similar to
(
B
D
.
6
)
(BD.6)
(BD.6).
For an
M
M
M-class problem, the risk associated with
ω
k
\omega_k
ωk is defined as
r
k
=
∑
i
=
1
M
λ
k
i
∫
R
i
p
(
x
∣
ω
k
)
d
x
(BD.10)
\mathbf r_k=\sum_{i=1}^M \lambda_{ki}\int_{R_i}p(\mathbf x|\omega_k)d \mathbf x \tag{BD.10}
rk=i=1∑Mλki∫Rip(x∣ωk)dx(BD.10)
where
λ
k
i
\lambda_{ki}
λki is the assigned penalty term when a feature vector
x
\mathbf x
x that belongs to class
ω
k
\omega_k
ωk lies in
R
i
,
i
≠
k
R_i,i\ne k
Ri,i=k. The matrix
L
L
L, which has at its
(
k
,
i
)
(k,i)
(k,i) location the corresponding penalty term
λ
k
i
\lambda_{ki}
λki, is known as the loss matrix. The diagonal entries
λ
k
k
\lambda_{kk}
λkk are set to zero for the sake of generality.
The average risk is defined as
r
=
∑
k
=
1
M
r
k
P
(
ω
k
)
=
∑
i
=
1
M
∫
R
i
(
∑
k
=
1
M
λ
k
i
p
(
x
∣
ω
k
)
P
(
ω
k
)
)
d
x
(BD.11)
\mathbf r=\sum_{k=1}^M \mathbf r_k P(\omega_k)=\sum_{i=1}^M\int_{R_i}\left( \sum_{k=1}^M \lambda_{ki}p(\mathbf x|\omega_k)P(\omega_k)\right)d \mathbf x \tag{BD.11}
r=k=1∑MrkP(ωk)=i=1∑M∫Ri(k=1∑Mλkip(x∣ωk)P(ωk))dx(BD.11)
which is the generalization of
(
B
D
.
9
)
(BD.9)
(BD.9).
Our goal is to choose the partitioning regions
R
j
R_j
Rj so that the average risk is minimized. This is achieved if each of the integrals is minimized, which is equivalent to selecting partitioning regions so that
x
∈
R
i
if
l
i
≡
∑
k
=
1
M
λ
k
i
p
(
x
∣
ω
k
)
P
(
ω
k
)
<
l
j
≡
∑
k
=
1
M
λ
k
j
p
(
x
∣
ω
k
)
P
(
ω
k
)
∀
j
≠
i
(BD.12)
\mathbf{x} \in \boldsymbol{R}_{i} \quad \text { if } \quad l_{i} \equiv \sum_{k=1}^{M} \lambda_{k i} p\left(\mathbf{x} | \omega_{k}\right) P\left(\omega_{k}\right)<l_{j} \equiv \sum_{k=1}^{M} \lambda_{k j} p\left(\mathbf{x} | \omega_{k}\right) P\left(\omega_{k}\right) \quad \forall j \neq i \tag{BD.12}
x∈Ri if li≡k=1∑Mλkip(x∣ωk)P(ωk)<lj≡k=1∑Mλkjp(x∣ωk)P(ωk)∀j=i(BD.12)
which is the generalization of
(
B
D
.
7
)
(BD.7)
(BD.7).
It is obvious that if λ k i = 1 − δ k i \lambda_{ki}=1-\delta_{ki} λki=1−δki, where δ k i \delta_{ki} δki is Kronecker’s delta ( 0 0 0 if k ≠ i k\ne i k=i and 1 1 1 if k = i k=i k=i), then minimizing the average risk becomes equivalent to minimizing the classification error probability.
Example
In a two-class problem with a single feature
x
x
x the pdfs are Gaussians with variance
σ
2
=
1
/
2
\sigma^{2}=1 / 2
σ2=1/2 for both classes and mean values 0 and 1, respectively, that is,
p
(
x
∣
ω
1
)
=
1
π
exp
(
−
x
2
)
p
(
x
∣
ω
2
)
=
1
π
exp
(
−
(
x
−
1
)
2
)
\begin{array}{l} p\left(x | \omega_{1}\right)=\frac{1}{\sqrt{\pi}} \exp \left(-x^{2}\right) \\ p\left(x | \omega_{2}\right)=\frac{1}{\sqrt{\pi}} \exp \left(-(x-1)^{2}\right) \end{array}
p(x∣ω1)=π
1exp(−x2)p(x∣ω2)=π
1exp(−(x−1)2)
If
P
(
ω
1
)
=
P
(
ω
2
)
=
1
/
2
,
P\left(\omega_{1}\right)=P\left(\omega_{2}\right)=1 / 2,
P(ω1)=P(ω2)=1/2, compute the threshold value
x
0
x_{0}
x0 (a) for minimum error probability and (b) for minimum risk if the loss matrix is
L
=
[
0
0.5
1.0
0
]
L=\left[\begin{array}{cc} 0 & 0.5 \\ 1.0 & 0 \end{array}\right]
L=[01.00.50]
Taking into account the shape of the Gaussian function graph (Appendix A), the threshold for the minimum probability case will be
x
0
:
exp
(
−
x
2
)
=
exp
(
−
(
x
−
1
)
2
)
x_{0}: \exp \left(-x^{2}\right)=\exp \left(-(x-1)^{2}\right)
x0:exp(−x2)=exp(−(x−1)2)
Taking the logarithm of both sides, we end up with
x
0
=
1
/
2.
x_{0}=1 / 2 .
x0=1/2. In the minimum risk case we get
x
0
:
exp
(
−
x
2
)
=
2
exp
(
−
(
x
−
1
)
2
)
x_{0}: \exp \left(-x^{2}\right)=2 \exp \left(-(x-1)^{2}\right)
x0:exp(−x2)=2exp(−(x−1)2)
or
x
0
=
(
1
−
ln
2
)
/
2
<
1
/
2
;
x_{0}=(1-\ln 2) / 2<1 / 2 ;
x0=(1−ln2)/2<1/2; that is, the threshold moves to the left of
1
/
2.
1 / 2 .
1/2. If the two classes are not equiprobable, then it is easily verified that if
P
(
ω
1
)
>
(
<
)
P
(
ω
2
)
P\left(\omega_{1}\right)>(<) P\left(\omega_{2}\right)
P(ω1)>(<)P(ω2) the threshold moves to the right (left). That is, we expand the region in which we decide in favor of the most probable class, since it is better to make fewer errors for the most probable class.
Discriminant functions and decision surfaces
If regions
R
i
,
R
j
R_i,R_j
Ri,Rj happens to be contiguous, then they are separated by a decision surface in the multidimensional feature space. For the minimum error probability case, the decision surface between
ω
i
\omega_i
ωi and
ω
j
\omega_j
ωj is described by the equation
P
(
ω
i
∣
x
)
−
P
(
ω
j
∣
x
)
=
0
(BD.13)
P(\omega_i|\mathbf x)-P(\omega_j|\mathbf x)=0\tag{BD.13}
P(ωi∣x)−P(ωj∣x)=0(BD.13)
From the one side of the surface this difference is positive, and from the other it is negative.
Sometimes, instead of working directly with probabilities (or risk functions), it may be more convenient. from a mathematical point of view, to work with an equivalent function of them, for example,
g
i
(
x
)
=
f
(
P
(
ω
i
∣
x
)
)
g_i(\mathbf x)=f(P(\omega_i|\mathbf x))
gi(x)=f(P(ωi∣x)), where
f
(
⋅
)
f(\cdot)
f(⋅) is a monotonically increasing function.
g
i
(
x
)
g_i(\mathbf x)
gi(x) is known as a discriminant function. The decision test is now stated as
classify
x
in
ω
i
if
g
i
(
x
)
>
g
j
(
x
)
,
∀
j
≠
i
(BD.14)
\text{classify }\mathbf x\text{ in }\omega_i\quad \text{if}\quad g_i(\mathbf x)>g_j(\mathbf x), ~\forall j\ne i\tag{BD.14}
classify x in ωiifgi(x)>gj(x), ∀j=i(BD.14)
The decision surfaces, separating contiguous regions, are described by
g
i
j
(
x
)
=
g
i
(
x
)
−
g
j
(
x
)
=
0
,
i
,
j
=
1
,
2
,
⋯
,
M
,
i
≠
j
(BD.15)
g_{ij}(\mathbf x)=g_i(\mathbf x)-g_j(\mathbf x)=0,\quad i,j=1,2,\cdots,M,\quad i\ne j\tag{BD.15}
gij(x)=gi(x)−gj(x)=0,i,j=1,2,⋯,M,i=j(BD.15)
Bayesian Classification For Normal Distribution
Normal distribution can model adequately a large number of cases due to the central limit theorem: if a random variable is the outcome of a summation of a number of independent random variables, its pdf approaches the Gaussian function as the number of summands tends to infinity.
The Gaussian Probability Density Function
-
one-dimensional Gaussian: N ( μ , σ 2 ) \mathcal{N}(\mu,\sigma^2) N(μ,σ2)
p ( x ) = 1 2 π σ exp ( − ( x − μ ) 2 2 σ 2 ) (ND.1) p(x)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right)\tag{ND.1} p(x)=2π σ1exp(−2σ2(x−μ)2)(ND.1)
where
μ = E [ x ] ≡ ∫ − ∞ + ∞ x p ( x ) d x σ 2 = E [ ( x − μ ) 2 ] ≡ ∫ − ∞ + ∞ ( x − μ ) 2 p ( x ) d x (ND.2) \begin{aligned} \mu=E[x] &\equiv \int_{-\infty}^{+\infty} x p(x) d x\\ \sigma^{2}=E\left[(x-\mu)^{2}\right]&\equiv\int_{-\infty}^{+\infty}(x-\mu)^{2} p(x) d x \end{aligned}\tag{ND.2} μ=E[x]σ2=E[(x−μ)2]≡∫−∞+∞xp(x)dx≡∫−∞+∞(x−μ)2p(x)dx(ND.2) -
l l l-dimensional Gaussian: N ( μ , Σ ) \mathcal{N}(\boldsymbol{\mu},\Sigma) N(μ,Σ)
p ( x ) = 1 ( 2 π ) l / 2 ∣ Σ ∣ 1 / 2 exp ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) (ND.3) p(\mathbf x)=\frac{1}{(2 \pi)^{l / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}(\mathbf x-\boldsymbol{\mu})^{T} \Sigma^{-1}(\mathbf x-\boldsymbol{\mu})\right)\tag{ND.3} p(x)=(2π)l/2∣Σ∣1/21exp(−21(x−μ)TΣ−1(x−μ))(ND.3)
where
μ = E [ x ] Σ = E [ ( x − μ ) ( x − μ ) T ] (ND.4) \begin{aligned} \boldsymbol{\mu}&=E[\mathbf x]\\ \Sigma&=E\left[(\mathbf x-\boldsymbol{\mu})(\mathbf x-\boldsymbol{\mu})^T\right] \end{aligned}\tag{ND.4} μΣ=E[x]=E[(x−μ)(x−μ)T](ND.4)
The Bayesian Classifier for Normally Distributed Classes
Our goal is to study the optimal Bayesian classifier when the involved PDFs, p ( x ∣ ω i ) ∼ N ( μ i , Σ i ) , i = 1 , 2 , ⋯ , M p(\mathbf x|\omega_i)\sim \mathcal{N}(\boldsymbol \mu_i,\Sigma_i), i=1,2,\cdots,M p(x∣ωi)∼N(μi,Σi),i=1,2,⋯,M. Because of the exponential form of the involved densities, it is preferable to work with the following discriminant functions, which involve the logarithmic function ln ( ⋅ ) \ln (\cdot) ln(⋅):
g
i
(
x
)
=
ln
(
p
(
x
∣
ω
i
)
P
(
ω
i
)
)
=
ln
p
(
x
∣
ω
i
)
+
ln
P
(
ω
i
)
(ND.5)
g_i(\mathbf x)=\ln (p(\mathbf x|\omega_i)P(\omega_i))=\ln p(\mathbf x|\omega_i)+\ln P(\omega_i)\tag{ND.5}
gi(x)=ln(p(x∣ωi)P(ωi))=lnp(x∣ωi)+lnP(ωi)(ND.5)
or
g
i
(
x
)
=
−
1
2
(
x
−
μ
i
)
T
Σ
i
−
1
(
x
−
μ
i
)
+
ln
P
(
ω
i
)
+
c
i
(ND.6)
g_i(\mathbf x)=-\frac{1}{2}(\mathbf x-\boldsymbol \mu_i)^T\Sigma_i^{-1}(\mathbf x-\boldsymbol \mu_i)+\ln P(\omega_i)+c_i \tag{ND.6}
gi(x)=−21(x−μi)TΣi−1(x−μi)+lnP(ωi)+ci(ND.6)
where
c
i
c_i
ci is a constant equal to
−
(
l
/
2
)
ln
2
π
−
(
1
/
2
)
ln
∣
Σ
i
∣
-(l/2)\ln 2\pi-(1/2)\ln |\Sigma_i|
−(l/2)ln2π−(1/2)ln∣Σi∣. Expanding, we obtain
g
i
(
x
)
=
−
1
2
x
T
Σ
i
−
1
x
+
1
2
x
T
Σ
i
−
1
μ
i
−
1
2
μ
i
T
Σ
i
−
1
μ
i
+
1
2
μ
i
T
Σ
i
−
1
x
+
ln
P
(
ω
i
)
+
c
i
(ND.7)
g_{i}(\mathbf x)=-\frac{1}{2} \mathbf x^{T} \Sigma_{i}^{-1} \mathbf x+\frac{1}{2}\mathbf x^{T} \Sigma_{i}^{-1} \boldsymbol\mu_{i}-\frac{1}{2}\boldsymbol \mu_{i}^{T} \Sigma_{i}^{-1} \boldsymbol\mu_{i}+\frac{1}{2} \boldsymbol\mu_{i}^{T} \Sigma_{i}^{-1}\mathbf x+\ln P\left(\omega_{i}\right)+c_{i} \tag{ND.7}
gi(x)=−21xTΣi−1x+21xTΣi−1μi−21μiTΣi−1μi+21μiTΣi−1x+lnP(ωi)+ci(ND.7)
Now define the quadratic classifier
f
(
x
)
=
g
1
(
x
)
−
g
2
(
x
)
=
x
T
W
x
+
w
T
x
+
w
0
(ND.8)
f(\mathbf x)=g_1(\mathbf x)-g_2(\mathbf x)=\mathbf x^T \mathbf W \mathbf x+\mathbf w^T \mathbf x+w_0\tag{ND.8}
f(x)=g1(x)−g2(x)=xTWx+wTx+w0(ND.8)
where
W
=
1
2
(
Σ
2
−
1
−
Σ
1
−
1
)
w
=
μ
1
T
Σ
1
−
1
−
μ
2
T
Σ
2
−
1
w
0
=
−
1
2
ln
∣
Σ
1
∣
−
1
2
μ
1
T
Σ
1
−
1
μ
1
+
ln
P
(
ω
1
)
+
1
2
ln
∣
Σ
2
∣
+
1
2
μ
2
T
Σ
2
−
1
μ
2
−
ln
P
(
ω
2
)
(ND.9)
\begin{aligned} \mathbf{W}=& \frac{1}{2}\left(\Sigma_{2}^{-1}-\Sigma_{1}^{-1}\right) \\ \mathbf{w}=& \boldsymbol\mu_{1}^{T} \Sigma_{1}^{-1}-\boldsymbol\mu_{2}^{T} \Sigma_{2}^{-1} \\ w_{0}=&-\frac{1}{2} \ln |\Sigma_{1}|-\frac{1}{2} \boldsymbol\mu_{1}^{T} \Sigma_{1}^{-1} \boldsymbol\mu_{1}+\ln P\left(\omega_{1}\right) \\ &+\frac{1}{2} \ln |\Sigma_{2}|+\frac{1}{2}\boldsymbol \mu_{2}^{T} \Sigma_{2}^{-1}\boldsymbol \mu_{2}-\ln P\left(\omega_{2}\right) \end{aligned}\tag{ND.9}
W=w=w0=21(Σ2−1−Σ1−1)μ1TΣ1−1−μ2TΣ2−1−21ln∣Σ1∣−21μ1TΣ1−1μ1+lnP(ω1)+21ln∣Σ2∣+21μ2TΣ2−1μ2−lnP(ω2)(ND.9)
In case the parameters
μ
\boldsymbol \mu
μ and
Σ
\Sigma
Σ is unknown, we can estimate them simply as
μ
^
k
=
1
N
∑
i
=
1
N
x
k
i
,
Σ
^
k
=
1
N
∑
i
=
1
N
(
x
k
i
−
μ
k
^
)
(
x
k
i
−
μ
k
^
)
T
(ND.10)
\hat {\boldsymbol \mu}_k =\frac{1}{N}\sum_{i=1}^N \mathbf x_{ki},\quad \hat \Sigma_k=\frac{1}{N}\sum_{i=1}^N(\mathbf x_{ki}-\hat{\boldsymbol \mu_k})(\mathbf x_{ki}-\hat{\boldsymbol \mu_k})^T\tag{ND.10}
μ^k=N1i=1∑Nxki,Σ^k=N1i=1∑N(xki−μk^)(xki−μk^)T(ND.10)
However, when there is insufficient data, the covariance matrix may be singular and cannot be inverted. An alternative is to average over all class covariance matrices
Σ
^
=
1
C
∑
k
=
1
C
Σ
^
k
(ND.11)
\hat \Sigma=\frac{1}{C}\sum_{k=1}^C \hat \Sigma_k \tag{ND.11}
Σ^=C1k=1∑CΣ^k(ND.11)
Since all class covariance matrices are same now,
W
=
0
\mathbf W=\mathbf 0
W=0, and thus
f
(
x
)
=
w
T
x
+
w
0
(ND.12)
f(\mathbf x)=\mathbf w^T \mathbf x+w_0\tag{ND.12}
f(x)=wTx+w0(ND.12)
which is the linear classifier, with
w
=
Σ
^
−
1
(
μ
^
1
T
−
μ
^
2
T
)
w
0
=
1
2
μ
^
2
T
Σ
−
1
μ
^
2
−
1
2
μ
^
1
T
Σ
−
1
μ
^
1
+
ln
P
(
ω
1
)
P
(
ω
2
)
(ND.13)
\begin{aligned} \mathbf{w}=& \hat \Sigma ^{-1}(\hat{\boldsymbol\mu}_{1}^{T} -\hat{\boldsymbol\mu}_{2}^{T} )\\ w_{0}=&\frac{1}{2}\hat{\boldsymbol\mu}_{2}^{T} \Sigma^{-1}\hat{\boldsymbol\mu}_{2}-\frac{1}{2}\hat{\boldsymbol\mu}_{1}^{T} \Sigma^{-1} \hat{\boldsymbol\mu}_{1}+\ln \frac{P\left(\omega_{1}\right)}{P\left(\omega_{2}\right)} \end{aligned}\tag{ND.13}
w=w0=Σ^−1(μ^1T−μ^2T)21μ^2TΣ−1μ^2−21μ^1TΣ−1μ^1+lnP(ω2)P(ω1)(ND.13)
In some cases even a full averaged covariance matrix is too much to estimate. As simplification one could assume that all features have the same variance, and are uncorrelated :
Σ
^
=
σ
2
I
(ND.14)
\hat \Sigma=\sigma^2 \mathbf I\tag{ND.14}
Σ^=σ2I(ND.14)
The linear classifier can be further simplified as nearest mean classifier:
f
(
x
)
=
w
T
x
+
w
0
f(\mathbf x)=\mathbf w^T \mathbf x+w_0
f(x)=wTx+w0
where
w
=
μ
^
1
T
−
μ
^
2
T
w
0
=
1
2
μ
^
2
T
μ
^
2
−
1
2
μ
^
1
T
μ
^
1
+
σ
2
ln
P
(
ω
1
)
P
(
ω
2
)
(ND.15)
\begin{aligned} \mathbf{w}=& \hat{\boldsymbol\mu}_{1}^{T} -\hat{\boldsymbol\mu}_{2}^{T} \\ w_{0}=&\frac{1}{2}\hat{\boldsymbol\mu}_{2}^{T} \hat{\boldsymbol\mu}_{2}-\frac{1}{2}\hat{\boldsymbol\mu}_{1}^{T} \hat{\boldsymbol\mu}_{1}+\sigma^2 \ln \frac{P\left(\omega_{1}\right)}{P\left(\omega_{2}\right)} \end{aligned}\tag{ND.15}
w=w0=μ^1T−μ^2T21μ^2Tμ^2−21μ^1Tμ^1+σ2lnP(ω2)P(ω1)(ND.15)
It only uses distance to the mean of each of the classes, and the decision boundary is the perpendicular bisector of
μ
1
−
μ
2
\boldsymbol \mu_1-\boldsymbol \mu_2
μ1−μ2.
Non-parametric Classification
Histogram method
Divide the
x
x
x-axis into successive bins of length
h
h
h. Then the probability of a sample
x
x
x being located in a bin is estimated for each of the bins. If
N
N
N is the total number of samples and
k
N
k_N
kN of these are located inside a bin, the corresponding probability is approximated by the frequency ratio:
P
≈
k
N
/
N
(NP.1)
P\approx k_N/N \tag{NP.1}
P≈kN/N(NP.1)
The corresponding PDF value is assumed constant throughout the bin and is approximated by
p
^
(
x
)
=
p
^
(
x
^
)
≈
1
h
k
N
N
,
∣
x
−
x
^
∣
≤
h
2
(NP.2)
\hat p(x)=\hat p(\hat x)\approx \frac{1}{h}\frac{k_N}{N}, \quad |x-\hat x|\le \frac{h}{2}\tag{NP.2}
p^(x)=p^(x^)≈h1NkN,∣x−x^∣≤2h(NP.2)
where
x
^
\hat x
x^ is the midpoint of the bin.
Parzen classifier
In the multidimensional case, instead of bins of size
h
h
h, the
l
l
l-dimensional space is divided into hypercubes with length of side
h
h
h and volume
h
l
h^l
hl. Let
x
i
,
i
=
1
,
2
,
⋯
,
N
\mathbf x_i,i=1,2,\cdots,N
xi,i=1,2,⋯,N be the available feature vectors. Define the function
ϕ
(
x
)
\phi(\mathbf x)
ϕ(x) so that
ϕ
(
x
i
)
=
{
1
,
for
∣
x
i
j
∣
≤
1
/
2
0
,
otherwise
(NP.3)
\phi(\mathbf x_i)=\left\{\begin{aligned}&1, && \text{for }|x_{ij}|\le 1/2\\&0,&& \text{otherwise}\end{aligned} \right.\tag{NP.3}
ϕ(xi)={1,0,for ∣xij∣≤1/2otherwise(NP.3)
where
x
i
j
,
j
=
1
,
⋯
,
l
x_{ij},j=1,\cdots,l
xij,j=1,⋯,l are the components of
x
i
\mathbf x_i
xi. In words, the function is equal to
1
1
1 for all points inside the unit side hypercube centered at the origin and
0
0
0 outside it.
Then
(
N
P
.
2
)
(NP.2)
(NP.2) can be rephrased as
p
^
(
x
)
=
1
h
l
(
1
N
∑
i
=
1
N
ϕ
(
x
i
−
x
h
)
)
(NP.4)
\hat p(\mathbf x)=\frac{1}{h^l}\left(\frac{1}{N}\sum_{i=1}^N \phi(\frac{\mathbf x_i-{\mathbf x}}{h}) \right)\tag{NP.4}
p^(x)=hl1(N1i=1∑Nϕ(hxi−x))(NP.4)
The interpretation of this is straightforward. We consider a hypercube with length of side
h
h
h centered at
x
\mathbf x
x, the point where the PDF is to be estimated. The number of points falling inside this hypercube equals
k
N
k_N
kN. Then the PDF estimate results from dividing
k
N
k_N
kN by
N
N
N and the respective hypercube volume
h
l
h^l
hl to make sure that
∫
V
p
^
(
x
)
d
V
=
1
\int_V\hat p(\mathbf x)dV=1
∫Vp^(x)dV=1. In fact, we can prove that for any
ϕ
(
x
)
≥
0
and
∫
x
ϕ
(
x
)
d
x
=
1
(NP.5)
\phi(\mathbf x)\ge 0\quad \text{and}\quad\int_{\mathbf x}\phi(\mathbf x)d\mathbf x=1\tag{NP.5}
ϕ(x)≥0and∫xϕ(x)dx=1(NP.5)
the resulting estimate is a legitimate PDF since
p
^
(
x
)
≥
0
\hat p(\mathbf x)\ge 0
p^(x)≥0 and
∫
V
p
^
(
x
)
d
V
=
∫
V
1
h
l
(
1
N
∑
i
=
1
N
ϕ
(
x
i
−
x
h
)
)
d
V
=
1
N
∑
i
=
1
N
1
h
l
(
∫
V
ϕ
(
x
i
−
x
h
)
d
V
)
=
1
N
∑
i
=
1
N
1
h
l
⋅
h
l
=
1
\begin{aligned} \int_V\hat p(\mathbf x)dV&=\int_V\frac{1}{h^l}\left(\frac{1}{N}\sum_{i=1}^N \phi(\frac{\mathbf x_i-{\mathbf x}}{h}) \right)dV\\ &=\frac{1}{N}\sum_{i=1}^N\frac{1}{h^l}\left( \int_{V}\phi(\frac{\mathbf x_i-{\mathbf x}}{h})dV \right)\\ &=\frac{1}{N}\sum_{i=1}^N\frac{1}{h^l}\cdot h^l=1 \end{aligned}
∫Vp^(x)dV=∫Vhl1(N1i=1∑Nϕ(hxi−x))dV=N1i=1∑Nhl1(∫Vϕ(hxi−x)dV)=N1i=1∑Nhl1⋅hl=1
Thus, this kind of
ϕ
(
x
)
\phi(\mathbf x)
ϕ(x) is named as kernels or potential functions or Parzen windows. A typical example is the Gaussian kernel
N
(
0
,
I
)
\mathcal{N}(\mathbf 0,\mathbf I)
N(0,I), which smooths the estimated distribution compared with the step function we used before (although the expression may not be intuitive). For Gaussian kernel, the approximate expansion of the unknown
p
(
x
)
p(\mathbf x)
p(x) will be
p
^
(
x
)
=
1
h
l
(
1
N
∑
i
=
1
N
1
(
2
π
)
l
/
2
exp
(
−
(
x
−
x
i
)
T
(
x
−
x
i
)
2
h
2
)
)
(NP.6)
\hat p(\mathbf x)=\frac{1}{h^l}\left(\frac{1}{N}\sum_{i=1}^N \frac{1}{(2\pi)^{l/2}}\exp \left(-\frac{(\mathbf x-\mathbf x_i)^T(\mathbf x-\mathbf x_i)}{2h^2}\right) \right)\tag{NP.6}
p^(x)=hl1(N1i=1∑N(2π)l/21exp(−2h2(x−xi)T(x−xi)))(NP.6)
In other words,the unknown pdf is approximated as an average of
N
N
N Gaussians,each one centered at a different point of the training set. Recall that as the parameter
h
h
h becomes smaller, the shape of the Gaussians becomes narrower and more “spiky” and the influence of each individual Gaussian is more localized in the feature space around the area of its mean value. On the other hand, the larger
the value of
h
h
h, the broader their shape becomes and more global in space their influence is.
We can calculate
p
(
x
∣
ω
i
)
p(\mathbf x|\omega_i)
p(x∣ωi) Parzen windows using the data set of
ω
i
\omega_i
ωi, and define a classifier according to
(
B
D
.
12
)
(BD.12)
(BD.12):
assign
x
to
ω
1
(
ω
2
)
if
l
12
≈
(
1
N
1
h
l
∑
i
=
1
N
1
ϕ
(
x
i
−
x
h
)
1
N
2
h
l
∑
i
=
1
N
2
ϕ
(
x
i
−
x
h
)
)
>
(
<
)
P
(
ω
2
)
P
(
ω
1
)
λ
21
λ
12
(NP.7)
\text{assign }x \text{ to } \omega_{1}\left(\omega_{2}\right) \quad \text{if} \quad l_{12} \approx\left(\frac{\frac{1}{N_{1} h^{l}} \sum_{i=1}^{N_{1}} \phi\left(\frac{\mathbf x_{i}-\mathbf x}{h}\right)}{\frac{1}{N_{2} h^{l}} \sum_{i=1}^{N_{2}} \phi\left(\frac{\mathbf x_{i}-\mathbf x}{h}\right)}\right)>(<) \frac{P\left(\omega_{2}\right)}{P\left(\omega_{1}\right)} \frac{\lambda_{21}}{\lambda_{12}}\tag{NP.7}
assign x to ω1(ω2)ifl12≈(N2hl1∑i=1N2ϕ(hxi−x)N1hl1∑i=1N1ϕ(hxi−x))>(<)P(ω1)P(ω2)λ12λ21(NP.7)
where
N
1
,
N
2
N_1,N_2
N1,N2 are the training vectors in class
ω
1
.
ω
2
\omega_1.\omega_2
ω1.ω2, respectively.
k k k-nearest neighbor classifier
h l : volume k N : the number of points falling inside the volume Parzen: h l is fixed, k N varies from point to point k Nearest: k N is fixed, h l varies from point to point \begin{array}{c} h^l:\text{volume}\quad k_N:\text{the number of points falling inside the volume}\\ \text{Parzen: }h^l\text{ is fixed, }k_N\text{ varies from point to point}\\ k\text{ Nearest: }k_N\text{ is fixed, }h^l\text{ varies from point to point} \end{array} hl:volumekN:the number of points falling inside the volumeParzen: hl is fixed, kN varies from point to pointk Nearest: kN is fixed, hl varies from point to point
For
k
k
k-nearest neighbor classifier, we can also consider more general types of regions, besides the hypercube. The estimator can now be written as
p
^
(
x
)
=
k
N
V
(
x
)
(NP.8)
\hat p(\mathbf x)=\frac{k}{NV(\mathbf x)}\tag{NP.8}
p^(x)=NV(x)k(NP.8)
From a practical point of view, at the reception of an unknown feature vector
x
x
x, we compute its distance
d
d
d, for example, Euclidean, from all the training vectors of the various classes, for example,
ω
1
,
ω
2
\omega_{1}, \omega_{2}
ω1,ω2. Let
r
1
r_{1}
r1 be the radius of the hypersphere, centered at
x
x
x, that contains
k
k
k points from
ω
1
\omega_{1}
ω1 and
r
2
r_{2}
r2 the corresponding radius of the hypersphere containing
k
k
k points from class
ω
2
\omega_{2}
ω2 (
k
k
k may not necessarily be the same for all classes). If we denote by
V
1
,
V
2
V_{1}, V_{2}
V1,V2 the respective hypersphere volumes, the likelihood ratio test becomes
assign
x
to
ω
1
(
ω
2
)
if
l
12
≈
N
2
V
2
N
1
V
1
>
(
<
)
P
(
ω
2
)
P
(
ω
1
)
λ
21
λ
12
(NP.9)
\text { assign } x \text { to } \omega_{1}\left(\omega_{2}\right) \quad \text {if} \quad l_{12} \approx \frac{ N_{2} V_{2}}{ N_{1} V_{1}}>(<) \frac{P\left(\omega_{2}\right)}{P\left(\omega_{1}\right)} \frac{\lambda_{21}}{\lambda_{12}}\tag{NP.9}
assign x to ω1(ω2)ifl12≈N1V1N2V2>(<)P(ω1)P(ω2)λ12λ21(NP.9)
At some optimum k k k the error is minimum:
In practical, remember to scale the features
Summary
- You can make classifiers using parametric density estimations + Bayes’ rule:
Quadratic, linear classifier, nearest mean classifier - You can make classifiers using nonparametric density estimations + Bayes’ rule:
Parzen classifier, k k k-nearest neighbor classifier - When you have many features, beware of the curse of dimensionality