Classifiers Based on Bayes Decision Theory

Reference:

Section 2.1-2.6 of Pattern recognition, by Sergios Theodoridis, Konstantinos Koutroumbas (2009)

Slides of CS4220, TUD

Bayes Decision Theory

We will initially focus on the two-class case.

Let ω 1 , ω 2 \omega_1,\omega_2 ω1,ω2 be the two classes in which our patterns belong.

  • priori probabilities: P ( ω 1 ) , P ( ω 2 ) P(\omega_1),P(\omega_2) P(ω1),P(ω2)

    Assumed to be known. If not, they can easily estimated as P ( ω 1 ) ≈ N 1 / N , P ( ω 2 ) ≈ N 2 / N P(\omega_1)\approx N_1/N, P(\omega_2)\approx N_2/N P(ω1)N1/N,P(ω2)N2/N.

  • likelihood function: p ( x ∣ ω 1 ) , p ( x ∣ ω 2 ) p(\mathbf x|\omega_1), p(\mathbf x|\omega_2) p(xω1),p(xω2)

    Assumed to be known. If not, they can also be estimated from the available training data.

  • posteriori probabilities: P ( ω 1 ∣ x ) , P ( ω 2 ∣ x ) P(\omega_1|\mathbf x), P(\omega_2|\mathbf x) P(ω1x),P(ω2x)

    Can be computed by the Bayes rule:
    P ( ω i ∣ x ) = p ( x ∣ ω i ) P ( ω i ) p ( x ) (BD.1) P(\omega_i|\mathbf x)=\frac{p(\mathbf x| \omega_i)P(\omega_i)}{p(\mathbf x)}\tag{BD.1} P(ωix)=p(x)p(xωi)P(ωi)(BD.1)
    where p ( x ) p(\mathbf x) p(x) is the pdf of x \mathbf x x and for which we have
    p ( x ) = ∑ p ( x ∣ ω i ) P ( ω i ) (BD.2) p(\mathbf x)=\sum p(\mathbf x|\omega_i)P(\omega_i)\tag{BD.2} p(x)=p(xωi)P(ωi)(BD.2)

The Bayes classification rule can now be stated as
 If  P ( ω 1 ∣ x ) > P ( ω 2 ∣ x ) , x  is classified to  ω 1  If  P ( ω 1 ∣ x ) < P ( ω 2 ∣ x ) , x  is classified to  ω 2 (BD.3) \begin{aligned} &\text { If } P\left(\omega_{1}| x\right)>P\left(\omega_{2}| x\right), \quad x \text { is classified to } \omega_{1}\\ &\text { If } P\left(\omega_{1} | x\right)<P\left(\omega_{2} | x\right), \quad x \text { is classified to } \omega_{2} \end{aligned}\tag{BD.3}  If P(ω1x)>P(ω2x),x is classified to ω1 If P(ω1x)<P(ω2x),x is classified to ω2(BD.3)
Using ( B D . 2 ) (BD.2) (BD.2), the decision can equivalently be based on the inequalities
p ( x ∣ ω 1 ) P ( ω 1 ) ≷ p ( x ∣ ω 2 ) P ( ω 2 ) (BD.4) p(\mathbf x| \omega_1)P(\omega_1)\gtrless p(\mathbf x| \omega_2)P(\omega_2)\tag{BD.4} p(xω1)P(ω1)p(xω2)P(ω2)(BD.4)
It can also be described as
p ( x ∣ ω 1 ) P ( ω 1 ) − p ( x ∣ ω 2 ) P ( ω 2 ) ≷ 0 p ( x ∣ ω 1 ) P ( ω 1 ) p ( x ∣ ω 2 ) P ( ω 2 ) ≷ 1 log ⁡ ( p ( x ∣ ω 1 ) P ( ω 1 ) ) − log ⁡ ( p ( x ∣ ω 2 ) P ( ω 2 ) ) ≷ 0 p(\mathbf x| \omega_1)P(\omega_1)-p(\mathbf x| \omega_2)P(\omega_2)\gtrless 0\\ \frac{p(\mathbf x| \omega_1)P(\omega_1)}{p(\mathbf x| \omega_2)P(\omega_2)}\gtrless 1\\ \log(p(\mathbf x| \omega_1)P(\omega_1))-\log(p(\mathbf x| \omega_2)P(\omega_2))\gtrless 0 p(xω1)P(ω1)p(xω2)P(ω2)0p(xω2)P(ω2)p(xω1)P(ω1)1log(p(xω1)P(ω1))log(p(xω2)P(ω2))0


Minimizing the classification error probability

We will show that the Bayesian classifier is optimal with respect to minimizing the classification error probability.

Define the classification error as
P ( e r r o r ) = ∑ i = 1 C P ( e r r o r ∣ ω i ) P ( ω i ) (BD.5) P(error)=\sum_{i=1}^C P(error|\omega_i)P(\omega_i)\tag{BD.5} P(error)=i=1CP(errorωi)P(ωi)(BD.5)

Classifiers Based on Bayes Decision Theory

For a determined decision boundary (the green line), it could happen that x ∈ ω 2 \mathbf x\in \omega_2 xω2 is assigned to ω 1 \omega_1 ω1, and that x ∈ ω 1 \mathbf x\in \omega_1 xω1 is assigned to ω 2 \omega_2 ω2. Let R i R_i Ri be the region of the feature space in which we decide in favor of
ω i \omega_i ωi. The classification error probability can be written as
P e = P ( ω 1 ) ∫ R 2 p ( x ∣ ω 1 ) d x + P ( ω 2 ) ∫ R 1 p ( x ∣ ω 2 ) d x = ∫ R 2 P ( ω 1 ∣ x ) p ( x ) d x + ∫ R 1 P ( ω 2 ∣ x ) p ( x ) d x (BD.6) \begin{aligned} P_{e}&=P(\omega_1)\int_{R_{2}} p\left(\mathbf x|\omega_1\right) d \mathbf x+P(\omega_2)\int_{R_{1}} p\left(\mathbf x|\omega_2\right) d \mathbf x \\&=\int_{R_{2}} P\left(\omega_{1} |\mathbf x\right) p(\mathbf x) d \mathbf x+\int_{R_{1}} P\left(\omega_{2} | \mathbf x\right) p(\mathbf x) d \mathbf x \end{aligned} \tag{BD.6} Pe=P(ω1)R2p(xω1)dx+P(ω2)R1p(xω2)dx=R2P(ω1x)p(x)dx+R1P(ω2x)p(x)dx(BD.6)
which is the marked yellow area in the figure above. Then it’s intuitive that the optimal decision boundary should be
x 0 : P ( ω 1 ∣ x 0 ) = P ( ω 2 ∣ x 0 ) (BD.7) \mathbf x_0:P\left(\omega_{1} | \mathbf x_0\right)=P\left(\omega_{2} | \mathbf x_0\right)\tag{BD.7} x0:P(ω1x0)=P(ω2x0)(BD.7)

Classifiers Based on Bayes Decision Theory

so that
R 1 : P ( ω 1 ∣ x ) > P ( ω 2 ∣ x ) R 2 : P ( ω 2 ∣ x ) > P ( ω 1 ∣ x ) (BD.8) \begin{array}{l} R_{1}: P\left(\omega_{1} | \mathbf x\right)>P\left(\omega_{2} | \mathbf x\right) \\ R_{2}: P\left(\omega_{2} |\mathbf x\right)>P\left(\omega_{1} | \mathbf x\right) \end{array}\tag{BD.8} R1:P(ω1x)>P(ω2x)R2:P(ω2x)>P(ω1x)(BD.8)
By combining ( B D . 6 ) (BD.6) (BD.6) and ( B D . 7 ) (BD.7) (BD.7), we obtain the Bayes error ε ∗ \varepsilon^* ε, which is the minimum classification error we can get. It does not depend on the classification rule that we apply, but on the distribution of data. In practical we can not obtain ε ∗ \varepsilon^* ε, since we do not have the true distributions, and the high dimensional integrals are very complicated.


Minimizing the average risk

Sometimes misclassification of class A to class B is much more dangerous than misclassification of class B to class A. Thus, it is more appropriate to assign a penalty term to weigh each error:
r = λ 12 P ( ω 1 ) ∫ R 2 p ( x ∣ ω 1 ) d x + λ 21 P ( ω 2 ) ∫ R 1 p ( x ∣ ω 2 ) d x (BD.9) \mathbf r=\lambda_{12}P(\omega_1)\int_{R_{2}} p\left(\mathbf x|\omega_1\right) d x+\lambda_{21}P(\omega_2)\int_{R_{1}} p\left(\mathbf x|\omega_2\right) d x\tag{BD.9} r=λ12P(ω1)R2p(xω1)dx+λ21P(ω2)R1p(xω2)dx(BD.9)
which is very similar to ( B D . 6 ) (BD.6) (BD.6).

For an M M M-class problem, the risk associated with ω k \omega_k ωk is defined as
r k = ∑ i = 1 M λ k i ∫ R i p ( x ∣ ω k ) d x (BD.10) \mathbf r_k=\sum_{i=1}^M \lambda_{ki}\int_{R_i}p(\mathbf x|\omega_k)d \mathbf x \tag{BD.10} rk=i=1MλkiRip(xωk)dx(BD.10)
where λ k i \lambda_{ki} λki is the assigned penalty term when a feature vector x \mathbf x x that belongs to class ω k \omega_k ωk lies in R i , i ≠ k R_i,i\ne k Ri,i=k. The matrix L L L, which has at its ( k , i ) (k,i) (k,i) location the corresponding penalty term λ k i \lambda_{ki} λki, is known as the loss matrix. The diagonal entries λ k k \lambda_{kk} λkk are set to zero for the sake of generality.

The average risk is defined as
r = ∑ k = 1 M r k P ( ω k ) = ∑ i = 1 M ∫ R i ( ∑ k = 1 M λ k i p ( x ∣ ω k ) P ( ω k ) ) d x (BD.11) \mathbf r=\sum_{k=1}^M \mathbf r_k P(\omega_k)=\sum_{i=1}^M\int_{R_i}\left( \sum_{k=1}^M \lambda_{ki}p(\mathbf x|\omega_k)P(\omega_k)\right)d \mathbf x \tag{BD.11} r=k=1MrkP(ωk)=i=1MRi(k=1Mλkip(xωk)P(ωk))dx(BD.11)
which is the generalization of ( B D . 9 ) (BD.9) (BD.9).

Our goal is to choose the partitioning regions R j R_j Rj so that the average risk is minimized. This is achieved if each of the integrals is minimized, which is equivalent to selecting partitioning regions so that
x ∈ R i  if  l i ≡ ∑ k = 1 M λ k i p ( x ∣ ω k ) P ( ω k ) < l j ≡ ∑ k = 1 M λ k j p ( x ∣ ω k ) P ( ω k ) ∀ j ≠ i (BD.12) \mathbf{x} \in \boldsymbol{R}_{i} \quad \text { if } \quad l_{i} \equiv \sum_{k=1}^{M} \lambda_{k i} p\left(\mathbf{x} | \omega_{k}\right) P\left(\omega_{k}\right)<l_{j} \equiv \sum_{k=1}^{M} \lambda_{k j} p\left(\mathbf{x} | \omega_{k}\right) P\left(\omega_{k}\right) \quad \forall j \neq i \tag{BD.12} xRi if lik=1Mλkip(xωk)P(ωk)<ljk=1Mλkjp(xωk)P(ωk)j=i(BD.12)
which is the generalization of ( B D . 7 ) (BD.7) (BD.7).

It is obvious that if λ k i = 1 − δ k i \lambda_{ki}=1-\delta_{ki} λki=1δki, where δ k i \delta_{ki} δki is Kronecker’s delta ( 0 0 0 if k ≠ i k\ne i k=i and 1 1 1 if k = i k=i k=i), then minimizing the average risk becomes equivalent to minimizing the classification error probability.

Classifiers Based on Bayes Decision Theory


Example

In a two-class problem with a single feature x x x the pdfs are Gaussians with variance σ 2 = 1 / 2 \sigma^{2}=1 / 2 σ2=1/2 for both classes and mean values 0 and 1, respectively, that is,
p ( x ∣ ω 1 ) = 1 π exp ⁡ ( − x 2 ) p ( x ∣ ω 2 ) = 1 π exp ⁡ ( − ( x − 1 ) 2 ) \begin{array}{l} p\left(x | \omega_{1}\right)=\frac{1}{\sqrt{\pi}} \exp \left(-x^{2}\right) \\ p\left(x | \omega_{2}\right)=\frac{1}{\sqrt{\pi}} \exp \left(-(x-1)^{2}\right) \end{array} p(xω1)=π 1exp(x2)p(xω2)=π 1exp((x1)2)
If P ( ω 1 ) = P ( ω 2 ) = 1 / 2 , P\left(\omega_{1}\right)=P\left(\omega_{2}\right)=1 / 2, P(ω1)=P(ω2)=1/2, compute the threshold value x 0 x_{0} x0 (a) for minimum error probability and (b) for minimum risk if the loss matrix is
L = [ 0 0.5 1.0 0 ] L=\left[\begin{array}{cc} 0 & 0.5 \\ 1.0 & 0 \end{array}\right] L=[01.00.50]
Taking into account the shape of the Gaussian function graph (Appendix A), the threshold for the minimum probability case will be
x 0 : exp ⁡ ( − x 2 ) = exp ⁡ ( − ( x − 1 ) 2 ) x_{0}: \exp \left(-x^{2}\right)=\exp \left(-(x-1)^{2}\right) x0:exp(x2)=exp((x1)2)
Taking the logarithm of both sides, we end up with x 0 = 1 / 2. x_{0}=1 / 2 . x0=1/2. In the minimum risk case we get
x 0 : exp ⁡ ( − x 2 ) = 2 exp ⁡ ( − ( x − 1 ) 2 ) x_{0}: \exp \left(-x^{2}\right)=2 \exp \left(-(x-1)^{2}\right) x0:exp(x2)=2exp((x1)2)
or x 0 = ( 1 − ln ⁡ 2 ) / 2 < 1 / 2 ; x_{0}=(1-\ln 2) / 2<1 / 2 ; x0=(1ln2)/2<1/2; that is, the threshold moves to the left of 1 / 2. 1 / 2 . 1/2. If the two classes are not equiprobable, then it is easily verified that if P ( ω 1 ) > ( < ) P ( ω 2 ) P\left(\omega_{1}\right)>(<) P\left(\omega_{2}\right) P(ω1)>(<)P(ω2) the threshold moves to the right (left). That is, we expand the region in which we decide in favor of the most probable class, since it is better to make fewer errors for the most probable class.


Discriminant functions and decision surfaces

If regions R i , R j R_i,R_j Ri,Rj happens to be contiguous, then they are separated by a decision surface in the multidimensional feature space. For the minimum error probability case, the decision surface between ω i \omega_i ωi and ω j \omega_j ωj is described by the equation
P ( ω i ∣ x ) − P ( ω j ∣ x ) = 0 (BD.13) P(\omega_i|\mathbf x)-P(\omega_j|\mathbf x)=0\tag{BD.13} P(ωix)P(ωjx)=0(BD.13)
From the one side of the surface this difference is positive, and from the other it is negative.

Sometimes, instead of working directly with probabilities (or risk functions), it may be more convenient. from a mathematical point of view, to work with an equivalent function of them, for example, g i ( x ) = f ( P ( ω i ∣ x ) ) g_i(\mathbf x)=f(P(\omega_i|\mathbf x)) gi(x)=f(P(ωix)), where f ( ⋅ ) f(\cdot) f() is a monotonically increasing function. g i ( x ) g_i(\mathbf x) gi(x) is known as a discriminant function. The decision test is now stated as
classify  x  in  ω i if g i ( x ) > g j ( x ) ,   ∀ j ≠ i (BD.14) \text{classify }\mathbf x\text{ in }\omega_i\quad \text{if}\quad g_i(\mathbf x)>g_j(\mathbf x), ~\forall j\ne i\tag{BD.14} classify x in ωiifgi(x)>gj(x), j=i(BD.14)
The decision surfaces, separating contiguous regions, are described by
g i j ( x ) = g i ( x ) − g j ( x ) = 0 , i , j = 1 , 2 , ⋯   , M , i ≠ j (BD.15) g_{ij}(\mathbf x)=g_i(\mathbf x)-g_j(\mathbf x)=0,\quad i,j=1,2,\cdots,M,\quad i\ne j\tag{BD.15} gij(x)=gi(x)gj(x)=0,i,j=1,2,,M,i=j(BD.15)

Bayesian Classification For Normal Distribution

Normal distribution can model adequately a large number of cases due to the central limit theorem: if a random variable is the outcome of a summation of a number of independent random variables, its pdf approaches the Gaussian function as the number of summands tends to infinity.

The Gaussian Probability Density Function

  • one-dimensional Gaussian: N ( μ , σ 2 ) \mathcal{N}(\mu,\sigma^2) N(μ,σ2)
    p ( x ) = 1 2 π σ exp ⁡ ( − ( x − μ ) 2 2 σ 2 ) (ND.1) p(x)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right)\tag{ND.1} p(x)=2π σ1exp(2σ2(xμ)2)(ND.1)
    where
    μ = E [ x ] ≡ ∫ − ∞ + ∞ x p ( x ) d x σ 2 = E [ ( x − μ ) 2 ] ≡ ∫ − ∞ + ∞ ( x − μ ) 2 p ( x ) d x (ND.2) \begin{aligned} \mu=E[x] &\equiv \int_{-\infty}^{+\infty} x p(x) d x\\ \sigma^{2}=E\left[(x-\mu)^{2}\right]&\equiv\int_{-\infty}^{+\infty}(x-\mu)^{2} p(x) d x \end{aligned}\tag{ND.2} μ=E[x]σ2=E[(xμ)2]+xp(x)dx+(xμ)2p(x)dx(ND.2)

  • l l l-dimensional Gaussian: N ( μ , Σ ) \mathcal{N}(\boldsymbol{\mu},\Sigma) N(μ,Σ)
    p ( x ) = 1 ( 2 π ) l / 2 ∣ Σ ∣ 1 / 2 exp ⁡ ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) (ND.3) p(\mathbf x)=\frac{1}{(2 \pi)^{l / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}(\mathbf x-\boldsymbol{\mu})^{T} \Sigma^{-1}(\mathbf x-\boldsymbol{\mu})\right)\tag{ND.3} p(x)=(2π)l/2Σ1/21exp(21(xμ)TΣ1(xμ))(ND.3)
    where
    μ = E [ x ] Σ = E [ ( x − μ ) ( x − μ ) T ] (ND.4) \begin{aligned} \boldsymbol{\mu}&=E[\mathbf x]\\ \Sigma&=E\left[(\mathbf x-\boldsymbol{\mu})(\mathbf x-\boldsymbol{\mu})^T\right] \end{aligned}\tag{ND.4} μΣ=E[x]=E[(xμ)(xμ)T](ND.4)


The Bayesian Classifier for Normally Distributed Classes

Our goal is to study the optimal Bayesian classifier when the involved PDFs, p ( x ∣ ω i ) ∼ N ( μ i , Σ i ) , i = 1 , 2 , ⋯   , M p(\mathbf x|\omega_i)\sim \mathcal{N}(\boldsymbol \mu_i,\Sigma_i), i=1,2,\cdots,M p(xωi)N(μi,Σi),i=1,2,,M. Because of the exponential form of the involved densities, it is preferable to work with the following discriminant functions, which involve the logarithmic function ln ⁡ ( ⋅ ) \ln (\cdot) ln():

g i ( x ) = ln ⁡ ( p ( x ∣ ω i ) P ( ω i ) ) = ln ⁡ p ( x ∣ ω i ) + ln ⁡ P ( ω i ) (ND.5) g_i(\mathbf x)=\ln (p(\mathbf x|\omega_i)P(\omega_i))=\ln p(\mathbf x|\omega_i)+\ln P(\omega_i)\tag{ND.5} gi(x)=ln(p(xωi)P(ωi))=lnp(xωi)+lnP(ωi)(ND.5)
or
g i ( x ) = − 1 2 ( x − μ i ) T Σ i − 1 ( x − μ i ) + ln ⁡ P ( ω i ) + c i (ND.6) g_i(\mathbf x)=-\frac{1}{2}(\mathbf x-\boldsymbol \mu_i)^T\Sigma_i^{-1}(\mathbf x-\boldsymbol \mu_i)+\ln P(\omega_i)+c_i \tag{ND.6} gi(x)=21(xμi)TΣi1(xμi)+lnP(ωi)+ci(ND.6)
where c i c_i ci is a constant equal to − ( l / 2 ) ln ⁡ 2 π − ( 1 / 2 ) ln ⁡ ∣ Σ i ∣ -(l/2)\ln 2\pi-(1/2)\ln |\Sigma_i| (l/2)ln2π(1/2)lnΣi. Expanding, we obtain
g i ( x ) = − 1 2 x T Σ i − 1 x + 1 2 x T Σ i − 1 μ i − 1 2 μ i T Σ i − 1 μ i + 1 2 μ i T Σ i − 1 x + ln ⁡ P ( ω i ) + c i (ND.7) g_{i}(\mathbf x)=-\frac{1}{2} \mathbf x^{T} \Sigma_{i}^{-1} \mathbf x+\frac{1}{2}\mathbf x^{T} \Sigma_{i}^{-1} \boldsymbol\mu_{i}-\frac{1}{2}\boldsymbol \mu_{i}^{T} \Sigma_{i}^{-1} \boldsymbol\mu_{i}+\frac{1}{2} \boldsymbol\mu_{i}^{T} \Sigma_{i}^{-1}\mathbf x+\ln P\left(\omega_{i}\right)+c_{i} \tag{ND.7} gi(x)=21xTΣi1x+21xTΣi1μi21μiTΣi1μi+21μiTΣi1x+lnP(ωi)+ci(ND.7)
Now define the quadratic classifier
f ( x ) = g 1 ( x ) − g 2 ( x ) = x T W x + w T x + w 0 (ND.8) f(\mathbf x)=g_1(\mathbf x)-g_2(\mathbf x)=\mathbf x^T \mathbf W \mathbf x+\mathbf w^T \mathbf x+w_0\tag{ND.8} f(x)=g1(x)g2(x)=xTWx+wTx+w0(ND.8)
where
W = 1 2 ( Σ 2 − 1 − Σ 1 − 1 ) w = μ 1 T Σ 1 − 1 − μ 2 T Σ 2 − 1 w 0 = − 1 2 ln ⁡ ∣ Σ 1 ∣ − 1 2 μ 1 T Σ 1 − 1 μ 1 + ln ⁡ P ( ω 1 ) + 1 2 ln ⁡ ∣ Σ 2 ∣ + 1 2 μ 2 T Σ 2 − 1 μ 2 − ln ⁡ P ( ω 2 ) (ND.9) \begin{aligned} \mathbf{W}=& \frac{1}{2}\left(\Sigma_{2}^{-1}-\Sigma_{1}^{-1}\right) \\ \mathbf{w}=& \boldsymbol\mu_{1}^{T} \Sigma_{1}^{-1}-\boldsymbol\mu_{2}^{T} \Sigma_{2}^{-1} \\ w_{0}=&-\frac{1}{2} \ln |\Sigma_{1}|-\frac{1}{2} \boldsymbol\mu_{1}^{T} \Sigma_{1}^{-1} \boldsymbol\mu_{1}+\ln P\left(\omega_{1}\right) \\ &+\frac{1}{2} \ln |\Sigma_{2}|+\frac{1}{2}\boldsymbol \mu_{2}^{T} \Sigma_{2}^{-1}\boldsymbol \mu_{2}-\ln P\left(\omega_{2}\right) \end{aligned}\tag{ND.9} W=w=w0=21(Σ21Σ11)μ1TΣ11μ2TΣ2121lnΣ121μ1TΣ11μ1+lnP(ω1)+21lnΣ2+21μ2TΣ21μ2lnP(ω2)(ND.9)

Classifiers Based on Bayes Decision Theory

In case the parameters μ \boldsymbol \mu μ and Σ \Sigma Σ is unknown, we can estimate them simply as
μ ^ k = 1 N ∑ i = 1 N x k i , Σ ^ k = 1 N ∑ i = 1 N ( x k i − μ k ^ ) ( x k i − μ k ^ ) T (ND.10) \hat {\boldsymbol \mu}_k =\frac{1}{N}\sum_{i=1}^N \mathbf x_{ki},\quad \hat \Sigma_k=\frac{1}{N}\sum_{i=1}^N(\mathbf x_{ki}-\hat{\boldsymbol \mu_k})(\mathbf x_{ki}-\hat{\boldsymbol \mu_k})^T\tag{ND.10} μ^k=N1i=1Nxki,Σ^k=N1i=1N(xkiμk^)(xkiμk^)T(ND.10)
However, when there is insufficient data, the covariance matrix may be singular and cannot be inverted. An alternative is to average over all class covariance matrices
Σ ^ = 1 C ∑ k = 1 C Σ ^ k (ND.11) \hat \Sigma=\frac{1}{C}\sum_{k=1}^C \hat \Sigma_k \tag{ND.11} Σ^=C1k=1CΣ^k(ND.11)
Since all class covariance matrices are same now, W = 0 \mathbf W=\mathbf 0 W=0, and thus
f ( x ) = w T x + w 0 (ND.12) f(\mathbf x)=\mathbf w^T \mathbf x+w_0\tag{ND.12} f(x)=wTx+w0(ND.12)
which is the linear classifier, with
w = Σ ^ − 1 ( μ ^ 1 T − μ ^ 2 T ) w 0 = 1 2 μ ^ 2 T Σ − 1 μ ^ 2 − 1 2 μ ^ 1 T Σ − 1 μ ^ 1 + ln ⁡ P ( ω 1 ) P ( ω 2 ) (ND.13) \begin{aligned} \mathbf{w}=& \hat \Sigma ^{-1}(\hat{\boldsymbol\mu}_{1}^{T} -\hat{\boldsymbol\mu}_{2}^{T} )\\ w_{0}=&\frac{1}{2}\hat{\boldsymbol\mu}_{2}^{T} \Sigma^{-1}\hat{\boldsymbol\mu}_{2}-\frac{1}{2}\hat{\boldsymbol\mu}_{1}^{T} \Sigma^{-1} \hat{\boldsymbol\mu}_{1}+\ln \frac{P\left(\omega_{1}\right)}{P\left(\omega_{2}\right)} \end{aligned}\tag{ND.13} w=w0=Σ^1(μ^1Tμ^2T)21μ^2TΣ1μ^221μ^1TΣ1μ^1+lnP(ω2)P(ω1)(ND.13)

Classifiers Based on Bayes Decision Theory

In some cases even a full averaged covariance matrix is too much to estimate. As simplification one could assume that all features have the same variance, and are uncorrelated :
Σ ^ = σ 2 I (ND.14) \hat \Sigma=\sigma^2 \mathbf I\tag{ND.14} Σ^=σ2I(ND.14)
The linear classifier can be further simplified as nearest mean classifier:
f ( x ) = w T x + w 0 f(\mathbf x)=\mathbf w^T \mathbf x+w_0 f(x)=wTx+w0
where
w = μ ^ 1 T − μ ^ 2 T w 0 = 1 2 μ ^ 2 T μ ^ 2 − 1 2 μ ^ 1 T μ ^ 1 + σ 2 ln ⁡ P ( ω 1 ) P ( ω 2 ) (ND.15) \begin{aligned} \mathbf{w}=& \hat{\boldsymbol\mu}_{1}^{T} -\hat{\boldsymbol\mu}_{2}^{T} \\ w_{0}=&\frac{1}{2}\hat{\boldsymbol\mu}_{2}^{T} \hat{\boldsymbol\mu}_{2}-\frac{1}{2}\hat{\boldsymbol\mu}_{1}^{T} \hat{\boldsymbol\mu}_{1}+\sigma^2 \ln \frac{P\left(\omega_{1}\right)}{P\left(\omega_{2}\right)} \end{aligned}\tag{ND.15} w=w0=μ^1Tμ^2T21μ^2Tμ^221μ^1Tμ^1+σ2lnP(ω2)P(ω1)(ND.15)
It only uses distance to the mean of each of the classes, and the decision boundary is the perpendicular bisector of μ 1 − μ 2 \boldsymbol \mu_1-\boldsymbol \mu_2 μ1μ2.

Classifiers Based on Bayes Decision Theory

Non-parametric Classification

Histogram method

Classifiers Based on Bayes Decision Theory

Divide the x x x-axis into successive bins of length h h h. Then the probability of a sample x x x being located in a bin is estimated for each of the bins. If N N N is the total number of samples and k N k_N kN of these are located inside a bin, the corresponding probability is approximated by the frequency ratio:
P ≈ k N / N (NP.1) P\approx k_N/N \tag{NP.1} PkN/N(NP.1)
The corresponding PDF value is assumed constant throughout the bin and is approximated by
p ^ ( x ) = p ^ ( x ^ ) ≈ 1 h k N N , ∣ x − x ^ ∣ ≤ h 2 (NP.2) \hat p(x)=\hat p(\hat x)\approx \frac{1}{h}\frac{k_N}{N}, \quad |x-\hat x|\le \frac{h}{2}\tag{NP.2} p^(x)=p^(x^)h1NkN,xx^2h(NP.2)
where x ^ \hat x x^ is the midpoint of the bin.


Parzen classifier

In the multidimensional case, instead of bins of size h h h, the l l l-dimensional space is divided into hypercubes with length of side h h h and volume h l h^l hl. Let x i , i = 1 , 2 , ⋯   , N \mathbf x_i,i=1,2,\cdots,N xi,i=1,2,,N be the available feature vectors. Define the function ϕ ( x ) \phi(\mathbf x) ϕ(x) so that
ϕ ( x i ) = { 1 , for  ∣ x i j ∣ ≤ 1 / 2 0 , otherwise (NP.3) \phi(\mathbf x_i)=\left\{\begin{aligned}&1, && \text{for }|x_{ij}|\le 1/2\\&0,&& \text{otherwise}\end{aligned} \right.\tag{NP.3} ϕ(xi)={1,0,for xij1/2otherwise(NP.3)
where x i j , j = 1 , ⋯   , l x_{ij},j=1,\cdots,l xij,j=1,,l are the components of x i \mathbf x_i xi. In words, the function is equal to 1 1 1 for all points inside the unit side hypercube centered at the origin and 0 0 0 outside it.

Classifiers Based on Bayes Decision Theory

Then ( N P . 2 ) (NP.2) (NP.2) can be rephrased as
p ^ ( x ) = 1 h l ( 1 N ∑ i = 1 N ϕ ( x i − x h ) ) (NP.4) \hat p(\mathbf x)=\frac{1}{h^l}\left(\frac{1}{N}\sum_{i=1}^N \phi(\frac{\mathbf x_i-{\mathbf x}}{h}) \right)\tag{NP.4} p^(x)=hl1(N1i=1Nϕ(hxix))(NP.4)
The interpretation of this is straightforward. We consider a hypercube with length of side h h h centered at x \mathbf x x, the point where the PDF is to be estimated. The number of points falling inside this hypercube equals k N k_N kN. Then the PDF estimate results from dividing k N k_N kN by N N N and the respective hypercube volume h l h^l hl to make sure that ∫ V p ^ ( x ) d V = 1 \int_V\hat p(\mathbf x)dV=1 Vp^(x)dV=1. In fact, we can prove that for any
ϕ ( x ) ≥ 0 and ∫ x ϕ ( x ) d x = 1 (NP.5) \phi(\mathbf x)\ge 0\quad \text{and}\quad\int_{\mathbf x}\phi(\mathbf x)d\mathbf x=1\tag{NP.5} ϕ(x)0andxϕ(x)dx=1(NP.5)
the resulting estimate is a legitimate PDF since p ^ ( x ) ≥ 0 \hat p(\mathbf x)\ge 0 p^(x)0 and
∫ V p ^ ( x ) d V = ∫ V 1 h l ( 1 N ∑ i = 1 N ϕ ( x i − x h ) ) d V = 1 N ∑ i = 1 N 1 h l ( ∫ V ϕ ( x i − x h ) d V ) = 1 N ∑ i = 1 N 1 h l ⋅ h l = 1 \begin{aligned} \int_V\hat p(\mathbf x)dV&=\int_V\frac{1}{h^l}\left(\frac{1}{N}\sum_{i=1}^N \phi(\frac{\mathbf x_i-{\mathbf x}}{h}) \right)dV\\ &=\frac{1}{N}\sum_{i=1}^N\frac{1}{h^l}\left( \int_{V}\phi(\frac{\mathbf x_i-{\mathbf x}}{h})dV \right)\\ &=\frac{1}{N}\sum_{i=1}^N\frac{1}{h^l}\cdot h^l=1 \end{aligned} Vp^(x)dV=Vhl1(N1i=1Nϕ(hxix))dV=N1i=1Nhl1(Vϕ(hxix)dV)=N1i=1Nhl1hl=1
Thus, this kind of ϕ ( x ) \phi(\mathbf x) ϕ(x) is named as kernels or potential functions or Parzen windows. A typical example is the Gaussian kernel N ( 0 , I ) \mathcal{N}(\mathbf 0,\mathbf I) N(0,I), which smooths the estimated distribution compared with the step function we used before (although the expression may not be intuitive). For Gaussian kernel, the approximate expansion of the unknown p ( x ) p(\mathbf x) p(x) will be
p ^ ( x ) = 1 h l ( 1 N ∑ i = 1 N 1 ( 2 π ) l / 2 exp ⁡ ( − ( x − x i ) T ( x − x i ) 2 h 2 ) ) (NP.6) \hat p(\mathbf x)=\frac{1}{h^l}\left(\frac{1}{N}\sum_{i=1}^N \frac{1}{(2\pi)^{l/2}}\exp \left(-\frac{(\mathbf x-\mathbf x_i)^T(\mathbf x-\mathbf x_i)}{2h^2}\right) \right)\tag{NP.6} p^(x)=hl1(N1i=1N(2π)l/21exp(2h2(xxi)T(xxi)))(NP.6)
In other words,the unknown pdf is approximated as an average of N N N Gaussians,each one centered at a different point of the training set. Recall that as the parameter h h h becomes smaller, the shape of the Gaussians becomes narrower and more “spiky” and the influence of each individual Gaussian is more localized in the feature space around the area of its mean value. On the other hand, the larger
the value of h h h, the broader their shape becomes and more global in space their influence is.

Classifiers Based on Bayes Decision Theory

We can calculate p ( x ∣ ω i ) p(\mathbf x|\omega_i) p(xωi) Parzen windows using the data set of ω i \omega_i ωi, and define a classifier according to ( B D . 12 ) (BD.12) (BD.12):
assign  x  to  ω 1 ( ω 2 ) if l 12 ≈ ( 1 N 1 h l ∑ i = 1 N 1 ϕ ( x i − x h ) 1 N 2 h l ∑ i = 1 N 2 ϕ ( x i − x h ) ) > ( < ) P ( ω 2 ) P ( ω 1 ) λ 21 λ 12 (NP.7) \text{assign }x \text{ to } \omega_{1}\left(\omega_{2}\right) \quad \text{if} \quad l_{12} \approx\left(\frac{\frac{1}{N_{1} h^{l}} \sum_{i=1}^{N_{1}} \phi\left(\frac{\mathbf x_{i}-\mathbf x}{h}\right)}{\frac{1}{N_{2} h^{l}} \sum_{i=1}^{N_{2}} \phi\left(\frac{\mathbf x_{i}-\mathbf x}{h}\right)}\right)>(<) \frac{P\left(\omega_{2}\right)}{P\left(\omega_{1}\right)} \frac{\lambda_{21}}{\lambda_{12}}\tag{NP.7} assign x to ω1(ω2)ifl12(N2hl1i=1N2ϕ(hxix)N1hl1i=1N1ϕ(hxix))>(<)P(ω1)P(ω2)λ12λ21(NP.7)
where N 1 , N 2 N_1,N_2 N1,N2 are the training vectors in class ω 1 . ω 2 \omega_1.\omega_2 ω1.ω2, respectively.


k k k-nearest neighbor classifier

h l : volume k N : the number of points falling inside the volume Parzen:  h l  is fixed,  k N  varies from point to point k  Nearest:  k N  is fixed,  h l  varies from point to point \begin{array}{c} h^l:\text{volume}\quad k_N:\text{the number of points falling inside the volume}\\ \text{Parzen: }h^l\text{ is fixed, }k_N\text{ varies from point to point}\\ k\text{ Nearest: }k_N\text{ is fixed, }h^l\text{ varies from point to point} \end{array} hl:volumekN:the number of points falling inside the volumeParzen: hl is fixed, kN varies from point to pointk Nearest: kN is fixed, hl varies from point to point

For k k k-nearest neighbor classifier, we can also consider more general types of regions, besides the hypercube. The estimator can now be written as
p ^ ( x ) = k N V ( x ) (NP.8) \hat p(\mathbf x)=\frac{k}{NV(\mathbf x)}\tag{NP.8} p^(x)=NV(x)k(NP.8)
From a practical point of view, at the reception of an unknown feature vector x x x, we compute its distance d d d, for example, Euclidean, from all the training vectors of the various classes, for example, ω 1 , ω 2 \omega_{1}, \omega_{2} ω1,ω2. Let r 1 r_{1} r1 be the radius of the hypersphere, centered at x x x, that contains k k k points from ω 1 \omega_{1} ω1 and r 2 r_{2} r2 the corresponding radius of the hypersphere containing k k k points from class ω 2 \omega_{2} ω2 ( k k k may not necessarily be the same for all classes). If we denote by V 1 , V 2 V_{1}, V_{2} V1,V2 the respective hypersphere volumes, the likelihood ratio test becomes
 assign  x  to  ω 1 ( ω 2 ) if l 12 ≈ N 2 V 2 N 1 V 1 > ( < ) P ( ω 2 ) P ( ω 1 ) λ 21 λ 12 (NP.9) \text { assign } x \text { to } \omega_{1}\left(\omega_{2}\right) \quad \text {if} \quad l_{12} \approx \frac{ N_{2} V_{2}}{ N_{1} V_{1}}>(<) \frac{P\left(\omega_{2}\right)}{P\left(\omega_{1}\right)} \frac{\lambda_{21}}{\lambda_{12}}\tag{NP.9}  assign x to ω1(ω2)ifl12N1V1N2V2>(<)P(ω1)P(ω2)λ12λ21(NP.9)

Classifiers Based on Bayes Decision Theory

At some optimum k k k the error is minimum:

Classifiers Based on Bayes Decision Theory

In practical, remember to scale the features

Classifiers Based on Bayes Decision Theory

Summary

  • You can make classifiers using parametric density estimations + Bayes’ rule:
    Quadratic, linear classifier, nearest mean classifier
  • You can make classifiers using nonparametric density estimations + Bayes’ rule:
    Parzen classifier, k k k-nearest neighbor classifier
  • When you have many features, beware of the curse of dimensionality