Gaussian Discriminative Analysis 高斯判别分析 GDA

Gaussian Discriminative Analysis 高斯判别分析 GDA

Multidimensional Gaussian Model

zN(μ,Σ)z \sim N(\vec\mu,\Sigma)
zRn,μRn,ΣRnnz \in R^n,\vec\mu \in R^n, \Sigma \in R^{n*n}
zz – variable
μ=[μ1μ2...μn]\vec\mu = \begin{bmatrix} \mu_1\\ \mu_2 \\ ... \\ \mu_n \end{bmatrix} – mean vector
Σ\Sigma – covarience matrix
All the Gaussian models share one covarience matrix.

E(z)=μ,Cov(z)=E[(xμ)(xμ)T]=E(zzT)(E(z))(E(z))TE(z) = \vec\mu, Cov(z)=E[(x-\vec\mu)(x-\vec\mu)^T]=E(zz^T)-(E(z))(E(z))^T

Intro

GDA assumes:
xy=0N(μ0,Σ)x|y=0 \sim N(\mu_0,\Sigma)
xy=1N(μ1,Σ)x|y=1 \sim N(\mu_1,\Sigma)
yBer(ϕ),ϕ=P(y=1)y \sim Ber(\phi), \phi = P(y=1)

GDA model(binary classification)

Multivariate Gaussian distribution:
P(x)=1(2π)d2Σ12exp(12(xμ)TΣ1(xμ))P(x) = \frac{1}{(2\pi)^{\frac d2}|\Sigma|^{\frac12}}exp(-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu))
Σ|\Sigma| is the value of determinant of Σ\Sigma


parameter: μ0,μ1,Σ,ϕ\mu_0,\mu_1, \Sigma, \phi
P(y)=ϕy(1ϕ)1yP(y) = \phi^y(1-\phi)^{1-y}
ϕ\phi is prior probability, and it depends on the proportion of two classes.


Joint likelihood:
L(ϕ,μ0,μ1,Σ)=i=1mP(x(i),y(i);ϕ,μ0,μ1,Σ)=i=1mP(x(i)y(i))P(y(i))L(\phi, \mu_0, \mu_1, \Sigma) = \sum\limits_{i=1}^mP(x^{(i)},y^{(i)};\phi, \mu_0, \mu_1, \Sigma) = \sum\limits_{i=1}^mP(x^{(i)}|y^{(i)})P(y^{(i)})
MLE: argmaxϕ,μ0,μ1,Σl(ϕ,μ0,μ1,Σ)\arg\max\limits_{\phi, \mu_0, \mu_1, \Sigma}l(\phi, \mu_0, \mu_1, \Sigma)
ϕ=i=1my(i)m=i=1m1{y(i)=1}m\phi = \frac{\sum\limits_{i=1}^my^{(i)}}{m}=\frac{\sum\limits_{i=1}^m1\{y^{(i)}=1\}}{m}
μk=i=1m1{y(i)=k}x(i)i=1m1{y(i)=k},k{0,1}\mu_k = \frac{\sum\limits_{i=1}^m1\{y^{(i)}=k\}x^{(i)}}{\sum\limits_{i=1}^m1\{y^{(i)}=k\}},k\in \{0,1\}
Σ=1mi=1m(x(i)μy(i))(x(i)μy(i))T\Sigma = \frac1m\sum\limits_{i=1}^m(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T

Based on the two Gaussian models, we can draw a boundary line.
图片来源
Gaussian Discriminative Analysis 高斯判别分析 GDA

Prediction

argmaxyP(yx)=argmaxyP(xy)P(y)P(x)=argmaxyP(xy)P(y)\arg\max\limits_yP(y|x) = \arg\max\limits_y \frac{P(x|y)P(y)}{P(x)}=\arg\max\limits_yP(x|y)P(y)
(P(x)P(x) is a constant)

& Logistic Regression

图片是我的笔记
Gaussian Discriminative Analysis 高斯判别分析 GDA
The picture shows when our data is 1D the function looks like Sigmoid function. Actually, it is Sigmoid function and it also applys to higher dimension. I won’t prove it here.
GDA is a stricter version of logistic regression because the data has to follow Gaussian distribution.
When the data follows Gaussian distribution or the data is very big(according to the central limit theorem), GDA works better than logistic regression.
Also, the data follows Gaussian distribution so the model has no local optima.