GMM & K-means 高斯混合模型和K-means聚类详解

往期文章链接目录

文章目录

往期文章链接目录
Gaussian mixture model (GMM)

Interpretation from geometry
Interpretation from mixture model

GMM Derivation

set up
Solve by MLE
Solve by EM Algorithm

K-means
往期文章链接目录

Gaussian mixture model (GMM)

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.

Interpretation from geometry

$p(x)$ is a weighted sum of multiple Gaussian distribution.

$p(x)=\sum_{k=1}^{K} \alpha_{k} \cdot \mathcal{N}\left(x | \mu_{k}, \Sigma_{k}\right)$

Interpretation from mixture model

setup:

The total number of Gaussian distribution $K$ .
$x$ , a sample (observed variable).
$z$ , the distribution of the sample $x$ (a latent variable), where
- $z \in \{c_1, c_2, ..., c_K\}$ .
- $\sum_{k=1}^K p(z=c_k)= 1$ . We denote $p(z=c_k)$ by $p_k$ .

Mixture models are usually generative models, which means new data can be drawn from the distribution of models. Specifically, in the Gaussian Mixture Model (GMM), a new data is generated by first select a class $c_k$ based on the probability distribution of all classes $c$ , and then draw a value from the Gaussian distribution of that class. Therefore, we could write $p(x)$ as the following

$\begin{aligned} p(x) &= \sum_z p(x,z) \\ &= \sum_{k=1}^{K} p(x, z=c_k) \\ &= \sum_{k=1}^{K} p(z=c_k) \cdot p(x|z=c_k) \\ &= \sum_{k=1}^{K} p_k \cdot \mathcal{N}(x | \mu_{k}, \Sigma_{k}) \end{aligned}$

We see that two ways of interpretation reach to the same result.

GMM Derivation

set up

X: observed data, where $X = (x_1, x_2, ..., x_N)$
$\theta$ : parameter of the model, where $\theta=\left\{p_{1}, p_{2}, \cdots, p_{K}, \mu_{1}, \mu_{2}, \cdots, \mu_{K}, \Sigma_{1}, \Sigma_{2}, \cdots, \Sigma_{K}\right\}$
$p(x) = \sum_{k=1}^{K} p_k \cdot \mathcal{N}(x | \mu_{k}, \Sigma_{k})$ .
$p(x,z) = p(z) \cdot p(x|z) = p_z \cdot \mathcal{N}(x | \mu_{z}, \Sigma_{z})$
$p(z|x) = \frac{p(x,z)}{p(x)} = \frac{p_z \cdot \mathcal{N}(x | \mu_{z}, \Sigma_{z})}{\sum_{k=1}^K p_z \cdot \mathcal{N}(x | \mu_{z}, \Sigma_{z})}$

Solve by MLE

$\begin{aligned} \hat{\theta}_{MLE} &= \underset{\theta}{\operatorname{argmax}} p(X) \\ &=\underset{\theta}{\operatorname{argmax}} \log p(X) \\ &=\underset{\theta}{\operatorname{argmax}} \sum_{i=1}^{N} \log p\left(x_{i}\right) \\ &=\underset{\theta}{\operatorname{argmax}} \sum_{i=1}^{N} \log \, [\sum_{i=1}^{K} p_{k} \cdot \mathcal{N}\left(x_{i} | \mu_{k}, \Sigma_{k}\right)] \end{aligned}$

I mentioned in the previous posts multiple times that the log of summation is very hard to solve. Therefore, we need somehow use approximation methods to solve for optimal $\theta$ . Since $Z$ is a hidden variable, it’s natural to use EM Algorithm.

Solve by EM Algorithm

Check my previous post to see how EM Algorithm is derived, Here is a briefly review of the Algorithm:

Initialize peremeters $\theta_0$ .

Iterate between steps 2 and 3 until convergence:

Expectation (E) step:

$\begin{aligned} Q(\theta, \theta^{(t)}) &= \sum_Z P(Z|X,\theta^{(t)}) \cdot \log p(X,Z|\theta) \\ &= E_{Z \sim P(Z|X,\theta^{(t)})}[\log p(X,Z|\theta)] \end{aligned}$

Maximization (M) step:

Compute parameters maximizing $Q(\theta, \theta^{(t)})$ found on the $E$ step and then update parameters to $\theta^{(t+1)}$ . That is

$\theta^{(t+1)} = \underset{\theta}{\operatorname{argmax}} Q(\theta, \theta^{(t)})$

The derivation is the following:

E step:

$\begin{aligned} Q(\theta, \theta^{(t)}) &= E_{Z \sim P(Z|X,\theta^{(t)})}[\log p(X,Z|\theta)] \\ &= \sum_Z \log p(X,Z|\theta) \cdot P(Z|X,\theta^{(t)})\\ &= \sum_{Z}\left[\log \prod_{i=1}^{N} p\left(x_{i}, z_{i} | \theta\right)\right] \prod_{i=1}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right) \\ &= \sum_{Z}\left[\sum_{i=1}^{N} \log p\left(x_{i}, z_{i} | \theta\right)\right] \prod_{i=1}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right) &&(1)\\ \end{aligned}$

We can expand equation $(1)$ and try to simplify the first term first:

$\begin{aligned} & \quad \sum_{Z} \log p\left(x_{1}, z_{1} | \theta\right) \cdot \prod_{i=1}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right) \\ &= \sum_{z_1}\sum_{z_2}...\sum_{z_N} \log p\left(x_{1}, z_{1} | \theta\right) \cdot \prod_{i=1}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right) \\ &= \sum_{z_1}\sum_{z_2}...\sum_{z_N} \log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right) \cdot \prod_{i=2}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right) && (2)\\ &= \sum_{z_1} \log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right) \sum_{z_2}...\sum_{z_N} \, \prod_{i=2}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right) && (3)\\ &= \sum_{z_1} \log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right) \sum_{z_2}...\sum_{z_N} \, p\left(z_{2} | x_{2}, \theta^{(t)}\right) ... p\left(z_{N} | x_{N}, \theta^{(t)}\right) && (4)\\ &= \sum_{z_1} \log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right) \sum_{z_2} p\left(z_{2} | x_{2}, \theta^{(t)}\right) ...\sum_{z_N} p\left(z_{N} | x_{N}, \theta^{(t)}\right) && (5)\\ &= \sum_{z_1} \log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right) && (6) \\ \end{aligned}$ Z∑logp(x1,z1∣θ)⋅i=1∏Np(zi∣xi,θ(t))=z1∑z2∑...zN∑logp(x1,z1∣θ)⋅i=1∏Np(zi∣xi,θ(t))=z1∑z2∑...zN∑logp(x1,z1∣θ)⋅p(z1∣x1,θ(t))⋅i=2∏Np(zi∣xi,θ(t))=z1∑logp(x1,z1∣θ)⋅p(z1∣x1,θ(t))z2∑...zN∑i=2∏Np(zi∣xi,θ(t))=z1∑logp(x1,z1∣θ)⋅p(z1∣x1,θ(t))z2∑...zN∑p(z2∣x2,θ(t))...p(zN∣xN,θ(t))=z1∑logp(x1,z1∣θ)⋅p(z1∣x1,θ(t))z2∑p(z2∣x2,θ(t))...zN∑p(zN∣xN,θ(t))=z1∑logp(x1,z1∣θ)⋅p(z1∣x1,θ(t))(2)(3)(4)(5)(6)

Remark:

$(2) \to (3)$ : since the term $\log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right)$ is only related to $\sum_{z_1}$ , we can pull it to the front.
$(4) \to (5)$ : we use the same trick as in $(2) \to (3)$ .
$(5) \to (6)$ : Clearly, every summation of products is one except the first summation.

Therefore, the rest of terms can be simplified the same way. So we have

$\begin{aligned} Q(\theta, \theta^{(t)}) &= \sum_{Z}\left[\sum_{i=1}^{N} \log p\left(x_{i}, z_{i} | \theta\right)\right] \prod_{i=1}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right)\\ &= \sum_{z_1} \log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right) + ... + \sum_{z_N} \log p\left(x_{N}, z_{N} | \theta\right) \cdot p\left(z_{N} | x_{N}, \theta^{(t)}\right) \\ &= \sum_{i=1}^N \sum_{z_i} \log p\left(x_{i}, z_{i} | \theta\right) \cdot p\left(z_{i} | x_{i}, \theta^{(t)}\right) \\ &= \sum_{k=1}^K \sum_{i=1}^N \log p\left(x_{i}, z_{k} | \theta\right) \cdot p\left(z_{i} = c_k | x_{i}, \theta^{(t)}\right) \\ &= \sum_{k=1}^K \sum_{i=1}^N [\log p_k + \log \mathcal{N}(x_i | \mu_{k}, \Sigma_{k})] \cdot p\left(z_{i} = c_k | x_{i}, \theta^{(t)}\right) \\ \end{aligned}$ Q(θ,θ(t))=Z∑[i=1∑Nlogp(xi,zi∣θ)]i=1∏Np(zi∣xi,θ(t))=z1∑logp(x1,z1∣θ)⋅p(z1∣x1,θ(t))+...+zN∑logp(xN,zN∣θ)⋅p(zN∣xN,θ(t))=i=1∑Nzi∑logp(xi,zi∣θ)⋅p(zi∣xi,θ(t))=k=1∑Ki=1∑Nlogp(xi,zk∣θ)⋅p(zi=ck∣xi,θ(t))=k=1∑Ki=1∑N[logpk+logN(xi∣μk,Σk)]⋅p(zi=ck∣xi,θ(t))

where

$p\left(z_{i} = c_k | x_{i}, \theta^{(t)}\right) = \frac{p_{k}^{(t)} \mathcal{N}\left(x_{i} | \mu_{z_{i}}^{(t)}, \Sigma_{z_{i}}^{(t)}\right)}{\sum_{k} p_{k}^{(t)} \mathcal{N}\left(x_{i} | \mu_{k}^{(t)}, \Sigma_{k}^{(t)}\right)}$

M step:

We can update $p^{(t+1)}, \mu^{(t+1)}, \Sigma^{(t+1)}$ separately. We first find $p^{(t+1)}$ , where $p^{(t+1)} = (p_1^{(t+1)}, p_2^{(t+1)}, ..., p_k^{(t+1)})$ :

$p^{(t+1)} = \underset{p}{\operatorname{argmax}} \sum_{k=1}^K \sum_{i=1}^N \log p_k \cdot p (z_{i} = c_k | x_{i}, \theta^{(t)})$

with the constraint $ \sum_{k=1}^K p_k = 1$. Clearly, this is a constrained optimization problem. So we are going to solve it by introducing a Lagrange multiplier.

$\begin{aligned} L\left(p, \lambda \right) &=\sum_{k=1}^{K} \sum_{i=1}^{N} \log p_{k} p\left(z_{i}=c_k | x_{i}, \theta^{(t)}\right) + \lambda\left(\sum_{k=1}^{K} p_{k} - 1\right) && (7)\\ \frac{\partial L}{\partial p_{k}} &= \sum_{i=1}^{N} \frac{1}{p_{k}} p\left(z_{i}=k | x_{i}, \theta^{(t)}\right)+\lambda =0 \\ \end{aligned}$

Then multiply on both side by $p_k$ and combine all $p_k$ , we have

$\sum_{i=1}^{N} p\left(z_{i}=c_k | x_{i}, \theta^{(t)}\right) + p_k \cdot \lambda = 0 \\ \sum_{i=1}^{N} \sum_{k=1}^{K} p\left(z_{i}=c_k | x_{i}, \theta^{(t)}\right) + \sum_{k=1}^{K} p_k \cdot \lambda = 0$

Since $\sum_{k=1}^{K} p\left(z_{i}=c_k | x_{i}, \theta^{(t)}\right) = 1$ , and $\sum_{k=1}^{K} p_k = 1$ , we have

$\lambda = -N$

Put $\lambda = -N$ back to (7), we have

$p_k^{(t+1)} = \frac{1}{N} \sum_{i=1}^{N} p\left(z_{i}=c_k | x_{i}, \theta^{(t)}\right)$

Note that solving $\mu^{(t+1)}$ and $\Sigma^{(t+1)}$ is basically the same as solving $p^{(t+1)}$ except that they don’t have any constraint.

K-means

K-Means is one of the most popular “clustering” algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster’s centroid than any other centroid.

The Algorithm:

Initialize $K$ centroids.
Iterate until convergence:

a. Hard assign each data-point to it’s closest centroid.

b. Move each centroid to the center of data-points assigned to it.

Notice that this process is very similar to the way we update parameters of GMM. And we call it an EM-style method to approximate optimal parameters.

The objective function is to minimize

$L = \sum_{i=1}^{N} \sum_{k=1}^K \gamma_{ik} \cdot || x_i - \mu_k||^2_2$

where $\gamma_{ik}$ = 1 if $x_i \in c_k$ , $0$ otherwise. Note that $\gamma_{ik}$ here is a hard label, which can only be $0$ or $1$ . So GMM is a more generalized model than K-means. In K-means, we can think $\gamma_{ik}$ as the latent variable and $\mu$ as the parameter we want to optimize.

Reference:

http://people.csail.mit.edu/dsontag/courses/ml12/slides/lecture21.pdf
https://www.cs.toronto.edu/~jlucas/teaching/csc411/lectures/tut8_handout.pdf
https://space.bilibili.com/97068901
https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
Kevin P. Murphy. Machine Learning: A Probabilistic Perspective.

GMM & K-means 高斯混合模型和K-means聚类详解

往期文章链接目录

文章目录

Gaussian mixture model (GMM)

Interpretation from geometry

Interpretation from mixture model

GMM Derivation

set up

Solve by MLE

Solve by EM Algorithm

K-means

往期文章链接目录

相关推荐