GMM & K-means 高斯混合模型和K-means聚类详解

往期文章链接目录

Gaussian mixture model (GMM)

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.

GMM & K-means 高斯混合模型和K-means聚类详解

Interpretation from geometry

p(x)p(x) is a weighted sum of multiple Gaussian distribution.

p(x)=k=1KαkN(xμk,Σk)p(x)=\sum_{k=1}^{K} \alpha_{k} \cdot \mathcal{N}\left(x | \mu_{k}, \Sigma_{k}\right)

Interpretation from mixture model

setup:

  • The total number of Gaussian distribution KK.

  • xx, a sample (observed variable).

  • zz, the distribution of the sample xx (a latent variable), where

    • z{c1,c2,...,cK}z \in \{c_1, c_2, ..., c_K\}.

    • k=1Kp(z=ck)=1\sum_{k=1}^K p(z=c_k)= 1. We denote p(z=ck)p(z=c_k) by pkp_k.

Mixture models are usually generative models, which means new data can be drawn from the distribution of models. Specifically, in the Gaussian Mixture Model (GMM), a new data is generated by first select a class ckc_k based on the probability distribution of all classes cc, and then draw a value from the Gaussian distribution of that class. Therefore, we could write p(x)p(x) as the following

p(x)=zp(x,z)=k=1Kp(x,z=ck)=k=1Kp(z=ck)p(xz=ck)=k=1KpkN(xμk,Σk) \begin{aligned} p(x) &= \sum_z p(x,z) \\ &= \sum_{k=1}^{K} p(x, z=c_k) \\ &= \sum_{k=1}^{K} p(z=c_k) \cdot p(x|z=c_k) \\ &= \sum_{k=1}^{K} p_k \cdot \mathcal{N}(x | \mu_{k}, \Sigma_{k}) \end{aligned}

We see that two ways of interpretation reach to the same result.

GMM Derivation

set up

  • X: observed data, where X=(x1,x2,...,xN)X = (x_1, x_2, ..., x_N)

  • θ\theta: parameter of the model, where θ={p1,p2,,pK,μ1,μ2,,μK,Σ1,Σ2,,ΣK}\theta=\left\{p_{1}, p_{2}, \cdots, p_{K}, \mu_{1}, \mu_{2}, \cdots, \mu_{K}, \Sigma_{1}, \Sigma_{2}, \cdots, \Sigma_{K}\right\}

  • p(x)=k=1KpkN(xμk,Σk)p(x) = \sum_{k=1}^{K} p_k \cdot \mathcal{N}(x | \mu_{k}, \Sigma_{k}).

  • p(x,z)=p(z)p(xz)=pzN(xμz,Σz)p(x,z) = p(z) \cdot p(x|z) = p_z \cdot \mathcal{N}(x | \mu_{z}, \Sigma_{z})

  • p(zx)=p(x,z)p(x)=pzN(xμz,Σz)k=1KpzN(xμz,Σz)p(z|x) = \frac{p(x,z)}{p(x)} = \frac{p_z \cdot \mathcal{N}(x | \mu_{z}, \Sigma_{z})}{\sum_{k=1}^K p_z \cdot \mathcal{N}(x | \mu_{z}, \Sigma_{z})}

Solve by MLE

θ^MLE=argmaxθp(X)=argmaxθlogp(X)=argmaxθi=1Nlogp(xi)=argmaxθi=1Nlog[i=1KpkN(xiμk,Σk)] \begin{aligned} \hat{\theta}_{MLE} &= \underset{\theta}{\operatorname{argmax}} p(X) \\ &=\underset{\theta}{\operatorname{argmax}} \log p(X) \\ &=\underset{\theta}{\operatorname{argmax}} \sum_{i=1}^{N} \log p\left(x_{i}\right) \\ &=\underset{\theta}{\operatorname{argmax}} \sum_{i=1}^{N} \log \, [\sum_{i=1}^{K} p_{k} \cdot \mathcal{N}\left(x_{i} | \mu_{k}, \Sigma_{k}\right)] \end{aligned}

I mentioned in the previous posts multiple times that the log of summation is very hard to solve. Therefore, we need somehow use approximation methods to solve for optimal θ\theta. Since ZZ is a hidden variable, it’s natural to use EM Algorithm.

Solve by EM Algorithm

Check my previous post to see how EM Algorithm is derived, Here is a briefly review of the Algorithm:

  1. Initialize peremeters θ0\theta_0.

Iterate between steps 2 and 3 until convergence:

  1. Expectation (E) step:

Q(θ,θ(t))=ZP(ZX,θ(t))logp(X,Zθ)=EZP(ZX,θ(t))[logp(X,Zθ)] \begin{aligned} Q(\theta, \theta^{(t)}) &= \sum_Z P(Z|X,\theta^{(t)}) \cdot \log p(X,Z|\theta) \\ &= E_{Z \sim P(Z|X,\theta^{(t)})}[\log p(X,Z|\theta)] \end{aligned}

  1. Maximization (M) step:

Compute parameters maximizing Q(θ,θ(t))Q(\theta, \theta^{(t)}) found on the EE step and then update parameters to θ(t+1)\theta^{(t+1)}. That is

θ(t+1)=argmaxθQ(θ,θ(t)) \theta^{(t+1)} = \underset{\theta}{\operatorname{argmax}} Q(\theta, \theta^{(t)})

The derivation is the following:

E step:

Q(θ,θ(t))=EZP(ZX,θ(t))[logp(X,Zθ)]=Zlogp(X,Zθ)P(ZX,θ(t))=Z[logi=1Np(xi,ziθ)]i=1Np(zixi,θ(t))=Z[i=1Nlogp(xi,ziθ)]i=1Np(zixi,θ(t))(1) \begin{aligned} Q(\theta, \theta^{(t)}) &= E_{Z \sim P(Z|X,\theta^{(t)})}[\log p(X,Z|\theta)] \\ &= \sum_Z \log p(X,Z|\theta) \cdot P(Z|X,\theta^{(t)})\\ &= \sum_{Z}\left[\log \prod_{i=1}^{N} p\left(x_{i}, z_{i} | \theta\right)\right] \prod_{i=1}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right) \\ &= \sum_{Z}\left[\sum_{i=1}^{N} \log p\left(x_{i}, z_{i} | \theta\right)\right] \prod_{i=1}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right) &&(1)\\ \end{aligned}

We can expand equation (1)(1) and try to simplify the first term first:

Zlogp(x1,z1θ)i=1Np(zixi,θ(t))=z1z2...zNlogp(x1,z1θ)i=1Np(zixi,θ(t))=z1z2...zNlogp(x1,z1θ)p(z1x1,θ(t))i=2Np(zixi,θ(t))(2)=z1logp(x1,z1θ)p(z1x1,θ(t))z2...zNi=2Np(zixi,θ(t))(3)=z1logp(x1,z1θ)p(z1x1,θ(t))z2...zNp(z2x2,θ(t))...p(zNxN,θ(t))(4)=z1logp(x1,z1θ)p(z1x1,θ(t))z2p(z2x2,θ(t))...zNp(zNxN,θ(t))(5)=z1logp(x1,z1θ)p(z1x1,θ(t))(6) \begin{aligned} & \quad \sum_{Z} \log p\left(x_{1}, z_{1} | \theta\right) \cdot \prod_{i=1}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right) \\ &= \sum_{z_1}\sum_{z_2}...\sum_{z_N} \log p\left(x_{1}, z_{1} | \theta\right) \cdot \prod_{i=1}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right) \\ &= \sum_{z_1}\sum_{z_2}...\sum_{z_N} \log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right) \cdot \prod_{i=2}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right) && (2)\\ &= \sum_{z_1} \log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right) \sum_{z_2}...\sum_{z_N} \, \prod_{i=2}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right) && (3)\\ &= \sum_{z_1} \log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right) \sum_{z_2}...\sum_{z_N} \, p\left(z_{2} | x_{2}, \theta^{(t)}\right) ... p\left(z_{N} | x_{N}, \theta^{(t)}\right) && (4)\\ &= \sum_{z_1} \log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right) \sum_{z_2} p\left(z_{2} | x_{2}, \theta^{(t)}\right) ...\sum_{z_N} p\left(z_{N} | x_{N}, \theta^{(t)}\right) && (5)\\ &= \sum_{z_1} \log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right) && (6) \\ \end{aligned}

Remark:

  • (2)(3)(2) \to (3): since the term logp(x1,z1θ)p(z1x1,θ(t))\log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right) is only related to z1\sum_{z_1}, we can pull it to the front.

  • (4)(5)(4) \to (5): we use the same trick as in (2)(3)(2) \to (3).

  • (5)(6)(5) \to (6): Clearly, every summation of products is one except the first summation.

Therefore, the rest of terms can be simplified the same way. So we have

Q(θ,θ(t))=Z[i=1Nlogp(xi,ziθ)]i=1Np(zixi,θ(t))=z1logp(x1,z1θ)p(z1x1,θ(t))+...+zNlogp(xN,zNθ)p(zNxN,θ(t))=i=1Nzilogp(xi,ziθ)p(zixi,θ(t))=k=1Ki=1Nlogp(xi,zkθ)p(zi=ckxi,θ(t))=k=1Ki=1N[logpk+logN(xiμk,Σk)]p(zi=ckxi,θ(t)) \begin{aligned} Q(\theta, \theta^{(t)}) &= \sum_{Z}\left[\sum_{i=1}^{N} \log p\left(x_{i}, z_{i} | \theta\right)\right] \prod_{i=1}^{N} p\left(z_{i} | x_{i}, \theta^{(t)}\right)\\ &= \sum_{z_1} \log p\left(x_{1}, z_{1} | \theta\right) \cdot p\left(z_{1} | x_{1}, \theta^{(t)}\right) + ... + \sum_{z_N} \log p\left(x_{N}, z_{N} | \theta\right) \cdot p\left(z_{N} | x_{N}, \theta^{(t)}\right) \\ &= \sum_{i=1}^N \sum_{z_i} \log p\left(x_{i}, z_{i} | \theta\right) \cdot p\left(z_{i} | x_{i}, \theta^{(t)}\right) \\ &= \sum_{k=1}^K \sum_{i=1}^N \log p\left(x_{i}, z_{k} | \theta\right) \cdot p\left(z_{i} = c_k | x_{i}, \theta^{(t)}\right) \\ &= \sum_{k=1}^K \sum_{i=1}^N [\log p_k + \log \mathcal{N}(x_i | \mu_{k}, \Sigma_{k})] \cdot p\left(z_{i} = c_k | x_{i}, \theta^{(t)}\right) \\ \end{aligned}

where

p(zi=ckxi,θ(t))=pk(t)N(xiμzi(t),Σzi(t))kpk(t)N(xiμk(t),Σk(t)) p\left(z_{i} = c_k | x_{i}, \theta^{(t)}\right) = \frac{p_{k}^{(t)} \mathcal{N}\left(x_{i} | \mu_{z_{i}}^{(t)}, \Sigma_{z_{i}}^{(t)}\right)}{\sum_{k} p_{k}^{(t)} \mathcal{N}\left(x_{i} | \mu_{k}^{(t)}, \Sigma_{k}^{(t)}\right)}

M step:

We can update p(t+1),μ(t+1),Σ(t+1)p^{(t+1)}, \mu^{(t+1)}, \Sigma^{(t+1)} separately. We first find p(t+1)p^{(t+1)}, where p(t+1)=(p1(t+1),p2(t+1),...,pk(t+1))p^{(t+1)} = (p_1^{(t+1)}, p_2^{(t+1)}, ..., p_k^{(t+1)}):

p(t+1)=argmaxpk=1Ki=1Nlogpkp(zi=ckxi,θ(t)) p^{(t+1)} = \underset{p}{\operatorname{argmax}} \sum_{k=1}^K \sum_{i=1}^N \log p_k \cdot p (z_{i} = c_k | x_{i}, \theta^{(t)})

with the constraint $ \sum_{k=1}^K p_k = 1$. Clearly, this is a constrained optimization problem. So we are going to solve it by introducing a Lagrange multiplier.

L(p,λ)=k=1Ki=1Nlogpkp(zi=ckxi,θ(t))+λ(k=1Kpk1)(7)Lpk=i=1N1pkp(zi=kxi,θ(t))+λ=0 \begin{aligned} L\left(p, \lambda \right) &=\sum_{k=1}^{K} \sum_{i=1}^{N} \log p_{k} p\left(z_{i}=c_k | x_{i}, \theta^{(t)}\right) + \lambda\left(\sum_{k=1}^{K} p_{k} - 1\right) && (7)\\ \frac{\partial L}{\partial p_{k}} &= \sum_{i=1}^{N} \frac{1}{p_{k}} p\left(z_{i}=k | x_{i}, \theta^{(t)}\right)+\lambda =0 \\ \end{aligned}

Then multiply on both side by pkp_k and combine all pkp_k, we have

i=1Np(zi=ckxi,θ(t))+pkλ=0i=1Nk=1Kp(zi=ckxi,θ(t))+k=1Kpkλ=0\sum_{i=1}^{N} p\left(z_{i}=c_k | x_{i}, \theta^{(t)}\right) + p_k \cdot \lambda = 0 \\ \sum_{i=1}^{N} \sum_{k=1}^{K} p\left(z_{i}=c_k | x_{i}, \theta^{(t)}\right) + \sum_{k=1}^{K} p_k \cdot \lambda = 0

Since k=1Kp(zi=ckxi,θ(t))=1\sum_{k=1}^{K} p\left(z_{i}=c_k | x_{i}, \theta^{(t)}\right) = 1, and k=1Kpk=1\sum_{k=1}^{K} p_k = 1, we have

λ=N\lambda = -N

Put λ=N\lambda = -N back to (7), we have

pk(t+1)=1Ni=1Np(zi=ckxi,θ(t)) p_k^{(t+1)} = \frac{1}{N} \sum_{i=1}^{N} p\left(z_{i}=c_k | x_{i}, \theta^{(t)}\right)

Note that solving μ(t+1)\mu^{(t+1)} and Σ(t+1)\Sigma^{(t+1)} is basically the same as solving p(t+1)p^{(t+1)} except that they don’t have any constraint.

K-means

GMM & K-means 高斯混合模型和K-means聚类详解

K-Means is one of the most popular “clustering” algorithms. K-means stores kk centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster’s centroid than any other centroid.

The Algorithm:

  1. Initialize KK centroids.

  2. Iterate until convergence:

    a. Hard assign each data-point to it’s closest centroid.

    b. Move each centroid to the center of data-points assigned to it.

Notice that this process is very similar to the way we update parameters of GMM. And we call it an EM-style method to approximate optimal parameters.

The objective function is to minimize

L=i=1Nk=1Kγikxiμk22 L = \sum_{i=1}^{N} \sum_{k=1}^K \gamma_{ik} \cdot || x_i - \mu_k||^2_2

where γik\gamma_{ik} = 1 if xickx_i \in c_k, 00 otherwise. Note that γik\gamma_{ik} here is a hard label, which can only be 00 or 11. So GMM is a more generalized model than K-means. In K-means, we can think γik\gamma_{ik} as the latent variable and μ\mu as the parameter we want to optimize.


Reference:

  • http://people.csail.mit.edu/dsontag/courses/ml12/slides/lecture21.pdf
  • https://www.cs.toronto.edu/~jlucas/teaching/csc411/lectures/tut8_handout.pdf
  • https://space.bilibili.com/97068901
  • https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
  • Kevin P. Murphy. Machine Learning: A Probabilistic Perspective.

往期文章链接目录