Gaussian mixture model (GMM)
A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.

Interpretation from geometry
p(x) is a weighted sum of multiple Gaussian distribution.
p(x)=k=1∑Kαk⋅N(x∣μk,Σk)
Interpretation from mixture model
setup:
-
The total number of Gaussian distribution K.
-
x, a sample (observed variable).
-
z, the distribution of the sample x (a latent variable), where
-
z∈{c1,c2,...,cK}.
-
∑k=1Kp(z=ck)=1. We denote p(z=ck) by pk.
Mixture models are usually generative models, which means new data can be drawn from the distribution of models. Specifically, in the Gaussian Mixture Model (GMM), a new data is generated by first select a class ck based on the probability distribution of all classes c, and then draw a value from the Gaussian distribution of that class. Therefore, we could write p(x) as the following
p(x)=z∑p(x,z)=k=1∑Kp(x,z=ck)=k=1∑Kp(z=ck)⋅p(x∣z=ck)=k=1∑Kpk⋅N(x∣μk,Σk)
We see that two ways of interpretation reach to the same result.
GMM Derivation
set up
-
X: observed data, where X=(x1,x2,...,xN)
-
θ: parameter of the model, where θ={p1,p2,⋯,pK,μ1,μ2,⋯,μK,Σ1,Σ2,⋯,ΣK}
-
p(x)=∑k=1Kpk⋅N(x∣μk,Σk).
-
p(x,z)=p(z)⋅p(x∣z)=pz⋅N(x∣μz,Σz)
-
p(z∣x)=p(x)p(x,z)=∑k=1Kpz⋅N(x∣μz,Σz)pz⋅N(x∣μz,Σz)
Solve by MLE
θ^MLE=θargmaxp(X)=θargmaxlogp(X)=θargmaxi=1∑Nlogp(xi)=θargmaxi=1∑Nlog[i=1∑Kpk⋅N(xi∣μk,Σk)]
I mentioned in the previous posts multiple times that the log of summation is very hard to solve. Therefore, we need somehow use approximation methods to solve for optimal θ. Since Z is a hidden variable, it’s natural to use EM Algorithm.
Solve by EM Algorithm
Check my previous post to see how EM Algorithm is derived, Here is a briefly review of the Algorithm:
- Initialize peremeters θ0.
Iterate between steps 2 and 3 until convergence:
-
Expectation (E) step:
Q(θ,θ(t))=Z∑P(Z∣X,θ(t))⋅logp(X,Z∣θ)=EZ∼P(Z∣X,θ(t))[logp(X,Z∣θ)]
-
Maximization (M) step:
Compute parameters maximizing Q(θ,θ(t)) found on the E step and then update parameters to θ(t+1). That is
θ(t+1)=θargmaxQ(θ,θ(t))
The derivation is the following:
E step:
Q(θ,θ(t))=EZ∼P(Z∣X,θ(t))[logp(X,Z∣θ)]=Z∑logp(X,Z∣θ)⋅P(Z∣X,θ(t))=Z∑[logi=1∏Np(xi,zi∣θ)]i=1∏Np(zi∣xi,θ(t))=Z∑[i=1∑Nlogp(xi,zi∣θ)]i=1∏Np(zi∣xi,θ(t))(1)
We can expand equation (1) and try to simplify the first term first:
Z∑logp(x1,z1∣θ)⋅i=1∏Np(zi∣xi,θ(t))=z1∑z2∑...zN∑logp(x1,z1∣θ)⋅i=1∏Np(zi∣xi,θ(t))=z1∑z2∑...zN∑logp(x1,z1∣θ)⋅p(z1∣x1,θ(t))⋅i=2∏Np(zi∣xi,θ(t))=z1∑logp(x1,z1∣θ)⋅p(z1∣x1,θ(t))z2∑...zN∑i=2∏Np(zi∣xi,θ(t))=z1∑logp(x1,z1∣θ)⋅p(z1∣x1,θ(t))z2∑...zN∑p(z2∣x2,θ(t))...p(zN∣xN,θ(t))=z1∑logp(x1,z1∣θ)⋅p(z1∣x1,θ(t))z2∑p(z2∣x2,θ(t))...zN∑p(zN∣xN,θ(t))=z1∑logp(x1,z1∣θ)⋅p(z1∣x1,θ(t))(2)(3)(4)(5)(6)
Remark:
-
(2)→(3): since the term logp(x1,z1∣θ)⋅p(z1∣x1,θ(t)) is only related to ∑z1, we can pull it to the front.
-
(4)→(5): we use the same trick as in (2)→(3).
-
(5)→(6): Clearly, every summation of products is one except the first summation.
Therefore, the rest of terms can be simplified the same way. So we have
Q(θ,θ(t))=Z∑[i=1∑Nlogp(xi,zi∣θ)]i=1∏Np(zi∣xi,θ(t))=z1∑logp(x1,z1∣θ)⋅p(z1∣x1,θ(t))+...+zN∑logp(xN,zN∣θ)⋅p(zN∣xN,θ(t))=i=1∑Nzi∑logp(xi,zi∣θ)⋅p(zi∣xi,θ(t))=k=1∑Ki=1∑Nlogp(xi,zk∣θ)⋅p(zi=ck∣xi,θ(t))=k=1∑Ki=1∑N[logpk+logN(xi∣μk,Σk)]⋅p(zi=ck∣xi,θ(t))
where
p(zi=ck∣xi,θ(t))=∑kpk(t)N(xi∣μk(t),Σk(t))pk(t)N(xi∣μzi(t),Σzi(t))
M step:
We can update p(t+1),μ(t+1),Σ(t+1) separately. We first find p(t+1), where p(t+1)=(p1(t+1),p2(t+1),...,pk(t+1)):
p(t+1)=pargmaxk=1∑Ki=1∑Nlogpk⋅p(zi=ck∣xi,θ(t))
with the constraint $ \sum_{k=1}^K p_k = 1$. Clearly, this is a constrained optimization problem. So we are going to solve it by introducing a Lagrange multiplier.
L(p,λ)∂pk∂L=k=1∑Ki=1∑Nlogpkp(zi=ck∣xi,θ(t))+λ(k=1∑Kpk−1)=i=1∑Npk1p(zi=k∣xi,θ(t))+λ=0(7)
Then multiply on both side by pk and combine all pk, we have
i=1∑Np(zi=ck∣xi,θ(t))+pk⋅λ=0i=1∑Nk=1∑Kp(zi=ck∣xi,θ(t))+k=1∑Kpk⋅λ=0
Since ∑k=1Kp(zi=ck∣xi,θ(t))=1, and ∑k=1Kpk=1, we have
λ=−N
Put λ=−N back to (7), we have
pk(t+1)=N1i=1∑Np(zi=ck∣xi,θ(t))
Note that solving μ(t+1) and Σ(t+1) is basically the same as solving p(t+1) except that they don’t have any constraint.
K-means

K-Means is one of the most popular “clustering” algorithms. K-means stores k centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster’s centroid than any other centroid.
The Algorithm:
-
Initialize K centroids.
-
Iterate until convergence:
a. Hard assign each data-point to it’s closest centroid.
b. Move each centroid to the center of data-points assigned to it.
Notice that this process is very similar to the way we update parameters of GMM. And we call it an EM-style method to approximate optimal parameters.
The objective function is to minimize
L=i=1∑Nk=1∑Kγik⋅∣∣xi−μk∣∣22
where γik = 1 if xi∈ck, 0 otherwise. Note that γik here is a hard label, which can only be 0 or 1. So GMM is a more generalized model than K-means. In K-means, we can think γik as the latent variable and μ as the parameter we want to optimize.
Reference:
- http://people.csail.mit.edu/dsontag/courses/ml12/slides/lecture21.pdf
- https://www.cs.toronto.edu/~jlucas/teaching/csc411/lectures/tut8_handout.pdf
- https://space.bilibili.com/97068901
- https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
- Kevin P. Murphy. Machine Learning: A Probabilistic Perspective.