Optimizing GMM parameters using EM
A Gaussian Mixture Model (GMM) models data as a finite mixture of Gaussians. It is often used to perform clustering. In this case, the number of Gaussians is the number of clusters and is initially selected. Compared to means, GMM allows clusters with uneven variance and density.
Parameters are usually estimated using an ExpectationMaximization (EM) algorithm, its aim being to iteratively increase likelihood of the dataset. GMM is actually a perfect model to understand how EM is working.
Fig. 1. Clustering a 2D dataset into 3 clusters using GMM. Left: Dataset before clustering. Right: Clustered data with position of fitted mean for each Gaussian.
We begin by describing GMM and list parameters of the model. We continue by applying EM to GMM and derive update formulas. Then, the algorithm to cluster data is given. We finally illustrate the clustering process on a simple example. This post does not detail how EM is working. If you are interested about theoretical considerations, please read this post.
What is GMM?
Let a dataset of .
Let any element of . We assume that has been sampled from a random variable . We assume that follows a probability distribution with a certain density. The density of in any is written as follows:
In addition, we assume that is labeled with a certain , where is a fixed integer. Those labels exist (and are fixed), but we only observe , without explicit knowledge of the corresponding label . The underlying random variable to model the label is noted , and the probability to be labeled is written as follows:
We say that is a latent variable. Using the law of total probability, we can reveal the latent variable (in the formula, ):
GMM assumes that three hypotheses are verified:
 The vector of couples forms an independent vector over ,
 Each record belongs to a cluster with probability (with ),
 Each conditional variable follows a Gaussian distribution with mean and covariance matrix .
We let the density function of a Gaussian with parameters and on . Using hypotheses 2 and 3, the last equation is rewritten as follows (for all ):
Unknown (fixed) parameters of the model are grouped together into:
The chosen strategy to estimate is to maximize the loglikelihood of observed data , as defined by the density of probability to observe given :
Using the three hypotheses of GMM, we obtain:
However, this loglikelihood function is nonconvex (as a function of ) and direct optimization is intractable (see this post for a discussion). We introduce EM to circumvent this problem (other methods could work, see this post for a discussion).
Applying EM to GMM
We assume that some parameters have been selected (for a certain ). We would like to update parameters and find out using EM algorithm.
We define for all :
The aim of EM is to maximize the function in . Please read section “The EM algorithm” of the EM post to understand why we have selected this function. From the last paragraph of the EM post, we also have:
We recall how parameters decompose into 3 terms for GMM:
In the following equalities, we use hypothesis 1 and then hypotheses 2 and 3 of GMM:
We define:
We use explicit formula for the Gaussian distribution (for all )
and obtain:
From this shape, we can separate maximization of each couple (for ) and maximization of the set .
For the mean
From previous expression, we can perform maximization for each fixed . Some terms have no dependence on , so we need to maximize:
We take the gradient with respect to (see formula (86) of the matrix cookbook and this post to remember how gradient is calculated in this case):
We have if and only if: from which we deduce:
Furthermore, Hessian matrix of is given by:
which is negativedefinite.
Conclusion: We select and is maximized in .
For the matrix of variancecovariance
From previous expression, we can also perform maximization of for each fixed .
First, it is easier to differentiate with respect to , so we let and maximize:
Then, we want to differentiate with respect to the matrix of shape . This is quite ambiguous: We can either decide to see the matrix as a vector of length , or to see it as a vector of length (because is symmetric and many coefficients are identical). Choosing one or another way to differentiate will change the formula for , which are both valid and give the same maximized variancecovariance matrix
Let’s do the most simple computations (seeing as a vector of length ).
Using formula (57) of the matrix cookbook and that is symmetric and positivedefinite:
Using formula (70) of the matrix cookbook:
We select and obtain:
We have if and only if:
And so:
This matrix is positivedefinite.
Furthermore, Hessian matrix of is given by:
which is negativedefinite.
Conclusion: We select and is maximized in .
For the probabilities
Probabilities are considered all together since there is a constraint: The sum of over must be . We remove this constraint using: . From previous expression, we only need to maximize:
We let and rewrite:
For all :
And if and only if .
Summing on all , we obtain:
and so:
It follows for all :
Furthermore, Hessian matrix of is a diagonal matrix with diagonal coefficients given by:
which is a negativedefinite matrix.
Conclusion: We select for all and is maximized in .
Algorithm to cluster data
Let a dataset of and an integer.
Step
We define initial parameters. For :
 ,
 the identity matrix of size , and
 some initial positions obtained with means.
Step to
Let the density function of a Gaussian with parameters and on .
Let for all :
Let for all :
We repeat this step until convergence (see this article for theoretical results of convergence). In general there is no problem for convergence to a local maxima, however it is possible to build some pathological cases.
Clustering
Given estimated parameters ,
we compute the density for to be labeled (for all ):
Hard label for is estimated by taking
Illustration of the clustering process
We propose to cluster a twodimensional dataset into 3 clusters using GMM. The dataset is plotted in Fig. 2 (a). We initialize parameters with means (related clustering shown in Fig. 2 (b)). We update parameters (see Fig. 2 (c) for clustering related to ) until convergence (see Fig. 2 (d)).
Fig. 2. Clustering dataset into 3 clusters using GMM. From left to right: (a) Dataset before clustering; (b) Initialization with Kmeans; (c) Step 1; (d) GMM clustering after convergence. On each figure from (b) to (d), one color represents one cluster, and mean position of each cluster is represented with a cross.
We summarize evolution of the parameters along steps. Cluster 1 is the green one on the left, cluster 2 is the orange one on the top, cluster 3 is the purple one on the right.
In this example, mean positions of clusters do not move a lot between means and GMM clustering:
$$t$$  $$~~~~~~~~~~~m_1^{(t)}~~~~~~~~~~~$$  $$~~~~~~~~~~~m_2^{(t)}~~~~~~~~~~~$$  $$~~~~~~~~~~~m_3^{(t)}~~~~~~~~~~~$$ 
$$0$$  $$\begin{bmatrix} 0.95 \\ 2.94 \end{bmatrix}$$  $$\begin{bmatrix} 1.65 \\ 2.93 \end{bmatrix}$$  $$\begin{bmatrix} 2.97 \\ 2.03 \end{bmatrix}$$ 
$$1$$  $$\begin{bmatrix} 0.94 \\ 2.93 \end{bmatrix}$$  $$\begin{bmatrix} 1.66 \\ 2.93 \end{bmatrix}$$  $$\begin{bmatrix} 2.96 \\ 2.03 \end{bmatrix}$$ 
$$\infty$$  $$\begin{bmatrix} 0.88 \\ 2.01 \end{bmatrix}$$  $$\begin{bmatrix} 1.94 \\ 3.00 \end{bmatrix}$$  $$\begin{bmatrix} 2.98 \\ 2.03 \end{bmatrix}$$ 
GMM successfully considers uneven variance in each cluster. For example, variance of the second axis for cluster 1 has increased a lot (from to ), contrary to the second axis of cluster 2 (from to ):
$$t$$  $$\Sigma_1^{(t)}$$  $$\Sigma_2^{(t)}$$  $$\Sigma_3^{(t)}$$ 
$$0$$  $$ \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} $$  $$ \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} $$  $$ \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} $$ 
$$1$$  $$ \begin{bmatrix} 0.91 & 0.14 \\ 0.14 & 4.90 \end{bmatrix} $$  $$ \begin{bmatrix} 1.64 & 0.13 \\ 0.13 & 0.32 \end{bmatrix} $$  $$ \begin{bmatrix} 0.30 & 0.01 \\ 0.01 & 0.18 \end{bmatrix} $$ 
$$\infty$$  $$ \begin{bmatrix} 0.97 & 0.09 \\ 0.09 & 8.28 \end{bmatrix} $$  $$ \begin{bmatrix} 0.93 & 0.02 \\ 0.02 & 0.04 \end{bmatrix} $$  $$ \begin{bmatrix} 0.26 & 0.01 \\ 0.01 & 0.16 \end{bmatrix} $$ 
GMM successfully considers uneven density in each cluster. For example, estimated proportion of elements in cluster 3 has increased from to :
$$t$$  $$~~~~~~~~~~~~\pi_1^{(t)}~~~~~~~~~~~~$$  $$~~~~~~~~~~~~\pi_2^{(t)}~~~~~~~~~~~~$$  $$~~~~~~~~~~~~\pi_3^{(t)}~~~~~~~~~~~~$$ 
$$0$$  $$0.33$$  $$0.33$$  $$0.33$$ 
$$1$$  $$0.15$$  $$0.30$$  $$0.55$$ 
$$\infty$$  $$0.18$$  $$0.27$$  $$0.55$$ 
Likelihood of the dataset has increased from to after convergence. In this case, EM algorithm has reached MLE.
Evolution of likelihood as the number of steps is shown in Fig. 3.
Fig. 3. Evolution of likelihood as the number of steps until convergence using EM algorithm.
Note 1. Dataset has been simulated as mixture of Gaussians, where true means for clusters are , , ; true matrices of variancecovariance are , , ; and true proportions are , and . The likelihood of the set using true parameters is .
Note 2. Since we are using means for initialization, it may be useful to normalize data before using GMM clustering.
References

The Matrix Cookbook. I’ve just discovered it, and it is really useful for reference,

English wikipedia about EM. Wikipedia gives concise formulas,