# Optimizing GMM parameters using EM

A Gaussian Mixture Model (GMM) models data as a finite mixture of Gaussians. It is often used to perform clustering. In this case, the number of Gaussians $K$ is the number of clusters and is initially selected. Compared to $K$-means, GMM allows clusters with uneven variance and density.

Parameters are usually estimated using an Expectation-Maximization (EM) algorithm, its aim being to iteratively increase likelihood of the dataset. GMM is actually a perfect model to understand how EM is working.

Fig. 1. Clustering a 2D dataset into 3 clusters using GMM. Left: Dataset before clustering. Right: Clustered data with position of fitted mean for each Gaussian.

We begin by describing GMM and list parameters of the model. We continue by applying EM to GMM and derive update formulas. Then, the algorithm to cluster data is given. We finally illustrate the clustering process on a simple example. This post does not detail how EM is working. If you are interested about theoretical considerations, please read this post.

## What is GMM?

Let $\mathbf{x} = (x_i)_{i \in \lbrace 1, \ldots n \rbrace}$ a dataset of $\mathbb{R}^d$.

Let $i$ any element of $\lbrace 1, \ldots n \rbrace$. We assume that $x_i \in \mathbb{R}^d$ has been sampled from a random variable $X_i$. We assume that $X_i$ follows a probability distribution with a certain density. The density of $X_i$ in any $x \in \mathbb{R}^d$ is written as follows:

In addition, we assume that $x_i$ is labeled with a certain $z_i^{(\text{true})} \in \lbrace 1, \ldots, K \rbrace$, where $K$ is a fixed integer. Those labels exist (and are fixed), but we only observe $x_i$, without explicit knowledge of the corresponding label $z_i^{(\text{true})}$. The underlying random variable to model the label is noted $Z_i$, and the probability to be labeled $k \in \lbrace 1, \ldots, K \rbrace$ is written as follows:

We say that $Z_i$ is a latent variable. Using the law of total probability, we can reveal the latent variable (in the formula, $x \in \mathbb{R}^d$):

GMM assumes that three hypotheses are verified:

1. The vector of couples $(X_i, Z_i)_i$ forms an independent vector over $i$,
2. Each record belongs to a cluster $Z_i = k$ with probability $\pi_k$ (with $\pi_k > 0$),
3. Each conditional variable $(X_i \mid Z_i = k)$ follows a Gaussian distribution with mean $m_k$ and covariance matrix $\Sigma_k$.

We let $f_{(m, \Sigma)}$ the density function of a Gaussian with parameters $m$ and $\Sigma$ on $\mathbb{R}^d$. Using hypotheses 2 and 3, the last equation is rewritten as follows (for all $i, x$):

Unknown (fixed) parameters of the model are grouped together into:

The chosen strategy to estimate $\theta^{(\text{true})}$ is to maximize the log-likelihood of observed data $\mathbf{x} := (x_1, \ldots, x_n)$, as defined by the density of probability to observe $\mathbf{x}$ given $\theta$:

Using the three hypotheses of GMM, we obtain:

However, this log-likelihood function is non-convex (as a function of $\theta$) and direct optimization is intractable (see this post for a discussion). We introduce EM to circumvent this problem (other methods could work, see this post for a discussion).

## Applying EM to GMM

We assume that some parameters $\theta^{(t)}$ have been selected (for a certain $t \geq 0$). We would like to update parameters and find out $\theta^{(t+1)}$ using EM algorithm.

We define for all $\theta$:

The aim of EM is to maximize the function $Q$ in $\theta$. Please read section “The EM algorithm” of the EM post to understand why we have selected this function. From the last paragraph of the EM post, we also have:

We recall how parameters decompose into 3 terms for GMM:

In the following equalities, we use hypothesis 1 and then hypotheses 2 and 3 of GMM:

We define:

We use explicit formula for the Gaussian distribution (for all $x \in \mathbb{R}^{d}$)

and obtain:

From this shape, we can separate maximization of each couple $(m_k, \Sigma_k)$ (for $k \in \lbrace 1, \ldots, K \rbrace$) and maximization of the set $(\pi_k)_k$.

### For the mean $m_k$

From previous expression, we can perform maximization for each fixed $k$. Some terms have no dependence on $m_k \in \mathbb{R}^{d}$, so we need to maximize:

We take the gradient with respect to $m_k$ (see formula (86) of the matrix cookbook and this post to remember how gradient is calculated in this case):

We have $\nabla_{m_k} A(m_k) = 0$ if and only if: $\sum_{i = 1}^{N} (x_i - m_k) T_{k, i}^{(t)} = 0$ from which we deduce:

Furthermore, Hessian matrix of $A(m_k)$ is given by:

which is negative-definite.

Conclusion: We select $m_k^{(t+1)} := \frac{\sum_{i = 1}^{N} x_i T_{k, i}^{(t)}}{\sum_{i = 1}^{N} m_k T_{k, i}^{(t)}}$ and $A(.)$ is maximized in $m_k^{(t+1)}$.

### For the matrix of variance-covariance $\Sigma_k$

From previous expression, we can also perform maximization of $\Sigma_k$ for each fixed $k$.

First, it is easier to differentiate with respect to $\Sigma_k^{-1}$, so we let $\Lambda_k := \Sigma_k^{-1}$ and maximize:

Then, we want to differentiate with respect to the matrix $\Lambda_k$ of shape $d \times d$. This is quite ambiguous: We can either decide to see the matrix as a vector of length $d^2$, or to see it as a vector of length $d(d+1)/2$ (because $\Lambda_k$ is symmetric and many coefficients are identical). Choosing one or another way to differentiate will change the formula for $\nabla_{\Lambda_k} B(\Lambda_k)$, which are both valid and give the same maximized variance-covariance matrix $\Lambda_k^{(t+1)}.$

Let’s do the most simple computations (seeing $\Lambda_k$ as a vector of length $d^2$).

Using formula (57) of the matrix cookbook and that $\Lambda_k$ is symmetric and positive-definite:

Using formula (70) of the matrix cookbook:

We select $z:= x_i - m_k^{(t+1)}$ and obtain:

We have $\nabla_{\Sigma_k} B(\Sigma_k) = 0$ if and only if:

And so:

This matrix is positive-definite.

Furthermore, Hessian matrix of $B(\Lambda_k)$ is given by:

which is negative-definite.

Conclusion: We select $\Sigma_k^{(t+1)} := \frac{\sum_{i = 1}^{N} (x_i - m_k^{(t+1)})(x_i - m_k^{(t+1)})^{T} T_{k, i}^{(t)}}{ \sum_{i = 1}^{N} T_{k, i}^{(t)}}.$ and $B(.)$ is maximized in $\Sigma_k^{(t+1)}$.

### For the probabilities $(\pi_k)_k$

Probabilities $(\pi_k)_k$ are considered all together since there is a constraint: The sum of $\pi_k$ over $k$ must be $1$. We remove this constraint using: $\pi_K = 1 - (\pi_1 + \ldots + \pi_{K-1})$. From previous expression, we only need to maximize:

We let $S_k := \sum_{i = 1}^{N} T_{k, i}^{(t)}$ and rewrite:

For all $k \in \lbrace 1, \ldots, K-1 \rbrace$:

And $\nabla_{\pi_k} C((\pi_k)_{k \in \lbrace 1, \ldots, K-1 \rbrace}) = 0$ if and only if $\pi_k = \frac{S_k}{S_K} \pi_K$.

Summing on all $k \in \lbrace 1, \ldots, K-1 \rbrace$, we obtain:

$1 - \pi_K = \frac{\sum_{k=1}^{K-1} S_k}{S_K} \pi_K$ and so: $\pi_K = \frac{S_K}{\sum_{k=1}^{K} S_k}.$

It follows for all $k \in \lbrace 1, \ldots, K \rbrace$:

Furthermore, Hessian matrix of $C((\pi_k)_{k \in \lbrace 1, \ldots, K-1 \rbrace})$ is a diagonal matrix with diagonal coefficients given by:

which is a negative-definite matrix.

Conclusion: We select $\pi_k^{(t+1)} := \frac{\sum_{i = 1}^{N} T_{k, i}^{(t)}}{\sum_{k'=1}^{K} \sum_{i = 1}^{N} T_{k', i}^{(t)}}$ for all $k$ and $C(.)$ is maximized in $(\pi_k^{(t+1)})_{k}$.

## Algorithm to cluster data

Let $\mathbf{x} = (x_i)_{i \in \lbrace 1, \ldots n \rbrace}$ a dataset of $\mathbb{R}^d$ and $K$ an integer.

### Step $0$

We define initial parameters. For $k \in \lbrace 1, \ldots, K \rbrace$:

• $\pi_k^{(0)} = 1 / K$,
• $\Sigma_k^{(0)}$ the identity matrix of size $K \times K$, and
• $(m_k^{(0)})_{k \in \lbrace 1, \ldots, K \rbrace}$ some initial positions obtained with $K$-means.

### Step $t$ to $t+1$

Let $f_{(m, \Sigma)}$ the density function of a Gaussian with parameters $m$ and $\Sigma$ on $\mathbb{R}^d$.

Let for all $k, i$:

Let for all $k$:

We repeat this step until convergence (see this article for theoretical results of convergence). In general there is no problem for convergence to a local maxima, however it is possible to build some pathological cases.

### Clustering

Given estimated parameters $\theta^{(\infty)} = \left( m_k^{(\infty)}, \Sigma_k^{(\infty)}, \pi_k^{(\infty)} \right)_k$,

we compute the density for $x_i$ to be labeled $k$ (for all $x_i, k$):

Hard label for $x_i$ is estimated by taking $\text{argmax}_k p_{\theta^{(\infty)}}(X_i = x_i, Z_i = k).$

## Illustration of the clustering process

We propose to cluster a two-dimensional dataset into 3 clusters using GMM. The dataset is plotted in Fig. 2 (a). We initialize parameters $\theta^{(0)}$ with $K$-means (related clustering shown in Fig. 2 (b)). We update parameters (see Fig. 2 (c) for clustering related to $\theta^{(1)}$) until convergence $\theta^{(\infty)}$ (see Fig. 2 (d)).

Fig. 2. Clustering dataset into 3 clusters using GMM. From left to right: (a) Dataset before clustering; (b) Initialization with K-means; (c) Step 1; (d) GMM clustering after convergence. On each figure from (b) to (d), one color represents one cluster, and mean position of each cluster is represented with a cross.

We summarize evolution of the parameters along steps. Cluster 1 is the green one on the left, cluster 2 is the orange one on the top, cluster 3 is the purple one on the right.

In this example, mean positions of clusters do not move a lot between $K$-means and GMM clustering:

 $$t$$ $$~~~~~~~~~~~m_1^{(t)}~~~~~~~~~~~$$ $$~~~~~~~~~~~m_2^{(t)}~~~~~~~~~~~$$ $$~~~~~~~~~~~m_3^{(t)}~~~~~~~~~~~$$ $$0$$ $$\begin{bmatrix} -0.95 \\ -2.94 \end{bmatrix}$$ $$\begin{bmatrix} 1.65 \\ 2.93 \end{bmatrix}$$ $$\begin{bmatrix} 2.97 \\ -2.03 \end{bmatrix}$$ $$1$$ $$\begin{bmatrix} -0.94 \\ -2.93 \end{bmatrix}$$ $$\begin{bmatrix} 1.66 \\ 2.93 \end{bmatrix}$$ $$\begin{bmatrix} 2.96 \\ -2.03 \end{bmatrix}$$ $$\infty$$ $$\begin{bmatrix} -0.88 \\ -2.01 \end{bmatrix}$$ $$\begin{bmatrix} 1.94 \\ 3.00 \end{bmatrix}$$ $$\begin{bmatrix} 2.98 \\ -2.03 \end{bmatrix}$$

GMM successfully considers uneven variance in each cluster. For example, variance of the second axis for cluster 1 has increased a lot (from $1$ to $8.28$), contrary to the second axis of cluster 2 (from $1$ to $0.04$):

 $$t$$ $$\Sigma_1^{(t)}$$ $$\Sigma_2^{(t)}$$ $$\Sigma_3^{(t)}$$ $$0$$ $$\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$$ $$\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$$ $$\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$$ $$1$$ $$\begin{bmatrix} 0.91 & -0.14 \\ -0.14 & 4.90 \end{bmatrix}$$ $$\begin{bmatrix} 1.64 & 0.13 \\ 0.13 & 0.32 \end{bmatrix}$$ $$\begin{bmatrix} 0.30 & -0.01 \\ -0.01 & 0.18 \end{bmatrix}$$ $$\infty$$ $$\begin{bmatrix} 0.97 & -0.09 \\ -0.09 & 8.28 \end{bmatrix}$$ $$\begin{bmatrix} 0.93 & -0.02 \\ -0.02 & 0.04 \end{bmatrix}$$ $$\begin{bmatrix} 0.26 & -0.01 \\ -0.01 & 0.16 \end{bmatrix}$$

GMM successfully considers uneven density in each cluster. For example, estimated proportion of elements in cluster 3 has increased from $1/K$ to $0.55$:

 $$t$$ $$~~~~~~~~~~~~\pi_1^{(t)}~~~~~~~~~~~~$$ $$~~~~~~~~~~~~\pi_2^{(t)}~~~~~~~~~~~~$$ $$~~~~~~~~~~~~\pi_3^{(t)}~~~~~~~~~~~~$$ $$0$$ $$0.33$$ $$0.33$$ $$0.33$$ $$1$$ $$0.15$$ $$0.30$$ $$0.55$$ $$\infty$$ $$0.18$$ $$0.27$$ $$0.55$$

Likelihood of the dataset has increased from $-4144.924$ to $-2966.941$ after convergence. In this case, EM algorithm has reached MLE.

Evolution of likelihood as the number of steps is shown in Fig. 3.

Fig. 3. Evolution of likelihood as the number of steps until convergence using EM algorithm.

Note 1. Dataset has been simulated as mixture of Gaussians, where true means for clusters are $\begin{bmatrix} -1 \\ -2 \end{bmatrix}$, $\begin{bmatrix} 2 \\ 3 \end{bmatrix}$, $\begin{bmatrix} 3 \\ -2 \end{bmatrix}$; true matrices of variance-covariance are $% $, $% $, $% $; and true proportions are $0.18$, $0.27$ and $0.55$. The likelihood of the set using true parameters is $-2979.822$.

Note 2. Since we are using $K$-means for initialization, it may be useful to normalize data before using GMM clustering.

## References

Written on February 11, 2017