Rediscover EM algorithm from scratch

Expectation-Maximization algorithm (EM) is an algorithm used to estimate parameters of statistical models comprising latent variables. It is applied in the case where direct calculation of maximum likelihood estimation (MLE) is impractical. EM updates parameters of the model iteratively, in order to increase likelihood of the set. It generally gives good estimation results, although there is no guarantee of convergence to MLE.

Many introductions of EM exist on the web. This one starts from the likelihood computation problem and uses inductive reasoning to bring out EM. No implementation is provided.

We begin by describing a general model involving latent variables. We then explain why direct computation of MLE is intractable. After that, ideas are introduced one by one and lead us naturally towards EM. Those ideas are formalized afterwards, making EM a practical algorithm to estimate parameters.

Intro illustration with decomposition of log-likelihood in EM

Definition of the model

Let the set of observed data points, where each stands in .

Let the set of unobserved latent data, where each stands in with a fixed integer.

The couple is a realization of a random variable .

The distribution of depends on a unknown but fixed parameter .

We use letter for densities and for probabilities. For the sake of conciseness, we discard notation of the random variable: For example, we write for , and for .

EM algorithm can be always derived, but may be impractical. A good rule of thumb to consider an EM algorithm is when it is “difficult to compute” directly, but “easy to compute” both and (for all , and ). In the following, we assume that we are in this configuration.

In GMM specifically, 3 hypotheses are set and allow practical use of EM algorithm: The vector of marginal variables of forms an independent vector over ; Each record belongs to a cluster with a fixed probability; And each conditional variable follows a Gaussian distribution with fixed parameters. If you are only interested on understanding how to use and implement EM algorithm for GMM clustering, please check out this specific post.

From now, we go back to the general setting.

Direct calculation of MLE is intractable

The common strategy to estimate is to maximize the log-likelihood of observed data , as defined by the density of probability to observe given :

Direct maximization of the likelihood over is not possible because “difficult to compute”. Using the law of total probability, we can reveal the latent variable by summing over all possible latent data (the sum in the following equation has terms):

We have retrieved , which is “easy to compute”. However, the problem in this decomposition is the presence of the sum of . This makes the log-likelihood function being non-convex as a function of , and direct optimization is intractable (see this post for a discussion and the most simple example of non-convexity).

Step by step from likelihood towards EM

We would like to let the inside the sum to retrieve terms which are “easy to compute” (this is not possible directly because is not linear!). See for example: This expression would be easier to maximize over (at least for GMM, see explicit computations here).

Basicly, EM will find a mean to include the inside the sum. To do this, we let any element of and write:

Taking the , we obtain:

The previous formula is valid for all . This is a great step, because the term of interest is here. The first idea would be to sum over all :

On the left, there is no dependence in , so we end with:

There is a new problem: The term containing on the right. We don’t have assumption on the conditional variable , and Bayes’ formula does not help (using it, the denominator is and we’re stucked).

So how to do then? We come back to this formula, which is valid for all :

We’ve seen that summing over all and dividing by is not suitable to solve our problem (because of the term on the right). There is another perspective: Summing over all and dividing by corresponds to select each with the uniform weight . But we could also select any other distribution to weight each element .

Let be some distribution over . Since does not depend on , and sums to one over , we get:

This is interpreted as the expectancy over a variable following :

We’ve said that second term on the right is the problem. We name it :

We assume that we have selected some current parameters . We would like to select such that for all choice of , If we can do this, would not be a problem anymore (try to see why now; Hint: we still cannot compute , but we would know it cannot decrease; Answer in a subsequent section).

We define two distributions and :

The inequality can be rewritten as follows:

We group terms to obtain:

The sum on the right is not easy to compute, but is convex, so we try the Jensen’s inequality. The following inequality in valid for any distribution :

If we can select such that , then we will end up with

Now the inequality can be rewritten as:

A natural choice is to select for all .

This choice has a nice interpretation in terms of entropy. By rewriting the difference with , we get (where stands for the Kullback-Leibler divergence):

We put together our advances. For all , the log-likelihood of given parameters is expressed as:

The first term of the sum is named :

[Note that this can be seen as an expectancy. Let a random variable following distribution . Then: ].

The second term is rewritten :

[Note that with previous notations, and ].

On the whole, for all , the log-likelihood of given parameters is:

and, for all ,

This induces a method to increase log-likelihood of the dataset, described in the next section.

The EM algorithm

EM algorithm computes parameters such that the corresponding log-likelihoods nondecrease:

We describe how to update parameters before showing that corresponding log-likelihoods nondecrease.

Updating parameters

We begin with some initial parameters . This initialization step is important, and different can lead to different estimations.

Now assume that parameter has been defined (for a certain ). We explain how to get .

We define for all :

where is a random variable following distribution

We let:

The step of defining is the expectation step, the step of maximizing is the maximization step.

Improvement of the likelihood using EM algorithm

We theoretically ensure that likelihood is nondecreasing when EM algorithm is used. This paragraph formalizes the previous section.

At step , we have for all :

with:

We compute:

Using Jensen’s inequality or Gibbs inequality, we obtain for all :

and so:

By choosing , we have:

and finally:

This ends the proof.

Warning: The previous proof ensures that likelihood is not decreasing. It does not say how parameters compare with (really unknown) or even (obtained using maximum likelihood estimate, but intractable). Under mild conditions, EM converges to a local maximum though, see this article for a review.

How to compute and maximize in practice?

We make a full decomposition of :

It should easier to maximize this function of , compared to the initial log-likelihood function.

For explicit calculations for GMM, please follow the next post.

References

Written on January 11, 2017