Rediscover EM algorithm from scratch

Expectation-Maximization algorithm (EM) is an algorithm used to estimate parameters of statistical models comprising latent variables. It is applied in the case where direct calculation of maximum likelihood estimation (MLE) is impractical. EM updates parameters of the model iteratively, in order to increase likelihood of the set. It generally gives good estimation results, although there is no guarantee of convergence to MLE.

Many introductions of EM exist on the web. This one starts from the likelihood computation problem and uses inductive reasoning to bring out EM. No implementation is provided.

We begin by describing a general model involving latent variables. We then explain why direct computation of MLE is intractable. After that, ideas are introduced one by one and lead us naturally towards EM. Those ideas are formalized afterwards, making EM a practical algorithm to estimate parameters.

Intro illustration with decomposition of log-likelihood in EM

Definition of the model

Let x=(xi)i{1,n} the set of observed data points, where each xi stands in Rd.

Let z(true)=(z(true)i)i{1,n} the set of unobserved latent data, where each z(true)i stands in {1,,K} with K a fixed integer.

The couple (x,z(true)) is a realization of a random variable (X,Z).

The distribution of (X,Z) depends on a unknown but fixed parameter θ(true).

We use letter p for densities and P for probabilities. For the sake of conciseness, we discard notation of the random variable: For example, we write P(z) for P(Z=z), and p(x) for p(X=x).

EM algorithm can be always derived, but may be impractical. A good rule of thumb to consider an EM algorithm is when it is “difficult to compute” pθ(x) directly, but “easy to compute” both logpθ(x|z) and Pθ(z) (for all x, z and θ). In the following, we assume that we are in this configuration.

In GMM specifically, 3 hypotheses are set and allow practical use of EM algorithm: The vector of marginal variables (Xi,Zi)i of (X,Z) forms an independent vector over i; Each record belongs to a cluster Zi=k with a fixed probability; And each conditional variable (XiZi=k) follows a Gaussian distribution with fixed parameters. If you are only interested on understanding how to use and implement EM algorithm for GMM clustering, please check out this specific post.

From now, we go back to the general setting.

Direct calculation of MLE is intractable

The common strategy to estimate θ(true) is to maximize the log-likelihood of observed data x, as defined by the density of probability to observe x given θ:

logL(θ;x):=logpθ(x).

Direct maximization of the likelihood over θ is not possible because pθ(x) “difficult to compute”. Using the law of total probability, we can reveal the latent variable by summing over all possible latent data (the sum in the following equation has Kn terms):

logL(θ;x)=log[zpθ(x,z)]=log[zpθ(x|z)Pθ(z)].

We have retrieved Pθ(z), which is “easy to compute”. However, the problem in this decomposition is the presence of the sum of pθ(x|z). This makes the log-likelihood function being non-convex as a function of θ, and direct optimization is intractable (see this post for a discussion and the most simple example of non-convexity).

Step by step from likelihood towards EM

We would like to let the log inside the sum to retrieve terms logpθ(x|z) which are “easy to compute” (this is not possible directly because log is not linear!). See for example: zlogpθ(x,z)=zlogpθ(x|z)+zlogPθ(z). This expression would be easier to maximize over θ (at least for GMM, see explicit computations here).

Basicly, EM will find a mean to include the log inside the sum. To do this, we let z any element of {1,,K}n and write:

pθ(x,z)=Pθ(z|x)pθ(x).

Taking the log, we obtain:

logL(θ;x)=logpθ(x)=logpθ(x,z)logPθ(z|x).

The previous formula is valid for all z. This is a great step, because the term of interest logpθ(x,z) is here. The first idea would be to sum over all z{1,,K}n:

zlogL(θ;x)=zlogpθ(x,z)zlogPθ(z|x).

On the left, there is no dependence in z, so we end with:

logL(θ;x)=1Knzlogpθ(x,z)1KnzlogPθ(z|x).

There is a new problem: The term containing Pθ(z|x) on the right. We don’t have assumption on the conditional variable (Z|X), and Bayes’ formula does not help (using it, the denominator is pθ(x) and we’re stucked).

So how to do then? We come back to this formula, which is valid for all z:

logL(θ;x)=logpθ(x,z)logPθ(z|x).

We’ve seen that summing over all z and dividing by Kn is not suitable to solve our problem (because of the term on the right). There is another perspective: Summing over all z and dividing by Kn corresponds to select each z with the uniform weight 1/Kn. But we could also select any other distribution to weight each element z.

Let (rz) be some distribution over z. Since logL(θ;x) does not depend on z, and rz sums to one over z, we get:

logL(θ;x)=zrzlogpθ(x,z)zrzlogPθ(z|x).

This is interpreted as the expectancy over a variable ˆZ following (rz)z:

logL(θ;x)=E(logpθ(x,ˆZ))E(logPθ(ˆZ|x)).

We’ve said that second term on the right is the problem. We name it H:

H(θ,r):=zrzlogPθ(z|x).

We assume that we have selected some current parameters θ0. We would like to select r such that for all choice of θ, H(θ,r)H(θ0,r). If we can do this, H would not be a problem anymore (try to see why now; Hint: we still cannot compute H, but we would know it cannot decrease; Answer in a subsequent section).

We define two distributions (pz) and (qz):

pz:=Pθ0(zx),
qz:=Pθ(zx).

The inequality H(θ,r)H(θ0,r) can be rewritten as follows:

zrzlogqzzrzlogpz.

We group terms to obtain:

zrz×(logqzpz)0.

The sum on the right is not easy to compute, but log is convex, so we try the Jensen’s inequality. The following inequality in valid for any distribution r:

zrz×(logqzpz)logzrzqzpz.

If we can select r such that logzrzqzpz0, then we will end up with H(θ,r)H(θ0,r).

Now the inequality logzrzqzpz0 can be rewritten as: zrzqzpz1.

A natural choice is to select rz:=pz for all z.

This choice has a nice interpretation in terms of entropy. By rewriting the difference H(θ,r)H(θ0,r) with r=p, we get (where KL stands for the Kullback-Leibler divergence):

H(θ,p)H(θ0,p)=zpz×logqzpz=:KL(p∣∣q)0.

We put together our advances. For all θ,θ0, the log-likelihood of x given parameters θ is expressed as:

logL(θ;x)=zPθ0(zx)logpθ(x,z)zPθ0(zx)logPθ(z|x).

The first term of the sum is named Q:

Q(θ|θ0):=zPθ0(zx)logpθ(x,z).

[Note that this can be seen as an expectancy. Let ˆZθ0 a random variable following distribution Pθ0(.x). Then: Q(θ|θ0)=E(logpθ(x,ˆZθ0))].

The second term is rewritten H:

H(θ|θ0):=zPθ0(zx)logPθ(z|x).

[Note that with previous notations, H(θ,p)=H(θθ0) and H(θ,q)=H(θθ)].

On the whole, for all θ,θ0, the log-likelihood of x given parameters θ is:

logL(θ;x)=Q(θ|θ0)+H(θ|θ0),

and, for all θ,θ0,

H(θ|θ0)H(θ0|θ0).

This induces a method to increase log-likelihood of the dataset, described in the next section.

The EM algorithm

EM algorithm computes parameters θ(0),θ(1),θ(2), such that the corresponding log-likelihoods nondecrease:

logL(θ(0);x)logL(θ(1);x)logL(θ(2);x),

We describe how to update parameters before showing that corresponding log-likelihoods nondecrease.

Updating parameters

We begin with some initial parameters θ(0). This initialization step is important, and different θ(0) can lead to different estimations.

Now assume that parameter θ(t) has been defined (for a certain t0). We explain how to get θ(t+1).

We define for all θ:

Q(θ|θ(t)):=E(logpθ(x,ˆZθ(t)))=zlogpθ(x,z)Pθ(t)(zx),

where ˆZθ(t) is a random variable following distribution Pθ(t)(.x).

We let:

θ(t+1):=argmaxθQ(θ|θ(t)).

The step of defining Q is the expectation step, the step of maximizing Q is the maximization step.

Improvement of the likelihood using EM algorithm

We theoretically ensure that likelihood is nondecreasing when EM algorithm is used. This paragraph formalizes the previous section.

At step t, we have for all θ:

logL(θ;x)=Q(θ|θ(t))+H(θ|θ(t)).

with:

H(θ|θ(t))=zPθ(t)(zx)logPθ(z|x).

We compute:

logL(θ;x)logL(θ(t);x)=Q(θ|θ(t))Q(θ(t)|θ(t))+H(θ|θ(t))H(θ(t)|θ(t)).

Using Jensen’s inequality or Gibbs inequality, we obtain for all θ:

H(θ|θ(t))H(θ(t)|θ(t))

and so:

logL(θ;x)logL(θ(t);x)Q(θ|θ(t))Q(θ(t)|θ(t)).

By choosing θ(t+1)=argmaxθQ(θθ(t)), we have:

Q(θ(t+1)|θ(t))Q(θ(t)|θ(t))

and finally:

logL(θ(t+1);x)logL(θ(t);x)0.

This ends the proof.

Warning: The previous proof ensures that likelihood is not decreasing. It does not say how parameters compare with θ(true) (really unknown) or even θ(MLE) (obtained using maximum likelihood estimate, but intractable). Under mild conditions, EM converges to a local maximum though, see this article for a review.

How to compute and maximize Q in practice?

We make a full decomposition of Q:

Q(θ|θ(t))=zlogpθ(x,z)Pθ(t)(z|x)=zlog[pθ(x|z)pθ(z)]pθ(t)(z,x)pθ(t)(x)=z[logpθ(x|z)+logpθ(z)]pθ(t)(z,x)zpθ(t)(z,x).

It should easier to maximize this function of θ, compared to the initial log-likelihood function.

For explicit calculations for GMM, please follow the next post.

References

Written on January 11, 2017