Maximizing likelihood is equivalent to minimizing KL-divergence

This post explains why maximizing likelihood is equivalent to minimizing KL-divergence. This can already be found here and here, but I restate this in my “own” words.

More generally, I encourage you to read Section 3.13 of Deep Learning book for insights on information theory.

Let a dataset of elements.

We assume that each has been sampled independently from a random variable with density and corresponding to a true (unknown and fixed) parameter .

We let the density function corresponding to another parameter .

The likelihood of given is .

The opposite of log-likelihood divided by is:

and

In the previous equation, stands for the (continuous) cross-entropy between and . We let the (continuous) entropy of and the Kullback-Leibler divergence between and .

Since , maximizing likelihood is equivalent to minimizing KL-divergence.

Written on March 11, 2017