Maximizing likelihood is equivalent to minimizing KL-divergence

This post explains why maximizing likelihood is equivalent to minimizing KL-divergence. This can already be found here and here, but I restate this in my “own” words.

More generally, I encourage you to read Section 3.13 of Deep Learning book for insights on information theory.

Let $\mathbf{x} = (x_1, \ldots x_n)$ a dataset of $n$ elements.

We assume that each $x_i$ has been sampled independently from a random variable $X$ with density $p = p_{\theta_0}$ and corresponding to a true (unknown and fixed) parameter $\theta_0$.

We let $q = p_{\theta}$ the density function corresponding to another parameter $\theta$.

The likelihood of $\mathbf{x}$ given $\theta$ is $L_{\theta}(\mathbf{x}) = \prod_{i = 1}^n p_{\theta}(x_i)$.

The opposite of log-likelihood divided by $n$ is:

and

In the previous equation, $H(p, q)$ stands for the (continuous) cross-entropy between $p$ and $q$. We let $H(p)$ the (continuous) entropy of $p$ and $\text{KL}(p \mid \mid q)$ the Kullback-Leibler divergence between $p$ and $q$.

Since $H(p, q) = H(p) + \text{KL}(p \mid \mid q)$, maximizing likelihood is equivalent to minimizing KL-divergence.

Written on March 11, 2017