# Maximizing likelihood is equivalent to minimizing KL-divergence

This post explains why maximizing likelihood is equivalent to minimizing KL-divergence. This can already be found here and here, but I restate this in my “own” words.

More generally, I encourage you to read Section 3.13 of Deep Learning book for insights on information theory.

Let $$\mathbf{x} = (x_1, \ldots x_n)$$ a dataset of $$n$$ elements.

We assume that each $$x_i$$ has been sampled independently from a random variable $$X$$ with density $$p = p_{\theta_0}$$ and corresponding to a true (unknown and fixed) parameter $$\theta_0$$.

We let $$q = p_{\theta}$$ the density function corresponding to another parameter $$\theta$$.

The likelihood of $$\mathbf{x}$$ given $$\theta$$ is $$L_{\theta}(\mathbf{x}) = \prod_{i = 1}^n p_{\theta}(x_i)$$.

The opposite of log-likelihood divided by $$n$$ is:

$-\frac{1}{n} \log L_{\theta}(\mathbf{x}) = -\frac{1}{n} \sum_{i=1}^n \log p_{\theta}(x_i) \xrightarrow[n \rightarrow +\infty]{} E_{\theta_0} \left[ - \log p_{\theta}(X) \right],$

and

$E_{\theta_0} \left[ - \log p_{\theta}(X) \right] = - \int \log p_{\theta}(x) dp_{\theta_0}(x) = - \int \log q(x) dp(x) =: H(p, q).$

In the previous equation, $$H(p, q)$$ stands for the (continuous) cross-entropy between $$p$$ and $$q$$. We let $$H(p)$$ the (continuous) entropy of $$p$$ and $$\text{KL}(p \mid \mid q)$$ the Kullback-Leibler divergence between $$p$$ and $$q$$.

Since $$H(p, q) = H(p) + \text{KL}(p \mid \mid q)$$, maximizing likelihood is equivalent to minimizing KL-divergence.

Written on March 11, 2017