Maximizing likelihood is equivalent to minimizing KL-divergence
More generally, I encourage you to read Section 3.13 of Deep Learning book for insights on information theory.
Let a dataset of elements.
We assume that each has been sampled independently from a random variable with density and corresponding to a true (unknown and fixed) parameter .
We let the density function corresponding to another parameter .
The likelihood of given is .
The opposite of log-likelihood divided by is:
In the previous equation, stands for the (continuous) cross-entropy between and . We let the (continuous) entropy of and the Kullback-Leibler divergence between and .
Since , maximizing likelihood is equivalent to minimizing KL-divergence.