KL Divergence vs. Cross-Entropy in Practice#
Conceptual Overview#
In probabilistic modeling and machine learning, the discrepancy between a true data distribution \(P\) and a model distribution \(Q\) is commonly formalized using Kullback--Leibler (KL) divergence.
However, in practical training pipelines, cross-entropy is almost always used instead.
This section explains the mathematical relationship between these two quantities and clarifies why cross-entropy is preferred in practice.
KL Divergence#
The KL divergence from \(P\) to \(Q\) is defined as:
\(\mathrm{KL}(P \| Q) = \sum_x p(x)\,\log\frac{p(x)}{q(x)}\)
Key properties of KL divergence include:
\(\mathrm{KL}(P \| Q) \ge 0\)
\(\mathrm{KL}(P \| Q) = 0\) if and only if \(P = Q\)
It is not symmetric
Pointwise positive and negative deviations do not cancel out
Intuitively, KL divergence measures the expected extra information required when data generated from \(P\) is encoded using a model \(Q\).
Cross-Entropy#
The cross-entropy between distributions \(P\) and \(Q\) is defined as:
\(H(P, Q) = -\sum_x p(x)\,\log q(x)\)
Cross-entropy quantifies how well the model distribution \(Q\) assigns probability mass to outcomes that occur under the true distribution \(P\).
Fundamental Relationship#
KL divergence and cross-entropy are related through the identity:
\(\mathrm{KL}(P \| Q) = H(P, Q) - H(P)\)
where the entropy of the true distribution is: \(H(P) = -\sum_x p(x)\,\log p(x)\)
If \(P\) is fixed, then \(H(P)\) is constant with respect to \(Q\). Therefore, \(\arg\min_Q \mathrm{KL}(P \| Q) \;\equiv\; \arg\min_Q H(P, Q)\)
Thus, minimizing KL divergence is equivalent to minimizing cross-entropy.
Why Cross-Entropy Is Used in Practice#
The entropy \(H(P)\) is unknown. In real-world problems, the true distribution \(P\) is not accessible; only samples from it are available. As a result, \(H(P)\) cannot be computed.
Cross-entropy is directly estimable from data. Given samples \(\{x_i\}_{i=1}^N \sim P\), cross-entropy can be approximated as:
\(H(P, Q) \approx -\frac{1}{N}\sum_{i=1}^N \log q(x_i)\)
which is exactly the negative log-likelihood.
Efficient and stable optimization. Modern deep learning frameworks implement numerically stable versions of cross-entropy, leading to well-behaved gradients and reliable convergence.
Clear statistical interpretation. Minimizing cross-entropy is equivalent to maximum likelihood estimation, providing a strong statistical foundation.
Special Case: One-Hot Labels#
In classification tasks with one-hot encoded true labels, the entropy of the true distribution is: \(H(P) = 0\)
Consequently, \(\mathrm{KL}(P \| Q) = H(P, Q)\)
In this common setting, KL divergence and cross-entropy are numerically identical, although cross-entropy remains the preferred formulation.
Summary#
KL divergence is the fundamental information-theoretic measure of discrepancy between probability distributions.
Cross-entropy differs from KL divergence only by the constant \(H(P)\).
Since this constant is unknown and irrelevant for optimization, it is omitted in practice.
As a result, minimizing cross-entropy is equivalent to minimizing KL divergence, while remaining simpler, computable, and statistically interpretable.