Cross-Entropy Minimization Equals MLE for Categorical Outputs
Claim
Let a classifier predict a categorical distribution \(\hat p_\theta(\cdot \mid x) = \sigma_{\text{softmax}}(f_\theta(x))\) over \(K\) classes, given training data \(\{(x_i, y_i)\}_{i=1}^N\) with one-hot targets. The cross-entropy objective (Cover and Thomas 2005; Goodfellow et al. 2016)
\[ J_{\text{CE}}(\theta) = \frac{1}{N} \sum_{i=1}^N -\log \hat p_\theta(y_i \mid x_i) \]
is identical to the (negative, normalized) log-likelihood under the model \(y \mid x \sim \text{Categorical}(\hat p_\theta(\cdot \mid x))\). Minimizing \(J_{\text{CE}}\) is therefore exactly maximum-likelihood estimation.
The same holds for binary classification with sigmoid + binary cross-entropy under a Bernoulli likelihood model.
Proof
The Categorical likelihood of a single example is
\[ p_\theta(y_i \mid x_i) = \prod_{k=1}^K \hat p_\theta(k \mid x_i)^{y_{i,k}}, \]
where \(y_{i,k} \in \{0,1\}\) is the one-hot indicator. Taking the negative log,
\[ -\log p_\theta(y_i \mid x_i) = -\sum_{k=1}^K y_{i,k} \log \hat p_\theta(k \mid x_i) = -\log \hat p_\theta(y_i \mid x_i), \]
since exactly one \(y_{i,k}\) is \(1\). The full negative log-likelihood is
\[ -\log \prod_{i=1}^N p_\theta(y_i \mid x_i) = \sum_{i=1}^N -\log \hat p_\theta(y_i \mid x_i). \]
Dividing by \(N\) gives \(J_{\text{CE}}(\theta)\). So \(J_{\text{CE}}\) is exactly \(-\frac{1}{N}\) times the log-likelihood of the data under the model. Maximizing the log-likelihood and minimizing \(J_{\text{CE}}\) are the same problem. \(\square\)
Equivalent Formulation in Terms of KL Divergence
Let \(\hat q(x) = \frac{1}{N} \sum_i \delta_{x_i}(x)\) be the empirical distribution over inputs, and let \(\hat p_{\text{data}}(y \mid x)\) be the empirical conditional (a one-hot for each \(x_i\)). Then
\[ J_{\text{CE}}(\theta) = \mathbb{E}_{x \sim \hat q}\!\left[\mathbb{E}_{y \sim \hat p_{\text{data}}(\cdot \mid x)}[-\log \hat p_\theta(y \mid x)]\right] = \mathbb{E}_{x \sim \hat q}\!\left[H(\hat p_{\text{data}}(\cdot \mid x), \hat p_\theta(\cdot \mid x))\right], \]
where \(H(q, p) = -\sum_y q(y) \log p(y)\) is the cross-entropy. Using \(H(q, p) = H(q) + \mathrm{KL}(q \,\|\, p)\),
\[ J_{\text{CE}}(\theta) = \mathbb{E}_{x \sim \hat q}\!\left[H(\hat p_{\text{data}}(\cdot \mid x))\right] + \mathbb{E}_{x \sim \hat q}\!\left[\mathrm{KL}(\hat p_{\text{data}}(\cdot \mid x) \,\|\, \hat p_\theta(\cdot \mid x))\right]. \]
The first term does not depend on \(\theta\). So minimizing \(J_{\text{CE}}\) in \(\theta\) is also equivalent to minimizing the average KL divergence from the empirical conditional to the model conditional. (KL divergence)
Three names for the same procedure: - maximum-likelihood estimation under a categorical model, - cross-entropy minimization, - KL divergence minimization from the empirical to the model distribution.
The Gradient at the Logits
Write the pre-softmax logits as \(z = f_\theta(x)\) with \(\hat p_k = \sigma_{\text{softmax}}(z)_k = e^{z_k} / \sum_j e^{z_j}\) and one-hot target \(y\). The single-example loss is \(L = -\log \hat p_{y^*}\) where \(y^*\) is the true class. Using \(\partial \hat p_k / \partial z_j = \hat p_k (\delta_{jk} - \hat p_j)\),
\[ \frac{\partial L}{\partial z_j} = -\frac{1}{\hat p_{y^*}} \cdot \frac{\partial \hat p_{y^*}}{\partial z_j} = -\frac{\hat p_{y^*} (\delta_{j, y^*} - \hat p_j)}{\hat p_{y^*}} = \hat p_j - \delta_{j, y^*} = \hat p_j - y_j. \]
Predicted distribution minus the one-hot target. The gradient has magnitude \(\Theta(1)\) when the prediction is wrong, regardless of how confidently — this is why softmax + cross-entropy trains well and softmax + MSE does not.
Binary Case
Sigmoid + binary cross-entropy is the same equivalence under a Bernoulli model. With \(\hat p = \sigma_{\text{sigmoid}}(z)\) and \(y \in \{0, 1\}\),
\[ -\log p_\theta(y \mid x) = -y \log \hat p - (1 - y) \log(1 - \hat p) = L_{\text{BCE}}, \]
and \(\partial L_{\text{BCE}} / \partial z = \hat p - y\). Same clean residual gradient.