Loss Functions

Motivation

A neural network defines a parameterized function \(f_\theta : \mathcal{X} \to \mathcal{Y}\) but does not on its own specify what makes a good \(\theta\). The loss function \(L(\hat y, y)\) measures the discrepancy between the prediction \(\hat y = f_\theta(x)\) and the target \(y\), and the training objective averages it over data. Two losses dominate practice: mean-squared error (MSE) for real-valued targets and cross-entropy for categorical targets. Both have an interpretation as the negative log-likelihood of a probabilistic model — MSE corresponds to a Gaussian likelihood, cross-entropy to a categorical (or Bernoulli) likelihood (Goodfellow et al. 2016). Recognizing this connection clarifies what the loss is really optimizing and explains why no other choice is normally needed.

Diagram: three classification losses on the same axes

For binary classification with target class \(y = +1\) and logit \(z\) (so the predicted probability of the correct class is \(\sigma(z)\)):

−3 −2 −1 0 1 2 3 logit z (target class is +1) 0 1 2 3 4 loss hinge cross-entropy MSE on σ(z) Hinge is piecewise-linear; cross-entropy decays exponentially; MSE saturates at 1 for very wrong predictions.

The MSE curve flattens for very confident wrong predictions, which is why MSE on softmax outputs trains poorly compared to cross-entropy.

Mean-Squared Error

For real-valued targets \(y \in \mathbb{R}^d\),

\[ L_{\text{MSE}}(\hat y, y) = \tfrac{1}{2} \|\hat y - y\|_2^2 = \tfrac{1}{2} \sum_{i=1}^d (\hat y_i - y_i)^2. \]

The gradient at the output is \(\hat y - y\) — the residual.

Probabilistic interpretation. Model the conditional distribution as \(y \mid x \sim \mathcal{N}(f_\theta(x), \sigma^2 I)\) for fixed \(\sigma^2\). The negative log-likelihood is

\[ -\log p_\theta(y \mid x) = \tfrac{1}{2 \sigma^2} \|y - f_\theta(x)\|_2^2 + \text{const}, \]

so minimizing MSE is exactly maximum-likelihood under a homoscedastic Gaussian model. The factor of \(\sigma^2\) disappears under any scale choice for the loss; the constant has no effect on \(\theta\).

Robustness. MSE is sensitive to outliers because the gradient grows linearly with the error magnitude. Alternatives — Huber loss, \(L^1\) loss, log-cosh — interpolate between MSE and a more robust criterion at the cost of a different probabilistic interpretation (Laplace, Student-t).

Worked example: one regression residual

Suppose a one-dimensional model predicts \(\hat y = 2.4\) when the target is \(y = 3.0\). The residual is \(\hat y - y = -0.6\), so

\[ L_{\text{MSE}} = \tfrac{1}{2}(2.4 - 3.0)^2 = \tfrac{1}{2}(0.36) = 0.18, \qquad \frac{\partial L}{\partial \hat y} = -0.6. \]

If the prediction moves to \(2.7\), the residual halves to \(-0.3\) and the loss falls by a factor of four to \(0.045\). Squaring the residual is what makes large mistakes dominate the average loss.

Cross-Entropy

For multiclass classification with \(K\) classes and one-hot target \(y\), predict a distribution \(\hat p = \sigma_{\text{softmax}}(z)\) over the classes. The cross-entropy loss is

\[ L_{\text{CE}}(\hat p, y) = -\sum_{k=1}^K y_k \log \hat p_k = -\log \hat p_{y^*}, \]

where \(y^*\) is the true class index. The right-hand form makes clear that only the predicted probability of the true class matters — the loss is the negative log-probability of getting the example right.

For binary classification with \(y \in \{0, 1\}\) and \(\hat p = \sigma_{\text{sigmoid}}(z)\),

\[ L_{\text{BCE}}(\hat p, y) = -y \log \hat p - (1 - y) \log(1 - \hat p). \]

In both cases the gradient at the pre-activation \(z\) simplifies dramatically. For softmax + multiclass cross-entropy,

\[ \frac{\partial L_{\text{CE}}}{\partial z_k} = \hat p_k - y_k. \]

Predicted distribution minus the one-hot target. Same form as MSE’s residual, despite the very different setup. This clean gradient is one reason cross-entropy and softmax are paired by default. (proof)

Probabilistic interpretation. With \(y \mid x \sim \text{Categorical}(\sigma_{\text{softmax}}(f_\theta(x)))\), the negative log-likelihood of a single example is exactly \(L_{\text{CE}}\). So cross-entropy minimization is maximum-likelihood estimation under a categorical model. The same is true for binary cross-entropy under a Bernoulli model.

Worked example: one classification example

For a 3-class example with true class \(2\) and predicted probabilities

\[ \hat p = (0.1, 0.7, 0.2), \qquad y = (0, 1, 0), \]

the cross-entropy is

\[ L_{\text{CE}} = -\sum_{k=1}^3 y_k \log \hat p_k = -\log 0.7 \approx 0.357. \]

If the model assigned only \(0.01\) probability to the true class, the loss would be \(-\log 0.01 \approx 4.605\). Cross-entropy therefore strongly punishes confident wrong predictions instead of merely asking which class has the largest score.

Why Not MSE for Classification

Two reasons.

  1. Wrong likelihood. Treating a one-hot vector as a Gaussian-distributed real vector is a model misspecification; MSE recovers something but it is not maximum likelihood for the actual data type.
  2. Vanishing gradients. With softmax + MSE, \(\partial L / \partial z\) is proportional to the Jacobian of softmax times the residual. The softmax Jacobian saturates near a one-hot output, killing the gradient when the prediction is confidently wrong. With softmax + cross-entropy, the gradient at \(z\) is just \(\hat p - y\), which has magnitude \(\Theta(1)\) when wrong — no saturation regardless of how confident the prediction is.

The second is the practical reason. Empirically, cross-entropy networks train substantially faster than MSE-on-softmax networks, and the gap widens with depth.

Numerical Stability

Computing \(\log \sigma_{\text{softmax}}(z)_k\) directly is unstable: the softmax can produce values close to \(0\) before the log. Frameworks fuse the operation into a single log-softmax:

\[ \log \sigma_{\text{softmax}}(z)_k = z_k - \log \sum_j e^{z_j} = z_k - z^* - \log \sum_j e^{z_j - z^*}, \qquad z^* = \max_j z_j. \]

The combined form cross_entropy_with_logits (or nn.CrossEntropyLoss in PyTorch) takes raw logits \(z\), not probabilities — mixing an explicit softmax with this loss applies the softmax twice. Use logits everywhere except where you actually need a probability for downstream consumption.

Other Losses Worth Knowing

  • Hinge loss \(L = \max(0, 1 - y \hat y)\) for binary \(\pm 1\) targets — the SVM loss. Largely superseded by cross-entropy in deep networks.
  • KL divergence between a target distribution and a predicted distribution — used in VAEs and as a component of label-smoothing cross-entropy. See KL divergence.
  • Contrastive losses (InfoNCE, triplet, NT-Xent) — core to representation learning. The targets are constructed from data structure (positive/negative pairs) rather than provided as labels.
  • Pixel-level losses (perceptual, LPIPS, adversarial) — used in image generation where the pixel-MSE measure is a poor proxy for perceptual quality.

The MSE/cross-entropy default covers the vast majority of supervised tasks. Reach for an alternative loss only when you have a specific reason — robustness, structured outputs, ranking — that the default does not address.

References

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. https://www.deeplearningbook.org/.