Activation Functions
Motivation
The nonlinearity in a neural network is the activation function: a fixed elementwise function \(\sigma\) applied to the pre-activation \(z = Wa + b\). Without it, a stack of linear layers collapses into a single linear layer: \(W_2 (W_1 x) = (W_2 W_1) x\). The activation is what makes a deep network express functions a shallow linear model cannot, and what gives it the universal-approximation property (Hornik et al. 1989).
Beyond expressivity, activations have direct consequences for trainability. A function whose derivative is zero on most of its domain (saturating activations like sigmoid) starves backpropagation of gradient signal; one whose derivative is unbounded (no longer used) causes exploding activations. Modern practice is dominated by ReLU and its variants because they are cheap, non-saturating in the positive direction, and easy to train.
Output Activations
The output activation is dictated by the loss and the target distribution, not chosen for representational reasons.
Identity (regression)
For real-valued targets minimizing mean-squared error, the output activation is the identity: \(\hat y = z_L\). This corresponds to a Gaussian likelihood model.
Sigmoid (binary classification)
For a single binary label \(y \in \{0, 1\}\), output a probability via sigmoid: \(\hat p = \sigma_{\text{sigmoid}}(z_L)\). Combined with binary cross-entropy loss, the gradient at the output simplifies to \(\hat p - y\).
Softmax (multiclass classification)
For \(K\) mutually exclusive classes, the softmax maps a vector \(z \in \mathbb{R}^K\) to a categorical distribution:
\[ \sigma_{\text{softmax}}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}. \]
Softmax outputs are non-negative and sum to \(1\). Combined with cross-entropy, the loss reduces to \(-\log \sigma_{\text{softmax}}(z)_y\) for the true class \(y\), and the gradient at the output is \(\sigma_{\text{softmax}}(z) - e_y\) (the predicted distribution minus the one-hot truth) — clean, well-conditioned, and always non-zero unless prediction is exactly right.
Numerical stability. Computing softmax directly from the formula overflows for large \(z_k\). The standard fix subtracts the max: \(\sigma_{\text{softmax}}(z)_k = e^{z_k - z^*} / \sum_j e^{z_j - z^*}\) with \(z^* = \max_j z_j\). Frameworks always use this internally; one should never implement softmax from the raw formula.
How to Pick
Default rules for current practice:
- Hidden layers: ReLU. Use GELU in transformers.
- Output for regression: identity.
- Output for binary classification: sigmoid + binary cross-entropy.
- Output for multiclass classification: softmax + cross-entropy.
Do not mix activations across hidden layers without a reason. Do not use sigmoid or tanh in hidden layers of a deep MLP unless required by the architecture (e.g., LSTM gates).