Activation Functions

Motivation

The nonlinearity in a neural network is the activation function: a fixed elementwise function \(\sigma\) applied to the pre-activation \(z = Wa + b\). Without it, a stack of linear layers collapses into a single linear layer: \(W_2 (W_1 x) = (W_2 W_1) x\). The activation is what makes a deep network express functions a shallow linear model cannot, and what gives it the universal-approximation property (Hornik et al. 1989).

Beyond expressivity, activations have direct consequences for trainability. A function whose derivative is zero on most of its domain (saturating activations like sigmoid) starves backpropagation of gradient signal; one whose derivative is unbounded (no longer used) causes exploding activations. Modern practice is dominated by ReLU and its variants because they are cheap, non-saturating in the positive direction, and easy to train.

Hidden-Layer Activations

Diagram: four hidden-layer activations on the same axes

ReLU

\[ \sigma(z) = \max(0, z), \qquad \sigma'(z) = \mathbb{1}[z > 0]. \]

Linear for positive inputs, zero for negative. The gradient is exactly \(1\) in the active region — no saturation, no shrinking gradients with depth — and exactly \(0\) in the inactive region. Training is fast and stable.

Dead-ReLU problem. A unit whose pre-activation is negative for every input in the dataset has zero gradient and never updates. This can happen if a large negative bias gets stuck or the learning rate is too high. The fraction of dead units is a useful diagnostic; a few percent is normal, \(>50\%\) is a red flag. Mitigations include lower learning rates, careful initialization, and the variants below.

Leaky ReLU and PReLU

\[ \sigma(z) = \max(\alpha z, z), \qquad \alpha \in (0, 1). \]

Leaky ReLU fixes \(\alpha\) (typically \(0.01\)); parametric ReLU (PReLU) learns it per-channel. Eliminates the dead-ReLU pathology by giving negative inputs a small non-zero gradient. Modest empirical gains; popular in image models when ReLU has been observed to underperform.

GELU

\[ \sigma(z) = z \, \Phi(z), \]

where \(\Phi\) is the standard Gaussian CDF. Smooth, looks like ReLU for \(|z| \gtrsim 2\) and rolls off more gently around \(0\). Slightly more expensive than ReLU; standard in transformer architectures (BERT, GPT-2 onward) where the empirical advantage is consistent.

Sigmoid and Tanh

\[ \sigma_{\text{sigmoid}}(z) = \frac{1}{1 + e^{-z}} \in (0, 1), \qquad \sigma_{\text{tanh}}(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} \in (-1, 1). \]

Both are smooth and bounded. Both saturate: \(\sigma'(z) \to 0\) as \(|z| \to \infty\). Sigmoid’s derivative tops out at \(1/4\) at \(z = 0\); tanh’s at \(1\). Stacking saturating activations causes vanishing gradients because the chain-rule product of layer Jacobians shrinks geometrically.

Hidden layers should not use sigmoid in modern feedforward networks — it was the dominant choice in the 1990s and is the reason deep MLPs were considered untrainable until around 2010. Tanh is occasionally still used in LSTMs and GRUs, where the gating structure tames the saturation problem.

Output Activations

The output activation is dictated by the loss and the target distribution, not chosen for representational reasons.

Identity (regression)

For real-valued targets minimizing mean-squared error, the output activation is the identity: \(\hat y = z_L\). This corresponds to a Gaussian likelihood model.

Sigmoid (binary classification)

For a single binary label \(y \in \{0, 1\}\), output a probability via sigmoid: \(\hat p = \sigma_{\text{sigmoid}}(z_L)\). Combined with binary cross-entropy loss, the gradient at the output simplifies to \(\hat p - y\).

Softmax (multiclass classification)

For \(K\) mutually exclusive classes, the softmax maps a vector \(z \in \mathbb{R}^K\) to a categorical distribution:

\[ \sigma_{\text{softmax}}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}. \]

Softmax outputs are non-negative and sum to \(1\). Combined with cross-entropy, the loss reduces to \(-\log \sigma_{\text{softmax}}(z)_y\) for the true class \(y\), and the gradient at the output is \(\sigma_{\text{softmax}}(z) - e_y\) (the predicted distribution minus the one-hot truth) — clean, well-conditioned, and always non-zero unless prediction is exactly right.

Numerical stability. Computing softmax directly from the formula overflows for large \(z_k\). The standard fix subtracts the max: \(\sigma_{\text{softmax}}(z)_k = e^{z_k - z^*} / \sum_j e^{z_j - z^*}\) with \(z^* = \max_j z_j\). Frameworks always use this internally; one should never implement softmax from the raw formula.

How to Pick

Default rules for current practice:

Hidden layers: ReLU. Use GELU in transformers.
Output for regression: identity.
Output for binary classification: sigmoid + binary cross-entropy.
Output for multiclass classification: softmax + cross-entropy.

Do not mix activations across hidden layers without a reason. Do not use sigmoid or tanh in hidden layers of a deep MLP unless required by the architecture (e.g., LSTM gates).

References

Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2 (5): 359–66. https://doi.org/10.1016/0893-6080(89)90020-8.