Multilayer Perceptron
Motivation
A multilayer perceptron (MLP) is the simplest neural network architecture: a stack of fully-connected linear layers separated by elementwise nonlinearities. Rooted in the perceptron (Rosenblatt 1958), it is the workhorse component inside larger architectures — every transformer block contains MLPs as the position-wise feedforward layer, and every CNN ends with an MLP classifier head. Understanding the MLP is the foundation for understanding everything else in deep learning.
The key facts: an MLP with one hidden layer is already a universal approximator for continuous functions on a compact set; depth gives it parameter efficiency that single-hidden-layer networks lack; and gradients with respect to all of its parameters can be computed in time comparable to a single forward pass via backpropagation (Rumelhart et al. 1986).
Architecture
An \(L\)-layer MLP maps an input \(x \in \mathbb{R}^{d_0}\) to an output \(\hat y \in \mathbb{R}^{d_L}\) through \(L\) layers. Layer \(\ell \in \{1, \ldots, L\}\) has parameters \(W_\ell \in \mathbb{R}^{d_\ell \times d_{\ell-1}}\) and \(b_\ell \in \mathbb{R}^{d_\ell}\) and computes
\[ z_\ell = W_\ell a_{\ell-1} + b_\ell, \qquad a_\ell = \sigma_\ell(z_\ell), \]
with \(a_0 = x\) and \(\hat y = a_L\). The function \(\sigma_\ell\) is an activation function — typically ReLU for hidden layers and softmax (for classification) or identity (for regression) at the output.
Total parameter count: \(\sum_\ell (d_\ell d_{\ell-1} + d_\ell)\). The dominant term is the matrix multiplications, which are \(O(d_\ell d_{\ell-1})\) per layer.
Why Depth and Why Width
A single hidden layer with enough units can already approximate any continuous function on a compact set (universal approximation theorem). So why use depth?
Two answers:
- Parameter efficiency. Functions with hierarchical or compositional structure can be represented with exponentially fewer parameters by deep networks than by shallow ones. The classic example is the parity function on \(n\) bits, which requires \(\Omega(2^n)\) width to compute with one hidden layer but \(O(n)\) parameters with \(\log n\) depth.
- Trainability. Wide-shallow networks are hard to optimize compared to deeper ones at matched parameter count. The reasons are subtle and partly empirical, but residual connections, normalization layers, and modern optimizers all assume depth.
Width matters too: too narrow a network bottlenecks information flow; too wide wastes compute. Modern practice picks both based on the task and the available data.
What MLPs Are Used For
In their pure form, MLPs are good for:
- Tabular data with no spatial or sequential structure.
- The classifier head on top of a CNN or transformer backbone.
- The position-wise feedforward block inside a transformer (typically a 2-layer MLP with hidden dimension \(4 \times\) the input).
- Function approximation for value functions and policies in reinforcement learning.
For images use convolutional networks; for sequences use recurrent networks or transformers. The relevant inductive biases — translation equivariance, locality, sequence structure — are missing from a plain MLP, so one needs more data and more parameters to compensate.
Training
Training an MLP minimizes a loss function over a dataset by stochastic gradient descent (or a variant like Adam). The gradients are computed by backpropagation, which is one application of reverse-mode automatic differentiation to the MLP’s computational graph. Each iteration:
- Sample a minibatch.
- Forward pass: compute activations \(a_1, \ldots, a_L\) and the loss.
- Backward pass: compute \(\partial L / \partial W_\ell\) and \(\partial L / \partial b_\ell\) for every layer.
- Update parameters: \(W_\ell \leftarrow W_\ell - \eta \, \partial L / \partial W_\ell\).
The loss surface is highly non-convex, but in practice gradient descent reliably finds good solutions for over-parameterized networks. Why this works at all — given that local minima of arbitrary depth are theoretically possible — remains an active research question; the empirical fact that it does work is what makes deep learning practical.
Initialization
Naive initialization (e.g., all zeros, or small Gaussian without scaling) breaks training. The standard schemes are:
- He / Kaiming initialization for ReLU networks: \(W_{ij} \sim \mathcal{N}(0, 2/d_{\ell-1})\). Preserves activation variance through layers.
- Xavier / Glorot initialization for tanh/sigmoid networks: \(W_{ij} \sim \mathcal{N}(0, 1/d_{\ell-1})\). Same idea, different constant.
The scaling matters: too large and activations explode through the layers; too small and they vanish. A correctly initialized MLP starts training with activations of \(O(1)\) at every layer, which is what gradient-based optimization needs.