Universal Approximation

Motivation

A neural network is a parametric family of functions, and using one as a function approximator is only justified if the family is rich enough to express the target function. The universal approximation theorem (Cybenko 1989; Hornik et al. 1989) answers this: a feedforward network with a single hidden layer and a non-trivial activation function can approximate any continuous function on a compact set to arbitrary accuracy, given enough hidden units.

This is a representational statement — the target function is expressible — not a learnability statement: nothing about the theorem says SGD will find the right weights, or that the network will generalize beyond training data, or that the required network is small.

Statement

(Cybenko 1989; Hornik et al. 1989). Let \(\sigma : \mathbb{R} \to \mathbb{R}\) be any non-polynomial continuous function. For every continuous function \(f : K \to \mathbb{R}\) on a compact set \(K \subset \mathbb{R}^n\) and every \(\varepsilon > 0\), there exist a positive integer \(N\), weights \(w_1, \ldots, w_N \in \mathbb{R}^n\), biases \(b_1, \ldots, b_N \in \mathbb{R}\), and output coefficients \(\alpha_1, \ldots, \alpha_N \in \mathbb{R}\) such that

\[ \sup_{x \in K}\, \left|\, f(x) - \sum_{i=1}^{N} \alpha_i\, \sigma(w_i^\top x + b_i) \,\right| < \varepsilon. \]

In other words, single-hidden-layer networks with activation \(\sigma\) are dense in \(C(K)\) under the supremum norm.

Diagram: three sigmoid bumps approximate a rectangular pulse

A pair of sigmoids \(\sigma(c(x - a))\) with opposite signs builds a step on either side; their sum approximates a rectangular pulse. With more units, more elaborate targets follow.

The non-polynomial condition is necessary and sufficient: if \(\sigma\) is a polynomial, the entire network is a polynomial of bounded degree and cannot approximate non-polynomial functions. Standard activations — sigmoid, tanh, ReLU, GELU, Swish — all qualify.

(proof)

What the Theorem Does Not Say

The universal approximation theorem is widely cited as a justification for “neural networks can do anything,” but its scope is narrower than the slogan suggests:

It does not bound \(N\). The required width may be exponential in the input dimension or in \(1/\varepsilon\). There is no efficiency guarantee.
It does not address optimization. Existence of weights is not the same as gradient descent finding them.
It does not imply generalization. The theorem talks about approximating a known function on a compact set, not about learning from finitely many samples.
It does not say one hidden layer is best. Many natural functions require exponentially fewer units in deep networks than in shallow ones — the theorem is silent on this.

In modern deep learning, universal approximation is rarely the binding constraint; depth, optimization, and generalization are.

Variants

Bounded-width networks. Lu et al. (2017) showed that deep ReLU networks with width \(n + 4\) (where \(n\) is the input dimension) are universal approximators on any compact set. The trade-off between width and depth runs both ways: arbitrarily deep narrow networks are also universal.

Convolutional networks. With appropriate architectural choices, convolutional networks are universal approximators on shift-invariant function classes. The translation-equivariance constraint of convolution is not a barrier to expressivity for shift-invariant targets.

Approximation rates. Refinements (Barron, 1993; Yarotsky, 2017) give explicit rates: how does \(N\) scale with \(\varepsilon\) for various function classes? For functions with bounded Fourier moments (Barron’s theorem), the required width is \(O(1/\varepsilon^2)\), independent of input dimension — a rare escape from the curse of dimensionality.

Probabilistic approximation. Nonzero-error approximation in \(L^2(\mu)\) for various probability measures, rather than uniform error, opens additional results and applies more naturally to learning settings.

Why Depth Matters Anyway

For natural function classes, deep networks need exponentially fewer parameters than shallow networks to achieve the same approximation error. Examples:

Compositional functions. A function of the form \(f_d \circ f_{d-1} \circ \cdots \circ f_1\) where each \(f_i\) is “simple” can be exactly represented by a depth-\(d\) network with linear total width, but requires width exponential in \(d\) in a depth-\(1\) network (Telgarsky, 2016).
Symmetric functions. Sorting, parity, and Boolean expressions have similar exponential gaps.

This is the practical justification for deep architectures: not whether they can express a target (the universal approximation theorem says any reasonable family can) but how efficiently they express targets that have inherent compositional structure.

Connection to the Practice of Deep Learning

In day-to-day deep learning, universal approximation is invoked rarely and casually — the theorem is well-known enough to function as background reassurance (“of course the network can express this; it’s universal”), but the actual engineering choices are guided by other considerations: gradient flow, parameter efficiency, computational cost, inductive biases that aid generalization. The expressive-power bottleneck described by approximation theory is not the bottleneck most practitioners face.

References

Cybenko, G. 1989. “Approximation by Superpositions of a Sigmoidal Function.” Mathematics of Control, Signals, and Systems 2 (4): 303–14. https://doi.org/10.1007/bf02551274.

Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2 (5): 359–66. https://doi.org/10.1016/0893-6080(89)90020-8.