The Universal Approximation Theorem

Claim

Let \(\sigma : \mathbb{R} \to \mathbb{R}\) be a continuous, bounded, monotone function with \(\lim_{t \to -\infty} \sigma(t) = 0\) and \(\lim_{t \to +\infty} \sigma(t) = 1\) (a “sigmoidal” activation). Let \(K \subset \mathbb{R}^n\) be compact. Then the set

\[ \mathcal{N} = \!\left\{ x \mapsto \sum_{i=1}^N \alpha_i\, \sigma(w_i^\top x + b_i) : N \in \mathbb{N},\, \alpha_i \in \mathbb{R},\, w_i \in \mathbb{R}^n,\, b_i \in \mathbb{R} \right\} \]

is dense in \(C(K)\) — the space of continuous real-valued functions on \(K\) — under the supremum norm (Cybenko 1989; Hornik et al. 1989).

This is Cybenko’s (1989) form of the universal approximation theorem. Hornik (1991) generalized to arbitrary non-polynomial continuous activations.

Strategy

The proof is by contradiction via the Hahn–Banach theorem and Riesz representation. Suppose \(\mathcal{N}\) is not dense in \(C(K)\). Then by Hahn–Banach there is a nonzero continuous linear functional \(\Lambda\) on \(C(K)\) that vanishes on \(\overline{\mathcal{N}}\). By Riesz representation, \(\Lambda\) corresponds to integration against a finite signed Borel measure \(\mu\) on \(K\), with

\[ \int_K \varphi(x)\, d\mu(x) = 0 \quad \text{for every } \varphi \in \mathcal{N}. \]

We will show this forces \(\mu = 0\), contradicting the hypothesis that \(\Lambda \ne 0\).

Step 1: \(\sigma\) Is Discriminatory

A function \(\sigma\) is discriminatory with respect to a finite signed Borel measure \(\mu\) if

\[ \int_K \sigma(w^\top x + b)\, d\mu(x) = 0 \quad \forall\, w \in \mathbb{R}^n,\, b \in \mathbb{R} \implies \mu = 0. \]

We show every continuous bounded sigmoidal \(\sigma\) is discriminatory with respect to every finite signed Borel measure on a compact set.

Limiting form of \(\sigma\). For any \(w, b, \varphi\) and \(\lambda > 0\), consider \(\sigma(\lambda(w^\top x + b) + \varphi)\). As \(\lambda \to \infty\):

\[ \sigma\bigl(\lambda(w^\top x + b) + \varphi\bigr) \to \begin{cases} 1 & \text{if } w^\top x + b > 0, \\ 0 & \text{if } w^\top x + b < 0, \\ \sigma(\varphi) & \text{if } w^\top x + b = 0. \end{cases} \]

The limit is the indicator of the open half-space \(H_{w,b}^+ = \{x : w^\top x + b > 0\}\), plus an offset on the hyperplane \(\Pi_{w,b} = \{x : w^\top x + b = 0\}\).

By dominated convergence (each \(\sigma(\cdot)\) is bounded by a constant), if \(\int_K \sigma(w^\top x + b)\, d\mu = 0\) for every \(w, b\), then for every \(w, b, \varphi\),

\[ \mu(H_{w,b}^+) + \sigma(\varphi)\, \mu(\Pi_{w,b}) = 0. \]

Since \(\sigma(\varphi)\) varies continuously and takes more than one value, the only way this can hold for all \(\varphi\) is

\[ \mu(H_{w,b}^+) = 0 \quad \text{and} \quad \mu(\Pi_{w,b}) = 0 \]

for every \(w, b\).

\(\mu\) vanishes on every measurable set. A finite signed Borel measure on a compact set is determined by its values on half-spaces (this follows from the fact that the Fourier transform of the measure restricted to a coordinate is determined by integrals over half-spaces, and the Fourier transform uniquely determines the measure). Concretely:

For fixed \(w\), define the signed measure \(F_w(t) = \mu(\{x : w^\top x \le t\})\) on \(\mathbb{R}\). The above gives \(F_w(t) - F_w(s) = \mu(\{x : s < w^\top x \le t\}) = 0\) for every interval. So \(F_w\) is constant; combined with \(F_w(\infty) = \mu(K)\) being finite and \(F_w\) tending to \(0\) at \(-\infty\), we get \(F_w \equiv 0\).

This means the pushforward of \(\mu\) under \(x \mapsto w^\top x\) is the zero measure for every \(w\). By the Cramér–Wold device (the joint distribution of \((X_1, \ldots, X_n)\) is determined by all linear projections \(w^\top X\)), \(\mu = 0\).

So \(\sigma\) is discriminatory. \(\square\)

Step 2: Combine with Hahn–Banach

Suppose for contradiction \(\overline{\mathcal{N}} \ne C(K)\). By Hahn–Banach, there exists a nonzero continuous linear functional \(\Lambda\) on \(C(K)\) that vanishes on \(\overline{\mathcal{N}}\). By the Riesz representation theorem for compact metric spaces, \(\Lambda\) has the form

\[ \Lambda(\varphi) = \int_K \varphi(x)\, d\mu(x) \]

for some nonzero finite signed Borel measure \(\mu\) on \(K\). Since every \(\varphi \in \mathcal{N}\) is a finite linear combination of functions \(x \mapsto \sigma(w^\top x + b)\), vanishing on all of \(\mathcal{N}\) in particular gives

\[ \int_K \sigma(w^\top x + b)\, d\mu(x) = 0 \quad \forall\, w, b. \]

By Step 1, \(\mu = 0\). Contradiction. Therefore \(\overline{\mathcal{N}} = C(K)\). \(\blacksquare\)

Notes

Hornik’s extension to non-polynomial \(\sigma\). Cybenko’s theorem requires \(\sigma\) to be sigmoidal. Hornik (1991) and Leshno et al. (1993) extended the result: a single-hidden-layer network with continuous activation \(\sigma\) is universal on \(C(K)\) if and only if \(\sigma\) is non-polynomial. The proof uses similar density arguments together with a polynomial-approximation lemma: derivatives of \(\sigma\) at appropriate points produce polynomials of every degree, and these together generate \(C(K)\) via Stone–Weierstrass.

ReLU. \(\sigma(t) = \max(0, t)\) is non-polynomial. ReLU networks are universal approximators by Hornik’s theorem; they additionally have a constructive interpretation as piecewise-linear functions, and any continuous function can be approximated to any accuracy by piecewise-linear functions on a compact set.

No efficiency. Both proofs are non-constructive (Hahn–Banach is) and give no bound on \(N\). Quantitative refinements — Barron’s theorem, Yarotsky’s results — bound \(N\) in terms of the smoothness of \(f\), but the dimension dependence is generally unfavorable.

Beyond compact \(K\). The theorem requires compactness; on \(\mathbb{R}^n\), no finite network can uniformly approximate every continuous function. The compactness assumption is essentially necessary, not a technical convenience.

References

Cybenko, G. 1989. “Approximation by Superpositions of a Sigmoidal Function.” Mathematics of Control, Signals, and Systems 2 (4): 303–14. https://doi.org/10.1007/bf02551274.

Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2 (5): 359–66. https://doi.org/10.1016/0893-6080(89)90020-8.