Defining Model Hypotheses

Motivation

Before any learning algorithm can pick a model, somebody has to decide what kinds of models are allowed. The set of candidate functions a learner is willing to consider is called the hypothesis class (or hypothesis space). Choosing it is the most consequential modeling decision: it determines what patterns can be expressed, how much data will be needed, and where the failure modes lie. Statistical learning theory is largely the study of how hypothesis classes interact with data (Mitchell 1997; Vapnik 1991; Shalev-Shwartz and Ben-David 2014).

This article is about defining the hypothesis class — the what. The companion article on finding better hypotheses covers the how: searching within a class for a good fit.

The Setup

A supervised learning problem comes with:

An input space \(\mathcal{X}\) (the kind of thing the model takes in — vectors, images, sentences).
An output space \(\mathcal{Y}\) (the kind of thing the model predicts — real numbers, class labels, probability distributions).
An unknown target function \(f^* : \mathcal{X} \to \mathcal{Y}\) (or a conditional distribution \(p^*(y \mid x)\)) generating the data.
A training sample \(\{(x_i, y_i)\}_{i=1}^N\) drawn i.i.d. from a joint distribution over \(\mathcal{X} \times \mathcal{Y}\).

A hypothesis is a candidate function \(h : \mathcal{X} \to \mathcal{Y}\). A hypothesis class \(\mathcal{H}\) is a set of such functions. Learning means picking some \(\hat h \in \mathcal{H}\) that approximates \(f^*\) well in some sense.

What a Hypothesis Class Looks Like

A hypothesis class is typically specified by a parameterization. The class consists of all functions you get by sweeping the parameters over their allowed values.

Linear hypotheses

The most fundamental class. For regression:

\[ \mathcal{H} = \{ h_{\mathbf{w}, b}(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b \mid \mathbf{w} \in \mathbb{R}^d,\, b \in \mathbb{R} \}. \]

For binary classification, \(h_{\mathbf{w}, b}(\mathbf{x}) = \mathrm{sign}(\mathbf{w}^\top \mathbf{x} + b)\). The parameters are \((\mathbf{w}, b)\) with \(d + 1\) degrees of freedom.

Polynomial hypotheses

For one-dimensional input:

\[ \mathcal{H}_p = \{ h_{\boldsymbol{\theta}}(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \cdots + \theta_p x^p \mid \boldsymbol{\theta} \in \mathbb{R}^{p+1} \}. \]

Increasing degree \(p\) enlarges the class: \(\mathcal{H}_1 \subset \mathcal{H}_2 \subset \mathcal{H}_3 \subset \cdots\)

Decision trees

A binary decision tree of depth \(d\) corresponds to a piecewise-constant function on a partition of \(\mathcal{X}\) into axis-aligned boxes. The class is

\[ \mathcal{H}_d = \{ \text{all decision trees of depth} \leq d \}. \]

The class is finite (assuming finitely many split candidates) but grows combinatorially in \(d\).

Neural networks

A multilayer perceptron with architecture \(A\) (specifying widths, depths, activations) gives

\[ \mathcal{H}_A = \{ h_{\boldsymbol{\theta}}(\mathbf{x}) : \boldsymbol{\theta} \in \mathbb{R}^P \}, \]

where \(P\) is the total parameter count. Each architectural choice — depth, width, residual connections, attention, normalization — defines a different class.

What Choosing a Class Determines

What can be learned at all

A hypothesis class with no \(h\) close to \(f^*\) has nothing to offer. Fitting a linear model to data generated by a sine wave will never achieve low error, regardless of how much data you give it. The approximation error

\[ \inf_{h \in \mathcal{H}} \mathbb{E}\!\left[\ell(h(x), y)\right] \]

is the floor — the best any element of \(\mathcal{H}\) can do. It is a property of \(\mathcal{H}\) and the data distribution, not the learner.

How much data is needed

Richer classes need more data to estimate reliably. The classical inequality (informally):

\[ \text{generalization error} \lesssim \text{training error} + \sqrt{\frac{\text{complexity of } \mathcal{H}}{N}}. \]

“Complexity” can be measured several ways: VC dimension (counts the patterns the class can shatter), Rademacher complexity (measures how well the class can fit random labels), or simply parameter count (a rough proxy that is wrong in the over-parameterized regime). All capture the same intuition — more flexible classes give weaker generalization guarantees per unit of data.

Where the failure modes are

The choice of class biases what the model can and cannot represent:

Linear models cannot represent interactions. \(\mathbf{w}^\top \mathbf{x}\) has no \(x_1 x_2\) term.
Polynomial models can extrapolate wildly outside the training range.
Decision trees produce piecewise-constant predictions with sharp boundaries.
Convolutional networks bake in translation equivariance — useful for images, wrong for tabular data.
Transformers assume tokens, positions, and self-attention — useful for sequences, awkward for raw image pixels.

These are inductive biases: assumptions about \(f^*\) smuggled in by the class itself. A good inductive bias is one that matches the structure of the real problem.

The Bias–Capacity Trade-off

The space of hypothesis classes is itself a spectrum, parameterized by capacity:

This is the bias-variance trade-off viewed through the lens of class choice:

Too small a class → high bias. The truth is not in the class, so even infinite data leaves residual error.
Too large a class → high variance. There are too many candidate \(h\) that fit the training data; the algorithm cannot reliably pick the right one.

The right class for a problem is the smallest one that still contains a good approximation to \(f^*\). In modern deep learning this folk wisdom is complicated by double descent and over-parameterization — large classes can generalize well when paired with regularization and implicit bias from gradient descent — but the basic instinct (match capacity to the task) remains useful.

How Practitioners Actually Choose

In practice the hypothesis class is rarely picked from first principles. The workflow:

Start with a strong baseline. Linear / logistic regression for tabular data; gradient boosting for structured data; a pre-trained transformer for text; a CNN or vision transformer for images. These are the classes the field has found to work across problems.
Stress-test the baseline. Is it underfitting? Overfitting? Does it ignore feature interactions you know matter?
Adjust the class. Add features, kernel-ize, deepen the network, switch to a tree ensemble, fine-tune a larger pre-trained model.
Compare on validation. Multiple plausible classes are fit and the one with the best held-out performance wins.

The choice is informed by domain knowledge (what symmetries does the problem have? what feature interactions matter?), compute budget (how big a class can you afford to train?), and prior experience with similar problems.

What Comes Next

Defining the hypothesis class fixes which functions are eligible. The next questions are:

How do you score candidate hypotheses? — a loss function.
How do you search the class for a good one? — an optimization algorithm.
How do you avoid picking one that overfits? — regularization, validation, model selection.

These are the topics of finding better hypotheses. Together with the hypothesis class itself, they make up the full machinery of supervised learning.

References

Mitchell, Tom M. 1997. Machine Learning. McGraw-Hill.

Shalev-Shwartz, Shai, and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. https://doi.org/10.1017/cbo9781107298019.

Vapnik, Vladimir N. 1991. “Principles of Risk Minimization for Learning Theory.” Advances in Neural Information Processing Systems (NeurIPS), 831–38. https://proceedings.neurips.cc/paper/1991/hash/ff4d5fbbafdf976cfdc032e3bde78de5-Abstract.html.