Finding Better Hypotheses

Motivation

A hypothesis class is the set of candidate functions the learner is willing to consider — linear models, decision trees of depth 5, a transformer of a particular shape. Picking the class fixes what is possible. The next problem is picking the right hypothesis from inside that class: the search problem of supervised learning.

The naive view — “just minimize training error” — misses two pitfalls:

The training error is not what we care about. We care about how well the model performs on new data. A hypothesis with zero training error can be useless on test data.
The search itself is rarely tractable. Even for finite classes, exhaustive search is exponential. For continuous classes, gradient methods rule but get stuck in local minima or saddle points.

Doing this well combines a principled objective (empirical risk plus regularization), a workable optimization procedure, and a model-selection protocol that catches overfitting (Shalev-Shwartz and Ben-David 2014; Hastie et al. 2009).

What “Better” Means

A hypothesis \(h\) is evaluated against a loss function \(\ell : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}\) that scores how wrong its prediction is. Standard choices:

Squared error \(\ell(\hat y, y) = (\hat y - y)^2\) — regression.
Cross-entropy \(\ell(\hat p, y) = -\log \hat p_y\) — probabilistic classification.
Zero-one loss \(\ell(\hat y, y) = \mathbb{1}[\hat y \neq y]\) — classification error rate (non-differentiable; usually replaced by a convex surrogate during training).
Hinge loss \(\ell(s, y) = \max(0, 1 - y s)\) — SVM-style margin loss.

The quantity we actually want to minimize is the expected loss (population risk):

\[ R(h) = \mathbb{E}_{(x, y) \sim p^*}\!\left[\ell(h(x), y)\right]. \]

We cannot compute \(R(h)\) because \(p^*\) is unknown. The available proxy is the empirical risk:

\[ \hat R(h) = \frac{1}{N} \sum_{i=1}^N \ell(h(x_i), y_i). \]

Empirical risk minimization (ERM) picks \(\hat h = \arg\min_{h \in \mathcal{H}} \hat R(h)\). This is the dominant principle but is fragile when \(\mathcal{H}\) is rich enough to memorize the training set (Vapnik 1991).

Regularization

If \(\mathcal{H}\) can fit the training data perfectly, ERM has multiple optima — including some that generalize badly. The fix is to bias the search toward simpler hypotheses. Regularized empirical risk minimizes

\[ \hat R_\lambda(h) = \hat R(h) + \lambda \, \Omega(h), \]

where \(\Omega(h)\) measures “complexity” and \(\lambda > 0\) controls the trade-off. Common \(\Omega\):

\(\ell_2\) regularization: \(\Omega(h) = \|\mathbf{w}\|_2^2\) — shrinks weights toward zero; the dominant choice (ridge regression, weight decay in neural nets).
\(\ell_1\) regularization: \(\Omega(h) = \|\mathbf{w}\|_1\) — drives weights exactly to zero (LASSO; produces sparse models).
Early stopping: implicit regularization via stopping optimization before convergence.
Dropout, weight noise, data augmentation: stochastic regularizers used in deep learning.

Regularization buys variance reduction at the cost of some bias — the bias-variance trade-off is the formal statement of this exchange.

Searching the Class

How you minimize \(\hat R_\lambda\) depends on the class.

Closed form (linear least squares)

For squared loss with a linear class and \(\ell_2\) regularization, the solution is closed-form:

\[ \hat{\mathbf{w}} = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}. \]

No iteration. Use this when the class admits it.

Convex optimization

Many useful problems — logistic regression, SVMs, LASSO — have convex training objectives, so any local minimum is global and methods like gradient descent, Newton’s method, or coordinate descent converge to the unique optimum.

Gradient descent and its variants

For differentiable but nonconvex objectives (deep networks), iterative gradient methods are the only option in practice. The basic update is

\[ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \, \nabla_{\boldsymbol{\theta}} \hat R_\lambda(h_{\boldsymbol{\theta}_t}), \]

with \(\eta\) the learning rate. Stochastic variants (SGD, Adam) replace the full-batch gradient by a noisy estimate from a mini-batch — required at scale because exact gradients are too expensive.

Combinatorial search

Decision trees, rule lists, and feature-subset selection are typically searched greedily. CART grows a tree by choosing the locally best split at each node. Boosting fits one weak learner at a time, residualizing on errors. These produce good — though not provably optimal — solutions.

Model Selection: Choosing Hyperparameters

The regularizer’s strength \(\lambda\), the network architecture, the tree depth, the kernel — none of these can be selected by minimizing training error (which would always prefer the most flexible setting). They must be chosen by their performance on held-out data.

The three-way split

Partition the data:

Training set. Used to fit \(\hat h\) for a given hyperparameter setting.
Validation set. Used to compare hyperparameter settings and pick the best.
Test set. Used once, at the end, to report the final performance estimate.

Reusing the test set for selection inflates the reported performance — it has been informally “fit” by the analyst’s decisions.

\(k\)-fold cross-validation

When data is scarce, splitting off a validation set wastes signal. \(k\)-fold cross-validation partitions the training data into \(k\) folds, trains on \(k-1\) of them, validates on the remaining one, and averages the validation error across all \(k\) choices of held-out fold. Standard \(k\) is 5 or 10. The output is a more reliable validation score per hyperparameter setting, at \(k\times\) the compute cost.

Hyperparameter search strategies

Once a validation criterion is defined, the search over hyperparameters becomes its own optimization problem:

Grid search. Discretize each hyperparameter and try every combination. Exhaustive, parallel-friendly, but combinatorially expensive.
Random search. Sample combinations randomly. Often finds the best setting faster than grid search, especially when only a few hyperparameters matter.
Bayesian optimization. Model validation performance as a function of hyperparameters and use the model to choose promising settings to try next. Sample-efficient but more complex to set up.
Manual. A practitioner adjusts hyperparameters based on observed training curves. Still the workflow most often used in deep learning.

Diagnosing What Went Wrong

A trained model that performs poorly is diagnosed by comparing training and validation errors — the standard bias-variance protocol:

Both errors high. Underfitting. The hypothesis class is too restrictive or the optimization stopped too early. Use a richer class, train longer, reduce regularization, add features.
Training error low, validation error much higher. Overfitting. Add data, increase regularization, simplify the class, ensemble, augment data.
Training error low, validation error close to it. Likely fine — though check for distribution shift between validation and deployment.

Learning curves (error vs. training set size) and validation curves (error vs. a hyperparameter) make these diagnoses visual and are routine tools.

Ensembles: Combining Several Hypotheses

A trained model is one element of \(\mathcal{H}\). Often a combination of several elements is better than any single one. The standard ensembling methods:

Bagging. Train many copies on bootstrap resamples and average. Reduces variance without changing bias. Random forests are bagged decision trees.
Boosting. Train weak learners sequentially, each one focused on the errors of the previous. Reduces bias and variance simultaneously. Gradient boosting machines are the dominant tabular learner.
Stacking. Train several heterogeneous models and combine their outputs with a meta-model.

Ensembles do not search \(\mathcal{H}\) for a single hypothesis — they search a richer class of combinations of hypotheses. They are almost always among the top performers on tabular benchmarks.

When Finding a Better Hypothesis Helps

It helps when:

The training error is much lower than the validation error (overfitting) — try regularization, simpler classes, more data.
The training error is high (underfitting) — try richer classes or better optimization.
Multiple candidate models pass the validation step — ensemble them.

It does not help when:

The hypothesis class is wrong — search inside \(\mathcal{H}\) can only return elements of \(\mathcal{H}\). If the truth is outside, no amount of search recovers it. Revisit defining the hypothesis class.
The data is the bottleneck — no amount of optimization manufactures information that was never collected. More data or better data is the only fix.
The objective is wrong — minimizing the wrong loss produces models that are correct for the wrong question. Reconsider what “better” should mean.

Search refines the model; class choice and data choice set the ceiling.

References

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/.

Shalev-Shwartz, Shai, and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. https://doi.org/10.1017/cbo9781107298019.

Vapnik, Vladimir N. 1991. “Principles of Risk Minimization for Learning Theory.” Advances in Neural Information Processing Systems (NeurIPS), 831–38. https://proceedings.neurips.cc/paper/1991/hash/ff4d5fbbafdf976cfdc032e3bde78de5-Abstract.html.