Segmented Least Squares
Motivation
Standard linear regression fits a single line to all data points. But real signals often consist of several distinct linear regimes — say, a stock price that rallies, then plateaus, then falls. Forcing one line through all of them produces a poor fit; using too many lines overfits the noise.
Segmented least squares formalizes this trade-off: divide the points into contiguous groups, fit a separate line to each group, and minimize the total squared error plus a penalty per segment. The penalty \(C\) controls model complexity — small \(C\) favours many tight segments, large \(C\) favours fewer, looser ones.
The problem looks like it requires deciding both the number and boundaries of segments. Yet a clean dynamic programming algorithm finds the optimum in \(\Theta(n^2)\) time. It is the standard textbook example showing that DP can solve problems whose obvious search space is exponential (Kleinberg and Tardos 2005).
Problem
Given \(n\) points \((x_1, y_1), \dots, (x_n, y_n)\) sorted by \(x\), and a per-segment penalty \(C > 0\), a segmentation partitions \(\{1, \dots, n\}\) into contiguous segments \([i_1, j_1], [i_2, j_2], \dots, [i_k, j_k]\) with \(i_1 = 1\), \(j_k = n\), and \(j_t + 1 = i_{t+1}\).
For a segment \([i, j]\), let \(e_{ij}\) be the minimum squared error of any line fit to the points \((x_i, y_i), \dots, (x_j, y_j)\). The cost of a segmentation is
\[ \sum_t e_{i_t j_t} + k C. \]
Find the segmentation of minimum cost.
Key Ideas
Why brute force fails
A segmentation is determined by which \(i\)’s are segment boundaries — there are \(2^{n-1}\) choices. Trying each is exponential. The DP exploits the structure that choices to the left of the last segment do not interact with choices to its right, given the boundary index: once we fix where the last segment starts at \(i\), the cost of the prefix \(\{1, \ldots, i-1\}\) is independent of how the last segment is fit.
The locality that unlocks the DP
Fix the rightmost segment \([i, n]\). Its cost is \(e_{i, n} + C\), depending only on \(i\). The cost of the segmentation to the left is the same problem on a smaller prefix. This Markov-like locality — the optimal sub-segmentation of the prefix is independent of everything in the last segment — collapses the search space from \(2^{n-1}\) to a 1D DP indexed by the right endpoint.
Per-segment fits are closed-form
For a segment from index \(i\) to \(j\), the least-squares line \(y = a x + b\) minimizes \(\sum_{k=i}^{j} (y_k - a x_k - b)^2\). The solution is: \[ a = \frac{m \sum x_k y_k - \sum x_k \sum y_k}{m \sum x_k^2 - (\sum x_k)^2}, \qquad b = \frac{\sum y_k - a \sum x_k}{m}, \qquad m = j - i + 1, \] with sums taken over \(k = i, \dots, j\). The minimum squared error \(e_{ij}\) is computed by plugging \(a\) and \(b\) back into the residual sum. With prefix sums of \(x\), \(y\), \(x^2\), \(xy\) pre-computed in \(O(n)\), every \(e_{ij}\) takes \(O(1)\).
Deriving the Solution
Let \(\text{OPT}(j)\) be the minimum cost of segmenting \(\{1, \dots, j\}\).
The last segment in any optimal segmentation of \(\{1, \dots, j\}\) ends at \(j\) and starts at some index \(i \le j\). Conditioning on \(i\):
\[ \text{OPT}(j) = \min_{1 \le i \le j} \left( e_{ij} + C + \text{OPT}(i-1) \right), \]
with \(\text{OPT}(0) = 0\). The minimum ranges over all \(j\) choices of where the last segment begins; the rest is exactly the same subproblem on \(\{1, \ldots, i-1\}\).
Algorithm
compute prefix sums of x, y, x^2, xy
compute e[i][j] for all 1 <= i <= j <= n # O(n^2) with prefix sums
M[0] = 0
for j = 1 to n:
M[j] = min over i = 1..j of (e[i][j] + C + M[i-1])
return M[n]
To recover the segmentation, store the argmin \(i^*(j)\) at each \(j\) and back-trace from \(n\).
Walkthrough
One segment vs. two
Eight points with \(x = 1, \dots, 8\) and corresponding \(y = 1, 3, 5, 7, 6, 5, 4, 3\). The first four lie exactly on \(y = 2x - 1\); the last four lie exactly on \(y = -x + 11\).
Two competing segmentations:
- One segment: ordinary least squares on all eight points gives \(\hat y \approx 0.214\,x + 3.286\) with \(e_{1,8} \approx 23.6\), so the cost is \(23.6 + C\).
- Two segments: split between \(j = 4\) and \(i = 5\). Each segment fits its line exactly, so \(e_{1,4} = e_{5,8} = 0\) and the cost is \(2C\).
The two-segment solution wins whenever \(2C < 23.6 + C\), i.e., for any \(C < 23.6\). So this same data yields the split shown for moderate \(C\) and collapses to one line as \(C\) grows past about \(23.6\) — the trade-off the recurrence makes precise.
Correctness
By induction on \(j\). The base \(\text{OPT}(0) = 0\) is the empty segmentation. Inductively, the optimal segmentation of \(\{1, \ldots, j\}\) ends in a last segment \([i^*, j]\) for some \(i^*\). Removing that segment leaves an optimal segmentation of \(\{1, \ldots, i^* - 1\}\) (otherwise the original wasn’t optimal), which by hypothesis has cost \(M[i^* - 1]\). So \(\text{OPT}(j) = e_{i^*, j} + C + M[i^* - 1]\), and the recurrence takes the min over all \(i\), including \(i^*\).
Complexity and Tradeoffs
- Computing every \(e_{ij}\) naively: \(\Theta(n^3)\) (each segment is \(\Theta(n)\) work, and there are \(\Theta(n^2)\) segments).
- With prefix sums of \(x\), \(y\), \(x^2\), \(xy\) pre-computed in \(O(n)\), each \(e_{ij}\) takes \(O(1)\), so all errors are computed in \(O(n^2)\).
- Filling the DP table: \(O(n^2)\) (each \(j\) checks up to \(j\) choices of \(i\)).
Total: \(\Theta(n^2)\) time, \(\Theta(n^2)\) space (or \(\Theta(n)\) with on-the-fly error computation).
Choosing the penalty
Different choices of \(C\) produce different optimal segmentations:
- \(C \to 0\): every point becomes its own segment.
- \(C \to \infty\): the entire input becomes one segment.
In practice \(C\) is selected by cross-validation, by an information criterion such as BIC, or by domain knowledge of how many regime changes are plausible.
When to Use It
| Situation | Approach |
|---|---|
| Piecewise-linear signal, segment count not known | Segmented least squares with penalty \(C\), \(\Theta(n^2)\). |
| Known segment count \(k\), free boundaries | Same DP with state \(\text{OPT}(j, k')\) — \(O(n^2 k)\) time. |
| Piecewise-constant signal (changepoint detection) | Same DP, replace line fits with constant fits. |
| Streaming / online data | DP is offline; use online changepoint methods. |
| Non-linear segment fits (polynomial, exponential) | Same DP, swap the closed-form per-segment fit. |
Variants
- Multiple variables. Replace each line by a hyperplane; the closed form generalizes to standard multivariate least squares.
- Different fit shapes. Replace least-squares lines with any function class (constants, polynomials, exponentials) provided each segment’s fit is computable.
- Length-dependent penalty. Replace \(C\) by \(C_j\) in the recurrence — useful when shorter segments should be cheaper.