Kalman Filter
Motivation
A linear dynamical system driven by noise cannot be observed exactly — the hidden state must be estimated from noisy measurements. The Kalman filter (Kalman 1960) is the optimal estimator for this setting: it maintains a Gaussian belief over the state and updates it recursively as each measurement arrives, minimizing mean squared error among all linear estimators (proof).
The filter underpins GPS receivers, aircraft autopilots, robot localization, and financial time-series smoothing. It also forms one half of the linear quadratic Gaussian (LQG) controller, which pairs the Kalman filter with the linear quadratic regulator.
Problem
The system is described by two equations. The process model evolves the hidden state:
\[ x_{t+1} = A x_t + B u_t + w_t, \qquad w_t \sim \mathcal{N}(0, W), \]
and the observation model generates measurements:
\[ y_t = C x_t + v_t, \qquad v_t \sim \mathcal{N}(0, V), \]
where \(x_t \in \mathbb{R}^n\) is the hidden state, \(u_t \in \mathbb{R}^m\) is a known control input, \(y_t \in \mathbb{R}^p\) is the measurement, and \(A\), \(B\), \(C\) are known matrices. The noise covariances \(W \succeq 0\) and \(V \succ 0\) are known, and \(\{w_t\}\), \(\{v_t\}\) are mutually independent.
Given measurements \(y_1, \ldots, y_t\), compute (online, one step at a time) the posterior \(p(x_t \mid y_1, \ldots, y_t)\).
Key Ideas
A Gaussian stays Gaussian under linear updates
Each step of the model is a linear function of the previous state plus Gaussian noise. Linear-Gaussian transformations preserve normality: if \(x_{t-1} \mid y_{1:t-1}\) is Gaussian, then so is \(x_t \mid y_{1:t-1}\) (after the dynamics), and so is \(x_t \mid y_{1:t}\) (after conditioning on the new measurement). The posterior never leaves the Gaussian family, so the filter only has to track a mean vector and a covariance matrix.
Alternate predict and update
Each new step splits into two stages:
- Predict. Push the previous posterior through the dynamics. The mean shifts by the deterministic part; the covariance grows by the process noise.
- Update. Condition on the new measurement. The mean shifts toward the observation; the covariance shrinks because the measurement provides information.
This alternation is exactly the chain decomposition of the joint \(p(x_t, y_t \mid y_{1:t-1}) = p(x_t \mid y_{1:t-1}) \cdot p(y_t \mid x_t)\), followed by Bayes’ rule.
The Kalman gain is the optimal trust trade-off
The update reweights the prediction by the innovation \(y_t - C\hat{x}_{t|t-1}\) — the part of the measurement not predicted by the prior. The Kalman gain \(K_t\) is the optimal weight: when the measurement is precise (\(V\) small) \(K_t\) is large and the posterior trusts the observation; when the prediction is precise (\(\Sigma_{t|t-1}\) small) \(K_t\) is small and the posterior trusts the prior. The gain minimizes the posterior covariance over all linear update rules (proof).
Same recursion as forward-backward, in continuous state
The Kalman filter is exactly the forward pass of forward-backward for a continuous-state HMM with linear-Gaussian transitions and observations. The discrete sum-product becomes a matrix recurrence on means and covariances because Gaussians close under marginalization and conditioning. Add a backward pass (the RTS smoother) and you get smoothing.
Deriving the Solution
The Kalman filter maintains the Gaussian posterior \(p(x_t \mid y_1, \ldots, y_t) = \mathcal{N}(\hat{x}_{t|t},\, \Sigma_{t|t})\). Apply the dynamics, then condition on the measurement.
Predict (propagate through dynamics before seeing \(y_t\)):
\[ \hat{x}_{t|t-1} = A \hat{x}_{t-1|t-1} + B u_t, \]
\[ \Sigma_{t|t-1} = A \Sigma_{t-1|t-1} A^\top + W. \]
The mean transforms by the deterministic dynamics; the covariance transforms by the linear-Gaussian variance formula plus the new process noise \(W\).
Update (incorporate measurement \(y_t\) via Bayes’ rule for Gaussians):
\[ K_t = \Sigma_{t|t-1} C^\top \bigl( C \Sigma_{t|t-1} C^\top + V \bigr)^{-1}, \]
\[ \hat{x}_{t|t} = \hat{x}_{t|t-1} + K_t \bigl( y_t - C \hat{x}_{t|t-1} \bigr), \]
\[ \Sigma_{t|t} = (I - K_t C)\, \Sigma_{t|t-1}. \]
The matrix \(K_t\) is the Kalman gain, derived by minimizing the trace of \(\Sigma_{t|t}\) over all linear updates of this form.
Initialization. Set \(\hat{x}_{0|-1} = \mu_0\) and \(\Sigma_{0|-1} = \Sigma_0\) to the prior mean and covariance of \(x_0\). If the initial state is unknown, \(\Sigma_0 = \sigma^2 I\) for large \(\sigma\) reflects diffuse ignorance.
Algorithm
initialize x̂[0] = μ₀, Σ[0] = Σ₀
for t = 1, 2, ...:
# Predict
x̂_pred = A · x̂[t-1] + B · u[t]
Σ_pred = A · Σ[t-1] · Aᵀ + W
# Update
innovation = y[t] - C · x̂_pred
S = C · Σ_pred · Cᵀ + V # innovation covariance
K = Σ_pred · Cᵀ · S⁻¹ # Kalman gain
x̂[t] = x̂_pred + K · innovation
Σ[t] = (I - K · C) · Σ_pred
Each step requires inverting the \(p \times p\) matrix \(S\) (where \(p\) is the measurement dimension) and a few matrix multiplies in \(\mathbb{R}^{n \times n}\).
Walkthrough
One-dimensional predict and update
In one dimension, the covariance is just a variance, so the recursion can be read as moving and reshaping bell curves. The prediction step shifts the previous posterior through the dynamics and widens it by process noise; the update step combines that predicted belief with the new measurement and produces a narrower posterior.
Numerically: suppose \(A = 1\) (random walk), \(B = 0\), \(C = 1\), \(W = 1\), \(V = 2/3\). Start with \(\hat{x}_{t-1|t-1} = 4\), \(\Sigma_{t-1|t-1} = 1\).
| Step | Computation | Result |
|---|---|---|
| Predict mean | \(\hat{x}_{t|t-1} = 1 \cdot 4 + 0\) | \(\hat{x}_{t|t-1} = 4\) |
| Predict variance | \(\Sigma_{t|t-1} = 1 \cdot 1 \cdot 1 + 1\) | \(\Sigma_{t|t-1} = 2\) |
| Observe | \(y_t = 5\) | innovation \(= 5 - 4 = 1\) |
| Kalman gain | \(K_t = 2 \cdot 1 / (1 \cdot 2 \cdot 1 + 2/3)\) | \(K_t = 3/4\) |
| Update mean | \(\hat{x}_{t|t} = 4 + (3/4) \cdot 1\) | \(\hat{x}_{t|t} = 4.75\) |
| Update variance | \(\Sigma_{t|t} = (1 - 3/4) \cdot 2\) | \(\Sigma_{t|t} = 0.5\) |
The posterior mean shifts \(3/4\) of the way from the prior to the observation, weighted by the relative precision of the two sources. The variance ends smaller than either the prior (\(1\)) or the predicted variance (\(2\)) — because the measurement provides new information.
Correctness
Theorem (optimality). Among all linear estimators of \(x_t\) from \(y_{1:t}\), the Kalman filter minimizes the mean squared error (proof). When the process and measurement noise are Gaussian, the Kalman filter is also the optimal estimator among all (not just linear) estimators — it is the exact Bayesian posterior mean and the MAP estimate.
The proof reduces to a direct computation of the Gaussian posterior under the chain decomposition \(p(x_t, y_t \mid y_{1:t-1}) = p(x_t \mid y_{1:t-1}) \cdot p(y_t \mid x_t)\). The Kalman gain falls out of the closed-form formula for a Gaussian conditional.
Complexity and Tradeoffs
Each step costs \(O(n^3)\) for the matrix operations (an \(n \times n\) multiply) plus \(O(p^3)\) for inverting the innovation covariance \(S\). Memory is \(O(n^2)\) for \(\Sigma_{t|t}\) — the dominant storage.
Steady-state Kalman filter
When the system is time-invariant and observable (the state is detectable from measurements), the covariance \(\Sigma_{t|t}\) converges as \(t \to \infty\) to a fixed point \(\Sigma_\infty\) satisfying the dual algebraic Riccati equation:
\[ \Sigma = A \Sigma A^\top + W - A \Sigma C^\top \bigl( C \Sigma C^\top + V \bigr)^{-1} C \Sigma A^\top. \]
The corresponding steady-state gain \(K_\infty = \Sigma_\infty C^\top (C \Sigma_\infty C^\top + V)^{-1}\) can be precomputed, reducing online computation to a single matrix-vector multiply per step. Standard in embedded applications where computation must be cheap.
Numerical issues
The covariance update \((I - K_t C) \Sigma_{t|t-1}\) can become non-symmetric or non-PSD under floating-point arithmetic, leading to filter divergence. The Joseph form \((I - K_t C) \Sigma_{t|t-1} (I - K_t C)^\top + K_t V K_t^\top\) is symmetric by construction and PSD even with imprecise \(K_t\) — preferred in safety-critical systems.
When to Use It
| Situation | Approach |
|---|---|
| Linear-Gaussian state-space model | Kalman filter — optimal and closed-form. |
| Nonlinear dynamics or observation, mildly nonlinear | Extended Kalman filter (EKF) — linearize about the current estimate. |
| Strongly nonlinear, smooth | Unscented Kalman filter (UKF) — propagate sigma points through the true nonlinearity. |
| Multi-modal posterior, severe nonlinearity | Particle filter — Monte Carlo, not Gaussian. |
| Smoothing (offline, two-sided) | Run Kalman forward + RTS smoother backward. |
| Discrete state, no continuous structure | Forward-backward / Viterbi on an HMM. |
| Time-invariant system, steady-state acceptable | Precompute steady-state gain, run a single matrix-vector multiply per step. |
Variants
- Extended Kalman filter (EKF). For nonlinear \(f, h\), linearize about the current estimate using Jacobians. Accurate for mild nonlinearities, fragile for strong ones.
- Unscented Kalman filter (UKF). Propagate a set of deterministically chosen sigma points through the nonlinear functions and refit a Gaussian. Captures second-order accuracy without Jacobians.
- Information filter. Maintain \(\Sigma^{-1}\) and \(\Sigma^{-1}\hat{x}\) (the information matrix and information vector) instead of \(\Sigma\) and \(\hat{x}\). Simpler to merge multiple sensors but more expensive to read off the state.
- Square-root filter. Maintain Cholesky factors instead of \(\Sigma\). Numerically stable and used when precision matters (e.g., NASA flight software).
- Duality with LQR. The Kalman filter and the LQR problem are mathematically dual. The Riccati recursion for the filter covariance \(\Sigma_t\) is obtained from the LQR Riccati recursion for \(P_t\) by substituting \((A, B, Q, R) \to (A^\top, C^\top, W, V)\). Every result about LQR gains has a mirror result about Kalman gains.
- Separation principle for LQG control (Anderson and Moore 1990). With both process and measurement noise, the optimal LQG controller decomposes: design the LQR gain \(K\) as if the state were fully observed, design the Kalman filter gain independently, and set \(u_t^* = -K\hat{x}_{t|t}\). The two designs are optimal together.