Kalman Filter

Motivation

A linear dynamical system driven by noise cannot be observed exactly — the hidden state must be estimated from noisy measurements. The Kalman filter (Kalman 1960) is the optimal estimator for this setting: it maintains a Gaussian belief over the state and updates it recursively as each measurement arrives, minimizing mean squared error among all linear estimators (proof).

The filter underpins GPS receivers, aircraft autopilots, robot localization, and financial time-series smoothing. It also forms one half of the linear quadratic Gaussian (LQG) controller, which pairs the Kalman filter with the linear quadratic regulator.

Problem

The system is described by two equations. The process model evolves the hidden state:

\[ x_{t+1} = A x_t + B u_t + w_t, \qquad w_t \sim \mathcal{N}(0, W), \]

and the observation model generates measurements:

\[ y_t = C x_t + v_t, \qquad v_t \sim \mathcal{N}(0, V), \]

where \(x_t \in \mathbb{R}^n\) is the hidden state, \(u_t \in \mathbb{R}^m\) is a known control input, \(y_t \in \mathbb{R}^p\) is the measurement, and \(A\), \(B\), \(C\) are known matrices. The noise covariances \(W \succeq 0\) and \(V \succ 0\) are known, and \(\{w_t\}\), \(\{v_t\}\) are mutually independent.

Given measurements \(y_1, \ldots, y_t\), compute (online, one step at a time) the posterior \(p(x_t \mid y_1, \ldots, y_t)\).

Key Ideas

A Gaussian stays Gaussian under linear updates

Each step of the model is a linear function of the previous state plus Gaussian noise. Linear-Gaussian transformations preserve normality: if \(x_{t-1} \mid y_{1:t-1}\) is Gaussian, then so is \(x_t \mid y_{1:t-1}\) (after the dynamics), and so is \(x_t \mid y_{1:t}\) (after conditioning on the new measurement). The posterior never leaves the Gaussian family, so the filter only has to track a mean vector and a covariance matrix.

Alternate predict and update

Each new step splits into two stages:

Predict. Push the previous posterior through the dynamics. The mean shifts by the deterministic part; the covariance grows by the process noise.
Update. Condition on the new measurement. The mean shifts toward the observation; the covariance shrinks because the measurement provides information.

This alternation is exactly the chain decomposition of the joint \(p(x_t, y_t \mid y_{1:t-1}) = p(x_t \mid y_{1:t-1}) \cdot p(y_t \mid x_t)\), followed by Bayes’ rule.

The Kalman gain is the optimal trust trade-off

The update reweights the prediction by the innovation \(y_t - C\hat{x}_{t|t-1}\) — the part of the measurement not predicted by the prior. The Kalman gain \(K_t\) is the optimal weight: when the measurement is precise (\(V\) small) \(K_t\) is large and the posterior trusts the observation; when the prediction is precise (\(\Sigma_{t|t-1}\) small) \(K_t\) is small and the posterior trusts the prior. The gain minimizes the posterior covariance over all linear update rules (proof).

Same recursion as forward-backward, in continuous state

The Kalman filter is exactly the forward pass of forward-backward for a continuous-state HMM with linear-Gaussian transitions and observations. The discrete sum-product becomes a matrix recurrence on means and covariances because Gaussians close under marginalization and conditioning. Add a backward pass (the RTS smoother) and you get smoothing.

Deriving the Solution

The Kalman filter maintains the Gaussian posterior \(p(x_t \mid y_1, \ldots, y_t) = \mathcal{N}(\hat{x}_{t|t},\, \Sigma_{t|t})\). Apply the dynamics, then condition on the measurement.

Predict (propagate through dynamics before seeing \(y_t\)):

\[ \hat{x}_{t|t-1} = A \hat{x}_{t-1|t-1} + B u_t, \]

\[ \Sigma_{t|t-1} = A \Sigma_{t-1|t-1} A^\top + W. \]

The mean transforms by the deterministic dynamics; the covariance transforms by the linear-Gaussian variance formula plus the new process noise \(W\).

Update (incorporate measurement \(y_t\) via Bayes’ rule for Gaussians):

\[ K_t = \Sigma_{t|t-1} C^\top \bigl( C \Sigma_{t|t-1} C^\top + V \bigr)^{-1}, \]

\[ \hat{x}_{t|t} = \hat{x}_{t|t-1} + K_t \bigl( y_t - C \hat{x}_{t|t-1} \bigr), \]

\[ \Sigma_{t|t} = (I - K_t C)\, \Sigma_{t|t-1}. \]

The matrix \(K_t\) is the Kalman gain, derived by minimizing the trace of \(\Sigma_{t|t}\) over all linear updates of this form.

Initialization. Set \(\hat{x}_{0|-1} = \mu_0\) and \(\Sigma_{0|-1} = \Sigma_0\) to the prior mean and covariance of \(x_0\). If the initial state is unknown, \(\Sigma_0 = \sigma^2 I\) for large \(\sigma\) reflects diffuse ignorance.

Algorithm

initialize x̂[0] = μ₀,  Σ[0] = Σ₀
for t = 1, 2, ...:
    # Predict
    x̂_pred = A · x̂[t-1] + B · u[t]
    Σ_pred = A · Σ[t-1] · Aᵀ + W

    # Update
    innovation = y[t] - C · x̂_pred
    S          = C · Σ_pred · Cᵀ + V             # innovation covariance
    K          = Σ_pred · Cᵀ · S⁻¹                # Kalman gain
    x̂[t]      = x̂_pred + K · innovation
    Σ[t]      = (I - K · C) · Σ_pred

Each step requires inverting the \(p \times p\) matrix \(S\) (where \(p\) is the measurement dimension) and a few matrix multiplies in \(\mathbb{R}^{n \times n}\).

Walkthrough

One-dimensional predict and update

In one dimension, the covariance is just a variance, so the recursion can be read as moving and reshaping bell curves. The prediction step shifts the previous posterior through the dynamics and widens it by process noise; the update step combines that predicted belief with the new measurement and produces a narrower posterior.

Numerically: suppose \(A = 1\) (random walk), \(B = 0\), \(C = 1\), \(W = 1\), \(V = 2/3\). Start with \(\hat{x}_{t-1|t-1} = 4\), \(\Sigma_{t-1|t-1} = 1\).

Step	Computation	Result
Predict mean	\(\hat{x}_{t\|t-1} = 1 \cdot 4 + 0\)	\(\hat{x}_{t\|t-1} = 4\)
Predict variance	\(\Sigma_{t\|t-1} = 1 \cdot 1 \cdot 1 + 1\)	\(\Sigma_{t\|t-1} = 2\)
Observe	\(y_t = 5\)	innovation \(= 5 - 4 = 1\)
Kalman gain	\(K_t = 2 \cdot 1 / (1 \cdot 2 \cdot 1 + 2/3)\)	\(K_t = 3/4\)
Update mean	\(\hat{x}_{t\|t} = 4 + (3/4) \cdot 1\)	\(\hat{x}_{t\|t} = 4.75\)
Update variance	\(\Sigma_{t\|t} = (1 - 3/4) \cdot 2\)	\(\Sigma_{t\|t} = 0.5\)

The posterior mean shifts \(3/4\) of the way from the prior to the observation, weighted by the relative precision of the two sources. The variance ends smaller than either the prior (\(1\)) or the predicted variance (\(2\)) — because the measurement provides new information.

Correctness

Theorem (optimality). Among all linear estimators of \(x_t\) from \(y_{1:t}\), the Kalman filter minimizes the mean squared error (proof). When the process and measurement noise are Gaussian, the Kalman filter is also the optimal estimator among all (not just linear) estimators — it is the exact Bayesian posterior mean and the MAP estimate.

The proof reduces to a direct computation of the Gaussian posterior under the chain decomposition \(p(x_t, y_t \mid y_{1:t-1}) = p(x_t \mid y_{1:t-1}) \cdot p(y_t \mid x_t)\). The Kalman gain falls out of the closed-form formula for a Gaussian conditional.

Complexity and Tradeoffs

Each step costs \(O(n^3)\) for the matrix operations (an \(n \times n\) multiply) plus \(O(p^3)\) for inverting the innovation covariance \(S\). Memory is \(O(n^2)\) for \(\Sigma_{t|t}\) — the dominant storage.

Steady-state Kalman filter

When the system is time-invariant and observable (the state is detectable from measurements), the covariance \(\Sigma_{t|t}\) converges as \(t \to \infty\) to a fixed point \(\Sigma_\infty\) satisfying the dual algebraic Riccati equation:

\[ \Sigma = A \Sigma A^\top + W - A \Sigma C^\top \bigl( C \Sigma C^\top + V \bigr)^{-1} C \Sigma A^\top. \]

The corresponding steady-state gain \(K_\infty = \Sigma_\infty C^\top (C \Sigma_\infty C^\top + V)^{-1}\) can be precomputed, reducing online computation to a single matrix-vector multiply per step. Standard in embedded applications where computation must be cheap.

Numerical issues

The covariance update \((I - K_t C) \Sigma_{t|t-1}\) can become non-symmetric or non-PSD under floating-point arithmetic, leading to filter divergence. The Joseph form \((I - K_t C) \Sigma_{t|t-1} (I - K_t C)^\top + K_t V K_t^\top\) is symmetric by construction and PSD even with imprecise \(K_t\) — preferred in safety-critical systems.

When to Use It

Situation	Approach
Linear-Gaussian state-space model	Kalman filter — optimal and closed-form.
Nonlinear dynamics or observation, mildly nonlinear	Extended Kalman filter (EKF) — linearize about the current estimate.
Strongly nonlinear, smooth	Unscented Kalman filter (UKF) — propagate sigma points through the true nonlinearity.
Multi-modal posterior, severe nonlinearity	Particle filter — Monte Carlo, not Gaussian.
Smoothing (offline, two-sided)	Run Kalman forward + RTS smoother backward.
Discrete state, no continuous structure	Forward-backward / Viterbi on an HMM.
Time-invariant system, steady-state acceptable	Precompute steady-state gain, run a single matrix-vector multiply per step.

Variants

Extended Kalman filter (EKF). For nonlinear \(f, h\), linearize about the current estimate using Jacobians. Accurate for mild nonlinearities, fragile for strong ones.
Unscented Kalman filter (UKF). Propagate a set of deterministically chosen sigma points through the nonlinear functions and refit a Gaussian. Captures second-order accuracy without Jacobians.
Information filter. Maintain \(\Sigma^{-1}\) and \(\Sigma^{-1}\hat{x}\) (the information matrix and information vector) instead of \(\Sigma\) and \(\hat{x}\). Simpler to merge multiple sensors but more expensive to read off the state.
Square-root filter. Maintain Cholesky factors instead of \(\Sigma\). Numerically stable and used when precision matters (e.g., NASA flight software).
Duality with LQR. The Kalman filter and the LQR problem are mathematically dual. The Riccati recursion for the filter covariance \(\Sigma_t\) is obtained from the LQR Riccati recursion for \(P_t\) by substituting \((A, B, Q, R) \to (A^\top, C^\top, W, V)\). Every result about LQR gains has a mirror result about Kalman gains.
Separation principle for LQG control (Anderson and Moore 1990). With both process and measurement noise, the optimal LQG controller decomposes: design the LQR gain \(K\) as if the state were fully observed, design the Kalman filter gain independently, and set \(u_t^* = -K\hat{x}_{t|t}\). The two designs are optimal together.

References

Anderson, Brian D. O., and John B. Moore. 1990. Optimal Control: Linear Quadratic Methods. Prentice Hall.

Kalman, R. E. 1960. “A New Approach to Linear Filtering and Prediction Problems.” Journal of Basic Engineering 82 (1): 35–45. https://doi.org/10.1115/1.3662552.