Intrinsic Motivation
Motivation
Upper confidence bound and Thompson sampling explore by tracking per-arm uncertainty. In a tabular setting that uncertainty is a function of the visit count \(N(s, a)\), which is cheap to maintain. In a deep reinforcement learning setting with millions of states — Atari frames, robot images, sentence prefixes — the count is identically zero almost everywhere, and uncertainty estimates have to be approximated.
Intrinsic motivation addresses this gap. The agent augments its environment (“extrinsic”) reward \(r^e_t\) with an internally computed intrinsic reward \(r^i_t\) that is high in novel or uncertain states:
\[ r_t = r^e_t + \beta \, r^i_t. \]
The intrinsic term encourages the agent to seek out parts of the state space it has not yet mastered. It is a generalization of the optimism principle: in tabular settings the right intrinsic bonus reduces to a count-based UCB term (Bellemare et al. 2016); in deep RL it becomes whatever scalable proxy for novelty the designer can compute.
See exploration vs. exploitation for the broader context. The general idea of building exploration into the reward dates to early work on artificial curiosity (Schmidhuber 1991).
Count-based bonuses
The simplest intrinsic reward in a tabular MDP is the UCB-style bonus
\[ r^i_t = \frac{c}{\sqrt{N(s_t)}}. \]
States visited rarely receive a large bonus, states visited often receive a small one. Combined with any value-based learner, this produces optimistic value estimates and thus directed exploration.
In high-dimensional settings, exact counts are useless because no state is visited twice. Pseudo-counts (Bellemare et al. 2016) replace \(N(s)\) with \(\hat{N}(s)\) derived from a density model: if a generative model assigns \(s\) probability that increases substantially after one observation of it, the implied pseudo-count is small. Pseudo-count bonuses recover meaningful exploration in Atari games like Montezuma’s Revenge where extrinsic reward is sparse.
Prediction-error / curiosity
Instead of estimating densities, predict the next state from the current state and action and reward the agent for being surprised — that is, for transitions its model fails to predict.
The Intrinsic Curiosity Module (ICM) (Pathak et al. 2017) trains a forward model \(\hat{\phi}(s_{t+1}) = f(\phi(s_t), a_t)\) in a learned feature space \(\phi\) and sets
\[ r^i_t = \tfrac{1}{2} \, \| \hat{\phi}(s_{t+1}) - \phi(s_{t+1}) \|^2. \]
Transitions the agent has practiced are predictable; novel ones are not. The feature space \(\phi\) is learned by an inverse-dynamics objective (\(a_t\) predicted from \(\phi(s_t), \phi(s_{t+1})\)) so that \(\phi\) encodes only features the agent can affect, immunizing the bonus against irrelevant background noise (the “noisy TV” problem).
Random network distillation
Random network distillation (RND) (Burda et al. 2019) sidesteps forward modeling. A random fixed network \(f^*\) maps states to a feature vector. A learned network \(\hat{f}_\theta\) is trained to match \(f^*\) on visited states. The intrinsic reward is the prediction error
\[ r^i_t = \| \hat{f}_\theta(s_t) - f^*(s_t) \|^2. \]
Because \(f^*\) is fixed and random, prediction error is high on states the learner has rarely encountered and low on familiar ones. RND is simpler and more stable than ICM, since there is no environment dynamics to model, and was the first method to make substantial progress on Montezuma’s Revenge from intrinsic reward alone.
Practical considerations
- Reward scaling and decay. The bonus is non-stationary by construction — it shrinks as the agent learns — which complicates value learning. Standard practice is to normalize \(r^i\) by a running estimate of its standard deviation and decay \(\beta\) over training.
- Bias. A persistent intrinsic reward biases the optimal policy. Treating \(r^i\) as a separate “exploration” reward stream with its own value head, then combining them at action-selection time, is a common workaround.
- Stochasticity hazard. A naive prediction-error bonus rewards stochastic transitions even when the agent has nothing to learn from them. Inverse-dynamics features (ICM) and learned-target distillation (RND) are the two standard fixes.
- Relation to optimism. In tabular MDPs the right pseudo-count bonus produces a value function equivalent to UCB-style optimistic Q-learning. Intrinsic motivation is the deep-RL avatar of optimism in the face of uncertainty.