Notes on Predictive Coding and Reinforcement Learning

These are rough notes — the kind I write to organize my own thinking rather than to make a formal argument. The question I keep returning to is: why do predictive coding and modern deep RL feel like they’re describing the same thing from different angles?

The Basic Idea

Predictive coding, in the neuroscience sense, proposes that the brain is constantly generating predictions about incoming sensory signals, and that what actually gets propagated up the hierarchy are prediction errors — the residuals between what was expected and what arrived. Karl Friston’s free energy principle extends this: agents minimize surprise (or technically, a bound on surprise called free energy) by either updating their internal model or acting on the world to make it conform to predictions.

Reinforcement learning, at its core, is also about learning predictions. The Bellman equation is a statement about self-consistency of value predictions across time. Temporal difference learning is literally about propagating prediction errors backwards through time. Actor-critic methods decompose the problem into a prediction component (critic) and a decision component (actor) that looks a lot like the two-stream architecture in predictive coding.

Where They Converge

Model-based RL makes the connection even sharper. If you have a learned world model that predicts the next state given the current state and action, you can plan by imagining rollouts — essentially doing inference in the model. This is structurally similar to active inference, where an agent selects actions that minimize expected free energy under its generative model of the world.

The key insight: both frameworks are trying to learn a compressed, hierarchical generative model of the world, and to use that model to guide action.

The difference is largely in emphasis. Predictive coding emphasizes the architecture of the hierarchy: top-down predictions, bottom-up errors, lateral connections. Deep RL emphasizes the objective: maximize cumulative reward, with the world model as a means to that end.

Hierarchical Structure

One place where the neuroscience perspective offers something genuinely useful is in thinking about timescales. A real agent needs predictions at multiple timescales: millisecond-level motor predictions, second-level task predictions, minute-level goal predictions. Predictive coding models this as a hierarchy where higher levels predict slower, more abstract dynamics.

Modern deep RL largely lacks this structured temporal hierarchy. We have options and skill discovery, but these are hard to scale and train. I think there’s real value in borrowing the hierarchical architecture from predictive coding and asking: what would it mean to have a world model that is explicitly structured at multiple timescales, with the same error-propagation mechanism that makes cortical hierarchies efficient?

Open Questions

A few things I don’t know yet:

How do you define “surprise” in a way that’s tractable for complex, partially observed environments?
Can compositional structure — the kind that lets the brain recombine known concepts to understand new situations — emerge from predictive coding architectures, or does it require explicit symbolic structure?
What is the right way to combine extrinsic reward (from the environment) with intrinsic reward (prediction error, novelty) in a single unified objective?

What I’m Working On

My current work tries to close some of these gaps by building hierarchical world models that explicitly use predictive coding-style architectures, trained with both reconstruction objectives and task-relevant RL objectives. The hope is that the inductive bias from the architecture helps with compositional generalization in ways that flat world models don’t achieve.

More to come as the ideas solidify.