Learning to Anticipate:
The Temporal Difference Model

Watch prediction errors shift from the reward to the cue—the signature of dopamine neuron activity (Schultz, Dayan & Montague, 1997).

Parameters

Learning Rate \(\alpha\)

Discount \(\gamma\)

Reward \(R\)

Reward Prob

Set < 1 to see omission dips

Timing (timesteps)

CS (cue) t = 10

US (reward) t = 40

Trials

Current Trial: Prediction Error \(\delta(t)\)

δ > 0 δ < 0

Current Value Function \(V(t)\)

Prediction Error \(\delta(t)\) Over Trials

Time →

−δ 0 +δ

↓ Trial (newest at bottom)

Value Function \(V(t)\) Over Trials

Time →

Low High

↓ Trial (newest at bottom)

🧠

The Dopamine Insight

Run trials to see the prediction error shift backward from the US (reward) to the CS (cue)—exactly like dopamine neuron recordings.

How Temporal Difference Learning Works

Dopamine neurons fire when something surprising happens. Schultz (1997) showed that early in conditioning, dopamine fires at reward delivery. After learning, dopamine instead fires at the CS (cue) that predicts the reward—and no longer at the reward itself. TD learning explains this.

The TD Learning Rule

Time is divided into small steps (microstates). At each timestep \(t\), the agent computes a prediction error:

\[\delta(t) = r(t) + \gamma \cdot V(t+1) - V(t)\]

Prediction error = Reward received + Discounted next value − Current value

Then update the value at \(t\):

\[V(t) \leftarrow V(t) + \alpha \cdot \delta(t)\]

The Backward Shift

Over trials, the prediction error shifts backward:

Trial 1: Only the US time has a positive δ (unexpected reward)
Trials 5–10: δ starts appearing at timesteps just before the US
Trials 20–30: δ shifts back to the CS (the earliest predictor)
Converged: δ is only at the CS; at the US, δ ≈ 0 (fully predicted)

This is exactly what Schultz observed in dopamine neurons: early in learning, dopamine fires at reward delivery; after learning, it fires at the predictive cue.

💡

Try it!

1. Run 50 trials and watch the green peak shift from US to CS in the heatmap.
2. Set reward probability to 0.5 — red bands appear when expected rewards are omitted.
3. Try γ = 0.8 vs γ = 0.98 — lower discount slows the backward propagation.