The Critic learns how good things are (state value). The Actor learns what to do (action preferences). Dopamine carries the prediction error that teaches both.
State value learning rate
Action preference learning rate
Run trials to see how the Critic learns the average reward (state value) while the Actor learns which arm to prefer.
The brain doesn't just evaluate situations—it also decides what to do. The Actor-Critic model captures this division of labor with two separate systems that cooperate through a shared prediction error signal.
Learns the state value \(V\) — "How good is it to be in this situation overall?" In a bandit, \(V\) converges to the average reward rate across both arms, weighted by how often each is chosen.
Learns action preferences \(m(a)\) — "Which action should I favor?" These preferences are converted to a policy via softmax. The actor adjusts preferences based on whether the chosen action led to better or worse outcomes than expected.
Step 1: Compute the Prediction Error (δ)
Step 2: Update the Critic
Step 3: Update the Actor
Step 4: Policy via Softmax
The Critic's job is to evaluate. It answers "how good is this state?" but doesn't specify an action. The Actor's job is to decide. It uses the same δ signal to adjust which action led to the better-or-worse-than-expected outcome.
This separation means the model can generalize: the Critic can evaluate novel states, while the Actor can transfer learned preferences. It also explains why striatal lesions in different regions produce different behavioral deficits.
1. Run 100 trials — watch V converge to the weighted average reward, and m(A) pull ahead of m(B).
2. Set αv = 0.5, αm = 0.05 — the Critic learns fast but the Actor is sluggish.
3. Set αv = 0.05, αm = 0.5 — the Actor over-reacts to noisy prediction errors.
4. Make arms close (0.55 vs 0.45) — δ fluctuates more and convergence is slower.