The Actor-Critic Model
Separating Evaluation from Action

The Critic learns how good things are (state value). The Actor learns what to do (action preferences). Dopamine carries the prediction error that teaches both.

Parameters

State value learning rate

Action preference learning rate

True Reward Probs
hidden
hidden
0
Trials

Critic — \(V\)

State Value 0.00
Last \(\delta\)

Actor — \(m(a)\)

Arm A 0.00
Arm B 0.00

Policy — \(P(a)\)

Arm A 50%
Arm B 50%
🎰 Arm A ← Chosen
Chosen: 0 Wins: 0
🎰 Arm B ← Chosen
Chosen: 0 Wins: 0

Prediction Error \(\delta\) Over Trials

State Value \(V\) Over Trials

Action Preferences \(m(a)\) Over Trials

🧠
The Actor-Critic Insight

Run trials to see how the Critic learns the average reward (state value) while the Actor learns which arm to prefer.

How the Actor-Critic Model Works

The brain doesn't just evaluate situations—it also decides what to do. The Actor-Critic model captures this division of labor with two separate systems that cooperate through a shared prediction error signal.

The Two Components

🎭 The Critic

Learns the state value \(V\) — "How good is it to be in this situation overall?" In a bandit, \(V\) converges to the average reward rate across both arms, weighted by how often each is chosen.

🎯 The Actor

Learns action preferences \(m(a)\) — "Which action should I favor?" These preferences are converted to a policy via softmax. The actor adjusts preferences based on whether the chosen action led to better or worse outcomes than expected.

The Learning Rules

Step 1: Compute the Prediction Error (δ)

\[\delta = r - V\]
Was the reward better (+δ) or worse (−δ) than expected?

Step 2: Update the Critic

\[V \leftarrow V + \alpha_v \cdot \delta\]
Move the state value toward the actual reward

Step 3: Update the Actor

\[m(a_{\text{chosen}}) \leftarrow m(a_{\text{chosen}}) + \alpha_m \cdot \delta\]
If δ > 0: strengthen the chosen action preference
If δ < 0: weaken the chosen action preference

Step 4: Policy via Softmax

\[P(a) = \frac{e^{m(a)/\tau}}{\sum_b e^{m(b)/\tau}}\]

Neural Implementation

Why Separate Actor and Critic?

The Critic's job is to evaluate. It answers "how good is this state?" but doesn't specify an action. The Actor's job is to decide. It uses the same δ signal to adjust which action led to the better-or-worse-than-expected outcome.

This separation means the model can generalize: the Critic can evaluate novel states, while the Actor can transfer learned preferences. It also explains why striatal lesions in different regions produce different behavioral deficits.

💡
Try it!

1. Run 100 trials — watch V converge to the weighted average reward, and m(A) pull ahead of m(B).
2. Set αv = 0.5, αm = 0.05 — the Critic learns fast but the Actor is sluggish.
3. Set αv = 0.05, αm = 0.5 — the Actor over-reacts to noisy prediction errors.
4. Make arms close (0.55 vs 0.45) — δ fluctuates more and convergence is slower.