Actor-Critic Model | State Values & Action Preferences

How the Actor-Critic Model Works

The brain doesn't just evaluate situations—it also decides what to do. The Actor-Critic model captures this division of labor with two separate systems that cooperate through a shared prediction error signal.

The Two Components

🎭 The Critic

Learns the state value \(V\) — "How good is it to be in this situation overall?" In a bandit, \(V\) converges to the average reward rate across both arms, weighted by how often each is chosen.

🎯 The Actor

Learns action preferences \(m(a)\) — "Which action should I favor?" These preferences are converted to a policy via softmax. The actor adjusts preferences based on whether the chosen action led to better or worse outcomes than expected.

The Learning Rules

Step 1: Compute the Prediction Error (δ)

\[\delta = r - V\]

Was the reward better (+δ) or worse (−δ) than expected?

Step 2: Update the Critic

\[V \leftarrow V + \alpha_v \cdot \delta\]

Move the state value toward the actual reward

Step 3: Update the Actor

\[m(a_{\text{chosen}}) \leftarrow m(a_{\text{chosen}}) + \alpha_m \cdot \delta\]

If δ > 0: strengthen the chosen action preference
If δ < 0: weaken the chosen action preference

Step 4: Policy via Softmax

\[P(a) = \frac{e^{m(a)/\tau}}{\sum_b e^{m(b)/\tau}}\]

Neural Implementation

Critic ≈ Ventral Striatum / OFC: Encodes expected value of the current state
Actor ≈ Dorsal Striatum: Encodes action preferences and habits
δ ≈ Dopamine (VTA/SNc): The prediction error signal that teaches both systems

Why Separate Actor and Critic?

The Critic's job is to evaluate. It answers "how good is this state?" but doesn't specify an action. The Actor's job is to decide. It uses the same δ signal to adjust which action led to the better-or-worse-than-expected outcome.

This separation means the model can generalize: the Critic can evaluate novel states, while the Actor can transfer learned preferences. It also explains why striatal lesions in different regions produce different behavioral deficits.

💡

Try it!

1. Run 100 trials — watch V converge to the weighted average reward, and m(A) pull ahead of m(B).
2. Set α_v = 0.5, α_m = 0.05 — the Critic learns fast but the Actor is sluggish.
3. Set α_v = 0.05, α_m = 0.5 — the Actor over-reacts to noisy prediction errors.
4. Make arms close (0.55 vs 0.45) — δ fluctuates more and convergence is slower.

The Actor-Critic Model
Separating Evaluation from Action

Critic — \(V\)

Actor — \(m(a)\)

Policy — \(P(a)\)

Prediction Error \(\delta\) Over Trials

State Value \(V\) Over Trials

Action Preferences \(m(a)\) Over Trials

How the Actor-Critic Model Works

The Two Components

The Learning Rules

Neural Implementation

Why Separate Actor and Critic?