Explore or Exploit?
The Softmax Bandit

Two slot machines, unknown payoffs. How does temperature control the balance between exploring new options and exploiting the best known one?

Parameters

Low = exploit, High = explore

True Reward Probs
hidden
hidden
0
Trials

Estimated Value \(Q(a)\)

Arm A 0.00
Arm B 0.00

Softmax Probability \(P(a)\)

Arm A 50%
Arm B 50%
🎰 Arm A
Chosen: 0 Wins: 0
🎰 Arm B
Chosen: 0 Wins: 0

Estimated Values \(Q(a)\) Over Trials

Choice Proportion (rolling 20-trial window)

Cumulative Reward vs. Optimal

🧠
The Explore/Exploit Insight

Run trials to see how softmax balances exploration and exploitation. Watch the Q-values converge and the choice proportion shift.

The Explore/Exploit Dilemma

Imagine you're at a casino with two slot machines. One might pay out more often than the other, but you don't know which. Every pull you spend on the worse machine is lost reward. But how do you know which is better without trying both?

This is the explore/exploit dilemma: you must balance exploiting the arm you believe is best with exploring the other to improve your estimates.

The Softmax (Boltzmann) Policy

Rather than choosing randomly (pure exploration) or greedily (pure exploitation), softmax converts estimated values into choice probabilities through a temperature parameter:

\[P(a) = \frac{e^{Q(a)/\tau}}{\sum_{b} e^{Q(b)/\tau}}\]
Probability of choosing arm \(a\) given current value estimates and temperature \(\tau\)

What Temperature Controls

Value Update Rule

After each trial, the estimated value is updated:

\[Q(a) \leftarrow Q(a) + \alpha \cdot \left[ r - Q(a) \right]\]
Move the estimate toward the received reward \(r\) (which is 1 for win, 0 for loss)

Greedy vs. ε-Greedy vs. Softmax

💡
Try it!

1. Run 100 trials with τ = 0.3 — the agent converges on the better arm quickly.
2. Set τ = 5 — choices stay near 50/50, Q-values converge but exploitation is slow.
3. Set τ = 0.01 — pure greedy; if early trials are unlucky, it locks onto the wrong arm.
4. Make the arms close (0.55 vs 0.45) — much harder to tell apart.