Softmax Bandit | Explore vs Exploit

The Explore/Exploit Dilemma

Imagine you're at a casino with two slot machines. One might pay out more often than the other, but you don't know which. Every pull you spend on the worse machine is lost reward. But how do you know which is better without trying both?

This is the explore/exploit dilemma: you must balance exploiting the arm you believe is best with exploring the other to improve your estimates.

The Softmax (Boltzmann) Policy

Rather than choosing randomly (pure exploration) or greedily (pure exploitation), softmax converts estimated values into choice probabilities through a temperature parameter:

\[P(a) = \frac{e^{Q(a)/\tau}}{\sum_{b} e^{Q(b)/\tau}}\]

Probability of choosing arm \(a\) given current value estimates and temperature \(\tau\)

What Temperature Controls

\(\tau \to 0\) (cold): Pure greedy — always picks the highest Q. Fast but risky if estimates are wrong.
\(\tau \approx 0.3\) (warm): Mostly exploits, occasional exploration. Good balance for many problems.
\(\tau \to \infty\) (hot): Near-random choices (50/50). Maximum exploration, minimal exploitation.

Value Update Rule

After each trial, the estimated value is updated:

\[Q(a) \leftarrow Q(a) + \alpha \cdot \left[ r - Q(a) \right]\]

Move the estimate toward the received reward \(r\) (which is 1 for win, 0 for loss)

Greedy vs. ε-Greedy vs. Softmax

Greedy: Always picks max Q. No exploration at all — can permanently lock onto the wrong arm.
ε-Greedy: Picks randomly with probability ε, otherwise greedy. Explores but doesn't use Q-values to guide exploration.
Softmax: Exploration is proportional to value. Arms with similar Q get similar attention; clearly worse arms are rarely chosen. This is both smarter and more biologically plausible.

💡

Try it!

1. Run 100 trials with τ = 0.3 — the agent converges on the better arm quickly.
2. Set τ = 5 — choices stay near 50/50, Q-values converge but exploitation is slow.
3. Set τ = 0.01 — pure greedy; if early trials are unlucky, it locks onto the wrong arm.
4. Make the arms close (0.55 vs 0.45) — much harder to tell apart.

Explore or Exploit?
The Softmax Bandit

Estimated Value \(Q(a)\)

Softmax Probability \(P(a)\)

Estimated Values \(Q(a)\) Over Trials

Choice Proportion (rolling 20-trial window)

Cumulative Reward vs. Optimal

The Explore/Exploit Dilemma

The Softmax (Boltzmann) Policy

What Temperature Controls

Value Update Rule

Greedy vs. ε-Greedy vs. Softmax