Two slot machines, unknown payoffs. How does temperature control the balance between exploring new options and exploiting the best known one?
Low = exploit, High = explore
Run trials to see how softmax balances exploration and exploitation. Watch the Q-values converge and the choice proportion shift.
Imagine you're at a casino with two slot machines. One might pay out more often than the other, but you don't know which. Every pull you spend on the worse machine is lost reward. But how do you know which is better without trying both?
This is the explore/exploit dilemma: you must balance exploiting the arm you believe is best with exploring the other to improve your estimates.
Rather than choosing randomly (pure exploration) or greedily (pure exploitation), softmax converts estimated values into choice probabilities through a temperature parameter:
After each trial, the estimated value is updated:
1. Run 100 trials with τ = 0.3 — the agent converges on the better arm quickly.
2. Set τ = 5 — choices stay near 50/50, Q-values converge but exploitation is slow.
3. Set τ = 0.01 — pure greedy; if early trials are unlucky, it locks onto the wrong arm.
4. Make the arms close (0.55 vs 0.45) — much harder to tell apart.