Watch an agent learn that being close to coffee is valuable. Values propagate backward from the goal, teaching each state how "good" it is to be there.
How quickly values are updated
How much future rewards decay with distance
How rewarding is getting coffee
Chance the shop is closed (no reward)
Every morning, you walk from Home β Park β Street β Coffee Shop. At the end, you get your delicious coffee reward. Over time, your brain learns that even being at the Park is valuableβbecause it means coffee is coming soon!
The value of a state is the expected total reward you'll receive from that state onward. For a deterministic path like ours, the theoretical value of each state follows:
Working backwards from the Coffee Shop:
The discount factor controls how much you value future rewards compared to immediate ones:
Start with \(\gamma = 0.9\) and run 20 walks. Watch values propagate backward from the Coffee Shop. Then try \(\gamma = 0.5\) vs \(\gamma = 1.0\) to see how discounting affects the value gradient!
Set the "Shop Closed Probability" above 0 to introduce uncertainty. Sometimes you'll walk all the way there and get nothing! This creates negative prediction errors (disappointment) and lowers the learned values.