The Coffee Walk:
Learning State Values

Watch an agent learn that being close to coffee is valuable. Values propagate backward from the goal, teaching each state how "good" it is to be there.

Parameters

How quickly values are updated

How much future rewards decay with distance

How rewarding is getting coffee

Chance the shop is closed (no reward)

0
Walks

The Journey

🚢
🏠
Home
State 0
Value
0.00
🌳
Park
State 1
Value
0.00
πŸš—
Street
State 2
Value
0.00
β˜•
Coffee Shop
Goal!
Reward
+10

Learned State Values \(V(s)\)

🏠 Home
🌳 Park
πŸš— Street
β˜• Goal
Theoretical Values (after convergence)
β€”
β€”
β€”
β€”
Last Walk Summary Ready to walk!
Steps Taken
β€”
Reward Received
β€”
Avg TD Error
β€”

Understanding State Values

Every morning, you walk from Home β†’ Park β†’ Street β†’ Coffee Shop. At the end, you get your delicious coffee reward. Over time, your brain learns that even being at the Park is valuableβ€”because it means coffee is coming soon!

The Bellman Equation

The value of a state is the expected total reward you'll receive from that state onward. For a deterministic path like ours, the theoretical value of each state follows:

\[V(s) = R(s) + \gamma \cdot V(s')\]
Value of state = Immediate reward + Discounted value of next state

Working backwards from the Coffee Shop:

The Discount Factor \(\gamma\)

The discount factor controls how much you value future rewards compared to immediate ones:

β˜•
Try it!

Start with \(\gamma = 0.9\) and run 20 walks. Watch values propagate backward from the Coffee Shop. Then try \(\gamma = 0.5\) vs \(\gamma = 1.0\) to see how discounting affects the value gradient!

What If the Shop Is Closed?

Set the "Shop Closed Probability" above 0 to introduce uncertainty. Sometimes you'll walk all the way there and get nothing! This creates negative prediction errors (disappointment) and lowers the learned values.