The Coffee Walk | TD Learning State Values

Understanding State Values

Every morning, you walk from Home → Park → Street → Coffee Shop. At the end, you get your delicious coffee reward. Over time, your brain learns that even being at the Park is valuable—because it means coffee is coming soon!

The Bellman Equation

The value of a state is the expected total reward you'll receive from that state onward. For a deterministic path like ours, the theoretical value of each state follows:

\[V(s) = R(s) + \gamma \cdot V(s')\]

Value of state = Immediate reward + Discounted value of next state

Working backwards from the Coffee Shop:

Coffee Shop: \(V = R\) (you get the reward!)
Street: \(V = \gamma \cdot R\) (one step away)
Park: \(V = \gamma^2 \cdot R\) (two steps away)
Home: \(V = \gamma^3 \cdot R\) (three steps away)

The Discount Factor \(\gamma\)

The discount factor controls how much you value future rewards compared to immediate ones:

\(\gamma = 1.0\): Future rewards are just as good as immediate ones. All states have value ≈ R.
\(\gamma = 0.9\): Each step of distance costs 10% of value. States closer to coffee are worth more.
\(\gamma = 0.5\): Steep discounting! Only states very close to the goal have significant value.
\(\gamma = 0\): Only immediate rewards matter. States before the goal have value 0.

☕

Try it!

Start with \(\gamma = 0.9\) and run 20 walks. Watch values propagate backward from the Coffee Shop. Then try \(\gamma = 0.5\) vs \(\gamma = 1.0\) to see how discounting affects the value gradient!

What If the Shop Is Closed?

Set the "Shop Closed Probability" above 0 to introduce uncertainty. Sometimes you'll walk all the way there and get nothing! This creates negative prediction errors (disappointment) and lowers the learned values.

The Coffee Walk:
Learning State Values

The Journey

Learned State Values \(V(s)\)

Understanding State Values

The Bellman Equation

The Discount Factor \(\gamma\)

What If the Shop Is Closed?