202509281550
Status: #idea
Tags: #reinforcement_learning #ai
# Learning state-value functions
Value functions predict the "return-to-go" of a state in a reinforcement learning problem. That is, if I start from state $s$ and follow policy $\pi$, what will my expected cumulative discounted reward be?
Without a full model of the environment (i.e., an exact transition function and reward function), we cannot compute this value function exactly. So we need to generate an estimate of it.
The value function is typically written as $V(s)$, where $s$ is the current state.
## Monte Carlo Methods
The most straightforward way to learn a value function is called first-visit Monte Carlo. In Monte Carlo based methods, we generate hundreds or thousands of trajectories (that is, sequence of states, going from an initial state to a final state, where the sequence of states is determined by sampling from the policy $\pi$). Once we have a full trajectory, we can compute the cumulative discounted reward for each state in the trajectory $s_t$ by starting from $s_t$ and summing up all of the discounted rewards from $t+1$ to $T$ (where $T$ is the terminal time/state). Now, to obtain an estimate of the state-value function $v_{\pi}(s)$, we just look for the first occurrence of $s$ in each trajectory and average the cumulative discounted reward of these first occurrences. This is why it is called first-visit Monte Carlo - we ignore all occurrences of $s$ in the trajectory after the first occurrence.
Another approach is to use every-visit Monte Carlo. This is just as the name sounds - instead of only using the first occurrence of $s$ in each trajectory to estimate $v_{\pi}(s)$, we use all occurrences of $s$ to produce the average.
Monte Carlo methods have good convergence properties because they update the value function estimate $v_{\pi}(s)$ toward the actual return $G_{t:T}$, which is an unbiased estimate of the true state-value function.
## Temporal Difference Learning
Monte Carlo methods have one key drawback - no updates can happen to state value function until we reach the end of the trajectory (since we need to compute cumulative discounted rewards, we can only do this once we know all of the rewards received!)
In addition, although MC updates are very accurate (i.e. they approximate the true state-value function well, they are not precise (that is, the updates have high variance). The actual returns $G_{t:T}$ accumulate many random actions, states, and rewards from time $t$ to $T$, so there can be significant differences in the return for state $s$ from one trajectory to the next.
Given this high variance, MC can be fairly sample inefficient. The noise can only be alleviated with a lot of data, requiring lots of full trajectories and actual return samples.
To reduce the trajectory-to-trajectory variance, we can use an estimated return instead of the actual return $G_{t:T}$. Why is this lower variance? Because the estimated return will be the expected value of random variable $G_{t:T}$, and expected values have $0$ variance (i.e. they are a constant).
Furthermore, by using the expected return, we can actually update the state-value function after every single reward step! Once we have the reward $R_{t+1}$ for transitioning from $s_t$ to $s_{t+1}$ as well as the estimate $\mathbb{E}(G_{t+1:T})$, we then have an estimate of the full trajectory return: $R_{t+1} + \gamma \mathbb{E}(G_{t+1:T})$, where $\gamma$ is the discount factor. But wait, we know that the state-value function is already estimating the full trajectory return starting from a given state $s$, so we can use $v_{\pi}(s_{t+1})$ in place of $\mathbb{E}(G_{t+1:T})$. This yields are one step estimate of the full trajectory: $R_{t+1} + \gamma v(s_{t+1})$
This is called the **temporal difference (TD) target**.
Given a learning rate $\alpha_t$, we can then write the update rule for TD learning as follows: $v_{t+1}(s_t) = v_t(s_t) + \alpha_t \left[R_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \right]$
That is, the updated value of the state-value function in state $s_t$ is given by the current value of that estimate plus the difference between our current estimate and the new TD target (times a learning rate variable). So, if our TD target is higher than existing estimate, we increase the value estimate for $s_t$ by a small amount. If the TD target is lower than the existing estimate, we decrease the value estimate for $s_t$ by a small amount.
Thus, TD methods estimate $v_{\pi}(s)$ using an estimate of $v_{\pi}(s)$ - that is, it bootstraps and makes a guess from a guess.
The TD target is a biased estimate of the true state-value function, but it is much lower variance than the actual return $G_{t:T}$. This is because the TD target depends only on a single action, a single transition, and a single reward, so there's much less randomness accumulated. As a result, TD methods usually learn much faster than MC methods.
## n-step Temporal Difference Learning
Extending the reasoning from both Monte Carlo and TD learning we can ask - what if instead of doing 1 step updates or full trajectory updates, we do something in between? It seems like, if we roll out for $n$ steps, where $n$ is less than $T$ (the full trajectory length), we can balance the variance seen in MC with the bias seen in TD learning.
Suppose we have an $n$ step trajectory given by $S_t, A_t, R_{t+1}, S_{t+1}, \ldots, R_{t+n}, S_{t+n}$. Then the mathematical update for $n$-step TD learning is given by:
$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1} (S_{t+n}))$
$V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha_t \left[G_{t:t+n} - V_{t+n-1}(S_t) \right]$
So we take $n-1$ actions, visit $n$ states, and use the $n-1$ rewards received plus the discounted estimate for $S_{t+n}$ to update the estimate for $S_t$.
Note that when $n=1$, we have standard TD learning. When $n$ is infinite, we have Monte Carlo.
An intermediate $n$ value typically performs better than either extreme.
## TD($\lambda$)
Now that we have seen how $n$-step TD works, how do we choose a good value of $n$? Maybe instead of setting a specific $n$, we could do some weighted combination of all possible $n$-step targets? This is precisely the idea behind TD($\lambda$).
In **forward-view TD($\lambda$)**, we combine multiple $n$-steps into a single update. In this version, the agent waits until the end of an episode before it can update the state-value function.
Mathematically, we have:
$G^{\lambda}_{t:T} = (1 - \lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{T-t-1} G_{t:T}$
$V_T(S_t) = V_{T-1}(S_t) + \alpha_t \left [ G^{\lambda}_{t:T} - V_{T-1}(S_t) \right]$
That is, we take a sum of partial returns from $t$ to $t+n$, where each partial return is weighted by $\lambda^{n-1}$. In the summation, $n$ ranges from 1 to $T-t-1$. As a result, the exponent of $\lambda$ in the summation ranges from $0$ to $T-t-2$, while the subscript of $G$ ranges from $t:t+1$ to $t:T-1$. This gives us a weighted sum of partial returns where the shorter partial returns are weighted more highly:
$G_{t:t+1} + \lambda G_{t:t+2} + \lambda^2 G_{t:t+3} + \cdots + \lambda^{T-t-2} G_{t:T-1}$
This value is called the $\lambda$ return. We then compute the difference between the $\lambda$ return and the current estimate for the value of state $S_t$ to produce the $\lambda$ error. The $\lambda$ error then determines the update to be made to the value function (as seen in the second equation).
With forward-view TD($\lambda$), we're back at the MC time restrictions - that is, we need to wait until the end of the trajectory to update the value function. In **backward-view TD($\lambda$)**, we split the corresponding updates into partial updates and apply those partial updates to the state-value function on every step.
The mechanism that provides backward-view TD($\lambda$), or TD($\lambda$) for short, with this advantage is known as eligibility traces. An eligibility trace is a memory vector that keeps track of recently visited states. The basic idea is to track the states that are eligible for an update on every step. We keep track, not only of whether a state is eligible or not, but also by how much, so that the corresponding update is applied correctly to eligible states.
The steps look like the following:
1. We set the eligibility vector to 0 on the start of a new episode (i.e. all states are not eligible) $E_0 = 0$
2. We interact with the environment for one cycle: $S_t, A_t, R_{t+1}, S_{t+1} \sim \pi_{t:t+1}$
3. When we encounter a state $S_t$, make it eligible for an update. $E_t(S_t) = E_t(S_t) + 1$
4. Calculate the TD error $\delta^{TD}_{t:t+1} = R_{t+1} + \gamma V_t(S_{t+1}) - V_t(S_t)$
5. Now, unlike before, we update the *entire* estimated state-value function $V$. That is, we don't update a single state value estimate, but *all* state-value estimates. The *entire* function is updated at once on every time step! $V_{t+1} = V_t + \alpha_t \delta^{TD}_{t:t+1}(S_t) E_t$
6. Finally, decay the eligibility: $E_{t+1} = E_t \gamma \lambda$
So, from the above, we see that on every time step $t$, we use the TD error of transitioning from state $S_t$ to state $S_{t+1}$ to update the value estimates of *all* eligible states in the trajectory. Older states have lower eligibility scores, decreasing the side of the update applied. Newer states have higher eligibility scores, increasing the size of the update.
Note that when $\lambda = 0$, TD($\lambda$) reduces to 1-step TD learning. This is because the eligibility vectors gets zeroed out after every step. This ensures that updates are only applied to the current state $S_t$ using the one step TD error.
When $\lambda = 1$, TD($\lambda$) reduces to Monte Carlo. This is because the eligibility vector stores $E_t(S_t) \gamma^n$, where $n$ is the number of time steps since we last saw $S_t$. This is precisely the discount factor for the reward $n$ steps after initially visiting a state that we saw in MC.
---
# References
[[Grokking Deep Reinforcement Learning]]