202509282249
Status: #idea
Tags: #reinforcement_learning #ai
# Q-Learning
Q-Learning updates [[SARSA]] to become an off-policy method. That is, during the policy evaluation step of the [[Generalized policy iteration (GPI)]] loop, Q-Learning uses two different policies for data generation and policy evaluation, whereas SARSA uses the same policy for evaluation and data generation.
Let's compare the update methods of Q-Learning and SARSA to fully understand the difference. For SARSA:
$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha_t \left [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right ]$
As you can see, the SARSA update rule is the same as used in one step TD learning (discussed in [[Learning state-value functions]])
Now we examine the Q-Learning update rule:
$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha_t \left [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t) \right ]$
So, the only core difference is that the Q-Learning target takes the max value action (regardless of the actual action selected by the agent), whereas the SARSA update uses the actual action selected by the agent.
You can think of Q-Learning as always using the greedy policy with respect to the current value function as Q-Learning target. Rather than waiting to do policy improvement, it immediately makes the target greedy with respect to the value function. That is, we are generating data with the current policy $\pi$ but evaluating/improving the greedy policy with respect to $Q$.
---
# References
[[Grokking Deep Reinforcement Learning]]