202509282212 Status: #idea Tags: #reinforcement_learning #ai # SARSA [[Monte Carlo Control]] is limited because it is offline in an episode-to-episode sense - that is, we must wait for the episode to end before we can update the value function. SARSA improves upon this by replacing Monte Carlo methods with temporal difference (TD) prediction for the policy evaluation step. Concretely, fitting SARSA into the the [[Generalized policy iteration (GPI)]] framework, we have: 1. Policy Improvement - done using $\epsilon$-greedy strategies 2. Policy Evaluation - done one step TD learning --- # References [[Grokking Deep Reinforcement Learning]]