202509281757 Status: #idea Tags: #reinforcement_learning #ai # Learning action-value function Action-value functions predict the "return-to-go" of a (state, action) pair in a reinforcement learning problem. That is, if I have taken action $a$ in state $s$ and follow policy $\pi$ for all actions thereafter, what will my expected cumulative discounted reward be? Without a full model of the environment (i.e., an exact transition function and reward function), we cannot compute this action-value function exactly. So we need to generate an estimate of it. The action-value function is typically written as $Q(s,a)$, where $s$ is the current state and $a$ is the current action. [[Monte Carlo Control]] [[SARSA]] [[Q-Learning]] [[Double Q-Learning]] --- # References [[Grokking Deep Reinforcement Learning]]