202509281757
Status: #idea
Tags: #reinforcement_learning #ai
# Learning action-value function
Action-value functions predict the "return-to-go" of a (state, action) pair in a reinforcement learning problem. That is, if I have taken action $a$ in state $s$ and follow policy $\pi$ for all actions thereafter, what will my expected cumulative discounted reward be?
Without a full model of the environment (i.e., an exact transition function and reward function), we cannot compute this action-value function exactly. So we need to generate an estimate of it.
The action-value function is typically written as $Q(s,a)$, where $s$ is the current state and $a$ is the current action.
[[Monte Carlo Control]]
[[SARSA]]
[[Q-Learning]]
[[Double Q-Learning]]
---
# References
[[Grokking Deep Reinforcement Learning]]