202509282145
Status: #idea
Tags: #reinforcement_learning #ai
# The value function is limited without knowledge of the MDP
When we learn a value function $V_{\pi}(S)$ for a given policy $\pi$, we are learning a mapping from the current state $S_t$ to its expected cumulative discounted return $V_{\pi}(S_t)$ when following policy $\pi$.
Now, suppose that we do *not* have access to the full Markov Decision Process (MDP) that models the environment. There are three possible scenarios to consider - we don't have the transition function, the reward function, or both. We'll explore each scenario in order.
Suppose we don't have access to the transition function. Let's also suppose we start in $S_t$ and policy $\pi$ tells us to take action $A_t$. Since we don't have access to the transition function, we don't know which states we will end up in (and with what probability). As a result, we cannot estimate the value of the next state because we do not know what that next state will be.
Now suppose we don't have access to the reward function. When we take action $A_t$, we end up in state $S_{t+1}$. But we don't know the associated reward and therefore cannot use the return $R_{t+1}$ to update the value function.
As a result, it is better to estimate the action-value function $Q(S, A)$ when we don't have access to the MDP.
---
# References
[[Grokking Deep Reinforcement Learning]]