202410042327
Status: #idea
Tags: #reinforcement_learning #deep_learning #ai
# Samples are not independent in value-based deep reinforcement learning
When we generate experiences in reinforcement learning, we typically start in some random starting state and then follow a policy $\pi$ that allows us to explore the space. Since these experiences are coming from a single trajectory, they will be highly correlated in the section of the state space that is explored.
For example, if we are exploring the integers on a number line, randomly select -5 as our starting point, and use a random policy that selects left or right with equal probability, the states we visit in this trajectory are highly likely to cluster around -5. We thus will only explore a tiny fraction of the total state space and thus provide biased gradient updates to our function.
We can see this in the below image, where we want to learn a cubic polynomial (the blue function) using another cubic polynomial with randomly initialized coefficients (the orange line) and gradient descent. With a biased sample restricted to the range $3.5 \leq x \leq 4.5$, even after 10,000 iterations of gradient descent, the learned function (green line) has significant errors compared to the true function.
![[Figure_1.png]]
As a result of the above phenomenon, the deep learning network approximating the value function will make parameter updates that potentially deteriorate the fit outside of the biased sample it is currently analyzing. Thus, we may never converge to the optimal function.
One common method to mitigate this problem is to use experience replay, as in the [[Deep Q-Network (DQN)]].
---
# References
[[Grokking Deep Reinforcement Learning]]