202410042327 Status: #idea Tags: # Samples are not identically-distributed in value-based deep reinforcement learning When we generate experiences in reinforcement learning, we typically start in some random starting state and then follow a policy $\pi$ that allows us to explore the space. In value-based deep reinforcement learning, this policy is typically something like $\epsilon$-greedy, where we choose the highest value action with probability $1-\epsilon$ and a random action with probability $\epsilon$. Since which action is the highest value action depends on our estimate of its value from our deep neural network, and this network's parameters change during training, the sampling distribution of actions will change as our network updates. This can negatively impact the convergence of our reinforcement learning algorithm. One common method to mitigate this problem is to use experience replay, as in the [[Deep Q-Network (DQN)]] --- # References [[Grokking Deep Reinforcement Learning]]