202410042326
Status: #idea
Tags: #reinforcement_learning #deep_learning #ai
# The target moves in value-based deep reinforcement learning
In standard value-based deep reinforcement learning with a single value network, every time we perform gradient descent, we update the parameters of the neural network that is approximating our state (state-action) value function. At the same time, that neural network is the one *providing* the target that we are calculating the loss against. For example, using the standard temporal difference (TD) learning set up, we have the following loss function:
$\nabla_{\theta_i}L_i(\theta_i) = \mathbb{E}_{s,a,r,s'} \left [(r + \gamma \max_{a'} Q(s', a'; \theta_i) - Q(s,a;\theta_i)) \; \nabla_{\theta_i}Q(s,a;\theta_i) \right]$
where $\theta_i$ are the parameters at time $i$, $r$ is the reward received from taking action $a$ in the current state $s$, $s'$ is the state we transition to, and $Q$ is the deep learning function.
We can see that our loss function is evaluating the difference between our current estimate (given by the term $Q(s,a;\theta_i)$ and the target, given by reward received plus the estimated reward of taking the best action in the next state and following policy $\pi$ (the term $(r + \gamma \max_{a'} Q(s', a'; \theta_i)$).
Hence, the model itself forms the target in our loss function (equivalent to a label in supervised learning), and so the *target is changing*.
One common method to use two networks, with one network acting as the target network, as in the [[Deep Q-Network (DQN)]].
Also [[Large neural networks can help solve the non-stationarity problem]]
---
# References
[[Grokking Deep Reinforcement Learning]]