202509292216
Status: #idea
Tags: #reinforcement_learning #deep_learning #ai
# Double DQN (DDQN)
[[Deep Q-Network (DQN)]] suffers from the same problem as [[Q-Learning]], and uses the same solution as [[Double Q-Learning]]. That is, since we are taking the max action value as our target, we have a positive bias that tends to overestimate the action value function. This occurs because our Q-function is an estimate of (state,action) values. If we assume that estimates have random noise around the true value, we see that by taking the max, we increase the probability that we select an estimate that randomly received positive noise (rather than negative noise). Hence, our max values will tend to be positively biased.
To resolve this, we need to separate identifying the max value action $a'$ and actually estimating the value of $a'$. In [[Deep Q-Network (DQN)]], both of these are done by the same Q-value function. However, DDQN resolves this by using the online Q-function (that is, the Q-function we're evaluating) to select the action, and the frozen Q-function (that is, the target Q-function) to estimate the action's value.
We use this ordering (online Q-function for action selection, frozen Q-function for action value estimation) because it ensures that our targets are not changing. If we swapped this order, then the online Q-function's values would be used as the action value estimates, and these would change after every gradient update. This would have reintroduced this problem: [[The target moves in value-based deep reinforcement learning]]
Mathematically, the DDQN gradient looks like the following:
$\nabla_{\theta_i} L_i(\theta_i) = \mathbb{E}_{s,a,r,s' \, \sim \, \mathcal{U}(D)}\left [ \left ( r + \gamma Q(s', \text{argmax}_{a'} Q(s', a' \, ; \, \theta_i) ; \theta^-) - Q(s,a;\theta_i) \right ) \nabla_{\theta_i} Q(s,a;\theta_i) \right ] $
DDQN also improves on the loss function from DQN, which used mean squared error (MSE). MSE is very unforgiving for large errors since it squares them. This makes sense for supervised learning - we have access to the ground truth labels, so obviously we should penalize any large deviations from them. However, in RL, our targets are themselves just estimates. It's not clear that the loss function should penalize large "errors" significantly, because they may indicate that we do truly need to update the value estimate substantially.
Another approach would be to use Mean Absolute Error (MAE), which scales linearly with error size (rather than quadratically). However, MAE is not differentiable at 0 and also the gradient is the same for small errors as it is for large errors. This can cause training instability, as the optimizer struggles to converge to a minima.
To balance these two competing ideals (not harshly penalizing large errors while ensuring that gradients decay as errors go to 0), we can use the [[Huber Loss]].
---
# References
[[Grokking Deep Reinforcement Learning]]