202509282303
Status: #idea
Tags: #reinforcement_learning #ai
# Double Q-Learning
[[Q-Learning]] often overestimates the value function since it is taking a maximum value over estimates. The use of a maximum of biased estimates as the estimate of the maximum value is a problem known as *maximization bias*.
One way of resolving this is to use two Q functions. At each time step, we randomly choose one Q function is used to take the maximum action (i.e. determine the most valuable action in a given state), and the other Q function is used to actually generate the estimate of that value. To select an action for interacting with the environment, we use the average of the two Q functions.
The reason this helps resolve the maximization bias is that the max return action in one Q function may not have the highest estimate in the other Q function. So when we separate responsibilities for selection and estimation, we are ensuring that we do not always choose the biased maximum estimate.
---
# References
[[Grokking Deep Reinforcement Learning]]