202509282303 Status: #idea Tags: #reinforcement_learning #ai # Double Q-Learning [[Q-Learning]] often overestimates the value function since it is taking a maximum value over estimates. The use of a maximum of biased estimates as the estimate of the maximum value is a problem known as *maximization bias*. One way of resolving this is to use two Q functions. At each time step, we randomly choose one Q function is used to take the maximum action (i.e. determine the most valuable action in a given state), and the other Q function is used to actually generate the estimate of that value. To select an action for interacting with the environment, we use the average of the two Q functions. The reason this helps resolve the maximization bias is that the max return action in one Q function may not have the highest estimate in the other Q function. So when we separate responsibilities for selection and estimation, we are ensuring that we do not always choose the biased maximum estimate. --- # References [[Grokking Deep Reinforcement Learning]]