202509302051 Status: #idea Tags: #reinforcement_learning #deep_learning #ai # Dueling DDQN Dueling DDQN improves the sample efficiency (i.e., amount of learning per episode) of [[Double DQN (DDQN)]] by splitting the online Q-function network into two streams: one that approximates the $V$-function (state value function) and one that approximates the $A$-function (action advantage function). Recall that $A(s,a) = Q(s,a) - V(a)$ for state $s$ and action $a$. That is, how much more return do we get by taking action $a$ in state $s$ and following policy $\pi$ thereafter versus just following policy $\pi$ from state $s$? By doing this splitting, we allow any visit to state $s$ (regardless of the action taken during that experience) to improve the $V$ estimate for *all* actions in that state. Previously, when approximating the $Q$-function, $(s,a)$ and $(s,a')$ were treated separately. So we're basically decomposing $Q$ into two functions: one that's shared across all actions in a state ($V$) and one that is specific to a state-action pair ($A$). When we combine the $V$ and $A$ functions to recover the $Q$ function, we get $Q(s,a) = V(s) + A(s,a)$ In practice, Dueling DDQN implements these two streams by using a single neural network architecture with two prediction heads - one $V(s)$ and one for $A(s, a)$. Below is a graphical example of a Dueling DDQN network trained on images. ![[Pasted image 20250930215323.png]] The output of the $V(s)$ head is a scalar. The output of the $A(s,a)$ head is a vector of size $|\mathcal{A}|$. We also introduce $\alpha$ and $\beta$ parameters to weight the value and action streams differently when recombining them to form $Q$. In practice, to stabilize the optimization process we subtract the mean advantage from the advantage function for each (state, action) pair. This removes one degree of freedom from the Q-function. Thus, the formula for $Q$ in this model becomes $Q(s, a \, \; \, \theta, \alpha, \beta) = V(s \, ; \, \theta, \beta) + \left ( A(s, a \, ; \, \theta, \alpha) - \frac{1}{|\mathcal{A}|}\sum_{a'}A(s, a' \, ; \, \theta, \alpha) \right )$ Dueling DDQN further improves on [[Double DQN (DDQN)]] by reducing how stale the frozen target network is. This is a problem because, as we near the end of a cycle (i.e. shortly before we next update the target weights), the predictions from the frozen target function are so stale as to potentially be irrelevant (or even harmful). Then, when we hit the update step, the target network's weights are updated to match the online network's weights, causing a massive change in targets all at once. Thus, instead of freezing the target network, Dueling DDQN gradually mixes the online network weights into the target network on every step. This looks like $\begin{align*} \theta^-_i &= \tau \theta_i + (1-\tau)\theta^-_i \\ \alpha_i^- &= \tau \alpha_i + (1-\tau) \alpha^-_i \\ \beta_i^- &= \tau \beta_i + (1-\tau) \beta^-_i \end{align*} $ So a factor of $\tau$ of the online network's weights are mixed in at every step. This technique is called *Polyak Averaging*. Other than these changes, Dueling DDQN is the same as [[Double DQN (DDQN)]] --- # References [[Grokking Deep Reinforcement Learning]]