Deep Q-Network (DQN) - Notes by Chris Hayduk

202410042327 Status: #idea Tags: #reinforcement_learning #deep_learning #ai # Deep Q-Network The Deep Q-Network (DQN) improves upon value-based reinforcement learning in two ways: ## 1. **By reducing target movement during training.** ([[The target moves in value-based deep reinforcement learning]]) It accomplishes this by maintaining two networks: a target network, which acts as our target (or evaluation label) in the loss function, and the current network that we improve in order to learn the action-value function. The target network is typically a frozen version of the learned network from some time in the past. After a fixed number of steps (it can be as few as 10 or as many as 10,000), we update the target network to reflect the most recent learned network before freezing it again. In particular, we have that the target network gradient update in the Deep Q-Network is $\nabla_{\theta_i} L_i(\theta_i) = \mathbb{E}_{s,a,r,s'}\left [ \left ( r + \gamma\max_{a'}Q(s', a'; \theta^-) - Q(s,a;\theta_i) \right ) \nabla_{\theta_i} Q(s,a;\theta_i) \right ] \tag{1}$ Here, the core difference is that the target network uses a frozen set of weights $\theta^-$. We then update the current set of parameters $\theta_i$ based on the difference between the current estimate state-action value and the target value (namely, $r + \gamma\max_{a'}Q(s', a'; \theta^-)$). This scalar difference is used to scale the gradient vector to produce the update of the weights. ## 2. **By making experiences more like independent and identically distributed (IID) samples.** ([[Samples are not identically-distributed in value-based deep reinforcement learning]] & [[Samples are not independent in value-based deep reinforcement learning]]) This is accomplished through through experience replay, which maintains a buffer of previously experienced $(\text{state}, \text{action}, \text{reward})$ tuples which can be sampled from to train the network. Thus, the samples generated for training are likely not from the same trajectory (and therefore are more likely to be independent). The samples also can appear more identically distributed since we are sampling from data generating by multiple policies at once. This helps the optimization method be more stable. In practice, the replay buffer needs to have considerable capacity to perform optimally, from 10,000 to 1,000,000 experiences depending on the problem. Now let's update equation (1) to reflect the replay buffer: $\nabla_{\theta_i} L_i(\theta_i) = \mathbb{E}_{s,a,r,s' \, \sim \, \mathcal{U}(D)}\left [ \left ( r + \gamma\max_{a'}Q(s', a'; \theta^-) - Q(s,a;\theta_i) \right ) \nabla_{\theta_i} Q(s,a;\theta_i) \right ] \tag{2}$ The only difference occurs in the subscript of the expectation operator $\mathbb{E}$, where we see that the state, action, reward, and next state are sampled uniformly at random from the replay buffer $D$, rather than using the online experiences as in equation (1). --- # References [[Grokking Deep Reinforcement Learning]]