202509302223 Status: #idea Tags: #reinforcement_learning #ai #deep_learning # Prioritized Experience Replay (PER) In other approaches I've described in RL, we've been using a replay buffer (described in [[Deep Q-Network (DQN)]]. This stores past experience tuples that have been encountered during the learning process, which we then sample from to form each batch (this helps solves the two problems [[Samples are not identically-distributed in value-based deep reinforcement learning]] & [[Samples are not independent in value-based deep reinforcement learning]]) In our earlier implementations, we sampled uniformly from this replay buffer. Thus, all past experiences had the same chance of being chosen. Intuitively, this seems to not make sense - aren't some experiences more informative than others? Thus comes the high-level goal of PER - to allocate more resources during training for experience tuples that have the most significant potential for learning. What we want to learn from is experiences with a very *unexpected* values. That is, experiences where our estimate deviated very strongly from our target (i.e. we had very large absolute error). The absolute error of [[Dueling DDQN]] (which is just the absolute TD error of the Dueling DDQN function) looks like the following $|\delta_i| = |r + \gamma Q(s', \text{argmax}_{a'} Q(s', a' \, ; \, \theta_i, \alpha_i, \beta_i) \, ; \, \theta^-, \alpha-, \beta^-) - Q(s, a \, ; \, \theta_i, \alpha_i, \beta_i) |$ In plain language, we are taking the absolute value of the difference between the Dueling DDQN target ($r + \gamma Q(s', \text{argmax}_{a'} Q(s', a' \, ; \, \theta_i, \alpha_i, \beta_i)$) and the Dueling DDQN estimate ($Q(s, a \, ; \, \theta_i, \alpha_i, \beta_i)$). See [[Dueling DDQN]] if you need to understand these concepts better. Now, instead of placing $(s, a, r, s', d)$ into the replay buffer ($d$ is the done flag), we place $(s, a, r, s', d, \text{absolute\_td\_error})$ in the buffer. Wit these TD errors in the replay buffer, we now need to use them to form priorities for how we should do our sampling. ## Prioritization Methods ### Proportional Prioritization A simple way to set the priority of sample $i$ is to use the TD error itself, plus a small constant to ensure $0$-error samples still get replayed $p_i = |\delta_i| + \epsilon$ We can convert this priority score into a sampling probability by dividing by the sum of all priority scores: $P(i) = \frac{p_i}{\sum_k p_k}$ And finally, we can further tune this probability by introducing a hyperparameter $\alpha$ that allows us to smoothly interpolate between uniform sampling ($\alpha = 0$) and sampling completely based on the priorities ($\alpha = 1$): $P(i) = \frac{p_i^{\alpha}}{\sum_k p_k^{\alpha}}$ ### Rank-based Prioritization Proportional prioritization is sensitive to outliers. Experiences with much higher TD error than the rest will be sampled much more often than those with low magnitudes. Instead, we can sample based on the rank when sorted by their absolute TD error: $p_i = \frac{1}{\text{rank}(i)}$ Then we can compute probabilities from these rank-based priorities, just as we did before: $P(i) = \frac{p_i^{\alpha}}{\sum_k p_k^{\alpha}}$ ### Prioritization Bias Note that, in order to produce accurate updates of the action-value function, we need to sample from the same distribution as the expectations. If we were to ignore this fact and update a single sample more often than it appears in expectation, we'd create a bias toward that value. We can mitigate this with weighted importance sampling. First, we compute weights for each sample $i$ by multiplying the probability of sampling $i$ by the number of samples in the replay buffer $N$: $w_i = (N \cdot P(i))^{-\beta}$ We add a hyperparameter $\beta$ that allows us to tune the degree of the corrections. When $\beta$ is $0$, there's no correction. When $\beta$ is 1, there is a fully correction of the bias. Then, we can downscale the weights so that the max weight is 1 and everything else is lower: $w_i = \frac{w_i}{\max_j(w_j)}$ ## PER Gradient Update Putting all of this together, we have the following for the PER gradient update: $\nabla_{\theta_i} L_i(\theta_i) = \mathbb{E}_{(w,s,a,r,s') \sim \mathcal{P}(D)} \left [ w(r+ \gamma Q(s', \text{argmax}_{a'}Q(s', a' \, ; \, \theta_i); \theta^-) - Q(s, a, \, ; \, \theta_i)) \nabla_{\theta_i} Q(s, a \, ; \, \theta_i) \right ]$ Note that we're using the normalized importance-sampling weights $w$ to modify the magnitude of the TD error. We're also sampling from $\mathcal{P}(D)$ in the subscript of the expectation rather than the uniform distribution $\mathcal{U}(D)$. Note that $\mathcal{P}$ can be the either the proportional prioritization or rank-based prioritization discussed previously. --- # References [[Grokking Deep Reinforcement Learning]]