The key decision points when creating a deep RL approach

202509282338 Status: #idea Tags: #reinforcement_learning #ai # The key decision points when creating a deep RL approach 1. **Select a value function to approximate** - we can use deep learning to approximate various types of value functions: - State-value functions $V(S)$ - Action-value function $Q(S, A)$ - Action-advantage functions $A(S, A)$ - Defined as $A(S, A) = Q(S, A) - V(S)$. That is, how much better is taking action $A$ in state $S$ and following policy $\pi$ thereafter vs. just following $\pi$ immediately from state $S$? 2. **Select a neural network architecture** - we can construct many different neural network architectures to approximate the value function we select. One of the most consequential decisions we will make is on the input & output layers of the model. A straightforward architecture is to input that state along with the action to evaluate, with the output being the $Q(S, A)$ estimate. However, a better approach (especially when doing $\epsilon$-greedy sampling or softmax) is to only input the state and have the neural network estimate the $Q$-values for all the actions in that state. 3. **Selecting what to optimize** - since we don't have access to the optimal action-value function $q^*(s,a)$, we need to do the loop described in [[Generalized policy iteration (GPI)]] - that is, we improve the value function to be consistent with the policy, then we make the policy greedy with respect to the new value function. We loop through this process several times. 4. **Selecting the targets for policy evaluation** - for the target (i.e. what is the "ground truth" label in the loss), we can use any of the targets discussed previously in [[Learning state-value functions]] or [[Learning action-value function]]. The simplest target is one step temporal difference learning. (Can either use on-policy one step TD ([[SARSA]]) or off-policy one step TD ([[Q-Learning]])). 5. **Selecting an exploration strategy** - now we need to choose what strategy we will use for the data generation policy. Note that I say data generation policy because, in off-policy methods like [[Q-Learning]], we actually use one policy to generate data and learn about another policy. The exploration strategy determines how we sample from the data generation policy to choose actions and produce new experiences. A common exploration strategy is decaying $\epsilon$-greedy 6. **Selecting a loss function** - the loss function tells our neural network how well (or poorly) its predictions are for the action-value function. In RL, it is a bit more difficult to interpret than supervised learning because our targets are predictions that come from the network rather than true values. A common loss for action-value function approximation is mean squared error (MSE). In practice, for Q-Learning, this looks like: $\text{MSE} = \frac{1}{n} \sum_{i=1}^n \left[Q(S_i, A_i) - \left(R_i + \gamma * \max_{A_i'} Q(S_i', A_i') \right) \right]$ 7. **Selecting an optimization method** - the optimization method determines exactly *how* we compute gradients with respect to the loss function and use them to update the weights of the deep learning value function. --- # References