202510082211 Status: #idea Tags: #reinforcement_learning #ai #deep_learning # Policy-Gradient Methods Overview Policy-gradient methods, at their core, seek to parameterize a policy directly and optimize it to maximize expected returns. This differs from value-based agents such as [[Q-Learning]] (or [[Deep Q-Network (DQN)]] for a deep RL example), in that we don't learn a value function and then derive the policy from that value function. We just go straight for the goal of learning the policy. Now, why can this be a better approach than learning value functions directly? A few reasons: 1. In continuous action spaces, our standard way of deriving a policy from a value function doesn't really make sense. What is the argmax of the value function in a given state when there are infinite possible values of the action function? We could discretize the action space, but this may not work for areas where we need fine-grained control. By parameterizing the policy directly, we can learn to output any continuous value in the action space 2. In some problems, the value function is much harder to learn than the policy. For example, suppose we are in a 1-D grid world with goal states that have equal reward on the extreme left and right of the grid world. Clearly the policy should be - if you're in the center square, go randomly right or left. If you're in any other square, go towards the closest goal state. This is an easy policy to learn, but learning whether the value of the state on the left of the middle state is 1.0001 or 1.001 may be quite hard. By using policy gradients, we directly attack the easier problem and avoid the harder value problem. 3. In some problems (particularly, those where the full state is not observable), it can make sense to model the next action as a probability distribution. In the face of measurement uncertainty, we can't be certain about what next action to take, so we introduce some informed randomization. Policy gradient methods can easily learn and output a distribution over actions 4. Policy gradient methods, in general, can have better convergence properties. This occurs because a tiny change in parameter values results in a tiny change in the action probabilities. By contrast, when using value-based methods, a tiny change in the parameters of the value function can completely flip the action that is selected using argmax. ## General Formulation Since policy gradient methods seek to maximize returns directly, they perform gradient *ascent*. This contrasts with value-based functions that performed gradient *descent* to minimize error between the value function estimate and the target. Functionally, this looks like $\theta_{t+1} = \theta_t + \alpha \widehat{\nabla J(\theta_t)}$ where $\widehat{\nabla J(\theta_t)} \in \mathbb{R}^d$ is a stochastic estimate whose expectation approximates the gradient of the performance measure with respect to its argument $\theta_t$. In discrete action scenarios, in order ensure that the policy $\pi$ produces a valid probability distribution over the possible actions, the softmax is typically used over the parameterized numerical preferences $h(s, a, \theta) \in \mathbb{R}$ for each state-action pair: $\theta(a \; | \; s, \theta) = \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}}$ This ensures that $\sum_{a' \in A} \theta (a' \; | \; s ,\theta) = 1$ for all $s$ and $\theta (a \; | \; s, \theta) \geq 0$ for all $a, s$, yielding a valid probability distribution for each fixed state $s$. The values $h(s,a,\theta)$ will be output by the function used to parameterize the policy (typically a deep neural network). ## Policy Gradient Theorem ## See Also Some policy gradient models for reference: - [[REINFORCE]] - [[Vanilla Policy Gradient (VPG)]] Policy gradients also play a core role in Actor-Critic Methods: [[Actor-Critic Methods Overview]]. --- # References [[Grokking Deep Reinforcement Learning]]