202510082211 Status: #idea Tags: #reinforcement_learning #ai #deep_learning # Policy-Gradient Methods Overview Policy-gradient methods, at their core, seek to parameterize a policy directly and optimize it to maximize expected returns. This differs from value-based agents such as [[Q-Learning]] (or [[Deep Q-Network (DQN)]] for a deep RL example), in that we don't learn a value function and then derive the policy from that value function. We just go straight for the goal of learning the policy. Now, why can this be a better approach than learning value functions directly? A few reasons: 1. In continuous action spaces, our standard way of deriving a policy from a value function doesn't really make sense. What is the argmax of the value function in a given state when there are infinite possible values of the action function? We could discretize the action space, but this may not work for areas where we need fine-grained control. By parameterizing the policy directly, we can learn to output any continuous value in the action space 2. In some problems, the value function is much harder to learn than the policy. For example, suppose we are in a 1-D grid world with goal states that have equal reward on the extreme left and right of the grid world. Clearly the policy should be - if you're in the center square, go randomly right or left. If you're in any other square, go towards the closest goal state. This is an easy policy to learn, but learning whether the value of the state on the left of the middle state is 1.0001 or 1.001 may be quite hard. By using policy gradients, we directly attack the easier problem and avoid the harder value problem. 3. In some problems (particularly, those where the full state is not observable), it can make sense to model the next action as a probability distribution. In the face of measurement uncertainty, we can't be certain about what next action to take, so we introduce some informed randomization. Policy gradient methods can easily learn and output a distribution over actions 4. Policy gradient methods, in general, can have better convergence properties. This occurs because a tiny change in parameter values results in a tiny change in the action probabilities. By contrast, when using value-based methods, a tiny change in the parameters of the value function can completely flip the action that is selected using argmax. ## General Formulation Since policy gradient methods seek to maximize returns directly, they perform gradient *ascent*. This contrasts with value-based functions that performed gradient *descent* to minimize error between the value function estimate and the target. Functionally, this looks like $\theta_{t+1} = \theta_t + \alpha \widehat{\nabla J(\theta_t)}$ where $\widehat{\nabla J(\theta_t)} \in \mathbb{R}^d$ is a stochastic estimate whose expectation approximates the gradient of the performance measure with respect to its argument $\theta_t$. In discrete action scenarios, in order ensure that the policy $\pi$ produces a valid probability distribution over the possible actions, the softmax is typically used over the parameterized numerical preferences $h(s, a, \theta) \in \mathbb{R}$ for each state-action pair: $\theta(a \; | \; s, \theta) = \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}}$ This ensures that $\sum_{a' \in A} \theta (a' \; | \; s ,\theta) = 1$ for all $s$ and $\theta (a \; | \; s, \theta) \geq 0$ for all $a, s$, yielding a valid probability distribution for each fixed state $s$. The values $h(s,a,\theta)$ will be output by the function used to parameterize the policy (typically a deep neural network). ## Policy Gradient Theorem The policy gradient theorem gives us an analytic expression for the gradient of performance with respect to the policy parameter $\theta$ that does not involve the derivative of the state distribution. For the episodic case, the theorem establishes establishes that: $\nabla J(\theta) \propto \sum_s \mu(s) \sum_a q_{\pi} (s,a) \nabla \pi(a \; | \; s, \theta)$ Why is this important? Well, the since we are now learning a policy, our choice of parameters for the policy influence two things - the actions we take (and thus the rewards we receive) as well as the states we visit. Thus, in order to calculate the gradient of the loss with respect to the policy function's parameters, it would seem on the surface that we would need to calculate how the actions change in response to a change in $\theta$ as well as how the states we visit changes in response to a change in $\theta$. Calculating the actions we take given the parameterization $\theta$ is straightforward - we can easily compute the "most likely" action in any given state by using the policy function. However, calculating which states we will visit (that is, the distribution over states) requires a model of the environment. Namely, we need to know the transition function so that we can determine, from the actions we will take, exactly what our state distribution will look like. Since we typically do not have a model of the environment (the transition function is typically hidden and very complex to model), if we needed this information to compute the policy gradient, we would be unable to compute the gradient and policy gradient methods would be intractable. Thanks to the Policy Gradient Theorem, we are able to do this seemingly intractable calculation by removing the dependence on the state distribution. The proof of the Policy Gradient Theorem is on p. 325 of Sutton & Barto. ## See Also Some policy gradient models for reference: - [[REINFORCE]] - [[Vanilla Policy Gradient (VPG)]] Policy gradients also play a core role in Actor-Critic Methods: [[Actor-Critic Methods Overview]]. --- # References [[Grokking Deep Reinforcement Learning]] [[Reinforcement learning_ an introduction]]