Policy Gradient
A class of RL algorithms that directly optimize the policy by estimating the gradient of expected reward.
Overview
Policy gradient methods are a family of reinforcement learning algorithms that directly optimize a parameterized policy by computing the gradient of expected cumulative reward with respect to the policy parameters. Unlike value-based methods (like Q-learning), they directly learn the action-selection strategy.
Key Details
The REINFORCE algorithm is the simplest policy gradient method, using Monte Carlo returns to estimate the gradient. More advanced variants like PPO, A2C, and TRPO use variance reduction techniques and trust regions for stable training. Policy gradients naturally handle continuous action spaces and stochastic policies, making them essential for robotics, game playing, and RLHF in language models.
Related Concepts
reinforcement learning • proximal policy optimization • actor critic