AI Glossary

Policy Gradient

A class of RL algorithms that directly optimize the policy by estimating the gradient of expected reward.

Overview

Policy gradient methods are a family of reinforcement learning algorithms that directly optimize a parameterized policy by computing the gradient of expected cumulative reward with respect to the policy parameters. Unlike value-based methods (like Q-learning), they directly learn the action-selection strategy.

Key Details

The REINFORCE algorithm is the simplest policy gradient method, using Monte Carlo returns to estimate the gradient. More advanced variants like PPO, A2C, and TRPO use variance reduction techniques and trust regions for stable training. Policy gradients naturally handle continuous action spaces and stochastic policies, making them essential for robotics, game playing, and RLHF in language models.

Related Concepts

reinforcement learningproximal policy optimizationactor critic

← Back to AI Glossary

Last updated: March 5, 2026