AI Glossary

Proximal Policy Optimization

A stable and efficient policy gradient algorithm that constrains policy updates to a trust region.

Overview

Proximal Policy Optimization (PPO), introduced by OpenAI in 2017, is a policy gradient reinforcement learning algorithm that achieves reliable performance by clipping the policy update ratio, preventing destructively large changes. It approximates the trust region constraint of TRPO but is much simpler to implement.

Key Details

PPO uses a clipped surrogate objective that discourages the new policy from moving too far from the old policy. This makes training more stable without the computational overhead of exact trust region methods. PPO is the dominant algorithm for RLHF in large language model training (used in ChatGPT, Claude, etc.) and is widely used in robotics, game AI, and autonomous systems.

Related Concepts

rlhfpolicy gradientactor critic

← Back to AI Glossary

Last updated: March 5, 2026