Group Relative Policy Optimization
A reinforcement learning algorithm for LLMs that uses group-level relative rewards without a critic model.
Overview
Group Relative Policy Optimization (GRPO), introduced by DeepSeek, is a reinforcement learning algorithm designed for training large language models. Unlike PPO which requires a separate critic (value) model, GRPO estimates baselines by sampling a group of outputs for each prompt and computing rewards relative to the group average.
Key Details
This eliminates the need for a critic model (saving significant memory and compute) while maintaining stable training. GRPO was a key technique used in training DeepSeek-R1, which achieved strong reasoning performance. The algorithm normalizes rewards within each group, naturally handling varying reward scales across different prompts. It represents a trend toward simpler, more efficient RL algorithms for LLM post-training.
Related Concepts
proximal policy optimization • rlhf • direct preference optimization