AI Glossary

Group Relative Policy Optimization

A reinforcement learning algorithm for LLMs that uses group-level relative rewards without a critic model.

Overview

Group Relative Policy Optimization (GRPO), introduced by DeepSeek, is a reinforcement learning algorithm designed for training large language models. Unlike PPO which requires a separate critic (value) model, GRPO estimates baselines by sampling a group of outputs for each prompt and computing rewards relative to the group average.

Key Details

This eliminates the need for a critic model (saving significant memory and compute) while maintaining stable training. GRPO was a key technique used in training DeepSeek-R1, which achieved strong reasoning performance. The algorithm normalizes rewards within each group, naturally handling varying reward scales across different prompts. It represents a trend toward simpler, more efficient RL algorithms for LLM post-training.

Related Concepts

proximal policy optimization • rlhf • direct preference optimization

← Back to AI Glossary