AI Glossary

RLHF (Reinforcement Learning from Human Feedback)

A training methodology that aligns language models with human preferences by using human feedback to train a reward model, then optimizing the LM against it.

The Three Stages

1. Supervised fine-tuning (SFT): Fine-tune a base model on high-quality demonstrations. 2. Reward modeling: Train a model to predict human preferences from comparison data. 3. RL optimization: Use PPO to optimize the SFT model against the reward model.

Why RLHF Changed Everything

Raw language models are good at predicting text but not at being helpful, harmless, and honest. RLHF bridged this gap, transforming GPT-3 into ChatGPT. It's the key technique behind making LLMs useful as assistants.

Alternatives

DPO (simpler, no reward model needed), RLAIF (AI feedback instead of human), Constitutional AI (principle-based self-improvement), and KTO (single-signal preference). The field is actively exploring which alignment methods work best.

← Back to AI Glossary

RLHF (Reinforcement Learning from Human Feedback)

The Three Stages

Why RLHF Changed Everything

Alternatives

Related Articles

RL in Games: From Atari to AlphaGo and Beyond

Reinforcement Learning in Robotics: Teaching Machines to Move

RLHF: How Human Feedback Makes AI Better

Related Concepts