RLHF (Reinforcement Learning from Human Feedback)
A training methodology that aligns language models with human preferences by using human feedback to train a reward model, then optimizing the LM against it.
The Three Stages
1. Supervised fine-tuning (SFT): Fine-tune a base model on high-quality demonstrations. 2. Reward modeling: Train a model to predict human preferences from comparison data. 3. RL optimization: Use PPO to optimize the SFT model against the reward model.
Why RLHF Changed Everything
Raw language models are good at predicting text but not at being helpful, harmless, and honest. RLHF bridged this gap, transforming GPT-3 into ChatGPT. It's the key technique behind making LLMs useful as assistants.
Alternatives
DPO (simpler, no reward model needed), RLAIF (AI feedback instead of human), Constitutional AI (principle-based self-improvement), and KTO (single-signal preference). The field is actively exploring which alignment methods work best.