RLHF Overview
A training technique that aligns language models with human preferences by using human feedback to train a reward model that guides further model optimization.
The Three Steps
1. Supervised fine-tuning (SFT): Train on high-quality demonstrations. 2. Reward modeling: Humans rank model outputs, train a reward model on these preferences. 3. RL optimization: Use PPO or similar algorithm to maximize the reward model's score.
Why RLHF Matters
Pre-trained LLMs can generate harmful, biased, or unhelpful text. RLHF teaches models to be helpful, harmless, and honest. It bridges the gap between 'can generate text' and 'generates text humans actually want'.
Alternatives
DPO (Direct Preference Optimization): Simplifies RLHF by skipping the reward model. Constitutional AI: Uses AI feedback instead of human feedback. RLAIF: RL from AI feedback. These address RLHF's complexity and cost.