AI Glossary

RLHF Overview

A training technique that aligns language models with human preferences by using human feedback to train a reward model that guides further model optimization.

The Three Steps

1. Supervised fine-tuning (SFT): Train on high-quality demonstrations. 2. Reward modeling: Humans rank model outputs, train a reward model on these preferences. 3. RL optimization: Use PPO or similar algorithm to maximize the reward model's score.

Why RLHF Matters

Pre-trained LLMs can generate harmful, biased, or unhelpful text. RLHF teaches models to be helpful, harmless, and honest. It bridges the gap between 'can generate text' and 'generates text humans actually want'.

Alternatives

DPO (Direct Preference Optimization): Simplifies RLHF by skipping the reward model. Constitutional AI: Uses AI feedback instead of human feedback. RLAIF: RL from AI feedback. These address RLHF's complexity and cost.

← Back to AI Glossary

RLHF Overview

The Three Steps

Why RLHF Matters

Alternatives

Related Articles

Human-in-the-Loop: Keeping Humans in Control of AI Agents

K-Nearest Neighbors: The Simplest ML Algorithm

Learning Rate Scheduling: The Key to Faster Training

Reinforcement Learning: The Complete Guide

Related Concepts