Reinforcement Learning from Human Feedback (RLHF) is the technique that transformed raw language models into the helpful, conversational AI assistants we use today. It is the reason ChatGPT feels like a helpful assistant rather than a text autocomplete engine. RLHF aligns model behavior with human preferences, teaching the model not just what to say, but how to say it in a way humans find helpful, harmless, and honest.
Why RLHF Is Needed
Pre-training teaches a model to predict text. Instruction tuning teaches it to follow instructions. But neither directly optimizes for what humans actually want. Consider two responses to "Explain quantum mechanics":
- Response A: A technically accurate but dense paragraph copied from a textbook
- Response B: A clear, well-structured explanation with analogies, starting with the basics and building to more complex concepts
Both are "correct" from a next-token prediction perspective. But humans overwhelmingly prefer Response B. RLHF teaches the model to consistently produce the kind of output humans find most useful.
RLHF optimizes for what humans want, not just what is statistically likely. This subtle but crucial distinction is what makes AI assistants actually helpful rather than merely coherent.
The Three Steps of RLHF
Step 1: Supervised Fine-Tuning (SFT)
Before RLHF, the model is fine-tuned on high-quality human-written demonstrations. This gives the model a baseline of good behavior and response formatting. The SFT model serves as the starting point for RLHF optimization.
Step 2: Training the Reward Model
The core of RLHF is the reward model -- a neural network that predicts human preferences. Training it requires:
- Generate multiple responses to the same prompt from the SFT model
- Present pairs of responses to human raters who indicate which is better
- Train the reward model on these preference pairs using a ranking loss
The reward model learns to assign a scalar score to any (prompt, response) pair that predicts how much a human would prefer that response. It captures nuanced preferences: helpfulness, safety, truthfulness, appropriate tone, and good formatting.
Building good preference data is expensive and requires careful annotation guidelines. Raters must be trained to evaluate not just factual accuracy but also helpfulness, potential harms, and response quality. Disagreements between raters are common and must be handled systematically.
Key Takeaway
The reward model is the bridge between human preferences and model optimization. It learns to score responses the way humans would, enabling automated evaluation at scale.
Step 3: Policy Optimization with PPO
The language model is then optimized using Proximal Policy Optimization (PPO), a reinforcement learning algorithm. The model generates responses, the reward model scores them, and the model's parameters are updated to produce higher-scoring responses.
A critical component is the KL divergence penalty, which prevents the model from straying too far from the SFT model. Without this constraint, the model might find ways to "hack" the reward model -- producing responses that score highly but are not actually helpful (a phenomenon called reward hacking).
objective = reward_model_score - beta * KL(policy || SFT_model)
The beta parameter controls the trade-off between optimizing the reward and staying close to the original model. Too little KL penalty leads to reward hacking; too much prevents meaningful improvement.
DPO: Simplifying RLHF
Direct Preference Optimization (DPO), proposed by Rafailov et al. in 2023, eliminated the need for a separate reward model and RL training entirely. DPO directly optimizes the language model on preference data using a clever reformulation of the RLHF objective.
The insight is that the optimal policy under the RLHF objective can be expressed in closed form as a function of the preference data. This allows training directly on pairs of preferred and dispreferred responses without the instability and complexity of PPO.
DPO's advantages include:
- No reward model training needed
- No RL training loop (much simpler to implement)
- More stable optimization
- Lower compute requirements
DPO has been widely adopted and is now used by many organizations as an alternative to traditional RLHF.
Constitutional AI
Anthropic's Constitutional AI (CAI) offers yet another approach. Instead of relying on human preference ratings, CAI defines a set of principles (the "constitution") and uses the model itself to generate preference data based on those principles.
This has several advantages: the principles are explicit and can be publicly debated, the process scales without additional human labeling, and the model's values are derived from articulated principles rather than implicit preferences of individual raters.
RLHF, DPO, and Constitutional AI are all solutions to the same fundamental problem: teaching language models to produce outputs that align with human values. Each approach has trade-offs in terms of cost, scalability, and the values it encodes.
Challenges and Open Problems
Despite its success, RLHF and its variants face several challenges:
- Reward hacking: Models finding ways to achieve high reward scores without genuinely helpful behavior
- Rater disagreement: Different humans have different preferences, and averaging them may not capture any individual's values well
- Evaluation difficulty: Measuring alignment is harder than measuring accuracy on benchmarks
- Sycophancy: RLHF can make models overly agreeable, telling users what they want to hear rather than what is true
- Cultural bias: Preferences from a specific group of raters may not generalize to all users
Key Takeaway
RLHF and its variants (DPO, Constitutional AI) are the key techniques that make LLMs helpful and safe. While challenges remain around reward hacking and preference representation, these alignment methods are what transform powerful but raw language models into the useful AI assistants we interact with daily.
The Future of Alignment
The alignment field is advancing rapidly. Research into scalable oversight explores how to align models that may become smarter than their human evaluators. Debate and recursive reward modeling propose methods where AI systems help humans evaluate AI outputs. Interpretability research aims to understand what models have actually learned, making alignment more verifiable.
RLHF was the innovation that made the AI assistant revolution possible. Understanding it is essential for anyone building, deploying, or thinking critically about AI systems.
