Direct Preference Optimization (DPO)
An alignment technique that trains language models directly on human preference data without needing a separate reward model, simplifying the RLHF pipeline.
How It Differs from RLHF
Traditional RLHF requires training a reward model, then using reinforcement learning (PPO) to optimize against it. DPO reformulates this as a simple classification problem: given preferred and rejected responses, directly update the model to increase the probability of preferred outputs.
Advantages
Simpler implementation (no reward model or RL loop). More stable training. Lower computational cost. Achieves comparable or better results to RLHF on many benchmarks.
Usage
DPO and its variants (IPO, KTO, ORPO) have become the preferred alignment method for many open-source LLM projects due to their simplicity and effectiveness.