AI Glossary

Direct Preference Optimization (DPO)

An alignment technique that trains language models directly on human preference data without needing a separate reward model, simplifying the RLHF pipeline.

How It Differs from RLHF

Traditional RLHF requires training a reward model, then using reinforcement learning (PPO) to optimize against it. DPO reformulates this as a simple classification problem: given preferred and rejected responses, directly update the model to increase the probability of preferred outputs.

Advantages

Simpler implementation (no reward model or RL loop). More stable training. Lower computational cost. Achieves comparable or better results to RLHF on many benchmarks.

Usage

DPO and its variants (IPO, KTO, ORPO) have become the preferred alignment method for many open-source LLM projects due to their simplicity and effectiveness.

← Back to AI Glossary

Last updated: March 5, 2026