Direct Preference Optimization
An alignment technique that optimizes language models from human preferences without training a separate reward model.
Overview
Direct Preference Optimization (DPO) is an alternative to RLHF for aligning language models with human preferences. Instead of training a separate reward model and then using reinforcement learning (PPO), DPO directly optimizes the language model policy using a simple classification loss on preference pairs (chosen vs rejected responses).
Key Details
DPO reformulates the RLHF objective as a simple loss function: it increases the probability of preferred responses and decreases the probability of dispreferred ones, with an implicit KL constraint preventing the model from deviating too far from the reference policy. DPO is simpler to implement, more stable to train, and computationally cheaper than RLHF, while achieving comparable alignment quality. It has become widely adopted for post-training alignment of LLMs.