alignmentPublished: May 29, 2023
Direct preference optimization: Your language model is secretly a reward model
By Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Research TL;DR
"Proposes Direct Preference Optimization (DPO) as an alternative to PPO-based RLHF. Simplifies alignment by optimizing the policy directly from human preference data."
Abstract
We present Direct Preference Optimization (DPO), a stable, performant, and computationally lightweight algorithm for steering LLMs to align with human preferences. DPO avoids the instability of traditional RLHF by mathematically optimizing the policy directly from preference data without training an explicit reward model.