alignmentPublished: May 29, 2023

Direct preference optimization: Your language model is secretly a reward model

By Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

Research TL;DR

"Proposes Direct Preference Optimization (DPO) as an alternative to PPO-based RLHF. Simplifies alignment by optimizing the policy directly from human preference data."

Abstract

We present Direct Preference Optimization (DPO), a stable, performant, and computationally lightweight algorithm for steering LLMs to align with human preferences. DPO avoids the instability of traditional RLHF by mathematically optimizing the policy directly from preference data without training an explicit reward model.

Read full paper on arXiv →

Direct preference optimization: Your language model is secretly a reward model

Abstract

Related Research

Reconciling safety and utility in reinforcement learning alignment

Constitutional AI: Harmlessness from AI feedback