// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM
DPO
A simpler and more stable way to train AI models using human preferences, directly optimizing the model to prefer good outputs over bad ones without needing a separate reward model.
TECHNICAL DEFINITION
Direct Preference Optimization (DPO) is a reinforcement learning-free method for aligning large language models (LLMs) with human preferences, directly optimizing the policy model using a simple loss function derived from a dataset of preferred and dispreferred response pairs, bypassing the need for an explicit reward model.
BACKGROUND
Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd., doing business as DeepSeek, is a Chinese artificial intelligence (AI) company that develops large language models (LLMs). Based in Hangzhou, Zhejiang, DeepSeek is owned and funded by High-Flyer, a Chinese hedge fund. DeepSeek was founded in July 2023 by Liang Wenfeng, the co-founder of High-Flyer, who also serves as the CEO for both of the companies. The company launched an eponymous chatbot alongside its DeepSeek-R1 model in January 2025.
READ MORE ON WIKIPEDIASYNONYMS & ALIASES
- Preference-based fine-tuning
- Alignment without RL
- Simplified alignment
USAGE NOTE
DPO offers a computationally efficient and stable alternative to RLHF for preference alignment.
DEVELOPERS
Organizations developing technology related to DPO.
They were key contributors to the original DPO research paper and continue to advance LLM alignment techniques, integrating them into their cutting-edge AI models.
Their `trl` (Transformer Reinforcement Learning) library offers an open-source implementation of DPO, widely used for aligning large language models with human preferences in the broader AI community.
Actively researches and develops large language models (e.g., Llama series) and their alignment strategies, making DPO or similar preference optimization techniques a critical area of focus for improving model behavior.
As a leader in AI development and safety, OpenAI extensively researches and employs advanced alignment techniques, including preference-based optimization methods like DPO, to ensure their models behave as intended.
Known for its focus on AI safety and alignment, Anthropic develops and applies sophisticated fine-tuning methods, including DPO-like approaches, to build helpful, harmless, and honest AI systems.
Researchers at Stanford were among the primary authors of the foundational paper on Direct Preference Optimization, actively developing and contributing to the theoretical and practical advancements of DPO.
Through Microsoft Research and Azure AI, the company invests in and integrates state-of-the-art LLM fine-tuning and alignment techniques, including DPO, for its enterprise AI offerings and internal models.
Offers cloud platforms and services for training and fine-tuning open-source large language models, providing the infrastructure and tools necessary to implement and scale advanced alignment techniques such as DPO.