// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM

DPO

A simpler and more stable way to train AI models using human preferences, directly optimizing the model to prefer good outputs over bad ones without needing a separate reward model.

TECHNICAL DEFINITION

Direct Preference Optimization (DPO) is a reinforcement learning-free method for aligning large language models (LLMs) with human preferences, directly optimizing the policy model using a simple loss function derived from a dataset of preferred and dispreferred response pairs, bypassing the need for an explicit reward model.

BACKGROUND

Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd., doing business as DeepSeek, is a Chinese artificial intelligence (AI) company that develops large language models (LLMs). Based in Hangzhou, Zhejiang, DeepSeek is owned and funded by High-Flyer, a Chinese hedge fund. DeepSeek was founded in July 2023 by Liang Wenfeng, who serves as the CEO for both of the companies. The company launched an eponymous chatbot alongside its DeepSeek-R1 model in January 2025.

SYNONYMS & ALIASES

Preference-based fine-tuning
Alignment without RL
Simplified alignment

USAGE NOTE

DPO offers a computationally efficient and stable alternative to RLHF for preference alignment.

DEVELOPERS

Organizations developing technology related to DPO.

Google / Google DeepMind
They were key contributors to the original DPO research paper and continue to advance LLM alignment techniques, integrating them into their cutting-edge AI models.
Hugging Face
Their `trl` (Transformer Reinforcement Learning) library offers an open-source implementation of DPO, widely used for aligning large language models with human preferences in the broader AI community.
Meta AI
Actively researches and develops large language models (e.g., Llama series) and their alignment strategies, making DPO or similar preference optimization techniques a critical area of focus for improving model behavior.
OpenAI
As a leader in AI development and safety, OpenAI extensively researches and employs advanced alignment techniques, including preference-based optimization methods like DPO, to ensure their models behave as intended.
Anthropic
Known for its focus on AI safety and alignment, Anthropic develops and applies sophisticated fine-tuning methods, including DPO-like approaches, to build helpful, harmless, and honest AI systems.
Stanford University (Stanford AI Lab/NLP Group)
Researchers at Stanford were among the primary authors of the foundational paper on Direct Preference Optimization, actively developing and contributing to the theoretical and practical advancements of DPO.
Microsoft
Through Microsoft Research and Azure AI, the company invests in and integrates state-of-the-art LLM fine-tuning and alignment techniques, including DPO, for its enterprise AI offerings and internal models.
Together AI
Offers cloud platforms and services for training and fine-tuning open-source large language models, providing the infrastructure and tools necessary to implement and scale advanced alignment techniques such as DPO.

RELATED TERMS IN PROMPTING & LOGIC

BACK TO AI ENGINEERING & PROMPT DESIGN LEXICON

TECHNICAL DEFINITION

BACKGROUND

SYNONYMS & ALIASES

USAGE NOTE

DEVELOPERS

Google / Google DeepMind

Hugging Face

Meta AI

OpenAI

Anthropic

Stanford University (Stanford AI Lab/NLP Group)

Microsoft

Together AI

RELATED TERMS IN PROMPTING & LOGIC