// MODEL OPTIMIZATION AND PROMPT SYNTAX TERM

DPO

A simpler and more stable way to train AI models using human preferences, directly optimizing the model to prefer good outputs over bad ones without needing a separate reward model.

TECHNICAL DEFINITION

Direct Preference Optimization (DPO) is a reinforcement learning-free method for aligning large language models (LLMs) with human preferences, directly optimizing the policy model using a simple loss function derived from a dataset of preferred and dispreferred response pairs, bypassing the need for an explicit reward model.

BACKGROUND

Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd., doing business as DeepSeek, is a Chinese artificial intelligence (AI) company that develops large language models (LLMs). Based in Hangzhou, Zhejiang, DeepSeek is owned and funded by High-Flyer, a Chinese hedge fund. DeepSeek was founded in July 2023 by Liang Wenfeng, the co-founder of High-Flyer, who also serves as the CEO for both of the companies. The company launched an eponymous chatbot alongside its DeepSeek-R1 model in January 2025.

READ MORE ON WIKIPEDIA

SYNONYMS & ALIASES

  • Preference-based fine-tuning
  • Alignment without RL
  • Simplified alignment

USAGE NOTE

DPO offers a computationally efficient and stable alternative to RLHF for preference alignment.

DEVELOPERS

Organizations developing technology related to DPO.

  • Google / Google DeepMind

    They were key contributors to the original DPO research paper and continue to advance LLM alignment techniques, integrating them into their cutting-edge AI models.

  • Hugging Face

    Their `trl` (Transformer Reinforcement Learning) library offers an open-source implementation of DPO, widely used for aligning large language models with human preferences in the broader AI community.

  • Meta AI

    Actively researches and develops large language models (e.g., Llama series) and their alignment strategies, making DPO or similar preference optimization techniques a critical area of focus for improving model behavior.

  • OpenAI

    As a leader in AI development and safety, OpenAI extensively researches and employs advanced alignment techniques, including preference-based optimization methods like DPO, to ensure their models behave as intended.

  • Anthropic

    Known for its focus on AI safety and alignment, Anthropic develops and applies sophisticated fine-tuning methods, including DPO-like approaches, to build helpful, harmless, and honest AI systems.

  • Stanford University (Stanford AI Lab/NLP Group)

    Researchers at Stanford were among the primary authors of the foundational paper on Direct Preference Optimization, actively developing and contributing to the theoretical and practical advancements of DPO.

  • Microsoft

    Through Microsoft Research and Azure AI, the company invests in and integrates state-of-the-art LLM fine-tuning and alignment techniques, including DPO, for its enterprise AI offerings and internal models.

  • Together AI

    Offers cloud platforms and services for training and fine-tuning open-source large language models, providing the infrastructure and tools necessary to implement and scale advanced alignment techniques such as DPO.

RELATED TERMS IN PROMPTING & LOGIC