Direct Preference Optimization, Intuitively Explained

January 30, 2024

The Top Secret Behind Effective LLM Training in 2024

Large-scale unsupervised language models (LMs) have shown remarkable capabilities in understanding and generating human-like text. However, achieving precise control over their behavior poses a significant challenge due to the unsupervised nature of their training. Existing methods rely on collecting human labels to fine-tune LMs, often through reinforcement learning from human feedback (RLHF). Here’s what this article contains:

  1. The Limitations of RLHF — Reinforcement Learning with Human Feedback
  2. The DPO Architecture & Why It’s So Useful
  3. A 5-Step Guide to Building Your DPO LLM
Current State of LLM Development

Who should read this?

Who is this blog post useful for? ML Engineers(LLM), Tech Enthusiasts, VCs, etc.

How advanced is this post? Anybody previously acquainted with ML terms should be able to follow along.

Replicate my code here: https://github.com/Timothy102/llm-course or through Google Colab

PPO = Separate Reward and LLM Policies

PPO stands for proximal policy optimization in the context of solving RF problems. The algorithm works by optimizing the reward policy and then fine-tuning the unsupervised LLM according to reward maximization of such human-reinforced preference feedback.

  1. Policy loss is calculated according to the reward policy function.
  2. LLM is used to create completions.
  3. Small updates to the model are applied and checked.
  4. We keep the updates within the “trust” region. Calculate policy loss.

Image by: GenAI with LLMs Course

The fundamental principle of PPO centers on implementing gradual, incremental adjustments to the policy. This approach aims to avoid instability or the emergence of suboptimal solutions that might result from larger updates. However, practical experience indicates that this technique remains susceptible to instability, with potential divergence in loss. Additionally, it poses challenges in terms of reproducibility due to the involvement of numerous hyperparameters and sensitivity to random seeds. Moreover, its computational demands are considerable, further complicating its practical implementation.

A Human Model — At Last! — DPO = Implicit Human Preference to the Training Model

Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses and then use RL to find a policy that maximizes the learned reward.

Image by: DPO Paper
In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form.

This is where Direct Preference Optimization (DPO) comes into play. DPO simplifies control by treating the task as a classification problem. Concretely, it uses two models: the trained model (or policy model) and a copy of it called the reference model. During training, the goal is to make sure the trained model outputs higher probabilities for preferred answers than the reference model. Conversely, we also want it to output lower probabilities for rejected answers. It means we’re penalizing the LLM for bad answers and rewarding it for good ones.

Image by Author

Unlike traditional RLHF:

  1. DPO optimizes for human preferences while avoiding reinforcement learning.
  2. DPO does not require extensive sampling from the LLM during fine-tuning, streamlining the entire process.
  3. Direct Preference Optimization (DPO) simplifies the RLHF process and makes it stable, efficient, and computationally lightweight.

Through the utilization of the Large Language Model (LLM) as its reward model and the implementation of binary cross-entropy objectives, Direct Preference Optimization (DPO) adeptly synchronizes the model’s outputs with human preferences. This is achieved without the necessity for exhaustive sampling, reward model fitting, or intricate adjustments of hyperparameters. The outcome is a notably more stable, efficient, and computationally frugal process.

3-Step DPO Training Process

  1. Supervised fine-tuning(SFT): RLHF typically begins by fine-tuning a pre-trained LM with supervised learning on high-quality data for the downstream task(s) of interest (dialogue, summarization, etc.), to obtain a model
  2. Reward Modelling Phase: In the second phase the SFT model is prompted with prompts x to produce pairs of answers (y1, y2) ∼ π SFT(y | x). These are then presented to human labelers who express preferences for one answer, denoted as yw ≻ yl | x where yw and yl denotes the preferred and dispreferred completion amongst (y1, y2)
  3. RL Optimization: During the RL phase, we use the learned reward function to provide feedback to the language model. In particular, we formulate the following optimization problem:
Image by: DPO Paper

Intuitively, the gradient of the loss function LDPO increases the likelihood of the preferred completions yw and decreases the likelihood of dispreferred completions yl. Importantly, the examples are weighed by how much higher the implicit reward model rˆθ rates the dispreferred completions, scaled by β, i.e, how incorrectly the implicit reward model orders the completions, accounting for the strength of the KL constraint

Image by: DPO Paper

Build Your DPO LLM in 5 Steps

Lastly, stand by as we train our own Mistral-7B Transformer using DPO in Google Colab or clone through Github.

  1. Install Dependencies
Image by Author

2. Load the Preference Dataset and Define the Reward Policy  Responses

Image by Author

3. Create the DPOTrainer from the STF Model and the Reference Model

Image by Author

4. Inspect Model Performance in TensorBoard

Image by Author

5. (Optional) — Expose for Inference

Image by Author

The comparative results between a fine-tuned RLHF model and a DPO-based LLM are outstanding:

Image by Author

In Conclusion

Direct Preference Optimization represents a breakthrough in achieving control over large language models. Its stable, efficient, and computationally lightweight approach simplifies the fine-tuning process, eliminating the need for extensive sampling and intricate hyperparameter tuning. The use of binary cross-entropy objectives and the LM itself as a reward model adds to the simplicity and effectiveness of the algorithm.

Image by: DPO Paper

Hey, thanks for reading!

Thanks for getting to the end of this article. My name is 🇸🇮 Tim, I love to elaborate ML research papers or ML applications with an emphasis on business use cases. Give me a follow if you’d like to read more! :)