Large-scale unsupervised language models (LMs) have shown remarkable capabilities in understanding and generating human-like text. However, achieving precise control over their behavior poses a significant challenge due to the unsupervised nature of their training. Existing methods rely on collecting human labels to fine-tune LMs, often through reinforcement learning from human feedback (RLHF). Here’s what this article contains:
Who is this blog post useful for? ML Engineers(LLM), Tech Enthusiasts, VCs, etc.
How advanced is this post? Anybody previously acquainted with ML terms should be able to follow along.
Replicate my code here: https://github.com/Timothy102/llm-course or through Google Colab
PPO stands for proximal policy optimization in the context of solving RF problems. The algorithm works by optimizing the reward policy and then fine-tuning the unsupervised LLM according to reward maximization of such human-reinforced preference feedback.
The fundamental principle of PPO centers on implementing gradual, incremental adjustments to the policy. This approach aims to avoid instability or the emergence of suboptimal solutions that might result from larger updates. However, practical experience indicates that this technique remains susceptible to instability, with potential divergence in loss. Additionally, it poses challenges in terms of reproducibility due to the involvement of numerous hyperparameters and sensitivity to random seeds. Moreover, its computational demands are considerable, further complicating its practical implementation.
Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses and then use RL to find a policy that maximizes the learned reward.
In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form.
This is where Direct Preference Optimization (DPO) comes into play. DPO simplifies control by treating the task as a classification problem. Concretely, it uses two models: the trained model (or policy model) and a copy of it called the reference model. During training, the goal is to make sure the trained model outputs higher probabilities for preferred answers than the reference model. Conversely, we also want it to output lower probabilities for rejected answers. It means we’re penalizing the LLM for bad answers and rewarding it for good ones.
Unlike traditional RLHF:
Through the utilization of the Large Language Model (LLM) as its reward model and the implementation of binary cross-entropy objectives, Direct Preference Optimization (DPO) adeptly synchronizes the model’s outputs with human preferences. This is achieved without the necessity for exhaustive sampling, reward model fitting, or intricate adjustments of hyperparameters. The outcome is a notably more stable, efficient, and computationally frugal process.
Intuitively, the gradient of the loss function LDPO increases the likelihood of the preferred completions yw and decreases the likelihood of dispreferred completions yl. Importantly, the examples are weighed by how much higher the implicit reward model rˆθ rates the dispreferred completions, scaled by β, i.e, how incorrectly the implicit reward model orders the completions, accounting for the strength of the KL constraint
Lastly, stand by as we train our own Mistral-7B Transformer using DPO in Google Colab or clone through Github.
2. Load the Preference Dataset and Define the Reward Policy Responses
3. Create the DPOTrainer from the STF Model and the Reference Model
4. Inspect Model Performance in TensorBoard
5. (Optional) — Expose for Inference
The comparative results between a fine-tuned RLHF model and a DPO-based LLM are outstanding:
Direct Preference Optimization represents a breakthrough in achieving control over large language models. Its stable, efficient, and computationally lightweight approach simplifies the fine-tuning process, eliminating the need for extensive sampling and intricate hyperparameter tuning. The use of binary cross-entropy objectives and the LM itself as a reward model adds to the simplicity and effectiveness of the algorithm.
Thanks for getting to the end of this article. My name is 🇸🇮 Tim, I love to elaborate ML research papers or ML applications with an emphasis on business use cases. Give me a follow if you’d like to read more! :)