Introduction
In 2026, aligning large language models (LLMs) with human preferences is crucial for helpful and safe responses. Direct Preference Optimization (DPO) changes the game: unlike RLHF (Reinforcement Learning from Human Feedback), DPO skips training a separate reward model, making the process simpler, more stable, and efficient.
Imagine: instead of optimizing a proxy reward, DPO directly optimizes the LLM policy using a loss based on 'chosen/rejected' pairs. This cuts compute costs by 50-70% while often outperforming RLHF on benchmarks like MT-Bench.
This beginner tutorial guides you through implementing DPO on GPT-2 with a real dataset (stack-exchange-paired). By the end, you'll have an aligned model ready for inference. Ideal for AI developers wanting to master alignment without a PhD. Estimated time: 30min on GPU, 2h on CPU.
Prerequisites
- Python 3.10+ installed (use pyenv for management).
- Virtual environment:
python -m venv dpo-env && source dpo-env/bin/activate. - NVIDIA GPU recommended (CUDA 12+), but CPU works for this small demo.
- Internet access to download Hugging Face datasets/models.
- Basic Python and PyTorch knowledge (no advanced ML required).
Installing Dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.44.0 datasets==2.21.0
pip install trl==0.9.6 accelerate==0.33.0 peft==0.12.0
pip install bitsandbytes==0.43.1 safetensors==0.4.3This script installs PyTorch with CUDA (adapt the index-url for CPU if needed), followed by essential libraries: Transformers for models, Datasets for data, TRL for DPOTrainer, and Accelerate/PEFT for optimization. Bitsandbytes enables 8-bit quantization to save VRAM. Run it once; check with pip list.
Understanding DPO Basics
DPO relies on an elegant mathematical trick: the loss derives directly from human preferences, without a reward model. For a pair (prompt, chosen_response, rejected_response), the formula is:
$$\mathcal{L}_{DPO} = -\mathbb{E} \log \sigma \left( \beta \log \frac{\pi_\theta (y_w | x)}{\pi_{ref} (y_w | x)} - \beta \log \frac{\pi_\theta (y_l | x)}{\pi_{ref} (y_l | x)} \right)$$
Where $\pi_\theta$ is your model, $\pi_{ref}$ a reference model (often a clone), $\beta$ a hyperparameter (0.1-1.0), $y_w$ chosen, $y_l$ rejected. Analogy: like picking the best path without a reward map, just by comparing two routes.
We use the 'lvwerra/stack-exchange-paired' dataset: 3.3k natural prompt/chosen/rejected pairs from Stack Exchange, perfect for beginners.
Loading and Preparing the Dataset
from datasets import load_dataset
import torch
# Charger le dataset DPO prêt-à-l'emploi
dataset = load_dataset("lvwerra/stack-exchange-paired", split="train")
# Prendre un petit sous-ensemble pour test rapide (1000 exemples)
dataset = dataset.shuffle(seed=42).select(range(1000))
# Vérifier la structure
dataset[0]
print(f"Dataset size: {len(dataset)}")
print("Exemple prompt:", dataset[0]['prompt'])
print("Exemple choisi:", dataset[0]['chosen'])
print("Exemple rejeté:", dataset[0]['rejected'])
# Sauvegarder localement pour réutilisation
dataset.save_to_disk("./dpo_dataset")
print("Dataset prêt ! Exécutez ce script d'abord.")This code loads a DPO-ready dataset with natural pairs from Stack Exchange. We limit to 1000 examples for fast training (5-10min on GPU). The structure is standard: 'prompt', 'chosen', 'rejected'. Run it first; it saves locally to avoid re-downloads.
Loading the Model and Tokenizer
Why GPT-2? Lightweight model (124M params), fast for beginners, but DPO scales to billions. The tokenizer handles padding; we set pad_token to eos_token to avoid warnings.
Tip: Use device_map="auto" with Accelerate for seamless multi-GPU/CPU support.
Initializing the Base Model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_from_disk
# Charger tokenizer et modèle base (GPT-2 petit)
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Charger dataset préparé
dataset = load_from_disk("./dpo_dataset")
print("Modèle et dataset chargés. Prêt pour DPO.")
print(f"Modèle params: {model.num_parameters():,}")
print(f"Vocab size: {tokenizer.vocab_size}")Loads GPT-2 in bfloat16 for precision and memory savings. device_map="auto" automatically places it on GPU/CPU. Left padding is crucial for causal LMs in DPO (prompt before response). Run after prepare_dataset.py.
Configuring the DPO Trainer
DPOTrainer (from TRL) handles everything: loss, reference (approximated via beta), and LoRA via PEFT for efficient fine-tuning (only 1-5% params trained). Hyperparams: beta=0.1 (standard), epochs=1 (for demo), batch=4 (adjust for VRAM).
Setting Up DPOTrainer with LoRA
from trl import DPOConfig, DPOTrainer
from transformers import TrainingArguments, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_from_disk
import torch
# Charger tokenizer, model, dataset (de scripts précédents)
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
dataset = load_from_disk("./dpo_dataset")
# Config LoRA pour efficiency (r=16, alpha=32)
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.1,
target_modules=["c_attn", "c_proj"]
)
model = get_peft_model(model, peft_config)
# Config DPO
beta = 0.1
dpo_config = DPOConfig(
beta=beta,
output_dir="./dpo-gpt2",
max_length=256,
max_prompt_length=128,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=1e-5,
logging_steps=10,
save_steps=100,
remove_unused_columns=False,
)
trainer = DPOTrainer(
model=model,
ref_model=None, # Auto-approximation
args=dpo_config,
train_dataset=dataset,
tokenizer=tokenizer,
)
print("Trainer configuré. Lancez trainer.train() ensuite.")
trainer.model.print_trainable_parameters()LoRA cuts VRAM usage (from 1GB to 200MB). DPOConfig sets hyperparams; ref_model=None uses beta approximation for simplicity (no clone needed). Effective batch=8 with accumulation. print_trainable_parameters() confirms ~1% active params.
Running DPO Training
from setup_trainer import trainer # Import du trainer configuré
# Lancer l'entraînement
train_result = trainer.train()
# Sauvegarder le modèle fine-tuné
trainer.save_model("./dpo-gpt2-final")
trainer.tokenizer.save_pretrained("./dpo-gpt2-final")
print("Entraînement terminé ! Modèle sauvé dans ./dpo-gpt2-final")
print(f"Résultats: {train_result.metrics}")This final script starts training (1 epoch ~5min on A10G). Auto-save plus manual. Metrics show descending DPO loss. Import from setup_trainer.py (or copy full code into one file for production).
Testing and Inference on the DPO Model
After training, test with a Stack Exchange-style prompt. Compare before/after to see alignment: more helpful, less verbose responses.
Inference and Model Comparison
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Charger modèle DPO fine-tuné
model_path = "./dpo-gpt2-final"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype=torch.bfloat16, device_map="auto"
)
prompt = "Question: What is the best way to learn Python? Answer:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
response_dpo = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Réponse DPO:", response_dpo[len(prompt):])
# Comparaison base (téléchargez gpt2 si besoin)
model_base = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype=torch.bfloat16, device_map="auto")
outputs_base = model_base.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
response_base = tokenizer.decode(outputs_base[0], skip_special_tokens=True)
print("Réponse base:", response_base[len(prompt):])Generates 50 tokens with sampling. Compare: DPO favors concise/relevant responses. do_sample=True adds variability; tweak temperature. Works standalone after training.
Best Practices
- Choose beta wisely: 0.1 for clean datasets, 0.5+ if noisy (test on val set).
- Always use LoRA/QLORA: Save 90% VRAM, scales to Llama-7B on 16GB.
- Validate dataset: Ensure len(chosen) > len(rejected) for preference consistency.
- Monitor loss: If >1.0 after epoch 1, lower LR or raise beta.
- Push to HF Hub:
trainer.push_to_hub("your-name/dpo-gpt2")for sharing.
Common Errors to Avoid
- Misconfigured padding: Forgetting
padding_side="left"causes wrong logits (infinite loss). - Missing ref model: Without beta or ref=None, approximation fails; clone for production.
- Batch too large: OOM? Drop per_device_batch_size to 1 + accumulation_steps=8.
- Malformed dataset: Ensure exact 'prompt','chosen','rejected' keys, no NaNs.
Next Steps
- Original paper: Direct Preference Optimization (Rafailov et al., 2023).
- TRL repo: huggingface/trl for advanced examples (SFT+DPO).
- Datasets: UltraFeedback.
- Scale to Llama: Swap "gpt2" for "meta-llama/Llama-3-8B" + HF login.