Skip to content
Learni
View all tutorials
Machine Learning

How to Fine-Tune an LLM with LoRA in 2026

Lire en français

Introduction

LoRA (Low-Rank Adaptation) fine-tuning revolutionizes adapting large language models (LLMs) in 2026. Unlike full fine-tuning that requires terabytes of VRAM, LoRA injects low-rank matrices into attention layers, training only 0.1-1% of parameters. Result: 3x faster training and 80% less memory on a single A100 GPU.

This expert tutorial guides you step-by-step to fine-tune Llama-3-8B on the databricks-dolly-15k instruction-following dataset. We use Hugging Face PEFT, Transformers, and TRL for production-ready Supervised Fine-Tuning (SFT). At the end, merge the LoRA adapters into the base model for optimal inference. Ideal for customizing an AI assistant with your business data without a massive cluster.

Prerequisites

  • Python 3.10+ and Git installed
  • NVIDIA GPU with ≥16GB VRAM (A100/H100 recommended; tested on RTX 4090)
  • Advanced knowledge: PyTorch, Transformers, tokenizers
  • Hugging Face account (token for gated models like Llama-3)
  • ≥50GB SSD storage for datasets/caches

Install Dependencies

setup.sh
#!/bin/bash
set -e

# Update pip and torch CUDA
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Core HF libs + PEFT/TRL
pip install transformers==4.45.1
pip install peft==0.12.0
pip install trl==0.9.6
pip install datasets==2.21.0
pip install accelerate==0.33.0
pip install bitsandbytes==0.43.1
pip install wandb

# HF Login (replace with your token)
huggingface-cli login

# CUDA Check
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Devices: {torch.cuda.device_count()}')"

This script installs all essential libraries with pinned versions for 2026 compatibility. Torch CUDA 12.1 optimizes for Ampere/Ada GPUs; bitsandbytes enables 4/8-bit quantization. Run it once—it handles HF login for Llama-3 access. Pitfall: Don't forget accelerate config post-install for multi-GPU if needed.

Prepare the Dataset

We use databricks-dolly-15k, a natural instruction dataset (15k examples: question/response). Load it via datasets, apply Llama-3 template formatting (<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{response}<|eot_id|>). This aligns the model for instructional chat. Tokenize in batches for efficiency.

Dataset Preparation Script

prepare_dataset.py
from datasets import load_dataset, DatasetDict

# Load Dolly-15k (train split only for simplicity)
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
dataset = dataset.train_test_split(test_size=0.1)  # 90/10 split

# Llama-3 template for SFTTrainer
llama_prompt = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n{output}<|eot_id|>"

def formatting_prompts_func(example):
    return {'text': llama_prompt.format(instruction=example['instruction'], output=example['output'])}

train_dataset = dataset['train'].map(formatting_prompts_func)
eval_dataset = dataset['test'].map(formatting_prompts_func)

# Save locally for reuse
train_dataset.save_to_disk('dolly_train')
eval_dataset.save_to_disk('dolly_eval')
print(f'Train: {len(train_dataset)}, Eval: {len(eval_dataset)}')

This script loads, splits, and formats the dataset into Llama-3-ready prompts for SFT. The EOS template <|eot_id|> is critical to avoid overflow. Saves in Arrow format for fast reloading. Pitfall: No split means no validation; map() is lazy but save_to_disk materializes.

Model and LoRA Configuration

Load meta-llama/Meta-Llama-3-8B-Instruct in 4-bit (QLoRA) to fit on 16GB VRAM. Apply PEFT LoRA to q_proj, v_proj (r=16, alpha=32). This freezes the base model, training only ~7M params. Use TRL's SFTTrainer for packing and automatic loss masking.

Load Model + LoRA Config

model_setup.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# 4-bit Quantization for QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

# Tokenizer with EOS padding
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Expert LoRA config: target attention, r=16 for perf/memory balance
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # ~0.3% trainable

Loads the model in QLoRA 4-bit NF4 for max compression (VRAM 16GB→5GB). LoRA targets the 4 attention projections; r=16 is the sweet spot (higher=overfit, lower=underfit). print_trainable_parameters() confirms efficiency. Pitfall: Without flash_attention_2, perf drops 2x; pad_token must match EOS.

SFTTrainer Setup

Configure SFTTrainer with packing (groups prompts to max seq_len=2048), gradient checkpointing, and DPO-ready logging. Max_steps=100 for quick tests (1h on A100); adjust for full runs. Use WandB logging to monitor loss/perplexity.

Trainer Config and Training Launch

train_lora.py
from datasets import load_from_disk
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

# Prepared datasets
train_dataset = load_from_disk('dolly_train')
eval_dataset = load_from_disk('dolly_eval')

# Expert 2026 hyperparameters
sft_config = SFTConfig(
    output_dir="./lora-llama3-dolly",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
    save_steps=50,
    eval_steps=50,
    max_seq_length=2048,
    packing=True,  # Pack prompts for efficiency
    dataset_num_proc=4,
    report_to="wandb",
    push_to_hub=True,
    gradient_checkpointing=True,
    remove_unused_columns=False,
    warmup_steps=100
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    args=sft_config,
    max_seq_length=2048,
    packing=True,
    dataset_text_field="text"
)

trainer.train()
trainer.save_model()
trainer.push_to_hub("your-username/lora-llama3-dolly")

SFTTrainer handles tokenization, masking, and packing automatically. Batch_size=2/accum=4 equals effective batch=16; LR=2e-4 optimal for LoRA. Packing boosts throughput 2x by filling sequences. Pitfall: Without remove_unused_columns=False, text_field error; push_to_hub needs pre-created HF repo.

Merge and Inference

After training, merge LoRA into the base model (full precision) for fast inference without PEFT overhead. Test with pipeline to generate instructional responses.

Merge LoRA and Test Inference

merge_inference.py
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM

# Path to trained LoRA
peft_model_id = "./lora-llama3-dolly"

# Load base + adapter
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, peft_model_id)

# Merge and unload
merged_model = model.merge_and_unload()
merged_model.save_pretrained("merged-lora-llama3")

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

# Test inference
prompt = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nExplain LoRA in 3 sentences.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
inputs = tokenizer(prompt, return_tensors="pt").to(merged_model.device)
outputs = merged_model.generate(**inputs, max_new_tokens=128, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Merge compresses LoRA into base model (+0.1% size) for native inference. merge_and_unload() frees memory. Exact prompt template match avoids hallucinations. Pitfall: Use bfloat16 dtype for precision; without do_sample, output is deterministic and flat.

Deployment Script (vLLM)

deploy_vllm.py
from vllm import LLM, SamplingParams

llm = LLM(model="merged-lora-llama3", tensor_parallel_size=1, dtype="bfloat16")

prompts = [
    "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWhat does LoRA do?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=128)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

vLLM (install via pip install vllm) serves the merged model at 100+ req/s. Tensor_parallel for multi-GPU. Optimized params: top_p=0.95 avoids repetitions. Pitfall: Merged HF repo must be public or logged; dtype must match training.

Best Practices

  • Choose r dynamically: r=8 for small datasets, r=64 for >1M examples (test via validation loss).
  • Quantize aggressively: Always use QLoRA 4-bit + double_quant for <10GB VRAM on 7B models.
  • Pack and checkpoint: Enable packing + gradient_accum to scale batch without OOM.
  • Monitor overfitting: Target eval perplexity <1.2; early-stop if plateau >3 epochs.
  • Version adapters: Push LoRA separately (20MB) to HF, merge at inference.

Common Errors to Avoid

  • Template mismatch: Train prompt ≠ inference → incoherent generations. Check <|eot_id|> everywhere.
  • Forgot pad_token: Caused invalid pad_id=0 → tokenizer fix required.
  • LR too high: >5e-4 spikes loss; use cosine decay + warmup=100 steps.
  • No merge: PEFT inference 2x slower; always merge for prod.

Next Steps