Skip to content
Learni
View all tutorials
Intelligence Artificielle

How to Fine-Tune Phi-3 with LoRA in 2026

Lire en français

Introduction

Phi-3, Microsoft's family of open-source Small Language Models (SLMs), is revolutionizing embedded AI in 2026 with its compact size (3.8B parameters for Phi-3-mini) and performance rivaling 7B+ LLMs. Why choose it? It runs locally on an 8GB GPU, cuts cloud costs, and keeps data private. This advanced tutorial walks you through every step: Ollama installation, optimized inference, LoRA fine-tuning with Unsloth (x2 speedup on RTX GPUs), GGUF conversion, and custom deployment. Ideal for personalized AI agents in RAG or specialized chatbots. You'll bookmark this for your production projects. (112 words)

Prerequisites

  • Hardware: NVIDIA GPU with ≥8GB VRAM (RTX 3060+), CUDA 12.1+
  • OS: Ubuntu 22.04 LTS or WSL2 on Windows
  • Software: Python 3.11+, Git, NVIDIA drivers 535+
  • Knowledge: PyTorch, Hugging Face Transformers, PEFT/LoRA
  • Disk space: 20GB free for models and datasets

Install Ollama and Phi-3

install-ollama.sh
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
sleep 5
ollama pull phi3:mini
ollama run phi3:mini "Bonjour, teste d'inférence !"

This script installs Ollama with one click, starts the server in the background, downloads Phi-3-mini (3.8B, 4k context), and tests CLI inference. The & allows continuation; sleep ensures startup. Pitfall: without NVIDIA runtime, it falls back to slow CPU (x10 slower).

First Inference via API

Ollama exposes a REST API at http://localhost:11434. Use it for scalable apps. Think of it like a lightweight FastAPI server for LLMs.

Python Client for Inference

inference_phi.py
import ollama

response = ollama.chat(model='phi3:mini',
                      messages=[
                        {'role': 'user', 'content': 'Explique le fine-tuning LoRA en 3 points.'}
                      ],
                      options={'temperature': 0.7, 'num_predict': 200})
print(response['message']['content'])

This code uses the ollama library (pip install ollama) for structured inference with advanced options: temperature for creativity, num_predict to limit tokens. Fully featured and async-ready; install with pip install ollama. Pitfall: without options, responses may be too short or verbose.

Launch Custom API Server

custom-modelfile.sh
cat > Modelfile << EOF
FROM phi3:mini
TEMPLATE "{{ if .System }}<|system|>{{ .System }}<|end|>{{ if .Prompt }}<|user|>{{ .Prompt }}<|end|><|assistant|>"
PARAMETER temperature 0.8
PARAMETER stop "<|end|>"
EOF
ollama create phi-custom -f Modelfile
ollama run phi-custom "Ton rôle : expert LoRA."

Creates a custom Modelfile to override Phi-3's template (chatML-like) and parameters. stop prevents infinite hallucinations. Great for RAG: add a SYSTEM prompt. Pitfall: malformed template crashes inference; always test CLI first.

Preparing for Fine-Tuning

For advanced fine-tuning, switch to Unsloth: it speeds up LoRA x2 on a single GPU compared to Hugging Face baselines. Example dataset: Alpaca for instruction tuning. Download via HF Datasets.

Fine-Tune with LoRA Using Unsloth

finetune_lora.py
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig

model, tokenizer = FastLanguageModel.from_pretrained("microsoft/Phi-3-mini-4k-instruct",
                                                     max_seq_length=2048,
                                                     dtype=torch.float16,
                                                     load_in_4bit=True)
model = FastLanguageModel.get_peft_model(model,
                                         r=16,
                                         target_modules=["qkv_proj", "o_proj"],
                                         lora_alpha=16,
                                         lora_dropout=0,
                                         bias="none",
                                         use_gradient_checkpointing=True)

dataset = load_dataset("yahma/alpaca-cleaned", split="train[:1000]")
trainer = SFTTrainer(model=model,
                     tokenizer=tokenizer,
                     train_dataset=dataset,
                     dataset_text_field="text",
                     max_seq_length=2048,
                     args=TrainingArguments(per_device_train_batch_size=2,
                                            gradient_accumulation_steps=4,
                                            warmup_steps=5,
                                            max_steps=60,
                                            learning_rate=2e-4,
                                            fp16=True,
                                            logging_steps=1,
                                            output_dir="phi-lora-tuned",
                                            optim="adamw_8bit"))
trainer.train()
model.save_pretrained("phi-lora-final")

Complete Unsloth script for LoRA on Phi-3-mini: 4-bit quantization fits 8GB VRAM, r=16 low rank for 1-2% perf drop. Alpaca dataset for instruction tuning. Trains 60 steps (~10min on RTX 4070). Pitfall: batch_size too high causes OOM; adjust gradient_accumulation.

Convert LoRA to GGUF for Ollama

convert-gguf.sh
pip install llama.cpp[server]
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j LLAMA_CUDA=1
python convert_hf_to_gguf.py ../phi-lora-final --outtype f16 --outfile phi-lora.gguf
ollama create phi-lora-ollama -f Modelfile.gguf << EOF
FROM ./phi-lora.gguf
EOF
ollama run phi-lora-ollama "Test post-fine-tune."

Converts the merged LoRA model to GGUF (Ollama format) with CUDA acceleration. LLAMA_CUDA=1 enables GPU inference. Modelfile points to local GGUF. Pitfall: without merging LoRA (model.save_pretrained_merged), performance suffers; use Unsloth's post-train merge.

Benchmark and Monitoring

benchmark.py
import ollama
import time

start = time.time()
for _ in range(10):
    resp = ollama.chat(model='phi-lora-ollama',
                       messages=[{'role': 'user', 'content': 'Génère du code Python pour tri rapide.'}])
print(f"Tokens/s: {len(resp['message']['content']) / (time.time() - start):.2f}")
print(resp['message']['content'])

Measures real performance: tokens/s over 10 queries. Compare pre/post fine-tune (expect +20% speed from optimized LoRA). Add eval_metrics to trainer for perplexity. Pitfall: ignore cold-start latency; warm up with ollama serve.

Best Practices

  • Quantize first: Always use 4bit/8bit for <12GB VRAM; Unsloth handles auto-merge.
  • Dataset curation: 500-5000 quality examples > sheer volume; use HF quality_filter.
  • LoRA hyperparameters: r=16/32, alpha=2*r, dropout=0.05 for generalization.
  • Eval loop: Integrate LM-Eval or custom perplexity post-training.
  • Versioning: Tag GGUF with ollama push to a private registry.

Common Errors to Avoid

  • GPU OOM: Reduce max_seq_length to 1024 or batch=1; monitor with nvidia-smi.
  • Unmerged LoRA: Slow inference; use FastLanguageModel.merge_and_unload().
  • Poorly formatted dataset: Phi-3 expects chatML (<|user|>...); parse with tokenizer.apply_chat_template.
  • Overfitting: <100 steps without val set; split 80/20 and early-stop.

Next Steps