Comment fine-tuner Phi-3 LoRA 2026 (Guide ADVANCED)

Introduction

Phi-3, la famille de Small Language Models (SLM) open-source de Microsoft, révolutionne l'IA embarquée en 2026 grâce à sa taille compacte (3,8B paramètres pour Phi-3-mini) et ses performances rivalisant les LLM 7B+. Pourquoi l'adopter ? Il s'exécute localement sur un GPU 8GB, réduit les coûts cloud et assure la confidentialité des données. Ce tutoriel advanced vous guide pas à pas : installation via Ollama, inférence optimisée, fine-tuning avec LoRA via Unsloth (accélération x2 sur RTX), conversion GGUF et déploiement custom. Idéal pour des agents IA personnalisés en RAG ou chatbots spécialisés. À la fin, vous bookmarkederez ce guide pour vos projets prod. (128 mots)

Prérequis

Hardware : GPU NVIDIA ≥8GB VRAM (RTX 3060+), CUDA 12.1+
OS : Ubuntu 22.04 LTS ou WSL2 sur Windows
Software : Python 3.11+, Git, NVIDIA drivers 535+
Connaissances : PyTorch, Transformers HuggingFace, PEFT/LoRA
Espace disque : 20GB libres pour modèles et datasets

Installer Ollama et Phi-3

install-ollama.sh

curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
sleep 5
ollama pull phi3:mini
ollama run phi3:mini "Bonjour, teste d'inférence !"

Ce script installe Ollama en un clic, lance le serveur en arrière-plan, télécharge Phi-3-mini (3,8B, 4k context) et teste l'inférence CLI. Le & permet la continuation ; sleep assure le démarrage. Piège : sans NVIDIA runtime, ça fallback sur CPU lent (x10).

Première inférence via API

Ollama expose une API REST sur http://localhost:11434. Utilisez-la pour des apps scalables. Analogie : comme un serveur FastAPI lightweight pour LLM.

Client Python pour inférence

inference_phi.py

import ollama

response = ollama.chat(model='phi3:mini',
                      messages=[
                        {'role': 'user', 'content': 'Explique le fine-tuning LoRA en 3 points.'}
                      ],
                      options={'temperature': 0.7, 'num_predict': 200})
print(response['message']['content'])

Ce code utilise la lib ollama (pip install ollama) pour une inférence structurée avec options avancées : temperature pour créativité, num_predict pour limiter tokens. Complet et async-ready ; installez via pip install ollama. Piège : sans options, réponses trop courtes ou verbeuses.

Lancer serveur API custom

custom-modelfile.sh

cat > Modelfile << EOF
FROM phi3:mini
TEMPLATE "{{ if .System }}<|system|>{{ .System }}<|end|>{{ if .Prompt }}<|user|>{{ .Prompt }}<|end|><|assistant|>"
PARAMETER temperature 0.8
PARAMETER stop "<|end|>"
EOF
ollama create phi-custom -f Modelfile
ollama run phi-custom "Ton rôle : expert LoRA."

Crée un Modelfile custom pour override template Phi (chatML-like) et params. stop évite hallucinations infinies. Utile pour RAG : ajoutez SYSTEM prompt. Piège : template malformé crash l'inférence ; testez toujours CLI d'abord.

Préparation au fine-tuning

Pour fine-tuning advanced, passez à Unsloth : accélère LoRA x2 sur single GPU vs HF baseline. Dataset exemple : Alpaca (instruction tuning). Téléchargez via HF Datasets.

Fine-tuning LoRA avec Unsloth

finetune_lora.py

from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig

model, tokenizer = FastLanguageModel.from_pretrained("microsoft/Phi-3-mini-4k-instruct",
                                                     max_seq_length=2048,
                                                     dtype=torch.float16,
                                                     load_in_4bit=True)
model = FastLanguageModel.get_peft_model(model,
                                         r=16,
                                         target_modules=["qkv_proj", "o_proj"],
                                         lora_alpha=16,
                                         lora_dropout=0,
                                         bias="none",
                                         use_gradient_checkpointing=True)

dataset = load_dataset("yahma/alpaca-cleaned", split="train[:1000]")
trainer = SFTTrainer(model=model,
                     tokenizer=tokenizer,
                     train_dataset=dataset,
                     dataset_text_field="text",
                     max_seq_length=2048,
                     args=TrainingArguments(per_device_train_batch_size=2,
                                            gradient_accumulation_steps=4,
                                            warmup_steps=5,
                                            max_steps=60,
                                            learning_rate=2e-4,
                                            fp16=True,
                                            logging_steps=1,
                                            output_dir="phi-lora-tuned",
                                            optim="adamw_8bit"))
trainer.train()
model.save_pretrained("phi-lora-final")

Script complet Unsloth pour LoRA sur Phi-3-mini : quantize 4bit pour fit 8GB VRAM, r=16 rank bas pour 1-2% perf drop. Alpaca dataset pour instruction tuning. Entraîne 60 steps (~10min sur RTX 4070). Piège : batch_size trop haut OOM ; ajustez gradient_accumulation.

Convertir LoRA en GGUF pour Ollama

convert-gguf.sh

pip install llama.cpp[server]
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j LLAMA_CUDA=1
python convert_hf_to_gguf.py ../phi-lora-final --outtype f16 --outfile phi-lora.gguf
ollama create phi-lora-ollama -f Modelfile.gguf << EOF
FROM ./phi-lora.gguf
EOF
ollama run phi-lora-ollama "Test post-fine-tune."

Convertit le modèle LoRA mergé en GGUF (format Ollama) avec CUDA accel. LLAMA_CUDA=1 pour GPU inference. Modelfile pointe le GGUF local. Piège : sans merge LoRA (model.save_pretrained_merged), perf dégradée ; utilisez Unsloth merge post-train.

Benchmark et monitoring

benchmark.py

import ollama
import time

start = time.time()
for _ in range(10):
    resp = ollama.chat(model='phi-lora-ollama',
                       messages=[{'role': 'user', 'content': 'Génère du code Python pour tri rapide.'}])
print(f"Tokens/s: {len(resp['message']['content']) / (time.time() - start):.2f}")
print(resp['message']['content'])

Mesure perf réelle : tokens/s sur 10 queries. Comparez pre/post fine-tune (attendez +20% vitesse LoRA optimisé). Ajoutez eval_metrics dans trainer pour perplexity. Piège : ignorez latence cold-start ; warm-up avec ollama serve.

Bonnes pratiques

Quantization first : Toujours 4bit/8bit pour <12GB VRAM ; Unsloth gère auto-merge.
Dataset curation : 500-5000 exemples qualifiés > volume ; utilisez quality_filter HF.
Hyperparams LoRA : r=16/32, alpha=2*r, dropout=0.05 pour généralisation.
Eval loop : Intégrez LM-Eval ou custom perplexity post-train.
Versioning : Taggez GGUF avec ollama push vers registry privé.

Erreurs courantes à éviter

OOM sur GPU : Réduisez max_seq_length à 1024 ou batch=1 ; monitor nvidia-smi.
LoRA non mergé : Inférence lente ; utilisez FastLanguageModel.merge_and_unload().
Dataset mal formaté : Phi-3 attend chatML (<|user|>... ) ; parsez avec tokenizer.apply_chat_template.
Overfitting : <100 steps sans val set ; split 80/20 et early-stop.

Pour aller plus loin

Docs officielles : HuggingFace Phi-3, Unsloth GitHub
Datasets avancés : Dolly15k
Outils : LM Studio pour UI, vLLM pour prod serving
Formations Learni IA avancée : Fine-tuning LLM en profondeur.

Comment fine-tuner Phi-3 avec LoRA en 2026

Introduction

Prérequis

Installer Ollama et Phi-3

Première inférence via API

Client Python pour inférence

Lancer serveur API custom

Préparation au fine-tuning

Fine-tuning LoRA avec Unsloth

Convertir LoRA en GGUF pour Ollama

Benchmark et monitoring

Bonnes pratiques

Erreurs courantes à éviter

Pour aller plus loin

Recommended Learni Training Courses

Training PHI 2026 - Mastering Healthcare IT Cybersecurity

Hugging Face Training - Master Advanced Transformers

Training AWS Bedrock - Deploying Scalable Generative AIs

Training Azure OpenAI Service 2026 - Deploying Scalable Generative AI

Training Chain-of-Thought 2026 - Expert in AI Reasoning

Training DeepSeek - Mastering Generative AI in the Enterprise

Training Fireworks.ai - Mastering Prompts and AI Fine-Tuning

Training Fireworks.ai 2026 - Mastering High-Performance AI Fine-Tuning

Training Fireworks.ai 2026 - Mastering Scalable AI in the Enterprise