Introduction
Phi-3, Microsoft's family of open-source Small Language Models (SLMs), is revolutionizing embedded AI in 2026 with its compact size (3.8B parameters for Phi-3-mini) and performance rivaling 7B+ LLMs. Why choose it? It runs locally on an 8GB GPU, cuts cloud costs, and keeps data private. This advanced tutorial walks you through every step: Ollama installation, optimized inference, LoRA fine-tuning with Unsloth (x2 speedup on RTX GPUs), GGUF conversion, and custom deployment. Ideal for personalized AI agents in RAG or specialized chatbots. You'll bookmark this for your production projects. (112 words)
Prerequisites
- Hardware: NVIDIA GPU with ≥8GB VRAM (RTX 3060+), CUDA 12.1+
- OS: Ubuntu 22.04 LTS or WSL2 on Windows
- Software: Python 3.11+, Git, NVIDIA drivers 535+
- Knowledge: PyTorch, Hugging Face Transformers, PEFT/LoRA
- Disk space: 20GB free for models and datasets
Install Ollama and Phi-3
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
sleep 5
ollama pull phi3:mini
ollama run phi3:mini "Bonjour, teste d'inférence !"This script installs Ollama with one click, starts the server in the background, downloads Phi-3-mini (3.8B, 4k context), and tests CLI inference. The & allows continuation; sleep ensures startup. Pitfall: without NVIDIA runtime, it falls back to slow CPU (x10 slower).
First Inference via API
Ollama exposes a REST API at http://localhost:11434. Use it for scalable apps. Think of it like a lightweight FastAPI server for LLMs.
Python Client for Inference
import ollama
response = ollama.chat(model='phi3:mini',
messages=[
{'role': 'user', 'content': 'Explique le fine-tuning LoRA en 3 points.'}
],
options={'temperature': 0.7, 'num_predict': 200})
print(response['message']['content'])This code uses the ollama library (pip install ollama) for structured inference with advanced options: temperature for creativity, num_predict to limit tokens. Fully featured and async-ready; install with pip install ollama. Pitfall: without options, responses may be too short or verbose.
Launch Custom API Server
cat > Modelfile << EOF
FROM phi3:mini
TEMPLATE "{{ if .System }}<|system|>{{ .System }}<|end|>{{ if .Prompt }}<|user|>{{ .Prompt }}<|end|><|assistant|>"
PARAMETER temperature 0.8
PARAMETER stop "<|end|>"
EOF
ollama create phi-custom -f Modelfile
ollama run phi-custom "Ton rôle : expert LoRA."Creates a custom Modelfile to override Phi-3's template (chatML-like) and parameters. stop prevents infinite hallucinations. Great for RAG: add a SYSTEM prompt. Pitfall: malformed template crashes inference; always test CLI first.
Preparing for Fine-Tuning
For advanced fine-tuning, switch to Unsloth: it speeds up LoRA x2 on a single GPU compared to Hugging Face baselines. Example dataset: Alpaca for instruction tuning. Download via HF Datasets.
Fine-Tune with LoRA Using Unsloth
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
model, tokenizer = FastLanguageModel.from_pretrained("microsoft/Phi-3-mini-4k-instruct",
max_seq_length=2048,
dtype=torch.float16,
load_in_4bit=True)
model = FastLanguageModel.get_peft_model(model,
r=16,
target_modules=["qkv_proj", "o_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing=True)
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:1000]")
trainer = SFTTrainer(model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=60,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="phi-lora-tuned",
optim="adamw_8bit"))
trainer.train()
model.save_pretrained("phi-lora-final")Complete Unsloth script for LoRA on Phi-3-mini: 4-bit quantization fits 8GB VRAM, r=16 low rank for 1-2% perf drop. Alpaca dataset for instruction tuning. Trains 60 steps (~10min on RTX 4070). Pitfall: batch_size too high causes OOM; adjust gradient_accumulation.
Convert LoRA to GGUF for Ollama
pip install llama.cpp[server]
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j LLAMA_CUDA=1
python convert_hf_to_gguf.py ../phi-lora-final --outtype f16 --outfile phi-lora.gguf
ollama create phi-lora-ollama -f Modelfile.gguf << EOF
FROM ./phi-lora.gguf
EOF
ollama run phi-lora-ollama "Test post-fine-tune."Converts the merged LoRA model to GGUF (Ollama format) with CUDA acceleration. LLAMA_CUDA=1 enables GPU inference. Modelfile points to local GGUF. Pitfall: without merging LoRA (model.save_pretrained_merged), performance suffers; use Unsloth's post-train merge.
Benchmark and Monitoring
import ollama
import time
start = time.time()
for _ in range(10):
resp = ollama.chat(model='phi-lora-ollama',
messages=[{'role': 'user', 'content': 'Génère du code Python pour tri rapide.'}])
print(f"Tokens/s: {len(resp['message']['content']) / (time.time() - start):.2f}")
print(resp['message']['content'])Measures real performance: tokens/s over 10 queries. Compare pre/post fine-tune (expect +20% speed from optimized LoRA). Add eval_metrics to trainer for perplexity. Pitfall: ignore cold-start latency; warm up with ollama serve.
Best Practices
- Quantize first: Always use 4bit/8bit for <12GB VRAM; Unsloth handles auto-merge.
- Dataset curation: 500-5000 quality examples > sheer volume; use HF
quality_filter. - LoRA hyperparameters: r=16/32, alpha=2*r, dropout=0.05 for generalization.
- Eval loop: Integrate LM-Eval or custom perplexity post-training.
- Versioning: Tag GGUF with
ollama pushto a private registry.
Common Errors to Avoid
- GPU OOM: Reduce
max_seq_lengthto 1024 or batch=1; monitor withnvidia-smi. - Unmerged LoRA: Slow inference; use
FastLanguageModel.merge_and_unload(). - Poorly formatted dataset: Phi-3 expects chatML (
<|user|>...); parse with tokenizer.apply_chat_template. - Overfitting: <100 steps without val set; split 80/20 and early-stop.
Next Steps
- Official docs: Hugging Face Phi-3, Unsloth GitHub
- Advanced datasets: Dolly15k
- Tools: LM Studio for UI, vLLM for production serving
- Advanced AI Training: In-depth LLM fine-tuning courses.