Introduction
Triton Inference Server, developed by NVIDIA, is a powerful open-source solution for deploying and serving machine learning models in production. Unlike custom servers, Triton natively supports multiple frameworks (TensorFlow, PyTorch, ONNX, etc.), optimizes inference on GPU/CPU, and handles dynamic scaling via Kubernetes or Docker.
Why use it in 2026? With the rise of generative AI and MLOps pipelines, Triton excels in heterogeneous environments: automatic batching, model ensembles, and built-in metrics. This beginner tutorial guides you through a complete setup: generate a simple ONNX model (linear regression), configure the model repository, launch via Docker, and query via Python client.
By the end, you'll have a working inference server ready for real apps. Estimated time: 20 minutes. No GPU needed—it works on CPU! (128 words)
Prerequisites
- Docker installed (version 20+; download here)
- Python 3.10+ with pip
- Internet access to pull Docker images and pip installs
- 2 GB free RAM
- Basic terminal and Python knowledge (no advanced ML required)
docker --version and python --version to verify.Create the project structure
#!/bin/bash
mkdir -p triton-tutorial/models/simple_regression/1
cd triton-tutorial
# Install Python dependencies
pip install scikit-learn==1.5.1 skl2onnx==1.17.0 onnxruntime==1.19.2 numpy==2.1.1
# Create an empty file for the model (will be generated next)
touch models/simple_regression/1/model.onnx
echo "Structure created: models/simple_regression/1/ ready for config and model."This script initializes the project folder with the required Triton hierarchy: models/. It installs the libraries needed to generate an ONNX model via scikit-learn. Avoid spaces in folder names to prevent Docker mount errors.
Generate the simple ONNX model
import numpy as np
import sklearn
from sklearn.linear_model import LinearRegression
import skl2onnx
# Sample data: y = 2*x + 1
X = np.array([[1.0], [2.0], [3.0], [4.0], [5.0]])
y = np.array([3.0, 5.0, 7.0, 9.0, 11.0])
# Train the model
model = LinearRegression()
model.fit(X, y)
# Export to ONNX
initial_type = [('input', np.float32, [1])]
onnx_model = skl2onnx.convert_sklearn(model, initial_types=initial_type, target_opset=15)
# Save
import os
os.makedirs('models/simple_regression/1', exist_ok=True)
with open('models/simple_regression/1/model.onnx', 'wb') as f:
skl2onnx.save_model(onnx_model, f)
print('Modèle ONNX généré : y ≈ 2*x + 1')
print('Test local :', model.predict([[6.0]])) # Should be ≈13This code creates a simple linear regression model (function y=2x+1), trains it on 5 points, and exports it to ONNX for Triton compatibility. Use target_opset=15 for stability. Pitfall: check input/output shapes carefully, as Triton is strict about dimensions.
Configure the model (config.pbtxt)
name: "simple_regression"
platform: "onnxruntime_onnx"
max_batch_size: 128
input {
name: "input"
data_type: TYPE_FP32
dims: [ 1 ]
}
output {
name: "variable"
data_type: TYPE_FP32
dims: [ 1 ]
}
instance_group [
{
count: 1
kind: KIND_CPU
}
]
optimization {
cpu {
separate: true # Optimize for CPU
}
}
This config.pbtxt file defines the name, ONNX platform, inputs/outputs (shape [1] for scalar x), and enables the CPU backend. max_batch_size:128 enables batching. Avoid dimension mismatches: input 'input' FP32 1D vector, output 'variable'.
Launch the Triton server
#!/bin/bash
cd triton-tutorial
# Pull Triton image (2026-compatible version, CPU/GPU)
docker pull nvcr.io/nvidia/tritonserver:24.08-py3
# Launch server: exposes HTTP, gRPC, metrics
# Mounts the model repo
docker run --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $(pwd)/models:/models \
nvcr.io/nvidia/tritonserver:24.08-py3 \
tritonserver --model-repository=/models --log-verbose=1
# Ctrl+C to stop. Check logs: 'Started HTTPService' and 'simple_regression INFERENCE READY'This script pulls the official Triton image and launches the server with a volume mount for models. Ports: 8000(HTTP), 8001(gRPC), 8002(metrics). --log-verbose=1 aids debugging. Pitfall: ensure models/ is accessible; for GPU, add --gpus all and NVIDIA Container Toolkit.
Python client to test inference
import tritonclient.http as httpclient
import numpy as np
# Connect to local HTTP server
triton_client = httpclient.InferenceServerClient(url="localhost:8000")
# Prepare input: x=6.0 → expected ~13
input_data = np.array([6.0], dtype=np.float32)
input_tensor = httpclient.InferInput("input", [1], "FP32")
input_tensor.set_data_from_numpy(input_data)
output_tensor = httpclient.InferRequestedOutput("variable")
# Inference
results = triton_client.infer(model_name="simple_regression", inputs=[input_tensor], outputs=[output_tensor])
prediction = results.as_numpy("variable")
print(f"Prédiction pour x=6 : {prediction[0]:.2f} (attendu ~13)")
# Batch test
batch_inputs = np.array([[6.0], [7.0]], dtype=np.float32)
batch_input = httpclient.InferInput("input", [2, 1], "FP32")
batch_input.set_data_from_numpy(batch_inputs)
batch_results = triton_client.infer(model_name="simple_regression", inputs=[batch_input], outputs=[output_tensor])
print(f"Batch : {batch_results.as_numpy('variable')} (attendu ~13,15)")This client uses tritonclient[http] (pip install tritonclient[all]) to query HTTP. Input/output matches config.pbtxt. Tests single and batch inference. Pitfall: install pip install tritonclient[all]==2.48.0 first; verify server is up with curl localhost:8000/v2/health/ready.
Verification and monitoring
Once launched, access http://localhost:8002/metrics for Prometheus. Test health: curl http://localhost:8000/v2/health/ready (returns 200 OK). Docker logs show 'READY'. Scale with docker stats or Kubernetes for production.
Best practices
- Version models: Use
models/my_model/v2/for updates without downtime. - Secure: Add
--api-timeout 60 --model-control-mode explicitand TLS auth in production. - Optimize: Enable
dynamic_batchingin config.pbtxt for 10x throughput. - Multi-models: Add multiple folders under
models/for AI ensembles. - CI/CD: Integrate with GitHub Actions for automated repo rebuilds.
Common errors to avoid
- Dims mismatch: Input shape in config.pbtxt != client → 'INVALID_ARGUMENT'.
- Model not found: Check volume mount and exact name (case-sensitive).
- Port occupied: Kill processes on 8000/8001 or change
-p 8001:8001. - No backend: For ONNX, py3 image suffices; GPU needs CUDA toolkit.
Next steps
Master advanced Triton: NVIDIA official docs.
Check out our Learni trainings on MLOps and AI to scale with Kubernetes and Triton.
Resources: Triton GitHub, NGC examples.