Skip to content
Learni
View all tutorials
IA et Machine Learning

How to Install and Use Triton Inference Server in 2026

Lire en français

Introduction

Triton Inference Server, developed by NVIDIA, is a powerful open-source solution for deploying and serving machine learning models in production. Unlike custom servers, Triton natively supports multiple frameworks (TensorFlow, PyTorch, ONNX, etc.), optimizes inference on GPU/CPU, and handles dynamic scaling via Kubernetes or Docker.

Why use it in 2026? With the rise of generative AI and MLOps pipelines, Triton excels in heterogeneous environments: automatic batching, model ensembles, and built-in metrics. This beginner tutorial guides you through a complete setup: generate a simple ONNX model (linear regression), configure the model repository, launch via Docker, and query via Python client.

By the end, you'll have a working inference server ready for real apps. Estimated time: 20 minutes. No GPU needed—it works on CPU! (128 words)

Prerequisites

  • Docker installed (version 20+; download here)
  • Python 3.10+ with pip
  • Internet access to pull Docker images and pip installs
  • 2 GB free RAM
  • Basic terminal and Python knowledge (no advanced ML required)
Run docker --version and python --version to verify.

Create the project structure

setup.sh
#!/bin/bash
mkdir -p triton-tutorial/models/simple_regression/1
cd triton-tutorial

# Install Python dependencies
pip install scikit-learn==1.5.1 skl2onnx==1.17.0 onnxruntime==1.19.2 numpy==2.1.1

# Create an empty file for the model (will be generated next)
touch models/simple_regression/1/model.onnx

echo "Structure created: models/simple_regression/1/ ready for config and model."

This script initializes the project folder with the required Triton hierarchy: models///. It installs the libraries needed to generate an ONNX model via scikit-learn. Avoid spaces in folder names to prevent Docker mount errors.

Generate the simple ONNX model

generate_model.py
import numpy as np
import sklearn
from sklearn.linear_model import LinearRegression
import skl2onnx

# Sample data: y = 2*x + 1
X = np.array([[1.0], [2.0], [3.0], [4.0], [5.0]])
y = np.array([3.0, 5.0, 7.0, 9.0, 11.0])

# Train the model
model = LinearRegression()
model.fit(X, y)

# Export to ONNX
initial_type = [('input', np.float32, [1])]
onnx_model = skl2onnx.convert_sklearn(model, initial_types=initial_type, target_opset=15)

# Save
import os
os.makedirs('models/simple_regression/1', exist_ok=True)
with open('models/simple_regression/1/model.onnx', 'wb') as f:
    skl2onnx.save_model(onnx_model, f)

print('Modèle ONNX généré : y ≈ 2*x + 1')
print('Test local :', model.predict([[6.0]]))  # Should be ≈13

This code creates a simple linear regression model (function y=2x+1), trains it on 5 points, and exports it to ONNX for Triton compatibility. Use target_opset=15 for stability. Pitfall: check input/output shapes carefully, as Triton is strict about dimensions.

Configure the model (config.pbtxt)

models/simple_regression/config.pbtxt
name: "simple_regression"
platform: "onnxruntime_onnx"

max_batch_size: 128

input {
  name: "input"
  data_type: TYPE_FP32
  dims: [ 1 ]
}

output {
  name: "variable"
  data_type: TYPE_FP32
  dims: [ 1 ]
}

instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]

optimization {
  cpu {
    separate: true  # Optimize for CPU
  }
}

This config.pbtxt file defines the name, ONNX platform, inputs/outputs (shape [1] for scalar x), and enables the CPU backend. max_batch_size:128 enables batching. Avoid dimension mismatches: input 'input' FP32 1D vector, output 'variable'.

Launch the Triton server

run_server.sh
#!/bin/bash
cd triton-tutorial

# Pull Triton image (2026-compatible version, CPU/GPU)
docker pull nvcr.io/nvidia/tritonserver:24.08-py3

# Launch server: exposes HTTP, gRPC, metrics
# Mounts the model repo

docker run --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v $(pwd)/models:/models \
  nvcr.io/nvidia/tritonserver:24.08-py3 \
  tritonserver --model-repository=/models --log-verbose=1

# Ctrl+C to stop. Check logs: 'Started HTTPService' and 'simple_regression INFERENCE READY'

This script pulls the official Triton image and launches the server with a volume mount for models. Ports: 8000(HTTP), 8001(gRPC), 8002(metrics). --log-verbose=1 aids debugging. Pitfall: ensure models/ is accessible; for GPU, add --gpus all and NVIDIA Container Toolkit.

Python client to test inference

test_client.py
import tritonclient.http as httpclient
import numpy as np

# Connect to local HTTP server
triton_client = httpclient.InferenceServerClient(url="localhost:8000")

# Prepare input: x=6.0 → expected ~13
input_data = np.array([6.0], dtype=np.float32)
input_tensor = httpclient.InferInput("input", [1], "FP32")
input_tensor.set_data_from_numpy(input_data)

output_tensor = httpclient.InferRequestedOutput("variable")

# Inference
results = triton_client.infer(model_name="simple_regression", inputs=[input_tensor], outputs=[output_tensor])

prediction = results.as_numpy("variable")
print(f"Prédiction pour x=6 : {prediction[0]:.2f} (attendu ~13)")

# Batch test
batch_inputs = np.array([[6.0], [7.0]], dtype=np.float32)
batch_input = httpclient.InferInput("input", [2, 1], "FP32")
batch_input.set_data_from_numpy(batch_inputs)
batch_results = triton_client.infer(model_name="simple_regression", inputs=[batch_input], outputs=[output_tensor])
print(f"Batch : {batch_results.as_numpy('variable')} (attendu ~13,15)")

This client uses tritonclient[http] (pip install tritonclient[all]) to query HTTP. Input/output matches config.pbtxt. Tests single and batch inference. Pitfall: install pip install tritonclient[all]==2.48.0 first; verify server is up with curl localhost:8000/v2/health/ready.

Verification and monitoring

Once launched, access http://localhost:8002/metrics for Prometheus. Test health: curl http://localhost:8000/v2/health/ready (returns 200 OK). Docker logs show 'READY'. Scale with docker stats or Kubernetes for production.

Best practices

  • Version models: Use models/my_model/v2/ for updates without downtime.
  • Secure: Add --api-timeout 60 --model-control-mode explicit and TLS auth in production.
  • Optimize: Enable dynamic_batching in config.pbtxt for 10x throughput.
  • Multi-models: Add multiple folders under models/ for AI ensembles.
  • CI/CD: Integrate with GitHub Actions for automated repo rebuilds.

Common errors to avoid

  • Dims mismatch: Input shape in config.pbtxt != client → 'INVALID_ARGUMENT'.
  • Model not found: Check volume mount and exact name (case-sensitive).
  • Port occupied: Kill processes on 8000/8001 or change -p 8001:8001.
  • No backend: For ONNX, py3 image suffices; GPU needs CUDA toolkit.

Next steps

Master advanced Triton: NVIDIA official docs.

Check out our Learni trainings on MLOps and AI to scale with Kubernetes and Triton.

Resources: Triton GitHub, NGC examples.