Skip to content
Learni
View all tutorials
Machine Learning

How to Master ONNX Runtime for ML Inference in 2026

Lire en français

Introduction

ONNX Runtime is the top-performing inference engine for ONNX models in 2026, supporting CPU, GPU (CUDA, DirectML), and even WebAssembly. Unlike frameworks like TensorFlow or PyTorch that complicate deployment, ONNX Runtime optimizes the model graph at a low level, slashing latency by 5-10x on standard hardware. For expert ML engineers, it's an essential production tool: it handles dynamic inputs, asynchronous batching, and hardware-specific providers. Think of it as a turbo engine for your models—it parses the ONNX graph once, then runs with JIT compilation and vectorized kernels. This tutorial walks you through generating an Iris model (sklearn to ONNX), running inference on CPU/GPU, optimizing sessions, and benchmarking real performance. By the end, you'll deploy scalable pipelines ready for Kubernetes or edge devices.

Prerequisites

  • Python 3.10 or higher
  • Up-to-date pip
  • scikit-learn, numpy for model generation
  • NVIDIA GPU with CUDA 12+ for the GPU section (optional, otherwise CPU only)
  • Advanced ML knowledge: ONNX basics, computational graphs, and profiling

Install Dependencies

install.sh
pip install onnxruntime scikit-learn skl2onnx numpy pandas matplotlib
# Pour GPU NVIDIA :
pip install onnxruntime-gpu

This command installs ONNX Runtime for CPU and sets up tools to convert sklearn to ONNX via skl2onnx. For GPU, onnxruntime-gpu enables CUDA automatically if available. Avoid conflicts with a virtualenv; verify with python -c 'import onnxruntime as ort; print(ort.__version__)'.

Generating a Reference ONNX Model

We'll use the Iris dataset for a simple DecisionTree classifier, easily convertible to ONNX. This perfectly demonstrates dynamic shapes ([None, 4] for variable batch sizes). The resulting model is lightweight (a few KB), ideal for testing optimizations without external dependencies.

Create the Iris ONNX Model

generate_iris_onnx.py
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import numpy as np

# Charger et entraîner le modèle
iris = load_iris()
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(iris.data, iris.target)

# Définir le type d'input dynamique
initial_type = [('float_input', FloatTensorType([None, 4]))]

# Convertir en ONNX
model_onnx = convert_sklearn(
    clf,
    initial_types=initial_type,
    target_opset=15
)

# Sauvegarder
with open("iris.onnx", "wb") as f:
    f.write(model_onnx.SerializeToString())

print("Modèle iris.onnx généré avec succès.")
print(f"Inputs: {model_onnx.graph.input}")
print(f"Outputs: {model_onnx.graph.output}")

This script trains a decision tree on Iris and exports it to ONNX opset 15 (stable in 2026). FloatTensorType([None,4]) enables variable batch sizes, key for experts. Pitfall: always specify target_opset for runtime compatibility; test with onnx.checker.check_model(model_onnx).

Basic CPU Inference

inference_cpu.py
import onnxruntime as ort
import numpy as np

# Charger la session CPU par défaut
session = ort.InferenceSession("iris.onnx")
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

# Input exemple : setosa
input_data = np.array([[5.1, 3.5, 1.4, 0.2]], dtype=np.float32)

# Inférence
outputs = session.run([output_name], {input_name: input_data})

print(f"Prédiction: {np.argmax(outputs[0])} (classe {iris.target_names[np.argmax(outputs[0])]})\n")
print(f"Probabilités: {outputs[0][0]}")

The session uses default CPU providers, parses the ONNX graph, and runs inference. Use get_inputs() for dynamic tensor naming. Common pitfall: np.float32 dtype is required, or it crashes; for production, pre-allocate outputs with run(None, ...).

GPU Acceleration with Providers

Analogy: Providers are like hardware drivers—CUDAExecutionProvider prioritizes GPU if available, with CPU fallback. In 2026, this boosts CNN/Transformer models by 5-20x. Check with session.get_providers().

Optimized GPU Inference

inference_gpu.py
import onnxruntime as ort
import numpy as np

providers = [
    'CUDAExecutionProvider',
    'CPUExecutionProvider'
]
session = ort.InferenceSession("iris.onnx", providers=providers)

input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

# Même input
input_data = np.array([[5.1, 3.5, 1.4, 0.2]], dtype=np.float32)
outputs = session.run([output_name], {input_name: input_data})

print(f"Provider utilisé: {session.get_providers()[0]}")
print(f"Prédiction GPU: {np.argmax(outputs[0])}")

Providers try CUDA first; use ort.get_device() to confirm GPU. Instant gains on models >1MB. Pitfall: sync CUDA with torch.cuda.synchronize() if mixing frameworks; for multi-GPU, specify device_id.

Advanced Session Optimization

optimized_session.py
import onnxruntime as ort
import numpy as np

# Options de session pour optimisations
session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
session_options.intra_op_num_threads = 4  # Parallélisme CPU
session_options.enable_cpu_mem_arena = False  # Pour petits modèles

providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
session = ort.InferenceSession("iris.onnx", sess_options=session_options, providers=providers)

input_name = session.get_inputs()[0].name
input_data = np.random.rand(1, 4).astype(np.float32)
outputs = session.run(None, {input_name: input_data})

print("Session optimisée prête.")
print(f"Temps de création: rapide grâce aux opts.")

ORT_ENABLE_EXTENDED fuses redundant nodes and vectorizes; intra_op_num_threads leverages CPU cores. For GPU, add session_options.enable_mem_pattern=True. Pitfall: too many opts can add overhead on small models; always profile.

Handling Batches and Dynamic Shapes

Dynamic shapes [batch_size, features] scale inference effortlessly. For experts, IO binding skips CPU-GPU copies, boosting throughput 2-3x in production.

Batch Inference with IO Binding

batch_io_binding.py
import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("iris.onnx")

# Batch de 32 échantillons
batch_data = np.random.rand(32, 4).astype(np.float32)
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

# IO Binding pour zéro-copy (GPU)
io_binding = session.io_binding()
io_binding.bind_cpu_input(input_name, batch_data)
io_binding.bind_output(output_name)

session.run_with_iobinding(io_binding)
outputs = io_binding.copy_outputs_to_cpu()[output_name]

print(f"Batch shape: {outputs.shape}")
print(f"Moyenne prédictions: {np.mean(np.argmax(outputs, axis=1))}")

IO binding maps tensors directly without copies, critical for large batches. bind_cpu_input falls back to GPU if provider matches. Pitfall: for GPU, bind_cuda_input needs CUDA pointers; scale batch_size to your VRAM.

Performance Benchmarking

benchmark.py
import onnxruntime as ort
import numpy as np
import time

session = ort.InferenceSession("iris.onnx")
input_name = session.get_inputs()[0].name
batch_size = 1024
warmup = 10
iters = 100

batch_data = np.random.rand(batch_size, 4).astype(np.float32)

# Warmup
for _ in range(warmup):
    session.run(None, {input_name: batch_data})

start = time.perf_counter()
for _ in range(iters):
    session.run(None, {input_name: batch_data})
end = time.perf_counter()

latency_ms = (end - start) * 1000 / iters
throughput = batch_size / ((end - start) / iters)

print(f"Latence moyenne: {latency_ms:.2f} ms")
print(f"Throughput: {throughput:.0f} échantillons/s")

This benchmark measures realistic latency/throughput after warmup (JIT cache). Use it to compare providers. Pitfall: always include warmup, as the first run is 10x slower; add tensorrt_execution_provider for +50% perf on NVIDIA.

Best Practices

  • Always profile: Use ort.profiler for JSON traces, analyze bottlenecks with Chrome DevTools.
  • Prep inputs: Normalize to float32, batch to 80% max VRAM to avoid OOM.
  • Multi-threading: Set inter_op_num_threads for parallel pipelines in serving.
  • Cache sessions: Reuse one session per model, clone for multi-instances.
  • Version ONNX: Use opset 18+ for 2026 features like attention fusion.

Common Errors to Avoid

  • Wrong dtype/shape: Float64 or static shapes crash; always use [None,...] and np.float32.
  • No fallback providers: GPU down? Add 'CPUExecutionProvider' last.
  • Skip warmup: Fake benchmarks; first run includes parse/optim.
  • No IO binding: CPU-GPU copies kill perf; mandate io_binding in prod.

Next Steps

Integrate ONNX Runtime into FastAPI for scalable serving, or explore TensorRT provider for RTX 50xx. Check the official ONNX Runtime docs and our Learni ML deployment courses. Test with HuggingFace ONNX models for LLMs.