Skip to content
Learni
View all tutorials
Machine Learning

How to Master Weights & Biases in Production in 2026

Lire en français

Introduction

Weights & Biases (W&B) has become the go-to tool for tracking machine learning experiments. Beyond basic logging, its power lies in versioned artifact management, distributed sweeps, and CI/CD pipeline integration. This tutorial walks you through expert configuration for a real production environment, including parallel run management, model versioning, and automated report generation. You will learn how to structure experiments in a reproducible and scalable way.

Prerequisites

  • Python 3.10+
  • PyTorch or TensorFlow
  • W&B account with a configured team
  • Advanced knowledge of MLOps and metadata management
  • Access to a GPU cluster or Kubernetes

Advanced project initialization

wandb_init.py
import wandb
import os

os.environ["WANDB_PROJECT"] = "production-ml"
os.environ["WANDB_ENTITY"] = "learni-ml-team"

run = wandb.init(
    project="production-ml",
    entity="learni-ml-team",
    config={
        "learning_rate": 0.001,
        "batch_size": 64,
        "architecture": "resnet50",
        "dataset": "imagenet-subset"
    },
    tags=["production", "resnet", "2026"],
    notes="Run de référence pour benchmark production"
)

This initialization sets up the project at the team level, adds structured metadata, and prepares the run for automatic hyperparameter and artifact versioning.

Logging metrics and gradients

train.py
import wandb

for epoch in range(10):
    train_loss = model.train_step()
    val_acc = model.validate()
    
    wandb.log({
        "train/loss": train_loss,
        "val/accuracy": val_acc,
        "epoch": epoch,
        "gradients": wandb.Histogram(model.get_gradients())
    }, step=epoch)
    
    if epoch % 5 == 0:
        wandb.log({"examples": wandb.Image(model.generate_sample())})

Logging includes gradient histograms and generated images. Use an explicit step to synchronize parallel runs and avoid metric collisions.

Distributed sweeps configuration

sweep.yaml
program: train.py
method: bayes
metric:
  name: val/accuracy
  goal: maximize
parameters:
  learning_rate:
    distribution: log_uniform_values
    min: 1e-5
    max: 1e-1
  batch_size:
    values: [32, 64, 128]
  optimizer:
    values: ["adam", "sgd"]
command:
  - ${env}
  - python
  - ${program}
  - "--sweep_id=${wandb:sweep_id}"

The YAML file defines Bayesian optimization. Run it with wandb sweep sweep.yaml then launch agents across multiple machines for efficient distributed search.

Versioned artifacts management

artifacts.py
import wandb

artifact = wandb.Artifact(
    name="resnet50-weights",
    type="model",
    metadata={"epoch": 50, "val_acc": 0.94}
)
artifact.add_file("model.pth")
run.log_artifact(artifact)

# Usage in another run
model_artifact = run.use_artifact("resnet50-weights:latest")
model_path = model_artifact.download()

Artifacts enable complete versioning of models and datasets. Automatic lineage links runs together for full traceability in production.

PyTorch Lightning integration

lightning_trainer.py
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning import Trainer

wandb_logger = WandbLogger(
    project="production-ml",
    log_model=True,
    save_dir="./wandb"
)

trainer = Trainer(
    logger=wandb_logger,
    max_epochs=100,
    callbacks=[wandb_logger.watch(model, log_freq=100)]
)
trainer.fit(model, datamodule)

Native integration with PyTorch Lightning simplifies metric logging and automatically saves checkpoints as W&B artifacts.

Best practices

  • Always use structured tags and notes to facilitate search in large projects
  • Systematically version datasets and models via artifacts
  • Configure Slack/Email alerts on critical metrics
  • Regularly clean up runs with wandb sweep --delete
  • Document configurations via YAML files versioned in Git

Common mistakes to avoid

  • Forgetting to call wandb.finish() in batch scripts, leaving runs unfinished
  • Logging overly large tensors without using step or reduce_fx
  • Incorrectly configuring team permissions, exposing sensitive data
  • Running sweeps without setting a seed, making experiments non-reproducible

Going further

Deepen your MLOps skills with our advanced training on Weights & Biases and production ML pipelines.