Introduction
Weights & Biases (W&B) has become the go-to tool for tracking machine learning experiments. Beyond basic logging, its power lies in versioned artifact management, distributed sweeps, and CI/CD pipeline integration. This tutorial walks you through expert configuration for a real production environment, including parallel run management, model versioning, and automated report generation. You will learn how to structure experiments in a reproducible and scalable way.
Prerequisites
- Python 3.10+
- PyTorch or TensorFlow
- W&B account with a configured team
- Advanced knowledge of MLOps and metadata management
- Access to a GPU cluster or Kubernetes
Advanced project initialization
import wandb
import os
os.environ["WANDB_PROJECT"] = "production-ml"
os.environ["WANDB_ENTITY"] = "learni-ml-team"
run = wandb.init(
project="production-ml",
entity="learni-ml-team",
config={
"learning_rate": 0.001,
"batch_size": 64,
"architecture": "resnet50",
"dataset": "imagenet-subset"
},
tags=["production", "resnet", "2026"],
notes="Run de référence pour benchmark production"
)This initialization sets up the project at the team level, adds structured metadata, and prepares the run for automatic hyperparameter and artifact versioning.
Logging metrics and gradients
import wandb
for epoch in range(10):
train_loss = model.train_step()
val_acc = model.validate()
wandb.log({
"train/loss": train_loss,
"val/accuracy": val_acc,
"epoch": epoch,
"gradients": wandb.Histogram(model.get_gradients())
}, step=epoch)
if epoch % 5 == 0:
wandb.log({"examples": wandb.Image(model.generate_sample())})Logging includes gradient histograms and generated images. Use an explicit step to synchronize parallel runs and avoid metric collisions.
Distributed sweeps configuration
program: train.py
method: bayes
metric:
name: val/accuracy
goal: maximize
parameters:
learning_rate:
distribution: log_uniform_values
min: 1e-5
max: 1e-1
batch_size:
values: [32, 64, 128]
optimizer:
values: ["adam", "sgd"]
command:
- ${env}
- python
- ${program}
- "--sweep_id=${wandb:sweep_id}"The YAML file defines Bayesian optimization. Run it with wandb sweep sweep.yaml then launch agents across multiple machines for efficient distributed search.
Versioned artifacts management
import wandb
artifact = wandb.Artifact(
name="resnet50-weights",
type="model",
metadata={"epoch": 50, "val_acc": 0.94}
)
artifact.add_file("model.pth")
run.log_artifact(artifact)
# Usage in another run
model_artifact = run.use_artifact("resnet50-weights:latest")
model_path = model_artifact.download()Artifacts enable complete versioning of models and datasets. Automatic lineage links runs together for full traceability in production.
PyTorch Lightning integration
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning import Trainer
wandb_logger = WandbLogger(
project="production-ml",
log_model=True,
save_dir="./wandb"
)
trainer = Trainer(
logger=wandb_logger,
max_epochs=100,
callbacks=[wandb_logger.watch(model, log_freq=100)]
)
trainer.fit(model, datamodule)Native integration with PyTorch Lightning simplifies metric logging and automatically saves checkpoints as W&B artifacts.
Best practices
- Always use structured tags and notes to facilitate search in large projects
- Systematically version datasets and models via artifacts
- Configure Slack/Email alerts on critical metrics
- Regularly clean up runs with wandb sweep --delete
- Document configurations via YAML files versioned in Git
Common mistakes to avoid
- Forgetting to call wandb.finish() in batch scripts, leaving runs unfinished
- Logging overly large tensors without using step or reduce_fx
- Incorrectly configuring team permissions, exposing sensitive data
- Running sweeps without setting a seed, making experiments non-reproducible
Going further
Deepen your MLOps skills with our advanced training on Weights & Biases and production ML pipelines.