Skip to content
Learni
View all tutorials
Machine Learning

How to Train and Deploy an ML Model with SageMaker in 2026

Lire en français

Introduction

Amazon SageMaker is AWS's fully managed end-to-end machine learning platform, letting you train, deploy, and monitor models at scale without handling the underlying infrastructure. In 2026, with native generative AI integration and GPU optimizations like Trainium2, SageMaker speeds up ML pipelines by 40% on average, per AWS re:Invent 2025 benchmarks.

Why use it? Unlike local setups (Jupyter + spot GPUs), SageMaker handles orchestration, automatic hyperparameter tuning (SageMaker Autopilot), and security (IAM roles, VPC). For intermediate data scientists, it's the perfect tool to scale: imagine training an XGBoost model on 1TB of data in hours instead of days.

This tutorial guides you step by step: from AWS setup to production inference. By the end, you'll have a fully deployed model ready for production. Estimated time: 2 hours. (128 words)

Prerequisites

  • Active AWS account with SageMakerFullAccess and S3FullAccess permissions.
  • AWS CLI v2 installed (version 2.15+).
  • Python 3.10+ with pip and boto3 (install via pip install boto3 sagemaker scikit-learn).
  • Intermediate knowledge of Python, ML (scikit-learn), and AWS S3.
  • AWS region: us-east-1 or eu-west-1 for optimal availability.

Configure AWS CLI and Create an S3 Bucket

setup-aws.sh
aws configure set aws_access_key_id YOUR_ACCESS_KEY
aws configure set aws_secret_access_key YOUR_SECRET_KEY
aws configure set default.region us-east-1

BUCKET_NAME=sagemaker-tutorial-$(date +%s)
aws s3 mb s3://${BUCKET_NAME}
echo "Bucket créé: s3://${BUCKET_NAME}"

This script configures AWS CLI credentials and creates a dedicated S3 bucket for datasets, models, and artifacts. Replace YOUR_ACCESS_KEY and YOUR_SECRET_KEY with your IAM values. Use a unique bucket with a timestamp to avoid conflicts; verify with aws s3 ls.

Prepare the Data and Upload to S3

SageMaker expects data in CSV/Parquet format on S3. We'll generate a synthetic Iris dataset (classic for classification), split it into train/validation, and upload it. This simulates a real workflow where your data comes from EC2 or Athena.

Python Script to Generate and Upload the Dataset

prepare_data.py
import boto3
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import io
import os

s3 = boto3.client('s3')
bucket = 'sagemaker-tutorial-1234567890'  # Remplacez par votre bucket

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

csv_buffer = io.StringIO()
train_df.to_csv(csv_buffer, index=False, header=False)
s3.put_object(Body=csv_buffer.getvalue(), Bucket=bucket, Key='data/train.csv')

csv_buffer = io.StringIO()
val_df.to_csv(csv_buffer, index=False, header=False)
s3.put_object(Body=csv_buffer.getvalue(), Bucket=bucket, Key='data/validation.csv')

print('Données uploadées sur s3://{}/data/'.format(bucket))

This script loads Iris, splits it 80/20, and uploads CSV files without headers or indexes to S3 (required by SageMaker XGBoost). Run it locally; adapt for large real datasets using pandas.read_csv('s3://...'). Pitfall: always use header=False to avoid bias during training.

Create a Training Job with XGBoost

Use SageMaker's built-in XGBoost algorithm—no custom container needed. Define an IAM role for SageMaker, hyperparameters, and launch the job. Instance: ml.m5.xlarge for cost/efficiency (spot for -70% savings).

Launch the Training Job via boto3

train_model.py
import boto3
from time import gmtime, strftime

sagemaker = boto3.client('sagemaker')
role = 'arn:aws:iam::YOUR_ACCOUNT:role/SageMakerRole'  # Créez ce rôle
bucket = 'sagemaker-tutorial-1234567890'

training_job_name = 'xgboost-iris-' + strftime('%Y-%m-%d-%H-%M-%S', gmtime())

sagemaker.create_training_job(
    TrainingJobName=training_job_name,
    AlgorithmSpecification={
        'AlgorithmName': 'xgboost',
        'TrainingInputMode': 'File'
    },
    RoleArn=role,
    InputDataConfig=[
        {'ChannelName': 'train', 'DataSource': {'S3DataSource': {'S3Uri': f's3://{bucket}/data/train.csv', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File'}}, 'ContentType': 'text/csv'},
        {'ChannelName': 'validation', 'DataSource': {'S3DataSource': {'S3Uri': f's3://{bucket}/data/validation.csv', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File'}}, 'ContentType': 'text/csv'}
    ],
    OutputDataConfig={'S3OutputPath': f's3://{bucket}/output/'},
    ResourceConfig={'InstanceType': 'ml.m5.xlarge', 'InstanceCount': 1, 'VolumeSizeInGB': 10},
    EnableManagedSpotTraining=True,
    TrainingJobName=training_job_name,
    HyperParameters={
        'num_round': '100',
        'max_depth': '5',
        'objective': 'multi:softmax',
        'num_class': '3',
        'eta': '0.2'
    }
)

print(f'Job lancé: {training_job_name}. Surveillez dans SageMaker Console.')

This code creates an XGBoost training job with spot instances for cost savings. Hyperparameters tuned for Iris (multi-class). Replace role and bucket. Wait 5-10 minutes; check status with sagemaker.describe_training_job(). Benefit: scales to 100+ instances without infra code.

Deploy the Model as a Serverless Endpoint

After training, retrieve the S3 artifact and deploy an endpoint. In 2026, go with Serverless Inference: pay-per-use, auto-scale 0-1000 req/s, cold starts under 200ms.

Deploy the Endpoint and Test Inference

deploy_endpoint.py
import boto3
import time

sagemaker = boto3.client('sagemaker')
bucket = 'sagemaker-tutorial-1234567890'
training_job_name = 'xgboost-iris-2026-01-01-12-00-00'  # Nom de votre job

# Récupérer model artifact
desc = sagemaker.describe_training_job(TrainingJobName=training_job_name)
model_s3 = desc['ModelArtifacts']['S3ModelArtifacts']

endpoint_name = training_job_name + '-endpoint'

sagemaker.create_model(
    ModelName=endpoint_name + '-model',
    PrimaryContainer={
        'Image': sagemaker.get_image_uri('us-east-1', 'xgboost', '1.7-1'),
        'ModelDataUrl': model_s3
    },
    ExecutionRoleArn='arn:aws:iam::YOUR_ACCOUNT:role/SageMakerRole'
)

sagemaker.create_endpoint_config(
    EndpointConfigName=endpoint_name + '-config',
    ProductionVariants=[{
        'VariantName': 'AllTraffic',
        'ModelName': endpoint_name + '-model',
        'InitialInstanceCount': 0,
        'InstanceType': 'ml.m5.xlarge',
        'ServerlessConfig': {'MemorySizeInMB': 2048, 'MaxConcurrency': 50}
    }],
    ServerlessConfig={'MemorySizeInMB': 6144, 'MaxConcurrency': 200}
)

sagemaker.create_endpoint(EndpointName=endpoint_name, EndpointConfigName=endpoint_name + '-config')

while True:
    status = sagemaker.describe_endpoint(EndpointName=endpoint_name)['EndpointStatus']
    print(f'Status: {status}')
    if status == 'InService': break
    time.sleep(30)

runtime = boto3.client('sagemaker-runtime')
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='text/csv',
    Body='5.1,3.5,1.4,0.2'
)
print(response['Body'].read().decode())

Deploys a serverless XGBoost endpoint with auto-scaling. Tests with Iris features. Monitor costs via CloudWatch (serverless ~$0.06/1000s). Pitfall: wait for 'InService' before invoking; delete with delete_endpoint to avoid charges.

Monitoring Script with CloudWatch

monitor_endpoint.py
import boto3

cloudwatch = boto3.client('cloudwatch')
endpoint_name = 'xgboost-iris-2026-01-01-12-00-00-endpoint'

response = cloudwatch.get_metric_statistics(
    Namespace='AWS/SageMaker',
    MetricName='InvocationsPerInstance',
    Dimensions=[{'Name': 'EndpointName', 'Value': endpoint_name}],
    StartTime='2026-01-01T00:00:00Z',
    EndTime='2026-01-02T00:00:00Z',
    Period=300,
    Statistics=['Average']
)
print('Métriques invocations:', response['Datapoints'])

alarms = cloudwatch.describe_alarms(AlarmNamePrefix='SageMaker-'+endpoint_name)
print('Alarmes actives:', [a['AlarmName'] for a in alarms['MetricAlarms']])

Retrieves CloudWatch metrics to monitor latency/invocations. Set up alarms for >95% CPU. Essential for production autoscaling; integrate with Lambda for Slack alerts.

Best Practices

  • Use minimal IAM roles: SageMakerExecutionRole with inline policies for S3/model registry only.
  • Automatic hyperparameter tuning: Enable SageMaker Automatic Model Tuning for +20% accuracy effortlessly.
  • Data versioning: Store datasets in S3 with dated prefixes and use SageMaker Model Registry.
  • Serverless first: For <1000 req/day; switch to provisioned for <100ms latency.
  • Debug with Debugger: Hook for tensors/realtime metrics during training.

Common Errors to Avoid

  • Forgetting ContentType='text/csv': Causes 400 errors during inference; always match training.
  • No spot training: Triples costs; enable EnableManagedSpotTraining=True.
  • Endpoint not deleted: Bills $1/hour; always delete_endpoint after testing.
  • Region mismatch: XGBoost image varies by region; use sagemaker.image_uris.retrieve() dynamically.

Next Steps