Introduction
Amazon SageMaker is AWS's fully managed end-to-end machine learning platform, letting you train, deploy, and monitor models at scale without handling the underlying infrastructure. In 2026, with native generative AI integration and GPU optimizations like Trainium2, SageMaker speeds up ML pipelines by 40% on average, per AWS re:Invent 2025 benchmarks.
Why use it? Unlike local setups (Jupyter + spot GPUs), SageMaker handles orchestration, automatic hyperparameter tuning (SageMaker Autopilot), and security (IAM roles, VPC). For intermediate data scientists, it's the perfect tool to scale: imagine training an XGBoost model on 1TB of data in hours instead of days.
This tutorial guides you step by step: from AWS setup to production inference. By the end, you'll have a fully deployed model ready for production. Estimated time: 2 hours. (128 words)
Prerequisites
- Active AWS account with SageMakerFullAccess and S3FullAccess permissions.
- AWS CLI v2 installed (version 2.15+).
- Python 3.10+ with pip and boto3 (install via
pip install boto3 sagemaker scikit-learn). - Intermediate knowledge of Python, ML (scikit-learn), and AWS S3.
- AWS region: us-east-1 or eu-west-1 for optimal availability.
Configure AWS CLI and Create an S3 Bucket
aws configure set aws_access_key_id YOUR_ACCESS_KEY
aws configure set aws_secret_access_key YOUR_SECRET_KEY
aws configure set default.region us-east-1
BUCKET_NAME=sagemaker-tutorial-$(date +%s)
aws s3 mb s3://${BUCKET_NAME}
echo "Bucket créé: s3://${BUCKET_NAME}"This script configures AWS CLI credentials and creates a dedicated S3 bucket for datasets, models, and artifacts. Replace YOUR_ACCESS_KEY and YOUR_SECRET_KEY with your IAM values. Use a unique bucket with a timestamp to avoid conflicts; verify with aws s3 ls.
Prepare the Data and Upload to S3
SageMaker expects data in CSV/Parquet format on S3. We'll generate a synthetic Iris dataset (classic for classification), split it into train/validation, and upload it. This simulates a real workflow where your data comes from EC2 or Athena.
Python Script to Generate and Upload the Dataset
import boto3
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import io
import os
s3 = boto3.client('s3')
bucket = 'sagemaker-tutorial-1234567890' # Remplacez par votre bucket
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)
csv_buffer = io.StringIO()
train_df.to_csv(csv_buffer, index=False, header=False)
s3.put_object(Body=csv_buffer.getvalue(), Bucket=bucket, Key='data/train.csv')
csv_buffer = io.StringIO()
val_df.to_csv(csv_buffer, index=False, header=False)
s3.put_object(Body=csv_buffer.getvalue(), Bucket=bucket, Key='data/validation.csv')
print('Données uploadées sur s3://{}/data/'.format(bucket))This script loads Iris, splits it 80/20, and uploads CSV files without headers or indexes to S3 (required by SageMaker XGBoost). Run it locally; adapt for large real datasets using pandas.read_csv('s3://...'). Pitfall: always use header=False to avoid bias during training.
Create a Training Job with XGBoost
Use SageMaker's built-in XGBoost algorithm—no custom container needed. Define an IAM role for SageMaker, hyperparameters, and launch the job. Instance: ml.m5.xlarge for cost/efficiency (spot for -70% savings).
Launch the Training Job via boto3
import boto3
from time import gmtime, strftime
sagemaker = boto3.client('sagemaker')
role = 'arn:aws:iam::YOUR_ACCOUNT:role/SageMakerRole' # Créez ce rôle
bucket = 'sagemaker-tutorial-1234567890'
training_job_name = 'xgboost-iris-' + strftime('%Y-%m-%d-%H-%M-%S', gmtime())
sagemaker.create_training_job(
TrainingJobName=training_job_name,
AlgorithmSpecification={
'AlgorithmName': 'xgboost',
'TrainingInputMode': 'File'
},
RoleArn=role,
InputDataConfig=[
{'ChannelName': 'train', 'DataSource': {'S3DataSource': {'S3Uri': f's3://{bucket}/data/train.csv', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File'}}, 'ContentType': 'text/csv'},
{'ChannelName': 'validation', 'DataSource': {'S3DataSource': {'S3Uri': f's3://{bucket}/data/validation.csv', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File'}}, 'ContentType': 'text/csv'}
],
OutputDataConfig={'S3OutputPath': f's3://{bucket}/output/'},
ResourceConfig={'InstanceType': 'ml.m5.xlarge', 'InstanceCount': 1, 'VolumeSizeInGB': 10},
EnableManagedSpotTraining=True,
TrainingJobName=training_job_name,
HyperParameters={
'num_round': '100',
'max_depth': '5',
'objective': 'multi:softmax',
'num_class': '3',
'eta': '0.2'
}
)
print(f'Job lancé: {training_job_name}. Surveillez dans SageMaker Console.')This code creates an XGBoost training job with spot instances for cost savings. Hyperparameters tuned for Iris (multi-class). Replace role and bucket. Wait 5-10 minutes; check status with sagemaker.describe_training_job(). Benefit: scales to 100+ instances without infra code.
Deploy the Model as a Serverless Endpoint
After training, retrieve the S3 artifact and deploy an endpoint. In 2026, go with Serverless Inference: pay-per-use, auto-scale 0-1000 req/s, cold starts under 200ms.
Deploy the Endpoint and Test Inference
import boto3
import time
sagemaker = boto3.client('sagemaker')
bucket = 'sagemaker-tutorial-1234567890'
training_job_name = 'xgboost-iris-2026-01-01-12-00-00' # Nom de votre job
# Récupérer model artifact
desc = sagemaker.describe_training_job(TrainingJobName=training_job_name)
model_s3 = desc['ModelArtifacts']['S3ModelArtifacts']
endpoint_name = training_job_name + '-endpoint'
sagemaker.create_model(
ModelName=endpoint_name + '-model',
PrimaryContainer={
'Image': sagemaker.get_image_uri('us-east-1', 'xgboost', '1.7-1'),
'ModelDataUrl': model_s3
},
ExecutionRoleArn='arn:aws:iam::YOUR_ACCOUNT:role/SageMakerRole'
)
sagemaker.create_endpoint_config(
EndpointConfigName=endpoint_name + '-config',
ProductionVariants=[{
'VariantName': 'AllTraffic',
'ModelName': endpoint_name + '-model',
'InitialInstanceCount': 0,
'InstanceType': 'ml.m5.xlarge',
'ServerlessConfig': {'MemorySizeInMB': 2048, 'MaxConcurrency': 50}
}],
ServerlessConfig={'MemorySizeInMB': 6144, 'MaxConcurrency': 200}
)
sagemaker.create_endpoint(EndpointName=endpoint_name, EndpointConfigName=endpoint_name + '-config')
while True:
status = sagemaker.describe_endpoint(EndpointName=endpoint_name)['EndpointStatus']
print(f'Status: {status}')
if status == 'InService': break
time.sleep(30)
runtime = boto3.client('sagemaker-runtime')
response = runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType='text/csv',
Body='5.1,3.5,1.4,0.2'
)
print(response['Body'].read().decode())Deploys a serverless XGBoost endpoint with auto-scaling. Tests with Iris features. Monitor costs via CloudWatch (serverless ~$0.06/1000s). Pitfall: wait for 'InService' before invoking; delete with delete_endpoint to avoid charges.
Monitoring Script with CloudWatch
import boto3
cloudwatch = boto3.client('cloudwatch')
endpoint_name = 'xgboost-iris-2026-01-01-12-00-00-endpoint'
response = cloudwatch.get_metric_statistics(
Namespace='AWS/SageMaker',
MetricName='InvocationsPerInstance',
Dimensions=[{'Name': 'EndpointName', 'Value': endpoint_name}],
StartTime='2026-01-01T00:00:00Z',
EndTime='2026-01-02T00:00:00Z',
Period=300,
Statistics=['Average']
)
print('Métriques invocations:', response['Datapoints'])
alarms = cloudwatch.describe_alarms(AlarmNamePrefix='SageMaker-'+endpoint_name)
print('Alarmes actives:', [a['AlarmName'] for a in alarms['MetricAlarms']])Retrieves CloudWatch metrics to monitor latency/invocations. Set up alarms for >95% CPU. Essential for production autoscaling; integrate with Lambda for Slack alerts.
Best Practices
- Use minimal IAM roles: SageMakerExecutionRole with inline policies for S3/model registry only.
- Automatic hyperparameter tuning: Enable SageMaker Automatic Model Tuning for +20% accuracy effortlessly.
- Data versioning: Store datasets in S3 with dated prefixes and use SageMaker Model Registry.
- Serverless first: For <1000 req/day; switch to provisioned for <100ms latency.
- Debug with Debugger: Hook for tensors/realtime metrics during training.
Common Errors to Avoid
- Forgetting ContentType='text/csv': Causes 400 errors during inference; always match training.
- No spot training: Triples costs; enable EnableManagedSpotTraining=True.
- Endpoint not deleted: Bills $1/hour; always
delete_endpointafter testing. - Region mismatch: XGBoost image varies by region; use sagemaker.image_uris.retrieve() dynamically.
Next Steps
- Official docs: AWS SageMaker Developer Guide.
- Advanced examples: JumpStart for pre-trained models (Llama3).
- Training: Check out our Learni AWS ML courses.
- Next: SageMaker Pipelines for ML CI/CD or Canvas for no-code.