Introduction
AWS Glue is an AWS serverless ETL service that simplifies data extraction, transformation, and loading. It allows companies to catalog their data and run jobs without managing infrastructure. In 2026, mastering Glue is essential for any beginner data engineer looking to automate their pipelines. This tutorial guides you step by step to create a functional ETL job.
Prerequisites
- AWS account with Glue and S3 permissions
- AWS CLI installed and configured
- Basic knowledge of Python
- An S3 bucket containing CSV data
IAM Configuration
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"glue:*"
],
"Resource": "*"
}
]
}This IAM policy grants the minimum permissions necessary for Glue to read and write to S3 and manage metadata.
Creating the Crawler
aws glue create-crawler \
--name mon-premier-crawler \
--role arn:aws:iam::123456789012:role/GlueServiceRole \
--database-name glue-demo-db \
--targets '{"S3Targets":[{"Path":"s3://mon-bucket-donnees/raw/"}]}' \
--table-prefix demo_This CLI command creates a crawler that automatically analyzes CSV data in S3 and generates the schema in the Glue catalog.
Python ETL Script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource = glueContext.create_dynamic_frame.from_catalog(
database="glue-demo-db",
table_name="demo_raw_data"
)
transformed = ApplyMapping.apply(
frame=datasource,
mappings=[("id", "int", "id", "int"), ("name", "string", "nom", "string")]
)
glueContext.write_dynamic_frame.from_options(
frame=transformed,
connection_type="s3",
connection_options={"path": "s3://mon-bucket-donnees/processed/"},
format="parquet"
)
job.commit()This complete script reads data via the catalog, applies a simple transformation, and writes the result as Parquet to S3.
Creating the Glue Job
aws glue create-job \
--name mon-job-etl \
--role arn:aws:iam::123456789012:role/GlueServiceRole \
--command Name=glueetl,ScriptLocation=s3://mon-bucket-scripts/etl_job.py,PythonVersion=3 \
--glue-version 4.0 \
--default-arguments '{"--job-language":"python"}'This command registers the ETL job in AWS Glue by pointing to the Python script stored in S3.
Running the Job
aws glue start-job-run --job-name mon-job-etlStarts the ETL job execution. Monitor progress in the AWS Glue console or via CloudWatch.
Best Practices
- Always use the data catalog to avoid hard-coded schemas
- Prefer the Parquet format for transformed data
- Enable bookmarks to process only new data
- Monitor costs with tags and CloudWatch alerts
Common Errors to Avoid
- Forgetting to grant correct IAM permissions to the Glue role
- Not configuring bookmarks on S3 sources
- Using outdated Glue versions (prefer 4.0+)
- Ignoring CloudWatch logging for debugging
To Go Further
Deepen your ETL skills with our Learni training courses dedicated to AWS and modern data pipelines.