Skip to content
Learni
View all tutorials
Cloud & Data

How to Create an ETL Job with AWS Glue in 2026

18 minBEGINNER
Lire en français

Introduction

AWS Glue is an AWS serverless ETL service that simplifies data extraction, transformation, and loading. It allows companies to catalog their data and run jobs without managing infrastructure. In 2026, mastering Glue is essential for any beginner data engineer looking to automate their pipelines. This tutorial guides you step by step to create a functional ETL job.

Prerequisites

  • AWS account with Glue and S3 permissions
  • AWS CLI installed and configured
  • Basic knowledge of Python
  • An S3 bucket containing CSV data

IAM Configuration

glue-role-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket",
        "glue:*"
      ],
      "Resource": "*"
    }
  ]
}

This IAM policy grants the minimum permissions necessary for Glue to read and write to S3 and manage metadata.

Creating the Crawler

terminal
aws glue create-crawler \
  --name mon-premier-crawler \
  --role arn:aws:iam::123456789012:role/GlueServiceRole \
  --database-name glue-demo-db \
  --targets '{"S3Targets":[{"Path":"s3://mon-bucket-donnees/raw/"}]}' \
  --table-prefix demo_

This CLI command creates a crawler that automatically analyzes CSV data in S3 and generates the schema in the Glue catalog.

Python ETL Script

etl_job.py
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource = glueContext.create_dynamic_frame.from_catalog(
    database="glue-demo-db",
    table_name="demo_raw_data"
)

transformed = ApplyMapping.apply(
    frame=datasource,
    mappings=[("id", "int", "id", "int"), ("name", "string", "nom", "string")]
)

glueContext.write_dynamic_frame.from_options(
    frame=transformed,
    connection_type="s3",
    connection_options={"path": "s3://mon-bucket-donnees/processed/"},
    format="parquet"
)
job.commit()

This complete script reads data via the catalog, applies a simple transformation, and writes the result as Parquet to S3.

Creating the Glue Job

terminal
aws glue create-job \
  --name mon-job-etl \
  --role arn:aws:iam::123456789012:role/GlueServiceRole \
  --command Name=glueetl,ScriptLocation=s3://mon-bucket-scripts/etl_job.py,PythonVersion=3 \
  --glue-version 4.0 \
  --default-arguments '{"--job-language":"python"}'

This command registers the ETL job in AWS Glue by pointing to the Python script stored in S3.

Running the Job

terminal
aws glue start-job-run --job-name mon-job-etl

Starts the ETL job execution. Monitor progress in the AWS Glue console or via CloudWatch.

Best Practices

  • Always use the data catalog to avoid hard-coded schemas
  • Prefer the Parquet format for transformed data
  • Enable bookmarks to process only new data
  • Monitor costs with tags and CloudWatch alerts

Common Errors to Avoid

  • Forgetting to grant correct IAM permissions to the Glue role
  • Not configuring bookmarks on S3 sources
  • Using outdated Glue versions (prefer 4.0+)
  • Ignoring CloudWatch logging for debugging

To Go Further

Deepen your ETL skills with our Learni training courses dedicated to AWS and modern data pipelines.