Skip to content
Learni
View all tutorials
Cloud AWS

How to Orchestrate Advanced ETL Pipelines with AWS Glue in 2026

Lire en français

Introduction

AWS Glue is Amazon's serverless ETL service for large-scale Extract, Transform, Load jobs, perfect for processing terabytes of structured or unstructured data without managing infrastructure. In 2026, with the rise of generative AI and unified data lakes, Glue shines in orchestrating complex workflows using Glue Studio, automated crawlers, and scalable PySpark jobs.

This advanced tutorial guides you through deploying a full pipeline: S3 ingestion, automatic cataloging via crawler, PySpark transformations (cleaning, aggregations, deduplication), and orchestration with conditional triggers. We use AWS CDK in Python for reproducible infrastructure as code (IaC) that beats manual consoles.

Why it matters: Pro data engineers bookmark this setup for its Delta Lake ACID support, CloudWatch monitoring, and auto-scaling up to 1000 DPUs. By the end, you'll have a production-ready workflow handling 1 TB/day. Estimated deployment time: 45 minutes.

Prerequisites

  • AWS account with admin rights (or dedicated IAM policies: GlueFullAccess, S3FullAccess, CloudWatchLogsFullAccess).
  • Python 3.10+ installed.
  • AWS CLI v2 configured (aws configure).
  • AWS CDK v2: pip install aws-cdk-lib constructs
  • Node.js 18+ for CDK synth (optional; CDK Python works standalone).
  • Advanced knowledge: PySpark, Boto3, Data Catalog.

Bootstrap CDK and Initialize Project

bootstrap.sh
mkdir glue-etl-pipeline && cd glue-etl-pipeline
pip install aws-cdk-lib constructs
cdk init app --language python
source .venv/bin/activate
pip install -r requirements.txt
cdk bootstrap aws://$(aws sts get-caller-identity --query Account --output text)/$(aws ec2 describe-regions --query 'Regions[0].RegionName' --output text)

This script initializes a Python CDK project, installs required libraries, and bootstraps the CDK environment in your primary AWS account. It creates an S3 staging bucket for CDK assets. Pro tip: Always activate the Python venv and verify the default region (us-east-1 recommended for Glue).

Prepare Source Data

Before diving into Glue, set up an S3 bucket with sample data: a messy CSV (duplicates, nulls) simulating IoT logs. Download a public dataset like NYC Taxi or generate one.

Structure: s3://mon-bucket-glue/input/nyc_taxi.csv. This lets you test the crawler, which will infer the schema (string, double, timestamp).

CDK Stack: S3 Buckets and IAM Role

glue_pipeline/glue_pipeline_stack.py
from aws_cdk import Stack, aws_s3 as s3, aws_iam as iam, aws_glue as glue, aws_logs as logs, CfnOutput

from aws_cdk import App, Stack
from constructs import Construct

class GluePipelineStack(Stack):

    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        # Bucket S3 pour input/output
        self.input_bucket = s3.Bucket(self, "GlueInputBucket",
            bucket_name=f"glue-demo-input-{self.account}-{self.region}",
            block_public_access=s3.BlockPublicAccess.BLOCK_ALL,
            versioned=True,
            removal_policy=RemovalPolicy.DESTROY
        )
        self.output_bucket = s3.Bucket(self, "GlueOutputBucket",
            bucket_name=f"glue-demo-output-{self.account}-{self.region}",
            block_public_access=s3.BlockPublicAccess.BLOCK_ALL,
            removal_policy=RemovalPolicy.DESTROY
        )
        CfnOutput(self, "InputBucketName", value=self.input_bucket.bucket_name)
        CfnOutput(self, "OutputBucketName", value=self.output_bucket.bucket_name)

        # IAM Role pour Glue
        self.glue_role = iam.Role(self, "GlueETLRole",
            assumed_by=iam.ServicePrincipal("glue.amazonaws.com"),
            managed_policies=[
                iam.ServicePrincipal("glue.amazonaws.com").add_managed_policy(iam.ManagedPolicy.from_aws_managed_policy_name("service-role/AWSGlueServiceRole")),
            ]
        )
        self.glue_role.add_to_policy(iam.PolicyStatement(
            actions=["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
            resources=[self.input_bucket.bucket_arn + "/*", self.output_bucket.bucket_arn + "/*"]
        ))
        self.glue_role.add_to_policy(iam.PolicyStatement(
            actions=["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"],
            resources=["*"]
        ))
        CfnOutput(self, "GlueRoleArn", value=self.glue_role.role_arn)

app = App()
GluePipelineStack(app, "GluePipelineStack")
app.synth()

This CDK stack creates two versioned S3 buckets (input/output) and a custom IAM role for Glue with S3/CloudWatch access. Outputs make it easy to reference in later steps. Common pitfall: Don't forget removal_policy=DESTROY for easy cleanup during tests; use unique names with account/region to avoid global conflicts.

Upload Data and PySpark Script

upload_data.sh
#!/bin/bash
BUCKET_INPUT=$(aws cloudformation describe-stacks --stack-name GluePipelineStack --query 'Stacks[0].Outputs[?OutputKey==`InputBucketName`].OutputValue' --output text)
aws s3 cp nyc_taxi_sample.csv s3://$BUCKET_INPUT/input/ --recursive
cat > glue_etl_job.py << 'EOF'
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read raw CSV
ds = glueContext.create_dynamic_frame.from_catalog(database="glue_demo_db", table_name="nyc_taxi_raw")

# Transform: Convert to DF, clean nulls/duplicates, aggregate
spark_df = ds.toDF()
spark_df_clean = spark_df.dropDuplicates(["pickup_datetime", "pickup_longitude"]).na.drop(subset=["total_amount"])
spark_df_agg = spark_df_clean.groupBy("pickup_datetime").agg({"total_amount": "sum", "passenger_count": "avg"}).withColumnRenamed("sum(total_amount)", "total_revenue").withColumnRenamed("avg(passenger_count)", "avg_passengers")

# Write partitioned Parquet
dynamic_frame_agg = DynamicFrame.fromDF(spark_df_agg.coalesce(1), glueContext, "agg_df")
glueContext.write_dynamic_frame.from_options(dynamic_frame_agg,
    connection_type="s3",
    connection_options={"path": "s3://OUTPUT_BUCKET/output/parquet/"},
    format="parquet")

job.commit()
EOF
aws s3 cp glue_etl_job.py s3://$BUCKET_INPUT/scripts/

This bash script uploads the CSV data to the CDK bucket and creates a complete Glue PySpark script: reads from Data Catalog, cleans (duplicates, nulls), aggregates, and writes partitioned Parquet. Replace OUTPUT_BUCKET with the real reference. Note: coalesce(1) for small tests; in production, use repartition(100) for parallelism.

Deploy the Glue Crawler

Glue crawlers automate schema discovery in S3 and populate the Data Catalog (Hive Metastore compatible). For advanced use, configure custom classifiers (JSON/Avro) and scheduling. Here, via CDK, we target the input folder to infer the NYC Taxi schema.

CDK Stack: Crawler and Database

glue_pipeline/crawler_stack.py
from aws_cdk import Stack, aws_glue as glue, CfnOutput
from aws_cdk import RemovalPolicy

class CrawlerStack(Stack):

    def __init__(self, scope: Construct, construct_id: str, input_bucket_name: str, glue_role_arn: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        # Database Glue
        database = glue.CfnDatabase(self, "GlueDemoDB",
            catalog_id=self.account,
            database_input=glue.CfnDatabase.DatabaseInputProperty(
                name="glue_demo_db",
                description="DB pour NYC Taxi ETL"
            )
        )

        # Crawler
        crawler = glue.CfnCrawler(self, "NYCTaxiCrawler",
            role=glue_role_arn,
            database_name="glue_demo_db",
            description="Crawler pour input S3 NYC Taxi",
            targets=glue.CfnCrawler.TargetsProperty(
                s3_targets=[glue.CfnCrawler.S3TargetProperty(
                    path=f"s3://{input_bucket_name}/input/"
                )]
            ),
            schedule=glue.CfnCrawler.ScheduleProperty(schedule_expression="rate(1 day)"),
            configuration="{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}",
            recrawl_policy=glue.CfnCrawler.RecrawlPolicyProperty(
                recrawl_behavior="CRAWL_EVENT_MODE"
            )
        )
        crawler.add_depends_on(database)
        CfnOutput(self, "CrawlerName", value=crawler.ref)

# Intégrer dans app
app = App()
# Assume props from previous stack via SSM or manual
CrawlerStack(app, "CrawlerStack", input_bucket_name="glue-demo-input-XXXX-us-east-1", glue_role_arn="arn:aws:iam::XXXX:role/GlueETLRole")
app.synth()

This stack adds a Data Catalog database and a scheduled daily S3 crawler with JSON config for schema merging. Database dependency is enforced. Pitfall: Use CRAWL_EVENT_MODE for triggers; replace placeholders with real CDK outputs using Fn.import_value in production.

CDK Stack: PySpark ETL Job

glue_pipeline/job_stack.py
from aws_cdk import Stack, aws_glue as glue, aws_iam as iam, CfnOutput

class JobStack(Stack):

    def __init__(self, scope: Construct, construct_id: str, input_bucket_name: str, output_bucket_name: str, glue_role_arn: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        job = glue.CfnJob(self, "NYCTaxiETLJob",
            role_arn=glue_role_arn,
            command=glue.CfnJob.CommandProperty(
                name="glueetl",
                script_location=f"s3://{input_bucket_name}/scripts/glue_etl_job.py",
                python_version="3"
            ),
            default_arguments={
                "--TempDir": f"s3://{input_bucket_name}/temp/",
                "--OUTPUT_BUCKET": output_bucket_name,
                "--job-bookmark-option": "job-bookmark-enable",
                "--enable-job-insights": "true",
                "--datalake-defaults": "true"
            },
            glue_version="4.0",  # PySpark 3.3+ en 2026
            number_of_workers=10,
            worker_type="G.2X",
            max_capacity=10.0,
            timeout=60,
            max_retries=3
        )
        CfnOutput(self, "JobName", value=job.ref)

app = App()
JobStack(app, "JobStack", "glue-demo-input-XXXX-us-east-1", "glue-demo-output-XXXX-us-east-1", "arn:aws:iam::XXXX:role/GlueETLRole")
app.synth()

Defines a Glue 4.0 job (PySpark 3.4+) with 10 G.2X workers, bookmarks for resumability, and Job Insights for performance. Arguments are passed to the script. Note: glue_version=4.0 for 2026; scale max_capacity >100 DPUs in production and test timeouts on large datasets.

Orchestration with Workflows and Triggers

For an advanced pipeline, connect crawler -> job via workflow: a trigger on crawler success launches the job. Add conditionals (e.g., >100 new partitions).

CDK Stack: Workflow and Triggers

glue_pipeline/workflow_stack.py
from aws_cdk import Stack, aws_glue as glue, CfnOutput

class WorkflowStack(Stack):

    def __init__(self, scope: Construct, construct_id: str, crawler_name: str, job_name: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        workflow = glue.CfnWorkflow(self, "NYCTaxiWorkflow",
            description="Workflow ETL NYC Taxi: Crawler -> Job",
            default_run_properties={"conf": "spark.sql.adaptive.enabled=true"}
        )

        trigger = glue.CfnTrigger(self, "CrawlerSuccessTrigger",
            name="trigger-crawler-to-job",
            workflow_name=workflow.ref,
            type="CONDITIONAL",
            start_on=glue.CfnTrigger.StartOnProperty(
                crawler_names=[crawler_name]
            ),
            actions=[glue.CfnTrigger.ActionProperty(job_name=job_name)],
            predicate=glue.CfnTrigger.PredicateProperty(
                conditions=[glue.CfnTrigger.ConditionProperty(
                    logical_operator="EQUALS",
                    job_name="",
                    state="SUCCEEDED",
                    crawler_name=crawler_name
                )]
            )
        )
        CfnOutput(self, "WorkflowName", value=workflow.ref)

app = App()
WorkflowStack(app, "WorkflowStack", "NYCTaxiCrawler", "NYCTaxiETLJob")
app.synth()

Creates a workflow with a conditional trigger: job starts only if crawler succeeds. Spark conf enables AQE optimizations. Pitfall: type=CONDITIONAL requires a predicate; monitor with aws glue get-workflow-run in the CLI.

Full CDK Deployment

deploy.sh
cd glue_pipeline
cdk deploy GluePipelineStack CrawlerStack JobStack WorkflowStack --require-approval never
aws glue start-workflow-run --name $(aws cloudformation describe-stacks --stack-name WorkflowStack --query 'Stacks[0].Outputs[?OutputKey==`WorkflowName`].OutputValue' --output text)
aws glue get-workflow-run --name WORKFLOW_NAME --run-id $(aws glue start-workflow-run --name WORKFLOW_NAME --query 'RunId' --output text) --query 'Run.Parts[0].ExecutionStatus'

Deploys all stacks in sequence and manually starts the workflow. --require-approval never for CI/CD. Check status; in production, integrate with CodePipeline and manual approvals.

Best Practices

  • Dynamic scaling: Use max_capacity=0 with auto DPUs (Glue 4.0+) for optimized costs.
  • Delta Lake: Add spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension for ACID transactions on output.
  • Monitoring: Enable Job Insights and CloudWatch metrics (DPU-seconds, OOM errors).
  • Security: Use Lake Formation for Data Catalog governance; encrypt S3 with KMS.
  • CI/CD: CDK in GitHub Actions, unit tests with LocalStack for PySpark.

Common Errors to Avoid

  • PySpark script missing job.commit(): Job doesn't end cleanly, leading to infinite billing.
  • Crawler without recrawl_policy: Stale schemas on data drift; enforce CRAWL_EVENT_MODE.
  • Overly permissive IAM: Limit to s3://bucket/*; audit with Access Analyzer.
  • No bookmarks: Full reprocessing on incremental runs; enable --job-bookmark-option=job-bookmark-enable.

Next Steps

Dive deeper with Glue Streaming for real-time Kafka/S3, or Ray on Glue for ML pipelines. Check the AWS Glue 2026 docs.

Explore our Learni AWS Data training: Pro Data Engineer certification, CDK Glue workshops. Join the Discord community for live Q&A.