How to Create Your First AWS Glue Job in 2026

Introduction

AWS Glue is a fully managed, serverless ETL service from Amazon Web Services that automates data discovery, cataloging, and transformation. Unlike traditional tools like Talend or Informatica, Glue integrates natively with S3, Athena, and Redshift—no infrastructure to manage.

Why use it in 2026? Data pipelines are exploding with AI and big data: Glue auto-generates PySpark or Scala code from your catalog, cuts costs with pay-as-you-go pricing, and scales automatically. Imagine transforming 1TB of CSV files into optimized Parquet in minutes, without EMR clusters.

This beginner tutorial walks you through creating an S3 bucket, crawling CSV data, generating a catalog, and launching an ETL job that reads, transforms, and writes data. At the end, you'll have a working pipeline—bookmark it for any junior data engineer. Estimated time: 30 minutes.

Prerequisites

Free AWS account (free tier eligible for Glue basics).
AWS CLI v2 installed (download).
Python 3.9+ for local testing (optional).
AWS region: us-east-1 (default; change if needed).
IAM permissions: AWSGlueServiceRole or custom policy with glue:, s3:, iam:PassRole.

Install and Configure AWS CLI

terminal

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

aws configure
# Enter: AWS Access Key ID, Secret Access Key, region us-east-1, output json

echo '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "glue.amazonaws.com"},"Action": "sts:AssumeRole"}]}' > glue-role-trust-policy.json
aws iam create-role --role-name GlueBeginnerRole --assume-role-policy-document file://glue-role-trust-policy.json
aws iam attach-role-policy --role-name GlueBeginnerRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
aws iam attach-role-policy --role-name GlueBeginnerRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

This script installs AWS CLI v2, sets up your credentials, and creates an IAM role for Glue with S3 and Glue policies. Replace with your own keys during aws configure. Pitfall: Without iam:PassRole, jobs will fail on launch.

Prepare Source Data in S3

Create an S3 bucket to simulate raw data. We'll upload a sample CSV with sales data: id,name,price,region. Glue will crawl this bucket to infer the schema (types, partitions).

Create S3 Bucket and CSV Data

terminal

BUCKET_NAME=glue-tutorial-$(date +%s)
aws s3 mb s3://${BUCKET_NAME}/raw/

cat > sales.csv << EOF
id,name,price,region
1,ProductA,29.99,US
2,ProductB,49.99,EU
3,ProductC,19.99,ASIA
4,ProductA,29.99,US
EOF
aws s3 cp sales.csv s3://${BUCKET_NAME}/raw/
aws s3 ls s3://${BUCKET_NAME}/raw/

Generates a unique bucket, creates a sample CSV, and uploads it. Use $BUCKET_NAME everywhere afterward. Think of it like a Dropbox folder for raw data. Pitfall: Bucket names must be globally unique, hence the timestamp.

Create the Glue Database

terminal

aws glue create-database --database-input '{"Name": "tutorial_db"}' --region us-east-1
aws glue get-database --name tutorial_db --region us-east-1

Creates a logical Data Catalog database to store metadata. It's like a 'registry' for inferred tables. Verify with get-database. Without it, crawlers can't store schemas.

Discover Data with a Crawler

Crawlers scan S3, infer schemas (string/int/date) and partitions, and populate the catalog. For beginners: a crawler is a serverless job that generates a virtual 'table'.

Create and Run the Crawler

terminal

cat > crawler-config.json << EOF
{"Name": "tutorial-crawler","Role": "GlueBeginnerRole","DatabaseName": "tutorial_db","Targets": {"S3Targets": [{"Path": "s3://${BUCKET_NAME}/raw/"}]},"SchemaChangePolicy": {"UpdateBehavior": "UPDATE_IN_DATABASE","DeleteBehavior": "DEPRECATE_IN_DATABASE"}}
EOF
aws glue create-crawler --cli-input-json file://crawler-config.json
aws glue start-crawler --name tutorial-crawler
aws glue get-crawler --name tutorial-crawler
# Wait 1-2 min, then:
aws glue get-table --database-name tutorial_db --name sales

Defines a JSON crawler that scans S3/raw/ and creates a 'sales' table in the catalog. start-crawler launches it (billed ~$0.44/DPU-hour). Check the generated table. Pitfall: Missing IAM role blocks the scan.

Write the PySpark ETL Script

Core of the tutorial: The Glue job uses PySpark with glueContext. It reads the cataloged table, filters region='US', aggregates average price by name, and writes partitioned Parquet.

Complete Glue Job Script

glue_job.py

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read cataloged data source
ds = glueContext.create_dynamic_frame.from_catalog(database="tutorial_db", table_name="sales")

df = ds.toDF()

# Transform: filter US, aggregate average price by name
us_df = df.filter(df.region == "US")
agg_df = us_df.groupBy("name").avg("price").withColumnRenamed("avg(price)", "avg_price")

# Write partitioned Parquet
glueContext.write_dynamic_frame.from_options(
    frame=DynamicFrame.fromDF(agg_df, glueContext, "agg_df"),
    connection_type="s3",
    connection_options={"path": f"s3://{BUCKET_NAME}/processed/", "partitionKeys": []},
    format="parquet"
)

job.commit()

Full PySpark script: reads from catalog, transforms (filters/aggregates), and writes optimized Parquet. create_dynamic_frame leverages inferred metadata. Copy-paste into the Glue console. Pitfall: Forget job.commit() and the job won't finish.

Create and Run the Glue Job

terminal

aws glue create-job --job-name tutorial-job --role GlueBeginnerRole --command '{"Name": "glueetl","ScriptLocation": "s3://${BUCKET_NAME}/scripts/glue_job.py","PythonVersion": "3"}' --default-arguments '{"--TempDir": "s3://${BUCKET_NAME}/temp/"}' --max-capacity 2.0 --timeout 10 --max-retries 0

# First upload the script:
aws s3 cp glue_job.py s3://${BUCKET_NAME}/scripts/

aws glue start-job-run --job-name tutorial-job
aws glue get-job-run --job-name tutorial-job --run-id $(aws glue get-job-runs --job-name tutorial-job --query 'JobRuns[0].Id' --output text)

# Check output:
aws s3 ls s3://${BUCKET_NAME}/processed/

Creates a job with the S3 script, then runs it (2 DPUs, ~$1/run). get-job-run tracks status (SUCCEEDED). Verify generated Parquet. Pitfall: --TempDir required for scripts >512KB or dependencies.

Verify and Query with Athena

Connect Athena to the tutorial_db catalog. Query: SELECT * FROM tutorial_db.sales LIMIT 10; to validate. Parquet output is columnar and 10x faster than CSV.

Best Practices

Use DynamicFrames: They handle schema evolution better than DataFrames.
S3 Partitions: Add partitionKeys=["region"] for fast scans (e.g., WHERE region='US').
DPU Scaling: Start at 2 DPUs, monitor CloudWatch for costs.
Script Versioning: Store in S3 with tags, don't hardcode.
Security: Use least-privilege IAM, encrypt S3 with KMS.

Common Errors to Avoid

Crawler Fails: Ensure IAM role has s3:GetObject on the bucket.
Job Timeout: Increase --max-capacity or timeout for large data.
Schema Drift: Set SchemaChangePolicy: LOG to alert without breaking.
Hidden Costs: Crawlers billed per DPU-hour; test on small data.

Next Steps

AWS Docs: AWS Glue Developer Guide.
Advanced: Streaming jobs, Lake Formation integrations.
Training: Check our certified AWS trainings at Learni for pro Data Engineers.

How to Create Your First AWS Glue Job in 2026

Introduction

Prerequisites

Install and Configure AWS CLI

Prepare Source Data in S3

Create S3 Bucket and CSV Data

Create the Glue Database

Discover Data with a Crawler

Create and Run the Crawler

Write the PySpark ETL Script

Complete Glue Job Script

Create and Run the Glue Job

Verify and Query with Athena

Best Practices

Common Errors to Avoid

Next Steps

Recommended Learni Training Courses

Delta Lake Training - Master High-Performance ACID Data Lakes

Training AWS Glue - Mastering Advanced Serverless ETL

Training AWS Glue 2026 - Optimizing Advanced Serverless ETL

Training Azure Data Engineer DP-203 - Get Your Certification in 3 Days April 2026

Training Databricks 2026 - Optimising Big Data Forensic Analysis

Training Databricks Advanced - Optimize Big Data Pipelines in 1 Day

Training Databricks Expert - Optimize Advanced Data Pipelines

Training Delta Lake - Managing Scalable ACID Data Lakes

Training Delta Lake - Master ACID-compliant High-Performance Data Lakes

Recommended Learni Training Courses

Delta Lake Training - Master High-Performance ACID Data Lakes

Training AWS Glue - Mastering Advanced Serverless ETL

Training AWS Glue 2026 - Optimizing Advanced Serverless ETL

Training Azure Data Engineer DP-203 - Get Your Certification in 3 Days April 2026

Training Databricks 2026 - Optimising Big Data Forensic Analysis

Training Databricks Advanced - Optimize Big Data Pipelines in 1 Day

Training Databricks Expert - Optimize Advanced Data Pipelines

Training Delta Lake - Managing Scalable ACID Data Lakes

Training Delta Lake - Master ACID-compliant High-Performance Data Lakes