Introduction
AWS Glue is a fully managed, serverless ETL service from Amazon Web Services that automates data discovery, cataloging, and transformation. Unlike traditional tools like Talend or Informatica, Glue integrates natively with S3, Athena, and Redshift—no infrastructure to manage.
Why use it in 2026? Data pipelines are exploding with AI and big data: Glue auto-generates PySpark or Scala code from your catalog, cuts costs with pay-as-you-go pricing, and scales automatically. Imagine transforming 1TB of CSV files into optimized Parquet in minutes, without EMR clusters.
This beginner tutorial walks you through creating an S3 bucket, crawling CSV data, generating a catalog, and launching an ETL job that reads, transforms, and writes data. At the end, you'll have a working pipeline—bookmark it for any junior data engineer. Estimated time: 30 minutes.
Prerequisites
- Free AWS account (free tier eligible for Glue basics).
- AWS CLI v2 installed (download).
- Python 3.9+ for local testing (optional).
- AWS region: us-east-1 (default; change if needed).
- IAM permissions:
AWSGlueServiceRoleor custom policy withglue:,s3:,iam:PassRole.
Install and Configure AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
aws configure
# Enter: AWS Access Key ID, Secret Access Key, region us-east-1, output json
echo '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "glue.amazonaws.com"},"Action": "sts:AssumeRole"}]}' > glue-role-trust-policy.json
aws iam create-role --role-name GlueBeginnerRole --assume-role-policy-document file://glue-role-trust-policy.json
aws iam attach-role-policy --role-name GlueBeginnerRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
aws iam attach-role-policy --role-name GlueBeginnerRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccessThis script installs AWS CLI v2, sets up your credentials, and creates an IAM role for Glue with S3 and Glue policies. Replace with your own keys during aws configure. Pitfall: Without iam:PassRole, jobs will fail on launch.
Prepare Source Data in S3
Create an S3 bucket to simulate raw data. We'll upload a sample CSV with sales data: id,name,price,region. Glue will crawl this bucket to infer the schema (types, partitions).
Create S3 Bucket and CSV Data
BUCKET_NAME=glue-tutorial-$(date +%s)
aws s3 mb s3://${BUCKET_NAME}/raw/
cat > sales.csv << EOF
id,name,price,region
1,ProductA,29.99,US
2,ProductB,49.99,EU
3,ProductC,19.99,ASIA
4,ProductA,29.99,US
EOF
aws s3 cp sales.csv s3://${BUCKET_NAME}/raw/
aws s3 ls s3://${BUCKET_NAME}/raw/Generates a unique bucket, creates a sample CSV, and uploads it. Use $BUCKET_NAME everywhere afterward. Think of it like a Dropbox folder for raw data. Pitfall: Bucket names must be globally unique, hence the timestamp.
Create the Glue Database
aws glue create-database --database-input '{"Name": "tutorial_db"}' --region us-east-1
aws glue get-database --name tutorial_db --region us-east-1Creates a logical Data Catalog database to store metadata. It's like a 'registry' for inferred tables. Verify with get-database. Without it, crawlers can't store schemas.
Discover Data with a Crawler
Crawlers scan S3, infer schemas (string/int/date) and partitions, and populate the catalog. For beginners: a crawler is a serverless job that generates a virtual 'table'.
Create and Run the Crawler
cat > crawler-config.json << EOF
{"Name": "tutorial-crawler","Role": "GlueBeginnerRole","DatabaseName": "tutorial_db","Targets": {"S3Targets": [{"Path": "s3://${BUCKET_NAME}/raw/"}]},"SchemaChangePolicy": {"UpdateBehavior": "UPDATE_IN_DATABASE","DeleteBehavior": "DEPRECATE_IN_DATABASE"}}
EOF
aws glue create-crawler --cli-input-json file://crawler-config.json
aws glue start-crawler --name tutorial-crawler
aws glue get-crawler --name tutorial-crawler
# Wait 1-2 min, then:
aws glue get-table --database-name tutorial_db --name salesDefines a JSON crawler that scans S3/raw/ and creates a 'sales' table in the catalog. start-crawler launches it (billed ~$0.44/DPU-hour). Check the generated table. Pitfall: Missing IAM role blocks the scan.
Write the PySpark ETL Script
Core of the tutorial: The Glue job uses PySpark with glueContext. It reads the cataloged table, filters region='US', aggregates average price by name, and writes partitioned Parquet.
Complete Glue Job Script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read cataloged data source
ds = glueContext.create_dynamic_frame.from_catalog(database="tutorial_db", table_name="sales")
df = ds.toDF()
# Transform: filter US, aggregate average price by name
us_df = df.filter(df.region == "US")
agg_df = us_df.groupBy("name").avg("price").withColumnRenamed("avg(price)", "avg_price")
# Write partitioned Parquet
glueContext.write_dynamic_frame.from_options(
frame=DynamicFrame.fromDF(agg_df, glueContext, "agg_df"),
connection_type="s3",
connection_options={"path": f"s3://{BUCKET_NAME}/processed/", "partitionKeys": []},
format="parquet"
)
job.commit()Full PySpark script: reads from catalog, transforms (filters/aggregates), and writes optimized Parquet. create_dynamic_frame leverages inferred metadata. Copy-paste into the Glue console. Pitfall: Forget job.commit() and the job won't finish.
Create and Run the Glue Job
aws glue create-job --job-name tutorial-job --role GlueBeginnerRole --command '{"Name": "glueetl","ScriptLocation": "s3://${BUCKET_NAME}/scripts/glue_job.py","PythonVersion": "3"}' --default-arguments '{"--TempDir": "s3://${BUCKET_NAME}/temp/"}' --max-capacity 2.0 --timeout 10 --max-retries 0
# First upload the script:
aws s3 cp glue_job.py s3://${BUCKET_NAME}/scripts/
aws glue start-job-run --job-name tutorial-job
aws glue get-job-run --job-name tutorial-job --run-id $(aws glue get-job-runs --job-name tutorial-job --query 'JobRuns[0].Id' --output text)
# Check output:
aws s3 ls s3://${BUCKET_NAME}/processed/Creates a job with the S3 script, then runs it (2 DPUs, ~$1/run). get-job-run tracks status (SUCCEEDED). Verify generated Parquet. Pitfall: --TempDir required for scripts >512KB or dependencies.
Verify and Query with Athena
Connect Athena to the tutorial_db catalog. Query: SELECT * FROM tutorial_db.sales LIMIT 10; to validate. Parquet output is columnar and 10x faster than CSV.
Best Practices
- Use DynamicFrames: They handle schema evolution better than DataFrames.
- S3 Partitions: Add
partitionKeys=["region"]for fast scans (e.g.,WHERE region='US'). - DPU Scaling: Start at 2 DPUs, monitor CloudWatch for costs.
- Script Versioning: Store in S3 with tags, don't hardcode.
- Security: Use least-privilege IAM, encrypt S3 with KMS.
Common Errors to Avoid
- Crawler Fails: Ensure IAM role has
s3:GetObjecton the bucket. - Job Timeout: Increase
--max-capacityor timeout for large data. - Schema Drift: Set
SchemaChangePolicy: LOGto alert without breaking. - Hidden Costs: Crawlers billed per DPU-hour; test on small data.
Next Steps
- AWS Docs: AWS Glue Developer Guide.
- Advanced: Streaming jobs, Lake Formation integrations.
- Training: Check our certified AWS trainings at Learni for pro Data Engineers.