Skip to content
Learni
View all tutorials
Cloud & DevOps

How to Orchestrate AWS Step Functions Workflows in 2026

Lire en français

Introduction

AWS Step Functions lets you orchestrate complex serverless distributed workflows. In 2026, event-driven architectures require fine-grained state management, retries, and parallelism. This expert tutorial guides you from ASL definition to CDK deployment while integrating advanced patterns like callbacks and compensation. Every step includes production-ready functional code.

Prerequisites

  • AWS account with advanced IAM permissions
  • Node.js 20+ and AWS CDK v2
  • Strong knowledge of TypeScript and Python
  • AWS CLI configured (v2)

Create the Lambda Functions

lambdas/validate_order.py
import json
def lambda_handler(event, context):
    order = event.get('order', {})
    if not order.get('id'):
        raise Exception('InvalidOrder')
    return {'status': 'validated', 'order': order}

This Lambda validates an order and raises a named exception to trigger retry and catch states in Step Functions.

Define the ASL State Machine

statemachine/workflow.asl.json
{
  "Comment": "Workflow e-commerce expert 2026",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "${ValidateOrderLambdaArn}",
      "Catch": [{
        "ErrorEquals": ["InvalidOrder"],
        "Next": "NotifyFailure"
      }],
      "Next": "ProcessPayment"
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "${PaymentLambdaArn}"
      },
      "End": true
    },
    "NotifyFailure": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "${FailureTopicArn}",
        "Message": "Order failed"
      },
      "End": true
    }
  }
}

Complete ASL definition with named error handling, direct SNS integration, and CDK variables for ARN injection.

TypeScript CDK Stack

lib/stepfunctions-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as sfn from 'aws-cdk-lib/aws-stepfunctions';
import * as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks';

export class StepFunctionsStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);
    const validateFn = new lambda.Function(this, 'ValidateOrder', {
      runtime: lambda.Runtime.PYTHON_3_12,
      handler: 'validate_order.lambda_handler',
      code: lambda.Code.fromAsset('lambdas'),
    });
    const definition = sfn.DefinitionBody.fromFile('statemachine/workflow.asl.json');
    new sfn.StateMachine(this, 'OrderWorkflow', {
      definitionBody: definition,
      definitionSubstitutions: { ValidateOrderLambdaArn: validateFn.functionArn },
      stateMachineType: sfn.StateMachineType.STANDARD,
    });
  }
}

CDK stack that deploys the Lambda and State Machine while dynamically replacing ARNs in the ASL file.

Deploy via CDK

terminal
cdk bootstrap
cdk deploy StepFunctionsStack --require-approval never

Deployment commands that initialize the environment and deploy the stack without interactive confirmation.

Advanced Callback Token Pattern

statemachine/callback.asl.json
{
  "StartAt": "WaitForApproval",
  "States": {
    "WaitForApproval": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Parameters": {
        "FunctionName": "${ApprovalLambdaArn}",
        "Payload": {
          "taskToken.$": "$$.Task.Token"
        }
      },
      "TimeoutSeconds": 3600,
      "End": true
    }
  }
}

Callback token pattern implementation for human approvals, including an explicit timeout.

Best Practices

  • Always use custom error names for precise debugging
  • Prefer EXPRESS StateMachineType for high-throughput workloads
  • Enable full DEBUG logging during development
  • Externalize configuration via Parameter Store
  • Systematically test states with Step Functions Local

Common Errors to Avoid

  • Forgetting IAM permissions between Step Functions and integrated services
  • Using overly short timeouts on long-running tasks
  • Not versioning ASL definitions in production
  • Ignoring the 25,000 event limit per execution

Further Reading

Deepen these concepts with our advanced serverless architecture training: https://learni-group.com/formations