AWS Lambda Durable Functions: Building Reliable Stateful Workflows Without Step Functions

Introduction

AWS Lambda changed how we think about backend compute when it launched in 2014. Twelve years on, the service has matured enormously — but one constraint has remained stubbornly fixed: functions are expected to be short-lived and stateless. Any workflow that needed to span minutes, hours, or await human input required bolting on external machinery — DynamoDB checkpointing tables, Step Functions state machines, SQS retry loops, or Temporal clusters running on EC2. All of that works, but it adds operational weight that erodes the original serverless value proposition.

Lambda Durable Functions, announced at AWS re:Invent 2025 and reaching general availability in early 2026, changes the equation fundamentally. It lets you write multi-step, long-running, stateful logic directly inside a Lambda function handler — in sequential imperative code, in Python, TypeScript, or JavaScript — while the platform transparently handles checkpointing, failure recovery, and cost-efficient pausing during wait periods. No external state store. No orchestration service. No idle compute charges while waiting for a webhook or human approval.

This article is a deep technical examination of how Durable Functions work, when they are the right architectural choice, and how to deploy them with production-grade reliability. We cover the checkpoint-and-replay engine, the steps and waits primitives, AI agent orchestration patterns, comparison with AWS Step Functions, cost modelling, observability, and the operational constraints you need to plan around before going to production.

The Statefulness Problem in Serverless Architecture

To appreciate why Durable Functions matter, it helps to understand the problem space they solve. Consider a straightforward e-commerce order fulfillment workflow: validate payment credentials, reserve inventory, wait up to 72 hours for fraud review, trigger shipment if approved or issue a refund if rejected, then notify the customer. On paper, five steps. In a Lambda-based microservices architecture, this becomes a distributed systems coordination challenge involving at least three queues, two DynamoDB tables, a Step Functions state machine with six states, and carefully designed idempotency keys to prevent double-charges on retries.

The cognitive and operational overhead is significant. Engineers building these systems spend more time reasoning about retry semantics, dead-letter queue routing, and checkpoint consistency than on the actual business logic. When a Lambda function times out mid-workflow — after debiting the customer but before reserving inventory — the recovery path must be explicitly designed, tested, and monitored. For small teams or high-velocity feature development, this is a meaningful tax on engineering throughput.

The Durable Execution programming model addresses this by making the runtime responsible for fault tolerance rather than the developer. Your code expresses intent sequentially; the infrastructure ensures that intent is carried out exactly once, across however many retries, restarts, or suspensions are required.

What Lambda Durable Functions Actually Are

Lambda Durable Functions are not a new compute primitive — they are a new programming model layered on top of the existing Lambda execution environment. You enable the feature at function creation time, include the open-source Durable Execution SDK in your deployment package, and structure your handler around the SDK’s primitives. The Lambda runtime, AWS infrastructure, and an AWS-managed journal handle the rest.

The feature is built on the concept of durable execution, a pattern pioneered by workflow engines like Microsoft’s Durable Functions for Azure, Temporal, and Cadence. The fundamental idea is that a workflow’s execution history is the source of truth, not the in-memory state of a running process. Progress is checkpointed to a journal after each logical step. If the function or underlying host fails, the workflow resumes by replaying the execution history from the journal — fast-forwarding through completed steps and re-executing only the steps that had not yet succeeded.

ℹ

Availability and Runtime SupportLambda Durable Functions are available in all commercial AWS regions where Lambda is available. Supported runtimes at GA: Python 3.12+, Node.js 22+, TypeScript 5+. Java and .NET 10 support is on the GA roadmap. Functions must be created with the feature enabled — existing functions cannot be retroactively converted.

The Checkpoint-and-Replay Engine: Under the Hood

Understanding the replay mechanism is essential before writing production Durable Functions code, because it shapes how you reason about side effects, timing, and external interactions.

When a Durable Function executes for the first time, it runs sequentially top-to-bottom. Each time it reaches a step() call, the SDK executes the provided function, waits for the result, and atomically checkpoints that result to the AWS-managed journal before proceeding. If the Lambda function is interrupted at any point — timeout, crash, or deliberate suspension via a wait() call — the journal preserves everything that completed.

On the next invocation (triggered by the runtime when the function is resumed or retried), the function handler runs from line one again. As it encounters each step() call, the SDK checks the journal: has this step already completed? If yes, it returns the recorded result immediately without re-executing the step’s logic. The handler effectively fast-forwards through history. Only the step that was interrupted executes again.

⚠

Critical Design ConstraintYour handler code and step functions must be deterministic. Do not call random.uuid4(), datetime.now(), or any other non-deterministic operation directly in the handler or inside replayed steps. The SDK provides replay-safe alternatives: dc.uuid(), dc.now(). Violating this constraint causes divergence between the original execution and the replay, leading to silent data corruption or incorrect workflow behaviour.

sequenceDiagram
    participant H as Lambda Handler
    participant SDK as Durable Execution SDK
    participant J as AWS-Managed Journal
    participant E as External System

    Note over H,J: First execution
    H->>SDK: step("validate-payment", fn)
    SDK->>E: Execute fn() - call payment API
    E-->>SDK: Result: {txn_id: "T-9921"}
    SDK->>J: Checkpoint step result
    H->>SDK: wait("fraud-review", 72h)
    SDK->>J: Record WAIT state
    Note over H,J: Function suspended - zero compute billing

    Note over H,J: Resumed after fraud team approves
    H->>SDK: step("validate-payment", fn)
    SDK->>J: Journal has result - SKIP re-execution
    SDK-->>H: Return {txn_id: "T-9921"} from journal
    H->>SDK: step("fulfill-order", fn)
    SDK->>E: Execute fn() - trigger shipment
    E-->>SDK: Result: {shipment_id: "SHP-441"}
    SDK->>J: Checkpoint step result
    H-->>H: Return final result

Core Primitives: steps, waits, and Replay-Safe Utilities

The Durable Execution SDK exposes a small, composable API surface. Mastering these two primitives and the replay-safe utilities covers 95% of production use cases.

The steps Primitive

A step wraps any callable — a function, a lambda, a coroutine — making it automatically retried on failure and checkpointed on success. Steps are the atomic unit of durable execution. Each step is identified by a string name that must be unique within the workflow. This name becomes the journal key for checkpoint lookup during replay.

import boto3
import json
from durable_execution import DurableContext

def handler(event, context):
    dc = DurableContext(context)

    # Each step has a unique name, a callable, and optional retry config
    payment = dc.step(
        name="validate-payment",
        fn=lambda: call_payment_service(event["card_token"], event["amount_cents"]),
        retry_config={
            "max_attempts": 5,
            "initial_interval_seconds": 2,
            "backoff_coefficient": 2.0,      # Exponential backoff
            "max_interval_seconds": 30,
            "non_retryable_errors": ["CardDeclinedError", "InvalidCardError"]
        }
    )

    # Steps can depend on results of prior steps
    inventory = dc.step(
        name="reserve-inventory",
        fn=lambda: reserve_stock(
            sku=event["sku"],
            quantity=event["qty"],
            reservation_ref=payment["txn_id"]  # Use prior step result
        )
    )

    return {
        "status": "reserved",
        "txn_id": payment["txn_id"],
        "reservation_id": inventory["reservation_id"]
    }

The waits Primitive

A wait suspends the function execution until an external signal arrives or a timeout expires. Crucially, no compute resources are consumed during the wait period. The function is fully de-allocated. When the signal arrives — via a callback URL, an SQS message, an SNS notification, or an SDK API call — Lambda resumes the function from the checkpoint immediately preceding the wait.

def order_workflow(event, context):
    dc = DurableContext(context)

    dc.step("charge-payment", lambda: charge_card(event["card_token"], event["total"]))

    # Suspend for up to 72 hours - zero compute cost during wait
    # The wait returns whatever payload the external system sends on resume
    fraud_decision = dc.wait(
        name="fraud-review-approval",
        timeout={"hours": 72},
        on_timeout="reject"   # If timeout expires, resume with this signal name
    )

    if fraud_decision.signal_name == "reject":
        dc.step("issue-refund", lambda: refund_payment(event["card_token"], event["total"]))
        return {"status": "rejected", "reason": "fraud_timeout_or_rejection"}

    # Approved - continue fulfillment
    dc.step("fulfill-order", lambda: trigger_fulfillment(event["order_id"]))
    dc.step("send-confirmation", lambda: send_email(event["customer_email"]))

    return {"status": "fulfilled"}

# In your fraud review system - send approval signal to resume the workflow
def approve_fraud_review(execution_id: str, reviewer: str):
    import boto3
    lambda_client = boto3.client("lambda")
    lambda_client.invoke(
        FunctionName="order-workflow",
        InvocationType="Event",
        Payload=json.dumps({
            "_durable_signal": {
                "execution_id": execution_id,
                "signal_name": "approve",
                "payload": {"reviewer": reviewer, "approved_at": dc.now().isoformat()}
            }
        })
    )

AI Agent Orchestration: The Killer Use Case for 2026

The most compelling production use case for Lambda Durable Functions in 2026 is AI agent orchestration. Agentic AI systems — LLM-based agents that iteratively call tools, accumulate conversation context across multiple turns, escalate to human reviewers, and retry on reasoning failures — map almost perfectly onto the Durable Execution model.

Traditional approaches to agentic AI pipelines require either keeping a process alive for the full agent execution (expensive and fragile) or implementing custom checkpointing (error-prone). Durable Functions provide a clean, infrastructure-managed solution: each LLM call and each tool execution is a checkpointed step. If the Lambda times out mid-agent-loop, the next invocation replays from the last checkpoint and skips the completed tool calls.

import boto3, json
from durable_execution import DurableContext

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

TOOL_DEFINITIONS = [
    {"name": "search_knowledge_base", "description": "Query internal knowledge base"},
    {"name": "execute_sql",           "description": "Run read-only SQL against data warehouse"},
    {"name": "send_report_email",     "description": "Send formatted report to stakeholders"},
]

def ai_analyst_agent(event, context):
    dc = DurableContext(context)
    task         = event["task"]
    max_iters    = event.get("max_iterations", 12)
    conversation = [{"role": "user", "content": task}]

    for i in range(max_iters):
        # Each LLM call is a checkpointed step - replays skip completed calls
        llm_out = dc.step(
            name=f"llm-iter-{i}",
            fn=lambda msgs=list(conversation): bedrock.invoke_model(
                modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
                body=json.dumps({
                    "messages": msgs,
                    "tools": TOOL_DEFINITIONS,
                    "max_tokens": 4096,
                    "temperature": 0.1
                })
            )
        )
        body = json.loads(llm_out["body"].read())

        if body["stop_reason"] == "end_turn":
            return {
                "result":     body["content"][0]["text"],
                "iterations": i + 1,
                "escalated":  False
            }

        # Execute each tool call as its own checkpointed step
        tool_results = []
        for tool_call in [b for b in body["content"] if b["type"] == "tool_use"]:
            result = dc.step(
                name=f"tool-{tool_call['name']}-iter-{i}",
                fn=lambda tc=tool_call: dispatch_tool(tc["name"], tc["input"]),
                retry_config={"max_attempts": 3}
            )
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": tool_call["id"],
                "content": json.dumps(result)
            })

        conversation.append({"role": "assistant", "content": body["content"]})
        conversation.append({"role": "user",      "content": tool_results})

    # Max iterations reached - escalate to human analyst
    human_input = dc.wait("human-escalation", timeout={"hours": 48})
    return {
        "result":    human_input.payload.get("analyst_response", "No response"),
        "escalated": True
    }

💡

Performance TipFor AI agent workflows that call Bedrock repeatedly, enable Prompt Caching on Claude models. Combined with Durable Functions checkpointing, this means replayed iterations do not re-incur Bedrock input token costs for the portions of the conversation that hit the cache prefix. At scale, this yields 60-80% token cost savings on agent retry paths.

Lambda Durable Functions vs. AWS Step Functions: The Complete Decision Framework

Lambda Durable Functions vs AWS Step Functions decision matrix — Decision Matrix: Lambda Durable Functions for code-first orchestration vs Step Functions for cross-service visual workflows

The most common architectural question teams ask when evaluating Durable Functions is how it compares to AWS Step Functions. The answer is nuanced: these are complementary services, not competing ones. The right choice depends on three axes — orchestration scope, team skillset, and debugging preferences.

Choose Lambda Durable Functions when: your entire workflow logic lives within a single function’s codebase, you want code-first workflow definition without a separate JSON/YAML state machine, your team is fluent in Python or TypeScript and prefers debuggable unit-testable code, and your workflows involve tight loops with variable iteration counts (like AI agent loops).

Choose AWS Step Functions when: your workflow orchestrates multiple distinct AWS services (ECS tasks, SageMaker endpoints, Glue jobs, DynamoDB operations) as first-class participants, not just API calls from inside Lambda; when you need the visual workflow designer for stakeholder communication; when you require the 200+ SDK integrations that allow Step Functions to call services directly without Lambda intermediaries; or when your compliance posture requires an externally auditable state machine definition.

flowchart TD
    A["New Workflow Requirement"] --> B{"Multi-service orchestration?"}
    B -->|"Yes: ECS, Glue, SageMaker as direct steps"| C["AWS Step Functions"]
    B -->|"No: logic lives within one function"| D{"Long-running or stateful?"}
    D -->|"Yes: approvals, AI agent loops"| E["Lambda Durable Functions"]
    D -->|"No: sub-15 min stateless"| F["Standard Lambda"]
    C --> G["Visual Workflow Designer\nPer-state-transition pricing"]
    E --> H["Code-first sequential logic\nCompute-only pricing"]
    F --> I["Simple event handler\nLowest overhead"]

ℹ

Cost ComparisonStep Functions charges per state transition ($0.025 per 1,000 transitions for Standard Workflows). A 20-step workflow processing 1M orders/month costs $500/month in Step Functions transitions alone. Lambda Durable Functions charges only for compute time during active execution — waits cost zero. For high-volume workflows with long human-approval waits, Durable Functions can be significantly cheaper.

Error Handling, Retries, and Idempotency at Scale

Production durable workflows demand careful thought about failure modes. The SDK’s retry configuration handles transient failures (network timeouts, throttling, temporary unavailability), but non-transient failures — invalid inputs, permanent service errors, business rule violations — require explicit handling in your step logic.

from durable_execution import DurableContext, StepError

class InsufficientInventoryError(Exception):
    """Non-retryable business logic error"""
    pass

def fulfillment_workflow(event, context):
    dc = DurableContext(context)

    try:
        inventory = dc.step(
            name="reserve-inventory",
            fn=lambda: reserve_stock_with_validation(event["sku"], event["qty"]),
            retry_config={
                "max_attempts": 4,
                "non_retryable_errors": [
                    "InsufficientInventoryError",  # Don't retry stockouts
                    "InvalidSKUError"
                ]
            }
        )
    except StepError as e:
        if e.cause_type == "InsufficientInventoryError":
            # Compensate - notify customer, release payment hold
            dc.step("notify-stockout", lambda: send_stockout_notification(event))
            dc.step("release-payment-hold", lambda: void_authorization(event["auth_code"]))
            return {"status": "cancelled", "reason": "stockout"}
        raise  # Unexpected error - propagate for dead-letter handling

    dc.step("finalize-shipment", lambda: book_carrier(inventory["reservation_id"]))
    return {"status": "fulfilled"}

Deployment: CDK, SAM, and CloudFormation

Durable Functions must be enabled at function creation time. The feature cannot be retrofitted onto an existing Lambda function. Plan your deployment accordingly — treat Durable Function enablement as an infrastructure change requiring a new function resource, not an in-place update.

# AWS SAM template for Lambda Durable Function
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Resources:
  OrderWorkflow:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: order-workflow-durable
      Handler: handler.order_workflow
      Runtime: python3.13
      Timeout: 900        # Maximum Lambda timeout; waits extend beyond this
      MemorySize: 1024
      Architectures: [arm64]   # Graviton3 - 20% cheaper per GB-second

      # Enable Durable Functions
      DurableConfig:
        Enabled: true
        JournalRetentionDays: 90   # How long to keep execution journals

      Environment:
        Variables:
          POWERTOOLS_SERVICE_NAME: order-workflow
          LOG_LEVEL: INFO

      Policies:
        - AWSLambdaBasicExecutionRole
        - DynamoDBReadPolicy:
            TableName: !Ref OrdersTable
        - Statement:
            - Effect: Allow
              Action: [secretsmanager:GetSecretValue]
              Resource: !Sub "arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:payment-*"

Observability and Debugging in Production

Lambda Durable Functions emit structured execution events to CloudWatch Logs with full step names, durations, retry counts, and journal checkpoint metadata. AWS Lambda Powertools for Python integrates natively with Durable Functions to add correlation IDs, structured JSON logging, and X-Ray trace segments per step.

from aws_lambda_powertools import Logger, Tracer
from durable_execution import DurableContext

logger = Logger(service="order-workflow")
tracer = Tracer(service="order-workflow")

@logger.inject_lambda_context(correlation_id_path="headers.x-correlation-id")
@tracer.capture_lambda_handler
def handler(event, context):
    dc = DurableContext(context)

    with tracer.capture_method("validate-payment"):
        payment = dc.step(
            name="validate-payment",
            fn=lambda: validate_payment(event["card_token"])
        )
        logger.info("Payment validated", txn_id=payment["txn_id"],
                    step_duration_ms=dc.last_step_duration_ms)

    # dc.execution_id is stable across replays - use as correlation ID
    logger.append_keys(execution_id=dc.execution_id)
    return {"execution_id": dc.execution_id}

Cost Modelling: When Durable Functions Save Money

The cost profile of Durable Functions is distinctive: you pay for compute only during active step execution. Wait periods — whether 30 seconds or 30 days — cost nothing. Compare this to alternatives:

EC2-based workflow engine (Temporal self-hosted): 24/7 instance costs for the history service, frontend, matching, and worker hosts. For a small cluster: ~$800-1,200/month minimum, regardless of workflow volume.

AWS Step Functions Standard: $0.025 per 1,000 state transitions. A 10-step workflow at 1M executions/month = $250/month in transitions, plus Lambda compute for each state.

Lambda Durable Functions: Lambda compute charges only during active execution. A workflow that processes for 2 seconds and waits 48 hours for human review costs the same as a 2-second-only Lambda invocation — the 48-hour wait is free. At 1M executions/month with 2 seconds active compute per execution on a 1 GB ARM64 function: ~$33.40/month in compute plus the usual Lambda request charges.

💡

Optimization TipStructure your steps to minimize active compute time before a long wait. Move all preprocessing work into steps before the wait call so the function suspends as quickly as possible after triggering the external action. This maximises the idle-wait period during which no compute charges accrue.

Limitations and Constraints You Need to Plan Around

Durable Functions are a powerful addition to the toolkit, but they have real constraints that shape architectural decisions:

New functions only: The feature cannot be enabled on existing Lambda functions. You must create new function resources. Plan blue-green migration for existing workflows.

Supported runtimes at GA: Python 3.12+, Node.js 22+, TypeScript via Node.js. Java and .NET support is on the roadmap but not yet GA.

Step result size: Individual step return values must be serialisable to JSON and are subject to the journal record size limits (currently 256 KB per step result). For larger outputs — large LLM responses, binary data — store the result in S3 and record only the S3 key as the step result.

Cold starts on resume: When a function is resumed after a wait, it incurs a cold start just like any other Lambda invocation. For latency-sensitive resume paths, configure Provisioned Concurrency on the function alias.

Debugging replays locally: The durable_execution SDK provides a local journal implementation for unit testing. Use it extensively — testing replay behaviour in isolation is essential and cannot be adequately done by simply running the handler once.

Key Takeaways

Durable Execution SDK brings checkpoint-and-replay stateful orchestration natively into Lambda — no external state store, no Step Functions state machine required for single-function workflows
The steps primitive checkpoints progress and auto-retries with configurable backoff; the waits primitive suspends execution at zero compute cost for hours or days
AI agent loops, multi-step payment flows, and long-running approval processes are the sweet spots — scenarios where the workflow is naturally sequential, code-owned, and interspersed with external waits
Step Functions remains the right choice for cross-service orchestration with visual state machine definitions and 200+ SDK integrations
Durable Functions costs dramatically less than EC2-hosted workflow engines and can outperform Step Functions on cost for high-volume workflows with long idle-wait periods
Plan around the new-function-only constraint, JSON step result size limits, and the need for deterministic handler code before committing architectures to the feature

Discover more from C4: Container, Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Searching in