Multi-Region ARC Failover

A multi-region payment processing gateway deployed across us-east-1 (primary) and us-west-2 (standby). Customer transactions flow through an ALB-fronted Lambda function that validates cards via DynamoDB and records charges in an RDS PostgreSQL instance. Route 53 weighted routing directs all traffic to the primary region. When the primary region experiences an AZ impairment, ARC Zonal Shift removes the degraded AZ from the ALB's target pool. When a full regional failure occurs, an ARC Region Switch plan executes a coordinated failover - updating Route 53 health checks, shifting DNS to the standby region, and activating the secondary stack.

Services

Service Role
EC2 VPCs, subnets (3 AZs x 2 regions), IGW, route tables, security groups
ELBv2 ALB with Lambda target group per region, HTTPS listener, target health
Lambda PayGate function per region, VPC-attached, environment variables for region/DB endpoints
DynamoDB Cards table per region for card token validation, KMS encrypted
RDS PostgreSQL ledger instance per region for charge records, KMS encrypted
Route 53 Hosted zone, weighted A records (primary 100 / secondary 0), health checks
IAM Lambda execution role, ARC execution role, ARC alarm role
KMS CMK per region for DynamoDB SSE and RDS encryption
CloudWatch ALB HealthyHostCount alarm, Lambda error metrics
ARC Zonal Shift Managed resource registration, zonal shift start/cancel, practice run config
ARC Region Switch Failover plan with Route53 health check steps, plan execution, cancel
ACM TLS certificate for ALB HTTPS listeners

Architecture

                Route 53 (arc-payments.simfra.dev)
             weighted: 100 -> primary, 0 -> secondary
                + health checks per region
                        |
        +---------------+------------------+
        |                                  |
  us-east-1 (primary)             us-west-2 (standby)
        |                                  |
   +----+----+                        +----+----+
   |   ALB   |  <- ARC Zonal Shift    |   ALB   |
   +----+----+                        +----+----+
    1a  1b  1c                         2a  2b  2c
        |                                  |
    Lambda (paygate)                  Lambda (paygate)
        |                                  |
  DynamoDB (cards)                   DynamoDB (cards)
  RDS (ledger)                       RDS (ledger)

          +----------------------------+
          |     ARC Region Switch      |
          |  Plan: "payments-failover" |
          |  Step 1: Route53 HC flip   |
          |  Step 2: DNS weight swap   |
          +----------------------------+

Application

PayGate is a simplified PCI-style payment gateway. Merchants call POST /v1/charges with a card token and amount. The Lambda handler validates the card token in DynamoDB, inserts a charge record into RDS with status pending, simulates authorization, and updates the charge to captured. The region field in responses proves which region served the request - critical for verifying that failover moved traffic.

Endpoints:

Method Path Description
GET /health Returns {"status":"ok","region":"<region>","az":"<az>"}
POST /v1/charges Create a charge
GET /v1/charges/{id} Retrieve a charge by ID

What This Validates

  • ARC Zonal Shift removing a degraded AZ from an ALB target pool
  • ARC Region Switch executing a multi-step failover plan
  • Route 53 weighted routing between primary and standby regions
  • Practice run configuration for ARC managed resources
  • Concurrent execution prevention (only one active plan execution per plan)
  • Plan lifecycle: tag, update version, delete with active execution guard
  • Two-region deployment with aliased Terraform providers (aws.primary, aws.secondary)
  • KMS encryption across DynamoDB and RDS in both regions
  • CloudWatch alarms driving ARC awareness

Test Coverage

Tests run in seven phases: infrastructure provisioning (Terraform apply, DynamoDB seeding), smoke checks (both ALBs active, Route 53 records present, health checks healthy), application integration tests (charges from primary region, invalid card token handling, round-trip charge retrieval), zonal shift tests (shift start/cancel/update, duplicate shift rejection, practice run config lifecycle), region switch failover tests (full plan execution, mid-flight cancellation, concurrent execution prevention, plan tagging and versioning), security validation (least-privilege IAM roles, KMS encryption, security group scope), and observability tests (ALB HealthyHostCount and Lambda Invocations metrics).