Multi-Region ARC Failover
A multi-region payment processing gateway deployed across us-east-1 (primary) and us-west-2 (standby). Customer transactions flow through an ALB-fronted Lambda function that validates cards via DynamoDB and records charges in an RDS PostgreSQL instance. Route 53 weighted routing directs all traffic to the primary region. When the primary region experiences an AZ impairment, ARC Zonal Shift removes the degraded AZ from the ALB's target pool. When a full regional failure occurs, an ARC Region Switch plan executes a coordinated failover - updating Route 53 health checks, shifting DNS to the standby region, and activating the secondary stack.
Services
| Service | Role |
|---|---|
| EC2 | VPCs, subnets (3 AZs x 2 regions), IGW, route tables, security groups |
| ELBv2 | ALB with Lambda target group per region, HTTPS listener, target health |
| Lambda | PayGate function per region, VPC-attached, environment variables for region/DB endpoints |
| DynamoDB | Cards table per region for card token validation, KMS encrypted |
| RDS | PostgreSQL ledger instance per region for charge records, KMS encrypted |
| Route 53 | Hosted zone, weighted A records (primary 100 / secondary 0), health checks |
| IAM | Lambda execution role, ARC execution role, ARC alarm role |
| KMS | CMK per region for DynamoDB SSE and RDS encryption |
| CloudWatch | ALB HealthyHostCount alarm, Lambda error metrics |
| ARC Zonal Shift | Managed resource registration, zonal shift start/cancel, practice run config |
| ARC Region Switch | Failover plan with Route53 health check steps, plan execution, cancel |
| ACM | TLS certificate for ALB HTTPS listeners |
Architecture
Route 53 (arc-payments.simfra.dev)
weighted: 100 -> primary, 0 -> secondary
+ health checks per region
|
+---------------+------------------+
| |
us-east-1 (primary) us-west-2 (standby)
| |
+----+----+ +----+----+
| ALB | <- ARC Zonal Shift | ALB |
+----+----+ +----+----+
1a 1b 1c 2a 2b 2c
| |
Lambda (paygate) Lambda (paygate)
| |
DynamoDB (cards) DynamoDB (cards)
RDS (ledger) RDS (ledger)
+----------------------------+
| ARC Region Switch |
| Plan: "payments-failover" |
| Step 1: Route53 HC flip |
| Step 2: DNS weight swap |
+----------------------------+
Application
PayGate is a simplified PCI-style payment gateway. Merchants call POST /v1/charges with a card token and amount. The Lambda handler validates the card token in DynamoDB, inserts a charge record into RDS with status pending, simulates authorization, and updates the charge to captured. The region field in responses proves which region served the request - critical for verifying that failover moved traffic.
Endpoints:
| Method | Path | Description |
|---|---|---|
GET |
/health |
Returns {"status":"ok","region":"<region>","az":"<az>"} |
POST |
/v1/charges |
Create a charge |
GET |
/v1/charges/{id} |
Retrieve a charge by ID |
What This Validates
- ARC Zonal Shift removing a degraded AZ from an ALB target pool
- ARC Region Switch executing a multi-step failover plan
- Route 53 weighted routing between primary and standby regions
- Practice run configuration for ARC managed resources
- Concurrent execution prevention (only one active plan execution per plan)
- Plan lifecycle: tag, update version, delete with active execution guard
- Two-region deployment with aliased Terraform providers (
aws.primary,aws.secondary) - KMS encryption across DynamoDB and RDS in both regions
- CloudWatch alarms driving ARC awareness
Test Coverage
Tests run in seven phases: infrastructure provisioning (Terraform apply, DynamoDB seeding), smoke checks (both ALBs active, Route 53 records present, health checks healthy), application integration tests (charges from primary region, invalid card token handling, round-trip charge retrieval), zonal shift tests (shift start/cancel/update, duplicate shift rejection, practice run config lifecycle), region switch failover tests (full plan execution, mid-flight cancellation, concurrent execution prevention, plan tagging and versioning), security validation (least-privilege IAM roles, KMS encryption, security group scope), and observability tests (ALB HealthyHostCount and Lambda Invocations metrics).