Data Lake Analytics

A deep-space fleet monitoring system called "Voidwatch" that ingests starship telemetry through Kinesis, delivers raw data to S3 via Firehose, enriches it with Lambda, catalogs it with Glue crawlers, and queries it with Athena SQL. The scenario exercises Simfra's full analytics stack from streaming ingestion through schema inference to interactive queries.

Services

Service Role
Kinesis Data Streams 2-shard stream for telemetry ingestion, KMS encrypted
Kinesis Data Firehose Delivers from Kinesis to S3 with JSONL formatting and time-based partitioning
Lambda Python 3.12 producer (generates telemetry) and enricher (adds fleet sector and threat level)
S3 Four buckets: raw data, enriched data, Athena results, pipeline artifacts - all SSE-KMS
Glue Two crawlers (raw and enriched) for schema inference and Data Catalog registration
Athena Workgroup and named queries for SQL analytics against cataloged tables
EventBridge Scheduled rule triggering producer Lambda on 1-minute intervals
CloudWatch Logs Lambda execution logs
KMS Single customer-managed key for all encryption
IAM Seven least-privilege roles scoped per function
CodeCommit Source repository for Lambda functions
CodeBuild Packages Python functions
CodePipeline Orchestrates deployment

Architecture

EventBridge (1-min schedule) --> Producer Lambda
                                    |
                                    v
                                 Kinesis Stream (2 shards, KMS)
                                /                  \
                               v                    v
                         Firehose                Lambda Enricher (ESM)
                            |                       |
                            v                       v
                    S3 raw/ (JSONL,           S3 enriched/ (Hive-partitioned
                     time-partitioned)         by faction)
                            |                       |
                            v                       v
                     Glue Crawler (raw)      Glue Crawler (enriched)
                            |                       |
                            v                       v
                         Glue Data Catalog (tables + schemas)
                                    |
                                    v
                                 Athena SQL Queries
                                    |
                                    v
                                 S3 athena-results/

The scenario uses a dual-consumer architecture: Firehose and Lambda both consume from the same Kinesis stream simultaneously. Firehose handles raw archival with time-based S3 prefixes (raw/YYYY/MM/DD/HH/). The Lambda enricher adds derived fields (fleet_sector, threat_level) and writes Hive-partitioned output (enriched/faction={faction}/). Glue crawlers infer JSON schemas and detect partitions automatically.

What This Validates

  • Kinesis stream ingestion with KMS encryption and multi-shard parallelism
  • Firehose delivery from Kinesis source to S3 with AppendDelimiterToRecord processing
  • Dual consumption from the same Kinesis stream (Firehose + Lambda ESM)
  • Lambda event source mapping with Kinesis, processing batches of up to 50 records
  • Glue crawler schema inference and Hive partition detection on S3 data
  • Glue Data Catalog table registration with inferred column types
  • Athena SQL queries with aggregations (GROUP BY, COUNT, AVG) and filtering (WHERE)
  • EventBridge scheduled Lambda invocation on a recurring interval
  • Hive-partitioned S3 data layout recognized by Glue and queryable by Athena

Test Coverage

Tests include smoke checks for all 12 resources, integration tests for the full pipeline (producer invocation, Firehose flush and S3 delivery, Glue crawler runs with schema and partition verification, Athena queries with result validation), security tests for KMS encryption across all services and IAM role scoping, and performance tests with burst ingestion (5 parallel producer invocations) and concurrent Athena queries (3 parallel).