Data Lake Analytics

A deep-space fleet monitoring system called "Voidwatch" that ingests starship telemetry through Kinesis, delivers raw data to S3 via Firehose, enriches it with Lambda, catalogs it with Glue crawlers, and queries it with Athena SQL. The scenario exercises Simfra's full analytics stack from streaming ingestion through schema inference to interactive queries.

Services

Service	Role
Kinesis Data Streams	2-shard stream for telemetry ingestion, KMS encrypted
Kinesis Data Firehose	Delivers from Kinesis to S3 with JSONL formatting and time-based partitioning
Lambda	Python 3.12 producer (generates telemetry) and enricher (adds fleet sector and threat level)
S3	Four buckets: raw data, enriched data, Athena results, pipeline artifacts - all SSE-KMS
Glue	Two crawlers (raw and enriched) for schema inference and Data Catalog registration
Athena	Workgroup and named queries for SQL analytics against cataloged tables
EventBridge	Scheduled rule triggering producer Lambda on 1-minute intervals
CloudWatch Logs	Lambda execution logs
KMS	Single customer-managed key for all encryption
IAM	Seven least-privilege roles scoped per function
CodeCommit	Source repository for Lambda functions
CodeBuild	Packages Python functions
CodePipeline	Orchestrates deployment

Architecture

EventBridge (1-min schedule) --> Producer Lambda
                                    |
                                    v
                                 Kinesis Stream (2 shards, KMS)
                                /                  \
                               v                    v
                         Firehose                Lambda Enricher (ESM)
                            |                       |
                            v                       v
                    S3 raw/ (JSONL,           S3 enriched/ (Hive-partitioned
                     time-partitioned)         by faction)
                            |                       |
                            v                       v
                     Glue Crawler (raw)      Glue Crawler (enriched)
                            |                       |
                            v                       v
                         Glue Data Catalog (tables + schemas)
                                    |
                                    v
                                 Athena SQL Queries
                                    |
                                    v
                                 S3 athena-results/

The scenario uses a dual-consumer architecture: Firehose and Lambda both consume from the same Kinesis stream simultaneously. Firehose handles raw archival with time-based S3 prefixes (raw/YYYY/MM/DD/HH/). The Lambda enricher adds derived fields (fleet_sector, threat_level) and writes Hive-partitioned output (enriched/faction={faction}/). Glue crawlers infer JSON schemas and detect partitions automatically.

What This Validates

Kinesis stream ingestion with KMS encryption and multi-shard parallelism
Firehose delivery from Kinesis source to S3 with AppendDelimiterToRecord processing
Dual consumption from the same Kinesis stream (Firehose + Lambda ESM)
Lambda event source mapping with Kinesis, processing batches of up to 50 records
Glue crawler schema inference and Hive partition detection on S3 data
Glue Data Catalog table registration with inferred column types
Athena SQL queries with aggregations (GROUP BY, COUNT, AVG) and filtering (WHERE)
EventBridge scheduled Lambda invocation on a recurring interval
Hive-partitioned S3 data layout recognized by Glue and queryable by Athena

Test Coverage

Tests include smoke checks for all 12 resources, integration tests for the full pipeline (producer invocation, Firehose flush and S3 delivery, Glue crawler runs with schema and partition verification, Athena queries with result validation), security tests for KMS encryption across all services and IAM role scoping, and performance tests with burst ingestion (5 parallel producer invocations) and concurrent Athena queries (3 parallel).