System Design: Data Warehouse

Requirements

Functional Requirements:

Ingest structured data from OLTP databases, event streams, and flat files
Support complex analytical queries (aggregations, window functions, joins across billions of rows)
Organize data into schemas: raw, staging, and presentation layers
Provide role-based access control at table and column level
Support time-travel queries and snapshot isolation
Expose a SQL interface compatible with BI tools (Tableau, Looker, Power BI)

Non-Functional Requirements:

Query latency under 10 seconds for 90th percentile analytical queries on 100 TB datasets
Support 500 concurrent query sessions without degradation
99.9% availability for read workloads; ingestion can tolerate brief outages
Data freshness SLA of 15 minutes for near-real-time tables
Comply with SOC 2 Type II and GDPR data residency requirements

Scale Estimation

A mid-size enterprise warehouse ingests ~5 TB/day across 300 source tables. At 3x compression with columnar encoding, storage grows ~1.5 TB/day. After 3 years the corpus reaches ~1.6 PB. Peak query concurrency hits 500 sessions during business hours. Each query scans on average 50 GB of compressed data, so the I/O subsystem must sustain 25 TB/min of aggregate read throughput during peak.

High-Level Architecture

The architecture follows a medallion layering pattern: data lands in a raw zone (Bronze), is cleansed and standardized into a conformed layer (Silver), and finally aggregated into business-facing star schemas (Gold). An orchestration layer (Apache Airflow or dbt Cloud) schedules transformations between tiers on a 15-minute cadence for hot tables and nightly for cold ones.

The storage engine decouples compute from storage. Object storage (S3 or GCS) holds Parquet files partitioned by date and tenant. A catalog service (Apache Iceberg or Delta Lake) maintains metadata—partition manifests, schema evolution history, and snapshot pointers—so that multiple compute clusters can read consistent snapshots concurrently without locking.

A query router sits in front of multiple MPP compute clusters. Interactive dashboards are routed to a reserved cluster with auto-scaling; ad hoc analyst queries hit a separate elastic cluster; ETL jobs run on dedicated worker pools to prevent resource contention. Query results are cached in a distributed result cache keyed by query hash and data snapshot ID, allowing repeated dashboard refreshes to return in milliseconds.

Core Components

Columnar Storage Engine

Data is stored in Parquet or ORC format with dictionary encoding, run-length encoding, and Zstandard compression achieving 5–10x compression over raw CSV. Columns are split into row groups of 128 MB; min/max statistics per row group enable the query planner to skip groups that cannot satisfy a predicate, reducing I/O by 80–95% for selective queries. Micro-partitioning (Snowflake's approach) further clusters data by high-cardinality columns like user_id and event_date.

Metadata Catalog & Table Format

Apache Iceberg maintains a tree of snapshot files: each commit writes a new manifest list pointing to manifest files listing data file paths, record counts, and column-level statistics. This enables atomic multi-file commits, full schema evolution (add/drop/rename columns without rewriting data), and partition evolution. Time-travel queries reference older snapshot IDs. The catalog (Hive Metastore or a REST catalog) maps table names to the current snapshot pointer.

Query Execution Engine

A distributed MPP engine (Trino, Spark SQL, or a native warehouse engine) receives SQL, generates a distributed query plan, and dispatches tasks to worker nodes co-located with data. Vectorized execution processes 1024-row batches using SIMD instructions, achieving 10–50x speedup over row-at-a-time execution. The planner pushes predicates and projections to the storage layer (predicate pushdown) and uses cost-based optimization to choose join strategies (broadcast vs. shuffle hash join).

Database Design

The Gold layer uses Kimball-style star schemas: a central fact table (e.g., fact_orders) with foreign keys into dimension tables (dim_customer, dim_product, dim_date). Fact tables are partitioned by event_date and clustered by the most selective join key. Slowly Changing Dimensions (SCD Type 2) track historical attribute changes by adding valid_from, valid_to, and is_current columns. Aggregated summary tables pre-compute common GROUP BY combinations using materialized views refreshed on a schedule.

The Silver layer stores normalized, type-corrected tables with a _source_system, _loaded_at, and _row_hash column for deduplication. Unique constraints are enforced in the ETL layer (not the warehouse engine) since MPP systems typically skip constraint enforcement for performance._

API Design

POST /queries — Submit an async SQL query; returns query_id and estimated queue position. GET /queries/{query_id}/status — Poll execution state (QUEUED, RUNNING, SUCCEEDED, FAILED) with progress percentage and bytes scanned. GET /queries/{query_id}/results?page=1&page_size=1000 — Paginated result set retrieval in Arrow IPC or JSON format. POST /ingest/tables/{table_name} — Trigger a batch load from a specified S3 prefix into the target table with schema auto-detection.

Scaling & Bottlenecks

The primary bottleneck is query concurrency: a single MPP cluster has finite memory and CPU slots. The solution is workload management (WLM) queues that assign priority and resource limits per user group. Auto-scaling adds worker nodes in ~60 seconds when queue depth exceeds a threshold, and removes idle nodes after 10 minutes. Result caching eliminates repeated scans for dashboard queries, which typically account for 60% of query volume.

Ingestion throughput can bottleneck at the transformation layer. Micro-batch loading via Structured Streaming (Spark) or Kafka Connect JDBC sink handles high-frequency source tables. For bulk historical loads, parallel COPY commands across multiple compute nodes saturate object-storage bandwidth. Schema drift (unexpected source column changes) is handled by schema-on-read in the Bronze layer and explicit validation in the Silver transform.

Key Trade-offs

Compute-storage separation vs. colocation: Separating compute and storage enables independent scaling and multi-cluster access but adds network I/O latency vs. local NVMe. Caching hot data on compute nodes (Alluxio or local SSD cache) bridges the gap.
Star schema vs. wide flat tables: Star schemas reduce storage and simplify dimension updates but require multi-table joins; denormalized flat tables speed up query execution at the cost of data duplication and slower SCD updates.
Batch vs. micro-batch ingestion: 15-minute micro-batches improve data freshness but increase metadata overhead (more small files); compaction jobs merge small files hourly to maintain scan efficiency.
Result caching vs. freshness: Aggressive result caching improves p99 dashboard latency by 100x but can serve stale data; TTL-based cache invalidation tied to the table's last commit timestamp balances freshness and performance.