System Design: Data Lineage System

Requirements

Functional Requirements:

Automatically capture lineage from SQL transformations, Spark jobs, dbt models, and Airflow DAGs
Track lineage at both dataset and column granularity
Enable impact analysis: given a source column change, identify all downstream consumers
Support root cause analysis: trace an anomalous metric back to its source data
Generate compliance reports showing data flow for GDPR/CCPA subject access requests
Provide a visual lineage graph explorer in the UI

Non-Functional Requirements:

Lineage capture adds less than 50ms overhead to pipeline execution
Lineage graph queries for impact analysis return within 2 seconds for depth up to 10 hops
Support 100 million lineage edges across the entire platform
99.9% lineage capture completeness: no transformation should silently bypass lineage tracking
Lineage data retained for 7 years for regulatory compliance

Scale Estimation

A large data platform runs 5,000 pipeline jobs per day, each producing on average 20 lineage events (table reads, writes, column derivations). That's 100,000 lineage events per day. Over 7 years: ~256 million events. The graph grows at ~1 million new edges per week. At 500 bytes per edge record: 500 MB/week new graph data. Total graph after 7 years: ~175 GB — manageable in a dedicated graph database.

High-Level Architecture

Lineage is captured through three mechanisms: passive SQL parsing, active instrumentation, and metadata-level integration. Passive parsing intercepts SQL queries submitted to warehouse engines (via query log hooks or JDBC interceptors), parses ASTs to extract column-level input/output relationships, and emits lineage events. Active instrumentation wraps ETL framework operators (Spark DataFrameWriter, Airflow operators) to emit OpenLineage events at job start and completion. Metadata-level integration reads static lineage from dbt manifest.json and Airflow task dependencies.

All lineage events conform to the OpenLineage specification (a vendor-neutral open standard), enabling integration with tools like Marquez, Apache Atlas, and Amundsen. A Lineage Ingestion Service validates, deduplicates, and writes events to the lineage store. The store uses a property graph model: nodes represent datasets and columns; edges represent PRODUCES and CONSUMES relationships with metadata (transform name, SQL snippet, job run ID, timestamp).

A query service exposes graph traversal APIs. For impact analysis (downstream), it performs a BFS from the target node. For root cause analysis (upstream), it performs a DFS toward source nodes. Results are cached in Redis with TTL of 1 hour, invalidated when new lineage edges are added for nodes in the cached subgraph.

Core Components

OpenLineage Event Collector

The collector is a horizontally scalable service that receives OpenLineage RunEvents over HTTP. It validates event schema, normalizes dataset names to canonical URIs (e.g., bigquery://project/dataset/table), resolves aliases (views, CTEs) to their base tables, and forwards events to a Kafka topic for async processing. Rate limiting prevents a single noisy pipeline from overwhelming the collector.

Graph Storage Engine

Neo4j (or Amazon Neptune) stores the lineage graph. The property graph schema: (:Dataset {uri, name, source_system}), (:Column {uri, name, type, dataset_uri}), (:Job {id, name, run_id, timestamp}), with relationships (:Column)-[:DERIVED_FROM {sql, job_id}]->(:Column) and (:Dataset)-[:PRODUCES {job_id}]->(:Dataset). Indexes on Dataset.uri and Column.uri enable O(1) node lookups. The graph engine partitions by source system for multi-tenant isolation.

Impact Analysis Engine

Given a change to a source column (rename, type change, deprecation), the impact analyzer runs a BFS traversal downstream to depth 10, collecting all affected datasets, pipelines, dashboards, and ML features. Results are grouped by consumer team (via catalog ownership lookup) and formatted as an impact report with contact information for each affected team. The analyzer integrates with JIRA to automatically create tickets for high-impact changes.

Database Design

Lineage events are also archived in an append-only event store (S3 + Parquet) for long-term retention and replay. The graph database holds only the current state (latest lineage per job). A time-travel query returns the lineage graph as it existed at any past timestamp by replaying events from the archive up to that point. The event store schema: (event_id, event_type, run_id, job_name, inputs JSON, outputs JSON, column_lineage JSON, captured_at TIMESTAMP).

API Design

GET /lineage/datasets/{uri}/impact?depth=5 — Return all downstream datasets and jobs affected by a change to this dataset. GET /lineage/columns/{uri}/upstream?depth=10 — Trace the full upstream lineage of a column back to raw source data. POST /lineage/events — OpenLineage-compatible endpoint for submitting RunEvent, RunStateUpdate, and DatasetEvent. GET /lineage/jobs/{job_id}/runs/{run_id} — Return the full input/output lineage for a specific job run.

Scaling & Bottlenecks

Graph database write throughput is the primary bottleneck during mass ingestion (e.g., first-time full platform scan). Neo4j's MERGE operations for node creation are CPU-intensive; batching MERGE statements into 1,000-node transactions and running parallel import threads achieves 50,000 edges/second write throughput. For read-heavy impact analysis during incident response, a read replica cluster handles query traffic without impacting write availability.

SQL parsing for complex queries (multi-CTE, nested subqueries) can take 100–500ms per query, creating overhead in query execution paths. Moving parsing to an async post-processing pipeline (parse from warehouse query logs rather than inline) removes this from the critical path. Column-level lineage accuracy degrades for dynamically generated SQL (SELECT * expansions, EXECUTE statements); a fallback to table-level lineage with a flag indicating incomplete column resolution maintains correctness signals.*

Key Trade-offs

Inline vs. async lineage capture: Inline capture (within query execution) guarantees 100% completeness but adds latency; async capture from query logs achieves high completeness (>99%) with zero execution overhead but may miss jobs that don't log queries.
Column-level vs. table-level lineage: Column lineage provides precise impact analysis but requires SQL parsing which is brittle for dynamic SQL; table lineage is universally available and covers all pipelines.
Active OpenLineage integration vs. passive query log mining: Active integration is accurate but requires instrumentation of every pipeline; passive log mining requires no changes to pipelines but depends on query log availability and completeness.
Graph DB vs. relational adjacency list: Graph DB excels at multi-hop traversals natively; relational with recursive CTEs works up to 10 million edges but becomes slow at 100 million+ edges without specialized graph extensions.