SYSTEM_DESIGN

System Design: Data Mesh Architecture

Design a data mesh architecture that decentralizes data ownership to domain teams while providing federated computational governance, self-service infrastructure, and interoperability standards. Covers domain data products, governance federation, and the platform layer.

14 min readUpdated Jan 15, 2025
system-designdata-meshdata-governancedomain-driven-designdata-engineering

Requirements

Functional Requirements:

  • Enable domain teams to independently own, publish, and operate data products
  • Provide a self-service data infrastructure platform that abstracts compute, storage, and pipeline tooling
  • Enforce global standards: schema registry, data contracts, SLA declarations, and access control policies
  • Enable data product discovery and consumption through a global marketplace
  • Support federated governance: policy enforcement happens at the platform level, policy definition is federated to domain stewards
  • Facilitate data product interoperability: any consumer can read any data product using standard interfaces

Non-Functional Requirements:

  • A new data product must be publishable by a domain team in under 1 hour using self-service tooling
  • Cross-domain data product reads must have SLA-backed freshness guarantees
  • Platform infrastructure must be available 99.9%; domain teams are responsible for their own product SLAs
  • Policy enforcement overhead must be transparent to domain teams (automated, not manual gates)
  • Support 100 domains with up to 50 data products per domain = 5,000 total data products

Scale Estimation

With 5,000 data products, each producing an average of 100 GB/day, total data platform throughput is 500 TB/day. Cross-domain consumption: 20% of data products are consumed by other domains, generating 100 TB/day of cross-domain reads. The global marketplace catalog indexes 5,000 products with 1,000 daily searches. Policy enforcement runs 10,000 access control checks per hour across all data product reads.

High-Level Architecture

The data mesh architecture rests on four principles: domain ownership, data as a product, self-service infrastructure, and federated computational governance. In practice, this translates to three layers: the Domain Layer (where teams own and operate data products), the Platform Layer (shared infrastructure services consumed by all domains), and the Governance Layer (federated policy enforcement and global standards).

Each domain operates its own pipeline stack within a standardized envelope: they provision compute and storage via the Infrastructure-as-Code templates provided by the platform team, which ensures all domain pipelines run within the approved security boundary, use the approved storage format (Iceberg), and emit the required metadata (lineage events, quality scores). The platform team provides golden-path templates in Terraform/Pulumi that create a new domain data product in one command.

The Global Data Marketplace is a federated catalog (based on DataHub or custom) that indexes data products from all domain catalogs. Each domain publishes a dataproduct.yaml manifest (name, owner, SLA, schema, access policy, output ports) to a central registry. The marketplace ingests these manifests, indexes them for search, and provides a uniform consumption interface regardless of which domain produced the data.

Core Components

Data Product Specification

Each data product is defined by a contract: dataproduct.yaml specifies the product name, domain owner, output ports (each port is a named interface with a schema, format, SLA, and access endpoint), input dependencies (upstream data products consumed), data classification, and retention policy. A CI/CD pipeline validates the contract against the global schema registry and policy engine before publishing. Versioning follows semantic versioning: breaking schema changes require a major version bump and a deprecation period.

Self-Service Infrastructure Platform

The platform team provides Terraform modules that a domain team calls with their data product configuration. The module provisions: an S3 prefix under the domain's namespace, an Iceberg catalog registration, an Airflow DAG template (or dbt project scaffold), IAM roles scoped to the domain's data, and monitoring/alerting defaults. Domain teams extend the template with their business logic but cannot override the governance envelope (encryption settings, audit log hooks, metadata emission).

Federated Governance Engine

A policy engine (Open Policy Agent/Rego) evaluates every data product access request against policies defined at three levels: global (enforced by the platform team — e.g., no unencrypted PII cross-domain), domain (enforced by domain stewards — e.g., Finance data requires VP approval), and product (defined by the product owner — e.g., this table is public within the company). OPA policies are stored in a Git repository with PR-based review for global and domain policies; product-level policies are configurable via the marketplace UI.

Database Design

The Global Registry stores: data_products (product_id, domain_id, name, version, status, manifest_yaml, registered_at), output_ports (port_id, product_id, port_name, schema_id, sla_freshness_minutes, access_endpoint, format), consumption_contracts (contract_id, consumer_domain_id, producer_product_id, port_id, agreed_sla, created_at). SLA compliance is tracked in a time-series store: (product_id, port_id, measured_at, actual_freshness_minutes, sla_freshness_minutes, compliant). Breach events trigger automated alerts to the product owner.

API Design

POST /products — Register a new data product by uploading dataproduct.yaml; triggers validation, provisioning, and catalog indexing. GET /products/{product_id}/ports/{port_name}/read — Consume data from a specific output port; returns pre-signed S3 URL or query endpoint with access-controlled credentials. GET /marketplace/search?q={term}&domain={domain}&sla_minutes={n} — Search for data products by keyword and quality filters. POST /governance/policy-check — Evaluate whether a consumer is authorized to access a data product port under current policies.

Scaling & Bottlenecks

The platform team's self-service infrastructure becomes a bottleneck if domain teams must wait for manual platform approvals or if the Terraform module suite cannot keep pace with new requirements. The solution is a product mindset for the platform team: maintain the platform as a versioned product with a public roadmap, SLAs for module updates, and a self-service extension mechanism allowing domain teams to add custom modules within governance guardrails.

Cross-domain data product reads create network topology complexity: a consumer in Domain A reading from Domain B must resolve the output port endpoint, obtain access credentials, and read from Domain B's storage. A data access broker service caches resolved endpoints and credentials (short-lived tokens refreshed every hour), reducing cross-domain read overhead to a single credential check rather than a full policy evaluation chain on every read.

Key Trade-offs

  • Decentralized ownership vs. governance enforcement: Giving domains full ownership increases autonomy and reduces bottlenecks but risks data silos and inconsistent quality; the self-service platform enforces a minimum governance envelope automatically, preserving autonomy within guardrails.
  • Federated catalogs vs. centralized catalog: Federated domain catalogs reduce the platform team's operational burden but fragment discovery; a global marketplace layer that aggregates domain catalogs provides unified discovery without requiring domains to surrender ownership.
  • Data product immutability vs. mutability: Immutable data products (append-only) simplify consumer subscription and SLA guarantees; mutable products (MERGE/UPDATE) are more expressive but require versioned snapshots and more complex consumer contracts.
  • Schema standards enforcement at registration vs. at read: Enforcing schema standards at registration (blocking incompatible products from being published) prevents consumer issues but slows domain iteration; read-time validation catches issues but only when data is already consumed.

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.