System Design: Metadata Management System

Requirements

Functional Requirements:

Centralize metadata from databases, data lakes, warehouses, APIs, and BI tools
Track schema versions: capture every schema change with diff, author, and timestamp
Enforce schema compatibility rules (backward, forward, full compatibility) for event schemas
Provide a REST/GraphQL API for consumers to fetch schema definitions and validate messages
Maintain business metadata: ownership, SLAs, data classification, and retention policies
Trigger downstream notifications when schemas change (webhook, Kafka event)

Non-Functional Requirements:

Schema fetch latency under 10ms for cached schemas (hot path for message validation)
Support 100,000 registered schemas across all source systems
Schema changes versioned indefinitely for audit and compliance
99.99% availability for schema validation API (on critical message ingestion path)
API handles 50,000 schema validation requests per second

Scale Estimation

With 100,000 schemas averaging 5 versions each = 500,000 schema versions. Average schema size 10 KB → 5 GB total schema storage (trivial). Validation API at 50,000 requests/second with schemas cached in memory: throughput dominated by deserialization and compatibility check compute, not storage. A schema validation worker can process 10,000 requests/second; 5 workers handle peak load with headroom.

High-Level Architecture

The metadata management system has two distinct planes: the Schema Registry plane (low-latency, high-throughput schema validation for real-time pipelines) and the Metadata Catalog plane (search, governance, and human-facing metadata management). The Schema Registry is inspired by the Confluent Schema Registry design: a REST service backed by Kafka as the source of truth and an in-memory cache for fast reads.

Metadata is ingested from sources via crawlers (scheduled), event hooks (real-time schema changes), and manual API calls (business metadata). All metadata changes are published to a Kafka metadata changelog topic, enabling downstream consumers (lineage systems, quality monitors, catalog search indexers) to react to changes in real time. The changelog is also the source of truth; the metadata store is a projection that can be rebuilt from the changelog.

A GraphQL API serves the UI and programmatic consumers. It federates queries across the relational metadata store (for business context), the schema registry (for technical schemas), and the lineage graph (for dataset relationships), composing a unified metadata response from multiple specialized stores.

Core Components

Schema Registry

The Schema Registry stores Avro, Protobuf, and JSON Schema definitions per subject (topic or entity type). Schema registration validates compatibility against the latest registered version using the configured compatibility mode. Confluent's compatibility algorithm checks forward compatibility (new schema can read data written with old schema) and backward compatibility (old schema can read data written with new schema). Schemas are stored in Kafka (durable, ordered changelog) and cached in-memory in each registry instance.

Schema Versioning & Diff Engine

Every schema change creates a new version record: (subject, version_number, schema_id, schema_text, compatible_with_previous, author, created_at, change_description). A diff engine computes structural diffs between schema versions: added fields, removed fields, type changes, and constraint changes. Diffs are stored as JSON patch documents and displayed in the UI as a side-by-side view, enabling data engineers to understand the impact of a proposed schema change before registering it.

Business Metadata Store

PostgreSQL stores business metadata in an entity-attribute-value (EAV) model for flexibility: metadata_assets (asset_id, asset_type, external_id, source_system), metadata_attributes (asset_id, attribute_key, attribute_value, data_type, set_by, set_at). Standard attribute keys are enumerated (owner, sensitivity_class, retention_days, sla_freshness_minutes) for type safety. Custom attributes are supported for domain-specific tagging. All attribute changes are audited in an immutable metadata_audit_log table.

Database Design

Schema definitions are stored in PostgreSQL: schemas (schema_id UUID, subject VARCHAR, version INT, schema_format ENUM, schema_text TEXT, compatibility_mode ENUM, created_at TIMESTAMP, author VARCHAR). An index on (subject, version) enables O(1) version lookups. The Kafka metadata changelog uses a compacted topic with subject as the key, so the latest schema per subject is always the topic's head. A Redis cache with schema:{schema_id} keys holds deserialized schema objects for sub-millisecond validation.

API Design

POST /subjects/{subject}/versions — Register a new schema version; validates compatibility and returns schema_id or compatibility error. GET /subjects/{subject}/versions/latest — Return the latest schema version for a subject (cached, <10ms). POST /compatibility/subjects/{subject} — Test if a proposed schema is compatible with the registered schema without registering it. GET /subjects/{subject}/versions/{version}/diff — Return the structural diff between version N and N-1.

Scaling & Bottlenecks

The schema validation hot path (50,000 requests/second) is served entirely from in-memory cache; the bottleneck is CPU for Avro deserialization and compatibility checking. Caching pre-compiled schema validators (Avro GenericRecord parsers) eliminates recompilation per request, improving throughput by 5x. If single-node throughput is insufficient, a stateless registry cluster behind a load balancer scales linearly.

Schema crawlers for large source systems (crawling a 10,000-table database) can generate thousands of schema events in a short burst, overwhelming the metadata ingestion pipeline. Rate limiting the crawler to 100 schema events/second and using a queue for burst absorption prevents downstream overload. Idempotent metadata writes (upsert by external_id + version_hash) prevent duplicate schema records from repeated crawl runs.

Key Trade-offs

Kafka as source of truth vs. database: Kafka provides durable ordering, replay capability, and event-driven integration but requires schema compaction management; a PostgreSQL-primary with event publishing on writes is simpler operationally.
Centralized registry vs. per-team schema ownership: Centralized enforcement ensures global compatibility but creates a bottleneck for schema registration approval; per-team ownership with periodic compatibility validation reconciles autonomy and governance.
Strict compatibility enforcement vs. advisory-only: Blocking incompatible schema registration prevents consumer breakage but may impede development velocity; advisory mode alerts on incompatibility but allows registration, relying on teams to coordinate changes.
Full schema storage vs. delta storage: Storing full schema text per version is simple and enables fast retrieval; storing only diffs saves space but requires replaying deltas from version 1, adding latency for old version lookups.