System Design: Feature Store

Requirements

Functional Requirements:

Define feature groups with schema, entity key, and time-to-live
Compute features via batch pipelines (daily/hourly) and streaming pipelines (seconds)
Serve features online: given entity_id(s), return the latest feature vector in <10ms
Serve features offline: return a point-in-time correct feature table for a labeled training dataset
Version feature definitions; old pipelines continue serving until explicitly deprecated
Detect and alert on training-serving skew: statistical drift between offline training features and online serving features

Non-Functional Requirements:

Online serving: 1 million requests/second, p99 latency under 10ms
Offline serving: generate training dataset of 100 million rows in under 2 hours
Feature freshness SLA: streaming features no more than 30 seconds stale; batch features updated hourly
99.99% availability for online serving
Storage efficiency: avoid redundant feature computation across teams

Scale Estimation

1 million entities (users), each with 500 features averaging 8 bytes = 4 KB per entity online store entry. Total online store: 4 GB for 1 million users — small enough for Redis. For 100 million users (global scale): 400 GB online store, requiring Redis Cluster with 10 shards of 40 GB each. Offline store: 100 million users * 500 features * 30 days of snapshots = 1.5 TB/day of feature data written to S3.

High-Level Architecture

The feature store has three layers: the Feature Catalog (metadata), the Feature Pipeline layer (compute), and the Feature Serving layer (storage + API). The catalog stores feature group definitions, entity schemas, and pipeline configurations. The pipeline layer runs batch (Spark) and streaming (Flink) jobs to compute features and write to both the offline store (S3/Parquet) and online store (Redis). The serving layer provides a unified API that routes online requests to Redis and offline requests to the offline store.

Feature pipelines are defined as transformations over source data (event streams, database snapshots, API enrichments). A pipeline SDK generates both the Spark batch job (computing historical feature values) and the Flink streaming job (computing real-time feature values) from the same feature definition DSL, ensuring consistency between the two compute paths. This symmetry is critical for eliminating training-serving skew.

Point-in-time correct joins are the signature capability of the offline store. Given a training dataset with (entity_id, label_timestamp) rows, the feature store retrieves the feature value that was current at label_timestamp for each entity, not the latest value. This is implemented using an AS-OF join on the offline store's time-partitioned feature snapshots, ensuring the model trains on feature values that would have been available at prediction time.

Core Components

Online Store (Redis Cluster)

Redis Cluster stores the latest feature vector per entity. Key: {feature_group_id}:{entity_id}. Value: serialized Avro or MessagePack binary of the feature vector with a version tag. TTL is set per feature group based on data freshness requirements. Batch updates use Redis Pipeline commands for bulk writes (50,000 entities/second). Streaming updates arrive via Flink's Redis Sink connector, updating individual keys as new feature values are computed. Read-replicas handle read traffic; writes go only to primary shards.

Offline Store (S3 + Parquet + Iceberg)

The offline store is an Iceberg table per feature group: (entity_id, feature_timestamp, feature_1, ..., feature_N). Batch pipelines append new snapshots daily; streaming pipelines write micro-batches every 5 minutes. Partition by feature_date with Iceberg hidden partitioning on entity_id for efficient entity lookups. Point-in-time joins use Iceberg's time-travel queries: SELECT * FROM feature_group FOR SYSTEM_TIME AS OF '{label_timestamp}'. For training jobs reading 100 million rows, Spark reads the Iceberg snapshot in parallel across 1,000 tasks.*

Training-Serving Skew Detector

A nightly job computes distribution statistics (mean, stddev, histogram) for each feature in both the offline training set (last 30 days) and online store (sample of 10,000 recent online reads). Jensen-Shannon divergence is computed per feature; features with JS divergence > 0.1 trigger a skew alert to the feature owner and the downstream model owner. Common causes of skew are identified automatically: null rate differences (pipeline failures), value range shifts (upstream data changes), or temporal mismatch (label timestamp incorrectly set).

Database Design

Feature Catalog (PostgreSQL): feature_groups (group_id, name, entity_type, batch_source_config, streaming_source_config, ttl_seconds, created_by, version), features (feature_id, group_id, name, dtype, default_value, description), entity_types (type_id, name, join_key_column). Feature freshness metadata is tracked in Redis: {group_id}:last_updated → timestamp, enabling staleness checks on every online read. A feature_usage table logs which models consume which feature groups, enabling impact analysis for feature deprecations.

API Design

GET /feature-groups/{group_id}/online?entity_ids=id1,id2 — Batch online feature lookup; returns feature vectors for up to 1,000 entity IDs in a single call. POST /feature-groups/{group_id}/training-dataset — Submit a training dataset request with entity IDs and label timestamps; returns a job ID for async dataset generation. POST /feature-groups — Register a new feature group with entity type, feature schema, and pipeline configuration. GET /feature-groups/{group_id}/skew-report — Return the latest training-serving skew report with per-feature divergence scores.

Scaling & Bottlenecks

Online store throughput at 1 million requests/second requires a Redis Cluster with 20+ shards. The bottleneck is network bandwidth: at 4 KB per feature vector and 1 million reads/second = 4 GB/s. Redis Cluster spreads this across 20 nodes at 200 MB/s each — within 10GbE network capacity. Client-side caching (Caffeine or Guava cache in the model gateway, 10,000-entity LRU with 5-second TTL) eliminates 40% of Redis reads for hot entities.

Offline point-in-time join performance depends on the AS-OF join implementation. Naive Spark implementations scan the entire offline store for each label timestamp; optimized implementations sort by (entity_id, feature_timestamp) and use a merge-join strategy that processes each entity's history in a single pass, reducing complexity from O(NM) to O(N+M). Iceberg's partition pruning further reduces data read by filtering to only the relevant time range.

Key Trade-offs

Redis vs. DynamoDB for online store: Redis provides sub-millisecond latency with rich data structures; DynamoDB offers managed scaling and multi-region replication at higher latency (5–10ms) — the right choice depends on latency requirements and operational preferences.
Batch vs. streaming feature computation: Streaming provides fresh features in seconds but requires a more complex pipeline (Flink); batch features are simpler but hours stale, degrading models that depend on recent user behavior.
Shared vs. dedicated online store: A shared Redis cluster for all feature groups reduces operational overhead; dedicated clusters per feature group provide isolation and independent scaling.
Pre-joining features at write time vs. join at read time: Pre-joined feature vectors reduce online latency (single key lookup) but require recomputing the full vector when any feature changes; joining at read time allows independent feature updates but adds latency proportional to the number of feature groups joined.