System Design: Fraud Detection System

Requirements

Functional Requirements:

Score every payment transaction in real-time for fraud risk before authorization
Apply rule-based policies (velocity checks, amount thresholds, geo-fencing) configurable by risk analysts
Behavioral analytics comparing current transaction against user's historical patterns
Device fingerprinting and session risk scoring to detect account takeover
Case management system for manual review of flagged transactions
Feedback loop: fraud investigators mark transactions as confirmed fraud or false positive to retrain models

Non-Functional Requirements:

Score 15,000 transactions/sec with p99 latency under 100ms (synchronous in the payment path)
False positive rate below 0.5% to minimize legitimate customer friction
Fraud detection rate above 95% for known fraud patterns
Model update deployment within 4 hours of new fraud pattern detection
99.99% availability — fraud service downtime means either blocking all payments or letting fraud through

Scale Estimation

At 15,000 TPS, that is 1.3B transactions/day requiring fraud scoring. Each scoring request involves: feature extraction (20-30 features computed in real-time), model inference (ensemble of 5 models), rule engine evaluation (200+ rules), and decision logging. Feature extraction queries user history (last 30 days of transactions = ~100 rows average per user), device fingerprint database (~500M device records), and IP reputation database. Model inference on a CPU-optimized model takes ~15ms; rule engine evaluation ~10ms; total overhead budget is 100ms. Feature store must serve 15K lookups/sec with sub-10ms p99 latency. Decision logs produce 1.3B records/day = ~2TB/day at 1.5KB per record.

High-Level Architecture

The fraud detection system sits inline in the payment authorization path — the payment gateway calls the Fraud Service synchronously before sending the authorization request to the card network. The architecture has two planes: the Real-Time Scoring Plane (handles live transaction scoring) and the Offline Training Plane (handles model training, feature engineering, and analytics).

The Real-Time Scoring Plane receives a transaction scoring request and orchestrates three parallel evaluations: (1) the ML Scoring Service runs the transaction features through an ensemble model (gradient-boosted trees via XGBoost for tabular features + a neural network for sequence features like transaction history patterns); (2) the Rule Engine evaluates deterministic rules configured by fraud analysts (e.g., "block if >5 transactions in 1 minute from same card" or "flag if transaction country differs from cardholder country"); (3) the Device Intelligence Service checks the device fingerprint against known fraud device clusters. The results are combined by a Decision Aggregator that produces a final decision: APPROVE, DECLINE, or REVIEW (sent to manual queue).

The Offline Training Plane runs on Spark/Databricks. Daily batch jobs compute aggregate features (user spending patterns, merchant fraud rates, card velocity profiles) and write them to the Feature Store (Redis cluster for online serving, Delta Lake for offline training). Model retraining runs weekly on labeled data (confirmed fraud + confirmed legitimate transactions) with continuous evaluation against a holdout set. Challenger models are deployed via A/B testing with shadow scoring before full rollout.

Core Components

Feature Store & Real-Time Feature Engine

The Feature Store is the backbone of the scoring system, serving pre-computed and real-time features at sub-10ms latency. Pre-computed features (user's average transaction amount, preferred merchants, typical transaction times) are calculated by daily Spark batch jobs and loaded into a Redis cluster keyed by user_id. Real-time features (transaction count in last 5 minutes, cumulative amount in last hour) are maintained by a Flink streaming job consuming transaction events from Kafka. Flink uses sliding window aggregations and writes results to Redis with TTL-based expiration. The Feature Engine assembles a 30-dimension feature vector for each transaction by querying Redis in a single pipelined MGET call.

ML Scoring Service

The scoring service runs an ensemble of models: (1) an XGBoost model trained on tabular features (amount, merchant category, time of day, velocity metrics) providing a calibrated fraud probability; (2) an LSTM neural network processing the user's last 50 transactions as a sequence to detect anomalous behavior shifts; (3) a graph neural network analyzing the transaction network (sender-receiver relationships) to detect fraud rings. Each model runs independently on dedicated CPU-optimized pods (ONNX Runtime for XGBoost, TensorFlow Serving for neural models). The ensemble combines scores using a logistic regression meta-model. Model serving infrastructure uses Kubernetes with autoscaling based on request queue depth, maintaining 3x headroom for traffic spikes.

Rule Engine

The Rule Engine evaluates deterministic fraud rules written in a DSL (domain-specific language) by fraud analysts via a web-based rule editor. Rules are compiled into an optimized decision tree at deployment time for fast evaluation. Examples: velocity rules ("decline if card used at >3 different merchants in 10 minutes"), amount rules ("flag if transaction >10x user's average"), geo rules ("decline if transaction in high-risk country and card was used domestically 30 minutes ago"). Rules support A/B testing — new rules run in shadow mode (score but don't block) for 48 hours before enforcement. The rule engine evaluates 200+ rules in <10ms using short-circuit evaluation on the compiled decision tree.

Database Design

The Feature Store uses Redis Cluster (30 nodes) with two key patterns: user features stored as Redis hashes user:{user_id}:features containing 20+ pre-computed fields, and sliding window counters stored as sorted sets velocity:{user_id}:{window} with timestamp scores for real-time aggregation. TTL on velocity keys is set to 24 hours. The decision log uses Kafka as the primary store with a ClickHouse consumer for analytical queries. ClickHouse schema: transaction_id, user_id, merchant_id, amount, fraud_score, model_scores (Array Float32), rules_triggered (Array String), decision, latency_ms, timestamp. Partitioned by day with a 2-year retention policy.

Labeled fraud cases are stored in PostgreSQL: case_id, transaction_id, reported_by (CUSTOMER/INTERNAL/CHARGEBACK), fraud_type (CARD_STOLEN/ACCOUNT_TAKEOVER/FRIENDLY_FRAUD), investigation_status, resolved_at, analyst_id. This labeled dataset is the training data source for model retraining.

API Design

POST /v1/fraud/score — Synchronous scoring endpoint called by payment gateway; body contains transaction details (amount, currency, merchant, card_token, device_fingerprint, ip_address); returns decision (APPROVE/DECLINE/REVIEW), fraud_score (0-1000), triggered_rules array, latency_ms
POST /v1/fraud/feedback — Submit fraud confirmation or false positive; body contains transaction_id, label (FRAUD/LEGITIMATE), fraud_type, reporter
GET /v1/fraud/cases?status=open&analyst={id}&page={n} — List fraud cases for manual review queue with filtering
PUT /v1/fraud/rules/{rule_id} — Create or update a fraud rule; body contains rule DSL expression, action (DECLINE/FLAG/SHADOW), priority

Scaling & Bottlenecks

The 100ms latency budget is the binding constraint. Network calls to Redis for feature retrieval consume 5-10ms; model inference 15-25ms; rule evaluation 5-10ms; overhead (serialization, logging) 10-15ms. To stay within budget, all three scoring components (ML, rules, device) run in parallel with a 80ms timeout — if any component exceeds the timeout, its result is excluded and the remaining components make the decision. The Feature Store Redis cluster is the most latency-sensitive dependency — a Redis node failure increases p99 latency by 20ms due to cluster redirections. This is mitigated with read replicas and client-side caching of hot keys (top 10K most active users).

Model serving scalability is managed by horizontal pod autoscaling on Kubernetes. XGBoost inference on CPU is efficient (~5ms per prediction), but the LSTM sequence model requires more compute (~15ms). To reduce latency, the LSTM model input is pre-computed: the user's recent transaction embedding is updated incrementally after each transaction rather than recomputing from the full sequence at scoring time.

Key Trade-offs

Inline synchronous scoring over async post-authorization review: Scoring before authorization prevents fraud in real-time but adds latency to every legitimate payment — the 100ms budget minimizes this impact while still catching fraud before money moves
Ensemble of specialized models over single large model: Multiple models capture different fraud signals (tabular anomalies, behavioral sequences, network patterns) but increase inference latency and operational complexity — the parallel execution and timeout strategy bounds the impact
Shadow mode for new rules before enforcement: Prevents false-positive spikes from untested rules, but delays the response to new fraud patterns by 48 hours — critical rules can bypass shadow mode with manager approval
Pre-computed features over real-time computation: Batch-computed features are cheaper and allow complex aggregations, but can be up to 24 hours stale — supplemented with real-time Flink-computed features for time-sensitive signals like velocity