System Design: ML-based Fraud Detection

Requirements

Functional Requirements:

Score every transaction in real time and return a fraud probability score within 100ms
Block high-confidence fraud (score > 0.95) and step-up challenge medium-confidence (0.6–0.95)
Compute real-time features: transaction velocity, device fingerprint, behavioral biometrics
Maintain a rule engine for hard-coded business rules (block transactions from sanctioned countries)
Provide human review queue for borderline cases with model explanations
Feed fraud labels (confirmed fraud, confirmed legitimate) back into the model retraining pipeline

Non-Functional Requirements:

Decision latency under 100ms at the 99th percentile (often under 50ms requirement from payment networks)
False positive rate under 0.5% (blocking 1 in 200 legitimate transactions harms UX)
False negative rate under 0.1% (missing 1 in 1,000 fraudulent transactions is acceptable)
System must process 10,000 transactions per second at peak (Black Friday, Cyber Monday)
Model retrained at least daily with last 30 days of labeled transactions

Scale Estimation

10,000 TPS peak. Each transaction evaluation: fetch ~100 real-time features from Redis (velocity counters, device signals) in 5ms, score via gradient boosting model in 2ms, return decision. Redis must sustain 1 million reads/second (100 features * 10,000 TPS). Feature computation for velocity (transactions per user per hour) uses Redis INCR commands on sliding window counters. Model inference: XGBoost ONNX-exported model scores 100,000 samples/second on a single CPU core; 10,000 TPS requires 1 core with headroom.*

High-Level Architecture

The fraud decision pipeline has three parallel paths that execute concurrently: (1) Rule Engine — hard-coded deterministic rules (blocked card, impossible geography, sanctioned merchant), (2) ML Scoring — gradient boosting model on tabular features, and (3) Network Graph Engine — real-time graph analytics detecting account network anomalies. Results from all three are combined by a Decision Fusion layer that applies business logic (max score, rule overrides) to produce the final decision.

A real-time feature computation layer maintains sliding window aggregates (transaction counts, amounts, unique merchants in the last 1/5/24 hours) in Redis. When a transaction arrives, feature fetching and rule checking run in parallel; ML scoring follows immediately after features are assembled. A response is guaranteed within 100ms; if the network graph engine (which is slower) hasn't responded, the decision is made without graph features and the graph result is used for model retraining only.

A feedback pipeline processes chargeback events, customer disputes, and manual review outcomes to produce fraud labels. These labels feed into the training dataset. A champion-challenger framework runs a new model version at 5% traffic for live evaluation before full promotion. Model drift detection (PSI on feature distributions, AUC monitoring on labeled transactions) triggers automatic retraining when degradation is detected.

Core Components

Real-Time Feature Store

Velocity features (counts and amounts in sliding windows) are maintained in Redis using a combination of INCR (exact counts for short windows) and Count-Min Sketch (approximate counts for long-horizon large cardinality windows). Features include: transaction count per user in last 1h/24h, total spend per user in last 7 days, number of distinct merchants in last 24h, number of distinct countries in last 30 days. Device fingerprint features (browser hash, IP reputation score) are looked up from a separate Redis hash updated by a device intelligence service.

ML Scoring Service

The primary model is an XGBoost gradient boosting classifier trained on 30 days of labeled transactions with ~300 features. The model is exported to ONNX format for language-agnostic serving at maximum throughput. A secondary neural network model (TabNet or a shallow MLP) runs in parallel for ensemble scoring; the ensemble reduces false negatives by 15% vs. XGBoost alone. SHAP values are computed for the top 10 most influential features per transaction, providing the explanation text displayed to fraud analysts in the review queue.

Graph Anomaly Detection

A real-time graph engine (TigerGraph or a custom graph backed by Redis adjacency lists) maintains entity relationships: user → device, user → IP, user → bank_account, merchant → acquiring_bank. Fraud rings exhibit dense subgraphs (many new users sharing devices or IPs). A streaming graph algorithm computes degree centrality and community membership for new nodes as they join the graph. Sudden increases in node degree (a device used by 50 new accounts in 1 hour) score high on the graph risk signal.

Database Design

PostgreSQL stores transaction records and fraud labels: transactions (tx_id, user_id, merchant_id, amount, currency, device_id, ip_address, timestamp, fraud_score, decision, model_version), fraud_labels (tx_id, label_type ENUM, labeled_by, labeled_at, confidence). Redis stores real-time features: vel:{user_id}:{window} → count/amount. A separate Cassandra cluster stores the full transaction history for offline feature computation (30-day lookback for training).

API Design

POST /transactions/score — Submit a transaction for fraud scoring; returns fraud_probability, decision (APPROVE/CHALLENGE/BLOCK), and top-3 risk reasons. POST /labels — Submit a fraud label (FRAUD/LEGITIMATE) for a transaction after manual review or chargeback confirmation. GET /transactions/{tx_id}/explanation — Return SHAP-based feature importance explanation for the fraud decision. GET /models/current/performance — Return real-time AUC, precision, recall, and false-positive rate metrics for the production model.

Scaling & Bottlenecks

Redis throughput at 1 million reads/second is well within Redis Cluster capabilities (each node handles 100,000 ops/second; a 10-node cluster handles 1 million). The bottleneck is network round-trip time for feature fetching: 100 Redis keys in a pipeline command returns in under 2ms on a local network. Pipelining all feature fetches into a single Redis PIPELINE command (vs. 100 individual GETs) reduces round trips from 100 to 1, cutting feature fetch latency by 50x.

Model retraining with 30 days of labeled data (300 million transactions at 1,000 TPS) takes 4 hours on a 32-core machine for XGBoost. Incremental learning (retraining only on recent data using warm-starting from previous model weights) reduces retraining time to 30 minutes while maintaining model quality. A/B testing framework ensures new models improve on the production baseline before full deployment.

Key Trade-offs

Rule engine vs. pure ML: Rules are interpretable, fast, and immediately enforceable for known fraud patterns; ML generalizes to unknown patterns but is a black box — a hybrid approach uses rules as hard stops and ML for nuanced probabilistic scoring.
False positives vs. false negatives: Tuning the decision threshold trades off customer friction (false positive blocks) vs. fraud losses (false negative misses); the optimal threshold depends on the unit economics of each error type.
Batch vs. online model updates: Batch retraining daily is simpler but allows concept drift during the day; online learning (incremental gradient updates after each label) adapts in real time but risks instability from adversarial feedback loops where fraudsters probe the model.
Global vs. per-merchant models: A global model benefits from cross-merchant signal sharing; per-merchant models capture merchant-specific fraud patterns but require sufficient per-merchant data volume and increase model management overhead.