System Design: GitHub (Code Hosting Platform)

Requirements

Functional Requirements:

Host Git repositories with push, pull, clone, and fork operations
Pull requests with inline code review, comments, and approvals
Issue tracking with labels, milestones, and assignments
Branch protection rules and required status checks
GitHub Actions integration (CI/CD triggered by repository events)
Package registry for code dependencies

Non-Functional Requirements:

Support 100 million repositories and 100 million users
Clone/push throughput: 10 GB/s aggregate
API latency: sub-200ms for most operations
99.95% availability
Repository storage: petabyte scale
Git operations: strong consistency (no data loss on push)

Scale Estimation

With 100 million repositories at an average size of 50 MB each (compressed Git objects), total storage is 5 PB. With 3x replication, physical storage is 15 PB. GitHub serves ~2 billion Git operations per day (~23,000 ops/sec average). Large monorepos (e.g., linux kernel: 3 GB, chromium: 15 GB) require special handling for clone operations. A git clone of a 1 GB repository over 1 Gbps takes ~8 seconds — acceptable for developers. With 10 million daily active developers, peak push traffic is ~100 pushes/sec. Pull requests: 100 million open PRs, with metadata stored in a relational database.

High-Level Architecture

GitHub's architecture separates Git protocol handling from web application logic. Git operations (clone, push, pull, fetch) are handled by a Git service layer that speaks the Git pack protocol. Web operations (viewing files, PRs, issues) are handled by a Rails application backed by MySQL. A repository routing layer maps (user/repo) to the specific storage server where the repository lives. Repository data is stored on dedicated Git storage nodes (Gitaly, GitHub's Git storage service) as bare Git repositories on disk.

For a git push, the client connects to a load-balanced Git frontend server (haproxy → Gitaly proxy). The proxy authenticates the push, identifies the repository, routes to the primary Gitaly node for that repository, and streams the pack data. Gitaly unpacks and validates the objects (checking connectivity, ref updates), acquires a write lock on the repository, and writes the new objects to the Git object store. After a successful write, replication events are sent to replica Gitaly nodes. The push is confirmed to the client once the primary has durably stored the objects.

The web frontend is a Ruby on Rails monolith (with increasing service extraction over time). It serves rendered HTML and JSON API responses. Repository metadata (name, description, visibility, stars, watchers), PR/issue data, and user data are stored in MySQL (sharded by organization/user ID). File content and commit history for web rendering (blame, tree view, diff) are served by Gitaly RPC calls. A search service (Elasticsearch) indexes repository code and metadata for GitHub's search feature.

Core Components

Gitaly (Git Storage Service)

Gitaly is GitHub's custom Git storage service, replacing direct NFS mounts of Git repositories with a gRPC-based service. Gitaly exposes Git operations as RPC calls: GetBlob, GetTree, GetCommit, CreateBranch, PostReceivePack (for pushes). It runs libgit2 or git binary internally. Gitaly Cluster (Praefect) provides replication: a Praefect proxy sits in front of multiple Gitaly nodes, routing writes to a primary and replicating to secondaries. Read requests can be served from any replica (primary or secondary) for load balancing. Praefect uses a PostgreSQL database to track replication state and elect new primaries on failure.

Repository Sharding

With 100 million repositories, no single server can hold them all. Repositories are sharded across thousands of Gitaly nodes (storage shards), with each shard hosting tens of thousands of repositories. Shard assignment is stored in a routing table (database) mapping repository_id → shard_id. When a repository is created, it is assigned to the shard with the most available capacity. Repository migrations (moving a repo to a different shard) are handled by a background process: objects are copied to the destination shard, then traffic is atomically switched to the new shard via the routing table update.

Pull Request & Code Review

Pull requests are metadata records in MySQL: (pr_id, repo_id, source_branch, target_branch, author_id, state, created_at, merge_commit_sha). PR diffs are computed on demand via Gitaly RPC (CommitDiff). Inline comments are stored in MySQL with (pr_id, commit_sha, file_path, line_number, body). PR review approvals use a state machine: required reviews count and required CI checks must pass before merge is enabled. The merge operation is a Gitaly RPC (MergeBranch) that performs a Git merge server-side, creating a merge commit atomically. Conflict detection uses a 3-way diff between source branch, target branch, and their common ancestor.

Database Design

MySQL stores all structured metadata: users (id, login, email, created_at), repositories (id, owner_id, name, visibility, fork_parent_id, default_branch, disk_path), pull_requests (id, repo_id, source_ref, target_ref, state, author_id, merge_sha), issues (id, repo_id, number, title, body, state, author_id, labels), commits (sha, repo_id, message, author_id, committed_at — partial, for web UI). MySQL is sharded by org/user ID (Vitess for transparent sharding). Vitess provides a connection pool and query routing layer over MySQL shards. Repository file trees and blobs are never stored in MySQL — always read from Gitaly.

API Design

Scaling & Bottlenecks

Large monorepos are the most acute scaling challenge. A git clone of a 15 GB repository (Chromium) over 1 Gbps takes 2 minutes — unacceptable for frequent CI runs. Solutions: (1) Shallow clones (--depth=1) fetch only the latest commit's tree, reducing clone size to a few MB; (2) Partial clones (--filter=blob:none) skip fetching file contents until needed; (3) Repository forks share objects — a fork of a popular repository does not copy all objects; it uses a shared object store with copy-on-write for new objects, reducing fork storage to near-zero for the initial fork.

MySQL scalability for PRs and issues: with 10 million new PRs per month, the PR table grows by 120 million rows per year. MySQL handles 1 billion rows per table on modern hardware with proper indexing. PR queries are dominated by single-repo lookups (list PRs for a repo, get a specific PR) with indexes on (repo_id, state, created_at). Issues are similarly shaped. Sharding by repo_id (co-locating all metadata for a repository on one shard) avoids cross-shard joins for repo-level queries and simplifies consistency for PR state transitions.

Key Trade-offs

NFS vs. Gitaly for repository storage: Direct NFS mounts are simpler but create network latency for every Git file access and are operationally fragile at scale; Gitaly adds a service layer but enables controlled, monitored, horizontally scaled Git access
Monorepo vs. polyrepo model for GitHub itself: GitHub uses microservices but stores many in a Rails monolith — balancing development speed against deployment independence
Merge commit vs. squash vs. rebase merge: Merge commits preserve full history but clutter commit log; squash merges produce clean history but lose per-commit attribution; rebase merge produces linear history but rewrites commits; GitHub supports all three, with team preference dictating policy
Strong consistency on push vs. availability: Push operations use strict primary-write to prevent data loss; this means a Gitaly primary failure blocks pushes for affected repositories until failover completes (~30 seconds), prioritizing durability over availability