Bloom Filters Explained: Probabilistic Set Membership Testing

How Bloom filters work — hash functions, false positives, sizing formulas, and how Google, Cassandra, and CDNs use them to avoid expensive lookups.

bloom-filtersdata-structuresdistributed-systemsprobabilisticperformance

Bloom Filters

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set, with the property that false positives are possible but false negatives are not.

What It Really Means

Imagine you have a database with 100 million usernames and you need to check if a given username exists before allowing registration. You could query the database every time, but that is a disk I/O operation for every single registration attempt. A Bloom filter lets you answer "is this username taken?" using a few kilobytes of memory with sub-microsecond lookup time.

The catch: a Bloom filter can tell you "definitely not in the set" or "probably in the set." It never produces false negatives — if it says "not in the set," you can trust it completely. But it can produce false positives — it might say "probably in the set" when the element is actually not there. The false positive rate is configurable and depends on the filter's size and the number of hash functions used.

This makes Bloom filters perfect as a first-pass filter. Check the Bloom filter first. If it says "no," skip the expensive lookup entirely. If it says "maybe," do the full lookup to confirm. In practice, a well-configured Bloom filter eliminates 99%+ of unnecessary lookups.

How It Works in Practice

The Mechanics

  1. Initialize: Create a bit array of m bits, all set to 0
  2. Insert: Hash the element with k different hash functions. Each hash produces an index in [0, m). Set those k bits to 1
  3. Query: Hash the element with the same k hash functions. If ALL k bits are 1, the element is "probably in the set." If ANY bit is 0, the element is "definitely not in the set"](streamdown:incomplete-link)

False positives occur when other elements have collectively set all k bits for an element that was never inserted.

Apache Cassandra

Cassandra uses Bloom filters on every SSTable (immutable data file). When a read request arrives, Cassandra checks the Bloom filter for each SSTable before reading from disk. If the Bloom filter says the key is not in that SSTable, Cassandra skips it entirely — avoiding an expensive disk seek.

With a false positive rate of 1%, Cassandra eliminates 99% of unnecessary SSTable reads. This is why Cassandra can serve reads quickly despite having data spread across many SSTables.

Google Chrome Safe Browsing

Chrome uses a Bloom filter to check if a URL is in the list of known malicious websites. The full list is too large to download to every browser, but a Bloom filter representation fits in a few megabytes. When you visit a URL, Chrome checks the local Bloom filter. If it says "not malicious," the URL is safe. If it says "maybe malicious," Chrome contacts Google's servers for a definitive check.

Content Delivery Networks (CDNs)

Akamai and Cloudflare use Bloom filters to decide if a URL should be cached. On the first request, the URL is added to a Bloom filter. On the second request, the Bloom filter confirms it has been seen before, and only then is the content cached. This prevents "one-hit wonders" (URLs accessed only once) from polluting the cache.

HBase

HBase uses Bloom filters similarly to Cassandra — each HFile has an associated Bloom filter. A row-level Bloom filter checks if a specific row key exists, while a row-column Bloom filter checks if a specific row+column combination exists, further reducing disk I/O.

Implementation

python

Trade-offs

Advantages

  • Extreme space efficiency: A Bloom filter for 1 billion items with 1% FP rate uses ~1.1 GB. Storing the items directly would require 10-100 GB.
  • Constant-time operations: Both insert and query are O(k) where k is the number of hash functions (typically 3-10)
  • No false negatives: If the filter says "no," the answer is definitively no
  • Parallelizable: Hash computations are independent and can run in parallel

Disadvantages

  • False positives: Cannot give a definitive "yes" — only "maybe"
  • Cannot delete elements: Setting bits to 0 would affect other elements that hash to the same positions. Use Counting Bloom filters for deletion support.
  • Fixed capacity: Once the filter is "full" (too many items for the chosen size), the false positive rate increases. You must rebuild with a larger filter.
  • No enumeration: You cannot list the items in a Bloom filter

Sizing Guidelines

ItemsFP RateBits per itemMemory
1M1%9.61.2 MB
1M0.1%14.41.8 MB
100M1%9.6115 MB
1B1%9.61.1 GB

Common Misconceptions

  • "Bloom filters have a fixed false positive rate" — The false positive rate increases as more items are added. The configured rate only applies when the number of inserted items does not exceed the expected capacity.
  • "You need many different hash functions" — In practice, you can use one hash function (like MurmurHash3) with different seeds. Two hash functions with double hashing can simulate k hash functions efficiently.
  • "Bloom filters are only useful for large datasets" — Even for small datasets, a Bloom filter can prevent expensive network calls or disk I/O. The overhead is negligible.
  • "Counting Bloom filters solve all deletion problems" — Counting filters use 4 bits per position instead of 1, quadrupling memory usage. They also introduce the possibility of counter overflow.

How This Appears in Interviews

Bloom filters appear in system design and data structures interviews:

  • "Design a web crawler that avoids revisiting URLs" — Use a Bloom filter to track visited URLs. Check before crawling each URL. False positives mean skipping a few valid URLs, which is acceptable.
  • "How does Cassandra serve reads efficiently with many SSTables?" — Bloom filters on each SSTable. Explain the false positive trade-off and how it reduces disk I/O.
  • "How would you implement spell-check for a dictionary?" — Bloom filter for the dictionary. If the word is not in the filter, it is definitely misspelled. If it is, verify against the full dictionary.
  • "Design a system that prevents duplicate message processing" — Bloom filter as a fast first check. If the Bloom filter says "not seen," process immediately. If "maybe seen," check the database.

See our interview questions on data structures for practice.

Related Concepts

GO DEEPER

Learn from senior engineers in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.