Bloom Filters
A probabilistic set that answers "definitely not" or "maybe yes" using a bit array and k hash functions — orders of magnitude smaller than a hash set, with a tunable false-positive rate and mathematically guaranteed zero false negatives.
The problem this solves — and why you care
Here's a concrete one. Cassandra stores data in immutable files called SSTables. When you read a key, Cassandra might have to check dozens of SSTables on disk before finding the one that holds it (or confirming it's missing). Each disk read is potentially milliseconds. Multiply that by millions of reads per second and you have a disaster.
Cassandra's solution: every SSTable has a bloom filter. Before any disk read, it queries the filter. If the filter says "definitely not here," skip the file entirely. If it says "maybe," go look. False positives mean occasional unnecessary reads; that's fine. False negatives — saying "not here" when the key actually is — would corrupt results, so they cannot happen. The bloom filter structure makes them mathematically impossible.
Same pattern in Chrome Safe Browsing: millions of known-malicious URLs need to be checked locally, without sending every URL you visit to Google. A bloom filter fits the entire database in RAM and answers in microseconds.
The shape of the problem is always the same: "Is this element definitely absent from a very large set? I can tolerate occasionally checking unnecessarily, but I cannot tolerate missing a real member."
How it works: the bit array and k hashes
A bloom filter is a bit array of length m, initially all zeros, and k independent hash functions, each mapping an element to one of the m positions.
Insert an element:
- Run it through all k hash functions.
- Set each of the k resulting bit positions to 1.
Query an element:
- Run it through all k hash functions.
- Check each of the k resulting bit positions.
- If any bit is 0 → definitely not in the set. Zero false negatives.
- If all bits are 1 → probably in the set. Those bits might all be 1 due to other elements.
{ "type": "array", "title": "After inserting 'cat' (bits 1,5,9) and 'dog' (bits 3,7,12)", "data": [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0] }
This is the state of a 16-bit filter after inserting "cat" and "dog" with k=3 functions each. Now query "bird": if its 3 hash positions include any 0, it's definitively absent. Query "cat": all 3 of its positions (1, 5, 9) are 1 — correct, it was inserted.
The false positive happens when you query an element that was never inserted but all k of its positions happen to be 1 due to other insertions. This gets more likely as the bit array fills up.
flowchart TD
A["Insert element X"] --> B["Hash 1(X) → position p1"]
A --> C["Hash 2(X) → position p2"]
A --> D["Hash 3(X) → position p3"]
B --> E["Set bit[p1] = 1"]
C --> F["Set bit[p2] = 1"]
D --> G["Set bit[p3] = 1"]
H["Query element Y"] --> I["Hash 1(Y) → position q1"]
H --> J["Hash 2(Y) → position q2"]
H --> K["Hash 3(Y) → position q3"]
I --> L{"bit[q1] == 0?"}
J --> M{"bit[q2] == 0?"}
K --> N{"bit[q3] == 0?"}
L -->|yes| O["Definitely NOT in set"]
M -->|yes| O
N -->|yes| O
L -->|no| P{"All bits = 1?"}
M -->|no| P
N -->|no| P
P -->|yes| Q["Probably in set (may be false positive)"]
style O fill:#00ff9d,color:#0a0a0f
style Q fill:#7c5cff,color:#fff
The math: why 10 bits/item gives you ~1% fpp
This is the part most tutorials skip, but understanding it tells you exactly what knobs to turn when you need a specific false-positive rate.
After inserting n elements into an m-bit array with k hash functions, the probability that any given bit is still 0 is approximately:
P(bit = 0) ≈ (1 - 1/m)^(kn) ≈ e^(-kn/m)
The false-positive probability for a query is the chance that all k positions happen to be 1:
fpp ≈ (1 - e^(-kn/m))^k
This looks ugly, but it collapses to a simple rule. The optimal number of hash functions for a given m/n ratio is:
k_optimal = (m/n) × ln(2) ≈ 0.693 × (m/n)
Substituting that back, the minimum false-positive probability achievable with m bits and n elements is:
fpp_min ≈ (0.6185)^(m/n)
Which gives you this table:
| Bits per item (m/n) | Optimal k | False-positive rate |
|---|---|---|
| 6 | 4 | ~5.7% |
| 8 | 6 | ~2.1% |
| 10 | 7 | ~0.82% |
| 12 | 8 | ~0.30% |
| 14 | 10 | ~0.11% |
| 16 | 11 | ~0.041% |
The practical takeaway: 10 bits per item with k=7 gives you just under 1% fpp. For a 10-million-item set, that's 100 million bits = ~12 MB. The equivalent hash set storing 10-character string keys would be roughly 400 MB. That's the 30–40× compression the bloom filter buys you.
If you don't want to do the math yourself, you also don't have to — every production bloom filter library accepts n (expected elements) and desired_fpp as constructor arguments and computes m and k automatically.
Why deletion is broken
This is the question everyone asks, and the answer is short: bit positions are shared.
Say you insert "alice" and it sets bits {2, 7, 14}. Then you insert "bob" and it also sets bit 7 (plus others). Now you "delete" alice by clearing bits {2, 7, 14}. You just cleared bit 7, which bob's membership depends on. Query bob and you get "definitely not in set." That's a false negative — the one thing bloom filters are not supposed to do.
There's no way around this in the standard design. You don't know which element set any given bit. Every bit is a lossy aggregation of all the elements whose hash functions pointed there.
The workaround is a counting bloom filter: replace each 1-bit slot with a small counter (usually 4 bits). Insert increments the k counters; delete decrements them. A slot reads as "set" if its counter > 0. This prevents the false-negative problem at the cost of 4× the space. In practice, if you need deletions, you should ask whether you actually need a bloom filter at all — you may just want a hash set or a hash table.
Core implementation
Here's a minimal, correct bloom filter in Python that shows the mechanics clearly:
import hashlib
import math
class BloomFilter:
def __init__(self, expected_elements: int, false_positive_rate: float = 0.01):
# Derive optimal m and k from the math above
self.m = self._optimal_bits(expected_elements, false_positive_rate)
self.k = self._optimal_hashes(self.m, expected_elements)
self.bits = bytearray(math.ceil(self.m / 8)) # bit array, packed into bytes
@staticmethod
def _optimal_bits(n: int, fpp: float) -> int:
# m = -n * ln(fpp) / ln(2)^2
return math.ceil(-n * math.log(fpp) / (math.log(2) ** 2))
@staticmethod
def _optimal_hashes(m: int, n: int) -> int:
# k = (m/n) * ln(2)
return max(1, round((m / n) * math.log(2)))
def _hash_positions(self, item: str):
"""Yield k bit positions using double-hashing to avoid k real hash calls."""
# double-hashing: h_i(x) = (h1(x) + i * h2(x)) mod m
h1 = int(hashlib.md5(item.encode()).hexdigest(), 16)
h2 = int(hashlib.sha256(item.encode()).hexdigest(), 16)
for i in range(self.k):
yield (h1 + i * h2) % self.m
def _get_bit(self, pos: int) -> bool:
return bool(self.bits[pos >> 3] & (1 << (pos & 7)))
def _set_bit(self, pos: int):
self.bits[pos >> 3] |= (1 << (pos & 7))
def add(self, item: str):
for pos in self._hash_positions(item):
self._set_bit(pos)
def __contains__(self, item: str) -> bool:
"""Returns False → definitely not present. True → probably present."""
return all(self._get_bit(pos) for pos in self._hash_positions(item))
# Usage
bf = BloomFilter(expected_elements=1_000_000, false_positive_rate=0.01)
bf.add("user:alice")
bf.add("user:bob")
print("alice" in bf) # True (was inserted)
print("charlie" in bf) # False (probably — definitely not, in this case)
# Real-world check: fpp at 1% means ~1 in 100 queries for non-members returns True
Two notes on implementation:
Double-hashing: running k independent cryptographic hash functions is slow. The standard trick is to compute just two base hashes (h1 and h2) and derive k positions as (h1 + i*h2) % m. This has the same theoretical properties as k independent functions and is much faster.
Bit packing: the filter stores bits packed into a bytearray (8 bits per byte), not a list of booleans. A list of Python bools would use 28 bytes per bool (CPython object overhead). The packed array uses 1 bit per bit. At millions of items, this matters a lot.
Where bloom filters show up in production
Cassandra and Bigtable SSTable checks
Both Cassandra and HBase/Bigtable store data in sorted, immutable SSTable files. A read for a key might need to check several SSTables to find the right one. Each SSTable carries a bloom filter. The read path queries the filter before touching the disk:
for each SSTable (newest to oldest):
if key NOT IN sstable.bloom_filter:
skip # definitely not here, zero disk I/O
else:
check index → read data block # might be a false positive, that's OK
The false-positive rate controls how often you do a wasted disk read. Cassandra defaults to 1% and lets you tune per-table. Drop it to 0.1% and you save more disk reads at the cost of 50% more memory for the filter.
Chrome Safe Browsing
Chrome's browser keeps a locally-stored list of known malicious URLs — millions of them — and checks every URL you visit before it loads. Sending every URL to Google's servers would be a privacy nightmare and a latency disaster. Instead, Chrome downloads a compact filter-like structure (technically a hash prefix scheme that evolved from bloom filters), checks URLs locally in microseconds, and only phones home when it gets a "maybe." False positives mean an occasional unnecessary server round-trip. False negatives would mean missing a malicious URL — the property bloom filters guarantee you'll never have.
CDN cache admission
A CDN wants to cache popular content and avoid caching "one-hit wonders" — URLs requested exactly once and never again. Caching a one-hitter wastes a cache slot that a popular item could use.
The trick: keep a bloom filter of recently-seen URLs. When a URL comes in:
- If NOT in filter → add to filter, don't cache (it's a first-time request)
- If in filter → cache it (it's been requested at least twice)
This is sometimes called a "LRU-on-write" filter and dramatically improves cache hit rates without storing a full set of URLs. The false-positive rate means you occasionally cache a one-hitter — that's a minor inefficiency, not a correctness problem.
Redis Bloom Filter module
Redis ships a RedisBloom module that adds native BF.ADD and BF.EXISTS commands. It's useful when you need a shared bloom filter across multiple application instances — one filter in Redis, queried from N application servers. The module handles serialization and supports BF.RESERVE to pre-configure m and k.
Pitfalls people actually hit
Not sizing correctly upfront. A bloom filter has a fixed m set at creation. As you fill it past the expected n, the actual false-positive rate rises above what you designed for — sometimes dramatically. If your "million-element" filter is holding 5 million elements, your 1% fpp is now closer to 25%. Audit your actual element counts. If you can't predict them, look at scalable bloom filters (a chain of bloom filters that add new stages when the current one fills up).
Using it as a cache. The filter tells you whether something might be in your primary store. It does not cache the value. You still need to go to the primary store on a "maybe." This sounds obvious, but I've seen engineers confuse "membership filter" with "read-through cache" and be confused when the filter hit didn't give them the data.
Assuming independent hash functions. A naive implementation that calls a random hash function k separate times might not produce truly independent values, which inflates your actual fpp above the theoretical value. Use the double-hashing technique or a well-tested library — pybloom-live in Python, Guava's BloomFilter in Java.
Not hashing the serialized form consistently. If you insert "user:alice" from a Python service and later query b"user:alice" from a Go service, you'll get wrong answers. Standardize on a serialization format (usually UTF-8 bytes) across all writers and readers.
Using a bloom filter when you actually need exact membership. This seems obvious, but there's a category of engineers who see "it uses way less memory" and reach for the bloom filter when the application cannot tolerate any false positives — like access control, deduplication of financial transactions, or any case where a false "present" has a meaningful consequence. If you can't tolerate false positives, use a hash set.
When NOT to use a bloom filter
| You need this | Use this instead |
|---|---|
| Zero false positives | Hash set — exact membership |
| Deletions | Counting bloom filter, or just a hash set |
| Retrieve the stored element or value | Hash table — maps key to value |
| Ordered iteration | Sorted array or BST |
| Exact count of occurrences | Hash table with integer values, or Count-Min Sketch |
| The set changes unpredictably in size | Scalable/dynamic bloom filter |
The clearest signal to reach for a bloom filter: you have a read-heavy workload where the dominant case is elements that are absent, you want to avoid expensive I/O (disk, network, database) on those negative lookups, and you can tolerate a small fpp that you'll tune consciously. If those three conditions don't all hold, you probably want exact membership (hash set) or a value lookup (hash table).
If you're still learning how hash functions underpin all of this, the hashing crash course is the right foundation — bloom filters are essentially k independent hash functions racing to falsify each other, and understanding collision probability there gives you the intuition for why the fpp formula works.
The bigger picture
Bloom filters belong to a class of structures called probabilistic data structures — they trade exactness for dramatically lower resource usage. The Count-Min Sketch does the same for frequency estimation. HyperLogLog approximates cardinality ("how many unique visitors?") in kilobytes where an exact set would need gigabytes.
The design philosophy is the same in all of them: identify which guarantee you actually need, figure out which guarantee you can relax, and exploit the relaxation ruthlessly. For bloom filters, the relaxed guarantee is "I might have a small number of false positives." The exploited consequence is that you never need to store the elements at all — just the fingerprint of where they hashed to.
Once you've seen this pattern, you see it everywhere in systems design. The bloom filter is the simplest and oldest example, which is why it's the one worth understanding cold.
Frequently asked questions
▸Can a bloom filter ever give a false negative?
No — and this is the property that makes bloom filters genuinely useful in practice. If an element was inserted, all k of its bit positions were set to 1. A query checks those same k positions. They can only go from 0 to 1, never back to 0, so a previously inserted element will always get a "maybe" result. False negatives are mathematically impossible in a standard bloom filter. Only false positives can occur — when all k bits for an element that was never inserted happen to be 1 due to other insertions.
▸What is a good false-positive rate and how many bits per item do I need?
For most practical use cases, 1% false-positive rate (fpp) is the sweet spot. It requires roughly 10 bits per item with k=7 hash functions. At 0.1% fpp you need ~15 bits per item with k=10. The formula is m/n = -log₂(fpp) / ln(2), where m is the total bit count and n is the expected element count. The optimal k is (m/n) × ln(2). Most production libraries accept n and fpp as constructor arguments and compute m and k for you.
▸Why can't you delete from a standard bloom filter?
Because multiple elements share bit positions. If element A set bit 42 and element B also set bit 42, then deleting A by clearing bit 42 would make B appear absent — a false negative. You can work around this with a counting bloom filter, where each position stores a small counter (typically 4 bits) instead of a single bit. Incrementing on insert and decrementing on delete preserves correctness as long as no counter overflows. The tradeoff: 4× the space of a plain filter.
▸How is a bloom filter different from a hash set?
A hash set stores the actual elements (or their hashes with collision chaining) and gives you exact membership answers — no false positives, and deletions work. A bloom filter stores only k bits per element across a shared bit array and gives you probabilistic answers. The space difference is dramatic: a hash set holding 10 million strings might use hundreds of megabytes; a bloom filter with 1% fpp needs only ~12 MB. You trade away exactness and deletability for that compression.
▸Where are bloom filters actually used in production systems?
Everywhere you want to avoid expensive I/O for elements that almost certainly do not exist. Cassandra and HBase use bloom filters to skip SSTables that cannot contain a key before doing a disk read. Chrome uses a bloom filter-like structure for its Safe Browsing database — millions of known-malicious URLs checked locally without sending every URL to Google. CDN caches use them to avoid caching one-hit-wonder content. Bitcoin full nodes have used them to filter transaction broadcasts. Redis has a native bloom filter module.
You may also like
Union-Find (Disjoint Set Union)
The near-magic data structure for connectivity queries — are these two nodes in the same component? Nearly O(1) per operation with union by rank and path compression. Kruskal's MST, redundant connections, account merging, and more.
Tries (Prefix Trees)
The data structure powering autocomplete, spell-check, and IP routing — a tree where each edge is a character and every path from the root spells a prefix. Insert, search, and prefix queries all run in O(L) on word length alone, completely independent of how many words you store.
Strings
Strings look like simple text, but they hide a brutal trap: naive concatenation in a loop is O(n²). Learn why, how encoding actually works, and the handful of patterns — sliding window, two pointers, hashing — that solve 80% of string interview problems.