MODULE 08 / 12crash course

~/roadmap/07-hashing

◆Beginner

Hashing: The O(1) Superpower

How hash functions and buckets turn lookup into O(1) average time — the intuition behind collisions, chaining, and load factor that every engineer should own.

9 min read2026-01-15Ironclad Academy

#hashing #hash-tables #fundamentals

You've already hit this, even if you didn't know it

You had a list of 10 million user IDs and needed to check, for each incoming event, whether the user was banned. The obvious approach — loop through the list — would take 10 million comparisons per event. Even binary search on a sorted list costs log₂(10,000,000) ≈ 23 steps per check, and you'd have to keep the list sorted.

Someone on the team said "just throw it in a set" and the whole thing suddenly ran in microseconds. That's hashing. Understanding why it works means you'll know when to trust it, when it can fail you, and how to tune it when it does.

The core idea: turning a key into a slot number

Lookup is slow when you don't know where to look. Sequential search is O(n). Binary search is O(log n). But what if the key told you exactly where it lived?

That is the idea. A hash function takes your key and returns an integer — call it the hash code. You then take that integer modulo the number of buckets to get a slot:

slot = hash(key) % num_buckets

To store "alice", compute hash("alice") — say it returns 9,271,345 — mod 7 buckets, land in slot 2. To look up "alice" later, same computation, same slot. One step. Done.

The underlying storage is just a plain array. Constant-time array access by index does the heavy lifting; the hash function is just a scheme for turning your key into an index.

flowchart LR
    K["key: &quot;alice&quot;"] --> HF["hash(&quot;alice&quot;)\n= 9,271,345"]
    HF --> MOD["9,271,345 % 7 = 2"]
    MOD --> B["buckets[ 2 ]"]
    style HF fill:#7c5cff,color:#0a0a0f
    style MOD fill:#22d3ee,color:#0a0a0f
    style B fill:#00ff9d,color:#0a0a0f

Collisions: two keys, one bucket

Here is the catch. You have 7 buckets and potentially infinite keys. No matter how clever your hash function is, eventually two keys will land in the same slot. This is not a bug — it is a mathematical certainty (pigeonhole principle). The question is how you handle it.

The most common answer is chaining: each bucket holds a linked list. When two keys hash to the same slot, both go into that bucket's list. Lookup checks the bucket, then walks the (short) chain to find the exact key.

flowchart TD
    B2["bucket[2]"] --> N1["alice → 'banned'"]
    N1 --> N2["carol → 'premium'"]
    B5["bucket[5]"] --> N3["bob → 'free'"]
    B0["bucket[0]"] --> N4["dave → 'admin'"]
    style B2 fill:#7c5cff,color:#0a0a0f
    style B5 fill:#7c5cff,color:#0a0a0f
    style B0 fill:#7c5cff,color:#0a0a0f

If "alice" and "carol" both hash to bucket 2, that bucket's chain has two entries. A lookup for "carol" lands on bucket 2, checks the first entry ("alice" — not it), checks the second ("carol" — found it). Two steps instead of one, but still tiny.

Now try the interactive version — watch what happens when a collision occurs:

{ "type": "hash-table", "title": "collisions and chaining", "buckets": 5, "data": ["cat", "dog", "bird", "fish", "ant"] }

Notice some buckets get multiple entries while others stay empty. That uneven distribution is normal. The question is whether the chains stay short — and that is exactly what load factor controls.

Load factor: the number that determines whether you keep your O(1)

Load factor is simple: number of stored entries divided by number of buckets.

load_factor = n / m

With 100 entries in 200 buckets, load factor is 0.5. With 150 entries in those same 200 buckets, it is 0.75.

Why does this matter? Picture a table with load factor 3.0 — three times as many entries as buckets. Every bucket averages three entries in its chain. Every lookup now scans an average of three nodes. At 10.0, you are scanning ten nodes per lookup. At some point a "hash table" is just a slow linked list wearing a costume.

The standard response is rehashing. When load factor crosses a threshold (Python uses 0.67, Java uses 0.75), the table:

Allocates a new bucket array, usually double the size.
Reinserts every existing entry (because the modulus changed, all slots shift).
Discards the old array.

This is O(n) work — same cost as a dynamic array resize. And like a dynamic array, it is amortized away. Most insertions are O(1); the occasional rehash spreads its cost across all the insertions that led up to it.

Load factor	Average chain length	Lookup cost
0.5	~0.5 nodes	Fast — near-constant
0.75	~0.75 nodes	Still fine, typical resize threshold
1.0	~1 node	Getting dicier
2.0	~2 nodes	Noticeably degraded
5.0+	~5+ nodes	You now have a slow linked list

Keep load factor below 1.0 and collisions stay rare enough that average lookup stays O(1). Let it climb toward 5 or 10 and you are paying O(n) for "O(1)" lookups. This is the most common way hash tables silently betray you in production.

What makes a hash function good (and what makes one dangerous)

The hash function is the engine of the whole scheme. A good one distributes keys uniformly across buckets — which means collisions are spread around randomly rather than clustering. A bad one clumps everything into a handful of buckets, causing long chains regardless of load factor.

An extreme example: imagine a hash function that always returns 0 for string keys. Every string key lands in bucket 0. Your "hash table" is a single linked list, and lookups are O(n). That is not a contrived failure mode — a naive hash function that only looks at the first character of a string will cluster on common prefixes like "user_", causing exactly this problem at scale.

Production code uses functions like MurmurHash3, xxHash, or SipHash (Python's default for strings). They are engineered for uniform distribution, speed, and (in SipHash's case) resistance to collision attacks — a real concern when external input controls the keys in your server's dictionaries.

What you want from a hash function:

Deterministic — same input always yields the same hash.
Fast — you call it on every lookup, so it should cost a few nanoseconds, not microseconds.
Uniform — outputs should spread across the integer space with no obvious clusters.
Avalanche — one bit of input change should flip roughly half the output bits, so similar keys land in very different buckets.

You almost never write your own. The runtime provides one. But knowing these properties lets you spot when something has gone wrong.

What this unlocks in practice

Once you have O(1) insert and O(1) lookup, a whole class of problems that looked like they needed nested loops collapses to a single pass. A few patterns that show up constantly:

Frequency counting. Counting how many times each word appears in a document? One pass: for each word, counts[word] += 1. No sorting, no O(n log n) overhead. O(n) total.

Deduplication. Seen a value before? if val in seen: skip. One set, one pass, O(n). Without hashing, you would need O(n²) comparisons or O(n log n) sorting.

Two-sum and its cousins. For each element x, check if target - x is already in a hash set. One O(n) pass, O(n) space — instead of the O(n²) brute force. This is the hash table accelerating the two pointers style of thinking, except you do not even need the array to be sorted.

Caching / memoization. The entire premise of memoization is "have I computed this input before?" — which is a hash set membership test. No hashing means no fast memoization.

A worked example: first non-repeating character

Given a string, find the first character that appears exactly once. The brute-force approach checks every character against every other — O(n²). Two passes with a hash map: O(n).

def first_unique_char(s: str) -> int:
    # Pass 1: count every character's frequency
    freq = {}
    for ch in s:
        freq[ch] = freq.get(ch, 0) + 1

    # Pass 2: find the first with frequency 1
    for i, ch in enumerate(s):
        if freq[ch] == 1:
            return i

    return -1  # all characters repeat

# "leetcode" → 0  ('l' appears once)
# "aabb"    → -1  (no unique character)

Two O(n) passes, O(1) extra space in the best case (26 possible lowercase letters means a constant-size map), O(k) space where k is the alphabet size. The hash map is what makes pass 2 a lookup rather than a rescan.

The pitfalls people actually hit

Mutable objects as keys. Python dicts and sets require keys to be hashable — meaning immutable. Lists fail silently with a TypeError. Use tuples. This trips up everyone once.

Assuming O(1) under adversarial input. If an attacker controls the keys in your hash table and knows your hash function, they can craft input that all hashes to the same bucket, forcing O(n) lookup. This is a real denial-of-service vector. Python randomizes its string hash seed per process startup to prevent it. Be aware of this when your hash table is exposed to external input.

Ignoring the constant factor. Hash lookup is O(1) but that constant is not free — you compute the hash, index into the array, and potentially follow a pointer chain. For tiny collections (under ~20 elements), a linear scan over an array or a small sorted list beats a hash table in wall-clock time because cache behavior dominates. Big-O hides constants; profilers reveal them.

Forgetting the space cost. A hash table with a 0.5 load factor is using twice the memory its entries require, just to keep collision rates low. For small, read-only lookup tables over a known fixed key set, a sorted array with binary search or a perfect hash uses less memory and can be faster.

Using floats as keys. Floating-point equality is treacherous. hash(0.1) == hash(0.10000000000001) is probably false. If your hash table keys are floats derived from computation, you will miss lookups silently. Map float keys to rounded integers or avoid them entirely.

When NOT to use a hash table

Hash tables are the right default for unordered key-value storage and membership testing. But they are not universal:

You need sorted order. A hash table has no concept of order. If you need to iterate keys in sorted order, or find the minimum key, or do range queries ("all keys between 'alice' and 'carol'"), use a balanced BST (Python's sortedcontainers.SortedDict, Java's TreeMap).
You need predecessor/successor. Same — ordered trees.
Memory is extremely tight. Hash tables trade memory for speed. A bit array, bloom filter, or trie may serve better.
Keys are dense small integers. If your keys are integers 0–999, a plain array is faster and simpler — direct indexing with zero overhead.

For everything else — counting, deduplication, fast membership tests, key-value caching — the hash table is almost always the answer. Python's dict, JavaScript's Map/plain objects, Java's HashMap, and Go's map are all hash tables under the hood, and they all give you the same O(1) average guarantee for the same reasons.

The full deep dive — open addressing vs. chaining, how Python's dict actually works, and the interview patterns that rely on hash tables — is in Hash Table. This lesson is the foundation. Go there when you're ready for the whole picture.

// FAQ

Frequently asked questions

▸How does hashing achieve O(1) lookup?

A hash function converts a key into an integer, and that integer (modulo the bucket count) tells you exactly which slot to look in — no scanning required. As long as collisions stay rare and the table stays under a reasonable load, you jump straight to the right bucket in constant time.

▸What is a hash collision and how is it handled?

A collision happens when two different keys hash to the same bucket. The most common fix is chaining: each bucket holds a short linked list of all the key-value pairs that landed there. A good hash function spreads keys so evenly that each chain stays near length 1 on average, keeping lookups O(1).

▸What is load factor and why does it matter?

Load factor is the number of stored entries divided by the number of buckets (n/m). When it climbs above roughly 0.7–0.75, collisions multiply and chains grow long, dragging lookup toward O(n). Hash tables watch their load factor and rehash — double the buckets and reinsert everything — before it gets there.

▸What makes a hash function "good"?

A good hash function distributes keys uniformly across buckets, is fast to compute, and is deterministic — same input always produces the same output. If it clusters keys, many buckets go empty while a few get long chains, destroying the O(1) guarantee. Production hash functions like MurmurHash3 and xxHash are engineered for both speed and distribution quality.

▸When should I use a hash set instead of a sorted array for membership tests?

Use a hash set when you only need to know whether a key exists and order does not matter. A hash set gives O(1) average for insert and lookup; a sorted array gives O(log n) for binary search but requires O(n) for insert. If you need order — sorted iteration, range queries — keep the sorted structure. Otherwise the hash set wins every time.

← previous module

Searching: Linear vs Binary

next module →

Stacks and Queues: Order Matters