~/data structures/hash-table
◆◆Intermediateasked at Googleasked at Amazonasked at Metaasked at Microsoft

Hash Tables

The interview MVP. How hash functions turn arbitrary keys into O(1) lookups, what actually happens during a collision, and why "just use a hash map" is usually the right instinct.

10 min read2026-01-15Ironclad Academy
Complexity cheat sheet
Operation
Average
Worst
InsertO(n) only if every key hashes to the same bucket
O(1)
O(n)
Lookupworst case is a degenerate hash function
O(1)
O(n)
Delete
O(1)
O(n)
Resize (rehash)amortized away across many inserts
O(n)
O(n)
// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

The hook: you already know the problem

You have a list of 10 million log lines and need to count how often each user ID appears. The brute-force answer is O(n²): for each user ID, scan the whole list. That runs for hours.

Or you keep a dictionary as you go. Each user ID maps to a counter. One pass through the list, one lookup and increment per entry, done in seconds. You just used a hash table. The question is: what actually makes that O(1)?

The mental model: a very clever array

A hash table is an array underneath. That's it. The trick is a function that converts your key — which might be a string like "alice@example.com", a tuple like (3, 7), or anything else — into an integer that indexes the array.

flowchart LR
    K["key: 'cat'"] --> H["hash('cat') = 52740"]
    H --> M["52740 % 7 = 3"]
    M --> B["buckets[3]"]
    style K fill:#7c5cff,color:#0a0a0f
    style B fill:#00ff9d,color:#0a0a0f

The math: bucket_index = hash(key) % num_buckets. If you have 7 buckets, hash("cat") might be 52740, and 52740 % 7 = 3. So "cat" lives in bucket 3. When you look it up later, you run the same function, land on bucket 3, and check there. No scanning required.

This is why the hash function is the beating heart of the structure. A good one spreads keys uniformly across all buckets. A bad one — or one you design without thinking — can dump most keys into a handful of slots and quietly turn your O(1) into something much uglier. More on that in a moment.

Collisions: inevitable, not catastrophic

Here's the math that trips people up: if you have 7 buckets and insert 8 keys, at least one bucket must hold two keys. That's just the pigeonhole principle. And even with plenty of buckets, birthday-problem statistics mean collisions show up way sooner than you'd expect — with 23 keys spread across 365 buckets, you'd bet against it, but there's already a 50% chance of a collision.

So you need a strategy.

Separate chaining

Each bucket holds a linked list (or small dynamic array). When a collision happens, the new entry is appended to the list at that bucket. Lookup walks the list, comparing keys until it finds a match.

With a good hash function and a low load factor, those lists stay short — often zero or one entry per bucket. A scan of a list of length 1 or 2 is still constant time in practice.

flowchart TD
    subgraph Buckets
        B0["[0] → ∅"]
        B1["[1] → ∅"]
        B2["[2] → ∅"]
        B3["[3] → 'cat' → 'fish'"]
        B4["[4] → 'dog'"]
        B5["[5] → 'bird'"]
        B6["[6] → ∅"]
    end
    style B3 fill:#7c5cff,color:#0a0a0f

The visualizer above shows exactly this — two keys sharing bucket 3, chained together.

Open addressing

Instead of a list, you store everything directly in the array. On a collision, probe the next slot (linear probing: index + 1, + 2, ...), or use a smarter pattern like quadratic probing or double hashing. Lookup probes in the same sequence until it finds the key or an empty slot.

Open addressing has better cache behavior — everything stays in one contiguous array — but deletion is tricky (you can't just clear a slot; it might break a probe chain) and it degrades faster at high load factors. Python's dict uses open addressing with a variant of quadratic probing. Java's HashMap uses separate chaining.

Separate chainingOpen addressing
MemoryExtra pointer per entryOne flat array
Cache behaviorPoor (pointer hops)Good (contiguous)
DeletionEasyRequires tombstones
High load factorDegrades gracefullyDegrades sharply
Common usersJava HashMap, C++ unordered_mapPython dict, Java IdentityHashMap

For most practical purposes: if you're using a language's built-in map, you don't choose. If you're implementing one, separate chaining is simpler and easier to reason about.

Load factor and resizing

Load factor is n / capacity — entries divided by bucket count. It's the single number that predicts your average chain length. When load factor is 0.5, each bucket holds 0.5 entries on average. At 2.0, chains average 2 entries long and every lookup does extra work.

Most implementations resize when load factor crosses ~0.75. That means allocating a new, larger array (typically 2× the size), computing hash(key) % new_capacity for every existing entry, and inserting them into the new positions. This is called rehashing.

Rehashing is O(n) — you touch every entry. But it happens exponentially less often as the table grows (the same doubling argument that makes array appends amortized O(1)). So inserts are O(1) amortized even accounting for rehashes.

The practical implication: if you know you're going to insert a million entries, pre-size the map. In Python that's not directly exposed, but in Java you can do new HashMap<>(2_000_000) and skip a handful of rehashes.

The embed: watch the hash function in action

{ "type": "hash-table", "title": "insert, collision, and lookup", "buckets": 7, "data": ["cat", "dog", "bird", "fish"] }

Step through the inserts one at a time. Notice which bucket each key lands in. When two keys hit the same bucket, watch the chain form. Then run a lookup on one of the chained keys — see it land on the right bucket and walk the short list. This is the core loop: hash → bucket → (short) scan. The "short" part is load factor working for you.

Why worst case is O(n) — and why you rarely see it

If every key hashes to bucket 0, you have a linked list disguised as a hash table. A lookup for the last key scans all n entries. That's O(n), identical to a linear search.

Two scenarios cause this:

  1. A degenerate hash function. If hash(key) = 0 for every key, you've built a very expensive linked list.
  2. Adversarial input. An attacker who knows your hash function can craft inputs that all collide, turning your O(1) map into O(n) and potentially causing a denial-of-service. This is why Python randomizes its hash seed per process, and why production hash tables often use SipHash or similar cryptographic-strength functions.

Java 8+ addresses this by converting a bucket's chain to a balanced BST when it exceeds 8 entries. That caps the worst case at O(log n) per lookup rather than O(n). Clever, but also a sign of how seriously the engineers took this problem.

For your purposes: use the language built-in. Its hash function is good. If you're building something yourself, make sure hash(a) == hash(b) if and only if a == b (well, modulo collisions), and that a == b implies hash(a) == hash(b) — the two invariants any hash table needs to be correct.

The "use a hash map" instinct

Here's the real skill. Three problem shapes that should immediately make you reach for a hash map:

1. Deduplication / "have I seen this?"

Keep a set (backed by a hash table). First occurrence: add it. Subsequent occurrences: skip it. One pass, O(n) total. The alternative is sorting first — which is O(n log n) and changes the output order.

2. Frequency counting

from collections import Counter

def most_frequent(words):
    counts = Counter(words)          # one pass, O(n)
    return counts.most_common(1)[0]  # O(n) for all-time most common

Counter is just a hash map with a more convenient API. You could do this with dict too — counts[word] = counts.get(word, 0) + 1. The key move is swapping the O(n) scan-per-lookup for an O(1) hash lookup.

3. Two-sum and its variants

This is the canonical hash-map interview problem. Given a list of numbers and a target, find two that sum to the target.

def two_sum(nums, target):
    seen = {}                          # value → index
    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:
            return [seen[complement], i]
        seen[num] = i
    return []

The brute force is two nested loops — O(n²). This version makes one pass: for each number, check whether its complement already exists in the map. If yes, done. If not, record the current number and move on. O(n) time, O(n) space.

The two pointers approach solves the same problem in O(n) time and O(1) space, but only when the input is sorted. If you can't sort (because you need original indices, or the input isn't sortable), the hash map is the right tool.

Pitfalls people actually hit

Mutating a key after insertion. Hash tables compute the bucket index from the key at insert time. If you change the key afterward, the object now lives in the wrong bucket — lookups will miss it. This is why Python dicts require keys to be hashable (immutable). Lists are not hashable; tuples are.

Assuming order is preserved. Python's dict preserves insertion order as of 3.7, but that's a language guarantee, not a property of hash tables in general. Java's HashMap makes no ordering promise. If you need sorted order, use a TreeMap (Java) or sort the items separately.

Using a hash map when an array suffices. If your keys are small non-negative integers — say, character counts in a string — an array of size 26 (or 128 for ASCII) is faster, uses less memory, and has better cache behavior. hash_map[char] trades constant-factor overhead for no real benefit over arr[ord(char) - ord('a')]. See the deep dive on hashing fundamentals for when to go lower-level.

Building a hash map where a set would do. If you only care whether a key exists and not what it maps to, use a set — it's a hash map without the values, and it communicates intent clearly.

Counting on O(1) for large string keys. Hashing a key takes time proportional to the key's length. For a 10 KB string, hashing is not instant. Most of the time this doesn't matter, but if your keys are huge and you're hashing billions of them, it can show up in profiling.

When not to use a hash table

Hash tables are the right default for key-value lookups. But they have real limitations:

  • You need sorted order or range queries. "Find all users with ID between 1000 and 2000" requires scanning every bucket or maintaining a separate structure. A balanced BST (or a B-tree in a database) handles this in O(log n) per query.
  • Memory is tight. Hash tables carry overhead: the bucket array (often sparse), the chain pointers, the load-factor headroom. A sorted array with binary search uses less memory and gives O(log n) lookup — often fine.
  • You need cache-locality and contiguous scans. Separate chaining follows pointers off the main array into heap-allocated lists. If you're scanning millions of entries in a tight loop, a flat sorted array or a direct-address table (array keyed by integer) will be faster.
  • Keys aren't hashable. If you can't define a stable hash and equality for your keys, you can't use a hash table. You'd need a structure that compares keys directly, like a BST.

When sorted order matters, look at balanced BSTs. When memory is the constraint and keys are integers, a direct-address array is the fastest thing you can build. The hash table is the right tool for the enormous middle of "I need fast keyed lookup and I don't need ordering" — which is most of the time.

// FAQ

Frequently asked questions

How does a hash table achieve O(1) average lookup?

A hash function converts any key into an integer, which gets reduced modulo the bucket count to pick a slot. If the hash function distributes keys evenly, each bucket holds only a handful of entries on average, so finding the right one takes constant time. The key word is "average" — a terrible hash function can pile everything into one bucket and degrade to O(n), which is why the quality of the hash function matters.

What is a hash collision and how is it handled?

A collision happens when two different keys hash to the same bucket. There are two main strategies. Separate chaining keeps a linked list (or small array) at each bucket — colliding keys are simply appended to the list, and lookup scans the list. Open addressing stores everything in the array itself and probes nearby slots when a collision occurs. Both work; separate chaining is simpler and more common in general-purpose maps.

What is load factor and why does it matter for hash tables?

Load factor is the ratio of stored entries to total buckets (n / capacity). As it rises, buckets fill up, collisions become more frequent, and average lookup time creeps up. Most implementations resize — rehash into a larger array — when the load factor exceeds a threshold, typically 0.7 to 0.75. Keeping the load factor low is what preserves O(1) average performance.

Why is the worst case for a hash table O(n)?

If every key maps to the same bucket — whether from a bad hash function or a deliberate adversarial input — all n entries stack up in one place. A lookup then has to scan all n of them. Real-world implementations defend against this with randomized or cryptographic hash functions and by using balanced trees instead of linked lists when a bucket exceeds a certain size (Java 8+ does this at 8 entries per bucket).

When should I use a hash map versus a sorted map or an array?

Use a hash map when you need O(1) average insert and lookup and do not care about key ordering. Use a sorted map (BST-backed, O(log n) operations) when you need range queries, successor/predecessor, or iteration in sorted order. Use an array when your keys are small non-negative integers — you can index directly, which is even faster and uses less memory.

// RELATED

You may also like