Hash Sets
A hash set is a hash table with the values ripped out — pure O(1) membership, dedup, and "have I seen this?" tracking. Master it and a whole class of O(n²) problems collapse to O(n).
The hook: you are already writing O(n²) set logic without knowing it
Here's a bug that ships every week somewhere. The function checks whether a list of usernames contains any duplicates:
def has_duplicate(names):
for i in range(len(names)):
for j in range(i + 1, len(names)):
if names[i] == names[j]:
return True
return False
That's O(n²). A list of 100,000 names means ten billion comparisons. You can rewrite the entire function in two lines using a set and it becomes O(n):
def has_duplicate(names):
return len(names) != len(set(names))
That's what a hash set does to this class of problems. Anywhere you're doing nested loops to check for existence or track what you've seen — the set collapses it.
Mental model: a bucket of stamps, no labels
A hash table is like a filing cabinet where each folder has a name and contents. A hash set is like a rubber-stamp collection: each stamp exists or it doesn't. There's no folder, no contents, just the stamp itself.
When you call add("alice"), the set hashes "alice" to a bucket index, checks if that value is already in the bucket, and — if not — stores it. When you call contains("alice"), it hashes again, jumps to the same bucket, and answers in constant time.
flowchart LR
K["key: 'carol'"] --> H["hash('carol') → 42"]
H --> M["42 % 7 = 0"]
M --> B["bucket 0"]
B --> C{"already\nthere?"}
C -->|yes| N["no-op"]
C -->|no| A["store it"]
style K fill:#7c5cff,color:#fff
style B fill:#22d3ee,color:#0a0a0f
style A fill:#00ff9d,color:#0a0a0f
The hashing mechanics are explained in depth in crash course: hashing — the short version is that a good hash function spreads keys evenly across buckets, which keeps each bucket short and the average operation O(1). Sets and maps share all of that.
Set vs. Map: one question decides it
Do I need to retrieve a value later, or do I just need to know if the key exists?
That's it. That's the whole decision.
| Scenario | Use |
|---|---|
| "Have I visited this node before?" | Set |
| "How many times have I visited this node?" | Map (key → count) |
| "Is this word in the dictionary?" | Set |
| "What is the definition of this word?" | Map (word → definition) |
| "Which users are in the admin group?" | Set |
| "What role does each user have?" | Map (user → role) |
| "Does this graph have a cycle?" | Set (visited nodes) |
| "What is the shortest path to each node?" | Map (node → distance) |
A common mistake: reaching for a map and setting the value to True everywhere, because you're used to maps. seen = {"alice": True, "bob": True} works, but seen = {"alice", "bob"} is cleaner, uses less memory, and communicates intent clearly. If the value is always the same throwaway boolean, you don't need the map.
The dedup pattern
The single most-used hash set idiom:
unique = list(set(original_list)) # O(n), duplicates gone
Or if you want to preserve first-occurrence order (Python 3.7+ dict preserves insertion order, which you can exploit):
def dedup_ordered(lst):
seen = set()
result = []
for item in lst:
if item not in seen:
seen.add(item)
result.append(item)
return result
Two data structures doing two jobs: the set answers "have I seen this?" in O(1), and the list preserves order. Together they give you order-preserving dedup in O(n) time and O(n) space. Trying to do this with sorted arrays costs O(n log n); doing it with nested loops costs O(n²). The set wins.
Visited tracking: sets and graph traversal
Every graph traversal — BFS, DFS, Dijkstra — needs to track visited nodes. Without it, you loop forever. With an array you'd need to scan O(n) to check if a node was visited. With a set it's O(1).
from collections import deque
def bfs(graph, start):
visited = set() # O(1) membership check each step
queue = deque([start])
visited.add(start)
while queue:
node = queue.popleft()
for neighbor in graph[node]:
if neighbor not in visited: # O(1)
visited.add(neighbor)
queue.append(neighbor)
The visited set is doing something subtle here: it transforms the problem from "have I been here before across potentially millions of paths?" into a single hash lookup. Without it, on a dense graph you'd explore the same nodes exponentially many times.
The same pattern applies to linked list cycle detection. You can use Floyd's algorithm (two pointers, O(1) space), but for an interview warm-up the set approach is immediate to write and O(n) time:
def has_cycle(head):
seen = set()
node = head
while node:
if id(node) in seen:
return True
seen.add(id(node))
node = node.next
return False
See two pointers for the Floyd's cycle-detection version — it's worth knowing both, and the interviewer might ask you to optimize space after you nail this one.
Set operations: union, intersection, difference
These are where sets really pull away from any manual approach. You need the set of customers who bought product A and product B? A & B. You need customers who bought A but not B? A - B. You need everyone who bought either? A | B.
bought_a = {"alice", "bob", "carol", "dave"}
bought_b = {"bob", "dave", "eve", "frank"}
both = bought_a & bought_b # intersection: {'bob', 'dave'}
either = bought_a | bought_b # union: all 6
only_a = bought_a - bought_b # difference: {'alice', 'carol'}
one_not_both = bought_a ^ bought_b # symmetric diff: {'alice', 'carol', 'eve', 'frank'}
Each of these costs O(m + n) or better. The intersection A & B actually runs in O(min(m, n)) — Python iterates the smaller set and checks each element against the larger one using O(1) lookups.
Here's a concrete problem: given two lists of integers, find the values that appear in both. The nested-loop approach is O(m × n). The set approach:
def intersection(list_a, list_b):
set_a = set(list_a) # O(m) to build
return [x for x in list_b if x in set_a] # O(n) to check
O(m + n) total. For lists of 10,000 elements each, that's the difference between 100 million operations and 20,000. The set doesn't just make this code shorter — it changes the problem's character entirely.
Three worked problems
1. Contains Duplicate (LeetCode 217) — warm-up
Given an integer array, return True if any value appears at least twice.
def contains_duplicate(nums):
seen = set()
for n in nums:
if n in seen:
return True
seen.add(n)
return False
# One-liner: return len(nums) != len(set(nums))
Time O(n), space O(n). The loop version returns early on the first duplicate — better in practice even though they're asymptotically the same.
2. Longest Consecutive Sequence (LeetCode 128) — medium
Given an unsorted array of integers, find the length of the longest sequence of consecutive integers. Must run in O(n).
The non-obvious key: you only want to start counting from the beginning of a run. An element n is the start of a run if n - 1 is not in the set. This avoids redundant counting.
def longest_consecutive(nums):
num_set = set(nums) # O(n) build, dedup included
best = 0
for n in num_set:
if n - 1 not in num_set: # only start counting from run boundaries
length = 1
while n + length in num_set:
length += 1
best = max(best, length)
return best
Every number gets "visited" at most twice (once in the outer loop, once as part of a run). Total work is O(n) despite the nested while. The set membership check is what makes the inner loop O(1) per step rather than O(n).
{ "type": "hash-table", "title": "num_set for [100,4,200,1,3,2]", "buckets": 7, "data": ["100", "4", "200", "1", "3", "2"] }
Play with adding numbers to see how the buckets fill. Notice 1, 2, 3, 4 all hash to different buckets — the set doesn't "know" they're consecutive, but membership checks for n - 1 and n + 1 tell you exactly that in O(1).
3. Two Sum — the set variant
You know the map-based Two Sum. Here's when a set is enough: given an array and a target, return True if any two numbers sum to the target (you don't need to return the indices).
def two_sum_exists(nums, target):
seen = set()
for n in nums:
complement = target - n
if complement in seen:
return True
seen.add(n)
return False
O(n) time, O(n) space. If you need the indices, you need a map. If you only need a boolean, the set is cleaner and slightly lighter in memory. This distinction matters — reading a problem carefully to see if you need indices or just existence is itself a signal you can show an interviewer.
Pitfalls
Checking membership on a list by accident. This is the single most common performance mistake I see in code reviews:
# O(n) per check — O(n²) total if inside a loop
if x in my_list:
...
# O(1) per check — O(n) total
lookup = set(my_list)
if x in lookup:
...
The fix: whenever a collection is only used for membership checks and you're not going to mutate it mid-loop, convert it to a set before the loop. One line, done.
Mutating a set while iterating it. This raises a RuntimeError in Python. If you need to remove elements while iterating, collect the removals in a separate list and apply them after.
# RuntimeError: Set changed size during iteration
for item in my_set:
if should_remove(item):
my_set.remove(item)
# Correct
to_remove = {item for item in my_set if should_remove(item)}
my_set -= to_remove
Storing unhashable types. Sets and hash tables both require their keys to be hashable. In Python, list, dict, and set are not hashable — you'll get a TypeError trying to add them. Use tuple instead of list when you need a sequence as a set element.
my_set = set()
my_set.add([1, 2, 3]) # TypeError: unhashable type: 'list'
my_set.add((1, 2, 3)) # fine — tuples are hashable
Assuming insertion order. Before Python 3.7, even dict didn't guarantee order; set still doesn't in any version. If your code depends on iterating a set in the order elements were added, you have a bug waiting for the runtime or Python version to expose it. Use a list to track order, a set to track membership — they're complementary tools.
Using a set when you need counts. A set tracks presence, not frequency. If you need to know how many times "alice" appears, you need a Counter (Python) or a HashMap<String, Integer> (Java). Adding the same element 100 times to a set is the same as adding it once.
When NOT to use a hash set
Sets are fantastic for their job. Their job is narrow.
| You need | Don't use a set — use this instead |
|---|---|
| Membership check in O(1) | Set (you're in the right place) |
| Ordered iteration | Sorted array or TreeSet (Java) — O(log n) ops |
| Counting occurrences | Counter / HashMap |
| The k-th element | Array with sorting, or a heap |
| Predecessor/successor queries | BST — see binary search tree |
| Key-to-value mapping | Hash table |
| Range queries ("all elements between 3 and 9") | Sorted array or BST |
If your queries are "is x in this collection?" and you don't need ordering, a set is nearly always correct. The moment you need to answer "what's the rank of x?" or "what's the next-largest after x?", the hash set has no answer and you need a tree-backed sorted set.
There's also a memory angle. A set of 10 million 64-bit integers holds the integers themselves plus bucket pointers and load-factor overhead — roughly 3–5× the raw data size. If you're working with dense integer ranges (say, "mark which integers from 0 to N have been seen"), a bitset or a boolean array is dramatically more compact and just as fast.
And finally: if you're checking membership against a truly huge static dataset (a dictionary of 500,000 words), a Bloom filter can give you probabilistic O(1) membership with a fraction of the memory — at the cost of false positives. Sets are exact; Bloom filters trade a tiny error rate for space. Worth knowing the tradeoff exists.
Quick reference
# Python — built-in set
s = set()
s = {1, 2, 3} # literal syntax (NOT an empty set — {} is a dict)
s.add(4) # O(1)
s.remove(2) # O(1), KeyError if missing
s.discard(99) # O(1), no error if missing
print(3 in s) # O(1)
# Set operations
a = {1, 2, 3}
b = {2, 3, 4}
a | b # union → {1, 2, 3, 4}
a & b # intersection → {2, 3}
a - b # difference → {1}
a ^ b # symmetric diff → {1, 4}
# Dedup a list (order NOT preserved)
unique = list(set(my_list))
# Dedup preserving order
def dedup_ordered(lst):
seen = set()
return [x for x in lst if not (x in seen or seen.add(x))]
// Java
Set<String> set = new HashSet<>(); // O(1) ops, unordered
Set<String> sorted = new TreeSet<>(); // O(log n) ops, sorted iteration
set.add("alice");
set.contains("bob"); // false
set.remove("alice");
// Set operations
Set<Integer> intersection = new HashSet<>(setA);
intersection.retainAll(setB); // modifies in-place
Set<Integer> union = new HashSet<>(setA);
union.addAll(setB);
The HashSet is your everyday tool. Reach for TreeSet only when you actually need sorted iteration — the O(log n) overhead is real, and for pure membership checks it's unnecessary.
If a problem has you scanning a list to check "have I seen this?" more than once, that scan is the smell. Swap the list for a set, and what was O(n²) becomes O(n). It's one of the most reliable upgrades in the toolkit — not fancy, just right.
Frequently asked questions
▸What is the difference between a hash set and a hash map?
A hash map (or hash table) stores key-value pairs — you look up a key to retrieve an associated value. A hash set stores only keys — there are no values. You use a set when you only care whether something is present, not what it maps to. Under the hood they are built the same way: the same hash function, the same collision resolution, the same O(1) average operations. A set is just a map where the value is always a meaningless constant (Python literally does this internally).
▸When should I use a set instead of a list for membership checks?
Almost always. Checking "is x in this collection?" costs O(n) on a list because it has to scan every element. The same check costs O(1) average on a set because the set hashes x and jumps directly to the right bucket. The only time you should keep a list is when you need indexed access, ordering, or duplicates — none of which a set provides. If you find yourself writing `if x in my_list` inside a loop, convert it to a set first.
▸Are sets ordered?
Standard hash sets are unordered — iteration order is not guaranteed and can differ between runs. If you need sorted iteration, Python has `sorted(my_set)` (O(n log n)) or you can use a sorted set backed by a balanced BST (like Java's `TreeSet`) which maintains O(log n) insert/lookup but loses O(1) membership. For interview problems the unordered hash set is almost always what you want.
▸Can a set contain duplicate values?
No, and that is the point. When you insert a value that is already present, the set simply does nothing. This makes sets the fastest way to deduplicate a collection: convert a list to a set in O(n), all duplicates gone. Convert back to a list if you need list semantics again — O(n) total, far cheaper than any sort-and-scan approach.
▸How is set intersection implemented efficiently?
The standard approach iterates over the smaller set and checks each element against the larger set using O(1) membership. Total cost is O(min(m, n)) average, not O(m × n) — which is why you should always use set operations instead of nested loops when finding common elements. Most languages implement this natively: Python has `a & b` or `a.intersection(b)`, Java has `retainAll`, JavaScript requires a manual loop but each check is still O(1).
You may also like
Union-Find (Disjoint Set Union)
The near-magic data structure for connectivity queries — are these two nodes in the same component? Nearly O(1) per operation with union by rank and path compression. Kruskal's MST, redundant connections, account merging, and more.
Tries (Prefix Trees)
The data structure powering autocomplete, spell-check, and IP routing — a tree where each edge is a character and every path from the root spells a prefix. Insert, search, and prefix queries all run in O(L) on word length alone, completely independent of how many words you store.
Strings
Strings look like simple text, but they hide a brutal trap: naive concatenation in a loop is O(n²). Learn why, how encoding actually works, and the handful of patterns — sliding window, two pointers, hashing — that solve 80% of string interview problems.