MODULE 13 / 12crash course

~/roadmap/12-choosing-the-right-structure

◆◆Intermediate

Choosing the Right Data Structure

The meta-skill every engineer needs — a decision framework that maps your dominant operation to the right structure, so you stop guessing and start choosing deliberately.

11 min read2026-01-15Ironclad Academy

#data-structures #decision-framework #fundamentals

You've made this mistake. So has everyone.

You're three hours into a coding problem. You have a list of tasks, each with a priority. Every second you need to pull the highest-priority task, process it, and add new ones. Your first instinct: sort the list when you insert. You've written something like this:

tasks.append(new_task)
tasks.sort(key=lambda t: t.priority, reverse=True)
next_task = tasks.pop(0)

It works. On 50 tasks it's instant. On 50,000 tasks it's... slower. You add a sort after every insert and a pop from the front — that's O(n log n) plus O(n) per iteration. On ten million events, this quietly kills your system.

The fix is a heap. Two lines of Python, O(log n) per operation, runs ten thousand times faster at scale. But you had to know to reach for it. That's the gap this lesson closes.

The framework: start with your dominant operation

Structures aren't better or worse in the abstract. They're better or worse for specific operations. The move is to identify your dominant operation — the thing your code does most frequently, the thing that needs to be fast — and then match it to the structure purpose-built for that operation.

Here's how to think about it:

flowchart TD
    A["What do you need most?"] --> B["Lookup by key"]
    A --> C["Always need the min or max"]
    A --> D["Sorted iteration or range queries"]
    A --> E["Ordered sequence with fast access"]
    A --> F["FIFO / LIFO access"]

    B --> G["Hash table → O(1) avg"]
    C --> H["Heap → O(1) peek, O(log n) push/pop"]
    D --> I["BST → O(log n) all ops, ordered"]
    E --> J["Array → O(1) index, O(1) append"]
    F --> K["Queue (FIFO) or Stack (LIFO) → O(1)"]

    style G fill:#7c5cff,color:#fff
    style H fill:#22d3ee,color:#0a0a0f
    style I fill:#00ff9d,color:#0a0a0f
    style J fill:#7c5cff,color:#fff
    style K fill:#22d3ee,color:#0a0a0f

Walk through each branch with a real scenario.

Branch 1: Fast lookup by key → hash table

You have a user ID. You need to know their subscription tier. You have 50 million users.

A hash table is the only sensible answer. It gives you O(1) average lookup regardless of whether you have 50 users or 50 million. The mechanism is simple: a hash function converts the key to an index, and you jump directly to that slot — no scanning, no comparison loop.

Where it fails you: order. A hash table doesn't know what "next" or "previous" means. If you need to iterate users alphabetically, or find all users whose IDs fall between two values, a hash table is useless. You'd have to dump every key into a list and sort it, which costs O(n log n) and defeats the point.

Also watch the load factor. A hash table that's more than ~70% full starts degrading because collisions pile up and chains get long. A well-tuned implementation rehashes automatically, but if you're rolling your own or using a low-level one, keep an eye on it.

Use it for: membership tests, frequency counts, caching, deduplication, any lookup that's purely "does this key exist and what's its value?"

Don't use it for: anything that requires order.

Branch 2: Repeatedly need the min or max → heap

You're building a task scheduler. A stream of jobs arrives continuously, each with a priority. At each tick you want to run the highest-priority job available.

The temptation is to keep a sorted list. But sorting after every insert is O(n log n) per insert, and you'll do this thousands of times. The correct tool is a heap — specifically a max-heap (or a min-heap if lower priority number means higher urgency).

A heap maintains one invariant: the root is always the extreme element. That's it. You don't get sorted order; the rest of the heap is a partial mess. But you get:

Peek at the top: O(1) — just look at the root.
Pop the top: O(log n) — remove the root, sift the replacement down.
Push a new element: O(log n) — add it at the bottom, sift it up.

For the task-scheduler problem, every tick is O(log n) instead of O(n log n). At a million ticks, that's a difference between about 20 million operations and 20 billion.

The other canonical heap use: k-th largest/smallest. Maintain a min-heap of size k over a stream. Every new element: push, then pop if size exceeds k. The root is always the k-th largest. O(n log k) total — far cheaper than sorting n elements when k is small.

import heapq

def kth_largest(stream, k):
    # min-heap of the k largest elements seen so far
    heap = []
    for val in stream:
        heapq.heappush(heap, val)
        if len(heap) > k:
            heapq.heappop(heap)   # evict the smallest of the top-k
    return heap[0]                # root = k-th largest

Use it for: priority queues, scheduling, top-k problems, Dijkstra's shortest path, median-of-a-stream (two heaps).

Don't use it for: arbitrary lookup by key, sorted traversal, anything other than the extreme element.

Branch 3: Sorted order or range queries → BST

You're building autocomplete. The user types "pro" and you need all words that start with "pro" — efficiently, in alphabetical order.

A binary search tree (or its practical variants — red-black trees, AVL trees, B-trees) maintains your data in sorted order at all times. Every insertion places the new value in its correct spot. Every search, insert, or delete is O(log n) on a balanced tree.

What you get that a hash table can't give you:

Range queries: all values between a and b — walk the right subtree starting at a, stop at b. O(k + log n) where k is the number of results.
Successor / predecessor: what's the next value after x? Trivially answered by the tree structure.
Sorted iteration: in-order traversal visits every node in sorted order in O(n).

The catch is "balanced." A naive BST built by inserting already-sorted data degenerates into a linked list — O(n) for everything. In production you almost always use a self-balancing variant (Python's sortedcontainers.SortedList, Java's TreeMap, C++'s std::map). These guarantee O(log n) in the worst case by rebalancing after inserts.

Use it for: any problem that combines lookup with sorted traversal, range queries, or order statistics.

Don't use it for: pure key-value lookup with no ordering requirement — a hash table is faster.

Branch 4: Ordered sequence with fast access → array

Most of your data will live in an array. If you're reading sequentially, appending at the end, or accessing by index, an array beats every other structure:

Index access: O(1), and CPU-cache-friendly because the memory is contiguous.
Append: O(1) amortized.
Sequential scan: faster than a linked list even at the same O(n), because the cache can prefetch the next element.

The array starts to hurt when you insert or delete in the middle — that forces a shift of every later element, O(n). And if you frequently ask "is x in this array?", you're doing O(n) scans. At that point you've outgrown it.

Knowing when to stop using an array is as important as knowing when to use one. A classic sign: you keep calling if x in my_list on a large list. Every call is O(n). Convert it to a set once and every call drops to O(1).

Use it for: sequential access, index-based access, appending, anything small enough that simplicity beats complexity.

Branch 5: FIFO and LIFO → queue and stack

These feel like interview trivia, but they encode something real: traversal strategy.

A queue (FIFO — first in, first out) processes nodes in the order they arrive. That's exactly what breadth-first search needs — add neighbors to the back, process from the front, and you automatically explore level by level. Task queues, message brokers, and rate limiters all share this structure.

A stack (LIFO — last in, first out) processes the most recently added item first. That's depth-first search, recursive call stacks, undo history, and expression evaluation. In fact, DFS with an explicit stack and DFS with recursion are the same thing — the recursive version just uses the call stack instead of an explicit one.

Both run in O(1) for their core operations. The choice between them isn't about performance — it's about which order you want to visit things.

The cheat sheet

Need	Structure	Lookup	Insert	Notes
Key → value, fast	Hash table	O(1) avg	O(1) avg	No ordering
Always want the min/max	Heap	O(1) peek	O(log n)	Only the extreme is fast
Sorted order + range	BST (balanced)	O(log n)	O(log n)	Must stay balanced
Indexed sequence	Array	O(1)	O(1) end / O(n) mid	Cache-friendly
FIFO	Queue	—	O(1)	BFS, task scheduling
LIFO	Stack	—	O(1)	DFS, undo, parsing

The visualizer: see the complexity curves, not just the tables

Cheat sheets tell you which is faster in theory. The curves tell you how much it matters at a given scale.

{ "type": "bigo", "title": "your structure's operation cost at scale" }

Push the slider to n = 100,000 and look at the gap between O(1) and O(n). That's the difference between your hash table and a linear scan through an unsorted array. Now look at O(log n) vs O(n log n). That's the difference between a single BST lookup and re-sorting on every insert. The curves aren't decoration — they're the reason these structures exist.

A worked decision: the median of a data stream

Here's a problem that forces you to combine two structures. Given a stream of numbers, return the median after each insertion in O(log n) per insertion.

The naive approach: maintain a sorted array, binary search for the insertion point, keep the middle element. Inserting into a sorted array is O(n) because of the shift, so this is O(n) per step.

The right structure pairing: two heaps.

A max-heap for the lower half of the numbers.
A min-heap for the upper half.

Keep them balanced (sizes differ by at most 1). The median is either the top of one heap (odd count) or the average of both tops (even count) — O(1) to read. Each insertion is O(log n): push to one heap, possibly rebalance by moving the top of one heap to the other.

import heapq

class MedianFinder:
    def __init__(self):
        self.lo = []   # max-heap (negate values for Python's min-heap)
        self.hi = []   # min-heap

    def add_num(self, num):
        # push to max-heap (lower half)
        heapq.heappush(self.lo, -num)

        # ensure every element in lo <= every element in hi
        if self.hi and (-self.lo[0]) > self.hi[0]:
            heapq.heappush(self.hi, -heapq.heappop(self.lo))

        # rebalance sizes: lo can be at most 1 larger than hi
        if len(self.lo) > len(self.hi) + 1:
            heapq.heappush(self.hi, -heapq.heappop(self.lo))
        elif len(self.hi) > len(self.lo):
            heapq.heappush(self.lo, -heapq.heappop(self.hi))

    def find_median(self):
        if len(self.lo) > len(self.hi):
            return -self.lo[0]
        return (-self.lo[0] + self.hi[0]) / 2

This is what "choosing the right structure" looks like in practice: you asked yourself "what operation must be fast?" — in this case, O(1) access to the extreme element of each half — and that question pointed straight to heaps.

The anti-patterns worth naming

Sorting when you should be heaping. If you sort a list to get the minimum, then insert a new element and sort again, you're paying O(n log n) for something a heap does in O(log n). Spot the pattern: repeated insertion + repeated extraction of the extreme = heap.

Scanning a list that should be a set. if item in big_list inside a loop is O(n) per check. Convert big_list to a set once and every check drops to O(1). This is one of the most common O(n²) → O(n) fixes in real code.

Using a BST when you just need lookup. If you'll never iterate in order or do range queries, a BST's ordering guarantee is a cost you're paying for nothing. Use a hash table and get O(1) instead of O(log n).

Over-engineering small data. At n = 100, an O(n²) algorithm runs in 10,000 operations. That's fast. A hash table with load-factor tuning and a balanced BST are both overkill. Use the simplest thing that works. Complexity classes matter at scale — below a few thousand elements, readability usually beats cleverness.

The real meta-skill

The structures are the vocabulary. The meta-skill is knowing when to use which word.

You've now seen arrays, hash tables, heaps, BSTs, stacks, and queues — not as isolated topics but as a set of trade-offs laid side by side. Every one of them is the right answer in some context and the wrong answer in another. The engineers who pick correctly aren't smarter than the ones who don't. They've just internalized one question well enough that it becomes a reflex: what does my code do most, and what structure makes that operation cheapest?

Ask that question first. Everything else follows.

From here, each structure has its own deep dive. Start with whichever feels fuzziest: arrays if the shift tax doesn't feel visceral yet, hash tables if you're not sure when O(1) lookup breaks down, heaps if the two-heap median problem felt abstract, or BSTs if "balanced" is still a hand-wave. The details are there when you need them.

// FAQ

Frequently asked questions

▸How do you choose the right data structure for a problem?

Start by identifying your dominant operation — the thing your code does most often. Fast lookups by key suggest a hash table. Maintaining sorted order suggests a BST or sorted array. Always needing the min or max suggests a heap. First-in-first-out access suggests a queue. Once you name the operation, the structure almost picks itself.

▸When should you use a hash table vs. a binary search tree?

Use a hash table when you only need lookup, insert, and delete by key — it gives O(1) average for all three. Switch to a BST (or a balanced variant like a red-black tree) when you need sorted order: range queries, iteration in order, finding the predecessor or successor of a key. Hash tables have no ordering; BSTs maintain it at the cost of O(log n) per operation.

▸When should you use a heap instead of sorting the array?

Use a heap when you repeatedly need the minimum or maximum but the dataset keeps changing — new elements arrive, old ones are consumed. Sorting solves the one-time case, but a heap maintains the invariant dynamically. Pushing to a heap is O(log n), popping the extreme is O(log n), and peeking is O(1). That beats re-sorting after every update by a wide margin.

▸What data structure should you use for FIFO and LIFO access?

A queue for FIFO (first-in-first-out): enqueue at the back, dequeue from the front — BFS and task scheduling are the canonical uses. A stack for LIFO (last-in-first-out): push and pop from the same end — DFS, undo history, and expression parsing all live here. Both run O(1) for their core operations.

▸Is it ever wrong to default to an array?

An array is a perfectly reasonable default for small data or when the access pattern is mostly reads and end-appends. It falls short when you need O(1) membership tests (use a hash set), O(log n) sorted operations (use a BST), constant access to the minimum (use a heap), or O(1) front deletions (use a deque). The cost of a wrong default is usually just a constant factor — until the dataset hits a threshold where the wrong complexity class starts to hurt.

← previous module

Dynamic Programming for Beginners

course complete →

Explore the data structures