MODULE 04 / 12crash course
~/roadmap/03-arrays-and-memory
Beginner

Arrays and How Memory Really Works

Go under the hood of the most fundamental data structure — contiguous bytes, O(1) index addressing, cache lines, and why a sequential scan beats a random walk every time.

8 min read2026-01-15Ironclad Academy

Why memory layout is the whole story

You already know what an array looks like on paper — a row of indexed boxes. What the textbook usually skips is what that means in hardware, and skipping it means you'll never really understand why performance works out the way it does.

So let's go one level down.

Your computer's RAM is a flat address space: byte 0, byte 1, byte 2, all the way to however many gigabytes you have. Every variable, every object, every function call lives at some address in that space. When you declare an array of five 64-bit integers in C, you're asking the runtime to find you 40 consecutive bytes5 × 8 — and hand you the address of the first one. Call that address base.

Here's what that looks like on the strip:

flowchart LR
    subgraph "40 consecutive bytes in RAM"
      direction LR
      B0["base+0<br/>arr[0]=12"] --- B1["base+8<br/>arr[1]=47"] --- B2["base+16<br/>arr[2]=3"] --- B3["base+24<br/>arr[3]=99"] --- B4["base+32<br/>arr[4]=28"]
    end
    style B2 fill:#7c5cff,color:#0a0a0f
    style B0 fill:#22d3ee,color:#0a0a0f

Notice the offsets: 0, 8, 16, 24, 32. They are exactly i × 8. That pattern is the whole secret. To retrieve arr[2], the CPU computes base + 2 × 8 = base + 16 and loads from that address. One instruction. It does not look at arr[0] or arr[1] on the way. The index i could be 2 or it could be 2,000,000 — the cost is identical.

This is why arrays describe arr[i] as O(1). It is not a shorthand or an approximation. It is literally one arithmetic step plus one memory read. Compare that to a linked list, where reaching the ith node means following i pointers in sequence. Same Big-O class on the chart (O(n) for linear traversal), wildly different hardware story.

The formula you should internalize

address_of(arr[i])  =  base_address  +  i × element_size

Three things feed it:

  • base_address — where the array starts in memory. The runtime hands this to you when you allocate.
  • i — the index. Zero-based is not arbitrary; it is because the first element lives at base + 0 × size = base.
  • element_size — the number of bytes per element. For int32 that's 4, for float64 it's 8, for a struct with three 64-bit fields it's 24. Whatever it is, it's fixed at compile time, which is the only reason the formula works.

If elements could be different sizes, you would not know where arr[7] starts without summing up the sizes of arr[0] through arr[6]. That would make indexing O(n), and the whole advantage collapses. Contiguous + fixed size = O(1) indexing. The two properties are inseparable.

Enter the cache — where theory meets reality

Here is the number that changes how you should think about arrays:

Accessing RAM takes roughly 60–100 nanoseconds. Accessing L1 cache takes roughly 1–4 nanoseconds.

That is a 30–100× gap, and it exists on every machine you will ever run code on. Your CPU is fast; your RAM is comparatively slow; and the cache is the bridge between them.

The way the cache works: when your CPU reads from RAM, it does not fetch a single byte. It fetches an entire cache line — almost universally 64 bytes on modern hardware — and stores the whole line in L1 cache. The next time you read from anywhere in that same 64-byte window, you get the cached version at 1–4 ns instead of going back to RAM.

Now look at what this means for an array of 32-bit integers (4 bytes each). One cache line holds 16 of them. When you read arr[0], the hardware automatically loads arr[0] through arr[15] into cache. Reading arr[1] through arr[15] is then essentially free — the data is already sitting in the CPU.

{ "type": "array", "title": "cache line loads neighbors for free", "data": [12, 47, 3, 99, 28, 61, 8, 74] }

Step through this — each element is already in the neighborhood of the last one. The "free" transfers happen every time you access sequentially. For a linked list, each next pointer points to a node allocated independently somewhere else in the heap. Each hop is a fresh address, almost certainly not in cache. Each hop is a probable cache miss at ~100 ns. At 10 million nodes, that difference is not subtle.

Comparing the patterns

Let's make this concrete with real numbers. Suppose you have 10 million integers and you're summing them.

Access patternCache behaviorReal time (approx.)
Sequential array scanNearly all cache hits after first miss per line~10 ms
Random index access (array)One cache miss per access~1,000 ms
Linked-list traversalOne cache miss per node~1,000 ms

The sequential array and the linked list are both O(n) in Big-O notation. Hardware gives the sequential array a 100× advantage. Big-O tells you the shape of the curve; cache behavior tells you where the curve actually lives.

This is why "cache-friendly" matters in performance-sensitive code. It's also why algorithms textbooks sometimes mislead you — they assume every memory access costs the same. It does not.

What "contiguous" actually costs you

Nothing is free. The property that makes array indexing O(1) and array scanning cache-friendly is also the property that makes inserting in the middle expensive: you cannot have gaps.

If you need to insert a value at index i, you have to shift everything from i to the end one slot to the right first. That's up to n moves — O(n). Deleting at i means shifting everything from i+1 to the end one slot left. Same story.

flowchart LR
    A0["12"] --- A1["47"] --- A2["??"] --- A3["3"] --- A4["99"] --- A5["28"]
    style A2 fill:#00ff9d,color:#0a0a0f
    N["insert 55 at index 2"] -.->|"shift 3, 99, 28 right first"| A2

The contiguous constraint is what gives the array its performance profile — and it is also what makes it not the right tool for every job. If your workload involves constant insertions and deletions in the middle, you're paying O(n) repeatedly. A linked list inserts in O(1) once you have a pointer to the spot — it just trades that gain for O(n) indexing and cache-hostile traversal.

Most of the time you should still use the array. But knowing why helps you recognize the rare case when you shouldn't.

The hardware prefetcher: one more free gift

Modern CPUs do not just cache reactively. They predict. The hardware prefetcher watches your memory access patterns and, when it detects a stride — you read address X, then X+64, then X+128 — it starts loading the next cache line before you ask for it, hiding the latency entirely.

A sequential array scan triggers this almost perfectly. The prefetcher sees the stride, gets ahead of your loop, and your CPU almost never stalls waiting for memory. A random-order traversal — or a linked list — presents no predictable pattern. The prefetcher cannot help. Every access stalls.

This is the full picture of why for i in range(n): total += arr[i] is as fast as it is. It is not just that it avoids pointer-chasing. The hardware is actively working with you because the access pattern is predictable.

Putting it together: what to take away

The deep dive on arrays covers operations, dynamic resizing, amortized cost, and the interview patterns. This article is about the one-level-deeper question: why does any of that work the way it does?

The answer is always memory layout:

  1. Contiguous + fixed size = O(1) index. The formula base + i × size is the whole thing.
  2. Contiguous = cache-friendly. You load neighbors for free. Sequential iteration is fast not just asymptotically but in wall-clock time.
  3. Cache lines are 64 bytes. One access loads the next several elements. This is the hardware gift arrays get and scattered structures don't.
  4. Contiguous = no gaps = expensive middle inserts. Every trade-off in data structures is a trade-off in memory layout.

When you look at a linked list and wonder why you'd ever tolerate its O(n) indexing, the answer is: you want O(1) insert/delete at the cost of cache performance. When you look at a hash table, you're looking at an array of buckets with a function to map keys to indices — still relying on that same O(1) indexing at the base. Every data structure you will learn in this course is a different arrangement of memory, trading one property for another.

Start with arrays. They are not the simplest structure because they're boring — they're the simplest because they are the direct expression of what memory actually is.

One worked example: why stride matters

To make the cache-line story visceral, here's a classic trick question. You have a 1000×1000 matrix stored in row-major order (C, Python, JavaScript all do this). Which is faster?

# Version A: row-major — stride of 1 element
total = 0
for row in range(1000):
    for col in range(1000):
        total += matrix[row][col]

# Version B: column-major — stride of 1000 elements
total = 0
for col in range(1000):
    for row in range(1000):
        total += matrix[row][col]

Version A is faster — sometimes 10× faster. Both are O(n²). But Version A accesses matrix[row][0], matrix[row][1], matrix[row][2] — consecutive memory addresses, cache-line friendly. Version B accesses matrix[0][col], matrix[1][col], matrix[2][col] — each one 1000 elements (4000 bytes) away from the last. Every access in Version B blows through the cache line and likely causes a miss.

This matters outside of matrix problems too. Any time you have nested loops over two-dimensional data, the order of iteration can mean the difference between fast and painfully slow. And the reason is always the same: memory is a strip of bytes, and you want your accesses to walk it sequentially.

Now you know why.

// FAQ

Frequently asked questions

How does array indexing work at the memory level?

Every element in an array is the same size, and all elements sit in one contiguous block of memory. To find element i, the CPU computes base_address + i × element_size — a single multiply and a single add. It jumps straight to the right byte without scanning anything, which is why arr[i] is always O(1) regardless of i or the array length.

What is a cache line and why does it matter for arrays?

Modern CPUs do not fetch one byte at a time from RAM — they fetch 64 bytes at once into a cache line. Because array elements are contiguous, reading one element automatically loads the next several neighbors into the L1 or L2 cache. Sequential scans of an array therefore hit that cache almost every time, making them far faster than the Big-O complexity alone suggests.

Why are sequential array scans faster than random-index lookups, even though both are O(n)?

Both are O(n) on paper, but hardware tells a different story. A sequential scan strides through memory predictably, so the CPU prefetcher loads upcoming cache lines before you ask. Random-order access scatters reads across memory, causing cache misses that each stall the CPU for ~100 nanoseconds while it waits for RAM. On 10 million elements that gap is the difference between milliseconds and seconds.

How does a linked list compare to an array in terms of memory and cache performance?

A linked list scatters its nodes across the heap — each node is allocated wherever malloc finds space. Following a pointer to the next node almost always causes a cache miss, because the CPU must fetch a fresh 64-byte cache line for every single element. Arrays keep everything together, so most accesses are served from cache. This is why a cache-friendly array scan can beat a linked list even when the Big-O says they are equal.

What happens when you allocate an array in memory?

The runtime asks the operating system for a contiguous block of bytes large enough to hold all elements. For a C array of 10 ints, that is 40 bytes in a row (4 bytes × 10). In most languages those bytes are uninitialized on the stack or in a heap allocation. The key constraint is that the block must be contiguous — there can be no gaps between elements — because the index formula only works if every element starts at a predictable offset from the base address.