Say we have an unsorted array and linked list.
The worst case when searching for an element for both data structures would be O( n ), but my question is:
Would the array still be way faster because of the use of spatial locality within the cache, or will the cache make use of branch locality allowing linked lists to be just as fast as any array ?
My understanding for an array is that if an element is accessed, that block of memory and many of the surrounding blocks are then brought into the cache allowing for much faster memory accesses.
My understanding for a linked list is that since the path that will be taken to traverse the list is predictable, then the cache will exploit that and still store the appropriate blocks of memory even though the nodes of the list can be far apart within the heap.
Your understanding of the the array case is mostly correct. If an array is accessed sequentially, many processors will not only fetch the block containing the element, but will also prefetch subsequent blocks to minimize cycles spent waiting on cache misses. If you are using an Intel x86 processor, you can find details about this in the Intel x86 optimization manual. Also, if the array elements are small enough, loading a block containing an element means the next element is likely in the same block.
Unfortunately, for linked lists the pattern of loads is unpredictable from the processor's point of view. It doesn't know that when loading an element at address X that the next address is the contents of (X + 8).
As a concrete example, the sequence of load addresses for a sequential array access is nice and predictable.
For example, 1000, 1016, 1032, 1064, etc.
For a linked list it will look like:
1000, 3048, 5040, 7888, etc. Very hard to predict the next address.
Related
In what ways would finding an element (traverse) in a BST be slower than linearly scanning for it within an array?
The answer supposedly has to do with caching. Can someone explain what exactly this means and why it is holds true?
How exactly do you "cache-this," using an array rather than caching with BST?
Thanks
My guess is that using a BST doesn't give you any advantage, since, even if you're caching data (which means there's some kind of locality, you may access the same element later for example), the insert operation and find operation always cost O(h) where h is the height of the tree. In the worst case, even O(n).
Whereas using an array for caching means that of course, at first it may be linear, but whenever you access the same element of the array afterwards, if there's spatial and temporal locality, you may find yourself accessing directly the same chunks of contiguous memory repeatedly, because you already know its index, which means you have a constant time access.
I assume caching relates to CPU caches, which come with a prefetcher which predicts your next memory accesses. So if you search sequentially in an array your prefetcher recognizes the memory access pattern and loads the memory into the CPU cache before your CPU accesses it. When the CPU actually accesses the next memory element, it is already in the cache and can be accessed quickly.
Without caches & prefetchers your CPU would have to wait for the memory controller to fetch the data from the RAM, which is quite slow in comparison to the CPU cache.
In a BST you don't do sequential access. In the worst case your BST does not reside in contiguous memory, but each node is at some arbitrary location in memory. Your prefetcher cannot predict this. The then CPU has to wait for each element to be fetched from memory.
A though without prefetchers is regarding the cache line. On x86_64 a cache line is 64 bytes long. Each integer is either 4 or 8 byte, so you can scan 16 or 8 array entries per cache line. The first access to the memory location fetches the whole line, so you only pay the memory access once for 8 comparisons.
For the BST the same argument as above applies. The node memory is likely not on the same cache line so have to do a memory access for each comparison.
To summarize: A) The memory access takes significantly more time than the comparison; B) if searching through an array or a BST is faster depends on the amount of items.
I know that an array is allocated as a contiguous block of memory and we can therefore access its elements by calculating the byte/word offset from the beginning of the array very easily.
I'm aware that linked list traversal is less efficient than array traversal due to cache inefficiency, where branch prediction won't work well in the way it would for an array. However, I've also heard that its quicker to iterate from one element of an array to the next than it is to access the pointer of the next element in a linked list due to the way we access the array using an offset.
How is the pointer access in the linked list slower than the offset access in the array?
cache inefficiency, where branch prediction won't work well
These are different things. Linked lists suffer from cache inefficiency:
Nodes are usually not necessarily allocated contiguously and in order, which is bad for spatial locality. You can sometimes avoid this, for example with custom allocators. With generational garbage collection, allocating nodes closely together in time also tends to put them close together in space, but that's probably not a very common thing to actually happen when using a linked list.
Having a pointer (and potentially other junk, like an object header and padding) in the node wastes space. Wasting a bunch of space is not inherently super bad, but it is bad when the wasted space is touched, which loads it into the cache. That actually happens here: that pointer to the next node is definitely needed, and the other junk is likely in the same cache line so it gets pulled in as well. This wastes both cache space and bandwidth (both to higher level caches and maybe to memory), which is pretty bad.
Linked lists don't really suffer from branch misprediction inherently. Yes, if you iterate over one, the last branch (the one that exits the loop) has a decent chance of being mispredicted, but that is not specific to linked lists.
How is the pointer access in the linked list slower than the offset access in the array?
Loading a pointer at all is slower than calculating the next address of an element in an array, both in terms of latency and in terms of throughput. For a quick comparison, typical on a modern machine is that loading that point takes around 4 cycles (at best! if there is a cache miss, it takes much longer) and could be done twice per cycle. Adding the size of an array element to the current address takes 1 cycle and can be done 4 times per cycle, and you (or the compiler) may be able to re-use the increment of the loop counter for this with some clever coding. For example, maybe you can use indexed addressing with the loop counter (which is incremented anyway) as index, or you can "steal" the loop counter entirely and increment it by the size of an element (scaling the loop-end correspondingly), or have no loop counter and directly compare the current address to the address just beyond the end of the array. Compilers like to use tricks like these automatically.
It's actually much worse than that makes it sound, because loading those pointers in a linked list is completely serial. Yes, the CPU can load two things per cycle, but it takes 4 cycles until it knows where the next node is so that it can start loading the next pointer, so realistically it can find the address of a node only once every 4th cycle. Computing the addresses of array elements has no such problem, maybe there will be a latency of 1 between the computation of successive addresses but (because actual loops cannot be faster than that anyway) that only hurts when the loop is unrolled, and if necessary the address of the element k steps ahead can be computed just by adding k*sizeof(element) (so several addresses can be computed independently, and compilers do this too when they unroll loops).
Doing a sufficient amount of work per "step" through a linked list can hide the latency problem.
Accessing the pointer requires an additional memory read (which is slow compared to calculations): To read the value of the next element, first the pointer needs to be read from memory, then the contents of the referenced address need to be read. For an array, there is only one memory read access for the value (assuming the base address is kept in a register during the iteration).
I was wondering what were the advantages and disadvantages of linked-list compared to contiguous arrays in C. Therefore I read a wikipedia article about linked-lists.
https://en.wikipedia.org/wiki/Linked_list#Disadvantages
According to this article, the disadvantages are the following:
They use more memory than arrays because of the storage used by their pointers.
Nodes in a linked list must be read in order from the beginning as linked lists are inherently sequential access.
Difficulties arise in linked lists when it comes to reverse traversing. For instance, singly linked lists are cumbersome to navigate backwards and while doubly linked lists are somewhat easier to read, memory is wasted in allocating.
Nodes are stored incontiguously, greatly increasing the time required to access individual elements within the list, especially with a CPU cache.
I understand the first 3 points but I am having a hard time with the last one:
Nodes are stored incontiguously, greatly increasing the time required to access individual elements within the list, especially with a CPU cache.
The article about CPU Cache does not mention anything about non contiguous memory arrays. As far as I know CPU Caches just caches frequently used adresses for a total 10^-6 cache miss.
Therefore, I do not understand why the CPU cache should be less efficient when it comes to non contiguous memory arrays.
CPU caches actually do two things.
The one you mentioned is caching recently used memory.
The other however is predicting which memory is going to be used in near future. The algorithm is usually quite simple - it assumes that the program processes big array of data and whenever it accesses some memory it will prefetch few more bytes behind.
This doesn't work for linked list as the nodes are randomly placed in memory.
Additionally, the CPU loads bigger blocks of memory (64, 128 bytes). Again, for the int64 array with single read it has data for processing 8 or 16 elements. For linked list it reads one block and the rest may be wasted as the next node can be in completely different chunk of memory.
And last but not least, related to previous section - linked list takes more memory for its management, the most simple version will take at least additional sizeof(pointer) bytes for the pointer to the next node. But it's not so much about CPU cache anymore.
The article is only scratching the surface, and gets some things wrong (or at least questionable), but the overall outcome is usually about the same: linked lists are much slower.
One thing to note is that "nodes are stored incontiguously [sic]" is an overly strong claim. It is true that in general nodes returned by, for example, malloc may be spread around in memory, especially if nodes are allocated at different times or from different threads. However, in practice, many nodes are often allocated on the same thread, at the same time, and these will often end up quite contiguous in memory, because good malloc implementations are, well, good! Furthermore, when performance is a concern, you may often use special allocators on a per-object basis, which allocated the fixed-sized notes from one or more contiguous chunks of memory, which will provide great spatial locality.
So you can assume that in at least some scenarios, linked lists will give you reasonable to good spatial locality. It largely depends on if you are adding most of all of your list elements at once (linked lists do OK), or are constantly adding elements over a longer period of time (linked lists will have poor spatial locality).
Now, on the side of lists being slow, one of the main issues glossed over with linked lists is the large constant factors associated with some operations relative to the array variant. Everyone knows that accessing an element given its index is O(n) in a linked list and O(1) in an array, so you don't use the linked list if you are going to do a lot of accesses by index. Similarly, everyone knows that adding an element to the middle of a list takes O(1) time in a linked list, and O(n) time in an array, so the former wins in that scenario.
What they don't address is that even operations that have the same algorithmic complexity can be much slower in practice in one implementation...
Let's take iterating over all the elements in a list (looking for a particular value, perhaps). That's an O(n) operation regardless if you use a linked or array representation. So it's a tie, right?
Not so fast! The actual performance can vary a lot! Here is what typical find() implementations would look like when compiled at -O2 optimization level in x86 gcc, thanks to godbolt which makes this easy.
Array
C Code
int find_array(int val, int *array, unsigned int size) {
for (unsigned int i=0; i < size; i++) {
if (array[i] == val)
return i;
}
return -1;
}
Assembly (loop only)1
.L6:
add rsi, 4
cmp DWORD PTR [rsi-4], edi
je .done
add eax, 1
cmp edx, eax
jne .notfound
Linked List
C Code
struct Node {
struct Node *next;
int item;
};
Node * find_list(int val, Node *listptr) {
while (listptr) {
if (listptr->item == val)
return listptr;
listptr = listptr->next;
}
return 0;
}
Assembly (loop only)
.L20:
cmp DWORD PTR [rax+8], edi
je .done
mov rax, QWORD PTR [rax]
test rax, rax
jne .notfound
Just eyeballing the C code, both methods look competitive. The array method is going to have an increment of i, a couple of comparisons, and one memory access to read the value from the array. The linked list version if going to have a couple of (adjacent) memory accesses to read the Node.val and Node.next members, and a couple of comparisons.
The assembly seems to bear that out: the linked list version has 5 instructions and the array version2 has 6. All of the instructions are simple ones that have a throughput of 1 per cycle or more on modern hardware.
If you test it though - with both lists fully resident in L1, you'll find that the array version executes at about 1.5 cyles per iteration, while the linked list version takes about 4! That's because the linked list version is limited by it's loop-carried dependency on listptr. The one line listptr = listptr->next boils down to on instruction, but that one instruction will never execute more than once every 4 cycles, because each execution depends on the completion of the prior one (you need to finish reading listptr->next before you can calculate listptr->next->next). Even though modern CPUs can execute something like 2 loads cycles every cycle, these loads take ~4 cycles to complete, so you get a serial bottleneck here.
The array version also has loads, but the address doesn't depend on the prior load:
add rsi, 4
cmp DWORD PTR [rsi-4], edi
It depends only on rsi, which is simply calculated by adding 4 each iteration. An add has a latency of one cycle on modern hardware, so this doesn't create a bottleneck (unless you get below 1 cycle/iteration). So the array loop is able to use the full power of the CPU, executing many instructions in parallel. The linked list version is not.
This isn't unique to "find" - any operation linked that needs to iterate over many elements will have this pointer chasing behavior, which is inherently slow on modern hardware.
1I omitted the epilogue and prologue for each assembly function because it really isn't doing anything interesting. Both versions had no epilogue at all really, and the proloque was very similar for both, peeling off the first iteration and jumping into the middle of the loop. The full code is available for inspection in any case.
2It's worth noting that gcc didn't really do as well as it could have here, since it maintains both rsi as the pointer into the array, and eax as the index i. This means two separate cmp instructions, and two increments. Better would have been to maintain only the pointer rsi in the loop, and to compare against (array + 4*size) as the "not found" condition. That would eliminate one increment. Additionally, you could eliminate one cmp by having rsi run from -4*size up to zero, and indexing into array using [rdi + rsi] where rdi is array + 4*size. Shows that even today optimizing compilers aren't getting everything right!
CPU cache usually takes in a page of a certain size for example (the common one) 4096 bytes or 4kB and accesses information needed from there. To fetch a page there is a considerate amount of time consumed let's say 1000 cycles. If say we have an array of 4096 bytes which is contiguous we will fetch a 4096 bytes page from cache memory and probably most of the data will be there. If not maybe we need to fetch another page to get the rest of the data.
Example: We have 2 pages from 0-8191 and the array is in between 2048 and 6244 then we will fetch page#1 from 0-4095 to get the desired elements and then page#2 from 4096-8191 to get all array elements we want. This results in fetching 2 pages from memory to our cache to get our data.
What happens in a list though? In a list the data are non-contiguous which means that the elements are not in contiguous places in memory so they are probably scattered through various pages. This means that a CPU has to fetch a lot of pages from memory to the cache to get the desired data.
Example: Node#1 mem_address = 1000, Node#2 mem_address = 5000, Node#3 mem_address = 18000. If the CPU is able to see in 4k pages sizes then it has to fetch 3 different pages from memory to find the data it wants.
Also, the memory uses prefetch techniques to fetch pages of memory before they are needed so if the linked list is small let's say A -> B -> C, then the first cycle will be slow because the prefetcher can't predict the next block to fetch. But, on the next cycle we say that the prefetcher is warmed up and it can start predicting the path of the linked list and fetch the correct blocks on time.
Summarizing arrays are easily predictable by the hardware and are in one place so they are easy to fetch, while linked lists are unpredictable and are scattered throughout memory, which makes the life of the predictor and CPU harder.
BeeOnRope's answer is good and highlights the cycle count overheads of traversing a linked list vs iterating through an array, but as he explicitly says that's assuming "both lists fully resident in L1". However, it's far more likely that an array will fit better in L1 than a linked list, and the moment you start thrashing your cache the performance difference becomes huge. RAM can be more than 100x slower than L1, with L2 and L3 (if your CPU has any) being between 3x to 14x slower.
On a 64 bit architecture, each pointer takes 8 bytes, and a doubly linked list needs two of them or 16 bytes of overhead. If you only want a single 4 byte uint32 per entry, that means you need 5x as much storage for the dlist as you need for an array. Arrays guarantee locality, and although malloc can do OK at locality if you allocate stuff together in the right order, you often can't. Lets approximate poor locality by saying it takes 2x the space, so a dlist uses 10x as much "locality space" as an array. That's enough to push you from fitting in L1 to overflowing into L3, or even worse from L2 into RAM.
I was reviewing an interview question and comparing notes with a friend, and we have different ideas on one with respect to CPU caching.
Consider a large set of data, such as a large array of double, ie:
double data[1024];
Consider using a dynamically allocated on-the-fly linked list to store the same number of elements. The question asked for a description of a number of trade-offs:
Which allows for quicker random access: we both agreed the array was quicker, since you didn't have to traverse the list in a linear fashion (O(n)), just provide an index (O(1)).
Which is quicker for comparing two lists of the same length: we both decided that if it was just primitive data types, the array would allow for a memcmp(), while the linked list required element-wise comparison plus dereferencing overhead.
Which allowed for more efficient caching if you were accessing the same element several times?
In point 3, this is where our opinions differed. I contended that that the CPU is going to try and cache the entire array, and if the array is obscenely large, it can't be stored in cache, and therefore there will be no caching benefit. With the linked list, individual elements can be cached. Therefore, linked lists lend themselves to cache "hits" more than static arrays do when dealing with a very large number of elements.
To the question: Which of the two is better for cache "hits", and can modern systems cache part of an array, or do they need the whole array or it won't try? Any sort of references to technical documents or standards I could also use to provide a definitive answer would help a lot.
Thanks!
The CPU doesn't know about your data structures. It caches more-or-less raw blocks of memory. Therefore, if you suppose you can access the same one element multiple times without traversing the list each time, then neither linked list nor array has a caching advantage over the other.
HOWEVER, arrays have a big advantage over dynamically-allocated linked lists for accessing multiple elements in sequence. Because CPU caches operate on blocks of memory rather larger than one double, when one array element is in the cache, it is likely that several others that reside at adjacent addresses are in the cache, too. Thus one (slow) read from main memory gives access to (fast) cached access to multiple adjacent array elements. The same is not true of linked lists, as nodes may be allocated anywhere in memory, and even a single node has at minimum the overhead of a next pointer to dilute the number of data elements that may be cached at the same time.
Caches don't know about arrays, they just see memory accesses and store a little bit of the memory near that address. Once you've accessed something at an address it should stick around in the cache a while, regardless of whether that address belongs to an array or a linked list. But the cache controllers don't really know what's being accessed.
When you traverse an array, the cache system may pre-fetch the next bit of an array. This is usually heuristically driven (maybe with some compiler hints).
Some hardware and toolchains offer intrinsics that let you control cache residency (through pre-fetches, explicit flushes and so forth). Normally you don't need this kind of control, but for things like DSP code, resource-constrained game consoles and OS-level stuff that needs to worry about cache coherency it's pretty common to see people use this functionality.
I was trying to understand this paper about cache timing issues
In Section 3.6, the authors explains a technique that allows you to populate a contiguous cache region and measure the time for this populating process. They mentioned:
"A naive implementation of the prime and probe steps (i.e., scanning the memory buffer in fixed strides) gives poor results due to two optimizations implemented in modern CPUs: reordering of memory accesses, and automatic read-ahead of memory by the “hardware prefetcher”. Our
attack code works around both disruptions by using the following “pointer chasing” technique. During initialization, the attacker’s memory is organized into a linked list (optionally, randomly permuted); later, priming and probing are done by traversing this list (see Figure 7). To minimize cache thrashing (self-eviction), we use a doubly-linked list and traverse it forward for priming but backward for probing. Moreover, to avoid “polluting” its own samples, the probe code stores each obtained sample into the same cache set it has just finished measuring. On some platforms one can improve the timing gap by using writes instead of reads, or more than W reads."
I have a question about this paragraph:
How does linked list avoid the timing variation due to hardware prefetch and reordering?
Currently implemented hardware prefetchers only handle fixed stride access patterns, so accesses that are ordered arbitrarily with respect to memory address are not recognized as prefetchable. Such hardware prefetchers could still detect and use accidental stride patterns. E.g., if address[X+5] = address[X] + N and address[x+7] = address[x+5] + 2N (note that the addresses do not have to be separated by a constant access count), a prefetcher might predict a stride of N. This could have the effect of modestly reducing apparent miss latency (some prefetches would be valid) or introducing some cache pollution.
(If the detected/predicted N is small and the linked list is contained within a modest region of memory relative to cache size (so that most of the stride prefetches will be to addresses that will be used soonish), stride-based prefetching might have a noticeable positive effect.)
By using data-dependent accesses (traversing a link requires access to the data), access reordering is prevented. Hardware cannot load the next node until the next pointer from the previous node is available.
There have been academic proposals for supporting prefetching of such patterns and even generic correlations (e.g., see this Google Scholar search for Markov prefetching), but such have not yet (to my knowledge) been implemented in commercial hardware.
As a side note, by traversing the probing accesses in reverse order from the priming accesses, for common, LRU-oriented replacement excessive misses are avoided.