Linked List vs. Array Traversal Efficiency - arrays

I know that an array is allocated as a contiguous block of memory and we can therefore access its elements by calculating the byte/word offset from the beginning of the array very easily.
I'm aware that linked list traversal is less efficient than array traversal due to cache inefficiency, where branch prediction won't work well in the way it would for an array. However, I've also heard that its quicker to iterate from one element of an array to the next than it is to access the pointer of the next element in a linked list due to the way we access the array using an offset.
How is the pointer access in the linked list slower than the offset access in the array?

cache inefficiency, where branch prediction won't work well
These are different things. Linked lists suffer from cache inefficiency:
Nodes are usually not necessarily allocated contiguously and in order, which is bad for spatial locality. You can sometimes avoid this, for example with custom allocators. With generational garbage collection, allocating nodes closely together in time also tends to put them close together in space, but that's probably not a very common thing to actually happen when using a linked list.
Having a pointer (and potentially other junk, like an object header and padding) in the node wastes space. Wasting a bunch of space is not inherently super bad, but it is bad when the wasted space is touched, which loads it into the cache. That actually happens here: that pointer to the next node is definitely needed, and the other junk is likely in the same cache line so it gets pulled in as well. This wastes both cache space and bandwidth (both to higher level caches and maybe to memory), which is pretty bad.
Linked lists don't really suffer from branch misprediction inherently. Yes, if you iterate over one, the last branch (the one that exits the loop) has a decent chance of being mispredicted, but that is not specific to linked lists.
How is the pointer access in the linked list slower than the offset access in the array?
Loading a pointer at all is slower than calculating the next address of an element in an array, both in terms of latency and in terms of throughput. For a quick comparison, typical on a modern machine is that loading that point takes around 4 cycles (at best! if there is a cache miss, it takes much longer) and could be done twice per cycle. Adding the size of an array element to the current address takes 1 cycle and can be done 4 times per cycle, and you (or the compiler) may be able to re-use the increment of the loop counter for this with some clever coding. For example, maybe you can use indexed addressing with the loop counter (which is incremented anyway) as index, or you can "steal" the loop counter entirely and increment it by the size of an element (scaling the loop-end correspondingly), or have no loop counter and directly compare the current address to the address just beyond the end of the array. Compilers like to use tricks like these automatically.
It's actually much worse than that makes it sound, because loading those pointers in a linked list is completely serial. Yes, the CPU can load two things per cycle, but it takes 4 cycles until it knows where the next node is so that it can start loading the next pointer, so realistically it can find the address of a node only once every 4th cycle. Computing the addresses of array elements has no such problem, maybe there will be a latency of 1 between the computation of successive addresses but (because actual loops cannot be faster than that anyway) that only hurts when the loop is unrolled, and if necessary the address of the element k steps ahead can be computed just by adding k*sizeof(element) (so several addresses can be computed independently, and compilers do this too when they unroll loops).
Doing a sufficient amount of work per "step" through a linked list can hide the latency problem.

Accessing the pointer requires an additional memory read (which is slow compared to calculations): To read the value of the next element, first the pointer needs to be read from memory, then the contents of the referenced address need to be read. For an array, there is only one memory read access for the value (assuming the base address is kept in a register during the iteration).

Related

Scanning Binary Search Tree vs Array

In what ways would finding an element (traverse) in a BST be slower than linearly scanning for it within an array?
The answer supposedly has to do with caching. Can someone explain what exactly this means and why it is holds true?
How exactly do you "cache-this," using an array rather than caching with BST?
Thanks
My guess is that using a BST doesn't give you any advantage, since, even if you're caching data (which means there's some kind of locality, you may access the same element later for example), the insert operation and find operation always cost O(h) where h is the height of the tree. In the worst case, even O(n).
Whereas using an array for caching means that of course, at first it may be linear, but whenever you access the same element of the array afterwards, if there's spatial and temporal locality, you may find yourself accessing directly the same chunks of contiguous memory repeatedly, because you already know its index, which means you have a constant time access.
I assume caching relates to CPU caches, which come with a prefetcher which predicts your next memory accesses. So if you search sequentially in an array your prefetcher recognizes the memory access pattern and loads the memory into the CPU cache before your CPU accesses it. When the CPU actually accesses the next memory element, it is already in the cache and can be accessed quickly.
Without caches & prefetchers your CPU would have to wait for the memory controller to fetch the data from the RAM, which is quite slow in comparison to the CPU cache.
In a BST you don't do sequential access. In the worst case your BST does not reside in contiguous memory, but each node is at some arbitrary location in memory. Your prefetcher cannot predict this. The then CPU has to wait for each element to be fetched from memory.
A though without prefetchers is regarding the cache line. On x86_64 a cache line is 64 bytes long. Each integer is either 4 or 8 byte, so you can scan 16 or 8 array entries per cache line. The first access to the memory location fetches the whole line, so you only pay the memory access once for 8 comparisons.
For the BST the same argument as above applies. The node memory is likely not on the same cache line so have to do a memory access for each comparison.
To summarize: A) The memory access takes significantly more time than the comparison; B) if searching through an array or a BST is faster depends on the amount of items.

CPU Cache disadvantages of using linked lists in C

I was wondering what were the advantages and disadvantages of linked-list compared to contiguous arrays in C. Therefore I read a wikipedia article about linked-lists.
https://en.wikipedia.org/wiki/Linked_list#Disadvantages
According to this article, the disadvantages are the following:
They use more memory than arrays because of the storage used by their pointers.
Nodes in a linked list must be read in order from the beginning as linked lists are inherently sequential access.
Difficulties arise in linked lists when it comes to reverse traversing. For instance, singly linked lists are cumbersome to navigate backwards and while doubly linked lists are somewhat easier to read, memory is wasted in allocating.
Nodes are stored incontiguously, greatly increasing the time required to access individual elements within the list, especially with a CPU cache.
I understand the first 3 points but I am having a hard time with the last one:
Nodes are stored incontiguously, greatly increasing the time required to access individual elements within the list, especially with a CPU cache.
The article about CPU Cache does not mention anything about non contiguous memory arrays. As far as I know CPU Caches just caches frequently used adresses for a total 10^-6 cache miss.
Therefore, I do not understand why the CPU cache should be less efficient when it comes to non contiguous memory arrays.
CPU caches actually do two things.
The one you mentioned is caching recently used memory.
The other however is predicting which memory is going to be used in near future. The algorithm is usually quite simple - it assumes that the program processes big array of data and whenever it accesses some memory it will prefetch few more bytes behind.
This doesn't work for linked list as the nodes are randomly placed in memory.
Additionally, the CPU loads bigger blocks of memory (64, 128 bytes). Again, for the int64 array with single read it has data for processing 8 or 16 elements. For linked list it reads one block and the rest may be wasted as the next node can be in completely different chunk of memory.
And last but not least, related to previous section - linked list takes more memory for its management, the most simple version will take at least additional sizeof(pointer) bytes for the pointer to the next node. But it's not so much about CPU cache anymore.
The article is only scratching the surface, and gets some things wrong (or at least questionable), but the overall outcome is usually about the same: linked lists are much slower.
One thing to note is that "nodes are stored incontiguously [sic]" is an overly strong claim. It is true that in general nodes returned by, for example, malloc may be spread around in memory, especially if nodes are allocated at different times or from different threads. However, in practice, many nodes are often allocated on the same thread, at the same time, and these will often end up quite contiguous in memory, because good malloc implementations are, well, good! Furthermore, when performance is a concern, you may often use special allocators on a per-object basis, which allocated the fixed-sized notes from one or more contiguous chunks of memory, which will provide great spatial locality.
So you can assume that in at least some scenarios, linked lists will give you reasonable to good spatial locality. It largely depends on if you are adding most of all of your list elements at once (linked lists do OK), or are constantly adding elements over a longer period of time (linked lists will have poor spatial locality).
Now, on the side of lists being slow, one of the main issues glossed over with linked lists is the large constant factors associated with some operations relative to the array variant. Everyone knows that accessing an element given its index is O(n) in a linked list and O(1) in an array, so you don't use the linked list if you are going to do a lot of accesses by index. Similarly, everyone knows that adding an element to the middle of a list takes O(1) time in a linked list, and O(n) time in an array, so the former wins in that scenario.
What they don't address is that even operations that have the same algorithmic complexity can be much slower in practice in one implementation...
Let's take iterating over all the elements in a list (looking for a particular value, perhaps). That's an O(n) operation regardless if you use a linked or array representation. So it's a tie, right?
Not so fast! The actual performance can vary a lot! Here is what typical find() implementations would look like when compiled at -O2 optimization level in x86 gcc, thanks to godbolt which makes this easy.
Array
C Code
int find_array(int val, int *array, unsigned int size) {
for (unsigned int i=0; i < size; i++) {
if (array[i] == val)
return i;
}
return -1;
}
Assembly (loop only)1
.L6:
add rsi, 4
cmp DWORD PTR [rsi-4], edi
je .done
add eax, 1
cmp edx, eax
jne .notfound
Linked List
C Code
struct Node {
struct Node *next;
int item;
};
Node * find_list(int val, Node *listptr) {
while (listptr) {
if (listptr->item == val)
return listptr;
listptr = listptr->next;
}
return 0;
}
Assembly (loop only)
.L20:
cmp DWORD PTR [rax+8], edi
je .done
mov rax, QWORD PTR [rax]
test rax, rax
jne .notfound
Just eyeballing the C code, both methods look competitive. The array method is going to have an increment of i, a couple of comparisons, and one memory access to read the value from the array. The linked list version if going to have a couple of (adjacent) memory accesses to read the Node.val and Node.next members, and a couple of comparisons.
The assembly seems to bear that out: the linked list version has 5 instructions and the array version2 has 6. All of the instructions are simple ones that have a throughput of 1 per cycle or more on modern hardware.
If you test it though - with both lists fully resident in L1, you'll find that the array version executes at about 1.5 cyles per iteration, while the linked list version takes about 4! That's because the linked list version is limited by it's loop-carried dependency on listptr. The one line listptr = listptr->next boils down to on instruction, but that one instruction will never execute more than once every 4 cycles, because each execution depends on the completion of the prior one (you need to finish reading listptr->next before you can calculate listptr->next->next). Even though modern CPUs can execute something like 2 loads cycles every cycle, these loads take ~4 cycles to complete, so you get a serial bottleneck here.
The array version also has loads, but the address doesn't depend on the prior load:
add rsi, 4
cmp DWORD PTR [rsi-4], edi
It depends only on rsi, which is simply calculated by adding 4 each iteration. An add has a latency of one cycle on modern hardware, so this doesn't create a bottleneck (unless you get below 1 cycle/iteration). So the array loop is able to use the full power of the CPU, executing many instructions in parallel. The linked list version is not.
This isn't unique to "find" - any operation linked that needs to iterate over many elements will have this pointer chasing behavior, which is inherently slow on modern hardware.
1I omitted the epilogue and prologue for each assembly function because it really isn't doing anything interesting. Both versions had no epilogue at all really, and the proloque was very similar for both, peeling off the first iteration and jumping into the middle of the loop. The full code is available for inspection in any case.
2It's worth noting that gcc didn't really do as well as it could have here, since it maintains both rsi as the pointer into the array, and eax as the index i. This means two separate cmp instructions, and two increments. Better would have been to maintain only the pointer rsi in the loop, and to compare against (array + 4*size) as the "not found" condition. That would eliminate one increment. Additionally, you could eliminate one cmp by having rsi run from -4*size up to zero, and indexing into array using [rdi + rsi] where rdi is array + 4*size. Shows that even today optimizing compilers aren't getting everything right!
CPU cache usually takes in a page of a certain size for example (the common one) 4096 bytes or 4kB and accesses information needed from there. To fetch a page there is a considerate amount of time consumed let's say 1000 cycles. If say we have an array of 4096 bytes which is contiguous we will fetch a 4096 bytes page from cache memory and probably most of the data will be there. If not maybe we need to fetch another page to get the rest of the data.
Example: We have 2 pages from 0-8191 and the array is in between 2048 and 6244 then we will fetch page#1 from 0-4095 to get the desired elements and then page#2 from 4096-8191 to get all array elements we want. This results in fetching 2 pages from memory to our cache to get our data.
What happens in a list though? In a list the data are non-contiguous which means that the elements are not in contiguous places in memory so they are probably scattered through various pages. This means that a CPU has to fetch a lot of pages from memory to the cache to get the desired data.
Example: Node#1 mem_address = 1000, Node#2 mem_address = 5000, Node#3 mem_address = 18000. If the CPU is able to see in 4k pages sizes then it has to fetch 3 different pages from memory to find the data it wants.
Also, the memory uses prefetch techniques to fetch pages of memory before they are needed so if the linked list is small let's say A -> B -> C, then the first cycle will be slow because the prefetcher can't predict the next block to fetch. But, on the next cycle we say that the prefetcher is warmed up and it can start predicting the path of the linked list and fetch the correct blocks on time.
Summarizing arrays are easily predictable by the hardware and are in one place so they are easy to fetch, while linked lists are unpredictable and are scattered throughout memory, which makes the life of the predictor and CPU harder.
BeeOnRope's answer is good and highlights the cycle count overheads of traversing a linked list vs iterating through an array, but as he explicitly says that's assuming "both lists fully resident in L1". However, it's far more likely that an array will fit better in L1 than a linked list, and the moment you start thrashing your cache the performance difference becomes huge. RAM can be more than 100x slower than L1, with L2 and L3 (if your CPU has any) being between 3x to 14x slower.
On a 64 bit architecture, each pointer takes 8 bytes, and a doubly linked list needs two of them or 16 bytes of overhead. If you only want a single 4 byte uint32 per entry, that means you need 5x as much storage for the dlist as you need for an array. Arrays guarantee locality, and although malloc can do OK at locality if you allocate stuff together in the right order, you often can't. Lets approximate poor locality by saying it takes 2x the space, so a dlist uses 10x as much "locality space" as an array. That's enough to push you from fitting in L1 to overflowing into L3, or even worse from L2 into RAM.

Storing arrays in cache

I was reviewing an interview question and comparing notes with a friend, and we have different ideas on one with respect to CPU caching.
Consider a large set of data, such as a large array of double, ie:
double data[1024];
Consider using a dynamically allocated on-the-fly linked list to store the same number of elements. The question asked for a description of a number of trade-offs:
Which allows for quicker random access: we both agreed the array was quicker, since you didn't have to traverse the list in a linear fashion (O(n)), just provide an index (O(1)).
Which is quicker for comparing two lists of the same length: we both decided that if it was just primitive data types, the array would allow for a memcmp(), while the linked list required element-wise comparison plus dereferencing overhead.
Which allowed for more efficient caching if you were accessing the same element several times?
In point 3, this is where our opinions differed. I contended that that the CPU is going to try and cache the entire array, and if the array is obscenely large, it can't be stored in cache, and therefore there will be no caching benefit. With the linked list, individual elements can be cached. Therefore, linked lists lend themselves to cache "hits" more than static arrays do when dealing with a very large number of elements.
To the question: Which of the two is better for cache "hits", and can modern systems cache part of an array, or do they need the whole array or it won't try? Any sort of references to technical documents or standards I could also use to provide a definitive answer would help a lot.
Thanks!
The CPU doesn't know about your data structures. It caches more-or-less raw blocks of memory. Therefore, if you suppose you can access the same one element multiple times without traversing the list each time, then neither linked list nor array has a caching advantage over the other.
HOWEVER, arrays have a big advantage over dynamically-allocated linked lists for accessing multiple elements in sequence. Because CPU caches operate on blocks of memory rather larger than one double, when one array element is in the cache, it is likely that several others that reside at adjacent addresses are in the cache, too. Thus one (slow) read from main memory gives access to (fast) cached access to multiple adjacent array elements. The same is not true of linked lists, as nodes may be allocated anywhere in memory, and even a single node has at minimum the overhead of a next pointer to dilute the number of data elements that may be cached at the same time.
Caches don't know about arrays, they just see memory accesses and store a little bit of the memory near that address. Once you've accessed something at an address it should stick around in the cache a while, regardless of whether that address belongs to an array or a linked list. But the cache controllers don't really know what's being accessed.
When you traverse an array, the cache system may pre-fetch the next bit of an array. This is usually heuristically driven (maybe with some compiler hints).
Some hardware and toolchains offer intrinsics that let you control cache residency (through pre-fetches, explicit flushes and so forth). Normally you don't need this kind of control, but for things like DSP code, resource-constrained game consoles and OS-level stuff that needs to worry about cache coherency it's pretty common to see people use this functionality.

Arrays vs Linked Lists in terms of locality

Say we have an unsorted array and linked list.
The worst case when searching for an element for both data structures would be O( n ), but my question is:
Would the array still be way faster because of the use of spatial locality within the cache, or will the cache make use of branch locality allowing linked lists to be just as fast as any array ?
My understanding for an array is that if an element is accessed, that block of memory and many of the surrounding blocks are then brought into the cache allowing for much faster memory accesses.
My understanding for a linked list is that since the path that will be taken to traverse the list is predictable, then the cache will exploit that and still store the appropriate blocks of memory even though the nodes of the list can be far apart within the heap.
Your understanding of the the array case is mostly correct. If an array is accessed sequentially, many processors will not only fetch the block containing the element, but will also prefetch subsequent blocks to minimize cycles spent waiting on cache misses. If you are using an Intel x86 processor, you can find details about this in the Intel x86 optimization manual. Also, if the array elements are small enough, loading a block containing an element means the next element is likely in the same block.
Unfortunately, for linked lists the pattern of loads is unpredictable from the processor's point of view. It doesn't know that when loading an element at address X that the next address is the contents of (X + 8).
As a concrete example, the sequence of load addresses for a sequential array access is nice and predictable.
For example, 1000, 1016, 1032, 1064, etc.
For a linked list it will look like:
1000, 3048, 5040, 7888, etc. Very hard to predict the next address.

Array Performance very similar to LinkedList - What gives?

So the title is somewhat misleading... I'll keep this simple: I'm comparing these two data structures:
An array, whereby it starts at size 1, and for each subsequent addition, there is a realloc() call to expand the memory, and then append the new (malloced) element to the n-1 position.
A linked list, whereby I keep track of the head, tail, and size. And addition involves mallocing for a new element and updating the tail pointer and size.
Don't worry about any of the other details of these data structures. This is the only functionality I'm concerned with for this testing.
In theory, the LL should be performing better. However, they're near identical in time tests involving 10, 100, 1000... up to 5,000,000 elements.
My gut feeling is that the heap is large. I think the data segment defaults to 10 MB on Redhat? I could be wrong. Anyway, realloc() is first checking to see if space is available at the end of the already-allocated contiguous memory location (0-[n-1]). If the n-th position is available, there is not a relocation of the elements. Instead, realloc() just reserves the old space + the immediately following space. I'm having a hard time finding evidence of this, and I'm having a harder time proving that this array should, in practice, perform worse than the LL.
Here is some further analysis, after reading posts below:
[Update #1]
I've modified the code to have a separate list that mallocs memory every 50th iteration for both the LL and the Array. For 1 million additions to the array, there are almost consistently 18 moves. There's no concept of moving for the LL. I've done a time comparison, they're still nearly identical. Here's some output for 10 million additions:
(Array)
time ./a.out a 10,000,000
real 0m31.266s
user 0m4.482s
sys 0m1.493s
(LL)
time ./a.out l 10,000,000
real 0m31.057s
user 0m4.696s
sys 0m1.297s
I would expect the times to be drastically different with 18 moves. The array addition is requiring 1 more assignment and 1 more comparison to get and check the return value of realloc to ensure a move occurred.
[Update #2]
I ran an ltrace on the testing that I posted above, and I think this is an interesting result... It looks like realloc (or some memory manager) is preemptively moving the array to larger contiguous locations based on the current size.
For 500 iterations, a memory move was triggered on iterations:
1, 2, 4, 7, 11, 18, 28, 43, 66, 101, 154, 235, 358
Which is pretty close to a summation sequence. I find this to be pretty interesting - thought I'd post it.
You're right, realloc will just increase the size of the allocated block unless it is prevented from doing so. In a real world scenario you will most likely have other objects allocated on the heap in between subsequent additions to the list? In that case realloc will have to allocate a completely new chunk of memory and copy the elements already in the list.
Try allocating another object on the heap using malloc for every ten insertions or so, and see if they still perform the same.
So you're testing how quickly you can expand an array verses a linked list?
In both cases you're calling a memory allocation function. Generally memory allocation functions grab a chunk of memory (perhaps a page) from the operating system, then divide that up into smaller pieces as required by your application.
The other assumption is that, from time to time, realloc() will spit the dummy and allocate a large chunk of memory elsewhere because it could not get contiguous chunks within the currently allocated page. If you're not making any other calls to memory allocation functions in between your list expand then this won't happen. And perhaps your operating system's use of virtual memory means that your program heap is expanding contiguously regardless of where the physical pages are coming from. In which case the performance will be identical to a bunch of malloc() calls.
Expect performance to change where you mix up malloc() and realloc() calls.
Assuming your linked list is a pointer to the first element, if you want to add an element to the end, you must first walk the list. This is an O(n) operation.
Assuming realloc has to move the array to a new location, it must traverse the array to copy it. This is an O(n) operation.
In terms of complexity, both operations are equal. However, as others have pointed out, realloc may be avoiding relocating the array, in which case adding the element to the array is O(1). Others have also pointed out that the vast majority of your program's time is probably spent in malloc/realloc, which both implementations call once per addition.
Finally, another reason the array is probably faster is cache coherency and the generally high performance of linear copies. Jumping around to erratic addresses with significant gaps between them (both the larger elements and the malloc bookkeeping) is not usually as fast as doing a bulk copy of the same volume of data.
The performance of an array-based solution expanded with realloc() will depend on your strategy for creating more space.
If you increase the amount of space by adding a fixed amount of storage on each re-allocation, you'll end up with an expansion that, on average, depends on the number of elements you have stored in the array. This is on the assumption that realloc will need to (occasionally) allocate space elsewhere and copy the contents, rather than just expanding the existing allocation.
If you increase the amount of space by adding a proportion of your current number of elements (doubling is pretty standard), you'll end up with an expansion that, on average, takes constant time.
Will the compiler output be much different in these two cases?
This is not a real life situation. Presumably, in real life, you are interested in looking at or even removing items from your data structures as well as adding them.
If you allow removal, but only from the head, the linked list becomes better than the array because removing an item is trivial and, if instead of freeing the removed item, you put it on a free list to be recycled, you can eliminate a lot of the mallocs needed when you add items to the list.
On the other had, if you need random access to the structure, clearly an array beats the linked list.
(Updated.)
As others have noted, if there are no other allocations in between reallocs, then no copying is needed. Also as others have noted, the risk of memory copying lessens (but also its impact of course) for very small blocks, smaller than a page.
Also, if all you ever do in your test is to allocate new memory space, I am not very surprised you see little difference, since the syscalls to allocate memory are probably taking most of the time.
Instead, choose your data structures depending on how you want to actually use them. A framebuffer is for instance probably best represented by a contiguous array.
A linked list is probably better if you have to reorganise or sort data within the structure quickly.
Then these operations will be more or less efficient depending on what you want to do.
(Thanks for the comments below, I was initially confused myself about how these things work.)
What's the basis of your theory that the linked list should perform better for insertions at the end? I would not expect it to, for exactly the reason you stated. realloc will only copy when it has to to maintain contiguity; in other cases it may have to combine free chunks and/or increase the chunk size.
However, every linked list node requires fresh allocation and (assuming double linked list) two writes. If you want evidence of how realloc works, you can just compare the pointer before and after realloc. You should find that it usually doesn't change.
I suspect that since you're calling realloc for every element (obviously not wise in production), the realloc/malloc call itself is the biggest bottleneck for both tests, even though realloc often doesn't provide a new pointer.
Also, you're confusing the heap and data segment. The heap is where malloced memory lives. The data segment is for global and static variables.

Resources