Scanning Binary Search Tree vs Array - arrays

In what ways would finding an element (traverse) in a BST be slower than linearly scanning for it within an array?
The answer supposedly has to do with caching. Can someone explain what exactly this means and why it is holds true?
How exactly do you "cache-this," using an array rather than caching with BST?
Thanks

My guess is that using a BST doesn't give you any advantage, since, even if you're caching data (which means there's some kind of locality, you may access the same element later for example), the insert operation and find operation always cost O(h) where h is the height of the tree. In the worst case, even O(n).
Whereas using an array for caching means that of course, at first it may be linear, but whenever you access the same element of the array afterwards, if there's spatial and temporal locality, you may find yourself accessing directly the same chunks of contiguous memory repeatedly, because you already know its index, which means you have a constant time access.

I assume caching relates to CPU caches, which come with a prefetcher which predicts your next memory accesses. So if you search sequentially in an array your prefetcher recognizes the memory access pattern and loads the memory into the CPU cache before your CPU accesses it. When the CPU actually accesses the next memory element, it is already in the cache and can be accessed quickly.
Without caches & prefetchers your CPU would have to wait for the memory controller to fetch the data from the RAM, which is quite slow in comparison to the CPU cache.
In a BST you don't do sequential access. In the worst case your BST does not reside in contiguous memory, but each node is at some arbitrary location in memory. Your prefetcher cannot predict this. The then CPU has to wait for each element to be fetched from memory.
A though without prefetchers is regarding the cache line. On x86_64 a cache line is 64 bytes long. Each integer is either 4 or 8 byte, so you can scan 16 or 8 array entries per cache line. The first access to the memory location fetches the whole line, so you only pay the memory access once for 8 comparisons.
For the BST the same argument as above applies. The node memory is likely not on the same cache line so have to do a memory access for each comparison.
To summarize: A) The memory access takes significantly more time than the comparison; B) if searching through an array or a BST is faster depends on the amount of items.

Related

Linked List vs. Array Traversal Efficiency

I know that an array is allocated as a contiguous block of memory and we can therefore access its elements by calculating the byte/word offset from the beginning of the array very easily.
I'm aware that linked list traversal is less efficient than array traversal due to cache inefficiency, where branch prediction won't work well in the way it would for an array. However, I've also heard that its quicker to iterate from one element of an array to the next than it is to access the pointer of the next element in a linked list due to the way we access the array using an offset.
How is the pointer access in the linked list slower than the offset access in the array?
cache inefficiency, where branch prediction won't work well
These are different things. Linked lists suffer from cache inefficiency:
Nodes are usually not necessarily allocated contiguously and in order, which is bad for spatial locality. You can sometimes avoid this, for example with custom allocators. With generational garbage collection, allocating nodes closely together in time also tends to put them close together in space, but that's probably not a very common thing to actually happen when using a linked list.
Having a pointer (and potentially other junk, like an object header and padding) in the node wastes space. Wasting a bunch of space is not inherently super bad, but it is bad when the wasted space is touched, which loads it into the cache. That actually happens here: that pointer to the next node is definitely needed, and the other junk is likely in the same cache line so it gets pulled in as well. This wastes both cache space and bandwidth (both to higher level caches and maybe to memory), which is pretty bad.
Linked lists don't really suffer from branch misprediction inherently. Yes, if you iterate over one, the last branch (the one that exits the loop) has a decent chance of being mispredicted, but that is not specific to linked lists.
How is the pointer access in the linked list slower than the offset access in the array?
Loading a pointer at all is slower than calculating the next address of an element in an array, both in terms of latency and in terms of throughput. For a quick comparison, typical on a modern machine is that loading that point takes around 4 cycles (at best! if there is a cache miss, it takes much longer) and could be done twice per cycle. Adding the size of an array element to the current address takes 1 cycle and can be done 4 times per cycle, and you (or the compiler) may be able to re-use the increment of the loop counter for this with some clever coding. For example, maybe you can use indexed addressing with the loop counter (which is incremented anyway) as index, or you can "steal" the loop counter entirely and increment it by the size of an element (scaling the loop-end correspondingly), or have no loop counter and directly compare the current address to the address just beyond the end of the array. Compilers like to use tricks like these automatically.
It's actually much worse than that makes it sound, because loading those pointers in a linked list is completely serial. Yes, the CPU can load two things per cycle, but it takes 4 cycles until it knows where the next node is so that it can start loading the next pointer, so realistically it can find the address of a node only once every 4th cycle. Computing the addresses of array elements has no such problem, maybe there will be a latency of 1 between the computation of successive addresses but (because actual loops cannot be faster than that anyway) that only hurts when the loop is unrolled, and if necessary the address of the element k steps ahead can be computed just by adding k*sizeof(element) (so several addresses can be computed independently, and compilers do this too when they unroll loops).
Doing a sufficient amount of work per "step" through a linked list can hide the latency problem.
Accessing the pointer requires an additional memory read (which is slow compared to calculations): To read the value of the next element, first the pointer needs to be read from memory, then the contents of the referenced address need to be read. For an array, there is only one memory read access for the value (assuming the base address is kept in a register during the iteration).

How much time to initialize an array to 0?

I am confused whether
int arr[n]={0} takes constant time i.e. O(1), or O(n)?
You should expect O(N) time, but there are caveats:
It is O(1) if memory occupied by array is smaller than the word size. (it may be O(1) all the way to the cache line size on modern CPUs)
It is O(N) if array fits within a single tier in the memory.
It is complicated if array pushes through the tiers: There are multiple tiers on all modern computers (registers, L0 cache, L1 cache, L3 cache?, NUMA on multi-CPU machines, Virtual memory(mapping to swap), ...). If array can't fit in one - there will be a severe performance penalty.
CPU cache architecture impacts the time needed to zero out memory quite severely. In practice calling it O(N) is somewhat misleading given that going from 100 to 101 may increase time 10x if it falls on a cache boundary (line or whole). It may be even more dramatic if swapping is involved. Beware of the tiered memory model...
Generally initialization to zero of non-static storage is linear in the size of the storage. If you are reusing the array, it will need to be zero'ed each time. There are computer architectures that try to make this free via maintaining bit masks on pages or cache lines and returning zeroes at some point in the cache fill or load unit machinery. These are somewhat rare, but the effect can often be replicated in software if performance is paramount.
Arguably zeroed static storage is free but in practice the pages to back it up will be faulted in and there will be some cost to zero them on most architectures.
(One can end up in situations where the cost of the faults to provide zero filled pages is quite noticeable. E.g. repeated malloc/free of blocks bigger than some threshold can result in the address space backing the allocation being returned to the OS at each deallocation. The OS then has to zero it for security reasons even though malloc isn't guaranteed to return zero filled storage. Worse case the program then writes zeroes into the same block after it is returned from malloc so it ends up being twice the cost.)
For cases where large arrays of mostly zeroes will be accessed in a sparse fashion, the zero fill on demand behavior mentioned above can reduce the cost to linear in the number of pages actually used, not the total size. Arranging to do this usually requires using mmap or similar directly, not just allocating an array in C and zero initializing it.

Cache locality of an array and amortized complexity

An array is usually faster in access than a linked list.
This is primarily due to cache locality of an array.
I had two doubts :
On what factor does the amount of data that is brought into the cache memory depend ? Is it completely equal to the cache memory of the system ? How can we know what amount of memory is this ?
The first access to an array is usually costlier as the array has to be searched for in the memory and brought in the memory. The subsequent operations are comparitively faster. How do we calculate the amortized complexity of the access operation ?
What are cache misses ? Does it mean (in reference to linked list) that the required item which is needed (current pointer-> next) has not been loaded into the cache memory and hence memory has to be searched again for its address ?
In reality, it is a bit more complex than the simple model you present in the question.
First, you may have multiple caching layers (L1,L2,L3), each of them with different characteristics. Especially, the replacement policy for each cache may use different algorithms as a tradeoff between efficiency and complexity (i.e. cost).
Then, all modern operating-systems implement virtual memory mechanisms. It is not enough to cache the data and the instructions (which is what L1..L3 are used for), it is also required to cache the association between virtual and physical addresses (in the TLB, transaction lookaside buffer).
To understand the impact of locality, you need to consider all these mechanisms.
Question 1
The minimum unit exchanged between the memory and a cache is the cache line. Typically, it is 64 bytes (but it depends on the CPU model). Let's imagine the caches are empty.
If you iterate on an array, you will pay for a cache miss every 64 bytes. A smart CPU (and a smart program) could analyze the memory access pattern and decide to prefetch contiguous blocks of memory in the caches to increase the throughput.
If you iterate on a list, the access pattern will be random, and you will likely pay a cache miss for each item.
Question 2
The whole array is not searched and brought in the cache on first access. Only the first cache line is.
However, there is also another factor to consider: the TLB. The page size is system dependent. Typical value is 4 KB. The first time the array will be accessed, an address translation will occur (and its result will be stored in the TLB). If the array is smaller than 4 KB (page size), no other address translation will have to be done. If it is bigger, than one translation per page will be done.
Compare this to a list. The probability that multiple items fits in the same page (4 KB) is much lower than for the array. The probability that they can fit in the same cache line (64 bytes) is extremely low.
I think it is difficult to calculate a complexity because there are probably other factors to consider. But in this complexity, you have to take in account the cache line size (for cache misses), and the page size (for TLB misses).
Question 3
A cache miss is when a given cache line is not in the cache. It can happen at L1, L2, or L3 levels. The higher level, the more expensive.
A TLB miss occurs when the virtual address is not in the TLB. In that case, a conversion to a physical address is done using the page tables (costly) and the result is stored in the TLB.
So yes, with a linked list, you will likely pay for a higher number of cache and TLB misses than for an array.
Useful links:
Wikipedia article on caches in CPUs: https://en.wikipedia.org/wiki/CPU_cache
Another good article on this topic: http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
An oldish, but excellent article on the same topic: http://arstechnica.com/gadgets/2002/07/caching/
A gallery of various caching effects: http://igoro.com/archive/gallery-of-processor-cache-effects/

Does size of array in C affect time to access or change a piece of data in the array

Title says it all:
I create and array, say ARRAY[10000],
I set bounds where I only need to access data from 1-100 and 900-2000.
Will the time to access and/or change a piece of data take a greater amount of time in this manner than if I had declared an array as ARRAY[2001]
Will access and/or change time be changed if I have an array that only has the data from 1-100 and 900-2000?
I have seen some papers on this but they have not been clear to me and was hoping I could get a more concise answer here.
If the array is accessed infrequently, then the size of the array probably won't make much difference because you'd be getting a cache miss anyway. In that case the time will depend on how quickly the CPU can do any "virtual address to physical address" conversion and fetch the data from RAM.
The more frequently you access something in the array, the more cache effects matter. These cache effects depend heavily on the number, sizes and speeds of different caches.
However, it also depends on your access patterns. For example, if you have a 1 GiB array and frequently access 5 bytes of it, then the size won't make much difference as the 5 bytes you're accessing frequently will be in cache. For another example, if you use a sequential access pattern (e.g. "for each element in array { ... }") then it's likely the CPU might do hardware pre-fetching and you won't pay the full cost of cache misses.
For a typical 80x86 system, with a random access pattern and frequent accesses; there's about 5 different sizes that matter. The first (smallest) is L1 data cache size - if the array fits in the L1 data cache, then it's going to be relatively fast regardless of whether the array is 20 bytes or 20 KiB. The next size is L2 data cache size - as the array gets larger the ratio of "L1 hits to L2 hits" decreases and performance gets worse until (a t maybe "twice as large than L1") the L1 hits become negligible. Then (for some CPUs that do have L3 caches) the same happens with the L3 cache size, where as the array gets larger the ratio of "L2 hits to L3 hits" decreases.
Once you go larger than the largest cache, the ratio of "cache hits to cache misses" decreases.
The next size that can matter is TLB size. Most modern operating systems use paging, and most modern CPUs cache "virtual address to physical address conversions" in something typically called a Translation Look-aside Buffer. If the array is huge then you start getting TLB misses (in addition to cache misses) which makes performance worse (as the CPU can't avoid extra work when converting your virtual address into a physical address).
Finally, the last size that can matter is how much RAM you actually have. If you have a 10 TiB array, 4 GiB of RAM and 20 TiB of swap space; then the OS is going to be swapping data to/from disk and the overhead of disk IO is going to dominate performance.
Of course often you're creating an executable for many different computers (e.g. "64-bit 80x86, ranging from ancient Athlon to modern Haswell"). In that case you can't know most of the details that matter (like cache sizes) and it becomes a compromise between guesses (estimated overhead from cache misses due to "array too large" vs. estimated overhead from other things caused by "array too small").
No, at least normally the time to access any item in an array will be constant, regardless of the array's size.
This could change if (for example) you define an array larger than memory, so the OS (assuming you're using one) ends up doing some extra paging to support the larger array. In most cases, even that's unlikely to have much impact though.
May be yes, but it depends on size.
cache size
size of access range will change latency.
int ARRAY[10000] fit in L1 cache(32KB)
very small size access which fits in L1 cache(32KB), it costs 4 clocks for haswell.
But L2 cache access costs 12 clocks.
see detail here
http://www.7-cpu.com/cpu/Haswell.html
cache conflict
it other core of CPU modify some data in array, local cache-line state will be modified to be Invalid state.
Invalid stated cache-line costs much more latency.
NUMA
some environments with multiple CPU sockets on motherboard, it is called as non-uniform memory access environment.
That may have huge memory capacity, but some address of memory may be resident in CPU1, other address of memory may be resident in CPU2.
int huge_array[SOME_FUGE_SIZE]; // it strides multiple CPU's DIMM
// suppose that entire huge_array is out of cache.
huge_array[ADDRESS_OF_CPU1] = 1; // maybe fast for CPU1
huge_array[ADDRESS_OF_CPU2] = 1; // maybe slow for CPU2
But I wonder huge array strides multiple CPU's memory.
Allocation of huge array may simply fail.
It depends on OS.
In information theory, as was stated by others, array access is constant, and thus does not cost more or less depending on the arrays size. This question seems to be about real live performance though, and there the array size definitely matters. How so, is well explained by the accepted answer by #Brendan.
Things to consider in practice:
* How big are the elements of your array: bool[1000], MyStruct[1000] and MyStruct*[1000] may differ a lot in access performance
* Try writing code for both ways, once using the big array, and once keeping the required data in a smaller array. Then run the code on your target hardware(s), and compare performance. You will often be surprised to see, that optimization attempts make performance worse, and you learn a lot about hardware and its quirks in the process.
I don't believe it should.
When you access an element, you are going to memory location 0 + (element) therefore, regardless of the size it will get to the same memory location in the same time.

Arrays vs Linked Lists in terms of locality

Say we have an unsorted array and linked list.
The worst case when searching for an element for both data structures would be O( n ), but my question is:
Would the array still be way faster because of the use of spatial locality within the cache, or will the cache make use of branch locality allowing linked lists to be just as fast as any array ?
My understanding for an array is that if an element is accessed, that block of memory and many of the surrounding blocks are then brought into the cache allowing for much faster memory accesses.
My understanding for a linked list is that since the path that will be taken to traverse the list is predictable, then the cache will exploit that and still store the appropriate blocks of memory even though the nodes of the list can be far apart within the heap.
Your understanding of the the array case is mostly correct. If an array is accessed sequentially, many processors will not only fetch the block containing the element, but will also prefetch subsequent blocks to minimize cycles spent waiting on cache misses. If you are using an Intel x86 processor, you can find details about this in the Intel x86 optimization manual. Also, if the array elements are small enough, loading a block containing an element means the next element is likely in the same block.
Unfortunately, for linked lists the pattern of loads is unpredictable from the processor's point of view. It doesn't know that when loading an element at address X that the next address is the contents of (X + 8).
As a concrete example, the sequence of load addresses for a sequential array access is nice and predictable.
For example, 1000, 1016, 1032, 1064, etc.
For a linked list it will look like:
1000, 3048, 5040, 7888, etc. Very hard to predict the next address.

Resources