Cache locality of an array and amortized complexity - arrays

An array is usually faster in access than a linked list.
This is primarily due to cache locality of an array.
I had two doubts :
On what factor does the amount of data that is brought into the cache memory depend ? Is it completely equal to the cache memory of the system ? How can we know what amount of memory is this ?
The first access to an array is usually costlier as the array has to be searched for in the memory and brought in the memory. The subsequent operations are comparitively faster. How do we calculate the amortized complexity of the access operation ?
What are cache misses ? Does it mean (in reference to linked list) that the required item which is needed (current pointer-> next) has not been loaded into the cache memory and hence memory has to be searched again for its address ?

In reality, it is a bit more complex than the simple model you present in the question.
First, you may have multiple caching layers (L1,L2,L3), each of them with different characteristics. Especially, the replacement policy for each cache may use different algorithms as a tradeoff between efficiency and complexity (i.e. cost).
Then, all modern operating-systems implement virtual memory mechanisms. It is not enough to cache the data and the instructions (which is what L1..L3 are used for), it is also required to cache the association between virtual and physical addresses (in the TLB, transaction lookaside buffer).
To understand the impact of locality, you need to consider all these mechanisms.
Question 1
The minimum unit exchanged between the memory and a cache is the cache line. Typically, it is 64 bytes (but it depends on the CPU model). Let's imagine the caches are empty.
If you iterate on an array, you will pay for a cache miss every 64 bytes. A smart CPU (and a smart program) could analyze the memory access pattern and decide to prefetch contiguous blocks of memory in the caches to increase the throughput.
If you iterate on a list, the access pattern will be random, and you will likely pay a cache miss for each item.
Question 2
The whole array is not searched and brought in the cache on first access. Only the first cache line is.
However, there is also another factor to consider: the TLB. The page size is system dependent. Typical value is 4 KB. The first time the array will be accessed, an address translation will occur (and its result will be stored in the TLB). If the array is smaller than 4 KB (page size), no other address translation will have to be done. If it is bigger, than one translation per page will be done.
Compare this to a list. The probability that multiple items fits in the same page (4 KB) is much lower than for the array. The probability that they can fit in the same cache line (64 bytes) is extremely low.
I think it is difficult to calculate a complexity because there are probably other factors to consider. But in this complexity, you have to take in account the cache line size (for cache misses), and the page size (for TLB misses).
Question 3
A cache miss is when a given cache line is not in the cache. It can happen at L1, L2, or L3 levels. The higher level, the more expensive.
A TLB miss occurs when the virtual address is not in the TLB. In that case, a conversion to a physical address is done using the page tables (costly) and the result is stored in the TLB.
So yes, with a linked list, you will likely pay for a higher number of cache and TLB misses than for an array.
Useful links:
Wikipedia article on caches in CPUs: https://en.wikipedia.org/wiki/CPU_cache
Another good article on this topic: http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
An oldish, but excellent article on the same topic: http://arstechnica.com/gadgets/2002/07/caching/
A gallery of various caching effects: http://igoro.com/archive/gallery-of-processor-cache-effects/

Related

linked lists vs arrays: which is more contiguous in physical memory?

Arrays are not necessarily contiguous in physical memory, though they are contiguous in virtual address space. But can it be said that the "tidiness" of arrays in physical memory is significantly higher compared to linked lists? So, which is a better option for a cache-friendly program?
There are two reasons why contiguous memory is more cache-friendly than non-contiguous memory:
If the data is stored contiguously, then the data will likely be stored in less cache lines (which are 64 byte blocks on most platforms). In that case, there is a higher chance that all the data will fit in the cache and new cache lines will less often have to be loaded. If the data is not stored contiguously and is scattered in many random memory locations, then it is possible that only a small fraction of every cache line will contain important data and that the rest of the cache line will contain unimportant data. In that case, more cache lines would be required to cache all important data, and if the cache is not large enough to store all these cache lines, then the cache efficiency will decrease.
The hardware cache prefetcher will do a better job at predicting the next cache line to prefetch, because it is easy to predict a sequential access pattern. Depending on whether the elements of the linked list are scattered or not, the access pattern to a linked list may be random and unpredictable, whereas the access pattern to an array is often sequential.
You are right that even if an array is stored contiguously in the virtual address space, this does not necessarily mean that the array is also contiguously in the physical address space.
However, this is irrelevant with regard to my statements made in #1 of my answer), because a cache line cannot overlap the boundary of a memory page. The content of a single memory page is always contiguous, both in the virtual address space and in the physical address space.
But you are right that it can be relevant with regard to my statements made in #2 of my answer. Assuming a memory page size of 4096 bytes (which is standard on the x64 platform) and a cache line size of 64 bytes, then there are 64 cache lines per memory page. This means that every 64th cache line could be at the edge of a "jump" in the physical address space. As a result, every 64th cache line could be mispredicted by the hardware cache prefetcher. Also, the cache prefetcher may not be able to adapt itself immediately to this new situation, so it may fail to prefetch several cache lines before it is able to reliably predict the next cache lines again and preload them in time. However, as a application programmer, you should not have to worry about this. It is the responsibility of the operating system to arrange the mapping of the virtual memory space to the physical memory space in such a way that there are not too many "jumps" which could have a negative performance impact. If you want to read more on this topic, you might want to read this research paper: Analysis of hardware prefetching across virtual page boundaries
Generally, arrays are better than linked lists in terms of cache efficiency, because they are always contiguous (in the virtual address space).

Scanning Binary Search Tree vs Array

In what ways would finding an element (traverse) in a BST be slower than linearly scanning for it within an array?
The answer supposedly has to do with caching. Can someone explain what exactly this means and why it is holds true?
How exactly do you "cache-this," using an array rather than caching with BST?
Thanks
My guess is that using a BST doesn't give you any advantage, since, even if you're caching data (which means there's some kind of locality, you may access the same element later for example), the insert operation and find operation always cost O(h) where h is the height of the tree. In the worst case, even O(n).
Whereas using an array for caching means that of course, at first it may be linear, but whenever you access the same element of the array afterwards, if there's spatial and temporal locality, you may find yourself accessing directly the same chunks of contiguous memory repeatedly, because you already know its index, which means you have a constant time access.
I assume caching relates to CPU caches, which come with a prefetcher which predicts your next memory accesses. So if you search sequentially in an array your prefetcher recognizes the memory access pattern and loads the memory into the CPU cache before your CPU accesses it. When the CPU actually accesses the next memory element, it is already in the cache and can be accessed quickly.
Without caches & prefetchers your CPU would have to wait for the memory controller to fetch the data from the RAM, which is quite slow in comparison to the CPU cache.
In a BST you don't do sequential access. In the worst case your BST does not reside in contiguous memory, but each node is at some arbitrary location in memory. Your prefetcher cannot predict this. The then CPU has to wait for each element to be fetched from memory.
A though without prefetchers is regarding the cache line. On x86_64 a cache line is 64 bytes long. Each integer is either 4 or 8 byte, so you can scan 16 or 8 array entries per cache line. The first access to the memory location fetches the whole line, so you only pay the memory access once for 8 comparisons.
For the BST the same argument as above applies. The node memory is likely not on the same cache line so have to do a memory access for each comparison.
To summarize: A) The memory access takes significantly more time than the comparison; B) if searching through an array or a BST is faster depends on the amount of items.

How much time to initialize an array to 0?

I am confused whether
int arr[n]={0} takes constant time i.e. O(1), or O(n)?
You should expect O(N) time, but there are caveats:
It is O(1) if memory occupied by array is smaller than the word size. (it may be O(1) all the way to the cache line size on modern CPUs)
It is O(N) if array fits within a single tier in the memory.
It is complicated if array pushes through the tiers: There are multiple tiers on all modern computers (registers, L0 cache, L1 cache, L3 cache?, NUMA on multi-CPU machines, Virtual memory(mapping to swap), ...). If array can't fit in one - there will be a severe performance penalty.
CPU cache architecture impacts the time needed to zero out memory quite severely. In practice calling it O(N) is somewhat misleading given that going from 100 to 101 may increase time 10x if it falls on a cache boundary (line or whole). It may be even more dramatic if swapping is involved. Beware of the tiered memory model...
Generally initialization to zero of non-static storage is linear in the size of the storage. If you are reusing the array, it will need to be zero'ed each time. There are computer architectures that try to make this free via maintaining bit masks on pages or cache lines and returning zeroes at some point in the cache fill or load unit machinery. These are somewhat rare, but the effect can often be replicated in software if performance is paramount.
Arguably zeroed static storage is free but in practice the pages to back it up will be faulted in and there will be some cost to zero them on most architectures.
(One can end up in situations where the cost of the faults to provide zero filled pages is quite noticeable. E.g. repeated malloc/free of blocks bigger than some threshold can result in the address space backing the allocation being returned to the OS at each deallocation. The OS then has to zero it for security reasons even though malloc isn't guaranteed to return zero filled storage. Worse case the program then writes zeroes into the same block after it is returned from malloc so it ends up being twice the cost.)
For cases where large arrays of mostly zeroes will be accessed in a sparse fashion, the zero fill on demand behavior mentioned above can reduce the cost to linear in the number of pages actually used, not the total size. Arranging to do this usually requires using mmap or similar directly, not just allocating an array in C and zero initializing it.

Does size of array in C affect time to access or change a piece of data in the array

Title says it all:
I create and array, say ARRAY[10000],
I set bounds where I only need to access data from 1-100 and 900-2000.
Will the time to access and/or change a piece of data take a greater amount of time in this manner than if I had declared an array as ARRAY[2001]
Will access and/or change time be changed if I have an array that only has the data from 1-100 and 900-2000?
I have seen some papers on this but they have not been clear to me and was hoping I could get a more concise answer here.
If the array is accessed infrequently, then the size of the array probably won't make much difference because you'd be getting a cache miss anyway. In that case the time will depend on how quickly the CPU can do any "virtual address to physical address" conversion and fetch the data from RAM.
The more frequently you access something in the array, the more cache effects matter. These cache effects depend heavily on the number, sizes and speeds of different caches.
However, it also depends on your access patterns. For example, if you have a 1 GiB array and frequently access 5 bytes of it, then the size won't make much difference as the 5 bytes you're accessing frequently will be in cache. For another example, if you use a sequential access pattern (e.g. "for each element in array { ... }") then it's likely the CPU might do hardware pre-fetching and you won't pay the full cost of cache misses.
For a typical 80x86 system, with a random access pattern and frequent accesses; there's about 5 different sizes that matter. The first (smallest) is L1 data cache size - if the array fits in the L1 data cache, then it's going to be relatively fast regardless of whether the array is 20 bytes or 20 KiB. The next size is L2 data cache size - as the array gets larger the ratio of "L1 hits to L2 hits" decreases and performance gets worse until (a t maybe "twice as large than L1") the L1 hits become negligible. Then (for some CPUs that do have L3 caches) the same happens with the L3 cache size, where as the array gets larger the ratio of "L2 hits to L3 hits" decreases.
Once you go larger than the largest cache, the ratio of "cache hits to cache misses" decreases.
The next size that can matter is TLB size. Most modern operating systems use paging, and most modern CPUs cache "virtual address to physical address conversions" in something typically called a Translation Look-aside Buffer. If the array is huge then you start getting TLB misses (in addition to cache misses) which makes performance worse (as the CPU can't avoid extra work when converting your virtual address into a physical address).
Finally, the last size that can matter is how much RAM you actually have. If you have a 10 TiB array, 4 GiB of RAM and 20 TiB of swap space; then the OS is going to be swapping data to/from disk and the overhead of disk IO is going to dominate performance.
Of course often you're creating an executable for many different computers (e.g. "64-bit 80x86, ranging from ancient Athlon to modern Haswell"). In that case you can't know most of the details that matter (like cache sizes) and it becomes a compromise between guesses (estimated overhead from cache misses due to "array too large" vs. estimated overhead from other things caused by "array too small").
No, at least normally the time to access any item in an array will be constant, regardless of the array's size.
This could change if (for example) you define an array larger than memory, so the OS (assuming you're using one) ends up doing some extra paging to support the larger array. In most cases, even that's unlikely to have much impact though.
May be yes, but it depends on size.
cache size
size of access range will change latency.
int ARRAY[10000] fit in L1 cache(32KB)
very small size access which fits in L1 cache(32KB), it costs 4 clocks for haswell.
But L2 cache access costs 12 clocks.
see detail here
http://www.7-cpu.com/cpu/Haswell.html
cache conflict
it other core of CPU modify some data in array, local cache-line state will be modified to be Invalid state.
Invalid stated cache-line costs much more latency.
NUMA
some environments with multiple CPU sockets on motherboard, it is called as non-uniform memory access environment.
That may have huge memory capacity, but some address of memory may be resident in CPU1, other address of memory may be resident in CPU2.
int huge_array[SOME_FUGE_SIZE]; // it strides multiple CPU's DIMM
// suppose that entire huge_array is out of cache.
huge_array[ADDRESS_OF_CPU1] = 1; // maybe fast for CPU1
huge_array[ADDRESS_OF_CPU2] = 1; // maybe slow for CPU2
But I wonder huge array strides multiple CPU's memory.
Allocation of huge array may simply fail.
It depends on OS.
In information theory, as was stated by others, array access is constant, and thus does not cost more or less depending on the arrays size. This question seems to be about real live performance though, and there the array size definitely matters. How so, is well explained by the accepted answer by #Brendan.
Things to consider in practice:
* How big are the elements of your array: bool[1000], MyStruct[1000] and MyStruct*[1000] may differ a lot in access performance
* Try writing code for both ways, once using the big array, and once keeping the required data in a smaller array. Then run the code on your target hardware(s), and compare performance. You will often be surprised to see, that optimization attempts make performance worse, and you learn a lot about hardware and its quirks in the process.
I don't believe it should.
When you access an element, you are going to memory location 0 + (element) therefore, regardless of the size it will get to the same memory location in the same time.

Programmatically find the number of cache levels

i am a newbie in c programming . I have an assignment to find the number of data cache levels in the cpu and also the hit time of each levels.I am looking at C Program to determine Levels & Size of Cache but finding it difficult to interpret the results. How is the number of cache levels revealed?
any pointers will be helpful
Assuming you don't have a way to cheat (like some way of getting that information from the operating system or some CPU identification register):
The basic idea is that (by design), your L1 cache is faster than your L2 cache which is faster than your L3 cache... In any normal design, your L1 cache is also smaller than your L2 cache which is smaller than your L3 cache...
So you want to allocate a large-ish block of memory and then access (read and write) it sequentially[1] until you notice that the time taken to perform X accesses has risen sharply. Then keep going until you see the same thing again. You would need to allocate a memory block larger than the largest cache you are hoping to discover.
This requires access to some low-overhead access timestamp counter for the actual measurement (as pointed out in the referrered-to answer).
[1] or depending on whether you want to try to fool any clever prefetching that may skew the results, randomly within a sequentially progressing N-byte block.

Resources