Programmatically find the number of cache levels - c

i am a newbie in c programming . I have an assignment to find the number of data cache levels in the cpu and also the hit time of each levels.I am looking at C Program to determine Levels & Size of Cache but finding it difficult to interpret the results. How is the number of cache levels revealed?
any pointers will be helpful

Assuming you don't have a way to cheat (like some way of getting that information from the operating system or some CPU identification register):
The basic idea is that (by design), your L1 cache is faster than your L2 cache which is faster than your L3 cache... In any normal design, your L1 cache is also smaller than your L2 cache which is smaller than your L3 cache...
So you want to allocate a large-ish block of memory and then access (read and write) it sequentially[1] until you notice that the time taken to perform X accesses has risen sharply. Then keep going until you see the same thing again. You would need to allocate a memory block larger than the largest cache you are hoping to discover.
This requires access to some low-overhead access timestamp counter for the actual measurement (as pointed out in the referrered-to answer).
[1] or depending on whether you want to try to fool any clever prefetching that may skew the results, randomly within a sequentially progressing N-byte block.

Related

Cache locality of an array and amortized complexity

An array is usually faster in access than a linked list.
This is primarily due to cache locality of an array.
I had two doubts :
On what factor does the amount of data that is brought into the cache memory depend ? Is it completely equal to the cache memory of the system ? How can we know what amount of memory is this ?
The first access to an array is usually costlier as the array has to be searched for in the memory and brought in the memory. The subsequent operations are comparitively faster. How do we calculate the amortized complexity of the access operation ?
What are cache misses ? Does it mean (in reference to linked list) that the required item which is needed (current pointer-> next) has not been loaded into the cache memory and hence memory has to be searched again for its address ?
In reality, it is a bit more complex than the simple model you present in the question.
First, you may have multiple caching layers (L1,L2,L3), each of them with different characteristics. Especially, the replacement policy for each cache may use different algorithms as a tradeoff between efficiency and complexity (i.e. cost).
Then, all modern operating-systems implement virtual memory mechanisms. It is not enough to cache the data and the instructions (which is what L1..L3 are used for), it is also required to cache the association between virtual and physical addresses (in the TLB, transaction lookaside buffer).
To understand the impact of locality, you need to consider all these mechanisms.
Question 1
The minimum unit exchanged between the memory and a cache is the cache line. Typically, it is 64 bytes (but it depends on the CPU model). Let's imagine the caches are empty.
If you iterate on an array, you will pay for a cache miss every 64 bytes. A smart CPU (and a smart program) could analyze the memory access pattern and decide to prefetch contiguous blocks of memory in the caches to increase the throughput.
If you iterate on a list, the access pattern will be random, and you will likely pay a cache miss for each item.
Question 2
The whole array is not searched and brought in the cache on first access. Only the first cache line is.
However, there is also another factor to consider: the TLB. The page size is system dependent. Typical value is 4 KB. The first time the array will be accessed, an address translation will occur (and its result will be stored in the TLB). If the array is smaller than 4 KB (page size), no other address translation will have to be done. If it is bigger, than one translation per page will be done.
Compare this to a list. The probability that multiple items fits in the same page (4 KB) is much lower than for the array. The probability that they can fit in the same cache line (64 bytes) is extremely low.
I think it is difficult to calculate a complexity because there are probably other factors to consider. But in this complexity, you have to take in account the cache line size (for cache misses), and the page size (for TLB misses).
Question 3
A cache miss is when a given cache line is not in the cache. It can happen at L1, L2, or L3 levels. The higher level, the more expensive.
A TLB miss occurs when the virtual address is not in the TLB. In that case, a conversion to a physical address is done using the page tables (costly) and the result is stored in the TLB.
So yes, with a linked list, you will likely pay for a higher number of cache and TLB misses than for an array.
Useful links:
Wikipedia article on caches in CPUs: https://en.wikipedia.org/wiki/CPU_cache
Another good article on this topic: http://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
An oldish, but excellent article on the same topic: http://arstechnica.com/gadgets/2002/07/caching/
A gallery of various caching effects: http://igoro.com/archive/gallery-of-processor-cache-effects/

Why can pointer chasing in double-linked list avoid cache thrashing (self-eviction)?

I was trying to understand this paper about cache timing issues
In Section 3.6, the authors explains a technique that allows you to populate a contiguous cache region and measure the time for this populating process. They mentioned:
"A naive implementation of the prime and probe steps (i.e., scanning the memory buffer in fixed strides) gives poor results due to two optimizations implemented in modern CPUs: reordering of memory accesses, and automatic read-ahead of memory by the “hardware prefetcher”. Our
attack code works around both disruptions by using the following “pointer chasing” technique. During initialization, the attacker’s memory is organized into a linked list (optionally, randomly permuted); later, priming and probing are done by traversing this list (see Figure 7). To minimize cache thrashing (self-eviction), we use a doubly-linked list and traverse it forward for priming but backward for probing. Moreover, to avoid “polluting” its own samples, the probe code stores each obtained sample into the same cache set it has just finished measuring. On some platforms one can improve the timing gap by using writes instead of reads, or more than W reads."
I have a question about this paragraph:
How does linked list avoid the timing variation due to hardware prefetch and reordering?
Currently implemented hardware prefetchers only handle fixed stride access patterns, so accesses that are ordered arbitrarily with respect to memory address are not recognized as prefetchable. Such hardware prefetchers could still detect and use accidental stride patterns. E.g., if address[X+5] = address[X] + N and address[x+7] = address[x+5] + 2N (note that the addresses do not have to be separated by a constant access count), a prefetcher might predict a stride of N. This could have the effect of modestly reducing apparent miss latency (some prefetches would be valid) or introducing some cache pollution.
(If the detected/predicted N is small and the linked list is contained within a modest region of memory relative to cache size (so that most of the stride prefetches will be to addresses that will be used soonish), stride-based prefetching might have a noticeable positive effect.)
By using data-dependent accesses (traversing a link requires access to the data), access reordering is prevented. Hardware cannot load the next node until the next pointer from the previous node is available.
There have been academic proposals for supporting prefetching of such patterns and even generic correlations (e.g., see this Google Scholar search for Markov prefetching), but such have not yet (to my knowledge) been implemented in commercial hardware.
As a side note, by traversing the probing accesses in reverse order from the priming accesses, for common, LRU-oriented replacement excessive misses are avoided.

Does size of array in C affect time to access or change a piece of data in the array

Title says it all:
I create and array, say ARRAY[10000],
I set bounds where I only need to access data from 1-100 and 900-2000.
Will the time to access and/or change a piece of data take a greater amount of time in this manner than if I had declared an array as ARRAY[2001]
Will access and/or change time be changed if I have an array that only has the data from 1-100 and 900-2000?
I have seen some papers on this but they have not been clear to me and was hoping I could get a more concise answer here.
If the array is accessed infrequently, then the size of the array probably won't make much difference because you'd be getting a cache miss anyway. In that case the time will depend on how quickly the CPU can do any "virtual address to physical address" conversion and fetch the data from RAM.
The more frequently you access something in the array, the more cache effects matter. These cache effects depend heavily on the number, sizes and speeds of different caches.
However, it also depends on your access patterns. For example, if you have a 1 GiB array and frequently access 5 bytes of it, then the size won't make much difference as the 5 bytes you're accessing frequently will be in cache. For another example, if you use a sequential access pattern (e.g. "for each element in array { ... }") then it's likely the CPU might do hardware pre-fetching and you won't pay the full cost of cache misses.
For a typical 80x86 system, with a random access pattern and frequent accesses; there's about 5 different sizes that matter. The first (smallest) is L1 data cache size - if the array fits in the L1 data cache, then it's going to be relatively fast regardless of whether the array is 20 bytes or 20 KiB. The next size is L2 data cache size - as the array gets larger the ratio of "L1 hits to L2 hits" decreases and performance gets worse until (a t maybe "twice as large than L1") the L1 hits become negligible. Then (for some CPUs that do have L3 caches) the same happens with the L3 cache size, where as the array gets larger the ratio of "L2 hits to L3 hits" decreases.
Once you go larger than the largest cache, the ratio of "cache hits to cache misses" decreases.
The next size that can matter is TLB size. Most modern operating systems use paging, and most modern CPUs cache "virtual address to physical address conversions" in something typically called a Translation Look-aside Buffer. If the array is huge then you start getting TLB misses (in addition to cache misses) which makes performance worse (as the CPU can't avoid extra work when converting your virtual address into a physical address).
Finally, the last size that can matter is how much RAM you actually have. If you have a 10 TiB array, 4 GiB of RAM and 20 TiB of swap space; then the OS is going to be swapping data to/from disk and the overhead of disk IO is going to dominate performance.
Of course often you're creating an executable for many different computers (e.g. "64-bit 80x86, ranging from ancient Athlon to modern Haswell"). In that case you can't know most of the details that matter (like cache sizes) and it becomes a compromise between guesses (estimated overhead from cache misses due to "array too large" vs. estimated overhead from other things caused by "array too small").
No, at least normally the time to access any item in an array will be constant, regardless of the array's size.
This could change if (for example) you define an array larger than memory, so the OS (assuming you're using one) ends up doing some extra paging to support the larger array. In most cases, even that's unlikely to have much impact though.
May be yes, but it depends on size.
cache size
size of access range will change latency.
int ARRAY[10000] fit in L1 cache(32KB)
very small size access which fits in L1 cache(32KB), it costs 4 clocks for haswell.
But L2 cache access costs 12 clocks.
see detail here
http://www.7-cpu.com/cpu/Haswell.html
cache conflict
it other core of CPU modify some data in array, local cache-line state will be modified to be Invalid state.
Invalid stated cache-line costs much more latency.
NUMA
some environments with multiple CPU sockets on motherboard, it is called as non-uniform memory access environment.
That may have huge memory capacity, but some address of memory may be resident in CPU1, other address of memory may be resident in CPU2.
int huge_array[SOME_FUGE_SIZE]; // it strides multiple CPU's DIMM
// suppose that entire huge_array is out of cache.
huge_array[ADDRESS_OF_CPU1] = 1; // maybe fast for CPU1
huge_array[ADDRESS_OF_CPU2] = 1; // maybe slow for CPU2
But I wonder huge array strides multiple CPU's memory.
Allocation of huge array may simply fail.
It depends on OS.
In information theory, as was stated by others, array access is constant, and thus does not cost more or less depending on the arrays size. This question seems to be about real live performance though, and there the array size definitely matters. How so, is well explained by the accepted answer by #Brendan.
Things to consider in practice:
* How big are the elements of your array: bool[1000], MyStruct[1000] and MyStruct*[1000] may differ a lot in access performance
* Try writing code for both ways, once using the big array, and once keeping the required data in a smaller array. Then run the code on your target hardware(s), and compare performance. You will often be surprised to see, that optimization attempts make performance worse, and you learn a lot about hardware and its quirks in the process.
I don't believe it should.
When you access an element, you are going to memory location 0 + (element) therefore, regardless of the size it will get to the same memory location in the same time.

force some data on L1 cache

Apologies about this simple question. Still struggling with some of the memory concepts here. Question is: Suppose I have a pre-computed array A that I want to access repeatedly. Is there a way to tell a C program to keep this array as close as possible to the CPU cache for fastest access? Thanks.
There is no way to force an array to L1/L2 cache on most architectures; it is not needed usually, if you access it frequently it is unlikely to be evicted from cache.
On some architectures there is a set of instructions that allows you to give the processor a hint that the memory location will soon be needed, so that it can start loading it to L1/L2 cache early - this is called prefetching, see _mm_prefetch instruction for example ( http://msdn.microsoft.com/en-us/library/84szxsww(v=vs.80).aspx ). Still this is unlikely to be needed if you're accessing a small array.
The general advice is - make your data structures cache-efficient first (put related data together, pack data, etc.), try prefetching later if the profiler tells you that you're still spending time on cache misses and you can't improve the data layout any further.

Measure size and way-order of L1 and L2 caches

How can I programmatically measure (not query the OS) the size and order of associativity of L1 and L2 caches (data caches)?
Assumptions about system:
It has L1 and L2 cache (may be L3 too, may be cache sharing),
It may have a hardware prefetch unit (just like P4+),
It has a stable clocksource (tickcounter or good HPET for gettimeofday).
There are no assumptions about OS (it can be Linux, Windows, or something else), and we can't use POSIX queries.
Language is C, and compiler optimizations may be disabled.
I think all you need to do is repeatedly access memory in ever-increasing chunks (to determine cache size), and I think you can vary the strides to determine associativity.
So you would start out trying to access very short segments of memory and keep doubling the size until access slows down. Every time access slows down you've determined the size of another level of cache.
Here is the code from ATLAS. It is for L1 cache size
ATLAS/tune/sysinfo/L1CacheSize.c
(https://github.com/vtjnash/atlas-3.10.0/blob/master/tune/sysinfo/L1CacheSize.c)
int GetL1Size(int MaxSize, double tol)
{
int L1Size, tmp, correct=1;
fprintf(stderr, "\n Calculating L1 cache size:\n");
but it is only l1 cache and only size of it, not the way-count.
You might find the STREAM benchmark useful or interesting or both.
Question is outdated a little, but the answer is here.

Resources