C - cache line and processes performances? - c

context
I am doing some trials with memory caching. Read a lot of papers.
The problem is not how to make cache friendly code per process, I almost got it.
My main concern is : how will the cache behave when, say, hundreds of running processes will hit the L1 cache?
Since L1 size is scarce, should I understand that there will be a lot of cache eviction that will slow other processes since all the processes will fight for L1 cache?
On a cpu with 64bytes cache line and 64k l1 cache with a word size of 64bit.
This is the point I don't understand.
edit:
The hundreds are per core

First off, you'll likely use a multicore CPU. That means you have far fewer processes per core. Modern OS'es try to keep cores and processes somewhat associated, too.
But that said, you indeed lose L1 cache when your program is swapped out. It doesn't even make sense to hold on to it. Your address 0x04000000 doesn't have the same content as the same address in another process. They're virtual addresses.

Related

Does the distance between read and write locations have an effect on cache performance?

I have a buffer of size n that is full, and a successor buffer of size n that is empty. I want to insert a value within the first buffer at position i, but I would need to move a range of memory forward in order to do that, since the buffer is full (ie. sequential insert). I have two options here:
Prefer write close to read (adjacent):
Push the last value of the first buffer into the second.
Move between i and n - 1 in the first buffer one forward.
Insert at i.
Prefer fewer steps:
Copy the range i to n - 1 from the first into the second buffer.
Insert at i.
Most of what I can find only talks about locality in a read context, and I am wondering whether the distance between the read and the write memory should be considered.
Does the distance between read and write locations have an effect on cache performance?
Yes. Normally (not including rare situations where CPU can write an entire cache line with new data) the CPU has to fetch the most recent version of a cache line into its cache before doing the write. If the cache line is already in the cache (e.g. due to a previous read of some other data that happened to be in the same cache line) then CPU won't need to fetch the cache line before doing the write.
Note that there's also various other quirks (cache aliasing, TLB misses, etc); and all of it depends on the specific situation and which CPU (e.g. if all of the process' data fits in the CPU's cache, there's no shared memory in involved, and there's no task switches or other processes using the CPU; then you can assume everything will always be in the cache anyway).
I want to insert a value within the first buffer at position i, but I would need to move a range of memory forward in order to do that, since the buffer is full (ie. sequential insert).
Without more information (how often this happens, how much data is involved, etc) I can't really make any suggestions. However (at first glance, without much information), the entire idea seems bad. More specifically, it sounds like you're adding a bunch of hassle to make two smaller arrays behave exactly the same as one larger array would have (and then worrying about the cost of insertion because arrays aren't good for insertion in general).
this is a component deep within a data structure at the lowest level where n is small and constant
by small I assume you mean smaller than L1 cpu cache of being somewhere less than 1MB or L2 cache up to 10-20 MB, depending on your CPU then no,
I am wondering whether the distance between the read and the write memory should be considered.
sometimes; if all the data can fit into the CPU L1, L2, L3 cache that the process is running on then what you think random access means applies it would all be the same latency. You can get nitty gritty and delve into the differences between L1, L2, L3 cache but for sake of brevity (and i simply take it for granted) anywhere within a memory boundary it's all the same latency to access. So in your case where N is small and if it all fits into cpu cache (the first of many boundaries) then it would be the manner and efficiency in which you chose to move/change values and the number of times you end up doing that which affects performance (time to complete).
Now if N were big, for example in a 2 or more socket system (over intel QPI or UPI) and that data resided on DDR RAM that is located across the QPI or UPI path to memory dimms off the memory controller of the other CPU, then definitely yes big performance hit (relatively speaking) because now a boundary has been crossed, and that would be what could NOT fit into cache of the CPU that the process was running on (which was initally fetched from DIMMS LOCAL to that cpu memory controller) now incurs the overhead talking to the other CPU over the QPI or UPI path (while still very fast compared to previous architecures) and that other CPU then fetches the data from it's set of memory DIMMS and sends it back over QPI or UPI to the cpu your process is running on.
So when you exceed L1 cache limit into L2 there is a performance hit, likewise into L3 cache, all within one CPU. when a process has to repeatedly fetch from it's local set of dimms more data that it could not fit into cache then performance hit. And when that data is not on dimms local to that cpu = slower. And when that data is not on the same motherboard and goes across some kind of high speed fiber RDMA = slower. When it's across ethernet even slower... and so on.

Is there a way to avoid cache misses _completely_?

I read the very basics on how the cache works here: How and when to align to cache line size? and here: What is "cache-friendly" code? , but none of these posts answered my question: is there a way to execute some code entirely within the cache, i.e., without using any access to RAM (beyond perhaps during the initial process of reading the file from the HDD)? As far as I understand the bottleneck in computation nowadays is mostly memory bandwidth, and "as long as you are within the CPU, you are just fine".
Is there a way to load a program into the cache, and keep it there until it terminates? So let's say I have a 1MB compiled C program, which does some scientific computation with a memory requirement of another 1MB, and runs for 5 days. Is there a way to flag this code, so that it does not get out from the cache during evaluation? I am thinking of giving this code higher priority, or alike during execution.
In other words, how much cache is used by an idling computer, which loads its OS (say Ubuntu), and then does nothing? Is there excessive cache use during idling? Should I expect my small program to be always in the cache if the OS does not do anything besides executing it? Let's say after 5 minutes the screensaver starts. Does this lead to massive cache misses (and hence, drastic reduction in performance), since now it competes with my program for the cache space? My experience says that running several non-demanding programs (like the screensaver, or a simple audio player, pdf reader, etc.) at the same time does not significantly decrease the performance of my scientific program, even though I would expect that it would go in-and-out from the cache all the time. The question is: why does not it get its speed affected? Would it make sense to use an absolute minimalistic OS (if so, then which one?) to improve (or rather: maintain) the speed of the computation?
Just for clarity, we can assume that the code is something very simple, say it is a bunch of nested for loops where the innermost part sums up all the increment variables modulo 97. The point is that it is small enough to be put and executed in the cache.
There are different types of CPU cache misses: compulsory, conflict, capacity, coherence.
Compulsory misses can't be avoided, as they happen on the first reference to a location in memory. So no, you definitely can't avoid cache misses completely.
Besides that, typical L1 cache sizes today are 32KB/64KB per core, and L2 cache sizes are 256KB per core. So 1MB of data would also create either capacity or conflict misses, depending on cache's associativity.
No, on most standard architectures, CPU cache is not addressable.*
And even if you could, what kind of performance improvement are you anticipating here? What percentage of your program's execution time do you believe is being spent loading from main memory into (L3) cache? You should profile your program to determine where it's actually spending its time, rather than dreaming up solutions to problems that don't exist!
* I think x86 CPUs might have a hardware configuration which allows them to operate without attached RAM, but that's basically irrelevant.
Short answer: NO. Cache is being maintained by the OS/CPU and it is a bad idea to allow programs to force itself to stay in cache. Lets say you got 2 programs running at the same time, and both are trying to force to stay in the cache, chaos would happen isn't it?
Newer Intel CPUs have added "Cache Allocation Technology" (CAT) under the general rubric of their Resource Director Technology. This allows software directives to reserve certain cache (and other) resources for particular computational units (application, container, VM, etc). So, if the process in question has enough cache space set aside for it under CAT, it should experience only its initial compulsory misses (to bring its code and data into cache) and self-induced conflict misses, avoiding capacity misses and conflict misses created by other processes.
I am not sure whether it will satisfy your questions.
is there a way to execute some code entirely within the cache, i.e., without using any access to RAM?
Is there a way to load a program into the cache, and keep it there until it terminates?
It is possible to use fully associative cache( for eg Tightly coupled memories), which has single cycle access times.(This is realistic only in very small embedded systems).it is a general practise to use TCM's in embedded systems for time critical code as it provides predictability.
In case of partially associative caches it is possible to lock up cache lines or ways (for eg using CP15 in ARM ), so that the eviction algorithm doesn't consider them as a victim for cache fill.
as a side note it is also useful sometimes to use Cache as Ram for Bringup of non booting boards when the caches are in debug mode.
(http://www.asset-intertech.com/Products/Processor-Controlled-Test/PCT-Software/Cache-as-RAM-for-board-bring-up-of-non-boothing-ci)

Programming to fill our system's L1 or L2 cache

How we can systematically write code to load data into our L1 or L2 cache?
I am specifically trying to target filling the the L1 I cache of my system for some higher analysis.
Any suggestions will do - with respect to writing assembly code or simple C programming. Related articles on this topic will be even more helpful.
A cache stores recently-accessed data. To fill the cache, just access the data. Or in this case, instructions. Fill a block of memory with no-op instructions (and a looping branch instruction at the end) and jump to it.
The tricky part is keeping data in there once it's loaded. You can't access anything outside the 32K (or whatever) data set as long as your benchmark is running.
I can't imagine what you get from artificially filling a cache and then keeping it filled with the same data set, but there you go.
You will need to find out the cache associativity of your CPU and the replacement policy. I can't think of a general solution to this problem that would work on all the CPUs I've worked with. Even caches advertised as fully associative with an LRU replacement policy aren't exactly that in reality and it can be very hard to figure out a pattern of memory access that completely fills the cache.
If you want this for some very specific benchmark (which is a bad idea for other reasons), I'd recommend you trying to figure out how to flush the cache instead. That is actually doable.
I just last week performed this task for ECC filling of an l1 and l2 cache.
Basically if you have a 64Kbyte cache, for example, total (x number of ways, y number of cache lines, etc) for the data just access that much data linearly through the cache (might need an mmu on to enable caching) start on some 64Kbyte boundary and read 64Kbytes worth of data ideally in cache line sized reads (or multiples) if possible. For icache you need that many bytes worth of instructions (nops or add reg+1 or something), remember there is probably a pre-fetch at the end so you might have to back off the final return a few instructions so that the prefetch takes you all the way to the end (might take some practice and if you dont have visibility into the logic (a chip sim) then you might not figure it out.
you can use the mmu or other games your logic might have to reduce the amount of memory required, for example if you have an mmu with an entry size that covers say 4Kb, then you could fill 4Kb of real memory with data, then use 16 different mmu entries (with 16 different virtual addresses) and for each of the 16 read through the 4K. Of course that is if your cache is on the virtual address side of the mmu.
overall it is kind of an ugly thing to do, if your mmu prevents instruction caching, you could put the code performing the test in a non-cached space so that it doesnt mess with the icache and only instructions used to fill the cache are in a cached address space.
Good luck...

How to saturate memory bus

I want to test a program with various memory bus usage levels. For example, I would like to find out if my program works as expected when other processes use 50% of the memory bus.
How would I simulate this kind of disturbance?
My attempt was to run a process with multiple threads, each thread doing random reads from a big block of memory. This didn't appear to have a big impact on my program. My program has a lot of memory operations, so I would expect that a significant disturbance will be noticeable.
I want to saturate the bus but without using too many CPU cycles, so that any performance degradation will be caused only by bus contention.
Notes:
I'm using a Xeon E5645 processor, DDR3 memory
The mental model of "processes use 50% of the memory bus" is not a great one. A thread that has acquired a core and accesses memory that's not in the caches uses the memory bus.
Getting a thread to saturate the bus is simple, just use memcpy(). Copy several times the amount that fits in the last cache and warm it up by running it multiple times so there are no page faults to slow the code down.
My first instinct would be to set up a bunch of DMA operations to bounce data around without using the CPU too much. This all depends on what operating system you're running and what hardware. Is this an embedded system? I'd be glad to give more detail in the comments.
I'd use SSE2 movntps instructions to stream data, to avoid cache conflicts for the other thread in the same core. Maybe unroll that loop 16 times to minimize number of instructions per memory transfer. While DMA idea sounds good, the linked manual is old and for 32bit linux and your processor model makes me think you probably have 64bit os, which makes me wonder how much of it is correct still. And bug in your test code may screw your hard drive in worst case.

Multiple threads and CPU cache

I am implementing an image filtering operation in C using multiple threads and making it as optimized as possible. I have one question though: If a memory is accessed by thread-0, and concurrently if the same memory is accessed by thread-1, will it get it from the cache ? This question stems from the possibility that these two threads could be running into two different cores of the CPU. So another way of putting this is: do all the cores share the same common cache memory ?
Suppose i have a memory layout like the following
int output[100];
Assume there are 2 CPU cores and hence I spawn two threads to work concurrently. One scheme could be to divide the memory into two chunks, 0-49 and 50-99 and let each thread work on each chunk. Another way could be to let thread-0 work on even indices, like 0 2 4 and so on.. while the other thread work on odd indices like 1 3 5 .... This later technique is easier to implement (specially for 3D data) but I am not sure if I could use the cache efficiently this way.
The answer to this question strongly depends upon the architecture and the cache level, along with where the threads are actually running.
For example, recent Intel multi core CPUs have a L1 caches that are per-core, and an L2 cache that is shared among cores that are in the same CPU package; however different CPU packages will have their own L2 caches.
Even in the case when your threads are running on two cores within the one package though, if both threads access data within the same cacheline you will have that cacheline bouncing between the two L1 caches. This is very inefficient, and you should design your algorithm to avoid this situation.
A few comments have asked about how to go about avoiding this problem.
At heart, it's really not particularly complicated - you just want to avoid two threads from simultaneously trying to access data that is located on the same cache line, where at least one thread is writing to the data. (As long as all the threads are only reading the data, there's no problem - on most architectures, read-only data can be present in multiple caches).
To do this, you need to know the cache line size - this varies by architecture, but currently most x86 and x86-64 family chips use a 64 byte cache line (consult your architecture manual for other architectures). You will also need to know the size of your data structures.
If you ask your compiler to align the shared data structure of interest to a 64 byte boundary (for example, your array output), then you know that it will start at the start of a cache line, and you can also calculate where the subsequent cache line boundaries are. If your int is 4 bytes, then each cacheline will contain exactly 8 int values. As long as the array starts on a cacheline boundary, then output[0] through output[7] will be on one cache line, and output[8] through output[15] on the next. In this case, you would design your algorithm such that each thread works on a block of adjacent int values that is a multiple of 8.
If you are storing complicated struct types rather than plain int, the pahole utility will be of use. It will analyse the struct types in your compiled binary, and show you the layout (including padding) and total size. You can then adjust your structs using this output - for example, you may want to manually add some padding so that your struct is a multiple of the cache line size.
On POSIX systems, the posix_memalign() function is useful for allocating a block of memory with a specified alignment.
In general it is a bad idea to share overlapping memory regions like if one thread processes 0,2,4... and the other processes 1,3,5... Although some architectures may support this, most architectures will not, and you probably can not specify on which machines your code will run on. Also the OS is free to assign your code to any core it likes (a single one, two on the same physical processor, or two cores on separate processors). Also each CPU usually has a separate first level cache, even if its on the same processor.
In most situations 0,2,4.../1,3,5... will slow down performance extremely up to possibly being slower than a single CPU.
Herb Sutters "Eliminate False Sharing" demonstrates this very well.
Using the scheme [...n/2-1] and [n/2...n] will scale much better on most systems. It even may lead to super linear performance as the cache size of all CPUs in sum can be possibly used. The number of threads used should be always configurable and should default to the number of processor cores found.
I might be mistaking, but whether the core's cache is shared or not depends on the implementation of the CPU. You'd have to look up the technical sheets on the manufacturer's page to check whether each core in your CPU has their own cache or whether the cache was shared.
I was working on image manipulation as well for a security company and sometimes we got corrupted images after running batch operations on threads. After long investigations we came to the conclusion that the cache was shared between CPU Core's and that in rare cases the data was beeing overwritten or replaced with incorrect data.
Whether this is something to keep into account or is rather a rare event I cannot anwser.
Intel documentation
Intel publishes per-generation datasheets which may contain this kind of information.
For example, for the processor i5-3210M which I had on my older computer, I look up the 3rd generation - Datasheet Volume 1 3.3 "Intel Hyper-Threading Technology (Intel HT Technology)" says:
The processor supports Intel Hyper-Threading Technology (Intel HT Technology)
that allows an execution core to function as two logical processors. While some
execution resources such as caches, execution units, and buses are shared, each
logical processor has its own architectural state with its own set of general-purpose registers and control registers.
which confirms that caches are shared in a given hyperthread for that generation of CPUs.
See also:
similar question for cache sharing across cores: How are cache memories shared in multicore Intel CPUs?
further analysis of threads vs cores: https://superuser.com/questions/133082/what-is-the-difference-between-hyper-threading-and-multiple-cores/995858#995858
the architecture spec itself also has a section about the sharing of certain resources that must be valid across all implementations, although it does not mention caches: What does multicore assembly language look like?

Resources