How many cycles does false sharing cost? - c

My target platforms are windows and linux with x86-64 (coffe lake or higher, zen 2 or higher) and mac m2. I'm wondering is there a penalty for multiple threads accessing the same data at the same time? and how much of a penalty there is if one thread changes one variable once. Do other cores stall immediately if that cache line is loaded? How many cycles does it take to update? From my understanding false sharing happens when you change a byte on a line, is this strictly 128 bytes and less? I don't have to worry about TLB? here's my situation
I have a few objects which is a source of truth for some data. I can't remember if they're 64 bytes or over. I have a 32bit status flag in the first 64 bytes. Many threads may access this and bytes next to it sometimes 100 times, sometimes one other thread once. I'm not sure how many nanoseconds between a write and read will be but only one write will happen
C++ thread sanitizer complained that I'm changing the flag in one thread and reading in another, neither using atomic operations. The other threads don't need to see the update since I simply set a bit they don't care about. I was thinking I can use atomic load/store with atomic_relaxed.
Another option is having a pointer and going through that to update the data. I was thinking if 1K objects are written to once and every other thread happen to read it within 10ns, would it be a problem? How many cycles would that stall? This is assuming there's no penalties when many cores are reading the same data. I have a bit of a memory bandwidth problems (I'm writing a lot of data) so I'm concerned about using more data when I don't need to

Related

Random Memory Reads vs Random Memory Writes

In low level languages like C I know you should try to use the CPU cache to your benefit as much as possible. As a cache miss means your program will temporarily have to wait for the RAM to dereference a pointer. However are writes to memory also effected by this? If you write to memory it would seem that the CPU does not need to wait on a response.
I'm trying to decide if reordering a array of items would truly be worth it when I need to access items in the array in certain groups repeatedly (so sorting it based on those groups). However those groups will frequently change so I would need to keep reordering the array if I do this.
Depending on your architecture, random memory writes can be expensive for at least two reasons.
On today's multi-core machines, almost all writes will require some kind of cache coherence protocol to be run so that the corresponding cache lines on other caches will be invalidated.
In terms of ordinary writes, they will either always cost a memory operation or sometimes cause a memory operation depending on whether the cache is write-through or write-back.
You can read more details about the possible behaviors of caches on Wikipedia.
This is a very broad question, so my answer is nearly as broad.
The source code, the compiled code, and the underlying hardware are not necessarily all in sync when it comes to reading and writing memory. Your C/C++ code simply references variables. The compiled code will turn that into appropriate machine language which is close to the source code but can vary in the case of optimization, volatile keyword, etc. Finally the hardware will optimize the 3 main levels of storage: CPU cache (fastest), RAM, and hard disk (yes, your program variables can actually be stored on the hard disk, in the case of swapping).
Whether the CPU waits or not depends partially on what's going on at the hardware layer combined with the machine code (again for example consider data specified as volatile).

Layout of shared memory and cache coherency

I have a number of processes, about 16 (but design limit is a few hundred) that communicate via shared memory. Each process has a reserved area in the shared memory where it places requests. When a request is ready for others, it sets a per process bit "RequestReady". Other processes are reading the RequestReady bits inside short run spin loops and take action if bits are set. This is specific to our needs and kernel semaphores are used frequently too, but the spin loops are faster for our highly specific need (tested)
I am especially interested in effects for cross socket cache coherency on x86 platforms rather than shared L2/L3 cache designs. This is not a question about affinity, I am aware of using that.
Currently, I have the "RequestReady" bit spaced out so that each process has its bit on a seperate cache line. Logically something like
struct {
unsigned long RequestReady;
char DataArea[5000];
} ProcessSlots[MAX_SLOTS];
This means that I essentially have up to MAX_SLOTS of cachelines, being invalidated. The advantage is that only one core will be writing to the cacheline, and other sockets will simply need to revalidate. This means that writer will not (should not?) write stall, but readers will at some stage when they are scanning for work. A disadvantage is that I have used a number of cachelines that might be better for something else.
An alternative layout would be
unsigned long RequestReady[MAX_SLOTS];
char DataArea[5000*MAX_SLOTS];
This means that all the RequestReady flags are together in a cacheline, so I only have one cache invalidate message, but I am worried about creating a hotspot like this, one that will be essentially shared across every socket/core. Will this degrade so that every read/write to this cacheline will be going to main memory?
The RequestReady bits are toggled often (ca. 10,000/sec) and scanned frequently, (ca 100K to 10M/sec). I know I will be going to main memory access speeds often, but want a layout that minimises this for such a hot area.
What is a good approach for laying out these RequestReady bits? Is there an alternative layout I haven't considered? Does my first approach have any advantage for the writer over the second approach?

Does a cache write take longer with more caches to invalidate?

can you please help me to find out if it takes longer for a cache write to finish when there are more cores/caches holding a copy of that line.
I also want to measure/quantify how much longer it actually takes.
I couldn't find anything useful on google and I have trouble measuring it myself plus interpret what I measure because of the many things that can happen on a modern processor.
(reordering, prefetching, buffering and god knows what)
Details:
My basic process of measuring it is roughly as follows:
write soemthing to the cacheline on processor 0
read it on processors 1 to n.
rdtsc
write it on process 0
rdtsc
I am not even sure which instructions to actually use for read/write on process 0 in order to make sure the write/invalidate is finished before the final time measurement.
At the moment I fiddle with an atomic exchange (__sync_fetch_and_add()), but it seems that the number of threads is itself important for the length of this operation (not the number of threads to invalidate) -- which is probably not what I want to measure?!.
I also tried a read, then write, then memory barrier (__sync_synchronize()). This looks more like what I expect to see,
but here I am also not sure if the write is finished when the final rdtsc takes place.
As you can guess my knowledge of CPU internals is somewhat limited.
Any help is very much appreciated!
ps:
* I use linux, gcc and pthreads for the measurements.
* I want know this for modeling a parallel algorithm of mine.
Edit:
In a week or so (going on vacation tomorrow) I'll do some more research and post my code and notes and link it here (In case someone is interested), because the time I can spend on this is limited.
I started writing a very long answer, describing exactly how this works, then realized, I probably don't know enough about the exact details. So I'll do a shorter answer....
So, when you write something on one processor, if it's not already in that processors cache, it will have to be fetched in, and after the processor has read the data, it will perform the actual write. In doing so, it will send a cache-invalidate message to ALL other processors in the system. These will then throw away any content. If another processor has "dirty" content, it will in itself write out the data, and ask for an invalidation - in which case the first processor will have to RELOAD the data before finishing its write (otherwise, some other element in the same cacheline may get destroyed).
Reading it back into the cache will be required on every other processor that is interested in that cache-line.
The __sync_fetch_and_add() wilol use a "lock" prefix [on x86, other processors may vary, but the general idea on processors that support "per instruction" locks is roughtly the same] - this will issue a "I want this cacheline EXCLUSIVELY, everyone else please give it up and invalidate it". Just like the first case, the processor may well have to re-read anything that another processor may have made dirty.
A memory barrier will not ensure that data is updated "safely" - it will just make sure that "whatever happened (to memory) before now is visible to all processors by the time this instructon finishes".
The best way to optimize the use of processors is to share as little as possible, and in particular, avoid "false sharing". In a benchmark many years ago, there was a structure like [simplifed] this:
struct stuff {
int x[2];
... other data ... total data a few cachelines.
} data;
void thread1()
{
for( ... big number ...)
data.x[0]++;
}
void thread2()
{
for( ... big number ...)
data.x[1]++;
}
int main()
{
start = timenow();
create(thread1);
create(thread2);
end = timenow() - start;
}
Since EVERY time thread1 wrote to the x[0], thread2's processor had to get rid of it's copy of x[1], and vice versa, the result is was that the SMP test [vs just running thread1] was running about 15 times slower. By altering the struct like this:
struct stuff {
int x;
... other data ...
} data[2];
and
void thread1()
{
for( ... big number ...)
data[0].x++;
}
we got 200% of the 1 thread variant [give or take a few percent]
Right, so the processor has queues of buffers where write operations are stored when the processor is writing to memory. A memory barrier (mfence, sfence or lfence) instruction is there to ensure that any outstanding read/write, write or read type operation has completely been finished before the processor proceeds to the next instruction. Normally, the processor would just continue on it's jolly way through any following instructions, and eventualy the memory operation becomes fulfilled some way or another. Since modern processors have a lot of parallel operations and buffers all over the place, it can take quite some time before something ACTUALLY trickles through to where it eventually will end up. So, when it's CRITICAL to make sure that something has ACTUALLY been done before proceeding (for example, if we have written a bunch of instructions to the video memory, and we now want to kick off the run of those instructions, we need to make sure that the 'instruction' writing has actually finished, and some other part of the processor isn't still working on finishing that. So use an sfence to make sure that the write has really happened - that may not be a very realistic example, but I think you get the idea.)
Cache writes have to get line-ownership before dirtying the cache line. Depending on the
cache coherence model implemented in the processor architecture, the time taken for this step varies. The most common coherence protocols that I know are:
Snooping Coherence Protocol: all caches monitor address lines for cached memory lines i.e. all memory requests have to be broadcast to all cpus i.e. non-scalable as cpus increase.
Directory-based Coherence Protocol: all cache lines shared among many cpus is kept in a directory; so, invalidating/gaining ownership is a point-to-point cpu request rather than a broadcast i.e. more scalable, but latency suffers because the directory is a single point of contention.
Most cpu architectures support something called PMU (perf monitoring unit). This unit exports
counters for many things like: cache hits, misses, cache write latency, read latency, tlb hits, etc. Please consult the cpu manual to see if this info is available.

How to saturate memory bus

I want to test a program with various memory bus usage levels. For example, I would like to find out if my program works as expected when other processes use 50% of the memory bus.
How would I simulate this kind of disturbance?
My attempt was to run a process with multiple threads, each thread doing random reads from a big block of memory. This didn't appear to have a big impact on my program. My program has a lot of memory operations, so I would expect that a significant disturbance will be noticeable.
I want to saturate the bus but without using too many CPU cycles, so that any performance degradation will be caused only by bus contention.
Notes:
I'm using a Xeon E5645 processor, DDR3 memory
The mental model of "processes use 50% of the memory bus" is not a great one. A thread that has acquired a core and accesses memory that's not in the caches uses the memory bus.
Getting a thread to saturate the bus is simple, just use memcpy(). Copy several times the amount that fits in the last cache and warm it up by running it multiple times so there are no page faults to slow the code down.
My first instinct would be to set up a bunch of DMA operations to bounce data around without using the CPU too much. This all depends on what operating system you're running and what hardware. Is this an embedded system? I'd be glad to give more detail in the comments.
I'd use SSE2 movntps instructions to stream data, to avoid cache conflicts for the other thread in the same core. Maybe unroll that loop 16 times to minimize number of instructions per memory transfer. While DMA idea sounds good, the linked manual is old and for 32bit linux and your processor model makes me think you probably have 64bit os, which makes me wonder how much of it is correct still. And bug in your test code may screw your hard drive in worst case.

Multiple threads and CPU cache

I am implementing an image filtering operation in C using multiple threads and making it as optimized as possible. I have one question though: If a memory is accessed by thread-0, and concurrently if the same memory is accessed by thread-1, will it get it from the cache ? This question stems from the possibility that these two threads could be running into two different cores of the CPU. So another way of putting this is: do all the cores share the same common cache memory ?
Suppose i have a memory layout like the following
int output[100];
Assume there are 2 CPU cores and hence I spawn two threads to work concurrently. One scheme could be to divide the memory into two chunks, 0-49 and 50-99 and let each thread work on each chunk. Another way could be to let thread-0 work on even indices, like 0 2 4 and so on.. while the other thread work on odd indices like 1 3 5 .... This later technique is easier to implement (specially for 3D data) but I am not sure if I could use the cache efficiently this way.
The answer to this question strongly depends upon the architecture and the cache level, along with where the threads are actually running.
For example, recent Intel multi core CPUs have a L1 caches that are per-core, and an L2 cache that is shared among cores that are in the same CPU package; however different CPU packages will have their own L2 caches.
Even in the case when your threads are running on two cores within the one package though, if both threads access data within the same cacheline you will have that cacheline bouncing between the two L1 caches. This is very inefficient, and you should design your algorithm to avoid this situation.
A few comments have asked about how to go about avoiding this problem.
At heart, it's really not particularly complicated - you just want to avoid two threads from simultaneously trying to access data that is located on the same cache line, where at least one thread is writing to the data. (As long as all the threads are only reading the data, there's no problem - on most architectures, read-only data can be present in multiple caches).
To do this, you need to know the cache line size - this varies by architecture, but currently most x86 and x86-64 family chips use a 64 byte cache line (consult your architecture manual for other architectures). You will also need to know the size of your data structures.
If you ask your compiler to align the shared data structure of interest to a 64 byte boundary (for example, your array output), then you know that it will start at the start of a cache line, and you can also calculate where the subsequent cache line boundaries are. If your int is 4 bytes, then each cacheline will contain exactly 8 int values. As long as the array starts on a cacheline boundary, then output[0] through output[7] will be on one cache line, and output[8] through output[15] on the next. In this case, you would design your algorithm such that each thread works on a block of adjacent int values that is a multiple of 8.
If you are storing complicated struct types rather than plain int, the pahole utility will be of use. It will analyse the struct types in your compiled binary, and show you the layout (including padding) and total size. You can then adjust your structs using this output - for example, you may want to manually add some padding so that your struct is a multiple of the cache line size.
On POSIX systems, the posix_memalign() function is useful for allocating a block of memory with a specified alignment.
In general it is a bad idea to share overlapping memory regions like if one thread processes 0,2,4... and the other processes 1,3,5... Although some architectures may support this, most architectures will not, and you probably can not specify on which machines your code will run on. Also the OS is free to assign your code to any core it likes (a single one, two on the same physical processor, or two cores on separate processors). Also each CPU usually has a separate first level cache, even if its on the same processor.
In most situations 0,2,4.../1,3,5... will slow down performance extremely up to possibly being slower than a single CPU.
Herb Sutters "Eliminate False Sharing" demonstrates this very well.
Using the scheme [...n/2-1] and [n/2...n] will scale much better on most systems. It even may lead to super linear performance as the cache size of all CPUs in sum can be possibly used. The number of threads used should be always configurable and should default to the number of processor cores found.
I might be mistaking, but whether the core's cache is shared or not depends on the implementation of the CPU. You'd have to look up the technical sheets on the manufacturer's page to check whether each core in your CPU has their own cache or whether the cache was shared.
I was working on image manipulation as well for a security company and sometimes we got corrupted images after running batch operations on threads. After long investigations we came to the conclusion that the cache was shared between CPU Core's and that in rare cases the data was beeing overwritten or replaced with incorrect data.
Whether this is something to keep into account or is rather a rare event I cannot anwser.
Intel documentation
Intel publishes per-generation datasheets which may contain this kind of information.
For example, for the processor i5-3210M which I had on my older computer, I look up the 3rd generation - Datasheet Volume 1 3.3 "Intel Hyper-Threading Technology (Intel HT Technology)" says:
The processor supports Intel Hyper-Threading Technology (Intel HT Technology)
that allows an execution core to function as two logical processors. While some
execution resources such as caches, execution units, and buses are shared, each
logical processor has its own architectural state with its own set of general-purpose registers and control registers.
which confirms that caches are shared in a given hyperthread for that generation of CPUs.
See also:
similar question for cache sharing across cores: How are cache memories shared in multicore Intel CPUs?
further analysis of threads vs cores: https://superuser.com/questions/133082/what-is-the-difference-between-hyper-threading-and-multiple-cores/995858#995858
the architecture spec itself also has a section about the sharing of certain resources that must be valid across all implementations, although it does not mention caches: What does multicore assembly language look like?

Resources