Multiple threads and CPU cache - c

I am implementing an image filtering operation in C using multiple threads and making it as optimized as possible. I have one question though: If a memory is accessed by thread-0, and concurrently if the same memory is accessed by thread-1, will it get it from the cache ? This question stems from the possibility that these two threads could be running into two different cores of the CPU. So another way of putting this is: do all the cores share the same common cache memory ?
Suppose i have a memory layout like the following
int output[100];
Assume there are 2 CPU cores and hence I spawn two threads to work concurrently. One scheme could be to divide the memory into two chunks, 0-49 and 50-99 and let each thread work on each chunk. Another way could be to let thread-0 work on even indices, like 0 2 4 and so on.. while the other thread work on odd indices like 1 3 5 .... This later technique is easier to implement (specially for 3D data) but I am not sure if I could use the cache efficiently this way.

The answer to this question strongly depends upon the architecture and the cache level, along with where the threads are actually running.
For example, recent Intel multi core CPUs have a L1 caches that are per-core, and an L2 cache that is shared among cores that are in the same CPU package; however different CPU packages will have their own L2 caches.
Even in the case when your threads are running on two cores within the one package though, if both threads access data within the same cacheline you will have that cacheline bouncing between the two L1 caches. This is very inefficient, and you should design your algorithm to avoid this situation.
A few comments have asked about how to go about avoiding this problem.
At heart, it's really not particularly complicated - you just want to avoid two threads from simultaneously trying to access data that is located on the same cache line, where at least one thread is writing to the data. (As long as all the threads are only reading the data, there's no problem - on most architectures, read-only data can be present in multiple caches).
To do this, you need to know the cache line size - this varies by architecture, but currently most x86 and x86-64 family chips use a 64 byte cache line (consult your architecture manual for other architectures). You will also need to know the size of your data structures.
If you ask your compiler to align the shared data structure of interest to a 64 byte boundary (for example, your array output), then you know that it will start at the start of a cache line, and you can also calculate where the subsequent cache line boundaries are. If your int is 4 bytes, then each cacheline will contain exactly 8 int values. As long as the array starts on a cacheline boundary, then output[0] through output[7] will be on one cache line, and output[8] through output[15] on the next. In this case, you would design your algorithm such that each thread works on a block of adjacent int values that is a multiple of 8.
If you are storing complicated struct types rather than plain int, the pahole utility will be of use. It will analyse the struct types in your compiled binary, and show you the layout (including padding) and total size. You can then adjust your structs using this output - for example, you may want to manually add some padding so that your struct is a multiple of the cache line size.
On POSIX systems, the posix_memalign() function is useful for allocating a block of memory with a specified alignment.

In general it is a bad idea to share overlapping memory regions like if one thread processes 0,2,4... and the other processes 1,3,5... Although some architectures may support this, most architectures will not, and you probably can not specify on which machines your code will run on. Also the OS is free to assign your code to any core it likes (a single one, two on the same physical processor, or two cores on separate processors). Also each CPU usually has a separate first level cache, even if its on the same processor.
In most situations 0,2,4.../1,3,5... will slow down performance extremely up to possibly being slower than a single CPU.
Herb Sutters "Eliminate False Sharing" demonstrates this very well.
Using the scheme [...n/2-1] and [n/2...n] will scale much better on most systems. It even may lead to super linear performance as the cache size of all CPUs in sum can be possibly used. The number of threads used should be always configurable and should default to the number of processor cores found.

I might be mistaking, but whether the core's cache is shared or not depends on the implementation of the CPU. You'd have to look up the technical sheets on the manufacturer's page to check whether each core in your CPU has their own cache or whether the cache was shared.
I was working on image manipulation as well for a security company and sometimes we got corrupted images after running batch operations on threads. After long investigations we came to the conclusion that the cache was shared between CPU Core's and that in rare cases the data was beeing overwritten or replaced with incorrect data.
Whether this is something to keep into account or is rather a rare event I cannot anwser.

Intel documentation
Intel publishes per-generation datasheets which may contain this kind of information.
For example, for the processor i5-3210M which I had on my older computer, I look up the 3rd generation - Datasheet Volume 1 3.3 "Intel Hyper-Threading Technology (Intel HT Technology)" says:
The processor supports Intel Hyper-Threading Technology (Intel HT Technology)
that allows an execution core to function as two logical processors. While some
execution resources such as caches, execution units, and buses are shared, each
logical processor has its own architectural state with its own set of general-purpose registers and control registers.
which confirms that caches are shared in a given hyperthread for that generation of CPUs.
See also:
similar question for cache sharing across cores: How are cache memories shared in multicore Intel CPUs?
further analysis of threads vs cores: https://superuser.com/questions/133082/what-is-the-difference-between-hyper-threading-and-multiple-cores/995858#995858
the architecture spec itself also has a section about the sharing of certain resources that must be valid across all implementations, although it does not mention caches: What does multicore assembly language look like?

Related

How many cycles does false sharing cost?

My target platforms are windows and linux with x86-64 (coffe lake or higher, zen 2 or higher) and mac m2. I'm wondering is there a penalty for multiple threads accessing the same data at the same time? and how much of a penalty there is if one thread changes one variable once. Do other cores stall immediately if that cache line is loaded? How many cycles does it take to update? From my understanding false sharing happens when you change a byte on a line, is this strictly 128 bytes and less? I don't have to worry about TLB? here's my situation
I have a few objects which is a source of truth for some data. I can't remember if they're 64 bytes or over. I have a 32bit status flag in the first 64 bytes. Many threads may access this and bytes next to it sometimes 100 times, sometimes one other thread once. I'm not sure how many nanoseconds between a write and read will be but only one write will happen
C++ thread sanitizer complained that I'm changing the flag in one thread and reading in another, neither using atomic operations. The other threads don't need to see the update since I simply set a bit they don't care about. I was thinking I can use atomic load/store with atomic_relaxed.
Another option is having a pointer and going through that to update the data. I was thinking if 1K objects are written to once and every other thread happen to read it within 10ns, would it be a problem? How many cycles would that stall? This is assuming there's no penalties when many cores are reading the same data. I have a bit of a memory bandwidth problems (I'm writing a lot of data) so I'm concerned about using more data when I don't need to

Is multi-thread memory access faster than single threaded memory access?

Is multi-thread memory access faster than single threaded memory access?
Assume we are in C language. A simple example is as follows. If I have a gigantic array A and I want to copy A to array B with the same size as A. Is using multithreading to do memory copy faster than it with a single thread? How many threads are suitable to do this kind of memory operation?
EDIT:
Let me put the question more narrow. First of all, we do not consider the GPU case. The memory access optimization is very important and effective when we do GPU programming. In my experience, we always need to be careful about the memory operations. On the other hand, it is not always the case when we work on CPU. In addition, let's not consider about the SIMD instructions, such as avx and sse. Those will also show memory performance issues when the program has too many memory access operations as opposed to a lot of computational operations. Assume that we work an x86 architecture with 1-2 CPUs. Each CPU has multiple cores and a quad channel memory interface. The main memory is DDR4, as it is common today.
My array is an array of double precision floating point numbers with the size similar to the size of L3 cache of a CPU, that is roughly 50MB. Now, I have two cases: 1) copy this array to another array with the same size using by doing element-wise copy or by using memcpy. 2) combine a lot of small arrays into this gigantic array. Both are real-time operations, meaning that they need to be done as fast as possible. Does multi-threading give a speedup or a dropdown? What's the factor in this case that affects the performance of memory operations?
Someone said it will mostly depend on DMA performance. I think it is when we do memcpy. What if we do element-wise copy, does the pass through the CPU cache first?
It depends on many factors. One factor is the hardware you use. On modern PC hardware, multithreading will most likely not lead to performance improvement, because CPU time is not the limiting factor of copy operations. The limiting factor is the memory interface. The CPU will most likely use the DMA controller to do the copying, so the CPU will not be too busy when copying data.
Over the years, CPU performance increased greatly, literally exponentiated. RAM performance couldn't catch up. It actually made the cache more important. Especially after celeron.
So you can have increase or decrease in performance:
Depending heavily on
memory fetch and memory store units per core
memory controller modules
pipeline depths of memory modules and enumeration of memory banks
memory accessing patterns of each thread(software)
Alignments of data chunks, instruction blobs
Sharing and its datapaths of common hardware resources
Operating system doing too much preemption for all threads
Simply optimize the code for cache, then the quality of cpu will decide the performance.
Example:
FX8150 has weaker cores than a i7-4700:
FX cores can have scaling with extra threads but i7 tops with just single thread (I mean memory-heavy codes)
FX has more L3 but it is slower
FX can work with higher frequency RAM but i7 has better inter-core data bandwidth (incase of 1 thread sending data to another thread)
FX pipeline is too long, too long to recover after a branch
it looks like AMD can share more finer-grained performance to threads while INTEL does give power to a single thread. (council assembly vs monarchy) Maybe thats why AMD is better at GPU and HBM.
If I had to stop speculation, I would care only for cache as it is not alterable in cpu while RAM can have many combinations on a motherboard.
Assuming AMD/Intel64 architecture.
One core is not capable of saturating the memory bandwidth. But this means not that multi-threaded is faster. For that the threads must be on different cores, launching as many threads as there is physical cores should give a speed up as the OS would most likely assign the threads to different cores, but in you threading library there should be a function binding a thread to a specific core, using this is the best for speed. Another thing to think about is NUMA, if you have a multi socket system. For maximum speed you should also think about using AVX instructions.

Is it possible to assign parts of the shared L2 caches to different cores

Lets say, 4 threads are running on 4 separate cores of a Multicore x86 processor, and they do not share any data, is it possible to progammatically make the 4 cores use separate and predefined portions of the shared L2 cache.
Let's use two terms, exclusive and shared caches instead of L1, L2, L3, L4 caches. Different CPU families start to share cache on different levels. In the presented terms the original question is - is it possible split shared cache into the parts, each of which will be used exclusively by one of the CPU/cores? There is no clear answer. Furthermore there are two answers opposite to each other.
1) First and general answer: NO.
Cache is by design managed in hardware. There are only few control levers of cache accessible in software such as enable/disable cache for whole memory or defined memory region, apply specified policy for cache flushing (write through/ write back). NO basically due to the fact, that it was designed to be managed in hardware. So there are no useful interface that will allow manage it gracefully in software.
2) Second answer: Yes.
In fact, cache designed in such a way, that each line of the cache can save data from specified set of memory lines. Due to this if memory manager provides guaranty, that the same CPU one CPU/core own and use all memory lines assigned to the same cache line exclusively, then memory manager provides guaranty that that cache line will be used by that CPU exclusively. It is a very tricky workaround. And it have very limited benefits, and have serious drawbacks: memory layout is very fragmented, cache usage is unbalanced, complicated memory management, very hadrware-dependent (Details can be found in the paper provided by "MetallicPriest").
Resume: it is possible in theory and almost impossible on practice.

pthread offer no performance increase when using virtual cores

I am playing around with pthreads for the first time and have noticed something strange when running on my machine.
I have an Intel i5 with 2 physical cores and 4 virtual cores.
When running my program with 2 threads, I get roughly double the performance, yet when running with 4 threads, I get the same performance as two threads. Why is this the case?
Results with 2 threads:
real 0m9.335s
user 0m18.233s
sys 0m0.132s
Results with 4 threads:
real 0m9.427s
user 0m34.130s
sys 0m0.180s
Edit: The code is fully parallelizable and the threads are running independently without any shared resources.
Because you only really have 2 cores. Hyper-threading will not magically create 2 more cores for you. Hyper-threading makes it possible to run 4 threads on the CPU but not simultaneously. It will still allocate the threads on the two physical cores and switch the threads back and forth in the execution pipeline.
The performance increase you may expect is at BEST 30%.
Keep in mind that hyperthreading is basically a way of reusing spare execution units on the CPU for a separate thread of execution. You're still working with the horsepower of two cores, it's just split four ways.
If your code is optimized such that it fully utilizes most of the available EUs, there's no spare resources left once it's running on both physical cores, so the hyperthreaded cores can't do any better.
This old article from when HyperThreading (HT) was first introduced provides a lot of details on how it works (though I'm sure many improvements have been made over the last 10 years). http://www.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf:
Each logical processor maintains a complete set of the architecture state. The architecture state consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers, and some machine state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors. The number of transistors to store the architecture state is an extremely small fraction of the total.
However, the following sentence shows where HT can bottleneck:
Logical processors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic, and buses.
If the threads execution are each keeping one or more of those shared resources (such as the execution unit or buses) 100% busy, then the hyperthreading will not improve throughput. Since benchmarks often exercise one aspect of a system (intentionally or not), it's not surprising that one of these shared processor resources would end up being a bottleneck and prevent HT from showing a benefit.
The performance gain when using multiple threads is very difficult to determine. Hyperthreading is also "less than one extra core" in performance for sure.
Besides from that, you may run into memory throughput issues, or your code is contending over locks or some such now that you have more of them - even if your own code is lock-less doesn't mean that for example I/O or some functions you call are completely able to run in parallel - there are sometimes "hidden" shared resources.
But most likely, your processor just can't go any faster.

Out of Order Execution and Memory Fences

I know that modern CPUs can execute out of order, However they always retire the results in-order, as described by wikipedia.
"Out of Oder processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal."
Now memory fences are said to be required when using multicore platforms, because owing to Out of Order execution, wrong value of x can be printed here.
Processor #1:
while f == 0
;
print x; // x might not be 42 here
Processor #2:
x = 42;
// Memory fence required here
f = 1
Now my question is, since Out of Order Processors (Cores in case of MultiCore Processors I assume) always retire the results In-Order, then what is the necessity of Memory fences. Don't the cores of a multicore processor sees results retired from other cores only or they also see results which are in-flight?
I mean in the example I gave above, when Processor 2 will eventually retire the results, the result of x should come before f, right? I know that during out of order execution it might have modified f before x but it must have not retired it before x, right?
Now with In-Order retiring of results and cache coherence mechanism in place, why would you ever need memory fences in x86?
This tutorial explains the issues: http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf
FWIW, where memory ordering issues happen on modern x86 processors, the reason is that while the x86 memory consistency model offers quite strong consistency, explicit barriers are needed to handle read-after-write consistency. This is due to something called the "store buffer".
That is, x86 is sequentially consistent (nice and easy to reason about) except that loads may be reordered wrt earlier stores. That is, if the processor executes the sequence
store x
load y
then on the processor bus this may be seen as
load y
store x
The reason for this behavior is the afore-mentioned store buffer, which is a small buffer for writes before they go out on the system bus. Load latency is, OTOH, a critical issue for performance, and hence loads are permitted to "jump the queue".
See Section 8.2 in http://download.intel.com/design/processor/manuals/253668.pdf
The memory fence ensures that all changes to variables before the fence are visible to all other cores, so that all cores have an up to date view of the data.
If you don't put a memory fence, the cores might be working with wrong data, this can be seen especially in scenario's, where multiple cores would be working on the same datasets. In this case you can ensure that when CPU 0 has done some action, that all changes done to the dataset are now visible to all other cores, whom can then work with up to date information.
Some architectures, including the ubiquitous x86/x64, provide several
memory barrier instructions including an instruction sometimes called
"full fence". A full fence ensures that all load and store operations
prior to the fence will have been committed prior to any loads and
stores issued following the fence.
If a core were to start working with outdated data on the dataset, how could it ever get the correct results? It couldn't no matter if the end result were to be presented as-if all was done in the right order.
The key is in the store buffer, which sits between the cache and the CPU, and does this:
Store buffer invisible to remote CPUs
Store buffer allows writes to memory and/or caches to be saved to
optimize interconnect accesses
That means that things will be written to this buffer, and then at some point will the buffer be written to the cache. So the cache could contain a view of data that is not the most recent, and therefore another CPU, through cache coherency, will also not have the latest data. A store buffer flush is necessary for the latest data to be visible, this, I think is essentially what the memory fence will cause to happen at hardware level.
EDIT:
For the code you used as an example, Wikipedia says this:
A memory barrier can be inserted before processor #2's assignment to f
to ensure that the new value of x is visible to other processors at or
prior to the change in the value of f.
Just to make explicit what is implicit in the previous answers, this is correct, but is distinct from memory accesses:
CPUs can execute out of order, However they always retire the results in-order
Retirement of the instruction is separate from performing the memory access, the memory access may complete at a different time to instruction retirement.
Each core will act as if it's own memory accesses occur at retirement, but other cores may see those accesses at different times.
(On x86 and ARM, I think only stores are observably subject to this, but e.g., Alpha may load an old value from memory. x86 SSE2 has instructions with weaker guarentees than normal x86 behaviour).
PS. From memory the abandoned Sparc ROCK could in fact retire out-of-order, it spent power and transistors determining when this was harmless. It got abandoned because of power consumption and transistor count... I don't believe any general purpose CPU has been bought to market with out-of-order retirement.

Resources