Can memset be parallelized on 4 cores? - c

I am not sure to that. Can I write a large memset (for example 10 MB), on four cores to gain speedup with this?
Is such ram-chip parallelization possible at all, and also how big are time costs of firing other threads - is it more than a millisecond or less?

You are pointing out a right question, at the same time it is difficult to give a simple answer to it. There are several aspects involved.
Overhead of starting new threads (or picking them from some cache);
Contension on the memory bus.
The aspects above differ and have very different cost for different platforms.
Bigger PCs have several memory buses. Smaller ones have only one. On a one memory bus system this does not make any sense. If your system has several memory buses (channels) your array of data may have arbitrary split between memory banks. If it will happen that the whole array sits in the same memory bank, the parralelisation will be useless. Figuring out the layout of your array is an overhead again. In other words before splitting the operation between cores it is necessary to figure out if this is worth doing or not.
Simple answer is that these difficult to predict overheads will most likely will consume the benefit and make the overall result worse.
At the same time for a really huge memory area on some architectures it makes sense.

Related

Cache pre-fetching while traversing an array: what if some memory pages have being swapped out?

The biggest advantage of an array, of, for example, int, is that, if you read it sequentially, it can be fully preloaded in cache because the CPU checks the memory access patterns and pre-fetches the next locations about to be read, so the "next" element of a vector is always in cache.
At which extent is that sentence just "theory"? Thinking about timing, for that to be true, the pre-fetcher must know how much time will take sending the next cache line to cache (which implies knowing how "slow" is the RAM), and how much time is left before such data is the next one to be read by the CPU (which implies knowing how time-consuming are the remaining instructions), so the first sequence of actions takes no longer than the second.
The specific case I have in mind: let's assume that the first 5 pages of a 10-pages-long array are in RAM and the last 5 in swap. If the next address that the pre-fetcher wants to load is the first address of the fifth page, that pre-loading time will be unpredictable long and the pre-fetcher will fail in its mission.
I know CPUs try to do a lot of guessing and speculations about the future of a process, like such cache pre-fetching, branch prediction and probably a lot of other techniques I'm not aware of, some of them probably talking to the OS to speculate together (and I'm eager to know more about this, it surprises me every time).
So, does CPUs and/or OSes try to solve that kind of guessing-timing problems, for example, by trying to answer the pre-fetcher's question of: how much time do I need in advance for my speculative pre-fetching to cause 0 delay?
So, does CPUs and/or OSes try to solve that kind of guessing-timing problems
No.
This is solely handled by the CPU HW. If you read some memory location the CPU will simply bring a memory block containing your location into a whole cache line.

Use of malloc in Real Time application

We had an issue with one of our real time application. The idea was to run one of the threads every 2ms (500 Hz). After the application ran for half or an hour so.. we noticed that the thread is falling behind.
After a few discussions, people complain about the malloc allocations in the real time thread or root cause is malloc allocations.
I am wondering that, is it always a good idea to avoid all dynamic memory allocations in the real time threads?
Internet has very few resource on this ? If you can point to some discussion that would we great too..
Thanks
First step is to profile the code and make sure you understand exactly where the bottleneck is. People are often bad at guessing bottlenecks in code, and you might be surprised with the findings. You can simply instrument several parts of this routine yourself and dump min/avg/max durations in regular intervals. You want to see the worst case (max), and if the average duration increases as the time goes by.
I doubt that malloc will take any significant portion of these 2ms on a reasonable microcontroller capable of running Linux; I'd say it's more likely you would run out of memory due to fragmentation, than having performance issues. If you have any other syscalls in your function, they will easily take an order of magnitude more than malloc.
But if malloc is really the problem, depending on how short-lived your objects are, how much memory you can afford to waste, and how much your requirements are known in advance, there are several approaches you can take:
General purpose allocation (malloc from your standard library, or any third party implementation): best approach if you have "more than enough" RAM, many short-lived objects, and no strict latency requirements
PROS: works for any object size out of the box, familiar interface, memory is shared dynamically, no need to "plan ahead" if memory is not an issue
CONS: slight performance penalty during allocation and/or deallocation, memory fragmentation issues when doing lots of allocations/deallocations of objects of different sizes, whether a run-time allocation will fail is less deterministic and cannot be easily mitigated at runtime
Memory pool: best approach in most cases where memory is limited, low latency is required, and the object needs to live longer than a single block scope
PROS: allocation/deallocation time is guaranteed to be O(1) in any reasonable implementation, does not suffer from fragmentation, easier to plan its size in advance, failure to allocate at run-time is (likely) easier to mitigate
CONS: works for a single specific object size - memory is not shared between other parts of the program, requires a planning for the right size of the pool or risking potential waste of memory
Stack based (automatic-duration) objects: best for smaller, short-lived objects (single block scope)
PROS: allocation and deallocation is done automatically, allows having optimum usage of RAM for the object's lifetime, there are tools which can sometimes do a static analysis of your code and estimate the stack size
CONS: objects limited to a single block scope - cannot share objects between interrupt invocations
Individual statically allocated objects: best approach for long lived objects
PROS: no allocation whatsoever - all needed objects exist throughout the application life-cycle, no problems with allocation/deallocation
CONS: wastes memory if the objects should be short-lived
Even if you decide to go for memory pools all over the program, make sure you add profiling/instrumentation to your code. And then leave it there forever to see how the performance changes over time.
Being a realtime software engineer in the aerospace industry we see this question a lot. Even within our own engineers, we see that software engineers attempt to use non-realtime programming techniques they learned elsewhere or to use open-source code in their programs. Never allocate from the heap during realtime. One of our engineers created a tool that intercepts the malloc and records the overhead. You can see in the numbers that you cannot predict when the allocation attempt will take a long time. Even on very high end computers (72 cores, 256 GB RAM servers) running a realtime hybrid of Linux we record mallocs taking 100's of milliseconds. It is a system call which is cross-ring, so high overhead, and you don't know when you will get hit by garbage collection, or when it decides it must request another large chunk or memory for the task from the OS.

Random Memory Reads vs Random Memory Writes

In low level languages like C I know you should try to use the CPU cache to your benefit as much as possible. As a cache miss means your program will temporarily have to wait for the RAM to dereference a pointer. However are writes to memory also effected by this? If you write to memory it would seem that the CPU does not need to wait on a response.
I'm trying to decide if reordering a array of items would truly be worth it when I need to access items in the array in certain groups repeatedly (so sorting it based on those groups). However those groups will frequently change so I would need to keep reordering the array if I do this.
Depending on your architecture, random memory writes can be expensive for at least two reasons.
On today's multi-core machines, almost all writes will require some kind of cache coherence protocol to be run so that the corresponding cache lines on other caches will be invalidated.
In terms of ordinary writes, they will either always cost a memory operation or sometimes cause a memory operation depending on whether the cache is write-through or write-back.
You can read more details about the possible behaviors of caches on Wikipedia.
This is a very broad question, so my answer is nearly as broad.
The source code, the compiled code, and the underlying hardware are not necessarily all in sync when it comes to reading and writing memory. Your C/C++ code simply references variables. The compiled code will turn that into appropriate machine language which is close to the source code but can vary in the case of optimization, volatile keyword, etc. Finally the hardware will optimize the 3 main levels of storage: CPU cache (fastest), RAM, and hard disk (yes, your program variables can actually be stored on the hard disk, in the case of swapping).
Whether the CPU waits or not depends partially on what's going on at the hardware layer combined with the machine code (again for example consider data specified as volatile).

Is multi-thread memory access faster than single threaded memory access?

Is multi-thread memory access faster than single threaded memory access?
Assume we are in C language. A simple example is as follows. If I have a gigantic array A and I want to copy A to array B with the same size as A. Is using multithreading to do memory copy faster than it with a single thread? How many threads are suitable to do this kind of memory operation?
EDIT:
Let me put the question more narrow. First of all, we do not consider the GPU case. The memory access optimization is very important and effective when we do GPU programming. In my experience, we always need to be careful about the memory operations. On the other hand, it is not always the case when we work on CPU. In addition, let's not consider about the SIMD instructions, such as avx and sse. Those will also show memory performance issues when the program has too many memory access operations as opposed to a lot of computational operations. Assume that we work an x86 architecture with 1-2 CPUs. Each CPU has multiple cores and a quad channel memory interface. The main memory is DDR4, as it is common today.
My array is an array of double precision floating point numbers with the size similar to the size of L3 cache of a CPU, that is roughly 50MB. Now, I have two cases: 1) copy this array to another array with the same size using by doing element-wise copy or by using memcpy. 2) combine a lot of small arrays into this gigantic array. Both are real-time operations, meaning that they need to be done as fast as possible. Does multi-threading give a speedup or a dropdown? What's the factor in this case that affects the performance of memory operations?
Someone said it will mostly depend on DMA performance. I think it is when we do memcpy. What if we do element-wise copy, does the pass through the CPU cache first?
It depends on many factors. One factor is the hardware you use. On modern PC hardware, multithreading will most likely not lead to performance improvement, because CPU time is not the limiting factor of copy operations. The limiting factor is the memory interface. The CPU will most likely use the DMA controller to do the copying, so the CPU will not be too busy when copying data.
Over the years, CPU performance increased greatly, literally exponentiated. RAM performance couldn't catch up. It actually made the cache more important. Especially after celeron.
So you can have increase or decrease in performance:
Depending heavily on
memory fetch and memory store units per core
memory controller modules
pipeline depths of memory modules and enumeration of memory banks
memory accessing patterns of each thread(software)
Alignments of data chunks, instruction blobs
Sharing and its datapaths of common hardware resources
Operating system doing too much preemption for all threads
Simply optimize the code for cache, then the quality of cpu will decide the performance.
Example:
FX8150 has weaker cores than a i7-4700:
FX cores can have scaling with extra threads but i7 tops with just single thread (I mean memory-heavy codes)
FX has more L3 but it is slower
FX can work with higher frequency RAM but i7 has better inter-core data bandwidth (incase of 1 thread sending data to another thread)
FX pipeline is too long, too long to recover after a branch
it looks like AMD can share more finer-grained performance to threads while INTEL does give power to a single thread. (council assembly vs monarchy) Maybe thats why AMD is better at GPU and HBM.
If I had to stop speculation, I would care only for cache as it is not alterable in cpu while RAM can have many combinations on a motherboard.
Assuming AMD/Intel64 architecture.
One core is not capable of saturating the memory bandwidth. But this means not that multi-threaded is faster. For that the threads must be on different cores, launching as many threads as there is physical cores should give a speed up as the OS would most likely assign the threads to different cores, but in you threading library there should be a function binding a thread to a specific core, using this is the best for speed. Another thing to think about is NUMA, if you have a multi socket system. For maximum speed you should also think about using AVX instructions.

How to saturate memory bus

I want to test a program with various memory bus usage levels. For example, I would like to find out if my program works as expected when other processes use 50% of the memory bus.
How would I simulate this kind of disturbance?
My attempt was to run a process with multiple threads, each thread doing random reads from a big block of memory. This didn't appear to have a big impact on my program. My program has a lot of memory operations, so I would expect that a significant disturbance will be noticeable.
I want to saturate the bus but without using too many CPU cycles, so that any performance degradation will be caused only by bus contention.
Notes:
I'm using a Xeon E5645 processor, DDR3 memory
The mental model of "processes use 50% of the memory bus" is not a great one. A thread that has acquired a core and accesses memory that's not in the caches uses the memory bus.
Getting a thread to saturate the bus is simple, just use memcpy(). Copy several times the amount that fits in the last cache and warm it up by running it multiple times so there are no page faults to slow the code down.
My first instinct would be to set up a bunch of DMA operations to bounce data around without using the CPU too much. This all depends on what operating system you're running and what hardware. Is this an embedded system? I'd be glad to give more detail in the comments.
I'd use SSE2 movntps instructions to stream data, to avoid cache conflicts for the other thread in the same core. Maybe unroll that loop 16 times to minimize number of instructions per memory transfer. While DMA idea sounds good, the linked manual is old and for 32bit linux and your processor model makes me think you probably have 64bit os, which makes me wonder how much of it is correct still. And bug in your test code may screw your hard drive in worst case.

Resources