About global memory access method - c

In general, for GPU, which accessing mode is faster (read data from a continous block of global memory)?
(1) for-loops with single or very small number of threads to read data from a block of global memory;
(2) let alot of threads, maybe from different blocks, to read data from global memory concurrently.
e.g.
if (threadIdx.x==0)
{
for (int i=0; i<1000; ++i)
buffer[i]=data[i];//data is stored in global memory
}
OR:
buffer[threadIdx.x]=data[threadIdx.x];//there are 1000 threads in this thread block

In short, the second should be faster generally. The Justification is followed:
There are two kinds of parallelism: Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP). Your first code (the loop) targets ILP and the second exploits TLP.
When the TLP is exploited, many memory requests are issued concurrently free of any control-flow dependencies. At this situation, hardware can take advantage of locality among threads to reduce total memory transactions (where possible). Moreover, hardware can serve the concurrent requests concurrently through L2-cache bank parallelism, memory controller parallelism, DRAM bank parallelism, and many other levels of parallelism.
However, in the ILP case, the existing control-dependency limits the number of concurrent issued memory requests. This is also true even in the case of loop-unrolling (hardware resources like scoreboard size and instruction window size limit the total outstanding instructions). So, many of the memory requests are actually serialized unnecessarily. Moreover, the hardware capability in memory access coalescing is not exploited.

The Solution one is faster.Cause 1000 Threads is 1000 tasks actually witch share one task address space.The process scheduling of the OS must cost much resources of CPU.So the CPU always be interrupted.
If you do the thing in one task , The CPU always process one task.
And multi-core CPU can process better , But 1000 threads is too large.

Related

Is multi-thread memory access faster than single threaded memory access?

Is multi-thread memory access faster than single threaded memory access?
Assume we are in C language. A simple example is as follows. If I have a gigantic array A and I want to copy A to array B with the same size as A. Is using multithreading to do memory copy faster than it with a single thread? How many threads are suitable to do this kind of memory operation?
EDIT:
Let me put the question more narrow. First of all, we do not consider the GPU case. The memory access optimization is very important and effective when we do GPU programming. In my experience, we always need to be careful about the memory operations. On the other hand, it is not always the case when we work on CPU. In addition, let's not consider about the SIMD instructions, such as avx and sse. Those will also show memory performance issues when the program has too many memory access operations as opposed to a lot of computational operations. Assume that we work an x86 architecture with 1-2 CPUs. Each CPU has multiple cores and a quad channel memory interface. The main memory is DDR4, as it is common today.
My array is an array of double precision floating point numbers with the size similar to the size of L3 cache of a CPU, that is roughly 50MB. Now, I have two cases: 1) copy this array to another array with the same size using by doing element-wise copy or by using memcpy. 2) combine a lot of small arrays into this gigantic array. Both are real-time operations, meaning that they need to be done as fast as possible. Does multi-threading give a speedup or a dropdown? What's the factor in this case that affects the performance of memory operations?
Someone said it will mostly depend on DMA performance. I think it is when we do memcpy. What if we do element-wise copy, does the pass through the CPU cache first?
It depends on many factors. One factor is the hardware you use. On modern PC hardware, multithreading will most likely not lead to performance improvement, because CPU time is not the limiting factor of copy operations. The limiting factor is the memory interface. The CPU will most likely use the DMA controller to do the copying, so the CPU will not be too busy when copying data.
Over the years, CPU performance increased greatly, literally exponentiated. RAM performance couldn't catch up. It actually made the cache more important. Especially after celeron.
So you can have increase or decrease in performance:
Depending heavily on
memory fetch and memory store units per core
memory controller modules
pipeline depths of memory modules and enumeration of memory banks
memory accessing patterns of each thread(software)
Alignments of data chunks, instruction blobs
Sharing and its datapaths of common hardware resources
Operating system doing too much preemption for all threads
Simply optimize the code for cache, then the quality of cpu will decide the performance.
Example:
FX8150 has weaker cores than a i7-4700:
FX cores can have scaling with extra threads but i7 tops with just single thread (I mean memory-heavy codes)
FX has more L3 but it is slower
FX can work with higher frequency RAM but i7 has better inter-core data bandwidth (incase of 1 thread sending data to another thread)
FX pipeline is too long, too long to recover after a branch
it looks like AMD can share more finer-grained performance to threads while INTEL does give power to a single thread. (council assembly vs monarchy) Maybe thats why AMD is better at GPU and HBM.
If I had to stop speculation, I would care only for cache as it is not alterable in cpu while RAM can have many combinations on a motherboard.
Assuming AMD/Intel64 architecture.
One core is not capable of saturating the memory bandwidth. But this means not that multi-threaded is faster. For that the threads must be on different cores, launching as many threads as there is physical cores should give a speed up as the OS would most likely assign the threads to different cores, but in you threading library there should be a function binding a thread to a specific core, using this is the best for speed. Another thing to think about is NUMA, if you have a multi socket system. For maximum speed you should also think about using AVX instructions.

mmap thread safety in a multi-core and multi-cpu environment

I am a little confused as to the real issues between multi-core and multi-cpu environments when it comes to shared memory, with particular reference to mmap in C.
I have an application that utilizes mmap to share multiple segments of memory between 2 processes. Each process has access to:
A Status and Control memory segment
Raw data (up to 8 separate raw data buffers)
The Status and Control segment is used essentially as an IPC. IE, it may convey that buffer 1 is ready to receive data, or buffer 3 is ready for processing or that the Status and Control memory segment is locked whilst being updated by either parent or child etc etc.
My understanding is, and PLEASE correct me if I am wrong, is that in a multi-core CPU environment on a single boarded PC type infrastructure, mmap is safe. That is, regardless of the number of cores in the CPU, RAM is only ever accessed by a single core (or process) at any one time.
Does this assumption of single-process RAM access also apply to multi-cpu systems? That is, a single PC style board with multiple CPU's (and I guess, multiple cores within each CPU).
If not, I will need to seriously rethink my logic to allow for multi-cpu'd single-boarded machines!
Any thoughts would be greatly appreciated!
PS - by single boarded I mean a single, standalone PC style system. This excludes mainframes and the like ... just to clarify :)
RAM is only ever accessed by a single core (or process) at any one time.
Take a step back and think about your assumption means. Theoretically, yes, this statement is true, but I don't think it means what you think it means. There are no practical conclusions you can draw from this other than maybe "the memory will not catch fire if two CPUs write to the same address at the same time". Let me explain.
If one CPU/process writes to a memory location, then a different CPU/process writes to the same location, the memory writes will not happen at the same time, they will happen one at a time. You can't generally reason about which write will happen before the other, you can't reason about if a read from one CPU will happen before the write from the other CPU, one some older CPUs you can't even reason if multi-byte (multi-word, actually) values will be stored/accessed one byte at a time or multiple bytes at a time (which means that reads and writes to multibyte values can get interleaved between CPUs or processes).
The only thing multiple CPUs change here is the order of memory reads and writes. On a single CPU reading memory you can be pretty sure that your reads from memory will see earlier writes to the same memory (iff no other hardware is reading/writing the memory, then all bets are off). On multiple CPUs the order of reads and writes to different memory locations will surprise you (cpu 1 writes to address 1 and then 2, but cpu 2 might just see the new value at address 2 and the old value at address 1).
So unless you have specific documentation from your operating system and/or CPU manufacturer you can't make any assumptions (except that when two writes to the same memory location happen one will happen before the other). This is why you should use libraries like pthreads or stdatomic.h from C11 for proper locking and synchronization or really dig deep down into the most complex parts of the CPU documentation to actually understand what will happen. The locking primitives in pthreads not only provide locking, they are also guarantee that memory is properly synchronized. stdatomic.h is another way to guarantee memory synchronization, but you should carefully read the C11 standard to see what it promises and what it doesn't promise.
One potential issue is that each core has it's own cache (usually just level1, as level2 and level3 caches are usually shared). Each cpu would also have it's own cache. However most systems ensure cache coherency, so this isn't the issue (except for performance impact of constantly invalidating caches due to writes to the same memory shared in a cache line by each core or processor).
The real issue is that there is no guarantee against reordering of reads and writes due to optimizations by the compiler and/or the hardware. You need to use a Memory Barrier to flush out any pending memory operations to synchronize the state of the threads or shared memory of processes. The memory barrier will occur if you use one of the synchronization types such as an event, mutex, semaphore, ... . Not all of the shared memory reads and writes need to be atomic, but you need to use synchronization between threads and/or processes before accessing any shared memory possibly updated by another thread and/or process.
This does not sound right to me. Two processes on two different cores can both load and store data to RAM at the same time. In addition to this caching strategies can result in all kinds of strangeness-es. So please make sure all access to shared memory is properly synchronized using (interprocess) synchronization objects.
My understanding is, and PLEASE correct me if I am wrong, is that in a multi-core CPU environment on a single boarded PC type infrastructure, mmap is safe. That is, regardless of the number of cores in the CPU, RAM is only ever accessed by a single core (or process) at any one time.
Even if this holds true for some particular architecture, such an assumption is entirely wrong in general. You should have proper synchronisation between the processes that modify the shared memory segment, unless atomic intrinsics are used and the algorithm itself is lock-free.
I would advise you to put a pthread_mutex_t in the shared memory segment (shared across all processes). You will have to initialise it with the PTHREAD_PROCESS_SHARED attribute:
pthread_mutexattr_t mutex_attr;
pthread_mutexattr_init(&mutex_attr);
pthread_mutexattr_setpshared(&mutex_attr, PTHREAD_PROCESS_SHARED);
pthread_mutex_init(mutex, &mutex_attr);

Openmp not speeding up parallel loop

I have the following embarassingly parallel loop
//#pragma omp parallel for
for(i=0; i<tot; i++)
pointer[i] = val;
Why does uncommenting the #pragma line cause performance to drop? I'm getting a slight increase in program run time when I use openmp to parallelize this for loop. Since each access is independent, shouldn't it greatly increase the speed of the program?
Is it possible that if this for loop isn't run for large values of tot, the overhead is slowing things down?
Achieving performance with multiple threads in a Shared Memory environment usually depends on:
The task granularity;
Load balance between parallel tasks;
The number of parallel task/number of cores used;
The amount of synchronization among parallel tasks;
The type of bound of the algorithm;
The machine architecture.
I will give a brief overview of each of the aforementioned points.
You need to check if the granularity of the parallel tasks is enough to overcome the overhead of the parallelization (e.g., thread creation and synchronization). Maybe the number of iterations of your loop, and the computation pointer[i] = val; is not enough to justify the overhead of thread creation; Worth-noting, however, that too large of a task granularity can also lead to problems, for instance, load unbalancing.
You have to test the load balance (the amount of work per thread). Ideally, each thread should compute the same amount of work. In your code example this is not problematic;
Are you using hyper-threading?! Are you utilizing more threads than cores?! Because, if you are, threads will start "competing" for resources, and this can lead to a drop in performance;
Usually, one wants to reduce the amount of synchronization among threads. Consequently, sometimes one uses finer-grain synchronization mechanisms and even data redundancy (among other approaches) to achieve that. Your code does not have this issue.
Before attempting to parallelize your code you should analyze if it is memory-bound, CPU-bound, and so on. If it is memory-bound you may start by improving the cache usage, before you tackling the parallelization. For this task, it is highly recommended the use of a profiler.
To extract the most out of the underlining architecture, the multi-threaded approach needs to tackle the constraints of that architecture. For example, implementing an efficient multi-threaded approach to execute in a SMP architecture is different than implementing it to execute in a NUMA architecture. Since in the latter, one has to take into account the memory affinity.
EDIT: Suggestion from #Hristo lliev
Thread affinity: "Binding threads to cores improves performance in general and even more on NUMA systems since it improves data locality."
Btw, I recommend you to read this Intel Guide for Developing Multithreaded Applications.

if using shared memory, are there still advantages for processes over threading?

I have written a Linux application in which the main 'consumer' process forks off a bunch of 'reader' processes (~16) which read data from the disk and pass it to the 'consumer' for display. The data is passed over a socket which was created before the fork using socketpair.
I originally wrote it with this process boundary for 3 reasons:
The consumer process has real-time constraints, so I wanted to avoid any memory allocations in the consumer. The readers are free to allocate memory as they wish, or even be written in another language (e.g. with garbage collection), and this doesn't interrupt the consumer, which has FIFO priority. Also, disk access or other IO in the reader process won't interrupt the consumer. I figured that with threads I couldn't get such guarantees.
Using processes will stop me, the programmer, from doing stupid things like using global variables and clobbering other processes' memory.
I figured forking off a bunch of workers would be the best way to utilize multiple CPU architectures, and I figured using processes instead of threads would generally be safer.
Not all readers are always active, however, those that are active are constantly sending large amounts of data. Lately I was thinking that to optimize this by avoiding memory copies associated with writing and reading the socket, it would be nice to just read the data directly into a shared memory buffer (shm_open/mmap). Then only an index into this shared memory would be passed over the socket, and the consumer would read directly from it before marking it as available again.
Anyways, one of the biggest benefits of processes over threads is to avoid clobbering another thread's memory space. Do you think that switching to shared memory would destroy any advantages I have in this architecture? Is there still any advantage to using processes in this context, or should I just switch my application to using threads?
Your assumption that you cannot meet your realtime constraints with threads is mistaken. IO or memory allocation in the reader threads cannot stall the consumer thread as long as the consumer thread is not using malloc itself (which could of course lead to lock contention). I would recommend reading what POSIX has to say on the matter if you're unsure.
As for the other reasons to use processes instead of threads (safety, possibility of writing the readers in a different language, etc.), these are perfectly legitimate. As long as your consumer process treats the shared memory buffer as potentially-unsafe external data, I don't think you lose any significant amount of safety by switching from pipes to shared memory.
Yes, exactly for the reason you told. It's better to have each processes memory protected and only share what is really necessary to share. So each consumer can allocate and use its resources without bothering with the locking.
As for your index communication between your task, it should be noted that you could then use an area in your shared memory for that and using mutex for the accesses, as it is likely less heavy than the socket communication. File descriptor communication (sockets, pipes, files etc) always involves the kernel, shared memory with mutex locks or semaphores only when there is contention.
One point to be aware of when programming with shared memory in a multiprocessor environment, is to avoid false dependencies on variables. This happens when two unrelated objects share the same cache line. When one is modified it "dirties" also the other, which means that if other processor access the other object it will trigger a cache synchronisation between the CPUs. This can lead to bad scaling. By aligning the objects to the cache line size (64 byte usually but can differ from architecture to architecture) one can easily avoid that.
The main reason I met in my experience to replace processes by threads was efficiency.
If your processes are using a lot of code or unshared memory that could be shared in multithreading, then you could win a lot of performance on highly threaded CPUs like SUN Sparc CPUs having 64 or more threads per CPU. In this case, the CPU cache, especially for the code, will be much more efficient with multithreaded process (cache is small on Sparc).
If you see that your software is not running faster when running on new hardware with more CPU threads, then you should consider multi-threading. Otherwise, your arguments to avoid it seem good to me.
I did not meet this issue on Intel processors yet, but it could happen in the future when they add more cores per CPU.

Thread-safety of read-only memory access

I've implemented the Barnes-Hut gravity algorithm in C as follows:
Build a tree of clustered stars.
For each star, traverse the tree and apply the gravitational forces from each applicable node.
Update the star velocities and positions.
Stage 2 is the most expensive stage, and so is implemented in parallel by dividing the set of stars. E.g. with 1000 stars and 2 threads, I have one thread processing the first 500 stars and the second thread processing the second 500.
In practice this works: it speeds the computation by about 30% with two threads on a two-core machine, compared to the non-threaded version. Additionally, it yields the same numerical results as the original non-threaded version.
My concern is that the two threads are accessing the same resource (namely, the tree) simultaneously. I have not added any synchronisation to the thread workers, so it's likely they will attempt to read from the same location at some point. Although access to the tree is strictly read-only I am not 100% sure it's safe. It has worked when I've tested it but I know this is no guarantee of correctness!
Questions
Do I need to make a private copy of the tree for each thread?
Even if it is safe, are there performance problems of accessing the same memory from multiple threads?
Update Benchmark results for the curious:
Machine: Intel Atom CPU N270 # 1.60GHz, cpu MHz 800, cache size 512 KB
Threads real user sys
0 69.056 67.324 1.720
1 76.821 66.268 5.296
2 50.272 63.608 10.585
3 55.510 55.907 13.169
4 49.789 43.291 29.838
5 54.245 41.423 31.094
0 means no threading at all; 1 and above means spawn that many worker threads and for the main thread to wait for them. I would not expect much of an improvement for anything beyond 2 threads, since it's entirely CPU bound and that's how many cores there are. It's interesting that an odd number of threads is slightly worse than an even number.
Looking at sys it's apparent that there's a cost with making threads. Currently it's making the threads for each frame (so N*1000 thread creations). This was easy to program (during my 15 minutes on the train this morning). I'll need to think a bit about how to reuse threads...
Update #2 I've made it use a pool of threads, synchronised with two barriers. This has no noticeable performance advantage over recreating the threads each frame.
You don't specify how your data is structured, but in general reading memory from multiple threads simultaneously is safe and does not introduce any performance issues. You only get problems if someone is writing.
It is interesting that you say you're only getting 30% speedup out of two threads. If you have an otherwise idle machine, two or more CPUs and only readonly shared data (i.e. no synchronization) I would expect to see much closer to 50% speed improvement. This suggests that your operation is actually completing so quickly that the overhead of creating the thread is becoming significant in your numbers. Are you running on a hyperthreaded CPU?
If your data is read-only, then no, you do not need to make a private copy of the tree for each thread. This is the biggest advantage that a shared memory threading model offers!
I'm not aware of any performance problems with such a model. If anything, it should be faster depending on if your CPUs can share some of their cache.

Resources