Multithreading slower than singlethread in function with substantial memory access

Multithreading slower than singlethread in function with substantial memory access - c

Is that a lot of memory access makes slow multithreading? Because I use pthread to multithread a great function who use a lot of memory access. And I have time CPU greater then if I call my function with 1 thread. And proportion of use CPUs is between 50% and 70%.

Don't guess; measure.
You don't say what OS you're using, but given pthreads I'm going to guess Linux. Use tools like Valgrind's callgrind and cachegrind to analyse where your program is spending its time. LTTng could also help you. Maybe perf also.
Yes, if your program is maxing out your memory bandwidth, or thrashing your cache, then multithreading could certainly slow down performance. This is especially true if the threads are trying to share any resources. BUT, you won't know if you don't look.

Aside (since you seem to be talking about memory access and not allocation), the default malloc has poor performance if you are allocating memory in parallel.
If you are looking for higher performance you may want to consider TCMalloc which scales significantly better with multithreaded allocations.
In general, keeping shared memory synchronised between threads is a nightmare that should probably be avoided if possible. See if you can avoid cache invalidations by adopting a message-passing paradigm (this may not be possible for your use-case).
Message passing with shared read-only memory is a good compromise for lowering cache traffic.

Related

When does using more than one stream gain benefit in CUDA?

I have written a CUDA program which already gets a speedup compared to a serial version of 40 (2600k vs GTX 780). Now I am thinking about using several streams for running several kernels parallel. Now my questions are: How can I measure the free resources on my GPU (because if I have no free resources on my GPU the use of streams would make no sense, am I right?), and in which case does the use of streams make sense?
If asked I can provide my code of course, but at the moment I think that it is not needed for the question.

Running kernels concurrently will only happen if the resources are available for it. A single kernel call that "uses up" the GPU will prevent other kernels from executing in a meaningful way, as you've already indicated, until that kernel has finished executing.
The key resources to think about initially are SMs, registers, shared memory, and threads. Most of these are also related to occupancy, so studying occupancy (both theoretical, i.e. occupancy calculator, as well as measured) of your existing kernels will give you a good overall view of opportunities for additional benefit through concurrent kernels.
In my opinion, concurrent kernels is only likely to show much overall benefit in your application if you are launching a large number of very small kernels, i.e. kernels that encompass only one or a small number of threadblocks, and which make very limited use of shared memory, registers, and other resources.
The best optimization approach (in my opinion) is analysis-driven optimization. This tends to avoid premature or possibly misguided optimization strategies, such as "I heard about concurrent kernels, I wonder if I can make my code run faster with it?" Analysis driven optimization starts out by asking basic utilization questions, using the profiler to answer those questions, and then focusing your optimization effort at improving metrics, such as memory utilization or compute utilization. Concurrent kernels, or various other techniques are some of the strategies you might use to address the findings from profiling your code.
You can get started with analysis-driven optimization with presentations such as this one.

If you specified no stream, the stream 0 is used. According to wikipedia (you may also find it in the cudaDeviceProp structure), your GTX 780 GPU has 12 streaming multiprocessors which means there could be an improvement if you use multiple streams. The asyncEngineCount property will tell you how many concurrent asynchronous memory copies can run.
The idea of using streams is to use an asyncmemcopy engine (aka DMA engine) to overlap kernel executions and device2host transfers. The number of streams you should use for best performance is hard to guess because it depends on the number of DMA engines you have, the number of SMs and the balance between synchronizations/amount of concurrency. To get an idea you can read this presentation (for instance slides 5,6 explain the idea very well).
Edit: I agree that using a profiler is needed as a first step.

Is it possible to bypass L1 cache in a multicore processor

In Modern multicore processors, we normally have a local L1 cache but a shared L2 cache. Is it possible to bypass the L1 cache for some portion of the memory while still using L2 cache for it? I want to do this to improve timing predictability, at the cost of performance it may be.

As far as I know, there is no way to bypass the L1 cache on mainstream CPUs.
However, to achieve your goal (i.e. avoid cache misses that may cause variation in timings mesurements), you may try to ask your compiler to prefetch the data into the cache.
If you use GCC or LLVM, see __builtin_prefetch.
However, your question is quite vague, and I am unsure that your proposal will suit your needs.

Caches
I strongly suspect that you have misunderstood what a cache does and what it is for.
Caches are transparent from the point of view of memory contents. If one core writes to a memory location then every other core whose caches (L1, L2, L3 etc), shared or not, happen to be caching that location will get updated also.
Note that that does not mean that the cores aren't able to race for the value. You can still have a race condition whereby one core reading a location fractionally before another writes it 'gets the wrong value'. Furthermore that will happen whether or not your CPU has caches of any sort. To solve that 'ordering' problem you have to use semaphores or other IPC primitives in your source code.
Some cache systems do allow you to 'drop hints' to them. Matthieu Rouget gave an example of that with __builtin_prefetch. These sorts of things allow the programmer to tell the cache system that it might well be worth getting some data in advance. Some systems (e.g. PowerPC 7450) sort of allowed the programmer to use part of the cache as memory instead of cache, kind of the ultimate in programmer cache control.
However, none of these things make any difference to the view of memory that all the caches have. If one cache's contents get updated, the rest are also updated.
Caches and Performance Programming
The very best programmers are able to extract peak performance from a CPU by coding around the behaviour of the cache. In that realm one generally finds oneself wishing that the cache wasn't there at all. The ultimate embodiment of this is the Cell processor in the PS3. The maths cores on that have no cache at all. Instead you have to in effect do all your own data fetching and write back yourself in your source code, rather than leave it up to some cache to second guess what data your program is going to ask for. Get it right and the performance is still blisteringly good.
Bus Snooping
Some CPUs don't have cache bus snooping, which can be a particular problem when writing device drivers. Bus snooping is a mechanism whereby the CPU caches spot the content of memory being updated by something other than the CPU cores (e.g. by a DMA controller reading data from a device). And the same the other way round - DMAs from memory get values currently stuck in cache. AFAIK almost all CPUs these days do bus snooping, so that is not likely to be a problem.
On systems with IO as well as memory address spaces (e.g. Intel) I don't think that I/O address space is cached anyway. For systems with memory mapped devices their memory is generally not cached either, and the OS sets up the CPU that way (see this).
Timing Predictability
To return to the reason for your question - timing predictability. You may be using the wrong technology. If your system has timing constraints whereby the problem is variations in main memory write times, then frankly using a multicore CPU sounds like the wrong thing in the first place. #Griwes is quite right on that point (and indeed the entire comment). You'll more likely need to resort to a pure hardware design, something along the lines of an FPGA (no comments about whether firmware is really software please!).
If, as I suspect, you're actually trying to avoid using semaphores and other IPC primitives to synchronise two threads in your system then you're not going to succeed, shared caches or not. You need to use semaphores and such to make your code work properly.

How much less kernel overhead Tesla is, comparing to Geforce?

Tesla(Fermi or Kepler) with TCC mode comparing to Geforce (same generations) with WDDM?
The program I wrote have some very serious problems with kernel overheads due to it have to repeatively lanuch kernels, the overhead is so huge I have to merge many kernels togegther and trade memory space for less kernel launches, however it can only work so far thanks to the grand size of GPU memory storage.
I heard TCC mode can have less overheads, but can it bring the overhead performane to CPU'level?
Since I read some benchmarks, at least for Geforce 280 GTX the kernel-call overheads is thousands of times longer than function-call overheads of CPU, and for methods require a large amount of repeatively iterations it make a huge performance difference here.

The WDDM driver will batch kernel launches together to reduce overhead. So if you are able to merge kernels together to reduce launch overhead, so will be the WDDM driver as well (unless you use CUDA calls in between that prevent batching). Thus switching to TCC mode will not gain you much in this specific use case.
Are you sure the problem is launch overhead and not something else? How many separate kernels are you launching and how long does this take?
It could well be (particularly in the case of very small kernels where launch overhead would be noticable) that merging the kernels together allows the compiler to better optimize the kernels, e.g. to eliminate the writing out and reading back of intermediate results to global memory.

I was launching 16 kernels and the speed was X when I merged all kernels to be launched at once the speed was 10X also merging kernels has put overhead but the results were great.
This is many-core architecture if you cannot make use of that (launch the largest job size) then you are wasting the overhead you took to launch the kernel.
I hope this helps you.

Multiprocessors vs Multithreading in the context of PThreads

I have an application level (PThreads) question regarding choice of hardware and its impact on software development.
I have working multi-threaded code tested well on a multi-core single CPU box.
I am trying to decide what to purchase for my next machine:
A 6-core single CPU box
A 4-core dual CPU box
My question is, if I go for the dual CPU box, will that impact the porting of my code in a serious way? Or can I just allocate more threads and let the OS handle the rest?
In other words, is multiprocessor programming any different from (single CPU) multithreading in the context of a PThreads application?
I thought it would make no difference at this level, but when configuring a new box, I noticed that one has to buy separate memory for each CPU. That's when I hit some cognitive dissonance.
More Detail Regarding the Code (for those who are interested): I read a ton of data from disk into a huge chunk of memory (~24GB soon to be more), then I spawn my threads. That initial chunk of memory is "read-only" (enforced by my own code policies) so I don't do any locking for that chunk. I got confused as I was looking at 4-core dual CPU boxes - they seem to require separate memory. In the context of my code, I have no idea what will happen "under the hood" if I allocate a bunch of extra threads. Will the OS copy my chunk of memory from one CPU's memory bank to another? This would impact how much memory I would have to buy (raising the cost for this configuration). The ideal situation (cost-wise and ease-of-programming-wise) is to have the dual CPU share one large bank of memory, but if I understand correctly, this may not be possible on the new Intel dual core MOBOs (like the HP ProLiant ML350e)?

Modern CPUs1 handle RAM locally and use a separate channel2 to communicate between them. This is a consumer-level version of the NUMA architecture, created for supercomputers more than a decade ago.
The idea is to avoid a shared bus (the old FSB) that can cause heavy contention because it's used by every core to access memory. As you add more NUMA cells, you get higher bandwidth. The downside is that memory becomes non-uniform from the point of view of the CPU: some RAM is faster than others.
Of course, modern OS schedulers are NUMA-aware, so they try to reduce the migration of a task from one cell to another. Sometimes it's okay to move from one core to another in the same socket; sometimes there's a whole hierarchy specifying which resources (1-,2-,3-level cache, RAM channel, IO, etc) are shared and which aren't, and that determines if there would be a penalty or not by moving the task. Sometimes it can determine that waiting for the right core would be pointless and it's better to shovel the whole thing to another socket....
In the vast majority of cases, it's best to leave the scheduler do what it knows best. If not, you can play around with numactl.
As for the specific case of a given program; the best architecture depends heavily in the level of resource sharing between threads. If each thread has its own playground and mostly works alone within it, a smart enough allocator would prioritize local RAM, making it less important on which cell each thread happens to be.
If, on the other hand, objects are allocated by one thread, processed by another and consumed by a third; performance would suffer if they're not on the same cell. You could try to create small thread groups and limit heavy sharing within the group, then each group could go on a different cell without problem.
The worst case is when all threads participate in a great orgy of data sharing. Even if you have all your locks and processes well debugged, there won't be any way to optimize it to use more cores than what are available on a cell. It might even be best to limit the whole process to just use the cores in a single cell, effectively wasting the rest.
1 by modern, I mean any AMD-64bit chip, and Nehalem or better for Intel.
2 AMD calls this channel HyperTransport, and Intel name is QuickPath Interconnect
EDIT:
You mention that you initialize "a big chunk of read-only memory". And then spawn a lot of threads to work on it. If each thread works on its own part of that chunk, then it would be a lot better if you initialize it on the thread, after spawning it. That would allow the threads to spread to several cores, and the allocator would choose local RAM for each, a much more effective layout. Maybe there's some way to hint the scheduler to migrate away the threads as soon as they're spawned, but I don't know the details.
EDIT 2:
If your data is read verbatim from disk, without any processing, it might be advantageous to use mmap instead of allocating a big chunk and read()ing. There are some common advantages:
No need to preallocate RAM.
The mmap operation is almost instantaneous and you can start using it. The data will be read lazily as needed.
The OS can be way smarter than you when choosing between application, mmaped RAM, buffers and cache.
it's less code!
Non needed data won't be read, won't use up RAM.
You can specifically mark as read-only. Any bug that tries to write will cause a coredump.
Since the OS knows it's read-only, it can't be 'dirty', so if the RAM is needed, it will simply discard it, and reread when needed.
but in this case, you also get:
Since data is read lazily, each RAM page would be chosen after the threads have spread on all available cores; this would allow the OS to choose pages close to the process.
So, I think that if two conditions hold:
the data isn't processed in any way between disk and RAM
each part of the data is read (mostly) by one single thread, not touched by all of them.
then, just by using mmap, you should be able to take advantage of machines of any size.
If each part of the data is read by more than one single thread, maybe you could identify which threads will (mostly) share the same pages, and try to hint the scheduler to keep these in the same NUMA cell.

For the x86 boxes you're looking at, the fact that memory is physically wired to different CPU sockets is an implementation detail. Logically, the total memory of the machine appears as one large pool - your wouldn't need to change your application code for it to run correctly across both CPUs.
Performance, however, is another matter. There is a speed penalty for cross-socket memory access, so the unmodified program may not run to its full potential.
Unfortunately, it's hard to say ahead of time whether your code will run faster on the 6-core, one-node box or the 8-core, two-node box. Even if we could see your code, it would ultimately be an educated guess. A few things to consider:
The cross-socket memory access penalty only kicks in on a cache miss, so if your program has good cache behaviour then NUMA won't hurt you much;
If your threads are all writing to private memory regions and you're limited by write bandwidth to memory, then the dual-socket machine will end up helping;
If you're compute-bound rather than memory-bandwidth-bound then 8 cores is likely better than 6;
If your performance is bounded by cache read misses then the 6 core single-socket box starts to look better;
If you have a lot of lock contention or writes to shared data then again this tends to advise towards the single-socket box.
There's a lot of variables, so the best thing to do is to ask your HP reseller for loaner machines matching the configurations you're considering. You can then test your application out, see where it performs best and order your hardware accordingly.

Without more details, it's hard to give a detailed answer. However, hopefully the following will help you frame the problem.
If your thread code is proper (e.g. you properly lock shared resources), you should not experience any bugs introduced by the change of hardware architecture. Improper threading code can sometimes be masked by the specifics of how a specific platform handles things like CPU cache access/sharing.
You may experience a change in application performance per equivalent core due to differing approaches to memory and cache management in the single chip, multi core vs. multi chip alternatives.
Specifically if you are looking at hardware that has separate memory per CPU, I would assume that each thread is going to be locked to the CPU it starts on (otherwise, the system would have to incur significant overhead to move a thread's memory to memory dedicated to a different core). That may reduce overall system efficiency depending on your specific situation. However, separate memory per core also means that the different CPUs do not compete with each other for a given cache line (the 4 cores on each of the dual CPUs will still potentially compete for cache lines, but that is less contention than if 6 cores are competing for the same cache lines).
This type of cache line contention is called False Sharing. I suggest the following read to understand if that may be an issue you are facing
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206?pgno=3
Bottom line is, application behavior should be stable (other than things that naturally depend on the details of thread scheduling) if you followed proper thread development practices, but performance could go either way depending on exactly what you are doing.

How shared memory would be accessed in manycore systems

In multicore systems, such as 2, 4, 8 cores, we typically use mutexes and semaphores to access shared memory. However, I can foresee that these methods would induce a high overhead for future systems with many cores. Are there any alternative methods that would be better for future many core systems for accessing shared memories.

Transactional memory is one such method.

I'm not sure how far in the future you want to go. But in the long-long run, shared memory as we know it right now (single address space accessible by any core) is not scalable. So the programming model will have to change at some point and make the lives of programmers harder as it did when we went to multi-core.
But for now (perhaps for another 10 years) you can get away with transactional memory and other hardware/software tricks.
The reason I say shared-memory is not scalable in the long run is simply due to physics. (similar to how single-core/high-frequency hit a barrier)
In short, transistors can't shrink to less than the size of an atom (barring new technology), and signals can't propagate faster than the speed of light. Therefore, memory will get slower and slower (with respect to the processor) and at some point, it becomes infeasible to share memory.
We can already see this effect right now with NUMA on the multi-socket systems. Large-scale supercomputers are neither shared-memory nor cache-coherent.

1) Lock only the memory part your are accessing, and not the entire table ! This is done with the help of a big hash table. The bigger the table, the finer the lock mechanism is.
2) If you can, only lock on writing, not on reading (this requires that there is no problem in reading the "previous value" while it is being updated, which is very often a valid case).

Access to shared memory at the lowest level in any multi-processor/core/threaded application synchronization depends on the bus lock. Such a lock may incur hundreds of (CPU) wait states as it also encompasses locking those I/O buses that have bus-mastering devices including DMA. Theoretically it is possible to envision a medium-level lock that can be invoked in situations when the programmer is certain that the memory area being locked won't be affected by any I/O bus. Such a lock would be much faster because it only needs to synchronize the CPU caches with main memory which is fast, at least in comparison to latency of the slowest I/O buses. Whether programmers in general would be competent to determine when to use which bus lock adds worrying implications to its mainstream feasibility. Such a lock could also require its own dedicated external pins for synchronization with other processors.
In multi-processor Opteron systems each processor has its own memory which becomes part of the entire memory that all installed processors can "see". A processor trying to access memory which turns out to be attached to another processor will transparently complete the access - albeit more slowly - through a high-speed interconnect bus (called HyperTransport) to the processor in charge of that memory (the NUMA concept). As long as a processor and its cores are working with the memory physically connected to it processing will be fast. In addition, many processors are equipped with several external memory buses to multiply their overall memory bandwidth.
A theoretical medium-level lock could, on Opteron systems, be implemented using the HyperTransport interconnections.
As for any forseeable future the classic approach of locking as seldom as possible and for as short a time as possible by implementing efficient algorithms (and associated data structures) that are used when the locks are in place still holds true.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight