Tesla(Fermi or Kepler) with TCC mode comparing to Geforce (same generations) with WDDM?
The program I wrote have some very serious problems with kernel overheads due to it have to repeatively lanuch kernels, the overhead is so huge I have to merge many kernels togegther and trade memory space for less kernel launches, however it can only work so far thanks to the grand size of GPU memory storage.
I heard TCC mode can have less overheads, but can it bring the overhead performane to CPU'level?
Since I read some benchmarks, at least for Geforce 280 GTX the kernel-call overheads is thousands of times longer than function-call overheads of CPU, and for methods require a large amount of repeatively iterations it make a huge performance difference here.
The WDDM driver will batch kernel launches together to reduce overhead. So if you are able to merge kernels together to reduce launch overhead, so will be the WDDM driver as well (unless you use CUDA calls in between that prevent batching). Thus switching to TCC mode will not gain you much in this specific use case.
Are you sure the problem is launch overhead and not something else? How many separate kernels are you launching and how long does this take?
It could well be (particularly in the case of very small kernels where launch overhead would be noticable) that merging the kernels together allows the compiler to better optimize the kernels, e.g. to eliminate the writing out and reading back of intermediate results to global memory.
I was launching 16 kernels and the speed was X when I merged all kernels to be launched at once the speed was 10X also merging kernels has put overhead but the results were great.
This is many-core architecture if you cannot make use of that (launch the largest job size) then you are wasting the overhead you took to launch the kernel.
I hope this helps you.
Related
I have written a CUDA program which already gets a speedup compared to a serial version of 40 (2600k vs GTX 780). Now I am thinking about using several streams for running several kernels parallel. Now my questions are: How can I measure the free resources on my GPU (because if I have no free resources on my GPU the use of streams would make no sense, am I right?), and in which case does the use of streams make sense?
If asked I can provide my code of course, but at the moment I think that it is not needed for the question.
Running kernels concurrently will only happen if the resources are available for it. A single kernel call that "uses up" the GPU will prevent other kernels from executing in a meaningful way, as you've already indicated, until that kernel has finished executing.
The key resources to think about initially are SMs, registers, shared memory, and threads. Most of these are also related to occupancy, so studying occupancy (both theoretical, i.e. occupancy calculator, as well as measured) of your existing kernels will give you a good overall view of opportunities for additional benefit through concurrent kernels.
In my opinion, concurrent kernels is only likely to show much overall benefit in your application if you are launching a large number of very small kernels, i.e. kernels that encompass only one or a small number of threadblocks, and which make very limited use of shared memory, registers, and other resources.
The best optimization approach (in my opinion) is analysis-driven optimization. This tends to avoid premature or possibly misguided optimization strategies, such as "I heard about concurrent kernels, I wonder if I can make my code run faster with it?" Analysis driven optimization starts out by asking basic utilization questions, using the profiler to answer those questions, and then focusing your optimization effort at improving metrics, such as memory utilization or compute utilization. Concurrent kernels, or various other techniques are some of the strategies you might use to address the findings from profiling your code.
You can get started with analysis-driven optimization with presentations such as this one.
If you specified no stream, the stream 0 is used. According to wikipedia (you may also find it in the cudaDeviceProp structure), your GTX 780 GPU has 12 streaming multiprocessors which means there could be an improvement if you use multiple streams. The asyncEngineCount property will tell you how many concurrent asynchronous memory copies can run.
The idea of using streams is to use an asyncmemcopy engine (aka DMA engine) to overlap kernel executions and device2host transfers. The number of streams you should use for best performance is hard to guess because it depends on the number of DMA engines you have, the number of SMs and the balance between synchronizations/amount of concurrency. To get an idea you can read this presentation (for instance slides 5,6 explain the idea very well).
Edit: I agree that using a profiler is needed as a first step.
Is that a lot of memory access makes slow multithreading? Because I use pthread to multithread a great function who use a lot of memory access. And I have time CPU greater then if I call my function with 1 thread. And proportion of use CPUs is between 50% and 70%.
Don't guess; measure.
You don't say what OS you're using, but given pthreads I'm going to guess Linux. Use tools like Valgrind's callgrind and cachegrind to analyse where your program is spending its time. LTTng could also help you. Maybe perf also.
Yes, if your program is maxing out your memory bandwidth, or thrashing your cache, then multithreading could certainly slow down performance. This is especially true if the threads are trying to share any resources. BUT, you won't know if you don't look.
Aside (since you seem to be talking about memory access and not allocation), the default malloc has poor performance if you are allocating memory in parallel.
If you are looking for higher performance you may want to consider TCMalloc which scales significantly better with multithreaded allocations.
In general, keeping shared memory synchronised between threads is a nightmare that should probably be avoided if possible. See if you can avoid cache invalidations by adopting a message-passing paradigm (this may not be possible for your use-case).
Message passing with shared read-only memory is a good compromise for lowering cache traffic.
For example, in a dual socket system with 2 quad core processors, does the thread scheduler tries to keep the threads from the same processes in the same processor? Because interleaving threads of different processes in different processors would slow down performance in the case where threads in a process have a lot of shared memory accesses.
It depends.
On current Intel platforms the BIOS default seems to be that memory is interleaved between the sockets in the system, page by page. Allocate 1Mbyte and half will be on one socket, half on the other. That means that wherever your threads are they have equal access to the data.
This makes it very simple for OSes - anywhere will do.
This can work against you. The SMP hardware environment presented to the OS is synthesised by the CPUs cooperating over QPI. If there's a lot of threads all accessing the same data then those links can get real busy. If they're too busy then that limits the performance, and you're I/O bound. That's where I am; Z80 cores with Intel's memory subsystem design would be just as quick as the nahelem cores I've actually got (ok I might be exagerating...).
At the end of the day the real problem is that memory just isn't quick enough. Intel and AMD have both done some impressive things with memory recently, but we're still hampered by its slowness. Ideally memory would be quick enough so that all cores had clock rate access times to it. The Cell processor sort of did this - each SPE has a bit of SRAM instead of a cache, and once you get your head round them you can make them really sing.
===EDIT===
There is more to it. As Basile Starynkevitch hints the alternate approach is to embrace NUMA.
NUMA is what modern CPUs actually embody, the memory access being non-uniform because the memory on the other CPU sockets is not accessible directly by addressing a bus. The CPUs instead make a request for data over the QPI link (or Hypertransport in AMD's case) to ask the other CPU to fetch data out of its memory and send it back. Because the CPU is doing all this for you in hardware it ends up looking like a conventional SMP environment. And QPI / Hypertransport are very fast, so most of the time it's plenty quick enough.
If you write your code so as to mirror the architecture of the hardware you can in theory make improvements. So this might involve (for example) having two copies of your data in the system, one on each socket. There's memory affinity routines in Linux to specifically allocate memory that way instead of interleaved across all sockets. There's also CPU affinity routines that allow you to control which CPU core a thread is running on, the idea being you run it on a core that is close to the data buffer it will be processing.
Ok, so that might mean a lot of investment in the source code to make that work for you (especially if that data duplication doesn't fit well with the program's flow), but if the QPI has become a problematic bottle neck it's the only thing you can do.
I've fiddled with this to some extent. In a way it's a right faff. The whole mindset of Intel and AMD (and thus the OSes and libraries too) is to give you an SMP environment which, most of the time, is pretty good. However they let you play with NUMA by having a load of library functions you have to call to get the deployment of threads and memory that you want.
However for the edge cases where you want that little bit extra speed it'd be easier if the architecture and OS was rigidly NUMA, no SMP at all. Just like the Cell processor in fact. Easier, not because it'd be simple to write (in fact it would be harder), but if you got it running at all you'd then know for sure that it was as quick as the hardware could ever possibly achieve. With the faked SMP that we have right now you experiment with NUMA but you're mostly left wondering if it's as fast as it possibly could be. It's not like the libraries tell you that you're accessing memory that actually resident on another socket, they just let you do it with no hint that there's room for improvement.
I have an application level (PThreads) question regarding choice of hardware and its impact on software development.
I have working multi-threaded code tested well on a multi-core single CPU box.
I am trying to decide what to purchase for my next machine:
A 6-core single CPU box
A 4-core dual CPU box
My question is, if I go for the dual CPU box, will that impact the porting of my code in a serious way? Or can I just allocate more threads and let the OS handle the rest?
In other words, is multiprocessor programming any different from (single CPU) multithreading in the context of a PThreads application?
I thought it would make no difference at this level, but when configuring a new box, I noticed that one has to buy separate memory for each CPU. That's when I hit some cognitive dissonance.
More Detail Regarding the Code (for those who are interested): I read a ton of data from disk into a huge chunk of memory (~24GB soon to be more), then I spawn my threads. That initial chunk of memory is "read-only" (enforced by my own code policies) so I don't do any locking for that chunk. I got confused as I was looking at 4-core dual CPU boxes - they seem to require separate memory. In the context of my code, I have no idea what will happen "under the hood" if I allocate a bunch of extra threads. Will the OS copy my chunk of memory from one CPU's memory bank to another? This would impact how much memory I would have to buy (raising the cost for this configuration). The ideal situation (cost-wise and ease-of-programming-wise) is to have the dual CPU share one large bank of memory, but if I understand correctly, this may not be possible on the new Intel dual core MOBOs (like the HP ProLiant ML350e)?
Modern CPUs1 handle RAM locally and use a separate channel2 to communicate between them. This is a consumer-level version of the NUMA architecture, created for supercomputers more than a decade ago.
The idea is to avoid a shared bus (the old FSB) that can cause heavy contention because it's used by every core to access memory. As you add more NUMA cells, you get higher bandwidth. The downside is that memory becomes non-uniform from the point of view of the CPU: some RAM is faster than others.
Of course, modern OS schedulers are NUMA-aware, so they try to reduce the migration of a task from one cell to another. Sometimes it's okay to move from one core to another in the same socket; sometimes there's a whole hierarchy specifying which resources (1-,2-,3-level cache, RAM channel, IO, etc) are shared and which aren't, and that determines if there would be a penalty or not by moving the task. Sometimes it can determine that waiting for the right core would be pointless and it's better to shovel the whole thing to another socket....
In the vast majority of cases, it's best to leave the scheduler do what it knows best. If not, you can play around with numactl.
As for the specific case of a given program; the best architecture depends heavily in the level of resource sharing between threads. If each thread has its own playground and mostly works alone within it, a smart enough allocator would prioritize local RAM, making it less important on which cell each thread happens to be.
If, on the other hand, objects are allocated by one thread, processed by another and consumed by a third; performance would suffer if they're not on the same cell. You could try to create small thread groups and limit heavy sharing within the group, then each group could go on a different cell without problem.
The worst case is when all threads participate in a great orgy of data sharing. Even if you have all your locks and processes well debugged, there won't be any way to optimize it to use more cores than what are available on a cell. It might even be best to limit the whole process to just use the cores in a single cell, effectively wasting the rest.
1 by modern, I mean any AMD-64bit chip, and Nehalem or better for Intel.
2 AMD calls this channel HyperTransport, and Intel name is QuickPath Interconnect
EDIT:
You mention that you initialize "a big chunk of read-only memory". And then spawn a lot of threads to work on it. If each thread works on its own part of that chunk, then it would be a lot better if you initialize it on the thread, after spawning it. That would allow the threads to spread to several cores, and the allocator would choose local RAM for each, a much more effective layout. Maybe there's some way to hint the scheduler to migrate away the threads as soon as they're spawned, but I don't know the details.
EDIT 2:
If your data is read verbatim from disk, without any processing, it might be advantageous to use mmap instead of allocating a big chunk and read()ing. There are some common advantages:
No need to preallocate RAM.
The mmap operation is almost instantaneous and you can start using it. The data will be read lazily as needed.
The OS can be way smarter than you when choosing between application, mmaped RAM, buffers and cache.
it's less code!
Non needed data won't be read, won't use up RAM.
You can specifically mark as read-only. Any bug that tries to write will cause a coredump.
Since the OS knows it's read-only, it can't be 'dirty', so if the RAM is needed, it will simply discard it, and reread when needed.
but in this case, you also get:
Since data is read lazily, each RAM page would be chosen after the threads have spread on all available cores; this would allow the OS to choose pages close to the process.
So, I think that if two conditions hold:
the data isn't processed in any way between disk and RAM
each part of the data is read (mostly) by one single thread, not touched by all of them.
then, just by using mmap, you should be able to take advantage of machines of any size.
If each part of the data is read by more than one single thread, maybe you could identify which threads will (mostly) share the same pages, and try to hint the scheduler to keep these in the same NUMA cell.
For the x86 boxes you're looking at, the fact that memory is physically wired to different CPU sockets is an implementation detail. Logically, the total memory of the machine appears as one large pool - your wouldn't need to change your application code for it to run correctly across both CPUs.
Performance, however, is another matter. There is a speed penalty for cross-socket memory access, so the unmodified program may not run to its full potential.
Unfortunately, it's hard to say ahead of time whether your code will run faster on the 6-core, one-node box or the 8-core, two-node box. Even if we could see your code, it would ultimately be an educated guess. A few things to consider:
The cross-socket memory access penalty only kicks in on a cache miss, so if your program has good cache behaviour then NUMA won't hurt you much;
If your threads are all writing to private memory regions and you're limited by write bandwidth to memory, then the dual-socket machine will end up helping;
If you're compute-bound rather than memory-bandwidth-bound then 8 cores is likely better than 6;
If your performance is bounded by cache read misses then the 6 core single-socket box starts to look better;
If you have a lot of lock contention or writes to shared data then again this tends to advise towards the single-socket box.
There's a lot of variables, so the best thing to do is to ask your HP reseller for loaner machines matching the configurations you're considering. You can then test your application out, see where it performs best and order your hardware accordingly.
Without more details, it's hard to give a detailed answer. However, hopefully the following will help you frame the problem.
If your thread code is proper (e.g. you properly lock shared resources), you should not experience any bugs introduced by the change of hardware architecture. Improper threading code can sometimes be masked by the specifics of how a specific platform handles things like CPU cache access/sharing.
You may experience a change in application performance per equivalent core due to differing approaches to memory and cache management in the single chip, multi core vs. multi chip alternatives.
Specifically if you are looking at hardware that has separate memory per CPU, I would assume that each thread is going to be locked to the CPU it starts on (otherwise, the system would have to incur significant overhead to move a thread's memory to memory dedicated to a different core). That may reduce overall system efficiency depending on your specific situation. However, separate memory per core also means that the different CPUs do not compete with each other for a given cache line (the 4 cores on each of the dual CPUs will still potentially compete for cache lines, but that is less contention than if 6 cores are competing for the same cache lines).
This type of cache line contention is called False Sharing. I suggest the following read to understand if that may be an issue you are facing
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206?pgno=3
Bottom line is, application behavior should be stable (other than things that naturally depend on the details of thread scheduling) if you followed proper thread development practices, but performance could go either way depending on exactly what you are doing.
I am starting to learn OpenMP, running examples (with gcc 4.3) from https://computing.llnl.gov/tutorials/openMP/exercise.html in a cluster. All the examples work fine, but I have some questions:
How do I know in which nodes (or cores of each node) have the different threads been "run"?
Case of nodes, what is the average transfer time in microsecs or nanosecs for sending the info and getting it back?
What are the best tools for debugging OpenMP programs?
Best advices for speeding up real programs?
Typically your OpenMP program does not know, nor does it care, on which cores it is running. If you have a job management system that may provide the information you want in its log files. Failing that, you could probably insert calls to the environment inside your threads and check the value of some environment variable. What that is called and how you do this is platform dependent, I'll leave figuring it out up to you.
How the heck should I (or any other SOer) know ? For an educated guess you'd have to tell us a lot more about your hardware, o/s, run-time system, etc, etc, etc. The best answer to the question is the one you determine from your own measurements. I fear that you may also be mistaken in thinking that information is sent around the computer -- in shared-memory programming variables usually stay in one place (or at least you should think about them staying in one place the reality may be a lot messier but also impossible to discern) and is not sent or received.
Parallel debuggers such as TotalView or DDT are probably the best tools. I haven't yet used Intel's debugger's parallel capabilities but they look promising. I'll leave it to less well-funded programmers than me to recommend FOSS options, but they are out there.
i) Select the fastest parallel algorithm for your problem. This is not necessarily the fastest serial algorithm made parallel.
ii) Test and measure. You can't optimise without data so you have to profile the program and understand where the performance bottlenecks are. Don't believe any advice along the lines that 'X is faster than Y'. Such statements are usually based on very narrow, and often out-dated, cases and have become, in the minds of their promoters, 'truths'. It's almost always possible to find counter-examples. It's YOUR code YOU want to make faster, there's no substitute for YOUR investigations.
iii) Know your compiler inside out. The rate of return (measured in code speed improvements) on the time you spent adjusting compilation options is far higher than the rate of return from modifying the code 'by hand'.
iv) One of the 'truths' that I cling to is that compilers are not terrifically good at optimising for use of the memory hierarchy on current processor architectures. This is one area where code modification may well be worthwhile, but you won't know this until you've profiled your code.
You cannot know, the partition of threads on different cores is handled entirely by the OS. You speaking about nodes, but OpenMP is a multi-thread (and not multi-process) parallelization that allow parallelization for one machine containing several cores. If you need parallelization across different machines you have to use a multi-process system like OpenMPI.
The order of magnitude of communication times are :
huge in case of communications between cores inside the same CPU, it can be considered as instantaneous
~10 GB/s for communications between two CPU across a motherboard
~100-1000 MB/s for network communications between nodes, depending of the hardware
All the theoretical speeds should be specified in your hardware specifications. You should also do little benchmarks to know what you will really have.
For OpenMP, gdb do the job well, even with many threads.
I work in extreme physics simulation on supercomputer, here are our daily aims :
use as less communication as possible between the threads/processes, 99% of the time it is communications that kill performances in parallel jobs
split the tasks optimally, machine load should be as close as possible to 100% all the time
test, tune, re-test, re-tune... . Parallelization is not at all a generic "miracle solution", it generally needs some practical work to be efficient.