When does using more than one stream gain benefit in CUDA? - c

I have written a CUDA program which already gets a speedup compared to a serial version of 40 (2600k vs GTX 780). Now I am thinking about using several streams for running several kernels parallel. Now my questions are: How can I measure the free resources on my GPU (because if I have no free resources on my GPU the use of streams would make no sense, am I right?), and in which case does the use of streams make sense?
If asked I can provide my code of course, but at the moment I think that it is not needed for the question.

Running kernels concurrently will only happen if the resources are available for it. A single kernel call that "uses up" the GPU will prevent other kernels from executing in a meaningful way, as you've already indicated, until that kernel has finished executing.
The key resources to think about initially are SMs, registers, shared memory, and threads. Most of these are also related to occupancy, so studying occupancy (both theoretical, i.e. occupancy calculator, as well as measured) of your existing kernels will give you a good overall view of opportunities for additional benefit through concurrent kernels.
In my opinion, concurrent kernels is only likely to show much overall benefit in your application if you are launching a large number of very small kernels, i.e. kernels that encompass only one or a small number of threadblocks, and which make very limited use of shared memory, registers, and other resources.
The best optimization approach (in my opinion) is analysis-driven optimization. This tends to avoid premature or possibly misguided optimization strategies, such as "I heard about concurrent kernels, I wonder if I can make my code run faster with it?" Analysis driven optimization starts out by asking basic utilization questions, using the profiler to answer those questions, and then focusing your optimization effort at improving metrics, such as memory utilization or compute utilization. Concurrent kernels, or various other techniques are some of the strategies you might use to address the findings from profiling your code.
You can get started with analysis-driven optimization with presentations such as this one.

If you specified no stream, the stream 0 is used. According to wikipedia (you may also find it in the cudaDeviceProp structure), your GTX 780 GPU has 12 streaming multiprocessors which means there could be an improvement if you use multiple streams. The asyncEngineCount property will tell you how many concurrent asynchronous memory copies can run.
The idea of using streams is to use an asyncmemcopy engine (aka DMA engine) to overlap kernel executions and device2host transfers. The number of streams you should use for best performance is hard to guess because it depends on the number of DMA engines you have, the number of SMs and the balance between synchronizations/amount of concurrency. To get an idea you can read this presentation (for instance slides 5,6 explain the idea very well).
Edit: I agree that using a profiler is needed as a first step.

Related

Why do we even need cache coherence?

In languages like C, unsynchronized reads and writes to the same memory location from different threads is undefined behavior. But in the CPU, cache coherence says that if one core writes to a memory location and later another core reads it, the other core has to read the written value.
Why does the processor need to bother exposing a coherent abstraction of the memory hierarchy if the next layer up is just going to throw it away? Why not just let the caches get incoherent, and require the software to issue a special instruction when it wants to share something?
The acquire and release semantics required for C++11 std::mutex (and equivalents in other languages, and earlier stuff like pthread_mutex) would be very expensive to implement if you didn't have coherent cache. You'd have to write-back every dirty line every time you released a lock, and evict every clean line every time you acquired a lock, if couldn't count on the hardware to make your stores visible, and to make your loads not take stale data from a private cache.
But with cache coherency, acquire and release are just a matter of ordering this core's accesses to its own private cache which is part of the same coherency domain as the L1d caches of other cores. So they're local operations and pretty cheap, not even needing to drain the store buffer. The cost of a mutex is just in the atomic RMW operation it needs to do, and of course in cache misses if the last core to own the mutex wasn't this one.
C11 and C++11 added stdatomic and std::atomic respectively, which make it well-defined to access shared _Atomic int variables, so it's not true that higher level languages don't expose this. It would hypothetically be possible to implement on a machine that required explicit flushes/invalidates to make stores visible to other cores, but that would be very slow. The language model assumes coherent caches, not providing explicit flushes of ranges but instead having release operations that make every older store visible to other threads that do an acquire load that syncs-with the release store in this thread. (See When to use volatile with multi threading? for some discussion, although that answer is mainly debunking the misconception that caches could have stale data, from people mixed up by the fact that the compiler can "cache" non-atomic non-volatile values in registers.)
In fact, some of the guarantees on C++ atomic are actually described by the standard as exposing HW coherence guarantees to software, like "write-read coherence" and so on, ending with the note:
http://eel.is/c++draft/intro.races#19
[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note
(Long before C11 and C++11, SMP kernels and some user-space multithreaded programs were hand-rolling atomic operations, using the same hardware support that C11 and C++11 finally exposed in a portable way.)
Also, as pointed out in comments, coherent cache is essential for writes to different parts of the same line by other cores to not step on each other.
ISO C11 guarantees that a char arr[16] can have arr[0] written by one thread while another writes arr[1]. If those are both in the same cache line, and two conflicting dirty copies of the line exist, only one can "win" and be written back. C++ memory model and race conditions on char arrays
ISO C effectively requires char to be as large as smallest unit you can write without disturbing surrounding bytes. On almost all machines (not early Alpha and not some DSPs), that's a single byte, even if a byte store might take an extra cycle to commit to L1d cache vs. an aligned word on some non-x86 ISAs.
The language didn't officially require this until C11, but that just standardized what "everyone knew" the only sane choice had to be, i.e. how compilers and hardware already worked.
Ah, a very deep topic indeed!
Cache coherency between cores is used to synthesise (as closely as possible) and Symetric Multi Processing (SMP) environment. This harks back to the days when multiple single core CPUs were simply tagged on to the same single memory bus, circa mid 1990s, caches weren't really a thing, etc. With multiple CPUs with multiple cores each with multiple caches and multiple memory interfaces per CPU, the synthesis of an SMP-like environment is a lot more complicated, and cache-coherency is a big part of that.
So, when one asks, "Why does the processor need to bother exposing a coherent abstraction of the memory hierarchy if the next layer up is just going to throw it away?", one is really asking "Do we still need an SMP environment?".
The answer is software. An awful lot of software, including all major OSes, has been written around the assumption that they're running on an SMP environment. Take away the SMP, and we'd have to re-write literally everything.
There are now various sage commentators beginning to wonder in articles whether SMP is in fact a dead end, and that we should start worrying about how to get out of that dead end. I think that it won't happen for a good long while yet; the CPU manufacturers have likely got a few more tricks to play to get ever increasing performance, and whilst that keeps being delivered no one will want to suffer the pain of software incompatibility. Security is another reason to avoid SMP - Meltdown and Spectre exploit weaknesses in the way SMP has been synthesised - but I'd guess that whilst other mitigations (however distasteful) are available security alone will not be sufficient reason to ditch SMP.
"Why not just let the caches get incoherent, and require the software to issue a special instruction when it wants to share something?" Why not, indeed? We have been there before. Transputers (1980s, early 1990s) implemented Communicating Sequential Processes (CSP), where if the application needed a different CPU to process some data, the application would have to purposefully transfer data to that CPU. The transfers are (in CSP speak) through "Channels", which are more like network sockets or IPC pipes and not at all like shared memory spaces.
CSP is having something of a resurgence - as a multiprocessing paradigm it has some very beneficial features - and languages such as Go, Rust, Erlang implement it. The thing about those languages' implementations of CSP is that they're having to synthesise CSP on top of an SMP environment, which in turn is synthesised on top of an electronic architecture much more reminiscent of Transputers!
Having had a lot of experience with CSP, my view is that every multi-process piece of software should use CSP; it's a lot more reliable. The "performance hit" of "copying" data (which is what you have to do to do CSP properly on top of SMP) isn't so bad; it's about the same amount of traffic over the cache-coherency connections to copy data from one CPU to another as it is to access the data in an SMP-like way.
Rust is very interesting, because with it's syntax strongly expressing data ownership I suspect that it doesn't have to copy data to implement CSP, it can transfer ownership between threads (processes). Thus it may be getting the benefits of CSP, but without having to copy the data. Therefore it could be very efficient CSP, even if every thread is running on a CPU single core. I've not yet explored Rust deeply enough to know that that is what it's doing, but I have hopes.
On of the nice things about CSP is that with Channels being like network sockets or IPC pipes, one can readily implement CSP across actual network sockets. Raw sockets are not in themselves ideal - they're asynchronous and so more akin to Actor Model (as is ZeroMQ). Actor Model is fairly OK - and I've used it - but it's not as guarateed devoid of runtime problems as CSP is. So one has to implement the CSP bit oneself or find a library. However, with that in place CSP becomes a software architecture that can more easily span arbitrary networks of computers without having to change the software architecture; a local channel and a network channel are "the same", except the network one is a bit slower.
It's a lot harder to take a multithreaded piece of software that assumes SMP, uses semaphores, etc to scale up across multiple machines on a network. In fact, it can't, and has to be re-written.
More recently than Transputers, the Cell processor (Playstation 3 fame) was a multi-core device that did exactly as you suggest. It had a single CPU core, and 8 SPE maths cores each with 255k on-chip core-speed static RAM. To use the SPEs you had to write software to ships code and data in and out of that 256k (there was a monster-fast internal ring bus for doing this, and a very fast external memory interface). The result was that, with the right developer, very good results could be attained.
It took Intel about a further 10 years to usefully get x64 up to about the same performance; adding in a Fused Multply-Add instruction into SSE was what finally got them there, an instruction they'd been keeping in Itanium's repetoire in the vain hope of boosting its appeal. Cell (the SPEs were based in the PowerPC equivalent of SSE - Altivec) had had an FMA instruction from the get-go.
Cache coherency is not needed if a developer takes care of
issuing lock(+ memory barriers) / (mem. barrier)unlock irrespective of it.
Cache coherency is of little value, or even has a negative value
in terms of cost, power, performance, validation etc.
Today, software is more and more distributed. Any way, coherency
can't help two processes running on two different machines.
Even for multi-threaded SW, we end up using IPC.
only small part of data is shared in multi threaded sw.
A large part of data is not shared, if shared, memory barriers should
solve cache syncing.
practically, SW developers depend on explicit locks to access shared data. These locks can efficiently flush the caches with h/w assistance (efficient means, only the caches lines that are modified AND also cached else where). And already, this is exactly done when we lock/unlock. Since every does above said lock/unlock, then Cache coherency is redundant and wastage of silicon space/power, hw engineers sleep.
all compilers(at least C/C++, python VM ) generate code for single threaded, non shared data. If I need to share the data, I just tell it is shared, but not how and why (volatile?). Developers need to take care of managing (again lock/unlock across hw cores/sw threads/hw threads). Most of the time, we write in HLL with non-atomic data. Cache-coherency does not add any value to developers, so, he/she fall back to managing it through locks, which instruct the cache system to efficiently flush. All caches systems have logs of cache lines to flush efficiently w/ or w/o coherency support. (think of cached but non coherent memory. this memory still has logs which can be used for efficient flushing)
Cache coherency very complex on silicon, consuming space and power.
In any case, SW developers takes care of issuing memory barriers (via locks).
So, I think, it is good to get rid of coherency and tell developers to own it.
But I see, trend is opposite.
Look at CXL memory etc... It is coherent.
I am looking for a system call where I can just turn off the cache coherency
for my threads and see experiment

How much less kernel overhead Tesla is, comparing to Geforce?

Tesla(Fermi or Kepler) with TCC mode comparing to Geforce (same generations) with WDDM?
The program I wrote have some very serious problems with kernel overheads due to it have to repeatively lanuch kernels, the overhead is so huge I have to merge many kernels togegther and trade memory space for less kernel launches, however it can only work so far thanks to the grand size of GPU memory storage.
I heard TCC mode can have less overheads, but can it bring the overhead performane to CPU'level?
Since I read some benchmarks, at least for Geforce 280 GTX the kernel-call overheads is thousands of times longer than function-call overheads of CPU, and for methods require a large amount of repeatively iterations it make a huge performance difference here.
The WDDM driver will batch kernel launches together to reduce overhead. So if you are able to merge kernels together to reduce launch overhead, so will be the WDDM driver as well (unless you use CUDA calls in between that prevent batching). Thus switching to TCC mode will not gain you much in this specific use case.
Are you sure the problem is launch overhead and not something else? How many separate kernels are you launching and how long does this take?
It could well be (particularly in the case of very small kernels where launch overhead would be noticable) that merging the kernels together allows the compiler to better optimize the kernels, e.g. to eliminate the writing out and reading back of intermediate results to global memory.
I was launching 16 kernels and the speed was X when I merged all kernels to be launched at once the speed was 10X also merging kernels has put overhead but the results were great.
This is many-core architecture if you cannot make use of that (launch the largest job size) then you are wasting the overhead you took to launch the kernel.
I hope this helps you.

Is there a difference between a real time system and one that is just deterministic?

At work we're discussing the design of a new platform and one of the upper management types said it needed to run our current code base (C on Linux) but be real time because it needed to respond in less than a second to various inputs. I pointed out that:
That point doesn't mean it needs to be "real time" just that it needs a faster clock and more streamlining in its interrupt handling
One of the key points to consider is the OS that's being used. They wanted to stick with embedded Linux, I pointed out we need an RTOS. Using Linux will prevent "real time" because of the kernel/user space memory split thus I/O is done via files and sockets which introduce a delay
What we really need to determine is if it needs to be deterministic (needs to respond to input in <200ms 90% of the time for example).
Really in my mind if point 3 is true, then it needs to be a real time system, and then point 2 is the biggest consideration.
I felt confident answering, but then I was thinking about it later... What do others think? Am I on the right track here or am I missing something?
Is there any difference that I'm missing between a "real time" system and one that is just "deterministic"? And besides a RTC and a RTOS, am I missing anything major that is required to execute a true real time system?
Look forward to some great responses!
EDIT:
Got some good responses so far, looks like there's a little curiosity about my system and requirements so I'll add a few notes for those who are interested:
My company sells units in the 10s of thousands, so I don't want to go over kill on the price
Typically we sell a main processor board and an independent display. There's also an attached network of other CAN devices.
The board (currently) runs the devices and also acts as a webserver sending basic XML docs to the display for end users
The requirements come in here where management wants the display to be updated "quickly" (<1s), however the true constraints IMO come from the devices that can be attached over CAN. These devices are frequently motor controlled devices with requirements including "must respond in less than 200ms".
You need to distinguish between:
Hard realtime: there is an absolute limit on response time that must not be breached (counts as a failure) - e.g. this is appropriate for example when you are controlling robotic motors or medical devices where failure to meet a deadline could be catastrophic
Soft realtime: there is a requirement to respond quickly most of the time (perhaps 99.99%+), but it is acceptable for the time limit to be occasionally breached providing the response on average is very fast. e.g. this is appropriate when performing realtime animation in a computer game - missing a deadline might cause a skipped frame but won't fundamentally ruin the gaming experience
Soft realtime is readily achievable in most systems as long as you have adequate hardware and pay sufficient attention to identifying and optimising the bottlenecks. With some tuning, it's even possible to achieve in systems that have non-deterministic pauses (e.g. the garbage collection in Java).
Hard realtime requires dedicated OS support (to guarantee scheduling) and deterministic algorithms (so that once scheduled, a task is guaranteed to complete within the deadline). Getting this right is hard and requires careful design over the entire hardware/software stack.
It is important to note that most business apps don't require either: in particular I think that targeting a <1sec response time is far away from what most people would consider a "realtime" requirement. Having said that, if a response time is explicitly specified in the requirements then you can regard it as soft realtime with a fairly loose deadline.
From the definition of the real-time tag:
A task is real-time when the timeliness of the activities' completion is a functional requirement and correctness condition, rather than merely a performance metric. A real-time system is one where some (though perhaps not all) of the tasks are real-time tasks.
In other words, if something bad will happen if your system responds too slowly to meet a deadline, the system needs to be real-time and you will need a RTOS.
A real-time system does not need to be deterministic: if the response time randomly varies between 50ms and 150ms but the response time never exceeds 150ms then the system is non-deterministic but it is still real-time.
Maybe you could try to use RTLinux or RTAI if you have sufficient time to experiment with. With this, you can keep the non realtime applications on the linux, but the realtime applications will be moved to the RTOS part. In that case, you will(might) achieve <1second response time.
The advantages are -
Large amount of code can be re-used
You can manually partition realtime and non-realtime tasks and try to achieve the response <1s as you desire.
I think migration time will not be very high, since most of the code will be in linux
Just on a sidenote be careful about the hardware drivers that you might need to run on the realtime part.
The following architecture of RTLinux might help you to understand how this can be possible.
It sounds like you're on the right track with the RTOS. Different RTOSs prioritize different things either robustness or speed or something. You will need to figure out if you need a hard or soft RTOS and based on what you need, how your scheduler is going to be driven. One thing is for sure, there is a serious difference betweeen using a regular OS and a RTOS.
Note: perhaps for the truest real time system you will need hard event based resolution so that you can guarantee that your processes will execute when you expect them too.
RTOS or real-time operating system is designed for embedded applications. In a multitasking system, which handles critical applications operating systems must be
1.deterministic in memory allocation,
2.should allow CPU time to different threads, task, process,
3.kernel must be non-preemptive which means context switch must happen only after the end of task execution. etc
SO normal windows or Linux cannot be used.
example of RTOS in an embedded system: satellites, formula 1 cars, CAR navigation system.
Embedded System: System which is designed to perform a single or few dedicated functions.
The system with RTOS: also can be an embedded system but naturally RTOS will be used in the real-time system which will need to perform many functions.
Real-time System: System which can provide the output in a definite/predicted amount of time. this does not mean the real-time systems are faster.
Difference between both :
1.normal Embedded systems are not Real-Time System
2. Systems with RTOS are real-time systems.

Multiprocessors vs Multithreading in the context of PThreads

I have an application level (PThreads) question regarding choice of hardware and its impact on software development.
I have working multi-threaded code tested well on a multi-core single CPU box.
I am trying to decide what to purchase for my next machine:
A 6-core single CPU box
A 4-core dual CPU box
My question is, if I go for the dual CPU box, will that impact the porting of my code in a serious way? Or can I just allocate more threads and let the OS handle the rest?
In other words, is multiprocessor programming any different from (single CPU) multithreading in the context of a PThreads application?
I thought it would make no difference at this level, but when configuring a new box, I noticed that one has to buy separate memory for each CPU. That's when I hit some cognitive dissonance.
More Detail Regarding the Code (for those who are interested): I read a ton of data from disk into a huge chunk of memory (~24GB soon to be more), then I spawn my threads. That initial chunk of memory is "read-only" (enforced by my own code policies) so I don't do any locking for that chunk. I got confused as I was looking at 4-core dual CPU boxes - they seem to require separate memory. In the context of my code, I have no idea what will happen "under the hood" if I allocate a bunch of extra threads. Will the OS copy my chunk of memory from one CPU's memory bank to another? This would impact how much memory I would have to buy (raising the cost for this configuration). The ideal situation (cost-wise and ease-of-programming-wise) is to have the dual CPU share one large bank of memory, but if I understand correctly, this may not be possible on the new Intel dual core MOBOs (like the HP ProLiant ML350e)?
Modern CPUs1 handle RAM locally and use a separate channel2 to communicate between them. This is a consumer-level version of the NUMA architecture, created for supercomputers more than a decade ago.
The idea is to avoid a shared bus (the old FSB) that can cause heavy contention because it's used by every core to access memory. As you add more NUMA cells, you get higher bandwidth. The downside is that memory becomes non-uniform from the point of view of the CPU: some RAM is faster than others.
Of course, modern OS schedulers are NUMA-aware, so they try to reduce the migration of a task from one cell to another. Sometimes it's okay to move from one core to another in the same socket; sometimes there's a whole hierarchy specifying which resources (1-,2-,3-level cache, RAM channel, IO, etc) are shared and which aren't, and that determines if there would be a penalty or not by moving the task. Sometimes it can determine that waiting for the right core would be pointless and it's better to shovel the whole thing to another socket....
In the vast majority of cases, it's best to leave the scheduler do what it knows best. If not, you can play around with numactl.
As for the specific case of a given program; the best architecture depends heavily in the level of resource sharing between threads. If each thread has its own playground and mostly works alone within it, a smart enough allocator would prioritize local RAM, making it less important on which cell each thread happens to be.
If, on the other hand, objects are allocated by one thread, processed by another and consumed by a third; performance would suffer if they're not on the same cell. You could try to create small thread groups and limit heavy sharing within the group, then each group could go on a different cell without problem.
The worst case is when all threads participate in a great orgy of data sharing. Even if you have all your locks and processes well debugged, there won't be any way to optimize it to use more cores than what are available on a cell. It might even be best to limit the whole process to just use the cores in a single cell, effectively wasting the rest.
1 by modern, I mean any AMD-64bit chip, and Nehalem or better for Intel.
2 AMD calls this channel HyperTransport, and Intel name is QuickPath Interconnect
EDIT:
You mention that you initialize "a big chunk of read-only memory". And then spawn a lot of threads to work on it. If each thread works on its own part of that chunk, then it would be a lot better if you initialize it on the thread, after spawning it. That would allow the threads to spread to several cores, and the allocator would choose local RAM for each, a much more effective layout. Maybe there's some way to hint the scheduler to migrate away the threads as soon as they're spawned, but I don't know the details.
EDIT 2:
If your data is read verbatim from disk, without any processing, it might be advantageous to use mmap instead of allocating a big chunk and read()ing. There are some common advantages:
No need to preallocate RAM.
The mmap operation is almost instantaneous and you can start using it. The data will be read lazily as needed.
The OS can be way smarter than you when choosing between application, mmaped RAM, buffers and cache.
it's less code!
Non needed data won't be read, won't use up RAM.
You can specifically mark as read-only. Any bug that tries to write will cause a coredump.
Since the OS knows it's read-only, it can't be 'dirty', so if the RAM is needed, it will simply discard it, and reread when needed.
but in this case, you also get:
Since data is read lazily, each RAM page would be chosen after the threads have spread on all available cores; this would allow the OS to choose pages close to the process.
So, I think that if two conditions hold:
the data isn't processed in any way between disk and RAM
each part of the data is read (mostly) by one single thread, not touched by all of them.
then, just by using mmap, you should be able to take advantage of machines of any size.
If each part of the data is read by more than one single thread, maybe you could identify which threads will (mostly) share the same pages, and try to hint the scheduler to keep these in the same NUMA cell.
For the x86 boxes you're looking at, the fact that memory is physically wired to different CPU sockets is an implementation detail. Logically, the total memory of the machine appears as one large pool - your wouldn't need to change your application code for it to run correctly across both CPUs.
Performance, however, is another matter. There is a speed penalty for cross-socket memory access, so the unmodified program may not run to its full potential.
Unfortunately, it's hard to say ahead of time whether your code will run faster on the 6-core, one-node box or the 8-core, two-node box. Even if we could see your code, it would ultimately be an educated guess. A few things to consider:
The cross-socket memory access penalty only kicks in on a cache miss, so if your program has good cache behaviour then NUMA won't hurt you much;
If your threads are all writing to private memory regions and you're limited by write bandwidth to memory, then the dual-socket machine will end up helping;
If you're compute-bound rather than memory-bandwidth-bound then 8 cores is likely better than 6;
If your performance is bounded by cache read misses then the 6 core single-socket box starts to look better;
If you have a lot of lock contention or writes to shared data then again this tends to advise towards the single-socket box.
There's a lot of variables, so the best thing to do is to ask your HP reseller for loaner machines matching the configurations you're considering. You can then test your application out, see where it performs best and order your hardware accordingly.
Without more details, it's hard to give a detailed answer. However, hopefully the following will help you frame the problem.
If your thread code is proper (e.g. you properly lock shared resources), you should not experience any bugs introduced by the change of hardware architecture. Improper threading code can sometimes be masked by the specifics of how a specific platform handles things like CPU cache access/sharing.
You may experience a change in application performance per equivalent core due to differing approaches to memory and cache management in the single chip, multi core vs. multi chip alternatives.
Specifically if you are looking at hardware that has separate memory per CPU, I would assume that each thread is going to be locked to the CPU it starts on (otherwise, the system would have to incur significant overhead to move a thread's memory to memory dedicated to a different core). That may reduce overall system efficiency depending on your specific situation. However, separate memory per core also means that the different CPUs do not compete with each other for a given cache line (the 4 cores on each of the dual CPUs will still potentially compete for cache lines, but that is less contention than if 6 cores are competing for the same cache lines).
This type of cache line contention is called False Sharing. I suggest the following read to understand if that may be an issue you are facing
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206?pgno=3
Bottom line is, application behavior should be stable (other than things that naturally depend on the details of thread scheduling) if you followed proper thread development practices, but performance could go either way depending on exactly what you are doing.

OpenMP debug newbie questions

I am starting to learn OpenMP, running examples (with gcc 4.3) from https://computing.llnl.gov/tutorials/openMP/exercise.html in a cluster. All the examples work fine, but I have some questions:
How do I know in which nodes (or cores of each node) have the different threads been "run"?
Case of nodes, what is the average transfer time in microsecs or nanosecs for sending the info and getting it back?
What are the best tools for debugging OpenMP programs?
Best advices for speeding up real programs?
Typically your OpenMP program does not know, nor does it care, on which cores it is running. If you have a job management system that may provide the information you want in its log files. Failing that, you could probably insert calls to the environment inside your threads and check the value of some environment variable. What that is called and how you do this is platform dependent, I'll leave figuring it out up to you.
How the heck should I (or any other SOer) know ? For an educated guess you'd have to tell us a lot more about your hardware, o/s, run-time system, etc, etc, etc. The best answer to the question is the one you determine from your own measurements. I fear that you may also be mistaken in thinking that information is sent around the computer -- in shared-memory programming variables usually stay in one place (or at least you should think about them staying in one place the reality may be a lot messier but also impossible to discern) and is not sent or received.
Parallel debuggers such as TotalView or DDT are probably the best tools. I haven't yet used Intel's debugger's parallel capabilities but they look promising. I'll leave it to less well-funded programmers than me to recommend FOSS options, but they are out there.
i) Select the fastest parallel algorithm for your problem. This is not necessarily the fastest serial algorithm made parallel.
ii) Test and measure. You can't optimise without data so you have to profile the program and understand where the performance bottlenecks are. Don't believe any advice along the lines that 'X is faster than Y'. Such statements are usually based on very narrow, and often out-dated, cases and have become, in the minds of their promoters, 'truths'. It's almost always possible to find counter-examples. It's YOUR code YOU want to make faster, there's no substitute for YOUR investigations.
iii) Know your compiler inside out. The rate of return (measured in code speed improvements) on the time you spent adjusting compilation options is far higher than the rate of return from modifying the code 'by hand'.
iv) One of the 'truths' that I cling to is that compilers are not terrifically good at optimising for use of the memory hierarchy on current processor architectures. This is one area where code modification may well be worthwhile, but you won't know this until you've profiled your code.
You cannot know, the partition of threads on different cores is handled entirely by the OS. You speaking about nodes, but OpenMP is a multi-thread (and not multi-process) parallelization that allow parallelization for one machine containing several cores. If you need parallelization across different machines you have to use a multi-process system like OpenMPI.
The order of magnitude of communication times are :
huge in case of communications between cores inside the same CPU, it can be considered as instantaneous
~10 GB/s for communications between two CPU across a motherboard
~100-1000 MB/s for network communications between nodes, depending of the hardware
All the theoretical speeds should be specified in your hardware specifications. You should also do little benchmarks to know what you will really have.
For OpenMP, gdb do the job well, even with many threads.
I work in extreme physics simulation on supercomputer, here are our daily aims :
use as less communication as possible between the threads/processes, 99% of the time it is communications that kill performances in parallel jobs
split the tasks optimally, machine load should be as close as possible to 100% all the time
test, tune, re-test, re-tune... . Parallelization is not at all a generic "miracle solution", it generally needs some practical work to be efficient.

Resources