How to get "sum" of parallel arrays in cuda?

How to get "sum" of parallel arrays in cuda? - arrays

my problem is about getting "sum" for some same length arrays. For example,I have a M*N(100 * 2000) length float array in all. I would like to get M(100) sum values of every N(2000) float numbers. I found two ways to do this job. One is with Cublas function in a for loop for M ,like cublasSasum. The other is self-written kernel function, adding numbers in loop. My problem is the speed of these two ways and how to choose between them.
For Cublas method, no matter how big is N(4000~2E6), the time consuming is depending mainly on M, the loop number.
For self-written kennel function, the speed varied much with N. In detail, if N is small, below 5000, it runs much faster than the Cublas way. Then the time consumption is increasing with N's increasing.
N = 4000 |10000 | 40000 | 80000 | 1E6 | 2E6
t = 254ms| 422ms | 1365ms| 4361ms| 5399ms | 10635ms
If N is big enough, it runs much slower than Cublas way. My problem is how could I make a predition with M or N to decide which way I should use? My code might be used on different GPU device. Must I compare the speed in a parameter swept and then "guess" to make a choice in every GPU device, or I could inference from GPU device information?
Besides, for the kernel function way，I also have problem in deciding the blockSize and gridSize. I must note here that what I concern more is speed not efficiency. Because the memory is limited. For example, if I got 8G memory. My dataformat is float in 4 bytes. N is 1E5. Then M is at most 2E4, which is smaller than the MaxGridSize. So I got two ways as below. I found have a bigger gridSize is always better, I don't know the reason. Is it about the usage of register number per thread? But I don't think it needs many registers per thread in this kernel function.
Any suggestion or information would be appreciated. Thank you.
Cublas way
for (int j = 0;j<M;j++)
cublasStatus = cublasSasum(cublasHandle,N,d_in+N*j,1,d_out+j);
self-written kernel way
__global__ void getSum(int M, int N, float* in, float * out)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
if(i<M){
float tmp = 0;
for(int ii = 0; ii<N; ii++){
tmp += *(in+N*i+ii);
}
out[i] = tmp;
}
}
Bigger gridSize is faster. I don't know the reason.
getSum<<<M,1>>>(M, N, d_in, d_out); //faster
getSum<<<1,M>>>(M, N, d_in, d_out);
This is a blockSize-time parameter swept result. M = 1E4.N = 1E5.
cudaEventRecord(start, 0);
//blockSize = 1:1024;
int gridSize = (M + blockSize - 1) / blockSize;
getSum<<<gridSize1,blockSize1>>>...
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
It seems I should choose a relative small blockSize, like 10~200. I just would like to know why the full occupancy(blockSize 1024) is slower. I just post here for some possible reasons, registers number?latency?

Using CuBLAS is generally a very good idea and should be preferred if there is dedicated function doing want you want directly, especially for large datasets. That being said, you timings are very bad for a GPU kernel working on such small dataset. Let us understand why.
Bigger gridSize is faster. I don't know the reason.
getSum<<<M,1>>>(M, N, d_in, d_out);
getSum<<<1,M>>>(M, N, d_in, d_out);
The syntax of calling a CUDA kernel is kernel<<<numBlocks, threadsPerBlock>>>. Thus the first line submit a kernel with M blocks of 1 threads. Don't do that: this is very inefficient. Indeed, The CUDA programming manual say:
The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors. [...]
The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. [...]
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.
As a result, the first call create M blocks of 1 threads wasting 31 CUDA cores of 32 available in each warp. It means that you will likely read only 3% of the peak performance of your GPU...
The second call create one block of M threads. Because M is not a multiple of 32, few CUDA core are wasted. Moreover, it uses only 1 SM over the many available on your GPU because you have only one block. Modern GPUs have dozens of SMs (my GTX-1660S has 22 SM). This means that you will use only a tiny fraction of your GPU capability (few %). Not to mention that the memory access pattern is not contiguous slowing down even more the computation...
If you want to use your GPU much more efficiently, you need to provide more parallelism and to waste less resources. You can start by writing a kernel working on a 2D grid performing a reduction using atomics. This is not perfect, but much better than your initial code. You should also take care of reading memory contiguously (threads sharing the same warp should read/write a contiguous block of memory).
Please read precisely the CUDA manual or tutorials before writing CUDA code. It describes all of this very well and accurately.
UPDATE:
Based on the new informations, the problem you are experimenting with the blockSize is likely due to the strided memory accesses in the kernel (more specifically the N*i). Strided memory access patterns are slow and are generally slower when the stride is getting bigger. In your kernel, each thread will access to a different block in memory. GPU (and actually most hardware computing units) are optimized for accessing contiguous chunks of data as previously said. If you want to solve this problem and get faster results, you need to work on the other dimension in parallel (so not M but N).
Furthermore, the BLAS calls are inefficient because each iteration of the loop on the CPU will call a kernel on the GPU. Calling a kernel introduces a quite big overhead (typically from few microseconds up to ~100 us). Thus doing this in a loop called tens of thousands of times will be very slow.

Related

what does STREAM memory bandwidth benchmark really measure?

I have a few questions on STREAM (http://www.cs.virginia.edu/stream/ref.html#runrules) benchmark.
Below is the comment from stream.c. What is the rationale about the requirement that arrays should be 4 times the size of cache?
* (a) Each array must be at least 4 times the size of the
* available cache memory. I don't worry about the difference
* between 10^6 and 2^20, so in practice the minimum array size
* is about 3.8 times the cache size.
I originally assume STREAM measures the peak memory bandwidth. But I later found that when I add extra arrays and array accesses, I can get larger bandwidth numbers. So it looks to me that STREAM doesn't guarantee to saturate memory bandwidth. Then my question is what does STREAM really measures and how do you use the numbers reported by STREAM?
For example, I added two extra arrays and make sure to access them together with the original a/b/c arrays. I modify the bytes accounting accordingly. With these two extra arrays, my bandwidth number is bumped up by ~11.5%.
> diff stream.c modified_stream.c
181c181,183
< c[STREAM_ARRAY_SIZE+OFFSET];
---
> c[STREAM_ARRAY_SIZE+OFFSET],
> e[STREAM_ARRAY_SIZE+OFFSET],
> d[STREAM_ARRAY_SIZE+OFFSET];
192,193c194,195
< 3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
< 3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
---
> 5 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
> 5 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
270a273,274
> d[j] = 3.0;
> e[j] = 3.0;
335c339
< c[j] = a[j]+b[j];
---
> c[j] = a[j]+b[j]+d[j]+e[j];
345c349
< a[j] = b[j]+scalar*c[j];
---
> a[j] = b[j]+scalar*c[j] + d[j]+e[j];
CFLAGS = -O2 -fopenmp -D_OPENMP -DSTREAM_ARRAY_SIZE=50000000
My last level cache is around 35MB.
Any commnet?
Thanks!
This is for a Skylake Linux server.

Memory accesses in modern computers are a lot more complex than one might expect, and it is very hard to tell when the "high-level" model falls apart because of some "low-level" detail that you did not know about before....
The STREAM benchmark code only measures execution time -- everything else is derived. The derived numbers are based on both decisions about what I think is "reasonable" and assumptions about how the majority of computers work. The run rules are the product of trial and error -- attempting to balance portability with generality.
The STREAM benchmark reports "bandwidth" values for each of the kernels. These are simple calculations based on the assumption that each array element on the right hand side of each loop has to be read from memory and each array element on the left hand side of each loop has to be written to memory. Then the "bandwidth" is simply the total amount of data moved divided by the execution time.
There are a surprising number of assumptions involved in this simple calculation.
The model assumes that the compiler generates code to perform all the loads, stores, and arithmetic instructions that are implied by the memory traffic counts. The approach used in STREAM to encourage this is fairly robust, but an advanced compiler might notice that all the array elements in each array contain the same value, so only one element from each array actually needs to be processed. (This is how the validation code works.)
Sometimes compilers move the timer calls out of their source-code locations. This is a (subtle) violation of the language standards, but is easy to catch because it usually produces nonsensical results.
The model assumes a negligible number of cache hits. (With cache hits, the computed value is still a "bandwidth", it is just not the "memory bandwidth".) The STREAM Copy and Scale kernels only load one array (and store one array), so if the stores bypass the cache, the total amount of traffic going through the cache in each iteration is the size of one array. Cache addressing and indexing are sometimes very complex, and cache replacement policies may be dynamic (either pseudo-random or based on run-time utilization metrics). As a compromise between size and accuracy, I picked 4x as the minimum array size relative to the cache size to ensure that most systems have a very low fraction of cache hits (i.e., low enough to have negligible influence on the reported performance).
The data traffic counts in STREAM do not "give credit" to additional transfers that the hardware does, but that were not explicitly requested. This primarily refers to "write allocate" traffic -- most systems read each store target address from memory before the store can update the corresponding cache line. Many systems have the ability to skip this "write allocate", either by allocating a line in the cache without reading it (POWER) or by executing stores that bypass the cache and go straight to memory (x86). More notes on this are at http://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/
Multicore processors with more than 2 DRAM channels are typically unable to reach asymptotic bandwidth using only a single core. The OpenMP directives that were originally provided for large shared-memory systems now must be enabled on nearly every processor with more than 2 DRAM channels if you want to reach asymptotic bandwidth levels.
Single-core bandwidth is still important, but is typically limited by the number of cache misses that a single core can generate, and not by the peak DRAM bandwidth of the system. The issues are presented in http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/
For the single-core case, the number of outstanding L1 Data Cache misses much too small to get full bandwidth -- for your Xeon Scalable processor about 140 concurrent cache misses are required for each socket, but a single core can only support 10-12 L1 Data Cache misses. The L2 hardware prefetchers can generate additional memory concurrency (up to ~24 cache misses per core, if I recall correctly), but reaching average values near the upper end of this range requires simultaneous accesses to more 4KiB pages. Your additional array reads give the L2 hardware prefetchers more opportunity to generate (close to) the maximum number of concurrent memory accesses. An increase of 11%-12% is completely reasonable.
Increasing the fraction of reads is also expected to increase the performance when using all the cores. In this case the benefit is primarily by reducing the number of "read-write turnaround stalls" on the DDR4 DRAM interface. With no stores at all, sustained bandwidth should reach 90% peak on this processor (using 16 or more cores per socket).
Additional notes on avoiding "write allocate" traffic:
In x86 architectures, cache-bypassing stores typically invalidate the corresponding address from the local caches and hold the data in a "write-combining buffer" until the processor decides to push the data to memory. Other processors are allowed to keep and use "stale" copies of the cache line during this period. When the write-combining buffer is flushed, the cache line is sent to the memory controller in a transaction that is very similar to an IO DMA write. The memory controller has the responsibility of issuing "global" invalidations on the address before updating memory. Care must be taken when these streaming stores are used to update memory that is shared across cores. The general model is to execute the streaming stores, execute a store fence, then execute an "ordinary" store to a "flag" variable. The store fence will ensure that no other processor can see the updated "flag" variable until the results of all of the streaming stores are globally visible. (With a sequence of "ordinary" stores, results always become visible in program order, so no store fence is required.)
In the PowerPC/POWER architecture, the DCBZ (or DCLZ) instruction can be used to avoid write allocate traffic. If the line is in cache, its contents are set to zero. If the line is not in cache, a line is allocated in the cache with its contents set to zero. One downside of this approach is that the cache line size is exposed here. DCBZ on a PowerPC with 32-Byte cache lines will clear 32 Bytes. The same instruction on a processor with 128-Byte cache lines will clear 128 Bytes. This was irritating to a vendor who used both. I don't remember enough of the details of the POWER memory ordering model to comment on how/when the coherence transactions become visible with this instruction.

The key point here, as pointed out by Dr. Bandwidth's answer, is that STREAMS only counts the useful bandwidth seen by the source code. (He's the author of the benchmark.)
In practice the write stream will incur read bandwidth costs as well for the RFO (Read For Ownership) requests. When a CPU want to write 16 bytes (for example) to a cache line, first it has to load the original cache line and then modify it in L1d cache.
(Unless your compiler auto-vectorized with NT stores that bypass cache and avoid that RFO. Some compilers will do that for loops they expect to write an array too larger for cache before any of it is re-read.)
See Enhanced REP MOVSB for memcpy for more about cache-bypassing stores that avoid an RFO.
So increasing the number of read streams vs. write streams will bring software-observed bandwidth closer to actual hardware bandwidth. (Also a mixed read/write workload for the memory may not be perfectly efficient.)

The purpose of the STREAM benchmark is not to measure the peak memory bandwidth (i.e., the maximum memory bandwidth that can be achieved on the system), but to measure the "memory bandwidth" of a number of kernels (COPY, SCALE, SUM, and TRIAD) that are important to the HPC community. So when the bandwidth reported by STREAM is higher, it means that HPC applications will probably run faster on the system.
It's also important to understand the meaning of the term "memory bandwidth" in context of the STREAM benchmark, which is explained in the last section of the documentation. As mentioned in that section, there are at least three ways to count the number of bytes for a benchmark. The STREAM benchmark uses the STREAM method, which count the number of bytes read and written at the source code level. For example, in the SUM kernel (a(i) = b(i) + c(i)), two elements are read and one element is written. Therefore, assuming that all accesses are to memory, the number of bytes accessed from memory per iteration is equal to the number of arrays multiplied by the size of an element (which is 8 bytes). STREAM calculates bandwidth by multiplying the total number of elements accessed (counted using the STREAM method) by the element size and dividing that by the execution time of the kernel. To take run-to-run variations into account, each kernel is run multiple times and the arithmetic average, minimum, and maximum bandwidths are reported.
As you can see, the bandwidth reported by STREAM is not the real memory bandwidth (at the hardware level), so it doesn't even make sense to say that it is the peak bandwidth. In addition, it's almost always much lower than the peak bandwidth. For example, this article shows how ECC and 2MB pages impact the bandwidth reported by STREAM. Writing a benchmark that actually achieves the maximum possible memory bandwidth (at the hardware level) on modern Intel processors is a major challenge and may be a good problem for a whole Ph.D. thesis. In practice, though, the peak bandwidth is less important than the STREAM bandwidth in the HPC domain. (Related: See my answer for information on the issues involved in measuring the memory bandwidth at the hardware level.)
Regarding your first question, notice that STREAM just assumes that all reads and writes are satisfied by the main memory and not by any cache. Allocating an array that is much larger than the size of the LLC helps in making it more likely that this is the case. Essentially, complex and undocumented aspects of the LLC including the replacement policy and the placement policy need to be defeated. It doesn't have to be exactly 4x larger than the LLC. My understanding is that this is what Dr. Bandwidth found to work in practice.

Best practices to ensure low power consumption [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Assume I have two programs P1 and P2 which perform the same functionality, but P1 consumes lesser power than P2 when they run. What are some best practices in coding that help me write good (in terms of low power consumption) programs like P1? You can assume C or any other popular language.
I am asking from a battery saving point of view (say, for a smartphone).

To start off, let's consider what consumes power on a modern CPU (most to least):
running at higher frequencies
keeping more cores online
doing any kind of work
If a particular thread is taking a while to do something, the kernel might boost the CPU frequency to ensure smooth performance, thereby increasing power consumption. In fact, power consumption increases with CPU frequency - exponentially (!) so (PDF), so it's a really good idea to reduce how long it takes to get any particular thing done as much as possible.
If multiple tasks are active and doing enough work that they cannot easily and/or performantly share a single core, the kernel will bring additional cores online (well, technically they're just not sleeping anymore - they were never offline) if available, again, in order to ensure smooth performance. Now this scales roughly about linearly, especially, in mobile ARM processors, according to NVIDIA (PDF).
When the processor doesn't have any work to do, the kernel will put it to sleep if possible, which usually consumes ridiculously small amounts of power, thus vastly increasing how long the device can run on its battery.
So far, we have essentially established that we should do as little work as possible, should do whatever we need to do as fast as possible, and that we should minimize any overhead we have via threads. The neat thing about these attributes is that doing optimizing for them will also likely increase performance! So without further ado, let's actually start seeing what we can do:
Block / No Event Loops
When we use nonblocking calls, we usually end up doing a lot of polling. This means that we are just burning through CPU cycles like an insane madman until something happens. Event loops are the usual way that people go about doing this and are an excellent example of what not to do.
Instead, use blocking calls. Often, with things such as IO, it may take quite a while for a request to complete. In this time, the kernel can allow another thread or process to use the CPU (thus reducing the overall usage of the processor) or can sleep the processor.
In other words, turn something like this:
while (!event) {
event = getEvent (read);
}
into something like this:
read ();
Vectorize
Sometimes, you have a lot of data that you need to process. Vector operations allow you to process more data faster (usually - in rare occasions they can be much slower and just exist for compatibility). Therefore, vectorizing your code can often allow for it to complete its task faster, thus utilizing less processing resources.
Today, many compilers can auto-vectorize with the appropriate flags. For instance, on gcc, the flag -ftree-vectorize will enable auto-vectorization (if available) which can accelerate code massively by processing more data at a time when appropriate, often freeing up registers in the process (thus reducing register pressure), which also has the beneficial side effect of reducing loads and stores, which can in turn further increase performance.
On some platforms, vendors may provide libraries for processing certain kinds of data that may help with this. For instance, the Accelerate framework by Apple includes functions for dealing with vector and matrix data.
However, in certain cases, you may want to do the vectorization yourself, such as when the compiler does not see the opportunity to vectorize or does not fully utilize the opportunity, you may want to vectorize your code yourself. This is often done in assembly, but if you use gcc or clang, you can simply use a form of intrinsics to write portable vectorized code (albeit for all platforms with the specified vector size):
typedef float v4f __attribute__ (((vector_size (16)));
// calculates (r = a * b + c) four floats at a time
void vmuladd (const v4f *a, const v4f *b, const v4f *c, int n) {
int x;
for (x = 0; x < n; x++) {
r[x] = a[x] * b[x];
r[x] = r[x] + c[x];
}
}
This may not be useful on older platforms, but this could seriously improve performance on ARM64 and other modern 64-bit platforms (x86_64, etc.).
Parallelization
Remember how I said that keeps more cores online is bad because it consumes power? Well:
Parallelization via multiple threads doesn't necessarily mean using more cores. If you paid attention to the whole thing I said about using blocking functions, threads could allow you to get work done while other threads wait on IO. That being said, you should not use those extra threads as "IO worker" threads that simply wait on IO - you'll just end up polling all over again. Instead, divide up the individual, atomic tasks that you need to get done among the threads so that for the most part, they can work independently.
It's better to consume more cores than to have to boost clock frequency (linear vs exponential). If you have a task that needs to do a shit ton of processing, it might be useful to break up that processing among a few threads so that they can utilize the available cores. If you do this, take care to ensure that only minimal synchronization is required across the threads; we don't want to waste even more cycles just waiting for synchronization.
When possible, try to combine both of the approaches - parallelize tasks when you have a lot of things to do and parallelize computation when you have a lot of a single thing to do. If you do end up using threads, try to make them block when waiting for work (pthreads - POSIX threads available on both Android and iOS have POSIX semaphores that can help with this) and try to make them long running.
If you have a situation in which you will often need to create and destroy threads, it might be worthwhile to utilize a thread pool. How you accomplish this varies based on the task that you have at hand, but a set of queues is a common way to accomplish this. Ensure that your pools' threads block when there is no work if you use one (this can again be accomplished using the above mentioned POSIX semaphores).
Minimize Work
Try to do as little as you can get by doing. When possible, offload work to external servers up in the cloud, where power consumption isn't as critical of a concern (for most people - this changes once you are at scale).
In situations where you must poll, reducing the frequency of the polling by calling a sleep function can often help - turn something like this:
while (!event) {
event = getEvent ();
}
into something like this:
event = getEvent ();
while (!event) {
sleep (25); // in ms
event = getEvent ();
}
Also, batch processing can work well if you don't have real time requirements (although this may be a good case to push it to the cloud) or if you get lots of independent data rapidly - change something like this:
while (!exit) {
event = getEventBlocking ();
process (event);
}
into something more like this:
while (!exit) {
int x;
event_type *events[16];
for (x = 0; (x < 16) && availableEvents (); x++) {
events[x] = getEventBlocking ();
}
int y;
for (y = 0; y < x; y++) {
process (events[y]);
}
}
This can increase performance by increasing the speed via instruction and data cache locality. If possible, it'd be nice to take this a step further (when such functionality is available on your platform of choice):
while (!exit) {
int x;
event_types **events = getEventsAllBlocking (&x);
int y;
for (y = 0; y < x; y++) {
process (events[y]);
}
}
This will increase performance and waste fewer cycles on waiting and performing function calls. Furthermore, this speedup can become quite noticeable with large amounts of data.
Optimize
This one is pretty easy: crank up the optimization settings on your compiler. Check out the documentation for relevant optimizations that you can enable and benchmark to see if they increase performance and/or reduce power consumption.
On GCC and clang, you can enable recommended safe optimizations by using the flag -O2. Bear in mind that this can make debugging slightly harder, so only use it on production releases.
All in all:
do as little work as possible
don't waste time in event loops
optimize to get shit done in less time
vectorize to get more data processed faster
parallelize to use available resources more efficiently

CUDA concurrent execution

I hope answering my question would not require a lot of time, because it is about my understanding of this topic.
So, the question is about block and grid sizes for concurrent kernels execution.
First, let me tell about my card: it is GeForce GTX TITAN, and here is some of it's characteristics, which I think are important in this question.
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 6144 MBytes (6442123264 bytes)
(14) Multiprocessors, (192) CUDA Cores/MP: 2688 CUDA Cores
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Now, the main problem: I have a kernel(it performs sparse matrix multiplication, but it is not so important) and I want to launch it simultaneously(!) in several streams on one GPU, computing different matrixes multiplication.
Please, notice again the simultaneous requirement - I want all the kernels start at one moment, and finish at the another(all of them!), so the solution when these kernels only partly overlap doesn't satisfy me.
It is also very important that I want to maximize the number of parallel kernels, even if we lose some performance because of it.
Ok, let`s consider we already have the kernel and we want to specify it's grid and block sizes in in the best way.
Looking to the card characteristics we see it has 14 sm and capability 3.5, which allows to run 32 concurrent kernels.
So, the conclusion I make here is that launching 28 concurrent kernels(two per each of 14 SM) would be the best decision. The first question - am I right here?
Now, again, we want to optimize each kernel's block and grid sizes. Ok, let's look to this characteristic:
Maximum number of threads per multiprocessor: 2048
I understand it this way: if we launch a kernel with 1024 threads and 2 blocks, these two blocks will be computed simultaneously. if we launch a kernel with 1024 threads and 4 blocks, then two pairs of block will be computed one after another.
So, the next conclusion I make is that launching 28 kernels each one with 1024 threads would be also the best solution - because this is the only way when they can be executed simultaneously on each SM. The second question - am I right here? Or there is better solution how to get the simultaneous execution?
It would be very nice if you only say am I right or not, and I would be very grateful if you explain where I mistake or propose a better solution.
Thank you for reading this!

There are a number of questions on concurrent kernels already. You might search and review some of them. You must consider register usage, blocks, threads, and shared memory usage, amongst other things. Your question is not precisely answerable when you don't provide information about register usage or shared memory usage. Maximizing concurrent kernels is partly an occupancy question, so you should study that as well.
Nevertheless, you want to observe maximum concurrent kernels. As you've already pointed out, that is 32.
You have 14 SMs, each of which can have a maximum of 2048 threads. 14x2048/32 = 896 threads per kernel (ie. blocks * threads per block)
With a threadblock size of 128, that would be 7 blocks per kernel. 7 blocks * 32 kernels = 224 blocks total. When we divide this by 14 SMs we get 16 blocks per SM, which just happens to exactly match the spec limit.
So the above analysis, 32 kernels, 7 blocks per kernel, 128 threads per block, would be the extent of the analysis that could be done taking into account only the data you have provided.
If that does not work for you, I'd be sure to make sure I have addressed the requirements for concurrent execution and then focus on registers per thread or shared memory to see if those are limiters for "occupancy" in this case.
Honestly I don't hold out much hope for you witnessing the perfect scenario you describe, but have at it. I'd enjoy being surprised. FYI, if I were trying to do something like this, I would certainly try it on linux rather than windows, especially considering your card is a GeForce card subject to WDDM limitations under windows.
Your understanding seems flawed. Statements like this:
if we launch a kernel with 1024 threads and 2 blocks, these two blocks will be computed simultaneously. if we launch a kernel with 1024 threads and 4 blocks, then two pairs of block will be computed one after another
don't make sense to me. Blocks will be computed in whatever order the scheduler deems appropriate, but there is no rule that says two blocks will be computed simultaneously, but four blocks will be computed two by two.

Questions about parallelism on GPU (CUDA)

I need to give some details about what I am doing before asking my question. I hope my English and my explanations are clear and concise enough.
I am currently working on a massive parallelization of an initially written C code. The reason I was interested in CUDA is the large sizes of the arrays I was dealing with : the code is a simulation of fluid mechanics and I needed to launch a "time loop" with five to six successive operations on arrays as big as 3.10^9 or 19.10^9 double variables. I went through various tutorials and documentation and I finally managed to write a not-so-bad CUDA code.
Without going through the details of the code, I used relatively small 2D-blocks. The number of threads is 18 or 57 (which is awkwardly done since my wraps are not fully occupied).
The kernels call a "big" 3D-grid, which describes my physical geometry (the maximal desired size is 1000 value per dimension, that means I want to deal with a 3D grid with a 1 billion blocks).
Okay so now, my five to six kernels which are doing correctly the job are making good use of the shared memory advantages, since global memory is read ounce and written ounce for each kernel (the size of my blocks was actually determined in accordance with the adequate needed amount of shared memory).
Some of my kernels are launched concurrently, asynchronously called, but most of them need to be successive. There are several memcpy from device to host, but the ratio of memcpys over kernels calls is significantly low. I am mostly executing operations on my arrays values.
Here is my question :
If I understood correctly, all of my blocks are doing the job on the arrays at the same time. So that means dealing with a 10-blocks grid, a 100-blocks grid or a billion will take the same amount of time? The answer is obviously no, since the compuation time is significantly more important when I am dealing with large grids. Why is that?
I am using a relatively modest NVIDIA device (NVS 5200M). I was trying to get used to CUDA before getting bigger/more efficient devices.
Since I went through all the optimization and CUDA programming advices/guides by myself, I may have completely misunderstood some points. I hope my question is not too naive...
Thanks!

If I understood correctly, all of my blocks are doing the job on the arrays at the same time.
No they don't run at the same time! How many thread blocks can run concurrently depends on several things, all effected on the compute capability of your device - NVS 5200M should be cc2.1.
A CUDA enabled gpu has an internal scheduler, that manages where and when which thread block and warps of the blocks will run. Where means on which streaming multiprocessor (SM) the block will be launched.
Every SM has a limited amount of resources - shared memory and registers for example. A good overview for these limitations gives the Programming Guide or the Occupancy Calculator.
The first limitation is, that for cc2.1 a SM can run up to 8 thread blocks at the same time. Depending on your usage of registers, shared memory... the number will possible decrease.
If I remind me right a SM of cc2.1 exists of 96 cuda cores and therefore your NVS 5200M should have one SM. Let's assume with your kernel setup N (N<=8) thread blocks fit into the SM at the same time. The internal scheduler will be launched the first N blocks and queue up all other thread blocks. If one thread block has finished his work, the next one from the queue will be launched. So if you will launch in total 1 until N blocks, the used time for the kernel will be very equal. If you run the kernel with N+1 blocks, than the used time will be increased.

Efficiently update an identical array on all tasks with MPI

I'd like to improve the efficiency of a code which includes updates to every value of an array which is identical on all processors run with MPI. The basic structure I have now is to memcpy chunks of the data into a local array on each processor, operate on those, and Allgatherv (have to use "v" because the size of local blocks isn't strictly identical).
In C this would look something like:
/* counts gives the parallelization, counts[RANK] is the local memory size */
/* offsets gives the index in the global array to the local processors */
memcpy (&local_memory[0], &total_vector[0], counts[RANK] * sizeof (double));
for (i = 0; i < counts[RANK]; i++)
local_memory[i] = new_value;
MPI_Allgatherv (&local_memory[0], counts[RANK], MPI_DOUBLE, &total_vector[0], counts, offsets, MPI_DOUBLE, MPI_COMM_WORLD);
As it turns out, this isn't very efficient. In fact, it's really freaking slow, so bad that for most system sizes I'm interested in the parallelization doesn't lead to any increase in speed.
I suppose an alternative to this would be to update just the local chunks of the global vector on each processor and then broadcast the correct chunk of memory from the correct task to all other tasks. While this avoids the explicit memory handling, the communication cost of the broadcast has to be pretty high. It's effectively all-to-all.
EDIT: I just went and tried this solution, where you have to loop over the number of tasks and execute that number of broadcast statements. This method is even worse.
Anyone have a better solution?

The algorithm you describe is "all to all." Each rank updates part of a larger array, and all ranks must sync that array from time to time.
If the updates happen at controlled points in the program flow, a Gather/Scatter pattern might be beneficial. All ranks send their update to "rank 0", and rank 0 sends the updated array to everyone else. Depending on the array size, number of ranks, interconnect between each rank, etc....this pattern may offer less overhead than the Allgatherv.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight