Questions about parallelism on GPU (CUDA)

Questions about parallelism on GPU (CUDA) - c

I need to give some details about what I am doing before asking my question. I hope my English and my explanations are clear and concise enough.
I am currently working on a massive parallelization of an initially written C code. The reason I was interested in CUDA is the large sizes of the arrays I was dealing with : the code is a simulation of fluid mechanics and I needed to launch a "time loop" with five to six successive operations on arrays as big as 3.10^9 or 19.10^9 double variables. I went through various tutorials and documentation and I finally managed to write a not-so-bad CUDA code.
Without going through the details of the code, I used relatively small 2D-blocks. The number of threads is 18 or 57 (which is awkwardly done since my wraps are not fully occupied).
The kernels call a "big" 3D-grid, which describes my physical geometry (the maximal desired size is 1000 value per dimension, that means I want to deal with a 3D grid with a 1 billion blocks).
Okay so now, my five to six kernels which are doing correctly the job are making good use of the shared memory advantages, since global memory is read ounce and written ounce for each kernel (the size of my blocks was actually determined in accordance with the adequate needed amount of shared memory).
Some of my kernels are launched concurrently, asynchronously called, but most of them need to be successive. There are several memcpy from device to host, but the ratio of memcpys over kernels calls is significantly low. I am mostly executing operations on my arrays values.
Here is my question :
If I understood correctly, all of my blocks are doing the job on the arrays at the same time. So that means dealing with a 10-blocks grid, a 100-blocks grid or a billion will take the same amount of time? The answer is obviously no, since the compuation time is significantly more important when I am dealing with large grids. Why is that?
I am using a relatively modest NVIDIA device (NVS 5200M). I was trying to get used to CUDA before getting bigger/more efficient devices.
Since I went through all the optimization and CUDA programming advices/guides by myself, I may have completely misunderstood some points. I hope my question is not too naive...
Thanks!

If I understood correctly, all of my blocks are doing the job on the arrays at the same time.
No they don't run at the same time! How many thread blocks can run concurrently depends on several things, all effected on the compute capability of your device - NVS 5200M should be cc2.1.
A CUDA enabled gpu has an internal scheduler, that manages where and when which thread block and warps of the blocks will run. Where means on which streaming multiprocessor (SM) the block will be launched.
Every SM has a limited amount of resources - shared memory and registers for example. A good overview for these limitations gives the Programming Guide or the Occupancy Calculator.
The first limitation is, that for cc2.1 a SM can run up to 8 thread blocks at the same time. Depending on your usage of registers, shared memory... the number will possible decrease.
If I remind me right a SM of cc2.1 exists of 96 cuda cores and therefore your NVS 5200M should have one SM. Let's assume with your kernel setup N (N<=8) thread blocks fit into the SM at the same time. The internal scheduler will be launched the first N blocks and queue up all other thread blocks. If one thread block has finished his work, the next one from the queue will be launched. So if you will launch in total 1 until N blocks, the used time for the kernel will be very equal. If you run the kernel with N+1 blocks, than the used time will be increased.

Related

At what point does adding more threads stop helping?

I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)

I think the best way to answer is to give a first overview on how threads are managed by the system. Nowadays all processors are actually multi-core and multi-thread per core, but for sake of simplicity let's first imagine a single core processor with single thread. This is physically limited in performing only a single task at the time, but we are still capable of running multitask programs.
So how is this possible? Well it is simply illusion!
The CPU is still performing a single task at the time, but switches between one and the other giving the illusion of multitasking. This process of changing from one task to the other is named Context switching.
During a Context switch all the data related to the task that is running is saved and the data related to the next task is loaded. Depending on the architecture of the CPU data can be saved in registers, cache, RAM, etc. The more the technology advances, the more performing solutions have been discovered. When the task is resumed, the whole data is fetched and the task continues its operations.
This concept introduces many issues in managing tasks, like:
Race condition
Synchronization
Starvation
Deadlock
There are other points, but this is just a quick list since the question does not focus on this.
Getting back to your question:
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
Short answer: It depends!
As previously said, to switch between a task and another, a Context switch is required. To perform this some storing and fetching data operations are required, but these operations are just an overhead for you computation and don't give you directly any advantage. So having too many tasks requires a high amount of Context switching, thus meaning a lot of computational time wasted! So at the end your task might be running slower than with less tasks.
Also, since you tagged this question with pthreads, it is also necessary to check that the code is compiled to run on multiple HW cores. Having a multi core CPU does not guarantee that you multitask code will run on multiple HW cores!
In your particular case of application:
I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
Is a good example of concurrent and data independent computing. This sort of tasks run great on GPU, since operations don't have data correlation and concurrent computing is performed in hardware (modern GPU have thousands of computing cores!)

OpenCL large global size or for loops per work item?

I'm learning OpenCL in order to implement a relatively complex image processing algorithm which includes several subroutines that should be implemented as kernels.
The implementation is intended to be on Mali T-6xx GPU.
I read the "OpenCL Programming by Example" book and the "Optimizing OpenCL kernels on the Mali-T600 GPUs" document.
In the book examples they use some global size of work items and each work item processes several pixels in for loops.
In the document the kernels are written without loops as in there is a single execution per work item in the kernel.
Since the maximum global size of work items that are possible to spawn on the Mali T-600 GPUs are 256 (and thats for simple kernels) And there are clearly more pixels to process in most images, in my understanding the kernel without loops will spawn more work item threads as soon as possible until the global size of work items completed executing the kernel and the global size might just be the amount of pixels in the image. Is that right? Such that it is a kind of a thread spawning loop in itself?
On the other hand in the book. The global work size is smaller than the amount of pixels to process, but the kernel has loops that make each work item process several pixels while executing the kernel code.
So I want to know which way is the proper way to write image processing kernels or any OpenCL kernels for that
matter and in what situations one way might be better than the other, assuming I understood correctly both ways...

Is that right? Such that it is a kind of a thread spawning loop in itself?
Yes.
So I want to know which way is the proper way to write image processing kernels or any OpenCL kernels for that matter and in what situations one
I suspect there isn't a "right" answer in general - there are multiple hardware vendors and multiple drivers - so I suspect the "best" approach will vary from vendor to vendor.
For Mali in particular the thread spawning is all handled by hardware, so will in general be faster than explicit loops in the shader code which will take instructions to process.
There is normally some advantage to at least some vectorization - e.g. processing vec4 or vec8 vectors of pixels per work item rather than just 1 - as the Mali-T600/700/800 GPU cores uses a vector arithmetic architecture.

CUDA concurrent execution

I hope answering my question would not require a lot of time, because it is about my understanding of this topic.
So, the question is about block and grid sizes for concurrent kernels execution.
First, let me tell about my card: it is GeForce GTX TITAN, and here is some of it's characteristics, which I think are important in this question.
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 6144 MBytes (6442123264 bytes)
(14) Multiprocessors, (192) CUDA Cores/MP: 2688 CUDA Cores
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Now, the main problem: I have a kernel(it performs sparse matrix multiplication, but it is not so important) and I want to launch it simultaneously(!) in several streams on one GPU, computing different matrixes multiplication.
Please, notice again the simultaneous requirement - I want all the kernels start at one moment, and finish at the another(all of them!), so the solution when these kernels only partly overlap doesn't satisfy me.
It is also very important that I want to maximize the number of parallel kernels, even if we lose some performance because of it.
Ok, let`s consider we already have the kernel and we want to specify it's grid and block sizes in in the best way.
Looking to the card characteristics we see it has 14 sm and capability 3.5, which allows to run 32 concurrent kernels.
So, the conclusion I make here is that launching 28 concurrent kernels(two per each of 14 SM) would be the best decision. The first question - am I right here?
Now, again, we want to optimize each kernel's block and grid sizes. Ok, let's look to this characteristic:
Maximum number of threads per multiprocessor: 2048
I understand it this way: if we launch a kernel with 1024 threads and 2 blocks, these two blocks will be computed simultaneously. if we launch a kernel with 1024 threads and 4 blocks, then two pairs of block will be computed one after another.
So, the next conclusion I make is that launching 28 kernels each one with 1024 threads would be also the best solution - because this is the only way when they can be executed simultaneously on each SM. The second question - am I right here? Or there is better solution how to get the simultaneous execution?
It would be very nice if you only say am I right or not, and I would be very grateful if you explain where I mistake or propose a better solution.
Thank you for reading this!

There are a number of questions on concurrent kernels already. You might search and review some of them. You must consider register usage, blocks, threads, and shared memory usage, amongst other things. Your question is not precisely answerable when you don't provide information about register usage or shared memory usage. Maximizing concurrent kernels is partly an occupancy question, so you should study that as well.
Nevertheless, you want to observe maximum concurrent kernels. As you've already pointed out, that is 32.
You have 14 SMs, each of which can have a maximum of 2048 threads. 14x2048/32 = 896 threads per kernel (ie. blocks * threads per block)
With a threadblock size of 128, that would be 7 blocks per kernel. 7 blocks * 32 kernels = 224 blocks total. When we divide this by 14 SMs we get 16 blocks per SM, which just happens to exactly match the spec limit.
So the above analysis, 32 kernels, 7 blocks per kernel, 128 threads per block, would be the extent of the analysis that could be done taking into account only the data you have provided.
If that does not work for you, I'd be sure to make sure I have addressed the requirements for concurrent execution and then focus on registers per thread or shared memory to see if those are limiters for "occupancy" in this case.
Honestly I don't hold out much hope for you witnessing the perfect scenario you describe, but have at it. I'd enjoy being surprised. FYI, if I were trying to do something like this, I would certainly try it on linux rather than windows, especially considering your card is a GeForce card subject to WDDM limitations under windows.
Your understanding seems flawed. Statements like this:
if we launch a kernel with 1024 threads and 2 blocks, these two blocks will be computed simultaneously. if we launch a kernel with 1024 threads and 4 blocks, then two pairs of block will be computed one after another
don't make sense to me. Blocks will be computed in whatever order the scheduler deems appropriate, but there is no rule that says two blocks will be computed simultaneously, but four blocks will be computed two by two.

CUDA - limiting number of SMs being used

Is there any way I could EXPLICITLY limit the number of GPU multiprocessors being used during runtime of my program? I would like to calculate how my algorithm scales up with growing number of multiprocessors.
If it helps: I am using CUDA 4.0 and device with compute capability 2.0.

Aaahhh... I know the problem. I played with it myself when writing a paper.
There is no explicit way to do it, however you can try "hacking" it, by having some of the blocks doing nothing.
if you never launch more blocks as there are multiprocessors, then your work is easy - just launch even less blocks. Some of SM are guaranteed to have no work, because a block cannot split onto multiple SMs.
if you launch much more blocks and you just rely on the driver to schedule them, use a different approach: just launch as many blocks as your GPU can handle and if one of the blocks finishes its work, instead of terminating it, loop back to the beginning and fetch another piece of data to work on. Most likely, the performance of your program will not fall; it might even get better if you schedule your work carefully :)
The biggest problem is when all your blocks are all running on the GPU at once, but you have more than one block per SM. Then you need to launch normally, but manually "disable" some of the blocks and order other blocks to do the work for them. The problem is - which blocks to disable to guarantee that one SM is working and other is not.
From my own experiments, the 1.3 devices (I had GTX 285) schedules the blocks in sequence. So, if I launch 60 blocks onto 30 SMs, blocks 1-30 are scheduled onto SM 1-30 and then 31-60 again onto SM from 1 to 30. So, by disabling block 5 and 35, SM number 5 is practically not doing anything.
Note however, this is my private, experimental observation I made 2 years ago. It is in no way confirmed, supported, maintained, what-not by NVIDIA and may change (or already has changed) with the new GPUs and/or drivers.
I would suggest - try playing with some simple kernels which do a lot of stupid work and see how long does it take to compute on various "enabled"/"disabled" configurations. If you are lucky, you will catch a performance drop, indicating that 2 blocks are in fact executed by a single SM.

Thread-safety of read-only memory access

I've implemented the Barnes-Hut gravity algorithm in C as follows:
Build a tree of clustered stars.
For each star, traverse the tree and apply the gravitational forces from each applicable node.
Update the star velocities and positions.
Stage 2 is the most expensive stage, and so is implemented in parallel by dividing the set of stars. E.g. with 1000 stars and 2 threads, I have one thread processing the first 500 stars and the second thread processing the second 500.
In practice this works: it speeds the computation by about 30% with two threads on a two-core machine, compared to the non-threaded version. Additionally, it yields the same numerical results as the original non-threaded version.
My concern is that the two threads are accessing the same resource (namely, the tree) simultaneously. I have not added any synchronisation to the thread workers, so it's likely they will attempt to read from the same location at some point. Although access to the tree is strictly read-only I am not 100% sure it's safe. It has worked when I've tested it but I know this is no guarantee of correctness!
Questions
Do I need to make a private copy of the tree for each thread?
Even if it is safe, are there performance problems of accessing the same memory from multiple threads?
Update Benchmark results for the curious:
Machine: Intel Atom CPU N270 # 1.60GHz, cpu MHz 800, cache size 512 KB
Threads real user sys
0 69.056 67.324 1.720
1 76.821 66.268 5.296
2 50.272 63.608 10.585
3 55.510 55.907 13.169
4 49.789 43.291 29.838
5 54.245 41.423 31.094
0 means no threading at all; 1 and above means spawn that many worker threads and for the main thread to wait for them. I would not expect much of an improvement for anything beyond 2 threads, since it's entirely CPU bound and that's how many cores there are. It's interesting that an odd number of threads is slightly worse than an even number.
Looking at sys it's apparent that there's a cost with making threads. Currently it's making the threads for each frame (so N*1000 thread creations). This was easy to program (during my 15 minutes on the train this morning). I'll need to think a bit about how to reuse threads...
Update #2 I've made it use a pool of threads, synchronised with two barriers. This has no noticeable performance advantage over recreating the threads each frame.

You don't specify how your data is structured, but in general reading memory from multiple threads simultaneously is safe and does not introduce any performance issues. You only get problems if someone is writing.
It is interesting that you say you're only getting 30% speedup out of two threads. If you have an otherwise idle machine, two or more CPUs and only readonly shared data (i.e. no synchronization) I would expect to see much closer to 50% speed improvement. This suggests that your operation is actually completing so quickly that the overhead of creating the thread is becoming significant in your numbers. Are you running on a hyperthreaded CPU?

If your data is read-only, then no, you do not need to make a private copy of the tree for each thread. This is the biggest advantage that a shared memory threading model offers!
I'm not aware of any performance problems with such a model. If anything, it should be faster depending on if your CPUs can share some of their cache.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight