OpenCL: How to distribute a calculation on different devices without multithreading - c

Following my former post about comparing the time required to do a simple array addition job (C[i]=A[i]+B[i]) on different devices, I improved the code a little bit to repeat the process for different array length and give back the time required:
The X axis is the array length in logarithm with a base 2 and Y is the time in logarithm with base 10. As it can be seen somewhere between 2^13 and 2^14 the GPUs become faster than the CPU. I guess it is because the memory allocation becomes negligible in comparison to the calculation. (GPI1 is a typo I meant GPU1).
Now hoping my C-OpenCL code is correct I can have an estimation of the time required to do an array addition on different devices: f1(n) for CPU, f2(n) for the first GPU and f3(n) for the second GPU. If I have an array job with a length of n I should theoretically be able to divide it into 3 parts as n1+n2+n3=n and in a way to satisfy the f1(n1)=f2(n2)=f3(n3) and distribute it on three devices on my system to have the fastest possible calculation. I think I can do it using lets say OpenMP or any other multithreading method and use the cores of my CPU to host three different OpenCL tasks. That's not what I like to do because:
It is a wast of resources. Two of the cores are just hosting while could be used for calculation
It makes the code more complicated.
I'm not sure how to do it. I'm now using the Apple Clang compiler with -framework OpenCL to compile the code, but for OpenMP I have to use the GNU compiler. I don't know how to both OpenMP and OpenCL on one of these compilers.
Now I'm thinking if there is any way to do this distribution without multithreading? For example if one of the CPU cores assigns the tasks to the three devices consequentially and then catches the results in the same (or different) order and then concatenate them. It probably needs a little bit of experimenting to adjust for the timing of the task assignment of the subtasks, but I guess it should be possible.
I'm a total beginner to the OpenCL so I would appreciate if you guys could help me know if it is possible and how to do it. Maybe there are already some examples doing so, please let me know. Thanks in advance.
P.S. I have also posted this question here and here on Reddit.

The problem as its read implicitly tells you the solution should be concurrent (asynchronous) thus you require to add the results from three different devices at same time, otherwise what you will do is to run a process first on device A, then device B and then device C (better to run a single process on the fastest device), if you plan to efficiently learn to exploit OpenCL programming (either on mCPU/GPUs) you should be comfortable to do Asynchronous programming (indeed multi threaded).

Related

Should I also measure clCreateContext() when profiling OpenCL code?

Recently I'm programming OpenCL code which handles some images.
After completing the code, I need to benchmark OpenCL code and native C(or C++) code which does same job.
My question arouses from above. Specifically which steps should I contain to time measuring?
Majority of books and questions on StackOverflow only measures time of executing clEnqueueNDRangeKernel() with using clGetEventProfilingInfo() and clWaitForEvents().
My senior said I need to contain buffer copying jobs(C memory to cl_mem) since native C code doesn't have such steps.
Then should I contain program creating & kernel building step, argument setting step, *.cl source code file reading step, and (most curious stuff)clCreateContext() step?
According to [this paper], clCreateContext() consumes largest time compared with other steps like below.
IMAGE
Android OpenCL code example from SONY also only gets elapsed time of clEnqueueNDRangeKernel(). Check here -> developer.sonymobile.com/downloads/code-example-module/opencl-code-example/
If above is right, is it right that I should only measure the very native C code which does same job in OpenCL kernel code?
Or are there various perspective to profiling and comparing OpenCL & native C code?
PLUS: My program is going to handle continuous image (like video) so there'll be frequent memory copy between GPU and other memory. Then I should also get time for copying memory in both OpenCL code and native C code, right?
I mean, that obviously depends on what you need to measure.
Generally, if you care about the total run time of your program, measure the total runtime, including context creation.
In reality, you usually don't use openCL to do workloads that, over the whole life time of a program, take less time than the context creation. If that is the case, I'd be sure to check whether using openCL makes sense, at all. OpenCL is a single instruction, much much much data architecture. Hence, I think you might be constructing testbenches with simply too little work to be done to ever get statistically sufficient data.
For example, the timers you use to measure the time something takes to execute have some granularity, typically in the multiples of microseconds. If your workload takes shorter than let's say 500 µs, then what you're measuring is practically unusable as benchmark. This is a common problem for the performance comparison of many things!

Basic GPU application, integer calculations

Long story short, I have done several prototypes of interactive software. I use pygame now (python sdl wrapper) and everything is done on CPU. I am starting to port it to C now and at the same time search for the existing possibilities to use some GPU power to entlast the CPU from redundant operations. However I cannot find a good "guideline" what exact technology/tools should I pick in my situation. I just read plethora of docs, it drains my mental powers very fast. I am not sure if it is possible at all, so I'm puzzled.
Here I've made a very rough sketch of my typical application skeleton that I develop, but given that it uses GPU now (note, I have almost zero practical knowledge about GPU programming). Still important is that data types and functionality must be exactly preserved. Here it is:
So F(A,R,P) is some custom function, for example element substitution, repetition, etc. Function is presumably constant in program lifetime, rectangle's shapes generally are not equal with A shape, so it is not in-place calculation. So they are simply generated whith my functions. Examples of F: repeat rows and columns of A; substitute values with values from Substitution tables; compose some tiles into single array; any math function on A values, etc. As said all this can be easily made on CPU, but app must be really smooth. BTW in pure Python it became just unusable after adding several visual features, which are based on numpy arrays. Cython helps to make fast custom functions but then the source code is already kind of a salad.
Question:
Does this schema reflect some (standart) technology/dev.tools?
Is CUDA what I am looking for? If yes, some links/examples which coincides whith my application structure, would be great.
I realise, this a big question, so I will give more details if it helps.
Update
Here is a concrete example of two typical calculations for my prototype of bitmap editor. So the editor works with indexes and the data include layers with corresponding bit masks. I can determine the size of layers and masks are same size as layers and, say, all layers are same size (1024^2 pixels = 4 MB for 32 bit values). And my palette is say, 1024 elements (4 Kilobytes for 32 bpp format).
Consider I want to do two things now:
Step 1. I want to flatten all layers in one. Say A1 is default layer (background) and layers 'A2' and 'A3' have masks 'm2' and 'm3'. In python i'd write:
from numpy import logical_not
...
Result = (A1 * logical_not(m2) + A2 * m2) * logical_not(m3) + A3 * m3
Since the data is independent I believe it must give speedup proportionl to number of parallel blocks.
Step 2. Now I have an array and want to 'colorize' it with some palette, so it will be my lookup table. As I see now, there is a problem with simultanous read of lookup table element.
But my idea is, probably one can just duplicate the palette for all blocks, so each block can read its own palette? Like this:
When your code is highly parallel (i.e. there are small or no data dependencies between stages of processing) then you can go for CUDA (more finegrained control over synching) or OpenCL (very similar AND portable OpenGL-like API to interface with the GPU for kernel processing). Most of the acceleration work we do happens in OpenCL, which has excellent interop with both OpenGL and DirectX, but we also have the same setup working with CUDA. One big difference between CUDA and OpenCL is that in CUDA you can compile kernels once and delay-load (and/or link) them in your app, whereas in OpenCL the compiler plays nice with the OpenCL driver stack to ensure the kernel is compiled when the app starts.
One alternative that is often overlooked if you're using Microsoft Visual Studio is C++AMP, a C++ syntax-friendly and intuitive api for those who do not want to dig into the logic twists and turns of OpenCL/CUDA API's. Big advantage here is that the code also works if you do not have a GPU in the system, but then you do not have as many options to tweak performance. Still, in a lot of cases, this is a fast and efficient way to write proof your concept code and re-implement bits and parts in CUDA or OpenCL later.
OpenMP and Thread Building Blocks are only good alternatives when you have synching issues and lots of data dependencies. Native threading using worker threads is also a viable solution, but only if you have a good idea on how synch-points can be set up between the different processes in such a way that threads do not starve each-other out when fighting for priority. This is a lot harder to get right, and tools such as Parallel Studio are a must. But then, so is NVida NSight if you're writing GPU code.
Appendix:
A new platform called Quasar (http://quasar.ugent.be/blog/) is being developed that enables you to write your math problems in a syntax that is very similar to Matlab, but with full support of c/c++/c# or java integration, and cross-compiles (LLVM, CLANG) your "kernel" code to any underlying hardware configuration. It generates CUDA ptx files, or runs on openCL, or even on your CPU using TBB's, or a mixture of them. Using a few monikers, you can decorate the algorithm so that the underlying compiler can infer types (you can also explicitly use strict typing), so you can leave the type-heavy stuff entirely up to the compiler. To be fair, at the time of writing, the system is still w.i.p. and the first OpenCL compiled programs are just being tested, but most important benefit is fast prototyping with almost identical performance compared to optimized cuda.
What you want to do is send values really fast to the GPU using the high frequency dispatch and then display the result of a function which is basically texture lookups and some parameters.
I would say this problem will only be worth solving on the GPU if two conditions are met:
The size of A[] is optimised to make the transfer times irrelevant (Look at, http://blog.theincredibleholk.org/blog/2012/11/29/a-look-at-gpu-memory-transfer/).
The lookup table is not too big and/or the lookup values are organized in a way that the cache can be maximally utilized, in general random lookups on the GPU can be slow, ideally you can pre-load the R[] values in a shared memory buffer for each element of the A[] buffer.
If you can answer both of those questions positively then and only then consider having a go at using the GPU for your problem, else those 2 factors will overpower the computational speed-up that the GPU can provide you with.
Another thing you can have a look at is to as best as you can overlap the transfer and computing times to hide as much as possible the slow transfer rates of CPU->GPU data.
Regarding your F(A, R, P) function you need to make sure that you do not need to know the value of F(A, R, P)[0] in order to know what the value of F(A, R, P)[1] is because if you do then you need to rewrite F(A, R, P) to go around this issue, using some parallelization technique. If you have a limited number of F() functions then this can be solved by writing a parallel version of each F() function for the GPU to use, but if F() is user-defined then your problem becomes a bit trickier.
I hope this is enough information to have an informed guess towards whether you should or not use a GPU to solve your problem.
EDIT
Having read your edit, I would say yes. The palette could fit in shared memory (See GPU shared memory size is very small - what can I do about it?) which is very fast, if you have more than one palette, you could fit 16KB (size of shared mem on most cards) / 4KB per palette = 4 palettes per block of threads.
One last warning, integer operations are not the fastest on the GPU, consider using floating points if necessary after you have implemented your algorithm and it is working as a cheap optimization.
There is not much difference between OpenCL/CUDA so choose which works better for you. Just remember that CUDA will limit you to the NVidia GPUs.
If i understand corretly to your problem, kernel (function executed on GPU) should be simple. It should follow this pseudocode:
kernel main(shared A, shared outA, const struct R, const struct P, const int maxOut, const int sizeA)
int index := getIndex() // get offset in input array
if(sizeA >= index) return // GPU often works better when n of threads is 2^n
int outIndex := index*maxOut // to get offset in output array
outA[outIndex] := F(A[index], R, P)
end
Functions F should be inlined and you can use switch or if for different function. Since there is not known size of the output of F, then you have to use more memory. Each kernel instance must know positions for correct memory writes and reads so there have to be some maximum size (if there is none, than this all is useless and you have to use CPU!). If different sizes are sparse, then I would use something like computing these different sizes after getting the array back to RAM and compute these few with CPU, while filling outA with some zeros or indication values.
Sizes of arrays are obviously length(A) * maxOut = length(outA).
I forgot to mention that if execution of F is not same in most of the cases (same source code), than GPU will serialize it. GPU multiprocessors have a few cores connected into the same instruction cache so it will have to serialize the code, which is not the same for all cores! OpenMP or threads are better choice for this kind of problem!

Avoiding CUDA thread divergence for MISD type operation

As part of a bigger code, I have a CUDA RK4 solver that integrates a large number of ODEs (Can be 1000+) in parallel. One step of this operation is calculating 'xdot', which is different for each equation (or data element). As of now, I have a switch-case branching setup to calculate the value for each data element in the kernel. All the different threads use the same 3-6 data elements to calculate their output, but in a different way. For example, for thread 1, it could be
xdot = data[0]*data[0] + data[1];
while for thread 2 it could be,
xdot = -2*data[0] + data[2];
and so on.
So if I have a hundred data elements, the execution path is different for each of them.
Is there any way to avoid/decrease the thread-divergence penalty in such a scenario?
Would running only one thread per block be of any help ?
Running one thread per block simply nulls 31/32 threads in the single warp you launch and wastes a lot of cycles and opportunities to hide latency. I would never recommend it, no matter how much branch divergence penalty your code incurred.
Your application sounds pretty orthognal to the basic CUDA programming paradigm and there really isn't going to be much you can do to avoid branch divergence penalties. One approach which could slightly improve things would be to perform some prior analysis of the expressions for each equation and group those with common arithmetic terms together. Recent hardware can run a number of kernels simultaneously, so it might be profitable to group calculations sharing like terms into different kernels and launch them simultaneously, rather than a single large kernel. CUDA supports C++ templating, and that can be a good way of generating a lot of kernel code from a relatively narrow base and make a lot of logic statically evaluable, which can help the compiler. But don't expect miracles - your problem is probay better suited to a different architecture than the GPU (Intel's Xeon Phi, for example).

Is there a simple way to run a C/C++ program parallelly without recoding?

I have a multi-cores machine but when i tried to run this old C program (http://www.statmt.org/moses/giza/mkcls.html) it only utilizes one core. Is there a way to run the C code and send the cycles/threads to the other cores?
Is recoding the code into CUDA the only way?
I have a multi-cores machine but when i tried to run this old C
program (http://www.statmt.org/moses/giza/mkcls.html) it only utilizes
one core. Is there a way to run the C code and send the cycles/threads
to the other cores?
Without recompiling, definitely not.
You may be able to make some minor tweaks and use a tool that takes your source and parallelizes it automatically, but since each core is quite separate - they are "quite far apart" - you can't just spread the instructions between the two cores. The code has to be compiled in such a way that there are two "streams of instructions" - if you were to just send every other instruction to every other core in a dual core system, it would probably run 10-100 times slower than if you run all code on one core, because of all the extra overhead in communication between the cores that would be needed [each core already has the ability to run several instructions in parallel, and the main reason for multi-core processors in the first place is that this ability to run things in parallel only goes so far at making things faster - there are only so many instructions that can be run before you need the result of a previous instruction, etc, etc].
Is recoding the code into CUDA the only way?
No, there are many other alternatives. OpenMP, hand-coding using multiple threads. Or, the simplest approach, start the program two or four times over, with different input data, and let them run completely separately. This obviously only works if there is something you can run multiple variants of at the same time...
A word on "making things parallel". It's not a magical thing that will make all code faster. Calculating something where you need the result of the previous calculation would be pretty hopeless - say you want to calculate Fibonacci series for example - f(n) = f(n-1) + f(n-2) - you can't do that with parallel calculations, because you need the result from the other calculation(s) to proceed this. On the other hand, if you have a dozen really large numbers that you want to check if they are prime-numbers, then you'd be able to do that about four times faster with a 4 core processor and four threads.
If you have a large matrix that needs to be multiplied by another large matrix or vector, that would be ideal to split up so you do part of the calculation on each core.
I haven't looked at the code for your particular project, but just looking at the description, I think it may parallelise quite well.
Yes, this is called automatic parallelization and it is an active area of research.
However, I know of no free tools for this. The Wikipedia article "automatic parallelization" has a list of tools. You will need access to the original source code, and you might have to add parallelization directives to the code.
You can run it in multiple processes and write another program that forwards tasks to either of those processes.
CUDA? You only need that if you want it to run on your graphics-card, so in this case that makes no sense.

how to allocate more cpu and RAM to a c program in linux

I am running a simple C program which performs a lot calculations(CFD) hence takes a lot of time to run. However i still have a lot of unused CPU and RAM. So how will i allocate some of my processing power to one program.??
I'm guessing that CFD means Computational Fluid Dynamics (but CFD has also a lot of other meanings, so I might guess wrong).
You definitely should first profile your code. At the very least, compile it with gcc -Wall -pg -O and learn how to use gprof. You might also use strace to find out the system calls done by your code.
I'm not an expert of CFD (even if in the previous century I did work with CFD experts). But such code uses a lot of finite elements analysis and other vector computation.
If you are writing the code, you might perhaps consider using OpenMP (so by carefully adding OpenMP pragmas in your source code, you might speed it up), or even consider using GPGPUs by coding OpenCL kernels that run on the GPU.
You could also learn more about pthreads programming and change your code to use threads.
If you are using important numerical libraries like e.g. BLAS they have a lot of tuning, and even specialized variants (e.g. multi-core, OpenMP-ed, or even in OpenCL).
In all cases, parallelizing your code is a lot of work. You'll spend weeks or months on improving it, if it is possible.
Linux doesn't keep programs waiting and CPU free when they need to do calculations.
Either you have a multicore CPU and one single thread running (as suggested by #Pankrates) or you are blocking on some I/O.
You could nice the process with a negative increment, but you need to be superuser for that. See
man nice
This would increase the scheduling priority of the process. If it is competing with other processes for CPU time, it would get more CPU time and therefore "run faster".
As for increasing the amount of RAM used by the program: you'd need to rewrite or reconfigure the program to use more RAM. It is difficult to say more given the information available in the question.
To use multiple CPU's at once, you either need to run multiple copies of your program, or run multiple threads within the program. Neither is terribly hard to get started on.
However, it's much easier to do a parallel version of "I've got 10000 large numbers, I want to find out for each of them if they are primes or not" than it is to do "lots of A = A + B" type calculations in parallel - because you need the new A before you can make the next step. CFD calculations tend to do the latter [as far as I understand it], but with large arrays. You may be able to split large vector calculations into a set of smaller vector caclulations [say we have a matrix of 1000 x 1000, you could split that into 4 sets of 250 x 1000 matrixes, or 4 sets of 500 x 500 matrixes, and perform each of those in it's own thread].
If it's your own code, then you hopefully know what it does and how it works. If it's someone elses code, then you need to talk to whoever owns the code.
There is no magical way to "automatically make use of more CPU's". 30% CPU usage on a quad-core processor probably means that your system is basically using one core, and 5% or so is overhead for other things going on in the system - or maybe there is a second thread somewhere in your application that uses a little bit of CPU doing whatever it does. Or the application is multithreaded, but doesn't use the multiple cores to full extent because there is contention between the threads over some shared resource... It's impossible for us to say which of these three [or several other] alternatives.
Asking for more RAM isn't going to help unless you have something useful to put into that memory. If there is free memory, your application get as much memory as it needs.

Resources