Why is my OpenMP implementation slower than a single threaded implementation? - c

I am learning about OpenMP concurrency, and tried my hand at some existing code I have. In this code, I tried to make all the for loops parallel. However, this seems to make the program MUCH slower, at least 10x slower, or even more than the single threaded version.
Here is the code: http://pastebin.com/zyLzuWU2
I also used pthreads, which turns out to be faster than the single threaded version.
Now the question is, what am I doing wrong in my OpenMP implementation that is causing this slowdown?
Thanks!
edit: the single threaded version is just the one without all the #pragmas

One problem I see with your code is that you are using OpenMP across loops that are very small (8 or 64 iterations, for example). This will not be efficient due to overheads. If you want to use OpenMP for the n-queens problem, look at OpenMP 3.0 tasks and thread parallelism for branch-and-bound problems.

I think your code is much too complex to be reviewed here. One error that I saw immediately is that it is not even correct. At places where you are using an omp parallel for to do sums you must use reduction(+: yourcountervariable) to have the results of the different threads correctly assembled together. Otherwise one thread may overwrite the result of the others.

At least two reasons:
You're only doing 8 iterations of a very simple loop. Your runtime will be completely dominated by the overhead involved in setting up all the threads.
In some places, the critical section will cause contention; all the threads will be trying to access the critical section continuously, and block each other.

Related

Avoiding CUDA thread divergence for MISD type operation

As part of a bigger code, I have a CUDA RK4 solver that integrates a large number of ODEs (Can be 1000+) in parallel. One step of this operation is calculating 'xdot', which is different for each equation (or data element). As of now, I have a switch-case branching setup to calculate the value for each data element in the kernel. All the different threads use the same 3-6 data elements to calculate their output, but in a different way. For example, for thread 1, it could be
xdot = data[0]*data[0] + data[1];
while for thread 2 it could be,
xdot = -2*data[0] + data[2];
and so on.
So if I have a hundred data elements, the execution path is different for each of them.
Is there any way to avoid/decrease the thread-divergence penalty in such a scenario?
Would running only one thread per block be of any help ?
Running one thread per block simply nulls 31/32 threads in the single warp you launch and wastes a lot of cycles and opportunities to hide latency. I would never recommend it, no matter how much branch divergence penalty your code incurred.
Your application sounds pretty orthognal to the basic CUDA programming paradigm and there really isn't going to be much you can do to avoid branch divergence penalties. One approach which could slightly improve things would be to perform some prior analysis of the expressions for each equation and group those with common arithmetic terms together. Recent hardware can run a number of kernels simultaneously, so it might be profitable to group calculations sharing like terms into different kernels and launch them simultaneously, rather than a single large kernel. CUDA supports C++ templating, and that can be a good way of generating a lot of kernel code from a relatively narrow base and make a lot of logic statically evaluable, which can help the compiler. But don't expect miracles - your problem is probay better suited to a different architecture than the GPU (Intel's Xeon Phi, for example).

Writing For loops efficiently

I am constructing the partial derivative of a function in C. The process is mainly consisted of a large number of small loops. Each loop is responsible for filling a column of the matrix. Because the size of the matrix is huge, the code should be written efficiently. I have a number of plans in mind for the implementation which I do not want get into the details.
I know that the smart compilers try to take advantage of the cache automatically. But I would like to know more the details of using cache and writing an efficient code and efficient loops. It is appreciated if provide with some resources or websites so I can know more about writing the efficient codes in terms of reducing memory access time and taking advantage guy.
I know that my request my look sloppy, but I am not a computer guy. I did some research but with no success.
So, any help is appreciated.
Thanks
Well written code tends to be efficient (though not always optimal). Start by writing good clean code, and if you actually have a performance problem that can be isolated and addressed.
It is probably best that you write the code in the most readable and understandable way you can and then profile it to see where the bottlenecks really are. Often times your conception of where you need efficiency doesn't match up with reality.
Modern compilers do a decent job with many aspects of optimization and it seems unlikely that the process of looping will itself be a problem. Perhaps you should consider focusing on simplifying the calculation done by each loop.
Otherwise, you'll be looking at things such as accessing your matrix row by row so that you take advantage of the row-major storage order C uses (see this question).
You'll want to build your for loops without if statements inside because if statements create what is called "branching". The computer essentially guesses which option will be right and pays a sometimes hefty option if it is wrong.
To extend that theme, you want to do as little inside the for loop as possible. You'll also want to define it with static limits, e.g.:
for(int i=1;i<100;i++) //This is better than
for(int i=1;i<N/i;i++) //this
Static limits means that very little effort is expended determining if the for loop should keep going. They also permit you to use OpenMP to divy up the work in the loops, which can sometimes speed things up considerably. This is simple to do:
#pragma omp parallel for
for(int i=0;i<100;i++)
And, walla! the code is parallelized.

Is there a simple way to run a C/C++ program parallelly without recoding?

I have a multi-cores machine but when i tried to run this old C program (http://www.statmt.org/moses/giza/mkcls.html) it only utilizes one core. Is there a way to run the C code and send the cycles/threads to the other cores?
Is recoding the code into CUDA the only way?
I have a multi-cores machine but when i tried to run this old C
program (http://www.statmt.org/moses/giza/mkcls.html) it only utilizes
one core. Is there a way to run the C code and send the cycles/threads
to the other cores?
Without recompiling, definitely not.
You may be able to make some minor tweaks and use a tool that takes your source and parallelizes it automatically, but since each core is quite separate - they are "quite far apart" - you can't just spread the instructions between the two cores. The code has to be compiled in such a way that there are two "streams of instructions" - if you were to just send every other instruction to every other core in a dual core system, it would probably run 10-100 times slower than if you run all code on one core, because of all the extra overhead in communication between the cores that would be needed [each core already has the ability to run several instructions in parallel, and the main reason for multi-core processors in the first place is that this ability to run things in parallel only goes so far at making things faster - there are only so many instructions that can be run before you need the result of a previous instruction, etc, etc].
Is recoding the code into CUDA the only way?
No, there are many other alternatives. OpenMP, hand-coding using multiple threads. Or, the simplest approach, start the program two or four times over, with different input data, and let them run completely separately. This obviously only works if there is something you can run multiple variants of at the same time...
A word on "making things parallel". It's not a magical thing that will make all code faster. Calculating something where you need the result of the previous calculation would be pretty hopeless - say you want to calculate Fibonacci series for example - f(n) = f(n-1) + f(n-2) - you can't do that with parallel calculations, because you need the result from the other calculation(s) to proceed this. On the other hand, if you have a dozen really large numbers that you want to check if they are prime-numbers, then you'd be able to do that about four times faster with a 4 core processor and four threads.
If you have a large matrix that needs to be multiplied by another large matrix or vector, that would be ideal to split up so you do part of the calculation on each core.
I haven't looked at the code for your particular project, but just looking at the description, I think it may parallelise quite well.
Yes, this is called automatic parallelization and it is an active area of research.
However, I know of no free tools for this. The Wikipedia article "automatic parallelization" has a list of tools. You will need access to the original source code, and you might have to add parallelization directives to the code.
You can run it in multiple processes and write another program that forwards tasks to either of those processes.
CUDA? You only need that if you want it to run on your graphics-card, so in this case that makes no sense.

how to allocate more cpu and RAM to a c program in linux

I am running a simple C program which performs a lot calculations(CFD) hence takes a lot of time to run. However i still have a lot of unused CPU and RAM. So how will i allocate some of my processing power to one program.??
I'm guessing that CFD means Computational Fluid Dynamics (but CFD has also a lot of other meanings, so I might guess wrong).
You definitely should first profile your code. At the very least, compile it with gcc -Wall -pg -O and learn how to use gprof. You might also use strace to find out the system calls done by your code.
I'm not an expert of CFD (even if in the previous century I did work with CFD experts). But such code uses a lot of finite elements analysis and other vector computation.
If you are writing the code, you might perhaps consider using OpenMP (so by carefully adding OpenMP pragmas in your source code, you might speed it up), or even consider using GPGPUs by coding OpenCL kernels that run on the GPU.
You could also learn more about pthreads programming and change your code to use threads.
If you are using important numerical libraries like e.g. BLAS they have a lot of tuning, and even specialized variants (e.g. multi-core, OpenMP-ed, or even in OpenCL).
In all cases, parallelizing your code is a lot of work. You'll spend weeks or months on improving it, if it is possible.
Linux doesn't keep programs waiting and CPU free when they need to do calculations.
Either you have a multicore CPU and one single thread running (as suggested by #Pankrates) or you are blocking on some I/O.
You could nice the process with a negative increment, but you need to be superuser for that. See
man nice
This would increase the scheduling priority of the process. If it is competing with other processes for CPU time, it would get more CPU time and therefore "run faster".
As for increasing the amount of RAM used by the program: you'd need to rewrite or reconfigure the program to use more RAM. It is difficult to say more given the information available in the question.
To use multiple CPU's at once, you either need to run multiple copies of your program, or run multiple threads within the program. Neither is terribly hard to get started on.
However, it's much easier to do a parallel version of "I've got 10000 large numbers, I want to find out for each of them if they are primes or not" than it is to do "lots of A = A + B" type calculations in parallel - because you need the new A before you can make the next step. CFD calculations tend to do the latter [as far as I understand it], but with large arrays. You may be able to split large vector calculations into a set of smaller vector caclulations [say we have a matrix of 1000 x 1000, you could split that into 4 sets of 250 x 1000 matrixes, or 4 sets of 500 x 500 matrixes, and perform each of those in it's own thread].
If it's your own code, then you hopefully know what it does and how it works. If it's someone elses code, then you need to talk to whoever owns the code.
There is no magical way to "automatically make use of more CPU's". 30% CPU usage on a quad-core processor probably means that your system is basically using one core, and 5% or so is overhead for other things going on in the system - or maybe there is a second thread somewhere in your application that uses a little bit of CPU doing whatever it does. Or the application is multithreaded, but doesn't use the multiple cores to full extent because there is contention between the threads over some shared resource... It's impossible for us to say which of these three [or several other] alternatives.
Asking for more RAM isn't going to help unless you have something useful to put into that memory. If there is free memory, your application get as much memory as it needs.

Why is my OpenMP implementation slower than a single threaded implementation? (Followup)

This is a follow up to Why is my OpenMP implementation slower than a single threaded implementation? .
I have adhered to the answer provided, and used tasking instead of for pragmas to speed up the code. However, compared to a sequential (same) program, both programs run equally as fast. I witness no speed up.
The reworked code is here: http://pastebin.com/3SFaNEc4
I simply removed all the for pragmas and replaced it tasking pragmas for the recursive procedures.
Am I doing anything wrong? I should be seeing an almost linear speed up. What do you guys think?
Thanks!
First - you still have an "#pragma end critical" which should be removed. It isn't causing a problem, but it is incorrect. Second - as I said in the other question you posted, you might have to think about how you are parallelizing the code to see the speedup, so just replacing the other pragmas with task pragmas may not speed it up. Third - you haven't put the tasks into a parallel region, so you are not running in parallel at all. And you can't just add a parallel region around the tasks or you are going to be doing the same tasks multiple times.

Resources