OpenMP low performance - c

I am trying to parallelize a loop in my program so i searched about multi-threading. First i took a look on POSIX multithreaded programming tutorial, it was so complicated so i tried to do something easier. I tried with OpenMP. I have successfully parallelized my code but the problem of execution time get worser than the serial case. this is below a portion ok my program. I wish you tell me what's the problem. Should i specify what variables are shared and what are private? and how can i know the kind of each variable? i wish you answer me because i searched in many forums and i still don't know what to do.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>
#define D 0.215 // magnetic dipolar constant
main()
{
int i,j,n,p,NTOT = 1600,Nc = NTOT-1;
float r[2],spin[2*NTOT],w[2],d;
double E,F,V,G,dU;
.
.
.
for(n = 1; n <= Nc; n++){
fscanf(voisins,"%d%d%f%f%f",&i,&j,&r[0],&r[1],&d);
V = 0.0;E = 0.0;F = 0.0;
#pragma omp parallel num_threads(4)
{
#pragma omp for schedule(auto)
for(p = 0;p < 2;p++)
{
V += (D/pow(d,3.0))*(spin[2*i-2+p]-w[p])*spin[2*j-2+p];
E += (spin[2*i-2+p]-w[p])*r[p];
F += spin[2*j-2+p]*r[p];
}
}
G = -3*(D/pow(d,5.0))*E*F;
dU += (V+G);
}
.
.
.
}//End of main()

You are parallelizing a loop with only 2 iterations: p=0 and p=1. The way that OpenMP's omp for works is by splitting up the loop iterations among your threads in the parallel team (which you've defined as 4 threads) and letting them work through their part of the problem in parallel.
With only 2 iterations, 2 of your threads will be sitting idle. On top of that, actually figuring out which threads will work on which part of the problem takes overhead. And if your actual loop doesn't take long (which in this case it clearly doesn't), the overhead will cost more than the benefits you've gained from parallelization.
A better strategy is usually to parallelize the outermost loops with OpenMP whenever possible in order to solve both the problems of splitting up the work evenly and reducing the (relative) overhead. Alternatively, you can parallelize at the lowest loop level using OpenMP 4.0's omp simd command.
Lastly, you are not computing the variables V, E, and F correctly. Because they are summed from iteration to iteration, you should define them all as reduction variables with reduction(+:V). I would be surprised if you are currently getting the correct answer as is.
(Also as High Performance Mark says: make sure you're timing the wall time execution of your program and not the CPU time execution of your program. This is typically done with omp_get_wtime().)

Related

Why this OpenMP parallel for loop doesn't work properly?

I would like to implement OpenMP to parallelize my code. I am starting from a very basic example to understand how it works, but I am missing something...
So, my example looks like this, without parallelization:
int main() {
...
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
Where I omitted some parts in the "..." because are not relevant. It works, and if I print the u[] and v[] arrays on a file, I get the expected results.
Now, if I try to parallelize it just by adding:
#include <omp.h>
int main() {
...
omp_set_num_threads(2);
#pragma omp parallel for
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
The code compiles and the program runs, BUT the u[] and v[] arrays are half full of zeros.
If I set omp_set_num_threads( 4 ), I get three quarters of zeros.
If I set omp_set_num_threads( 1 ), I get the expected result.
So it looks like only the first thread is being executed, while not the other ones...
What am I doing wrong?
OpenMP assumes that each iteration of a loop is independent of the others. When you write this:
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
The iteration i of the loop is modifying iteration i+1. Meanwhile, iteration i+1 might be happening at the same time.
Unless you can make the iterations independent, this isn't a good use-case for parallelism.
And, if you think about what Euler's method does, it should be obvious that it is not possible to parallelize the code you're working on in this way. Euler's method calculates the state of a system at time t+1 based on information at time t. Since you cannot knowing what's at t+1 without knowing first knowing t, there's no way to parallelize across the iterations of Euler's method.
u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
is equivalent to
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
therefore you can parallelize you code like this
#pragma omp parallel for
for (int i = 0; i < n; i++) {
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
}
If you want to mitigate the cost of the pow function you can do it once per thread rather than once per iteration like his (since t << n).
#pragma omp parallel
{
int nt = omp_get_num_threads();
int t = omp_get_thread_num();
int s = (t+0)*n/nt;
int f = (t+1)*n/nt;
u[s] = pow((1+h), s)*u[0];
v[s] = v[0]*pow(1.0/(1-h), s);
for(int i=s; i<f-1; i++) {
u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
}
}
You can also write your own pow(double, int) function optimized for integer powers.
Note that the relationship I used is not in fact 100% equivalent because floating point arithmetic is not associative. That's not usually a problem but it's something one should be aware of.
Before parallelizing your code you must identify its concurrency, i.e. the set of tasks that are logically happening at the same time and then figure out a way to make them actually happen in parallel.
As mentioned above, this is a not a good example to apply parallelism on due to the fact that there is no concurrency in its nature. Attempting to use parallelism like that will lead to wrong results, due to the so-called race conditions.
If you just wanna learn how OpenMP works, try to come up with examples where you can clearly identify conceptually independent tasks. One of the most simple I can think of would be computing the area under a curve by means of integration.
Welcome to the parallel ( or "just"-concurrent ) plurality of computing realities.
Why?
Any non-sequential schedule of processing the loop will have problems with hidden ( not correctly handled ) breach of data-{-access | -value}
integrity in time.
A pure-[SERIAL] flow of processing is free from such dangers as the principally serialised steps indirectly introduce ( right by a rigid order of executing nothing but a one-step-after-another as a sequence ) order, in which there is no chance to "touch" the same memory location twice or more times at the same time.
This "peace-of-mind" is inadvertently lost, once a process goes into a "just"-[CONCURRENT] or the true-[PARALLEL] processing.
Suddenly there is an almost random order ( in a case of a "just"-[CONCURRENT] ) or a principally "immediate" singularity ( avoiding any original meaning of "order" - in the case of a true-[PARALLEL] code execution mode -- like a robot, having 6DoF, arrives into each and every trajectory-point in a true-[PARALLEL] fashion, driving all 6DoF-axes in parallel, not a one-after-another, in a pure-[SERIAL]-manner, not in a some-now-some-other-later-and-the-rest-as-it-gets in a "just"-[CONCURRENT] fashion, as the 3D-trajectory of robot-arm will become hardly predictable and mutual collisions would be often on a car assembly line ... ).
Solution:
Using either a defensive tool, called atomic operations, or a principal approach - design (b)locking-free algorithm, where possible, or explicitly signal and coordinate reads and writes ( sure, at a cost in excess-time and degraded performance ), so as to warrant the values will not get damaged into an inconsistent digital trash, if protective steps ( ensuring all "old"-writes get safely "through" before any "next"-reads go ahead to grab a "right"-value ) were not coded in ( as was demonstrated above ).
Epilogue:
Using a tool, like OpenMP for problems, where it cannot bring any advantage, will result in spending time and decreased performance ( as there are needs to handle all tool-related overheads, while there is literally zero net-effect of parallelism in cases, where the algorithm does not allow any parallelism to be enjoyed ), so one finally pays ways more then one finally gets.
A good point to learn about OpenMP best practices could be sources for example from Lawrence Livermore National Laboratory ( indeed very competent ) and similar publications on using OpenMP.

OpenMP for beginners

I just got started with openMP; I wrote a little C code in order to check if what I have studied is correct. However I found some troubles; here is the main.c code
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "time.h"
int main(){
float msec_kernel;
const int N = 1000000;
int i, a[N];
clock_t start = clock(), diff;
#pragma omp parallel for private(i)
for (i = 1; i <= N; i++){
a[i] = 2 * i;
}
diff = clock() - start;
msec_kernel = diff * 1000 / CLOCKS_PER_SEC;
printf("Kernel Time: %e s\n",msec_kernel*1e-03);
printf("a[N] = %d\n",a[N]);
return 0;
}
My goal is to see how long it takes to the PC to do such operation using 1 and 2 CPUs; in order to to compile the program I type the following line in the terminal:
gcc -fopenmp main.c -o main
And then I select the number of CPUs like so:
export OMP_NUM_THREADS=N
where N is either 1 or 2; however I don't get the right execution time; my results in fact are:
Kernel Time: 5.000000e-03 s
a[N] = 2000000
and
Kernel Time: 6.000000e-03 s
a[N] = 2000000
Both corresponding to N=1 and N=2. as you can see when I use 2 CPUs it takes slightly more time than using just one! What am I doing wrong? How can I fix this problem?
First of all, using multiple cores doesn't implicitly mean, that you're going to get better performance.
OpenMP has to manage the data distribution among you're cores which is going to take time as well. Especially for very basic operations such as only a single multiplication you are doing, performance of a sequential (single core) program will be better.
Second, by going through every element of you're array only once and not doing anything else, you make no use of cache memory and most certainly not of shared cache between cpu's.
So you should start reading some things about general algorithm performance. To make use of multiple cores using shared cache is in my opinion the essence.
Todays computers have come to a stage where the CPU is so much faster than a memory allocation, read or write. This means when using multiple cores, you'll only have a benefit if you use things like shared cache, because the data distribution,initialization of the threads and managing them will use time as well. To really see a performance speedup (See the link, essential term in parallel computing) you should program an algorithm which has a heavy accent on computation not on memory; this has to do with locality (another important term).
So if you wanna experience a big performance boost by using multiple cores test it on a matrix-matrix-multiplication on big matrices such as 10'000*10'000. And plot some graphs with inputsize(matrix-size) to time and matrix-size to gflops and compare the multicore with the sequential version.
Also make yourself comfortable with the complexity analysis (Big O notation).
Matrix-matrix-multiplication has a locality of O(n).
Hope this helps :-)
I suggest setting the numbers of cores/threads within the code itself either directly at the #pragma line #pragma omp parallel for num_threads(2) or using the omp_set_num_threads function omp_set_num_threads(2);
Further, when doing time/performance analysis it is really important to always run the program multiple times and then take the mean of all the runtimes or something like that. Running the respective programs only once will not give you a meaningful reading of used time. Always call multiple times in a row. Not to forget to also alternate the quality of data.
I suggest writing a test.c file, which takes your actual program function within a loop and then calculates the time per execution of the function:
int executiontimes = 20;
clock_t initial_time = clock();
for(int i = 0; i < executiontimes; i++){
function_multiplication(values);
}
clock_t final_time = clock();
clock_t passed_time = final_time - initial_time;
clock_t time_per_exec = passed_time / executiontimes;
Improve this test algorithm, add some rand() for your values etc. seed them with srand() etc. If you have more questions on the subject or to my answer leave a comment and I'll try to explain further by adding more explanations.
The function clock() returns elapsed CPU time, which includes ticks from all cores. Since there is some overhead to using multiple threads, when you sum the execution time of all threads the total cpu time will always be longer than the serial time.
If you want the real time (wall clock time), try to use the OMP Runtime Library function omp_get_wtime() defined in omp.h. It is cross platform portable and should be the preferred way to do wall timing.
You can also use the POSIX functions defined in time.h:
struct timespec start, stop;
clock_gettime(CLOCK_REALTIME, &start);
// action
clock_gettime(CLOCK_REALTIME, &stop);
double elapsed_time = (stop.tv_sec - start.tv_sec) +
1e-9 * (stop.tv_nsec - start.tv_nsec);

OpenMP with 1 thread slower than sequential version

I have implemented knapsack using OpenMP (gcc version 4.6.3)
#define MAX(x,y) ((x)>(y) ? (x) : (y))
#define table(i,j) table[(i)*(C+1)+(j)]
for(i=1; i<=N; ++i) {
#pragma omp parallel for
for(j=1; j<=C; ++j) {
if(weights[i]>j) {
table(i,j) = table(i-1,j);
}else {
table(i,j) = MAX(profits[i]+table(i-1,j-weights[i]), table(i-1,j));
}
}
}
execution time for the sequential program = 1s
execution time for the openmp with 1 thread = 1.7s (overhead = 40%)
Used the same compiler optimization flags (-O3) in the both cases.
Can someone explain the reason behind this behavior.
Thanks.
Enabling OpenMP inhibits certain compiler optimisations, e.g. it could prevent loops from being vectorised or shared variables from being kept in registers. Therefore OpenMP-enabled code is usually slower than the serial and one has to utilise the available parallelism to offset this.
That being said, your code contains a parallel region nested inside the outer loop. This means that the overhead of entering and exiting the parallel region is multiplied N times. This only makes sense if N is relatively small and C is significantly larger (like orders of magnitude larger) than N, therefore the work being done inside the region greatly outweighs the OpenMP overhead.

OpenMP and C parallel for loop: why does my code slow down when using OpenMP?

I'm new here and a beginner level programmer in C. I'm having some problem with using openmp to speedup the for-loop. Below is simple example:
#include <stdlib.h>
#include <stdio.h>
#include <gsl/gsl_rng.h>
#include <omp.h>
gsl_rng *rng;
main()
{
int i, M=100000000;
double tmp;
/* initialize RNG */
gsl_rng_env_setup();
rng = gsl_rng_alloc (gsl_rng_taus);
gsl_rng_set (rng,(unsigned long int)791526599);
// option 1: parallel
#pragma omp parallel for default(shared) private( i, tmp ) schedule(dynamic)
for(i=0;i<=M-1;i++){
tmp=gsl_ran_gamma_mt(rng, 4, 1./3 );
}
// option 2: sequential
for(i=0;i<=M-1;i++){
tmp=gsl_ran_gamma_mt(rng, 4, 1./3 );
}
}
The code draws from a gamma random distribution for M iterations. It turns out the parallel approach with openmp (option 1) takes about 1 minute while the sequential approach (option 2) takes only 20 seconds. While running with openmp, I can see the cpu usage is 800% ( the server I'm using has 8 CPUs ). And the system is linux with GCC 4.1.3. The compile command I'm using is gcc -fopenmp -lgsl -lgslcblas -lm (I'm using GSL )
Am I doing something wrong? Please help me! Thanks!
P.S. As pointed out by some users, it might be caused by rng. But even if I replace
tmp=gsl_ran_gamma_mt(rng, 4, 1./3 );
by say
tmp=1000*10000;
the problem still there...
gsl_ran_gamma_mt probably locks on rng to prevent concurrency issues (if it didn’t, your parallel code probably contains a race condition and thus yields wrong results). The solution then would be to have a separate rng instance for each thread, thus avoiding locking.
Your rng variable is shared, so the threads are spending all their time waiting to be able to use the random number generator. Give each thread a separate instance of the RNG. This will probably mean making the RNG initialization code run in parallel as well.
Again thanks everyone for helping. I just found out that if I get rid of
schedule(dynamic)
in the code, the problem disapears. But why is that?

Why is my computer not showing a speedup when I use parallel code?

So I realize this question sounds stupid (and yes I am using a dual core), but I have tried two different libraries (Grand Central Dispatch and OpenMP), and when using clock() to time the code with and without the lines that make it parallel, the speed is the same. (for the record they were both using their own form of parallel for). They report being run on different threads, but perhaps they are running on the same core? Is there any way to check? (Both libraries are for C, I'm uncomfortable at lower layers.) This is super weird. Any ideas?
EDIT: Added detail for Grand Central Dispatch in response to OP comment.
While the other answers here are useful in general, the specific answer to your question is that you shouldn't be using clock() to compare the timing. clock() measures CPU time which is added up across the threads. When you split a job between cores, it uses at least as much CPU time (usually a bit more due to threading overhead). Search for clock() on this page, to find "If process is multi-threaded, cpu time consumed by all individual threads of process are added."
It's just that the job is split between threads, so the overall time you have to wait is less. You should be using the wall time (the time on a wall clock). OpenMP provides a routine omp_get_wtime() to do it. Take the following routine as an example:
#include <omp.h>
#include <time.h>
#include <math.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
int i, nthreads;
clock_t clock_timer;
double wall_timer;
for (nthreads = 1; nthreads <=8; nthreads++) {
clock_timer = clock();
wall_timer = omp_get_wtime();
#pragma omp parallel for private(i) num_threads(nthreads)
for (i = 0; i < 100000000; i++) cos(i);
printf("%d threads: time on clock() = %.3f, on wall = %.3f\n", \
nthreads, \
(double) (clock() - clock_timer) / CLOCKS_PER_SEC, \
omp_get_wtime() - wall_timer);
}
}
The results are:
1 threads: time on clock() = 0.258, on wall = 0.258
2 threads: time on clock() = 0.256, on wall = 0.129
3 threads: time on clock() = 0.255, on wall = 0.086
4 threads: time on clock() = 0.257, on wall = 0.065
5 threads: time on clock() = 0.255, on wall = 0.051
6 threads: time on clock() = 0.257, on wall = 0.044
7 threads: time on clock() = 0.255, on wall = 0.037
8 threads: time on clock() = 0.256, on wall = 0.033
You can see that the clock() time doesn't change much. I get 0.254 without the pragma, so it's a little slower using openMP with one thread than not using openMP at all, but the wall time decreases with each thread.
The improvement won't always be this good due to, for example, parts of your calculation that aren't parallel (see Amdahl's_law) or different threads fighting over the same memory.
EDIT: For Grand Central Dispatch, the GCD reference states, that GCD uses gettimeofday for wall time. So, I create a new Cocoa App, and in applicationDidFinishLaunching I put:
struct timeval t1,t2;
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
for (int iterations = 1; iterations <= 8; iterations++) {
int stride = 1e8/iterations;
gettimeofday(&t1,0);
dispatch_apply(iterations, queue, ^(size_t i) {
for (int j = 0; j < stride; j++) cos(j);
});
gettimeofday(&t2,0);
NSLog(#"%d iterations: on wall = %.3f\n",iterations, \
t2.tv_sec+t2.tv_usec/1e6-(t1.tv_sec+t1.tv_usec/1e6));
}
and I get the following results on the console:
2010-03-10 17:33:43.022 GCDClock[39741:a0f] 1 iterations: on wall = 0.254
2010-03-10 17:33:43.151 GCDClock[39741:a0f] 2 iterations: on wall = 0.127
2010-03-10 17:33:43.236 GCDClock[39741:a0f] 3 iterations: on wall = 0.085
2010-03-10 17:33:43.301 GCDClock[39741:a0f] 4 iterations: on wall = 0.064
2010-03-10 17:33:43.352 GCDClock[39741:a0f] 5 iterations: on wall = 0.051
2010-03-10 17:33:43.395 GCDClock[39741:a0f] 6 iterations: on wall = 0.043
2010-03-10 17:33:43.433 GCDClock[39741:a0f] 7 iterations: on wall = 0.038
2010-03-10 17:33:43.468 GCDClock[39741:a0f] 8 iterations: on wall = 0.034
which is about the same as I was getting above.
This is a very contrived example. In fact, you need to be sure to keep the optimization at -O0, or else the compiler will realize we don't keep any of the calculations and not do the loop at all. Also, the integer that I'm taking the cos of is different in the two examples, but that doesn't affect the results too much. See the STRIDE on the manpage for dispatch_apply for how to do it properly and for why iterations is broadly comparable to num_threads in this case.
EDIT: I note that Jacob's answer includes
I use the omp_get_thread_num()
function within my parallelized loop
to print out which core it's working
on... This way you can be sure that
it's running on both cores.
which is not correct (it has been partly fixed by an edit). Using omp_get_thread_num() is indeed a good way to ensure that your code is multithreaded, but it doesn't show "which core it's working on", just which thread. For example, the following code:
#include <omp.h>
#include <stdio.h>
int main() {
int i;
#pragma omp parallel for private(i) num_threads(50)
for (i = 0; i < 50; i++) printf("%d\n", omp_get_thread_num());
}
prints out that it's using threads 0 to 49, but this doesn't show which core it's working on, since I only have eight cores. By looking at the Activity Monitor (the OP mentioned GCD, so must be on a Mac - go Window/CPU Usage), you can see jobs switching between cores, so core != thread.
Most likely your execution time isn't bound by those loops you parallelized.
My suggestion is that you profile your code to see what is taking most of the time. Most engineers will tell you that you should do this before doing anything drastic to optimize things.
It's hard to guess without any details. Maybe your application isn't even CPU bound. Did you watch CPU load while your code was running? Did it hit 100% on at least one core?
Your question is missing some very crucial details such as what the nature of your application is, what portion of it are you trying to improve, profiling results (if any), etc...
Having said that you should remember several critical points when approaching a performance improvement effort:
Efforts should always concentrate on the code areas which have been proven, by profiling, to be the inefficient
Parallelizing CPU bound code will almost never improve performance (on a single core machine). You will be losing precious time on unnecessary context switches and gaining nothing. You can very easily worsen performance by doing this.
Even if you are parallelizing CPU bound code on a multicore machine, you must remember you never have any guarantee of parallel execution.
Make sure you are not going against these points, because an educated guess (barring any additional details) will say that's exactly what you're doing.
If you are using a lot of memory inside the loop, that might prevent it from being faster. Also you could look into pthread library, to manually handle threading.
I use the omp_get_thread_num() function within my parallelized loop to print out which core it's working on if you don't specify num_threads. For e.g.,
printf("Computing bla %d on core %d/%d ...\n",i+1,omp_get_thread_num()+1,omp_get_max_threads());
The above will work for this pragma
#pragma omp parallel for default(none) shared(a,b,c)
This way you can be sure that it's running on both cores since only 2 threads will be created.
Btw, is OpenMP enabled when you're compiling? In Visual Studio you have to enable it in the Property Pages, C++ -> Language and set OpenMP Support to Yes

Resources