Not working OpenMP in C [closed]

Not working OpenMP in C [closed] - c

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I use a function with a multithread process in a while. The code runs fine sometimes but other times just stops. I have identified that the problem is in the multithread process. Im newbie with multithread and OpenMP... if someone have a tip to solve that... I'll be very grateful. xP
void paralelSim(PetriNet *p, Matrix *mconflit, int steps){
Matrix ring;
choicesRing(p, mconflit, &ring);
clock_t t;
t = clock();
while((ring.row!=0) && ((steps>0) || (steps == CONTINUE))){
omp_set_dynamic(0);
omp_set_num_threads(ring.col);
#pragma omp parallel shared(p)
{
firethread(p, ring.m[0][omp_get_thread_num()]);
}
if(steps != CONTINUE){
steps--;
}
choicesRing(p, mconflit, &ring);
}
t= clock() - t;
printf("%f seconds.\n",t,((float)t)/CLOCKS_PER_SEC);
printf("m:\n");
showMatrix(&p->m);
printf("steps: %d\n", p->steps);
}

Too many information are missing on your code snippet to be sure of anything, but I can at least give you some hints of what could go wrong...
You set the number of threads to an arbitrary value that, I assume, can be quite large. But that's not really in OpenMP's philosophy. Indeed, in OpenMP, you don't expect the number of threads to go far beyond the number of cores or hardware threads available on the machine. I'm sure the run time library can handle much more, but I'm also sure the performance penalty can be sever and I even suspect there is a limit to what it can manage.
You repeatedly forbid nested parallelism. I guess doing it once is enough, unless firethread() set it back on.
You time your runs with clock() which is a bad idea, since it times the CPU time of the current threads and all its children and sums it, instead of reporting the wall time. So you'll never see if you have any speed-up, and you'll even experience reports of slow-downs. Use omp_get_wtime() instead.
Your time printing statement is wrong with 2 values packed for only one in the format list.
Here is a tentative re-write of your code, which I can't compile or test, but which I feel more in-line with what one would expect of an OpenMP code. Maybe it will improve / solve your issue.
void paralelSim(PetriNet *p, Matrix *mconflit, int steps){
Matrix ring;
omp_set_dynamic(0);
choicesRing(p, mconflit, &ring);
double t = omp_get_wtime();
while((ring.row!=0) && ((steps>0) || (steps == CONTINUE))){
#pragma omp parallel for
for(int i=0; i<ring.col; ++i)
firethread(p, ring.m[0][i]);
if(steps != CONTINUE){
steps--;
}
choicesRing(p, mconflit, &ring);
}
t = omp_get_wtime() - t;
printf("%f seconds.\n",t);
printf("m:\n");
showMatrix(&p->m);
printf("steps: %d\n", p->steps);
}
Again, I didn't even compile this so there might (likely) be some typos. Moreover, should it work but not give expected performance, you could consider move the omp parallel part outside of the while loop, and use some omp single where needed.
Finally, since I didn't know how many cores you plan to run this on, I didn't explicitly set the number of threads. This will for the moment rely on either your environment's default, or your setting of the OMP_NUM_THREADS environment variable.

Related

What is the C way to report progress of computation? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
This is a follow-up question to Using a thread in C++ to report progress of computations.
Suppose that I have a for loop which executes run_difficult_task() many times, and I would like to infer how far the loop has advanced. I used to write:
int i;
for (i=0; i < 10000; ++i) {
run_difficult_task(i);
if (i % 100 == 0) {
printf("i = %d\n", i);
}
}
but the main problem with such approach is that executing run_difficult_task() might literally take forever (by being stuck in an infinite loop, etc.), so I would like to get a progress report in every k seconds by means of printing out the value of the loop variable i.
I found quite a rich literature on this site regarding object-oriented multithreading (of which I am not really familiar with) in various programming languages, but the questions I found doing this in C-style seem quite outdated. Is there a platform-independent, C11 way to do what I want? If there is not any, then I would be interested in methods working in unix and with gcc.
Note: I do not wish to run various instances of run_difficult_task in parallel (with, for example, OpenMP), but I want to run the for loop and the reporting mechanism in parallel.
Related: How to "multithread" C code and How do I start threads in plain C?

Linux (and also POSIX systems) provide the alarm library call. This allows you to do something after an interval of seconds without interrupting your main thread, and without bothering with multi-threading when you don't really need it. It was very much created for use cases like yours.

You can try using one thread (the worker thread) or possibly two (one that does computations and one that displays output while main is doing something else or just waiting) and some global variables (ugh).
The first thread will be your workhorse doing computations and updating some global variable. The second one (maybe simply the main thread) will then check whether this variable has changed or not and then print the stats (perhaps, that variable will hold the stats, for example, percentage).
What you can try:
int ping = 0, working = 0, data;
// in main thread
for (/* something */){
// spawn worker thread
while (working) {
if (ping) printf("%d\n", data), ping = 0;
}
}
// in worker thread
working = 1;
while (/* something */) {
// do a lot of computations
if (/* some condition */) {
if (! ping) {
data = /* data */
ping = 1;
}
}
}
working = 0;

Here's a simple time based progress indicator that I've often used:
void
progress(int i)
{
time_t tvnow;
static time_t tvlast;
static time_t tvbeg;
if (tvbeg == 0) {
tvbeg = time(NULL);
tvlast = tvbeg - 2;
}
tvnow = time(NULL);
if ((tvnow - tvlast) >= 1) {
printf("\r%ld: i = %d",tvnow - tvbeg,i);
fflush(stdoout);
tvlast = tvnow;
}
}
int i;
for (i=0; i < 10000; ++i) {
run_difficult_task(i);
progress(i);
}
UPDATE:
Does this update if run_difficult_task(i) runs for longer than 2seconds?
No, but I've updated the example to put the progress code in a separate function, which is what I normally do in my own code.
You'll have to add calls to the progress function within run_difficult_task to get finer grain progress--this is something I also do in my own code.
But, notice that I added an elapsed time [in seconds] to the progress.
If you didn't care about that, if run_difficult_task takes longer than 2 seconds to run, there is no progress until it returns as you define it because progress is defined by incrementing i which is done by the outer loop.
For my own stuff, the progress function can handle an arbitrary number of progress indicators from an arbitrary number of worker threads.
So, if that would be of interest to you, and [say] run_difficult_task has some inner loop variables like j, k, l, these could be added to the progress. Or, whatever you wish to report on.

OpenMP low performance

I am trying to parallelize a loop in my program so i searched about multi-threading. First i took a look on POSIX multithreaded programming tutorial, it was so complicated so i tried to do something easier. I tried with OpenMP. I have successfully parallelized my code but the problem of execution time get worser than the serial case. this is below a portion ok my program. I wish you tell me what's the problem. Should i specify what variables are shared and what are private? and how can i know the kind of each variable? i wish you answer me because i searched in many forums and i still don't know what to do.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>
#define D 0.215 // magnetic dipolar constant
main()
{
int i,j,n,p,NTOT = 1600,Nc = NTOT-1;
float r[2],spin[2*NTOT],w[2],d;
double E,F,V,G,dU;
.
.
.
for(n = 1; n <= Nc; n++){
fscanf(voisins,"%d%d%f%f%f",&i,&j,&r[0],&r[1],&d);
V = 0.0;E = 0.0;F = 0.0;
#pragma omp parallel num_threads(4)
{
#pragma omp for schedule(auto)
for(p = 0;p < 2;p++)
{
V += (D/pow(d,3.0))*(spin[2*i-2+p]-w[p])*spin[2*j-2+p];
E += (spin[2*i-2+p]-w[p])*r[p];
F += spin[2*j-2+p]*r[p];
}
}
G = -3*(D/pow(d,5.0))*E*F;
dU += (V+G);
}
.
.
.
}//End of main()

You are parallelizing a loop with only 2 iterations: p=0 and p=1. The way that OpenMP's omp for works is by splitting up the loop iterations among your threads in the parallel team (which you've defined as 4 threads) and letting them work through their part of the problem in parallel.
With only 2 iterations, 2 of your threads will be sitting idle. On top of that, actually figuring out which threads will work on which part of the problem takes overhead. And if your actual loop doesn't take long (which in this case it clearly doesn't), the overhead will cost more than the benefits you've gained from parallelization.
A better strategy is usually to parallelize the outermost loops with OpenMP whenever possible in order to solve both the problems of splitting up the work evenly and reducing the (relative) overhead. Alternatively, you can parallelize at the lowest loop level using OpenMP 4.0's omp simd command.
Lastly, you are not computing the variables V, E, and F correctly. Because they are summed from iteration to iteration, you should define them all as reduction variables with reduction(+:V). I would be surprised if you are currently getting the correct answer as is.
(Also as High Performance Mark says: make sure you're timing the wall time execution of your program and not the CPU time execution of your program. This is typically done with omp_get_wtime().)

OpenMP for beginners

I just got started with openMP; I wrote a little C code in order to check if what I have studied is correct. However I found some troubles; here is the main.c code
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "time.h"
int main(){
float msec_kernel;
const int N = 1000000;
int i, a[N];
clock_t start = clock(), diff;
#pragma omp parallel for private(i)
for (i = 1; i <= N; i++){
a[i] = 2 * i;
}
diff = clock() - start;
msec_kernel = diff * 1000 / CLOCKS_PER_SEC;
printf("Kernel Time: %e s\n",msec_kernel*1e-03);
printf("a[N] = %d\n",a[N]);
return 0;
}
My goal is to see how long it takes to the PC to do such operation using 1 and 2 CPUs; in order to to compile the program I type the following line in the terminal:
gcc -fopenmp main.c -o main
And then I select the number of CPUs like so:
export OMP_NUM_THREADS=N
where N is either 1 or 2; however I don't get the right execution time; my results in fact are:
Kernel Time: 5.000000e-03 s
a[N] = 2000000
and
Kernel Time: 6.000000e-03 s
a[N] = 2000000
Both corresponding to N=1 and N=2. as you can see when I use 2 CPUs it takes slightly more time than using just one! What am I doing wrong? How can I fix this problem?

First of all, using multiple cores doesn't implicitly mean, that you're going to get better performance.
OpenMP has to manage the data distribution among you're cores which is going to take time as well. Especially for very basic operations such as only a single multiplication you are doing, performance of a sequential (single core) program will be better.
Second, by going through every element of you're array only once and not doing anything else, you make no use of cache memory and most certainly not of shared cache between cpu's.
So you should start reading some things about general algorithm performance. To make use of multiple cores using shared cache is in my opinion the essence.
Todays computers have come to a stage where the CPU is so much faster than a memory allocation, read or write. This means when using multiple cores, you'll only have a benefit if you use things like shared cache, because the data distribution,initialization of the threads and managing them will use time as well. To really see a performance speedup (See the link, essential term in parallel computing) you should program an algorithm which has a heavy accent on computation not on memory; this has to do with locality (another important term).
So if you wanna experience a big performance boost by using multiple cores test it on a matrix-matrix-multiplication on big matrices such as 10'000*10'000. And plot some graphs with inputsize(matrix-size) to time and matrix-size to gflops and compare the multicore with the sequential version.
Also make yourself comfortable with the complexity analysis (Big O notation).
Matrix-matrix-multiplication has a locality of O(n).
Hope this helps :-)
I suggest setting the numbers of cores/threads within the code itself either directly at the #pragma line #pragma omp parallel for num_threads(2) or using the omp_set_num_threads function omp_set_num_threads(2);
Further, when doing time/performance analysis it is really important to always run the program multiple times and then take the mean of all the runtimes or something like that. Running the respective programs only once will not give you a meaningful reading of used time. Always call multiple times in a row. Not to forget to also alternate the quality of data.
I suggest writing a test.c file, which takes your actual program function within a loop and then calculates the time per execution of the function:
int executiontimes = 20;
clock_t initial_time = clock();
for(int i = 0; i < executiontimes; i++){
function_multiplication(values);
}
clock_t final_time = clock();
clock_t passed_time = final_time - initial_time;
clock_t time_per_exec = passed_time / executiontimes;
Improve this test algorithm, add some rand() for your values etc. seed them with srand() etc. If you have more questions on the subject or to my answer leave a comment and I'll try to explain further by adding more explanations.

The function clock() returns elapsed CPU time, which includes ticks from all cores. Since there is some overhead to using multiple threads, when you sum the execution time of all threads the total cpu time will always be longer than the serial time.
If you want the real time (wall clock time), try to use the OMP Runtime Library function omp_get_wtime() defined in omp.h. It is cross platform portable and should be the preferred way to do wall timing.
You can also use the POSIX functions defined in time.h:
struct timespec start, stop;
clock_gettime(CLOCK_REALTIME, &start);
// action
clock_gettime(CLOCK_REALTIME, &stop);
double elapsed_time = (stop.tv_sec - start.tv_sec) +
1e-9 * (stop.tv_nsec - start.tv_nsec);

OpenMP Parallel for-loop showing little performance increase

I am in the process of learning how to use OpenMP in C, and as a HelloWorld exercise I am writing a program to count primes. I then parallelise this as follows:
int numprimes = 0;
#pragma omp parallel for reduction (+:numprimes)
for (i = 1; i <= n; i++)
{
if (is_prime(i) == true)
numprimes ++;
}
I compile this code using gcc -g -Wall -fopenmp -o primes primes.c -lm (-lm for the math.h functions I am using). Then I run this code on an Intel® Core™2 Duo CPU E8400 # 3.00GHz × 2, and as expected, the performance is better than for a serial program.
The problem, however, comes when I try to run this on a much more powerful machine. (I have also tried to manually set the number of threads to use with num_threads, but this did not change anything.) Counting all the primes up to 10 000 000 gives me the following times (using time):
8-core machine:
real 0m8.230s
user 0m50.425s
sys 0m0.004s
dual-core machine:
real 0m10.846s
user 0m17.233s
sys 0m0.004s
And this pattern continues for counting more primes, the machine with more cores shows a slight performance increase, but not as much as I would expect for having so many more cores available. (I would expect 4 times more cores to imply almost 4 times less running time?)
Counting primes up to 50 000 000:
8-core machine:
real 1m29.056s
user 8m11.695s
sys 0m0.017s
dual-core machine:
real 1m51.119s
user 2m50.519s
sys 0m0.060s
If anyone can clarify this for me, it would be much appreciated.
EDIT
This is my prime-checking function.
static int is_prime(int n)
{
/* handle special cases */
if (n == 0) return 0;
else if (n == 1) return 0;
else if (n == 2) return 1;
int i;
for(i=2;i<=(int)(sqrt((double) n));i++)
if (n%i==0) return 0;
return 1;
}

This performance is happening because:
is_prime(i) takes longer the higher i gets, and
Your OpenMP implementation uses static scheduling by default for parallel for constructs without the schedule clause, i.e. it chops the for loop into equal sized contiguous chunks.
In other words, the highest-numbered thread is doing all of the hardest operations.
Explicitly selecting a more appropriate scheduling type with the schedule clause allows you to divide work among the threads fairly.
This version will divide the work better:
int numprimes = 0;
#pragma omp parallel for schedule(dynamic, 1) reduction(+:numprimes)
for (i = 1; i <= n; i++)
{
if (is_prime(i) == true)
numprimes ++;
}
Information on scheduling syntax is available via MSDN and Wikipedia.
schedule(dynamic, 1) may not be optimal, as High Performance Mark notes in his answer. There is a more in-depth discussion of scheduling granularity in this OpenMP wihtepaper.
Thanks also to Jens Gustedt and Mahmoud Fayez for contributing to this answer.

The reason for the apparently poor scaling of your program is, as #naroom has suggested, the variability in the run time of each call to your is_prime function. The run time does not simply increase with the value of i. Your code shows that the test terminates as soon as the first factor of i is found so the longest run times will be for numbers with few (and large) factors, including the prime numbers themselves.
As you've already been told, the default schedule for your parallelisation will parcel out the iterations of the master loop a chunk at a time to the available threads. For your case of 5*10^7 integers to test and 8 cores to use, the first thread will get the integers 1..6250000 to test, the second will get 6250001..12500000 and so on. This will lead to a severely unbalanced load across the threads because, of course, the prime numbers are not uniformly distributed.
Rather than using the default scheduling you should experiment with dynamic scheduling. The following statement tells the run-time to parcel out the iterations of your master loop m iterations at a time to the threads in your computation:
#pragma omp parallel for schedule(dynamic,m)
Once a thread has finished its m iterations it will be given m more to work on. The trick for you is to find the sweet spot for m. Too small and your computation will be dominated by the work that the run time does in parcelling out iterations, too large and your computation will revert to the unbalanced loads that you have already seen.
Take heart though, you will learn some useful lessons about the costs, and benefits, of parallel computation by working through all of this.

I think your code need to use dynamic so the threads each can consume different number of iterations as your iterations have different work load so the current code is balanced which won't help in your case try this out please:
int numprimes = 0;
#pragma omp parallel for reduction (+:numprimes) schedule(dynamic,1)
for (i = 1; i <= n; i++){
if (is_prime(i) == true)
++numprimes;
}

problems when creating many plans and executing plans

I am a little confused about creating many_plan by calling fftwf_plan_many_dft_r2c() and executing it with OpenMP. What I am trying to achieve here is to see if explicitly using OpenMP and organizing FFTW data could work together. ( I know I "should" use multithreaded version of fftw but I failed to get a expected speedup from it ).
My code looks like this:
/* I ignore some helper APIs */
#define N 1024*1024 //N is the total size of 1d fft
fftwf_plan p;
float * in;
fftwf_complex *out;
omp_set_num_threads(threadNum); // Suppose threadNum is 2 here
in = fftwf_alloc_real(2*(N/2+1));
std::fill(in,in+2*(N/2+1),1.1f); // just try with a random real floating numbers
out = (fftwf_complex *)&in[0]; // for in-place transformation
/* Problems start from here */
int n[] = {N/threadNum}; // according to the manual, n is the size of each "howmany" transformation
p = fftwf_plan_many_dft_r2c(1, n, threadNum, in, NULL,1 ,1, out, NULL, 1, 1, FFTW_ESTIMATE);
#pragma omp parallel for
for (int i = 0; i < threadNum; i ++)
{
fftwf_execute(p);
// fftwf_execute_dft_r2c(p,in+i*N/threadNum,out+i*N/threadNum);
}
What I got is like this:
If I use fftwf_execute(p), the program executes successfully, but the result seems not correct. ( I compare the result with the version of not using many_plan and openmp )
If I use fftwf_execute_dft_r2c(), I got segmentation fault.
Can somebody help me here? How should I partition the data across multiple threads? Or it is not correct in the first place.
Thank you in advance.
flyree

Do you properly allocate memory for out? Does this:
out = (fftwf_complex *)&in[0]; // for in-place transformation
do the same as this:
out = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfOutputColumns);
You are trying to access 'p' inside your parallel block, without specifically telling openMP how to use it. It should be:
pragma omp parallel for shared(p)
If you are going to split the work up for n threads, I would think you'd explicitly want to tell omp to use n threads:
pragma omp parallel for shared(p) num_threads(n)
Does this code work without multithreading? If you removed the for loop and openMP call and executed fftwf_execute(p) just once does it work?
I don't know much about FFTW's plans for many, but it seems like p is really many plans, not one single plan. So, when you "execute" p, you are executing all plans at once, right? You don't really need to iteratively execute p.
I'm still learning about OpenMP + FFTW so I could be wrong on these. StackOverflow doesn't like it when i put a # in front of pragma, but you need one.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight