I have a problem with OpenMp. I need to compute Pi with OpenMP and Monte Carlo. I write simple program and i am reading number of threads from command line. Now it is working not stable sometimes 1 thread is faster than 16. Have anyine idea what am i doing wrong?
int main(int argc, char*argv[])
int niter, watki;
watki = strtol(argv[1], NULL, 0);
niter = strtol(argv[2], NULL, 0);
int i;
double x, y, z;
double pi;
unsigned int myseed = omp_get_thread_num();
double start = omp_get_wtime();
#pragma omp parallel for private(i,x,y,z) reduction(+:count)
for ( i=0; i<niter; i++) {
x = (double)rand_r(&myseed)/RAND_MAX;
y = (double)rand_r(&myseed)/RAND_MAX;
z = x*x+y*y;
if (z<=1) count++;
pi=(double)count/ niter*4;
printf("# of trials= %d, threads %d , estimate of pi is %g \n",niter, watki,pi);
double end = omp_get_wtime();
printf("%f \n", (end - start));
I compile it with gcc -fopenmp pi.c -o pi
And run it with ./pi 1 10000
Thanks in advance
You're calling omp_get_thread_num outside of the parallel region, which will always return 0.
Then all your rand_r calls will access the same shared seed, which is probably the source of your problem. You should declar myseed within the loop to make it private to each thread, and to get the correct value from omp_get_thread_num
#pragma omp parallel for private(i,x,y,z) reduction(+:count)
for ( i=0; i<niter; i++) {
int myseed = omp_get_thread_num();
x = (double)rand_r(&myseed)/RAND_MAX;
y = (double)rand_r(&myseed)/RAND_MAX;
z = x*x+y*y;
if (z<=1) count++;
I am on a Windows 10 machine with a processor Intel(R) Core(TM) i5-8265U CPU # 1.60GHz, 1800 Mhz, 4 Core(s), 8 Logical Processor(s) and 8 GB RAM. I have been running this small openmp code to compare the performance of a normal sequential program and an omp program.
void normal(unsigned int num_steps){
double step = 1.0/(double)(num_steps);
double sum = 0.0;
double start=omp_get_wtime();
for (long i = 0; i < num_steps;i++){
double x = i * step;
sum += (4.0 / (1.0 + x * x));
double pi = step * sum;
double end=omp_get_wtime();
printf("Time taken : %0.9lf\n",end-start);
printf("The value of pi is : %0.9lf\n",pi);
void parallel(unsigned int num_steps,unsigned int thread_cnt){
double pi=0.0;
double sum[thread_cnt];
for(unsigned int i=0;i<thread_cnt;i++)
double start=omp_get_wtime();
#pragma omp parallel
double x;
double sum_temp=0.0;
double step = 1.0 / (double)(num_steps);
int num_threads = omp_get_num_threads();
int thread_no = omp_get_thread_num();
thread_cnt = num_threads;
printf("Number of threads assigned is : %d\n",num_threads);
for (unsigned int i = thread_no; i < num_steps;i+=thread_cnt){
#pragma omp critical
double end=omp_get_wtime();
printf("Time taken : %0.9lf\n",end-start);
for(unsigned int i=0;i<thread_cnt;i++){
printf("The value of pi is : %0.9lf\n",pi);
int main(){
unsigned int num_steps=1000000;
unsigned int thread_cnt=4;
return 0;
I am using mingw's GCC compiler and to run openmp programs which require pthread library i had downloaded the mingw32-pthreads-w32 library. So is it not working, because I don't seem to be able to beat the normal sequential execution despite using so many threads and also handling race conditions and false sharing using the critical pragma.
I am trying to write a parallel program which takes an error rate(i.e 0.01) and returns a PI value which is closer to PI than the error with montecarlo simulation.
I wrote a simple function however it does not terminate as error rate is always around 11.
I appreciate your comments.
#include "stdio.h"
#include "omp.h"
#include <stdlib.h>
#include <unistd.h>
#include <math.h>
double drand48(void);
double monte_carlo(double epsilon){
double x,y, pi_estimate = 0.0;
double drand48(void);
double error = 10000.0;
int n = 0; // total number of points
int i = 0; // total numbers of points inside circle
int p = omp_get_num_threads();
#pragma omp parallel private(x, y) reduction(+:i)//OMP parallel directive
x = drand48();
y = drand48();
printf("%lf\n", error);
error = fabs(M_PI-pi_estimate)/M_PI;
return pi_estimate;
int main(int argc, char* argv[]) {
double epsilon = 0.01;
printf("PI estimate: %lf",monte_carlo(epsilon));
return 0;
Calling omp_get_num_threads() outside a parallel section will always return 1, as there is only one active thread at the moment the function is called. The following code should give a correct result, but will be much slower than the serial version due to the large parallelization & synchronization overhead spend for doing a very simple operation.
#pragma omp parallel private(x, y) reduction(+:i)//OMP parallel directive
x = drand48();
y = drand48();
#pragma omp master
The following avoids repeatedly spawning threads and may be more efficient, but still probably slower.
#pragma omp parallel private(x, y)
x = drand48();
y = drand48();
#pragma omp atomic
#pragma omp barrier
#pragma omp single
error = fabs(M_PI-pi_estimate)/M_PI;
printf("%lf\n", error);
} // implicit barrier here
In order to really go faster, a minimum number of iterations should be given such as:
#define ITER 1000
#pragma omp parallel private(x, y)
#pragma omp for reduction(+:i)
for (int j=1;j<ITER;j++){
x = drand48();
y = drand48();
if((x*x+y*y)<=1.0) i+=1;
/* implicit barrier + implicit atomic addition
* of thread-private accumulator to shared variable i
#pragma omp single
error = fabs(M_PI-pi_estimate)/M_PI;
printf("%lf\n", error);
} // implicit barrier
I'm attempting to implement block matrix multiplication and making it more parallelized.
This is my code :
int i,j,jj,k,kk;
float sum;
int en = 4 * (2048/4);
#pragma omp parallel for collapse(2)
for(i=0;i<2048;i++) {
for(j=0;j<2048;j++) {
for (kk=0;kk<en;kk+=4) {
for(jj=0;jj<en;jj+=4) {
for(i=0;i<2048;i++) {
for(j=jj;j<jj+4;j++) {
sum = C[i][j];
for(k=kk;k<kk+4;k++) {
C[i][j] = sum;
I've been playing around with OpenMP but still have had no luck in figuring what the best way to have this done in the least amount of time.
Getting good performance from matrix multiplication is a big job. Since "The best code is the code I don't have to write", a much better use of your time would be to understand how to use a BLAS library.
If you are using X86 processors, the Intel Math Kernel Library (MKL) is available free, and includes optimized, parallelized, matrix multiplication operations.
(FWIW, I work for Intel, but not on MKL :-))
I recently started looking into dense matrix multiplication (GEMM)again. It turns out the Clang compiler is really good at optimization GEMM without needing any intrinsics (GCC still needs intrinsics). The following code gets 60% of the peak FLOPS of my four core/eight hardware thread Skylake system. It uses block matrix multiplication.
Hyper-threading gives worse performance so you make sure you only use threads equal to the number of cores and bind threads to prevent thread migration.
export OMP_PROC_BIND=true
Then compile like this
clang -Ofast -march=native -fopenmp -Wall gemm_so.c
The code
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <omp.h>
#include <x86intrin.h>
#define SM 80
typedef __attribute((aligned(64))) float * restrict fast_float;
static void reorder2(fast_float a, fast_float b, int n) {
for(int i=0; i<SM; i++) memcpy(&b[i*SM], &a[i*n], sizeof(float)*SM);
static void kernel(fast_float a, fast_float b, fast_float c, int n) {
for(int i=0; i<SM; i++) {
for(int k=0; k<SM; k++) {
for(int j=0; j<SM; j++) {
c[i*n + j] += a[i*n + k]*b[k*SM + j];
void gemm(fast_float a, fast_float b, fast_float c, int n) {
int bk = n/SM;
#pragma omp parallel
float *b2 = _mm_malloc(sizeof(float)*SM*SM, 64);
#pragma omp for collapse(3)
for(int i=0; i<bk; i++) {
for(int j=0; j<bk; j++) {
for(int k=0; k<bk; k++) {
reorder2(&b[SM*(k*n + j)], b2, n);
kernel(&a[SM*(i*n+k)], b2, &c[SM*(i*n+j)], n);
static int doublecmp(const void *x, const void *y) { return *(double*)x < *(double*)y ? -1 : *(double*)x > *(double*)y; }
double median(double *x, int n) {
qsort(x, n, sizeof(double), doublecmp);
return 0.5f*(x[n/2] + x[(n-1)/2]);
int main(void) {
int cores = 4;
double frequency = 3.1; // i7-6700HQ turbo 4 cores
double peak = 32*cores*frequency;
int n = SM*10*2;
int mem = sizeof(float) * n * n;
float *a = _mm_malloc(mem, 64);
float *b = _mm_malloc(mem, 64);
float *c = _mm_malloc(mem, 64);
memset(a, 1, mem), memset(b, 1, mem);
printf("%dx%d matrix\n", n, n);
printf("memory of matrices: %.2f MB\n", 3.0*mem*1E-6);
printf("peak SP GFLOPS %.2f\n", peak);
while(1) {
int r = 10;
double times[r];
for(int j=0; j<r; j++) {
times[j] = -omp_get_wtime();
gemm(a, b, c, n);
times[j] += omp_get_wtime();
double flop = 2.0*1E-9*n*n*n; //GFLOP
double time_mid = median(times, r);
double flops_low = flop/times[r-1], flops_mid = flop/time_mid, flops_high = flop/times[0];
printf("%.2f %.2f %.2f %.2f\n", 100*flops_low/peak, 100*flops_mid/peak, 100*flops_high/peak, flops_high);
This does GEMM 10 times per iteration of an infinite loop and prints the low, median, and high ratio of FLOPS to peak_FLOPS and finally the median FLOPS.
You will need to adjust the following lines
int cores = 4;
double frequency = 3.1; // i7-6700HQ turbo 4 cores
double peak = 32*cores*frequency;
to the number of physical cores, frequency for all cores (with turbo if enabled), and the number of floating pointer operations per core which is 16 for Core2-Ivy Bridge, 32 for Haswell-Kaby Lake, and 64 for the Xeon Phi Knights Landing.
This code may be less efficient with NUMA systems. It does not do nearly as well with Knight Landing (I just started looking into this).
What I am looking for is what is the best way to gather all the data from the parallel for loops into one variable. OpenMP seems to have a different routine then I am used to seeing as I started learning OpenMPI first which has scatter and gather routines.
Calculating PI (embarrassingly parallel routine)
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define NUM_STEPS 100
#define CHUNKSIZE 20
int main(int argc, char *argv[])
double step, x, pi, sum=0.0;
int i, chunk;
chunk = CHUNKSIZE;
step = 1.0/(double)NUM_STEPS;
#pragma omp parallel shared(chunk) private(i,x,sum,step)
#pragma omp for schedule(dynamic,chunk)
for(i = 0; i < NUM_STEPS; i++)
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
printf("Thread %d: i = %i sum = %f \n",tid,i,sum);
pi = step * sum;
EDIT: It seems that I could use an array sum[*NUM_STEPS / CHUNKSIZE*] and sum the array into one value, or would it be better to use some sort of blocking routine to sum the product of each iteration
Add this clause to your #pragma omp parallel ... statement:
reduction(+ : pi)
Then just do pi += step * sum; at the end of the parallel region. (Notice the plus!) OpenMP will then automagically sum up the partial sums for you.
Lets see, I am not quite sure what happens, because I havn't got deterministic behaviour on the finished application, but I have something looks like it resembles π. I removed the #pragma omp parallel shared(chunk) and changed the #pragma omp for schedule(dynamic,chunk) to #pragma omp parallel for schedule(dynamic) reduction(+:sum).
#pragma omp parallel for schedule(dynamic) reduction(+:sum)
This requires some explanation, I removed the schedules chunk just to make it all simpler (for me). The part that you are interested in is the reduction(+:sum) which is a normal reduce opeartion with the operator + and using the variable sum.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define NUM_STEPS 100
int main(int argc, char *argv[])
double step, x, pi, sum=0.0;
int i;
step = 1.0/(double)NUM_STEPS;
#pragma omp parallel for schedule(dynamic) reduction(+:sum)
for(i = 0; i < NUM_STEPS; i++)
x = (i+0.5)*step;
sum +=4.0/(1.0+x*x);
printf("Thread %%d: i = %i sum = %f \n",i,sum);
pi = step * sum;
printf("pi=%lf\n", pi);
I've got simply 3 functions, one is control function aan the next 2 function are done in a bit different way using OpenMP. But function thread1 gives another score than thread2 and control and I have no idea why?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
float function(float x){
return pow(x,pow(x,sin(x)));
float integrate(float begin, float end, int count){
float score = 0 , width = (end-begin)/(1.0*count), i=begin, y1, y2;
for(i = 0; i<count; i++){
score += (function(begin+(i*width)) + function(begin+(i+1)*width)) * width/2.0;
return score;
float thread1(float begin, float end, int count){
float score = 0 , width = (end-begin)/(1.0*count), y1, y2;
int i;
#pragma omp parallel for reduction(+:score) private(y1,i) shared(count)
for(i = 0; i<count; i++){
y1 = ((function(begin+(i*width)) + function(begin+(i+1)*width)) * width/2.0);
score = score + y1;
return score;
float thread2(float begin, float end, int count){
float score = 0 , width = (end-begin)/(1.0*count), y1, y2;
int i;
float * tab = (float*)malloc(count * sizeof(float));
#pragma omp parallel for
for(i = 0; i<count; i++){
tab[i] = (function(begin+(i*width)) + function(begin+(i+1)*width)) * width/2.0;
for(i=0; i<count; i++)
score += tab[i];
return score;
unsigned long long int rdtsc(void){
unsigned long long int x;
unsigned a, d;
__asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long long)a) | (((unsigned long long)d) << 32);
int main(int argc, char** argv){
unsigned long long counter = 0;
counter = rdtsc();
printf("control: %f \n ",integrate (atof(argv[1]), atof(argv[2]), atoi(argv[3])));
printf("control count: %lld \n",rdtsc()-counter);
counter = rdtsc();
printf("thread1: %f \n ",thread1(atof(argv[1]), atof(argv[2]), atoi(argv[3])));
printf("thread1 count: %lld \n",rdtsc()-counter);
counter = rdtsc();
printf("thread2: %f \n ",thread2(atof(argv[1]), atof(argv[2]), atoi(argv[3])));
printf("thread2 count: %lld \n",rdtsc()-counter);
return 0;
Here are simple answears :
gcc -fopenmp zad2.c -o zad -pg -lm
env OMP_NUM_THREADS=2 ./zad 3 13 100000
control: 5407308.500000
control count: 138308058
thread1: 5407494.000000
thread1 count: 96525618
thread2: 5407308.500000
thread2 count: 104770859
Ok, I tried to do this more quickly, and not count values for periods twice.
double thread3(double begin, double end, int count){
double score = 0 , width = (end-begin)/(1.0*count), yp, yk;
int i,j, k;
#pragma omp parallel private (yp,yk)
int thread_num = omp_get_num_threads();
k = count / thread_num;
#pragma omp for private(i) reduction(+:score)
for(i=0; i<thread_num; i++){
yp = function(begin + i*k*width);
yk = function(begin + (i*k+1)*width);
score += (yp + yk) * width / 2.0;
for(j=i*k +1; j<(i+1)*k; j++){
yp = yk;
yk = function(begin + (j+1)*width);
score += (yp + yk) * width / 2.0;
#pragma omp for private(i) reduction(+:score)
for(i = k*thread_num; i<count; i++)
score += (function(begin+(i*width)) + function(begin+(i+1)*width)) * width/2.0;
return score;
But after few tests I found that the scores are near the right value, but not equal. Sometimes one of the threads doesn't start. When I'm not using OpenMp, the value is correct.
You're integrating a very strongly peaked function - x(xsin(x)) - which covers over 7 orders of magnitude in the range you're integrating it. That's about the limit for a 32-bit floating point number, so there are going to be issues depending on the order you sum the numbers. This isn't an OpenMP thing -- its just a numerical sensitivity thing.
So for instance, consider this completely serial code doing the same integral:
#include <stdio.h>
#include <math.h>
float function(float x){
return pow(x,pow(x,sin(x)));
int main(int argc, char **argv) {
const float begin=3., end=13.;
const int count = 100000;
const float width=(end-begin)/(1.*count);
float integral1=0., integral2=0., integral3=0.;
/* left to right */
for (int i=0; i<count; i++) {
integral1 += (function(begin+(i*width)) + function(begin+(i+1)*width)) * width/2.0;
/* right to left */
for (int i=count-1; i>=0; i--) {
integral2 += (function(begin+(i*width)) + function(begin+(i+1)*width)) * width/2.0;
/* centre outwards, first right-to-left, then left-to-right */
for (int i=count/2; i<count; i++) {
integral3 += (function(begin+(i*width)) + function(begin+(i+1)*width)) * width/2.0;
for (int i=count/2-1; i>=0; i--) {
integral3 += (function(begin+(i*width)) + function(begin+(i+1)*width)) * width/2.0;
printf("Left to right: %lf\n", integral1);
printf("Right to left: %lf\n", integral2);
printf("Centre outwards: %lf\n", integral3);
return 0;
Running this, we get:
$ ./reduce
Left to right: 5407308.500000
Right to left: 5407430.000000
Centre outwards: 5407335.500000
-- the same sort of differences you see. Doing the summation with two threads necessarily changes the order of the summation, and so your answer changes.
There's a few options here. If this was just a test proble, and this function doesn't actually represent what you'll be integrating, you might be fine already. Otherwise, using a different numerical method may help.
But also here, there is a simple solution - the range of the numbers exceeds the range of a float, making the answer very sensitive to summation order, but fits comfortably within the range of a double, making the problem much less severe. Note that changing to doubles is not a magic solution to everything; some cases it just postpones the problem or allows you to paper over a flaw in your numerical method. But here it actually addresses the underlying problem fairly well. Changing all the floats above to doubles gives:
$ ./reduce
Left to right: 5407589.272885
Right to left: 5407589.272885
Centre outwards: 5407589.272885
On the other hand, even doubles wouldn't save you if you needed to integrate this function in the range (18,23).