I am on a Windows 10 machine with a processor Intel(R) Core(TM) i5-8265U CPU # 1.60GHz, 1800 Mhz, 4 Core(s), 8 Logical Processor(s) and 8 GB RAM. I have been running this small openmp code to compare the performance of a normal sequential program and an omp program.
void normal(unsigned int num_steps){
double step = 1.0/(double)(num_steps);
double sum = 0.0;
double start=omp_get_wtime();
for (long i = 0; i < num_steps;i++){
double x = i * step;
sum += (4.0 / (1.0 + x * x));
double pi = step * sum;
double end=omp_get_wtime();
printf("Time taken : %0.9lf\n",end-start);
printf("The value of pi is : %0.9lf\n",pi);
void parallel(unsigned int num_steps,unsigned int thread_cnt){
double pi=0.0;
double sum[thread_cnt];
for(unsigned int i=0;i<thread_cnt;i++)
double start=omp_get_wtime();
#pragma omp parallel
double x;
double sum_temp=0.0;
double step = 1.0 / (double)(num_steps);
int num_threads = omp_get_num_threads();
int thread_no = omp_get_thread_num();
thread_cnt = num_threads;
printf("Number of threads assigned is : %d\n",num_threads);
for (unsigned int i = thread_no; i < num_steps;i+=thread_cnt){
#pragma omp critical
double end=omp_get_wtime();
printf("Time taken : %0.9lf\n",end-start);
for(unsigned int i=0;i<thread_cnt;i++){
printf("The value of pi is : %0.9lf\n",pi);
int main(){
unsigned int num_steps=1000000;
unsigned int thread_cnt=4;
return 0;
I am using mingw's GCC compiler and to run openmp programs which require pthread library i had downloaded the mingw32-pthreads-w32 library. So is it not working, because I don't seem to be able to beat the normal sequential execution despite using so many threads and also handling race conditions and false sharing using the critical pragma.
Reference :
I have been following the OPENMP playlist on youtube by Intel.
This is my solution using openmp which i used to parallelize the code which calculates Pi. The floating point value of Pi changes each time this is executed. Could someone explain why?
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define THREAD_NUM 20
static long num_steps = 100000;
double step;
int main(){
int i;
double x;
double pi;
double sum = 0.0;
double t1 = 0.0;
double t2 = 0.0;
step = 1.0/(double) num_steps;
t1 = omp_get_wtime();
#pragma omp parallel
double p_sum = 0.0;
#pragma omp for
for(i=0; i<num_steps; i++){
x = (i+0.5)*step;
p_sum = p_sum + 4.0/(1.0+x*x);
#pragma omp atomic
sum += p_sum;
t2 = omp_get_wtime();
pi = step*sum;
printf("value of pi = %lf\n", pi);
printf("time = %lf ms\n", (t2-t1)*1000);
Floating point addition is neither associative nor commutative! This means that the exact value you obtain depends on the order in which the components of p_sum/sum are added up. To understand precisely why you have to understand how floating point addition works in practice. I would recommend reading What Every Computer Scientist should Know About Floating-Point Arithmetic.
As #Gilles pointed out in the comments under the question, the problem was with the x variable which was declared as a shared variable. it should be declared as a private variable.
#pragma omp parallel
double x;
double p_sum = 0.0;
#pragma omp for
for(i=0; i<num_steps; i++){
I'm using the "read" benchmark from Why is writing to memory much slower than reading it?, and I added just two lines:
#pragma omp parallel for
for(unsigned dummy = 0; dummy < 1; ++dummy)
They should have no effect, because OpenMP should only parallelize the outer loop, but the code now consistently runs twice faster.
Update: These lines aren't even necessary. Simply adding
(implicitly declared) in the same place has the same effect.
Complete code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
unsigned long do_xor(const unsigned long* p, unsigned long n)
unsigned long i, x = 0;
for(i = 0; i < n; ++i)
x ^= p[i];
return x;
int main()
unsigned long n, r, i;
unsigned long *p;
clock_t c0, c1;
double elapsed;
n = 1000 * 1000 * 1000; /* GB */
r = 100; /* repeat */
p = calloc(n/sizeof(unsigned long), sizeof(unsigned long));
c0 = clock();
#pragma omp parallel for
for(unsigned dummy = 0; dummy < 1; ++dummy)
for(i = 0; i < r; ++i) {
p[0] = do_xor(p, n / sizeof(unsigned long)); /* "use" the result */
printf("%4ld/%4ld\r", i, r);
c1 = clock();
elapsed = (c1 - c0) / (double)CLOCKS_PER_SEC;
printf("Bandwidth = %6.3f GB/s (Giga = 10^9)\n", (double)n * r / elapsed / 1e9);
Compiled and executed with
gcc -O3 -Wall -fopenmp single_iteration.c && time taskset -c 0 ./a.out
The wall time reported by time is 3.4s vs 7.5s.
GCC 7.3.0 (Ubuntu)
The reason for the performance difference is not actually any difference in code, but in how memory is mapped. In the fast case you are reading from zero-pages, i.e. all virtual addresses are mapped to a single physical page - so nothing has to be read from memory. In the slow case, it is not zeroed. For details see this answer from a slightly different context.
On the other side, it is not caused by calling omp_get_num_threads or the pragma itstelf, but merely linking to the OpenMP runtime library. You can confirm that by using -Wl,--no-as-needed -fopenmp. If you just specify -fopenmp but don't use it at all, the linker will omit it.
Now unfortunately I am still missing the final puzzle piece: why does linking to OpenMP change the behavior of calloc regarding zero'd pages .
I'm attempting to implement block matrix multiplication and making it more parallelized.
This is my code :
int i,j,jj,k,kk;
float sum;
int en = 4 * (2048/4);
#pragma omp parallel for collapse(2)
for(i=0;i<2048;i++) {
for(j=0;j<2048;j++) {
for (kk=0;kk<en;kk+=4) {
for(jj=0;jj<en;jj+=4) {
for(i=0;i<2048;i++) {
for(j=jj;j<jj+4;j++) {
sum = C[i][j];
for(k=kk;k<kk+4;k++) {
C[i][j] = sum;
I've been playing around with OpenMP but still have had no luck in figuring what the best way to have this done in the least amount of time.
Getting good performance from matrix multiplication is a big job. Since "The best code is the code I don't have to write", a much better use of your time would be to understand how to use a BLAS library.
If you are using X86 processors, the Intel Math Kernel Library (MKL) is available free, and includes optimized, parallelized, matrix multiplication operations.
(FWIW, I work for Intel, but not on MKL :-))
I recently started looking into dense matrix multiplication (GEMM)again. It turns out the Clang compiler is really good at optimization GEMM without needing any intrinsics (GCC still needs intrinsics). The following code gets 60% of the peak FLOPS of my four core/eight hardware thread Skylake system. It uses block matrix multiplication.
Hyper-threading gives worse performance so you make sure you only use threads equal to the number of cores and bind threads to prevent thread migration.
export OMP_PROC_BIND=true
Then compile like this
clang -Ofast -march=native -fopenmp -Wall gemm_so.c
The code
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <omp.h>
#include <x86intrin.h>
#define SM 80
typedef __attribute((aligned(64))) float * restrict fast_float;
static void reorder2(fast_float a, fast_float b, int n) {
for(int i=0; i<SM; i++) memcpy(&b[i*SM], &a[i*n], sizeof(float)*SM);
static void kernel(fast_float a, fast_float b, fast_float c, int n) {
for(int i=0; i<SM; i++) {
for(int k=0; k<SM; k++) {
for(int j=0; j<SM; j++) {
c[i*n + j] += a[i*n + k]*b[k*SM + j];
void gemm(fast_float a, fast_float b, fast_float c, int n) {
int bk = n/SM;
#pragma omp parallel
float *b2 = _mm_malloc(sizeof(float)*SM*SM, 64);
#pragma omp for collapse(3)
for(int i=0; i<bk; i++) {
for(int j=0; j<bk; j++) {
for(int k=0; k<bk; k++) {
reorder2(&b[SM*(k*n + j)], b2, n);
kernel(&a[SM*(i*n+k)], b2, &c[SM*(i*n+j)], n);
static int doublecmp(const void *x, const void *y) { return *(double*)x < *(double*)y ? -1 : *(double*)x > *(double*)y; }
double median(double *x, int n) {
qsort(x, n, sizeof(double), doublecmp);
return 0.5f*(x[n/2] + x[(n-1)/2]);
int main(void) {
int cores = 4;
double frequency = 3.1; // i7-6700HQ turbo 4 cores
double peak = 32*cores*frequency;
int n = SM*10*2;
int mem = sizeof(float) * n * n;
float *a = _mm_malloc(mem, 64);
float *b = _mm_malloc(mem, 64);
float *c = _mm_malloc(mem, 64);
memset(a, 1, mem), memset(b, 1, mem);
printf("%dx%d matrix\n", n, n);
printf("memory of matrices: %.2f MB\n", 3.0*mem*1E-6);
printf("peak SP GFLOPS %.2f\n", peak);
while(1) {
int r = 10;
double times[r];
for(int j=0; j<r; j++) {
times[j] = -omp_get_wtime();
gemm(a, b, c, n);
times[j] += omp_get_wtime();
double flop = 2.0*1E-9*n*n*n; //GFLOP
double time_mid = median(times, r);
double flops_low = flop/times[r-1], flops_mid = flop/time_mid, flops_high = flop/times[0];
printf("%.2f %.2f %.2f %.2f\n", 100*flops_low/peak, 100*flops_mid/peak, 100*flops_high/peak, flops_high);
This does GEMM 10 times per iteration of an infinite loop and prints the low, median, and high ratio of FLOPS to peak_FLOPS and finally the median FLOPS.
You will need to adjust the following lines
int cores = 4;
double frequency = 3.1; // i7-6700HQ turbo 4 cores
double peak = 32*cores*frequency;
to the number of physical cores, frequency for all cores (with turbo if enabled), and the number of floating pointer operations per core which is 16 for Core2-Ivy Bridge, 32 for Haswell-Kaby Lake, and 64 for the Xeon Phi Knights Landing.
This code may be less efficient with NUMA systems. It does not do nearly as well with Knight Landing (I just started looking into this).
I have optimized as much as I could my function for sequential running.
When I use openMP I see no gain in performance.
I tried my program on a machine with 1 cores and on a machine with 8 cores, and the performance is the same.
With year set to 20, I have
1 core: 1 sec.
8 core: 1 sec.
With year set to 25 I have
1 core: 40 sec.
8 core: 40 sec.
1 core machine: my laptop's intel core 2 duo 1.8 GHz, ubuntu linux
8 core machine: 3.25 GHz, ubuntu linux
My program enumerate all the possible path of a binomial tree and do some work on each path. So my loop size increase exponentially and I would expect the footprint of openMP thread to be zero. In my loop, I only do a reduction of one variable. All other variable are read-only. I only use function I wrote, and I think they are thread safe.
I also run Valgrind cachegrind on my program. I don't fully understand the output but there seems to be no cache miss or false sharing.
I compile with
gcc -O3 -g3 -Wall -c -fmessage-length=0 -lm -fopenmp -ffast-math
My complete program is as below. Sorry for posting a lot of code. I'm not familiar with openMP nor C, and I couldn't resume my code more without loosing the main task.
How can I improve performance when I use openMP?
Are they some compiler flags or C tricks that will make the program run faster?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include "test.h"
int main(){
int year=20;
int tradingdate0=1;
int i;
float v=0;
long n=pow(tradingdate0+1,year);
#pragma omp parallel for reduction(+:v)
return 0;
//***function on which openMP is applied
float pathvalue(long pathindex) {
float value = -ctx.firstpremium;
float personalaccount = ctx.personalaccountat0;
float account = ctx.firstpremium;
int i;
for (i = 0; i < ctx.year-1; i++) {
value *= ctx.accumulationfactor;
double index = getindex(i,pathindex);
account = account * index;
double death = fmaxf(account,ctx.guarantee[i]);
value += qx(i) * death;
if (haswithdraw(i)){
double withdraw = personalaccount*ctx.allowed;
value += px(i) * withdraw;
personalaccount = fmaxf(personalaccount-withdraw,0);
account = fmaxf(account-withdraw,0);
//last year
double index = getindex(ctx.year-1,pathindex);
account = account * index;
return value * ctx.discountfactor;
int haswithdraw(int period){
return 1;
float getindex(int period, long pathindex){
int ndx = (pathindex/ctx.chunksize[period])%ctx.tradingdate;
return ctx.stock[ndx];
float qx(int period){
return 0;
float px(int period){
return 1;
struct context ctx;
void globalinit(int year, int tradingdate0){
ctx.year = year;
ctx.tradingdate0 = tradingdate0;
ctx.firstpremium = 1;
ctx.riskfreerate = 0.06;
ctx.personalaccountat0 = 1;
ctx.allowed = 0.07;
ctx.guaranteerate = 0.03;
ctx.beta = 1;
ctx.discountfactor = exp(-ctx.riskfreerate * ctx.year);
ctx.accumulationfactor = exp(ctx.riskfreerate);
ctx.guaranteefactor = 1+ctx.guaranteerate;
int i;
void globaldel(){
float pathvalue(long pathindex);
int haswithdraw(int period);
float getindex(int period, long pathindex);
float qx(int period);
float px(int period);
struct context{
int year;
int tradingdate0;
float firstpremium;
float riskfreerate;
float volatility;
float personalaccountat0;
float allowed;
float guaranteerate;
float alpha;
float beta;
int tradingdate;
float discountfactor;
float accumulationfactor;
float guaranteefactor;
float upmove;
float downmove;
float* stock;
long* chunksize;
float* guarantee;
struct context ctx;
void globalinit();
void globaldel();
EDIT I simplify all global variables as constant. For 20 year, the program run two time faster (great!). I tried to set the number of thread with OMP_NUM_THREADS=4 ./test for example. But it didn't give me any performance gain.
Can my gcc have some problem?
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <omp.h>
#include "test.h"
int main(){
int i;
float v=0;
#pragma omp parallel for reduction(+:v)
return 0;
//function on which openMP is applied
float pathvalue(long pathindex) {
float value = -firstpremium;
float personalaccount = personalaccountat0;
float account = firstpremium;
int i;
for (i = 0; i < year-1; i++) {
value *= accumulationfactor;
double index = getindex(i,pathindex);
account = account * index;
double death = fmaxf(account,guarantee[i]);
value += death;
double withdraw = personalaccount*allowed;
value += withdraw;
personalaccount = fmaxf(personalaccount-withdraw,0);
account = fmaxf(account-withdraw,0);
//last year
double index = getindex(year-1,pathindex);
account = account * index;
return value * discountfactor;
float getindex(int period, long pathindex){
int ndx = (pathindex/chunksize[period])%tradingdate;
return stock[ndx];
clock_t begin;
void starttimer(){
begin = clock();
void endtimer(){
clock_t end = clock();
double elapsed = (double)(end - begin) / CLOCKS_PER_SEC;
printf("\nelapsed: %f\n",elapsed);
float pathvalue(long pathindex);
int haswithdraw(int period);
float getindex(int period, long pathindex);
float qx(int period);
float px(int period);
void starttimer();
void endtimer();
const int year= 20 ;
const int tradingdate0= 1 ;
const float firstpremium= 1 ;
const float riskfreerate= 0.06 ;
const float volatility= 0.25 ;
const float personalaccountat0= 1 ;
const float allowed= 0.07 ;
const float guaranteerate= 0.03 ;
const float alpha= 1 ;
const float beta= 1 ;
const int tradingdate= 2 ;
const int numberofpath= 1048576 ;
const float discountfactor= 0.301194211912 ;
const float accumulationfactor= 1.06183654655 ;
const float guaranteefactor= 1.03 ;
const float upmove= 1.28402541669 ;
const float downmove= 0.778800783071 ;
const float stock[2]={1.2840254166877414, 0.7788007830714049};
const long chunksize[20]={524288, 262144, 131072, 65536, 32768, 16384, 8192, 4096, 2048, 1024, 512, 256, 128, 64, 32, 16, 8, 4, 2, 1};
const float guarantee[20]={1.03, 1.0609, 1.092727, 1.1255088100000001, 1.1592740743, 1.1940522965290001, 1.2298738654248702, 1.2667700813876164, 1.304773183829245, 1.3439163793441222, 1.384233870724446, 1.4257608868461793, 1.4685337134515648, 1.512589724855112, 1.557967416600765, 1.6047064390987882, 1.6528476322717518, 1.7024330612399046, 1.7535060530771016, 1.8061112346694148};
Even if your program benefits from using OpenMP, you won't see it because you are measuring the wrong time.
clock() returns the total CPU time spent in all threads. If you run with four threads and each runs for 1/4 of the time, clock() will still return the same value since 4*(1/4) = 1. You should be measuring the wall-clock time instead.
Replace calls to clock() with omp_get_wtime() or gettimeofday(). They both provide high precision wall-clock timing.
P.S. Why are there so many people around SO using clock() for timing?
It seems as if it should work. Probably you need to specify the number of threads to use. You can do so by setting the OMP_NUM_THREADS variable. For instance, for using 4 threads:
EDIT: I just compiled the code and I observe significant speedups when changing the number of threads.
I don't see any section in which you're specifying the number of cores OpenMP will use. It's supposed to, by default, use the number of CPUs it sees, but for my purposes, I've always forced it to use as many as I specified.
Add this line before your parallel for construct:
#pragma omp parallel num_threads(num_threads)
// Your parallel for follows here
...where num_threads is an integer between 1 and the number of cores on your machine.
EDIT: Here's the makefile used to build the code. Place this in a text file named Makefile in the same directory.
test: test.c test.h
cc -o $# $< -O3 -g3 -fmessage-length=0 -lm -fopenmp -ffast-math
I have a problem with OpenMp. I need to compute Pi with OpenMP and Monte Carlo. I write simple program and i am reading number of threads from command line. Now it is working not stable sometimes 1 thread is faster than 16. Have anyine idea what am i doing wrong?
int main(int argc, char*argv[])
int niter, watki;
watki = strtol(argv[1], NULL, 0);
niter = strtol(argv[2], NULL, 0);
int i;
double x, y, z;
double pi;
unsigned int myseed = omp_get_thread_num();
double start = omp_get_wtime();
#pragma omp parallel for private(i,x,y,z) reduction(+:count)
for ( i=0; i<niter; i++) {
x = (double)rand_r(&myseed)/RAND_MAX;
y = (double)rand_r(&myseed)/RAND_MAX;
z = x*x+y*y;
if (z<=1) count++;
pi=(double)count/ niter*4;
printf("# of trials= %d, threads %d , estimate of pi is %g \n",niter, watki,pi);
double end = omp_get_wtime();
printf("%f \n", (end - start));
I compile it with gcc -fopenmp pi.c -o pi
And run it with ./pi 1 10000
Thanks in advance
You're calling omp_get_thread_num outside of the parallel region, which will always return 0.
Then all your rand_r calls will access the same shared seed, which is probably the source of your problem. You should declar myseed within the loop to make it private to each thread, and to get the correct value from omp_get_thread_num
#pragma omp parallel for private(i,x,y,z) reduction(+:count)
for ( i=0; i<niter; i++) {
int myseed = omp_get_thread_num();
x = (double)rand_r(&myseed)/RAND_MAX;
y = (double)rand_r(&myseed)/RAND_MAX;
z = x*x+y*y;
if (z<=1) count++;