Varying run time of an OpenMP parallel region - c

Whenever I run this code it shows me different run-time took by the parallel section. I tried with a constant number of threads according to my core but still the effort is futile. The program is to calculate the value of pi. Compiled with gcc -fopenmp.
#include <stdio.h>
#include <omp.h>
static long num_steps = 100000; double step;
//double omp_get_wtime(void);
int main (){
int i;
double x,pi,max_threads,start,time;
double sum=0.0;
step = 1.0/(double) num_steps;
//omp_set_num_threads(4);
omp_get_max_threads();
start=omp_get_wtime();
#pragma omp parallel
{
#pragma omp for reduction(+:sum) schedule(static) private(x) //reduction to get local copy
for (i=0;i<num_steps;i++){
x=(i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
//max_threads=omp_get_max_threads();
}
time=omp_get_wtime()-start;
pi=step*sum;
printf("pi=(%f)\t run_time(%f)\n",pi,time);//,max_threads);
return 0;
}

The code runs only a few milliseconds (on my system 2-6 ms), the time is being dominated overhead e.g. for thread creation. The serial version runs <1 ms. It is normal for such short execution times to be very variable as it depends on the current state of the system, e.g. there is some 'warmup needed'.
In this case, just increase num_steps to get meaningful stable results. E.g. with num_steps = 1000000000, 10 executions are all between 4.332 and 4.399 seconds on my system.
Generally, if you do performance measurements, you should compile with the -O3 flag.

Related

OpenMP: No Speedup in parallel workloads

So I can't really figure this bit out with my fairly simple OpenMP parallelized for loop. When running on the same input size, P=1 runs in ~50 seconds, but running P=2 takes almost 300 Seconds, with P=4 running ~250 Seconds.
Here's the parallelized loop
double time = omp_get_wtime();
printf("Input Size: %d\n", n);
#pragma omp parallel for private(i) reduction(+:in)
for(i = 0; i < n; i++) {
double x = (double)(rand() % 10000)/10000;
double y = (double)(rand() % 10000)/10000;
if(inCircle(x, y)) {
in++;
}
}
double ratio = (double)in/(double)n;
double est_pi = ratio * 4.0;
time = omp_get_wtime() - time;
Runtimes:
p=1, n=1073741824 - 52.764 seconds
p=2, n=1073741824 - 301.66 seconds
p=4, n=1073741824 - 274.784 seconds
p=8, n=1073741824 - 188.224 seconds
Running in a Ubuntu 20.04 VM with 8 cores of a Xeon 5650 and 16gb of DDR3 EEC RAM on top of a FreeNas installation on a Dual Xeon 5650 System with 70Gb of RAM.
Partial Solution:
The rand() function inside of the loop causes the time to jump when running on multiple threads.
Since rand() uses state saved from the previous call to generated the next PRN it can't run in multiple threads at the same time. Multiple threads would need to read/write the PRNG state at the same time.
POSIX states that rand() need not be thread safe. This means your code could just not work right. Or the C library might put in a mutex so that only one thread could call rand() at a time. This is what's happening, but it slows the code down considerably. The threads are nearly entirely consumed trying to get access to the rand critical section as nothing else they are doing takes any significant time.
To solve this, try using rand_r(), which does not use shared state, but instead is passed the seed value it should use for state.
Keep in mind that using the same seed for every thread will defeat the purpose of increasing the number of trials in your Monte Carlo simulation. Each thread would just use the exact same pseudo-random sequence. Try something like this:
unsigned int seed;
#pragma omp parallel private(seed)
{
seed = omp_get_thread_num();
#pragma omp for private(i) reduction(+:in)
for(i = 0; i < n; i++) {
double x = (double)(rand_r(&seed) % 10000)/10000;
double y = (double)(rand_r(&seed) % 10000)/10000;
if(inCircle(x, y)) {
in++;
}
}
}
BTW, you might notice your estimate is off. x and y need to be evenly distributed in the range [0, 1], and they are not.

Why does the time for this simple program to run double if run quickly in succession?

I have been working through the introductory openmp example and on the first multithreaded example - a numerical integration to pi - I knew the bit about false sharing would be coming and so implemented the following:
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "omp.h"
#define STEPS 100000000.0
#define MAX_THREADS 4
void pi(double start, double end, double **sum);
int main(){
double * sum[MAX_THREADS];
omp_set_num_threads(MAX_THREADS);
double inc;
bool set_inc=false;
double start=omp_get_wtime();
#pragma omp parallel
{
int ID=omp_get_thread_num();
#pragma omp critical
if(!set_inc){
int num_threads=omp_get_num_threads();
printf("Using %d threads.\n", num_threads);
inc=1.0/num_threads;
set_inc=true;
}
pi(ID*inc, (ID+1)*inc, &sum[ID]);
}
double end=omp_get_wtime();
double tot=0.0;
for(int i=0; i<MAX_THREADS; i++){
tot=tot+*sum[i];
}
tot=tot/STEPS;
printf("The value of pi is: %.8f. Took %f secs.\n", tot, end-start);
return 0;
}
void pi(double start, double end, double **sum_ptr){
double *sum=(double *) calloc(1, sizeof(double));
for(double i=start; i<end; i=i+1/STEPS){
*sum=*sum+4.0/(1.0+i*i);
}
*sum_ptr=sum;
}
My idea was that in using calloc, the probability of the pointers returned being contiguous and thus being pulled into the same cache lines was virtually impossible (though I'm a tad unsure as to why there would be false sharing anyways as double is 64 bit here and my cache lines are 8 bytes as well, so if you can enlighten me there as well...). -- now I realize cache lines are typically 64 bytes not bits
In fun, after compiling I ran the program in quick succession and here's a short example of what I got (definitely was pushing arrows and enter in the terminal more than 1 press/.5 secs):
user#user-kubuntu:~/git/openmp-practice$ ./pi_mp.exe
Using 4 threads.
The value of pi is: 3.14159273. Took 0.104703 secs.
user#user-kubuntu:~/git/openmp-practice$ ./pi_mp.exe
Using 4 threads.
The value of pi is: 3.14159273. Took 0.196900 secs.
I thought that maybe something was happening because of the way I tried to avoid the false sharing and since I am still ignorant about the complete happenings amongst the levels of memory I chalked it up to that. So, I followed the prescribed method of the tutorial using a "critical" section like so:
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "omp.h"
#define STEPS 100000000.0
#define MAX_THREADS 4
double pi(double start, double end);
int main(){
double sum=0.0;
omp_set_num_threads(MAX_THREADS);
double inc;
bool set_inc=false;
double start=omp_get_wtime();
#pragma omp parallel
{
int ID=omp_get_thread_num();
#pragma omp critical
if(!set_inc){
int num_threads=omp_get_num_threads();
printf("Using %d threads.\n", num_threads);
inc=1.0/num_threads;
set_inc=true;
}
double temp=pi(ID*inc, (ID+1)*inc);
#pragma omp critical
sum+=temp;
}
double end=omp_get_wtime();
sum=sum/STEPS;
printf("The value of pi is: %.8f. Took %f secs.\n", sum, end-start);
return 0;
}
double pi(double start, double end){
double sum=0.0;
for(double i=start; i<end; i=i+1/STEPS){
sum=sum+4.0/(1.0+i*i);
}
return sum;
}
The doubling in run time is virtually identical. What's the explanation for this? Does it have anything to do with the low level memory? Can you answer my intermediate question?
Thanks a lot.
Edit:
The compiler is gcc 7 on Kubuntu 17.10. options used were -fopenmp -W -o ( in that order).
The system specs include an i5 6500 # 3.2Ghz and 16 gigs of DDR4 RAM (though I forget its clock speed)
As some have asked, the program time does not continue to double if run more than twice in quick succession. After the initial doubling, it remains at around the same time (~.2 secs) for as many successive runs as I have tested (5+). Waiting a second or two, the time to run returns to the lesser amount. However, when the runs are not run manually in succession but rather in one command line such as ./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe; I get:
The value of pi is: 3.14159273. Took 0.100528 secs.
Using 4 threads.
The value of pi is: 3.14159273. Took 0.097707 secs.
Using 4 threads.
The value of pi is: 3.14159273. Took 0.098078 secs.
...
Adding gcc optimization options (-O3) had no change on any of the results.

OpenMP for beginners

I just got started with openMP; I wrote a little C code in order to check if what I have studied is correct. However I found some troubles; here is the main.c code
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "time.h"
int main(){
float msec_kernel;
const int N = 1000000;
int i, a[N];
clock_t start = clock(), diff;
#pragma omp parallel for private(i)
for (i = 1; i <= N; i++){
a[i] = 2 * i;
}
diff = clock() - start;
msec_kernel = diff * 1000 / CLOCKS_PER_SEC;
printf("Kernel Time: %e s\n",msec_kernel*1e-03);
printf("a[N] = %d\n",a[N]);
return 0;
}
My goal is to see how long it takes to the PC to do such operation using 1 and 2 CPUs; in order to to compile the program I type the following line in the terminal:
gcc -fopenmp main.c -o main
And then I select the number of CPUs like so:
export OMP_NUM_THREADS=N
where N is either 1 or 2; however I don't get the right execution time; my results in fact are:
Kernel Time: 5.000000e-03 s
a[N] = 2000000
and
Kernel Time: 6.000000e-03 s
a[N] = 2000000
Both corresponding to N=1 and N=2. as you can see when I use 2 CPUs it takes slightly more time than using just one! What am I doing wrong? How can I fix this problem?
First of all, using multiple cores doesn't implicitly mean, that you're going to get better performance.
OpenMP has to manage the data distribution among you're cores which is going to take time as well. Especially for very basic operations such as only a single multiplication you are doing, performance of a sequential (single core) program will be better.
Second, by going through every element of you're array only once and not doing anything else, you make no use of cache memory and most certainly not of shared cache between cpu's.
So you should start reading some things about general algorithm performance. To make use of multiple cores using shared cache is in my opinion the essence.
Todays computers have come to a stage where the CPU is so much faster than a memory allocation, read or write. This means when using multiple cores, you'll only have a benefit if you use things like shared cache, because the data distribution,initialization of the threads and managing them will use time as well. To really see a performance speedup (See the link, essential term in parallel computing) you should program an algorithm which has a heavy accent on computation not on memory; this has to do with locality (another important term).
So if you wanna experience a big performance boost by using multiple cores test it on a matrix-matrix-multiplication on big matrices such as 10'000*10'000. And plot some graphs with inputsize(matrix-size) to time and matrix-size to gflops and compare the multicore with the sequential version.
Also make yourself comfortable with the complexity analysis (Big O notation).
Matrix-matrix-multiplication has a locality of O(n).
Hope this helps :-)
I suggest setting the numbers of cores/threads within the code itself either directly at the #pragma line #pragma omp parallel for num_threads(2) or using the omp_set_num_threads function omp_set_num_threads(2);
Further, when doing time/performance analysis it is really important to always run the program multiple times and then take the mean of all the runtimes or something like that. Running the respective programs only once will not give you a meaningful reading of used time. Always call multiple times in a row. Not to forget to also alternate the quality of data.
I suggest writing a test.c file, which takes your actual program function within a loop and then calculates the time per execution of the function:
int executiontimes = 20;
clock_t initial_time = clock();
for(int i = 0; i < executiontimes; i++){
function_multiplication(values);
}
clock_t final_time = clock();
clock_t passed_time = final_time - initial_time;
clock_t time_per_exec = passed_time / executiontimes;
Improve this test algorithm, add some rand() for your values etc. seed them with srand() etc. If you have more questions on the subject or to my answer leave a comment and I'll try to explain further by adding more explanations.
The function clock() returns elapsed CPU time, which includes ticks from all cores. Since there is some overhead to using multiple threads, when you sum the execution time of all threads the total cpu time will always be longer than the serial time.
If you want the real time (wall clock time), try to use the OMP Runtime Library function omp_get_wtime() defined in omp.h. It is cross platform portable and should be the preferred way to do wall timing.
You can also use the POSIX functions defined in time.h:
struct timespec start, stop;
clock_gettime(CLOCK_REALTIME, &start);
// action
clock_gettime(CLOCK_REALTIME, &stop);
double elapsed_time = (stop.tv_sec - start.tv_sec) +
1e-9 * (stop.tv_nsec - start.tv_nsec);

OpenMP: parallel program not faster (or not very faster) then serial. What am I doing wrong?

Look at this code:
#include <stdio.h>
#include <omp.h>
int main()
{
long i, j;
#pragma omp for
for(i=0;i<=100000;i++)
{
for(j=0;j<=100000;j++)
{
if((i ^ j) == 5687)
{
//printf("%ld ^ %ld\n", i, j);
break;
}
}
}
}
So, result:
robotex#robotex-work:~/Projects$ gcc test.c -fopenmp -o test_openmp
robotex#robotex-work:~/Projects$ gcc test.c -o test_noopenmp
robotex#robotex-work:~/Projects$ time ./test_openmp
real 0m11.785s
user 0m11.613s
sys 0m0.008s
robotex#robotex-work:~/Projects$ time ./test_noopenmp
real 0m13.364s
user 0m13.253s
sys 0m0.008s
robotex#robotex-work:~/Projects$ time ./test_noopenmp
real 0m11.955s
user 0m11.853s
sys 0m0.004s
robotex#robotex-work:~/Projects$ time ./test_openmp
real 0m15.048s
user 0m14.949s
sys 0m0.004s
What's wrong? Why are OpenMP program slower? How can I correct it?
I tested it in several computers (Intel Core i5 at work, Intel Core2Duo T7500 at home) with OS Ubuntu and always got the same result: OpenMP don't give significant performance gains.
I also tested example from Wikipedia and got the same result.
There are two issues in your code:
You're missing the parallel in your pragma. So it's only using 1 thread.
You have a race condition on j because it's declared outside the parallel region.
First, you need parallel to actually make OpenMP run in parallel:
#pragma omp parallel for
Secondly, you are declaring j outside the parallel region. This will make it shared among all the threads. So all the threads read and modify it inside the parallel region.
So not only do you have a race-condition, but the cache coherence traffic caused by all the invalidations is killing your performance.
What you need to do is to make j local to each thread. This can be done by either:
Declaring j inside the parallel region.
Or adding private(j) to the pragma: #pragma omp parallel for private(j)(as pointed out by #ArjunShankar in the comments)
Try this instead:
int main()
{
double start = omp_get_wtime();
long i;
#pragma omp parallel for
for(i=0;i<=100000;i++)
{
long j;
for(j=0;j<=100000;j++)
{
if((i ^ j) == 5687)
{
//printf("%ld ^ %ld\n", i, j);
break;
}
}
}
double end = omp_get_wtime();
printf("%f\n",end - start);
return 0;
}
No OpenMP: 6.433378
OpenMP with global j: 9.634591
OpenMP with local j: 2.266667

How to generate random numbers in parallel?

I want to generate pseudorandom numbers in parallel using openMP, something like this:
int i;
#pragma omp parallel for
for (i=0;i<100;i++)
{
printf("%d %d %d\n",i,omp_get_thread_num(),rand());
}
return 0;
I've tested it on windows and I got huge speedup, but each thread generated exactly the same numbers. I've tested it also on Linux and I got huge slowdown, parallel version on 8core processor was about 10 time slower than sequential, but each thread generated different numbers.
Is there any way to have both speedup and different numbers?
Edit 27.11.2010
I think I've solved it using an idea from Jonathan Dursi post. It seems that following code works fast on both linux and windows. Numbers are also pseudorandom. What do You think about it?
int seed[10];
int main(int argc, char **argv)
{
int i,s;
for (i=0;i<10;i++)
seed[i] = rand();
#pragma omp parallel private(s)
{
s = seed[omp_get_thread_num()];
#pragma omp for
for (i=0;i<1000;i++)
{
printf("%d %d %d\n",i,omp_get_thread_num(),s);
s=(s*17931+7391); // those numbers should be choosen more carefully
}
seed[omp_get_thread_num()] = s;
}
return 0;
}
PS.: I haven't accepted any answer yet, because I need to be sure that this idea is good.
I'll post here what I posted to Concurrent random number generation :
I think you're looking for rand_r(), which explicitly takes the current RNG state as a parameter. Then each thread should have its own copy of seed data (whether you want each thread to start off with the same seed or different ones depends on what you're doing, here you want them to be different or you'd get the same row again and again). There's some discussion of rand_r() and thread-safety here: whether rand_r is real thread safe? .
So say you wanted each thread to have its seed start off with its thread number (which is probably not what you want, as it would give the same results every time you ran with the same number of threads, but just as an example):
#pragma omp parallel default(none)
{
int i;
unsigned int myseed = omp_get_thread_num();
#pragma omp for
for(i=0; i<100; i++)
printf("%d %d %d\n",i,omp_get_thread_num(),rand_r(&myseed));
}
Edit: Just on a lark, checked to see if the above would get any speedup. Full code was
#define NRANDS 1000000
int main(int argc, char **argv) {
struct timeval t;
int a[NRANDS];
tick(&t);
#pragma omp parallel default(none) shared(a)
{
int i;
unsigned int myseed = omp_get_thread_num();
#pragma omp for
for(i=0; i<NRANDS; i++)
a[i] = rand_r(&myseed);
}
double sum = 0.;
double time=tock(&t);
for (long int i=0; i<NRANDS; i++) {
sum += a[i];
}
printf("Time = %lf, sum = %lf\n", time, sum);
return 0;
}
where tick and tock are just wrappers to gettimeofday(), and tock() returns the difference in seconds. Sum is printed just to make sure that nothing gets optimized away, and to demonstrate a small point; you will get different numbers with different numbers of threads because each thread gets its own threadnum as a seed; if you run the same code again and again with the same number of threads you'll get the same sum, for the same reason. Anyway, timing (running on a 8-core nehalem box with no other users):
$ export OMP_NUM_THREADS=1
$ ./rand
Time = 0.008639, sum = 1074808568711883.000000
$ export OMP_NUM_THREADS=2
$ ./rand
Time = 0.006274, sum = 1074093295878604.000000
$ export OMP_NUM_THREADS=4
$ ./rand
Time = 0.005335, sum = 1073422298606608.000000
$ export OMP_NUM_THREADS=8
$ ./rand
Time = 0.004163, sum = 1073971133482410.000000
So speedup, if not great; as #ruslik points out, this is not really a compute-intensive process, and other issues like memory bandwidth start playing a role. Thus, only a shade over 2x speedup on 8 cores.
You cannot use the C rand() function from multiple threads; this results in undefined behavior. Some implementations might give you locking (which will make it slow); others might allow threads to clobber each other's state, possibly crashing your program or just giving "bad" random numbers.
To solve the problem, either write your own PRNG implementation or use an existing one that allows the caller to store and pass the state to the PRNG iterator function.
Get each thread to set a different seed based on its thread id, e.g. srand(omp_get_thread_num() * 1000);
It seems like that rand has a global shared state between all threads on Linux and a thread local storage state for it on Windows. The shared state on Linux is causing your slowdowns because of the necessary synchronization.
I don't think there is a portable way in the C library to use the RNG parallel on multiple threads, so you need another one. You could use a Mersenne Twister. As marcog said you need to initialize the seed for each thread differently.
On linux/unix you can use
long jrand48(unsigned short xsubi[3]);
where xsubi[3] encodes the state of the random number generator, like this:
#include<stdio.h>
#include<stdlib.h>
#include <algorithm>
int main() {
unsigned short *xsub;
#pragma omp parallel private(xsub)
{
xsub = new unsigned short[3];
xsub[0]=xsub[1]=xsub[2]= 3+omp_get_thread_num();
int j;
#pragma omp for
for(j=0;j<10;j++)
printf("%d [%d] %ld\n", j, omp_get_thread_num(), jrand48(xsub));
}
}
compile with
g++-mp-4.4 -Wall -Wextra -O2 -march=native -fopenmp -D_GLIBCXX_PARALLEL jrand.cc -o jrand
(replace g++-mp-4.4 with whatever you need to call g++ version 4.4 or 4.3)
and you get
$ ./jrand
0 [0] 1344229389
1 [0] 1845350537
2 [0] 229759373
3 [0] 1219688060
4 [0] -553792943
5 [1] 360650087
6 [1] -404254894
7 [1] 1678400333
8 [1] 1373359290
9 [1] 171280263
i.e. 10 different pseudorandom numbers without any mutex locking or race conditions.
Random numbers can be generated very fast,so usually the memory would be the bottleneck. By dividing this task between several threads you create additional communication and syncronization overheads (and sinchronization of caches of different cores is not cheap).
It would be better to use a single thread with a better random() function.

Resources