OpenMP: No Speedup in parallel workloads - c

So I can't really figure this bit out with my fairly simple OpenMP parallelized for loop. When running on the same input size, P=1 runs in ~50 seconds, but running P=2 takes almost 300 Seconds, with P=4 running ~250 Seconds.
Here's the parallelized loop
double time = omp_get_wtime();
printf("Input Size: %d\n", n);
#pragma omp parallel for private(i) reduction(+:in)
for(i = 0; i < n; i++) {
double x = (double)(rand() % 10000)/10000;
double y = (double)(rand() % 10000)/10000;
if(inCircle(x, y)) {
in++;
}
}
double ratio = (double)in/(double)n;
double est_pi = ratio * 4.0;
time = omp_get_wtime() - time;
Runtimes:
p=1, n=1073741824 - 52.764 seconds
p=2, n=1073741824 - 301.66 seconds
p=4, n=1073741824 - 274.784 seconds
p=8, n=1073741824 - 188.224 seconds
Running in a Ubuntu 20.04 VM with 8 cores of a Xeon 5650 and 16gb of DDR3 EEC RAM on top of a FreeNas installation on a Dual Xeon 5650 System with 70Gb of RAM.
Partial Solution:
The rand() function inside of the loop causes the time to jump when running on multiple threads.

Since rand() uses state saved from the previous call to generated the next PRN it can't run in multiple threads at the same time. Multiple threads would need to read/write the PRNG state at the same time.
POSIX states that rand() need not be thread safe. This means your code could just not work right. Or the C library might put in a mutex so that only one thread could call rand() at a time. This is what's happening, but it slows the code down considerably. The threads are nearly entirely consumed trying to get access to the rand critical section as nothing else they are doing takes any significant time.
To solve this, try using rand_r(), which does not use shared state, but instead is passed the seed value it should use for state.
Keep in mind that using the same seed for every thread will defeat the purpose of increasing the number of trials in your Monte Carlo simulation. Each thread would just use the exact same pseudo-random sequence. Try something like this:
unsigned int seed;
#pragma omp parallel private(seed)
{
seed = omp_get_thread_num();
#pragma omp for private(i) reduction(+:in)
for(i = 0; i < n; i++) {
double x = (double)(rand_r(&seed) % 10000)/10000;
double y = (double)(rand_r(&seed) % 10000)/10000;
if(inCircle(x, y)) {
in++;
}
}
}
BTW, you might notice your estimate is off. x and y need to be evenly distributed in the range [0, 1], and they are not.

Related

Why is there no speed up when using OpenMP to generate random numbers?

I am looking to run a type of monte carlo simulations which require the generation of random numbers, and a set of instructions based on those random numbers.
I wish to make use of parallel processing but when testing my code (written in C) there seems to be an inverse speed up with more cores! I'm not sure what I could be doing wrong. I then copied the code form another answer and still get this effect.
The code slightly modified form the answer is
#define NRANDS 1000000
int main() {
int a[NRANDS];
#pragma omp parallel default(none) shared(a)
{
int i;
unsigned int myseed = omp_get_thread_num();
#pragma omp for
for(i=0; i<NRANDS; i++)
a[i] = rand_r(&myseed);
}
double sum = 0.;
for (long int i=0; i<NRANDS; i++) {
sum += a[i];
}
printf("sum = %lf\n", sum);
return 0;
}
where I have just then run the time command in terminal in order to time how long it takes to run. I varied the number of threads allowed using export OMP_NUM_THREADS=2. The output of my terminal is:
Thread total: 1
sum = 1074808568711883.000000
real 0m0,041s
user 0m0,036s
sys 0m0,004s
Thread total: 2
sum = 1074093295878604.000000
real 0m0,037s
user 0m0,058s
sys 0m0,008s
Thread total: 3
sum = 1073700114076905.000000
real 0m0,032s
user 0m0,061s
sys 0m0,010s
Thread total: 4
sum = 1073422298606608.000000
real 0m0,035s
user 0m0,074s
sys 0m0,024s
Note that the time command adds up the time spent on all cores when it prints the user and sys values. Observe that your wall time (real) is nearly constant.
Also, your benchmark is too small. There is a significant cost of creating and managing threads. This overhead may be overshadowing the actual execution time of the random number generation. A million values isn't that many. In other words, the time taken to actually compute the random numbers is so small that it's lost in the noise and dwarfed by the setup/teardown costs. If you generate a whole lot more, you may start to see the advantage due to parallelism.

Varying run time of an OpenMP parallel region

Whenever I run this code it shows me different run-time took by the parallel section. I tried with a constant number of threads according to my core but still the effort is futile. The program is to calculate the value of pi. Compiled with gcc -fopenmp.
#include <stdio.h>
#include <omp.h>
static long num_steps = 100000; double step;
//double omp_get_wtime(void);
int main (){
int i;
double x,pi,max_threads,start,time;
double sum=0.0;
step = 1.0/(double) num_steps;
//omp_set_num_threads(4);
omp_get_max_threads();
start=omp_get_wtime();
#pragma omp parallel
{
#pragma omp for reduction(+:sum) schedule(static) private(x) //reduction to get local copy
for (i=0;i<num_steps;i++){
x=(i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
//max_threads=omp_get_max_threads();
}
time=omp_get_wtime()-start;
pi=step*sum;
printf("pi=(%f)\t run_time(%f)\n",pi,time);//,max_threads);
return 0;
}
The code runs only a few milliseconds (on my system 2-6 ms), the time is being dominated overhead e.g. for thread creation. The serial version runs <1 ms. It is normal for such short execution times to be very variable as it depends on the current state of the system, e.g. there is some 'warmup needed'.
In this case, just increase num_steps to get meaningful stable results. E.g. with num_steps = 1000000000, 10 executions are all between 4.332 and 4.399 seconds on my system.
Generally, if you do performance measurements, you should compile with the -O3 flag.

Can the producer/consumer ( bounded-buffer ) be sped up through openMP?

I'm implementing the producer/consumer problem for homework, and I have to compare the sequential algorithm with the parallel one, and my parallel one seems to only be able to run either at the same speed or slower than the sequential one. I've come to the conclusion that using a queue is a limiting factor and it won't speed up my algorithm.
Is this the case or am I just coding it wrong?
int main() {
long sum = 0;
unsigned long serial = ::GetTickCount();
for(int i = 0; i < test; i++){
enqueue(rand()%54354);
sum+= dequeue();
}
printf("%d \n",sum);
serial = (::GetTickCount() - serial);
printf("Serial Program took: %f seconds\n", serial * .001);
sum = 0;
unsigned long omp = ::GetTickCount();
#pragma omp parallel for num_threads(128) default(shared)
for(int i = 0; i < test; i++){
enqueue(rand()%54354);
sum+= dequeue();
}
#pragma omp barrier //joins all threads
omp = (::GetTickCount() - omp);
printf("%d \n",sum);
printf("OpenMP Program took: %f seconds\n", omp * .001);
getchar();
}
Problem #1:
You have rand() inside the parallel region.
rand() is not thread-safe. It uses global/static variables. So calling it concurrently from multiple threads will lead to unexpected (possibly undefined) behavior.
That aside, the data-races resulting from concurrent calls to rand() will lead to a lot of cache coherency stalls. This is likely the source of the slowdown.
Problem #2:
Is enqueue() and dequeue() thread-safe?
If it isn't, then you need to fix that first. If it is, how are you synchronizing it?
If it's just a critical region that allows only one thread at a time to access the queue, then that kind of defeats the whole purpose of parallelism.
Problem #3:
This line modifies the sum variable in each iteration:
sum += dequeue();
Note that all the threads will be doing this concurrently. So you need to declare sum as a reduction variable.

What could be some possible problems with this use of OpenMP?

I was trying to figure out how to parallelize a segment of code in OpenMP, where the inside of the for loop is independent from the rest of it.
Basically the project is dealing with particle systems, but I don't think that should relevant to the parallelization of the code. Is it a caching problem where the for loop divides the threads in a way such that the particles are not cached in each core in an efficient manner?
Edit: As mentioned by an answer below, I'm wondering why I'm not getting speedup.
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles[i].pos = s->particles[i].pos + dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
// printf("%d", omp_get_thread_num());
}
If you're asking whether it's parallelized correctly, it looks fine. I don't see any data-races or loop-dependencies that could break it.
But I think you're wondering on why you aren't getting any speedup with parallelism.
Since you mentioned that the trip count, psize-n_dead will be on the order of 4000. I'd say that's actually pretty small given the amount of work in the loop.
In other words, you don't have much total work to be worth parallelizing. So threading overhead is probably eating up any speedup that you should be gaining. If possible, you should try parallelizing at a higher level.
EDIT: You updated your comment to include up to 200000.
For larger values, it's likely that you'll be memory bound in some way. Your loop merely iterates through all the data doing very little work. So using more threads probably won't help much (if at all).
There is no correctness issues such as data races in this piece of code.
Assuming that the number of particles to process is big enough to warrant parallelism, I do not see OpenMP related performance issues in this code. By default, OpenMP will split the loop iterations statically in equal portions across all threads, so any cache conflicts may only occur at the boundaries of these portions, i.e. just in a few iterations of the loop.
Unrelated to OpenMP (and so to the parallel speedup problem), possibly performance improvement can be achieved by switching from array-of-structs to struct-of-arrays, as this might help compiler to vectorize the code (i.e. use SIMD instructions of a target processor):
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles.pos[i] = s->particles.pos[i] + dt * s->particles.vel[i];
s->particles.vel[i] = (1 - dt*.1) * s->particles.vel[i] + dt*s->force;
}
Such reorganization assumes that most time all particles are processed in a loop like this one. Working with an individual particle requires more cache lines to be loaded, but if you process them all in a loop, the net amount of cache lines loaded is nearly the same.
How sure are you that you're not getting speedup?
Trying it both ways - array of structs and struct of arrays, compiled with gcc -O3 (gcc 4.6), on a dual quad-core nehalem, I get for psize-n_dead = 200000, running 100 iterations for better timer accuracy:
Struct of arrays (reported time are in milliseconds)
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 90.984000
Took time 45.992000
Took time 22.996000
Took time 11.998000
Array of structs:
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 58.989000
Took time 28.995000
Took time 14.997000
Took time 8.999000
However, I because the operation is so short (sub-ms) I didn't see any speedup without doing 100 iterations because of timer accuracy. Also, you'd have to have a machine with good memory bandwidth to to get this sort of behaviour; you're only doing ~3 FMAs and another multiplication for every two pieces of data you read in.
Code for array-of-structs follows.
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
typedef struct particle_struct {
double pos;
double vel;
} particle;
typedef struct simulation_struct {
particle *particles;
double force;
} simulation;
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}
void update(simulation *s, unsigned psize, double dt) {
#pragma omp parallel for
for (unsigned i = 0; i < psize; ++i)
{
s->particles[i].pos = s->particles[i].pos+ dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
}
}
void init(simulation *s, unsigned np) {
s->force = 1.;
s->particles = malloc(np*sizeof(particle));
for (unsigned i=0; i<np; i++) {
s->particles[i].pos = 1.;
s->particles[i].vel = 1.;
}
int main(void)
{
const unsigned np=200000;
simulation s;
struct timeval clock;
init(&s, np);
tick(&clock);
for (int iter=0;iter< 100; iter++)
update(&s, np, 0.75);
double elapsed=tock(&clock)*1000.;
printf("Took time %lf\n", elapsed);
free(s.particles);
}

How to generate random numbers in parallel?

I want to generate pseudorandom numbers in parallel using openMP, something like this:
int i;
#pragma omp parallel for
for (i=0;i<100;i++)
{
printf("%d %d %d\n",i,omp_get_thread_num(),rand());
}
return 0;
I've tested it on windows and I got huge speedup, but each thread generated exactly the same numbers. I've tested it also on Linux and I got huge slowdown, parallel version on 8core processor was about 10 time slower than sequential, but each thread generated different numbers.
Is there any way to have both speedup and different numbers?
Edit 27.11.2010
I think I've solved it using an idea from Jonathan Dursi post. It seems that following code works fast on both linux and windows. Numbers are also pseudorandom. What do You think about it?
int seed[10];
int main(int argc, char **argv)
{
int i,s;
for (i=0;i<10;i++)
seed[i] = rand();
#pragma omp parallel private(s)
{
s = seed[omp_get_thread_num()];
#pragma omp for
for (i=0;i<1000;i++)
{
printf("%d %d %d\n",i,omp_get_thread_num(),s);
s=(s*17931+7391); // those numbers should be choosen more carefully
}
seed[omp_get_thread_num()] = s;
}
return 0;
}
PS.: I haven't accepted any answer yet, because I need to be sure that this idea is good.
I'll post here what I posted to Concurrent random number generation :
I think you're looking for rand_r(), which explicitly takes the current RNG state as a parameter. Then each thread should have its own copy of seed data (whether you want each thread to start off with the same seed or different ones depends on what you're doing, here you want them to be different or you'd get the same row again and again). There's some discussion of rand_r() and thread-safety here: whether rand_r is real thread safe? .
So say you wanted each thread to have its seed start off with its thread number (which is probably not what you want, as it would give the same results every time you ran with the same number of threads, but just as an example):
#pragma omp parallel default(none)
{
int i;
unsigned int myseed = omp_get_thread_num();
#pragma omp for
for(i=0; i<100; i++)
printf("%d %d %d\n",i,omp_get_thread_num(),rand_r(&myseed));
}
Edit: Just on a lark, checked to see if the above would get any speedup. Full code was
#define NRANDS 1000000
int main(int argc, char **argv) {
struct timeval t;
int a[NRANDS];
tick(&t);
#pragma omp parallel default(none) shared(a)
{
int i;
unsigned int myseed = omp_get_thread_num();
#pragma omp for
for(i=0; i<NRANDS; i++)
a[i] = rand_r(&myseed);
}
double sum = 0.;
double time=tock(&t);
for (long int i=0; i<NRANDS; i++) {
sum += a[i];
}
printf("Time = %lf, sum = %lf\n", time, sum);
return 0;
}
where tick and tock are just wrappers to gettimeofday(), and tock() returns the difference in seconds. Sum is printed just to make sure that nothing gets optimized away, and to demonstrate a small point; you will get different numbers with different numbers of threads because each thread gets its own threadnum as a seed; if you run the same code again and again with the same number of threads you'll get the same sum, for the same reason. Anyway, timing (running on a 8-core nehalem box with no other users):
$ export OMP_NUM_THREADS=1
$ ./rand
Time = 0.008639, sum = 1074808568711883.000000
$ export OMP_NUM_THREADS=2
$ ./rand
Time = 0.006274, sum = 1074093295878604.000000
$ export OMP_NUM_THREADS=4
$ ./rand
Time = 0.005335, sum = 1073422298606608.000000
$ export OMP_NUM_THREADS=8
$ ./rand
Time = 0.004163, sum = 1073971133482410.000000
So speedup, if not great; as #ruslik points out, this is not really a compute-intensive process, and other issues like memory bandwidth start playing a role. Thus, only a shade over 2x speedup on 8 cores.
You cannot use the C rand() function from multiple threads; this results in undefined behavior. Some implementations might give you locking (which will make it slow); others might allow threads to clobber each other's state, possibly crashing your program or just giving "bad" random numbers.
To solve the problem, either write your own PRNG implementation or use an existing one that allows the caller to store and pass the state to the PRNG iterator function.
Get each thread to set a different seed based on its thread id, e.g. srand(omp_get_thread_num() * 1000);
It seems like that rand has a global shared state between all threads on Linux and a thread local storage state for it on Windows. The shared state on Linux is causing your slowdowns because of the necessary synchronization.
I don't think there is a portable way in the C library to use the RNG parallel on multiple threads, so you need another one. You could use a Mersenne Twister. As marcog said you need to initialize the seed for each thread differently.
On linux/unix you can use
long jrand48(unsigned short xsubi[3]);
where xsubi[3] encodes the state of the random number generator, like this:
#include<stdio.h>
#include<stdlib.h>
#include <algorithm>
int main() {
unsigned short *xsub;
#pragma omp parallel private(xsub)
{
xsub = new unsigned short[3];
xsub[0]=xsub[1]=xsub[2]= 3+omp_get_thread_num();
int j;
#pragma omp for
for(j=0;j<10;j++)
printf("%d [%d] %ld\n", j, omp_get_thread_num(), jrand48(xsub));
}
}
compile with
g++-mp-4.4 -Wall -Wextra -O2 -march=native -fopenmp -D_GLIBCXX_PARALLEL jrand.cc -o jrand
(replace g++-mp-4.4 with whatever you need to call g++ version 4.4 or 4.3)
and you get
$ ./jrand
0 [0] 1344229389
1 [0] 1845350537
2 [0] 229759373
3 [0] 1219688060
4 [0] -553792943
5 [1] 360650087
6 [1] -404254894
7 [1] 1678400333
8 [1] 1373359290
9 [1] 171280263
i.e. 10 different pseudorandom numbers without any mutex locking or race conditions.
Random numbers can be generated very fast,so usually the memory would be the bottleneck. By dividing this task between several threads you create additional communication and syncronization overheads (and sinchronization of caches of different cores is not cheap).
It would be better to use a single thread with a better random() function.

Resources