I'm trying to parallelize a ray tracer in C, but the execution time is not dropping as the number of threads increase. The code I have so far is:
main2(thread function):
float **result=malloc(width * sizeof(float*));
int count=0;
for (int px=0;, px<width; ++px)
for (int py=0; py<height; ++py)
float *scaled_color=malloc(3*sizeof(float));
return (void *) result;
pthread_t threads[nthreads];
for (i=0;i<nthreads;i++)
pthread_create(&threads[i], NULL, main2, &i);
float** result_handler;
for (i=0; i<nthreads; i++)
pthread_join(threads[i], (void *) &result_handler);
int count=0;
for(j=0; j<width;j++)
float* scaled_color=result_handler[count];
count ++;
main2 returns a float ** so that the picture can be printed in order in the main function. Anyone know why the exectution time is not dropping (e.g. it runs longer with 8 threads than with 4 threads when it's supposed to be the other way around)?

It's not enough to add threads, you need to actually split the task as well. Looks like you're doing the same job in every thread, so you get n copies of the result with n threads.

Parallelism of programs and algorithms is usually non trivial to achieve and doesn't come without some investment.
I don't think that working directly with threads is the right tool for you. Try to look into OpenMp, it is much more highlevel.

Two things are working against you here. (1) Unless you can allocate threads to more than one core, you couldn't expect a speed up in the first place; using a single core, that core has the same amount of work to do whether you parallelize the code or not. (2) Even with multiple cores, parallel performance is exquisitely sensitive to the ratio of computation done on-core to the amount of communication necessary between cores. With ptrhead_join() inside the loop, you're incurring a lot of this kind of 'stop and wait for the other guy' kind of performance hits.


Why would executing a function in parallel significantly slowdown the program?

I am trying to parallelize a code using OpenMP, the serial time for my current input size is around 9 seconds, I have a code of the following form:
int main()
/* do some stuff*/
void myfunction()
for (int i=0; i<n; i++)
//it has some parameters but that is beyond the point I guess
int rand = custom_random_generator();
so here the random generator can be executed in parallel since there are no dependencies, and the same goes for the compute function so I was attempting to parallel this piece but all my attempts resulted in a failure, the first thought was to put these functions as task so they get executed in parallel but resulted in a slower result, here is what I did
void myfunction()
for (int i=0; i<n; i++)
#pragma omp task
//it has some parameters but that is beyond the point I guess
int rand=custom_random_generator();
Result: 23 seconds, more than double the serial time
Putting task on compute() only resulted in the same
Even worse attempt:
void myfunction()
#pragma omp parallel for
for (int i=0; i<n; i++)
//it has some parameters but that is beyond the point I guess
int rand=custom_random_generator();
Result: 45 seconds
Theoretically speaking, why could this happen? I know that for anyone to tell my exact problem they would need a minimum reproducible example but my goal from the question is to understand the different theories that could explain my problem and apply them myself, why would parallelizing an "embarrassingly parallel" piece of code result in way worse performance?
One theory could be the overhead that is associated with creating and maintaining multiple threads.
The advantges of parallel programming can only be seen when each iteration has to perform more complicated processor intensive tasks.
A simple for loop with some simple routine inside would not take advantage of it.

tasks run in thread takes longer than in serial?

So im doing some computation on 4 million nodes.
the very bask serial version just have a for loop which loops 4 million times and do 4 million times of computation. this takes roughly 1.2 sec.
when I split the for loop to, say, 4 for loops and each does 1/4 of the computation, the total time became 1.9 sec.
I guess there are some overhead in creating for loops and maybe has to do with cpu likes to compute data in chunk.
The real thing bothers me is when I try to put 4 loops to 4 thread on a 8 core machine, each thread would take 0.9 seconds to finish.
I am expecting each of them to only take 1.9/4 second instead.
I dont think there are any race condition or synchronize issue since all I do was having a for loop to create 4 threads, which took 200 microseconds. And then a for loop to joins them.
The computation read from a shared array and write to a different shared array.
I am sure they are not writing to the same byte.
Where could the overhead came from?
main: ncores: number of cores. node_size: size of graph (4 million node)
for(i = 0 ; i < ncores ; i++){
int *t = (int*)malloc(sizeof(int));
*t = i;
int iret = pthread_create( &thread[i], NULL, calculate_rank_p, (void*)(t));
for (i = 0; i < ncores; i++)
pthread_join(thread[i], NULL);
calculate_rank_p: vector is the rank vector for page rank calculation
Void *calculate_rank_pthread(void *argument) {
int index = *(int*)argument;
for(i = index; i < node_size ; i+=ncores)
current_vector[i] = calc_r(i, vector);
return NULL;
calc_r: this is just a page rank calculation using compressed row format.
double calc_r(int i, double *vector){
double prank = 0;
int j;
for(j = row_ptr[i]; j < row_ptr[i+1]; j++){
prank += vector[col_ind[j]] * val[j];
return prank;
everything that is not declared are global variable
The computation read from a shared array and write to a different shared array. I am sure they are not writing to the same byte.
It's impossible to be sure without seeing relevant code and having some more details, but this sounds like it could be due to false sharing, or ...
the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. This causes real but invisible performance contention; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line.
This looks like it could very well trigger false sharing, depending on the size of a vector (though there is still not enough information in the post to be sure, as we don't see how the various vector are allocated.
for(i = index; i < node_size ; i+=ncores)
Instead of interleaving which core works on which data i += ncores give each of them a range of data to work on.
For me the same surprise when build and run in Debug (other test code though).
In release all as expected ;)

How to change the count of a pthread_barrier?

The problem is that we have to implement a kind of "running-contest" using pthreads. After one track we have to wait until all runners/threads are done until this point, so we use a barrier for that.
But now we also have to implement the probability of injuries. So we wrote a function, which sometimes reduces the number of runners, and reinitialize the barrier with a smaller count. Now the problem is that the program is not always terminating. I guess the reason for this is that some of the threads have already been at the barrier, and after reinitializing them the required amount is not arriving.
The code for the simulation of the injury looks like this:
void simulateInjury(int number) {
int totalRunners = 0;
int i = 0;
if (rand() % 10 < 1) {
printf("Runner of Team %i injured!\n", number);
for (i = 0; i < teams; i++) {
totalRunners += standings.teamSize[i];
pthread_barrier_init(&barrier_track1, NULL, totalRunners);
pthread_barrier_init(&barrier_track4[number], NULL, standings.teamSize[number]);
Or is there maybe a way to just change the count argument of the barrier?
I see two errors:
You should not re-initialize a barrier while some thread is using
You should not execute the re-initialization of the barrier
simultaneously by several threads.
For the first you can create a second barrier that you use in alternation with the first.
For the second you should use the return value of the wait function to designate one particular thread that will do the re-initialization.

MPI wrapper that imitates OpenMP's for-loop pragma

I am thinking about implementing a wrapper for MPI that imitates OpenMP's way
of parallelizing for loops.
begin_parallel_region( chunk_size=100 , num_proc=10 );
for( int i=0 ; i<1000 ; i++ )
//some computation
The code above distributes computation inside the for loop to 10 slave MPI processors.
Upon entering the parallel region, the chunk size and number of slave processors are provided.
Upon leaving the parallel region, the MPI processors are synched and are put idle.
EDITED in response to High Performance Mark.
I have no intention to simulate the OpenMP's shared memory model.
I propose this because I need it.
I am developing a library that is required to build graphs from mathetical functions.
In these mathetical functions, there often exist for loops like the one below.
for( int i=0 ; i<n ; i++ )
s = s + sin(x[i]);
So I want to first be able to distribute sin(x[i]) to slave processors and at the end reduce to the single varible just like in OpenMP.
I was wondering if there is such a wrapper out there so that I don't have to reinvent the wheel.
There is no such wrapper out there which has escaped from the research labs into widespread use. What you propose is not so much re-inventing the wheel as inventing the flying car.
I can see how you propose to write MPI code which simulates OpenMP's approach to sharing the burden of loops, what is much less clear is how you propose to have MPI simulate OpenMP's shared memory model ?
In a simple OpenMP program one might have, as you suggest, 10 threads each perform 10% of the iterations of a large loop, perhaps updating the values of a large (shared) data structure. To simulate that inside your cunning wrapper in MPI you'll either have to (i) persuade single-sided communications to behave like shared memory (this might be doable and will certainly be difficult) or (ii) distribute the data to all processes, have each process independently compute 10% of the results, then broadcast the results all-to-all so that at the end of execution each process has all the data that the others have.
Simulating shared memory computing on distributed memory hardware is a hot topic in parallel computing, always has been, always will be. Google for distributed shared memory computing and join the fun.
Well, if you've distributed x across processes then individual processes can compute sin(x[i]) and you can reduce the sum on to one process using MPI_Reduce.
I must be missing something about your requirements because I just can't see why you want to build any superstructure on top of what MPI already provides. Nevertheless, my answer to your original question remains No, there is no such wrapper as you seek and all the rest of my answer is mere commentary.
Yes, you could do this, for specific tasks. But you shouldn't.
Consider how you might implement this; the begin part would distribute the data, and the end part would bring the answer back:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
typedef struct state_t {
int globaln;
int localn;
int *locals;
int *offsets;
double *localin;
double *localout;
double (*map)(double);
} state;
state *begin_parallel_mapandsum(double *in, int n, double (*map)(double)) {
state *s = malloc(sizeof(state));
s->globaln = n;
s->map = map;
/* figure out decomposition */
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
s->locals = malloc(size * sizeof(int));
s->offsets = malloc(size * sizeof(int));
s->offsets[0] = 0;
for (int i=0; i<size; i++) {
s->locals[i] = (n+i)/size;
if (i < size-1) s->offsets[i+1] = s->offsets[i] + s->locals[i];
/* allocate local arrays */
s->localn = s->locals[rank];
s->localin = malloc(s->localn*sizeof(double));
s->localout = malloc(s->localn*sizeof(double));
/* distribute */
MPI_Scatterv( in, s->locals, s->offsets, MPI_DOUBLE,
s->localin, s->locals[rank], MPI_DOUBLE,
return s;
double end_parallel_mapandsum(state **s) {
double localanswer=0., answer;
/* sum up local answers */
for (int i=0; i<((*s)->localn); i++) {
localanswer += ((*s)->localout)[i];
/* and get global result. Everyone gets answer */
MPI_Allreduce(&localanswer, &answer, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
free( (*s)->localin );
free( (*s)->localout );
free( (*s)->locals );
free( (*s)->offsets );
free( (*s) );
return answer;
int main(int argc, char **argv) {
int rank;
double *inputs;
double result;
int n=100;
const double pi=4.*atan(1.);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
inputs = malloc(n * sizeof(double));
for (int i=0; i<n; i++) {
inputs[i] = 2.*pi/n*i;
state *s=begin_parallel_mapandsum(inputs, n, sin);
for (int i=0; i<s->localn; i++) {
s->localout[i] = (s->map)(s->localin[i]);
result = end_parallel_mapandsum(&s);
if (rank == 0) {
printf("Calculated result: %lf\n", result);
double trueresult = 0.;
for (int i=0; i<n; i++) trueresult += sin(inputs[i]);
printf("True result: %lf\n", trueresult);
That constant distribute/gather is a terrible communications burden to sum up a few numbers, and is antithetical to the entire distributed-memory computing model.
To a first approximation, shared memory approaches - OpenMP, pthreads, IPP, what have you - are about scaling computations faster; about throwing more processors at the same chunk of memory. On the other hand, distributed-memory computing is about scaling a computation bigger; about using more resourses, particularly memory, than can be found on a single computer. The big win of using MPI is when you're dealing with problem sets which can't fit on any one node's memory, ever. So when doing distributed-memory computing, you avoid having all the data in any one place.
It's important to keep that basic approach in mind even when you are just using MPI on-node to use all the processors. The above scatter/gather approach will just kill performance. The more idiomatic distributed-memory computing approach is for the logic of the program to already have distributed the data - that is, your begin_parallel_region and end_parallel_region above would have already been built into the code above your loop at the very beginning. Then, every loop is just
for( int i=0 ; i<localn ; i++ )
s = s + sin(x[i]);
and when you need to exchange data between tasks (or reduce a result, or what have you) then you call the MPI functions to do those specific tasks.
Is MPI a must or are you just trying to run your OpenMP-like code on a cluster? In the latter case, I propose you to take a look at Intel's Cluster OpenMP:

problems when creating many plans and executing plans

I am a little confused about creating many_plan by calling fftwf_plan_many_dft_r2c() and executing it with OpenMP. What I am trying to achieve here is to see if explicitly using OpenMP and organizing FFTW data could work together. ( I know I "should" use multithreaded version of fftw but I failed to get a expected speedup from it ).
My code looks like this:
/* I ignore some helper APIs */
#define N 1024*1024 //N is the total size of 1d fft
fftwf_plan p;
float * in;
fftwf_complex *out;
omp_set_num_threads(threadNum); // Suppose threadNum is 2 here
in = fftwf_alloc_real(2*(N/2+1));
std::fill(in,in+2*(N/2+1),1.1f); // just try with a random real floating numbers
out = (fftwf_complex *)&in[0]; // for in-place transformation
/* Problems start from here */
int n[] = {N/threadNum}; // according to the manual, n is the size of each "howmany" transformation
p = fftwf_plan_many_dft_r2c(1, n, threadNum, in, NULL,1 ,1, out, NULL, 1, 1, FFTW_ESTIMATE);
#pragma omp parallel for
for (int i = 0; i < threadNum; i ++)
// fftwf_execute_dft_r2c(p,in+i*N/threadNum,out+i*N/threadNum);
What I got is like this:
If I use fftwf_execute(p), the program executes successfully, but the result seems not correct. ( I compare the result with the version of not using many_plan and openmp )
If I use fftwf_execute_dft_r2c(), I got segmentation fault.
Can somebody help me here? How should I partition the data across multiple threads? Or it is not correct in the first place.
Thank you in advance.
Do you properly allocate memory for out? Does this:
out = (fftwf_complex *)&in[0]; // for in-place transformation
do the same as this:
out = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfOutputColumns);
You are trying to access 'p' inside your parallel block, without specifically telling openMP how to use it. It should be:
pragma omp parallel for shared(p)
If you are going to split the work up for n threads, I would think you'd explicitly want to tell omp to use n threads:
pragma omp parallel for shared(p) num_threads(n)
Does this code work without multithreading? If you removed the for loop and openMP call and executed fftwf_execute(p) just once does it work?
I don't know much about FFTW's plans for many, but it seems like p is really many plans, not one single plan. So, when you "execute" p, you are executing all plans at once, right? You don't really need to iteratively execute p.
I'm still learning about OpenMP + FFTW so I could be wrong on these. StackOverflow doesn't like it when i put a # in front of pragma, but you need one.
