So this is just a basic program that reads values from a file and then calculates for each value the amount of fuel it would take. f = (m/4)-3. I have it working but I feel I would have a better timing if more threads could be doing some of these things at the same time.
Currently without that critical section, I am getting incorrect or changing results. I am wondering if theres a way I can reduce the amount in the critical section or otherwise optimize this section more. Thanks in advance for the help!
#pragma omp parallel for reduction(+:fuelUnits) schedule(dynamic)
for (int j = 0; j < count; j++) {
#pragma omp critical
{
i = arr[j];
//printf("i = %d\n", i);
mass += i;
printf("i = %d\n", i);
fuel2 = fuel(i);
}
fuelUnits += fuel2;
printf("fuel is %d\n", fuel2);
}
What happens if you remove mass+=i? Does it need to be included in your reduction operation: (ex: reduction(+:fuelUnits, mass)?
This should work without the critical section:
// somewhere globally:
int fuelUnits = 0;
// your fuel function here:
int fuel(int m) {
return (m/4)-3
}
-- snip --
#pragma omp parallel for reduction(+:fuelUnits) schedule(dynamic)
for (int j = 0; j < count; j++)) {
int i = arr[j];
fuelUnits += fuel(i);
}
Related
I made this parallel code to share the iterations like first and last, fisrst+1 and last-1,... But I don't know how to improve the code in every one of the two parallel sections because I have an inner loop in the sections and I can't think of any way to simplify it, thanks.
This isn't about which values are stored in x or y, I use this sections design because the requisite is execute the iterations from 0 to N like: 0 N, 1 N-1, 2 N-2 but I would like to know if I can optimize the inner loops maintaining this model
int x = 0, y = 0,k,i,j,h;
#pragma omp parallel private(i, h) reduction(+:x, y)
{
#pragma omp sections
{
#pragma omp section
{
for (i=0; i<N/2; i++)
{
C[i] = 0;
for (j=0; j<N; j++)
{
C[i] += MAT[i][j] * B[j];
}
x += C[i];
}
}
#pragma omp section
{
for (h=N-1; h>=N/2; h--)
{
C[h] = 0;
for (k=0; k<N; k++)
{
C[h] += MAT[h][k] * B[k];
}
y += C[h];
}
}
}
}
x = x + y;
Using sections seems like the wrong approach. A pragma omp for seems more appropriate. Also note that you forgot to declare j private.
int x = 0, y = 0,k,i,j;
#pragma omp parallel private(i,j) reduction(+:x, y)
{
# pragma omp for nowait
for(i=0; i<N/2; i++) {
// local variable to make the life easier on the compiler
int ci = 0;
for(j=0; j<N; j++)
ci += MAT[i][j] * B[j];
x += ci;
C[i] = ci;
}
# pragma omp for nowait
for(i=N/2; i < N; i++) {
int ci = 0;
for(j=0; j<N; j++)
ci += MAT[i][j] * B[j];
y += ci;
C[i] = ci;
}
}
x = x + y;
Also, I'm not sure but if you just want x as your final output, you can simplify the code even further:
int x=0, i, j;
#pragma omp parallel for reduction(+:x) private(i,j)
for(i=0; i < N; ++i)
for(j=0; j < N; ++j)
x += MAT[i][j] * B[j];
The section construct is to distribute different tasks to different threads and each section block marks a different task so you will not be able to do that iterations in the order you want I answered you here:
Distribution of loop iterations between threads with a specific order
But I want to clarify that the requirement to use sections is that each block must be independent of the other blocks.
A section gets only one thread, so you can't make the loops parallel. How about
Make a parallel loop to N at the top level,
then inside each iteration use a conditional to decide whether to accumulate into x,y?
Although #Homer512 's solution looks correct to me too.
Unable to reduce the execution time of multiple FFTs using OpenMP.
Tried parallelizing the outermost loop, but thsi degraded the performance
typedef struct{float r; float i;}cmplx_f32_t;
double src[2*128];
double dst[2*128];
double w[128];
cmplx_f32_t data[128][4][256];
cffti(128, w);
for (k = 0; k < 128; k++)
{
for (j = 0; j < 4; j++)
{
for (i = 0; i < 2*32; i++)
{
src[i] = data[i/2][j][k].r;
src[i+1] = data[i/2][j][k].i;
}
cfft2(128, src, dst, w, 1);
}
}
cffti and cfft2 and as given in the example at https://people.sc.fsu.edu/~jburkardt/c_src/fft_openmp/fft_openmp.html
If I disable the #pragma omp directives from the fft_openmp.c files, the run time is about 11ms. If we use #pragma omp, the total execution time is about 220 ms
The following function reads data from a file in loops and processes each loaded chunk at a time. To speed up this process, I thought to use openmp in the for loop so that this job is divided between the threads as the following:
void read_process(FILE *fp_read, double *centroids, int total) {
int i, j, c, dim = 16, chunk_size = 10000, num_itr;
double *buffer = calloc(total * dim, sizeof(double));
num_itr = total / chunk_size;
for (c = 0; c < total; ++c) {
fread(buffer, sizeof(double), chunk_size * dim, fp_read);
#pragma omp parallel private(i, j)
{
#pragma omp for
for (i = 0; i < chunk_size; i++) {
for (j = 0; j < dim; j++) {
#pragma omp atomic update
centroids[j] += buffer[i * dim + j];
}
}
}
}
free(buffer);
fclose(fp_read);
}
Without using openmp, my code works fine. However, adding #pragma section causes the code to stop and show the word Hangup in the terminal without further explanation of what was it hanged for. Some folks in StackOverflow answered other issues related to this error message that it is probably because of race condition but I think it won't be the case here because I am using atomic which serializes the access of the buffer. Am I right? Do you guys see an issue with my code? How can I enhance this code?
Thank you very much.
What you want to do is an array reduction. If you have a compiler that supports OpenMP 4.5 then you don't need to change your serial code. You can do
#pragma omp parallel for private (j) reduction(+:centroids[:dim])
for(i=0; i <chunck_size; i++) {
for(j=0; j < dim; j++) {
centroids[j] += buffer[i*dim +j];
}
}
Otherwise you can do the array reduction by hand. Here is one solution
#pragma omp parallel private(j)
{
double tmp[dim] = {0};
#pragma omp for
for(i=0; i < chunck_size; i++) {
for(j=0; j < dim; j++) {
tmp[j] += buffer[i*dim +j];
}
}
#pragma omp critical
for(int i=0; i < dim; i++) centroids[i] += tmp[i];
}
Your current solution is causing massive false sharing as each thread is writing to the same cache line. Both of the solutions above fix this problem by making private versions of centroid for each thread.
As long as dim << chunck_size then these are good solutions.
Why is the parallel application taking more time to execute than the one with the single thread? I am using an 8 CPU computer with Ubuntu 14.04. The code is just my simple way to test omp parallel sections, the aim later is to run two different functions in two different threads at the same time, so I do not want to use #pragma omp parallel for.
parallel:
int main()
{
int k = 0;
int m = 0;
omp_set_num_threads(2);
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
for( k = 0; k < 1e9; k++ ){};
}
#pragma omp section
{
for( m = 0; m < 1e9; m++ ){};
}
}
}
return 0;
}
and the single thread:
int main()
{
int m = 0;
int k = 0;
for( k = 0; k < 1e9; k++ ){};
for( m = 0; m < 1e9; m++ ){};
return 0;
}
If the compiler would not optimise the loops, then the parallel code would suffer from false sharing because m and k are very likely to end up in the same cache line. Make the variables private:
#pragma omp parallel private(k,m)
{
#pragma omp sections
{
#pragma omp section
{
for( k = 0; k < 1e9; k++ ){};
}
#pragma omp section
{
for( m = 0; m < 1e9; m++ ){};
}
}
}
At high optimisation levels, the compiler could drop the loops altogether. But then the parallel version will still have the added overhead of spawning the OpenMP worker threads and joining them afterwards, which will make it slower than the sequential version.
In above test code compiler itself optimizing the code. You need to change your test code. Depending on number of thread you are creating also add an overhead.
Also refer, Amdahl’s Law.
Disclaimer: following example is just an dummy example to quickly understand the problem. If you are thinking about real world problem, think anything dynamic programming.
The problem:
We have an n*m matrix, and we want to copy elements from previous row as in the following code:
for (i = 1; i < n; i++)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
Approach:
Outer loop iterations have to be executed in order, they would be executed sequentially.
Inner loop can be parallelized. We would want to minimize overhead of creating and killing threads, so we would want to create team of threads just once, however, this seems like an impossible task in OpenMP.
#pragma omp parallel private(j)
{
for (i = 1; i < n; i++)
{
#pragma omp for scheduled(dynamic)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
}
When we apply ordered option on the outer loop, the code will be executed sequential way, so there will be no performance gain.
I am looking to solution for the scenario above, even if I had to use some workaround.
I am adding my actual code. This is is actually slower than seq. version. Please review:
/* load input */
for (i = 1; i <= n; i++)
scanf ("%d %d", &in[i][W], &in[i][V]);
/* init */
for (i = 0; i <= wc; i++)
a[0][i] = 0;
/* compute */
#pragma omp parallel private(i,w)
{
for(i = 1; i <= n; ++i) // 1 000 000
{
j=i%2;
jn = j == 1 ? 0 : 1;
#pragma omp for
for(w = 0; w <= in[i][W]; w++) // 1000
a[j][w] = a[jn][w];
#pragma omp for
for(w = in[i][W]+1; w <= wc; w++) // 350 000
a[j][w] = max(a[jn][w], in[i][V] + a[jn][w-in[i][W]]);
}
}
As for measuring, I am using something like this:
double t;
t = omp_get_wtime();
// ...
t = omp_get_wtime() - t;
To sum up the parallelization in OpenMP for this particular case: It is not worth it.
Why?
Operations in the inner loops are simple. Code was compiled with -O3, so max() call was probably substituted with the code of function body.
Overhead in implicit barrier is probably high enough, to compensate the performance gain, and overall overhead is high enough to make the parallel code even slower than the sequential code was.
I also found out, there is no real performance gain in such construct:
#pragma omp parallel private(i,j)
{
for (i = 1; i < n; i++)
{
#pragma omp for
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
}
because it's performance is similar to this one
for (i = 1; i < n; i++)
{
#pragma omp parallel for private(j)
for (j = 0; j < m; j++)
x[i][j] = x[i-1][j];
}
thanks to built-in thread reusing in GCC libgomp, according to this article: http://bisqwit.iki.fi/story/howto/openmp/
Since the outer loop cannot be paralellized (without ordered option) it looks there is no way to significantly improve performance of the program in question using OpenMP. If someone feels I did something wrong, and it is possible, I'll be glad to see and test the solution.