I want to parallelize the for loops and I can't seem to grasp the concept, every time I try to parallelize them it still works but it slows down dramatically.
for(i=0; i<nbodies; ++i){
for(j=i+1; j<nbodies; ++j) {
d2 = 0.0;
for(k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
if (d2 <= cut2) {
d = sqrt(d2);
d3 = d*d2;
for(k=0; k<3; ++k) {
double f = -rij[k]/d3;
forces[i][k] += f;
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
}
I tried using synchronization with barrier and critical in some cases but nothing happens or the processing simply does not end.
Update, this is the state I'm at right now. Working without crashes but calculation times worsen the more threads I add. (Ryzen 5 2600 6/12)
#pragma omp parallel shared(d,d2,d3,nbodies,rij,pos,cut2,forces) private(i,j,k) num_threads(n)
{
clock_t begin = clock();
#pragma omp for schedule(auto)
for(i=0; i<nbodies; ++i){
for(j=i+1; j<nbodies; ++j) {
d2 = 0.0;
for(k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
}
if (d2 <= cut2) {
d = sqrt(d2);
d3 = d*d2;
#pragma omp parallel for shared(d3) private(k) schedule(auto) num_threads(n)
for(k=0; k<3; ++k) {
double f = -rij[k]/d3;
#pragma omp atomic
forces[i][k] += f;
#pragma omp atomic
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
#pragma omp single
printf("Calculation time %lf sec\n",time_spent);
}
I incorporated the timer in the actual parallel code (I think it is some milliseconds faster this way). Also I think I got most of the shared and private variables right. In the file it outputs the forces.
Using barriers or other synchronizations will slow down your code, if the amount of unsynchronized work is not larger by a good factor. That is not the case with you. You probably need to reformulate your code to remove synchronization.
You are doing something like an N-body simulation. I've worked out a couple of solutions here: https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-examples.html#N-bodyproblems
Also: your d2 loop is a reduction, so you can treat it like that, but it is probably enough if that variable is private to the i,j iterations.
You should always define your variables in their minimal required scope, especially if performance is an issue. (Note that if you do so your compiler can create more efficient code). Besides performance it also helps to avoid data race.
I think you have misplaced a curly brace and the condition in the first for loop should be i<nbodies-1. Variable ene can be summed up using reduction and to avoid data race atomic operations have to be used to increase array forces, so you do not need to use slow barriers or critical sections. Your code should look something like this (assuming int for indices and double for calculation):
#pragma omp parallel for reduction(+:ene)
for(int i=0; i<nbodies-1; ++i){
for(int j=i+1; j<nbodies; ++j) {
double d2 = 0.0;
double rij[3];
for(int k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
}
if (d2 <= cut2) {
double d = sqrt(d2);
double d3 = d*d2;
for(int k=0; k<3; ++k) {
double f = -rij[k]/d3;
#pragma omp atomic
forces[i][k] += f;
#pragma omp atomic
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
}
Solved, turns out all I needed was
#pragma omp parallel for nowait
Doesn't need the "atomic" either.
Weird solution, I don't fully understand how it works but it does also the output file has 0 corrupt results whatsoever.
Related
I am trying to optimize a C program using openMP. I am able to parallelize the simple for loop like
#pragma omp parallel
#pragma omp for
for (i = 0; i < size; i++)
{
Y[i] = i * 0.3;
Z[i] = -i * 0.4;
}
Where X, Y and Z are float* of size "size" This works fine, but there is another loop immediately after it:
for (i = 0; i < size; i++)
{
X[i] += Z[i] * Y[i] * Y[i] * 10.0;
sum += X[i];
}
printf("Sum =%d\n",sum);
I am not sure how to parallelize the above for the loop. I am compiling the program with the command gcc -fopenmp filename and running the executable ./a.out I hope that is enough to reflect performance improvement.
I added #pragma omp parallel for reduction(+ \ : sum) at top of the second loop, it indeed is running faster and producing correct output. Need expert input to parallelize the above and avoid false sharing? Is the above directive correct or any better alternative to parallelize and make it faster?
Lets say I have the following code:
#pragma omp parallel for
for (i = 0; i < array.size; i++ ) {
int temp = array[i];
for (p = 0; p < array2.size; p++) {
array2[p] = array2[p] + temp;
How can I divide the array2.size between the threads that I call when I do the #pragma omp parallel for in the first line? For what I understood when I do the #pragma omp parallel for I'll spawn several threads in such a way that each thread will have a part of the array.size so that the i will never be the same between threads. But in this case I also want those same threads to have a different part of the array2.size (their p will also never be the same between them) so that I dont have all the threads doing the same calculation in the second for.
I've tried the collapse notation but it seems that this is only used for perfect for statements since I couldn't get the result I wanted.
Any help is appreciated! Thanks in advance
The problem with your code is that multiple threads will try to modify array2 at the same time (race condition). This can easily be avoided by reordering the loops. If array2.size doesn't provide enough parallelism, you may apply the collapse clause, as the loops are now in canonical form.
#pragma omp parallel for
for (p = 0; p < array2.size; p++) {
for (i = 0; i < array.size; i++ ) {
array2[p] += array[i];
}
}
You shouldn't expect too much of this though as the ratio between loads/stores and computation is very bad. This is without a doubt memory-bound and not compute-bound.
EDIT: If this is really your problem and not just a minimal example, I would also try the following:
#pragma omp parallel
{
double sum = 0.;
#pragma omp for reduction(+: sum)
for (i = 0; i < array.size; i++) {
sum += array[i];
}
#pragma omp for
for (p = 0; p < array2.size; p++) {
array2[p] += sum;
}
}
I've implemented a version of the Travelling Salesman with xmmintrin.h SSE instructions, received a decent speedup. But now I'm also trying to implement OpenMP threading on top of it, and I'm seeing a pretty drastic slow down. I'm getting the correct answer in both cases (i.e. (i) with SSE only, or (ii) with SSE && OpenMP).
I know I am probably doing something wildly wrong, and maybe someone much more experienced than me can spot the issue.
The main loop of my program has the following (brief) pseudocode:
int currentNode;
for(int i = 0; i < numNodes; i++) {
minimumDistance = DBL_MAX;
minimumDistanceNode;
for(int j = 0; j < numNodes; j++) {
// find distance between 'currentNode' to j-th node
// ...
if(jthNodeDistance < minimumDistance) {
minimumDistance = jthNodeDistance;
minimumDistanceNode = jthNode;
}
}
currentNode = minimumDistanceNode;
}
And here is my implementation, that is still semi-pseudocode as I've still brushed over some parts that I don't think have an impact on performance, I think the issues to be found with my code can be found in the following code snippet. If you just omit the #pragma lines, then the following is pretty much identical to the SSE only version of the same program, so I figure I should only include the OpenMP version:
int currentNode = 0;
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++) {
miniumum = DBL_MAX;
__m128 currentNodeX = _mm_set1_ps(xCoordinates[currentNode]);
__m128 currentNodeY = _mm_set1_ps(yCoordinates[currentNode]);
#pragma omp parallel num_threads(omp_get_max_threads())
{
float localMinimum = DBL_MAX;
float localMinimumNode;
#pragma omp for
for (int j = 0; j < loopEnd; j += 4) {
// a number of SSE vector calculations to find distance
// between the current node and the four nodes we're looking
// at in this iteration of the loop:
__m128 subXs_0 = _mm_sub_ps(currentNodeX, _mm_load_ps(&xCoordinates[j]));
__m128 squareSubXs_0 = _mm_mul_ps(subXs_0, subXs_0);
__m128 subYs_0 = _mm_sub_ps(currentNodeY, _mm_load_ps(&yCoordinates[j]));
__m128 squareSubYs_0 = _mm_mul_ps(subYs_0, subYs_0);
__m128 addXY_0 = _mm_add_ps(squareSubXs_0, squareSubYs_0);
float temp[unroll];
_mm_store_ps(&temp[0], addXY_0);
// skipping stuff here that is about getting the minimum distance and
// it's equivalent node, don't think it's massively relevant but
// each thread will have its own
// localMinimum
// localMinimumNode
}
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
So my logic is:
omp single for the top-level for(int i) for loop, and I only want 1 thread dedicated to this
omp parallel num_threads(omp_get_max_threads()) for the inner for(int j) for-loop, as I want all cores working on this part of the code at the same time.
omp critical at the end of the full for(int j) loop, as I want to thread-safely update the current node.
In terms of run-time, the OpenMP version is typically twice as slow as the SSE-only version.
Does anything jump out at you as particularly bad in my code, that is causing this drastic slow-down for OpenMP?
Does anything jump out at you as particularly bad in my code, that is
causing this drastic slow-down for OpenMP?
First:
omp single for the top-level for(int i) for loop, and I only want 1
thread dedicated to this
In your code you have the following:
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
The #pragma omp parallel creates a team of threads, but then only one thread executes a parallel task (i.e., #pragma omp single) while the other threads don't do anything. You can simplified to:
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
ThisPt = minimumNode;
}
The inner only is still executed by only one thread.
Second :
omp parallel num_threads(omp_get_max_threads()) for the inner for(int
j) for-loop, as I want all cores working on this part of the code at
the same time.
The problem is that this might return the number of logic-cores and not physical cores, and some codes might perform worse with hyper-threading. So, I would first test with a different number of threads, starting from 2, 4 and so on, until you find a number to which the code stops scaling.
omp critical at the end of the full for(int j) loop, as I want to
thread-safely update the current node.
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
this can be replaced by creating an array where each thread save its local minimum in a position reserved to that thread, and outside the parallel region the initial thread extract the minimum and minimumNode:
int total_threads = /..;
float localMinimum[total_threads] = {DBL_MAX};
float localMinimumNode[total_threads] = {DBL_MAX};
#pragma omp parallel num_threads(total_threads)
{
/...
}
for(int i = 0; i < total_threads; i++){
if (localMinimum[i] < minimum) {
minimum = localMinimum[i];
minimumNode = localMinimumNode[i];
}
}
Finally, after those changes are done, you try to check if it is possible to replace this parallelization by the following:
#pragma omp parallel for
for (int i = 1; i < totalNum; i++)
{
...
}
I am trying to understand how OMP treats different for loop declarations. I have:
int main()
{
int i, A[10000]={...};
double ave = 0.0;
#pragma omp parallel for reduction(+:ave)
for(i=0;i<10000;i++){
ave += A[i];
}
ave /= 10000;
printf("Average value = %0.4f\n",ave);
return 0;
}
where {...} are the numbers form 1 to 10000. This code prints the correct value. If instead of #pragma omp parallel for reduction(+:ave) I use #pragma omp parallel for private(ave) the result of the printf is 0.0000. I think I understand what reduction(oper:list) does, but was wondering if it can be substituted for private and how.
So yes, you can do reductions without the reduction clause. But that has a few downsides that you have to understand:
You have to do things by hand, which is more error-prone:
declare local variables to store local accumulations;
initialize them correctly;
accumulate into them;
do the final reduction in the initial variable using a critical construct.
This is harder to understand and maintain
This is potentially less effective...
Anyway, here is an example of that using your code:
int main() {
int i, A[10000]={...};
double ave = 0.0;
double localAve;
#pragma omp parallel private( i, localAve )
{
localAve = 0;
#pragma omp for
for( i = 0; i < 10000; i++ ) {
localAve += A[i];
}
#pragma omp critical
ave += localAve;
}
ave /= 10000;
printf("Average value = %0.4f\n",ave);
return 0;
}
This is a classical method for doing reductions by hand, but notice that the variable that would have been declared reduction isn't declared private here. What becomes private is a local substitute of this variable while the global one must remain shared.
This question already has answers here:
Why OpenMP under ubuntu 12.04 is slower than serial version
(3 answers)
Parallelizing matrix times a vector by columns and by rows with OpenMP
(2 answers)
Closed 8 years ago.
I'm trying to write Matrix by vector multiplication in C (OpenMP)
but my program slows when I add processors...
1 proc - 1,3 s
2 proc - 2,6 s
4 proc - 5,47 s
I tested this on my PC (core i5) and our school's cluster and the result is the same (program slows)
here is my code (matrix is 10000 x 10000) and vector is 10000:
double start_time = clock();
#pragma omp parallel private(i) num_threads(4)
{
tid = omp_get_thread_num();
world_size = omp_get_num_threads();
printf("Threads: %d\n",world_size);
for(y = 0; y < matrix_size ; y++){
#pragma omp parallel for private(i) shared(results, vector, matrix)
for(i = 0; i < matrix_size; i++){
results[y] = results[y] + vector[i]*matrix[i][y];
}
}
}
double end_time = clock();
double result_time = (end_time - start_time) / CLOCKS_PER_SEC;
printf("Time: %f\n", result_time);
My question is: is there any mistake? For me it seems pretty straightforward and should speed up
I essentially already answer this question parallelizing-matrix-times-a-vector-by-columns-and-by-rows-with-openmp.
You have a race condition when you write to results[y]. To fix this, and still parallelize the inner loop, you have to make private versions of results[y], fill them in parallel, and then merge them in a critical section.
In the code below I assume you're using double, replace it with float or int or whatever datatype you're using (note that your inner loop goes over the first index of matrix[i][y] which is cache unfriendly).
#pragma omp parallel num_threads(4)
{
int y,i;
double* results_private = (double*)calloc(matrix_size, sizeof(double));
for(y = 0; y < matrix_size ; y++) {
#pragma omp for
for(i = 0; i < matrix_size; i++) {
results_private[y] += vector[i]*matrix[i][y];
}
}
#pragma omp critical
{
for(y=0; y<matrix_size; y++) results[y] += results_private[y];
}
free(results_private);
}
If this is homework assignment and you want to really impress your instructor then it's possible to do the merging without a critical section. See this link to get an idea on what to do fill-histograms-array-reduction-in-parallel-with-openmp-without-using-a-critic though I can't promise it will be faster.
I've not done any parallel programming for a while now, nor any Maths for that matter, but don't you want to split the rows of the matrix in parallel, rather than the columns?
What happens if you try this:
double start_time = clock();
#pragma omp parallel private(i) num_threads(4)
{
tid = omp_get_thread_num();
world_size = omp_get_num_threads();
printf("Threads: %d\n",world_size);
#pragma omp parallel for private(y) shared(results, vector, matrix)
for(y = 0; y < matrix_size ; y++){
for(i = 0; i < matrix_size; i++){
results[y] = results[y] + vector[i]*matrix[i][y];
}
}
}
double end_time = clock();
double result_time = (end_time - start_time) / CLOCKS_PER_SEC;
printf("Time: %f\n", result_time);
Also, are you sure everything's OK compiling and linking with openMP?
You have a typical case of cache conflicts.
Consider that a cache line on your CPU is probably 64 bytes long. Having one processor/core write to the first 4 bytes (float) causes that cache line to be invalidated on every other L1/L2 and maybe L3. This is a lot of overhead.
Partition your data better!
#pragma omp parallel for private(i) shared(results, vector, matrix) schedule(static,16)
should do the trick. Increase the chunksize if this does not help.
Another optimisation is to store the result locally before you flush it down to memory.
Also, this is an OpenMP thing, but you don't need to start a new parallel region for the loop (each mention of parallel starts a new team):
#pragma omp parallel default(none) \
shared(vector, matrix) \
firstprivate(matrix_size) \
num_threads(4)
{
int i, y;
#pragma omp for schedule(static,16)
for(y = 0; y < matrix_size ; y++){
double result = 0;
for(i = 0; i < matrix_size; i++){
results += vector[i]*matrix[i][y];
}
result[y] = result;
}
}