This question already has answers here:
Why OpenMP under ubuntu 12.04 is slower than serial version
(3 answers)
Parallelizing matrix times a vector by columns and by rows with OpenMP
(2 answers)
Closed 8 years ago.
I'm trying to write Matrix by vector multiplication in C (OpenMP)
but my program slows when I add processors...
1 proc - 1,3 s
2 proc - 2,6 s
4 proc - 5,47 s
I tested this on my PC (core i5) and our school's cluster and the result is the same (program slows)
here is my code (matrix is 10000 x 10000) and vector is 10000:
double start_time = clock();
#pragma omp parallel private(i) num_threads(4)
{
tid = omp_get_thread_num();
world_size = omp_get_num_threads();
printf("Threads: %d\n",world_size);
for(y = 0; y < matrix_size ; y++){
#pragma omp parallel for private(i) shared(results, vector, matrix)
for(i = 0; i < matrix_size; i++){
results[y] = results[y] + vector[i]*matrix[i][y];
}
}
}
double end_time = clock();
double result_time = (end_time - start_time) / CLOCKS_PER_SEC;
printf("Time: %f\n", result_time);
My question is: is there any mistake? For me it seems pretty straightforward and should speed up
I essentially already answer this question parallelizing-matrix-times-a-vector-by-columns-and-by-rows-with-openmp.
You have a race condition when you write to results[y]. To fix this, and still parallelize the inner loop, you have to make private versions of results[y], fill them in parallel, and then merge them in a critical section.
In the code below I assume you're using double, replace it with float or int or whatever datatype you're using (note that your inner loop goes over the first index of matrix[i][y] which is cache unfriendly).
#pragma omp parallel num_threads(4)
{
int y,i;
double* results_private = (double*)calloc(matrix_size, sizeof(double));
for(y = 0; y < matrix_size ; y++) {
#pragma omp for
for(i = 0; i < matrix_size; i++) {
results_private[y] += vector[i]*matrix[i][y];
}
}
#pragma omp critical
{
for(y=0; y<matrix_size; y++) results[y] += results_private[y];
}
free(results_private);
}
If this is homework assignment and you want to really impress your instructor then it's possible to do the merging without a critical section. See this link to get an idea on what to do fill-histograms-array-reduction-in-parallel-with-openmp-without-using-a-critic though I can't promise it will be faster.
I've not done any parallel programming for a while now, nor any Maths for that matter, but don't you want to split the rows of the matrix in parallel, rather than the columns?
What happens if you try this:
double start_time = clock();
#pragma omp parallel private(i) num_threads(4)
{
tid = omp_get_thread_num();
world_size = omp_get_num_threads();
printf("Threads: %d\n",world_size);
#pragma omp parallel for private(y) shared(results, vector, matrix)
for(y = 0; y < matrix_size ; y++){
for(i = 0; i < matrix_size; i++){
results[y] = results[y] + vector[i]*matrix[i][y];
}
}
}
double end_time = clock();
double result_time = (end_time - start_time) / CLOCKS_PER_SEC;
printf("Time: %f\n", result_time);
Also, are you sure everything's OK compiling and linking with openMP?
You have a typical case of cache conflicts.
Consider that a cache line on your CPU is probably 64 bytes long. Having one processor/core write to the first 4 bytes (float) causes that cache line to be invalidated on every other L1/L2 and maybe L3. This is a lot of overhead.
Partition your data better!
#pragma omp parallel for private(i) shared(results, vector, matrix) schedule(static,16)
should do the trick. Increase the chunksize if this does not help.
Another optimisation is to store the result locally before you flush it down to memory.
Also, this is an OpenMP thing, but you don't need to start a new parallel region for the loop (each mention of parallel starts a new team):
#pragma omp parallel default(none) \
shared(vector, matrix) \
firstprivate(matrix_size) \
num_threads(4)
{
int i, y;
#pragma omp for schedule(static,16)
for(y = 0; y < matrix_size ; y++){
double result = 0;
for(i = 0; i < matrix_size; i++){
results += vector[i]*matrix[i][y];
}
result[y] = result;
}
}
Related
I am trying to optimize a C program using openMP. I am able to parallelize the simple for loop like
#pragma omp parallel
#pragma omp for
for (i = 0; i < size; i++)
{
Y[i] = i * 0.3;
Z[i] = -i * 0.4;
}
Where X, Y and Z are float* of size "size" This works fine, but there is another loop immediately after it:
for (i = 0; i < size; i++)
{
X[i] += Z[i] * Y[i] * Y[i] * 10.0;
sum += X[i];
}
printf("Sum =%d\n",sum);
I am not sure how to parallelize the above for the loop. I am compiling the program with the command gcc -fopenmp filename and running the executable ./a.out I hope that is enough to reflect performance improvement.
I added #pragma omp parallel for reduction(+ \ : sum) at top of the second loop, it indeed is running faster and producing correct output. Need expert input to parallelize the above and avoid false sharing? Is the above directive correct or any better alternative to parallelize and make it faster?
I want to parallelize the for loops and I can't seem to grasp the concept, every time I try to parallelize them it still works but it slows down dramatically.
for(i=0; i<nbodies; ++i){
for(j=i+1; j<nbodies; ++j) {
d2 = 0.0;
for(k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
if (d2 <= cut2) {
d = sqrt(d2);
d3 = d*d2;
for(k=0; k<3; ++k) {
double f = -rij[k]/d3;
forces[i][k] += f;
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
}
I tried using synchronization with barrier and critical in some cases but nothing happens or the processing simply does not end.
Update, this is the state I'm at right now. Working without crashes but calculation times worsen the more threads I add. (Ryzen 5 2600 6/12)
#pragma omp parallel shared(d,d2,d3,nbodies,rij,pos,cut2,forces) private(i,j,k) num_threads(n)
{
clock_t begin = clock();
#pragma omp for schedule(auto)
for(i=0; i<nbodies; ++i){
for(j=i+1; j<nbodies; ++j) {
d2 = 0.0;
for(k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
}
if (d2 <= cut2) {
d = sqrt(d2);
d3 = d*d2;
#pragma omp parallel for shared(d3) private(k) schedule(auto) num_threads(n)
for(k=0; k<3; ++k) {
double f = -rij[k]/d3;
#pragma omp atomic
forces[i][k] += f;
#pragma omp atomic
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
#pragma omp single
printf("Calculation time %lf sec\n",time_spent);
}
I incorporated the timer in the actual parallel code (I think it is some milliseconds faster this way). Also I think I got most of the shared and private variables right. In the file it outputs the forces.
Using barriers or other synchronizations will slow down your code, if the amount of unsynchronized work is not larger by a good factor. That is not the case with you. You probably need to reformulate your code to remove synchronization.
You are doing something like an N-body simulation. I've worked out a couple of solutions here: https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-examples.html#N-bodyproblems
Also: your d2 loop is a reduction, so you can treat it like that, but it is probably enough if that variable is private to the i,j iterations.
You should always define your variables in their minimal required scope, especially if performance is an issue. (Note that if you do so your compiler can create more efficient code). Besides performance it also helps to avoid data race.
I think you have misplaced a curly brace and the condition in the first for loop should be i<nbodies-1. Variable ene can be summed up using reduction and to avoid data race atomic operations have to be used to increase array forces, so you do not need to use slow barriers or critical sections. Your code should look something like this (assuming int for indices and double for calculation):
#pragma omp parallel for reduction(+:ene)
for(int i=0; i<nbodies-1; ++i){
for(int j=i+1; j<nbodies; ++j) {
double d2 = 0.0;
double rij[3];
for(int k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
}
if (d2 <= cut2) {
double d = sqrt(d2);
double d3 = d*d2;
for(int k=0; k<3; ++k) {
double f = -rij[k]/d3;
#pragma omp atomic
forces[i][k] += f;
#pragma omp atomic
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
}
Solved, turns out all I needed was
#pragma omp parallel for nowait
Doesn't need the "atomic" either.
Weird solution, I don't fully understand how it works but it does also the output file has 0 corrupt results whatsoever.
I've implemented a version of the Travelling Salesman with xmmintrin.h SSE instructions, received a decent speedup. But now I'm also trying to implement OpenMP threading on top of it, and I'm seeing a pretty drastic slow down. I'm getting the correct answer in both cases (i.e. (i) with SSE only, or (ii) with SSE && OpenMP).
I know I am probably doing something wildly wrong, and maybe someone much more experienced than me can spot the issue.
The main loop of my program has the following (brief) pseudocode:
int currentNode;
for(int i = 0; i < numNodes; i++) {
minimumDistance = DBL_MAX;
minimumDistanceNode;
for(int j = 0; j < numNodes; j++) {
// find distance between 'currentNode' to j-th node
// ...
if(jthNodeDistance < minimumDistance) {
minimumDistance = jthNodeDistance;
minimumDistanceNode = jthNode;
}
}
currentNode = minimumDistanceNode;
}
And here is my implementation, that is still semi-pseudocode as I've still brushed over some parts that I don't think have an impact on performance, I think the issues to be found with my code can be found in the following code snippet. If you just omit the #pragma lines, then the following is pretty much identical to the SSE only version of the same program, so I figure I should only include the OpenMP version:
int currentNode = 0;
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++) {
miniumum = DBL_MAX;
__m128 currentNodeX = _mm_set1_ps(xCoordinates[currentNode]);
__m128 currentNodeY = _mm_set1_ps(yCoordinates[currentNode]);
#pragma omp parallel num_threads(omp_get_max_threads())
{
float localMinimum = DBL_MAX;
float localMinimumNode;
#pragma omp for
for (int j = 0; j < loopEnd; j += 4) {
// a number of SSE vector calculations to find distance
// between the current node and the four nodes we're looking
// at in this iteration of the loop:
__m128 subXs_0 = _mm_sub_ps(currentNodeX, _mm_load_ps(&xCoordinates[j]));
__m128 squareSubXs_0 = _mm_mul_ps(subXs_0, subXs_0);
__m128 subYs_0 = _mm_sub_ps(currentNodeY, _mm_load_ps(&yCoordinates[j]));
__m128 squareSubYs_0 = _mm_mul_ps(subYs_0, subYs_0);
__m128 addXY_0 = _mm_add_ps(squareSubXs_0, squareSubYs_0);
float temp[unroll];
_mm_store_ps(&temp[0], addXY_0);
// skipping stuff here that is about getting the minimum distance and
// it's equivalent node, don't think it's massively relevant but
// each thread will have its own
// localMinimum
// localMinimumNode
}
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
So my logic is:
omp single for the top-level for(int i) for loop, and I only want 1 thread dedicated to this
omp parallel num_threads(omp_get_max_threads()) for the inner for(int j) for-loop, as I want all cores working on this part of the code at the same time.
omp critical at the end of the full for(int j) loop, as I want to thread-safely update the current node.
In terms of run-time, the OpenMP version is typically twice as slow as the SSE-only version.
Does anything jump out at you as particularly bad in my code, that is causing this drastic slow-down for OpenMP?
Does anything jump out at you as particularly bad in my code, that is
causing this drastic slow-down for OpenMP?
First:
omp single for the top-level for(int i) for loop, and I only want 1
thread dedicated to this
In your code you have the following:
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
The #pragma omp parallel creates a team of threads, but then only one thread executes a parallel task (i.e., #pragma omp single) while the other threads don't do anything. You can simplified to:
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
ThisPt = minimumNode;
}
The inner only is still executed by only one thread.
Second :
omp parallel num_threads(omp_get_max_threads()) for the inner for(int
j) for-loop, as I want all cores working on this part of the code at
the same time.
The problem is that this might return the number of logic-cores and not physical cores, and some codes might perform worse with hyper-threading. So, I would first test with a different number of threads, starting from 2, 4 and so on, until you find a number to which the code stops scaling.
omp critical at the end of the full for(int j) loop, as I want to
thread-safely update the current node.
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
this can be replaced by creating an array where each thread save its local minimum in a position reserved to that thread, and outside the parallel region the initial thread extract the minimum and minimumNode:
int total_threads = /..;
float localMinimum[total_threads] = {DBL_MAX};
float localMinimumNode[total_threads] = {DBL_MAX};
#pragma omp parallel num_threads(total_threads)
{
/...
}
for(int i = 0; i < total_threads; i++){
if (localMinimum[i] < minimum) {
minimum = localMinimum[i];
minimumNode = localMinimumNode[i];
}
}
Finally, after those changes are done, you try to check if it is possible to replace this parallelization by the following:
#pragma omp parallel for
for (int i = 1; i < totalNum; i++)
{
...
}
I just want to evaluate an integration of function by summarization by OpenMP by using an array to hold every values computed in every step > take sum of all values; and take sum without the array.
The code is:
double f(double x)
{
return sin(x)*sin(x)/(x*x+1);
}
METHOD 1
long i = 0;
const long NUM_STEP = 100000;
double sum[NUM_STEP];
double from = 0.0, to = 1.0;
double step = (to - from)/NUM_STEP;
double result = 0;
#pragma omp parallel for shared(sum) num_threads(4)
for(i=0; i<NUM_STEP; i++)
sum[i] = step*f(from+i*step);
for(i=0; i<NUM_STEP; i++)
result += sum[i];
printf("%lf", result);
METHOD 2
long i = 0;
const long NUM_STEP = 100000;
double from = 0.0, to = 1.0;
double step = (to - from)/NUM_STEP;
double result = 0;
#pragma omp parallel for shared(result) num_threads(4)
for(i=0; i<NUM_STEP; i++)
result += step*f(from+i*step);
printf("%lf", result);
But the results are too different. The METHOD 1 gave an stable value, but the METHOD 2 gave a changable value. Here is an example:
METHOD 1: 0.178446
METHOD 2: 0.158738
The value of METHOD 1 is true (checked by another tool).
TL;DR The first method does not have a race-condition whereas the second method has.
The first method does not have a race-condition whereas the second method has. Namely, in the first method:
#pragma omp parallel for shared(sum) num_threads(4)
for(i=0; i<NUM_STEP; i++)
sum[i] = step*f(from+i*step);
for(i=0; i<NUM_STEP; i++)
result += sum[i];
each thread saves the result of the operation step*f(from+i*step); in a different position of the array sum[i]. And after that the master thread, sequentially reduces the values saved on the array sum, namely:
for(i=0; i<NUM_STEP; i++)
result += sum[i];
Actually, there is a slight improvement that you can make on this version; instead of allocating the array sum with the same size as the number of NUM_STEP, you can just allocated it with the same size as the number of threads, and each thread would save in a position equals to its ID, namely:
int total_threads = 4;
double sum[total_threads];
#pragma omp parallel num_threads(total_threads)
{
int thread_id = omp_get_thread_num();
for(i=0; i<NUM_STEP; i++)
sum[thread_id] += step*f(from+i*step);
for(i=0; i< total_threads; i++)
result += sum[i];
}
Nonetheless, the best approach will be to actually fix the second method.
On the second method there is a race-condition on the update of the variable result:
#pragma omp parallel for shared(result) num_threads(4)
for(i=0; i<NUM_STEP; i++)
result += step*f(from+i*step);
because the result variable is being updated concurrently by multiple threads in a non-thread safety manner.
To solve this race-condition you need to use the reduction clause:
#pragma omp parallel for reduction(+:result) num_threads(4)
for(i=0; i<NUM_STEP; i++)
result += step*f(from+i*step);
I'd like get to know OpenMP a bit, cause I'd like to have a huge loop parallelized. After some reading (SO, Common OMP mistakes, tutorial, etc), I've taken as a first step the basically working c/mex code given below (which yields different results for the first test case).
The first test does sum up result values - functions serial, parallel -,
the second takes values from an input array and writes the processed values to an output array - functions serial_a, parallel_a.
My questions are:
Why differ the results of the first test, i. e. the results of the serial and parallel
Suprisingly the second test succeeds. My concern is about, how to handle memory (array locations) which possibly are read by multiple threads? In the example this should be emulated by a[i])/cos(a[n-i].
Are there some easy rules how to determine which variables to declare as private, shared and reduction?
In both cases int i is outside the pragma, however the second test appears to yield correct results. So is that okay or has i to be moved into the pragma omp parallel region, as being said here?
Any other hints on spoted mistakes?
Code
#include "mex.h"
#include <math.h>
#include <omp.h>
#include <time.h>
double serial(int x)
{
double sum=0;
int i;
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
return sum;
}
double parallel(int x)
{
double sum=0;
int i;
#pragma omp parallel num_threads(6) shared(sum) //default(none)
{
//printf(" I'm thread no. %d\n", omp_get_thread_num());
#pragma omp for private(i, x) reduction(+: sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
return sum;
}
void serial_a(double* a, int n, double* y2)
{
int i;
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
void parallel_a(double* a, int n, double* y2)
{
int i;
#pragma omp parallel num_threads(6)
{
#pragma omp for private(i)
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
}
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
double sum, *y1, *y2, *a, s, p;
int x, n, *d;
/* Check for proper number of arguments. */
if(nrhs!=2) {
mexErrMsgTxt("Two inputs required.");
} else if(nlhs>2) {
mexErrMsgTxt("Too many output arguments.");
}
/* Get pointer to first input */
x = (int)mxGetScalar(prhs[0]);
/* Get pointer to second input */
a = mxGetPr(prhs[1]);
d = (int*)mxGetDimensions(prhs[1]);
n = (int)d[1]; // row vector
/* Create space for output */
plhs[0] = mxCreateDoubleMatrix(2,1, mxREAL);
plhs[1] = mxCreateDoubleMatrix(n,2, mxREAL);
/* Get pointer to output array */
y1 = mxGetPr(plhs[0]);
y2 = mxGetPr(plhs[1]);
{ /* Do the calculation */
clock_t tic = clock();
y1[0] = serial(x);
s = (double) clock()-tic;
printf("serial....: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
y1[1] = parallel(x);
p = (double) clock()-tic;
printf("parallel..: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
mexEvalString("drawnow");
tic = clock();
serial_a(a, n, y2);
s = (double) clock()-tic;
printf("serial_a..: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
parallel_a(a, n, &y2[n]);
p = (double) clock()-tic;
printf("parallel_a: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
}
}
Output
>> mex omp1.c
>> [a, b] = omp1(1e8, 1:1e8);
serial....: 13399 ms
parallel..: 2810 ms
ratio.....: 0.21
serial_a..: 12840 ms
parallel_a: 2740 ms
ratio.....: 0.21
>> a(1) == a(2)
ans =
0
>> all(b(:,1) == b(:,2))
ans =
1
System
MATLAB Version: 8.0.0.783 (R2012b)
Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)
Microsoft Visual Studio 2005 Version 8.0.50727.867
In your function parallel you have a few mistakes. The reduction should be declared when you use parallel. Private and share variables should also be declared when you use parallel. But when you do a reduction you should not declare the variable that is being reduced as shared. The reduction will take care of this.
To know what to declare private or shared you have to ask yourself which variables are being written to. If a variable is not being written to then normally you want it to be shared. In your case the variable x does not change so you should declare it shared. The variable i, however, does change so normally you should declare it private so to fix your function you could do
#pragma omp parallel reduction(+:sum) private(i) shared(x)
{
#pragma omp for
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
However, OpenMP automatically makes the iterator of a parallel for region private and variables declared outside of parallel regions are shared by default so for your parallel function you can simply do
#pragma omp parallel for reduction(+:sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
Notice that the only difference between this and your serial code is the pragma statment. OpenMP is designed so that you don't have to change your code except for pragma statments.
When it comes to arrays as long as each iteration of a parallel for loop acts on a different array element then you don't have to worry about shared and private. So you can write your private_a function simply as
#pragma omp parallel for
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
and once again it is the same as your serial_a function except for the pragma statement.
But be careful with assuming iterators are private. Consider the following double loop
for(i=0; i<n; i++) {
for(j=0; j<m; j++) {
//
}
}
If you use #pragma parallel for with that the i iterator will be made private but the j iterator will be shared. This is because the parallel for only applies to the outer loop over i and since j is shared by default it is not made private. In this case you would need to explicitly declare j private like this #pragma parallel for private(j).