I just want to evaluate an integration of function by summarization by OpenMP by using an array to hold every values computed in every step > take sum of all values; and take sum without the array.
The code is:
double f(double x)
{
return sin(x)*sin(x)/(x*x+1);
}
METHOD 1
long i = 0;
const long NUM_STEP = 100000;
double sum[NUM_STEP];
double from = 0.0, to = 1.0;
double step = (to - from)/NUM_STEP;
double result = 0;
#pragma omp parallel for shared(sum) num_threads(4)
for(i=0; i<NUM_STEP; i++)
sum[i] = step*f(from+i*step);
for(i=0; i<NUM_STEP; i++)
result += sum[i];
printf("%lf", result);
METHOD 2
long i = 0;
const long NUM_STEP = 100000;
double from = 0.0, to = 1.0;
double step = (to - from)/NUM_STEP;
double result = 0;
#pragma omp parallel for shared(result) num_threads(4)
for(i=0; i<NUM_STEP; i++)
result += step*f(from+i*step);
printf("%lf", result);
But the results are too different. The METHOD 1 gave an stable value, but the METHOD 2 gave a changable value. Here is an example:
METHOD 1: 0.178446
METHOD 2: 0.158738
The value of METHOD 1 is true (checked by another tool).
TL;DR The first method does not have a race-condition whereas the second method has.
The first method does not have a race-condition whereas the second method has. Namely, in the first method:
#pragma omp parallel for shared(sum) num_threads(4)
for(i=0; i<NUM_STEP; i++)
sum[i] = step*f(from+i*step);
for(i=0; i<NUM_STEP; i++)
result += sum[i];
each thread saves the result of the operation step*f(from+i*step); in a different position of the array sum[i]. And after that the master thread, sequentially reduces the values saved on the array sum, namely:
for(i=0; i<NUM_STEP; i++)
result += sum[i];
Actually, there is a slight improvement that you can make on this version; instead of allocating the array sum with the same size as the number of NUM_STEP, you can just allocated it with the same size as the number of threads, and each thread would save in a position equals to its ID, namely:
int total_threads = 4;
double sum[total_threads];
#pragma omp parallel num_threads(total_threads)
{
int thread_id = omp_get_thread_num();
for(i=0; i<NUM_STEP; i++)
sum[thread_id] += step*f(from+i*step);
for(i=0; i< total_threads; i++)
result += sum[i];
}
Nonetheless, the best approach will be to actually fix the second method.
On the second method there is a race-condition on the update of the variable result:
#pragma omp parallel for shared(result) num_threads(4)
for(i=0; i<NUM_STEP; i++)
result += step*f(from+i*step);
because the result variable is being updated concurrently by multiple threads in a non-thread safety manner.
To solve this race-condition you need to use the reduction clause:
#pragma omp parallel for reduction(+:result) num_threads(4)
for(i=0; i<NUM_STEP; i++)
result += step*f(from+i*step);
Related
I want to parallelize the for loops and I can't seem to grasp the concept, every time I try to parallelize them it still works but it slows down dramatically.
for(i=0; i<nbodies; ++i){
for(j=i+1; j<nbodies; ++j) {
d2 = 0.0;
for(k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
if (d2 <= cut2) {
d = sqrt(d2);
d3 = d*d2;
for(k=0; k<3; ++k) {
double f = -rij[k]/d3;
forces[i][k] += f;
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
}
I tried using synchronization with barrier and critical in some cases but nothing happens or the processing simply does not end.
Update, this is the state I'm at right now. Working without crashes but calculation times worsen the more threads I add. (Ryzen 5 2600 6/12)
#pragma omp parallel shared(d,d2,d3,nbodies,rij,pos,cut2,forces) private(i,j,k) num_threads(n)
{
clock_t begin = clock();
#pragma omp for schedule(auto)
for(i=0; i<nbodies; ++i){
for(j=i+1; j<nbodies; ++j) {
d2 = 0.0;
for(k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
}
if (d2 <= cut2) {
d = sqrt(d2);
d3 = d*d2;
#pragma omp parallel for shared(d3) private(k) schedule(auto) num_threads(n)
for(k=0; k<3; ++k) {
double f = -rij[k]/d3;
#pragma omp atomic
forces[i][k] += f;
#pragma omp atomic
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
#pragma omp single
printf("Calculation time %lf sec\n",time_spent);
}
I incorporated the timer in the actual parallel code (I think it is some milliseconds faster this way). Also I think I got most of the shared and private variables right. In the file it outputs the forces.
Using barriers or other synchronizations will slow down your code, if the amount of unsynchronized work is not larger by a good factor. That is not the case with you. You probably need to reformulate your code to remove synchronization.
You are doing something like an N-body simulation. I've worked out a couple of solutions here: https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-examples.html#N-bodyproblems
Also: your d2 loop is a reduction, so you can treat it like that, but it is probably enough if that variable is private to the i,j iterations.
You should always define your variables in their minimal required scope, especially if performance is an issue. (Note that if you do so your compiler can create more efficient code). Besides performance it also helps to avoid data race.
I think you have misplaced a curly brace and the condition in the first for loop should be i<nbodies-1. Variable ene can be summed up using reduction and to avoid data race atomic operations have to be used to increase array forces, so you do not need to use slow barriers or critical sections. Your code should look something like this (assuming int for indices and double for calculation):
#pragma omp parallel for reduction(+:ene)
for(int i=0; i<nbodies-1; ++i){
for(int j=i+1; j<nbodies; ++j) {
double d2 = 0.0;
double rij[3];
for(int k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
}
if (d2 <= cut2) {
double d = sqrt(d2);
double d3 = d*d2;
for(int k=0; k<3; ++k) {
double f = -rij[k]/d3;
#pragma omp atomic
forces[i][k] += f;
#pragma omp atomic
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
}
Solved, turns out all I needed was
#pragma omp parallel for nowait
Doesn't need the "atomic" either.
Weird solution, I don't fully understand how it works but it does also the output file has 0 corrupt results whatsoever.
I am trying to understand how OMP treats different for loop declarations. I have:
int main()
{
int i, A[10000]={...};
double ave = 0.0;
#pragma omp parallel for reduction(+:ave)
for(i=0;i<10000;i++){
ave += A[i];
}
ave /= 10000;
printf("Average value = %0.4f\n",ave);
return 0;
}
where {...} are the numbers form 1 to 10000. This code prints the correct value. If instead of #pragma omp parallel for reduction(+:ave) I use #pragma omp parallel for private(ave) the result of the printf is 0.0000. I think I understand what reduction(oper:list) does, but was wondering if it can be substituted for private and how.
So yes, you can do reductions without the reduction clause. But that has a few downsides that you have to understand:
You have to do things by hand, which is more error-prone:
declare local variables to store local accumulations;
initialize them correctly;
accumulate into them;
do the final reduction in the initial variable using a critical construct.
This is harder to understand and maintain
This is potentially less effective...
Anyway, here is an example of that using your code:
int main() {
int i, A[10000]={...};
double ave = 0.0;
double localAve;
#pragma omp parallel private( i, localAve )
{
localAve = 0;
#pragma omp for
for( i = 0; i < 10000; i++ ) {
localAve += A[i];
}
#pragma omp critical
ave += localAve;
}
ave /= 10000;
printf("Average value = %0.4f\n",ave);
return 0;
}
This is a classical method for doing reductions by hand, but notice that the variable that would have been declared reduction isn't declared private here. What becomes private is a local substitute of this variable while the global one must remain shared.
I'd like get to know OpenMP a bit, cause I'd like to have a huge loop parallelized. After some reading (SO, Common OMP mistakes, tutorial, etc), I've taken as a first step the basically working c/mex code given below (which yields different results for the first test case).
The first test does sum up result values - functions serial, parallel -,
the second takes values from an input array and writes the processed values to an output array - functions serial_a, parallel_a.
My questions are:
Why differ the results of the first test, i. e. the results of the serial and parallel
Suprisingly the second test succeeds. My concern is about, how to handle memory (array locations) which possibly are read by multiple threads? In the example this should be emulated by a[i])/cos(a[n-i].
Are there some easy rules how to determine which variables to declare as private, shared and reduction?
In both cases int i is outside the pragma, however the second test appears to yield correct results. So is that okay or has i to be moved into the pragma omp parallel region, as being said here?
Any other hints on spoted mistakes?
Code
#include "mex.h"
#include <math.h>
#include <omp.h>
#include <time.h>
double serial(int x)
{
double sum=0;
int i;
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
return sum;
}
double parallel(int x)
{
double sum=0;
int i;
#pragma omp parallel num_threads(6) shared(sum) //default(none)
{
//printf(" I'm thread no. %d\n", omp_get_thread_num());
#pragma omp for private(i, x) reduction(+: sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
return sum;
}
void serial_a(double* a, int n, double* y2)
{
int i;
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
void parallel_a(double* a, int n, double* y2)
{
int i;
#pragma omp parallel num_threads(6)
{
#pragma omp for private(i)
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
}
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
double sum, *y1, *y2, *a, s, p;
int x, n, *d;
/* Check for proper number of arguments. */
if(nrhs!=2) {
mexErrMsgTxt("Two inputs required.");
} else if(nlhs>2) {
mexErrMsgTxt("Too many output arguments.");
}
/* Get pointer to first input */
x = (int)mxGetScalar(prhs[0]);
/* Get pointer to second input */
a = mxGetPr(prhs[1]);
d = (int*)mxGetDimensions(prhs[1]);
n = (int)d[1]; // row vector
/* Create space for output */
plhs[0] = mxCreateDoubleMatrix(2,1, mxREAL);
plhs[1] = mxCreateDoubleMatrix(n,2, mxREAL);
/* Get pointer to output array */
y1 = mxGetPr(plhs[0]);
y2 = mxGetPr(plhs[1]);
{ /* Do the calculation */
clock_t tic = clock();
y1[0] = serial(x);
s = (double) clock()-tic;
printf("serial....: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
y1[1] = parallel(x);
p = (double) clock()-tic;
printf("parallel..: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
mexEvalString("drawnow");
tic = clock();
serial_a(a, n, y2);
s = (double) clock()-tic;
printf("serial_a..: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
parallel_a(a, n, &y2[n]);
p = (double) clock()-tic;
printf("parallel_a: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
}
}
Output
>> mex omp1.c
>> [a, b] = omp1(1e8, 1:1e8);
serial....: 13399 ms
parallel..: 2810 ms
ratio.....: 0.21
serial_a..: 12840 ms
parallel_a: 2740 ms
ratio.....: 0.21
>> a(1) == a(2)
ans =
0
>> all(b(:,1) == b(:,2))
ans =
1
System
MATLAB Version: 8.0.0.783 (R2012b)
Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)
Microsoft Visual Studio 2005 Version 8.0.50727.867
In your function parallel you have a few mistakes. The reduction should be declared when you use parallel. Private and share variables should also be declared when you use parallel. But when you do a reduction you should not declare the variable that is being reduced as shared. The reduction will take care of this.
To know what to declare private or shared you have to ask yourself which variables are being written to. If a variable is not being written to then normally you want it to be shared. In your case the variable x does not change so you should declare it shared. The variable i, however, does change so normally you should declare it private so to fix your function you could do
#pragma omp parallel reduction(+:sum) private(i) shared(x)
{
#pragma omp for
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
However, OpenMP automatically makes the iterator of a parallel for region private and variables declared outside of parallel regions are shared by default so for your parallel function you can simply do
#pragma omp parallel for reduction(+:sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
Notice that the only difference between this and your serial code is the pragma statment. OpenMP is designed so that you don't have to change your code except for pragma statments.
When it comes to arrays as long as each iteration of a parallel for loop acts on a different array element then you don't have to worry about shared and private. So you can write your private_a function simply as
#pragma omp parallel for
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
and once again it is the same as your serial_a function except for the pragma statement.
But be careful with assuming iterators are private. Consider the following double loop
for(i=0; i<n; i++) {
for(j=0; j<m; j++) {
//
}
}
If you use #pragma parallel for with that the i iterator will be made private but the j iterator will be shared. This is because the parallel for only applies to the outer loop over i and since j is shared by default it is not made private. In this case you would need to explicitly declare j private like this #pragma parallel for private(j).
This question already has answers here:
Why OpenMP under ubuntu 12.04 is slower than serial version
(3 answers)
Parallelizing matrix times a vector by columns and by rows with OpenMP
(2 answers)
Closed 8 years ago.
I'm trying to write Matrix by vector multiplication in C (OpenMP)
but my program slows when I add processors...
1 proc - 1,3 s
2 proc - 2,6 s
4 proc - 5,47 s
I tested this on my PC (core i5) and our school's cluster and the result is the same (program slows)
here is my code (matrix is 10000 x 10000) and vector is 10000:
double start_time = clock();
#pragma omp parallel private(i) num_threads(4)
{
tid = omp_get_thread_num();
world_size = omp_get_num_threads();
printf("Threads: %d\n",world_size);
for(y = 0; y < matrix_size ; y++){
#pragma omp parallel for private(i) shared(results, vector, matrix)
for(i = 0; i < matrix_size; i++){
results[y] = results[y] + vector[i]*matrix[i][y];
}
}
}
double end_time = clock();
double result_time = (end_time - start_time) / CLOCKS_PER_SEC;
printf("Time: %f\n", result_time);
My question is: is there any mistake? For me it seems pretty straightforward and should speed up
I essentially already answer this question parallelizing-matrix-times-a-vector-by-columns-and-by-rows-with-openmp.
You have a race condition when you write to results[y]. To fix this, and still parallelize the inner loop, you have to make private versions of results[y], fill them in parallel, and then merge them in a critical section.
In the code below I assume you're using double, replace it with float or int or whatever datatype you're using (note that your inner loop goes over the first index of matrix[i][y] which is cache unfriendly).
#pragma omp parallel num_threads(4)
{
int y,i;
double* results_private = (double*)calloc(matrix_size, sizeof(double));
for(y = 0; y < matrix_size ; y++) {
#pragma omp for
for(i = 0; i < matrix_size; i++) {
results_private[y] += vector[i]*matrix[i][y];
}
}
#pragma omp critical
{
for(y=0; y<matrix_size; y++) results[y] += results_private[y];
}
free(results_private);
}
If this is homework assignment and you want to really impress your instructor then it's possible to do the merging without a critical section. See this link to get an idea on what to do fill-histograms-array-reduction-in-parallel-with-openmp-without-using-a-critic though I can't promise it will be faster.
I've not done any parallel programming for a while now, nor any Maths for that matter, but don't you want to split the rows of the matrix in parallel, rather than the columns?
What happens if you try this:
double start_time = clock();
#pragma omp parallel private(i) num_threads(4)
{
tid = omp_get_thread_num();
world_size = omp_get_num_threads();
printf("Threads: %d\n",world_size);
#pragma omp parallel for private(y) shared(results, vector, matrix)
for(y = 0; y < matrix_size ; y++){
for(i = 0; i < matrix_size; i++){
results[y] = results[y] + vector[i]*matrix[i][y];
}
}
}
double end_time = clock();
double result_time = (end_time - start_time) / CLOCKS_PER_SEC;
printf("Time: %f\n", result_time);
Also, are you sure everything's OK compiling and linking with openMP?
You have a typical case of cache conflicts.
Consider that a cache line on your CPU is probably 64 bytes long. Having one processor/core write to the first 4 bytes (float) causes that cache line to be invalidated on every other L1/L2 and maybe L3. This is a lot of overhead.
Partition your data better!
#pragma omp parallel for private(i) shared(results, vector, matrix) schedule(static,16)
should do the trick. Increase the chunksize if this does not help.
Another optimisation is to store the result locally before you flush it down to memory.
Also, this is an OpenMP thing, but you don't need to start a new parallel region for the loop (each mention of parallel starts a new team):
#pragma omp parallel default(none) \
shared(vector, matrix) \
firstprivate(matrix_size) \
num_threads(4)
{
int i, y;
#pragma omp for schedule(static,16)
for(y = 0; y < matrix_size ; y++){
double result = 0;
for(i = 0; i < matrix_size; i++){
results += vector[i]*matrix[i][y];
}
result[y] = result;
}
}
I'm writing a parallel version of the Longest Common Subsequence algorithm using openMP.
The sequential version is the following (and it works correctly):
// Preparing first row and first column with zeros
for(j=0; j < (len2+1); j++)
score[0][j] = 0;
for(i=0; i < (len1+1); i++)
score[i][0] = 0;
// Calculating scores
for(i=1; i < (len1+1); i++) {
for(j=1; j < (len2+1) ;j++) {
if (seq1[i-1] == seq2[j-1]) {
score[i][j] = score[i-1][j-1] + 1;
}
else {
score[i][j] = max(score[i-1][j], score[i][j-1]);
}
}
}
The critical part is filling up the score matrix and this is the part I'm trying to mostly parallelize.
One way to do it (which I chose) is: filling up the matrix by anti diagonals, so left, top and top-left dependecies are always satisfied. In a nutshell, I keep track of the diagonal (third loop, variable i below) and threads fill up that diagonal in parallel.
For this purpose, I've written this code:
void parallelCalculateLCS(int len1, int len2, char *seq1, char *seq2) {
int score[len1 + 1][len2 + 1];
int i, j, k, iam;
char *lcs = NULL;
for(i=0;i<len1+1;i++)
for(j=0;j<len2+1;j++)
score[i][j] = -1;
#pragma omp parallel default(shared) private(iam)
{
iam = omp_get_thread_num();
// Preparing first row and first column with zeros
#pragma omp for
for(j=0; j < (len2+1); j++)
score[0][j] = iam;
#pragma omp for
for(i=0; i < (len1+1); i++)
score[i][0] = iam;
// Calculating scores
for(i=1; i < (len1+1); i++) {
k=i;
#pragma omp for
for(j=1; j <= i; j++) {
if (seq1[k-1] == seq2[j-1]) {
// score[k][j] = score[k-1][j-1] + 1;
score[k][j] = iam;
}
else {
// score[k][j] = max(score[k-1][j], score[k][j-1]);
score[k][j] = iam;
}
#pragma omp atomic
k--;
}
}
}
}
The first two loops (first row and column) work correctly and threads fill up cells in a balanced way.
When it comes to fill up the matrix (diagonally), nothing works well. I tried to debug it, but it seems that threads act and write things randomly.
I can't figure out what's going wrong, since in the first two loops there were no problems at all.
Any idea?
P.S. I know that accessing matrix in a diagonal way is very cache-unfriendly and threads could be unbalanced, but I only need it to work by now.
P.S. #2 I don't know if it could be useful, but my CPU has up to 8 threads.
#pragma omp atomic means that the processors will perform the operation one at a time. You are looking for #pragma omp for private(k) : the processors will no longer share the same value. Bye, Francis
The following nested for loop
#pragma omp for
for(j=1; j <= i; j++)
will be executed in parallel, each thread with a different value of j in no specific order.
As nothing is specified in the omp for section, k will be shared by default between all threads. So depending on the order of the threads, k will be decremented at an unknown time (even with the omp atomic). So for a fixed j, the value of k might change during the execution of the body of the for loop (between the if clauses, ...).