OpenMP parallel for reduction delivers wrong results

OpenMP parallel for reduction delivers wrong results - c

I am working with a signal matrix and my goal is to calculate the sum of all elements of a row. The matrix is represented by the following struct:
typedef struct matrix {
float *data;
int rows;
int cols;
int leading_dim;
} matrix;
I have to mention the matrix is stored in column-major order (http://en.wikipedia.org/wiki/Row-major_order#Column-major_order), which should explain the formula column * tan_hd.rows + row for retrieving the correct indices.
for(int row = 0; row < tan_hd.rows; row++) {
float sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for(int column = 0; column < tan_hd.cols; column++) {
sum += tan_hd.data[column * tan_hd.rows + row];
}
printf("row %d: %f", row, sum);
}
Without the OpenMP pragma, the delivered result is correct and looks like this:
row 0: 8172539.500000 row 1: 8194582.000000
As soon as I add the #pragma omp... as described above, a different (wrong) result is returned:
row 0: 8085544.000000 row 1: 8107186.000000
In my understanding, reduction(+:sum) creates private copies of sum for each thread, and after completing the loop these partial results are summed up and written back to the global variable sum again. What is it, that I am doing wrong?
I appreciate your suggestions!

Use the Kahan summation algorithm
It has the same algorithmic complexity as a naive summation
It will greatly increase the accuracy of a summation, without requiring you to switch data types to double.
By rewriting your code to implement it:
for(int row = 0; row < tan_hd.rows; row++) {
float sum = 0.0, c = 0.0;
#pragma omp parallel for reduction(+:sum, +:c)
for(int column = 0; column < tan_hd.cols; column++) {
float y = tan_hd.data[column * tan_hd.rows + row] - c;
float t = sum + y;
c = (t - sum) - y;
sum = t;
}
sum = sum - c;
printf("row %d: %f", row, sum);
}
You can additionally switch all float to double to achieve a higher precision, but since your array is a float array, there should only be differences in the number of signficant numbers at the end.

Related

Search of max value and index in a vector

I'm trying to parallelize this piece of code that search for a max on a column.
The problem is that the parallelize version runs slower than the serial
Probably the search of the pivot (max on a column) is slower due the syncrhonization on the maximum value and the index, right?
int i,j,t,k;
// Decrease the dimension of a factor 1 and iterate each time
for (i=0, j=0; i < rwA, j < cwA; i++, j++) {
int i_max = i; // max index set as i
double matrixA_maxCw_value = fabs(matrixA[i_max][j]);
#pragma omp parallel for reduction(max:matrixA_maxCw_value,i_max) //OVERHEAD
for (t = i+1; t < rwA; t++) {
if (fabs(matrixA[t][j]) > matrixA_maxCw_value) {
matrixA_maxCw_value = matrixA[t][j];
i_max = t;
}
}
if (matrixA[i_max][j] == 0) {
j++; //Check if there is a pivot in the column, if not pass to the next column
}
else {
//Swap the rows, of A, L and P
#pragma omp parallel for //OVERHEAD
for (k = 0; k < cwA; k++) {
swapRows(matrixA, i, k, i_max);
swapRows(P, i, k, i_max);
if(k < i) {
swapRows(L, i, k, i_max);
}
}
lupFactorization(matrixA,L,i,j,rwA);
}
}
void swapRows(double **matrixA, int i, int j, int i_max) {
double temp_val = matrixA[i][j];
matrixA[i][j] = matrixA[i_max][j];
matrixA[i_max][j] = temp_val;
}
I do not want a different code but I want only know why this happens, on a matrix of dimension 1000x1000 the serial version takes 4.1s and the parallelized version 4.28s
The same thing (the overhead is very small but there is) happens on the swap of the rows that theoretically can be done in parallel without problem, why it happens?

There is at least two things wrong with your parallelization
#pragma omp parallel for reduction(max:matrixA_maxCw_value,i_max) //OVERHEAD
for (t = i+1; t < rwA; t++) {
if (fabs(matrixA[t][j]) > matrixA_maxCw_value) {
matrixA_maxCw_value = matrixA[t][j];
i_max = t;
}
}
You are getting the biggest index of all of them, but that does not mean that it belongs to the max value. For instance looking at the following array:
[8, 7, 6, 5, 4 ,3, 2 , 1]
if you parallelized with two threads, the first thread will have max=8 and index=0, the second thread will have max=4 and index=4. After the reduction is done the max will be 8 but the index will be 4 which is obviously wrong.
OpenMP has in-build reduction functions that consider a single target value, however in your case you want to reduce taking into account 2 values the max and the array index. After OpenMP 4.0 one can create its own reduction functions (i.e., User-Defined Reduction).
You can have a look at a full example implementing such logic here
The other issue is this part:
#pragma omp parallel for //OVERHEAD
for (k = 0; k < cwA; k++) {
swapRows(matrixA, i, k, i_max);
swapRows(P, i, k, i_max);
if(k < i) {
swapRows(L, i, k, i_max);
}
}
You are swapping those elements in parallel, which leads to inconsistent state.
First you need to solve those issue before analyzing why your code is not having speedups.
First correctness then efficiency. But don't except much speedups with the current implementation, the amount of computation performed in parallelism is that much to justify the overhead of the parallelism.

Generate same random matrix in OpenMP than sequential code

I'd like to generate a random matrix with OpenMP like it were generated by a sequential program, i.e. if any sequential matrix generator outputs me a matrix like the following one:
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 0.0 1.0 2.0
3.0 4.0 5.0 6.0
I want the parallel OpenMP version of the same program to generate the same matrix with no interleaved rows.
Here is how I gradually approached the problem.
Given my serial generator C function generating a matrix as a 1D array:
void generate_matrix_array(
double *v,
int rows,
int columns,
double min,
double max,
int seed
) {
srand(seed);
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
}
}
}
First, I naively tried the #pragma omp parallel for directive to outer for loop; however, there's no guarantee about row ordering, since thread execution gets interleaved, so they get generated in a non-deterministic order.
Adding the ordered option would solve the issue at the price of making useless multithreading in this particular case.
In order to solve the issue, I tried to partition by hand the matrix array so that thread i would generate the i-th slice of it:
void generate_matrix_array_par(
double *v,
int rows,
int columns,
double min,
double max,
int seed
) {
srand(seed);
#pragma omp parallel \
shared(v)
{
int tid = omp_get_thread_num();
int nthreads = omp_get_num_threads();
int rows_per_thread = round(rows / (double) nthreads);
int rem_rows = rows % (nthreads - 1) != 0?
rows % (nthreads - 1):
rows_per_thread;
int local_rows = (tid == 0)?
rows_per_thread:
rem_rows;
int lower_row = tid * local_rows;
int upper_row = ((tid + 1) * local_rows);
printf(
"[T%d] receiving %d of %d rows from row %d to %d\n",
tid,
local_rows,
rows,
lower_row,
upper_row - 1
);
printf("\n");
fflush(stdout);
for (int i = lower_row; i < upper_row; i++) {
for (int j = 0; j < columns; j++) {
v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
}
}
}
}
However, despite matrix vector gets properly divided among threads, for some reason unknown to me, every thread generates its rows into the matrix in a non-deterministic order, i.e. if I want to generate a 8x8 matrix with 4 threads and thread 3 is assigned to rows 4 and 5, he will generate two contiguous rows in the matrix array but in the wrong position every time, like if I didn't perform any partitioning and the omp parallel for directive was in place.
I skeptically tried, at last, to get back to naive approach by specifying shared(v) and schedule(static, 16) options to omp parallel for directive and it 'magically' happens to work:
void generate_matrix_array_par(
double *v,
int rows,
int columns,
double min,
double max,
int seed
) {
srand(seed);
int nthreads = omp_get_max_threads();
int chunk_size = (rows * columns) / nthreads;
#pragma omp parallel for \
shared(v) \
schedule(static, chunk_size)
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
}
}
}
The schedule option is being added since I read somewhere else that it gets rid of cache conflicts. Edit: Looks like schedule splits up data to thread in a round-robin fashion according to a given chunk size; so if I share N/nthreads-sized chunks among threads, data will be assigned in a single round.
Any question? YES!!!
Now, I'd like to know whether I missed or failed some consideration about the problem, since I'm not convinced about the fairness of my last version of the program, despite the fact that it is working.

Is there any benefit to this combination of pointers and loops?

I'm working through CUDA C Programming by Cheng, and came across this piece of code:
void sumMatrixOnHost (float *A, float *B, float *C, const int nx, const int ny) {
float *ia = A;
float *ib = B;
float *ic = C;
for (int iy=0; iy<ny; iy++) {
for (int ix=0; ix<nx; ix++) {
ic[ix] = ia[ix] + ib[ix];
}
ia += nx; ib += nx; ic += nx;
}
}
This is for matrix addition whereby the matrices are stored in a row-major format.
As I understand, the inner for loop is iterating over a row and performing element addition, and the outer for loop is then used to increment the pointers to the start of the next row.
Why is this approach better than using pointers over the whole matrix, i.e.
for (int i=0; i<ny*nx; i++) {
ic[i] = ia[i] + ib[i];
}
or dual for loops, i.e.
for (int iy=0; iy<ny; iy++) {
for (int ix=0; ix<nx; ix++) {
ic[iy*nx+ix] = ia[iy*nx+ix] + ib[iy*nx+ix];
}
}
Is this something to do with how it is optimized by the compiler?

The simplest approach, is always the best approach:
for (int i=0; i<ny*nx; i++) {
C[i] = A[i] + B[i];
}
This will be faster than the first solution. The problem with splitting the matrix up by row, is that the vectoriser will do:
process line in batches of 32bytes (size of YMM)
Process the remaining handful of values at the end of the line.
Now repeat for each line!
If however you do it with a single loop, the code generated will be:
process all data in batches of 32bytes (size of YMM)
Process the remaining handful of values at the end of the matrix that don't align to 32byte blocks.
The first version just adds pointless code to process the inner loop. None of that code is needed, it just breaks the ability to vectorise the entire matrix.

The approach on sumMatrixOnHost is better for optimization, and it should execute faster (generally) then the two approach you have suggested.
For the alu multiplication takes more time than addition.
So in sumMatrixOnHost there is no multipicaion, in
for (int i=0; i<ny*nx; i++) {
ic[i] = ia[i] + ib[i];
}
there is multiplication in each iteration of the loop.
in
for (int iy=0; iy<ny; iy++) {
for (int ix=0; ix<nx; ix++) {
ic[iy*nx+ix] = ia[iy*nx+ix] + ib[iy*nx+ix];
}
}
there are 3 multipication in each iteration of the loop.
A simpler approach can be
int n = ny*nx;
for (int i=0; i<n; i++) {
ic[i] = ia[i] + ib[i];
}
but in the last approach we lose another thing that is good in sumMatrixOnHost, and that is the ability to do the operation on matrix blocks and not the whole matrix.

Tile matrix multiplication when M doesn't equal N?

Every algorithm I've come across always uses a square matrix for tiling matrix multiplication. Is it possible to tile MMM when the two matrices are of completely different sizes?
This is the code I'm using at the moment and I would like to try and add tiling to improve my performance.
#pragma omp parallel shared(x, R, r, row) private(i, j, k, temp)
{
#pragma omp for
for (k = 0; k < x->n; k++){
for (i = row; i < row+2; i++){
temp = x->v[i*x->n+k];
for (j = 0; j < R->n; j++){
r[i*R->n+j] += temp*R->v[k*R->n+j];
}
}
}
}
The variable x, R, and r are matrix structures with m, n, and v defined to hold the size and data. Matrix x is a rotation matrix, so it is an identity matrix with data in two rows. That's why I am limiting the multiplication to two rows only.
Matrix x therefore has a size of 2xn and R has a size of mxn. Note: It has been defined that x's n is always R's m so there is no size mismatch.
Is it possible to tile this type of mismatched matrix multiplication and still get a performance gain?

Fast computation of cumulative sums over four-dimensional arrays in R

I'm relatively new to R programming, and this website has been very helpful for me so far, but I was unable to find a question that already covered what I want to know. So I decided to post a question myself.
My problem is the following: I want to find efficient ways to compute cumulative sums over four-dimensional arrays, i.e. I have data in a four-dimensional array x and want to write a function that computes an array x_sum such that
x_sum[i,j,k,l] = sum_{ind1 <= i, ind2 <= j, ind3 <= k, ind4 <=l} x[ind1, ind2, ind3, ind4].
I want to use this function billions of times, which makes it very important that it be as efficient as possible. Although I have come up with several ways to calculate the sums (see below), I suspect more experienced R programmers might be able to find a more efficient solution. So if anyone can suggest a better way of doing this, I would be very grateful.
Here's what I've tried so far:
I have found three different implementations (each of which brought a gain in speed) that work (see code below):
One in R using the cumsum() function (cumsum_4R) and two implementations where the „heavy lifting“ is done in C (using the .C() interface).
The first implementation in C is merely a naive attempt to write the sums using nested for-loops and pointer arithmetic (cumsumC_4_old).
In the second C-implementation (cumsumC_4) I tried to adapt my code using the ideas in the following article
As you can see in the source code below, the adaptation is relatively lopsided: For some dimensions, I was able to replace all the nested for-loops but not for others. Do you have ideas on how to do that?
Using microbenchmark on the three implementations, I get the following result for arrays of size 40x40x40x40:
Unit: milliseconds
expr min lq mean median uq
cumsum_4R(x) 976.13258 1029.33371 1064.35100 1051.37782 1074.23234
cumsumC_4_old(x) 174.72868 177.95875 192.75392 184.11121 203.18141
cumsumC_4(x) 56.87169 57.73512 67.34714 63.20269 68.80326
max neval
1381.5832 50
283.2384 50
105.7099 50
Additional information:
1) Since this made it easier to install any needed packages, I ran the benchmarks on my personal computer under Windows, but I plan on running the finished simulations on a computer from my university, which runs under Linux.
EDIT: 2) The four-dimensional data x[i,j,k,l] is actually the result of the product of two applications of the outer function: First, the outer product of a matrix with itself (i.e. outer(mat,mat)) and then taking the pairwise minima of another matrix (i.e. outer(mat2, mat2, pmin)). Then the data is the product
x = outer(mat, mat) * outer(mat2, mat2, pmin),
i.e.
x[i,j,k,l] = mat[i,j] * mat[k,l] * min(mat2[i,j], mat2[k,l])
The four-dimensional array has the corresponding symmetries.
3)The reason I need these cumulative sums in the first place is that I want to run simulations of a test for which I need partial sums over „rectangles“ of indices: I want to iterate over all sums of the form
sum_{k1<=i1 <= m1,k2<=i2 <= m2, k1 <= i3 <= m1, k2 <= i4 <=m2} x[i1, i2, i3, i4],
where 1<=k1<=m1<=n, 1<=k2<=m2<=n. In order to avoid calculating the sum of the same variables over and over again, I first calculate all the cumulative sums and then calculate the sums over rectangles as functions of the cumulative sums. Do you know of a more efficient way to do this?
EDIT to 3): In order to include all potentially important aspects: I also want to calculate sums of the form
sum_{k1<=i1 <= m1,k2<=i2 <= m2, 1 <= i3 <= n, 1 <= i4 <=n} x[i1, i2, i3, i4].
(Since I can trivially obtain them using the cumulative sums, I had not included this specification before).
Here is the C code I use (which I save as „cumsumC.c“):
#include<R.h>
#include<math.h>
#include <stdio.h>
int min(int a, int b){
if(a <= b) return a;
else return b;
}
void cumsumC_4_old(double* x, int* nv){
int n = *nv;
int n2 = n*n;
int n3 = n*n*n;
//Dim 1
for(int i=0; i<n; i++){
for(int j=0; j<n; j++){
for(int k=0; k<n; k++){
for(int l=1; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i + j*n +k*n2+(l-1)*n3];
}
}
}
}
//Dim 2
for(int i=0; i<n; i++){
for(int j=0; j<n; j++){
for(int k=1; k<n; k++){
for(int l=0; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i + j*n +(k-1)*n2+l*n3];
}
}
}
}
//Dim 3
for(int i=0; i<n; i++){
for(int j=1; j<n; j++){
for(int k=0; k<n; k++){
for(int l=0; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i + (j-1)*n +k*n2+l*n3];
}
}
}
}
//Dim 4
for(int i=1; i<n; i++){
for(int j=0; j<n; j++){
for(int k=0; k<n; k++){
for(int l=0; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i-1 + j*n +k*n2+l*n3];
}
}
}
}
}
void cumsumC_4(double* x, int* nv){
int n = *nv;
int n2 = n*n;
int n3 = n*n*n;
long ind1, ind2;
long index, indexges = n +(n-1)*n+(n-1)*n2+(n-1)*n3, indexend;
//Dim 1
index = n3;
while(index != indexges){
x[index] += x[index-n3];
index++;
}
//Dim 2
long teilind = n+(n-1)*n;
for(int k=1; k<n; k++){
ind1 = k*n2;
ind2 = ind1 - n2;
for(int l=0; l<n; l++){
index = l*n3;
indexend = teilind+index;
while(index != indexend){
x[index+ind1] += x[index+ind2];
index++;
}
}
}
//Dim 3
ind1 = n;
while(ind1 < n+(n-1)*n){
index = 0;
indexend = indexges - ind1;
ind2 = ind1-n;
while(index < indexend){
x[ind1+index] += x[ind2+index];
index += n2;
}
ind1++;
}
//Dim 4
index = 0;
int i;
long minind;
while(index < indexges){
i = 1;
minind = min(indexges, index+n);
while(index+i < minind){
x[index+i] += x[index+i-1];
i++;
}
index+=n;
}
}
Here is the R function „cumsum_4R“ and code used to call and compare the cumulative sum functions in R (under Windows; for Linux, the commands dyn.load/dyn.unload need to be adjusted; ideally, I want to use the functions on 50^4 size arrays, but since the call to microbenchmark would then take a while, I have chosen n=40 here):
library("microbenchmark")
# dyn.load("cumsumC.so")
dyn.load("cumsumC.dll")
cumsum_4R <- function(x){
return(aperm(apply(apply(aperm(apply(apply(x, 2:4,function(a) cumsum(as.numeric(a))), c(1,3,4) , function(a) cumsum(as.numeric(a))), c(2,1,3,4)), c(1,2,4), function(a) cumsum(as.numeric(a))), 1:3, function(a) cumsum(as.numeric(a))), c(3,4,2,1)))
}
cumsumC_4_old <- function(x){
n <- dim(x)[1]
arr <- array(.C("cumsumC_4_old", res=as.double(x), as.integer(n))$res, dim=c(n,n,n,n))
return(arr)
}
cumsumC_4 <- function(x){
n <- dim(x)[1]
arr <- array(.C("cumsumC_4", res=as.double(x), as.integer(n))$res, dim=c(n,n,n,n))
return(arr)
}
set.seed(1234)
n <- 40
x <- array(rnorm(n^4),dim=c(n,n,n,n))
r <- 6 #parameter for rounding results for comparison
res1 <- cumsum_4R(x)
res2 <- cumsumC_4_old(x)
res3 <- cumsumC_4(x)
print(c("Identical R and C1:", identical(round(res1,r),round(res2,r))))
print(c("Identical R and C2:",identical(round(res1,r),round(res3,r))))
times <- microbenchmark(cumsum_4R(x), cumsumC_4_old(x),cumsumC_4(x),times=50)
print(times)
dyn.unload("cumsumC.dll")
# dyn.unload("cumsumC.so")
Thank you for your help!

You can indeed use points 2 and 3 in your original question to solve the problem more efficiently. Actually, this makes the problem separable. By separable I mean that the limits of the 4 sums in Equation 3 do not depend on the variables you sum over. This and the fact that x is an outer product of 2 matrices enables you to separate the 4-fold sum in Eq. 3 into an outer product of two 2-fold sums. Even better: the 2 matrices used to define x are the same (denoted by mat by you) - so the two 2-fold sums give the same matrix, which has to be calculated only once.
Here the code:
set.seed(1234)
n=40
mat=array(rnorm(n^2),dim=c(n,n))
x=outer(mat,mat)
cumsum_sep=function(x) {
#calculate matrix corresponding to 2-fold sums
#actually it's just one matrix because x is an outer product of mat with itself
tmp=t(apply(apply(x,2,cumsum),1,cumsum))
#outer product of two-fold sums
outer(tmp,tmp)
}
y1=cumsum_4R(x)
#note that cumsum_sep operates on the original matrix mat!
y2=cumsum_sep(mat)
Check whether results are the same
all.equal(y1,y2)
[1] TRUE
This gives the benchmark results
microbenchmark(cumsum_4R(x),cumsum_sep(mat),times=10)
Unit: milliseconds
expr min lq mean median uq max neval cld
cumsum_4R(xx) 2084.454155 2135.852305 2226.59692 2251.95928 2270.15198 2402.2724 10 b
cumsum_sep(x) 6.844939 7.145546 32.75852 14.45762 34.94397 120.0846 10 a
Quite a difference! :)