Related
I have this code that transposes a matrix using loop tiling strategy.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
}
I want to optimize this with multi-threading using OpenMP, however I am not sure what to do when having so many nested for loops. I thought about just adding #pragma omp parallel for but doesn't this just parallelize the outer loop?
When you try to parallelize a loop nest, you should ask yourself how many levels are conflict free. As in: every iteration writing to a different location. If two iterations write (potentially) to the same location, you need to 1. use a reduction 2. use a critical section or other synchronization 3. decide that this loop is not worth parallelizing, or 4. rewrite your algorithm.
In your case, the write location depends on k,l. Since k<n and l*n, there are no pairs k.l / k',l' that write to the same location. Furthermore, there are no two inner iterations that have the same k or l value. So all four loops are parallel, and they are perfectly nested, so you can use collapse(4).
You could also have drawn this conclusion by considering the algorithm in the abstract: in a matrix transposition each target location is written exactly once, so no matter how you traverse the target data structure, it's completely parallel.
You can use the collapse specifier to parallelize over two loops.
# pragma omp parallel for collapse(2)
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
As a side-note, I think you should swap the two innermost loops. Usually, when you have a choice between writing sequentially and reading sequentially, writing is more important for performance.
I thought about just adding #pragma omp parallel for but doesnt this
just parallelize the outer loop?
Yes. To parallelize multiple consecutive loops one can utilize OpenMP' collapse clause. Bear in mind, however that:
(As pointed out by Victor Eijkhout). Even though this does not directly apply to your code snippet, typically, for each new loop to be parallelized one should reason about potential newer race-conditions e.g., that this parallelization might have added. For example, different threads writing concurrently into the same dst position.
in some cases parallelizing nested loops may result in slower execution times than parallelizing a single loop. Since, the concrete implementation of the collapse clause uses a more complex heuristic (than the simple loop parallelization) to divide the iterations of the loops among threads, which can result in an overhead higher than the gains that it provides.
You should try to benchmark with a single parallel loop and then with two, and so on, and compare the results, accordingly.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
#pragma omp parallel for collapse(...)
for (int i = 0; i < n; i += blocksize)
for (int j = 0; j < m; j += blocksize)
for (int k = i; k < i + blocksize; ++k
for (int l = j; l < j + blocksize; ++l)
dst[k + l*n] = src[l + k*m];
}
Depending upon the number of threads, cores, size of the matrices among other factors it might be that running sequential would actually be faster than the parallel versions. This is specially true in your code that is not very CPU intensive (i.e., dst[k + l*n] = src[l + k*m];)
I'm trying to parallelize this piece of code that search for a max on a column.
The problem is that the parallelize version runs slower than the serial
Probably the search of the pivot (max on a column) is slower due the syncrhonization on the maximum value and the index, right?
int i,j,t,k;
// Decrease the dimension of a factor 1 and iterate each time
for (i=0, j=0; i < rwA, j < cwA; i++, j++) {
int i_max = i; // max index set as i
double matrixA_maxCw_value = fabs(matrixA[i_max][j]);
#pragma omp parallel for reduction(max:matrixA_maxCw_value,i_max) //OVERHEAD
for (t = i+1; t < rwA; t++) {
if (fabs(matrixA[t][j]) > matrixA_maxCw_value) {
matrixA_maxCw_value = matrixA[t][j];
i_max = t;
}
}
if (matrixA[i_max][j] == 0) {
j++; //Check if there is a pivot in the column, if not pass to the next column
}
else {
//Swap the rows, of A, L and P
#pragma omp parallel for //OVERHEAD
for (k = 0; k < cwA; k++) {
swapRows(matrixA, i, k, i_max);
swapRows(P, i, k, i_max);
if(k < i) {
swapRows(L, i, k, i_max);
}
}
lupFactorization(matrixA,L,i,j,rwA);
}
}
void swapRows(double **matrixA, int i, int j, int i_max) {
double temp_val = matrixA[i][j];
matrixA[i][j] = matrixA[i_max][j];
matrixA[i_max][j] = temp_val;
}
I do not want a different code but I want only know why this happens, on a matrix of dimension 1000x1000 the serial version takes 4.1s and the parallelized version 4.28s
The same thing (the overhead is very small but there is) happens on the swap of the rows that theoretically can be done in parallel without problem, why it happens?
There is at least two things wrong with your parallelization
#pragma omp parallel for reduction(max:matrixA_maxCw_value,i_max) //OVERHEAD
for (t = i+1; t < rwA; t++) {
if (fabs(matrixA[t][j]) > matrixA_maxCw_value) {
matrixA_maxCw_value = matrixA[t][j];
i_max = t;
}
}
You are getting the biggest index of all of them, but that does not mean that it belongs to the max value. For instance looking at the following array:
[8, 7, 6, 5, 4 ,3, 2 , 1]
if you parallelized with two threads, the first thread will have max=8 and index=0, the second thread will have max=4 and index=4. After the reduction is done the max will be 8 but the index will be 4 which is obviously wrong.
OpenMP has in-build reduction functions that consider a single target value, however in your case you want to reduce taking into account 2 values the max and the array index. After OpenMP 4.0 one can create its own reduction functions (i.e., User-Defined Reduction).
You can have a look at a full example implementing such logic here
The other issue is this part:
#pragma omp parallel for //OVERHEAD
for (k = 0; k < cwA; k++) {
swapRows(matrixA, i, k, i_max);
swapRows(P, i, k, i_max);
if(k < i) {
swapRows(L, i, k, i_max);
}
}
You are swapping those elements in parallel, which leads to inconsistent state.
First you need to solve those issue before analyzing why your code is not having speedups.
First correctness then efficiency. But don't except much speedups with the current implementation, the amount of computation performed in parallelism is that much to justify the overhead of the parallelism.
I'm relatively new to R programming, and this website has been very helpful for me so far, but I was unable to find a question that already covered what I want to know. So I decided to post a question myself.
My problem is the following: I want to find efficient ways to compute cumulative sums over four-dimensional arrays, i.e. I have data in a four-dimensional array x and want to write a function that computes an array x_sum such that
x_sum[i,j,k,l] = sum_{ind1 <= i, ind2 <= j, ind3 <= k, ind4 <=l} x[ind1, ind2, ind3, ind4].
I want to use this function billions of times, which makes it very important that it be as efficient as possible. Although I have come up with several ways to calculate the sums (see below), I suspect more experienced R programmers might be able to find a more efficient solution. So if anyone can suggest a better way of doing this, I would be very grateful.
Here's what I've tried so far:
I have found three different implementations (each of which brought a gain in speed) that work (see code below):
One in R using the cumsum() function (cumsum_4R) and two implementations where the „heavy lifting“ is done in C (using the .C() interface).
The first implementation in C is merely a naive attempt to write the sums using nested for-loops and pointer arithmetic (cumsumC_4_old).
In the second C-implementation (cumsumC_4) I tried to adapt my code using the ideas in the following article
As you can see in the source code below, the adaptation is relatively lopsided: For some dimensions, I was able to replace all the nested for-loops but not for others. Do you have ideas on how to do that?
Using microbenchmark on the three implementations, I get the following result for arrays of size 40x40x40x40:
Unit: milliseconds
expr min lq mean median uq
cumsum_4R(x) 976.13258 1029.33371 1064.35100 1051.37782 1074.23234
cumsumC_4_old(x) 174.72868 177.95875 192.75392 184.11121 203.18141
cumsumC_4(x) 56.87169 57.73512 67.34714 63.20269 68.80326
max neval
1381.5832 50
283.2384 50
105.7099 50
Additional information:
1) Since this made it easier to install any needed packages, I ran the benchmarks on my personal computer under Windows, but I plan on running the finished simulations on a computer from my university, which runs under Linux.
EDIT: 2) The four-dimensional data x[i,j,k,l] is actually the result of the product of two applications of the outer function: First, the outer product of a matrix with itself (i.e. outer(mat,mat)) and then taking the pairwise minima of another matrix (i.e. outer(mat2, mat2, pmin)). Then the data is the product
x = outer(mat, mat) * outer(mat2, mat2, pmin),
i.e.
x[i,j,k,l] = mat[i,j] * mat[k,l] * min(mat2[i,j], mat2[k,l])
The four-dimensional array has the corresponding symmetries.
3)The reason I need these cumulative sums in the first place is that I want to run simulations of a test for which I need partial sums over „rectangles“ of indices: I want to iterate over all sums of the form
sum_{k1<=i1 <= m1,k2<=i2 <= m2, k1 <= i3 <= m1, k2 <= i4 <=m2} x[i1, i2, i3, i4],
where 1<=k1<=m1<=n, 1<=k2<=m2<=n. In order to avoid calculating the sum of the same variables over and over again, I first calculate all the cumulative sums and then calculate the sums over rectangles as functions of the cumulative sums. Do you know of a more efficient way to do this?
EDIT to 3): In order to include all potentially important aspects: I also want to calculate sums of the form
sum_{k1<=i1 <= m1,k2<=i2 <= m2, 1 <= i3 <= n, 1 <= i4 <=n} x[i1, i2, i3, i4].
(Since I can trivially obtain them using the cumulative sums, I had not included this specification before).
Here is the C code I use (which I save as „cumsumC.c“):
#include<R.h>
#include<math.h>
#include <stdio.h>
int min(int a, int b){
if(a <= b) return a;
else return b;
}
void cumsumC_4_old(double* x, int* nv){
int n = *nv;
int n2 = n*n;
int n3 = n*n*n;
//Dim 1
for(int i=0; i<n; i++){
for(int j=0; j<n; j++){
for(int k=0; k<n; k++){
for(int l=1; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i + j*n +k*n2+(l-1)*n3];
}
}
}
}
//Dim 2
for(int i=0; i<n; i++){
for(int j=0; j<n; j++){
for(int k=1; k<n; k++){
for(int l=0; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i + j*n +(k-1)*n2+l*n3];
}
}
}
}
//Dim 3
for(int i=0; i<n; i++){
for(int j=1; j<n; j++){
for(int k=0; k<n; k++){
for(int l=0; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i + (j-1)*n +k*n2+l*n3];
}
}
}
}
//Dim 4
for(int i=1; i<n; i++){
for(int j=0; j<n; j++){
for(int k=0; k<n; k++){
for(int l=0; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i-1 + j*n +k*n2+l*n3];
}
}
}
}
}
void cumsumC_4(double* x, int* nv){
int n = *nv;
int n2 = n*n;
int n3 = n*n*n;
long ind1, ind2;
long index, indexges = n +(n-1)*n+(n-1)*n2+(n-1)*n3, indexend;
//Dim 1
index = n3;
while(index != indexges){
x[index] += x[index-n3];
index++;
}
//Dim 2
long teilind = n+(n-1)*n;
for(int k=1; k<n; k++){
ind1 = k*n2;
ind2 = ind1 - n2;
for(int l=0; l<n; l++){
index = l*n3;
indexend = teilind+index;
while(index != indexend){
x[index+ind1] += x[index+ind2];
index++;
}
}
}
//Dim 3
ind1 = n;
while(ind1 < n+(n-1)*n){
index = 0;
indexend = indexges - ind1;
ind2 = ind1-n;
while(index < indexend){
x[ind1+index] += x[ind2+index];
index += n2;
}
ind1++;
}
//Dim 4
index = 0;
int i;
long minind;
while(index < indexges){
i = 1;
minind = min(indexges, index+n);
while(index+i < minind){
x[index+i] += x[index+i-1];
i++;
}
index+=n;
}
}
Here is the R function „cumsum_4R“ and code used to call and compare the cumulative sum functions in R (under Windows; for Linux, the commands dyn.load/dyn.unload need to be adjusted; ideally, I want to use the functions on 50^4 size arrays, but since the call to microbenchmark would then take a while, I have chosen n=40 here):
library("microbenchmark")
# dyn.load("cumsumC.so")
dyn.load("cumsumC.dll")
cumsum_4R <- function(x){
return(aperm(apply(apply(aperm(apply(apply(x, 2:4,function(a) cumsum(as.numeric(a))), c(1,3,4) , function(a) cumsum(as.numeric(a))), c(2,1,3,4)), c(1,2,4), function(a) cumsum(as.numeric(a))), 1:3, function(a) cumsum(as.numeric(a))), c(3,4,2,1)))
}
cumsumC_4_old <- function(x){
n <- dim(x)[1]
arr <- array(.C("cumsumC_4_old", res=as.double(x), as.integer(n))$res, dim=c(n,n,n,n))
return(arr)
}
cumsumC_4 <- function(x){
n <- dim(x)[1]
arr <- array(.C("cumsumC_4", res=as.double(x), as.integer(n))$res, dim=c(n,n,n,n))
return(arr)
}
set.seed(1234)
n <- 40
x <- array(rnorm(n^4),dim=c(n,n,n,n))
r <- 6 #parameter for rounding results for comparison
res1 <- cumsum_4R(x)
res2 <- cumsumC_4_old(x)
res3 <- cumsumC_4(x)
print(c("Identical R and C1:", identical(round(res1,r),round(res2,r))))
print(c("Identical R and C2:",identical(round(res1,r),round(res3,r))))
times <- microbenchmark(cumsum_4R(x), cumsumC_4_old(x),cumsumC_4(x),times=50)
print(times)
dyn.unload("cumsumC.dll")
# dyn.unload("cumsumC.so")
Thank you for your help!
You can indeed use points 2 and 3 in your original question to solve the problem more efficiently. Actually, this makes the problem separable. By separable I mean that the limits of the 4 sums in Equation 3 do not depend on the variables you sum over. This and the fact that x is an outer product of 2 matrices enables you to separate the 4-fold sum in Eq. 3 into an outer product of two 2-fold sums. Even better: the 2 matrices used to define x are the same (denoted by mat by you) - so the two 2-fold sums give the same matrix, which has to be calculated only once.
Here the code:
set.seed(1234)
n=40
mat=array(rnorm(n^2),dim=c(n,n))
x=outer(mat,mat)
cumsum_sep=function(x) {
#calculate matrix corresponding to 2-fold sums
#actually it's just one matrix because x is an outer product of mat with itself
tmp=t(apply(apply(x,2,cumsum),1,cumsum))
#outer product of two-fold sums
outer(tmp,tmp)
}
y1=cumsum_4R(x)
#note that cumsum_sep operates on the original matrix mat!
y2=cumsum_sep(mat)
Check whether results are the same
all.equal(y1,y2)
[1] TRUE
This gives the benchmark results
microbenchmark(cumsum_4R(x),cumsum_sep(mat),times=10)
Unit: milliseconds
expr min lq mean median uq max neval cld
cumsum_4R(xx) 2084.454155 2135.852305 2226.59692 2251.95928 2270.15198 2402.2724 10 b
cumsum_sep(x) 6.844939 7.145546 32.75852 14.45762 34.94397 120.0846 10 a
Quite a difference! :)
I am working with a signal matrix and my goal is to calculate the sum of all elements of a row. The matrix is represented by the following struct:
typedef struct matrix {
float *data;
int rows;
int cols;
int leading_dim;
} matrix;
I have to mention the matrix is stored in column-major order (http://en.wikipedia.org/wiki/Row-major_order#Column-major_order), which should explain the formula column * tan_hd.rows + row for retrieving the correct indices.
for(int row = 0; row < tan_hd.rows; row++) {
float sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for(int column = 0; column < tan_hd.cols; column++) {
sum += tan_hd.data[column * tan_hd.rows + row];
}
printf("row %d: %f", row, sum);
}
Without the OpenMP pragma, the delivered result is correct and looks like this:
row 0: 8172539.500000 row 1: 8194582.000000
As soon as I add the #pragma omp... as described above, a different (wrong) result is returned:
row 0: 8085544.000000 row 1: 8107186.000000
In my understanding, reduction(+:sum) creates private copies of sum for each thread, and after completing the loop these partial results are summed up and written back to the global variable sum again. What is it, that I am doing wrong?
I appreciate your suggestions!
Use the Kahan summation algorithm
It has the same algorithmic complexity as a naive summation
It will greatly increase the accuracy of a summation, without requiring you to switch data types to double.
By rewriting your code to implement it:
for(int row = 0; row < tan_hd.rows; row++) {
float sum = 0.0, c = 0.0;
#pragma omp parallel for reduction(+:sum, +:c)
for(int column = 0; column < tan_hd.cols; column++) {
float y = tan_hd.data[column * tan_hd.rows + row] - c;
float t = sum + y;
c = (t - sum) - y;
sum = t;
}
sum = sum - c;
printf("row %d: %f", row, sum);
}
You can additionally switch all float to double to achieve a higher precision, but since your array is a float array, there should only be differences in the number of signficant numbers at the end.
I have some matrix which I want to cycle through blocks, the matrix could be of many different sizes, but I can know the size, is there a way to fast cycle through blocks?
i.e: to fast output the indexes of the blocks, suppose a matrix of 4*4 I should have:
Block1: (0,0),(0,1)(1,0)(1,1)
Block2: (0,2),(0,3)(1,2)(1,3)
Block3: (2,0),(2,1)(3,0)(3,1)
Block4: (2,2),(2,3)(3,2)(3,3)
Where the indexes are (row,col).
For blocks I mean a submatrix of size sqrt(matrixSize)* sqrt(matrixSize) where matrix is a matrix of matrixSize*matrixSize.
For example a matrix of 4*4 has 4 blocks of 2*2, a 9*9 has 9 blocks of 3*3...
I am working in C, but I think that the pseudocode is useful also, I only need the loop on the indexes.
I managed to do it, it's coded in c ...
The variable 'matrix_size' contains the number of rows(or columns) of the matrix.
The variable 'block' contains the precomputed size of one block of the matrix i.e: the sqrt of matrix_size.
for(int i = 0; i< matrix_size; i++)
{
fprintf(stdout, "Block %u: ", i);
for(int k= ((i/block)*block) ; k < (((i/block)*block) + block) ;k++)
{
for(int j = ((i%block)*block) ; j< (((i%block)*block) + block); j++)
{
fprintf(stdout,"(%u,%u)",k,j);
}
}
fprintf(stdout, "\n");
}