I have two vectors, a[n] and b[n], where n is a large number.
a[0] = b[0];
for (i = 1; i < size; i++) {
a[i] = a[i-1] + b[i];
}
With this code we try to achieve that a[i] contains the sum of all the numbers in b[] until b[i]. I need to parallelise this loop using openmp.
The main problem is that a[i] depends of a[i-1], so the only direct way that comes to my mind would be waiting for each a[i-1] number to be ready, which takes a lot of time and makes no sense. Is there any approach in openmp for solving this problem?
You're Carl Friedrich Gauss in the 18 century and your grade school teacher has decided to punish the class with a homework problem that requires a lot or mundane repeated arithmetic. In the previous week your teacher told you to add up the first 100 counting numbers and because you're clever you came up with a quick solution. Your teacher did not like this so he came up with a new problem which he thinks can't be optimized. In your own notation you rewrite the problem like this
a[0] = b[0];
for (int i = 1; i < size; i++) a[i] = a[i-1] + b[i];
then you realize that
a0 = b[0]
a1 = (b[0]) + b[1];
a2 = ((b[0]) + b[1]) + b[2]
a_n = b[0] + b[1] + b[2] + ... b[n]
using your notation again you rewrite the problem as
int sum = 0;
for (int i = 0; i < size; i++) sum += b[i], a[i] = sum;
How to optimize this? First you define
int sum(n0, n) {
int sum = 0;
for (int i = n0; i < n; i++) sum += b[i], a[i] = sum;
return sum;
}
Then you realize that
a_n+1 = sum(0, n) + sum(n, n+1)
a_n+2 = sum(0, n) + sum(n, n+2)
a_n+m = sum(0, n) + sum(n, n+m)
a_n+m+k = sum(0, n) + sum(n, n+m) + sum(n+m, n+m+k)
So now you know what to do. Find t classmates. Have each one work on a subset of the numbers. To keep it simple you choose size is 100 and four classmates t0, t1, t2, t3 then each one does
t0 t1 t2 t3
s0 = sum(0,25) s1 = sum(25,50) s2 = sum(50,75) s3 = sum(75,100)
at the same time. Then define
fix(int n0, int n, int offset) {
for(int i=n0; i<n; i++) a[i] += offset
}
and then each classmates goes back over their subset at the same time again like this
t0 t1 t2 t3
fix(0, 25, 0) fix(25, 50, s0) fix(50, 75, s0+s1) fix(75, 100, s0+s1+s2)
You realize that with t classmate taking about the same K seconds to do arithmetic that you can finish the job in 2*K*size/t seconds whereas one person would take K*size seconds. It's clear you're going to need at least two classmates just to break even. So with four classmates they should finish in half the time as one classmate.
Now your write up your algorithm in your own notation
int *suma; // array of partial results from each classmate
#pragma omp parallel
{
int ithread = omp_get_thread_num(); //label of classmate
int nthreads = omp_get_num_threads(); //number of classmates
#pragma omp single
suma = malloc(sizeof *suma * (nthreads+1)), suma[0] = 0;
//now have each classmate calculate their partial result s = sum(n0, n)
int s = 0;
#pragma omp for schedule(static) nowait
for (int i=0; i<size; i++) s += b[i], a[i] = sum;
suma[ithread+1] = s;
//now wait for each classmate to finish
#pragma omp barrier
// now each classmate sums each of the previous classmates results
int offset = 0;
for(int i=0; i<(ithread+1); i++) offset += suma[i];
//now each classmates corrects their result
#pragma omp for schedule(static)
for (int i=0; i<size; i++) a[i] += offset;
}
free(suma)
You realize that you could optimize the part where each classmate has to add the result of the previous classmate but since size >> t you don't think it's worth the effort.
Your solution is not nearly as fast as your solution adding the counting numbers but nevertheless your teacher is not happy that several of his students finished much earlier than the other students. So now he decides that one student has to read the b array slowly to the class and when you report the result a it has to be done slowly as well. You call this being read/write bandwidth bound. This severely limits the effectiveness of your algorithm. What are you going to do now?
The only thing you can think of is to get multiple classmates to read and record different subsets of the numbers to the class at the same time.
Related
I have this code that transposes a matrix using loop tiling strategy.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
}
I want to optimize this with multi-threading using OpenMP, however I am not sure what to do when having so many nested for loops. I thought about just adding #pragma omp parallel for but doesn't this just parallelize the outer loop?
When you try to parallelize a loop nest, you should ask yourself how many levels are conflict free. As in: every iteration writing to a different location. If two iterations write (potentially) to the same location, you need to 1. use a reduction 2. use a critical section or other synchronization 3. decide that this loop is not worth parallelizing, or 4. rewrite your algorithm.
In your case, the write location depends on k,l. Since k<n and l*n, there are no pairs k.l / k',l' that write to the same location. Furthermore, there are no two inner iterations that have the same k or l value. So all four loops are parallel, and they are perfectly nested, so you can use collapse(4).
You could also have drawn this conclusion by considering the algorithm in the abstract: in a matrix transposition each target location is written exactly once, so no matter how you traverse the target data structure, it's completely parallel.
You can use the collapse specifier to parallelize over two loops.
# pragma omp parallel for collapse(2)
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
As a side-note, I think you should swap the two innermost loops. Usually, when you have a choice between writing sequentially and reading sequentially, writing is more important for performance.
I thought about just adding #pragma omp parallel for but doesnt this
just parallelize the outer loop?
Yes. To parallelize multiple consecutive loops one can utilize OpenMP' collapse clause. Bear in mind, however that:
(As pointed out by Victor Eijkhout). Even though this does not directly apply to your code snippet, typically, for each new loop to be parallelized one should reason about potential newer race-conditions e.g., that this parallelization might have added. For example, different threads writing concurrently into the same dst position.
in some cases parallelizing nested loops may result in slower execution times than parallelizing a single loop. Since, the concrete implementation of the collapse clause uses a more complex heuristic (than the simple loop parallelization) to divide the iterations of the loops among threads, which can result in an overhead higher than the gains that it provides.
You should try to benchmark with a single parallel loop and then with two, and so on, and compare the results, accordingly.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
#pragma omp parallel for collapse(...)
for (int i = 0; i < n; i += blocksize)
for (int j = 0; j < m; j += blocksize)
for (int k = i; k < i + blocksize; ++k
for (int l = j; l < j + blocksize; ++l)
dst[k + l*n] = src[l + k*m];
}
Depending upon the number of threads, cores, size of the matrices among other factors it might be that running sequential would actually be faster than the parallel versions. This is specially true in your code that is not very CPU intensive (i.e., dst[k + l*n] = src[l + k*m];)
I'm working through CUDA C Programming by Cheng, and came across this piece of code:
void sumMatrixOnHost (float *A, float *B, float *C, const int nx, const int ny) {
float *ia = A;
float *ib = B;
float *ic = C;
for (int iy=0; iy<ny; iy++) {
for (int ix=0; ix<nx; ix++) {
ic[ix] = ia[ix] + ib[ix];
}
ia += nx; ib += nx; ic += nx;
}
}
This is for matrix addition whereby the matrices are stored in a row-major format.
As I understand, the inner for loop is iterating over a row and performing element addition, and the outer for loop is then used to increment the pointers to the start of the next row.
Why is this approach better than using pointers over the whole matrix, i.e.
for (int i=0; i<ny*nx; i++) {
ic[i] = ia[i] + ib[i];
}
or dual for loops, i.e.
for (int iy=0; iy<ny; iy++) {
for (int ix=0; ix<nx; ix++) {
ic[iy*nx+ix] = ia[iy*nx+ix] + ib[iy*nx+ix];
}
}
Is this something to do with how it is optimized by the compiler?
The simplest approach, is always the best approach:
for (int i=0; i<ny*nx; i++) {
C[i] = A[i] + B[i];
}
This will be faster than the first solution. The problem with splitting the matrix up by row, is that the vectoriser will do:
process line in batches of 32bytes (size of YMM)
Process the remaining handful of values at the end of the line.
Now repeat for each line!
If however you do it with a single loop, the code generated will be:
process all data in batches of 32bytes (size of YMM)
Process the remaining handful of values at the end of the matrix that don't align to 32byte blocks.
The first version just adds pointless code to process the inner loop. None of that code is needed, it just breaks the ability to vectorise the entire matrix.
The approach on sumMatrixOnHost is better for optimization, and it should execute faster (generally) then the two approach you have suggested.
For the alu multiplication takes more time than addition.
So in sumMatrixOnHost there is no multipicaion, in
for (int i=0; i<ny*nx; i++) {
ic[i] = ia[i] + ib[i];
}
there is multiplication in each iteration of the loop.
in
for (int iy=0; iy<ny; iy++) {
for (int ix=0; ix<nx; ix++) {
ic[iy*nx+ix] = ia[iy*nx+ix] + ib[iy*nx+ix];
}
}
there are 3 multipication in each iteration of the loop.
A simpler approach can be
int n = ny*nx;
for (int i=0; i<n; i++) {
ic[i] = ia[i] + ib[i];
}
but in the last approach we lose another thing that is good in sumMatrixOnHost, and that is the ability to do the operation on matrix blocks and not the whole matrix.
This question already has answers here:
Why is the runtime of this code O(n^5)?
(3 answers)
Closed 3 years ago.
Find a tight upper bound on the complexity of this program.
I tried. I think the time complexity of this code O(n2).
void function(int n)
{
int count = 0;
for (int i=0; i<n; i++)
for (int j=i; j< i*i; j++)
if (j%i == 0)
{
for (int k=0; k<j; k++)
printf("*");
}
}
But the answer given is O(n5). How?
EDIT: Yikes, after spending the time to answer this question, I discovered that this is a duplicate of a previous question that I had also answered three years ago. Oops!
The tightest bound you can get on this function's runtime is Θ(n4). Here's the derivation.
This is a great place to illustrate a great general strategy for determining the big-O of a piece of code:
"When in doubt, work inside out!"
Let's take your code:
for (int i=0; i<n; i++)
{
for (int j=i; j< i*i; j++)
{
if (j%i == 0)
{
for (int k=0; k<j; k++)
{
printf("*");
}
}
}
}
Our approach for analyzing the runtime complexity will be to repeatedly take the innermost loop and replace it with the amount of work that it does. When we're done, we'll have our final time complexity.
Let's begin with this innermost loop:
for (int k=0; k<j; k++)
{
printf("*");
}
The amount of work done here is Θ(j), since the number of loop iterations is directly proportional to j and we do a constant amount of work per loop iteration. So let's replace this loop with the simpler "do Θ(j) work," giving us this simplified loop nest:
for (int i=0; i<n; i++)
{
for (int j=i; j< i*i; j++)
{
if (j%i == 0)
{
do Θ(j) work
}
}
}
Now, let's take aim at what's now the innermost loop:
for (int j=i; j < i*i; j++)
{
if (j%i == 0)
{
do Θ(j) work
}
}
This loop is unusual in that the amount of work that it does varies pretty dramatically from one iteration to the next. Specifically:
most iterations will do only O(1) work, but
one out of every i iterations will do Θ(j) work.
To analyze this loop, we'll therefore split the work apart into these two constituent pieces and see how much each contributes to the total.
First, let's look at the "easy" iterations of the loop, which do only O(1) work. There are a total of Θ(i2) iterations of the loop (the loop starts counting at j = i and stops when j = i2 and i2 - i = Θ(i2). We can therefore bound the contribution of the of these "easy" loop iterations at O(i2) work.
Now, what about the "hard" loop iterations? These iterations occur when j = i, when j = 2i, when j = 3i, j = 4i, etc. Moreover, each of these "hard" iterations do work directly proportional to j during the iteration. This means that, if we add up the work across all of these iterations, the total work done is given by
i + 2i + 3i + 4i + ... + (i - 1)i.
We can simplify this as follows:
i + 2i + 3i + 4i + ... + (i - 1)i
= i(1 + 2 + 3 + ... + i-1)
= i · Θ(i2)
= Θ(i3).
This uses the fact that 1 + 2 + 3 + ... + k = k(k + 1) / 2 = Θ(k2), which is Gauss's famous sum.
So now we see that the work done by the inner loop here is given by
O(i2) work for the easy iterations, and
Θ(i3) work for the hard iterations.
Summing this up, we see that the total work done by this inner loop is Θ(i3). Continuing our process of working inside out, we can replace this inner loop with "do Θ(i3) work" to get the following:
for (int i=0; i<n; i++)
{
do Θ(i^3) work
}
From here, we see that the work done is
13 + 23 + 33 + ... + (n - 1)3,
and that sum is Θ(n4). (Specifically, it's n2(n - 1)2 / 4.)
So overall, the theory predicts that the runtime should be Θ(n4), which is a factor of n lower than the O(n5) bound you mentioned above. How does the theory match the practice?
I ran this code on a variety of values of n and counted how many times that a star was printed. Here's the values I got back:
n = 500: 7760510375
n = 1000: 124583708250
n = 1500: 631407093625
n = 2000: 1996668166500
n = 2500: 4876304426875
n = 3000: 10113753374750
n = 3500: 18739952510125
n = 4000: 31973339333000
n = 4500: 51219851343375
If the runtime is Θ(n4), then if we double the size of the input, we should scale the output by a factor of 16. If the runtime is Θ(n5), then doubling the input size should scale the output by a factor of 32. Here's what I found:
Ratio of n = 1000 to n = 500: 16.0535
Ratio of n = 2000 to n = 1000: 16.0267
Ratio of n = 3000 to n = 1500: 16.0178
Ratio of n = 4000 to n = 2000: 16.0133
This strongly suggests that the runtime of this function is indeed Θ(n4) rather than Θ(n5).
Hope this helps!
I agree with the other poster, except the innermost loop is O(n^2), as k spans from 0 to j, which itself spans up to n^2. This gives us the desired answer of O(n^5).
The first loop
for (int i=0; i<n; i++)
Gives your first O(n) multiplier.
Next, the second loop
for (int j=i; j< i*i; j++)
Itself multiplies by O(n^2) complexity because i is "like" n here.
The third loop
for (int k=0; k<j; k++)
Multiplies by another O(n^2) because j is "like" n^2 here.
So you get complexity O(n^5).
I have this piece of code and I would like to find its time complexity. I am preparing for interviews and I think this one is a bit tough.
int foo (int n)
{
int sum = 0;
int k, i, j;
int t = 2;
for (i=n/2; i>0; i/=2)
{
for(j=0; j<i; j++)
{
for(k=0; k<log2(t-1); k++)
{
sum += bar(sum);
// bar time-complexity for all inputs is O(1)
}
}
t = pow(2, i);
}
}
I don't know why but I am unable to bound this expression and find a complexity.
Any help on how to resolve this ?
Since you've shown no progress, I'll give you top-level hints:
let log_t = log2(t). What is log_t as a function of i?
Note that the outer loop is executed log2(n) times.
How many total times is the j loop executed?
Pick a sample value of n, such as 32. For each value of i, how many times is the sum += statement executed? Can you generalize an equation for this based on n?
Lets write it down as:
(n/2)*log(1) + (n/4)*log(3) + ... + 1*(log(n-1)). Which is equal to:
< n * [log(2^i)/(2^i) for i in range 1...n] .
= n * [log(2)/2 + log(4)/4 + log(8)/8 + ... + log(n)/n)]
This yeilds to O(n)
I'm relatively new to R programming, and this website has been very helpful for me so far, but I was unable to find a question that already covered what I want to know. So I decided to post a question myself.
My problem is the following: I want to find efficient ways to compute cumulative sums over four-dimensional arrays, i.e. I have data in a four-dimensional array x and want to write a function that computes an array x_sum such that
x_sum[i,j,k,l] = sum_{ind1 <= i, ind2 <= j, ind3 <= k, ind4 <=l} x[ind1, ind2, ind3, ind4].
I want to use this function billions of times, which makes it very important that it be as efficient as possible. Although I have come up with several ways to calculate the sums (see below), I suspect more experienced R programmers might be able to find a more efficient solution. So if anyone can suggest a better way of doing this, I would be very grateful.
Here's what I've tried so far:
I have found three different implementations (each of which brought a gain in speed) that work (see code below):
One in R using the cumsum() function (cumsum_4R) and two implementations where the „heavy lifting“ is done in C (using the .C() interface).
The first implementation in C is merely a naive attempt to write the sums using nested for-loops and pointer arithmetic (cumsumC_4_old).
In the second C-implementation (cumsumC_4) I tried to adapt my code using the ideas in the following article
As you can see in the source code below, the adaptation is relatively lopsided: For some dimensions, I was able to replace all the nested for-loops but not for others. Do you have ideas on how to do that?
Using microbenchmark on the three implementations, I get the following result for arrays of size 40x40x40x40:
Unit: milliseconds
expr min lq mean median uq
cumsum_4R(x) 976.13258 1029.33371 1064.35100 1051.37782 1074.23234
cumsumC_4_old(x) 174.72868 177.95875 192.75392 184.11121 203.18141
cumsumC_4(x) 56.87169 57.73512 67.34714 63.20269 68.80326
max neval
1381.5832 50
283.2384 50
105.7099 50
Additional information:
1) Since this made it easier to install any needed packages, I ran the benchmarks on my personal computer under Windows, but I plan on running the finished simulations on a computer from my university, which runs under Linux.
EDIT: 2) The four-dimensional data x[i,j,k,l] is actually the result of the product of two applications of the outer function: First, the outer product of a matrix with itself (i.e. outer(mat,mat)) and then taking the pairwise minima of another matrix (i.e. outer(mat2, mat2, pmin)). Then the data is the product
x = outer(mat, mat) * outer(mat2, mat2, pmin),
i.e.
x[i,j,k,l] = mat[i,j] * mat[k,l] * min(mat2[i,j], mat2[k,l])
The four-dimensional array has the corresponding symmetries.
3)The reason I need these cumulative sums in the first place is that I want to run simulations of a test for which I need partial sums over „rectangles“ of indices: I want to iterate over all sums of the form
sum_{k1<=i1 <= m1,k2<=i2 <= m2, k1 <= i3 <= m1, k2 <= i4 <=m2} x[i1, i2, i3, i4],
where 1<=k1<=m1<=n, 1<=k2<=m2<=n. In order to avoid calculating the sum of the same variables over and over again, I first calculate all the cumulative sums and then calculate the sums over rectangles as functions of the cumulative sums. Do you know of a more efficient way to do this?
EDIT to 3): In order to include all potentially important aspects: I also want to calculate sums of the form
sum_{k1<=i1 <= m1,k2<=i2 <= m2, 1 <= i3 <= n, 1 <= i4 <=n} x[i1, i2, i3, i4].
(Since I can trivially obtain them using the cumulative sums, I had not included this specification before).
Here is the C code I use (which I save as „cumsumC.c“):
#include<R.h>
#include<math.h>
#include <stdio.h>
int min(int a, int b){
if(a <= b) return a;
else return b;
}
void cumsumC_4_old(double* x, int* nv){
int n = *nv;
int n2 = n*n;
int n3 = n*n*n;
//Dim 1
for(int i=0; i<n; i++){
for(int j=0; j<n; j++){
for(int k=0; k<n; k++){
for(int l=1; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i + j*n +k*n2+(l-1)*n3];
}
}
}
}
//Dim 2
for(int i=0; i<n; i++){
for(int j=0; j<n; j++){
for(int k=1; k<n; k++){
for(int l=0; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i + j*n +(k-1)*n2+l*n3];
}
}
}
}
//Dim 3
for(int i=0; i<n; i++){
for(int j=1; j<n; j++){
for(int k=0; k<n; k++){
for(int l=0; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i + (j-1)*n +k*n2+l*n3];
}
}
}
}
//Dim 4
for(int i=1; i<n; i++){
for(int j=0; j<n; j++){
for(int k=0; k<n; k++){
for(int l=0; l<n; l++){
x[i+j*n+k*n2+l*n3] += x[i-1 + j*n +k*n2+l*n3];
}
}
}
}
}
void cumsumC_4(double* x, int* nv){
int n = *nv;
int n2 = n*n;
int n3 = n*n*n;
long ind1, ind2;
long index, indexges = n +(n-1)*n+(n-1)*n2+(n-1)*n3, indexend;
//Dim 1
index = n3;
while(index != indexges){
x[index] += x[index-n3];
index++;
}
//Dim 2
long teilind = n+(n-1)*n;
for(int k=1; k<n; k++){
ind1 = k*n2;
ind2 = ind1 - n2;
for(int l=0; l<n; l++){
index = l*n3;
indexend = teilind+index;
while(index != indexend){
x[index+ind1] += x[index+ind2];
index++;
}
}
}
//Dim 3
ind1 = n;
while(ind1 < n+(n-1)*n){
index = 0;
indexend = indexges - ind1;
ind2 = ind1-n;
while(index < indexend){
x[ind1+index] += x[ind2+index];
index += n2;
}
ind1++;
}
//Dim 4
index = 0;
int i;
long minind;
while(index < indexges){
i = 1;
minind = min(indexges, index+n);
while(index+i < minind){
x[index+i] += x[index+i-1];
i++;
}
index+=n;
}
}
Here is the R function „cumsum_4R“ and code used to call and compare the cumulative sum functions in R (under Windows; for Linux, the commands dyn.load/dyn.unload need to be adjusted; ideally, I want to use the functions on 50^4 size arrays, but since the call to microbenchmark would then take a while, I have chosen n=40 here):
library("microbenchmark")
# dyn.load("cumsumC.so")
dyn.load("cumsumC.dll")
cumsum_4R <- function(x){
return(aperm(apply(apply(aperm(apply(apply(x, 2:4,function(a) cumsum(as.numeric(a))), c(1,3,4) , function(a) cumsum(as.numeric(a))), c(2,1,3,4)), c(1,2,4), function(a) cumsum(as.numeric(a))), 1:3, function(a) cumsum(as.numeric(a))), c(3,4,2,1)))
}
cumsumC_4_old <- function(x){
n <- dim(x)[1]
arr <- array(.C("cumsumC_4_old", res=as.double(x), as.integer(n))$res, dim=c(n,n,n,n))
return(arr)
}
cumsumC_4 <- function(x){
n <- dim(x)[1]
arr <- array(.C("cumsumC_4", res=as.double(x), as.integer(n))$res, dim=c(n,n,n,n))
return(arr)
}
set.seed(1234)
n <- 40
x <- array(rnorm(n^4),dim=c(n,n,n,n))
r <- 6 #parameter for rounding results for comparison
res1 <- cumsum_4R(x)
res2 <- cumsumC_4_old(x)
res3 <- cumsumC_4(x)
print(c("Identical R and C1:", identical(round(res1,r),round(res2,r))))
print(c("Identical R and C2:",identical(round(res1,r),round(res3,r))))
times <- microbenchmark(cumsum_4R(x), cumsumC_4_old(x),cumsumC_4(x),times=50)
print(times)
dyn.unload("cumsumC.dll")
# dyn.unload("cumsumC.so")
Thank you for your help!
You can indeed use points 2 and 3 in your original question to solve the problem more efficiently. Actually, this makes the problem separable. By separable I mean that the limits of the 4 sums in Equation 3 do not depend on the variables you sum over. This and the fact that x is an outer product of 2 matrices enables you to separate the 4-fold sum in Eq. 3 into an outer product of two 2-fold sums. Even better: the 2 matrices used to define x are the same (denoted by mat by you) - so the two 2-fold sums give the same matrix, which has to be calculated only once.
Here the code:
set.seed(1234)
n=40
mat=array(rnorm(n^2),dim=c(n,n))
x=outer(mat,mat)
cumsum_sep=function(x) {
#calculate matrix corresponding to 2-fold sums
#actually it's just one matrix because x is an outer product of mat with itself
tmp=t(apply(apply(x,2,cumsum),1,cumsum))
#outer product of two-fold sums
outer(tmp,tmp)
}
y1=cumsum_4R(x)
#note that cumsum_sep operates on the original matrix mat!
y2=cumsum_sep(mat)
Check whether results are the same
all.equal(y1,y2)
[1] TRUE
This gives the benchmark results
microbenchmark(cumsum_4R(x),cumsum_sep(mat),times=10)
Unit: milliseconds
expr min lq mean median uq max neval cld
cumsum_4R(xx) 2084.454155 2135.852305 2226.59692 2251.95928 2270.15198 2402.2724 10 b
cumsum_sep(x) 6.844939 7.145546 32.75852 14.45762 34.94397 120.0846 10 a
Quite a difference! :)