R - apply function on each element of array in parallel - arrays

I have measurements of maximum and minimum temperature and precipitation that are organized as arrays of size
(100x96x50769), where i and j are grid cells with coordinates associated and z means the number of measurements over time.
Conceptually, it looks like this:
I am using the climdex.pcic package to calculate indices of extreme weather events. Given a time series of maximum and minimum temperature and precipitation, the climdexInput.raw function will return a climdexIput object that can be used to determine several indices: number of frost days, number of summer days, consecutive dry days etc.
The call for the function is pretty simple:
ci <- climdexInput.raw(tmax=x, tmin=y, prec=z,
t, t, t, base.range=c(1961,1990))
where x is a vector of maximum temperatures, y is a vector of minimum temperatures, z is a vector of precipitation and t is a vector with dates under which x, y and z were measured.
What I would like to do is to extract the timeseries for each element of my array (i.e. each grid cell in the figure above) and use it to run the climdexInput.raw function.
Because of the large number of elements of real data, I want to run this task in parallel on my 4-core Linux server. However, I have no experience with parallelization in R.
Here's one example of my program (with intentionally reduced dimensions to make execution faster on your computer):
library(climdex.pcic)
# Create some dates
t <- seq(as.Date('2000-01-01'), as.Date('2010-12-31'), 'day')
# Parse the dates into PCICt
t <- as.PCICt(strftime(t), cal='gregorian')
# Create some dummy weather data, with dimensions `# of lat`, `# of lon` and `# of timesteps`
nc.min <- array(runif(10*9*4018, min=0, max=15), c(10, 9, 4018))
nc.max <- array(runif(10*9*4018, min=25, max=40), c(10, 9, 4018))
nc.prc <- array(runif(10*9*4018, min=0, max=25), c(10, 9, 4018))
# Create "ci" object
ci <- climdexInput.raw(tmax=nc.max[1,1,], tmin=nc.min[1,1,], prec=nc.prc[1,1,],
t, t, t, base.range=c(2000,2005))
# Once you have “ci”, you can compute any of the indices provided by the climdex.pcic package.
# The example below is for cumulative # of dry days per year:
cdd <- climdex.cdd(ci, spells.can.span.years = TRUE)
Now, please note that in the example above I used only the first element of my array ([1,1,]) as an example in the climdexInput.raw function.
How can do the same for all elements taking advantage of parallel processing, possibly by looping over the dimensions i and j of my array?

You can use foreach to do that:
library(doParallel)
registerDoParallel(cl <- makeCluster(3))
res <- foreach(j = seq_len(ncol(nc.min))) %:%
foreach(i = seq_len(nrow(nc.min))) %dopar% {
ci <- climdex.pcic::climdexInput.raw(
tmax=nc.max[i,j,],
tmin=nc.min[i,j,],
prec=nc.prc[i,j,],
t, t, t,
base.range=c(2000,2005)
)
}
stopCluster(cl)
See my guide on parallelism using foreach: https://privefl.github.io/blog/a-guide-to-parallelism-in-r/.
Then, to compute an index, just use climdex.cdd(res[[1]][[1]], spells.can.span.years = TRUE) (j first, i second).

Related

"Diagonalize" each row of a matrix

I have an n x p matrix that looks like this:
n = 100
p = 10
x <- matrix(sample(c(0,1), size = p*n, replace = TRUE), n, p)
I want to create an n x p x p array A whose kth item along the 1st dimension is a p x p diagonal matrix containing the elements of x[k,]. What is the most efficient way to do this in R? I'm looking for a way that uses outer (or some other vectorized approach) rather than one of the apply functions.
Solution using lapply:
A <- aperm(simplify2array(lapply(1:nrow(x), function(i) diag(x[i,]))), c(3,2,1))
I'm looking for something more efficient than this.
Thanks.
As a starting point, here is a humble for loop method with pre-allocation of the matrix.
# pre-allocate matrix of desired size
myArray <- array(0, dim=c(ncol(x), ncol(x), nrow(x)))
# fill in array
for(i in seq_len(nrow(x))) myArray[,,i] <- diag(x[i,])
It should run relatively fast. On my machine, for a 1000 X 100 matrix, the lapply method took 0.87 seconds, while the for loop (including the array pre-allocation) took 0.25 seconds to transform the matrix into to your desired array. So the for loop was about 3.5 times faster.
transpose your original matrix
Note also that row operations on R matrices tend to be slower than column operations. This is because matrices are stored in memory by column. If you transpose your matrix, and perform the operation this way, the time to complete the operation on 100X1000 matrix drops to 0.14, half that of the first for loop, and 7 times faster than the lapply method.

MATLAB pairwise differences in Nth dimension

Say i have a N-Dimensional matrix A that can be of any size. For example:
A = rand([2,5,3]);
I want to calculate all possible pairwise differences between elements of the matrix, along a given dimension. For example, if i wanted to calculate the differences along dimension 3, a shortcut would be to create a matrix like so:
B = cat(3, A(:,:,2) - A(:,:,1), A(:,:,3) - A(:,:,1), A(:,:,3) - A(:,:,2));
However, i want this to be able to operate along any dimension, with a matrix of any size. So, ideally, i'd like to either create a function that takes in a matrix A and calculates all pairwise differences along dimension DIM, or find a builtin MATLAB function that does the same thing.
The diff function seems like it could be useful, but it only calculates differences between adjacent elements, not all possible differences.
Doing my research on this issue, i have found a couple of posts about getting all possible differences, but most of these are for items in a vector (and ignore the dimensionality issue). Does anyone know of a quick fix?
Specific Dimension Cases
If you don't care about a general solution, for a dim=3 case, it would be as simple as couple lines of code -
dim = 3
idx = fliplr(nchoosek(1:size(A,dim),2))
B = A(:,:,idx(:,1)) - A(:,:,idx(:,2))
You can move around those idx(..) to specific dimension positions, if you happen to know the dimension before-hand. So, let's say dim = 4, then just do -
B = A(:,:,:,idx(:,1)) - A(:,:,:,idx(:,2))
Or let's say dim = 3, but A is a 4D array, then do -
B = A(:,:,idx(:,1),:) - A(:,:,idx(:,2),:)
Generic Case
For a Nth dim case, it seems you need to welcome a party of reshapes and permutes -
function out = pairwise_diff(A,dim)
%// New permuting dimensions
new_permute = [dim setdiff(1:ndims(A),dim)];
%// Permuted A and its 2D reshaped version
A_perm = permute(A,new_permute);
A_perm_2d = reshape(A_perm,size(A,dim),[]);
%// Get pairiwse indices for that dimension
N = size(A,dim);
[Y,X] = find(bsxfun(#gt,[1:N]',[1:N])); %//' OR fliplr(nchoosek(1:size(A,dim),2))
%// Get size of new permuted array that would have the length of
%// first dimension equal to number of such pairwise combinations
sz_A_perm = size(A_perm);
sz_A_perm(1) = numel(Y);
%// Get the paiwise differences; reshape to a multidimensiona array of same
%// number of dimensions as the input array
diff_mat = reshape(A_perm_2d(Y,:) - A_perm_2d(X,:),sz_A_perm);
%// Permute back to original dimension sequence as the final output
[~,return_permute] = sort(new_permute);
out = permute(diff_mat,return_permute);
return
So much for a generalization , huh!

Efficient, concise approach to convert array dimension to list (and back) in R

I convert between data formats a lot. I'm sure this is quite common. In particular, I switch between arrays and lists. I'm trying to figure out if I'm doing it right, or if I'm missing any schemas that would greatly improve quality of life. Below I'll give some examples of how to achieve desired results in a couple situations.
Begin with the following array:
dat <- array(1:60, c(5,4,3))
Then, convert one or more of the dimensions of that array to a list. For clarification and current approaches, see the following:
1 dimension, array to list
# Convert 1st dim
dat_list1 <- unlist(apply(dat, 1, list),F,F) # this is what I usually do
# Convert 1st dim, (alternative approach)
library(plyr) # I don't use this approach often b/c I try to go base if I can
dat_list1a <- alply(dat, 1) # points for being concise!
# minus points to alply for being slow (in this case)
> microbenchmark(unlist(apply(dat, 1, list),F,F), alply(dat, 1))
Unit: microseconds
expr min lq mean median uq max neval
unlist(apply(dat, 1, list), F, F) 40.515 43.519 50.6531 50.4925 53.113 88.412 100
alply(dat, 1) 1479.418 1511.823 1684.5598 1595.4405 1842.693 2605.351 100
1 dimension, list to array
# Convert elements of list into new array dimension
# bonus points for converting to original array
dat_array1_0 <- simplify2array(dat_list1)
aperm.key1 <- sapply(dim(dat), function(x)which(dim(dat_array1_0)==x))
dat_array1 <- aperm(dat_array1_0,aperm.key1)
In general, these are the tasks I'm trying to accomplish, although sometimes it's in multiple dimensions or the lists are nested, or some such other complication. So I'm asking if anyone has a "better" (concise, efficient) way of doing either of these things, but bonus points if a suggested approach can handle other related scenarios too.

How to find the maximum of multiple arrays in MATLAB?

Let's say we have an array x. We can find the maximum value of this array as follows:
maximum = max(x);
If I have two arrays, let's say x and y, I can find the array that contains the maximum value by using the command
maximum_array = max(x, y);
Let's say that this array is y. Then, I can find the maximum value by using the max command with argument y, as before with x:
maximum_value = max(y);
This two-step procedure could be performed with the following compact, one-liner command:
maximum_value = max(max(x, y));
But what happens when we have more than 2 arrays? As far as I know, the max function does not allow to compare more than two arrays. Therefore, I have to use max for pairs of arrays, and then find the max among the intermediate results (which involves also the use of additional variables). Of course, if I have, let's say, 50 arrays, this would be - and it really is - a tedius process.
Is there a more efficient approach?
Approach #1
Concatenate column vector versions of them along dim-2 with cat and then use maximium values with max along dim-2 to get the max.
Thus, assuming x, y and z to be the input arrays, do something like this -
%// Reshape all arrays to column vectors with (:) and then use cat
M = cat(2,x(:),y(:),z(:))
%// Use max along dim-2 with `max(..,[],2)` to get column vector
%// version and then reshape back to the shape of input arrays
max_array = reshape(max(M,[],2),size(x))
Approach #2
You can use ndims to find the number of dimensions in the input arrays and then concatenate along the dimension that is plus 1 of that dimension and finally find max along it to get the maximum values array. This would avoid all of that reshaping back and forth and thus could be more efficient and a more compact code as well -
ndimsp1 = ndims(x)+1 %// no. of dimensions plus 1
maxarr = max(cat(ndimsp1,x,y,z),[],ndimsp1) %// concatenate and find max
I think the easiest approach for a small set of arrays is to column-ify and concatenate:
maxValue = max([x(:);y(:)]);
For a large number of arrays in some data structure (e.g. a cell array or a struct), I simple loop would be best:
maxValue = max(cellOfMats{1}(:));
for k = 2:length(cellOfMats)
maxValue = max([maxValue;cellOfMats{k}(:)]);
end
For the pathological case of a large number of separate arrays with differing names, I say "don't do that" and put them in a data structure or use eval with a loop.

Parallel `for` loop with an array as output

How can I run a for loop in parallel (so I can use all the processors on my windows machine) with the result being a 3 dimension array? The code I have now takes about an hour to run and is something like:
guad = array(NA,c(1680,170,15))
for (r in 1:15)
{
name = paste("P:/......",r,".csv",sep="")
pp = read.table(name,sep=",",header=T)
#lots of stuff to calculate x (which is a matrix)
guad[,,r]= x #
}
I have been looking at related questions and thought I could use foreach but I couldn't find a way to combine the matrices into an array.
I am new to parallel programming so any help will be very much appreciated!
You could do that with foreach using the abind function. Here's an example using the doParallel package as the parallel backend which is fairly portable:
library(doParallel)
library(abind)
cl <- makePSOCKcluster(3)
registerDoParallel(cl)
acomb <- function(...) abind(..., along=3)
guad <- foreach(r=1:4, .combine='acomb', .multicombine=TRUE) %dopar% {
x <- matrix(rnorm(16), 4) # compute x somehow
x # return x as the task result
}
This uses a combine function called acomb that uses the abind function from the abind package to combine the matrices generated by the cluster workers into a 3 dimensional array.
In this case, you can also combine the results using cbind and then modify the dim attribute afterwards to convert the resulting matrix into a 3 dimensional array:
guad <- foreach(r=1:4, .combine='cbind') %dopar% {
x <- matrix(rnorm(16), 4) # compute x somehow
x # return x as the task result
}
dim(guad) <- c(4,4,4)
The use of abind is useful since it can combine matrices and arrays in a variety of ways. Also, be aware that resetting the dim attribute may cause the matrix to be duplicated which could be a problem for large arrays.
Note that it's a good idea to shutdown the cluster at the end of the script using stopCluster(cl).

Resources