Parallel `for` loop with an array as output - arrays

How can I run a for loop in parallel (so I can use all the processors on my windows machine) with the result being a 3 dimension array? The code I have now takes about an hour to run and is something like:
guad = array(NA,c(1680,170,15))
for (r in 1:15)
{
name = paste("P:/......",r,".csv",sep="")
pp = read.table(name,sep=",",header=T)
#lots of stuff to calculate x (which is a matrix)
guad[,,r]= x #
}
I have been looking at related questions and thought I could use foreach but I couldn't find a way to combine the matrices into an array.
I am new to parallel programming so any help will be very much appreciated!

You could do that with foreach using the abind function. Here's an example using the doParallel package as the parallel backend which is fairly portable:
library(doParallel)
library(abind)
cl <- makePSOCKcluster(3)
registerDoParallel(cl)
acomb <- function(...) abind(..., along=3)
guad <- foreach(r=1:4, .combine='acomb', .multicombine=TRUE) %dopar% {
x <- matrix(rnorm(16), 4) # compute x somehow
x # return x as the task result
}
This uses a combine function called acomb that uses the abind function from the abind package to combine the matrices generated by the cluster workers into a 3 dimensional array.
In this case, you can also combine the results using cbind and then modify the dim attribute afterwards to convert the resulting matrix into a 3 dimensional array:
guad <- foreach(r=1:4, .combine='cbind') %dopar% {
x <- matrix(rnorm(16), 4) # compute x somehow
x # return x as the task result
}
dim(guad) <- c(4,4,4)
The use of abind is useful since it can combine matrices and arrays in a variety of ways. Also, be aware that resetting the dim attribute may cause the matrix to be duplicated which could be a problem for large arrays.
Note that it's a good idea to shutdown the cluster at the end of the script using stopCluster(cl).

Related

Repeat array rows specified number of times

New to julia, so this is probably very easy.
I have an n-by-m array and a vector of length n and want to repeat each row of the array the number of times in the corresponding element of the vector. For example:
mat = rand(3,6)
v = vec([2 3 1])
The result should be a 6-by-6 array. I tried the repeat function but
repeat(mat, inner = v)
yields a 6×18×1 Array{Float64,3}: array instead so it takes v to be the dimensions along which to repeat the elements. In matlab I would use repelem(mat, v, 1) and I hope julia offers something similar. My actual matrix is a lot bigger and I will have to call the function many times, so this operation needs to be as fast as possible.
It has been discussed to add a similar thing to Julia Base, but currently it is not implemented yet AFAIK. You can achieve what you want using the inverse_rle function from StatsBase.jl:
julia> row_idx = inverse_rle(axes(v, 1), v)
6-element Array{Int64,1}:
1
1
2
2
2
3
and now you can write:
mat[row_idx, :]
or
#view mat[row_idx, :]
(the second option creates a view which might be relevant in your use case if you say that your mat is large and you need to do such indexing many times - which option is faster will depend on your exact use case).

R - apply function on each element of array in parallel

I have measurements of maximum and minimum temperature and precipitation that are organized as arrays of size
(100x96x50769), where i and j are grid cells with coordinates associated and z means the number of measurements over time.
Conceptually, it looks like this:
I am using the climdex.pcic package to calculate indices of extreme weather events. Given a time series of maximum and minimum temperature and precipitation, the climdexInput.raw function will return a climdexIput object that can be used to determine several indices: number of frost days, number of summer days, consecutive dry days etc.
The call for the function is pretty simple:
ci <- climdexInput.raw(tmax=x, tmin=y, prec=z,
t, t, t, base.range=c(1961,1990))
where x is a vector of maximum temperatures, y is a vector of minimum temperatures, z is a vector of precipitation and t is a vector with dates under which x, y and z were measured.
What I would like to do is to extract the timeseries for each element of my array (i.e. each grid cell in the figure above) and use it to run the climdexInput.raw function.
Because of the large number of elements of real data, I want to run this task in parallel on my 4-core Linux server. However, I have no experience with parallelization in R.
Here's one example of my program (with intentionally reduced dimensions to make execution faster on your computer):
library(climdex.pcic)
# Create some dates
t <- seq(as.Date('2000-01-01'), as.Date('2010-12-31'), 'day')
# Parse the dates into PCICt
t <- as.PCICt(strftime(t), cal='gregorian')
# Create some dummy weather data, with dimensions `# of lat`, `# of lon` and `# of timesteps`
nc.min <- array(runif(10*9*4018, min=0, max=15), c(10, 9, 4018))
nc.max <- array(runif(10*9*4018, min=25, max=40), c(10, 9, 4018))
nc.prc <- array(runif(10*9*4018, min=0, max=25), c(10, 9, 4018))
# Create "ci" object
ci <- climdexInput.raw(tmax=nc.max[1,1,], tmin=nc.min[1,1,], prec=nc.prc[1,1,],
t, t, t, base.range=c(2000,2005))
# Once you have “ci”, you can compute any of the indices provided by the climdex.pcic package.
# The example below is for cumulative # of dry days per year:
cdd <- climdex.cdd(ci, spells.can.span.years = TRUE)
Now, please note that in the example above I used only the first element of my array ([1,1,]) as an example in the climdexInput.raw function.
How can do the same for all elements taking advantage of parallel processing, possibly by looping over the dimensions i and j of my array?
You can use foreach to do that:
library(doParallel)
registerDoParallel(cl <- makeCluster(3))
res <- foreach(j = seq_len(ncol(nc.min))) %:%
foreach(i = seq_len(nrow(nc.min))) %dopar% {
ci <- climdex.pcic::climdexInput.raw(
tmax=nc.max[i,j,],
tmin=nc.min[i,j,],
prec=nc.prc[i,j,],
t, t, t,
base.range=c(2000,2005)
)
}
stopCluster(cl)
See my guide on parallelism using foreach: https://privefl.github.io/blog/a-guide-to-parallelism-in-r/.
Then, to compute an index, just use climdex.cdd(res[[1]][[1]], spells.can.span.years = TRUE) (j first, i second).

Build Dictionary of Arrays Efficiently in julia

I want to save the (x,y) coordinates in a grid network that are visited by different individuals. Let say I have 1000 individuals and the network size is x = 1:100 and y=1:100. I am using Dict() and here is a sample code about what I want to do:
individuals = 1:1000
x = 1:100
y = 1:100
function Visited_nodes()
nodes_of_inds =Dict{Int64, Array{Tuple{Int64, Int64}}}()
for ind in individuals
dum_array = Array{Tuple{Int64, Int64}}(0)
for i in x
for j in y
if rand()<0.2 # some conditions
push!(dum_array, (i,j))
end
end
end
nodes_of_inds[ind]=unique(dum_array)
end
return nodes_of_inds
end
#time nodes_of_inds = Visited_nodes()
# result: 1.742297 seconds (12.31 M allocations: 607.035 MB, 6.72% gc time)
But this is not efficient. I appreciate any advice how to make it more efficient.
Please see the performance tips. Very first piece of advice there: avoid global variables. individuals, x, and y are all non-constant global variables. Make them arguments to your function instead. That change alone speeds up your function by an order of magnitude.
By construction, you're not going to have any duplicate tuples in your dum_array, so you don't need to call unique. That shaves off another factor of two.
Finally, Array{T} isn't a concrete type. Julia's arrays also encode the dimensionality as a type parameter, which must be included for the dictionary of arrays to be efficient. Use Array{T, 1} or Vector{T} instead. This isn't a major consideration within the time of this function, though.
The major thing that's left is just the O(length(individuals)*length(x)*length(y)) computational complexity. Doing anything ten million times will add up quickly, no matter how efficient it is.
#Matt B., thanks for your response. About the global variables, I tried a simplified version of my code and it did not help the performance.
Let say I read my input data from a couple of csv files and I have three functions with different arguments:
function Read_input_data()
# read input data
individuals = readcsv("file1")
x = readcsv("file2")
y = readcsv("file3")
A = readcsv("file4")
B = readcsv("file5") # and a few other files
# call different functions
result_1 = Function1(individuals , x, y)
result_2 = Function2(result_1 ,y, A, B)
result_3 = Function3(result_2 , individuals, A, B)
return result_1, result_2, result_3
end
result_1, result_2, result_3 = Read_input_data()
I do not know why the performance is not better compared to when I define everything global! I appreciate any if you can comment about this!

Parallelize nested for-loop on 3 dimensional array in R

Using R on a Windows machine, I am currently running a nested loop on a 3D array (720x360x1368) which cycles through d1 and d2 to apply a function over d3 and assemble the output to a new array of similar dimensionality.
In the following reproducible example, I have reduced the dimensions by factor 10, to make execution faster.
library(SPEI)
old.array = array(abs(rnorm(50)), dim=c(72,36,136))
new.array = array(dim=c(72,36,136))
for (i in 1:72) {
for (j in 1:36) {
new.listoflists <- spi(ts(old.array[i,j,], freq=12, start=c(1901,1)), 1, na.rm = T)
new.array[i,j,] = new.listoflists$fitted
}
}
where spi() is a function from the SPEI package returning a list of lists, of which one particular list $fittedof length 1368 is used from each loop increment to cunstruct the new array.
While this loop works flawlessly, it takes quite a long time to compute. I have read that foreachcan be used to parallelize for loops.
However, I do not understand how the nesting and the assembling of the new array can be achieved such that the dimnames of the old and the new array are consistent.
(In the end, what I want to be able to, is to transform both the old and the new array into a "flat" long panel data frame using as.data.frame.table() and merge them along their three dimensions.)
Any help on how I can achieve the desired output using parallel computing will be highly appreciated!
Cheers
CubicTom
It would have been better with a reproducible example, here is what i come up with:
First create the cluster to use
cl <- makeCluster(6, type = "SOCK")
registerDoSNOW(cl)
Then you create the loop and close the cluster:
zz <- foreach(i = 1:720, .combine = c) %:%
foreach(j = 1:360, .combine = c ) %dopar% {
new.listoflists <- FUN(old.array[i,j,])
new.array[i,j,] <- new.listoflists$list
}
stopCluster(cl)
This will create a list zz containing every iteration of new.array[i,j,], then you can bind them together with:
new.obj <- plyr::ldply(zz, data.frame)
Hope this helps you!
I did not use as much of dimensions as your question because I wanted to ensure the behavior was correct.
So here I use mapply which take multiple arguments. The result is a list of the results. Then I wrapped it with matrix() to get the dimensions you hoped for.
Please note that i is repeated using times and j is repeated using each. This is critical as matrix() put entries by row first then wraps to the next column when the number of row is reached.
new.array = array(1:(5*10*4), dim=c(5,10,4))
# FUN: function which returns lists of
FUN <- function(x){
list(lapply(x, rep, times=3))
}
# result of the computation
result <- matrix(
mapply(
function(i,j,...){
FUN(new.array[i,j,])
}
,i = rep(1:nrow(new.array),times=ncol(new.array))
,j = rep(1:ncol(new.array),each=nrow(new.array))
,new.array=new.array
)
,nrow=nrow(new.array)
,ncol=ncol(new.array)
)

In MATLAB: How should nested fields of a struct be converted to a cell array?

In MATLAB, I would like to extract a nested field for each index of a 1 x n struct (a nonscalar struct) and receive the output as a 1 x n cell array. As a simple example, suppose I start with the following struct s:
s(1).f1.fa = 'foo';
s(2).f1.fa = 'yedd';
s(1).f1.fb = 'raf';
s(2).f1.fb = 'da';
s(1).f2 = 'bok';
s(2).f2 = 'kemb';
I can produce my desired 1 x 2 cell array c using a for-loop:
n = length(s);
c = cell(1,n);
for k = 1:n
c{k} = s(k).f1.fa;
end
If I wanted to do analogously for a non-nested field, for example f2, then I could "vectorize" the operation (see this question), writing simply:
c = {s.f2};
However the same approach does not appear to work for nested fields. What then are possible ways to vectorize the above for-loop?
You cannot really vectorize it. The problem is that Matlab does not allow most forms of nested indexing, including []..
The most concise / readable option would be to concatenate s.f1 results in a structure array using [...], and then index into the new structure array with a separate call:
x = [s.f1]; c = {x.fa};
If you have a Mapping Toolbox, you could use extractfield to perform the second indexing in one expression:
c = extractfield([s.f1], 'fa');
Alternatively you could write a one-liner using arrayfun - here's a couple of options:
c = arrayfun(#(x) x.f1.fa, s, 'uni', false);
c = arrayfun(#(x) x.fa, [s.f1], 'uni', false);
Note that arrayfun and similar functions are generally slower than explicit for loops. So if the performance is critical, time / profile your code, before making a decision to get rid of the loop.

Resources