R: Convert 3darray[i, j, ] to columns of df, fast and readable - arrays

I'm working with 3-dimensional arrays and want to have slices along the
third dimension for each position in the first two dimensions as columns in a data frame.
I also want my code to be readable for people who dont use R regularly.
Looping over the first two dimensions is very readable but slow (30 secs for the example below), while the permute-flatten-shape-to-matrix approach
is faster (14 secs) but not so readable.
Any suggestions for a nice solution?
Reproducible example here:
# Create data
d1 <- 200
d2 <- 100
d3 <- 50
data <- array(rnorm(n=d1*d2*d3), dim=c(d1, d2, d3))
# Idea 1: Loop
df <- data.frame(var1 = rep(0, d3))
i <- 1
system.time(
for (c in 1:d2) {
for(r in 1:d1){
i <- i + 1
df[[i]] <- data[r, c, ]
}
})
# Idea 2: Permute dimension of array first
df2 <- data.frame(var1 = rep(0, d3))
system.time({
data.perm <- aperm(data, c(3, 1, 2))
df2[, 2:(d1*d2 + 1)] <- matrix(c(data.perm), nrow = d3, ncol = d1*d2)}
)
identical(df, df2)

I would suggest a much more simple approach:
t(apply(data, 3, c))
I hope it suits your expectations of being fast and readable.
fast, as demonstrated in the timings below.
readable because it's a basic apply statement. All that is being done is using c to convert the matrix in each third dimension to a single vector in each third dimension, which then simplifies to a two-dimensional array. The result just needs to be transposed....
Here's your sample data:
set.seed(1)
d1 <- 200
d2 <- 100
d3 <- 50
data <- array(rnorm(n=d1*d2*d3), dim=c(d1, d2, d3))
Here are a few functions to compare:
funam <- function() t(apply(data, 3, c))
funrl <- function() {
myl <- vector("list", d3)
i <- 1
for (c in 1:d2) {
for(r in 1:d1){
i <- i + 1
myl[[i]] <- data[r, c, ]
}
}
do.call(cbind, myl)
}
funop <- function() {
df <- data.frame(var1 = rep(0, d3))
i <- 1
for (c in 1:d2) {
for(r in 1:d1){
i <- i + 1
df[[i]] <- data[r, c, ]
}
}
df[-1]
}
Here are the results of the timing:
system.time(am <- funam())
# user system elapsed
# 0.000 0.000 0.062
system.time(rl <- funrl())
# user system elapsed
# 3.980 0.000 1.375
system.time(op <- funop())
# user system elapsed
# 21.496 0.000 21.355
... and a comparison for equality:
all.equal(am, as.matrix(unname(op)), check.attributes = FALSE)
# [1] TRUE
all.equal(am, rl, check.attributes = FALSE)
# [1] TRUE

Here's an idea. Recommended read would be The R Inferno by Patrick Burns (pun intended?).
myl <- vector("list", d3) # create an empty list
i <- 1
system.time(
for (c in 1:d2) {
for(r in 1:d1){
i <- i + 1
myl[[i]] <- data[r, c, ]
}
})
user system elapsed
1.8 0.0 1.8
# bind each list element into a matrix, column-wise
do.call("cbind", myl)[1:5, 1:5]
[,1] [,2] [,3] [,4] [,5]
[1,] -0.3394909 0.1266012 -0.4240452 0.2277654 -2.04943585
[2,] 1.6788653 -2.9381127 0.5781967 -0.7248759 -0.19482647
[3,] -0.6002371 -0.3132874 1.0895175 -0.2766891 -0.02109013
[4,] 0.5215603 -0.2805730 -1.0325867 -1.5373842 -0.14034565
[5,] 0.6063638 1.6027835 0.5711185 0.5410889 -1.77109124

Related

Product of two 3D array and a 2D matrix

I'm trying to find a much more efficient way to code in R the following matrix:
Let A and C be two 3D array of dimension (n, n, m) and B a matrix of dimension (m, m), then M is an (n, n) matrix such that:
M_ij = SUM_kl A_ijk * B_kl * C_ijl
for (i in seq(n)) {
for (j in seq(n)) {
M[i, j] <- A[i,j,] %*% B %*% C[i,j,]
}
}
It is possible to write this with the TensorA package using i and j as parallel dimension, but I'd rather stay with base R object.
einstein.tensor(A %e% log(B), C, by = c("i", "j"))
Thanks!
I don't know if this would be faster, but it would avoid one level of looping:
for (i in seq(n))
M[i,] <- diag(A[i,,] %*% B %*% t(C[i,,]))
It gives the same answer as yours in this example:
n <- 2
m <- 3
A <- array(1:(n^2*m), c(n, n, m))
C <- A + 1
B <- matrix(1:(m^2), m, m)
M <- matrix(NA, n, n)
for (i in seq(n))
M[i,] <- diag(A[i,,] %*% B %*% t(C[i,,]))
M
# [,1] [,2]
# [1,] 1854 3216
# [2,] 2490 4032
Edited to add: Based on https://stackoverflow.com/a/42569902/2554330, here's a slightly faster version:
for (i in seq(n))
M[i,] <- rowSums((A[i,,] %*% B) * C[i,,])
I did some timing with n <- 200 and m <- 300, and this was the fastest at 3.1 sec, versus my original solution at 4.7 sec, and the one in the question at 17.4 sec.

Quickest way to find closest elements in an array in R

I would like find the fastes way in R to indentify indexes of elements in Ytimes array which are closest to given Xtimes values.
So far I have been using a simple for-loop, but there must be a better way to do it:
Xtimes <- c(1,5,8,10,15,19,23,34,45,51,55,57,78,120)
Ytimes <- seq(0,120,length.out = 1000)
YmatchIndex = array(0,length(Xtimes))
for (i in 1:length(Xtimes)) {
YmatchIndex[i] = which.min(abs(Ytimes - Xtimes[i]))
}
print(Ytimes[YmatchIndex])
Obligatory Rcpp solution. Takes advantage of the fact that your vectors are sorted and don't contain duplicates to turn an O(n^2) into an O(n). May or may not be practical for your application ;)
C++:
#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector closest_pts(NumericVector Xtimes, NumericVector Ytimes) {
int xsize = Xtimes.size();
int ysize = Ytimes.size();
int y_ind = 0;
double minval = R_PosInf;
IntegerVector output(xsize);
for(int x_ind = 0; x_ind < xsize; x_ind++) {
while(std::abs(Ytimes[y_ind] - Xtimes[x_ind]) < minval) {
minval = std::abs(Ytimes[y_ind] - Xtimes[x_ind]);
y_ind++;
}
output[x_ind] = y_ind;
minval = R_PosInf;
}
return output;
}
R:
microbenchmark::microbenchmark(
for_loop = {
for (i in 1:length(Xtimes)) {
which.min(abs(Ytimes - Xtimes[i]))
}
},
apply = sapply(Xtimes, function(x){which.min(abs(Ytimes - x))}),
fndIntvl = {
Y2 <- c(-Inf, Ytimes + c(diff(Ytimes)/2, Inf))
Ytimes[ findInterval(Xtimes, Y2) ]
},
rcpp = closest_pts(Xtimes, Ytimes),
times = 100
)
Unit: microseconds
expr min lq mean median uq max neval cld
for_loop 3321.840 3422.51 3584.452 3492.308 3624.748 10458.52 100 b
apply 68.365 73.04 106.909 84.406 93.097 2345.26 100 a
fndIntvl 31.623 37.09 50.168 42.019 64.595 105.14 100 a
rcpp 2.431 3.37 5.647 4.301 8.259 10.76 100 a
identical(closest_pts(Xtimes, Ytimes), findInterval(Xtimes, Y2))
# TRUE
R is vectorized, so skip the for loop. This saves time in scripting and computation. Simply replace the for loop with an apply function. Since we're returning a 1D vector, we use sapply.
YmatchIndex <- sapply(Xtimes, function(x){which.min(abs(Ytimes - x))})
Proof that apply is faster:
library(microbenchmark)
library(ggplot2)
# set up data
Xtimes <- c(1,5,8,10,15,19,23,34,45,51,55,57,78,120)
Ytimes <- seq(0,120,length.out = 1000)
# time it
mbm <- microbenchmark(
for_loop = for (i in 1:length(Xtimes)) {
YmatchIndex[i] = which.min(abs(Ytimes - Xtimes[i]))
},
apply = sapply(Xtimes, function(x){which.min(abs(Ytimes - x))}),
times = 100
)
# plot
autoplot(mbm)
See ?apply for more.
We can use findInterval to do this efficiently. (cut will also work, with a little more work).
First, let's offset the Ytimes offsets so that we can find the nearest and not the next-lesser. I'll demonstrate on fake data first:
y <- c(1,3,5,10,20)
y2 <- c(-Inf, y + c(diff(y)/2, Inf))
cbind(y, y2[-1])
# y
# [1,] 1 2.0
# [2,] 3 4.0
# [3,] 5 7.5
# [4,] 10 15.0
# [5,] 20 Inf
findInterval(c(1, 1.9, 2.1, 8), y2)
# [1] 1 1 2 4
The second column (prepended with a -Inf will give us the breaks. Notice that each is half-way between the corresponding value and its follower.
Okay, let's apply this to your vectors:
Y2 <- Ytimes + c(diff(Ytimes)/2, Inf)
head(cbind(Ytimes, Y2))
# Ytimes Y2
# [1,] 0.0000000 0.06006006
# [2,] 0.1201201 0.18018018
# [3,] 0.2402402 0.30030030
# [4,] 0.3603604 0.42042042
# [5,] 0.4804805 0.54054054
# [6,] 0.6006006 0.66066066
Y2 <- c(-Inf, Ytimes + c(diff(Ytimes)/2, Inf))
cbind(Xtimes, Y2[ findInterval(Xtimes, Y2) ])
# Xtimes
# [1,] 1 0.9009009
# [2,] 5 4.9849850
# [3,] 8 7.9879880
# [4,] 10 9.9099099
# [5,] 15 14.9549550
# [6,] 19 18.9189189
# [7,] 23 22.8828829
# [8,] 34 33.9339339
# [9,] 45 44.9849850
# [10,] 51 50.9909910
# [11,] 55 54.9549550
# [12,] 57 56.9969970
# [13,] 78 77.8978979
# [14,] 120 119.9399399
(I'm using cbind just for side-by-side demonstration, not that it's necessary.)
Benchmark:
mbm <- microbenchmark::microbenchmark(
for_loop = {
YmatchIndex <- array(0,length(Xtimes))
for (i in 1:length(Xtimes)) {
YmatchIndex[i] = which.min(abs(Ytimes - Xtimes[i]))
}
},
apply = sapply(Xtimes, function(x){which.min(abs(Ytimes - x))}),
fndIntvl = {
Y2 <- c(-Inf, Ytimes + c(diff(Ytimes)/2, Inf))
Ytimes[ findInterval(Xtimes, Y2) ]
},
times = 100
)
mbm
# Unit: microseconds
# expr min lq mean median uq max neval
# for_loop 2210.5 2346.8 2823.678 2444.80 3029.45 7800.7 100
# apply 48.8 58.7 100.455 65.55 91.50 2568.7 100
# fndIntvl 18.3 23.4 34.059 29.80 40.30 83.4 100
ggplot2::autoplot(mbm)

matrix addition from an array r

I have an array with 272 matrices, each one is 2 by 2. I now want to sum these matrices up using matrix addition. So I want the return to be a single 2 by 2 matrix. Here are some code I have used.
y <- as.matrix(faithful)
B <- matrix(c(0,0,0,0),nrow = 2)
sigma <- function(n = 272,u_new) {
vec = replicate(272,B)
for (i in 1:n) {
w <- (y-u_new)[i,]
x <- ptilde1[i]*(w%*%t(w))
vec[,,i][1,1] <- x[1,1]
vec[,,i][1,2] <- x[1,2]
vec[,,i][2,1] <- x[2,1]
vec[,,i][2,2] <- x[2,2]}
vec
}
Here vec is the array with 272 matrices. Thank you in advance.
Here is code which loops a number of times (272) and adds a matrix to the same list.
B <- matrix(c(0,0,0,0),nrow = 2)
list <- list(B)
for (i in 2:272) {
list[[i]] <- B
}
To add them all together, you can use the Reduce() function:
sum <- Reduce('+', list)
> sum
[,1] [,2]
[1,] 0 0
[2,] 0 0
This is a contrived example because all the matrices are the zero matrix. I will leave it to you as a homework assignment to use the matrices you actually want to sum together.

Divide every slice of a matrix in an array by its own vector?

Suppose I have two arrays (or tensors if tensor package is needed)
dim(Xbeta)
products draws Households
13 20 10
dim(denom)
1 20 10
set.seed(1)
Xbeta=array(rnorm(13*20*10,0,1),dim=c(13,20,10))
denom=array(rnorm(1*20*10,0,1),dim=c(1,20,10))
Without looping, I want to do the following:
for(i in 1:10){
Xbeta[,,i]=t(t(Xbeta[,,i]) / denom[,,i])
}
I want to to divide each column in Xbeta[,,i] slice by each corresponding number in denom[,,i].
For example...Xbeta[,1,i]/denom[,1,i]...etc
You can avoid looping and replication by (1) 3-dimensionally transposing the numerator array and (2) flattening the denominator array to a vector, such that the division operation will naturally cycle the incomplete denominator vector across the entirety of the transposed numerator array in such a way that the data lines up the way you want. You then must 3-dimensionally "untranspose" the result to get back the original transposition.
aperm(aperm(Xbeta,c(2,3,1))/c(denom),c(3,1,2));
The first call to aperm() transposes columns to rows, z-slices to columns, and rows to z-slices. The c() call on denom flattens the denominator array to a vector, because when cycling, we don't care about dimensionality. The final call to aperm() reverses the original transposition.
To go into more detail about the logic of this solution, what you have with your inputs is basically a vector of divisors per z-slice of the numerator array, and you want to apply each divisor to every row of the corresponding z-slice and column. This means the vector of divisors must be applied across columns, first-and-foremost, and then, as each denominator z-slice is exhausted, applied across numerator z-slices. After a complete row (covering all z-slices in the row) of the numerator array has been exhausted, the entirety of the denominator vector has been exhausted, causing it to be cycled back to the beginning for the next row of the numerator array.
See https://stat.ethz.ch/R-manual/R-devel/library/base/html/aperm.html.
For a rough idea on performance:
r> set.seed(1);
r> Xbeta <- array(rnorm(13*20*10,0,1), dim=c(13,20,10) );
r> denom <- array(rnorm(1*20*10,0,1), dim=c(1,20,10) );
r> robert <- function() { result <- array(NA, dim=c(13,20,10) ); for (i in 1:10) { result[,,i] <- t(t(Xbeta[,,i]) / denom[,,i]); }; };
r> andre <- function() { denom_myVersion <- array(rep(c(denom), each=13 ), c(13,20,10) ); result <- Xbeta / denom_myVersion; };
r> bgoldst <- function() { result <- aperm(aperm(Xbeta,c(2,3,1))/c(denom),c(3,1,2)); };
r> N <- 99999;
r> system.time({ replicate(N, robert() ); });
user system elapsed
25.421 0.000 25.440
r> system.time({ replicate(N, andre() ); });
user system elapsed
12.578 0.594 13.283
r> system.time({ replicate(N, bgoldst() ); });
user system elapsed
8.484 0.594 9.142
Also, as a general recommendation, it is helpful (for both questioners and answerers) to present these kinds of problems using minimal sample input, e.g.:
r> n <- array(1:12,dim=c(2,3,2)); n;
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
, , 2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
r> d <- array(1:6,dim=c(1,3,2)); d;
, , 1
[,1] [,2] [,3]
[1,] 1 2 3
, , 2
[,1] [,2] [,3]
[1,] 4 5 6
r> aperm(aperm(n,c(2,3,1))/c(d),c(3,1,2));
, , 1
[,1] [,2] [,3]
[1,] 1 1.5 1.666667
[2,] 2 2.0 2.000000
, , 2
[,1] [,2] [,3]
[1,] 1.75 1.8 1.833333
[2,] 2.00 2.0 2.000000
# Is this what you're looking for?
Xbeta <- array(rnorm(13*20*10,0,1),dim=c(13,20,10))
denom <- array(rnorm(1*20*10,0,1),dim=c(1,20,10))
div.list <- sapply(1:10, FUN = function(x) t(Xbeta[,,x]) / denom[,,x], simplify = FALSE)
result <- array(do.call(c, div.list), dim = dim(Xbeta)[c(2,1,3)])
I'm not sure why you choose a 3-dimensional array for the denon. Anyway, this can be done by paying close attention to how these numbers are stored in memory. In an array the first dimensions "moves the fastest". By replicating the denom values 13 times "each" then you create an array with the exact same dimensions as your numerator.
So, let's test it out:
Let's save the ramdom values so we can use them for both methods:
set.seed(1)
Num_2600 <- rnorm(13*20*10,0,1)
Denom_200 <- rnorm(20*10,0,1)
Xbeta=array(Num_2600,dim=c(13,20,10))
denom=array(Denom_200,dim=c(1,20,10))
Your_result <- array(NA, dim=c(13,20,10))
Your code gives:
for(i in 1:10){
Your_result[,,i] <- t(t(Xbeta[,,i]) / denom[,,i])
}
My code:
denom_myVersion <- array(rep( Denom_200 , each=13), c(13,20,10))
> all(Your_result == Xbeta / denom_myVersion)
[1] TRUE
>
So we get the same results. The hard part is how to decide how to replicate so the numbers fall in the right spot. Notice:
denom_myVersion <- array(rep( Denom_200 , times=13), c(13,20,10))
> all(Your_result == Xbeta / denom_myVersion)
[1] FALSE
>
With 'each' as a parameter in rep each element is repeated 13 times before going to the next element. With times, the whole vector is repeated 13 times. Compare:
> rep(1:3, each =3)
[1] 1 1 1 2 2 2 3 3 3
> rep(1:3, times=3)
[1] 1 2 3 1 2 3 1 2 3

R: Fill a matrix with a covariance function

I'm experimenting with spectral simulation for generating unconditional Gaussian realizations of a spatial variable. The variable has a covariance function c(h) = exp(-h/a), where a is the range of the covariance function and h is distance. In the first step, I need to discretize the covariance function into an array/matrix. The entries in the matrix correspond to physical locations in space (i.e. the matrix indices correspond to x and y coordinates):
cov(i,j) = exp(-sqrt((i-64)^2 + (j-64)^2) / 20) for i,j = 1 to 128
I am looking to generate a matrix in R and fill it with the covariance function related to the indices of the array. As a total beginner with R, I'm a bit lost.
stuff that expression into a function:
myfun <- function(i, j) {
exp(-sqrt((i-64)^2 + (j-64)^2) / 20)
}
Then make your "matrix" of possible i, j combinations:
n <- 128
combos <- expand.grid(i=1:n, j=1:n)
Then call your function with those two vectors:
matrix(myfun(combos$i, combos$j), nrow=n)
Using a smaller number:
> n <- 5
> combos <- expand.grid(i=1:n, j=1:n)
> matrix(myfun(combos$i, combos$j), nrow=n)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.01162296 0.01203954 0.01246747 0.01290681 0.01335761
[2,] 0.01203954 0.01247458 0.01292166 0.01338085 0.01385221
[3,] 0.01246747 0.01292166 0.01338860 0.01386840 0.01436113
[4,] 0.01290681 0.01338085 0.01386840 0.01436960 0.01488451
[5,] 0.01335761 0.01385221 0.01436113 0.01488451 0.01542247
>
You could also use outer:
f <- function(i, j) {
exp(-sqrt((i-64)^2 + (j-64)^2) / 20)
}
n <- 5
outer(1:n, 1:n, f)

Resources