I have a formula that creates matrices. Later with every single matrix of the set I have to do some time consuming stuff. So far, I'm bundling these matrices into a list with lapply(). Now, I assume operating with an array of matrices would be much faster. The thing is, I don't know how to let the matirices be generated into an array as with lapply().
I give you this example:
# matrix generating function
mxSim <- function(X, n) {
mx = matrix(NA, nrow = n, ncol = 3,
dimnames = list(NULL, c("d", "alpha", "beta")))
mx[,1] = rbinom(n, 1, .375)
mx[,2] = rnorm(n, 0, 2)
mx[,3] = .42 * rnorm(n, 0, 6)
return(mx)
}
# bundle matrices together
mx.lst <- lapply(1:1e1, mxSim, n = 1e4)
# some stuff to be done after, like e. g.:
lapply(mx.lst, function(m) lm(d ~ alpha + beta, as.data.frame(m)))
Could anybody give me some advise how to do this with an array?
I've been looking into this answer, but for it the matrices have to be already generated, and I only could help me by listing them before again.
Enough with the hooha. Lets time it.
library(microbenchmark)
# matrix generating function
mxSim <- function(X, n) {
mx = matrix(NA, nrow = n, ncol = 3,
dimnames = list(NULL, c("d", "alpha", "beta")))
mx[,1] = rbinom(n, 1, .375)
mx[,2] = rnorm(n, 0, 2)
mx[,3] = .42 * rnorm(n, 0, 6)
return(mx)
}
# bundle matrices together
mx.lst <- lapply(1:1e1, mxSim, n = 1e4)
mx.array <- array(mx.lst,dim=c(2,5))
# some stuff to be done after, like e. g.:
#Timing...
some.fnc<-function(m)lm(d ~ alpha + beta, as.data.frame(m))
list.test<-microbenchmark(lapply(mx.lst, some.fnc))
array.test<-microbenchmark(apply(mx.array, MARGIN=c(1,2), some.fnc))
expr min lq mean median uq max neval
lapply: 74.8953 101.9424 173.8733 146.7186 234.7577 397.2494 100
apply: 77.2362 101.0338 174.4178 137.153 264.6854 418.7297 100
Naively applying a function over a list as opposed to an array is almost identical in actual performance.
For the sake of completeness I just made some other benchmarks with n=1e3 as stated in the comment of #SeldomSeenSlim's answer. In addition I made it with a list of data.frames(), and this was a bit surprising.
Here is the function for data.frames, for matrix function see above.
dfSim <- function(X, n) {
d <- rbinom(n, 1, .375)
alpha <- rnorm(n, 0, 2)
beta <- .42 * rnorm(n, 0, 6)
data.frame(d, alpha, beta)
}
Bundeling
mx.lst <- lapply(1:1e3, mxSim, n = 1e4)
mx.array <- array(mx.lst, dim = c(2, 500))
df.lst <- lapply(1:1e3, dfSim, n = 1e4)
And the microbenchmarks:
some.fnc <- function(m) lm(d ~ alpha + beta, as.data.frame(m))
list.test <- microbenchmark(lapply(mx.lst, some.fnc))
array.test <- microbenchmark(apply(mx.array, MARGIN = c(1, 2), some.fnc))
df.list.test <- microbenchmark(lapply(df.lst, some.fnc))
Results
Unit: seconds
expr min lq mean median uq max neval
lapply 9.658568 9.742613 9.831577 9.784711 9.911466 10.30035 100
apply 9.727057 9.951213 9.994986 10.00614 10.06847 10.22178 100
lapply(df) 9.121293 9.229912 9.286592 9.277967 9.327829 10.12548 100
Now, what does us tell this?
But, okay, as a bold sidenote:
microbenchmark((lapply(1:1e3, mxSim, n = 1e4)), (lapply(1:1e3, dfSim, n = 1e4)))
expr min lq mean median uq max neval cld
(lapply(mxSim)) 2.533466 2.551199 2.563864 2.555421 2.559234 2.693383 100 a
(lapply(dfSim)) 2.676869 2.695826 2.718454 2.701161 2.706249 3.293431 100 b
Related
I am trying to optimise the code designed to compute double sums of product of the elements of two square matrices. Let’s say we have two square matrices of size n, W and V. The object that needs to be computed is a vector B with elements
In simple terms: compute element-by-element products of two different rows in two different matrices and take their sum, then take an extra sum over all rows of the second matrix (sans identical indices).
The problem is, the computational complexity of this task seemingly O(n3) because the length of this object we are creating, B, is n, and each element requires two summations. This is what I have come up with:
For given i and j (i≠j), start with the inner sum over k. Sum for all k, then subtract the terms for k=i and k=j, and multiply by the indicator of j≠i.
Since the restriction j≠i has been taken care of in the inner sum, the outer sum is taken just for j=1,...,n.
If we denote , then the two steps will look like
and .
However, writing a loop turned out to be very inefficient. n=100 works quickly (0.05 seconds). But, for instance, when n=500 (we are talking about real-world applications here), the average computation time is 3 seconds, and for n=1000, it jumps to 22 s.
The inner loop over k can be easily replaced by a sum, but the outer one... In this question, the suggested solution is sapply, but it implies that the summation must be done over all elements.
This is the code I am trying to evaluate before the heat death of the Universe for large n.
set.seed(1)
N <- 500
x1 <- rnorm(N)
x2 <- rchisq(N, df=3)
bw1 <- bw.nrd(x1)
bw2 <- bw.nrd(x2)
w <- outer(x1, x1, function(x, y) dnorm((x-y)/bw1) )
w <- w/rowSums(w)
v <- outer(x2, x2, function(x, y) dnorm((x-y)/bw2) )
v <- v/rowSums(v)
Bij <- matrix(NA, ncol=N, nrow=N)
for (i in 1:N) { # Around 22 secs for N=1000
for (j in 1:N) {
Bij[i, j] <- (sum(w[i, ]*v[j, ]) - w[i, i]*v[j, i] - w[i, j]*v[j, j]) * (i!=j)
}
}
Bi <- rowSums(Bij)
How would an expert R programmer vectorise such kind of loops?
Update:
In fact, given your expression for B_{ij}, we may also do the following
diag(w) <- diag(v) <- 0
BBij <- tcrossprod(w, v)
diag(BBij) <- 0
range(rowSums(BBij) - Bi)
# [1] -2.220446e-16 0.000000e+00
range(BBij - Bij)
# [1] -6.938894e-18 5.204170e-18
Hence, while somewhat obvious, it may also be an interesting observation for your purposes that neither B_{ij} nor B_i depend on the diagonals of W and V.
Initial answer:
Since
where the diagonals of W and V can be set to zero and V_{.k} denotes the sum of the k-th column of V, we have
diag(w) <- diag(v) <- 0
A <- w * v
rowSums(sweep(w, 2, colSums(v), `*`)) - rowSums(A) + diag(A)
where
range(rowSums(sweep(w, 2, colSums(v), `*`)) - rowSums(A) + diag(A) - Bi)
# [1] -1.110223e-16 1.110223e-16
Without looking into the content of your matrices w and v, your double for-loop can be replaced with simple matrix operations, using one matrix multiplication (tcrossprod), transpose (t) and diagonal extraction:
Mat.ij <- tcrossprod(w, v) -
matrix(rep(diag(w), times = N), nrow = N) * t(v) -
w * matrix(rep(diag(v), each = N), nrow = N)
diag(Mat.ij) <- 0
all.equal(Bij, Mat.ij)
[1] TRUE
It seems I am stuck on the following problem with numpy.
I have an array X with shape: X.shape = (nexp, ntime, ndim, npart)
I need to compute binned statistics on this array along npart dimension, according to the values in binvals (and some bins), but keeping all the other dimensions there, because I have to use the binned statistic to remove some bias in the original array X. Binning values have shape binvals.shape = (nexp, ntime, npart).
A complete, minimal example, to explain what I am trying to do. Note that, in reality, I am working on large arrays and with several hunderds of bins (so this implementation takes forever):
import numpy as np
np.random.seed(12345)
X = np.random.randn(24).reshape(1,2,3,4)
binvals = np.random.randn(8).reshape(1,2,4)
bins = [-np.inf, 0, np.inf]
nexp, ntime, ndim, npart = X.shape
cleanX = np.zeros_like(X)
for ne in range(nexp):
for nt in range(ntime):
indices = np.digitize(binvals[ne, nt, :], bins)
for nd in range(ndim):
for nb in range(1, len(bins)):
inds = indices==nb
cleanX[ne, nt, nd, inds] = X[ne, nt, nd, inds] - \
np.mean(X[ne, nt, nd, inds], axis = -1)
Looking at the results of this may make it clearer?
In [8]: X
Out[8]:
array([[[[-0.20470766, 0.47894334, -0.51943872, -0.5557303 ],
[ 1.96578057, 1.39340583, 0.09290788, 0.28174615],
[ 0.76902257, 1.24643474, 1.00718936, -1.29622111]],
[[ 0.27499163, 0.22891288, 1.35291684, 0.88642934],
[-2.00163731, -0.37184254, 1.66902531, -0.43856974],
[-0.53974145, 0.47698501, 3.24894392, -1.02122752]]]])
In [10]: cleanX
Out[10]:
array([[[[ 0. , 0.67768523, -0.32069682, -0.35698841],
[ 0. , 0.80405255, -0.49644541, -0.30760713],
[ 0. , 0.92730041, 0.68805503, -1.61535544]],
[[ 0.02303938, -0.02303938, 0.23324375, -0.23324375],
[-0.81489739, 0.81489739, 1.05379752, -1.05379752],
[-0.50836323, 0.50836323, 2.13508572, -2.13508572]]]])
In [12]: binvals
Out[12]:
array([[[ -5.77087303e-01, 1.24121276e-01, 3.02613562e-01,
5.23772068e-01],
[ 9.40277775e-04, 1.34380979e+00, -7.13543985e-01,
-8.31153539e-01]]])
Is there a vectorized solution? I thought of using scipy.stats.binned_statistic, but I seem to be unable to understand how to use it for this aim. Thanks!
import numpy as np
np.random.seed(100)
nexp = 3
ntime = 4
ndim = 5
npart = 100
nbins = 4
binvals = np.random.rand(nexp, ntime, npart)
X = np.random.rand(nexp, ntime, ndim, npart)
bins = np.linspace(0, 1, nbins + 1)
d = np.digitize(binvals, bins)[:, :, np.newaxis, :]
r = np.arange(1, len(bins)).reshape((-1, 1, 1, 1, 1))
m = d[np.newaxis, ...] == r
counts = np.sum(m, axis=-1, keepdims=True).clip(min=1)
means = np.sum(X[np.newaxis, ...] * m, axis=-1, keepdims=True) / counts
cleanX = X - np.choose(d - 1, means)
Ok, I think I got it, mainly based on the answer by #jdehesa.
clean2 = np.zeros_like(X)
d = np.digitize(binvals, bins)
for i in range(1, len(bins)):
m = d == i
minds = np.where(m)
sl = [*minds[:2], slice(None), minds[2]]
msum = m.sum(axis=-1)
clean2[sl] = (X - \
(np.sum(X * m[...,np.newaxis,:], axis=-1) /
msum[..., np.newaxis])[..., np.newaxis])[sl]
Which gives the same results as my original code.
On the small arrays I have in the example here, this solution is approximately three times as fast as the original code. I expect it to be way faster on larger arrays.
Update:
Indeed it's faster on larger arrays (didn't do any formal test), but despite this, it just reaches the level of acceptable in terms of performance... any further suggestion on extra vectoriztaions would be very welcome.
Consider a simple toy problem.
#build array, A, with its last dimension filled with randomly selected rows from matrix G
set.seed(1)
n=2
G = matrix(c(0,1,0,1,1,1,1,1,0), 3, 3)
inds = matrix( ceiling( 3 * runif( n^2 ) ), n, n )
A = array(0, dim = c(n, n, 3))
for(i in 1:n){
for(j in 1:n){
A[i, j, ] = G[inds[i,j],]
}
}
I would like to create a matrix, R, with each entry R[i,j] containing the row r of G such that A[i,j,] == G[r,].
Thes answer to the above is
R =[1, 2;
2, 3]
There are lots of questions here on matching rows of a matrix to a vector, but thought I'd ask the more general question.
Using row.match from the package prodsim I came up with the following, which I believe could be generalized to find any subarray of an array equal to a specified array
apply(A, c(1,2), function(x) row.match(x,G))
But perhaps there are other, better ways?
I'm doing some optimization in R and in connection with that I need to write a function that returns a jacobian. It's a very simple jacobian -- just zeros and ones -- but I'd like to populate it quickly and cleanly. My current code works but is very sloppy.
I have a four-dimensional array of probabilities. Index the dimensions by i, j, k, l. My constraint is that, for each i, j, k, the sum of probabilities over index l must equal 1.
I compute my constraint vector like this:
get_prob_array_from_vector <- function(prob_vector, array_dim) {
return(array(prob_vector, array_dim))
}
constraint_function <- function(prob_vector, array_dim) {
prob_array <- get_prob_array_from_vector(prob_vector, array_dim)
prob_array_sums <- apply(prob_array, MARGIN=c(1, 2, 3), FUN=sum)
return(as.vector(prob_array_sums) - 1) # Should equal zero
}
My question is: what is a clean, fast way of computing the jacobian of as.vector(apply(array(my_input_vector, array_dim), MARGIN=c(1, 2, 3), FUN=sum)) -- i.e., my constraint_function in the code above -- with respect to my_input_vector?
Here is my sloppy solution (which I check for correctness against the jacobian function from the numDeriv package):
library(numDeriv)
array_dim <- c(5, 4, 3, 3)
get_prob_array_from_vector <- function(prob_vector, array_dim) {
return(array(prob_vector, array_dim))
}
constraint_function <- function(prob_vector, array_dim) {
prob_array <- get_prob_array_from_vector(prob_vector, array_dim)
prob_array_sums <- apply(prob_array, MARGIN=c(1, 2, 3), FUN=sum)
return(as.vector(prob_array_sums) - 1)
}
constraint_function_jacobian <- function(prob_vector, array_dim) {
prob_array <- get_prob_array_from_vector(prob_vector, array_dim)
jacobian <- matrix(0, Reduce("*", dim(prob_array)[1:3]), length(prob_vector))
## Must be a faster, clearner way of populating jacobian
for(i in seq_along(prob_vector)) {
dummy_vector <- rep(0, length(prob_vector))
dummy_vector[i] <- 1
dummy_array <- get_prob_array_from_vector(dummy_vector, array_dim)
dummy_array_sums <- apply(dummy_array, MARGIN=c(1, 2, 3), FUN=sum)
jacobian_row_idx <- which(dummy_array_sums != 0, arr.ind=FALSE)
stopifnot(length(jacobian_row_idx) == 1)
jacobian[jacobian_row_idx, i] <- 1
} # Is there a fast, readable one-liner that does the same as this for loop?
stopifnot(sum(jacobian) == length(prob_vector))
stopifnot(all(jacobian == 0 | jacobian == 1))
return(jacobian)
}
## Example of a probability array satisfying my constraint
my_prob_array <- array(0, array_dim)
for(i in seq_len(array_dim[1])) {
for(j in seq_len(array_dim[2])) {
my_prob_array[i, j, , ] <- diag(array_dim[3])
}
}
my_prob_array[1, 1, , ] <- 1 / array_dim[3]
my_prob_array[2, 1, , ] <- 0.25 * (1 / array_dim[3]) + 0.75 * diag(array_dim[3])
my_prob_vector <- as.vector(my_prob_array) # Flattened representation of my_prob_array
should_be_zero_vector <- constraint_function(my_prob_vector, array_dim)
is.vector(should_be_zero_vector)
all(should_be_zero_vector == 0) # Constraint is satistied
## Check constraint_function_jacobian for correctness using numDeriv
jacobian_analytical <- constraint_function_jacobian(my_prob_vector, array_dim)
jacobian_numerical <- jacobian(constraint_function, my_prob_vector, array_dim=array_dim)
max(abs(jacobian_analytical - jacobian_numerical)) # Very small
My functions take prob_vector as input -- i.e., a flattened representation of my probability array -- because optimization functions require vector arguments.
Spend some time to understand what you were trying to do, but here is a proposition to replace your constraint_function_jacobian:
enhanced <- function(prob_vector, array_dim) {
firstdim <- Reduce("*", array_dim[1:3])
seconddim <- length(prob_vector)
jacobian <- matrix(0, firstdim, seconddim)
idxs <- split(1:seconddim, cut(1:seconddim, array_dim[4], labels=FALSE))
for (i in seq_along(idxs)) {
diag(jacobian[, idxs[[i]] ]) <- 1
}
stopifnot(sum(jacobian) == length(prob_vector))
stopifnot(all(jacobian == 0 | jacobian == 1))
jacobian
}
Unless I'm wrong, the jacobian construction is filling diagonals with 1, as it is not a square matrix we have to split it on array_dim[4] square matrix to fill up their diagonals with 1.
I did get rid of the transformation of prob_vector to an array to then get its dim as it will be the same as array_dim, skipping this step is not a huge improvement but it simplify the code IMO.
Results are ok according to test:
identical(constraint_function_jacobian(my_prob_vector, array_dim),
enhanced(my_prob_vector, array_dim))
# [1] TRUE
According to benchmark it gives a great speedup:
microbenchmark::microbenchmark(
original=constraint_function_jacobian(my_prob_vector, array_dim),
enhanced=enhanced(my_prob_vector, array_dim), times=100)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# original 16946.979 18466.491 20150.304 19066.7410 19671.4100 28148.035 100 b
# enhanced 678.222 737.948 799.005 796.3905 834.5925 1141.773 100 a
I have the following function which takes 4 vectors. The T vector has a given length and all 3 other vectors (pga, Sa5Hz and Sa1Hz) have a given (identical but not necessarily equal to T) lenght.
The output is a matrix with length(T) rows and length(pga) columns.
My code below seems like the perfect example of what NOT to do, however, I could not figure out a way to optimize it using an apply function. Can anyone help?
designSpectrum <- function (T, pga, Sa5Hz, Sa1Hz){
Ts <- Sa1Hz / Sa5Hz
#By convention, if Sa5Hz is null, set Ts as 0.
Ts[is.nan(Ts)] <- 0
res <- matrix(NA, nrow = length(T), ncol = length(pga))
for (i in 1:nrow(res))
{
for (j in 1:ncol(res))
{
res[i,j] <- if(T[i] <= 0) {pga[j]}
else if (T[i] <= 0.2 * Ts[j]) {pga[j] + T[i] * (Sa5Hz[j] - pga[j]) / (0.2 * Ts[j])}
else if (T[i] <= Ts[j]) {Sa5Hz[j]}
else Sa1Hz[j] / T[i]
}
}
return(res)
}
Instead of doing a double for loop and processing each i and j value separately, you could use the outer function to process all of them in one shot. Since you're now processing multiple i and j values simultaneously, you could switch to the vectorized ifelse statement instead of the non-vectorized if and else statements:
designSpectrum2 <- function (T, pga, Sa5Hz, Sa1Hz) {
Ts <- Sa1Hz / Sa5Hz
Ts[is.nan(Ts)] <- 0
outer(1:length(T), 1:length(pga), function(i, j) {
ifelse(T[i] <= 0, pga[j],
ifelse(T[i] <= 0.2 * Ts[j], pga[j] + T[i] * (Sa5Hz[j] - pga[j]) / (0.2 * Ts[j]),
ifelse(T[i] <= Ts[j], Sa5Hz[j], Sa1Hz[j] / T[i])))
})
}
identical(designSpectrum(T, pga, Sa5Hz, Sa1Hz), designSpectrum2(T, pga, Sa5Hz, Sa1Hz))
# [1] TRUE
Data:
T <- -1:3
pga <- 1:3
Sa5Hz <- 2:4
Sa1Hz <- 3:5
You can see the efficiency gains by testing on rather large vectors (here I'll use an output matrix with 1 million entries):
# Larger vectors
set.seed(144)
T2 <- runif(1000, -1, 3)
pga2 <- runif(1000, -1, 3)
Sa5Hz2 <- runif(1000, -1, 3)
Sa1Hz2 <- runif(1000, -1, 3)
# Runtime comparison
all.equal(designSpectrum(T2, pga2, Sa5Hz2, Sa1Hz2), designSpectrum2(T2, pga2, Sa5Hz2, Sa1Hz2))
# [1] TRUE
system.time(designSpectrum(T2, pga2, Sa5Hz2, Sa1Hz2))
# user system elapsed
# 4.038 1.011 5.042
system.time(designSpectrum2(T2, pga2, Sa5Hz2, Sa1Hz2))
# user system elapsed
# 0.517 0.138 0.652
The approach with outer is almost 10x faster.