Loop in C to make RScript more efficient performance - c

I am trying to compute the number of pairwise differences between each row in a table of 100 rows x 2500 Columns.
I have a small RScript that does this but the run time is (obviously) extremely high!
I am trying to write a loop in C but I keep getting errors (compileCode).
Do you have any idea of how I can "convert" the following loop to C?
pw.dist <- function (vec1, vec2) {
return( length(which(vec1!=vec2)) )
}
N.row <- dim(table)[1]
pw.dist.table <- array( dim = c(dim(table)[1], dim(table)[1]))
for (i in 1:N.row) {
for (j in 1:N.row) {
pw.dist.table[i,j] <- pw.dist(table[i,-c(1)], table[j,-c(1)])
}
}
I am trying something like:
sig <- signature(N.row="integer", table="integer", pw.dist.table="integer")
code <- "
for( int i = 0; i < (*N.row) - 1; i++ ) {
for( int j = i + 1; j < *N.row; j++ ) {
int pw.dist.table = table[j] - table[i];
}
}
"
f <- cfunction( sig, code, convention=".C" )
I am a complete newbie when it comes to programming!
Thanks in advance.
JMFA

Before trying to optimize the code,
it is always a good idea to check where the time is spent.
Rprof()
... # Your loops
Rprof(NULL)
summaryRprof()
In your case, the loop is not slow, but your distance function is.
$by.total
total.time total.pct self.time self.pct
"pw.dist" 37.98 98.85 0.54 1.41
"which" 37.44 97.45 34.02 88.55
"!=" 3.12 8.12 3.12 8.12
You can rewite it as follows (it takes 1 second).
# Sample data
n <- 100
k <- 2500
d <- matrix(sample(1:10, n*k, replace=TRUE), nr=n, nc=k)
# Function to compute the number of differences
f <- function(i,j) sum(d[i,]!=d[j,])
# You could use a loop, instead of outer,
# it should not make a big difference.
d2 <- outer( 1:n, 1:n, Vectorize(f) )

Vincent above has the right idea. In addition, you can take advantage of how matrices work in R and get even faster results:
n <- 100
k <- 2500
d <- matrix(sample(1:10, n*k, replace=TRUE), nr=n, nc=k)
system.time(d2 <- outer( 1:n, 1:n, Vectorize(f) ))
#precompute transpose of matrix - you can just replace
#dt with t(d) if you want to avoid this
system.time(dt <- t(d))
system.time(sapply(1:n, function(i) colSums( dt[,i] != dt)))
Output:
#> system.time(d2 <- outer( 1:n, 1:n, Vectorize(f) ))
# user system elapsed
# 0.4 0.0 0.4
#> system.time(dt <- t(d))
# user system elapsed
# 0 0 0
#> system.time(sapply(1:n, function(i) colSums( dt[,i] != dt)))
# user system elapsed
# 0.08 0.00 0.08

Related

In R what is an efficient way to fill a vector of unknown size?

v <- c()
i <- 1
while (some_condition) {
v[i] <- some_value
i <- i + 1
}
So I am aware that each time v is modified a copy is made and v is moved. This is obviously very slow. This appears to be a trivia problem in other languages. What is the efficient way to do this in R?
As far as I'm aware there is no native method for populating a vector of unknown length without constantly rewriting it to memory. Maybe data.table has some tricks?
Lacking any sophisticated solutions, something as simple as preallocating an oversized vector, as 12b345b6b78 suggest, can save you a fair bit of time.
unk <- 1e5
f1 <- function(unk) {
v <- c()
for (i in 1:unk) {
v[i] <- i
}
v
}
f2 <- function(unk) {
v <- vector(length=unk*2)
for (i in 1:unk) {
v[i] <- i
}
v[1:i]
}
f10 <- function(unk) {
v <- vector(length=unk*10)
for (i in 1:unk) {
v[i] <- i
}
v[1:i]
}
library(microbenchmark)
mb <- microbenchmark(f1(unk), f2(unk), f10(unk), times=50)
mb
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(unk) 27.177678 28.364024 32.65618 29.896673 36.18994 48.37088 50 c
# f2(unk) 8.075867 9.025156 10.87335 9.271589 10.07932 35.29222 50 a
# f10(unk) 11.132773 13.071857 20.46808 15.059086 21.53610 187.00786 50 b

Vectorising (or speeding up) a double loop with summation over non-identical indices in R

I am trying to optimise the code designed to compute double sums of product of the elements of two square matrices. Let’s say we have two square matrices of size n, W and V. The object that needs to be computed is a vector B with elements
In simple terms: compute element-by-element products of two different rows in two different matrices and take their sum, then take an extra sum over all rows of the second matrix (sans identical indices).
The problem is, the computational complexity of this task seemingly O(n3) because the length of this object we are creating, B, is n, and each element requires two summations. This is what I have come up with:
For given i and j (i≠j), start with the inner sum over k. Sum for all k, then subtract the terms for k=i and k=j, and multiply by the indicator of j≠i.
Since the restriction j≠i has been taken care of in the inner sum, the outer sum is taken just for j=1,...,n.
If we denote , then the two steps will look like
and .
However, writing a loop turned out to be very inefficient. n=100 works quickly (0.05 seconds). But, for instance, when n=500 (we are talking about real-world applications here), the average computation time is 3 seconds, and for n=1000, it jumps to 22 s.
The inner loop over k can be easily replaced by a sum, but the outer one... In this question, the suggested solution is sapply, but it implies that the summation must be done over all elements.
This is the code I am trying to evaluate before the heat death of the Universe for large n.
set.seed(1)
N <- 500
x1 <- rnorm(N)
x2 <- rchisq(N, df=3)
bw1 <- bw.nrd(x1)
bw2 <- bw.nrd(x2)
w <- outer(x1, x1, function(x, y) dnorm((x-y)/bw1) )
w <- w/rowSums(w)
v <- outer(x2, x2, function(x, y) dnorm((x-y)/bw2) )
v <- v/rowSums(v)
Bij <- matrix(NA, ncol=N, nrow=N)
for (i in 1:N) { # Around 22 secs for N=1000
for (j in 1:N) {
Bij[i, j] <- (sum(w[i, ]*v[j, ]) - w[i, i]*v[j, i] - w[i, j]*v[j, j]) * (i!=j)
}
}
Bi <- rowSums(Bij)
How would an expert R programmer vectorise such kind of loops?
Update:
In fact, given your expression for B_{ij}, we may also do the following
diag(w) <- diag(v) <- 0
BBij <- tcrossprod(w, v)
diag(BBij) <- 0
range(rowSums(BBij) - Bi)
# [1] -2.220446e-16 0.000000e+00
range(BBij - Bij)
# [1] -6.938894e-18 5.204170e-18
Hence, while somewhat obvious, it may also be an interesting observation for your purposes that neither B_{ij} nor B_i depend on the diagonals of W and V.
Initial answer:
Since
where the diagonals of W and V can be set to zero and V_{.k} denotes the sum of the k-th column of V, we have
diag(w) <- diag(v) <- 0
A <- w * v
rowSums(sweep(w, 2, colSums(v), `*`)) - rowSums(A) + diag(A)
where
range(rowSums(sweep(w, 2, colSums(v), `*`)) - rowSums(A) + diag(A) - Bi)
# [1] -1.110223e-16 1.110223e-16
Without looking into the content of your matrices w and v, your double for-loop can be replaced with simple matrix operations, using one matrix multiplication (tcrossprod), transpose (t) and diagonal extraction:
Mat.ij <- tcrossprod(w, v) -
matrix(rep(diag(w), times = N), nrow = N) * t(v) -
w * matrix(rep(diag(v), each = N), nrow = N)
diag(Mat.ij) <- 0
all.equal(Bij, Mat.ij)
[1] TRUE

Altering arrays to add/remove entries at each time-step in R

This question, probably has a simple solution but I cannot think of how to do it...
So I have a script as follows:
# ------------------ MODEL SETUP ----------------------------------------# simulation length
t_max <- 50
# arena
arena_x <- 100
arena_y <- 100
# plant parameters
a <- 0.1
b <- 0.1
g <- 1
# list of plant locations and initial sizes
nplants <-dim(plantLocsX)[1]*dim(plantLocsX)[2]
iterations<-5
totalBiomass<-matrix(0,nrow=iterations,ncol=1)
# starting loop
sep <- 10
# Original matrix
plantLocsX <- matrix(rep(seq(0,arena_x,sep), arena_y/sep),
nrow=1+arena_x/sep,
ncol=1+arena_y/sep)
plantLocsY <- t(plantLocsX)
plantSizes <- matrix(1,nrow=nplants,ncol=1)
# Plot the plants
radius <- sqrt( plantSizes/ pi )
symbols(plantLocsX, plantLocsY, radius, xlim = c(0,100), ylim=c(0,100), inches=0.05, fg = "green",
xlab = "x domain (m)", ylab = "y domain (m)", main = "Random Plant Locations", col.main = 51)
# Calculate distances between EACH POSSIBLE PAIR of plants
distances <- matrix(0,nrow=nplants,ncol=nplants)
for (i in 1:nplants){
for (j in 1:nplants){
distances[i,j] <- sqrt( (plantLocsX[i]-plantLocsX[j])^2 + (plantLocsY[i]-plantLocsY[j])^2 )
}
}
# ------------------ MODEL RUNNING ---------------------------------------
I need to alter the arrays containing plant locations and plant sizes so that at each time step, entries are removed and added (simulating mortality/reproduction, respectively). The "distances" must be updated with plant locations and sizes after each iteration...I can only think of complex ways to do this: destructing and constructing new matrices at each time step to fit the new number of elements but there must be functions to make this simpler....any advice?
Many thanks!!

Clean way to compute jacobian of array summation

I'm doing some optimization in R and in connection with that I need to write a function that returns a jacobian. It's a very simple jacobian -- just zeros and ones -- but I'd like to populate it quickly and cleanly. My current code works but is very sloppy.
I have a four-dimensional array of probabilities. Index the dimensions by i, j, k, l. My constraint is that, for each i, j, k, the sum of probabilities over index l must equal 1.
I compute my constraint vector like this:
get_prob_array_from_vector <- function(prob_vector, array_dim) {
return(array(prob_vector, array_dim))
}
constraint_function <- function(prob_vector, array_dim) {
prob_array <- get_prob_array_from_vector(prob_vector, array_dim)
prob_array_sums <- apply(prob_array, MARGIN=c(1, 2, 3), FUN=sum)
return(as.vector(prob_array_sums) - 1) # Should equal zero
}
My question is: what is a clean, fast way of computing the jacobian of as.vector(apply(array(my_input_vector, array_dim), MARGIN=c(1, 2, 3), FUN=sum)) -- i.e., my constraint_function in the code above -- with respect to my_input_vector?
Here is my sloppy solution (which I check for correctness against the jacobian function from the numDeriv package):
library(numDeriv)
array_dim <- c(5, 4, 3, 3)
get_prob_array_from_vector <- function(prob_vector, array_dim) {
return(array(prob_vector, array_dim))
}
constraint_function <- function(prob_vector, array_dim) {
prob_array <- get_prob_array_from_vector(prob_vector, array_dim)
prob_array_sums <- apply(prob_array, MARGIN=c(1, 2, 3), FUN=sum)
return(as.vector(prob_array_sums) - 1)
}
constraint_function_jacobian <- function(prob_vector, array_dim) {
prob_array <- get_prob_array_from_vector(prob_vector, array_dim)
jacobian <- matrix(0, Reduce("*", dim(prob_array)[1:3]), length(prob_vector))
## Must be a faster, clearner way of populating jacobian
for(i in seq_along(prob_vector)) {
dummy_vector <- rep(0, length(prob_vector))
dummy_vector[i] <- 1
dummy_array <- get_prob_array_from_vector(dummy_vector, array_dim)
dummy_array_sums <- apply(dummy_array, MARGIN=c(1, 2, 3), FUN=sum)
jacobian_row_idx <- which(dummy_array_sums != 0, arr.ind=FALSE)
stopifnot(length(jacobian_row_idx) == 1)
jacobian[jacobian_row_idx, i] <- 1
} # Is there a fast, readable one-liner that does the same as this for loop?
stopifnot(sum(jacobian) == length(prob_vector))
stopifnot(all(jacobian == 0 | jacobian == 1))
return(jacobian)
}
## Example of a probability array satisfying my constraint
my_prob_array <- array(0, array_dim)
for(i in seq_len(array_dim[1])) {
for(j in seq_len(array_dim[2])) {
my_prob_array[i, j, , ] <- diag(array_dim[3])
}
}
my_prob_array[1, 1, , ] <- 1 / array_dim[3]
my_prob_array[2, 1, , ] <- 0.25 * (1 / array_dim[3]) + 0.75 * diag(array_dim[3])
my_prob_vector <- as.vector(my_prob_array) # Flattened representation of my_prob_array
should_be_zero_vector <- constraint_function(my_prob_vector, array_dim)
is.vector(should_be_zero_vector)
all(should_be_zero_vector == 0) # Constraint is satistied
## Check constraint_function_jacobian for correctness using numDeriv
jacobian_analytical <- constraint_function_jacobian(my_prob_vector, array_dim)
jacobian_numerical <- jacobian(constraint_function, my_prob_vector, array_dim=array_dim)
max(abs(jacobian_analytical - jacobian_numerical)) # Very small
My functions take prob_vector as input -- i.e., a flattened representation of my probability array -- because optimization functions require vector arguments.
Spend some time to understand what you were trying to do, but here is a proposition to replace your constraint_function_jacobian:
enhanced <- function(prob_vector, array_dim) {
firstdim <- Reduce("*", array_dim[1:3])
seconddim <- length(prob_vector)
jacobian <- matrix(0, firstdim, seconddim)
idxs <- split(1:seconddim, cut(1:seconddim, array_dim[4], labels=FALSE))
for (i in seq_along(idxs)) {
diag(jacobian[, idxs[[i]] ]) <- 1
}
stopifnot(sum(jacobian) == length(prob_vector))
stopifnot(all(jacobian == 0 | jacobian == 1))
jacobian
}
Unless I'm wrong, the jacobian construction is filling diagonals with 1, as it is not a square matrix we have to split it on array_dim[4] square matrix to fill up their diagonals with 1.
I did get rid of the transformation of prob_vector to an array to then get its dim as it will be the same as array_dim, skipping this step is not a huge improvement but it simplify the code IMO.
Results are ok according to test:
identical(constraint_function_jacobian(my_prob_vector, array_dim),
enhanced(my_prob_vector, array_dim))
# [1] TRUE
According to benchmark it gives a great speedup:
microbenchmark::microbenchmark(
original=constraint_function_jacobian(my_prob_vector, array_dim),
enhanced=enhanced(my_prob_vector, array_dim), times=100)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# original 16946.979 18466.491 20150.304 19066.7410 19671.4100 28148.035 100 b
# enhanced 678.222 737.948 799.005 796.3905 834.5925 1141.773 100 a

R - avoid nested for loops

I have the following function which takes 4 vectors. The T vector has a given length and all 3 other vectors (pga, Sa5Hz and Sa1Hz) have a given (identical but not necessarily equal to T) lenght.
The output is a matrix with length(T) rows and length(pga) columns.
My code below seems like the perfect example of what NOT to do, however, I could not figure out a way to optimize it using an apply function. Can anyone help?
designSpectrum <- function (T, pga, Sa5Hz, Sa1Hz){
Ts <- Sa1Hz / Sa5Hz
#By convention, if Sa5Hz is null, set Ts as 0.
Ts[is.nan(Ts)] <- 0
res <- matrix(NA, nrow = length(T), ncol = length(pga))
for (i in 1:nrow(res))
{
for (j in 1:ncol(res))
{
res[i,j] <- if(T[i] <= 0) {pga[j]}
else if (T[i] <= 0.2 * Ts[j]) {pga[j] + T[i] * (Sa5Hz[j] - pga[j]) / (0.2 * Ts[j])}
else if (T[i] <= Ts[j]) {Sa5Hz[j]}
else Sa1Hz[j] / T[i]
}
}
return(res)
}
Instead of doing a double for loop and processing each i and j value separately, you could use the outer function to process all of them in one shot. Since you're now processing multiple i and j values simultaneously, you could switch to the vectorized ifelse statement instead of the non-vectorized if and else statements:
designSpectrum2 <- function (T, pga, Sa5Hz, Sa1Hz) {
Ts <- Sa1Hz / Sa5Hz
Ts[is.nan(Ts)] <- 0
outer(1:length(T), 1:length(pga), function(i, j) {
ifelse(T[i] <= 0, pga[j],
ifelse(T[i] <= 0.2 * Ts[j], pga[j] + T[i] * (Sa5Hz[j] - pga[j]) / (0.2 * Ts[j]),
ifelse(T[i] <= Ts[j], Sa5Hz[j], Sa1Hz[j] / T[i])))
})
}
identical(designSpectrum(T, pga, Sa5Hz, Sa1Hz), designSpectrum2(T, pga, Sa5Hz, Sa1Hz))
# [1] TRUE
Data:
T <- -1:3
pga <- 1:3
Sa5Hz <- 2:4
Sa1Hz <- 3:5
You can see the efficiency gains by testing on rather large vectors (here I'll use an output matrix with 1 million entries):
# Larger vectors
set.seed(144)
T2 <- runif(1000, -1, 3)
pga2 <- runif(1000, -1, 3)
Sa5Hz2 <- runif(1000, -1, 3)
Sa1Hz2 <- runif(1000, -1, 3)
# Runtime comparison
all.equal(designSpectrum(T2, pga2, Sa5Hz2, Sa1Hz2), designSpectrum2(T2, pga2, Sa5Hz2, Sa1Hz2))
# [1] TRUE
system.time(designSpectrum(T2, pga2, Sa5Hz2, Sa1Hz2))
# user system elapsed
# 4.038 1.011 5.042
system.time(designSpectrum2(T2, pga2, Sa5Hz2, Sa1Hz2))
# user system elapsed
# 0.517 0.138 0.652
The approach with outer is almost 10x faster.

Resources