Optimizing function speed on 3D array - arrays

I am applying a user-defined function to individual cells of a 3D array. The contents of each cell are one of the following possibilities, all of which are character vectors because of prior formatting:
"N"
"A"
""
"1"
"0"
I want to create a new 3D array of the same dimensions, where cells contain either NA or a numeric vector containing 1 or 0. Thus, I wrote a function named Numericize and used aaply to apply it to the entire array. However, it takes forever to apply it.
Numericize <- function(x){
if(!is.na(x)){
x[x=="N"] <- NA; x
x[x=="A"] <- NA; x
x[x==""] <- NA; x
x <- as.integer(x)
}
return(x)
}
The dimensions original array are 480x866x366. The function takes forever to apply using the following code:
Final.Daily.Array <- aaply(.data = Complete.Daily.Array,
.margins = c(1,2,3),
.fun = Numericize,
.progress = "text")
I am unsure if the speed issue comes from an inefficient Numericize, an inefficient aaply, or something else entirely. I considered trying to set up parallel computing using the plyr package but I wouldn't think that such a simple command would require parallel processing.
On one hand I am concerned that I created a stack overflow for myself (see this for more), but I have applied other functions to similar arrays without problems.
ex.array <- array(dim = c(3,3,3))
ex.array[,,1] <- c("N","A","","1","0","N","A","","1")
ex.array[,,2] <- c("0","N","A","","1","0","N","A","")
ex.array[,,3] <- c("1","0","N","A","","1","0","N","A")
desired.array <- array(dim = c(3,3,3))
desired.array[,,1] <- c(NA,NA,NA,1,0,NA,NA,NA,1)
desired.array[,,2] <- c(0,NA,NA,NA,1,0,NA,NA,NA)
desired.array[,,3] <- c(1,0,NA,NA,NA,1,0,NA,NA)
ex.array
desired.array
Any suggestions?

You can just use a vectorized approach:
ex.array[ex.array %in% c("", "N", "A")] <- NA
storage.mode(ex.array) <- "integer"
You can simply use the second line and it will introduce NAs by coercion.

Related

How to structure these objects to easily index and flatten when multiple non-independent lists are involved?

For the sake of illustrating the problem at hand with a fully reproducible set of data, I have the following simplified objects:
set.seed(7)
animalL <- list(mittens=list(what='cat', paws=4), lassie=list(what='dog', paws=4))
timeL <- list(list(foo=rnorm(2)), list(foo=rnorm(2)), list(foo=rnorm(2)))
houseL <- list(houseA=list(zip=123, roomL=list(den='some:blueprint', den2='another:blueprint')), houseB=list(zip=456, roomL=list(cove='a:blueprint', cove2='oneMore:blueprint')))
So there is a list of animals, each with their own features and associated data. There is a list of "time periods" each of which again has various features associated to them (simplified here to just a numeric vector of size 2). Finally, there is a list of houses, each with a list of rooms of possibly different lengths. From this data, I am able to evaluate, over the aggregate of all those time periods, at each house, what each animals "experience" is like. I can only do so, however, after letting each animal, during each time period, hang out in every room of every house. For the latter I have this function:
evaluateHangOutInRoom <- function(animal, timePeriod, house, room) {
list(x=rnorm(1), y=rnorm(1))
}
In real life this function is very slow, so I need to run it in parallel. I have an internal library for this but its interface is only that of lapply. Suppose it is this:
fastLapply <- function(X, FUN) {
#in reality this is massively distributed
lapply(X, FUN)
}
I can try to do this by doing index math, something like this:
AA <- length(animalL)
TT <- length(timeL)
RR <- cumprod(lapply(houseL, function(house) length(house$roomL)))
fastLapply(1:(AA*TT*RR), function(idx) 0)
But it is error prone to back out from idx the arguments to the hangout function.
I can try to structure an array of each "scenario" I need to work on:
hangOutArgArray <- array('', dim=c(AA, TT, RR))
dimnames(hangOutArgArray)[[1]] <- as.list(animalL)
dimnames(hangOutArgArray)[[3]] <- as.list(names(unlist(lapply(houseL, function(house) lapply(house$roomL, function(room) NULL)), recursive=FALSE)))
where maybe something like Map over the array dimnames and the array itself can produce a flat list which I can easily restructure back to the array shape after - but this is nasty...
How can I flatten out these different cases to pass onto the fastLapply, so that I can take the flat output of the same length and easily restructure it back into an easy to work with multi-dimensional layout for subsequent processing? At the moment this is the aforementioned nasty (but working) approach:
argumentArray <- array('', dim=c(AA, TT, RR))
dimOneNames <- as.list(names(animalL))
dimThreeNames <- as.list(names(unlist(lapply(houseL, function(house) lapply(house$roomL, function(room) NULL)), recursive=FALSE)))
dimnames(argumentArray)[[1]] <- dimOneNames
dimnames(argumentArray)[[3]] <- dimThreeNames
invisible(lapply(dimOneNames, function(d1) lapply(1:TT, function(d2), lapply(dimThreeNames, function(d3) { argumentArray[d1, d2, d3] <<- paste(d1, d2, d3, sep=';')}))))
resultL <- fastLapply(argumentArray, function(argStr) {
argV <- unlist(strsplit(argStr, ';', fixed=TRUE))
animal <- argV[1]
timePrd <- timeL[[ as.integer(argV[2]) ]]
house <- houseL[[ unlist(strsplit(argV[3], '.', fixed=TRUE))[1] ]]
room <- house$roomL[[ unlist(strsplit(argV[3], '.', fixed=TRUE))[2] ]]
evaluateHangOutInRoom(animal, timePrd, house, room)
})
resultArray <- array(resultL, dim=dim(argumentArray))
dimnames(resultArray) <- dimnames(argumentArray)

R: Array changing type in a loop

I have created a 3D-array and want to fill it with data from two other data.frames
Those data.frames have different colnames and rownames, so sometimes a NULL will pop out when I address a non-existent cell. Both data.frames have a list of 'lm' output in their cells.
But the problem is I keep getting this error:
Error in diff_models[i, j, "cont"] <- cont :
incorrect number of subscripts
I have also noticed that upon creation "diff_models" is a logical type (also strange, btw), but when the error pops out it becomes a list. So I guess the problem is about there being no [i,j,'cont'] in a list. But why does the loop change the type of "diff_models"?
cont_col <- colnames(temp1)
cont_row <- rownames(temp1)
dis_col <- colnames(temp2)
dis_row <- rownames(temp2)
cols <- unique(c(cont_col,dis_col))
rows <- unique(c(cont_row,dis_row))
diff_models <- array(NA, c(length(rows),length(cols),2), dimnames =
list('predictor'=rows,'response'=cols, 'condition'=c('dis','cont')))
for (j in cols) {
for (i in rows) {
cont <- cont_models[i,j]
dis <- dis_models[i,j]
diff_models[i,j,"dis"] <- ifelse(is.null(dis),NA,dis)
diff_models[i,j,"cont"] <- ifelse(is.null(cont),NA,cont)
}
}
Using
diff_models[i][j]["dis"] <- ifelse(is.null(dis),NA,dis)
diff_models[i][j]["cont"] <- ifelse(is.null(cont),NA,cont)
does not end up in an error but turns "diff_models" into an empty list.
Saving numerics into the array, however, work perfectly well

Transform two arrays in to one data frame in R

I have two arrays coming from a postgreSQL database as following.
iarray
{9.467182035,9.252423958,9.179368178,9.142931845,9.118895803,9.098669713,9.093398102,9.092035392,9.091328028,9.090594437,9.090000456,9.089253543......keeps going on}
varray
{-1.025945126,-0.791203874,-0.506481774,-0.255416444,-0.028424464,0.188855034,0.390787963,0.579327969,0.761521769 ...keeps going on}
Both arrays have equal number of entries. I want to convert these to a data frame hence I can draw a graph of i over v.
How should I proceed?
I tried n<-gsub("^\\{+(.+)\\}+$", '\\1', iarray) to remove the {} and
n2 <- strsplit(n, ",") to remove the commas.
Assuming you are getting iarray & varray as strings :
iarray = "{9.467182035,9.252423958,9.179368178,9.142931845}"
varray = "{-1.025945126,-0.791203874,-0.506481774,-0.255416444}"
n<-gsub("^\\{+(.+)\\}+$", '\\1', iarray)
n1 <- strsplit(n,",")
n1 <- unlist(n1)
df <- as.data.frame(n1)
n<-gsub("^\\{+(.+)\\}+$", '\\1', varray)
n2 <- strsplit(n,",")
n2 <- unlist(n2)
df <- cbind(df,n2)
This seems one of the few occasions to correctly use eval(parse()):
df<-list(iarray,varray)
df<-data.frame(lapply(df,
function(x) eval(parse(text=sub("\\}$",")",sub("^\\{","c(",x))))
))
names(df)<-c("iarray","varray")
We just replace the { with (, add a c at the beginning and iarray and varray become command lines to create vectors which we parse and eval.

lapply and rbind not properly appending the results

SimNo <- 10
for (i in 1:SimNo){
z1<-rnorm(1000,0,1)
z2<-rnorm(1000,0,1)
z3<-rnorm(1000,0,1)
z4<-rnorm(1000,0,1)
z5<-rnorm(1000,0,1)
z6<-rnorm(1000,0,1)
X<-cbind(z1,z2,z3,z4,z5,z6)
sx<-scale(X)/sqrt(999)
det1<-det(t(sx)%*%sx)
detans<-do.call(rbind,lapply(1:SimNo, function(x) ifelse(det1<1,det1,0)))
}
when I run all commands with in loop except last one I get different values of determinant but when I run code with loops at once I get last value of determinant repeated for all.
Please help and guide to control all situation like this.
Is there way to have short and efficient way for this code, so that each individual variable can also be accessed.
Whenever you are repeating the same operation multiple times, and without inputs, think about using replicate. Here you can use it twice:
SimNo <- 10
det1 <- replicate(SimNo, {
X <- replicate(6, rnorm(1000, 0, 1))
sx <- scale(X) / sqrt(999)
det(t(sx) %*% sx)
})
detans <- ifelse(det1 < 1, det1, 0)
Otherwise, this is what your code should have looked with your for loop. You needed to create a vector for storing your outputs at each loop iteration:
SimNo <- 10
detans <- numeric(SimNo)
for (i in 1:SimNo) {
z1<-rnorm(1000,0,1)
z2<-rnorm(1000,0,1)
z3<-rnorm(1000,0,1)
z4<-rnorm(1000,0,1)
z5<-rnorm(1000,0,1)
z6<-rnorm(1000,0,1)
X<-cbind(z1,z2,z3,z4,z5,z6)
sx<-scale(X)/sqrt(999)
det1<-det(t(sx)%*%sx)
detans[i] <- ifelse(det1<1,det1,0)
}
Edit: you asked in the comments how to access X using replicate. You would have to make replicate create and store all your X matrices in a list. Then use the *apply family of functions to loop throughout that list to finish the computations:
X <- replicate(SimNo, replicate(6, rnorm(1000, 0, 1)), simplify = FALSE)
det1 <- sapply(X, function(x) {
sx <- scale(x) / sqrt(999)
det(t(sx) %*% sx)
})
detans <- ifelse(det1 < 1, det1, 0)
Here, X is now a list of matrices, so you can get e.g. the matrix for the second simulation by doing X[[2]].
SimNo <- 10
matdet <- matrix(data=NA, nrow=SimNo, ncol=1, byrow=TRUE)
for (i in 1:SimNo){
z1<-rnorm(1000,0,1)
z2<-rnorm(1000,0,1)
z3<-rnorm(1000,0,1)
z4<-rnorm(1000,0,1)
z5<-rnorm(1000,0,1)
z6<-rnorm(1000,0,1)
X<-cbind(z1,z2,z3,z4,z5,z6)
sx<-scale(X)/sqrt(999)
det1<-det(t(sx)%*%sx)
matdet[i] <-do.call(rbind,lapply(1:SimNo, function(x) ifelse(det1<1,det1,0)))
}
matdet

How to read multiple files into a multi-dimensional array

I want to make array in 3 dimension.
Here is what I tried:
z<-c(160,720,420)
first_data_set <-array(dim = length(file_1), dimnames = z)
Data that I am reading is in one level. (only x and y)
There are other data in the same format, and I need to put them in the same array with the first data. So once I finish reading all data, all of them are in the same array but there is no overwriting.
So I think array has to be 3 dimensions; otherwise I cannot keep all data that I read in loop.
Say that you have two matrices of size 3x4:
m1 <- matrix(rnorm(12), nrow = 3, ncol = 4)
m2 <- matrix(rnorm(12), nrow = 3, ncol = 4)
If you want to place them in an array, first make an array of NA's:
A <- array(as.numeric(NA), dim = c(3,4,2))
Then populate the layers with data:
A[,,1] <- m1
A[,,2] <- m2
As suggested by #Justin, you could also just put the matrices together in a list:
A2 <- list()
A2[['m1']] <- m1
A2[['m2']] <- m2
To read matrices from files: using a list makes it easier to get these matrices from files in a directory, without having to specify the dimensions in advance. Assume you want all files with extension csv:
myfiles <- dir(pattern = ".csv")
for (i in 1:length(myfiles)){
A2[[myfiles[i]]] <- read.table(myfiles[i], sep = ',')
}

Resources