I am new to R. I have an R array (or atleast I think) which gives the following output
> head(x)
[,1] [,2]
199 3.40 3.50
What is the 199 on the front mean? How do I extract the elements of this array?
As above, the 199 is a row name. You extract elements from a data frame (or vector) in R by using square brackets:
x[,1] # gives a column
x[1,] # give a row
x[1:2,] # gives several rows
You can also use column names like so:
x <- data.frame(col1 = c(1,2,3), col2 = c("A", "B", "C"))
x$col1 # 1 2 3
You'll figure out more as you start to play around in R and do some R tutorials.
Related
I'd like to find the nearest value to x in a large two dimensional array (my.array) and return the the i and j indexes where i is [1:175] and j is [1:120].
x <- 1.863
my.array <- array(rnorm(21000), dim=c(175,120))
On searching on stack overflow and other sites I've found I can find the nearest value like so:
nearest <- which.min(abs(my.array - x))
However this returns a single locater value whereas I would like to return the i and j index values.
> nearest
[1] 13229
Thanks in advance.
help(which.min)
says, near the bottom:
...
Use arrayInd(), if you need array/matrix indices instead of 1D vector ones.
Aha! Well then:
# make the example reproducible
set.seed(123)
x <- 1.863
my.array <- array(rnorm(21000), dim=c(175,120))
nearest <- which.min(abs(my.array - x))
idx <- arrayInd(nearest, .dim=dim(my.array))
idx
[,1] [,2]
[1,] 46 62
Dropping the unused dimensions is not necessary, but it keeps me from getting confused, so I do it. The example works the same way if you skip the drop() statement.
# drop unused dimensions:
idx <- drop(idx)
idx
[1] 46 62
# check:
my.array[idx[1], idx[2]]
[1] 1.863453
I right away give an example,
now suppose I have 3 arrays a,b,c such as
a = c(3,5)
b = c(6,1,8,7)
c = c(4,2,9)
I must be able to extract consecutive triplets among them i,e.,
c(1,2,3),c(4,5,6)
But this was just an example, I would be having a larger data set with even more than 10 arrays, hence must be able to find the consecutive series of length ten.
So could anyone provide an algorithm, to generally find the consecutive series of length 'n' among 'n' arrays.
I am actually doing this stuff in R, so its preferable if you give your code in R. Yet algorithm from any language is more than welcomed.
Reorganize the data first into a list containing value and array number.
Sort the list; you'd have smth like:
1-2
2-3
3-1 (i.e. " there' s a three in array 1" )
4-3
5-1
6-2
7-2
8-2
9-3
Then loop the list, check if there are actually n consecutive numbers, then check if these had different array numbers
Here's one approach. This assumes there are no breaks in the sequence of observations in the number of groups. Here the data.
N <- 3
a <- c(3,5)
b <- c(6,1,8,7)
c <- c(4,2,9)
Then i combine them together and order by the observations
dd <- lattice::make.groups(a,b,c)
dd <- dd[order(dd$data),]
Now I look for rows in this table where all three groups are represented
idx <- apply(embed(as.numeric(dd$which),N), 1, function(x) {
length(unique(x))==N
})
Then we can see the triplets with
lapply(which(idx), function(i) {
dd[i:(i+N-1),]
})
# [[1]]
# data which
# b2 1 b
# c2 2 c
# a1 3 a
#
# [[2]]
# data which
# c1 4 c
# a2 5 a
# b1 6 b
Here is a brute force method with expand.grid and three vectors as in the example
# get all combinations
df <- expand.grid(a,b,c)
Using combn to calculate difference for each pairwise combination.
# get all parwise differences
myDiffs <- combn(names(df), 2, FUN=function(x) abs(x[1]-x[2]))
# subset data using `rowSums` and `which`
df[which(rowSums(myDiffs == 1) == ncol(myDiffs)-1), ]
df[which(rowSums(myDiffs == 1) == ncol(myDiffs)-1), ]
Var1 Var2 Var3
2 5 6 4
11 3 1 2
I have hacked together a little recursive function that will find all the consecutive triplets amongst as many vectors as you pass it (need to pass at least three). It is probably a little crude, but seems to work.
The function uses the ellipsis, ..., for passing arguments. Hence it will take however many arguments (i.e. numeric vectors) you provide and put them in the list items. Then the smallest value amongst each passed vector is located, along with its index.
Then the indeces of the vectors corresponding to the smallest triplet are created and iterated through using a for() loop, where the output values are passed to the output vector out. The input vectors in items are pruned and passed again into the function in a recursive fashion.
Only, when all vectors are NA, i.e. there are no more values in the vectors, the function returns the final result.
library(magrittr)
# define function to find the triplets
tripl <- function(...){
items <- list(...)
# find the smallest number in each passed vector, along with its index
# output is a matrix of n-by-2, where n is the number of passed arguments
triplet.id <- lapply(items, function(x){
if(is.na(x) %>% prod) id <- c(NA, NA)
else id <- c(which(x == min(x)), x[which(x == min(x))])
}) %>% unlist %>% matrix(., ncol=2, byrow=T)
# find the smallest triplet from the passed vectors
index <- order(triplet.id[,2])[1:3]
# create empty vector for output
out <- vector()
# go through the smallest triplet's indices
for(i in index){
# .. append the coresponding item from the input vector to the out vector
# .. and remove the value from the input vector
if(length(items[[i]]) == 1) {
out <- append(out, items[[i]])
# .. if the input vector has no value left fill with NA
items[[i]] <- NA
}
else {
out <- append(out, items[[i]][triplet.id[i,1]])
items[[i]] <- items[[i]][-triplet.id[i,1]]
}
}
# recurse until all vectors are empty (NA)
if(!prod(unlist(is.na(items)))) out <- append(list(out),
do.call("tripl", c(items), quote = F))
else(out <- list(out))
# return result
return(out)
}
The function can be called by passing the input vectors as arguments.
# input vectors
a = c(3,5)
b = c(6,1,8,7)
c = c(4,2,9)
# find all the triplets using our function
y <- tripl(a,b,c)
The result is a list, which contains all the neccesary information, albeit unordered.
print(y)
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 4 5 6
#
# [[3]]
# [1] 7 9 NA
#
# [[4]]
# [1] 8 NA NA
Ordering everything can be done using sapply():
# put everything in order
sapply(y, function(x){x[order(x)]}) %>% t
# [,1] [,2] [,3]
# [1,] 1 2 3
# [2,] 4 5 6
# [3,] 7 9 NA
# [4,] 8 NA NA
The thing is, that it will use only one value per vector to find triplets.
It will therefore not find the consecutive triplet c(6,7,8) among e.g. c(6,7,11), c(8,9,13) and c(10,12,14).
In this instance it would return c(6,8,10) (see below).
a<-c(6,7,11)
b<-c(8,9,13)
c<-c(10,12,14)
y <- tripl(a,b,c)
sapply(y, function(x){x[order(x)]}) %>% t
# [,1] [,2] [,3]
# [1,] 6 8 10
# [2,] 7 9 12
# [3,] 11 13 14
I am new to using R and thus my question might be a simple one, but nonetheless I have spent a lot of time trying to figure out what I am doing wrong and to no avail. I have discovered a lot of help on this site in the past week searching through other questions/answers (thank you!) but as someone new, it is often difficult to interpret other people's code.
I am trying to build a 3-dimensional array of multiple data files, each one with the same dimensions 57x57.
# read in 100 files
Files = lapply(Sys.glob('File*.txt'), read.table, sep='\t', as.is=TRUE)
# convert to dataframes
Files = lapply(Files[1:100], as.data.frame)
# check dimensions of first file (it's the same for all)
dim(Files[[1]])
[1] 57 57
# build empty array
Array = array(dim=c(57,57,100))
# read in the first data frame
Array[,,1] = Files[1]
# read in the second data frame
Array[,,2] = Files[2]
Error in Array[, , 2] = Files[2] : incorrect number of subscripts
# if I check...
Array[,,1] = Files[1]
Error in Array[, , 1] : incorrect number of dimensions
# The same thing happens when I do it in a loop:
x = 0
for(i in 1:100){
Array[,,x+1] = Files[[i]]
x = x + 1
}
Error in Array[, , 1] = Files[[1]] :
incorrect number of subscripts
You need to convert your data frames into matrices before you do the assignment:
l <- list(data.frame(x=1:2, y=3:4), data.frame(x=5:6, y=7:8))
arr <- array(dim=c(2, 2, 2))
arr[,,1] <- as.matrix(l[[1]])
arr[,,2] <- as.matrix(l[[2]])
arr
# , , 1
#
# [,1] [,2]
# [1,] 1 3
# [2,] 2 4
#
# , , 2
#
# [,1] [,2]
# [1,] 5 7
# [2,] 6 8
You can actually build the array in one line with the unlist function applied to a list of the matrices you want to combine:
arr2 <- array(unlist(lapply(l, as.matrix)), dim=c(dim(l[[1]]), length(l)))
all.equal(arr, arr2)
# [1] TRUE
A normal matrix would be 2-dimension matrix. But, I can initialise:
a<-array(0,dim=c(2,3,4,5))
Which is a 2*4*5*3 matrix, or array.
Command
apply(a,c(2,3),sum)
will give a 4*5 array, contain the sum over elements in the 1st and 4th dimension.
Why it that? As far as I know, in the apply function, 1 indicates rows, 2 indicates columns, but what does 3 mean here?
We need some abstraction here.
The easiest way to understand apply on an array is to try some examples. Here's some data modified from the last example object in the documentation:
> z <- array(1:24, dim = 2:4)
> dim(z)
[1] 2 3 4
> apply(z, 1, function(x) sum(x))
[1] 144 156
> apply(z, 2, function(x) sum(x))
[1] 84 100 116
> apply(z, 3, function(x) sum(x))
[1] 21 57 93 129
What's going on here? Well, we create a three-dimensional array z. If you use apply with MARGIN=1 you get row sums (two values because there are two rows), if you use MARGIN=2 you get column sums (three values because there are three columns), and if you use MARGIN=3 you get sums across the array's third dimension (four values because there are four levels to the third dimension of the array).
If you specify a vector for MARGIN, like c(2,3) you get the sum of the rows for each column and level of the third dimension. Note how in the above examples, the results from apply with MARGIN=1 are the row sums and with MARGIN=2 the column sums, respectively, of the matrix seen in the result below:
> apply(z, c(2,3), function(x) sum(x))
[,1] [,2] [,3] [,4]
[1,] 3 15 27 39
[2,] 7 19 31 43
[3,] 11 23 35 47
If you specify all of the dimensions as MARGIN=c(1,2,3) you simply get the original three-dimensional object:
> all.equal(z, apply(z, c(1,2,3), function(x) sum(x)))
[1] TRUE
Best way to learn here is just to start playing around with some real matrices. Your example data aren't helpful for looking at sums because all of the array entries are zero.
My array is
x <- array(1:24, dim=c(3,4,3))
My task 1 is to find the max value according to the first two dimensions
x.max <- apply(x,c(1,2), function(x) ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE)))
in case there is NA data
my task 2 is to find the max value position on the third dimension.
I tried
x.max.position = apply(x, c(1,2),which.max(x))
But this only give me the position on the fist two dimensions.
Can anyone help me?
It's not totally clear, but if you want to find the max for each matrix of the third dimension (is that even a technically right thing to say?), then you need to use apply across the third dimension. The argument margin under ?apply states that:
a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns.
So for this example where you have a 3D array, 3 is the third dimension. So...
t( apply( x , 3 , function(x) which( x == max(x) , arr.ind = TRUE ) ) )
[,1] [,2]
[1,] 3 4
[2,] 3 4
[3,] 3 4
Which returns a matrix where each row contains the row and then column index of the max value of each 2D array/matrix of the third dimension.
If you want the max across all dimensions you can use which and the arr.ind argument like this:
which( x==max(x,na.rm=T) , arr.ind = T )
dim1 dim2 dim3
[1,] 3 4 2
Which tells us the max value is the third row, fourth column, second matrix.
EDIT
To find the position at dim 3 where where values on dim 1 and 2 are max try:
which.max( apply( x , 3 , max ) )
# [1] 2
Which tells us that at position 2 of the third dimension contains the maximal value.