R two dimensional array containing dataframe rows - arrays

So, I lately started working with R for a research I'm interested in, and I'm trying to create a multi dimensional array that would contain dataframe rows.
I have a large data frame containing many columns, that are either numeric, or strings. For the sake of simplicity, let's work with 3 columns:
thread_id: an integer number between 1 and 10100.
user_id: an integer number given to users.
post_name: a string that gives us the title of the post
I would like to create a datastructure, that's preferably a two dimensional array, where at the first dimension we have the thread_id, and at the second we have a row from the dataframe.
So, as a return to for
DataSet[1][1], I'd get thread_id: 1, user_id: 100, post_name: "some name 1"
DataSet[1][2], I'd get thread_id: 1, user_id: 101, post_name: "some name 2"
DataSet[5][10], I'd get thread_id: 5, user_id: 900, post_name: "some name 3"
Is this possible to do in R? I only have previous experiences with Java, and in that it is possible to solve with an array for Objects.
Thanks for all the help!

If, say, thread_id took on values 1 to 5, you could use:
mylist <- list()
for(i in 1:5)
mylist[[i]] <- myData[thread_id==i,]
You could of course use max(myData$thread_id) instead of 5...

Here is an alternative for you.
Assumption: df is a data.frame
convert.to.str <- function(df){
df_col <- names(df)
val <- unlist(df)
ans <- paste(df_col,val,sep=': ')
final_ans <- paste(ans,collapse=', ')
}
int_ans <- data.frame(thread_id = df$thread_id, ans = apply(df,1,convert.to.str), nrow2=1:nrow(df))
library(reshape2)
int_ans2 <- dcast(int_ans,thread_id ~ nrow2,value.var='ans')
DataSet <- int_ans2[2:ncol(int_ans2)]
dimnames(DataSet)[[1]] <- int_ans2$thread_id

Related

How to add a continuous sequence (unique identifier) to an array?

I have an array, similar to this:
ar1<- array(rep(1, 91*5*4), dim=c(91, 5, 4))
I want to add an extra column at the end of each component (n = 4) that is sequential across all components (I'm not sure if component is the right word).
In this case it would be a sequence from 1 to 364.
The idea behind this is that if the rows are scrambled when I'm messing around with joining data or anything else I would be able to see it and rectify it.
How do I achieve this please?
Maybe the following is what you want.
It uses apply to add an extra column to each slice defined by the 2nd dimension of the array and after this is done sets the final dimensions correctly.
ar2 <- sapply(1:5, function(i){
new <- seq_len(NROW(ar1[, i, ])) + (i - 1)*NROW(ar1[, i, ])
cbind(ar1[, i, ], new)
})
dim(ar2) <- c(91, 5, 5)
The code above creates a new array, if you want you can rewrite the original one.
To get the original back this will do it.
n <- dim(ar2)[2]
ar1_back <- sapply(1:5, function(i){
ar2[, -n, i]
})
dim(ar1_back) <- c(91, 5, 4)
identical(ar1, ar1_back)
#[1] TRUE

Concatenate two columns from table in R

I have a table stored in a csv file that looks like this:
"","",""
"1",50.7109704392639,598.945216481663
"2",88.4551431247316,432.427671968179
"3",146.142850442859,558.077250358249
"4",67.5287612139969,283.50009457641
"5",28.8212787088875,355.3292769956
I am trying to concatenate the second and the third columns from this table into an array by doing this:
data <- read.table("testecase3.csv", header = TRUE, sep = ",")
before <- data[2];
after <- data[3];
merge <- c(before, after);
When I print this new array, this is what I get:
$`X.1`
[1] 50.71097 88.45514 146.14285 67.52876 28.82128
$X.2
[1] 598.9452 432.4277 558.0773 283.5001 355.3293
How can I fix this problem? I would like something like this:
[1] 50.71097 88.45514 146.14285 67.52876 28.82128 598.9452 432.4277 558.0773 283.5001 355.3293
The correct way do do this is using:
before <- data[,2];
after <- data[,3];
As Darren explained above, data[2] extracts the entire column 2 as a data.frame, whereas data[ , 2] extracts the elements in the column 2 as a vector.

How do you dynamically create difference- or delta- columns in a data.frame?

My dataframe has column names of outstanding balance from Balance, Balance1, Balance2,...,Balance36.
I want to add a column for the delta between each month, i.e. Delta2 = Balance2 - Balance1
How can I simplify by method below.
dataset$delta1 = apply(dataset[, c("Balance1","Balance")], 1, function(x){x[2]-x[1]})
dataset$delta2 = apply(dataset[, c("Balance2","Balance1")], 1, function(x){x[2]-x[1]})
...
dataset$delta35 = apply(dataset[, c("Balance35","Balance34")], 1, function(x){x[2]-x[1]})
dataset$delta36 = apply(dataset[, c("Balance36","Balance35")], 1, function(x){x[2]-x[1]})
It boils down to a one-liner. First, name your dataset something short, df is the usual name. Then, use direct subtraction; there's zero need to call apply() to subtract one column from another:
df$delta1 <- df[,"Balance1"] - df[,"Balance"]
df$delta2 <- df[,"Balance2"] - df[,"Balance1"]
...
df$delta35 <- df[,"Balance35"] - df[,"Balance34")]
df$delta36 <- df[,"Balance36"] - df[,"Balance35")]
But since the whole computation has a regular structure, we're really only talking about generating a Nx36 array of differences, so use numeric column indices. Say your "Balance*" column indices are (50:85) and your delta_cols are 100:135, or whatever. Then the indices for LHS of your "Balance*" subtraction are balance_lhs <- (50:84) and RHS indices are (51:85), or just ((50:84)+1) (remember that most operators like addition vectorize in R)
So your Nx36 array can be generated by just the one-liner:
df[,delta_cols] <- df[,(balance_lhs+1)] - df[,balance_lhs]
And you can compute delta_cols <- which(colnames(df) == c("delta1",...,"delta36") programmatically, to avoid magic-number column indices in your code.
Use lapply to calculate delta for all 36 comparisons in one line.
# Sample data (37 columns, labelled Balance, Balance1, ...)
set.seed(2017);
df <- as.data.frame(matrix(runif(37 * 100), ncol = 37));
colnames(df) <- paste("Balance", c("", seq(1:36)), sep = "");
# List of difference vectors (36 distance vectors, labelled delta1, ...)
lst <- lapply(2:ncol(df), function(i) df[, i] - df[, i - 1]);
names(lst) <- paste("delta", seq(1:36), sep = "");
# Combine with original dataframe
df <- cbind.data.frame(
df,
as.data.frame(lst));

R Array subsetting: flexible use of drop

As it has been noticed in Subsetting R array: dimension lost when its length is 1
R drops every dimension when subsetting and its length is 1.
The drop property helps avoid that.
I need a more flexible way to subset :
> arr = array(1, dim= c(1,2,3,4))
> dim(arr[,,1,])
[1] 2 4
> dim(arr[,,1,,drop=F])
[1] 1 2 1 4
I want a way to subset by dropping the 3rd dimension (actually the dimension where I put the subset 1) and keepping the 1st dimension (the dimensions where no subset is put).
It should return an array with dimension = 1 2 4
My issue is that I started coding with an array with no dimension = 1, but when coming to deal with some cases where a dimension is 1, it crashes. The function I need provides a way to deal with the array as if the dimension is not 1.
Two ways to do this, either use adrop from package abind, or build a new array with the dimensions you choose, after doing the subsetting.
library(abind)
arr <- array(sample(100, 24), dim=c(1, 2, 3, 4))
arr2 <- adrop(arr[ , , 1, , drop=FALSE], drop=3)
dim(arr2)
arr3 <- array(arr[ , , 1 , ], dim=c(1,2,4))
identical(arr2, arr3)
If you want a function that takes a single specified margin, and a single index of that margin, and drops that margin to create a new array with exactly one fewer margin, here is how to do it with abind:
specialsubset <- function(ARR, whichMargin, whichIndex) {
library(abind)
stopifnot(length(whichIndex) == 1, length(whichMargin) == 1, is.numeric(whichMargin))
return(adrop(x = asub(ARR, idx = whichIndex, dims = whichMargin, drop = FALSE), drop = whichMargin))
}
arr4 <- specialsubset(arr, whichMargin=3, whichIndex=1)
identical(arr4, arr2)

Accessing elements in matlab, get pixels of color image (array) from indices stored in another array

A and B are mask indices (row and column respectively) and C is an image and I want to note the color values stored in C for which the indices are stored in A and B. If A and B would be something like [1, 2, 3] and [20, 30, 40] so I would like to find C(1, 20, :), C(2, 30, :) and C(3, 40, :).
If I do D = C(A, B, :), I get an array of size 3x3x3 in this case, however I want an array of size 3x1x3. I know I am messing with the indexing, is there a simple way to do this without writing a loop?
Simply stating, is there a way to do the following without a loop:
for i = 1:10
D(i, :) = C(A(i), B(i), :)
end
You need to convert subindices to linear indices. You can use sub2ind for that:
r = C(sub2ind([size(C,1) size(C,2) 1],A,B,1*ones(1,length(A))));
g = C(sub2ind([size(C,1) size(C,2) 2],A,B,2*ones(1,length(A))));
b = C(sub2ind([size(C,1) size(C,2) 3],A,B,3*ones(1,length(A))));
The n x 1 x 3 result you want would be simply cat(3, r.',g.',b.').
Why not something like
C = C(A,B(i),:);
You could use a for statement to get the value of i or set it some other way.
It sounds like everything is working as it should. In your example you've indexed 9 elements of C using A and B. Then D is a 3x3x3 array with the dimensions corresponding to [row x col x color_mask(RGB)]. Why would the second dimension be reduced to 1 unless B only contained one value (signifying you only want to take elements from one column)? Of course A and B must not contain values higher than the number of rows and columns in C.
A = [1 2 3];
B = [20];
D = C(A,B,:);
size(D)
>> 3 1 3
EDIT: Ok, I see what you mean now. You want to specify N number of points using A(Nx1) and B(Nx1). Not NxN number of points which is what you are currently getting.

Resources