I have a table stored in a csv file that looks like this:
"","",""
"1",50.7109704392639,598.945216481663
"2",88.4551431247316,432.427671968179
"3",146.142850442859,558.077250358249
"4",67.5287612139969,283.50009457641
"5",28.8212787088875,355.3292769956
I am trying to concatenate the second and the third columns from this table into an array by doing this:
data <- read.table("testecase3.csv", header = TRUE, sep = ",")
before <- data[2];
after <- data[3];
merge <- c(before, after);
When I print this new array, this is what I get:
$`X.1`
[1] 50.71097 88.45514 146.14285 67.52876 28.82128
$X.2
[1] 598.9452 432.4277 558.0773 283.5001 355.3293
How can I fix this problem? I would like something like this:
[1] 50.71097 88.45514 146.14285 67.52876 28.82128 598.9452 432.4277 558.0773 283.5001 355.3293
The correct way do do this is using:
before <- data[,2];
after <- data[,3];
As Darren explained above, data[2] extracts the entire column 2 as a data.frame, whereas data[ , 2] extracts the elements in the column 2 as a vector.
Related
I am getting "ValueError: setting an array element with a sequence." error when I am trying to run my random forest classifier on a heterogenous data--the text data is been fed to word2vec model and I extracted one dimensional numpy array by taking mean of the word2vec vectors for each word in the text row.
Here is the sample of the data am working with:
col-A col-B ..... col-z
100 230 ...... [0.016612869501113892, -0.04279713928699493, .....]
where col-z is the numpy array with fixed size of 300 in each row.
Following is the code for calculating mean the word2vec vectors and creating numpy arrays:
` final_data = []
for i, row in df.iterrows():
text_vectorized = []
text = row['col-z']
for word in text:
try:
text_vectorized.append(list(w2v_model[word]))
except Exception as e:
pass
try:
text_vectorized = np.asarray(text_vectorized, dtype='object')
text_vectorized_mean = list(np.mean(text_vectorized, axis=0))
except Exception as e:
text_vectorized_mean = list(np.zeros(100))
pass
try:
len(text_vectorized_mean)
except:
text_vectorized_mean = list(np.zeros(100))
temp_row = np.asarray(text_vectorized_mean, dtype='object')
final_data.append(temp_row)
text_array = np.asarray(final_data, dtype='object')`
After this, I convert text_array to pandas dataframe and concatenate it with my original dataframe with other numeric columns. But as soon as I try to feed this data into a classifier, it gives me the above error at this line:
--> array = np.array(array, dtype=dtype, order=order, copy=copy)
Why am I getting this error?
You are trying to create an array from a mixed list containing both numeric values and an another list. Try to flatten the array first using .ravel()
For example,
text_array = np.asarray(final_data.ravel(), dtype='object')
My dataframe has column names of outstanding balance from Balance, Balance1, Balance2,...,Balance36.
I want to add a column for the delta between each month, i.e. Delta2 = Balance2 - Balance1
How can I simplify by method below.
dataset$delta1 = apply(dataset[, c("Balance1","Balance")], 1, function(x){x[2]-x[1]})
dataset$delta2 = apply(dataset[, c("Balance2","Balance1")], 1, function(x){x[2]-x[1]})
...
dataset$delta35 = apply(dataset[, c("Balance35","Balance34")], 1, function(x){x[2]-x[1]})
dataset$delta36 = apply(dataset[, c("Balance36","Balance35")], 1, function(x){x[2]-x[1]})
It boils down to a one-liner. First, name your dataset something short, df is the usual name. Then, use direct subtraction; there's zero need to call apply() to subtract one column from another:
df$delta1 <- df[,"Balance1"] - df[,"Balance"]
df$delta2 <- df[,"Balance2"] - df[,"Balance1"]
...
df$delta35 <- df[,"Balance35"] - df[,"Balance34")]
df$delta36 <- df[,"Balance36"] - df[,"Balance35")]
But since the whole computation has a regular structure, we're really only talking about generating a Nx36 array of differences, so use numeric column indices. Say your "Balance*" column indices are (50:85) and your delta_cols are 100:135, or whatever. Then the indices for LHS of your "Balance*" subtraction are balance_lhs <- (50:84) and RHS indices are (51:85), or just ((50:84)+1) (remember that most operators like addition vectorize in R)
So your Nx36 array can be generated by just the one-liner:
df[,delta_cols] <- df[,(balance_lhs+1)] - df[,balance_lhs]
And you can compute delta_cols <- which(colnames(df) == c("delta1",...,"delta36") programmatically, to avoid magic-number column indices in your code.
Use lapply to calculate delta for all 36 comparisons in one line.
# Sample data (37 columns, labelled Balance, Balance1, ...)
set.seed(2017);
df <- as.data.frame(matrix(runif(37 * 100), ncol = 37));
colnames(df) <- paste("Balance", c("", seq(1:36)), sep = "");
# List of difference vectors (36 distance vectors, labelled delta1, ...)
lst <- lapply(2:ncol(df), function(i) df[, i] - df[, i - 1]);
names(lst) <- paste("delta", seq(1:36), sep = "");
# Combine with original dataframe
df <- cbind.data.frame(
df,
as.data.frame(lst));
So, I lately started working with R for a research I'm interested in, and I'm trying to create a multi dimensional array that would contain dataframe rows.
I have a large data frame containing many columns, that are either numeric, or strings. For the sake of simplicity, let's work with 3 columns:
thread_id: an integer number between 1 and 10100.
user_id: an integer number given to users.
post_name: a string that gives us the title of the post
I would like to create a datastructure, that's preferably a two dimensional array, where at the first dimension we have the thread_id, and at the second we have a row from the dataframe.
So, as a return to for
DataSet[1][1], I'd get thread_id: 1, user_id: 100, post_name: "some name 1"
DataSet[1][2], I'd get thread_id: 1, user_id: 101, post_name: "some name 2"
DataSet[5][10], I'd get thread_id: 5, user_id: 900, post_name: "some name 3"
Is this possible to do in R? I only have previous experiences with Java, and in that it is possible to solve with an array for Objects.
Thanks for all the help!
If, say, thread_id took on values 1 to 5, you could use:
mylist <- list()
for(i in 1:5)
mylist[[i]] <- myData[thread_id==i,]
You could of course use max(myData$thread_id) instead of 5...
Here is an alternative for you.
Assumption: df is a data.frame
convert.to.str <- function(df){
df_col <- names(df)
val <- unlist(df)
ans <- paste(df_col,val,sep=': ')
final_ans <- paste(ans,collapse=', ')
}
int_ans <- data.frame(thread_id = df$thread_id, ans = apply(df,1,convert.to.str), nrow2=1:nrow(df))
library(reshape2)
int_ans2 <- dcast(int_ans,thread_id ~ nrow2,value.var='ans')
DataSet <- int_ans2[2:ncol(int_ans2)]
dimnames(DataSet)[[1]] <- int_ans2$thread_id
I have an array that can have one or more pages or sheets (my names for the third dimension). I am attempting to perform operations on the array. When there is only one sheet or page the result of the operation is a matrix. I would like the result to be an array. Is there a way to retain the class array even when the result of the operation has only 1 sheet or page?
Here is an example. I would like my.var.2 and my.var.3 to be arrays. The variable my.pages is set to 1 here, which seems to be causing the problem. However, my.pages can be >1. If my.pages <- 2 then my.var.2 and my.var.3 are arrays.
set.seed(1234)
my.rows <- 10
my.columns <- 4
my.pages <- 1
my.var.1 <- array( rnorm((my.rows*my.columns*my.pages), 10, 2),
c(my.rows,my.columns,my.pages))
my.var.1
my.var.2 <- 2 * my.var.1[,-my.columns,]
my.var.3 <- 10 * my.var.1[,-1,]
class(my.var.2)
class(my.var.3)
my.var.2 <- as.array(my.var.2)
my.var.3 <- as.array(my.var.3)
class(my.var.2)
class(my.var.3)
my.var.2 <- as.array( 2 * my.var.1[,-my.columns,])
my.var.3 <- as.array(10 * my.var.1[,-1,] )
class(my.var.2)
class(my.var.3)
The switch to matrix causes problems when I try to use my.var.1 and my.var.2 in nested for-loops.
The following if statement seems to solve the problem, but also seems a little clunky. Is there a more elegant solution?
if(my.pages == 1) {my.var.2 <- array(my.var.2, c(my.rows,(my.columns-1),my.pages))}
From help([):
Usage:
x[i, j, ... , drop = TRUE]
...
drop: For matrices and arrays. If 'TRUE' the result is coerced to
the lowest possible dimension (see the examples). This only
works for extracting elements, not for the replacement. See
'drop' for further details.
Your code, revisited:
set.seed(1234)
my.rows <- 10
my.columns <- 4
my.pages <- 1
my.var.1 <- array( rnorm((my.rows*my.columns*my.pages), 10, 2),
c(my.rows,my.columns,my.pages))
my.var.2 <- 2 * my.var.1[,-my.columns,,drop=FALSE]
my.var.3 <- 10 * my.var.1[,-1,,drop=FALSE]
class(my.var.2)
## [1] "array"
class(my.var.3)
## [1] "array"
I want to make array in 3 dimension.
Here is what I tried:
z<-c(160,720,420)
first_data_set <-array(dim = length(file_1), dimnames = z)
Data that I am reading is in one level. (only x and y)
There are other data in the same format, and I need to put them in the same array with the first data. So once I finish reading all data, all of them are in the same array but there is no overwriting.
So I think array has to be 3 dimensions; otherwise I cannot keep all data that I read in loop.
Say that you have two matrices of size 3x4:
m1 <- matrix(rnorm(12), nrow = 3, ncol = 4)
m2 <- matrix(rnorm(12), nrow = 3, ncol = 4)
If you want to place them in an array, first make an array of NA's:
A <- array(as.numeric(NA), dim = c(3,4,2))
Then populate the layers with data:
A[,,1] <- m1
A[,,2] <- m2
As suggested by #Justin, you could also just put the matrices together in a list:
A2 <- list()
A2[['m1']] <- m1
A2[['m2']] <- m2
To read matrices from files: using a list makes it easier to get these matrices from files in a directory, without having to specify the dimensions in advance. Assume you want all files with extension csv:
myfiles <- dir(pattern = ".csv")
for (i in 1:length(myfiles)){
A2[[myfiles[i]]] <- read.table(myfiles[i], sep = ',')
}