I have data frame, with a horrible schema where many of the dimensions have values which are comma seperated arrays - instead of these arrays, i wish to apply operations to values like count, sum, mean etc
e.g.
colA ColB
A [0.0,0.0,0.0,2177.0068,0.0,0.0,0.0,0.0,0.0,0.0]
B [0.0,0.0,650.2635,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
C [0.0,0.0,406.3296,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
D \N
E [0.0,0.0,982.2527,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
F [0.0,0.0,0.0,163.6882,0.0,0.0,0.0,0.0,0.0,0.0]
Does anyone have an elegant way of summing/counting/mean of each array?
Thanks
Convert this to long form in which case it is easy to perform aggregations.
1) Assuming DF shown reproducibly in the Note at the end remove the square brackets in ColB and separate ColB into rows converting appropriately. Then group by colA and take the sum and mean of ColB (and possibly use other aggregation functions as well). If you don't want an NA for D filter out the rows for which ColB does not start with [. See the filter statement in (2).
library(dplyr)
library(tidyr)
DF %>%
mutate(ColB = gsub("[][]", "", ColB)) %>%
separate_rows(ColB, sep = "[^-0-9.]", convert = TRUE) %>%
group_by(ColA) %>%
summarize(Sum = sum(ColB), Mean = mean(ColB)) %>%
ungroup
giving:
# A tibble: 6 x 3
ColA Sum Mean
<chr> <dbl> <dbl>
1 A 2177. 218.
2 B 650. 65.0
3 C 406. 40.6
4 D NA NA
5 E 982. 98.2
6 F 164. 16.4
2) Alternately use the fact that the ColB strings that start with [ are JSON. In this case we filtered out the non-JSON elements of colB first.
library(dplyr)
library(jsonlite)
library(tidyr)
DF %>%
filter(substring(ColB, 1, 1) == "[") %>%
rowwise() %>%
mutate(ColB = list(fromJSON(ColB))) %>%
ungroup %>%
unnest %>%
group_by(ColA) %>%
summarize(Sum = sum(ColB), Mean = mean(ColB)) %>%
ungroup
giving:
# A tibble: 5 x 3
ColA Sum Mean
<chr> <dbl> <dbl>
1 A 2177. 218.
2 B 650. 65.0
3 C 406. 40.6
4 E 982. 98.2
5 F 164. 16.4
Note
Lines <- "ColA ColB
A [0.0,0.0,0.0,2177.0068,0.0,0.0,0.0,0.0,0.0,0.0]
B [0.0,0.0,650.2635,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
C [0.0,0.0,406.3296,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
D \\N
E [0.0,0.0,982.2527,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
F [0.0,0.0,0.0,163.6882,0.0,0.0,0.0,0.0,0.0,0.0]"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
Related
I am using the code below to fill a 3D array from another 3D array. I have used the sapply function to apply the code lines at each individual (3rd dimension) as in Efficient way to fill a 3D array.
Here is my code.
ind <- 1000
individuals <- as.character(seq(1, ind, by = 1))
maxCol <- 7
col <- 4
line <- 0
a <- 0
b <- 0
c <- 0
col_array <- c("year","time", "ID", "age", as.vector(outer(c(paste(seq(0, 1, by = 1), "year", sep="_"), paste(seq(2, maxCol, by = 1), "years", sep="_")), c("S_F", "I_F", "R_F"), paste, sep="_")))
array1 <- array(sample(1:100, length(col_array), replace = T), dim=c(2, length(col_array), ind), dimnames=list(NULL, col_array, individuals)) ## 3rd dimension = individual ID
## print(array1)
col_array <- c("year","time", "ID", "age", as.vector(outer(c(paste(seq(0, 1, by = 1), "year", sep="_"), paste(seq(2, maxCol, by = 1), "years", sep="_")), c("S_M", "I_M", "R_M"), paste, sep="_")))
array2 <- array(NA, dim=c(2, length(col_array), ind), dimnames=list(NULL, col_array, individuals)) ## 3rd dimension = individual ID
## print(array2)
tic("array2")
array2 <- sapply(individuals, function(i){
## Fill the first columns
array2[line + 1, c("year", "time", "ID", "age"), i] <- c(a, b, i, c)
## Define column indexes for individuals S
col_start_S_F <- which(colnames(array1[,,i])=="0_year_S_F")
col_end_S_F <- which(colnames(array1[,,i])==paste(maxCol,"years_S_F", sep="_"))
col_start_S_M <- which(colnames(array2[,,i])=="0_year_S_M")
col_end_S_M <- which(colnames(array2[,,i])==paste(maxCol,"years_S_M", sep="_"))
## Fill the columns for individuals S
p_S_M <- sapply(0:maxCol, function(x){pnorm(x, 4, 1)})
array2[line + 1, col_start_S_M:col_end_S_M, i] <- round(as.numeric(as.vector(array1[line + 1, col_start_S_F:col_end_S_F, i]))*p_S_M)
## Define column indexes for individuals I
col_start_I_F <- which(colnames(array1[,,i])=="0_year_I_F")
col_end_I_F <- which(colnames(array1[,,i])==paste(maxCol,"years_I_F", sep="_"))
col_start_I_M <- which(colnames(array2[,,i])=="0_year_I_M")
col_end_I_M <- which(colnames(array2[,,i])==paste(maxCol,"years_I_M", sep="_"))
## Fill the columns for individuals I
p_I_M <- sapply(0:maxCol, function(x){pnorm(x, 2, 1)})
array2[line + 1, col_start_I_M:col_end_I_M, i] <- round(as.numeric(as.vector(array1[line + 1, col_start_I_F:col_end_I_F, i]))*p_I_M)
## Define column indexes for individuals R
col_start_R_M <- which(colnames(array2[,,i])=="0_year_R_M")
col_end_R_M <- which(colnames(array2[,,i])==paste(maxCol,"years_R_M", sep="_"))
## Fill the columns for individuals R
array2[line + 1, col_start_R_M:col_end_R_M, i] <- as.numeric(as.vector(array2[line + 1, col_start_S_M:col_end_S_M, i])) +
as.numeric(as.vector(array2[line + 1, col_start_I_M:col_end_I_M, i]))
return(array2[,,i])
## print(array2[,,i])
}, simplify = "array")
## print(array2)
toc()
Is there a way to increase the performance/speed of my code (i.e., < 1 sec)? There are 500000 observations for the 3rd dimension. Any suggestions?
TL;DR: Here's a tidyverse solution that transforms the sample array into a dataframe and applies the requested changes. EDIT: I've added steps 1+2 to transform the original post's sample data into the format I used in step 3. The actual calculation in Step 3 is very fast (<0.1 sec), but the bottleneck is step 2, which takes 10 seconds for 500k rows.
Step 0: Create sample data for 500k individuals
ind <- 500000
individuals <- as.character(seq(1, ind, by = 1))
maxCol <- 7
col <- 4
line <- 0
a <- 0
b <- 0
c <- 0
col_array <- c("year","time", "ID", "age", as.vector(outer(c(paste(seq(0, 1, by = 1), "year", sep="_"), paste(seq(2, maxCol, by = 1), "years", sep="_")), c("S_F", "I_F", "R_F"), paste, sep="_")))
array1 <- array(sample(1:100, length(col_array), replace = T), dim=c(2, length(col_array), ind), dimnames=list(NULL, col_array, individuals)) ## 3rd dimension = individual ID
dim(array1)
# [1] 2 28 500000 # Two rows x 28 measures x 500k individuals
Step 1: Subset array and convert to data frame.
library(tidyverse)
# OP only uses first line of array1. If other rows needed, replace with "array1 %>%"
# and adjust renaming below to account for different Var1.
array1_dt <- array1[1,,] %>%
as.data.frame.table(stringsAsFactors = FALSE)
Step 2: Break out the stats into different columns, with one row for each individual-year. This is the slowest step (especially the spread line), and takes 0.05 sec for 1000 individuals but 10 seconds for 500k. I expect a data.table solution could make it much faster, if needed.
array1_dt_reshape <- array1_dt %>%
rename(stat = Var1, ID = Var2) %>%
filter(!stat %in% c("year", "time", "ID", "age")) %>%
mutate(year = stat %>% str_sub(end = 1),
col = stat %>% str_sub(start = -3)) %>%
select(-stat) %>%
spread(col, Freq) %>%
arrange(ID)
Step 3: Apply requested transformation. This function calculates the distribution with two sets of parameters, and uses these to scale the input table's columns. It takes 0.03 sec for 500k of individuals.
array_transform <- function(input_data = array1_dt_reshape,
max_yr = 7, S_M_mean = 4, I_M_mean = 2) {
tictoc::tic()
# First calculate the distribution function values to apply to all individuals,
# depending on year.
p_S_M_vals <- sapply(0:max_yr, function(x){pnorm(x, S_M_mean, 1)})
p_I_M_vals <- sapply(0:max_yr, function(x){pnorm(x, I_M_mean, 1)})
# For each year, scale S_M + I_M by the respective distribution functions.
# This solution relies on the fact that each ID has 8 rows every time,
# so we can recycle the 8 values in the distribution functions.
output <- input_data %>%
# group_by(ID) %>% <-- Not needed
mutate(S_M = S_F * p_S_M_vals,
I_M = I_F * p_I_M_vals,
R_M = S_M + I_M) # %>% ungroup <-- Not needed
tictoc::toc()
return(output)
}
array1_output <- array_transform(array1_dt_reshape)
Results
head(array1_output)
ID year I_F R_F S_F S_M I_M R_M
1 1 0 16 76 23 7.284386e-04 0.3640021 0.3647305
2 1 1 46 96 80 1.079918e-01 7.2981417 7.4061335
3 1 2 27 57 76 1.729010e+00 13.5000000 15.2290100
4 1 3 42 64 96 1.523090e+01 35.3364793 50.5673837
5 1 4 74 44 57 2.850000e+01 72.3164902 100.8164902
6 1 5 89 90 64 5.384606e+01 88.8798591 142.7259228
7 1 6 23 16 44 4.299899e+01 22.9992716 65.9982658
8 1 7 80 46 90 8.987851e+01 79.9999771 169.8784862
9 2 0 16 76 23 7.284386e-04 0.3640021 0.3647305
10 2 1 46 96 80 1.079918e-01 7.2981417 7.406133
I would like to convert a matrix/array (with dimnames) into a data frame. This can be done very easily using reshape2::melt but seems harder with tidyr, and in fact not really possible in the case of an array. Am I missing something? (In particular since reshape2 describes itself as being retired; see https://github.com/hadley/reshape).
For example, given the following matrix
MyScores <- matrix(runif(2*3), nrow = 2, ncol = 3,
dimnames = list(Month = month.name[1:2], Class = LETTERS[1:3]))
we can turn it into a data frame as follows
reshape2::melt(MyScores, value.name = 'Score') # perfect
or, using tidyr as follows:
as_tibble(MyScores, rownames = 'Month') %>%
gather(Class, Score, -Month)
In this case reshape2 and tidyr seem similar (although reshape2 is shorter if you are looking for a long-format data frame).
However for arrays, it seems harder. Given
EverybodyScores <- array(runif(2*3*5), dim = c(2,3,5),
dimnames = list(Month = month.name[1:2], Class = LETTERS[1:3], StudentID = 1:5))
we can turn it into a data frame as follows:
reshape2::melt(EverybodyScores, value.name = 'Score') # perfect
but using tidyr it's not clear how to do it:
as_tibble(EverybodyScores, rownames = 'Month') # looses month information and need to distange Class and StudentID
Is this a situation where the right solution is to stick to using reshape2?
One way I just found by playing around is to coerce via tbl_cube. I have never really used the class but it seems to do the trick in this instance.
EverybodyScores <- array(
runif(2 * 3 * 5),
dim = c(2, 3, 5),
dimnames = list(Month = month.name[1:2], Class = LETTERS[1:3], StudentID = 1:5)
)
library(tidyverse)
library(cubelyr)
EverybodyScores %>%
as.tbl_cube(met_name = "Score") %>%
as_tibble
#> # A tibble: 30 x 4
#> Month Class StudentID Score
#> <chr> <chr> <int> <dbl>
#> 1 January A 1 0.366
#> 2 February A 1 0.254
#> 3 January B 1 0.441
#> 4 February B 1 0.562
#> 5 January C 1 0.313
#> 6 February C 1 0.192
#> 7 January A 2 0.799
#> 8 February A 2 0.277
#> 9 January B 2 0.631
#> 10 February B 2 0.101
#> # ... with 20 more rows
Created on 2018-08-15 by the reprex package (v0.2.0).
Making a tibble drops the row names, but instead of going straight into a tibble, you can make the array into a base R data.frame, then use tidyr::rownames_to_column to make a column for months. Notice that converting to a data frame creates columns with names like A.1, sticking the class and ID together; you can separate these again with tidyr::separate. Calling as_tibble is optional, just for if you care about it being a tibble in the end, and also can come at any point in the workflow once you've made a column from the row names.
library(tidyverse)
EverybodyScores <- array(runif(2*3*5), dim = c(2,3,5),
dimnames = list(Month = month.name[1:2], Class = LETTERS[1:3], StudentID = 1:5))
EverybodyScores %>%
as.data.frame() %>%
rownames_to_column("Month") %>%
gather(key = class_id, value = value, -Month) %>%
separate(class_id, into = c("Class", "StudentID"), sep = "\\.") %>%
as_tibble()
#> # A tibble: 30 x 4
#> Month Class StudentID value
#> <chr> <chr> <chr> <dbl>
#> 1 January A 1 0.576
#> 2 February A 1 0.229
#> 3 January B 1 0.930
#> 4 February B 1 0.547
#> 5 January C 1 0.761
#> 6 February C 1 0.468
#> 7 January A 2 0.631
#> 8 February A 2 0.893
#> 9 January B 2 0.638
#> 10 February B 2 0.735
#> # ... with 20 more rows
Created on 2018-08-15 by the reprex package (v0.2.0).
Here is the new tidyr way to do the same:
library(tidyr)
EverybodyScores <- array(
runif(2 * 3 * 5),
dim = c(2, 3, 5),
dimnames = list(Month = month.name[1:2], Class = LETTERS[1:3], StudentID = 1:5)
)
as_tibble(EverybodyScores, rownames = "Month") %>%
pivot_longer(
cols = matches("^A|^B|^C"),
names_sep = "\\.",
names_to = c("Class", "StudentID")
)
#> # A tibble: 30 x 4
#> Month Class StudentID value
#> <chr> <chr> <chr> <dbl>
#> 1 January A 1 0.0325
#> 2 January B 1 0.959
#> 3 January C 1 0.593
#> 4 January A 2 0.0702
#> 5 January B 2 0.882
#> 6 January C 2 0.918
#> 7 January A 3 0.459
#> 8 January B 3 0.849
#> 9 January C 3 0.901
#> 10 January A 4 0.328
#> # … with 20 more rows
Created on 2021-02-23 by the reprex package (v1.0.0)
I have the following dataframe:
x <- data.frame("A"=c(rep(4,3),rep(7,4),rep(2,2)),
"B"=c("Q","Y"," ","F","Q"," ","Z","Q","C"),
"C"=seq(1:9))
A being my grouping variable / factor
B the status flag I'll strart subsetting after its first appearance.
I've managed to subset a fixed amount of rows after by just adding the amount of rows to subset to B's indices where "Q" appears:
Something like this: x[c(which(x$B=="Q"),which(x$B=="B")+1),]
"+1" because I was just interested in the first row after "Q" occurs.
Now I need the rest of the rows within each group after "Q" occurs (including Q's row), and I been banging my head against my desk trying to figure out how to do this with the dplyr package with grouped tibbles... hence I'm here.
Please help?
------ EDIT -----
This seemed to have worked
x %>% group_by(A) %>% filter(row_number()>=which.max(B=="Q"))
Using dplyr and tidyr you may run the following code:
mydf <- data.frame("A"=c(rep(4,3),rep(7,4),rep(2,2)),
"B"=c("Q","Y"," ","F","Q"," ","Z","Q","C"),
"C"=seq(1:9))
library(tidyverse)
mydf %>%
group_by(A) %>%
mutate(selector=case_when(
B=="Q" ~ 1
)) %>%
fill(...=selector,.direction="down") %>%
filter(selector==1) %>%
select(-selector)) %>%
arrange(C,A)
and this is the result:
A B C
<dbl> <fct> <int>
1 4.00 Q 1
2 4.00 Y 2
3 4.00 " " 3
4 7.00 Q 5
5 7.00 " " 6
6 7.00 Z 7
7 2.00 Q 8
8 2.00 C 9
Is this what you're looking for?
q_rows <- row.names(subset(x, B == 'Q')) # rows where Q occurs
list_of_frames <- list()
for(i in 1:length(q_rows)) {
q_start <- as.numeric(q_rows[i]);
q_group <- as.numeric(x[q_start,c('A')])
group_frame <- subset(x[q_start:nrow(x),], A == q_group)
list_of_frames[i] <- list(group_frame)
}
list_of_frames
I'm trying to solve a problem in which for a new route (of a truck) that I just found, I check if that route was already a part of a previous route I have. For instance, assume my stored routes are in datatable routelist and node_list refers to the stored routes. I want to check the rows in which route (5,6,7,8) is part of.
library(data.table)
routelist=data.table(id=c(1:3),node_list=list(c(1:6),c(4:7),c(1:10)))
item<-c(5:8)
routelist[sum(item%in%unlist(packlist$node_list))==length(item)]
For the above check, all three rows are returned however only the third row should be returned. I could do it with the following for loop, but it s not fast and does not take order into account (and there should be a way to do it in a better way). The order of nodes in item is important and the list does not need to be consecutive i.e. item could be c(5,7,8) and should be returned in 3rd row while c(5,8,7) shouldnt return.
for(i in 1:3)
{
if(sum(item%in%unlist(packlist[i]$node_list))==length(item))
print(routelist[i])
}
There are two issues with OP's data.table approach here.
Missing by clause
routelist = data.table(id = 1:3, node_list = list(1:6, 4:7, 1:10))
item <- 5:8
routelist[, sum(item %in% unlist(node_list)) == length(item)]
returns a single TRUE value because
routelist[, unlist(node_list)]
returns a single vector
[1] 1 2 3 4 5 6 4 5 6 7 1 2 3 4 5 6 7 8 9 10
If grouped by id, we do get the desired result:
routelist[, sum(item %in% unlist(node_list)) == length(item), by = id]
id V1
1: 1 FALSE
2: 2 FALSE
3: 3 TRUE
or
routelist[routelist[, sum(item %in% unlist(node_list)) == length(item), by = id]$V1]
id node_list
1: 3 1,2,3,4,5,6,
%in% checks only appearance but not the order
The expression sum(item %in% unlist(node_list)) == length(item) doesn't take care of the order of elements in item.
As order of the elements is important, the expression
isTRUE(all(diff(match(item, unlist(node_list))) > 0))
accounts for the order. match() returns the positions of the elements of item in node_list (or NA if not found). If the order in item is the same as in node_list then all differences in position must be positive. isTRUE() is required to cover the NA case.
Thus,
item <- c(5, 7, 8)
routelist[routelist[, isTRUE(all(diff(match(item, unlist(node_list))) > 0)), by = id]$V1]
returns
id node_list
1: 3 1,2,3,4,5,6,
despite the gap while
item <- c(5, 8, 7)
routelist[routelist[, isTRUE(all(diff(match(item, unlist(node_list))) > 0)), by = id]$V1]
returns
Empty data.table (0 rows) of 2 cols: id,node_list
as requested due to the wrong order.
Solutions from dplyr and tidyr.
If the order is not important, the following approach may work. By examining the id column in routelist2, it is clear that id 3 is the one with the right condition.
# Create example dataset
library(data.table)
routelist=data.table(id=c(1:3),node_list=list(c(1:6),c(4:7),c(1:10)))
item<-c(5:8)
# Solution 1
library(dplyr)
library(tidyr)
routelist2 <- routelist %>%
unnest() %>%
group_by(id) %>%
filter(all(item %in% node_list)) %>%
nest()
routelist2
# A tibble: 1 x 2
id data
<int> <list>
1 3 <tibble [10 x 1]>
If the order is important, we may have to convert the route numbers to string than find the right string pattern. The following approach should work.
# Solution 2
item_str <- toString(item)
routelist3 <- routelist %>%
rowwise() %>%
mutate(node_list = toString(node_list)) %>%
filter(grepl(item_str, node_list)) %>%
ungroup()
routelist3
# A tibble: 1 x 2
id node_list
<int> <chr>
1 3 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Update
The following considering the situation when nodes in item2 are not complete.
# Solution 3
library(dplyr)
library(tidyr)
item2 <- c(5, 7, 8)
routelist4 <- routelist %>%
unnest() %>%
group_by(id) %>%
filter(all(item2 %in% node_list)) %>%
filter(node_list %in% item2) %>%
summarise(node_list = toString(node_list)) %>%
filter(node_list == toString(item2))
routelist4
# A tibble: 1 x 2
id node_list
<int> <chr>
1 3 5, 7, 8
Using loop (which is not elegant) it is possible to use the following check in the body. It does take order into account:
library(data.table)
routelist=data.table(id=c(1:3),node_list=list(c(1:6),c(4:7),c(1:10)))
item<-c(5,8,7)
for(i in 1:nrow(routelist))
{
if(identical(intersect(unlist(routelist[i]$node_list),item),item)){
print(routelist[i])
}
}
I have two questions.
1) I have a dataset (df) where the last column, which begins with "09", is on every other row. Like so:
a <- c("01+0135.","09-6999","01+0135.","09-6999")
b <- c("02+2015.", "", "02+2015.", "")
c <- c("03+0349.","", "03+0349.", "")
d <- c("04+0537.","","04+0542.","")
e <- c("05+170.1","","05+170.1","")
f <- c("06+0.033","","06+0.384","")
g <- c("07+0.001","","07+0.395","")
h <- c("08+0.000","","08+0.000","")
df <- data.frame(a,b,c,d,e,f,g,h)
> df
1 01+0135. 02+2015. 03+0349. 04+0537. 05+170.1 06+0.033 07+0.001 08+0.000
2 09-6999
3 01+0135. 02+2015. 03+0349. 04+0542. 05+170.1 06+0.384 07+0.395 08+0.000
4 09-6999
Is there a simple way to get every other row (perhaps using regex) to appear in a ninth column?
2) How do I remove the first three characters of each column (i.e. 06+) and the period at the end of each column for columns beginning with 01, 02, 03, and 04?
For the first issue, you can split and recombine:
df1 <- df[(1:nrow(df))%%2==1,]
df2 <- df[(1:nrow(df))%%2==0,]
df1$i <- df2$a
df1
# a b c d e f g h i
# 1 01+0135. 02+2015. 03+0349. 04+0537. 05+170.1 06+0.033 07+0.001 08+0.000 09-6999
# 3 01+0135. 02+2015. 03+0349. 04+0542. 05+170.1 06+0.384 07+0.395 08+0.000 09-6999
For the second issue, the simplest way might be to use substr() (but that's assuming that you really only want characters 4 through 7 of each, otherwise we'll want to do something with regular expressions):
for(i in 1:4) df1[,i] <- substr(df1[,i],4,7)
df1
# a b c d e f g h i
# 1 0135 2015 0349 0537 05+170.1 06+0.033 07+0.001 08+0.000 09-6999
# 3 0135 2015 0349 0542 05+170.1 06+0.384 07+0.395 08+0.000 09-6999