loop Tukey post hoc letters extraction - loops

I need to apply a tukey post hoc test to a dataset with 80 columns/variables based on 3 groups/treatments. Is there any way to get a table with all variables in which common characters identify levels or groups that are not significantly different (based on p-values) in an automated way with a loop function?

Does this work for you?
# packages and function conflicts
library(conflicted)
library(emmeans)
library(multcomp)
library(multcompView)
library(tidyverse)
conflict_prefer("select", winner = "dplyr")
#> [conflicted] Will prefer dplyr::select over any other package
# Create example data
dat <- PlantGrowth %>%
transmute(
group = group,
y1 = weight,
y2 = weight * runif(30, 0.8, 1.2),
y3 = weight * runif(30, 0.8, 1.2)
) %>%
as_tibble()
dat
#> # A tibble: 30 x 4
#> group y1 y2 y3
#> <fct> <dbl> <dbl> <dbl>
#> 1 ctrl 4.17 4.22 3.53
#> 2 ctrl 5.58 6.19 6.46
#> 3 ctrl 5.18 5.95 5.66
#> 4 ctrl 6.11 5.36 7.19
#> 5 ctrl 4.5 4.41 5.11
#> 6 ctrl 4.61 3.93 4.89
#> 7 ctrl 5.17 4.36 4.67
#> 8 ctrl 4.53 4.53 4.72
#> 9 ctrl 5.33 4.64 5.86
#> 10 ctrl 5.14 5.34 4.89
#> # ... with 20 more rows
# Loop setup
var_names <- names(dat)[-1]
loop_out <- list()
# Loop
for (var_i in var_names) {
dat_i <- dat %>%
rename(y_i = !!var_i) %>%
select(group, y_i)
mod_i <- lm(y_i ~ group, data = dat_i)
emm_i <- emmeans(mod_i, "group") %>%
cld(Letters = letters)
loop_out[[var_i]] <- emm_i %>%
as_tibble() %>%
select(group, emmean, .group) %>%
rename_with(.cols = -group,
.fn = ~ paste(., var_i, sep = "_"))
}
# Join loop results
loop_out %>% reduce(full_join, by='group')
#> # A tibble: 3 x 7
#> group emmean_y1 .group_y1 emmean_y2 .group_y2 emmean_y3 .group_y3
#> <fct> <dbl> <chr> <dbl> <chr> <dbl> <chr>
#> 1 trt1 4.66 " a " 4.46 " a " 4.64 " a"
#> 2 ctrl 5.03 " ab" 4.89 " ab" 5.30 " a"
#> 3 trt2 5.53 " b" 5.74 " b" 5.39 " a"
Created on 2022-08-08 by the reprex package (v2.0.1)
Check out my summary on the compact letter display for more background.

Related

How to replace reshape2::melt for an array with tidyr?

I would like to convert a matrix/array (with dimnames) into a data frame. This can be done very easily using reshape2::melt but seems harder with tidyr, and in fact not really possible in the case of an array. Am I missing something? (In particular since reshape2 describes itself as being retired; see https://github.com/hadley/reshape).
For example, given the following matrix
MyScores <- matrix(runif(2*3), nrow = 2, ncol = 3,
dimnames = list(Month = month.name[1:2], Class = LETTERS[1:3]))
we can turn it into a data frame as follows
reshape2::melt(MyScores, value.name = 'Score') # perfect
or, using tidyr as follows:
as_tibble(MyScores, rownames = 'Month') %>%
gather(Class, Score, -Month)
In this case reshape2 and tidyr seem similar (although reshape2 is shorter if you are looking for a long-format data frame).
However for arrays, it seems harder. Given
EverybodyScores <- array(runif(2*3*5), dim = c(2,3,5),
dimnames = list(Month = month.name[1:2], Class = LETTERS[1:3], StudentID = 1:5))
we can turn it into a data frame as follows:
reshape2::melt(EverybodyScores, value.name = 'Score') # perfect
but using tidyr it's not clear how to do it:
as_tibble(EverybodyScores, rownames = 'Month') # looses month information and need to distange Class and StudentID
Is this a situation where the right solution is to stick to using reshape2?
One way I just found by playing around is to coerce via tbl_cube. I have never really used the class but it seems to do the trick in this instance.
EverybodyScores <- array(
runif(2 * 3 * 5),
dim = c(2, 3, 5),
dimnames = list(Month = month.name[1:2], Class = LETTERS[1:3], StudentID = 1:5)
)
library(tidyverse)
library(cubelyr)
EverybodyScores %>%
as.tbl_cube(met_name = "Score") %>%
as_tibble
#> # A tibble: 30 x 4
#> Month Class StudentID Score
#> <chr> <chr> <int> <dbl>
#> 1 January A 1 0.366
#> 2 February A 1 0.254
#> 3 January B 1 0.441
#> 4 February B 1 0.562
#> 5 January C 1 0.313
#> 6 February C 1 0.192
#> 7 January A 2 0.799
#> 8 February A 2 0.277
#> 9 January B 2 0.631
#> 10 February B 2 0.101
#> # ... with 20 more rows
Created on 2018-08-15 by the reprex package (v0.2.0).
Making a tibble drops the row names, but instead of going straight into a tibble, you can make the array into a base R data.frame, then use tidyr::rownames_to_column to make a column for months. Notice that converting to a data frame creates columns with names like A.1, sticking the class and ID together; you can separate these again with tidyr::separate. Calling as_tibble is optional, just for if you care about it being a tibble in the end, and also can come at any point in the workflow once you've made a column from the row names.
library(tidyverse)
EverybodyScores <- array(runif(2*3*5), dim = c(2,3,5),
dimnames = list(Month = month.name[1:2], Class = LETTERS[1:3], StudentID = 1:5))
EverybodyScores %>%
as.data.frame() %>%
rownames_to_column("Month") %>%
gather(key = class_id, value = value, -Month) %>%
separate(class_id, into = c("Class", "StudentID"), sep = "\\.") %>%
as_tibble()
#> # A tibble: 30 x 4
#> Month Class StudentID value
#> <chr> <chr> <chr> <dbl>
#> 1 January A 1 0.576
#> 2 February A 1 0.229
#> 3 January B 1 0.930
#> 4 February B 1 0.547
#> 5 January C 1 0.761
#> 6 February C 1 0.468
#> 7 January A 2 0.631
#> 8 February A 2 0.893
#> 9 January B 2 0.638
#> 10 February B 2 0.735
#> # ... with 20 more rows
Created on 2018-08-15 by the reprex package (v0.2.0).
Here is the new tidyr way to do the same:
library(tidyr)
EverybodyScores <- array(
runif(2 * 3 * 5),
dim = c(2, 3, 5),
dimnames = list(Month = month.name[1:2], Class = LETTERS[1:3], StudentID = 1:5)
)
as_tibble(EverybodyScores, rownames = "Month") %>%
pivot_longer(
cols = matches("^A|^B|^C"),
names_sep = "\\.",
names_to = c("Class", "StudentID")
)
#> # A tibble: 30 x 4
#> Month Class StudentID value
#> <chr> <chr> <chr> <dbl>
#> 1 January A 1 0.0325
#> 2 January B 1 0.959
#> 3 January C 1 0.593
#> 4 January A 2 0.0702
#> 5 January B 2 0.882
#> 6 January C 2 0.918
#> 7 January A 3 0.459
#> 8 January B 3 0.849
#> 9 January C 3 0.901
#> 10 January A 4 0.328
#> # … with 20 more rows
Created on 2021-02-23 by the reprex package (v1.0.0)

R - Data frame containing an array of comma separated values

I have data frame, with a horrible schema where many of the dimensions have values which are comma seperated arrays - instead of these arrays, i wish to apply operations to values like count, sum, mean etc
e.g.
colA ColB
A [0.0,0.0,0.0,2177.0068,0.0,0.0,0.0,0.0,0.0,0.0]
B [0.0,0.0,650.2635,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
C [0.0,0.0,406.3296,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
D \N
E [0.0,0.0,982.2527,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
F [0.0,0.0,0.0,163.6882,0.0,0.0,0.0,0.0,0.0,0.0]
Does anyone have an elegant way of summing/counting/mean of each array?
Thanks
Convert this to long form in which case it is easy to perform aggregations.
1) Assuming DF shown reproducibly in the Note at the end remove the square brackets in ColB and separate ColB into rows converting appropriately. Then group by colA and take the sum and mean of ColB (and possibly use other aggregation functions as well). If you don't want an NA for D filter out the rows for which ColB does not start with [. See the filter statement in (2).
library(dplyr)
library(tidyr)
DF %>%
mutate(ColB = gsub("[][]", "", ColB)) %>%
separate_rows(ColB, sep = "[^-0-9.]", convert = TRUE) %>%
group_by(ColA) %>%
summarize(Sum = sum(ColB), Mean = mean(ColB)) %>%
ungroup
giving:
# A tibble: 6 x 3
ColA Sum Mean
<chr> <dbl> <dbl>
1 A 2177. 218.
2 B 650. 65.0
3 C 406. 40.6
4 D NA NA
5 E 982. 98.2
6 F 164. 16.4
2) Alternately use the fact that the ColB strings that start with [ are JSON. In this case we filtered out the non-JSON elements of colB first.
library(dplyr)
library(jsonlite)
library(tidyr)
DF %>%
filter(substring(ColB, 1, 1) == "[") %>%
rowwise() %>%
mutate(ColB = list(fromJSON(ColB))) %>%
ungroup %>%
unnest %>%
group_by(ColA) %>%
summarize(Sum = sum(ColB), Mean = mean(ColB)) %>%
ungroup
giving:
# A tibble: 5 x 3
ColA Sum Mean
<chr> <dbl> <dbl>
1 A 2177. 218.
2 B 650. 65.0
3 C 406. 40.6
4 E 982. 98.2
5 F 164. 16.4
Note
Lines <- "ColA ColB
A [0.0,0.0,0.0,2177.0068,0.0,0.0,0.0,0.0,0.0,0.0]
B [0.0,0.0,650.2635,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
C [0.0,0.0,406.3296,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
D \\N
E [0.0,0.0,982.2527,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
F [0.0,0.0,0.0,163.6882,0.0,0.0,0.0,0.0,0.0,0.0]"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)

Extract the rest of the rows within a group / factor in R

I have the following dataframe:
x <- data.frame("A"=c(rep(4,3),rep(7,4),rep(2,2)),
"B"=c("Q","Y"," ","F","Q"," ","Z","Q","C"),
"C"=seq(1:9))
A being my grouping variable / factor
B the status flag I'll strart subsetting after its first appearance.
I've managed to subset a fixed amount of rows after by just adding the amount of rows to subset to B's indices where "Q" appears:
Something like this: x[c(which(x$B=="Q"),which(x$B=="B")+1),]
"+1" because I was just interested in the first row after "Q" occurs.
Now I need the rest of the rows within each group after "Q" occurs (including Q's row), and I been banging my head against my desk trying to figure out how to do this with the dplyr package with grouped tibbles... hence I'm here.
Please help?
------ EDIT -----
This seemed to have worked
x %>% group_by(A) %>% filter(row_number()>=which.max(B=="Q"))
Using dplyr and tidyr you may run the following code:
mydf <- data.frame("A"=c(rep(4,3),rep(7,4),rep(2,2)),
"B"=c("Q","Y"," ","F","Q"," ","Z","Q","C"),
"C"=seq(1:9))
library(tidyverse)
mydf %>%
group_by(A) %>%
mutate(selector=case_when(
B=="Q" ~ 1
)) %>%
fill(...=selector,.direction="down") %>%
filter(selector==1) %>%
select(-selector)) %>%
arrange(C,A)
and this is the result:
A B C
<dbl> <fct> <int>
1 4.00 Q 1
2 4.00 Y 2
3 4.00 " " 3
4 7.00 Q 5
5 7.00 " " 6
6 7.00 Z 7
7 2.00 Q 8
8 2.00 C 9
Is this what you're looking for?
q_rows <- row.names(subset(x, B == 'Q')) # rows where Q occurs
list_of_frames <- list()
for(i in 1:length(q_rows)) {
q_start <- as.numeric(q_rows[i]);
q_group <- as.numeric(x[q_start,c('A')])
group_frame <- subset(x[q_start:nrow(x),], A == q_group)
list_of_frames[i] <- list(group_frame)
}
list_of_frames

visualise daily time series data using dygraphs in R

I have a dataframe similar to 'df1'. After Converting the value column to a daily time series, I fit using Holt Winters method and predict 120 days in the future. I want to be able to visualise the actual and predicted using dygraphs.
library(dygraphs)
> head(df1)
timestamp value
1 2017-03-29 534.4571
2 2017-03-30 536.4350
3 2017-03-31 534.6661
4 2017-04-01 535.9185
5 2017-04-02 532.6998
6 2017-04-03 534.8282
convert_to_daily_ts <- function(x){
x <- x[order(x$timestamp),]
x$value_ts <- ts(x$value, frequency = 7)
return(x)
}
df1 <- convert_to_daily_ts(df1)
hw <- tryCatch(HoltWinters(df1$value_ts), error=NA)
p <- predict(hw, n.ahead = 120, prediction.interval = TRUE, level=0.95)
act <- df1$value_ts
all <- cbind(act, p)
> class(all)
[1] "mts" "ts" "matrix"
> head(all)
Time Series:
Start = c(1, 1)
End = c(1, 6)
Frequency = 7
actual p.fit p.upr p.lwr
1.000000 534.4571 NA NA NA
1.142857 536.4350 NA NA NA
1.285714 534.6661 NA NA NA
1.428571 535.9185 NA NA NA
1.571429 532.6998 NA NA NA
1.714286 534.8282 NA NA NA
> tail(all)
Time Series:
Start = c(115, 2)
End = c(115, 7)
Frequency = 7
actual p.fit p.upr p.lwr
115.1429 NA 386.2924 581.7568 190.8279
115.2857 NA 384.4614 580.0625 188.8603
115.4286 NA 383.4728 579.2104 187.7352
115.5714 NA 381.3159 577.1900 185.4418
115.7143 NA 383.3130 579.3234 187.3025
115.8571 NA 384.2098 580.3565 188.0631
> str(all)
mts [1:805, 1:4] 534 536 535 536 533 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:4] "actual" "p.fit" "p.upr" "p.lwr"
- attr(*, "tsp")= num [1:3] 1 116 7
- attr(*, "class")= chr [1:3] "mts" "ts" "matrix"
dygraph(all, main = "Daily Predictions") %>%
dySeries("act", label = "Actual") %>%
dySeries(c("p.lwr", "p.fit", "p.upr"), label = "Predicted") %>%
dyOptions(drawGrid = F) %>%
dyRangeSelector()
I get Error:Unsupported type passed to argument 'data'. But the class of 'all' is as expected for the dygraph. Any help to visualise above data(actual & predicted) will be helpful. Also, I need the x-axis values to show month-year(Ex: Jun 2017, Jul 2017) instead of 1,2,3 so on. Is it possible ?
It looks like the ts object needs start and end dates for dygraph to figure things out. Could you add the appropirate start and end dates when you create the ts object? You'll need to adjust the start and end dates as appropriate. There's a post about that here.
convert_to_daily_ts <- function(x){
x <- x[order(x$timestamp),]
x$value_ts <- ts(x$value, start = c(2017,3), end = c(2017,7), frequency = 7)
return(x)
}

Convert API JSON format to data frame

Im trying to create a new subset data frame from this site using R.
#load libraries
library(dplyr)
library(jsonlite)
library(tidyr)
#source file
url = "http://api.us.socrata.com/api/catalog/v1 q=nasa&domains=data.nasa.gov&offset=0&limit=500"
metadata <- fromJSON(url)
#Create a new data frame
nasa_api <- data.frame(id = metadata$results$resource$id,
title = metadata$results$resource$name,
description = metadata$results$resource$description,
download_count = metadata$results$resource$download_count,
domain_category = metadata$results$classification$domain_category,
link = metadata$results$link,
permlink = metadata$results$permalink)
I notice that metadata object contains nested lists. I need to create a new dataset for classifications which is a data frame nested inside metadata. So ideally I want this new data frame to contain "id" so that I can join these 2 datasets later.
I think it will be an easy task but I am new to R. Please can you help?
I noticed there is a problem in your URL (v1 q=nasa should be v1?q=nasa). As such, I have illustrated how you might solve this problem with the tidyjson package. It can be a lot of typing, but it gives you a solid tidy data_frame afterwards. I recommend the development version from devtools::install_github('jeremystan/tidyjson'), which has some features not yet on CRAN.
In any case, since you did not articulate which nested arrays you are interested in, I just picked one (classification/domain_metadata).
## devtools::install_github('jeremystan/tidyjson')
library(dplyr)
library(tidyjson)
j <- as.tbl_json("http://api.us.socrata.com/api/catalog/v1?q=nasa&domains=data.nasa.gov&offset=0&limit=500")
base <- j %>% enter_object(results) %>% gather_array()
nasa_api <- base %>% spread_values(id = jstring(resource, id), title = jstring(resource,
name), description = jstring(resource, description), download_count = jstring(resource,
download_count), domain_category = jstring(classification, domain_category),
link = jstring(link), permlink = jstring(permlink))
print(nasa_api)
#> # A tbl_json: 500 x 9 tibble with a "JSON" attribute
#> `attr(., "JSON")` document.id array.index id
#> <chr> <int> <int> <chr>
#> 1 "{\"resource\":{\"d..." 1 1 gvk9-iz74
#> 2 "{\"resource\":{\"d..." 1 2 scmi-np9r
#> 3 "{\"resource\":{\"d..." 1 3 gquh-watm
#> 4 "{\"resource\":{\"d..." 1 4 dtgb-tk9p
#> 5 "{\"resource\":{\"d..." 1 5 j6wr-4xhn
#> 6 "{\"resource\":{\"d..." 1 6 357b-ra7j
#> 7 "{\"resource\":{\"d..." 1 7 e2ud-kf5m
#> 8 "{\"resource\":{\"d..." 1 8 uwnx-gns8
#> 9 "{\"resource\":{\"d..." 1 9 fzmj-dfnj
#> 10 "{\"resource\":{\"d..." 1 10 szzb-kefa
#> # ... with 490 more rows, and 6 more variables: title <chr>,
#> # description <chr>, download_count <chr>, domain_category <chr>,
#> # link <chr>, permlink <chr>
## explore the json_types of one of the objects
base %>% enter_object("classification") %>% .[1, ] %>% gather_object() %>% json_types()
#> # A tbl_json: 5 x 4 tibble with a "JSON" attribute
#> `attr(., "JSON")` document.id array.index name
#> <chr> <int> <int> <chr>
#> 1 [] 1 1 categories
#> 2 [] 1 1 tags
#> 3 "\"Management/Ope..." 1 1 domain_category
#> 4 [] 1 1 domain_tags
#> 5 "[{\"value\":\"\",\"k..." 1 1 domain_metadata
#> # ... with 1 more variables: type <fctr>
## example of an ancillary table
base %>% spread_values(id = jstring(resource, id)) %>% enter_object("classification") %>%
enter_object("domain_metadata") %>% gather_array("domain_metadata_id") %>%
spread_values(key = jstring(key), value = jstring(value)) %>% select(document.id,
array.index, id, key, value) %>% as_data_frame()
#> # A tibble: 6,343 x 5
#> document.id array.index id key
#> * <int> <int> <chr> <chr>
#> 1 1 1 gvk9-iz74 Common-Core_Contact-Email
#> 2 1 1 gvk9-iz74 Common-Core_License
#> 3 1 1 gvk9-iz74 Common-Core_System-of-Records
#> 4 1 1 gvk9-iz74 Common-Core_Program-Code
#> 5 1 1 gvk9-iz74 Common-Core_Described-By
#> 6 1 1 gvk9-iz74 Common-Core_Public-Access-Level
#> 7 1 1 gvk9-iz74 Common-Core_Temporal-Applicability
#> 8 1 1 gvk9-iz74 Common-Core_Is-Quality-Data
#> 9 1 1 gvk9-iz74 Common-Core_Language
#> 10 1 1 gvk9-iz74 Common-Core_References
#> # ... with 6,333 more rows, and 1 more variables: value <chr>

Resources