How to use iteration in R to simplify my code for GLM? - loops

I've just started using R and am having some issues when trying to simplify my code. I can't share my real data, but have used an open dataset to ask my question (Breed to represent my IV and Age to represent a DV).
In my dataset, I have all factor variables - my independent variable has 3 levels and my dependent variables all have 2 levels (0/1). Out of a larger dataset, I have six dependent variables and would like to run some descriptive stats and GLM for each. I have figured out working code for running each DV independently, see below. However, I am currently just copying & pasting this code and replacing the DV variables each time. I would like to instead create a function that I can apply to simplify my code.
I have attempted to do this using the purr package (map) but have had no luck. If someone could provide an example of how to do this using the sample data below, it would help me out a lot (though I know in the below data there is only one DV provided).
install.packages("GLMsData")
library(GLMsData)
data(butterfat)
library(tidyverse)
library(dplyr)
#Descriptive summaries
butterfat %>%
group_by(Breed, Age) %>%
summarise(n())
prop.table(table(butterfat$Breed, butterfat$Age), 1)
#Model
Age_model1 <- glm(Age ~ Breed, family=binomial, data=butterfat, na.action = na.omit)
#Get summary, including coefficients and p-values
summary(Age_model1)
#See coefficients, get odds ratio and confidence intervals
Age_model1$coefficients
exp(Age_model1$coefficients)
exp(confint(Age_model1))

Many Models: R for Data Science
library(tidyverse)
iris %>%
group_by(Species) %>%
nest() %>%
mutate(lm = map(data,
~glm(Sepal.Length~Sepal.Width,data = .))) %>%
mutate(tidy = map(lm,
broom::tidy)) %>%
unnest(tidy)

Related

Do I need to use future_map or map to parallelize fable forecast?

I've created a tsibble of ~75K time series in R Studio on my local machine.
I'm looking for ways to speed up the processing time before I migrate the process to a VM with more processing power.
Does Fable handle all of the parallel processing in the background or are there more opportunities to make the code more efficient?
Here is an example of my code
plan(multisession, gc= TRUE)
tic()
results <- train %>%
group_by_key() %>%
model(my_dcmp_spec) %>%
forecast(h="10 weeks") %>%
ungroup()
toc()
Thank you in advance!
Currently fable will model each of the series in parallel (model()) according to your plan(). The forecasts will not yet be done in parallel, but this is planned for an upcoming release: https://github.com/tidyverts/fabletools/issues/268

Turn Multiple Column into Json Array in R?

Say I have the mtcars dataset, and I wanted to take three columns and turn them into a JSON array. How do I convert this to a json array and is that possible to pass them into a POSTGRESQL database?
library(jsonlite)
df <- mtcars
attach(mtcars)
json.column <- cbind(mpg,cyl,disp)
Do I use toJSON() ?
mtcars.json <- toJSON(json.column)
https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html
Array of objects [{"name":"Erik", "age":43}, {"name":"Anna", "age":32}] Data Frame simplifyDataFrame
Keep your data as a data.frame, not a matrix. Use
json.column <- data.frame(mpg,cyl,disp)
toJSON(json.column)
# [{"mpg":21,"cyl":6,"disp":160},{"mpg":21,"cyl":6,"disp":160}, ...
Also, you should avoid the use of attach(). It can cause lots of problems if you forget detach(). Plus you can use with() often to avoid it
json.column <- with(mtcars, data.frame(mpg,cyl,disp))
(For starters, never use attach! It's dangerous! Use with instead, typically.)
There are a bunch of ways to do it. Here's how to create the values using dplyr:
qq <- rowwise(mtcars) %>%
mutate(newcol=as.character(jsonlite::toJSON(list(mpg=mpg, cyl=cyl, disp=disp))))
> qq$newcol
[1] "{\"mpg\":[21],\"cyl\":[6],\"disp\":[160]}" "{\"mpg\":[21],\"cyl\":[6],\"disp\":[160]}"
[3] "{\"mpg\":[22.8],\"cyl\":[4],\"disp\":[108]}" "{\"mpg\":[21.4],\"cyl\":[6],\"disp\":[258]}"
...
From there, if your Postgres database is set up with newcol as a JSON type, I think just writing that table as usual should work.

SparkR - extracting dataframe's array<int> for an R function

I have 1000s of sensors, I need to partition the data (i.e. per sensor per day) then submit each list of data points to an R algorithm). Using Spark, simplified sample looks like:
//Spark
val rddData = List(
("1:3", List(1,1,456,1,1,2,480,0,1,3,425,0)),
("1:4", List(1,4,437,1,1,5,490,0)),
("1:6", List(1,6,500,0,1,7,515,1,1,8,517,0,1,9,522,0,1,10,525,0)),
("1:11", List(1,11,610,1))
)
case class DataPoint(
key: String,
value: List[Int]) // 4 value pattern, sensorID:seq#, seq#, value, state
I convert to a parquet file, save it.
Load the parquet in SparkR, no problem, the schema says:
#SparkR
df <- read.df(sqlContext, filespec, "parquet")
schema(df)
StructType
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value", type = "ArrayType(IntegerType,true)", nullable = TRUE
So in SparkR, I have a dataframe where each record has all of the data I want (df$value). I want to extract that array into something R can consume then mutate my original dataframe(df) with a new column holding the resultant array. Logically something like results = function(df$value). Then I need to get results (for all rows) back into a SparkR dataframe for output.
How to I extract an array from the SparkR dataframe then mutate with the results?
Let spark data frame be, df and R data frame be df_r
To convert sparkR df to R df, use code
df_r <- collect(df)
with R data frame df_r, you can do all computations you want to do in R.
let say you have the result in column df_r$result
Then for converting back to SparkR data frame use code,
#this is a new SparkR data frame, df_1
df_1 <- createDataFrame(sqlContext, df_r)
For adding the result back to SparkR data frame `df` use code
#this adds the df_1$result to a new column df$result
#note that number of rows should be same in df and `df_1`, if not use `join` operation
df$result <- df_1$result
Hope this solves your problem
I had this problem too. The way I got around it was by adding a row index into the spark DataFrame and then using explode inside a select statement. Make sure to select the index and then the row you want in your select statement. That will get you a "long" dataframe. If each of the nested lists in the DataFrame column has the same amount of information in it (for example if you are exploding a list-column of x,y coordinates), you would expect each row index in the long DataFrame to occur twice.
After doing the above, I typically do a groupBy(index) on the exploded DataFrame, filter where the n() of each index is not equal to the expected number of items in the list and proceed with additional groupBy, merge, join, filter, etc. operations on the Spark DataFrame.
There are some excellent guides on the Urban Institute's GitHub page. Good luck. -nate

create hash value for each row of data in dataframe in R

I am exploring how to compare two dataframe in R more efficiently, and I come up with hash.
My plan is to create hash for each row of data in two dataframe with same columns, using digest in digest package, and I suppose hash should be the same for any 2 identical row of data.
I tried to give and unique hash for each row of data, using the code below:
for (loop.ssi in (1:nrow(ssi.10q3.v1)))
{ssi.10q3.v1[loop.ssi,"hash"] <- digest(as.character(ssi.10q3.v1[loop.ssi,]))
print(paste(loop.ssi,nrow(ssi.10q3.v1),sep="/"))
flush.console()
}
But this is very slow.
Is my approach in comparing dataframe correct? If yes, any suggestion for speeding up the code above? Thanks.
UPDATE
I have updated the code as below:
ssi.10q3.v1[,"uid"] <- 1:nrow(ssi.10q3.v1)
ssi.10q3.v1.hash <- ddply(ssi.10q3.v1,
c("uid"),
function(df)
{df[,"uid"]<- NULL
hash <- digest(as.character(df))
data.frame(hash=hash)
},
.progress="text")
I self-generated a uid column for the "unique" purpose.
If I get what you want properly, digest will work directly with apply:
library(digest)
ssi.10q3.v1.hash <- data.frame(uid = 1:nrow(ssi.10q3.v1), hash = apply(ssi.10q3.v1, 1, digest))
I know this answer doesn't match the title of the question, but if you just want to see when rows are different you can do it directly:
rowSums(df2 == df1) == ncol(df1)
Assuming both data.frames have the same dimensions, that will evaluate to FALSE for every row that is not identical. If you need to test rownames as well that could be manage seperately and combined with the test of contents, and similarly for colnames (and attributes, and strict tests on column types).
rowSums(df2 == df1) == ncol(df1) & rownames(df2) == rownames(df1)

Storing R Objects in a relational database

I frequently create nonparametric statistics (loess, kernel densities, etc) on data I pull out of a relational database. To make data management easier I would like to store R output back inside my DB. This is easy with simple data frames of numbers or text, but I have not figured out how to store R objects back in my relational database. So is there a way to store a vector of kernel densities, for example, back into a relational database?
Right now I work around this by saving the R objects to a network drive space so others can load the objects as needed.
Use the serialization feature to turn any R object into a (raw or character) string, then store that string. See help(serialize).
Reverse this for retrieval: get the string, then unserialize() into a R object.
An example R variable, that's fairly complex:
library(nlme)
model <- lme(uptake ~ conc + Treatment, CO2, random = ~ 1 | Plant / Type)
The best storage database method for R variables depends upon how you want to use it.
I need to do in-database analytics on the values
In this case, you need to break the object down into values that the database can handle natively. This usually means converting it into one or more data frames. The easiest way to do this is to use the broom package.
library(broom)
coefficients_etc <- tidy(model)
model_level_stats <- glance(model)
row_level_stats <- augment(model)
I just want storage
In this case you want to serialize your R variables. That is, converting them to be a string or a binary blob. There are several methods for this.
My data has to be accessible by programs other than R, and needs to be human-readable
You should store your data in a cross-platform text format; probably JSON or YAML. JSON doesn't support some important concepts like Inf; YAML is more general but the support in R isn't as mature. XML is also possible, but is too verbose to be useful for storing large arrays.
library(RJSONIO)
model_as_json <- toJSON(model)
nchar(model_as_json) # 17916
library(yaml)
# yaml package doesn't yet support conversion of language objects,
# so preprocessing is needed
model2 <- within(
model,
{
call <- as.character(call)
terms <- as.character(terms)
}
)
model_as_yaml <- as.yaml(model2)
nchar(model_as_yaml) # 14493
My data has to be accessible by programs other than R, and doesn't need to be human-readable
You could write your data to an open, cross-platform binary format like HFD5. Currently support for HFD5 files (via rhdf5) is limited, so complex objects are not supported. (You'll probably need to unclass everything.)
library(rhdf5)
h5save(rapply(model2, unclass, how = "replace"), file = "model.h5")
bin_h5 <- readBin("model.h5", "raw", 1e6)
length(bin_h5) # 88291 not very efficient in this case
The feather package let's you save data frames in a format readable by both R and Python. To use this, you would first have to convert the model object into data frames, as described in the broom section earlier in the answer.
library(feather)
library(broom)
write_feather(augment(model), "co2_row.feather") # 5474 bytes
write_feather(tidy(model), "co2_coeff.feather") # 2093 bytes
write_feather(glance(model), "co2_model.feather") # 562 bytes
Another alternative is to save a text version of the variable (see previous section) to a zipped file and store its bytes in the database.
writeLines(model_as_json)
tar("model.tar.bz", "model.txt", compression = "bzip2")
bin_bzip <- readBin("model.tar.bz", "raw", 1e6)
length(bin_bzip) # only 42 bytes!
My data only needs to be accessible by R, and needs to be human-readable
There are two options for turning a variable into a string: serialize and deparse.
p <- function(x)
{
paste0(x, collapse = "\n")
}
serialize needs to be sent to a text connection, and rather than writing to file, you can write to the console and capture it.
model_serialized <- p(capture.output(serialize(model, stdout())))
nchar(model_serialized) # 23830
Use deparse with control = "all" to maximise the reversibility when re-parsing later.
model_deparsed <- p(deparse(model, control = "all"))
nchar(model_deparsed) # 22036
My data only needs to be accessible by R, and doesn't need to be human-readable
The same sorts of techniques shown in the previous sections can be applied here. You can zip a serialized or deparsed variable and re-read it as a raw vector.
serialize can also write variables in a binary format. In this case, it is most easily used with its wrapper saveRDS.
saveRDS(model, "model.rds")
bin_rds <- readBin("model.rds", "raw", 1e6)
length(bin_rds) # 6350
For sqlite (and possibly others):
CREATE TABLE data (blob BLOB);
Now in R:
RSQLite::dbGetQuery(db.conn, 'INSERT INTO data VALUES (:blob)', params = list(blob = list(serialize(some_object)))
Note the list wrapper around some_object. The output of serialize is a raw vector. Without list, the INSERT statement would be executed for each vector element. Wrapping it in a list allows RSQLite::dbGetQuery to see it as one element.
To get the object back from the database:
some_object <- unserialize(RSQLite::dbGetQuery(db.conn, 'SELECT blob FROM data LIMIT 1')$blob[[1]])
What happens here is you take the field blob (which is a list since RSQLite doesn't know how many rows will be returned by the query). Since LIMIT 1 assures only 1 row is returned, we take it with [[1]], which is the original raw vector. Then you need to unserialize the raw vector to get your object.
Using textConnection / saveRDS / loadRDS is perhaps the most versatile and high level:
zz<-textConnection('tempConnection', 'wb')
saveRDS(myData, zz, ascii = T)
TEXT<-paste(textConnectionValue(zz), collapse='\n')
#write TEXT into SQL
...
closeAllConnections() #if the connection persists, new data will be appended
#reading back:
#1. pull from SQL into queryResult
...
#2. recover the object
recoveredData <- readRDS(textConnection(queryResult$TEXT))
[100% WORKING - 27 Feb 2020]
Description:
Here are the steps if you want to store your model into a POSTGRES table, then query it and load it. An important part is the ascii = TRUE, which otherwise would produce errors when serializing
db <- pgsql_connect #connection to your database
serialized_model <- rawToChar(serialize(model_fit, NULL, ascii=TRUE))
insert_query <-'INSERT INTO table (model) VALUES ($1)'
rs <- dbSendQuery(db, insert_query, list(serialized_model))
dbClearResult(rs)
serialized_model <- dbGetQuery(db, "select model from table order by created_at desc limit 1")
model_fit2 <- unserialize(charToRaw(as.character(serialized_model[,c('model')])))
model_fit2

Resources