Say I have the mtcars dataset, and I wanted to take three columns and turn them into a JSON array. How do I convert this to a json array and is that possible to pass them into a POSTGRESQL database?
library(jsonlite)
df <- mtcars
attach(mtcars)
json.column <- cbind(mpg,cyl,disp)
Do I use toJSON() ?
mtcars.json <- toJSON(json.column)
https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html
Array of objects [{"name":"Erik", "age":43}, {"name":"Anna", "age":32}] Data Frame simplifyDataFrame
Keep your data as a data.frame, not a matrix. Use
json.column <- data.frame(mpg,cyl,disp)
toJSON(json.column)
# [{"mpg":21,"cyl":6,"disp":160},{"mpg":21,"cyl":6,"disp":160}, ...
Also, you should avoid the use of attach(). It can cause lots of problems if you forget detach(). Plus you can use with() often to avoid it
json.column <- with(mtcars, data.frame(mpg,cyl,disp))
(For starters, never use attach! It's dangerous! Use with instead, typically.)
There are a bunch of ways to do it. Here's how to create the values using dplyr:
qq <- rowwise(mtcars) %>%
mutate(newcol=as.character(jsonlite::toJSON(list(mpg=mpg, cyl=cyl, disp=disp))))
> qq$newcol
[1] "{\"mpg\":[21],\"cyl\":[6],\"disp\":[160]}" "{\"mpg\":[21],\"cyl\":[6],\"disp\":[160]}"
[3] "{\"mpg\":[22.8],\"cyl\":[4],\"disp\":[108]}" "{\"mpg\":[21.4],\"cyl\":[6],\"disp\":[258]}"
...
From there, if your Postgres database is set up with newcol as a JSON type, I think just writing that table as usual should work.
Related
I want to add array in database. I use ruby, sequel and postgresql, but question about true way, and not program realization.
I see 3 path:
Use special adapters. It help to save array like [1,2,3,4] in db after some changes: in database it will look like "{1,2,3,4}". BUT after it I don't know how to convert this entry back to [1,2,3,4] without bullshits.
Convert array to json and save it to database. Array like [1,2,3,4] will be look like "[1,2,3,4]". And I can easy convert back by JSON.parse.
Convert array by Marshal and save. Array [1,2,3,4] will be look like "\x04\b[\ti\x06i\ai\bi\t", but it riskily, because data can be changed after ruby update (or I am wrong)
Can anybody to tell about true way?
Sequel supports Postgres Array columns natively, assuming you don’t need to normalize yourschema further.
How do I define an ARRAY column in a Sequel Postgresql migration?
http://sequel.jeremyevans.net/rdoc-plugins/files/lib/sequel/extensions/pg_array_rb.html
IMHO, The clean approach would be an extra table. Ideally, each row in the DB represent a single observation, and having an array or json column most probably contradicts that. I suggest looking at that extra table as good design, not as an overkill.
Make use of serialize provided by ActiveRecord
class User < ApplicationRecord
serialize :preferences, Array
end
#In migration Add a string field
class AddSerializePreferencesToUsers < ActiveRecord::Migration[5.1]
def change
add_column :users, :preferences, :string
end
end
#In rails console.
u = User.first
u.preferences # => []
u.preferences << 'keep coding'
u.preferences # => ['keep coding']
u.save
u.reload.preferences # ['keep coding']
Hope this helps
I have 1000s of sensors, I need to partition the data (i.e. per sensor per day) then submit each list of data points to an R algorithm). Using Spark, simplified sample looks like:
//Spark
val rddData = List(
("1:3", List(1,1,456,1,1,2,480,0,1,3,425,0)),
("1:4", List(1,4,437,1,1,5,490,0)),
("1:6", List(1,6,500,0,1,7,515,1,1,8,517,0,1,9,522,0,1,10,525,0)),
("1:11", List(1,11,610,1))
)
case class DataPoint(
key: String,
value: List[Int]) // 4 value pattern, sensorID:seq#, seq#, value, state
I convert to a parquet file, save it.
Load the parquet in SparkR, no problem, the schema says:
#SparkR
df <- read.df(sqlContext, filespec, "parquet")
schema(df)
StructType
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value", type = "ArrayType(IntegerType,true)", nullable = TRUE
So in SparkR, I have a dataframe where each record has all of the data I want (df$value). I want to extract that array into something R can consume then mutate my original dataframe(df) with a new column holding the resultant array. Logically something like results = function(df$value). Then I need to get results (for all rows) back into a SparkR dataframe for output.
How to I extract an array from the SparkR dataframe then mutate with the results?
Let spark data frame be, df and R data frame be df_r
To convert sparkR df to R df, use code
df_r <- collect(df)
with R data frame df_r, you can do all computations you want to do in R.
let say you have the result in column df_r$result
Then for converting back to SparkR data frame use code,
#this is a new SparkR data frame, df_1
df_1 <- createDataFrame(sqlContext, df_r)
For adding the result back to SparkR data frame `df` use code
#this adds the df_1$result to a new column df$result
#note that number of rows should be same in df and `df_1`, if not use `join` operation
df$result <- df_1$result
Hope this solves your problem
I had this problem too. The way I got around it was by adding a row index into the spark DataFrame and then using explode inside a select statement. Make sure to select the index and then the row you want in your select statement. That will get you a "long" dataframe. If each of the nested lists in the DataFrame column has the same amount of information in it (for example if you are exploding a list-column of x,y coordinates), you would expect each row index in the long DataFrame to occur twice.
After doing the above, I typically do a groupBy(index) on the exploded DataFrame, filter where the n() of each index is not equal to the expected number of items in the list and proceed with additional groupBy, merge, join, filter, etc. operations on the Spark DataFrame.
There are some excellent guides on the Urban Institute's GitHub page. Good luck. -nate
I am trying to convert an array that was produced by the "table" function into a normal data frame within a function. I've tried as.data.frame, and get is.data.frame=TRUE. However, the object is still an array with an extra dim. This causes problems when I merge it with a data frame, and I end up with a single column of the data frame having array dimensions. How can I coerce the object to a simple data frame? The extra dim contains only the rownames. I've tried setting rownames to null to no avail.
A reproducible example will help if you can provide one, but I've had this problem before too. There's a number of solutions to this, none of them are perfect.
DF <- as.data.frame(as.matrix(table(whatev))) ## makes a data frame w/ one column
## and rownames = names(table)
DF$V2 <- rownames(DF) ## or as.numeric(rownames(DF)) if you want
rownames(DF) <- NULL ## no need for them anymore
Also
DF <- data.frame(V1 = as.vector(names(table(whatever))), V2 = as.numeric(table(w.e))
I am exploring how to compare two dataframe in R more efficiently, and I come up with hash.
My plan is to create hash for each row of data in two dataframe with same columns, using digest in digest package, and I suppose hash should be the same for any 2 identical row of data.
I tried to give and unique hash for each row of data, using the code below:
for (loop.ssi in (1:nrow(ssi.10q3.v1)))
{ssi.10q3.v1[loop.ssi,"hash"] <- digest(as.character(ssi.10q3.v1[loop.ssi,]))
print(paste(loop.ssi,nrow(ssi.10q3.v1),sep="/"))
flush.console()
}
But this is very slow.
Is my approach in comparing dataframe correct? If yes, any suggestion for speeding up the code above? Thanks.
UPDATE
I have updated the code as below:
ssi.10q3.v1[,"uid"] <- 1:nrow(ssi.10q3.v1)
ssi.10q3.v1.hash <- ddply(ssi.10q3.v1,
c("uid"),
function(df)
{df[,"uid"]<- NULL
hash <- digest(as.character(df))
data.frame(hash=hash)
},
.progress="text")
I self-generated a uid column for the "unique" purpose.
If I get what you want properly, digest will work directly with apply:
library(digest)
ssi.10q3.v1.hash <- data.frame(uid = 1:nrow(ssi.10q3.v1), hash = apply(ssi.10q3.v1, 1, digest))
I know this answer doesn't match the title of the question, but if you just want to see when rows are different you can do it directly:
rowSums(df2 == df1) == ncol(df1)
Assuming both data.frames have the same dimensions, that will evaluate to FALSE for every row that is not identical. If you need to test rownames as well that could be manage seperately and combined with the test of contents, and similarly for colnames (and attributes, and strict tests on column types).
rowSums(df2 == df1) == ncol(df1) & rownames(df2) == rownames(df1)
I frequently create nonparametric statistics (loess, kernel densities, etc) on data I pull out of a relational database. To make data management easier I would like to store R output back inside my DB. This is easy with simple data frames of numbers or text, but I have not figured out how to store R objects back in my relational database. So is there a way to store a vector of kernel densities, for example, back into a relational database?
Right now I work around this by saving the R objects to a network drive space so others can load the objects as needed.
Use the serialization feature to turn any R object into a (raw or character) string, then store that string. See help(serialize).
Reverse this for retrieval: get the string, then unserialize() into a R object.
An example R variable, that's fairly complex:
library(nlme)
model <- lme(uptake ~ conc + Treatment, CO2, random = ~ 1 | Plant / Type)
The best storage database method for R variables depends upon how you want to use it.
I need to do in-database analytics on the values
In this case, you need to break the object down into values that the database can handle natively. This usually means converting it into one or more data frames. The easiest way to do this is to use the broom package.
library(broom)
coefficients_etc <- tidy(model)
model_level_stats <- glance(model)
row_level_stats <- augment(model)
I just want storage
In this case you want to serialize your R variables. That is, converting them to be a string or a binary blob. There are several methods for this.
My data has to be accessible by programs other than R, and needs to be human-readable
You should store your data in a cross-platform text format; probably JSON or YAML. JSON doesn't support some important concepts like Inf; YAML is more general but the support in R isn't as mature. XML is also possible, but is too verbose to be useful for storing large arrays.
library(RJSONIO)
model_as_json <- toJSON(model)
nchar(model_as_json) # 17916
library(yaml)
# yaml package doesn't yet support conversion of language objects,
# so preprocessing is needed
model2 <- within(
model,
{
call <- as.character(call)
terms <- as.character(terms)
}
)
model_as_yaml <- as.yaml(model2)
nchar(model_as_yaml) # 14493
My data has to be accessible by programs other than R, and doesn't need to be human-readable
You could write your data to an open, cross-platform binary format like HFD5. Currently support for HFD5 files (via rhdf5) is limited, so complex objects are not supported. (You'll probably need to unclass everything.)
library(rhdf5)
h5save(rapply(model2, unclass, how = "replace"), file = "model.h5")
bin_h5 <- readBin("model.h5", "raw", 1e6)
length(bin_h5) # 88291 not very efficient in this case
The feather package let's you save data frames in a format readable by both R and Python. To use this, you would first have to convert the model object into data frames, as described in the broom section earlier in the answer.
library(feather)
library(broom)
write_feather(augment(model), "co2_row.feather") # 5474 bytes
write_feather(tidy(model), "co2_coeff.feather") # 2093 bytes
write_feather(glance(model), "co2_model.feather") # 562 bytes
Another alternative is to save a text version of the variable (see previous section) to a zipped file and store its bytes in the database.
writeLines(model_as_json)
tar("model.tar.bz", "model.txt", compression = "bzip2")
bin_bzip <- readBin("model.tar.bz", "raw", 1e6)
length(bin_bzip) # only 42 bytes!
My data only needs to be accessible by R, and needs to be human-readable
There are two options for turning a variable into a string: serialize and deparse.
p <- function(x)
{
paste0(x, collapse = "\n")
}
serialize needs to be sent to a text connection, and rather than writing to file, you can write to the console and capture it.
model_serialized <- p(capture.output(serialize(model, stdout())))
nchar(model_serialized) # 23830
Use deparse with control = "all" to maximise the reversibility when re-parsing later.
model_deparsed <- p(deparse(model, control = "all"))
nchar(model_deparsed) # 22036
My data only needs to be accessible by R, and doesn't need to be human-readable
The same sorts of techniques shown in the previous sections can be applied here. You can zip a serialized or deparsed variable and re-read it as a raw vector.
serialize can also write variables in a binary format. In this case, it is most easily used with its wrapper saveRDS.
saveRDS(model, "model.rds")
bin_rds <- readBin("model.rds", "raw", 1e6)
length(bin_rds) # 6350
For sqlite (and possibly others):
CREATE TABLE data (blob BLOB);
Now in R:
RSQLite::dbGetQuery(db.conn, 'INSERT INTO data VALUES (:blob)', params = list(blob = list(serialize(some_object)))
Note the list wrapper around some_object. The output of serialize is a raw vector. Without list, the INSERT statement would be executed for each vector element. Wrapping it in a list allows RSQLite::dbGetQuery to see it as one element.
To get the object back from the database:
some_object <- unserialize(RSQLite::dbGetQuery(db.conn, 'SELECT blob FROM data LIMIT 1')$blob[[1]])
What happens here is you take the field blob (which is a list since RSQLite doesn't know how many rows will be returned by the query). Since LIMIT 1 assures only 1 row is returned, we take it with [[1]], which is the original raw vector. Then you need to unserialize the raw vector to get your object.
Using textConnection / saveRDS / loadRDS is perhaps the most versatile and high level:
zz<-textConnection('tempConnection', 'wb')
saveRDS(myData, zz, ascii = T)
TEXT<-paste(textConnectionValue(zz), collapse='\n')
#write TEXT into SQL
...
closeAllConnections() #if the connection persists, new data will be appended
#reading back:
#1. pull from SQL into queryResult
...
#2. recover the object
recoveredData <- readRDS(textConnection(queryResult$TEXT))
[100% WORKING - 27 Feb 2020]
Description:
Here are the steps if you want to store your model into a POSTGRES table, then query it and load it. An important part is the ascii = TRUE, which otherwise would produce errors when serializing
db <- pgsql_connect #connection to your database
serialized_model <- rawToChar(serialize(model_fit, NULL, ascii=TRUE))
insert_query <-'INSERT INTO table (model) VALUES ($1)'
rs <- dbSendQuery(db, insert_query, list(serialized_model))
dbClearResult(rs)
serialized_model <- dbGetQuery(db, "select model from table order by created_at desc limit 1")
model_fit2 <- unserialize(charToRaw(as.character(serialized_model[,c('model')])))
model_fit2