How to save R list object to a database? - database

Suppose I have a list of R objects which are themselves lists. Each list has a defined structure: data, model which fits data and some attributes for identifying data. One example would be time series of certain economic indicators in particular countries. So my list object has the following elements:
data - the historical time series for economic indicator
country - the name of the country, USA for example
name - the indicator name, GDP for example
model - ARIMA orders found out by auto.arima in suitable format, this again may be a list.
This is just an example. As I said suppose I have a number of such objects combined into a list. I would like to save it into some suitable format. The obvious solution is simply to use save, but this does not scale very well for large number of objects. For example if I only wanted to inspect a subset of objects, I would need to load all of the objects into memory.
If my data is a data.frame I could save it to database. If I wanted to work with particular subset of data I would use SELECT and rely on database to deliver the required subset. SQLite served me well in this regard. Is it possible to replicate this for my described list object with some fancy database like MongoDB? Or should I simply think about how to convert my list to several related tables?
My motivation for this is to be able to easily generate various reports on the fitted models. I can write a bunch of functions which produce some report on a given object and then just use lapply on my list of objects. Ideally I would like to parallelise this process, but this is a another problem.

I think I explained the basics of this somewhere once before---the gist of it is that
R has complete serialization and deserialization support built in, so you can in fact take any existing R object and turn it into either a binary or textual serialization. My digest package use that to turn the serialization into hash using different functions
R has all the db connectivity you need.
Now, what a suitable format and db schema is ... will depend on your specifics. But there is (as usual) nothing in R stopping you :)

This question has been inactive for a long time. Since I had a similar concern recently, I want to add the pieces of information that I've found out. I recognise these three demands in the question:
to have the data stored in a suitable structure
scalability in terms of size and access time
the possibility to efficiently read only subsets of the data
Beside the option to use a relational database, one can also use the HDF5 file format which is designed to store a large amount of possible large objects. The choice depends on the type of data and the intended way to access it.
Relational databases should be favoured if:
the atomic data items are small-sized
the different data items possess the same structure
there is no anticipation in which subsets the data will be read out
convenient transfer of the data from one computer to another is not an issue or the computers where the data is needed have access to the database.
The HDF5 format should be preferred if:
the atomic data items are themselves large objects (e.g. matrices)
the data items are heterogenous, it is not possible to combine them into a table like representation
most of the time the data is read out in groups which are known in advance
moving the data from one computer to another should not require much effort
Furthermore, one can distinguish between relational and hierarchial relationships, where the latter is contained in the former. Within a HDF5 file, the information chunks can be arranged in a hierarchial way, e.g.:
/Germany/GDP/model/...
/Germany/GNP/data
/Austria/GNP/model/...
/Austria/GDP/data
The rhdf5 package for handling HDF5 files is available on Bioconductor. General information on the HDF5 format is available here.

Not sure if it is the same, but I had some good experience with time series objects with:
str()
Maybe you can look into that.

Related

How to store datapoints in the magnitudes of 1+ trillion?

So, I have astronomical spectroscopy data in the following format:
{
"molecule": "CO2",
"blahblah":
"5 more simple fields"
"arrayofvalues": [lengths can go up to 2 million]
}
of this data, I have 600,000 files, so that means that there are 1 trillion individual datapoints that I want to search through, and do computations with.
So can someone please direct me to a source of maybe bigData or bigQueries on how I can efficiently lookup this data for computations and graphing? I want to like for example search certain molecules, under certain condition, what data they show etc.
I want to make a website where people can pick some variables, and a value range, and get graphical or textual data.
Now I tried to put some of this stuff on PostgresQL, but obviously when I do a get request, (and store even just 5 files) it will crash Postman, because its too much data.
Without knowing more details, you can take advantage of the data modeling options available in bigquery, such as:
nested data
arrays and structs
partitioned tables
clustering
Take a look at the data types: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
And also the partitioning and clustering techniques.
https://towardsdatascience.com/how-to-use-partitions-and-clusters-in-bigquery-using-sql-ccf84c89dd65?gi=cd1bc7f704cc

How to store user defined data structures?

I'm building a mobile application that records information about items and then outputs an automatically generated report.
Each Item may be of various types, each type requires different information to be recorded. The user needs to be able to specify what is to be stored for each type.
Is there a "best" way to store this type of information in a relational database?
My current plan is to have a Type table that maps Types to Attributes that need to be recorded for that Type. Does this sound sensible? I imagine that it may get messy when I come to produce reports from this data.
I guess I need a way of generalising the information that needs to be recorded?
I think I just need some pointers in the right direction.
Thanks!
Only a suggestion, might not be an answer... use JSON and go for no-sql database. Today it is more convenient to operate and play around with data in not strictly relational database format.
That way you can define a model(s), or create you own data structure as mentioned and store it easily as a collection of documents of that model. Also no-sql allows structure changes without obligating you to define entire "column" for all "rows" present there ;)
Check this out about MongoDB and NoSQL explanation.
This is also a beatiful post that i love about data modeling in
NoSQL.

Relational database versus R/Python data frames

I was exposed to the world of tables and data structures in R before the RDBMS systems and other database systems. It is quite elegant in R/Python to create tables and lists from stuctured data (.csv or other formats) and then do data manipulations programmatically.
Last year, I attended a course in Database management and learnt all about structured and unstructured databases. I also noticed that it is the norm to feed data from multiple sources of data into databases rather than directly use them in R (for convenience and discipline?).
For research purposes, R seems to suffice, for joining, appending or even complicated data manipulations.
The questions that keeps arising is:
When to use R directly by using commands such as read.csv, when to use R by creating database and querying from tables using the R-SQL interface?
For instance, if I have a multi-source data, like (a) Person level information (age, gender, smoking habits), (b) Outcome variables (such as surveys taken by them in real time), (c) Covariate information (environment characteristics), (d) Treatment input (occurrence of an event that modifies the outcome - survey response) (d) Time and space information of participants taking survey
How to approach the data collection and processing in this case. There may be standard industry procedures, but I put this question forward here, to understand list of feasible and optimal approaches that individuals and small group of researchers can adopt.
What you're describing when you say "that it is the norm to feed data from multiple sources of data into databases" sounds more specifically like a data warehouse. Databases are used for many reasons, and in plenty of situations they will hold data from one source - for instance, a database used as the data store of a transactional system will often only hold the data needed to run that system, and the data produced by that system.
The process you're describing is commonly called Extract, Transform, Load (ETL), and you might find looking up information about ETL and data warehousing helpful if you decide to go in the direction of combining your data prior to working with it in R.
I can't tell you which you should choose, or the optimal way of accomplishing it, because it will vary in different situations and might even come down to opinion. What I can tell you are some of the reasons why people create data warehouses, and you can decide for yourself whether it might be useful in your situation:
A data warehouse can provide a central location to hold combined data. This means that people do not need to combine the data themselves each time they need to use that specific combination of data. Unlike something like a simple one-off report or extract of combined data, it should provide some flexibility, letting people obtain the combined set of data they need for a specific task. Very often, in enterprise situations, multiple things are then be run on top of the same combined set of data - multidimensional data analysis tools (cubes), reports, data mining, etc.
Some of the benefits of this might include:
Individuals saving time when they otherwise would have needed to combine the data themselves.
If the data which needs to be combined is complex, or some people do not have proficiency at handling that part of the process, then there is less risk of data being combined incorrectly; you can be sure that different pieces of work have used the same source data.
If the data suffers from data quality issues, you resolve this once in the data warehouse, rather than working around it or resolving it repeatedly in code.
If new data is constantly being received, collection and integration of this into the data warehouse can be carried out automatically.
Like I say, I can't decide for you whether this is a useful direction or not - as with any decision of this kind you'll need to weigh up the costs of implementing such a solution against the benefits, and both will be specific to your individual case. But hopefully this answers your core question of why someone might choose to do this work in a database instead of in their code, and gives you a starting point to work from.

How to save and load a giant hash-table to-n-fro from disk?

I am trying to write a search-engine for a large collection, for learning purposes. I started with my own intuitions. Then I researched and am finally arriving at a working model.
I am constructing a giant hash-table to hold all the terms in my collection. It is very expensive to construct this from the collection. Once I have computed the table I want to save this to disk, so that whenever I want to access this hash-table in my program latter, I can load it again from disk.
Is there any standard way of doing it or do I have to invent my own file-format and hacks to do this?
Note: The has-table is only for storing all term occurrences, I am planning to store the main ranking data in a postings file and have its pointer set in corresponding term of hash-table.
I am working in C.
BDB is a library for efficiently managing flat-file databases. In particular a hash table format is supported. B-Trees are also available, in case ordered access is required.

Evaluating HDF5: What limitations/features does HDF5 provide for modelling data?

We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!
How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)
We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.

Resources