How to store datapoints in the magnitudes of 1+ trillion? - database

So, I have astronomical spectroscopy data in the following format:
{
"molecule": "CO2",
"blahblah":
"5 more simple fields"
"arrayofvalues": [lengths can go up to 2 million]
}
of this data, I have 600,000 files, so that means that there are 1 trillion individual datapoints that I want to search through, and do computations with.
So can someone please direct me to a source of maybe bigData or bigQueries on how I can efficiently lookup this data for computations and graphing? I want to like for example search certain molecules, under certain condition, what data they show etc.
I want to make a website where people can pick some variables, and a value range, and get graphical or textual data.
Now I tried to put some of this stuff on PostgresQL, but obviously when I do a get request, (and store even just 5 files) it will crash Postman, because its too much data.

Without knowing more details, you can take advantage of the data modeling options available in bigquery, such as:
nested data
arrays and structs
partitioned tables
clustering
Take a look at the data types: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
And also the partitioning and clustering techniques.
https://towardsdatascience.com/how-to-use-partitions-and-clusters-in-bigquery-using-sql-ccf84c89dd65?gi=cd1bc7f704cc

Related

How to save R list object to a database?

Suppose I have a list of R objects which are themselves lists. Each list has a defined structure: data, model which fits data and some attributes for identifying data. One example would be time series of certain economic indicators in particular countries. So my list object has the following elements:
data - the historical time series for economic indicator
country - the name of the country, USA for example
name - the indicator name, GDP for example
model - ARIMA orders found out by auto.arima in suitable format, this again may be a list.
This is just an example. As I said suppose I have a number of such objects combined into a list. I would like to save it into some suitable format. The obvious solution is simply to use save, but this does not scale very well for large number of objects. For example if I only wanted to inspect a subset of objects, I would need to load all of the objects into memory.
If my data is a data.frame I could save it to database. If I wanted to work with particular subset of data I would use SELECT and rely on database to deliver the required subset. SQLite served me well in this regard. Is it possible to replicate this for my described list object with some fancy database like MongoDB? Or should I simply think about how to convert my list to several related tables?
My motivation for this is to be able to easily generate various reports on the fitted models. I can write a bunch of functions which produce some report on a given object and then just use lapply on my list of objects. Ideally I would like to parallelise this process, but this is a another problem.
I think I explained the basics of this somewhere once before---the gist of it is that
R has complete serialization and deserialization support built in, so you can in fact take any existing R object and turn it into either a binary or textual serialization. My digest package use that to turn the serialization into hash using different functions
R has all the db connectivity you need.
Now, what a suitable format and db schema is ... will depend on your specifics. But there is (as usual) nothing in R stopping you :)
This question has been inactive for a long time. Since I had a similar concern recently, I want to add the pieces of information that I've found out. I recognise these three demands in the question:
to have the data stored in a suitable structure
scalability in terms of size and access time
the possibility to efficiently read only subsets of the data
Beside the option to use a relational database, one can also use the HDF5 file format which is designed to store a large amount of possible large objects. The choice depends on the type of data and the intended way to access it.
Relational databases should be favoured if:
the atomic data items are small-sized
the different data items possess the same structure
there is no anticipation in which subsets the data will be read out
convenient transfer of the data from one computer to another is not an issue or the computers where the data is needed have access to the database.
The HDF5 format should be preferred if:
the atomic data items are themselves large objects (e.g. matrices)
the data items are heterogenous, it is not possible to combine them into a table like representation
most of the time the data is read out in groups which are known in advance
moving the data from one computer to another should not require much effort
Furthermore, one can distinguish between relational and hierarchial relationships, where the latter is contained in the former. Within a HDF5 file, the information chunks can be arranged in a hierarchial way, e.g.:
/Germany/GDP/model/...
/Germany/GNP/data
/Austria/GNP/model/...
/Austria/GDP/data
The rhdf5 package for handling HDF5 files is available on Bioconductor. General information on the HDF5 format is available here.
Not sure if it is the same, but I had some good experience with time series objects with:
str()
Maybe you can look into that.

Determining the Similarity Between Items in a Database

We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!
Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.
I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.
Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.
It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.
If your DBMS has it, take a look at SOUNDEX().

Suggestions for a database with good support for set operations

I'm looking for a database with good support for set operations (more specifically: unions).
What I want is something that can store sets of short strings and calculate the union of such sets. For example, I want to add A, B, and C to a set, then D, and A to another and then get the cardinality of the union of those sets (4), but scaled up a million times or so.
The values are 12 character strings and the set sizes range from single elements to millions.
I have experimented with Redis, and it's fantastic in every respect except that for the amount of data I have it's tricky with something that is memory-based. I've tried using the VM feature, but that makes it use even more memory, it something more geared towards large values and I have small values (so say the helpful people on the Redis mailing list). The jury is still out, though, I might get it to work.
I've also sketched on implementing it on top of a relational database, which would probably work, but what I'm asking for is something that I wouldn't have to hack to work. Redis would be a good answer, but as I mentioned above, I've tried it.
My current, Redis-based, implementation works more or less like this: I parse log files and for each line I extract an API key, a user ID, and the values of a number of properties like site domain, time of day, etc. I then formulate a keys that looks somewhat like this (each line results in many keys, one for each property):
APIKEY:20101001:site_domain:stackoverflow.com
the key points to a set, and to this set I add the user ID. When I've parsed all the log files I want to know the total number of unique user IDs for a property over all time and so I ask Redis for the cardinality of the union of all keys that match
APIKEY:*:site_domain:stackoverflow.com
Is there a database, besides Redis, that has good support for this use case?
it sounds like you need something like boost::disjoint_set which is a datastructure specifically optimized for taking unions or intersections of large sets.

How do you verify the correct data is in a data mart?

I'm working on a data warehouse and I'm trying to figure out how to best verify that data from our data cleansing (normalized) database makes it into our data marts correctly. I've done some searches, but the results so far talk more about ensuring things like constraints are in place and that you need to do data validation during the ETL process (E.g. dates are valid, etc.). The dimensions were pretty easy as I could easily either leverage the primary key or write a very simple and verifiable query to get the data. The fact tables are more complex.
Any thoughts? We're trying to make this very easy for a subject matter export to run a couple queries, see some data from both the data cleansing database and the data marts, and visually compare the two to ensure they are correct.
You test your fact table loads by implementing a simplified, pared-down subset of the same data manipulation elsewhere, and comparing the results.
You calculate the same totals, counts, or other figures at least twice. Once from the fact table itself, after it has finished loading, and once from some other source:
the source data directly, controlling for all the scrubbing steps in between source and fact
a source system report that is known to be correct
etc.
If you are doing this in the database, you could write each test as a query that returns no records if everything correct. Any records that get returned are exceptions: count of x by (y,z) does not match.
See this excellent post by ConcernedOfTunbridgeWells for more recommendations.
Although it has some drawbacks and potential problems if you do a lot of cleansing or transforming, I've found you can round trip an input file by re-generating the input file from the star schema(s). Then simply comparing the input file to the output file. It might require some massaging to make them match (one is left padded, the other right padded).
Typically, I had a program which used the same layout the ETL used and did a compare, ignoring alignment within a field. Also, the files might have to be sorted - there is a command-line sort I used.
If your ETL does a transform incorrectly and you transform out incorrectly, it's still possible that this method doesn't show every problem in the DW, and I wouldn't claim it has complete coverage, but it's a pretty good first whack at a regression unit test for each load.

Strategy in storing ad-hoc numbers/constants?

I have a need to store a number of ad-hoc figures and constants for calculation.
These numbers change periodically but they are different type of values. One might be a balance, a money amount, another might be an interest rate, and yet another might be a ratio of some kind.
These numbers are then used in a calculation that involve other more structured figures.
I'm not certain what the best way to store these in a relational DB is - that's the choice of storage for the app.
One way, I've done before, is to create a very generic table that stores the values as text. I might store the data type along with it but the consumer knows what type it is so, in situations I didn't even need to store the data type. This kind of works fine but I am not very fond of the solution.
Should I break down each of the numbers into specific categories and create tables that way? For example, create Rates table, and Balances table, etc.?
Yes, you should definitely structure your database accordingly. Having a generic table holding text values is not a great solution, and it also adds overhead when using those values in programs that may pull that data for some calculations.
Keeping each of the tables and values separated allows you to do things like adding dates and statuses to your values (perhaps some are active while others aren't?) and also allows you to keep an accurate history (what if i want to see a particular rate from last year?). It also makes things easier for those who come behind you to sift through your data.
I suggest reading this article on database normalization.

Resources