Storing complex data as key in NoSQL database? - database

I have data that has multiple dimensions, each of which are strings. For example, a Person is described by position, id, email, etc...
I want to use one piece of multi-dimensional datum as a key into my NoSQL database. I don't need to do any complex querying, just periodic full table scans (the table will be small). What are some ways / best practices to format this data as a key?
I have considered colon delimiting (i.e. position:id:email) but it has hard readability and low flexibility. I've also considered hashing this colon-delimited string. Is there a good hash function for this type of thing? Or any completely other suggestions?
Thanks in advance!

Storing multi-dimensional data under a one-dimensional key is a challenging task in key-value-stores / NoSQL databases. Projects like MD-HBase or GeoMESA do exist; they place the multi-dimensional data into an n-dimensional space and use a space-filling curve to encode the location of the data into a one-dimensional key. However, most projects are limited to 2-dimensional spatial data, and string attributes could not be handled.
Shameless Plug: I have started a new open-source-project called BBoxDB. BBoxDB is a distributed storage manager that is capable of handling multi-dimensional data. In BBoxDB a bounding box is used to describe the location of multi-dimensional data in the n-dimensional space. You could map the string attributes of your Person entity to a point in the n-dimensional space and use this point as the bounding box for your data. Then BBoxDB can run queries on your data (e.g., full table scans or scans that are restricted to some dimensions). The project is at an early stage, but maybe it is interesting for you.

Related

Storing sparse matrices in SQL for quick retrieval and deserealization

Interested in storing a sparse vector (technically the sparse array is a key value pair dictionary but let’s just say that the values are an array for simplicity) and retrieving it from Postgres efficiently. The table schema would be:
id
sparse_array
Few options being considered:
store array using ARRAY type on Postgres
Store as a JSON
Encrypt/decrypt the array using some sort of two-way encryption scheme - JWT concept?
Convert into character (comma, for example) separated string and serialize/deserialize using C
Are there industry practice/good ways of doing this? Big tech companies can deliver images super quickly to users, how is that made possible/how does storage/retrieval work?

How to save R list object to a database?

Suppose I have a list of R objects which are themselves lists. Each list has a defined structure: data, model which fits data and some attributes for identifying data. One example would be time series of certain economic indicators in particular countries. So my list object has the following elements:
data - the historical time series for economic indicator
country - the name of the country, USA for example
name - the indicator name, GDP for example
model - ARIMA orders found out by auto.arima in suitable format, this again may be a list.
This is just an example. As I said suppose I have a number of such objects combined into a list. I would like to save it into some suitable format. The obvious solution is simply to use save, but this does not scale very well for large number of objects. For example if I only wanted to inspect a subset of objects, I would need to load all of the objects into memory.
If my data is a data.frame I could save it to database. If I wanted to work with particular subset of data I would use SELECT and rely on database to deliver the required subset. SQLite served me well in this regard. Is it possible to replicate this for my described list object with some fancy database like MongoDB? Or should I simply think about how to convert my list to several related tables?
My motivation for this is to be able to easily generate various reports on the fitted models. I can write a bunch of functions which produce some report on a given object and then just use lapply on my list of objects. Ideally I would like to parallelise this process, but this is a another problem.
I think I explained the basics of this somewhere once before---the gist of it is that
R has complete serialization and deserialization support built in, so you can in fact take any existing R object and turn it into either a binary or textual serialization. My digest package use that to turn the serialization into hash using different functions
R has all the db connectivity you need.
Now, what a suitable format and db schema is ... will depend on your specifics. But there is (as usual) nothing in R stopping you :)
This question has been inactive for a long time. Since I had a similar concern recently, I want to add the pieces of information that I've found out. I recognise these three demands in the question:
to have the data stored in a suitable structure
scalability in terms of size and access time
the possibility to efficiently read only subsets of the data
Beside the option to use a relational database, one can also use the HDF5 file format which is designed to store a large amount of possible large objects. The choice depends on the type of data and the intended way to access it.
Relational databases should be favoured if:
the atomic data items are small-sized
the different data items possess the same structure
there is no anticipation in which subsets the data will be read out
convenient transfer of the data from one computer to another is not an issue or the computers where the data is needed have access to the database.
The HDF5 format should be preferred if:
the atomic data items are themselves large objects (e.g. matrices)
the data items are heterogenous, it is not possible to combine them into a table like representation
most of the time the data is read out in groups which are known in advance
moving the data from one computer to another should not require much effort
Furthermore, one can distinguish between relational and hierarchial relationships, where the latter is contained in the former. Within a HDF5 file, the information chunks can be arranged in a hierarchial way, e.g.:
/Germany/GDP/model/...
/Germany/GNP/data
/Austria/GNP/model/...
/Austria/GDP/data
The rhdf5 package for handling HDF5 files is available on Bioconductor. General information on the HDF5 format is available here.
Not sure if it is the same, but I had some good experience with time series objects with:
str()
Maybe you can look into that.

Efficient Database Table Structure

Consider Microsoft SQL Server 2008
I need to create a table which can be created two different ways as follows.
Structure Columnwise
StudentId number, Name Varchar, Age number, Subject varchar
eg.(1,'Dharmesh',23,'Science')
(2,'David',21,'Maths')
Structure Rowwise
AttributeName varchar,AttributeValue varchar
eg.('StudentId','1'),('Name','Dharmesh'),('Age','23'),('Subject','Science')
('StudentId','2'),('Name','David'),('Age','21'),('Subject','Maths')
in first case records will be less but in 2nd approach it will be 4 times more but 2 columns are reduced.
So which approach is more better in terms of performance,disk storage and data retrial??
Your second approach is commonly known as an EAV design - Entity-Attribute-Value.
IMHO, 1st approach all the way. That allows you to type your columns properly allowing for most efficient storage of data and greatly helps with ease and efficiency of queries.
In my experience, the EAV approach usually causes a world of pain. Here's one example of a previous question about this, with good links to best practices. If you do a search, you'll find more - well worth a sift through.
A common reason why people head down the EAV route is to model a flexible schema, which is relatively difficult to do efficiently in RDBMS. Other approaches include storing data in XML fields. This is one reason where NOSQL (non-relational) databases can come in very handy due to their schemaless nature (e.g. MongoDB).
The first one will have better performance, disk storage and data retrieval will be better.
Having attribute names as varchars will make it impossible to change names, datatypes or apply any kind of validation
It will be impossible to index desired search actions
Saving integers as varchars will use more space
Ordering, adding or summing integers will be a headache, and will have bad performance
The programming language using this database will not have any possibility to have strong typed data
There are many more reasons for using the first approach.

Store array of numbers in database field

Context: SQL Server 2008, C#
I have an array of integers (0-10 elements). Data doesn't change often, but is retrieved often.
I could create a separate table to store the numbers, but for some reason it feels like that wouldn't be optimal.
Question #1: Should I store my array in a separate table? Please give reasons for one way or the other.
Question #2: (regardless of what the answer to Q#1 is), what's the "best" way to store int[] in database field? XML? JSON? CSV?
EDIT:
Some background: numbers being stored are just some coefficients that don't participate in any relationship, and are always used as an array (i.e. never a value is being retrieved or used in isolation).
Separate table, normalized
Not as XML or json , but separate numbers in separate rows
No matter what you think, it's the best way. You can thank me later
The "best" way to store data in a database is the way that is most conducive to the operations that will be performed on it and the one which makes maintenance easiest. It is this later requirement which should lead you to a normalized solution which means storing the integers in a table with a relationship. Beyond being easier to update, it is easier for the next developer that comes after you to understand what and how the information is stored.
Store it as a JSON array but know that all accesses will now be for the entire array - no individual read/writes to specific coefficients.
In our case, we're storing them as a json array. Like your case, there is no relationship between individual array numbers - the array only make sense as a unit and as a unit it DOES has a relationship with other columns in the table. By the way, everything else IS normalized. I liken it to this: If you were going to store a 10 byte chunk, you'd save it packed in a single column of VARBINARY(10). You wouldn't shard it into 10 bytes, store each in a column of VARBINARY(1) and then stitch them together with a foreign key. I mean you could - but it wouldn't make any sense.
YOU as the developer will need to understand how 'monolithic' that array of int's really is.
A separate table would be the most "normalized" way to do this. And it is better in the long run, probably, since you won't have to parse the value of the column to extract each integer.
If you want you could use an XML column to store the data, too.
Sparse columns may be another option for you, too.
If you want to keep it really simple you could just delimit the values: 10;2;44;1
I think since you are talking about sql server that indicates that your app may be a data driven application. If that is the case I would keep definately keep the array in the database as a seperate table with a record for each value. It will be normalized and optimized for retreival. Even if you only have a few values in the array you may need to combine that data with other retreived data that may need to be "joined" with your array values. In which case sql is optimized for by using indexes, foreign keys, etc. (normalized).
That being said, you can always hard code the 10 values in your code and save the round trip to the DB if you don't need to change the values. It depends on how your application works and what this array is going to be used for.
I agree with all the others about the best being a separate normalized table. But if you insist in having it all in the same table don't place the array in one only column. In instead create the 10 columns and store each array value in a different column. It will save you the parsing and update problems.

Multidimensional databases and online analytical processing (OLAP) how it is related?

http://en.wikipedia.org/wiki/Online_Analytical_Processing
How is this two related? How to know that we are dealing with this type of programme?
The two are often conflated but are not exactly equivalent.
A multi-dimensional database - ie. a star schema:
http://en.wikipedia.org/wiki/Star_schema
(or arguably also a snowflake schema) is a way of organising data into fact tables and dimension tables - the former typically hold the numeric data (ie. measurements) while the latter hold descriptive data. A star schema may be implemented using relational database technology or using specialised storage formats that have been optimised for manipulating dimensional data.
OLAP is normally implemented using specialised storage formats that have been optimised for manipulating dimensional data, and features precalculation of summarised values.
Both are normally used as part of datawarehousing. OLAP is likely to be implemented where performance from a non-aggregated SQL database is judged to be inadequate for aggregated reporting requirements.
What multidimensional usually means in the context of OLAP systems is actually a database design based on "Dimensional Modelling" or software that supports dimensionally modelled data.
The word "Multidimensional" used in that sense is not really very informative because any relational database is inherently multidimensional. (A relation being fundamentally an N-dimensional data structure with the number of dimensions limited only by the constraints of software and hardware). Personally therefore I would prefer to avoid the term multidimensional altogether. It is just too ambiguous to be useful.

Resources