What database to use for fast data records with a fixed number of columns - database

I have a constant number of columns - they correspond to real-time coordinates of a few/maybe even a few hundred points in space (constant id and x, y coordinates of a detected pose in an OpenCV image - they are analyzed grid by grid so a lot of data comes in at once).
I read that Redis runs on RAM and you can set a time to delete the data.
Cassandra stores data in columns next to each other so as for fixed coordinates it should be suitable.
It would be nice if you could perform operations on them such as subtraction or multiplication.
I'm looking for a database that will be able to quickly write and read this data and at the same time will not be performance-intensive.
thanks

Related

Recommended approach to store multi-dimensional data (e.g. spectra) in InfluxDB

I am trying to incorporate the time series database with the laboratory real time monitoring equipment. For scalar data such as temperature the line protocol works well:
temperature,site=reactor temperature=20.0 1556892576842902000
For 1D (e.g., IR Spectrum) or higher dimensional data, I came up two approaches to write data.
Write each element of the spectrum as field set as shown below. This way I can query individual frequency and perform analysis or visualization using the existing software. However, each record will easily contain thousands of field sets due to the high resolution of the spectrometer. My concern is whether the line protocol is too chunky and the storage can get inefficient or not.
ir_spectrum,site=reactor w1=10.0,w2=11.2,w3=11.3,......,w4000=2665.2 1556892576842902000
Store the vector as a serialized string (e.g., JSON). This way I may need some plugins to adapt the data to the visualization tools such as Grafana. But the protocol will look cleaner. I am not sure whether the storage layout is better than the first approach or not.
ir_spectrum,site=reactor data="[10.0, 11.2, 11.3, ......, 2665.2]" 1556892576842902000
I wonder whether there is any recommended way to store the high dimensional data? Thanks!
The first approach is better from the performance and disk space usage PoV. InfluxDB stores each field in a separate column. If a column contains similar numeric values, then it may be compressed better compared to the column with JSON strings. This also improves query speed when selecting only a subset of fields or filtering on a subset of fields.
P.S. InfluxDB may need high amounts of RAM for big number of fields and big number of tag combinations (aka high cardinality). In this case there are alternative solutions, which support InfluxDB line protocol and require lower amounts of RAM for high cardinality time series. See, for example, VictoriaMetrics.

database table design for storing different sized datasets

I am designing a Microsoft Access database to store results from lab equipment. They are in the form of hundreds of lists of frequency vs. response curves which I have previously stored rather easily, but inefficiently in Excel.
The difficulty comes from the fact that the frequency can vary from 1 - 50E9 Hz, the step size between data points can vary from 1 - 1E9, Hz, and the number of points can vary from ~ 100 - 40,000. This has brought up a challenge when it comes to table design because everything I try seems to be very inefficient.
I have considered using links to external text files to store the data points which solves the table design, but seems to violate good database design. I've considered using tables of arrays (i.e. Start Freq, Stop Freq, Freq Step Size, and Array of Responses), but the array sizes could vary greatly which seems just as inefficient.
Is there a recommended practice for storing this type of data? It seems like a common task when storing instrument data, but I can't seem to find anything in web searches. Any assistance will be greatly appreciated.
Looks like a classic 1:N relationship to me. "1" is the measurement session and "N" is all the measurements (i.e. data points) taken in that session. This is modeled by two tables and one foreign key between them, similar to this:
Tweak the fields to suit your needs, but this general design should be more than able to handle large amounts of data and varying numbers of measurements per session.
That being said, MS Access has historically had significant limitations on the size of the data that can be stored in a single database. If you hit these limits, consider using a "real" DBMS.

how to calculate row size of an unstructured data?

In classical RDBMS it' relatively easy to calculate maximum row size by adding max size of each field defined within a table. This value multiplied by predicted number of rows will give max table size excluding indexes, logs etc.
Today in the era of structured way of storing unstructured data it's relatively hard to tell what will be the optimal table size.
Is there any way to calculate or predict table or even database growth and storage requirements without sample data load ?
What are your ways of calculating row size and planning storage capacity for unstructured database ?
It is pretty much the same. Find the average size of data you need to persist and multiply it with your estimated transaction count per time unit.
Database engines may allocate datafile chunks exponentially (first 16mb then 32mb etc.) so you need to know about the workings of your dbms engine to translate the data size to physical storage space size.

What is meant by sparse data/ datastore/ database?

Have been reading up on Hadoop and HBase lately, and came across this term-
HBase is an open-source, distributed, sparse, column-oriented store...
What do they mean by sparse? Does it have something to do with a sparse matrix? I am guessing it is a property of the type of data it can store efficiently, and hence, would like to know more about it.
In a regular database, rows are sparse but columns are not. When a row is created, storage is allocated for every column, irrespective of whether a value exists for that field (a field being storage allocated for the intersection of a row and and a column).
This allows fixed length rows greatly improving read and write times. Variable length data types are handled with an analogue of pointers.
Sparse columns will incur a performance penalty and are unlikely to save you much disk space because the space required to indicate NULL is smaller than the 64-bit pointer required for the linked-list style of chained pointer architecture typically used to implement very large non-contiguous storage.
Storage is cheap. Performance isn't.
Sparse in respect to HBase is indeed used in the same context as a sparse matrix. It basically means that fields that are null are free to store (in terms of space).
I found a couple of blog posts that touch on this subject in a bit more detail:
http://blog.rapleaf.com/dev/2008/03/11/matching-impedance-when-to-use-hbase/
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
At the storage level, all data is stored as a key-value pair. Each storage file contains an index so that it knows where each key-value starts and how long it is.
As a consequence of this, if you have very long keys (e.g. a full URL), and a lot of columns associated with that key, you could be wasting some space. This is ameliorated somewhat by turning compression on.
See:
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
for more information on HBase storage
There are two way of data storing in the tables it will be either Sparse data and Dense data.
example for sparse data.
Suppose we have to perform a operation on a table containing sales data for transaction by employee between the month jan2015 to nov 2015 then after triggering the query we will get data which satisfies above timestamp condition
if employee didnt made any transaction then the whole row will return blank
eg.
EMPNo Name Product Date Quantity
1234 Mike Hbase 2014/12/01 1
5678
3454 Jole Flume 2015/09/12 3
the row with empno5678 have no data and rest of the rows contains the data if we consider whole table with blanks row and populated row then we can termed it as sparse data.
If we take only populated data then it is termed as dense data.
The best article I have seen, which explains many databases terms as well.
> http://jimbojw.com/#understanding%20hbase

Flexible storage and retrieval of motion capture data

I want to flexibly access motion capture data from C/C++ code. We currently have a bunch of separate files (.c3d format). We can expect the full set of data to be several hours long and tracking about 50 markers (4 floats each) per frame, sampled at 60 hz. So we're probably looking at a couple of gigabytes of data.
I'd like to have a database that can hold the data, allowing it to be relatively rapidly retrieved, augmented, and modified. I like to be able to apply labels to the data and retrieve sequences of frames by label, time indices (e.g., frame 400-2000, or every 30th frame) or other potential criteria.
Does such a thing already exist? Could I do it with SQLite for example? Does anyone have an intuition for what kind of performance I might get?
Currently, I'm just loading one .c3d file at a time and processing it. I haven't yet begun to apply meta-data/labels to sequences. I'll be accessing the sequences for visualization, statistical analysis, and training for machine-learning.
If you need to store multi-gigabytes of data with a known schema you might want to look into a binary flat file database. Of those available, I would recommend HDF5. It is not a relational database like SQLite, but provides rich support for array and matrix data with excellent performance. It also includes MPI support, if you ever expand your machine-learning onto a cluster.

Resources