Data Store Design for NxN Data Aggregation - database

I am trying to come up with a theoretical solution to an NxN problem for data aggregation and storage. As an example I have a huge amount of data that comes in via a stream. The stream sends the data in points. Each point has 5 dimensions:
Location
Date
Time
Name
Statistics
This data then needs to be aggregated and stored to allow another user to come along and query the data for both location and time. The user should be able to query like the following (pseudo-code):
Show me aggregated statistics for Location 1,2,3,4,....N between Dates 01/01/2011 and 01/03/2011 between times 11am and 4pm
Unfortunately due to the scale of the data it is not possible to aggregate all this data from the points on the fly and so aggregation prior to this needs to be done. As you can see though there are multiple dimensions that the data could be aggregated on.
They can query for any number of days or locations and so finding all the combinations would require huge pre-aggregation:
Record for Locations 1 Today
Record for Locations 1,2 Today
Record for Locations 1,3 Today
Record for Locations 1,2,3 Today
etc... up to N
Preprocessing all of these combinations prior to querying could result in an amount of precessing that is not viable. If we have 200 different locations then we have 2^200 combinations which would be nearly impossible to precompute in any reasonable amount of time.
I did think about creating records on 1 dimension and then merging could be done on the fly when requested, but this would also take time at scale.
Questions:
How should I go about choosing the right dimension and/or combination of dimensions given that the user is as likely to query on all dimensions?
Are there any case studies I could refer to, books I could read or anything else you can think of that would help?
Thank you for your time.
EDIT 1
When I say aggregating the data together I mean combining the statistics and name (dimensions 4 & 5) for the other dimensions. So for example if I request data for Locations 1,2,3,4..N then I must merge the statistics and counts of name together for those N Locations before serving it up to the user.
Similarly if I request the data for dates 01/01/2015 - 01/12/2015 then I must aggregate all data between those periods (by adding summing name/statistics).
Finally If I ask for data between dates 01/01/2015 - 01/12/2015 for Locations 1,2,3,4..N then I must aggregate all data between those dates for all those locations.
For the sake of this example lets say that going through statistics requires some sort of nested loop and does not scale well especially on the fly.

Try a time-series database!
From your description it seems that your data is a time-series dataset.
The user seems to be mostly concerned about the time when querying and after selecting a time frame, the user will refine the results by additional conditions.
With this in mind, I suggest you to try a time-series database like InfluxDB or OpenTSD.
For example, Influx provides a query language that is capable of handling queries like the following, which comes quite close to what you are trying to achieve:
SELECT count(location) FROM events
WHERE time > '2013-08-12 22:32:01.232' AND time < '2013-08-13'
GROUP BY time(10m);
I am not sure what you mean by scale, but the time-series DBs have been designed to be fast for lots of data points.
I'd suggest to definitely give them a try before rolling your own solution!

Denormalization is a means of addressing performance or scalability in relational database.
IMO having some new tables to hold aggregated data and using them for reporting will help you.
I have a huge amount of data that comes in via a stream. The stream
sends the data in points.
There will be multiple ways to achieve denormalization in the case:
Adding a new parallel endpoint for data aggregation functionality in streaming
level
Scheduling a job to aggregate data in DBMS level.
Using DBMS triggering mechanism (less efficient)
In an ideal scenario when a message reaches the streaming level there will be two copies of data message containing location, date, time, name, statistics dimensions, being dispatched for processing, one goes for OLTP(current application logic) second will goes for an OLAP(BI) process.
The BI process will create denormalized aggregated structures for reporting.
I will suggest having aggregated data record per location, date group.
So end-user will query preprossed data that wont need heavy recalculations, having some acceptable inaccuracy.
How should I go about choosing the right dimension and/or combination
of dimensions given that the user is as likely to query on all
dimensions?
That will depends on your application logic. If possible limit the user for predefined queries that can be assigned values by the user(like for dates from 01/01/2015 to 01/12/2015). In more complex systems using a report generator above the BI warehouse will be an option.
I'd recommend Kimball's The Data Warehouse ETL Toolkit.

You can at least reduce Date and Time to a single dimension, and pre-aggregate your data based on your minimum granularity, e.g. 1-second or 1-minute resolution. It could be useful to cache and chunk your incoming stream for the same resolution, e.g. append totals to the datastore every second instead of updating for every point.
What's the size and likelyhood of change of the name and location domains? Is there any relation between them? You said that location could be as many as 200. I'm thinking that if name is a very small set and unlikely to change, you could hold counts of names in per-name columns in a single record, reducing the scale of the table to 1 row per location per unit of time.

you have a lot of datas. It will take a lot of time with all methods due to the amount of datas you're trying to parse.
I have two methods to give.
First one is a brutal one, you probably thought off:
id | location | date | time | name | statistics
0 | blablabl | blab | blbl | blab | blablablab
1 | blablabl | blab | blbl | blab | blablablab
ect.
With this one, you can easily parse and get elements, they are all in the same table, but the parsing is long and the table is enormous.
Second one is better I think:
Multiple tables:
id | location
0 | blablabl
id | date
0 | blab
id | time
0 | blab
id | name
0 | blab
id | statistics
0 | blablablab
With this you could parse (a lot) faster, getting the IDs and then taking all the needed informations.
It also allow you to preparse all the datas:
You can have the locations sorted by location, the time sorted by time, the name sorted by alphabet, ect, because we don't care about how the ID's are mixed:
If the id's are 1 2 3 or 1 3 2, no one actually care, and you would go a lot faster with parsing if your datas are already parsed in their respective tables.
So, if you use the second method I gave: At the moment where you receive a point of data, give an ID to each of his columns:
You receive:
London 12/12/12 02:23:32 donut verygoodstatsblablabla
You add the ID to each part of this and go parse them in their respective columns:
42 | London ==> goes with London location in the location table
42 | 12/12/12 ==> goes with 12/12/12 dates in the date table
42 | ...
With this, you want to get all the London datas, they are all side by side, you just have to take all the ids, and get the other datas with them. If you want to take all the datas between 11/11/11 and 12/12/12, they are all side by side, you just have to take the ids ect..
Hope I helped, sorry for my poor english.

You should check out Apache Flume and Hadoop
http://hortonworks.com/hadoop/flume/#tutorials
The flume agent can be used to capture and aggregate the data into HDFS, and you can scale this as needed. Once it is in HDFS there are many options to visualize and even use map reduce or elastic search to view the data sets you are looking for in the examples provided.

I have worked with a point-of-sale database with hundred thousand products and ten thousand stores (typically week-level aggregated sales but also receipt-level stuff for basket analysis, cross sales etc.). I would suggest you to have a look at these:
Amazon Redshift, highly scalable and relatively simple to get started, cost-efficient
Microsoft Columnstore Indexes, compresses data and has familiar SQL interface, quite expensive (1 year reserved instance r3.2xlarge at AWS is about 37.000 USD), no experience on how it scales within a cluster
ElasticSearch is my personal favourite, highly scalable, very efficient searches via inverted indexes, nice aggregation framework, no license fees, has its own query language but simple queries are simple to express
In my experiments ElasticSearch was faster than Microsoft's column store or clustered index tables for small and medium-size queries by 20 - 50% on same hardware. To have fast response times you must have sufficient amount of RAM to have necessary data structures loaded in-memory.
I know I'm missing many other DB engines and platforms but I am most familiar with these. I have also used Apache Spark but not in data aggregation context but for distributed mathematical model training.

Is there really likely to be a way of doing this without brute forcing it in some way?
I'm only familiar with relational databases, and I think that the only real way to tackle this is with a flat table as suggested before i.e. all your datapoints as fields in a single table. I guess that you just have to decide how to do this, and how to optimize it.
Unless you have to maintain 100% to the single record accuracy, then I think the question really needs to be, what can we throw away.
I think my approach would be to:
Work out what the smallest time fragment would be and quantise the time domain on that. e.g. each analyseable record is 15 minutes long.
Collect raw records together into a raw table as they come in, but as the quantising window passes, summarize the rows into the analytical table (for the 15 minute window).
Deletion of old raw records can be done by a less time-sensitive routine.
Location looks like a restricted set, so use a table to convert these to integers.
Index all the columns in the summary table.
Run queries.
Obviously I'm betting that quantising the time domain in this way is acceptable. You could supply interactive drill-down by querying back onto the raw data by time domain too, but that would still be slow.
Hope this helps.
Mark

Related

Speed up ETL transformation - Pentaho Kettle

For a project, I have to deal with many sensors Time Series data.
I have an industrial machine that produces some artifacts. For each work (max 20 mins in time) sensors record oil pressure and temperature, and some other vibrational data (very high frequencies). All these Time Series are recorded in a .csv file, one for each sensor and for each work. Each file is named:
yyyy_mm_dd_hh_mm_ss_sensorname.csv
and contains just a sequence of real numbers.
I have to store somehow this kind of data. I am benchmarking many solution, relational and not, like MySQL, Cassandra, Mongo, etc.
In particular, for Cassandra and Mongo, I am using Pentaho Data Integration as ETL tool.
I have designed a common scheme for both DBs (unique column family/collection):
---------------------------------------
id | value | timestamp | sensor | order
---------------------------------------
The problem is that I am forced to extract timestamp and sensor information from filenames, and I have to apply many transformation to have the desired formats.
This slows my whole job down: uploading a single work (with just a single high-frequency metric, for a total of 3M rows, more or less) takes 3 mins for MongoDB, 8 mins for Cassandra.
I am running both DBs on a single node (for now), with 16 GB RAM and an 15 Core CPU.
I am sure I am doing the transformation wrong, so the question is: how can I speed things up??
Here is my KTR file: https://imgur.com/a/UZu4kYv (not enough rep to post images)
You cannot unfortunately use the filename which is on the Additional output field tab because this field is populated in parallel and there are chance it is not known when you use it in computations.
However, in your case, you can put the filename in a field, for example with a data grid, and use it for computations of timestamp and sensor. In parallel, you make the needed transforms on id, value and order. When finished you put them together again. I added a Unique Row on the common flow, just in case the input is buggy and have more than one timestamp, sensor.

Large db table with many rows or many columns

I have tried to have a normalized table design. The problem (maybe) is that we are generating a lot of data, and therefore a lot of rows. Currently the database is increasing in size by 0,25 GB per day.
The main tables are Samples an Boxes. There's a one-to-many relation from Samples to Boxes.
Sample table:
ID | Timestamp | CamId
Boxes table:
ID | SampleID | Volume | ...
We analyse 19 samples each 5 seconds, and each sample on avg has 7 boxes. That's 19*7*12 = 1596 boxes each minute and 1596*60*24 = 2,298,240 rows in Boxes table each day on avg.
This setup might run for months. At this time the Boxes table has about 25 million rows.
Quistion is; should i be worried about database size, table size and table design with so much data?
Or should I have a table like
ID | SampleID | CamId | Volume1 | Volume2 | ... | Volume9 | ...
Depending on the validity of your data, you could implement a purge of your data.
What I mean is: do you really need data from days ago, months ago, years ago? If you have a time limit of use for your data, purge them and your data table should stop growing (or likely) after a set amount of time.
This way you wouldn't need to care that much about either architecture for the sake of size.
Otherwise the answer is yes, you should care. Separate notions in a lot of tables could give you a good tweak on performance but maybe won't be sufficient in terms of access time after a long time. Consider looking at NoSQL solutions or alike in order to store heavy rows.
There is one simple rule: Whenever you think you have to put a number to a column's name you probably need a related table.
The amount of data will be roughly the same, no wins here.
I'd try to partition the table. AFAIK this feature was bound to the Enterprise Editions, but - according to this document - with SQL Server 2016 SP1 table and index partitioning is coming down even to Express!
The main question is: What are you going to do with this data?
If you have to run analytical scripts over everything, there won't be a much better hint than buy better hardware.
If your needs refer to data of the last 3 weeks you will be fine off with partitioning.
If you cannot use this feature yet (due to your Server's version), you can create an archive table and move older data into this table in regular jobs. A UNION ALL view would still allow to grab the whole lot. With SCHEMA BINDING you might even get the advantages of indexed views.
In this case it is clever, to hold your working data in your fastest drive and put the archive table in a separate file on a large storage somewhere else.
Question is, should i be worried about database size, table size and table design with so much data?
My answer is YES:
1. A huge amount of data(daily) should affect your storage in hardware part.
2. Table normalized is a must mostly if you are storing bytes or images.

Database design: storing many large reports for frequent historical analysis

I'm a long time programmer who has little experience with DBMSs or designing databases.
I know there are similar posts regarding this, but am feeling quite discombobulated tonight.
I'm working on a project which will require that I store large reports, multiple times per day, and have not dealt with storage or tables of this magnitude. Allow me to frame my problem in a generic way:
The process:
A script collects roughly 300 rows of information, set A, 2-3 times per day.
The structure of these rows never change. The rows contain two columns, both integers.
The script also collects roughly 100 rows of information, set B, at the same time. The
structure of these rows does not change either. The rows contain eight columns, all strings.
I need to store all of this data. Set A will be used frequently, and daily for analytics. Set B will be used frequently on the day that it is collected and then sparingly in the future for historical analytics. I could theoretically store each row with a timestamp for later query.
If stored linearly, both sets of data in their own table, using a DBMS, the data will reach ~300k rows per year. Having little experience with DBMSs, this sounds high for two tables to manage.
I feel as though throwing this information into a database with each pass of the script will lead to slow read times and general responsiveness. For example, generating an Access database and tossing this information into two tables seems like too easy of a solution.
I suppose my question is: how many rows is too many rows for a table in terms of performance? I know that it would be in very poor taste to create tables for each day or month.
Of course this only melts into my next, but similar, issue, audit logs...
300 rows about 50 times a day for 6 months is not a big blocker for any DB. Which DB are you gonna use? Most will handle this load very easily. There are a couple of techniques for handling data fragmentation if the data rows exceed more than a few 100 millions per table. But with effective indexing and cleaning you can achieve the performance you desire. I myself deal with heavy data tables with more than 200 million rows every week.
Make sure you have indexes in place as per the queries you would issue to fetch that data. Whats ever you have in the where clause should have an appropriate index in db for it.
If you row counts per table exceed many millions you should look at partitioning of tables DBs store data in filesystems as files actually so partitioning would help in making smaller groups of data files based on some predicates e.g: date or some unique column type. You would see it as a single table but on the file system the DB would store the data in different file groups.
Then you can also try table sharding. Which actually is what you mentioned....different tables based on some predicate like date.
Hope this helps.
You are over thinking this. 300k rows is not significant. Just about any relational database or NoSQL database will not have any problems.
Your design sounds fine, however, I highly advise that you utilize the facility of the database to add a primary key for each row, using whatever facility is available to you. Typically this involves using AUTO_INCREMENT or a Sequence, depending on the database. If you used a nosql like Mongo, it will add an id for you. Relational theory depends on having a primary key, and it's often helpful to have one for diagnostics.
So your basic design would be:
Table A tableA_id | A | B | CreatedOn
Table B tableB_id | columns… | CreatedOn
The CreatedOn will facilitate date range queries that limit data for summarization purposes and allow you to GROUP BY on date boundaries (Days, Weeks, Months, Years).
Make sure you have an index on CreatedOn, if you will be doing this type of grouping.
Also, use the smallest data types you can for any of the columns. For example, if the range of the integers falls below a particular limit, or is non-negative, you can usually choose a datatype that will reduce the amount of storage required.

How to store the plant data and where to store?

Plant data is real time data from plant process, such as, press, temperature, gas flow and so on. The data model of these data is typically like this:
(Point Name, Time stamps, value(float or integer), state(int))
We have thousands of points and longtime to store. And important, we want search them easy and quickly when we need.
A typically search request is like:
get data order by time stamp
from database
where Point name is P001_Press
between 2010-01-01 and 2010-01-02
A database similar to MySql is not suitable for us, because the records is too many and the query is too slowly.
So, how to store data (like above) and where to store them? Any NOSQL databases?? Thanks!
This data and query pattern actually fits pretty well into a flat table in a SQL database, which means implementing it with NoSQL will be significantly more work than fixing your query performance in SQL.
If your data is inserted in real time, you can remove the order by clause as the date will already be sorted by timestamp and there is no need to waste time resorting it. An index on point name and timestamp should get you good performance on the rest of the query.
If you are really getting to the limits of what a SQL table can hold (many millions of records) you have the option of sharding - a table for each data point may work fairly well.

Recommendations for database structure with huge dataset

It seems to me this question will be without precise answer since requires too complex analysis and deep dive into details of our system.
We have distributed net of sensors. Information gathered in one database and futher processed.
Current DB design is to have one huge table partitioned per month. We try keep it at 1 billion (usually 600-800 million records), so fill rate is at 20-50 million records per day.
DB server currently is MS SQL 2008 R2 but we started from 2005 and upgrade during project development.
The table itself contains SensorId, MessageTypeId, ReceiveDate and Data field. Current solution is to preserve sensor data in Data field (binary, 16 byte fixed length) with partially decoding it's type and store it in messageTypeId.
We have different kind of message type sending by sensors (current is approx 200) and it can be futher increased on demand.
Main processing is done on application server which fetch records on demand (by type, sensorId and date range), decode it and carry out required processing. Current speed is enough for such amount of data.
We have request to increase capacity of our system in 10-20 times and we worry is our current solution is capable of that.
We have also 2 ideas to "optimise" structure which I want to discuss.
1 Sensor's data can be splitted into types, I'll use 2 primary one for simplicity: (value) level data (analog data with range of values), state data (fixed amount of values)
So we can redesign our table to bunch of small ones by using following rules:
for each fixed type value (state type) create it's own table with SensorId and ReceiveDate (so we avoid store type and binary blob), all depended (extended) states will be stored in own table similar Foreign Key, so if we have State with values A and B, and depended (or additional) states for it 1 and 2 we ends with tables StateA_1, StateA_2, StateB_1, StateB_2. So table name consist of fixed states it represents.
for each analog data we create seperate table it will be similar first type but cantains additional field with sensor value;
Pros:
Store only required amount of data (currently our binary blob Data contains space to longest value) and reduced DB size;
To get data of particular type we get access right table instead of filter by type;
Cons:
AFAIK, it violates recommended practices;
Requires framework development to automate table management since it will be DBA's hell to maintain it manually;
The amount of tables can be considerably large since requires full coverage of possible values;
DB schema changes on introduction new sensor data or even new state value for already defined states thus can require complex change;
Complex management leads to error prone;
It maybe DB engine hell to insert values in such table orgranisation?
DB structure is not fixed (constantly changed);
Probably all cons outweight a few pros but if we get significant performance gains and / or (less preferred but valuable too) storage space maybe we follow that way.
2 Maybe just split table per sensor (it will be about 100 000 tables) or better by sensor range and/or move to different databases with dedicated servers but we want avoid hardware span if it possible.
3 Leave as it is.
4 Switch to different kind of DBMS, e.g. column oriented DBMS (HBase and similar).
What do you think? Maybe you can suggest resource for futher reading?
Update:
The nature of system that some data from sensors can arrive even with month delay (usually 1-2 week delay), some always online, some kind of sensor has memory on-board and go online eventually. Each sensor message has associated event raised date and server received date, so we can distinguish recent data from gathered some time ago. The processing include some statistical calculation, param deviation detection, etc. We built aggregated reports for quick view, but when we get data from sensor updates old data (already processed) we have to rebuild some reports from scratch, since they depends on all available data and aggregated values can't be used. So we have usually keep 3 month data for quick access and other archived. We try hard to reduce needed to store data but decided that we need it all to keep results accurate.
Update2:
Here table with primary data. As I mention in comments we remove all dependencies and constrains from it during "need for speed", so it used for storage only.
CREATE TABLE [Messages](
[id] [bigint] IDENTITY(1,1) NOT NULL,
[sourceId] [int] NOT NULL,
[messageDate] [datetime] NOT NULL,
[serverDate] [datetime] NOT NULL,
[messageTypeId] [smallint] NOT NULL,
[data] [binary](16) NOT NULL
)
Sample data from one of servers:
id sourceId messageDate serverDate messageTypeId data
1591363304 54 2010-11-20 04:45:36.813 2010-11-20 04:45:39.813 257 0x00000000000000D2ED6F42DDA2F24100
1588602646 195 2010-11-19 10:07:21.247 2010-11-19 10:08:05.993 258 0x02C4ADFB080000CFD6AC00FBFBFBFB4D
1588607651 195 2010-11-19 10:09:43.150 2010-11-19 10:09:43.150 258 0x02E4AD1B280000CCD2A9001B1B1B1B77
Just going to throw some ideas out there, hope they are useful - they're some of the things I'd be considering/thinking about/researching into.
Partitioning - you mention the table is partitioned by month. Is that manually partitioned yourself, or are you making use of the partitioning functionality available in Enterprise Edition? If manual, consider using the built in partitioning functionality to partition your data out more which should give you increased scalability / performance. This "Partitioned Tables and Indexes" article on MSDN by Kimberly Tripp is great - lot of great info in there, I won't do it a injustice by paraphrasing! Worth considering this over manually creating 1 table per sensor which could be more difficult to maintain/implement and therefore added complexity (simple = good). Of course, only if you have Enterprise Edition.
Filtered Indexes - check out this MSDN article
There is of course the hardware element - goes without saying that a meaty server with oodles of RAM/fast disks etc will play a part.
One technique, not so much related to databases, is to switch to recording a change in values -- with having minimum of n records per minute or so. So, for example if as sensor no 1 is sending something like:
Id Date Value
-----------------------------
1 2010-10-12 11:15:00 100
1 2010-10-12 11:15:02 100
1 2010-10-12 11:15:03 100
1 2010-10-12 11:15:04 105
then only first and last record would end in the DB. To make sure that the sensor is "live" minimum of 3 records would be entered per minute. This way the volume of data would be reduced.
Not sure if this helps, or if it would be feasible in your application -- just an idea.
EDIT
Is it possible to archive data based on the probability of access? Would it be correct to say that old data is less likely to be accessed than new data? If so, you may want to take a look at look at Bill Inmon's DW 2.0 Architecture for The Next Generation of Data Warehousing where he discusses model for moving data through different DW zones (Interactive, Integrated, Near-Line, Archival) based on the probability of access. Access times vary from very fast (Interactive zone) to very slow (Archival). Each zone has different hardware requirements. The objective is to prevent large amounts of data clogging the DW.
Storage-wise you are probably going to be fine. SQL Server will handle it.
What worries me is the load your server is going to take. If you are receiving transactions constantly, you would have some ~400 transactions per second today. Increase this by a factor of 20 and you are looking at ~8,000 transactions per second. That's not a small number considering you are doing reporting on the same data...
Btw, do I understand you correctly in that you are discarding the sensor data when you have processed it? So your total data set will be a "rolling" 1 billion rows? Or do you just append the data?
You could store the datetime stamps as integers. I believe datetime stamps use 8 bytes and integers only use 4 within SQL. You'd have to leave off the year, but since you are partitioning by month it might not be a problem.
So '12/25/2010 23:22:59' would get stored as 1225232259 -MMDDHHMMSS
Just a thought...

Resources