Speed up ETL transformation - Pentaho Kettle - database

For a project, I have to deal with many sensors Time Series data.
I have an industrial machine that produces some artifacts. For each work (max 20 mins in time) sensors record oil pressure and temperature, and some other vibrational data (very high frequencies). All these Time Series are recorded in a .csv file, one for each sensor and for each work. Each file is named:
yyyy_mm_dd_hh_mm_ss_sensorname.csv
and contains just a sequence of real numbers.
I have to store somehow this kind of data. I am benchmarking many solution, relational and not, like MySQL, Cassandra, Mongo, etc.
In particular, for Cassandra and Mongo, I am using Pentaho Data Integration as ETL tool.
I have designed a common scheme for both DBs (unique column family/collection):
---------------------------------------
id | value | timestamp | sensor | order
---------------------------------------
The problem is that I am forced to extract timestamp and sensor information from filenames, and I have to apply many transformation to have the desired formats.
This slows my whole job down: uploading a single work (with just a single high-frequency metric, for a total of 3M rows, more or less) takes 3 mins for MongoDB, 8 mins for Cassandra.
I am running both DBs on a single node (for now), with 16 GB RAM and an 15 Core CPU.
I am sure I am doing the transformation wrong, so the question is: how can I speed things up??
Here is my KTR file: https://imgur.com/a/UZu4kYv (not enough rep to post images)

You cannot unfortunately use the filename which is on the Additional output field tab because this field is populated in parallel and there are chance it is not known when you use it in computations.
However, in your case, you can put the filename in a field, for example with a data grid, and use it for computations of timestamp and sensor. In parallel, you make the needed transforms on id, value and order. When finished you put them together again. I added a Unique Row on the common flow, just in case the input is buggy and have more than one timestamp, sensor.

Related

Data Store Design for NxN Data Aggregation

I am trying to come up with a theoretical solution to an NxN problem for data aggregation and storage. As an example I have a huge amount of data that comes in via a stream. The stream sends the data in points. Each point has 5 dimensions:
Location
Date
Time
Name
Statistics
This data then needs to be aggregated and stored to allow another user to come along and query the data for both location and time. The user should be able to query like the following (pseudo-code):
Show me aggregated statistics for Location 1,2,3,4,....N between Dates 01/01/2011 and 01/03/2011 between times 11am and 4pm
Unfortunately due to the scale of the data it is not possible to aggregate all this data from the points on the fly and so aggregation prior to this needs to be done. As you can see though there are multiple dimensions that the data could be aggregated on.
They can query for any number of days or locations and so finding all the combinations would require huge pre-aggregation:
Record for Locations 1 Today
Record for Locations 1,2 Today
Record for Locations 1,3 Today
Record for Locations 1,2,3 Today
etc... up to N
Preprocessing all of these combinations prior to querying could result in an amount of precessing that is not viable. If we have 200 different locations then we have 2^200 combinations which would be nearly impossible to precompute in any reasonable amount of time.
I did think about creating records on 1 dimension and then merging could be done on the fly when requested, but this would also take time at scale.
Questions:
How should I go about choosing the right dimension and/or combination of dimensions given that the user is as likely to query on all dimensions?
Are there any case studies I could refer to, books I could read or anything else you can think of that would help?
Thank you for your time.
EDIT 1
When I say aggregating the data together I mean combining the statistics and name (dimensions 4 & 5) for the other dimensions. So for example if I request data for Locations 1,2,3,4..N then I must merge the statistics and counts of name together for those N Locations before serving it up to the user.
Similarly if I request the data for dates 01/01/2015 - 01/12/2015 then I must aggregate all data between those periods (by adding summing name/statistics).
Finally If I ask for data between dates 01/01/2015 - 01/12/2015 for Locations 1,2,3,4..N then I must aggregate all data between those dates for all those locations.
For the sake of this example lets say that going through statistics requires some sort of nested loop and does not scale well especially on the fly.
Try a time-series database!
From your description it seems that your data is a time-series dataset.
The user seems to be mostly concerned about the time when querying and after selecting a time frame, the user will refine the results by additional conditions.
With this in mind, I suggest you to try a time-series database like InfluxDB or OpenTSD.
For example, Influx provides a query language that is capable of handling queries like the following, which comes quite close to what you are trying to achieve:
SELECT count(location) FROM events
WHERE time > '2013-08-12 22:32:01.232' AND time < '2013-08-13'
GROUP BY time(10m);
I am not sure what you mean by scale, but the time-series DBs have been designed to be fast for lots of data points.
I'd suggest to definitely give them a try before rolling your own solution!
Denormalization is a means of addressing performance or scalability in relational database.
IMO having some new tables to hold aggregated data and using them for reporting will help you.
I have a huge amount of data that comes in via a stream. The stream
sends the data in points.
There will be multiple ways to achieve denormalization in the case:
Adding a new parallel endpoint for data aggregation functionality in streaming
level
Scheduling a job to aggregate data in DBMS level.
Using DBMS triggering mechanism (less efficient)
In an ideal scenario when a message reaches the streaming level there will be two copies of data message containing location, date, time, name, statistics dimensions, being dispatched for processing, one goes for OLTP(current application logic) second will goes for an OLAP(BI) process.
The BI process will create denormalized aggregated structures for reporting.
I will suggest having aggregated data record per location, date group.
So end-user will query preprossed data that wont need heavy recalculations, having some acceptable inaccuracy.
How should I go about choosing the right dimension and/or combination
of dimensions given that the user is as likely to query on all
dimensions?
That will depends on your application logic. If possible limit the user for predefined queries that can be assigned values by the user(like for dates from 01/01/2015 to 01/12/2015). In more complex systems using a report generator above the BI warehouse will be an option.
I'd recommend Kimball's The Data Warehouse ETL Toolkit.
You can at least reduce Date and Time to a single dimension, and pre-aggregate your data based on your minimum granularity, e.g. 1-second or 1-minute resolution. It could be useful to cache and chunk your incoming stream for the same resolution, e.g. append totals to the datastore every second instead of updating for every point.
What's the size and likelyhood of change of the name and location domains? Is there any relation between them? You said that location could be as many as 200. I'm thinking that if name is a very small set and unlikely to change, you could hold counts of names in per-name columns in a single record, reducing the scale of the table to 1 row per location per unit of time.
you have a lot of datas. It will take a lot of time with all methods due to the amount of datas you're trying to parse.
I have two methods to give.
First one is a brutal one, you probably thought off:
id | location | date | time | name | statistics
0 | blablabl | blab | blbl | blab | blablablab
1 | blablabl | blab | blbl | blab | blablablab
ect.
With this one, you can easily parse and get elements, they are all in the same table, but the parsing is long and the table is enormous.
Second one is better I think:
Multiple tables:
id | location
0 | blablabl
id | date
0 | blab
id | time
0 | blab
id | name
0 | blab
id | statistics
0 | blablablab
With this you could parse (a lot) faster, getting the IDs and then taking all the needed informations.
It also allow you to preparse all the datas:
You can have the locations sorted by location, the time sorted by time, the name sorted by alphabet, ect, because we don't care about how the ID's are mixed:
If the id's are 1 2 3 or 1 3 2, no one actually care, and you would go a lot faster with parsing if your datas are already parsed in their respective tables.
So, if you use the second method I gave: At the moment where you receive a point of data, give an ID to each of his columns:
You receive:
London 12/12/12 02:23:32 donut verygoodstatsblablabla
You add the ID to each part of this and go parse them in their respective columns:
42 | London ==> goes with London location in the location table
42 | 12/12/12 ==> goes with 12/12/12 dates in the date table
42 | ...
With this, you want to get all the London datas, they are all side by side, you just have to take all the ids, and get the other datas with them. If you want to take all the datas between 11/11/11 and 12/12/12, they are all side by side, you just have to take the ids ect..
Hope I helped, sorry for my poor english.
You should check out Apache Flume and Hadoop
http://hortonworks.com/hadoop/flume/#tutorials
The flume agent can be used to capture and aggregate the data into HDFS, and you can scale this as needed. Once it is in HDFS there are many options to visualize and even use map reduce or elastic search to view the data sets you are looking for in the examples provided.
I have worked with a point-of-sale database with hundred thousand products and ten thousand stores (typically week-level aggregated sales but also receipt-level stuff for basket analysis, cross sales etc.). I would suggest you to have a look at these:
Amazon Redshift, highly scalable and relatively simple to get started, cost-efficient
Microsoft Columnstore Indexes, compresses data and has familiar SQL interface, quite expensive (1 year reserved instance r3.2xlarge at AWS is about 37.000 USD), no experience on how it scales within a cluster
ElasticSearch is my personal favourite, highly scalable, very efficient searches via inverted indexes, nice aggregation framework, no license fees, has its own query language but simple queries are simple to express
In my experiments ElasticSearch was faster than Microsoft's column store or clustered index tables for small and medium-size queries by 20 - 50% on same hardware. To have fast response times you must have sufficient amount of RAM to have necessary data structures loaded in-memory.
I know I'm missing many other DB engines and platforms but I am most familiar with these. I have also used Apache Spark but not in data aggregation context but for distributed mathematical model training.
Is there really likely to be a way of doing this without brute forcing it in some way?
I'm only familiar with relational databases, and I think that the only real way to tackle this is with a flat table as suggested before i.e. all your datapoints as fields in a single table. I guess that you just have to decide how to do this, and how to optimize it.
Unless you have to maintain 100% to the single record accuracy, then I think the question really needs to be, what can we throw away.
I think my approach would be to:
Work out what the smallest time fragment would be and quantise the time domain on that. e.g. each analyseable record is 15 minutes long.
Collect raw records together into a raw table as they come in, but as the quantising window passes, summarize the rows into the analytical table (for the 15 minute window).
Deletion of old raw records can be done by a less time-sensitive routine.
Location looks like a restricted set, so use a table to convert these to integers.
Index all the columns in the summary table.
Run queries.
Obviously I'm betting that quantising the time domain in this way is acceptable. You could supply interactive drill-down by querying back onto the raw data by time domain too, but that would still be slow.
Hope this helps.
Mark

Hadoop on periodically generated files

I would like to use Hadoop to process input files which are generated every n minute. How should I approach this problem? For example I have temperature measurements of cities in USA received every 10 minute and I want to compute average temperatures per day per week and month.
PS: So far I have considered Apache Flume to get the readings. Which will get data from multiple servers and write the data periodically to HDFS. From where I can read and process them.
But how can I avoid working on same files again and again?
You should consider a Big Data stream processing platform like Storm (which I'm very familiar with, there are others, though) which might be better suited for the kinds of aggregations and metrics you mention.
Either way, however, you're going to implement something which has the entire set of processed data in a form that makes it very easy to apply the delta of just-gathered data to give you your latest metrics. Another output of this merge is a new set of data to which you'll apply the next hour's data. And so on.

How to store 7.3 billion rows of market data (optimized to be read)?

I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.
Most (99.9%) of the time I will perform only read requests.
What is the best way to store this data in a db?
1 big table with 7.3B rows?
1000 tables (one for each stock symbol) with 7.3M rows each?
any recommendation of database engine? (I'm planning to use Amazon RDS' MySQL)
I'm not used to deal with datasets this big, so this is an excellent opportunity for me to learn. I will appreciate a lot your help and advice.
Edit:
This is a sample row:
'XX', 20041208, 938, 43.7444, 43.7541, 43.735, 43.7444, 35116.7, 1, 0, 0
Column 1 is the stock symbol, column 2 is the date, column 3 is the minute, the rest are open-high-low-close prices, volume, and 3 integer columns.
Most of the queries will be like "Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
About the hardware: I plan to use Amazon RDS so I'm flexible on that
So databases are for situations where you have a large complicated schema that is constantly changing. You only have one "table" with a hand-full of simple numeric fields. I would do it this way:
Prepare a C/C++ struct to hold the record format:
struct StockPrice
{
char ticker_code[2];
double stock_price;
timespec when;
etc
};
Then calculate sizeof(StockPrice[N]) where N is the number of records. (On a 64-bit system) It should only be a few hundred gig, and fit on a $50 HDD.
Then truncate a file to that size and mmap (on linux, or use CreateFileMapping on windows) it into memory:
//pseduo-code
file = open("my.data", WRITE_ONLY);
truncate(file, sizeof(StockPrice[N]));
void* p = mmap(file, WRITE_ONLY);
Cast the mmaped pointer to StockPrice*, and make a pass of your data filling out the array. Close the mmap, and now you will have your data in one big binary array in a file that can be mmaped again later.
StockPrice* stocks = (StockPrice*) p;
for (size_t i = 0; i < N; i++)
{
stocks[i] = ParseNextStock(stock_indata_file);
}
close(file);
You can now mmap it again read-only from any program and your data will be readily available:
file = open("my.data", READ_ONLY);
StockPrice* stocks = (StockPrice*) mmap(file, READ_ONLY);
// do stuff with stocks;
So now you can treat it just like an in-memory array of structs. You can create various kinds of index data structures depending on what your "queries" are. The kernel will deal with swapping the data to/from disk transparently so it will be insanely fast.
If you expect to have a certain access pattern (for example contiguous date) it is best to sort the array in that order so it will hit the disk sequentially.
I have a dataset of 1 minute data of 1000 stocks [...] most (99.9%) of the time I will perform only read requests.
Storing once and reading many times time-based numerical data is a use case termed "time series". Other common time series are sensor data in the Internet of Things, server monitoring statistics, application events etc.
This question was asked in 2012, and since then, several database engines have been developing features specifically for managing time series. I've had great results with the InfluxDB, which is open sourced, written in Go, and MIT-licensed.
InfluxDB has been specifically optimized to store and query time series data. Much more so than Cassandra, which is often touted as great for storing time series:
Optimizing for time series involved certain tradeoffs. For example:
Updates to existing data are a rare occurrence and contentious updates never happen. Time series data is predominantly new data that is never updated.
Pro: Restricting access to updates allows for increased query and write performance
Con: Update functionality is significantly restricted
In open sourced benchmarks,
InfluxDB outperformed MongoDB in all three tests with 27x greater write throughput, while using 84x less disk space, and delivering relatively equal performance when it came to query speed.
Queries are also very simple. If your rows look like <symbol, timestamp, open, high, low, close, volume>, with InfluxDB you can store just that, then query easily. Say, for the last 10 minutes of data:
SELECT open, close FROM market_data WHERE symbol = 'AAPL' AND time > '2012-04-12 12:15' AND time < '2012-04-13 12:52'
There are no IDs, no keys, and no joins to make. You can do a lot of interesting aggregations. You don't have to vertically partition the table as with PostgreSQL, or contort your schema into arrays of seconds as with MongoDB. Also, InfluxDB compresses really well, while PostgreSQL won't be able to perform any compression on the type of data you have.
Tell us about the queries, and your hardware environment.
I would be very very tempted to go NoSQL, using Hadoop or something similar, as long as you can take advantage of parallelism.
Update
Okay, why?
First of all, notice that I asked about the queries. You can't -- and we certainly can't -- answer these questions without knowing what the workload is like. (I'll co-incidentally have an article about this appearing soon, but I can't link it today.) But the scale of the problem makes me think about moving away from a Big Old Database because
My experience with similar systems suggests the access will either be big sequential (computing some kind of time series analysis) or very very flexible data mining (OLAP). Sequential data can be handled better and faster sequentially; OLAP means computing lots and lots of indices, which either will take lots of time or lots of space.
If You're doing what are effectively big runs against many data in an OLAP world, however, a column-oriented approach might be best.
If you want to do random queries, especially making cross-comparisons, a Hadoop system might be effective. Why? Because
you can better exploit parallelism on relatively small commodity hardware.
you can also better implement high reliability and redundancy
many of those problems lend themselves naturally to the MapReduce paradigm.
But the fact is, until we know about your workload, it's impossible to say anything definitive.
Okay, so this is somewhat away from the other answers, but... it feels to me like if you have the data in a file system (one stock per file, perhaps) with a fixed record size, you can get at the data really easily: given a query for a particular stock and time range, you can seek to the right place, fetch all the data you need (you'll know exactly how many bytes), transform the data into the format you need (which could be very quick depending on your storage format) and you're away.
I don't know anything about Amazon storage, but if you don't have anything like direct file access, you could basically have blobs - you'd need to balance large blobs (fewer records, but probably reading more data than you need each time) with small blobs (more records giving more overhead and probably more requests to get at them, but less useless data returned each time).
Next you add caching - I'd suggest giving different servers different stocks to handle for example - and you can pretty much just serve from memory. If you can afford enough memory on enough servers, bypass the "load on demand" part and just load all the files on start-up. That would simplify things, at the cost of slower start-up (which obviously impacts failover, unless you can afford to always have two servers for any particular stock, which would be helpful).
Note that you don't need to store the stock symbol, date or minute for each record - because they're implicit in the file you're loading and the position within the file. You should also consider what accuracy you need for each value, and how to store that efficiently - you've given 6SF in your question, which you could store in 20 bits. Potentially store three 20-bit integers in 64 bits of storage: read it as a long (or whatever your 64-bit integer value will be) and use masking/shifting to get it back to three integers. You'll need to know what scale to use, of course - which you could probably encode in the spare 4 bits, if you can't make it constant.
You haven't said what the other three integer columns are like, but if you could get away with 64 bits for those three as well, you could store a whole record in 16 bytes. That's only ~110GB for the whole database, which isn't really very much...
EDIT: The other thing to consider is that presumably the stock doesn't change over the weekend - or indeed overnight. If the stock market is only open 8 hours per day, 5 days per week, then you only need 40 values per week instead of 168. At that point you could end up with only about 28GB of data in your files... which sounds a lot smaller than you were probably originally thinking. Having that much data in memory is very reasonable.
EDIT: I think I've missed out the explanation of why this approach is a good fit here: you've got a very predictable aspect for a large part of your data - the stock ticker, date and time. By expressing the ticker once (as the filename) and leaving the date/time entirely implicit in the position of the data, you're removing a whole bunch of work. It's a bit like the difference between a String[] and a Map<Integer, String> - knowing that your array index always starts at 0 and goes up in increments of 1 up to the length of the array allows for quick access and more efficient storage.
It is my understanding that HDF5 was designed specifically with the time-series storage of stock data as one potential application. Fellow stackers have demonstrated that HDF5 is good for large amounts of data: chromosomes, physics.
I think any major RDBMS would handle this. At the atomic level, a one table with correct partitioning seems reasonable (partition based on your data usage if fixed - this is ikely to be either symbol or date).
You can also look into building aggregated tables for faster access above the atomic level. For example if your data is at day, but you often get data back at the wekk or even month level, then this can be pre-calculated in an aggregate table. In some databases this can be done though a cached view (various names for different DB solutions - but basically its a view on the atomic data, but once run the view is cached/hardened intoa fixed temp table - that is queried for subsequant matching queries. This can be dropped at interval to free up memory/disk space).
I guess we could help you more with some idea as to the data usage.
Here is an attempt to create a Market Data Server on top of the Microsoft SQL Server 2012 database which should be good for OLAP analysis, a free open source project:
http://github.com/kriasoft/market-data
First, there isn't 365 trading days in the year, with holidays 52 weekends (104) = say 250 x the actual hours of day market is opened like someone said, and to use the symbol as the primary key is not a good idea since symbols change, use a k_equity_id (numeric) with a symbol (char) since symbols can be like this A , or GAC-DB-B.TO , then in your data tables of price info, you have, so your estimate of 7.3 billion is vastly over calculated since it's only about 1.7 million rows per symbol for 14 years.
k_equity_id
k_date
k_minute
and for the EOD table (that will be viewed 1000x over the other data)
k_equity_id
k_date
Second, don't store your OHLC by minute data in the same DB table as and EOD table (end of day) , since anyone wanting to look at a pnf, or line chart, over a year period , has zero interest in the by the minute information.
Let me recommend that you take a look at apache solr, which I think would be ideal for your particular problem. Basically, you would first index your data (each row being a "document"). Solr is optimized for searching and natively supports range queries on dates. Your nominal query,
"Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
would translate to something like:
?q=stock:AAPL AND date:[2012-04-12T12:15:00Z TO 2012-04-13T12:52:00Z]
Assuming "stock" is the stock name and "date" is a "DateField" created from the "date" and "minute" columns of your input data on indexing. Solr is incredibly flexible and I really can't say enough good things about it. So, for example, if you needed to maintain the fields in the original data, you can probably find a way to dynamically create the "DateField" as part of the query (or filter).
You should compare the slow solutions with a simple optimized in memory model. Uncompressed it fits in a 256 GB ram server. A snapshot fits in 32 K and you just index it positionally on datetime and stock. Then you can make specialized snapshots, as open of one often equals closing of the previous.
[edit] Why do you think it makes sense to use a database at all (rdbms or nosql)? This data doesn't change, and it fits in memory. That is not a use case where a dbms can add value.
If you have the hardware, I recommend MySQL Cluster. You get the MySQL/RDBMS interface you are so familiar with, and you get fast and parallel writes. Reads will be slower than regular MySQL due to network latency, but you have the advantage of being able to parallelize queries and reads due to the way MySQL Cluster and the NDB storage engine works.
Make sure that you have enough MySQL Cluster machines and enough memory/RAM for each of those though - MySQL Cluster is a heavily memory-oriented database architecture.
Or Redis, if you don't mind a key-value / NoSQL interface to your reads/writes. Make sure that Redis has enough memory - its super-fast for reads and writes, you can do basic queries with it (non-RDBMS though) but is also an in-memory database.
Like others have said, knowing more about the queries you will be running will help.
You will want the data stored in a columnar table / database. Database systems like Vertica and Greenplum are columnar databases, and I believe SQL Server now allows for columnar tables. These are extremely efficient for SELECTing from very large datasets. They are also efficient at importing large datasets.
A free columnar database is MonetDB.
If your use case is to simple read rows without aggregation, you can use Aerospike cluster. It's in memory database with support of file system for persistence. It's also SSD optimized.
If your use case needs aggregated data, go for Mongo DB cluster with date range sharding. You can club year vise data in shards.

Recommendations for database structure with huge dataset

It seems to me this question will be without precise answer since requires too complex analysis and deep dive into details of our system.
We have distributed net of sensors. Information gathered in one database and futher processed.
Current DB design is to have one huge table partitioned per month. We try keep it at 1 billion (usually 600-800 million records), so fill rate is at 20-50 million records per day.
DB server currently is MS SQL 2008 R2 but we started from 2005 and upgrade during project development.
The table itself contains SensorId, MessageTypeId, ReceiveDate and Data field. Current solution is to preserve sensor data in Data field (binary, 16 byte fixed length) with partially decoding it's type and store it in messageTypeId.
We have different kind of message type sending by sensors (current is approx 200) and it can be futher increased on demand.
Main processing is done on application server which fetch records on demand (by type, sensorId and date range), decode it and carry out required processing. Current speed is enough for such amount of data.
We have request to increase capacity of our system in 10-20 times and we worry is our current solution is capable of that.
We have also 2 ideas to "optimise" structure which I want to discuss.
1 Sensor's data can be splitted into types, I'll use 2 primary one for simplicity: (value) level data (analog data with range of values), state data (fixed amount of values)
So we can redesign our table to bunch of small ones by using following rules:
for each fixed type value (state type) create it's own table with SensorId and ReceiveDate (so we avoid store type and binary blob), all depended (extended) states will be stored in own table similar Foreign Key, so if we have State with values A and B, and depended (or additional) states for it 1 and 2 we ends with tables StateA_1, StateA_2, StateB_1, StateB_2. So table name consist of fixed states it represents.
for each analog data we create seperate table it will be similar first type but cantains additional field with sensor value;
Pros:
Store only required amount of data (currently our binary blob Data contains space to longest value) and reduced DB size;
To get data of particular type we get access right table instead of filter by type;
Cons:
AFAIK, it violates recommended practices;
Requires framework development to automate table management since it will be DBA's hell to maintain it manually;
The amount of tables can be considerably large since requires full coverage of possible values;
DB schema changes on introduction new sensor data or even new state value for already defined states thus can require complex change;
Complex management leads to error prone;
It maybe DB engine hell to insert values in such table orgranisation?
DB structure is not fixed (constantly changed);
Probably all cons outweight a few pros but if we get significant performance gains and / or (less preferred but valuable too) storage space maybe we follow that way.
2 Maybe just split table per sensor (it will be about 100 000 tables) or better by sensor range and/or move to different databases with dedicated servers but we want avoid hardware span if it possible.
3 Leave as it is.
4 Switch to different kind of DBMS, e.g. column oriented DBMS (HBase and similar).
What do you think? Maybe you can suggest resource for futher reading?
Update:
The nature of system that some data from sensors can arrive even with month delay (usually 1-2 week delay), some always online, some kind of sensor has memory on-board and go online eventually. Each sensor message has associated event raised date and server received date, so we can distinguish recent data from gathered some time ago. The processing include some statistical calculation, param deviation detection, etc. We built aggregated reports for quick view, but when we get data from sensor updates old data (already processed) we have to rebuild some reports from scratch, since they depends on all available data and aggregated values can't be used. So we have usually keep 3 month data for quick access and other archived. We try hard to reduce needed to store data but decided that we need it all to keep results accurate.
Update2:
Here table with primary data. As I mention in comments we remove all dependencies and constrains from it during "need for speed", so it used for storage only.
CREATE TABLE [Messages](
[id] [bigint] IDENTITY(1,1) NOT NULL,
[sourceId] [int] NOT NULL,
[messageDate] [datetime] NOT NULL,
[serverDate] [datetime] NOT NULL,
[messageTypeId] [smallint] NOT NULL,
[data] [binary](16) NOT NULL
)
Sample data from one of servers:
id sourceId messageDate serverDate messageTypeId data
1591363304 54 2010-11-20 04:45:36.813 2010-11-20 04:45:39.813 257 0x00000000000000D2ED6F42DDA2F24100
1588602646 195 2010-11-19 10:07:21.247 2010-11-19 10:08:05.993 258 0x02C4ADFB080000CFD6AC00FBFBFBFB4D
1588607651 195 2010-11-19 10:09:43.150 2010-11-19 10:09:43.150 258 0x02E4AD1B280000CCD2A9001B1B1B1B77
Just going to throw some ideas out there, hope they are useful - they're some of the things I'd be considering/thinking about/researching into.
Partitioning - you mention the table is partitioned by month. Is that manually partitioned yourself, or are you making use of the partitioning functionality available in Enterprise Edition? If manual, consider using the built in partitioning functionality to partition your data out more which should give you increased scalability / performance. This "Partitioned Tables and Indexes" article on MSDN by Kimberly Tripp is great - lot of great info in there, I won't do it a injustice by paraphrasing! Worth considering this over manually creating 1 table per sensor which could be more difficult to maintain/implement and therefore added complexity (simple = good). Of course, only if you have Enterprise Edition.
Filtered Indexes - check out this MSDN article
There is of course the hardware element - goes without saying that a meaty server with oodles of RAM/fast disks etc will play a part.
One technique, not so much related to databases, is to switch to recording a change in values -- with having minimum of n records per minute or so. So, for example if as sensor no 1 is sending something like:
Id Date Value
-----------------------------
1 2010-10-12 11:15:00 100
1 2010-10-12 11:15:02 100
1 2010-10-12 11:15:03 100
1 2010-10-12 11:15:04 105
then only first and last record would end in the DB. To make sure that the sensor is "live" minimum of 3 records would be entered per minute. This way the volume of data would be reduced.
Not sure if this helps, or if it would be feasible in your application -- just an idea.
EDIT
Is it possible to archive data based on the probability of access? Would it be correct to say that old data is less likely to be accessed than new data? If so, you may want to take a look at look at Bill Inmon's DW 2.0 Architecture for The Next Generation of Data Warehousing where he discusses model for moving data through different DW zones (Interactive, Integrated, Near-Line, Archival) based on the probability of access. Access times vary from very fast (Interactive zone) to very slow (Archival). Each zone has different hardware requirements. The objective is to prevent large amounts of data clogging the DW.
Storage-wise you are probably going to be fine. SQL Server will handle it.
What worries me is the load your server is going to take. If you are receiving transactions constantly, you would have some ~400 transactions per second today. Increase this by a factor of 20 and you are looking at ~8,000 transactions per second. That's not a small number considering you are doing reporting on the same data...
Btw, do I understand you correctly in that you are discarding the sensor data when you have processed it? So your total data set will be a "rolling" 1 billion rows? Or do you just append the data?
You could store the datetime stamps as integers. I believe datetime stamps use 8 bytes and integers only use 4 within SQL. You'd have to leave off the year, but since you are partitioning by month it might not be a problem.
So '12/25/2010 23:22:59' would get stored as 1225232259 -MMDDHHMMSS
Just a thought...

Storing large number of sensor data records

I need to create a database that saves sensor data that will be queried to generate reports later on (Display a graph and AVG/MAX/MIN values for a given timeframe).
The data points look like this:
CREATE TABLE [dbo].[Table_1](
[time] [datetime] NOT NULL,
[sensor] [int] NOT NULL,
[value] [decimal](18, 0) NULL
)
Data can be added in intervals ranging from seconds to minutes (depending on the sensor).
Should I worry about my Database growing too big when several years of data accumulate (The DB will run on a MS SQL Server 2008 workgroup edition)?
There are specialized historian databases, such as OSISoft's PI Historian that handle this type of data a lot better than a relational database. With PI you can configure a compression deviation for each data point, such that the data will not be archived unless it changes by at least that compression deviation. When you query for the historical data for a given point, you can ask PI to do interpolation of what the value would have been at the specified time even though your time period is between the archived values.
It's capable of a whole lot more, but you will have to explore that on your own because I don't intend on becoming an OSISoft salesman. However, this is definitely the way you want to go for storing large quantities of sensor data.
It all depends what resources and effort you want to expend on it. At 1 row per second that table would still be less than 0.5GB per sensor per year, which is very small. If you have thousands of sensors then you might want to consider whether to create summary tables to help with the reporting and analysis of the data.
Sensor data like this is often very repetetive. There are more convenient ways to store repeated values - for example by storing one row with a range of times rather than multiple rows with different times.
There are many software packages that can help with storing and managing this kind of time series data. There is also a significant body of research and literature on the subject, which might help you. If you aren't already familiar with it then Google for terms like "Process Historian", "Complex Event Processing" and "SCADA".
It depends on how you're going to use the data, what indexes you add in addition, how many sensors, etc.
That table, as shown, could store 150 million rows (~ 1 sensor x 1 recording per second x 5 years) in ~6GB of space (assuming a heap). The file size limit is 16 terabytes, and I'm not aware of any restrictions on this for Workgroup edition.
If you are worried about the database to grow too big then I would suggest you can have a Archive_Table with the same structure and archive data for an interval like once a month or 6 months(entirely based on the volume of data).
So, this would allow you to have a check on the number of records in your Table. And, of course the archive tables would be available for report generation when you need it.

Resources