Time series database "metrics limit"?

Time series database "metrics limit"? - database

I'd like to know if a time-series database will crumble with this scenario:
I have tens of thousands of IoTs sending 4 different values each 5min.
I will query those values for each IoT, for certain time spans. My question is:
Is a tsdb approach feasible and scalable up to, e.g., a million IoTs, having metrics like:
iot.key1.value1
iot.key1.value2
iot.key1.value3
iot.key1.value4
iot.key2.value1
.
.
.
iot.key1000000.value4
? Or are they way too much "amount of metrics"?
The retention policy will be 2 years, with possible roll ups maybe after (TBA) months. But I think this consideration only matters for disk size afaik.
Right now I'm using graphite

A reporting frequency of five minutes that should be fairly manageable, just be sure to set your storage schema to five-minutes being the smallest resolution data in order to save space, as you won't be needing to hold on to data at shorter periods.
With that said, scaling a graphite cluster to meet your needs isn't easy as Whisper isn't optimized for this. There are several resources/stories where others have shared their dismay trying to achieve this, for example: here and here
There are other limitations to consider too, Whisper is configured in such a way that it can record only one datapoint per timestamp, and the last datapoint received "wins". This might not be an issue to you now, but later down the road you might find that you need to increase the datapoint reporting requency to get a better insight into your data.
Therein comes the question, how can I get around that? Often, StatsD is the answer - it's an aggregator that takes your individual metrics over a defined period of time, and churns out a histogram-like set of metrics with different statistical derivatives of your data (minimum, maximum, X-percentile, and so on). Suddenly you're then faced with the prospect of managing a Graphite instance or cluster, one (or more) StatsD service, and that's before you even get to the fun part of visualising your data: Grafana is often used here and also requires you to set up and maintain.
Conversely, assuming you will maintain that reporting frequency, but increase the number of devices (as you mentioned), you might find another component of your Graphite stack - Carbon-relay - running into some bottlenecking issues (as described here).
I work at MetricFire, formerly Hosted Graphite, where we had a lot of these considerations in mind when building our product/service. Collectively we process millions of datapoints per second across hundreds of accounts. Data is rolled up and stored at four resolutions: 5-seconds, 30-seconds, 5-minute, 1-hour, where each resolution is available for 24 hours, 3 days, six months and two years, respectively.
A key component of our set-up is that our storage is not built on the typical Whisper backend - instead we use a custom-built data store using Riak allowing us to do many things: scale easily and aggregate datapoints per metric into Data Views, to name a few. That article about Data Views was written by one of our engineers and goes into some detail about the decisions we made when building our storage layer.

Related

database design for large streaming data with minimal latency

Following is the scenario:
Customer places an order.
Order has type: Physical / Downloadable.
Order is placed from: Web / App.
Order is placed from a Location: UK,AUS,etc.
Can have more dimensions in future.
Consider that all of the dimensions change frequently in every order. And the data is quite huge, approximately 1.3 million records per hour.
Want to design this in a way that reports should be able able to drill down with any requested dimension for each customer.
Example:
- Customer 'A' has placed how many orders of type 'Physical' from 'AUS'
- Customer 'A' has placed how many orders in all.
- Customer 'A' has placed how many orders from of type 'Downloadable' from'APP'.
etc.
Need these reports on realtime, hence low latency writes and reads are a must. What nosql database can be a good fit. And how can this data be well structured to be able to sliced and diced in any required dimension as well as combination of more than one dimension.

If you need high performance then I would recommend ScyllaDB which can handle over 1M ops/s per node (on a good hardware). It shares data model with Cassandra so you can model and query your data using CQL. You can give it a free test drive with just couple of clicks here.
Regarding modeling: A useful technique is to model around your queries. So if you have a particular query you should prepare a table that will serve this query in most efficient way. In this technique you duplicate data by creating as many tables with the same data as many different types of queries you have. Duplicating data comes with a price so you need to trade off the performance and cost depending on your needs. You can read more about it here.

Suggestions on how to store and retrieve time-series data

I am currently working on a project that requires us to store a large amount of time series data, but more importantly, retrieve large amounts of it quick.
There will be N devices (>10,000) which will periodically send data to the system, lets say every 5 seconds. This data will quickly build up, but we are generally only interested in the most recent data, and want to compact the older data. We don't want to remove it, as it is still useful, but instead of having thousands of data point for a day, we might save just 5 or 10 after N days/weeks/months have passed.
Specifically we want to be able to fetch sampled data over a large time period, say a year or two. There might be millions of points here, but we just want a small, linearly distributed, sample of this data.
Today we are experimenting with influxdb, which initially seemed like an alright solution. It was fast enough and allows us to store our data in a reasonable structure, but we have found that it is not completely satisfactory. We were unable to perform the sample query described above and in general the system does not feel mature enough for us.
Any advice on how we can proceed, or alternative solutions, is much appreciated.

You might be interested in looking at TimescaleDB:
https://github.com/timescale/timescaledb
It builds a time-series DB on top of Postgres and so offers full SQL support, as well as generally the Postgres ecosystem/reliability. This can give you a lot greater query flexibility, which sounds like you want.
In terms of your specific use case, there would really be two solutions.
First, what people typically would do is to create two "hypertables", one for raw data, another for sampled data. These hypertables look like standard tables to the user, although heavily partitioned under the covers for much better scalability (e.g., 20x insert throughput vs. postgres for large table sizes).
Then you basically do a roll-up from the raw to the sampled table, and use a different data retention policy on each (so you keep raw data for say 1 month, with sampled data for years).
http://docs.timescale.com/getting-started/setup/starting-from-scratch
http://docs.timescale.com/api/data-retention
Second, you can go with a single hypertable, and then just schedule a normal SQL query to delete individual rows from data that's older than a certain time period.
We might even in the future add better first-class support for this latter approach if it becomes a common-enough requested feature, although most use cases we've encountered to date seemed more focused on #1, esp. in order to to keep statistical data about removed data-points, as opposed to just straight samples.
(Disclaimer: I'm one of the authors of TimescaleDB.)

Metrics vs Events

We are in the process of evaluating time series databases (TSDB) for our project.
My use case is to store historical events emanating from various sensors. The events can contain one or more attributes of different data types(e.g., strings, float, int etc).
As part of this evaluation exercise we came across few online materials where people say that certain type of TSDBs are suitable for metric stores, certain types are suitable for ,event stores and certain others are for both. Am a bit confused about the differences between metrics and events. Aren't metrics some kind of events? Can someone please help in understanding the difference in this context?

Metrics and events are two different types of time series data: regular and irregular, respectively. Regular data (metrics) are evenly distributed across time and can be used for processes like forecasting. Irregular data (events) are unpredictable, and while they still occur in temporal order, the intervals between events are inconsistent, which means that using them for forecasting or averaging could lead to unreliable results.
The basic difference is metrics occur at regular intervals and events don’t. Imagine I’m monitoring my personal website — I want to track the response codes to make sure the site is available, so I collect them at frequent intervals. I could then query those response code metrics to figure out what percentage of the time my site was down (because it was too popular). But I also want to know when a user clicks on an ad. I don’t know when or if this click will happen, so collecting at a regular interval doesn’t make sense. If I have 12 clicks for the past year, the average will be one click a month regardless if they could have all happened October (the peak of my popularity).

How to store and retrieve large numbers of data points for graphical visualization?

I'm thinking about building a web-based data logging and visualization service. The basic idea is that at some timed interval something (e.g. a sensor) reports a value (e.g. temperature) to the server. The server records this value into a database. There would be a web-based UI that allows me to view this data on a time-based graph. Ideally this graph would have various resolutions (last 30 seconds, last week, last year, etc). In a super ideal world, I would be able to zoom into the data for any point in time.
The problem is that the sensors are going to generate enormous amounts of data. For example, a sensor that reports a value every 5 seconds will generate about 18k values a day. I'm imagining a system that has thousands of sensors. Over time, this becomes lots of data.
The naive solution is to throw this data into a relational database and retrieve it in the various ways I want, but that won't scale.
The simple solution is to reduce the amount of data by performing periodic roll-ups of the data. New data might go into a table that has data points every 5 seconds. Every hour, some system pumps this data into another table that has data points every minute and the original data is deleted. This repeats for a few levels. The downside to this is that the further back in time you go, the less detailed the data is. That's probably fine. I would imagine that I would need enormous amounts of hardware to support full resolution of data over all time as compared to a system with this sort of rollup.
Is there a better way to do this? Is there an existing solution? I have to imagine this is a fairly common problem.

You probably want a fixed sized database like RRDTool: http://oss.oetiker.ch/rrdtool/
Also Graphite is built on top of a similar datastore implementation: http://graphite.wikidot.com/

Storing time-series data, relational or non?

I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNMP. The ultimate goal is to provide visualisations to a user of the system in the form of time-series graphs.
I have looked at using RRDTool in the past, but rejected it as storing the captured data indefinitely is important to my project, and I want higher level and more flexible access to the captured data. So my question is really:
What is better, a relational database (such as MySQL or PostgreSQL) or a non-relational or NoSQL database (such as MongoDB or Redis) with regard to performance when querying data for graphing.
Relational
Given a relational database, I would use a data_instances table, in which would be stored every instance of data captured for every metric being measured for all devices, with the following fields:
Fields: id fk_to_device fk_to_metric metric_value timestamp
When I want to draw a graph for a particular metric on a particular device, I must query this singular table filtering out the other devices, and the other metrics being analysed for this device:
SELECT metric_value, timestamp FROM data_instances
WHERE fk_to_device=1 AND fk_to_metric=2
The number of rows in this table would be:
d * m_d * f * t
where d is the number of devices, m_d is the accumulative number of metrics being recorded for all devices, f is the frequency at which data is polled for and t is the total amount of time the system has been collecting data.
For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under 5 million records.
Indexes
Without indexes on fk_to_device and fk_to_metric scanning this continuously expanding table would take too much time. So indexing the aforementioned fields and also timestamp (for creating graphs with localised periods) is a requirement.
Non-Relational (NoSQL)
MongoDB has the concept of a collection, unlike tables these can be created programmatically without setup. With these I could partition the storage of data for each device, or even each metric recorded for each device.
I have no experience with NoSQL and do not know if they provide any query performance enhancing features such as indexing, however the previous paragraph proposes doing most of the traditional relational query work in the structure by which the data is stored under NoSQL.
Undecided
Would a relational solution with correct indexing reduce to a crawl within the year? Or does the collection based structure of NoSQL approaches (which matches my mental model of the stored data) provide a noticeable benefit?

Definitely Relational. Unlimited flexibility and expansion.
Two corrections, both in concept and application, followed by an elevation.
Correction
It is not "filtering out the un-needed data"; it is selecting only the needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).
Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the row is identified by:
(Device, Metric, DateTime)
The Id column does nothing, it is totally and completely redundant.
An Id column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means).
The Id column requires an additional Index, which obviously impedes the speed of INSERT/DELETE, and adds to the disk space used.
You can get rid of it. Please.
Elevation
Now that you have removed the impediment, you may not have recognised it, but your table is in Sixth Normal Form. Very high speed, with just one Index on the PK. For understanding, read this answer from the What is Sixth Normal Form ? heading onwards.
(I have one index only, not three; on the Non-SQLs you may need three indices).
I have the exact same table (without the Id "key", of course). I have an additional column Server. I support multiple customers remotely.
(Server, Device, Metric, DateTime)
The table can be used to Pivot the data (ie. Devices across the top and Metrics down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.
Monitor Statistics Data Model.
(Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)
It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notation helpful.
One More Thing
Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.

Found very interesting the above answers.
Trying to add a couple more considerations here.
1) Data aging
Time-series management usually need to create aging policies. A typical scenario (e.g. monitoring server CPU) requires to store:
1-sec raw samples for a short period (e.g. for 24 hours)
5-min detail aggregate samples for a medium period (e.g. 1 week)
1-hour detail over that (e.g. up to 1 year)
Although relational models make it possible for sure (my company implemented massive centralized databases for some large customers with tens of thousands of data series) to manage it appropriately, the new breed of data stores add interesting functionalities to be explored like:
automated data purging (see Redis' EXPIRE command)
multidimensional aggregations (e.g. map-reduce jobs a-la-Splunk)
2) Real-time collection
Even more importantly some non-relational data stores are inherently distributed and allow for a much more efficient real-time (or near-real time) data collection that could be a problem with RDBMS because of the creation of hotspots (managing indexing while inserting in a single table). This problem in the RDBMS space is typically solved reverting to batch import procedures (we managed it this way in the past) while no-sql technologies have succeeded in massive real-time collection and aggregation (see Splunk for example, mentioned in previous replies).

You table has data in single table. So relational vs non relational is not the question. Basically you need to read a lot of sequential data. Now if you have enough RAM to store a years worth data then nothing like using Redis/MongoDB etc.
Mostly NoSQL databases will store your data on same location on disk and in compressed form to avoid multiple disk access.
NoSQL does the same thing as creating the index on device id and metric id, but in its own way. With database even if you do this the index and data may be at different places and there would be a lot of disk IO.
Tools like Splunk are using NoSQL backends to store time series data and then using map reduce to create aggregates (which might be what you want later). So in my opinion to use NoSQL is an option as people have already tried it for similar use cases. But will a million rows bring the database to crawl (maybe not , with decent hardware and proper configurations).

Create a file, name it 1_2.data. weired idea? what you get:
You save up to 50% of space because you don't need to repeat the fk_to_device and fk_to_metric value for every data point.
You save up even more space because you don't need any indices.
Save pairs of (timestamp,metric_value) to the file by appending the data so you get a order by timestamp for free. (assuming that your sources don't send out of order data for a device)
=> Queries by timestamp run amazingly fast because you can use binary search to find the right place in the file to read from.
if you like it even more optimized start thinking about splitting your files like that;
1_2_january2014.data
1_2_february2014.data
1_2_march2014.data
or use kdb+ from http://kx.com because they do all this for you:) column-oriented is what may help you.
There is a cloud-based column-oriented solution popping up, so you may want to have a look at: http://timeseries.guru

You should look into Time series database. It was created for this purpose.
A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range).
Popular example of time-series database InfluxDB

I think that the answer for this kind of question should mainly revolve about the way your Database utilize storage.
Some Database servers use RAM and Disk, some use RAM only (optionally Disk for persistency), etc.
Most common SQL Database solutions are using memory+disk storage and writes the data in a Row based layout (every inserted raw is written in the same physical location).
For timeseries stores, in most cases the workload is something like: Relatively-low interval of massive amount of inserts, while reads are column based (in most cases you want to read a range of data from a specific column, representing a metric)
I have found Columnar Databases (google it, you'll find MonetDB, InfoBright, parAccel, etc) are doing terrific job for time series.
As for your question, which personally I think is somewhat invalid (as all discussions using the fault term NoSQL - IMO):
You can use a Database server that can talk SQL on one hand, making your life very easy as everyone knows SQL for many years and this language has been perfected over and over again for data queries; but still utilize RAM, CPU Cache and Disk in a Columnar oriented way, making your solution best fit Time Series

5 Millions of rows is nothing for today's torrential data. Expect data to be in the TB or PB in just a few months. At this point RDBMS do not scale to the task and we need the linear scalability of NoSql databases. Performance would be achieved for the columnar partition used to store the data, adding more columns and less rows kind of concept to boost performance. Leverage the Open TSDB work done on top of HBASE or MapR_DB, etc.

I face similar requirements regularly, and have recently started using Zabbix to gather and store this type of data. Zabbix has its own graphing capability, but it's easy enough to extract the data out of Zabbix's database and process it however you like. If you haven't already checked Zabbix out, you might find it worth your time to do so.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight