We are in the process of evaluating time series databases (TSDB) for our project.
My use case is to store historical events emanating from various sensors. The events can contain one or more attributes of different data types(e.g., strings, float, int etc).
As part of this evaluation exercise we came across few online materials where people say that certain type of TSDBs are suitable for metric stores, certain types are suitable for ,event stores and certain others are for both. Am a bit confused about the differences between metrics and events. Aren't metrics some kind of events? Can someone please help in understanding the difference in this context?
Metrics and events are two different types of time series data: regular and irregular, respectively. Regular data (metrics) are evenly distributed across time and can be used for processes like forecasting. Irregular data (events) are unpredictable, and while they still occur in temporal order, the intervals between events are inconsistent, which means that using them for forecasting or averaging could lead to unreliable results.
The basic difference is metrics occur at regular intervals and events don’t. Imagine I’m monitoring my personal website — I want to track the response codes to make sure the site is available, so I collect them at frequent intervals. I could then query those response code metrics to figure out what percentage of the time my site was down (because it was too popular). But I also want to know when a user clicks on an ad. I don’t know when or if this click will happen, so collecting at a regular interval doesn’t make sense. If I have 12 clicks for the past year, the average will be one click a month regardless if they could have all happened October (the peak of my popularity).
Related
My flink application generates output (complex) events based on the processing of (simple) input events. The generated output events are to be consumed by other external services. My application works using event-time semantics, so I am bit in doubt regarding what should I use as the output events' timestamp.
Should I use:
the processing time at the moment of generating them?
the event time (given by the watermark value)?
both? (*)
For my use case, I am using both for now. But maybe you can come up with examples/justifications for each of the given options.
(*) In the case of using both, what naming would you use for the two fields? Something along the lines of event_time and processing_time seems to leak implementation details of my app to the external services...
There is no general answer to your question. It often depends on downstream requirements. Let's look at two simple cases:
A typical data processing pipeline is ingesting some kind of movement event (e.g., sensor data, click on web page, search request) and enriches it with master data (e.g., sensor calibration data, user profiles, geographic information) through joins. Then the resulting event should clearly have the same time as the input event.
A second pipeline is aggregating the events from the first pipeline on a 15 min tumbling window and simply counts it. Then fair options would be to use the start of the window or the time of the first event, end of the window or time of the last event, or both of these information. Using the start/end of a window would mean that we have a resulting signal that is always defined. Using the first/last event timestamp is more precise when you actually want to see in the aggregates when things happen. Usually, that also means that you probably want some finer window resolutions though (1 min instead of 15 min). Whether you use the start or the end of a window is often more a matter of taste and you are usually safer to include both.
In none of these cases, processing time is relevant at all. In fact, if your input is event time, I'd argue that there is no good reason for processing time. The main reason is that you cannot do meaningful reprocessing with processing time.
You can still add processing time, but for a different reason: to measure the end-to-end latency of a very complex data analytics pipeline including multiple technologies and jobs.
I'd like to know if a time-series database will crumble with this scenario:
I have tens of thousands of IoTs sending 4 different values each 5min.
I will query those values for each IoT, for certain time spans. My question is:
Is a tsdb approach feasible and scalable up to, e.g., a million IoTs, having metrics like:
iot.key1.value1
iot.key1.value2
iot.key1.value3
iot.key1.value4
iot.key2.value1
.
.
.
iot.key1000000.value4
? Or are they way too much "amount of metrics"?
The retention policy will be 2 years, with possible roll ups maybe after (TBA) months. But I think this consideration only matters for disk size afaik.
Right now I'm using graphite
A reporting frequency of five minutes that should be fairly manageable, just be sure to set your storage schema to five-minutes being the smallest resolution data in order to save space, as you won't be needing to hold on to data at shorter periods.
With that said, scaling a graphite cluster to meet your needs isn't easy as Whisper isn't optimized for this. There are several resources/stories where others have shared their dismay trying to achieve this, for example: here and here
There are other limitations to consider too, Whisper is configured in such a way that it can record only one datapoint per timestamp, and the last datapoint received "wins". This might not be an issue to you now, but later down the road you might find that you need to increase the datapoint reporting requency to get a better insight into your data.
Therein comes the question, how can I get around that? Often, StatsD is the answer - it's an aggregator that takes your individual metrics over a defined period of time, and churns out a histogram-like set of metrics with different statistical derivatives of your data (minimum, maximum, X-percentile, and so on). Suddenly you're then faced with the prospect of managing a Graphite instance or cluster, one (or more) StatsD service, and that's before you even get to the fun part of visualising your data: Grafana is often used here and also requires you to set up and maintain.
Conversely, assuming you will maintain that reporting frequency, but increase the number of devices (as you mentioned), you might find another component of your Graphite stack - Carbon-relay - running into some bottlenecking issues (as described here).
I work at MetricFire, formerly Hosted Graphite, where we had a lot of these considerations in mind when building our product/service. Collectively we process millions of datapoints per second across hundreds of accounts. Data is rolled up and stored at four resolutions: 5-seconds, 30-seconds, 5-minute, 1-hour, where each resolution is available for 24 hours, 3 days, six months and two years, respectively.
A key component of our set-up is that our storage is not built on the typical Whisper backend - instead we use a custom-built data store using Riak allowing us to do many things: scale easily and aggregate datapoints per metric into Data Views, to name a few. That article about Data Views was written by one of our engineers and goes into some detail about the decisions we made when building our storage layer.
I am currently working on a project that requires us to store a large amount of time series data, but more importantly, retrieve large amounts of it quick.
There will be N devices (>10,000) which will periodically send data to the system, lets say every 5 seconds. This data will quickly build up, but we are generally only interested in the most recent data, and want to compact the older data. We don't want to remove it, as it is still useful, but instead of having thousands of data point for a day, we might save just 5 or 10 after N days/weeks/months have passed.
Specifically we want to be able to fetch sampled data over a large time period, say a year or two. There might be millions of points here, but we just want a small, linearly distributed, sample of this data.
Today we are experimenting with influxdb, which initially seemed like an alright solution. It was fast enough and allows us to store our data in a reasonable structure, but we have found that it is not completely satisfactory. We were unable to perform the sample query described above and in general the system does not feel mature enough for us.
Any advice on how we can proceed, or alternative solutions, is much appreciated.
You might be interested in looking at TimescaleDB:
https://github.com/timescale/timescaledb
It builds a time-series DB on top of Postgres and so offers full SQL support, as well as generally the Postgres ecosystem/reliability. This can give you a lot greater query flexibility, which sounds like you want.
In terms of your specific use case, there would really be two solutions.
First, what people typically would do is to create two "hypertables", one for raw data, another for sampled data. These hypertables look like standard tables to the user, although heavily partitioned under the covers for much better scalability (e.g., 20x insert throughput vs. postgres for large table sizes).
Then you basically do a roll-up from the raw to the sampled table, and use a different data retention policy on each (so you keep raw data for say 1 month, with sampled data for years).
http://docs.timescale.com/getting-started/setup/starting-from-scratch
http://docs.timescale.com/api/data-retention
Second, you can go with a single hypertable, and then just schedule a normal SQL query to delete individual rows from data that's older than a certain time period.
We might even in the future add better first-class support for this latter approach if it becomes a common-enough requested feature, although most use cases we've encountered to date seemed more focused on #1, esp. in order to to keep statistical data about removed data-points, as opposed to just straight samples.
(Disclaimer: I'm one of the authors of TimescaleDB.)
I'm thinking about building a web-based data logging and visualization service. The basic idea is that at some timed interval something (e.g. a sensor) reports a value (e.g. temperature) to the server. The server records this value into a database. There would be a web-based UI that allows me to view this data on a time-based graph. Ideally this graph would have various resolutions (last 30 seconds, last week, last year, etc). In a super ideal world, I would be able to zoom into the data for any point in time.
The problem is that the sensors are going to generate enormous amounts of data. For example, a sensor that reports a value every 5 seconds will generate about 18k values a day. I'm imagining a system that has thousands of sensors. Over time, this becomes lots of data.
The naive solution is to throw this data into a relational database and retrieve it in the various ways I want, but that won't scale.
The simple solution is to reduce the amount of data by performing periodic roll-ups of the data. New data might go into a table that has data points every 5 seconds. Every hour, some system pumps this data into another table that has data points every minute and the original data is deleted. This repeats for a few levels. The downside to this is that the further back in time you go, the less detailed the data is. That's probably fine. I would imagine that I would need enormous amounts of hardware to support full resolution of data over all time as compared to a system with this sort of rollup.
Is there a better way to do this? Is there an existing solution? I have to imagine this is a fairly common problem.
You probably want a fixed sized database like RRDTool: http://oss.oetiker.ch/rrdtool/
Also Graphite is built on top of a similar datastore implementation: http://graphite.wikidot.com/
I am working with google app engine and using the low leval java api to access Big Table. I'm building a SAAS application with 4 layers:
Client web browser
RESTful resources layer
Business layer
Data access layer
I'm building an application to help manage my mobile auto detailing company (and others like it). I have to represent these four separate concepts, but am unsure if my current plan is a good one:
Appointments
Line Items
Invoices
Payments
Appointment: An "Appointment" is a place and time where employees are expected to be in order to deliver a service.
Line Item: A "Line Item" is a service, fee or discount and its associated information. An example of line items that might go into an appointment:
Name: Price: Commission: Time estimate
Full Detail, Regular Size: 160 75 3.5 hours
$10 Off Full Detail Coupon: -10 0 0 hours
Premium Detail: 220 110 4.5 hours
Derived totals(not a line item): $370 $185 8.0 hours
Invoice: An "Invoice" is a record of one or more line items that a customer has committed to pay for.
Payment: A "Payment" is a record of what payments have come in.
In a previous implementation of this application, life was simpler and I treated all four of these concepts as one table in a SQL database: "Appointment." One "Appointment" could have multiple line items, multiple payments, and one invoice. The invoice was just an e-mail or print out that was produced from the line items and customer record.
9 out of 10 times, this worked fine. When one customer made one appointment for one or a few vehicles and paid for it themselves, all was grand. But this system didn't work under a lot of conditions. For example:
When one customer made one appointment, but the appointment got rained out halfway through resulting in the detailer had to come back the next day, I needed two appointments, but only one line item, one invoice and one payment.
When a group of customers at an office all decided to have their cars done the same day in order to get a discount, I needed one appointment, but multiple invoices and multiple payments.
When one customer paid for two appointments with one check, I needed two appointments, but only one invoice and one payment.
I was able to handle all of these outliers by fudging things a little. For example, if a detailer had to come back the next day, i'd just make another appointment on the second day with a line item that said "Finish Up" and the cost would be $0. Or if I had one customer pay for two appointments with one check, I'd put split payment records in each appointment. The problem with this is that it creates a huge opportunity for data in-congruency. Data in-congruency can be a serious problem especially for cases involving financial information such as the third exmaple where the customer paid for two appointments with one check. Payments must be matched up directly with goods and services rendered in order to properly keep track of accounts receivable.
Proposed structure:
Below, is a normalized structure for organizing and storing this data. Perhaps because of my inexperience, I place a lot of emphasis on data normalization because it seems like a great way to avoid data incongruity errors. With this structure, changes to the data can be done with one operation without having to worry about updating other tables. Reads, however, can require multiple reads coupled with in-memory organization of data. I figure later on, if there are performance issues, I can add some denormalized fields to "Appointment" for faster querying while keeping the "safe" normalized structure intact. Denormalization could potentially slow down writes, but I was thinking that I might be able to make asynchronous calls to other resources or add to the task que so that the client does not have to wait for the extra writes that update the denormalized portions of the data.
Tables:
Appointment
start_time
etc...
Invoice
due_date
etc...
Payment
invoice_Key_List
amount_paid
etc...
Line_Item
appointment_Key_List
invoice_Key
name
price
etc...
The following is the series of queries and operations required to tie all four entities (tables) together for a given list of appointments. This would include information on what services were scheduled for each appointment, the total cost of each appointment and weather or not payment as been received for each appointment. This would be a common query when loading the calendar for appointment scheduling or for a manager to get an overall view of operations.
QUERY for the list of "Appointments" who's "start_time" field lies between the given range.
Add each key from the returned appointments into a List.
QUERY for all "Line_Items" who's appointment_key_List field includes any of the returns appointments
Add each invoice_key from all of the line items into a Set collection.
QUERY for all "Invoices" in the invoice ket set (this can be done in one asynchronous operation using app engine)
Add each key from the returned invoices into a List
QUERY for all "Payments" who's invoice_key_list field contains a key matching any of the returned invoices
Reorganize in memory so that each appointment reflects the line_items that are scheduled for it, the total price, total estimated time, and weather or not it has been paid for.
...As you can see, this operation requires 4 datastore queries as well as some in-memory organization (hopefully the in-memory will be pretty fast)
Can anyone comment on this design? This is the best I could come up with, but I suspect there might be better options or completely different designs that I'm not thinking of that might work better in general or specifically under GAE's (google app engine) strengths, weaknesses, and capabilities.
Thanks!
Usage clarification
Most applications are more read-intensive, some are more write intensive. Below, I describe a typical use-case and break down operations that the user would want to perform:
Manager gets a call from a customer:
Read - Manager loads the calendar and looks for a time that is available
Write - Manager queries customer for their information, I pictured this to be a succession of asynchronous reads as the manager enters each piece of information such as phone number, name, e-mail, address, etc... Or if necessary, perhaps one write at the end after the client application has gathered all of the information and it is then submitted.
Write - Manager takes down customer's credit card info and adds it to their record as a separate operation
Write - Manager charges credit card and verifies that the payment went through
Manager makes an outgoing phone call:
Read Manager loads the calendar
Read Manager loads the appointment for the customer he wants to call
Write Manager clicks "Call" button, a call is initiated and a new CallReacord entity is written
Read Call server responds to call request and reads CallRecord to find out how to handle the call
Write Call server writes updated information to the CallRecord
Write when call is closed, call server makes another request to the server to update the CallRecord resource (note: this request is not time-critical)
Accepted answer::
Both of the top two answers were very thoughtful and appreciated. I accepted the one with few votes in order to imperfectly equalize their exposure as much as possible.
You specified two specific "views" your website needs to provide:
Scheduling an appointment. Your current scheme should work just fine for this - you'll just need to do the first query you mentioned.
Overall view of operations. I'm not really sure what this entails, but if you need to do the string of four queries you mentioned above to get this, then your design could use some improvement. Details below.
Four datastore queries in and of itself isn't necessarily overboard. The problem in your case is that two of the queries are expensive and probably even impossible. I'll go through each query:
Getting a list of appointments - no problem. This query will be able to scan an index to efficiently retrieve the appointments in the date range you specify.
Get all line items for each of appointment from #1 - this is a problem. This query requires that you do an IN query. IN queries are transformed into N sub-queries behind the scenes - so you'll end up with one query per appointment key from #1! These will be executed in parallel so that isn't so bad. The main problem is that IN queries are limited to only a small list of values (up to just 30 values). If you have more than 30 appointment keys returned by #1 then this query will fail to execute!
Get all invoices referenced by line items - no problem. You are correct that this query is cheap because you can simply fetch all of the relevant invoices directly by key. (Note: this query is still synchronous - I don't think asynchronous was the word you were looking for).
Get all payments for all invoices returned by #3 - this is a problem. Like #2, this query will be an IN query and will fail if #3 returns even a moderate number of invoices which you need to fetch payments for.
If the number of items returned by #1 and #3 are small enough, then GAE will almost certainly be able to do this within the allowed limits. And that should be good enough for your personal needs - it sounds like you mostly need it to work, and don't need to it to scale to huge numbers of users (it won't).
Suggestions for improvement:
Denormalization! Try storing the keys for Line_Item, Invoice, and Payment entities relevant to a given appointment in lists on the appointment itself. Then you can eliminate your IN queries. Make sure these new ListProperty are not indexed to avoid problems with exploding indices
Other less specific ideas for improvement:
Depending on what your "overall view of operations" is going to show, you might be able to split up the retrieval of all this information. For example, perhaps you start by showing a list of appointments, and then when the manager wants more information about a particular appointment you go ahead and fetch the information relevant to that appointment. You could even do this via AJAX if you this interaction to take place on a single page.
Memcache is your friend - use it to cache the results of datastore queries (or even higher level results) so that you don't have to recompute it from scratch on every access.
As you've noticed, this design doesn't scale. It requires 4 (!!!) DB queries to render the page. That's 3 too many :)
The prevailing notion of working with the App Engine Datastore is that you want to do as much work as you possibly can when something is written, so that almost nothing needs to be done when something is retrieved and rendered. You presumably write the data very few times, compared to how many times it's rendered.
Normalization is similarly something that you seem to be striving for. The Datastore doesn't place any value in normalization -- it may mean less data incongruity, but it also means reading data is muuuuuch slower (4 reads?!!). Since your data is read much more often than it's written, optimize for reads, even if that means your data will occasionally be duplicated or out of sync for a short amount of time.
Instead of thinking about how the data looks when it's stored, think about how you want the data to look when it's displayed to the user. Store as close to that format as you can, even if that means literally storing pre-rendered HTML in the datastore. Reads will be lightning-fast, and that's a good thing.
So since you should optimize for reads, oftentimes your writes will grow to gigantic proportions. So gigantic that you can't fit it in the 30 second time limit for requests. Well, that's what the task queue is for. Store what you consider the "bare necessities" of your model in the datastore, then fire off a task queue to pull it back out, generate the HTML to be rendered, and put it in there in the background. This might mean your model is immediately ready to display until the task has finished with it, so you'll need a graceful degradation in this case, even if that means rendering it "the slow way" until the data is fully populated. Any further reads will be lightning-quick.
In summary, I don't have any specific advice directly related to your database -- that's dependent on what you want the data to look like when the user sees it.
What I can give you are some links to some super helpful videos about the datastore:
Brett Slatkin's 2008 and 2009 talks on building scalable, complex apps on App Engine, and a great one from this year about data pipelines (which isn't directly applicable I think, but really useful in general)
App Engine Under the Covers: How App Engine does what it does, behind the scenes
AppStats: a great way to see how many datastore reads you're performing, and some tips on reducing that number
Here are a few app-engine specific factors that I think you'll have to contend with:
When querying using an inequality, you can only use an inequality on one property. for example, if you are filtering on an appt date being between July 1st and July 4th, you couldn't also filter by price > 200
Transactions on app engine are a bit tricky compared to the SQL database you are probably used to. You can only do transactions on entities that are in the same "entity group".