I'm having a problem thats limiting me quite a bit. We are trying to sample our data by grouping time. We have millions of points and want to fetch every Nth point in a given interval. We have implemented a solution that calculates the time difference in this interval and then groups by it to receive the correct amount of points.
SELECT last(value) as value FROM measurement WHERE time >= '...' AND time <= '...' GROUP BY time(calculated_time) fill(none)
The amount of points returned seems to be correct, but the dates are not.
See the results below:
Without sampling
> SELECT value FROM "measurement" WHERE time >= '2016-01-01T00:00:00Z' AND time <= '2017-01-01T00:00:00Z' LIMIT 5;
name: measurement
time value
---- -----
2016-01-01T00:00:00Z 61.111
2016-01-01T01:00:00Z 183.673
2016-01-01T02:00:00Z 200
2016-01-01T03:00:00Z 66.667
2016-01-01T04:00:00Z 97.959
With Sampling
> SELECT last(value) as value FROM "measurement" WHERE time >= '2016-01-01T00:00:00Z' AND time <= '2017-01-01T00:00:00Z' GROUP BY time(23m) fill(none) LIMIT 5;
name: measurement
time value
---- -----
2015-12-31T23:44:00Z 61.111
2016-01-01T00:53:00Z 183.673
2016-01-01T01:39:00Z 200
2016-01-01T02:48:00Z 66.667
2016-01-01T03:57:00Z 97.959
I expect the data to be returned to have the correct timestamp as in the database, regardless of the time used in the aggregation function. Instead, the time returned seems to be a multiple of the aggregated time. That is, if my aggregation is GROUP BY time(7m) then the points seem to a multiple of 7 apart.
If there is no solution to my problem with influx, is there an alternative database I can use where I can accomplish this? The data in this example is uniform and evenly distributed, but this is not always the case. More often than not it will be randomly distributed (spans of seconds to minutes).
Related
We have an API that queries an Influx database and a report functionality was implemented so the user can query data using a start and end date.
The problem is that when a longer period is chosen(usually more than 8 weeks), we get a timeout from influx, query takes around 13 seconds to run. When the query returns a dataset successfully, we store that in cache.
The most time-consuming part of the query is probably comparison and averages we do, something like this:
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
FROM $MEASUREMENT
WHERE time >= $startDate AND time < $endDate
AND ("field" = 'myFieldValue' )
GROUP BY "tagname"
What would be the best approach to fix this? I can of course limit the amount of weeks the user can choose, but I guess that's not the ideal fix.
How would you approach this? Increase timeout? Batch query? Any database optimization to be able to run this faster?
In such cases where you allow user to select in days, I would suggest to have another table that stores the result (min, max and avg) of each day as a document. This table can be populated using some job after end of the day.
You can also think changing the document per day to per week or per month, based on how you plot the values. You can also add more fields like in your case, tagname and other fields.
Reason why this is superior to using a cache: When you use a cache, you can store the result of the query, so you have to compute for every different combination in realtime. However, in this case, the cumulative results are already available with much smaller dataset to compute.
Based on your query, I assume you are using InfluxDB v1.X. You could try Continuous Queries which are InfluxQL queries that run automatically and periodically on realtime data and store query results in a specified measurement.
In your case, for each report, you could generate a CQ and let your users to query it.
e.g.:
Step 1: create a CQ
CREATE CONTINUOUS QUERY "cq_basic_rp" ON "db"
BEGIN
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
INTO "mean_min_max"
FROM $MEASUREMENT
WHERE "field" = 'myFieldValue' // note that the time filter is not here
GROUP BY time(1h), "tagname" // here you can define the job interval
END
Step 2: Query against that CQ
SELECT * FROM "mean_min_max"
WHERE time >= $startDate AND time < $endDate // here you can pass the user's time filter
Since you already ask InfluxDB to run these aggregates continuously based on the specified interval, you should be able to trade space for time.
I previously touched on this problem in my post "What's the best way to return a sample of data over the a period?" but while the potential solutions offered there were promising, ultimately they didn't solve the problem I have (which has since changed slightly too).
The problem
The problem I'm trying to solve is returning aggregated data from a large dataset split into time-bound chunks quickly. For example, from a collection of temperature readings, return the average hourly reading for the last 24 hours.
For this example, let's say we have a collection named observations that contains temperature data collected from many devices at a rate of 1 per second. This collection contains a large amount of data (the dataset I'm working with has 120M documents). Each collection contains the following three fields: deviceId, timestamp, temperature.
For clarity, this gives us:
200 devices
3,600 documents per device per hour
86,400 documents per device per day
17,280,000 documents per day (all devices)
120,960,000 documents per week (all devices)
Retrieving data for a device for a period of time is trivial:
FOR o IN observations
FILTER o.deviceId = #deviceId
FILTER o.timestamp >= #start AND o.timestamp <= #end
RETURN o
The issue comes when trying to return aggregated data. Let's say that we want to return three sets of data for a specific deviceId:
The average daily reading for the last week (7 records from 17,280,000)
The average hourly readings for the last day (24 records from 86,400)
The average minutely readings for the last hour (60 records from 3,600)
(In case it affects the potential solutions, some of the data may not be per second, but at a lower rate, for example, per 15 seconds, perhaps even per minute. There may also be missing data for certain periods of time. While I've assumed ideal for the above figures, we all know reality is rarely that simple.)
What I've tried so far
I've looked into using WINDOW (example below) but the query ran very slow, whether this was an issue with the query or with volume of data, I don't know, but I couldn't find much information on it. Also, this still requires a way to perform multiple readings, one per each period of time (what I think of as a step).
FOR o IN observations
FILTER o.deviceId = #deviceId
WINDOW DATE_TIMESTAMP(o.timestamp) WITH { preceding: "PT60M" }
AGGREGATE temperature = AVG(o.temperature)
RETURN {
timestamp: o.timestamp,
temperature
}
I also looked at ways of filtering the timestamps based on a modulus as suggested in the previous thread but that didn't account for averages or for data that may be missing (a missed update, so no record with an exact timestamp, for example).
I've pulled out the data and filtered it outside of ArangoDB but this isn't really a solution, it's slow and especially for large volumes of data (17M for a week of readings) it just wasn't working at all.
So I looked at recreating some of that logic in the query, by stepping through each chunk of data, and returning the average values, which works, but isn't the most performant (taking roughly 10s for me, albeit relatively low powered box but still, slow):
LET steps = 24
LET stepsRange = 0..23
LET diff = #end - #start
LET interval = diff / steps
LET filteredObservations = (
FOR o IN observations
FILTER o.deviceId == #deviceId
FILTER o.timestamp >= #start AND o.timestamp <= #end
RETURN o
)
FOR step IN stepsRange
RETURN (
LET stepStart = start + (interval * step)
LET stepEnd = stepStart + interval
FOR f IN filteredObservations
FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
COLLECT AGGREGATE temperature = AVG(f.temperature)
RETURN { step, temperature }
)
I've also tried variations of the above using WINDOW but without much luck. I'm not massively across the graph functionality of ArangoDB, coming from a relational/document database background, so I wonder if there is something in that which could make querying this data quicker and easier.
Summary
I expect this query to be ran simultaneously for several different time ranges and devices by many users, so really the performance needs to be < 1s. In terms of compromises, if this could be achieved by picking one record from each time chunk, that would be okay.
I figured out the answer to this at about 2.00 am last night, waking up and scribbling down a general idea. I've just tested it and it seems to be running quite quickly.
My thought was this: grabbing an average between two timestamps is a quick query, so if we simplify the overall query to simply run a filtered aggregate for each timestep, then the performance of the query will simply be a linear cost depending on the number of datapoints required.
It wasn't much different than the example above:
# Slow method
LET steps = 24
LET stepsRange = 0..23
LET diff = #end - #start
LET interval = diff / steps
LET filteredObservations = (
FOR o IN observations
FILTER o.deviceId == #deviceId
FILTER o.timestamp >= #start AND o.timestamp <= #end
RETURN o
)
FOR step IN stepsRange
RETURN (
LET stepStart = start + (interval * step)
LET stepEnd = stepStart + interval
FOR f IN filteredObservations
FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
COLLECT AGGREGATE temperature = AVG(f.temperature)
RETURN { step, temperature }
)
Instead of filtering the observations first, and then filtering this subset again and again to collect the aggregate, we instead just loop through from 0 to n, in this case 23, and run the query to filter and aggregate the result:
# Quick method
LET steps = 24
LET stepsRange = 0..23
LET diff = #end - #start
LET interval = diff / steps
FOR step IN stepsRange
RETURN FIRST(
LET stepStart = start + (interval * step)
LET stepEnd = stepStart + interval
RETURN FIRST(
FOR f IN filteredObservations
FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
COLLECT AGGREGATE temperature = AVG(f.temperature)
RETURN temperature
)
)
In my case, the total query time is around 75 ms for 24 datapoints, hourly average from 24 hours worth of data. Increasing this up to 48 points only increases the query run time by about 25%, and returning 1440 steps (per minute averages) runs in 132 ms.
I'd say this qualifies as a performant query.
(It's worth noting that I have a persistent index for this collection, without which the query is very slow.)
I want to store trades as well as best ask/bid data, where the latter updates much more rapidly than the former, in InfluxDB.
I want to, if possible, use a schema that allows me to query: "for each trade on market X, find the best ask/bid on market Y whose timestamp is <= the timestamp of the trade".
(I'll use any version of Influx.)
For example, trades might look like this:
Time Price Volume Direction Market
00:01.000 100 5 1 foo-bar
00:03.000 99 50 0 bar-baz
00:03.050 99 25 0 foo-bar
00:04.000 101 15 1 bar-baz
And tick data might look more like this:
Time Ask Bid Market
00:00.763 100 99 bar-baz
00:01.010 101 99 foo-bar
00:01.012 101 98 bar-baz
00:01.012 101 99 foo-bar
00:01:238 100 99 bar-baz
...
00:03:021 101 98 bar-baz
I would want to be able to somehow join each trade for some market, e.g. foo-bar, with only the most recent ask/bid data point on some other market, e.g. bar-baz, and get a result like:
Time Trade Price Ask Bid
00:01.000 100 100 99
00:03.050 99 101 98
Such that I could compute the difference between the trade price on market foo-bar and the most recently quoted ask or bid on market bar-baz.
Right now, I store trades in one time series and ask/bid data points in another and merge them on the client side, with logic along the lines of:
function merge(trades, quotes, data_points)
next_trade, more_trades = first(trades), rest(trades)
quotes = drop-while (quote.timestamp < next_trade.timestamp) quotes
data_point = join(next_trade, first(quotes))
if more_trades
return merge(more_trades, quotes, data_points + data_point)
return data_points + data_point
The problem is that the client has to discard tons of ask/bid data points because they update so frequently, and only the most recent update before the trade is relevant.
There are tens of markets whose most recent ask/bid I might want to compare a trade with, otherwise I'd simply store the most recent ask/bid in the same series as the trades.
Is it possible to do what I want to do with Influx, or with another time series database? An alternative solution that produces lower quality results is to group the ask/bid data by some time interval, say 250ms, and take the last from each interval, to at least impose an upper bound on the amount of quotes the client has to drop before finding the one that's closest to the next trade.
NB. Just a clarification on InfluxDB terminology. You're probably storing trade and tick data in different measurements(analogous to a table). Series is a subdivision withing a measurement based on tag values. e.g
Time Ask Bid Market
00:00.763 100 99 bar-baz
is one series
Time Ask Bid Market
00:01.010 101 99 foo-bar
is another series(assuming you are storing Market name/id as a tag and not a field)
Answer
InfluxQL https://docs.influxdata.com/influxdb/v1.7/query_language/spec/ - I can't think of a way to achieve what you need with InfluxQL (Influx Query Language) as it does not support joins.
Perhaps what you could do on the client side is instead of requesting all tick data for a period and discarding most of it, make a request per trade and market to get exactly the (the most recent with respect to the trade) ask/bid datapoint that you need. Something like:
function merge(trades, market)
points = <empty list>
for next_trade in trades
quote = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp and Market=market and time>next_trade.timestamp - 1m")
// or to get a list per market with one query
// quote_per_market = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp group by Market")
points = points + join(next_trade, quote)
return points
Of course you'd have the overhead of querying the database more frequently but depending on the number of trades and your resource constraints it may be more efficient. NB. A potential pitfall here is that ask and bid retrieved this way are not retrieved as a pair but independently and while they are returned as a pair it could happen that they have different timestamps. If for some timestamp for some reason you only have an ask or a bid price you might run into this problem. However, as long as you write them in pairs and have no missing data it should be ok.
Flux https://www.influxdata.com/products/flux/ - Flux is a more sophisticated query language that is part of Influxdb 1.7 and 2 that allows you to do joins and operations across different measurements. I can't give you any examples yet but it's worth having a look at.
Other (relational) Times Series DBs that you could have a look at that would also allow you to do joins are CrateDB https://crate.io/ or Postgres + TimescaleDB https://www.timescale.com/products
I'm converting something from SQL Server to PostgreSQL. There's a table with a
calculated field between a BeginTime and an EndTime called MidTime. The times are offsets from the beginning of a video clip and will never be more than about 6 minutes long. In SQL Server, BeginTime, EndTime, and MidTime are all TimeSpans. You can use this as the function:
DATEADD(ms, DATEDIFF(ms,BeginTime, EndTime)/2, BeginTime)
Which is taking the difference in the two timespans in millseconds, dividing it by 2, and then adding it to the BeginTime. Super straightforward. Result looks like this:
ID BeginTime EndTime MidTime
10137 00:00:05.0000000 00:00:07.0000000 00:00:06.0000000
10138 00:00:08.5000000 00:00:09.6660000 00:00:09.0830000
10139 00:00:12.1660000 00:00:13.4000000 00:00:12.7830000
10140 00:00:14.6000000 00:00:15.7660000 00:00:15.1830000
10141 00:00:17.1330000 00:00:18.3000000 00:00:17.7160000
10142 00:00:19.3330000 00:00:21.5000000 00:00:20.4160000
10143 00:00:23.4000000 00:00:25.4000000 00:00:24.4000000
10144 00:00:25.4330000 00:00:26.8330000 00:00:26.1330000
I've looked at all of the different things available to me in PostgreSQL and don't see anything like this. I'm storing BeginTime and EndTime as "time without time zone" time(6) values, and they look right in the database. I can subtract these from each other, but I can't get the value in milliseconds to halve (division of times is not allowed) and then there's no obvious way to add the milliseconds back into the BeginTime.
I've looked at EXTRACT which when you ask for milliseconds gives you the value of second and milliseconds, but just that part of the time. I can't seem to get a representation of the time that I can subtract, divide, and then add the result back into another time.
I'm using Postgres 9.4 and I don't see any simple way of doing this without breaking the date into component parts and getting overall milliseconds (seems like it would work but I don't want to do such an ugly thing if I don't need to), or converting everything to a unix datetime and then doing the calculations and then it's not obvious how to get it back into a "time without time zone."
I'm hoping there's something elegant that I'm just missing? Or maybe a better way to store these where this work is easier? I am only interested in the time part so time(6) seemed closest to Sql Server's TimeSpan.
Just subtract one from the other divide it by two and add it to begintime:
begintime + (endtime - begintime)/2
It is correct that you can't divide a time value. But the result of endtime - begintime is not a time but an interval. And you can divide an interval by 2.
The above expression works with time, timestamp or interval columns.
What am I doing wrong in this query?
SELECT * FROM TreatmentPlanDetails
WHERE
accountId = 'ag5zfmRvbW9kZW50d2ViMnIRCxIIQWNjb3VudHMYtcjdAQw' AND
status = 'done' AND
category = 'chirurgia orale' AND
setDoneCalendarEventStartTimestamp >= [timestamp for 6 june 2012] AND
setDoneCalendarEventStartTimestamp <= [timestamp for 11 june 2012] AND
deleteStatus = 'notDeleted'
ORDER BY setDoneCalendarEventStartTimestamp ASC
I am not getting any record and I am sure there are records meeting the where clause conditions. To get the correct records I have to widen the timestamp interval by 1 millisecond. Is it normal? Furthermore, if I modify this query by removing the category filter, I am getting the correct results. This is definitely weird.
I also asked on google groups, but I got no answer. Anyway, for details:
https://groups.google.com/forum/?fromgroups#!searchin/google-appengine/query/google-appengine/ixPIvmhCS3g/d4OP91yTkrEJ
Let's talk specifically about creating timestamps to go into the query. What code are you using to create the timestamp record? Apparently that's important, because fuzzing with it a little bit affects the query. It may be relevant that in the datastore, timestamps are recorded as integers representing posix timestamps with microseconds, i.e. the number of microseconds since 1/1/1970 UTC (not counting leap seconds). It's also relevant that dates (i.e. without a time) are represented as midnight, i.e. the earliest time on that day. But please show us the exact code. (It may also be important to show the actual content of the record that you're attempting to retrieve.)
An aside that is not specific to your question: Entity property names count as part of your storage quota. If this is going to be a huge dataset, you might pay more $$ than you'd like for property names like setDoneCalendarEventStartTimestamp.
Because you write :
if I modify this query by removing the category filter, I am getting
the correct results
this probably means that the category was not indexed at the time you write the matching records to the data store. You have to re-write your records to the data store if you want them added to the newly created index.