indexed query to decimate time series results - database

The context here is I'm scoping a design that slices time-series data at user-defined intervals - too many permutations to simply roll up the data (eg, a 2nd hourly table, rolled up as (or after) the data ingest). I am not a database expert, so am hoping to learn if there are standard approaches to this that rely on table indexes, not duplicating the data in new table/collection, or otherwise encoding it a priori by file structure, etc. Especially if there is a particular db suited to or supporting this load. A prototype of the backend is in Mongo, but we can easily pivot to a more suited store at this time.
QUESTION: In any mainstream database or similar, is it possible to query a ~time series over a given time window, (efficiently) returning only data points at a consistent interval? (what db and example query, specifically).
My data is a few 10's of gb today, but growing if we're successful. I'd expect indexes against timestamp to continue to fit ~ok in memory. A custom file-based schema such as using parquet might work, but an off the shelf DB would be ideal. By consistent interval, i mean some "skip" meaningful to a human, such as
"every nth value", if the data could be assumed to be at a reliable cadence
or perhaps "first sample per hour", if not and the timestamps were epoch times
Eg, if my data is
ts
value
1001
1
1002
2
1003
3
... continuous ...
1010
10
1011
11
... etc ...
... large data set
and query had the parameters:
- skip_value:10
- ts:
- after:1000
- before:2000
the returned set would be:
[ (1001, 1), (1011,11), (1021,21) .... (1991,991) ]

Related

Anylogic: How to create plot from database table?

In my Anylogic model I succesfully create plots of datasets that count the number of trucks arriving from terminals each hour in my simulation. Now, I want to add the actual/"observed" number of trucks arriving at a terminal, to compare my simulation to these numbers. I added these numbers in a database table (see picture below). Is there a simple way of adding this data to the plot?
I tried it by creating a variable that reads the database table for every hour and adding that to a dataset (like can be seen in the pictures below), but this did not work unfortunately (the plot was empty).
Maybe simply delete the variable and fill the dataset at the start of the model by looping through the dbase table data. Use the dbase query wizard to create a for-loop. Something like this should work:
int numEntries = (int) selectFrom(observed_arrivals).count();
DataSet myDataSet = new DataSet(numEntries);
List<Tuple> rows = selectFrom(observed_arrivals).list();
for (Tuple
row : rows) {
myDataSet.add(row.get( observed_arrivals.hour ), row.get( observed_arrivals.terminal_a ));
}
myChart.addDataSet(myDataSet);
You don't explain why it "didn't work" (what errors/problems did you get?), nor where you defined these elements.
(1) Since you want both observed (empirical) and simulated arrivals per terminal, datasets for each should be in the Terminal agent. And then the replicated plot (in Main) can have two data entries referring to data sets terminals(index).observedArrivals and terminals(index).simulatedArrivals or whatever you name them.
(2) Using getHourOfDay to add to the observed dataset is wrong because that just returns 0-23 (i.e., the hour in the current day for the current model date). Your database table looks like it has hours since model start, so you just want time(HOUR) to get the model time in elapsed hours (irrespective of what the model time unit is). Or possibly time(HOUR) - 1 if you only want to update the empirical arrivals for the hour at the end of that hour (i.e., at the same time that you updated the simulated arrivals).
(3) Using a Variable to get the database value each hour doesn't work because a variable's initial value is only evaluated once at model initialisation. You want an hourly cyclic Event in Terminal instead which adds the relevant row's value. (You need to use the Insert Database Query wizard to generate the relevant Java code for the query you need in the event's action.)
(4) Because you have a database table with specifically-named columns for each terminal (columns terminal_a and presumably terminal_b etc.) that makes it slightly more awkward. (This isn't proper relational table design where, instead of 4 columns for the 4 terminals, you'd instead have two columns for terminal_id and observed_value with a row for each time period and terminal combination.)
So your database query expression (in your Terminal agents) will need to use the SQL format (not the QueryDSL format) so that you can 'stitch in' the correct column name into the SQL.

Merging different granularity time series in influxdb

I want to store trades as well as best ask/bid data, where the latter updates much more rapidly than the former, in InfluxDB.
I want to, if possible, use a schema that allows me to query: "for each trade on market X, find the best ask/bid on market Y whose timestamp is <= the timestamp of the trade".
(I'll use any version of Influx.)
For example, trades might look like this:
Time Price Volume Direction Market
00:01.000 100 5 1 foo-bar
00:03.000 99 50 0 bar-baz
00:03.050 99 25 0 foo-bar
00:04.000 101 15 1 bar-baz
And tick data might look more like this:
Time Ask Bid Market
00:00.763 100 99 bar-baz
00:01.010 101 99 foo-bar
00:01.012 101 98 bar-baz
00:01.012 101 99 foo-bar
00:01:238 100 99 bar-baz
...
00:03:021 101 98 bar-baz
I would want to be able to somehow join each trade for some market, e.g. foo-bar, with only the most recent ask/bid data point on some other market, e.g. bar-baz, and get a result like:
Time Trade Price Ask Bid
00:01.000 100 100 99
00:03.050 99 101 98
Such that I could compute the difference between the trade price on market foo-bar and the most recently quoted ask or bid on market bar-baz.
Right now, I store trades in one time series and ask/bid data points in another and merge them on the client side, with logic along the lines of:
function merge(trades, quotes, data_points)
next_trade, more_trades = first(trades), rest(trades)
quotes = drop-while (quote.timestamp < next_trade.timestamp) quotes
data_point = join(next_trade, first(quotes))
if more_trades
return merge(more_trades, quotes, data_points + data_point)
return data_points + data_point
The problem is that the client has to discard tons of ask/bid data points because they update so frequently, and only the most recent update before the trade is relevant.
There are tens of markets whose most recent ask/bid I might want to compare a trade with, otherwise I'd simply store the most recent ask/bid in the same series as the trades.
Is it possible to do what I want to do with Influx, or with another time series database? An alternative solution that produces lower quality results is to group the ask/bid data by some time interval, say 250ms, and take the last from each interval, to at least impose an upper bound on the amount of quotes the client has to drop before finding the one that's closest to the next trade.
NB. Just a clarification on InfluxDB terminology. You're probably storing trade and tick data in different measurements(analogous to a table). Series is a subdivision withing a measurement based on tag values. e.g
Time Ask Bid Market
00:00.763 100 99 bar-baz
is one series
Time Ask Bid Market
00:01.010 101 99 foo-bar
is another series(assuming you are storing Market name/id as a tag and not a field)
Answer
InfluxQL https://docs.influxdata.com/influxdb/v1.7/query_language/spec/ - I can't think of a way to achieve what you need with InfluxQL (Influx Query Language) as it does not support joins.
Perhaps what you could do on the client side is instead of requesting all tick data for a period and discarding most of it, make a request per trade and market to get exactly the (the most recent with respect to the trade) ask/bid datapoint that you need. Something like:
function merge(trades, market)
points = <empty list>
for next_trade in trades
quote = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp and Market=market and time>next_trade.timestamp - 1m")
// or to get a list per market with one query
// quote_per_market = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp group by Market")
points = points + join(next_trade, quote)
return points
Of course you'd have the overhead of querying the database more frequently but depending on the number of trades and your resource constraints it may be more efficient. NB. A potential pitfall here is that ask and bid retrieved this way are not retrieved as a pair but independently and while they are returned as a pair it could happen that they have different timestamps. If for some timestamp for some reason you only have an ask or a bid price you might run into this problem. However, as long as you write them in pairs and have no missing data it should be ok.
Flux https://www.influxdata.com/products/flux/ - Flux is a more sophisticated query language that is part of Influxdb 1.7 and 2 that allows you to do joins and operations across different measurements. I can't give you any examples yet but it's worth having a look at.
Other (relational) Times Series DBs that you could have a look at that would also allow you to do joins are CrateDB https://crate.io/ or Postgres + TimescaleDB https://www.timescale.com/products

1. What type of database, if any for information arrangement? 2. Faster with int instead of string values in database?

I am collecting data from the internet at regular intervals and put it in text files
Columns: Datetime data1 data2 data3
Values: 20170717 1800 text1 text2 int
What kind of database would be best, where the data is predictable? I could rearrange the data for a column-base if needed. Then you have those saying you should not use databases at all (referring to Google).
Would it be faster and cheaper on disc space if I translate string values into simple int and have a separate translation table when queries are needed?
For instance writing 1, 2 and 3 instead of hockey stick, football and frisbee or 1,2,3 instead of Newport, Kuala Lumpur, Lumpini Stadium into the database and when searching doing in pseudocode INNER JOIN database AND translation-table WHERE 1 = Newport
Perhaps there would be trade offs between saving disc space and increasing the workload on the CPU when querying the database with INNER JOIN-type queries?
The answer is to look further into the subject matter. Data formats I did not know about popped up in the following video and then a good text on the matter.
VIDEO: More parallelization file formats matter
TEXT: How to choose data type

How do RRD values in a database dump translate to the input values?

I am having trouble understanding the values that I have saved in my Round Robin Database. I do a dump with rrdtool dump mydatabase and I get a dump of the data. I found the most recent update, and matched it to my rrd update command:
$rrdupdate --template=var1:var2:var3:var4:var5 N:15834740:839964:247212:156320:13493356
In my dump at the matching timestamp, I find these values:
<!-- 2016-12-01 10:30:00 CST / 1480609800 --> <row><v>9.0950245287e+04</v><v>4.8264158237e+03</v><v>1.4182428703e+03</v><v>8.9785764359e+02</v><v>7.7501969607e+04</v></row>
The first value is supposed to be var1. Out of scientific notation, that's 90,950.245287, which does not match up at all to my input value. (None of them are decimal.)
Is there something special I have to do to be able to convert values from my dump to get the standard value that I entered?
I can't give you specifics for your case, as you have not shown the full definition of your RRD file (internals, DS definition, etc), however...
Values stored in an RRDTool database are subject to Data Normalisation, and are then converted to Rates (unless the DS is of type Gauge in which case they are assumed to be rates already).
Normalisation is when the values are adjusted on a linear basis to make them fit exactly into the time sequence as defined by the Interval (which is often 300 seconds).
If you want to see the values stored exactly as you write them, you need to set the DS type to 'gauge', and make Normalisation a null step. The only way to do the latter is to store the values exactly on a time boundary. So, if the Interval is 300s, then store at 12:00:00, 12:05:00, and so on - otherwise the values will be adjusted.
There is a lot more information about Normalisation - what it is, and why it is done - in Alex van den Bogaerdt's tutorial

MongoDB and Arctic

I intend to analyse multiple data sets on the same time series (daily EOD). I will need to use computed columns. Use column A + B to create column C (store net result of calculation in column C). Is this functionality available using the MongoDB / Arctic database?
I would also intend to search the data... for example: What happens when the advance decline thrust pushes over 70 when the cumulative TICK was below -100,000 in the past 'n days'
Two data sets: Cumulative TICK and the Advance Decline Thrust (Uses advancers / decliners data). So they would be stored in the database, then I would want to have the capability to search for the above condition. This is achievable with the mongoDB / Arctic database structure?
Just looking for some general information before I move to a DB format. Currently everything I had created is on excel / VBA now its alrady out grown!
Any information greatly appreciated.
Note: I will use the same database for weekly, monthly, yearly and 1 minute, 3 minute, 5 minute 60 minute TICK/TIME based bars - not feeding live but updated EOD
yes, this can be done with arctic. Arctic can store pandas dataframes, and an operation like you have mentioned is trivial in pandas. Arctic is just a store, so you'd want to read the data out of arctic (data is stored in symbols in arctic) and then perform your transform, and then write the data back. Any of the storage engines (VersionStore, TickStore, or ChunkStore) should work for this.

Resources