key-value store for time series data? - database

I've been using SQL Server to store historical time series data for a couple hundred thousand objects, observed about 100 times per day. I'm finding that queries (give me all values for object XYZ between time t1 and time t2) are too slow (for my needs, slow is more then a second). I'm indexing by timestamp and object ID.
I've entertained the thought of using somethings a key-value store like MongoDB instead, but I'm not sure if this is an "appropriate" use of this sort of thing, and I couldn't find any mentions of using such a database for time series data. ideally, I'd be able to do the following queries:
retrieve all the data for object XYZ between time t1 and time t2
do the above, but return one date point per day (first, last, closed to time t...)
retrieve all data for all objects for a particular timestamp
the data should be ordered, and ideally it should be fast to write new data as well as update existing data.
it seems like my desire to query by object ID as well as by timestamp might necessitate having two copies of the database indexed in different ways to get optimal performance...anyone have any experience building a system like this, with a key-value store, or HDF5, or something else? or is this totally doable in SQL Server and I'm just not doing it right?

It sounds like MongoDB would be a very good fit. Updates and inserts are super fast, so you might want to create a document for every event, such as:
{
object: XYZ,
ts : new Date()
}
Then you can index the ts field and queries will also be fast. (By the way, you can create multiple indexes on a single database.)
How to do your three queries:
retrieve all the data for object XYZ
between time t1 and time t2
db.data.find({object : XYZ, ts : {$gt : t1, $lt : t2}})
do the above, but return one date
point per day (first, last, closed to
time t...)
// first
db.data.find({object : XYZ, ts : {$gt : new Date(/* start of day */)}}).sort({ts : 1}).limit(1)
// last
db.data.find({object : XYZ, ts : {$lt : new Date(/* end of day */)}}).sort({ts : -1}).limit(1)
For closest to some time, you'd probably need a custom JavaScript function, but it's doable.
retrieve all data for all objects for
a particular timestamp
db.data.find({ts : timestamp})
Feel free to ask on the user list if you have any questions, someone else might be able to think of an easier way of getting closest-to-a-time events.

This is why databases specific to time series data exist - relational databases simply aren't fast enough for large time series.
I've used Fame quite a lot at investment banks. It's very fast but I imagine very expensive. However if your application requires the speed it might be worth looking it.

There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.
I am not sure this will be exactly what you need, but it will allow you to get the first two points - get values from t1 to t2 for any series (one series per file) or just take one data point.
https://code.google.com/p/timeseriesdb/
// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
file.UniqueIndexes = true; // enforces index uniqueness
file.InitializeNewFile(); // create file and write header
file.AppendData(data); // append data (stream of ArraySegment<>)
}
// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
// Enumerate one item at a time maxitum 10 items starting at 2011-1-1
// (can also get one segment at a time with StreamSegments)
foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
Console.WriteLine(val);
}

I recently tried something similar in F#. I started with the 1 minute bar format for the symbol in question in a Space delimited file which has roughly 80,000 1 minute bar readings. The code to load and parse from disk was under 1ms. The code to calculate a 100 minute SMA for every period in the file was 530ms. I can pull any slice I want from the SMA sequence once calculated in under 1ms. I am just learning F# so there are probably ways to optimize. Note this was after multiple test runs so it was already in the windows Cache but even when loaded from disk it never adds more than 15ms to the load.
date,time,open,high,low,close,volume
01/03/2011,08:00:00,94.38,94.38,93.66,93.66,3800
To reduce the recalculation time I save the entire calculated indicator sequence to disk in a single file with \n delimiter and it generally takes less than 0.5ms to load and parse when in the windows file cache. Simple iteration across the full time series data to return the set of records inside a date range in a sub 3ms operation with a full year of 1 minute bars. I also keep the daily bars in a separate file which loads even faster because of the lower data volumes.
I use the .net4 System.Runtime.Caching layer to cache the serialized representation of the pre-calculated series and with a couple gig's of RAM dedicated to cache I get nearly a 100% cache hit rate so my access to any pre-computed indicator set for any symbol generally runs under 1ms.
Pulling any slice of data I want from the indicator is typically less than 1ms so advanced queries simply do not make sense. Using this strategy I could easily load 10 years of 1 minute bar in less than 20ms.
// Parse a \n delimited file into RAM then
// then split each line on space to into a
// array of tokens. Return the entire array
// as string[][]
let readSpaceDelimFile fname =
System.IO.File.ReadAllLines(fname)
|> Array.map (fun line -> line.Split [|' '|])
// Based on a two dimensional array
// pull out a single column for bar
// close and convert every value
// for every row to a float
// and return the array of floats.
let GetArrClose(tarr : string[][]) =
[| for aLine in tarr do
//printfn "aLine=%A" aLine
let closep = float(aLine.[5])
yield closep
|]

I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages.
I use boost::date_time for the timestamp field.
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
I created a number of custom iterators and used standard template library features to be able to efficiently search for specific values or ranges of time-based records.

Related

NDB Queries Exceeding GAE Soft Private Memory Limit

I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.
My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.
My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.
Below is my code for how it is currently implemented and breaking. Any input is much appreciated:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
# model for data to save
...
# Function for querying data below a given ancestor between two optional
# times
#classmethod
def time_ordered_query(cls, ancestor_key, start=None, end=None):
return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)
EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
gc.collect()
fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
if fetched:
for acct in fetched:
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
if more and next_cursor:
cursor = next_cursor
else:
break
where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.
EDIT:
Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!
You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.
You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?
You can bump your instance class to one with more memory (if not done already).
You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.
Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).

Continuous queries in Influxdb ignoring where clause?

I'm having a bit of a trouble with the continuous queries in influxdb 0.8.8.
I'm trying to create a continuous query but it seems that the where clauses are ignored. I'm aware about the restrictions mentioned here: http://influxdb.com/docs/v0.8/api/continuous_queries.html but I don't consider that this would be the case here.
One row in the time series would contain data like this:
{"hex":"06a0b6", "squawk":"3421", "flight":"QTR028 ", "lat":99.867630, "lon":66.447365, "validposition":1, "altitude":39000, "vert_rate":-64,"track":125, "validtrack":1,"speed":482, "messages":201, "seen":219}
The query I'm running and works is the following:
select * from flight_series where time > now() - 30m and flight !~ /^$/ and validtrack = 1 and validposition = 1;
Trough it I'm trying to take the last 30 minutes from the current time, check that the flight field is no whitespaces and that the track/position are valid.
The query returns successfully but when I'm adding the
into filtered_log
part the 'where' clause is ignored.
How can I create a continuous query which takes the above-mentioned conditions into consideration? At least, how could I extract with one continuous query only the rows which have the valid track/heading set to 1 and the flight is not whitespace/empty string? The time constraint I could eliminate from the query and translate into shard retention/duration.
Also, could I specify to in the continuous query to save the data into a time-series which is located into another database (which has a more relaxed retention/duration policy)?
Thank you!
Later edit:
I've managed to do something closer to my need by using the following cq:
"select time, sequence_number, altitude, vert_rate, messages, squawk, lon, lat, speed, hex, seen from current_flights where ((flight !~ /^$/) AND (validtrack = 1)) AND (validposition = 1) into flight.[flight]"
This creates a series for each 'flight' even for those which have a whitespace in the 'flight' field -- for which a flight. series is built.
How could I specify the retention/duration policies for the series generated by the cq above? Can I do something like:
"spaces": [
{
"name": "flight",
"retentionPolicy": "1h",
"shardDuration": "30m",
"regex": "/.*/",
"replicationFactor": 1,
"split": 1
},
...
which would give me a retention of 1h and shard duration of 30m?
I'm a bit confused about where those series are stored, which shard space?
Thanks!
P.S.: My final goal would be the following:
Have a 'window' of 15-30min max with all the flights around, process some data from them and after that period is over discard the data but in the same time move/copy it to another long-term db/series which can be used for historical purposes.
You cannot put time restrictions into the WHERE clause of a continuous query. The server will generate the time restrictions as needed when the CQ runs and must ignore all others. I suspect if you leave out the time restriction the rest of the WHERE clause will be fine.
I don't believe CQs in 0.8 require an aggregation in the SELECT, but you do need to have GROUP BY clause to tell the CQ how often to run. I'm not sure what you would GROUP BY, perhaps the flight?
You can specify a different retention policy when writing to the new series but not a new database. In 0.8 the retention policy for a series is determined by regex matching on the series name. As long as you select a series name correctly it will go into your desired retention policy.
EDIT: updates for new questions
How could I specify the retention/duration policies for the series
generated by the cq above?
In 0.8.x, the shard space to which a series belongs controls the retention policy. The regex on the shard space determines which series belong to that shard. The shard space regex is evaluated newest to oldest, meaning the first created shard space will be the last regex evaluated. Unfortunately, I do know if it is possible to create new shard spaces once the database exists. See this discussion on the mailing list for more: https://groups.google.com/d/msgid/influxdb/ce3fc641-fbf2-4b39-9ce7-77e65c67ea24%40googlegroups.com
Can I do something like:
"spaces": [
{
"name": "flight",
"retentionPolicy": "1h",
"shardDuration": "30m",
"regex": "/.*/",
"replicationFactor": 1,
"split": 1
}, ... which would give me a retention of 1h and shard duration of 30m?
That shard space would have a shard duration of 30 minutes, retaining data for 1 hour, meaning any series would only exist in three shards, the current hot shard, the current cold shard, and the shard waiting for deletion.
The regex is /./, meaning it would match any series, not just the 'flight.' series. Perhaps /flight../ is a better regex if you only want those series generated by the CQ in that shard space.

increase performance of a linq query using contains

I have a winforms app where I have a Telerik dropdownchecklist that lets the user select a group of state names.
Using EF and the database is stored in Azure SQL.
The code then hits a database of about 17,000 records and filters the results to only include states that are checked.
Works fine. I am wanting to update a count on the screen whenever they change the list box.
This is the code, in the itemCheckChanged event:
var states = stateDropDownList.CheckedItems.Select(i => i.Value.ToString()).ToList();
var filteredStops = (from stop in aDb.Stop_address_details where states.Contains(stop.Stop_state) select stop).ToArray();
ExportInfo_tb.Text = "Current Stop Count: " + filteredStops.Count();
It works, but it is slow.
I tried to load everything into a memory variable then querying that vs the database but can't seem to figure out how to do that.
Any suggestions?
Improvement:
I picked up a noticeable improvement by limiting the amount of data coming down by:
var filteredStops = (from stop in aDb.Stop_address_details where states.Contains(stop.Stop_state) select stop.Stop_state).ToList();
And better yet --
int count = (from stop in aDb.Stop_address_details where
states.Contains(stop.Stop_state)
select stop).Count();
ExportInfo_tb.Text = "Current Stop Count: " + count.ToString();
The performance of you query, actually, has nothing to do with Contiains, in this case. Contains is pretty performant. The problem, as you picked up on in your third solution, is that you are pulling far more data over the network than required.
In your first solution you are pulling back all of the rows from the server with the matching stop state and performing the count locally. This is the worst possible approach. You are pulling back data just to count it and you are pulling back far more data than you need.
In your second solution you limited the data coming back to a single field which is why the performance improved. This could have resulted in a significant improvement if your table is really wide. The problem with this is that you are still pulling back all the data just to count it locally.
In your third solution EF will translate the .Count() method into a query that performs the count for you. So the count will happen on the server and the only data returned is a single value; the result of count. Since network latency CAN often be (but is not always) the longest step when performing a query, returning less data can often result in significant gains in query speed.
The query translation of your final solution should look something like this:
SELECT COUNT(*) AS [value]
FROM [Stop_address_details] AS [t0]
WHERE [t0].[Stop_state] IN (#p0)

Why use an external database with Matlab?

Why should you use an external database (e.g. Mysql) when working with (large/growing) Data?
I know of some projects which use SQL databases, but I can't see the advantage you get from doing this in contrast to just storing everything in .mat files (as for example stated here: http://www.matlabtips.com/how-to-store-large-datasets/)
Where is this necessary? Where does this approach simplify things?
Regarding growing data, let's take an example where, on a production line, you would measure different sources with different sensors:
Experiment.Date = '2014-07-18 # 07h28';
Experiment.SensorType = 'A';
Experiment.SensorSerial = 'SENSOR-00012-A';
Experiment.SourceType = 'C';
Experiment.SourceSerial = 'SOURCE-00143-C';
Experiment.SensorPositions = 180 * linspace(0, 359, 360) / pi;
Experiment.SensorResponse = rand(1, 360);
And store these experiments on disk using .mat files:
experiment.2013-01-02.0001.mat
experiment.2013-01-02.0002.mat
experiment.2013-01-02.0003.mat
experiment.2013-01-03.0004.mat
...
experiment.2014-07-18.0001.mat
experiment.2014-07-18.0002.mat
So now, if I ask you:
"what is the typical response of sensors of type B when the source is of type E" ?
Or:
"Which sensor has best performances to measure sources of type C ? Sensors A or sensors B ?"
"How performances of these sensors degrade with time ?"
"Did modification we made last july to production line improved lifetime of sensors A ?"
Loading in memory all these .mat files, to check if date, sensor and source type are correct and then calculate min,mean,max responses, and other statistics is gonna be very painful and time consuming + writing custom code for file selection!
Building a data-base on top of these .mat files can be very useful to "SELECT/JOIN/..." elements of interest and then perform further statistic or operations.
NB: The database does not replace .mat files (i.e. the information), it just a practical and standard way to quickly select some of them upon conditions without having to load and parse everything.

How to generate large files (PDF and CSV) using AppEngine and Datastore?

When I first started developing this project, there was no requirement for generating large files, however it is now a deliverable.
Long story short, GAE just doesn't play nice with any large scale data manipulation or content generation. The lack of file storage aside, even something as simple as generating a pdf with ReportLab with 1500 records seems to hit a DeadlineExceededError. This is just a simple pdf comprised of a table.
I am using the following code:
self.response.headers['Content-Type'] = 'application/pdf'
self.response.headers['Content-Disposition'] = 'attachment; filename=output.pdf'
doc = SimpleDocTemplate(self.response.out, pagesize=landscape(letter))
elements = []
dataset = Voter.all().order('addr_str')
data = [['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE']]
i = 0
r = 1
s = 100
while ( i < 1500 ):
voters = dataset.fetch(s, offset=i)
for voter in voters:
data.append([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname ])
r = r + 1
i = i + s
t=Table(data, '', r*[0.4*inch], repeatRows=1 )
t.setStyle(TableStyle([('ALIGN',(0,0),(-1,-1),'CENTER'),
('INNERGRID', (0,0), (-1,-1), 0.15, colors.black),
('BOX', (0,0), (-1,-1), .15, colors.black),
('FONTSIZE', (0,0), (-1,-1), 8)
]))
elements.append(t)
doc.build(elements)
Nothing particularly fancy, but it chokes. Is there a better way to do this? If I could write to some kind of file system and generate the file in bits, and then rejoin them that might work, but I think the system precludes this.
I need to do the same thing for a CSV file, however the limit is obviously a bit higher since it's just raw output.
self.response.headers['Content-Type'] = 'application/csv'
self.response.headers['Content-Disposition'] = 'attachment; filename=output.csv'
dataset = Voter.all().order('addr_str')
writer = csv.writer(self.response.out,dialect='excel')
writer.writerow(['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE'])
i = 0
s = 100
while ( i < 2000 ):
last_cursor = memcache.get('db_cursor')
if last_cursor:
dataset.with_cursor(last_cursor)
voters = dataset.fetch(s)
for voter in voters:
writer.writerow([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname])
memcache.set('db_cursor', dataset.cursor())
i = i + s
memcache.delete('db_cursor')
Any suggestions would be very much appreciated.
Edit:
Above I had documented three possible solutions based on my research, plus suggestions etc
They aren't necessarily mutually exclusive, and could be a slight variation or combination of any of the three, however the gist of the solutions are there. Let me know which one you think makes the most sense, and might perform the best.
Solution A: Using mapreduce (or tasks), serialize each record, and create a memcache entry for each individual record keyed with the keyname. Then process these items individually into the pdf/xls file. (use get_multi and set_multi)
Solution B: Using tasks, serialize groups of records, and load them into the db as a blob. Then trigger a task once all records are processed that will load each blob, deserialize them and then load the data into the final file.
Solution C: Using mapreduce, retrieve the keynames and store them as a list, or serialized blob. Then load the records by key, which would be faster than the current loading method. If I were to do this, which would be better, storing them as a list (and what would the limitations be...I presume a list of 100,000 would be beyond the capabilities of the datastore) or as a serialized blob (or small chunks which I then concatenate or process)
Thanks in advance for any advice.
Here is one quick thought, assuming it is crapping out fetching from the datastore. You could use tasks and cursors to fetch the data in smaller chunks, then do the generation at the end.
Start a task which does the initial query and fetches 300 (arbitrary number) records, then enqueues a named(!important) task that you pass the cursor to. That one in turn queries [your arbitrary number] records, and then passes the cursor to a new named task as well. Continue that until you have enough records.
Within each task process the entities, then store the serialized result in a text or blob property on a 'processing' model. I would make the model's key_name the same as the task that created it. Keep in mind the serialized data will need to be under the API call size limit.
To serialize your table pretty fast you could use:
serialized_data = "\x1e".join("\x1f".join(voter) for voter in data)
Have the last task (when you get enough records) kick of the PDf or CSV generation. If you use key_names for you models you, should be able to grab all of the entities with encoded data by key. Fetches by key are pretty fast, you'll know the model's keys since you know the last task name. Again, you'll want to be mindful size of your fetches from the datastore!
To deserialize:
list(voter.split('\x1f') for voter in serialized_data.split('\x1e'))
Now run your PDF / CSV generation on the data. If splitting up the datastore fetches alone does not help you'll have to look into doing more of the processing in each task.
Don't forget in the 'build' task you'll want to raise an exception if any of the interim models are not yet present. Your final task will automatically retry.
Some time ago I faced the same problem with GAE. After many attempts I just moved to another web hosting since I could do it. Nevertheless, before moving I had 2 ideas how to resolve it. I haven't implemented them, but you may try to.
First idea is to use SOA/RESTful service on another server, if it is possible. You can even create another application on GAE in Java, do all the work there (I guess with Java's PDFBox it will take much less time to generate PDF), and return result to Python. But this option needs you to know Java and also to divide your app to several parts with terrible modularity.
So, there's another approach: you can create a "ping-pong" game with a user's browser. The idea is that if you cannot make everything in a single request, force browser to send you several. During first request make only a part of work which fits 30 seconds limit, then save the state and generate 'ticket' - unique identifier of a 'job'. Finally, send the user response which is simple page with redirect back to your app, parametrized by a job ticket. When you get it. just restore state and proceed with the next part of job.

Resources