Continuous queries in Influxdb ignoring where clause? - database

I'm having a bit of a trouble with the continuous queries in influxdb 0.8.8.
I'm trying to create a continuous query but it seems that the where clauses are ignored. I'm aware about the restrictions mentioned here: http://influxdb.com/docs/v0.8/api/continuous_queries.html but I don't consider that this would be the case here.
One row in the time series would contain data like this:
{"hex":"06a0b6", "squawk":"3421", "flight":"QTR028 ", "lat":99.867630, "lon":66.447365, "validposition":1, "altitude":39000, "vert_rate":-64,"track":125, "validtrack":1,"speed":482, "messages":201, "seen":219}
The query I'm running and works is the following:
select * from flight_series where time > now() - 30m and flight !~ /^$/ and validtrack = 1 and validposition = 1;
Trough it I'm trying to take the last 30 minutes from the current time, check that the flight field is no whitespaces and that the track/position are valid.
The query returns successfully but when I'm adding the
into filtered_log
part the 'where' clause is ignored.
How can I create a continuous query which takes the above-mentioned conditions into consideration? At least, how could I extract with one continuous query only the rows which have the valid track/heading set to 1 and the flight is not whitespace/empty string? The time constraint I could eliminate from the query and translate into shard retention/duration.
Also, could I specify to in the continuous query to save the data into a time-series which is located into another database (which has a more relaxed retention/duration policy)?
Thank you!
Later edit:
I've managed to do something closer to my need by using the following cq:
"select time, sequence_number, altitude, vert_rate, messages, squawk, lon, lat, speed, hex, seen from current_flights where ((flight !~ /^$/) AND (validtrack = 1)) AND (validposition = 1) into flight.[flight]"
This creates a series for each 'flight' even for those which have a whitespace in the 'flight' field -- for which a flight. series is built.
How could I specify the retention/duration policies for the series generated by the cq above? Can I do something like:
"spaces": [
{
"name": "flight",
"retentionPolicy": "1h",
"shardDuration": "30m",
"regex": "/.*/",
"replicationFactor": 1,
"split": 1
},
...
which would give me a retention of 1h and shard duration of 30m?
I'm a bit confused about where those series are stored, which shard space?
Thanks!
P.S.: My final goal would be the following:
Have a 'window' of 15-30min max with all the flights around, process some data from them and after that period is over discard the data but in the same time move/copy it to another long-term db/series which can be used for historical purposes.

You cannot put time restrictions into the WHERE clause of a continuous query. The server will generate the time restrictions as needed when the CQ runs and must ignore all others. I suspect if you leave out the time restriction the rest of the WHERE clause will be fine.
I don't believe CQs in 0.8 require an aggregation in the SELECT, but you do need to have GROUP BY clause to tell the CQ how often to run. I'm not sure what you would GROUP BY, perhaps the flight?
You can specify a different retention policy when writing to the new series but not a new database. In 0.8 the retention policy for a series is determined by regex matching on the series name. As long as you select a series name correctly it will go into your desired retention policy.
EDIT: updates for new questions
How could I specify the retention/duration policies for the series
generated by the cq above?
In 0.8.x, the shard space to which a series belongs controls the retention policy. The regex on the shard space determines which series belong to that shard. The shard space regex is evaluated newest to oldest, meaning the first created shard space will be the last regex evaluated. Unfortunately, I do know if it is possible to create new shard spaces once the database exists. See this discussion on the mailing list for more: https://groups.google.com/d/msgid/influxdb/ce3fc641-fbf2-4b39-9ce7-77e65c67ea24%40googlegroups.com
Can I do something like:
"spaces": [
{
"name": "flight",
"retentionPolicy": "1h",
"shardDuration": "30m",
"regex": "/.*/",
"replicationFactor": 1,
"split": 1
}, ... which would give me a retention of 1h and shard duration of 30m?
That shard space would have a shard duration of 30 minutes, retaining data for 1 hour, meaning any series would only exist in three shards, the current hot shard, the current cold shard, and the shard waiting for deletion.
The regex is /./, meaning it would match any series, not just the 'flight.' series. Perhaps /flight../ is a better regex if you only want those series generated by the CQ in that shard space.

Related

Flink CEP cannot get correct results on a unioned table

I use Flink SQL and CEP to recognize some really simple patterns. However, I found a weird thing (likely a bug). I have two example tables password_change and transfer as below.
transfer
transid,accountnumber,sortcode,value,channel,eventtime,eventtype
1,123,1,100,ONL,2020-01-01T01:00:01Z,transfer
3,123,1,100,ONL,2020-01-01T01:00:02Z,transfer
4,123,1,200,ONL,2020-01-01T01:00:03Z,transfer
5,456,1,200,ONL,2020-01-01T01:00:04Z,transfer
password_change
accountnumber,channel,eventtime,eventtype
123,ONL,2020-01-01T01:00:05Z,password_change
456,ONL,2020-01-01T01:00:06Z,password_change
123,ONL,2020-01-01T01:00:08Z,password_change
123,ONL,2020-01-01T01:00:09Z,password_change
Here are my SQL queries.
First create a temporary view event as
(SELECT accountnumber,rowtime,eventtype FROM password_change WHERE channel='ONL')
UNION ALL
(SELECT accountnumber,rowtime, eventtype FROM transfer WHERE channel = 'ONL' )
rowtime column is the event time extracted directly from original eventtime col with watermark periodic bound 1 second.
Then output the query result of
SELECT * FROM `event`
MATCH_RECOGNIZE (
PARTITION BY accountnumber
ORDER BY rowtime
MEASURES
transfer.eventtype AS event_type,
transfer.rowtime AS transfer_time
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (transfer password_change ) WITHIN INTERVAL '5' SECOND
DEFINE
password_change AS eventtype='password_change',
transfer AS eventtype='transfer'
)
It should output
123,transfer,2020-01-01T01:00:03Z
456,transfer,2020-01-01T01:00:04Z
But I got nothing when running Flink 1.11.1 (also no output for 1.10.1).
What's more, I change the pattern to only password_change, it still output nothing, but if I change the pattern to transfer then it outputs several rows but not all transfer rows. If I exchange the eventtime of two tables which means let password_changes happen first, then the pattern password_change will output several rows while transfer not.
On the other hand, if I extract those columns from two tables and merge them in one table manually, then emit them into Flink, the running result is correct.
I searched and tried a lot to get it right including changing the SQL statement, watermark, buffer timeout and so on, but nothing helped. Hope anyone here can help. Thanks.
10/10/2020 update:
I use Kafka as the table source. tEnv is the StreamTableEnvironment.
Kafka kafka=new Kafka()
.version("universal")
.property("bootstrap.servers", "localhost:9092");
tEnv.connect(
kafka.topic("transfer")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
.field("transid",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("value",DataTypes.DECIMAL(38,18))
).createTemporaryTable("transfer");
tEnv.connect(
kafka.topic("pchange")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
).createTemporaryTable("password_change");
Thank #Dawid Wysakowicz's answer. To confirm that, I added 4,123,1,200,ONL,2020-01-01T01:00:10Z,transfer to the end of transfer table, then the output becomes right, which means it is really some problem about watermarks.
So now the question is how to fix it. Since a user will not change his/her password frequently, the time gap between these two table is unavoidable. I just need the UNION ALL table has the same behavior as that I merged manually.
Update Nov. 4th 2020:
WatermarkStrategy with idle sources may help.
Most likely the problem is somewhere around watermark generation in conjunction with the UNION ALL operator. Could you share how you create the two tables including how you define the time attributes and what are the connectors? It could let me confirm my suspicions.
I think the problem is that one of the sources stops emitting Watermarks. If the transfer table (or the table with lower timestamps) does not finish and produces no records it emits no Watermarks. After emitting the fourth row it will emit Watermark = 3 (4-1 second). The Watermark of a union of inputs is the smallest of values of the two. Therefore the first table will pause/hold the Watermark with value Watermark = 3 and thus you see no progress for the original query and you see some records emitted for the table with smaller timestamps.
If you manually join the two tables, you have just a single input with a single source of Watermarks and thus it progresses further and you see some results.

MongoDB grab last versions from specified version

I have a set of test results in my mongodb database. Each document in the database contains version information, test data, date, test run information etc...
The version is broken up in the document and stored as individual values. For example: { VER_MAJOR : "0", VER_MINOR : "2", VER_REVISION : "3", VER_PATCH : "20}
My application wants the ability to specify a specific version and grab the document as well as the previous N documents based on the version.
For example:
If version = 0.2.3.20 and n = 5 then the result would return documents with version 0.2.3.20, 0.2.3.19, 0.2.3.18, 0.2.3.17, 0.2.3.16, 0.2.3.15
The solutions that come to my mind is:
Create a new database that contains documents with version information and is sorted. Which can be used to obtain the previous N version's which can be used to obtain the corresponding N documents in the test results database.
Perform the sorting in the test results database itself like in number 1. Though if the test results database is large, this will take a very long time. Also consider inserting in order every time.
Creating another database like in option 1 doesn't seem like the right way. But sorting the test results database seems like there will be lots of overhead, am I mistaken that I should be worried about option 2 producing lots of overhead? I have the impression I'd have to query the entire database then sort it on application side. Querying the entire database seems like overkill...
db.collection_name.find().sort([Paramaters for sorting])
You are quite correct that querying and sorting the entire data set would be very excessive. I probably went overboard on this, but I tried to break everything down in detail below.
Terminology
First thing first, a couple terminology nitpicks. I think you're using the term Database when you mean to use the word Collection. Differentiating between these two concepts will help with navigating documentation and allow for a better understanding of MongoDB.
Collections and Sorting
Second, it is important to understand that documents in a Collection have no inherent ordering. The order in which documents are returned to your app is only applied when retrieving documents from the Collection, such as when specifying .sort() on a query. This means we won't need to copy all of the documents to some other collection; we just need to query the data so that only the desired data is returned in the order we want.
Query
Now to the fun part. The query will look like the following:
db.test_results.find({
"VER_MAJOR" : "0",
"VER_MINOR" : "2",
"VER_REVISION" : "3",
"VER_PATCH" : { "$lte" : 20 }
}).sort({
"VER_PATCH" : -1
}).limit(N)
Our query has a direct match on the three leading version fields to limit results to only those values, i.e. the specific version "0.2.3". A range $lte filter is applied on VER_PATCH since we will want more than a single patch revision.
We then sort results by VER_PATCH to return results descending by the patch version. Finally, the limit operator is used to restrict the number of documents being returned.
Index
We're not done yet! Remember how you said that querying the entire collection and sorting it on the app side felt like overkill? Well, the database would doing exactly that if an index did not exist for this query.
You should follow the equality-sort-match rule when determining the order of fields in an index. In this case, this would give us the index:
{ "VER_MAJOR" : 1, "VER_MINOR" : 1, "VER_REVISION" : 1, "VER_PATCH" : 1 }
Creating this index will allow the query to complete by scanning only the results it would return, while avoiding an in-memory sort. More information can be found here.

Is it possible to re-order query results in memory?

and thanks in advance for any and all help!!
I'm running a query on the datastore that looks like this:
forks = Thing.query(ancestor=user.subscriber_key).filter(
Thing.status==True,
Thing.fork_of==thing_key,
Thing.start_date <= user.day_threshold(),
Thing.level.IN([1,2,3,4,5])).order(
Thing.level)
This query works and returns the results I expect. However, I would like to sort it on one additional field (Thing.last_touched). If I add this to the sort, it won't work because Thing.last_touched is not the property to which the inequality filter is applied. I can't add an additional inequality filter, since we're only allowed one, plus it's not needed (actually, that's why Thing.leve.IN is there.. not needed as a filter, but required for the sort).
So, what I'm wondering is, could I run the query with the filters that I want, and then run code to sort the query results myself? I know I could pull all the parameters I want to sort and store them in dictionaries and sort them that way, but it seems to me there ought to be a way to handle this with the query.
I've searched for days for this but have had no luck.
Just in case you need it, here's the class definition of Thing:
class Thing(ndb.Model):
title = ndb.StringProperty()
level = ndb.IntegerProperty()
fork = ndb.BooleanProperty()
recursion_level = ndb.IntegerProperty()
fork_of = ndb.KeyProperty()
creation_date = ndb.DateTimeProperty(auto_now_add=True)
last_touched = ndb.DateTimeProperty(auto_now=True)
status = ndb.BooleanProperty()
description = ndb.StringProperty()
owner_id = ndb.StringProperty()
frequency = ndb.IntegerProperty()
start_date = ndb.DateTimeProperty(auto_now_add=True)
due_date = ndb.DateTimeProperty()
One of the main reasons that Google AppEngine is so fast even when dealing with insane amounts of data is because of the very limited query options. All standard queries are "scans" over an index, i.e. there is some table (index) that keeps references to your actual data entires in order sorted by ONE of the data's properties. So, let's say you add the following entries:
Thing A: start-date = Wednesday (I'm just going to use weekdays for simplicity)
Thing B: start-date = Friday
Thing C: start-date = Monday
Thing D: start-date = Thursday
Then, AppEngine will create an index that looks like this:
1 - Monday -> Thing C
2 - Wednesday -> Thing A
3 - Thursday -> Thing D
4 - Friday -> Thing B
Now, any query will correspond to a continuous block in this (or another) index. If you, for example, say "All Things with start-date >= Tuesday", it will return entries in row 2 through 4 (i.e. Thing A, Thing D, and Thing B in that exact order!). If you query for "< Thursday", you get 1-2. If you say "> Tuesday and <= Thursday" you get 2-3.
And if you are doing inequality filters on a different property, AppEngine will use a different index.
This is why you can only do one inequality filter and why the sort-order is always also specified by the property that you do an inequality filter of. Because AppEngine is not designed to be able to return items 1, 2, 4 (with a gap*) out of an index, or items 4, 2, 3 (no gap, but out of order).
So, if you need to sort your entries on a different property other than the one you use for inequality filtering, you basically have 3 choices:
Perform your query with the inequality filter, read all results into memory, and sort them in your code afterwards (I think this is what you mean by storing them in a dictionary)
Perform your query WITHOUT the inequality filter, but sorted on the right property. Then, as you loop over the returned entries, simply check the inequality yourself and drop the ones that don't match
Perform your query with the inequality filter and just return the items in the wrong order, and let the client-application worry about sorting them! ;)
Generally I would assume that you have much more unused resources available client-side to do the sorting, so I would probably go for option 3 in most cases. But if you need to sort the entries server-side (e.g. for a mobile-app targeted at older smart-phones), it will depend on the size of your database and the fraction of entries that usually match your inequality filter, whether option 1 or option 2 are better. If your inequality filter only removes a small fraction of the entries, option 2 might be much faster (as it doesn't require any O(>n) sorting), but if you have a huge database of entries and only a very small number of them will match the inequality, definitely go for option 1.
BTW: The talk "App Engine Datastore Under the Covers" from Google I/O 2008 might be a very helpful resource. It's a bit technical, but it gives a great overview of this topic and I consider it must-know information if you want to do anything in AppEngine. Note, though, that this talk is a bit out-dated. There are a bunch more things that you can do with queries now-a-days. But ALL of these extra things (if I understand correctly) are API functions that in the end just generate a set of several simple queries (exactly like the ones described in this talk) and then just combine the results of these in memory in your application (just like you would if you did your own sorting).
*There are some exceptions where AppEngine can generate the intersection of two (or more?) index-scans to drop items from the results, but I don't think that you could use that to change the order of the returned entries.

Django Query Optimisation

I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.

key-value store for time series data?

I've been using SQL Server to store historical time series data for a couple hundred thousand objects, observed about 100 times per day. I'm finding that queries (give me all values for object XYZ between time t1 and time t2) are too slow (for my needs, slow is more then a second). I'm indexing by timestamp and object ID.
I've entertained the thought of using somethings a key-value store like MongoDB instead, but I'm not sure if this is an "appropriate" use of this sort of thing, and I couldn't find any mentions of using such a database for time series data. ideally, I'd be able to do the following queries:
retrieve all the data for object XYZ between time t1 and time t2
do the above, but return one date point per day (first, last, closed to time t...)
retrieve all data for all objects for a particular timestamp
the data should be ordered, and ideally it should be fast to write new data as well as update existing data.
it seems like my desire to query by object ID as well as by timestamp might necessitate having two copies of the database indexed in different ways to get optimal performance...anyone have any experience building a system like this, with a key-value store, or HDF5, or something else? or is this totally doable in SQL Server and I'm just not doing it right?
It sounds like MongoDB would be a very good fit. Updates and inserts are super fast, so you might want to create a document for every event, such as:
{
object: XYZ,
ts : new Date()
}
Then you can index the ts field and queries will also be fast. (By the way, you can create multiple indexes on a single database.)
How to do your three queries:
retrieve all the data for object XYZ
between time t1 and time t2
db.data.find({object : XYZ, ts : {$gt : t1, $lt : t2}})
do the above, but return one date
point per day (first, last, closed to
time t...)
// first
db.data.find({object : XYZ, ts : {$gt : new Date(/* start of day */)}}).sort({ts : 1}).limit(1)
// last
db.data.find({object : XYZ, ts : {$lt : new Date(/* end of day */)}}).sort({ts : -1}).limit(1)
For closest to some time, you'd probably need a custom JavaScript function, but it's doable.
retrieve all data for all objects for
a particular timestamp
db.data.find({ts : timestamp})
Feel free to ask on the user list if you have any questions, someone else might be able to think of an easier way of getting closest-to-a-time events.
This is why databases specific to time series data exist - relational databases simply aren't fast enough for large time series.
I've used Fame quite a lot at investment banks. It's very fast but I imagine very expensive. However if your application requires the speed it might be worth looking it.
There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.
I am not sure this will be exactly what you need, but it will allow you to get the first two points - get values from t1 to t2 for any series (one series per file) or just take one data point.
https://code.google.com/p/timeseriesdb/
// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
file.UniqueIndexes = true; // enforces index uniqueness
file.InitializeNewFile(); // create file and write header
file.AppendData(data); // append data (stream of ArraySegment<>)
}
// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
// Enumerate one item at a time maxitum 10 items starting at 2011-1-1
// (can also get one segment at a time with StreamSegments)
foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
Console.WriteLine(val);
}
I recently tried something similar in F#. I started with the 1 minute bar format for the symbol in question in a Space delimited file which has roughly 80,000 1 minute bar readings. The code to load and parse from disk was under 1ms. The code to calculate a 100 minute SMA for every period in the file was 530ms. I can pull any slice I want from the SMA sequence once calculated in under 1ms. I am just learning F# so there are probably ways to optimize. Note this was after multiple test runs so it was already in the windows Cache but even when loaded from disk it never adds more than 15ms to the load.
date,time,open,high,low,close,volume
01/03/2011,08:00:00,94.38,94.38,93.66,93.66,3800
To reduce the recalculation time I save the entire calculated indicator sequence to disk in a single file with \n delimiter and it generally takes less than 0.5ms to load and parse when in the windows file cache. Simple iteration across the full time series data to return the set of records inside a date range in a sub 3ms operation with a full year of 1 minute bars. I also keep the daily bars in a separate file which loads even faster because of the lower data volumes.
I use the .net4 System.Runtime.Caching layer to cache the serialized representation of the pre-calculated series and with a couple gig's of RAM dedicated to cache I get nearly a 100% cache hit rate so my access to any pre-computed indicator set for any symbol generally runs under 1ms.
Pulling any slice of data I want from the indicator is typically less than 1ms so advanced queries simply do not make sense. Using this strategy I could easily load 10 years of 1 minute bar in less than 20ms.
// Parse a \n delimited file into RAM then
// then split each line on space to into a
// array of tokens. Return the entire array
// as string[][]
let readSpaceDelimFile fname =
System.IO.File.ReadAllLines(fname)
|> Array.map (fun line -> line.Split [|' '|])
// Based on a two dimensional array
// pull out a single column for bar
// close and convert every value
// for every row to a float
// and return the array of floats.
let GetArrClose(tarr : string[][]) =
[| for aLine in tarr do
//printfn "aLine=%A" aLine
let closep = float(aLine.[5])
yield closep
|]
I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages.
I use boost::date_time for the timestamp field.
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
I created a number of custom iterators and used standard template library features to be able to efficiently search for specific values or ranges of time-based records.

Resources