TitanDB - Build a property index in descending order by timestamp - database

TitanDB 1.0.0 (on top of DynamoDB)
Gremlin 3
I've got a set of vertices with a label a. I have a property of type long on those vertices which corresponds to the time in milliseconds from 1970 UTC (timestamp of when the vertex was created.) When I pull back those vertices I want to be able to pull them back in decsending order.
How can I create an index on that property in the decr order in Titan Management System?
Documentation seems vague on that.
Closest thing I found is
public RelationTypeIndex buildPropertyIndex(PropertyKey key,
String name,
Order sortOrder,
PropertyKey... sortKeys)
But what do I put in as the key and sortKeys? I want to be able to pull the whole vertex ordered by the timestamp property
Edit: The only way I know of doing this at the minute is by duplicating that property on the edge and using a vertex centric index on the edge to increase the performance.

I do something similar as I order by the release date of a particular product. If you want to execute order().by("Value", decr) efficiently then I highly recommend reading into mixed indices specified here. Mixed indices allow these operations to be done quickly.
An example of what you may be looking for:
TitanGraph graph = TitanFactory.open(config);
TitanManagement mgmt = graph.openManagement();
PropertyKey key = mgmt.makePropertyKey("TimeStamp").dataType(Long.class).make();
mgmt.buildIndex("timeStampIndex", Vertex.class).addKey(key).buildMixedIndex("search");
mgmt.commit();
The above index lets me execute the following query very quickly:
g.V().order().by("TimeStamp", decr);
Which gets me the time stamps in descending order. The above traversal will work without indexing but can be slow in large graphs.

Related

Apache Flink: How to group every n rows with the Table API?

Recently I am trying to use Apache Flink for fast batch processing.
I have a table with a column:value and an irrelevant index column
Basically I want to calculate the mean and range of every 5 rows of value. Then I am going to calculate the mean and standard deviation based on those mean I just calculated. So I guess the best way is to use Tumble window.
It looks like this
DataSet<Tuple2<Double, Integer>> rawData = {get the source data};
Table table = tableEnvironment.fromDataSet(rawData);
Table groupedTable = table
.window(Tumble.over("5.rows").on({what should I write?}).as("w")
.groupBy("w")
.select("f0.avg, f0.max-f0.min");
{The next step is to use groupedTable to calculate overall mean and stdDev}
But I don't know what to write in .on(). I have tried "proctime" but it said there is no such input. I just want it to group by the order as it reads from the source. But it has to be a time attribute so I cannot use "f2" - the index column as ordering as well.
Do I have to add a timestamp to do this? Is it necessary in batch processing and will it slow down the calculation? What is the best way to solve this?
Update :
I tried to use a sliding window in the table API and it gets me Exception.
// Calculate mean value in each group
Table groupedTable = table
.groupBy("f0")
.select("f0.cast(LONG) as groupNum, f1.avg as avg")
.orderBy("groupNum");
//Calculate moving range of group Mean using sliding window
Table movingRangeTable = groupedTable
.window(Slide.over("2.rows").every("1.rows").on("groupNum").as("w"))
.groupBy("w")
.select("groupNum.max as groupNumB, (avg.max - avg.min) as MR");
The Exception is:
Exception in thread "main" java.lang.UnsupportedOperationException: Count sliding group windows on event-time are currently not supported.
at org.apache.flink.table.plan.nodes.dataset.DataSetWindowAggregate.createEventTimeSlidingWindowDataSet(DataSetWindowAggregate.scala:456)
at org.apache.flink.table.plan.nodes.dataset.DataSetWindowAggregate.translateToPlan(DataSetWindowAggregate.scala:139)
...
Does that mean that sliding window is not supported in Table API? If I recall correctly there is no window function in DataSet API. Then how do I calculate moving range in batch process?
The window clause is used to define a grouping based on a window function, such as Tumble or Session. Grouping every 5 rows is not well defined in the Table API (or SQL) unless you specify the order of the rows. This is done in the on clause of the Tumble function. Since this feature originates from stream processing, the on clause expects a timestamp attribute.
You can fetch the timestamp of the current time using the currentTimestamp() function. However, I should point out that Flink will sort the data as it is not aware of the monotonic property of the function. Moreover, all of that will with a parallelism of 1 because there is no clause that would allow for partitioning.
Alternatively, you can also implement a user-defined scalar function that converts the index attribute into a timestamp (effectively a Long value). But again, Flink will do a full sort of the data.

Referential Integrity with Neo4j

I am working on a project that uses a graph database to hold click data for a search engine. The nodes can be search terms or urls, and the edges hold a weight attribute, and a percentage of times that search led to someone clicking that URL.
Number of times the URL was clicked / Number of times term was searched
My issue is that when I update the edges, the percentage will be accurate, but if I later update the search term node and the searched count changes, the edge will no longer have the correct percentage. Is there a way in Neo4j to keep referential integrity? like a foreign key type thing?
The following info might be helpful.
If you stored the number of clicks instead of the percentage, there is no way to get inconsistent data. For example:
(:Term {id: 1, nSearches: 123})-[:HAS_URL {weight: 2, nClicks: 17}]->(:Url {id: 2})
With this data model, you'd calculate the percentage whenever you needed it.
For example, to find the 10 terms that have the highest percentage of visits to a specific URL:
MATCH (term:Term)-[r:HAS_URL]->(url:Url {id: 2})
RETURN url, term
ORDER BY r.nClicks/term.nSearches DESC
LIMIT 10;
But notice that the inverse query (find the 10 URLs that have the highest percentage of visits from a specific term) does not even require that you calculate the percentage! This is because in this case the percentages all have the same denominator. So, you can just use nClicks for sorting:
MATCH (term:Term {id: 1})-[r:HAS_URL]->(url:Url)
RETURN term, url
ORDER BY r.nClicks DESC
LIMIT 10;
Unfortunately no, neo4j doesn't support this. You can still do it, with one of two methods. I'll tell you what they both are, then make a recommendation.
Relative to your relational database, I don't think you're looking for a foreign key or "referential integrity" -- I think what you're looking for is more like a trigger. A trigger is like a function or procedure that executes when data changes. In your case, it'd probably be good to have trigger functions that re-calculated all of the weight percentages on incident edges.
Option 1 - The capable Max De Marzi has got you covered there with a description of how you can do triggers in neo4j. Spoiling the surprise, there's a TransactionEventHandler in the java API. When the right kind of transaction comes through, you can catch that and do extra stuff.
Option 2 - the server provides an extension/plugin mechanism so that you could write this on your own. This is a big hammer, it can do just about anything, but it's harder to wield, too.
I'd recommend you look into Max's post and the TransactionEventHandler. You might then implement public void afterCommit(TransactionData transactionData, Object o). In that method, you'd check out the transaction data to see if it was something of interest (not all transactions would be of interest). If the transaction updated a search term node or searched count changes, then I'd go do your recomputation, fix your weights, and you should be good.

Multi-location entity query solution with geographic distance calculation

in my project we have an entity called Trip. This trip has two points: start and finish. Start and finish are geo coordinates with some added properties like address atc.
what i need is to query for all Trips that satisifies search criteria for both start and finish.
smth like
select from trips where start near 16,16 and finish near 18,20 where type = type
So my question is: which database can offer such functionality?
what i have tried
i have explored mongodb which has support for geo indexes but does not support this use case. current solution stores the points as separate documents which have a reference to a Trip. we run two separate quesries for starts and finishes, then extract ids of their associated trips and then select trip ids that are found both in starts and finishes and finally return a collection of trips.
on a small sample it works fine but with a larger collection it gets slow and it's like scratching my left ear with my right hand.
so i am looking for a better solution.
i know about neo4j and its spatial plugin but i couldn't even make it work on windows. would it support our use case?
or are there any better solutions? preferably with a object mapper written in php.
like edze already said Postgres (PostGIS) or SQLite(SpatiaLite) is what your looking for
SELECT
*
FROM
trips
WHERE
ST_Distance(ST_StartPoint(way), ST_GeomFromText('POINT(16 16)',4326) < 5
AND ST_Distance(ST_EndPoint(way), ST_GeomFromText('POINT(18 20)',4326) < 5
AND type = 'type'

MDX MEMBER causing NON EMPTY to not filter

I'm using an MDX query to pull information to support a set of reports. A high degree of detail is required for the reports so they take some time to generate. To speed up the access time we pull the data we need and store it in a flat Oracle table and then connect to the table in Excel. This makes the reports refresh in seconds instead of minutes.
Previously the MDX was generated and run by department for 100 departments and then for a number of other filters. All this was done in VB.Net. The requirements for filters have grown to the point where this method is not sustainable (and probably isn't the best approach regardless).
I've built the entire dataset into one MDX query that works perfectly. One of my sets that I cross join includes members from three different levels of hierarchy, it looks like this:
(
Descendants([Merch].[Merch CHQ].[All], 2),
Descendants([Merch].[Merch CHQ].[All], 3),
[Merch].[Merch CHQ].[Department].&[1].Children
)
The problem for me is in our hierarchy (which I can't change), each group (first item) and each department (second item) have the same structure to their naming, ie 15-DeptName and it's confusing to work with.
To address it I added a member:
MEMBER
[Measures].[Merch Level] AS
(
[Merch].[Merch CHQ].CurrentMember.Level.Name
)
Which returns what type the member is and it works perfectly.
The problem is that it updates for every member so none of the rows get filtered by NON BLANK, instead of 65k rows I have 130k rows which will hurt my access performance.
Can my query be altered to still filter out the non blanks short of using IIF to check each measurement for null?
You can specify Null for your member based on your main measure like:
MEMBER
[Measures].[Merch Level] AS
IIf(IsEmpty([Measures].[Normal Measure]),null,[Merch].[Merch CHQ].CurrentMember.Level.Name)
That way it will only generate when there is data. You can go further and add additional dimensions to the empty check if you need to get more precise.

key-value store for time series data?

I've been using SQL Server to store historical time series data for a couple hundred thousand objects, observed about 100 times per day. I'm finding that queries (give me all values for object XYZ between time t1 and time t2) are too slow (for my needs, slow is more then a second). I'm indexing by timestamp and object ID.
I've entertained the thought of using somethings a key-value store like MongoDB instead, but I'm not sure if this is an "appropriate" use of this sort of thing, and I couldn't find any mentions of using such a database for time series data. ideally, I'd be able to do the following queries:
retrieve all the data for object XYZ between time t1 and time t2
do the above, but return one date point per day (first, last, closed to time t...)
retrieve all data for all objects for a particular timestamp
the data should be ordered, and ideally it should be fast to write new data as well as update existing data.
it seems like my desire to query by object ID as well as by timestamp might necessitate having two copies of the database indexed in different ways to get optimal performance...anyone have any experience building a system like this, with a key-value store, or HDF5, or something else? or is this totally doable in SQL Server and I'm just not doing it right?
It sounds like MongoDB would be a very good fit. Updates and inserts are super fast, so you might want to create a document for every event, such as:
{
object: XYZ,
ts : new Date()
}
Then you can index the ts field and queries will also be fast. (By the way, you can create multiple indexes on a single database.)
How to do your three queries:
retrieve all the data for object XYZ
between time t1 and time t2
db.data.find({object : XYZ, ts : {$gt : t1, $lt : t2}})
do the above, but return one date
point per day (first, last, closed to
time t...)
// first
db.data.find({object : XYZ, ts : {$gt : new Date(/* start of day */)}}).sort({ts : 1}).limit(1)
// last
db.data.find({object : XYZ, ts : {$lt : new Date(/* end of day */)}}).sort({ts : -1}).limit(1)
For closest to some time, you'd probably need a custom JavaScript function, but it's doable.
retrieve all data for all objects for
a particular timestamp
db.data.find({ts : timestamp})
Feel free to ask on the user list if you have any questions, someone else might be able to think of an easier way of getting closest-to-a-time events.
This is why databases specific to time series data exist - relational databases simply aren't fast enough for large time series.
I've used Fame quite a lot at investment banks. It's very fast but I imagine very expensive. However if your application requires the speed it might be worth looking it.
There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.
I am not sure this will be exactly what you need, but it will allow you to get the first two points - get values from t1 to t2 for any series (one series per file) or just take one data point.
https://code.google.com/p/timeseriesdb/
// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
file.UniqueIndexes = true; // enforces index uniqueness
file.InitializeNewFile(); // create file and write header
file.AppendData(data); // append data (stream of ArraySegment<>)
}
// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
// Enumerate one item at a time maxitum 10 items starting at 2011-1-1
// (can also get one segment at a time with StreamSegments)
foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
Console.WriteLine(val);
}
I recently tried something similar in F#. I started with the 1 minute bar format for the symbol in question in a Space delimited file which has roughly 80,000 1 minute bar readings. The code to load and parse from disk was under 1ms. The code to calculate a 100 minute SMA for every period in the file was 530ms. I can pull any slice I want from the SMA sequence once calculated in under 1ms. I am just learning F# so there are probably ways to optimize. Note this was after multiple test runs so it was already in the windows Cache but even when loaded from disk it never adds more than 15ms to the load.
date,time,open,high,low,close,volume
01/03/2011,08:00:00,94.38,94.38,93.66,93.66,3800
To reduce the recalculation time I save the entire calculated indicator sequence to disk in a single file with \n delimiter and it generally takes less than 0.5ms to load and parse when in the windows file cache. Simple iteration across the full time series data to return the set of records inside a date range in a sub 3ms operation with a full year of 1 minute bars. I also keep the daily bars in a separate file which loads even faster because of the lower data volumes.
I use the .net4 System.Runtime.Caching layer to cache the serialized representation of the pre-calculated series and with a couple gig's of RAM dedicated to cache I get nearly a 100% cache hit rate so my access to any pre-computed indicator set for any symbol generally runs under 1ms.
Pulling any slice of data I want from the indicator is typically less than 1ms so advanced queries simply do not make sense. Using this strategy I could easily load 10 years of 1 minute bar in less than 20ms.
// Parse a \n delimited file into RAM then
// then split each line on space to into a
// array of tokens. Return the entire array
// as string[][]
let readSpaceDelimFile fname =
System.IO.File.ReadAllLines(fname)
|> Array.map (fun line -> line.Split [|' '|])
// Based on a two dimensional array
// pull out a single column for bar
// close and convert every value
// for every row to a float
// and return the array of floats.
let GetArrClose(tarr : string[][]) =
[| for aLine in tarr do
//printfn "aLine=%A" aLine
let closep = float(aLine.[5])
yield closep
|]
I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages.
I use boost::date_time for the timestamp field.
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
I created a number of custom iterators and used standard template library features to be able to efficiently search for specific values or ranges of time-based records.

Resources