Neo4j Create new label based on other labels with certain properties - database

I need to create Trajectories based on Points.
A Trajectory can contain any number of Points that meet certain criteria.
Criteria are: cameraSid, trajectoryId, classType and classQual should be equal.
The difference in time (at) of each point must be less or equal to 1 hour.
In order to create a Trajectory we need at least one point.
In order to associate a new point to an existing Trajectory, the latest associated Point of the trajectory must be no older and 1 hour compared to the new point.
If the new Point has the exact same properties but is older than 1 hour, then a new Trajectory need to be created.
I've been reading a lot but, I cannot make this work as it should.
This is what I have tried so far:
MATCH (p:Point)
WHERE NOT (:Trajectory)-[:CONTAINS]->(p)
WITH p.cameraSid AS cameraSid, p.trajectoryId AS trajectoryId, p.classType AS classType, p.classQual AS classQual, COLLECT(p) AS points
UNWIND points AS point
MERGE (trajectory:Trajectory{trajectoryId:point.trajectoryId, cameraSid: point.cameraSid, classType: point.classType, classQual: point.classQual, date: date(datetime(point.at))})
MERGE (trajectory)-[:CONTAINS{at:point.at}]->(point)
I have no idea how to create this sort of condition (1hr or less) in the MERGE clause.
Here are the neo4j queries to create some data
// Create points
LOAD CSV FROM 'https://uca54485eb4c5d2a6869053af475.dl.dropboxusercontent.com/cd/0/get/AmR2pn0hC0c-CQW_mSS-TDqHQyi7MNVjPvqffQHhSIyMP37D7UMtfODdHDkNWi6-HqzQdp4ob2Q3326g6imEd26F3sdNJyJuAeNa8wJA2o_E6A/file?dl=1#' AS line
CREATE (:Point{trajectoryId: line[0],at: line[1],cameraSid: line[2],activity: line[3],x: line[4],atEpochMilli: line[5],y: line[6],control: line[7],classQual: line[8],classType: line[9],uniqueIdentifier: line[10]})
// Create Trajectory based on Points
MATCH (p:Point)
WHERE NOT (:Trajectory)-[:CONTAINS]->(p)
WITH p.cameraSid AS cameraSid, p.trajectoryId AS trajectoryId, p.classType AS classType, p.classQual AS classQual, COLLECT(p) AS points
UNWIND points AS point
MERGE (trajectory:Trajectory{trajectoryId:point.trajectoryId, cameraSid: point.cameraSid, classType: point.classType, classQual: point.classQual, date: date(datetime(point.at))})
MERGE (trajectory)-[:CONTAINS{at:point.at}]->(point)
If the link to the CSV file does not work, here is an alternative, in this case, you will have to download the file and then import it locally from your neo4j instance.

I think this is one of those just because you can do it in one Cypher statement doesn't mean you should situations, and you will almost certainly find this easier to do in application code.
Regardless, it can be done using APOC and by introducing an instanceId unique property on your Trajectory nodes.
Possible solution
This almost certainly won't scale, and you'll want indexes (discussed later based on educated guesswork).
First we need to change your import script to make sure that the at property is a datetime and not just a string (otherwise we end up peppering the queries with datetime() calls:
LOAD CSV FROM 'file:///export.csv' AS line
CREATE (:Point{trajectoryId: line[0], at: datetime(line[1]), cameraSid: line[2], activity: line[3],x: line[4], atEpochMilli: line[5], y: line[6], control: line[7], classQual: line[8], classType: line[9], uniqueIdentifier: line[10]})
The following then appears to take your sample data set and add Trajectories per your requirements (and can be run whenever new Points are added).
CALL apoc.periodic.iterate(
'
MATCH (p: Point)
WHERE NOT (:Trajectory)-[:CONTAINS]->(p)
RETURN p
ORDER BY p.at
',
'
OPTIONAL MATCH (t: Trajectory { trajectoryId: p.trajectoryId, cameraSid: p.cameraSid, classQual: p.classQual, classType: p.classType })-[:CONTAINS]-(trajPoint:Point)
WITH p, t, max(trajPoint.at) as maxAt, min(trajPoint.at) as minAt
WITH p, max(case when t is not null AND (
(p.at <= datetime(maxAt) + duration({ hours: 1 }))
AND
(p.at >= datetime(minAt) - duration({ hours: 1 }))
)
THEN t.instanceId ELSE NULL END) as instanceId
MERGE (tActual: Trajectory { trajectoryId: p.trajectoryId, cameraSid: p.cameraSid, classQual: p.classQual, classType: p.classType, instanceId: COALESCE(instanceId, randomUUID()) })
ON CREATE SET tActual.date = date(datetime(p.at))
MERGE (tActual)-[:CONTAINS]->(p)
RETURN instanceId
',
{ parallel: false, batchSize: 1 })
Explanation
The problem as posed is tricky because the decision on whether or not to create a new Trajectory or add the point to an existing one depends entirely on how we handled all prior Points. That means two things:
We need to process the Points in order to make sure we create reliable Trajectories - we start with the earliest, and work up
We need each creation or amend of a Trajectory to be immediately visible for the processing of the next Point - that is to say that we need to handle each Point in isolation, as though it were a mini-transaction
We'll use apoc.periodic.iterate with a batchSize of 1 to give us the behaviour we need.
The first parameter builds the set of nodes to be processed - all those Points which aren't currently part of a Trajectory, sorted by their timestamp.
The second parameter to apoc.periodic.iterate is where the magic's happening so let's break that down - given a point p that isn't part of a Trajectory so far:
OPTIONAL MATCH (t: Trajectory { trajectoryId: p.trajectoryId, cameraSid: p.cameraSid, classQual: p.classQual, classType: p.classType })-[:CONTAINS]-(trajPoint:Point)
WITH p, t, max(trajPoint.at) as maxAt, min(trajPoint.at) as minAt
WITH p, max(case when t is not null AND (
(p.at <= datetime(maxAt) + duration({ hours: 1 }))
AND
(p.at >= datetime(minAt) - duration({ hours: 1 }))
)
THEN t.instanceId ELSE NULL END) as instanceId
Finds any Trajectory that matches the key fields and that contain a Point that's within an hour of the incoming point p and pick out its instanceId property if we find a suitable match (or the biggest one we found if there are multiple matches - we just want to ensure there's zero or one rows by this point)
We'll see what instanceId is all about in a minute, but consider it a unique identifier for a given Trajectory
MERGE (tActual: Trajectory { trajectoryId: p.trajectoryId, cameraSid: p.cameraSid, classQual: p.classQual, classType: p.classType, instanceId: COALESCE(instanceId, randomUUID()) })
ON CREATE SET tActual.date = date(datetime(p.at))
Ensure that there's a Trajectory that matches the incoming point's key fields - if the previous code found a matching Trajectory then the MERGE has no work to do. Otherwise create the new Trajectory - we'll add a property instanceId to a new random UUID if we didn't match a Trajectory earlier to compel the MERGE to create a new node (since no other will exist with that UUID, even if one matches all the other key fields)
MERGE (tActual)-[:CONTAINS]->(p)
RETURN instanceId
tActual is now the Trajectory that the incoming point p should belong to - create the :CONTAINS relationship
The third parameter is vital:
{ parallel: false, batchSize: 1 })
Important: For this to work, each iteration of the 'inner' Cypher statement has to happen exactly in order, so we force a batchSize of 1 and disable parallelism to prevent APOC from scheduling the batches in any way other than one-at-a-time.
Indexing
I think the performance of the above is going to degrade quickly as the size of the import grows and the number of Trajectories increases. At a minimum I think you'll want a composite index on
:Trajectory(trajectoryId, cameraSid, classQual, classType) - so that the initial match to find candidate trajectories for a given Point is quick
:Trajectory(trajectoryId, cameraSid, classQual, classType, instanceId) - so that the MERGE at the end finds the existing Trajectory to add to quickly, if one exists
However - that's guesswork from eyeballing the query, and unfortunately you can't see into the query properly to tell what the execution plan is because we're using apoc.periodic.iterate - an EXPLAIN or PROFILE will just tell you that there's one procedure call costing 1 db hit, which is true but not helpful.

Related

How to find a MoveTo destination filled by database?

I could need some help with a Anylogic Model.
Model (short): Manufacturing scenario with orders move in a individual route. The workplaces (WP) are dynamical created by simulation start. Their names, quantity and other parameters are stored in a database (excel Import). Also the orders are created according to an import. The Agent population "order" has a collection routing which contains the Workplaces it has to stop in the specific order.
Target: I want a moveTo block in main which finds the next destination of the agent order.
Problem and solution paths:
I set the destination Type to agent and in the Agent field I typed a function agent.getDestination(). This function is in order which returns the next entry of the collection WP destinationName = routing.get(i). With this I get a Datatype error (while run not compiling). I quess it's because the database does not save the entrys as WP Type but only String.
Is there a possiblity to create a collection with agents from an Excel?
After this I tried to use the same getDestination as String an so find via findFirst the WP matching the returned name and return it as WP. WP targetWP = findFirst(wps, w->w.name == destinationName);
Of corse wps (the population of Workplaces) couldn't be found.
How can I search the population?
Maybe with an Agentlink?
I think it is not that difficult but can't find an answer or a solution. As you can tell I'm a beginner... Hope the description is good an someone can help me or give me a hint :)
Thanks
Is there a possiblity to create a collection with agents from an Excel?
Not directly using the collection's properties and, as you've seen, you can't have database (DB) column types which are agent types.1
But this is relatively simple to do directly via Java code (and you can use the Insert Database Query wizard to construct the skeleton code for you).
After this I tried to use the same getDestination as String an so find via findFirst the WP matching the returned name and return it as WP
Yes, this is one approach. If your order details are in Excel/the database, they are presumably referring to workplaces via some String ID (which will be a parameter of the workplace agents you've created from a separate Excel worksheet/database table). You need to use the Java equals method to compare strings though, not == (which is for comparing numbers or whether two objects are the same object).
I want a moveTo block in main which finds the next destination of the agent order
So the general overall solution is
Create a population of Workplace agents (let's say called workplaces in Main) from the DB, each with a String parameter id or similar mapped from a DB column.
Create a population of Order agents (let's say called orders in Main) from the DB and then, in their on-startup action, set up their collection of workplace IDs (type ArrayList, element class String; let's say called workplaceIDsList) using data from another DB table.
Order probably also needs a working variable storing the next index in the list that it needs to go to (so let's say an int variable nextWorkplaceIndex which starts at 0).
Write a function in Main called getWorkplaceByID that has a single String argument id and returns a Workplace. This gets the workplace from the population that matches the ID; a one-line way similar to yours is findFirst(workplaces, w -> w.id.equals(id)).
The MoveTo block (which I presume is in Main) needs to move the Order to an agent defined by getWorkplaceByID(agent.workplaceIDsList.get(nextWorkplaceIndex++)). (The ++ bit increments the index after evaluating the expression so it is ready for the next workplace to go to.)
For populating the collection, you'd have two tables, something like the below (assuming using strings as IDs for workplaces and orders):
orders table: columns for parameters of your orders (including some String id column) other than the workplace-list. (Create one Order agent per row.)
order_workplaces table: columns order_id, sequence_num and workplace_id (so with multiple rows specifying the sequence of workplace IDs for an order ID).
In the On startup action of Order, set up the skeleton query code via the Insert Database Query wizard as below (where we want to loop through all rows for this order's ID and do something --- we'll change the skeleton code to add entries to the collection instead of just printing stuff via traceln like the skeleton code does).
Then we edit the skeleton code to look like the below. (Note we add an orderBy clause to the initial query so we ensure we get the rows in ascending sequence number order.)
List<Tuple> rows = selectFrom(order_workplaces)
.where(order_workplaces.order_id.eq(id))
.orderBy(order_workplaces.sequence_num.asc())
.list();
for (Tuple row : rows) {
workplaceIDsList.add(row.get(order_workplaces.workplace_id));
}
1 The AnyLogic database is a normal relational database --- HSQLDB in fact --- and databases only understand their own specific data types like VARCHAR, with AnyLogic and the libraries it uses translating these to Java types like String. In the user interface, AnyLogic makes it look like you set the column types as int, String, etc. but these are really the Java types that the columns' contents will ultimately be translated into.
AnyLogic does support columns which have option list types (and the special Code type column for columns containing executable Java code) but these are special cases using special logic under the covers to translate the column data (which is ultimately still a string of characters) into the appropriate option list instance or (for Code columns) into compiled-on-the-fly-and-then-executed Java).
Welcome to Stack Overflow :) To create a Population via Excel Import you have to create a method and call Code like this. You also need an empty Population.
int n = excelFile.getLastRowNum(YOUR_SHEET_NAME);
for(int i = FIRST_ROW; i <= n; i++){
String name = excelFile.getCellStringValue(YOUR_SHEET_NAME, i, 1);
double SEC_PARAMETER_TO_READ= excelFile.getCellNumericValue(YOUR_SHEET_NAME, i, 2);
WP workplace = add_wps(name, SEC_PARAMETER_TO_READ);
}
Now if you want to get a workplace by name, you have to create a method similar to your try.
Functionbody:
WP workplaceToFind = wps.findFirst(w -> w.name.equals(destinationName));
if(workplaceToFind != null){
//do what ever you want
}

Cypher query in neo4j to find specific node with most paths matching pattern

I have a neo4j database with statistical information on water and waste. In this database are data points linked with the facts that are relevant, including mappings to internal definitions. Here in the attached screenshot is an example of a data point and the related metadata. The node in the center is the value, and the immediate nodes linked by "HAS_DIMENSION" are the dimensions that came with the data provider. These are not fixed and change depending on the provider. Each dimension of interest is mapped to an internal definition. Currently this is my query:
MATCH (o:Observation {uq_id:'e__ABS_AGR_AQ__FSW__MIO_M3__BG__1970____9f07c7a629625e5ae00e35838fcd4f824a3593dd'})-[:HAS_DIMENSION]->()
MATCH (o)-[:HAS_DIMENSION]->()-[:HAS_SYNONYM_FROM]->()-[:WITH_TARGET_DEF]->(v:Variable)<-[:HAS_UNIT]-(u:Unit)
MATCH (o)-[vl0:HAS_DIMENSION]->()-[:HAS_SYNONYM_FROM]->()-[:WITH_TARGET_DEF]->(l:Location)
MATCH (o)-[vc0:HAS_DIMENSION]->()-[:HAS_SYNONYM_FROM]->()-[:WITH_TARGET_DEF]->(c:Country)
MATCH (o)-[vy0:HAS_DIMENSION]->()-[:HAS_SYNONYM_FROM]->()-[:WITH_TARGET_DEF]->(y:Year)
MATCH (o)-[:HAS_DIMENSION]->(unk0)
MATCH (o)-[sr0:CAME_FROM_FILE]->(ds0)-[sr1:BELONGS_TO]->(s0)
OPTIONAL MATCH (o)-[dtr0:HAS_DIMENSION]->()-[:HAS_SYNONYM_FROM]->()-[:WITH_TARGET_DEF]->(d:DataType)
RETURN *
The issue I have is exemplified by the pink circles. I want only one pink circle (which is a node with label Variable) in the query, in particular I want the variable like follows
MATCH (v:Variable)<-[:MAPS_TO]-()<-[:HAS_DIMENSION]-(o:Observation)
By this I want to force it to observe a pattern where it identifies the single variable that matches the pattern above for the most number of intermediate nodes. So the "Fresh surface water abstracted" variable would match this pattern, since it has two paths that match this. But the "Fresh groundwater abstracted" would not, since it only has one. How could I accomplish this?
It sounds like you want to return the Variable node with the most number of paths leading to it. Would something like this roughly return the results you are after? You will need to adapt according to your matching statements.
MATCH p=(o:Observation {uq_id:'<your_id>'})-[:HAS_DIMENSION]->()<-[:MAPS_TO]-(v:Variable)
RETURN v.name, COUNT(p) as p ORDER BY p DESC LIMIT 1

Neo4j add huge number of relationships to already existing nodes

I have labels Person and Company with millions of nodes.
I am trying to create a relationship:
(person)-[:WORKS_AT]->(company) based on a unique company number property that exists in both labels.
I am trying to do that with the following query:
MATCH (company:Company), (person:Person)
WHERE company.companyNumber=person.comp_number
CREATE (person)-[:WORKS_AT]->(company)
but the query takes too long to execute and eventually fails.
I have indexes on companyNumber and comp_number.
So, my question is: it there a way to create the relationships by segments, for example (50000, then another 50000 etc...)?
Use a temporary label to mark things as completed, and add a limit step before creating the relationship. When you are all done, just remove the label from everyone.
MATCH (company:Company)
WITH company
MATCH (p:Person {comp_number: company.companyNumber} )
WHERE NOT p:Processed
WITH company, p
LIMIT 50000
MERGE (p) - [:WORKS_AT] -> (company)
SET p:Processed
RETURN COUNT(*) AS processed
That will return the number (usually 50000) of rows that were processed; when it returns less than 50000 (or whatever you set the limit to), you are all done. Run this guy then:
MATCH (n:Processed)
WITH n LIMIT 50000
REMOVE n:Processed
RETURN COUNT(*) AS processed
until you get a result less than 50000. You can probably turn all of these numbers up to 100000 or maybe more, depending on your db setup.

TitanDB - Build a property index in descending order by timestamp

TitanDB 1.0.0 (on top of DynamoDB)
Gremlin 3
I've got a set of vertices with a label a. I have a property of type long on those vertices which corresponds to the time in milliseconds from 1970 UTC (timestamp of when the vertex was created.) When I pull back those vertices I want to be able to pull them back in decsending order.
How can I create an index on that property in the decr order in Titan Management System?
Documentation seems vague on that.
Closest thing I found is
public RelationTypeIndex buildPropertyIndex(PropertyKey key,
String name,
Order sortOrder,
PropertyKey... sortKeys)
But what do I put in as the key and sortKeys? I want to be able to pull the whole vertex ordered by the timestamp property
Edit: The only way I know of doing this at the minute is by duplicating that property on the edge and using a vertex centric index on the edge to increase the performance.
I do something similar as I order by the release date of a particular product. If you want to execute order().by("Value", decr) efficiently then I highly recommend reading into mixed indices specified here. Mixed indices allow these operations to be done quickly.
An example of what you may be looking for:
TitanGraph graph = TitanFactory.open(config);
TitanManagement mgmt = graph.openManagement();
PropertyKey key = mgmt.makePropertyKey("TimeStamp").dataType(Long.class).make();
mgmt.buildIndex("timeStampIndex", Vertex.class).addKey(key).buildMixedIndex("search");
mgmt.commit();
The above index lets me execute the following query very quickly:
g.V().order().by("TimeStamp", decr);
Which gets me the time stamps in descending order. The above traversal will work without indexing but can be slow in large graphs.

key-value store for time series data?

I've been using SQL Server to store historical time series data for a couple hundred thousand objects, observed about 100 times per day. I'm finding that queries (give me all values for object XYZ between time t1 and time t2) are too slow (for my needs, slow is more then a second). I'm indexing by timestamp and object ID.
I've entertained the thought of using somethings a key-value store like MongoDB instead, but I'm not sure if this is an "appropriate" use of this sort of thing, and I couldn't find any mentions of using such a database for time series data. ideally, I'd be able to do the following queries:
retrieve all the data for object XYZ between time t1 and time t2
do the above, but return one date point per day (first, last, closed to time t...)
retrieve all data for all objects for a particular timestamp
the data should be ordered, and ideally it should be fast to write new data as well as update existing data.
it seems like my desire to query by object ID as well as by timestamp might necessitate having two copies of the database indexed in different ways to get optimal performance...anyone have any experience building a system like this, with a key-value store, or HDF5, or something else? or is this totally doable in SQL Server and I'm just not doing it right?
It sounds like MongoDB would be a very good fit. Updates and inserts are super fast, so you might want to create a document for every event, such as:
{
object: XYZ,
ts : new Date()
}
Then you can index the ts field and queries will also be fast. (By the way, you can create multiple indexes on a single database.)
How to do your three queries:
retrieve all the data for object XYZ
between time t1 and time t2
db.data.find({object : XYZ, ts : {$gt : t1, $lt : t2}})
do the above, but return one date
point per day (first, last, closed to
time t...)
// first
db.data.find({object : XYZ, ts : {$gt : new Date(/* start of day */)}}).sort({ts : 1}).limit(1)
// last
db.data.find({object : XYZ, ts : {$lt : new Date(/* end of day */)}}).sort({ts : -1}).limit(1)
For closest to some time, you'd probably need a custom JavaScript function, but it's doable.
retrieve all data for all objects for
a particular timestamp
db.data.find({ts : timestamp})
Feel free to ask on the user list if you have any questions, someone else might be able to think of an easier way of getting closest-to-a-time events.
This is why databases specific to time series data exist - relational databases simply aren't fast enough for large time series.
I've used Fame quite a lot at investment banks. It's very fast but I imagine very expensive. However if your application requires the speed it might be worth looking it.
There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.
I am not sure this will be exactly what you need, but it will allow you to get the first two points - get values from t1 to t2 for any series (one series per file) or just take one data point.
https://code.google.com/p/timeseriesdb/
// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
file.UniqueIndexes = true; // enforces index uniqueness
file.InitializeNewFile(); // create file and write header
file.AppendData(data); // append data (stream of ArraySegment<>)
}
// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
// Enumerate one item at a time maxitum 10 items starting at 2011-1-1
// (can also get one segment at a time with StreamSegments)
foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
Console.WriteLine(val);
}
I recently tried something similar in F#. I started with the 1 minute bar format for the symbol in question in a Space delimited file which has roughly 80,000 1 minute bar readings. The code to load and parse from disk was under 1ms. The code to calculate a 100 minute SMA for every period in the file was 530ms. I can pull any slice I want from the SMA sequence once calculated in under 1ms. I am just learning F# so there are probably ways to optimize. Note this was after multiple test runs so it was already in the windows Cache but even when loaded from disk it never adds more than 15ms to the load.
date,time,open,high,low,close,volume
01/03/2011,08:00:00,94.38,94.38,93.66,93.66,3800
To reduce the recalculation time I save the entire calculated indicator sequence to disk in a single file with \n delimiter and it generally takes less than 0.5ms to load and parse when in the windows file cache. Simple iteration across the full time series data to return the set of records inside a date range in a sub 3ms operation with a full year of 1 minute bars. I also keep the daily bars in a separate file which loads even faster because of the lower data volumes.
I use the .net4 System.Runtime.Caching layer to cache the serialized representation of the pre-calculated series and with a couple gig's of RAM dedicated to cache I get nearly a 100% cache hit rate so my access to any pre-computed indicator set for any symbol generally runs under 1ms.
Pulling any slice of data I want from the indicator is typically less than 1ms so advanced queries simply do not make sense. Using this strategy I could easily load 10 years of 1 minute bar in less than 20ms.
// Parse a \n delimited file into RAM then
// then split each line on space to into a
// array of tokens. Return the entire array
// as string[][]
let readSpaceDelimFile fname =
System.IO.File.ReadAllLines(fname)
|> Array.map (fun line -> line.Split [|' '|])
// Based on a two dimensional array
// pull out a single column for bar
// close and convert every value
// for every row to a float
// and return the array of floats.
let GetArrClose(tarr : string[][]) =
[| for aLine in tarr do
//printfn "aLine=%A" aLine
let closep = float(aLine.[5])
yield closep
|]
I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages.
I use boost::date_time for the timestamp field.
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
I created a number of custom iterators and used standard template library features to be able to efficiently search for specific values or ranges of time-based records.

Resources