Gremlin - Move multiple edges in single traversal - graph-databases

I am using Gremlin to access data in AWS Neptune. I need to modify 2 edges going out from a single vertex to point to 2 vertices which are different from the ones it points to at the moment.
For instance if the current relation is as shown below:
(X)---(A)---(Y)
(B) (C)
I want it to modified to:
(X) (A) (Y)
/ \
(B) (C)
To ensure the whole operation is done in a single transaction, I need this done in a single traversal (because manual transaction logic using tx.commit() and tx.rollback() is not supported in AWS Neptune).
I tried the following queries to get this done but failed:
1) Add the new edges and drop the previous ones by selecting them using alias:
g.V(<id of B>).as('B').V(<id of C>).as('C').V(<id of A>).as('A').outE('LINK1','LINK2')
.as('oldEdges').addE('LINK1').from('A').to('B').addE('LINK2').from('A').to('C')
.select('oldEdges').drop();
Here, since outE('LINK1','LINK2') returns 2 edges, the edges being added after it, executes twice. So I get double the number of expected edges between A to B and C.
2) Add the new edges and drop the existing edges where edge id not equal to newly added ones.
g.V(<id of B>).as('B').V(<id of C>).as('C').V(<id of A>).as('A')
.addE('LINK1').from('A').to('B').as('edge1').addE('LINK2').from('A').to('C').as('edge2')
.select('A').outE().where(__.hasId(neq(select('edge1').id()))
.and(hasId(neq(select('edge2').id())))).drop();
Here I get the following exception in my gremlin console:
could not be serialized by org.apache.tinkerpop.gremlin.driver.ser.AbstractGryoMessageSerializerV3d0.
java.lang.IllegalArgumentException: Class is not registered: org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.DefaultGraphTraversal
Note: To register this class use: kryo.register(org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.DefaultGraphTraversal.class);
at org.apache.tinkerpop.shaded.kryo.Kryo.getRegistration(Kryo.java:488)
at org.apache.tinkerpop.gremlin.structure.io.gryo.AbstractGryoClassResolver.writeClass(AbstractGryoClassResolver.java:110)
at org.apache.tinkerpop.shaded.kryo.Kryo.writeClass(Kryo.java:517)
at org.apache.tinkerpop.shaded.kryo.Kryo.writeClassAndObject(Kryo.java:622)
at org.apache.tinkerpop.gremlin.structure.io.gryo.kryoshim.shaded.ShadedKryoAdapter.writeClassAndObject(ShadedKryoAdapter.java:49)
at org.apache.tinkerpop.gremlin.structure.io.gryo.kryoshim.shaded.ShadedKryoAdapter.writeClassAndObject(ShadedKryoAdapter.java:24)
...
Please help.

You can try:
g.V(<id of A>).union(
addE('Link1').to(V(<id of B>)),
addE('Link2').to(V(<id of C>)),
outE('Link1', 'Link2').where(inV().hasId(<id of X>,<id of Y>)).drop()
)

Related

Delete duplicate nodes between two nodes in Neo4j

Due to unwanted scrip execution my database has some duplicate nodes and it looks like this
From the image, there are multiple nodes with 'see' and 'execute' from p1 to m1.
I tried to eliminate them using this:
MATCH (ur:Role)-[c:CAN]->(e:Entitlement{action:'see'})-[o:ON]->(s:Role {id:'msci'})
WITH collect(e) AS rels WHERE size(rels) > 1
FOREACH (n IN TAIL(rels) | DETACH DELETE n)
Resulting in this:
As you can see here, it deletes all the nodes with 'see' action.
I think I am missing something in the query which I am not sure of.
The good graph should be like this:
EDIT: Added one more scenario with extra relations
this works if there is more than one extra :) and cleanses your entire graph.
// get all the patterns where you have > 1 entitlement of the same "action" (see or execute)
MATCH (n:Role)-->(e:Entitlement)-->(m:Role)
WITH n,m,e.action AS EntitlementAction,
COLLECT(e) AS Entitlements
WHERE size(Entitlements) > 1
// delete all entitlements, except the first one
FOREACH (e IN tail(Entitlements) |
DETACH DELETE e
)
Your query pretty explicitly is matching the "See" action.
MATCH (ur:Role)-[c:CAN]->(e:Entitlement{action:'see'})
You might try the query without specifying the action type.
Edit:
I went back and played with this and this worked for me:
MATCH (ur:Role)-[c:CAN]->(e:Entitlement {action: "see"})-[o:ON]->(s:Role {id:'msci'})
with collect(e) as rels where size(rels) > 1
with tail(rels) as tail
match(n:Entitlement {id: tail[0].id})
detach delete n
You'd need to run two queries one for each action but as long as it's only got the one extra relationship it should work.

How to get combined vertex results in gremlin using union/project/select?

below is my example graph
addV('user').property('userId','user2').as('u2').
addV('user').property('userId','user3').as('u3').
addV('group').property('groupId','group1').as('g1').
addV('group').property('groupId','group2').as('g2').
addV('group').property('groupId','group3').as('g3').
addV('folder').property('folderId','folder1').property('folderName','director').as('f1').
addV('folder').property('folderId','folder2').property('folderName','asstDirector').as('f2').
addV('folder').property('folderId','folder3').property('folderName','editor').as('f3').
addV('file').property('fileId','file1').property('fileName','scene1').
addE('in_folder').to('f3').
addE('in_folder').from('f2').to('f1').
addE('in_folder').from('f3').to('f2').
addE('member_of').from('u1').to('g1').
addE('member_of').from('u2').to('g2').
addE('member_of').from('u3').to('g3').
addE('member_of').from('g3').to('g1').
addE('has_permission').property('prm','view').from('g1').to('f1').
addE('has_permission').property('prm','view').from('g2').to('f2').
addE('has_permission').property('prm','view').from('g3').to('f3').
addE('has_permission').property('prm','view').from('u2').to('f1').iterate()
The use-case : Get both folders and files vertex as result where user3 have access to
im able to get either folder or files where user have access to, but not able to combine them.
below is the query which cam close but not exactly exptected resultset.
g.V().has('user','userId','user3').emit().until(__.not(outE('member_of'))).repeat(out('member_of'))
.outE('has_permission').has('prm','view').inV().as('f').inE('in').outV().as('a').select('a','f')
which results
==>[a:v[6144475224],f:v[164175920]]
==>[a:v[5202170056],f:v[204857480]]
the expectation is to get 3 vertexes with value Map, not just node ids.
f:v[164175920]
a:v[6144475224]
a:v[5202170056]
Could some one check and let me know how i can utilize union here.
tried below query
g.V().has('user','userId','user3').emit().until(__.not(outE('member_of'))).repeat(out('member_of'))
.outE('has_permission').has('prm','view').inV().inE('in').outV().map(union(valueMap())
this gives the file information only not the folder values.
You got it right in your comment query, only with few spelling mistakes which caused the error:
g.V().has('user','userId','user3').emit().repeat(out('member_of'))
.outE('has_permission').has('prm','view').inV().dedup()
.union(identity(),__.repeat(__.in('in_folder')).emit()).dedup()
This can be optimized a bit by moving the emit before the repeat and dropping the union:
g.V().has('user','userId','user3').emit().repeat(out('member_of'))
.outE('has_permission').has('prm','view').inV().dedup()
.emit().repeat(__.in('in_folder')).dedup()
Note that I used 'in_folder' instead on 'in' to match your sample data.

Gremlin - Update values of multiple edges

I am using AWS Neptune and I have to modify a certain property of a set of EDGEs with specific values. I also need this done in a single transaction. In AWS Neptune, manual transaction logic using tx.commit() and tx.rollback() is not supported. Which means I have to do this operation in a single traversal.
If I was to modify properties of vertices instead of edges, I could have got it done with a query similar to the following one:
g.V(<id 1>).property('name', 'Marko').V(<id 2>).property('name', 'Stephen');
This is because it is possible to select vertices by id in mid traversal, i.e. the GraphTraversal class has V(String ... vertexIds) as a member function.
But this is not the same for the case of edges. I am not able to select edges this way because E(String ... edgeIds) is not a member function of the GraphTraversal class.
Can somebody suggest the correct way I can solve this problem?
Thank you.
Amazon Neptune engine 1.0.1.0.200463.0 added Support for Gremlin Sessions to enable executing multiple Gremlin traversals in a single transaction.
However, you can do it also with a single query like this:
g.E('id1', 'id2', 'id3').coalesce(
has(id, 'id1').property('name','marko'),
has(id, 'id2').property('name','stephan'),
has(id, 'id3').property('name','vadas'))
You can get the same result as a mid traversal E() using V().outE().hasId(<list of IDs>)

How to query for multiple vertices and counts of their relationships in Gremlin/Tinkerpop 3?

I am using Gremlin/Tinkerpop 3 to query a graph stored in TitanDB.
The graph contains user vertices with properties, for example, "description", and edges denoting relationships between users.
I want to use Gremlin to obtain 1) users by properties and 2) the number of relationships (in this case of any kind) to some other user (e.g., with id = 123). To realize this, I make use of the match operation in Gremlin 3 like so:
g.V().match('user',__.as('user').has('description',new P(CONTAINS,'developer')),
__.as('user').out().hasId(123).values('name').groupCount('a').cap('a').as('relationships'))
.select()
This query works fine, unless there are multiple user vertices returned, for example, because multiple users have the word "developer" in their description. In this case, the count in relationships is the sum of all relationships between all returned users and the user with id 123, and not, as desired, the individual count for every returned user.
Am I doing something wrong or is this maybe an error?
PS: This question is related to one I posted some time ago about a similar query in Tinkerpop 2, where I had another issue: How to select optional graph structures with Gremlin?
Here's the sample data I used:
graph = TinkerGraph.open()
g = graph.traversal()
v123=graph.addVertex(id,123,"description","developer","name","bob")
v124=graph.addVertex(id,124,"description","developer","name","bill")
v125=graph.addVertex(id,125,"description","developer","name","brandy")
v126=graph.addVertex(id,126,"description","developer","name","beatrice")
v124.addEdge('follows',v125)
v124.addEdge('follows',v123)
v124.addEdge('likes',v126)
v125.addEdge('follows',v123)
v125.addEdge('likes',v123)
v126.addEdge('follows',v123)
v126.addEdge('follows',v124)
My first thought, was: "Do we really need match step"? Secondarily, of course, I wanted to write this in TP3 fashion and not use a lambda/closure. I tried all manner of things in the first iteration and the closest I got was stuff like this from Daniel Kuppitz:
gremlin> g.V().as('user').local(out().hasId(123).values('name')
.groupCount()).as('relationships').select()
==>[relationships:[:]]
==>[relationships:[bob:1]]
==>[relationships:[bob:2]]
==>[relationships:[bob:1]]
so here we used local step to restrict the traversal within local to the current element. This works, but we lost the "user" tag in the select. Why? groupCount is a ReducingBarrierStep and paths are lost after those steps.
Well, let's go back to match. I figured I could try to make the match step traverse using local:
gremlin> g.V().match('user',__.as('user').has('description','developer'),
gremlin> __.as('user').local(out().hasId(123).values('name').groupCount()).as('relationships')).select()
==>[relationships:[:], user:v[123]]
==>[relationships:[bob:1], user:v[124]]
==>[relationships:[bob:2], user:v[125]]
==>[relationships:[bob:1], user:v[126]]
Ok - success - that's what we wanted: no lambdas and local counts. But, it still left me feeling like: "Do we really need match step"? That's when Mr. Kuppitz closed in on the final answer which makes copious use of the by step:
gremlin> g.V().has('description','developer').as("user","relationships").select().by()
.by(out().hasId(123).values("name").groupCount())
==>[user:v[123], relationships:[:]]
==>[user:v[124], relationships:[bob:1]]
==>[user:v[125], relationships:[bob:2]]
==>[user:v[126], relationships:[bob:1]]
As you can see, by can be chained (on some steps). The first by groups by vertex and the second by processes the grouped elements with a "local" groupCount.

What DBMS should I use to store openstreetmap as a graph?

Background:
I need to store the following data in a database:
osm nodes with tags;
osm edges with weights (that is an edge between two nodes extracted from 'way' from an .osm file).
Nodes that form edges, which are in the same 'way' sets should have the same tags as those ways, i.e. every node in a 'way' set of nodes which is a highway should have a 'highway' tag.
I need this structure to easily generate a graph based on various filters, e.g. a graph consisting only of nodes and edges which are highways, or a 'foot paths' graph, etc.
Problem:
I have not heard about the spatial index before, so I just parsed an .osm file into a MySQL database:
all nodes to a 'nodes' table (with respective coordinates columns) - OK, about 9,000,000 of rows in my case:
(INSERT INTO nodes VALUES [pseudocode]node_id,lat,lon[/pseudocode];
all ways to an 'edges' table (usually one way creates a few edges) - OK, about 9,000,000 of rows as well:
(INSERT INTO edges VALUES [pseudocode]edge_id,from_node_id,to_node_id[/pseudocode];
add tags to nodes, calculate weights for edges - Problem:
Here is the problematic php script:
$query = mysql_query('SELECT * FROM edges');
$i=0;
while ($res = mysql_fetch_object($query)) {
$i++;
echo "$i\n";
$node1 = mysql_query('SELECT * FROM nodes WHERE id='.$res->from);
$node1 = mysql_fetch_object($node1);
$tag1 = $node1->tags;
$node2 = mysql_query('SELECT * FROM nodes WHERE id='.$res->to);
$node2 = mysql_fetch_object($node2);
$tag2 = $node2->tags;
mysql_query('UPDATE nodes SET tags="'.$tag1.$res->tags.'" WHERE nodes.id='.$res->from);
mysql_query('UPDATE nodes SET tags="'.$tag2.$res->tags.'" WHERE nodes.id='.$res->to);`
Nohup shows the output for 'echo "$i\n"' each 55-60 seconds (which can take more than 17 years to finish if the size of the 'edges' table is more than 9,000,000 rows, like in my case).
Htop shows a /usr/bin/mysqld process which takes 40-60% of CPU.
The same problem exists for the script which tries to calculate the weight (the distance) of an edge (select all edges, take an edge, then select two nodes of this edge from 'nodes' table, then calculate the distance, then update the edges table).
Question:
How can I make this SQL updates faster? Should I tweak any of MySQL config settings? Or should I use PostgreSQL with PostGIS extension? Should I use another structure for my data? Or should I somehow utilize the spatial index?
If I understand you right there is two things to discuss.
First, your idea of putting the highway-tag on the start and stop nodes. A node can have more than one edge connected, where to put the tag from the second edge? Or third or fourth if it is a crossing? The reason the highway tag is putted in the edges table in the first place is that from a relational point of view that is where it belongs.
Second, to get the whole table and process it outside the database is not the right way. What a relational database is really good at is taking care of this whole process.
I have not worked with mysql, and I fully agree that you will probably get a lot more fun if migrating to PostGIS since PostGIS has a lot better spatial capabilities (even if you don't need any spatial capabilities for this particular task) from what I have heard.
So if we ignore the first problem and just for showing the concept say that there is only two edges connected to one node and that each node has two tag-fields. tag1 and tag2. Then it could look something like this in PostGIS:
UPDATE nodes set tag1=edges.tags from edges where nodes.id=edges.from;
UPDATE nodes set tag2=edges.tags from edges where nodes.id=edges.to;
If you disable the indexes that should be very fast.
Again,
if I have understood you right.
PostgreSQL
Openstreetmap itself uses PostgreSQL, so I guess that's recommended.
See: http://wiki.openstreetmap.org/wiki/PostgreSQL
You can see OSM's database schema at: http://wiki.openstreetmap.org/wiki/Database_Schema
So you can use the same fields, fieldtypes and indexes that OSM uses for maximum compatibility.
MySQL
If you want to import .osm files into a MySQL database, have a look at:
http://wiki.openstreetmap.org/wiki/OsmDB.pm
Here you will find perl code that will create MySQL tables, parse a OSM file and import it into your MySQL database.
Making it faster
If you are updating in bulk, you don't need to update the indexes after every update.
You can just disable the indexes, do all your updates and re-enable the index.
I'm guessing that should be a whole lot faster.
Good luck

Resources