I am tasked with writing a query for a front-end application that visualizes a Neptune Graph database. Let us say that the first vertex are items while the second vertex user. A user can create an item. There are item to item relationships to show items derived from another item like in the case of media clips cut out of an original media clip. The first set of items created should be created in a vertex such as a SERVER which they are grouped by in the UI.
The following is the requirement:
Find (Y) seed nodes that are not connected by any ITEM-ITEM relationships on the graph (relationships via USERs etc... are fine)
Populate the graph with all relationships from these (Y) seed nodes with no limits on the relationships that are followed (relationships through USERs for example is fine).
Stop populating the graph once the number of nodes (not records limit) hits the limit specified by (X)
Here is a visual representation of the graph.
https://drive.google.com/file/d/1YNzh4wbzcdC0JeloMgD2C0oS6MYvfI4q/view?usp=sharing
A sample code to reproduce this graph is below. This graph could even get deeper. This is a just a simple example. Kindly see diagram:
g.addV('SERVER').property(id, 'server1')
g.addV('SERVER').property(id, 'server2')
g.addV('ITEM').property(id, 'item1')
g.addV('ITEM').property(id, 'item2')
g.addV('ITEM').property(id, 'item3')
g.addV('ITEM').property(id, 'item4')
g.addV('ITEM').property(id, 'item5')
g.addV('USER').property(id, 'user1')
g.V('item1').addE('STORED IN').to(g.V('server1'))
g.V('item2').addE('STORED IN').to(g.V('server2'))
g.V('item2').addE('RELATED TO').to(g.V('item1'))
g.V('item3').addE('DERIVED FROM').to(g.V('item2') )
g.V('item3').addE('CREATED BY').to(g.V('user1'))
g.V('user1').addE('CREATED').to(g.V('item4'))
g.V('item4').addE('RELATED TO').to(g.V('item5'))
The result should be in the form below if possible:
[
[
{
"V1": {},
"E": {},
"V2": {}
}
]
]
We have an API with an endpoint that allows for open-ended gremlin queries. We call this endpoint in our client app to fetch the data that is rendered visually. I have written a query that I do not think is quite right. Moreover, I would like to know how to filter the number of nodes traversed and stop at X nodes.
g.V().hasLabel('USER','SERVER').sample(5).aggregate('v1').repeat(__.as('V1').bothE().dedup().as('E').otherV().hasLabel('USER','SERVER').as('V2').aggregate('x').by(select('V1', 'E', 'V2'))).until(out().count().is(0)).as('V1').bothE().dedup().as('E').otherV().hasLabel(without('ITEM')).as('V2').aggregate('x').by(select('V1', 'E', 'V2')).cap('v1','x','v1').coalesce(select('x').unfold(),select('v1').unfold().project('V1'))
I would appreciate if I can get a single query that will fetch this dataset if it is possible. If vertices in the result are not connected to anything, I would want to retrieve them and render them like that on the UI.
I have looked at this again and came up with this query
g.V().hasLabel(without('ITEM')).sample(2).aggregate('v1').
repeat(__.as('V1').bothE().dedup().as('E').otherV().as('V2').
aggregate('x').by(select('V1', 'E', 'V2'))).
until(out().count().is(0)).
as('V1').bothE().dedup().as('E').otherV().as('V2').
aggregate('x').
by(select('V1', 'E', 'V2')).
cap('v1','x','v1').
coalesce(select('x').unfold(),select('v1').unfold().project('V1')).limit(5)
To meet the criteria for the node count rather than records count (or limit), I can pass to limit half the number passed in by the user as an input for nodes count and then exclude the edge E and vertice V2 of the last record from what will be rendered on the UI.
I will approach any suggestions on a better way.
Related
I have a question about Neo4j. I need to show labels in my graph database as node - like if I have only two types of labels in my database (for example Thing and Person), I want to have 2 extra nodes - Thing and Person with relationships to normal nodes.
Example - I have this:
Orange node is Person, red is Thing. So I want to have extra label nodes for every label in graph. So I want this:
Can be this created automatically?
You do not really want to do that, since a visualization with N nodes would then have N extraneous relationships to the special "label" nodes, making it hard (or even impossible) to see the actual data. Using different colors for different labels is a good compromise.
In any case, the top of the result panel (in the neo4j Browser) tells you which color belongs to which label, so you can already easily get the information you want.
[UPDATE]
However, if you really need to do something like that, there is no "automated" way. But you could use some APOC procedures to create virtual nodes and relationships that are not stored in the DB, but which can be visualized.
For example, if your original Cypher query is:
MATCH path=(p:Person)-[r:RELTYPE]->(t:Thing)
RETURN *
you can use this query to generate the appropriate virtual nodes and relationships:
MATCH path=(p:Person)-[r:RELTYPE]->(t:Thing)
WITH COLLECT(path) AS paths, COLLECT(DISTINCT p) AS ps, COLLECT(DISTINCT t) AS ts
CALL apoc.create.vNode(['V_Label'], {label: 'Person'}) YIELD node AS pLabel
CALL apoc.create.vNode(['V_Label'], {label: 'Thing'}) YIELD node AS tLabel
UNWIND ps AS person
CALL apoc.create.vRelationship(person, 'IS', {}, pLabel) YIELD rel AS pRel
WITH paths, ts, pLabel, tLabel, COLLECT(pRel) AS pRels
UNWIND ts AS thing
CALL apoc.create.vRelationship(thing, 'IS', {}, tLabel) YIELD rel AS tRel
RETURN *
A sample resulting visualization:
I have a graph containing two vertex collections: Attraction (green) and Hotel (orange).
I want to query for a certain combination of Attractions and Hotels, such as the one given below:
Attraction (start vertex) ---> Attraction ---> Hotel
|
|
v
Attraction
Graph has directed edges as shown.
The query I have now (below) gives any part of the above combination, instead of four nodes connected exactly as above.
FOR document IN Attraction FOR vertex, edge, path IN 1..2 OUTBOUND document GRAPH "LondonAttractionDB"
FILTER path.vertices[0].entityTypes[0] == "Attraction"
FILTER path.vertices[1].entityTypes[0] == "Attraction"
FILTER path.vertices[2].entityTypes[0] == "Hotel" OR path.vertices[2].entityTypes[0] == "Attraction"
RETURN path
Above query gives all combinations containing two, three or four nodes as shown above. How can I get only the results (combinations of exactly four nodes) shown within circles?
Any help is much appreciated.
You mean you had duplication in the result?
If yes then you can use distinct in return value.
otherwise try BFS unique vertices and unique edges
https://docs.arangodb.com/3.3/AQL/Graphs/Traversals.html
I created a large neo4j graph connecting users to the videos they watch like user -> video in a social graph or network type of graph. There are about 9000 user nodes and 20000 video nodes.
If I try:
MATCH (u)-[:VIEW]->(v)
RETURN u,v
The graph says "Displaying 300000 nodes, 0 relationships." No graph nor relationships nor nodes are showing up.
If I try:
MATCH (u)-[:VIEW]->(v)
RETURN u,v
LIMIT 1000
The graph says "Displaying 1000 nodes, 1000 relationships (completed with 1000 additional relationships)." All graph and relationships and nodes show up.
If I try:
MATCH (u)-[:VIEW]->(v)
RETURN u,v
LIMIT 10000
No graph nor relationships nor nodes show up.
Is the first graph too large to show? How can I get it to show up?
Thank you in advance.
Are you doing this in the web console? I suspect when you do the LIMIT 10000 that the result is just too big to be handled in the web browser. I'm actually a bit surprised that 1000 showed up (again, if you're in the web console).
What are you trying to get? If you want to get a table you can do this (I'm making up properties here):
MATCH (u)-[:VIEW]->(v)
RETURN u.username,v.title
If you want something else, then I'd need more information ;)
I am using Gremlin/Tinkerpop 3 to query a graph stored in TitanDB.
The graph contains user vertices with properties, for example, "description", and edges denoting relationships between users.
I want to use Gremlin to obtain 1) users by properties and 2) the number of relationships (in this case of any kind) to some other user (e.g., with id = 123). To realize this, I make use of the match operation in Gremlin 3 like so:
g.V().match('user',__.as('user').has('description',new P(CONTAINS,'developer')),
__.as('user').out().hasId(123).values('name').groupCount('a').cap('a').as('relationships'))
.select()
This query works fine, unless there are multiple user vertices returned, for example, because multiple users have the word "developer" in their description. In this case, the count in relationships is the sum of all relationships between all returned users and the user with id 123, and not, as desired, the individual count for every returned user.
Am I doing something wrong or is this maybe an error?
PS: This question is related to one I posted some time ago about a similar query in Tinkerpop 2, where I had another issue: How to select optional graph structures with Gremlin?
Here's the sample data I used:
graph = TinkerGraph.open()
g = graph.traversal()
v123=graph.addVertex(id,123,"description","developer","name","bob")
v124=graph.addVertex(id,124,"description","developer","name","bill")
v125=graph.addVertex(id,125,"description","developer","name","brandy")
v126=graph.addVertex(id,126,"description","developer","name","beatrice")
v124.addEdge('follows',v125)
v124.addEdge('follows',v123)
v124.addEdge('likes',v126)
v125.addEdge('follows',v123)
v125.addEdge('likes',v123)
v126.addEdge('follows',v123)
v126.addEdge('follows',v124)
My first thought, was: "Do we really need match step"? Secondarily, of course, I wanted to write this in TP3 fashion and not use a lambda/closure. I tried all manner of things in the first iteration and the closest I got was stuff like this from Daniel Kuppitz:
gremlin> g.V().as('user').local(out().hasId(123).values('name')
.groupCount()).as('relationships').select()
==>[relationships:[:]]
==>[relationships:[bob:1]]
==>[relationships:[bob:2]]
==>[relationships:[bob:1]]
so here we used local step to restrict the traversal within local to the current element. This works, but we lost the "user" tag in the select. Why? groupCount is a ReducingBarrierStep and paths are lost after those steps.
Well, let's go back to match. I figured I could try to make the match step traverse using local:
gremlin> g.V().match('user',__.as('user').has('description','developer'),
gremlin> __.as('user').local(out().hasId(123).values('name').groupCount()).as('relationships')).select()
==>[relationships:[:], user:v[123]]
==>[relationships:[bob:1], user:v[124]]
==>[relationships:[bob:2], user:v[125]]
==>[relationships:[bob:1], user:v[126]]
Ok - success - that's what we wanted: no lambdas and local counts. But, it still left me feeling like: "Do we really need match step"? That's when Mr. Kuppitz closed in on the final answer which makes copious use of the by step:
gremlin> g.V().has('description','developer').as("user","relationships").select().by()
.by(out().hasId(123).values("name").groupCount())
==>[user:v[123], relationships:[:]]
==>[user:v[124], relationships:[bob:1]]
==>[user:v[125], relationships:[bob:2]]
==>[user:v[126], relationships:[bob:1]]
As you can see, by can be chained (on some steps). The first by groups by vertex and the second by processes the grouped elements with a "local" groupCount.
Background:
I need to store the following data in a database:
osm nodes with tags;
osm edges with weights (that is an edge between two nodes extracted from 'way' from an .osm file).
Nodes that form edges, which are in the same 'way' sets should have the same tags as those ways, i.e. every node in a 'way' set of nodes which is a highway should have a 'highway' tag.
I need this structure to easily generate a graph based on various filters, e.g. a graph consisting only of nodes and edges which are highways, or a 'foot paths' graph, etc.
Problem:
I have not heard about the spatial index before, so I just parsed an .osm file into a MySQL database:
all nodes to a 'nodes' table (with respective coordinates columns) - OK, about 9,000,000 of rows in my case:
(INSERT INTO nodes VALUES [pseudocode]node_id,lat,lon[/pseudocode];
all ways to an 'edges' table (usually one way creates a few edges) - OK, about 9,000,000 of rows as well:
(INSERT INTO edges VALUES [pseudocode]edge_id,from_node_id,to_node_id[/pseudocode];
add tags to nodes, calculate weights for edges - Problem:
Here is the problematic php script:
$query = mysql_query('SELECT * FROM edges');
$i=0;
while ($res = mysql_fetch_object($query)) {
$i++;
echo "$i\n";
$node1 = mysql_query('SELECT * FROM nodes WHERE id='.$res->from);
$node1 = mysql_fetch_object($node1);
$tag1 = $node1->tags;
$node2 = mysql_query('SELECT * FROM nodes WHERE id='.$res->to);
$node2 = mysql_fetch_object($node2);
$tag2 = $node2->tags;
mysql_query('UPDATE nodes SET tags="'.$tag1.$res->tags.'" WHERE nodes.id='.$res->from);
mysql_query('UPDATE nodes SET tags="'.$tag2.$res->tags.'" WHERE nodes.id='.$res->to);`
Nohup shows the output for 'echo "$i\n"' each 55-60 seconds (which can take more than 17 years to finish if the size of the 'edges' table is more than 9,000,000 rows, like in my case).
Htop shows a /usr/bin/mysqld process which takes 40-60% of CPU.
The same problem exists for the script which tries to calculate the weight (the distance) of an edge (select all edges, take an edge, then select two nodes of this edge from 'nodes' table, then calculate the distance, then update the edges table).
Question:
How can I make this SQL updates faster? Should I tweak any of MySQL config settings? Or should I use PostgreSQL with PostGIS extension? Should I use another structure for my data? Or should I somehow utilize the spatial index?
If I understand you right there is two things to discuss.
First, your idea of putting the highway-tag on the start and stop nodes. A node can have more than one edge connected, where to put the tag from the second edge? Or third or fourth if it is a crossing? The reason the highway tag is putted in the edges table in the first place is that from a relational point of view that is where it belongs.
Second, to get the whole table and process it outside the database is not the right way. What a relational database is really good at is taking care of this whole process.
I have not worked with mysql, and I fully agree that you will probably get a lot more fun if migrating to PostGIS since PostGIS has a lot better spatial capabilities (even if you don't need any spatial capabilities for this particular task) from what I have heard.
So if we ignore the first problem and just for showing the concept say that there is only two edges connected to one node and that each node has two tag-fields. tag1 and tag2. Then it could look something like this in PostGIS:
UPDATE nodes set tag1=edges.tags from edges where nodes.id=edges.from;
UPDATE nodes set tag2=edges.tags from edges where nodes.id=edges.to;
If you disable the indexes that should be very fast.
Again,
if I have understood you right.
PostgreSQL
Openstreetmap itself uses PostgreSQL, so I guess that's recommended.
See: http://wiki.openstreetmap.org/wiki/PostgreSQL
You can see OSM's database schema at: http://wiki.openstreetmap.org/wiki/Database_Schema
So you can use the same fields, fieldtypes and indexes that OSM uses for maximum compatibility.
MySQL
If you want to import .osm files into a MySQL database, have a look at:
http://wiki.openstreetmap.org/wiki/OsmDB.pm
Here you will find perl code that will create MySQL tables, parse a OSM file and import it into your MySQL database.
Making it faster
If you are updating in bulk, you don't need to update the indexes after every update.
You can just disable the indexes, do all your updates and re-enable the index.
I'm guessing that should be a whole lot faster.
Good luck