Modelling GRAPH query with scenario - graph-databases

I have 2 vertexes one is USER and one is PLACE, Any user can review a place and any user can like or comment on that review. In this scenario, there will be two edges one storing review done by any user on any place and one storing any kind of activity done on that review. Suppose I need to fetch all reviews at any place with like and comment counts of each review in a single query, how do I write such query.

Assuming that you are using user and place to store your vertices and edges activity, review with labels comment, like or review to store the activity you could use graph traversals with depth 1, that are using all vertices of your place collection as starting vertices.
The following query iterates through all place documents and returns likes, comment and reviews for each one.
FOR vertex IN place
LET likes = LENGTH (FOR v, e, p IN 1..1 ANY vertex review, activity FILTER p.edges[0].label == 'like' RETURN 1)
LET reviews = (FOR v, e, p IN 1..1 ANY vertex review, activity FILTER p.edges[0].label == 'review' RETURN p.edges[0].rv)
LET comments = LENGTH (FOR v, e, p IN 1..1 ANY vertex review, activity FILTER p.edges[0].label == 'comment' RETURN 1)
RETURN {place:vertex.name, likes:likes, reviews: reviews, comments: comments}
Filters specify the activity of the traversed edge. The amount of edges with label like/comment represent the amount of likes/comments for this place while the reviews traversal returns the review with attribute rv that is saved in the edge.

Related

Traverse graph database from random seed nodes

I am tasked with writing a query for a front-end application that visualizes a Neptune Graph database. Let us say that the first vertex are items while the second vertex user. A user can create an item. There are item to item relationships to show items derived from another item like in the case of media clips cut out of an original media clip. The first set of items created should be created in a vertex such as a SERVER which they are grouped by in the UI.
The following is the requirement:
Find (Y) seed nodes that are not connected by any ITEM-ITEM relationships on the graph (relationships via USERs etc... are fine)
Populate the graph with all relationships from these (Y) seed nodes with no limits on the relationships that are followed (relationships through USERs for example is fine).
Stop populating the graph once the number of nodes (not records limit) hits the limit specified by (X)
Here is a visual representation of the graph.
https://drive.google.com/file/d/1YNzh4wbzcdC0JeloMgD2C0oS6MYvfI4q/view?usp=sharing
A sample code to reproduce this graph is below. This graph could even get deeper. This is a just a simple example. Kindly see diagram:
g.addV('SERVER').property(id, 'server1')
g.addV('SERVER').property(id, 'server2')
g.addV('ITEM').property(id, 'item1')
g.addV('ITEM').property(id, 'item2')
g.addV('ITEM').property(id, 'item3')
g.addV('ITEM').property(id, 'item4')
g.addV('ITEM').property(id, 'item5')
g.addV('USER').property(id, 'user1')
g.V('item1').addE('STORED IN').to(g.V('server1'))
g.V('item2').addE('STORED IN').to(g.V('server2'))
g.V('item2').addE('RELATED TO').to(g.V('item1'))
g.V('item3').addE('DERIVED FROM').to(g.V('item2') )
g.V('item3').addE('CREATED BY').to(g.V('user1'))
g.V('user1').addE('CREATED').to(g.V('item4'))
g.V('item4').addE('RELATED TO').to(g.V('item5'))
The result should be in the form below if possible:
[
[
{
"V1": {},
"E": {},
"V2": {}
}
]
]
We have an API with an endpoint that allows for open-ended gremlin queries. We call this endpoint in our client app to fetch the data that is rendered visually. I have written a query that I do not think is quite right. Moreover, I would like to know how to filter the number of nodes traversed and stop at X nodes.
g.V().hasLabel('USER','SERVER').sample(5).aggregate('v1').repeat(__.as('V1').bothE().dedup().as('E').otherV().hasLabel('USER','SERVER').as('V2').aggregate('x').by(select('V1', 'E', 'V2'))).until(out().count().is(0)).as('V1').bothE().dedup().as('E').otherV().hasLabel(without('ITEM')).as('V2').aggregate('x').by(select('V1', 'E', 'V2')).cap('v1','x','v1').coalesce(select('x').unfold(),select('v1').unfold().project('V1'))
I would appreciate if I can get a single query that will fetch this dataset if it is possible. If vertices in the result are not connected to anything, I would want to retrieve them and render them like that on the UI.
I have looked at this again and came up with this query
g.V().hasLabel(without('ITEM')).sample(2).aggregate('v1').
repeat(__.as('V1').bothE().dedup().as('E').otherV().as('V2').
aggregate('x').by(select('V1', 'E', 'V2'))).
until(out().count().is(0)).
as('V1').bothE().dedup().as('E').otherV().as('V2').
aggregate('x').
by(select('V1', 'E', 'V2')).
cap('v1','x','v1').
coalesce(select('x').unfold(),select('v1').unfold().project('V1')).limit(5)
To meet the criteria for the node count rather than records count (or limit), I can pass to limit half the number passed in by the user as an input for nodes count and then exclude the edge E and vertice V2 of the last record from what will be rendered on the UI.
I will approach any suggestions on a better way.

How to query for a specific combination of nodes in Arango DB graph

I have a graph containing two vertex collections: Attraction (green) and Hotel (orange).
I want to query for a certain combination of Attractions and Hotels, such as the one given below:
Attraction (start vertex) ---> Attraction ---> Hotel
|
|
v
Attraction
Graph has directed edges as shown.
The query I have now (below) gives any part of the above combination, instead of four nodes connected exactly as above.
FOR document IN Attraction FOR vertex, edge, path IN 1..2 OUTBOUND document GRAPH "LondonAttractionDB"
FILTER path.vertices[0].entityTypes[0] == "Attraction"
FILTER path.vertices[1].entityTypes[0] == "Attraction"
FILTER path.vertices[2].entityTypes[0] == "Hotel" OR path.vertices[2].entityTypes[0] == "Attraction"
RETURN path
Above query gives all combinations containing two, three or four nodes as shown above. How can I get only the results (combinations of exactly four nodes) shown within circles?
Any help is much appreciated.
You mean you had duplication in the result?
If yes then you can use distinct in return value.
otherwise try BFS unique vertices and unique edges
https://docs.arangodb.com/3.3/AQL/Graphs/Traversals.html

Apache Spark Graphx :Source and Destination share the shame VertexId but represnet different things

I have a file with srcId -> dstId values that represent the edges of a graph which i load with GraphLoader edgeListFile, the source represents users and the destination items , in some occasions the srcId and the dstId are equal so there are errors in some algorithms like when i want to collect the neighbor of each vertex. Can i do something to separate the users from the items and also not loose any information
Each GraphX vertex must be defined by an unique long value. If the source and destination IDs represent different things, you need to transform them with some operation to make sure they are distinct. For example, assuming you have read your data into an RDD[(Long, Long)], you could do:
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx.{Edge, Graph}
val userMaxID = rdd.map(_._1).distinct.max
val edges: RDD[Edge[Int]] = rdd.map {
case (userID, itemID) => Edge(userID, itemID + userMaxID, 0)
}
val g = Graph.fromEdges(edges, 0)
Then you will have a graph where all item IDs will be their original ID + the maximum ID of an user (if the IDs can be 0, you need to add an extra 1).
Note that this is just a suggestion, the idea is that you need to transform your IDs in a way that no item can have the same ID as an user. Also, you may want to keep a way to know if a given vertex is an user or an item; in my suggestion, all vertices with ID <= userMaxID would be users, whereas all vertices with ID > userMaxID would be items.

How to query for multiple vertices and counts of their relationships in Gremlin/Tinkerpop 3?

I am using Gremlin/Tinkerpop 3 to query a graph stored in TitanDB.
The graph contains user vertices with properties, for example, "description", and edges denoting relationships between users.
I want to use Gremlin to obtain 1) users by properties and 2) the number of relationships (in this case of any kind) to some other user (e.g., with id = 123). To realize this, I make use of the match operation in Gremlin 3 like so:
g.V().match('user',__.as('user').has('description',new P(CONTAINS,'developer')),
__.as('user').out().hasId(123).values('name').groupCount('a').cap('a').as('relationships'))
.select()
This query works fine, unless there are multiple user vertices returned, for example, because multiple users have the word "developer" in their description. In this case, the count in relationships is the sum of all relationships between all returned users and the user with id 123, and not, as desired, the individual count for every returned user.
Am I doing something wrong or is this maybe an error?
PS: This question is related to one I posted some time ago about a similar query in Tinkerpop 2, where I had another issue: How to select optional graph structures with Gremlin?
Here's the sample data I used:
graph = TinkerGraph.open()
g = graph.traversal()
v123=graph.addVertex(id,123,"description","developer","name","bob")
v124=graph.addVertex(id,124,"description","developer","name","bill")
v125=graph.addVertex(id,125,"description","developer","name","brandy")
v126=graph.addVertex(id,126,"description","developer","name","beatrice")
v124.addEdge('follows',v125)
v124.addEdge('follows',v123)
v124.addEdge('likes',v126)
v125.addEdge('follows',v123)
v125.addEdge('likes',v123)
v126.addEdge('follows',v123)
v126.addEdge('follows',v124)
My first thought, was: "Do we really need match step"? Secondarily, of course, I wanted to write this in TP3 fashion and not use a lambda/closure. I tried all manner of things in the first iteration and the closest I got was stuff like this from Daniel Kuppitz:
gremlin> g.V().as('user').local(out().hasId(123).values('name')
.groupCount()).as('relationships').select()
==>[relationships:[:]]
==>[relationships:[bob:1]]
==>[relationships:[bob:2]]
==>[relationships:[bob:1]]
so here we used local step to restrict the traversal within local to the current element. This works, but we lost the "user" tag in the select. Why? groupCount is a ReducingBarrierStep and paths are lost after those steps.
Well, let's go back to match. I figured I could try to make the match step traverse using local:
gremlin> g.V().match('user',__.as('user').has('description','developer'),
gremlin> __.as('user').local(out().hasId(123).values('name').groupCount()).as('relationships')).select()
==>[relationships:[:], user:v[123]]
==>[relationships:[bob:1], user:v[124]]
==>[relationships:[bob:2], user:v[125]]
==>[relationships:[bob:1], user:v[126]]
Ok - success - that's what we wanted: no lambdas and local counts. But, it still left me feeling like: "Do we really need match step"? That's when Mr. Kuppitz closed in on the final answer which makes copious use of the by step:
gremlin> g.V().has('description','developer').as("user","relationships").select().by()
.by(out().hasId(123).values("name").groupCount())
==>[user:v[123], relationships:[:]]
==>[user:v[124], relationships:[bob:1]]
==>[user:v[125], relationships:[bob:2]]
==>[user:v[126], relationships:[bob:1]]
As you can see, by can be chained (on some steps). The first by groups by vertex and the second by processes the grouped elements with a "local" groupCount.

How to handle the Set Similarity in Database

I have an interesting problem. We receive the feed files from our customers which contains the products along with their information. We log each of the feed request received from our customers in a database.
The Problem is that given a feed file, we need to get all the feed requests which has the same list of products in the given feed file.Every feed request has nearly 2million candidate feeds for matching?
Let me to summarize the probelem, just to make sure that we are on the same page.
The application may get a Feed Request, which contains list of products. Every time it happens, you log FR in db, and in addition you want to check for all FRs in the past which contained the same products set, is that right?
If so, an idea is to generate a hash key for a list of products within a FR. In that way every FR in db, has its own hash - which corresponds to list of products this FR contained.
Eg.
Feed Request came to the app, and it contains products 2, 1, 3. The
app sorts products identities: [1, 2, 3], and then generate hash:
h([1, 2, 3]) = abc. Then, all you need to look for previous FRs with
the same products set, is to generate a query: "get all records from feed requests, where
hash is equal to "abc" ".
Such comparison is not very expensive if you index the data in the right way, even if there are milions of records.

Resources