I am using gremlin with AWS Neptune, and for certain reasons, I want to traverse the graph in either depth-first or breadth-first manner (doesn't matter). This is what I am doing currently:
g.V('0').repeat(out('connected_to').dedup().where(without('z')).aggregate('z')).until(out('connected_to').dedup().where(without('z')).count().is(0)).select('z').limit(1).unfold()
I know that a path exists from vertex '0' to every other vertex in the graph, but there may be cycles in the graph and so, I use the Collection 'z' to keep track of visited nodes, making sure I do not revisit such a node.
If this were to work, I would have all 1000 vertices of the graph in 'z' at the end. But that isn't the case. I get 600 vertices and some vertices are missing even though they have clear incoming edges from other vertices that are in 'z'. What's wrong with my logic here?
I am pretty new to GraphDBs so excuse me if this is trivial.
I have two groups of vertices in an Azure CosmosDB using gremlin. Group a and b. Every vertex of group a can connect to any vertex of group b, once at most. I am looking to find every vertex in group b that is connected to at least two vertices in group a.
So in this example I would like to get back vertices [b2, b3].
In case it's relevant: There would usually be a lot more vertices in group b
Creating a graph from your diagram, you can do something like this
g.addV('person').property('group','A').property('name','a1').as('a1').
addV('person').property('group','A').property('name','a2').as('a2').
addV('person').property('group','A').property('name','a3').as('a3').
addV('person').property('group','B').property('name','b1').as('b1').
addV('person').property('group','B').property('name','b2').as('b2').
addV('person').property('group','B').property('name','b3').as('b3').
addV('person').property('group','B').property('name','b4').as('b4').
addE('knows').from('b3').to('a2').
addE('knows').from('b3').to('a3').
addE('knows').from('a2').to('b2').
addE('knows').from('b2').to('a1').
addE('knows').from('a1').to('b1').
addE('knows').from('a3').to('b4').iterate()
gremlin> g.V().has('group','B').
......1> filter(both().has('group','A').count().is(gte(2))).
......2> values('name')
==>b2
==>b3
If there is a possibility that any of the vertices may have a lot of connected edges it is probably more efficient to change the test part of the query to be.
filter(both().has('group','A').limit(2).count().is(2))
I am very new to graph database. And I have started with Arango. For this project I am not sure about the queries that I will encounter in future. I don't want to create bottlenecks. So I wanted to create undirected or bidirectional edges everywhere.
However as only directed edges are supported my current understanding is that if some vertex is not reachable by a directed traversal then I'll hit a bottleneck later. So whenever I am creating an edge a -> b I am also creating b -> a in the same edge collection.
Are my assumptions correct ? and Is the design decision acceptable ?
While edges are always directed, you can choose to ignore the edge direction in a traversal by using ANY: https://www.arangodb.com/docs/stable/aql/graphs-traversals.html
OUTBOUND to follow an edge in its defined direction (_from → _to)
INBOUND to follow in the opposite direction (_from ← _to)
ANY to follow regardless of the edge direction, inbound and outbound (_from ↔ _to)
Is it possible to implement the Christofides algorithm for an directed Graph?
Suppose you have an undirected Graph, in which every vertex has an edges in both ways to every other in the graph (not to itself). But the weights of the edges, don't necessarily have do be the same in both ways (unsymmetrical).
For Example you think of a Street Map, in which there are a lot of oneway streets.
We now want to find an approximation for the traveling salesman tour through all the vertices.
First of all the Christoffides algorithm is not defined for such an TSP, because the Minimum Spanning Tree ist not defined for an directed Graph.
But still we start the algorithm by finding the optimum branching with Edmonds algorithm to the start point of the tour as the root.
Then we find a minimal perfect matching for the branching, so that it becomes an Eulerian graph. This will happen with the Hungarian algorithm, wich finds an minimal matching so that every vertex in the branching has afterwords the same amount of edges coming in an out.
In the last step we find the euler tour and optimize the tour by taking shortcuts.
I have to questions:
Is the way I want to implement the algorithm right, or did I made a
mistake and it can't work
If it works, is it still bounded bei 1,5 of the optimal solution for the tsp?
The deadline for this project is closing in very quickly and I don't have much time to deal with what it's left. So, instead of looking for the best (and probably more complicated/time consuming) algorithms, I'm looking for the easiest algorithms to implement a few operations on a Graph structure.
The operations I'll need to do is as follows:
List all users in the graph network given a distance X
List all users in the graph network given a distance X and the type of relation
Calculate the shortest path between 2 users on the graph network given a type of relation
Calculate the maximum distance between 2 users on the graph network
Calculate the most distant connected users on the graph network
A few notes about my Graph implementation:
The edge node has 2 properties, one is of type char and another int. They represent the type of relation and weight, respectively.
The Graph is implemented with linked lists, for both the vertices and edges. I mean, each vertex points to the next one and each vertex also points to the head of a different linked list, the edges for that specific vertex.
What I know about what I need to do:
I don't know if this is the easiest as I said above, but for the shortest path between 2 users, I believe the Dijkstra algorithm is what people seem to recommend pretty often so I think I'm going with that.
I've been searching and searching and I'm finding it hard to implement this algorithm, does anyone know of any tutorial or something easy to understand so I can implement this algorithm myself? If possible, with C source code examples, it would help a lot. I see many examples with math notations but that just confuses me even more.
Do you think it would help if I "converted" the graph to an adjacency matrix to represent the links weight and relation type? Would it be easier to perform the algorithm on that instead of the linked lists? I could easily implement a function to do that conversion when needed. I'm saying this because I got the feeling it would be easier after reading a couple of pages about the subject, but I could be wrong.
I don't have any ideas about the other 4 operations, suggestions?
List all users in the graph network given a distance X
A distance X from what? from a starting node or a distance X between themselves? Can you give an example? This may or may not be as simple as doing a BF search or running Dijkstra.
Assuming you start at a certain node and want to list all nodes that have distances X to the starting node, just run BFS from the starting node. When you are about to insert a new node in the queue, check if the distance from the starting node to the node you want to insert the new node from + the weight of the edge from the node you want to insert the new node from to the new node is <= X. If it's strictly lower, insert the new node and if it is equal just print the new node (and only insert it if you can also have 0 as an edge weight).
List all users in the graph network given a distance X and the type of relation
See above. Just factor in the type of relation into the BFS: if the type of the parent is different than that of the node you are trying to insert into the queue, don't insert it.
Calculate the shortest path between 2 users on the graph network given a type of relation
The algorithm depends on a number of factors:
How often will you need to calculate this?
How many nodes do you have?
Since you want easy, the easiest are Roy-Floyd and Dijkstra's.
Using Roy-Floyd is cubic in the number of nodes, so inefficient. Only use this if you can afford to run it once and then answer each query in O(1). Use this if you can afford to keep an adjacency matrix in memory.
Dijkstra's is quadratic in the number of nodes if you want to keep it simple, but you'll have to run it each time you want to calculate the distance between two nodes. If you want to use Dijkstra's, use an adjacency list.
Here are C implementations: Roy-Floyd and Dijkstra_1, Dijkstra_2. You can find a lot on google with "<algorithm name> c implementation".
Edit: Roy-Floyd is out of the question for 18 000 nodes, as is an adjacency matrix. It would take way too much time to build and way too much memory. Your best bet is to either use Dijkstra's algorithm for each query, but preferably implementing Dijkstra using a heap - in the links I provided, use a heap to find the minimum. If you run the classical Dijkstra on each query, that could also take a very long time.
Another option is to use the Bellman-Ford algorithm on each query, which will give you O(Nodes*Edges) runtime per query. However, this is a big overestimate IF you don't implement it as Wikipedia tells you to. Instead, use a queue similar to the one used in BFS. Whenever a node updates its distance from the source, insert that node back into the queue. This will be very fast in practice, and will also work for negative weights. I suggest you use either this or the Dijkstra with heap, since classical Dijkstra might take a long time on 18 000 nodes.
Calculate the maximum distance between 2 users on the graph network
The simplest way is to use backtracking: try all possibilities and keep the longest path found. This is NP-complete, so polynomial solutions don't exist.
This is really bad if you have 18 000 nodes, I don't know any algorithm (simple or otherwise) that will work reasonably fast for so many nodes. Consider approximating it using greedy algorithms. Or maybe your graph has certain properties that you could take advantage of. For example, is it a DAG (Directed Acyclic Graph)?
Calculate the most distant connected users on the graph network
Meaning you want to find the diameter of the graph. The simplest way to do this is to find the distances between each two nodes (all pairs shortest paths - either run Roy-Floyd or Dijkstra between each two nodes and pick the two with the maximum distance).
Again, this is very hard to do fast with your number of nodes and edges. I'm afraid you're out of luck on these last two questions, unless your graph has special properties that can be exploited.
Do you think it would help if I "converted" the graph to an adjacency matrix to represent the links weight and relation type? Would it be easier to perform the algorithm on that instead of the linked lists? I could easily implement a function to do that conversion when needed. I'm saying this because I got the feeling it would be easier after reading a couple of pages about the subject, but I could be wrong.
No, adjacency matrix and Roy-Floyd are a very bad idea unless your application targets supercomputers.
This assumes O(E log V) is an acceptable running time, if you're doing something online, this might not be, and it would require some higher powered machinery.
List all users in the graph network given a distance X
Djikstra's algorithm is good for this, for one time use. You can save the result for future use, with a linear scan through all the vertices (or better yet, sort and binary search).
List all users in the graph network given a distance X and the type of relation
Might be nearly the same as above -- just use some function where the weight would be infinity if it is not of the correct relation.
Calculate the shortest path between 2 users on the graph network given a type of relation
Same as above, essentially, just determine early if you match the two users. (Alternatively, you can "meet in the middle", and terminate early if you find someone on both shortest path spanning tree)
Calculate the maximum distance between 2 users on the graph network
Longest path is an NP-complete problem.
Calculate the most distant connected users on the graph network
This is the diameter of the graph, which you can read about on Math World.
As for the adjacency list vs adjacency matrix question, it depends on how densely populated your graph is. Also, if you want to cache results, then the matrix might be the way to go.
The simplest algorithm to compute shortest path between two nodes is Floyd-Warshall. It's just triple-nested for loops; that's it.
It computes ALL-pairs shortest path in O(N^3), so it may do more work than necessary, and will take a while if N is huge.