Depth/Breadth First Search Traversal - database

I am using gremlin with AWS Neptune, and for certain reasons, I want to traverse the graph in either depth-first or breadth-first manner (doesn't matter). This is what I am doing currently:
g.V('0').repeat(out('connected_to').dedup().where(without('z')).aggregate('z')).until(out('connected_to').dedup().where(without('z')).count().is(0)).select('z').limit(1).unfold()
I know that a path exists from vertex '0' to every other vertex in the graph, but there may be cycles in the graph and so, I use the Collection 'z' to keep track of visited nodes, making sure I do not revisit such a node.
If this were to work, I would have all 1000 vertices of the graph in 'z' at the end. But that isn't the case. I get 600 vertices and some vertices are missing even though they have clear incoming edges from other vertices that are in 'z'. What's wrong with my logic here?

Related

creating a cost function in jgrapht

jgrapht supports the idea of putting a wehight(a cost) on an edge/vertex between two nodes. This can be achieved using the class DefaultWeightedEdge.
In my graph I do have the requirement to not find the shortest path but the cheapest one. The cheapest path might be longer/have more hops nodes to travel then the shortest path.
Therefor, one can use the DijkstraShortestPath algorithm to achieve this.
However, my use case is a bit more complex: It needs to also evaluate costs on actions that need to be executed when arriving at a node.
Let's say, you have a graph like a chess board(8x8 fields, each field beeing a node). All the edges have a weight of 1. To move in a car from left bottom to the diagonal corner(right upper), there are many paths with the cost of 16. You can take a diagonal path in a zic zac style, or you can first travel all nodes to the right and then all nodes upwards.
The difference is: When taking a zic zac, you need to rotate yourself in the direction of moving. You rotate 16 times.
When moving first all to the right and then upwards, you need to rotate only once (maybe twice, depending on your start orientation).
So the zic zac path is, from a Djikstra point of view, perfect. From a logical point of view, it's the worst.
Long story short: How can I put some costs on a node or edge depending on the previous edge/node in that path? I did not find anything related in the source code of jgrapht.
Or is there a better algorithm to use?
This is not a JGraphT issue but a graph algorithm issue. You need to think about how to encode this problem and formalize that in more detail
Incorporating weights on vertices is in general easy. Say that every vertex represents visiting a customer, which takes a_i time. This can be encoded in the graph by adding a_i/2 to the cost of every incoming arc in node i, as well as a_i/2 to the cost of every outgoing arc.
A cost function where the cost of traveling from j to k dependents on the arc (i,j) you used to travel to j is more complicated.
Approach a.: Use a dynamic programming (labeling) algorithm. This is perhaps the easiest. You can define your cost function as a recursive function, where the cost of traversing an arc depends on the cost of the previous arc.
Approach b.: With some tricks you may be able to encode the costs in the graph by adding extra nodes to it. Here's an example:
Given a graph with vertices {a,b,c,d,e}, with arcs: (a,e), (e,b), (c,e), (e,d). This graph represents a crossroad with vertex e being in the middle. Going from a->e->b (straight) is free, however, a turn from a->e->d takes additional time. Similar for c->e->d (straight) is free and c->e->b (turning) should be penalized.
Decouple vertex e in 4 new vertices: e1,e2,e3,e4.
Add the following arcs:
(a,e1), (e3,b), (c,e2), (e4,d), (e2, e3), (e1, e3), (e1, e4), (e2, e4).
(e1,e4) and (e2,e3) can have a positive weight to penalize turning.

ArangoDB: Traversals where edges are connected to other edges

I recently read that ArangoDB is capable of connecting edges to other edges in a graph. In this situation, how would querying the path work? For example:
car <-------- part
^
|
|
installationEvidence
In this case, installationEvidence is a node connecting to the edge between the part to the car. Starting from the car node, what is the AQL to return installationEvidence but not part? Are both installationEvidence and part considered at the p.vertices[1] layer?
In ArangoDB edges are a special type of Documents.
That is why you can store edges pointing to other edges.
From a query point of view there are two directions for this edges:
A) The traversal leads to the target edge. In this case it is assumed to be the general type of document and the traversal will not follow any direction of the target edge.
Which means you would have to write 2 traversals steps in the statement.
The first ending in the edge.
The second starting at _from or _to of the edge.
In your case the query could look like this:
FOR edge IN 1 OUTBOUND #installationEvidece ##edges1
LET car = DOCUMENT(edge._to)
RETURN car
B) A traversal walks through an edge which has other edges pointing to it.
This case is more complicated. In ArangoDB's architecture the "vertex" does not know anything about it's attached edges, the edges know their vertices.
What you could do in this case is to again write two traversal statements where the second starts with the edge encountered, e.g.:
FOR part,edge IN 1 INBOUND #car ##edges1
FOR installationEvidence IN 1 INBOUND edge ##edges2
[...]
For the time being we did not encounter any use-case of customers to make the above traversal more transparent. If this is critical for you please contact us and we can increase the priority to make these kind of queries easier to formulate.

What edges are not in any MST

This is a homework question. I do not want the solution - I'm offering the solution I've been thinking of and wish to know whether is it good or why is it flawed.
My motivation is to find what edges of an unweighted, undirected graph are not a part of any MST. This problem only makes sense when several edges have the same values, otherwise the MST is unique.
My idea comes from Prim's Algorithm with a slight change - instead of adding the minimum edge from S to T on every step (where S and T being the two sets of vertex) - instead look for the minimum edge and more edges of the same value going from S to the vertex the minimum edge goes to. By doing that, (so I suppose) we will receive a graph containing all the edges which appear in any MST. If this is right, I can simply XOR the edges list with the original graph edges list to find what edges are not in any MST.
Thanks in advance.
Do you add all the edges you find (=those with equal weight)? If so, you will lose some edges:
Consider a pentagon with equal edge costs. You start with 1 node and add the 2 edges to the 2 adjacent nodes. In you next step you would add the 2 edges going from those 2 adjacent nodes to the 2 disconnected nodes and you would be done. However, all edges are of equal cost and they are all valid to be in the MST. The edge between the last 2 nodes is not included by your algorithm but could be part of the MST.
It's even worse. Suppose that last edge is of lower cost. Your algorithm still doesn't include it, yet it's present in every MST. You're adding several edges per step to account for all the possibilities but adding those edges changes the next steps.

Suggestions of the easiest algorithms for some Graph operations

The deadline for this project is closing in very quickly and I don't have much time to deal with what it's left. So, instead of looking for the best (and probably more complicated/time consuming) algorithms, I'm looking for the easiest algorithms to implement a few operations on a Graph structure.
The operations I'll need to do is as follows:
List all users in the graph network given a distance X
List all users in the graph network given a distance X and the type of relation
Calculate the shortest path between 2 users on the graph network given a type of relation
Calculate the maximum distance between 2 users on the graph network
Calculate the most distant connected users on the graph network
A few notes about my Graph implementation:
The edge node has 2 properties, one is of type char and another int. They represent the type of relation and weight, respectively.
The Graph is implemented with linked lists, for both the vertices and edges. I mean, each vertex points to the next one and each vertex also points to the head of a different linked list, the edges for that specific vertex.
What I know about what I need to do:
I don't know if this is the easiest as I said above, but for the shortest path between 2 users, I believe the Dijkstra algorithm is what people seem to recommend pretty often so I think I'm going with that.
I've been searching and searching and I'm finding it hard to implement this algorithm, does anyone know of any tutorial or something easy to understand so I can implement this algorithm myself? If possible, with C source code examples, it would help a lot. I see many examples with math notations but that just confuses me even more.
Do you think it would help if I "converted" the graph to an adjacency matrix to represent the links weight and relation type? Would it be easier to perform the algorithm on that instead of the linked lists? I could easily implement a function to do that conversion when needed. I'm saying this because I got the feeling it would be easier after reading a couple of pages about the subject, but I could be wrong.
I don't have any ideas about the other 4 operations, suggestions?
List all users in the graph network given a distance X
A distance X from what? from a starting node or a distance X between themselves? Can you give an example? This may or may not be as simple as doing a BF search or running Dijkstra.
Assuming you start at a certain node and want to list all nodes that have distances X to the starting node, just run BFS from the starting node. When you are about to insert a new node in the queue, check if the distance from the starting node to the node you want to insert the new node from + the weight of the edge from the node you want to insert the new node from to the new node is <= X. If it's strictly lower, insert the new node and if it is equal just print the new node (and only insert it if you can also have 0 as an edge weight).
List all users in the graph network given a distance X and the type of relation
See above. Just factor in the type of relation into the BFS: if the type of the parent is different than that of the node you are trying to insert into the queue, don't insert it.
Calculate the shortest path between 2 users on the graph network given a type of relation
The algorithm depends on a number of factors:
How often will you need to calculate this?
How many nodes do you have?
Since you want easy, the easiest are Roy-Floyd and Dijkstra's.
Using Roy-Floyd is cubic in the number of nodes, so inefficient. Only use this if you can afford to run it once and then answer each query in O(1). Use this if you can afford to keep an adjacency matrix in memory.
Dijkstra's is quadratic in the number of nodes if you want to keep it simple, but you'll have to run it each time you want to calculate the distance between two nodes. If you want to use Dijkstra's, use an adjacency list.
Here are C implementations: Roy-Floyd and Dijkstra_1, Dijkstra_2. You can find a lot on google with "<algorithm name> c implementation".
Edit: Roy-Floyd is out of the question for 18 000 nodes, as is an adjacency matrix. It would take way too much time to build and way too much memory. Your best bet is to either use Dijkstra's algorithm for each query, but preferably implementing Dijkstra using a heap - in the links I provided, use a heap to find the minimum. If you run the classical Dijkstra on each query, that could also take a very long time.
Another option is to use the Bellman-Ford algorithm on each query, which will give you O(Nodes*Edges) runtime per query. However, this is a big overestimate IF you don't implement it as Wikipedia tells you to. Instead, use a queue similar to the one used in BFS. Whenever a node updates its distance from the source, insert that node back into the queue. This will be very fast in practice, and will also work for negative weights. I suggest you use either this or the Dijkstra with heap, since classical Dijkstra might take a long time on 18 000 nodes.
Calculate the maximum distance between 2 users on the graph network
The simplest way is to use backtracking: try all possibilities and keep the longest path found. This is NP-complete, so polynomial solutions don't exist.
This is really bad if you have 18 000 nodes, I don't know any algorithm (simple or otherwise) that will work reasonably fast for so many nodes. Consider approximating it using greedy algorithms. Or maybe your graph has certain properties that you could take advantage of. For example, is it a DAG (Directed Acyclic Graph)?
Calculate the most distant connected users on the graph network
Meaning you want to find the diameter of the graph. The simplest way to do this is to find the distances between each two nodes (all pairs shortest paths - either run Roy-Floyd or Dijkstra between each two nodes and pick the two with the maximum distance).
Again, this is very hard to do fast with your number of nodes and edges. I'm afraid you're out of luck on these last two questions, unless your graph has special properties that can be exploited.
Do you think it would help if I "converted" the graph to an adjacency matrix to represent the links weight and relation type? Would it be easier to perform the algorithm on that instead of the linked lists? I could easily implement a function to do that conversion when needed. I'm saying this because I got the feeling it would be easier after reading a couple of pages about the subject, but I could be wrong.
No, adjacency matrix and Roy-Floyd are a very bad idea unless your application targets supercomputers.
This assumes O(E log V) is an acceptable running time, if you're doing something online, this might not be, and it would require some higher powered machinery.
List all users in the graph network given a distance X
Djikstra's algorithm is good for this, for one time use. You can save the result for future use, with a linear scan through all the vertices (or better yet, sort and binary search).
List all users in the graph network given a distance X and the type of relation
Might be nearly the same as above -- just use some function where the weight would be infinity if it is not of the correct relation.
Calculate the shortest path between 2 users on the graph network given a type of relation
Same as above, essentially, just determine early if you match the two users. (Alternatively, you can "meet in the middle", and terminate early if you find someone on both shortest path spanning tree)
Calculate the maximum distance between 2 users on the graph network
Longest path is an NP-complete problem.
Calculate the most distant connected users on the graph network
This is the diameter of the graph, which you can read about on Math World.
As for the adjacency list vs adjacency matrix question, it depends on how densely populated your graph is. Also, if you want to cache results, then the matrix might be the way to go.
The simplest algorithm to compute shortest path between two nodes is Floyd-Warshall. It's just triple-nested for loops; that's it.
It computes ALL-pairs shortest path in O(N^3), so it may do more work than necessary, and will take a while if N is huge.

Best and easiest algorithm to search for a vertex on a Graph?

After implementing most of the common and needed functions for my Graph implementation, I realized that a couple of functions (remove vertex, search vertex and get vertex) don't have the "best" implementation.
I'm using adjacency lists with linked lists for my Graph implementation and I was searching one vertex after the other until it finds the one I want. Like I said, I realized I was not using the "best" implementation. I can have 10000 vertices and need to search for the last one, but that vertex could have a link to the first one, which would speed up things considerably. But that's just an hypothetical case, it may or may not happen.
So, what algorithm do you recommend for search lookup? Our teachers talked about Breadth-first and Depth-first mostly (and Dikjstra' algorithm, but that's a completely different subject). Between those two, which one do you recommend?
It would be perfect if I could implement both but I don't have time for that, I need to pick up one and implement it has the first phase deadline is approaching...
My guess, is to go with Depth-first, seems easier to implement and looking at the way they work, it seems a best bet. But that really depends on the input.
But what do you guys suggest?
If you’ve got an adjacency list, searching for a vertex simply means traversing that list. You could perhaps even order the list to decrease the needed lookup operations.
A graph traversal (such as DFS or BFS) won’t improve this from a performance point of view.
Finding and deleting nodes in a graph is a "search" problem not a graph problem, so to make it better than O(n) = linear search, BFS, DFS, you need to store your nodes in a different data structure optimized for searching or sort them. This gives you O(log n) for find and delete operations. Candidatas are tree structures like b-trees or hash tables. If you want to code the stuff yourself I would go for a hash table which normally gives very good performance and is reasonably easy to implement.
I think BFS would usually be faster an average. Read the wiki pages for DFS and BFS.
The reason I say BFS is faster is because it has the property of reaching nodes in order of their distance from your starting node. So if your graph has N nodes and you want to search for node N and node 1, which is the node you start your search form, is linked to N, then you will find it immediately. DFS might expand the whole graph before this happens however. DFS will only be faster if you get lucky, while BFS will be faster if the nodes you search for are close to your starting node. In short, they both depend on the input, but I would choose BFS.
DFS is also harder to code without recursion, which makes BFS a bit faster in practice, since it is an iterative algorithm.
If you can normalize your nodes (number them from 1 to 10 000 and access them by number), then you can easily keep Exists[i] = true if node i is in the graph and false otherwise, giving you O(1) lookup time. Otherwise, consider using a hash table if normalization is not possible or you don't want to do it.
Depth-first search is best because
It uses much less memory
Easier to implement
the depth first and breadth first algorithms are almost identical, except for the use of a stack in one (DFS), a queue in the other (BFS), and a few required member variables. Implementing them both shouldn't take you much extra time.
Additionally if you have an adjacency list of the vertices then your look up with be O(V) anyway. So little to nothing will be gained via using one of the two other searches.
I'd comment on Konrad's post but I can't comment yet so... I'd like to second that it doesn't make a difference in performance if you implement DFS or BFS over a simple linear search through your list. Your search for a particular node in the graph doesn't depend on the structure of the graph, hence it's not necessary to confine yourself to graph algorithms. In terms of coding time, the linear search is the best choice; if you want to brush up your skills in graph algorithms, implement DFS or BFS, whichever you feel like.
If you are searching for a specific vertex and terminating when you find it, I would recommend using A*, which is a best-first search.
The idea is that you calculate the distance from the source vertex to the current vertex you are processing, and then "guess" the distance from the current vertex to the target.
You start at the source, calculate the distance (0) plus the guess (whatever that might be) and add it to a priority queue where the priority is distance + guess. At each step, you remove the element with the smallest distance + guess, do the calculation for each vertex in its adjacency list and stick those in the priority queue. Stop when you find the target vertex.
If your heuristic (your "guess") is admissible, that is, if it's always an under-estimate, then you are guaranteed to find the shortest path to your target vertex the first time you visit it. If your heuristic is not admissible, then you will have to run the algorithm to completion to find the shortest path (although it sounds like you don't care about the shortest path, just any path).
It's not really any more difficult to implement than a breadth-first search (you just have to add the heuristic, really) but it will probably yield faster results. The only hard part is figuring out your heuristic. For vertices that represent geographical locations, a common heuristic is to use an "as-the-crow-flies" (direct distance) heuristic.
Linear search is faster than BFS and DFS. But faster than linear search would be A* with the step cost set to zero. When the step cost is zero, A* will only expand the nodes that are closest to a goal node. If the step cost is zero then every node's path cost is zero and A* won't prioritize nodes with a shorter path. That's what you want since you don't need the shortest path.
A* is faster than linear search because linear search will most likely complete after O(n/2) iterations (each node has an equal chance of being a goal node) but A* prioritizes nodes that have a higher chance of being a goal node.

Resources