best way to deal with multiple intermediate node in a path? - graph-databases

basically I have a scenario like the following:
vertex --- vertex* --- vertex
however the vertex* could have a variable number of vertexes at this point in path resulting in
vertex --- vertex1 --- vertex
vertex --- vertex2 --- vertex
vertex --- vertexN --- vertex
I won't know what N will be until I traverse to this vertex. When I traverse to this node for the first time, arbitrary function will be able to determine how many instances of this vertex will be at this point in path.
do I just record N as a property or do I create additional N number of paths with middle vertex with incremented value?
A real world example would be, a file directory with unknown number of folders (until you open the parent directory), with each folder containing one file, and you need to traverse each file path.
Update:
This is what I expect:
(first traversal, runs into a vertex with special property *)
A --- X* --- B
generates additional instance of the same X vertex, connected to parent A, and children B.
A --- X1 --- B
\--- X2 --/
\-- X3 -/
or
A --- X1 --- B
A --- X2 --- B
A --- X3 --- B
so that now the traversal will happen like
A, X1, B
A, X2, B
A, X3, B
The X vertex instances are exactly same from each other other then that they have an index integer. The number of instances is determined by the first initial traversal (A, X*, B). X* may generate 3 or 50 or 100 additional instances.
For storage, what I meant was to store this index value at the X* and increment it everytime until the max integer N is reached. So for the above example, it would have a starting index of 1 and max of 3. This would bypass the need to insert additional vertexes in the middle and connect it to both A and B. However, I'm not sure if this is best for my case, which I need to traverse every generated path.

So i think now i got your use case.
you are right you basically have to options:
Replace vertex "x*" by other vertices:
First i would execute a simple query searching for all vertices having the special attribute (i would not use a traversal in this step but an index on this special attribute, should be faster)
Second i would replace all of them in a transaction with the corresponding amount of real vertices (remember to delete the "x*" vertex if you want to execute this query again)
Third you can use all built in traversal statements as the query structure is displayed by the graph.
Pro:
Simple to implement.
The data is exactly as you expect it to be, no parsing of attributes necessary, if there are 5 paths from A to B in your Application, there are 5 paths from A to B stored in your DB.
Can make heavy use of built-in features without (ArangoDB expects all edges to be there physically by default)
Con:
Redundant data (X1 - Xn are copies of each other) so if you store some data here you have to take care to keep it in sync
Higher memory consumption.
More paths in your graph => more traversal steps
Will be less performant then Option 2.
Option 2: Store only one intermediate vertex and make use of the special attribute
Just store vertices X*
Implement your own visitor which checks for the special attribute (from your description i think you want to check at vertex B if the last vertex on the path (X*) has the special attribute). If so you add the values for (A X B) n-times to the result.
Pro:
Performant
No redundancy
Con:
You have to implement the logic to replace the X* with X1 - Xn in your application
You have to implement your own visitor
There is a slight mismatch between your domain Model and the content in your database
I would make the decision based on the size of your dataset.
If you have a quite small dataset and redundancy/performance is not an issue i would go for option 1 which is much simpler and less effort.
If you have a large dataset and need high performance option 2 would be better i guess.
Hope that helps ;)

i am a bit confused what you are actually looking for ;)
So first of all could you elaborate your use case further?
Are you searching for the list of all vertices between two vertices A and B?
A --- vertex1 --- B
A --- vertex2 --- B
A --- vertexN --- B
Or are you searching for all vertices you can reach from A in a specific depth (e.g.: 2):
A --- vertex1 --- B
A --- vertex2 --- C
A --- vertexN --- D
Second, are you looking for a solution how to store it the best way?
Or are you having it already stored and are looking for a way how to query it?
If you want to query it, what do you expect as a result? The number of paths?
Or the list of in-between vertices?
I think we can solve all the questions above ;)

Related

Can XOR metric be used to implement DHT without Kademlia?

So we can establish that XOR distance metric is a real metric (it’s symmetric, satisfies triangle inequality, etc.)
I was thinking before reading about Kademlia and its k-buckets that each node would simply find its own id and store its closest k neighbors, and vice versa. Nodes would periodically ping their neghbors and evict them from the list if they didn’t respond.
Now if I want to find some key X, I simply issue this request to the closest node among my neighbors to X, and this continues recursively until you get a node that is closest to X among itself and all its neighbors. This node would be among those who store the value for X, and then they would just reverse the steps (ie unwind the stack) to return the value to the requester.
A node would simply look up its own id when joining the network, and then add each of ots neighbors.
Seems much more straightforward than Kademlia. Would this work? Is it just much slower because each lookup may have many more hops?
No.
Without kademlia's routing table you would have no guarantee that any node's neighbor list would actually contain contacts that are closer to the target key and thus could help your query converge towards the target.
This can even happen at the 0th hop, i.e. your local routing table may only contain neighbors that are further away from the target node than yourself. You will have no better contacts to query. You would actually have to go backwards on the distance metric, but xor distance does not allow negative distances since it is just a ring of positive integers modulo N, so negative distances wrap around to the most distant nodes, which is equivalent to kademlia's bucket with 0 prefix bits shared.

If one were to code a digital map such as Google Maps using Dijkstra's algorithm to find the shortest path, how will the nodes be represented/coded?

How are the nodes going to be represented? In a map, are these nodes every point on it? Need to know more about the nodes in Dijkstra Algorithm and how to implement them.
The nodes and paths are essentially ambiguous and declarative of what you want them to represent. Is a node an intersection? (scaling on a city size) Is the node a city? (Scaling on a provincial/state, country or continent scale).
The paths would need some sort of weighting between them as well. So if you're picking intersections for nodes what is the weight between node A & B vs. the weight between node B & C? I assume you'd want to use time but you don't have that data readily available do you?
There's more challenges than just nodes I think... without a database of weighting data to run short-path algorithm against you're dead in the water.
In a more abstract way of looking at it you need to determine your nodes (points probably using lat/long coords) and then determine weighting between each node that can be reached. So using the A B C idea again can node A reach node C directly? Can it reach it through B?
4 3
A ------- B--------
\ \
-----------------C
8
A -> B = 4
B -> C = 3
A -> C = 8
A -> B -> C = 7

Write a function to cycle through subsets by adding/deleting elements

Problem: Start with a set S of size 2n+1 and a subset A of S of size n. You have functions addElement(A,x) and removeElement(A,x) that can add or remove an element of A. Write a function that cycles through all the subsets of S of size n or n+1 using just these two operations on A.
I figured out that there are (2n+1 choose n) + (2n+1 choose n+1) = 2 * (2n+1 choose n) subsets that I need to find. So here's the structure for my function:
for (int k=0; k<2*binomial(2n+1,n); ++k) {
if (k mod 2) {
// somehow choose x from S-A
A = addElement(A,x);
printSet(A,n+1);
} else
// somehow choose x from A
A = removeElement(A,x);
printSet(A,n);
}
}
The function binomial(2n+1,n) just gives the binomial coefficient, and the function printSet prints the elements of A so that I can see if I hit all the sets.
I don't know how to choose the element to add or remove, though. I tried lots of different things, but I didn't get anything that worked in general.
For n=1, here's a solution that I found that works:
for (int k=0; k<6; ++k) {
if (k mod 2) {
x = S[A[0] mod 3];
A = addElement(A,x);
printSet(A,2);
} else
x = A[0];
A = removeElement(A,x);
printSet(A,1);
}
}
and the output for S = [1,2,3] and A=[1] is:
[1,2]
[2]
[2,3]
[3]
[3,1]
[1]
But even getting this to work for n=2 I can't do. Can someone give me some help on this one?
This isn't a solution so much as it's another way to think about the problem.
Make the following graph:
Vertices are all subsets of S of sizes n or n+1.
There is an edge between v and w if the two sets differ by one element.
For example, for n=1, you get the following cycle:
{1} --- {1,3} --- {3}
| |
| |
{1,2} --- {2} --- {2,3}
Your problem is to find a Hamiltonian cycle:
A Hamiltonian cycle (or Hamiltonian circuit) is a cycle in an
undirected graph which visits each vertex exactly once and also
returns to the starting vertex. Determining whether such paths and
cycles exist in graphs is the Hamiltonian path problem which is
NP-complete.
In other words, this problem is hard.
There are a handful of theorems giving sufficient conditions for a Hamiltonian cycle to exist in a graph (e.g. if all vertices have degree at least N/2 where N is the number of vertices), but none that I know immediately implies that this graph has a Hamiltonian cycle.
You could try one of the myriad algorithms to determine if a Hamiltonian cycle exists. For example, from the wikipedia article on the Hamiltonian path problem:
A trivial heuristic algorithm for locating hamiltonian paths is to
construct a path abc... and extend it until no longer possible; when
the path abc...xyz cannot be extended any longer because all
neighbours of z already lie in the path, one goes back one step,
removing the edge yz and extending the path with a different neighbour
of y; if no choice produces a hamiltonian path, then one takes a
further step back, removing the edge xy and extending the path with a
different neighbour of x, and so on. This algorithm will certainly
find an hamiltonian path (if any) but it runs in exponential time.
Hope this helps.
Good News: Though the Hamiltonian cycle problem is difficult in general, this graph is very nice: it's bipartite and (n+1)-regular. This means there may be a nice solution for this particular graph.
Bad News: After doing a bit of searching, it turns out that this problem is known as the Middle Levels Conjecture, and it seems to have originated around 1980. As best I can tell, the problem is still open in general, but it has been computer verified for n <= 17 (and I found a preprint from 12/2009 claiming to verify n=18). These two pages have additional information about the problem and references:
http://www.math.uiuc.edu/~west/openp/revolving.html
http://garden.irmacs.sfu.ca/?q=op/middle_levels_problem
This sort of thing is covered in Knuth Vol 4A (which despite Charles Stross's excellent Laundry novels is now openly available). I think you request is satisfied by a section of a monotonic binary gray code described in section 7.2.1.1. There is an online preprint with a PDF version at http://www.kcats.org/csci/464/doc/knuth/fascicles/fasc2a.pdf

How does the winged-edge structure for meshes work?

I'm implementing an algorithm in which I need manipulate a mesh, adding and deleting edges quickly and iterating quickly over the edges adjacent to a vertex in CCW or CW order.
The winged-edge structure is used in the description of the algorithm I'm working from, but I can't find any concise descriptions of how to perform those operations on this data structure.
I've learned about it in University but that was a while ago.
In response to this question i've searched the web too for any good documentation, found none that is good, but we can go through a quick example for CCW and CW order and insertion/deletion here.
Have a look at this table and graphic:
from this page:
http://www.cs.mtu.edu/~shene/COURSES/cs3621/NOTES/model/winged-e.html
The table gives only the entry for one edge a, in a real table you have this row for every edge. You can see you get the:
left predecessor,
left successor,
right predecessor,
right successor
but here comes the critical point: it gives them relative to the direction of the edge which is X->Y in this case, and when it is right-traversed (e->a->c).
So for the CW-order of going through the graph this is very easy to read: edge a left has right-successor c and then you look into the row for edge c.
Ok, this table is easy to read for CW-order traversal; for CCW you have to think "from which edge did i come from when i walked this edge backwards". Effectively you get the next edge in CCW-order by taking the left-traverse-predecessor in this case b and continue with the row-entry for edge b in the same manner.
Now insertion and deletion: It is clear that you cant just remove the edge and think that the graph would still consist of only triangles; during deletion you have to join two vertices, for example X and Y in the graphic. To do this you first have to make sure that everywhere the edge a is referred-to we have to fix that reference.
So where can a be referred-to? only in the edges b,c,d and e (all other edges are too far away to know a) plus in the vertex->edge-table if you have that (but let's only consider the edges-table in this example).
As an example of how we have to fix edges lets take a look at c. Like a, c has a left and right pre- and successor (so 4 edges), which one of those is a? We cannot know that without checking because the table-entry for c can have the node Y in either its Start- or End-Node. So we have to check which one it is, let's assume we find that c has Y in its Start-Node, we then have to check whether a is c's right predecessor (which it is and which we find out by looking at c's entry and comparing it to a) OR whether it is c's right successor. "Successor??" you might ask? Yes because remember the two "left-traverse"-columns are relative to going the edge backward. So, now we have found that a is c's right predecessor and we can fix that reference by inserting a's right predecessor. Continue with the other 3 edges and you are done with the edges-table. Fixing an additional Node->Vertices is trivial of course, just look into the entries for X and Y and delete a there.
Adding edges is basically the reverse of this fix-up of 4 other edges BUT with a little twist. Lets call the node which we want to split Z (it will be split into X and Y). You have to take care that you split it in the right direction because you can have either d and e combined in a node or e and c (like if the new edge is horizontal instead of the vertical a in the graphic)! You first have to find out between which 2 edges of the soon-to-be X and between which 2 edges of Y the new edge is added: You just choose which edges shall be on one node and which on the other node: In this example graphic: choose that you want b, c and the 2 edges to the north in between them on one node, and it follows that the other edges are on the other node which will become X. You then find by vector-subtraction that the new edge a has to be between b and c, not between say c and one of the 2 edges in the north. The vector-subtraction is the desired position of the new X minus the desired position of Y.

Lowest Common Ancestor of Binary Tree(Not Binary Search Tree)

I tried working out the problem using Tarjan's Algorithm and one algorithm from the website: http://discuss.techinterview.org/default.asp?interview.11.532716.6, but none is clear. Maybe my recursion concepts are not build up properly. Please give small demonstration to explain the above two examples. I have an idea of Union Find data-structure.
It looks very interesting problem. So have to decode the problem anyhow. Preparing for the interviews.
If any other logic/algorithm exist, please share.
The LCA algorithm tries to do a simple thing: Figure out paths from the two nodes in question to the root. Now, these two paths would have a common suffix (assuming that the path ends at the root). The LCA is the first node where the suffix begins.
Consider the following tree:
r *
/ \
s * *
/ \
u * * t
/ / \
* v * *
/ \
* *
In order to find the LCA(u, v) we proceed as follows:
Path from u to root: Path(u, r) = usr
Path from v to root: Path(v, r) = vtsr
Now, we check for the common suffix:
Common suffix: 'sr'
Therefore LCA(u, v) = first node of the suffix = s
Note the actual algorithms do not go all the way up to the root. They use Disjoint-Set data structures to stop when they reach s.
An excellent set of alternative approaches are explained here.
Since you mentioned job interviews, I thought of the variation of this problem where you are limited to O(1) memory usage.
In this case, consider the following algorithm:
1) Scan the tree from node u up to the root, finding the path length L(u)
2) Scan the tree from node v up to the root, finding the path length L(v)
3) Calculate the path length difference D = |L(u)-L(v)|
4) Skip D nodes in the longer path from the root
5) Walk up the tree in parallel from the two nodes, until you hit the same node
6) Return this node as the LCA
Assuming you only need to solve the problem once (per data set) then a simple approach is to collect the set of ancestors from one node (along with itself), and then walk the list of ancestors from the other until you find a member of the above set, which is necessarily the lowest common ancestor. Pseudocode for that is given:
Let A and B begin as the nodes in question.
seen := set containing the root node
while A is not root:
add A to seen
A := A's parent
while B is not in seen:
B := B's parent
B is now the lowest common ancestor.
Another method is to compute the entire path-to-room for each node, then scan from the right looking for a common suffix. Its first element is the LCA. Which one of these is faster depends on your data.
If you will be needing to find LCAs of many pairs of nodes, then you can make various space/time trade-offs:
You could, for instance, pre-compute the depth of each node, which would allow you to avoid re-creating the sets(or paths) each time by first walking from the deeper node to the depth of the shallower node, and then walking the two nodes toward the root in lock step: when these paths meet, you have the LCA.
Another approach annotates nodes with their next ancestor at depth-mod-H, so that you first solve a similar-but-H-times-smaller problem and then an H-sized instance of the first problem. This is good on very deep trees, and H is generally chosen as the square root of the average depth of the tree.

Resources