I have a weighted graph with (in practice) up to 50,000 vertices. Given a vertex, I want to randomly choose an adjacent vertex based on the relative weights of all adjacent edges.
How should I store this graph in memory so that making the selection is efficient? What is the best algorithm? It could be as simple as a key value store for each vertex, but that might not lend itself to the most efficient algorithm. I'll also need to be able update the network.
Note that I'd like to take only one "step" at a time.
More Formally: Given a weighted, directed, and potentially complete graph, let W(a,b) be the weight of edge a->b and let Wa be the sum of all edges from a. Given an input vertex v, I want to choose a vertex randomly where the likelihood of choosing vertex x is W(v,x) / Wv
Example:
Say W(v,a) = 2, W(v,b) = 1, W(v,c) = 1.
Given input v, the function should return a with probability 0.5 and b or c with probability 0.25.
If you are concerned about the performance of generating the random walk you may use the alias method to build a datastructure which fits your requirements of choosing a random outgoing edge quite well. The overhead is just that you have to assign each directed edge a probability weight and a so-called alias-edge.
So for each note you have a vector of outgoing edges together with the weight and the alias edge. Then you may choose random edges in constant time (only the generation of th edata structure is linear time with respect to number of total edges or number of node edges). In the example the edge is denoted by ->[NODE] and node v corresponds to the example given above:
Node v
->a (p=1, alias= ...)
->b (p=3/4, alias= ->a)
->c (p=3/4, alias= ->a)
Node a
->c (p=1/2, alias= ->b)
->b (p=1, alias= ...)
...
If you want to choose an outgoing edge (i.e. the next node) you just have to generate a single random number r uniform from interval [0,1).
You then get no=floor(N[v] * r) and pv=frac(N[v] * r) where N[v] is the number of outgoing edges. I.e. you pick each edge with the exact same probability (namely 1/3 in the example of node v).
Then you compare the assigned probability p of this edge with the generated value pv. If pv is less you keep the edge selected before, otherwise you choose its alias edge.
If for example we have r=0.6 from our random number generator we have
no = floor(0.6*3) = 1
pv = frac(0.6*3) = 0.8
Therefore we choose the second outgoing edge (note the index starts with zero) which is
->b (p=3/4, alias= ->a)
and switch to the alias edge ->a since p=3/4 < pv.
For the example of node v we therefore
choose edge b with probability 1/3*3/4 (i.e. whenever no=1 and pv<3/4)
choose edge c with probability 1/3*3/4 (i.e. whenever no=2 and pv<3/4)
choose edge a with probability 1/3 + 1/3*1/4 + 1/3*1/4 (i.e. whenever no=0 or pv>=3/4)
In theory the absolutely most efficient thing to do is to store, for each node, the moral equivalent of a balanced binary tree (red-black, or BTree, or skip list all fit) of the connected nodes and their weights, and the total weight to each side. Then you can pick a random number from 0 to 1, multiply by the total weight of the connected nodes, then do a binary search to find it.
However traversing a binary tree like that involves a lot of choices, which have a tendency to create pipeline stalls. Which are very expensive. So in practice if you're programming in an efficient language (eg C++), if you've got less than a couple of hundred connected edges per node, a linear list of edges (with a pre-computed sum) that you walk in a loop may prove to be faster.
Related
The problem: Given an undirected graph, implemented using adjacency list. I'm looking for an algorithm to transform it to a regular graph (each vertex has same degree) through one vertex deletion.
For example:
Iterate all vertex, partition them by their degrees.
If all have same degree, its only possible if there is a vertex that has degree n - 1.
If you can partition them into 2 different degrees set: Let´s call X the set with the lower degree and Y the one with higher. Lets call dg(X) and dg(Y) the degree of those vertex
If one of the partitions has only 1 vertex and its degree is either 0 or the amount of vertex in the other set, remove it
If dg(Y) - dg(X) > 1, its not possible
If dg(Y) - dg(X) = 1 and |Y| = dg(X), check if a vertex from X is connected to all vertex from Y and remove it.
If dg(Y) - dg(X) = 1 and |X| = dg(Y), check if a vertex from Y is connected to all vertex from X and remove it.
Any other case is not possible with 2 partitions
If you can partition into 3 sets:
One of them must have only 1 vertex and that vertex has to be connected to all vertex from the other highest degree set, and to none of the remaining set. The degree difference between the other highest degree set and the remaining set must also be 1
Any other case, its not possible
I am working on an assignment for an Algorithms and Data Structures class. I am having trouble understanding the instructions given. I will do my best to explain the problem.
The input I am given is a positive integer n followed by n positive integers which represent the frequency (or weight) for symbols in an ordered character set. The first goal is to construct a tree that gives an approximate order-preserving Huffman code for each character of the ordered character set. We are to accomplish this by "greedily merging the two adjacent trees whose weights have the smallest sum."
In the assignment we are shown that a conventional Huffman code tree is constructed by first inserting the weights into a priority queue. Then by using a delmin() function to "pop" off the root from the priority queue I can obtain the two nodes with the lowest frequencies and merge them into one node with its left and right being these two lowest frequency nodes and its priority being the sum of the priorities of its children. This merged node then is inserted back into the min-heap. The process is repeated until all input nodes have been merged. I have implemented this using an array of size 2*n*-1 with the input nodes being from 0...n-1 and then from n...2*n*-1 being the merged nodes.
I do not understand how I can greedily merge the two adjacent trees whose weights have the smallest sum. My input has basically been organized into a min-heap and from there I must find the two adjacent nodes that have the smallest sum and merge them. By adjacent I assume my professor means that they are next to each other in the input.
Example Input:
9
1
2
3
3
2
1
1
2
3
Then my min-heap would look like so:
1
/ \
2 1
/ \ / \
2 2 3 1
/ \
3 3
The two adjacent trees (or nodes) with the smallest sum, then, are the two consecutive 1's that appear near the end of the input. What logic can I apply to start with these nodes? I seem to be missing something but I can't quite grasp it. Please, let me know if you need any more information. I can elaborate myself or provide the entire assignment page if something is unclear.
I think this can be done with a small modification to the conventional algoritm. Instead of storing single trees in your priority queue heap, store pairs of adjacent trees. Then, at each step you remove the minimum pair (t1, t2) as well as the up to two pairs that also contain those trees, i.e. (u, t1) and (t2, r). Then merge t1 and t2 to a new tree t', re-insert the pairs (u, t') and (t', r) in the heap with updated weights and repeat.
You need to pop two trees and make 3rd tree. To it left node join tree with smaller sum and to right node join second tree. Put this tree to heap. From your example
Pop 2 tree from heap:
1 1
Make tree
?
/ \
? ?
Put smaller tree to left node
min(1, 1) = 1
?
/ \
1 ?
Put to right node second tree
?
/ \
1 1
Tree you made have sum = sum of left node + sum of right node
2
/ \
1 1
Put new tree (sum 2) to heap.
Finally you will have one tree, It's Huffman tree.
I am having issues understanding the algorithm. Here is the most popular one seen online
for all members of population
sum += fitness of this individual
end for
for all members of population
probability = sum of probabilities + (fitness / sum)
sum of probabilities += probability
end for
loop until new population is full
do this twice
number = Random between 0 and 1
for all members of population
if number > probability but less than next probability
then you have been selected
end for
end
create offspring
end loop
for all members of population
probability = sum of probabilities + (fitness / sum)
sum of probabilities += probability
end for
^^^this piece in particular confuses me. What are the "sum of probabilities" and even "probability" in the context of an individual in a population? Are these like values individuals have on inception?
That is a very obfuscated piece of code.
In that second block of code, probability is a variable attached to each member of the population, and sum of probabilities is a global variable for the whole population.
Now, what the roulette wheel metaphor is saying, is that the entire population can be represented as a roulette wheel, and each member of the population has a slice in that roulette wheel proportional to its relative fitness. That code is doing the dirty work behind that metaphor-- instead of wedges on a wheel, the members are now represented by proportional intervals on the line segment [0,1], which is a customary way to represent probabilities.
To do that, you technically need two numbers, a start and an end, for each member. But the first member's start is going to be 0; the second member's start is going to be the end of the first member; etc, until the last member, which has an end of 1.
That's what that code is doing. Sum of probabilities starts out at 0, and the first time through the loop, the probability is what you intuitively thought it would be. It is marking the end point of the first member. Then the "sum of probabilities" is updated. The second time through the loop, the "probability" is what you intuitively thought it would be... shifted over by the "sum of probabilities." And so it goes.
So the first loop is summing fitness values as a prelude to normalizing things. The second loop, that you ask about, is normalizing and arranging those normalized probabilities in the unit interval. And the third (most complex) loop is picking two random numbers, matching them up with two members of the population, and mating them. (Note that the assumption is that those members are in some array-like format so that you can sequentially check their endpoints against the random number you've rolled.)
The key is in
probability = sum of probabilities + (fitness / sum)
and
if number > probability but less than next probability
then you have been selected
Probability is a measurement of the individual's chance to create offspring; the size of it's slice on the roulette wheel. The sum of probabilities is the total size of the roulette wheel.
Each individual's probability is a function of it's fitness.
I found this link helpful while trying to understand the algorithm.
We often use stacks or queues in our algorithms, but are there any cases where we use a doubly linked list to implement both a stack and a queue in the algorithm? For example, at one stage, we push() 6 items onto the stack, pop() 2 items, and then dequeue() the rest of the items (4) from the tail of the doubly linked list. What I am looking for are obscure, interesting algorithms that implement something in this method, or even stranger. Pseudocode, links and explanations would be nice.
The Melkman algorithm (for computing the convex hull of a simple polygonal chain in linear time) uses a double-ended queue (a.k.a deque) to store an incremental hull for the vertices already processed.
Input: a simple polyline W with n vertices V[i]
Put first 3 vertices onto deque D so that:
a) 3rd vertex V[2] is at bottom and top of D
b) on D they form a counterclockwise (ccw) triangle
While there are more polyline vertices of W to process
Get the next vertex V[i]
{
Note that:
a) D is the convex hull of already processed vertices
b) D[bot] = D[top] = the last vertex added to D
// Test if V[i] is inside D (as a polygon)
If V[i] is left of D[bot]D[bot+1] and D[top-1]D[top]
Skip V[i] and Continue with the next vertex
// Get the tangent to the bottom
While V[i] is right of D[bot]D[bot+1]
Remove D[bot] from the bottom of D
Insert V[i] at the bottom of D
// Get the tangent to the top
While V[i] is right of D[top-1]D[top]
Pop D[top] from the top of D
Push V[i] onto the top of D
}
Output: D = the ccw convex hull of W.
Source: http://softsurfer.com/Archive/algorithm_0203/algorithm_0203.htm
Joe Mitchell: Melkman’s Convex Hull Algorithm (PDF)
This structure is called Deque, that is a queue where elements can be added to or removed from the head or tail. See more at 1.
I'm not sure whether this qualifies, but you can use a double-ended priority queue to apply quicksort to a file too large to fit into memory. The idea is that in a regular quicksort, you pick an element as a pivot, then partition the elements into three groups - elements less than the pivot, elements equal to the pivot, and elements greater than the pivot. If you can't fit all of the items into memory at once, you can adapt this solution as follows. Instead of choosing a single element as the pivot, instead pick some huge number of elements (as many as you can fit into RAM, say) and insert them into a double-ended priority queue. Then, scan across the rest of the elements one at a time. If the element is less than the smallest element of the double-ended priority queue, put it into the group of elements smaller than all the pivots. If it's bigger than the largest element of the priority queue, then put it into a group of elements greater than the pivots. Otherwise, insert the element into the double-ended priority queue and then kick out either the smallest or largest element from the queue and put it into the appropriate group. Once you've finished doing this, you'll have split the elements into three pieces - a group of small elements that can then be recursively sorted, a group of elements in the middle that are now fully sorted (since if you dequeue them all from the double-ended priority queue they'll be extracted in sorted order), and a group of elements greater than the middle elements that can be sorted as well.
For more info on this algorithm and double-ended priority queues in general, consider looking into this link to a chapter on the subject.
We can modify Breadth First Search (that is usually used to find shortest pathes in a graph with 1-weighted edges) to work with 0-1 graphs (i.e. graph with edges of 0 and 1 weights). We can do it next way: when we used 1-edge we add the vertex to the back and when we use 0-edge we add the vertex to the begin.
On a circle, N arbitrary points are chosen on its circumference. The complete graph formed with those N points would divide the area of the circle into many pieces.
What is the maximum number of pieces of area that the circle will get divided into when the points are chosen along its circumference?
Examples:
2 points => 2 pieces
4 points => 8 pieces
Any ideas how to go about this?
This is known as Moser's circle problem.
The solution is:
i.e.
The proof is quite simple:
Consider each intersection inside the circle. It must be defined by the intersection of two lines, and each line has two points, so every intersection inside the circle defines 4 unique sets of points on the circumference. Therefore, there are at most n choose 4 inner vertices, and obviously there are n vertices on the circumference.
Now, how many edges does each vertex touch? Well, it's a complete graph, so each vertex on the outside touches n - 1 edges, and of course each vertex on the inside touches 4 edges. So the number of edges is given by (n(n - 1) + 4(n choose 4))/2 (we divide by two because otherwise each edge would be counted twice by its two vertices).
The final step is to use Euler's formula for the number of faces in a graph, i.e.: v - e + f = 1 (the Euler characteristic is 1 in our case).
Solving for f gives the formulae above :-)