So, I wanted to have some fun with graphs and now it's driving me crazy.
First, I generate a connected graph with a given number of edges. This is the easy part, which became my curse. Basically, it works as intended, but the results I'm getting are quite bizarre (well, maybe they're not, and I'm the issue here). The algorithm for generating the graph is fairly simple.
I have two arrays, one of them is filled with numbers from 0 to n - 1, and the other is empty.
At the beginning I shuffle the first one move its last element to the empty one.
Then, in a loop, I'm creating an edge between the last element of the first array and a random element from the second one and after that I, again, move the last element from the first array to the other one.
After that part is done, I have to create random edges between the vertexes until I get as many as I need. This is, again, very easy. I just random two numbers in the range from 0 to n - 1 and if there is no edge between these vertexes, I create one.
This is the code:
void generate(int n, double d) {
initMatrix(n); // <- creates an adjacency matrix n x n, filled with 0s
int *array1 = malloc(n * sizeof(int));
int *array2 = malloc(n * sizeof(int));
int j = n - 1, k = 0;
for (int i = 0; i < n; ++i) {
array1[i] = i;
array2[i] = 0;
}
shuffle(array1, 0, n); // <- Fisher-Yates shuffle
array2[k++] = array1[j--];
int edges = d * n * (n - 1) * .5;
if (edges % 2) {
++edges;
}
while (j >= 0) {
int r = rand() % k;
createEdge(array1[j], array2[r]);
array2[k++] = array1[j--];
--edges;
}
free(array1);
free(array2);
while (edges) {
int a = rand() % n;
int b = rand() % n;
if (a == b || checkEdge(a, b)) {
continue;
}
createEdge(a, b);
--edges;
}
}
Now, if I print it out, it's a fine graph. Then I want to find a Hammiltonian cycle. This part works. Then I get to my bane - Eulerian cycle. What's the problem?
Well, first I check if all vertexes are even. And they are not. Always. Every single time, unless I choose to generate a complete graph.
I now feel destroyed by my own code. Is something wrong? Or is it supposed to be like this? I knew that Eulerian circuits would be rare, but not that rare. Please, help.
Let's analyze the probability for having euleran cycle, and for simplicity - let's do it for all graphs with n vertices, no matter number of edges.
Given a graph G of size n, choose one arbitrary vertex. The probability of it's degree being even is roughly 1/2 (assuming for each u1,u2, P((v,u1) exists) = P((v,u2) exists)).
Now, remove v from G, and create a new graph G' with n-1 vertices, and without all edges connected to v.
Similarly, for any arbitrary vertex v' in G' - if (v,v') was an edge on G', we need d(v') to be odd. Otherwise, we need d(v') to be even (both in G'). Either way, probability of it is still roughly ~1/2. (independent from previous degree of v).
....
For the ith round, let #(v) be the number of discarded edges until reaching the current graph that are connected to v. If #(v) is odd, the probability of its current degree being odd is ~1/2, and if #(v) is even, the probability of its current degree being even is also ~1/2, and we remain with current probability of ~1/2
We can now understand how it works, and make a recurrence formula for the probability of the graph being eulerian cyclic:
P(n) ~= 1/2*P(n-1)
P(1) = 1
This is going to give us P(n) ~= 2^-n, which is very unlikely for reasonable n.
Note, 1/2 is just a rough estimation (and is correct when n->infinity), probability is in fact a bit higher, but it is still exponential in -n - which makes it very unlikely for reasonable size graphs.
Related
I want to shuffle an array, and that each index will have the same probability to be in any other index (excluding itself).
I have this solution, only i find that always the last 2 indexes will always ne swapped with each other:
void Shuffle(int arr[]. size_t n)
{
int newIndx = 0;
int i = 0;
for(; i > n - 2; ++i)
{
newIndx = rand() % (n - 1);
if (newIndx >= i)
{
++newIndx;
}
swap(i, newIndx, arr);
}
}
but in the end it might be that some indexes will go back to their first place once again.
Any thoughts?
C lang.
A permutation (shuffle) where no element is in its original place is called a derangement.
Generating random derangements is harder than generating random permutations, can be done in linear time and space. (Generating a random permutation can be done in linear time and constant space.) Here are two possible algorithms.
The simplest solution to understand is a rejection strategy: do a Fisher-Yates shuffle, but if the shuffle attempts to put an element at its original spot, restart the shuffle. [Note 1]
Since the probability that a random shuffle is a derangement is approximately 1/e, the expected number of shuffles performed is about e (that is, 2.71828…). But since unsuccessful shuffles are restarted as soon as the first fixed point is encountered, the total number of shuffle steps is less than e times the array size for a detailed analysis, see this paper, which proves the expected number of random numbers needed by the algorithm to be around (e−1) times the number of elements.
In order to be able to do the check and restart, you need to keep an array of indices. The following little function produces a derangement of the indices from 0 to n-1; it is necessary to then apply the permutation to the original array.
/* n must be at least 2 for this to produce meaningful results */
void derange(size_t n, size_t ind[]) {
for (size_t i = 0; i < n; ++i) ind[i] = i;
swap(ind, 0, randint(1, n));
for (size_t i = 1; i < n; ++i) {
int r = randint(i, n);
swap(ind, i, r);
if (ind[i] == i) i = 0;
}
}
Here are the two functions used by that code:
void swap(int arr[], size_t i, size_t j) {
int t = arr[i]; arr[i] = arr[j]; arr[j] = t;
}
/* This is not the best possible implementation */
int randint(int low, int lim) {
return low + rand() % (lim - low);
}
The following function is based on the 2008 paper "Generating Random Derangements" by Conrado Martínez, Alois Panholzer and Helmut Prodinger, although I use a different mechanism to track cycles. Their algorithm uses a bit vector of size N but uses a rejection strategy in order to find an element which has not been marked. My algorithm uses an explicit vector of indices not yet operated on. The vector is also of size N, which is still O(N) space [Note 2]; since in practical applications, N will not be large, the difference is not IMHO significant. The benefit is that selecting the next element to use can be done with a single call to the random number generator. Again, this is not particularly significant since the expected number of rejections in the MP&P algorithm is very small. But it seems tidier to me.
The basis of the algorithms (both MP&P and mine) is the recursive procedure to produce a derangement. It is important to note that a derangement is necessarily the composition of some number of cycles where each cycle is of size greater than 1. (A cycle of size 1 is a fixed point.) Thus, a derangement of size N can be constructed from a smaller derangement using one of two mechanisms:
Produce a derangement of the N-1 elements other than element N, and add N to some cycle at any point in that cycle. To do so, randomly select any element j in the N-1 cycle and place N immediately after j in the j's cycle. This alternative covers all possibilities where N is in a cycle of size > 3.
Produce a derangement of N-2 of the N-1 elements other than N, and add a cycle of size 2 consisting of N and the element not selected from the smaller derangement. This alternative covers all possibilities where N is in a cycle of size 2.
If Dn is the number of derangements of size n, it is easy to see from the above recursion that:
Dn = (n−1)(Dn−1 + Dn−2)
The multiplier is n−1 in both cases: in the first alternative, it refers to the number of possible places N can be added, and in the second alternative to the number of possible ways to select n−2 elements of the recursive derangement.
Therefore, if we were to recursively produce a random derangement of size N, we would randomly select one of the N-1 previous elements, and then make a random boolean decision on whether to produce alternative 1 or alternative 2, weighted by the number of possible derangements in each case.
One advantage to this algorithm is that it can derange an arbitrary vector; there is no need to apply the permuted indices to the original vector as with the rejection algorithm.
As MP&P note, the recursive algorithm can just as easily be performed iteratively. This is quite clear in the case of alternative 2, since the new 2-cycle can be generated either before or after the recursion, so it might as well be done first and then the recursion is just a loop. But that is also true for alternative 1: we can make element N the successor in a cycle to a randomly-selected element j even before we know which cycle j will eventually be in. Looked at this way, the difference between the two alternatives reduces to whether or not element j is removed from future consideration or not.
As shown by the recursion, alternative 2 should be chosen with probability (n−1)Dn−2/Dn, which is how MP&P write their algorithm. I used the equivalent formula Dn−2 / (Dn−1 + Dn−2), mostly because my prototype used Python (for its built-in bignum support).
Without bignums, the number of derangements and hence the probabilities need to be approximated as double, which will create a slight bias and limit the size of the array to be deranged to about 170 elements. (long double would allow slightly more.) If that is too much of a limitation, you could implement the algorithm using some bignum library. For ease of implementation, I used the Posix drand48 function to produce random doubles in the range [0.0, 1.0). That's not a great random number function, but it's probably adequate to the purpose and is available in most standard C libraries.
Since no attempt is made to verify the uniqueness of the elements in the vector to be deranged, a vector with repeated elements may produce a derangement where one or more of these elements appear to be in the original place. (It's actually a different element with the same value.)
The code:
/* Deranges the vector `arr` (of length `n`) in place, to produce
* a permutation of the original vector where every element has
* been moved to a new position. Returns `true` unless the derangement
* failed because `n` was 1.
*/
bool derange(int arr[], size_t n) {
if (n < 2) return n != 1;
/* Compute derangement counts ("subfactorials") */
double subfact[n];
subfact[0] = 1;
subfact[1] = 0;
for (size_t i = 2; i < n; ++i)
subfact[i] = (i - 1) * (subfact[i - 2] + subfact[i - 1]);
/* The vector 'todo' is the stack of elements which have not yet
* been (fully) deranged; `u` is the count of elements in the stack
*/
size_t todo[n];
for (size_t i = 0; i < n; ++i) todo[i] = i;
size_t u = n;
/* While the stack is not empty, derange the element at the
* top of the stack with some element lower down in the stack
*/
while (u) {
size_t i = todo[--u]; /* Pop the stack */
size_t j = u * drand48(); /* Get a random stack index */
swap(arr, i, todo[j]); /* i will follow j in its cycle */
/* If we're generating a 2-cycle, remove the element at j */
if (drand48() * (subfact[u - 1] + subfact[u]) < subfact[u - 1])
todo[j] = todo[--u];
}
return true;
}
Notes
Many people get this wrong, particularly in social occasions such as "secret friend" selection (I believe this is sometimes called "the Santa game" in other parts of the world.) The incorrect algorithm is to just choose a different swap if the random shuffle produces a fixed point, unless the fixed point is at the very end in which case the shuffle is restarted. This will produce a random derangement but the selection is biased, particularly for small vectors. See this answer for an analysis of the bias.
Even if you don't use the RAM model where all integers are considered fixed size, the space used is still linear in the size of the input in bits, since N distinct input values must have at least N log N bits. Neither this algorithm nor MP&P makes any attempt to derange lists with repeated elements, which is a much harder problem.
Your algorithm is only almost correct (which in algorithmics means unexpected results). Because of some little errors scattered along, it will not produce expected results.
First, rand() % N is not guaranteed to produce an uniformal distribution, unless N is a divisor of the number of possible values. In any other case, you will get a slight bias. Anyway my man page for rand describes it as a bad random number generator, so you should try to use random or if available arc4random_uniform.
But avoiding that an index come back at its original place is both incommon, and rather hard to achieve. The only way I can imagine is to keep an array of the numbers [0; n[ and swap it the same as the real array to be able to know the original index of a number.
The code could become:
void Shuffle(int arr[]. size_t n)
{
int i, newIndx;
int *indexes = malloc(n * sizeof(int));
for (i=0; i<n; i++) indexes[i] = i;
for(i=0; i < n - 1; ++i) // beware to the inequality!
{
int i1;
// search if index i is in the [i; n[ current array:
for (i1=i; i1 < n; ++i) {
if (indexes[i1] == i) { // move it to i position
if (i1 != i) { // nothing to do if already at i
swap(i, i1, arr);
swap(i, i1, indexes);
}
break;
}
}
i1 = (i1 == n) ? i : i+1; // we will start the search at i1
// to guarantee that no element keep its place
newIndx = i1 + arc4random_uniform(n - i1);
/* if arc4random is not available:
newIndx = i1 + (random() % (n - i1));
*/
swap(i, newIndx, arr);
swap(i, newIndx, indexes);
}
/* special case: a permutation of [0: n-1[ have left last element in place
* we will exchange the last element with a random one
*/
if (indexes[n-1] == n-1) {
newIndx = arc4random_uniform(n-1)
swap(n-1, newIndx, arr);
swap(n-1, newIndx, indexes);
}
free(indexes); // don't forget to free what we have malloc'ed...
}
Beware: the algorithm should be correct, but the code has not been tested and can contain typos...
I'm dealing with a task from the ai class that is the following
I need to use the A* algorithm
I have an n-digits display between 0-9
Every digit is identified with Ci, I=0...n-1
Under every digit there is a button that the agent can press to modify their value.
Every time the agent presses the button, digits change following 2 rules
Ci = (Ci+1)%10
C (I+j)%n = [C (I+j)%n +k]%10
Rule 1) is known to the agent while rule 2) isn't. Constants j and k are unknown to the agenti
My goal is to reach a goal-state starting from a 00...0 Configuration.
The goal-state can be generated applying random actions by the agent, aka pressing buttons randomly, so I'm sure there is at least one way to solve the problem.
My difficulties are:
How do I represent an n-digits display as a node?
How do I choose a right heuristics?
I'm stuck and frustrated with this exam.
(Sorry for English mistakes, I'm italian!)
An A* algorithm is characterized namely by:
The nodes forming the search tree.
The definition of the "neighbor" concept.
The definition of an optimistic heuristic for a node.
Nodes in the search tree
As you are putting the problem, it seems that each node of the search tree is a configuration of the n digits of the display. Note that this can be easily represented as an array of integers of size n, where each position of the array represents a digit in the display, and the value represents the actual value of the digit in the display. For example, if a display of 5 digits showing 25847 could be represented by [2, 5, 8, 4, 7]. Easy, right?
"Neighbor" concept
That is, how do yo change the display status? As you said, you can only use one of the n buttons below each digit. So, for every status, you will have exactly n possible "neighbors", or "descendants", if you think on the search tree. You will need a function that gives you the node resulting of pressing a particular button (which would simulate the agent pressing the button). Something like the following (in Java):
static int[] pressButton(int[] node, int button, int j, int k) {
int n = node.length;
int[] newNode = Arrays.copyOf(node, n);
newNode[button] = (newNode[button] + 1) % 10;
newNode[(button + j) % n] = (newNode[(button + j) % n] + k) % 10;
return newNode;
}
Now you have this, you can generate every child of the current node with something as simple as:
for (int i = 0; i < node.length; i++) {
int[] newNode = pressButton(node, i, j, k);
// compute heuristic of the new node
// add it to the A* priority queue
}
Optimistic heuristic
Now you just need an heuristic that provides an optimistic estimation of the distance from a given node to the goal. One simple idea would be to assume that, for a given node, you will have to press the buttons at least as many times as the number of digits that differ from the goal; for example, if the goal is 41243 and the current node is 37253, you will have to press a button at least three times, because there are three digits that differ from the goal. In Java:
static int heuristic(int[] currentNode, int[] goal) {
int h = 0;
for (int i = 0; i < goal.length; i++) {
if (goal[i] != currentNode[i]) {
h = h + 1;
}
}
return h;
}
Note, however, that this heuristic is wrong. For example, if the goal is 81730, your current node is 71230, j = 2 and k = 5, this heuristic would give a value of 2; however, we could reach the goal state from the current node in just 1 step pressing button one, so the heuristic would be pessimist in this case. This is because each time a button is pressed, two digits are affected, which could make as get to the solution faster than we thought. To avoid this, we could just substract one from the heuristic (if it is bigger than one):
static int heuristic(int[] currentNode, int[] goal) {
int h = 0;
for (int i = 0; i < goal.length; i++) {
if (goal[i] != currentNode[i]) {
h = h + 1;
}
}
if (h > 1) {
h = h - 1;
}
return h;
}
More accurate heuristics could be defined based not only on the number of digits that differ from the goal, but also on how much they differ (the bigger the difference, the more times you need to press the button); however, I don't think there is an easy way to define an optimistic heuristic based on this (specially keeping in mind corner cases like j = 0 or j = n, negative k, etc.). If you are allowed to use the values of j and k in your heuristic (which I assumed not, because that's what I interpreted when you said that these are unknown to the agent), maybe there could be room for some sophistication, but even then, I would have a hard time trying to define it.
Finally, given the nature of the problem, it's very easy to reach already visited states when going from one node to another. If you want to keep your tree as an actual tree (and finite), and not as an infinite graph, you will have to use a set of already visited nodes, to avoid creating cycles in the search space.
I'm working on a demo that requires a lot of vector math, and in profiling, I've found that it spends the most time finding the distances between given vectors.
Right now, it loops through an array of X^2 vectors, and finds the distance between each one, meaning it runs the distance function X^4 times, even though (I think) there are only (X^2)/2 unique distances.
It works something like this: (pseudo c)
#define MATRIX_WIDTH 8
typedef float vec2_t[2];
vec2_t matrix[MATRIX_WIDTH * MATRIX_WIDTH];
...
for(int i = 0; i < MATRIX_WIDTH; i++)
{
for(int j = 0; j < MATRIX_WIDTH; j++)
{
float xd, yd;
float distance;
for(int k = 0; k < MATRIX_WIDTH; k++)
{
for(int l = 0; l < MATRIX_WIDTH; l++)
{
int index_a = (i * MATRIX_LENGTH) + j;
int index_b = (k * MATRIX_LENGTH) + l;
xd = matrix[index_a][0] - matrix[index_b][0];
yd = matrix[index_a][1] - matrix[index_b][1];
distance = sqrtf(powf(xd, 2) + powf(yd, 2));
}
}
// More code that uses the distances between each vector
}
}
What I'd like to do is create and populate an array of (X^2) / 2 distances without redundancy, then reference that array when I finally need it. However, I'm drawing a blank on how to index this array in a way that would work. A hash table would do it, but I think it's much too complicated and slow for a problem that seems like it could be solved by a clever indexing method.
EDIT: This is for a flocking simulation.
performance ideas:
a) if possible work with the squared distance, to avoid root calculation
b) never use pow for constant, integer powers - instead use xd*xd
I would consider changing your algorithm - O(n^4) is really bad. When dealing with interactions in physics (also O(n^4) for distances in 2d field) one would implement b-trees etc and neglect particle interactions with a low impact. But it will depend on what "more code that uses the distance..." really does.
just did some considerations: the number of unique distances is 0.5*n*n(+1) with n = w*h.
If you write down when unique distances occur, you will see that both inner loops can be reduced, by starting at i and j.
Additionally if you only need to access those distances via the matrix index, you can set up a 4D-distance matrix.
If memory is limited we can save up nearly 50%, as mentioned above, with a lookup function that will access a triangluar matrix, as Code-Guru said. We would probably precalculate the line index to avoid summing up on access
float distanceArray[(H*W+1)*H*W/2];
int lineIndices[H];
searchDistance(int i, int j)
{
return i<j?distanceArray[i+lineIndices[j]]:distanceArray[j+lineIndices[i]];
}
While this may look like homework, I assure you it's not. It stems from some homework assignment I did, though.
Let's call an undirected graph without self-edges "cubic" if every vertex has degree exactly three. Given a positive integer N I'd like to generate a random cubic graph on N vertices. I'd like for it to have uniform probability, that is, if there are M cubic graphs on N vertices the probability of generating each one is 1/M. A weaker condition that is still fine is that every cubic graph has non-zero probability.
I feel there's a quick and smart way to do this, but so far I've been unsuccessful.
I am a bad coder, please bear with this awful code:
PRE: edges = (3*nodes)/2, nodes is even, the constants are selected in such a way that the hash works (BIG_PRIME is bigger than edges, SMALL_PRIME is bigger than nodes, LOAD_FACTOR is small).
void random_cubic_graph() {
int i, j, k, count;
int *degree;
char guard;
count = 0;
degree = (int*) calloc(nodes, sizeof(int));
while (count < edges) {
/* Try a new edge at random */
guard = 0;
i = rand() % nodes;
j = rand() % nodes;
/* Checks if it is a self-edge */
if (i == j)
guard = 1;
/* Checks that the degrees are 3 or less */
if (degree[i] > 2 || degree[j] > 2)
guard = 1;
/* Checks that the edge was not already selected with an hash */
k = 0;
while(A[(j + k*BIG_PRIME) % (LOAD_FACTOR*edges)] != 0) {
if (A[(j + k*BIG_PRIME) % (LOAD_FACTOR*edges)] % SMALL_PRIME == j)
if ((A[(j + k*BIG_PRIME) % (LOAD_FACTOR*edges)] - j) / SMALL_PRIME == i)
guard = 1;
k++;
}
if (guard == 0)
A[(j + k*BIG_PRIME) % (LOAD_FACTOR*edges)] = hash(i,j);
k = 0;
while(A[(i + k*BIG_PRIME) % (LOAD_FACTOR*edges)] != 0) {
if (A[(i + k*BIG_PRIME) % (LOAD_FACTOR*edges)] % SMALL_PRIME == i)
if ((A[(i + k*BIG_PRIME) % (LOAD_FACTOR*edges)] - i) / SMALL_PRIME == j)
guard = 1;
k++;
}
if (guard == 0)
A[(i + k*BIG_PRIME) % (LOAD_FACTOR*edges)] = hash(j,i);
/* If all checks were passed, increment the count, print the edge, increment the degrees. */
if (guard == 0) {
count++;
printf("%d\t%d\n", i, j);
degree[i]++;
degree[j]++;
}
}
The problem is that its final edge that has to be selected might be a self-edge. That happens when N - 1 vertices have already degree 3, only 1 has degree 1. Thus the algorithm might not terminate. Moreover, I'm not entirely convinced that the probability is uniform.
There's probably much to improve in my code, but can you suggest a better algorithm to implement?
Assume N is even. (Otherwise there cannot be a cubic graph on N vertices).
You can do the following:
Take 3N points and divide them into N groups of 3 points each.
Now pair up these 3N points randomly (note: 3N is even). i.e. Marry two points off randomly and form 3N/2 marriages).
If there is a pairing between group i and group j, create an edge between i and j. This gives a graph on N vertices.
If this random pairing does not create any multiple edges or loops, you have a cubic graph.
If not try again. This runs in expected linear time and generates a uniform distribution.
Note: all cubic graphs on N vertices are generated by this method (responding to Hamish's comments).
To see this:
Let G be a cubic graph on N vertices.
Let the vertices be, 1, 2, ...N.
Let the three neighbours of j be A(j), B(j) and C(j).
For each j, construct the group of ordered pairs { (j, A(j)), (j, B(j)), (j, C(j)) }.
This gives us 3N ordered pairs. We pair them up: (u,v) is paired with (v,u).
Thus any cubic graph corresponds to a pairing and vice versa...
More information on this algorithm and faster algorithms can be found here: Generating Random Regular Graphs Quickly.
Warning: I make a lot of intuitive-but-maybe-wrong claims in this answer. You should definitely prove them if you intend to use this idea.
Enumerating Cubic Graphs
When dealing with a random choice, a good starting point is to figure out how to enumerate over all of your possible elements. This might reveal some of the structure, and lead you to an algorithm.
Here is my strategy for enumerating cubic graphs: pick the first vertex, and iterate over all possible choices of three adjacent vertices. During those iterations, recurse on the next vertex, with the caveat that you keep track of how many edges are needed for each vertex degree to reach 3. Continue in that fashion until the lowest level is reached. Now you have your first cubic graph. Undo the recently added edges and continue to the next possibility until there are none left. There are a few implementation details you need to consider, but generally straight-forward.
Generalize Enumeration into Choice
Once you can enumerate all the elements, it is trivial to make a random choice. For example, you can scan the list once to compute its size then pick a random number in [0, size) then scan the sequence again to get the element at that offset. This is incredibly inefficient, taking at LEAST time proportional to the O(n^3) number of cubic graphs, but it works.
Sacrifice Uniform Probability for Efficiency
The obvious speed-up here is to make random edge choices at each level, instead of iterating over each possibility. Unfortunately, this will favor some graphs because of how your early choices affect the availability of later choices. Taking into account the need to track the remaining free vertices, you should be able to achieve O(n log n) time and O(n) space. Significantly better than the enumerating algorithm.
...
It's probably possible to do better. Probably a lot better. But this should get you started.
Another term for cubic graph is 3-regular graph or trivalent graph.
Your problem needs a little more clarification because "the number of cubic graphs" could mean the number of cubic graphs on 2n nodes that are non-isomorphic to one another or the number of (non-isomorphic) cubic graphs on 2n labelled nodes. The former is given by integer sequence A005638, and it is likely a non-trivial problem to uniformly pick a random isomorphism class of cubic graphs efficiently (i.e. not listing them all out and then picking one class). The latter is given by A002829.
There is an article on Wikipedia about random regular graphs that you should take a look at.
This question already has answers here:
Unique (non-repeating) random numbers in O(1)?
(22 answers)
Closed 5 years ago.
The question gives all necessary data: what is an efficient algorithm to generate a sequence of K non-repeating integers within a given interval [0,N-1]. The trivial algorithm (generating random numbers and, before adding them to the sequence, looking them up to see if they were already there) is very expensive if K is large and near enough to N.
The algorithm provided in Efficiently selecting a set of random elements from a linked list seems more complicated than necessary, and requires some implementation. I've just found another algorithm that seems to do the job fine, as long as you know all the relevant parameters, in a single pass.
In The Art of Computer Programming, Volume 2: Seminumerical Algorithms, Third Edition, Knuth describes the following selection sampling algorithm:
Algorithm S (Selection sampling technique). To select n records at random from a set of N, where 0 < n ≤ N.
S1. [Initialize.] Set t ← 0, m ← 0. (During this algorithm, m represents the number of records selected so far, and t is the total number of input records that we have dealt with.)
S2. [Generate U.] Generate a random number U, uniformly distributed between zero and one.
S3. [Test.] If (N – t)U ≥ n – m, go to step S5.
S4. [Select.] Select the next record for the sample, and increase m and t by 1. If m < n, go to step S2; otherwise the sample is complete and the algorithm terminates.
S5. [Skip.] Skip the next record (do not include it in the sample), increase t by 1, and go back to step S2.
An implementation may be easier to follow than the description. Here is a Common Lisp implementation that select n random members from a list:
(defun sample-list (n list &optional (length (length list)) result)
(cond ((= length 0) result)
((< (* length (random 1.0)) n)
(sample-list (1- n) (cdr list) (1- length)
(cons (car list) result)))
(t (sample-list n (cdr list) (1- length) result))))
And here is an implementation that does not use recursion, and which works with all kinds of sequences:
(defun sample (n sequence)
(let ((length (length sequence))
(result (subseq sequence 0 n)))
(loop
with m = 0
for i from 0 and u = (random 1.0)
do (when (< (* (- length i) u)
(- n m))
(setf (elt result m) (elt sequence i))
(incf m))
until (= m n))
result))
The random module from Python library makes it extremely easy and effective:
from random import sample
print sample(xrange(N), K)
sample function returns a list of K unique elements chosen from the given sequence.
xrange is a "list emulator", i.e. it behaves like a list of consecutive numbers without creating it in memory, which makes it super-fast for tasks like this one.
It is actually possible to do this in space proportional to the number of elements selected, rather than the size of the set you're selecting from, regardless of what proportion of the total set you're selecting. You do this by generating a random permutation, then selecting from it like this:
Pick a block cipher, such as TEA or XTEA. Use XOR folding to reduce the block size to the smallest power of two larger than the set you're selecting from. Use the random seed as the key to the cipher. To generate an element n in the permutation, encrypt n with the cipher. If the output number is not in your set, encrypt that. Repeat until the number is inside the set. On average you will have to do less than two encryptions per generated number. This has the added benefit that if your seed is cryptographically secure, so is your entire permutation.
I wrote about this in much more detail here.
The following code (in C, unknown origin) seems to solve the problem extremely well:
/* generate N sorted, non-duplicate integers in [0, max] */
int *generate(int n, int max) {
int i, m, a;
int *g = (int *)calloc(n, sizeof(int));
if (!g) return 0;
m = 0;
for (i = 0; i < max; i++) {
a = random_in_between(0, max - i);
if (a < n - m) {
g[m] = i;
m++;
}
}
return g;
}
Does anyone know where I can find more gems like this one?
Generate an array 0...N-1 filled a[i] = i.
Then shuffle the first K items.
Shuffling:
Start J = N-1
Pick a random number 0...J (say, R)
swap a[R] with a[J]
since R can be equal to J, the element may be swapped with itself
subtract 1 from J and repeat.
Finally, take K last elements.
This essentially picks a random element from the list, moves it out, then picks a random element from the remaining list, and so on.
Works in O(K) and O(N) time, requires O(N) storage.
The shuffling part is called Fisher-Yates shuffle or Knuth's shuffle, described in the 2nd volume of The Art of Computer Programming.
Speed up the trivial algorithm by storing the K numbers in a hashing store. Knowing K before you start takes away all the inefficiency of inserting into a hash map, and you still get the benefit of fast look-up.
My solution is C++ oriented, but I'm sure it could be translated to other languages since it's pretty simple.
First, generate a linked list with K elements, going from 0 to K
Then as long as the list isn't empty, generate a random number between 0 and the size of the vector
Take that element, push it into another vector, and remove it from the original list
This solution only involves two loop iterations, and no hash table lookups or anything of the sort. So in actual code:
// Assume K is the highest number in the list
std::vector<int> sorted_list;
std::vector<int> random_list;
for(int i = 0; i < K; ++i) {
sorted_list.push_back(i);
}
// Loop to K - 1 elements, as this will cause problems when trying to erase
// the first element
while(!sorted_list.size() > 1) {
int rand_index = rand() % sorted_list.size();
random_list.push_back(sorted_list.at(rand_index));
sorted_list.erase(sorted_list.begin() + rand_index);
}
// Finally push back the last remaining element to the random list
// The if() statement here is just a sanity check, in case K == 0
if(!sorted_list.empty()) {
random_list.push_back(sorted_list.at(0));
}
Step 1: Generate your list of integers.
Step 2: Perform Knuth Shuffle.
Note that you don't need to shuffle the entire list, since the Knuth Shuffle algorithm allows you to apply only n shuffles, where n is the number of elements to return. Generating the list will still take time proportional to the size of the list, but you can reuse your existing list for any future shuffling needs (assuming the size stays the same) with no need to preshuffle the partially shuffled list before restarting the shuffling algorithm.
The basic algorithm for Knuth Shuffle is that you start with a list of integers. Then, you swap the first integer with any number in the list and return the current (new) first integer. Then, you swap the second integer with any number in the list (except the first) and return the current (new) second integer. Then...etc...
This is an absurdly simple algorithm, but be careful that you include the current item in the list when performing the swap or you will break the algorithm.
The Reservoir Sampling version is pretty simple:
my $N = 20;
my $k;
my #r;
while(<>) {
if(++$k <= $N) {
push #r, $_;
} elsif(rand(1) <= ($N/$k)) {
$r[rand(#r)] = $_;
}
}
print #r;
That's $N randomly selected rows from STDIN. Replace the <>/$_ stuff with something else if you're not using rows from a file, but it's a pretty straightforward algorithm.
If the list is sorted, for example, if you want to extract K elements out of N, but you do not care about their relative order, an efficient algorithm is proposed in the paper An Efficient Algorithm for Sequential Random Sampling (Jeffrey Scott Vitter, ACM Transactions on Mathematical Software, Vol. 13, No. 1, March 1987, Pages 56-67.).
edited to add the code in c++ using boost. I've just typed it and there might be many errors. The random numbers come from the boost library, with a stupid seed, so don't do anything serious with this.
/* Sampling according to [Vitter87].
*
* Bibliography
* [Vitter 87]
* Jeffrey Scott Vitter,
* An Efficient Algorithm for Sequential Random Sampling
* ACM Transactions on MAthematical Software, 13 (1), 58 (1987).
*/
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <string>
#include <iostream>
#include <iomanip>
#include <boost/random/linear_congruential.hpp>
#include <boost/random/variate_generator.hpp>
#include <boost/random/uniform_real.hpp>
using namespace std;
// This is a typedef for a random number generator.
// Try boost::mt19937 or boost::ecuyer1988 instead of boost::minstd_rand
typedef boost::minstd_rand base_generator_type;
// Define a random number generator and initialize it with a reproducible
// seed.
// (The seed is unsigned, otherwise the wrong overload may be selected
// when using mt19937 as the base_generator_type.)
base_generator_type generator(0xBB84u);
//TODO : change the seed above !
// Defines the suitable uniform ditribution.
boost::uniform_real<> uni_dist(0,1);
boost::variate_generator<base_generator_type&, boost::uniform_real<> > uni(generator, uni_dist);
void SequentialSamplesMethodA(int K, int N)
// Outputs K sorted random integers out of 0..N, taken according to
// [Vitter87], method A.
{
int top=N-K, S, curr=0, currsample=-1;
double Nreal=N, quot=1., V;
while (K>=2)
{
V=uni();
S=0;
quot=top/Nreal;
while (quot > V)
{
S++; top--; Nreal--;
quot *= top/Nreal;
}
currsample+=1+S;
cout << curr << " : " << currsample << "\n";
Nreal--; K--;curr++;
}
// special case K=1 to avoid overflow
S=floor(round(Nreal)*uni());
currsample+=1+S;
cout << curr << " : " << currsample << "\n";
}
void SequentialSamplesMethodD(int K, int N)
// Outputs K sorted random integers out of 0..N, taken according to
// [Vitter87], method D.
{
const int negalphainv=-13; //between -20 and -7 according to [Vitter87]
//optimized for an implementation in 1987 !!!
int curr=0, currsample=0;
int threshold=-negalphainv*K;
double Kreal=K, Kinv=1./Kreal, Nreal=N;
double Vprime=exp(log(uni())*Kinv);
int qu1=N+1-K; double qu1real=qu1;
double Kmin1inv, X, U, negSreal, y1, y2, top, bottom;
int S, limit;
while ((K>1)&&(threshold<N))
{
Kmin1inv=1./(Kreal-1.);
while(1)
{//Step D2: generate X and U
while(1)
{
X=Nreal*(1-Vprime);
S=floor(X);
if (S<qu1) {break;}
Vprime=exp(log(uni())*Kinv);
}
U=uni();
negSreal=-S;
//step D3: Accept ?
y1=exp(log(U*Nreal/qu1real)*Kmin1inv);
Vprime=y1*(1. - X/Nreal)*(qu1real/(negSreal+qu1real));
if (Vprime <=1.) {break;} //Accept ! Test [Vitter87](2.8) is true
//step D4 Accept ?
y2=0; top=Nreal-1.;
if (K-1 > S)
{bottom=Nreal-Kreal; limit=N-S;}
else {bottom=Nreal+negSreal-1.; limit=qu1;}
for(int t=N-1;t>=limit;t--)
{y2*=top/bottom;top--; bottom--;}
if (Nreal/(Nreal-X)>=y1*exp(log(y2)*Kmin1inv))
{//Accept !
Vprime=exp(log(uni())*Kmin1inv);
break;
}
Vprime=exp(log(uni())*Kmin1inv);
}
// Step D5: Select the (S+1)th record
currsample+=1+S;
cout << curr << " : " << currsample << "\n";
curr++;
N-=S+1; Nreal+=negSreal-1.;
K-=1; Kreal-=1; Kinv=Kmin1inv;
qu1-=S; qu1real+=negSreal;
threshold+=negalphainv;
}
if (K>1) {SequentialSamplesMethodA(K, N);}
else {
S=floor(N*Vprime);
currsample+=1+S;
cout << curr << " : " << currsample << "\n";
}
}
int main(void)
{
int Ntest=10000000, Ktest=Ntest/100;
SequentialSamplesMethodD(Ktest,Ntest);
return 0;
}
$ time ./sampling|tail
gives the following ouptut on my laptop
99990 : 9998882
99991 : 9998885
99992 : 9999021
99993 : 9999058
99994 : 9999339
99995 : 9999359
99996 : 9999411
99997 : 9999427
99998 : 9999584
99999 : 9999745
real 0m0.075s
user 0m0.060s
sys 0m0.000s
This Ruby code showcases the Reservoir Sampling, Algorithm R method. In each cycle, I select n=5 unique random integers from [0,N=10) range:
t=0
m=0
N=10
n=5
s=0
distrib=Array.new(N,0)
for i in 1..500000 do
t=0
m=0
s=0
while m<n do
u=rand()
if (N-t)*u>=n-m then
t=t+1
else
distrib[s]+=1
m=m+1
t=t+1
end #if
s=s+1
end #while
if (i % 100000)==0 then puts i.to_s + ". cycle..." end
end #for
puts "--------------"
puts distrib
output:
100000. cycle...
200000. cycle...
300000. cycle...
400000. cycle...
500000. cycle...
--------------
250272
249924
249628
249894
250193
250202
249647
249606
250600
250034
all integer between 0-9 were chosen with nearly the same probability.
It's essentially Knuth's algorithm applied to arbitrary sequences (indeed, that answer has a LISP version of this). The algorithm is O(N) in time and can be O(1) in memory if the sequence is streamed into it as shown in #MichaelCramer's answer.
Here's a way to do it in O(N) without extra storage. I'm pretty sure this is not a purely random distribution, but it's probably close enough for many uses.
/* generate N sorted, non-duplicate integers in [0, max[ in O(N))*/
int *generate(int n, int max) {
float step,a,v=0;
int i;
int *g = (int *)calloc(n, sizeof(int));
if ( ! g) return 0;
for (i=0; i<n; i++) {
step = (max-v)/(float)(n-i);
v+ = floating_pt_random_in_between(0.0, step*2.0);
if ((int)v == g[i-1]){
v=(int)v+1; //avoid collisions
}
g[i]=v;
}
while (g[i]>max) {
g[i]=max; //fix up overflow
max=g[i--]-1;
}
return g;
}
This is Perl Code. Grep is a filter, and as always I didn't test this code.
#list = grep ($_ % I) == 0, (0..N);
I = interval
N = Upper Bound
Only get numbers that match your interval via the modulus operator.
#list = grep ($_ % 3) == 0, (0..30);
will return 0, 3, 6, ... 30
This is pseudo Perl code. You may need to tweak it to get it to compile.