Generating a uniform distribution of INTEGERS in C - c

I've written a C function that I think selects integers from a uniform distribution with range [rangeLow, rangeHigh], inclusive. This isn't homework--I'm just using this in some embedded systems tinkering that I'm doing for fun.
In my test cases, this code appears to produce an appropriate distribution. I'm not feeling fully confident that the implementation is correct, though.
Could someone do a sanity check and let me know if I've done anything wrong here?
//uniform_distribution returns an INTEGER in [rangeLow, rangeHigh], inclusive.
int uniform_distribution(int rangeLow, int rangeHigh)
{
int myRand = (int)rand();
int range = rangeHigh - rangeLow + 1; //+1 makes it [rangeLow, rangeHigh], inclusive.
int myRand_scaled = (myRand % range) + rangeLow;
return myRand_scaled;
}
//note: make sure rand() was already initialized using srand()
P.S. I searched for other questions like this. However, it was hard to filter out the small subset of questions that discuss random integers instead of random floating-point numbers.

Let's assume that rand() generates a uniformly-distributed value I in the range [0..RAND_MAX],
and you want to generate a uniformly-distributed value O in the range [L,H].
Suppose I in is the range [0..32767] and O is in the range [0..2].
According to your suggested method, O= I%3. Note that in the given range, there are 10923 numbers for which I%3=0, 10923 number for which I%3=1, but only 10922 number for which I%3=2. Hence your method will not map a value from I into O uniformly.
As another example, suppose O is in the range [0..32766].
According to your suggested method, O=I%32767. Now you'll get O=0 for both I=0 and I=32767. Hence 0 is twice as likely than any other value - your method is again nonuniform.
The suggest way to generate a uniform mapping is as follow:
Calculate the number of bits that are needed to store a random value in the range [L,H]:
unsigned int nRange = (unsigned int)H - (unsigned int)L + 1;
unsigned int nRangeBits= (unsigned int)ceil(log((double(nRange) / log(2.));
Generate nRangeBits random bits
this can be easily implemented by shifting-right the result of rand()
Ensure that the generated number is not greater than H-L.
If it is - repeat step 2.
Now you can map the generated number into O just by adding a L.

On some implementations, rand() did not provide good randomness on its lower order bits, so the modulus operator would not provide very random results. If you find that to be the case, you could try this instead:
int uniform_distribution(int rangeLow, int rangeHigh) {
double myRand = rand()/(1.0 + RAND_MAX);
int range = rangeHigh - rangeLow + 1;
int myRand_scaled = (myRand * range) + rangeLow;
return myRand_scaled;
}
Using rand() this way will produce a bias as noted by Lior. But, the technique is fine if you can find a uniform number generator to calculate myRand. One possible candidate would be drand48(). This will greatly reduce the amount of bias to something that would be very difficult to detect.
However, if you need something cryptographically secure, you should use an algorithm outlined in Lior's answer, assuming your rand() is itself cryptographically secure (the default one is probably not, so you would need to find one). Below is a simplified implementation of what Lior described. Instead of counting bits, we assume the range falls within RAND_MAX, and compute a suitable multiple. Worst case, the algorithm ends up calling the random number generator twice on average per request for a number in the range.
int uniform_distribution_secure(int rangeLow, int rangeHigh) {
int range = rangeHigh - rangeLow + 1;
int secureMax = RAND_MAX - RAND_MAX % range;
int x;
do x = secure_rand(); while (x >= secureMax);
return rangeLow + x % range;
}

I think it is known that rand() is not very good. It just depends on how good of "random" data you need.
http://www.azillionmonkeys.com/qed/random.html
http://www.linuxquestions.org/questions/programming-9/generating-random-numbers-in-c-378358/
http://forums.indiegamer.com/showthread.php?9460-Using-C-rand%28%29-isn-t-as-bad-as-previously-thought
I suppose you could write a test then calculate the chi-squared value to see how good your uniform generator is:
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
Depending on your use (don't use this for your online poker shuffler), you might consider a LFSR
http://en.wikipedia.org/wiki/Linear_feedback_shift_register
It may be faster, if you just want some psuedo-random output. Also, supposedly they can be uniform, although I haven't studied the math enough to back up that claim.

A version which corrects the distribution errors (noted by Lior),
involves the high-bits returned by rand() and
only uses integer math (if that's desirable):
int uniform_distribution(int rangeLow, int rangeHigh)
{
int range = rangeHigh - rangeLow + 1; //+1 makes it [rangeLow, rangeHigh], inclusive.
int copies=RAND_MAX/range; // we can fit n-copies of [0...range-1] into RAND_MAX
// Use rejection sampling to avoid distribution errors
int limit=range*copies;
int myRand=-1;
while( myRand<0 || myRand>=limit){
myRand=rand();
}
return myRand/copies+rangeLow; // note that this involves the high-bits
}
//note: make sure rand() was already initialized using srand()
This should work well provided that range is much smaller than RAND_MAX, otherwise
you'll be back to the problem that rand() isn't a good random number generator in terms of its low-bits.

Related

Why does rand() + rand() produce negative numbers?

I observed that rand() library function when it is called just once within a loop, it almost always produces positive numbers.
for (i = 0; i < 100; i++) {
printf("%d\n", rand());
}
But when I add two rand() calls, the numbers generated now have more negative numbers.
for (i = 0; i < 100; i++) {
printf("%d = %d\n", rand(), (rand() + rand()));
}
Can someone explain why I am seeing negative numbers in the second case?
PS: I initialize the seed before the loop as srand(time(NULL)).
rand() is defined to return an integer between 0 and RAND_MAX.
rand() + rand()
could overflow. What you observe is likely a result of undefined behaviour caused by integer overflow.
The problem is the addition. rand() returns an int value of 0...RAND_MAX. So, if you add two of them, you will get up to RAND_MAX * 2. If that exceeds INT_MAX, the result of the addition overflows the valid range an int can hold. Overflow of signed values is undefined behaviour and may lead to your keyboard talking to you in foreign tongues.
As there is no gain here in adding two random results, the simple idea is to just not do it. Alternatively you can cast each result to unsigned int before the addition if that can hold the sum. Or use a larger type. Note that long is not necessarily wider than int, the same applies to long long if int is at least 64 bits!
Conclusion: Just avoid the addition. It does not provide more "randomness". If you need more bits, you might concatenate the values sum = a + b * (RAND_MAX + 1), but that also likely requires a larger data type than int.
As your stated reason is to avoid a zero-result: That cannot be avoided by adding the results of two rand() calls, as both can be zero. Instead, you can just increment. If RAND_MAX == INT_MAX, this cannot be done in int. However, (unsigned int)rand() + 1 will do very, very likely. Likely (not definitively), because it does require UINT_MAX > INT_MAX, which is true on all implementations I'm aware of (which covers quite some embedded architectures, DSPs and all desktop, mobile and server platforms of the past 30 years).
Warning:
Although already sprinkled in comments here, please note that adding two random values does not get a uniform distribution, but a triangular distribution like rolling two dice: to get 12 (two dice) both dice have to show 6. for 11 there are already two possible variants: 6 + 5 or 5 + 6, etc.
So, the addition is also bad from this aspect.
Also note that the results rand() generates are not independent of each other, as they are generated by a pseudorandom number generator. Note also that the standard does not specify the quality or uniform distribution of the calculated values.
This is an answer to a clarification of the question made in comment to this answer,
the reason i was adding was to avoid '0' as the random number in my code. rand()+rand() was the quick dirty solution which readily came to my mind.
The problem was to avoid 0. There are (at least) two problems with the proposed solution. One is, as the other answers indicate, that rand()+rand() can invoke undefined behavior. Best advice is to never invoke undefined behavior. Another issue is there's no guarantee that rand() won't produce 0 twice in a row.
The following rejects zero, avoids undefined behavior, and in the vast majority of cases will be faster than two calls to rand():
int rnum;
for (rnum = rand(); rnum == 0; rnum = rand()) {}
// or do rnum = rand(); while (rnum == 0);
Basically rand() produce numbers between 0 and RAND_MAX, and 2 RAND_MAX > INT_MAX in your case.
You can modulus with the max value of your data-type to prevent overflow. This ofcourse will disrupt the distribution of the random numbers, but rand is just a way to get quick random numbers.
#include <stdio.h>
#include <limits.h>
int main(void)
{
int i=0;
for (i=0; i<100; i++)
printf(" %d : %d \n", rand(), ((rand() % (INT_MAX/2))+(rand() % (INT_MAX/2))));
for (i=0; i<100; i++)
printf(" %d : %ld \n", rand(), ((rand() % (LONG_MAX/2))+(rand() % (LONG_MAX/2))));
return 0;
}
May be you could try rather a tricky approach by ensuring that the value returned by sum of 2 rand() never exceeds the value of RAND_MAX. A possible approach could be sum = rand()/2 + rand()/2; This would ensure that for a 16 bit compiler with RAND_MAX value of 32767 even if both rand happens to return 32767, even then (32767/2 = 16383) 16383+16383 = 32766, thus would not result in negative sum.
the reason i was adding was to avoid '0' as the random number in my code. rand()+rand() was the quick dirty solution which readily came to my mind.
A simple solution (okay, call it a "Hack") which never produces a zero result and will never overflow is:
x=(rand()/2)+1 // using divide -or-
x=(rand()>>1)+1 // using shift which may be faster
// compiler optimization may use shift in both cases
This will limit your maximum value, but if you don't care about that, then this should work fine for you.
To avoid 0, try this:
int rnumb = rand()%(INT_MAX-1)+1;
You need to include limits.h.
thx. the reason i was adding was to avoid '0' as the random number in my code. rand()+rand() was the quick dirty solution which readily came to my mind
It sounds like an XY problem to me, in which in order to not get a 0 from rand(), you call rand() two times, doing the program slower, with a new setback and the possibility of getting a 0 is still there.
Another solution is using uniform_int_distribution, which creates a random and uniformly distributed number in the defined interval:
https://wandbox.org/permlink/QKIHG4ghwJf1b7ZN
#include <random>
#include <array>
#include <iostream>
int main()
{
const int MAX_VALUE=50;
const int MIN_VALUE=1;
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> distrib(MIN_VALUE, MAX_VALUE);
std::array<int,MAX_VALUE-MIN_VALUE> weight={0};
for(int i=0; i<50000; i++) {
weight[distrib(gen)-MIN_VALUE]++;
}
for(int i=0;i<(int)weight.size();i++) {
std::cout << "value: " << MIN_VALUE+i << " times: " << weight[i] << std::endl;
}
}
While what everyone else has said about the likely overflow could very well be the cause of the negative, even when you use unsigned integers. The real problem is actually using time/date functionality as the seed. If you have truly become familiar with this functionality you will know exactly why I say this. As what it really does is give a distance (elapsed time) since a given date/time. While the use of the date/time functionality as the seed to a rand(), is a very common practice, it really is not the best option. You should search better alternatives, as there are many theories on the topic and I could not possibly go into all of them. You add into this equation the possibility of overflow and this approach was doomed from the beginning.
Those that posted the rand()+1 are using the solution that most use in order to guarantee that they do not get a negative number. But, that approach is really not the best way either.
The best thing you can do is take the extra time to write and use proper exception handling, and only add to the rand() number if and/or when you end up with a zero result. And, to deal with negative numbers properly. The rand() functionality is not perfect, and therefore needs to be used in conjunction with exception handling to ensure that you end up with the desired result.
Taking the extra time and effort to investigate, study, and properly implement the rand() functionality is well worth the time and effort. Just my two cents. Good luck in your endeavors...

About a criteria for random integers number generation (C)

I am running a bunch of physical simulations in which I need random numbers. I'm using the standard rand() function in C++.
So it works like this: first I precalculate a bunch of probabilities that are of the form 1/(1+exp(a)), for a set of different a. They're of type double as returned by the exp function in the math library, and then things must happen with those probabilities, there are only two of them, so I generate a random number uniformly distributed between 0 and 1 and compared with those precalculated probabilities. To do that, I used:
double p = double(rand()%101)/100.0;
so I'm given random values between 0 and 1 both included. This didn't yield to correct physical results. I tried this:
double p = double(rand()%1000001)/1000000.0;
And this worked. I don't really understand why so I would like some criteria about how to do it. My intuition tells that if I do
double p = double(rand()%(N+1))/double(N);
with N big enough such that the smallest division (1/N) is much smaller than the smallest probability 1/1+exp(a) then I will be getting realistic random numbers.
I would like to understand why, though.
rand() returns a random number between 0 and RAND_MAX.
Therefore you need this:
double p = double(rand() % RAND_MAX) / double(RAND_MAX);
Also run this snippet and you will understand:
int i;
for (i = 1; i < 30; i++)
{
int rnd = rand();
double p0 = double(rnd % 101) / 100.0;
double p1 = double(rnd % 1000001) / 1000000.0;
printf ("%d\t%f\t%f\n", rnd, p0, p1);
}
for (i = 1; i < 30; i++)
{
int rnd = rand();
double p0 = double(rnd) / double(RAND_MAX);
printf ("%d\t%f\n", rnd, p0);
}
You have multiple problems.
rand() isn't very random at all. On almost all operating systems it returns badly distributed, horribly biased numbers. It's actually quite hard to find a good random number generator, but I can guarantee you that rand() will be among the worst you can find.
rand() % N gives a biased distribution. Think about the pigeonhole principle. Let's simplify it, assume that rand returns numbers [0,7) and your N is 6. 0 to 5 map to 0 to 5, 6 maps to 0 and 7 maps to 1, meaning that 0 and 1 are twice as likely to come out.
Converting the numbers to double before division does not remove the bias from 2, it just makes it less visible. The pigeonhole principle applies regardless of the conversions you do.
Converting a well-distributed random number from integer to float/double is harder than it looks. Simple division ignores the problems of how floating point math works.
I can't help you much with 1, you need to do research. Look around the net for random number libraries. If you want something very random and unpredictable you need to look for cryptographic random libraries. If you want a repeatable but good random number Mersenne Twister should probably be good enough. But you need to do the research here.
For 2 and 3 there are standard solutions. You are mapping a set from M elements to N elements and rand % N will only work iff N < M and N and M share prime factors. Since on most systems M will be a power of two it means that N also has to be a power of two. So assuming that M is a power of two the algorithm is: find the nearest power of 2 higher or equal to N, let's call it P. Generate randomness_source() % P. If the number is higher than N, throw it away and try again. This is the only safe way to do this. Cleverer people than you and me have spent years on this problem, there's no better way to remove the bias.
For 4, you can probably ignore the problem and just divide, in an absolute majority of cases this should be good enough. If you really want to study the problem, I've done some work on it and published the code on github. There I go through some basic principles of how floating point numbers work and how it relates to generating random numbers.
// produces pseudorandom bits. These are NOT crypto quality bits. Has the same underlying unpredictability as uncooked
// rand() output. It buffers rand() bits to produce a more convenient zero-to-the-argument range including negative
// arguments, corrects for the toward-zero bias of the modular construction I'd be using otherwise, eliminates the
// RAND_MAX range limitation, (use INT64_MAX instead) and effectively obscures biases and sequence telltales due to
// annoyingly bad rand libraries. It does not correct these biases; anyone tracking the arguments and outputs has
// enough information to reconstruct the rand() output and detect them. But it makes the relationships drastically more complicated.
// needs stdint, stdlib.
int64_t privaterandom(int64_t range, int reset){
static uint64_t state = 0;
int64_t retval;
if (reset != 0){
srand((unsigned int)range);
state = (uint64_t)range;
}
if (range == 0) return (0);
if (range < 0) return -privaterandom(-range, 0);
if (range > UINT64_MAX/0xFFFFFFFF){
retval = (privaterandom(range/0xFFFFFFFF, 0) * 0xFFFFFFFF); // order of operations matters
return (retval + privaterandom(0xFFFFFFFF, 0));
}
while (state < UINT64_MAX / 0xFF){
state *= RAND_MAX;
state += rand();
}
retval = (state % range);
// makes "pigeonhole" bias alternate unpredictably between toward-even and toward-odd
if ((state/range > (state - (retval) )/ range) && state % 2 == 0) retval++;
state /= range;
return retval;
}
int64_t Random(int64_t range){ return (privaterandom(range, 0));}
int64_t Random_Init(int64_t seed){return (privaterandom(seed, 1));}

Deterministic bit scrambling to filter coordinates

I am trying to write a function that, given an (x,y) coordinate pair and the random seed of the program, will psuedo-randomly return true for some preset percentage of all such pairs. There are no limits on x or y beyond the restrictions of the data type, which is a 32-bit signed int.
My current approach is to scramble the bits of x, y, and the seed together and then compare the resulting number to the percentage:
float percentage = 0.005;
...
unsigned int n = (x ^ y) ^ seed;
return (((float) n / UINT_MAX) < percentage);
However, it seems that this approach would be biased for certain values of x and y. For example, if it returns true for (0,a), it will also return true for (a,0).
I know this implementation that just XORs them together is naive. Is there a better bit-scrambling algorithm to use here that will not be biased?
Edit: To clarify, I am not starting with a set of (x,y) coordinates, nor am I trying to get a fixed-size set of coordinates that evaluate to true. The function should be able to evaluate a truth value for arbitrary x, y, and seed, with the percentage controlling the average frequency of "true" coordinates.
The easy solution is to use a good hashing algorithm. You can do the range check on the value of hash(seed || x || y).
Of course, selecting points individually with percentage p does not guarantee that you will end up with a sample whose size will be exactly p * N. (That's the expected size of the sample, but any given sample will deviate a bit.) If you want to get a sample of size precisely k from a universe of N objects, you can use the following simple algorithm:
Examine the elements in the sample one at a time until k reaches 0.
When examining element i, add it to the sample if its hash value mapped onto the range [0, N-i) is less than k. If you add the element to the sample, decrement k.
There's no way to get the arithmetic absolutely perfect (since there is no way to perfectly partition 2i different hash values into n buckets unless n is a power of 2), so there will always be a tiny bias. (Floating point arithmetic does not help; the number of possible floating point values is also fixed, and suffers from the same bias.)
If you do 64-bit arithmetic, the bias will be truly tiny, but the arithmetic is more complicated unless your environment provides a 128-bit multiply. So you might feel satisfied with 32-bit computations, where the bias of one in a couple of thousand million [Note 1] doesn't matter. Here, you can use the fact that any 32 bits in your hash should be as unbiased as any other 32 bits, assuming your hash algorithm is any good (see below). So the following check should work fine:
// I need k elements from a remaining universe of n, and I have a 64-bit hash.
// Return true if I should select this element
bool select(uint32_t n, uint32_t k, uint64_t hash) {
return ((hash & (uint32_t)(-1)) * (uint64_t)n) >> 32 < k;
}
// Untested example sampler
// select exactly k elements from U, using a seed value
std::vector<E> sample(const std::vector<E>& U, uint64_t seed, uint32_t k) {
std::vector<E> retval;
uint32_t n = U.size();
for (uint32_t n = U.size(); k && n;) {
E& elt = U[--n];
if (select(n, k, hash_function(seed, elt))) {
retval.push_back(elt);
--k;
}
}
return retval;
}
Assuming you need to do this a lot, you'll want to use a fast hash algorithm; since you're not actually working in a secure environment, you don't need to worry about whether the algorithm is cryptographically secure.
Many high-speed hashing algorithms work on 64-bit units, so you could maximize the speed by constructing a 128-bit input consisting of a 64-bit seed and the two 32-bit co-ordinates. You can then unroll the hash loop to do exactly two blocks.
I won't venture a guess at the best hash function for your purpose. You might want to check out one or more of these open-source hashing functions:
Farmhash https://code.google.com/p/farmhash/
Murmurhash https://code.google.com/p/smhasher/
xxhash https://code.google.com/p/xxhash/
siphash https://github.com/majek/csiphash/
... and many more.
Notes
A couple of billion, if you're on that side of the Atlantic.
I would prefer feeding seed, x, and y through a Combined Linear Congruential Generator.
This is generally much faster than hashing, and it is designed specifically for the purpose: To output a pseudo-random number uniformly in a certain range.
Using coefficients recommended by Wichmann-Hill (which are also used in some versions of Microsoft Excel) we can do:
si = 171 * s % 30269;
xi = 172 * x % 30307;
yi = 170 * y % 30323;
r_combined = fmod(si/30269. + xi/30307. + yi/30323., 1.);
return r_combined < percentage;
Where s is the seed on the first call, and the previous si on each subsequent call. (Thanks to rici's comment for this point.)

How to generate a random uniformly distributed number between -32000 and 32000

How can I generate a random uniformly distributed number in the rang of -32000 to 32000. I have already done how to generate random number without uniform distribution. The code for non-uniform distribution is given below:
sint16 min= Some value a;
sint16 max= Some value b;
sint32 array[1536];
uint16 i;
for(i=0; i<1536; i++) {
r= rand()%(max+min+1)+min;
array[i]=r;
}
This code produces non-uniform distribution. I think for uniform distribution I need to remove the modulus operation. Any suggestions please.
When the span (max+1-min) is small compared to RAND_MAX, the non-uniformity is small, and people often leave it non-uniform in applications that can tolerate it. (However, they usually distribute the non-uniformities over the entire interval. Your code groups the excess elements at the low end of the interval.)
If you want the distribution to be perfectly uniform, then it is necessary to reject some samples. This trims the number of possible values so that it is a perfect multiple of the desired span:
Let span = max+1-min.
Let M = the largest multiple of span not greater than RAND_MAX+1.
// Get samples from random-number generator until one is in range.
do
sample = rand();
while (M <= sample);
// Scale and translate to desired interval.
sample = sample / (M/span) + min;
(This assumes that span ≤ RAND_MAX+1. If you want a bigger span than rand provides, you must “paste together” samples from rand to make bigger numbers. However, it will still be necessary to use rejection to trim the samples, unless the span is a factor of some power of RAND_MAX+1.)
Assuming rand(), returns uniformly distributed integral numbers in the range [0,RAND_MAX], you can generate uniformly distributed numbers in the range [0,N] easily, as long as N<=RAND_MAX.
int uniform_rand(int N)
{
int res;
do{
res=rand();
}while(res>N);
return res;
}
You can also shift the distribution to cover the range [min,max] as long as max-min <= RAND_MAX and max>=min of course.
int sample = min + uniform_rand(max-min);
Working example
http://coliru.stacked-crooked.com/a/50fa635270697fbf
Notice
While this solution is straightforward, you can dramatically increase performance of the uniform_rand() function by using:
the largest multiple of N not greater than RAND_MAX+1.
as pointed out in Eric's answer.
EDIT: Completely revised my initial answer after caf's legitimate criticism. (see comments)
The modulo operation gives only a slight advantage to lower numbers, so we could non-strictly consider the distribuiton uniform.
With this regard, you could generate random numbers between -32000,32000 like this:
r = rand() % 64000;
r -= 32000;

Does "n * (rand() / RAND_MAX)" make a skewed random number distribution?

I'd like to find an unskewed way of getting random numbers in C (although at most I'm going to be using it for values of 0-20, and more likely only 0-8). I've seen this formula but after running some tests I'm not sure if it's skewed or not. Any help?
Here is the full function used:
int randNum()
{
return 1 + (int) (10.0 * (rand() / (RAND_MAX + 1.0)));
}
I seeded it using:
unsigned int iseed = (unsigned int)time(NULL);
srand (iseed);
The one suggested below refuses to work for me I tried
int greek;
for (j=0; j<50000; j++)
{
greek =rand_lim(5);
printf("%d, " greek);
greek =(int) (NUM * (rand() / (RAND_MAX + 1.0)));
int togo=number[greek];
number[greek]=togo+1;
}
and it stops working and gives me the same number 50000 times when I comment out printf.
Yes, it's skewed, unless your RAND_MAX happens to be a multiple of 10.
If you take the numbers from 0 to RAND_MAX, and try to divide them into 10 piles, you really have only three possibilities:
RAND_MAX is a multiple of 10, and the piles come out even.
RAND_MAX is not a multiple of 10, and the piles come out uneven.
You split it into uneven groups to start with, but throw away all the "extras" that would make it uneven.
You rarely have control over RAND_MAX, and it's often a prime number anyway. That really only leaves 2 and 3 as possibilities.
The third option looks roughly like this:
[Edit: After some thought, I've revised this to produce numbers in the range 0...(limit-1), to fit with the way most things in C and C++ work. This also simplifies the code (a tiny bit).
int rand_lim(int limit) {
/* return a random number in the range [0..limit)
*/
int divisor = RAND_MAX/limit;
int retval;
do {
retval = rand() / divisor;
} while (retval == limit);
return retval;
}
For anybody who questions whether this method might leave some skew, I also wrote a rather different version, purely for testing. This one uses a decidedly non-random generator with a very limited range, so we can simply iterate through every number in the range. It looks like this:
#include <stdlib.h>
#include <stdio.h>
#define MAX 1009
int next_val() {
// just return consecutive numbers
static int v=0;
return v++;
}
int lim(int limit) {
int divisor = MAX/limit;
int retval;
do {
retval = next_val() / divisor;
} while (retval == limit);
return retval;
}
#define LIMIT 10
int main() {
// we'll allocate extra space at the end of the array:
int buckets[LIMIT+2] = {0};
int i;
for (i=0; i<MAX; i++)
++buckets[lim(LIMIT)];
// and print one beyond what *should* be generated
for (i=0; i<LIMIT+1; i++)
printf("%2d: %d\n", i, buckets[i]);
}
So, we're starting with numbers from 0 to 1009 (1009 is prime, so it won't be an exact multiple of any range we choose). So, we're starting with 1009 numbers, and splitting it into 10 buckets. That should give 100 in each bucket, and the 9 leftovers (so to speak) get "eaten" by the do/while loop. As it's written right now, it allocates and prints out an extra bucket. When I run it, I get exactly 100 in each of buckets 0..9, and 0 in bucket 10. If I comment out the do/while loop, I see 100 in each of 0..9, and 9 in bucket 10.
Just to be sure, I've re-run the test with various other numbers for both the range produced (mostly used prime numbers), and the number of buckets. So far, I haven't been able to get it to produce skewed results for any range (as long as the do/while loop is enabled, of course).
One other detail: there is a reason I used division instead of remainder in this algorithm. With a good (or even decent) implementation of rand() it's irrelevant, but when you clamp numbers to a range using division, you keep the upper bits of the input. When you do it with remainder, you keep the lower bits of the input. As it happens, with a typical linear congruential pseudo-random number generator, the lower bits tend to be less random than the upper bits. A reasonable implementation will throw out a number of the least significant bits already, rendering this irrelevant. On the other hand, there are some pretty poor implementations of rand around, and with most of them, you end up with better quality of output by using division rather than remainder.
I should also point out that there are generators that do roughly the opposite -- the lower bits are more random than the upper bits. At least in my experience, these are quite uncommon. That with which the upper bits are more random are considerably more common.

Resources