Why does rand() + rand() produce negative numbers? - c

I observed that rand() library function when it is called just once within a loop, it almost always produces positive numbers.
for (i = 0; i < 100; i++) {
printf("%d\n", rand());
}
But when I add two rand() calls, the numbers generated now have more negative numbers.
for (i = 0; i < 100; i++) {
printf("%d = %d\n", rand(), (rand() + rand()));
}
Can someone explain why I am seeing negative numbers in the second case?
PS: I initialize the seed before the loop as srand(time(NULL)).

rand() is defined to return an integer between 0 and RAND_MAX.
rand() + rand()
could overflow. What you observe is likely a result of undefined behaviour caused by integer overflow.

The problem is the addition. rand() returns an int value of 0...RAND_MAX. So, if you add two of them, you will get up to RAND_MAX * 2. If that exceeds INT_MAX, the result of the addition overflows the valid range an int can hold. Overflow of signed values is undefined behaviour and may lead to your keyboard talking to you in foreign tongues.
As there is no gain here in adding two random results, the simple idea is to just not do it. Alternatively you can cast each result to unsigned int before the addition if that can hold the sum. Or use a larger type. Note that long is not necessarily wider than int, the same applies to long long if int is at least 64 bits!
Conclusion: Just avoid the addition. It does not provide more "randomness". If you need more bits, you might concatenate the values sum = a + b * (RAND_MAX + 1), but that also likely requires a larger data type than int.
As your stated reason is to avoid a zero-result: That cannot be avoided by adding the results of two rand() calls, as both can be zero. Instead, you can just increment. If RAND_MAX == INT_MAX, this cannot be done in int. However, (unsigned int)rand() + 1 will do very, very likely. Likely (not definitively), because it does require UINT_MAX > INT_MAX, which is true on all implementations I'm aware of (which covers quite some embedded architectures, DSPs and all desktop, mobile and server platforms of the past 30 years).
Warning:
Although already sprinkled in comments here, please note that adding two random values does not get a uniform distribution, but a triangular distribution like rolling two dice: to get 12 (two dice) both dice have to show 6. for 11 there are already two possible variants: 6 + 5 or 5 + 6, etc.
So, the addition is also bad from this aspect.
Also note that the results rand() generates are not independent of each other, as they are generated by a pseudorandom number generator. Note also that the standard does not specify the quality or uniform distribution of the calculated values.

This is an answer to a clarification of the question made in comment to this answer,
the reason i was adding was to avoid '0' as the random number in my code. rand()+rand() was the quick dirty solution which readily came to my mind.
The problem was to avoid 0. There are (at least) two problems with the proposed solution. One is, as the other answers indicate, that rand()+rand() can invoke undefined behavior. Best advice is to never invoke undefined behavior. Another issue is there's no guarantee that rand() won't produce 0 twice in a row.
The following rejects zero, avoids undefined behavior, and in the vast majority of cases will be faster than two calls to rand():
int rnum;
for (rnum = rand(); rnum == 0; rnum = rand()) {}
// or do rnum = rand(); while (rnum == 0);

Basically rand() produce numbers between 0 and RAND_MAX, and 2 RAND_MAX > INT_MAX in your case.
You can modulus with the max value of your data-type to prevent overflow. This ofcourse will disrupt the distribution of the random numbers, but rand is just a way to get quick random numbers.
#include <stdio.h>
#include <limits.h>
int main(void)
{
int i=0;
for (i=0; i<100; i++)
printf(" %d : %d \n", rand(), ((rand() % (INT_MAX/2))+(rand() % (INT_MAX/2))));
for (i=0; i<100; i++)
printf(" %d : %ld \n", rand(), ((rand() % (LONG_MAX/2))+(rand() % (LONG_MAX/2))));
return 0;
}

May be you could try rather a tricky approach by ensuring that the value returned by sum of 2 rand() never exceeds the value of RAND_MAX. A possible approach could be sum = rand()/2 + rand()/2; This would ensure that for a 16 bit compiler with RAND_MAX value of 32767 even if both rand happens to return 32767, even then (32767/2 = 16383) 16383+16383 = 32766, thus would not result in negative sum.

the reason i was adding was to avoid '0' as the random number in my code. rand()+rand() was the quick dirty solution which readily came to my mind.
A simple solution (okay, call it a "Hack") which never produces a zero result and will never overflow is:
x=(rand()/2)+1 // using divide -or-
x=(rand()>>1)+1 // using shift which may be faster
// compiler optimization may use shift in both cases
This will limit your maximum value, but if you don't care about that, then this should work fine for you.

To avoid 0, try this:
int rnumb = rand()%(INT_MAX-1)+1;
You need to include limits.h.

thx. the reason i was adding was to avoid '0' as the random number in my code. rand()+rand() was the quick dirty solution which readily came to my mind
It sounds like an XY problem to me, in which in order to not get a 0 from rand(), you call rand() two times, doing the program slower, with a new setback and the possibility of getting a 0 is still there.
Another solution is using uniform_int_distribution, which creates a random and uniformly distributed number in the defined interval:
https://wandbox.org/permlink/QKIHG4ghwJf1b7ZN
#include <random>
#include <array>
#include <iostream>
int main()
{
const int MAX_VALUE=50;
const int MIN_VALUE=1;
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> distrib(MIN_VALUE, MAX_VALUE);
std::array<int,MAX_VALUE-MIN_VALUE> weight={0};
for(int i=0; i<50000; i++) {
weight[distrib(gen)-MIN_VALUE]++;
}
for(int i=0;i<(int)weight.size();i++) {
std::cout << "value: " << MIN_VALUE+i << " times: " << weight[i] << std::endl;
}
}

While what everyone else has said about the likely overflow could very well be the cause of the negative, even when you use unsigned integers. The real problem is actually using time/date functionality as the seed. If you have truly become familiar with this functionality you will know exactly why I say this. As what it really does is give a distance (elapsed time) since a given date/time. While the use of the date/time functionality as the seed to a rand(), is a very common practice, it really is not the best option. You should search better alternatives, as there are many theories on the topic and I could not possibly go into all of them. You add into this equation the possibility of overflow and this approach was doomed from the beginning.
Those that posted the rand()+1 are using the solution that most use in order to guarantee that they do not get a negative number. But, that approach is really not the best way either.
The best thing you can do is take the extra time to write and use proper exception handling, and only add to the rand() number if and/or when you end up with a zero result. And, to deal with negative numbers properly. The rand() functionality is not perfect, and therefore needs to be used in conjunction with exception handling to ensure that you end up with the desired result.
Taking the extra time and effort to investigate, study, and properly implement the rand() functionality is well worth the time and effort. Just my two cents. Good luck in your endeavors...

Related

Why does rand() repeat numbers far more often on Linux than Mac?

I was implementing a hashmap in C as part of a project I'm working on and using random inserts to test it. I noticed that rand() on Linux seems to repeat numbers far more often than on Mac. RAND_MAX is 2147483647/0x7FFFFFFF on both platforms. I've reduced it to this test program that makes a byte array RAND_MAX+1-long, generates RAND_MAX random numbers, notes if each is a duplicate, and checks it off the list as seen.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
int main() {
size_t size = ((size_t)RAND_MAX) + 1;
char *randoms = calloc(size, sizeof(char));
int dups = 0;
srand(time(0));
for (int i = 0; i < RAND_MAX; i++) {
int r = rand();
if (randoms[r]) {
// printf("duplicate at %d\n", r);
dups++;
}
randoms[r] = 1;
}
printf("duplicates: %d\n", dups);
}
Linux consistently generates around 790 million duplicates. Mac consistently only generates one, so it loops through every random number that it can generate almost without repeating. Can anyone please explain to me how this works? I can't tell anything different from the man pages, can't tell which RNG each is using, and can't find anything online. Thanks!
While at first it may sound like the macOS rand() is somehow better for not repeating any numbers, one should note that with this amount of numbers generated it is expected to see plenty of duplicates (in fact, around 790 million, or (231-1)/e). Likewise iterating through the numbers in sequence would also produce no duplicates, but wouldn't be considered very random. So the Linux rand() implementation is in this test indistinguishable from a true random source, whereas the macOS rand() is not.
Another thing that appears surprising at first glance is how the macOS rand() can manage to avoid duplicates so well. Looking at its source code, we find the implementation to be as follows:
/*
* Compute x = (7^5 * x) mod (2^31 - 1)
* without overflowing 31 bits:
* (2^31 - 1) = 127773 * (7^5) + 2836
* From "Random number generators: good ones are hard to find",
* Park and Miller, Communications of the ACM, vol. 31, no. 10,
* October 1988, p. 1195.
*/
long hi, lo, x;
/* Can't be initialized with 0, so use another value. */
if (*ctx == 0)
*ctx = 123459876;
hi = *ctx / 127773;
lo = *ctx % 127773;
x = 16807 * lo - 2836 * hi;
if (x < 0)
x += 0x7fffffff;
return ((*ctx = x) % ((unsigned long) RAND_MAX + 1));
This does indeed result in all numbers between 1 and RAND_MAX, inclusive, exactly once, before the sequence repeats again. Since the next state is based on multiplication, the state can never be zero (or all future states would also be zero). Thus the repeated number you see is the first one, and zero is the one that is never returned.
Apple has been promoting the use of better random number generators in their documentation and examples for at least as long as macOS (or OS X) has existed, so the quality of rand() is probably not deemed important, and they've just stuck with one of the simplest pseudorandom generators available. (As you noted, their rand() is even commented with a recommendation to use arc4random() instead.)
On a related note, the simplest pseudorandom number generator I could find that produces decent results in this (and many other) tests for randomness is xorshift*:
uint64_t x = *ctx;
x ^= x >> 12;
x ^= x << 25;
x ^= x >> 27;
*ctx = x;
return (x * 0x2545F4914F6CDD1DUL) >> 33;
This implementation results in almost exactly 790 million duplicates in your test.
MacOS provides an undocumented rand() function in stdlib. If you leave it unseeded, then the first values it outputs are 16807, 282475249, 1622650073, 984943658 and 1144108930. A quick search will show that this sequence corresponds to a very basic LCG random number generator that iterates the following formula:
xn+1 = 75 · xn (mod 231 − 1)
Since the state of this RNG is described entirely by the value of a single 32-bit integer, its period is not very long. To be precise, it repeats itself every 231 − 2 iterations, outputting every value from 1 to 231 − 2.
I don't think there's a standard implementation of rand() for all versions of Linux, but there is a glibc rand() function that is often used. Instead of a single 32-bit state variable, this uses a pool of over 1000 bits, which to all intents and purposes will never produce a fully repeating sequence. Again, you can probably find out what version you have by printing the first few outputs from this RNG without seeding it first. (The glibc rand() function produces the numbers 1804289383, 846930886, 1681692777, 1714636915 and 1957747793.)
So the reason you're getting more collisions in Linux (and hardly any in MacOS) is that the Linux version of rand() is basically more random.
rand() is defined by the C standard, and the C standard does not specify which algorithm to use. Obviously, Apple is using an inferior algorithm to your GNU/Linux implementation: The Linux one is indistinguishable from a true random source in your test, while the Apple implementation just shuffles the numbers around.
If you want random numbers of any quality, either use a better PRNG that gives at least some guarantees on the quality of the numbers it returns, or simply read from /dev/urandom or similar. The later gives you cryptographic quality numbers, but is slow. Even if it is too slow by itself, /dev/urandom can provide some excellent seeds to some other, faster PRNG.
In general, the rand/srand pair has been considered sort of deprecated for a long time due to low-order bits displaying less randomness than high-order bits in the results. This may or may not have anything to do with your results, but I think this is still a good opportunity to remember that even though some rand/srand implementations are now more up to date, older implementations persist and it's better to use random(3). On my Arch Linux box, the following note is still in the man page for rand(3):
The versions of rand() and srand() in the Linux C Library use the same
random number generator as random(3) and srandom(3), so the lower-order
bits should be as random as the higher-order bits. However, on older
rand() implementations, and on current implementations on different
systems, the lower-order bits are much less random than the higher-or-
der bits. Do not use this function in applications intended to be por-
table when good randomness is needed. (Use random(3) instead.)
Just below that, the man page actually gives very short, very simple example implementations of rand and srand that are about the simplest LC RNGs you've ever seen and having a small RAND_MAX. I don't think they match what's in the C standard library, if they ever did. Or at least I hope not.
In general, if you're going to use something from the standard library, use random if you can (the man page lists it as POSIX standard back to POSIX.1-2001, but rand is standard way back before C was even standardized). Or better yet, crack open Numerical Recipes (or look for it online) or Knuth and implement one. They're really easy and you only really need to do it once to have a general purpose RNG with the attributes you most often need and which is of known quality.

Random integers in C, how bad is rand()%N compared to integer arithmetic? What are its flaws?

EDIT:
My question is: rand()%N is considered very bad, whereas the use of integer arithmetic is considered superior, but I cannot see the difference between the two.
People always mention:
low bits are not random in rand()%N,
rand()%N is very predictable,
you can use it for games but not for cryptography
Can someone explain if any of these points are the case here and how to see that?
The idea of the non-randomness of the lower bits is something that should make the PE of the two cases that I show differ, but it's not the case.
I guess many like me would always avoid using rand(), or rand()%N because we've been always taught that it is pretty bad. I was curious to see how "wrong" random integers generated with c rand()%N effectively are. This is also a follow up to Ryan Reich's answer in How to generate a random integer number from within a range.
The explanation there sounds very convincing, to be honest; nevertheless, I thought I’d give it a try. So, I compare the distributions in a VERY naive way. I run both random generators for different numbers of samples and domains. I didn't see the point of computing a density instead of histograms, so I just computed histograms and, just by looking, I would say they both look just as uniform. Regarding the other point that was raised, about the actual randomness (despite being uniformly distributed). I — again naively —compute the permutation entropy for these runs, which are the same for both sample sets, which tell us that there's no difference between both regarding the ordering of the occurrence.
So, for many purposes, it seems to me that rand()%N would be just fine, how can we see their flaws?
Here I show you a very simple, inefficient and not very elegant (but I think correct) way of computing these samples and get the histograms together with the permutation entropies.
I show plots for domains (0,i) with i in {5,10,25,50,100} for different number of samples:
There's not much to see in the code I guess, so I will leave both the C and the matlab code for replication purposes.
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
int main(int argc, char *argv[]){
unsigned long max = atoi(argv[2]);
int samples=atoi(argv[3]);
srand(time(NULL));
if(atoi(argv[1])==1){
for(int i=0;i<samples;++i)
printf("%ld\n",rand()%(max+1));
}else{
for(int i=0;i<samples;++i){
unsigned long
num_bins = (unsigned long) max + 1,
num_rand = (unsigned long) RAND_MAX + 1,
bin_size = num_rand / num_bins,
defect = num_rand % num_bins;
long x;
do {
x = rand();
}
while (num_rand - defect <= (unsigned long)x);
printf("%ld\n",x/bin_size);
}
}
return 0;
}
And here is the Matlab code to plot this and compute the PEs (the recursion for the permutations I took it from: https://www.mathworks.com/matlabcentral/answers/308255-how-to-generate-all-possible-permutations-without-using-the-function-perms-randperm):
system('gcc randomTest.c -o randomTest.exe;');
max = 100;
samples = max*10000;
trials = 200;
system(['./randomTest.exe 1 ' num2str(max) ' ' num2str(samples) ' > file1'])
system(['./randomTest.exe 2 ' num2str(max) ' ' num2str(samples) ' > file2'])
a1=load('file1');
a2=load('file2');
uni = figure(1);
title(['Samples: ' num2str(samples)])
subplot(1,3,1)
h1 = histogram(a1,max+1);
title('rand%(max+1)')
subplot(1,3,2)
h2 = histogram(a2,max+1);
title('Integer arithmetic')
as=[a1,a2];
ns=3:8;
H = nan(numel(ns),size(as,2));
for op=1:size(as,2)
x = as(:,op);
for n=ns
sequenceOcurrence = zeros(1,factorial(n));
sequences = myperms(1:n);
sequencesArrayIdx = sum(sequences.*10.^(size(sequences,2)-1:-1:0),2);
for i=1:numel(x)-n
[~,sequenceOrder] = sort(x(i:i+n-1));
out = sequenceOrder'*10.^(numel(sequenceOrder)-1:-1:0).';
sequenceOcurrence(sequencesArrayIdx == out) = sequenceOcurrence(sequencesArrayIdx == out) + 1;
end
chunks = length(x) - n + 1;
ps = sequenceOcurrence/chunks;
hh = sum(ps(logical(ps)).*log2(ps(logical(ps))));
H(n,op) = hh/log2(factorial(n));
end
end
subplot(1,3,3)
plot(ns,H(ns,:),'--*','linewidth',2)
ylabel('PE')
xlabel('Sequence length')
filename = ['all_' num2str(max) '_' num2str(samples) ];
export_fig(filename)
Due to the way modulo arithmetic works if N is significant compared to RAND_MAX doing %N will make it so you're considerably more likely to get some values than others. Imagine RAND_MAX is 12, and N is 9. If the distribution is good then the chances of getting one of 0, 1, or 2 is 0.5, and the chances of getting one of 3, 4, 5, 6, 7, 8 is 0.5. The result being that you're twice as likely to get a 0 instead of a 4. If N is an exact divider of RAND_MAX this distribution problem doesn't happen, and if N is very small compared to RAND_MAX the issue becomes less noticeable. RAND_MAX may not be a particularly large value (maybe 2^15 - 1), making this problem worse than you may expect. The alternative of doing (rand() * n) / (RAND_MAX + 1) also doesn't give an even distribution, however, it will be every mth value (for some m) that will be more likely to occur rather than the more likely values all being at the low end of the distribution.
If N is 75% of RAND_MAX then the values in the bottom third of your distribution are twice as likely as the values in the top two thirds (as this is where the extra values map to)
The quality of rand() will depend on the implementation of the system that you're on. I believe that some systems have had very poor implementation, OS Xs man pages declare rand obsolete. The Debian man page says the following:
The versions of rand() and srand() in the Linux C Library use the same
random number generator as random(3) and srandom(3), so the lower-order
bits should be as random as the higher-order bits. However, on older
rand() implementations, and on current implementations on different
systems, the lower-order bits are much less random than the higher-
order bits. Do not use this function in applications intended to be
portable when good randomness is needed. (Use random(3) instead.)
Both approaches have their pitfalls, and your graphs are little more than a pretty verification of the central limit theorem! For a sensible implementation of rand():
% N suffers from a "pigeon-holing" effect if 1u + RAND_MAX is not a multiple of N
/((RAND_MAX + 1u)/N) does not, in general, evenly distribute the return of rand across your range, due to integer truncation effects.
On balance, if N is small cf. RAND_MAX, I'd plump for % for its tractability. In any case test your generator to see it it has the appropriate statistical properties for your application.
rand() % N is considered extremely poor not because the distribution is bad, but because the randomness is poor-to-nonexistent. (If anything the distribution will be too good.)
If N is not small with respect to RAND_MAX, both
rand() % N
and
rand() / (RAND_MAX / N + 1)
will have more or less the same, poor distribution -- certain values will occur with significantly higher probability than others.
Looking at distribution histograms won't show you that for some implementations, rand() % N has a much, much worse problem -- to show that you'd have to perform some correlations with previous values. (For example, try taking rand() % 2, then subtracting from the previous value you got, and plotting a histogram of the differences. If the difference is never 0, you've got a problem.)
I would like to say that the implementations for which rand()'s low-order bits aren't random are simply buggy. I'd like to think that all those buggy implementations would have disappeared by now. I'd like to think that programmers shouldn't have to worry about calling rand()%N any more. But, unfortunately, my wishes don't change the fact that this seems to be one of those bugs that never get fixed, meaning that programmers do still have to worry.
See also the C FAQ list, question 13.16.

Is it a good idea to try and use the entire rand() table?

I'm currently using this code to randomise some variables.
srand(time(NULL));
do{vkmhrand = rand ();vkmh = vkmhrand/78;}while(vkmh <= 0 || vkmh > vmax);
vmax won't exceed 415, which means that if I make it as simple as vkmh = rand() the majority of the randomized table will be discarded, plus I also get some nice decimals to make the results a little more interesting.
I know rand()'s not the best random function to use in existence. I'm just thinking in general, is it a good idea to try and use as much of the table as possible? Is it a waste of time? Or does it exacerbate any patterns that might already exist in the table?
Scaling the output of rand() effectively means in your case, that you can use more of its bits. Assuming their value is independent of each other (which is probably not true for all implementations of rand()), this is generally a good thing. However, you have to be aware, that those additional bits are not used to make the high order bits "more random" (less predictable) but only to give you more decimal digits, so your solution space has a finer resolution.
The usual way to generate floats in a certain range would be:
double vkmh = (double) rand() / RAND_MAX * (vmax-vmin) + vmin;
if you are fine with ints, the usual way would be:
unsigned int = rand() % (vmax - vmin + 1) + vmin;
Although this will result in a slightly skewed distribution if RAND_MAX isn't multiple of (vmax - vmin + 1) distributed values
Use this, I'm not vouching for it's randomness though
srand(time(NULL));
do {
vkmhrand = rand () % 415;
vkmh = vkmhrand/78;
}
while(vkmh <= 0 || vkmh > vmax);

Generating a uniform distribution of INTEGERS in C

I've written a C function that I think selects integers from a uniform distribution with range [rangeLow, rangeHigh], inclusive. This isn't homework--I'm just using this in some embedded systems tinkering that I'm doing for fun.
In my test cases, this code appears to produce an appropriate distribution. I'm not feeling fully confident that the implementation is correct, though.
Could someone do a sanity check and let me know if I've done anything wrong here?
//uniform_distribution returns an INTEGER in [rangeLow, rangeHigh], inclusive.
int uniform_distribution(int rangeLow, int rangeHigh)
{
int myRand = (int)rand();
int range = rangeHigh - rangeLow + 1; //+1 makes it [rangeLow, rangeHigh], inclusive.
int myRand_scaled = (myRand % range) + rangeLow;
return myRand_scaled;
}
//note: make sure rand() was already initialized using srand()
P.S. I searched for other questions like this. However, it was hard to filter out the small subset of questions that discuss random integers instead of random floating-point numbers.
Let's assume that rand() generates a uniformly-distributed value I in the range [0..RAND_MAX],
and you want to generate a uniformly-distributed value O in the range [L,H].
Suppose I in is the range [0..32767] and O is in the range [0..2].
According to your suggested method, O= I%3. Note that in the given range, there are 10923 numbers for which I%3=0, 10923 number for which I%3=1, but only 10922 number for which I%3=2. Hence your method will not map a value from I into O uniformly.
As another example, suppose O is in the range [0..32766].
According to your suggested method, O=I%32767. Now you'll get O=0 for both I=0 and I=32767. Hence 0 is twice as likely than any other value - your method is again nonuniform.
The suggest way to generate a uniform mapping is as follow:
Calculate the number of bits that are needed to store a random value in the range [L,H]:
unsigned int nRange = (unsigned int)H - (unsigned int)L + 1;
unsigned int nRangeBits= (unsigned int)ceil(log((double(nRange) / log(2.));
Generate nRangeBits random bits
this can be easily implemented by shifting-right the result of rand()
Ensure that the generated number is not greater than H-L.
If it is - repeat step 2.
Now you can map the generated number into O just by adding a L.
On some implementations, rand() did not provide good randomness on its lower order bits, so the modulus operator would not provide very random results. If you find that to be the case, you could try this instead:
int uniform_distribution(int rangeLow, int rangeHigh) {
double myRand = rand()/(1.0 + RAND_MAX);
int range = rangeHigh - rangeLow + 1;
int myRand_scaled = (myRand * range) + rangeLow;
return myRand_scaled;
}
Using rand() this way will produce a bias as noted by Lior. But, the technique is fine if you can find a uniform number generator to calculate myRand. One possible candidate would be drand48(). This will greatly reduce the amount of bias to something that would be very difficult to detect.
However, if you need something cryptographically secure, you should use an algorithm outlined in Lior's answer, assuming your rand() is itself cryptographically secure (the default one is probably not, so you would need to find one). Below is a simplified implementation of what Lior described. Instead of counting bits, we assume the range falls within RAND_MAX, and compute a suitable multiple. Worst case, the algorithm ends up calling the random number generator twice on average per request for a number in the range.
int uniform_distribution_secure(int rangeLow, int rangeHigh) {
int range = rangeHigh - rangeLow + 1;
int secureMax = RAND_MAX - RAND_MAX % range;
int x;
do x = secure_rand(); while (x >= secureMax);
return rangeLow + x % range;
}
I think it is known that rand() is not very good. It just depends on how good of "random" data you need.
http://www.azillionmonkeys.com/qed/random.html
http://www.linuxquestions.org/questions/programming-9/generating-random-numbers-in-c-378358/
http://forums.indiegamer.com/showthread.php?9460-Using-C-rand%28%29-isn-t-as-bad-as-previously-thought
I suppose you could write a test then calculate the chi-squared value to see how good your uniform generator is:
http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
Depending on your use (don't use this for your online poker shuffler), you might consider a LFSR
http://en.wikipedia.org/wiki/Linear_feedback_shift_register
It may be faster, if you just want some psuedo-random output. Also, supposedly they can be uniform, although I haven't studied the math enough to back up that claim.
A version which corrects the distribution errors (noted by Lior),
involves the high-bits returned by rand() and
only uses integer math (if that's desirable):
int uniform_distribution(int rangeLow, int rangeHigh)
{
int range = rangeHigh - rangeLow + 1; //+1 makes it [rangeLow, rangeHigh], inclusive.
int copies=RAND_MAX/range; // we can fit n-copies of [0...range-1] into RAND_MAX
// Use rejection sampling to avoid distribution errors
int limit=range*copies;
int myRand=-1;
while( myRand<0 || myRand>=limit){
myRand=rand();
}
return myRand/copies+rangeLow; // note that this involves the high-bits
}
//note: make sure rand() was already initialized using srand()
This should work well provided that range is much smaller than RAND_MAX, otherwise
you'll be back to the problem that rand() isn't a good random number generator in terms of its low-bits.

Does "n * (rand() / RAND_MAX)" make a skewed random number distribution?

I'd like to find an unskewed way of getting random numbers in C (although at most I'm going to be using it for values of 0-20, and more likely only 0-8). I've seen this formula but after running some tests I'm not sure if it's skewed or not. Any help?
Here is the full function used:
int randNum()
{
return 1 + (int) (10.0 * (rand() / (RAND_MAX + 1.0)));
}
I seeded it using:
unsigned int iseed = (unsigned int)time(NULL);
srand (iseed);
The one suggested below refuses to work for me I tried
int greek;
for (j=0; j<50000; j++)
{
greek =rand_lim(5);
printf("%d, " greek);
greek =(int) (NUM * (rand() / (RAND_MAX + 1.0)));
int togo=number[greek];
number[greek]=togo+1;
}
and it stops working and gives me the same number 50000 times when I comment out printf.
Yes, it's skewed, unless your RAND_MAX happens to be a multiple of 10.
If you take the numbers from 0 to RAND_MAX, and try to divide them into 10 piles, you really have only three possibilities:
RAND_MAX is a multiple of 10, and the piles come out even.
RAND_MAX is not a multiple of 10, and the piles come out uneven.
You split it into uneven groups to start with, but throw away all the "extras" that would make it uneven.
You rarely have control over RAND_MAX, and it's often a prime number anyway. That really only leaves 2 and 3 as possibilities.
The third option looks roughly like this:
[Edit: After some thought, I've revised this to produce numbers in the range 0...(limit-1), to fit with the way most things in C and C++ work. This also simplifies the code (a tiny bit).
int rand_lim(int limit) {
/* return a random number in the range [0..limit)
*/
int divisor = RAND_MAX/limit;
int retval;
do {
retval = rand() / divisor;
} while (retval == limit);
return retval;
}
For anybody who questions whether this method might leave some skew, I also wrote a rather different version, purely for testing. This one uses a decidedly non-random generator with a very limited range, so we can simply iterate through every number in the range. It looks like this:
#include <stdlib.h>
#include <stdio.h>
#define MAX 1009
int next_val() {
// just return consecutive numbers
static int v=0;
return v++;
}
int lim(int limit) {
int divisor = MAX/limit;
int retval;
do {
retval = next_val() / divisor;
} while (retval == limit);
return retval;
}
#define LIMIT 10
int main() {
// we'll allocate extra space at the end of the array:
int buckets[LIMIT+2] = {0};
int i;
for (i=0; i<MAX; i++)
++buckets[lim(LIMIT)];
// and print one beyond what *should* be generated
for (i=0; i<LIMIT+1; i++)
printf("%2d: %d\n", i, buckets[i]);
}
So, we're starting with numbers from 0 to 1009 (1009 is prime, so it won't be an exact multiple of any range we choose). So, we're starting with 1009 numbers, and splitting it into 10 buckets. That should give 100 in each bucket, and the 9 leftovers (so to speak) get "eaten" by the do/while loop. As it's written right now, it allocates and prints out an extra bucket. When I run it, I get exactly 100 in each of buckets 0..9, and 0 in bucket 10. If I comment out the do/while loop, I see 100 in each of 0..9, and 9 in bucket 10.
Just to be sure, I've re-run the test with various other numbers for both the range produced (mostly used prime numbers), and the number of buckets. So far, I haven't been able to get it to produce skewed results for any range (as long as the do/while loop is enabled, of course).
One other detail: there is a reason I used division instead of remainder in this algorithm. With a good (or even decent) implementation of rand() it's irrelevant, but when you clamp numbers to a range using division, you keep the upper bits of the input. When you do it with remainder, you keep the lower bits of the input. As it happens, with a typical linear congruential pseudo-random number generator, the lower bits tend to be less random than the upper bits. A reasonable implementation will throw out a number of the least significant bits already, rendering this irrelevant. On the other hand, there are some pretty poor implementations of rand around, and with most of them, you end up with better quality of output by using division rather than remainder.
I should also point out that there are generators that do roughly the opposite -- the lower bits are more random than the upper bits. At least in my experience, these are quite uncommon. That with which the upper bits are more random are considerably more common.

Resources