Improving "randomness" when extending the range of rand() - c

I am playing around with a pair of algorithms I found on other SO posts (listed in the references below), and am trying to find out how to improve the distribution. I am effectively extending the range of a random number by doubling the number of bits, and want to ensure that the distribution is as uniform as possible, while removing (or at least reducing) the effects of modulo bias and other artifacts for a shuffling algorithm that would be using the result of my modified random number generator.
So, it is to my understanding that if I initialize my RNG with a constant seed (ie: srand(1)) I will get the same pattern of deterministic outputs from calling rand() in a for loop. Now, if I were to initialize my seed via srand(time(NULL)), it would be a different pattern, but it still might not help me with the following problem: I am trying to figure out if I were to implement the following algorithm:
Take two random numbers a,b
Calculate a*(RAND_MAX+1)+b
Would I be able to:
Generate every possible coordinate pair (a,b), where a,b ∈ Z+ on [0, RAND_MAX] (a and b are positive integers between zero and RAND_MAX inclusive).
Maximize the uniformity of the entire distribution (ie: an optimally flat histogram).
While the output of rand() is supposed to be uniformly distributed, I don't know if it's guaranteed to give me values for N,N+1 calls to rand each loop and give me every pair listing in point (1) before the random sequence repeats itself again. My new random number generator could theoretically generate random values on [0, RAND_MAX ^ 2], but I don't know if there might be "holes" aka values in this range that can never be generated by my algorithm.
I attempted to get further on this question myself, but I couldn't find information on how long a random sequence generated be rand() in C goes until it repeats itself. Lacking this and other information, I couldn't figure out whether or not it is possible to generate every pair (a,b).
So, using rand(), is it possible to achieve point (1) and if so, are there any solid suggestions on how to optimize its "randomness" according to point (2)?
Thank you for your time and assistance.
Update
I later revisited this problem and simulated it using an 8-bit PRNG. While it could indeed generate every possible coordinate pair, the distribution was actually quite interesting, and definitely not uniform. In the end, I read several articles/papers on PRNGs and used a Mersenne Twiser algorithm to generate the additional bits needed (i.e. MT19937-64).
References
Extend rand() max range, Accessed 2014-05-07, <https://stackoverflow.com/questions/9775313/extend-rand-max-range>
Shuffle array in C, Accessed 2014-05-07, <https://stackoverflow.com/questions/6127503/shuffle-array-in-c>

Assumptions
As pointed out in the comments, the behaviour of rand() is implementation dependent. So, let's make a few simplifying assumptions to get to the point of the question:
rand() can generate all values from 0 to RAND_MAX. Justfication: If it could not, then it would be even harder to generate all possible pairs (a, b).
rand() generates a statistically random sequence. Justification: The result of composing two random functions (or the same one twice) is only as good as the base random function.
Of course, we shouldn't expect the result to be better than the building blocks, so any deficiencies in the rand() implementation will reflect itself in any functions composed from it.
Holes in the distribution
Seeding rand() generates a deterministic sequence for a given seed, as the seed determines the PRNG's initial state. The sequence's maximum period is 2N where N is the number of bits in the state. Note that the state may in fact have more bits than RAND_MAX, we'll assume RAND_MAX = (2N - 1) for this section. Because it is a sequence, generating two successive "random" N-bit values a and b means that a ≠ b. Therefore, the method a*(RAND_MAX+ 1)+b will have some holes.
A little explanation on a ≠ b: PRNGs work by maintaining an internal state of N bits. It uses that state uniquely to determine its next state, so once the same state recurs, the sequence starts to repeat itself. The number of states gone through before the sequence starts repeating itself is called the period. So, technically we could have a = b, but that implies a period of 1, and that's a very bad PRNG. For more information, a helpful answer on PRNG periods has been posted on the Software Engineering site.
An algorithm without holes
One way to allow successive "random" calls to be equal is to generate 2 N-bit numbers, but consider only a certain number of bits significant, i.e. discard some. Now, we can have a = b, though with very slightly less probability than another random number c. Note this is similar to how Java's random number generator works. It is seeded with 48 bits, but outputs a 32-bit random number, discarding 16 bits (assuming # of bits in seed = # of bits in state).
However, since you need values larger than RAND_MAX, what you could do is use the above method, and then concatenate the bits, until you get enough bits to reach the desired maximum (though again, the distribution is not quite uniform).

Related

How to model random variables?

I want to know how to model random variables using "basic operations". The only random function I know, at least for C, is rand(), along with srand for seeding. There probably exists packages somewhere online but lets say I want to implement it on my own. I don't know if there are other very common random functions, but if not, lets just stick with rand() and the C language.
rand() allows me to pseudo-randomly generate an int from 0 to RAND_MAX. I can then use mod to get an int in some range. I can next mod 2 to choose a sign and get negative numbers. I can also do rand()/RAND_MAX to model values in the interval (0,1) and shift this to model Uniform(a,b).
But what I am not sure about is if I can extend this to model any probability distribution and at what point do I have to worry about accuracy especially when dealing with infinities and irrational probabilities. Also, this method is very crude so I would like to know of more standard ways using basic tools if any.
A simple example:
I have the random variable X such that Pr(X = 1)=1/pi and Pr(X=0)=1-1/pi. Since pi is irrational, I would approximate the probability of getting 1/pi with rand() and choose X=1 if I get an int from 0 to Round(RAND_MAX*1/pi). So this is approximating twice, once for pi and another time for rounding.
Is there a better approach? How would one go about modeling something more complicated such as a continuous random variable on the interval (0,infinity) or a discrete random variable with irrational probabilities on a countably infinite set. Would my approach still work or would I have to worry about rounding errors?
EDIT: Also how does the pseudo-randomness instead of randomness of rand() change things and how would I account for these changes?
I can then use mod to get an int in some range
No, you can't. Try it with dice. You want a number between 1 and 5. So you take the roll mod 5 (kind of, it would actually be ((roll-1)%5)+1). This maps 1 to 1, 2 to 2, etc. 5 to 5 and 6 to 1. You now have 1 twice as likely as any other roll.
The correct way of doing this is to find the nearest power of 2 higher than your range, mask out the bits of the random number above that power of 2 then check if you're in range. If you aren't in range, try again (will potentially loop forever, in practice the average number of retries is less than 2). This assumes that your random numbers are a stream of bits and not something else. This is usually a safe assumption for decent generators.
I can also do rand()/RAND_MAX to model values in the interval (0,1)
No, you can't. That's not how floating point numbers work. This generates a horrible distribution.
Either the number of bits in the integer is smaller than the number of bits in the mantissa, then you'll just have a bunch of floating point numbers you can't ever generate. Or the number of bits in the integer is bigger than the number of bits in the mantissa and then you'll truncate your integer when converting it to floating point before the division and will generate certain numbers much more often.
in the interval (0,1) and shift this to model Uniform(a,b).
This makes things even worse. First you lose bits in one direction, then you lose bits in the other direction.
To actually generate uniformly distributed floating point numbers in an arbitrary range is harder than it looks.
I've done some experiments to figure this out myself a few years ago, learning floating point internals in the process and I've written some code with a lot of comments with reasoning here: https://github.com/art4711/random-double
In short, to generate random floating point numbers in an arbitrary range: find the bigger absolute value of the range. That is the start, the other end of the range is the end. Figure out the next representable number from start to end. Subtract that next number from start, that becomes the step. Calculate how many steps exist between start and end. Generate a uniformly distributed random number between 0 and number of steps. start + step * random number is the answer. Also, because of how floating point work, this might not be exactly what you're looking for. All possible floating point values are most certainly not possible to generate using this method (except in very special cases). But this method guarantees that every possible value is equally likely.
Notice that your misconceptions are very common. Almost everyone does those things. Random numbers in the industry are anything but random. The word random in computer science pretty much means "predictable, repeatable, easily breakable and exploitable, quite possibly not well distributed". And don't get me started on the quality of the "random" number generators in standard libraries. If you dig around my github stuff, you'll find a package for Go with a long README rant about this.
I'm not going to respond to the rest of your question, those bits require a book or two.

How to generate either 0 or 1 randomly in C

I have read so many posts on this topic:
How does rand() work? Does it have certain tendencies? Is there something better to use?
How does the random number generator work in C
and this is what I got:
1) xn+1 depends on xn i.e., previous random number that is generated.
2) It is not recommended to initialize the seed more than once in the program.
3) It is a bad practice to use rand()%2 to generate either 0 or 1 randomly.
My questions are:
1) Are there any other libraries that I missed to take a look to generate a completely random number (either 0 or 1) without depending on previous output?
2) If there is any other work around using the inbuilt rand() function to satisfy the requirement?
3) What is the side effect of initializing the seed more than once in a program?
Code snippet:
srand(time(NULL));
d1=rand()%2;
d2=rand()%2;
Here my intention is to make d1 and d2 completely independent of each other.
My initial thought is to do this:
srand(time(NULL));
d1=rand()%2;
srand(time(NULL));
d2=rand()%2;
But as I mentioned earlier which is based on other posts, this is a bad practice I suppose?
So, can anyone please answer the above questions? I apologize if I completely missed an obvious thing.
Are there any other libraries that I missed to take a look to generate a completely random number between 0 and 1 without depending on previous output?
Not in the standard C library. There are lots of other libraries which generate "better" pseudo-random numbers.
If there is any other work around using the inbuilt rand() function to satisfy the requirement?
Most standard library implementations of rand produce sequences of random numbers where the low-order bit(s) have a short sequence and/or are not as independent of each other as one would like. The high-order bits are generally better distributed. So a better way of using the standard library rand function to generate a random single bit (0 or 1) is:
(rand() > RAND_MAX / 2)
or use an interior bit:
(rand() & 0x400U != 0)
Those will produce reasonably uncorrelated sequences with most standard library rand implementations, and impose no more computational overhead than checking the low-order bit. If that's not good enough for you, you'll probably want to research other pseudo-random number generators.
All of these (including rand() % 2) assume that RAND_MAX is odd, which is almost always the case. (If RAND_MAX were even, there would be an odd number of possible values and any way of dividing an odd number of possible values into two camps must be slightly biased.)
What is the side effect of initializing the seed more than once in a program?
You should think of the random number generator as producing "not very random" numbers after being seeded, with the quality improving as you successively generate new random numbers. And remember that if you seed the random number generator using some seed, you will get exactly the same sequence as you will the next time you seed the generator with the same seed. (Since time() returns a number of seconds, two successive calls in quick succession will usually produce exactly the same number, or very occasionally two consecutive numbers. But definitely not two random uncorrelated numbers.)
So the side effect of reseeding is that you get less random numbers, and possibly exactly the same ones as you got the last time you reseeded.
1) Are there any other libraries that I missed to take a look to
generate a completely random number between 0 and 1 without depending
on previous output?
This sub-question is off-topic for Stack Overflow, but I'll point out that POSIX and BSD systems have an alternative random number generator function named random() that you could consider if you are programming for such a platform (e.g. Linux, OS X).
2) If there is any other work around using the inbuilt rand() function
to satisfy the requirement?
Traditional computers (as opposed to quantum computers) are deterministic machines. They cannot do true randomness. Every completely programmatic "random number generator" is in practice a psuedo-random number generator. They generate completely deterministic sequences, but the values from a given set of calls are distributed across the generator's range in a manner approximately consistent with a target probability distribution (ordinarily the uniform distribution).
Some operating systems provide support for generating numbers that depend on something more chaotic and less predictable than a computed sequence. For instance, they may collect information from mouse movements, CPU temperature variations, or other such sources, to produce more objectively random (yet still deterministic) numbers. Linux, for example, has such a driver that is often exposed as the special file /dev/random. The problem with these is that they have a limited store of entropy, and therefore cannot provide numbers at a sustained high rate. If you need only a few random numbers, however, then that might be a suitable source.
3) What is the side effect of initializing the seed more than once in
a program?
Code snippet:
srand(time(NULL));
d1=rand()%2;
d2=rand()%2;
Here my intention is to make d1 and d2 completely independent of each
other.
My initial thought is to do this:
srand(time(NULL));
d1=rand()%2;
srand(time(NULL));
d2=rand()%2;
But as I mentioned earlier which is based on other posts, this is a
bad practice I suppose?
It is indeed bad if you want d1 and d2 to have a 50% probability of being different. time() returns the number of seconds since the epoch, so it is highly likely that it will return the same value when called twice so close together. The sequence of pseudorandom numbers is completely determined by the seed (this is a feature, not a bug), and when you seed the PRNG, you restart the sequence. Even if you used a higher-resolution clock to make the seeds more likely to differ, you don't escape correlation this way; you just change the function generating numbers for you. And the result does not have the same guarantees for output distribution.
Additionally, when you do rand() % 2 you use only one bit of the approximately log2(RAND_MAX) + 1 bits that it produced for you. Over the whole period of the PRNG, you can expect that bit to take each value the same number of times, but over narrow ranges you may sometimes see some correlation.
In the end, your requirement for your two random numbers to be completely independent of one another is probably way overkill. It is generally sufficient for the pseudo-random result of one call to be have no apparent correlation with the results of previous calls. You probably achieve that well enough with your first code snippet, even despite the use of only one bit per call. If you prefer to use more of the bits, though, then with some care you could base the numbers you choose on the parity of the count of how many bits are set in the values returned by rand().
Use this
(double)rand() / (double)RAND_MAX
completely random number ..... without depending on previous output?
Well in reality computers can't generate completely random numbers. There has to be some dependencies. But for almost all practical purposes, you can use rand().
side effect of initializing the seed more than once
No side effect. But that would mean you're completely invalidating the point of using rand(). If you're re initilizeing seed every time, the random number is more dependent on time(and processor).
any other work around using the inbuilt rand() function
You can write something like this:
#include<stdio.h>
#include<stdlib.h>
#include<time.h>
int main(int argc,char *argv[])
{
srand(time(NULL));
printf("%lf\n",(double)rand()/(double)RAND_MAX);
printf("%lf\n",(double)rand()/(double)RAND_MAX);
}
If you want to generate either a 0 or a 1, I think using rand()%2 is perfectly fine as the probability of an even number is same as probability of an odd number(probability of all numbers is equal for an unbiased random number generator).

How to get pseudo-random uniformly distributed integers in C good enough for statistical simulation?

I'm writing a Monte Carlo simulation and am going to need a lot of random bits for generating integers uniformly distributed over {1,2,...,N} where N<40. The problem with using the C rand function is that I'd waste a lot of perfectly good bits using the standard rand % N technique. What's a better way for generating the integers?
I don't need cryptographically secure random numbers, but I don't want them to skew my results. Also, I don't consider downloading a batch of bits from random.org a solution.
rand % N does not work; it skews your results unless RAND_MAX + 1 is a multiple of N.
A correct approach is to figure out the largest multiple of N that's smaller than RAND_MAX, and then generate random numbers until it's less than that value. Only then should you do the modulo operation. This gives you a worst-case rejection ratio of 50%.
in addition to oli's answer:
if you're desperately concerned about bits then you can manage a queue of bits by hand, only retrieving as many as are necessary for the next number (ie upper(log2(n))).
but you should make sure that your generator is good enough. simple linear congruential (sp?) generators are better in the higher bits than the lower (see comments) so your current modular division approach makes more sense there.
numerical recipes has a really good section on all this and is very easy to read (not sure it mentions saving bits, but as a general ref).
update if you're unsure whether it's needed or not, i would not worry about this for now (unless you have better advice from someone who understands your particular context).
Represent rand in base40 and take the digits as numbers. Drop any incomplete digits, that is, drop the first digit if it doesn't have the full range [0..39] and drop the whole random number if the first digit takes its highest-possible value (e.g. if RAND_MAX is base40 is 21 23 05 06, drop all numbers having the highest base-40 digit 21).

write a c function that generates one random number, or a pair of random numbers, or a triplet of random numbers given the particular ranges

i have to generate random numbers for 3 different cases.
i. 1 dice
ii. a pair of dice
iii. 3 dices
my questions:
1. please suggest me sm good logic to generate random numbers for all the 3 cases.
2. does the logic change when i consider the cses of 2 dices, rather than 1?
3.how much of an effect does the range in which we have to genrate a random number affect the logic of the random function?
If the range is small enough, you shouldn't have problems in using the usual modulo method
int GetRandomInt(int Min, int Max)
{
return (rand()%(Max-Min+1))+Min;
}
(where Min a Max specify a closed interval, [Min, Max])
and calling it once for each dice roll. Don't forget to call srand(time(NULL)) at the start of the application (at the start only, not each time you want to get a random number) to seed the random number generator.
If the range starts to be bigger, you may have to face two problems:
First, the range of rand() obviously isn't [0, +∞), instead it's [0, RAND_MAX], where RAND_MAX is a #define guaranteed to be at least 32767. If your range (Max-Min) spans over RAND_MAX, then, with this method, you'll have some numbers that will have a zero probability of being returned.
This is more subtle: suppose that RAND_MAX is bigger than your range, but not that bigger, let's say that RAND_MAX==1.5*/(Max-Min).
In this case, the distribution of results won't be uniform: rand() returns you an integer in the range [0, RAND_MAX] (and each integer in this range should be equiprobable), but you are taking the rest of the division with (Max-Min). This means that the numbers in the first half of your required range have twice the probability of being returned than the others: they can actually come from the first and the third third of the rand() range, while the second half of the required range can come only from the second third of the rand() range.
What does this mean for you?
Probably nothing. If all you want to do is a dice-roll simulator, you can go without problems with the modulo method, since the range involved is small, and the second problem, despite being still present, it's almost irrelevant: suppose your range is 3 and MAX_RAND 32767: from 0 to 32765, 0, 1 and 2 has the same probability, but going up to 32767 0 and 1 gain one potential exit, which is almost irrelevant, since they pass from a perfect 1/3 (10922/32766=0,333...) probability for each one to a 10922/32767 for 2 (~0,33332) and 10923/32767 (~0,33335) for 0 and 1 (assuming that rand() provides a perfect distribution).
Anyhow, to overcome such problems a quite used method is to "stretch" the rand() range over a wider range (or compressing it to a smaller range) using a method like this:
int GetRandomInt(int Min, int Max)
{
return (int)(((double)rand())/MAX_RAND*(Max-Min))+Min;
}
based on the equivalence rand():MAX_RAND=X:(Max-Min). The conversion to double is necessary, otherwise the integer division between rand() and its maximum value will always yield 0 (or 1 in the rare case of rand()==MAX_RAND); it could be done in integer arithmetic performing the product first if MAX_RAND is small and the range too is not too wide, otherwise there's a high risk of overflow.
I suspect that, if the output range is bigger than the range of rand(), the "stretching" and the fp value truncation (due to the conversion to int) affect in some way the distribution, but just locally (e.g. in small ranges you may never get a certain number, but globally the distribution will look ok).
Notice that this method helps to overcome to a diffused limitation of the C standard library random number generator, i.e. the low randomness of the lower bits of the returned value - which are, incidentally, the ones you are using when you perform a modulo operation with a small output range.
However, keep in mind that the C standard library RNG is a simple one, that strives to comply with "easy" statistical rules, and as such is easily predictable; it shouldn't be used when "serious" random numbers are required (e.g. cryptography). For such needs there are dedicated RNG libraries (e.g. the RNG part of the GNU Scientific Library), or, if you need really random stuff, there are several real random number services (one of the most famous is this), which do not use mathematical pseudo-RNG, but take their numbers from real random sources (e.g. radioactive decay).
Yea, like DarkDust said, this sounds like homework, so, to answer your questions with that in mind, I'd say:
--> No, the logic doesnt not change, no matter how many dices you include.
--> Easiest way to do this would be, just make a function that give you ONE
random function, and depending on how many dices you have, call it that
many times.
--> You can instead include for loop in the function and add the values into
array and return the array.
This way, you can generate random number for 100 dices too.
Since this sounds like homework I'm just going to give hints which should be "good enough" for you (a pro would do it slightly differently): use the random() function and the % (modulo) operator. Modulo is the "reminder" after a division.

How is the lagged fibonacci generator random?

I dont get. If it has a fixed length, choosing the lags and the mod over and over again will give the same number, no?
To be precise, the lagged Fibonacci is a pseudo-random number generator. It's not true random, but it's much better than, say, the more commonly used linear congruential generator (the standard generator for C++, Java, etc). I'm not sure why you think it will give the same number all over again, but it's true that like all pseudo-random number generator, it has a period after which the sequence of numbers will repeat again.
The multiplicative LFG has a period of (2^k - 1)*2^(M-3). For practical parameters, this is actually quite huge (LCG's period is only M).
The only catch with LFG is that the initialization procedure is very complex, and the mathematics behind it is incomplete. It's best to consult the literature for good choice of parameters and recommended procedure for proper seeding.
As an illustration, a multiplicative LFG with parameters (j=31, k=52) and modulus m=2^32 is seeded with an array of 52 32-bit numbers.
Additional references:
http://sprng.fsu.edu/Version4.0/generators.html
More details on this generator and the seeding algorithms can be found in papers by Mascagni, et al.
It's not random, its pseudorandom
From this http://en.wikipedia.org/wiki/Lagged_Fibonacci_generator
Lagged Fibonacci generators have a maximum period of (2^k - 1)*2^(M-1) if addition or subtraction is used, and (2^k-1) if exclusive-or operations are used to combine the previous values. If, on the other hand, multiplication is used, the maximum period is (2^k - 1)*2^(M-3), or 1/4 of period of the additive case.
So, given a certain seed value, the sequence of output values is predictable and repeatable, and it has a cycle. It will repeat if you wait long enough - but the cycle is quite large.
For an observer that doesn't know the seed value, the sequence appears to be quite random so it can be useful as a source of "randomness" for simulations and other situations where true randomness isn't required.
It's random in the same way that any pseudorandom number generator is--which is to say, not at all.
However, lagged fibonacci (and all linear feedback shift register PRNGs) improve on a basic linear congruential generator by increasing the state size. That is, the next value depends on several former values, rather than just the immediate previous one. Combined with a decent seed you should be able to get fairly decent results.
Edit:
From your post, it isn't clear that you understand that the underlying state is stored in a shift register, meaning that it isn't static but updated (by shifting each value one place to the left, dropping the leftmost value, and appending the most recent value on the right side) after each draw. In this way, drawing the same number over & over again is avoided (for most seed values, at least).
It all depends on the seed. Most random number generators do give the same sequence of numbers for a fixed seed value.
Random number generators are often one-to-one functions where for every input there is a constant output. To make it "random" you have to feed it a seed (which must be "random"), like the system time or the values of computer memory locations, for example.
If you're wondering why you don't just straight up use the seed (the time, etc.), it's because the time is sequential (1,2,3,4) whereas most pseudorandom number generators spit out numbers that appear random (8, 27, 13, 1). That way if you're generating pseudorandom numbers in a loop (which happens very fast), you're not just getting {1,2,3,4}...

Resources