How do I generate normally distributed positive integers in C? - c

I am writing a program in C that requires generating a normal distribution of positive integers with mean less than 1.

You can normalize data that is already normally distributed, for example take data for average length of human beings (180 centimeter) and scale every number by a factor so that the mean becomes less than 1 e.g. multiply every length by 1/180.

I used a Poisson random number generator function in C, which takes the mean as input. I used a combination of rand() followed by exponentiation to get the distribution, which is the normal way to do this, as in Calculation of Poisson distribution in C.
Since I generate 2^14 random numbers, by Central Limit theorem, their distribution will tend to a normal distribution, with the same mean and variance.

Related

Improving "randomness" when extending the range of rand()

I am playing around with a pair of algorithms I found on other SO posts (listed in the references below), and am trying to find out how to improve the distribution. I am effectively extending the range of a random number by doubling the number of bits, and want to ensure that the distribution is as uniform as possible, while removing (or at least reducing) the effects of modulo bias and other artifacts for a shuffling algorithm that would be using the result of my modified random number generator.
So, it is to my understanding that if I initialize my RNG with a constant seed (ie: srand(1)) I will get the same pattern of deterministic outputs from calling rand() in a for loop. Now, if I were to initialize my seed via srand(time(NULL)), it would be a different pattern, but it still might not help me with the following problem: I am trying to figure out if I were to implement the following algorithm:
Take two random numbers a,b
Calculate a*(RAND_MAX+1)+b
Would I be able to:
Generate every possible coordinate pair (a,b), where a,b ∈ Z+ on [0, RAND_MAX] (a and b are positive integers between zero and RAND_MAX inclusive).
Maximize the uniformity of the entire distribution (ie: an optimally flat histogram).
While the output of rand() is supposed to be uniformly distributed, I don't know if it's guaranteed to give me values for N,N+1 calls to rand each loop and give me every pair listing in point (1) before the random sequence repeats itself again. My new random number generator could theoretically generate random values on [0, RAND_MAX ^ 2], but I don't know if there might be "holes" aka values in this range that can never be generated by my algorithm.
I attempted to get further on this question myself, but I couldn't find information on how long a random sequence generated be rand() in C goes until it repeats itself. Lacking this and other information, I couldn't figure out whether or not it is possible to generate every pair (a,b).
So, using rand(), is it possible to achieve point (1) and if so, are there any solid suggestions on how to optimize its "randomness" according to point (2)?
Thank you for your time and assistance.
Update
I later revisited this problem and simulated it using an 8-bit PRNG. While it could indeed generate every possible coordinate pair, the distribution was actually quite interesting, and definitely not uniform. In the end, I read several articles/papers on PRNGs and used a Mersenne Twiser algorithm to generate the additional bits needed (i.e. MT19937-64).
References
Extend rand() max range, Accessed 2014-05-07, <https://stackoverflow.com/questions/9775313/extend-rand-max-range>
Shuffle array in C, Accessed 2014-05-07, <https://stackoverflow.com/questions/6127503/shuffle-array-in-c>
Assumptions
As pointed out in the comments, the behaviour of rand() is implementation dependent. So, let's make a few simplifying assumptions to get to the point of the question:
rand() can generate all values from 0 to RAND_MAX. Justfication: If it could not, then it would be even harder to generate all possible pairs (a, b).
rand() generates a statistically random sequence. Justification: The result of composing two random functions (or the same one twice) is only as good as the base random function.
Of course, we shouldn't expect the result to be better than the building blocks, so any deficiencies in the rand() implementation will reflect itself in any functions composed from it.
Holes in the distribution
Seeding rand() generates a deterministic sequence for a given seed, as the seed determines the PRNG's initial state. The sequence's maximum period is 2N where N is the number of bits in the state. Note that the state may in fact have more bits than RAND_MAX, we'll assume RAND_MAX = (2N - 1) for this section. Because it is a sequence, generating two successive "random" N-bit values a and b means that a ≠ b. Therefore, the method a*(RAND_MAX+ 1)+b will have some holes.
A little explanation on a ≠ b: PRNGs work by maintaining an internal state of N bits. It uses that state uniquely to determine its next state, so once the same state recurs, the sequence starts to repeat itself. The number of states gone through before the sequence starts repeating itself is called the period. So, technically we could have a = b, but that implies a period of 1, and that's a very bad PRNG. For more information, a helpful answer on PRNG periods has been posted on the Software Engineering site.
An algorithm without holes
One way to allow successive "random" calls to be equal is to generate 2 N-bit numbers, but consider only a certain number of bits significant, i.e. discard some. Now, we can have a = b, though with very slightly less probability than another random number c. Note this is similar to how Java's random number generator works. It is seeded with 48 bits, but outputs a 32-bit random number, discarding 16 bits (assuming # of bits in seed = # of bits in state).
However, since you need values larger than RAND_MAX, what you could do is use the above method, and then concatenate the bits, until you get enough bits to reach the desired maximum (though again, the distribution is not quite uniform).

Converting a Uniform distribution to Poisson distribution

I have to write a C program to convert a uniform distribution of random numbers (say from 0 to 1) to a poisson distribution. Can anyone help?
Use GSL, the Gnu Scientific Library. There's a function called gsl_ran_poisson:
This function returns a random integer from the Poisson distribution with mean mu.
The probability distribution for Poisson variates is,
p(k) = {\mu^k \over k!} \exp(-\mu)
for k >= 0.
Otherwise, look at the code and copy the ideas.
I am assuming you want to write a C program that can sample a random number from the Poisson Distribution, given a random number in U(0,1).
Generally, this is done by taking the inverse CDF of the number from U(0,1). For discrete distributions like Poisson, one first transforms it to a continuous distribution by assuming that the CDF function is smooth between the integer points, and then we apply appropriate approximations (floor function).
The book Numerical Recipes in C++ (3rd Ed) has the complete explanation and C++ code as well. sec 7.3.12, page 372.

How to get pseudo-random uniformly distributed integers in C good enough for statistical simulation?

I'm writing a Monte Carlo simulation and am going to need a lot of random bits for generating integers uniformly distributed over {1,2,...,N} where N<40. The problem with using the C rand function is that I'd waste a lot of perfectly good bits using the standard rand % N technique. What's a better way for generating the integers?
I don't need cryptographically secure random numbers, but I don't want them to skew my results. Also, I don't consider downloading a batch of bits from random.org a solution.
rand % N does not work; it skews your results unless RAND_MAX + 1 is a multiple of N.
A correct approach is to figure out the largest multiple of N that's smaller than RAND_MAX, and then generate random numbers until it's less than that value. Only then should you do the modulo operation. This gives you a worst-case rejection ratio of 50%.
in addition to oli's answer:
if you're desperately concerned about bits then you can manage a queue of bits by hand, only retrieving as many as are necessary for the next number (ie upper(log2(n))).
but you should make sure that your generator is good enough. simple linear congruential (sp?) generators are better in the higher bits than the lower (see comments) so your current modular division approach makes more sense there.
numerical recipes has a really good section on all this and is very easy to read (not sure it mentions saving bits, but as a general ref).
update if you're unsure whether it's needed or not, i would not worry about this for now (unless you have better advice from someone who understands your particular context).
Represent rand in base40 and take the digits as numbers. Drop any incomplete digits, that is, drop the first digit if it doesn't have the full range [0..39] and drop the whole random number if the first digit takes its highest-possible value (e.g. if RAND_MAX is base40 is 21 23 05 06, drop all numbers having the highest base-40 digit 21).

How is the lagged fibonacci generator random?

I dont get. If it has a fixed length, choosing the lags and the mod over and over again will give the same number, no?
To be precise, the lagged Fibonacci is a pseudo-random number generator. It's not true random, but it's much better than, say, the more commonly used linear congruential generator (the standard generator for C++, Java, etc). I'm not sure why you think it will give the same number all over again, but it's true that like all pseudo-random number generator, it has a period after which the sequence of numbers will repeat again.
The multiplicative LFG has a period of (2^k - 1)*2^(M-3). For practical parameters, this is actually quite huge (LCG's period is only M).
The only catch with LFG is that the initialization procedure is very complex, and the mathematics behind it is incomplete. It's best to consult the literature for good choice of parameters and recommended procedure for proper seeding.
As an illustration, a multiplicative LFG with parameters (j=31, k=52) and modulus m=2^32 is seeded with an array of 52 32-bit numbers.
Additional references:
http://sprng.fsu.edu/Version4.0/generators.html
More details on this generator and the seeding algorithms can be found in papers by Mascagni, et al.
It's not random, its pseudorandom
From this http://en.wikipedia.org/wiki/Lagged_Fibonacci_generator
Lagged Fibonacci generators have a maximum period of (2^k - 1)*2^(M-1) if addition or subtraction is used, and (2^k-1) if exclusive-or operations are used to combine the previous values. If, on the other hand, multiplication is used, the maximum period is (2^k - 1)*2^(M-3), or 1/4 of period of the additive case.
So, given a certain seed value, the sequence of output values is predictable and repeatable, and it has a cycle. It will repeat if you wait long enough - but the cycle is quite large.
For an observer that doesn't know the seed value, the sequence appears to be quite random so it can be useful as a source of "randomness" for simulations and other situations where true randomness isn't required.
It's random in the same way that any pseudorandom number generator is--which is to say, not at all.
However, lagged fibonacci (and all linear feedback shift register PRNGs) improve on a basic linear congruential generator by increasing the state size. That is, the next value depends on several former values, rather than just the immediate previous one. Combined with a decent seed you should be able to get fairly decent results.
Edit:
From your post, it isn't clear that you understand that the underlying state is stored in a shift register, meaning that it isn't static but updated (by shifting each value one place to the left, dropping the leftmost value, and appending the most recent value on the right side) after each draw. In this way, drawing the same number over & over again is avoided (for most seed values, at least).
It all depends on the seed. Most random number generators do give the same sequence of numbers for a fixed seed value.
Random number generators are often one-to-one functions where for every input there is a constant output. To make it "random" you have to feed it a seed (which must be "random"), like the system time or the values of computer memory locations, for example.
If you're wondering why you don't just straight up use the seed (the time, etc.), it's because the time is sequential (1,2,3,4) whereas most pseudorandom number generators spit out numbers that appear random (8, 27, 13, 1). That way if you're generating pseudorandom numbers in a loop (which happens very fast), you're not just getting {1,2,3,4}...

C: Random Number Generation - What (If Anything) Is Wrong With This

For a simple simulation in C, I need to generate exponential random variables. I remember reading somewhere (but I can't find it now, and I don't remember why) that using the rand() function to generate random integers in a fixed range would generate non-uniformly distributed integers. Because of this, I'm wondering if this code might have a similar problem:
//generate u ~ U[0,1]
u = ( (double)rand() / ((double)(RAND_MAX));
//inverse of exponential CDF to get exponential random variable
expon = -log(1-u) * mean;
Thank you!
The problem with random numbers in a fixed range is that a lot of people do this for numbers between 100 and 200 for example:
100 + rand() % 100
That is not uniform. But by doing this it is (or is close enough to uniform at least):
u = 100 + 100 * ((double)rand() / ((double)(RAND_MAX));
Since that's what you're doing, you should be safe.
In theory, at least, rand() should give you a discrete uniform distribution from 0 to RAND_MAX... in practice, it has some undesirable properties, such as a small period, so whether it's useful depends on how you're using it.
RAND_MAX is usually 32k, while the LCG rand() uses generates pseudorandom 32 bit numbers. Thus, the lack of uniformity, as well as low periodicity, will generally go unnoticed.
If you require high quality pseudorandom numbers, you could try George Marsaglia's CMWC4096 (Complementary Multiply With Carry). This is probably the best pseudorandom number generator around, with extreme periodicity and uniform distribution (you just have to pick good seeds for it). Plus, it's blazing fast (not as fast as a LCG, but approximately twice as fast as a Mersenne Twister.
Yes and no. The problem you're thinking of arises when you're clamping the output from rand() into a range that's smaller than RAND_MAX (i.e. there are fewer possible outputs than inputs).
In your case, you're (normally) reversing that: you're taking a fairly small number of bits produced by the random number generator, and spreading them among what will usually be a larger number of bits in the mantissa of your double. That means there are normally some bit patterns in the double (and therefore, specific values of the double) that can never occur. For most people's uses that's not a problem though.
As far as the "normally" goes, it's always possible that you have a 64-bit random number generator, where a double typically has a 53-bit mantissa. In this case, you could have the same kind of problem as with clamping the range with integers.
No, your algorithm will work; it's using the modulus function that does things imperfectly.
The one problem is that because it's quantized, once in a while it will generate exactly RAND_MAX and you'll be asking for log(1-1). I'd recommend at least (rand() + 0.5)/(RAND_MAX+1), if not a better source like drand48().
There are much faster ways to compute the necessary numbers, e.g. the Ziggurat algorithm.

Resources