Related
I've been looking into the int rand() function from <stdlib.h> in C11 when I stumbled over the following cppreference-example for rolling a six sided die.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(void)
{
srand(time(NULL)); // use current time as seed for random generator
int random_variable = rand();
printf("Random value on [0,%d]: %d\n", RAND_MAX, random_variable);
// roll a 6-sided die 20 times
for (int n=0; n != 20; ++n) {
int x = 7;
while(x > 6)
x = 1 + rand()/((RAND_MAX + 1u)/6); // Note: 1+rand()%6 is biased
printf("%d ", x);
}
}
Specifically this part:
[...]
while(x > 6)
x = 1 + rand()/((RAND_MAX + 1u)/6); // Note: 1+rand()%6 is biased
[...]
Questions:
Why the addition of + 1u? Since rand() is [0,RAND_MAX] I'm guessing
that doing rand()/(RAND_MAX/6) -> [0,RAND_MAX/(RAND_MAX/6)] -> [0,6]? And
since it's integer division (LARGE/(LARGE+small)) < 1 -> 0, adding 1u gives it the required range of [0,5]?
Building on the previous question, assuming [0,5], 1 + (rand()/((RAND_MAX+1u)/6)) should only go through [1,6] and never trigger a second loop?
Been poking around to see if rand() has returned float at some point, but
that seems like a pretty huge breakage towards old code? I guess the check
makes sense if you add 1.0f instead of 1u making it a floating point
division?
Trying to wrap my head around this, have a feeling that I might be missing
something..
(P.s. This is not a basis for anything security critical, I'm just exploring
the standard library. D.s)
The code avoids bias by ensuring each possible result in [1, 6] is the output from exactly the same number of return values from rand.
By definition, rand returns int values from 0 to RAND_MAX. So there are 1+RAND_MAX possible values it can return. If 1+RAND_MAX is not a multiple of 6, then it is impossible to partition it into 6 exactly equal intervals of integers. So the code partitions it into 6 equal intervals that are as big as possible and one odd-size fragment interval. Then the results of rand are mapped into these intervals: The first six intervals correspond to results from 1 to 6, and the last interval is rejected, and the code tries again.
When we divide 1+RAND_MAX by 6, there is some quotient q and some remainder r. Now consider the result of rand() / q:
When rand produces a number in [0, q−1], rand() / q will be 0.
When rand produces a number in [q, 2q−1], rand() / q will be 1.
When rand produces a number in [2q, 3q−1], rand() / q will be 2.
When rand produces a number in [3q, 4q−1], rand() / q will be 3.
When rand produces a number in [4q, 5q−1], rand() / q will be 4.
When rand produces a number in [5q, 6q−1], rand() / q will be 5.
When rand produces a number that is 6q or greater, rand() / q will be 6.
Observe that in each of the first six intervals, there are exactly q numbers. In the seventh interval, the possible return values are in [6q, RAND_MAX]. That interval contains r numbers.
This code works by rejecting that last interval:
int x = 7;
while(x > 6)
x = 1 + rand()/((RAND_MAX + 1u)/6);
Whenever rand produces a number in that last fragmentary interval, this code rejects it and tries again. When rand produces a number in one of the whole intervals, this code accepts it and exits (after adding 1 so the results in x are 1 to 6 instead of 0 to 5).
Thus, every output from 1 to 6, inclusive, is mapped to from an exactly equal number of rand values.
This is the best way to produce a uniform distribution from rand in the sense that it has the fewest rejections, given we are using a scheme like this.1 The range of rand has been split into six intervals that are as big as possible. The remaining fragmentary interval cannot be used because the remainder r is less than six, so the r unused values cannot be split evenly over the six desired values for x.
Footnote
1 This is not necessarily the best way to use rand to generate random numbers in [1, 6] overall. For example, from a single rand call with RAND_MAX equal to 32767, we could view the value as a base-six numeral from 000000 to 411411. If it is under 400000, we can take the last five digits, which are each uniformly distributed in [0, 5], and adding one gts us the desired [1, 6]. If it is in [400000, 410000), we can use the last four digits. If it is in [410000, 411000), we can use the last three, and so on. Additionally, the otherwise discarded information, such as the leading digit, might be pooled over multiple rand calls to increase the average number of outputs we get per call to rand.
I'm writing a short program to approximate the definite integral of the gaussian function f(x) = exp(-x^2/2), and my codes are as follows:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
double gaussian(double x) {
return exp((-pow(x,2))/2);
}
int main(void) {
srand(0);
double valIntegral, yReal = 0, xRand, yRand, yBound;
int xMin, xMax, numTrials, countY = 0;
do {
printf("Please enter the number of trials (n): ");
scanf("%d", &numTrials);
if (numTrials < 1) {
printf("Exiting.\n");
return 0;
}
printf("Enter the interval of integration (a b): ");
scanf("%d %d", &xMin, &xMax);
while (xMin > xMax) { //keeps looping until a valid interval is entered
printf("Invalid interval!\n");
printf("Enter the interval of integration (a b): ");
scanf("%d %d", &xMin, &xMax);
}
//check real y upper bound
if (gaussian((double)xMax) > gaussian((double)xMin))
yBound = gaussian((double)xMax);
else
yBound = gaussian((double)xMin);
for (int i = 0; i < numTrials; i++) {
xRand = (rand()% ((xMax-xMin)*1000 + 1))/1000.00 + xMin; //generate random x value between xMin and xMax to 3 decimal places
yRand = (rand()% (int)(yBound*1000 + 1))/1000.00; //generate random y value between 0 and yBound to 3 decimal places
yReal = gaussian(xRand);
if (yRand < yReal)
countY++;
}
valIntegral = (xMax-xMin)*((double)countY/numTrials);
printf("Integral of exp(-x^2/2) on [%.3lf, %.3lf] with n = %d trials is: %.3lf\n\n", (double)xMin, (double)xMax, numTrials, valIntegral);
countY = 0; //reset countY to 0 for the next run
} while (numTrials >= 1);
return 0;
}
However, the outputs from my code doesn't match the solutions. I tried to debug and print out all xRand, yRand and yReal values for 100 trials (and checked yReal value with particular xRand values with Matlab, in case I had any typos), and those values didn't seem to be out of range in any way... I don't know where my mistake is.
The correct output for # of trials = 100 on [0, 1] is 0.810, and mine is 0.880; correct output for # of trials = 50 on [-1, 0] is 0.900, and mine was 0.940. Can anyone find where I did wrong? Thanks a lot.
Another question is, I can't find a reference to the use of following code:
double randomNumber = rand() / (double) RAND MAX;
but it was provided by the instructor and he said it would generate a random number from 0 to 1. Why did he use '/' instead of '%' after "rand()"?
There's a few logical errors / discussion points in your code, both mathematics and programming-wise.
First of all, just to get it out of the way, we're talking about the standard gaussian here, i.e.
except, the definition of the gaussian on line 6, omits the
normalising term. Given the outputs you seem to expect, this seems to have been done on purpose. Fair enough. But if you wanted to calculate the actual integral, such that a practically infinite range (e.g. [-1000, 1000]) would sum up to 1, then you would need that term.
Is my code logically correct?
No. Your code has two logical errors: one on line 29 (i.e. your if statement), and one on line 40 (i.e. the calculation of valIntegral), which is a direct consequence of the first logical error.
For the first error, consider the following plot to see why:
Your Monte Carlo process effectively considers a bounded box over a certain range, and then says "I will randomly place points inside this box, and then count the proportion of the total number of points that randomly fell under the curve; the integral estimate is then the area of the bounded box itself, times this proportion".
Now, if both
and
are to the left of the mean (i.e. 0), then your if statement correctly sets the box's upper bound (i.e. yBound) to
such that the topmost bound of the box contains the highest part of that curve. So, e.g., to estimate the integral for the range [-2,-1], you set the upper bound to
.
Similarly, if both
and
are to the right of the mean, then you correctly set yBound to
However, if
, you should be setting yBound to neither
nor
, since the 0 point is higher than both!. So in this case, your yBound should simply be at the peak of the Gaussian, i.e.
(which in your case of an unnormalised Gaussian, this takes a value of '1').
Therefore, the correct if statement is as follows:
if (xMax < 0.0)
{ yBound = gaussian((double)xMax); }
else if (xMin > 0.0)
{ yBound = gaussian((double)xMin); }
else
{ yBound = gaussian(0.0); }
As for the second logical error, we already mentioned that the value of the integral is the "area of the bounding box" times the "proportion of successes". However, you seem to ignore the height of the box in your calculation. It is true that in the special case where
, the height of your unnormalised Gaussian function defaults to '1', therefore this term can be omitted. I suspect that this is why it may have been missed. However, in the other two cases, the height of the bounding box is necessarily less than 1, and therefore needs to be included in the calculation. So the correct code for line 40 should be:
valIntegral = yBound * (xMax-xMin) * (((double)countY)/numTrials);
Why am I not getting the correct output?
Even despite the above logical errors, as we've discussed above, your output should have been correct for the specific intervals [0,1] and [-1,0] (since they include the mean and therefore the correct yBound of 1). So why are you still getting a 'wrong' output?
The answer is, you are not. Your output is "correct". Except, a Monte Carlo process involves randomness, and 100 trials is not a big enough number to lead to consistent results. If you run the same range for 100 trials again and again, you'll see you'll get very different results each time (though, overall, they'll be distributed around the right value). Run with 1000000 trials, and you'll see that the result becomes a lot more precise.
What's up with that randomNumber code?
The rand() function returns an integer in the range [0, RAND_MAX], where RAND_MAX is system-specific (have a look at man 3 rand).
The modulo approach (i.e. %) works as follows: consider the range [-0.1, 0.3]. This range spans 0.4 units. 0.4 * 1000 + 1 = 401. For a random number from 0 to RAND_MAX, doing rand() modulo 401 will always result in a random number in the range [0,400]. If you then divide this back by 1000, you get a random number in the range [0, 0.4]. Add this to your xmin offset (here: -0.1) and you get a random number in the range [-0.1, 0.3].
In theory, this makes sense. However, unfortunately, as already pointed out in the other answer here, as a method it is susceptible to modulo bias, because RAND_MAX isn't necessarily exactly divisible by 401, therefore the top part of that range leading up to RAND_MAX overrepresents some numbers compared to others.
By contrast, the approach given to you by your teacher is simply saying: divide the result of the rand() function with RAND_MAX. This effectively normalises the returned random number into the range [0,1]. This is a much more straightforward thing to do, and it avoids modulo bias.
Therefore, the way I would implement this would be to make it into a function:
double randomNumber(void) {
return rand() / (double) RAND_MAX;
}
which then simplifies your computations as follows too:
xRand = randomNumber() * (xMax-xMin) + xMin;
yRand = randomNumber() * yBound;
You can see that this is a much more accurate thing to do, if you use a normalised gaussian, i.e.
double gaussian(double x) {
return exp((-pow(x,2.0))/2.0) / sqrt(2.0 * M_PI);
}
and then compare the two methods. You will see that the randomNumber() method for an "effectively infinite" range (e.g. [-1000,1000]) gives the correct result of 1, whereas the modulo approach tends to give numbers that are larger than 1.
Your code has no obvious bug (though there is a bug in the upper bound calculation, as #TasosPapastylianou points out, though it isn't the issue in your test cases). On 100 trials, your answer of 0.880 is closer to the actual value of the integral (0.855624...) than 0.810, and neither of those numbers are so far from the true value to suggest an outright bug in the code. Seems to be within sampling error (though see below). Here is a histogram of 1000 runs of a Monte Carlo integration (done in R, but with the same algorithm) of e^(-x^2/2) on [0,1] with 100 trials:
Unless your instructor specified the algorithm and the seed in precise detail, you shouldn't expect the exact same answer.
As far as your second question about rand() / (double) RAND MAX: it is an attempt to avoid modulo bias. It is possible that such a bias is effecting your code (especially given the way you round to 3 decimal places), since it does seem to overestimate the integral (based on running it a dozen times or so). Perhaps you could use that in your code and see if you get better results.
I am running a bunch of physical simulations in which I need random numbers. I'm using the standard rand() function in C++.
So it works like this: first I precalculate a bunch of probabilities that are of the form 1/(1+exp(a)), for a set of different a. They're of type double as returned by the exp function in the math library, and then things must happen with those probabilities, there are only two of them, so I generate a random number uniformly distributed between 0 and 1 and compared with those precalculated probabilities. To do that, I used:
double p = double(rand()%101)/100.0;
so I'm given random values between 0 and 1 both included. This didn't yield to correct physical results. I tried this:
double p = double(rand()%1000001)/1000000.0;
And this worked. I don't really understand why so I would like some criteria about how to do it. My intuition tells that if I do
double p = double(rand()%(N+1))/double(N);
with N big enough such that the smallest division (1/N) is much smaller than the smallest probability 1/1+exp(a) then I will be getting realistic random numbers.
I would like to understand why, though.
rand() returns a random number between 0 and RAND_MAX.
Therefore you need this:
double p = double(rand() % RAND_MAX) / double(RAND_MAX);
Also run this snippet and you will understand:
int i;
for (i = 1; i < 30; i++)
{
int rnd = rand();
double p0 = double(rnd % 101) / 100.0;
double p1 = double(rnd % 1000001) / 1000000.0;
printf ("%d\t%f\t%f\n", rnd, p0, p1);
}
for (i = 1; i < 30; i++)
{
int rnd = rand();
double p0 = double(rnd) / double(RAND_MAX);
printf ("%d\t%f\n", rnd, p0);
}
You have multiple problems.
rand() isn't very random at all. On almost all operating systems it returns badly distributed, horribly biased numbers. It's actually quite hard to find a good random number generator, but I can guarantee you that rand() will be among the worst you can find.
rand() % N gives a biased distribution. Think about the pigeonhole principle. Let's simplify it, assume that rand returns numbers [0,7) and your N is 6. 0 to 5 map to 0 to 5, 6 maps to 0 and 7 maps to 1, meaning that 0 and 1 are twice as likely to come out.
Converting the numbers to double before division does not remove the bias from 2, it just makes it less visible. The pigeonhole principle applies regardless of the conversions you do.
Converting a well-distributed random number from integer to float/double is harder than it looks. Simple division ignores the problems of how floating point math works.
I can't help you much with 1, you need to do research. Look around the net for random number libraries. If you want something very random and unpredictable you need to look for cryptographic random libraries. If you want a repeatable but good random number Mersenne Twister should probably be good enough. But you need to do the research here.
For 2 and 3 there are standard solutions. You are mapping a set from M elements to N elements and rand % N will only work iff N < M and N and M share prime factors. Since on most systems M will be a power of two it means that N also has to be a power of two. So assuming that M is a power of two the algorithm is: find the nearest power of 2 higher or equal to N, let's call it P. Generate randomness_source() % P. If the number is higher than N, throw it away and try again. This is the only safe way to do this. Cleverer people than you and me have spent years on this problem, there's no better way to remove the bias.
For 4, you can probably ignore the problem and just divide, in an absolute majority of cases this should be good enough. If you really want to study the problem, I've done some work on it and published the code on github. There I go through some basic principles of how floating point numbers work and how it relates to generating random numbers.
// produces pseudorandom bits. These are NOT crypto quality bits. Has the same underlying unpredictability as uncooked
// rand() output. It buffers rand() bits to produce a more convenient zero-to-the-argument range including negative
// arguments, corrects for the toward-zero bias of the modular construction I'd be using otherwise, eliminates the
// RAND_MAX range limitation, (use INT64_MAX instead) and effectively obscures biases and sequence telltales due to
// annoyingly bad rand libraries. It does not correct these biases; anyone tracking the arguments and outputs has
// enough information to reconstruct the rand() output and detect them. But it makes the relationships drastically more complicated.
// needs stdint, stdlib.
int64_t privaterandom(int64_t range, int reset){
static uint64_t state = 0;
int64_t retval;
if (reset != 0){
srand((unsigned int)range);
state = (uint64_t)range;
}
if (range == 0) return (0);
if (range < 0) return -privaterandom(-range, 0);
if (range > UINT64_MAX/0xFFFFFFFF){
retval = (privaterandom(range/0xFFFFFFFF, 0) * 0xFFFFFFFF); // order of operations matters
return (retval + privaterandom(0xFFFFFFFF, 0));
}
while (state < UINT64_MAX / 0xFF){
state *= RAND_MAX;
state += rand();
}
retval = (state % range);
// makes "pigeonhole" bias alternate unpredictably between toward-even and toward-odd
if ((state/range > (state - (retval) )/ range) && state % 2 == 0) retval++;
state /= range;
return retval;
}
int64_t Random(int64_t range){ return (privaterandom(range, 0));}
int64_t Random_Init(int64_t seed){return (privaterandom(seed, 1));}
I need to generate a set of random numbers within an interval which also happens to have a mean value. For instance min = 1000, max = 10000 and a mean of 7000. I know how to create numbers within a range but I am struggling with the mean value thing. Is there a function that I can use?
What you're looking for is done most easily with so called acceptance rejection method.
Split your interval into smaller intervals.
Specify a probability density function (PDF), can be a very simple one too, like a step function. For Gaussian distrubution you would have left and right steps lower than your middle step i.e (see the image bellow that has a more general distribution).
Generate a random number in the whole interval. If the generated number is greater than the value of your PDF at that point reject the generated number.
Repeat the steps until you get desired number of points
EDIT 1
Proof of concept on a Gaussian PDF.
Ok, so the basic idea is shown in graph (a).
Define/Pick your probability density function (PDF). PDF is a function of, statistically speaking, a random variable and describes the probability of finding the value x in a measurement/experiment. A function can be a PDF of a random variable x if it satisfies: 1) f(x) >= 0 and 2) it's normalized (meaning it sums, or integrates, up to the value 1).
Get maximum (max) and "zero points" (z1 < z2) of PDF. Some PDF's can have their zero points in infinity. In that case, determine cutoff points (z1, z2) for which PDF(z1>x>z2) < eta where you pick eta yourself. Basically means, set some small-ish value eta and then say your zero points are those values for which the value of PDF(x) is smaller than eta.
Define the interval Ch(z1, z2, max) of your random generator. This is the interval in which you generate your random variables.
Generate a random variable x such that z1<x<z2.
Generate a second unrelated random variable y in the range (0, max). If the value of y is smaller than PDF(x) reject both randomly generated values (x,y) and go back to step 4. If the generated value y is larger than PDF(x) accept the value x as the randomly generated point on a distribution and return it.
Here's the code that reproduces similar behavior for a Gaussian PDF.
#include "Random.h"
#include <fstream>
using namespace std;
double gaus(double a, double b, double c, double x)
{
return a*exp( -((x-b)*(x-b)/(2*c*c) ));
}
double* random_on_a_gaus_distribution(double inter_a, double inter_b)
{
double res [2];
double a = 1.0; //currently parameters for the Gaussian
double b = 2.0; //are defined here to avoid having
double c = 3.0; //a long function declaration line.
double x = kiss::Ran(inter_a, inter_b);
double y = kiss::Ran(0.0, 1.0);
while (y>gaus(a,b,c,x)) //keep creating values until step 5. is satisfied.
{
x = kiss::Ran(inter_a, inter_b); //this is interval (z1, z2)
y = kiss::Ran(0.0, 1.0); //this is the interval (0, max)
}
res[0] = x;
res[1] = y;
return res; //I return (x,y) for plot reasons, only x is the randomly
} //generated value you're looking for.
void main()
{
double* x;
ofstream f;
f.open("test.txt");
for(int i=0; i<100000; i++)
{
//see bellow how I got -5 and 10 to be my interval (z1, z2)
x = random_on_a_gaus_distribution(-5.0, 10.0);
f << x[0]<<","<<x[1]<<endl;
}
f.close();
}
Step 1
So first we define a general look of a Gaussian PDF in a function called gaus. Simple.
Then we define a function random_on_a_gaus_distribution which uses a well defined Gaussian function. In an experiment\measurement we would get coefficients a, b, c by fitting our function. I picked some random ones (1, 2, 3) for this example, you can pick the ones that satisfy your HW assignment (that is: coefficients that make a Gaussian that has a mean of 7000).
Step 2 and 3
I used wolfram mathematica to plot gaus. with parameters 1,2,3 too see what would be the most appropriate values for max and (z1, z2) . You can see the graph yourself. Maximum of the function is 1.0 and via ancient method of science called eyeballin' I estimated that the cutoff points are -5.0 and 10.0.
To make random_on_a_gaus_distribution more general you could follow step 2) more rigorously and define eta and then calculate your function in successive points until PDF gets smaller than eta. Dangers with this are that your cutoff points can be very far apart and this could take long for very monotonous functions. Additionally you have to find the maximum yourself. This is generally tricky, However a simpler problem is minimization of a negative of a function. This can also be tricky for a general case but not "undoable". Easiest way is to cheat a bit like I did and just hard-code this for a couple of functions only.
Step 4 and 5
And then you bash away. Just keep creating new and new points until you reach satisfactory hit. DO NOTICE the returned number x is a random number. You wouldn't be able to find a logical link between two successively created x values, or first created x and the millionth.
However the number of accepted x values in the interval around the x_max of our distribution is greater than the number of x values created in intervals for which PDF(x) < PDF(x_max).
This just means that your random numbers will be weighted within the chosen interval in such manner that the larger PDF value for a random variable x will correspond to more random points accepted in a small interval around that value than around any other value of xi for which PDF(xi)<PDF(x).
I returned both x and y to be able to plot the graph bellow, however what you're looking to return is actually just the x. I did the plots with matplotlib.
It's probably better to show just a histogram of randomly created variable on a distribution. This shows that the x values that are around the mean value of your PDF function are the most likely ones to get accepted, and therefore more randomly created variables with those approximate values will be created.
Additionally I assume you would be interested in implementation of the kiss Random number generator. IT IS VERY IMPORTANT YOU HAVE A VERY GOOD GENERATOR. I dare to say to an extent kiss doesn't probably cut it (mersene twister is used often).
Random.h
#pragma once
#include <stdlib.h>
const unsigned RNG_MAX=4294967295;
namespace kiss{
// unsigned int kiss_z, kiss_w, kiss_jsr, kiss_jcong;
unsigned int RanUns();
void RunGen();
double Ran0(int upper_border);
double Ran(double bottom_border, double upper_border);
}
namespace Crand{
double Ran0(int upper_border);
double Ran(double bottom_border, double upper_border);
}
Kiss.cpp
#include "Random.h"
unsigned int kiss_z = 123456789; //od 1 do milijardu
unsigned int kiss_w = 378295763; //od 1 do milijardu
unsigned int kiss_jsr = 294827495; //od 1 do RNG_MAX
unsigned int kiss_jcong = 495749385; //od 0 do RNG_MAX
//KISS99*
//Autor: George Marsaglia
unsigned int kiss::RanUns()
{
kiss_z=36969*(kiss_z&65535)+(kiss_z>>16);
kiss_w=18000*(kiss_w&65535)+(kiss_w>>16);
kiss_jsr^=(kiss_jsr<<13);
kiss_jsr^=(kiss_jsr>>17);
kiss_jsr^=(kiss_jsr<<5);
kiss_jcong=69069*kiss_jcong+1234567;
return (((kiss_z<<16)+kiss_w)^kiss_jcong)+kiss_jsr;
}
void kiss::RunGen()
{
for (int i=0; i<2000; i++)
kiss::RanUns();
}
double kiss::Ran0(int upper_border)
{
unsigned velicinaIntervala = RNG_MAX / upper_border;
unsigned granicaIzbora= velicinaIntervala*upper_border;
unsigned slucajniBroj = kiss::RanUns();
while(slucajniBroj>=granicaIzbora)
slucajniBroj = kiss::RanUns();
return slucajniBroj/velicinaIntervala;
}
double kiss::Ran (double bottom_border, double upper_border)
{
return bottom_border+(upper_border-bottom_border)*kiss::Ran0(100000)/(100001.0);
}
Additionally there's the standard C random generators:
CRands.cpp
#include "Random.h"
//standardni pseudo random generatori iz C-a
double Crand::Ran0(int upper_border)
{
return rand()%upper_border;
}
double Crand::Ran (double bottom_border, double upper_border)
{
return (upper_border-bottom_border)*rand()/((double)RAND_MAX+1);
}
It's worthy also to comment on the (b) graph above. When you have a very badly behaved PDF, PDF(x) will vary significantly between large numbers and very small ones.
Issue with that is that the interval area Ch(x) will match the extreme values of the PDF well, but since we create a random variable y for small values of PDF(x) as well; the chances of accepting that value are minute! It is more likely that the generated y value will always be larger than PDF(x) at that point. This means that you'll spend a lot of cycles creating numbers that won't get chosen and that all your chosen random numbers will be very locally bound to the max of your PDF.
That's why it's often useful not to have the same Ch(x) intervals everywhere, but to define a parametrized set of intervals. However this adds a fair bit of complexity to the code.
Where do you set your limits? How to deal with borderline cases? When and how to determine that you indeed need to suddenly use this approach? Calculating max might not be as simple now, depending on the method you originally envisioned would be doing this.
Additionally now you have to correct for the fact that a lot more numbers get accepted more easily in the areas where your Ch(x) box height is lower which skews the original PDF.
This can be corrected by weighing numbers created in the lowered boundary by the ratio of heights of higher and lower boundary, basically you repeat the y step one more time. Create a random number z from 0 to 1 and compare it to the ratio lower_height/higher_height, guaranteed to be <1. If z is smaller than the ratio: accept x and if it's larger reject.
Generalizations of code presented are also possible by writing a function, that takes in an object pointer instead. By defining your own class i.e. function which would generally describe functions, have a eval method at a point, be able to store your parameters, calculate and store it's own max/min values and zero/cutoff points, you wouldn't have to pass, or define them in a function like I did.
Good Luck have fun!
tl;dr: Raise a uniform 0 to 1 distribution to the power (1 - m) / m where m is the desired mean (between 0 and 1). Shift/scale as desired.
I was curious about how to implement this. I figured a trapezoid would be the easiest method, but then you're limited in that the most extreme mean you can get is with a triangle, which isn't that extreme. The math started getting hard, so I reverted to a purely empirical method that seems to work pretty well.
Anyways, for a distribution, how about starting with the uniform [0, 1) distribution and raising the values to some arbitrary power. Square them and the distribution shifts to the right. Square root them and they shift to the left. You can go to whatever extreme you want and shove the distribution as hard as you want.
def randompow(p):
return random.random() ** p
(Everything's written in Python, but should be easy enough to translate. If something's unclear, just ask. random.random() returns floats from 0 to 1)
So, how do we adjust that power? Well, how's the mean seem to shift with varying powers?
Looks like some sort of sigmoid curve. There are lots of sigmoid functions, but hyperbolic tangent seems to work pretty well.
Not 100% there, lets try to scale it in the X direction...
# x are the values from -3 to 3 (log transformed from the powers used)
# y are the empirically-determined means given all those powers
def fitter(tanscale):
xsc = tanscale * x
sigtan = np.tanh(xsc)
sigtan = (1 - sigtan) / 2
resid = sigtan - y
return sum(resid**2)
fit = scipy.optimize.minimize(fitter, 1)
The fitter says the best scaling factor is 1.1514088816214016. The residuals are actually pretty low, so sounds good.
Implementing the inverse of all the math I didn't talk about looks like:
def distpow(mean):
p = 1 - (mean * 2)
p = np.arctanh(p) / 1.1514088816214016
return 10**p
That gives us the power to use in the first function to get whatever mean to the distribution. A factory function can return a method to churn out a bunch of numbers from the distribution with the desired mean
def randommean(mean):
p = distpow(mean)
def f():
return random.random() ** p
return f
How's it do? Reasonably well out to 3-4 decimals:
for x in [0.01, 0.1, 0.2, 0.4, 0.5, 0.6, 0.8, 0.9, 0.99]:
f = randommean(x)
# sample the distribution 10 million times
mean = np.mean([f() for _ in range(10000000)])
print('Target mean: {:0.6f}, actual: {:0.6f}'.format(x, mean))
Target mean: 0.010000, actual: 0.010030
Target mean: 0.100000, actual: 0.100122
Target mean: 0.200000, actual: 0.199990
Target mean: 0.400000, actual: 0.400051
Target mean: 0.500000, actual: 0.499905
Target mean: 0.600000, actual: 0.599997
Target mean: 0.800000, actual: 0.799999
Target mean: 0.900000, actual: 0.899972
Target mean: 0.990000, actual: 0.989996
A more succinct function that just gives you a value given a mean (not a factory function):
def randommean(m):
p = np.arctanh(1 - (2 * m)) / 1.1514088816214016
return random.random() ** (10 ** p)
Edit: fitting against the natural log of the mean instead of log10 gave a residual suspiciously close to 0.5. Doing some math to simplify out the arctanh gives:
def randommean(m):
'''Return a value from the distribution 0 to 1 with average *m*'''
return random.random() ** ((1 - m) / m)
From here it should be fairly easy to shift, rescale, and round off the distribution. The truncating-to-integer might end up shifting the mean by 1 (or half a unit?), so that's an unsolved problem (if it matters).
You simply define 2 distributions dist1 operating in [1000, 7000] and dist2 operating in [7000, 10000].
Let's call m1 the mean of dist1 and m2 the mean of dist2.
You are looking for a mixture between dist1and dist2the mean of which is 7000.
You must adjust the weights (w1, w2 = 1-w1) such as :
7000 = w1 * m1 + w2 * m2
which leads to:
w1 = (m2 - 7000) / (m2 - m1)
Using the OpenTURNS library, the code will look as follow:
import openturns as ot
dist1 = ot.Uniform(1000, 7000)
dist2 = ot.Uniform(7000, 10000)
m1 = dist1.getMean()[0]
m2 = dist2.getMean()[0]
w = (m2 - 7000) / (m2 - m1)
dist = ot.Mixture([dist1, dist2], [w, 1 - w])
print ("Mean of dist = ", dist.getMean())
>>> Mean of dist = [7000]
Now you can draw a sample of size N by calling dist.getSample(N). For instance:
print(dist.getSample(10))
>>> [ X0 ]
0 : [ 3019.97 ]
1 : [ 7682.17 ]
2 : [ 9035.1 ]
3 : [ 8873.59 ]
4 : [ 5217.08 ]
5 : [ 6329.67 ]
6 : [ 9791.22 ]
7 : [ 7786.76 ]
8 : [ 7046.59 ]
9 : [ 7088.48 ]
What i would love to do is to create a function that takes a parameter that is the limit of which number the random generation should create. I have experienced that some generators that just repeat the number generated over and over again.
How can I make a generator that doesn't return the same number consecutively. Can someone please help me to achieve my goal?
int randomGen(int max)
{
int n;
return n;
}
The simplest way to get uniformly distributed results from rand is something like this:
int limited_rand(int limit)
{
int r, d = RAND_MAX / limit;
limit *= d;
do { r = rand(); } while (r >= limit);
return r / d;
}
The result will be in the range 0 to limit-1, and each will occur with equal probability as long as the values 0 through RAND_MAX all had equal probability with the original rand function.
Other methods such as modular arithmetic or dividing without the loop I used introduce bias. Methods that go through floating point intermediates do not avoid this problem. Getting good random floating point numbers from rand is at least as difficult. Using my function for integers (or an improvement of it) is a good place to start if you want random floats.
Edit: Here's an explanation of what I mean by bias. Suppose RAND_MAX is 7 and limit is 5. Suppose (if this is a good rand function) that the outputs 0, 1, 2, ..., 7 are all equally likely. Taking rand()%5 would map 0, 1, 2, 3, and 4 to themselves, but map 5, 6, and 7 to 0, 1, and 2. This means the values 0, 1, and 2 are twice as likely to pop up as the values 3 and 4. A similar phenomenon happens if you try to rescale and divide, for instance using rand()*(double)limit/(RAND_MAX+1) Here, 0 and 1 map to 0, 2 and 3 map to 1, 4 maps to 2, 5 and 6 map to 3, and 7 maps to 4.
These effects are somewhat mitigated by the magnitude of RAND_MAX, but they can come back if limit is large. By the way, as others have said, with linear congruence PRNGs (the typical implementation of rand), the low bits tend to behave very badly, so using modular arithmetic when limit is a power of 2 may avoid the bias problem I described (since limit usually divides RAND_MAX+1 evenly in this case), but you run into a different problem in its place.
How about this:
int randomGen(int limit)
{
return rand() % limit;
}
/* ... */
int main()
{
srand(time(NULL));
printf("%d", randomGen(2041));
return 0;
}
Any pseudo-random generator will repeat the values over and over again with some period. C only has rand(), if you use that you should definitively initialize the random seed with srand(). But probably your platform has better than that.
On POSIX systems there is a whole family of functions that you should find under the man drand48 page. They have a well defined period and quality. You probably find what you need, there.
Without explicit knowledge of the random generator of your platform, do not do rand() % max. The low-order bytes of simple random number generators are usually not random at all.
Use instead (returns a number between min inclusive and max non-inclusive):
int randomIntegerInRange(int min, int max)
{
double tmp = (double)rand() / (RAND_MAX - 1.);
return min + (int)floor(tmp * (max - min));
}
Update: The solution above is biased (see comments for explanation), and will likely not produce uniform results. I do not delete it since it is a non natural example of what not to do. Please use rejection methods as recommended elsewhere in this thread.