C - generate random numbers within an interval with respect to a mean - c

I need to generate a set of random numbers within an interval which also happens to have a mean value. For instance min = 1000, max = 10000 and a mean of 7000. I know how to create numbers within a range but I am struggling with the mean value thing. Is there a function that I can use?

What you're looking for is done most easily with so called acceptance rejection method.
Split your interval into smaller intervals.
Specify a probability density function (PDF), can be a very simple one too, like a step function. For Gaussian distrubution you would have left and right steps lower than your middle step i.e (see the image bellow that has a more general distribution).
Generate a random number in the whole interval. If the generated number is greater than the value of your PDF at that point reject the generated number.
Repeat the steps until you get desired number of points
EDIT 1
Proof of concept on a Gaussian PDF.
Ok, so the basic idea is shown in graph (a).
Define/Pick your probability density function (PDF). PDF is a function of, statistically speaking, a random variable and describes the probability of finding the value x in a measurement/experiment. A function can be a PDF of a random variable x if it satisfies: 1) f(x) >= 0 and 2) it's normalized (meaning it sums, or integrates, up to the value 1).
Get maximum (max) and "zero points" (z1 < z2) of PDF. Some PDF's can have their zero points in infinity. In that case, determine cutoff points (z1, z2) for which PDF(z1>x>z2) < eta where you pick eta yourself. Basically means, set some small-ish value eta and then say your zero points are those values for which the value of PDF(x) is smaller than eta.
Define the interval Ch(z1, z2, max) of your random generator. This is the interval in which you generate your random variables.
Generate a random variable x such that z1<x<z2.
Generate a second unrelated random variable y in the range (0, max). If the value of y is smaller than PDF(x) reject both randomly generated values (x,y) and go back to step 4. If the generated value y is larger than PDF(x) accept the value x as the randomly generated point on a distribution and return it.
Here's the code that reproduces similar behavior for a Gaussian PDF.
#include "Random.h"
#include <fstream>
using namespace std;
double gaus(double a, double b, double c, double x)
{
return a*exp( -((x-b)*(x-b)/(2*c*c) ));
}
double* random_on_a_gaus_distribution(double inter_a, double inter_b)
{
double res [2];
double a = 1.0; //currently parameters for the Gaussian
double b = 2.0; //are defined here to avoid having
double c = 3.0; //a long function declaration line.
double x = kiss::Ran(inter_a, inter_b);
double y = kiss::Ran(0.0, 1.0);
while (y>gaus(a,b,c,x)) //keep creating values until step 5. is satisfied.
{
x = kiss::Ran(inter_a, inter_b); //this is interval (z1, z2)
y = kiss::Ran(0.0, 1.0); //this is the interval (0, max)
}
res[0] = x;
res[1] = y;
return res; //I return (x,y) for plot reasons, only x is the randomly
} //generated value you're looking for.
void main()
{
double* x;
ofstream f;
f.open("test.txt");
for(int i=0; i<100000; i++)
{
//see bellow how I got -5 and 10 to be my interval (z1, z2)
x = random_on_a_gaus_distribution(-5.0, 10.0);
f << x[0]<<","<<x[1]<<endl;
}
f.close();
}
Step 1
So first we define a general look of a Gaussian PDF in a function called gaus. Simple.
Then we define a function random_on_a_gaus_distribution which uses a well defined Gaussian function. In an experiment\measurement we would get coefficients a, b, c by fitting our function. I picked some random ones (1, 2, 3) for this example, you can pick the ones that satisfy your HW assignment (that is: coefficients that make a Gaussian that has a mean of 7000).
Step 2 and 3
I used wolfram mathematica to plot gaus. with parameters 1,2,3 too see what would be the most appropriate values for max and (z1, z2) . You can see the graph yourself. Maximum of the function is 1.0 and via ancient method of science called eyeballin' I estimated that the cutoff points are -5.0 and 10.0.
To make random_on_a_gaus_distribution more general you could follow step 2) more rigorously and define eta and then calculate your function in successive points until PDF gets smaller than eta. Dangers with this are that your cutoff points can be very far apart and this could take long for very monotonous functions. Additionally you have to find the maximum yourself. This is generally tricky, However a simpler problem is minimization of a negative of a function. This can also be tricky for a general case but not "undoable". Easiest way is to cheat a bit like I did and just hard-code this for a couple of functions only.
Step 4 and 5
And then you bash away. Just keep creating new and new points until you reach satisfactory hit. DO NOTICE the returned number x is a random number. You wouldn't be able to find a logical link between two successively created x values, or first created x and the millionth.
However the number of accepted x values in the interval around the x_max of our distribution is greater than the number of x values created in intervals for which PDF(x) < PDF(x_max).
This just means that your random numbers will be weighted within the chosen interval in such manner that the larger PDF value for a random variable x will correspond to more random points accepted in a small interval around that value than around any other value of xi for which PDF(xi)<PDF(x).
I returned both x and y to be able to plot the graph bellow, however what you're looking to return is actually just the x. I did the plots with matplotlib.
It's probably better to show just a histogram of randomly created variable on a distribution. This shows that the x values that are around the mean value of your PDF function are the most likely ones to get accepted, and therefore more randomly created variables with those approximate values will be created.
Additionally I assume you would be interested in implementation of the kiss Random number generator. IT IS VERY IMPORTANT YOU HAVE A VERY GOOD GENERATOR. I dare to say to an extent kiss doesn't probably cut it (mersene twister is used often).
Random.h
#pragma once
#include <stdlib.h>
const unsigned RNG_MAX=4294967295;
namespace kiss{
// unsigned int kiss_z, kiss_w, kiss_jsr, kiss_jcong;
unsigned int RanUns();
void RunGen();
double Ran0(int upper_border);
double Ran(double bottom_border, double upper_border);
}
namespace Crand{
double Ran0(int upper_border);
double Ran(double bottom_border, double upper_border);
}
Kiss.cpp
#include "Random.h"
unsigned int kiss_z = 123456789; //od 1 do milijardu
unsigned int kiss_w = 378295763; //od 1 do milijardu
unsigned int kiss_jsr = 294827495; //od 1 do RNG_MAX
unsigned int kiss_jcong = 495749385; //od 0 do RNG_MAX
//KISS99*
//Autor: George Marsaglia
unsigned int kiss::RanUns()
{
kiss_z=36969*(kiss_z&65535)+(kiss_z>>16);
kiss_w=18000*(kiss_w&65535)+(kiss_w>>16);
kiss_jsr^=(kiss_jsr<<13);
kiss_jsr^=(kiss_jsr>>17);
kiss_jsr^=(kiss_jsr<<5);
kiss_jcong=69069*kiss_jcong+1234567;
return (((kiss_z<<16)+kiss_w)^kiss_jcong)+kiss_jsr;
}
void kiss::RunGen()
{
for (int i=0; i<2000; i++)
kiss::RanUns();
}
double kiss::Ran0(int upper_border)
{
unsigned velicinaIntervala = RNG_MAX / upper_border;
unsigned granicaIzbora= velicinaIntervala*upper_border;
unsigned slucajniBroj = kiss::RanUns();
while(slucajniBroj>=granicaIzbora)
slucajniBroj = kiss::RanUns();
return slucajniBroj/velicinaIntervala;
}
double kiss::Ran (double bottom_border, double upper_border)
{
return bottom_border+(upper_border-bottom_border)*kiss::Ran0(100000)/(100001.0);
}
Additionally there's the standard C random generators:
CRands.cpp
#include "Random.h"
//standardni pseudo random generatori iz C-a
double Crand::Ran0(int upper_border)
{
return rand()%upper_border;
}
double Crand::Ran (double bottom_border, double upper_border)
{
return (upper_border-bottom_border)*rand()/((double)RAND_MAX+1);
}
It's worthy also to comment on the (b) graph above. When you have a very badly behaved PDF, PDF(x) will vary significantly between large numbers and very small ones.
Issue with that is that the interval area Ch(x) will match the extreme values of the PDF well, but since we create a random variable y for small values of PDF(x) as well; the chances of accepting that value are minute! It is more likely that the generated y value will always be larger than PDF(x) at that point. This means that you'll spend a lot of cycles creating numbers that won't get chosen and that all your chosen random numbers will be very locally bound to the max of your PDF.
That's why it's often useful not to have the same Ch(x) intervals everywhere, but to define a parametrized set of intervals. However this adds a fair bit of complexity to the code.
Where do you set your limits? How to deal with borderline cases? When and how to determine that you indeed need to suddenly use this approach? Calculating max might not be as simple now, depending on the method you originally envisioned would be doing this.
Additionally now you have to correct for the fact that a lot more numbers get accepted more easily in the areas where your Ch(x) box height is lower which skews the original PDF.
This can be corrected by weighing numbers created in the lowered boundary by the ratio of heights of higher and lower boundary, basically you repeat the y step one more time. Create a random number z from 0 to 1 and compare it to the ratio lower_height/higher_height, guaranteed to be <1. If z is smaller than the ratio: accept x and if it's larger reject.
Generalizations of code presented are also possible by writing a function, that takes in an object pointer instead. By defining your own class i.e. function which would generally describe functions, have a eval method at a point, be able to store your parameters, calculate and store it's own max/min values and zero/cutoff points, you wouldn't have to pass, or define them in a function like I did.
Good Luck have fun!

tl;dr: Raise a uniform 0 to 1 distribution to the power (1 - m) / m where m is the desired mean (between 0 and 1). Shift/scale as desired.
I was curious about how to implement this. I figured a trapezoid would be the easiest method, but then you're limited in that the most extreme mean you can get is with a triangle, which isn't that extreme. The math started getting hard, so I reverted to a purely empirical method that seems to work pretty well.
Anyways, for a distribution, how about starting with the uniform [0, 1) distribution and raising the values to some arbitrary power. Square them and the distribution shifts to the right. Square root them and they shift to the left. You can go to whatever extreme you want and shove the distribution as hard as you want.
def randompow(p):
return random.random() ** p
(Everything's written in Python, but should be easy enough to translate. If something's unclear, just ask. random.random() returns floats from 0 to 1)
So, how do we adjust that power? Well, how's the mean seem to shift with varying powers?
Looks like some sort of sigmoid curve. There are lots of sigmoid functions, but hyperbolic tangent seems to work pretty well.
Not 100% there, lets try to scale it in the X direction...
# x are the values from -3 to 3 (log transformed from the powers used)
# y are the empirically-determined means given all those powers
def fitter(tanscale):
xsc = tanscale * x
sigtan = np.tanh(xsc)
sigtan = (1 - sigtan) / 2
resid = sigtan - y
return sum(resid**2)
fit = scipy.optimize.minimize(fitter, 1)
The fitter says the best scaling factor is 1.1514088816214016. The residuals are actually pretty low, so sounds good.
Implementing the inverse of all the math I didn't talk about looks like:
def distpow(mean):
p = 1 - (mean * 2)
p = np.arctanh(p) / 1.1514088816214016
return 10**p
That gives us the power to use in the first function to get whatever mean to the distribution. A factory function can return a method to churn out a bunch of numbers from the distribution with the desired mean
def randommean(mean):
p = distpow(mean)
def f():
return random.random() ** p
return f
How's it do? Reasonably well out to 3-4 decimals:
for x in [0.01, 0.1, 0.2, 0.4, 0.5, 0.6, 0.8, 0.9, 0.99]:
f = randommean(x)
# sample the distribution 10 million times
mean = np.mean([f() for _ in range(10000000)])
print('Target mean: {:0.6f}, actual: {:0.6f}'.format(x, mean))
Target mean: 0.010000, actual: 0.010030
Target mean: 0.100000, actual: 0.100122
Target mean: 0.200000, actual: 0.199990
Target mean: 0.400000, actual: 0.400051
Target mean: 0.500000, actual: 0.499905
Target mean: 0.600000, actual: 0.599997
Target mean: 0.800000, actual: 0.799999
Target mean: 0.900000, actual: 0.899972
Target mean: 0.990000, actual: 0.989996
A more succinct function that just gives you a value given a mean (not a factory function):
def randommean(m):
p = np.arctanh(1 - (2 * m)) / 1.1514088816214016
return random.random() ** (10 ** p)
Edit: fitting against the natural log of the mean instead of log10 gave a residual suspiciously close to 0.5. Doing some math to simplify out the arctanh gives:
def randommean(m):
'''Return a value from the distribution 0 to 1 with average *m*'''
return random.random() ** ((1 - m) / m)
From here it should be fairly easy to shift, rescale, and round off the distribution. The truncating-to-integer might end up shifting the mean by 1 (or half a unit?), so that's an unsolved problem (if it matters).

You simply define 2 distributions dist1 operating in [1000, 7000] and dist2 operating in [7000, 10000].
Let's call m1 the mean of dist1 and m2 the mean of dist2.
You are looking for a mixture between dist1and dist2the mean of which is 7000.
You must adjust the weights (w1, w2 = 1-w1) such as :
7000 = w1 * m1 + w2 * m2
which leads to:
w1 = (m2 - 7000) / (m2 - m1)
Using the OpenTURNS library, the code will look as follow:
import openturns as ot
dist1 = ot.Uniform(1000, 7000)
dist2 = ot.Uniform(7000, 10000)
m1 = dist1.getMean()[0]
m2 = dist2.getMean()[0]
w = (m2 - 7000) / (m2 - m1)
dist = ot.Mixture([dist1, dist2], [w, 1 - w])
print ("Mean of dist = ", dist.getMean())
>>> Mean of dist = [7000]
Now you can draw a sample of size N by calling dist.getSample(N). For instance:
print(dist.getSample(10))
>>> [ X0 ]
0 : [ 3019.97 ]
1 : [ 7682.17 ]
2 : [ 9035.1 ]
3 : [ 8873.59 ]
4 : [ 5217.08 ]
5 : [ 6329.67 ]
6 : [ 9791.22 ]
7 : [ 7786.76 ]
8 : [ 7046.59 ]
9 : [ 7088.48 ]

Related

Monte Carlo integration of the Gaussian function f(x) = exp(-x^2/2) in C incorrect output

I'm writing a short program to approximate the definite integral of the gaussian function f(x) = exp(-x^2/2), and my codes are as follows:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
double gaussian(double x) {
return exp((-pow(x,2))/2);
}
int main(void) {
srand(0);
double valIntegral, yReal = 0, xRand, yRand, yBound;
int xMin, xMax, numTrials, countY = 0;
do {
printf("Please enter the number of trials (n): ");
scanf("%d", &numTrials);
if (numTrials < 1) {
printf("Exiting.\n");
return 0;
}
printf("Enter the interval of integration (a b): ");
scanf("%d %d", &xMin, &xMax);
while (xMin > xMax) { //keeps looping until a valid interval is entered
printf("Invalid interval!\n");
printf("Enter the interval of integration (a b): ");
scanf("%d %d", &xMin, &xMax);
}
//check real y upper bound
if (gaussian((double)xMax) > gaussian((double)xMin))
yBound = gaussian((double)xMax);
else
yBound = gaussian((double)xMin);
for (int i = 0; i < numTrials; i++) {
xRand = (rand()% ((xMax-xMin)*1000 + 1))/1000.00 + xMin; //generate random x value between xMin and xMax to 3 decimal places
yRand = (rand()% (int)(yBound*1000 + 1))/1000.00; //generate random y value between 0 and yBound to 3 decimal places
yReal = gaussian(xRand);
if (yRand < yReal)
countY++;
}
valIntegral = (xMax-xMin)*((double)countY/numTrials);
printf("Integral of exp(-x^2/2) on [%.3lf, %.3lf] with n = %d trials is: %.3lf\n\n", (double)xMin, (double)xMax, numTrials, valIntegral);
countY = 0; //reset countY to 0 for the next run
} while (numTrials >= 1);
return 0;
}
However, the outputs from my code doesn't match the solutions. I tried to debug and print out all xRand, yRand and yReal values for 100 trials (and checked yReal value with particular xRand values with Matlab, in case I had any typos), and those values didn't seem to be out of range in any way... I don't know where my mistake is.
The correct output for # of trials = 100 on [0, 1] is 0.810, and mine is 0.880; correct output for # of trials = 50 on [-1, 0] is 0.900, and mine was 0.940. Can anyone find where I did wrong? Thanks a lot.
Another question is, I can't find a reference to the use of following code:
double randomNumber = rand() / (double) RAND MAX;
but it was provided by the instructor and he said it would generate a random number from 0 to 1. Why did he use '/' instead of '%' after "rand()"?
There's a few logical errors / discussion points in your code, both mathematics and programming-wise.
First of all, just to get it out of the way, we're talking about the standard gaussian here, i.e.
except, the definition of the gaussian on line 6, omits the
normalising term. Given the outputs you seem to expect, this seems to have been done on purpose. Fair enough. But if you wanted to calculate the actual integral, such that a practically infinite range (e.g. [-1000, 1000]) would sum up to 1, then you would need that term.
Is my code logically correct?
No. Your code has two logical errors: one on line 29 (i.e. your if statement), and one on line 40 (i.e. the calculation of valIntegral), which is a direct consequence of the first logical error.
For the first error, consider the following plot to see why:
Your Monte Carlo process effectively considers a bounded box over a certain range, and then says "I will randomly place points inside this box, and then count the proportion of the total number of points that randomly fell under the curve; the integral estimate is then the area of the bounded box itself, times this proportion".
Now, if both
and
are to the left of the mean (i.e. 0), then your if statement correctly sets the box's upper bound (i.e. yBound) to
such that the topmost bound of the box contains the highest part of that curve. So, e.g., to estimate the integral for the range [-2,-1], you set the upper bound to
.
Similarly, if both
and
are to the right of the mean, then you correctly set yBound to
However, if
, you should be setting yBound to neither
nor
, since the 0 point is higher than both!. So in this case, your yBound should simply be at the peak of the Gaussian, i.e.
(which in your case of an unnormalised Gaussian, this takes a value of '1').
Therefore, the correct if statement is as follows:
if (xMax < 0.0)
{ yBound = gaussian((double)xMax); }
else if (xMin > 0.0)
{ yBound = gaussian((double)xMin); }
else
{ yBound = gaussian(0.0); }
As for the second logical error, we already mentioned that the value of the integral is the "area of the bounding box" times the "proportion of successes". However, you seem to ignore the height of the box in your calculation. It is true that in the special case where
, the height of your unnormalised Gaussian function defaults to '1', therefore this term can be omitted. I suspect that this is why it may have been missed. However, in the other two cases, the height of the bounding box is necessarily less than 1, and therefore needs to be included in the calculation. So the correct code for line 40 should be:
valIntegral = yBound * (xMax-xMin) * (((double)countY)/numTrials);
Why am I not getting the correct output?
Even despite the above logical errors, as we've discussed above, your output should have been correct for the specific intervals [0,1] and [-1,0] (since they include the mean and therefore the correct yBound of 1). So why are you still getting a 'wrong' output?
The answer is, you are not. Your output is "correct". Except, a Monte Carlo process involves randomness, and 100 trials is not a big enough number to lead to consistent results. If you run the same range for 100 trials again and again, you'll see you'll get very different results each time (though, overall, they'll be distributed around the right value). Run with 1000000 trials, and you'll see that the result becomes a lot more precise.
What's up with that randomNumber code?
The rand() function returns an integer in the range [0, RAND_MAX], where RAND_MAX is system-specific (have a look at man 3 rand).
The modulo approach (i.e. %) works as follows: consider the range [-0.1, 0.3]. This range spans 0.4 units. 0.4 * 1000 + 1 = 401. For a random number from 0 to RAND_MAX, doing rand() modulo 401 will always result in a random number in the range [0,400]. If you then divide this back by 1000, you get a random number in the range [0, 0.4]. Add this to your xmin offset (here: -0.1) and you get a random number in the range [-0.1, 0.3].
In theory, this makes sense. However, unfortunately, as already pointed out in the other answer here, as a method it is susceptible to modulo bias, because RAND_MAX isn't necessarily exactly divisible by 401, therefore the top part of that range leading up to RAND_MAX overrepresents some numbers compared to others.
By contrast, the approach given to you by your teacher is simply saying: divide the result of the rand() function with RAND_MAX. This effectively normalises the returned random number into the range [0,1]. This is a much more straightforward thing to do, and it avoids modulo bias.
Therefore, the way I would implement this would be to make it into a function:
double randomNumber(void) {
return rand() / (double) RAND_MAX;
}
which then simplifies your computations as follows too:
xRand = randomNumber() * (xMax-xMin) + xMin;
yRand = randomNumber() * yBound;
You can see that this is a much more accurate thing to do, if you use a normalised gaussian, i.e.
double gaussian(double x) {
return exp((-pow(x,2.0))/2.0) / sqrt(2.0 * M_PI);
}
and then compare the two methods. You will see that the randomNumber() method for an "effectively infinite" range (e.g. [-1000,1000]) gives the correct result of 1, whereas the modulo approach tends to give numbers that are larger than 1.
Your code has no obvious bug (though there is a bug in the upper bound calculation, as #TasosPapastylianou points out, though it isn't the issue in your test cases). On 100 trials, your answer of 0.880 is closer to the actual value of the integral (0.855624...) than 0.810, and neither of those numbers are so far from the true value to suggest an outright bug in the code. Seems to be within sampling error (though see below). Here is a histogram of 1000 runs of a Monte Carlo integration (done in R, but with the same algorithm) of e^(-x^2/2) on [0,1] with 100 trials:
Unless your instructor specified the algorithm and the seed in precise detail, you shouldn't expect the exact same answer.
As far as your second question about rand() / (double) RAND MAX: it is an attempt to avoid modulo bias. It is possible that such a bias is effecting your code (especially given the way you round to 3 decimal places), since it does seem to overestimate the integral (based on running it a dozen times or so). Perhaps you could use that in your code and see if you get better results.

A numbers power between 0 and 1 in C

I'm making a program to replace math.h's pow() function.
I'm not using any functions from math.h.
The problem is, I can calculate powers as integers like
15-2
45.3211
but I can't calculate
x2.132
My program first finds integer power of x (x2) and multiplies it by (x0.132).
I know that x0.132 is 1000th root of x to the power 132 but I can't solve it.
How can I find xy (0 < y < 1)
To compute x ^ y, 0 < y < 1 :
Approximate y as a rational fraction, (a/b)
(Easiest way: Pick whatever b you want to get sufficient accuracy as a constant.
Then use: a = b * y.)
Approximate the b root of y using any method you like, such as Newton's.
(Simplest way: You know it's between 0 and b and can easily tell if a given value is too low or too high. So keep a min that starts at zero and a max that starts at b. Repeatedly try (min + max) / 2, see if it's too big or too small, and adjust min or max appropriately. Repeat until min and max are nearly the same.)
Raise that to the a power.
(Possibly by repeatedly multiplying it by itself. Optimize this if you like. For example, a^4 can be computed with just two multiplications, one to find a^2 and then one to square it. This generalizes easily.)
Use the factorization inherent in floating point formats to split x=2^e*m with 1<=m<2 to create the sub-problems 2^(e*y) and m^y
Use square roots, x^y=sqrt(x)^(2*y) and if there is an integer part in 2*b, split that off.
Use the binomial theorem for x close to 1, which will occur when iterating the square root.
(1+h)^y=1+y*h+(y*(y-1))/2*h^2+...+binom(y,j)*h^j+...
where the quotient from one term to the next is (y-j)/(j+1)*h
h=x-1;
term = y*h;
sum = 1+term;
j=1;
while(1+term !=1) {
term *= h*(y-j)/(1+j);
sum += term;
j+=1;
}

Generating random number in sorted order

I want to generate random number in sorted order.
I wrote below code:
void CreateSortedNode(pNode head)
{
int size = 10, last = 0;
pNode temp;
while(size-- > 0) {
temp = (pnode)malloc(sizeof(struct node));
last += (rand()%10);
temp->data = last;//randomly generate number in sorted order
list_add(temp);
}
}
[EDIT:]
Expecting number will be generated in increased or decreased order: i.e {2, 5, 9, 23, 45, 68 }
int main()
{
int size = 10, last = 0;
while(size-- > 0) {
last += (rand()%10);
printf("%4d",last);
}
return 0;
}
Any better idea?
Solved back in 1979 (by Bentley and Saxe at Carnegie-Mellon):
https://apps.dtic.mil/dtic/tr/fulltext/u2/a066739.pdf
The solution is ridiculously compact in terms of code too!
Their paper is in Pascal, I converted it to Python so it should work with any language:
from random import random
cur_max=100 #desired maximum random number
n=100 #size of the array to fill
x=[0]*(n) #generate an array x of size n
for i in range(n,0,-1):
cur_max=cur_max*random()**(1/i) #the magic formula
x[i-1]=cur_max
print(x) #the results
Enjoy your sorted random numbers...
Without any information about sample size or sample universe, it's not easy to know if the following is interesting but irrelevant or a solution, but since it is in any case interesting, here goes.
The problem:
In O(1) space, produce an unbiased ordered random sample of size n from an ordered set S of size N: <S1,S2,…SN>, such that the elements in the sample are in the same order as the elements in the ordered set.
The solution:
With probability n/|S|, do the following:
add S1 to the sample.
decrement n
Remove S1 from S
Repeat steps 1 and 2, each time with the new first element (and size) of S until n is 0, at which point the sample will have the desired number of elements.
The solution in python:
from random import randrange
# select n random integers in order from range(N)
def sample(n, N):
# insist that 0 <= n <= N
for i in range(N):
if randrange(N - i) < n:
yield i
n -= 1
if n <= 0:
break
The problem with the solution:
It takes O(N) time. We'd really like to take O(n) time, since n is likely to be much smaller than N. On the other hand, we'd like to retain the O(1) space, in case n is also quite large.
A better solution (outline only)
(The following is adapted from a 1987 paper by Jeffrey Scott Vitter, "An Efficient Algorithm for Sequential Random Sampling". See Dr. Vitter's publications page.. Please read the paper for the details.)
Instead of incrementing i and selecting a random number, as in the above python code, it would be cool if we could generate a random number according to some distribution which would be the number of times that i will be incremented without any element being yielded. All we need is the distribution (which will obviously depend on the current values of n and N.)
Of course, we can derive the distribution precisely from an examination of the algorithm. That doesn't help much, though, because the resulting formula requires a lot of time to compute accurately, and the end result is still O(N).
However, we don't always have to compute it accurately. Suppose we have some easily computable reasonably good approximation which consistently underestimates the probabilities (with the consequence that it will sometimes not make a prediction). If that approximation works, we can use it; if not, we'll need to fallback to the accurate computation. If that happens sufficiently rarely, we might be able to achieve O(n) on the average. And indeed, Dr. Vitter's paper shows how to do this. (With code.)
Suppose you wanted to generate just three random numbers, x, y, and z so that they are in sorted order x <= y <= z. You will place these in some C++ container, which I'll just denote as a list like D = [x, y, z], so we can also say that x is component 0 of D, or D_0 and so on.
For any sequential algorithm that first draws a random value for x, let's say it comes up with 2.5, then this tells us some information about what y has to be, Namely, y >= 2.5.
So, conditional on the value of x, your desired random number algorithm has to satisfy the property that p(y >= x | x) = 1. If the distribution you are drawing from is anything like a common distribution, like uniform or Guassian, then it's clear to see that usually p(y >= x) would be some other expression involving the density for that distribution. (In fact, only a pathological distribution like a Dirac Delta at "infinity" could be independent, and would be nonsense for your application.)
So what we can speculate with great confidence is that p(y >= t | x) for various values of t is not equal to p(y >= t). That's the definition for dependent random variables. So now you know that the random variable y (second in your eventual list) is not statistically independent of x.
Another way to state it is that in your output data D, the components of D are not statistically independent observations. And in fact they must be positively correlated since if we learn that x is bigger than we thought, we also automatically learn that y is bigger than or equal to what we thought.
In this sense, a sequential algorithm that provides this kind of output is an example of a Markov Chain. The probability distribution of a given number in the sequence is conditionally dependent on the previous number.
If you really want a Markov Chain like that (I suspect that you don't), then you could instead draw a first number at random (for x) and then draw positive deltas, which you will add to each successive number, like this:
Draw a value for x, say 2.5
Draw a strictly positive value for y-x, say 13.7, so y is 2.5 + 13.7 = 16.2
Draw a strictly positive value for z-y, say 0.001, so z is 16.201
and so on...
You just have to acknowledge that the components of your result are not statistically independent, and so you cannot use them in an application that relies on statistical independence assumptions.

WAV-file analysis C (libsndfile, fftw3)

I'm trying to develop a simple C application that can give a value from 0-100 at a certain frequency range at a given timestamp in a WAV-file.
Example: I have frequency range of 44.1kHz (typical MP3 file) and I want to split that range into n amount of ranges (starting from 0). I then need to get the amplitude of each range, being from 0 to 100.
What I've managed so far:
Using libsndfile I'm now able to read the data of a WAV-file.
infile = sf_open(argv [1], SFM_READ, &sfinfo);
float samples[sfinfo.frames];
sf_read_float(infile, samples, 1);
However, my understanding of FFT is rather limited. But I know it's required inorder to get the amplitudes at the ranges I need. But how do I move on from here? I found the library FFTW-3, which seems to be suited for the purpose.
I found some help here: https://stackoverflow.com/a/4371627/1141483
and looked at the FFTW tutorial here: http://www.fftw.org/fftw2_doc/fftw_2.html
But as I'm unsure about the behaviour of the FFTW, I don't know to progress from here.
And another question, assuming you use libsndfile: If you force the reading to be single channeled (with a stereo file) and then read the samples. Will you then actually only be reading half of the samples of the total file? As half of them being from channel 1, or does automaticly filter those out?
Thanks a ton for your help.
EDIT: My code can be seen here:
double blackman_harris(int n, int N){
double a0, a1, a2, a3, seg1, seg2, seg3, w_n;
a0 = 0.35875;
a1 = 0.48829;
a2 = 0.14128;
a3 = 0.01168;
seg1 = a1 * (double) cos( ((double) 2 * (double) M_PI * (double) n) / ((double) N - (double) 1) );
seg2 = a2 * (double) cos( ((double) 4 * (double) M_PI * (double) n) / ((double) N - (double) 1) );
seg3 = a3 * (double) cos( ((double) 6 * (double) M_PI * (double) n) / ((double) N - (double) 1) );
w_n = a0 - seg1 + seg2 - seg3;
return w_n;
}
int main (int argc, char * argv [])
{ char *infilename ;
SNDFILE *infile = NULL ;
FILE *outfile = NULL ;
SF_INFO sfinfo ;
infile = sf_open(argv [1], SFM_READ, &sfinfo);
int N = pow(2, 10);
fftw_complex results[N/2 +1];
double samples[N];
sf_read_double(infile, samples, 1);
double normalizer;
int k;
for(k = 0; k < N;k++){
if(k == 0){
normalizer = blackman_harris(k, N);
} else {
normalizer = blackman_harris(k, N);
}
}
normalizer = normalizer * (double) N/2;
fftw_plan p = fftw_plan_dft_r2c_1d(N, samples, results, FFTW_ESTIMATE);
fftw_execute(p);
int i;
for(i = 0; i < N/2 +1; i++){
double value = ((double) sqrtf(creal(results[i])*creal(results[i])+cimag(results[i])*cimag(results[i]))/normalizer);
printf("%f\n", value);
}
sf_close (infile) ;
return 0 ;
} /* main */
Well it all depends on the frequency range you're after. An FFT works by taking 2^n samples and providing you with 2^(n-1) real and imaginary numbers. I have to admit I'm quite hazy on what exactly these values represent (I've got a friend who has promised to go through it all with me in lieu of a loan I made him when he had financial issues ;)) other than an angle around a circle. Effectively they provide you with an arccos of the angle parameter for a sine and cosine for each frequency bin from which the original 2^n samples can be, perfectly, reconstructed.
Anyway this has the huge advantage that you can calculate magnitude by taking the euclidean distance of the real and imaginary parts (sqrtf( (real * real) + (imag * imag) )). This provides you with an unnormalised distance value. This value can then be used to build a magnitude for each frequency band.
So lets take an order 10 FFT (2^10). You input 1024 samples. You FFT those samples and you get 512 imaginary and real values back (the particular ordering of those values depends on the FFT algorithm you use). So this means that for a 44.1Khz audio file each bin represents 44100/512 Hz or ~86Hz per bin.
One thing that should stand out from this is that if you use more samples (from whats called the time or spatial domain when dealing with multi dimensional signals such as images) you get better frequency representation (in whats called the frequency domain). However you sacrifice one for the other. This is just the way things go and you will have to live with it.
Basically you will need to tune the frequency bins and time/spatial resolution to get the data you require.
First a bit of nomenclature. The 1024 time domain samples I referred to earlier is called your window. Generally when performing this sort of process you will want to slide the window on by some amount to get the next 1024 samples you FFT. The obvious thing to do would be to take samples 0->1023, then 1024->2047, and so forth. This unfortunately doesn't give the best results. Ideally you want to overlap the windows to some degree so that you get a smoother frequency change over time. Most commonly people slide the window on by half a window size. ie your first window will be 0->1023 the second 512->1535 and so on and so forth.
Now this then brings up one further problem. While this information provides for perfect inverse FFT signal reconstruction it leaves you with a problem that frequencies leak into surround bins to some extent. To solve this issue some mathematicians (far more intelligent than me) came up with the concept of a window function. The window function provides for far better frequency isolation in the frequency domain though leads to a loss of information in the time domain (ie its impossible to perfectly re-construct the signal after you have used a window function, AFAIK).
Now there are various types of window function ranging from the rectangular window (effectively doing nothing to the signal) to various functions that provide far better frequency isolation (though some may also kill surrounding frequencies that may be of interest to you!!). There is, alas, no one size fits all but I'm a big fan (for spectrograms) of the blackmann-harris window function. I think it gives the best looking results!
However as I mentioned earlier the FFT provides you with an unnormalised spectrum. To normalise the spectrum (after the euclidean distance calculation) you need to divide all the values by a normalisation factor (I go into more detail here).
this normalisation will provide you with a value between 0 and 1. So you could easily multiple this value by 100 to get your 0 to 100 scale.
This, however, is not where it ends. The spectrum you get from this is rather unsatisfying. This is because you are looking at the magnitude using a linear scale. Unfortunately the human ear hears using a logarithmic scale. This rather causes issues with how a spectrogram/spectrum looks.
To get round this you need to convert these 0 to 1 values (I'll call it 'x') to the decibel scale. The standard transformation is 20.0f * log10f( x ). This will then provide you a value whereby 1 has converted to 0 and 0 has converted to -infinity. your magnitudes are now in the appropriate logarithmic scale. However its not always that helpful.
At this point you need to look into the original sample bit depth. At 16-bit sampling you get a value that is between 32767 and -32768. This means your dynamic range is fabsf( 20.0f * log10f( 1.0f / 65536.0f ) ) or ~96.33dB. So now we have this value.
Take the values we've got from the dB calculation above. Add this -96.33 value to it. Obviously the maximum amplitude (0) is now 96.33. Now didivde by that same value and you nowhave a value ranging from -infinity to 1.0f. Clamp the lower end to 0 and you now have a range from 0 to 1 and multiply that by 100 and you have your final 0 to 100 range.
And that is much more of a monster post than I had originally intended but should give you a good grounding in how to generate a good spectrum/spectrogram for an input signal.
and breathe
Further reading (for people other than the original poster who has already found it):
Converting an FFT to a spectogram
Edit: As an aside I found kiss FFT far easier to use, my code to perform a forward fft is as follows:
CFFT::CFFT( unsigned int fftOrder ) :
BaseFFT( fftOrder )
{
mFFTSetupFwd = kiss_fftr_alloc( 1 << fftOrder, 0, NULL, NULL );
}
bool CFFT::ForwardFFT( std::complex< float >* pOut, const float* pIn, unsigned int num )
{
kiss_fftr( mFFTSetupFwd, pIn, (kiss_fft_cpx*)pOut );
return true;
}

Problem with Precision floating point operation in C

For one of my course project I started implementing "Naive Bayesian classifier" in C. My project is to implement a document classifier application (especially Spam) using huge training data.
Now I have problem implementing the algorithm because of the limitations in the C's datatype.
( Algorithm I am using is given here, http://en.wikipedia.org/wiki/Bayesian_spam_filtering )
PROBLEM STATEMENT:
The algorithm involves taking each word in a document and calculating probability of it being spam word. If p1, p2 p3 .... pn are probabilities of word-1, 2, 3 ... n. The probability of doc being spam or not is calculated using
Here, probability value can be very easily around 0.01. So even if I use datatype "double" my calculation will go for a toss. To confirm this I wrote a sample code given below.
#define PROBABILITY_OF_UNLIKELY_SPAM_WORD (0.01)
#define PROBABILITY_OF_MOSTLY_SPAM_WORD (0.99)
int main()
{
int index;
long double numerator = 1.0;
long double denom1 = 1.0, denom2 = 1.0;
long double doc_spam_prob;
/* Simulating FEW unlikely spam words */
for(index = 0; index < 162; index++)
{
numerator = numerator*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
denom2 = denom2*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
denom1 = denom1*(long double)(1 - PROBABILITY_OF_UNLIKELY_SPAM_WORD);
}
/* Simulating lot of mostly definite spam words */
for (index = 0; index < 1000; index++)
{
numerator = numerator*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
denom2 = denom2*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
denom1 = denom1*(long double)(1- PROBABILITY_OF_MOSTLY_SPAM_WORD);
}
doc_spam_prob= (numerator/(denom1+denom2));
return 0;
}
I tried Float, double and even long double datatypes but still same problem.
Hence, say in a 100K words document I am analyzing, if just 162 words are having 1% spam probability and remaining 99838 are conspicuously spam words, then still my app will say it as Not Spam doc because of Precision error (as numerator easily goes to ZERO)!!!.
This is the first time I am hitting such issue. So how exactly should this problem be tackled?
This happens often in machine learning. AFAIK, there's nothing you can do about the loss in precision. So to bypass this, we use the log function and convert divisions and multiplications to subtractions and additions, resp.
SO I decided to do the math,
The original equation is:
I slightly modify it:
Taking logs on both sides:
Let,
Substituting,
Hence the alternate formula for computing the combined probability:
If you need me to expand on this, please leave a comment.
Here's a trick:
for the sake of readability, let S := p_1 * ... * p_n and H := (1-p_1) * ... * (1-p_n),
then we have:
p = S / (S + H)
p = 1 / ((S + H) / S)
p = 1 / (1 + H / S)
let`s expand again:
p = 1 / (1 + ((1-p_1) * ... * (1-p_n)) / (p_1 * ... * p_n))
p = 1 / (1 + (1-p_1)/p_1 * ... * (1-p_n)/p_n)
So basically, you will obtain a product of quite large numbers (between 0 and, for p_i = 0.01, 99). The idea is, not to multiply tons of small numbers with one another, to obtain, well, 0, but to make a quotient of two small numbers. For example, if n = 1000000 and p_i = 0.5 for all i, the above method will give you 0/(0+0) which is NaN, whereas the proposed method will give you 1/(1+1*...1), which is 0.5.
You can get even better results, when all p_i are sorted and you pair them up in opposed order (let's assume p_1 < ... < p_n), then the following formula will get even better precision:
p = 1 / (1 + (1-p_1)/p_n * ... * (1-p_n)/p_1)
that way you devide big numerators (small p_i) with big denominators (big p_(n+1-i)), and small numerators with small denominators.
edit: MSalter proposed a useful further optimization in his answer. Using it, the formula reads as follows:
p = 1 / (1 + (1-p_1)/p_n * (1-p_2)/p_(n-1) * ... * (1-p_(n-1))/p_2 * (1-p_n)/p_1)
Your problem is caused because you are collecting too many terms without regard for their size. One solution is to take logarithms. Another is to sort your individual terms. First, let's rewrite the equation as 1/p = 1 + ∏((1-p_i)/p_i). Now your problem is that some of the terms are small, while others are big. If you have too many small terms in a row, you'll underflow, and with too many big terms you'll overflow the intermediate result.
So, don't put too many of the same order in a row. Sort the terms (1-p_i)/p_i. As a result, the first will be the smallest term, the last the biggest. Now, if you'd multiply them straight away you would still have an underflow. But the order of calculation doesn't matter. Use two iterators into your temporary collection. One starts at the beginning (i.e. (1-p_0)/p_0), the other at the end (i.e (1-p_n)/p_n), and your intermediate result starts at 1.0. Now, when your intermediate result is >=1.0, you take a term from the front, and when your intemediate result is < 1.0 you take a result from the back.
The result is that as you take terms, the intermediate result will oscillate around 1.0. It will only go up or down as you run out of small or big terms. But that's OK. At that point, you've consumed the extremes on both ends, so it the intermediate result will slowly approach the final result.
There's of course a real possibility of overflow. If the input is completely unlikely to be spam (p=1E-1000) then 1/p will overflow, because ∏((1-p_i)/p_i) overflows. But since the terms are sorted, we know that the intermediate result will overflow only if ∏((1-p_i)/p_i) overflows. So, if the intermediate result overflows, there's no subsequent loss of precision.
Try computing the inverse 1/p. That gives you an equation of the form 1 + 1/(1-p1)*(1-p2)...
If you then count the occurrence of each probability--it looks like you have a small number of values that recur--you can use the pow() function--pow(1-p, occurences_of_p)*pow(1-q, occurrences_of_q)--and avoid individual roundoff with each multiplication.
You can use probability in percents or promiles:
doc_spam_prob= (numerator*100/(denom1+denom2));
or
doc_spam_prob= (numerator*1000/(denom1+denom2));
or use some other coefficient
I am not strong in math so I cannot comment on possible simplifications to the formula that might eliminate or reduce your problem. However, I am familiar with the precision limitations of long double types and am aware of several arbitrary and extended precision math libraries for C. Check out:
http://www.nongnu.org/hpalib/
and
http://www.tc.umn.edu/~ringx004/mapm-main.html

Resources