are floating point numbers changed when sorted in perl? - arrays

I'm running a statistical bootstrap at 10k permutations, which I'm trying to compare against an observed value. The observed is supposed to be identical to the max of the 10k permutations. The way I am measuring this is by attempting to find its percentile.
All results of the 10k permutations (10,000 random numbers) are stored in an array, which I sort using:
my #sorted = sort {$a <=> $b} #permutednumbers;
When I then compare the observed value $truevalue, I'm getting an inaccurate comparison. These are stored as floating point numbers. The bootstrapping procedure uses the same formula for generating the random number so it should be absolutely identical, but when comparing the same value, it becomes inaccurate. I'm testing this with:
if ($sorted[$#sorted] == $truevalue) {
print "sorted: $sorted[$#sorted] is eq truevalue:$truevalue\n";
} elsif ($sorted[$#sorted] > $truevalue) {
print "sorted: $sorted[$#sorted] is gt truevalue:$truevalue\n";
} elsif ($sorted[$#sorted] < $truevalue) {
print "sorted: $sorted[$#sorted] is lt truevalue:$truevalue, totalpermvalues; $totalpermvalues\n";
}
output:
sorted: 0.937864522389543 is gt truevalue:0.937864522389543
So I get that floating point numbers aren't printed in complete accuracy, but I always assumed internally the computer stores the correct numbers. Is that not a correct assumption? Of course I can fix this quickly by changing them into integers of some sort, but is this something that I should be doing automatically all the time? Are floating point numbers just dangerous to use? Those exact values should be identical given that they are outputs of identical inputs, which is what is confusing me...
If this matters, the values are individually calculated using the linear_interpolate function in Math::Interpolate package, but the inputs are identical.

If I understand correctly, you are wondering why == is returning false and > is returning true for what appear to be identical numbers. Obviously, the numbers are not actually identical. You can see this by printing more digits.
printf "sorted: %.80e is gt truevalue:%.80e\n", $sorted[$#sorted], $truevalue;

No, sort will not change values. One has to assume that there is a difference in the way these two values have been produced.
It is most certainly possible to use == with floating point numbers (FPN), returning true if a pair of 64 bit quantities is identical. But one has to be very careful when one ask the question "Are these two FPNs equal?"
A (relatively small but still considerable) quantity of integers and rational numbers can be represented accurately in a FPN. For these (and only for these), questions such as "Is the FPN a equal to 1.5?" (written as $a==1.5) may make sense but only if you are confident about the genesis of the value in $a. - Don't take this lightly: will both of the following statements print "1"?
print 0.12345678901234567 == 1.2345678901234567E-1,"\n";
print 0.12345678901234567 == 12.345678901234567E-2,"\n";
All FPNs are not only representatives of the value x they represent accurately. They are also responsible for an interval of real numbers, including rational, irrational and transcendent (and even integer) numbers "a little greater and a little smaller" than x. You can quantify "a little": it is 1e-16 for x == 1.0, and shrinks or grows accordingly. So, for instance, 1+1e-17 will be 1.0 on your computer. You can input this number, but the FPN will be 1.0 all the same. Asking whether a FPN as the result of some computation equals 1+1e-17 doesn't make sense since you cannot even tell the computer that value.
The solution isn't difficult. Instead of asking for equality you have to ask "Is the FPN a in an interval [p,q] around x?" Determining p and q should be given a little thought, as a suitable choice of these values primarily depends on x. The usual formula is something like
abs( $a - $expect ) <= $expect*PRECISION
where PRECISION could be, for instance, 1e-12. (The value to use here may depend on the algorithm you use for computing $a, or on your needs, or both.)
Finally: due to the mathematical properties of FP machine instructions, the usual arithmetic laws of associativity or distributivity are not guaranteed. The effect of truncation in addition or subtraction may, for instance cause heavy distortion in the result. A typical example for illustrating this, compute some Taylor series: once adding terms in decreasing order until terms become smaller than a given limit, and once, using the same terms, but in increasing order.

Related

Simplest way to make a histogram of an unknown, finite list of discrete floating point numbers

I have a code that generates a sequence of configurations of some system of interest (Markov Chain Monte Carlo). For each configuration, I make a measurement of a particular value for that configuration, which is bounded between zero and some maximum which I can presumably predict before hand, let's call it Rmax. It can only take a finite number of discrete values in between 0 and Rmax, but the values could be irrational and are not evenly spaced, and I don't know them a priori, or necessarily how many there are (though I could probably estimate an upper bound). I want to generate a very large number of configurations (on the order 1e8) and make a histogram of the distribution of these values, but the issue that I am facing is how to effectively keep track of them.
For example, if the values were integers in the range [0,N-1], I would just create an integer array of N elements, initially set to zero, and increment the appropriate array element for each configuration, e.g. in pseudocode
do i = 1, 1e8
call generateConfig()
R = measureR() ! R is an integer
Rhist(R)++
end do
How can I do something similar to count or tally the number of times each of these irrational, non-uniformly distributed numbers occurs?

How to only allow floats to two decimal places

For instance, I have a floating point number 0.02344489282. I want to be able to make sure that every float that I have is upto two decimal points: 0.02. It will be inexact, I'm sure but the entire floats in my code should be able to truncate anything after two decimal places. I have seen other related posts on Stack Overflow but they deal with outputting the decimal to two points.
Goal: to optimize memory consumption at the expense of accuracy. But the accuracy can be downgraded to 5-15%.
Practical example: I am executing a Kalman filter. Instead of exact values of noise and actual values, I try to find the approximate values by shortening the bit width of variables. Then I'll find the difference of accuracy of the former script and the latter script and how much of energy and memory is saved.
Two possible solutions:
Use integers representing units of 1/100.
Use floating point, but only use integer multiples of 0.25 (i.e. numbers ending in .25, .50, .75, or .00) since these are the only floats which have only two decimal places.
Since option 2 is almost certainly not what you actually want, go for 1.

How to compare double numbers?

I know that when I would like to check if double == double I should write:
bool AreSame(double a, double b)
{
return fabs(a - b) < EPSILON;
}
But what when I would like to check if a > b or b > a ?
There is no general solution for comparing floating-point numbers that contain errors from previous operations. The code that must be used is application-specific. So, to get a proper answer, you must describe your situation more specifically. For example, if you are sorting numbers in a list or other data structure, you should not use any tolerance for comparison.
Usually, if your program needs to compare two numbers for order but cannot do so because it has only approximations of those numbers, then you should redesign the program rather than try to allow numbers to be ordered incorrectly.
The underlying problem is that performing a correct computation using incorrect data is in general impossible. If you want to compute some function of two exact mathematical values x and y but the only data you have is some incorrectly computed values x and y, it is generally impossible to compute the exactly correct result. For example, suppose you want to know what the sum, x+y, is, but you only know x is 3 and y is 4, but you do not know what the true, exact x and y are. Then you cannot compute x+y.
If you know that x and y are approximately x and y, then you can compute an approximation of x+y by adding x and y. This works when the function being computed has a reasonable derivative: Slightly changing the inputs of a function with a reasonable derivative slightly changes its outputs. This fails when the function you want to compute has a discontinuity or a large derivative. For example, if you want to compute the square root of x (in the real domain) using an approximation x but x might be negative due to previous rounding errors, then computing sqrt(x) may produce an exception. Similarly, comparing for inequality or order is a discontinuous function: A slight change in inputs can change the answer completely.
The common bad advice is to compare with a “tolerance”. This method trades false negatives (incorrect rejections of numbers that would satisfy the comparison if the true mathematical values were compared) for false positives (incorrect acceptance of numbers that would not satisfy the comparison).
Whether or not an application can tolerate false acceptance depends on the application. Therefore, there is no general solution.
The level of tolerance to set, and even the nature by which it is calculated, depend on the data, the errors, and the previous calculations. So, even when it is acceptable to compare with a tolerance, the amount of tolerance to use and how to calculate it depends on the application. There is no general solution.
The analogous comparisons are:
a > b - EPSILON
and
b > a - EPSILON
I am assuming that EPSILON is some small positive number.

Shuffling biased random numbers

While thinking about this question and conversing with the participants, the idea came up that shuffling a finite set of clearly biased random numbers makes them random because you don't know the order in which they were chosen. Is this true and if so can someone point to some resources?
EDIT: I think I might have been a little unclear. Suppose a bad random numbers generator. Take n values. These are biased(the rng is bad). Is there a way through shuffling to make the output of the rng over multiple trials statistically match the output of a known good rng?
False.
There is an easy test: Assume the bias in the original set creation algorithm is "creates sets whose arithmetic average is significantly lower than expected average". Obviously, shuffling the result of the algorithm will not change the averages and thus not remove the bias.
Also, regarding your clarification: How would you shuffle the set? Using the same bad output from the bad RNG that created the set in the first place? Or using a better RNG? Which raises the question why you don't use that directly.
It's not true. In the other question the problem is to select 30 random numbers in [1..9] with a sum of 200. After choosing about on average 20 of them randomly, you reach a point where you can't select nines anymore because this would make the total sum go over 200. Of the remaining 10 numbers, most will be ones and twos. So in the end, ones and twos are very overrepresented in the selected numbers. Shuffling doesn't change that. But it's not clear how the random distribution really should look like, so one could say this is as good a solution as any.
In general, if your "random" numbers will be biased to, say, low numbers, they will be biased that way no matter the ordering.
Just shuffling a set of numbers of already random numbers won't do anything to the probability distribution of course. That would mean false. Perhaps I misunderstand your question though?
I would say false, with a caveat:
I think there is random, and then there is 'random-enough'. For most applications that I have needed to work on, 'random-enough' was more than enough, i.e. picking a 'random' ad to display on a page from a list of 300 or so that have paid to be placed on that site.
I am sure a mathematician could prove my very basic 'random' selection criteria is not truly random at all, but in fact is predictable - for my clients, and for the users, nobody cares.
On the other hand if I was writing a video game to be used in Las Vegas where large amounts of money was at hand I'd define random differently (and may have a hard time coming up with truly random).
False
The set is finite, suppose consists of n numbers. What happens if you choose n+1 numbers? Let's also consider a basic random function as implemented in many languages which gives you a random number in [0,1). However, this number is limited to three digits after the decimal giving you a set of 1000 possible numbers (0.000 - 0.999). However in most cases you will not need to use all these 1000 numbers so this amount of randomness is more than enough.
However for some uses, you will need a better random generator than this. So it all comes down to exactly how many random numbers you are going to need, and how random you need them to be.
Addition after reading original question: in the case that you have some sort of limitation (such as in the original question in which each set of selected numbers must sum up to a certain N) you are not really selected random numbers per se, but rather choosing numbers in a random order from a given set (specifically, a permutation of numbers summing up to N).
Addition to edit: Suppose your bad number generator generated the sequence (1,1,1,2,2,2). Does the permutation (1,2,2,1,1,2) satisfy your definition of random?
Completely and utterly untrue: Shuffling doesn't remove a bias, it just conceals it from the casual observer. It's like removing your dog's fondly-laid present from your carpet by just pushing under the sofa - you really haven't solved the problem, you've just made it less conspicuous. Anyone with a nose knows that there is still a problem that needs removing.
The randomness must be applied evenly over the whole range, so here's one way (off the top of my head, lots of assumptions, yadda yadda. The point is the approach, not the code - start with everything even, then introduce your randomness in a consistent fashion until you're done. The only bias now is dependent on the values chosen for 'target' and 'numberofnumbers', which is part of the question.)
target = 200
numberofnumbers = 30
numbers = array();
for (i=0; i<numberofnumbers; i++)
numbers[i] = 9
while (sum(numbers)>target)
numbers[random(numberofnumbers)]--
False. Consider a bad random number generator producing only zeros (I said it was BAD :-) No amount of shuffling the zeros would change any property of that sequence.

Is there a good reason for storing percentages that are less than 1 as numbers greater than 1?

I inherited a project that uses SQL Server 200x, wherein a column that stores a value that is always considered as a percentage in the problem domain is stored as its greater than 1 decimal equivalent. For example, 70% (0.7, literally) is stored as 70, 100% as 100, etc. Aside from the need to remember to * 0.01 on retrieved values and * 100 before persisting values, it doesn't seem to be a problem in and of itself. It does make my head explode though... so is there a good reason for it that I'm missing? Are there compelling reasons to fix it, given that there is a fair amount of code written to work with the pseudo-percentages?
There are a few cases where greater than 100% occurs, but I don't see why the value wouldn't just be stored as 1.05, for example, in those cases.
EDIT: Head feeling better, and slightly smarter. Thanks for all the insights.
There are actually four good reasons I can think of that you might want to store—and calculate with—whole-number percentage values rather than floating-point equivalents:
Depending on the data types chosen, the integer value may take up less space.
Depending on the data type, the floating-point value may lose precision (remember that not all languages have a data type equivalent to SQL Server's decimal type).
If the value will be input from or output to the user very frequently, it may be more convenient to keep it in a more user-friendly format (decision between convert when you display and convert when you calculate ... but see the next point).
If the principle values are also integers, then
principle * integerPercentage / 100
which uses all integer arithmetic is usually faster than its floating-point equivalent (likely significantly faster in the case of a floating-point type equivalent to T-SQL's decimal type).
If its a byte field then it takes up less room in the db than floating point numbers, but unless you have millions and millions of records, you'll hardly see a difference.
Since floating-point values can't be compared for equality, an integer may have been used to make the SQL simpler.
For example
(0.3==3*.1)
is usually False.
However
abs( 0.3 - 3*.1 )
Is a tiny number (5.55e-17). But it's pain to have to do everything with (column-SomeValue) BETWEEN -0.0001 AND 0.0001 or ABS(column-SomeValue) < 0.0001. You'd rather do column = SomeValue in your WHERE clause.
Floating point numbers are prone to rounding errors and, therefore, can act "funny" in comparisons. If you always want to deal with it as fixed decimal, you could either choose a decimal type, say decimal(5,2), or do the convert and store as int thing that your db does. I'd probably go the decimal route, even though the int would take up less space.
A good guess is because anything you do with integers (storing, calculating, stuffing into an edit for for a user, etc.) is marginally easier and more efficient than doing the same with floating point numbers. And the rounding issues aren't so obvious when you look at the data.
If these are numbers that end users are likely to see and interact with, percentages are easier to understand than decimals.
This is one of those situations where a notation aid can help; in the program, be consistent in using a prefix (Hungarian) or postfix to specify values that are percentages vs. those that are decimal. If you can extend a naming convention to the database fields themselves, so much the better.
And to add to the data storage issue, if you can use integer arithmetic for whatever processing you are doing, the performance is much better than when doing floating point arithmetic... So storing ther percetages as integer values may allow the processing logic to itilize integer arithmetic
If you're actually using them as a coefficient (or expect users of the database to do this sort of thing in reports), there's a case for storing them as a coefficient - particularly if there's a reason to do calculations involving more than one.
However, if you do this you should be consistent - either all percentages or all coefficients.

Resources