Precision loss / rounding difference when directly assigning double result to an int - c

Is there a reason why converting from a double to an int performs as expected in this case:
double value = 45.33;
double multResult = (double) value*100.0; // assign to double
int convert = multResult; // assign to int
printf("convert = %d\n", convert); // prints 4533 as expected
But not in this case:
double value = 45.33;
int multResultInt = (double) value*100.0; // assign directly to int
printf("multResultInt = %d\n", multResultInt); // prints 4532??
It seems to me there should be no difference. In the second case the result is still first stored as a double before being converted to an int unless I am not understanding some difference between casts and hard assignments.

There is indeed no difference between the two, but compilers are used to take some freedom when it comes down to floating point computations. For example compilers are free to use higher precision for intermediate results of computations but higher still means different so the results may vary.
Some compilers provide switches to always drop extra precision and convert all intermediate results to the prescribed floating point numbers (say 64bit double-precision numbers). This will make the code slower, however.
In the specific the number 45.33 cannot be represented exactly with a floating point value (it's a periodic number when expressed in binary and it would require an infinite number of bits). When multiplying by 100 this value may be you don't get an integer, but something very close (just below or just above).
int conversion or cast is performed using truncation and something very close to 4533 but below will become 4532, when above will become 4533; even if the difference is incredibly tiny, say 1E-300.
To avoid having problems be sure to account for numeric accuracy problems. If you are doing a computation that depends on exact values of floating point numbers then you're using the wrong tool.

#6502 has given you the theory, here's how to look at things experimentally
double v = 45.33;
int x = v * 100.0;
printf("x=%d v=%.20lf v100=%.20lf\n", x, v, v * 100.0 );
On my machine, this prints
x=4533 v=45.32999999999999829470 v100=4533.00000000000000000000
The value 45.33 does not have an exact representation when encoded as a 64-bit IEEE-754 floating point number. The actual value of v is slightly lower than the intended value due to the limited precision of the encoding.
So why does multiplying by 100.0 fix the problem on some machines? One possibility is that the multiplication is done with 80-bits of precision and then rounded to fit into a 64-bit result. The 80-bit number 4532.999... will round to 4533 when converted to 64-bits.
On your machine, the multiplication is evidently done with 64-bits of precision, and I would expect that v100 will print as 4532.999....

Related

Trying to recreate printf's behaviour with doubles and given precisions (rounding) and have a question about handling big numbers

I'm trying to recreate printf and I'm currently trying to find a way to handle the conversion specifiers that deal with floats. More specifically: I'm trying to round doubles at a specific decimal place. Now I have the following code:
double ft_round(double value, int precision)
{
long long int power;
long long int result;
power = ft_power(10, precision);
result = (long long int) (value * power);
return ((double)result / power);
}
Which works for relatively small numbers (I haven't quite figured out whether printf compensates for truncation and rounding errors caused by it but that's another story). However, if I try a large number like
-154584942443242549.213565124235
I get -922337203685.4775391 as output, whereas printf itself gives me
-154584942443242560.0000000 (precision for both outputs is 7).
Both aren't exactly the output I was expecting but I'm wondering if you can help me figure out how I can make my idea for rounding work with larger numbers.
My question is basically twofold:
What exactly is happening in this case, both with my code and printf itself, that causes this output? (I'm pretty new to programming, sorry if it's a dumb question)
Do you guys have any tips on how to make my code capable of handling these bigger numbers?
P.S. I know there are libraries and such to do the rounding but I'm looking for a reinventing-the-wheel type of answer here, just FYI!
You can't round to a particular decimal precision with binary floating point arithmetic. It's not just possible. At small magnitudes, the errors are small enough that you can still get the right answer, but in general it doesn't work.
The only way to round a floating point number as decimal is to do all the arithmetic in decimal. Basically you start with the mantissa, converting it to decimal like an integer, then scale it by powers of 2 (the exponent) using decimal arithmetic. The amount of (decimal) precision you need to keep at each step is roughly (just a bit over) the final decimal precision you want. If you want an exact result, though, it's on the order of the base-2 exponent range (i.e. very large).
Typically rather than using base 10, implementations will use a base that's some large power of 10, since it's equivalent to work with but much faster. 1000000000 is a nice base because it fits in 32 bits and lets you treat your decimal representation as an array of 32-bit ints (comparable to how BCD lets you treat decimal representations as arrays of 4-bit nibbles).
My implementation in musl is dense but demonstrates this approach near-optimally and may be informative.
What exactly is happening in this case, both with my code and printf itself, that causes this output?
Overflow. Either ft_power(10, precision) exceeds LLONG_MAX and/or value * power > LLONG_MAX.
Do you guys have any tips on how to make my code capable of handling these bigger numbers?
Set aside various int types to do rounding/truncation. Use FP routines like round(), nearby(), etc.
double ft_round(double value, int precision) {
// Use a re-coded `ft_power()` that computes/returns `double`
double pwr = ft_power(10, precision);
return round(value * pwr)/pwr;
}
As well mentioned in this answer, floating point numbers have binary characteristics as well as finite precision. Using only double will extend the range of acceptable behavior. With extreme precision, the value computed with this code be close yet potentially only near the desired result.
Using temporary wider math will extend the acceptable range.
double ft_round(double value, int precision) {
double pwr = ft_power(10, precision);
return (double) (roundl((long double) value * pwr)/pwr);
}
I haven't quite figured out whether printf compensates for truncation and rounding errors caused by it but that's another story
See Printf width specifier to maintain precision of floating-point value to print FP with enough precision.

Why casting double to int might give different results?

I am using fixed decimal point number (using uint16_t) to store percentage with 2 fractional digits. I have found that the way I am casting the double value to integer makes a difference in the resulting value.
const char* testString = "99.85";
double percent = atof(testString);
double hundred = 100;
uint16_t reInt1 = (uint16_t)(hundred * percent);
double stagedDouble = hundred * percent;
uint16_t reInt2 = (uint16_t)stagedDouble;
Example output:
percent: 99.850000
stagedDouble: 9985.000000
reInt1: 9984
reInt2: 9985
The error is visible in about 47% of all values between 0 and 10000 (of the fixed point representation). It does not appear at all when casting with stagedDouble. And I do not understand why the two integers are different. I am using GCC 6.3.0.
Edit:
Improved code snippet to demonstrate percent variable and to unify the coefficient between the two statements. The change of 100 into a double seems as a quality change that might affect the output, but it does not change a thing in my program.
Is percent a float? If so, look at what types you're multiplying.
reInt1 is double * float and stagedDouble is int * float. Mixing up floating point math can cause these types of rounding errors.
Changing the 100's to be both double or both int results in the same answer.
The reported behavior is consistent with percent being declared float, and the use of IEEE-754 basic 32-bit and 64-bit binary floating-point for float and double.
uint16_t reInt1 = (uint16_t)(100.0 * percent);
Since 100.0 is a double constant, this converts percent to double, performs a multiplication in double, and converts the result to uint16_t. The multiplication may have a very slight rounding error, up to ½ ULP of the double format, a relative error around 2−53.
double stagedDouble = 100 * percent;
uint16_t reInt2 = (uint16_t)stagedDouble;
Since 100 is an int constant, this converts 100 to float, performs a multiplication in float, and converts the result to uint16_t. The rounding error in the multiplication may be up to ½ ULP of the float format, a relative error around 2−24.
Since all of the values are near hundredths of an integer, a 50:50 ratio of errors up:down would make about half the results just under what is needed for the integer threshold. In the multiplications, all those with values that are 0, 25, 50, or 100 one-hundredths would be exact (because 25/100 is ¼, which is exactly representable in binary floating-point), so 96/100 would have rounding errors. If the directions of the float and double rounding errors behave as independent, uniform random variables, about half would round in different directions, producing different results, giving about 48% mismatches, which is consistent with the 47% reported in the question.
(However, when I measure the actual results, I get 42% differences between the float and double methods. I suspect that has something to do with the trailing bits in the float multiplication before rounding—the distribution might not act like a uniform distribution of two possibilities. It may be the OP’s code prepares the percent values in some way other than dividing an integer value by 100.)

Nonintuitive result of the assignment of a double precision number to an int variable in C

Could someone give me an explanation why I get two different
numbers, resp. 14 and 15, as an output from the following code?
#include <stdio.h>
int main()
{
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
printf("%d %d",b,c); // 14 15, why?
return 0;
}
I expect to get 15 in both cases but it seems I'm missing some fundamentals of the language.
I am not sure if it's relevant but I was doing the test in CodeBlocks. However, if I type the same lines of code in some on-line compiler ( this one for example) I get an answer of 15 for the two printed variables.
... why I get two different numbers ...
Aside from the usual float-point issues, the computation paths to b and c are arrived in different ways. c is calculated by first saving the value as double a.
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
C allows intermediate floating-point math to be computed using wider types. Check the value of FLT_EVAL_METHOD from <float.h>.
Except for assignment and cast (which remove all extra range and precision), ...
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the
type;
1 evaluate operations and constants of type float and double to the
range and precision of the double type, evaluate long double
operations and constants to the range and precision of the long double
type;
2 evaluate all operations and constants to the range and precision of the
long double type.
C11dr §5.2.4.2.2 9
OP reported 2
By saving the quotient in double a = (Vmax-Vmin)/step;, precision is forced to double whereas int b = (Vmax-Vmin)/step; could compute as long double.
This subtle difference results from (Vmax-Vmin)/step (computed perhaps as long double) being saved as a double versus remaining a long double. One as 15 (or just above), and the other just under 15. int truncation amplifies this difference to 15 and 14.
On another compiler, the results may both have been the same due to FLT_EVAL_METHOD < 2 or other floating-point characteristics.
Conversion to int from a floating-point number is severe with numbers near a whole number. Often better to round() or lround(). The best solution is situation dependent.
This is indeed an interesting question, here is what happens precisely in your hardware. This answer gives the exact calculations with the precision of IEEE double precision floats, i.e. 52 bits mantissa plus one implicit bit. For details on the representation, see the wikipedia article.
Ok, so you first define some variables:
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
The respective values in binary will be
Vmax = 10.111001100110011001100110011001100110011001100110011
Vmin = 1.0110011001100110011001100110011001100110011001100110
step = .00011001100110011001100110011001100110011001100110011010
If you count the bits, you will see that I have given the first bit that is set plus 52 bits to the right. This is exactly the precision at which your computer stores a double. Note that the value of step has been rounded up.
Now you do some math on these numbers. The first operation, the subtraction, results in the precise result:
10.111001100110011001100110011001100110011001100110011
- 1.0110011001100110011001100110011001100110011001100110
--------------------------------------------------------
1.1000000000000000000000000000000000000000000000000000
Then you divide by step, which has been rounded up by your compiler:
1.1000000000000000000000000000000000000000000000000000
/ .00011001100110011001100110011001100110011001100110011010
--------------------------------------------------------
1110.1111111111111111111111111111111111111111111111111100001111111111111
Due to the rounding of step, the result is a tad below 15. Unlike before, I have not rounded immediately, because that is precisely where the interesting stuff happens: Your CPU can indeed store floating point numbers of greater precision than a double, so rounding does not take place immediately.
So, when you convert the result of (Vmax-Vmin)/step directly to an int, your CPU simply cuts off the bits after the fractional point (this is how the implicit double -> int conversion is defined by the language standards):
1110.1111111111111111111111111111111111111111111111111100001111111111111
cutoff to int: 1110
However, if you first store the result in a variable of type double, rounding takes place:
1110.1111111111111111111111111111111111111111111111111100001111111111111
rounded: 1111.0000000000000000000000000000000000000000000000000
cutoff to int: 1111
And this is precisely the result you got.
The "simple" answer is that those seemingly-simple numbers 2.9, 1.4, and 0.1 are all represented internally as binary floating point, and in binary, the number 1/10 is represented as the infinitely-repeating binary fraction 0.00011001100110011...[2] . (This is analogous to the way 1/3 in decimal ends up being 0.333333333... .) Converted back to decimal, those original numbers end up being things like 2.8999999999, 1.3999999999, and 0.0999999999. And when you do additional math on them, those .0999999999's tend to proliferate.
And then the additional problem is that the path by which you compute something -- whether you store it in intermediate variables of a particular type, or compute it "all at once", meaning that the processor might use internal registers with greater precision than type double -- can end up making a significant difference.
The bottom line is that when you convert a double back to an int, you almost always want to round, not truncate. What happened here was that (in effect) one computation path gave you 15.0000000001 which truncated down to 15, while the other gave you 14.999999999 which truncated all the way down to 14.
See also question 14.4a in the C FAQ list.
An equivalent problem is analyzed in analysis of C programs for FLT_EVAL_METHOD==2.
If FLT_EVAL_METHOD==2:
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
computes b by evaluating a long double expression then truncating it to a int, whereas for c it's evaluating from long double, truncating it to double and then to int.
So both values are not obtained with the same process, and this may lead to different results because floating types does not provides usual exact arithmetic.

How to round 8.475 to 8.48 in C (rounding function that takes into account representation issues)? Reducing probability of issue

I am trying to round 8.475 to 8.48 (to two decimal places in C). The problem is that 8.475 internally is represented as 8.47499999999999964473:
double input_test =8.475;
printf("input tests: %.20f, %.20f \n", input_test, *&input_test);
gives:
input tests: 8.47499999999999964473, 8.47499999999999964473
So, if I had an ideal round function then it would round 8.475=8.4749999... to 8.47. So, internal round function is no appropriate for me. I see that rounding problem arises in cases of "underflow" and therefore I am trying to use the following algorithm:
double MyRound2( double * value) {
double ad;
long long mzr;
double resval;
if ( *value < 0.000000001 )
ad = -0.501;
else
ad = 0.501;
mzr = long long (*value);
resval = *value - mzr;
resval= (long long( resval*100+ad))/100;
return resval;
}
This solves the "underflow" issue and it works well for "overflow" issues as well. The problem is that there are valid values x.xxx99 for which this function incorrectly gives bigger value (because of 0.001 in 0.501). How to solve this issue, how to devise algorithm that can detect floating point representation issue and that can round taking account this issue? Maybe C already has such clever rounding function? Maybe I can select different value for constant ad - such that probability of such rounding errors goes to zero (I mostly work with money values with up to 4 decimal ciphers).
I have read all the popoular articles about floating point representation and I know that there are tricky and unsolvable issues, but my client do not accept such explanation because client can clearly demonstrate that Excel handles (reproduces, rounds and so on) floating point numbers without representation issues.
(The C and C++ standards are intentionally flexible when it comes to the specification of the double type; quite often it is IEEE754 64 bit type. So your observed result is platform-dependent).
You are observing of the pitfalls of using floating point types.
Sadly there isn't an "out-of-the-box" fix for this. (Adding a small constant pre-rounding just pushes the problem to other numbers).
Moral of the story: don't use floating point types for money.
Use a special currency type instead or work in "pence"; using an integral type instead.
By the way, Excel does use an IEEE754 double precision floating point for its number type, but it also has some clever tricks up its sleeve. Essentially it tracks the joke digits carefully and also is clever with its formatting. This is how it can evaluate 1/3 + 1/3 + 1/3 exactly. But even it will get money calculations wrong sometimes.
For financial calculations, it is better to work in base-10 to avoid represenatation issues when going to/from binary. In many countries, financial software is even legally required to do so. Here is one library for IEEE 754R Decimal Floating-Point Arithmetic, have not tried it myself:
http://www.netlib.org/misc/intel/
Also note that working in decimal floating-point instead of fixed-point representation allows clever algoritms like the Kahan summation algorithm, to avoid accumulation of rounding errors. A noteworthy difference to normal floating point is that numbers with few significant digits are not normalized, so you can have e.g both 1*10^2 and .1*10^3.
An implementation note is that one representation in the std uses a binary significand, to allow sw implementations using a standard binary ALU.
How about this one: Define some threshold. This threshold is the distance to the next multiple of 0.005 at which you assume that this distance could be an error of imprecision. Execute appropriate methods if it's within that distance and smaller. Round as usual and at the end, if you detected that it was, add 0.01.
That said, this is only a work around and somewhat of a code smell. If you don't need too much speed, go for some other type than float. Like your own type that works like
class myDecimal{ int digits; int exponent_of_ten; } with value = digits * E exponent_of_ten
I am not trying to argument that using floating point numbers to represent money is advisable - it is not! but sometimes you have no choice... We do kind of work with money (life incurance calculations) and are forced to use floating point numbers for everything including values representing money.
Now there are quite some different rounding behaviours out there: round up, round down, round half up, round half down, round half even, maybe more. It looks like you were after round half up method.
Our round-half-up function - here translated from Java - looks like this:
#include <iostream>
#include <cmath>
#include <cfloat>
using namespace std;
int main()
{
double value = 8.47499999999999964473;
double result = value * pow(10, 2);
result = nextafter(result + (result > 0.0 ? 1e-8 : -1e-8), DBL_MAX);
double integral = floor(result);
double fraction = result - integral;
if (fraction >= 0.5) {
result = ceil(result);
} else {
result = integral;
}
result /= pow(10, 2);
cout << result << endl;
return 0;
}
where nextafter is a function returning the next floating point value after the given value - this code is proved to work using C++11 (AFAIK the nextafter is also available in boost), the result written into the standard output is 8.48.

Floating point rounding when truncating

This is probably a question for an x86 FPU expert:
I am trying to write a function which generates a random floating point value in the range [min,max]. The problem is that my generator algorithm (the floating-point Mersenne Twister, if you're curious) only returns values in the range [1,2) - ie, I want an inclusive upper bound, but my "source" generated value is from an exclusive upper bound. The catch here is that the underlying generator returns an 8-byte double, but I only want a 4-byte float, and I am using the default FPU rounding mode of Nearest.
What I want to know is whether the truncation itself in this case will result in my return value being inclusive of max when the FPU internal 80-bit value is sufficiently close, or whether I should increment the significand of my max value before multiplying it by the intermediary random in [1,2), or whether I should change FPU modes. Or any other ideas, of course.
Here's the code I am currently using, and I did verify that 1.0f resolves to 0x3f800000:
float MersenneFloat( float min, float max )
{
//genrand returns a double in [1,2)
const float random = (float)genrand_close1_open2();
//return in desired range
return min + ( random - 1.0f ) * (max - min);
}
If it makes a difference, this needs to work on both Win32 MSVC++ and Linux gcc. Also, will using any versions of the SSE optimizations change the answer to this?
Edit: The answer is yes, truncation in this case from double to float is sufficient to cause the result to be inclusive of max. See Crashworks' answer for more.
The SSE ops will subtly change the behavior of this algorithm because they don't have the intermediate 80-bit representation -- the math truly is done in 32 or 64 bits. The good news is that you can easily test it and see if it changes your results by simply specifying the /ARCH:SSE2 command line option to MSVC, which will cause it to use the SSE scalar ops instead of x87 FPU instructions for ordinary floating point math.
I'm not sure offhand of what the exact rounding behavior is around the integer boundaries, but you can test to see what'll happen when 1.999.. gets rounded from 64 to 32 bits by eg
static uint64 OnePointNineRepeating = 0x3FF FFFFF FFFF FFFF // exponent 0 (biased to 1023), all 1 bits in mantissa
double asDouble = *(double *)(&OnePointNineRepeating);
float asFloat = asDouble;
return asFloat;
Edit, result: original poster ran this test and found that with truncation, the 1.99999 will round up to 2 both with and without /arch:SSE2.
If you do adjust the rounding so that does include both ends of the range, will those extreme values not be only half as likely as any of the non-extreme ones?
With truncation, you are never going to be inclusive of the max.
Are you sure you really need the max? There is literally an almost 0 chance that you will land on exactly the maximum.
That said, you can exploit the fact that you are giving up precision and do something like this:
float MersenneFloat( float min, float max )
{
double random = 100000.0; // just a dummy value
while ((float)random > 65535.0)
{
//genrand returns a double in [1,2)
double random = genrand_close1_open2() - 1.0; // now it's [0,1)
random *= 65536.0; // now it's [0,65536). We try again if it's > 65535.0
}
//return in desired range
return min + float(random/65535.0) * (max - min);
}
Note that, now, it has a slight chance of multiple calls to genrand each time you call MersenneFloat. So you have given up possible performance for a closed interval. Since you are downcasting from double to float, you end up sacrificing no precision.
Edit: improved algorithm

Resources