math.h default rounding mode not clear - c

I am facing very strange fact about rounding of float and conversion to int.
As is stated here:
http://www.gnu.org/software/libc/manual/html_node/Rounding.html
Rounding to nearest representable value is default rounding mode. But it doesn`t seem to be.
So I have created this simple program:
#include <fenv.h>
#include <stdio.h>
int a;
double b;
main() {
b=1.3; a=b; printf("%f %d\n",b,a);
b=1.8; a=b; printf("%f %d\n",b,a);
b=-1.3; a=b; printf("%f %d\n",b,a);
b=-1.8; a=b; printf("%f %d\n",b,a);
printf("%d %d %d\n",fegetround(),FE_TONEAREST,FE_TOWARDZERO);
}
Program was compiled with gcc-4.7 (debian), cygwin gcc and Visual studio. Output was same, only definition of FE_TOWARDZERO changed.
Output of program:
1.300000 1
1.800000 1
-1.300000 -1
-1.800000 -1
0 0 3072
So we can clearly see, that rounding mode is set to FE_TONEAREST (default) in all tested compilers, but all of them are rounding towards zero.
Why?
PS: Yes, I can use Math.round() but I am wondering why is this happening.

Because the rounding mode applies to floating-point rounding functions. Conversion to int always truncates.

Ok, I have found question why is this happening.
As is stated here:
http://software.intel.com/en-us/articles/fast-floating-point-to-integer-conversions
in chapter
A Closer Look at Float-to-Int Conversions
The problem with casting from floating-point numbers to 32-bit
integers stems from the ANSI C standard, which states the conversion
should be effected by truncating the fractional portion of the number
and retaining the integer result. Because of this, whenever the
Microsoft Visual C++ 6.0 compiler encounters an (int) or a (long)
cast, it inserts a call to the _ftol C run-time function. This
function modifies the floating-point rounding mode to 'truncate',
performs the conversion, and then resets the rounding mode to its
original state prior to the cast.

Related

Machine epsilon calculation is different using C11 and GNU11 compiler flags

When using Python & Julia, I can use a neat trick to investigate machine epsilon for a particular floating point representation.
For example, in Julia 1.1.1:
julia> 7.0/3 - 4/3 - 1
2.220446049250313e-16
julia> 7.0f0/3f0 - 4f0/3f0 - 1f0
-1.1920929f-7
I'm currently learning C and wrote this program to try and achieve the same thing:
#include <stdio.h>
int main(void)
{
float foo;
double bar;
foo = 7.0f/3.0f - 4.0f/3.0f - 1.0f;
bar = 7.0/3.0 - 4.0/3.0 - 1.0;
printf("\nM.E. for float: %e \n\n", foo);
printf("M.E. for double: %e \n\n", bar);
return 0;
}
Curiously, the answer I get depends on whether I use C11 or GNU11 compiler standard. My compiler is GCC 5.3.0, running on Windows 7 and installed via MinGW.
So in short, when I compile with: gcc -std=gnu11 -pedantic begin.c I get:
M.E. for float: -1.192093e-007
M.E. for double: 2.220446e-016
as I expect, and matches Python and Julia. But when I compile with: gcc -std=c11 -pedantic begin.c I get:
M.E. for float: -1.084202e-019
M.E. for double: -1.084202e-019
which is unexpected. I thought it might by GNU specific features which is why I added the -pedantic flag. I have been searching on google and found this: https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html but I still am unable to explain the difference in behaviour.
To be explicit, my question is: Why is the result different using the different standards?
Update: The same differences apply with C99 and GNU99 standards.
In C, the best way to get the float or double epsilon is to include <float.h> and use FLT_MIN or DBL_MIN.
The value of 7.0/3.0 - 4.0/3.0 - 1.0; is not fully specified by the C standard because it allows implementations to evaluate floating-point expressions with more precision than the nominal type. To some extent, this can be dealt with by using casts or assignments. The C standard requires casts or assignments to “discard” excess precision. This is not a proper solution in general, because there can be rounding both with the initial excess precision and with the operation that “discards” excess precision. This double-rounding may produce a different result than calculating entirely with the nominal precision.
Using the cast workaround with the code in the question yields:
_Static_assert(FLT_RADIX == 2, "Floating-point radix must be two.");
float FloatEpsilon = (float) ((float) (7.f/3) - (float) (4.f/3)) - 1;
double DoubleEpsilon = (double) ((double) (7./3) - (double) (4./3)) - 1;
Note that a static assertion is required to ensure that the floating-point radix is as expected for this kludge to operate. The code should also include documentation explaining this bad idea:
The binary representation for the fraction ⅓ ends in an infinite sequences of “01010101…”.
When the binary for 4/3 or 7/3 is rounded to a fixed precision, it is as if the numeral were truncated and rounded down or up, depending on whether the next binary digit after truncation were a 0 or a 1.
Given our assumption that floating-point uses a base-two radix, 4/3 and 7/3 are in consecutive binades (4/3 is in [1, 2), and 7/3 is in [2, 4). Therefore, their truncation points are one position apart.
Thus, we converting to a binary floating-point format, 4/3 and 7/3 differ in that the latter exceeds the former by 1 and its significand ends one bit sooner. Examination of the possible truncation points reveals that, aside from the initial difference of 1, the significands differ by the value of the position of the low bit in 4/3, although the difference may be in either direction.
By Sterbenz’ Lemma, there is no floating-point error in subtracting 4/3 from 7/3, so the result is exactly 1 plus the difference described above.
Subtracting 1 produces that difference, which is the value of the position of the low bit of 4/3 except that it may be positive or negative.

pow numeric error in c

I'm wondering where does the numeric error happen, in what layer.
Let me explain using an example:
int p = pow(5, 3);
printf("%d", p);
I've tested this code on various HW and compilers (VS and GCC) and some of them print out 124, and some 125.
On the same HW (OS) i get different results in different compilers (VS and GCC).
On the different HW(OS) I get different results in the same compiler (cc (GCC) 4.8.1).
AFAIK, pow computes to 124.99999999 and that gets truncated to int, but where does this error happen?
Or, in other words, where does the correction happen (124.99->125)
Is it a compiler-HW interaction?
//****** edited:
Here's an additional snippet to play with (keep an eye on p=5, p=18, ...):
#include <stdio.h>
#include <math.h>
int main(void) {
int p;
for (p = 1; p < 20; p++) {
printf("\n%d %d %f %f", (int) pow(p, 3), (int) exp(3 * log(p)), pow(p, 3), exp(3 * log(p)));
}
return 0;
}
(First note that for an IEEE754 double precision floating point type, all integers up to the 53rd power of 2 can be represented exactly. Blaming floating point precision for integral pow inaccuracies is normally incorrect).
pow(x, y) is normally implemented in C as exp(y * log(x)). Hence it can "go off" for even quite small integral cases.
For small integral cases, I normally write the computation long-hand, and for other integral arguments I use a 3rd party library. Although a do-it-yourself solution using a for loop is tempting, there are effective optimisations that can be done for integral powers that such a solution might not exploit.
As for the observed different results, it could be down to some of the platforms using an 80 bit floating point intermediary. Perhaps some of the computations then are above 125 and others are below that.

Why does pow(n,2) return 24 when n=5, with my compiler and OS?

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
int n,i,ele;
n=5;
ele=pow(n,2);
printf("%d",ele);
return 0;
}
The output is 24.
I'm using GNU/GCC in Code::Blocks.
What is happening?
I know the pow function returns a double , but 25 fits an int type so why does this code print a 24 instead of a 25? If n=4; n=6; n=3; n=2; the code works, but with the five it doesn't.
Here is what may be happening here. You should be able to confirm this by looking at your compiler's implementation of the pow function:
Assuming you have the correct #include's, (all the previous answers and comments about this are correct -- don't take the #include files for granted), the prototype for the standard pow function is this:
double pow(double, double);
and you're calling pow like this:
pow(5,2);
The pow function goes through an algorithm (probably using logarithms), thus uses floating point functions and values to compute the power value.
The pow function does not go through a naive "multiply the value of x a total of n times", since it has to also compute pow using fractional exponents, and you can't compute fractional powers that way.
So more than likely, the computation of pow using the parameters 5 and 2 resulted in a slight rounding error. When you assigned to an int, you truncated the fractional value, thus yielding 24.
If you are using integers, you might as well write your own "intpow" or similar function that simply multiplies the value the requisite number of times. The benefits of this are:
You won't get into the situation where you may get subtle rounding errors using pow.
Your intpow function will more than likely run faster than an equivalent call to pow.
You want int result from a function meant for doubles.
You should perhaps use
ele=(int)(0.5 + pow(n,2));
/* ^ ^ */
/* casting and rounding */
Floating-point arithmetic is not exact.
Although small values can be added and subtracted exactly, the pow() function normally works by multiplying logarithms, so even if the inputs are both exact, the result is not. Assigning to int always truncates, so if the inexactness is negative, you'll get 24 rather than 25.
The moral of this story is to use integer operations on integers, and be suspicious of <math.h> functions when the actual arguments are to be promoted or truncated. It's unfortunate that GCC doesn't warn unless you add -Wfloat-conversion (it's not in -Wall -Wextra, probably because there are many cases where such conversion is anticipated and wanted).
For integer powers, it's always safer and faster to use multiplication (division if negative) rather than pow() - reserve the latter for where it's needed! Do be aware of the risk of overflow, though.
When you use pow with variables, its result is double. Assigning to an int truncates it.
So you can avoid this error by assigning result of pow to double or float variable.
So basically
It translates to exp(log(x) * y) which will produce a result that isn't precisely the same as x^y - just a near approximation as a floating point value,. So for example 5^2 will become 24.9999996 or 25.00002

rounding error of GNU C compiler

My textbook - C in a Nutshell, ISBN 978-0596006976
The part of casting, the code in an example showing C rounding error:
Code:
#include <stdio.h>
int
main()
{
long l_var = 123456789L;
float f_var = l_var;
printf("The rounding error (f_var - l_var) is %f\n", f_var - l_var);
return 0;
}
then the value it output with nothing but 0.000000
seems it made no precision problem while casting those literal
with gcc(v4.4.7) command
gcc -Wall file.c -o exec
did GNU make a better way to get around the problem which mentioned in that chapter
or just some setting not strictly close to the issue of rounding error?
I don't know what this chapter is telling you, but:
float f_var = l_var;
We can tell that f_var is (float)l_var. Now the expression:
f_var - l_var
As this operates on a long and a float, the long will be converted into a float. So the compiler will do:
f_var - (float)l_var
Which is the same as:
(float)l_var - (float)l_var
Which is zero, regardless of any rounding of the conversion.
I don't have access to this book.
My guess is that the example is trying to tell you that if you assign a 32 bit integer to a 32 bit float, you may lose bits due to truncation (rounding errors): A 32 bit float has only 23 bit significand and some bits may be lost during the assignment accordingly.
Apparently, the example code is bogus in the book though. Here is the code to demonstrate the truncation error:
#include <stdint.h>
#include <stdio.h>
int main() {
int32_t l_var = 123456789L;
/* 32 bit variable, 23 bit significand, approx. 7 decimals */
float f_var = l_var;
double err = (double) f_var - (double) l_var;
printf("The rounding error (f_var - l_var) is %f\n", err);
return 0;
}
This prints
The rounding error (f_var - l_var) is 3.000000
on my machine.
0 is the value you get if both values are converted to float, you'll get something else if they are converted to something else. And there is an allowance in the standard to use wider floating point representation that required by the type for computation (*). Using it here is especially tempting here as the result has to be converted to a double for passing to printf.
My version of gcc is not using that allowance when compiling for x86_64 (-m64 argument for gcc) and it is using it when compiling for x86 (-m32 argument). That make sense when you know that for 64 bits, it is using sse instructions which can easily do the computation in float, while when compiling for 32 bits it is using the older "8087" stack model which can't do that easily.
(*) Last paragraph of 6.2.1.5 in C90, 6.3.1.8/2 in C99, 6.3.1.8/2 in C11. I give the text of the latest (as in n1539)
The values of floating operands and of the results of floating expressions may be
represented in greater precision and range than that required by the type; the types are not changed thereby.
As pointed by Pascal Cuoq, starting from C99, you can test with FLT_EVAL_METHOD.

Why does GCC give an unexpected result when adding float values?

I'm using GCC to compile a program which adds floats, longs, ints and chars. When it runs, the result is bad. The following program unexpectedly prints the value of 34032.101562.
Recompiling with a Microsoft compiler gives the right result.
#include <stdio.h>
int main (void) {
const char val_c = 10;
const int val_i = 20;
const long val_l = 34000;
const float val_f = 2.1;
float result;
result = val_c + val_i + val_l + val_f;
printf("%f\n", result);
return 0;
}
What do you think the "right result" is? I'm guessing that you believe it is 34032.1. It isn't.
2.1 is not representable as a float, so val_f instead is initialized with the closest representable float value. In binary, 2.1 is:
10.000110011001100110011001100110011001100110011001...
a float has 24 binary digits, so the value of val_f in binary is:
10.0001100110011001100110
The expression resultat = val_c + val_i + val_l + val_f computes 34030 + val_f, which is evaluated in single-precision and causes another rounding to occur.
1000010011101110.0
+ 10.0001100110011001100110
-----------------------------------------
1000010011110000.0001100110011001100110
rounds to 24 digits:
-----------------------------------------
1000010011110000.00011010
In decimal, this result is exactly 34032.1015625. Because the %f format prints 6 digits after the decimal point (unless specified otherwise), this is rounded again, and printf prints 34032.101562.
Now, why do you not get this result when you compile with MSVC? The C and C++ standard allow floating-point calculations to be carried out in a wider type if the compiler chooses to do so. MSVC does this with your calculation, which means that the result of 34030 + val_f is not rounded before being passed to printf. In that case, the exact floating-point value being printed is 34032.099999999991268850862979888916015625, which is rounded to 34032.1 by printf.
Why don't all compilers do what MSVC does? A few reasons. First, it's slower on some processors. Second, and more importantly, although it can give more accurate answers, the programmer cannot depend on that -- seemingly unrelated code changes can cause the answer to change in the presence of this behavior. Because of this, carrying extra precision often causes more problems than it solves.
Google David Goldberg's paper "What Every Computer Scientist Should Know About
Floating-Point Arithmetic".
The float format has only about 6-7 digits of precision. Use %7.1f or some other reasonable format and you will like your results better.
I don't see any problem here. 2.1 has no exact representation in IEEE floating-point format, and as such, it is converting the entire answer to a floating-point number with around 6-7 (correct) sig-figs. If you need more precision, use a double.

Resources