rounding error of GNU C compiler - c

My textbook - C in a Nutshell, ISBN 978-0596006976
The part of casting, the code in an example showing C rounding error:
Code:
#include <stdio.h>
int
main()
{
long l_var = 123456789L;
float f_var = l_var;
printf("The rounding error (f_var - l_var) is %f\n", f_var - l_var);
return 0;
}
then the value it output with nothing but 0.000000
seems it made no precision problem while casting those literal
with gcc(v4.4.7) command
gcc -Wall file.c -o exec
did GNU make a better way to get around the problem which mentioned in that chapter
or just some setting not strictly close to the issue of rounding error?

I don't know what this chapter is telling you, but:
float f_var = l_var;
We can tell that f_var is (float)l_var. Now the expression:
f_var - l_var
As this operates on a long and a float, the long will be converted into a float. So the compiler will do:
f_var - (float)l_var
Which is the same as:
(float)l_var - (float)l_var
Which is zero, regardless of any rounding of the conversion.

I don't have access to this book.
My guess is that the example is trying to tell you that if you assign a 32 bit integer to a 32 bit float, you may lose bits due to truncation (rounding errors): A 32 bit float has only 23 bit significand and some bits may be lost during the assignment accordingly.
Apparently, the example code is bogus in the book though. Here is the code to demonstrate the truncation error:
#include <stdint.h>
#include <stdio.h>
int main() {
int32_t l_var = 123456789L;
/* 32 bit variable, 23 bit significand, approx. 7 decimals */
float f_var = l_var;
double err = (double) f_var - (double) l_var;
printf("The rounding error (f_var - l_var) is %f\n", err);
return 0;
}
This prints
The rounding error (f_var - l_var) is 3.000000
on my machine.

0 is the value you get if both values are converted to float, you'll get something else if they are converted to something else. And there is an allowance in the standard to use wider floating point representation that required by the type for computation (*). Using it here is especially tempting here as the result has to be converted to a double for passing to printf.
My version of gcc is not using that allowance when compiling for x86_64 (-m64 argument for gcc) and it is using it when compiling for x86 (-m32 argument). That make sense when you know that for 64 bits, it is using sse instructions which can easily do the computation in float, while when compiling for 32 bits it is using the older "8087" stack model which can't do that easily.
(*) Last paragraph of 6.2.1.5 in C90, 6.3.1.8/2 in C99, 6.3.1.8/2 in C11. I give the text of the latest (as in n1539)
The values of floating operands and of the results of floating expressions may be
represented in greater precision and range than that required by the type; the types are not changed thereby.
As pointed by Pascal Cuoq, starting from C99, you can test with FLT_EVAL_METHOD.

Related

Machine epsilon calculation is different using C11 and GNU11 compiler flags

When using Python & Julia, I can use a neat trick to investigate machine epsilon for a particular floating point representation.
For example, in Julia 1.1.1:
julia> 7.0/3 - 4/3 - 1
2.220446049250313e-16
julia> 7.0f0/3f0 - 4f0/3f0 - 1f0
-1.1920929f-7
I'm currently learning C and wrote this program to try and achieve the same thing:
#include <stdio.h>
int main(void)
{
float foo;
double bar;
foo = 7.0f/3.0f - 4.0f/3.0f - 1.0f;
bar = 7.0/3.0 - 4.0/3.0 - 1.0;
printf("\nM.E. for float: %e \n\n", foo);
printf("M.E. for double: %e \n\n", bar);
return 0;
}
Curiously, the answer I get depends on whether I use C11 or GNU11 compiler standard. My compiler is GCC 5.3.0, running on Windows 7 and installed via MinGW.
So in short, when I compile with: gcc -std=gnu11 -pedantic begin.c I get:
M.E. for float: -1.192093e-007
M.E. for double: 2.220446e-016
as I expect, and matches Python and Julia. But when I compile with: gcc -std=c11 -pedantic begin.c I get:
M.E. for float: -1.084202e-019
M.E. for double: -1.084202e-019
which is unexpected. I thought it might by GNU specific features which is why I added the -pedantic flag. I have been searching on google and found this: https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html but I still am unable to explain the difference in behaviour.
To be explicit, my question is: Why is the result different using the different standards?
Update: The same differences apply with C99 and GNU99 standards.
In C, the best way to get the float or double epsilon is to include <float.h> and use FLT_MIN or DBL_MIN.
The value of 7.0/3.0 - 4.0/3.0 - 1.0; is not fully specified by the C standard because it allows implementations to evaluate floating-point expressions with more precision than the nominal type. To some extent, this can be dealt with by using casts or assignments. The C standard requires casts or assignments to “discard” excess precision. This is not a proper solution in general, because there can be rounding both with the initial excess precision and with the operation that “discards” excess precision. This double-rounding may produce a different result than calculating entirely with the nominal precision.
Using the cast workaround with the code in the question yields:
_Static_assert(FLT_RADIX == 2, "Floating-point radix must be two.");
float FloatEpsilon = (float) ((float) (7.f/3) - (float) (4.f/3)) - 1;
double DoubleEpsilon = (double) ((double) (7./3) - (double) (4./3)) - 1;
Note that a static assertion is required to ensure that the floating-point radix is as expected for this kludge to operate. The code should also include documentation explaining this bad idea:
The binary representation for the fraction ⅓ ends in an infinite sequences of “01010101…”.
When the binary for 4/3 or 7/3 is rounded to a fixed precision, it is as if the numeral were truncated and rounded down or up, depending on whether the next binary digit after truncation were a 0 or a 1.
Given our assumption that floating-point uses a base-two radix, 4/3 and 7/3 are in consecutive binades (4/3 is in [1, 2), and 7/3 is in [2, 4). Therefore, their truncation points are one position apart.
Thus, we converting to a binary floating-point format, 4/3 and 7/3 differ in that the latter exceeds the former by 1 and its significand ends one bit sooner. Examination of the possible truncation points reveals that, aside from the initial difference of 1, the significands differ by the value of the position of the low bit in 4/3, although the difference may be in either direction.
By Sterbenz’ Lemma, there is no floating-point error in subtracting 4/3 from 7/3, so the result is exactly 1 plus the difference described above.
Subtracting 1 produces that difference, which is the value of the position of the low bit of 4/3 except that it may be positive or negative.

Specify float when initializing double. gcc and clang differs

I tried running this simple code on ideone.com
#include<stdio.h>
int main()
{
double a = 0.7f; // Notice: f for float
double b = 0.7;
if (a == b)
printf("Identical\n");
else
printf("Differ\n");
return 0;
}
With gcc-5.1 the output is Identical
With clang 3.7 the output is Differ
So it seems gcc ignores the f in 0.7f and treats it as a double while clang treats it as a float.
Is this a bug in one of the compilers or is this implementation dependent per standard?
Note: This is not about floating point numbers being inaccurate. The point is that gcc and clang treats this code differently.
The C standard allows floating point operations use higher precision than what is implied by the code during compilation and execution. I'm not sure if this is the exact clause in the standard but the closest I can find is §6.5 in C11:
A floating expression may be contracted, that is, evaluated as though it were a single operation, thereby omitting rounding errors implied by the source code and the expression evaluation method
Not sure if this is it, or there's a better part of the standard that specifies this. There was a huge debate about this a decade ago or two (the problem used to be much worse on i386 because of the internally 40/80 bit floating point numbers in the 8087).
The compiler is required to convert the literal into an internal representation which is at least as accurate as the literal. So gcc is permitted to store floating point literals internally as doubles. Then when it stores the literal value in 'a' it will be able to store the double. And clang is permitted to store floats as floats and doubles as doubles.
So it's implementation specific, rather than a bug.
Addendum: For what it is worth, something similar can happen with ints as well
int64_t val1 = 5000000000;
int64_t val2 = 5000000000LL;
if (val1 != val2) { printf("Different\n"); } else { printf("Same\n"); }
can print either Different or Same depending on how your compiler treats integer literals (though this is more particularly an issue with 32 bit compilers)

pow numeric error in c

I'm wondering where does the numeric error happen, in what layer.
Let me explain using an example:
int p = pow(5, 3);
printf("%d", p);
I've tested this code on various HW and compilers (VS and GCC) and some of them print out 124, and some 125.
On the same HW (OS) i get different results in different compilers (VS and GCC).
On the different HW(OS) I get different results in the same compiler (cc (GCC) 4.8.1).
AFAIK, pow computes to 124.99999999 and that gets truncated to int, but where does this error happen?
Or, in other words, where does the correction happen (124.99->125)
Is it a compiler-HW interaction?
//****** edited:
Here's an additional snippet to play with (keep an eye on p=5, p=18, ...):
#include <stdio.h>
#include <math.h>
int main(void) {
int p;
for (p = 1; p < 20; p++) {
printf("\n%d %d %f %f", (int) pow(p, 3), (int) exp(3 * log(p)), pow(p, 3), exp(3 * log(p)));
}
return 0;
}
(First note that for an IEEE754 double precision floating point type, all integers up to the 53rd power of 2 can be represented exactly. Blaming floating point precision for integral pow inaccuracies is normally incorrect).
pow(x, y) is normally implemented in C as exp(y * log(x)). Hence it can "go off" for even quite small integral cases.
For small integral cases, I normally write the computation long-hand, and for other integral arguments I use a 3rd party library. Although a do-it-yourself solution using a for loop is tempting, there are effective optimisations that can be done for integral powers that such a solution might not exploit.
As for the observed different results, it could be down to some of the platforms using an 80 bit floating point intermediary. Perhaps some of the computations then are above 125 and others are below that.

math.h default rounding mode not clear

I am facing very strange fact about rounding of float and conversion to int.
As is stated here:
http://www.gnu.org/software/libc/manual/html_node/Rounding.html
Rounding to nearest representable value is default rounding mode. But it doesn`t seem to be.
So I have created this simple program:
#include <fenv.h>
#include <stdio.h>
int a;
double b;
main() {
b=1.3; a=b; printf("%f %d\n",b,a);
b=1.8; a=b; printf("%f %d\n",b,a);
b=-1.3; a=b; printf("%f %d\n",b,a);
b=-1.8; a=b; printf("%f %d\n",b,a);
printf("%d %d %d\n",fegetround(),FE_TONEAREST,FE_TOWARDZERO);
}
Program was compiled with gcc-4.7 (debian), cygwin gcc and Visual studio. Output was same, only definition of FE_TOWARDZERO changed.
Output of program:
1.300000 1
1.800000 1
-1.300000 -1
-1.800000 -1
0 0 3072
So we can clearly see, that rounding mode is set to FE_TONEAREST (default) in all tested compilers, but all of them are rounding towards zero.
Why?
PS: Yes, I can use Math.round() but I am wondering why is this happening.
Because the rounding mode applies to floating-point rounding functions. Conversion to int always truncates.
Ok, I have found question why is this happening.
As is stated here:
http://software.intel.com/en-us/articles/fast-floating-point-to-integer-conversions
in chapter
A Closer Look at Float-to-Int Conversions
The problem with casting from floating-point numbers to 32-bit
integers stems from the ANSI C standard, which states the conversion
should be effected by truncating the fractional portion of the number
and retaining the integer result. Because of this, whenever the
Microsoft Visual C++ 6.0 compiler encounters an (int) or a (long)
cast, it inserts a call to the _ftol C run-time function. This
function modifies the floating-point rounding mode to 'truncate',
performs the conversion, and then resets the rounding mode to its
original state prior to the cast.

Why does GCC give an unexpected result when adding float values?

I'm using GCC to compile a program which adds floats, longs, ints and chars. When it runs, the result is bad. The following program unexpectedly prints the value of 34032.101562.
Recompiling with a Microsoft compiler gives the right result.
#include <stdio.h>
int main (void) {
const char val_c = 10;
const int val_i = 20;
const long val_l = 34000;
const float val_f = 2.1;
float result;
result = val_c + val_i + val_l + val_f;
printf("%f\n", result);
return 0;
}
What do you think the "right result" is? I'm guessing that you believe it is 34032.1. It isn't.
2.1 is not representable as a float, so val_f instead is initialized with the closest representable float value. In binary, 2.1 is:
10.000110011001100110011001100110011001100110011001...
a float has 24 binary digits, so the value of val_f in binary is:
10.0001100110011001100110
The expression resultat = val_c + val_i + val_l + val_f computes 34030 + val_f, which is evaluated in single-precision and causes another rounding to occur.
1000010011101110.0
+ 10.0001100110011001100110
-----------------------------------------
1000010011110000.0001100110011001100110
rounds to 24 digits:
-----------------------------------------
1000010011110000.00011010
In decimal, this result is exactly 34032.1015625. Because the %f format prints 6 digits after the decimal point (unless specified otherwise), this is rounded again, and printf prints 34032.101562.
Now, why do you not get this result when you compile with MSVC? The C and C++ standard allow floating-point calculations to be carried out in a wider type if the compiler chooses to do so. MSVC does this with your calculation, which means that the result of 34030 + val_f is not rounded before being passed to printf. In that case, the exact floating-point value being printed is 34032.099999999991268850862979888916015625, which is rounded to 34032.1 by printf.
Why don't all compilers do what MSVC does? A few reasons. First, it's slower on some processors. Second, and more importantly, although it can give more accurate answers, the programmer cannot depend on that -- seemingly unrelated code changes can cause the answer to change in the presence of this behavior. Because of this, carrying extra precision often causes more problems than it solves.
Google David Goldberg's paper "What Every Computer Scientist Should Know About
Floating-Point Arithmetic".
The float format has only about 6-7 digits of precision. Use %7.1f or some other reasonable format and you will like your results better.
I don't see any problem here. 2.1 has no exact representation in IEEE floating-point format, and as such, it is converting the entire answer to a floating-point number with around 6-7 (correct) sig-figs. If you need more precision, use a double.

Resources