I'm doing a project where I do RGB to luma conversions, and I have some rounding issues with the -mno-sse2 flag:
Here's the test code:
#include <stdio.h>
#include <stdint.h>
static double rec709_luma_coeff[3] = {0.2126, 0.7152, 0.0722};
int main()
{
uint16_t n = 242 * rec709_luma_coeff[0] + 242 * rec709_luma_coeff[1] + 242 * rec709_luma_coeff[2];
printf("%u\n", n);
return 0;
}
And here's what I get:
user#gentoo>gcc -mno-sse2 test.c -o test && ./test
241
user#gentoo> gcc test.c -o test && ./test
242
I suppose that gcc uses sse2 optimizations for double multiplications, but what I don't get is why the optimized version would be the correct one.
Also, what do you recommend I use to get more consistent results, ceil() or floor()?
TL:DR use lrint(x) or (int)rint(x) to convert from float to int with round-to-nearest instead of truncation. Unfortunately not all compilers efficiently inline the same math functions, though. See round() for float in C++
gcc -mno-sse2 has to use x87 for double, even in 64-bit code. x87 registers have an internal precision of 80 bits, but SSE2 uses the IEEE binary64 (aka double) format natively in XMM registers, so all the temporaries are rounded to 64-bit double at each step.
The problem isn't anything as interesting as the double rounding problem (80 bit -> 64 bit, then to integer). It's also not from gcc -O0 (the default: no extra optimizations) rounding when storing temporaries to memory, because you did the whole thing in one C statement so it does just use x87 registers for the whole expression.
It's simply that 80-bit precision leads to a result that just below 242.0 and is truncated to 241 by C's float->int semantics, while SSE2 produces a result just above 242.0 which truncates to 242. For x87, rounding down to the next lower integer happens consistently, not just 242, for any input from 1 to 65535. (I made a version of your program using atoi(argv[1]) so I could test other values, and with -O3).
Remember that int foo = 123.99999 is 123, because C uses the "truncation" rounding mode (towards zero). For non-negative numbers, this is the same as floor (which rounds towards -Infinity). https://en.wikipedia.org/wiki/Floating-point_arithmetic#Rounding_modes.
double can't represent the coefficients exactly: I printed them with gdb and got: {0.21260000000000001, 0.71519999999999995, 0.0722}. Those decimal representations are probably not exact representations of the base-2 floating point values. But they're close enough to see that the coefficients add up to 0.99999999999999996 (using an arbitrary-precision calculator).
We get consist rounding down because x87 internal precision is higher than the precision of the coefficients, so the sum rounding errors in n * rec709_luma_coeff[0] and so on, and in summing up the results, is ~2^11 smaller than the difference between the sum of the coefficients and 1.0. (64-bit significand vs. 53 bits).
The real question is how the SSE2 version managed to work! Presumably round to nearest-even on the temporaries happens to go upward in enough cases, at least for 242. It happens to produce the original input for more cases than not, but it produces input-1 for 5, 7, 10, 13, 14, 20... (252 of the first 1000 numbers from 1..1000 are "munged" by the SSE2 version, so it's not like it always works either.)
With -O3 for your source, it does the calculation at compile time with extended precision and produces the exact result. i.e. it compiles the same as printf("%u\n", n);.
And BTW you should use static const for you constants so gcc can optimize better. static is much better than plain global, though, because the compiler can see that nothing in the compilation unit writes the values or passes their address anywhere, so it can treat them as if they were const.
Related
When using Python & Julia, I can use a neat trick to investigate machine epsilon for a particular floating point representation.
For example, in Julia 1.1.1:
julia> 7.0/3 - 4/3 - 1
2.220446049250313e-16
julia> 7.0f0/3f0 - 4f0/3f0 - 1f0
-1.1920929f-7
I'm currently learning C and wrote this program to try and achieve the same thing:
#include <stdio.h>
int main(void)
{
float foo;
double bar;
foo = 7.0f/3.0f - 4.0f/3.0f - 1.0f;
bar = 7.0/3.0 - 4.0/3.0 - 1.0;
printf("\nM.E. for float: %e \n\n", foo);
printf("M.E. for double: %e \n\n", bar);
return 0;
}
Curiously, the answer I get depends on whether I use C11 or GNU11 compiler standard. My compiler is GCC 5.3.0, running on Windows 7 and installed via MinGW.
So in short, when I compile with: gcc -std=gnu11 -pedantic begin.c I get:
M.E. for float: -1.192093e-007
M.E. for double: 2.220446e-016
as I expect, and matches Python and Julia. But when I compile with: gcc -std=c11 -pedantic begin.c I get:
M.E. for float: -1.084202e-019
M.E. for double: -1.084202e-019
which is unexpected. I thought it might by GNU specific features which is why I added the -pedantic flag. I have been searching on google and found this: https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html but I still am unable to explain the difference in behaviour.
To be explicit, my question is: Why is the result different using the different standards?
Update: The same differences apply with C99 and GNU99 standards.
In C, the best way to get the float or double epsilon is to include <float.h> and use FLT_MIN or DBL_MIN.
The value of 7.0/3.0 - 4.0/3.0 - 1.0; is not fully specified by the C standard because it allows implementations to evaluate floating-point expressions with more precision than the nominal type. To some extent, this can be dealt with by using casts or assignments. The C standard requires casts or assignments to “discard” excess precision. This is not a proper solution in general, because there can be rounding both with the initial excess precision and with the operation that “discards” excess precision. This double-rounding may produce a different result than calculating entirely with the nominal precision.
Using the cast workaround with the code in the question yields:
_Static_assert(FLT_RADIX == 2, "Floating-point radix must be two.");
float FloatEpsilon = (float) ((float) (7.f/3) - (float) (4.f/3)) - 1;
double DoubleEpsilon = (double) ((double) (7./3) - (double) (4./3)) - 1;
Note that a static assertion is required to ensure that the floating-point radix is as expected for this kludge to operate. The code should also include documentation explaining this bad idea:
The binary representation for the fraction ⅓ ends in an infinite sequences of “01010101…”.
When the binary for 4/3 or 7/3 is rounded to a fixed precision, it is as if the numeral were truncated and rounded down or up, depending on whether the next binary digit after truncation were a 0 or a 1.
Given our assumption that floating-point uses a base-two radix, 4/3 and 7/3 are in consecutive binades (4/3 is in [1, 2), and 7/3 is in [2, 4). Therefore, their truncation points are one position apart.
Thus, we converting to a binary floating-point format, 4/3 and 7/3 differ in that the latter exceeds the former by 1 and its significand ends one bit sooner. Examination of the possible truncation points reveals that, aside from the initial difference of 1, the significands differ by the value of the position of the low bit in 4/3, although the difference may be in either direction.
By Sterbenz’ Lemma, there is no floating-point error in subtracting 4/3 from 7/3, so the result is exactly 1 plus the difference described above.
Subtracting 1 produces that difference, which is the value of the position of the low bit of 4/3 except that it may be positive or negative.
I'm compiling and running the following program in 32 and 64 bit platforms:
int main()
{
double y = 8.34214e08;
double z = 1.25823e45;
return y * z == 8.34214e08 * 1.25823e45;
}
While in 64bit the result is the expected (the values are the same and the exit code is non-zero) in 32bit seems there is a little difference between the value calculated at compile time, the right hand side of the comparison, and the left side computed at runtime.
Is this a bug in the compiler or there is a logical explanation?
EDIT: this is different from Why comparing double and float leads to unexpected result? because here all the values are double.
IEEE-754 allows intermediate computations to be done in a greater precision (emphasis mine).
(IEEE-754:2008) "A language standard should also define, and require implementations to provide, attributes that allow and disallow value-changing optimizations, separately or collectively, for a block. These optimizations might include, but are not limited to: [...] Use of wider intermediate results in expression evaluation."
In your case for example on a IA-32, the double values could be stored in the x87 FPU registers with greater precision (80-bit instead of 64). So you are actually comparing a multiplication done on double precision with a multiplication done on double-extended precision.
For example, on x64 where the result is 1 (the x87 FPU is not used as SSE is used instead), adding gcc option -mfpmath=387 to use the x87 makes the result change to 0 on my machine.
And if you wonder if that is also allowed by C, it is:
(C99, 6.3.1.p8) "The values of floating operands and of the results of floating expressions may be represented in greater precision and range than that required by the type;"
In general, never do equality checks with floating point numbers. You need to check whether the result you want differs from the result you get by less than a pre-set precision.
What is happening here is in all likelihood due to the multiplication being run on two different "platforms": once by your code, and once by the compiler, which may have a different precision. This happens with most compilers.
Your program would probably work if you compiled it with the same options that were used to compile the compiler (supposing the compiler was compiled by itself). But that would not mean you would get the correct result; you would be getting the same precision error the compiler is getting.
(Also, I'm assuming that the compiler performs a straight multiplication and the parsing code recognizing floats does not enter into the equation. This might well be wishful thinking on my part).
Testing
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/4.8/lto-wrapper
Target: x86_64-suse-linux
Configured with: ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,java,ada --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.8 --enable-ssp --disable-libssp --disable-plugin --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux' --disable-libgcj --disable-libmudflap --with-slibdir=/lib64 --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-version-specific-runtime-libs --enable-linker-build-id --enable-linux-futex --program-suffix=-4.8 --without-system-libunwind --with-arch-32=i586 --with-tune=generic --build=x86_64-suse-linux --host=x86_64-suse-linux
Thread model: posix
gcc version 4.8.3 20141208 [gcc-4_8-branch revision 218481] (SUSE Linux)
#include <stdio.h>
int main()
{
double y = 8.34214e08;
double z = 1.25823e45;
return printf("%s\n", y * z == 8.34214e08 * 1.25823e45 ? "Equal" : "NOT equal!");
}
Forcing -O0 to avoid the compiler from optimizing out the whole code (thanks #markgz!), we get
$ gcc -m32 -O0 -o float float.c && ./float
NOT equal!
$ gcc -m32 -frounding-math -O0 -o float float.c && ./float
Equal
For the record, since you got there before me :-),
-frounding-math
Disable transformations and optimizations that assume default floating-point rounding behavior. This is round-to-zero for all
floating point to integer conversions, and round-to-nearest for all
other arithmetic truncations. This option should be specified for
programs that change the FP rounding mode dynamically, or that may be
executed with a non-default rounding mode. This option disables
constant folding of floating-point expressions at compile time (which
may be affected by rounding mode) and arithmetic transformations that
are unsafe in the presence of sign-dependent rounding modes.
The default is -fno-rounding-math.
Floating point calculations done at compile time often occur at a higher precision than double uses at run time. Also C may perform run-time intermediate double calculations at the higher long double precision. Either explain your inequality. See FLT_EVAL_METHOD for details.
volatile double y = 8.34214e08;
volatile double z = 1.25823e45;
volatile double yz = 8.34214e08 * 1.25823e45;
printf("%.20e\n", y);
printf("%.20e\n", z);
printf("%.20e\n", yz);
printf("%.20Le\n", (long double) y*z);
printf("%.20Le\n", (long double) 8.34214e08 * 1.25823e45);
8.34214000000000000000e+08
1.25822999999999992531e+45
// 3 different products!
1.04963308121999993395e+54
1.04963308121999993769e+54
1.04963308122000000000e+54
Your results may slightly differ.
I'm debugging some old C code and it has a definition #define PI 3.14... where ... is about 50 other digits.
Why is this? I said I could reduce the number to about 16 decimal places but my boss snarled at me saying that the other numbers are there for platform independence and forward compatibility. But will is slow the program down?
No, this will not slow down the program, unless you are running on an incredibly underpowered 1MHz DSP chip that has to do floating point arithmetic in software as opposed to passing it off to a dedicated FPU. This would mean that any mathematical operations that use floating point data are much slower than just using integer arithmetic.
In general, greater precision is only going to introduce a slowdown if the most time-consuming part of your program is doing a lot of calculations in rapid succession, and floating point calculations are especially slow. On a modern CPU, this is generally not the case, with the possible exception of certain chips that cause an 80-cycle stall on things like floating point underflow. That kind of issue likely exceeds the domain of this question.
First, it's better to use a common standard definition of PI, like in the C standard header, <math.h>, where it is defined as #define M_PI 3.14159265358979323846. If you insist, you can go ahead and define it manually.
Also, the best precision currently available in C is the equivalent of about 19 digits.
According to Wikipedia, 80-bit "Intel" IEEE 754 extended-precision
long double, which is 80 bits padded to 16 bytes in memory, has 64
bits mantissa, with no implicit bit, which gets you 19.26 decimal
digits. This has been the almost universal standard for long double
for ages, but recently things have started to change.
The newer 128-bit quad-precision format has 112 mantissa bits plus an
implicit bit, which gets you 34 decimal digits. GCC implements this as
the __float128 type and there is (if memory serves) a compiler option
to set long double to it.
Personally, if I were required to use our own definition of pi, I'd write something like this:
#ifndef M_PI
#define PI 3.14159265358979323846264338327950288419716939937510
#else
#define PI M_PI
#endif
If the latest C standard supports an even wider floating point primitive data type, it's pretty much a guarantee that constants in the math library would be updated to support this.
References
More Precise Floating point Data Types than double?, Accessed 2014-03-13, <https://stackoverflow.com/questions/15659668/more-precise-floating-point-data-types-than-double>
Math constant PI value in C, Accessed 2014-03-13, <https://stackoverflow.com/questions/9912151/math-constant-pi-value-in-c>
The number of digits in a macro definition almost certainly will have no effect at all on run-time performance.
Macro expansion is textual. That means that if you have:
#define PI 3.14159... /* 50 digits */
then any time you refer to PI in code to which that definition is visible, it will be as if you had written out 3.14159....
C has just three floating-point types: float, double, and long double. There sizes and precisions are implementation-defined, but they're typically 32 bits, 64 bits, and something wider than 64 bits (the size of long double typically varies more from system to system than the other two do.)
If you use PI in an expression, it will be evaluated as a value of some specific type. And in fact, if there's no L suffix on the literal, it will be of type double.
So if you write:
double x = PI / 2.0;
it's as if you had written:
double x = 3.14159... / 2.0;
The compiler will probably evaluate the division at compile time generating a value of type double. Any extra precision in the literal will be discarded.
To see this, you can try writing a small program that uses the PI macro and examining an assembly listing.
For example:
#include <stdio.h>
#define PI 3.141592653589793238462643383279502884198716939937510582097164
int main(void) {
double x = PI;
printf("x = %g\n", x);
}
On my x86_64 system, the generated machine code has no reference to the full precision value. The instruction corresponding to the initialization is:
movabsq $4614256656552045848, %rax
where 4614256656552045848 is a 64-bit integer corresponding to the binary IEEE double-precision representation of a number as close as possible to 3.141592653589793238462643383279502884198716939937510582097164.
The actual stored floating-point value on my system happens to be exactly:
3.1415926535897931159979634685441851615905761718750000000000000000
of which only about 16 decimal digits are significant.
I'm trying to write unit tests for some simple vector math functions that operate on arrays of single precision floating point numbers. The functions use SSE intrinsics and I'm getting false positives (at least I think) when running the tests on a 32-bit system (the tests pass on 64-bit). As the operation runs through the array, I accumulate more and more round off error. Here is a snippet of unit test code and output (my actual question(s) follow):
Test Setup:
static const int N = 1024;
static const float MSCALAR = 42.42f;
static void setup(void) {
input = _mm_malloc(sizeof(*input) * N, 16);
ainput = _mm_malloc(sizeof(*ainput) * N, 16);
output = _mm_malloc(sizeof(*output) * N, 16);
expected = _mm_malloc(sizeof(*expected) * N, 16);
memset(output, 0, sizeof(*output) * N);
for (int i = 0; i < N; i++) {
input[i] = i * 0.4f;
ainput[i] = i * 2.1f;
expected[i] = (input[i] * MSCALAR) + ainput[i];
}
}
My main test code then calls the function to be tested (which does the same calculation used to generate the expected array) and checks its output against the expected array generated above. The check is for closeness (within 0.0001) not equality.
Sample output:
0.000000 0.000000 delta: 0.000000
44.419998 44.419998 delta: 0.000000
...snip 100 or so lines...
2043.319946 2043.319946 delta: 0.000000
2087.739746 2087.739990 delta: 0.000244
...snip 100 or so lines...
4086.639893 4086.639893 delta: 0.000000
4131.059570 4131.060059 delta: 0.000488
4175.479492 4175.479980 delta: 0.000488
...etc, etc...
I know I have two problems:
On 32-bit machines, differences between 387 and SSE floating point arithmetic units. I believe 387 uses more bits for intermediate values.
Non-exact representation of my 42.42 value that I'm using to generate expected values.
So my question is, what is the proper way to write meaningful and portable unit tests for math operations on floating point data?
*By portable I mean should pass on both 32 and 64 bit architectures.
Per a comment, we see that the function being tested is essentially:
for (int i = 0; i < N; ++i)
D[i] = A[i] * b + C[i];
where A[i], b, C[i], and D[i] all have type float. When referring to the data of a single iteration, I will use a, c, and d for A[i], C[i], and D[i].
Below is an analysis of what we could use for an error tolerance when testing this function. First, though, I want to point out that we can design the test so that there is no error. We can choose the values of A[i], b, C[i], and D[i] so that all the results, both final and intermediate results, are exactly representable and there is no rounding error. Obviously, this will not test the floating-point arithmetic, but that is not the goal. The goal is to test the code of the function: Does it execute instructions that compute the desired function? Simply choosing values that would reveal any failures to use the right data, to add, to multiply, or to store to the right location will suffice to reveal bugs in the function. We trust that the hardware performs floating-point correctly and are not testing that; we just want to test that the function was written correctly. To accomplish this, we could, for example, set b to a power of two, A[i] to various small integers, and C[i] to various small integers multiplied by b. I could detail limits on these values more precisely if desired. Then all results would be exact, and any need to allow for a tolerance in comparison would vanish.
That aside, let us proceed to error analysis.
The goal is to find bugs in the implementation of the function. To do this, we can ignore small errors in the floating-point arithmetic, because the kinds of bugs we are seeking almost always cause large errors: The wrong operation is used, the wrong data is used, or the result is not stored in the desired location, so the actual result is almost always very different from the expected result.
Now the question is how much error should we tolerate? Because bugs will generally cause large errors, we can set the tolerance quite high. However, in floating-point, “high” is still relative; an error of one million is small compared to values in the trillions, but it is too high to discover errors when the input values are in the ones. So we ought to do at least some analysis to decide the level.
The function being tested will use SSE intrinsics. This means it will, for each i in the loop above, either perform a floating-point multiply and a floating-point add or will perform a fused floating-point multiply-add. The potential errors in the latter are a subset of the former, so I will use the former. The floating-point operations for a*b+c do some rounding so that they calculate a result that is approximately a•b+c (interpreted as an exact mathematical expression, not floating-point). We can write the exact value calculated as (a•b•(1+e0)+c)•(1+e1) for some errors e0 and e1 with magnitudes at most 2-24, provided all the values are in the normal range of the floating-point format. (2-24 is the maximum relative error that can occur in any correctly rounded elementary floating-point operation in round-to-nearest mode in the IEEE-754 32-bit binary floating-point format. Rounding in round-to-nearest mode changes the mathematical value by at most half the value of the least significant bit in the significand, which is 23 bits below the most significant bit.)
Next, we consider what value the test program produces for its expected value. It uses the C code d = a*b + c;. (I have converted the long names in the question to shorter names.) Ideally, this would also calculate a multiply and an add in IEEE-754 32-bit binary floating-point. If it did, then the result would be identical to the function being tested, and there would be no need to allow for any tolerance in comparison. However, the C standard allows implementations some flexibility in performing floating-point arithmetic, and there are non-conforming implementations that take more liberties than the standard allows.
A common behavior is for an expression to be computed with more precision than its nominal type. Some compilers may calculate a*b + c using double or long double arithmetic. The C standard requires that results be converted to the nominal type in casts or assignments; extra precision must be discarded. If the C implementation is using extra precision, then the calculation proceeds: a*b is calculated with extra precision, yielding exactly a•b, because double and long double have enough precision to exactly represent the product of any two float values. A C implementation might then round this result to float. This is unlikely, but I allow for it anyway. However, I also dismiss it because it moves the expected result to be closer to the result of the function being tested, and we just need to know the maximum error that can occur. So I will continue, with the worse (more distant) case, that the result so far is a•b. Then c is added, yielding (a•b+c)•(1+e2) for some e2 with magnitude at most 2-53 (the maximum relative error of normal numbers in the 64-bit binary format). Finally, this value is converted to float for assignment to d, yielding (a•b+c)•(1+e2)•(1+e3) for some e3 with magnitude at most 2-24.
Now we have expressions for the exact result computed by a correctly operating function, (a•b•(1+e0)+c)•(1+e1), and for the exact result computed by the test code, (a•b+c)•(1+e2)•(1+e3), and we can calculate a bound on how much they can differ. Simple algebra tells us the exact difference is a•b•(e0+e1+e0•e1-e2-e3-e2•e3)+c•(e1-e2-e3-e2•e3). This is a simple function of e0, e1, e2, and e3, and we can see its extremes occur at endpoints of the potential values for e0, e1, e2, and e3. There are some complications due to interactions between possibilities for the signs of the values, but we can simply allow some extra error for the worst case. A bound on the maximum magnitude of the difference is |a•b|•(3•2-24+2-53+2-48)+|c|•(2•2-24+2-53+2-77).
Because we have plenty of room, we can simplify that, as long as we do it in the direction of making the values larger. E.g., it might be convenient to use |a•b|•3.001•2-24+|c|•2.001•2-24. This expression should suffice to allow for rounding in floating-point calculations while detecting nearly all implementation errors.
Note that the expression is not proportional to the final value, a*b+c, as calculated either by the function being tested or by the test program. This means that, in general, tests using a tolerance relative to the final values calculated by the function being tested or by the test program are wrong. The proper form of a test should be something like this:
double tolerance = fabs(input[i] * MSCALAR) * 0x3.001p-24 + fabs(ainput[i]) * 0x2.001p-24;
double difference = fabs(output[i] - expected[i]);
if (! (difference < tolerance))
// Report error here.
In summary, this gives us a tolerance that is larger than any possible differences due to floating-point rounding, so it should never give us a false positive (report the test function is broken when it is not). However, it is very small compared to the errors caused by the bugs we want to detect, so it should rarely give us a false negative (fail to report an actual bug).
(Note that there are also rounding errors computing the tolerance, but they are smaller than the slop I have allowed for in using .001 in the coefficients, so we can ignore them.)
(Also note that ! (difference < tolerance) is not equivalent to difference >= tolerance. If the function produces a NaN, due to a bug, any comparison yields false: both difference < tolerance and difference >= tolerance yield false, but ! (difference < tolerance) yields true.)
On 32-bit machines, differences between 387 and SSE floating point arithmetic units. I believe 387 uses more bits for intermediate values.
If you are using GCC as 32-bit compiler, you can tell it to generate SSE2 code still with options -msse2 -mfpmath=sse. Clang can be told to do the same thing with one of the two options and ignores the other one (I forget which). In both cases the binary program should implement strict IEEE 754 semantics, and compute the same result as a 64-bit program that also uses SSE2 instructions to implement strict IEEE 754 semantics.
Non-exact representation of my 42.42 value that I'm using to generate expected values.
The C standard says that a literal such as 42.42f must be converted to either the floating-point number immediately above or immediately below the number represented in decimal. Moreover, if the literal is representable exactly as a floating-point number of the intended format, then this value must be used. However, a quality compiler (such as GCC) will give you(*) the nearest representable floating-point number, of which there is only one, so again, this is not a real portability issue as long as you are using a quality compiler (or at the very least, the same compiler).
Should this turn out to be a problem, a solution is to write an exact representation of the constants you intend. Such an exact representation can be very long in decimal format (up to 750 decimal digits for the exact representation of a double) but is always quite compact in C99's hexadecimal format: 0x1.535c28p+5 for the exact representation of the float nearest to 42.42. A recent version of the static analysis platform for C programs Frama-C can provide the hexadecimal representation of all inexact decimal floating-point constants with option -warn-decimal-float:all.
(*) barring a few conversion bugs in older GCC versions. See Rick Regan's blog for details.
My textbook - C in a Nutshell, ISBN 978-0596006976
The part of casting, the code in an example showing C rounding error:
Code:
#include <stdio.h>
int
main()
{
long l_var = 123456789L;
float f_var = l_var;
printf("The rounding error (f_var - l_var) is %f\n", f_var - l_var);
return 0;
}
then the value it output with nothing but 0.000000
seems it made no precision problem while casting those literal
with gcc(v4.4.7) command
gcc -Wall file.c -o exec
did GNU make a better way to get around the problem which mentioned in that chapter
or just some setting not strictly close to the issue of rounding error?
I don't know what this chapter is telling you, but:
float f_var = l_var;
We can tell that f_var is (float)l_var. Now the expression:
f_var - l_var
As this operates on a long and a float, the long will be converted into a float. So the compiler will do:
f_var - (float)l_var
Which is the same as:
(float)l_var - (float)l_var
Which is zero, regardless of any rounding of the conversion.
I don't have access to this book.
My guess is that the example is trying to tell you that if you assign a 32 bit integer to a 32 bit float, you may lose bits due to truncation (rounding errors): A 32 bit float has only 23 bit significand and some bits may be lost during the assignment accordingly.
Apparently, the example code is bogus in the book though. Here is the code to demonstrate the truncation error:
#include <stdint.h>
#include <stdio.h>
int main() {
int32_t l_var = 123456789L;
/* 32 bit variable, 23 bit significand, approx. 7 decimals */
float f_var = l_var;
double err = (double) f_var - (double) l_var;
printf("The rounding error (f_var - l_var) is %f\n", err);
return 0;
}
This prints
The rounding error (f_var - l_var) is 3.000000
on my machine.
0 is the value you get if both values are converted to float, you'll get something else if they are converted to something else. And there is an allowance in the standard to use wider floating point representation that required by the type for computation (*). Using it here is especially tempting here as the result has to be converted to a double for passing to printf.
My version of gcc is not using that allowance when compiling for x86_64 (-m64 argument for gcc) and it is using it when compiling for x86 (-m32 argument). That make sense when you know that for 64 bits, it is using sse instructions which can easily do the computation in float, while when compiling for 32 bits it is using the older "8087" stack model which can't do that easily.
(*) Last paragraph of 6.2.1.5 in C90, 6.3.1.8/2 in C99, 6.3.1.8/2 in C11. I give the text of the latest (as in n1539)
The values of floating operands and of the results of floating expressions may be
represented in greater precision and range than that required by the type; the types are not changed thereby.
As pointed by Pascal Cuoq, starting from C99, you can test with FLT_EVAL_METHOD.