Compilation platform taking FPU rounding mode into account in printing, conversions - c

EDIT: I had made a mistake during the debugging session that lead me to ask this question. The differences I was seeing were in fact in printing a double and in parsing a double (strtod). Stephen's answer still covers my question very well even after this rectification, so I think I will leave the question alone in case it is useful to someone.
Some (most) C compilation platforms I have access to do not take the FPU rounding mode into account when
converting a 64-bit integer to double;
printing a double.
Nothing very exotic here: Mac OS X Leopard, various recent Linuxes and BSD variants, Windows.
On the other hand, Mac OS X Snow Leopard seems to take the rounding mode into account when doing these two things. Of course, having different behaviors annoys me no end.
Here are typical snippets for the two cases:
#if defined(__OpenBSD__) || defined(__NetBSD__)
# include <ieeefp.h>
# define FE_UPWARD FP_RP
# define fesetround(RM) fpsetround(RM)
#else
# include <fenv.h>
#endif
#include <float.h>
#include <math.h>
fesetround(FE_UPWARD);
...
double f;
long long b = 2000000001;
b = b*b;
f = b;
...
printf("%f\n", 0.1);
My questions are:
Is there something non-ugly that I can do to normalize the behavior across all platforms? Some hidden setting to tell the platforms that take rounding mode into account not to or vice versa?
Is one of the behaviors standard?
What am I likely to encounter when the FPU rounding mode is not used? Round towards zero? Round to nearest? Please, tell me that there is only one alternative :)
Regarding 2. I found the place in the standard where it is said that floats converted to integers are always truncated (rounded towards zero) but I couldn't find anything for the integer -> float direction.

If you have not set the rounding mode, it should be the IEEE-754 default mode, which is round-to-nearest.
For conversions from integer to float, the C standard says (§6.3.1.4):
When a value of integer type is
converted to a real floating type, if
the value being converted can be
represented exactly in the new type,
it is unchanged. If the value being
converted is in the range of values
that can be represented but cannot be
represented exactly, the result is
either the nearest higher or nearest
lower representable value, chosen in
an implementation-defined manner. If
the value being converted is outside
the range of values that can be
represented, the behavior is
undefined.
So both behaviors conform to the C standard.
The C standard says (§F.5) that conversions between IEC60559 floating point formats and character sequences be correctly rounded as per the IEEE-754 standard. For non-IEC60559 formats, this is recommended, but not required. The 1985 IEEE-754 standard says (clause 5.4):
Conversions shall be correctly rounded
as specified in Section 4 for operands
lying within the ranges specified in
Table 3. Otherwise, for rounding to
nearest, the error in the converted
result shall not exceed by more than
0.47 units in the destination's least significant digit the error that is
incurred by the rounding
specifications of Section 4, provided
that exponent over/underflow does not
occur. In the directed rounding modes
the error shall have the correct sign
and shall not exceed 1.47 units in the
last place.
What section (4) actually says is that the operation shall occur according to the prevailing rounding mode. I.e. if you change the rounding mode, IEEE-754 says that the result of float->string conversion should change accordingly. Ditto for integer->float conversions.
The 2008 revision of the IEEE-754 standard says (clause 4.3):
The rounding-direction attribute
affects all computational operations
that might be inexact. Inexact numeric
floating-point results always have the
same sign as the unrounded result.
Both conversions are defined to be computational operations in clause 5, so again they should be performed according to the prevailing rounding mode.
I would argue that Snow Leopard has the correct behavior here (assuming that it is correctly rounding the results according to the prevailing rounding mode). If you want to force the old behavior, you can always wrap your printf calls in code that changes the rounding mode, I suppose, though that's clearly not ideal.
Alternatively, you could use the %a format specifier (hexadecimal floating point) on C99 compliant platforms. Since the result of this conversion is always exact, it will never be effected by the prevailing rounding mode. I don't think that the Windows C library supports %a, but you could probably port the BSD or glibc implementation easily enough if you need it.

Related

Do different platforms have different double precision in C?

C code
#include <stdio.h>
int main(int argc, char *argv[]) {
double x = 1.0/3.0;
printf("%.32f\n", x);
return 0;
}
On Windows Server 2008 R2, I compile and run the code using MSVC 2013. The output is:
0.33333333333333331000000000000000
On macOS High Sierra, I compile and run the same code using Apple LLVM version 9.0.0 (clang-900.0.39.2). The output is:
0.33333333333333331482961625624739
Running the same code on ideone.com gives the same result as the one on macOS.
I know the double-precision float-point format has only about 15 or 16 significant decimal digits. The number keeps the precise decimal representation of the 64-bit binary value of 1/3 (as the Wikipedia's explanation) on macOS platform, but why is it "truncated" on Windows platform?
According to the C standard, the following behaviour is implementation-defined:
The accuracy of the floating-point operations (+, -, *, /) and of the library functions in <math.h> and <complex.h> that return floating-point results is implementation- defined, as is the accuracy of the conversion between floating-point internal representations and string representations performed by the library functions in <stdio.h>, <stdlib.h>, and <wchar.h>. The implementation may state that the accuracy is unknown.
From the MSVC docs since MSVC conform to IEEE double has up to 16 significant digits.
But the docs also state the by default the /fp flag is set to precise:
With /fp:precise on x86 processors, the compiler performs rounding on
variables of type float to the correct precision for assignments and
casts and when parameters are passed to a function. This rounding
guarantees that the data does not retain any significance greater than
the capacity of its type.
So you're rounding of the double when passing it to printf, getting those zeros. As far as I can tell the trailing 1 is noise and should not be there (it's digit 17, I counted).
This behavior is MSVC specific, you can read more about it in the docs I linked to, to see what flags you would need for other compilers.
According to coding-guidelines:
696 When a double is demoted to float, a long double is demoted to double or float, or a value being represented in greater precision and
range than required by its semantic type (see 6.3.1.8) is explicitly
converted to its semantic type (including to its own type), if the
value being converted can be represented exactly in the new type, it
is unchanged.
697 If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is
either the nearest higher or nearest lower representable value, chosen
in an implementation-defined manner.
698 If the value being converted is outside the range of values that can be represented, the behavior is undefined.
Yes, different implementations may have double types with different characteristics, but that is not actually the cause of the behavior you are seeing here. In this case, the cause is that the Windows routines for converting floating-point values to decimal do not produce the correctly rounded result as defined in IEEE 754 (they “give up” after a limited number of digits), whereas the macOS routines do produce the correctly rounded result (and in fact produce the exact result if allowed to produce enough digits for that).
The short answer is: yes, different platforms behave differently.
From Intel:
Intel Xeon processor E5-2600 V2 product family supports half-precision (16-bit) floating- point data types. Half-precision floating-point data types provide 2x more compact data representation than single-precision (32-bit) floating-point data format, but sacrifice data range and accuracy. In particular, half-floats may provide better performance than 32-bit floats when the 32-bit float data does not fit into the L1 cache.
I don't have access to a Xeon machine to test this but I'd expect that you are ending up with a double that is not really a double due to the use of a "faster floating-point, but less precision" CPU instructions.
The "i" line of Intel Core processors doesn't have that half-width fast floating point precision so it doesn't truncate the result.
Here's more on the half-precision format in Xeon on Intel website.

Overflow vs Inf

When I enter a number greater than max double in Matlab that is approximately 1.79769e+308, for example 10^309, it returns Inf. For educational purposes, I want to get overflow exception like C compilers that return an overflow error message, not Inf. My questions are:
Is Inf an overflow exception?
If is, why C compilers don't return Inf?
If not, can I get an overflow exception in Matlab?
Is there any difference between Inf and an overflow exception at all?
Also I don't want check Inf in Matlab and then throw an exception with error() function.
1) Floating-points in C/C++
Operations on floating-point numbers can produce results that are not numerical values. Examples:
the result of an operation is a complex number (think sqrt(-1.0))
the result of an operation is undefined (think 1.0 / 0.0)
the result of an operation is too large to be represented
an operation is performed where one of the operands is already NaN or Inf
The philosophy of IEEE754 is to not trap such exceptions by default, but to produce special values (Inf and NaN), and allow computation to continue normally without interrupting the program. It is up to the user to test for such results and treat them separately (like isinf and isnan functions in MATLAB).
There exist two types of NaN values: NaN (Quiet NaN) and sNaN (Signaling NaN). Normally all arithmetic operations of floating-point numbers will produce the quiet type (not the signaling type) when the operation cannot be successfully completed.
There are (platform-dependent) functions to control the floating-point environment and catch FP exceptions:
Win32 API has _control87() to control the FPU flags.
POSIX/Linux systems typically handle FP exception by trapping the SIGFPE signal (see feenableexcept).
SunOS/Solaris has its own functions as well (see chapter 4 in Numerical Computation Guide by Sun/Oracle)
C99/C++11 introduced the fenv header with functions that control the floating-point exception flags.
For instance, check out how Python implements the FP exception control module for different platforms: https://hg.python.org/cpython/file/tip/Modules/fpectlmodule.c
2) Integers in C/C++
This is obviously completely different from floating-points, since integer types cannot represent Inf or NaN:
unsigned integers use modular arithmetic (so values wrap-around if the result exceeds the largest integer). This means that the result of an unsigned arithmetic operation is always "mathematically defined" and never overflows. Compare this to MATLAB which uses saturation arithmetic for integers (uint8(200) + uint8(200) will be uint8(255)).
signed integer overflow on the other hand is undefined behavior.
integer division by zero is undefined behavior.
Floating Point
MATLAB implements the IEEE Standard 754 for floating point operations.
This standard has five defined exceptions:
Invalid Operation
Division by Zero
Overflow
Underflow
Inexact
As noted by the GNU C Library, these exceptions are indicated by a status word but do not terminate the program.
Instead, an exception-dependent default value is returned; the value may be an actual number or a special value Special values in MATLAB are Inf, -Inf, NaN, and -0; these MATLAB symbols are used in place of the official standard's reserved binary representations for readability and usability (a bit of nice syntactic sugar).
Operations on the special values are well-defined and operate in an intuitive way.
With this information in hand, the answers to the questions are:
Inf means that an operation was performed that raised one of the above exceptions (namely, 1, 2, or 3), and Inf was determined to be the default return value.
Depending on how the C program is written, what compiler is being used, and what hardware is present, INFINITY and NaN are special values that can be returned by a C operation. It depends on if-and-how the IEEE-754 standard was implemented. The C99 has IEEE-754 implementation as part of the standard, but it is ultimately up to the compiler on how the implementation works (this can be complicated by aggressive optimizations and standard options like rounding modes).
A return value of Inf or -Inf indicates that an Overflow exception may have happened, but it could also be an Invalid Operation or Division by Zero. I don't think MATLAB will tell you which it is (though maybe you have access to that information via compiled MEX files, but I'm unfamiliar with those).
See answer 1.
For more fun and in-depth examples, here is a nice PDF.
Integers
Integers do not behave as above in MATLAB.
If an operation on an integer of a specified bit size will exceed the maximum value of that class, it will be set to the maximum value and vice versa for negatives (if signed).
In other words, MATLAB integers do not wrap.
I'm going to repeat an answer by Jan Simon from the "MATLAB Answers" website:
For stopping (in debugger mode) on division-by-zero, use:
warning on MATLAB:divideByZero
dbstop if warning MATLAB:divideByZero
Similarly for stopping on taking the logarithm of zero:
warning on MATLAB:log:LogOfZero
dbstop if warning MATLAB:log:LogOfZero
and for stopping when an operation (a function call or an assignment) returns either NaN or Inf, use:
dbstop if naninf
Unfortunately the first two warnings seems to be no longer supported, although the last option still works for me on R2014a and is in fact documented.

Guaranteed precision of sqrt function in C/C++

Everyone knows sqrt function from math.h/cmath in C/C++ - it returns square root of its argument. Of course, it has to do it with some error, because not every number can be stored precisely. But am I guaranteed that the result has some precision? For example, 'it's the best approximation of square root that can be represented in the floating point type usedorif you calculate square of the result, it will be as close to initial argument as possible using the floating point type given`?
Does C/C++ standard have something about it?
For C99, there are no specific requirements. But most implementations try to support Annex F: IEC 60559 floating-point arithmetic as good as possible. It says:
An implementation that defines __STDC_IEC_559__ shall conform to the specifications in this annex.
And:
The sqrt functions in <math.h> provide the IEC 60559 square root operation.
IEC 60559 (equivalent to IEEE 754) says about basic operations like sqrt:
Except for binary <-> decimal conversion, each of the operations shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then coerced this intermediate result to fit in the destination's format.
The final step consists of rounding according to several rounding modes but the result must always be the closest representable value in the target precision.
This question was already answered here as Chris Dodd noticed in the comments section. In short: it's not guaranteed by C++ standard, but IEEE-754 standard guarantees me that the result will be as close to the 'real result' as possible, i.e. error will be less than or equal to 1/2 unit-in-the-last-place. In particular, if the result can be precisely stored, it should be.

How to check that IEEE 754 single-precision (32-bit) floating-point representation is used?

I want to test the following things on my target board:
Is 'float' implemented with IEEE 754 single-precision (32-bit) floating-point variable?
Is 'double' implemented with IEEE 754 double-precision (64-bit) floating-point variable?
What are the ways in which i can test it with a simple C program.
No simple test exists.
The overwhelming majority of systems today use IEEE-754 formats for floating-point. However, most C implementations do not fully conform to IEEE 754 (which is identical to IEC 60559) and do not set the preprocessor identifier __STDC_IEC_559__. In the absence of this identifier, the only way to determine whether a C implementation conforms to IEEE 754 is one or a combination of:
Read its documentation.
Examine its source code.
Test it (which is, of course, difficult when only exhaustive testing can be conclusive).
In many C implementations and software applications, the deviations from IEEE 754 can be ignored or worked around: You may write code as if IEEE 754 were in use, and much code will largely work. However, there are a variety of things that can trip up an unsuspecting programmer; writing completely correct floating-point code is difficult even when the full specification is obeyed.
Common deviations include:
Intermediate arithmetic is performed with more precision than the nominal type. E.g., expressions that use double values may be calculated with long double precision.
sqrt does not return a correctly rounded value in every case.
Other math library routines return values that may be slightly off (a few ULP) from the correctly rounded results. (In fact, nobody has implemented all the math routines recommended in IEEE 754-2008 with both guaranteed correct rounding and guaranteed bound run time.)
Subnormal numbers (tiny numbers near the edge of the floating-point format) may be converted to zero instead of handled as specified by IEEE 754.
Conversions between decimal numerals (e.g., 3.1415926535897932384626433 in the source code) and binary floating-point formats (e.g., the common double format, IEEE-754 64-bit binary) do not always round correctly, in either conversion direction.
Only round-to-nearest mode is supported; the other rounding modes specified in IEEE 754 are not supported. Or they may be available for simple arithmetic but require using machine-specific assembly language to access. Standard math libraries (cos, log, et cetera) rarely support other rounding modes.
In C99, you can check for __STDC_IEC_559__:
#ifdef __STDC_IEC_559__
/* using IEEE-754 */
#endif
This is because the international floating point standard referenced by C99 is IEC 60559:989 (IEC 559 and IEEE-754 was a previous description). The mapping from the C language to IEC 60559 is optional, but if in use, the implementation defines the macro __STDC_IEC_559__ (Appendix F of the C99 standard), so you can totally rely on that.
Another alternative is to manually check if the values in float.h, such as FLT_MAX, FLT_EPSILON, FLT_MAX_10_EXP, etc, match with the IEEE-754 limits, although theoretically there could be another representation with the same values.
First of all, you can find the details about the ISO/IEC/IEEE 60559 (or IEEE 754) in Wikipedia:
Floating point standard types
As F. Goncalvez has told you, the macro __STDC_IEC_559__ brings you information about your compiler, if it conform IEEE 754 or not.
In what follows, we
However, you can obtain additional information with the macro FLT_EVAL_METHOD.
The value of this macro means:
0 All operations and constants are evaluated in the range and precision of the type used.
1 The operations of types float and double are evaluated in the range and precision of double, and long double goes in your own way...
2 The evaluations of all types are done in the precision and range of long double.
-1 Indeterminate
Other negative values: Implementation defined (it depends on your compiler).
For example, if FLT_EVAL_METHOD == 2, and you hold the result of several calculations in a floating point variable x, then all operations and constants are calculated or processed in the best precition, that is, long double, but only the final result is rounded to the type that x has.
This behaviour reduces the immpact of numerical errors.
In order to know details about the floating point types, you have to watch the constant macros provided by the standard header <float.h>.
For example, see this link:
Çharacteristics of floating point types
In the sad case that your implementation does not conform to the IEEE 754 standard, you can try looking for details in the standard header <float.h>, if it exists.
Also, you have to read the documentation of your compiler.
For example, the compiler GCC explains what does with floating point:
Stadus of C99 features in GCC
No, Standard C18, p. 373 specifies that IEC 60559 is used for float, double...
Why do you think IEEE 754 is used?

What is the significance of "A conforming compiler may choose not to implement non-normalized floating point numbers"?

ISO/IEC 9899:2011 §5.2.4.2.2 ¶10 (p48) says:
The presence or absence of subnormal numbers is characterized by the
implementation- defined values of FLT_HAS_SUBNORM, DBL_HAS_SUBNORM,
and LDBL_HAS_SUBNORM:
-1 indeterminable
0 absent (type does not support subnormal numbers)
1 present (type does support subnormal numbers)
What the! So on some platforms I cannot write double d = 33.3? Or will the compiler automatically convert this to 333E-1? What is the practical significance of presence or absence of non-normalized floating point numbers?
Subnormal numbers are the nonzero floating-point numbers between -FLT_MIN and FLT_MIN (for type float) and -DBL_MIN and DBL_MIN (for type double). The constant FLT_MIN is typically 1.17549435E-38F, that is, small. If you do only a little programming with floating-point numbers, you may never have encountered a subnormal number.
On a compilation platform with FLT_HAS_SUBNORM == 0, there are only the numbers +0. and -0. between -FLT_MIN and FLT_MIN.
Subnormal numbers are usually handled in software (since they have exceptional behavior and do not happen often). One reason not to handle them at all is to avoid the slowdown that can occur when they happen. This can be important in real-time contexts.
The next Intel desktop processor generation (or is it the current one?) will handle subnormals in hardware.
The notion of subnormal number has nothing to do with the notations 33.3 and 333E-1, which represent the same double value.
The justification for subnormals and the history of their standardization in IEEE 754 can be found in these reminiscences by Kahan, under “Gradual Underflow”.
EDIT:
I could not find a source for Intel handling subnormals in hardware in its next generation of processors, but I found one for Nvidia's Fermi platform doing so already.

Resources