C code
#include <stdio.h>
int main(int argc, char *argv[]) {
double x = 1.0/3.0;
printf("%.32f\n", x);
return 0;
}
On Windows Server 2008 R2, I compile and run the code using MSVC 2013. The output is:
0.33333333333333331000000000000000
On macOS High Sierra, I compile and run the same code using Apple LLVM version 9.0.0 (clang-900.0.39.2). The output is:
0.33333333333333331482961625624739
Running the same code on ideone.com gives the same result as the one on macOS.
I know the double-precision float-point format has only about 15 or 16 significant decimal digits. The number keeps the precise decimal representation of the 64-bit binary value of 1/3 (as the Wikipedia's explanation) on macOS platform, but why is it "truncated" on Windows platform?
According to the C standard, the following behaviour is implementation-defined:
The accuracy of the floating-point operations (+, -, *, /) and of the library functions in <math.h> and <complex.h> that return floating-point results is implementation- defined, as is the accuracy of the conversion between floating-point internal representations and string representations performed by the library functions in <stdio.h>, <stdlib.h>, and <wchar.h>. The implementation may state that the accuracy is unknown.
From the MSVC docs since MSVC conform to IEEE double has up to 16 significant digits.
But the docs also state the by default the /fp flag is set to precise:
With /fp:precise on x86 processors, the compiler performs rounding on
variables of type float to the correct precision for assignments and
casts and when parameters are passed to a function. This rounding
guarantees that the data does not retain any significance greater than
the capacity of its type.
So you're rounding of the double when passing it to printf, getting those zeros. As far as I can tell the trailing 1 is noise and should not be there (it's digit 17, I counted).
This behavior is MSVC specific, you can read more about it in the docs I linked to, to see what flags you would need for other compilers.
According to coding-guidelines:
696 When a double is demoted to float, a long double is demoted to double or float, or a value being represented in greater precision and
range than required by its semantic type (see 6.3.1.8) is explicitly
converted to its semantic type (including to its own type), if the
value being converted can be represented exactly in the new type, it
is unchanged.
697 If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is
either the nearest higher or nearest lower representable value, chosen
in an implementation-defined manner.
698 If the value being converted is outside the range of values that can be represented, the behavior is undefined.
Yes, different implementations may have double types with different characteristics, but that is not actually the cause of the behavior you are seeing here. In this case, the cause is that the Windows routines for converting floating-point values to decimal do not produce the correctly rounded result as defined in IEEE 754 (they “give up” after a limited number of digits), whereas the macOS routines do produce the correctly rounded result (and in fact produce the exact result if allowed to produce enough digits for that).
The short answer is: yes, different platforms behave differently.
From Intel:
Intel Xeon processor E5-2600 V2 product family supports half-precision (16-bit) floating- point data types. Half-precision floating-point data types provide 2x more compact data representation than single-precision (32-bit) floating-point data format, but sacrifice data range and accuracy. In particular, half-floats may provide better performance than 32-bit floats when the 32-bit float data does not fit into the L1 cache.
I don't have access to a Xeon machine to test this but I'd expect that you are ending up with a double that is not really a double due to the use of a "faster floating-point, but less precision" CPU instructions.
The "i" line of Intel Core processors doesn't have that half-width fast floating point precision so it doesn't truncate the result.
Here's more on the half-precision format in Xeon on Intel website.
Related
I have a gcc cross compiler on an 18 bit soft-core processor target
that has the following datatypes defined:
Integer 18 bit, Long 36 bit and float 36-bit(single precision).
Right now my focus is on floating point operation. Since the width is
non-standard(36 bit), I have the following scheme: 27 bits for
Mantissa(significand), 8 bits for Exponents and 1 Sign bit.
I can see the widths are defined in the float.h file. of interest to
me are the following:
FLT_MANT_DIG and FLT_DIG.
They are defined as:
FLT_MANT_DIG 24
FLT_DIG 6
I have changed them to
FLT_MANT_DIG 28
FLT_DIG 9
As per my requirements in float.h and then build the gcc compiler. But
still I get 32 bit floating point output.Do anyone has any experience
implementing non-standard single precision floating point numbers
and/or know the workaround?
Efficient floating-point math requires hardware which is designed to support the exact floating-point formats which are being used. In the absence of such hardware, routines which are designed around a particular floating-point format will be much more efficient than routines which are readily adaptable to other formats. The GCC compiler and supplied libraries are designed to operate efficiently with IEEE-754 floating-point types and are not particularly adaptable to any others. The aforementioned headers exist not to allow a programmer to request a particular floating-point format, but merely to notify code about what format is going to be used.
If you don't need 72-bit floating-point types, and if the compiler's double type will perform 64-bit math in something resembling sensible fashion even though long is 36 bits rather than 32, you might be able to arrange things so that float values get unpacked into a four-word double, perform computations using that, and then rearrange the bits of the double to yield a float. Alternatively, you could write or find 36-bit floating-point libraries. I would not particularly expect GCC or its libraries to include such a thing, since 36-bit processors are rather rare these days.
I know in C and Java, float's underlying representation is IEEE754-32, double is IEEE754-64.
In expressions, float will be auto-promoted to double. So how?
Take 3.7f for example. Is the process like this?
3.7f will be represented in memory using IEEE754. It fits in 4 bytes.
During calculation, it may be loaded into a 64-bit register (or whatever 64-bit place), turning the 3.7f into IEEE754-64 represent.
It is very implementation-dependent.
For one example, on x86 platform the set of FPU commands includes commands for loading/storing data in IEEE754 float and double formats (as well as many other formats). The data is loaded into the internal FPU registers that have 80-bit width. So in reality on x86 all floating-point calculations are performed with 80-bit floating-point precision. i.e. all floating-point data is actually promoted to 80-bit precision. How is data represented inside those registers is completely irrelevant, since you cannot observe them directly anyway.
This means that on x86 platform there's no such thing as a single-step float-to-double conversion. Whenever a need for such conversion arises, it is actually implemented as two-step conversion: float-to-internal-fpu and internal-fpu-to-double.
This BTW created a significant semantic difference between x86 FPU computation model and C/C++ computation models. In order to fully match the language model the processor has to forcefully reduce precision of intermediate floating-point results, thus negatively affecting performance. Many compilers provide user with options that control FPU computation model, allowing the user to opt for strict C/C++ conformance, better performance or something in between.
Not so many years ago FPU unit was an optional component of x86 platform. Floating-point computations on FPU-less platforms were performed in software, either by emulating FPU or by generating code without any FPU instructions at all. In such implementations things could work differently, like, for example, perform software conversion from IEEE754 float to IEEE754 double directly.
I know in C/Java, float point number's underlying represent is IEEE754-32, double point's is IEEE754-64.
Wrong. The C standard has never specified a fixed, specific limit in integer and floating-point type sizes although they did ensure the relation between types
1 == sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long)
sizeof(float) <= sizeof(double) <= sizeof(long double)
C implementations are allowed to use any type of floating-point format although most now use IEEE-754 and its descendants. Likewise they can freely use any of integer representations such as 1's complement or sign-magnitude
About the promotion rules, pre-standard versions of C promote floats in expressions to double but in C89/90 the rule was changed and float * float results in a float result.
If either operand has type long double, the other operand is converted to long double
Otherwise, if either operand is double, the other operand is converted to double.
Otherwise, if either operand is float, the other operand is converted to float.
Implicit type conversion rules in C++ operators
It would be true in Java or C# though, since they run bytecode in a virtual machine, and the VM's types are consistent across platforms
I noticed on Windows and Linux x86, float is a 4-byte type, double is 8, but long double is 12 and 16 on x86 and x86_64 respectively. C99 is supposed to be breaking such barriers with the specific integral sizes.
The initial technological limitation appears to be due to the x86 processor not being able to handle more than 80-bit floating point operations (plus 2 bytes to round it up) but why the inconsistency in the standard compared to int types? Why don't they go at least to 80-bit standardization?
The C language doesn't specify the implementation of various types, so that it can be efficiently implemented on as wide a variety of hardware as possible.
This extends to the integer types too - the C standard integral types have minimum ranges (eg. signed char is -127 to 127, short and int are both -32,767 to 32,767, long is -2,147,483,647 to 2,147,483,647, and long long is -9,223,372,036,854,775,807 to 9,223,372,036,854,775,807). For almost all purposes, this is all that the programmer needs to know.
C99 does provide "fixed-width" integer types, like int32_t - but these are optional - if the implementation can't provide such a type efficiently, it doesn't have to provide it.
For floating point types, there are equivalent limits (eg double must have at least 10 decimal digits worth of precision).
They were trying to (mostly) accommodate pre-existing C implementations, some of which don't even use IEEE floating point formats.
ints can be used to represent abstract things like ids, colors, error code, requests, etc. In this case ints are not really used as integers numbers but as sets of bits (= a container). Most of the time a programmer knows exactly how many bits he needs, so he wants to be able to use just as many bits as needed.
floats on the other hand are design for a very specific usage (floating point arithmetic). You are very unlikely to be able to size precisely how many bits you need for your float.
Actually, most of the time the more bits you have the better it is.
C99 is supposed to be breaking such barriers with the specific integral sizes.
No, those fixed-width (u)intN_t types are completely optional because not all processors use type sizes that are a power of 2. C99 only requires that (u)int_fastN_t and (u)int_leastN_t to be defined. That means the premise why the inconsistency in the standard compared to int types is just plain wrong because there's no consistency in the size of int types
Lots of modern DSPs use 24-bit word for 24-bit audio. There are even 20-bit DSPs like the Zoran ZR3800x family or 28-bit DSPs like the ADAU1701 which allows transformation of 16/24-bit audio without clipping. Many 32 or 64-bit architectures also have some odd-sized registers to allow accumulation of values without overflow, for example the TI C5500/C6000 with 40-bit long and SHARC with 80-bit accumulator. The Motorola DSP5600x/3xx series also has odd sizes: 2-byte short, 3-byte int, 6-byte long. In the past there were lots of architectures with other word sizes like 12, 18, 36, 60-bit... and lots of CPUs that use one's complement of sign-magnitude. See Exotic architectures the standards committees care about
C was designed to be flexible to support all kinds of such platforms. Specifying a fixed size, whether for integer or floating-point types, defeats that purpose. Floating-point support in hardware varies wildly just like integer support. There are different formats that use decimal, hexadecimal or possibly other bases. Each format has different sizes of exponent/mantissa, different position of sign/exponent/mantissa and even the signed format. For example some use two's complement for the mantissa while some others use two's complement for the exponent or the whole floating-point value. You can see many formats here but that's obviously not every format that ever existed. For example the SHARC above has a special 40-bit floating-point format. Some platforms also use double-double arithmetic for long double. See also
What uncommon floating-point sizes exist in C++ compilers?
Do any real-world CPUs not use IEEE 754?
That means you can't standardize a single floating-point format for all platforms because there's no one-size-fits-all solution. If you're designing a DSP then obviously you need to have a format that's best for your purpose so that you can churn as most data as possible. There's no reason to use IEEE-754 binary64 when a 40-bit format has enough precision for your application, fits better in cache and needs far less die size. Or if you're on a small embedded system then 80-bit long double is usually useless as you don't even have enough ROM for that 80-bit long double library. That's why some platforms limit long double to 64-bit like double
on my 32-bit machine (with an Intel T7700 duo core), I have 15 precision digits for both double and long double types for the C language. I compared the parameters LDBL_DIG for long double and DBL_DIG for double and they are both 15. I got these answers using MVS2008. I was wondering if these results can be compiler dependent or do they just depend on my processor?
Thanks a lot...
Some compilers support a long double format that has more precision than double. Microsoft MSVC isn't one of them. If 15 significant digits isn't good enough then the odds are very high that you shouldn't be using a floating point type in the first place. Check this thread for arbitrary precision libraries.
It's possible for them to depend on compiler, but in general, they will just depend on architecture. If the compilers are using the same include files, in particular, these will probably not vary. It's a good idea to check them to be sure though, if you want to write portable code.
Right. These are implementation-dependent. The only guarantees of the C standard are:
float is a subset of double and double is a subset of long double (6.2.5/10)
FLT_RADIX ≥ 2 (5.2.4.2.2/9)
FLT_DIG ≥ 6, DBL_DIG ≥ 10, LDBL_DIG ≥ 10
FLT_MIN_10_EXP, DBL_MIN_10_EXP LDBL_MIN_10_EXP ≤ -37
FLT_MAX_10_EXP, DBL_MAX_10_EXP, LDBL_MAX_10_EXP ≥ +37
FLT_MAX, DBL_MAX, LDBL_MAX ≥ 1e+37 (5.2.4.2.2/10)
FLT_EPSILON ≤ 1e-5, DBL_EPSILON ≤ 1e-9, LDBL_EPSILON ≤ 1e-9 (5.2.4.2.2/11)
FLT_MIN, DBL_MIN, LDBL_MIN ≤ 1e-37
Treating long double = double is permitted by the C standard.
While the C standard does not require this, it STRONGLY ADVISES that float and double are standard IEEE 754 single and double precision floating-point types, respectively. Which they are on any architecture that supports them in hardware (which means practically everywhere).
Things are slightly more tricky with long double, as not many architectures support floating-point types of higher-than-double precision. The standard requires that long double has at least as much range and precision as double. Which means that if an architecture does not support anything more, long double type is identical to double. And even if it does (like x87), some compilers still make long double equivalent to double (like M$VC), while others expose the extended precision type as long double (like Borland and GCC).
Even if the compiler exposes the extended precision type, there is still no standard on what exactly "extended-precision" means. On x87 this is 80-bit. Some other architectures have 128-bit quad-precision types. Even on x87 some compilers have sizeof(long double) = 10, while others pad it for alignment, so that it is 12 or 16 (of 8 if long double is double).
So the bottom line is, implementation of long double varies across platforms. The only thing you can be sure about it, is that it is at least equivalent to double. If you want to write portable code, don't depend on its representation - keep it away from interfaces and binary I/O. Using long double in internal calculations of your program is OK though.
You should also be aware that some CPUs' floating point units support multiple levels of precision for intermediate results, and this level can be controlled at runtime. Apps have been affected by things like buggy versions of DirectX libraries selecting a lower level of precision during a library call and forgetting to restore the setting, thus affecting later FP calculations in the caller.
EDIT: I had made a mistake during the debugging session that lead me to ask this question. The differences I was seeing were in fact in printing a double and in parsing a double (strtod). Stephen's answer still covers my question very well even after this rectification, so I think I will leave the question alone in case it is useful to someone.
Some (most) C compilation platforms I have access to do not take the FPU rounding mode into account when
converting a 64-bit integer to double;
printing a double.
Nothing very exotic here: Mac OS X Leopard, various recent Linuxes and BSD variants, Windows.
On the other hand, Mac OS X Snow Leopard seems to take the rounding mode into account when doing these two things. Of course, having different behaviors annoys me no end.
Here are typical snippets for the two cases:
#if defined(__OpenBSD__) || defined(__NetBSD__)
# include <ieeefp.h>
# define FE_UPWARD FP_RP
# define fesetround(RM) fpsetround(RM)
#else
# include <fenv.h>
#endif
#include <float.h>
#include <math.h>
fesetround(FE_UPWARD);
...
double f;
long long b = 2000000001;
b = b*b;
f = b;
...
printf("%f\n", 0.1);
My questions are:
Is there something non-ugly that I can do to normalize the behavior across all platforms? Some hidden setting to tell the platforms that take rounding mode into account not to or vice versa?
Is one of the behaviors standard?
What am I likely to encounter when the FPU rounding mode is not used? Round towards zero? Round to nearest? Please, tell me that there is only one alternative :)
Regarding 2. I found the place in the standard where it is said that floats converted to integers are always truncated (rounded towards zero) but I couldn't find anything for the integer -> float direction.
If you have not set the rounding mode, it should be the IEEE-754 default mode, which is round-to-nearest.
For conversions from integer to float, the C standard says (§6.3.1.4):
When a value of integer type is
converted to a real floating type, if
the value being converted can be
represented exactly in the new type,
it is unchanged. If the value being
converted is in the range of values
that can be represented but cannot be
represented exactly, the result is
either the nearest higher or nearest
lower representable value, chosen in
an implementation-defined manner. If
the value being converted is outside
the range of values that can be
represented, the behavior is
undefined.
So both behaviors conform to the C standard.
The C standard says (§F.5) that conversions between IEC60559 floating point formats and character sequences be correctly rounded as per the IEEE-754 standard. For non-IEC60559 formats, this is recommended, but not required. The 1985 IEEE-754 standard says (clause 5.4):
Conversions shall be correctly rounded
as specified in Section 4 for operands
lying within the ranges specified in
Table 3. Otherwise, for rounding to
nearest, the error in the converted
result shall not exceed by more than
0.47 units in the destination's least significant digit the error that is
incurred by the rounding
specifications of Section 4, provided
that exponent over/underflow does not
occur. In the directed rounding modes
the error shall have the correct sign
and shall not exceed 1.47 units in the
last place.
What section (4) actually says is that the operation shall occur according to the prevailing rounding mode. I.e. if you change the rounding mode, IEEE-754 says that the result of float->string conversion should change accordingly. Ditto for integer->float conversions.
The 2008 revision of the IEEE-754 standard says (clause 4.3):
The rounding-direction attribute
affects all computational operations
that might be inexact. Inexact numeric
floating-point results always have the
same sign as the unrounded result.
Both conversions are defined to be computational operations in clause 5, so again they should be performed according to the prevailing rounding mode.
I would argue that Snow Leopard has the correct behavior here (assuming that it is correctly rounding the results according to the prevailing rounding mode). If you want to force the old behavior, you can always wrap your printf calls in code that changes the rounding mode, I suppose, though that's clearly not ideal.
Alternatively, you could use the %a format specifier (hexadecimal floating point) on C99 compliant platforms. Since the result of this conversion is always exact, it will never be effected by the prevailing rounding mode. I don't think that the Windows C library supports %a, but you could probably port the BSD or glibc implementation easily enough if you need it.