Changing Mantissa's Width in Non-IEEE Floating Point implementation

Changing Mantissa's Width in Non-IEEE Floating Point implementation - c

I have a gcc cross compiler on an 18 bit soft-core processor target
that has the following datatypes defined:
Integer 18 bit, Long 36 bit and float 36-bit(single precision).
Right now my focus is on floating point operation. Since the width is
non-standard(36 bit), I have the following scheme: 27 bits for
Mantissa(significand), 8 bits for Exponents and 1 Sign bit.
I can see the widths are defined in the float.h file. of interest to
me are the following:
FLT_MANT_DIG and FLT_DIG.
They are defined as:
FLT_MANT_DIG 24
FLT_DIG 6
I have changed them to
FLT_MANT_DIG 28
FLT_DIG 9
As per my requirements in float.h and then build the gcc compiler. But
still I get 32 bit floating point output.Do anyone has any experience
implementing non-standard single precision floating point numbers
and/or know the workaround?

Efficient floating-point math requires hardware which is designed to support the exact floating-point formats which are being used. In the absence of such hardware, routines which are designed around a particular floating-point format will be much more efficient than routines which are readily adaptable to other formats. The GCC compiler and supplied libraries are designed to operate efficiently with IEEE-754 floating-point types and are not particularly adaptable to any others. The aforementioned headers exist not to allow a programmer to request a particular floating-point format, but merely to notify code about what format is going to be used.
If you don't need 72-bit floating-point types, and if the compiler's double type will perform 64-bit math in something resembling sensible fashion even though long is 36 bits rather than 32, you might be able to arrange things so that float values get unpacked into a four-word double, perform computations using that, and then rearrange the bits of the double to yield a float. Alternatively, you could write or find 36-bit floating-point libraries. I would not particularly expect GCC or its libraries to include such a thing, since 36-bit processors are rather rare these days.

Related

How to define a floating-point data type larger than 16 bytes?

I'm trying to do math calculations that they require more than 100 decimals of precision. C data types cannot go beyond 16 bytes (long double), so I cannot compute more than ~17 decimals. Is there a way to create a variable in C that can get more precision?

Realistically you need an arbitrary-precision arithmetic library, see Wikipedia for some options. I personally have found GNU MPFR to be fairly reliable, though I have also heard good things about Arb.

Do different platforms have different double precision in C?

C code
#include <stdio.h>
int main(int argc, char *argv[]) {
double x = 1.0/3.0;
printf("%.32f\n", x);
return 0;
}
On Windows Server 2008 R2, I compile and run the code using MSVC 2013. The output is:
0.33333333333333331000000000000000
On macOS High Sierra, I compile and run the same code using Apple LLVM version 9.0.0 (clang-900.0.39.2). The output is:
0.33333333333333331482961625624739
Running the same code on ideone.com gives the same result as the one on macOS.
I know the double-precision float-point format has only about 15 or 16 significant decimal digits. The number keeps the precise decimal representation of the 64-bit binary value of 1/3 (as the Wikipedia's explanation) on macOS platform, but why is it "truncated" on Windows platform?

According to the C standard, the following behaviour is implementation-defined:
The accuracy of the floating-point operations (+, -, *, /) and of the library functions in <math.h> and <complex.h> that return floating-point results is implementation- defined, as is the accuracy of the conversion between floating-point internal representations and string representations performed by the library functions in <stdio.h>, <stdlib.h>, and <wchar.h>. The implementation may state that the accuracy is unknown.

From the MSVC docs since MSVC conform to IEEE double has up to 16 significant digits.
But the docs also state the by default the /fp flag is set to precise:
With /fp:precise on x86 processors, the compiler performs rounding on
variables of type float to the correct precision for assignments and
casts and when parameters are passed to a function. This rounding
guarantees that the data does not retain any significance greater than
the capacity of its type.
So you're rounding of the double when passing it to printf, getting those zeros. As far as I can tell the trailing 1 is noise and should not be there (it's digit 17, I counted).
This behavior is MSVC specific, you can read more about it in the docs I linked to, to see what flags you would need for other compilers.

According to coding-guidelines:
696 When a double is demoted to float, a long double is demoted to double or float, or a value being represented in greater precision and
range than required by its semantic type (see 6.3.1.8) is explicitly
converted to its semantic type (including to its own type), if the
value being converted can be represented exactly in the new type, it
is unchanged.
697 If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is
either the nearest higher or nearest lower representable value, chosen
in an implementation-defined manner.
698 If the value being converted is outside the range of values that can be represented, the behavior is undefined.

Yes, different implementations may have double types with different characteristics, but that is not actually the cause of the behavior you are seeing here. In this case, the cause is that the Windows routines for converting floating-point values to decimal do not produce the correctly rounded result as defined in IEEE 754 (they “give up” after a limited number of digits), whereas the macOS routines do produce the correctly rounded result (and in fact produce the exact result if allowed to produce enough digits for that).

The short answer is: yes, different platforms behave differently.
From Intel:
Intel Xeon processor E5-2600 V2 product family supports half-precision (16-bit) floating- point data types. Half-precision floating-point data types provide 2x more compact data representation than single-precision (32-bit) floating-point data format, but sacrifice data range and accuracy. In particular, half-floats may provide better performance than 32-bit floats when the 32-bit float data does not fit into the L1 cache.
I don't have access to a Xeon machine to test this but I'd expect that you are ending up with a double that is not really a double due to the use of a "faster floating-point, but less precision" CPU instructions.
The "i" line of Intel Core processors doesn't have that half-width fast floating point precision so it doesn't truncate the result.
Here's more on the half-precision format in Xeon on Intel website.

Is it safe to assume floating point is represented using IEEE754 floats in C?

Floating point is implementation defined in the C. So there isn't any guarantees.
Our code needs to be portable, we are discussing whether or not acceptable to use IEEE754 floats in our protocol. For performance reasons it would be nice if we don't have to convert back and forth between a fixed point format when sending or receiving data.
While I know that there can be differences between platforms and architectures regarding the size of long or wchar_t. But I can't seem to find any specific about the float and double.
What I found so far that the byte order maybe reversed on big endian platforms. While there are platforms without floating point support where a code containing float and double wouldn't even link. Otherwise platforms seem to stick to IEEE754 single and double precision.
So is it safe to assume that floating point is in IEEE754 when available?
EDIT: In response to a comment:
What is your definition of "safe"?
By safe I mean, the bit pattern on one system means the same on the another (after the byte rotation to deal with endianness).

Essentially all architectures in current non-punch-card use, including embedded architectures and exotic signal processing architectures, offer one of two floating point systems:
IEEE-754.
IEEE-754 except for blah. That is, they mostly implement 754, but cheap out on some of the more expensive and/or fiddly bits.
The most common cheap-outs:
Flushing denormals to zero. This invalidates certain sometimes-useful theorems (in particular, the theorem that a-b can be exactly represented if a and b are within a factor of 2), but in practice it's generally not going to be an issue.
Failure to recognize inf and NaN as special. These architectures will fail to follow the rules regarding inf and NaN as operands, and may not saturate to inf, instead producing numbers that are larger than FLT_MAX, which will generally be recognized by other architectures as NaN.
Proper rounding of division and square root. It's a whole lot easier to guarantee that the result is within 1-3 ulps of the exact result than within 1/2 ulp. A particularly common case is for division to be implemented as reciprocal+multiplication, which loses you one bit of precision.
Fewer or no guard digits. This is an unusual cheap-out, but means that other operations can be 1-2 ulps off.
BUUUUT... even those except for blah architectures still use IEEE-754's representation of numbers. Other than byte ordering issues, the bits describing a float or double on architecture A are essentially guaranteed to have the same meaning on architecture B.
So as long as all you care about is the representation of values, you're totally fine. If you care about cross-platform consistency of operations, you may need to do some extra work.
EDIT: As Chux mentions in the comments, a common extra source of inconsistency between platforms is the use of extended precision, such as the x87's 80-bit internal representation. That's the opposite of a cheap-out, and (with proper treatment) fully conforms to both IEEE-754 and the C standard, but it will likewise cause results to differ between architectures, and even between compiler versions and following apparently minor and unrelated code changes. However: a particular x86/x64 executable will NOT produce different results on different processors due to extended precision.

There is a macro to check (since C99):
C11 §6.10.8.3 Conditional feature macros
__STDC_IEC_559__ The integer constant 1, intended to indicate conformance to the specifications in annex F (IEC 60559 floating-point arithmetic).
IEC 60559 (short for ISO/IEC/IEEE 60559) is another name for IEEE-754.
Annex F then establishes the mapping between C floating types and IEEE-754 types:
The C floating types match the IEC 60559 formats as follows:
The float type matches the IEC 60559 single format.
The double type matches the IEC 60559 double format.
The long double type matches an IEC 60559 extended format, 357) else a
non-IEC 60559 extended format, else the IEC 60559 double format.

I suggest you need to look more carefully at your definition of portable.
I would also suggest your definition of "safe" is insufficient. Even if the binary representation (allowing for endianness) is okay, the operations on variables may behave differently. After all, there are few applications of floating point that don't involve operations on variables.
If you want to support all host architectures that have ever been created then assuming IEEE floating point format is inherently unsafe. You will have to deal with systems that support different formats, systems that don't support floating point at all, systems for which compilers have switches to select floating point behaviours (with some behaviours being associated with non-IEEE formats), CPUs that have an optional co-processor (so floating point support depends on whether an additional chip is installed, but otherwise variants of the CPU are identical), systems that emulate floating point operations in software (some such software emulators are configurable at run time), and systems with buggy or incomplete implementation of floating point (which may or may not be IEEE based).
If you are willing to limit yourself to hardware of post 2000 vintage, then your risk is lower but non-zero. Virtually all CPUs of that vintage support IEEE in some form. However you still (as with older CPUs too) need to consider what floating point operations you wish to have supported, and the trade-offs you are willing to accept to have them. Different CPUs (or software emulation) have less complete implementation of floating point than others, and some are configured by default to not support some features - so it is necessary to change settings to enable some features, which can impact on performance or correctness of your code.
If you need to share floating point values between applications (which may be on different hosts with different features, built with different compilers, etc) then you will need to define a protocol. That protocol might involve IEEE format, but all your applications will need to be able to handle conversion between the protocol and their native representations.

Almost all common architectures now use IEEE-754, this is not required by the standard. There used to be old non IEE-754 architectures, and some could still be around.
If the only requirement is for exchange of network data, my advice is:
if __STDC_IEC_559__ is defined, only use network order for the bytes and assume you do have standard IEE-754 for float and double.
if __STDC_IEC_559__ is not defined, use a special interchange format, that could be IEE-754 - one single protocol - or anything else - need a protocol indication.

Like others have mentioned, there's the __STDC_IEC_559__ macro, but it isn't very useful because it's only set by compilers that completely implement the respective annex in the C standard. There are compilers that implement only a subset but still have (mostly) usable IEEE floating point support.
If you're only concerned with the binary representation, you should write a feature test that checks the bit patterns of certain floating numbers. Something like:
#include <stdint.h>
#include <stdio.h>
typedef union {
double d;
uint64_t i;
} double_bits;
int main() {
double_bits b;
b.d = 2.5;
if (b.i != UINT64_C(0x4004000000000000)) {
fprintf(stderr, "Not an IEEE-754 double\n");
return 1;
}
return 0;
}
Check a couple of numbers with different exponents, mantissae, and signs, and you should be on the safe side. Since these tests aren't expensive, you could even run them once at runtime.

Strictly speaking, it's not safe to assume floating-point support; generally speaking, the vast majority of platforms will support it. Notable exceptions include (now deprecated) VMS systems running on Alpha chips
If you have the luxury of runtime checking, consider paranoia, a floating-point vetting tool written by William Kahan.
Edit: sounds like your application is more concerned with binary formats as they pertain to storage and/or serialization. I would suggest narrowing your scope to choosing a third-party library that supports this. You could do worse than Google Protocol Buffers.

Is C floating-point non-deterministic?

I have read somewhere that there is a source of non-determinism in C double-precision floating point as follows:
The C standard says that 64-bit floats (doubles) are required to produce only about 64-bit accuracy.
Hardware may do floating point in 80-bit registers.
Because of (1), the C compiler is not required to clear the low-order bits of floating-point registers before stuffing a double into the high-order bits.
This means YMMV, i.e. small differences in results can happen.
Is there any now-common combination of hardware and software where this really happens? I see in other threads that .net has this problem, but is C doubles via gcc OK? (e.g. I am testing for convergence of successive approximations based on exact equality)

The behavior on implementations with excess precision, which seems to be the issue you're concerned about, is specified strictly by the standard in most if not all cases. Combined with IEEE 754 (assuming your C implementation follows Annex F) this does not leave room for the kinds of non-determinism you seem to be asking about. In particular, things like x == x (which Mehrdad mentioned in a comment) failing are forbidden since there are rules for when excess precision is kept in an expression and when it is discarded. Explicit casts and assignment to an object are among the operations that drop excess precision and ensure that you're working with the nominal type.
Note however that there are still a lot of broken compilers out there that don't conform to the standards. GCC intentionally disregards them unless you use -std=c99 or -std=c11 (i.e. the "gnu99" and "gnu11" options are intentionally broken in this regard). And prior to GCC 4.5, correct handling of excess precision was not even supported.

This may happen on Intel x86 code that uses the x87 floating-point unit (except probably 3., which seems bogus. LSB bits will be set to zero.). So the hardware platform is very common, but on the software side use of x87 is dying out in favor of SSE.
Basically whether a number is represented in 80 or 64 bits is at the whim of the compiler and may change at any point in the code. With for example the consequence that a number which just tested non-zero is now zero. m)
See "The pitfalls of verifying floating-point computations", page 8ff.

Testing for exact convergence (or equality) in floating point is usually a bad idea, even with in a totally deterministic environment. FP is an approximate representation to begin with. It is much safer to test for convergence to within a specified epsilon.

Why don't the authors of the C99 standard specify a standard for the size of floating point types?

I noticed on Windows and Linux x86, float is a 4-byte type, double is 8, but long double is 12 and 16 on x86 and x86_64 respectively. C99 is supposed to be breaking such barriers with the specific integral sizes.
The initial technological limitation appears to be due to the x86 processor not being able to handle more than 80-bit floating point operations (plus 2 bytes to round it up) but why the inconsistency in the standard compared to int types? Why don't they go at least to 80-bit standardization?

The C language doesn't specify the implementation of various types, so that it can be efficiently implemented on as wide a variety of hardware as possible.
This extends to the integer types too - the C standard integral types have minimum ranges (eg. signed char is -127 to 127, short and int are both -32,767 to 32,767, long is -2,147,483,647 to 2,147,483,647, and long long is -9,223,372,036,854,775,807 to 9,223,372,036,854,775,807). For almost all purposes, this is all that the programmer needs to know.
C99 does provide "fixed-width" integer types, like int32_t - but these are optional - if the implementation can't provide such a type efficiently, it doesn't have to provide it.
For floating point types, there are equivalent limits (eg double must have at least 10 decimal digits worth of precision).

They were trying to (mostly) accommodate pre-existing C implementations, some of which don't even use IEEE floating point formats.

ints can be used to represent abstract things like ids, colors, error code, requests, etc. In this case ints are not really used as integers numbers but as sets of bits (= a container). Most of the time a programmer knows exactly how many bits he needs, so he wants to be able to use just as many bits as needed.
floats on the other hand are design for a very specific usage (floating point arithmetic). You are very unlikely to be able to size precisely how many bits you need for your float.
Actually, most of the time the more bits you have the better it is.

C99 is supposed to be breaking such barriers with the specific integral sizes.
No, those fixed-width (u)intN_t types are completely optional because not all processors use type sizes that are a power of 2. C99 only requires that (u)int_fastN_t and (u)int_leastN_t to be defined. That means the premise why the inconsistency in the standard compared to int types is just plain wrong because there's no consistency in the size of int types
Lots of modern DSPs use 24-bit word for 24-bit audio. There are even 20-bit DSPs like the Zoran ZR3800x family or 28-bit DSPs like the ADAU1701 which allows transformation of 16/24-bit audio without clipping. Many 32 or 64-bit architectures also have some odd-sized registers to allow accumulation of values without overflow, for example the TI C5500/C6000 with 40-bit long and SHARC with 80-bit accumulator. The Motorola DSP5600x/3xx series also has odd sizes: 2-byte short, 3-byte int, 6-byte long. In the past there were lots of architectures with other word sizes like 12, 18, 36, 60-bit... and lots of CPUs that use one's complement of sign-magnitude. See Exotic architectures the standards committees care about
C was designed to be flexible to support all kinds of such platforms. Specifying a fixed size, whether for integer or floating-point types, defeats that purpose. Floating-point support in hardware varies wildly just like integer support. There are different formats that use decimal, hexadecimal or possibly other bases. Each format has different sizes of exponent/mantissa, different position of sign/exponent/mantissa and even the signed format. For example some use two's complement for the mantissa while some others use two's complement for the exponent or the whole floating-point value. You can see many formats here but that's obviously not every format that ever existed. For example the SHARC above has a special 40-bit floating-point format. Some platforms also use double-double arithmetic for long double. See also
What uncommon floating-point sizes exist in C++ compilers?
Do any real-world CPUs not use IEEE 754?
That means you can't standardize a single floating-point format for all platforms because there's no one-size-fits-all solution. If you're designing a DSP then obviously you need to have a format that's best for your purpose so that you can churn as most data as possible. There's no reason to use IEEE-754 binary64 when a 40-bit format has enough precision for your application, fits better in cache and needs far less die size. Or if you're on a small embedded system then 80-bit long double is usually useless as you don't even have enough ROM for that 80-bit long double library. That's why some platforms limit long double to 64-bit like double

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight