How is the behavior of the ARM Neon float-to-integer conversion instructions vcvt.s32.f32 and vcvt.u32.f32 defined in case of overflows? Can you rely upon the behavior that I observed on a particular processor, i.e. that the result is saturated? Any links to official documentation are highly appreciated.
The ARM Architecture Reference Manual is the source of all answers for this sort of question. In section A8.8.305 it says:
The floating-point to integer operation uses the Round towards Zero rounding mode.
And in the Glossary it clarifies:
Round towards Zero (RZ) mode
Means that results are rounded to the nearest representable number that is no greater in magnitude than the unrounded result.
(Which is the same meaning for "Round towards Zero" as in IEEE 754.)
The gory details are in the pseudocode for FPToFixed and FPUnpack.
So, in short: yes, the result is guaranteed to be saturated.
Related
I am trying to understand how floating point conversion is handled at the low level. So based on my understanding, this is implemented in hardware. So, for example, SSE provides the instruction cvttss2si which converts a float to an int.
But my question is: was the floating point conversion always handled this way? What about before the invention of FPU and SSE, was the calculation done manually using Assembly code?
It depends on the processor, and there have been a huge number of different processors over the years.
FPU stands for "floating-point unit". It's a more or less generic term that can refer to a floating-point hardware unit for any computer system. Some systems might have floating-point operations built into the CPU. Others might have a separate chip. Yet others might not have hardware floating-point support at all. If you specify a floating-point conversion in your code, the compiler will generate whatever CPU instructions are needed to perform the necessary computation. On some systems, that might be a call to a subroutine that does whatever bit manipulations are needed.
SSE stands for "Streaming SIMD Extensions", and is specific to the x86 family of CPUs. For non-x86 CPUs, there's no "before" or "after" SSE; SSE simply doesn't apply.
The conversion from floating-point to integer is considered a basic enough operation that the 387 instruction set already had such an instruction, FIST—although not useful for compiling the (int)f construct of C programs, as that instruction used the current rounding mode.
Some RISC instruction sets have always considered that a dedicated conversion instruction from floating-point to integer was an unnecessary luxury, and that this could be done with several instructions accessing the IEEE 754 floating-point representation. One basic scheme might look like this blog post, although the blog post is about rounding a float to a
float representing the nearest integer.
Prior to the standardization of IEEE 754 arithmetic, there were many competing vendor-specific ways of doing floating-point arithmetic. These had different ranges, precision, and different behavior with respect to overflow, underflow, signed zeroes, and undefined results such as 0/0 or sqrt(-1).
However, you can divide floating point implementations into two basic groups: hardware and software. In hardware, you would typically see an opcode which performs the conversion, although coprocessor FPUs can complicate things. In software, the conversion would be done by a function.
Today, there are still soft FPUs around, mostly on embedded systems. Not too long ago, this was common for mobile devices, but soft FPUs are still the norm on smaller systems.
Indeed, the floating point operations are a challenge for hardware engineers, as they require much hardware (leading to higher costs of the final product) and consume much power. There are some architectures that do not contain a floating point unit. There are also architectures that do not provide instructions even for basic operations like integer division. The ARM architecture is an example of this, where you have to implement division in software. Also, the floating point unit comes as an optional coprocessor in this architecture. It is worth thinking about this, considering the fact that ARM is the main architecture used in embedded systems.
IEEE 754 (the floating point standard used today in most of the applications) is not the only way of representing real numbers. You can also represent them using a fixed point format. For example, if you have a 32 bit machine, you can assume you have a decimal point between bit 15 and 16 and perform operations keeping this in mind. This is a simple way of representing floating numbers and it can be handled in software easily.
It depends on the implementation of the compiler. You can implement floating point math in just about any language (an example in C: http://www.jhauser.us/arithmetic/SoftFloat.html), and so usually the compiler's runtime library will include a software implementation of things like floating point math (or possibly the target hardware has always supported native instructions for this - again, depends on the hardware) and instructions which target the FPU or use SSE are offered as an optimization.
Before Floating Point Units doesn't really apply, since some of the earliest computers made back in the 1940's supported floating point numbers: wiki - first electro mechanical computers.
On processors without floating point hardware, the floating point operations are implemented in software, or on some computers, in microcode as opposed to being fully hardware implemented: wiki - microcode , or the operations could be handled by separate hardware components such as the Intel x87 series: wiki - x87 .
But my question is: was the floating point conversion always handled this way?
No, there's no x87 or SSE on architectures other than x86 so no cvttss2si either
Everything you can do with software, you can also do in hardware and vice versa.
The same to float conversion. If you don't have the hardware support, just do some bit hacking. There's nothing low level here so you can do it in C or any other languages easily. There is already a lot of solutions on SO
Converting Int to Float/Float to Int using Bitwise
Casting float to int (bitwise) in C
Converting float to an int (float2int) using only bitwise manipulation
...
Yes. The exponent was changed to 0 by shifting the mantissa, denormalizing the number. If the result was too large for an int an exception was generated. Otherwise the denormalized number (minus the factional part and optionally rounded) is the integer equivalent.
Is there any functions that controls rounding mode of vcvt_s32_f32 intrinsic? I want to use round toward even instead of round toward negative infinity.
Thanks.
No, you can't change the rounding mode.
NEON is designed for performance rather than precision, and thus is restricted compared to VFP. Unlike VFP, it's not a full IEEE 754 implementation, and is hardwired to certain settings - quoting from the ARM ARM:
denormalized numbers are flushed to zero
only default NaNs are supported
the Round to Nearest* rounding mode selected
untrapped exception handling selected for all floating-point exceptions
The specific case of floating-point to integer conversion is slightly different in that the behaviour of the VCVT instruction in this case (for both VFP and NEON) is to ignore the selected rounding mode and always round towards zero. The VCVTR instruction which does use the selected rounding mode is only available in VFP.
The ARMv8 architecture introduced a whole bunch of rounding and conversion instructions
for using specific rounding modes, but I suspect that's not much help in this particular case. If you want to do conversions under a different rounding mode on ARMv7 and earlier, you'll either have to use VFP (if available) or some bit-hacking to implement it manually.
* The ARM ARM uses IEEE 754-1985 terminology, so more precisely this is round to nearest, ties to even
I have read somewhere that there is a source of non-determinism in C double-precision floating point as follows:
The C standard says that 64-bit floats (doubles) are required to produce only about 64-bit accuracy.
Hardware may do floating point in 80-bit registers.
Because of (1), the C compiler is not required to clear the low-order bits of floating-point registers before stuffing a double into the high-order bits.
This means YMMV, i.e. small differences in results can happen.
Is there any now-common combination of hardware and software where this really happens? I see in other threads that .net has this problem, but is C doubles via gcc OK? (e.g. I am testing for convergence of successive approximations based on exact equality)
The behavior on implementations with excess precision, which seems to be the issue you're concerned about, is specified strictly by the standard in most if not all cases. Combined with IEEE 754 (assuming your C implementation follows Annex F) this does not leave room for the kinds of non-determinism you seem to be asking about. In particular, things like x == x (which Mehrdad mentioned in a comment) failing are forbidden since there are rules for when excess precision is kept in an expression and when it is discarded. Explicit casts and assignment to an object are among the operations that drop excess precision and ensure that you're working with the nominal type.
Note however that there are still a lot of broken compilers out there that don't conform to the standards. GCC intentionally disregards them unless you use -std=c99 or -std=c11 (i.e. the "gnu99" and "gnu11" options are intentionally broken in this regard). And prior to GCC 4.5, correct handling of excess precision was not even supported.
This may happen on Intel x86 code that uses the x87 floating-point unit (except probably 3., which seems bogus. LSB bits will be set to zero.). So the hardware platform is very common, but on the software side use of x87 is dying out in favor of SSE.
Basically whether a number is represented in 80 or 64 bits is at the whim of the compiler and may change at any point in the code. With for example the consequence that a number which just tested non-zero is now zero. m)
See "The pitfalls of verifying floating-point computations", page 8ff.
Testing for exact convergence (or equality) in floating point is usually a bad idea, even with in a totally deterministic environment. FP is an approximate representation to begin with. It is much safer to test for convergence to within a specified epsilon.
Division by zero in a C program results in abnormal termination with the error message Floating point exception (core dumped). This is unsurprising for floating point division, but why does it say this when integer division by zero occurs? Does integer division actually use the FPU under the hood?
(This is all on Linux under x86, by the way.)
Does integer division actually use the FPU under the hood?
No, Linux just generates SIGFPE in this case too (it's a legacy name whose usage has now been extended). Indeed, the Single Unix Specification defines SIGFPE as "Erroneous arithmetic operation".
man signal mentions:
Integer division by zero has undefined result. On some architectures it will generate a SIGFPE signal. (Also dividing the most negative integer by -1 may generate SIGFPE.)
My guess at a historical explanation for this would be that the original unix hardware didn't generate a trap on integer division by zero, so the name SIGFPE made sense. (PDP assembly programmers, confirm?) Then later when the system was ported (or in the case of Linux, reimplemented) to hardware with an integer division-by-zero trap, it was not considered worthwhile to add a new signal number, so the old one acquired a new meaning and now has a slightly confusing name.
There could be many different implementation-specific reasons for that.
For example, the FPU unit on x86 platform supports both floating point and integer formats for reading arguments and writing results. Back when the platform itself was 16-bit, some compilers used the FPU to perform division with 32-bit integer operands (since there's no precision loss for 32-bit wide data). Under such circumstances there would be nothing unusual in getting a genuine FPU error for invalid 32-bit integer division.
While porting an application from Linux x86 to iOS ARM (iPhone 4), I've discovered
a difference in behavior on floating point arithmetics and small values.
64bits floating point numbers (double) smaller than [+/-]2.2250738585072014E-308 are called denormal/denormalized/subnormal numbers in the IEEE 754-1985/IEEE 754-2008 standards.
On iPhone 4, such small numbers are treated as zero (0), while on x86, subnormal numbers can be used for computation.
I wasn't able to find any explanation regarding conformance to IEEE-754 standards on Apple's documentation Mac OS X Manual Page For float(3).
But thanks to some answers on Stack Overflow ( flush-to-zero behavior in floating-point arithmetic , Double vs float on the iPhone ), I have found some clues.
According to some searches, it seems the VFP (or NEON) math coprocessor used along the ARM core is using Flush-To-Zero (FTZ) mode (e.g. subnormal values are converted to 0 at the output) and Denormals-Are-Zero (DAZ) mode (e.g. subnormal values are converted to 0 when used as input parameters) to provide fast hardware handled IEEE 754 computation.
Full IEEE754 compliance with ARM support code
Run-Fast mode for near IEEE754 compliance (hardware only)
A good explanation on FTZ and DAZ can be found in
x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ):
FTZ and DAZ modes both handle the cases when invalid floating-point data occurs or is
processed with underflow or denormal conditions. [...]. The difference between a number
that is handled by FTZ and DAZ is very subtle. FTZ handles underflow conditions while
DAZ handles denormals. An underflow condition occurs when a computation results in a
denormal. In this case, FTZ mode sets the output to zero. DAZ fixes the cases when
denormals are used as input, either as constants or by reading invalid memory into
registers. DAZ mode sets the inputs of the calculation to zero before computation. FTZ
can then be said to handle [output] while DAZ handles [input].
The only things about FTZ on Apple's developer site seems to be in iOS ABI Function Call Guide :
VFP status register |
FPSCR |
Special |
Condition code bits (28-31) and saturation bits (0-4) are not preserved by function calls. Exception control (8-12), rounding mode (22-23), and flush-to-zero (24) bits should be modified only by specific routines that affect the application state (including framework API functions). Short vector length (16-18) and stride (20-21) bits must be zero on function entry and exit. All other bits must not be modified.
According to ARM1176JZF-S Technical Reference Manual, 18.5
Modes of operation (first iPhone processor), the VFP can be configured to fully support IEEE 754 (sub normal arithmetic), but in this case it will require some software support (trapping into kernel to compute in software).
Note: I have also read Debian's ARM Hard Float Port and VFP comparison pages.
My questions are :
Where can one find definitive answers regarding subnormal numbers handling across iOS devices ?
Can one set the iOS system to provide support for subnormal number without asking the compiler to produce only full software floating point code ?
Thanks.
Can one set the iOS system to provide support for subnormal number without asking the compiler to produce only full software floating point code?
Yes. This can be achieved by setting the FZ bit in the FPSCR to zero:
static inline void DisableFZ( )
{
__asm__ volatile("vmrs r0, fpscr\n"
"bic r0, $(1 << 24)\n"
"vmsr fpscr, r0" : : : "r0");
}
Note that this can cause significant slowdowns in application performance when appreciable quantities of denormal values are encountered. You can (and should) restore the default floating-point state before making calls into any code that does not make an ABI guarantee to work properly in non-default modes:
static inline void RestoreFZ( ) {
__asm__ volatile("vmrs r0, fpscr\n"
"orr r0, $(1 << 24)\n"
"vmsr fpscr, r0" : : : "r0");
}
Please file a bug report to request that better documentation be provided for the modes of FP operation in iOS.