ARM NEON convert f32 to s32 with round toward even - arm

Is there any functions that controls rounding mode of vcvt_s32_f32 intrinsic? I want to use round toward even instead of round toward negative infinity.
Thanks.

No, you can't change the rounding mode.
NEON is designed for performance rather than precision, and thus is restricted compared to VFP. Unlike VFP, it's not a full IEEE 754 implementation, and is hardwired to certain settings - quoting from the ARM ARM:
denormalized numbers are flushed to zero
only default NaNs are supported
the Round to Nearest* rounding mode selected
untrapped exception handling selected for all floating-point exceptions
The specific case of floating-point to integer conversion is slightly different in that the behaviour of the VCVT instruction in this case (for both VFP and NEON) is to ignore the selected rounding mode and always round towards zero. The VCVTR instruction which does use the selected rounding mode is only available in VFP.
The ARMv8 architecture introduced a whole bunch of rounding and conversion instructions
for using specific rounding modes, but I suspect that's not much help in this particular case. If you want to do conversions under a different rounding mode on ARMv7 and earlier, you'll either have to use VFP (if available) or some bit-hacking to implement it manually.
* The ARM ARM uses IEEE 754-1985 terminology, so more precisely this is round to nearest, ties to even

Related

How floating point conversion was handled before the invention of FPU and SSE?

I am trying to understand how floating point conversion is handled at the low level. So based on my understanding, this is implemented in hardware. So, for example, SSE provides the instruction cvttss2si which converts a float to an int.
But my question is: was the floating point conversion always handled this way? What about before the invention of FPU and SSE, was the calculation done manually using Assembly code?
It depends on the processor, and there have been a huge number of different processors over the years.
FPU stands for "floating-point unit". It's a more or less generic term that can refer to a floating-point hardware unit for any computer system. Some systems might have floating-point operations built into the CPU. Others might have a separate chip. Yet others might not have hardware floating-point support at all. If you specify a floating-point conversion in your code, the compiler will generate whatever CPU instructions are needed to perform the necessary computation. On some systems, that might be a call to a subroutine that does whatever bit manipulations are needed.
SSE stands for "Streaming SIMD Extensions", and is specific to the x86 family of CPUs. For non-x86 CPUs, there's no "before" or "after" SSE; SSE simply doesn't apply.
The conversion from floating-point to integer is considered a basic enough operation that the 387 instruction set already had such an instruction, FIST—although not useful for compiling the (int)f construct of C programs, as that instruction used the current rounding mode.
Some RISC instruction sets have always considered that a dedicated conversion instruction from floating-point to integer was an unnecessary luxury, and that this could be done with several instructions accessing the IEEE 754 floating-point representation. One basic scheme might look like this blog post, although the blog post is about rounding a float to a
float representing the nearest integer.
Prior to the standardization of IEEE 754 arithmetic, there were many competing vendor-specific ways of doing floating-point arithmetic. These had different ranges, precision, and different behavior with respect to overflow, underflow, signed zeroes, and undefined results such as 0/0 or sqrt(-1).
However, you can divide floating point implementations into two basic groups: hardware and software. In hardware, you would typically see an opcode which performs the conversion, although coprocessor FPUs can complicate things. In software, the conversion would be done by a function.
Today, there are still soft FPUs around, mostly on embedded systems. Not too long ago, this was common for mobile devices, but soft FPUs are still the norm on smaller systems.
Indeed, the floating point operations are a challenge for hardware engineers, as they require much hardware (leading to higher costs of the final product) and consume much power. There are some architectures that do not contain a floating point unit. There are also architectures that do not provide instructions even for basic operations like integer division. The ARM architecture is an example of this, where you have to implement division in software. Also, the floating point unit comes as an optional coprocessor in this architecture. It is worth thinking about this, considering the fact that ARM is the main architecture used in embedded systems.
IEEE 754 (the floating point standard used today in most of the applications) is not the only way of representing real numbers. You can also represent them using a fixed point format. For example, if you have a 32 bit machine, you can assume you have a decimal point between bit 15 and 16 and perform operations keeping this in mind. This is a simple way of representing floating numbers and it can be handled in software easily.
It depends on the implementation of the compiler. You can implement floating point math in just about any language (an example in C: http://www.jhauser.us/arithmetic/SoftFloat.html), and so usually the compiler's runtime library will include a software implementation of things like floating point math (or possibly the target hardware has always supported native instructions for this - again, depends on the hardware) and instructions which target the FPU or use SSE are offered as an optimization.
Before Floating Point Units doesn't really apply, since some of the earliest computers made back in the 1940's supported floating point numbers: wiki - first electro mechanical computers.
On processors without floating point hardware, the floating point operations are implemented in software, or on some computers, in microcode as opposed to being fully hardware implemented: wiki - microcode , or the operations could be handled by separate hardware components such as the Intel x87 series: wiki - x87 .
But my question is: was the floating point conversion always handled this way?
No, there's no x87 or SSE on architectures other than x86 so no cvttss2si either
Everything you can do with software, you can also do in hardware and vice versa.
The same to float conversion. If you don't have the hardware support, just do some bit hacking. There's nothing low level here so you can do it in C or any other languages easily. There is already a lot of solutions on SO
Converting Int to Float/Float to Int using Bitwise
Casting float to int (bitwise) in C
Converting float to an int (float2int) using only bitwise manipulation
...
Yes. The exponent was changed to 0 by shifting the mantissa, denormalizing the number. If the result was too large for an int an exception was generated. Otherwise the denormalized number (minus the factional part and optionally rounded) is the integer equivalent.

Does any floating point-intensive code produce bit-exact results in any x86-based architecture?

I would like to know if any code in C or C++ using floating point arithmetic would produce bit exact results in any x86 based architecture, regardless of the complexity of the code.
To my knowledge, any x86 architecture since the Intel 8087 uses a FPU unit prepared to handle IEEE-754 floating point numbers, and I cannot see any reason why the result would be different in different architectures. However, if they were different (namely due to different compiler or different optimization level), would there be some way to produce bit-exact results by just configuring the compiler?
Table of contents:
C/C++
asm
Creating real-life software that achieves this.
In C or C++:
No, a fully ISO C11 and IEEE-conforming C implementation does not guarantee bit-identical results to other C implementations, even other implementations on the same hardware.
(And first of all, I'm going to assume we're talking about normal C implementations where double is the IEEE-754 binary64 format, etc., even though it would be legal for a C implementation on x86 to use some other format for double and implement FP math with software emulation, and define the limits in float.h. That might have been plausible when not all x86 CPUs included with an FPU, but in 2016 that's Deathstation 9000 territory.)
related: Bruce Dawson's Floating-Point Determinism blog post is an answer to this question. His opening paragraph is amusing (and is followed by a lot of interesting stuff):
Is IEEE floating-point math deterministic? Will you always get the same results from the same inputs? The answer is an unequivocal “yes”. Unfortunately the answer is also an unequivocal “no”. I’m afraid you will need to clarify your question.
If you're pondering this question, then you will definitely want to have a look at the index to Bruce's series of articles about floating point math, as implemented by C compilers on x86, and also asm, and IEEE FP in general.
First problem: Only "basic operations": + - * / and sqrt are required to return "correctly rounded" results, i.e. <= 0.5ulp of error, correctly-rounded out to the last bit of the mantissa, so the results is the closest representable value to the exact result.
Other math library functions like pow(), log(), and sin() allow implementers to make a tradeoff between speed and accuracy. For example, glibc generally favours accuracy, and is slower than Apple's OS X math libraries for some functions, IIRC. See also glibc's documentation of the error bounds for every libm function across different architectures.
But wait, it gets worse. Even code that only uses the correctly-rounded basic operations doesn't guarantee the same results.
C rules also allow some flexibility in keeping higher precision temporaries. The implementation defines FLT_EVAL_METHOD so code can detect how it works, but you don't get a choice if you don't like what the implementation does. You do get a choice (with #pragma STDC FP_CONTRACT off) to forbid the compiler from e.g. turning a*b + c into an FMA with no rounding of the a*b temporary before the add.
On x86, compilers targeting 32-bit non-SSE code (i.e. using obsolete x87 instructions) typically keep FP temporaries in x87 registers between operations. This produces the FLT_EVAL_METHOD = 2 behaviour of 80-bit precision. (The standard specifies that rounding still happens on every assignment, but real compilers like gcc don't actually do extra store/reloads for rounding unless you use -ffloat-store. See https://gcc.gnu.org/wiki/FloatingPointMath. That part of the standard seems to have been written assuming non-optimizing compilers, or hardware that efficiently provides rounding to the type width like non-x86, or like x87 with precision set to round to 64-bit double instead of 80-bit long double. Storing after every statement is exactly what gcc -O0 and most other compilers do, and the standard allows extra precision within evaluation of one expression.)
So when targeting x87, the compiler is allowed to evaluate the sum of three floats with two x87 FADD instructions, without rounding off the sum of the first two to a 32-bit float. In that case, the temporary has 80-bit precision... Or does it? Not always, because the C implementation's startup code (or a Direct3D library!!!) may have changed the precision setting in the x87 control word, so values in x87 registers are rounded to 53 or 24 bit mantissa. (This makes FDIV and FSQRT run a bit faster.) All of this from Bruce Dawson's article about intermediate FP precision).
In assembly:
With rounding mode and precision set the same, I think every x86 CPU should give bit-identical results for the same inputs, even for complex x87 instructions like FSIN.
Intel's manuals don't define exactly what those results are for every case, but I think Intel aims for bit-exact backwards compatibility. I doubt they'll ever add extended-precision range-reduction for FSIN, for example. It uses the 80-bit pi constant you get with fldpi (correctly-rounded 64-bit mantissa, actually 66-bit because the next 2 bits of the exact value are zero). Intel's documentation of the worst-case-error was off by a factor of 1.3 quintillion until they updated it after Bruce Dawson noticed how bad the worst-case actually was. But this can only be fixed with extended-precision range reduction, so it wouldn't be cheap in hardware.
I don't know if AMD implements their FSIN and other micro-coded instructions to always give bit-identical results to Intel, but I wouldn't be surprised. Some software does rely on it, I think.
Since SSE only provides instructions for add/sub/mul/div/sqrt, there's nothing too interesting to say. They implement the IEEE operation exactly, so there's no chance that any x86 implementation will ever give you anything different (unless the rounding mode is set differently, or denormals-are-zero and/or flush-to-zero are different and you have any denormals).
SSE rsqrt (fast approximate reciprocal square root) is not exactly specified, and I think it's possible you might get a different result even after a Newton iteration, but other than that SSE/SSE2 is always bit exact in asm, assuming the MXCSR isn't set weird. So the only question is getting the compiler go generate the same code, or just using the same binaries.
In real life:
So, if you statically link a libm that uses SSE/SSE2 and distribute those binaries, they will run the same everywhere. Unless that library uses run-time CPU detection to choose alternate implementations...
As #Yan Zhou points out, you pretty much need to control every bit of the implementation down to the asm to get bit-exact results.
However, some games really do depend on this for multi-player, but often with detection/correction for clients that get out of sync. Instead of sending the entire game state over the network every frame, every client computes what happens next. If the game engine is carefully implemented to be deterministic, they stay in sync.
In the Spring RTS, clients checksum their gamestate to detect desync. I haven't played it for a while, but I do remember reading something at least 5 years ago about them trying to achieve sync by making sure all their x86 builds used SSE math, even the 32-bit builds.
One possible reason for some games not allowing multi-player between PC and non-x86 console systems is that the engine gives the same results on all PCs, but different results on the different-architecture console with a different compiler.
Further reading: GAFFER ON GAMES: Floating Point Determinism. Some techniques that real game engines use to get deterministic results. e.g. wrap sin/cos/tan in non-optimized function calls to force the compiler to leave them at single-precision.
If the compiler and architecture is compliant to IEEE standards, yes.
For instance, gcc is IEEE compliant if configured properly. If you use the -ffast-math flag, it will not be IEEE compliant.
See http://www.validlab.com/goldberg/paper.pdf page 25.
If you want to know exactly what exactness you can rely on when using a IEEE 754-1985 hardware/compiler pair, you need to purchase the standard paper on IEEE site. Unfortunately, this is not publicly available
Link

Behavior of ARM Neon float-integer conversion with overflow

How is the behavior of the ARM Neon float-to-integer conversion instructions vcvt.s32.f32 and vcvt.u32.f32 defined in case of overflows? Can you rely upon the behavior that I observed on a particular processor, i.e. that the result is saturated? Any links to official documentation are highly appreciated.
The ARM Architecture Reference Manual is the source of all answers for this sort of question. In section A8.8.305 it says:
The floating-point to integer operation uses the Round towards Zero rounding mode.
And in the Glossary it clarifies:
Round towards Zero (RZ) mode
Means that results are rounded to the nearest representable number that is no greater in magnitude than the unrounded result.
(Which is the same meaning for "Round towards Zero" as in IEEE 754.)
The gory details are in the pseudocode for FPToFixed and FPUnpack.
So, in short: yes, the result is guaranteed to be saturated.

Xilinx MicroBlaze Floating Point Compatibility

I have a 'c' code targeted to a MicroBlaze CPU.
When I debug the code as c program in Eclipse + GCC or Visual Studio I get the results I want.
Yet when I run on the target the result are different.
It happens only on floating point operations (Multiplication and Division).
How can I make it work with full floating point precision?
Are there special GCC flags?
P.S.
The configuration of the MicroBlaze is with all the hardware of floating point operations enabled.
I'm not very experienced with MicroBlaze, but the Wikipedia page states:
Also, key processor instructions which are rarely used but more expensive to implement in hardware can be selectively added/removed (i.e. multiply, divide, and floating-point ops.)
Emphasis mine.
So, make sure that your particular MicroBlaze actually has the floating point operations supported, otherwise I imagine your results will be very random.
Also make sure your compiler toolchain generates the proper instructions, sometimes toolchains for embedded development support software-emulated floating point. This should be trivial to figure out by disassembling the final code, and seeing how the floating-point operations are implemented.
MicroBlaze floating-point in hardware supports IEEE754 with some exceptions that is listed in the MicroBlaze reference guide.
Floating-point is not 100% identical on all machines.
It depends on actual precision when executing the operations (hardware can use extended precision when executing single-precision operations), it also depends on the configuration of the rounding-mode (IEEE defines four different rounding modes).
MicroBlaze do not support denormalized floating-point (they will be consider to be zero).
However normal coding should avoid denormalized values since they have a reduced accuracy.
What kind of difference do you see?
Göran Bilski

subnormal IEEE 754 floating point numbers support on iOS ARM devices (iPhone 4)

While porting an application from Linux x86 to iOS ARM (iPhone 4), I've discovered
a difference in behavior on floating point arithmetics and small values.
64bits floating point numbers (double) smaller than [+/-]2.2250738585072014E-308 are called denormal/denormalized/subnormal numbers in the IEEE 754-1985/IEEE 754-2008 standards.
On iPhone 4, such small numbers are treated as zero (0), while on x86, subnormal numbers can be used for computation.
I wasn't able to find any explanation regarding conformance to IEEE-754 standards on Apple's documentation Mac OS X Manual Page For float(3).
But thanks to some answers on Stack Overflow ( flush-to-zero behavior in floating-point arithmetic , Double vs float on the iPhone ), I have found some clues.
According to some searches, it seems the VFP (or NEON) math coprocessor used along the ARM core is using Flush-To-Zero (FTZ) mode (e.g. subnormal values are converted to 0 at the output) and Denormals-Are-Zero (DAZ) mode (e.g. subnormal values are converted to 0 when used as input parameters) to provide fast hardware handled IEEE 754 computation.
Full IEEE754 compliance with ARM support code
Run-Fast mode for near IEEE754 compliance (hardware only)
A good explanation on FTZ and DAZ can be found in
x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ):
FTZ and DAZ modes both handle the cases when invalid floating-point data occurs or is
processed with underflow or denormal conditions. [...]. The difference between a number
that is handled by FTZ and DAZ is very subtle. FTZ handles underflow conditions while
DAZ handles denormals. An underflow condition occurs when a computation results in a
denormal. In this case, FTZ mode sets the output to zero. DAZ fixes the cases when
denormals are used as input, either as constants or by reading invalid memory into
registers. DAZ mode sets the inputs of the calculation to zero before computation. FTZ
can then be said to handle [output] while DAZ handles [input].
The only things about FTZ on Apple's developer site seems to be in iOS ABI Function Call Guide :
VFP status register |
FPSCR |
Special |
Condition code bits (28-31) and saturation bits (0-4) are not preserved by function calls. Exception control (8-12), rounding mode (22-23), and flush-to-zero (24) bits should be modified only by specific routines that affect the application state (including framework API functions). Short vector length (16-18) and stride (20-21) bits must be zero on function entry and exit. All other bits must not be modified.
According to ARM1176JZF-S Technical Reference Manual, 18.5
Modes of operation (first iPhone processor), the VFP can be configured to fully support IEEE 754 (sub normal arithmetic), but in this case it will require some software support (trapping into kernel to compute in software).
Note: I have also read Debian's ARM Hard Float Port and VFP comparison pages.
My questions are :
Where can one find definitive answers regarding subnormal numbers handling across iOS devices ?
Can one set the iOS system to provide support for subnormal number without asking the compiler to produce only full software floating point code ?
Thanks.
Can one set the iOS system to provide support for subnormal number without asking the compiler to produce only full software floating point code?
Yes. This can be achieved by setting the FZ bit in the FPSCR to zero:
static inline void DisableFZ( )
{
__asm__ volatile("vmrs r0, fpscr\n"
"bic r0, $(1 << 24)\n"
"vmsr fpscr, r0" : : : "r0");
}
Note that this can cause significant slowdowns in application performance when appreciable quantities of denormal values are encountered. You can (and should) restore the default floating-point state before making calls into any code that does not make an ABI guarantee to work properly in non-default modes:
static inline void RestoreFZ( ) {
__asm__ volatile("vmrs r0, fpscr\n"
"orr r0, $(1 << 24)\n"
"vmsr fpscr, r0" : : : "r0");
}
Please file a bug report to request that better documentation be provided for the modes of FP operation in iOS.

Resources