I have a quite specific question.
An ADC gives me 24bit datapoints in the twos complement. Usually I stored them into an 32bit int (twos complement) (by copying them starting from the MSB of the int and then shifting them 8 bits towards the LSB to maintain the leading one or zero)
Now I want to use the CMSIS-DSP Library on an ARM Processor to do a FFT Transformation. The FFT expects float32_t input. I never heard of the data format and can't find any specific sources about whether it has a fixed floating point or anything ...
Can anyone tell me what exactly float32_t is? Additionally any thoughts about converting the 24bit Two's complements into float32_t ?
I'll keep investigating an will Edit this post if I have anything new :-)
If someone is interested:
The ADC is the TI-ADS1299
The CMISI-DSP Library can be found here. The link goes directly to the method I want to use (arm_rfft_f32 ()) . Since I'm just cable to use an older version of the library the method is already marked as deprecated.
Thanks & Greetings!
Often the most obvious solution also turns out the best. If I had to sign-extend a 24-bit number and convert it to a floating-point type, I'd start by writing something like this:
// See Dric512's answer; I happen to know my compiler's ABI implements
// 'float' with the appropriate IEEE 754 single-precision format
typedef float float32_t;
float32_t conv_func(unsigned int int24) {
return (int)(int24 << 8) >> 8;
}
Since you mention both CMSIS and critical timing, I'm going to safely assume your micro has a Cortex-M4 (or possibly Cortex-M7) with a hardware FPU - the words "performance" and "software floating-point FFT" go together pretty laughably - and that since it's the 21st century you're using a half-decent optimising compiler, so I compiled the above thusly:
$arm-none-eabi-gcc -c -Os -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mthumb float.c
and got this out of it (comments added for clarity):
0: f340 0017 sbfx r0, r0, #0, #24 # sign-extend 24-bit value from argument
4: ee07 0a90 vmov s15, r0 # move 32-bit result to FPU register
8: eeb8 0ae7 vcvt.f32.s32 s0, s15 # convert signed int to 32-bit float
c: 4770 bx lr # return (with final result in FPU)
Well, that looks like optimal code already - there's no way any manual bit-twiddling is gonna beat a mere 2 single-cycle instructions. Job done!
And if you do happen to be stuck without an FPU, then the fundamental point of the answer remains unchanged - let the compiler/library do the dirty work, because the soft-fp library's conversion implementation will be:
Reliably correct.
Pretty well optimised.
Entirely lost in the noise compared to the overhead of the calculations themselves.
Float32_t is the standard IEEE 32-bit floating point standard, which is the base (Like the float64_t) of the hardware floating-point unit supported by several ARM CPUs.
There is 1 bit of sign (Bit 31), 8 bits of exponent, and 23 bits of mantissa:
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
If you have a CPU that contains a hardware floating-point, you can directly use the instructions to convert the 32-bit integer to the 32-bit floating-point (VCVT instruction).
Related
Well on the Intel intrinsic guide it is stated that the instruction called "sqrtsd" has a latency of 18 cycles.
I tested it with my own program and it is correct if, for example, we take 0.15 as input. But when we take 256 (or any 2^x) number then the latency is only 13. Why is that?
One theory I had is that since 13 is the latency of "sqrtss" which is the same as "sqrtsd" but done on 32bits floating points then maybe the processor was smart enough to understand taht 256 can fit in 32 bit and hence use that version while 0.15 needs the full 64 bit since it isn't representable in a finite way.
I am doing it using inline assembly, here is the relveant part compiled with gcc -O3 and -fno-tree-vectorize.
static double sqrtsd (double x) {
double r;
__asm__ ("sqrtsd %1, %0" : "=x" (r) : "x" (x));
return r;
}
SQRT* and DIV* are the only two "simple" ALU instructions (single uop, not microcoded branching / looping) that have data-dependent throughput or latency on modern Intel/AMD CPUs. (Not counting microcode assists for denormal aka subnormal FP values in add/multiply/fma). Everything else is pretty much fixed so the out-of-order uop scheduling machinery doesn't need to wait for confirmation that a result was ready some cycle, it just knows it will be.
As usual, Intel's intrinsics guide gives an over-simplified picture of performance. The actual latency isn't a fixed 18 cycles for double-precision on Skylake. (Based on the numbers you chose to quote, I assume you have a Skylake.)
div/sqrt are hard to implement; even in hardware the best we can do is an iterative refinement process. Refining more bits at once (radix-1024 divider since Broadwell) speeds it up (see this Q&A about the hardware). But it's still slow enough that an early-out is used to speed up simple cases (Or maybe the speedup mechanism is just skipping a setup step for all-zero mantissas on modern CPUs with partially-pipelined div/sqrt units. Older CPUs had throughput=latency for FP div/sqrt; that execution unit is harder to pipeline.)
https://www.uops.info/html-instr/VSQRTSD_XMM_XMM_XMM.html shows Skylake SQRTSD can vary from 13 to 19 cycle latency. The SKL (client) numbers only show 13 cycle latency, but we can see from the detailed SKL vsqrtsd page that they only tested with input = 0. SKX (server) numbers show 13-19 cycle latency. (This page has the detailed breakdown of the test code they used, including the binary bit-patterns for the tests.) Similar testing (with only 0 for client cores) was done on the non-VEX sqrtsd xmm, xmm page. :/
InstLatx64 results show best / worst case latencies of 13 to 18 cycles on Skylake-X (which uses the same core as Skylake-client, but with AVX512 enabled).
Agner Fog's instruction tables show 15-16 cycle latency on Skylake. (Agner does normally test with a range of different input values.) His tests are less automated and sometimes don't exactly match other results.
What makes some cases fast?
Note that most ISAs (including x86) use binary floating point:
the bits represent values as a linear significand (aka mantissa) times 2exp, and a sign bit.
It seems that there may only be 2 speeds on modern Intel (since Haswell at least) (See discussion with #harold in comments.) e.g. even powers of 2 are all fast, like 0.25, 1, 4, and 16. These have trivial mantissa=0x0 representing 1.0. https://www.h-schmidt.net/FloatConverter/IEEE754.html has a nice interactive decimal <-> bit-pattern converter for single-precision, with checkboxes for the set bits and annotations of what the mantissa and exponent represent.
On Skylake the only fast cases I've found in a quick check are even powers of 2 like 4.0 but not 2.0. These numbers have an exact sqrt result with both input and output having a 1.0 mantissa (only the implicit 1 bit set). 9.0 is not fast, even though it's exactly representable and so is the 3.0 result. 3.0 has mantissa = 1.5 with just the most significant bit of the mantissa set in the binary representation. 9.0's mantissa is 1.125 (0b00100...). So the non-zero bits are very close to the top, but apparently that's enough to disqualify it.
(+-Inf and NaN are fast, too. So are ordinary negative numbers: result = -NaN. I measure 13 cycle latency for these on i7-6700k, same as for 4.0. vs. 18 cycle latency for the slow case.)
x = sqrt(x) is definitely fast with x = 1.0 (all-zero mantissa except for the implicit leading 1 bit). It has a simple input and simple output.
With 2.0 the input is also simple (all-zero mantissa and exponent 1 higher) but the output is not a round number. sqrt(2) is irrational and thus has infinite non-zero bits in any base. This apparently makes it slow on Skylake.
Agner Fog's instruction tables say that AMD K10's integer div instruction performance depends on the number of significant bits in the dividend (input), not the quotient, but searching Agner's microarch pdf and instruction tables didn't find any footnotes or info about how sqrt specifically is data-dependent.
On older CPUs with even slower FP sqrt, there might be more room for a range of speeds. I think number of significant bits in the mantissa of the input will probably be relevant. Fewer significant bits (more trailing zeros in the significand) makes it faster, if this is correct. But again, on Haswell/Skylake the only fast cases seem to be even powers of 2.
You can test this with something that couples the output back to the input without breaking the data dependency, e.g. andps xmm0, xmm1 / orps xmm0, xmm2 to set a fixed value in xmm0 that's dependent on the sqrtsd output.
Or a simpler way to test latency is to take "advantage" of the false output dependency of sqrtsd xmm0, xmm1 - it and sqrtss leave the upper 64 / 32 bits (respectively) of the destination unmodified, thus the output register is also an input for that merging. I assume this is how your naive inline-asm attempt ended up bottlenecking on latency instead of throughput with the compiler picking a different register for the output so it could just re-read the same input in a loop. The inline asm you added to your question is totally broken and won't even compile, but perhaps your real code used "x" (xmm register) input and output constraints instead of "i" (immediate)?
This NASM source for a static executable test loop (to run under perf stat) uses that false dependency with the non-VEX encoding of sqrtsd.
This ISA design wart is thanks to Intel optimizing for the short term with SSE1 on Pentium III. P3 handled 128-bit registers internally as two 64-bit halves. Leaving the upper half unmodified let scalar instructions decode to a single uop. (But that still gives PIII sqrtss a false dependency). AVX finally lets us avoid this with vsqrtsd dst, src,src at least for register sources, and similarly vcvtsi2sd dst, cold_reg, eax for the similarly near-sightedly designed scalar int->fp conversion instructions. (GCC missed-optimization reports: 80586, 89071, 80571.)
On many earlier CPUs even throughput was variable, but Skylake beefed up the dividers enough that the scheduler always knows it can start a new div/sqrt uop 3 cycles after the last single-precision input.
Even Skylake double-precision throughput is variable, though: 4 to 6 cycles after the last double-precision input uop, if Agner Fog's instruction tables are right.
https://uops.info/ shows a flat 6c reciprocal throughput. (Or twice that long for 256-bit vectors; 128-bit and scalar can use separate halves of the wide SIMD dividers for more throughput but the same latency.) See also Floating point division vs floating point multiplication for some throughput/latency numbers extracted from Agner Fog's instruction tables.
As I understand, floating points are stored in XMM registers and not the general purpose registers such as eax, so I did an experiment:
float a = 5;
in this case, a is stored as 1084227584 in the XMM register.
Here is an assembly version:
.text
.global _start
.LCO:
.long 1084227584
_start:
mov .LCO, %eax
movss .LCO, %xmm0
Executing the above assembly and debugging it using gdb shows that the value in eax will be 1084227584, however the value in ymm0 is 5.
Here are is my questions:
1- What's so special about the XMM registers? beside the SIMD instructions, are they the only type of registers to store floating points?
why can't I set the same bits in a regular register?
2- Are float and double values always stored as a floating point?
Can we never store them as a fixed point in C or assembly?
however the value in ymm0 is 5.
The bit-pattern in ymm0 is 1084227584. The float interpretation of that number is 5.0.
But you can print /x $xmm0.v4_int32 to see a hex representation of the bits in xmm0.
What's so special about the XMM registers? beside the SIMD instructions, are they the only type of registers to store floating points?
No, in asm everything is just bytes.
Some compilers will use an integer register to copy a float or double from one memory location to another, if not doing any computation on it. (Integer instructions are often smaller.) e.g. clang will do this: https://godbolt.org/z/76EWMY
void copy(float *d, float *s) { *d = *s; }
# clang8.0 -O3 targeting x86-64 System V
copy: # #copy
mov eax, dword ptr [rsi]
mov dword ptr [rdi], eax
ret
XMM/YMM/ZMM registers are special because they're the only registers that FP ALU instructions exist for (ignoring x87, which is only used for 80-bit long double in x86-64).
addsd xmm0, xmm1 (add scalar double) has no equivalent for integer registers.
Usually FP and integer data don't mingle very much, so providing a whole separate set of architectural registers allows more space for more data to be in registers. (Given the same instruction-encoding constraints, it's a choice between 16 FP + 16 GP integer vs. 16 unified registers, not vs. 32 unified registers).
Plus, a major microarchitectural benefit of a separate register file is that it can be physically close to the FP ALUs, while the integer register file can be physically close to the integer ALUs. For more, see Is there any architecture that uses the same register space for scalar integer and floating point operations?
are float and double values always stored as a floating point? can we never store them as a fixed point in C or assembly?
x86 compilers use float = IEEE754 binary32 https://en.wikipedia.org/wiki/Single-precision_floating-point_format. (And double = IEEE754 binary64). This is specified as part of the ABI.
Internally the as-if rule allows the compiler to do whatever it wants, as long as the final result is identical. (Or with -ffast-math, to pretend that FP math is associative, and assume NaN/Inf aren't possible.)
Compilers can't just randomly choose a different object representation for some float that other separately-compiled functions might look at.
There might be rare cases for locals that are never visible to other functions where a "human compiler" (hand-writing asm to implement C) could prove that fixed-point was safe. Or more likely, that the float values were exact integers small enough that double wouldn't round them, so your fixed-point could degenerate to integer (except maybe for a final step).
But it would be rare to know this much about possible values without just being able to do constant propagation and optimize everything away. That's why I say a human would have to be involved, to prove things the compiler wouldn't know to look for.
I think in theory you could have a C implementation that did use a fixed-point float or double. ISO C puts very little restrictions on what float and double actually are.
But limits.h constants like FLT_RADIX and DBL_MAX_EXP have interactions that might not make sense for a fixed-point format, which has constant distance between each representable value, instead of being much closer together near 0 and much farther apart for large number. (Rounding error of 0.5ulp is relative to the magnitude, instead of absolute.)
Still, most programs don't actually do things that would break if the "mantissa" and exponent limits didn't correspond to what you'd expect for DBL_MIN and DBL_MAX.
Another interesting possibility is to make float and double based on the Posit format (similar to traditional floating-point, but with a variable-length exponent encoding. https://www.johndcook.com/blog/2018/04/11/anatomy-of-a-posit-number/ https://posithub.org/index).
Modern hardware, especially Intel CPUs, has very good support for IEEE float/double, so fixed-point is often not a win. There are some nice SIMD instructions for 16-bit fixed-point, though, like high-half-only multiply, and even pmulhrsw which does fixed-point rounding.
But general 32-bit integer multiply has worse throughput than packed-float multiply. (Because the SIMD ALUs optimized for float/double only need 24x24-bit significand multipliers per 32 bits of vector element. Modern Intel CPUs run integer multiply and shift on the FMA execution units, with 2 uops per clock throughput.)
are they the only type of registers to store floating points?
No. There are the 80-bit floating-point registers (fp0-fp7) in the 8087-compatible FPU which should still be present in most modern CPUs.
Most 32-bit programs use these registers.
Can we store a floating point in a regular [integer] register?
Yes. 30 years ago many PCs contained a CPU without 80x87 FPU, so there were no fp0-fp7 registers. CPUs with XMM registers came even later.
We find a similar situation in mobile devices today.
What's so special about the XMM registers?
Using the 80x87 FPU seems to be more complicated than using XMM registers. Furthermore, I'm not sure if using the 80x87 is allowed in 64-bit programs in every operating system.
If you store a floating-point value in an integer register (such as eax), you don't have any instructions performing arithmetic: On x86 CPUs, there is no instruction for doing a multiplication or addition of floating-point values that are stored in integer registers.
In the case of CPUs without FPU, you have to do floating-point emulation. This means you have to perform one floating-point operation by doing multiple integer operations - just like you would do it with paper and pencil.
However, if you only want to store a floating-point value, you can of course also use an integer register. The same is true for copying a value or checking if two values are equal and similar operations.
Can we never store them as a fixed point in C or assembly?
Fixed point is used a lot when using CPUs that do not have an FPU.
For example when using 8- or 16-bit CPUs which are still used in automotive industry, consumer devices or PC peripheral devices.
However, I doubt that there are C compilers that automatically translate the keyword "float" to fixed point.
NEON can do SIMD operations for 32 bit float numbers. But does not do SIMD operations for 64 bit float numbers.
VFU is not SIMD. It can do 32 bit or 64 bit floating point operations only on one element.
Does ARM support SIMD operations for 64 bit floating point numbers?
This is only possible on processors supporting ARMv8, and only when running Aarch64 instruction set. This is not possible in Aarch32 instruction set.
However most processors support 32-bit and 64-bit scalar floating-point operations (ie floating-point unit).
ARMv8
In ARMv8, it is possible:
fadd v2.2d, v0.2d, v1.2d
Minimal runnable example with an assert and QEMU user setup.
The analogous ARMv7 does not work:
vadd.f64 q2, q0, q1
assembly fails with:
bad type in Neon instruction -- `vadd.f64 q2,q0,q1'
Minimal runnable 32-bit float v7 code for comparison.
Manual
https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf A1.5 "Advanced SIMD and floating-point support" says:
The SIMD instructions provide packed Single Instruction Multiple Data (SIMD) and single-element scalar operations, and support:
Single-precision and double-precision arithmetic in AArch64 state.
For ARMv7, F6.1.27 "VADD (floating-point)" says:
<dt> Is the data type for the elements of the vectors, encoded in the "sz" field. It can have the following values:
F32 when sz = 0
F16 when sz = 1
but there is no F64, which suggests that it is not possible.
I am somehow confused by the MOVSD assembly instruction. I wrote some numerical code computing some matrix multiplication, simply using ordinary C code with no SSE intrinsics. I do not even include the header file for SSE2 intrinsics for compilation. But when I check the assembler output, I see that:
1) 128-bit vector registers XMM are used;
2) SSE2 instruction MOVSD is invoked.
I understand that MOVSD essentially operates on single double precision floating point. It only uses the lower 64-bit of an XMM register and set the upper 64-bit 0. But I just don't understand two things:
1) I never give the compiler any hint for using SSE2. Plus, I am using GCC not intel compiler. As far as I know, intel compiler will automatically seek opportunities for vectorization, but GCC will not. So how does GCC know to use MOVSD?? Or, has this x86 instruction been around long before SSE instruction set, and the _mm_load_sd() intrinsics in SSE2 is just to provide backward compatibility for using XMM registers for scalar computation?
2) Why does not the compiler use other floating point registers, either the 80-bit floating point stack, or 64-bit floating point registers?? Why must it take the toll using XMM register (by setting upper 64-bit 0 and essentially wasting that storage)? Does XMM do provide faster access??
By the way, I have another question regarding SSE2. I just can't see the difference between _mm_store_sd() and _mm_storel_sd(). Both store the lower 64-bit value to an address. What is the difference? Performance difference?? Alignment difference??
Thank you.
Update 1:
OKAY, obviously when I first asked this question, I lacked some basic knowledge on how a CPU manages floating point operations. So experts tend to think my question is non-sense. Since I did not include even the shortest sample C code, people might think this question vague as well. Here I would provide a review as an answer, which hopefully will be useful to any people unclear about the floating point operations on modern CPUs.
A review of floating point scalar/vector processing on modern CPUs
The idea of vector processing dates back to old time vector processors, but these processors had been superseded by modern architectures with cache systems. So we focus on modern CPUs, especially x86 and x86-64. These architectures are the main stream in high performance scientific computing.
Since i386, Intel introduced the floating point stack where floating point numbers up to 80-bit wide can be held. This stack is commonly known as x87 or 387 floating point "registers", with a set of x87 FPU instructions. x87 stack are not real, directly addressable registers like general purpose registers, as they are on a stack. Access to register st(i) is by offsetting the stack top register %st(0) or simply %st. With help of an instruction FXCH which swaps the contents between current stack top %st and some offset register %st(i), random access can be achieved. But FXCH can impose some performance penalty, though minimized. x87 stack provides high precision computation by calculating intermediate results with 80 bits of precision by default, to minimise roundoff error in numerically unstable algorithms. However, x87 instructions are completely scalar.
The first effort on vectorization is the MMX instruction set, which implemented integer vector operations. The vector registers under MMX are 64-bit wide registers MMX0, MMX1, ..., MMX7. Each can be used to hold either 64-bit integers, or multiple smaller integers in a "packed" format. A single instruction can then be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers at once. So now there are the legacy general purpose registers for scalar integer operations, as well as new MMX for integer vector operations with no shared execution resources. But MMX shared execution resources with scalar x87 FPU operation: each MMX register corresponded to the lower 64 bits of an x87 register, and the upper 16 bits of the x87 registers is unused. These MMX registers were each directly addressable. But the aliasing made it difficult to work with floating point and integer vector operations in the same application. To maximize performance, programmers often used the processor exclusively in one mode or the other, deferring the relatively slow switch between them as long as possible.
Later, SSE created a separate set of 128-bit wide registers XMM0–XMM7 along side of x87 stack. SSE instructions focused exclusively on single-precision floating-point operations (32-bit); integer vector operations were still performed using the MMX register and MMX instruction set. But now both operations can proceed at the same time, as they share no execution resources. It is important to know that SSE not only do floating point vector operations, but also floating point scalar operations. Essentially it provides a new place where floating operations take place, and the x87 stack is no longer prior choice to carry out floating operations. Using XMM registers for scalar floating point operations is faster than using x87 stack, as all XMM registers are easier to access, while the x87 stack can't be randomly accessed without FXCH. When I posted my question, I was clearly unaware of this fact. The other concept I was not clear about is that general purpose registers are integer/address registers. Even if they are 64-bit on x86-64, they can not hold 64-bit floating point. The main reason is that the execution unit associated with general purpose registers is ALU (arithmetic & logical unit), which is not for floating point computation.
SSE2 is a major progress, as it extends vector data type, so SSE2 instructions, either scalar or vector, can work with all C standard data type. Such extension in fact makes MMX obsolete. Also, x87 stack is no long as important as it once was. Since there are two alternative places where floating point operations can take place, you can specify your option to the compiler. For example for GCC, compilation with flag
-mfpmath=387
will schedule floating point operations on the legacy x87 stack. Note that this seems to be the default for 32-bit x86, even if SSE is already available. For example, I have an Intel Core2Duo laptop made in 2007, and it was already equipped with SSE release up to version SSE4, while GCC will still by default use x87 stack, which makes scientific computations unnecessarily slower. In this case, we need compile with flag
-mfpmath=sse
and GCC will schedule floating point operations on XMM registers. 64-bit x86-64 user needs not worry about such configuration as this is default on x86-64. Such signal will only affect scalar floating point operation. If we have written code using vector instructions and compiler the code with flag
-msse2
then XMM registers will be the only place where computation can take place. In other words, this flags turns on -mfpmath=sse. For more information see GCC's configuration of x86, x86-64. For examples of writing SSE2 C code, see my other post How to ask GCC to completely unroll this loop (i.e., peel this loop)?.
SSE set of instructions, though very useful, are not the latest vector extensions. The AVX, advanced vector extensions enhances SSE by providing 3-operands and 4 operands instructions. See number of operands in instruction set if you are unclear of what this means. 3-operands instruction optimizes the commonly seen fused multiply-add (FMA) operation in scientific computing by 1) using 1 fewer register; 2) reducing the explicit amount of data movement between registers; 3) speeding up FMA computations in itself. For example of using AVX, see #Nominal Animal's answer to my post.
While porting an application from Linux x86 to iOS ARM (iPhone 4), I've discovered
a difference in behavior on floating point arithmetics and small values.
64bits floating point numbers (double) smaller than [+/-]2.2250738585072014E-308 are called denormal/denormalized/subnormal numbers in the IEEE 754-1985/IEEE 754-2008 standards.
On iPhone 4, such small numbers are treated as zero (0), while on x86, subnormal numbers can be used for computation.
I wasn't able to find any explanation regarding conformance to IEEE-754 standards on Apple's documentation Mac OS X Manual Page For float(3).
But thanks to some answers on Stack Overflow ( flush-to-zero behavior in floating-point arithmetic , Double vs float on the iPhone ), I have found some clues.
According to some searches, it seems the VFP (or NEON) math coprocessor used along the ARM core is using Flush-To-Zero (FTZ) mode (e.g. subnormal values are converted to 0 at the output) and Denormals-Are-Zero (DAZ) mode (e.g. subnormal values are converted to 0 when used as input parameters) to provide fast hardware handled IEEE 754 computation.
Full IEEE754 compliance with ARM support code
Run-Fast mode for near IEEE754 compliance (hardware only)
A good explanation on FTZ and DAZ can be found in
x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ):
FTZ and DAZ modes both handle the cases when invalid floating-point data occurs or is
processed with underflow or denormal conditions. [...]. The difference between a number
that is handled by FTZ and DAZ is very subtle. FTZ handles underflow conditions while
DAZ handles denormals. An underflow condition occurs when a computation results in a
denormal. In this case, FTZ mode sets the output to zero. DAZ fixes the cases when
denormals are used as input, either as constants or by reading invalid memory into
registers. DAZ mode sets the inputs of the calculation to zero before computation. FTZ
can then be said to handle [output] while DAZ handles [input].
The only things about FTZ on Apple's developer site seems to be in iOS ABI Function Call Guide :
VFP status register |
FPSCR |
Special |
Condition code bits (28-31) and saturation bits (0-4) are not preserved by function calls. Exception control (8-12), rounding mode (22-23), and flush-to-zero (24) bits should be modified only by specific routines that affect the application state (including framework API functions). Short vector length (16-18) and stride (20-21) bits must be zero on function entry and exit. All other bits must not be modified.
According to ARM1176JZF-S Technical Reference Manual, 18.5
Modes of operation (first iPhone processor), the VFP can be configured to fully support IEEE 754 (sub normal arithmetic), but in this case it will require some software support (trapping into kernel to compute in software).
Note: I have also read Debian's ARM Hard Float Port and VFP comparison pages.
My questions are :
Where can one find definitive answers regarding subnormal numbers handling across iOS devices ?
Can one set the iOS system to provide support for subnormal number without asking the compiler to produce only full software floating point code ?
Thanks.
Can one set the iOS system to provide support for subnormal number without asking the compiler to produce only full software floating point code?
Yes. This can be achieved by setting the FZ bit in the FPSCR to zero:
static inline void DisableFZ( )
{
__asm__ volatile("vmrs r0, fpscr\n"
"bic r0, $(1 << 24)\n"
"vmsr fpscr, r0" : : : "r0");
}
Note that this can cause significant slowdowns in application performance when appreciable quantities of denormal values are encountered. You can (and should) restore the default floating-point state before making calls into any code that does not make an ABI guarantee to work properly in non-default modes:
static inline void RestoreFZ( ) {
__asm__ volatile("vmrs r0, fpscr\n"
"orr r0, $(1 << 24)\n"
"vmsr fpscr, r0" : : : "r0");
}
Please file a bug report to request that better documentation be provided for the modes of FP operation in iOS.