Divide by zero exception on M0+ devices - arm

Is there a Divide by zero exception on M0+ devices?
I know Cortex M3 and M4 devices have this.

Cortex-M0+ is ARMv6-M which does not include a divide instruction, so there will be no hardware exception for that. Since division would necessarily be a software operation, it would be for the software implementation for trap divide-by-zero. The behaviour will therefore be down to your compiler; for C and C++ at least it is undefined behaviour.

Related

How to detect FPU in Cortex M?

Cortex-M processors implement the CPUID register, through which it is possible to detect information about the core: part number (e.g. Cortex M7 or M4), revision and patch level (e.g. r1p2), etc.
Is there a register or a way to detect if the FPU has been implemented by the implementer? And how to detect the type of FPU (VFPv4, VFPv5-SP or VFPv5-DP)?
In the cortex-m Architecure Reference manual,
B3.2.20 Coprocessor Access Control Register, CPACR
The CPACR characteristics are:
Purpose: Specifies the access privileges for coprocessors
Usage constraints: If a coprocessor is not implemented, a write of 0b01 or 0b11 to the corresponding CPACR field reads back as 0b00.
Configurations: Always implemented
The VFP will have implemented CP10 and CP11 (decimal). If there is no VFP, then they should read back as 0b00. This would apply to a majority of Cortex-M CPUs. As a vendor can implement there own IP, it is possible that some CPU/SOC might not work as documented. It would be prudent to trap/handle the undefined instruction which will be taken if a Co-processor is not present.

How are traps generated for floating point exceptions?

I want to know which code and files in the glibc library are responsible for generating traps for floating point exceptions when traps are enabled.
Currently, GCC for RISC-V does not trap floating point exceptions. I am interested in adding this feature. So, I was looking at how this functionality is implemented in GCC for x86.
I am aware that we can trap signals as described in this [question]
(Trapping floating-point overflow in C) but I want to know more details about how it works.
I went through files in glibc/math which according to me are in some form responsible for generating traps like
fenv.h
feenablxcpt.c
fegetexpect.c
feupdateenv.c
and many other files starting with fe.
All these files are also present in glibc for RISC-V. I am not able to
figure out how glibc for x86 is able to generate traps.
These traps are usually generated by the hardware itself, at the instruction set architecture (ISA) level. In particular on x86-64.
I want to know which code and files in the glibc library are responsible for generating traps for floating point exceptions when traps are enabled.
So there are no such file. However, the operating system kernel (notably with signal(7)-s on Linux...) is translating traps to something else.
Please read Operating Systems: Three Easy Pieces for more. And study the x86-64 instruction set in details.
A more familiar example is the integer division by zero. On most hardware, that produces a machine trap (or machine exception), handled by the kernel. On some hardware (IIRC, PowerPC), its gives -1 as a result and sets some bit in a status register. Further machine code could test that bit. I believe that the GCC compiler would, in some cases and with some optimizations disabled, generate such a test after every division. But it is not required to do that.
The C language (read n1570, which practically is the C11 standard) has defined the notion of undefined behavior to handle such situations the most quickly and simply possible. Read Lattner's What every C programmer should know about undefined behavior blog.
Since you mention RISC-V, read about the RISC philosophy of the previous century, and be aware that designing out-of-order and super-scalar processors requires a lot of engineering efforts. My guess is that if you invest as much R&D (that means tens of billions of US$ or €) as Intel -or, to a lesser extent, AMD- did on x86-64 into a RISC-V chip, you could get comparable performance to current x86-64 processors. Notice that SPARC or PowerPC (or perhaps ARM) chips are RISC-like, and their best processors are nearly comparable in performance to Intel chips but got probably ten times less R&D investment than what Intel put in its microprocessors.

why is a vdiv instruction generated with neon flags?

I disassembled an arm binary previously compiled with neon flags:
-mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp -ftree-vectorize
The dump shows a vdiv.f64 instruction generated by the compiler. According to the arm manual for armv7 (cortex-a9) neon simd isa does not support vdiv instruction but the floating point (vfp) engine does. Why is this instruction generated? is it then a floating point instruction that will be executed by the vfp? Both neon and VFP support addition and multiplication for floating point so how can I differenciate them from eahc other?
In the case of Cortex-A9, the NEON FPU option also implements VFP; it is a superset of the cut-down 16-register VFP-only FPU option.
More generally, the architecture does not allow implementing floating-point Advanced SIMD without also implementing at least single-precision VFP, therefore GCC's -mfpu=neon implies VFPv3 as well. It is permissible to implement integer-only Advanced SIMD without any floating-point capability at all, but I'm not sure GCC can support that (or that anyone's ever built such a thing).
The actual VFP and Advanced SIMD variants of instructions are unambiguous from the syntax - anything operating on double-precision data (i.e. <op>.F64) is obviously VFP, as Advanced SIMD doesn't support double-precision. Single precision operations (i.e. <op>.F32) operating on 32-bit s registers are scalar, thus VFP; if they're operating on larger 64-bit d or 128-bit q registers, then they are handling multiple 32-bit values at once, thus are vectorised Advanced SIMD instructions.

power function without the use of math library

I'm working on a micro-controller that contains access to floating-point operations.
I need to make use of a power function. The problem is, there isn't enough memory to support the pow and sqrt function. This is because the microcontroller doesn't support FP operations natively, and produces a large number of instructions to use them. I can still multiply and divide floating point numbers.
Architecture: Freescale HCS12 (16-bit)
If you mentioned the architecture, you might get a more specific answer.
The linux kernel still has the old x87 IEEE-754 math emulation library for i386 and i486 processors without a hardware floating point unit, under: arch/x86/math-emu/
There are a lot of resources online for floating point routines implemented for PIC micros, and AVR libc has a floating point library - though it's in AVR assembly.
glibc has implementations for pow functions in sysdeps/ieee754. Obviously, the compiler must handle the elementary floating point ops using hardware instructions or emulation / function calls.
Make your own function that multiplies repeatedly in a loop.

ARM Cortex-A8: Whats the difference between VFP and NEON

In ARM Cortex-A8 processor, I understand what NEON is, it is an SIMD co-processor.
But is VFP(Vector Floating Point) unit, which is also a co-processor, works as a SIMD processor? If so which one is better to use?
I read few links such as -
Link1
Link2.
But not really very clear what they mean. They say that VFP was never intended to be used for SIMD but on Wiki I read the following - "The VFP architecture also supports execution of short vector instructions but these operate on each vector element sequentially and thus do not offer the performance of true SIMD (Single Instruction Multiple Data) parallelism."
It so not so clear what to believe, can anyone elaborate more on this topic?
There are quite some difference between the two. Neon is a SIMD (Single Instruction Multiple Data) accelerator processor as part of the ARM core. It means that during the execution of one instruction the same operation will occur on up to 16 data sets in parallel. Since there is parallelism inside the Neon, you can get more MIPS or FLOPS out of Neon than you can a standard SISD processor running at the same clock rate.
The biggest benefit of Neon is if you want to execute operation with vectors, i.e. video encoding/decoding. Also it can perform single precision floating point(float) operations in parallel.
VFP is a classic floating point hardware accelerator. It is not a parallel architecture like Neon. Basically it performs one operation on one set of inputs and returns one output. It's purpose is to speed up floating point calculations. It supports single and double precision floating point.
You have 3 possibilities to use Neon:
use intrinsics functions #include "arm_neon.h"
inline the assembly code
let the gcc to do the optimizations for you by providing -mfpu=neon as argument (gcc 4.5 is good on this)
For armv7 ISA (and variants)
The NEON is a SIMD and parallel data processing unit for integer and floating point data and the VFP is a fully IEEE-754 compatible floating point unit. In particular on the A8, the NEON unit is much faster for just about everything, even if you don't have highly parallel data, since the VFP is non-pipelined.
So why would you ever use the VFP?!
The most major difference is that the VFP provides double precision floating point.
Secondly, there are some specialized instructions that that VFP offers that there are no equivalent implementations for in the NEON unit. SQRT comes to mind, perhaps some type conversions.
But the most important difference not mentioned in Cosmin's answer is that the NEON floating point pipeline is not entirely IEEE-754 compliant. The best description of the differences are in the FPSCR Register Description.
Because it is not IEEE-754 compliant, a compiler cannot generate these instructions unless you tell the compiler that you are not interested in full compliance. This can be done in several ways.
Using an intrinsic function to force NEON usage, for example see the GCC Neon Intrinsic Function List.
Ask the compiler, very nicely. Even newer GCC versions with -mfpu=neon will not generate floating point NEON instructions unless you also specify -funsafe-math-optimizations.
For armv8+ ISA (and variants) [Update]
NEON is now fully IEE-754 compliant, and from a programmer (and compiler's) point of view, there is actually not too much difference. Double precision has been vectorized. From a micro-architecture point of view I kind of doubt they are even different hardware units. ARM does document scalar and vector instructions separately but both are part of "Advanced SIMD."
Architecturally, VFP (it wasn't called Vector Floating Point for nothing) indeed has a provision for operating on a floating-point vector in a single instruction. I don't think it ever actually executes multiples operations simultaneously (like true SIMD), but it could save some code size. However, if you read the ARM Architecture Reference Manual in the Shark help (as I describe in my introduction to NEON, link 1 in the question), you'll see at section A2.6 that the vector feature of VFP is deprecated in ARMv7 (which is what the Cortex A8 implements), and software should use Advanced SIMD for floating-point vector operations.
Worse yet, in the Cortex A8 implementation, VFP is implemented with a VFP Lite execution unit (read lite as occupying a smaller silicon surface, not as having less features), which means that it's actually slower than on the ARM11, for instance! Fortunately, most single-precision VFP instructions get executed by the NEON unit, but I'm not sure vector VFP operations do; and even if they do, they certainly execute slower than with NEON instructions.
Hope that clears thing up!
IIRC, the VFP is a floating point coprocessor which works sequentially.
This means that you can use instruction on a vector of floats for SIMD-like behaviour, but internally, the instruction is performed on each element of the vector in sequence.
While the overall time required for the instruction is reduced by this because of the single load instruction, the VFP still needs time to process all elements of the vector.
True SIMD will gain more net floating point performance, but using the VFP with vectors is still faster then using it purely sequential.

Resources