What is the difference between NEON SIMD and NEON SIMD version 2 as in Cortex A15?
It adds SIMD FMA instruction (VFMA.F32) and also mandates NEON half precision extension. NEONv2 is supported in ARM Cortex-A7, ARM Cortex-A15, and Qualcomm Krait (not sure about ARM Cortex-A5).
It is not that much of a difference, from ARM ARM:
(in reverse order of definitions)
Advanced SIMDv2 is an OPTIONAL extension to the ARMv7-A and ARMv7-R profiles.
Advanced SIMDv2 adds both the Half-precision Extension and the fused
multiply-add instructions to the features of Advanced SIMDv1.
...
Advanced SIMDv1 can be extended by the OPTIONAL Half-precision Extension,
that provides conversion functions in both directions between half-precision
floating-point and single-precision floating-point.
...
The Advanced SIMD architecture extension, its associated implementations, and supporting software, are
commonly referred to as NEON™
technology.
Related
I am working on a project to accelerate the perf on ARM platform with NEON intrinsics.
I could not find the direct equivalents for below intrinsics
_mm_mulhi_epi16
_mm_hadd_epi32
_mm_maddubs_epi16
_mm_madd_epi16
_mm_extract_epi8
Equivalent intrinsics will help a lot in my efforts
_mm_hadd_epi32 appears to match vpaddq_s32.
_mm_extract_epi8 appears to match vgetq_lane_s8.
Not sure about the others offhand.
I am using the FFTW3 library on Beagleboard xM in a C application to perform r2c FFTs of floats. I read on this page that FFTW3 includes support for Neon, which is part of the xM architecture.
Is there a way to tell if the Neon coprocessor is actually being used?
For example, can I lists symbols from the object files and parse for some special Neon symbols? Alternatively, can I look through gcc -S assembler output for any Neon instructions? What instruction(s) would I look for? (I'm not familiar with what Neon assembly looks like).
Look at the disassembly. NEON instructions that operate on float data have a .f32 suffix and the NEON registers have names of the form dN or qN (where N is an integer). So if you see instructions that look like:
vadd.f32 q0, q1, q2
then NEON is being used.
I haven't yet created a program to see whether GCC will need it passed, When I do I'd like to know how I'd go about enabling strict floating point mode which will allow reproducible results between runs and computers, Thanks.
Compiling with -msse2 on an Intel/AMD processor that supports it will get you almost there. Do not let any library put the FPU in FTZ/DNZ mode, and you will be mostly set (processor bugs notwithstanding).
For other architectures, the answer would be different. Those achitectures that do not offer any convenient way to get exact IEEE 754 semantics (for instance, pre-SSE2 IA32 CPUs) would require the use of a floating-point emulation library to get the result you want, at a very high performance penalty.
If your target architecture supports the fmadd (multiplication and addition without intermediate rounding) instruction, make sure your compiler does not use it when you have explicit multiplications and additions in the source code. GCC is not supposed to do this unless you use the -ffast-math option.
If you use -ffloat-store and always store intermediate values to variables or apply (explicit) casts to the desired type/precision, you should be at least 90% to your goal, and maybe more. I'd welcome comments on whether there are cases this approach still misses. Note that I claim this works even without any SSE options.
You can also use GCC's option -mpc64 on i386 / ia32 target to force double precision computation even on x87 FPU. See GCC manual.
You can also modify the x87 FPU behavor at runtime, see Deterministic cross-platform floating point arithmetics and also An Introduction to GCC.
I want to write some C code such that gcc using the -msse4.1 flag can optimize it. Basically I want to check whether or not the compiler is taking advantage of SSE4.1 instructions.
There are many SSE4.1 instructions (http://en.wikipedia.org/wiki/SSE4#New_instructions) but I am not able to write a fragment of C Code which is using any of those instructions in the generated assembly code.
Thanks in advance.
From what I've seen, compilers rarely ever generate SSE4.1 instructions. I've seen a few cases where it will use the insert/extract instructions to pack data.
But for the most part, if you want to use the SSE4.1 instructions, you need to do them explicitly using intrinsics:
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_bk_sse41.htm
I doubt GCC would emit SSE4.1 instructions that easily. But you could have a look at Intel SPMD Program Compiler:
Under the SPMD model, the programmer writes a program that mostly
appears to be a regular serial program, though the execution model is
actually that a number of program instances execute in parallel on the
hardware. (See a more detailed example that illustrates this concept.)
ispc compiles a C-based SPMD programming language to run on the SIMD
units of CPUs; it frequently provides a 3x or more speedup on CPUs
with 4-wide SSE units, without any of the difficulty of writing
intrinsics code.
I'm using Cortex-A8 processor and I'm not understanding how to use the -mfpu flag.
On the Cortex-A8 there are both vfpv3 and neon co-processors. Previously I was not knowing how to use neon so I was only using
gcc -marm -mfloat-abi=softfp -mfpu=vfpv3
Now I have understood how SIMD processors run and I have written certain code using NEON intrinsics. To use neon co-processor now my -mfpu flag has to change to -mfpu=neon, so my compiler command line looks like this
gcc -marm -mfloat-abi=softfp -mfpu=neon
Now, does this mean that my vfpv3 is not used any more? I have lots of code which is not making use of NEON, do those parts not make use of vfpv3.
If both neon and vfpv3 are still used then I have no issues, but if only one of them is used how can I make use of both?
NEON implies having the traditional VFP support too. VFP can be used for "normal" (non-vector) floating-point calculations. Also, NEON does not support double-precision FP so only VFP instructions can be used for that.
What you can do is add -S to gcc's command line and check the assembly. Instructions starting with V (e.g. vld1.32, vmla.f32) are NEON instructions, and those starting with F (fldd, fmacd) are VFP. (Although ARM docs now prefer using the V prefix even for VFP instructions, GCC does not do that.)