I was working with some assembly code for ARM's NEON processor and i came across these instructions in my code. I am not able to understand how do these instruction work exactly.. (I MEAN AT THE MEMORY LEVEL..(REGISTER LEVEL)). Any help regarding this would be appreciable. Thank you in advance
VZIP.F32 Q3, Q4
VUZP.F32 Q3, Q4
Related
There is ARM software optimization guide (e.g., https://developer.arm.com/documentation/swog309707/latest for neoverse n1).
This guide doesn't seem to contain the latency and throughput for Neon or SVE. Is there a separate guide for NEON or SVE (e.g., the instruction latency and throughput for INSR (SIMD&FP scalar) instruction)?
A pointer would be very helpful!
The timings for Neon instructions are in that document, listed under ASIMD (which is Arm's more formal name for that instruction set). See Sections 3.15 onward.
There are no timings for SVE instructions because, as I understand it, the N1 simply doesn't support that extension. But if you look at the guide for some core that does support SVE, you'll see the timings included. For the Neoverse N2 they are from Section 3.26 onward.
I'm trying to understand the comment made by "Iwillnotexist Idonotexist" at SIMD optimization of cvtColor using ARM NEON intrinsics:
... why you don't use the ARM NEON intrisics that map to the VLD3 instruction? That spares you all of the shuffling, both simplifying and speeding up the code. The Intel SSE implementation requires shuffles because it lacks 2/3/4-way deinterleaving load instructions, but you shouldn't pass on them when they are available.
The trouble I am having is the solution offers code that is non-interleaved, and it performs fused multiplies on floating points. I'm trying to separate the two and understand just the interleaved loads.
According to the other question's comment and Coding for NEON - Part 1: Load and Stores, the answer is probably going to use VLD3.
Unfortunately, I'm just not seeing it (probably because I'm less familiar with NEON and its intrinsic functions). It seems like VLD3 basically produces 3 outputs for each input, so my metal model is confused.
Given the following SSE instrinsics that operate on data in BGR BGR BGR BGR... format that needs a shuffle for BBBB GGGG RRRR ...:
const byte* data = ... // assume 16-byte aligned
const __m128i mask = _mm_setr_epi8(0,3,6,9,12,15,1,4,7,10,13,2,5,8,11,14);
__m128i a = _mm_shuffle_epi8(_mm_load_si128((__m128i*)(data)),mask);
How do we perform the interleaved loads using NEON intrinsics so that the we don't need the SSE shuffles?
Also note... I'm interested in intrinsics and not ASM. I can use ARM's intrinsics on Windows Phone, Windows Store, and Linux powered devices under MSVC, ICC, Clang, etc. I can't do that with ASM, and I'm not trying to specialize the code 3 times (Microsoft 32-bit ASM, Microsoft 64-bit ASM and GCC ASM).
According to this page:
The VLD3 intrinsic you need is:
int8x8x3_t vld3_s8(__transfersize(24) int8_t const * ptr);
// VLD3.8 {d0, d1, d2}, [r0]
If at address pointed by ptr you have this data:
0x00: 33221100
0x04: 77665544
0x08: bbaa9988
0x0c: ffddccbb
0x10: 76543210
0x14: fedcba98
You will finally get in the registers:
d0: ba54ffbb99663300
d1: dc7610ccaa774411
d2: fe9832ddbb885522
The int8x8x3_t structure is defined as:
struct int8x8x3_t
{
int8x8_t val[3];
};
This question already has an answer here:
Why does a std::atomic store with sequential consistency use XCHG?
(1 answer)
Closed 11 months ago.
In C++ and Beyond 2012: Herb Sutter - atomic<> Weapons, 2 of 2 Herb Sutter argues (around 0:38:20) that one should use xchg, not mov/mfence to implement atomic_store on x86. He also seems to suggest that this particular instruction sequence is what everyone agreed one. However, GCC uses the latter. Why does GCC use this particular implementation?
Quite simply, the mov and mfence method is faster as it does not trigger a redundant memory read like the xchg which will take time. The x86 CPU guarantees strict ordering of writes between threads anyway so so it is enough.
Note some very old CPUs have a bug in the mov instruction which makes xchg necessary but this is from a very long time ago and working around this is not worth the overhead to most users.
Credit to #amdn for the information on the bug in old Pentium CPUs causing xchg to be needed in the past.
I'm trying to convert a piece of code in from SSE to ARM Neon for optimization. For most of the SSE instructions of the code I found some clearly equivalent Neon ones. I've got some problems with these though:
result1_shifted = _mm_srli_si128 (result1, 1);
result=_mm_packus_epi16 (res1,res2);
_mm_storeu_si128 (p_dest, result);
Could you please help me?
I agree with the comments that it's probably a good idea to go back to a "C" (or anything really) reference design and maybe start from scratch. In particular you will find that perhaps NEON has some more optimal ways of doing things in some cases. But if you find that you need to do nearly identical things, here are some hints:
_mm_srli_si128 (result1, 1); Try VEXT.S8 Qdst, Qsrc, Qsrc2, #1, where src2 has been cleared to 0.
_mm_packus_epi16 (res1,res2); Try VQMOVN.S16 Ddst, Qsrc. The key word when looking for alternatives is "narrow." You are moving with narrowing. "Q" is NEON nomenclature for saturation. You may have an issue because you are doing signed to unsigned, which I'm not sure NEON supports, but your use case may be okay, but that's why having reference and tests is good!
_mm_storeu_si128 (__m128i *p, __m128i a); Obviously there is VSTM and there are lots of options here. You probably want to look at this in some detail.
Where can I find information about common SIMD tricks? I have an instruction set and know, how to write non-tricky SIMD code, but I know, SIMD now is much more powerful. It can hold complex conditional branchless code.
For example (ARMv6), the following sequence of instructions sets each byte of Rd equal to the unsigned minimum of the corresponding bytes of Ra and Rb:
USUB8 Rd, Ra, Rb
SEL Rd, Rb, Ra
Links to tutorials / uncommon SIMD techniques are good too :) ARMv6 is the most interesting for me, but x86(SSE,...)/Neon(in ARMv7)/others are good too.
One of the best SIMD resources ever was the old AltiVec mailing list. Although PowerPC/AltiVec-specific I suspect that a lot of the material on this list would be of general interest to anyone working with other SIMD architectures. Sadly this list seems now to be defunct after being moved to a forum on power.org, but you may be able to find archived versions of it. (If not then let me know - I have pretty much all the posts from 2000 - 2007.)
There is also a lot of potentially useful info on AltiVec, SSE, SIMD vectorization and performance in general at http://developer.apple.com/hardwaredrivers/ve/index.html, a good deal of which may be transferable to other SIMD architectures.
Try AMD's SSEPlus project on sourceforge