NEON, SSE and interleaving loads vs shuffles

NEON, SSE and interleaving loads vs shuffles - arm

I'm trying to understand the comment made by "Iwillnotexist Idonotexist" at SIMD optimization of cvtColor using ARM NEON intrinsics:
... why you don't use the ARM NEON intrisics that map to the VLD3 instruction? That spares you all of the shuffling, both simplifying and speeding up the code. The Intel SSE implementation requires shuffles because it lacks 2/3/4-way deinterleaving load instructions, but you shouldn't pass on them when they are available.
The trouble I am having is the solution offers code that is non-interleaved, and it performs fused multiplies on floating points. I'm trying to separate the two and understand just the interleaved loads.
According to the other question's comment and Coding for NEON - Part 1: Load and Stores, the answer is probably going to use VLD3.
Unfortunately, I'm just not seeing it (probably because I'm less familiar with NEON and its intrinsic functions). It seems like VLD3 basically produces 3 outputs for each input, so my metal model is confused.
Given the following SSE instrinsics that operate on data in BGR BGR BGR BGR... format that needs a shuffle for BBBB GGGG RRRR ...:
const byte* data = ... // assume 16-byte aligned
const __m128i mask = _mm_setr_epi8(0,3,6,9,12,15,1,4,7,10,13,2,5,8,11,14);
__m128i a = _mm_shuffle_epi8(_mm_load_si128((__m128i*)(data)),mask);
How do we perform the interleaved loads using NEON intrinsics so that the we don't need the SSE shuffles?
Also note... I'm interested in intrinsics and not ASM. I can use ARM's intrinsics on Windows Phone, Windows Store, and Linux powered devices under MSVC, ICC, Clang, etc. I can't do that with ASM, and I'm not trying to specialize the code 3 times (Microsoft 32-bit ASM, Microsoft 64-bit ASM and GCC ASM).

According to this page:
The VLD3 intrinsic you need is:
int8x8x3_t vld3_s8(__transfersize(24) int8_t const * ptr);
// VLD3.8 {d0, d1, d2}, [r0]
If at address pointed by ptr you have this data:
0x00: 33221100
0x04: 77665544
0x08: bbaa9988
0x0c: ffddccbb
0x10: 76543210
0x14: fedcba98
You will finally get in the registers:
d0: ba54ffbb99663300
d1: dc7610ccaa774411
d2: fe9832ddbb885522
The int8x8x3_t structure is defined as:
struct int8x8x3_t
{
int8x8_t val[3];
};

Related

_mm_mul_epu32 and _mm_mullo_epi32 on arm neon

I am working on a application to port SSE code to Neon.
I see the intrinsics _mm_mullo_epi32 and _mm_mul_epu32 in SSE.
Do we have an equivalent of Neon for these ?

The equivalent of _mm_mullo_epi32 is vmulq_s32
_mm_mul_epu32 is a bit tricky, no single NEON instruction does the job.
Still, the workaround is not that bad, only needs 3 instructions. Two vmovn_u64 instructions to discard every 2-nd lane of the arguments, followed by vmull_u32 to multiply 32-bit lanes into 64-bit ones.

How to extract bytes from an SSE2 __m128i structure?

I'm a beginner with SIMD intrinsics, so I'll thank everyone for their patience in advance. I have an application involving absolute difference comparison of unsigned bytes (I'm working with greyscale images).
I tried AVX, more modern SSE versions etc, but eventually decided SSE2 seems sufficient and has the most support for individual bytes - please correct me if I'm wrong.
I have two questions: first, what's the right way to load 128-bit registers? I think I'm supposed to pass the load intrinsics data aligned to multiples of 128, but will that work with 2D array code like this:
greys = aligned_alloc(16, xres * sizeof(int8_t*));
for (uint32_t x = 0; x < xres; x++)
{
greys[x] = aligned_alloc(16, yres * sizeof(int8_t*));
}
(The code above assumes xres and yres are the same, and are powers of two). Does this turn into a linear, unbroken block in memory? Could I then, as I loop, just keep passing addresses (incrementing them by 128) to the SSE2 load intrinsics? Or does something different need to be done for 2D arrays like this one?
My second question: once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i ? Looking through the Intel Intrinsics Guide, instructions that convert a vector type to a scalar one are rare. The closest I've found is int _mm_movemask_epi8 (__m128i a) but I don't quite understand how to use it.
Oh, and one third question - I assumed _mm_load_si128 only loads signed bytes? And I couldn't find any other byte loading function, so I guess you're just supposed to subtract 128 from each and account for it later?
I know these are basic questions for SIMD experts, but I hope this one will be useful to beginners like me. And if you think my whole approach to the application is wrong, or I'd be better off with more modern SIMD extensions, I'd love to know. I'd just like to humbly warn I've never worked with assembly and all this bit-twiddling stuff requires a lot of explication if it's to help me.
Nevertheless, I'm grateful for any clarification available.
In case it makes a difference: I'm targeting a low-power i7 Skylake architecture. But it'd be nice to have the application run on much older machines too (hence SSE2).

Least obvious question first:
once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i
Extract the low 64 bits to an integer with int64_t _mm_cvtsi128_si64x(__m128i), or the low 32 bits with int _mm_cvtsi128_si32 (__m128i a).
If you want other parts of the vector, not the low element, your options are:
Shuffle the vector to create a new __m128i with the data you want in the low element, and use the cvt intrinsics (MOVD or MOVQ in asm).
Use SSE2 int _mm_extract_epi16 (__m128i a, int imm8), or the SSE4.1 similar instructions for other element sizes such as _mm_extract_epi64(v, 1); (PEXTRB/W/D/Q) are not the fastest instructions, but if you only need one high element, they're about equivalent to a separate shuffle and MOVD, but smaller machine code.
_mm_store_si128 to an aligned temporary array and access the members: compilers will often optimize this into just a shuffle or pextr* instruction if you compile with -msse4.1 or -march=haswell or whatever. print a __m128i variable shows an example, including Godbolt compiler output showing _mm_store_si128 into an alignas(16) uint64_t tmp[2]
Or use union { __m128i v; int64_t i64[2]; } or something. Union-based type punning is legal in C99, but only as an extension in C++. This is compiles the same as a tmp array, and is generally not easier to read.
An alternative to the union that would also work in C++ would be memcpy(&my_int64_local, 8 + (char*)my_vector, 8); to extract the high half, but that seems more complicated and less clear, and more likely to be something a compiler wouldn't "see through". Compilers are usually pretty good about optimizing away small fixed-size memcpy when it's an entire variable, but this is just half of the variable.
If the whole high half of a vector can go directly into memory unmodified (instead of being needed in an integer register), a smart compiler might optimize to use MOVHPS to store the high half of a __m128i with the above union stuff.
Or you can use _mm_storeh_pi((__m64*)dst, _mm_castsi128_ps(vec)). That only requires SSE1, and is more efficient than SSE4.1 pextrq on most CPUs. But don't do this for a scalar integer you're about to use again right away; if SSE4.1 isn't available it's likely the compiler will actually MOVHPS and integer reload, which usually isn't optimal. (And some compilers like MSVC don't optimize intrinsics.)
Does this turn into a linear, unbroken block in memory?
No, it's an array of pointers to separate blocks of memory, introducing an extra level of indirection vs. a proper 2D array. Don't do that.
Make one large allocation, and do the index calculation yourself (using array[x*yres + y]).
And yes, load data from it with _mm_load_si128, or loadu if you need to load from an offset.
assumed _mm_load_si128 only loads signed bytes
Signed or unsigned isn't an inherent property of a byte, it's only how you interpret the bits. You use the same load intrinsic for loading two 64-bit elements, or a 128-bit bitmap.
Use intrinsics that are appropriate for your data. It's a little bit like assembly language: everything is just bytes, and the machine will do what you tell it with your bytes. It's up to you to choose a sequence of instructions / intrinsics that produces meaningful results.
The integer load intrinsics take __m128i* pointer args, so you have to use _mm_load_si128( (const __m128i*) my_int_pointer ) or similar. This looks like pointer aliasing (e.g. reading an array of int through a short *), which is Undefined Behaviour in C and C++. However, this is how Intel says you're supposed to do it, so any compiler that implements Intel's intrinsics is required to make this work correctly. gcc does so by defining __m128i with __attribute__((may_alias)).
See also Loading data for GCC's vector extensions which points out that you can use Intel intrinsics for GNU C native vector extensions, and shows how to load/store.
To learn more about SIMD with SSE, there are some links in the sse tag wiki, including some intro / tutorial links.
The x86 tag wiki has some good x86 asm / performance links.

Will Knights Landing CPU (Xeon Phi) accelerate byte/word integer code?

The Intel Xeon Phi "Knights Landing" processor will be the first to support AVX-512, but it will only support "F" (like SSE without SSE2, or AVX without AVX2), so floating-point stuff mainly.
I'm writing software that operates on bytes and words (8- and 16-bit) using up to SSE4.1 instructions via intrinsics.
I am confused whether there will be EVEX-encoded versions of all/most SSE4.1 instructions in AVX-512F, and whether this means I can expect my SSE code to automatically gain EVEX-extended instructions and map to all new registers.
Wikipedia says this:
The width of the SIMD register file is increased from 256 bits to 512 bits, with a total of 32 registers ZMM0-ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.
This unfortunately does not clarify whether compiling SSE4 code with AVX512-enabled will lead to the same (awesome) speedup that compiling it to AVX2 provides (VEX coding of legacy instructions).
Anybody know what will happen when SSE2/4 code (C intrinsics) are compiled for AVX-512F? Could one expect a speed bump like with AVX1's VEX coding of the byte and word instructions?

Okay, I think I've pieced together enough information to make a decent answer. Here goes.
What will happen when native SSE2/4 code is run on Knights Landing (KNL)?
The code will run in the bottom fourth of the registers on a single VPU (called the compatibility layer) within a core. According to a pre-release webinar from Colfax, this means occupying only 1/4 to 1/8 of the total register space available to a core and running in legacy mode.
What happens if the same code is recompiled with compiler flags for AVX-512F?
SSE2/4 code will be generated with VEX prefix. That means pshufb becomes vpshufb and works with other AVX code in ymm. Instructions will NOT be promoted to AVX512's native EVEX or allowed to address the new zmm registers specifically. Instructions can only be promoted to EVEX with AVX512-VL, in which case they gain the ability to directly address (renamed) zmm registers. It is unknown whether register sharing is possible at this point, but pipelining on AVX2 has demonstrated similar throughput with half-width AVX2 (AVX-128) as with full 256-bit AVX2 code in many cases.
Most importantly, how do I get my SSE2/4/AVX128 byte/word size code running on AVX512F?
You'll have to load 128-bit chunks into xmm, sign/zero extend those bytes/words into 32-bit in zmm, and operate as if they were always larger integers. Then when finished, convert back to bytes/words.
Is this fast?
According to material published on Larrabee (Knights Landing's prototype), type conversions of any integer width are free from xmm to zmm and vice versa, so long as registers are available. Additionally, after calculations are performed, the 32-bit results can be truncated on the fly down to byte/word length and written (packed) to unaligned memory in 128-bit chunks, potentially saving an xmm register.
On KNL, each core has 2 VPUs that seem to be capable of talking to each other. Hence, 32-way 32-bit lookups are possible in a single vperm*2d instruction of presumably reasonable throughput. This is not possible even with AVX2, which can only permute within 128-bit lanes (or between lanes for the 32-bit vpermd only, which is inapplicable to byte/word instructions). Combined with free type conversions, the ability to use masks implicitly with AVX512 (sparing the costly and register-intensive use of blendv or explicit mask generation), and the presence of more comparators (native NOT, unsigned/signed lt/gt, etc), it may provide a reasonable performance boost to rewrite SSE2/4 byte/word code for AVX512F after all. At least on KNL.
Don't worry, I'll test the moment I get my hands on mine. ;-)

Any preference to SHUFPD or PSHUFD for reversing two packed double in an XMM?

Question today is fairly short. Consider the following toy C program shuffle.c for reversing two packed double in register xmm0:
#include <stdio.h>
void main () {
double x[2] = {0.0, 1.0};
asm volatile (
"movupd (%[x]), %%xmm0\n\t"
"shufpd $1, %%xmm0, %%xmm0\n\t" /* method 1 */
//"pshufd $78, %%xmm0, %%xmm0\n\t" /* method 2 */
"movupd %%xmm0, (%[x])\n\t"
:
: [x] "r" (x)
: "xmm0", "memory");
printf("x[0] = %.2f, x[1] = %.2f\n", x[0], x[1]);
}
After a dry run: gcc -msse3 -o shuffle shuffle.c | ./test, both methods/instructions will return the correct result x[0] = 1.00, x[1] = 0.00. This page says that shufpd has a latency of 6 cycles, while the intel intrinsic guide says that pshufd only has a latency of 1 cycles. This sounds like great preference to pshufd. However, This instruction is truly for packed integers. When using it for packed doubles, will there be any penalty associated with "wrong type"?
As a similar question, I also heard that instruction movaps is 1-byte smaller than movapd, and they do the same thing by reading 128bits from a 16-bit aligned address. So can we always use the former for move (between XMMs) / load (from memory) / store (to memory)? This seems crazy. I think there must be some reason to reject this. Can someone give me an explanation? Thank you.

You'll always get correct results, but it can matter for performance.
Prefer FP shuffles for FP data that will be an input to FP math instructions (like addps or vfma..., as opposed to insns like xorps).
This avoids any extra bypass-delay latency on some microarchitectures, including potentially current Intel chips. See Agner Fog's microarchitecture guide. AMD Bulldozer-family does all shuffles in the vector-integer domain, so there's a bypass delay whichever shuffle you use.
If it saves instructions, it can be worth it to use an integer shuffle anyway. (But usually it's the other way around, where you want to use shufps to combine data from two integer vectors. That's fine in even more cases, and mostly a problem only on Nehalem, IIRC.)
http://x86.renejeschke.de/html/file_module_x86_id_293.html lists the latency for CPUID 0F3n/0F2n CPUs, i.e. Pentium4 (family 0xF model 2 (Northwood) / model 3 (Prescott)). Those numbers are obviously totally irrelevant, and don't even match Agner Fog's P4 table for shufpd.
Intel's intrinsics guide sometimes has numbers that don't match experimental testing, either. See Agner Fog's instruction tables for good latency/throughput numbers, and microarch guides to understand the details.
movaps vs. movapd: No existing microarchitectures care which you use. It would be possible for someone in the future to design an x86 CPU that kept double vectors separate from float vectors internally, but for now the only distinction has been int vs. FP.
Always prefer the ps instruction when the behaviour is identical (xorps over xorpd, movhps over movhpd).
Some compilers (maybe both gcc and clang, I forget) will compile a _mm_store_si128 integer vector store to movaps, because there's no performance downside on any existing hardware, and it's one byte shorter.
IIRC, there's also no perf downside to loading integer vector data with movaps / movups, but I'm less sure about that.
There is a perf downside to using the wrong mov instruction for a reg-reg move, though. movdqa xmm1, xmm2 between two FP instructions is bad on Nehalem.
re: your inline asm:
It doesn't need to be volatile, and you could drop the "memory" clobber if you used a 16 byte struct or something as a "+m" input/output operand. Or a "+x" vector-register operand for an __m128d variable.
You'll probably get better results from intrinsics than from inline asm, unless you write whole loops in inline asm or stand-alone functions.
See the x86 tag wiki for a link to my inline asm guide.

What's the difference between logical SSE intrinsics?

Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions:
Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation?
These intrinsics maps to three different x86 instructions (por, orps, orpd). Does anyone have any ideas why Intel is wasting precious opcode space for several instructions which do the same thing?

Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation?
Yes, there can be performance reasons to choose one vs. the other.
1: Sometimes there is an extra cycle or two of latency (forwarding delay) if the output of an integer execution unit needs to be routed to the input of an FP execution unit, or vice versa. It takes a LOT of wires to move 128b of data to any of many possible destinations, so CPU designers have to make tradeoffs, like only having a direct path from every FP output to every FP input, not to ALL possible inputs.
See this answer, or Agner Fog's microarchitecture doc for bypass-delays. Search for "Data bypass delays on Nehalem" in Agner's doc; it has some good practical examples and discussion. He has a section on it for every microarch he has analysed.
However, the delays for passing data between the
different domains or different types of registers are smaller on the
Sandy Bridge and Ivy Bridge than on the Nehalem, and often zero. --
Agner Fog's micro arch doc
Remember that latency doesn't matter if it isn't on the critical path of your code (except sometimes on Haswell/Skylake where it infects later use of the produced value, long after actual bypass :/). Using pshufd instead of movaps + shufps can be a win if uop throughput is your bottleneck, rather than latency of your critical path.
2: The ...ps version takes 1 fewer byte of code than the other two for legacy-SSE encoding. (Not AVX). This will align the following instructions differently, which can matter for the decoders and/or uop cache lines. Generally smaller is better for better code density in I-cache and fetching code from RAM, and packing into the uop cache.
3: Recent Intel CPUs can only run the FP versions on port5.
Merom (Core2) and Penryn: orps can run on p0/p1/p5, but integer-domain only. Presumably all 3 versions decoded into the exact same uop. So the cross-domain forwarding delay happens. (AMD CPUs do this too: FP bitwise instructions run in the ivec domain.)
Nehalem / Sandybridge / IvB / Haswell / Broadwell: por can run on p0/p1/p5, but orps can run only on port5. p5 is also needed by shuffles, but the FMA, FP add, and FP mul units are on ports 0/1.
Skylake: por and orps both have 3-per-cycle throughput. Intel's optimization manual has some info about bypass forwarding delays: to/from FP instructions it depends on which port the uop ran on. (Usually still port 5 because the FP add/mul/fma units are on ports 0 and 1.) See also Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says - "bypass" latency can affect every use of the register until it's overwritten.
Note that on SnB/IvB (AVX but not AVX2), only p5 needs to handle 256b logical ops, as vpor ymm, ymm requires AVX2. This was probably not the reason for the change, since Nehalem did this.
How to choose wisely:
Keep in mind that compilers can use por for _mm_or_pd if they want, so some of this applies mostly to hand-written asm. But some compilers are somewhat faithful to the intrinsics you choose.
If logical op throughput on port5 could be a bottleneck, then use the integer versions, even on FP data. This is especially true if you want to use integer shuffles or other data-movement instructions.
AMD CPUs always use the integer domain for logicals, so if you have multiple integer-domain things to do, do them all at once to minimize round-trips between domains. Shorter latencies will get things cleared out of the reorder buffer faster, even if a dep chain isn't the bottleneck for your code.
If you just want to set/clear/flip a bit in FP vectors between FP add and mul instructions, use the ...ps logicals, even on double-precision data, because single and double FP are the same domain on every CPU in existence, and the ...ps versions are one byte shorter (without AVX).
There are practical / human-factor reasons for using the ...pd versions, though, with intrinsics. Readability of your code by other humans is a factor: They'll wonder why you're treating your data as singles when it's actually doubles. For C/C++ intrinsics, littering your code with casts between __m128 and __m128d is not worth it. (And hopefully a compiler will use orps for _mm_or_pd anyway, if compiling without AVX where it will actually save a byte.)
If tuning on the level of insn alignment matters, write in asm directly, not intrinsics! (Having the instruction one byte longer might align things better for uop cache line density and/or decoders, but with prefixes and addressing modes you can extend instructions in general)
For integer data, use the integer versions. Saving one instruction byte isn't worth the bypass-delay between paddd or whatever, and integer code often keeps port5 fully occupied with shuffles. For Haswell, many shuffle / insert / extract / pack / unpack instructions became p5 only, instead of p1/p5 for SnB/IvB. (Ice Lake finally added a shuffle unit on another port for some more common shuffles.)
These intrinsics maps to three different x86 instructions (por, orps,
orpd). Does anyone have any ideas why Intel is wasting precious opcode
space for several instructions which do the same thing?
If you look at the history of these instruction sets, you can kind of see how we got here.
por (MMX): 0F EB /r
orps (SSE): 0F 56 /r
orpd (SSE2): 66 0F 56 /r
por (SSE2): 66 0F EB /r
MMX existed before SSE, so it looks like opcodes for SSE (...ps) instructions were chosen out of the same 0F xx space. Then for SSE2, the ...pd version added a 66 operand-size prefix to the ...ps opcode, and the integer version added a 66 prefix to the MMX version.
They could have left out orpd and/or por, but they didn't. Perhaps they thought that future CPU designs might have longer forwarding paths between different domains, and so using the matching instruction for your data would be a bigger deal. Even though there are separate opcodes, AMD and early Intel treated them all the same, as int-vector.
Related / near duplicate:
What is the point of SSE2 instructions such as orpd? also summarizes the history. (But I wrote it 5 years later.)
Difference between the AVX instructions vxorpd and vpxor
Does using mix of pxor and xorps affect performance?

According to Intel and AMD optimization guidelines mixing op types with data types produces a performance hit as the CPU internally tags 64 bit halves of the register for a particular data type. This seems to mostly effect pipe-lining as the instruction is decoded and the uops are scheduled. Functionally they produce the same result. The newer versions for the integer data types have larger encoding and take up more space in the code segment. So if code size is a problem use the old ops as these have smaller encoding.

I think all three are effectively the same, i.e. 128 bit bitwise operations. The reason different forms exist is probably historical, but I'm not certain. I guess it's possible that there may be some additional behaviour in the floating point versions, e.g. when there are NaNs, but this is pure guesswork. For normal inputs the instructions seem to be interchangeable, e.g.
#include <stdio.h>
#include <emmintrin.h>
#include <pmmintrin.h>
#include <xmmintrin.h>
int main(void)
{
__m128i a = _mm_set1_epi32(1);
__m128i b = _mm_set1_epi32(2);
__m128i c = _mm_or_si128(a, b);
__m128 x = _mm_set1_ps(1.25f);
__m128 y = _mm_set1_ps(1.5f);
__m128 z = _mm_or_ps(x, y);
printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
c = (__m128i)_mm_or_ps((__m128)a, (__m128)b);
z = (__m128)_mm_or_si128((__m128i)x, (__m128i)y);
printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
return 0;
}
Terminal:
$ gcc -Wall -msse3 por.c -o por
$ ./por
a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000
a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight