Fastest way to shift 32 bits right on a __m128 (Intel Intrinsics) - c

I have a 128 bit variable filled with 4 separate integers. [1,2,3,4]. I want to shift right, so I can get [2,3,4,0]. What's the fastest way to do this.
My current code:
__m128 v1;
v1 = (__m128)_mm_srli_si128( _mm_castps_si128(v1) , 4 );
this succeeds in shifting the bits, but I am trying to go for speed and cache optimization (aka fewest variables as possible). Is there anyway to improve this code to avoid casting to and from a __m128i?
thanks

Don't worry about it. __m128 and __m128i are two different ways of interpreting the contents of an XMM register, so the cast disappears in compilation. My compiler (clang on Mac OS 10.9) compiles the whole thing down to a single instruction as it stands:
psrldq $0x4, %xmm0

Related

How to best emulate the logical meaning of _mm_slli_si128 (128-bit bit-shift), not _mm_bslli_si128

Looking through the intel intrinsics guide, I saw this instruction. Looking through the naming pattern, the meaning should be clear: "Shift 128-bit register left by a fixed number of bits", but it is not. In actuality it shifts by a fixed number of bytes, which makes it exactly the same as _mm_bslli_si128.
Is this an oversight? Shouldn't it be shifting by bits like _mm_slli_epi32 or _mm_slli_epi64?
If not, in which situation should I use this over _mm_bslli_si128?
Is there an assembly instruction which does this correctly?
What is the best way of emulating this with smaller shifts?
1 that’s not an oversight. That instruction indeed shifts by bytes, i.e. multiples of 8 bits.
2 doesn’t matter, _mm_slli_si128 and _mm_bslli_si128 are equivalents, both compile into pslldq SSE2 instruction.
As for the emulation, I’d do it like that, assuming you have C++/17. If you’re writing C++/14, replace if constexpr with normal if, also add a message to the static_assert.
template<int i>
inline __m128i shiftLeftBits( __m128i vec )
{
static_assert( i >= 0 && i < 128 );
// Handle couple trivial cases
if constexpr( 0 == i )
return vec;
if constexpr( 0 == ( i % 8 ) )
return _mm_slli_si128( vec, i / 8 );
if constexpr( i > 64 )
{
// Shifting by more than 8 bytes, the lowest half will be all zeros
vec = _mm_slli_si128( vec, 8 );
return _mm_slli_epi64( vec, i - 64 );
}
else
{
// Shifting by less than 8 bytes.
// Need to propagate a few bits across 64-bit lanes.
__m128i low = _mm_slli_si128( vec, 8 );
__m128i high = _mm_slli_epi64( vec, i );
low = _mm_srli_epi64( low, 64 - i );
return _mm_or_si128( low, high );
}
}
TL:DR: They're synonyms; the bslli name is newer, introduced around the same time as new AVX-512 intrinsics (sometime before 2015, long after SSE2 _mm_slli_si128 was in widespread usage). I find it clearer and would recommend it for new development.
SSE/AVX2/AVX-512 do not have bit-shifts with element sizes wider than 64. (Or any other bit-granularity operation like add, except pure-vertical bitwise boolean stuff that's really 128 fully separate operations, not one big wide one. Or for AVX-512 masking and broadcast-load purposes, can be in dword or qword chunks like _mm512_xor_epi32 / vpxord)
You have to emulate it somehow, which can be fairly efficient for compile-time-constant counts so you can pick between strategies according to c >= 64, with special cases for c%8 reducing to a byte-shift. Existing SO Q&As cover that, or see #Soonts' answer on this Q.
Runtime-variable counts would suck; you'd have to branch or do both ways and blend, unlike for element bit-shifts where _mm_sll_epi64(v, _mm_cvtsi32_si128(i)) can compile to movd / psllq xmm, xmm. Unfortunately, hardware variable-count versions of byte-shuffle/shift instructions don't exist, only for the bit-shift versions.
bslli / bsrli are new, clearer intrinsic names for the same asm instructions
The b names are supported in current version of all 4 major compilers for x86 (Godbolt), and I'd recommend them for new development unless you need backwards compat with crusty old compilers, or for some reason you like the old name that doesn't both to distinguish it from different operations. (e.g. familiarity; if you don't want people to have to look up this newfangled name in the manual.)
gcc since 4.8
clang since 3.7
ICC since ICC13 or earlier, Godbolt doesn't have any older
MSVC since 19.14 or earlier, Godbolt doesn't have any older
If you check the intrinsics guide, _mm_slli_si128 is listed as an intrinsic for PSLLDQ, which is a byte shift. This is not a bug, just Intel's idea of a joke, or whatever process they used to choose names for intrinsics back in the SSE2 days. (There are only 2 hard problems in computer science: cache invalidation and naming things).
Asm mnemonics also use the same pattern of not making the byte-shuffle one look different from the bit-shifts. psllw xmm, 1 / pslld / psllq / pslldq. Again, you just have to know that 128-bit size is special, and must be a byte shuffle not a bit-shift, because x86 never has that. (Or you have to check the manual.)
The asm manual entry for pslldq in turn lists intrinsics for forms of it, interestingly only using the b name for the __m512i AVX-512BW version. When SSE2 and AVX2 were new, _mm_slli_si128 and _mm256_slli_si256 were the only names available, I think. Certainly it post-dates SSE2 intrinsics.
(Note that the si256 and si512 versions are just 2 or 4 copies of the 16-byte operation, not shifting bytes across 128-bit lanes; something a few other Q&As have asked for. This often makes AVX2 versions of shuffles like this and palignr a lot less useful than they'd otherwise be: either not worth using at all, or needing extra shuffles on top of it.)
I think this new bslli name was introduced when AVX-512 was new. Intel invented some new names for other intrinsics around that time, and the AVX-512 load/store intrinsics take void* instead of __m512i*, which is a major improvement to amount of noise in code, especially for C where implicit conversion to void* is allowed. (Creating a misaligned __m512i* is not actually a problem in C terms, but you couldn't deref it normally so it's a weird-looking thing to do.) So there was cleanup work happening on intrinsic naming then, and I think this was part of it.
(AVX-512 also gave Intel the chance to introduce some fairly bad names, like _mm_loadu_epi32(const void*) - you'd guess that's a strict-aliasing-safe way to do a 32-bit movd load, right? No, unfortunately, it's an intrinsic for vmovdqu32 xmm, [mem] with no masking. It's just _mm_loadu_si128 with a different C type for the pointer arg. It's there for consistency with the naming pattern for _mm_maskz_loadu_epi32. It would be nice to have void* load / store intrinsics for __m128i and __m256i, but if they have misleading names like that (esp. when you aren't using the mask/maskz versions in nearby code), I'll just stick to those cumbersome _mm256_loadu_si256( (const __m256i*)(arr + i) ) casts for the old intrinsic, because I love typing 256 three times. >.<
I wish asm was more maintainable (or that intrinsics just used asm mnemonics) because it's much more concise; Intel generally does a good job naming their mnemonics.
It somewhat but not entirely helps to note the difference between epi16/32/64 and si128: EPI = Extended (SSE instead of MMX) Packed Integer. (Packed implying multiple SIMD elements). si128 means a whole 128-bit integer vector.
There's no way to infer from the name that you aren't just doing the same thing to a single 128-bit integer, instead of packed elements. You just have to know that there are no bit-granularity things that ever cross 64-bit boundaries, only SIMD shuffles (which work in terms of bytes). This avoids the combinatorial explosion of building a really wide barrel shifter, or of carry propagation at such a long distance for a 128-bit add, or whatever.

How to extract bytes from an SSE2 __m128i structure?

I'm a beginner with SIMD intrinsics, so I'll thank everyone for their patience in advance. I have an application involving absolute difference comparison of unsigned bytes (I'm working with greyscale images).
I tried AVX, more modern SSE versions etc, but eventually decided SSE2 seems sufficient and has the most support for individual bytes - please correct me if I'm wrong.
I have two questions: first, what's the right way to load 128-bit registers? I think I'm supposed to pass the load intrinsics data aligned to multiples of 128, but will that work with 2D array code like this:
greys = aligned_alloc(16, xres * sizeof(int8_t*));
for (uint32_t x = 0; x < xres; x++)
{
greys[x] = aligned_alloc(16, yres * sizeof(int8_t*));
}
(The code above assumes xres and yres are the same, and are powers of two). Does this turn into a linear, unbroken block in memory? Could I then, as I loop, just keep passing addresses (incrementing them by 128) to the SSE2 load intrinsics? Or does something different need to be done for 2D arrays like this one?
My second question: once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i ? Looking through the Intel Intrinsics Guide, instructions that convert a vector type to a scalar one are rare. The closest I've found is int _mm_movemask_epi8 (__m128i a) but I don't quite understand how to use it.
Oh, and one third question - I assumed _mm_load_si128 only loads signed bytes? And I couldn't find any other byte loading function, so I guess you're just supposed to subtract 128 from each and account for it later?
I know these are basic questions for SIMD experts, but I hope this one will be useful to beginners like me. And if you think my whole approach to the application is wrong, or I'd be better off with more modern SIMD extensions, I'd love to know. I'd just like to humbly warn I've never worked with assembly and all this bit-twiddling stuff requires a lot of explication if it's to help me.
Nevertheless, I'm grateful for any clarification available.
In case it makes a difference: I'm targeting a low-power i7 Skylake architecture. But it'd be nice to have the application run on much older machines too (hence SSE2).
Least obvious question first:
once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i
Extract the low 64 bits to an integer with int64_t _mm_cvtsi128_si64x(__m128i), or the low 32 bits with int _mm_cvtsi128_si32 (__m128i a).
If you want other parts of the vector, not the low element, your options are:
Shuffle the vector to create a new __m128i with the data you want in the low element, and use the cvt intrinsics (MOVD or MOVQ in asm).
Use SSE2 int _mm_extract_epi16 (__m128i a, int imm8), or the SSE4.1 similar instructions for other element sizes such as _mm_extract_epi64(v, 1); (PEXTRB/W/D/Q) are not the fastest instructions, but if you only need one high element, they're about equivalent to a separate shuffle and MOVD, but smaller machine code.
_mm_store_si128 to an aligned temporary array and access the members: compilers will often optimize this into just a shuffle or pextr* instruction if you compile with -msse4.1 or -march=haswell or whatever. print a __m128i variable shows an example, including Godbolt compiler output showing _mm_store_si128 into an alignas(16) uint64_t tmp[2]
Or use union { __m128i v; int64_t i64[2]; } or something. Union-based type punning is legal in C99, but only as an extension in C++. This is compiles the same as a tmp array, and is generally not easier to read.
An alternative to the union that would also work in C++ would be memcpy(&my_int64_local, 8 + (char*)my_vector, 8); to extract the high half, but that seems more complicated and less clear, and more likely to be something a compiler wouldn't "see through". Compilers are usually pretty good about optimizing away small fixed-size memcpy when it's an entire variable, but this is just half of the variable.
If the whole high half of a vector can go directly into memory unmodified (instead of being needed in an integer register), a smart compiler might optimize to use MOVHPS to store the high half of a __m128i with the above union stuff.
Or you can use _mm_storeh_pi((__m64*)dst, _mm_castsi128_ps(vec)). That only requires SSE1, and is more efficient than SSE4.1 pextrq on most CPUs. But don't do this for a scalar integer you're about to use again right away; if SSE4.1 isn't available it's likely the compiler will actually MOVHPS and integer reload, which usually isn't optimal. (And some compilers like MSVC don't optimize intrinsics.)
Does this turn into a linear, unbroken block in memory?
No, it's an array of pointers to separate blocks of memory, introducing an extra level of indirection vs. a proper 2D array. Don't do that.
Make one large allocation, and do the index calculation yourself (using array[x*yres + y]).
And yes, load data from it with _mm_load_si128, or loadu if you need to load from an offset.
assumed _mm_load_si128 only loads signed bytes
Signed or unsigned isn't an inherent property of a byte, it's only how you interpret the bits. You use the same load intrinsic for loading two 64-bit elements, or a 128-bit bitmap.
Use intrinsics that are appropriate for your data. It's a little bit like assembly language: everything is just bytes, and the machine will do what you tell it with your bytes. It's up to you to choose a sequence of instructions / intrinsics that produces meaningful results.
The integer load intrinsics take __m128i* pointer args, so you have to use _mm_load_si128( (const __m128i*) my_int_pointer ) or similar. This looks like pointer aliasing (e.g. reading an array of int through a short *), which is Undefined Behaviour in C and C++. However, this is how Intel says you're supposed to do it, so any compiler that implements Intel's intrinsics is required to make this work correctly. gcc does so by defining __m128i with __attribute__((may_alias)).
See also Loading data for GCC's vector extensions which points out that you can use Intel intrinsics for GNU C native vector extensions, and shows how to load/store.
To learn more about SIMD with SSE, there are some links in the sse tag wiki, including some intro / tutorial links.
The x86 tag wiki has some good x86 asm / performance links.

How to align 16-bit ints for use with SSE intrinsics

I am working with two-dimensional arrays of 16-bit integers defined as
int16_t e[MAX_SIZE*MAX_NODE][MAX_SIZE];
int16_t C[MAX_SIZE][MAX_SIZE];
Where Max_SIZE and MAX_NODE are constant values. I'm not a professional programmer, but somehow with the help of people in StackOverflow I managed to write a piece of code that deploys SSE instruction on my data and achieved a significant speed-up. Currently, I am using the intrinsics that do not require data alignment (mainly _mm_loadu_si128 and _mm_storeu_si128).
for (b=0; b<n; b+=8){
v1 = _mm_loadu_si128((__m128i*)&C[level][b]); // level defined elsewhere.
v2 = _mm_loadu_si128((__m128i*)&e1[node][b]); // node defined elsewhere.
v3 = _mm_and_si128(v1,v2);
_mm_storeu_si128((__m128i*)&C[level+1][b],v3);
}
When I change the intrinsics to their counterparts for aligned data (i.e. _mm_load_si128 and _mm_store_si128), I get run-time errors, which leads me to the assumption that my data is not aligned properly.
My question is now, if my data is not aligned properly, how can I align it to be able to use the corresponding intrinsics? I'd think since the integers are 16 bits, they're automatically aligned. But I seem to be wrong!
Any insight on this will be highly appreciated.
Thanks!
SSE needs data to be aligned on 16 bytes boundary, not 16 bits, that's your problem.
What you're looking for to align your static arrays is compiler dependent.
If you're using MSVC, you'll have to use __declspec(align(16)), or with GCC, this would be __attribute__((aligned (16))).

Which one is faster?

I am using SSE2 in gcc 4.4.3. In my program, I need to use say least (0 - 7) 8-bits of a 128-bit SIMD register. Please suggest a way in which I can retrieve the 8-bits quickly.
I tried with _mm_movepi64_pi64 or _mm_extract_epi16, both of which gives similar performance in my program. I was trying with union approach also. union{__m128i a1, int a2[4]}. Though, in the test case, it gave good results, in my program, this approach was not very good.
Any ideas.. (which of the above mentioned three ways I should use?)
_mm_movepi64_pi64 moves from XMM to MMX registers. There's no way it's the right choice, unless you want to do some more SIMD in MMX registers, and your code runs out of XMM regs.
If you want the bits as an array index or something, they have to be in a GP register, in which case you want SSE4.1 _mm_extract_epi8.
If you need to stick to SSE2, this should be the fastests way to get byte 5 of xmm0:
pextrw eax, xmm0, 2
movzx eax, ah
So this should hopefully get the compiler to be efficient like that:
(uint8_t)(_mm_extract_epi16(var, n/2) >> ((n%2) * 8))
Less efficient would be a shift-by-bytes _mm_bsrli_si128 (psrldq) to put the byte you want into the low byte of an xmm reg, then movd (_mm_extract_epi16(var, 0) emits movd, not pextrw r32, xmm, 0, fortunately). This way you don't have to do anything extra if the byte you want is an odd-numbered byte that pextw would leave in the high 8 of a result. Still no easy way to use this with an index that isn't a compile-time constant.
Storing 16B to memory and loading the element you want should be fairly good. (What you'll probably get with the union approach, unless the compiler optimizes it to a pextract instruction). The compiler will use a 16B-aligned location on the stack. Thus store->load forwarding should work fine in this case, so the latency will be low. If you need two separate elements into two separate integer variables, this is probably the best choice, maybe beating multiple pextrw

What's the most efficient way to load and extract 32 bit integer values from a 128 bit SSE vector?

I'm trying to optimize my code using SSE intrinsics but am running into a problem where I don't know of a good way to extract the integer values from a vector after I've done the SSE intrinsics operations to get what I want.
Does anyone know of a good way to do this? I'm programming in C and my compiler is gcc version 4.3.2.
Thanks for all your help.
It depends on what you can assume about the minimum level of SSE support that you have.
Going all the way back to SSE2 you have _mm_extract_epi16 (PEXTRW) which can be used to extract any 16 bit element from a 128 bit vector. You would need to call this twice to get the two halves of a 32 bit element.
In more recent versions of SSE (SSE4.1 and later) you have _mm_extract_epi32 (PEXTRD) which can extract a 32 bit element in one instruction.
Alternatively if this is not inside a performance-critical loop you can just use a union, e.g.
typedef union
{
__m128i v;
int32_t a[4];
} U32;
_mm_extract_epi32
The extract intrinsics is indeed the best option but if you need to support SSE2, I'd recommend this:
inline int get_x(const __m128i& vec){return _mm_cvtsi128_si32 (vec);}
inline int get_y(const __m128i& vec){return _mm_cvtsi128_si32 (_mm_shuffle_epi32(vec,0x55));}
inline int get_z(const __m128i& vec){return _mm_cvtsi128_si32 (_mm_shuffle_epi32(vec,0xAA));}
inline int get_w(const __m128i& vec){return _mm_cvtsi128_si32 (_mm_shuffle_epi32(vec,0xFF));}
I've found that if you reinterpret_cast/union the vector to any int[4] representation the compiler tends to flush things back to memory (which may not be that bad) and reads it back as an int, though I haven't looked at the assembly to see if the latest versions of the compilers generate better code.

Resources