How to extract bytes from an SSE2 __m128i structure? - c

I'm a beginner with SIMD intrinsics, so I'll thank everyone for their patience in advance. I have an application involving absolute difference comparison of unsigned bytes (I'm working with greyscale images).
I tried AVX, more modern SSE versions etc, but eventually decided SSE2 seems sufficient and has the most support for individual bytes - please correct me if I'm wrong.
I have two questions: first, what's the right way to load 128-bit registers? I think I'm supposed to pass the load intrinsics data aligned to multiples of 128, but will that work with 2D array code like this:
greys = aligned_alloc(16, xres * sizeof(int8_t*));
for (uint32_t x = 0; x < xres; x++)
{
greys[x] = aligned_alloc(16, yres * sizeof(int8_t*));
}
(The code above assumes xres and yres are the same, and are powers of two). Does this turn into a linear, unbroken block in memory? Could I then, as I loop, just keep passing addresses (incrementing them by 128) to the SSE2 load intrinsics? Or does something different need to be done for 2D arrays like this one?
My second question: once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i ? Looking through the Intel Intrinsics Guide, instructions that convert a vector type to a scalar one are rare. The closest I've found is int _mm_movemask_epi8 (__m128i a) but I don't quite understand how to use it.
Oh, and one third question - I assumed _mm_load_si128 only loads signed bytes? And I couldn't find any other byte loading function, so I guess you're just supposed to subtract 128 from each and account for it later?
I know these are basic questions for SIMD experts, but I hope this one will be useful to beginners like me. And if you think my whole approach to the application is wrong, or I'd be better off with more modern SIMD extensions, I'd love to know. I'd just like to humbly warn I've never worked with assembly and all this bit-twiddling stuff requires a lot of explication if it's to help me.
Nevertheless, I'm grateful for any clarification available.
In case it makes a difference: I'm targeting a low-power i7 Skylake architecture. But it'd be nice to have the application run on much older machines too (hence SSE2).

Least obvious question first:
once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i
Extract the low 64 bits to an integer with int64_t _mm_cvtsi128_si64x(__m128i), or the low 32 bits with int _mm_cvtsi128_si32 (__m128i a).
If you want other parts of the vector, not the low element, your options are:
Shuffle the vector to create a new __m128i with the data you want in the low element, and use the cvt intrinsics (MOVD or MOVQ in asm).
Use SSE2 int _mm_extract_epi16 (__m128i a, int imm8), or the SSE4.1 similar instructions for other element sizes such as _mm_extract_epi64(v, 1); (PEXTRB/W/D/Q) are not the fastest instructions, but if you only need one high element, they're about equivalent to a separate shuffle and MOVD, but smaller machine code.
_mm_store_si128 to an aligned temporary array and access the members: compilers will often optimize this into just a shuffle or pextr* instruction if you compile with -msse4.1 or -march=haswell or whatever. print a __m128i variable shows an example, including Godbolt compiler output showing _mm_store_si128 into an alignas(16) uint64_t tmp[2]
Or use union { __m128i v; int64_t i64[2]; } or something. Union-based type punning is legal in C99, but only as an extension in C++. This is compiles the same as a tmp array, and is generally not easier to read.
An alternative to the union that would also work in C++ would be memcpy(&my_int64_local, 8 + (char*)my_vector, 8); to extract the high half, but that seems more complicated and less clear, and more likely to be something a compiler wouldn't "see through". Compilers are usually pretty good about optimizing away small fixed-size memcpy when it's an entire variable, but this is just half of the variable.
If the whole high half of a vector can go directly into memory unmodified (instead of being needed in an integer register), a smart compiler might optimize to use MOVHPS to store the high half of a __m128i with the above union stuff.
Or you can use _mm_storeh_pi((__m64*)dst, _mm_castsi128_ps(vec)). That only requires SSE1, and is more efficient than SSE4.1 pextrq on most CPUs. But don't do this for a scalar integer you're about to use again right away; if SSE4.1 isn't available it's likely the compiler will actually MOVHPS and integer reload, which usually isn't optimal. (And some compilers like MSVC don't optimize intrinsics.)
Does this turn into a linear, unbroken block in memory?
No, it's an array of pointers to separate blocks of memory, introducing an extra level of indirection vs. a proper 2D array. Don't do that.
Make one large allocation, and do the index calculation yourself (using array[x*yres + y]).
And yes, load data from it with _mm_load_si128, or loadu if you need to load from an offset.
assumed _mm_load_si128 only loads signed bytes
Signed or unsigned isn't an inherent property of a byte, it's only how you interpret the bits. You use the same load intrinsic for loading two 64-bit elements, or a 128-bit bitmap.
Use intrinsics that are appropriate for your data. It's a little bit like assembly language: everything is just bytes, and the machine will do what you tell it with your bytes. It's up to you to choose a sequence of instructions / intrinsics that produces meaningful results.
The integer load intrinsics take __m128i* pointer args, so you have to use _mm_load_si128( (const __m128i*) my_int_pointer ) or similar. This looks like pointer aliasing (e.g. reading an array of int through a short *), which is Undefined Behaviour in C and C++. However, this is how Intel says you're supposed to do it, so any compiler that implements Intel's intrinsics is required to make this work correctly. gcc does so by defining __m128i with __attribute__((may_alias)).
See also Loading data for GCC's vector extensions which points out that you can use Intel intrinsics for GNU C native vector extensions, and shows how to load/store.
To learn more about SIMD with SSE, there are some links in the sse tag wiki, including some intro / tutorial links.
The x86 tag wiki has some good x86 asm / performance links.

Related

How to best emulate the logical meaning of _mm_slli_si128 (128-bit bit-shift), not _mm_bslli_si128

Looking through the intel intrinsics guide, I saw this instruction. Looking through the naming pattern, the meaning should be clear: "Shift 128-bit register left by a fixed number of bits", but it is not. In actuality it shifts by a fixed number of bytes, which makes it exactly the same as _mm_bslli_si128.
Is this an oversight? Shouldn't it be shifting by bits like _mm_slli_epi32 or _mm_slli_epi64?
If not, in which situation should I use this over _mm_bslli_si128?
Is there an assembly instruction which does this correctly?
What is the best way of emulating this with smaller shifts?
1 that’s not an oversight. That instruction indeed shifts by bytes, i.e. multiples of 8 bits.
2 doesn’t matter, _mm_slli_si128 and _mm_bslli_si128 are equivalents, both compile into pslldq SSE2 instruction.
As for the emulation, I’d do it like that, assuming you have C++/17. If you’re writing C++/14, replace if constexpr with normal if, also add a message to the static_assert.
template<int i>
inline __m128i shiftLeftBits( __m128i vec )
{
static_assert( i >= 0 && i < 128 );
// Handle couple trivial cases
if constexpr( 0 == i )
return vec;
if constexpr( 0 == ( i % 8 ) )
return _mm_slli_si128( vec, i / 8 );
if constexpr( i > 64 )
{
// Shifting by more than 8 bytes, the lowest half will be all zeros
vec = _mm_slli_si128( vec, 8 );
return _mm_slli_epi64( vec, i - 64 );
}
else
{
// Shifting by less than 8 bytes.
// Need to propagate a few bits across 64-bit lanes.
__m128i low = _mm_slli_si128( vec, 8 );
__m128i high = _mm_slli_epi64( vec, i );
low = _mm_srli_epi64( low, 64 - i );
return _mm_or_si128( low, high );
}
}
TL:DR: They're synonyms; the bslli name is newer, introduced around the same time as new AVX-512 intrinsics (sometime before 2015, long after SSE2 _mm_slli_si128 was in widespread usage). I find it clearer and would recommend it for new development.
SSE/AVX2/AVX-512 do not have bit-shifts with element sizes wider than 64. (Or any other bit-granularity operation like add, except pure-vertical bitwise boolean stuff that's really 128 fully separate operations, not one big wide one. Or for AVX-512 masking and broadcast-load purposes, can be in dword or qword chunks like _mm512_xor_epi32 / vpxord)
You have to emulate it somehow, which can be fairly efficient for compile-time-constant counts so you can pick between strategies according to c >= 64, with special cases for c%8 reducing to a byte-shift. Existing SO Q&As cover that, or see #Soonts' answer on this Q.
Runtime-variable counts would suck; you'd have to branch or do both ways and blend, unlike for element bit-shifts where _mm_sll_epi64(v, _mm_cvtsi32_si128(i)) can compile to movd / psllq xmm, xmm. Unfortunately, hardware variable-count versions of byte-shuffle/shift instructions don't exist, only for the bit-shift versions.
bslli / bsrli are new, clearer intrinsic names for the same asm instructions
The b names are supported in current version of all 4 major compilers for x86 (Godbolt), and I'd recommend them for new development unless you need backwards compat with crusty old compilers, or for some reason you like the old name that doesn't both to distinguish it from different operations. (e.g. familiarity; if you don't want people to have to look up this newfangled name in the manual.)
gcc since 4.8
clang since 3.7
ICC since ICC13 or earlier, Godbolt doesn't have any older
MSVC since 19.14 or earlier, Godbolt doesn't have any older
If you check the intrinsics guide, _mm_slli_si128 is listed as an intrinsic for PSLLDQ, which is a byte shift. This is not a bug, just Intel's idea of a joke, or whatever process they used to choose names for intrinsics back in the SSE2 days. (There are only 2 hard problems in computer science: cache invalidation and naming things).
Asm mnemonics also use the same pattern of not making the byte-shuffle one look different from the bit-shifts. psllw xmm, 1 / pslld / psllq / pslldq. Again, you just have to know that 128-bit size is special, and must be a byte shuffle not a bit-shift, because x86 never has that. (Or you have to check the manual.)
The asm manual entry for pslldq in turn lists intrinsics for forms of it, interestingly only using the b name for the __m512i AVX-512BW version. When SSE2 and AVX2 were new, _mm_slli_si128 and _mm256_slli_si256 were the only names available, I think. Certainly it post-dates SSE2 intrinsics.
(Note that the si256 and si512 versions are just 2 or 4 copies of the 16-byte operation, not shifting bytes across 128-bit lanes; something a few other Q&As have asked for. This often makes AVX2 versions of shuffles like this and palignr a lot less useful than they'd otherwise be: either not worth using at all, or needing extra shuffles on top of it.)
I think this new bslli name was introduced when AVX-512 was new. Intel invented some new names for other intrinsics around that time, and the AVX-512 load/store intrinsics take void* instead of __m512i*, which is a major improvement to amount of noise in code, especially for C where implicit conversion to void* is allowed. (Creating a misaligned __m512i* is not actually a problem in C terms, but you couldn't deref it normally so it's a weird-looking thing to do.) So there was cleanup work happening on intrinsic naming then, and I think this was part of it.
(AVX-512 also gave Intel the chance to introduce some fairly bad names, like _mm_loadu_epi32(const void*) - you'd guess that's a strict-aliasing-safe way to do a 32-bit movd load, right? No, unfortunately, it's an intrinsic for vmovdqu32 xmm, [mem] with no masking. It's just _mm_loadu_si128 with a different C type for the pointer arg. It's there for consistency with the naming pattern for _mm_maskz_loadu_epi32. It would be nice to have void* load / store intrinsics for __m128i and __m256i, but if they have misleading names like that (esp. when you aren't using the mask/maskz versions in nearby code), I'll just stick to those cumbersome _mm256_loadu_si256( (const __m256i*)(arr + i) ) casts for the old intrinsic, because I love typing 256 three times. >.<
I wish asm was more maintainable (or that intrinsics just used asm mnemonics) because it's much more concise; Intel generally does a good job naming their mnemonics.
It somewhat but not entirely helps to note the difference between epi16/32/64 and si128: EPI = Extended (SSE instead of MMX) Packed Integer. (Packed implying multiple SIMD elements). si128 means a whole 128-bit integer vector.
There's no way to infer from the name that you aren't just doing the same thing to a single 128-bit integer, instead of packed elements. You just have to know that there are no bit-granularity things that ever cross 64-bit boundaries, only SIMD shuffles (which work in terms of bytes). This avoids the combinatorial explosion of building a really wide barrel shifter, or of carry propagation at such a long distance for a 128-bit add, or whatever.

What is the difference between loadu/lddqu and assignment operator?

I am using SIMD vector to do some computations, and curious the difference of them, as followings.
__m128i vector2 = vector1;
__m128i vector2 = _mm_loadu_si128(&vector1);
So, what's the difference of above two statements?
Like Peter Cordes said in his comment, if vector1 really is a __m128i it's just unnecessarily complicated.
However, that's not the use case for _mm_loadu_si128. While the type of the argument is __m128i const*, that's really more about a lack of good options and poor decisions.
_mm_loadu_si128 is really meant to load any 16 bytes of data into a vector register. If you want to load data which is already aligned to 16 byte boundaries you should use _mm_load_si128 instead. If your data isn't aligned to 16 byte boundaries it's not really a __m128i, so the type of the parameter is misleading at best.
The reason Intel (or whoever) chose to use __m128i const* isn't completely clear, but to be honest there aren't a lot of good options. int const* wouldn't really make sense because what we're trying to load isn't always 32-bit signed integers. They could have made an _mm_loadu_epi8, _mm_loadu_epi16, _mm_loadu_epi32, etc., but even that wouldn't be quite right because the data doesn't need to be aligned to _Alignof(int), _Alignof(short), etc., though char actually work well here.
The right choice would probably be to make the argument void*, but I guess Intel wanted to signal that they really wanted 16 bytes of data. char mem_addr[16] would be okay in C99+, but not in C++, and while SSE2 came out at around the same time as C99 many compilers didn't support C99 (MSVC still doesn't!).
Basically, for this function, ignore the types of the parameters and read the description.

How to do an indirect load (gather-scatter) in AVX or SSE instructions?

I've been searching for a while now, but can't seem to find anything useful in the documentation or on SO. This question didn't really help me out, since it makes references to modifying the assembly and I am writing in C.
I have some code making indirect accesses that I want to vectorize.
for (i = 0; i < LENGTH; ++i) {
foo[bar[i]] *= 2;
}
Since I have the indices I want to double inside bar, I was wondering if there was a way to load those indices of foo into a vector register and then I could apply my math and store it back to the same indices.
Something like the following. The load and store instructions I just made up because I couldn't find anything like them in the AVX or SSE documentation. I think I read somewhere that AVX2 has similar functions, but the processor I'm working with doesn't support AVX2.
for (i = 0; i < LENGTH; i += 8) {
// For simplicity, I'm leaving out any pointer type casting
__m256 ymm0 = _mm256_load_indirect(bar+i);
__m256 ymm1 = _mm256_set1_epi32(2); // Set up vector of just 2's
__m256 ymm2 = _mm256_mul_ps(ymm0, ymm1);
_mm256_store_indirect(ymm2, bar+i);
}
Are there any instructions in AVX or SSE that will allow me to load a vector register with an array of indices from a different array? Or any "hacky" ways around it if there isn't an explicit function?
(I' writing an answer to this old question as I think it may help others.)
Short answer
No. There are no scatter/gather instructions in the SSE and AVX instruction sets.
Longer answer
Scatter/gather instructions are expensive to implement (in terms of complexity and silicon real estate) because scatter/gather mechanism needs to be deeply intertwined with the cache memory controller. I believe this is the reason that this functionality was missing from SSE/AVX.
For newer instruction sets the situation is different. In AVX2 you have
VGATHERDPD, VGATHERDPS, VGATHERQPD, VGATHERQPS for floating point gather (intrinsics here)
VPGATHERDD, VPGATHERQD, VPGATHERDQ, VPGATHERQQ for integer gather (intrinsics here)
In AVX-512 we got
VSCATTERDPD, VSCATTERDPS, VSCATTERQPD, VSCATTERQPS for floating point scatter
(intrinsics here)
VPSCATTERDD, VPSCATTERQD, VPSCATTERDQ, VPSCATTERQQ for integer scatter (intrinsics here)
However, it is still a question whether using scatter/gather for such a simple operation would actually pay off.

How to align 16-bit ints for use with SSE intrinsics

I am working with two-dimensional arrays of 16-bit integers defined as
int16_t e[MAX_SIZE*MAX_NODE][MAX_SIZE];
int16_t C[MAX_SIZE][MAX_SIZE];
Where Max_SIZE and MAX_NODE are constant values. I'm not a professional programmer, but somehow with the help of people in StackOverflow I managed to write a piece of code that deploys SSE instruction on my data and achieved a significant speed-up. Currently, I am using the intrinsics that do not require data alignment (mainly _mm_loadu_si128 and _mm_storeu_si128).
for (b=0; b<n; b+=8){
v1 = _mm_loadu_si128((__m128i*)&C[level][b]); // level defined elsewhere.
v2 = _mm_loadu_si128((__m128i*)&e1[node][b]); // node defined elsewhere.
v3 = _mm_and_si128(v1,v2);
_mm_storeu_si128((__m128i*)&C[level+1][b],v3);
}
When I change the intrinsics to their counterparts for aligned data (i.e. _mm_load_si128 and _mm_store_si128), I get run-time errors, which leads me to the assumption that my data is not aligned properly.
My question is now, if my data is not aligned properly, how can I align it to be able to use the corresponding intrinsics? I'd think since the integers are 16 bits, they're automatically aligned. But I seem to be wrong!
Any insight on this will be highly appreciated.
Thanks!
SSE needs data to be aligned on 16 bytes boundary, not 16 bits, that's your problem.
What you're looking for to align your static arrays is compiler dependent.
If you're using MSVC, you'll have to use __declspec(align(16)), or with GCC, this would be __attribute__((aligned (16))).

What's the most efficient way to load and extract 32 bit integer values from a 128 bit SSE vector?

I'm trying to optimize my code using SSE intrinsics but am running into a problem where I don't know of a good way to extract the integer values from a vector after I've done the SSE intrinsics operations to get what I want.
Does anyone know of a good way to do this? I'm programming in C and my compiler is gcc version 4.3.2.
Thanks for all your help.
It depends on what you can assume about the minimum level of SSE support that you have.
Going all the way back to SSE2 you have _mm_extract_epi16 (PEXTRW) which can be used to extract any 16 bit element from a 128 bit vector. You would need to call this twice to get the two halves of a 32 bit element.
In more recent versions of SSE (SSE4.1 and later) you have _mm_extract_epi32 (PEXTRD) which can extract a 32 bit element in one instruction.
Alternatively if this is not inside a performance-critical loop you can just use a union, e.g.
typedef union
{
__m128i v;
int32_t a[4];
} U32;
_mm_extract_epi32
The extract intrinsics is indeed the best option but if you need to support SSE2, I'd recommend this:
inline int get_x(const __m128i& vec){return _mm_cvtsi128_si32 (vec);}
inline int get_y(const __m128i& vec){return _mm_cvtsi128_si32 (_mm_shuffle_epi32(vec,0x55));}
inline int get_z(const __m128i& vec){return _mm_cvtsi128_si32 (_mm_shuffle_epi32(vec,0xAA));}
inline int get_w(const __m128i& vec){return _mm_cvtsi128_si32 (_mm_shuffle_epi32(vec,0xFF));}
I've found that if you reinterpret_cast/union the vector to any int[4] representation the compiler tends to flush things back to memory (which may not be that bad) and reads it back as an int, though I haven't looked at the assembly to see if the latest versions of the compilers generate better code.

Resources