assembly intrinsic to do a masked load - c

int main()
{
const int STRIDE=2,SIZE=8192;
int i=0;
double u[SIZE][STRIDE];
#pragma vector aligned
for(i=0;i<SIZE;i++)
{
u[i][STRIDE-1]= i;
}
printf("%lf\n",u[7][STRIDE-1]);
return 0;
}
The compiler uses xmm registers here. There is stride 2 access and I want to make the compiler ignore this and do a regular load of memory and then mask alternate bits so I would be using 50% of the SIMD registers. I need intrinsics which can be used to load and then mask the register bitwise before storing back to memory
P.S: I have never done assembly coding before

A masked store with a mask value as 0xAA (10101010)

You can't do a masked load (only a masked store). The easiest alternative would be to do a load and then mask it yourself (e.g. using intrinsics).
A potentially better alternative would be to change your array to "double u[STRIDE][SIZE];" so that you don't need to mask anything and don't end up with half an XMM register wasted/masked.

Without AVX, half a SIMD register is only one double anyway, so there seems little wrong with regular 64-bit stores.
If you want to use masked stores (MASKMOVDQU/MASKMOVQ), note that they write directly to DRAM just like the non-temporal stores like MOVNTPS. This may or may not be what you want. If the data fits in cache and you plan to read it soon, it is likely better not to use them.
Certain AMD processors can do a 64-bit non-temporal store from an XMM register using MOVNTSD; this may simplify things slightly compared to MASKMOVDQU).

Related

ARM NEON intrinsics convert D (64-bit) register to low half of Q (128-bit) register, leaving upper half undefined

I'd like to be able to essentially be able to typecast a uint8x8_t into a uint8x16_t with no overhead, leaving the upper 64-bits undefined. This is useful if you only care about the bottom 64-bits, but wish to use 128-bit instructions, for example:
uint8x16_t data = (uint8x16_t)vld1_u8(src); // if you can somehow do this
uint8x16_t shifted = vextq_u8(oldData, data, 2);
From my understanding of ARM assembly, this should be possible as the load can be issued to a D register, then interpreted as a Q register.
Some ways I can think of getting this working would be:
data = vcombine_u8(vld1_u8(src), vdup_n_u8(0)); - compiler seems to go to the effort of setting the upper half to 0, even though this is never necessary
data = vld1q_u8(src); - doing a 128-bit load works (and is fine in my case), but is likely slower on processors with 64-bit NEON units?
I suppose there may be an icky case of partial dependencies in the CPU, with only setting half a register like this, but I'd rather the compiler figure out the best approach here rather than forcing it to use a 0 value.
Is there any way to do this?
On aarch32, you are completely at the compiler's mercy on this. (That's why I write NEON routines in assembly)
On aarch64 on the other hand, it's pretty much automatic since the upper 64bit isn't directly accessible anyway.
The compiler will execute trn1 instruction upon vcombine though.
To sum it up, There is always overhead involved on aarch64 while it's unpredictable on aarch32. If your aarch32 routine is simple and short, thus not many registers are necessary, chances are good that the compiler assigns the registers cleverly, but VERY unlikely otherwise.
BTW, on aarch64, if you initialize the lower 64bit, the CPU automatically sets the upper 64bit to zero. I don't know if it costs extra time though. It did cost me several days until I found out what had been wrong all the time along. So annoying!!!

How to extract bytes from an SSE2 __m128i structure?

I'm a beginner with SIMD intrinsics, so I'll thank everyone for their patience in advance. I have an application involving absolute difference comparison of unsigned bytes (I'm working with greyscale images).
I tried AVX, more modern SSE versions etc, but eventually decided SSE2 seems sufficient and has the most support for individual bytes - please correct me if I'm wrong.
I have two questions: first, what's the right way to load 128-bit registers? I think I'm supposed to pass the load intrinsics data aligned to multiples of 128, but will that work with 2D array code like this:
greys = aligned_alloc(16, xres * sizeof(int8_t*));
for (uint32_t x = 0; x < xres; x++)
{
greys[x] = aligned_alloc(16, yres * sizeof(int8_t*));
}
(The code above assumes xres and yres are the same, and are powers of two). Does this turn into a linear, unbroken block in memory? Could I then, as I loop, just keep passing addresses (incrementing them by 128) to the SSE2 load intrinsics? Or does something different need to be done for 2D arrays like this one?
My second question: once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i ? Looking through the Intel Intrinsics Guide, instructions that convert a vector type to a scalar one are rare. The closest I've found is int _mm_movemask_epi8 (__m128i a) but I don't quite understand how to use it.
Oh, and one third question - I assumed _mm_load_si128 only loads signed bytes? And I couldn't find any other byte loading function, so I guess you're just supposed to subtract 128 from each and account for it later?
I know these are basic questions for SIMD experts, but I hope this one will be useful to beginners like me. And if you think my whole approach to the application is wrong, or I'd be better off with more modern SIMD extensions, I'd love to know. I'd just like to humbly warn I've never worked with assembly and all this bit-twiddling stuff requires a lot of explication if it's to help me.
Nevertheless, I'm grateful for any clarification available.
In case it makes a difference: I'm targeting a low-power i7 Skylake architecture. But it'd be nice to have the application run on much older machines too (hence SSE2).
Least obvious question first:
once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i
Extract the low 64 bits to an integer with int64_t _mm_cvtsi128_si64x(__m128i), or the low 32 bits with int _mm_cvtsi128_si32 (__m128i a).
If you want other parts of the vector, not the low element, your options are:
Shuffle the vector to create a new __m128i with the data you want in the low element, and use the cvt intrinsics (MOVD or MOVQ in asm).
Use SSE2 int _mm_extract_epi16 (__m128i a, int imm8), or the SSE4.1 similar instructions for other element sizes such as _mm_extract_epi64(v, 1); (PEXTRB/W/D/Q) are not the fastest instructions, but if you only need one high element, they're about equivalent to a separate shuffle and MOVD, but smaller machine code.
_mm_store_si128 to an aligned temporary array and access the members: compilers will often optimize this into just a shuffle or pextr* instruction if you compile with -msse4.1 or -march=haswell or whatever. print a __m128i variable shows an example, including Godbolt compiler output showing _mm_store_si128 into an alignas(16) uint64_t tmp[2]
Or use union { __m128i v; int64_t i64[2]; } or something. Union-based type punning is legal in C99, but only as an extension in C++. This is compiles the same as a tmp array, and is generally not easier to read.
An alternative to the union that would also work in C++ would be memcpy(&my_int64_local, 8 + (char*)my_vector, 8); to extract the high half, but that seems more complicated and less clear, and more likely to be something a compiler wouldn't "see through". Compilers are usually pretty good about optimizing away small fixed-size memcpy when it's an entire variable, but this is just half of the variable.
If the whole high half of a vector can go directly into memory unmodified (instead of being needed in an integer register), a smart compiler might optimize to use MOVHPS to store the high half of a __m128i with the above union stuff.
Or you can use _mm_storeh_pi((__m64*)dst, _mm_castsi128_ps(vec)). That only requires SSE1, and is more efficient than SSE4.1 pextrq on most CPUs. But don't do this for a scalar integer you're about to use again right away; if SSE4.1 isn't available it's likely the compiler will actually MOVHPS and integer reload, which usually isn't optimal. (And some compilers like MSVC don't optimize intrinsics.)
Does this turn into a linear, unbroken block in memory?
No, it's an array of pointers to separate blocks of memory, introducing an extra level of indirection vs. a proper 2D array. Don't do that.
Make one large allocation, and do the index calculation yourself (using array[x*yres + y]).
And yes, load data from it with _mm_load_si128, or loadu if you need to load from an offset.
assumed _mm_load_si128 only loads signed bytes
Signed or unsigned isn't an inherent property of a byte, it's only how you interpret the bits. You use the same load intrinsic for loading two 64-bit elements, or a 128-bit bitmap.
Use intrinsics that are appropriate for your data. It's a little bit like assembly language: everything is just bytes, and the machine will do what you tell it with your bytes. It's up to you to choose a sequence of instructions / intrinsics that produces meaningful results.
The integer load intrinsics take __m128i* pointer args, so you have to use _mm_load_si128( (const __m128i*) my_int_pointer ) or similar. This looks like pointer aliasing (e.g. reading an array of int through a short *), which is Undefined Behaviour in C and C++. However, this is how Intel says you're supposed to do it, so any compiler that implements Intel's intrinsics is required to make this work correctly. gcc does so by defining __m128i with __attribute__((may_alias)).
See also Loading data for GCC's vector extensions which points out that you can use Intel intrinsics for GNU C native vector extensions, and shows how to load/store.
To learn more about SIMD with SSE, there are some links in the sse tag wiki, including some intro / tutorial links.
The x86 tag wiki has some good x86 asm / performance links.

How to do an indirect load (gather-scatter) in AVX or SSE instructions?

I've been searching for a while now, but can't seem to find anything useful in the documentation or on SO. This question didn't really help me out, since it makes references to modifying the assembly and I am writing in C.
I have some code making indirect accesses that I want to vectorize.
for (i = 0; i < LENGTH; ++i) {
foo[bar[i]] *= 2;
}
Since I have the indices I want to double inside bar, I was wondering if there was a way to load those indices of foo into a vector register and then I could apply my math and store it back to the same indices.
Something like the following. The load and store instructions I just made up because I couldn't find anything like them in the AVX or SSE documentation. I think I read somewhere that AVX2 has similar functions, but the processor I'm working with doesn't support AVX2.
for (i = 0; i < LENGTH; i += 8) {
// For simplicity, I'm leaving out any pointer type casting
__m256 ymm0 = _mm256_load_indirect(bar+i);
__m256 ymm1 = _mm256_set1_epi32(2); // Set up vector of just 2's
__m256 ymm2 = _mm256_mul_ps(ymm0, ymm1);
_mm256_store_indirect(ymm2, bar+i);
}
Are there any instructions in AVX or SSE that will allow me to load a vector register with an array of indices from a different array? Or any "hacky" ways around it if there isn't an explicit function?
(I' writing an answer to this old question as I think it may help others.)
Short answer
No. There are no scatter/gather instructions in the SSE and AVX instruction sets.
Longer answer
Scatter/gather instructions are expensive to implement (in terms of complexity and silicon real estate) because scatter/gather mechanism needs to be deeply intertwined with the cache memory controller. I believe this is the reason that this functionality was missing from SSE/AVX.
For newer instruction sets the situation is different. In AVX2 you have
VGATHERDPD, VGATHERDPS, VGATHERQPD, VGATHERQPS for floating point gather (intrinsics here)
VPGATHERDD, VPGATHERQD, VPGATHERDQ, VPGATHERQQ for integer gather (intrinsics here)
In AVX-512 we got
VSCATTERDPD, VSCATTERDPS, VSCATTERQPD, VSCATTERQPS for floating point scatter
(intrinsics here)
VPSCATTERDD, VPSCATTERQD, VPSCATTERDQ, VPSCATTERQQ for integer scatter (intrinsics here)
However, it is still a question whether using scatter/gather for such a simple operation would actually pay off.

What's the most efficient way to load and extract 32 bit integer values from a 128 bit SSE vector?

I'm trying to optimize my code using SSE intrinsics but am running into a problem where I don't know of a good way to extract the integer values from a vector after I've done the SSE intrinsics operations to get what I want.
Does anyone know of a good way to do this? I'm programming in C and my compiler is gcc version 4.3.2.
Thanks for all your help.
It depends on what you can assume about the minimum level of SSE support that you have.
Going all the way back to SSE2 you have _mm_extract_epi16 (PEXTRW) which can be used to extract any 16 bit element from a 128 bit vector. You would need to call this twice to get the two halves of a 32 bit element.
In more recent versions of SSE (SSE4.1 and later) you have _mm_extract_epi32 (PEXTRD) which can extract a 32 bit element in one instruction.
Alternatively if this is not inside a performance-critical loop you can just use a union, e.g.
typedef union
{
__m128i v;
int32_t a[4];
} U32;
_mm_extract_epi32
The extract intrinsics is indeed the best option but if you need to support SSE2, I'd recommend this:
inline int get_x(const __m128i& vec){return _mm_cvtsi128_si32 (vec);}
inline int get_y(const __m128i& vec){return _mm_cvtsi128_si32 (_mm_shuffle_epi32(vec,0x55));}
inline int get_z(const __m128i& vec){return _mm_cvtsi128_si32 (_mm_shuffle_epi32(vec,0xAA));}
inline int get_w(const __m128i& vec){return _mm_cvtsi128_si32 (_mm_shuffle_epi32(vec,0xFF));}
I've found that if you reinterpret_cast/union the vector to any int[4] representation the compiler tends to flush things back to memory (which may not be that bad) and reads it back as an int, though I haven't looked at the assembly to see if the latest versions of the compilers generate better code.

How to treat 64-bit words on a CUDA device?

I'd like to handle directly 64-bit words on the CUDA platform (eg. uint64_t vars).
I understand, however, that addressing space, registers and the SP architecture are all 32-bit based.
I actually found this to work correctly (on my CUDA cc1.1 card):
__global__ void test64Kernel( uint64_t *word )
{
(*word) <<= 56;
}
but I don't know, for example, how this affects registers usage and the operations per clock cycle count.
Whether addresses are 32-bit or anything else does not affect what data types you can use. In your example you have a pointer (32-bit, 64-bit, 3-bit (!) - doesn't matter) to a 64-bit unsigned integer.
64-bit integers are supported in CUDA but of course for every 64-bit value you are storing twice as much data as a 32-bit value and so will use more registers and arithmetic operations will take longer (adding two 64-bit integers will just expand it out onto the smaller datatypes using carries to push into the next sub-word). The compiler is an optimising compiler, so will try to minimise the impact of this.
Note that using double precision floating point, also 64-bit, is only supported in devices with compute capability 1.3 or higher (i.e. 1.3 or 2.0 at this time).

Resources