Convert __m256d to __m256i - c

Since cast like this:
__m256d a;
uint64_t t[4];
_mm256_store_si256( (__m256i*)t, (__m256i)a );/* Cast of 'a' to __m256i not allowed */
are not allowed when compiling under Visual Studio, I thought I could use some intrinsic functions to convert a __m256d value into a __m256i before passing it to _mm256_store_si256 and thus, avoiding the cast which causes the error.
But after looking on that list, I couldn't find a function taking for argument a __m256d value and returning a __256i value. So maybe you could help me writing my own function or finding the function I'm looking for, a function that stores 4x 64-bit double bit value to an array of 4x64-bit integers.
EDIT:
After further research, I found _mm256_cvtpd_epi64 which seems to be exactly what I want. But, my CPU doesn't support AVX512 instructions set...
What is left for me to do here?

You could use _mm256_store_pd( (double*)t, a). I'm pretty sure this is strict-aliasing safe because you're not directly dereferencing the pointer after casting it. The _mm256_store_pd intrinsic wraps the store with any necessary may-alias stuff.
(With AVX512, Intel switched to using void* for the load/store intrinsics instead of float*, double*, or __m512i*, to remove the need for these clunky casts and make it more clear that intrinsics can alias anything.)
The other option is to _mm256_castpd_si256 to reinterpret the bits of your __m256d as a __m256i:
alignas(32) uint64_t t[4];
_mm256_store_si256( (__m256i*)t, _mm256_castpd_si256(a));
If you read from t[] right away, your compiler might optimize away the store/reload and just shuffle or pextrq rax, xmm0, 1 to extract FP bit patterns directly into integer registers. You could write this manually with intrinsics. Store/reload is not bad, though, especially if you want more than 1 of the double bit-patterns as scalar integers.
You could instead use union m256_elements { uint64_t u64[4]; __m256d vecd; };, but there's no guarantee that will compile efficiently.
This cast compiles to zero asm instructions, i.e. it's just a type-pun to keep the C compiler happy.
If you wanted to actually round packed double to the nearest signed or unsigned 64-bit integer and have the result in 2's complement or unsigned binary instead of IEEE754 binary64, you need AVX512F _mm256/512_cvtpd_epi64 (vcvtpd2qq) for it to be efficient. SSE2 + x86-64 can do it for scalar, or you can use some packed FP hacks for numbers in the [0..2^52] range: How to efficiently perform double/int64 conversions with SSE/AVX?.
BTW, storeu doesn't require an aligned destination, but store does. If the destination is a local, you should normally align it instead of using an unaligned store, at least if the store happens in a loop, or if this function can inline into a larger function.

Related

Efficient Embedded Fixed Point 2x2 Matrix Multiplication in ARM Cortex-M4 C code

I am trying to implement a VERY efficient 2x2 matrix multiplication in C code for operation in an ARM Cortex-M4. The function accepts 3 pointers to 2x2 arrays, 2 for the inputs to be multiplied and an output buffer passed by the using function. Here is what I have so far...
static inline void multiply_2x2_2x2(int16_t a[2][2], int16_t b[2][2], int32_t c[2][2])
{
int32_t a00a01, a10a11, b00b01, b01b11;
a00a01 = a[0][0] | a[0][1]<<16;
b00b10 = b[0][0] | b[1][0]<<16;
b01b11 = b[0][1] | b[1][1]<<16;
c[0][0] = __SMUAD(a00a01, b00b10);
c[0][1] = __SMUAD(a00a01, b01b11);
a10a11 = a[1][0] | a[1][1]<<16;
c[1][0] = __SMUAD(a10a11, b00b10);
c[1][1] = __SMUAD(a10a11, b01b11);
}
Basically, my strategy is to to use the ARM Cortex-M4 __SMUAD() function to do the actual multiply accumulates. But this requires me to build the inputs a00a01, a10a11, b00b10, and b01b11 ahead of time. My question is, given that the C array should be a continuous in memory, is there a more efficient wat to pass the data into the functions directly without the intermediate variables? Secondary question, am I overthinking this and I should just let the compiler do its job as it is smarter than I am? I tend to do that a lot.
Thanks!
You could break the strict aliasing rules and load the matrix row directly into the 32-bit register, using a int16_t* to int32_t* typecast. An expression such as a00a01 = a[0][0] | a[0][1]<<16 just takes some consecutive bits from RAM and arranges them into other consecutive bits in registers. Consult your compiler manual for the flag to disable its strict aliasing assumptions, and make the cast safely usable.
You could also perhaps avoid transposing matrix columns into registers, by generating b in transposed format in the first place.
The best way to learn about the compiler, and get a sense of the cases for which it's smarter than you, is to disassemble its results and compare the instruction sequence to your intentions.
The first main concern is that some_signed_int << 16 invokes undefined behavior for negative numbers. So you have bugs all over. And then bitwise OR of two int16_t where either is negative does not necessarily form a valid int32_t either. Do you actually need the sign or can you drop it?
ARM examples use unsigned int, which in turn supposedly contains 2x int16_t in raw binary form. This is what you actually want too.
Also it would seem that it shouldn't matter for SMUAD which 16 bit word you place where. So the a[0][0] | a[0][1]<<16; just serves to needlessly swap data around in memory. It will confuse the compiler which can't optimize such code well. Sure, shifts etc are always very fast, but this is pointless overhead.
(As someone noted, this whole thing is likely much easier to write in pure assembler without concern of all the C type rules and undefined behavior.)
To avoid all these issues you could define your own union type:
typedef union
{
int16_t i16 [2][2];
uint32_t u32 [2];
} mat2x2_t;
u32[0] corresponds to i16[0][0] and i16[0][1]
u32[1] corresponds to i16[1][0] and i16[1][1]
C actually lets you "type pun" between these types pretty wildly (unlike C++). Unions also dodge the brittle strict aliasing rules.
The function can then become something along the lines of this pseudo code:
static uint32_t mat_mul16 (mat2x2_t a, mat2x2_t b)
{
uint32_t c0 = __SMUAD(a.u32[0], b.u32[0]);
...
}
Supposedly each such line should give 2x signed 16 multiplications as per the SMUAD instruction.
As for if this actually gives some revolutionary performance increase compared to some default MUL, I kind of doubt it. Disassemble and count CPU ticks.
am I overthinking this and I should just let the compiler do its job as it is smarter than I am?
Most likely :) The old rule of thumb: benchmark and then only manually optimize at the point when you've actually found a performance bottleneck.

Why is the last argument of _mm_permute_ps an int?

GCC kindly informed me that the last argument of the SIMD intrinsic _mm_permute_ps must be an 8-bit immediate. Why then is its last argument declared as expecting an int?
__m128 _mm_permute_ps(__m128 a, int imm8);
__m256d _mm256_permute_pd(__m256d a, int imm8);
Would an 8-bit type parameter not provide a more helpful interface to the end user?
It is consistent with all the other intrinsics taking a shuffle vector or immediate argument. Probably to indicate it is an integer and not a character, while avoiding depending on stdint.h for int8_t.
The funnier part from a C++ point of view is that isn't constexpr, so you can give it non-compile time arguments, which then causes fun stuff for the compiler. I once tried improving the intrinsics for gcc in a way that assumed the immediate argument was compile-time, and it broke a surprisingly amount of code.

Intrinsic to set value in array based on a BitMask

Is there an intrinsic that will set a single value at all the places in an input array where the corresponding position had a 1 bit in the provided BitMask?
10101010 is bitmask
value is 121
it will set positions 0,2,4,6 with value 121
With AVX512, yes. Masked stores are a first-class operation in AVX512.
Use the bitmask as an AVX512 mask for a vector store to an array, using _mm512_mask_storeu_epi8 (void* mem_addr, __mmask64 k, __m512i a) vmovdqu8. (AVX512BW. With AVX512F, you can only use 32 or 64-bit element size.)
#include <immintrin.h>
#include <stdint.h>
void set_value_in_selected_elements(char *array, uint64_t bitmask, uint8_t value) {
__m512i broadcastv = _mm512_set1_epi8(value);
// integer types are implicitly convertible to/from __mmask types
// the compiler emits the KMOV instruction for you.
_mm512_mask_storeu_epi8 (array, bitmask, broadcastv);
}
This compiles (with gcc7.3 -O3 -march=skylake-avx512) to:
vpbroadcastb zmm0, edx
kmovq k1, rsi
vmovdqu8 ZMMWORD PTR [rdi]{k1}, zmm0
vzeroupper
ret
If you want to write zeros in the elements where the bitmap was zero, either use a zero-masking move to create a constant from the mask and store that, or create a 0 / -1 vector using AVX512BW or DQ __m512i _mm512_movm_epi8(__mmask64 ). Other element sizes are available. But using a masked store makes it possible to safely use it when the array size isn't a multiple of the vector width, because the unmodified elements aren't read / rewritten or anything; they're truly untouched. (The CPU can take a slow microcode assist if any of the untouched elements would have faulted on a real store, though.)
Without AVX512, you still asked for "an intrinsic" (singular).
There's pdep, which you can use to expand a bitmap to a byte-map. See my AVX2 left-packing answer for an example of using _pdep_u64(mask, 0x0101010101010101); to unpack each bit in mask to a byte. This gives you 8 bytes in a uint64_t. In C, if you use a union between that and an array, then it gives you an array of 0 / 1 elements. (But of course indexing the array will require the compiler to emit shift instructions, if it hasn't spilled it somewhere first. You probably just want to memcpy the uint64_t into a permanent array.)
But in the more general case (larger bitmaps), or even with 8 elements when you want to blend in new values based on the bitmask, you should use multiple intrinsics to implement the inverse of pmovmskb, and use that to blend. (See the without pdep section below)
In general, if your array fits in 64 bits (e.g. an 8-element char array), you can use pdep. Or if it's an array of 4-bit nibbles, then you can do a 16-bit mask instead of 8.
Otherwise there's no single instruction, and thus no intrinsic. For larger bitmaps, you can process it in 8-bit chunks and store 8-byte chunks into the array.
If your array elements are wider than 8 bits (and you don't have AVX512), you should probably still expand bits to bytes with pdep, but then use [v]pmovzx to expand from bytes to dwords or whatever in a vector. e.g.
// only the low 8 bits of the input matter
__m256i bits_to_dwords(unsigned bitmap) {
uint64_t mask_bytes = _pdep_u64(bitmap, 0x0101010101010101); // expand bits to bytes
__m128i byte_vec = _mm_cvtsi64x_si128(mask_bytes);
return _mm256_cvtepu8_epi32(byte_vec);
}
If you want to leave elements unmodified instead of setting them to zero where the bitmask had zeros, OR with the previous contents instead of assigning / storing.
This is rather inconvenient to express in C / C++ (compared to asm). To copy 8 bytes from a uint64_t into a char array, you can (and should) just use memcpy (to avoid any undefined behaviour because of pointer aliasing or misaligned uint64_t*). This will compile to a single 8-byte store with modern compilers.
But to OR them in, you'd either have to write a loop over the bytes of the uint64_t, or cast your char array to uint64_t*. This usually works fine, because char* can alias anything so reading the char array later doesn't have any strict-aliasing UB. But a misaligned uint64_t* can cause problems even on x86, if the compiler assumes that it is aligned when auto-vectorizing. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Assigning a value other than 0 / 1
Use a multiply by 0xFF to turn the mask of 0/1 bytes into a 0 / -1 mask, and then AND that with a uint64_t that has your value broadcasted to all byte positions.
If you want to leave element unmodified instead of setting them to zero or value=121, you should probably use SSE2 / SSE4 or AVX2 even if your array has byte elements. Load the old contents, vpblendvb with set1(121), using the byte-mask as a control vector.
vpblendvb only uses the high bit of each byte, so your pdep constant can be 0x8080808080808080 to scatter the input bits to the high bit of each byte, instead of the low bit. (So you don't need to multiply by 0xFF to get an AND mask).
If your elements are dword or larger, you could use _mm256_maskstore_epi32. (Use pmovsx instead of zx to copy the sign bit when expanding the mask from bytes to dwords). This can be a perf win over a variable-blend + always read / re-write. Is it possible to use SIMD instruction for replace?.
Without pdep
pdep is very slow on Ryzen, and even on Intel it's maybe not the best choice.
The alternative is to turn your bitmask into a vector mask:
is there an inverse instruction to the movemask instruction in intel avx2? and
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
i.e. broadcast your bitmap to every position of a vector (or shuffle it so the right bit of the bitmap in in the corresponding byte), and use a SIMD AND to mask off the appropriate bit for that byte. Then use pcmpeqb/w/d against the AND-mask to find the elements that had their bit set.
You're probably going to want to load / blend / store if you don't want to store zeros where the bitmap was zero.
Use the compare-mask to blend on your value, e.g. with _mm_blendv_epi8 or the 256bit AVX2 version. You can handle bitmaps in 16-bit chunks, producing 16-byte vectors with just a pshufb to send bytes of it to the right elements.
It's not safe for multiple threads to do this at the same time on the same array even if their bitmaps don't intersect, unless you use masked stores, though.

Union in C changing machine behavior of Float Addition

New to C programming, and I've been told to avoid unions which in general makes perfect sense and I agree with. However, as part of an academic exercise I'm writing an emulator for hardware single-precision floating point addition by doing bit manipulation operations on unsigned 32-bit integers. I only mention that to explain why I want to use unions; I'm having no trouble with the emulation.
In order to test this emulator, I wrote a test program. But of course I'm trying to find the bit representation of floats on my hardware, so I thought this could be the perfect use for a union. I wrote this union:
typedef union {
float floatRep;
uint32_t unsignedIntRep;
} FloatExaminer;
This way, I can initialize a float with the floatRep member and then examine the bits with the unsignedIntRep member.
This worked most of the time, but when I got to NaN addition, I started running into trouble. The exact situation was that I wrote a function to automate these tests. The gist of it was this:
void addTest(float op1, float op2){
FloatExaminer result;
result.floatRep = op1 + op2;
printf("%f + %f = %f\n", op1, op2, result.floatRep);
//print bit pattern as well
printf("Bit pattern of result: %08x", result.unsignedIntRep);
}
OK, now for the confusing part:
I added a NAN and a NAN with different mantissa bit patterns to differentiate between the two. On my particular hardware, it's supposed to return the second NAN operand (making it quiet if it was signalling). (I'll explain how I know this below.) However, passing the bit patterns op1=0x7fc00001, op2=0x7fc00002 would return op1, 0x7fc00001, every time!
I know it's supposed to return the second operand because I tried--outside the function--initializing as an integer and casting to a float as below:
uint32_t intRep1 = 0x7fc00001;
uint32_t intRep2 = 0x7fc00002;
float *op1 = (float *) &intRep1;
float *op2 = (float *) &intRep2;
float result = *op1 + *op2;
uint32_t *intResult = (uint32_t *)&result;
printf("%08x", *intResult); //bit pattern 0x7fc00002
In the end, I've concluded that unions are evil and I should never use them. However, does anyone know why I'm getting the result I am? Did I make stupid mistake or assumption? (I understand that hardware architecture varies, but this just seems bizarre.)
I'm assuming that when you say "my particular hardware", you are referring to an Intel processor using SSE floating point. But in fact, that architecture has a different rule, according to the Intel® 64 and IA-32 Architectures
Software Developer's Manual. Here's a summary of Table 4.7 ("Rules for handling NaNs") from Volume 1 of that documentation, which describes the handling of NaNs in arithmetic operations: (QNaN is a quiet NaN; SNaN is a signalling NaN; I've only included information about two-operand instructions)
SNaN and QNaN
x87 FPU — QNaN source operand.
SSE — First source operand, converted to a QNaN.
Two SNaNs
x87 FPU — SNaN source operand with the larger significand, converted to a QNaN
SSE — First source operand, converted to a QNaN.
Two QNaNs
x87 FPU — QNaN source operand with the larger significand
SSE — First source operand
NaN and a floating-point value
x87/SSE — NaN source operand, converted to a QNaN.
SSE floating point machine instructions generally have the form op xmm1, xmm2/m32, where the first operand is the destination register and the second operand is either a register or a memory location. The instruction will then do, in effect, xmm1 <- xmm1 (op) xmm2/m32, so the first operand is both the left-hand operand of the operation and the destination. That's the meaningof "first operand" in the above chart. AVX adds three-operand instructions, where the destination might be a different register; it is then the third operand and does not figure in the above chart. The x87 FPU uses a stack-based architecture, where the top of the stack is always one of the operands and the result replaces either the top of the stack or the other operand; in the above chart, it will be noted that the rules do not attempt to decide which operand is "first", relying instead on a simple comparison.
Now, suppose we're generating code for an SSE machine, and we have to handle the C statement:
a = b + c;
where none of those variables are in a register. That means we might emit code something like this: (I'm not using real instructions here, but the principle is the same)
LOAD r1, b (r1 <- b)
ADD r1, c (r1 <- r1 + c)
STORE r1, a (a <- r1)
But we could also do this, with (almost) the same result:
LOAD r1, c (r1 <- c)
ADD r1, b (r1 <- r1 + b)
STORE r1, a (a <- r1)
That will have precisely the same effect, except for additions involving NaNs (and only when using SSE). Since arithmetic involving NaNs is unspecified by the C standard, there is no reason why the compiler should care which of these two options it chooses. In particular, if r1 happened to already have the value c in it, the compiler would probably choose the second option, since it saves a load instruction. (And who is going to complain? We all want the compiler to generate code which runs as quickly as possible, no?)
So, in short, the order of the operands of the ADD instruction will vary with the intricate details of how the compiler chooses to optimize the code, and the particular state of the registers at the moment in which the addition operator is being emitted. It is possible that this will be effected by the use of a union, but it is equally or more likely that it has to do with the fact that in your code using the union, the values being added are arguments to the function and therefore are already placed in registers.
Indeed, different versions of gcc, and different optimization settings, produce different results for your code. And forcing the compiler to emit x87 FPU instructions produces yet different results, because the hardware operates according to a different logic.
Note:
If you want some bedtime reading, you can download the entire Intel SDM (currently 4,684 pages / 23.3MB, but it keeps on getting bigger) from their site.

Working with Intel SSE SIMD intrinsics

I have a question regarding the various arithmetic operations for Intel SSE intrinsics.
what is the difference between doing a _mm_add_ps Vs. _mm_add_epi8/16/32? I want to make sure that my data is aligned at all times.
In a sample code when I do this:
__m128 u1 = _mm_load_ps(&V[(i-1)]);
I get a segmentation fault. But when I do this:
__m128 u1 = _mm_loadu_ps(&V[(i-1)]);
It works fine.
Since I want my data aligned i declared the array like this:
posix_memalign((void**)&V, 16, dx*sizeof(float));
Can someone help explain this.
_mm_add_ps add floats together, where _mm_add_epi8/16/32 adds integers, which are not floating point numbers.
_mm_loadu_ps does not require your floats to be 16byte (128bit) aligned, whereas _mm_load_ps does require 16byte alignment.
So if you get a seg fault on the first one, your alignment is wrong.
On the posix_memalign page it says this:
The posix_memalign() function shall fail if:
[EINVAL] The value of the alignment parameter is not a power of two
multiple of sizeof( void *).
I'm not sure that sizeof(float) == sizeof(void*) ??
Per this, it seems to be the same in C (on a 32bit system). Ok, a little trickery here, because the size of a pointer is normally the size of the CPU register width, 32bit or 64bit (8 bytes) depending on the system used, whereas a float would normally be 32bit (4 bytes)
Your aligned allocation should look more like this:
posix_memalign((void**)&V, 16, dx*sizeof(void*)); //since it will the correct size for your platform. You can always cast to `float` later on.

Resources