Best way to mask a single bit in AVX2?

Best way to mask a single bit in AVX2? - c

For example, with an input ymm vector x and bit index i I want an output vector with only the ith bit kept and everything else zeroed.
With AVX512 k registers, I could write the following, but AVX2 and below doesn't have k registers, so what do you think is the best way to do it?
__m512i m512i_maskBit(__m512i x, unsigned i) {
__mmask8 m = _cvtu32_mask8(1u << i / 64);
__m512i vm = _mm512_maskz_set1_epi64(m, 1ull << i % 64);
return _mm512_and_si512(x, vm);
}

Here is an approach using variable shifts (just creating the mask):
__m256i create_mask(unsigned i) {
__m256i ii = _mm256_set1_epi32(i);
ii = _mm256_sub_epi32(ii,_mm256_setr_epi32(0,32,64,96,128,160,192,224));
__m256i mask = _mm256_sllv_epi32(_mm256_set1_epi32(1), ii);
return mask;
}
_mm256_sllv_epi32 (vpsllvd) was introduced by AVX2 and it shifts each 32 bit element by a variable amount of bits. If the (unsigned) shift-amount is bigger than 31 (i.e., also for signed negative numbers), the corresponding result is 0.
Godbolt link with small test code: https://godbolt.org/z/a5xfqTcGs

How about the simplest approach:
__m256i m256i_create_mask(unsigned i) {
// Get the required bit in every byte of the vector
__m256i vm = _mm256_broadcastb_epi8(_mm_cvtsi32_si128(1u << (i & 7u)));
// Mask off the bytes that are outside the index
__m256i vi = _mm256_broadcastb_epi8(_mm_cvtsi32_si128(i >> 3u));
__m256i vm1 = _mm256_cmpeq_epi8(vi,
_mm256_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31));
return _mm256_and_si256(vm, vm1);
}

Here’s another approach. Not sure it’s necessarily better, it depends on CPU model and surrounding code, but it might be.
// A buffer to load vectors with a single bit set in one lane
alignas( 64 ) static const std::array<int, 16> s_oneBuffer =
{
0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0
};
__m256i maskSingleBit( __m256i x, uint32_t bitIndex )
{
// Load `1` into a single 32-bit lane of the vector
// The buffer aligned by 64 bytes, contained in a single cache line, no unaligned load penalty.
__m256i one = _mm256_loadu_si256( ( const __m256i* )( ( s_oneBuffer.data() + 8 ) - ( bitIndex / 32 ) ) );
// Left shift to move the `1` into the correct location
__m128i shift = _mm_cvtsi32_si128( bitIndex % 32 );
__m256i bit = _mm256_sll_epi32( one, shift );
// Bitwise AND with the value
return _mm256_and_si256( x, bit );
}

Related

How to compare two vectors using SIMD and get a strncmp like result?

I want to achieve something like strncmp result but not that complicated
I tried to read https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strcmp-avx2.S.html source code but I failed to understand it
suppose we have to 256 bit vector
how can I compare these two based on 8 bit comparison to achieve result like strncmp
I know there is a library but I want to understand the basics.
how it return -1,0,1 result with _mm256_cmpeq_epi8 and _mm256_min_epu8

I would do it like that.
inline int compareBytes( __m256i a, __m256i b )
{
// Compare for both a <= b and a >= b
__m256i min = _mm256_min_epu8( a, b );
__m256i le = _mm256_cmpeq_epi8( a, min );
__m256i ge = _mm256_cmpeq_epi8( b, min );
// Reverse bytes within 16-byte lanes
const __m128i rev16 = _mm_set_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
const __m256i rev32 = _mm256_broadcastsi128_si256( rev16 );
le = _mm256_shuffle_epi8( le, rev32 );
ge = _mm256_shuffle_epi8( ge, rev32 );
// Move the masks to scalar registers
uint32_t lessMask = (uint32_t)_mm256_movemask_epi8( le );
uint32_t greaterMask = (uint32_t)_mm256_movemask_epi8( ge );
// Flip high/low 16-bit pieces in the masks.
// Apparently, modern compilers are smart enough to emit ROR instructions for that code
lessMask = ( lessMask >> 16 ) | ( lessMask << 16 );
greaterMask = ( greaterMask >> 16 ) | ( greaterMask << 16 );
// Produce the desired result
if( lessMask > greaterMask )
return -1;
else if( lessMask < greaterMask )
return +1;
else
return 0;
}
The reason that method works, integer comparison is essentially searching for the most significant bit which differs, and comparison result is equal to the difference in that most significant different bit. Because we reversed order of the bytes being tested, the first byte in the vectors corresponds to the most significant bit in the masks. For this reason, ( lessMask > greaterMask ) expression evaluates to true when for the first different byte in the source vectors ( a < b ) evaluated to true.

How to load multiple char values in armv7 assembly program?

I am loading multiple char values in armv7 program using vldm instruction,
but all four values is loading one s register, but I need to expand this values in floating point register (q0).
Please help me. This is my C code:
void sum(){
int sum =0;
char *p =NULL;
p=( char *) malloc(sizeof( char ) *10);
for( int i=0; i<16;++i){
p[i]=i; sum +=i;
}
printf("sum =%d\n",sum);
}

Here is a typical text book example for loading/storing multiple values from/to the vector banks to general purpose registers that may hold destination and source addresses.
VLDM r1!, {d0-d7}
VSTM r0!, {d0-d7}
If you are using gdb you may get a better visual of a particular set of banks or groups of registers.
(gdb) p $q0
{u8 = {0 <repeats 16 times>}, u16 = {0, 0, 0, 0, 0, 0, 0, 0}, u32 = {0, 0, 0, 0}, u64 = {0, 0}, f32 = {0, 0, 0, 0}, f64 = {0, 0}}

0xFFFF flags in SSE

I would like to create an SSE register with values that I can store in an array of integers, from another SSE register which contains flags 0xFFFF and zeros. For example:
__m128i regComp = _mm_cmpgt_epi16(regA, regB);
For the sake of argument, lets assume that regComp was loaded with { 0, 0xFFFF, 0, 0xFFFF }. I would like to convert this into say { 0, 80, 0, 80 }.
What I had in mind was to create an array of integers, initialized to 80 and load them to a register regC. Then, do a _mm_and_si128 bewteen regC and regComp and store the result in regD. However, this does not do the trick, which led me to think that I do not understand the positive flags in SSE registers. Could someone answer the question with a brief explanation why my solution does not work?
short valA[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };
short valB[16] = { 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10 };
short ones[16] = { 1 };
short final[16];
__m128i vA, vB, vOnes, vRes, vRes2;
vOnes = _mm_load_si128((__m128i *)&(ones)[0] );
for( i=0 ; i < 16 ;i+=8){
vA = _mm_load_si128((__m128i *)&(valA)[i] );
vB = _mm_load_si128((__m128i *)&(valB)[i] );
vRes = _mm_cmpgt_epi16(vA,vB);
vRes2 = _mm_and_si128(vRes,vOnes);
_mm_storeu_si128((__m128i *)&(final)[i], vRes2);
}

You only set the first element of array ones to 1 (the rest of the array is initialised to 0).
I suggest you get rid of the array ones altogether and then change this line:
vOnes = _mm_load_si128((__m128i *)&(ones)[0] );
to:
vOnes = _mm_set1_epi16(1);
Probably a better solution though, if you just want to convert SIMD TRUE (0xffff) results to 1, would be to use a shift:
for (i = 0; i < 16; i += 8) {
vA = _mm_loadu_si128((__m128i *)&pA[i]);
vB = _mm_loadu_si128((__m128i *)&pB[i]);
vRes = _mm_cmpgt_epi16(vA, vB); // generate 0xffff/0x0000 results
vRes = _mm_srli_epi16(vRes, 15); // convert to 1/0 results
_mm_storeu_si128((__m128i *)&final[i], vRes2);
}

Try this for loading 1:
vOnes = _mm_set1_epi16(1);
This is shorter than creating a constant array.
Be careful, providing less array values than array size in C++ initializes the other values to zero. This was your error, and not the SSE part.
Don't forget the debugger, modern ones display SSE variables properly.

SSE _mm_movemask_epi8 equivalent method for ARM NEON

I decided to continue Fast corners optimisation and stucked at
_mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input?

I know this post is quite outdated but I found it useful to give my (validated) solution. It assumes all ones/all zeroes in every lane of the Input argument.
const uint8_t __attribute__ ((aligned (16))) _Powers[16]=
{ 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 };
// Set the powers of 2 (do it once for all, if applicable)
uint8x16_t Powers= vld1q_u8(_Powers);
// Compute the mask from the input
uint64x2_t Mask= vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vandq_u8(Input, Powers))));
// Get the resulting bytes
uint16_t Output;
vst1q_lane_u8((uint8_t*)&Output + 0, (uint8x16_t)Mask, 0);
vst1q_lane_u8((uint8_t*)&Output + 1, (uint8x16_t)Mask, 8);
(Mind http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47553, anyway.)
Similarly to Michael, the trick is to form the powers of the indexes of the non-null entries, and to sum them pairwise three times. This must be done with increasing data size to double the stride on every addition. You reduce from 2 x 8 8-bit entries to 2 x 4 16-bit, then 2 x 2 32-bit and 2 x 1 64-bit. The low byte of these two numbers gives the solution. I don't think there is an easy way to pack them together to form a single short value using NEON.
Takes 6 NEON instructions if the input is in the suitable form and the powers can be preloaded.

The obvious solution seems to be completely missed here.
// Use shifts to collect all of the sign bits.
// I'm not sure if this works on big endian, but big endian NEON is very
// rare.
int vmovmaskq_u8(uint8x16_t input)
{
// Example input (half scale):
// 0x89 FF 1D C0 00 10 99 33
// Shift out everything but the sign bits
// 0x01 01 00 01 00 00 01 00
uint16x8_t high_bits = vreinterpretq_u16_u8(vshrq_n_u8(input, 7));
// Merge the even lanes together with vsra. The '??' bytes are garbage.
// vsri could also be used, but it is slightly slower on aarch64.
// 0x??03 ??02 ??00 ??01
uint32x4_t paired16 = vreinterpretq_u32_u16(
vsraq_n_u16(high_bits, high_bits, 7));
// Repeat with wider lanes.
// 0x??????0B ??????04
uint64x2_t paired32 = vreinterpretq_u64_u32(
vsraq_n_u32(paired16, paired16, 14));
// 0x??????????????4B
uint8x16_t paired64 = vreinterpretq_u8_u64(
vsraq_n_u64(paired32, paired32, 28));
// Extract the low 8 bits from each lane and join.
// 0x4B
return vgetq_lane_u8(paired64, 0) | ((int)vgetq_lane_u8(paired64, 8) << 8);
}

This question deserves a newer answer for aarch64. The addition of new capabilities to Armv8 allows the same function to be implemented in fewer instructions. Here's my version:
uint32_t _mm_movemask_aarch64(uint8x16_t input)
{
const uint8_t __attribute__ ((aligned (16))) ucShift[] = {-7,-6,-5,-4,-3,-2,-1,0,-7,-6,-5,-4,-3,-2,-1,0};
uint8x16_t vshift = vld1q_u8(ucShift);
uint8x16_t vmask = vandq_u8(input, vdupq_n_u8(0x80));
uint32_t out;
vmask = vshlq_u8(vmask, vshift);
out = vaddv_u8(vget_low_u8(vmask));
out += (vaddv_u8(vget_high_u8(vmask)) << 8);
return out;
}

after some tests it looks like following code works correct:
int32_t _mm_movemask_epi8_neon(uint8x16_t input)
{
const int8_t __attribute__ ((aligned (16))) xr[8] = {-7,-6,-5,-4,-3,-2,-1,0};
uint8x8_t mask_and = vdup_n_u8(0x80);
int8x8_t mask_shift = vld1_s8(xr);
uint8x8_t lo = vget_low_u8(input);
uint8x8_t hi = vget_high_u8(input);
lo = vand_u8(lo, mask_and);
lo = vshl_u8(lo, mask_shift);
hi = vand_u8(hi, mask_and);
hi = vshl_u8(hi, mask_shift);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
lo = vpadd_u8(lo,lo);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
hi = vpadd_u8(hi,hi);
return ((hi[0] << 8) | (lo[0] & 0xFF));
}

Note that I haven't tested any of this, but something like this might work:
X := the vector that you want to create the mask from
A := 0x808080808080...
B := 0x00FFFEFDFCFB... (i.e. 0,-1,-2,-3,...)
X = vand_u8(X, A); // Keep d7 of each byte in X
X = vshl_u8(X, B); // X[7]>>=0; X[6]>>=1; X[5]>>=2; ...
// Each byte of X now contains its msb shifted 7-N bits to the right, where N
// is the byte index.
// Do 3 pairwise adds in order to pack all these into X[0]
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
X = vpadd_u8(X, X);
// X[0] should now contain the mask. Clear the remaining bytes if necessary
This would need to be repeated once to process a 128-bit vector, since vpadd only works on 64-bit vectors.

I know this question is here for 8 years already but let me give you the answer which might solve all performance problems with emulation. It's based on the blog Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more.
Most usages of movemask instructions are coming from comparisons where the vectors have 0xFF or 0x00 values from the result of every 16 bytes. After that most cases to use movemasks are to check if none/all match, find leading/trailing or iterate over bits.
If this is the case which often is, then you can use shrn reg1, reg2, #4 instruction. This instruction called Shift-Right-then-Narrow instruction can reduce a 128-bit byte mask to a 64-bit nibble mask (by alternating low and high nibbles to the result). This allows the mask to be extracted to a 64-bit general purpose register.
const uint16x8_t equalMask = vreinterpretq_u16_u8(vceqq_u8(chunk, vdupq_n_u8(tag)));
const uint8x8_t res = vshrn_n_u16(equalMask, 4);
const uint64_t matches = vget_lane_u64(vreinterpret_u64_u8(res), 0);
return matches;
After that you can use all bit operations you typically use on x86 with very minor tweaks like shifting by 2 or doing a scalar AND.

What is wrong with my program?

I cannot figure out what is wrong. I spent a few hours trying to debug this. I am compiling with gcc -m32 source.c -o source
How else can I approach this when debugging? Right now, I am isolating the code in many different ways and everything is working the way I expect but its working the wrong way when I have it all together.
This program takes an input and then looks for the highest position with the 1 bit.
I removed my code for now.

in bitsearch, you are storing num in eax, you store a special value in edx in order to perform check. check is testing if the highest bit is set (indicating a negative number), and exits if its the case...
the andl instruction in check stores the result of the operation inside the second operand (eax), so the result overwrites num.
then in zero you are using edx to perform your computation... edx contains the special value of the start of the function, so your result will always be wrong.
now at the end of zero, you are going back to check, but the check is unnecessary here, you should loop back to zeroinstead...

Does the bit-search need to be implemented in assembly? A simple for loop can accomplish the same task, and is much more readable:
int num = 10;
int maxFound = -1;
for (int numShifts = 0; numShifts < 32 && num != 0; numShifts++) {
if ((num & 1) == 1) {
maxFound = numShifts;
}
num = num >> 1;
}
//the last position that had a 1 will be in maxFound

There's a neat bit-fiddling trick: x & -x isolates the last 1-bit. The following C program uses a lookup table based on de Bruijn sequences to compute the number of trailing (!) zeros of a number in constant (!) time:
unsigned int x; // find the number of trailing zeros in 32-bit x
int r; // result goes here
int table[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = table[((uint32_t)((x & -x) * 0x077CB531U)) >> 27];
Doing this in assembly language (which I stopped learning by the age of 16) should be no problem. Now all you have to do is to reverse the bits in num and apply the technique described above.
I wrote a paper about the trick described above, but unfortunately it's not available on the web. If you're interested, I can send it to you (or anyone else who's interested) by email.

My assembly knowledge is a little rusty, but it seems to me like bitsearch is overly complicated. How about just rotating the number to the right and counting the times you need to do that until it's zero?