I would like to take the result of an 8-bit vertical SIMD comparison between 256-bit vectors and pack the bits into the lowest byte of each 32-bit element for a vpshufb lookup on the lowest bytes. This isn't terribly difficult with AVX-512 (replace the & with a masked move if using 512-bit vectors):
__m256i cmp_8_into_32(__m256i a, __m256i b) {
return _mm256_popcnt_epi32(_mm256_cmpeq_epi8(a, b)
& _mm256_set1_epi32(0xff0f0301 /* can be any order */));
}
That's three uops and, assuming perfect scheduling, a throughput of 1 according to uops.info—not bad. Alas, vpopcntd isn't in AVX2. What's the optimal way to do this operation there? The best I can think of is to mask the pairs of bits at indices 7,8 and 15,16, then perform two constant-amount vpsrld and a vpor. So that's 6 uops, throughput of 2.5 ish. Not bad, but I wonder if there's something better.
Following chtz's comment (thanks!), I realize it's actually fairly easy:
__m256i cmp_8_into_32_1(__m256i a, __m256i b) {
const __m256i weights = _mm256_set1_epi32(0x08040201);
const __m256i all_n1 = _mm256_set1_epi16(-0x1);
__m256i cmp = _mm256_cmpeq_epi8(a, b);
__m256i hsum16 = _mm256_maddubs_epi16(weights, cmp);
return _mm256_madd_epi16(hsum16, all_n1);
}
Peter Cordes's suggestion saved an additional vpand. The two multiply–add instructions both run on either port 0 or 1, so this has the same throughput as the original popcount-based solution, although with a latency of about 11 instead of 5.
Uses 1 multiply:
__m256i cmp_8_into_32(__m256i a, __m256i b) {
__m256i cmp = _mm256_cmpeq_epi8(a, b);
__m256i msk = _mm256_and_si256(cmp, _mm256_set1_epi32(0x08040201));
__m256i hsum = _mm256_madd_epi16(msk, _mm256_set1_epi8(1));
return _mm256_srli_epi16(hsum, 8);
}
A 32-bit multiply (_mm256_mullo_epi32) is not used because it is slow.
If the results are not needed "in-lane" then one could use a _mm256_packs_epi16 immediately after the comparison to process twice as much data at once. If you don't need all of the possible states (say we don't care about lowest byte matches) then you could do 4x as much per instruction. If the results from the vpshufb lookup are getting gathered together then there may be other possible optimizations...
Related
I am porting an application from Altivec to Neon.
I see a lot of intrinsics in Altivec which return scalar values.
Do we have any such intrinsics on ARM ?
For instance vec_all_gt
There are no intrinsics that give scalar comparison results. This is because the common pattern for SIMD comparisons is to use branchless lane-masking and conditional selects to multiplex results, not branch-based control flow.
You can build them if you need them though ...
// Do a comparison of e.g. two vectors of floats
uint32x4_t compare = vcgeq_f32(a, b)
// Shift all compares down to a single bit in the LSB of each lane, other bits zero
uint32x4_t tmp = vshrq_n_u32(a.m, 31);
// Shift compare results up so lane 0 = bit 0, lane 1 = bit 1, etc.
static const int shifta[4] { 0, 1, 2, 3 };
static const int32x4_t shift = vld1q_s32(shifta);
tmp = vshlq_u32(tmp, shift)
// Horizontal add across the vector to merge the result into a scalar
return vaddvq_u32();
... at which point you can define any() (mask is non-zero) and all() (mask is 0xF) comparisons if you need branchy logic.
Given a __m256i register and an index i, I want to extract a single byte from each value stored in the register and save it in antoher __m256i register. Also after performing some calculation on the second register, I want to load the byte back to the first register without touching the other bytes.
Example:
index i = 2
__m256i a:
3210
|AAAA|AAAA|AAAA|AAAA|AAAA|AAAA|AAAA|AAAA|
__m256i b:
|FAFF|FAFF|FAFF|FAFF|FAFF|FAFF|FAFF|FAFF|
// some calculation
__m256i a:
|A6AA|A6AA|A6AA|A6AA|A6AA|A6AA|A6AA|A6AA|
I am sorry, if this question was asked before, but since I am new to this topic it is quite hard to find answers for this topic. Thank you!
I try to generalize answers above:
const int index = 2; // byte index
__m256i mask = _mm256_set1_epi32(0xFF << index*8); // bit mask |0F00|0F00|...|0F00|
__m256i a; // source vector |AAAA|AAAA|...|AAAA|
__m256i b = _mm256_blendv_epi8(_mm256_set1_epi8(-1), a, mask);// extract byte |FAFF|FAFF|...|FAFF|
b; // after some manipulations |BBBB|BBBB|...|BBBB|
a = _mm256_blendv_epi8(a, b, mask); // store byte |ABAA|ABAA|...|ABAA|
What I want to do is:
Multiply the input floating point number by a fixed factor.
Convert them to 8-bit signed char.
Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127].
I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :(
I also wrote some code that converts 32-bit float to 16-bit int, and it works as exactly what I want.
void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
// input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
assert(width % 16 == 0);
int num_input_chunks = width / 16;
__m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
quant_mult, quant_mult, quant_mult, quant_mult);
for (int i = 0; i < num_rows; ++i) {
const float* input_row = input + i * width;
__m256i* output_row = output + i * num_input_chunks;
for (int j = 0; j < num_input_chunks; ++j) {
const float* x = input_row + j * 16;
// Process 16 floats at once, since each __m256i can contain 16 16-bit integers.
__m256 f_0 = _mm256_loadu_ps(x);
__m256 f_1 = _mm256_loadu_ps(x + 8);
__m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
__m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);
__m256i i_0 = _mm256_cvtps_epi32(m_0);
__m256i i_1 = _mm256_cvtps_epi32(m_1);
*(output_row + j) = _mm256_packs_epi32(i_0, i_1);
}
}
}
Any help is welcome, thank you so much!
For good throughput with multiple source vectors, it's a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn't necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that need to be stored separately.)
Or are you complaining about how it operates in-lane? Yes that's annoying, but _mm256_packs_epi32 does the same thing. If it's ok for your outputs to have interleaved groups of data there, do the same thing for this, too.
Your best bet is to combine 4 vectors down to 1, in 2 steps of in-lane packing (because there's no lane-crossing pack). Then use one lane-crossing shuffle to fix it up.
#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
__m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
__m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
__m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
__m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
__m256i ab = _mm256_packs_epi32(a,b); // 16x int16_t
__m256i cd = _mm256_packs_epi32(c,d);
__m256i abcd = _mm256_packs_epi16(ab, cd); // 32x int8_t
// packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_hi, b_hi, c_hi, d_hi ] order
// if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done
// but if you need sequential order, then vpermd:
__m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
return lanefix;
}
(Compiles nicely on the Godbolt compiler explorer).
Call this in a loop and _mm256_store_si256 the resulting vector.
(For uint8_t unsigned destination, use _mm256_packus_epi16 for the 16->8 step and keep everything else the same. We still use signed 32->16 packing, because 16 -> u8 vpackuswb packing still takes its epi16 input as signed. You need -1 to be treated as -1, not +0xFFFF, for unsigned saturation to clamp it to 0.)
With 4 total shuffles per 256-bit store, 1 shuffle per clock throughput will be the bottleneck on Intel CPUs. You should get a throughput of one float vector per clock, bottlenecked on port 5. (https://agner.org/optimize/). Or maybe bottlenecked on memory bandwidth if data isn't hot in L2.
If you only have a single vector to do, you could consider using _mm256_shuffle_epi8 to put the low byte of each epi32 element into the low 32 bits of each lane, then _mm256_permutevar8x32_epi32 for lane-crossing.
Another single-vector alternative (good on Ryzen) is extracti128 + 128-bit packssdw + packsswb. But that's still only good if you're just doing a single vector. (Still on Ryzen, you'll want to work in 128-bit vectors to avoid extra lane-crossing shuffles, because Ryzen splits every 256-bit instruction into (at least) 2 128-bit uops.)
Related:
SSE - AVX conversion from double to char
How can I convert a vector of float to short int using avx instructions?
Please check the IEEE754 standard format to store float values, first understand how this float and double get store in memory ,then you only came to know how to convert float or double to the char , it is quite simple .
I have a question about using 128-bit registers to gain speed in a code. Consider the following C/C++ code: I define two unsigned long long ints a and b, and give them some values.
unsigned long long int a = 4368, b = 56480;
Then, I want to compute
a & b;
Here a is represented in the computer as a 64-bit number 4369 = 100010001001, and same for b = 56481 = 1101110010100001, and I compute a & b, which is still a 64-bit number given by the bit-by-bit logical AND between a and b:
a & b = 1000000000001
My question is the following: Do computers have a 128-bit register where I could do the operation above, but with 128-bits integers rather than with 64-bit integers, and with the same computer time? To be clearer: I would like to gain a factor two of speed in my code by using 128 bit numbers rather than 64 bit numbers, e. g. I would like to compute 128 ANDs rather than 64 ANDs (one AND for every bit) with the same computer time. If this is possible, do you have a code example? I have heard that the SSE regiters might do this, but I am not sure.
Yes, SSE2 has a 128 bit bitwise AND - you can use it via intrinsics in C or C++, e.g.
#include "emmintrin.h" // SSE2 intrinsics
__m128i v0, v1, v2; // 128 bit variables
v2 = _mm_and_si128(v0, v1); // bitwise AND
or you can use it directly in assembler - the instruction is PAND.
You can even do a 256 bit AND on Haswell and later CPUs which have AVX2:
#include "immintrin.h" // AVX2 intrinsics
__m256i v0, v1, v2; // 256 bit variables
v2 = _mm256_and_si256(v0, v1); // bitwise AND
The corresponding instruction in this case is VPAND.
I have a __m128i register with 8 bit values with the content:
{-4,10,10,10,10,10,10,-4,-4,10,10,10,10,10,10,-4}
Now I want to convert it to eight 16 bit values in a _m128i register. It should look like:
{-4,10,10,10,10,10,10,-4}
How is this possible with the least amount of instructions as possible?
I want to use SSSE3 at most.
Assuming you just want the first 8 values out of the 16 and are going to ignore the other 8 (the example data you give is somewhat ambiguous) then you can do it with SSE2 like this:
v = _mm_srai_epi16(_mm_unpacklo_epi8(v, v), 8);
You can do it this way with one SSE2 instruction (ignoring initialization)
__m128i const zero = _mm_setzero_si128(); // (if you're in a loop pull this out)
__m128i v;
v = _mm_unpacklo_epi8(v, zero);