I have a question about using 128-bit registers to gain speed in a code. Consider the following C/C++ code: I define two unsigned long long ints a and b, and give them some values.
unsigned long long int a = 4368, b = 56480;
Then, I want to compute
a & b;
Here a is represented in the computer as a 64-bit number 4369 = 100010001001, and same for b = 56481 = 1101110010100001, and I compute a & b, which is still a 64-bit number given by the bit-by-bit logical AND between a and b:
a & b = 1000000000001
My question is the following: Do computers have a 128-bit register where I could do the operation above, but with 128-bits integers rather than with 64-bit integers, and with the same computer time? To be clearer: I would like to gain a factor two of speed in my code by using 128 bit numbers rather than 64 bit numbers, e. g. I would like to compute 128 ANDs rather than 64 ANDs (one AND for every bit) with the same computer time. If this is possible, do you have a code example? I have heard that the SSE regiters might do this, but I am not sure.
Yes, SSE2 has a 128 bit bitwise AND - you can use it via intrinsics in C or C++, e.g.
#include "emmintrin.h" // SSE2 intrinsics
__m128i v0, v1, v2; // 128 bit variables
v2 = _mm_and_si128(v0, v1); // bitwise AND
or you can use it directly in assembler - the instruction is PAND.
You can even do a 256 bit AND on Haswell and later CPUs which have AVX2:
#include "immintrin.h" // AVX2 intrinsics
__m256i v0, v1, v2; // 256 bit variables
v2 = _mm256_and_si256(v0, v1); // bitwise AND
The corresponding instruction in this case is VPAND.
Related
I would like to take the result of an 8-bit vertical SIMD comparison between 256-bit vectors and pack the bits into the lowest byte of each 32-bit element for a vpshufb lookup on the lowest bytes. This isn't terribly difficult with AVX-512 (replace the & with a masked move if using 512-bit vectors):
__m256i cmp_8_into_32(__m256i a, __m256i b) {
return _mm256_popcnt_epi32(_mm256_cmpeq_epi8(a, b)
& _mm256_set1_epi32(0xff0f0301 /* can be any order */));
}
That's three uops and, assuming perfect scheduling, a throughput of 1 according to uops.info—not bad. Alas, vpopcntd isn't in AVX2. What's the optimal way to do this operation there? The best I can think of is to mask the pairs of bits at indices 7,8 and 15,16, then perform two constant-amount vpsrld and a vpor. So that's 6 uops, throughput of 2.5 ish. Not bad, but I wonder if there's something better.
Following chtz's comment (thanks!), I realize it's actually fairly easy:
__m256i cmp_8_into_32_1(__m256i a, __m256i b) {
const __m256i weights = _mm256_set1_epi32(0x08040201);
const __m256i all_n1 = _mm256_set1_epi16(-0x1);
__m256i cmp = _mm256_cmpeq_epi8(a, b);
__m256i hsum16 = _mm256_maddubs_epi16(weights, cmp);
return _mm256_madd_epi16(hsum16, all_n1);
}
Peter Cordes's suggestion saved an additional vpand. The two multiply–add instructions both run on either port 0 or 1, so this has the same throughput as the original popcount-based solution, although with a latency of about 11 instead of 5.
Uses 1 multiply:
__m256i cmp_8_into_32(__m256i a, __m256i b) {
__m256i cmp = _mm256_cmpeq_epi8(a, b);
__m256i msk = _mm256_and_si256(cmp, _mm256_set1_epi32(0x08040201));
__m256i hsum = _mm256_madd_epi16(msk, _mm256_set1_epi8(1));
return _mm256_srli_epi16(hsum, 8);
}
A 32-bit multiply (_mm256_mullo_epi32) is not used because it is slow.
If the results are not needed "in-lane" then one could use a _mm256_packs_epi16 immediately after the comparison to process twice as much data at once. If you don't need all of the possible states (say we don't care about lowest byte matches) then you could do 4x as much per instruction. If the results from the vpshufb lookup are getting gathered together then there may be other possible optimizations...
What I want to do is:
Multiply the input floating point number by a fixed factor.
Convert them to 8-bit signed char.
Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127].
I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :(
I also wrote some code that converts 32-bit float to 16-bit int, and it works as exactly what I want.
void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
// input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
assert(width % 16 == 0);
int num_input_chunks = width / 16;
__m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
quant_mult, quant_mult, quant_mult, quant_mult);
for (int i = 0; i < num_rows; ++i) {
const float* input_row = input + i * width;
__m256i* output_row = output + i * num_input_chunks;
for (int j = 0; j < num_input_chunks; ++j) {
const float* x = input_row + j * 16;
// Process 16 floats at once, since each __m256i can contain 16 16-bit integers.
__m256 f_0 = _mm256_loadu_ps(x);
__m256 f_1 = _mm256_loadu_ps(x + 8);
__m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
__m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);
__m256i i_0 = _mm256_cvtps_epi32(m_0);
__m256i i_1 = _mm256_cvtps_epi32(m_1);
*(output_row + j) = _mm256_packs_epi32(i_0, i_1);
}
}
}
Any help is welcome, thank you so much!
For good throughput with multiple source vectors, it's a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn't necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that need to be stored separately.)
Or are you complaining about how it operates in-lane? Yes that's annoying, but _mm256_packs_epi32 does the same thing. If it's ok for your outputs to have interleaved groups of data there, do the same thing for this, too.
Your best bet is to combine 4 vectors down to 1, in 2 steps of in-lane packing (because there's no lane-crossing pack). Then use one lane-crossing shuffle to fix it up.
#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
__m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
__m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
__m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
__m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
__m256i ab = _mm256_packs_epi32(a,b); // 16x int16_t
__m256i cd = _mm256_packs_epi32(c,d);
__m256i abcd = _mm256_packs_epi16(ab, cd); // 32x int8_t
// packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_hi, b_hi, c_hi, d_hi ] order
// if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done
// but if you need sequential order, then vpermd:
__m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
return lanefix;
}
(Compiles nicely on the Godbolt compiler explorer).
Call this in a loop and _mm256_store_si256 the resulting vector.
(For uint8_t unsigned destination, use _mm256_packus_epi16 for the 16->8 step and keep everything else the same. We still use signed 32->16 packing, because 16 -> u8 vpackuswb packing still takes its epi16 input as signed. You need -1 to be treated as -1, not +0xFFFF, for unsigned saturation to clamp it to 0.)
With 4 total shuffles per 256-bit store, 1 shuffle per clock throughput will be the bottleneck on Intel CPUs. You should get a throughput of one float vector per clock, bottlenecked on port 5. (https://agner.org/optimize/). Or maybe bottlenecked on memory bandwidth if data isn't hot in L2.
If you only have a single vector to do, you could consider using _mm256_shuffle_epi8 to put the low byte of each epi32 element into the low 32 bits of each lane, then _mm256_permutevar8x32_epi32 for lane-crossing.
Another single-vector alternative (good on Ryzen) is extracti128 + 128-bit packssdw + packsswb. But that's still only good if you're just doing a single vector. (Still on Ryzen, you'll want to work in 128-bit vectors to avoid extra lane-crossing shuffles, because Ryzen splits every 256-bit instruction into (at least) 2 128-bit uops.)
Related:
SSE - AVX conversion from double to char
How can I convert a vector of float to short int using avx instructions?
Please check the IEEE754 standard format to store float values, first understand how this float and double get store in memory ,then you only came to know how to convert float or double to the char , it is quite simple .
I want to do some operation using the Intel intrinsics (vector of unsigned int of 16 bit) and the operations are the following :
load or set from an array of unsigned short int.
Div and Mod operations with unsigned short int.
Multiplication operation with unsigned short int.
Store operation of unsigned short int into an array.
I looked into the Intrinsics guide but it looks like there are only intrinsics for short integers and not the unsigned ones. Could someone have any trick that could help me with this ?
In fact I'm trying to store an image of a specific raster format in an array with a specific ordering. So I have to calculate the index where each pixel value is going to be stored:
unsigned int Index(unsigned int interleaving_depth, unsigned int x_size, unsigned int y_size, unsigned int z_size, unsigned int Pixel_number)
{
unsigned int x = 0, y = 0, z = 0, reminder = 0, i = 0;
y = Pixel_number/(x_size*z_size);
reminder = Pixel_number % (x_size*z_size);
i = reminder/(x_size*interleaving_depth);
reminder = reminder % (x_size*interleaving_depth);
if(i == z_size/interleaving_depth){
x = reminder/(z_size - i*interleaving_depth);
reminder = reminder % (z_size - i*interleaving_depth);
}
else
{
x = reminder/interleaving_depth;
reminder = reminder % interleaving_depth;
}
z = interleaving_depth*i + reminder;
if(z >= z_size)
z = z_size - 1;
return x + y*x_size + *x_size*y_size;
}
If you only want the low half of the result, multiplication is the same binary operation for signed or unsigned. So you can use pmullw on either. There are separate high-half multiply instructions for signed and unsigned short, though: _mm_mulhi_epu16 (pmulhuw) vs. _mm_mulhi_epi16 (pmuluw)
Similarly, you don't need an _mm_set_epu16 because it's the same operation: on x86 casting to signed doesn't change the bit-pattern, so Intel only bothered to provide _mm_set_epi16, but you can use it with args like 0xFFFFu instead of -1 with no problems. (Using Intel intrinsics automatically means your code only has to be portable to x86 32 and 64 bit.)
Load / store intrinsics don't change the data at all.
SSE/AVX doesn't have integer division or mod instructions. If you have compile-time-constant divisors, do it yourself with a multiply/shift. You can look at compiler output to get the magic constant and shift counts (Why does GCC use multiplication by a strange number in implementing integer division?), or even let gcc auto-vectorize something for you. Or even use GNU C native vector syntax to divide:
#include <immintrin.h>
__m128i div13_epu16(__m128i a)
{
typedef unsigned short __attribute__((vector_size(16))) v8uw;
v8uw tmp = (v8uw)a;
v8uw divisor = (v8uw)_mm_set1_epi16(13);
v8uw result = tmp/divisor;
return (__m128i)result;
// clang allows "lax" vector type conversions without casts
// gcc allows vector / scalar, e.g. tmp / 13. Clang requires set1
// to work with both, we need to jump through all the syntax hoops
}
compiles to this asm with gcc and clang (Godbolt compiler explorer):
div13_epu16:
pmulhuw xmm0, XMMWORD PTR .LC0[rip] # tmp93,
psrlw xmm0, 2 # tmp95,
ret
.section .rodata
.LC0:
.value 20165
# repeats 8 times
If you have runtime-variable divisors, it's going to be slower, but you can use http://libdivide.com/. It's not too bad if you reuse the same divisor repeatedly, so you only have to calculate a fixed-point inverse for it once, but code to use an arbitrary inverse needs a variable shift count which is less efficient with SSE (well also for integer), and potentially more instructions because some divisors require a more complicated sequence than others.
Is there any way to implement below logic in neon .
As I did not find any multiply and accumulate instruction for 64 bit input and output value .
int64x2_t result;
int64x2_t num1;
int64x2_t num2;
>> result + = num1*num2 <<
Technically two 64-bit values could result in a 128-bit result. That's why there are the following int64*int32+int32 functions, but not one that takes two 64-bit input values.
int64x2_t vmlal_s32 (int64x2_t, int32x2_t, int32x2_t);
int64x2_t vqdmlal_s32 (int64x2_t, int32x2_t, int32x2_t);
If those don't work for you, then you'll need to use a scalar 64*64 operations followed by vaddq_s64.
Note: Visual Studio implements _mul128, __umul128, _mulh, and __umulh for all architectures including ARM for handling the full 64*64 = 128 bit scenario.
I have a __m128i register with 8 bit values with the content:
{-4,10,10,10,10,10,10,-4,-4,10,10,10,10,10,10,-4}
Now I want to convert it to eight 16 bit values in a _m128i register. It should look like:
{-4,10,10,10,10,10,10,-4}
How is this possible with the least amount of instructions as possible?
I want to use SSSE3 at most.
Assuming you just want the first 8 values out of the 16 and are going to ignore the other 8 (the example data you give is somewhat ambiguous) then you can do it with SSE2 like this:
v = _mm_srai_epi16(_mm_unpacklo_epi8(v, v), 8);
You can do it this way with one SSE2 instruction (ignoring initialization)
__m128i const zero = _mm_setzero_si128(); // (if you're in a loop pull this out)
__m128i v;
v = _mm_unpacklo_epi8(v, zero);