Effective use of vmlaq_s16 - arm

When using the vmlaq_s16 intrinsic/VMLA.I16 instruction, the result takes the form of a set of 8 16-bit integers. The multiplies inside the instructions however require the results to be stored in 32-bit integers to protect from overflow.
On Intel processors with SSE2, _mm_madd_epi16 preserves the length of the instruction (8 16-bit integers into 4 32-bit results) by multiplying and adding pairs of consecutive elements of the vectors, i.e.
r0 := (a0 * b0) + (a1 * b1)
r1 := (a2 * b2) + (a3 * b3)
r2 := (a4 * b4) + (a5 * b5)
r3 := (a6 * b6) + (a7 * b7)
Where r0,r1,r2,r3 are all 32-bit, and a0-a7, b0-b7 are all 16-bit elements.
Is there a trick that I'm missing with the vmlaq_s16 instruction that would allow me to still be able to process 8 16-bit elements at once and have results that don't overflow? Or is it the fact that this instruction is just provided for operands that are inherently in the 4-bit range (highly doubtful)?
Thanks!
EDIT: So I just thought about the fact that if vmlaq_s16 sets the overflow register flag(s?) for each of the elements in the result, then it's easy to count the overflows and recover the result.
EDIT 2: For everyone's reference, here's how to load 8 elements and pipeline two long multiply-adds on a 128bit register with intrinsics (proof of concept code that compiles with VS2012 for the ARM target):
signed short vector1[] = {1, 2, 3, 4, 5, 6, 7, 8};
signed short vector2[] = {1, 2, 3, 4, 5, 6, 7, 8};
int16x8_t v1; // = vdupq_n_s16(0);
int16x8_t v2; // = vdupq_n_s16(0);
v1 = vld1q_s16(vector1);
v2 = vld1q_s16(vector2);
int32x4_t sum = vdupq_n_s16(0);
sum = vmlal_s16(sum, v1.s.low64, v2.s.low64);
sum = vmlal_s16(sum, v1.s.high64, v2.s.high64);
printf("sum: %d\n", sum.n128_i32[0]);

These aren't directly equivalent operations - VMLA multiples two vectors then adds the result elementwise to a 3rd vector, unlike the self-contained half-elementwise-half-horizontal craziness of Intel's PMADDWD. Since that 3rd vector is a regular operand it has to exist in a register, thus there's no room for a 256-bit accumulator.
If you don't want to risk overflow by using VMLA to do 8x16 * 8x16 + 8x16, the alternative is to use VMLAL to do 4x16 * 4x16 + 4x32. The obvious suggestion would be to pipeline pairs of instructions to process 8x16 vectors into two 4x32 accumulators then add them together at the end, but I'll admit I'm not too familiar with intrinsics so I don't know how difficult they would make that (compared to assembly where you can exploit the fact that "64-bit vectors" and "128-bit vectors" are simply interchangable views of the same register file).

Related

How can I quickly get the value 2^64 divided by random integer in C lang? [duplicate]

How to compute the integer division, 264/n? Assuming:
unsigned long is 64-bit
We use a 64-bit CPU
1 < n < 264
If we do 18446744073709551616ul / n, we get warning: integer constant is too large for its type at compile time. This is because we cannot express 264 in a 64-bit CPU. Another way is the following:
#define IS_POWER_OF_TWO(x) ((x & (x - 1)) == 0)
unsigned long q = 18446744073709551615ul / n;
if (IS_POWER_OF_TWO(n))
return q + 1;
else
return q;
Is there any faster (CPU cycle) or cleaner (coding) implementation?
I'll use uint64_t here (which needs the <stdint.h> include) so as not to require your assumption about the size of unsigned long.
phuclv's idea of using -n is clever, but can be made much simpler. As unsigned 64-bit integers, we have -n = 264-n, then (-n)/n = 264/n - 1, and we can simply add back the 1.
uint64_t divide_two_to_the_64(uint64_t n) {
return (-n)/n + 1;
}
The generated code is just what you would expect (gcc 8.3 on x86-64 via godbolt):
mov rax, rdi
xor edx, edx
neg rax
div rdi
add rax, 1
ret
I've come up with another solution which was inspired by this question. From there we know that
(a1 + a2 + a3 + ... + an)/n =
(a1/n + a2/n + a3/n + ... + an/n) + (a1 % n + a2 % n + a3 % n + ... + an % n)/n
By choosing a1 = a2 = a3 = ... = an-1 = 1 and an = 264 - n we'll have
(a1 + a2 + a3 + ... + an)/n = (1 + 1 + 1 + ... + (264 - n))/n = 264/n
= [(n - 1)*1/n + (264 - n)/n] + [(n - 1)*0 + (264 - n) % n]/n
= (264 - n)/n + ((264 - n) % n)/n
264 - n is the 2's complement of n, which is -n, or we can also write it as ~0 - n + 1. So the final solution would be
uint64_t twoPow64div(uint64_t n)
{
return (-n)/n + (n + (-n) % n)/n + (n > 1ULL << 63);
}
The last part is to correct the result, because we deal with unsigned integers instead of signed ones like in the other question. Checked both 32 and 64-bit versions on my PC and the result matches with your solution
On MSVC however there's an intrinsic for 128-bit division, so you can use like this
uint64_t remainder;
return _udiv128(1, 0, n, &remainder);
which results in the cleanest output
mov edx, 1
xor eax, eax
div rcx
ret 0
Here's the demo
On most x86 compilers (one notable exception is MSVC) long double also has 64 bits of precision, so you can use either of these
(uint64_t)(powl(2, 64)/n)
(uint64_t)(((long double)~0ULL)/n)
(uint64_t)(18446744073709551616.0L/n)
although probably the performance would be worse. This can also be applied to any implementations where long double has more than 63 bits of significand, like PowerPC with its double-double implementation
There's a related question about calculating ((UINT_MAX + 1)/x)*x - 1: Integer arithmetic: Add 1 to UINT_MAX and divide by n without overflow with also clever solutions. Based on that we have
264/n = (264 - n + n)/n = (264 - n)/n + 1 = (-n)/n + 1
which is essentially just another way to get Nate Eldredge's answer
Here's some demo for other compilers on godbolt
See also:
Trick to divide a constant (power of two) by an integer
Efficient computation of 2**64 / divisor via fast floating-point reciprocal
We use a 64-bit CPU
Which 64-bit CPU?
In general, if you multiply a number with N bits by another number that has M bits, the result will have up to N+M bits. For integer division it's similar - if a number with N bits is divided by a number with M bits the result will have N-M+1 bits.
Because multiplication is naturally "widening" (the result has more digits than either of the source numbers) and integer division is naturally "narrowing" (the result has less digits); some CPUs support "widening multiplication" and "narrowing division".
In other words, some 64-bit CPUs support dividing a 128-bit number by a 64-bit number to get a 64-bit result. For example, on 80x86 it's a single DIV instruction.
Unfortunately, C doesn't support "widening multiplication" or "narrowing division". It only supports "result is same size as source operands".
Ironically (for unsigned 64-bit divisors on 64-bit 80x86) there is no other choice and the compiler must use the DIV instruction that will divide a 128-bit number by a 64-bit number. This means that the C language forces you to use a 64-bit numerator, then the code generated by the compiler extends your 64 bit numerator to 128 bits and divides it by a 64 bit number to get a 64 bit result; and then you write extra code to work around the fact that the language prevented you from using a 128-bit numerator to begin with.
Hopefully you can see how this situation might be considered "less than ideal".
What I'd want is a way to trick the compiler into supporting "narrowing division". For example, maybe by abusing casts and hoping that the optimiser is smart enough, like this:
__uint128_t numerator = (__uint128_t)1 << 64;
if(n > 1) {
return (uint64_t)(numerator/n);
}
I tested this for the latest versions of GCC, CLANG and ICC (using https://godbolt.org/ ) and found that (for 64-bit 80x86) none of the compilers are smart enough to realise that a single DIV instruction is all that is needed (they all generated code that does a call __udivti3, which is an expensive function to get a 128 bit result). The compilers will only use DIV when the (128-bit) numerator is 64 bits (and it will be preceded by an XOR RDX,RDX to set the highest half of the 128-bit numerator to zeros).
In other words, it's likely that the only way to get ideal code (the DIV instruction by itself on 64-bit 80x86) is to resort to inline assembly.
For example, the best code you'll get without inline assembly (from Nate Eldredge's answer) will be:
mov rax, rdi
xor edx, edx
neg rax
div rdi
add rax, 1
ret
...and the best code that's possible is:
mov edx, 1
xor rax, rax
div rdi
ret
Your way is pretty good. It might be better to write it like this:
return 18446744073709551615ul / n + ((n&(n-1)) ? 0:1);
The hope is to make sure the compiler notices that it can do a conditional move instead of a branch.
Compile and disassemble.

How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)

What I want to do is:
Multiply the input floating point number by a fixed factor.
Convert them to 8-bit signed char.
Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127].
I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :(
I also wrote some code that converts 32-bit float to 16-bit int, and it works as exactly what I want.
void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
// input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
assert(width % 16 == 0);
int num_input_chunks = width / 16;
__m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
quant_mult, quant_mult, quant_mult, quant_mult);
for (int i = 0; i < num_rows; ++i) {
const float* input_row = input + i * width;
__m256i* output_row = output + i * num_input_chunks;
for (int j = 0; j < num_input_chunks; ++j) {
const float* x = input_row + j * 16;
// Process 16 floats at once, since each __m256i can contain 16 16-bit integers.
__m256 f_0 = _mm256_loadu_ps(x);
__m256 f_1 = _mm256_loadu_ps(x + 8);
__m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
__m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);
__m256i i_0 = _mm256_cvtps_epi32(m_0);
__m256i i_1 = _mm256_cvtps_epi32(m_1);
*(output_row + j) = _mm256_packs_epi32(i_0, i_1);
}
}
}
Any help is welcome, thank you so much!
For good throughput with multiple source vectors, it's a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn't necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that need to be stored separately.)
Or are you complaining about how it operates in-lane? Yes that's annoying, but _mm256_packs_epi32 does the same thing. If it's ok for your outputs to have interleaved groups of data there, do the same thing for this, too.
Your best bet is to combine 4 vectors down to 1, in 2 steps of in-lane packing (because there's no lane-crossing pack). Then use one lane-crossing shuffle to fix it up.
#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
__m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
__m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
__m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
__m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
__m256i ab = _mm256_packs_epi32(a,b); // 16x int16_t
__m256i cd = _mm256_packs_epi32(c,d);
__m256i abcd = _mm256_packs_epi16(ab, cd); // 32x int8_t
// packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_hi, b_hi, c_hi, d_hi ] order
// if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done
// but if you need sequential order, then vpermd:
__m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
return lanefix;
}
(Compiles nicely on the Godbolt compiler explorer).
Call this in a loop and _mm256_store_si256 the resulting vector.
(For uint8_t unsigned destination, use _mm256_packus_epi16 for the 16->8 step and keep everything else the same. We still use signed 32->16 packing, because 16 -> u8 vpackuswb packing still takes its epi16 input as signed. You need -1 to be treated as -1, not +0xFFFF, for unsigned saturation to clamp it to 0.)
With 4 total shuffles per 256-bit store, 1 shuffle per clock throughput will be the bottleneck on Intel CPUs. You should get a throughput of one float vector per clock, bottlenecked on port 5. (https://agner.org/optimize/). Or maybe bottlenecked on memory bandwidth if data isn't hot in L2.
If you only have a single vector to do, you could consider using _mm256_shuffle_epi8 to put the low byte of each epi32 element into the low 32 bits of each lane, then _mm256_permutevar8x32_epi32 for lane-crossing.
Another single-vector alternative (good on Ryzen) is extracti128 + 128-bit packssdw + packsswb. But that's still only good if you're just doing a single vector. (Still on Ryzen, you'll want to work in 128-bit vectors to avoid extra lane-crossing shuffles, because Ryzen splits every 256-bit instruction into (at least) 2 128-bit uops.)
Related:
SSE - AVX conversion from double to char
How can I convert a vector of float to short int using avx instructions?
Please check the IEEE754 standard format to store float values, first understand how this float and double get store in memory ,then you only came to know how to convert float or double to the char , it is quite simple .

32-bit multiplication through 16-bit shifting

I am writing a soft-multiplication function call using shifting and addition. The existing function call goes like this:
unsigned long __mulsi3 (unsigned long a, unsigned long b) {
unsigned long answer = 0;
while(b)
{
if(b & 1) {
answer += a;
};
a <<= 1;
b >>= 1;
}
return answer;
}
Although my hardware does not have a multiplier, I have a hard shifter. The shifter is able to shift up to 16 bits at one time.
If I want to make full use of my 16-bit shifter. Any suggestions on how can I adapt the code above to reflect my hardware's capabilities? The given code shifts only 1-bit per iteration.
The 16-bit shifter can shift 32-bit unsigned long values up to 16 places at a time. The sizeof(unsigned long) == 32 bits
The ability to shift multiple bits is not going to help much, unless you have a hardware multiply, say 8-bit x 8-bit, or you can afford some RAM/ROM to do (say) a 4-bit by 4-bit multiply by lookup.
The straightforward shift and add (as you are doing) can be helped by swapping the arguments so that the multiplier is the smaller.
If your machine is faster doing 16 bit things in general, then treating your 32-bit 'a' as 'a1:a0' 16-bits at a time, and similarly 'b', you just might be able to same some cycles. Your result is only 32-bits, so you don't need to do 'a1 * b1' -- though one or both of those may be zero, so the win may not be big! Also, you only need the ls 16-bits of 'a0 * b1', so that can be done entirely 16-bits -- but if b1 (assuming b <= a) is generally zero this is not a big win, either. For 'a * b0', you need a 32-bit 'a' and 32-bit adds into 'answer', but your multiplier is 16-bits only... which may or may not help.
Skipping runs of multiplier zeros could help -- depending on processor and any properties of the multiplier.
FWIW: doing the magic 'a1*b1', '(a1-a0)*(b0-b1)', 'a0*b0' and combining the result by shifts, adds and subtracts is, in my small experience, an absolute nightmare... the signs of '(a1-a0)', '(b0-b1)' and their product have to be respected, which makes a bit of a mess of what looks like a cute trick. By the time you have finished with that and the adds and subtracts, you have to have a mighty slow multiply to make it all worth while ! When multiplying very, very long integers this may help... but there the memory issues may dominate... when I tried it, it was something of a disappointment.
Having 16-bit shifts can help you in making minor speed enhancement using the following approach:
(U1 * P + U0) * (V1 * P + V0) =
= U1 * V1 * P * P + U1 * V0 * P + U0 * V1 * P + U0 * V0 =
= U1 * V1 * (P*P+P) + (U1-U0) * (V0-V1) * P + U0 * V0 * (1-P)
provided P is a convenient power of 2 (for example, 2^16, 2^32), so multiplying to it is a fast shift. This reduces from 4 to 3 multiplications of smaller numbers, and, recursively, O(N^1.58) instead of O(N^2) for very long numbers.
This method is named KaratsubaŹ¼s multiplication. There are more advanced versions described there.
For small numbers (e.g. 8 by 8 bits), the following method is fast, if you have enough fast ROM:
a * b = square(a+b)/4 - square(a-b)/4
if to tabulate int(square(x)/4), you'll need 1022 bytes for unsigned multiplication and 510 bytes for signed one.
The basic approach is (assuming shifting by 1) :-
Shift the top 16 bits
Set the bottom bit of the top 16 bits to the top bit of the bottom 16 bits
Shift the bottom 16 bits
Depends a bit on your hardware...
but you could try :-
assuming unsigned long is 32 bits
assuming Big Endian
then :-
union Data32
{
unsigned long l;
unsigned short s[2];
};
unsigned long shiftleft32(unsigned long valueToShift, unsigned short bitsToShift)
{
union Data32 u;
u.l = valueToShift
u.s[0] <<= bitsToShift;
u.s[0] |= (u.s[1] >> (16 - bitsToShift);
u.s[1] <<= bitsToShift
return u.l;
}
then do the same in reverse for shifting right
the code above is multiplying on the traditional way, the way we learnt in primary school :
EX:
0101
* 0111
-------
0101
0101.
0101..
--------
100011
of course you can not approach it like that if you don't have either a multiplier operator or 1-bit shifter!
though, you can do it in other ways, for example a loop :
unsigned long _mult(unsigned long a, unsigned long b)
{
unsigned long res =0;
while (a > 0)
{
res += b;
a--;
}
return res;
}
It is costy but it serves your needings, anyways you can think about other approaches if you have more constraints (like computation time ...)

Bitwise operations between 128-bit integers

I have a question about using 128-bit registers to gain speed in a code. Consider the following C/C++ code: I define two unsigned long long ints a and b, and give them some values.
unsigned long long int a = 4368, b = 56480;
Then, I want to compute
a & b;
Here a is represented in the computer as a 64-bit number 4369 = 100010001001, and same for b = 56481 = 1101110010100001, and I compute a & b, which is still a 64-bit number given by the bit-by-bit logical AND between a and b:
a & b = 1000000000001
My question is the following: Do computers have a 128-bit register where I could do the operation above, but with 128-bits integers rather than with 64-bit integers, and with the same computer time? To be clearer: I would like to gain a factor two of speed in my code by using 128 bit numbers rather than 64 bit numbers, e. g. I would like to compute 128 ANDs rather than 64 ANDs (one AND for every bit) with the same computer time. If this is possible, do you have a code example? I have heard that the SSE regiters might do this, but I am not sure.
Yes, SSE2 has a 128 bit bitwise AND - you can use it via intrinsics in C or C++, e.g.
#include "emmintrin.h" // SSE2 intrinsics
__m128i v0, v1, v2; // 128 bit variables
v2 = _mm_and_si128(v0, v1); // bitwise AND
or you can use it directly in assembler - the instruction is PAND.
You can even do a 256 bit AND on Haswell and later CPUs which have AVX2:
#include "immintrin.h" // AVX2 intrinsics
__m256i v0, v1, v2; // 256 bit variables
v2 = _mm256_and_si256(v0, v1); // bitwise AND
The corresponding instruction in this case is VPAND.

How to perform element-wise left shift with __m128i?

The SSE shift instructions I have found can only shift by the same amount on all the elements:
_mm_sll_epi32()
_mm_slli_epi32()
These shift all elements, but by the same shift amount.
Is there a way to apply different shifts to the different elements? Something like this:
__m128i a, __m128i b;
r0:= a0 << b0;
r1:= a1 << b1;
r2:= a2 << b2;
r3:= a3 << b3;
There exists the _mm_shl_epi32() intrinsic that does exactly that.
http://msdn.microsoft.com/en-us/library/gg445138.aspx
However, it requires the XOP instruction set. Only AMD Bulldozer and Interlagos processors or later have this instruction. It is not available on any Intel processor.
If you want to do it without XOP instructions, you will need to do it the hard way: Pull them out and do them one by one.
Without XOP instructions, you can do this with SSE4.1 using the following intrinsics:
_mm_insert_epi32()
_mm_extract_epi32()
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse41_reg_ins_ext.htm
Those will let you extract parts of a 128-bit register into regular registers to do the shift and put them back.
If you go with the latter method, it'll be horrifically inefficient. That's why _mm_shl_epi32() exists in the first place.
Without XOP, your options are limited. If you can control the format of the shift count argument, then you can use _mm_mullo_pi16 since multiplying by a power of two is the same as shifting by that power.
For example, if you want to shift your 8 16-bit elements in an SSE register by <0, 1, 2, 3, 4, 5, 6, 7> you can multiply by 2 raised to the shift count powers, i.e., by <0, 2, 4, 8, 16, 32, 64, 128>.
in some circumstances, this can substitute for _mm_shl_epi32(a, b):
_mm_mullo_ps(a, 1 << b);
generally speaking, this requires b to have a constant value - I don't know of an efficient way to calculate (1 << b) using older SSE instructions.

Resources