Multiplying 2 32-bit numbers and taking the top 32 bits using AVX2 - c

I am using multiplication (with the addition of other operations) as a substitution for integer division. My solution eventually requires me to multiply 2 32-bit numbers together and take the top 32 bits (just like the mulhi function), but AVX2 does not offer a 32-bit variant of _mm256_mulhi_epu16 (Ex: there's no '_mm256_mulhi_epu32' function).
I have tried various methods such as checking the functions of AVX512, or even manipulating the 32-bit integers to be 2 hi/lo 16-bit integers. I'm very new to working with low-level programming, so I'm unaware what is optimal, or even just possible.

This can be done by doing the following:
__m256i t1 = _mm256_mul_epu32(m, n);
t1 = _mm256_srli_epi64(t1, 32);

Related

Operating Rightmost/Leftmost n-Bits, Not All the Bits of A Integer Type Data Variable

In a programming-task, I have to add a smaller integer in variable B (data type int)
to a larger integer (20 decimal integer) in variable A (data type long long int),
then compare A with variable C which is also as large integer (data type long long int) as A.
What I realized, since I add a smaller B to A,
I don't need to check all the digits of A when I compare that with C, in other words, we don't need to check all the bits of A and C.
Given that I know, how many bits from the right I need to check, say n-bits,
is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language?
Because for comparing all the bits take more time, and since I am working with large number, the program becomes slower.
Every time I search in the google, bit-masking appears which uses all the bits of A, C, that doesn't do what I am asking for, so probably I am not using correct terminology, please help.
Addition:
Initial comments of this post made me think there is no way but i found the following -
Bit Manipulation by University of Colorado Boulder
(#cuboulder, after 7:45)
...the bit band region is accessed via a bit band alías, each bit in a
supported bit band region has its own unique address and we can access
that bit using a pointer to its bit band alias location, the least
significant bit in an alias location can be sent or cleared and that
will be mapped to the bit in the corresponding data or peripheral
memory, unfortunately this will not help you if you need to write to
multiple bit locations in memory dependent operations only allow a
single bit to be cleared or set...
Is above what I a asking for? if yes then
where I can find the detail as beginner?
Updated question:
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
Your assumption that comparing fewer bits is faster might be true in some cases but is probably not true in most cases.
I'm only familiar with x86 CPUs. A x86-64 Processor has 64 bit wide registers. These can be accessed as 64 bit registers but the lower bits also as 32, 16 and 8 bit registers. There are processor instructions which work with the 64, 32, 16 or 8 bit part of the registers. Comparing 8 bits is one instruction but so is comparing 64 bits.
If using the 32 bit comparison would be faster than the 64 bit comparison you could gain some speed. But it seems like there is no speed difference for current processor generations. (Check out the "cmp" instruction with the link to uops.info from #harold.)
If your long long data type is actually bigger then the word size of your processor, then it's a different story. E.g. if your long long is 64 bit but your are on a 32 bit processor then these instructions cannot be handled by one register and you would need multiple instructions. So if you know that comparing only the lower 32 bits would be enough this could save some time.
Also note that comparing only e.g. 20 bits would actually take more time then comparing 32 bits. You would have to compare 32 bits and then mask the 12 highest bits. So you would need a comparison and a bitwise and instruction.
As you see this is very processor specific. And you are on the processors opcode level. As #RawkFist wrote in his comment you could try to get the C compiler to create such instructions but that does not automatically mean that this is even faster.
All of this is only relevant if these operations are executed a lot. I'm not sure what you are doing. If e.g. you add many values B to A and compare them to C each time it might be faster to start with C, subtract the B values from it and compare with 0. Because the compare-operation works internally like a subtraction. So instead of an add and a compare instruction a single subtraction would be enough within the loop. But modern CPUs and compilers are very smart and optimize a lot. So maybe the compiler automatically performs such or similar optimizations.
Try this question.
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
Yes - when A + B != C. We can short-cut the comparison once a difference is found: from least to most significant.
No - when A + B == C. All bits need comparison.
Now back to OP's original question
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
No. In order to do so, we need to out-think the compiler. A well enabled compiler itself will notice any "tricks" available for long long + (signed char)int == long long and emit efficient code.
Yet what about really long compares? How about a custom uint1000000 for A and C?
For long compares of a custom type, a quick compare can be had.
First, select a fast working type. unsigned is a prime candidate.
typedef unsigned ufast;
Now define the wide integer.
#include <limits.h>
#include <stdbool.h>
#define UINT1000000_N (1000000/(sizeof(ufast) * CHAR_BIT))
typedef struct {
// Least significant first
ufast digit[UINT1000000_N];
} uint1000000;
Perform the addition and compare one "digit" at a time.
bool uint1000000_fast_offset_compare(const uint1000000 *A, unsigned B,
const uint1000000 *C) {
ufast carry = B;
for (unsigned i = 0; i < UINT1000000_N; i++) {
ufast sum = A->digit[i] + carry;
if (sum != C->digit[i]) {
return false;
}
carry = sum < A->digit[i];
}
return true;
}

Summing 8-bit integers in __m512i with AVX intrinsics

AVX512 provide us with intrinsics to sum all cells in a __mm512 vector. However, some of their counterparts are missing: there is no _mm512_reduce_add_epi8, yet.
_mm512_reduce_add_ps //horizontal sum of 16 floats
_mm512_reduce_add_pd //horizontal sum of 8 doubles
_mm512_reduce_add_epi32 //horizontal sum of 16 32-bit integers
_mm512_reduce_add_epi64 //horizontal sum of 8 64-bit integers
Basically, I need to implement MAGIC in the following snippet.
__m512i all_ones = _mm512_set1_epi16(1);
short sum_of_ones = MAGIC(all_ones);
/* now sum_of_ones contains 32, the sum of 32 ones. */
The most obvious way would be using _mm512_storeu_epi8 and sum the elements of the array together, but that would be slow, plus it might invalidate the cache. I suppose there exists a faster approach.
Bonus points for implementing _mm512_reduce_add_epi16 as well.
First of all, _mm512_reduce_add_epi64 does not correspond to a single AVX512 instruction, but it generates a sequence of shuffles and additions.
To reduce 64 epu8 values to 8 epi64 values one usually uses the vpsadbw instruction (SAD=Sum of Absolute Differences) against a zero vector, which then can be reduced further:
long reduce_add_epu8(__m512i a)
{
return _mm512_reduce_add_epi64(_mm512_sad_epu8(a, _mm512_setzero_si512()));
}
Try it on godbolt: https://godbolt.org/z/1rMiPH. Unfortunately, neither GCC nor Clang seem to be able to optimize away the function if it is used with _mm512_set1_epi16(1).
For epi8 instead of epu8 you need to first add 128 to each element (or xor with 0x80), then reduce it using vpsadbw and at the end subtract 64*128 (or 8*128 on each intermediate 64bit result). [Note this was wrong in a previous version of this answer]
For epi16 I suggest having a look at what instructions _mm512_reduce_add_epi32 and _mm512_reduce_add_epi64 generate and derive from there what to do.
Overall, as #Mysticial suggested, it depends on your context what the best approach of reducing is. E.g., if you have a very large array of int64 and want a sum as int64, you should just add them together packet-wise and only at the very end reduce one packet to a single int64.

fixed point arithmetic in modern systems

I'd like to start out by saying this isn't about optimizations so please refrain from dragging this topic down that path. My purpose for using fixed point arithmetic is because I want to control the precision of my calculations without using floating point.
With that being said let's move on. I wanted to have 17 bits for range and 15 bits for the fractional part. The extra bit is for the signed value. Here are some macros below.
const int scl = 18;
#define Double2Fix(x) ((x) * (double)(1 << scl))
#define Float2Fix(x) ((x) * (float)(1 << scl))
#define Fix2Double(x) ((double)(x) / (1 << scl))
#define Fix2Float(x) ((float)(x) / (1 << scl))
Addition and subtraction are fairly straight forward but things gets a bit tricky with mul and div.
I've seen two different ways to handle these two types of operations.
1) if I am using 32 bits then use a temp 64bit variable to store intermediate multiplication steps then scale at the end.
2) right in the multiplication step scale both variables to a lesser bit range before multiplication. For example if you have a 32 bit register with 16 bits for the whole number you could shift like this:
(((a)>>8)*((b)>>6) >> 2) or some combination that makes sense for you app.
It seems to me that if you design your fixed point math around 32 bits it might be impractical to always depend on having a 64bit variable able to store your intermediate values but on the other hand shifting to a lower scale will seriously reduce your range and precision.
questions
Since i'd like to avoid trying to force the cpu to try to create a 64bit type in the middle of my calculations is the shifting to lower bit values the only other alternative?
Also i've notice
int b = Double2Fix(9.1234567890);
printf("double shift:%f\n",Fix2Double(b));
int c = Float2Fix(9.1234567890);
printf("float shift:%f\n",Fix2Float(c));
double shift:9.123444
float shift:9.123444
Is that precision loss just a part of using fixed point numbers?
Since i'd like to avoid trying to force the cpu to try to create a 64bit type in the middle of my calculations is the shifting to lower bit values the only other alternative?
You have to work with the hardware capabilities, and the only available operations you'll find are:
Multiply N x N => low N bits (native C multiplication)
Multiply N x N => high N bits (the C language has no operator for this)
Multiply N x N => all 2N bits (cast to wider type, then multiply)
If the instruction set has #3, and the CPU implements it efficiently, then there's no need to worry about the extra-wide result it produces. For x86, you can pretty much take these as a given. Anyway, you said this wasn't an optimization question :) .
Sticking to just #1, you'll need to break the operands into pieces of (N/2) bits and do long multiplication, which is likely to generate more work. There are still cases where it's the right thing to do, for instance implementing #3 (software extended arithmetic) on a CPU that doesn't have it or #2.
Is that precision loss just a part of using fixed point numbers?
log2( 9.1234567890 – 9.123444 ) = –16.25, and you used 16 bits of precision, so yep, that's very typical.

Pure high-bit multiplication in assembly?

To implement real numbers between 0 and 1, one usually uses ANSI floats or doubles. But fixed precision numbers between 0 and 1 (decimals modulo 1) can be efficiently implemented as 32 bit integers or 16 bit words, which add like normal integers/words, but which multiply the "wrong way", meaning that when you multiply X times Y, you keep the high order bits of the product. This is equivalent to multiplying 0.X and 0.Y, where all the bits of X are behind the decimal point. Likewise, signed numbers between -1 and 1 are also implementable this way with one extra bit and a shift.
How would one implement fixed-precision mod 1 or mod 2 in C (especially using MMX or SSE)?
I think this representation could be useful for efficient representation of unitary matrices, for numerically intensive physics simulations. It makes for more MMX/SSE to have integer quantities, but you need higher level access to PMULHW.
If 16 bit fixed point arithmetic is sufficient and you are on x86 or a similar architecture, you can directly use SSE.
The SSE3 instruction pmulhrsw directly implements signed 0.15 fixed point arithmetic multiplication (mod 2 as you call it, from -1..+1) in hardware. Addition is not different than the standard 16 bit vector operations, just using paddw.
So a library which handles multiplication and addition of eight signed 16 bit fixed point variables at a time could look like this:
typedef __v8hi fixed16_t;
fixed16_t mul(fixed16_t a, fixed16_t b) {
return _mm_mulhrs_epi16(a,b);
}
fixed16_t add(fixed16_t a, fixed16_t b) {
return _mm_add_epi16(a,b);
}
Permission granted to use it in any way you like ;-)

Fixed-point multiplication in a known range

I'm trying to multiply A*B in 16-bit fixed point, while keeping as much accuracy as possible. A is 16-bit in unsigned integer range, B is divided by 1000 and always between 0.001 and 9.999. It's been a while since I dealt with problems like that, so:
I know I can just do A*B/1000 after moving to 32-bit variables, then strip back to 16-bit
I'd like to make it faster than that
I'd like to do all the operations without moving to 32-bit (since I've got 16-bit multiplication only)
Is there any easy way to do that?
Edit: A will be between 0 and 4000, so all possible results are in the 16-bit range too.
Edit: B comes from user, set digit-by-digit in the X.XXX mask, that's why the operation is /1000.
No, you have to go to 32 bit. In general the product of two 16 bit numbers will always give you a 32 bit wide result.
You should check the CPU instruction set of the CPU you're working on because most multiply instructions on 16 bit machines have an option to return the result as a 32 bit integer directly.
This would help you a lot because:
short testfunction (short a, short b)
{
int A32 = a;
int B32 = b;
return A32*B32/1000
}
Would force the compiler to do a 32bit * 32bit multiply. On your machine this could be very slow or even done in multiple steps using 16bit multiplies only.
A little bit of inline assembly or even better a compiler intrinsic could speed things up a lot.
Here is an example for the Texas Instruments C64x+ DSP which has such intrinsics:
short test (short a, short b)
{
int product = _mpy (a,b); // calculates product, returns 32 bit integer
return product / 1000;
}
Another thought: You're dividing by 1000. Was that constant your choice? It would be much faster to use a power of two as the base for your fixed-point numbers. 1024 is close. Why don't you:
return (a*b)/1024
instead? The compiler could optimize this by using a shift right by 10 bits. That ought to be much faster than doing reciprocal multiplication tricks.

Resources