Fixed-point multiplication in a known range - c

I'm trying to multiply A*B in 16-bit fixed point, while keeping as much accuracy as possible. A is 16-bit in unsigned integer range, B is divided by 1000 and always between 0.001 and 9.999. It's been a while since I dealt with problems like that, so:
I know I can just do A*B/1000 after moving to 32-bit variables, then strip back to 16-bit
I'd like to make it faster than that
I'd like to do all the operations without moving to 32-bit (since I've got 16-bit multiplication only)
Is there any easy way to do that?
Edit: A will be between 0 and 4000, so all possible results are in the 16-bit range too.
Edit: B comes from user, set digit-by-digit in the X.XXX mask, that's why the operation is /1000.

No, you have to go to 32 bit. In general the product of two 16 bit numbers will always give you a 32 bit wide result.
You should check the CPU instruction set of the CPU you're working on because most multiply instructions on 16 bit machines have an option to return the result as a 32 bit integer directly.
This would help you a lot because:
short testfunction (short a, short b)
{
int A32 = a;
int B32 = b;
return A32*B32/1000
}
Would force the compiler to do a 32bit * 32bit multiply. On your machine this could be very slow or even done in multiple steps using 16bit multiplies only.
A little bit of inline assembly or even better a compiler intrinsic could speed things up a lot.
Here is an example for the Texas Instruments C64x+ DSP which has such intrinsics:
short test (short a, short b)
{
int product = _mpy (a,b); // calculates product, returns 32 bit integer
return product / 1000;
}
Another thought: You're dividing by 1000. Was that constant your choice? It would be much faster to use a power of two as the base for your fixed-point numbers. 1024 is close. Why don't you:
return (a*b)/1024
instead? The compiler could optimize this by using a shift right by 10 bits. That ought to be much faster than doing reciprocal multiplication tricks.

Related

Operating Rightmost/Leftmost n-Bits, Not All the Bits of A Integer Type Data Variable

In a programming-task, I have to add a smaller integer in variable B (data type int)
to a larger integer (20 decimal integer) in variable A (data type long long int),
then compare A with variable C which is also as large integer (data type long long int) as A.
What I realized, since I add a smaller B to A,
I don't need to check all the digits of A when I compare that with C, in other words, we don't need to check all the bits of A and C.
Given that I know, how many bits from the right I need to check, say n-bits,
is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language?
Because for comparing all the bits take more time, and since I am working with large number, the program becomes slower.
Every time I search in the google, bit-masking appears which uses all the bits of A, C, that doesn't do what I am asking for, so probably I am not using correct terminology, please help.
Addition:
Initial comments of this post made me think there is no way but i found the following -
Bit Manipulation by University of Colorado Boulder
(#cuboulder, after 7:45)
...the bit band region is accessed via a bit band alías, each bit in a
supported bit band region has its own unique address and we can access
that bit using a pointer to its bit band alias location, the least
significant bit in an alias location can be sent or cleared and that
will be mapped to the bit in the corresponding data or peripheral
memory, unfortunately this will not help you if you need to write to
multiple bit locations in memory dependent operations only allow a
single bit to be cleared or set...
Is above what I a asking for? if yes then
where I can find the detail as beginner?
Updated question:
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
Your assumption that comparing fewer bits is faster might be true in some cases but is probably not true in most cases.
I'm only familiar with x86 CPUs. A x86-64 Processor has 64 bit wide registers. These can be accessed as 64 bit registers but the lower bits also as 32, 16 and 8 bit registers. There are processor instructions which work with the 64, 32, 16 or 8 bit part of the registers. Comparing 8 bits is one instruction but so is comparing 64 bits.
If using the 32 bit comparison would be faster than the 64 bit comparison you could gain some speed. But it seems like there is no speed difference for current processor generations. (Check out the "cmp" instruction with the link to uops.info from #harold.)
If your long long data type is actually bigger then the word size of your processor, then it's a different story. E.g. if your long long is 64 bit but your are on a 32 bit processor then these instructions cannot be handled by one register and you would need multiple instructions. So if you know that comparing only the lower 32 bits would be enough this could save some time.
Also note that comparing only e.g. 20 bits would actually take more time then comparing 32 bits. You would have to compare 32 bits and then mask the 12 highest bits. So you would need a comparison and a bitwise and instruction.
As you see this is very processor specific. And you are on the processors opcode level. As #RawkFist wrote in his comment you could try to get the C compiler to create such instructions but that does not automatically mean that this is even faster.
All of this is only relevant if these operations are executed a lot. I'm not sure what you are doing. If e.g. you add many values B to A and compare them to C each time it might be faster to start with C, subtract the B values from it and compare with 0. Because the compare-operation works internally like a subtraction. So instead of an add and a compare instruction a single subtraction would be enough within the loop. But modern CPUs and compilers are very smart and optimize a lot. So maybe the compiler automatically performs such or similar optimizations.
Try this question.
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
Yes - when A + B != C. We can short-cut the comparison once a difference is found: from least to most significant.
No - when A + B == C. All bits need comparison.
Now back to OP's original question
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
No. In order to do so, we need to out-think the compiler. A well enabled compiler itself will notice any "tricks" available for long long + (signed char)int == long long and emit efficient code.
Yet what about really long compares? How about a custom uint1000000 for A and C?
For long compares of a custom type, a quick compare can be had.
First, select a fast working type. unsigned is a prime candidate.
typedef unsigned ufast;
Now define the wide integer.
#include <limits.h>
#include <stdbool.h>
#define UINT1000000_N (1000000/(sizeof(ufast) * CHAR_BIT))
typedef struct {
// Least significant first
ufast digit[UINT1000000_N];
} uint1000000;
Perform the addition and compare one "digit" at a time.
bool uint1000000_fast_offset_compare(const uint1000000 *A, unsigned B,
const uint1000000 *C) {
ufast carry = B;
for (unsigned i = 0; i < UINT1000000_N; i++) {
ufast sum = A->digit[i] + carry;
if (sum != C->digit[i]) {
return false;
}
carry = sum < A->digit[i];
}
return true;
}

Multiplying 2 32-bit numbers and taking the top 32 bits using AVX2

I am using multiplication (with the addition of other operations) as a substitution for integer division. My solution eventually requires me to multiply 2 32-bit numbers together and take the top 32 bits (just like the mulhi function), but AVX2 does not offer a 32-bit variant of _mm256_mulhi_epu16 (Ex: there's no '_mm256_mulhi_epu32' function).
I have tried various methods such as checking the functions of AVX512, or even manipulating the 32-bit integers to be 2 hi/lo 16-bit integers. I'm very new to working with low-level programming, so I'm unaware what is optimal, or even just possible.
This can be done by doing the following:
__m256i t1 = _mm256_mul_epu32(m, n);
t1 = _mm256_srli_epi64(t1, 32);

fixed point arithmetic in modern systems

I'd like to start out by saying this isn't about optimizations so please refrain from dragging this topic down that path. My purpose for using fixed point arithmetic is because I want to control the precision of my calculations without using floating point.
With that being said let's move on. I wanted to have 17 bits for range and 15 bits for the fractional part. The extra bit is for the signed value. Here are some macros below.
const int scl = 18;
#define Double2Fix(x) ((x) * (double)(1 << scl))
#define Float2Fix(x) ((x) * (float)(1 << scl))
#define Fix2Double(x) ((double)(x) / (1 << scl))
#define Fix2Float(x) ((float)(x) / (1 << scl))
Addition and subtraction are fairly straight forward but things gets a bit tricky with mul and div.
I've seen two different ways to handle these two types of operations.
1) if I am using 32 bits then use a temp 64bit variable to store intermediate multiplication steps then scale at the end.
2) right in the multiplication step scale both variables to a lesser bit range before multiplication. For example if you have a 32 bit register with 16 bits for the whole number you could shift like this:
(((a)>>8)*((b)>>6) >> 2) or some combination that makes sense for you app.
It seems to me that if you design your fixed point math around 32 bits it might be impractical to always depend on having a 64bit variable able to store your intermediate values but on the other hand shifting to a lower scale will seriously reduce your range and precision.
questions
Since i'd like to avoid trying to force the cpu to try to create a 64bit type in the middle of my calculations is the shifting to lower bit values the only other alternative?
Also i've notice
int b = Double2Fix(9.1234567890);
printf("double shift:%f\n",Fix2Double(b));
int c = Float2Fix(9.1234567890);
printf("float shift:%f\n",Fix2Float(c));
double shift:9.123444
float shift:9.123444
Is that precision loss just a part of using fixed point numbers?
Since i'd like to avoid trying to force the cpu to try to create a 64bit type in the middle of my calculations is the shifting to lower bit values the only other alternative?
You have to work with the hardware capabilities, and the only available operations you'll find are:
Multiply N x N => low N bits (native C multiplication)
Multiply N x N => high N bits (the C language has no operator for this)
Multiply N x N => all 2N bits (cast to wider type, then multiply)
If the instruction set has #3, and the CPU implements it efficiently, then there's no need to worry about the extra-wide result it produces. For x86, you can pretty much take these as a given. Anyway, you said this wasn't an optimization question :) .
Sticking to just #1, you'll need to break the operands into pieces of (N/2) bits and do long multiplication, which is likely to generate more work. There are still cases where it's the right thing to do, for instance implementing #3 (software extended arithmetic) on a CPU that doesn't have it or #2.
Is that precision loss just a part of using fixed point numbers?
log2( 9.1234567890 – 9.123444 ) = –16.25, and you used 16 bits of precision, so yep, that's very typical.

Pure high-bit multiplication in assembly?

To implement real numbers between 0 and 1, one usually uses ANSI floats or doubles. But fixed precision numbers between 0 and 1 (decimals modulo 1) can be efficiently implemented as 32 bit integers or 16 bit words, which add like normal integers/words, but which multiply the "wrong way", meaning that when you multiply X times Y, you keep the high order bits of the product. This is equivalent to multiplying 0.X and 0.Y, where all the bits of X are behind the decimal point. Likewise, signed numbers between -1 and 1 are also implementable this way with one extra bit and a shift.
How would one implement fixed-precision mod 1 or mod 2 in C (especially using MMX or SSE)?
I think this representation could be useful for efficient representation of unitary matrices, for numerically intensive physics simulations. It makes for more MMX/SSE to have integer quantities, but you need higher level access to PMULHW.
If 16 bit fixed point arithmetic is sufficient and you are on x86 or a similar architecture, you can directly use SSE.
The SSE3 instruction pmulhrsw directly implements signed 0.15 fixed point arithmetic multiplication (mod 2 as you call it, from -1..+1) in hardware. Addition is not different than the standard 16 bit vector operations, just using paddw.
So a library which handles multiplication and addition of eight signed 16 bit fixed point variables at a time could look like this:
typedef __v8hi fixed16_t;
fixed16_t mul(fixed16_t a, fixed16_t b) {
return _mm_mulhrs_epi16(a,b);
}
fixed16_t add(fixed16_t a, fixed16_t b) {
return _mm_add_epi16(a,b);
}
Permission granted to use it in any way you like ;-)

On embedded platforms, is it more efficient to use unsigned int instead of (implicity signed) int?

I've got into this habit of always using unsigned integers where possible in my code, because the processor can do divides by powers of two on unsigned types, which it can't with signed types. Speed is critical for this project. The processor operates at up to 40 MIPS.
My processor has an 18 cycle divide, but it takes longer than the single cycle barrel shifter. So is it worth using unsigned integers here to speed things up or do they bring other disadvantages? I'm using a dsPIC33FJ128GP802 - a member of the dsPIC33F series by Microchip. It has single cycle multiply for both signed and unsigned ints. It also has sign and zero extend instructions.
For example, it produces this code when mixing signed and unsigned integers.
026E4 97E80F mov.b [w15-24],w0
026E6 FB0000 se w0,w0
026E8 97E11F mov.b [w15-31],w2
026EA FB8102 ze w2,w2
026EC B98002 mul.ss w0,w2,w0
026EE 400600 add.w w0,w0,w12
026F0 FB8003 ze w3,w0
026F2 100770 subr.w w0,#16,w14
I'm using C (GCC for dsPIC.)
I think we all need to know a lot more about the peculiarities of your processor to answer this question. Why can't it do divides by powers of two on signed integers? As far as I remember the operation is the same for both. I.e.
10/2 = 00001010 goes to 00000101
-10/2 = 11110110 goes to 11111011
Maybe you should write some simple code doing an unsigned divide and a signed divide and compare the compiled output.
Also benchmarking is a good idea. It doesn't need to be precise. Just have a an array of a few thousand numbers, start a timer and start dividing them a few million times and time how long it takes. Maybe do a few billion times if your processor is fast. E.g.
int s_numbers[] = { etc. etc. };
int s_array_size = sizeof(s_numbers);
unsigned int u_numbers[] = { etc. etc.};
unsigned int u_array_size = sizeof(u_numbers);
int i;
int s_result;
unsigned int u_result;
/* Start timer. */
for(i = 0; i < 100000000; i++)
{
i_result = s_numbers[i % s_array_size] / s_numbers[(i + 1) % s_array_size];
}
/* Stop timer and print difference. */
/* Repeat for unsigned integers. */
Written in a hurry to show the principle, please forgive any errors.
It won't give precise benchmarking but should give a general idea of which is faster.
I don't know much about the instruction set available on your processor but a quick look makes me think that it has instructions that may be used for both arithmetic and logical shifts, which should mean that shifting a signed value costs about the same as shifting an unsigned value, and dividing by powers of 2 for each using the shifts should also cost the same. (my knowledge about this is from a quick glance at some intrinsic functions for a C compiler that targets your processor family).
That being said, if you are working with values which are to be interpreted as unsigned then you might as well declare them as unsigned. For the last few years I've been using the types from stdint.h more and more, and usually I end up using the unsigned versions because my values are either inherently unsigned or I'm just using them as bit arrays.
Generate assembly both ways and count cycles.
I'm going to guess the unsigned divide of powers of two are faster because it can simply do a right shift as needed without needing to worry about sign extension.
As for disadvantages: detecting arithmetic overflows, overflowing a signed type because you didn't realize it while using unsigned, etc. Nothing blocking, just different things to watch out for.

Resources