fixed point arithmetic in modern systems - c

I'd like to start out by saying this isn't about optimizations so please refrain from dragging this topic down that path. My purpose for using fixed point arithmetic is because I want to control the precision of my calculations without using floating point.
With that being said let's move on. I wanted to have 17 bits for range and 15 bits for the fractional part. The extra bit is for the signed value. Here are some macros below.
const int scl = 18;
#define Double2Fix(x) ((x) * (double)(1 << scl))
#define Float2Fix(x) ((x) * (float)(1 << scl))
#define Fix2Double(x) ((double)(x) / (1 << scl))
#define Fix2Float(x) ((float)(x) / (1 << scl))
Addition and subtraction are fairly straight forward but things gets a bit tricky with mul and div.
I've seen two different ways to handle these two types of operations.
1) if I am using 32 bits then use a temp 64bit variable to store intermediate multiplication steps then scale at the end.
2) right in the multiplication step scale both variables to a lesser bit range before multiplication. For example if you have a 32 bit register with 16 bits for the whole number you could shift like this:
(((a)>>8)*((b)>>6) >> 2) or some combination that makes sense for you app.
It seems to me that if you design your fixed point math around 32 bits it might be impractical to always depend on having a 64bit variable able to store your intermediate values but on the other hand shifting to a lower scale will seriously reduce your range and precision.
questions
Since i'd like to avoid trying to force the cpu to try to create a 64bit type in the middle of my calculations is the shifting to lower bit values the only other alternative?
Also i've notice
int b = Double2Fix(9.1234567890);
printf("double shift:%f\n",Fix2Double(b));
int c = Float2Fix(9.1234567890);
printf("float shift:%f\n",Fix2Float(c));
double shift:9.123444
float shift:9.123444
Is that precision loss just a part of using fixed point numbers?

Since i'd like to avoid trying to force the cpu to try to create a 64bit type in the middle of my calculations is the shifting to lower bit values the only other alternative?
You have to work with the hardware capabilities, and the only available operations you'll find are:
Multiply N x N => low N bits (native C multiplication)
Multiply N x N => high N bits (the C language has no operator for this)
Multiply N x N => all 2N bits (cast to wider type, then multiply)
If the instruction set has #3, and the CPU implements it efficiently, then there's no need to worry about the extra-wide result it produces. For x86, you can pretty much take these as a given. Anyway, you said this wasn't an optimization question :) .
Sticking to just #1, you'll need to break the operands into pieces of (N/2) bits and do long multiplication, which is likely to generate more work. There are still cases where it's the right thing to do, for instance implementing #3 (software extended arithmetic) on a CPU that doesn't have it or #2.
Is that precision loss just a part of using fixed point numbers?
log2( 9.1234567890 – 9.123444 ) = –16.25, and you used 16 bits of precision, so yep, that's very typical.

Related

How do you convert the remainder of a division operation to a fixed point in C?

I understand the concept of fixed point pretty well at this point, but I'm having trouble making a logical jump.
I'm working with M68000 CPUs using gcc with no standard libraries of any sort. Using DIVU/DIVS opcodes, I can obtain the quotient and the remainder. Given a Q16.16 fixed point value stored in an unsigned 32bit memory space, I know I can put the quotient in the upper 16 bits. However, how does one convert the integer remainder into the fractional portion of the fixed point value?
I'm sure this is something simple and I'm just missing it. Any help would be greatly appreciated.
The way to think about it is that fixed point numbers are actually integers hold the value of your number times some fixed multiplier. You want to build you fixed point operations out of the integer operations you have available in your hardware.
So for a 16.16 fixed-point format, your multiplier is 65536 (216), so if you want to do a divide c = a/b, the numbers (integers) you have to work with are actually a' = a * 65536 and b' = b * 65536 and you want to find c' = c * 65536. So substituting into the desired c = a/b, you have
c'/65536 = (a'/65536) / (b'/65536) = a'/b'
c' = 65536 * a' / b'
So you actually want to first (integer) mulitply the fixed-point value of a by 65536 (left shift by 16), then do an integer divide by the fixed point value of b, and that will give you the fixed point value of c. The issue is that the first multiply will almost certainly overflow 32 bits, so you need a 64 bit (actually only 48 bit) intermediate. So if you're using a 68020+ with a 64/32 DIVS.L instruction (divides a 64 bit value in a pair of registers by a 32 bit value), you're fine. You don't need the remainder at all.
If you're using a pure 68000 that doesn't have the wide divide, you'll need to do 16-bit long division on the values (where you use 16 bit numbers as "digits", so you're dividing a 3-"digit" number by a 2-"digit" one)

understanding Fixed point arithmetic

I am struggling with how to implement arithmetic on fixed-point numbers of different precision. I have read the paper by R. Yates, but I'm still lost. In what follows, I use Yates's notation, in which A(n,m) designates a signed fixed-point format with n integer bits, m fraction bits, and n + m + 1 bits overall.
Short question: How exactly is a A(a,b)*A(c,d) and A(a,b)+A(c,d) carried out when a != c and b != d?
Long question: In my FFT algorithm, I am generating a random signal having values between -10V and 10V signed input(in) which is scaled to A(15,16), and the twiddle factors (tw) are scaled to A(2,29). Both are stored as ints. Something like this:
float temp = (((float)rand() / (float)(RAND_MAX)) * (MAX_SIG - MIN_SIG)) + MIN_SIG;
int in_seq[i][j] = (int)(roundf(temp *(1 << numFracBits)));
And similarly for the twiddle factors.
Now I need to perform
res = a*tw
Questions:
a) how do I implement this?
b) Should the size of res be 64 bit?
c) can I make 'res' A(17,14) since I know the ranges of a and tw? if yes, should I be scaling a*tw by 2^14 to store correct value in res?
a + res
Questions:
a) How do I add these two numbers of different Q formats?
b) if not, how do I do this operation?
Maybe it's easiest to make an example.
Suppose you want to add two numbers, one in the format A(3, 5), and the other in the format A(2, 10).
You can do it by converting both numbers to a "common" format - that is, they should have the same number of bits in the fractional part.
A conservative way of doing that is to choose the greater number of bits. That is, convert the first number to A(3, 10) by shifting it 5 bits left. Then, add the second number.
The result of an addition has the range of the greater format, plus 1 bit. In my example, if you add A(3, 10) and A(2, 10), the result has the format A(4, 10).
I call this the "conservative" way because you cannot lose information - it guarantees that the result is representable in the fixed-point format, without losing precision. However, in practice, you will want to use smaller formats for your calculation results. To do that, consider these ideas:
You can use the less-accurate format as your common representation. In my example, you can convert the second number to A(2, 5) by shifting the integer right by 5 bits. This will lose precision, and usually this precision loss is not problematic, because you are going to add a less-precise number to it anyway.
You can use 1 fewer bit for the integer part of the result. In applications, it often happens that the result cannot be too big. In this case, you can allocate 1 fewer bit to represent it. You might want to check if the result is too big, and clamp it to the needed range.
Now, on multiplication.
It's possible to multiply two fixed-point numbers directly - they can be in any format. The format of the result is the "sum of the input formats" - all the parts added together - and add 1 to the integer part. In my example, multiplying A(3, 5) with A(2, 10) gives a number in the format A(6, 15). This is a conservative rule - the output format is able to store the result without loss of precision, but in applications, almost always you want to cut the precision of the output, because it's just too many bits.
In your case, where the number of bits for all numbers is 32, you probably want to lose precision in such a way that all intermediate results have 32 bits.
For example, multiplying A(17, 14) with A(2, 29) gives A(20, 43) - 64 bits required. You probably should cut 32 bits from it, and throw away the rest. What is the range of the result? If your twiddle factor is a number up to 4, the result is probably limited by 2^19 (the conservative number 20 above is needed to accommodate the edge case of multiplying -1 << 31 by -1 << 31 - it's almost always worth rejecting this edge-case).
So use A(19, 12) for your output format, i.e. remove 31 bits from the fractional part of your output.
So, instead of
res = a*tw;
you probably want
int64_t res_tmp = (int64_t)a * tw; // A(20, 43)
if (res_tmp == ((int64_t)1 << 62)) // you might want to neglect this edge case
--res_tmp; // A(19, 43)
int32_t res = (int32_t)(res_tmp >> 31); // A(19, 12)
Your question seems to assume that there is a single right way to perform the operations you are interested in, but you are explicitly asking about some of the details that direct how the operations should be performed. Perhaps this is the kernel of your confusion.
res = a*tw
a is represented as A(15,16) and tw is represented as A(2,29), so the its natural representation of their product A(18,45). You need more value bits (as many bits as the two factors have combined) to maintain full precision. A(18,45) is how you should interpret the result of widening your ints to a 64-bit signed integer type (e.g. int64_t) and computing their product.
If you don't actually need or want 45 bits of fraction, then you can indeed round that to A(18,13) (or to A(18+x,13-x) for any non-negative x) without changing the magnitude of the result. That does requiring scaling. I would probably implement it like this:
/*
* Computes a magnitude-preserving fixed-point product of any two signed
* fixed-point numbers with a combined 31 (or fewer) value bits. If x
* is represented as A(s,t) and y is represented as A(u,v),
* where s + t == u + v == 31, then the representation of the result is
* A(s + u + 1, t + v - 32).
*/
int32_t fixed_product(int32_t x, int32_t y) {
int64_t full_product = (int64_t) x * (int64_t) y;
int32_t truncated = full_product / (1U << 31);
int round_up = ((uint32_t) full_product) >> 31;
return truncated + round_up;
}
That avoids several potential issues and implementation-defined characteristics of signed integer arithmetic. It assumes that you want the results to be in a consistent format (that is, depending only on the formats of the inputs, not on their actual values), without overflowing.
a + res
Addition is actually a little harder if you cannot rely on the operands to initially have the same scale. You need to rescale so that they match before you can perform the addition. In the general case, you may not be able to do that without rounding away some precision.
In your case, you start with one A(15,16) and one A(18,13). You can compute an intermediate result in A(19,16) or wider (presumably A(47,16) in practice) that preserves magnitude without losing any precision, but if you want to represent that in 32 bits then the best you can do without risk of changing the magnitude is A(19,11). That would be this:
int32_t a_plus_res(int32_t a, int32_t res) {
int64_t res16 = ((int64_t) res) * (1 << 3);
int64_t sum16 = a + res16;
int round_up = (((uint32_t) sum16) >> 4) & 1;
return (int32_t) ((sum16 / (1 << 5)) + round_up);
}
A generic version would need to accept the scales of the operands' representations as additional arguments. Such a thing is possible, but the above is enough to chew on as it is.
All of the foregoing assumes that the fixed-point format for each operand and result is constant. That is more or less the distinguishing feature of fixed-point, differentiating it from floating-point formats on one hand and from arbitrary-precision formats on the other. You do, however, have the alternative of allowing formats to vary, and tracking them with a separate variable per value. That would be basically a hybrid of fixed-point and arbitrary-precision formats, and it would be messier.
Additionally, the foregoing assumes that overflow must be avoided at all costs. It would also be possible to instead put operands and results on a consistent scale; this would make addition simpler and multiplication more complicated, and it would afford the possibility of arithmetic overflow. That might nevertheless be acceptable if you have reason to believe that such overflow is unlikely for your particular data.

How does C perform the % operation interally

I am curious to understand the logic behind the mod operation since I understand that bit-shifting operations can be performed to do different things such as bit shifting to multiply.
One way I can see it being done is by a recursive algorithm that keeps dividing until you cannot divide anymore, but this does not seem efficient.
Any ideas will be helpful. Thanks in advance!
The quick version is: Depends on hardware, the optimizer, if it's division by a constant or not (pdf), if there's exceptions to be checked for (e.g. modulo by 0), if and how negative numbers are handled (this is a scary question for C++), etc...
R gave a nice, concise answer for unsigned integers, but it's difficult to understand unless you're well versed with C.
The crux of the technique illuminated by R is to strip away multiples of q until there's no more multiples of q left. We could naively do this with a simple loop:
while (p >= q) p -= q; // One liner, woohoo!
The code may be short, but for large values of p and small values of q this might take a very long time.
Better than stripping away one q at a time would be to strip away many q's at a time. Note that we actually want to strip away as many q's as possible -- that is, floor(p/q) many q's... And indeed, that's a valid technique. For unsigned integers, one would expect that p % q == p - (p / q) * q. (Note that unsigned integer division rounds down.)
But this almost feels like cheating because division and remainder operations are so intimately related. (In fact, often if hardware natively supports division, it supports a divide-and-compute-remainder operation because they're so strongly related.)
Assuming we've no access to division, how shall we find a multiple of q greater than 1 to strip away? In hardware, fixed shift operations are cheap (if not practically free) and conceptually represent multiplication by a non-negative power of two. For example, shifting a bit string left by 3 is equivalent to multiplying by 8 (that is, 2^3), e.g. 5 decimal is equivalent to '101' binary. Shift '101' in binary by adding three zeroes on the right (giving '101000') and the result is 50 in decimal -- five times eight.
Likewise, shift operations are very cheap as software operations and you'll struggle to find a controller that doesn't support them and quickly. (Some architectures such as ARM can even combine shifts with other instructions to make them 'free' a good deal of the time.)
ARMed (couldn't resist) with these shift operations, we can proceed as follows:
Find out the largest power of two we can multiply q by and still be less than p.
Working from the largest power of two to the smallest, multiply q by each power of two and if it's less than what's left of p subtract it from what's left of p.
Whatever you've got left is the remainder.
Why does this work? Because in the end you'll find that all the subtracted powers of two actually sum to floor(p / q)! Don't take my word for it, similar knowledge has been known for a very long time.
Breaking apart R's answer:
#define HI (-1U-(-1U/2))
This effectively gives you an unsigned integer with only the highest value bit set.
unsigned i;
for (i=0; !(HI & (q<<i)); i++);
This line actually finds the highest power of two q can be multiplied before overflowing an unsigned integer. This isn't strictly necessary, but it doesn't change the results other than increasing the amount of execution time required.
In case you're not familiar with the C-isms in this line:
(q<<i) is a left bit shift by i. Recall this is equivalent to multiplying by 2^i.
HI & (q<<i) performs a bitwise-AND. Since HI only has its top bit populated this will only result in a non-zero value when (q<<i) is large enough to cause the top bit to be non-zero. One more shift over to the left and there'd be an integer overflow.
!(HI & (q<<i)) is 'true' when (HI & (q<<i)) is zero and 'false' otherwise.
do { if (p >= (q<<i)) p -= (q<<i); } while (i--);
This is a simple decreasing loop do { .... } while (i--);. Note that post-decrementing is used on i so the loop executes, then it checks to see if i is not zero, then it subtracts one from i, and then if its earlier check resulted in true it continues. This has the property that the loop executes its last time when i is 0. This is important because we may need to strip away an unmultiplied copy of q.
if (p >= (q<<i)) checks if the 2^i * q is less than or equal to p. If it is, p -= (q<<i) strips it away.
The remainder is left.
While most C implementations run on hardware that has a division instruction, the remainder operation can be performed roughly like this, for computing p%q, assuming unsigned values:
#define HI (-1U-(-1U/2))
unsigned i;
for (i=0; !(HI & (q<<i)); i++);
do { if (p >= (q<<i)) p -= (q<<i); } while (i--);
The resulting remainder is in p.
In addition to a hardware instruction and implementation using shifts, as R.. suggests, there's also reciprocal multiplication.
This technique can be used when the right-hand side of % is a constant, known at compile time.
Reciprocal multiplication is used to implement division, but using it for % is easy, based on the formula a%b == a-(a/b)*b.
Depending on the smarts of the optimizer, there is a shortcut for modulo base 2. For example, a % 32 can be implemented as a & 31. In general, a % (2^N) == a & (2^N -1). This is lightning fast compared to division. Most dividers (ever hardware) require at least 1 cycle for each bit of the result to calculate, while logic AND is just a few cycle operation (in the pipeline).
EDIT: this only works if a is unsigned !

Pure high-bit multiplication in assembly?

To implement real numbers between 0 and 1, one usually uses ANSI floats or doubles. But fixed precision numbers between 0 and 1 (decimals modulo 1) can be efficiently implemented as 32 bit integers or 16 bit words, which add like normal integers/words, but which multiply the "wrong way", meaning that when you multiply X times Y, you keep the high order bits of the product. This is equivalent to multiplying 0.X and 0.Y, where all the bits of X are behind the decimal point. Likewise, signed numbers between -1 and 1 are also implementable this way with one extra bit and a shift.
How would one implement fixed-precision mod 1 or mod 2 in C (especially using MMX or SSE)?
I think this representation could be useful for efficient representation of unitary matrices, for numerically intensive physics simulations. It makes for more MMX/SSE to have integer quantities, but you need higher level access to PMULHW.
If 16 bit fixed point arithmetic is sufficient and you are on x86 or a similar architecture, you can directly use SSE.
The SSE3 instruction pmulhrsw directly implements signed 0.15 fixed point arithmetic multiplication (mod 2 as you call it, from -1..+1) in hardware. Addition is not different than the standard 16 bit vector operations, just using paddw.
So a library which handles multiplication and addition of eight signed 16 bit fixed point variables at a time could look like this:
typedef __v8hi fixed16_t;
fixed16_t mul(fixed16_t a, fixed16_t b) {
return _mm_mulhrs_epi16(a,b);
}
fixed16_t add(fixed16_t a, fixed16_t b) {
return _mm_add_epi16(a,b);
}
Permission granted to use it in any way you like ;-)

Fixed-point multiplication in a known range

I'm trying to multiply A*B in 16-bit fixed point, while keeping as much accuracy as possible. A is 16-bit in unsigned integer range, B is divided by 1000 and always between 0.001 and 9.999. It's been a while since I dealt with problems like that, so:
I know I can just do A*B/1000 after moving to 32-bit variables, then strip back to 16-bit
I'd like to make it faster than that
I'd like to do all the operations without moving to 32-bit (since I've got 16-bit multiplication only)
Is there any easy way to do that?
Edit: A will be between 0 and 4000, so all possible results are in the 16-bit range too.
Edit: B comes from user, set digit-by-digit in the X.XXX mask, that's why the operation is /1000.
No, you have to go to 32 bit. In general the product of two 16 bit numbers will always give you a 32 bit wide result.
You should check the CPU instruction set of the CPU you're working on because most multiply instructions on 16 bit machines have an option to return the result as a 32 bit integer directly.
This would help you a lot because:
short testfunction (short a, short b)
{
int A32 = a;
int B32 = b;
return A32*B32/1000
}
Would force the compiler to do a 32bit * 32bit multiply. On your machine this could be very slow or even done in multiple steps using 16bit multiplies only.
A little bit of inline assembly or even better a compiler intrinsic could speed things up a lot.
Here is an example for the Texas Instruments C64x+ DSP which has such intrinsics:
short test (short a, short b)
{
int product = _mpy (a,b); // calculates product, returns 32 bit integer
return product / 1000;
}
Another thought: You're dividing by 1000. Was that constant your choice? It would be much faster to use a power of two as the base for your fixed-point numbers. 1024 is close. Why don't you:
return (a*b)/1024
instead? The compiler could optimize this by using a shift right by 10 bits. That ought to be much faster than doing reciprocal multiplication tricks.

Resources