Fixed point multiplication - c

I need to convert a value from one unit to another according to a non constant factor. The input value range from 0 to 1073676289 and the range value range from 0 to 1155625. The conversion can be described like this:
output = input * (range / 1073676289)
My own initial fixed point implementation feels a bit clumsy:
// Input values (examples)
unsigned int input = 536838144; // min 0, max 1073676289
unsigned int range = 1155625; // min 0, max 1155625
// Conversion
unsigned int tmp = (input >> 16) * ((range) >> 3u);
unsigned int output = (tmp / ((1073676289) >> 16u)) << 3u;
Can my code be improved to be simpler or to have better accuracy?

This will give you the best precision with no floating point values and the result will be rounded to the nearest integer value:
output = (input * (long long) range + 536838144) / 1073676289;

The problem is that input * range would overflow a 32-bit integer. Fix that by using a 64-bit integer.
uint64_least_t tmp;
tmp = input;
tmp = tmp * range;
tmp = tmp / 1073676289ul;
output = temp;

A quick trip out to google brought http://sourceforge.net/projects/fixedptc/ to my attention
It's a c library in a header for managing fixed point math in 32 or 64 bit integers.
A little bit of experimentation with the following code:
#include <stdio.h>
#include <stdint.h>
#define FIXEDPT_BITS 64
#include "fixedptc.h"
int main(int argc, char ** argv)
{
unsigned int input = 536838144; // min 0, max 1073676289
unsigned int range = 1155625; // min 0, max 1155625
// Conversion
unsigned int tmp = (input >> 16) * ((range) >> 3u);
unsigned int output = (tmp / ((1073676289) >> 16u)) << 3u;
double output2 = (double)input * ((double)range / 1073676289.0);
uint32_t output3 = fixedpt_toint(fixedpt_xmul(fixedpt_fromint(input), fixedpt_xdiv(fixedpt_fromint(range), fixedpt_fromint(1073676289))));
printf("baseline = %g, better = %d, library = %d\n", output2, output, output3);
return 0;
}
Got me the following results:
baseline = 577812, better = 577776, library = 577812
Showing better precision (matching the floating point) than you were getting with your code. Under the hood it's not doing anything terribly complicated (and doesn't work at all in 32 bits)
/* Multiplies two fixedpt numbers, returns the result. */
static inline fixedpt
fixedpt_mul(fixedpt A, fixedpt B)
{
return (((fixedptd)A * (fixedptd)B) >> FIXEDPT_FBITS);
}
/* Divides two fixedpt numbers, returns the result. */
static inline fixedpt
fixedpt_div(fixedpt A, fixedpt B)
{
return (((fixedptd)A << FIXEDPT_FBITS) / (fixedptd)B);
}
But it does show that you can get the precision you want. You'll just need 64 bits to do it

You won't get it any simpler then output = input * (range / 1073676289)
As noted below in the comments if you are restircted to integer operations then for range < 1073676289: range / 1073676289 == 0 so you would be good to go with:
output = range < 1073676289 ? 0 : input
If that is not what you wanted and you actually want precision then
output = (input * range) / 1073676289
will be the way to go.
If you need to do a lot of those then i suggest you use double and have your compiler vectorise your operations. Precision will be ok too.

Related

Iterate bits from left to right for any number

I am trying to implement Modular Exponentiation (square and multiply left to right) algorithm in c.
In order to iterate the bits from left to right, I can use masking which is explained in this link
In this example mask used is 0x80 which can work only for a number with max 8 bits.
In order to make it work for any number of bits, I need to assign mask dynamically but this makes it a bit complicated.
Is there any other solution by which it can be done.
Thanks in advance!
-------------EDIT-----------------------
long long base = 23;
long long exponent = 297;
long long mod = 327;
long long result = 1;
unsigned int mask;
for (mask = 0x80; mask != 0; mask >>= 1) {
result = (result * result) % mod; // Square
if (exponent & mask) {
result = (base * result) % mod; // Mul
}
}
As in this example, it will not work if I will use mask 0x80 but if I use 0x100 then it works fine.
Selecting the mask value at run time seems to be an overhead.
If you want to iterate over all bits, you first have to know how many bits there are in your type.
This is a surprisingly complicated matter:
sizeof gives you the number of bytes, but a byte can have more than 8 bits.
limits.h gives you CHAR_BIT to know the number of bits in a byte, but even if you multiply this by the sizeof your type, the result could still be wrong because unsigned types are allowed to contain padding bits that are not part of the number representation, while sizeof returns the storage size in bytes, which includes these padding bits.
Fortunately, this answer has an ingenious macro that can calculate the number of actual value bits based on the maximum value of the respective type:
#define IMAX_BITS(m) ((m) /((m)%0x3fffffffL+1) /0x3fffffffL %0x3fffffffL *30 \
+ (m)%0x3fffffffL /((m)%31+1)/31%31*5 + 4-12/((m)%31+3))
The maximum value of an unsigned type is surprisingly easy to get: just cast -1 to your unsigned type.
So, all in all, your code could look like this, including the macro above:
#define UNSIGNED_BITS IMAX_BITS((unsigned)-1)
// [...]
unsigned int mask;
for (mask = 1 << (UNSIGNED_BITS-1); mask != 0; mask >>= 1) {
// [...]
}
Note that applying this complicated macro has no runtime drawback at all, it's a compile-time constant.
Your algorithm seems unnecessarily complicated: bits from the exponent can be tested from the least significant to the most significant in a way that does not depend on the integer type nor its maximum value. Here is a simple implementation that does not need any special case for any size integers:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
unsigned long long base = (argc > 1) ? strtoull(argv[1], NULL, 0) : 23;
unsigned long long exponent = (argc > 2) ? strtoull(argv[2], NULL, 0) : 297;
unsigned long long mod = (argc > 3) ? strtoull(argv[3], NULL, 0) : 327;
unsigned long long y = exponent;
unsigned long long x = base;
unsigned long long result = 1;
for (;;) {
if (y & 1) {
result = result * x % mod;
}
if ((y >>= 1) == 0)
break;
x = x * x % mod;
}
printf("expmod(%llu, %llu, %llu) = %llu\n", base, exponent, mod, result);
return 0;
}
Without any command line arguments, it produces: expmod(23, 297, 327) = 185. You can try other numbers by passing the base, exponent and modulo as command line arguments.
EDIT:
If you must scan the bits in exponent from most significant to least significant, mask should be defined as the same type as exponent and initialized this way if the type is unsigned:
unsigned long long exponent = 297;
unsigned long long mask = 0;
mask = ~mask - (~mask >> 1);
If the type is signed, for complete portability, you must use the definition for its maximum value from <limits.h>. Note however that it would be more efficient to use the unsigned type.
long long exponent = 297;
long long mask = LLONG_MAX - (LLONG_MAX >> 1);
The loop will waste time running through all the most significant 0 bits, so a simpler loop could be used first to skip these bits:
while (mask > exponent) {
mask >>= 1;
}

function to convert float to int (huge integers)

This is a university question. Just to make sure :-) We need to implement (float)x
I have the following code which must convert integer x to its floating point binary representation stored in an unsigned integer.
unsigned float_i2f(int x) {
if (!x) return x;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 0;
int t = a;
while(t != 1) {
/* divide by two until t is 0*/
t >>= 1;
e++;
};
/* calculate mantissa */
int m = a << (32 - e);
/* logical right shift */
m = (m >> 9) & ~(((0x1 << 31) >> 9 << 1));
/* add bias for 32bit float */
e += 127;
int res = sign << 31;
res |= (e << 23);
res |= m;
/* lots of printf */
return res;
}
One problem I encounter now is that when my integers are too big then my code fails. I have this control procedure implemented:
float f = (float)x;
unsigned int r;
memcpy(&r, &f, sizeof(unsigned int));
This of course always produces the correct output.
Now when I do some test runs, this are my outputs (GOAL is what It needs to be, result is what I got)
:!make && ./btest -f float_i2f -1 0x80004999
make: Nothing to be done for `all'.
Score Rating Errors Function
x: [-2147464807] 10000000000000000100100110011001
sign: 1
expone: 01001110100000000000000000000000
mantis: 00000000011111111111111101101100
result: 11001110111111111111111101101100
GOAL: 11001110111111111111111101101101
So in this case, a 1 is added as the LSB.
Next case:
:!make && ./btest -f float_i2f -1 0x80000001
make: Nothing to be done for `all'.
Score Rating Errors Function
x: [-2147483647] 10000000000000000000000000000001
sign: 1
expone: 01001110100000000000000000000000
mantis: 00000000011111111111111111111111
result: 11001110111111111111111111111111
GOAL: 11001111000000000000000000000000
Here 1 is added to the exponent while the mantissa is the complement of it.
I tried hours to look ip up on the internet plus in my books etc but I can't find any references to this problem. I guess It has something to do with the fact that the mantissa is only 23 bits. But how do I have to handle it then?
EDIT: THIS PART IS OBSOLETE THANKS TO THE COMMENTS BELOW. int l must be unsigned l.
int x = 2147483647;
float f = (float)x;
int l = f;
printf("l: %d\n", l);
then l becomes -2147483648.
How can this happen? So C is doing the casting wrong?
Hope someone can help me here!
Thx
Markus
EDIT 2:
My updated code is now this:
unsigned float_i2f(int x) {
if (x == 0) return 0;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 158;
int t = a;
while (!(t >> 31) & 0x1) {
t <<= 1;
e--;
};
/* calculate mantissa */
int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
m &= 0x7fffff;
int res = sign << 31;
res |= (e << 23);
res |= m;
return res;
}
I also figured out that the code works for all integers in the range -2^24, 2^24. Everything above/below sometimes works but mostly doesn't.
Something is missing, but I really have no idea what. Can anyone help me?
The answer printed is absolutely correct as it's totally dependent on the underlying representation of numbers being cast. However, If we understand the binary representation of the number, you won't get surprised with this result.
To understand an implicit conversion is associated with the assignment operator (ref C99 Standard 6.5.16). The C99 Standard goes on to say:
6.3.1.4 Real floating and integer
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
Your earlier example illustrates undefined behavior due to assigning a value outside the range of the destination type. Trying to assign a negative value to an unsigned type, not from converting floating point to integer.
The asserts in the following snippet ought to prevent any undefined behavior from occurring.
#include <limits.h>
#include <math.h>
unsigned int convertFloatingPoint(double v) {
double d;
assert(isfinite(v));
d = trunc(v);
assert((d>=0.0) && (d<=(double)UINT_MAX));
return (unsigned int)d;
}
Another way for doing the same thing, Create a union containing a 32-bit integer and a float. The int and float are now just different ways of looking at the same bit of memory;
union {
int myInt;
float myFloat;
} my_union;
my_union.myInt = 0x BFFFF2E5;
printf("float is %f\n", my_union.myFloat);
float is -1.999600
You are telling the compiler to take the number you have (large integer) and make it into a float, not to interpret the number AS float. To do that, you need to tell the compiler to read the number from that address in a different form, so this:
myFloat = *(float *)&myInt ;
That means, if we take it apart, starting from the right:
&myInt - the location in memory that holds your integer.
(float *) - really, I want the compiler use this as a pointer to float, not whatever the compiler thinks it may be.
* - read from the address of whatever is to the right.
myFloat = - set this variable to whatever is to the right.
So, you are telling the compiler: In the location of (myInt), there is a floating point number, now put that float into myFloat.

Implementing BigInteger

I need to implement a 1024bit math operations in C .I Implemented a simple BigInteger library where the integer is stored as an array "typedef INT UINT1024[400]" where each element represent one digit. It turned up to be so slow so i decided to implement the BigInteger using a 1024bit array of UINT64: "typedef UINT64 UINT1024[16]"
so for example the number : 1000 is represented as {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1000},
18446744073709551615 as {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xFFFFFFFFFFFFFFFF} and 18446744073709551616 as {0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0}.
I started wih writing the function to convert a char array number to an UINT1024 and an UINT1024 to a char array, it worked with numbers <= 0xFFFFFFFFFFFFFFFF.
Here's what i did:
void UINT1024_FROMSTRING(UIN1024 Integer,const char szInteger[],UINT Length) {
int c = 15;
UINT64 Result = 0,Operation,Carry = 0;
UINT64 Temp = 1;
while(Length--)
{
Operation = (szInteger[Length] - '0') * Temp;
Result += Operation + Carry;
/*Overflow ?*/
if (Result < Operation || Temp == 1000000000000000000)
{
Carry = Result - Operation;
Result = 0;
Integer[c--] = 0;
Temp = 1;
}
else Carry = 0;
Temp *= 10;
}
if (Result || Carry)
{
/* I DONT KNOW WHAT TO DO HERE ! */
}
while(c--) Integer[c] = 0;}
So please how can i implement it and is it possible to implement it using UINT64 for speed or just to stick with each array element is a digit of the number which is very slow for 1024bit operations.
PS: I can't use any existing library !
Thanks in advance !
Update
Still can't figure out how to do the multiplication. I am using this function:
void _uint128_mul(UINT64 u,UINT64 v,UINT64 * ui64Hi,UINT64 * ui64Lo)
{
UINT64 ulo, uhi, vlo, vhi, k, t;
UINT64 wlo, whi, wt;
uhi = u >> 32;
ulo = u & 0xFFFFFFFF;
vhi = v >> 32;
vlo = v & 0xFFFFFFFF;
t = ulo*vlo; wlo = t & 0xFFFFFFFF;
k = t >> 32;
t = uhi*vlo + k;
whi = t & 0xFFFFFFFF;
wt = t >> 32;
t = ulo*vhi + whi;
k = t >> 32;
*ui64Lo = (t << 32) + wlo;
*ui64Hi = uhi*vhi + wt + k;
}
Then
void multiply(uint1024_t dUInteger,uint1024_t UInteger)
{
int i = 16;
UINT64 lo,hi,Carry = 0;
while(i--)
{
_uint128_mul(dUInteger[i],UInteger[15],&hi,&lo);
dUInteger[i] = lo + Carry;
Carry = hi;
}
}
I really need some help in this and Thanks in advance !
You need to implement two functions for your UINT1024 class, multiply by integer and add integer. Then for each digit you convert, multiply the previous value by 10 and add the value of the digit.
Writing, debugging, defining test cases, and checking they do work right is a huge undertaking. Just get one of the packaged multiprecission arithmetic libraries, like GMP, perhaps though NTL or CLN for C++. There are other alternatives, trawl the web. Jôrg Arndt's Matters Computational gives source code for C++.
If you are doing this for your education, you should take the middle road between your two previous approaches. Put more than 1 bit into a leaf or digit, but do not use the full bit range of the integer type.
The reason is that this may significantly simplify the multiplication operation if you can at first just accumulate the products a[i]*b[j] in c[i+j]. And then normalize the result to the fixed digit range. c has length 2N-1, and this should fit into 1024 bit, so a and b are restricted to 512 bit.
If the arrays a and b hold N digits with maximum value B-1, B=2^b, then the largest of the c[k] is c[N-1] with bound N*(B-1)^2. Thus the design constraints are
(2N)*b>=1024
ld(N)+(2b)<=64
b N 2N*b ld(N)+2b
32 16 1024 68
24 22 1056 53
28 19 1064 61
So one possibility is to set b=28, B=1<
Even more suited for educational purposes would be to set B=10^d, e.g. with d=9, so that conversion from and to string is relatively trivial.

Full variation of Random numbers in C

I am trying to generate 64 bit random numbers using the following code. I want the numbers in binary,but the problem is I cant get all the bits to vary. I want the numbers to vary as much as possible
void PrintDoubleAsCBytes(double d, FILE* f)
{
f = fopen("tb.txt","a");
unsigned char a[sizeof(d)];
unsigned i;
memcpy(a, &d, sizeof(d));
for (i = 0; i < sizeof(a); i++){
fprintf(f, "%0*X", (CHAR_BIT + 3) / 4, a[sizeof(d)-1-i]);
}
fprintf(f,"\n");
fclose(f); /*done!*/
}
int main (int argc, char *argv)
{
int limit = 100 ;
double a, b;
double result;
int i ;
printf("limit = %d", limit );
for (i= 0 ; i< limit;i++)
{
a= rand();
b= rand();
result = a * b;
printf ("A= %f B = %f\n",a,b);
printf ("result= %f\n",result);
PrintDoubleAsCBytes(a, stdout); puts("");
PrintDoubleAsCBytes(b, stdout); puts("");
PrintDoubleAsCBytes(result, stdout); puts("");
}
}
OUTPUT FILE
41DAE2D159C00000 //Last bits remain zero, I want them to change as well as in case of the result
41C93D91E3000000
43B534EE7FAEB1C3
41D90F261A400000
41D98CD21CC00000
43C4021C95228080
41DD2C3714400000
41B9495CFF000000
43A70D6CAD0EE321
How do I do I achieve this?I do not have much experience in software coding
In Java it is very easy:
Random rng = new Random(); //do this only once
long randLong = rng.NextLong();
double randDoubleFromBits = Double.longBitsToDouble(randLong);
In C I only know of a hack way to do it :)
Since RAND_MAX can be as low as 2^15-1 but is implementation defined, maybe you can get 64 random bits out of rand() by doing masks and bitshifts:
//seed program once at the start
srand(time(NULL));
uint64_t a = rand()&0x7FFF;
uint64_t b = rand()&0x7FFF;
uint64_t c = rand()&0x7FFF;
uint64_t d = rand()&0x7FFF;
uint64_t e = rand()&0x7FFF;
uint64_t random = (a<<60)+(b<<45)+(c<<30)+(d<<15)+e;
Then stuff it in a union and use the other member of the union to interpret its bits as a double. Something like
union
{
double d;
long l;
} doubleOrLong;
doubleOrLong.l = random;
double randomDouble = doubleOrLong.d;
(I haven't tested this code)
EDIT: Explanation of how it should work
First, srand(time(NULL)); seeds rand with the current timestamp. So you only need to do this once at the start, and if you want to reproduce an earlier RNG series you can reuse that seed if you like.
rand() returns a random, unbiased integer between 0 and RAND_MAX inclusive. RAND_MAX is guaranteed to be at least 2^15-1, which is 0x7FFF. To write the program such that it doesn't matter what RAND_MAX is (for example, it could be 2^16-1, 2^31-1, 2^32-1...), we mask out all but the bottom 15 bits - 0x7FFF is 0111 1111 1111 1111 in binary, or the bottom 15 bits.
Now we have to pack all of our 15 random bits into 64 bits. The bitshift operator, <<, shifts the left operand (right operand) bits to the left. So the final uint64_t we call random has random bits derived from the other variables like so:
aaaa bbbb bbbb bbbb bbbc cccc cccc cccc ccdd dddd dddd dddd deee eeee eeee eeee
But this is still being treated as a uint64_t, not as a double. It's undefined behaviour to do so, so you should make sure it works the way you expect on your compiler of choice, but if you put this uint64_t in a union and then read the union's other double member, then you'll (hopefully!) interpret those same bits as a double made up of random bits.
Depending on your platform, but assuming IEEE 754, e.g. Wikipedia, why not explicitly handle the internal double format?
(Barring mistakes), this generates random but valid doubles.
[ Haven't quite covered all bases here, e.g. case where exp = 0 or 0x7ff ]
double randomDouble()
{
uint64_t buf = 0ull;
// sign bit
bool odd = rand()%2 > 0;
if (odd)
buf = 1ull<<63;
// exponent
int exponentLength = 11;
int exponentMask = (1 << exponentLength) - 1;
int exponentLocation = 63 - exponentLength;
uint64_t exponent = rand()&exponentMask;
buf += exponent << exponentLocation;
// fraction
int fractionLength = exponentLocation;
int fractionMask = (1 << exponentLocation) - 1;
// Courtesy of Patashu
uint64_t a = rand()&0x7FFF;
uint64_t b = rand()&0x7FFF;
uint64_t c = rand()&0x7FFF;
uint64_t d = rand()&0x7FFF;
uint64_t fraction = (a<<45)+(b<<30)+(c<<15)+d;
fraction = fraction& fractionMask;
buf += fraction;
double* res = reinterpret_cast<double*>(&buf);
return *res;
}
Use could use this:
void GenerateRandomDouble(double* d)
{
unsigned char* p = (unsigned char*)d;
unsigned i;
for (i = 0; i < sizeof(d); i++)
p[i] = rand();
}
The problem with this method is that your C program may be unable to use some of the values returned by this function, because they're invalid or special floating point values.
But if you're testing your hardware, you could generate random bytes and feed them directly into said hardware without first converting them into a double.
The only place where you need to treat these random bytes as a double is the point of validation of the results returned by the hardware.
At that point you need to look at the bytes and see if they represent a valid value. If they do, you can memcpy() the bytes into a double and use it.
The next problem to deal with is overflows/underflows and exceptions resulting from whatever you need to do with these random doubles (addition, multiplication, etc). You need to figure out how to deal with them on your platform (compiler+CPU+OS), whether or not you can safely and reliably detect them.
But that looks like a separate question and it has probably already been asked and answered.

Picking good first estimates for Goldschmidt division

I'm calculating fixedpoint reciprocals in Q22.10 with Goldschmidt division for use in my software rasterizer on ARM.
This is done by just setting the numerator to 1, i.e the numerator becomes the scalar on the first iteration. To be honest, I'm kind of following the wikipedia algorithm blindly here. The article says that if the denominator is scaled in the half-open range (0.5, 1.0], a good first estimate can be based on the denominator alone: Let F be the estimated scalar and D be the denominator, then F = 2 - D.
But when doing this, I lose a lot of precision. Say if I want to find the reciprocal of 512.00002f. In order to scale the number down, I lose 10 bits of precision in the fraction part, which is shifted out. So, my questions are:
Is there a way to pick a better estimate which does not require normalization? Why? Why not? A mathematical proof of why this is or is not possible would be great.
Also, is it possible to pre-calculate the first estimates so the series converges faster? Right now, it converges after the 4th iteration on average. On ARM this is about ~50 cycles worst case, and that's not taking emulation of clz/bsr into account, nor memory lookups. If it's possible, I'd like to know if doing so increases the error, and by how much.
Here is my testcase. Note: The software implementation of clz on line 13 is from my post here. You can replace it with an intrinsic if you want. clz should return the number of leading zeros, and 32 for the value 0.
#include <stdio.h>
#include <stdint.h>
const unsigned int BASE = 22ULL;
static unsigned int divfp(unsigned int val, int* iter)
{
/* Numerator, denominator, estimate scalar and previous denominator */
unsigned long long N,D,F, DPREV;
int bitpos;
*iter = 1;
D = val;
/* Get the shift amount + is right-shift, - is left-shift. */
bitpos = 31 - clz(val) - BASE;
/* Normalize into the half-range (0.5, 1.0] */
if(0 < bitpos)
D >>= bitpos;
else
D <<= (-bitpos);
/* (FNi / FDi) == (FN(i+1) / FD(i+1)) */
/* F = 2 - D */
F = (2ULL<<BASE) - D;
/* N = F for the first iteration, because the numerator is simply 1.
So don't waste a 64-bit UMULL on a multiply with 1 */
N = F;
D = ((unsigned long long)D*F)>>BASE;
while(1){
DPREV = D;
F = (2<<(BASE)) - D;
D = ((unsigned long long)D*F)>>BASE;
/* Bail when we get the same value for two denominators in a row.
This means that the error is too small to make any further progress. */
if(D == DPREV)
break;
N = ((unsigned long long)N*F)>>BASE;
*iter = *iter + 1;
}
if(0 < bitpos)
N >>= bitpos;
else
N <<= (-bitpos);
return N;
}
int main(int argc, char* argv[])
{
double fv, fa;
int iter;
unsigned int D, result;
sscanf(argv[1], "%lf", &fv);
D = fv*(double)(1<<BASE);
result = divfp(D, &iter);
fa = (double)result / (double)(1UL << BASE);
printf("Value: %8.8lf 1/value: %8.8lf FP value: 0x%.8X\n", fv, fa, result);
printf("iteration: %d\n",iter);
return 0;
}
I could not resist spending an hour on your problem...
This algorithm is described in section 5.5.2 of "Arithmetique des ordinateurs" by Jean-Michel Muller (in french). It is actually a special case of Newton iterations with 1 as starting point. The book gives a simple formulation of the algorithm to compute N/D, with D normalized in range [1/2,1[:
e = 1 - D
Q = N
repeat K times:
Q = Q * (1+e)
e = e*e
The number of correct bits doubles at each iteration. In the case of 32 bits, 4 iterations will be enough. You can also iterate until e becomes too small to modify Q.
Normalization is used because it provides the max number of significant bits in the result. It is also easier to compute the error and number of iterations needed when the inputs are in a known range.
Once your input value is normalized, you don't need to bother with the value of BASE until you have the inverse. You simply have a 32-bit number X normalized in range 0x80000000 to 0xFFFFFFFF, and compute an approximation of Y=2^64/X (Y is at most 2^33).
This simplified algorithm may be implemented for your Q22.10 representation as follows:
// Fixed point inversion
// EB Apr 2010
#include <math.h>
#include <stdio.h>
// Number X is represented by integer I: X = I/2^BASE.
// We have (32-BASE) bits in integral part, and BASE bits in fractional part
#define BASE 22
typedef unsigned int uint32;
typedef unsigned long long int uint64;
// Convert FP to/from double (debug)
double toDouble(uint32 fp) { return fp/(double)(1<<BASE); }
uint32 toFP(double x) { return (int)floor(0.5+x*(1<<BASE)); }
// Return inverse of FP
uint32 inverse(uint32 fp)
{
if (fp == 0) return (uint32)-1; // invalid
// Shift FP to have the most significant bit set
int shl = 0; // normalization shift
uint32 nfp = fp; // normalized FP
while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
uint64 q = 0x100000000ULL; // 2^32
uint64 e = 0x100000000ULL - (uint64)nfp; // 2^32-NFP
int i;
for (i=0;i<4;i++) // iterate
{
// Both multiplications are actually
// 32x32 bits truncated to the 32 high bits
q += (q*e)>>(uint64)32;
e = (e*e)>>(uint64)32;
printf("Q=0x%llx E=0x%llx\n",q,e);
}
// Here, (Q/2^32) is the inverse of (NFP/2^32).
// We have 2^31<=NFP<2^32 and 2^32<Q<=2^33
return (uint32)(q>>(64-2*BASE-shl));
}
int main()
{
double x = 1.234567;
uint32 xx = toFP(x);
uint32 yy = inverse(xx);
double y = toDouble(yy);
printf("X=%f Y=%f X*Y=%f\n",x,y,x*y);
printf("XX=0x%08x YY=0x%08x XX*YY=0x%016llx\n",xx,yy,(uint64)xx*(uint64)yy);
}
As noted in the code, the multiplications are not full 32x32->64 bits. E will become smaller and smaller and fits initially on 32 bits. Q will always be on 34 bits. We take only the high 32 bits of the products.
The derivation of 64-2*BASE-shl is left as an exercise for the reader :-). If it becomes 0 or negative, the result is not representable (the input value is too small).
EDIT. As a follow-up to my comment, here is a second version with an implicit 32-th bit on Q. Both E and Q are now stored on 32 bits:
uint32 inverse2(uint32 fp)
{
if (fp == 0) return (uint32)-1; // invalid
// Shift FP to have the most significant bit set
int shl = 0; // normalization shift for FP
uint32 nfp = fp; // normalized FP
while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
int shr = 64-2*BASE-shl; // normalization shift for Q
if (shr <= 0) return (uint32)-1; // overflow
uint64 e = 1 + (0xFFFFFFFF ^ nfp); // 2^32-NFP, max value is 2^31
uint64 q = e; // 2^32 implicit bit, and implicit first iteration
int i;
for (i=0;i<3;i++) // iterate
{
e = (e*e)>>(uint64)32;
q += e + ((q*e)>>(uint64)32);
}
return (uint32)(q>>shr) + (1<<(32-shr)); // insert implicit bit
}
A couple of ideas for you, though none that solve your problem directly as stated.
Why this algo for division? Most divides I've seen in ARM use some varient of
adcs hi, den, hi, lsl #1
subcc hi, hi, den
adcs lo, lo, lo
repeated n bits times with a binary search off of the clz to determine where to start. That's pretty dang fast.
If precision is a big problem, you are not limited to 32/64 bits for your fixed point representation. It'll be a bit slower, but you can do add/adc or sub/sbc to move values across registers. mul/mla are also designed for this kind of work.
Again, not direct answers for you, but possibly a few ideas to go forward this. Seeing the actual ARM code would probably help me a bit as well.
Mads, you are not losing any precision at all. When you divide 512.00002f by 2^10, you merely decrease the exponent of your floating point number by 10. Mantissa remains the same. Of course unless the exponent hits its minimum value but that shouldn't happen since you're scaling to (0.5, 1].
EDIT: Ok so you're using a fixed decimal point. In that case you should allow a different representation of the denominator in your algorithm. The value of D is from (0.5, 1] not only at the beginning but throughout the whole calculation (it's easy to prove that x * (2-x) < 1 for x < 1). So you should represent the denominator with decimal point at base = 32. This way you will have 32 bits of precision all the time.
EDIT: To implement this you'll have to change the following lines of your code:
//bitpos = 31 - clz(val) - BASE;
bitpos = 31 - clz(val) - 31;
...
//F = (2ULL<<BASE) - D;
//N = F;
//D = ((unsigned long long)D*F)>>BASE;
F = -D;
N = F >> (31 - BASE);
D = ((unsigned long long)D*F)>>31;
...
//F = (2<<(BASE)) - D;
//D = ((unsigned long long)D*F)>>BASE;
F = -D;
D = ((unsigned long long)D*F)>>31;
...
//N = ((unsigned long long)N*F)>>BASE;
N = ((unsigned long long)N*F)>>31;
Also in the end you'll have to shift N not by bitpos but some different value which I'm too lazy to figure out right now :).

Resources