Most efficient way of splitting a number into whole and decimal parts

Most efficient way of splitting a number into whole and decimal parts - c

I am trying to split a double into its whole and fraction parts. My code works, but it is much too slow given that the microcontroller I am using does not have a dedicated multiply instruction in assembly. For instance,
temp = ((int)(tacc - temp); // This line takes about 500us
However, if I do this,
temp = (int)(100*(tacc-temp)); // This takes about 4ms
I could speed up the microcontroller, but since I'm trying to stay low power, I am curious if it is possible to do this faster. This is the little piece I'm actually interested in optimizing:
txBuffer[5] = ((int)tacc_y); // Whole part
txBuffer[6] = (int)(100*(tacc_y-txBuffer[5])); // 2 digits of fraction
I remember there is a fast way of multiplying by 10 using shifts, such that:
a * 10 = (a << 2 + a) << 1
I could probably nest this and get multiplication by 100. Is there any other way?

I believe the correct answer, which may not be the fastest, is this:
double whole = trunc(tacc_y);
double fract = tacc_y - whole;
// first, extract (some of) the data into an int
fract = fract * (1<<11); // should be just an exponent change
int ifract = (int)trunc(fract);
// next, decimalize it (I think)
ifract = ifract * 1000; // Assuming integer multiply available
ifract = ifract >> 11;
txBuffer[5] = (int)whole;
txBuffer[6] = ifract
If integer multiplication is not OK, then your shift trick should now work.
If the floating-point multiply is too stupid to just edit the exponent quickly, then you can do it manually by bit twiddling, but I wouldn't recommend it as a first option. In any case, once you've got as far as bit-twiddling FP numbers you might as well just extract the mantissa, or even do the whole operation manually.

I assume you are working with doubles. You could try to take a double apart bitwise:
double input = 10.64;
int sign = *(int64_t *)&input >> 63;
int exponent = (*(int64_t *)&input >> 52) & 0x7FF;
int64_t fraction = (*(int64_t *)&input) & 0xFFFFFFFFFFFFF;
fraction |= 0x10000000000000;
int64_t whole = fraction >> (52 + 1023 - exponent);
int64_t digits = ((fraction - (whole << (52 + 1023 - exponent))) * 100) >> (52 + 1023 - exponent);
printf("%lf, %ld.%ld\n", input, whole, digits);

Related

How to generate an IEEE 754 Single-precision float using only integer arithmetic?

Assuming a low end microprocessor with no floating point arithmetic, I need to generate an IEE754 single precision floating point format number to push out to a file.
I need to write a function that takes three integers being the sign, whole and the fraction and returns a byte array with 4 bytes being the IEEE 754 single precision representation.
Something like:
// Convert 75.65 to 4 byte IEEE 754 single precision representation
char* float = convert(0, 75, 65);
Does anybody have any pointers or example C code please? I'm particularly struggling to understand how to convert the mantissa.

You will need to generate the sign (1 bit), the exponent (8 bits, a biased power of 2), and the fraction/mantissa (23 bits).
Bear in mind that the fraction has an implicit leading '1' bit, which means that the most significant leading '1' bit (2^22) is not stored in the IEEE format. For example, given a fraction of 0x755555 (24 bits), the actual bits stored would be 0x355555 (23 bits).
Also bear in mind that the fraction is shifted so that the binary point is immediately to the right of the implicit leading '1' bit. So an IEEE 23-bit fraction of 11 0101 0101... represents the 24-bit binary fraction 1.11 0101 0101...
This means that the exponent has to be adjusted accordingly.

Does the value have to be written big endian or little endian? Reversed bit ordering?
If you are free, you should think about writing the value as string literal. That way you can easily convert the integer: just write the int part and write "e0" as exponent (or omit the exponent and write ".0").
For the binary representation, you should have a look at Wikipedia. Best is to first assemble the bitfields to an uint32_t - the structure is given in the linked article. Note that you might have to round if the integer has more than 23 bits value. Remember to normalize the generated value.
Second step will be to serialize the uint32_t to an uint8_t-array. Mind the endianess of the result!
Also note to use uint8_t for the result if you really want 8 bit values; you should use an unsigned type. For the intermediate representation, using uint32_t is recommended as that will guarantee you operate on 32 bit values.

You haven't had a go yet so no give aways.
Remember you can regard two 32-bit integers a & b to be interpreted as a decimal a.b as being a single 64-bit integer with an exponent of 2^-32 (where ^ is exponent).
So without doing anything you've got it in the form:
s * m * 2^e
The only problem is your mantissa is too long and your number isn't normalized.
A bit of shifting and adding/subtracting with a possible rounding step and you're done.

You can use a software floating point compiler/library.
See https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html

The basic premise is to:
Given binary32 float.
Form a binary fixed-point representation of the combined whole and factional parts hundredths. This code uses a structure encoding both whole and hundredths fields separately. Important that the whole field is at least 32 bits.
Shift left/right (*2 and /2) until MSbit is in the implied bit position whilst counting the shifts. A robust solution would also note non-zero bits shifted out.
Form a biased exponent.
Round mantissa and drop implied bit.
Form sign (not done here).
Combine the above 3 steps to form the answer.
As Sub-normals, infinites & Not-A-Number will not result with whole, hundredths input, generating those float special cases are not addressed here.
.
#include <assert.h>
#include <stdint.h>
#define IMPLIED_BIT 0x00800000L
typedef struct {
int_least32_t whole;
int hundreth;
} x_xx;
int_least32_t covert(int whole, int hundreth) {
assert(whole >= 0 && hundreth >= 0 && hundreth < 100);
if (whole == 0 && hundreth == 0) return 0;
x_xx x = { whole, hundreth };
int_least32_t expo = 0;
int sticky_bit = 0; // Note any 1 bits shifted out
while (x.whole >= IMPLIED_BIT * 2) {
expo++;
sticky_bit |= x.hundreth % 2;
x.hundreth /= 2;
x.hundreth += (x.whole % 2)*(100/2);
x.whole /= 2;
}
while (x.whole < IMPLIED_BIT) {
expo--;
x.hundreth *= 2;
x.whole *= 2;
x.whole += x.hundreth / 100;
x.hundreth %= 100;
}
int32_t mantissa = x.whole;
// Round to nearest - ties to even
if (x.hundreth >= 100/2 && (x.hundreth > 100/2 || x.whole%2 || sticky_bit)) {
mantissa++;
}
if (mantissa >= (IMPLIED_BIT * 2)) {
mantissa /= 2;
expo++;
}
mantissa &= ~IMPLIED_BIT; // Toss MSbit as it is implied in final
expo += 24 + 126; // Bias: 24 bits + binary32 bias
expo <<= 23; // Offset
return expo | mantissa;
}
void test_covert(int whole, int hundreths) {
union {
uint32_t u32;
float f;
} u;
u.u32 = covert(whole, hundreths);
volatile float best = whole + hundreths / 100.0;
printf("%10d.%02d --> %15.6e %15.6e Same:%d\n", whole, hundreths, u.f, best,
best == u.f);
}
#include <limits.h>
int main(void) {
test_covert(75, 65);
test_covert(0, 1);
test_covert(INT_MAX, 99);
return 0;
}
Output
75.65 --> 7.565000e+01 7.565000e+01 Same:1
0.01 --> 1.000000e-02 1.000000e-02 Same:1
2147483647.99 --> 2.147484e+09 2.147484e+09 Same:1
Known issues: sign not applied.

custom floats addition, implementing math expression - C

I'm implementing a new kind of float "NewFloat" in C, it uses 32bits, it has no sign bit (only positive numbers.
So the whole 32bits are used by the exponent or mantissa.
In my example, I have 6bits for the exponent (EXPBITS) and 26 for the mantissa (MANBITS).
And We have an offset which is used for representing negative exponents, which is (2^(EXPBITS-1)-1).
Given a NewFloat nf1, the translation to a real number is done like this:
nf1 = 2^(exponent - offset) * (1 + mantissa/2^MANBITS).
Now, given two NewFloats (nf1, nf2), each with it's (exp1, man1, exp2, man2 and the same offset),
Assuming that nf1 > nf2, I can calculate the exponent and mantissa of the sum of both nf1 and nf2, and this is done like this: link
To spare your time, I found that:
Exponent of the sum is: exp1
Mantissa of the sum is: man1 + 2^(exp2 - exp1 + MANBITS) + 2^(exp2 - exp1) * man2
To ease with the code, I split to work and calc separately each component of the mantissa:
x = 2^(exp2 - exp1 + MANBITS)
y = 2^(exp2 - exp1) * man2
I'm kind of sure that I'm not implementing right the mantissa part:
unsigned long long x = (1 << (exp2 - exp1 + MANBITS));
unsigned long long y = ((1 << exp2) >> exp1) * man2;
unsigned long long tempMan = man1;
tempMan += x + y;
unsigned int exp = exp1; // CAN USE DIRECTLY EXP1.
unsigned int man = (unsigned int)tempMan;
The sum is represented like this:
sum = 2^(exp1 - offset) * (1 + (man1 + x + y)/2^MANBITS).
The last thing I have to handle is the case of an overflow of the sum's mantissa.
In this case, I should add 1 to the exponent and divide the whole (1 + (man + x + y)2^MANBITS) expression.
In that case, given that I only need to represent the nominator in bits, how do I do that after the division?
Is there any problem in my implementation? Which I have a feeling there is.
If you have a better way of doing this, I would be really happy to hear about it.
Please, don't ask me why I do this.. it's an exercise which I've been trying to solve for more than 10 hours.

Code is doing signed int shifts and certainly unsigned long long is desired.
// unsigned long long x = (1 << (exp2 - exp1 + MANBITS));
unsigned long long x = (1LLU << (exp2 - exp1 + MANBITS));
Notes:
Suggest more meaningful variable names like x_mantissa.
Rounding not implemented. Rounding can cause a need for increase in exponent.
Overflow not detected/implemented.
Sub-normals not implemented. Should NewFloat not use them, not that a-b --> 0 does not mean a == b.

16bit Float Multiplication in C

I'm working on a small project, where I need float multiplication with 16bit floats (half precision). Unhappily, I'm facing some problems with the algorithm:
Example Output
1 * 5 = 5
2 * 5 = 10
3 * 5 = 14.5
4 * 5 = 20
5 * 5 = 24.5
100 * 4 = 100
100 * 5 = 482
The Source Code
const int bits = 16;
const int exponent_length = 5;
const int fraction_length = 10;
const int bias = pow(2, exponent_length - 1) - 1;
const int exponent_mask = ((1 << 5) - 1) << fraction_length;
const int fraction_mask = (1 << fraction_length) - 1;
const int hidden_bit = (1 << 10); // Was 1 << 11 before update 1
int float_mul(int f1, int f2) {
int res_exp = 0;
int res_frac = 0;
int result = 0;
int exp1 = (f1 & exponent_mask) >> fraction_length;
int exp2 = (f2 & exponent_mask) >> fraction_length;
int frac1 = (f1 & fraction_mask) | hidden_bit;
int frac2 = (f2 & fraction_mask) | hidden_bit;
// Add exponents
res_exp = exp1 + exp2 - bias; // Remove double bias
// Multiply significants
res_frac = frac1 * frac2; // 11 bit * 11 bit → 22 bit!
// Shift 22bit int right to fit into 10 bit
if (highest_bit_pos(res_mant) == 21) {
res_mant >>= 11;
res_exp += 1;
} else {
res_mant >>= 10;
}
res_frac &= ~hidden_bit; // Remove hidden bit
// Construct float
return (res_exp << bits - exponent_length - 1) | res_frac;
}
By the way: I'm storing the floats in ints, because I'll try to port this code to some kind of Assembler w/o float point operations later.
The Question
Why does the code work for some values only? Did I forget some normalization or similar? Or does it work only by accident?
Disclaimer: I'm not a CompSci student, it's a leisure project ;)
Update #1
Thanks to the comment by Eric Postpischil I noticed one problem with the code: the hidden_bit flag was off by one (should be 1 << 10). With that change, I don't get decimal places any more, but still some calculations are off (e.g. 3•3=20). I assume, it's the res_frac shift as descibred in the answers.
Update #2
The second problem with the code was indeed the res_frac shifting. After update #1 I got wrong results when having 22 bit results of frac1 * frac2. I've updated the code above with a the corrected shift statement. Thanks to all for every comment and answer! :)

From a cursory look:
No attempt is made to determine the location of the high bit in the product. Two 11-bit numbers, each their high bit set, may produce a 21- or 22-bit number. (Example with two-bit numbers: 102•102 is 1002, three bits, but 112•112 is 10012, four bits.)
The result is truncated instead of rounded.
Signs are ignored.
Subnormal numbers are not handled, on input or output.
11 is hardcoded as a shift amount in one place. This is likely incorrect; the correct amount will depend on how the significand is handled for normalization and rounding.
In decoding, the exponent field is shifted right by fraction_length. In encoding, it is shifted left by bits - exponent_length - 1. To avoid bugs, the same expression should be used in both places.
From a more detailed look by chux:
res_frac = frac1 * frac2 fails if int is less than 23 bits (22 for the product and one for the sign).

This is more a suggestion for how to make it easier to get your code right, rather than analysis of what is wrong with the existing code.
There are a number of steps that are common to some or all of the floating point arithmetic operations. I suggest extracting each into a function that can be written with focus on one issue, and tested separately. Then when you come to write e.g. multiplication, you only have to deal with the specifics of that operation.
All the operations will be easier working with a structure that has the actual signed exponent, and the full significand in a wider unsigned integer field. If you were dealing with signed numbers, it would also have a boolean for the sign bit.
Here are some sample operations that could be separate functions, at least until you get it working:
unpack: Take a 16 bit float and extract the exponent and significand into a struct.
pack: Undo unpack - deal with dropping the hidden bit, applying the bias the expoent, and combining them into a float.
normalize: Shift the significand and adjust the exponent to bring the most significant 1-bit to a specified bit position.
round: Apply your rounding rules to drop low significance bits. If you want to do IEEE 754 style round-to-nearest, you need a guard digit that is the most significant bit that will be dropped, and an additional bit indicating if there are any one bits of lower significance than the guard bit.

One problem is that you are truncating instead of rounding:
res_frac >>= 11; // Shift 22bit int right to fit into 10 bit
You should compute res_frac & 0x7ff first, the part of the 22-bit result that your algorithm is about to discard, and compare it to 0x400. If it is below, truncate. If it is above, round away from zero. If it is equal to 0x400, round to the even alternative.

C/C++ - convert 32-bit floating-point value to 24-bit normalized fixed-point value?

Please let me know how to convert 32 bit float to 24 bit normalized value? What I tried is (units * (1 <<24) but doesn't seem to be working. Please help me with this. Thanks.

Of course it is not working, (1 << 24) is too large for a 24-bit number capable of representing 0 to store, by exactly 1. To put this another way, 1 << 24 is actually a 25-bit number.
Consider (units * ((1 << 24) - 1)) instead.
(1 << 24) - 1 is the largest value an unsigned 24-bit integer that begins at 0 can represent.
Now, a floating-point number in the range [0.0 - 1.0] will actually fit into an unsigned 24-bit fixed-point integer without overflow.

A normalized fixed-point representation, means that the maximum representable value, not strictly reachable, is 1. So 1 is represented by 1<<24. See also Q Formats.
For example Q24 means 24 fractional bits, 0 integer bit and no sign. If using a 32 bits unsigned integer to manage a Q24, the remainig 8 bits can be used to ease calculations.
Before translating from floating-point to fixed-point representation, you always have to define the range for your original value. Example: the floating point value is a physical value in the range from [0, 5), so 0 is included and 5 is not included in the range, and your fixed-point value is normalized to 5.
#include <string.h>
#include <stdio.h>
float length_flp = 4.5; // Units: meters. Range: [0,5)
float time_flp = 1.2; // Seconds. Range: [0,2)
float speed_flp = 1.2; // m/sec. Range: [0,2.5)
unsigned uint32_t length_fixp; // Meters. Representation: Q24 = 24 bit normalized to MAX_LENGTH=5
unsigned uint32_t time_fixp; // Seconds. Representation: Q24 = 24 bit normalized to MAX_TIME=2
unsigned uint32_t speed_fixp; // m/sec. Repr: Q24 = 24 bit normalized to MAX_SPEED=(MAX_LENGTH/MAX_TIME)=2.5
void main(void)
{
printf("length_flp=%f m\n", length_flp);
printf("time_flp=%f sec\n", time_flp);
printf("speed_flp=%f m/sec\n\n", length_flp / time_flp);
length_fixp = (length_flp / 5) * (1 << 24);
time_fixp = (time_flp / 2) * (1 << 24);
speed_fixp = (length_fixp / (time_fixp >> 12)) << 12;
printf("length_fixp=%d m\n", length_fixp);
printf("time_fixp=%d sec\n", time_fixp);
printf("speed_fixp = %d msec [fixed-point] = %f msec\n", speed_fixp, (float)speed_fixp / (1 << 24) * 2.5);
}
The advantage with normalized representation is that operations between normalized values return a normalized value.
By the way, you have to define a generic function for each operation (division, multiplication, etc.), to prevent overflow and save precision.
As you can see I've used a small trick to calculate speed_fixp.
The output is
length_flp=4.500000 m
time_flp=1.200000 sec
speed_flp=3.750000 m/sec
length_fixp = 15099494 m [fixed-point]
time_fixp = 10066330 sec [fixed-point]
speed_fixp = 25169920 msec [fixed-point] = 3.750610 msec

Picking good first estimates for Goldschmidt division

I'm calculating fixedpoint reciprocals in Q22.10 with Goldschmidt division for use in my software rasterizer on ARM.
This is done by just setting the numerator to 1, i.e the numerator becomes the scalar on the first iteration. To be honest, I'm kind of following the wikipedia algorithm blindly here. The article says that if the denominator is scaled in the half-open range (0.5, 1.0], a good first estimate can be based on the denominator alone: Let F be the estimated scalar and D be the denominator, then F = 2 - D.
But when doing this, I lose a lot of precision. Say if I want to find the reciprocal of 512.00002f. In order to scale the number down, I lose 10 bits of precision in the fraction part, which is shifted out. So, my questions are:
Is there a way to pick a better estimate which does not require normalization? Why? Why not? A mathematical proof of why this is or is not possible would be great.
Also, is it possible to pre-calculate the first estimates so the series converges faster? Right now, it converges after the 4th iteration on average. On ARM this is about ~50 cycles worst case, and that's not taking emulation of clz/bsr into account, nor memory lookups. If it's possible, I'd like to know if doing so increases the error, and by how much.
Here is my testcase. Note: The software implementation of clz on line 13 is from my post here. You can replace it with an intrinsic if you want. clz should return the number of leading zeros, and 32 for the value 0.
#include <stdio.h>
#include <stdint.h>
const unsigned int BASE = 22ULL;
static unsigned int divfp(unsigned int val, int* iter)
{
/* Numerator, denominator, estimate scalar and previous denominator */
unsigned long long N,D,F, DPREV;
int bitpos;
*iter = 1;
D = val;
/* Get the shift amount + is right-shift, - is left-shift. */
bitpos = 31 - clz(val) - BASE;
/* Normalize into the half-range (0.5, 1.0] */
if(0 < bitpos)
D >>= bitpos;
else
D <<= (-bitpos);
/* (FNi / FDi) == (FN(i+1) / FD(i+1)) */
/* F = 2 - D */
F = (2ULL<<BASE) - D;
/* N = F for the first iteration, because the numerator is simply 1.
So don't waste a 64-bit UMULL on a multiply with 1 */
N = F;
D = ((unsigned long long)D*F)>>BASE;
while(1){
DPREV = D;
F = (2<<(BASE)) - D;
D = ((unsigned long long)D*F)>>BASE;
/* Bail when we get the same value for two denominators in a row.
This means that the error is too small to make any further progress. */
if(D == DPREV)
break;
N = ((unsigned long long)N*F)>>BASE;
*iter = *iter + 1;
}
if(0 < bitpos)
N >>= bitpos;
else
N <<= (-bitpos);
return N;
}
int main(int argc, char* argv[])
{
double fv, fa;
int iter;
unsigned int D, result;
sscanf(argv[1], "%lf", &fv);
D = fv*(double)(1<<BASE);
result = divfp(D, &iter);
fa = (double)result / (double)(1UL << BASE);
printf("Value: %8.8lf 1/value: %8.8lf FP value: 0x%.8X\n", fv, fa, result);
printf("iteration: %d\n",iter);
return 0;
}

I could not resist spending an hour on your problem...
This algorithm is described in section 5.5.2 of "Arithmetique des ordinateurs" by Jean-Michel Muller (in french). It is actually a special case of Newton iterations with 1 as starting point. The book gives a simple formulation of the algorithm to compute N/D, with D normalized in range [1/2,1[:
e = 1 - D
Q = N
repeat K times:
Q = Q * (1+e)
e = e*e
The number of correct bits doubles at each iteration. In the case of 32 bits, 4 iterations will be enough. You can also iterate until e becomes too small to modify Q.
Normalization is used because it provides the max number of significant bits in the result. It is also easier to compute the error and number of iterations needed when the inputs are in a known range.
Once your input value is normalized, you don't need to bother with the value of BASE until you have the inverse. You simply have a 32-bit number X normalized in range 0x80000000 to 0xFFFFFFFF, and compute an approximation of Y=2^64/X (Y is at most 2^33).
This simplified algorithm may be implemented for your Q22.10 representation as follows:
// Fixed point inversion
// EB Apr 2010
#include <math.h>
#include <stdio.h>
// Number X is represented by integer I: X = I/2^BASE.
// We have (32-BASE) bits in integral part, and BASE bits in fractional part
#define BASE 22
typedef unsigned int uint32;
typedef unsigned long long int uint64;
// Convert FP to/from double (debug)
double toDouble(uint32 fp) { return fp/(double)(1<<BASE); }
uint32 toFP(double x) { return (int)floor(0.5+x*(1<<BASE)); }
// Return inverse of FP
uint32 inverse(uint32 fp)
{
if (fp == 0) return (uint32)-1; // invalid
// Shift FP to have the most significant bit set
int shl = 0; // normalization shift
uint32 nfp = fp; // normalized FP
while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
uint64 q = 0x100000000ULL; // 2^32
uint64 e = 0x100000000ULL - (uint64)nfp; // 2^32-NFP
int i;
for (i=0;i<4;i++) // iterate
{
// Both multiplications are actually
// 32x32 bits truncated to the 32 high bits
q += (q*e)>>(uint64)32;
e = (e*e)>>(uint64)32;
printf("Q=0x%llx E=0x%llx\n",q,e);
}
// Here, (Q/2^32) is the inverse of (NFP/2^32).
// We have 2^31<=NFP<2^32 and 2^32<Q<=2^33
return (uint32)(q>>(64-2*BASE-shl));
}
int main()
{
double x = 1.234567;
uint32 xx = toFP(x);
uint32 yy = inverse(xx);
double y = toDouble(yy);
printf("X=%f Y=%f X*Y=%f\n",x,y,x*y);
printf("XX=0x%08x YY=0x%08x XX*YY=0x%016llx\n",xx,yy,(uint64)xx*(uint64)yy);
}
As noted in the code, the multiplications are not full 32x32->64 bits. E will become smaller and smaller and fits initially on 32 bits. Q will always be on 34 bits. We take only the high 32 bits of the products.
The derivation of 64-2*BASE-shl is left as an exercise for the reader :-). If it becomes 0 or negative, the result is not representable (the input value is too small).
EDIT. As a follow-up to my comment, here is a second version with an implicit 32-th bit on Q. Both E and Q are now stored on 32 bits:
uint32 inverse2(uint32 fp)
{
if (fp == 0) return (uint32)-1; // invalid
// Shift FP to have the most significant bit set
int shl = 0; // normalization shift for FP
uint32 nfp = fp; // normalized FP
while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
int shr = 64-2*BASE-shl; // normalization shift for Q
if (shr <= 0) return (uint32)-1; // overflow
uint64 e = 1 + (0xFFFFFFFF ^ nfp); // 2^32-NFP, max value is 2^31
uint64 q = e; // 2^32 implicit bit, and implicit first iteration
int i;
for (i=0;i<3;i++) // iterate
{
e = (e*e)>>(uint64)32;
q += e + ((q*e)>>(uint64)32);
}
return (uint32)(q>>shr) + (1<<(32-shr)); // insert implicit bit
}

A couple of ideas for you, though none that solve your problem directly as stated.
Why this algo for division? Most divides I've seen in ARM use some varient of
adcs hi, den, hi, lsl #1
subcc hi, hi, den
adcs lo, lo, lo
repeated n bits times with a binary search off of the clz to determine where to start. That's pretty dang fast.
If precision is a big problem, you are not limited to 32/64 bits for your fixed point representation. It'll be a bit slower, but you can do add/adc or sub/sbc to move values across registers. mul/mla are also designed for this kind of work.
Again, not direct answers for you, but possibly a few ideas to go forward this. Seeing the actual ARM code would probably help me a bit as well.

Mads, you are not losing any precision at all. When you divide 512.00002f by 2^10, you merely decrease the exponent of your floating point number by 10. Mantissa remains the same. Of course unless the exponent hits its minimum value but that shouldn't happen since you're scaling to (0.5, 1].
EDIT: Ok so you're using a fixed decimal point. In that case you should allow a different representation of the denominator in your algorithm. The value of D is from (0.5, 1] not only at the beginning but throughout the whole calculation (it's easy to prove that x * (2-x) < 1 for x < 1). So you should represent the denominator with decimal point at base = 32. This way you will have 32 bits of precision all the time.
EDIT: To implement this you'll have to change the following lines of your code:
//bitpos = 31 - clz(val) - BASE;
bitpos = 31 - clz(val) - 31;
...
//F = (2ULL<<BASE) - D;
//N = F;
//D = ((unsigned long long)D*F)>>BASE;
F = -D;
N = F >> (31 - BASE);
D = ((unsigned long long)D*F)>>31;
...
//F = (2<<(BASE)) - D;
//D = ((unsigned long long)D*F)>>BASE;
F = -D;
D = ((unsigned long long)D*F)>>31;
...
//N = ((unsigned long long)N*F)>>BASE;
N = ((unsigned long long)N*F)>>31;
Also in the end you'll have to shift N not by bitpos but some different value which I'm too lazy to figure out right now :).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Most efficient way of splitting a number into whole and decimal parts - c

Related

How to generate an IEEE 754 Single-precision float using only integer arithmetic?

custom floats addition, implementing math expression - C

16bit Float Multiplication in C

C/C++ - convert 32-bit floating-point value to 24-bit normalized fixed-point value?

Picking good first estimates for Goldschmidt division

Categories

Resources