custom floats addition, implementing math expression - C - c

I'm implementing a new kind of float "NewFloat" in C, it uses 32bits, it has no sign bit (only positive numbers.
So the whole 32bits are used by the exponent or mantissa.
In my example, I have 6bits for the exponent (EXPBITS) and 26 for the mantissa (MANBITS).
And We have an offset which is used for representing negative exponents, which is (2^(EXPBITS-1)-1).
Given a NewFloat nf1, the translation to a real number is done like this:
nf1 = 2^(exponent - offset) * (1 + mantissa/2^MANBITS).
Now, given two NewFloats (nf1, nf2), each with it's (exp1, man1, exp2, man2 and the same offset),
Assuming that nf1 > nf2, I can calculate the exponent and mantissa of the sum of both nf1 and nf2, and this is done like this: link
To spare your time, I found that:
Exponent of the sum is: exp1
Mantissa of the sum is: man1 + 2^(exp2 - exp1 + MANBITS) + 2^(exp2 - exp1) * man2
To ease with the code, I split to work and calc separately each component of the mantissa:
x = 2^(exp2 - exp1 + MANBITS)
y = 2^(exp2 - exp1) * man2
I'm kind of sure that I'm not implementing right the mantissa part:
unsigned long long x = (1 << (exp2 - exp1 + MANBITS));
unsigned long long y = ((1 << exp2) >> exp1) * man2;
unsigned long long tempMan = man1;
tempMan += x + y;
unsigned int exp = exp1; // CAN USE DIRECTLY EXP1.
unsigned int man = (unsigned int)tempMan;
The sum is represented like this:
sum = 2^(exp1 - offset) * (1 + (man1 + x + y)/2^MANBITS).
The last thing I have to handle is the case of an overflow of the sum's mantissa.
In this case, I should add 1 to the exponent and divide the whole (1 + (man + x + y)2^MANBITS) expression.
In that case, given that I only need to represent the nominator in bits, how do I do that after the division?
Is there any problem in my implementation? Which I have a feeling there is.
If you have a better way of doing this, I would be really happy to hear about it.
Please, don't ask me why I do this.. it's an exercise which I've been trying to solve for more than 10 hours.

Code is doing signed int shifts and certainly unsigned long long is desired.
// unsigned long long x = (1 << (exp2 - exp1 + MANBITS));
unsigned long long x = (1LLU << (exp2 - exp1 + MANBITS));
Notes:
Suggest more meaningful variable names like x_mantissa.
Rounding not implemented. Rounding can cause a need for increase in exponent.
Overflow not detected/implemented.
Sub-normals not implemented. Should NewFloat not use them, not that a-b --> 0 does not mean a == b.

Related

Can we use bitwise operator for conversion from decimal to other bases other than 4, 8 , 16 and so on? In C

Can we use bitwise operator for conversion from decimal to other bases other than 4, 8, 16 and so on?
I understand how to do that for 4, 8, 16 and so on.
But for conversion from decimal to base 3, or base 12, for example, I don't know.
It is possible?
I assume in your question you mend conversion from binary to other bases.
All arithmetic operations can be reduced to bitwise operations and shifts. That's what the CPU is doing internally in hardware too.
a + b ==> (a ^ b) + ((a & b) << 1)
The right side still has a + in there so you have to apply the same transformation again and again till you have a left shift larger than the width of your integer type. Or do it bit by bit in a loop.
With two's-complement:
-a ==> ~a + 1
And if you have + and negate you have -. * is just a bunch of shifts and adds. / is a bunch of shifts and subtract. Just consider how you did multiplication and long division in school and bring that down to base 2.
For most bases doing the math with bitwise operations is insane. Especially if you derive your code from the basic operations above. The CPUs add, sub and mul operations are just fine and way faster. But if you want to implement printf() for a freestanding environment (like a kernel) you might need to do a division of uint64_t / 10 that your CPU can't do in hardware. The compiler (gcc, clang) also isn't smart enough do this well and falls back to a general iterative uint64_t / uint64_t long division algorithm.
But a division can be done by multiplying by the inverse shifted a few bits and the shifting the result back. This method works out really well for a division by 10 and you get nicely optimized code:
uint64_t divu10(uint64_t n) {
uint64_t q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
q = q + (q >> 32);
q = q >> 3;
r = n - (((q << 2) + q) << 1);
return q + (r > 9);
}
That's is shorter and faster by a magnitude or two to the general uint64_t / uint64_t long division function that gcc / clang will call when you write x / 10.
Note: (((q << 2) + q) << 1) is q * 10. Another bitwise operation that is faster than q * 10 when the cpu doesn't have 64bit integers.

Divide a signed integer by a power of 2

I'm working on a way to divide a signed integer by a power of 2 using only binary operators (<< >> + ^ ~ & | !), and the result has to be round toward 0. I came across this question also on Stackoverflow on the problem, however, I cannot understand why it works. Here's the solution:
int divideByPowerOf2(int x, int n)
{
return (x + ((x >> 31) & ((1 << n) + ~0))) >> n;
}
I understand the x >> 31 part (only add the next part if x is negative, because if it's positive x will be automatically round toward 0). But what's bothering me is the (1 << n) + ~0 part. How can it work?
Assuming 2-complement, just bit-shifting the dividend is equivalent to a certain kind of division: not the conventional division where we round the dividend to next multiple of divisor toward zero. But another kind where we round the dividend toward negative infinity. I rediscovered that in Smalltalk, see http://smallissimo.blogspot.fr/2015/03/is-bitshift-equivalent-to-division-in.html.
For example, let's divide -126 by 8. traditionally, we would write
-126 = -15 * 8 - 6
But if we round toward infinity, we get a positive remainder and write it:
-126 = -16 * 8 + 2
The bit-shifting is performing the second operation, in term of bit patterns (assuming 8 bits long int for the sake of being short):
1000|0010 >> 3 = 1111|0000
1000|0010 = 1111|0000 * 0000|1000 + 0000|0010
So what if we want the traditional division with quotient rounded toward zero and remainder of same sign as dividend? Simple, we just have to add 1 to the quotient - if and only if the dividend is negative and the division is inexact.
You saw that x>>31 corresponds to first condition, dividend is negative, assuming int has 32 bits.
The second term corresponds to the second condition, if division is inexact.
See how are encoded -1, -2, -4, ... in two complement: 1111|1111 , 1111|1110 , 1111|1100. So the negation of nth power of two has n trailing zeros.
When the dividend has n trailing zeros and we divide by 2^n, then no need to add 1 to final quotient. In any other case, we need to add 1.
What ((1 << n) + ~0) is doing is creating a mask with n trailing ones.
The n last bits don't really matter, because we are going to shift to the right and just throw them away. So, if the division is exact, the n trailing bits of dividend are zero, and we just add n 1s that will be skipped. On the contrary, if the division is inexact, then one or more of the n trailing bits of the dividend is 1, and we are sure to cause a carry to the n+1 bit position: that's how we add 1 to the quotient (we add 2^n to the dividend). Does that explain it a bit more?
This is "write-only code": instead of trying to understand the code, try to create it by yourself.
For example, let's divide a number by 8 (shift right by 3).
If the number is negative, the normal right-shift rounds in the wrong direction. Let's "fix" it by adding a number:
int divideBy8(int x)
{
if (x >= 0)
return x >> 3;
else
return (x + whatever) >> 3;
}
Here you can come up with a mathematical formula for whatever, or do some trial and error. Anyway, here whatever = 7:
int divideBy8(int x)
{
if (x >= 0)
return x >> 3;
else
return (x + 7) >> 3;
}
How to unify the two cases? You need to make an expression that looks like this:
(x + stuff) >> 3
where stuff is 7 for negative x, and 0 for positive x. The trick here is using x >> 31, which is a 32-bit number whose bits are equal to the sign-bit of x: all 0 or all 1. So stuff is
(x >> 31) & 7
Combining all these, and replacing 8 and 7 by the more general power of 2, you get the code you asked about.
Note: in the description above, I assume that int represents a 32-bit hardware register, and hardware uses two's complement representation to do right shift.
OP's reference is of a C# code and so many subtle differences that cause it to be bad code with C, as this post is tagged.
int is not necessarily 32-bits so using a magic number of 32 does not make for a robust solution.
In particular (1 << n) + ~0 results in implementation defined behavior when n causes a bit to be shifted into the sign place. Not good coding.
Restricting code to only using "binary" operators << >> + ^ ~ & | ! encourages a coder to assume things about int which is not portable nor compliant with the C spec. So OP's posted code does not "work" in general, although may work in many common implementations.
OP code fails when int is not 2's complement, not uses the range [-2147483648 .. 2147483647] or when 1 << n uses implementation behavior that is not as expected.
// weak code
int divideByPowerOf2(int x, int n) {
return (x + ((x >> 31) & ((1 << n) + ~0))) >> n;
}
A simple alternative, assuming long long exceeds the range of int follows. I doubt this meets some corner of OP's goals, but OP's given goals encourages non-robust coding.
int divideByPowerOf2(int x, int n) {
long long ill = x;
if (x < 0) ill = -ill;
while (n--) ill >>= 1;
if (x < 0) ill = -ill;
return (int) ill;
}

16bit Float Multiplication in C

I'm working on a small project, where I need float multiplication with 16bit floats (half precision). Unhappily, I'm facing some problems with the algorithm:
Example Output
1 * 5 = 5
2 * 5 = 10
3 * 5 = 14.5
4 * 5 = 20
5 * 5 = 24.5
100 * 4 = 100
100 * 5 = 482
The Source Code
const int bits = 16;
const int exponent_length = 5;
const int fraction_length = 10;
const int bias = pow(2, exponent_length - 1) - 1;
const int exponent_mask = ((1 << 5) - 1) << fraction_length;
const int fraction_mask = (1 << fraction_length) - 1;
const int hidden_bit = (1 << 10); // Was 1 << 11 before update 1
int float_mul(int f1, int f2) {
int res_exp = 0;
int res_frac = 0;
int result = 0;
int exp1 = (f1 & exponent_mask) >> fraction_length;
int exp2 = (f2 & exponent_mask) >> fraction_length;
int frac1 = (f1 & fraction_mask) | hidden_bit;
int frac2 = (f2 & fraction_mask) | hidden_bit;
// Add exponents
res_exp = exp1 + exp2 - bias; // Remove double bias
// Multiply significants
res_frac = frac1 * frac2; // 11 bit * 11 bit → 22 bit!
// Shift 22bit int right to fit into 10 bit
if (highest_bit_pos(res_mant) == 21) {
res_mant >>= 11;
res_exp += 1;
} else {
res_mant >>= 10;
}
res_frac &= ~hidden_bit; // Remove hidden bit
// Construct float
return (res_exp << bits - exponent_length - 1) | res_frac;
}
By the way: I'm storing the floats in ints, because I'll try to port this code to some kind of Assembler w/o float point operations later.
The Question
Why does the code work for some values only? Did I forget some normalization or similar? Or does it work only by accident?
Disclaimer: I'm not a CompSci student, it's a leisure project ;)
Update #1
Thanks to the comment by Eric Postpischil I noticed one problem with the code: the hidden_bit flag was off by one (should be 1 << 10). With that change, I don't get decimal places any more, but still some calculations are off (e.g. 3•3=20). I assume, it's the res_frac shift as descibred in the answers.
Update #2
The second problem with the code was indeed the res_frac shifting. After update #1 I got wrong results when having 22 bit results of frac1 * frac2. I've updated the code above with a the corrected shift statement. Thanks to all for every comment and answer! :)
From a cursory look:
No attempt is made to determine the location of the high bit in the product. Two 11-bit numbers, each their high bit set, may produce a 21- or 22-bit number. (Example with two-bit numbers: 102•102 is 1002, three bits, but 112•112 is 10012, four bits.)
The result is truncated instead of rounded.
Signs are ignored.
Subnormal numbers are not handled, on input or output.
11 is hardcoded as a shift amount in one place. This is likely incorrect; the correct amount will depend on how the significand is handled for normalization and rounding.
In decoding, the exponent field is shifted right by fraction_length. In encoding, it is shifted left by bits - exponent_length - 1. To avoid bugs, the same expression should be used in both places.
From a more detailed look by chux:
res_frac = frac1 * frac2 fails if int is less than 23 bits (22 for the product and one for the sign).
This is more a suggestion for how to make it easier to get your code right, rather than analysis of what is wrong with the existing code.
There are a number of steps that are common to some or all of the floating point arithmetic operations. I suggest extracting each into a function that can be written with focus on one issue, and tested separately. Then when you come to write e.g. multiplication, you only have to deal with the specifics of that operation.
All the operations will be easier working with a structure that has the actual signed exponent, and the full significand in a wider unsigned integer field. If you were dealing with signed numbers, it would also have a boolean for the sign bit.
Here are some sample operations that could be separate functions, at least until you get it working:
unpack: Take a 16 bit float and extract the exponent and significand into a struct.
pack: Undo unpack - deal with dropping the hidden bit, applying the bias the expoent, and combining them into a float.
normalize: Shift the significand and adjust the exponent to bring the most significant 1-bit to a specified bit position.
round: Apply your rounding rules to drop low significance bits. If you want to do IEEE 754 style round-to-nearest, you need a guard digit that is the most significant bit that will be dropped, and an additional bit indicating if there are any one bits of lower significance than the guard bit.
One problem is that you are truncating instead of rounding:
res_frac >>= 11; // Shift 22bit int right to fit into 10 bit
You should compute res_frac & 0x7ff first, the part of the 22-bit result that your algorithm is about to discard, and compare it to 0x400. If it is below, truncate. If it is above, round away from zero. If it is equal to 0x400, round to the even alternative.

Most efficient way of splitting a number into whole and decimal parts

I am trying to split a double into its whole and fraction parts. My code works, but it is much too slow given that the microcontroller I am using does not have a dedicated multiply instruction in assembly. For instance,
temp = ((int)(tacc - temp); // This line takes about 500us
However, if I do this,
temp = (int)(100*(tacc-temp)); // This takes about 4ms
I could speed up the microcontroller, but since I'm trying to stay low power, I am curious if it is possible to do this faster. This is the little piece I'm actually interested in optimizing:
txBuffer[5] = ((int)tacc_y); // Whole part
txBuffer[6] = (int)(100*(tacc_y-txBuffer[5])); // 2 digits of fraction
I remember there is a fast way of multiplying by 10 using shifts, such that:
a * 10 = (a << 2 + a) << 1
I could probably nest this and get multiplication by 100. Is there any other way?
I believe the correct answer, which may not be the fastest, is this:
double whole = trunc(tacc_y);
double fract = tacc_y - whole;
// first, extract (some of) the data into an int
fract = fract * (1<<11); // should be just an exponent change
int ifract = (int)trunc(fract);
// next, decimalize it (I think)
ifract = ifract * 1000; // Assuming integer multiply available
ifract = ifract >> 11;
txBuffer[5] = (int)whole;
txBuffer[6] = ifract
If integer multiplication is not OK, then your shift trick should now work.
If the floating-point multiply is too stupid to just edit the exponent quickly, then you can do it manually by bit twiddling, but I wouldn't recommend it as a first option. In any case, once you've got as far as bit-twiddling FP numbers you might as well just extract the mantissa, or even do the whole operation manually.
I assume you are working with doubles. You could try to take a double apart bitwise:
double input = 10.64;
int sign = *(int64_t *)&input >> 63;
int exponent = (*(int64_t *)&input >> 52) & 0x7FF;
int64_t fraction = (*(int64_t *)&input) & 0xFFFFFFFFFFFFF;
fraction |= 0x10000000000000;
int64_t whole = fraction >> (52 + 1023 - exponent);
int64_t digits = ((fraction - (whole << (52 + 1023 - exponent))) * 100) >> (52 + 1023 - exponent);
printf("%lf, %ld.%ld\n", input, whole, digits);

Multiply with negative integer just by shifting

I'm trying to find a way to multiply an integer value with negative value just with bit shifting.
Usually I do this by shifting with the power of 2 which is closest to my factor and just adding / subtracting the rest, e.g. x * 7 = ((x << 3) - x)
Let's say I'd want to calculate x * -112. The only way I can imagine is -((x << 7) - (x << 4), so to calculate x * 112 and negate it afterwards.
Is there a "prettier" way to do this?
Get the compiler to do it, then check the produced assembly.
The negative of a positive number in 2's complement is done by negating all the bits and then adding 1 to the result. For example, to get -4 from 4 you would do:
4 = 000...0100 in binary. ~4 = 111...1011. -4 = 111...1100.
Same to reverse the sign.
So you could do this:
(~((x << 7) - (x << 4))) + 1.
Not necessarily prettier, but faster if we consider bitwise operations faster than arithmetic operations (especially multiplication) and ignore compiler optimizations.
Not that I'm saying you should do this, because you shouldn't. It's good to know about it though.
Computers internally represent negative integers in two's compliment form. One of the nice properties of two's compliment arithmetic is that multiply negative numbers is just like multiplying positive numbers. Hence, find the two's complement and use your normal approach.
Here's a simple example. For ease of exposition, I'm going to using 8-bit integers and multiply by -15.
15 in hex is 0x0f. The two's compliment of 0x0f is 0xf1.
Since these are 8-bit integers, all arithmetic is mod 0xff. In particular, note that 0x100 * anything = 0.
x * 0xf1
= x * (0x100 - 0x10 + 0x01)
= -(x * 0x10) + x
= -(x << 4) + x

Resources