Multiplying single precision float with 2 - c

This is a homework question. I already found a lot of code online, including some code in StackOverflow. But I just want the concept not the code. I want to implement it myself. So the function I want to implement is:
float_twice - Return bit-level equivalent of expression 2*f for floating point argument f.
Both the argument and result are passed as unsigned int's, but they are to be interpreted as the bit-level representation of single-precision floating point values.
I want to know how to do this. I know floating point representation. And read wiki page on how to multiply two floats, but didn't understand it. I just want to know the concept/algorithm for it.
Edit:
Thanks everyone. Based on your suggestions I wrote the following code:
unsigned float_twice(unsigned uf) {
int s = (uf >> 31) << 31;
int e = ((uf >> 23) & 0xFF) << 23;
int f = uf & 0x7FFF;
// if exponent is all 1's then its a special value NaN/infinity
if (e == 0xFF000000){
return uf;
} else if (e > 0){ //if exponent is bigger than zero(not all zeros', not al 1's,
// then its in normal form, add a number to the exponent
return uf + (1 << 23);
} else { // if not exponent not all 1's and not bigger than zero, then its all
// 0's, meaning denormalized form, and we have to add one to fraction
return uf +1;
} //end of if
} //end of function

You could do something like this (although some would claim that it breaks strict-aliasing rules):
unsigned int func(unsigned int n)
{
float x = *(float*)&n;
x *= 2;
return *(unsigned int*)&x;
}
void test(float x)
{
unsigned int n = *(unsigned int*)&x;
printf("%08X\n",func(n));
}
In any case, you'll have to assert that the size of float is equal to the size of int on your platform.
If you just want to take an unsigned int operand and perform on it the equivalent operation of multiplying a float by 2, then you can simply add 1 to the exponent part of it (located in bits 20-30):
unsigned int func(unsigned int n)
{
return n+(1<<20);
}

Related

Getting the mantissa (of a float) of either an unsigned int or a float (C)

So, i am trying to program a function which prints a given float number (n) in its (mantissa * 2^exponent) format. I was abled to get the sign and the exponent, but not the mantissa (whichever the number is, mantissa is always equal to 0.000000). What I have is:
unsigned int num = *(unsigned*)&n;
unsigned int m = num & 0x007fffff;
mantissa = *(float*)&m;
Any ideas of what the problem might be?
The C library includes a function that does this exact task, frexp:
int expon;
float mant = frexpf(n, &expon);
printf("%g = %g * 2^%d\n", n, mant, expon);
Another way to do it is with log2f and exp2f:
if (n == 0) {
mant = 0;
expon = 0;
} else {
expon = floorf(log2f(fabsf(n)));
mant = n * exp2f(-expon);
}
These two techniques are likely to give different results for the same input. For instance, on my computer the frexpf technique describes 4 as 0.5 × 23 but the log2f technique describes 4 as 1 × 22. Both are correct, mathematically speaking. Also, frexp will give you the exact bits of the mantissa, whereas log2f and exp2f will probably round off the last bit or two.
You should know that *(unsigned *)&n and *(float *)&m violate the rule against "type punning" and have undefined behavior. If you want to get the integer with the same bit representation as a float, or vice versa, use a union:
union { uint32_t i; float f; } u;
u.f = n;
num = u.i;
(Note: This use of unions is well-defined in C since roughly 2003, but, due to the C++ committee's long-standing habit of not paying sufficient attention to changes going into C, it is not officially well-defined in C++.)
You should also know IEEE floating-point numbers use "biased" exponents. When you initialize a float variable's mantissa field but leave its exponent field at zero, that gives you the representation of a number with a large negative exponent: in other words, a number so small that printf("%f", n) will print it as zero. Whenever printf("%f", variable) prints zero, change %f to %g or %a and rerun the program before assuming that variable actually is zero.
You are stripping off the bits of the exponent, leaving 0. An exponent of 0 is special, it means the number is denormalized and is quite small, at the very bottom of the range of representable numbers. I think you'd find if you looked closely that your result isn't quite exactly zero, just so small that you have trouble telling the difference.
To get a reasonable number for the mantissa, you need to put an appropriate exponent back in. If you want a mantissa in the range of 1.0 to 2.0, you need an exponent of 0, but adding the bias means you really need an exponent of 127.
unsigned int m = (num & 0x007fffff) | (127 << 23);
mantissa = *(float*)&m;
If you'd rather have a fully integer mantissa you need an exponent of 23, biased it becomes 150.
unsigned int m = (num & 0x007fffff) | ((23+127) << 23);
mantissa = *(float*)&m;
In addition to zwol's remarks: if you want to do it yourself you have to acquire some knowledge about the innards of an IEEE-754 float. Once you have done so you can write something like
#include <stdlib.h>
#include <stdio.h>
#include <math.h> // for testing only
typedef union {
float value;
unsigned int bits; // assuming 32 bit large ints (better: uint32_t)
} ieee_754_float;
// clang -g3 -O3 -W -Wall -Wextra -Wpedantic -Weverything -std=c11 -o testthewest testthewest.c -lm
int main(int argc, char **argv)
{
unsigned int m, num;
int exp; // the exponent can be negative
float n, mantissa;
ieee_754_float uf;
// neither checks nor balances included!
if (argc == 2) {
n = atof(argv[1]);
} else {
exit(EXIT_FAILURE);
}
uf.value = n;
num = uf.bits;
m = num & 0x807fffff; // extract mantissa (i.e.: get rid of sign bit and exponent)
num = num & 0x7fffffff; // full number without sign bit
exp = (num >> 23) - 126; // extract exponent and subtract bias
m |= 0x3f000000; // normalize mantissa (add bias)
uf.bits = m;
mantissa = uf.value;
printf("n = %g, mantissa = %g, exp = %d, check %g\n", n, mantissa, exp, mantissa * powf(2, exp));
exit(EXIT_SUCCESS);
}
Note: the code above is one of the quick&dirty(tm) species and is not meant for production. It also lacks handling for subnormal (denormal) numbers, a thing you must include. Hint: multiply the mantissa with a large power of two (e.g.: 2^25 or in that ballpark) and adjust the exponent accordingly (if you took the value of my example subtract 25).

Bitwise multiplication of a floating point number

I am working on a project for school that requires me to manipulate floating point numbers using only their binary representation, specifically multiplying a float by two. I am supposed to replicate the function of the following code using only bit operations (! | ^ & ~), logical operators (&&, ||), and if/while expressions.
However, I am having a problem with some of my numbers not having the correct result. For example, when I pass in 8388608 (0x800000), the output is 8388608 when it should be 16777216 (0x1000000).
Here is the code whose function we are supposed to replicate:
unsigned test_dl23(unsigned uf) {
float f = u2f(uf);
float tf = 2*f;
if (isnan(f))
return uf;
else
return f2u(tf);
}
And here is my attempt:
unsigned dl23(unsigned uf) {
int pullExpMask, expMask, mantMask, i, mask, signedBit
mask = 0x007fffff;
pullExpMask = (-1 << 23) ^ 0x80000000; // fills lower 23 bits with zeroes
signedBit = 0x80000000 & uf;
expMask = (uf & pullExpMask); // grabs the exponent and signed bits, mantissa bits are zeroes
mantMask = mask & uf;
if (!(uf & mask)){ // if uf has an exponent that is all ones, return uf
return uf;
}
else if (expMask == 0){
mantMask = mantMask << 1;
}
else if (expMask != 0){
expMask = expMask + 0x00100000;
}
return (signedBit | expMask | mantMask);
}
I have been working on this problem for a long time, and any helpful tips would be greatly appreciated!
I'm not sure about what numbers you are referring to as failing, but I do see a bug in your code that would cause this behavior. The section with the comment // if uf has an exponent that is all ones, return uf does not do what this describes. It will test that the mantissa part is 0 and return the unchanged number, causing the behavior you're seeing. It should probably be something like if (expMask == pullExpMask) {
Also, you need to consider the case that the exponent increment makes it all 1. It should always become an inf, and never a quiet or signaling NaN.

function to convert float to int (huge integers)

This is a university question. Just to make sure :-) We need to implement (float)x
I have the following code which must convert integer x to its floating point binary representation stored in an unsigned integer.
unsigned float_i2f(int x) {
if (!x) return x;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 0;
int t = a;
while(t != 1) {
/* divide by two until t is 0*/
t >>= 1;
e++;
};
/* calculate mantissa */
int m = a << (32 - e);
/* logical right shift */
m = (m >> 9) & ~(((0x1 << 31) >> 9 << 1));
/* add bias for 32bit float */
e += 127;
int res = sign << 31;
res |= (e << 23);
res |= m;
/* lots of printf */
return res;
}
One problem I encounter now is that when my integers are too big then my code fails. I have this control procedure implemented:
float f = (float)x;
unsigned int r;
memcpy(&r, &f, sizeof(unsigned int));
This of course always produces the correct output.
Now when I do some test runs, this are my outputs (GOAL is what It needs to be, result is what I got)
:!make && ./btest -f float_i2f -1 0x80004999
make: Nothing to be done for `all'.
Score Rating Errors Function
x: [-2147464807] 10000000000000000100100110011001
sign: 1
expone: 01001110100000000000000000000000
mantis: 00000000011111111111111101101100
result: 11001110111111111111111101101100
GOAL: 11001110111111111111111101101101
So in this case, a 1 is added as the LSB.
Next case:
:!make && ./btest -f float_i2f -1 0x80000001
make: Nothing to be done for `all'.
Score Rating Errors Function
x: [-2147483647] 10000000000000000000000000000001
sign: 1
expone: 01001110100000000000000000000000
mantis: 00000000011111111111111111111111
result: 11001110111111111111111111111111
GOAL: 11001111000000000000000000000000
Here 1 is added to the exponent while the mantissa is the complement of it.
I tried hours to look ip up on the internet plus in my books etc but I can't find any references to this problem. I guess It has something to do with the fact that the mantissa is only 23 bits. But how do I have to handle it then?
EDIT: THIS PART IS OBSOLETE THANKS TO THE COMMENTS BELOW. int l must be unsigned l.
int x = 2147483647;
float f = (float)x;
int l = f;
printf("l: %d\n", l);
then l becomes -2147483648.
How can this happen? So C is doing the casting wrong?
Hope someone can help me here!
Thx
Markus
EDIT 2:
My updated code is now this:
unsigned float_i2f(int x) {
if (x == 0) return 0;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 158;
int t = a;
while (!(t >> 31) & 0x1) {
t <<= 1;
e--;
};
/* calculate mantissa */
int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
m &= 0x7fffff;
int res = sign << 31;
res |= (e << 23);
res |= m;
return res;
}
I also figured out that the code works for all integers in the range -2^24, 2^24. Everything above/below sometimes works but mostly doesn't.
Something is missing, but I really have no idea what. Can anyone help me?
The answer printed is absolutely correct as it's totally dependent on the underlying representation of numbers being cast. However, If we understand the binary representation of the number, you won't get surprised with this result.
To understand an implicit conversion is associated with the assignment operator (ref C99 Standard 6.5.16). The C99 Standard goes on to say:
6.3.1.4 Real floating and integer
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
Your earlier example illustrates undefined behavior due to assigning a value outside the range of the destination type. Trying to assign a negative value to an unsigned type, not from converting floating point to integer.
The asserts in the following snippet ought to prevent any undefined behavior from occurring.
#include <limits.h>
#include <math.h>
unsigned int convertFloatingPoint(double v) {
double d;
assert(isfinite(v));
d = trunc(v);
assert((d>=0.0) && (d<=(double)UINT_MAX));
return (unsigned int)d;
}
Another way for doing the same thing, Create a union containing a 32-bit integer and a float. The int and float are now just different ways of looking at the same bit of memory;
union {
int myInt;
float myFloat;
} my_union;
my_union.myInt = 0x BFFFF2E5;
printf("float is %f\n", my_union.myFloat);
float is -1.999600
You are telling the compiler to take the number you have (large integer) and make it into a float, not to interpret the number AS float. To do that, you need to tell the compiler to read the number from that address in a different form, so this:
myFloat = *(float *)&myInt ;
That means, if we take it apart, starting from the right:
&myInt - the location in memory that holds your integer.
(float *) - really, I want the compiler use this as a pointer to float, not whatever the compiler thinks it may be.
* - read from the address of whatever is to the right.
myFloat = - set this variable to whatever is to the right.
So, you are telling the compiler: In the location of (myInt), there is a floating point number, now put that float into myFloat.

Bit Rotation in C

The Problem: Exercise 2-8 of The C Programming Language, "Write a function rightrot(x,n) that returns the value of the integer x, rotated to the right by n positions."
I have done this every way that I know how. Here is the issue that I am having. Take a given number for this exercise, say 29, and rotate it right one position.
11101 and it becomes 11110 or 30. Let's say for the sake of argument that the system we are working on has an unsigned integer type size of 32 bits. Let's further say that we have the number 29 stored in an unsigned integer variable. In memory the number will have 27 zeros ahead of it. So when we rotate 29 right one using one of several algorithms mine is posted below, we get the number 2147483662. This is obviously not the desired result.
unsigned int rightrot(unsigned x, int n) {
return (x >> n) | (x << (sizeof(x) * CHAR_BIT) - n);
}
Technically, this is correct, but I was thinking that the 27 zeros that are in front of 11101 were insignificant. I have also tried a couple of other solutions:
int wordsize(void) { // compute the wordsize on a given machine...
unsigned x = ~0;
int b;
for(b = 0; x; b++)
x &= x-1;
return x;
}
unsigned int rightrot(unsigned x, int n) {
unsigned rbit;
while(n --) {
rbit = x >> 1;
x |= (rbit << wordsize() - 1);
}
return x;
This last and final solution is the one where I thought that I had it, I will explain where it failed once I get to the end. I am sure that you will see my mistake...
int bitcount(unsigned x) {
int b;
for(b = 0; x; b++)
x &= x-1;
return b;
}
unsigned int rightrot(unsigned x, int n) {
unsigned rbit;
int shift = bitcount(x);
while(n--) {
rbit = x & 1;
x >>= 1;
x |= (rbit << shift);
}
}
This solution gives the expected answer of 30 that I was looking for, but if you use a number for x like oh say 31 (11111), then there are issues, specifically the outcome is 47, using one for n. I did not think of this earlier, but if a number like 8 (1000) is used then mayhem. There is only one set bit in 8, so the shift is most certainly going to be wrong. My theory at this point is that the first two solutions are correct (mostly) and I am just missing something...
A bitwise rotation is always necessarily within an integer of a given width. In this case, as you're assuming a 32-bit integer, 2147483662 (0b10000000000000000000000000001110) is indeed the correct answer; you aren't doing anything wrong!
0b11110 would not be considered the correct result by any reasonable definition, as continuing to rotate it right using the same definition would never give you back the original input. (Consider that another right rotation would give 0b1111, and continuing to rotate that would have no effect.)
In my opinion, the spirit of the section of the book which immediately precedes this exercise would have the reader do this problem without knowing anything about the size (in bits) of integers, or any other type. The examples in the section do not require that information; I don't believe the exercises should either.
Regardless of my belief, the book had not yet introduced the sizeof operator by section 2.9, so the only way to figure the size of a type is to count the bits "by hand".
But we don't need to bother with all that. We can do bit rotation in n steps, regardless of how many bits there are in the data type, by rotating one bit at a time.
Using only the parts of the language that are covered by the book up to section 2.9, here's my implementation (with integer parameters, returning an integer, as specified by the exercise): Loop n times, x >> 1 each iteration; if the old low bit of x was 1, set the new high bit.
int rightrot(int x, int n) {
int lowbit;
while (n-- > 0) {
lowbit = x & 1; /* save low bit */
x = (x >> 1) & (~0u >> 1); /* shift right by one, and clear the high bit (in case of sign extension) */
if (lowbit)
x = x | ~(~0u >> 1); /* set the high bit if the low bit was set */
}
return x;
}
You could find the location of the first '1' in the 32-bit value using binary search. Then note the bit in the LSB location, right shift the value by the required number of places, and put the LSB bit in the location of the first '1'.
int bitcount(unsigned x) {
int b;
for(b = 0; x; b++)
x &= x-1;
return b;
}
unsigned rightrot(unsigned x,int n) {
int b = bitcount(x);
unsigned a = (x&~(~0<<n))<<(b-n+1);
x>> = n;
x| = a;
}

Converting double to float without relying on the FPU rounding mode

Does anyone have handy the snippets of code to convert an IEEE 754 double to the immediately inferior (resp. superior) float, without changing or assuming anything about the FPU's current rounding mode?
Note: this constraint probably implies not using the FPU at all. I expect the simplest way to do it in these conditions is to read the bits of the double in a 64-bit long and to work with that.
You can assume the endianness of your choice for simplicity, and that the double in question is available through the d field of the union below:
union double_bits
{
long i;
double d;
};
I would try to do it myself but I am certain I would introduce hard-to-notice bugs for denormalized or negative numbers.
I think the following works, but I will state my assumptions first:
floating-point numbers are stored in IEEE-754 format on your implementation,
No overflow,
You have nextafterf() available (it's specified in C99).
Also, most likely, this method is not very efficient.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char *argv[])
{
/* Change to non-zero for superior, otherwise inferior */
int superior = 0;
/* double value to convert */
double d = 0.1;
float f;
double tmp = d;
if (argc > 1)
d = strtod(argv[1], NULL);
/* First, get an approximation of the double value */
f = d;
/* Now, convert that back to double */
tmp = f;
/* Print the numbers. %a is C99 */
printf("Double: %.20f (%a)\n", d, d);
printf("Float: %.20f (%a)\n", f, f);
printf("tmp: %.20f (%a)\n", tmp, tmp);
if (superior) {
/* If we wanted superior, and got a smaller value,
get the next value */
if (tmp < d)
f = nextafterf(f, INFINITY);
} else {
if (tmp > d)
f = nextafterf(f, -INFINITY);
}
printf("converted: %.20f (%a)\n", f, f);
return 0;
}
On my machine, it prints:
Double: 0.10000000000000000555 (0x1.999999999999ap-4)
Float: 0.10000000149011611938 (0x1.99999ap-4)
tmp: 0.10000000149011611938 (0x1.99999ap-4)
converted: 0.09999999403953552246 (0x1.999998p-4)
The idea is that I am converting the double value to a float value—this could be less than or greater than the double value depending upon the rounding mode. When converted back to double, we can check if it is smaller or greater than the original value. Then, if the value of the float is not in the right direction, we look at the next float number from the converted number in the original number's direction.
To do this job more accurately than just re-combine mantissa and exponent bit's check this out:
http://www.mathworks.com/matlabcentral/fileexchange/23173
regards
I posted code to do this here: https://stackoverflow.com/q/19644895/364818 and copied it below for your convenience.
// d is IEEE double, but double is not natively supported.
static float ConvertDoubleToFloat(void* d)
{
unsigned long long x;
float f; // assumed to be IEEE float
unsigned long long sign ;
unsigned long long exponent;
unsigned long long mantissa;
memcpy(&x,d,8);
// IEEE binary64 format (unsupported)
sign = (x >> 63) & 1; // 1
exponent = ((x >> 52) & 0x7FF); // 11
mantissa = (x >> 0) & 0x000FFFFFFFFFFFFFULL; // 52
exponent -= 1023;
// IEEE binary32 format (supported)
exponent += 127; // rebase
exponent &= 0xFF;
mantissa >>= (52-23); // left justify
x = mantissa | (exponent << 23) | (sign << 31);
memcpy(&f,&x,4);
return f;
}

Resources