How to decide when to round when converting decimal/int to float - c

I have an integer -2147483647, which is 0x8000001 in hex.
When I use an online float converter, it says that it will instead convert -2147483648, due to rounding.
Now I have to implement this int-to-float function in c with single-precision.
I have all the other features already implemented but I was not aware of the rounding.
When do I decide when to round?
**EDIT:
Here are the features I have so far. (in the order of how the function works)
1) First I check if the int is negative, in which I negate it and assign it to an unsigned int. I also set the sign bit to be 1.
2) To get the exponent, I search for the first '1' bit from the left of the unsigned int, and get its position.
3) I add 127 to the position and shift it 23 bits to the left to place to match the IEEE 754 standards.
4) With the position Counter I had on 2), I align the unsigned int to the right so that the 31st bit is the second right-most '1'bit. (because you omit the first '1'bit as it is assumed in every IEEE float)
5) I right shift that int 9 bits to the right so that it fits the mantissa slot of the standard.
6) I add the sign bit, the exponent, and the mantissa to get the final bit representation of the original int.

The three closest values for -2147483647 in single precision, as shown in hex and the integer number are:
hex ceffffff = -2147483520
hex cf000000 = -2147483648
hex cf000001 = -2147483904
so hex cf000000 = -2147483648 is used since it's the closest.

Related

Format a float using IEEE-754

I have an assignment in C where I'm given a float number and have to print it using the IEEE-754 format:
__(sign)mantissa * 2^(exponent)__
The sign would be either '-' or ' ', the exponent an int value and the mantissa a float value. I also cannot usestructs or any function present in any library other than the the ones in stdio.h (for the printf function). This should be done using bitwise operations.
I was given two functions:
unsignedToFloat: given an unsigned int, returns a float with the same bits;
floatToUnsigned: given a float, returns an unsigned int with the same bits;
After getting the float representation as an unsigned int, I have managed to determine the sign and exponent of any float number. I have also managed to determine the mantissa in bits. For example, given the float 3.75:
> bit representation: 01000000011100000000000000000000;
> sign: 0 (' ');
> exponent = 10000000 (128 - bias = 1)
> mantissa = 11100000000000000000000
Now, in this example, I should represent the mantissa as "1.875". However, I have not been able to do this conversion. So far, I have tried to create another float as 000000000[mantissa], but it gives me the wrong result (which I now understand why). I was instructed that the mantissa has a 1 in the beginning, meaning in this example, the mantissa would become 1.111, but I wasn't instructed on how exactly to do this. I tried searching online, but couldn't find any way of adding a 1 to the beginning of the mantissa.
I also had the idea of doing this portion by going through every bit of the mantissa and getting its decimal representation, and then adding 1. In this example, I would do the following:
> 11100000000000000000000
> as_float = 2^-1 + 2^-2 + 2^-3
> as_float += 1
However, this approach seems very hack-y, slow and could perhaps give me wrong result.
With this said, I am completely out of ideas. How would I represent the mantissa of a float number as its own thing?
Convert the float to unsigned int.
Clear the sign bit (using bitwise AND &).
Set the exponent so that it's equal to the bias (using bitwise AND & and OR |).
Convert the modified value from unsigned int to float.
Use printf to print the new value.
This works because clearing the sign bit and setting the exponent to the bias leaves you with (positive)mantissa * 2^0 = mantissa, so printf will print the mantissa for you.

How to extract the sign, mantissa and exponent from a 32-Bit float [duplicate]

This question already has answers here:
How to get the sign, mantissa and exponent of a floating point number
(7 answers)
Closed 7 years ago.
so I got a task where I have to extract the sign, exponent and mantissa from a floating point number given as uint32_t. I have to do that in C and as you might expect, how do I do that ?
For the sign I would search for the MSB (Most Significant Bit, since it tells me whether my number is positive or negative, depending if it's 0 or 1)
Or let's get straight to my idea, can I "splice" my 32 bit number into three parts ?
Get the 1 bit for msb/sign
Then after that follows 1 byte which stands for the exponent
and at last 23 bits for the mantissa
It probably doesn't work like that but can you give me a hint/solution ?
I know of freexp, but I want an alternative, where I learn a little more of C.
Thank you.
If you know the bitwise layout of your floating point type (e.g. because your implementation supports IEEE floating point representations) then convert a pointer to your floating point variable (of type float, double, or long double) into a pointer to unsigned char. From there, treat the variable like an array of unsigned char and use bitwise opertions to extract the parts you need.
Otherwise, do this;
#include <math.h>
int main()
{
double x = 4.25E12; /* value picked at random */
double significand;
int exponent;
int sign;
sign = (x >= 0) ? 1 : -1; /* deem 0 to be positive sign */
significand = frexp(x, &exponent);
}
The calculation of sign in the above should be obvious.
significand may be positive or negative, for non-zero x. The absolute value of significand is in the range [0.5,1) which, when multiplied by 2 to the power of exponent gives the original value.
If x is 0, both exponent and significand will be 0.
This will work regardless of what floating point representations your compiler supports (assuming double values).

What does signed and unsigned values mean?

What does signed mean in C? I have this table to show:
This says signed char 128 to +127. 128 is also a positive integer, so how can this be something like +128 to +127? Or do 128 and +127 have different meanings? I am referring to the book Apress Beginning C.
A signed integer can represent negative numbers; unsigned cannot.
Signed integers have undefined behavior if they overflow, while unsigned integers wrap around using modulo.
Note that that table is incorrect. First off, it's missing the - signs (such as -128 to +127). Second, the standard does not guarantee that those types must fall within those ranges.
By default, numerical values in C are signed, which means they can be both negative and positive. Unsigned values on the other hand, don't allow negative numbers.
Because it's all just about memory, in the end all the numerical values are stored in binary. A 32 bit unsigned integer can contain values from all binary 0s to all binary 1s. When it comes to 32 bit signed integer, it means one of its bits (most significant) is a flag, which marks the value to be positive or negative. So, it's the interpretation issue, which tells that value is signed.
Positive signed values are stored the same way as unsigned values, but negative numbers are stored using two's complement method.
If you want to write negative value in binary, first write positive number, next invert all the bits and last add 1. When a negative value in two's complement is added to a positive number of the same magnitude, the result will be 0.
In the example below lets deal with 8-bit numbers, because it'll be simple to inspect:
positive 95: 01011111
negative 95: 10100000 + 1 = 10100001 [positive 161]
0: 01011111 + 10100001 = 100000000
^
|_______ as we're dealing with 8bit numbers,
the 8 bits which means results in 0
The table is missing the minuses. The range of signed char is -128 to +127; likewise for the other types on the table.
It was a typo in the book; signed char goes from -128 to 127.
Signed integers are stored using the two's complement representation, in which the first bit is used to indicate the sign.
In C, chars are just 8 bit integers. This means that they can go from -(2^7) to 2^7 - 1. That's because we use the 7 last bits for the number and the first bit for the sign. 0 means positive and 1 means negative (in two's complement representation).
The biggest positive 7 bit number is (01111111)b = 2^7 - 1 = 127.
The smallest negative 7 bit number is (11111111)b = -128
(because 11111111 is the two's complement of 10000000 = 2^7 = 128).
Unsigned chars don't have signs so they can use all the 8 bits. Going from (00000000)b = 0 to (11111111)b = 255.
Signed numbers are those that have either + or - appended with them.
E.g +2 and -6 are signed numbers.
Signed Numbers can store both positive and negative numbers thats why they have bigger range.
i.e -32768 to 32767
Unsigned numbers are simply numbers with no sign with them. they are always positive. and their range is from 0 to 65535.
Hope it helps
Signed usually means the number has a + or - symbol in front of it. This means that unsigned int, unsigned shorts, etc cannot be negative.
Nobody mentioned this, but range of int in table is wrong:
it is
-2^(31) to 2^(31)-1
i.e.,
-2,147,483,648 to 2,147,483,647
A signed integer can have both negative and positive values. While a unsigned integer can only have positive values.
For signed integers using two's complement , which is most commonly used, the range is (depending on the bit width of the integer):
char s -> range -128-127
Where a unsigned char have the range:
unsigned char s -> range 0-255
First, your table is wrong... negative numbers are missing. Refering to the type char.... you can represent at all 256 possibilities as char has one byte means 2^8. So now you have two alternatives to set ur range. either from -128 to +128 or 0 to 255. The first one is a signed char the second a unsigned char. If you using integers be aware what kind of operation system u are using. 16 bit ,32 bit or 64 bit. Int (16 bit,32 bit,64 bit). char has always just 8 bit value.
It means that there will likely be a sign ( a symbol) in front of your value (+12345 || -12345 )

Integer representation to float representation

Is there an algorithm that can convert a 32-bit integer in its integer representation to a IEEE 754 float representation by just using integer operations?
I have a couple of thoughts on this but none of these works so far. (Using C)
I was thinking about shifting the integers but then I failed to
construct the new float representation on that .
I suppose I could convert the integer to binary but it has the same
problem with the first approach.
excellent resource on float
Address +3 +2 +1 +0
Format SEEEEEEE EMMMMMMM MMMMMMMM MMMMMMMM
S represents the sign bit where 1 is negative and 0 is positive.
E is the two’s complement exponent with an offset of 127.
M is the 23-bit normalized mantissa. The highest bit is always 1
and, therefore, is not stored
Then look here for two's complement
I'll use num as an array of bits, I know this isn't standard C array range accessing, but you get the point
So for a basic algorithm, we start with filling out S.
bit S = 0;
if (num[0] ==1) {
S = 1;
num[1..32] = -num[1..32] + 1; //ignore the leading bit. flip all the bits then add 1
}
Now we have set S and we have a scalar value for the rest of the number.
Then we can position our number into the mantissa, by finding the first index of 1. Which will also let us find the exponent. Note here that the exponent will always be positive since we can't have fractional int values. (also, make a special case for checking if the value is 0 first, to avoid an infinite loop in here, or just modify the loop appropriately, I'm lazy)
int pos = 1;
signed byte E = 32;
bit[23] M;
while(num[pos] == 0) {
--E;
++pos;
}
int finalPos = min(32, pos+23); //don't get too many bits
M = num[pos+1..finalPos]; //set the mantissa bits
Then you construct your float with the bits in S,E,M

Bits representation of negative numbers

This is a doubt regarding the representation of bits of signed integers. For example, when you want to represent -1, it is equivalent to 2's complement of (+1). So -1 is represented as 0xFFFFFFF. Now when I shift my number by 31 and print the result it is coming back as -1.
signed int a = -1;
printf(("The number is %d ",(a>>31));//this prints as -1
So can anyone please explain to me how the bits are represented for negative numbers?
Thanks.
When the top bit is zero, the number is positive. When it's 1, the number is negative.
Negative numbers shifted right keep shifting a "1" in as the topmost bit to keep the number negative. That's why you're getting that answer.
For more about two's complement, see this Stackoverflow question.
#Stobor points out that some C implementations could shift 0 into the high bit instead of 1. [Verified in Wikipedia.] In Java it's dependably an arithmetic shift.
But the output given by the questioner shows that his compiler is doing an arithmetic shift.
The C standard leaves it undefined whether the right shift of a negative (necessarily signed) integer shifts zeroes (logical shift right) or sign bits (arithmetic shift right) into the most significant bit. It is up to the implementation to choose.
Consequently, portable code ensures that it does not perform right shifts on negative numbers. Either it converts the value to the corresponding unsigned value before shifting (which is guaranteed to use a logical shift right, putting zeroes into the vacated bits), or it ensures that the value is positive, or it tolerates the variation in the output.
This is an arithmetic shift operation which preserves the sign bit and shifts the mantissa part of a signed number.
cheers
Basically there are two types of right shift. An unsigned right shift and a signed right shift. An unsigned right shift will shift the bits to the right, causing the least significant bit to be lost, and the most significant bit to be replaced with a 0. With a signed right shift, the bits are shifted to the right, causing the least significant bit be be lost, and the most significant bit to be preserved. A signed right shift divides the number by a power of two (corresponding to the number of places shifted), whereas an unsigned shift is a logical shifting operation.
The ">>" operator performs an unsigned right shift when the data type on which it operates is unsigned, and it performs a signed right shift when the data type on which it operates is signed. So, what you need to do is cast the object to an unsigned integer type before performing the bit manipulation to get the desired result.
Have a look at two's complement description. It should help.
EDIT: When the below was written, the code in the question was written as:
unsigned int a = -1;
printf(("The number is %d ",(a>>31));//this prints as -1
If unsigned int is at least 32 bits wide, then your compiler isn't really allowed to produce -1 as the output of that (with the small caveat that you should be casting the unsigned value to int before you pass it to printf).
Because a is an unsigned int, assigning -1 to it must give it the value of UINT_MAX (as the smallest non-negative value congruent to -1 modulo UINT_MAX+1). As long as unsigned int has at least 32 bits on your platform, the result of shifting that unsigned quantity right by 31 will be UINT_MAX divided by 2^31, which has to fit within int. (If unsigned int is 31 bits or shorter, it can produce whatever it likes because the result of the shift is unspecified).

Resources