How to create a custom float with varying bit length - c

I'm writing a program that basically converts an int into a floating point with a twist, the user specifies how many bits are used for the exponent and mantissa part. I don't understand what the algorithm for taking a number and breaking it into an exponent and mantissa is though. If I go to an IEEE 754 Converter and type in 34, it changes it to exp 2^5 and mantissa 1.0625, but I don't understand how to get this conversion. I know what I'll do to then get the bits in the specified places, but I don't know how to get the correct bits in the first place

Related

Convert integer to IEEE floating point?

I am currently reading "Computer Systems: A Programmer's Perspective". In the book, big-endian is used(most significant bits first). In the context of IEEE floating point numbers, using 32-bit single-precision, here is a citation of conversion between an integer and IEEE floating point:
One useful exercise for understanding floating-point representations
is to convert sample integer values into floating-point form. For
example, we saw in Figure
2.15 that 12,345 has binary representation [11000000111001]. We create a normalized representation of this by shifting 13 positions to the
right of a binary point, giving 12,345 = 1.10000001110012 × 2^13. To
encode this in IEEE single-precision format, we construct the fraction
field by dropping the leading 1 and adding 10 zeros to the end, giving
binary representation [10000001110010000000000]. To construct the
exponent field, we add bias 127 to 13, giving 140, which has binary
representation [10001100]. We combine this with a sign bit of 0 to get
the floating-point representation in binary of
[01000110010000001110010000000000].
What I do not understand is "by dropping the leading 1 and adding 10 zeros to the end, giving
binary representation [10000001110010000000000]." If big-endian is used, why can you add 10 zeros to the end of 1000000111001? Doesn't that lead to another value than that after the binary point? It would make sense to me if we added 10 zeros in the front since the final decimal value would still be that originally after the binary point.
Why/how can you add 10 zeros to the back without changing the value if big-endian is used?
This is how the number 12345 is represented as a 32-bit single-precision IEEE754 float:
3 2 1 0
1 09876543 21098765432109876543210
S ---E8--- ----------F23----------
Binary: 0 10001100 10000001110010000000000
Hex: 4640 E400
Precision: SP
Sign: Positive
Exponent: 13 (Stored: 140, Bias: 127)
Hex-float: +0x1.81c8p13
Value: +12345.0 (NORMAL)
Since this is a NORMAL value, the fractional part is interpreted with an implicit 1-bit; that is it's 1.10000001110010000000000. So, to fill the 23 bit mantissa you simply add 10 0's at the end as it doesn't change the value.
Endianness isn't really related to how these numbers are represented, as each bit has a fixed meaning. But in general, the most-significant-bit is to the left in both the exponent and the mantissa.

C, FLT_MAX value larger than 32 bits?

I'm actually researching how to display a float number (with write) and I'm facing about something which is confusing me.
I found that float are stored in 32 bits, whith 1 bits for sign, 7 bits for exponant and the rest for the Mantissa.
Where my trouble are coming, is when I display FLT_MAX with printf, I will get 340282346638528859811704183484516925440.000000 by simply doing
printf("%f\n", FLT_MAX)
This value is bigger than INT_MAX, bigger than LLONG_MAX, how can this number of digit can be stored in 32 bits ? This is really 32 bits or system dependent ? I'm on Ubuntu x86_64 GNU/Linux.
I can't understand how more than 10 digits (INT_MAX len) can be stored in the same number of bits.
If think the problem is linked, but I also have trouble for double who will give me
printf("%lf", DBL_MAX);
#179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
It's making the mystery bigger !
Thanks for helping, hope I was clear.
Bits are merely physical things with two states. There is no inherent meaning to them. When we use the bits to represent an integer in binary, we interpret each bit as having a value, 1 for one bit, 2 for another, 4 for another, 8 for another, and so on. There is nothing in physics, logic, or law that requires us to give them this interpretation.
When we use the bits to represent a floating-point object, we give each bit a different meaning. One bit represents the sign. Eight bits contain an encoding of the exponent. 23 bits contain an encoding of the significand.
To figure out the meaning of the bits given the floating-point encoding scheme for numbers in the normal range, we interpret the exponent bits as a binary numeral, then subtract 127, then raise two to the resulting power. (For example, “10000011” is the binary numeral for 131, so it represents 24) Then we take the significand bits and append them to “1.”, forming a binary numeral such as “1.01011100000000000000000”. We convert that numeral to a number (it is 159/128), and we multiply it by the power from the exponent (producing 159/8 in this example) and apply the sign.
Since the exponent can be large, the value represented can be very large. The software that converts floating-point numbers to characters for output such as “340282346638528859811704183484516925440.000000” performs these interpretations for you.

C, getting the maximum float or maximum double not from <float.h>

i was completing book "C. Programming language", but faced up with the question in which i should get the maximum\minimum value of float-pointing number, without using any of standard libraries, such as <float.h>. Thank you
“Without using” exercises are a little bit stupid, so here is one version “without using” any header.
…
double nextafter(double, double);
double max = nextafter(1.0 / 0.0, 0.0);
…
And without using any library function, only assuming that double is mapped to IEEE 754's binary64 format (a very common choice):
…
double max = 0x1.fffffffffffffp1023;
…
Assuming a binary floating-point format, start with 2.0 and multiply it by 2.0 until you get an overflow. This determines the maximum exponent. Then, starting with x as the number you had right before the overflow, take the sum x + x/2 + x/4 + ... until adding x/q does not change the value of the number (or overflows again). This determines the maximum mantissa.
The smallest representable positive number can be found a similar way.
From wikipedia you can read up the IEEE floating point format: http://en.wikipedia.org/wiki/Single-precision_floating-point_format
This contains
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 bits (23 explicitly stored)
The page also contains information on how to interpret the exponent value. Value of 0xFF (255) in exponent signifies ±infinity if the significant is zero and NaN (not a number) otherwise. The +- infinity are largest numbers. The sign bit defines if the number if +infinity or -infinity. If the question is about the largest non-infinite value then just use the largest non-special value.
Largest non-infinite value is 24 bits of 1s in significand and 0xFE (254) as exponent. Since the exponent is offset the actual value is something like: significand * 2^(254-127) which is somewhere close to 3.402823 × 10^38 in decimal according to the wikipedia page. If you want the minimum, just toggle the sign bit on to get the exact same value as negative.
EDIT: Since this is about C, I've assumed the 32 bit IEEE float.
You can figure out the number of bits the number holds by doing a sizeof(type)*8.
Then look at http://en.wikipedia.org/wiki/Double-precision_floating-point_format or http://en.wikipedia.org/wiki/Single-precision_floating-point_format
This way you can look it up in a table based in the number of bits.
This assumes that the structure is using IEEE 754.
You could start from the IEEE definition, and work from there. For example, number of bits of exponent, number of bits of mantissa. When you study the format, you will see that the 23 bits of mantissa actually represent 24 bits. The reason is, the mantissa is "normalised", that is, it is left shifted so that the ms bit is always 1. This gives the maximum number of significant bits retained from a calculation. Where has the 24th bit gone? Because it is always there (except for a 0 value), it is "implied" as the 24th bit.

Square root union and Bit Shift

I found this code with which the square root is obtained what surprises me is the way it does, using a union and bit shifts this is the code:
float sqrt3(const float x)
{
union
{
int i;
float x;
} u;
u.x = x;
u.i = (1<<29) + (u.i >> 1) - (1<<22);
return u.x;
}
first is saved in u.x the value of x and then assign a value to u.i then the square root of the number and appears magically u.x
¿someone explain to me how this algorithm?
The above code exhibits UB (undefined behaviour), so it should not be trusted to work on any platform. This is because it writes to a member of a union and reads back from a member different from that it last used to write the union with. It also depends heavily on endianness (the ordering of the bytes within a multi-byte integer).
However, it generally will do what is expected, and to understand why it is worthwhile for you to read about the IEEE 754 binary32 floating-point format.
Crash Course in IEEE754 binary32 format
IEEE754 commonly divides a 32-bit float into 1 sign bit, 8 exponent bits and 23 mantissa bits, thus giving
31 30-23 22-0
Bit#: ||------||---------------------|
Bit Representation: seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm
Value: sign * 1.mantissa * pow(2, exponent-127)
With the number essentially being in "scientific notation, base 2".
As a detail, the exponent is stored in a "biased" form (that is, it has a value 127 units too high). This is why we subtract 127 from the encoded exponent to get the "real" exponent.
Short Explanation
What your code does is it halves the exponent portion and damages the mantissa. This is done because the square root of a number has an exponent roughly half in magnitude.
Example in base 10
Assume we want the square root of 4000000 = 4*10^6.
4000000 ~ 4*10^6 <- Exponent is 6
4000 ~ 4*10^3 <- Divide exponent in half
Just by dividing the exponent 6 by 2, getting 3, and making it the new exponent we are already within the right order of magnitude, and much closer to the truth,
2000 = sqrt(4000000)
.
You can find a perfect explanation on wikipedia:
Methods of computing square roots
see section: Approximations that depend on the floating point representation
So for a 32-bit single precision floating point number in IEEE format
(where notably, the power has a bias of 127 added for the represented
form) you can get the approximate logarithm by interpreting its binary
representation as a 32-bit integer, scaling it by 2^{-23}, and
removing a bias of 127, i.e.
To get the square root, divide the logarithm by 2 and convert the value back.

Why floating point does not start from negative numbers when it exceeds its range?

As we all know when a integer varible exceeds its range it starts from other end that is from negative numbers. for example
int a=2147483648;
printf("%d",a);
OUTPUT:
-2147483648 (as I was expecting)
Now I tried the same for floating points.
for example
float a=3.4e39;//as largest float is 3.4e38
printf("%f",a);
OUTOUT:
1.#INF00 (I was expecting some negative float value)
I didn't get the above output exactly but I know It represents positive infinity.
So my question is simply why it does not start from other end(negative values like integers)?
Floating point numbers are stored in a different format than integer numbers, and don't follow the same over-/under-flowing mechanics.
More specifically, the binary bit-pattern for 2147483648 is 1000000000000000 which in a two's complement system (like the one used on almost all modern computers) is the same as -2147483648.
Most computers today uses IEEE754 format for floating point values, and those are handled quite differently from plain integers.
In IEEE-754, the maximum finite float (binary-32) value is below double value 3.4e39.
IEEE-754 says (for default rounding-direction attribute roundTiesToEven):
(IEEE-754:2008, 4.3.1 Rounding-direction attributes to nearest) "In the following two rounding-direction attributes, an infinitely precise result with magnitude at least
b emax (b − ½ b 1−p) shall round to ∞ with no change in sign; here emax and p are determined by the destination format (see 3.3)"
So in this declaration:
float a=3.4e39;
the conversion yields a positive infinity.
Under IEEE floating point, it's impossible for arithmetic to overflow because the representable range is [-INF,INF] (including the endpoints). As usual, floating point is subject to rounding when the exact value is not representable, and in your case, rounding yields INF.
Other answers have looked at floating point. This answer is about why signed integer values traditionally wrap around. It is not because that is particularly nice behavior. It is because that is what is expected because it is the way it has been done for a long time.
Especially in early hardware, with either discrete logic or very limited chip space, there was a major advantage to using the same adder for signed and unsigned integer addition and subtraction.
Floating point arithmetic was done in software except on special "scientific" computers that cost extra. Floating point numbers are always signed, and, as has been pointed out in other answers, have their own format. There is no signed/unsigned hardware sharing issue.
Common hardware for signed and unsigned integers can be achieved by using 2's complement representation for signed integer types.
What follows is based on 8 bit integers, with each bit pattern represented as 2 hexadecimal digits. Other widths work the same way.
00 through 7f have the same meaning in unsigned and 2's complement, 0 through 127 in that order, the intersection of the two ranges. 80 through ff represent 128 through 255, in that order, for unsigned integers, but represent negative numbers for signed. To make addition the same for both, 80 represents -128, and ff represents -1.
Now see what happens if you add 1 to 7f. For unsigned, it has to increment from 127 to 128. That means the resulting bit pattern is 80, which is also the most negative signed value. The price of sharing an adder is wrap-around at one point in the range.

Resources