Related
(sorry, I'm coming up with some funny ideas... bear with me...)
Let's say I have a 'double' value, consisting of:
implicit
sign exponent bit mantissa
0 10000001001 (1).0011010010101010000001000001100010010011011101001100
representing 1234.6565 if I'm right.
I'd like to be able to access the fields of sign, exponent, implicit and mantissa separately as bits!, and manipulate them with bitwise operations like AND, OR, XOR ... or string operations like 'left', mid etc.
And then I'd like to puzzle a new double together from the manipulated bits.
e.g. setting the sign bit to 1 would make the number negative, adding or subtracting 1 to/from the exponent would double/halve the value, stripping all bits behind the position indicated by the recalculated (unbiased) value of the exponent would convert the value to an integer and so on.
Other tasks would/could be to find the last set bit, calculate how much it contributes to the value, check if the last bit is '1' (binary 'odd') or '0' (binary 'even') and the like.
I have seen similar in programs, just can't find it on the fly. I may remember something with 'reinterpret cast' or similar? I think there are libraries or toolkits or 'howtos' around which offer access to such, and hope here are reading people who can point me to such.
I'd like a solution near simple processor instructions and simple C code. I'm working in Debian Linux and compiling with gcc which was in by default.
startpoint is any double value i can address as 'x',
startpoint 2 is I'm not! an experienced programmer :-(
How to do easy, and get it working with good performance?
This is straightforward, if a bit esoteric.
Step 1 is to access the individual bits of a float or double. There are a number of ways of doing this, but the commonest are to use a char * pointer, or a union. For our purposes today let's use a union. [There are subtleties to this choice, which I'll address in a footnote.]
union doublebits {
double d;
uint64_t bits;
};
union doublebits x;
x.d = 1234.6565;
So now x.bits lets us access the bits and bytes of our double value as a 64-bit unsigned integer. First, we could print them out:
printf("bits: %llx\n", x.bits);
This prints
bits: 40934aa04189374c
and we're on our way.
The rest is "simple" bit manipulation.
We'll start by doing it the brute-force, obvious way:
int sign = x.bits >> 63;
int exponent = (x.bits >> 52) & 0x7ff;
long long mantissa = x.bits & 0xfffffffffffff;
printf("sign = %d, exponent = %d, mantissa = %llx\n", sign, exponent, mantissa);
This prints
sign = 0, exponent = 1033, mantissa = 34aa04189374c
and these values exactly match the bit decomposition you showed in your question, so it looks like you were right about the number 1234.6565.
What we have so far are the raw exponent and mantissa values.
As you know, the exponent is offset, and the mantissa has an implicit leading "1", so let's take care of those:
exponent -= 1023;
mantissa |= 1ULL << 52;
(Actually this isn't quite right. Soon enough we're going to have to address some additional complications having to do with denormalized numbers, and infinities and NaNs.)
Now that we have the true mantissa and exponent, we can do some math to recombine them, to see if everything is working:
double check = (double)mantissa * pow(2, exponent);
But if you try that, it gives the wrong answer, and it's because of a subtlety that, for me, is always the hardest part of this stuff: Where is the decimal point in the mantissa, really?
(Actually, it's not a "decimal point", anyway, because we're not working in decimal. Formally it's a "radix point", but that sounds too stuffy, so I'm going to keep using "decimal point", even though it's wrong. Apologies to any pedants whom this rubs the wrong way.)
When we did mantissa * pow(2, exponent) we assumed a decimal point, in effect, at the right end of the mantissa, but really, it's supposed to be 52 bits to the left of that (where that number 52 is, of course, the number of explicit mantissa bits). That is, our hexadecimal mantissa 0x134aa04189374c (with the leading 1 bit restored) is actually supposed to be treated more like 0x1.34aa04189374c. We can fix this by adjusting the exponent, subtracting 52:
double check = (double)mantissa * pow(2, exponent - 52);
printf("check = %f\n", check);
So now check is 1234.6565 (plus or minus some roundoff error). And that's the same number we started with, so it looks like our extraction was correct in all respects.
But we have some unfinished business, because for a fully general solution, we have to handle "subnormal" (also known as "denormalized") numbers, and the special representations inf and NaN.
These wrinkles are controlled by the exponent field. If the exponent (before subtracting the bias) is exactly 0, this indicates a subnormal number, that is, one whose mantissa is not in the normal range of (decimal) 1.00000 to 1.99999. A subnormal number does not have the implicit leading "1" bit, and the mantissa ends up being in the range from 0.00000 to 0.99999. (This also ends up being the way the ordinary number 0.0 has to be represented, since it obviously can't have that implicit leading "1" bit!)
On the other hand, if the exponent field has its maximum value (that is, 2047, or 211-1, for a double) this indicates a special marker. In that case, if the mantissa is 0, we have an infinity, with the sign bit distinguishing between positive and negative infinity. Or, if the exponent is max and the mantissa is not 0, we have a "not a number" marker, or NaN. The specific nonzero value in the mantissa can be used to distinguish between different kinds of NaN, like "quiet" and "signaling" ones, although it turns out the particular values that might be used for this aren't standard, so we'll ignore that little detail.
(If you're not familiar with infinities and NaNs, they're what IEEE-754 says that certain operations are supposed to return when the proper mathematical result is, well, not an ordinary number. For example, sqrt(-1.0) returns NaN, and 1./0. typically gives inf. There's a whole set of IEEE-754 rules about infinities and NaNs, such as that atan(inf) returns π/2.)
The bottom line is that instead of just blindly tacking on the implicit 1 bit, we have to check the exponent value first, and do things slightly differently depending on whether the exponent has its maximum value (indicating specials), an in-between value (indicating ordinary numbers), or 0 (indicating subnormal numbers):
if(exponent == 2047) {
/* inf or NAN */
if(mantissa != 0)
printf("NaN\n");
else if(sign)
printf("-inf\n");
else printf("inf\n");
} else if(exponent != 0) {
/* ordinary value */
mantissa |= 1ULL << 52;
} else {
/* subnormal */
exponent++;
}
exponent -= 1023;
That last adjustment, adding 1 to the exponent for subnormal numbers, reflects the fact that subnormals are "interpreted with the value of the smallest allowed exponent, which is one greater" (per the Wikipedia article on subnormal numbers).
I said this was all "straightforward, if a bit esoteric", but as you can see, while extracting the raw mantissa and exponent values is indeed pretty straightforward, interpreting what they actually mean can be a challenge!
If you already have raw exponent and mantissa numbers, going back in the other direction — that is, constructing a double value from them — is just about as straightforward:
sign = 1;
exponent = 1024;
mantissa = 0x921fb54442d18;
x.bits = ((uint64_t)sign << 63) | ((uint64_t)exponent << 52) | mantissa;
printf("%.15f\n", x.d);
This answer is getting too long, so for now I'm not going to delve into the question of how to construct appropriate exponent and mantissa numbers from scratch for an arbitrary real number. (Me, I usually do the equivalent of x.d = atof(the number I care about), and then use the techniques we've been discussing so far.)
Your original question was about "bitwise splitting", which is what we've been discussing. But it's worth noting that there's a much more portable way to do all this, if you don't want to muck around with raw bits, and if you don't want/need to assume that your machine uses IEEE-754. If you just want to split a floating-point number into a mantissa and an exponent, you can use the standard library frexp function:
int exp;
double mant = frexp(1234.6565, &exp);
printf("mant = %.15f, exp = %d\n", mant, exp);
This prints
mant = 0.602859619140625, exp = 11
and that looks right, because 0.602859619140625 × 211 = 1234.6565 (approximately). (How does it compare to our bitwise decomposition? Well, our mantissa was 0x34aa04189374c, or 0x1.34aa04189374c, which in decimal is 1.20571923828125, which is twice the mantissa that ldexp just gave us. But our exponent was 1033 - 1023 = 10, which is one less, so it comes out in the wash: 1.20571923828125 × 210 = 0.602859619140625 × 211 = 1234.6565.)
There's also a function ldexp that goes in the other direction:
double x2 = ldexp(mant, exp);
printf("%f\n", x2);
This prints 1234.656500 again.
Footnote: When you're trying to access the raw bits of something, as of course we've been doing here, there are some lurking portability and correctness questions having to do with something called strict aliasing. Strictly speaking, and depending on who you ask, you may need to use an array of unsigned char as the other part of your union, not uint64_t as I've been doing here. And there are those who say that you can't portably use a union at all, that you have to use memcpy to copy the bytes into a completely separate data structure, although I think they're taking about C++, not C.
I am reading the Serialization section of a tutorial http://beej.us/guide/bgnet/html/#serialization .
And I am reviewing the code which Encode the number into a portable binary form.
#include <stdint.h>
uint32_t htonf(float f)
{
uint32_t p;
uint32_t sign;
if (f < 0) { sign = 1; f = -f; }
else { sign = 0; }
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
return p;
}
float ntohf(uint32_t p)
{
float f = ((p>>16)&0x7fff); // whole part
f += (p&0xffff) / 65536.0f; // fraction
if (((p>>31)&0x1) == 0x1) { f = -f; } // sign bit set
return f;
}
I ran into problems with this line p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign .
According to the original code comments, this line extracts the whole part and sign, and the next line deals with fraction part.
Then I found an image about how float is represented in memory and started the calculation by hand.
From Wikipedia Single-precision floating-point format:
So I then presumed that whole part == exponent part.
But this (uint32_t)f)&0x7fff)<<16) is getting the last 15bits of the fraction part, if based on the image above.
Now I get confused, where did I get wrong?
It's important to realize what this code is not. This code does not do anything with the individual bits of a float value. (If it did, it wouldn't be portable and machine-independent, as it claims to be.) And the "portable" string representation it creates is fixed point, not floating point.
For example, if we use this code to convert the number -123.125, we will get the binary result
10000000011110110010000000000000
or in hexadecimal
807b2000
Now, where did that number 10000000011110110010000000000000 come from? Let's break it up into its sign, whole number, and fractional parts:
1 000000001111011 0010000000000000
The sign bit is 1 because our original number was negative. 000000001111011 is the 15-bit binary representation of 123. And 0010000000000000 is 8192. Where did 8192 come from? Well, 8192 ÷ 65536 is 0.125, which was our fractional part. (More on this below.)
How did the code do this? Let's walk through it step by step.
(1) Extract sign. That's easy: it's the ordinary test if(f < 0).
(2) Extract whole-number part. That's also easy: We take our floating-point number f, and cast it to type unint32_t. When you convert a floating-point number to an integer in C, the behavior is pretty obvious: it throws away the fractional part and gives you the integer. So if f is 123.125, (uint32_t)f is 123.
(3) Extract fraction. Since we've already got the integer part, we can isolate the fraction by starting with the original floating-point number f, and subtracting the integer part. That is, 123.125 - 123 = 0.125. Then we multiply the fractional part by 65536, which is 216.
It may not be obvious why we multiplied by 65536 and not some other number. In one sense, it doesn't matter what number you use. The goal here is to take a fractional number f and turn it into two integers a and b such that we can recover the fractional number f again later (although perhaps approximately). The way we're going to recover the fractional number f again later is by computing
a + b / x
where x is, well, some other number. If we chose 1000 for x, we'd break 123.125 up into a and b values of 123 and 125. We're choosing 65536, or 216, for x because that lets us make maximal use of the 16 bits we've allocated for the fractional part in our representation. Since x is 65536, b has to be some number we can divide by 65536 in order to get 0.125. So since b / 65536 = 0.125, by simple algebra we have b = 0.125 * 65536. Make sense?
Anyway, let's now look at the actual code for performing steps 1, 2, and 3.
if (f < 0) { sign = 1; f = -f; }
Easy peasy. If f is negative, our sign bit will be 1, and we want the rest of the code to operate on the positive version of f.
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31);
As mentioned, the important part here is (uint32_t)f, which just grabs the integer (whole-number) part of f. The bitmask & 0x7fff extracts the low-order 15 bits of it, throwing anything else away. (This is since our "portable representation" only allocates 15 bits for the whole-number part, meaning that numbers greater than 215-1 or 32767 can't be represented). The shift << 16 moves it into the high half of the eventual unint32_t result, where it belongs. And then | (sign<<31) takes the sign bit and puts it in the high-order position where it belongs.
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
Here, (int)f recomputes the integer (whole-number) part of f, and then f - (int)f extracts the fraction. We multiply it by 65536, as explained above. There may still be a fractional part (even after the multiplication, that is), so we cast to (uint32_t) again to throw that away, retaining only the integer part. We can only handle 16 bits of fraction, so we extract those bits (discarding anything else) with & 0xffff, although this should be unnecessary since we started with a positive fractional number less than 1, and multiplied it by 65536, so we should end up with a positive number less than 65536, i.e. we shouldn't have a number that won't exactly fit in 16 bits. Finally, the p |= operation stuffs these 16 bits we've just computed into the low-order half of p, and we're done.
Addendum: It may still not be obvious where the number 65536 came from, and why that was used instead of 10000 or some other number. So let's review two key points: we're ultimately dealing with integers here. Also, in one sense, the number 65536 actually was pretty arbitrary.
At the end of the day, any bit pattern we're working with is "really" just an integer. It's not a character, or a floating-point number, or a pointer — it's just an integer. If it has N bits, it represents integers from 0 to 2N-1.
In the fixed-point representation we're using here, there are three subfields: a 1-bit sign, a 15-bit whole-number part, and a 16-bit fraction part.
The interpretation of the sign and whole-number parts is obvious. But the question is: how shall we represent a fraction using a 16-bit integer?
And the answer is, we pick a number, almost any number, to divide by. We can call this number a "scaling factor".
We really can pick almost any number. Suppose I chose the number 3467 as my scaling factor. Here is now I would then represent several different fractions as integers:
½ → 1734/3467 → 1734
⅓ → 1155/3467 → 1155
0.125 → 433/3467 → 433
So my fractions ½, ⅓, and 0.125 are represented by the integers 1734, 1155, and 433. To recover my original fractions, I just divide by 3467:
1734 → 1734 ÷ 3467 → 0.500144
1155 → 1155 ÷ 3467 → 0.333141
433 → 1734 ÷ 3467 → 0.124891
Obviously I wasn't able to recover my original fractions exactly, but I came pretty close.
The other thing to wonder about is, where does that number 3467 "live"? If you're just looking at the numbers 1734, 1155, and 433, how do you know you're supposed to divide them by 3467? And the answer is, you don't know, at least, not just by looking at them. 3567 would have to be part of the definition of my silly fractional number format; people would just have to know, because I said so, that they had to multiply by 3467 when constructing integers to represent fractions, and divide by 3467 when recovering the original fractions.
And the other thing to look at is what the implications are of choosing various different scaling factors. The first thing is that, since in the end we're going to be using a 16-bit integer for the fractional representation, we absolutely can't use a scaling factor any greater than 65536. If we did, sometimes we'd end up with an integer greater than 65535, and it wouldn't fit in 16 bits. For example, suppose we tried to use a scaling factor of 70000, and suppose we tried to represent the fraction 0.95. Now, 0.95 is equal to 66500/70000, so our integer would be 66500, but that doesn't fit in 16 bits.
On the other hand, it turns out that ideally we don't want to use a number less than 65536, either. The smaller a number we use, the more of our 16-bit fractional representation we'll waste. When I chose 3467 in my silly example a little earlier, that meant I would represent fractions from 0/3467 = 0.00000 and 1/3467 = 0.000288 up to 3466/3467 = 0.999711. But I'd never use any of the integers from 3467 through 65536. They'd be wasted, and by not using them, I'd unnecessarily limit the precision of the fractions I could represent.
The "best" (least wasteful) scaling factor to use is 65536, although there's one other consideration, namely, which fractions do you want to be able to represent exactly? When I used 3467 as my scaling factor, I couldn't represent any of my test numbers ½, ⅓, or 0.125 exactly. If we use 65536 as the scaling factor, it turns out that we can represent fractions involving small powers of two exactly — that is, halves, quarters, eights, sixteenths, etc. — but not any other fractions, and in particular not most of the decimal fractions like 0.1. If we wanted to be able to represent decimal fractions exactly, we would have to use a scaling factor that was a power of 10. The largest power of 10 that will fit in 16 bits is 10000, and that would indeed let us exactly represent decimal fractions as small as 0.00001, although we'd waste about 5/6 (or 85%) of our 16-bit fractional range.
So if we wanted to represent decimal fractions exactly, without wasting precision, the inescapable conclusion is that we should not have allocated 16 bits for our fraction field in the first place. Better choices would have been 10 bits (ideal scaling factor 1024, we'd use 1000, wasting only 2%) or 20 bits (ideal scaling factor 1048576, we'd use 1000000, wasting about 5%).
The relevant excerpts from the page are
The thing to do is to pack the data into a known format and send that over the wire for decoding. For example, to pack floats, here’s something quick and dirty with plenty of room for improvement
and
On the plus side, it’s small, simple, and fast. On the minus side, it’s not an efficient use of space and the range is severely restricted—try storing a number greater-than 32767 in there and it won’t be very happy! You can also see in the above example that the last couple decimal places are not correctly preserved.
The code is presented only as an example. It is really quick and dirty, because it packs and unpacks the float as a fixed point number with 16 bits for fractions, 15 bits for integer magnitude and one for sign. It is an example and does not attempt to map floats 1:1.
It is in fact rather incredibly stupid algorithm: It can map 1:1 all IEEE 754 float32s within magnitude range ~256...32767 without losing a bit of information, truncate the fractions in floats in range 0...255 to 16 bits, and fail spectacularly for any number >= 32768. And NaNs.
As for the endianness problem: for any protocol that does not work with integers >= 32 bits intrinsically, someone needs to decide how to again serialize these integers into the other format. For example in the Internet at lowest levels data consists of 8-bit octets.
There are 24 obvious ways mapping a 32-bit unsigned integer into 4 octets, of which 2 are now generally used, and some more historically. Of course there are a countably infinite (and exponentially sillier) ways of encoding them...
flt32 flt32_abs (flt32 x) {
int mask=x>>31;
printMask(mask,32);
puts("Original");
printMask(x,32);
x=x^mask;
puts("after XOR");
printMask(x,32);
x=x-mask;
puts("after x-mask");
printMask(x,32);
return x;
}
Here's my code, calling the function on the value -32 is returning .125. I'm confused because it's a pretty straight up formula for abs on bits, but I seem to be missing something. Any ideas?
Is flt32 a type for floating point or fixed point numbers?
I suspect it's a type for fixed point arithmetic and you are not using it correctly. Let me explain it.
A fixed-point number uses, as the name says, a fixed position for the decimal digit; this means it uses a fixed number of bits for the decimal part. It is, in fact, a scaled integer.
I guess the flt32 type you are using uses the most significant 24 bits for the whole part and the least significant 8 bits for the decimal part; the value as real number of the 32-bit representation is the value of the same 32 bit representation as integer, divided by 256 (i.e. 28).
For example, the 32-bit number 0x00000020 is interpreted as integer as 32. As fixed-point number using 8 bits for the decimal part, its value is 0.125 (=32/256).
The code you posted is correct but you are not using it correctly.
The number -32 encoded as fixed-point number using 8 decimal digits is 0xFFFFE000 which is the integer representation of -8192 (=-32*256). The algorithm correctly produces 8192 which is 0x00002000 (=32*256); this is also 32 when it is interpreted as fixed-point.
If you pass -32 to the function without taking care to encode it as fixed-point, it correctly converts it to 32 and returns this value. But 32 (0x00000020) is 0.125 (=1/8=32/256) when it is interpreted as fixed-point (what I assume the function printMask() does).
How can you test the code correctly?
You probably have a function that creates fixed-point numbers from integers. Use it to get the correct representation of -32 and pass that value to the flt32_abs() function.
In case you don't have such a function, it is easy to write it. Just multiply the integer with 256 (or even better, left-shift it 8 bits) and that's all:
function int_to_fx32(int x)
{
return x << 8;
}
The fixed-point libraries usually use macros for such conversions because they produce faster code. Expressed as macro, it looks like this:
#define int_to_fx32(x) ((x) << 8)
Now you do the test:
fx32 negative = int_to_fx32(-32);
fx32 positive = fx32_abs(negative);
// This should print 32
printMask(positive, 32);
// This should print 8192
printf("%d", positive);
// This should print -8192
printf("%d", negative);
// This should print 0.125
printMask(32, 32);
int flt32_abs (int x) {
^^^ ^^^
int mask=x>>31;
x=x^mask;
x=x-mask;
return x;
}
I've been able to fix this and obtain the result of 32 by changing float to int, else the code wouldn't build with the error:
error: invalid operands of types 'float' and 'int' to binary 'operator>>'
For an explanation of why binary operations on floats are not allowed in C++, see
How to perform a bitwise operation on floating point numbers
I would like to ask more experienced developers, why did the code even build for OP? Relaxed compiler settings, I guess?
I was reading C primer plus, in chapter 3, data type, the author says:
If you take the bit pattern that represents the float number 256.0 and interpret it as a long value, you get 113246208.
I don't understand how the conversion works. Can someone helps me with this? Thanks.
256.0 is 1.0*28, right?
Now, look at the format (stealing it from #bash.d):
31 0
| |
SEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM //S - SIGN , E - EXPONENT, M - MANTISSA
The number is positive, so 0 goes into S.
The exponent, 8, goes into EEEEEEEE but before it goes there you need to add 127 to it as required by the format, so 135 goes there.
Now, of 1.0 only what's to the right of the point is actually stored in MMMMMMMMMMMMMMMMMMMMMMM, so 0 goes there. The 1. is implied for most numbers represented in the format and isn't actually stored in the format.
The idea here is that the absolute values of all nonzero numbers can be transformed into
1.0...1.111(1) * 10some integer (all numbers are binary)
or nearly equivalently
1.0...1.999(9) * 2some integer (all numbers are decimal)
and that's what I did at the top of my answer. The transformation is done by repeated division or multiplication of the number by 2 until you get the mantissa in the decimal range [1.0, 2.0) (or [1.0, 10.0) in binary). Since there's always this 1 in a non-zero number, why store it? And so it's not stored and gives you another free M bit.
So you end up with:
(0 << 31) + ((8 + 127) << 23) + 0 = 1132462080
The format is described here.
What's important from that quote is that integer/long and floats are saved in a different format in memory, so that you cannot simply pick up a bit of memory that has a float in it and say that now it's an int and get a correct value.
The specifics on how each data type is saved into memory can be found searching for IEEE standard, but again that isn't probably the objective of the quote. What it tries to tell you is that floats and integers are saved using a different pattern and you cannot simply use a float number as an int or vice-versa.
While integer and long values are usually represented using two's complement, float-values have a special Encoding, because you cannot tell the computer to display a float-value only using bits.
A 32-bit float number contains a sign-bit, a mantisse and an exponent. These determine together what value the float has.
See here for an article.
EDIT
So, this is what a float encoded by IEEE 754 looks like (32-bit)
31 0
| |
SEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM //S - SIGN , E - EXPONENT, M - MANTISSE
I don't know the pattern for 256.0, but the long value will be purely interpreted as
31 0
| |
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB // B - BIT
So there is no "conversion", but a different interpretation.
I wrote the following program:
#include<stdio.h>
int main(void)
{
float f;
printf("\nInput a floating-point no.: ");
scanf("%f",&f);
printf("\nOutput: %f\n",f);
return 0;
}
I am on Ubuntu and used GCC to compile the above program. Here is my sample run and output I want to inquire about:
Input a floating-point no.: 125.1
Output: 125.099998
Why does the precision change?
Because the number 125.1 is impossible to represent exactly with floating-point numbers. This happens in most programming languages. Use e.g. printf("%.1f", f); if you want to print the number with one decimal, but be warned: the number itself is not exactly equal to 125.1.
Thank you all for your answers. Although almost all of you helped me look in the right direction I could not understand the exact reason for this behavior. So I did a bit of research in addition to reading the pages you guys pointed me to. Here is my understanding for this behavior:
Single Precision Floating Point numbers typically use 4 bytes for storage on x86/x86-64 architectures. However not all 32 bits (4 bytes = 32 bits) are used to store the magnitude of the number.
For storing as a single precision floating type, the input stream is formatted in the following notation (somewhat similar to scientific notation):
(-1)^s x 1.m x 2^(e-127), where
s = sign of the number, range:{0,1} - takes up 1 bit
m = mantissa (fractional portion) of the number - takes up 23 bits
e = exponent of the number offset by 127, range:{0,..,255} - takes up 8 bits
and then stored in memory as
0th byte 1st byte 2nd byte 3rd byte
mmmmmmmm mmmmmmmm emmmmmmm seeeeeee
Therefore the decimal number 125.1 is first converted to binary form but limited to 24 bits so that the mantissa is represented by no more than 23 bits. After conversion to binary form:
125.1 = 1111101.00011001100110011
NOTE: 0.1 in decimal can be represented up to infinite bits in binary but the computer limits the representation to 17 bits so the complete representation does not exceed 24 bits.
Now converting it into the specified notation we get:
125.1 = 1.111101 00011001100110011 x 2^6
= (-1)^0 + 1.111101 00011001100110011 x 2^(133-127)
which implies
s = 0
m = 11110100011001100110011
e = 133 = 10000101
Therefore, 125.1 will be stored in memory as:
0th byte 1st byte 2nd byte 3rd byte
mmmmmmmm mmmmmmmm emmmmmmm seeeeeee
00110011 00110011 11111010 01000010
On being passed to the printf() function the output stream is generated by converting the binary form to the decimal form. The bytes are actually stored in reverse order (from the input stream) and hence read in this order:
3rd byte 2nd byte 1st byte 0th byte
seeeeeee emmmmmmm mmmmmmmm mmmmmmmm
01000010 11111010 00110011 00110011
Next, it is converted into the specific notation for conversion
(-1)^0 + 1.111101 00011001100110011 x 2^(133-127)
On simplifying the above representation further:
= 1.111101 00011001100110011 x 2^6
= 1111101.00011001100110011
And finally converting it to decimal:
= 125.0999984741210938
but single precision floating point can represent only up to 6 decimal places, therefore the answer is rounded off to 125.099998.
Think about a fixed point representation first.
2^3=8 2^2=4 2^1=2 2^0=1 2^-1=1/2 2^-2=1/4 2^-3=1/8 2^-4=1/16
If we want to represent a fraction then we set the bits to the right of the point, so 5.5 is represented as 01011000.
But if we want to represent 5.6, there is not an exact fractional representation. The closest we can get is 01011001 == 5.5625
1/2 + 1/16 = 0.5625
2^-4 + 2^-1
Because its the closest representation of 125.1 , remember that single precision floating point are just 32 bits.
If I tell you to write 1/3 as decimal number down, you realize there a numbers which have no finite representation. .1 is the exact representation of 1/10 there this problem does not appear, BUT this is just in decimal representation. In binary representation .1 is one of those numbers that require infinite digits. As your number must be somehwere cut there is something lost.
No floating point numbers has an exact representation, they all have limited accuracy. When converting from a number in text to a float (with scanf or otherwise), you're in another world with different kinds of numbers, and precision may be lost. Same thing goes when converting from a float to a string: you decide on how many digits you want. You can't know "how many digits there are" in a float before converting to text or another format that can keep that information. This all has to do with how floats are stored:
significant_digits * baseexponent
The normal type used for floating point in C is double, not float. Your float is implicitly cast to a double, and because the float is less precise, the difference to the closest representable number to 125.1 is more apparent (and printf's default precision is tailored for use with doubles). Try this instead:
#include<stdio.h>
int main(void)
{
double f;
printf("\nInput a floating-point no.: ");
scanf("%lf",&f);
printf("\nOutput: %f\n",f);
return 0;
}