I am currently reading "Computer Systems: A Programmer's Perspective". In the book, big-endian is used(most significant bits first). In the context of IEEE floating point numbers, using 32-bit single-precision, here is a citation of conversion between an integer and IEEE floating point:
One useful exercise for understanding floating-point representations
is to convert sample integer values into floating-point form. For
example, we saw in Figure
2.15 that 12,345 has binary representation [11000000111001]. We create a normalized representation of this by shifting 13 positions to the
right of a binary point, giving 12,345 = 1.10000001110012 × 2^13. To
encode this in IEEE single-precision format, we construct the fraction
field by dropping the leading 1 and adding 10 zeros to the end, giving
binary representation [10000001110010000000000]. To construct the
exponent field, we add bias 127 to 13, giving 140, which has binary
representation [10001100]. We combine this with a sign bit of 0 to get
the floating-point representation in binary of
[01000110010000001110010000000000].
What I do not understand is "by dropping the leading 1 and adding 10 zeros to the end, giving
binary representation [10000001110010000000000]." If big-endian is used, why can you add 10 zeros to the end of 1000000111001? Doesn't that lead to another value than that after the binary point? It would make sense to me if we added 10 zeros in the front since the final decimal value would still be that originally after the binary point.
Why/how can you add 10 zeros to the back without changing the value if big-endian is used?
This is how the number 12345 is represented as a 32-bit single-precision IEEE754 float:
3 2 1 0
1 09876543 21098765432109876543210
S ---E8--- ----------F23----------
Binary: 0 10001100 10000001110010000000000
Hex: 4640 E400
Precision: SP
Sign: Positive
Exponent: 13 (Stored: 140, Bias: 127)
Hex-float: +0x1.81c8p13
Value: +12345.0 (NORMAL)
Since this is a NORMAL value, the fractional part is interpreted with an implicit 1-bit; that is it's 1.10000001110010000000000. So, to fill the 23 bit mantissa you simply add 10 0's at the end as it doesn't change the value.
Endianness isn't really related to how these numbers are represented, as each bit has a fixed meaning. But in general, the most-significant-bit is to the left in both the exponent and the mantissa.
Related
Single precision floating point:
Sign bit: 1
Exponent: 8 bits
Mantissa: 23 bits
Double precision floating point:
Sign bit: 1
Exponent: 11 bits
Mantissa: 52 bits
What does this information mean?
I don't know English terms well.
A floating-point quantity (in most situations, not just C) is defined by three numbers: the sign, the significand (also called the "mantissa"), and the exponent.
These combine to form a pseudo-real number of the form
sign × significand × 2exponent
This is similar to scientific notation, except that the numbers are all binary, and the multiplication is by powers of 2, not powers of 10.
For example, the number 4.000 can be represented as
+1 × 1 × 22
The number 768.000 can be represented as
+1 × 1.5 × 29
The number -0.625 can be represented as
-1 × 1.25 × 2-1
The number 5.375 can be represented as
+1 × 1.34375 × 22
In any particular floating-point format, you can have different numbers of bits assigned to the different parts. The sign is always 0 (positive) or 1 (negative), so you only ever need one bit for that. The more bits you allocate to the significand, the more precision you can have in your numbers. The more bits you allocate to the exponent, the more range you can have for your numbers.
For example, IEEE 754 single-precision floating point has a total of 24 bits of precision for the significand (which is, yes, one more than your table called out, because there's literally one extra or "hidden" bit). So single-precision floating point has the equivalent of log10(224) or about 7.2 decimal digits worth of precision. It has 8 bits for the exponent, which gives us exponent values of about ±127, meaning we can multiply by 2±127, giving us a decimal range of about ±1038.
When you start digging into the details of actual floating-point formats, there are a few more nuances to consider. You might need to understand where the decimal point (really the "binary point" or "radix point") sits with respect to the number that is the significand. You might need to understand the "hidden 1 bit", and the concept of subnormals. You might need to understand how positive and negative exponents are represented, typically by using a bias. You might need to understand the special representations for infinity, and the "not a number" markers. You can read about all of these in general terms in the Wikipedia article on Floating point, or you can read about the specifics of the IEEE 754 floating-point standard which most computers use.
Once you understand how binary floating-point numbers work "on the inside", some of their surprising properties begin to make sense. For example, the ordinary-looking decimal fraction 0.1 is not exactly representable! In single precision, the closest you can get is
+1 × 0x1.99999a × 2-4
or equivalently
+1 × 1.60000002384185791015625 × 2-4
or equivalently
+1 × 0b1.10011001100110011001101 × 2-4
which works out to about 0.10000000149. We simply can't get any more precise than that — we can't add any more 0's to the decimal equivalent — because the significand 1.10011001100110011001101 has completely used up our 1+23 available bits of single-precision significance.
You can read more about such floating point "surprises" at this canonical SO question, and this one, and this one.
Footnote: I said everything was based on "a pseudo-real number of the form sign × significand × 2exponent, but strictly speaking, it's more like -1sign × significand × 2exponent. That is, the 1-bit sign component is 0 for positive, and 1 for negative.
So, It's almost time for midterms and the professor gave us some sample questions.
What I THINK the answer is:
We are given a float that is f=50000.
if we do f*f we get 2,500,000,000.
Now, I'm assuming we're working with a 32 bit machine as that is what we have studied so far. So, if that Is the case then 2,500,000,000 32 bit float not being declared unsigned is considered signed by default. Since 2,500,000,000 is a little over half of the 32 bit representation of 4294967296, and it is signed, we would have a negative value returned, so the statement f * f < 0 would be true, right?
I've only been studying systems programming for 4 weeks, PLEASE correct me if I am wrong here.
Unlike the int type, which is typically represented as a two's complement number, a float is a floating point type, which means it stores values using a mantissa and an exponent. This means that the typical wrapping behavior seen with signed integer types doesn't apply to floating point types.
In the case of 2,500,000,000, this will actually get stored as 0x1.2A05F2 x 231.
Floating point types are typically stored using IEEE 754 floating point format. In the case of a single precision floating point (which a float typically is), it has 1 sign bit, 8 exponent bits, and 24 mantissa bits (with 23 bits stored, as the high order "1" bit is implied).
While this format can't "wrap" from positive to negative, it is subject to 2 things:
Loss of precision
Overflow of the exponent
As an example of precision loss, let's use a decimal floating point format with a 3 digit mantissa and a 2 digit exponent. If we multiply 2.34 x 1010 by 6.78 x 1010, you get 1.58652 x 1021, but because of the 3 digit precision it gets truncated to 1.58 x 1021. So we lose the least significant digits.
To illustrate exponent overflow, suppose we were to multiply 2.00 x 1060 by 3.00 x 1050. You'd get 6.00 x 10110. But because the maximum value of an exponent is 99, this is an overflow. IEEE 754 has a special notation for infinity which it uses in the case of overflow where it sets the mantissa to all 0 bits and the exponent to all 1 bits, and the sign bit can be used to distinguish positive infinity and negative infinity.
While studying C I came to know that range of long double is 3.4E-4932 to 1.1E+4932. What is E here ? Size of long double in 10 bytes. If I assume E is 10 then how long double stores numbers till 19 places after decimal.
3.4E-4932 means . Both floats and doubles are stored in a format that keeps the exponent and the mantissa separate. In your example, -4392 will be encoded in the exponent, and 3.4 will be encoded in the mantissa, both as binary numbers.
Note that IEEE floating point formats come in a variety of ranges with availability that varies by platform. Refer to IEEE floating point for more details. As pointed out by Joe Farrell, your range is probably x86 Extended Precision Format. That format carries 1 bit for sign (s), 15 bits of binary exponent (e) with a bias of -16383, and 1 + 63 bits of binary mantissa (m). For normalized numbers, the value is computed as .
The smallest positive normalized number in this format has a sign bit of 0, an exponent of 1, and a mantissa of 1.0, corresponding to or . In binary, that number looks like:
The range of a long double (or, indeed, any floating point width) on Intel hardware is typically [-∞, ∞]. Between those endpoints many finite numbers are also representable:
0
±m×2e, where:
m is an integer between 1 and 264-1, and
e is an integer between -16445 and 16320
That means that the smallest non-zero long double is 2-16445 and the largest finite long double is (264-1)·216320 (or 216384-216320), which are approximately equal to the decimal numbers in scientific notation in the question.
See this Wikipedia article for details on the representation (which is binary, not decimal).
i was completing book "C. Programming language", but faced up with the question in which i should get the maximum\minimum value of float-pointing number, without using any of standard libraries, such as <float.h>. Thank you
“Without using” exercises are a little bit stupid, so here is one version “without using” any header.
…
double nextafter(double, double);
double max = nextafter(1.0 / 0.0, 0.0);
…
And without using any library function, only assuming that double is mapped to IEEE 754's binary64 format (a very common choice):
…
double max = 0x1.fffffffffffffp1023;
…
Assuming a binary floating-point format, start with 2.0 and multiply it by 2.0 until you get an overflow. This determines the maximum exponent. Then, starting with x as the number you had right before the overflow, take the sum x + x/2 + x/4 + ... until adding x/q does not change the value of the number (or overflows again). This determines the maximum mantissa.
The smallest representable positive number can be found a similar way.
From wikipedia you can read up the IEEE floating point format: http://en.wikipedia.org/wiki/Single-precision_floating-point_format
This contains
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 bits (23 explicitly stored)
The page also contains information on how to interpret the exponent value. Value of 0xFF (255) in exponent signifies ±infinity if the significant is zero and NaN (not a number) otherwise. The +- infinity are largest numbers. The sign bit defines if the number if +infinity or -infinity. If the question is about the largest non-infinite value then just use the largest non-special value.
Largest non-infinite value is 24 bits of 1s in significand and 0xFE (254) as exponent. Since the exponent is offset the actual value is something like: significand * 2^(254-127) which is somewhere close to 3.402823 × 10^38 in decimal according to the wikipedia page. If you want the minimum, just toggle the sign bit on to get the exact same value as negative.
EDIT: Since this is about C, I've assumed the 32 bit IEEE float.
You can figure out the number of bits the number holds by doing a sizeof(type)*8.
Then look at http://en.wikipedia.org/wiki/Double-precision_floating-point_format or http://en.wikipedia.org/wiki/Single-precision_floating-point_format
This way you can look it up in a table based in the number of bits.
This assumes that the structure is using IEEE 754.
You could start from the IEEE definition, and work from there. For example, number of bits of exponent, number of bits of mantissa. When you study the format, you will see that the 23 bits of mantissa actually represent 24 bits. The reason is, the mantissa is "normalised", that is, it is left shifted so that the ms bit is always 1. This gives the maximum number of significant bits retained from a calculation. Where has the 24th bit gone? Because it is always there (except for a 0 value), it is "implied" as the 24th bit.
If you print a float with more precision than is stored in memory, aren't the extra places supposed to have zeros in them? I have code that is something like this:
double z[2*N]="0";
...
for( n=1; n<=2*N; n++) {
fprintf( u1, "%.25g", z[n-1]);
fputc( n<2*N ? ',' : '\n', u1);
}
Which is creating output like this:
0,0.7071067811865474617150085,....
A float should have only 17 decimal places (right? Doesn't 53 bits comes out to 17 decimal places). If that's so, then the 18th, 19th... 25th places should have zeros. Notice in the above output that they have digits other than 0 in them.
Am I misunderstanding something? If so, what?
No, 53 bits means that the 17 decimal places are what you can trust, but because base-10 notation that we use is in a different base from which the double is stored (binary), the later digits are just because 1/2^53 is not exactly 1/10^n, i.e.,
1/2^53 = .0000000000000001110223024625156540423631668090820312500000000
The string printed by your implementation shows the exact value of the double in your example, and this is permitted by the C standard, as I show below.
First, we should understand what the floating-point object represents. The C standard does a poor job of this, but, presuming your implementation uses the IEEE 754 floating-point standard, a normal floating-point object represents exactly (-1)s•2e•(1+f) for some sign bit s (0 or 1), exponent e (in range for the specific type, -1022 to 1023 for double), and fraction f (also in range, 52 bits after a radix point for double). Many people use the object to approximate nearby values, but, according to the standard, the object only represents the one value it is defined to be.
The value you show, 0.7071067811865474617150085, is exactly representable as a double (sign bit 0, exponent -1, and fraction bits [in hexadecimal] .6a09e667f3bcc16). It is important to understand the double with this value represents exactly that value; it does not represent nearby values, such as 0.707106781186547461715.
Now that we know the value being passed to fprintf, we can consider what the C standard says about this. First, the C standard defines a constant named DECIMAL_DIG. C 2011 5.2.4.2.2 11 defines this to be the number of decimal digits such that any floating-point number in the widest supported type can be rounded to that many decimal digits and back again without change to the value. The precision you passed to fprintf, 25, is likely greater than the value of DECIMAL_DIG on your system.
In C 2011 7.21.6.1 13, the standard says “If the number of significant decimal digits is more than DECIMAL_DIG but the source value is exactly representable with DECIMAL_DIG digits, then the result should be an exact representation with trailing zeros. Otherwise, the source value is bounded by two adjacent decimal strings L < U , both having DECIMAL_DIG significant digits; the value of the resultant decimal string D should satisfy L ≤ D ≤ U, with the extra stipulation that the error should have a correct sign for the current rounding direction.”
This wording allows the compiler some wiggle room. The intent is that the result must be accurate enough that it can be converted back to the original double with no error. It may be more accurate, and some C implementations will produce the exactly correct value, which is permitted since it satisfies the paragraph above.
Incidentally, the value you show is not the double closest to sqrt(2)/2. That value is +0x1.6A09E667F3BCDp-1 = 0.70710678118654757273731092936941422522068023681640625.
There is enough precision to represent 0.7071067811865474617150085 in double precision floating point. The 64 bit output is actually 3FE6A09E667F3BCC
The formula used to evaluate the number is an exponentiation, so you cannot say that 53 bits will take 17 decimal places.
EDIT:
Look at the example below in the wiki article for another instance:
0.333333333333333314829616256247390992939472198486328125
=2^(−54) × 15 5555 5555 5555 base16
=2^(−2) × (15 5555 5555 5555 base16 × 2^(−52) )
You are asking for float, but in your code appears double.
Anyway, neither float or double have always the same number of decimals. Float have assigned 32 bits (4 bytes) for a floating point representation according to IEEE 754.
From Wikipedia:
The IEEE 754 standard specifies a binary32 as having:
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 (23 explicitly stored)
This gives from 6 to 9 significant decimal digits precision (if a
decimal string with at most 6 significant decimal is converted to IEEE
754 single precision and then converted back to the same number of
significant decimal, then the final string should match the original;
and if an IEEE 754 single precision is converted to a decimal string
with at least 9 significant decimal and then converted back to single,
then the final number must match the original).
In the case of double, from Wikipedia again:
Double-precision binary floating-point is a commonly used format on
PCs, due to its wider range over single-precision floating point, in
spite of its performance and bandwidth cost. As with single-precision
floating-point format, it lacks precision on integer numbers when
compared with an integer format of the same size. It is commonly known
simply as double. The IEEE 754 standard specifies a binary64 as
having:
Sign bit: 1 bit
Exponent width: 11 bits
Significand precision: 53 bits (52 explicitly stored)
This gives from 15 - 17 significant
decimal digits precision. If a decimal string with at most 15
significant decimal is converted to IEEE 754 double precision and then
converted back to the same number of significant decimal, then the
final string should match the original; and if an IEEE 754 double
precision is converted to a decimal string with at least 17
significant decimal and then converted back to double, then the final
number must match the original.
On the other hand, you can't expect that if you have a float and print it out with more precision that the really stored, the rest of digits will fill with 0s. The compiler can't imagine the tricks you are trying to do.