Why aren't the rightmost digits zeros (C/Linux)? - c

If you print a float with more precision than is stored in memory, aren't the extra places supposed to have zeros in them? I have code that is something like this:
double z[2*N]="0";
...
for( n=1; n<=2*N; n++) {
fprintf( u1, "%.25g", z[n-1]);
fputc( n<2*N ? ',' : '\n', u1);
}
Which is creating output like this:
0,0.7071067811865474617150085,....
A float should have only 17 decimal places (right? Doesn't 53 bits comes out to 17 decimal places). If that's so, then the 18th, 19th... 25th places should have zeros. Notice in the above output that they have digits other than 0 in them.
Am I misunderstanding something? If so, what?

No, 53 bits means that the 17 decimal places are what you can trust, but because base-10 notation that we use is in a different base from which the double is stored (binary), the later digits are just because 1/2^53 is not exactly 1/10^n, i.e.,
1/2^53 = .0000000000000001110223024625156540423631668090820312500000000

The string printed by your implementation shows the exact value of the double in your example, and this is permitted by the C standard, as I show below.
First, we should understand what the floating-point object represents. The C standard does a poor job of this, but, presuming your implementation uses the IEEE 754 floating-point standard, a normal floating-point object represents exactly (-1)s•2e•(1+f) for some sign bit s (0 or 1), exponent e (in range for the specific type, -1022 to 1023 for double), and fraction f (also in range, 52 bits after a radix point for double). Many people use the object to approximate nearby values, but, according to the standard, the object only represents the one value it is defined to be.
The value you show, 0.7071067811865474617150085, is exactly representable as a double (sign bit 0, exponent -1, and fraction bits [in hexadecimal] .6a09e667f3bcc16). It is important to understand the double with this value represents exactly that value; it does not represent nearby values, such as 0.707106781186547461715.
Now that we know the value being passed to fprintf, we can consider what the C standard says about this. First, the C standard defines a constant named DECIMAL_DIG. C 2011 5.2.4.2.2 11 defines this to be the number of decimal digits such that any floating-point number in the widest supported type can be rounded to that many decimal digits and back again without change to the value. The precision you passed to fprintf, 25, is likely greater than the value of DECIMAL_DIG on your system.
In C 2011 7.21.6.1 13, the standard says “If the number of significant decimal digits is more than DECIMAL_DIG but the source value is exactly representable with DECIMAL_DIG digits, then the result should be an exact representation with trailing zeros. Otherwise, the source value is bounded by two adjacent decimal strings L < U , both having DECIMAL_DIG significant digits; the value of the resultant decimal string D should satisfy L ≤ D ≤ U, with the extra stipulation that the error should have a correct sign for the current rounding direction.”
This wording allows the compiler some wiggle room. The intent is that the result must be accurate enough that it can be converted back to the original double with no error. It may be more accurate, and some C implementations will produce the exactly correct value, which is permitted since it satisfies the paragraph above.
Incidentally, the value you show is not the double closest to sqrt(2)/2. That value is +0x1.6A09E667F3BCDp-1 = 0.70710678118654757273731092936941422522068023681640625.

There is enough precision to represent 0.7071067811865474617150085 in double precision floating point. The 64 bit output is actually 3FE6A09E667F3BCC
The formula used to evaluate the number is an exponentiation, so you cannot say that 53 bits will take 17 decimal places.
EDIT:
Look at the example below in the wiki article for another instance:
0.333333333333333314829616256247390992939472198486328125
=2^(−54) × 15 5555 5555 5555 base16
=2^(−2) × (15 5555 5555 5555 base16 × 2^(−52) )

You are asking for float, but in your code appears double.
Anyway, neither float or double have always the same number of decimals. Float have assigned 32 bits (4 bytes) for a floating point representation according to IEEE 754.
From Wikipedia:
The IEEE 754 standard specifies a binary32 as having:
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 (23 explicitly stored)
This gives from 6 to 9 significant decimal digits precision (if a
decimal string with at most 6 significant decimal is converted to IEEE
754 single precision and then converted back to the same number of
significant decimal, then the final string should match the original;
and if an IEEE 754 single precision is converted to a decimal string
with at least 9 significant decimal and then converted back to single,
then the final number must match the original).
In the case of double, from Wikipedia again:
Double-precision binary floating-point is a commonly used format on
PCs, due to its wider range over single-precision floating point, in
spite of its performance and bandwidth cost. As with single-precision
floating-point format, it lacks precision on integer numbers when
compared with an integer format of the same size. It is commonly known
simply as double. The IEEE 754 standard specifies a binary64 as
having:
Sign bit: 1 bit
Exponent width: 11 bits
Significand precision: 53 bits (52 explicitly stored)
This gives from 15 - 17 significant
decimal digits precision. If a decimal string with at most 15
significant decimal is converted to IEEE 754 double precision and then
converted back to the same number of significant decimal, then the
final string should match the original; and if an IEEE 754 double
precision is converted to a decimal string with at least 17
significant decimal and then converted back to double, then the final
number must match the original.
On the other hand, you can't expect that if you have a float and print it out with more precision that the really stored, the rest of digits will fill with 0s. The compiler can't imagine the tricks you are trying to do.

Related

Floating Point Representation in Hexadecimal using C Langauge

I typed the following C code:
typedef unsigned char* byte_pointer;
void show_bytes(byte_pointer start, size_t len)
{
int i;
for (i = 0; i < len; i++)
printf(" %.2x", start[i]);
printf("\n");
}
void show_float(float x)
{
show_bytes((byte_pointer) &x, sizeof(float));
}
void main()
{
float f = 3.9;
show_float(f);
}
The output of this code is:
Output: 0x4079999A
Manual Calculations
1.1111001100110011001100 x 2 power 1
M: 1111001100110011001100
E: 1 + 127 = 128(d) = 10000000(b)
Final Bits: 01000000011110011001100110011000
Hex: 0x40799998
Why this last A is displayed despite of 8.
As per manual calculations the answer in Hex should supposed to be: Output: 0x40799998
Those undisclosed manual calculations must be wrong. The correct result is 4079999A16.
In the format commonly used for float, IEEE-754 binary32 or “single precision,” numbers are represented as an integer with magnitude less than 224 multiplied by a power of two within certain limits. (The floating-point representation is often described in other forms, such as sign, a 24-digit binary significand with radix point after the first digit, and a power of two. These forms are mathematically equivalent.)
The two numbers in this form closest to 3.9 are 16,357,785•2−23 and 16,357,786•2−23. These are, respectively, 3.8999998569488525390625 and 3.900000095367431640625. Lining them up, we can see the latter is closer to 3.9:
3.8999998569488525390625
3.9000000000000000000000
3.9000000953674316406250
as the former differs by 1.5 at the seventh digit after the decimal point, whereas the latter differs by about 9.5 at the eighth digit after the decimal point.
Therefore, the best conversion of 3.9 to this float format produces 16,357,786•2−23. In hexadecimal, 16,357,786 is F9999A16. In the encoding of the representation into the bits of a float, the low 23 bits of the significand are put into the primary significand field. The low 23 bits are 79999A16, and that is what we should see in the primary significand field.
Also note we can easily see the binary for 3.9 is
11.11100110011001100110011001100110011001100110…2. The bold marks the 24 bits that fit in the float significand. Immediately after them is 1001…, which we can see ought to round up, since it exceeds half of the previous bit, and therefore the last four bits of the significand should be 1010.
(Also note that good C implementations convert numerals in source text to the nearest representable number, especially for numbers without many decimal digits, but the C standard does not require this. It says “For decimal floating constants, and also for hexadecimal floating constants when FLT_RADIX is not a power of 2, the result is either the nearest representable value, or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined manner. However, the encoding shown in the question 4079999816, is not for either of the adjacent representable values, 4079999916 and 4079999A16. It is farther away than either.)

How many digits after the decimal point can a float variable save in c? [duplicate]

Generally we say that a float has precision of 6 digits after the decimal point. But if we store a large number of the order of 10^30 we won't get 6 digits after the decimal point. So is it correct to say that floats have a precision of 6 digits after the decimal point?
"6 digits after the decimal point" is nonesnse, and your example is a good demonstration of this.
This is an exact specification of the float data type.
The precision of the float is 24 bits. There are 23 bits denoting the fraction after the binary point, plus there's also an "implicit leading bit", according to the online source. This gives 24 significant bits in total.
Hence in decimal digits this is approximately:
24 * log(2) / log(10) = 7.22
It sounds like you're asking about precision to decimal places (digits after the decimal point), whereas significant figures (total number of digits excluding leading and traling zeroes) is a better way to describe accuracy of numbers.
You're correct in that the number of digits after the decimal point will change when the number is larger - but if we're talking precision, the number of significant figures will not change when the number is larger. However, the answer isn't simple for decimal numbers:
Most systems these days use IEE floating point format to represent numbers in C. However, if you're on something unusual, it's worth checking. Single precision IEE float numbers are made up of three parts:
The sign bit (is this number positive or negative)
The (generally also signed) exponent
The fraction (the number before the exponent is applied)
As we'd expect, this is all stored in binary.
How many significant figures?
If you are using IEE-754 numbers, "how many significant figures" probably isn't an easy way to think about it, because the precision is measured in binary significant figures rather than decimal. floats have only 23 bits of accuracy for the fraction part, but because there's an implicit leading bit (unless the fraction part is all zeroes, which indicates a final value of 1), there are 24 effective bits of precision.
This means there are 24 significant binary digits, which does not translate to an exact number of decimal significant figures. You can use the formula 24 * log(2) / log(10) to determine that there are 7.225 digits of decimal precision, which isn't a very good answer to your question, since there are numbers of 24 significant binary digits which only have 6 significant decimal digits.
So, single precision floating point numbers have 6-9 significant decimal digits of precision, depending on the number.
Interestingly, you can also use this precision to work out the largest consecutive integer (counting from zero) that you can successfully represent in a single precision float. It is 2^24, or 16,777,216. You can exactly store larger integers, but only if they can be represented in 24 significant binary digits.
Further trivia: The limited size of the fraction component is the same thing that causes this in Javascript:
> console.log(9999999999999999);
10000000000000000
Javascript numbers are always represented as double precision floats, which have 53 bits of precision. This means between 2^53 and 2^54, only even numbers can be represented, because the final bit of any odd number is lost.
The precision of floating point numbers should be measured in binary digits, not decimal digits. This is because computers operate on binary numbers, and a binary fraction can only approximate a decimal fraction.
Language lawyers will say that the exact width of a float is unspecified by the C standard and therefore implementation-dependent, but on any platform you are likely to encounter a C float means an IEEE754 single-precision number.
IEEE754 specifies that a floating point number is in scientific notation: (-1)s×2e×m
where s is one bit wide, e is eight bits wide, and m is twenty three bits wide. Mathematically, m is 24 bits wide because it's always assumed that the top bit is 1.
So, the maximum number of decimal digits that can be approximated with this representation is: log10(224) = 7.22 .
That approximates seven significant decimal digits, and an exponent ranging from 2-126 to 2127.
Notice that the exponent is measured separately. This is exactly like if you were using ordinary scientific notation, like "A person weighs 72.3 kilograms = 7.23×104 grams". Notice that there are three significant digits here, representing that the number is only accurate to within 100 grams. But there is also an exponent which is a different number entirely. You can have a very big exponent with very few significant digits, like "the sun weighs 1.99×1033 grams." Big number, few digits.
In a nutshell, a float can store about 7-8 significant decimal digits. Let me illustrate this with an example:
1234567001.00
^
+---------------- this information is lost
.01234567001
^
+-------------- this information is lost
Basically, the float stores two values: 1234567 and the position of the decimal point.
Now, this is a simplified example. Floats store binary values instead of decimal values. A 32-bit IEEE 754 float has space for 23 "significant bits" (plus the first one which is always assumed to be 1), which corresponds to roughly 7-8 decimal digits.
1234567001.00 (dec) =
1001001100101011111111101011001.00 (bin) gets rounded to
1001001100101011111111110000000.00 =
| 23 bits |
1234567040.00 (dec)
And this is exactly what C produces:
void main() {
float a = 1234567001;
printf("%f", a); // outputs 1234567040
}

Convert integer to IEEE floating point?

I am currently reading "Computer Systems: A Programmer's Perspective". In the book, big-endian is used(most significant bits first). In the context of IEEE floating point numbers, using 32-bit single-precision, here is a citation of conversion between an integer and IEEE floating point:
One useful exercise for understanding floating-point representations
is to convert sample integer values into floating-point form. For
example, we saw in Figure
2.15 that 12,345 has binary representation [11000000111001]. We create a normalized representation of this by shifting 13 positions to the
right of a binary point, giving 12,345 = 1.10000001110012 × 2^13. To
encode this in IEEE single-precision format, we construct the fraction
field by dropping the leading 1 and adding 10 zeros to the end, giving
binary representation [10000001110010000000000]. To construct the
exponent field, we add bias 127 to 13, giving 140, which has binary
representation [10001100]. We combine this with a sign bit of 0 to get
the floating-point representation in binary of
[01000110010000001110010000000000].
What I do not understand is "by dropping the leading 1 and adding 10 zeros to the end, giving
binary representation [10000001110010000000000]." If big-endian is used, why can you add 10 zeros to the end of 1000000111001? Doesn't that lead to another value than that after the binary point? It would make sense to me if we added 10 zeros in the front since the final decimal value would still be that originally after the binary point.
Why/how can you add 10 zeros to the back without changing the value if big-endian is used?
This is how the number 12345 is represented as a 32-bit single-precision IEEE754 float:
3 2 1 0
1 09876543 21098765432109876543210
S ---E8--- ----------F23----------
Binary: 0 10001100 10000001110010000000000
Hex: 4640 E400
Precision: SP
Sign: Positive
Exponent: 13 (Stored: 140, Bias: 127)
Hex-float: +0x1.81c8p13
Value: +12345.0 (NORMAL)
Since this is a NORMAL value, the fractional part is interpreted with an implicit 1-bit; that is it's 1.10000001110010000000000. So, to fill the 23 bit mantissa you simply add 10 0's at the end as it doesn't change the value.
Endianness isn't really related to how these numbers are represented, as each bit has a fixed meaning. But in general, the most-significant-bit is to the left in both the exponent and the mantissa.

what's the largest number float type can hold?

I'm new to programming and have recently come up with this simple question .
float type has 32 bits in which 8 bits are for the whole number part (the mantissa).
so my question is can float type hold numbers bigger than 255.9999 ?
and I would also appreciate if someone told me why this code is behaving unexpectedly. Is it a related issue?
int main(){
float a=123456789.1;
printf("%lf",a);
return 0;
}
for which the output is :
123456792.000000
<float.h> -- Numeric limits of floating point types has your answers, specifically...
FLT_MAX
DBL_MAX
LDBL_MAX
maximum finite value of float, double and long double respectively
...and...
FLT_DIG
DBL_DIG
LDBL_DIG
number of decimal digits that are guaranteed to be preserved in text -> float/double/long double -> text roundtrip without change due to rounding or overflow
That last part is meant to say that a float value longer (i.e. more significant digits) than FLT_DIG is no longer guaranteed to be precisely representable.
The most common 32-bit floating-point format, IEEE-754 binary32, does not have eight bits for the whole number part. It has one bit for a sign, eight bits for an exponent field, and 23 bits for a significand field (a fraction part).
The sign bit determines whether the number is positive (0) or negative (1).
The exponent field, e, has several uses. If it is 11111111 (in binary), and the significand field, f, is zero, the floating-point value represents infinity. If e is 11111111, and the significand field is not zero, it represents a special Not-a-Number “value”.
If the exponent is not 11111111 and is not zero, floating-point value represents 2e−127•(1+f/223), with the sign added. Note that the fraction portion is formed by adding 1 to the contents of the significand field. That is often called an implicit 1, so the mathematical significand is 24 bits—1 bit from the leading 1, 23 bits from the significand field.
If the exponent is zero, floating-point value represents 21−127•(0+f/223) or the negative of that if the sign bit is 1. Note that the leading bit is 0. These are called subnormal numbers. They are included in the format to make some mathematical properties work in floating-point arithmetic.
The largest finite value represented is when the exponent is 11111110 (254) and the significand field is all ones (f is 223−1), so the number represented is 2254−127•(1+ (223−1)/223) = 2127•(2−2−23) = 2128−2104 = 340282346638528859811704183484516925440.
In float a=123456789.1;, the float type does not have enough precision to represent 123456789.1. (In fact, a decimal fraction .1 can never be represented with a binary floating-point format.) When we have only 24 bits for the significand, the closest numbers to 123456789.1 that we can represent are 123456792 and 123456800.
what's the largest number [the] float type can hold?
The C Standard defines:
FLT_MAX
Include <float.h> to have it be #defined.

How is the range of long double in C calculated?

While studying C I came to know that range of long double is 3.4E-4932 to 1.1E+4932. What is E here ? Size of long double in 10 bytes. If I assume E is 10 then how long double stores numbers till 19 places after decimal.
3.4E-4932 means . Both floats and doubles are stored in a format that keeps the exponent and the mantissa separate. In your example, -4392 will be encoded in the exponent, and 3.4 will be encoded in the mantissa, both as binary numbers.
Note that IEEE floating point formats come in a variety of ranges with availability that varies by platform. Refer to IEEE floating point for more details. As pointed out by Joe Farrell, your range is probably x86 Extended Precision Format. That format carries 1 bit for sign (s), 15 bits of binary exponent (e) with a bias of -16383, and 1 + 63 bits of binary mantissa (m). For normalized numbers, the value is computed as .
The smallest positive normalized number in this format has a sign bit of 0, an exponent of 1, and a mantissa of 1.0, corresponding to or . In binary, that number looks like:
The range of a long double (or, indeed, any floating point width) on Intel hardware is typically [-∞, ∞]. Between those endpoints many finite numbers are also representable:
0
±m×2e, where:
m is an integer between 1 and 264-1, and
e is an integer between -16445 and 16320
That means that the smallest non-zero long double is 2-16445 and the largest finite long double is (264-1)·216320 (or 216384-216320), which are approximately equal to the decimal numbers in scientific notation in the question.
See this Wikipedia article for details on the representation (which is binary, not decimal).

Resources