'float' vs. 'double' precision

'float' vs. 'double' precision - c

The code
float x = 3.141592653589793238;
double z = 3.141592653589793238;
printf("x=%f\n", x);
printf("z=%f\n", z);
printf("x=%20.18f\n", x);
printf("z=%20.18f\n", z);
will give you the output
x=3.141593
z=3.141593
x=3.141592741012573242
z=3.141592653589793116
where on the third line of output 741012573242 is garbage and on the fourth line 116 is garbage. Do doubles always have 16 significant figures while floats always have 7 significant figures? Why don't doubles have 14 significant figures?

Floating point numbers in C use IEEE 754 encoding.
This type of encoding uses a sign, a significand, and an exponent.
Because of this encoding, many numbers will have small changes to allow them to be stored.
Also, the number of significant digits can change slightly since it is a binary representation, not a decimal one.
Single precision (float) gives you 23 bits of significand, 8 bits of exponent, and 1 sign bit.
Double precision (double) gives you 52 bits of significand, 11 bits of exponent, and 1 sign bit.

Do doubles always have 16 significant
figures while floats always have 7
significant figures?
No. Doubles always have 53 significant bits and floats always have 24 significant bits (except for denormals, infinities, and NaN values, but those are subjects for a different question). These are binary formats, and you can only speak clearly about the precision of their representations in terms of binary digits (bits).
This is analogous to the question of how many digits can be stored in a binary integer: an unsigned 32 bit integer can store integers with up to 32 bits, which doesn't precisely map to any number of decimal digits: all integers of up to 9 decimal digits can be stored, but a lot of 10-digit numbers can be stored as well.
Why don't doubles
have 14 significant figures?
The encoding of a double uses 64 bits (1 bit for the sign, 11 bits for the exponent, 52 explicit significant bits and one implicit bit), which is double the number of bits used to represent a float (32 bits).

float: 23 bits of significand, 8 bits of exponent, and 1 sign bit.
double: 52 bits of significand, 11 bits of exponent, and 1 sign bit.

It's usually based on significant figures of both the exponent and significand in base 2, not base 10. From what I can tell in the C99 standard, however, there is no specified precision for floats and doubles (other than the fact that 1 and 1 + 1E-5 / 1 + 1E-7 are distinguishable [float and double repsectively]). However, the number of significant figures is left to the implementer (as well as which base they use internally, so in other words, an implementation could decide to make it based on 18 digits of precision in base 3). [1]
If you need to know these values, the constants FLT_RADIX and FLT_MANT_DIG (and DBL_MANT_DIG / LDBL_MANT_DIG) are defined in float.h.
The reason it's called a double is because the number of bytes used to store it is double the number of a float (but this includes both the exponent and significand). The IEEE 754 standard (used by most compilers) allocate relatively more bits for the significand than the exponent (23 to 9 for float vs. 52 to 12 for double), which is why the precision is more than doubled.
1: Section 5.2.4.2.2 ( http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf )

A float has 23 bits of precision, and a double has 52.

It's not exactly double precision because of how IEEE 754 works, and because binary doesn't really translate well to decimal. Take a look at the standard if you're interested.

Related

How many digits after the decimal point can a float variable save in c? [duplicate]

Generally we say that a float has precision of 6 digits after the decimal point. But if we store a large number of the order of 10^30 we won't get 6 digits after the decimal point. So is it correct to say that floats have a precision of 6 digits after the decimal point?

"6 digits after the decimal point" is nonesnse, and your example is a good demonstration of this.
This is an exact specification of the float data type.
The precision of the float is 24 bits. There are 23 bits denoting the fraction after the binary point, plus there's also an "implicit leading bit", according to the online source. This gives 24 significant bits in total.
Hence in decimal digits this is approximately:
24 * log(2) / log(10) = 7.22

It sounds like you're asking about precision to decimal places (digits after the decimal point), whereas significant figures (total number of digits excluding leading and traling zeroes) is a better way to describe accuracy of numbers.
You're correct in that the number of digits after the decimal point will change when the number is larger - but if we're talking precision, the number of significant figures will not change when the number is larger. However, the answer isn't simple for decimal numbers:
Most systems these days use IEE floating point format to represent numbers in C. However, if you're on something unusual, it's worth checking. Single precision IEE float numbers are made up of three parts:
The sign bit (is this number positive or negative)
The (generally also signed) exponent
The fraction (the number before the exponent is applied)
As we'd expect, this is all stored in binary.
How many significant figures?
If you are using IEE-754 numbers, "how many significant figures" probably isn't an easy way to think about it, because the precision is measured in binary significant figures rather than decimal. floats have only 23 bits of accuracy for the fraction part, but because there's an implicit leading bit (unless the fraction part is all zeroes, which indicates a final value of 1), there are 24 effective bits of precision.
This means there are 24 significant binary digits, which does not translate to an exact number of decimal significant figures. You can use the formula 24 * log(2) / log(10) to determine that there are 7.225 digits of decimal precision, which isn't a very good answer to your question, since there are numbers of 24 significant binary digits which only have 6 significant decimal digits.
So, single precision floating point numbers have 6-9 significant decimal digits of precision, depending on the number.
Interestingly, you can also use this precision to work out the largest consecutive integer (counting from zero) that you can successfully represent in a single precision float. It is 2^24, or 16,777,216. You can exactly store larger integers, but only if they can be represented in 24 significant binary digits.
Further trivia: The limited size of the fraction component is the same thing that causes this in Javascript:
> console.log(9999999999999999);
10000000000000000
Javascript numbers are always represented as double precision floats, which have 53 bits of precision. This means between 2^53 and 2^54, only even numbers can be represented, because the final bit of any odd number is lost.

The precision of floating point numbers should be measured in binary digits, not decimal digits. This is because computers operate on binary numbers, and a binary fraction can only approximate a decimal fraction.
Language lawyers will say that the exact width of a float is unspecified by the C standard and therefore implementation-dependent, but on any platform you are likely to encounter a C float means an IEEE754 single-precision number.
IEEE754 specifies that a floating point number is in scientific notation: (-1)s×2e×m
where s is one bit wide, e is eight bits wide, and m is twenty three bits wide. Mathematically, m is 24 bits wide because it's always assumed that the top bit is 1.
So, the maximum number of decimal digits that can be approximated with this representation is: log10(224) = 7.22 .
That approximates seven significant decimal digits, and an exponent ranging from 2-126 to 2127.
Notice that the exponent is measured separately. This is exactly like if you were using ordinary scientific notation, like "A person weighs 72.3 kilograms = 7.23×104 grams". Notice that there are three significant digits here, representing that the number is only accurate to within 100 grams. But there is also an exponent which is a different number entirely. You can have a very big exponent with very few significant digits, like "the sun weighs 1.99×1033 grams." Big number, few digits.

In a nutshell, a float can store about 7-8 significant decimal digits. Let me illustrate this with an example:
1234567001.00
^
+---------------- this information is lost
.01234567001
^
+-------------- this information is lost
Basically, the float stores two values: 1234567 and the position of the decimal point.
Now, this is a simplified example. Floats store binary values instead of decimal values. A 32-bit IEEE 754 float has space for 23 "significant bits" (plus the first one which is always assumed to be 1), which corresponds to roughly 7-8 decimal digits.
1234567001.00 (dec) =
1001001100101011111111101011001.00 (bin) gets rounded to
1001001100101011111111110000000.00 =
| 23 bits |
1234567040.00 (dec)
And this is exactly what C produces:
void main() {
float a = 1234567001;
printf("%f", a); // outputs 1234567040
}

If I have a float that f = 50,000, and then i do f*f, is the value returned a negative?

So, It's almost time for midterms and the professor gave us some sample questions.
What I THINK the answer is:
We are given a float that is f=50000.
if we do f*f we get 2,500,000,000.
Now, I'm assuming we're working with a 32 bit machine as that is what we have studied so far. So, if that Is the case then 2,500,000,000 32 bit float not being declared unsigned is considered signed by default. Since 2,500,000,000 is a little over half of the 32 bit representation of 4294967296, and it is signed, we would have a negative value returned, so the statement f * f < 0 would be true, right?
I've only been studying systems programming for 4 weeks, PLEASE correct me if I am wrong here.

Unlike the int type, which is typically represented as a two's complement number, a float is a floating point type, which means it stores values using a mantissa and an exponent. This means that the typical wrapping behavior seen with signed integer types doesn't apply to floating point types.
In the case of 2,500,000,000, this will actually get stored as 0x1.2A05F2 x 231.
Floating point types are typically stored using IEEE 754 floating point format. In the case of a single precision floating point (which a float typically is), it has 1 sign bit, 8 exponent bits, and 24 mantissa bits (with 23 bits stored, as the high order "1" bit is implied).
While this format can't "wrap" from positive to negative, it is subject to 2 things:
Loss of precision
Overflow of the exponent
As an example of precision loss, let's use a decimal floating point format with a 3 digit mantissa and a 2 digit exponent. If we multiply 2.34 x 1010 by 6.78 x 1010, you get 1.58652 x 1021, but because of the 3 digit precision it gets truncated to 1.58 x 1021. So we lose the least significant digits.
To illustrate exponent overflow, suppose we were to multiply 2.00 x 1060 by 3.00 x 1050. You'd get 6.00 x 10110. But because the maximum value of an exponent is 99, this is an overflow. IEEE 754 has a special notation for infinity which it uses in the case of overflow where it sets the mantissa to all 0 bits and the exponent to all 1 bits, and the sign bit can be used to distinguish positive infinity and negative infinity.

max floating point value [duplicate]

I am wondering if the max float represented in IEEE 754 is:
(1.11111111111111111111111)_b*2^[(11111111)_b-127]
Here _b means binary representation. But that value is 3.403201383*10^38, which is different from 3.402823669*10^38, which is (1.0)_b*2^[(11111111)_b-127] and given by for example c++ <limits>. Isn't
(1.11111111111111111111111)_b*2^[(11111111)_b-127] representable and larger in the framework?
Does anybody know why?
Thank you.

The exponent 11111111b is reserved for infinities and NaNs, so your number cannot be represented.
The greatest value that can be represented in single precision, approximately 3.4028235×1038, is actually 1.11111111111111111111111b×211111110b-127.
See also http://en.wikipedia.org/wiki/Single-precision_floating-point_format

Being the "m" the mantisa and the "e" the exponent, the answer is:
In your case, if the number of bits on IEEE 754 are:
16 Bits you have 1 for the sign, 5 for the exponent and 10 for the mantissa. The largest number represented is 4,293,918,720.
32 Bits you have 1 for the sign, 8 for the exponent and 23 for the mantissa. The largest number represented is 3.402823466E38
64 Bits you have 1 for the sign, 11 for the exponent and 52 for the mantissa. The largest number represented is 2^1024 - 2^971

floating point numbers in C slightly different from expected

I noticed that in C, a float can be as small as 2^-149, and as large as 2^127. If I try to set the float to any smaller or larger respectively than these, then I get zero and inf, respectively. The 2^149 doesn't make sense to me; where does it come from?
It appears that the exponent is 8 bits, so we can have 2^-128 to 2^127. The overall sign of the float is 1 bit, so that leaves 23 bits for the significand since a float is 32 bits total. If all 23 bits of the significand are placed after the binary "decimal point" such that the significand is <= 0.5, then we should be able to have floats as small as 2^(-128-23) = 2^-151. On the other hand, if one of the 23 bits is placed BEFORE the binary "decimal" point such that the significand is <= 1, then we would have the smallest float be 2^(-128-22) = 2^-150. Both of these do not agree with the fact that the smallest float seems to be 2^-149. Why is this?

Infinity (+ or -) is represented by the maximum exponent (all 1 bits), and zero mantissa. NaN is represented by the maximum exponent, and any non-zero mantissa.
Denormal numbers, and zero, are represented with the minimum exponent (all 0 bits).
So those two exponents are not available for normal numbers.

Precision in C floats

Generally we say that a float has precision of 6 digits after the decimal point. But if we store a large number of the order of 10^30 we won't get 6 digits after the decimal point. So is it correct to say that floats have a precision of 6 digits after the decimal point?

"6 digits after the decimal point" is nonesnse, and your example is a good demonstration of this.
This is an exact specification of the float data type.
The precision of the float is 24 bits. There are 23 bits denoting the fraction after the binary point, plus there's also an "implicit leading bit", according to the online source. This gives 24 significant bits in total.
Hence in decimal digits this is approximately:
24 * log(2) / log(10) = 7.22

The precision of floating point numbers should be measured in binary digits, not decimal digits. This is because computers operate on binary numbers, and a binary fraction can only approximate a decimal fraction.
Language lawyers will say that the exact width of a float is unspecified by the C standard and therefore implementation-dependent, but on any platform you are likely to encounter a C float means an IEEE754 single-precision number.
IEEE754 specifies that a floating point number is in scientific notation: (-1)s×2e×m
where s is one bit wide, e is eight bits wide, and m is twenty three bits wide. Mathematically, m is 24 bits wide because it's always assumed that the top bit is 1.
So, the maximum number of decimal digits that can be approximated with this representation is: log10(224) = 7.22 .
That approximates seven significant decimal digits, and an exponent ranging from 2-126 to 2127.
Notice that the exponent is measured separately. This is exactly like if you were using ordinary scientific notation, like "A person weighs 72.3 kilograms = 7.23×104 grams". Notice that there are three significant digits here, representing that the number is only accurate to within 100 grams. But there is also an exponent which is a different number entirely. You can have a very big exponent with very few significant digits, like "the sun weighs 1.99×1033 grams." Big number, few digits.

In a nutshell, a float can store about 7-8 significant decimal digits. Let me illustrate this with an example:
1234567001.00
^
+---------------- this information is lost
.01234567001
^
+-------------- this information is lost
Basically, the float stores two values: 1234567 and the position of the decimal point.
Now, this is a simplified example. Floats store binary values instead of decimal values. A 32-bit IEEE 754 float has space for 23 "significant bits" (plus the first one which is always assumed to be 1), which corresponds to roughly 7-8 decimal digits.
1234567001.00 (dec) =
1001001100101011111111101011001.00 (bin) gets rounded to
1001001100101011111111110000000.00 =
| 23 bits |
1234567040.00 (dec)
And this is exactly what C produces:
void main() {
float a = 1234567001;
printf("%f", a); // outputs 1234567040
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight