Single precision floating point:
Sign bit: 1
Exponent: 8 bits
Mantissa: 23 bits
Double precision floating point:
Sign bit: 1
Exponent: 11 bits
Mantissa: 52 bits
What does this information mean?
I don't know English terms well.
A floating-point quantity (in most situations, not just C) is defined by three numbers: the sign, the significand (also called the "mantissa"), and the exponent.
These combine to form a pseudo-real number of the form
sign × significand × 2exponent
This is similar to scientific notation, except that the numbers are all binary, and the multiplication is by powers of 2, not powers of 10.
For example, the number 4.000 can be represented as
+1 × 1 × 22
The number 768.000 can be represented as
+1 × 1.5 × 29
The number -0.625 can be represented as
-1 × 1.25 × 2-1
The number 5.375 can be represented as
+1 × 1.34375 × 22
In any particular floating-point format, you can have different numbers of bits assigned to the different parts. The sign is always 0 (positive) or 1 (negative), so you only ever need one bit for that. The more bits you allocate to the significand, the more precision you can have in your numbers. The more bits you allocate to the exponent, the more range you can have for your numbers.
For example, IEEE 754 single-precision floating point has a total of 24 bits of precision for the significand (which is, yes, one more than your table called out, because there's literally one extra or "hidden" bit). So single-precision floating point has the equivalent of log10(224) or about 7.2 decimal digits worth of precision. It has 8 bits for the exponent, which gives us exponent values of about ±127, meaning we can multiply by 2±127, giving us a decimal range of about ±1038.
When you start digging into the details of actual floating-point formats, there are a few more nuances to consider. You might need to understand where the decimal point (really the "binary point" or "radix point") sits with respect to the number that is the significand. You might need to understand the "hidden 1 bit", and the concept of subnormals. You might need to understand how positive and negative exponents are represented, typically by using a bias. You might need to understand the special representations for infinity, and the "not a number" markers. You can read about all of these in general terms in the Wikipedia article on Floating point, or you can read about the specifics of the IEEE 754 floating-point standard which most computers use.
Once you understand how binary floating-point numbers work "on the inside", some of their surprising properties begin to make sense. For example, the ordinary-looking decimal fraction 0.1 is not exactly representable! In single precision, the closest you can get is
+1 × 0x1.99999a × 2-4
or equivalently
+1 × 1.60000002384185791015625 × 2-4
or equivalently
+1 × 0b1.10011001100110011001101 × 2-4
which works out to about 0.10000000149. We simply can't get any more precise than that — we can't add any more 0's to the decimal equivalent — because the significand 1.10011001100110011001101 has completely used up our 1+23 available bits of single-precision significance.
You can read more about such floating point "surprises" at this canonical SO question, and this one, and this one.
Footnote: I said everything was based on "a pseudo-real number of the form sign × significand × 2exponent, but strictly speaking, it's more like -1sign × significand × 2exponent. That is, the 1-bit sign component is 0 for positive, and 1 for negative.
Related
Generally we say that a float has precision of 6 digits after the decimal point. But if we store a large number of the order of 10^30 we won't get 6 digits after the decimal point. So is it correct to say that floats have a precision of 6 digits after the decimal point?
"6 digits after the decimal point" is nonesnse, and your example is a good demonstration of this.
This is an exact specification of the float data type.
The precision of the float is 24 bits. There are 23 bits denoting the fraction after the binary point, plus there's also an "implicit leading bit", according to the online source. This gives 24 significant bits in total.
Hence in decimal digits this is approximately:
24 * log(2) / log(10) = 7.22
It sounds like you're asking about precision to decimal places (digits after the decimal point), whereas significant figures (total number of digits excluding leading and traling zeroes) is a better way to describe accuracy of numbers.
You're correct in that the number of digits after the decimal point will change when the number is larger - but if we're talking precision, the number of significant figures will not change when the number is larger. However, the answer isn't simple for decimal numbers:
Most systems these days use IEE floating point format to represent numbers in C. However, if you're on something unusual, it's worth checking. Single precision IEE float numbers are made up of three parts:
The sign bit (is this number positive or negative)
The (generally also signed) exponent
The fraction (the number before the exponent is applied)
As we'd expect, this is all stored in binary.
How many significant figures?
If you are using IEE-754 numbers, "how many significant figures" probably isn't an easy way to think about it, because the precision is measured in binary significant figures rather than decimal. floats have only 23 bits of accuracy for the fraction part, but because there's an implicit leading bit (unless the fraction part is all zeroes, which indicates a final value of 1), there are 24 effective bits of precision.
This means there are 24 significant binary digits, which does not translate to an exact number of decimal significant figures. You can use the formula 24 * log(2) / log(10) to determine that there are 7.225 digits of decimal precision, which isn't a very good answer to your question, since there are numbers of 24 significant binary digits which only have 6 significant decimal digits.
So, single precision floating point numbers have 6-9 significant decimal digits of precision, depending on the number.
Interestingly, you can also use this precision to work out the largest consecutive integer (counting from zero) that you can successfully represent in a single precision float. It is 2^24, or 16,777,216. You can exactly store larger integers, but only if they can be represented in 24 significant binary digits.
Further trivia: The limited size of the fraction component is the same thing that causes this in Javascript:
> console.log(9999999999999999);
10000000000000000
Javascript numbers are always represented as double precision floats, which have 53 bits of precision. This means between 2^53 and 2^54, only even numbers can be represented, because the final bit of any odd number is lost.
The precision of floating point numbers should be measured in binary digits, not decimal digits. This is because computers operate on binary numbers, and a binary fraction can only approximate a decimal fraction.
Language lawyers will say that the exact width of a float is unspecified by the C standard and therefore implementation-dependent, but on any platform you are likely to encounter a C float means an IEEE754 single-precision number.
IEEE754 specifies that a floating point number is in scientific notation: (-1)s×2e×m
where s is one bit wide, e is eight bits wide, and m is twenty three bits wide. Mathematically, m is 24 bits wide because it's always assumed that the top bit is 1.
So, the maximum number of decimal digits that can be approximated with this representation is: log10(224) = 7.22 .
That approximates seven significant decimal digits, and an exponent ranging from 2-126 to 2127.
Notice that the exponent is measured separately. This is exactly like if you were using ordinary scientific notation, like "A person weighs 72.3 kilograms = 7.23×104 grams". Notice that there are three significant digits here, representing that the number is only accurate to within 100 grams. But there is also an exponent which is a different number entirely. You can have a very big exponent with very few significant digits, like "the sun weighs 1.99×1033 grams." Big number, few digits.
In a nutshell, a float can store about 7-8 significant decimal digits. Let me illustrate this with an example:
1234567001.00
^
+---------------- this information is lost
.01234567001
^
+-------------- this information is lost
Basically, the float stores two values: 1234567 and the position of the decimal point.
Now, this is a simplified example. Floats store binary values instead of decimal values. A 32-bit IEEE 754 float has space for 23 "significant bits" (plus the first one which is always assumed to be 1), which corresponds to roughly 7-8 decimal digits.
1234567001.00 (dec) =
1001001100101011111111101011001.00 (bin) gets rounded to
1001001100101011111111110000000.00 =
| 23 bits |
1234567040.00 (dec)
And this is exactly what C produces:
void main() {
float a = 1234567001;
printf("%f", a); // outputs 1234567040
}
While trying to understand int, if I was given the size of int in bits, I could use the formula of permutations to determine the maximum positive and negative base-10 values of int. So if a signed int is 16 bits wide, I can use 2^16 to determine the number of possible permutations and then can calculate the maximum number of positive numbers and the maximum number of negative numbers by using 2^15.
In a 32 bit float, 24 bits are assigned for the significand and its sign. 2^23 would be the maximum number of permutations, if we consider the sign to be positive. How can I get the maximum value of the significand from this number 2^23? Or is my understanding of floating point numbers flawed?
ieee-754 uses significand rather than mantissa.
C does not define mantissa. C uses significand.
Common float normal1 values have a 24-bit significand that is made up of 1 implied bit with a value of 1 and 23 explicitly encoded binary fractional bits. All 224 combination are possible.
The maximum significand is 1.11111111 11111111 11111112 or 1.9999998807907104492187510 or (2.0-2-23).
When this is combined with the maximum binary exponent for finite numbers 2(254-127), the maximum float, FLT_MAX is 340282346638528859811704183484516925440.0 or about 3.402823466e+38.
1For sub-numerals, there is no implied bit.
That maximum significand is 0.11111111 11111111 11111112 or 0.9999998807907104492187510
The number of possible values of a normal significand of a float is (FLT_RADIX-1)/FLT_EPSILON, where FLT_RADIX and FLT_EPSILON are defined by including <float.h>.
This is because FLT_EPSILON is the step size from 1 to the next greater representable number, so it is a change of 1 in the significand bits (when they are interpreted as a binary integer and we are starting from the floating-point number 1.000…000). FLT_RADIX/FLT_EPSILON calculates how many steps the significand could go through, starting from 0, until it wraps or overflows its leading digit. However, we do not start at zero; the question requests excluding the implicit leading 1 bit. The leading bit of a normalized binary-based floating-point number is 1, but, when we generalize to other bases, the leading digit of a floating-point number may be something other than 1 for a normalized number; it can be a non-zero integer less than FLT_RADIX. So, starting from 1 instead of 0, there are (FLT_RADIX-1)/FLT_EPSILON possible values of normal significands.
Note that (FLT_RADIX-1)/FLT_EPSILON has an integer value but floating-point type. To use it as an integer type, you may need a cast, such as when printing it with %d.
The floating-point number with the same scale (exponent) as 1 but the maximum significand is FLT_RADIX - FLT_EPSILON. The maximum value of the significand as an integer is FLT_RADIX/FLT_EPSILON - 1. Note that the latter includes the leading digit.
Notes
“Significand” is the preferred term for the fraction portion of a floating-point number. “Mantissa” is an old term for the fraction portion of a logarithm. Significands are linear; multiplying a significand multiplies the number represented. Mantissas are logarithm; adding to a mantissa multiplies the number represented.
“Permutation” refers to moving things around; (1 2 3 4) and (3 4 2 1) are permutations of each other. You appear to want the number of different values the significand bits can have.
The magnitude of the mantissa has no meaning without considering the exponent. What the 23 tells you is that the number of significant decimal digits is 23 * log(2) ≈ 7.
However a 24th bit is implied, giving 24 * log(2) which is > 7. So all 7-digit integer values can be stored without loss of precision.
Also, any integer that has a power of 2 as a factor, and when divided by that factor has 7 digits or less, can also be exactly represented, as the power of 2 is taken up by the exponent (subject to the limit of the exponent value).
So it is the exponent size that gives the range of values that can be stored, while the mantissa (significand) size gives the precision.
So, It's almost time for midterms and the professor gave us some sample questions.
What I THINK the answer is:
We are given a float that is f=50000.
if we do f*f we get 2,500,000,000.
Now, I'm assuming we're working with a 32 bit machine as that is what we have studied so far. So, if that Is the case then 2,500,000,000 32 bit float not being declared unsigned is considered signed by default. Since 2,500,000,000 is a little over half of the 32 bit representation of 4294967296, and it is signed, we would have a negative value returned, so the statement f * f < 0 would be true, right?
I've only been studying systems programming for 4 weeks, PLEASE correct me if I am wrong here.
Unlike the int type, which is typically represented as a two's complement number, a float is a floating point type, which means it stores values using a mantissa and an exponent. This means that the typical wrapping behavior seen with signed integer types doesn't apply to floating point types.
In the case of 2,500,000,000, this will actually get stored as 0x1.2A05F2 x 231.
Floating point types are typically stored using IEEE 754 floating point format. In the case of a single precision floating point (which a float typically is), it has 1 sign bit, 8 exponent bits, and 24 mantissa bits (with 23 bits stored, as the high order "1" bit is implied).
While this format can't "wrap" from positive to negative, it is subject to 2 things:
Loss of precision
Overflow of the exponent
As an example of precision loss, let's use a decimal floating point format with a 3 digit mantissa and a 2 digit exponent. If we multiply 2.34 x 1010 by 6.78 x 1010, you get 1.58652 x 1021, but because of the 3 digit precision it gets truncated to 1.58 x 1021. So we lose the least significant digits.
To illustrate exponent overflow, suppose we were to multiply 2.00 x 1060 by 3.00 x 1050. You'd get 6.00 x 10110. But because the maximum value of an exponent is 99, this is an overflow. IEEE 754 has a special notation for infinity which it uses in the case of overflow where it sets the mantissa to all 0 bits and the exponent to all 1 bits, and the sign bit can be used to distinguish positive infinity and negative infinity.
Title pretty much sums it all.
I know that floats are 32bit total with 23bits for mantissa and 8bits for the exponent value and 1 for signing.
Calculating the range of "int" is pretty simple: 32bits = 32-1bit signature =31bits ==> Range is therefore 2³¹= 2.14e9
The formula makes sense...
Now i've looked around stackoverflow but all the answers i've found regarding float range calculations lacked substance. Just a bunch of numbers appearing randomly in the responses and magically reaching the 3.4e38 conclusion.
I'm looking for an answer from someone with real knowledge of subject. Someone that can explain through the use of a formula how this range is calculated.
Thank you all.
Mo.
C does not define float as described by OP. The one suggested by OP: binary32, the most popular, is one of many conforming formats.
What C does define
5.2.4.2.2 Characteristics of floating types
s sign (±1)
b base or radix of exponent representation (an integer > 1)
e exponent (an integer between a minimum emin and a maximum emax)
p precision (the number of base-b digits in the significand)
fk nonnegative integers less than b (the significand digits)
x = s*power(b,e)*Σ(k=1, p, f[k]*power(b,-k))
For binary32, the max value is
x = (+1)*power(2, 128)*(0.1111111111 1111111111 1111 binary)
x = 3.402...e+38
Given 32-bits to define a float many other possibilities occur. Example: A float could exist just like binary32, yet not support infinity/not-a-number. The leaves another exponent available numbers. The max value is then 2*3.402...e+38.
binary32 describes its significand ranging up to 1.11111... binary. The C characteristic formula above ranges up to 0.111111...
C uses single-precision floating point notation, which means that a 32-bit float has 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa. The mantissa is calculated by summing each mantissa bit * 2^(- (bit_index)). The exponent is calculated by converting the 8 bit binary number to a decimal and subtracting 127 (thus you can have negative exponents as well), and the sign bit indicates whether or not is negative. The formula is thus:
(-1)^S * 1.M * 2^(E - 127)
Where S is the sign, M is the mantissa, and E is the exponent. See https://en.wikipedia.org/wiki/Single-precision_floating-point_format for a better mathematical explanation.
To explicitly answer your question, that means for a 32 bit float, the largest value is (-1)^0 * 1.99999988079071044921875 * 2^128, which is 6.8056469327705771962340836696903385088 × 10^38 according to Wolfram. The smallest value is the negative of that.
Generally we say that a float has precision of 6 digits after the decimal point. But if we store a large number of the order of 10^30 we won't get 6 digits after the decimal point. So is it correct to say that floats have a precision of 6 digits after the decimal point?
"6 digits after the decimal point" is nonesnse, and your example is a good demonstration of this.
This is an exact specification of the float data type.
The precision of the float is 24 bits. There are 23 bits denoting the fraction after the binary point, plus there's also an "implicit leading bit", according to the online source. This gives 24 significant bits in total.
Hence in decimal digits this is approximately:
24 * log(2) / log(10) = 7.22
It sounds like you're asking about precision to decimal places (digits after the decimal point), whereas significant figures (total number of digits excluding leading and traling zeroes) is a better way to describe accuracy of numbers.
You're correct in that the number of digits after the decimal point will change when the number is larger - but if we're talking precision, the number of significant figures will not change when the number is larger. However, the answer isn't simple for decimal numbers:
Most systems these days use IEE floating point format to represent numbers in C. However, if you're on something unusual, it's worth checking. Single precision IEE float numbers are made up of three parts:
The sign bit (is this number positive or negative)
The (generally also signed) exponent
The fraction (the number before the exponent is applied)
As we'd expect, this is all stored in binary.
How many significant figures?
If you are using IEE-754 numbers, "how many significant figures" probably isn't an easy way to think about it, because the precision is measured in binary significant figures rather than decimal. floats have only 23 bits of accuracy for the fraction part, but because there's an implicit leading bit (unless the fraction part is all zeroes, which indicates a final value of 1), there are 24 effective bits of precision.
This means there are 24 significant binary digits, which does not translate to an exact number of decimal significant figures. You can use the formula 24 * log(2) / log(10) to determine that there are 7.225 digits of decimal precision, which isn't a very good answer to your question, since there are numbers of 24 significant binary digits which only have 6 significant decimal digits.
So, single precision floating point numbers have 6-9 significant decimal digits of precision, depending on the number.
Interestingly, you can also use this precision to work out the largest consecutive integer (counting from zero) that you can successfully represent in a single precision float. It is 2^24, or 16,777,216. You can exactly store larger integers, but only if they can be represented in 24 significant binary digits.
Further trivia: The limited size of the fraction component is the same thing that causes this in Javascript:
> console.log(9999999999999999);
10000000000000000
Javascript numbers are always represented as double precision floats, which have 53 bits of precision. This means between 2^53 and 2^54, only even numbers can be represented, because the final bit of any odd number is lost.
The precision of floating point numbers should be measured in binary digits, not decimal digits. This is because computers operate on binary numbers, and a binary fraction can only approximate a decimal fraction.
Language lawyers will say that the exact width of a float is unspecified by the C standard and therefore implementation-dependent, but on any platform you are likely to encounter a C float means an IEEE754 single-precision number.
IEEE754 specifies that a floating point number is in scientific notation: (-1)s×2e×m
where s is one bit wide, e is eight bits wide, and m is twenty three bits wide. Mathematically, m is 24 bits wide because it's always assumed that the top bit is 1.
So, the maximum number of decimal digits that can be approximated with this representation is: log10(224) = 7.22 .
That approximates seven significant decimal digits, and an exponent ranging from 2-126 to 2127.
Notice that the exponent is measured separately. This is exactly like if you were using ordinary scientific notation, like "A person weighs 72.3 kilograms = 7.23×104 grams". Notice that there are three significant digits here, representing that the number is only accurate to within 100 grams. But there is also an exponent which is a different number entirely. You can have a very big exponent with very few significant digits, like "the sun weighs 1.99×1033 grams." Big number, few digits.
In a nutshell, a float can store about 7-8 significant decimal digits. Let me illustrate this with an example:
1234567001.00
^
+---------------- this information is lost
.01234567001
^
+-------------- this information is lost
Basically, the float stores two values: 1234567 and the position of the decimal point.
Now, this is a simplified example. Floats store binary values instead of decimal values. A 32-bit IEEE 754 float has space for 23 "significant bits" (plus the first one which is always assumed to be 1), which corresponds to roughly 7-8 decimal digits.
1234567001.00 (dec) =
1001001100101011111111101011001.00 (bin) gets rounded to
1001001100101011111111110000000.00 =
| 23 bits |
1234567040.00 (dec)
And this is exactly what C produces:
void main() {
float a = 1234567001;
printf("%f", a); // outputs 1234567040
}