max floating point value [duplicate] - c

I am wondering if the max float represented in IEEE 754 is:
(1.11111111111111111111111)_b*2^[(11111111)_b-127]
Here _b means binary representation. But that value is 3.403201383*10^38, which is different from 3.402823669*10^38, which is (1.0)_b*2^[(11111111)_b-127] and given by for example c++ <limits>. Isn't
(1.11111111111111111111111)_b*2^[(11111111)_b-127] representable and larger in the framework?
Does anybody know why?
Thank you.

The exponent 11111111b is reserved for infinities and NaNs, so your number cannot be represented.
The greatest value that can be represented in single precision, approximately 3.4028235×1038, is actually 1.11111111111111111111111b×211111110b-127.
See also http://en.wikipedia.org/wiki/Single-precision_floating-point_format

Being the "m" the mantisa and the "e" the exponent, the answer is:
In your case, if the number of bits on IEEE 754 are:
16 Bits you have 1 for the sign, 5 for the exponent and 10 for the mantissa. The largest number represented is 4,293,918,720.
32 Bits you have 1 for the sign, 8 for the exponent and 23 for the mantissa. The largest number represented is 3.402823466E38
64 Bits you have 1 for the sign, 11 for the exponent and 52 for the mantissa. The largest number represented is 2^1024 - 2^971

Related

What are the parts of a floating number in C?

Single precision floating point:
Sign bit: 1
Exponent: 8 bits
Mantissa: 23 bits
Double precision floating point:
Sign bit: 1
Exponent: 11 bits
Mantissa: 52 bits
What does this information mean?
I don't know English terms well.
A floating-point quantity (in most situations, not just C) is defined by three numbers: the sign, the significand (also called the "mantissa"), and the exponent.
These combine to form a pseudo-real number of the form
sign × significand × 2exponent
This is similar to scientific notation, except that the numbers are all binary, and the multiplication is by powers of 2, not powers of 10.
For example, the number 4.000 can be represented as
+1 × 1 × 22
The number 768.000 can be represented as
+1 × 1.5 × 29
The number -0.625 can be represented as
-1 × 1.25 × 2-1
The number 5.375 can be represented as
+1 × 1.34375 × 22
In any particular floating-point format, you can have different numbers of bits assigned to the different parts. The sign is always 0 (positive) or 1 (negative), so you only ever need one bit for that. The more bits you allocate to the significand, the more precision you can have in your numbers. The more bits you allocate to the exponent, the more range you can have for your numbers.
For example, IEEE 754 single-precision floating point has a total of 24 bits of precision for the significand (which is, yes, one more than your table called out, because there's literally one extra or "hidden" bit). So single-precision floating point has the equivalent of log10(224) or about 7.2 decimal digits worth of precision. It has 8 bits for the exponent, which gives us exponent values of about ±127, meaning we can multiply by 2±127, giving us a decimal range of about ±1038.
When you start digging into the details of actual floating-point formats, there are a few more nuances to consider. You might need to understand where the decimal point (really the "binary point" or "radix point") sits with respect to the number that is the significand. You might need to understand the "hidden 1 bit", and the concept of subnormals. You might need to understand how positive and negative exponents are represented, typically by using a bias. You might need to understand the special representations for infinity, and the "not a number" markers. You can read about all of these in general terms in the Wikipedia article on Floating point, or you can read about the specifics of the IEEE 754 floating-point standard which most computers use.
Once you understand how binary floating-point numbers work "on the inside", some of their surprising properties begin to make sense. For example, the ordinary-looking decimal fraction 0.1 is not exactly representable! In single precision, the closest you can get is
+1 × 0x1.99999a × 2-4
or equivalently
+1 × 1.60000002384185791015625 × 2-4
or equivalently
+1 × 0b1.10011001100110011001101 × 2-4
which works out to about 0.10000000149. We simply can't get any more precise than that — we can't add any more 0's to the decimal equivalent — because the significand 1.10011001100110011001101 has completely used up our 1+23 available bits of single-precision significance.
You can read more about such floating point "surprises" at this canonical SO question, and this one, and this one.
Footnote: I said everything was based on "a pseudo-real number of the form sign × significand × 2exponent, but strictly speaking, it's more like -1sign × significand × 2exponent. That is, the 1-bit sign component is 0 for positive, and 1 for negative.

How to determine maximum positive base-10 value of a float mantissa?

While trying to understand int, if I was given the size of int in bits, I could use the formula of permutations to determine the maximum positive and negative base-10 values of int. So if a signed int is 16 bits wide, I can use 2^16 to determine the number of possible permutations and then can calculate the maximum number of positive numbers and the maximum number of negative numbers by using 2^15.
In a 32 bit float, 24 bits are assigned for the significand and its sign. 2^23 would be the maximum number of permutations, if we consider the sign to be positive. How can I get the maximum value of the significand from this number 2^23? Or is my understanding of floating point numbers flawed?
ieee-754 uses significand rather than mantissa.
C does not define mantissa. C uses significand.
Common float normal1 values have a 24-bit significand that is made up of 1 implied bit with a value of 1 and 23 explicitly encoded binary fractional bits. All 224 combination are possible.
The maximum significand is 1.11111111 11111111 11111112 or 1.9999998807907104492187510 or (2.0-2-23).
When this is combined with the maximum binary exponent for finite numbers 2(254-127), the maximum float, FLT_MAX is 340282346638528859811704183484516925440.0 or about 3.402823466e+38.
1For sub-numerals, there is no implied bit.
That maximum significand is 0.11111111 11111111 11111112 or 0.9999998807907104492187510
The number of possible values of a normal significand of a float is (FLT_RADIX-1)/FLT_EPSILON, where FLT_RADIX and FLT_EPSILON are defined by including <float.h>.
This is because FLT_EPSILON is the step size from 1 to the next greater representable number, so it is a change of 1 in the significand bits (when they are interpreted as a binary integer and we are starting from the floating-point number 1.000…000). FLT_RADIX/FLT_EPSILON calculates how many steps the significand could go through, starting from 0, until it wraps or overflows its leading digit. However, we do not start at zero; the question requests excluding the implicit leading 1 bit. The leading bit of a normalized binary-based floating-point number is 1, but, when we generalize to other bases, the leading digit of a floating-point number may be something other than 1 for a normalized number; it can be a non-zero integer less than FLT_RADIX. So, starting from 1 instead of 0, there are (FLT_RADIX-1)/FLT_EPSILON possible values of normal significands.
Note that (FLT_RADIX-1)/FLT_EPSILON has an integer value but floating-point type. To use it as an integer type, you may need a cast, such as when printing it with %d.
The floating-point number with the same scale (exponent) as 1 but the maximum significand is FLT_RADIX - FLT_EPSILON. The maximum value of the significand as an integer is FLT_RADIX/FLT_EPSILON - 1. Note that the latter includes the leading digit.
Notes
“Significand” is the preferred term for the fraction portion of a floating-point number. “Mantissa” is an old term for the fraction portion of a logarithm. Significands are linear; multiplying a significand multiplies the number represented. Mantissas are logarithm; adding to a mantissa multiplies the number represented.
“Permutation” refers to moving things around; (1 2 3 4) and (3 4 2 1) are permutations of each other. You appear to want the number of different values the significand bits can have.
The magnitude of the mantissa has no meaning without considering the exponent. What the 23 tells you is that the number of significant decimal digits is 23 * log(2) ≈ 7.
However a 24th bit is implied, giving 24 * log(2) which is > 7. So all 7-digit integer values can be stored without loss of precision.
Also, any integer that has a power of 2 as a factor, and when divided by that factor has 7 digits or less, can also be exactly represented, as the power of 2 is taken up by the exponent (subject to the limit of the exponent value).
So it is the exponent size that gives the range of values that can be stored, while the mantissa (significand) size gives the precision.

what's the largest number float type can hold?

I'm new to programming and have recently come up with this simple question .
float type has 32 bits in which 8 bits are for the whole number part (the mantissa).
so my question is can float type hold numbers bigger than 255.9999 ?
and I would also appreciate if someone told me why this code is behaving unexpectedly. Is it a related issue?
int main(){
float a=123456789.1;
printf("%lf",a);
return 0;
}
for which the output is :
123456792.000000
<float.h> -- Numeric limits of floating point types has your answers, specifically...
FLT_MAX
DBL_MAX
LDBL_MAX
maximum finite value of float, double and long double respectively
...and...
FLT_DIG
DBL_DIG
LDBL_DIG
number of decimal digits that are guaranteed to be preserved in text -> float/double/long double -> text roundtrip without change due to rounding or overflow
That last part is meant to say that a float value longer (i.e. more significant digits) than FLT_DIG is no longer guaranteed to be precisely representable.
The most common 32-bit floating-point format, IEEE-754 binary32, does not have eight bits for the whole number part. It has one bit for a sign, eight bits for an exponent field, and 23 bits for a significand field (a fraction part).
The sign bit determines whether the number is positive (0) or negative (1).
The exponent field, e, has several uses. If it is 11111111 (in binary), and the significand field, f, is zero, the floating-point value represents infinity. If e is 11111111, and the significand field is not zero, it represents a special Not-a-Number “value”.
If the exponent is not 11111111 and is not zero, floating-point value represents 2e−127•(1+f/223), with the sign added. Note that the fraction portion is formed by adding 1 to the contents of the significand field. That is often called an implicit 1, so the mathematical significand is 24 bits—1 bit from the leading 1, 23 bits from the significand field.
If the exponent is zero, floating-point value represents 21−127•(0+f/223) or the negative of that if the sign bit is 1. Note that the leading bit is 0. These are called subnormal numbers. They are included in the format to make some mathematical properties work in floating-point arithmetic.
The largest finite value represented is when the exponent is 11111110 (254) and the significand field is all ones (f is 223−1), so the number represented is 2254−127•(1+ (223−1)/223) = 2127•(2−2−23) = 2128−2104 = 340282346638528859811704183484516925440.
In float a=123456789.1;, the float type does not have enough precision to represent 123456789.1. (In fact, a decimal fraction .1 can never be represented with a binary floating-point format.) When we have only 24 bits for the significand, the closest numbers to 123456789.1 that we can represent are 123456792 and 123456800.
what's the largest number [the] float type can hold?
The C Standard defines:
FLT_MAX
Include <float.h> to have it be #defined.

floating point numbers in C slightly different from expected

I noticed that in C, a float can be as small as 2^-149, and as large as 2^127. If I try to set the float to any smaller or larger respectively than these, then I get zero and inf, respectively. The 2^149 doesn't make sense to me; where does it come from?
It appears that the exponent is 8 bits, so we can have 2^-128 to 2^127. The overall sign of the float is 1 bit, so that leaves 23 bits for the significand since a float is 32 bits total. If all 23 bits of the significand are placed after the binary "decimal point" such that the significand is <= 0.5, then we should be able to have floats as small as 2^(-128-23) = 2^-151. On the other hand, if one of the 23 bits is placed BEFORE the binary "decimal" point such that the significand is <= 1, then we would have the smallest float be 2^(-128-22) = 2^-150. Both of these do not agree with the fact that the smallest float seems to be 2^-149. Why is this?
Infinity (+ or -) is represented by the maximum exponent (all 1 bits), and zero mantissa. NaN is represented by the maximum exponent, and any non-zero mantissa.
Denormal numbers, and zero, are represented with the minimum exponent (all 0 bits).
So those two exponents are not available for normal numbers.

'float' vs. 'double' precision

The code
float x = 3.141592653589793238;
double z = 3.141592653589793238;
printf("x=%f\n", x);
printf("z=%f\n", z);
printf("x=%20.18f\n", x);
printf("z=%20.18f\n", z);
will give you the output
x=3.141593
z=3.141593
x=3.141592741012573242
z=3.141592653589793116
where on the third line of output 741012573242 is garbage and on the fourth line 116 is garbage. Do doubles always have 16 significant figures while floats always have 7 significant figures? Why don't doubles have 14 significant figures?
Floating point numbers in C use IEEE 754 encoding.
This type of encoding uses a sign, a significand, and an exponent.
Because of this encoding, many numbers will have small changes to allow them to be stored.
Also, the number of significant digits can change slightly since it is a binary representation, not a decimal one.
Single precision (float) gives you 23 bits of significand, 8 bits of exponent, and 1 sign bit.
Double precision (double) gives you 52 bits of significand, 11 bits of exponent, and 1 sign bit.
Do doubles always have 16 significant
figures while floats always have 7
significant figures?
No. Doubles always have 53 significant bits and floats always have 24 significant bits (except for denormals, infinities, and NaN values, but those are subjects for a different question). These are binary formats, and you can only speak clearly about the precision of their representations in terms of binary digits (bits).
This is analogous to the question of how many digits can be stored in a binary integer: an unsigned 32 bit integer can store integers with up to 32 bits, which doesn't precisely map to any number of decimal digits: all integers of up to 9 decimal digits can be stored, but a lot of 10-digit numbers can be stored as well.
Why don't doubles
have 14 significant figures?
The encoding of a double uses 64 bits (1 bit for the sign, 11 bits for the exponent, 52 explicit significant bits and one implicit bit), which is double the number of bits used to represent a float (32 bits).
float: 23 bits of significand, 8 bits of exponent, and 1 sign bit.
double: 52 bits of significand, 11 bits of exponent, and 1 sign bit.
It's usually based on significant figures of both the exponent and significand in base 2, not base 10. From what I can tell in the C99 standard, however, there is no specified precision for floats and doubles (other than the fact that 1 and 1 + 1E-5 / 1 + 1E-7 are distinguishable [float and double repsectively]). However, the number of significant figures is left to the implementer (as well as which base they use internally, so in other words, an implementation could decide to make it based on 18 digits of precision in base 3). [1]
If you need to know these values, the constants FLT_RADIX and FLT_MANT_DIG (and DBL_MANT_DIG / LDBL_MANT_DIG) are defined in float.h.
The reason it's called a double is because the number of bytes used to store it is double the number of a float (but this includes both the exponent and significand). The IEEE 754 standard (used by most compilers) allocate relatively more bits for the significand than the exponent (23 to 9 for float vs. 52 to 12 for double), which is why the precision is more than doubled.
1: Section 5.2.4.2.2 ( http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf )
A float has 23 bits of precision, and a double has 52.
It's not exactly double precision because of how IEEE 754 works, and because binary doesn't really translate well to decimal. Take a look at the standard if you're interested.

Resources