Calculating range of float in C

Calculating range of float in C - c

Title pretty much sums it all.
I know that floats are 32bit total with 23bits for mantissa and 8bits for the exponent value and 1 for signing.
Calculating the range of "int" is pretty simple: 32bits = 32-1bit signature =31bits ==> Range is therefore 2³¹= 2.14e9
The formula makes sense...
Now i've looked around stackoverflow but all the answers i've found regarding float range calculations lacked substance. Just a bunch of numbers appearing randomly in the responses and magically reaching the 3.4e38 conclusion.
I'm looking for an answer from someone with real knowledge of subject. Someone that can explain through the use of a formula how this range is calculated.
Thank you all.
Mo.

C does not define float as described by OP. The one suggested by OP: binary32, the most popular, is one of many conforming formats.
What C does define
5.2.4.2.2 Characteristics of floating types
s sign (±1)
b base or radix of exponent representation (an integer > 1)
e exponent (an integer between a minimum emin and a maximum emax)
p precision (the number of base-b digits in the significand)
fk nonnegative integers less than b (the significand digits)
x = s*power(b,e)*Σ(k=1, p, f[k]*power(b,-k))
For binary32, the max value is
x = (+1)*power(2, 128)*(0.1111111111 1111111111 1111 binary)
x = 3.402...e+38
Given 32-bits to define a float many other possibilities occur. Example: A float could exist just like binary32, yet not support infinity/not-a-number. The leaves another exponent available numbers. The max value is then 2*3.402...e+38.
binary32 describes its significand ranging up to 1.11111... binary. The C characteristic formula above ranges up to 0.111111...

C uses single-precision floating point notation, which means that a 32-bit float has 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa. The mantissa is calculated by summing each mantissa bit * 2^(- (bit_index)). The exponent is calculated by converting the 8 bit binary number to a decimal and subtracting 127 (thus you can have negative exponents as well), and the sign bit indicates whether or not is negative. The formula is thus:
(-1)^S * 1.M * 2^(E - 127)
Where S is the sign, M is the mantissa, and E is the exponent. See https://en.wikipedia.org/wiki/Single-precision_floating-point_format for a better mathematical explanation.
To explicitly answer your question, that means for a 32 bit float, the largest value is (-1)^0 * 1.99999988079071044921875 * 2^128, which is 6.8056469327705771962340836696903385088 × 10^38 according to Wolfram. The smallest value is the negative of that.

Related

How to determine maximum positive base-10 value of a float mantissa?

While trying to understand int, if I was given the size of int in bits, I could use the formula of permutations to determine the maximum positive and negative base-10 values of int. So if a signed int is 16 bits wide, I can use 2^16 to determine the number of possible permutations and then can calculate the maximum number of positive numbers and the maximum number of negative numbers by using 2^15.
In a 32 bit float, 24 bits are assigned for the significand and its sign. 2^23 would be the maximum number of permutations, if we consider the sign to be positive. How can I get the maximum value of the significand from this number 2^23? Or is my understanding of floating point numbers flawed?

ieee-754 uses significand rather than mantissa.
C does not define mantissa. C uses significand.
Common float normal1 values have a 24-bit significand that is made up of 1 implied bit with a value of 1 and 23 explicitly encoded binary fractional bits. All 224 combination are possible.
The maximum significand is 1.11111111 11111111 11111112 or 1.9999998807907104492187510 or (2.0-2-23).
When this is combined with the maximum binary exponent for finite numbers 2(254-127), the maximum float, FLT_MAX is 340282346638528859811704183484516925440.0 or about 3.402823466e+38.
1For sub-numerals, there is no implied bit.
That maximum significand is 0.11111111 11111111 11111112 or 0.9999998807907104492187510

The number of possible values of a normal significand of a float is (FLT_RADIX-1)/FLT_EPSILON, where FLT_RADIX and FLT_EPSILON are defined by including <float.h>.
This is because FLT_EPSILON is the step size from 1 to the next greater representable number, so it is a change of 1 in the significand bits (when they are interpreted as a binary integer and we are starting from the floating-point number 1.000…000). FLT_RADIX/FLT_EPSILON calculates how many steps the significand could go through, starting from 0, until it wraps or overflows its leading digit. However, we do not start at zero; the question requests excluding the implicit leading 1 bit. The leading bit of a normalized binary-based floating-point number is 1, but, when we generalize to other bases, the leading digit of a floating-point number may be something other than 1 for a normalized number; it can be a non-zero integer less than FLT_RADIX. So, starting from 1 instead of 0, there are (FLT_RADIX-1)/FLT_EPSILON possible values of normal significands.
Note that (FLT_RADIX-1)/FLT_EPSILON has an integer value but floating-point type. To use it as an integer type, you may need a cast, such as when printing it with %d.
The floating-point number with the same scale (exponent) as 1 but the maximum significand is FLT_RADIX - FLT_EPSILON. The maximum value of the significand as an integer is FLT_RADIX/FLT_EPSILON - 1. Note that the latter includes the leading digit.
Notes
“Significand” is the preferred term for the fraction portion of a floating-point number. “Mantissa” is an old term for the fraction portion of a logarithm. Significands are linear; multiplying a significand multiplies the number represented. Mantissas are logarithm; adding to a mantissa multiplies the number represented.
“Permutation” refers to moving things around; (1 2 3 4) and (3 4 2 1) are permutations of each other. You appear to want the number of different values the significand bits can have.

The magnitude of the mantissa has no meaning without considering the exponent. What the 23 tells you is that the number of significant decimal digits is 23 * log(2) ≈ 7.
However a 24th bit is implied, giving 24 * log(2) which is > 7. So all 7-digit integer values can be stored without loss of precision.
Also, any integer that has a power of 2 as a factor, and when divided by that factor has 7 digits or less, can also be exactly represented, as the power of 2 is taken up by the exponent (subject to the limit of the exponent value).
So it is the exponent size that gives the range of values that can be stored, while the mantissa (significand) size gives the precision.

what's the largest number float type can hold?

I'm new to programming and have recently come up with this simple question .
float type has 32 bits in which 8 bits are for the whole number part (the mantissa).
so my question is can float type hold numbers bigger than 255.9999 ?
and I would also appreciate if someone told me why this code is behaving unexpectedly. Is it a related issue?
int main(){
float a=123456789.1;
printf("%lf",a);
return 0;
}
for which the output is :
123456792.000000

<float.h> -- Numeric limits of floating point types has your answers, specifically...
FLT_MAX
DBL_MAX
LDBL_MAX
maximum finite value of float, double and long double respectively
...and...
FLT_DIG
DBL_DIG
LDBL_DIG
number of decimal digits that are guaranteed to be preserved in text -> float/double/long double -> text roundtrip without change due to rounding or overflow
That last part is meant to say that a float value longer (i.e. more significant digits) than FLT_DIG is no longer guaranteed to be precisely representable.

The most common 32-bit floating-point format, IEEE-754 binary32, does not have eight bits for the whole number part. It has one bit for a sign, eight bits for an exponent field, and 23 bits for a significand field (a fraction part).
The sign bit determines whether the number is positive (0) or negative (1).
The exponent field, e, has several uses. If it is 11111111 (in binary), and the significand field, f, is zero, the floating-point value represents infinity. If e is 11111111, and the significand field is not zero, it represents a special Not-a-Number “value”.
If the exponent is not 11111111 and is not zero, floating-point value represents 2e−127•(1+f/223), with the sign added. Note that the fraction portion is formed by adding 1 to the contents of the significand field. That is often called an implicit 1, so the mathematical significand is 24 bits—1 bit from the leading 1, 23 bits from the significand field.
If the exponent is zero, floating-point value represents 21−127•(0+f/223) or the negative of that if the sign bit is 1. Note that the leading bit is 0. These are called subnormal numbers. They are included in the format to make some mathematical properties work in floating-point arithmetic.
The largest finite value represented is when the exponent is 11111110 (254) and the significand field is all ones (f is 223−1), so the number represented is 2254−127•(1+ (223−1)/223) = 2127•(2−2−23) = 2128−2104 = 340282346638528859811704183484516925440.
In float a=123456789.1;, the float type does not have enough precision to represent 123456789.1. (In fact, a decimal fraction .1 can never be represented with a binary floating-point format.) When we have only 24 bits for the significand, the closest numbers to 123456789.1 that we can represent are 123456792 and 123456800.

what's the largest number [the] float type can hold?
The C Standard defines:
FLT_MAX
Include <float.h> to have it be #defined.

If I have a float that f = 50,000, and then i do f*f, is the value returned a negative?

So, It's almost time for midterms and the professor gave us some sample questions.
What I THINK the answer is:
We are given a float that is f=50000.
if we do f*f we get 2,500,000,000.
Now, I'm assuming we're working with a 32 bit machine as that is what we have studied so far. So, if that Is the case then 2,500,000,000 32 bit float not being declared unsigned is considered signed by default. Since 2,500,000,000 is a little over half of the 32 bit representation of 4294967296, and it is signed, we would have a negative value returned, so the statement f * f < 0 would be true, right?
I've only been studying systems programming for 4 weeks, PLEASE correct me if I am wrong here.

Unlike the int type, which is typically represented as a two's complement number, a float is a floating point type, which means it stores values using a mantissa and an exponent. This means that the typical wrapping behavior seen with signed integer types doesn't apply to floating point types.
In the case of 2,500,000,000, this will actually get stored as 0x1.2A05F2 x 231.
Floating point types are typically stored using IEEE 754 floating point format. In the case of a single precision floating point (which a float typically is), it has 1 sign bit, 8 exponent bits, and 24 mantissa bits (with 23 bits stored, as the high order "1" bit is implied).
While this format can't "wrap" from positive to negative, it is subject to 2 things:
Loss of precision
Overflow of the exponent
As an example of precision loss, let's use a decimal floating point format with a 3 digit mantissa and a 2 digit exponent. If we multiply 2.34 x 1010 by 6.78 x 1010, you get 1.58652 x 1021, but because of the 3 digit precision it gets truncated to 1.58 x 1021. So we lose the least significant digits.
To illustrate exponent overflow, suppose we were to multiply 2.00 x 1060 by 3.00 x 1050. You'd get 6.00 x 10110. But because the maximum value of an exponent is 99, this is an overflow. IEEE 754 has a special notation for infinity which it uses in the case of overflow where it sets the mantissa to all 0 bits and the exponent to all 1 bits, and the sign bit can be used to distinguish positive infinity and negative infinity.

How is the range of long double in C calculated?

While studying C I came to know that range of long double is 3.4E-4932 to 1.1E+4932. What is E here ? Size of long double in 10 bytes. If I assume E is 10 then how long double stores numbers till 19 places after decimal.

3.4E-4932 means . Both floats and doubles are stored in a format that keeps the exponent and the mantissa separate. In your example, -4392 will be encoded in the exponent, and 3.4 will be encoded in the mantissa, both as binary numbers.
Note that IEEE floating point formats come in a variety of ranges with availability that varies by platform. Refer to IEEE floating point for more details. As pointed out by Joe Farrell, your range is probably x86 Extended Precision Format. That format carries 1 bit for sign (s), 15 bits of binary exponent (e) with a bias of -16383, and 1 + 63 bits of binary mantissa (m). For normalized numbers, the value is computed as .
The smallest positive normalized number in this format has a sign bit of 0, an exponent of 1, and a mantissa of 1.0, corresponding to or . In binary, that number looks like:

The range of a long double (or, indeed, any floating point width) on Intel hardware is typically [-∞, ∞]. Between those endpoints many finite numbers are also representable:
0
±m×2e, where:
m is an integer between 1 and 264-1, and
e is an integer between -16445 and 16320
That means that the smallest non-zero long double is 2-16445 and the largest finite long double is (264-1)·216320 (or 216384-216320), which are approximately equal to the decimal numbers in scientific notation in the question.
See this Wikipedia article for details on the representation (which is binary, not decimal).

C, getting the maximum float or maximum double not from <float.h>

i was completing book "C. Programming language", but faced up with the question in which i should get the maximum\minimum value of float-pointing number, without using any of standard libraries, such as <float.h>. Thank you

“Without using” exercises are a little bit stupid, so here is one version “without using” any header.
…
double nextafter(double, double);
double max = nextafter(1.0 / 0.0, 0.0);
…
And without using any library function, only assuming that double is mapped to IEEE 754's binary64 format (a very common choice):
…
double max = 0x1.fffffffffffffp1023;
…

Assuming a binary floating-point format, start with 2.0 and multiply it by 2.0 until you get an overflow. This determines the maximum exponent. Then, starting with x as the number you had right before the overflow, take the sum x + x/2 + x/4 + ... until adding x/q does not change the value of the number (or overflows again). This determines the maximum mantissa.
The smallest representable positive number can be found a similar way.

From wikipedia you can read up the IEEE floating point format: http://en.wikipedia.org/wiki/Single-precision_floating-point_format
This contains
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 bits (23 explicitly stored)
The page also contains information on how to interpret the exponent value. Value of 0xFF (255) in exponent signifies ±infinity if the significant is zero and NaN (not a number) otherwise. The +- infinity are largest numbers. The sign bit defines if the number if +infinity or -infinity. If the question is about the largest non-infinite value then just use the largest non-special value.
Largest non-infinite value is 24 bits of 1s in significand and 0xFE (254) as exponent. Since the exponent is offset the actual value is something like: significand * 2^(254-127) which is somewhere close to 3.402823 × 10^38 in decimal according to the wikipedia page. If you want the minimum, just toggle the sign bit on to get the exact same value as negative.
EDIT: Since this is about C, I've assumed the 32 bit IEEE float.

You can figure out the number of bits the number holds by doing a sizeof(type)*8.
Then look at http://en.wikipedia.org/wiki/Double-precision_floating-point_format or http://en.wikipedia.org/wiki/Single-precision_floating-point_format
This way you can look it up in a table based in the number of bits.
This assumes that the structure is using IEEE 754.

You could start from the IEEE definition, and work from there. For example, number of bits of exponent, number of bits of mantissa. When you study the format, you will see that the 23 bits of mantissa actually represent 24 bits. The reason is, the mantissa is "normalised", that is, it is left shifted so that the ms bit is always 1. This gives the maximum number of significant bits retained from a calculation. Where has the 24th bit gone? Because it is always there (except for a 0 value), it is "implied" as the 24th bit.