I typed the following C code:
typedef unsigned char* byte_pointer;
void show_bytes(byte_pointer start, size_t len)
{
int i;
for (i = 0; i < len; i++)
printf(" %.2x", start[i]);
printf("\n");
}
void show_float(float x)
{
show_bytes((byte_pointer) &x, sizeof(float));
}
void main()
{
float f = 3.9;
show_float(f);
}
The output of this code is:
Output: 0x4079999A
Manual Calculations
1.1111001100110011001100 x 2 power 1
M: 1111001100110011001100
E: 1 + 127 = 128(d) = 10000000(b)
Final Bits: 01000000011110011001100110011000
Hex: 0x40799998
Why this last A is displayed despite of 8.
As per manual calculations the answer in Hex should supposed to be: Output: 0x40799998
Those undisclosed manual calculations must be wrong. The correct result is 4079999A16.
In the format commonly used for float, IEEE-754 binary32 or “single precision,” numbers are represented as an integer with magnitude less than 224 multiplied by a power of two within certain limits. (The floating-point representation is often described in other forms, such as sign, a 24-digit binary significand with radix point after the first digit, and a power of two. These forms are mathematically equivalent.)
The two numbers in this form closest to 3.9 are 16,357,785•2−23 and 16,357,786•2−23. These are, respectively, 3.8999998569488525390625 and 3.900000095367431640625. Lining them up, we can see the latter is closer to 3.9:
3.8999998569488525390625
3.9000000000000000000000
3.9000000953674316406250
as the former differs by 1.5 at the seventh digit after the decimal point, whereas the latter differs by about 9.5 at the eighth digit after the decimal point.
Therefore, the best conversion of 3.9 to this float format produces 16,357,786•2−23. In hexadecimal, 16,357,786 is F9999A16. In the encoding of the representation into the bits of a float, the low 23 bits of the significand are put into the primary significand field. The low 23 bits are 79999A16, and that is what we should see in the primary significand field.
Also note we can easily see the binary for 3.9 is
11.11100110011001100110011001100110011001100110…2. The bold marks the 24 bits that fit in the float significand. Immediately after them is 1001…, which we can see ought to round up, since it exceeds half of the previous bit, and therefore the last four bits of the significand should be 1010.
(Also note that good C implementations convert numerals in source text to the nearest representable number, especially for numbers without many decimal digits, but the C standard does not require this. It says “For decimal floating constants, and also for hexadecimal floating constants when FLT_RADIX is not a power of 2, the result is either the nearest representable value, or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined manner. However, the encoding shown in the question 4079999816, is not for either of the adjacent representable values, 4079999916 and 4079999A16. It is farther away than either.)
Related
I'm new to programming and have recently come up with this simple question .
float type has 32 bits in which 8 bits are for the whole number part (the mantissa).
so my question is can float type hold numbers bigger than 255.9999 ?
and I would also appreciate if someone told me why this code is behaving unexpectedly. Is it a related issue?
int main(){
float a=123456789.1;
printf("%lf",a);
return 0;
}
for which the output is :
123456792.000000
<float.h> -- Numeric limits of floating point types has your answers, specifically...
FLT_MAX
DBL_MAX
LDBL_MAX
maximum finite value of float, double and long double respectively
...and...
FLT_DIG
DBL_DIG
LDBL_DIG
number of decimal digits that are guaranteed to be preserved in text -> float/double/long double -> text roundtrip without change due to rounding or overflow
That last part is meant to say that a float value longer (i.e. more significant digits) than FLT_DIG is no longer guaranteed to be precisely representable.
The most common 32-bit floating-point format, IEEE-754 binary32, does not have eight bits for the whole number part. It has one bit for a sign, eight bits for an exponent field, and 23 bits for a significand field (a fraction part).
The sign bit determines whether the number is positive (0) or negative (1).
The exponent field, e, has several uses. If it is 11111111 (in binary), and the significand field, f, is zero, the floating-point value represents infinity. If e is 11111111, and the significand field is not zero, it represents a special Not-a-Number “value”.
If the exponent is not 11111111 and is not zero, floating-point value represents 2e−127•(1+f/223), with the sign added. Note that the fraction portion is formed by adding 1 to the contents of the significand field. That is often called an implicit 1, so the mathematical significand is 24 bits—1 bit from the leading 1, 23 bits from the significand field.
If the exponent is zero, floating-point value represents 21−127•(0+f/223) or the negative of that if the sign bit is 1. Note that the leading bit is 0. These are called subnormal numbers. They are included in the format to make some mathematical properties work in floating-point arithmetic.
The largest finite value represented is when the exponent is 11111110 (254) and the significand field is all ones (f is 223−1), so the number represented is 2254−127•(1+ (223−1)/223) = 2127•(2−2−23) = 2128−2104 = 340282346638528859811704183484516925440.
In float a=123456789.1;, the float type does not have enough precision to represent 123456789.1. (In fact, a decimal fraction .1 can never be represented with a binary floating-point format.) When we have only 24 bits for the significand, the closest numbers to 123456789.1 that we can represent are 123456792 and 123456800.
what's the largest number [the] float type can hold?
The C Standard defines:
FLT_MAX
Include <float.h> to have it be #defined.
I am trying to figure out exactly how big number I can use as floating point number and double. But it does not store the way I expected except integer value. double should hold 8 bytes of information which is enough to hold variable a, but it does not hold it right. It shows 1234567890123456768 in which last 2 digits are different. And when I stored 214783648 or any digit in the last digit in float variable b, it shows the same value 214783648. which is supposed to be the limit. So what's going on?
double a;
float b;
int c;
a = 1234567890123456789;
b = 2147483648;
c = 2147483647;
printf("Bytes of double: %d\n", sizeof(double));
printf("Bytes of integer: %d\n", sizeof(int));
printf("Bytes of float: %d\n", sizeof(float));
printf("\n");
printf("You can count up to %.0f in 4 bytes\n", pow(2,32));
printf("You can count up to %.0f with + or - sign in 4 bytes\n", pow(2,31));
printf("You can count up to %.0f in 4 bytes\n", pow(2,64));
printf("You can count up to %.0f with + or - sign in in 8 bytes\n", pow(2,63));
printf("\n");
printf("double number: %.0f\n", a);
printf("floating point: %.0f\n", b);
printf("integer: %d\n", c);
return 0;
The answer to the question of what is the largest (finite) number that can be stored in a floating point type would be FLT_MAX or DBL_MAX for float and double, respectively.
However, that doesn't mean that the type can precisely represent every smaller number or integer (in fact, not even close).
First you need to understand that not all bits of a floating point number are “equal”. A floating point number has an exponent (8 bits in IEEE-754 standard float, 11 bits in double), and a mantissa (23 and 52 bits in float, and double respectively). The number is obtained by multiplying the mantissa (which has an implied leading 1-bit and binary point) by 2exponent (after normalizing the exponent; its binary value is not used directly). There is also a separate sign bit, so the following applies to negative numbers as well.
As the exponent changes, the distance between consecutive values of the mantissa changes as well, i.e., the greater the exponent, the further apart consecutive representable values of the floating point number are. Thus you may be able to store one number of a given magnitude precisely, but not the “next” number. One should also remember that some seemingly simple fractions can not be represented precisely with any number of binary digits (e.g., 1/10, one tenth, is an infinitely repeating sequence in binary, like 1/3, one third, is in decimal).
When it comes to integers, you can precisely represent every integer up to 2mantissa_bits + 1 magnitude. Thus an IEEE-754 float can represent all integers up to 224 and a double up to 253 (in the last half of these ranges the consecutive floating point values are exactly one integer apart, since the entire mantissa is used for the integer part only). There are individual larger integers that can be represented, but they are spaced more than one integer apart, i.e., you can represent some integers greater than 2mantissa_bits + 1 but every integer only up to that magnitude.
For example:
float f = powf(2.0f, 24.0f);
float f1 = f + 1.0f, f2 = f1 + 2.0f;
double d = pow(2.0, 53.0);
double d1 = d + 1.0, d2 = d + 2.0;
(void) printf("2**24 float = %.0f, +1 = %.0f, +2 = %.0f\n", f, f1, f2);
(void) printf("2**53 double = %.0f, +1 = %.0f, +2 = %.0f\n", d, d1, d2);
Outputs:
2**24 float = 16777216, +1 = 16777216, +2 = 16777218
2**53 double = 9007199254740992, +1 = 9007199254740992, +2 = 9007199254740994
As you can see, adding 1 to 2mantissa_bits + 1 makes no difference since the result is not representable, but adding 2 does produce the correct answer (as it happens, at this magnitude the representable numbers are two integers apart since the multiplier has doubled).
TL;DR An IEE-754 float can precisely represent all integers up to 224 and double up to 253, but only some integers of greater magnitude (the spacing of representable values depends on the magnitude).
sizeof(double) is 8, true, but double needs some bits to store the exponent part as well.
Assuming IEEE-754 is used, double can represent integers at most 253 precisely, which is less than 1234567890123456789.
See also Double-precision floating-point format.
You can use constants to know what are the limits :
FLT_MAX
DBL_MAX
LDBL_MAX
From CPP reference
You can print the actual limits of the standard POD-types by printing the limits stored in the 'limits.h' header file (for C++ the equivalent is 'std::numeric_limits' identifier as shown here:
enter link description here)
Due to the fact that the hardware doesn't work with floating types respectively cannot represent floating types by hardware in reality, the hardware uses the bit-length of your hardware to represent a floating type. Since you don't have an infinit length for floating types, you can only show/present a double variable for a specific precision. Most of the hardware uses for the floating type presentation the IEEE-754 standard.
To get more precision you could try 'long double' (dependend on the hardware this could be of quadruple-precision than double), AVX,SSE registers, big-num libraries or you coudl do it yourself.
The sizeof an object only reports the memory space it occupies. It does not show the valid range. It would be well possible to have an unsigned int with e.g. 2**16 (65536) possible value occupy 32 bits im memory.
For floating point objects, it is more difficult. They consist of (simplified) two fields: an integer mantissa and an exponent (see details in the linked article). Both with a fixed width.
As the mantissa only has a limited range, trailing bits are truncated or rounded and the exponent is corrected, if required. This is one reason one should never use floating point types to store precise values like currency.
In decimal (note: computers use binary representation) with 4 digit mantissa:
1000 --> 1.000e3
12345678 --> 1.234e7
The paramters for your implementation are defined in float.h similar to limits.h which provides parameters for integers.
On Linux, #include <values.h>
On Windows,#include <float.h>
There is a fairly comprehensive list of defines
int x=25,i;
float *p=(float *)&x;
printf("%f\n",*p);
I understand that bit representation for floating point numbers and int are different, but no matter what value I store, the answer is always 0.000000. Shouldn't it be some other value depending on the floating point representation?
Your code has undefined behavior -- but it will most likely behave as you expect, as long as the size and alignment of types int and float are compatible.
By using the "%f" format to print *p, you're losing a lot of information.
Try this:
#include <stdio.h>
int main(void) {
int x = 25;
float *p = (float*)&x;
printf("%g\n", *p);
return 0;
}
On my system (and probably on yours), it prints:
3.50325e-44
The int value 25 has zeros in most of its high-order bits. Those bits are probably in the same place as the exponent field of type float -- resulting in a very small number.
Look up IEEE floating-point representation for more information. Byte order is going to be an issue. (And don't do this kind of thing in real code unless you have a very good reason.)
As rici suggests in a comment, a better way to learn about floating-point representation is to start with a floating-point value, convert it to an unsigned integer of the same size, and display the integer value in hexadecimal. For example:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void show(float f) {
unsigned int rep;
memcpy(&rep, &f, sizeof rep);
printf("%g --> 0x%08x\n", f, rep);
}
int main(void) {
if (sizeof (float) != sizeof (unsigned int)) {
fprintf(stderr, "Size mismatch\n");
exit(EXIT_FAILURE);
}
show(0.0);
show(1.0);
show(1.0/3.0);
show(-12.34e5);
return 0;
}
For the purposes of this discussion, we're going to assume both int and float are 32 bits wide. We're also going to assume IEEE-754 floats.
Floating point values are represented as sign * βexp * signficand. For 32-bit binary floats, β is 2, the exponent exp ranges from -126 to 127, and the significand is a normalized binary fraction, such that there is a single leading non-zero bit before the radix point. For example, the binary integer representation of 25 is
110012
while the binary floating point representation of 25.0 would be:
1.10012 * 24 // normalized
The IEEE-754 encoding for a 32-bit float is
s eeeeeeee fffffffffffffffffffffff
where s denotes the sign bit, e denotes the exponent bits, and f denotes the significand (fraction) bits. The exponent is encoded using "excess 127" notation, meaning an exponent value of 127 (011111112) represents 0, while 1 (000000012) represents -126 and 254 (111111102) represents 127. The leading bit of the significand is not explicitly stored, so 25.0 would be encoded as
0 10000011 10010000000000000000000 // exponent 131-127 = 4
However, what happens when you map the bit pattern for the 32-bit integer value 25 onto a 32-bit floating point format? We wind up with the following:
0 00000000 00000000000000000011001
It turns out that in IEEE-754 floats, exponent value 000000002 is reserved for representing 0.0 and subnormal (or denormal) numbers. A subnormal number is number close to 0 that can't be represented as 1.??? * 2exp, because the exponent would have to be smaller than what we can encode in 8 bits. Such numbers are interpreted as 0.??? * 2-126, with as many leading 0s as necessary.
In this case, it adds up to 0.000000000000000000110012 * 2-126, which gives us 3.50325 * 10-44.
You'll have to map large integer values (in excess of 224) to see anything other than 0 out to a bunch of decimal places. And, like Keith says, this is all undefined behavior anyway.
I found this code with which the square root is obtained what surprises me is the way it does, using a union and bit shifts this is the code:
float sqrt3(const float x)
{
union
{
int i;
float x;
} u;
u.x = x;
u.i = (1<<29) + (u.i >> 1) - (1<<22);
return u.x;
}
first is saved in u.x the value of x and then assign a value to u.i then the square root of the number and appears magically u.x
¿someone explain to me how this algorithm?
The above code exhibits UB (undefined behaviour), so it should not be trusted to work on any platform. This is because it writes to a member of a union and reads back from a member different from that it last used to write the union with. It also depends heavily on endianness (the ordering of the bytes within a multi-byte integer).
However, it generally will do what is expected, and to understand why it is worthwhile for you to read about the IEEE 754 binary32 floating-point format.
Crash Course in IEEE754 binary32 format
IEEE754 commonly divides a 32-bit float into 1 sign bit, 8 exponent bits and 23 mantissa bits, thus giving
31 30-23 22-0
Bit#: ||------||---------------------|
Bit Representation: seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm
Value: sign * 1.mantissa * pow(2, exponent-127)
With the number essentially being in "scientific notation, base 2".
As a detail, the exponent is stored in a "biased" form (that is, it has a value 127 units too high). This is why we subtract 127 from the encoded exponent to get the "real" exponent.
Short Explanation
What your code does is it halves the exponent portion and damages the mantissa. This is done because the square root of a number has an exponent roughly half in magnitude.
Example in base 10
Assume we want the square root of 4000000 = 4*10^6.
4000000 ~ 4*10^6 <- Exponent is 6
4000 ~ 4*10^3 <- Divide exponent in half
Just by dividing the exponent 6 by 2, getting 3, and making it the new exponent we are already within the right order of magnitude, and much closer to the truth,
2000 = sqrt(4000000)
.
You can find a perfect explanation on wikipedia:
Methods of computing square roots
see section: Approximations that depend on the floating point representation
So for a 32-bit single precision floating point number in IEEE format
(where notably, the power has a bias of 127 added for the represented
form) you can get the approximate logarithm by interpreting its binary
representation as a 32-bit integer, scaling it by 2^{-23}, and
removing a bias of 127, i.e.
To get the square root, divide the logarithm by 2 and convert the value back.
If you print a float with more precision than is stored in memory, aren't the extra places supposed to have zeros in them? I have code that is something like this:
double z[2*N]="0";
...
for( n=1; n<=2*N; n++) {
fprintf( u1, "%.25g", z[n-1]);
fputc( n<2*N ? ',' : '\n', u1);
}
Which is creating output like this:
0,0.7071067811865474617150085,....
A float should have only 17 decimal places (right? Doesn't 53 bits comes out to 17 decimal places). If that's so, then the 18th, 19th... 25th places should have zeros. Notice in the above output that they have digits other than 0 in them.
Am I misunderstanding something? If so, what?
No, 53 bits means that the 17 decimal places are what you can trust, but because base-10 notation that we use is in a different base from which the double is stored (binary), the later digits are just because 1/2^53 is not exactly 1/10^n, i.e.,
1/2^53 = .0000000000000001110223024625156540423631668090820312500000000
The string printed by your implementation shows the exact value of the double in your example, and this is permitted by the C standard, as I show below.
First, we should understand what the floating-point object represents. The C standard does a poor job of this, but, presuming your implementation uses the IEEE 754 floating-point standard, a normal floating-point object represents exactly (-1)s•2e•(1+f) for some sign bit s (0 or 1), exponent e (in range for the specific type, -1022 to 1023 for double), and fraction f (also in range, 52 bits after a radix point for double). Many people use the object to approximate nearby values, but, according to the standard, the object only represents the one value it is defined to be.
The value you show, 0.7071067811865474617150085, is exactly representable as a double (sign bit 0, exponent -1, and fraction bits [in hexadecimal] .6a09e667f3bcc16). It is important to understand the double with this value represents exactly that value; it does not represent nearby values, such as 0.707106781186547461715.
Now that we know the value being passed to fprintf, we can consider what the C standard says about this. First, the C standard defines a constant named DECIMAL_DIG. C 2011 5.2.4.2.2 11 defines this to be the number of decimal digits such that any floating-point number in the widest supported type can be rounded to that many decimal digits and back again without change to the value. The precision you passed to fprintf, 25, is likely greater than the value of DECIMAL_DIG on your system.
In C 2011 7.21.6.1 13, the standard says “If the number of significant decimal digits is more than DECIMAL_DIG but the source value is exactly representable with DECIMAL_DIG digits, then the result should be an exact representation with trailing zeros. Otherwise, the source value is bounded by two adjacent decimal strings L < U , both having DECIMAL_DIG significant digits; the value of the resultant decimal string D should satisfy L ≤ D ≤ U, with the extra stipulation that the error should have a correct sign for the current rounding direction.”
This wording allows the compiler some wiggle room. The intent is that the result must be accurate enough that it can be converted back to the original double with no error. It may be more accurate, and some C implementations will produce the exactly correct value, which is permitted since it satisfies the paragraph above.
Incidentally, the value you show is not the double closest to sqrt(2)/2. That value is +0x1.6A09E667F3BCDp-1 = 0.70710678118654757273731092936941422522068023681640625.
There is enough precision to represent 0.7071067811865474617150085 in double precision floating point. The 64 bit output is actually 3FE6A09E667F3BCC
The formula used to evaluate the number is an exponentiation, so you cannot say that 53 bits will take 17 decimal places.
EDIT:
Look at the example below in the wiki article for another instance:
0.333333333333333314829616256247390992939472198486328125
=2^(−54) × 15 5555 5555 5555 base16
=2^(−2) × (15 5555 5555 5555 base16 × 2^(−52) )
You are asking for float, but in your code appears double.
Anyway, neither float or double have always the same number of decimals. Float have assigned 32 bits (4 bytes) for a floating point representation according to IEEE 754.
From Wikipedia:
The IEEE 754 standard specifies a binary32 as having:
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 (23 explicitly stored)
This gives from 6 to 9 significant decimal digits precision (if a
decimal string with at most 6 significant decimal is converted to IEEE
754 single precision and then converted back to the same number of
significant decimal, then the final string should match the original;
and if an IEEE 754 single precision is converted to a decimal string
with at least 9 significant decimal and then converted back to single,
then the final number must match the original).
In the case of double, from Wikipedia again:
Double-precision binary floating-point is a commonly used format on
PCs, due to its wider range over single-precision floating point, in
spite of its performance and bandwidth cost. As with single-precision
floating-point format, it lacks precision on integer numbers when
compared with an integer format of the same size. It is commonly known
simply as double. The IEEE 754 standard specifies a binary64 as
having:
Sign bit: 1 bit
Exponent width: 11 bits
Significand precision: 53 bits (52 explicitly stored)
This gives from 15 - 17 significant
decimal digits precision. If a decimal string with at most 15
significant decimal is converted to IEEE 754 double precision and then
converted back to the same number of significant decimal, then the
final string should match the original; and if an IEEE 754 double
precision is converted to a decimal string with at least 17
significant decimal and then converted back to double, then the final
number must match the original.
On the other hand, you can't expect that if you have a float and print it out with more precision that the really stored, the rest of digits will fill with 0s. The compiler can't imagine the tricks you are trying to do.