Why float is more precise than it ought to be? - c

#include <stdio.h>
#include <float.h>
int main(int argc, char** argv)
{
long double pival = 3.14159265358979323846264338327950288419716939937510582097494459230781640628620899L;
float pival_float = pival;
printf("%1.80f\n", pival_float);
return 0;
}
The output I got on gcc is :
3.14159274101257324218750000000000000000000000000000000000000000000000000000000000
The float uses 23 bits mantisa. So the maximum fraction that can be represented is 2^23 = 8388608 = 7 decimal digits of precision.
But the above output shows 23 decimal digits of precision (3.14159274101257324218750). I expected it print 3.1415927000000000000....)
What did I miss to understand ?

You only got 7 digits of precision. Pi is
3.1415926535897932384626433832795028841971693993751058209...
But the output you got from printing your float approximation to Pi was
3.14159274101257324218750000...
As you can see the values diverge starting from the 7th digit after the decimal point.
If you ask printf() for 80 digits after the decimal place, it will print out that many digits of the decimal representation of the binary value stored in the float, even if that many digits is far more than the precision allowed by the float representation.

A binary floating-point value can't represent 3.1415927 exactly (since that's not an exact binary fraction). The nearest value that it can represent is 3.1415927410125732421875, so that's the actual value of your pival_float. When you print pival_float with eighty digits, you see its exact value, plus a bunch of zeroes for good measure.

The closest float value to pi has binary encoding...
0 10000000 10010010000111111011011
...in which I've inserted spaces between the sign, exponent and mantissa. The exponent is biased, so the bits above encode a multiplier of 2^1 == 2, and the mantissa encodes a fraction above 1, with the first bit being worth a half, and each bit thereafter being worth half as much as the bit before.
Therefore, the mantissa bits above are worth:
1 x 0.5
0 x 0.25
0 x 0.125
1 x 0.0625
0 x 0.03125
0 x 0.015625
1 x 0.0078125
0 x 0.00390625
0 x 0.001953125
0 x 0.0009765625
0 x 0.00048828125
1 x 0.000244140625
1 x 0.0001220703125
1 x 0.00006103515625
1 x 0.000030517578125
1 x 0.0000152587890625
1 x 0.00000762939453125
0 x 0.000003814697265625
1 x 0.0000019073486328125
1 x 0.00000095367431640625
0 x 0.000000476837158203125
1 x 0.0000002384185791015625
1 x 0.00000011920928955078125
So, the least significant bit after multiplying by the exponent-encoded value "2" is worth...
0.000 000 238 418 579 101 562 5
I added spaces to make it easier to count that the last non-0 digit is in the 22nd decimal place.
The value the question says printf() displayed appears below alongside the contribution of the least significant bit in the mantissa:
3.14159274101257324218750000000000000000000000000000000000000000000000000000000000
0.0000002384185791015625
Clearly the least significant digits line up properly. If you added up all the mantissa contributions above, added the implicit 1, then multiplied by 2, you'd get the exact value printf displayed. That explains how the float value is precisely (in the mathematical sense of zero randomness) the value shown by printf, but the comparison below against pi shows only the first 6 decimal places are accurate given the particular value we want it to store.
3.14159274101257324218750000000000000000000000000000000000000000000000000000000000
3.14159265358979323846264338327950288419716939937510582097494459230781640628620899
^
In computing, it's common to refer to the precision of floating point types when we're actually interested in the accuracy we can rely on. I suppose you could argue that while taken in isolation the precision of floats and doubles is infinite, the rounding necessary when using them to approximate numbers that they can't encode perfectly is for most practical purposes random, and in that sense they offer finite significant digits of precision at encoding such numbers.
So, printf isn't wrong to display so many digits; some application might be using a float to encode that exact number (almost certainly because the nature of the app's calculations involve sums of 1/2^n values), but that'd be the exception rather than the rule.

Carrying on from Tony's answer, one way to prove this limitation on decimal precision to yourself in a practical way is simply to declare pi to as many decimals points as you like while assigning the value to a float. Then look at how it is stored in memory.
What you find, is no matter how many decimal points you give it, the 32-bit value in memory will always be the equivalent of the unsigned value 1078530011 or 01000000010010010000111111011011 in binary. That is due, as others explained, to the IEEE-754 Single Precision Floating Point Format Below is a simple bit of code that will allow you to prove to yourself that this limitation means pi, as a float, is limited to six decimal precision:
#include <stdio.h>
#include <stdlib.h>
#if defined (__LP64__) || defined (_LP64)
# define BUILD_64 1
#endif
#ifdef BUILD_64
# define BITS_PER_LONG 64
#else
# define BITS_PER_LONG 32
#endif
char *binpad (unsigned long n, size_t sz);
int main (void) {
float fPi = 3.1415926535897932384626433;
printf ("\n fPi : %f, in memory : %s unsigned : %u\n\n",
fPi, binpad (*(unsigned*)&fPi, 32), *(unsigned*)&fPi);
return 0;
}
char *binpad (unsigned long n, size_t sz)
{
static char s[BITS_PER_LONG + 1] = {0};
char *p = s + BITS_PER_LONG;
register size_t i;
for (i = 0; i < sz; i++)
*(--p) = (n>>i & 1) ? '1' : '0';
return p;
}
Output
$ ./bin/ieee754_pi
fPi : 3.141593, in memory : 01000000010010010000111111011011 unsigned : 1078530011

Related

Display decimal number

somebody know how to represent the digits to the left of the decimal point?
I want to display the number 5 digits left to the point and 4 digits right to it/
for the exercise 12345/100 i want to get 00123.4500
printf("%010.4f", (double)12345/100);
man 3 printf says:
The overall syntax of a conversion specification is:
%[$][flags][width][.precision][length modifier]conversion
.4 means the floating point precision to print 4 decimals.
0 is a flag that means to pad with 0s.
10 is the width. If at the right of the decimal there are 4, and at the left there are 5, the total is 10 (with the dot).
While the answer of alinsoar is correct and the most common case, I would like to mention another possibility which sometimes is useful: fixed-point representation.
An uint32_t integer holds 9 decimal digits. We may choose to imagine a decimal point before the 4th digit from the right. The example would then look like this:
#include <stdint.h>
#include <stdio.h>
#include <inttypes.h>
int main(void)
{
uint32_t a = 123450000; // 12345.0000
uint32_t b = a / 100; // 123.4500
uint32_t i = b / 10000; // integer part
uint32_t f = b % 10000; // fractional part
printf("%05" PRIu32 ".%04" PRIu32, i, f); // 5 integer digits, 4 fractional digits
}
Using fixed-point one has to pay special attention to the value range and e.g. multiplication needs special handling, but there are cases where it is preferable over the floating-point representation.

How do I print a floating-point value for later scanning with perfect accuracy?

Suppose I have a floating-point value of type float or double (i.e. 32 or 64 bits on typical machines). I want to print this value as text (e.g. to the standard output stream), and then later, in some other process, scan it back in - with fscanf() if I'm using C, or perhaps with istream::operator>>() if I'm using C++. But - I need the scanned float to end up being exactly, identical to the original value (up to equivalent representations of the same value). Also, the printed value should be easily readable - to a human - as floating-point, i.e. I don't want to print 0x42355316 and reinterpret that as a 32-bit float.
How should I do this? I'm assuming the standard library of (C and C++) won't be sufficient, but perhaps I'm wrong. I suppose that a sufficient number of decimal digits might be able to guarantee an error that's underneath the precision threshold - but that's not the same as guaranteeing the rounding/truncation will happen just the way I want it.
Notes:
The scanning does not having to be perfectly accurate w.r.t. the value it scans, only the original value.
If it makes it easier, you may assume the value is a number and is not infinity.
denormal support is desired but not required; still if we get a denormal, failure should be conspicuous.
First, you should use the %a format with fprintf and fscanf. This is what it was designed for, and the C standard requires it to work (reproduce the original number) if the implementation uses binary floating-point.
Failing that, you should print a float with at least FLT_DECIMAL_DIG significant digits and a double with at least DBL_DECIMAL_DIG significant digits. Those constants are defined in <float.h> and are defined:
… number of decimal digits, n, such that any floating-point number with p radix b digits can be rounded to a floating-point number with n decimal digits and back again without change to the value,… [b is the base used for the floating-point format, defined in FLT_RADIX, and p is the number of base-b digits in the format.]
For example:
printf("%.*g\n", FLT_DECIMAL_DIG, 1.f/3);
or:
#define QuoteHelper(x) #x
#define Quote(x) QuoteHelper(x)
…
printf("%." Quote(FLT_DECIMAL_DIG) "g\n", 1.f/3);
In C++, these constants are defined in <limits> as std::numeric_limits<Type>::max_digits10, where Type is float or double or another floating-point type.
Note that the C standard only recommends that such a round-trip through a decimal numeral work; it does not require it. For example, C 2018 5.2.4.2.2 15 says, under the heading “Recommended practice”:
Conversion from (at least) double to decimal with DECIMAL_DIG digits and back should be the identity function. [DECIMAL_DIG is the equivalent of FLT_DECIMAL_DIG or DBL_DECIMAL_DIG for the widest floating-point format supported in the implementation.]
In contrast, if you use %a, and FLT_RADIX is a power of two (meaning the implementation uses a floating-point base that is two, 16, or another power of two), then C standard requires that the result of scanning the numeral produced with %a equals the original number.
I need the scanned float to end up being exactly, identical to the original value.
As already pointed out in the other answers, that can be achieved with the %a format specifier.
Also, the printed value should be easily readable - to a human - as floating-point, i.e. I don't want to print 0x42355316 and reinterpret that as a 32-bit float.
That's more tricky and subjective. The first part of the string that %a produces is in fact a fraction composed by hexadecimal digits, so that an output like 0x1.4p+3 may take some time to be parsed as 10 by a human reader.
An option could be to print all the decimal digits needed to represent the floating-point value, but there may be a lot of them. Consider, for example the value 0.1, its closest representation as a 64-bit float may be
0x1.999999999999ap-4 == 0.1000000000000000055511151231257827021181583404541015625
While printf("%.*lf\n", DBL_DECIMAL_DIG, 01); (see e.g. Eric's answer) would print
0.10000000000000001 // If DBL_DECIMAL_DIG == 17
My proposal is somewhere in the middle. Similarly to what %a does, we can exactly represent any floating-point value with radix 2 as a fraction multiplied by 2 raised to some integer power. We can transform that fraction into a whole number (increasing the exponent accordingly) and print it as a decimal value.
0x1.999999999999ap-4 --> 1.999999999999a16 * 2-4 --> 1999999999999a16 * 2-56
--> 720575940379279410 * 2-56
That whole number has a limited number of digits (it's < 253), but the result it's still an exact representation of the original double value.
The following snippet is a proof of concept, without any check for corner cases. The format specifier %a separates the mantissa and the exponent with a p character (as in "... multiplied by two raised to the Power of..."), I'll use a q instead, for no particular reason other than using a different symbol.
The value of the mantissa will also be reduced (and the exponent raised accordingly), removing all the trailing zero-bits. The idea beeing that 5q+1 (parsed as 510 * 21) should be more "easily" identified as 10, rather than 2814749767106560q-48.
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void to_my_format(double x, char *str)
{
int exponent;
double mantissa = frexp(x, &exponent);
long long m = 0;
if ( mantissa ) {
exponent -= 52;
m = (long long)scalbn(mantissa, 52);
// A reduced mantissa should be more readable
while (m && m % 2 == 0) {
++exponent;
m /= 2;
}
}
sprintf(str, "%lldq%+d", m, exponent);
// ^
// Here 'q' is used to separate the mantissa from the exponent
}
double from_my_format(char const *str)
{
char *end;
long long mantissa = strtoll(str, &end, 10);
long exponent = strtol(str + (end - str + 1), &end, 10);
return scalbn(mantissa, exponent);
}
int main(void)
{
double tests[] = { 1, 0.5, 2, 10, -256, acos(-1), 1000000, 0.1, 0.125 };
size_t n = (sizeof tests) / (sizeof *tests);
char num[32];
for ( size_t i = 0; i < n; ++i ) {
to_my_format(tests[i], num);
double x = from_my_format(num);
printf("%22s%22a ", num, tests[i]);
if ( tests[i] != x )
printf(" *** %22a *** Round-trip failed\n", x);
else
printf("%58.55g\n", x);
}
return 0;
}
Testable here.
Generally, the improvement in readability is admitedly little to none, surely a matter of opinion.
You can use the %a format specifier to print the value as hexadecimal floating point. Note that this is not the same as reinterpreting the float as an integer and printing the integer value.
For example:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
float x;
scanf("%f", &x);
printf("x=%.7f\n", x);
char str[20];
sprintf(str, "%a", x);
printf("str=%s\n", str);
float y;
sscanf(str, "%f", &y);
printf("y=%.7f\n", y);
printf("x==y: %d\n", (x == y));
return 0;
}
With an input of 4, this outputs:
x=4.0000000
str=0x1p+2
y=4.0000000
x==y: 1
With an input of 3.3, this outputs:
x=3.3000000
str=0x1.a66666p+1
y=3.3000000
x==y: 1
As you can see from the output, the %a format specifier prints in exponential format with the significand in hex and the exponent in decimal. This format can then be converted directly back to the exact same value as demonstrated by the equality check.

double representing values upto 16 significant digits

I was working on reading values in variables from a byte positioned file and I had to represent some value read into decimal with 6 digits representing fractional part, total no. of digits 20.
So the value could be for example be 99999999999999999999 (20 9s) and it is to be used as float considering the last six 9s representing fractional part.
Now when I was tring to do it with the method employed:
#include<stdio.h>
#include<stdlib.h>
int main(void)
{
char c[]="999999999.999999"; //9 nines to the left of decimal and 6 nines to the right
double d=0;
sscanf(c,"%lf",&d);
printf("%f\n",d);
return 0;
}
OUTPUT:999999999.999999 same as input
Now I increased the number of 9s to the left of decimal by 1 (making 10 nines to the left of decimal)
the output became 9999999999.999998.
On further increase of one more 9 to the left of decimal the outcome became rounded off to 100000000000.000000
For my usage it is possible that values with 14 digits to the left of decimal and 6 to the right of it can come in the variable - I want it to be converted precisely just like the input itself without and truncation or rounding off. Also I read somewhere that double can be used to represent a value with up to 16 significant digits but here when I used only 9999999999.999999 (10 nines to the left and 6 to the right) it produced outcome as 9999999999.999998 which contradicts this *represent a value with up to 16 significant digits` statement.
What should be done in this case?
The nature of floating point numbers is they are inaccurate. The bigger the number, the more inaccurate.
What should be done in this case?
You could try a long double, but even that's not guaranteed to be precise enough.
long double z = 99999999999999.999999L;
printf("%Lf\n", z); // 100000000000000.000000
You could store it as an integer and remember it's actually 1,000,000 times smaller. Unfortunately, 20 digits is a bit too large for even an unsigned 64 bit integer. It's 3 bits too large.
#include <inttypes.h>
int main() {
uint64_t x = 99999999999999999999ULL;
}
test.c:4:18: error: integer literal is too large to be represented in any integer type
uint64_t x = 99999999999999999999ULL;
You could make a struct that stores the pieces separately.
#include <inttypes.h>
#include <stdbool.h>
typedef struct {
bool positive;
uint64_t integer;
uint64_t decimal;
} bignum;
int main() {
bignum num = {
.positive = true,
.integer = 99999999999999,
.decimal = 999999
};
printf("%s" "%"PRIu64 "." "%"PRIu64 "\n",
num.positive ? "" : "-", num.integer, num.decimal
);
}
At which point you're building your own arbitrary-precision arithmetic library. Instead, use an existing one such as GMP. This can store numbers of any size. The trade off is speed and you have to use special types and functions.
#include <gmp.h>
int main() {
mpf_t y;
mpf_init_set_str(y, "99999999999999.999999", 10);
gmp_printf("%Ff\n", y);
}

Calculate machine precision on gmp arbitrary precision

I'm trying to obtain the machine precision on gmp variables.
To that end, I adapted the code from wikipedia, to compute the precision of a gmp with a fixed precision:
int main( int argc, char **argv )
{
long int precision = 100;
mpf_set_default_prec(precision); // in principle redundant, but who cares
mpf_t machEps, one, temp; // "one" is the "1" in gmp. tmp is to comparison.
mpf_init2(machEps, precision);
mpf_set_ui(machEps, 1); // eps = 1
mpf_init_set(one,machEps); // ensure "one" has the same precision as machEps
mpf_init_set(temp,machEps); // ensure "temp" has the same precision as machEps
do {
mpf_div_ui(machEps,machEps,2); // eps = eps/2
mpf_div_ui(temp,machEps,2); // temp = eps/2
mpf_add(temp,temp,one); // temp += 1
}
while ( mpf_cmp(temp,one)); // temp == 1
/// print the result...
char *t = new char[400];
mp_exp_t expprt;
mpf_get_str(NULL, &expprt, 10, 10, machEps);
sprintf(t, "%se%ld", mpf_get_str(NULL, &expprt, 10, mpf_get_default_prec(), machEps), expprt);
printf( "Calculated Machine epsilon: %s\n", t);
return 0;
}
However, the result is not consistent with the wikipedia's formula, neither changes with the precision I set. What am I missing? I've also tried with double and float (c standard), and the result is correct...
I get results that are consistent with wikipedia's formula, and the values depend on the precision.
However, the value - and the effective precision - only change when crossing a limb-boundary(1). For me, that means multiples of 64, so for
(k-1)*64 < precision <= k*64
the calculated machine epsilon is
0.5^(k*64)
Some results:
$ ./a.out 192
Calculated Machine epsilon: 15930919111324522770288803977677118055911045551926187860739e-57
$ ./a.out 193
Calculated Machine epsilon: 8636168555094444625386351862800399571116000364436281385023703470168591803162427e-77
For comparison:
Prelude> 0.5^192
1.5930919111324523e-58
Prelude> 0.5^256
8.636168555094445e-78
The output of the GMP programme is in the form mantissa,'e',exponent where the value is
0.mantissa * 10^exponent
(1) GMP represents the floating point numbers as a pair of exponent (for base 2) and mantissa (and sign). The mantissa is maintained as an array of unsigned integers, the limbs. For me, the limbs are 64 bits, on 32 bit systems usually 32 bits (iirc). So when the desired precision is between (k-1)*LIMB_BITS (exclusive) and k*LIMB_BITS (inclusive), the array for the mantissa contains k limbs, and all of them are used, thus the effective precision is k*LIMB_BITS bits. Therefore the epsilon only changes when the number of limbs changes.

Given a double, need to find how many digits in total

I have a double which is not necessarily positive but usually. It can be 0.xxxx000 or X.xxxx00000 or XX.00000 or 0.xxx0xxx00000, where eventually there are all 0's to the right of the last number. I need to keep track of how many digits there are. I've been having trouble with this, any help? This is C.
A double has 52 mantissa bits plus an implicit "1" bit, so you should be able to type-pun a double pointer to a 64-bit integer (getting the raw bits into an integer), &= this with (1<<52)-1, and |= the result with (1<<52).
The log10 of that would be the number of decimal digits.
Though, I'm almost inclined to say "go with jonsca's solution" because it is so ingeniously simple (it deserves a +1 in any case for being KISS).
Use sprintf to turn it into a string and do whatever counting/testing you need to do on the digits
The representation of the double is not decimal - it is binary (like all the other numbers in a computer). The problem you defined makes little sense really. Consider the example: number 1.2 is converted to binary - 1+1/5 = 1.(0011) binary [0011 in period]. If you cut it to 52 bits of precision (double) - you'll have 1.0011001100110011001100110011001100110011001100110011 binary that equals 1+(1-1/2^52)/5. If you represent this number in decimal form precisely you will get 52 decimals before all-zeroes that is a lot more than the maximum decimal precision of a double that is 16 digits (and all those digits of representation from 17 to 52 are just meaningless).
Anyway if you have purely abstract problem (like in school):
int f( double x )
{
int n = 0;
x = fabs(x);
x -= floor(x);
while( x != floor(x) )
{
x *= 2;
++n;
}
return n;
}
The function returns number of binary digits before all-zeroes and it is also the number of decimal digits before all-zeroes (the last decimal digit is always 5 if returned value > 0).

Resources