Calculate machine precision on gmp arbitrary precision - c

I'm trying to obtain the machine precision on gmp variables.
To that end, I adapted the code from wikipedia, to compute the precision of a gmp with a fixed precision:
int main( int argc, char **argv )
{
long int precision = 100;
mpf_set_default_prec(precision); // in principle redundant, but who cares
mpf_t machEps, one, temp; // "one" is the "1" in gmp. tmp is to comparison.
mpf_init2(machEps, precision);
mpf_set_ui(machEps, 1); // eps = 1
mpf_init_set(one,machEps); // ensure "one" has the same precision as machEps
mpf_init_set(temp,machEps); // ensure "temp" has the same precision as machEps
do {
mpf_div_ui(machEps,machEps,2); // eps = eps/2
mpf_div_ui(temp,machEps,2); // temp = eps/2
mpf_add(temp,temp,one); // temp += 1
}
while ( mpf_cmp(temp,one)); // temp == 1
/// print the result...
char *t = new char[400];
mp_exp_t expprt;
mpf_get_str(NULL, &expprt, 10, 10, machEps);
sprintf(t, "%se%ld", mpf_get_str(NULL, &expprt, 10, mpf_get_default_prec(), machEps), expprt);
printf( "Calculated Machine epsilon: %s\n", t);
return 0;
}
However, the result is not consistent with the wikipedia's formula, neither changes with the precision I set. What am I missing? I've also tried with double and float (c standard), and the result is correct...

I get results that are consistent with wikipedia's formula, and the values depend on the precision.
However, the value - and the effective precision - only change when crossing a limb-boundary(1). For me, that means multiples of 64, so for
(k-1)*64 < precision <= k*64
the calculated machine epsilon is
0.5^(k*64)
Some results:
$ ./a.out 192
Calculated Machine epsilon: 15930919111324522770288803977677118055911045551926187860739e-57
$ ./a.out 193
Calculated Machine epsilon: 8636168555094444625386351862800399571116000364436281385023703470168591803162427e-77
For comparison:
Prelude> 0.5^192
1.5930919111324523e-58
Prelude> 0.5^256
8.636168555094445e-78
The output of the GMP programme is in the form mantissa,'e',exponent where the value is
0.mantissa * 10^exponent
(1) GMP represents the floating point numbers as a pair of exponent (for base 2) and mantissa (and sign). The mantissa is maintained as an array of unsigned integers, the limbs. For me, the limbs are 64 bits, on 32 bit systems usually 32 bits (iirc). So when the desired precision is between (k-1)*LIMB_BITS (exclusive) and k*LIMB_BITS (inclusive), the array for the mantissa contains k limbs, and all of them are used, thus the effective precision is k*LIMB_BITS bits. Therefore the epsilon only changes when the number of limbs changes.

Related

How do I print a floating-point value for later scanning with perfect accuracy?

Suppose I have a floating-point value of type float or double (i.e. 32 or 64 bits on typical machines). I want to print this value as text (e.g. to the standard output stream), and then later, in some other process, scan it back in - with fscanf() if I'm using C, or perhaps with istream::operator>>() if I'm using C++. But - I need the scanned float to end up being exactly, identical to the original value (up to equivalent representations of the same value). Also, the printed value should be easily readable - to a human - as floating-point, i.e. I don't want to print 0x42355316 and reinterpret that as a 32-bit float.
How should I do this? I'm assuming the standard library of (C and C++) won't be sufficient, but perhaps I'm wrong. I suppose that a sufficient number of decimal digits might be able to guarantee an error that's underneath the precision threshold - but that's not the same as guaranteeing the rounding/truncation will happen just the way I want it.
Notes:
The scanning does not having to be perfectly accurate w.r.t. the value it scans, only the original value.
If it makes it easier, you may assume the value is a number and is not infinity.
denormal support is desired but not required; still if we get a denormal, failure should be conspicuous.
First, you should use the %a format with fprintf and fscanf. This is what it was designed for, and the C standard requires it to work (reproduce the original number) if the implementation uses binary floating-point.
Failing that, you should print a float with at least FLT_DECIMAL_DIG significant digits and a double with at least DBL_DECIMAL_DIG significant digits. Those constants are defined in <float.h> and are defined:
… number of decimal digits, n, such that any floating-point number with p radix b digits can be rounded to a floating-point number with n decimal digits and back again without change to the value,… [b is the base used for the floating-point format, defined in FLT_RADIX, and p is the number of base-b digits in the format.]
For example:
printf("%.*g\n", FLT_DECIMAL_DIG, 1.f/3);
or:
#define QuoteHelper(x) #x
#define Quote(x) QuoteHelper(x)
…
printf("%." Quote(FLT_DECIMAL_DIG) "g\n", 1.f/3);
In C++, these constants are defined in <limits> as std::numeric_limits<Type>::max_digits10, where Type is float or double or another floating-point type.
Note that the C standard only recommends that such a round-trip through a decimal numeral work; it does not require it. For example, C 2018 5.2.4.2.2 15 says, under the heading “Recommended practice”:
Conversion from (at least) double to decimal with DECIMAL_DIG digits and back should be the identity function. [DECIMAL_DIG is the equivalent of FLT_DECIMAL_DIG or DBL_DECIMAL_DIG for the widest floating-point format supported in the implementation.]
In contrast, if you use %a, and FLT_RADIX is a power of two (meaning the implementation uses a floating-point base that is two, 16, or another power of two), then C standard requires that the result of scanning the numeral produced with %a equals the original number.
I need the scanned float to end up being exactly, identical to the original value.
As already pointed out in the other answers, that can be achieved with the %a format specifier.
Also, the printed value should be easily readable - to a human - as floating-point, i.e. I don't want to print 0x42355316 and reinterpret that as a 32-bit float.
That's more tricky and subjective. The first part of the string that %a produces is in fact a fraction composed by hexadecimal digits, so that an output like 0x1.4p+3 may take some time to be parsed as 10 by a human reader.
An option could be to print all the decimal digits needed to represent the floating-point value, but there may be a lot of them. Consider, for example the value 0.1, its closest representation as a 64-bit float may be
0x1.999999999999ap-4 == 0.1000000000000000055511151231257827021181583404541015625
While printf("%.*lf\n", DBL_DECIMAL_DIG, 01); (see e.g. Eric's answer) would print
0.10000000000000001 // If DBL_DECIMAL_DIG == 17
My proposal is somewhere in the middle. Similarly to what %a does, we can exactly represent any floating-point value with radix 2 as a fraction multiplied by 2 raised to some integer power. We can transform that fraction into a whole number (increasing the exponent accordingly) and print it as a decimal value.
0x1.999999999999ap-4 --> 1.999999999999a16 * 2-4 --> 1999999999999a16 * 2-56
--> 720575940379279410 * 2-56
That whole number has a limited number of digits (it's < 253), but the result it's still an exact representation of the original double value.
The following snippet is a proof of concept, without any check for corner cases. The format specifier %a separates the mantissa and the exponent with a p character (as in "... multiplied by two raised to the Power of..."), I'll use a q instead, for no particular reason other than using a different symbol.
The value of the mantissa will also be reduced (and the exponent raised accordingly), removing all the trailing zero-bits. The idea beeing that 5q+1 (parsed as 510 * 21) should be more "easily" identified as 10, rather than 2814749767106560q-48.
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void to_my_format(double x, char *str)
{
int exponent;
double mantissa = frexp(x, &exponent);
long long m = 0;
if ( mantissa ) {
exponent -= 52;
m = (long long)scalbn(mantissa, 52);
// A reduced mantissa should be more readable
while (m && m % 2 == 0) {
++exponent;
m /= 2;
}
}
sprintf(str, "%lldq%+d", m, exponent);
// ^
// Here 'q' is used to separate the mantissa from the exponent
}
double from_my_format(char const *str)
{
char *end;
long long mantissa = strtoll(str, &end, 10);
long exponent = strtol(str + (end - str + 1), &end, 10);
return scalbn(mantissa, exponent);
}
int main(void)
{
double tests[] = { 1, 0.5, 2, 10, -256, acos(-1), 1000000, 0.1, 0.125 };
size_t n = (sizeof tests) / (sizeof *tests);
char num[32];
for ( size_t i = 0; i < n; ++i ) {
to_my_format(tests[i], num);
double x = from_my_format(num);
printf("%22s%22a ", num, tests[i]);
if ( tests[i] != x )
printf(" *** %22a *** Round-trip failed\n", x);
else
printf("%58.55g\n", x);
}
return 0;
}
Testable here.
Generally, the improvement in readability is admitedly little to none, surely a matter of opinion.
You can use the %a format specifier to print the value as hexadecimal floating point. Note that this is not the same as reinterpreting the float as an integer and printing the integer value.
For example:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
float x;
scanf("%f", &x);
printf("x=%.7f\n", x);
char str[20];
sprintf(str, "%a", x);
printf("str=%s\n", str);
float y;
sscanf(str, "%f", &y);
printf("y=%.7f\n", y);
printf("x==y: %d\n", (x == y));
return 0;
}
With an input of 4, this outputs:
x=4.0000000
str=0x1p+2
y=4.0000000
x==y: 1
With an input of 3.3, this outputs:
x=3.3000000
str=0x1.a66666p+1
y=3.3000000
x==y: 1
As you can see from the output, the %a format specifier prints in exponential format with the significand in hex and the exponent in decimal. This format can then be converted directly back to the exact same value as demonstrated by the equality check.

Understanding the maximum values that can be stored in floats in C

I have come across some behaviour with the float type in C that I do not understand, and was hoping might be explained. Using the macros defined in float.h I can determine the maximum/minimum values that the datatype can store on the given hardware. However when performing a calculation that should not exceed these limits, I find that a typed float variable fails where a double succeeds.
The following is a minimal example, which compiles on my machine.
#include <stdio.h>
#include <stdlib.h>
#include <float.h>
int main(int argc, char **argv)
{
int gridsize;
long gridsize3;
float *datagrid;
float sumval_f;
double sumval_d;
long i;
gridsize = 512;
gridsize3 = (long)gridsize*gridsize*gridsize;
datagrid = calloc(gridsize3, sizeof(float));
if(datagrid == NULL)
{
free(datagrid);
printf("Memory allocation failed\n");
exit(0);
}
for(i=0; i<gridsize3; i++)
{
datagrid[i] += 1.0;
}
sumval_f = 0.0;
sumval_d = 0.0;
for(i=0; i<gridsize3; i++)
{
sumval_f += datagrid[i];
sumval_d += (double)datagrid[i];
}
printf("\ngridsize3 = %e\n", (float)gridsize3);
printf("FLT_MIN = %e\n", FLT_MIN);
printf("FLT_MAX = %e\n", FLT_MAX);
printf("DBL_MIN = %e\n", DBL_MIN);
printf("DBL_MAX = %e\n", DBL_MAX);
printf("\nfloat sum = %f\n", sumval_f);
printf("double sum = %lf\n", sumval_d);
printf("sumval_d/sumval_f = %f\n\n", sumval_d/(double)sumval_f);
free(datagrid);
return(0);
}
Compiling with gcc I find the output:
gridsize3 = 1.342177e+08
FLT_MIN = 1.175494e-38
FLT_MAX = 3.402823e+38
DBL_MIN = 2.225074e-308
DBL_MAX = 1.797693e+308
float sum = 16777216.000000
double sum = 134217728.000000
sumval_d/sumval_f = 8.000000
Whilst compiling with icc the sumval_f = 67108864.0 and hence the final ratio is instead 2.0*. Note that the float sum is incorrect, whilst the double sum is correct.
As far as I can tell the output of FLT_MAX suggests that the sum should fit into a float, and yet it seems to plateau out at either an eighth or a half of the full value.
Is there a compiler specific override to the values found using float.h?
Why is a double required to correctly find the sum of this array?
*Interestingly the inclusion of an if statement inside the for loop that prints values of the array causes the value to match the gcc output, i.e. an eighth of the correct sum, rather than a half.
The problem here isn't the range of values but the precision.
Assuming a 32-bit IEEE754 float, this datatype has a maximum of 24 bits of precision. This means that not all integers larger than 16777216 can be represented exactly.
So when your sum reaches 16777216, adding 1 to it is outside the precision of what the datatype can store, so the number doesn't get any bigger.
A (presumably) 64-bit double has 53 bits of precision. This is enough bits to hold all integer values up to your sum of 134217728, so it gives you an accurate result.
A float can precisely represent any integer between -16777215 and +16777215, inclusive. It can also represent all even integers between -2*16777215 and +2*16777215 (including +/- 2*8388608, i.e. 16777216), all multiples of 4 between -4*16777215 and +4*16777215, and likewise for all power-of-two scaling factors up to 2^104 (roughly 2.028E+31). Additionally, it can represent multiples of 1/2 from -16777215/2 to +16777215/2, multiples of 1/4 from -16777215/4 to +16777215/4, etc. down to multiples of 1/2^149 from -167777215/(2^149) to +16777215/(2^149).
Floating point numbers represent all of the infinite possible values between any two numbers; but, computers cannot hold an infinite number of values. So a compromise is made. The floating point numbers hold an approximation of the value.
This means that if you pick a value that is "more" than the stored floating point number, but not enough to arrive at the "next" storable approximation, then storing that logically bigger number won't actually change the floating point value.
The "error" in a floating point approximation is variable. For small numbers, the error is more precise; for bigger numbers, the error proportionally the same, but a bigger actual value.

Why float is more precise than it ought to be?

#include <stdio.h>
#include <float.h>
int main(int argc, char** argv)
{
long double pival = 3.14159265358979323846264338327950288419716939937510582097494459230781640628620899L;
float pival_float = pival;
printf("%1.80f\n", pival_float);
return 0;
}
The output I got on gcc is :
3.14159274101257324218750000000000000000000000000000000000000000000000000000000000
The float uses 23 bits mantisa. So the maximum fraction that can be represented is 2^23 = 8388608 = 7 decimal digits of precision.
But the above output shows 23 decimal digits of precision (3.14159274101257324218750). I expected it print 3.1415927000000000000....)
What did I miss to understand ?
You only got 7 digits of precision. Pi is
3.1415926535897932384626433832795028841971693993751058209...
But the output you got from printing your float approximation to Pi was
3.14159274101257324218750000...
As you can see the values diverge starting from the 7th digit after the decimal point.
If you ask printf() for 80 digits after the decimal place, it will print out that many digits of the decimal representation of the binary value stored in the float, even if that many digits is far more than the precision allowed by the float representation.
A binary floating-point value can't represent 3.1415927 exactly (since that's not an exact binary fraction). The nearest value that it can represent is 3.1415927410125732421875, so that's the actual value of your pival_float. When you print pival_float with eighty digits, you see its exact value, plus a bunch of zeroes for good measure.
The closest float value to pi has binary encoding...
0 10000000 10010010000111111011011
...in which I've inserted spaces between the sign, exponent and mantissa. The exponent is biased, so the bits above encode a multiplier of 2^1 == 2, and the mantissa encodes a fraction above 1, with the first bit being worth a half, and each bit thereafter being worth half as much as the bit before.
Therefore, the mantissa bits above are worth:
1 x 0.5
0 x 0.25
0 x 0.125
1 x 0.0625
0 x 0.03125
0 x 0.015625
1 x 0.0078125
0 x 0.00390625
0 x 0.001953125
0 x 0.0009765625
0 x 0.00048828125
1 x 0.000244140625
1 x 0.0001220703125
1 x 0.00006103515625
1 x 0.000030517578125
1 x 0.0000152587890625
1 x 0.00000762939453125
0 x 0.000003814697265625
1 x 0.0000019073486328125
1 x 0.00000095367431640625
0 x 0.000000476837158203125
1 x 0.0000002384185791015625
1 x 0.00000011920928955078125
So, the least significant bit after multiplying by the exponent-encoded value "2" is worth...
0.000 000 238 418 579 101 562 5
I added spaces to make it easier to count that the last non-0 digit is in the 22nd decimal place.
The value the question says printf() displayed appears below alongside the contribution of the least significant bit in the mantissa:
3.14159274101257324218750000000000000000000000000000000000000000000000000000000000
0.0000002384185791015625
Clearly the least significant digits line up properly. If you added up all the mantissa contributions above, added the implicit 1, then multiplied by 2, you'd get the exact value printf displayed. That explains how the float value is precisely (in the mathematical sense of zero randomness) the value shown by printf, but the comparison below against pi shows only the first 6 decimal places are accurate given the particular value we want it to store.
3.14159274101257324218750000000000000000000000000000000000000000000000000000000000
3.14159265358979323846264338327950288419716939937510582097494459230781640628620899
^
In computing, it's common to refer to the precision of floating point types when we're actually interested in the accuracy we can rely on. I suppose you could argue that while taken in isolation the precision of floats and doubles is infinite, the rounding necessary when using them to approximate numbers that they can't encode perfectly is for most practical purposes random, and in that sense they offer finite significant digits of precision at encoding such numbers.
So, printf isn't wrong to display so many digits; some application might be using a float to encode that exact number (almost certainly because the nature of the app's calculations involve sums of 1/2^n values), but that'd be the exception rather than the rule.
Carrying on from Tony's answer, one way to prove this limitation on decimal precision to yourself in a practical way is simply to declare pi to as many decimals points as you like while assigning the value to a float. Then look at how it is stored in memory.
What you find, is no matter how many decimal points you give it, the 32-bit value in memory will always be the equivalent of the unsigned value 1078530011 or 01000000010010010000111111011011 in binary. That is due, as others explained, to the IEEE-754 Single Precision Floating Point Format Below is a simple bit of code that will allow you to prove to yourself that this limitation means pi, as a float, is limited to six decimal precision:
#include <stdio.h>
#include <stdlib.h>
#if defined (__LP64__) || defined (_LP64)
# define BUILD_64 1
#endif
#ifdef BUILD_64
# define BITS_PER_LONG 64
#else
# define BITS_PER_LONG 32
#endif
char *binpad (unsigned long n, size_t sz);
int main (void) {
float fPi = 3.1415926535897932384626433;
printf ("\n fPi : %f, in memory : %s unsigned : %u\n\n",
fPi, binpad (*(unsigned*)&fPi, 32), *(unsigned*)&fPi);
return 0;
}
char *binpad (unsigned long n, size_t sz)
{
static char s[BITS_PER_LONG + 1] = {0};
char *p = s + BITS_PER_LONG;
register size_t i;
for (i = 0; i < sz; i++)
*(--p) = (n>>i & 1) ? '1' : '0';
return p;
}
Output
$ ./bin/ieee754_pi
fPi : 3.141593, in memory : 01000000010010010000111111011011 unsigned : 1078530011

Computing floating point accuracy (K&R 2-1)

I found Stevens Computing Services – K & R Exercise 2-1 a very thorough answer to K&R 2-1. This slice of the full code computes the maximum value of a float type in the C programming language.
Unluckily my theoretical comprehension of float values is quite limited. I know they are composed of significand (mantissa.. ) and a magnitude which is a power of 2.
#include <stdio.h>
#include <limits.h>
#include <float.h>
main()
{
float flt_a, flt_b, flt_c, flt_r;
/* FLOAT */
printf("\nFLOAT MAX\n");
printf("<limits.h> %E ", FLT_MAX);
flt_a = 2.0;
flt_b = 1.0;
while (flt_a != flt_b) {
flt_m = flt_b; /* MAX POWER OF 2 IN MANTISSA */
flt_a = flt_b = flt_b * 2.0;
flt_a = flt_a + 1.0;
}
flt_m = flt_m + (flt_m - 1); /* MAX VALUE OF MANTISSA */
flt_a = flt_b = flt_c = flt_m;
while (flt_b == flt_c) {
flt_c = flt_a;
flt_a = flt_a * 2.0;
flt_b = flt_a / 2.0;
}
printf("COMPUTED %E\n", flt_c);
}
I understand that the latter part basically checks to which power of 2 it's possible to raise the significand with a three variable algorithm. What about the first part?
I can see that a progression of multiples of 2 should eventually determine the value of the significand, but I tried to trace a few small numbers to check how it should work and it failed to find the right values...
======================================================================
What are the concepts on which this program is based upon and does this program gets more precise as longer and non-integer numbers have to be found?
The first loop determines the number of bits contributing to the significand by finding the least power 2 such that adding 1 to it (using floating-point arithmetic) fails to change its value. If that's the nth power of two, then the significand uses n bits, because with n bits you can express all the integers from 0 through 2^n - 1, but not 2^n. The floating-point representation of 2^n must therefore have an exponent large enough that the (binary) units digit is not significant.
By that same token, having found the first power of 2 whose float representation has worse than unit precision, the maximim float value that does have unit precision is one less. That value is recorded in variable flt_m.
The second loop then tests for the maximum exponent by starting with the maximum unit-precision value, and repeatedly doubling it (thereby increasing the exponent by 1) until it finds that the result cannot be converted back by halving it. The maximum float is the value before that final doubling.
Do note, by the way, that all the above supposes a base-2 floating-point representation. You are unlikely to run into anything different, but C does not actually require any specific representation.
With respect to the second part of your question,
does this program gets more precise as longer and non-integer numbers have to be found?
the program takes care to avoid losing precision. It does assume a binary floating-point representation such as you described, but it will work correctly regardless of the number of bits in the significand or exponent of such a representation. No non-integers are involved, but the program already deals with numbers that have worse than unit precision, and with numbers larger than can be represented with type int.

Given a double, need to find how many digits in total

I have a double which is not necessarily positive but usually. It can be 0.xxxx000 or X.xxxx00000 or XX.00000 or 0.xxx0xxx00000, where eventually there are all 0's to the right of the last number. I need to keep track of how many digits there are. I've been having trouble with this, any help? This is C.
A double has 52 mantissa bits plus an implicit "1" bit, so you should be able to type-pun a double pointer to a 64-bit integer (getting the raw bits into an integer), &= this with (1<<52)-1, and |= the result with (1<<52).
The log10 of that would be the number of decimal digits.
Though, I'm almost inclined to say "go with jonsca's solution" because it is so ingeniously simple (it deserves a +1 in any case for being KISS).
Use sprintf to turn it into a string and do whatever counting/testing you need to do on the digits
The representation of the double is not decimal - it is binary (like all the other numbers in a computer). The problem you defined makes little sense really. Consider the example: number 1.2 is converted to binary - 1+1/5 = 1.(0011) binary [0011 in period]. If you cut it to 52 bits of precision (double) - you'll have 1.0011001100110011001100110011001100110011001100110011 binary that equals 1+(1-1/2^52)/5. If you represent this number in decimal form precisely you will get 52 decimals before all-zeroes that is a lot more than the maximum decimal precision of a double that is 16 digits (and all those digits of representation from 17 to 52 are just meaningless).
Anyway if you have purely abstract problem (like in school):
int f( double x )
{
int n = 0;
x = fabs(x);
x -= floor(x);
while( x != floor(x) )
{
x *= 2;
++n;
}
return n;
}
The function returns number of binary digits before all-zeroes and it is also the number of decimal digits before all-zeroes (the last decimal digit is always 5 if returned value > 0).

Resources