double representing values upto 16 significant digits

double representing values upto 16 significant digits - c

I was working on reading values in variables from a byte positioned file and I had to represent some value read into decimal with 6 digits representing fractional part, total no. of digits 20.
So the value could be for example be 99999999999999999999 (20 9s) and it is to be used as float considering the last six 9s representing fractional part.
Now when I was tring to do it with the method employed:
#include<stdio.h>
#include<stdlib.h>
int main(void)
{
char c[]="999999999.999999"; //9 nines to the left of decimal and 6 nines to the right
double d=0;
sscanf(c,"%lf",&d);
printf("%f\n",d);
return 0;
}
OUTPUT:999999999.999999 same as input
Now I increased the number of 9s to the left of decimal by 1 (making 10 nines to the left of decimal)
the output became 9999999999.999998.
On further increase of one more 9 to the left of decimal the outcome became rounded off to 100000000000.000000
For my usage it is possible that values with 14 digits to the left of decimal and 6 to the right of it can come in the variable - I want it to be converted precisely just like the input itself without and truncation or rounding off. Also I read somewhere that double can be used to represent a value with up to 16 significant digits but here when I used only 9999999999.999999 (10 nines to the left and 6 to the right) it produced outcome as 9999999999.999998 which contradicts this *represent a value with up to 16 significant digits` statement.
What should be done in this case?

The nature of floating point numbers is they are inaccurate. The bigger the number, the more inaccurate.
What should be done in this case?
You could try a long double, but even that's not guaranteed to be precise enough.
long double z = 99999999999999.999999L;
printf("%Lf\n", z); // 100000000000000.000000
You could store it as an integer and remember it's actually 1,000,000 times smaller. Unfortunately, 20 digits is a bit too large for even an unsigned 64 bit integer. It's 3 bits too large.
#include <inttypes.h>
int main() {
uint64_t x = 99999999999999999999ULL;
}
test.c:4:18: error: integer literal is too large to be represented in any integer type
uint64_t x = 99999999999999999999ULL;
You could make a struct that stores the pieces separately.
#include <inttypes.h>
#include <stdbool.h>
typedef struct {
bool positive;
uint64_t integer;
uint64_t decimal;
} bignum;
int main() {
bignum num = {
.positive = true,
.integer = 99999999999999,
.decimal = 999999
};
printf("%s" "%"PRIu64 "." "%"PRIu64 "\n",
num.positive ? "" : "-", num.integer, num.decimal
);
}
At which point you're building your own arbitrary-precision arithmetic library. Instead, use an existing one such as GMP. This can store numbers of any size. The trade off is speed and you have to use special types and functions.
#include <gmp.h>
int main() {
mpf_t y;
mpf_init_set_str(y, "99999999999999.999999", 10);
gmp_printf("%Ff\n", y);
}

Related

Display decimal number

somebody know how to represent the digits to the left of the decimal point?
I want to display the number 5 digits left to the point and 4 digits right to it/
for the exercise 12345/100 i want to get 00123.4500

printf("%010.4f", (double)12345/100);
man 3 printf says:
The overall syntax of a conversion specification is:
%[$][flags][width][.precision][length modifier]conversion
.4 means the floating point precision to print 4 decimals.
0 is a flag that means to pad with 0s.
10 is the width. If at the right of the decimal there are 4, and at the left there are 5, the total is 10 (with the dot).

While the answer of alinsoar is correct and the most common case, I would like to mention another possibility which sometimes is useful: fixed-point representation.
An uint32_t integer holds 9 decimal digits. We may choose to imagine a decimal point before the 4th digit from the right. The example would then look like this:
#include <stdint.h>
#include <stdio.h>
#include <inttypes.h>
int main(void)
{
uint32_t a = 123450000; // 12345.0000
uint32_t b = a / 100; // 123.4500
uint32_t i = b / 10000; // integer part
uint32_t f = b % 10000; // fractional part
printf("%05" PRIu32 ".%04" PRIu32, i, f); // 5 integer digits, 4 fractional digits
}
Using fixed-point one has to pay special attention to the value range and e.g. multiplication needs special handling, but there are cases where it is preferable over the floating-point representation.

Largest integer that can be stored in long double

EDIT: After some discussion in the comments it came out that because of a luck of knowledge in how floating point numbers are implemented in C, I asked something different from what I meant to ask.
I wanted to use (do operations with) integers larger than those I can have with unsigned long long (that for me is 8 bytes), possibly without recurring to arrays or bigint libraries. Since my long double is 16 bytes, I thought it could've been possible by just switching type. It came out that even though it is possible to represent larger integers, you can't do operations -with these larger long double integers- without losing precision. So it's not possible to achieve what I wanted to do. Actually, as stated in the comments, it is not possible for me. But in general, wether it is possible or not depends on the floating point characteristics of your long double.
// end of EDIT
I am trying to understand what's the largest integer that I can store in a long double.
I know it depends on environment which the program is built in, but I don't know exactly how. I have a sizeof(long double) == 16 for what is worth.
Now in this answer they say that the the maximum value for a 64-bit double should be 2^53, which is around 9 x 10^15, and exactly 9007199254740992.
When I run the following program, it just works:
#include <stdio.h>
int main() {
long double d = 9007199254740992.0L, i;
printf("%Lf\n", d);
for(i = -3.0; i < 4.0; i++) {
printf("%.Lf) %.1Lf\n", i, d+i);
}
return 0;
}
It works even with 11119007199254740992.0L that is the same number with four 1s added at the start. But when I add one more 1, the first printf works as expected, while all the others show the same number of the first print.
So I tried to get the largest value of my long double with this program
#include <stdio.h>
#include <math.h>
int main() {
long double d = 11119007199254740992.0L, i;
for(i = 0.0L; d+i == d+i-1.0; i++) {
if( !fmodl(i, 10000.0L) ) printf("%Lf\n", i);
}
printf("%.Lf\n", i);
return 0;
}
But it prints 0.
(Edit: I just realized that I needed the condition != in the for)
Always in the same answer, they say that the largest possible value of a double is DBL_MAX or approximately 1.8 x 10^308.
I have no idea of what does it mean, but if I run
printf("%e\n", LDBL_MAX);
I get every time a different value that is always around 6.9 x 10^(-310).
(Edit: I should have used %Le, getting as output a value around 1.19 x 10^4932)
I took LDBL_MAX from here.
I also tried this one
printf("%d\n", LDBL_MAX_10_EXP);
That gives the value 4932 (which I also found in this C++ question).
Since we have 16 bytes for a long double, even if all of them were for the integer part of the type, we would be able to store numbers till 2^128, that is around 3.4 x 10^38. So I don't get what 308, -310 and 4932 are supposed to mean.
Is someone able to tell me how can I find out what's the largest integer that I can store as long double?

Inasmuch as you express in comments that you want to use long double as a substitute for long long to obtain increased range, I assume that you also require unit precision. Thus, you are asking for the largest number representable by the available number of mantissa digits (LDBL_MANT_DIG) in the radix of the floating-point representation (FLT_RADIX). In the very likely event that FLT_RADIX == 2, you can compute that value like so:
#include <float.h>
#include <math.h>
long double get_max_integer_equivalent() {
long double max_bit = ldexpl(1, LDBL_MANT_DIG - 1);
return max_bit + (max_bit - 1);
}
The ldexp family of functions scale floating-point values by powers of 2, analogous to what the bit-shift operators (<< and >>) do for integers, so the above is similar to
// not reliable for the purpose!
unsigned long long max_bit = 1ULL << (DBL_MANT_DIG - 1);
return max_bit + (max_bit - 1);
Inasmuch as you suppose that your long double provides more mantissa digits than your long long has value bits, however, you must assume that bit shifting would overflow.
There are, of course, much larger values that your long double can express, all of them integers. But they do not have unit precision, and thus the behavior of your long double will diverge from the expected behavior of integers when its values are larger. For example, if long double variable d contains a larger value then at least one of d + 1 == d and d - 1 == d will likely evaluate to true.

You can print the maximum value on your machine using limits.h, the value is ULLONG_MAX
In https://www.geeksforgeeks.org/climits-limits-h-cc/ is a C++ example.
The format specifier for printing unsigned long long with printf() is %llu for printing long double it is %Lf
printf("unsigned long long int: %llu ",(unsigned long long) ULLONG_MAX);
printf("long double: %Lf ",(long double) LDBL_MAX);
https://www.tutorialspoint.com/format-specifiers-in-c
Is also in Printing unsigned long long int Value Type Returns Strange Results

Assuming you mean "stored without loss of information", LDBL_MANT_DIG gives the number of bits used for the floating-point mantissa, so that's how many bits of an integer value that can be stored without loss of information.*
You'd need 128-bit integers to easily determine the maximum integer value that can be held in a 128-bit float, but this will at least emit the hex value (this assumes unsigned long long is 64 bits - you can use CHAR_BIT and sizeof( unsigned long long ) to get a portable answer):
#include <stdio.h>
#include <float.h>
#include <limits.h>
int main( int argc, char **argv )
{
int tooBig = 0;
unsigned long long shift = LDBL_MANT_DIG;
if ( shift >= 64 )
{
tooBig = 1;
shift -= 64;
}
unsigned long long max = ( 1ULL << shift ) - 1ULL;
printf( "Max integer value: 0x" );
// don't emit an extraneous zero if LDBL_MANT_DIG is
// exactly 64
if ( max )
{
printf( "%llx", max );
}
if ( tooBig )
{
printf( "%llx", ULLONG_MAX );
}
printf( "\n" );
return( 0 );
}
* - pedantically, it's the number of digits in FLT_RADIX base, but that base is almost certainly 2.

I've made a program in C that takes two inputs, x and n, and raises x to the power of n. 10^10 doesn't work, what happened?

I've made a program in C that takes two inputs, x and n, and raises x to the power of n. 10^10 doesn't work, what happened?
#include <cs50.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
float isEven(int n)
{
return n % 2 == 0;
}
float isOdd(int n)
{
return !isEven(n);
}
float power(int x, int n)
{
// base case
if (n == 0)
{
return 1;
}
// recursive case: n is negative
else if (n < 0)
{
return (1 / power(x, -n));
}
// recursive case: n is odd
else if (isOdd(n))
{
return x * power(x, n-1);
}
// recursive case: n is positive and even
else if (isEven(n))
{
int y = power(x, n/2);
return y * y;
}
return true;
}
int displayPower(int x, int n)
{
printf("%d to the %d is %f", x, n, power(x, n));
return true;
}
int main(void)
{
int x = 0;
printf("What will be the base number?");
scanf("%d", &x);
int n = 0;
printf("What will be the exponent?");
scanf("%d", &n);
displayPower(x, n);
}
For example, here is a pair of inputs that works:
./exponentRecursion
What will be the base number?10
What will be the exponent?9
10 to the 9 is 1000000000.000000
But this is what I get for 10^10:
./exponentRecursion
What will be the base number?10
What will be the exponent?10
10 to the 10 is 1410065408.000000
Why does this write such a weird number?
BTW, 10^11 returns 14100654080.000000, exactly ten times the above.
Perhaps it may be that there is some "Limit" to the data type that I am using? I am not sure.

Your variable x is an int type. The most common internal representation of that is 32 bits. That a signed binary number, so only 31 bits are available for representing a magnitude, with the usual maximum positive int value being 2^31 - 1 = 2,147,483,647. Anything larger that that will overflow, giving a smaller magnitude and possibly a negative sign.
For a greater range, you can change the type of x to long long (usually 64 bits--about 18 digits) or double (usually 64 bits, with 51 bits of precision for about 15 digits).
(Warning: Many implementations use the same representation for int and long, so using long might not be an improvement.)

A float only has enough precision for about 7 decimal digits. Any number with more digits than that will only be an approximations.
If you switch to double you'll get about 16 digits of precision.

When you start handling large numbers with the basic data types in C, you can run into trouble.
Integral types have a limited range of values (such as 4x109 for a 32-bit unsigned integer). Floating point type haver a much larger range (though not infinite) but limited precision. For example, IEEE754 double precision can give you about 16 decimal digits of precision in the range +/-10308
To recover both of these aspects, you'll need to use a bignum library of some sort, such as MPIR.

If you are mixing different data types in a C program, there are several implicit casts done by the compiler. As there are strong rules how the compiler works one can exactly figure out, what happens to your program and why.
As I do not know all of this casting rules, I did the following: Estimating the maximum of precision needed for the biggest result. Then casting explicit every variable and funktion in the process to this precision, even if it is not necessary. Normally this will work like a workarount.

Why float is more precise than it ought to be?

#include <stdio.h>
#include <float.h>
int main(int argc, char** argv)
{
long double pival = 3.14159265358979323846264338327950288419716939937510582097494459230781640628620899L;
float pival_float = pival;
printf("%1.80f\n", pival_float);
return 0;
}
The output I got on gcc is :
3.14159274101257324218750000000000000000000000000000000000000000000000000000000000
The float uses 23 bits mantisa. So the maximum fraction that can be represented is 2^23 = 8388608 = 7 decimal digits of precision.
But the above output shows 23 decimal digits of precision (3.14159274101257324218750). I expected it print 3.1415927000000000000....)
What did I miss to understand ?

You only got 7 digits of precision. Pi is
3.1415926535897932384626433832795028841971693993751058209...
But the output you got from printing your float approximation to Pi was
3.14159274101257324218750000...
As you can see the values diverge starting from the 7th digit after the decimal point.
If you ask printf() for 80 digits after the decimal place, it will print out that many digits of the decimal representation of the binary value stored in the float, even if that many digits is far more than the precision allowed by the float representation.

A binary floating-point value can't represent 3.1415927 exactly (since that's not an exact binary fraction). The nearest value that it can represent is 3.1415927410125732421875, so that's the actual value of your pival_float. When you print pival_float with eighty digits, you see its exact value, plus a bunch of zeroes for good measure.

The closest float value to pi has binary encoding...
0 10000000 10010010000111111011011
...in which I've inserted spaces between the sign, exponent and mantissa. The exponent is biased, so the bits above encode a multiplier of 2^1 == 2, and the mantissa encodes a fraction above 1, with the first bit being worth a half, and each bit thereafter being worth half as much as the bit before.
Therefore, the mantissa bits above are worth:
1 x 0.5
0 x 0.25
0 x 0.125
1 x 0.0625
0 x 0.03125
0 x 0.015625
1 x 0.0078125
0 x 0.00390625
0 x 0.001953125
0 x 0.0009765625
0 x 0.00048828125
1 x 0.000244140625
1 x 0.0001220703125
1 x 0.00006103515625
1 x 0.000030517578125
1 x 0.0000152587890625
1 x 0.00000762939453125
0 x 0.000003814697265625
1 x 0.0000019073486328125
1 x 0.00000095367431640625
0 x 0.000000476837158203125
1 x 0.0000002384185791015625
1 x 0.00000011920928955078125
So, the least significant bit after multiplying by the exponent-encoded value "2" is worth...
0.000 000 238 418 579 101 562 5
I added spaces to make it easier to count that the last non-0 digit is in the 22nd decimal place.
The value the question says printf() displayed appears below alongside the contribution of the least significant bit in the mantissa:
3.14159274101257324218750000000000000000000000000000000000000000000000000000000000
0.0000002384185791015625
Clearly the least significant digits line up properly. If you added up all the mantissa contributions above, added the implicit 1, then multiplied by 2, you'd get the exact value printf displayed. That explains how the float value is precisely (in the mathematical sense of zero randomness) the value shown by printf, but the comparison below against pi shows only the first 6 decimal places are accurate given the particular value we want it to store.
3.14159274101257324218750000000000000000000000000000000000000000000000000000000000
3.14159265358979323846264338327950288419716939937510582097494459230781640628620899
^
In computing, it's common to refer to the precision of floating point types when we're actually interested in the accuracy we can rely on. I suppose you could argue that while taken in isolation the precision of floats and doubles is infinite, the rounding necessary when using them to approximate numbers that they can't encode perfectly is for most practical purposes random, and in that sense they offer finite significant digits of precision at encoding such numbers.
So, printf isn't wrong to display so many digits; some application might be using a float to encode that exact number (almost certainly because the nature of the app's calculations involve sums of 1/2^n values), but that'd be the exception rather than the rule.

Carrying on from Tony's answer, one way to prove this limitation on decimal precision to yourself in a practical way is simply to declare pi to as many decimals points as you like while assigning the value to a float. Then look at how it is stored in memory.
What you find, is no matter how many decimal points you give it, the 32-bit value in memory will always be the equivalent of the unsigned value 1078530011 or 01000000010010010000111111011011 in binary. That is due, as others explained, to the IEEE-754 Single Precision Floating Point Format Below is a simple bit of code that will allow you to prove to yourself that this limitation means pi, as a float, is limited to six decimal precision:
#include <stdio.h>
#include <stdlib.h>
#if defined (__LP64__) || defined (_LP64)
# define BUILD_64 1
#endif
#ifdef BUILD_64
# define BITS_PER_LONG 64
#else
# define BITS_PER_LONG 32
#endif
char *binpad (unsigned long n, size_t sz);
int main (void) {
float fPi = 3.1415926535897932384626433;
printf ("\n fPi : %f, in memory : %s unsigned : %u\n\n",
fPi, binpad (*(unsigned*)&fPi, 32), *(unsigned*)&fPi);
return 0;
}
char *binpad (unsigned long n, size_t sz)
{
static char s[BITS_PER_LONG + 1] = {0};
char *p = s + BITS_PER_LONG;
register size_t i;
for (i = 0; i < sz; i++)
*(--p) = (n>>i & 1) ? '1' : '0';
return p;
}
Output
$ ./bin/ieee754_pi
fPi : 3.141593, in memory : 01000000010010010000111111011011 unsigned : 1078530011

pow() seems to be out by one here

What's going on here:
#include <stdio.h>
#include <math.h>
int main(void) {
printf("17^12 = %lf\n", pow(17, 12));
printf("17^13 = %lf\n", pow(17, 13));
printf("17^14 = %lf\n", pow(17, 14));
}
I get this output:
17^12 = 582622237229761.000000
17^13 = 9904578032905936.000000
17^14 = 168377826559400928.000000
13 and 14 do not match with wolfram alpa cf:
12: 582622237229761.000000
582622237229761
13: 9904578032905936.000000
9904578032905937
14: 168377826559400928.000000
168377826559400929
Moreover, it's not wrong by some strange fraction - it's wrong by exactly one!
If this is down to me reaching the limits of what pow() can do for me, is there an alternative that can calculate this? I need a function that can calculate x^y, where x^y is always less than ULLONG_MAX.

pow works with double numbers. These represent numbers of the form s * 2^e where s is a 53 bit integer. Therefore double can store all integers below 2^53, but only some integers above 2^53. In particular, it can only represent even numbers > 2^53, since for e > 0 the value is always a multiple of 2.
17^13 needs 54 bits to represent exactly, so e is set to 1 and hence the calculated value becomes even number. The correct value is odd, so it's not surprising it's off by one. Likewise, 17^14 takes 58 bits to represent. That it too is off by one is a lucky coincidence (as long as you don't apply too much number theory), it just happens to be one off from a multiple of 32, which is the granularity at which double numbers of that magnitude are rounded.
For exact integer exponentiation, you should use integers all the way. Write your own double-free exponentiation routine. Use exponentiation by squaring if y can be large, but I assume it's always less than 64, making this issue moot.

The numbers you get are too big to be represented with a double accurately. A double-precision floating-point number has essentially 53 significant binary digits and can represent all integers up to 2^53 or 9,007,199,254,740,992.
For higher numbers, the last digits get truncated and the result of your calculation is rounded to the next number that can be represented as a double. For 17^13, which is only slightly above the limit, this is the closest even number. For numbers greater than 2^54 this is the closest number that is divisible by four, and so on.

If your input arguments are non-negative integers, then you can implement your own pow.
Recursively:
unsigned long long pow(unsigned long long x,unsigned int y)
{
if (y == 0)
return 1;
if (y == 1)
return x;
return pow(x,y/2)*pow(x,y-y/2);
}
Iteratively:
unsigned long long pow(unsigned long long x,unsigned int y)
{
unsigned long long res = 1;
while (y--)
res *= x;
return res;
}
Efficiently:
unsigned long long pow(unsigned long long x,unsigned int y)
{
unsigned long long res = 1;
while (y > 0)
{
if (y & 1)
res *= x;
y >>= 1;
x *= x;
}
return res;
}

A small addition to other good answers: under x86 architecture there is usually available x87 80-bit extended format, which is supported by most C compilers via the long double type. This format allows to operate with integer numbers up to 2^64 without gaps.
There is analogue of pow() in <math.h> which is intended for operating with long double numbers - powl(). It should also be noticed that the format specifier for the long double values is other than for double ones - %Lf. So the correct program using the long double type looks like this:
#include <stdio.h>
#include <math.h>
int main(void) {
printf("17^12 = %Lf\n", powl(17, 12));
printf("17^13 = %Lf\n", powl(17, 13));
printf("17^14 = %Lf\n", powl(17, 14));
}
As Stephen Canon noted in comments there is no guarantee that this program should give exact result.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

double representing values upto 16 significant digits - c

Related

Display decimal number

Largest integer that can be stored in long double

I've made a program in C that takes two inputs, x and n, and raises x to the power of n. 10^10 doesn't work, what happened?

Why float is more precise than it ought to be?

pow() seems to be out by one here

Categories

Resources