Why are my exponential numbers being rounded? (C language) - c

I am getting unexpected results when printing some doubles. Some rounding is taking place, and I'm not sure why.
#include <stdio.h>
int main(void)
{
double d1 = 0;
double d2 = 0;
d1 = 1.2345678901234567e16;
d2 = 112233445566778899.0;
printf("d1: %.0lf\n", d1);
printf("d2: %.0lf\n", d2);
return 0;
}
The results of running the program are:
d1: 12345678901234568
d2: 112233445566778900
In the first case, I'm not sure why the last digit (the 7) got rounded to an 8, if there are no numbers after it.
In the second case, I also don't know why the number in the hundreds position got rounded. Doubles should accomodate numbers much larger than these without rounding.
Thanks

Not "much larger" - in fact you're right at the limit for "accuracy". A double has 53 bits of accuracy. Your first number is about 10^16, which would need about 16/(log 2) = 53.15 bits to be accurate to within an integer.

“Doubles should accomodate numbers much larger than these without rounding.” Why do you think so?
An IEEE standard double (which is what you are using) has 53 bits (binary digits) of precision.
Go to Wolfram Alpha and ask it for the binary representation of 12345678901234567. It will tell you that the binary form has 54 digits. Therefore it cannot be represented exactly as a double.
Your second number requires 57 digits, so it too cannot be represented exactly.

Doubles should accomodate numbers much larger than these without rounding - yes, if they are powers of 2. If there is a large distance between the leftest and rightest 1 in their binary presentation, they will be rounded.

A 64 bit double only has 16 or so decimal digits of precision - you're simply reaching the precision limits of the data type

Related

How are results rounded in floating-point arithmetic?

I wrote this code that simply sums a list of n numbers, to practice with floating point arithmetic, and I don't understand this:
I am working with float, this means I have 7 digits of precision, therefore, if I do the operation 10002*10002=100040004, the result in data type float will be 100040000.000000, since I lost any digit beyond the 7th (the program still knows the exponent, as seen here).
If the input in this program is
3
10000
10001
10002
You will see that, however, when this program computes 30003*30003=900180009 we have 30003*30003=900180032.000000
I understand this 32 appears becasue I am working with float, and my goal is not to make the program more precise but understand why this is happening. Why is it 900180032.000000 and not 900180000.000000? Why does this decimal noise (32) appear in 30003*30003 and not in 10002*10002 even when the magnitude of the numbers are the same? Thank you for your time.
#include <stdio.h>
#include <math.h>
#define MAX_SIZE 200
int main()
{
int numbers[MAX_SIZE];
int i, N;
float sum=0;
float sumb=0;
float sumc=0;
printf("introduce n" );
scanf("%d", &N);
printf("write %d numbers:\n", N);
for(i=0; i<N; i++)
{
scanf("%d", &numbers[i]);
}
int r=0;
while (r<N){
sum=sum+numbers[r];
sumb=sumb+(numbers[r]*numbers[r]);
printf("sum is %f\n",sum);
printf("sumb is %f\n",sumb);
r++;
}
sumc=(sum*sum);
printf("sumc is %f\n",sumc);
}
As explained below, the computed result of multiplying 10,002 by 10,002 must be a multiple of eight, and the computed result of multiplying 30,003 by 30,003 must be a multiple of 64, due to the magnitudes of the numbers and the number of bits available for representing them. Although your question asks about “decimal noise,” there are no decimal digits involved here. The results are entirely due to rounding to multiples of powers of two. (Your C implementation appears to use the common IEEE 754 format for binary floating-point.)
When you multiply 10,002 by 10,002, the computed result must be a multiple of eight. I will explain why below. The mathematical result is 100,040,004. The nearest multiples of eight are 100,040,000 and 100,040,008. They are equally far from the exact result, and the rule used to break ties chooses the even multiple (100,040,000 is eight times 12,505,000, an even number, while 100,040,008 is eight times 12,505,001, an odd number).
Many C implementations use IEEE 754 32-bit basic binary floating-point for float. In this format, a number is represented as an integer M multiplied by a power of two 2e. The integer M must be less than 224 in magnitude. The exponent e may be from −149 to 104. These limits come from the numbers of bits used to represent the integer and the exponent.
So all float values in this format have the value M • 2e for some M and some e. There are no decimal digits in the format, just an integer multiplied by a power of two.
Consider the number 100,040,004. The biggest M we can use is 16,777,215 (224−1). That is not big enough that we can write 100,040,004 as M • 20. So we must increase the exponent. Even with 22, the biggest we can get is 16,777,215 • 22 = 67,108,860. So we must use 23. And that is why the computed result must be a multiple of eight, in this case.
So, to produce a result for 10,002•10,002 in float, the computer uses 12,505,000 • 23, which is 100,040,000.
In 30,003•30,003, the result must be a multiple of 64. The exact result is 900,180,009. 25 is not enough because 16,777,215•25 is 536,870,880. So we need 26, which is 64. The two nearest multiples of 64 are 900,179,968 and 900,180,032. In this case, the latter is closer (23 away versus 41 away), so it is chosen.
(While I have described the format as an integer times a power of two, it can also be described as a binary numeral with one binary digit before the radix point and 23 binary digits after it, with the exponent range adjusted to compensate. These are mathematically equivalent. The IEEE 754 standard uses the latter description. Textbooks may use the former description because it makes analyzing some of the numerical properties easier.)
Floating point arithmetic is done in binary, not in decimal.
Floats actually have 24 binary bits of precision, 1 of which is a sign bit and 23 of which are called significand bits. This converts to approximately 7 decimal digits of precision.
The number you're looking at, 900180032, is already 9 digits long and so it makes sense that the last two digits (the 32) might be wrong. The rounding like the arithmetic is done in binary, the reason for the difference in rounding can only be seen if you break things down into binary.
900180032 = 110101101001111010100001000000
900180000 = 110101101001111010100000100000
If you count from the first 1 to the last 1 in each of those numbers (the part I put in bold), that is how many significand bits it takes to store the number. 900180032 takes only 23 significand bits to store while 900180000 takes 24 significand bits which makes 900180000 an impossible number to store as floats only have 23 significand bits. 900180032 is the closest number to the correct answer, 900180009, that a float can store.
In the other example
100040000 = 101111101100111110101000000
100040004 = 101111101100111110101000100
The correct answer, 100040004 has 25 significand bits, too much for floats. The nearest number that has 23 or less significand bits is 10004000 which only has 21 significant bits.
For more on floating point arithmetic works, try here http://steve.hollasch.net/cgindex/coding/ieeefloat.html

Using floorf to reduce the number of decimals

I would like to use the first five digits of a number for computation.
For example,
A floating point number: 4.23654897E-05
I wish to use 4.2365E-05.I tried the following
#include <math.h>
#include <stdio.h>
float num = 4.23654897E-05;
int main(){
float rounded_down = floorf(num * 10000) / 10000;
printf("%f",rounded_down);
return 0;
}
The output is 0.000000.The desired output is 4.2365E-05.
In short,say 52 bits are allocated for storing the mantissa.Is there a way to reduce the number of bits being allocated?
Any suggestions on how this can be done?
A number x that is positive and within the normal range can be rounded down approximately to five significant digits with:
double l = pow(10, floor(log10(x)) - 4);
double y = l * floor(x / l);
This is useful only for tinkering with floating-point arithmetic as a learning tool. The exact mathematical result is generally not exactly representable, because binary floating-point cannot represent most decimal values exactly. Additionally, rounding errors can occur in the pow, /, and * operations that may cause the result to differ slightly from the true mathematical result of rounding x to five significant digits. Also, poor implementations of log10 or pow can cause the result to differ from the true mathematical result.
I'd go:
printf("%.6f", num);
Or you can try using snprintf() from stdlib.h:
float num = 4.23654897E-05; char output[50];
snprintf(output, 50, "%f", num);
printf("%s", output);
The result is expected. The multiplication by 10000 yield 0.423.. the nearest integer to it is 0. So the result is 0. Rounding can be done using format specifier %f to print the result upto certain decimal places after decimal point.
If you check the return value of floorf you will see it returns If no errors occur, the largest integer value not greater than arg, that is ⌊arg⌋, is returned. where arg is the passed argument.
Without using floatf you can use %e or (%E)format specifier to print it accordingly.
printf("%.4E",num);
which outputs:
4.2365E-05
After David's comment:
Your way of doing things is right but the number you multiplied is wrong. The thing is 4.2365E-05 is 0.00004235.... Now if you multiply it with 10000 then it will 0.42365... Now you said I want the expression to represent in that form. floorf returns float in this case. Store it in a variable and you will be good to go. The rounded value will be in that variable. But you will see that the rounded down value will be 0. That is what you got.
float rounded_down = floorf(num * 10000) / 10000;
This will hold the correct value rounded down to 4 digits after . (not in exponent notation with E or e). Don't confuse the value with the format specifier used to represent it.
What you need to do in order to get the result you want is move the decimal places to the right. To do that multiply with larger number. (1e7 or 1e8 or as you want it to).
I would like to use the first five digits of a number for computation.
In general, floating point numbers are encoded using binary and OP wants to use 5 significant decimal digits. This is problematic as numbers like 4.23654897E-05 and 4.2365E-05 are not exactly representable as a float/double. The best we can do is get close.
The floor*() approach has problems with 1) negative numbers (should have used trunc()) and 2) values near x.99995 that during rounding may change the number of digits. I strongly recommend against it here as such solutions employing it fail many corner cases.
The *10000 * power10, round, /(10000 * power10) approach suffers from 1) power10 calculation (1e5 in this case) 2) rounding errors in the multiple, 3) overflow potential. The needed power10 may not be exact. * errors show up with cases when the product is close to xxxxx.5. Often this intermediate calculation is done using wider double math and so the corner cases are rare. Bad rounding using (some_int_type) which has limited range and is a truncation instead of the better round() or rint().
An approach that gets close to OP's goal: print to 5 significant digits using %e and convert back. Not highly efficient, yet handles all cases well.
int main(void) {
float num = 4.23654897E-05f;
// sign d . dddd e sign expo + \0
#define N (1 + 1 + 1 + 4 + 1 + 1 + 4 + 1)
char buf[N*2]; // Use a generous buffer - I like 2x what I think is needed.
// OP wants 5 significant digits so print 4 digits after the decimal point.
sprintf(buf, "%.4e", num);
float rounded = (float) atof(buf);
printf("%.5e %s\n", rounded, buf);
}
Output
4.23650e-05 4.2365e-05
Why 5 in %.5e: Typical float will print up to 6 significant decimal digits as expected (research FLT_DIG), so 5 digits after the decimal point are printed. The exact value of rounded in this case was about 4.236500171...e-05 as 4.2365e-05 is not exactly representable as a float.

How to overflow a float?

Working on my way to solve exercise 2.1 from "The C programming language" where one should calculate on the local machine the range of different types like char, short, int etc. but also float and double. By everything except float and double i watch for the overflow to happen and so can calculate the max/min values. However, by floats this is still not working.
So, the question is why this code prints the same value twice? I thought the second line should print inf
float f = 1.0;
printf("%f\n",FLT_MAX);
printf("%f\n",FLT_MAX + f);
Try multiplying with 10, and if will overflow. The reason it doesn't overflow is the same reason why adding a small float to an already very large float doesn't actually change the value at all - it's a floating point format, meaning the number of digits of precision is limited.
Or, adding at least that last significant digit would likely work:
float f = 3.402823e38f; // FLT_MAX
f = f + 0.000001e38f; // this should result in overflow
The reason why it prints the same value twice is that 1.0 is too small to be added to FLOAT_MAX. A float has usually 24 bits for the mantissa, and 8 bits for the exponent. If you have a very large value with an exponent of 127, you would need a mantissa with at least 127 bits to be able to add 1.0.
As an example, the same problem exists with decimal (and any other) exponential values:
If you have a number with 3 significant digits like 1.00*106, you can't add 1 to it because this would be 1'000'001, and this requires 6 significant digits.
You could overflow a float by doubling the value repeatedly.

Multiplying two floats doesn't give exact result

I am trying to multiply two floats as follows:
float number1 = 321.12;
float number2 = 345.34;
float rexsult = number1 * number2;
The result I want to see is 110895.582, but when I run the code it just gives me 110896. Most of the time I'm having this issue. Any calculator gives me the exact result with all decimals. How can I achive that result?
edit : It's C code. I'm using XCode iOS simulator.
There's a lot of rounding going on.
float a = 321.12; // this number will be rounded
float b = 345.34; // this number will also be rounded
float r = a * b; // and this number will be rounded too
printf("%.15f\n", r);
I get 110895.578125000000000 after the three separate roundings.
If you want more than 6 decimal digits' worth of precision, you will have to use double and not float. (Note that I said "decimal digits' worth", because you don't get decimal digits, you get binary.) As it stands, 1/2 ULP of error (a worst-case bound for a perfectly rounded result) is about 0.004.
If you want exactly rounded decimal numbers, you will have to use a specialized decimal library for such a task. A double has more than enough precision for scientists, but if you work with money everything has to be 100% exact. No floating point numbers for money.
Unlike integers, floating point numbers take some real work before you can get accustomed to their pitfalls. See "What Every Computer Scientist Should Know About Floating-Point Arithmetic", which is the classic introduction to the topic.
Edit: Actually, I'm not sure that the code rounds three times. It might round five times, since the constants for a and b might be rounded first to double-precision and then to single-precision when they are stored. But I don't know the rules of this part of C very well.
You will never get the exact result that way.
First of all, number1 ≠ 321.12 because that value cannot be represented exactly in a base-2 system. You'll need an infinite number of bits for it.
The same holds for number2 ≠ 345.34.
So, you begin with inexact values to begin with.
Then the product will get rounded because multiplication gives you double the number of significant digits but the product has to be stored in float again if you multiply floats.
You probably want to use a 10-based system for your numbers. Or, in case your numbers only have 2 decimal digits of the fractional, you can use integers (32-bit integers are sufficient in this case, but you may end up needing 64-bit):
32112 * 34534 = 1108955808.
That represents 321.12 * 345.34 = 110895.5808.
Since you are using C you could easily set the precision by using "%.xf" where x is the wanted precision.
For example:
float n1 = 321.12;
float n2 = 345.34;
float result = n1 * n2;
printf("%.20f", result);
Output:
110895.57812500000000000000
However, note that float only gives six digits of precision. For better precision use double.
floating point variables are only approximate representation, not precise one. Not every number can "fit" into float variable. For example, there is no way to put 1/10 (0.1) into binary variable, just like it's not possible to put 1/3 into decimal one (you can only approximate it with endless 0.33333)
when outputting such variables, it's usual to apply many rounding options. Unless you set them all, you can never be sure which of them are applied. This is especially true for << operators, as the stream can be told how to round BEFORE <<.
Printf also does some rounding. Consider http://codepad.org/LLweoeHp:
float t = 0.1f;
printf("result: %f\n", t);
--
result: 0.100000
Well, it looks fine. Why? Because printf defaulted to some precision and rounded up the output. Let's dial in 50 places after decimal point: http://codepad.org/frUPOvcI
float t = 0.1f;
printf("result: %.50f\n", t);
--
result: 0.10000000149011611938476562500000000000000000000000
That's different, isn't it? After 625 the float ran out of capacity to hold more data, that's why we see zeroes.
A double can hold more digits, but 0.1 in binary is not finite. Double has to give up, eventually: http://codepad.org/RAd7Yu2r
double t = 0.1;
printf("result: %.70f\n", t);
--
result: 0.1000000000000000055511151231257827021181583404541015625000000000000000
In your example, 321.12 alone is enough to cause trouble: http://codepad.org/cgw3vUKn
float t = 321.12f;
printf("and the result is: %.50f\n", t);
result: 321.11999511718750000000000000000000000000000000000000
This is why one has to round up floating point values before presenting them to humans.
Calculator programs don't use floats or doubles at all. They implement decimal number format. eg:
struct decimal
{
int mantissa; //meaningfull digits
int exponent; //number of decimal zeroes
};
Ofc that requires reinventing all operations: addition, substraction, multiplication and division. Or just look for a decimal library.

Why do I need 17 significant digits (and not 16) to represent a double?

Can someone give me an example of a floating point number (double precision), that needs more than 16 significant decimal digits to represent it?
I have found in this thread that sometimes you need up to 17 digits, but I am not able to find an example of such a number (16 seems enough to me).
Can somebody clarify this?
My other answer was dead wrong.
#include <stdio.h>
int
main(int argc, char *argv[])
{
unsigned long long n = 1ULL << 53;
unsigned long long a = 2*(n-1);
unsigned long long b = 2*(n-2);
printf("%llu\n%llu\n%d\n", a, b, (double)a == (double)b);
return 0;
}
Compile and run to see:
18014398509481982
18014398509481980
0
a and b are just 2*(253-1) and 2*(253-2).
Those are 17-digit base-10 numbers. When rounded to 16 digits, they are the same. Yet a and b clearly only need 53 bits of precision to represent in base-2. So if you take a and b and cast them to double, you get your counter-example.
The correct answer is the one by Nemo above. Here I am just pasting a simple Fortran program showing an example of the two numbers, that need 17 digits of precision to print, showing, that one does need (es23.16) format to print double precision numbers, if one doesn't want to loose any precision:
program test
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: a, b
a = 1.8014398509481982e+16_dp
b = 1.8014398509481980e+16_dp
print *, "First we show, that we have two different 'a' and 'b':"
print *, "a == b:", a == b, "a-b:", a-b
print *, "using (es22.15)"
print "(es22.15)", a
print "(es22.15)", b
print *, "using (es23.16)"
print "(es23.16)", a
print "(es23.16)", b
end program
it prints:
First we show, that we have two different 'a' and 'b':
a == b: F a-b: 2.0000000000000000
using (es22.15)
1.801439850948198E+16
1.801439850948198E+16
using (es23.16)
1.8014398509481982E+16
1.8014398509481980E+16
I think the guy on that thread is wrong, and 16 base-10 digits are always enough to represent an IEEE double.
My attempt at a proof would go something like this:
Suppose otherwise. Then, necessarily, two distinct double-precision numbers must be represented by the same 16-significant-digit base-10 number.
But two distinct double-precision numbers must differ by at least one part in 253, which is greater than one part in 1016. And no two numbers differing by more than one part in 1016 could possibly round to the same 16-significant-digit base-10 number.
This is not completely rigorous and could be wrong. :-)
Dig into the single and double precision basics and wean yourself of the notion of this or that (16-17) many DECIMAL digits and start thinking in (53) BINARY digits. The necessary examples may be found here at stackoverflow if you spend some time digging.
And I fail to see how you can award a best answer to anyone giving a DECIMAL answer without qualified BINARY explanations. This stuff is straight-forward but it is not trivial.
The largest continuous range of integers that can be exactly represented by a double (8-byte IEEE) is -253 to 253 (-9007199254740992. to 9007199254740992.). The numbers -253-1 and 253+1 cannot be exactly represented by a double.
Therefore, no more than 16 significant decimal digits to the left of the decimal point will exactly represent a double in the continuous range.

Resources