The same operations seem to work differently for larger and smaller values (I think the code below explains the question better than I could in words) I have calculated max and max3 in the same way except the values are different. Similarly I have calculated max2 and max4 the exact same way with different values. Yet the answer I'm getting is very different?:
#include <stdio.h>
#include <math.h>
int main(void)
{
// 86997171 / 48 = 1812441.0625
int max = ceil((float) 86997171 / 48);
float max2 = ((float) 86997171)/ 48;
printf("max = %i, max2 = %f\n", max, max2);
int max3 = ceil((float) 3 / 2);
float max4 = ((float) 3) / 2;
printf("ma3 = %i, max4 = %f\n", max3, max4);
}
Output:
max = 1812441, max2 = 1812441.000000
ma3 = 2, max4 = 1.500000
I was expecting max = 1812442, max2 = 1812441.062500 to be the output, since that's what it should be in principle. Now I don't know what to do
float division in C for large numbers
This issue has nothing to do with division. The rounding error occurs in the initial conversion to float.
In the format most commonly used for float, IEEE-754 binary32, the two representable numbers closed to 86,997,171 are 86,997,168 and 86,997,176. (These are 10,874,746•23 and 10,874,747•103. 10,874,746 and 10,874,747 are 24-bit numbers (it takes 24 digits in binary to represent them), and 24 bits is all the binary32 format has for representing the fraction portion of a floating-point number.)
Of those two, 86,997,168 is closer. So, in (float) 86997171, 86,997,171 is converted to 86,997,168.
Then 86,997,168 / 48 is 1,812,441. So (float) 86997171 / 48 is 1,812,441, and so is ceil((float) 86997171 / 48). So max and max2 are both set to 1,812,441.
In C, float is a single-precision floating-point format, so it is usually 4 bytes (on most compilers), so its precision is around 6-9 significant digits, typically 7 digits.
Your number in question, 1812441.0625 has 11 digits, which don't fit in a float type.
You should use double instead, which in C is a double-precision floating-point format, so it is usually 8 bytes (on most compilers) so its precision is around 15-18 significant digits, typically 16 digits, and therefore can keep the precision of your number.
In fact, using double in this case gives:
max = 1812442, max2 = 1812441.062500
ma3 = 2, max4 = 1.500000
which is what you need.
Link to code.
Note that the precision of these types is explained here. It is far from the truth (as explained by the link), but it gives good perspective in your question.
Related
Given a harmonic series 1 - 1/2 + 1/3 - 1/4... = ln(2), is it possible to get a value of 0.69314718056 using only float values and using only basic operations (+,-,*,/). Are there any algorithms which can increase the precision of this calculation without going to unreasonably high values of n (current reasonable limit is 1e^10)
What I currently have: this nets me 8 correct digits -> 0.6931471825
EDIT
The goal is to compute the most precise summation value using only float datatypes
int main()
{
float sum = 0;
int n = 1e9;
double ans = log(2);
int i;
float r = 0;
for (i = n; i > 0; i--) {
r = i - (2*(i/2));
if(r == 0){
sum -= 1.0000000 / i;
}else{
sum += 1.0000000 / i;
}
}
printf("\n%.10f", sum);
printf("\n%.10f", ans);
return 0;
}
On systems where a float is a single-precision IEEE floating point number, it has 24 bits of precision, which is roughly 7 or (log10(224)) digits of decimal precision.
If you change
double ans = log(2);
to
float ans = log(2);
You'll see you already get the best answer possible.
0.6931471 82464599609375 From log(2), casted to float
0.6931471 82464599609375 From your algorithm
0.6931471 8055994530941723... Actual value
\_____/
7 digits
In fact, if you use %A instead of %f, you'll see you get the same answer to the bit.
0X1.62E43P-1 // From log(2), casted to float
0X1.62E43P-1 // From your algorithm
#ikegami already showed this answer in decimal and hex, but to make it even more clear, here are the numbers in binary.
ln(2) is actually:
0.1011000101110010000101111111011111010001110011111…
Rounded to 24 bits, that is:
0.101100010111001000011000
Converted back to decimal, that is:
0.693147182464599609375
...which is the number you got. You simply can't do any better than that, in the 24 bits of precision you've got available in a single-precision float.
This question already has answers here:
How best to sum up lots of floating point numbers?
(5 answers)
Is floating point math broken?
(31 answers)
Closed 4 years ago.
Here I have a function sum() of type float that takes in a pointer t of type float and an integer size. It returns the sum of all the elements in the array. Then I create two arrays using that function. One that has the BIG value at the first index and one that has it at the last index. When I return the sums of each of those arrays I get different results. This is my code:
#include <stdlib.h>
#include <stdio.h>
#define N 1024
#define SMALL 1.0
#define BIG 100000000.0
float sum(float* t, int size) { // here I define the function sum()
float s = 0.0;
for (int i = 0; i < size; i++) {
s += t[i];
}
return s;
}
int main() {
float tab[N];
for (int i = 0; i < N; i++) {
tab[i] = SMALL;
}
tab[0] = BIG;
float sum1 = sum(tab, N); // initialize sum1 with the big value at index 0
printf("sum1 = %f\n", sum1);
tab[0] = SMALL;
tab[N-1] = BIG;
float sum2 = sum(tab, N); // initialize sum2 with the big value at last index
printf("sum2 = %f\n", sum2);
return 0;
}
After compiling the code and running it I get the following output:
Sum = 100000000.000000
Sum = 100001024.000000
Why do I get different results even though the arrays have the same elements ( but at different indexes ).
What you're experiencing is floating point imprecision. Here's a simple demonstration.
int main() {
float big = 100000000.0;
float small = 1.0;
printf("%f\n", big + small);
printf("%f\n", big + (19 *small));
return 0;
}
You'd expect 100000001.0 and 100000019.0.
$ ./test
100000000.000000
100000016.000000
Why'd that happen? Because computers don't store numbers like we do, floating point numbers doubly so. A float has a size of just 32 bits, but can store numbers up to about 3^38 rather than the just 2^31 a 32 bit integer can. And it can store decimal places. How? They cheat. What it really stores is the sign, an exponent, and a mantissa.
sign * 2^exponent * mantissa
The mantissa is what determines accuracy and there's only 24 bits in a float. So large numbers lose precision.
You can read about exactly how and play around with the representation.
To solve this either use a double which has greater precision, or use an accurate, but slower, arbitrary precision library such as GMP.
Why do I get different results even though the arrays have the same elements
In floating-point math, 100000000.0 + 1.0 equals 100000000.0 and not 100000001.0, but 100000000.0 + 1024.0 does equal 100001024.0. Given the value 100000000.0, the value 1.0 is too small to show up in the available bits used to represent 100000000.0.
So when you put 100000000.0 first, all the later + 1.0 operations have no effect.
When you put 100000000.0 last, though, all the previous 1000+ 1.0 + 1.0 + ... do add up to 1024.0, and 1024.0 is "big enough" to make a difference given the available precision of floating point math.
I have come across some behaviour with the float type in C that I do not understand, and was hoping might be explained. Using the macros defined in float.h I can determine the maximum/minimum values that the datatype can store on the given hardware. However when performing a calculation that should not exceed these limits, I find that a typed float variable fails where a double succeeds.
The following is a minimal example, which compiles on my machine.
#include <stdio.h>
#include <stdlib.h>
#include <float.h>
int main(int argc, char **argv)
{
int gridsize;
long gridsize3;
float *datagrid;
float sumval_f;
double sumval_d;
long i;
gridsize = 512;
gridsize3 = (long)gridsize*gridsize*gridsize;
datagrid = calloc(gridsize3, sizeof(float));
if(datagrid == NULL)
{
free(datagrid);
printf("Memory allocation failed\n");
exit(0);
}
for(i=0; i<gridsize3; i++)
{
datagrid[i] += 1.0;
}
sumval_f = 0.0;
sumval_d = 0.0;
for(i=0; i<gridsize3; i++)
{
sumval_f += datagrid[i];
sumval_d += (double)datagrid[i];
}
printf("\ngridsize3 = %e\n", (float)gridsize3);
printf("FLT_MIN = %e\n", FLT_MIN);
printf("FLT_MAX = %e\n", FLT_MAX);
printf("DBL_MIN = %e\n", DBL_MIN);
printf("DBL_MAX = %e\n", DBL_MAX);
printf("\nfloat sum = %f\n", sumval_f);
printf("double sum = %lf\n", sumval_d);
printf("sumval_d/sumval_f = %f\n\n", sumval_d/(double)sumval_f);
free(datagrid);
return(0);
}
Compiling with gcc I find the output:
gridsize3 = 1.342177e+08
FLT_MIN = 1.175494e-38
FLT_MAX = 3.402823e+38
DBL_MIN = 2.225074e-308
DBL_MAX = 1.797693e+308
float sum = 16777216.000000
double sum = 134217728.000000
sumval_d/sumval_f = 8.000000
Whilst compiling with icc the sumval_f = 67108864.0 and hence the final ratio is instead 2.0*. Note that the float sum is incorrect, whilst the double sum is correct.
As far as I can tell the output of FLT_MAX suggests that the sum should fit into a float, and yet it seems to plateau out at either an eighth or a half of the full value.
Is there a compiler specific override to the values found using float.h?
Why is a double required to correctly find the sum of this array?
*Interestingly the inclusion of an if statement inside the for loop that prints values of the array causes the value to match the gcc output, i.e. an eighth of the correct sum, rather than a half.
The problem here isn't the range of values but the precision.
Assuming a 32-bit IEEE754 float, this datatype has a maximum of 24 bits of precision. This means that not all integers larger than 16777216 can be represented exactly.
So when your sum reaches 16777216, adding 1 to it is outside the precision of what the datatype can store, so the number doesn't get any bigger.
A (presumably) 64-bit double has 53 bits of precision. This is enough bits to hold all integer values up to your sum of 134217728, so it gives you an accurate result.
A float can precisely represent any integer between -16777215 and +16777215, inclusive. It can also represent all even integers between -2*16777215 and +2*16777215 (including +/- 2*8388608, i.e. 16777216), all multiples of 4 between -4*16777215 and +4*16777215, and likewise for all power-of-two scaling factors up to 2^104 (roughly 2.028E+31). Additionally, it can represent multiples of 1/2 from -16777215/2 to +16777215/2, multiples of 1/4 from -16777215/4 to +16777215/4, etc. down to multiples of 1/2^149 from -167777215/(2^149) to +16777215/(2^149).
Floating point numbers represent all of the infinite possible values between any two numbers; but, computers cannot hold an infinite number of values. So a compromise is made. The floating point numbers hold an approximation of the value.
This means that if you pick a value that is "more" than the stored floating point number, but not enough to arrive at the "next" storable approximation, then storing that logically bigger number won't actually change the floating point value.
The "error" in a floating point approximation is variable. For small numbers, the error is more precise; for bigger numbers, the error proportionally the same, but a bigger actual value.
#include <stdio.h>
#include <float.h>
int main(int argc, char** argv)
{
long double pival = 3.14159265358979323846264338327950288419716939937510582097494459230781640628620899L;
float pival_float = pival;
printf("%1.80f\n", pival_float);
return 0;
}
The output I got on gcc is :
3.14159274101257324218750000000000000000000000000000000000000000000000000000000000
The float uses 23 bits mantisa. So the maximum fraction that can be represented is 2^23 = 8388608 = 7 decimal digits of precision.
But the above output shows 23 decimal digits of precision (3.14159274101257324218750). I expected it print 3.1415927000000000000....)
What did I miss to understand ?
You only got 7 digits of precision. Pi is
3.1415926535897932384626433832795028841971693993751058209...
But the output you got from printing your float approximation to Pi was
3.14159274101257324218750000...
As you can see the values diverge starting from the 7th digit after the decimal point.
If you ask printf() for 80 digits after the decimal place, it will print out that many digits of the decimal representation of the binary value stored in the float, even if that many digits is far more than the precision allowed by the float representation.
A binary floating-point value can't represent 3.1415927 exactly (since that's not an exact binary fraction). The nearest value that it can represent is 3.1415927410125732421875, so that's the actual value of your pival_float. When you print pival_float with eighty digits, you see its exact value, plus a bunch of zeroes for good measure.
The closest float value to pi has binary encoding...
0 10000000 10010010000111111011011
...in which I've inserted spaces between the sign, exponent and mantissa. The exponent is biased, so the bits above encode a multiplier of 2^1 == 2, and the mantissa encodes a fraction above 1, with the first bit being worth a half, and each bit thereafter being worth half as much as the bit before.
Therefore, the mantissa bits above are worth:
1 x 0.5
0 x 0.25
0 x 0.125
1 x 0.0625
0 x 0.03125
0 x 0.015625
1 x 0.0078125
0 x 0.00390625
0 x 0.001953125
0 x 0.0009765625
0 x 0.00048828125
1 x 0.000244140625
1 x 0.0001220703125
1 x 0.00006103515625
1 x 0.000030517578125
1 x 0.0000152587890625
1 x 0.00000762939453125
0 x 0.000003814697265625
1 x 0.0000019073486328125
1 x 0.00000095367431640625
0 x 0.000000476837158203125
1 x 0.0000002384185791015625
1 x 0.00000011920928955078125
So, the least significant bit after multiplying by the exponent-encoded value "2" is worth...
0.000 000 238 418 579 101 562 5
I added spaces to make it easier to count that the last non-0 digit is in the 22nd decimal place.
The value the question says printf() displayed appears below alongside the contribution of the least significant bit in the mantissa:
3.14159274101257324218750000000000000000000000000000000000000000000000000000000000
0.0000002384185791015625
Clearly the least significant digits line up properly. If you added up all the mantissa contributions above, added the implicit 1, then multiplied by 2, you'd get the exact value printf displayed. That explains how the float value is precisely (in the mathematical sense of zero randomness) the value shown by printf, but the comparison below against pi shows only the first 6 decimal places are accurate given the particular value we want it to store.
3.14159274101257324218750000000000000000000000000000000000000000000000000000000000
3.14159265358979323846264338327950288419716939937510582097494459230781640628620899
^
In computing, it's common to refer to the precision of floating point types when we're actually interested in the accuracy we can rely on. I suppose you could argue that while taken in isolation the precision of floats and doubles is infinite, the rounding necessary when using them to approximate numbers that they can't encode perfectly is for most practical purposes random, and in that sense they offer finite significant digits of precision at encoding such numbers.
So, printf isn't wrong to display so many digits; some application might be using a float to encode that exact number (almost certainly because the nature of the app's calculations involve sums of 1/2^n values), but that'd be the exception rather than the rule.
Carrying on from Tony's answer, one way to prove this limitation on decimal precision to yourself in a practical way is simply to declare pi to as many decimals points as you like while assigning the value to a float. Then look at how it is stored in memory.
What you find, is no matter how many decimal points you give it, the 32-bit value in memory will always be the equivalent of the unsigned value 1078530011 or 01000000010010010000111111011011 in binary. That is due, as others explained, to the IEEE-754 Single Precision Floating Point Format Below is a simple bit of code that will allow you to prove to yourself that this limitation means pi, as a float, is limited to six decimal precision:
#include <stdio.h>
#include <stdlib.h>
#if defined (__LP64__) || defined (_LP64)
# define BUILD_64 1
#endif
#ifdef BUILD_64
# define BITS_PER_LONG 64
#else
# define BITS_PER_LONG 32
#endif
char *binpad (unsigned long n, size_t sz);
int main (void) {
float fPi = 3.1415926535897932384626433;
printf ("\n fPi : %f, in memory : %s unsigned : %u\n\n",
fPi, binpad (*(unsigned*)&fPi, 32), *(unsigned*)&fPi);
return 0;
}
char *binpad (unsigned long n, size_t sz)
{
static char s[BITS_PER_LONG + 1] = {0};
char *p = s + BITS_PER_LONG;
register size_t i;
for (i = 0; i < sz; i++)
*(--p) = (n>>i & 1) ? '1' : '0';
return p;
}
Output
$ ./bin/ieee754_pi
fPi : 3.141593, in memory : 01000000010010010000111111011011 unsigned : 1078530011
I'm new to C and when I run the code below, the value that is put out is 12098 instead of 12099.
I'm aware that working with decimals always involves a degree of inaccuracy, but is there a way to accurately move the decimal point to the right two places every time?
#include <stdio.h>
int main(void)
{
int i;
float f = 120.99;
i = f * 100;
printf("%d", i);
}
Use the round function
float f = 120.99;
int i = round( f * 100.0 );
Be aware however, that a float typically only has 6 or 7 digits of precision, so there's a maximum value where this will work. The smallest float value that won't convert properly is the number 131072.01. If you multiply by 100 and round, the result will be 13107202.
You can extend the range of your numbers by using double values, but even a double has limited range. (A double has 16 or 17 digits of precision.) For example, the following code will print 10000000000000098
double d = 100000000000000.99;
uint64_t j = round( d * 100.0 );
printf( "%llu\n", j );
That's just an example, finding the smallest number is that exceeds the precision of a double is left as an exercise for the reader.
Use fixed-point arithmetic on integers:
#include <stdio.h>
#define abs(x) ((x)<0 ? -(x) : (x))
int main(void)
{
int d = 12099;
int i = d * 100;
printf("%d.%02d\n", d/100, abs(d)%100);
printf("%d.%02d\n", i/100, abs(i)%100);
}
Your problem is that float are represented internaly using IEEE-754. That is in base 2 and not in base 10. 0.25 will have an exact representation, but 0.1 has not, nor has 120.99.
What really happens is that due to floating point inacuracy, the ieee-754 float closest to the decimal value 120.99 multiplied by 100 is slightly below 12099, so it is truncated to 12098. You compiler should have warned you that you had a truncation from float to in (mine did).
The only foolproof way to get what you expect is to add 0.5 to the float before the truncation to int :
i = (f * 100) + 0.5
But beware floating point are inherently inaccurate when processing decimal values.
Edit :
Of course for negative numbers, it should be i = (f * 100) - 0.5 ...
If you'd like to continue operating on the number as a floating point number, then the answer is more or less no. There's various things you can do for small numbers, but as your numbers get larger, you'll have issues.
If you'd like to only print the number, then my recommendation would be to convert the number to a string, and then move the decimal point there. This can be slightly complicated depending on how you represent the number in the string (exponential and what not).
If you'd like this to work and you don't mind not using floating point, then I'd recommend researching any number of fixed decimal libraries.
You can use
float f = 120.99f
or
double f = 120.99
by default c store floating-point values as double so if you store them in float variable implicit casting is happened and it is bad ...
i think this works.