Float underflow in C explanation

Float underflow in C explanation - c

I am solving one of C Primer Plus exercises dealing with float underflow. The task is to simulate it. I did it this way:
#include<stdio.h>
#include<float.h>
int main(void)
{
// print min value for a positive float retaining full precision
printf("%s\n %.150f\n", "Minimum positive float value retaining full precision:",FLT_MIN);
// print min value for a positive float retaining full precision divided by two
printf("%s\n %.150f\n", "Minimum positive float value retaining full precision divided by two:",FLT_MIN/2.0);
// print min value for a positive float retaining full precision divided by four
printf("%s\n %.150f\n", "Minimum positive float value retaining full precision divided by four:",FLT_MIN/4.0);
return 0;
}
The result is
Minimum positive float value retaining full precision: 0.000000000000000000000000000000000000011754943508222875079687365372222456778186655567720875215087517062784172594547271728515625000000000000000000000000
Minimum positive float value retaining full precision divided by two: 0.000000000000000000000000000000000000005877471754111437539843682686111228389093327783860437607543758531392086297273635864257812500000000000000000000000
Minimum positive float value retaining full precision divided by four: 0.000000000000000000000000000000000000002938735877055718769921841343055614194546663891930218803771879265696043148636817932128906250000000000000000000000
I expected less precision for min float value divide by two and four but it seems the precision is ok and there is no underflow situation. How is it possible? Did I miss something?
Thank you very much

Incorrect method of assessing precision as code simple divides FLT_MIN (certainly a power of 2) by 2.
Instead start with a number that is just above a power of 2 so its binary significand is something like 1.000...(maybe total of 24 binary digits)...0001. Insure values printed are originally float. (FLT_MIN/2.0 is a double.)
Notice below that the precision is lost when the numbers becomes less than FLT_MIN: minimum normalized positive floating-point number.
Also consider FLT_TRUE_MIN: minimum positive floating-point number. See binary32
#include <float.h>
#include <math.h>
#include <stdio.h>
int main(void) {
char *format = "%.10e %a\n";
printf(format, FLT_MIN, FLT_MIN);
printf(format, FLT_TRUE_MIN, FLT_TRUE_MIN);
float f = nextafterf(1.0f, 2.0f);
do {
f /= 2;
printf(format, f, f); // print in decimal and hex for detail
} while (f);
return 0;
}
Output
1.1754943508e-38 0x1p-126
1.4012984643e-45 0x1p-149
5.0000005960e-01 0x1.000002p-1
2.5000002980e-01 0x1.000002p-2
1.2500001490e-01 0x1.000002p-3
...
2.3509889819e-38 0x1.000002p-125
1.1754944910e-38 0x1.000002p-126
5.8774717541e-39 0x1p-127 // lost least significant bit of precision
2.9387358771e-39 0x1p-128
...
2.8025969286e-45 0x1p-148
1.4012984643e-45 0x1p-149
0.0000000000e+00 0x0p+0

Related

Round float to 2 decimal places in C language?

float number = 123.8798831;
number=(floorf((number + number * 0.1) * 100.0)) / 100.0;
printf("number = %f",number);
I want to get number = 136.25
But the compiler shows me number = 136.259995
I know that I can write like this printf("number = %.2f",number) ,but I need the number itself for further operation.It is necessary that the number be stored in a variable as number = 136.25

It is necessary that the number be stored in a variable as number = 136.25
But that would be the incorrect result. The precise result of number + number * 0.1 is 136.26787141. When you round that downwards to 2 decimal places, the number that you would get is 136.26, and not 136.25.
However, there is no way to store 136.26 in a float because it simply isn't a representable value (on your system). Best you can get is a value that is very close to it. You have successfully produced a floating point number that is very close to 136.26. If you cannot accept the slight error in the value, then you shouldn't be using finite precision floating point arithmetic.
If you wish to print the value of a floating point number up to limited number of decimals, you must understand that not all values can be represented by floating point numbers, and that you must use %.2f to get desired output.
Round float to 2 decimal places in C language?
Just like you did:
multiply with 100
round
divide by 100

I agree with the other comments/answers that using floating point numbers for money is usually not a good idea, not all numbers can be stored exactly. Basically, when you use floating point numbers, you sacrifice exactness for being able to storage very large and very small numbers and being able to store decimals. You don't want to sacrifice exactness when dealing with real money, but I think this is a student project, and no actual money is being calculated, so I wrote the small program to show one way of doing this.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(void)
{
double number, percent_interest, interest, result, rounded_result;
number = 123.8798831;
percent_interest = 0.1;
interest = (number * percent_interest)/100; //Calculate interest of interest_rate percent.
result = number + interest;
rounded_result = floor(result * 100) / 100;
printf("number=%f, percent_interest=%f, interest=%f, result=%f, rounded_result=%f\n", number, percent_interest, interest, result, rounded_result);
return EXIT_SUCCESS;
}
As you can see, I use double instead float, because double has more precession and floating point constants are of type double not float. The code in your question should give you a warning because in
float number = 123.8798831;
123.8798831 is of type double and has to be converted to float (possibly losing precession in the process).
You should also notice that my program calculates interest at .1% (like you say you want to do) unlike the code in your question which calculates interest at 10%. Your code multiplies by 0.1 which is 10/100 or 10%.

Here is an example of a function you can use for rounding to x number of decimals.
Code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <stddef.h>
double dround(double number, int dp)
{
int charsNeeded = 1 + snprintf(NULL, 0, "%.*f", dp, number);
char *buffer = malloc(charsNeeded);
snprintf(buffer, charsNeeded, "%.*f", dp, number);
double result = atof(buffer);
free(buffer);
return result;
}
int main()
{
float number = 37.777779;
number = dround(number,2);
printf("Number is %f\n",number);
return 0;
}

Determining the number of decimal digits in a floating number

I am trying to write a program that outputs the number of the digits in the decimal portion of a given number (0.128).
I made the following program:
#include <stdio.h>
#include <math.h>
int main(){
float result = 0;
int count = 0;
int exp = 0;
for(exp = 0; int(1+result) % 10 != 0; exp++)
{
result = 0.128 * pow(10, exp);
count++;
}
printf("%d \n", count);
printf("%f \n", result);
return 0;
}
What I had in mind was that exp keeps being incremented until int(1+result) % 10 outputs 0. So for example when result = 0.128 * pow(10,4) = 1280, result mod 10 (int(1+result) % 10) will output 0 and the loop will stop.
I know that on a bigger scale this method is still inefficient since if result was a given input like 1.1208 the program would basically stop at one digit short of the desired value; however, I am trying to first find out the reason why I'm facing the current issue.
My Issue: The loop won't just stop at 1280; it keeps looping until its value reaches 128000000.000000.
Here is the output when I run the program:
10
128000000.000000
Apologies if my description is vague, any given help is very much appreciated.

I am trying to write a program that outputs the number of the digits in the decimal portion of a given number (0.128).
This task is basically impossible, because on a conventional (binary) machine the goal is not meaningful.
If I write
float f = 0.128;
printf("%f\n", f);
I see
0.128000
and I might conclude that 0.128 has three digits. (Never mind about the three 0's.)
But if I then write
printf("%.15f\n", f);
I see
0.128000006079674
Wait a minute! What's going on? Now how many digits does it have?
It's customary to say that floating-point numbers are "not accurate" or that they suffer from "roundoff error". But in fact, floating-point numbers are, in their own way, perfectly accurate — it's just that they're accurate in base two, not the base 10 we're used to thinking about.
The surprising fact is that most decimal (base 10) fractions do not exist as finite binary fractions. This is similar to the way that the number 1/3 does not even exist as a finite decimal fraction. You can approximate 1/3 as 0.333 or 0.3333333333 or 0.33333333333333333333, but without an infinite number of 3's it's only an approximation. Similarly, you can approximate 1/10 in base 2 as 0b0.00011 or 0b0.000110011 or 0b0.000110011001100110011001100110011, but without an infinite number of 0011's it, too, is only an approximation. (That last rendition, with 33 bits past the binary point, works out to about 0.0999999999767.)
And it's the same with most decimal fractions you can think of, including 0.128. So when I wrote
float f = 0.128;
what I actually got in f was the binary number 0b0.00100000110001001001101111, which in decimal is exactly 0.12800000607967376708984375.
Once a number has been stored as a float (or a double, for that matter) it is what it is: there is no way to rediscover that it was initially initialized from a "nice, round" decimal fraction like 0.128. And if you try to "count the number of decimal digits", and if your code does a really precise job, you're liable to get an answer of 26 (that is, corresponding to the digits "12800000607967376708984375"), not 3.
P.S. If you were working with computer hardware that implemented decimal floating point, this problem's goal would be meaningful, possible, and tractable. And implementations of decimal floating point do exist. But the ordinary float and double values any of is likely to use on any of today's common, mass-market computers are invariably going to be binary (specifically, conforming to IEEE-754).
P.P.S. Above I wrote, "what I actually got in f was the binary number 0b0.00100000110001001001101111". And if you count the number of significant bits there — 100000110001001001101111 — you get 24, which is no coincidence at all. You can read at single precision floating-point format that the significand portion of a float has 24 bits (with 23 explicitly stored), and here, you're seeing that in action.

float vs. code
A binary float cannot encode 0.128 exactly as it is not a dyadic rational.
Instead, it takes on a nearby value: 0.12800000607967376708984375. 26 digits.
Rounding errors
OP's approach incurs rounding errors in result = 0.128 * pow(10, exp);.
Extended math needed
The goal is difficult. Example: FLT_TRUE_MIN takes about 149 digits.
We could use double or long double to get us somewhat there.
Simply multiply the fraction by 10.0 in each step.
d *= 10.0; still incurs rounding errors, but less so than OP's approach.
#include <stdio.h>
#include <math.h> int main(){
int count = 0;
float f = 0.128f;
double d = f - trunc(f);
printf("%.30f\n", d);
while (d) {
d *= 10.0;
double ipart = trunc(d);
printf("%.0f", ipart);
d -= ipart;
count++;
}
printf("\n");
printf("%d \n", count);
return 0;
}
Output
0.128000006079673767089843750000
12800000607967376708984375
26
Usefulness
Typically, past FLT_DECMAL_DIG (9) or so significant decimal places, OP’s goal is usually not that useful.

As others have said, the number of decimal digits is meaningless when using binary floating-point.
But you also have a flawed termination condition. The loop test is (int)(1+result) % 10 != 0 meaning that it will stop whenever we reach an integer whose last digit is 9.
That means that 0.9, 0.99 and 0.9999 all give a result of 2.
We also lose precision by truncating the double value we start with by storing into a float.
The most useful thing we could do is terminate when the remaining fractional part is less than the precision of the type used.
Suggested working code:
#include <math.h>
#include <float.h>
#include <stdio.h>
int main(void)
{
double val = 0.128;
double prec = DBL_EPSILON;
double result;
int count = 0;
while (fabs(modf(val, &result)) > prec) {
++count;
val *= 10;
prec *= 10;
}
printf("%d digit(s): %0*.0f\n", count, count, result);
}
Results:
3 digit(s): 128

How do I print a floating-point value for later scanning with perfect accuracy?

Suppose I have a floating-point value of type float or double (i.e. 32 or 64 bits on typical machines). I want to print this value as text (e.g. to the standard output stream), and then later, in some other process, scan it back in - with fscanf() if I'm using C, or perhaps with istream::operator>>() if I'm using C++. But - I need the scanned float to end up being exactly, identical to the original value (up to equivalent representations of the same value). Also, the printed value should be easily readable - to a human - as floating-point, i.e. I don't want to print 0x42355316 and reinterpret that as a 32-bit float.
How should I do this? I'm assuming the standard library of (C and C++) won't be sufficient, but perhaps I'm wrong. I suppose that a sufficient number of decimal digits might be able to guarantee an error that's underneath the precision threshold - but that's not the same as guaranteeing the rounding/truncation will happen just the way I want it.
Notes:
The scanning does not having to be perfectly accurate w.r.t. the value it scans, only the original value.
If it makes it easier, you may assume the value is a number and is not infinity.
denormal support is desired but not required; still if we get a denormal, failure should be conspicuous.

First, you should use the %a format with fprintf and fscanf. This is what it was designed for, and the C standard requires it to work (reproduce the original number) if the implementation uses binary floating-point.
Failing that, you should print a float with at least FLT_DECIMAL_DIG significant digits and a double with at least DBL_DECIMAL_DIG significant digits. Those constants are defined in <float.h> and are defined:
… number of decimal digits, n, such that any floating-point number with p radix b digits can be rounded to a floating-point number with n decimal digits and back again without change to the value,… [b is the base used for the floating-point format, defined in FLT_RADIX, and p is the number of base-b digits in the format.]
For example:
printf("%.*g\n", FLT_DECIMAL_DIG, 1.f/3);
or:
#define QuoteHelper(x) #x
#define Quote(x) QuoteHelper(x)
…
printf("%." Quote(FLT_DECIMAL_DIG) "g\n", 1.f/3);
In C++, these constants are defined in <limits> as std::numeric_limits<Type>::max_digits10, where Type is float or double or another floating-point type.
Note that the C standard only recommends that such a round-trip through a decimal numeral work; it does not require it. For example, C 2018 5.2.4.2.2 15 says, under the heading “Recommended practice”:
Conversion from (at least) double to decimal with DECIMAL_DIG digits and back should be the identity function. [DECIMAL_DIG is the equivalent of FLT_DECIMAL_DIG or DBL_DECIMAL_DIG for the widest floating-point format supported in the implementation.]
In contrast, if you use %a, and FLT_RADIX is a power of two (meaning the implementation uses a floating-point base that is two, 16, or another power of two), then C standard requires that the result of scanning the numeral produced with %a equals the original number.

I need the scanned float to end up being exactly, identical to the original value.
As already pointed out in the other answers, that can be achieved with the %a format specifier.
Also, the printed value should be easily readable - to a human - as floating-point, i.e. I don't want to print 0x42355316 and reinterpret that as a 32-bit float.
That's more tricky and subjective. The first part of the string that %a produces is in fact a fraction composed by hexadecimal digits, so that an output like 0x1.4p+3 may take some time to be parsed as 10 by a human reader.
An option could be to print all the decimal digits needed to represent the floating-point value, but there may be a lot of them. Consider, for example the value 0.1, its closest representation as a 64-bit float may be
0x1.999999999999ap-4 == 0.1000000000000000055511151231257827021181583404541015625
While printf("%.*lf\n", DBL_DECIMAL_DIG, 01); (see e.g. Eric's answer) would print
0.10000000000000001 // If DBL_DECIMAL_DIG == 17
My proposal is somewhere in the middle. Similarly to what %a does, we can exactly represent any floating-point value with radix 2 as a fraction multiplied by 2 raised to some integer power. We can transform that fraction into a whole number (increasing the exponent accordingly) and print it as a decimal value.
0x1.999999999999ap-4 --> 1.999999999999a16 * 2-4 --> 1999999999999a16 * 2-56
--> 720575940379279410 * 2-56
That whole number has a limited number of digits (it's < 253), but the result it's still an exact representation of the original double value.
The following snippet is a proof of concept, without any check for corner cases. The format specifier %a separates the mantissa and the exponent with a p character (as in "... multiplied by two raised to the Power of..."), I'll use a q instead, for no particular reason other than using a different symbol.
The value of the mantissa will also be reduced (and the exponent raised accordingly), removing all the trailing zero-bits. The idea beeing that 5q+1 (parsed as 510 * 21) should be more "easily" identified as 10, rather than 2814749767106560q-48.
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void to_my_format(double x, char *str)
{
int exponent;
double mantissa = frexp(x, &exponent);
long long m = 0;
if ( mantissa ) {
exponent -= 52;
m = (long long)scalbn(mantissa, 52);
// A reduced mantissa should be more readable
while (m && m % 2 == 0) {
++exponent;
m /= 2;
}
}
sprintf(str, "%lldq%+d", m, exponent);
// ^
// Here 'q' is used to separate the mantissa from the exponent
}
double from_my_format(char const *str)
{
char *end;
long long mantissa = strtoll(str, &end, 10);
long exponent = strtol(str + (end - str + 1), &end, 10);
return scalbn(mantissa, exponent);
}
int main(void)
{
double tests[] = { 1, 0.5, 2, 10, -256, acos(-1), 1000000, 0.1, 0.125 };
size_t n = (sizeof tests) / (sizeof *tests);
char num[32];
for ( size_t i = 0; i < n; ++i ) {
to_my_format(tests[i], num);
double x = from_my_format(num);
printf("%22s%22a ", num, tests[i]);
if ( tests[i] != x )
printf(" *** %22a *** Round-trip failed\n", x);
else
printf("%58.55g\n", x);
}
return 0;
}
Testable here.
Generally, the improvement in readability is admitedly little to none, surely a matter of opinion.

You can use the %a format specifier to print the value as hexadecimal floating point. Note that this is not the same as reinterpreting the float as an integer and printing the integer value.
For example:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
float x;
scanf("%f", &x);
printf("x=%.7f\n", x);
char str[20];
sprintf(str, "%a", x);
printf("str=%s\n", str);
float y;
sscanf(str, "%f", &y);
printf("y=%.7f\n", y);
printf("x==y: %d\n", (x == y));
return 0;
}
With an input of 4, this outputs:
x=4.0000000
str=0x1p+2
y=4.0000000
x==y: 1
With an input of 3.3, this outputs:
x=3.3000000
str=0x1.a66666p+1
y=3.3000000
x==y: 1
As you can see from the output, the %a format specifier prints in exponential format with the significand in hex and the exponent in decimal. This format can then be converted directly back to the exact same value as demonstrated by the equality check.

Understanding the maximum values that can be stored in floats in C

I have come across some behaviour with the float type in C that I do not understand, and was hoping might be explained. Using the macros defined in float.h I can determine the maximum/minimum values that the datatype can store on the given hardware. However when performing a calculation that should not exceed these limits, I find that a typed float variable fails where a double succeeds.
The following is a minimal example, which compiles on my machine.
#include <stdio.h>
#include <stdlib.h>
#include <float.h>
int main(int argc, char **argv)
{
int gridsize;
long gridsize3;
float *datagrid;
float sumval_f;
double sumval_d;
long i;
gridsize = 512;
gridsize3 = (long)gridsize*gridsize*gridsize;
datagrid = calloc(gridsize3, sizeof(float));
if(datagrid == NULL)
{
free(datagrid);
printf("Memory allocation failed\n");
exit(0);
}
for(i=0; i<gridsize3; i++)
{
datagrid[i] += 1.0;
}
sumval_f = 0.0;
sumval_d = 0.0;
for(i=0; i<gridsize3; i++)
{
sumval_f += datagrid[i];
sumval_d += (double)datagrid[i];
}
printf("\ngridsize3 = %e\n", (float)gridsize3);
printf("FLT_MIN = %e\n", FLT_MIN);
printf("FLT_MAX = %e\n", FLT_MAX);
printf("DBL_MIN = %e\n", DBL_MIN);
printf("DBL_MAX = %e\n", DBL_MAX);
printf("\nfloat sum = %f\n", sumval_f);
printf("double sum = %lf\n", sumval_d);
printf("sumval_d/sumval_f = %f\n\n", sumval_d/(double)sumval_f);
free(datagrid);
return(0);
}
Compiling with gcc I find the output:
gridsize3 = 1.342177e+08
FLT_MIN = 1.175494e-38
FLT_MAX = 3.402823e+38
DBL_MIN = 2.225074e-308
DBL_MAX = 1.797693e+308
float sum = 16777216.000000
double sum = 134217728.000000
sumval_d/sumval_f = 8.000000
Whilst compiling with icc the sumval_f = 67108864.0 and hence the final ratio is instead 2.0*. Note that the float sum is incorrect, whilst the double sum is correct.
As far as I can tell the output of FLT_MAX suggests that the sum should fit into a float, and yet it seems to plateau out at either an eighth or a half of the full value.
Is there a compiler specific override to the values found using float.h?
Why is a double required to correctly find the sum of this array?
*Interestingly the inclusion of an if statement inside the for loop that prints values of the array causes the value to match the gcc output, i.e. an eighth of the correct sum, rather than a half.

The problem here isn't the range of values but the precision.
Assuming a 32-bit IEEE754 float, this datatype has a maximum of 24 bits of precision. This means that not all integers larger than 16777216 can be represented exactly.
So when your sum reaches 16777216, adding 1 to it is outside the precision of what the datatype can store, so the number doesn't get any bigger.
A (presumably) 64-bit double has 53 bits of precision. This is enough bits to hold all integer values up to your sum of 134217728, so it gives you an accurate result.

A float can precisely represent any integer between -16777215 and +16777215, inclusive. It can also represent all even integers between -2*16777215 and +2*16777215 (including +/- 2*8388608, i.e. 16777216), all multiples of 4 between -4*16777215 and +4*16777215, and likewise for all power-of-two scaling factors up to 2^104 (roughly 2.028E+31). Additionally, it can represent multiples of 1/2 from -16777215/2 to +16777215/2, multiples of 1/4 from -16777215/4 to +16777215/4, etc. down to multiples of 1/2^149 from -167777215/(2^149) to +16777215/(2^149).

Floating point numbers represent all of the infinite possible values between any two numbers; but, computers cannot hold an infinite number of values. So a compromise is made. The floating point numbers hold an approximation of the value.
This means that if you pick a value that is "more" than the stored floating point number, but not enough to arrive at the "next" storable approximation, then storing that logically bigger number won't actually change the floating point value.
The "error" in a floating point approximation is variable. For small numbers, the error is more precise; for bigger numbers, the error proportionally the same, but a bigger actual value.

Computing floating point accuracy (K&R 2-1)

I found Stevens Computing Services – K & R Exercise 2-1 a very thorough answer to K&R 2-1. This slice of the full code computes the maximum value of a float type in the C programming language.
Unluckily my theoretical comprehension of float values is quite limited. I know they are composed of significand (mantissa.. ) and a magnitude which is a power of 2.
#include <stdio.h>
#include <limits.h>
#include <float.h>
main()
{
float flt_a, flt_b, flt_c, flt_r;
/* FLOAT */
printf("\nFLOAT MAX\n");
printf("<limits.h> %E ", FLT_MAX);
flt_a = 2.0;
flt_b = 1.0;
while (flt_a != flt_b) {
flt_m = flt_b; /* MAX POWER OF 2 IN MANTISSA */
flt_a = flt_b = flt_b * 2.0;
flt_a = flt_a + 1.0;
}
flt_m = flt_m + (flt_m - 1); /* MAX VALUE OF MANTISSA */
flt_a = flt_b = flt_c = flt_m;
while (flt_b == flt_c) {
flt_c = flt_a;
flt_a = flt_a * 2.0;
flt_b = flt_a / 2.0;
}
printf("COMPUTED %E\n", flt_c);
}
I understand that the latter part basically checks to which power of 2 it's possible to raise the significand with a three variable algorithm. What about the first part?
I can see that a progression of multiples of 2 should eventually determine the value of the significand, but I tried to trace a few small numbers to check how it should work and it failed to find the right values...
======================================================================
What are the concepts on which this program is based upon and does this program gets more precise as longer and non-integer numbers have to be found?

The first loop determines the number of bits contributing to the significand by finding the least power 2 such that adding 1 to it (using floating-point arithmetic) fails to change its value. If that's the nth power of two, then the significand uses n bits, because with n bits you can express all the integers from 0 through 2^n - 1, but not 2^n. The floating-point representation of 2^n must therefore have an exponent large enough that the (binary) units digit is not significant.
By that same token, having found the first power of 2 whose float representation has worse than unit precision, the maximim float value that does have unit precision is one less. That value is recorded in variable flt_m.
The second loop then tests for the maximum exponent by starting with the maximum unit-precision value, and repeatedly doubling it (thereby increasing the exponent by 1) until it finds that the result cannot be converted back by halving it. The maximum float is the value before that final doubling.
Do note, by the way, that all the above supposes a base-2 floating-point representation. You are unlikely to run into anything different, but C does not actually require any specific representation.
With respect to the second part of your question,
does this program gets more precise as longer and non-integer numbers have to be found?
the program takes care to avoid losing precision. It does assume a binary floating-point representation such as you described, but it will work correctly regardless of the number of bits in the significand or exponent of such a representation. No non-integers are involved, but the program already deals with numbers that have worse than unit precision, and with numbers larger than can be represented with type int.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight