Why does this floating-point loop terminate at 1,000,000? - c

The answer to this sample homework problem is "1,000,000", but I do not understand why:
What is the output of the following code?
int main(void) {
float k = 1;
while (k != k + 1) {
k = k + 1;
}
printf(“%g”, k); // %g means output a floating point variable in decimal
}
If the program runs indefinitely but produces no output, write INFINITE LOOP as the answer to the question. All of the programs compile and run. They may or may not contain serious errors, however. You should assume that int is four bytes. You should assume that float has the equivalent of six decimal digits of precision. You may round your answer off to the nearest power of 10 (e.g., you can say 1,000 instead of 210 (i.e., 1024)).
I do not understand why the loop would ever terminate.

It doesn't run forever for the simple reason that floating point numbers are not perfect.
At some point, k will become big enough so that adding 1 to it will have no effect.
At that point, k will be equal to k+1 and your loop will exit.
Floating point numbers can be differentiated by a single unit only when they're in a certain range.
As an example, let's say you have an integer type with 3 decimal digits of precision for a positive integer and a single-decimal-digit exponent.
With this, you can represent the numbers 0 through 999 perfectly as 000x100 through 999x100 (since 100 is 1):
What happens when you want to represent 1000? You need to use 100x101. This is still represented perfectly.
However, there is no accurate way to represent 1001 with this scheme, the next number you can represent is 101x101 which is 1010.
So, when you add 1 to 1000, you'll get the closest match which is 1000.

The code is using a float variable.
As specified in the question, float has 6 digits of precision, meaning that any digits after the sixth will be inaccurate. Therefore, once you pass a million, the final digit will be inaccurate, so that incrementing it can have no effect.

The output of this program is not specified by the C standard, since the semantics of the float type are not specified. One likely result (what you will get on a platform for which float arithmetic is evaluated in IEEE-754 single precision) is 2^24.
All integers smaller than 2^24 are exactly representable in single precision, so the computation will not stop before that point. The next representable single precision number after 2^24, however, is 2^24 + 2. Since 2^24 + 1 is exactly halfway between that number and 2^24, in the default IEEE-754 rounding mode it rounds to the one whose trailing bit is zero, which is 2^24.
Other likely answers include 2^53 and 2^64. Still other answers are possible. Infinity (the floating-point value) could result on a platform for which the default rounding mode is round up, for example. As others have noted, an infinite loop is also possible on platforms that evaluate floating-point expressions in a wider type (which is the source of all sorts of programmer confusion, but allowed by the C standard).

Actually, on most C compilers, this will run forever (infinite loop), though the precise behavior is implementation defined.
The reason that most compilers will give an infinite loop is that they evaluate all floating point expressions at double precision and only round values back to float (single) precision when storing into a variable. So when the value of k gets to about 2^24, k == k + 1 will still evaluate as false (as a double can hold the value k+1 without rounding), but the k = k + 1 assignment will be a noop, as k+1 needs to be rounded to fit into a float
edit
gcc on x86 gets this infinite loop behavior. Interestingly on x64 it does not, as it uses sse instructions which do the comparison in float precision.

Related

What happens if we keep dividing float 1.0 by 2 until it reaches zero?

float f = 1.0;
while (f != 0.0) f = f / 2.0;
This loop runs 150 times using 32-bit precision. Why is that so? Is it getting rounded to zero?
In common C implementations, the IEEE-754 binary32 format is used for float. It is also called “single precision.” It is a binary based format where finite numbers are represented as ±f•2e, where f is a 24-bit binary numeral in [1, 2) and e is an integer in [−126, 127].
In this format, 1 is represented as +1.000000000000000000000002•20. Dividing that by 2 yields ½, which is represented as +1.000000000000000000000002•2−1. Dividing that by 2 yields +1.000000000000000000000002•2−2, then +1.000000000000000000000002•2−3, and so on until we reach +1.000000000000000000000002•2−126.
When that is divided by two, the mathematical result is +1.000000000000000000000002•2−127, but −127 is below the normal exponent range, [−126, 127]. Instead, the significand becomes denormalized; 2−127 is represented by +0.100000000000000000000002•2−126. Dividing that by 2 yields +0.010000000000000000000002•2−126, then +0.001000000000000000000002•2−126, +0.000100000000000000000002•2−126, and so on until we get to +0.000000000000000000000012•2−126.
At this point, we have done 149 divisions by 2; +0.000000000000000000000012•2−126 is 2−149.
When the next division is performed, the result would be 2−150, but that is not representable in this format. Even with the lowest non-zero significand, 0.000000000000000000000012, and the lowest exponent, −126, we cannot get to 2−150. The next lower representable number is +0.000000000000000000000002•2−126, which equals 0.
So, the real-number-arithmetic result of the division would be 2−150, but we cannot represent that in this format. The two nearest representable numbers are +0.000000000000000000000012•2−126 just above it and +0.000000000000000000000002•2−126 just below it. They are equally near 2−150. The default rounding method is to take the nearest representable number and, in case of ties, to take the number with the even low digit. So +0.000000000000000000000002•2−126 wins the tie, and that is produced as the result for the 150th division.
What happens is simply that your system has only a limited number of bits available for a variable, and hence limited precision; even though, mathematically, you can halve a number (!= 0) indefinitely without ever reaching zero, in a computer implementation that has a limited precision for a float variable, that variable will inevitably, at some stage, become indistinguishable from zero. The more bits your system uses, the more precision it has and the later this will happen, but at some stage it will.
Since I suppose this is meant to be C, I just implemented it in C (with a counter counting each iteration), and indeed it ran for 150 rounds until the loop ended. I also implemented it with a double, where it ran for 1075 iterations. Keep in mind, however, that the C standard does not define the exact precision of a float variable. In most implementations it's 32 bits for a float and 64 for a double. With a long double, I get 16,446 iterations.

Nonintuitive result of the assignment of a double precision number to an int variable in C

Could someone give me an explanation why I get two different
numbers, resp. 14 and 15, as an output from the following code?
#include <stdio.h>
int main()
{
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
printf("%d %d",b,c); // 14 15, why?
return 0;
}
I expect to get 15 in both cases but it seems I'm missing some fundamentals of the language.
I am not sure if it's relevant but I was doing the test in CodeBlocks. However, if I type the same lines of code in some on-line compiler ( this one for example) I get an answer of 15 for the two printed variables.
... why I get two different numbers ...
Aside from the usual float-point issues, the computation paths to b and c are arrived in different ways. c is calculated by first saving the value as double a.
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
C allows intermediate floating-point math to be computed using wider types. Check the value of FLT_EVAL_METHOD from <float.h>.
Except for assignment and cast (which remove all extra range and precision), ...
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the
type;
1 evaluate operations and constants of type float and double to the
range and precision of the double type, evaluate long double
operations and constants to the range and precision of the long double
type;
2 evaluate all operations and constants to the range and precision of the
long double type.
C11dr §5.2.4.2.2 9
OP reported 2
By saving the quotient in double a = (Vmax-Vmin)/step;, precision is forced to double whereas int b = (Vmax-Vmin)/step; could compute as long double.
This subtle difference results from (Vmax-Vmin)/step (computed perhaps as long double) being saved as a double versus remaining a long double. One as 15 (or just above), and the other just under 15. int truncation amplifies this difference to 15 and 14.
On another compiler, the results may both have been the same due to FLT_EVAL_METHOD < 2 or other floating-point characteristics.
Conversion to int from a floating-point number is severe with numbers near a whole number. Often better to round() or lround(). The best solution is situation dependent.
This is indeed an interesting question, here is what happens precisely in your hardware. This answer gives the exact calculations with the precision of IEEE double precision floats, i.e. 52 bits mantissa plus one implicit bit. For details on the representation, see the wikipedia article.
Ok, so you first define some variables:
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
The respective values in binary will be
Vmax = 10.111001100110011001100110011001100110011001100110011
Vmin = 1.0110011001100110011001100110011001100110011001100110
step = .00011001100110011001100110011001100110011001100110011010
If you count the bits, you will see that I have given the first bit that is set plus 52 bits to the right. This is exactly the precision at which your computer stores a double. Note that the value of step has been rounded up.
Now you do some math on these numbers. The first operation, the subtraction, results in the precise result:
10.111001100110011001100110011001100110011001100110011
- 1.0110011001100110011001100110011001100110011001100110
--------------------------------------------------------
1.1000000000000000000000000000000000000000000000000000
Then you divide by step, which has been rounded up by your compiler:
1.1000000000000000000000000000000000000000000000000000
/ .00011001100110011001100110011001100110011001100110011010
--------------------------------------------------------
1110.1111111111111111111111111111111111111111111111111100001111111111111
Due to the rounding of step, the result is a tad below 15. Unlike before, I have not rounded immediately, because that is precisely where the interesting stuff happens: Your CPU can indeed store floating point numbers of greater precision than a double, so rounding does not take place immediately.
So, when you convert the result of (Vmax-Vmin)/step directly to an int, your CPU simply cuts off the bits after the fractional point (this is how the implicit double -> int conversion is defined by the language standards):
1110.1111111111111111111111111111111111111111111111111100001111111111111
cutoff to int: 1110
However, if you first store the result in a variable of type double, rounding takes place:
1110.1111111111111111111111111111111111111111111111111100001111111111111
rounded: 1111.0000000000000000000000000000000000000000000000000
cutoff to int: 1111
And this is precisely the result you got.
The "simple" answer is that those seemingly-simple numbers 2.9, 1.4, and 0.1 are all represented internally as binary floating point, and in binary, the number 1/10 is represented as the infinitely-repeating binary fraction 0.00011001100110011...[2] . (This is analogous to the way 1/3 in decimal ends up being 0.333333333... .) Converted back to decimal, those original numbers end up being things like 2.8999999999, 1.3999999999, and 0.0999999999. And when you do additional math on them, those .0999999999's tend to proliferate.
And then the additional problem is that the path by which you compute something -- whether you store it in intermediate variables of a particular type, or compute it "all at once", meaning that the processor might use internal registers with greater precision than type double -- can end up making a significant difference.
The bottom line is that when you convert a double back to an int, you almost always want to round, not truncate. What happened here was that (in effect) one computation path gave you 15.0000000001 which truncated down to 15, while the other gave you 14.999999999 which truncated all the way down to 14.
See also question 14.4a in the C FAQ list.
An equivalent problem is analyzed in analysis of C programs for FLT_EVAL_METHOD==2.
If FLT_EVAL_METHOD==2:
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
computes b by evaluating a long double expression then truncating it to a int, whereas for c it's evaluating from long double, truncating it to double and then to int.
So both values are not obtained with the same process, and this may lead to different results because floating types does not provides usual exact arithmetic.

Do the rules of arithmetic hold for addition in C?

I'm new to C, and I'm having such a hard time understanding this material. I really need help! Please someone help.
In arithmetic, the sum of any two positive integers is great than either:
(n+m) > n for n, m > 0
(n+m) > m for n, m > 0
C has an addition operator +. Does this arithmetic rule hold in C?
I know this is False. But can please someone explain to me why so, I can understand it? Please provide counter-example?
Thank you in advance.
(I won't solve this for you, but will provide some pointers.)
It is false for both integer and floating-point arithmetic, for different reasons.
Integers are susceptible to overflow.
Adding a very small floating-point number m to a very large number n returns n. Have a read of What Every Computer Scientist Should Know About Floating-Point Arithmetic.
It doesn't hold, since C's integers are not "abstract" infinititely-sized integers that the real integers (in mathematics) are.
In C, integers are discrete and digital, and implemented using a fixed number of bits. This leads to limited range, and problems when you go (try to) out of range. Typically integers will wrap, which is very "un-natural".
I brief search did not show up nice answers describing these, so I rather attempt to answer this nicely here, for beginners.
The answer is false, of course, but why so?
Integers
In C, or any programming language providing some kind of integer type, this type does not mean it in the mathematical sense. In mathematical sense non-negative integers range from 0 to infinity. A computer, however has only limited storage, so integers necessarily are constrained to something less than infinity.
This alone proves that a + b > a and a + b > b can not be true all the time, since it can be set up so both a and b is less than the largest number the computer can represent in it's storage, but a + b is larger than that.
What exactly happens here, depends. Some mentioned wraparound, but that's not necessarily the case. The C language the first place defines integer overflow to be an undefined behaviour, that whatever, including fire and smoke, may happen if the code happens to step on it (of course in the reality that won't happen, but interpreting the standard strictly it could, as well as the breach of the space-time continuum).
I won't describe how wraparound works here since it is beyond the scope of the problem itself.
Floating point
The case here is again just the same like for integers: the key to understand why mathematics don't fully apply here is that the computer has a limited storage.
Floating numbers in the computers memory are represented much like scientific notation: a mantissa, and an exponent. Both of these have a fixed limited range depending on the type of the floating point variable.
In base 10, you may conceive this like you have the exponent ranging from 10 ^ -10 to 10 ^ 10, and the mantissa having like 4 fraction digits after the decimal point, always normalized.
With this in mind, check these example additions:
1.2345 * (10 ^ 0) + 1.0237 * (10 ^ 5)
5.2345 * (10 ^ 10) + 6.7891 * (10 ^ 10)
The first is an example where the result will equal one of the input numbers while both were larger than zero. The second is an example where the result is out of range.
The floating point representation computers use however is capable to represent infinity, and two at that: positive infinity and negative infinity. So while the first example passes as a proof, the second does not, since that addition's result is positive infinity.
However with this in mind, you could produce an another proofing example:
3.1416 * (10 ^ 0) + (+ infinity)
Of course the result is positive infinity, no matter what you add it to. And of course positive infinity is not larger than positive infinity, so proved again.

float variable doesn't meet the conditions (C)

I'm trying to get the user to input a number between 1.00000 to 0.00001 while edges not included into a float variable. I can assume that the user isn't typing more than 5 numbers after the dot.
now, here is what I have written:
printf("Enter required Leibniz gap.(Between 0.00001 to 1.00000)\n");
scanf("%f", &gap);
while ((gap < 0.00002) || (gap > 0.99999))
{
printf("Enter required Leibniz gap.(Between 0.00001 to 1.00000)\n");
scanf("%f", &gap);
}
now, when I'm typing the smallest number possible: 0.00002 in getting stuck in the while loop.
when I run the debugger I saw that 0.00002 is stored with this value in the float variable: 1.99999995e-005
anybody can clarify for me what am I doing wrong? why isn't 0.00002 meeting the conditions? what is this "1.99999995e-005" thing.
The problem here is that you are using a float variable (gap), but you are comparing it with a double constant (0.00002). The constant is double because floating-point constants in C are double unless otherwise specified.
An underlying issue is that the number 0.00002 is not representable in either float or double. (It's not representable at all in binary floating point because it's binary expansion is infinitely long, like the decimal expansion of &frac13;.) So when you write 0.00002 in a program, the C compiler substitutes it with a double value which is very close to 0.00002. Similarly, when scanf reads the number 0.00002 into a float variable, it substitutes a float value which is very close to 0.00002. Since double numbers have more bits than floats, the double value is closer to 0.00002 than the float value.
When you compare two floating point values with different precision, the compiler converts the value with less precision into exactly the same value with more precision. (The set of values representable as double is a superset of the set of values representable as float, so it is always possible to find a double whose value is the same as the value of a float.) And that's what happens when gap < 0.00002 is executed: gap is converted to the double of the same value, and that is compared with the double (close to) 0.00002. Since both of these values are actually slightly less than 0.00002, and the double is closer, the float is less than the double.
You can solve this problem in a couple of ways. First, you can avoid the conversion, either by making gap a double and changing the scanf format to %lf, or by comparing gap to a float:
while (gap < 0.00002F || gap > 0.99999F) {
But that's not really correct, for a couple of reasons. First, there is actually no guarantee that the floating point conversion done by the C compiler is the same as the conversion done by the standard library (scanf), and the standard allows the compiler to use "either the nearest representable value, or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined manner." (It doesn't specify in detail which value scanf produces either, but recommends that it be the nearest representable value.) As it happens, gcc and glibc (the C compiler and standard library used on Linux) both produce the nearest representable value, but other implementations don't.
Anyway, according to your error message, you want the value to be between 0.00001 and 1.00000. So your test should be precisely that:
while (gap <= 0.00001F || gap >= 1.0000F) { ...
(assuming you keep gap as a float.)
Any of the above solutions will work. Personally, I'd make gap a double in order to make the comparison more intuitive, and also change the comparison to compare against 0.00001 and 1.0000.
By the way, the E-05 suffix means "times ten to the power of -5" (the E stands for Exponent). You'll see that a lot; it's a standard way of writing floating point constants.
floats are not capable of storing exact values for every possible number (infinite numbers between 0-1 therefore impossible). Assigning 0.00002 to a float will have a different but really close number due to the implementation which is what you are experiencing. Precision decreases as the number grows.
So you can't directly compare two close floats and have healthy results.
More information on floating points can be found on this Wikipedia page.
What you could do is emulate fixed point math. Have an int n = 100000; to represent 1.00000 internally (1000 -> 0.001 and such) and do calculations accordingly or use a fixed point math library.
Fraction part of single precision floating numbers can represent numbers from -2 to 2-2^-23 and have a fraction part with smallest quantization step of 2^-23. So if some value cannot be represented with a such step then it represented with a nearest value according to IEEE 754 rounding rules:
0.00002*32768 = 0.655360043 // floating point exponent is chosen.
0.655360043/(2^-23) = 5497558.5 // is not an integer multiplier
// of quantization step, so the
5497558*(2^-23) = 0.655359983 // nearest value is chosen
5497559*(2^-23) = 0.655360103 // from these two variants
First one variant equals to 1.999969797×10⁻⁵ in decimal format and the second one equals to 1.999999948×10⁻⁵ (just to compare - if we choose 5497560 we get 2.000000677×10⁻⁵). So the second variant can be choosen as a result and its value is not equal to 0.00002.
The total precision of floating point number depends on exponent value as well (takes values from -128 to 127): it can be computed by multiplication of fraction part quantization step and exponent value. In case of 0.00002 total precision is (2^-23)×(2^-15) = 3.6×(10^-12). It means if we add to 0.00002 a value which is smaller than a half of this value than 0.00002 remains the same. In general it means that numbers of floating point number which is meaningful are from 1×exponent to 2×(10^-23)×exponent.
That is why a very popular approach is to compare two floating numbers using some epsilon value which is greater than quantization step.
Like some of the comments said, due to how floating point numbers are represented, you will see errors like this.
A solution to this is convert it to
gap + 1e-8 < 0.0002
This gives you a small window of tolerance enough to let most cases you want to pass and most you dont want to fail

Simple question about 'floating point exception' in C

I have the following C program:
#include <stdio.h>
int main()
{
double x=0;
double y=0/x;
if (y==1)
printf("y=1\n");
else
printf("y=%f\n",y);
if (y!=1)
printf("y!=1\n");
else
printf("y=%f\n",y);
return 0;
}
The output I get is
y=nan
y!=1
But when I change the line
double x=0;
to
int x=0;
the output becomes
Floating point exception
Can anyone explain why?
You're causing the division 0/0 with integer arithmetic (which is invalid, and produces the exception you see). Regardless of the type of y, what's evaluated first is 0/x.
When x is declared to be a double, the zero is converted to a double as well, and the operation is performed using floating-point arithmetic.
When x is declared to be an int, you are dividing one int 0 by another, and the result is not valid.
Because due to IEEE 754, NaN will be produced when conducting an illegal operation on floating point numbers (e.g. 0/0, ∞×0, or sqrt(−1)).
There are actually two kinds of NaNs, signaling and quiet. Using a
signaling NaN in any arithmetic operation (including numerical
comparisons) will cause an "invalid" exception. Using a quiet NaN
merely causes the result to be NaN too.
The representation of NaNs specified by the standard has some
unspecified bits that could be used to encode the type of error; but
there is no standard for that encoding. In theory, signaling NaNs
could be used by a runtime system to extend the floating-point numbers
with other special values, without slowing down the computations with
ordinary values. Such extensions do not seem to be common, though.
Also, Wikipedia says this about integer division by zero:
Integer division by zero is usually handled differently from floating
point since there is no integer representation for the result. Some
processors generate an exception when an attempt is made to divide an
integer by zero, although others will simply continue and generate an
incorrect result for the division. The result depends on how division
is implemented, and can either be zero, or sometimes the largest
possible integer.
There's a special bit-pattern in IEE754 which indicates NaN as the result of floating point division by zero errors.
However there's no such representation when using integer arithmetic, so the system has to throw an exception instead of returning NaN.
Check the min and max values of an integer data type. You will see that an undefined or nan result is not in it's range.
And read this what every computer scientist should know about floating point.
Integer division by 0 is illegal and is not handled. Float values on the other hand are handled in C using NaN. The following how ever would work.
int x=0;
double y = 0.0 / x;
If you divide int to int you can divide by 0.
0/0 in doubles is NaN.
int x=0;
double y=0/x; //0/0 as ints **after that** casted to double. You can use
double z=0.0/x; //or
double t=0/(double)x; // to avoid exception and get NaN
Floating point is inherently modeling the reals to limited precision. There are only a finite number of bit-patterns, but an infinite (continuous!) number of reals. It does its best of course, returning the closest representable real to the exact inputs it is given. Answers that are too small to be directly represented are instead represented by zero. Dividing by zero is an error in the real numbers. In floating point, however, because zero can arise from these very small answers, it can be useful to consider x/0.0 (for positive x) to be "positive infinity" or "too big to be represented". This is no longer useful for x = 0.0.
The best we could say is that dividing zero by zero is really "dividing something small that can't be told apart from zero by something small that can't be told apart from zero". What the answer to this? Well, there is no answer for the exact case of 0/0, and there is no good way of treating it inexactly. It would depend on the relative magnitudes, and so the processor basically shrugs and says "I lost all precision -- any result I gave you would be misleading", by returning Not a Number.
In contrast, when doing an integer divide by zero, the divisor really can only mean precisely zero. There's no possible way to give a consistent meaning to it, so when your code asks for the answer, it really is doing something illegitimate.
(It's an integer division in the second case, but not the first because of the promotion rules of C. 0 can be taken as an integer literal, and as both sides are integers, the division is integer division. In the first case, the fact that x is a double causes the dividend to be promoted to double. If you replace the 0 by 0.0, it will be a floating-point division, no matter the type of x.)

Resources