Do the rules of arithmetic hold for addition in C? - c

I'm new to C, and I'm having such a hard time understanding this material. I really need help! Please someone help.
In arithmetic, the sum of any two positive integers is great than either:
(n+m) > n for n, m > 0
(n+m) > m for n, m > 0
C has an addition operator +. Does this arithmetic rule hold in C?
I know this is False. But can please someone explain to me why so, I can understand it? Please provide counter-example?
Thank you in advance.

(I won't solve this for you, but will provide some pointers.)
It is false for both integer and floating-point arithmetic, for different reasons.
Integers are susceptible to overflow.
Adding a very small floating-point number m to a very large number n returns n. Have a read of What Every Computer Scientist Should Know About Floating-Point Arithmetic.

It doesn't hold, since C's integers are not "abstract" infinititely-sized integers that the real integers (in mathematics) are.
In C, integers are discrete and digital, and implemented using a fixed number of bits. This leads to limited range, and problems when you go (try to) out of range. Typically integers will wrap, which is very "un-natural".

I brief search did not show up nice answers describing these, so I rather attempt to answer this nicely here, for beginners.
The answer is false, of course, but why so?
Integers
In C, or any programming language providing some kind of integer type, this type does not mean it in the mathematical sense. In mathematical sense non-negative integers range from 0 to infinity. A computer, however has only limited storage, so integers necessarily are constrained to something less than infinity.
This alone proves that a + b > a and a + b > b can not be true all the time, since it can be set up so both a and b is less than the largest number the computer can represent in it's storage, but a + b is larger than that.
What exactly happens here, depends. Some mentioned wraparound, but that's not necessarily the case. The C language the first place defines integer overflow to be an undefined behaviour, that whatever, including fire and smoke, may happen if the code happens to step on it (of course in the reality that won't happen, but interpreting the standard strictly it could, as well as the breach of the space-time continuum).
I won't describe how wraparound works here since it is beyond the scope of the problem itself.
Floating point
The case here is again just the same like for integers: the key to understand why mathematics don't fully apply here is that the computer has a limited storage.
Floating numbers in the computers memory are represented much like scientific notation: a mantissa, and an exponent. Both of these have a fixed limited range depending on the type of the floating point variable.
In base 10, you may conceive this like you have the exponent ranging from 10 ^ -10 to 10 ^ 10, and the mantissa having like 4 fraction digits after the decimal point, always normalized.
With this in mind, check these example additions:
1.2345 * (10 ^ 0) + 1.0237 * (10 ^ 5)
5.2345 * (10 ^ 10) + 6.7891 * (10 ^ 10)
The first is an example where the result will equal one of the input numbers while both were larger than zero. The second is an example where the result is out of range.
The floating point representation computers use however is capable to represent infinity, and two at that: positive infinity and negative infinity. So while the first example passes as a proof, the second does not, since that addition's result is positive infinity.
However with this in mind, you could produce an another proofing example:
3.1416 * (10 ^ 0) + (+ infinity)
Of course the result is positive infinity, no matter what you add it to. And of course positive infinity is not larger than positive infinity, so proved again.

Related

Simple floating point multiplication not giving expected result [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 3 years ago.
When given the input 150, I expect the output to be the mathematically correct answer 70685.7750, but I am getting the wrong output 70685.7812.
#include<stdio.h>
int main()
{
float A,n,R;
n=3.14159;
scanf("%f",&R);
A=n*(R*R);
printf("A=%.4f\n",A);
}
float and double numbers are not represented very accurately in the memory. The main reason is that the memory is limited, and most non-integers are not.
The best example is PI. You can specify as many digits as you want, but it will still be an approximation.
The limited precision of representing the numbers is the reason of the following rule:
when working with floats and double numbers, not not check for equality (m == n), but check that the difference between them is smaller than a certain error ((m-n) < e)
Please note, as mentioned in the comments too, that the above rule is not "the mother rule of all rules". There are other rules also.
Careful analysis must be done for each particular situation, in order to have a properly working application.
(Thanks #EricPostpischil for the reminder)
It is common for a variable of type float to be an IEEE-754 32-bit floating point number.
The number 3.14159 cannot be stored exactly in an IEEE-754 32-bit float - the closest value is approximately 3.14159012. 150 * 150 * 3.14159012 is 70685.7777, and the closest value to this that can be represented in a 32-bit float is 70685.78125, which you are then printing with %.4f so you see 70685.7812.
Another way of thinking about this is that your n value only ends up being accurate to the sixth significant figure, so - as you are just calculating a series of multiplications - your result is also only acccurate to the sixth significant figure (ie 70685.8). (In the general case this can be worse - for example subtraction of two close values can lead to a large increase in the relative error).
If you switch to using variables of type double (and change the scanf() to use %lf), then you will likely get the answer you are after. double is typically a 64-bit float, which means that the error in the representation of your n values and the result is small enough not to affect the fourth decimal place.
Have you heard that float and double values aren't always perfectly accurate, have limited precision? Have you heard that type float gives you the equivalent of only about 7 decimal digits' worth of precision? This is what that means. Your expected and actual answers, 70685.7750 and 70685.7812, differ in the seventh digit, just about as expected.
I expect the output to be the mathematically correct answer
I am sorry to disappoint you, but that's your mistake. As a general rule, when you're doing floating-point arithmetic, you will never get the mathematically correct answer, you will always get a limited-precision approximation of the mathematically correct answer.
The canonical SO answers to this sort of question are collected at Is floating point math broken?. You might want to read some of those answers for more enlightenment.

C thinking : float vs. integers and float representation

When using integers in C (and in many other languages), one must pay attention when dividing about precision. It is always better to multiply and add things (thus creating a larger intermediary result, so long as it doesn't overflow) before dividing.
But what about floats? Does that still hold? Or are they represented in such a way that it is better to divide number of similar orders of magnitude rather than large ones by small ones?
The representation of floats/doubles and similar floating-point working, is geared towards retaining numbers of significant digits (aka "precision"), rather than a fixed number of decimal places, such as happens in fixed-point, or integer working.
It is best to avoid combining quantities, that may give rise to implicit under or overflow in terms of the exponent, ie at the limits of the floating-point number range.
Hence, addition/subtraction of quantities of widely differing magnitudes (either explicitly, or due to having opposite signs)) should be avoided and re-arranged, where possible, to avoid this well-known route to lost precision.
Example: it's better to refactor/re-order
small + big + small + big + small * big
as
(small+small+small) + big + big
since the smalls individually might make no difference to a big, and hence their contribution might disappear.
If there is any "noise" or imprecision in the lower bits of any quantity, it's also wise to be aware how loss of significant bits propagates through a computation.
With integers:
As long as there is no overflow, +,-,* is always exact.
With division, the result is truncated and often not equal to the mathematical answer.
ia,ib,ic, multiplying before dividing ia*ib/ic vs ia*(ib/ic) is better as the quotient is based on more bits of the product ia*ib than ib.
With floating point:
Issues are subtle. Again, as long as no over/underflow, the order or *,/ sequence make less impact than with integers. FP */- is akin to adding/subtracting logs. Typical results are within 0.5 ULP of the mathematically correct answer.
With FP and +,- the result of fa,fb,fc can have significant differences than the mathematical correct one when 1) values are far apart in magnitude or 2) subtracting values that are nearly equal and the error in a prior calculation now become significant.
Consider the quadratic equation:
double d = sqrt(b*b - 4*a/c); // assume b*b - 4*a/c >= 0
double root1 = (-b + d)/(2*a);
double root2 = (-b - d)/(2*a);
Versus
double d = sqrt(b*b - 4*a/c); // assume b*b - 4*a/c >= 0
double root1 = (b < 0) ? (-b + d)/(2*a) : (-b - d)/(2*a)
double root2 = c/(a*root1); // assume a*root1 != 0
The 2nd has much better root2 precision result when one root is near 0 and |b| is nearly d. This is because the b,d subtraction cancels many bits of significance allowing the error in the calculation of d to become significant.
(for integer) It is always better to multiply and add things (thus creating a larger intermediary result, so long as it doesn't overflow) before dividing.
Does that still hold (for floats)?
In general the answer is No
It is easy to construct an example where adding all input before division will give you a huge rounding error.
Assume you want to add 10000000000 values and divide them by 1000. Further assume that each value is 1. So the expected result is 10000000.
Method 1
However, if you add all the values before division, you'll get the result 16777.216 (for a 32 bit float). As you can see it is pretty much off.
Method 2
So is it better to divide each value by 1000 before adding it to the result? If you do that, you'll get the result 32768.0 (for a 32 bit float). As you can see it is pretty much off as well.
Method 3
However, if you go on adding values until the temporary result is greater than 1000000 and then divide the temporary result by 1000 and add that intermediate result to the final result and repeats that until you have added a total 10000000000 values, you will get the correct result.
So there is no simple "always add before division" or "always divide before adding" when dealing with floating point. As a general rule it is typically a good idea to keep operands in similar magnitude. That is what the third example does.

Compiler does not recognise matching float values [duplicate]

I know UIKit uses CGFloat because of the resolution independent coordinate system.
But every time I want to check if for example frame.origin.x is 0 it makes me feel sick:
if (theView.frame.origin.x == 0) {
// do important operation
}
Isn't CGFloat vulnerable to false positives when comparing with ==, <=, >=, <, >?
It is a floating point and they have unprecision problems: 0.0000000000041 for example.
Is Objective-C handling this internally when comparing or can it happen that a origin.x which reads as zero does not compare to 0 as true?
First of all, floating point values are not "random" in their behavior. Exact comparison can and does make sense in plenty of real-world usages. But if you're going to use floating point you need to be aware of how it works. Erring on the side of assuming floating point works like real numbers will get you code that quickly breaks. Erring on the side of assuming floating point results have large random fuzz associated with them (like most of the answers here suggest) will get you code that appears to work at first but ends up having large-magnitude errors and broken corner cases.
First of all, if you want to program with floating point, you should read this:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
Yes, read all of it. If that's too much of a burden, you should use integers/fixed point for your calculations until you have time to read it. :-)
Now, with that said, the biggest issues with exact floating point comparisons come down to:
The fact that lots of values you may write in the source, or read in with scanf or strtod, do not exist as floating point values and get silently converted to the nearest approximation. This is what demon9733's answer was talking about.
The fact that many results get rounded due to not having enough precision to represent the actual result. An easy example where you can see this is adding x = 0x1fffffe and y = 1 as floats. Here, x has 24 bits of precision in the mantissa (ok) and y has just 1 bit, but when you add them, their bits are not in overlapping places, and the result would need 25 bits of precision. Instead, it gets rounded (to 0x2000000 in the default rounding mode).
The fact that many results get rounded due to needing infinitely many places for the correct value. This includes both rational results like 1/3 (which you're familiar with from decimal where it takes infinitely many places) but also 1/10 (which also takes infinitely many places in binary, since 5 is not a power of 2), as well as irrational results like the square root of anything that's not a perfect square.
Double rounding. On some systems (particularly x86), floating point expressions are evaluated in higher precision than their nominal types. This means that when one of the above types of rounding happens, you'll get two rounding steps, first a rounding of the result to the higher-precision type, then a rounding to the final type. As an example, consider what happens in decimal if you round 1.49 to an integer (1), versus what happens if you first round it to one decimal place (1.5) then round that result to an integer (2). This is actually one of the nastiest areas to deal with in floating point, since the behaviour of the compiler (especially for buggy, non-conforming compilers like GCC) is unpredictable.
Transcendental functions (trig, exp, log, etc.) are not specified to have correctly rounded results; the result is just specified to be correct within one unit in the last place of precision (usually referred to as 1ulp).
When you're writing floating point code, you need to keep in mind what you're doing with the numbers that could cause the results to be inexact, and make comparisons accordingly. Often times it will make sense to compare with an "epsilon", but that epsilon should be based on the magnitude of the numbers you are comparing, not an absolute constant. (In cases where an absolute constant epsilon would work, that's strongly indicative that fixed point, not floating point, is the right tool for the job!)
Edit: In particular, a magnitude-relative epsilon check should look something like:
if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y))
Where FLT_EPSILON is the constant from float.h (replace it with DBL_EPSILON fordoubles or LDBL_EPSILON for long doubles) and K is a constant you choose such that the accumulated error of your computations is definitely bounded by K units in the last place (and if you're not sure you got the error bound calculation right, make K a few times bigger than what your calculations say it should be).
Finally, note that if you use this, some special care may be needed near zero, since FLT_EPSILON does not make sense for denormals. A quick fix would be to make it:
if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y) || fabs(x-y) < FLT_MIN)
and likewise substitute DBL_MIN if using doubles.
Since 0 is exactly representable as an IEEE754 floating-point number (or using any other implementation of f-p numbers I've ever worked with) comparison with 0 is probably safe. You might get bitten, however, if your program computes a value (such as theView.frame.origin.x) which you have reason to believe ought to be 0 but which your computation cannot guarantee to be 0.
To clarify a little, a computation such as :
areal = 0.0
will (unless your language or system is broken) create a value such that (areal==0.0) returns true but another computation such as
areal = 1.386 - 2.1*(0.66)
may not.
If you can assure yourself that your computations produce values which are 0 (and not just that they produce values which ought to be 0) then you can go ahead and compare f-p values with 0. If you can't assure yourself to the required degree, best stick to the usual approach of 'toleranced equality'.
In the worst cases the careless comparison of f-p values can be extremely dangerous: think avionics, weapons-guidance, power-plant operations, vehicle navigation, almost any application in which computation meets the real world.
For Angry Birds, not so dangerous.
I want to give a bit of a different answer than the others. They are great for answering your question as stated but probably not for what you need to know or what your real problem is.
Floating point in graphics is fine! But there is almost no need to ever compare floats directly. Why would you need to do that? Graphics uses floats to define intervals. And comparing if a float is within an interval also defined by floats is always well defined and merely needs to be consistent, not accurate or precise! As long as a pixel (which is also an interval!) can be assigned that's all graphics needs.
So if you want to test if your point is outside a [0..width[ range this is just fine. Just make sure you define inclusion consistently. For example always define inside is (x>=0 && x < width). The same goes for intersection or hit tests.
However, if you are abusing a graphics coordinate as some kind of flag, like for example to see if a window is docked or not, you should not do this. Use a boolean flag that is separate from the graphics presentation layer instead.
Comparing to zero can be a safe operation, as long as the zero wasn't a calculated value (as noted in an above answer). The reason for this is that zero is a perfectly representable number in floating point.
Talking perfectly representable values, you get 24 bits of range in a power-of-two notion (single precision). So 1, 2, 4 are perfectly representable, as are .5, .25, and .125. As long as all your important bits are in 24-bits, you are golden. So 10.625 can be repsented precisely.
This is great, but will quickly fall apart under pressure. Two scenarios spring to mind:
1) When a calculation is involved. Don't trust that sqrt(3)*sqrt(3) == 3. It just won't be that way. And it probably won't be within an epsilon, as some of the other answers suggest.
2) When any non-power-of-2 (NPOT) is involved. So it may sound odd, but 0.1 is an infinite series in binary and therefore any calculation involving a number like this will be imprecise from the start.
(Oh and the original question mentioned comparisons to zero. Don't forget that -0.0 is also a perfectly valid floating-point value.)
[The 'right answer' glosses over selecting K. Selecting K ends up being just as ad-hoc as selecting VISIBLE_SHIFT but selecting K is less obvious because unlike VISIBLE_SHIFT it is not grounded on any display property. Thus pick your poison - select K or select VISIBLE_SHIFT. This answer advocates selecting VISIBLE_SHIFT and then demonstrates the difficulty in selecting K]
Precisely because of round errors, you should not use comparison of 'exact' values for logical operations. In your specific case of a position on a visual display, it can't possibly matter if the position is 0.0 or 0.0000000003 - the difference is invisible to the eye. So your logic should be something like:
#define VISIBLE_SHIFT 0.0001 // for example
if (fabs(theView.frame.origin.x) < VISIBLE_SHIFT) { /* ... */ }
However, in the end, 'invisible to the eye' will depend on your display properties. If you can upper bound the display (you should be able to); then choose VISIBLE_SHIFT to be a fraction of that upper bound.
Now, the 'right answer' rests upon K so let's explore picking K. The 'right answer' above says:
K is a constant you choose such that the accumulated error of your
computations is definitely bounded by K units in the last place (and
if you're not sure you got the error bound calculation right, make K a
few times bigger than what your calculations say it should be)
So we need K. If getting K is more difficult, less intuitive than selecting my VISIBLE_SHIFT then you'll decide what works for you. To find K we are going to write a test program that looks at a bunch of K values so we can see how it behaves. Ought to be obvious how to choose K, if the 'right answer' is usable. No?
We are going to use, as the 'right answer' details:
if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)
Let's just try all values of K:
#include <math.h>
#include <float.h>
#include <stdio.h>
void main (void)
{
double x = 1e-13;
double y = 0.0;
double K = 1e22;
int i = 0;
for (; i < 32; i++, K = K/10.0)
{
printf ("K:%40.16lf -> ", K);
if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)
printf ("YES\n");
else
printf ("NO\n");
}
}
ebg#ebg$ gcc -o test test.c
ebg#ebg$ ./test
K:10000000000000000000000.0000000000000000 -> YES
K: 1000000000000000000000.0000000000000000 -> YES
K: 100000000000000000000.0000000000000000 -> YES
K: 10000000000000000000.0000000000000000 -> YES
K: 1000000000000000000.0000000000000000 -> YES
K: 100000000000000000.0000000000000000 -> YES
K: 10000000000000000.0000000000000000 -> YES
K: 1000000000000000.0000000000000000 -> NO
K: 100000000000000.0000000000000000 -> NO
K: 10000000000000.0000000000000000 -> NO
K: 1000000000000.0000000000000000 -> NO
K: 100000000000.0000000000000000 -> NO
K: 10000000000.0000000000000000 -> NO
K: 1000000000.0000000000000000 -> NO
K: 100000000.0000000000000000 -> NO
K: 10000000.0000000000000000 -> NO
K: 1000000.0000000000000000 -> NO
K: 100000.0000000000000000 -> NO
K: 10000.0000000000000000 -> NO
K: 1000.0000000000000000 -> NO
K: 100.0000000000000000 -> NO
K: 10.0000000000000000 -> NO
K: 1.0000000000000000 -> NO
K: 0.1000000000000000 -> NO
K: 0.0100000000000000 -> NO
K: 0.0010000000000000 -> NO
K: 0.0001000000000000 -> NO
K: 0.0000100000000000 -> NO
K: 0.0000010000000000 -> NO
K: 0.0000001000000000 -> NO
K: 0.0000000100000000 -> NO
K: 0.0000000010000000 -> NO
Ah, so K should be 1e16 or larger if I want 1e-13 to be 'zero'.
So, I'd say you have two options:
Do a simple epsilon computation using your engineering judgement for the value of 'epsilon', as I've suggested. If you are doing graphics and 'zero' is meant to be a 'visible change' than examine your visual assets (images, etc) and judge what epsilon can be.
Don't attempt any floating point computations until you've read the non-cargo-cult answer's reference (and gotten your Ph.D in the process) and then use your non-intuitive judgement to select K.
The correct question: how does one compare points in Cocoa Touch?
The correct answer: CGPointEqualToPoint().
A different question: Are two calculated values are the same?
The answer posted here: They are not.
How to check if they are close? If you want to check if they are close, then don't use CGPointEqualToPoint(). But, don't check to see if they are close. Do something that makes sense in the real world, like checking to see if a point is beyond a line or if a point is inside a sphere.
The last time I checked the C standard, there was no requirement for floating point operations on doubles (64 bits total, 53 bit mantissa) to be accurate to more than that precision. However, some hardware might do the operations in registers of greater precision, and the requirement was interpreted to mean no requirement to clear lower order bits (beyond the precision of the numbers being loaded into the registers). So you could get unexpected results of comparisons like this depending on what was left over in the registers from whoever slept there last.
That said, and despite my efforts to expunge it whenever I see it, the outfit where I work has lots of C code that is compiled using gcc and run on linux, and we have not noticed any of these unexpected results in a very long time. I have no idea whether this is because gcc is clearing the low-order bits for us, the 80-bit registers are not used for these operations on modern computers, the standard has been changed, or what. I'd like to know if anyone can quote chapter and verse.
You can use such code for compare float with zero:
if ((int)(theView.frame.origin.x * 100) == 0) {
// do important operation
}
This will compare with 0.1 accuracy, that enough for CGFloat in this case.
Another issue that may need to be kept in mind is that different implementations do things differently. One example of this that I am very familiar with is the FP units on the Sony Playstation 2. They have significant discrepancies when compared to the IEEE FP hardware in any X86 device. The cited article mentions the complete lack of support for inf and NaN, and it gets worse.
Less well known is what I came to know as the "one bit multiply" error. For certain values of float x:
y = x * 1.0;
assert(y == x);
would fail the assert. In the general case, sometimes, but not always, the result of a FP multiply on the Playstation 2 had a mantissa that was a single bit less than the equivalent IEEE mantissa.
My point being that you should not assume that porting FP code from one platform to another will produce the same results. Any given platform is internally consistent, in that results don't change on that platform, it's just that they may not agree with a different platform. E.g. CPython on X86 uses 64 bit doubles to represent floats, while CircuitPython on a Cortex MO has to use software FP, and only uses 32 bit floats. Needless to say that will introduce discrepancies.
A quote I learned over 40 years ago is as true today as the day I learned it. "Doing floating point maths on a computer is like moving a pile of sand. Every time you do anything, you leave a little sand behind and pick up a little dirt."
Playstation is a registered trademark of Sony Corporation.
-(BOOL)isFloatEqual:(CGFloat)firstValue secondValue:(CGFloat)secondValue{
BOOL isEqual = NO;
NSNumber *firstValueNumber = [NSNumber numberWithDouble:firstValue];
NSNumber *secondValueNumber = [NSNumber numberWithDouble:secondValue];
isEqual = [firstValueNumber isEqualToNumber:secondValueNumber];
return isEqual;
}
I am using the following comparison function to compare a number of decimal places:
bool compare(const double value1, const double value2, const int precision)
{
int64_t magnitude = static_cast<int64_t>(std::pow(10, precision));
int64_t intValue1 = static_cast<int64_t>(value1 * magnitude);
int64_t intValue2 = static_cast<int64_t>(value2 * magnitude);
return intValue1 == intValue2;
}
// Compare 9 decimal places:
if (compare(theView.frame.origin.x, 0, 9)) {
// do important operation
}
I'd say the right thing is to declare each number as an object, and then define three things in that object: 1) an equality operator. 2) a setAcceptableDifference method. 3)the value itself. The equality operator returns true if the absolute difference of two values is less than the value set as acceptable.
You can subclass the object to suit the problem. For example, round bars of metal between 1 and 2 inches might be considered of equal diameter if their diameters differed by less than 0.0001 inches. So you'd call setAcceptableDifference with parameter 0.0001, and then use the equality operator with confidence.

How do you print out an IEEE754 number (without printf)?

For the purposes of this question, I do not have the ability to use printf facilities (I can't tell you why, unfortunately, but let's just assume for now that I know what I'm doing).
For an IEEE754 single precision number, you have the following bits:
SEEE EEEE EFFF FFFF FFFF FFFF FFFF FFFF
where S is the sign, E is the exponent and F is the fraction.
Printing the sign is relatively easy for all cases, as is catching all the special cases like NaN (E == 0xff, F != 0), Inf (E == 0xff, F == 0) and 0 (E == 0, F == 0, considered special just because the exponent bias isn't used in that case).
I have two questions.
The first is how best to turn denormalised numbers (where E == 0, F != 0) into normalised numbers (where 1 <= E <= 0xfe)? I suspect this will be necessary to simplify the answer to the next question (but I could be wrong so feel free to educate me).
The second question is how to print out the normalised numbers. I want to be able to print them out in two ways, exponential like -3.74195E3 and non-exponential like 3741.95. Although, just looking at those two side-by-side, it should be fairly easy to turn the former into the latter by just moving the decimal point around. So let's just concentrate on the exponential form.
I have a vague recollection of an algorithm I used long ago for printing out PI where you used one of the ever-reducing formulae and kept an upper and lower limit on the possibilities, outputting a digit when both limits agreed, and shifting the calculation by a factor of 10 (so when the upper and lower limits were 3.2364 and 3.1234, you could output the 3 and adjust for that in the calculation).
But it's been a long time since I did that so I don't even know if that's a suitable approach to take here. It seems so since the value of each bit is half that of the previous bit when moving through the fractional part (1/2, 1/4, 1/8 and so on).
I would really prefer not to have to go trudging through printf source code unless absolutely necessary so, if anyone can help out with this, I'll be eternally grateful.
If you want to get exact results for every conversion, you'll have to use arbitrary-precision arithmetic, as done in printf() implementations. If you want to get results that are "close," perhaps differing only in their least significant digit(s), then a very simple double-precision based algorithm will suffice: for the integer part, repeatedly divide by ten and append the remainders to form the decimal string (in reverse); for the fractional part, repeatedly multiply by ten and subtract off the integer parts to form the decimal string.
I recently wrote an article about this method: http://www.exploringbinary.com/quick-and-dirty-floating-point-to-decimal-conversion/ . It does not print scientific notation, but that should be trivial to add. The algorithm prints subnormal numbers (the ones I printed came out accurately, but you'd have to do more thorough testing).
Denormalized numbers cannot be turned into normalized numbers of the same floating point type. The equivalent normalized number's exponent will be too small to be represented by the exponent.
To print normalized numbers, one silly way I can think of is to repeatedly multiply by 10 (well, for the fractional part).
The first thing you need to do is convert the exponent to decimal (since presumably that's what you want the output in) using logarithms. You take the fraction of that result and multiply the mantissa by the exp10 of that fraction, and then convert that to decimal characters. From there you just need to insert the decimal point in the appropriate location, shifted by the now-decimal exponent.
There is a paper by G. Steele describing in more details an algorithm which seems based on the same principle as the one you outline. If memory serve, there are time when you are forced to use unbounded precision arithmetic. (I think it is How to print floating-point numbers accurately but citeseer is currently down from here, I can't confirm and google results are polluted by a retrospective paper by the same from 20 years later).

Why does this floating-point loop terminate at 1,000,000?

The answer to this sample homework problem is "1,000,000", but I do not understand why:
What is the output of the following code?
int main(void) {
float k = 1;
while (k != k + 1) {
k = k + 1;
}
printf(ā€œ%gā€, k); // %g means output a floating point variable in decimal
}
If the program runs indefinitely but produces no output, write INFINITE LOOP as the answer to the question. All of the programs compile and run. They may or may not contain serious errors, however. You should assume that int is four bytes. You should assume that float has the equivalent of six decimal digits of precision. You may round your answer off to the nearest power of 10 (e.g., you can say 1,000 instead of 210 (i.e., 1024)).
I do not understand why the loop would ever terminate.
It doesn't run forever for the simple reason that floating point numbers are not perfect.
At some point, k will become big enough so that adding 1 to it will have no effect.
At that point, k will be equal to k+1 and your loop will exit.
Floating point numbers can be differentiated by a single unit only when they're in a certain range.
As an example, let's say you have an integer type with 3 decimal digits of precision for a positive integer and a single-decimal-digit exponent.
With this, you can represent the numbers 0 through 999 perfectly as 000x100 through 999x100 (since 100 is 1):
What happens when you want to represent 1000? You need to use 100x101. This is still represented perfectly.
However, there is no accurate way to represent 1001 with this scheme, the next number you can represent is 101x101 which is 1010.
So, when you add 1 to 1000, you'll get the closest match which is 1000.
The code is using a float variable.
As specified in the question, float has 6 digits of precision, meaning that any digits after the sixth will be inaccurate. Therefore, once you pass a million, the final digit will be inaccurate, so that incrementing it can have no effect.
The output of this program is not specified by the C standard, since the semantics of the float type are not specified. One likely result (what you will get on a platform for which float arithmetic is evaluated in IEEE-754 single precision) is 2^24.
All integers smaller than 2^24 are exactly representable in single precision, so the computation will not stop before that point. The next representable single precision number after 2^24, however, is 2^24 + 2. Since 2^24 + 1 is exactly halfway between that number and 2^24, in the default IEEE-754 rounding mode it rounds to the one whose trailing bit is zero, which is 2^24.
Other likely answers include 2^53 and 2^64. Still other answers are possible. Infinity (the floating-point value) could result on a platform for which the default rounding mode is round up, for example. As others have noted, an infinite loop is also possible on platforms that evaluate floating-point expressions in a wider type (which is the source of all sorts of programmer confusion, but allowed by the C standard).
Actually, on most C compilers, this will run forever (infinite loop), though the precise behavior is implementation defined.
The reason that most compilers will give an infinite loop is that they evaluate all floating point expressions at double precision and only round values back to float (single) precision when storing into a variable. So when the value of k gets to about 2^24, k == k + 1 will still evaluate as false (as a double can hold the value k+1 without rounding), but the k = k + 1 assignment will be a noop, as k+1 needs to be rounded to fit into a float
edit
gcc on x86 gets this infinite loop behavior. Interestingly on x64 it does not, as it uses sse instructions which do the comparison in float precision.

Resources