Correct strtod implementation? - c

Simple question: what is the correct bit-representation of the number 1.15507e-173, in double precision?
Full question: how does one determine the correct parsing of this number?
Background: my question follows from this answer which shows two different bit-representations from three different parsers, namely
0x1c06dace8bda0ee0
and
0x1c06dace8bda0edf
and I'm wondering which parser has got it right.
Update Section 6.4.4.2 of the C99 specification says that for the C parser,
"...the result is either the nearest representable value, or the larger
or smaller representable value immediately adjacent to the nearest
representable value, chosen in an implementation-defined manner."
This implies that the parsed number need not be the nearest, nor even one of the two adjacent representable numbers. The same spec in 7.20.1.3 says that strtod() behaves essentially the same way as the built-in parser. Thanks to the answerers who pointed this out.
Also see this answer to a similar question, and this blog.

:= num1 = ImportString["\.1c\.06\.da\.ce\.8b\.da\.0e\.e0", "Real64", ByteOrdering->1] // First;
:= num2 = ImportString["\.1c\.06\.da\.ce\.8b\.da\.0e\.df", "Real64", ByteOrdering->1] // First;
:= SetPrecision[num1, Infinity]-numOr //N
:= numOr = SetPrecision[1.15507, Infinity] * 10^-173;
-190
= -6.65645 10
:= SetPrecision[num2, Infinity]-numOr //N
-189
= -2.46118 10
Given that both deviate for the same side, it follows that the correct representation is the first one.

Related

How should I obtain the fractional part of a floating-point value?

I have a variable x of type float, and I need its fractional part. I know I can get it with
x - floorf(x), or
fmodf(x, 1.0f)
My questions: Is one of these always preferable to the other? Are they effectively the same? Is there a third alternative I might consider?
Notes:
If the answer depends on the processor I'm using, let's make it x86_64, and if you can elaborate about other processors that would be nice.
Please make sure and refer to the behavior on negative values of x. I don't mind this behavior or that, but I need to know what the behavior is.
Is there a third alternative I might consider?
There's the dedicated function for it. modff exists to decompose a number into its integral and fractional parts.
float modff( float arg, float* iptr );
Decomposes given floating point value arg into integral and fractional
parts, each having the same type and sign as arg. The integral part
(in floating-point format) is stored in the object pointed to by iptr.
I'd say that x - floorf(x) is pretty good (exact), except in corner cases
it has the wrong sign bit for negative zero or any other negative whole float
(we might expect the fraction part to wear the same sign bit).
it does not work that well with inf
modff does respect -0.0 sign bit for both int and frac part, and answer +/-0.0 for +/-inf fraction part - at least if implementation supports the IEC 60559 standard (IEEE 754).
A rationale for inf could be: since every float greater than 2^precision has a null fraction part, then it must be true for infinite float too.
That's minor, but nonetheless different.
EDIT Err, of course as pointed by #StoryTeller-UnslanderMonica the most obvious flaw of x - floor(x) is for the case of negative floating point with a fraction part, because applied to -2.25, it would return +0.75 for example, which is not what we expect...
Since c99 label is used, x - truncf(x) would be more correct, but still suffer from the minor problems onto which I initially focused.

Are there floating point infinitesimals?

Since there are finitely many floating point numbers and one can compare each possible pair of such numbers (I assume), there must always exist a number 'b' which is
smaller than some given number 'a' (not +/- infinity) and
there exists no number 'c' smaller than 'a' and greater than 'b';
i.e. the 'next' smaller floating-point-represented number. I wonder if:
there is a function smaller(float a) returning such number b (or greater(float a) for that matter) in the C programming language
if not, then if there is a way to obtain these 'next' numbers for certain types of numbers 'a', for example if 'a' is an integer/zero.
Trying
float smaller(float a) return a - 0.00...001f;
seems to me like a hack that probably doesn't work for all possible inputs, but I might be wrong, so that's why I'm turning to you guys. Any help is appretiated.
Indeed there is. You're after the "nextafter" family of functions.
These can be used to move from one floating point number to the next, much in the same way as you can use ++ and -- for integral types.
See https://en.cppreference.com/w/c/numeric/math/nextafter
(This is C documentation).
The C99/POSIX functions nextafter/nexttoward can do this. You provide a start value x and a destination value y, and they return the next value from the start in the direction of the destination.
Also, if your language does not have the nextafter family of functions, but does let you treat values stored in memory as integers (by pointer casting or other, dirtier tricks), then, for any floating-point type (double, float, half, ...) that conforms to IEEE 754, if you want to find the next larger number than value, you can do
FLOATING value = ...;
if (value >= 0) {
integer_increment(value);
} else {
integer_decrement(value);
}
and vice versa for the next smaller number, where integer_increment increments the value of value as if value was of an integral type.

Compiler does not recognise matching float values [duplicate]

I know UIKit uses CGFloat because of the resolution independent coordinate system.
But every time I want to check if for example frame.origin.x is 0 it makes me feel sick:
if (theView.frame.origin.x == 0) {
// do important operation
}
Isn't CGFloat vulnerable to false positives when comparing with ==, <=, >=, <, >?
It is a floating point and they have unprecision problems: 0.0000000000041 for example.
Is Objective-C handling this internally when comparing or can it happen that a origin.x which reads as zero does not compare to 0 as true?
First of all, floating point values are not "random" in their behavior. Exact comparison can and does make sense in plenty of real-world usages. But if you're going to use floating point you need to be aware of how it works. Erring on the side of assuming floating point works like real numbers will get you code that quickly breaks. Erring on the side of assuming floating point results have large random fuzz associated with them (like most of the answers here suggest) will get you code that appears to work at first but ends up having large-magnitude errors and broken corner cases.
First of all, if you want to program with floating point, you should read this:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
Yes, read all of it. If that's too much of a burden, you should use integers/fixed point for your calculations until you have time to read it. :-)
Now, with that said, the biggest issues with exact floating point comparisons come down to:
The fact that lots of values you may write in the source, or read in with scanf or strtod, do not exist as floating point values and get silently converted to the nearest approximation. This is what demon9733's answer was talking about.
The fact that many results get rounded due to not having enough precision to represent the actual result. An easy example where you can see this is adding x = 0x1fffffe and y = 1 as floats. Here, x has 24 bits of precision in the mantissa (ok) and y has just 1 bit, but when you add them, their bits are not in overlapping places, and the result would need 25 bits of precision. Instead, it gets rounded (to 0x2000000 in the default rounding mode).
The fact that many results get rounded due to needing infinitely many places for the correct value. This includes both rational results like 1/3 (which you're familiar with from decimal where it takes infinitely many places) but also 1/10 (which also takes infinitely many places in binary, since 5 is not a power of 2), as well as irrational results like the square root of anything that's not a perfect square.
Double rounding. On some systems (particularly x86), floating point expressions are evaluated in higher precision than their nominal types. This means that when one of the above types of rounding happens, you'll get two rounding steps, first a rounding of the result to the higher-precision type, then a rounding to the final type. As an example, consider what happens in decimal if you round 1.49 to an integer (1), versus what happens if you first round it to one decimal place (1.5) then round that result to an integer (2). This is actually one of the nastiest areas to deal with in floating point, since the behaviour of the compiler (especially for buggy, non-conforming compilers like GCC) is unpredictable.
Transcendental functions (trig, exp, log, etc.) are not specified to have correctly rounded results; the result is just specified to be correct within one unit in the last place of precision (usually referred to as 1ulp).
When you're writing floating point code, you need to keep in mind what you're doing with the numbers that could cause the results to be inexact, and make comparisons accordingly. Often times it will make sense to compare with an "epsilon", but that epsilon should be based on the magnitude of the numbers you are comparing, not an absolute constant. (In cases where an absolute constant epsilon would work, that's strongly indicative that fixed point, not floating point, is the right tool for the job!)
Edit: In particular, a magnitude-relative epsilon check should look something like:
if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y))
Where FLT_EPSILON is the constant from float.h (replace it with DBL_EPSILON fordoubles or LDBL_EPSILON for long doubles) and K is a constant you choose such that the accumulated error of your computations is definitely bounded by K units in the last place (and if you're not sure you got the error bound calculation right, make K a few times bigger than what your calculations say it should be).
Finally, note that if you use this, some special care may be needed near zero, since FLT_EPSILON does not make sense for denormals. A quick fix would be to make it:
if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y) || fabs(x-y) < FLT_MIN)
and likewise substitute DBL_MIN if using doubles.
Since 0 is exactly representable as an IEEE754 floating-point number (or using any other implementation of f-p numbers I've ever worked with) comparison with 0 is probably safe. You might get bitten, however, if your program computes a value (such as theView.frame.origin.x) which you have reason to believe ought to be 0 but which your computation cannot guarantee to be 0.
To clarify a little, a computation such as :
areal = 0.0
will (unless your language or system is broken) create a value such that (areal==0.0) returns true but another computation such as
areal = 1.386 - 2.1*(0.66)
may not.
If you can assure yourself that your computations produce values which are 0 (and not just that they produce values which ought to be 0) then you can go ahead and compare f-p values with 0. If you can't assure yourself to the required degree, best stick to the usual approach of 'toleranced equality'.
In the worst cases the careless comparison of f-p values can be extremely dangerous: think avionics, weapons-guidance, power-plant operations, vehicle navigation, almost any application in which computation meets the real world.
For Angry Birds, not so dangerous.
I want to give a bit of a different answer than the others. They are great for answering your question as stated but probably not for what you need to know or what your real problem is.
Floating point in graphics is fine! But there is almost no need to ever compare floats directly. Why would you need to do that? Graphics uses floats to define intervals. And comparing if a float is within an interval also defined by floats is always well defined and merely needs to be consistent, not accurate or precise! As long as a pixel (which is also an interval!) can be assigned that's all graphics needs.
So if you want to test if your point is outside a [0..width[ range this is just fine. Just make sure you define inclusion consistently. For example always define inside is (x>=0 && x < width). The same goes for intersection or hit tests.
However, if you are abusing a graphics coordinate as some kind of flag, like for example to see if a window is docked or not, you should not do this. Use a boolean flag that is separate from the graphics presentation layer instead.
Comparing to zero can be a safe operation, as long as the zero wasn't a calculated value (as noted in an above answer). The reason for this is that zero is a perfectly representable number in floating point.
Talking perfectly representable values, you get 24 bits of range in a power-of-two notion (single precision). So 1, 2, 4 are perfectly representable, as are .5, .25, and .125. As long as all your important bits are in 24-bits, you are golden. So 10.625 can be repsented precisely.
This is great, but will quickly fall apart under pressure. Two scenarios spring to mind:
1) When a calculation is involved. Don't trust that sqrt(3)*sqrt(3) == 3. It just won't be that way. And it probably won't be within an epsilon, as some of the other answers suggest.
2) When any non-power-of-2 (NPOT) is involved. So it may sound odd, but 0.1 is an infinite series in binary and therefore any calculation involving a number like this will be imprecise from the start.
(Oh and the original question mentioned comparisons to zero. Don't forget that -0.0 is also a perfectly valid floating-point value.)
[The 'right answer' glosses over selecting K. Selecting K ends up being just as ad-hoc as selecting VISIBLE_SHIFT but selecting K is less obvious because unlike VISIBLE_SHIFT it is not grounded on any display property. Thus pick your poison - select K or select VISIBLE_SHIFT. This answer advocates selecting VISIBLE_SHIFT and then demonstrates the difficulty in selecting K]
Precisely because of round errors, you should not use comparison of 'exact' values for logical operations. In your specific case of a position on a visual display, it can't possibly matter if the position is 0.0 or 0.0000000003 - the difference is invisible to the eye. So your logic should be something like:
#define VISIBLE_SHIFT 0.0001 // for example
if (fabs(theView.frame.origin.x) < VISIBLE_SHIFT) { /* ... */ }
However, in the end, 'invisible to the eye' will depend on your display properties. If you can upper bound the display (you should be able to); then choose VISIBLE_SHIFT to be a fraction of that upper bound.
Now, the 'right answer' rests upon K so let's explore picking K. The 'right answer' above says:
K is a constant you choose such that the accumulated error of your
computations is definitely bounded by K units in the last place (and
if you're not sure you got the error bound calculation right, make K a
few times bigger than what your calculations say it should be)
So we need K. If getting K is more difficult, less intuitive than selecting my VISIBLE_SHIFT then you'll decide what works for you. To find K we are going to write a test program that looks at a bunch of K values so we can see how it behaves. Ought to be obvious how to choose K, if the 'right answer' is usable. No?
We are going to use, as the 'right answer' details:
if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)
Let's just try all values of K:
#include <math.h>
#include <float.h>
#include <stdio.h>
void main (void)
{
double x = 1e-13;
double y = 0.0;
double K = 1e22;
int i = 0;
for (; i < 32; i++, K = K/10.0)
{
printf ("K:%40.16lf -> ", K);
if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)
printf ("YES\n");
else
printf ("NO\n");
}
}
ebg#ebg$ gcc -o test test.c
ebg#ebg$ ./test
K:10000000000000000000000.0000000000000000 -> YES
K: 1000000000000000000000.0000000000000000 -> YES
K: 100000000000000000000.0000000000000000 -> YES
K: 10000000000000000000.0000000000000000 -> YES
K: 1000000000000000000.0000000000000000 -> YES
K: 100000000000000000.0000000000000000 -> YES
K: 10000000000000000.0000000000000000 -> YES
K: 1000000000000000.0000000000000000 -> NO
K: 100000000000000.0000000000000000 -> NO
K: 10000000000000.0000000000000000 -> NO
K: 1000000000000.0000000000000000 -> NO
K: 100000000000.0000000000000000 -> NO
K: 10000000000.0000000000000000 -> NO
K: 1000000000.0000000000000000 -> NO
K: 100000000.0000000000000000 -> NO
K: 10000000.0000000000000000 -> NO
K: 1000000.0000000000000000 -> NO
K: 100000.0000000000000000 -> NO
K: 10000.0000000000000000 -> NO
K: 1000.0000000000000000 -> NO
K: 100.0000000000000000 -> NO
K: 10.0000000000000000 -> NO
K: 1.0000000000000000 -> NO
K: 0.1000000000000000 -> NO
K: 0.0100000000000000 -> NO
K: 0.0010000000000000 -> NO
K: 0.0001000000000000 -> NO
K: 0.0000100000000000 -> NO
K: 0.0000010000000000 -> NO
K: 0.0000001000000000 -> NO
K: 0.0000000100000000 -> NO
K: 0.0000000010000000 -> NO
Ah, so K should be 1e16 or larger if I want 1e-13 to be 'zero'.
So, I'd say you have two options:
Do a simple epsilon computation using your engineering judgement for the value of 'epsilon', as I've suggested. If you are doing graphics and 'zero' is meant to be a 'visible change' than examine your visual assets (images, etc) and judge what epsilon can be.
Don't attempt any floating point computations until you've read the non-cargo-cult answer's reference (and gotten your Ph.D in the process) and then use your non-intuitive judgement to select K.
The correct question: how does one compare points in Cocoa Touch?
The correct answer: CGPointEqualToPoint().
A different question: Are two calculated values are the same?
The answer posted here: They are not.
How to check if they are close? If you want to check if they are close, then don't use CGPointEqualToPoint(). But, don't check to see if they are close. Do something that makes sense in the real world, like checking to see if a point is beyond a line or if a point is inside a sphere.
The last time I checked the C standard, there was no requirement for floating point operations on doubles (64 bits total, 53 bit mantissa) to be accurate to more than that precision. However, some hardware might do the operations in registers of greater precision, and the requirement was interpreted to mean no requirement to clear lower order bits (beyond the precision of the numbers being loaded into the registers). So you could get unexpected results of comparisons like this depending on what was left over in the registers from whoever slept there last.
That said, and despite my efforts to expunge it whenever I see it, the outfit where I work has lots of C code that is compiled using gcc and run on linux, and we have not noticed any of these unexpected results in a very long time. I have no idea whether this is because gcc is clearing the low-order bits for us, the 80-bit registers are not used for these operations on modern computers, the standard has been changed, or what. I'd like to know if anyone can quote chapter and verse.
You can use such code for compare float with zero:
if ((int)(theView.frame.origin.x * 100) == 0) {
// do important operation
}
This will compare with 0.1 accuracy, that enough for CGFloat in this case.
Another issue that may need to be kept in mind is that different implementations do things differently. One example of this that I am very familiar with is the FP units on the Sony Playstation 2. They have significant discrepancies when compared to the IEEE FP hardware in any X86 device. The cited article mentions the complete lack of support for inf and NaN, and it gets worse.
Less well known is what I came to know as the "one bit multiply" error. For certain values of float x:
y = x * 1.0;
assert(y == x);
would fail the assert. In the general case, sometimes, but not always, the result of a FP multiply on the Playstation 2 had a mantissa that was a single bit less than the equivalent IEEE mantissa.
My point being that you should not assume that porting FP code from one platform to another will produce the same results. Any given platform is internally consistent, in that results don't change on that platform, it's just that they may not agree with a different platform. E.g. CPython on X86 uses 64 bit doubles to represent floats, while CircuitPython on a Cortex MO has to use software FP, and only uses 32 bit floats. Needless to say that will introduce discrepancies.
A quote I learned over 40 years ago is as true today as the day I learned it. "Doing floating point maths on a computer is like moving a pile of sand. Every time you do anything, you leave a little sand behind and pick up a little dirt."
Playstation is a registered trademark of Sony Corporation.
-(BOOL)isFloatEqual:(CGFloat)firstValue secondValue:(CGFloat)secondValue{
BOOL isEqual = NO;
NSNumber *firstValueNumber = [NSNumber numberWithDouble:firstValue];
NSNumber *secondValueNumber = [NSNumber numberWithDouble:secondValue];
isEqual = [firstValueNumber isEqualToNumber:secondValueNumber];
return isEqual;
}
I am using the following comparison function to compare a number of decimal places:
bool compare(const double value1, const double value2, const int precision)
{
int64_t magnitude = static_cast<int64_t>(std::pow(10, precision));
int64_t intValue1 = static_cast<int64_t>(value1 * magnitude);
int64_t intValue2 = static_cast<int64_t>(value2 * magnitude);
return intValue1 == intValue2;
}
// Compare 9 decimal places:
if (compare(theView.frame.origin.x, 0, 9)) {
// do important operation
}
I'd say the right thing is to declare each number as an object, and then define three things in that object: 1) an equality operator. 2) a setAcceptableDifference method. 3)the value itself. The equality operator returns true if the absolute difference of two values is less than the value set as acceptable.
You can subclass the object to suit the problem. For example, round bars of metal between 1 and 2 inches might be considered of equal diameter if their diameters differed by less than 0.0001 inches. So you'd call setAcceptableDifference with parameter 0.0001, and then use the equality operator with confidence.

Float data without fraction part [duplicate]

This question already has answers here:
Why isn't "0f" treated as a floating point literal in C++?
(6 answers)
Closed 8 years ago.
The make difference between double and float data, we add f. But when i tried to write:
float Value = 255f;
The compiler diplays the following error.:
line 50: error (dcc:1633): parse error near 'f'
line 50: error (dcc:1206): syntax error
line 50: fatal error (dcc:1340): can't recover from earlier errors
Why ?
According to draft n1570, §6.4.4.2, paragraph ¶2
Description
A floating constant has a significand part that may be followed by an exponent part and a
suffix that specifies its type. The components of the significand part may include a digit
sequence representing the whole-number part, followed by a period (.), followed by a
digit sequence representing the fraction part. The components of the exponent part are an
e, E, p, or P followed by an exponent consisting of an optionally signed digit sequence.
Either the whole-number part or the fraction part has to be present; for decimal floating
constants, either the period or the exponent part has to be present.
I made the relevant part bold, so you can see why it doesn't work.
Note that this also implies that
float value = 255e0f;
works.
You need a period in addition, in order the compiler to accept it:
float Value = 255.f;
It's part of the standard specification but I guess it was chosen this way in order to simplify the implementation of the lexical analyzer and also improve the readability of the number, since it's easy to think it's a hex number otherwise.
My guess would be that you are missing a period in your code.
float Value = 255.f;
Add a period before f and you should be good to go. If this doesn't work, please tell us what Error you're receiving.

In C, is specifying 2.0f the same as 2.000000f?

Are these lines the same?
float a = 2.0f;
and
float a = 2.000000f;
Yes, it is. No matter what representation you use, when the code is compiled, the number will be converted to a unique binary representation. There's only one way of representing 2 in the IEEE 754 binary32 standard used in modern computers to represent float numbers.
The only thing the C99 standard has to say on the matter is this (section 6.4.4.2):
For decimal floating constants ... the result is either
the nearest representable value, or the larger or smaller representable value immediately
adjacent to the nearest representable value, chosen in an implementation-defined manner.
That bit about "implementation-defined" means that technically an implementation could choose to do something different in each case. Although in practice, nothing weird is going to happen for a value like 2.
It's important to bear in mind that the C standards don't require IEEE-754.
Yes, they are the same.
Simple check:
http://codepad.org/FOQsufB4
int main() {
printf("%d",2.0f == 2.000000f);
}
^ Will output 1 (true)
Yes Sure it is the same extra zeros on the right are ignored just likes zeros on the left

Resources