long double in fabs, range and overflow errors

long double in fabs, range and overflow errors - c

At wiki.sei.cmu.edu, they claim the following code is error-free for out-of-range floating-point errors during assignment; I've narrowed it down to the long double case:
Compliant Solution (Narrowing Conversion)
This compliant solution checks whether the values to be stored can be represented in the new type:
#include <float.h>
#include <math.h>
void func(double d_a, long double big_d) {
double d_b;
// ...
if (big_d != 0.0 &&
(isnan(big_d) ||
isgreater(fabs(big_d), DBL_MAX) ||
isless(fabs(big_d), DBL_MIN))) {
/* Handle error */
} else {
d_b = (double)big_d;
}
}
Unless I'm missing something, the declaration of fabs according to the C99 and C11 standards is double fabs(double x), which means it takes a double, so this code isn't compliant, and instead long double fabsl(long double x) should be used.
Further, I believe isgreater and isless should be declared as taking a long double as their first parameters (since that's what fabsl returns).
#include <stdio.h>
#include <math.h>
int main(void)
{
long double ld = 1.12345e506L;
printf("%lg\n", fabs(ld)); // UB: ld is outside the range of double (~ 1e308)
printf("%Lg\n", fabsl(ld)); // OK
return 0;
}
On my machine, this produces the following output:
inf
1.12345e+506
along with a warning (GCC):
warning: conversion from 'long double' to 'double' may change value [-Wfloat-conversion]
printf("%lg\n", fabs(ld));
^~
Am I therefore correct in saying their code results in undefined behavior?
On p. 211 of the C99 standard there's a footnote that reads:
Particularly on systems with wide expression evaluation, a <math.h> function might pass arguments
and return values in wider format than the synopsis prototype indicates.
and on some systems long double has the exact same value range, representation, etc. as double, but this doesn't mean the code above is portable.
Now I have a related question here, and I'd just like to ask for confirmation (I've read through dozens of questions and answers here, but I'm still a little confused because they often deal with specific examples and specific types, not all of them are sourced, or they're about C++, and I think it'd be a waste of time to ask each of these questions as a separate, "formal" question on Stack Overflow): according to the C99 and C11 standards, there's a difference between overflow, which occurs during an arithmetic operation, and a range error, which occurs when a value is too large to be represented in a given type. I've provided excerpts from the C99 standard that talk about this, and I'd appreciate it if someone could confirm that my interpretation is correct. (I'm aware of the fact that certain implementations define what happens when undefined behavior occurs, e.g. as explained here, but that's not what I'm interested in right now.)
for floating-point types, overflow results in some representation of a "large value" (i.e. as defined by the HUGE_VAL* macro definition as per 7.12.1):
A floating result overflows if the magnitude of the mathematical result is finite but so
large that the mathematical result cannot be represented without extraordinary roundoff
error in an object of the specified type. If a floating result overflows and default rounding
is in effect, or if the mathematical result is an exact infinity (for example log(0.0)),
then the function returns the value of the macro HUGE_VAL, HUGE_VALF, or HUGE_VALL according to the return type, with the same sign as the correct value of the
function;
On my system, HUGE_VAL* is defined as INFINITY cast to the appropriate floating-point type.
So this is completely legal, the value of HUGE_VAL* being implementation-defined or something like that notwithstanding.
for floating-point types, a range error results in undefined behavior (6.3.1.5):
When a double is demoted to float, a long double is demoted to double or
float, or a value being represented in greater precision and range than required by its
semantic type (see 6.3.1.8) is explicitly converted to its semantic type [...]. If the value being converted is outside the range of values that can be represented, the behavior is undefined.

Related

C - ceil/float rounding to int guarantees

I'm wondering if there are any circumstances where code like this will be incorrect due to floating point inaccuracies:
#include <math.h>
// other code ...
float f = /* random but not NAN or INF */;
int i = (int)floorf(f);
// OR
int i = (int)ceilf(f);
Are there any guarantees about these values? If I have a well-formed f (not NAN or INF) will i always be the integer that it rounds to, whichever way that is.
I can image a situation where (with a bad spec/implementation) the value you get is the value just below the true value rather than just above/equal but is actually closer. Then when you truncate it actually rounds down to the next lower value.
It doesn't seem possible to me given that integers can be exact values in ieee754 floating point but I don't know if float is guaranteed to be that standard

The C standard is sloppy in specifying floating-point behavior, so it is technically not completely specified that floorf(f) produces the correct floor of f or that ceilf(f) produces the correct ceiling of f.
Nonetheless, no C implementations I am aware of get this wrong.
If, instead of floorf(some variable), you have floorf(some expression), there are C implementations that may evaluate the expression in diverse ways that will not get the same result as if IEEE-754 arithmetic were used throughout.
If the C implementation defines __STDC_IEC_559__, it should evaluate the expressions using IEEE-754 arithmetic.
Nonetheless, int i = (int)floorf(f); is of course not guaranteed to set i to the floor of f if the floor of f is out of range of int.

Dodging the inaccuracy of a floating point number

I totally understand the problems associated with floating points, but I have seen a very interesting behavior that I can't explain.
float x = 1028.25478;
long int y = 102825478;
float z = y/(float)100000.0;
printf("x = %f ", x);
printf("z = %f",z);
The output is:
x = 1028.254761 z = 1028.254780
Now if floating numbers failed to represent that specific random value (1028.25478) when I assigned that to variable x. Why isn't it the same in case of variable z?
P.S. I'm using pellesC IDE to test the code (C11 compiler).

I am pretty sure that what happens here is that the latter floating point variable is elided and instead kept in a double-precision register; and then passed as is as an argument to printf. Then the compiler will believe that it is safe to pass this number at double precision after default argument promotions.
I managed to produce a similar result using GCC 7.2.0, with these switches:
-Wall -Werror -ffast-math -m32 -funsafe-math-optimizations -fexcess-precision=fast -O3
The output is
x = 1028.254761 z = 1028.254800
The number is slightly different there^.
The description for -fexcess-precision=fast says:
-fexcess-precision=style
This option allows further control over excess precision on
machines where floating-point operations occur in a format with
more precision or range than the IEEE standard and interchange
floating-point types. By default, -fexcess-precision=fast is in
effect; this means that operations may be carried out in a wider
precision than the types specified in the source if that would
result in faster code, and it is unpredictable when rounding to
the types specified in the source code takes place. When
compiling C, if -fexcess-precision=standard is specified then
excess precision follows the rules specified in ISO C99; in
particular, both casts and assignments cause values to be rounded
to their semantic types (whereas -ffloat-store only affects
assignments). This option [-fexcess-precision=standard] is enabled by default for C if a
strict conformance option such as -std=c99 is used. -ffast-math
enables -fexcess-precision=fast by default regardless of whether
a strict conformance option is used.
This behaviour isn't C11-compliant

Restricting this to IEEE754 strict floating point, the answers should be the same.
1028.25478 is actually 1028.2547607421875. That accounts for x.
In the evaluation of y / (float)100000.0;, y is converted to a float, by C's rules of argument promotion. The closest float to 102825478 is 102825480. IEEE754 requires the returning of the the best result of a division, which should be 1028.2547607421875 (the value of z): the closest number to 1028.25480.
So my answer is at odds with your observed behaviour. I put that down to your compiler not implementing floating point strictly; or perhaps not implementing IEEE754.

Code acts as if z was a double and y/(float)100000.0 is y/100000.0.
float x = 1028.25478;
long int y = 102825478;
double z = y/100000.0;
// output
x = 1028.254761 z = 1028.254780
An important consideration is FLT_EVAL_METHOD. This allows select floating point code to evaluate at higher precision.
#include <float.h>
#include <stdio.h>
printf("FLT_EVAL_METHOD %d\n", FLT_EVAL_METHOD);
Except for assignment and cast ..., the values yielded by operators with floating operands and values subject to the usual arithmetic conversions and of floating constants are evaluated to a format whose range and precision may be greater than required by the type. The use of evaluation formats is characterized by the implementation-defined value of FLT_EVAL_METHOD.
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the
type;
1 evaluate ... type float and double to the
range and precision of the double type, evaluate long double
... to the range and precision of the long double
type;
2 evaluate all ... to the range and precision of the
long double type.
Yet this does not apply as z with float z = y/(float)100000.0; should lose all higher precision on the assignment.
I agree with #Antti Haapala that code is using a speed optimization that has less adherence to the expected rules of floating point math.

Is operator ≤ UB for floating point comparison? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
There are numerous reference on the subject (here or here). However I still fails to understand why the following is not considered UB and properly reported by my favorite compiler (insert clang and/or gcc) with a neat warning:
// f1, f2 and epsilon are defined as double
if ( f1 / f2 <= epsilon )
As per C99:TC3, 5.2.4.2.2 §8: we have:
Except for assignment and cast (which remove all extra range and
precision), the values of operations with floating operands and values
subject to the usual arithmetic conversions and of floating constants
are evaluated to a format whose range and precision may be greater
than required by the type. [...]
Using typical compilation f1 / f2 would be read directly from the FPU. I've tried here using gcc -m32, with gcc 5.2. So f1 / f2 is (over-here) on an 80 bits (just a guess dont have the exact spec here) floating point register. There is not type promotion here (per standard).
I've also tested clang 3.5, this compiler seems to cast the result of f1 / f2 back to a normal 64 bits floating point representation (this is an implementation defined behavior but for my question I prefer the default gcc behavior).
As per my understanding the comparison will be done in between a type for which we don't know the size (ie. format whose range and precision may be greater) and epsilon which size is exactly 64 bits.
What I really find hard to understand is equality comparison with a well known C types (eg. 64bits double) and something whose range and precision may be greater. I would have assumed that somewhere in the standard some kind of promotion would be required (eg. standard would mandates that epsilon would be promoted to a wider floating point type).
So the only legitimate syntaxes should instead be:
if ( (double)(f1 / f2) <= epsilon )
or
double res = f1 / f2;
if ( res <= epsilon )
As a side note, I would have expected the litterature to document only the operator <, in my case:
if ( f1 / f2 < epsilon )
Since it is always possible to compare floating point with different size using operator <.
So in which cases the first expression would make sense ? In other word, how could the standard defines some kind of equality operator in between two floating point representation with different size ?
EDIT: The whole confusion here, was that I assumed it was possible to compare two float of different size. Which cannot possibly happen. (thanks #DevSolar!).

<= is well-defined for all possible floating point values.
There is one exception though: the case when at least one of the arguments is uninitialised. But that's more to do with reading an uninitialised variable being UB; not the <= itself

I think you're confusing implementation-defined with undefined behavior. The C language doesn't mandate IEEE 754, so all floating point operations are essentially implementation-defined. But this is different from undefined behavior.

After a bit of chat, it became clear where the miscommunication came from.
The quoted part of the standard explicitly allows an implementation to use wider formats for floating operands in calculations. This includes, but is not limited to, using the long double format for double operands.
The standard section in question also does not call this "type promotion". It merely refers to a format being used.
So, f1 / f2 may be done in some arbitrary internal format, but without making the result any other type than double.
So when the result is compared (by either <= or the problematic ==) to epsilon, there is no promotion of epsilon (because the result of the division never got a different type), but by the same rule that allowed f1 / f2 to happen in some wider format, epsilon is allowed to be evaluated in that format as well. It is up to the implementation to do the right thing here.
The value of FLT_EVAL_METHOD might tell what exactly an implementation is doing exactly (if set to 0, 1, or 2 respectively), or it might have a negative value, which indicates "indeterminate" (-1) or "implementation-defined", which means "look it up in your compiler manual".
This gives an implementation "wiggle room" to do any kind of funny things with floating operands, as long as at least the range / precision of the actual type is preserved. (Some older FPUs had "wobbly" precisions, depending on the kind of floating operation performed. The quoted part of the standard caters for exactly that.)
In no case may any of this lead to undefined behaviour. Implementation-defined, yes. Undefined, no.

The only case where you would get undefined behavior is when a large floating point variable gets demoted to a smaller one which cannot represent the contents. I don't quite see how that applies in this case.
The text you quote is concerned about whether or not floats may be evaluated as doubles etc, as indicated by the text you unfortunately didn't include in the quote:
The use of evaluation formats is characterized by the
implementation-defined value of FLT_EVAL_METHOD:
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the type;
1 evaluate operations and constants of type float and double to the range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type;
2 evaluate all operations and constants to the range and precision of the long double type.
However, I don't believe this macro overwrites the behavior of the usual arithmetic conversions. The usual arithmetic conversions guarantee that you can never compare two float variables of different size. So I don't see how you could run into undefined behavior here. The only possible issue you would have is performance.
In theory, in case FLT_EVAL_METHOD == 2 then your operands could indeed get evaluated as type long double. But please note that if the compiler allows such implicit promotions to larger types, there will be a reason for it.
According to the text you cited, explicit casting will counter this compiler behavior.
In which case the code if ( (double)(f1 / f2) <= epsilon ) is nonsense. By the time you cast the result of f1 / f2 to double, the calculation is already done and have been carried out on long double. The calculation of the result <= epsilon will however be carried out on double since you forced this with the cast.
To avoid long double entirely, you would have to write the code as:
if ( (double)((double)f1 / (double)f2) <= epsilon )
or to increase readability, preferably:
double div = (double)f1 / (double)f2;
if( (double)div <= (double)epsilon )
But again, code like this does only make sense if you know that there will be implicit promotions, which you wish to avoid to increase performance. In practice, I doubt you'll ever run into that situation, as the compiler is most likely far more capable than the programmer to make such decisions.

Float overflowed but not when a printf argument in C?

Why is it that when my overflow calculation is an argument of the printf() function,the float does not overflow, but when the coded calculation is assigned to a separate variable ,float_overflowed, and is not an argument of the printf function I get the expected result of 'inf'?
Why does this happen? What is causing this difference?
The code and results that led me to this question are below.
Here is my code that didn't execute as expected when the calculation is an argument:
float float_overflow;
float_overflow=3.4e38;
printf("This demonstrates floating data type overflow. We should get an \'inf\' value.\n%e*10=%e.\n\n",float_overflow, float_overflow*10); //No overflow?
The result:
This demonstrates floating data type overflow. We should get an 'inf' value.
3.400000e+38*10=3.400000e+39.
And, when the calculation is not an argument:
float float_upperlimit;
float float_overflowed;
float_upperlimit=3.4e38;
float_overflowed=float_upperlimit*10;
printf("This demonstrates floating data type overflow. We should get an \'inf\' value.\n%e*10=%e.\n\n",float_upperlimit, float_overflowed); //for float overflow
and its result:
This demonstrates floating data type overflow. We should get an 'inf' value.
3.400000e+38*10=inf.

Actually the compiler is not constrained to do the arithmetic in float but it might well use double. 5.2.4.2.1 of the current C standard has:
Except for assignment and cast (which remove all extra range and
precision), the values yielded by operators with floating operands and
values subject to the usual arithmetic conversions and of floating
constants are evaluated to a format whose range and precision may be
greater than required by the type. The use of evaluation formats is
characterized by the implementation-defined value of FLT_EVAL_METHOD
So you only know to force the value to be float when you assign it. Since in the context of the printf call (it is a va_arg function) any such argument is needed as double anyhow, there is no conversion taking place in case that FLT_EVAL_METHOD is of value 1, that is all float arithmetic is done in double.

Remember that for the "%e" format (and all other floating-point formating codes), the argument is actually a double. See e.g. the table in this reference.
That means that when you do the calculation "in-line" as the argument you do now actually overflow. But when you do it for the variable, then it's indeed overflowed and that will carry when used in the printf call.

Problems casting NAN floats to int

Ignoring why I would want to do this, the 754 IEEE fp standard doesn't define the behavior for the following:
float h = NAN;
printf("%x %d\n", (int)h, (int)h);
Gives: 80000000 -2147483648
Basically, regardless of what value of NAN I give, it outputs 80000000 (hex) or -2147483648 (dec). Is there a reason for this and/or is this correct behavior? If so, how come?
The way I'm giving it different values of NaN are here:
How can I manually set the bit value of a float that equates to NaN?
So basically, are there cases where the payload of the NaN affects the output of the cast?
Thanks!

The result of a cast of a floating point number to an integer is undefined/unspecified for values not in the range of the integer variable (±1 for truncation).
Clause 6.3.1.4:
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
If the implementation defines __STDC_IEC_559__, then for conversions from a floating-point type to an integer type other than _BOOL:
if the floating value is infinite or NaN or if the integral part of the floating value exceeds the range of the integer type, then the "invalid" floating-
point exception is raised and the resulting value is unspecified.
(Annex F [normative], point 4.)
If the implementation doesn't define __STDC_IEC_559__, then all bets are off.

There is a reason for this behavior, but it is not something you should usually rely on.
As you note, IEEE-754 does not specify what happens when you convert a floating-point NaN to an integer, except that it should raise an invalid operation exception, which your compiler probably ignores. The C standard says the behavior is undefined, which means not only do you not know what integer result you will get, you do not know what your program will do at all; the standard allows the program to abort or get crazy results or do anything. You probably executed this program on an Intel processor, and your compiler probably did the conversion using one of the built-in instructions. Intel specifies instruction behavior very carefully, and the behavior for converting a floating-point NaN to a 32-bit integer is to return 0x80000000, regardless of the payload of the NaN, which is what you observed.
Because Intel specifies the instruction behavior, you can rely on it if you know the instruction used. However, since the compiler does not provide such guarantees to you, you cannot rely on this instruction being used.

First, a NAN is everything not considered a float number according to the IEEE standard.
So it can be several things. In the compiler I work with there is NAN and -NAN, so it's not about only one value.
Second, every compiler has its isnan set of functions to test for this case, so the programmer doesn't have to deal with the bits himself. To summarize, I don't think peeking at the value makes any difference. You might peek the value to see its IEEE construction, like sign, mantissa and exponent, but, again, each compiler gives its own functions (or better say, library) to deal with it.
I do have more to say about your testing, however.
float h = NAN;
printf("%x %d\n", (int)h, (int)h);
The casting you did trucates the float for converting it to an int. If you want to get the
integer represented by the float, do the following
printf("%x %d\n", *(int *)&h, *(int *)&h);
That is, you take the address of the float, then refer to it as a pointer to int, and eventually take the int value. This way the bit representation is preserved.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight