C floating point exception handling - c

The manual pages for fenv.h (feraiseexcept and its ilk) are unusually uninformative; the examples I find online (cppreference.com and others) are very minimal.
How are they supposed to be used? In particular, does feraiseexcept(...) always return? What should one return from a function that gets an illegal argument, like sqrt(-1)? I assume NAN.
Should one get and restore the original state in a function? For example, one might compute a starting point for some iterative algorithm, if that computation turns out inaccurate is of no consequence as presumably, the iteration can converge to full precision.
How should one use this in a program calling e.g. library functions?
(I'd prefer not to have to dive into random math library sources to answer the above.)

The manual pages for fenv.h (feraisexceptionand [its] ilk) are unusually uninformative, the examples I find on line (cppreference.com and others) are very minimal.
<fenv.h> is specified by the C standard. The Unix manual pages are useful only insofar as they extend the C specification. Regard cppreference.com with skepticism; it is known to have errors.
How are they supposed to be used?
Discussing all of <fenv.h> is too broad a question for Stack Overflow.
In particular, does feraisexception(...) always return?
Per C 2018 7.6.2.3 2 and foonote 221, feraiseexcept attempts to cause floating-point exceptions as if they had been generated by arithmetic instructions. Thus, if it raises an exception for which a trap is enabled, it will cause a trap.
In contrast, fesetexceptflag merely sets floating-point exception flags without generating actual exceptions. Per 7.6.2.4 2, “… This function does not raise floating-point exceptions, but only sets the state of the flags.”
What should one return from a function that gets an illegal argument, like sqrt(-1)? I assume NAN.
This is outside the domain of <fenv.h>. sqrt is specified in 7.12, covering <math.h>. Per 7.12.7.5 2, a sqrt domain error occurs if the argument is less than zero. Per 7.12.1, 2, “… On a domain error, the function returns an implementation-defined value…” and it may also set errno and/or raise the “invalid” floating-point exception. Commonly, implementations will return a quiet NaN.
Should one get and restore the original state in a function?
This is up to the person who is specifying or implementing a function. Routines that provide “elementary” mathematical functions, as for the standard math library, may be required to affect the floating-point state only as called for by the ideal function being provided, concealing any internal operations. For example, if log10(100) is implemented by using an engineered polynomial to approximate the logarithm, the evaluation of the polynomial might use arithmetic operations that have very high precision and accuracy, so high that the final result, when rounded to the final format, is exactly two. But the intermediate calculations might involve some operations that were inexact. Thus, the inexact exception would be raised. If the routine merely evaluated the polynomial and returned the exact result, 2, it would have raised the inexact flag. Ideally, the routine would suppress floating-point traps, save the exception flags, do its calculations, restore originally enabled traps, set the proper exception flags if any, and return the result. (Alternately, the routine might be implemented to use only integer arithmetic, avoiding the floating-point issues.)
Outside of specialized library routines, floating-point subroutines rarely go to this extent. It is up to each application how it wants its routines to behave.
For example, one might compute a starting point for some iterative algorithm, if that computation turns out inaccurate is of no consecuence as pressumably the iteration can converge to full precision.
This paragraph is unclear and may be ungrammatical. I am unable to answer it as written.
How should one use this in a program calling e.g. library functions?
This question is unclear, possibly due to the lack of clarity of the preceding paragraph. The antecedent of “this” is unclear.

Related

To what lengths should I go in order to avoid raising `FE_INEXACT` in library code?

I'm creating a library in C that contains common data structures, convenience functions, etc. that is intended for general use. Within, I've implemented a dynamic array, and I've chosen the golden ratio as the growth factor for the reason explained here. However, this necessarily involves multiplication of floating-point numbers, which can cause FE_INEXACT to be raised if they have large significands.
When I implemented it, I was under the impression that, as the library is for general use, floating point exceptions must be avoided if at all possible. I first tried something like
fenv_t fenv;
feholdexcept(&fenv);
// expand dynamic array
feclearexcept(FE_INEXACT);
feupdateenv(&fenv);
, but this had such an enormous time cost that it wasn't worth it.
Eventually, I came up with a solution that had negligible time cost. While not avoiding FE_INEXACT entirely, it made it highly unlikely. Namely,
size_t newCapacity = nearbyint((double)(float)PHI * capacity);
This would only raise FE_INEXACT if the current capacity was extremely large, at least for compilers that adhere to IEEE 754 standards.
I'm starting to wonder whether my efforts have gone into solving a relative nonissue. For library code, is it reasonable to expect the user to handle the raising of FE_INEXACT when necessary, or should it be avoided within the library? In the latter case, how important is the issue compared to other factors, such as efficiency?
To what lengths should I go...
None at all. Almost nobody uses fenv.h, compilers do not even fully support it (they make transformations that wrongly disregard or alter the floating point environment), and if someone calling your code is using it, it's completely reasonable to require them to save/restore exception state around calls to your library. Moreover, most of the time if you're doing an operation that raises FE_INEXACT, it's precisely because the result you're going to be returning is inexact, and it's thereby semantically appropriate to be raising it.

How to compare two Math library implementations?

As you know, C standard library defines several standard functions calls that should be implemented by any compliant implementation e.g., Newlib, MUSL, GLIBC ...
If I am targetting Linux for example, I have to choose between glibc and MUSL, and the criteria for me is accuracy of the math library libm. How can I compare two possible implementations of, say sin() or cos() for example?
A naive approach would be to test the output quality of result of both implementations on a set of randomly generated inputs with a reference one (from Matlab for example), but is there any other more reliable/formal/structured/guided way to compare/model the two? I tried to see if there is any research in this direction but I found any, so any pointers are appreciated.
Some thoughts:
You can use the GNU Multiple Precision Arithmetic Library (GnuMP to generate good reference results.
It is possible to test most, if not all of the single-argument single-precision (IEEE-754 binary32) routines exhaustively. (For some of the macOS trigonometric routines, such as sinf, we tested one implementation exhaustively, verifying that it returned faithfully rounded results, meaning the result was the mathematical value [if representable] or one of the two adjacent values [if not]. Then, when changing implementations, we compared one to the other. If a new-implementation result was identical to the old-implementation result, it passed. Otherwise, GnuMP was used to test it. Since new implementations largely coincided with old implementations, this resulted in few invocations of GnuMP, so we were able to exhaustively test a new routine implementation in about three minutes, if I recall correctly.)
It is not feasible to test the multiple-argument or double-precision routines exhaustively.
When comparing implementations, you have to choose a metric, or several metrics. A library that has a good worst-case error is good for proofs; its bound can be asserted to hold true for any argument, and that can be used to derive further bounds in subsequent computations. But a library that has a good average error may tend to produce better results for, say, physics simulations that use large arrays of data. For some applications, only the errors in a “normal” domain may be relevant (angles around −2π to +2π), so errors in reducing large arguments (up to around 10308) may be irrelevant because those arguments are never used.
There are some common points where various routines should be tested. For example, for the trigonometric routines, test at various fractions of π. Aside from being mathematically interesting, these tend to be where implementations switch between approximations, internally. Also test at large numbers that are representable but happen to be very near multiples of simple fractions of π. These are the worst cases for argument reduction and can yield huge relative errors if not done correctly. They require number theory to find. Testing in any sort of scattershot approach, or even orderly approaches that fail to consider this reduction problem, will fail to find these troublesome arguments, so it would be easy to report as accurate a routine that had huge errors.
On the other hand, there are important points to test that cannot be known without internal knowledge of the implementation. For example, when designing a sine routine, I would use the Remez algorithm to find a minimax polynomial, aiming for it to be good from, say, –π/2 to +π/2 (quite large for this sort of thing, but just for example). Then I would look at the arithmetic and rounding errors that could occur during argument reduction. Sometimes they would produce a result a little outside that interval. So I would go back to the minimax polynomial generation and push for a slightly larger interval. And I would also look for improvements in the argument reduction. In the end, I would end up with a reduction guaranteed to produce results within a certain interval and a polynomial known to be good to a certain accuracy within that interval. To test my routine, you then need to know the endpoints of that interval, and you have to be able to find some arguments for which the argument reduction yields points near those endpoints, which means you have to have some idea of how my argument reduction is implemented—how many bits does it use, and such. Like the troublesome arguments mentioned above, these points cannot be found with a scattershot approach. But unlike those above, they cannot be found from pure mathematics; you need information about the implementation. This makes it practically impossible to know you have compared the worst potential arguments for implementations.

Overflow vs Inf

When I enter a number greater than max double in Matlab that is approximately 1.79769e+308, for example 10^309, it returns Inf. For educational purposes, I want to get overflow exception like C compilers that return an overflow error message, not Inf. My questions are:
Is Inf an overflow exception?
If is, why C compilers don't return Inf?
If not, can I get an overflow exception in Matlab?
Is there any difference between Inf and an overflow exception at all?
Also I don't want check Inf in Matlab and then throw an exception with error() function.
1) Floating-points in C/C++
Operations on floating-point numbers can produce results that are not numerical values. Examples:
the result of an operation is a complex number (think sqrt(-1.0))
the result of an operation is undefined (think 1.0 / 0.0)
the result of an operation is too large to be represented
an operation is performed where one of the operands is already NaN or Inf
The philosophy of IEEE754 is to not trap such exceptions by default, but to produce special values (Inf and NaN), and allow computation to continue normally without interrupting the program. It is up to the user to test for such results and treat them separately (like isinf and isnan functions in MATLAB).
There exist two types of NaN values: NaN (Quiet NaN) and sNaN (Signaling NaN). Normally all arithmetic operations of floating-point numbers will produce the quiet type (not the signaling type) when the operation cannot be successfully completed.
There are (platform-dependent) functions to control the floating-point environment and catch FP exceptions:
Win32 API has _control87() to control the FPU flags.
POSIX/Linux systems typically handle FP exception by trapping the SIGFPE signal (see feenableexcept).
SunOS/Solaris has its own functions as well (see chapter 4 in Numerical Computation Guide by Sun/Oracle)
C99/C++11 introduced the fenv header with functions that control the floating-point exception flags.
For instance, check out how Python implements the FP exception control module for different platforms: https://hg.python.org/cpython/file/tip/Modules/fpectlmodule.c
2) Integers in C/C++
This is obviously completely different from floating-points, since integer types cannot represent Inf or NaN:
unsigned integers use modular arithmetic (so values wrap-around if the result exceeds the largest integer). This means that the result of an unsigned arithmetic operation is always "mathematically defined" and never overflows. Compare this to MATLAB which uses saturation arithmetic for integers (uint8(200) + uint8(200) will be uint8(255)).
signed integer overflow on the other hand is undefined behavior.
integer division by zero is undefined behavior.
Floating Point
MATLAB implements the IEEE Standard 754 for floating point operations.
This standard has five defined exceptions:
Invalid Operation
Division by Zero
Overflow
Underflow
Inexact
As noted by the GNU C Library, these exceptions are indicated by a status word but do not terminate the program.
Instead, an exception-dependent default value is returned; the value may be an actual number or a special value Special values in MATLAB are Inf, -Inf, NaN, and -0; these MATLAB symbols are used in place of the official standard's reserved binary representations for readability and usability (a bit of nice syntactic sugar).
Operations on the special values are well-defined and operate in an intuitive way.
With this information in hand, the answers to the questions are:
Inf means that an operation was performed that raised one of the above exceptions (namely, 1, 2, or 3), and Inf was determined to be the default return value.
Depending on how the C program is written, what compiler is being used, and what hardware is present, INFINITY and NaN are special values that can be returned by a C operation. It depends on if-and-how the IEEE-754 standard was implemented. The C99 has IEEE-754 implementation as part of the standard, but it is ultimately up to the compiler on how the implementation works (this can be complicated by aggressive optimizations and standard options like rounding modes).
A return value of Inf or -Inf indicates that an Overflow exception may have happened, but it could also be an Invalid Operation or Division by Zero. I don't think MATLAB will tell you which it is (though maybe you have access to that information via compiled MEX files, but I'm unfamiliar with those).
See answer 1.
For more fun and in-depth examples, here is a nice PDF.
Integers
Integers do not behave as above in MATLAB.
If an operation on an integer of a specified bit size will exceed the maximum value of that class, it will be set to the maximum value and vice versa for negatives (if signed).
In other words, MATLAB integers do not wrap.
I'm going to repeat an answer by Jan Simon from the "MATLAB Answers" website:
For stopping (in debugger mode) on division-by-zero, use:
warning on MATLAB:divideByZero
dbstop if warning MATLAB:divideByZero
Similarly for stopping on taking the logarithm of zero:
warning on MATLAB:log:LogOfZero
dbstop if warning MATLAB:log:LogOfZero
and for stopping when an operation (a function call or an assignment) returns either NaN or Inf, use:
dbstop if naninf
Unfortunately the first two warnings seems to be no longer supported, although the last option still works for me on R2014a and is in fact documented.

Is C floating-point non-deterministic?

I have read somewhere that there is a source of non-determinism in C double-precision floating point as follows:
The C standard says that 64-bit floats (doubles) are required to produce only about 64-bit accuracy.
Hardware may do floating point in 80-bit registers.
Because of (1), the C compiler is not required to clear the low-order bits of floating-point registers before stuffing a double into the high-order bits.
This means YMMV, i.e. small differences in results can happen.
Is there any now-common combination of hardware and software where this really happens? I see in other threads that .net has this problem, but is C doubles via gcc OK? (e.g. I am testing for convergence of successive approximations based on exact equality)
The behavior on implementations with excess precision, which seems to be the issue you're concerned about, is specified strictly by the standard in most if not all cases. Combined with IEEE 754 (assuming your C implementation follows Annex F) this does not leave room for the kinds of non-determinism you seem to be asking about. In particular, things like x == x (which Mehrdad mentioned in a comment) failing are forbidden since there are rules for when excess precision is kept in an expression and when it is discarded. Explicit casts and assignment to an object are among the operations that drop excess precision and ensure that you're working with the nominal type.
Note however that there are still a lot of broken compilers out there that don't conform to the standards. GCC intentionally disregards them unless you use -std=c99 or -std=c11 (i.e. the "gnu99" and "gnu11" options are intentionally broken in this regard). And prior to GCC 4.5, correct handling of excess precision was not even supported.
This may happen on Intel x86 code that uses the x87 floating-point unit (except probably 3., which seems bogus. LSB bits will be set to zero.). So the hardware platform is very common, but on the software side use of x87 is dying out in favor of SSE.
Basically whether a number is represented in 80 or 64 bits is at the whim of the compiler and may change at any point in the code. With for example the consequence that a number which just tested non-zero is now zero. m)
See "The pitfalls of verifying floating-point computations", page 8ff.
Testing for exact convergence (or equality) in floating point is usually a bad idea, even with in a totally deterministic environment. FP is an approximate representation to begin with. It is much safer to test for convergence to within a specified epsilon.

Which is the best approach to handle floating point exceptions in embedded code?

I'm working on a safety critical, embedded program (in C) where I'd like to use IEEE 754 floating-point arithmetics (with NaN and Infs) for engineering calculations. Here I have two approach (afaik) to deal with floating
point exceptions:
go to a permanent fault state if any exception occurs. This one is might more robust from error detection point of view, but bad for fault-tolerance/availability.
ignore exceptions, and check the final results whether they finite numbers (sucsessfull calculation) or NaN, inf (failed calculation). This solution is more fault tolerant, but it is more risky because outputs
might accidentally be excluded from the check.
Which would be a better solution in a safety critical system?
Are there other options?
If the complexity of the calculations does not allow the first solution (I can not avoid exceptions in normal usage) are the final checks enough or are there other aspects I should consider?
Which would be better in a safety-critical system depends on the system and cannot be answered without more information.
Another option is to design the floating-point code so that no undesired behavior is possible (or can be handled as desired) and to write a proof of that.
In general, checking final values is inadequate to detect whether exceptions or other errors occurred during computation.
In regard to 3, consider that various exceptional results can vanish in subsequent operations. Infinity can produce zero when used as a divisor. NaN vanishes in some implementations of minimum or maximum. (E.g., max(3, NaN) may produce 3 rather than NaN.) Analyzing your code might (or might not) reveal whether or not these things are possible in your specific computations.
However, an alternative to checking final values is checking exception flags. Most implementations of IEEE 754 have cumulative flags—once an exception occurs, its flag is raised and remains raised until explicitly reset. So you can clear flags at the start of computations and test them at the end, and, unlike testing final values, this will guarantee that you observe exceptions after they occur.

Resources