When I enter a number greater than max double in Matlab that is approximately 1.79769e+308, for example 10^309, it returns Inf. For educational purposes, I want to get overflow exception like C compilers that return an overflow error message, not Inf. My questions are:
Is Inf an overflow exception?
If is, why C compilers don't return Inf?
If not, can I get an overflow exception in Matlab?
Is there any difference between Inf and an overflow exception at all?
Also I don't want check Inf in Matlab and then throw an exception with error() function.
1) Floating-points in C/C++
Operations on floating-point numbers can produce results that are not numerical values. Examples:
the result of an operation is a complex number (think sqrt(-1.0))
the result of an operation is undefined (think 1.0 / 0.0)
the result of an operation is too large to be represented
an operation is performed where one of the operands is already NaN or Inf
The philosophy of IEEE754 is to not trap such exceptions by default, but to produce special values (Inf and NaN), and allow computation to continue normally without interrupting the program. It is up to the user to test for such results and treat them separately (like isinf and isnan functions in MATLAB).
There exist two types of NaN values: NaN (Quiet NaN) and sNaN (Signaling NaN). Normally all arithmetic operations of floating-point numbers will produce the quiet type (not the signaling type) when the operation cannot be successfully completed.
There are (platform-dependent) functions to control the floating-point environment and catch FP exceptions:
Win32 API has _control87() to control the FPU flags.
POSIX/Linux systems typically handle FP exception by trapping the SIGFPE signal (see feenableexcept).
SunOS/Solaris has its own functions as well (see chapter 4 in Numerical Computation Guide by Sun/Oracle)
C99/C++11 introduced the fenv header with functions that control the floating-point exception flags.
For instance, check out how Python implements the FP exception control module for different platforms: https://hg.python.org/cpython/file/tip/Modules/fpectlmodule.c
2) Integers in C/C++
This is obviously completely different from floating-points, since integer types cannot represent Inf or NaN:
unsigned integers use modular arithmetic (so values wrap-around if the result exceeds the largest integer). This means that the result of an unsigned arithmetic operation is always "mathematically defined" and never overflows. Compare this to MATLAB which uses saturation arithmetic for integers (uint8(200) + uint8(200) will be uint8(255)).
signed integer overflow on the other hand is undefined behavior.
integer division by zero is undefined behavior.
Floating Point
MATLAB implements the IEEE Standard 754 for floating point operations.
This standard has five defined exceptions:
Invalid Operation
Division by Zero
Overflow
Underflow
Inexact
As noted by the GNU C Library, these exceptions are indicated by a status word but do not terminate the program.
Instead, an exception-dependent default value is returned; the value may be an actual number or a special value Special values in MATLAB are Inf, -Inf, NaN, and -0; these MATLAB symbols are used in place of the official standard's reserved binary representations for readability and usability (a bit of nice syntactic sugar).
Operations on the special values are well-defined and operate in an intuitive way.
With this information in hand, the answers to the questions are:
Inf means that an operation was performed that raised one of the above exceptions (namely, 1, 2, or 3), and Inf was determined to be the default return value.
Depending on how the C program is written, what compiler is being used, and what hardware is present, INFINITY and NaN are special values that can be returned by a C operation. It depends on if-and-how the IEEE-754 standard was implemented. The C99 has IEEE-754 implementation as part of the standard, but it is ultimately up to the compiler on how the implementation works (this can be complicated by aggressive optimizations and standard options like rounding modes).
A return value of Inf or -Inf indicates that an Overflow exception may have happened, but it could also be an Invalid Operation or Division by Zero. I don't think MATLAB will tell you which it is (though maybe you have access to that information via compiled MEX files, but I'm unfamiliar with those).
See answer 1.
For more fun and in-depth examples, here is a nice PDF.
Integers
Integers do not behave as above in MATLAB.
If an operation on an integer of a specified bit size will exceed the maximum value of that class, it will be set to the maximum value and vice versa for negatives (if signed).
In other words, MATLAB integers do not wrap.
I'm going to repeat an answer by Jan Simon from the "MATLAB Answers" website:
For stopping (in debugger mode) on division-by-zero, use:
warning on MATLAB:divideByZero
dbstop if warning MATLAB:divideByZero
Similarly for stopping on taking the logarithm of zero:
warning on MATLAB:log:LogOfZero
dbstop if warning MATLAB:log:LogOfZero
and for stopping when an operation (a function call or an assignment) returns either NaN or Inf, use:
dbstop if naninf
Unfortunately the first two warnings seems to be no longer supported, although the last option still works for me on R2014a and is in fact documented.
Related
The manual pages for fenv.h (feraiseexcept and its ilk) are unusually uninformative; the examples I find online (cppreference.com and others) are very minimal.
How are they supposed to be used? In particular, does feraiseexcept(...) always return? What should one return from a function that gets an illegal argument, like sqrt(-1)? I assume NAN.
Should one get and restore the original state in a function? For example, one might compute a starting point for some iterative algorithm, if that computation turns out inaccurate is of no consequence as presumably, the iteration can converge to full precision.
How should one use this in a program calling e.g. library functions?
(I'd prefer not to have to dive into random math library sources to answer the above.)
The manual pages for fenv.h (feraisexceptionand [its] ilk) are unusually uninformative, the examples I find on line (cppreference.com and others) are very minimal.
<fenv.h> is specified by the C standard. The Unix manual pages are useful only insofar as they extend the C specification. Regard cppreference.com with skepticism; it is known to have errors.
How are they supposed to be used?
Discussing all of <fenv.h> is too broad a question for Stack Overflow.
In particular, does feraisexception(...) always return?
Per C 2018 7.6.2.3 2 and foonote 221, feraiseexcept attempts to cause floating-point exceptions as if they had been generated by arithmetic instructions. Thus, if it raises an exception for which a trap is enabled, it will cause a trap.
In contrast, fesetexceptflag merely sets floating-point exception flags without generating actual exceptions. Per 7.6.2.4 2, “… This function does not raise floating-point exceptions, but only sets the state of the flags.”
What should one return from a function that gets an illegal argument, like sqrt(-1)? I assume NAN.
This is outside the domain of <fenv.h>. sqrt is specified in 7.12, covering <math.h>. Per 7.12.7.5 2, a sqrt domain error occurs if the argument is less than zero. Per 7.12.1, 2, “… On a domain error, the function returns an implementation-defined value…” and it may also set errno and/or raise the “invalid” floating-point exception. Commonly, implementations will return a quiet NaN.
Should one get and restore the original state in a function?
This is up to the person who is specifying or implementing a function. Routines that provide “elementary” mathematical functions, as for the standard math library, may be required to affect the floating-point state only as called for by the ideal function being provided, concealing any internal operations. For example, if log10(100) is implemented by using an engineered polynomial to approximate the logarithm, the evaluation of the polynomial might use arithmetic operations that have very high precision and accuracy, so high that the final result, when rounded to the final format, is exactly two. But the intermediate calculations might involve some operations that were inexact. Thus, the inexact exception would be raised. If the routine merely evaluated the polynomial and returned the exact result, 2, it would have raised the inexact flag. Ideally, the routine would suppress floating-point traps, save the exception flags, do its calculations, restore originally enabled traps, set the proper exception flags if any, and return the result. (Alternately, the routine might be implemented to use only integer arithmetic, avoiding the floating-point issues.)
Outside of specialized library routines, floating-point subroutines rarely go to this extent. It is up to each application how it wants its routines to behave.
For example, one might compute a starting point for some iterative algorithm, if that computation turns out inaccurate is of no consecuence as pressumably the iteration can converge to full precision.
This paragraph is unclear and may be ungrammatical. I am unable to answer it as written.
How should one use this in a program calling e.g. library functions?
This question is unclear, possibly due to the lack of clarity of the preceding paragraph. The antecedent of “this” is unclear.
For a long time I thought floating point arithmetic is well-defined and different platforms making the same calculation should get the same results. (Given the same rounding modes.)
Yet Microsoft's SQL Server deems any calculation performed with floating point to have an "imprecise" result, meaning you can't have an index on it. This suggests to me that the guys at Microsoft thought there's a relevant catch regarding floating point.
So what is the catch?
EDIT: You can have indexes on floats in general, just not on computed columns. Also, geodata uses floating point types for the coordinates and those are obviously indexed, too. So it's not a problem of reliable comparison in general.
Floating-point arithmetic is well defined by the IEEE 754 standard. In the documentation you point out, Microsoft has apparently not chosen to adhere to the standard.
There are a variety of issues that makes floating-point reproducibility difficult, and you can find Stack Overflow discussions about them by searching for “[floating-point] reproducibility”. However, most of these issues are about lack of control in high-level languages (the individual floating-point operations are completely reproducible and specified by IEEE 754, and the hardware provides sufficient IEEE 754 conformance, but the high-level language specification does not adequately map language constructs to specific floating-point operations), differences in math library routines (functions such as sin and log are “hard” to compute in some sense, and vendors implement them without what is called correct rounding, so each vendor’s routines have slightly different error characteristics than others), multithreading and other issues allow operations to be performed in different orders, thus yielding different results, and so on.
In a single system such as Microsoft’s SQL Server, Microsoft presumably could have controlled these issues if they wanted to. Still, there are issues to consider. For example, a database system may have a sum function that computes the sum of many things. For speed, you may wish the sum implementation to have the flexibility to add the elements in any order, so that it can take advantage of multiprocessing or of adding the elements in whatever order they happen to be stored in. But adding floating-point data using elementary add operations of the same floating-point format has different results depending on the order of the elements. To make the sum reproducible, you have to specify the order of operation or use extra precision or other techniques, and then performance suffers.
So, not making floating-point arithmetic is a choice that is made, not a consequence of any lack of specification for floating-point arithmetic.
Another problem for database purposes is that even well defined and completely specified floating-point arithmetic has NaN values. (NaN, an abbreviation for Not a Number, represents a floating-point datum that is not a number. A NaN is produced as the result of an operation that has no mathematical result (such as the real square root of a negative number). NaNs act as placeholders so that floating-point operations can continue without interruption, and an application can complete a set of floating-point operations and then take action to replace or otherwise deal with any NaNs that arose.) Because a NaN does not represent a number, it is not equal to anything, not even itself. Comparing two NaNs for equality produces false, even if the NaNs are represented with exactly the same bits. This is a problem for databases, because NaNs cannot be used as a key for looking up records because a NaN search key will never equal a NaN in the key field of a record. Sometimes this is deal with by defining two different ordering relations—one is the usual mathematical comparison, which defines less than, equal to, and greater than for numbers (and for which all three are false for NaNs), and a second which defines a sort order and which is defined for all data, including NaNs.
It should be noted that each floating-point datum that is not a NaN represents a certain number exactly. There is no imprecision in a floating-point number. A floating-point number does not represent an interval. Floating-point operations approximate real arithmetic in that they return values approximately equal to the exact mathematical results, while floating-point numbers are exact. Elementary floating-point operations are exactly specified by IEEE 754. Lack of reproducibility arises in using different operations (including the same operations with different precisions), in using operations in different orders, or in using operations that do not conform to the IEEE 754 standard.
I don't want to introduce floating point when an inexact value would be a distaster, so I have a couple of questions about when you actually can use them safely.
Are they exact for integers as long as you don't overflow the number of significant digit? Are these two tests always true:
double d = 2.0;
if (d + 3.0 == 5.0) ...
if (d * 3.0 == 6.0) ...
What math functions can you rely on? Are these tests always true:
#include <math.h>
double d = 100.0;
if (log10(d) == 2.0) ...
if (pow(d, 2.0) == 10000.0) ...
if (sqrt(d) == 10.0) ...
How about this:
int v = ...;
if (log2((double) v) > 16.0) ... /* gonna need more than 16 bits to store v */
if (log((double) v) / log(2.0) > 16.0) ... /* C89 */
I guess you can summarize this question as: 1) Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h? 2) Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
I too find incorrect results distasteful.
On common hardware, you can rely on +, -, *, /, and sqrt working and delivering the correctly-rounded result. That is, they deliver the floating-point number closest to the sum, difference, product, quotient, or square root of their argument or arguments.
Some library functions, notably log2 and log10 and exp2 and exp10, traditionally have terrible implementations that are not even faithfully-rounded. Faithfully-rounded means that a function delivers one of the two floating-point numbers bracketing the exact result. Most modern pow implementations have similar issues. Lots of these functions will even blow exact cases like log10(10000) and pow(7, 2). Thus equality comparisons involving these functions, even in exact cases, are asking for trouble.
sin, cos, tan, atan, exp, and log have faithfully-rounded implementations on every platform I've recently encountered. In the bad old days, on processors using the x87 FPU to evaluate sin, cos, and tan, you would get horribly wrong outputs for largish inputs and you'd get the input back for larger inputs. CRlibm has correctly-rounded implementations; these are not mainstream because, I'm told, they've got rather nastier worst cases than the traditional faithfully-rounded implementations.
Things like copysign and nextafter and isfinite all work correctly. ceil and floor and rint and friends always deliver the exact result. fmod and friends do too. frexp and friends work. fmin and fmax work.
Someone thought it would be a brilliant idea to make fma(x,y,z) compute x*y+z by computing x*y rounded to a double, then adding z and rounding the result to a double. You can find this behaviour on modern platforms. It's stupid and I hate it.
I have no experience with the hyperbolic trig, gamma, or Bessel functions in my C library.
I should also mention that popular compilers targeting 32-bit x86 play by a different, broken, set of rules. Since the x87 is the only supported floating-point instruction set and all x87 arithmetic is done with an extended exponent, computations that would induce an underflow or overflow in double precision may fail to underflow or overflow. Furthermore, since the x87 also by default uses an extended significand, you may not get the results you're looking for. Worse still, compilers will sometimes spill intermediate results to variables of lower precision, so you can't even rely on your calculations with doubles being done in extended precision. (Java has a trick for doing 64-bit math with 80-bit registers, but it is quite expensive.)
I would recommend sticking to arithmetic on long doubles if you're targeting 32-bit x86. Compilers are supposed to set FLT_EVAL_METHOD to an appropriate value, but I do not know if this is done universally.
Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h?
Well, they can store the integers which fit in their mantissa (significand). So [-2^53, 2^53] for double. For more on this, see: Which is the first integer that an IEEE 754 float is incapable of representing exactly?
Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
They at least guarantee that the result is immediately on either side of the actual mathematical result. That is, you won't get a result which has a valid floating point value between itself and the "actual" result. But beware, because repeated operations may accumulate an error which seems counter to this, while it is not (because all intermediate values are subject to the same constraints, not just the inputs and output of a compound expression).
I have read somewhere that there is a source of non-determinism in C double-precision floating point as follows:
The C standard says that 64-bit floats (doubles) are required to produce only about 64-bit accuracy.
Hardware may do floating point in 80-bit registers.
Because of (1), the C compiler is not required to clear the low-order bits of floating-point registers before stuffing a double into the high-order bits.
This means YMMV, i.e. small differences in results can happen.
Is there any now-common combination of hardware and software where this really happens? I see in other threads that .net has this problem, but is C doubles via gcc OK? (e.g. I am testing for convergence of successive approximations based on exact equality)
The behavior on implementations with excess precision, which seems to be the issue you're concerned about, is specified strictly by the standard in most if not all cases. Combined with IEEE 754 (assuming your C implementation follows Annex F) this does not leave room for the kinds of non-determinism you seem to be asking about. In particular, things like x == x (which Mehrdad mentioned in a comment) failing are forbidden since there are rules for when excess precision is kept in an expression and when it is discarded. Explicit casts and assignment to an object are among the operations that drop excess precision and ensure that you're working with the nominal type.
Note however that there are still a lot of broken compilers out there that don't conform to the standards. GCC intentionally disregards them unless you use -std=c99 or -std=c11 (i.e. the "gnu99" and "gnu11" options are intentionally broken in this regard). And prior to GCC 4.5, correct handling of excess precision was not even supported.
This may happen on Intel x86 code that uses the x87 floating-point unit (except probably 3., which seems bogus. LSB bits will be set to zero.). So the hardware platform is very common, but on the software side use of x87 is dying out in favor of SSE.
Basically whether a number is represented in 80 or 64 bits is at the whim of the compiler and may change at any point in the code. With for example the consequence that a number which just tested non-zero is now zero. m)
See "The pitfalls of verifying floating-point computations", page 8ff.
Testing for exact convergence (or equality) in floating point is usually a bad idea, even with in a totally deterministic environment. FP is an approximate representation to begin with. It is much safer to test for convergence to within a specified epsilon.
EDIT: I had made a mistake during the debugging session that lead me to ask this question. The differences I was seeing were in fact in printing a double and in parsing a double (strtod). Stephen's answer still covers my question very well even after this rectification, so I think I will leave the question alone in case it is useful to someone.
Some (most) C compilation platforms I have access to do not take the FPU rounding mode into account when
converting a 64-bit integer to double;
printing a double.
Nothing very exotic here: Mac OS X Leopard, various recent Linuxes and BSD variants, Windows.
On the other hand, Mac OS X Snow Leopard seems to take the rounding mode into account when doing these two things. Of course, having different behaviors annoys me no end.
Here are typical snippets for the two cases:
#if defined(__OpenBSD__) || defined(__NetBSD__)
# include <ieeefp.h>
# define FE_UPWARD FP_RP
# define fesetround(RM) fpsetround(RM)
#else
# include <fenv.h>
#endif
#include <float.h>
#include <math.h>
fesetround(FE_UPWARD);
...
double f;
long long b = 2000000001;
b = b*b;
f = b;
...
printf("%f\n", 0.1);
My questions are:
Is there something non-ugly that I can do to normalize the behavior across all platforms? Some hidden setting to tell the platforms that take rounding mode into account not to or vice versa?
Is one of the behaviors standard?
What am I likely to encounter when the FPU rounding mode is not used? Round towards zero? Round to nearest? Please, tell me that there is only one alternative :)
Regarding 2. I found the place in the standard where it is said that floats converted to integers are always truncated (rounded towards zero) but I couldn't find anything for the integer -> float direction.
If you have not set the rounding mode, it should be the IEEE-754 default mode, which is round-to-nearest.
For conversions from integer to float, the C standard says (§6.3.1.4):
When a value of integer type is
converted to a real floating type, if
the value being converted can be
represented exactly in the new type,
it is unchanged. If the value being
converted is in the range of values
that can be represented but cannot be
represented exactly, the result is
either the nearest higher or nearest
lower representable value, chosen in
an implementation-defined manner. If
the value being converted is outside
the range of values that can be
represented, the behavior is
undefined.
So both behaviors conform to the C standard.
The C standard says (§F.5) that conversions between IEC60559 floating point formats and character sequences be correctly rounded as per the IEEE-754 standard. For non-IEC60559 formats, this is recommended, but not required. The 1985 IEEE-754 standard says (clause 5.4):
Conversions shall be correctly rounded
as specified in Section 4 for operands
lying within the ranges specified in
Table 3. Otherwise, for rounding to
nearest, the error in the converted
result shall not exceed by more than
0.47 units in the destination's least significant digit the error that is
incurred by the rounding
specifications of Section 4, provided
that exponent over/underflow does not
occur. In the directed rounding modes
the error shall have the correct sign
and shall not exceed 1.47 units in the
last place.
What section (4) actually says is that the operation shall occur according to the prevailing rounding mode. I.e. if you change the rounding mode, IEEE-754 says that the result of float->string conversion should change accordingly. Ditto for integer->float conversions.
The 2008 revision of the IEEE-754 standard says (clause 4.3):
The rounding-direction attribute
affects all computational operations
that might be inexact. Inexact numeric
floating-point results always have the
same sign as the unrounded result.
Both conversions are defined to be computational operations in clause 5, so again they should be performed according to the prevailing rounding mode.
I would argue that Snow Leopard has the correct behavior here (assuming that it is correctly rounding the results according to the prevailing rounding mode). If you want to force the old behavior, you can always wrap your printf calls in code that changes the rounding mode, I suppose, though that's clearly not ideal.
Alternatively, you could use the %a format specifier (hexadecimal floating point) on C99 compliant platforms. Since the result of this conversion is always exact, it will never be effected by the prevailing rounding mode. I don't think that the Windows C library supports %a, but you could probably port the BSD or glibc implementation easily enough if you need it.