Guaranteed precision of sqrt function in C/C++ - c

Everyone knows sqrt function from math.h/cmath in C/C++ - it returns square root of its argument. Of course, it has to do it with some error, because not every number can be stored precisely. But am I guaranteed that the result has some precision? For example, 'it's the best approximation of square root that can be represented in the floating point type usedorif you calculate square of the result, it will be as close to initial argument as possible using the floating point type given`?
Does C/C++ standard have something about it?

For C99, there are no specific requirements. But most implementations try to support Annex F: IEC 60559 floating-point arithmetic as good as possible. It says:
An implementation that defines __STDC_IEC_559__ shall conform to the specifications in this annex.
And:
The sqrt functions in <math.h> provide the IEC 60559 square root operation.
IEC 60559 (equivalent to IEEE 754) says about basic operations like sqrt:
Except for binary <-> decimal conversion, each of the operations shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then coerced this intermediate result to fit in the destination's format.
The final step consists of rounding according to several rounding modes but the result must always be the closest representable value in the target precision.

This question was already answered here as Chris Dodd noticed in the comments section. In short: it's not guaranteed by C++ standard, but IEEE-754 standard guarantees me that the result will be as close to the 'real result' as possible, i.e. error will be less than or equal to 1/2 unit-in-the-last-place. In particular, if the result can be precisely stored, it should be.

Related

Is the floating-point literal “.1” the same as “0.1” in C?

In the source text of a C program, do .1 and 0.1 have the same value?
.1 represents one-tenth, the same as 0.1 does. However, due to a lack of strictness in the C standard, .1 and 0.1 do not necessarily convert to the same internal value, per C 2018 6.4.4.2 5. They will be equal in all compilers of reasonable quality. (6.4.4.2 5 says “All floating constants of the same source form shall convert to the same internal format with the same value.” Footnote 77 gives examples of source forms that have the same mathematical values but that do not necessarily convert to the same internal value.)
Floating-point constants in source text are converted to an internal format. Most commonly, a binary-based format is used. Most decimal numerals, including .1, are not exactly representable in binary floating-point. So, when they are converted, the result is rounded (in binary) to a representable value. In typical C implementations, .1 becomes 0.1000000000000000055511151231257827021181583404541015625.
All good compilers will convert .1 and 0.1 to the same value. The reason the C standard is lax about this is that other floating-point literals, involving exponents or many digits, were difficult (in some sense) to convert to binary floating-point with ideal rounding. Historically, there were C implementations that fudged the conversions. The C standard accommodated these implementations by not making strict requirements about handling of floating-point values. (Today, good algorithms are known, and any good compiler ought to convert a floating-point literal to the nearest representable value, with ties to the even low digit, unless the user requests otherwise.)
So, the C standard does not guarantee that .1 and 0.1 have the same value. However, in practice, they will.
Eric's answer is correct if you're just talking about the baseline C standard, which makes basically no guarantees about floating point; 1.0==42.0 is a valid implementation choice. But this is not very helpful.
If you want any reasonable floating point behavior in C, you want an implementation that supports Annex F (the alignment of IEEE floating point semantics with C), an optional part of the standard. You can tell if your implementation supports (or claims to support) Annex F by checking for the predefined macro __STDC_IEC_559__.
Assuming Annex F, the interpretation of floating point literals is not up for grabs, and .1 and 0.1 will necessarily be the same.

How to check that IEEE 754 single-precision (32-bit) floating-point representation is used?

I want to test the following things on my target board:
Is 'float' implemented with IEEE 754 single-precision (32-bit) floating-point variable?
Is 'double' implemented with IEEE 754 double-precision (64-bit) floating-point variable?
What are the ways in which i can test it with a simple C program.
No simple test exists.
The overwhelming majority of systems today use IEEE-754 formats for floating-point. However, most C implementations do not fully conform to IEEE 754 (which is identical to IEC 60559) and do not set the preprocessor identifier __STDC_IEC_559__. In the absence of this identifier, the only way to determine whether a C implementation conforms to IEEE 754 is one or a combination of:
Read its documentation.
Examine its source code.
Test it (which is, of course, difficult when only exhaustive testing can be conclusive).
In many C implementations and software applications, the deviations from IEEE 754 can be ignored or worked around: You may write code as if IEEE 754 were in use, and much code will largely work. However, there are a variety of things that can trip up an unsuspecting programmer; writing completely correct floating-point code is difficult even when the full specification is obeyed.
Common deviations include:
Intermediate arithmetic is performed with more precision than the nominal type. E.g., expressions that use double values may be calculated with long double precision.
sqrt does not return a correctly rounded value in every case.
Other math library routines return values that may be slightly off (a few ULP) from the correctly rounded results. (In fact, nobody has implemented all the math routines recommended in IEEE 754-2008 with both guaranteed correct rounding and guaranteed bound run time.)
Subnormal numbers (tiny numbers near the edge of the floating-point format) may be converted to zero instead of handled as specified by IEEE 754.
Conversions between decimal numerals (e.g., 3.1415926535897932384626433 in the source code) and binary floating-point formats (e.g., the common double format, IEEE-754 64-bit binary) do not always round correctly, in either conversion direction.
Only round-to-nearest mode is supported; the other rounding modes specified in IEEE 754 are not supported. Or they may be available for simple arithmetic but require using machine-specific assembly language to access. Standard math libraries (cos, log, et cetera) rarely support other rounding modes.
In C99, you can check for __STDC_IEC_559__:
#ifdef __STDC_IEC_559__
/* using IEEE-754 */
#endif
This is because the international floating point standard referenced by C99 is IEC 60559:989 (IEC 559 and IEEE-754 was a previous description). The mapping from the C language to IEC 60559 is optional, but if in use, the implementation defines the macro __STDC_IEC_559__ (Appendix F of the C99 standard), so you can totally rely on that.
Another alternative is to manually check if the values in float.h, such as FLT_MAX, FLT_EPSILON, FLT_MAX_10_EXP, etc, match with the IEEE-754 limits, although theoretically there could be another representation with the same values.
First of all, you can find the details about the ISO/IEC/IEEE 60559 (or IEEE 754) in Wikipedia:
Floating point standard types
As F. Goncalvez has told you, the macro __STDC_IEC_559__ brings you information about your compiler, if it conform IEEE 754 or not.
In what follows, we
However, you can obtain additional information with the macro FLT_EVAL_METHOD.
The value of this macro means:
0 All operations and constants are evaluated in the range and precision of the type used.
1 The operations of types float and double are evaluated in the range and precision of double, and long double goes in your own way...
2 The evaluations of all types are done in the precision and range of long double.
-1 Indeterminate
Other negative values: Implementation defined (it depends on your compiler).
For example, if FLT_EVAL_METHOD == 2, and you hold the result of several calculations in a floating point variable x, then all operations and constants are calculated or processed in the best precition, that is, long double, but only the final result is rounded to the type that x has.
This behaviour reduces the immpact of numerical errors.
In order to know details about the floating point types, you have to watch the constant macros provided by the standard header <float.h>.
For example, see this link:
Çharacteristics of floating point types
In the sad case that your implementation does not conform to the IEEE 754 standard, you can try looking for details in the standard header <float.h>, if it exists.
Also, you have to read the documentation of your compiler.
For example, the compiler GCC explains what does with floating point:
Stadus of C99 features in GCC
No, Standard C18, p. 373 specifies that IEC 60559 is used for float, double...
Why do you think IEEE 754 is used?

Compilation platform taking FPU rounding mode into account in printing, conversions

EDIT: I had made a mistake during the debugging session that lead me to ask this question. The differences I was seeing were in fact in printing a double and in parsing a double (strtod). Stephen's answer still covers my question very well even after this rectification, so I think I will leave the question alone in case it is useful to someone.
Some (most) C compilation platforms I have access to do not take the FPU rounding mode into account when
converting a 64-bit integer to double;
printing a double.
Nothing very exotic here: Mac OS X Leopard, various recent Linuxes and BSD variants, Windows.
On the other hand, Mac OS X Snow Leopard seems to take the rounding mode into account when doing these two things. Of course, having different behaviors annoys me no end.
Here are typical snippets for the two cases:
#if defined(__OpenBSD__) || defined(__NetBSD__)
# include <ieeefp.h>
# define FE_UPWARD FP_RP
# define fesetround(RM) fpsetround(RM)
#else
# include <fenv.h>
#endif
#include <float.h>
#include <math.h>
fesetround(FE_UPWARD);
...
double f;
long long b = 2000000001;
b = b*b;
f = b;
...
printf("%f\n", 0.1);
My questions are:
Is there something non-ugly that I can do to normalize the behavior across all platforms? Some hidden setting to tell the platforms that take rounding mode into account not to or vice versa?
Is one of the behaviors standard?
What am I likely to encounter when the FPU rounding mode is not used? Round towards zero? Round to nearest? Please, tell me that there is only one alternative :)
Regarding 2. I found the place in the standard where it is said that floats converted to integers are always truncated (rounded towards zero) but I couldn't find anything for the integer -> float direction.
If you have not set the rounding mode, it should be the IEEE-754 default mode, which is round-to-nearest.
For conversions from integer to float, the C standard says (§6.3.1.4):
When a value of integer type is
converted to a real floating type, if
the value being converted can be
represented exactly in the new type,
it is unchanged. If the value being
converted is in the range of values
that can be represented but cannot be
represented exactly, the result is
either the nearest higher or nearest
lower representable value, chosen in
an implementation-defined manner. If
the value being converted is outside
the range of values that can be
represented, the behavior is
undefined.
So both behaviors conform to the C standard.
The C standard says (§F.5) that conversions between IEC60559 floating point formats and character sequences be correctly rounded as per the IEEE-754 standard. For non-IEC60559 formats, this is recommended, but not required. The 1985 IEEE-754 standard says (clause 5.4):
Conversions shall be correctly rounded
as specified in Section 4 for operands
lying within the ranges specified in
Table 3. Otherwise, for rounding to
nearest, the error in the converted
result shall not exceed by more than
0.47 units in the destination's least significant digit the error that is
incurred by the rounding
specifications of Section 4, provided
that exponent over/underflow does not
occur. In the directed rounding modes
the error shall have the correct sign
and shall not exceed 1.47 units in the
last place.
What section (4) actually says is that the operation shall occur according to the prevailing rounding mode. I.e. if you change the rounding mode, IEEE-754 says that the result of float->string conversion should change accordingly. Ditto for integer->float conversions.
The 2008 revision of the IEEE-754 standard says (clause 4.3):
The rounding-direction attribute
affects all computational operations
that might be inexact. Inexact numeric
floating-point results always have the
same sign as the unrounded result.
Both conversions are defined to be computational operations in clause 5, so again they should be performed according to the prevailing rounding mode.
I would argue that Snow Leopard has the correct behavior here (assuming that it is correctly rounding the results according to the prevailing rounding mode). If you want to force the old behavior, you can always wrap your printf calls in code that changes the rounding mode, I suppose, though that's clearly not ideal.
Alternatively, you could use the %a format specifier (hexadecimal floating point) on C99 compliant platforms. Since the result of this conversion is always exact, it will never be effected by the prevailing rounding mode. I don't think that the Windows C library supports %a, but you could probably port the BSD or glibc implementation easily enough if you need it.

Lower Bounds For Floating Points

Are there any lower bounds for floating point types in C? Like there are lower bounds for integral types (int being at least 16 bits)?
Yes. float.h contains constants such as:
FLT_EPSILON, DBL_EPSILON, LDBL_EPSILON this is the least magnitude non-zero value which can be represented by float, double, and long double representations.
FLT_MAX and FLT_MIN represent the extreme positive and negative numbers which can be represented for float. Similar DBL_ and LDBL_ are available.
FLT_DIG, DBL_DIG, LDBL_DIG are defined as the number of decimal digits precision.
You are asking for either the xxx_MIN or the xxx_EPSILON value.
Along these lines, here is a question wherein I posted some code which displays the internals of a 64-bit IEEE-754 floating-point number.
To be strict and grounded:
ISO/IEC 9899:TC2: (WG14/N1124m May 6, 2005):
5.2.4.2.2, Characteristics of floating types <float.h>
float.h contains many macros describing various properties of the floating types (including FLT_MIN and DBL_MIN).
The description of the requirements of the limits infloat.h is given in the standard (C90 or C99 - 5.2.4.2.2 "Characteristics of floating types").
In particular, according to the standard any implementation must support a lower-bound of at least 1E-37 for float or double. But an implementation is free to do better than that (and indicate what it does in FLT_MIN and DBL_MIN).
See this question for information on where to get a copy of the standards documents if you need one:
Where do I find the current C or C++ standard documents?
Maybe this helps: float.h reference (it is C++, I'm not sure if it applies to plain C as well)
This Draft C99 standard (PDF) notes minimum values for floating point type precision in section 5.2.4.2.2.
(Found via Wikipedia on C99.)
A useful reference here is What Every Computer Scientist Should Know About Floating-Point Arithmetic.
The nature of a floating point number — its size, precision, limits — is really defined by the hardware, rather than the programming language. A single-precision float on an x86 is the same in C, C#, Java, and any other practical programming language. (The exception is esoteric programming languages that implement odd widths of floating point number in software.)
Excerpts from the Standard draft (n1401.pdf)
Annex F
(normative)
IEC 60559 floating-point arithmetic
F.1 Introduction
1 ... An implementation that defines _ _STDC_IEC_559_ _ shall conform to
the specifications in this annex. ...
F.2 Types
1 The C floating types match the IEC 60559 formats as follows:
-- The float type matches the IEC 60559 single format.
-- The double type matches the IEC 60559 double format.
-- The long double type matches an IEC 60559 extended format ...
Wikipedia has an article about IEC 559 (or rather IEEE 754-1985) you might find interesting.

How are floating point literals in C interpreted?

In a C program, when you write a floating point literal like 3.14159 is there standard interpretation or is it compiler or architecture dependent? Java is exceedingly clear about how floating point strings are interpreted, but when I read K&R or other C documentation the issue seems swept under the rug.
It is architecture dependent.
That generally means IEEE 754, but not necessarily.
The C standard (ISO 9899:1999) discusses this mainly in section 5.2.4.2.2 'Characteristics of floating types'.
From the C99 standard, section 6.4.4.2 Floating constants, paragraph 3 (emphasis mine):
The significand part is interpreted as a (decimal or hexadecimal) rational number; the
digit sequence in the exponent part is interpreted as a decimal integer. For decimal
floating constants, the exponent indicates the power of 10 by which the significand part is
to be scaled. For hexadecimal floating constants, the exponent indicates the power of 2
by which the significand part is to be scaled. For decimal floating constants, and also for
hexadecimal floating constants when FLT_RADIX is not a power of 2, the result is either
the nearest representable value, or the larger or smaller representable value immediately
adjacent to the nearest representable value, chosen in an implementation-defined manner.
For hexadecimal floating constants when FLT_RADIX is a power of 2, the result is
correctly rounded.
So, you're going to get a constant within one ULP in an implementation-defined manner. Recall that implementation-defined means that the implementation (in this case, the C runtime) can choose any of the options, but that choice must be documented. So, you can consult libc runtime documentation to find out how the rounding occurs.
You are not clear if you mean floating point literal as part of the source code (for the compiler to parse into architecture-dependent binary representation), or scanned by library functions, such as scanf(), atof(), strtol(), strtod() and strtold() (at run-time, to convert to in-memory float, double or long double value).
In the first case, it is part of ISO/IEC 9899:1999 (ISO C99), §6.4.4.2 "Floating constants". It defines both the lexicon and how it should be interpreted.
In the second case, the behavior of the library functions are defined in §7.20.1 "Numeric conversion functions".
I don't have a hard copy of the previous standard (ANSI C, 1989), but I'm pretty sure it also defines very precisely how floating point numbers are parsed and converted.
In the case you want to know if there is a standard to represent these values in binary format, in-memory, the answer is no. The C language is intended to be close to the architecture, and not impose constraints over it. So the in-memory representation is always architecture-dependent. But the C standard defines how arithmetic should be performed over floating point values. It follows IEC 60559 standard. In the ISO C99 standard, it is described in Annex F (normative), "IEC 60559 floating-point arithmetic". The implementation may or may not implement this standard. If it does, it must define the __STDC_IEC_559__ preprocessor name.

Resources