Why does frexp() not yield scientific notation? - c

Scientific notation is the common way to express a number with an explicit order of magnitude. First a nonzero digit, then a radix point, then a fractional part, and the exponent. In binary, there is only one possible nonzero digit.
Floating-point math involves an implicit first digit equal to one, then the mantissa bits "follow the radix point."
So why does frexp() put the radix point to the left of the implicit bit, and return a number in [0.5, 1) instead of scientific-notation-like [1, 2)? Is there some overflow to beware of?
Effectively it subtracts one more than the bias value specified by IEEE 754/ISO 60559. In hardware, this potentially trades an addition for an XOR. Alone, that seems like a pretty weak argument, considering that in many cases getting back to normal will require another floating-point operation.

The rationale says:
4.5.4.2 The frexp function
The functions frexp, ldexp, and modf are primitives used by the
remainder of the library. There was some sentiment for dropping them
for the same reasons that ecvt, fcvt, and gcvt were dropped, but their
adherents rescued them for general use. Their use is problematic: on
nonbinary architectures ldexp may lose precision, and frexp may be
inefficient.
One can speculate that the “remainder of the library” was more convenient to write with frexp's convention, or was already traditionally written against this interface although it did not provide any benefit.
I know that this does not fully answer the question, but it did not quite fit inside a comment.
I should also point out that some of the choices made in the design of the C language predate IEEE 754. Perhaps the format returned by frexp made sense with the PDP-11's floating-point format(s), or any other architecture on which a function frexp was first introduced. EDIT: See also page 155 of the manual for one PDP-11 model.

Related

IEEE Floating Point Numbers: Are they not "precise"? (read: well-defined)

For a long time I thought floating point arithmetic is well-defined and different platforms making the same calculation should get the same results. (Given the same rounding modes.)
Yet Microsoft's SQL Server deems any calculation performed with floating point to have an "imprecise" result, meaning you can't have an index on it. This suggests to me that the guys at Microsoft thought there's a relevant catch regarding floating point.
So what is the catch?
EDIT: You can have indexes on floats in general, just not on computed columns. Also, geodata uses floating point types for the coordinates and those are obviously indexed, too. So it's not a problem of reliable comparison in general.
Floating-point arithmetic is well defined by the IEEE 754 standard. In the documentation you point out, Microsoft has apparently not chosen to adhere to the standard.
There are a variety of issues that makes floating-point reproducibility difficult, and you can find Stack Overflow discussions about them by searching for “[floating-point] reproducibility”. However, most of these issues are about lack of control in high-level languages (the individual floating-point operations are completely reproducible and specified by IEEE 754, and the hardware provides sufficient IEEE 754 conformance, but the high-level language specification does not adequately map language constructs to specific floating-point operations), differences in math library routines (functions such as sin and log are “hard” to compute in some sense, and vendors implement them without what is called correct rounding, so each vendor’s routines have slightly different error characteristics than others), multithreading and other issues allow operations to be performed in different orders, thus yielding different results, and so on.
In a single system such as Microsoft’s SQL Server, Microsoft presumably could have controlled these issues if they wanted to. Still, there are issues to consider. For example, a database system may have a sum function that computes the sum of many things. For speed, you may wish the sum implementation to have the flexibility to add the elements in any order, so that it can take advantage of multiprocessing or of adding the elements in whatever order they happen to be stored in. But adding floating-point data using elementary add operations of the same floating-point format has different results depending on the order of the elements. To make the sum reproducible, you have to specify the order of operation or use extra precision or other techniques, and then performance suffers.
So, not making floating-point arithmetic is a choice that is made, not a consequence of any lack of specification for floating-point arithmetic.
Another problem for database purposes is that even well defined and completely specified floating-point arithmetic has NaN values. (NaN, an abbreviation for Not a Number, represents a floating-point datum that is not a number. A NaN is produced as the result of an operation that has no mathematical result (such as the real square root of a negative number). NaNs act as placeholders so that floating-point operations can continue without interruption, and an application can complete a set of floating-point operations and then take action to replace or otherwise deal with any NaNs that arose.) Because a NaN does not represent a number, it is not equal to anything, not even itself. Comparing two NaNs for equality produces false, even if the NaNs are represented with exactly the same bits. This is a problem for databases, because NaNs cannot be used as a key for looking up records because a NaN search key will never equal a NaN in the key field of a record. Sometimes this is deal with by defining two different ordering relations—one is the usual mathematical comparison, which defines less than, equal to, and greater than for numbers (and for which all three are false for NaNs), and a second which defines a sort order and which is defined for all data, including NaNs.
It should be noted that each floating-point datum that is not a NaN represents a certain number exactly. There is no imprecision in a floating-point number. A floating-point number does not represent an interval. Floating-point operations approximate real arithmetic in that they return values approximately equal to the exact mathematical results, while floating-point numbers are exact. Elementary floating-point operations are exactly specified by IEEE 754. Lack of reproducibility arises in using different operations (including the same operations with different precisions), in using operations in different orders, or in using operations that do not conform to the IEEE 754 standard.

How unreliable are floating point values, operators and functions?

I don't want to introduce floating point when an inexact value would be a distaster, so I have a couple of questions about when you actually can use them safely.
Are they exact for integers as long as you don't overflow the number of significant digit? Are these two tests always true:
double d = 2.0;
if (d + 3.0 == 5.0) ...
if (d * 3.0 == 6.0) ...
What math functions can you rely on? Are these tests always true:
#include <math.h>
double d = 100.0;
if (log10(d) == 2.0) ...
if (pow(d, 2.0) == 10000.0) ...
if (sqrt(d) == 10.0) ...
How about this:
int v = ...;
if (log2((double) v) > 16.0) ... /* gonna need more than 16 bits to store v */
if (log((double) v) / log(2.0) > 16.0) ... /* C89 */
I guess you can summarize this question as: 1) Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h? 2) Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
I too find incorrect results distasteful.
On common hardware, you can rely on +, -, *, /, and sqrt working and delivering the correctly-rounded result. That is, they deliver the floating-point number closest to the sum, difference, product, quotient, or square root of their argument or arguments.
Some library functions, notably log2 and log10 and exp2 and exp10, traditionally have terrible implementations that are not even faithfully-rounded. Faithfully-rounded means that a function delivers one of the two floating-point numbers bracketing the exact result. Most modern pow implementations have similar issues. Lots of these functions will even blow exact cases like log10(10000) and pow(7, 2). Thus equality comparisons involving these functions, even in exact cases, are asking for trouble.
sin, cos, tan, atan, exp, and log have faithfully-rounded implementations on every platform I've recently encountered. In the bad old days, on processors using the x87 FPU to evaluate sin, cos, and tan, you would get horribly wrong outputs for largish inputs and you'd get the input back for larger inputs. CRlibm has correctly-rounded implementations; these are not mainstream because, I'm told, they've got rather nastier worst cases than the traditional faithfully-rounded implementations.
Things like copysign and nextafter and isfinite all work correctly. ceil and floor and rint and friends always deliver the exact result. fmod and friends do too. frexp and friends work. fmin and fmax work.
Someone thought it would be a brilliant idea to make fma(x,y,z) compute x*y+z by computing x*y rounded to a double, then adding z and rounding the result to a double. You can find this behaviour on modern platforms. It's stupid and I hate it.
I have no experience with the hyperbolic trig, gamma, or Bessel functions in my C library.
I should also mention that popular compilers targeting 32-bit x86 play by a different, broken, set of rules. Since the x87 is the only supported floating-point instruction set and all x87 arithmetic is done with an extended exponent, computations that would induce an underflow or overflow in double precision may fail to underflow or overflow. Furthermore, since the x87 also by default uses an extended significand, you may not get the results you're looking for. Worse still, compilers will sometimes spill intermediate results to variables of lower precision, so you can't even rely on your calculations with doubles being done in extended precision. (Java has a trick for doing 64-bit math with 80-bit registers, but it is quite expensive.)
I would recommend sticking to arithmetic on long doubles if you're targeting 32-bit x86. Compilers are supposed to set FLT_EVAL_METHOD to an appropriate value, but I do not know if this is done universally.
Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h?
Well, they can store the integers which fit in their mantissa (significand). So [-2^53, 2^53] for double. For more on this, see: Which is the first integer that an IEEE 754 float is incapable of representing exactly?
Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
They at least guarantee that the result is immediately on either side of the actual mathematical result. That is, you won't get a result which has a valid floating point value between itself and the "actual" result. But beware, because repeated operations may accumulate an error which seems counter to this, while it is not (because all intermediate values are subject to the same constraints, not just the inputs and output of a compound expression).

What is Biased Notation?

I have read:
"Like an unsigned int, but offset by −(2^(n−1) − 1), where n is the number of bits in the numeral. Aside:
Technically we could choose any bias we please, but the choice presented here is extraordinarily common." - http://inst.eecs.berkeley.edu/~cs61c/sp14/disc/00/Disc0.pdf
However, I don't get what the point is. Can someone explain this to me with examples? Also, when should I use it, given other options like one's compliment, sign and mag, and two's compliment?
Biased notation is a way of storing a range of values that doesn't start with zero.
Put simply, you take an existing representation that goes from zero to N, and then add a bias B to each number so it now goes from B to N+B.
Floating-point exponents are stored with a bias to keep the dynamic range of the type "centered" on 1.
Excess-three encoding is a technique for simplifying decimal arithmetic using a bias of three.
Two's complement notation could be considered as biased notation with a bias of INT_MIN and the most-significant bit flipped.
A "representation" is a way of encoding information so that it easy to extract details or inferences from the encoded information.
Most modern CPUs "represent" numbers using "twos complement notation". They do this because it is easy to design digital circuits that can do what amounts to arithmetic on these values quickly (add, subtract, multiply, divide, ...). Twos complement also has the nice property that one can interpret the most significant bit as either a power-of-two (giving "unsigned numbers") or as a sign bit (giving signed numbers) without changing essentially any of the hardware used to implement the arithmetic.
Older machines used other bases, e.g, quite common in the 60s were machines that represented numbers as sets of binary-coded-decimal digits stuck in 4-bit addressable nibbles (the IBM 1620 and 1401 are examples of this). So, you can represent that same concept or value different ways.
A bias just means that whatever representation you chose (for numbers), you have added a constant bias to that value. Presumably that is done to enable something to be done more effectively. I can't speak to "−(2^(n−1) − 1)" being "an extraordinaly common (bias)"; I do lots of assembly and C coding and pretty don't find a need to "bias" values.
However, there is a common example. Modern CPUs largely implement IEEE floating point, which stores floating point numbers with sign, exponent, mantissa. The exponent is is power of two, symmetric around zero, but biased by 2^(N-1) if I recall correctly, for an N-bit exponent.
This bias allows floating point values with the same sign to be compared for equal/less/greater by using the standard machine twos-complement instructions rather than a special floating point instruction, which means that sometimes use of actual floating point compares can be avoided. (See http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm for dark corner details). [Thanks to #PotatoSwatter for noting
the inaccuracy of my initial answer here, and making me go dig this out.]

Why dividing a float by a power of 10 is less accurate than typing the number directly?

When I run
printf("%.8f\n", 971090899.9008999);
printf("%.8f\n", 9710908999008999.0 / 10000000.0);
I get
971090899.90089989
971090899.90089977
I know why neither is exact, but what I don't understand is why doesn't the second match the first?
I thought basic arithmetic operations (+ - * /) were always as accurate as possible...
Isn't the first number a more accurate result of the division than the second?
Judging from the numbers you're using and based on the standard IEEE 754 floating point standard, it seems the left hand side of the division is too large to be completely encompassed in the mantissa (significand) of a 64-bit double.
You've got 52 bits worth of pure integer representation before you start bleeding precision. 9710908999008999 has ~54 bits in its representation, so it does not fit properly -- thus, the truncation and approximation begins and your end numbers get all finagled.
EDIT: As was pointed out, the first number that has no mathematical operations done on it doesn't fit either. But, since you're doing extra math on the second one, you're introducing extra rounding errors not present with the first number. So you'll have to take that into consideration too!
Evaluating the expression 971090899.9008999 involves one operation, a conversion from decimal to the floating-point format.
Evaluating the expression 9710908999008999.0 / 10000000.0 involves three operations:
Converting 9710908999008999.0 from decimal to the floating-point format.
Converting 10000000.0 from decimal to the floating-point format.
Dividing the results of the above operations.
The second of those should be exact in any good C implementation, because the result is exactly representable. However, the other two add rounding errors.
C does not require implementations to convert decimal to floating-point as accurately as possible; it allows some slack. However, a good implementation does convert accurately, using extra precision if necessary. Thus, the single operation on 971090899.9008999 produces a more accurate result than the multiple operations.
Additionally, as we learn from a comment, the C implementation used by the OP converts 9710908999008999.0 to 9710908999008998. This is incorrect by the rules of IEEE-754 for the common round-to-nearest mode. The correct result is 9710908999009000. Both of these candidates are representable in IEEE-754 64-bit binary, and both are equidistant from the source value, 9710908999008999. The usual rounding mode is round-to-nearest, ties-to-even, meaning the candidate with the even low bit should be selected, which is 9710908999009000 (with significand 0x1.1400298aa8174), not 9710908999008998 (with significand 0x1.1400298aa8173). (IEEE 754 defines another round-to-nearest mode: ties-to-away, which selects the candidate with the larger magnitude, which is again 9710908999009000.)
The C standard permits some slack in conversions; either of these two candidates conforms to the C standard, but good implementations also conform to IEEE 754.

Rounding to specified absolute decimal precision in C90/99

I am working on software that, among other things, converts measured numbers between text and internal (double) representation. A necessary part of the process is to produce text representations with the correct decimal precision based on the statistical uncertainty of the measurement. The needed precision varies with the number, and the least-significant digit in it can be anywhere, including left of the (decimal) units place.
Correct rounding is essential for this process, where "correct" means according to the floating-point rounding mode in effect at the time, or at least in a well-defined rounding mode. As such, I need to be careful about (read: avoid) performing intermediate arithmetic on the numbers being handled, because rounding can be sensitive even to the least-significant bit in the internal representation of a number.
I think I can do almost all the needed formatting reasonably well with the printf family of functions if I first compute the number of significant digits in the required representation:
sprintf(buffer, "%.*e", num_sig_figs - 1, number);
There is one class of corner cases that has so far defeated me, however: the one where the most significant (decimal) digit in the measured number is one place right of the least significant digit of the desired-precision representation. In that case, rounding should yield the least (and only) significant digit in the desired result as either 0 or 1, but I haven't been able to devise a way to perform the rounding in a portable(*) way without risk of changing the result. This is similar to what the MPFR function mpfr_prec_round() could do, except that it works in binary precision, whereas I need to use decimal precision.
For example, in the default rounding mode (round-to-nearest with ties rounded to even):
0.5 expressed to unit (10^0) precision should be "0" or "0e+00"
654 expressed to thousands (10^3) precision should be "1e+03"
0.03125 expressed to tenths (10^-1) precision should be "0" or "0e-01" or even "0e+00"
(*) "Portable" here means that the code accurately expresses the computation in standard, portable C99 (or better, C90). It is understood that the actual result may depend on machine details, and it should depend (and be consistent with) the floating-point rounding mode in effect.
What options do I have?
One simple (albeit fairly inefficient) approach that will always work is to print the full exact decimal value as a string, then do your rounding in decimal manually. This can be achieved by something like
snprintf(buf, sizeof buf, "%.*f", DBL_MANT_DIG-DBL_MIN_EXP, x);
I hope I got that precision right. The idea is that each additional mantissa bit, and each additional negative power of two, takes up one extra decimal place.
You avoid the issue of double rounding by the fact that the decimal value obtained is exact.
Note that double rounding only matters in the default rounding mode (nearest). In other modes, double rounding obtains the same result that would be obtained by a single rounding step, so you can take lots of shortcuts if you like.
There are probably better solutions which I'll post later if I think of them. Note that the above solution will only work on high-quality implementations where the printf family of functions is capable of printing exact decimals. It will fail horribly, for example, on MSVCRT and other low-quality implementations, even some conforming ones.

Resources