IEEE 754-2008 Decimal Floating Point. Why though?

IEEE 754-2008 Decimal Floating Point. Why though? - database

IEEE 754-2008 added storage formats for decimal floating-point numbers, i.e. floating-point numbers with a radix of 10 instead of 2. IEEE 854 already defined operations on these kinds of numbers, but without a storage format, and it never saw adoption.
I'd be interested in knowing what use cases the standards committee might have had in mind when adopting these storage formats and associated operations into the standard. I have a hard time coming up with use cases for either.
The only source that I could find that tries to grasp at some straws is this: Intel and Floating-Point states:
Decimal arithmetic also provides a robust, reliable framework for financial applications that are often subject to legal requirements concerning rounding and precision of the results in the areas of banking, telephone billing, tax calculation, currency conversion, insurance, or accounting in general.
I mean, come on. That's definitely not the case. A financial applications can't use numbers with rounding behavior that changes depending on the size of the number. All the tax law I know (and I know some), requires that numbers are rounded to a fixed number of digits when calculating tax, regardless of the magnitude of the amounts (e.g. 2 digits for VAT and 0 digits for income tax in 🇨🇭). Supporting this is the fact that DB engines generally provide fixed-point types for decimal numbers (e.g. SQL Server's MONEY or PostgreSQL's numeric).
So my question is: Is there any industry, technology, company etc. that makes use of decimal floating-point numbers?

Related

IEEE Floating Point Numbers: Are they not "precise"? (read: well-defined)

For a long time I thought floating point arithmetic is well-defined and different platforms making the same calculation should get the same results. (Given the same rounding modes.)
Yet Microsoft's SQL Server deems any calculation performed with floating point to have an "imprecise" result, meaning you can't have an index on it. This suggests to me that the guys at Microsoft thought there's a relevant catch regarding floating point.
So what is the catch?
EDIT: You can have indexes on floats in general, just not on computed columns. Also, geodata uses floating point types for the coordinates and those are obviously indexed, too. So it's not a problem of reliable comparison in general.

Floating-point arithmetic is well defined by the IEEE 754 standard. In the documentation you point out, Microsoft has apparently not chosen to adhere to the standard.
There are a variety of issues that makes floating-point reproducibility difficult, and you can find Stack Overflow discussions about them by searching for “[floating-point] reproducibility”. However, most of these issues are about lack of control in high-level languages (the individual floating-point operations are completely reproducible and specified by IEEE 754, and the hardware provides sufficient IEEE 754 conformance, but the high-level language specification does not adequately map language constructs to specific floating-point operations), differences in math library routines (functions such as sin and log are “hard” to compute in some sense, and vendors implement them without what is called correct rounding, so each vendor’s routines have slightly different error characteristics than others), multithreading and other issues allow operations to be performed in different orders, thus yielding different results, and so on.
In a single system such as Microsoft’s SQL Server, Microsoft presumably could have controlled these issues if they wanted to. Still, there are issues to consider. For example, a database system may have a sum function that computes the sum of many things. For speed, you may wish the sum implementation to have the flexibility to add the elements in any order, so that it can take advantage of multiprocessing or of adding the elements in whatever order they happen to be stored in. But adding floating-point data using elementary add operations of the same floating-point format has different results depending on the order of the elements. To make the sum reproducible, you have to specify the order of operation or use extra precision or other techniques, and then performance suffers.
So, not making floating-point arithmetic is a choice that is made, not a consequence of any lack of specification for floating-point arithmetic.
Another problem for database purposes is that even well defined and completely specified floating-point arithmetic has NaN values. (NaN, an abbreviation for Not a Number, represents a floating-point datum that is not a number. A NaN is produced as the result of an operation that has no mathematical result (such as the real square root of a negative number). NaNs act as placeholders so that floating-point operations can continue without interruption, and an application can complete a set of floating-point operations and then take action to replace or otherwise deal with any NaNs that arose.) Because a NaN does not represent a number, it is not equal to anything, not even itself. Comparing two NaNs for equality produces false, even if the NaNs are represented with exactly the same bits. This is a problem for databases, because NaNs cannot be used as a key for looking up records because a NaN search key will never equal a NaN in the key field of a record. Sometimes this is deal with by defining two different ordering relations—one is the usual mathematical comparison, which defines less than, equal to, and greater than for numbers (and for which all three are false for NaNs), and a second which defines a sort order and which is defined for all data, including NaNs.
It should be noted that each floating-point datum that is not a NaN represents a certain number exactly. There is no imprecision in a floating-point number. A floating-point number does not represent an interval. Floating-point operations approximate real arithmetic in that they return values approximately equal to the exact mathematical results, while floating-point numbers are exact. Elementary floating-point operations are exactly specified by IEEE 754. Lack of reproducibility arises in using different operations (including the same operations with different precisions), in using operations in different orders, or in using operations that do not conform to the IEEE 754 standard.

Why dividing a float by a power of 10 is less accurate than typing the number directly?

When I run
printf("%.8f\n", 971090899.9008999);
printf("%.8f\n", 9710908999008999.0 / 10000000.0);
I get
971090899.90089989
971090899.90089977
I know why neither is exact, but what I don't understand is why doesn't the second match the first?
I thought basic arithmetic operations (+ - * /) were always as accurate as possible...
Isn't the first number a more accurate result of the division than the second?

Judging from the numbers you're using and based on the standard IEEE 754 floating point standard, it seems the left hand side of the division is too large to be completely encompassed in the mantissa (significand) of a 64-bit double.
You've got 52 bits worth of pure integer representation before you start bleeding precision. 9710908999008999 has ~54 bits in its representation, so it does not fit properly -- thus, the truncation and approximation begins and your end numbers get all finagled.
EDIT: As was pointed out, the first number that has no mathematical operations done on it doesn't fit either. But, since you're doing extra math on the second one, you're introducing extra rounding errors not present with the first number. So you'll have to take that into consideration too!

Evaluating the expression 971090899.9008999 involves one operation, a conversion from decimal to the floating-point format.
Evaluating the expression 9710908999008999.0 / 10000000.0 involves three operations:
Converting 9710908999008999.0 from decimal to the floating-point format.
Converting 10000000.0 from decimal to the floating-point format.
Dividing the results of the above operations.
The second of those should be exact in any good C implementation, because the result is exactly representable. However, the other two add rounding errors.
C does not require implementations to convert decimal to floating-point as accurately as possible; it allows some slack. However, a good implementation does convert accurately, using extra precision if necessary. Thus, the single operation on 971090899.9008999 produces a more accurate result than the multiple operations.
Additionally, as we learn from a comment, the C implementation used by the OP converts 9710908999008999.0 to 9710908999008998. This is incorrect by the rules of IEEE-754 for the common round-to-nearest mode. The correct result is 9710908999009000. Both of these candidates are representable in IEEE-754 64-bit binary, and both are equidistant from the source value, 9710908999008999. The usual rounding mode is round-to-nearest, ties-to-even, meaning the candidate with the even low bit should be selected, which is 9710908999009000 (with significand 0x1.1400298aa8174), not 9710908999008998 (with significand 0x1.1400298aa8173). (IEEE 754 defines another round-to-nearest mode: ties-to-away, which selects the candidate with the larger magnitude, which is again 9710908999009000.)
The C standard permits some slack in conversions; either of these two candidates conforms to the C standard, but good implementations also conform to IEEE 754.

Rounding to specified absolute decimal precision in C90/99

I am working on software that, among other things, converts measured numbers between text and internal (double) representation. A necessary part of the process is to produce text representations with the correct decimal precision based on the statistical uncertainty of the measurement. The needed precision varies with the number, and the least-significant digit in it can be anywhere, including left of the (decimal) units place.
Correct rounding is essential for this process, where "correct" means according to the floating-point rounding mode in effect at the time, or at least in a well-defined rounding mode. As such, I need to be careful about (read: avoid) performing intermediate arithmetic on the numbers being handled, because rounding can be sensitive even to the least-significant bit in the internal representation of a number.
I think I can do almost all the needed formatting reasonably well with the printf family of functions if I first compute the number of significant digits in the required representation:
sprintf(buffer, "%.*e", num_sig_figs - 1, number);
There is one class of corner cases that has so far defeated me, however: the one where the most significant (decimal) digit in the measured number is one place right of the least significant digit of the desired-precision representation. In that case, rounding should yield the least (and only) significant digit in the desired result as either 0 or 1, but I haven't been able to devise a way to perform the rounding in a portable(*) way without risk of changing the result. This is similar to what the MPFR function mpfr_prec_round() could do, except that it works in binary precision, whereas I need to use decimal precision.
For example, in the default rounding mode (round-to-nearest with ties rounded to even):
0.5 expressed to unit (10^0) precision should be "0" or "0e+00"
654 expressed to thousands (10^3) precision should be "1e+03"
0.03125 expressed to tenths (10^-1) precision should be "0" or "0e-01" or even "0e+00"
(*) "Portable" here means that the code accurately expresses the computation in standard, portable C99 (or better, C90). It is understood that the actual result may depend on machine details, and it should depend (and be consistent with) the floating-point rounding mode in effect.
What options do I have?

One simple (albeit fairly inefficient) approach that will always work is to print the full exact decimal value as a string, then do your rounding in decimal manually. This can be achieved by something like
snprintf(buf, sizeof buf, "%.*f", DBL_MANT_DIG-DBL_MIN_EXP, x);
I hope I got that precision right. The idea is that each additional mantissa bit, and each additional negative power of two, takes up one extra decimal place.
You avoid the issue of double rounding by the fact that the decimal value obtained is exact.
Note that double rounding only matters in the default rounding mode (nearest). In other modes, double rounding obtains the same result that would be obtained by a single rounding step, so you can take lots of shortcuts if you like.
There are probably better solutions which I'll post later if I think of them. Note that the above solution will only work on high-quality implementations where the printf family of functions is capable of printing exact decimals. It will fail horribly, for example, on MSVCRT and other low-quality implementations, even some conforming ones.

Binary, Floats, and Modern Computers

I have been reading a lot about floats and computer-processed floating-point operations. The biggest question I see when reading about them is why are they so inaccurate? I understand this is because binary cannot accurately represent all real numbers, so the numbers are rounded to the 'best' approximation.
My question is, knowing this, why do we still use binary as the base for computer operations? Surely using a larger base number than 2 would increase the accuracy of floating-point operations exponentially, would it not?
What are the advantages of using a binary number system for computers as opposed to another base, and has another base ever been tried? Or is it even possible?

First of all: You cannot represent all real numbers even when using say, base 100. But you already know this. Anyway, this means: Inaccuracy will always arise due to 'not being able to represent all real numbers'.
Now lets talk about "what can higher bases bring to you when doing math?": Higher bases bring exactly 'nothing' in terms of precision. Why?
If you want to use base 4, then a 16 digit base 4 number provides exactly 416 different values.
But you can get the same number of different values from a 32 digit base 2 number (232 = 416).
As another answer already said: Transistors can either be on or off. So your newly designed base 4 registers need to be an abstraction over (base 2) ON/OFF 'bits'. This means: Use two 'bits' to represent a base 4 digit. But you'll still get exactly 2N levels by spending N 'bits' (or N/2 base-4 digits). You can only get better accuracy by spending more bits, not by increasing the base. Which base you 'imagine/abstract' your numbers to be in (e.g. like how printf can print these base-2 numbers in base-10) is really just a matter of abstraction and not precision.

Computers are built on transistors, which have a "switched on" state, and a "switched off" state. This corresponds to high and low voltage. Pretty much all digital integrated circuits work in this binary fashion.
Ignoring the fact that transistors just simply work this way, using a different base (e.g. base 3) would require these circuits to operate at an intermediate voltage state (or several) as well as 0V and their highest operating voltage. This is more complicated, and can result in problems at high frequencies - how can you tell whether a signal is just transitioning between 2V and 0V, or actually at 1V?
When we get down to the floating point level, we are (as nhahtdh mentioned in their answer) mapping an infinite space of numbers down to a finite storage space. It's an absolute guarantee that we'll lose some precision. One advantage of IEEE floats, though, is that the precision is relative to the magnitude of the value.
Update: You should also check out Tunguska, a ternary computer emulator. It uses base-3 instead of base-2, which makes for some interesting (albeit mind-bending) concepts.

We are essentially mapping a finite space to an infinite set of real number. So it is not even a problem of base anyway.
Base 2 is chosen, like Polynomial said, for implementation reason, as it is easier to differentiate 2 levels of energy.
We either throw more space to represent more numbers/increase precision, or limit the range that we want to encode, or a mix of them.

It boils out to getting the most from available chip area.
If you use on/off switches to represent numbers, you can't get more precision per switch than with a base-2 representation. This is simply because N switches can represent 2^N quantities no matter what you choose these values to be. There were early machines that used base 16 floating point digits, but each of these needed 4 binary bits, so the overall precision per bit was the same as base 2 (actually somewhat less due to edge cases).
If you choose a base that's not a power of 2, precision is obviously lost. For example you need 4 bits to represent one decimal digit, but 6 of the available values of those 4 bits are never used. This system is called binary-coded decimal and it's still used occassionally, usually when doing computations with money.
Multi-level logic could efficiently implement other bases, but at least with current chip technologies, it turns out to be very expensive to implement more than 2 levels. Even quantum computers are being design assuming two quantum levels: quantum bits or qubits.
The nature of the world and math is what makes the floating point situation hopeless. There is a hierarchy of real numbers Integer -> Rational -> Algebraic -> Transendental. There's a wonderful mathematical proof, Cantor diagonalization, that most numbers, i.e. a "bigger infinity" than the other sets, are Transdendental. Yet no matter what floating point system you choose, there will still be lowly rational numbers with no perfect representation (i.e. 1/3 in base 10). This is our universe. No amount of clever hardware design will change it.
Software can and does use rational representations, storing a numerator and denominator as integers. However with these your programmer's hands are tied. For example, square root is not "closed." Sqrt(2) has no rational representation.
There has been research with algebraic number reps and "lazy" reps of arbitrary reals that produce more digits as needed. Most work of this type seems to be in computational geometry.

Your first paragraph makes sense but the second is a non-sequiter. A larger base would not make a difference to the precision.
The precison of a number depends on the amount of storage that is used for it - for example a 16 bit binary number has the same precision as a 2 x 256 base number - both take up the same amount of information.
See the Usual floating point reference. for more detail - and it generalises to all bases.
Yes computers have been built using other bases - I know of ones that use base 10 (decimal) cf wikipaedia

Yes, there are/have been computers that use other than binary (i.e., other than base 2 representations and aritmetic):
Decimal computers.
Designers of computing systems have looked into many alternatives. But it's hard to find a model that's as simple to implement in a physical device than one using two discrete states. So start with a binary circuit that's very easy and cheap to build and work up to a computer with complex operations. That's the history of binary in a nutshell.

I am not a EE, so everything I say below may be totally wrong. But...
The advantage of binary is that it maps very cleanly to distinguishing between on/off (or, more accurately, high/low voltage) states in real circuits. Trying to distinguish between multiple voltages would, I think, present a bit more of a challenge.
This may go completely out the window if quantum computers make it out of the lab.

There are 2 issues arising from the use of binary floating-point numbers to represent mathematical real numbers -- well, there are probably a lot more issues, but 2 is enough for the time being.
All computer numbers are finite so any number which requires an
infinite number of digits cannot be accurately represented on a
computer, whatever number base is chosen. So that deals with pi, e,
and most of the other real numbers.
Whatever base is chosen will have difficulties representing (finitely) some fractions. Base 2 can only approximate any fraction with a factor of 3 in the denominator, but base 5 or base 7 do too.
Over the years computers with circuitry based on devices with more than 2 states have been built. The old Soviet Union developed a series of computers with 3-state devices and at least one US computer manufacturer at one time offered computers using 10-state devices for arithmetic.
I suspect that binary representation has won out (so far) because it is simple, both to reason about and to implement with current electronic devices.

I vote that we move to a Rational number system storage. Two 32 bit intergers that will evaluate as p/q. Multiplication and division will be really cheap operations. Yeah there will be redundant evaluated numbers (1/2 = 2/4), but who really uses the full dynamic range of a 64 bit double anyways.

I'm neither an electrical engineer nor a mathematician, so take that into consideration when I make the following statement:
All floating point numbers can be represented as integers.

Why can't I multiply a float? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Dealing with accuracy problems in floating-point numbers
I was quite surprised why I tried to multiply a float in C (with GCC 3.2) and that it did not do as I expected.. As a sample:
int main() {
float nb = 3.11f;
nb *= 10;
printf("%f\n", nb);
}
Displays: 31.099998
I am curious regarding the way floats are implemented and why it produces this unexpected behavior?

First off, you can multiply floats. The problem you have is not the multiplication itself, but the original number you've used. Multiplication can lose some precision, but here the original number you've multiplied started with lost precision.
This is actually an expected behavior. floats are implemented using binary representation which means they can't accurately represent decimal values.
See MSDN for more information.
You can also see in the description of float that it has 6-7 significant digits accuracy. In your example if you round 31.099998 to 7 significant digits you will get 31.1 so it still works as expected here.
double type would of course be more accurate, but still has rounding error due to it's binary representation while the number you wrote is decimal.
If you want complete accuracy for decimal numbers, you should use a decimal type. This type exists in languages like C#. http://msdn.microsoft.com/en-us/library/system.decimal.aspx
You can also use rational numbers representation. Using two integers which will give you complete accuracy as long as you can represent the number as a division of two integers.

This is working as expected. Computers have finite precision, because they're trying to compute floating point values from integers. This leads to floating point inaccuracies.
The Floating point wikipedia page goes into far more detail on the representation and resulting accuracy problems than I could here :)
Interesting real-world side-note: this is partly why a lot of money calculations are done using integers (cents) - don't let the computer lose money with lack of precision! I want my $0.00001!

The number 3.11 cannot be represented in binary. The closest you can get with 24 significant bits is 11.0001110000101000111101, which works out to 3.1099998950958251953125 in decimal.
If your number 3.11 is supposed to represent a monetary amount, then you need to use a decimal representation.

In the Python communities we often see people surprised at this, so there are well-tested-and-debugged FAQs and tutorial sections on the issue (of course they're phrased in terms of Python, not C, but since Python delegates float arithmetic to the underlying C and hardware anyway, all the descriptions of float's mechanics still apply).
It's not the multiplication's fault, of course -- remove the statement where you multiply nb and you'll see similar issues anyway.

From Wikipedia article:
The fact that floating-point numbers
cannot precisely represent all real
numbers, and that floating-point
operations cannot precisely represent
true arithmetic operations, leads to
many surprising situations. This is
related to the finite precision with
which computers generally represent
numbers.

Floating points are not precise because they use base 2 (because it's binary: either 0 or 1) instead of base 10. And base 2 converting to base 10, as many have stated before, will cause rounding precision issues.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight