Float data type uncertainty - c

I am doing a numerical analysis of a math software I developed. I want to identify what is the uncertainty of my result. Being f() my method and x an input value, I want to identify y of my result as f(x) +/- y. My f() method has multiple operations between float variables. To study the error propagation occurred in f(), I have to apply the Statistical Propagation of Uncertainty formulas and in order to do so I have to know the uncertainty of a float variable.
I do understand the architecture of a float variable as specified in the IEEE 754 standard and the rounding error converting a decimal value to float inherent to the latter.
From what I understood of the literature, the FLT_EPSILON macro in http://www.cplusplus.com/reference/cfloat/
defines my y value but this quick test proves it wrong:
float f1 = 1.234567f;
float f2 = 1.234567f + 1.192092896e-7f;
float f3 = 1.234567f + 1.192092895e-7f;
printf("Inicial:\t%f\n", f1);
printf("Inicial:\t%f\n", f2);
printf("Inicial:\t%f\n\n", f3);
Output:
Inicial: 1.234567
Inicial: 1.234567
Inicial: 1.234567
When the expected output should be:
Inicial: 1.234567
Inicial: 1.234568 <---
Inicial: 1.234567
What is that I am wrong about?
Should not the float value of x + FLT_EPSILON and x - FLT_EPSILON be the same?
EDIT: My question is being R the float value of x, what is the y value that x + y || x - y equals the same R float value?

Propagation of uncertainty is from the field of statistics and refers to how uncertainties in inputs affect mathematical functions of them. The analysis of errors that occur in computational arithmetic is numerical analysis.
FLT_EPSILON is not a measure of uncertainty or error in floating-point results. It is the distance between 1 and the next value representable in the float type. Hence, it is the size of steps between representable numbers at the magnitude of 1.
When you convert a decimal numeral to floating-point, the rounding error that results may have a magnitude of up to ½ the step size when the common round-to-nearest mode is used. The reason the bound is ½ the step size is that for any number x (within the finite domain of the floating-point format), there is a representable value within ½ the step size (inclusive). This is because, if there is a representable number more than ½ the step size in one direction, there is a representable number less than ½ the step size in the other direction.
The step size varies with the magnitudes of the numbers. With binary floating-point, it doubles at 2, and again at 4, then 8, and so on. Below 1, it halves, and again at ½, ¼, and so on.
When you perform floating-point arithmetic operations, the rounding that occurs in the computation may compound or cancel previous errors. There is no general formula for the final error.
The two numerals use used in your sample code, 1.192092897e-7f and 1.192092896e-7f, are so close together that they convert to the same float value, 2−23. That is why there is no difference in your f2 and f3.
There is a difference between f1 and f2, but you did not print enough digits to display it.
You ask “Should not the float value of x + FLT_EPSILON and x - FLT_EPSILON be the same?”, but your code does not contain x - FLT_EPSILON.
Re: “My question is being R the float value of x, what is the y value that x + y || x - y equals the same R float value?” This is trivially satisfied by y = 0. Did you mean to ask what is the largest value of y that satisfies the condition? That is a bit complicated.
The step size for a number x is called the ULP of x, which we may consider as a function ULP(x). ULP stands for Unit of Least Precision. It is the place value of the least digit in the floating-point representation of x. It is not a constant; it is a function of x.
For most values representable in a floating-point format, the largest y that satisfies your condition is ½ ULP(x) of the least digit in the floating-point representation of x is even and, if the digit is odd, it is just under ½ ULP(x). This complication arises from the rule that the results of arithmetic are rounded to the nearest representable value and, in case of a tie, the value with the even low digit is chosen. Thus, adding ½ ULP(x) to x will yield a tie that will round to x if the low digit is even, but will not round to x if the low digit is odd.
However, for x that are on the boundary where the ULP changes, the largest y that satisfies your condition is ¼ ULP(x). This is because, just below x (in magnitude), the step size changes, and the next number lower than x is half of x’s step size away instead of the usual full step size. So you can only go halfway toward that value before changing the result of the subtraction, so the most y can be is ¼ ULP(x).

Float is a 32 bit IEEE 754 single precision Floating Point Number: 1 bit for the sign, 8 bits for the exponent, and 23* for the value, i.e. float has 7 decimal digits of precision.
Increase the printf number of printed digits to see more but after 7 digits its just noise:
#include <stdio.h>
int main(void) {
float f1 = 1.234567f;
float f2 = 1.234567f + 1.192092897e-7f;
float f3 = 1.234567f + 1.192092896e-7f;
printf("Inicial:\t%.16f\n", f1);
printf("Inicial:\t%.16f\n", f2);
printf("Inicial:\t%.16f\n\n", f3);
return 0;
}
Output:
Inicial: 1.2345670461654663
Inicial: 1.2345671653747559
Inicial: 1.2345671653747559

float f1 = 1.234567f;
float f2 = f1 + 1.192092897e-7f;
float f3 = f1 + 1.192092896e-7f;
printf("Inicial:\t%.20f\n", f1);
printf("Inicial:\t%.20f\n", f2);
printf("Inicial:\t%.20f\n\n", f3);
Output:
Inicial: 1.23456704616546630000
Inicial: 1.23456716537475590000
Inicial: 1.23456716537475590000
No, your expectation is wrong
In the first printf call, you're printing the variable f1 with no effect which is just 1.234567f.

Related

IEEE-754: "smallest" overflow condition

Before I start, just some background information:
I'm running a bare-metal application on an ARM7 microcontroller (LPC2294/01) compiled in Keil uVision3, using the compiler standard math library (which is IEEE-754 compliant).
The issue:
I'm having trouble wrapping my head around what exactly constitutes an 'overflow' on the sum of 2 single-precision floating point inputs.
Initially, I was under the impression that if I attempted to add any positive value to the largest value that can be represented by IEEE-754 notation, the result would generate an overflow exception.
So for instance, suppose I have:
a = 0x7f7fffff (ie. 3.4028235..E38);
b = 0x3f800000 (ie. 1.0)
I expected that summing these two values would result in overflow as defined in IEEE-754. To my initial surprise, the result simply returned the value of 'a' with no exception being flagged.
So then I thought, since the precision (or resolution if you prefer) decreases as the value being represented increases, it's likely the value '1' in this case is being effectively rounded down to 0 due to its relative insignificance.
So that begged the question: What would be the smallest value of 'b' in this case that would cause an overflow exception? Does it depend on the specific implementation of IEEE-754?
Maybe it's as simple as me not understanding how to determine the minimum 'significant' precision in this particular case, but given the code below, why would the second sum cause an overflow and not the first?
static union sFloatConversion32
{
unsigned int unsigned32Value;
float floatValue;
} sFloatConversion32;
t_bool test_Float32_Addition(void)
{
float a;
float b;
float c;
sFloatConversion32.unsigned32Value = 0x7f7fffff;
a = sFloatConversion32.floatValue;
sFloatConversion32.unsigned32Value = 0x72ffffff;
b = sFloatConversion32.floatValue;
/* This sum returns (c = a) without overflow */
c = a + b;
sFloatConversion32.unsigned32Value = 0x73000000;
b = sFloatConversion32.floatValue;
/* This sum, however, causes an overflow exception */
c = a + b;
}
Is there a generalized rule that can be applied such that it would be possible to know ahead of time (ie. without performing the sum), that given two floats, their sum will cause an overflow as defined by IEEE-754?
Overflow occurs when the result is affected by the range of the format. As long as normal rounding keeps the result within the finite range, no overflow occurs, because the result is the same as it would be if the exponent were unbounded—the result was reduced by the normal rounding, before range was considered. So there is no exception due to range.
When the rounded result does not fit into the finite range of the format, then a finite result cannot be produced, so an overflow exception occurs and infinity is produced.
In IEEE 754, a normal operation is in effect two steps:
Calculate the exact mathematical result.
Round the exact mathematical result to the nearest representable value.
IEEE 754 defines overflow to occur if and only if the the result of the above exceeds in magnitude the largest representable finite value. In other words, overflow does not occur just because you went above the largest representable value but only if you go so far above the largest representable value that the normal way arithmetic works in floating-point does not work.
So, if you start with the largest representable value and add a small number to it, the result would simply round to the largest representable value anyway (when using round-to-nearest). IEEE 754 regards this as normal—all arithmetic operations round, and if that rounding kept the result in bounds, that is normal and unexceptionable. Even if the exponent range were unbounded, normal rounding would have produced the same result. Since this is a normal result not affected by the limited range, nothing exceptional has occurred.
Overflow occurs only when the mathematical result is so large that rounding would produce the next higher number if we were not limited by the exponent. (But, since we have reached the limits of the exponent range, we must return infinity.)
The largest representable value in IEEE-754 basic 32-bit binary floating-point is 2128−2104. At this point, the steps between representable numbers are in units of 2104. With the round-to-nearest rule, adding any number less than half a step, 2103, to this will round to 2128−2104, and no overflow occurs. If you add a number greater than 2103, then the result would round to 2128 if the exponent could go that high. Instead, infinity is produced and an overflow exception occurs. (If you add exactly 2103, the rule for ties is used. This rule says to choose the candidate with the even low bit. That produces 2128, so it also overflows.)
So, with round-to-nearest, overflow occurs at the midpoint of a step. With other rounding rules, overflow occurs at different points. With round-toward-infinity (round up), adding any positive value, even 2−149, to 2128−2104 will cause an overflow. With round-toward-zero, adding any value less than 2104 to 2128−2104 will not overflow.
Does it depend on the specific implementation of IEEE-754?
Yes and the rounding mode active at the time.
Consider the step between the x before max and FLT_MAX.
float max = FLT_MAX;
float before_max = nextafterf(max, 0.0f);
float delta = max - before_max;
printf("max: %- 20a %.*g\n", max, FLT_DECIMAL_DIG, max);
printf("1st d: % -20a %.*g\n", delta, FLT_DECIMAL_DIG, delta);
// Typical output
max: 0x1.fffffep+127 3.40282347e+38
b4max: 0x1.fffffep+127 3.40282347e+38
1st d: 0x1p+104 2.02824096e+31
The largest float is about twice the float with the same smallest float with the same steps or ULP. Think of this smaller float with all its explicit precision bits cleared versus set as with FLOAT_MAX.
float m0 = nextafterf(max/2, max);
printf("m0: %- 20a %.*g\n", m0, FLT_DECIMAL_DIG, m0);
// m0: 0x1p+127 1.70141183e+38
Now compare this to FLT_EPSILON, the smallest step from 1.0 to the next larger float:
float eps = FLT_EPSILON;
printf("epsil: %- 20a %.*g\n", eps, FLT_DECIMAL_DIG, eps);
// Output
// epsil: 0x1p-23 1.1920929e-07
Notice the ratio delta/m0 is FLT_EPSILON.
float r = delta1/m0;
printf("r: %- 20a %.*g\n", r, FLT_DECIMAL_DIG, r);
// r: 0x1p-23 1.1920929e-07
Consider the typical rounding mode of rounding to nearest, ties to even.
Now let us try adding 1/2*delta1 to FLOAT_MAX and then try adding the next smaller float.
sum = max + delta1/2;
printf("sum: % -20a %.*g\n", sum, FLT_DECIMAL_DIG, sum);
sum = nextafterf(sum, 0);
printf("sum: % -20a %.*g\n", sum, FLT_DECIMAL_DIG, sum);
// sum: inf inf
// sum: 0x1.fffffep+127 3.40282347e+38
IEEE-754: “smallest” overflow condition
We can see the smallest delta if about FLT_MAX*1/2*1/2*FLOAT_EPSILON.
float small = FLT_MAX*0.25f*FLT_EPSILON;
printf("small: %- 20a %.*g\n", small, FLT_DECIMAL_DIG, small);
printf("sum: % -20a %.*g\n", max+small, FLT_DECIMAL_DIG, max+small);
small = nextafterf(small, max);
printf("sum: % -20a %.*g\n", max+small, FLT_DECIMAL_DIG, max+small);
// sum: 0x1.fffffep+127 3.40282347e+38
// sum: inf inf
Given the various possible encoding for float, your results may differ, yet this approach gives an idea of how to determine the smallest delta that cause overflow.
Run this program long enough and see what will happen:
float x = 10000000.0f;
while(1)
{
printf("%f\n", x);
x += 1.0f;
}
I think it will answer your question.

Efficient floating-point division with constant integer divisors

A recent question, whether compilers are allowed to replace floating-point division with floating-point multiplication, inspired me to ask this question.
Under the stringent requirement, that the results after code transformation shall be bit-wise identical to the actual division operation,
it is trivial to see that for binary IEEE-754 arithmetic, this is possible for divisors that are a power of two. As long as the reciprocal
of the divisor is representable, multiplying by the reciprocal of the divisor delivers results identical to the division. For example, multiplication by 0.5 can replace division by 2.0.
One then wonders for what other divisors such replacements work, assuming we allow any short instruction sequence that replaces division but runs significantly faster, while delivering bit-identical results. In particular allow fused multiply-add operations in addition to plain multiplication.
In comments I pointed to the following relevant paper:
Nicolas Brisebarre, Jean-Michel Muller, and Saurabh Kumar Raina. Accelerating correctly rounded floating-point division when the divisor is known in advance. IEEE Transactions on Computers, Vol. 53, No. 8, August 2004, pp. 1069-1072.
The technique advocated by the authors of the paper precomputes the reciprocal of the divisor y as a normalized head-tail pair zh:zl as follows: zh = 1 / y, zl = fma (-y, zh, 1) / y. Later, the division q = x / y is then computed as q = fma (zh, x, zl * x). The paper derives various conditions that divisor y must satisfy for this algorithm to work. As one readily observes, this algorithm has problems with infinities and zero when the signs of head and tail differ. More importantly, it will fail to deliver correct results for dividends x that are very small in magnitude, because computation of the quotient tail, zl * x, suffers from underflow.
The paper also makes a passing reference to an alternative FMA-based division algorithm, pioneered by Peter Markstein when he was at IBM. The relevant reference is:
P. W. Markstein. Computation of elementary functions on the IBM RISC System/6000 processor. IBM Journal of Research & Development, Vol. 34, No. 1, January 1990, pp. 111-119
In Markstein's algorithm, one first computes a reciprocal rc, from which an initial quotient q = x * rc is formed. Then, the remainder of the division is computed accurately with an FMA as r = fma (-y, q, x), and an improved, more accurate quotient is finally computed as q = fma (r, rc, q).
This algorithm also has issues for x that are zeroes or infinities (easily worked around with appropriate conditional execution), but exhaustive testing using IEEE-754 single-precision float data shows that it delivers the correct quotient across all possibe dividends x for many divisors y, among these many small integers. This C code implements it:
/* precompute reciprocal */
rc = 1.0f / y;
/* compute quotient q=x/y */
q = x * rc;
if ((x != 0) && (!isinf(x))) {
r = fmaf (-y, q, x);
q = fmaf (r, rc, q);
}
On most processor architectures, this should translate into a branchless sequence of instructions, using either predication, conditional moves, or select-type instructions. To give a concrete example: For division by 3.0f, the nvcc compiler of CUDA 7.5 generates the following machine code for a Kepler-class GPU:
LDG.E R5, [R2]; // load x
FSETP.NEU.AND P0, PT, |R5|, +INF , PT; // pred0 = fabsf(x) != INF
FMUL32I R2, R5, 0.3333333432674408; // q = x * (1.0f/3.0f)
FSETP.NEU.AND P0, PT, R5, RZ, P0; // pred0 = (x != 0.0f) && (fabsf(x) != INF)
FMA R5, R2, -3, R5; // r = fmaf (q, -3.0f, x);
MOV R4, R2 // q
#P0 FFMA R4, R5, c[0x2][0x0], R2; // if (pred0) q = fmaf (r, (1.0f/3.0f), q)
ST.E [R6], R4; // store q
For my experiments, I wrote the tiny C test program shown below that steps through integer divisors in increasing order and for each of them exhaustively tests the above code sequence against the proper division. It prints a list of the divisors that passed this exhaustive test. Partial output looks as follows:
PASS: 1, 2, 3, 4, 5, 7, 8, 9, 11, 13, 15, 16, 17, 19, 21, 23, 25, 27, 29, 31, 32, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 64, 65, 67, 69,
To incorporate the replacement algorithm into a compiler as an optimization, a whitelist of divisors to which the above code transformation can safely be applied is impractical. The output of the program so far (at a rate of about one result per minute) suggests that the fast code works correctly across all possible encodings of x for those divisors y that are odd integers or are powers of two. Anecdotal evidence, not a proof, of course.
What set of mathematical conditions can determine a-priori whether the transformation of division into the above code sequence is safe? Answers can assume that all the floating-point operations are performed in the default rounding mode of "round to nearest or even".
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
int main (void)
{
float r, q, x, y, rc;
volatile union {
float f;
unsigned int i;
} arg, res, ref;
int err;
y = 1.0f;
printf ("PASS: ");
while (1) {
/* precompute reciprocal */
rc = 1.0f / y;
arg.i = 0x80000000;
err = 0;
do {
/* do the division, fast */
x = arg.f;
q = x * rc;
if ((x != 0) && (!isinf(x))) {
r = fmaf (-y, q, x);
q = fmaf (r, rc, q);
}
res.f = q;
/* compute the reference, slowly */
ref.f = x / y;
if (res.i != ref.i) {
err = 1;
break;
}
arg.i--;
} while (arg.i != 0x80000000);
if (!err) printf ("%g, ", y);
y += 1.0f;
}
return EXIT_SUCCESS;
}
Let me restart for the third time. We are trying to accelerate
q = x / y
where y is an integer constant, and q, x, and y are all IEEE 754-2008 binary32 floating-point values. Below, fmaf(a,b,c) indicates a fused multiply-add a * b + c using binary32 values.
The naive algorithm is via a precalculated reciprocal,
C = 1.0f / y
so that at runtime a (much faster) multiplication suffices:
q = x * C
The Brisebarre-Muller-Raina acceleration uses two precalculated constants,
zh = 1.0f / y
zl = -fmaf(zh, y, -1.0f) / y
so that at runtime, one multiplication and one fused multiply-add suffices:
q = fmaf(x, zh, x * zl)
The Markstein algorithm combines the naive approach with two fused multiply-adds that yields the correct result if the naive approach yields a result within 1 unit in the least significant place, by precalculating
C1 = 1.0f / y
C2 = -y
so that the divison can be approximated using
t1 = x * C1
t2 = fmaf(C1, t1, x)
q = fmaf(C2, t2, t1)
The naive approach works for all powers of two y, but otherwise it is pretty bad. For example, for divisors 7, 14, 15, 28, and 30, it yields an incorrect result for more than half of all possible x.
The Brisebarre-Muller-Raina approach similarly fails for almost all non-power of two y, but much fewer x yield the incorrect result (less than half a percent of all possible x, varies depending on y).
The Brisebarre-Muller-Raina article shows that the maximum error in the naive approach is ±1.5 ULPs.
The Markstein approach yields correct results for powers of two y, and also for odd integer y. (I have not found a failing odd integer divisor for the Markstein approach.)
For the Markstein approach, I have analysed divisors 1 - 19700 (raw data here).
Plotting the number of failure cases (divisor in the horizontal axis, the number of values of x where Markstein approach fails for said divisor), we can see a simple pattern occur:
(source: nominal-animal.net)
Note that these plots have both horizontal and vertical axes logarithmic. There are no dots for odd divisors, as the approach yields correct results for all odd divisors I've tested.
If we change the x axis to the bit reverse (binary digits in reverse order, i.e. 0b11101101 → 0b10110111, data) of the divisors, we have a very clear pattern:
(source: nominal-animal.net)
If we draw a straight line through the center of the point sets, we get curve 4194304/x. (Remember, the plot considers only half the possible floats, so when considering all possible floats, double it.)
8388608/x and 2097152/x bracket the entire error pattern completely.
Thus, if we use rev(y) to compute the bit reverse of divisor y, then 8388608/rev(y) is a good first order approximation of the number of cases (out of all possible float) where the Markstein approach yields an incorrect result for an even, non-power-of-two divisor y. (Or, 16777216/rev(x) for the upper limit.)
Added 2016-02-28: I found an approximation for the number of error cases using the Markstein approach, given any integer (binary32) divisor. Here it is as pseudocode:
function markstein_failure_estimate(divisor):
if (divisor is zero)
return no estimate
if (divisor is not an integer)
return no estimate
if (divisor is negative)
negate divisor
# Consider, for avoiding underflow cases,
if (divisor is very large, say 1e+30 or larger)
return no estimate - do as division
while (divisor > 16777216)
divisor = divisor / 2
if (divisor is a power of two)
return 0
if (divisor is odd)
return 0
while (divisor is not odd)
divisor = divisor / 2
# Use return (1 + 83833608 / divisor) / 2
# if only nonnegative finite float divisors are counted!
return 1 + 8388608 / divisor
This yields a correct error estimate to within ±1 on the Markstein failure cases I have tested (but I have not yet adequately tested divisors larger than 8388608). The final division should be such that it reports no false zeroes, but I cannot guarantee it (yet). It does not take into account very large divisors (say 0x1p100, or 1e+30, and larger in magnitude) which have underflow issues -- I would definitely exclude such divisors from acceleration anyway.
In preliminary testing, the estimate seems uncannily accurate. I did not draw a plot comparing the estimates and the actual errors for divisors 1 to 20000, because the points all coincide exactly in the plots. (Within this range, the estimate is exact, or one too large.) Essentially, the estimates reproduce the first plot in this answer exactly.
The pattern of failures for the Markstein approach is regular, and very interesting. The approach works for all power of two divisors, and all odd integer divisors.
For divisors greater than 16777216, I consistently see the same errors as for a divisor that is divided by the smallest power of two to yield a value less than 16777216. For example, 0x1.3cdfa4p+23 and 0x1.3cdfa4p+41, 0x1.d8874p+23 and 0x1.d8874p+32, 0x1.cf84f8p+23 and 0x1.cf84f8p+34, 0x1.e4a7fp+23 and 0x1.e4a7fp+37. (Within each pair, the mantissa is the same, and only the power of two varies.)
Assuming my test bench is not in error, this means that the Markstein approach also works divisors larger than 16777216 in magnitude (but smaller than, say, 1e+30), if the divisor is such that when divided by the smallest power of two that yields a quotient of less than 16777216 in magnitude, and the quotient is odd.
This question asks for a way to identify the values of the constant Y that make it safe to transform x / Y into a cheaper computation using FMA for all possible values of x. Another approach is to use static analysis to determine an over-approximation of the values x can take, so that the generally unsound transformation can be applied in the knowledge that the values for which the transformed code differs from the original division do not happen.
Using representations of sets of floating-point values that are well adapted to the problems of floating-point computations, even a forwards analysis starting from the beginning of the function can produce useful information. For instance:
float f(float z) {
float x = 1.0f + z;
float r = x / Y;
return r;
}
Assuming the default round-to-nearest mode(*), in the above function x can only be NaN (if the input is NaN), +0.0f, or a number larger than 2-24 in magnitude, but not -0.0f or anything closer to zero than 2-24. This justifies the transformation into one of the two forms shown in the question for many values of the constant Y.
(*) assumption without which many optimizations are impossible and that C compilers already make unless the program explicitly uses #pragma STDC FENV_ACCESS ON
A forwards static analysis that predicts the information for x above can be based on a representation of sets of floating-point values an expression can take as a tuple of:
a representation for the sets of possible NaN values (Since behaviors of NaN are underspecified, a choice is to use only a boolean, with true meaning some NaNs can be present, and false indicating no NaN is present.),
four boolean flags indicating respectively the presence of +inf, -inf, +0.0, -0.0,
an inclusive interval of negative finite floating-point values, and
an inclusive interval of positive finite floating-point values.
In order to follow this approach, all the floating-point operations that can occur in a C program must be understood by the static analyzer. To illustrate, the addition betweens sets of values U and V, to be used to handle + in the analyzed code, can be implemented as:
If NaN is present in one of the operands, or if the operands can be infinities of opposite signs, NaN is present in the result.
If 0 cannot be a result of the addition of a value of U and a value of V, use standard interval arithmetic. The upper bound of the result is obtained for the round-to-nearest addition of the largest value in U and the largest value in V, so these bounds should be computed with round-to-nearest.
If 0 can be a result of the addition of a positive value of U and a negative value of V, then let M be the smallest positive value in U such that -M is present in V.
if succ(M) is present in U, then this pair of values contributes succ(M) - M to the positive values of the result.
if -succ(M) is present in V, then this pair of values contributes the negative value M - succ(M) to the negative values of the result.
if pred(M) is present in U, then this pair of values contributes the negative value pred(M) - M to the negative values of the result.
if -pred(M) is present in V, then this pair of values contributes the value M - pred(M) to the positive values of the result.
Do the same work if 0 can be the result of the addition of a negative value of U and a positive value of V.
Acknowledgement: the above borrows ideas from “Improving the Floating Point Addition and Subtraction Constraints”, Bruno Marre & Claude Michel
Example: compilation of the function f below:
float f(float z, float t) {
float x = 1.0f + z;
if (x + t == 0.0f) {
float r = x / 6.0f;
return r;
}
return 0.0f;
}
The approach in the question refuses to transform the division in function f into an alternate form, because 6 is not one of the value for which the division can be unconditionally transformed. Instead, what I am suggesting is to apply a simple value analysis starting from the beginning of the function which, in this case, determines that x is a finite float either +0.0f or at least 2-24 in magnitude, and to use this information to apply Brisebarre et al's transformation, confident in the knowledge that x * C2 does not underflow.
To be explicit, I am suggesting to use an algorithm such as the one below to decide whether or not to transform the division into something simpler:
Is Y one of the values that can be transformed using Brisebarre et al's method according to their algorithm?
Do C1 and C2 from their method have the same sign, or is it possible to exclude the possibility that the dividend is infinite?
Do C1 and C2 from their method have the same sign, or can x take only one of the two representations of 0? If in the case where C1 and C2 have different signs and x can only be one representation of zero, remember to fiddle(**) with the signs of the FMA-based computation to make it produce the correct zero when x is zero.
Can the magnitude of the dividend be guaranteed to be large enough to exclude the possibility that x * C2 underflows?
If the answer to the four questions is “yes”, then the division can be transformed into a multiplication and an FMA in the context of the function being compiled. The static analysis described above serves to answer questions 2., 3. and 4.
(**) “fiddling with the signs” means using -FMA(-C1, x, (-C2)*x) in place of FMA(C1, x, C2*x) when this is necessary to make the result come out correctly when x can only be one of the two signed zeroes
I love #Pascal's answer but in optimization it's often better to have a simple and well-understood subset of transformations rather than a perfect solution.
All current and common historical floating point formats had one thing in common: a binary mantissa.
Therefore, all fractions were rational numbers of the form:
x / 2n
This is in contrast to the constants in the program (and all possible base-10 fractions) which are rational numbers of the form:
x / (2n * 5m)
So, one optimization would simply test the input and reciprocal for m == 0, since those numbers are represented exactly in the FP format and operations with them should produce numbers that are accurate within the format.
So, for example, within the (decimal 2-digit) range of .01 to 0.99 dividing or multiplying by the following numbers would be optimized:
.25 .50 .75
And everything else would not. (I think, do test it first, lol.)
The result of a floating point division is:
a sign flag
a significand
an exponent
a set of flags (overflow, underflow, inexact, etc - see fenv())
Getting the first 3 pieces correct (but the set of flags incorrect) is not enough. Without further knowledge (e.g. which parts of which pieces of the result actually matter, the possible values of the dividend, etc) I would assume that replacing division by a constant with multiplication by a constant (and/or a convoluted FMA mess) is almost never safe.
In addition; for modern CPUs I also wouldn't assume that replacing a division with 2 FMAs is always an improvement. For example, if the bottleneck is instruction fetch/decode, then this "optimisation" would make performance worse. For another example, if subsequent instructions don't depend on the result (the CPU can do many other instructions in parallel while waiting for the result) the FMA version may introduce multiple dependency stalls and make performance worse. For a third example, if all registers are being used then the FMA version (which requires additional "live" variables) may increase "spilling" and make performance worse.
Note that (in many but not all cases) division or multiplication by a constant multiple of 2 can be done with addition alone (specifically, adding a shift count to the exponent).

Computing floating point accuracy (K&R 2-1)

I found Stevens Computing Services – K & R Exercise 2-1 a very thorough answer to K&R 2-1. This slice of the full code computes the maximum value of a float type in the C programming language.
Unluckily my theoretical comprehension of float values is quite limited. I know they are composed of significand (mantissa.. ) and a magnitude which is a power of 2.
#include <stdio.h>
#include <limits.h>
#include <float.h>
main()
{
float flt_a, flt_b, flt_c, flt_r;
/* FLOAT */
printf("\nFLOAT MAX\n");
printf("<limits.h> %E ", FLT_MAX);
flt_a = 2.0;
flt_b = 1.0;
while (flt_a != flt_b) {
flt_m = flt_b; /* MAX POWER OF 2 IN MANTISSA */
flt_a = flt_b = flt_b * 2.0;
flt_a = flt_a + 1.0;
}
flt_m = flt_m + (flt_m - 1); /* MAX VALUE OF MANTISSA */
flt_a = flt_b = flt_c = flt_m;
while (flt_b == flt_c) {
flt_c = flt_a;
flt_a = flt_a * 2.0;
flt_b = flt_a / 2.0;
}
printf("COMPUTED %E\n", flt_c);
}
I understand that the latter part basically checks to which power of 2 it's possible to raise the significand with a three variable algorithm. What about the first part?
I can see that a progression of multiples of 2 should eventually determine the value of the significand, but I tried to trace a few small numbers to check how it should work and it failed to find the right values...
======================================================================
What are the concepts on which this program is based upon and does this program gets more precise as longer and non-integer numbers have to be found?
The first loop determines the number of bits contributing to the significand by finding the least power 2 such that adding 1 to it (using floating-point arithmetic) fails to change its value. If that's the nth power of two, then the significand uses n bits, because with n bits you can express all the integers from 0 through 2^n - 1, but not 2^n. The floating-point representation of 2^n must therefore have an exponent large enough that the (binary) units digit is not significant.
By that same token, having found the first power of 2 whose float representation has worse than unit precision, the maximim float value that does have unit precision is one less. That value is recorded in variable flt_m.
The second loop then tests for the maximum exponent by starting with the maximum unit-precision value, and repeatedly doubling it (thereby increasing the exponent by 1) until it finds that the result cannot be converted back by halving it. The maximum float is the value before that final doubling.
Do note, by the way, that all the above supposes a base-2 floating-point representation. You are unlikely to run into anything different, but C does not actually require any specific representation.
With respect to the second part of your question,
does this program gets more precise as longer and non-integer numbers have to be found?
the program takes care to avoid losing precision. It does assume a binary floating-point representation such as you described, but it will work correctly regardless of the number of bits in the significand or exponent of such a representation. No non-integers are involved, but the program already deals with numbers that have worse than unit precision, and with numbers larger than can be represented with type int.

Round positive value half-up to 2 decimal places in C

Typically, Rounding to 2 decimal places is very easy with
printf("%.2lf",<variable>);
However, the rounding system will usually rounds to the nearest even. For example,
2.554 -> 2.55
2.555 -> 2.56
2.565 -> 2.56
2.566 -> 2.57
And what I want to achieve is that
2.555 -> 2.56
2.565 -> 2.57
In fact, rounding half-up is doable in C, but for Integer only;
int a = (int)(b+0.5)
So, I'm asking for how to do the same thing as above with 2 decimal places on positive values instead of Integer to achieve what I said earlier for printing.
It is not clear whether you actually want to "round half-up", or rather "round half away from zero", which requires different treatment for negative values.
Single precision binary float is precise to at least 6 decimal places, and 20 for double, so nudging a FP value by DBL_EPSILON (defined in float.h) will cause a round-up to the next 100th by printf( "%.2lf", x ) for n.nn5 values. without affecting the displayed value for values not n.nn5
double x2 = x * (1 + DBL_EPSILON) ; // round half-away from zero
printf( "%.2lf", x2 ) ;
For different rounding behaviours:
double x2 = x * (1 - DBL_EPSILON) ; // round half-toward zero
double x2 = x + DBL_EPSILON ; // round half-up
double x2 = x - DBL_EPSILON ; // round half-down
Following is precise code to round a double to the nearest 0.01 double.
The code functions like x = round(100.0*x)/100.0; except it handles uses manipulations to insure scaling by 100.0 is done exactly without precision loss.
Likely this is more code than OP is interested, but it does work.
It works for the entire double range -DBL_MAX to DBL_MAX. (still should do more unit testing).
It depends on FLT_RADIX == 2, which is common.
#include <float.h>
#include <math.h>
void r100_best(const char *s) {
double x;
sscanf(s, "%lf", &x);
// Break x into whole number and fractional parts.
// Code only needs to round the fractional part.
// This preserves the entire `double` range.
double xi, xf;
xf = modf(x, &xi);
// Multiply the fractional part by N (256).
// Break into whole and fractional parts.
// This provides the needed extended precision.
// N should be >= 100 and a power of 2.
// The multiplication by a power of 2 will not introduce any rounding.
double xfi, xff;
xff = modf(xf * 256, &xfi);
// Multiply both parts by 100.
// *100 incurs 7 more bits of precision of which the preceding code
// insures the 8 LSbit of xfi, xff are zero.
int xfi100, xff100;
xfi100 = (int) (xfi * 100.0);
xff100 = (int) (xff * 100.0); // Cast here will truncate (towards 0)
// sum the 2 parts.
// sum is the exact truncate-toward-0 version of xf*256*100
int sum = xfi100 + xff100;
// add in half N
if (sum < 0)
sum -= 128;
else
sum += 128;
xf = sum / 256;
xf /= 100;
double y = xi + xf;
printf("%6s %25.22f ", "x", x);
printf("%6s %25.22f %.2f\n", "y", y, y);
}
int main(void) {
r100_best("1.105");
r100_best("1.115");
r100_best("1.125");
r100_best("1.135");
r100_best("1.145");
r100_best("1.155");
r100_best("1.165");
return 0;
}
[Edit] OP clarified that only the printed value needs rounding to 2 decimal places.
OP's observation that rounding of numbers "half-way" per a "round to even" or "round away from zero" is misleading. Of 100 "half-way" numbers like 0.005, 0.015, 0.025, ... 0.995, only 4 are typically exactly "half-way": 0.125, 0.375, 0.625, 0.875. This is because floating-point number format use base-2 and numbers like 2.565 cannot be exactly represented.
Instead, sample numbers like 2.565 have as the closest double value of 2.564999999999999947... assuming binary64. Rounding that number to nearest 0.01 should be 2.56 rather than 2.57 as desired by OP.
Thus only numbers ending with 0.125 and 0.625 area exactly half-way and round down rather than up as desired by OP. Suggest to accept that and use:
printf("%.2lf",variable); // This should be sufficient
To get close to OP's goal, numbers could be A) tested against ending with 0.125 or 0.625 or B) increased slightly. The smallest increase would be
#include <math.h>
printf("%.2f", nextafter(x, 2*x));
Another nudge method is found with #Clifford.
[Former answer that rounds a double to the nearest double multiple of 0.01]
Typical floating-point uses formats like binary64 which employs base-2. "Rounding to nearest mathmatical 0.01 and ties away from 0.0" is challenging.
As #Pascal Cuoq mentions, floating point numbers like 2.555 typically are only near 2.555 and have a more precise value like 2.555000000000000159872... which is not half way.
#BLUEPIXY solution below is best and practical.
x = round(100.0*x)/100.0;
"The round functions round their argument to the nearest integer value in floating-point
format, rounding halfway cases away from zero, regardless of the current rounding direction." C11dr §7.12.9.6.
The ((int)(100 * (x + 0.005)) / 100.0) approach has 2 problems: it may round in the wrong direction for negative numbers (OP did not specify) and integers typically have a much smaller range (INT_MIN to INT_MAX) that double.
There are still some cases when like when double x = atof("1.115"); which end up near 1.12 when it really should be 1.11 because 1.115, as a double is really closer to 1.11 and not "half-way".
string x rounded x
1.115 1.1149999999999999911182e+00 1.1200000000000001065814e+00
OP has not specified rounding of negative numbers, assuming y = -f(-x).

What is the best way to fmod a decimal number?

My problem is getting most accurate result from a mod calculation, I'm getting a remainder answer to do another rounding calculation, so I do need a accurate result to do so.
double a = 0.12345678...(may with many digits);
double b = fmod(a, 0.01);
the result b may be inaccurate deal with the binary storing issue.
Do I have to consider using float to increase the accuracy.
Or I just move the digit from decimal point to integer
double a = 12345678.0;
thanks
First, any serious implementation of fmod will answer the floating point nearest to the remainder in single/double/whatever precision as if the division were performed with infinite precision.
(NOTE: rephrased thanks to #EricPostpischil)
Though, that's well too late. The binary floating point internal representation of 0.01 does not represent 1/100 exactly as you already seem to know.
Let's examine how the error cumulates.
You want to know the remainder of a division, say a % b = c.
You have inexact representations a1 and b1, and you know an error bound for these representations: a1=a+ea1, abs(ea1) < ea, b1=b+eb1, abs(eb1) < eb.
What can you say about a1 % b1 = c1 (the exact operation), c1=c+ec1 that is about error bound abs(ec1) < ec?
a = q * b + c.
a1 = q1 * b1 + c1.
a+ea1 = (q+eq1)*(b+eb1) + (c+ec1).
ea1 = eq1*(b+eb1) + q*eb1 + ec1.
ec1 = ea1 - q*eb1 - eq1*(b+eb1).
ec >= max( ea , abs(q)*eb , eq*abs(b) , eq*eb).
ec <= ea + abs(q)*eb + eq*abs(b) + eq*eb.
You can control ea and abs(q)*eb by increasing precision of representation (single, double, extended, quadruple, arbitrary precision...).
But the important term in this equality is eq*abs(b), because if quotient can be off by one, then the bound of error is ec > b !
And of course, quotient can be off by one, such cases is extremely easy to construct.
Take c=0 and a1 a representation off a by default (ea1<0) or b1 a representation off b by excess (eb1>0) and you're done, you get eq1 = -1 even for small quotient and accurate precision.
Don't think that carefully controlling rounding modes such as to obtain ea1 > 0 (excess) and eb1 <= 0 (default) would protect you in all cases, since we can construct the inverse case where
b - smallValue < c < b
Don't try remainder a variant of fmod that rounds the quotient rather than truncate, that will just move the problem near perfect tie (when the exact division a/b is a multiple of 1/2).
With a carefull analysis of error bounds, you could answer an estimate of ec and identify the bad cases of potentially incorrect rounding of quotient q (when a1/b1 is near a whole int), or abs(q)*eb reaches 1, or ea>=b.
In bad cases, you could arrange to raise an exception, and restart producing a1 and b1 with increased precision, but in edge case c=0, there is no guaranty of convergence, even with arbitrary precision.
If I do understand your question correctly , you want the results of fmod in double. As described by Pascal Couq in comments that fmod prototype is double fmod(double x,double y); you can do it like this:
#include<stdio.h>
#include<math.h>
int main()
{
double a = 12.1649232848373633242;
double b = 1.234;
double c;
setbuf(stdout,NULL);
c = fmod(a,b);
printf("%.13f",c);//.13 in the format specifiers here describes the number of decimal places upto which you want to get the value .
return 0;
}

Resources