Question
For a C99 compiler implementing exact IEEE 754 arithmetic, do values of f, divisor of type float exist such that f / divisor != (float)(f * (1.0 / divisor))?
EDIT: By “implementing exact IEEE 754 arithmetic” I mean a compiler that rightfully defines FLT_EVAL_METHOD as 0.
Context
A C compiler that provides IEEE 754-compliant floating-point can only replace a single-precision division by a constant by a single-precision multiplication by the inverse if said inverse is itself representable exactly as a float.
In practice, this only happens for powers of two. So a programmer, Alex, may be confident that f / 2.0f will be compiled as if it had been f * 0.5f, but if it is acceptable for Alex to multiply by 0.10f instead of dividing by 10, Alex should express it by writing the multiplication in the program, or by using a compiler option such as GCC's -ffast-math.
This question is about transforming a single-precision division into a double-precision multiplication. Does it always produce the correctly rounded result? Is there a chance that it could be cheaper, and thus be an optimization that compilers might make (even without -ffast-math)?
I have compared (float)(f * 0.10) and f / 10.0f for all single-precision values of f between 1 and 2, without finding any counter-example. This should cover all divisions of normal floats producing a normal result.
Then I generalized the test to all divisors with the program below:
#include <float.h>
#include <math.h>
#include <stdio.h>
int main(void){
for (float divisor = 1.0; divisor != 2.0; divisor = nextafterf(divisor, 2.0))
{
double factor = 1.0 / divisor; // double-precision inverse
for (float f = 1.0; f != 2.0; f = nextafterf(f, 2.0))
{
float cr = f / divisor;
float opt = f * factor; // double-precision multiplication
if (cr != opt)
printf("For divisor=%a, f=%a, f/divisor=%a but (float)(f*factor)=%a\n",
divisor, f, cr, opt);
}
}
}
The search space is just large enough to make this interesting (246). The program is currently running. Can someone tell me whether it will print something, perhaps with an explanation why or why not, before it has finished?
Your program won't print anything, assuming round-ties-to-even rounding mode. The essence of the argument is as follows:
We're assuming that both f and divisor are between 1.0 and 2.0. So f = a / 2^23 and divisor = b / 2^23 for some integers a and b in the range [2^23, 2^24). The case divisor = 1.0 isn't interesting, so we can further assume that b > 2^23.
The only way that (float)(f * (1.0 / divisor)) could give the wrong result would be for the exact value f / divisor to be so close to a halfway case (i.e., a number exactly halfway between two single-precision floats) that the accumulated errors in the expression f * (1.0 / divisor) push us to the other side of that halfway case from the true value.
But that can't happen. For simplicity, let's first assume that f >= divisor, so that the exact quotient is in [1.0, 2.0). Now any halfway case for single precision in the interval [1.0, 2.0) has the form c / 2^24 for some odd integer c with 2^24 < c < 2^25. The exact value of f / divisor is a / b, so the absolute value of the difference f / divisor - c / 2^24 is bounded below by 1 / (2^24 b), so is at least 1 / 2^48 (since b < 2^24). So we're more than 16 double-precision ulps away from any halfway case, and it should be easy to show that the error in the double precision computation can never exceed 16 ulps. (I haven't done the arithmetic, but I'd guess it's easy to show an upper bound of 3 ulps on the error.)
So f / divisor can't be close enough to a halfway case to create problems. Note that f / divisor can't be an exact halfway case, either: since c is odd, c and 2^24 are relatively prime, so the only way we could have c / 2^24 = a / b is if b is a multiple of 2^24. But b is in the range (2^23, 2^24), so that's not possible.
The case where f < divisor is similar: the halfway cases then have the form c / 2^25 and the analogous argument shows that abs(f / divisor - c / 2^25) is greater than 1 / 2^49, which again gives us a margin of 16 double-precision ulps to play with.
It's certainly not possible if non-default rounding modes are possible. For example, in replacing 3.0f / 3.0f with 3.0f * C, a value of C less than the exact reciprocal would yield the wrong result in downward or toward-zero rounding modes, whereas a value of C greater than the exact reciprocal would yield the wrong result for upward rounding mode.
It's less clear to me whether what you're looking for is possible if you restrict to default rounding mode. I'll think about it and revise this answer if I come up with anything.
Random search resulted in an example.
Looks like when the result is a "denormal/subnormal" number, the inequality is possible. But then, maybe my platform is not IEEE 754 compliant?
f 0x1.7cbff8p-25
divisor -0x1.839p+116
q -0x1.f8p-142
q2 -0x1.f6p-142
int MyIsFinite(float f) {
union {
float f;
unsigned char uc[sizeof (float)];
unsigned long ul;
} x;
x.f = f;
return (x.ul & 0x7F800000L) != 0x7F800000L;
}
float floatRandom() {
union {
float f;
unsigned char uc[sizeof (float)];
} x;
do {
size_t i;
for (i=0; i<sizeof(x.uc); i++) x.uc[i] = rand();
} while (!MyIsFinite(x.f));
return x.f;
}
void testPC() {
for (;;) {
volatile float f, divisor, q, qd;
do {
f = floatRandom();
divisor = floatRandom();
q = f / divisor;
} while (!MyIsFinite(q));
qd = (float) (f * (1.0 / divisor));
if (qd != q) {
printf("%a %a %a %a\n", f, divisor, q, qd);
return;
}
}
}
Eclipse PC Version: Juno Service Release 2
Build id: 20130225-0426
Related
I am new to C, and my task is to create a function
f(x) = sqrt[(x^2)+1]-1
that can handle very large numbers and very small numbers. I am submitting my script on an online interface that checks my answers.
For very large numbers I simplify the expression to:
f(x) = x-1
By just using the highest power. This was the correct answer.
The same logic does not work for smaller numbers. For small numbers (on the order of 1e-7), they are very quickly truncated to zero, even before they are squared. I suspect that this has to do with floating point precision in C. In my textbook, it says that the float type has smallest possible value of 1.17549e-38, with 6 digit precision. So although 1e-7 is much larger than 1.17e-38, it has a higher precision, and is therefore rounded to zero. This is my guess, correct me if I'm wrong.
As a solution, I am thinking that I should convert x to a long double when x < 1e-6. However when I do this, I still get the same error. Any ideas? Let me know if I can clarify. Code below:
#include <math.h>
#include <stdio.h>
double feval(double x) {
/* Insert your code here */
if (x > 1e299)
{;
return x-1;
}
if (x < 1e-6)
{
long double g;
g = x;
printf("x = %Lf\n", g);
long double a;
a = pow(x,2);
printf("x squared = %Lf\n", a);
return sqrt(g*g+1.)- 1.;
}
else
{
printf("x = %f\n", x);
printf("Used third \n");
return sqrt(pow(x,2)+1.)-1;
}
}
int main(void)
{
double x;
printf("Input: ");
scanf("%lf", &x);
double b;
b = feval(x);
printf("%f\n", b);
return 0;
}
For small inputs, you're getting truncation error when you do 1+x^2. If x=1e-7f, x*x will happily fit into a 32 bit floating point number (with a little bit of error due to the fact that 1e-7 does not have an exact floating point representation, but x*x will be so much smaller than 1 that floating point precision will not be sufficient to represent 1+x*x.
It would be more appropriate to do a Taylor expansion of sqrt(1+x^2), which to lowest order would be
sqrt(1+x^2) = 1 + 0.5*x^2 + O(x^4)
Then, you could write your result as
sqrt(1+x^2)-1 = 0.5*x^2 + O(x^4),
avoiding the scenario where you add a very small number to 1.
As a side note, you should not use pow for integer powers. For x^2, you should just do x*x. Arbitrary integer powers are a little trickier to do efficiently; the GNU scientific library for example has a function for efficiently computing arbitrary integer powers.
There are two issues here when implementing this in the naive way: Overflow or underflow in intermediate computation when computing x * x, and substractive cancellation during final subtraction of 1. The second issue is an accuracy issue.
ISO C has a standard math function hypot (x, y) that performs the computation sqrt (x * x + y * y) accurately while avoiding underflow and overflow in intermediate computation. A common approach to fix issues with subtractive cancellation is to transform the computation algebraically such that it is transformed into multiplications and / or divisions.
Combining these two fixes leads to the following implementation for float argument. It has an error of less than 3 ulps across all possible inputs according to my testing.
/* Compute sqrt(x*x+1)-1 accurately and without spurious overflow or underflow */
float func (float x)
{
return (x / (1.0f + hypotf (x, 1.0f))) * x;
}
A trick that is often useful in these cases is based on the identity
(a+1)*(a-1) = a*a-1
In this case
sqrt(x*x+1)-1 = (sqrt(x*x+1)-1)*(sqrt(x*x+1)+1)
/(sqrt(x*x+1)+1)
= (x*x+1-1) / (sqrt(x*x+1)+1)
= x*x/(sqrt(x*x+1)+1)
The last formula can be used as an implementation. For vwry small x sqrt(x*x+1)+1 will be close to 2 (for small enough x it will be 2) but we don;t loose precision in evaluating it.
The problem isn't with running into the minimum value, but with the precision.
As you said yourself, float on your machine has about 7 digits of precision. So let's take x = 1e-7, so that x^2 = 1e-14. That's still well within the range of float, no problems there. But now add 1. The exact answer would be 1.00000000000001. But if we only have 7 digits of precision, this gets rounded to 1.0000000, i.e. exactly 1. So you end up computing sqrt(1.0)-1 which is exactly 0.
One approach would be to use the linear approximation of sqrt around x=1 that sqrt(x) ~ 1+0.5*(x-1). That would lead to the approximation f(x) ~ 0.5*x^2.
I want to know whether the program defined below can return 1 assuming:
IEEE754 floating point arithmetics
no overflow (neither in max/x nor in f*x)
no nan or inf (obviously)
0 < x and 0 < n < 32
no unsafe math optimization
int canfail(int n, double x) {
double max = 1ULL << n; // 2^n
double f = max / x;
return f * x > max;
}
In my opinion, it should sometime return 1, as roundToNearest(max / x) can in general be greater than max/x.
I'm able to find numbers for the opposite case, where f * x < max, but I have no examples of input that show f * x > max and I have no idea of how to find one. Can somebody help ?
EDIT:
I know the value of x if in a range between 10^(-6) and 10^6 (that still leaves a lot (too much possible double values), but I know I will not have to deal with overflow, underflow or sub-normal numbers !
In addition, I just realized that because max is a power of two and we don't deal with overflow, the solution will be the same by fixing max=1 as it is exactly the same computation, but shifted.
Therefore, the problem correspond to finding a positive, normal double value x such that `(1/x) * x > 1.0 !!
I made a little program to try to find a solution:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <stdint.h>
#include <omp.h>
int main( void ) {
#pragma omp parallel
{
unsigned short int xsubi[3] = {
omp_get_thread_num(),
omp_get_thread_num(),
omp_get_thread_num()
};
#pragma omp for
for(int64_t i=0; i<INT64_MAX; i++) {
double x = fmod(nrand48(xsubi), 1048576.0);
if(x<0.000001)
continue;
double f = 1.0 / x;
if(f * x > 1.0) {
printf("found !!! x=%.30f\n", x);
fflush(stdout);
}
}
}
return 1;
}
If you change the sign of the comparison, you will find some value quickly. However, it seems to run forever with f * x > 1.0
In the absence of underflow or overflow, the exponents are irrelevant; if M/x*x > M, then (M/p) / (x/q) * (x/q) > (M/p) for any powers of two p and q. So let’s consider 252 ≤ x < 253 and M = 2105. We can eliminate x = 252 since this yields exact floating-point arithmetic, so 252 < x < 253.
Division of 2105 by x yields integer quotient q and integer remainder r, with 252 q < 253, 0 < r < x, and 2105 = q•x + r.
In order for M/x*x to exceed M, both the division and the multiplication must round up. Since the division rounds up, x/2 ≤ r.
With rounding up, the result of floating-point division of 2105 by x yields q+1. Then the exact (not rounded) multiplication yields (q+1)•x = q•x + x = q•x + x + r - r = q•x + r + x − r = 2105 + x − r. Since x/2 < r, x − r ≤ x/2, so rounding this exact result rounds down, yielding 2105. (The “<” case always rounds down, and the “=” case rounds down because 2105 has the low even bit.)
Therefore, for powers of two M and all arithmetic within exponent bounds, M/x*x > M never occurs with round-to-nearest-ties-to-even.
Multiplication by a power of two is just a scaling of exponent, it does not change the problem: so it's the same as finding x such that (1/x) * x > 1.
One solution is brute force search.
For same reasons, we can limit the search of such x in the interval (1.0,2.0(
A better approach is to analyze error bounds without brute force.
Let's note ix the nearest floating point to 1/x.
Considering xand ixas exact fractions, we can write the integer division: 1 = ix * x + r where ris the remainder
(these are all fractions with denominators being powers of 2, so we have to multiply the whole equation by appropriate power of 2 to really have integer division).
In other words, ix = 1/x - r/x, where -r/x is the rounding error of inversion.
When we multiply the inverse approximation by x, the exact value is ix*x = 1 - r.
We know that the floating point result will be rounded to the nearest float to that exact value.
So, assumming default rounding mode to nearest, tie to even, the question asked is whether -r can exceed 0.5 ulp.
The short answer is never!
Suppose |r| > 0.5 ulp, then the rounding error -r/x does exceed half ulp of exact result 1/x.
This is not a proper answer, because the exact result is not a floating point and does not have an ulp, but you get the idea...
I might come back with a correct proof if i have time, but my bet is that you can find it already done, possibly on SO
EDIT
Why can you find (1/x) * x < 1?
Simply because 1.0 is at a binade limit, so below 1, we have to prove that r<0.25 ulp, what we cannot...
canfail(1, pow(2, 1023) * (2 - pow(2, -51))) will return 1.
I have a question about the pack754() function defined in Section 7.4 of Beej's Guide to Network Programming.
This function converts a floating point number f into its IEEE 754 representation where bits is the total number of bits to represent the number and expbits is the number of bits used to represent only the exponent.
I am concerned with single-precision floating numbers only, so for this question, bits is specified as 32 and expbits is specified as 8. This implies that 23 bits is used to store the significand (because one bit is sign bit).
My question is about this line of code.
significand = fnorm * ((1LL<<significandbits) + 0.5f);
What is the role of + 0.5f in this code?
Here is a complete code that uses this function.
#include <stdio.h>
#include <stdint.h> // defines uintN_t types
#include <inttypes.h> // defines PRIx macros
uint64_t pack754(long double f, unsigned bits, unsigned expbits)
{
long double fnorm;
int shift;
long long sign, exp, significand;
unsigned significandbits = bits - expbits - 1; // -1 for sign bit
if (f == 0.0) return 0; // get this special case out of the way
// check sign and begin normalization
if (f < 0) { sign = 1; fnorm = -f; }
else { sign = 0; fnorm = f; }
// get the normalized form of f and track the exponent
shift = 0;
while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
while(fnorm < 1.0) { fnorm *= 2.0; shift--; }
fnorm = fnorm - 1.0;
// calculate the binary form (non-float) of the significand data
significand = fnorm * ((1LL<<significandbits) + 0.5f);
// get the biased exponent
exp = shift + ((1<<(expbits-1)) - 1); // shift + bias
// return the final answer
return (sign<<(bits-1)) | (exp<<(bits-expbits-1)) | significand;
}
int main(void)
{
float f = 3.1415926;
uint32_t fi;
printf("float f: %.7f\n", f);
fi = pack754(f, 32, 8);
printf("float encoded: 0x%08" PRIx32 "\n", fi);
return 0;
}
What purpose does + 0.5f serve in this code?
The code is an incorrect attempt at rounding.
long double fnorm;
long long significand;
unsigned significandbits
...
significand = fnorm * ((1LL<<significandbits) + 0.5f); // bad code
The first clue of incorrectness is the f of 0.5f, which indicates float, is a nonsensical introduction of specifying float in a routine with long double f and fnorm. float math has no application in the function.
Yet adding 0.5f does not mean that the code is limited to float math in (1LL<<significandbits) + 0.5f. See FLT_EVAL_METHOD which may allow higher precision intermediate results and have fooled the code author in testing.
A rounding attempt does make sense as the argument is long double and the target representations are narrower. Adding 0.5 is a common approach - but it is not done right here. IMO, the lack of the author commenting here concerning 0.5f hinted that the intent was "obvious" - not subtle, albeit incorrect.
As commented, moving the 0.5 is closer to being correct for rounding, but may mis-lead some into thinking the addition is done with float math, (it is long double math adding a long doubleproduct to float causes the 0.5f to be promoted to long double first).
// closer to rounding but may mislead
significand = fnorm * (1LL<<significandbits) + 0.5f;
// better
significand = fnorm * (1LL<<significandbits) + 0.5L; // or 0.5l or simply 0.5
To round, without calling the preferred <math.h> rounds routines like rintl(), roundl(), nearbyintl(), llrintl(), adding the explicit type 0.5 is still a weak attempt at rounding. It is weak because it rounds incorrectly with many cases. The +0.5 trick relies on that sum being exact.
Consider
long double product = fnorm * (1LL<<significandbits);
long long significand = product + 0.5; // double rounding?
product + 0.5 itself may go through a rounding before truncation/assignment to long long - in effect double rounding.
Best to use the right tool in the C shed of standard library functions.
significand = llrintl(fnorm * (1ULL<<significandbits));
A corner case remains with this rounding is where significand is now one too great and significand , exp needs adjustment. As well identified by #Nayuki, code has other short-comings too. Also, it fails on -0.0.
The + 0.5f serves no purpose in the code, and may be harmful or misleading.
The expression (1LL<<significandbits) + 0.5f results in a float. But even for the small case of significandbits = 23 for single-precision floating-point, the expression evaluates to (float)(223 + 0.5), which rounds to exactly 223 (round half even).
Replacing + 0.5f with + 0.0f results in the same behavior. Heck, drop that term entirely, because fnorm will cause the right-hand side argument of * to be casted to long double anyway. This would be a better way to rewrite the line: long long significand = fnorm * (long double)(1LL << significandbits);
Side note: This implementation of pack754() handles zero correctly (and collapses negative zero to positive zero), but mishandles subnormal numbers (wrong bits), infinities (infinite loop), and NaN (wrong bits). It's best to not treat it as a reference model function.
I have seen this code:
(int)(num < 0 ? (num - 0.5) : (num + 0.5))
(How to round floating point numbers to the nearest integer in C?)
for rounding but I need to use float and precision for three digits after the point.
Examples:
254.450 should be rounded up to 255.
254.432 should be rounded down to 254
254.448 should be rounded down to 254
and so on.
Notice: This is what I mean by "3 digits" the bold digits after the dot.
I believe it should be faster then roundf() because I use many hundreds of thousands rounds when I need to calculate the rounds. Do you have some tips how to do that? I tried to search source of roundf but nothing found.
Note: I need it for RGB2HSV conversion function so I think 3 digits should be enough. I use positive numbers.
"it should be faster then roundf()" is only verifiable with profiling various approaches.
To round to 0 places (round to nearest whole number), use roundf()
float f;
float f_rounded3 = roundf(f);
To round to 3 places using float, use round()
The round functions round their argument to the nearest integer value in floating-point format, rounding halfway cases away from zero, regardless of the current rounding direction.
#include <math.h>
float f;
float f_rounded3 = round(f * 1000.0)/1000.0;
Code purposely uses the intermediate type of double, else code code use with reduced range:
float f_rounded3 = roundf(f * 1000.0f)/1000.0f;
If code is having trouble rounding 254.450 to 255.0 using roundf() or various tests, it is likely because the value is not 254.450, but a float close to it like 254.4499969 which rounds to 254. Typical FP using a binary format and 254.450 is not exactly representable.
You can use double transformation float -> string -> float, while first transformation make 3 digits after point:
sprintf(tmpStr, "%.3f", num);
this work for me
#include <stdio.h>
int main(int ac, char**av)
{
float val = 254.449f;
float val2 = 254.450f;
int res = (int)(val < 0 ? (val - 0.55f) : (val + 0.55f));
int res2 = (int)(val2 < 0 ? (val2 - 0.55f) : (val2 + 0.55f));
printf("%f %d %d\n", val, res, res2);
return 0;
}
output : 254.449005 254 255
to increase the precision just add any 5 you want in 0.55f like 0.555f, 0.5555f, etc
I wanted something like this:
float num = 254.454300;
float precision=10;
float p = 10*precision;
num = (int)(num * p + 0.5) / p ;
But the result will be inaccurate (with error) - my x86 machine gives me this result: 254.449997
When you can change de border from b=0.5 to b=0.45 you must know that for positives the rounded value is round_0(x,b)=(int)( x+(1-b) ) therefore b=0.45 ⟹ round_0(x)=(int)(x+0.55) and you can threat the signal. But remember that don't exists 254.45 but 254.449997 and 254.449999999999989, maybe you prefer to use b=0.4495.
If you have float round_0(float) to zero-digit rounding (can be like you show in question), you can do for one, two... n-digit rounding like this in C/C++: # define round_n(x,n) (round_0((x)*1e##n)/1e##n).
round_1( x , b ) = round_0( 10*x ,b)/10
round_2( x , b ) = round_0( 100*x ,b)/100
round_3( x , b ) = round_0( 1000*x ,b)/1000
round_n( x , b , n ) = round_0( (10^n)*x ,b)/(10^n)
But do typecast to int and (one more typecast) to float to operate is slower than rounds in operations. If don't simplify the add/sub (some compilers have this setting) for faster zero-digit round to float type you can do it.
inline float round_0( float x , float b=0.5f ){
return (( x+(0.5f-b) )+(3<<22))-(3<<22) ; // or (( x+(0.5f-b) )-(3<<22))+(3<<22) ;
}
inline double round_0( double x , double b=0.5 ){
return (( x+(0.5-b) )+(3<<51))-(3<<51) ; // or (( x+(0.5-b) )-(3<<51))+(3<<51) ;
}
When b=0.5 it correctly rounds to nearest integer if |x|<=2^23 (float) or |x|<=2^52 (double). But if compiler uses FPU (ten bytes floating-point) optimizing loads then constant is 3.0*(1u<<63), works |x|<=2^64 and use long double can be faster.
Typically, Rounding to 2 decimal places is very easy with
printf("%.2lf",<variable>);
However, the rounding system will usually rounds to the nearest even. For example,
2.554 -> 2.55
2.555 -> 2.56
2.565 -> 2.56
2.566 -> 2.57
And what I want to achieve is that
2.555 -> 2.56
2.565 -> 2.57
In fact, rounding half-up is doable in C, but for Integer only;
int a = (int)(b+0.5)
So, I'm asking for how to do the same thing as above with 2 decimal places on positive values instead of Integer to achieve what I said earlier for printing.
It is not clear whether you actually want to "round half-up", or rather "round half away from zero", which requires different treatment for negative values.
Single precision binary float is precise to at least 6 decimal places, and 20 for double, so nudging a FP value by DBL_EPSILON (defined in float.h) will cause a round-up to the next 100th by printf( "%.2lf", x ) for n.nn5 values. without affecting the displayed value for values not n.nn5
double x2 = x * (1 + DBL_EPSILON) ; // round half-away from zero
printf( "%.2lf", x2 ) ;
For different rounding behaviours:
double x2 = x * (1 - DBL_EPSILON) ; // round half-toward zero
double x2 = x + DBL_EPSILON ; // round half-up
double x2 = x - DBL_EPSILON ; // round half-down
Following is precise code to round a double to the nearest 0.01 double.
The code functions like x = round(100.0*x)/100.0; except it handles uses manipulations to insure scaling by 100.0 is done exactly without precision loss.
Likely this is more code than OP is interested, but it does work.
It works for the entire double range -DBL_MAX to DBL_MAX. (still should do more unit testing).
It depends on FLT_RADIX == 2, which is common.
#include <float.h>
#include <math.h>
void r100_best(const char *s) {
double x;
sscanf(s, "%lf", &x);
// Break x into whole number and fractional parts.
// Code only needs to round the fractional part.
// This preserves the entire `double` range.
double xi, xf;
xf = modf(x, &xi);
// Multiply the fractional part by N (256).
// Break into whole and fractional parts.
// This provides the needed extended precision.
// N should be >= 100 and a power of 2.
// The multiplication by a power of 2 will not introduce any rounding.
double xfi, xff;
xff = modf(xf * 256, &xfi);
// Multiply both parts by 100.
// *100 incurs 7 more bits of precision of which the preceding code
// insures the 8 LSbit of xfi, xff are zero.
int xfi100, xff100;
xfi100 = (int) (xfi * 100.0);
xff100 = (int) (xff * 100.0); // Cast here will truncate (towards 0)
// sum the 2 parts.
// sum is the exact truncate-toward-0 version of xf*256*100
int sum = xfi100 + xff100;
// add in half N
if (sum < 0)
sum -= 128;
else
sum += 128;
xf = sum / 256;
xf /= 100;
double y = xi + xf;
printf("%6s %25.22f ", "x", x);
printf("%6s %25.22f %.2f\n", "y", y, y);
}
int main(void) {
r100_best("1.105");
r100_best("1.115");
r100_best("1.125");
r100_best("1.135");
r100_best("1.145");
r100_best("1.155");
r100_best("1.165");
return 0;
}
[Edit] OP clarified that only the printed value needs rounding to 2 decimal places.
OP's observation that rounding of numbers "half-way" per a "round to even" or "round away from zero" is misleading. Of 100 "half-way" numbers like 0.005, 0.015, 0.025, ... 0.995, only 4 are typically exactly "half-way": 0.125, 0.375, 0.625, 0.875. This is because floating-point number format use base-2 and numbers like 2.565 cannot be exactly represented.
Instead, sample numbers like 2.565 have as the closest double value of 2.564999999999999947... assuming binary64. Rounding that number to nearest 0.01 should be 2.56 rather than 2.57 as desired by OP.
Thus only numbers ending with 0.125 and 0.625 area exactly half-way and round down rather than up as desired by OP. Suggest to accept that and use:
printf("%.2lf",variable); // This should be sufficient
To get close to OP's goal, numbers could be A) tested against ending with 0.125 or 0.625 or B) increased slightly. The smallest increase would be
#include <math.h>
printf("%.2f", nextafter(x, 2*x));
Another nudge method is found with #Clifford.
[Former answer that rounds a double to the nearest double multiple of 0.01]
Typical floating-point uses formats like binary64 which employs base-2. "Rounding to nearest mathmatical 0.01 and ties away from 0.0" is challenging.
As #Pascal Cuoq mentions, floating point numbers like 2.555 typically are only near 2.555 and have a more precise value like 2.555000000000000159872... which is not half way.
#BLUEPIXY solution below is best and practical.
x = round(100.0*x)/100.0;
"The round functions round their argument to the nearest integer value in floating-point
format, rounding halfway cases away from zero, regardless of the current rounding direction." C11dr §7.12.9.6.
The ((int)(100 * (x + 0.005)) / 100.0) approach has 2 problems: it may round in the wrong direction for negative numbers (OP did not specify) and integers typically have a much smaller range (INT_MIN to INT_MAX) that double.
There are still some cases when like when double x = atof("1.115"); which end up near 1.12 when it really should be 1.11 because 1.115, as a double is really closer to 1.11 and not "half-way".
string x rounded x
1.115 1.1149999999999999911182e+00 1.1200000000000001065814e+00
OP has not specified rounding of negative numbers, assuming y = -f(-x).