Explain this code in K&R 2-1 - c

I'm trying to determine range of the various floating-point types. When I read this code:
#include <stdio.h>
main()
{
float fl, fltest, last;
double dbl, dbltest, dblast;
fl = 0.0;
fltest = 0.0;
while (fl == 0.0) {
last = fltest;
fltest = fltest + 1111e28;
fl = (fl + fltest) - fltest;
}
printf("Maximum range of float variable: %e\n", last);
dbl = 0.0;
dbltest = 0.0;
while (dbl == 0.0) {
dblast = dbltest;
dbltest = dbltest + 1111e297;
dbl = (dbl + dbltest) - dbltest;
}
printf("Maximum range of double variable: %e\n", dblast);
return 0;
}
I don't understand why author added 1111e28 at fltest variable ?

The loop terminates when fltest reaches +Inf, as at that point fl = (fl + fltest) - fltest becomes NaN, which is unequal to 0.0. last contains a value which when added to 1111e28 produces +Inf and so is close to the upper limit of float.
1111e28 is chosen to reach +Inf reasonably quickly; it also needs to be large enough that when added to large values the loop continues to progress i.e. it is at least as large as the gap between the largest and second-largest non-infinite float values.

OP: ... why author added 1111e28 at fltest variable ?
A: [Edit] For code to work using float, 1111e28, or 1.111e31 this delta value needs careful selection. It should be big enough such that if fltest was FLT_MAX, the sum of fltest + delta would overflow and become float.infinity. With round to nearest mode, this is FLT_MAX*FLT_EPSILON/4. On my machine:
min_delta 1.014120601e+31 1/2 step between 2nd largest and FLT_MAX
FLT_MAX 3.402823466e+38
FLT_EPSILON 8.388608000e+06
FLT_MAX*FLT_EPSILON 4.056481679e+31
delta needs to be small enough so if f1test is the 2nd largest number, adding delta, would not sum right up to float.infinity and skip FLT_MAX. This is 3x min_delta
max_delta 3.042361441e+31
So 1.014120601e+31 <= 1111e28 < 3.042361441e+31.
#david.pfx Yes. 1111e28 is a cute number and it is in range.
Note: Complications occur when the math and its intermediate values, even though the variables are float may calcuate at higher precsison like double. This is allowed in C and control by FLT_EVAL_METHOD or very careful coding.
1111e28 is a curious value that makes sense if the author all ready knew the general range ofFLT_MAX.
The below code is expected to loop many times (24946069 on one test platform). Hopefully, the value fltest eventually becomes "infinite". Then f1 will becomes NaN as the difference of Infinity - Infinity. The the while loop ends as Nan != 0.0. #ecatmur
while (fl == 0.0) {
last = fltest;
fltest = fltest + 1111e28;
fl = (fl + fltest) - fltest;
}
The looping, if done in small enough increments, will arrive at a precise answer. Prior knowledge of FLT_MAX and FLT_EPSILON are needed to insure this.
The problem with this is that C does not define the range FLT_MAX and DBL_MAX other than they must be at least 1E+37. So if the maximum value was quite large, the increment value of 1111e28 or 1111e297 would have no effect. Example: dbltest = dbltest + 1111e297;, for dbltest = 1e400 would certainly not increase 1e400 unless dbltest a hundred decimal digits of precision.
If DBL_MAX was smaller than 1111e297, the method fails too. Note: On simple platforms in 2014, it is not surprising to find double and float to be the same 4-byte IEEE binary32 ) The first time though the loop, dbltest becomes infinity and the loop stops, reporting "Maximum range of double variable: 0.000000e+00".
There are many ways to efficiently derive the maximum float point value. A sample follows that uses a random initial value to help show its resilience to potential variant FLT_MAX.
float float_max(void) {
float nextx = 1.0 + rand()/RAND_MAX;
float x;
do {
x = nextx;
nextx *= 2;
} while (!isinf(nextx));
float delta = x;
do {
nextx = x + delta/2;
if (!isinf(nextx)) {
x = nextx;
}
delta /= 2;
} while (delta >= 1.0);
return x;
}
isinf() is a new-ish C function. Simple enough to roll your own if needed.
In re: #didierc comment
[Edit]
The precision of a float and double is implied with "epsilon": "the difference between 1 and the least value greater than 1 that is representable in the
given floating point type ...". The maximum values follow
FLT_EPSILON 1E-5
DBL_EPSILON 1E-9
Per #Pascal Cuoq comment. "... 1111e28 being chosen larger than FLT_MAX*FLT_EPSILON.", 1111e28 needs to be at least FLT_MAX*FLT_EPSILON to impact the loop's addition, yet small enough to precisely reach the number before infinity. Again, prior knowledge of FLT_MAX and FLT_EPSILON are needed to make this determination. If these values are known ahead of time, then the code simple could have been:
printf("Maximum range of float variable: %e\n", FLT_MAX);

The largest value representable in a float is 3.40282e+38. The constant 1111e28 is chosen such that adding that constant to a number in the range of 10^38 still produces a different floating point value, so that the value of fltest will continue to increase as the function runs. It needs to be large enough that it will still be significant at the 10^38 range, and small enough that the result will be accurate.

Related

Underflow error in floating point arithmetic in C

I am new to C, and my task is to create a function
f(x) = sqrt[(x^2)+1]-1
that can handle very large numbers and very small numbers. I am submitting my script on an online interface that checks my answers.
For very large numbers I simplify the expression to:
f(x) = x-1
By just using the highest power. This was the correct answer.
The same logic does not work for smaller numbers. For small numbers (on the order of 1e-7), they are very quickly truncated to zero, even before they are squared. I suspect that this has to do with floating point precision in C. In my textbook, it says that the float type has smallest possible value of 1.17549e-38, with 6 digit precision. So although 1e-7 is much larger than 1.17e-38, it has a higher precision, and is therefore rounded to zero. This is my guess, correct me if I'm wrong.
As a solution, I am thinking that I should convert x to a long double when x < 1e-6. However when I do this, I still get the same error. Any ideas? Let me know if I can clarify. Code below:
#include <math.h>
#include <stdio.h>
double feval(double x) {
/* Insert your code here */
if (x > 1e299)
{;
return x-1;
}
if (x < 1e-6)
{
long double g;
g = x;
printf("x = %Lf\n", g);
long double a;
a = pow(x,2);
printf("x squared = %Lf\n", a);
return sqrt(g*g+1.)- 1.;
}
else
{
printf("x = %f\n", x);
printf("Used third \n");
return sqrt(pow(x,2)+1.)-1;
}
}
int main(void)
{
double x;
printf("Input: ");
scanf("%lf", &x);
double b;
b = feval(x);
printf("%f\n", b);
return 0;
}
For small inputs, you're getting truncation error when you do 1+x^2. If x=1e-7f, x*x will happily fit into a 32 bit floating point number (with a little bit of error due to the fact that 1e-7 does not have an exact floating point representation, but x*x will be so much smaller than 1 that floating point precision will not be sufficient to represent 1+x*x.
It would be more appropriate to do a Taylor expansion of sqrt(1+x^2), which to lowest order would be
sqrt(1+x^2) = 1 + 0.5*x^2 + O(x^4)
Then, you could write your result as
sqrt(1+x^2)-1 = 0.5*x^2 + O(x^4),
avoiding the scenario where you add a very small number to 1.
As a side note, you should not use pow for integer powers. For x^2, you should just do x*x. Arbitrary integer powers are a little trickier to do efficiently; the GNU scientific library for example has a function for efficiently computing arbitrary integer powers.
There are two issues here when implementing this in the naive way: Overflow or underflow in intermediate computation when computing x * x, and substractive cancellation during final subtraction of 1. The second issue is an accuracy issue.
ISO C has a standard math function hypot (x, y) that performs the computation sqrt (x * x + y * y) accurately while avoiding underflow and overflow in intermediate computation. A common approach to fix issues with subtractive cancellation is to transform the computation algebraically such that it is transformed into multiplications and / or divisions.
Combining these two fixes leads to the following implementation for float argument. It has an error of less than 3 ulps across all possible inputs according to my testing.
/* Compute sqrt(x*x+1)-1 accurately and without spurious overflow or underflow */
float func (float x)
{
return (x / (1.0f + hypotf (x, 1.0f))) * x;
}
A trick that is often useful in these cases is based on the identity
(a+1)*(a-1) = a*a-1
In this case
sqrt(x*x+1)-1 = (sqrt(x*x+1)-1)*(sqrt(x*x+1)+1)
/(sqrt(x*x+1)+1)
= (x*x+1-1) / (sqrt(x*x+1)+1)
= x*x/(sqrt(x*x+1)+1)
The last formula can be used as an implementation. For vwry small x sqrt(x*x+1)+1 will be close to 2 (for small enough x it will be 2) but we don;t loose precision in evaluating it.
The problem isn't with running into the minimum value, but with the precision.
As you said yourself, float on your machine has about 7 digits of precision. So let's take x = 1e-7, so that x^2 = 1e-14. That's still well within the range of float, no problems there. But now add 1. The exact answer would be 1.00000000000001. But if we only have 7 digits of precision, this gets rounded to 1.0000000, i.e. exactly 1. So you end up computing sqrt(1.0)-1 which is exactly 0.
One approach would be to use the linear approximation of sqrt around x=1 that sqrt(x) ~ 1+0.5*(x-1). That would lead to the approximation f(x) ~ 0.5*x^2.

IEEE-754: "smallest" overflow condition

Before I start, just some background information:
I'm running a bare-metal application on an ARM7 microcontroller (LPC2294/01) compiled in Keil uVision3, using the compiler standard math library (which is IEEE-754 compliant).
The issue:
I'm having trouble wrapping my head around what exactly constitutes an 'overflow' on the sum of 2 single-precision floating point inputs.
Initially, I was under the impression that if I attempted to add any positive value to the largest value that can be represented by IEEE-754 notation, the result would generate an overflow exception.
So for instance, suppose I have:
a = 0x7f7fffff (ie. 3.4028235..E38);
b = 0x3f800000 (ie. 1.0)
I expected that summing these two values would result in overflow as defined in IEEE-754. To my initial surprise, the result simply returned the value of 'a' with no exception being flagged.
So then I thought, since the precision (or resolution if you prefer) decreases as the value being represented increases, it's likely the value '1' in this case is being effectively rounded down to 0 due to its relative insignificance.
So that begged the question: What would be the smallest value of 'b' in this case that would cause an overflow exception? Does it depend on the specific implementation of IEEE-754?
Maybe it's as simple as me not understanding how to determine the minimum 'significant' precision in this particular case, but given the code below, why would the second sum cause an overflow and not the first?
static union sFloatConversion32
{
unsigned int unsigned32Value;
float floatValue;
} sFloatConversion32;
t_bool test_Float32_Addition(void)
{
float a;
float b;
float c;
sFloatConversion32.unsigned32Value = 0x7f7fffff;
a = sFloatConversion32.floatValue;
sFloatConversion32.unsigned32Value = 0x72ffffff;
b = sFloatConversion32.floatValue;
/* This sum returns (c = a) without overflow */
c = a + b;
sFloatConversion32.unsigned32Value = 0x73000000;
b = sFloatConversion32.floatValue;
/* This sum, however, causes an overflow exception */
c = a + b;
}
Is there a generalized rule that can be applied such that it would be possible to know ahead of time (ie. without performing the sum), that given two floats, their sum will cause an overflow as defined by IEEE-754?
Overflow occurs when the result is affected by the range of the format. As long as normal rounding keeps the result within the finite range, no overflow occurs, because the result is the same as it would be if the exponent were unbounded—the result was reduced by the normal rounding, before range was considered. So there is no exception due to range.
When the rounded result does not fit into the finite range of the format, then a finite result cannot be produced, so an overflow exception occurs and infinity is produced.
In IEEE 754, a normal operation is in effect two steps:
Calculate the exact mathematical result.
Round the exact mathematical result to the nearest representable value.
IEEE 754 defines overflow to occur if and only if the the result of the above exceeds in magnitude the largest representable finite value. In other words, overflow does not occur just because you went above the largest representable value but only if you go so far above the largest representable value that the normal way arithmetic works in floating-point does not work.
So, if you start with the largest representable value and add a small number to it, the result would simply round to the largest representable value anyway (when using round-to-nearest). IEEE 754 regards this as normal—all arithmetic operations round, and if that rounding kept the result in bounds, that is normal and unexceptionable. Even if the exponent range were unbounded, normal rounding would have produced the same result. Since this is a normal result not affected by the limited range, nothing exceptional has occurred.
Overflow occurs only when the mathematical result is so large that rounding would produce the next higher number if we were not limited by the exponent. (But, since we have reached the limits of the exponent range, we must return infinity.)
The largest representable value in IEEE-754 basic 32-bit binary floating-point is 2128−2104. At this point, the steps between representable numbers are in units of 2104. With the round-to-nearest rule, adding any number less than half a step, 2103, to this will round to 2128−2104, and no overflow occurs. If you add a number greater than 2103, then the result would round to 2128 if the exponent could go that high. Instead, infinity is produced and an overflow exception occurs. (If you add exactly 2103, the rule for ties is used. This rule says to choose the candidate with the even low bit. That produces 2128, so it also overflows.)
So, with round-to-nearest, overflow occurs at the midpoint of a step. With other rounding rules, overflow occurs at different points. With round-toward-infinity (round up), adding any positive value, even 2−149, to 2128−2104 will cause an overflow. With round-toward-zero, adding any value less than 2104 to 2128−2104 will not overflow.
Does it depend on the specific implementation of IEEE-754?
Yes and the rounding mode active at the time.
Consider the step between the x before max and FLT_MAX.
float max = FLT_MAX;
float before_max = nextafterf(max, 0.0f);
float delta = max - before_max;
printf("max: %- 20a %.*g\n", max, FLT_DECIMAL_DIG, max);
printf("1st d: % -20a %.*g\n", delta, FLT_DECIMAL_DIG, delta);
// Typical output
max: 0x1.fffffep+127 3.40282347e+38
b4max: 0x1.fffffep+127 3.40282347e+38
1st d: 0x1p+104 2.02824096e+31
The largest float is about twice the float with the same smallest float with the same steps or ULP. Think of this smaller float with all its explicit precision bits cleared versus set as with FLOAT_MAX.
float m0 = nextafterf(max/2, max);
printf("m0: %- 20a %.*g\n", m0, FLT_DECIMAL_DIG, m0);
// m0: 0x1p+127 1.70141183e+38
Now compare this to FLT_EPSILON, the smallest step from 1.0 to the next larger float:
float eps = FLT_EPSILON;
printf("epsil: %- 20a %.*g\n", eps, FLT_DECIMAL_DIG, eps);
// Output
// epsil: 0x1p-23 1.1920929e-07
Notice the ratio delta/m0 is FLT_EPSILON.
float r = delta1/m0;
printf("r: %- 20a %.*g\n", r, FLT_DECIMAL_DIG, r);
// r: 0x1p-23 1.1920929e-07
Consider the typical rounding mode of rounding to nearest, ties to even.
Now let us try adding 1/2*delta1 to FLOAT_MAX and then try adding the next smaller float.
sum = max + delta1/2;
printf("sum: % -20a %.*g\n", sum, FLT_DECIMAL_DIG, sum);
sum = nextafterf(sum, 0);
printf("sum: % -20a %.*g\n", sum, FLT_DECIMAL_DIG, sum);
// sum: inf inf
// sum: 0x1.fffffep+127 3.40282347e+38
IEEE-754: “smallest” overflow condition
We can see the smallest delta if about FLT_MAX*1/2*1/2*FLOAT_EPSILON.
float small = FLT_MAX*0.25f*FLT_EPSILON;
printf("small: %- 20a %.*g\n", small, FLT_DECIMAL_DIG, small);
printf("sum: % -20a %.*g\n", max+small, FLT_DECIMAL_DIG, max+small);
small = nextafterf(small, max);
printf("sum: % -20a %.*g\n", max+small, FLT_DECIMAL_DIG, max+small);
// sum: 0x1.fffffep+127 3.40282347e+38
// sum: inf inf
Given the various possible encoding for float, your results may differ, yet this approach gives an idea of how to determine the smallest delta that cause overflow.
Run this program long enough and see what will happen:
float x = 10000000.0f;
while(1)
{
printf("%f\n", x);
x += 1.0f;
}
I think it will answer your question.

Understanding the maximum values that can be stored in floats in C

I have come across some behaviour with the float type in C that I do not understand, and was hoping might be explained. Using the macros defined in float.h I can determine the maximum/minimum values that the datatype can store on the given hardware. However when performing a calculation that should not exceed these limits, I find that a typed float variable fails where a double succeeds.
The following is a minimal example, which compiles on my machine.
#include <stdio.h>
#include <stdlib.h>
#include <float.h>
int main(int argc, char **argv)
{
int gridsize;
long gridsize3;
float *datagrid;
float sumval_f;
double sumval_d;
long i;
gridsize = 512;
gridsize3 = (long)gridsize*gridsize*gridsize;
datagrid = calloc(gridsize3, sizeof(float));
if(datagrid == NULL)
{
free(datagrid);
printf("Memory allocation failed\n");
exit(0);
}
for(i=0; i<gridsize3; i++)
{
datagrid[i] += 1.0;
}
sumval_f = 0.0;
sumval_d = 0.0;
for(i=0; i<gridsize3; i++)
{
sumval_f += datagrid[i];
sumval_d += (double)datagrid[i];
}
printf("\ngridsize3 = %e\n", (float)gridsize3);
printf("FLT_MIN = %e\n", FLT_MIN);
printf("FLT_MAX = %e\n", FLT_MAX);
printf("DBL_MIN = %e\n", DBL_MIN);
printf("DBL_MAX = %e\n", DBL_MAX);
printf("\nfloat sum = %f\n", sumval_f);
printf("double sum = %lf\n", sumval_d);
printf("sumval_d/sumval_f = %f\n\n", sumval_d/(double)sumval_f);
free(datagrid);
return(0);
}
Compiling with gcc I find the output:
gridsize3 = 1.342177e+08
FLT_MIN = 1.175494e-38
FLT_MAX = 3.402823e+38
DBL_MIN = 2.225074e-308
DBL_MAX = 1.797693e+308
float sum = 16777216.000000
double sum = 134217728.000000
sumval_d/sumval_f = 8.000000
Whilst compiling with icc the sumval_f = 67108864.0 and hence the final ratio is instead 2.0*. Note that the float sum is incorrect, whilst the double sum is correct.
As far as I can tell the output of FLT_MAX suggests that the sum should fit into a float, and yet it seems to plateau out at either an eighth or a half of the full value.
Is there a compiler specific override to the values found using float.h?
Why is a double required to correctly find the sum of this array?
*Interestingly the inclusion of an if statement inside the for loop that prints values of the array causes the value to match the gcc output, i.e. an eighth of the correct sum, rather than a half.
The problem here isn't the range of values but the precision.
Assuming a 32-bit IEEE754 float, this datatype has a maximum of 24 bits of precision. This means that not all integers larger than 16777216 can be represented exactly.
So when your sum reaches 16777216, adding 1 to it is outside the precision of what the datatype can store, so the number doesn't get any bigger.
A (presumably) 64-bit double has 53 bits of precision. This is enough bits to hold all integer values up to your sum of 134217728, so it gives you an accurate result.
A float can precisely represent any integer between -16777215 and +16777215, inclusive. It can also represent all even integers between -2*16777215 and +2*16777215 (including +/- 2*8388608, i.e. 16777216), all multiples of 4 between -4*16777215 and +4*16777215, and likewise for all power-of-two scaling factors up to 2^104 (roughly 2.028E+31). Additionally, it can represent multiples of 1/2 from -16777215/2 to +16777215/2, multiples of 1/4 from -16777215/4 to +16777215/4, etc. down to multiples of 1/2^149 from -167777215/(2^149) to +16777215/(2^149).
Floating point numbers represent all of the infinite possible values between any two numbers; but, computers cannot hold an infinite number of values. So a compromise is made. The floating point numbers hold an approximation of the value.
This means that if you pick a value that is "more" than the stored floating point number, but not enough to arrive at the "next" storable approximation, then storing that logically bigger number won't actually change the floating point value.
The "error" in a floating point approximation is variable. For small numbers, the error is more precise; for bigger numbers, the error proportionally the same, but a bigger actual value.

Round positive value half-up to 2 decimal places in C

Typically, Rounding to 2 decimal places is very easy with
printf("%.2lf",<variable>);
However, the rounding system will usually rounds to the nearest even. For example,
2.554 -> 2.55
2.555 -> 2.56
2.565 -> 2.56
2.566 -> 2.57
And what I want to achieve is that
2.555 -> 2.56
2.565 -> 2.57
In fact, rounding half-up is doable in C, but for Integer only;
int a = (int)(b+0.5)
So, I'm asking for how to do the same thing as above with 2 decimal places on positive values instead of Integer to achieve what I said earlier for printing.
It is not clear whether you actually want to "round half-up", or rather "round half away from zero", which requires different treatment for negative values.
Single precision binary float is precise to at least 6 decimal places, and 20 for double, so nudging a FP value by DBL_EPSILON (defined in float.h) will cause a round-up to the next 100th by printf( "%.2lf", x ) for n.nn5 values. without affecting the displayed value for values not n.nn5
double x2 = x * (1 + DBL_EPSILON) ; // round half-away from zero
printf( "%.2lf", x2 ) ;
For different rounding behaviours:
double x2 = x * (1 - DBL_EPSILON) ; // round half-toward zero
double x2 = x + DBL_EPSILON ; // round half-up
double x2 = x - DBL_EPSILON ; // round half-down
Following is precise code to round a double to the nearest 0.01 double.
The code functions like x = round(100.0*x)/100.0; except it handles uses manipulations to insure scaling by 100.0 is done exactly without precision loss.
Likely this is more code than OP is interested, but it does work.
It works for the entire double range -DBL_MAX to DBL_MAX. (still should do more unit testing).
It depends on FLT_RADIX == 2, which is common.
#include <float.h>
#include <math.h>
void r100_best(const char *s) {
double x;
sscanf(s, "%lf", &x);
// Break x into whole number and fractional parts.
// Code only needs to round the fractional part.
// This preserves the entire `double` range.
double xi, xf;
xf = modf(x, &xi);
// Multiply the fractional part by N (256).
// Break into whole and fractional parts.
// This provides the needed extended precision.
// N should be >= 100 and a power of 2.
// The multiplication by a power of 2 will not introduce any rounding.
double xfi, xff;
xff = modf(xf * 256, &xfi);
// Multiply both parts by 100.
// *100 incurs 7 more bits of precision of which the preceding code
// insures the 8 LSbit of xfi, xff are zero.
int xfi100, xff100;
xfi100 = (int) (xfi * 100.0);
xff100 = (int) (xff * 100.0); // Cast here will truncate (towards 0)
// sum the 2 parts.
// sum is the exact truncate-toward-0 version of xf*256*100
int sum = xfi100 + xff100;
// add in half N
if (sum < 0)
sum -= 128;
else
sum += 128;
xf = sum / 256;
xf /= 100;
double y = xi + xf;
printf("%6s %25.22f ", "x", x);
printf("%6s %25.22f %.2f\n", "y", y, y);
}
int main(void) {
r100_best("1.105");
r100_best("1.115");
r100_best("1.125");
r100_best("1.135");
r100_best("1.145");
r100_best("1.155");
r100_best("1.165");
return 0;
}
[Edit] OP clarified that only the printed value needs rounding to 2 decimal places.
OP's observation that rounding of numbers "half-way" per a "round to even" or "round away from zero" is misleading. Of 100 "half-way" numbers like 0.005, 0.015, 0.025, ... 0.995, only 4 are typically exactly "half-way": 0.125, 0.375, 0.625, 0.875. This is because floating-point number format use base-2 and numbers like 2.565 cannot be exactly represented.
Instead, sample numbers like 2.565 have as the closest double value of 2.564999999999999947... assuming binary64. Rounding that number to nearest 0.01 should be 2.56 rather than 2.57 as desired by OP.
Thus only numbers ending with 0.125 and 0.625 area exactly half-way and round down rather than up as desired by OP. Suggest to accept that and use:
printf("%.2lf",variable); // This should be sufficient
To get close to OP's goal, numbers could be A) tested against ending with 0.125 or 0.625 or B) increased slightly. The smallest increase would be
#include <math.h>
printf("%.2f", nextafter(x, 2*x));
Another nudge method is found with #Clifford.
[Former answer that rounds a double to the nearest double multiple of 0.01]
Typical floating-point uses formats like binary64 which employs base-2. "Rounding to nearest mathmatical 0.01 and ties away from 0.0" is challenging.
As #Pascal Cuoq mentions, floating point numbers like 2.555 typically are only near 2.555 and have a more precise value like 2.555000000000000159872... which is not half way.
#BLUEPIXY solution below is best and practical.
x = round(100.0*x)/100.0;
"The round functions round their argument to the nearest integer value in floating-point
format, rounding halfway cases away from zero, regardless of the current rounding direction." C11dr §7.12.9.6.
The ((int)(100 * (x + 0.005)) / 100.0) approach has 2 problems: it may round in the wrong direction for negative numbers (OP did not specify) and integers typically have a much smaller range (INT_MIN to INT_MAX) that double.
There are still some cases when like when double x = atof("1.115"); which end up near 1.12 when it really should be 1.11 because 1.115, as a double is really closer to 1.11 and not "half-way".
string x rounded x
1.115 1.1149999999999999911182e+00 1.1200000000000001065814e+00
OP has not specified rounding of negative numbers, assuming y = -f(-x).

accuracy of sqrt of integers

I have a loop like this:
for(uint64_t i=0; i*i<n; i++) {
This requires doing a multiplication every iteration. If I could calculate the sqrt before the loop then I could avoid this.
unsigned cut = sqrt(n)
for(uint64_t i=0; i<cut; i++) {
In my case it's okay if the sqrt function rounds up to the next integer but it's not okay if it rounds down.
My question is: is the sqrt function accurate enough to do this for all cases?
Edit: Let me list some cases. If n is a perfect square so that n = y^2 my question would be - is cut=sqrt(n)>=y for all n? If cut=y-1 then there is a problem. E.g. if n = 120 and cut = 10 it's okay but if n=121 (11^2) and cut is still 10 then it won't work.
My first concern was the fractional part of float only has 23 bits and double 52 so they can't store all the digits of some 32-bit or 64-bit integers. However, I don't think this is a problem. Let's assume we want the sqrt of some number y but we can't store all the digits of y. If we let the fraction of y we can store be x we can write y = x + dx then we want to make sure that whatever dx we choose does not move us to the next integer.
sqrt(x+dx) < sqrt(x) + 1 //solve
dx < 2*sqrt(x) + 1
// e.g for x = 100 dx < 21
// sqrt(100+20) < sqrt(100) + 1
Float can store 23 bits so we let y = 2^23 + 2^9. This is more than sufficient since 2^9 < 2*sqrt(2^23) + 1. It's easy to show this for double as well with 64-bit integers. So although they can't store all the digits as long as the sqrt of what they can store is accurate then the sqrt(fraction) should be sufficient. Now let's look at what happens for integers close to INT_MAX and the sqrt:
unsigned xi = -1-1;
printf("%u %u\n", xi, (unsigned)(float)xi); //4294967294 4294967295
printf("%u %u\n", (unsigned)sqrt(xi), (unsigned)sqrtf(xi)); //65535 65536
Since float can't store all the digits of 2^31-2 and double can they get different results for the sqrt. But the float version of the sqrt is one integer larger. This is what I want. For 64-bit integers as long as the sqrt of the double always rounds up it's okay.
First, integer multiplication is really quite cheap. So long as you have more than a few cycles of work per loop iteration and one spare execute slot, it should be entirely hidden by reorder on most non-tiny processors.
If you did have a processor with dramatically slow integer multiply, a truly clever compiler might transform your loop to:
for (uint64_t i = 0, j = 0; j < cut; j += 2*i+1, i++)
replacing the multiply with an lea or a shift and two adds.
Those notes aside, let’s look at your question as stated. No, you can’t just use i < sqrt(n). Counter-example: n = 0x20000000000000. Assuming adherence to IEEE-754, you will have cut = 0x5a82799, and cut*cut is 0x1ffffff8eff971.
However, a basic floating-point error analysis shows that the error in computing sqrt(n) (before conversion to integer) is bounded by 3/4 of an ULP. So you can safely use:
uint32_t cut = sqrt(n) + 1;
and you’ll perform at most one extra loop iteration, which is probably acceptable. If you want to be totally precise, instead use:
uint32_t cut = sqrt(n);
cut += (uint64_t)cut*cut < n;
Edit: z boson clarifies that for his purposes, this only matters when n is an exact square (otherwise, getting a value of cut that is “too small by one” is acceptable). In that case, there is no need for the adjustment and on can safely just use:
uint32_t cut = sqrt(n);
Why is this true? It’s pretty simple to see, actually. Converting n to double introduces a perturbation:
double_n = n*(1 + e)
which satisfies |e| < 2^-53. The mathematical square root of this value can be expanded as follows:
square_root(double_n) = square_root(n)*square_root(1+e)
Now, since n is assumed to be a perfect square with at most 64 bits, square_root(n) is an exact integer with at most 32 bits, and is the mathematically precise value that we hope to compute. To analyze the square_root(1+e) term, use a taylor series about 1:
square_root(1+e) = 1 + e/2 + O(e^2)
= 1 + d with |d| <~ 2^-54
Thus, the mathematically exact value square_root(double_n) is less than half an ULP away from[1] the desired exact answer, and necessarily rounds to that value.
[1] I’m being fast and loose here in my abuse of relative error estimates, where the relative size of an ULP actually varies across a binade — I’m trying to give a bit of the flavor of the proof without getting too bogged down in details. This can all be made perfectly rigorous, it just gets to be a bit wordy for Stack Overflow.
All my answer is useless if you have access to IEEE 754 double precision floating point, since Stephen Canon demonstrated both
a simple way to avoid imul in loop
a simple way to compute the ceiling sqrt
Otherwise, if for some reason you have a non IEEE 754 compliant platform, or only single precision, you could get the integer part of square root with a simple Newton-Raphson loop. For example in Squeak Smalltalk we have this method in Integer:
sqrtFloor
"Return the integer part of the square root of self"
| guess delta |
guess := 1 bitShift: (self highBit + 1) // 2.
[
delta := (guess squared - self) // (guess + guess).
delta = 0 ] whileFalse: [
guess := guess - delta ].
^guess - 1
Where // is operator for quotient of integer division.
Final guard guess*guess <= self ifTrue: [^guess]. can be avoided if initial guess is fed in excess of exact solution as is the case here.
Initializing with approximate float sqrt was not an option because integers are arbitrarily large and might overflow
But here, you could seed the initial guess with floating point sqrt approximation, and my bet is that the exact solution will be found in very few loops. In C that would be:
uint32_t sqrtFloor(uint64_t n)
{
int64_t diff;
int64_t delta;
uint64_t guess=sqrt(n); /* implicit conversions here... */
while( (delta = (diff=guess*guess-n) / (guess+guess)) != 0 )
guess -= delta;
return guess-(diff>0);
}
That's a few integer multiplications and divisions, but outside the main loop.
What you are looking for is a way to calculate a rational upper bound of the square root of a natural number. Continued fraction is what you need see wikipedia.
For x>0, there is
.
To make the notation more compact, rewriting the above formula as
Truncate the continued fraction by removing the tail term (x-1)/2's at each recursion depth, one gets a sequence of approximations of sqrt(x) as below:
Upper bounds appear at lines with odd line numbers, and gets tighter. When distance between an upper bound and its neighboring lower bound is less than 1, that approximation is what you need. Using that value as the value of cut, here cut must be a float number, solves the problem.
For very large number, rational number should be used, so no precision is lost during conversion between integer and floating point number.

Resources