accuracy of sqrt of integers

accuracy of sqrt of integers - c

I have a loop like this:
for(uint64_t i=0; i*i<n; i++) {
This requires doing a multiplication every iteration. If I could calculate the sqrt before the loop then I could avoid this.
unsigned cut = sqrt(n)
for(uint64_t i=0; i<cut; i++) {
In my case it's okay if the sqrt function rounds up to the next integer but it's not okay if it rounds down.
My question is: is the sqrt function accurate enough to do this for all cases?
Edit: Let me list some cases. If n is a perfect square so that n = y^2 my question would be - is cut=sqrt(n)>=y for all n? If cut=y-1 then there is a problem. E.g. if n = 120 and cut = 10 it's okay but if n=121 (11^2) and cut is still 10 then it won't work.
My first concern was the fractional part of float only has 23 bits and double 52 so they can't store all the digits of some 32-bit or 64-bit integers. However, I don't think this is a problem. Let's assume we want the sqrt of some number y but we can't store all the digits of y. If we let the fraction of y we can store be x we can write y = x + dx then we want to make sure that whatever dx we choose does not move us to the next integer.
sqrt(x+dx) < sqrt(x) + 1 //solve
dx < 2*sqrt(x) + 1
// e.g for x = 100 dx < 21
// sqrt(100+20) < sqrt(100) + 1
Float can store 23 bits so we let y = 2^23 + 2^9. This is more than sufficient since 2^9 < 2*sqrt(2^23) + 1. It's easy to show this for double as well with 64-bit integers. So although they can't store all the digits as long as the sqrt of what they can store is accurate then the sqrt(fraction) should be sufficient. Now let's look at what happens for integers close to INT_MAX and the sqrt:
unsigned xi = -1-1;
printf("%u %u\n", xi, (unsigned)(float)xi); //4294967294 4294967295
printf("%u %u\n", (unsigned)sqrt(xi), (unsigned)sqrtf(xi)); //65535 65536
Since float can't store all the digits of 2^31-2 and double can they get different results for the sqrt. But the float version of the sqrt is one integer larger. This is what I want. For 64-bit integers as long as the sqrt of the double always rounds up it's okay.

First, integer multiplication is really quite cheap. So long as you have more than a few cycles of work per loop iteration and one spare execute slot, it should be entirely hidden by reorder on most non-tiny processors.
If you did have a processor with dramatically slow integer multiply, a truly clever compiler might transform your loop to:
for (uint64_t i = 0, j = 0; j < cut; j += 2*i+1, i++)
replacing the multiply with an lea or a shift and two adds.
Those notes aside, let’s look at your question as stated. No, you can’t just use i < sqrt(n). Counter-example: n = 0x20000000000000. Assuming adherence to IEEE-754, you will have cut = 0x5a82799, and cut*cut is 0x1ffffff8eff971.
However, a basic floating-point error analysis shows that the error in computing sqrt(n) (before conversion to integer) is bounded by 3/4 of an ULP. So you can safely use:
uint32_t cut = sqrt(n) + 1;
and you’ll perform at most one extra loop iteration, which is probably acceptable. If you want to be totally precise, instead use:
uint32_t cut = sqrt(n);
cut += (uint64_t)cut*cut < n;
Edit: z boson clarifies that for his purposes, this only matters when n is an exact square (otherwise, getting a value of cut that is “too small by one” is acceptable). In that case, there is no need for the adjustment and on can safely just use:
uint32_t cut = sqrt(n);
Why is this true? It’s pretty simple to see, actually. Converting n to double introduces a perturbation:
double_n = n*(1 + e)
which satisfies |e| < 2^-53. The mathematical square root of this value can be expanded as follows:
square_root(double_n) = square_root(n)*square_root(1+e)
Now, since n is assumed to be a perfect square with at most 64 bits, square_root(n) is an exact integer with at most 32 bits, and is the mathematically precise value that we hope to compute. To analyze the square_root(1+e) term, use a taylor series about 1:
square_root(1+e) = 1 + e/2 + O(e^2)
= 1 + d with |d| <~ 2^-54
Thus, the mathematically exact value square_root(double_n) is less than half an ULP away from[1] the desired exact answer, and necessarily rounds to that value.
[1] I’m being fast and loose here in my abuse of relative error estimates, where the relative size of an ULP actually varies across a binade — I’m trying to give a bit of the flavor of the proof without getting too bogged down in details. This can all be made perfectly rigorous, it just gets to be a bit wordy for Stack Overflow.

All my answer is useless if you have access to IEEE 754 double precision floating point, since Stephen Canon demonstrated both
a simple way to avoid imul in loop
a simple way to compute the ceiling sqrt
Otherwise, if for some reason you have a non IEEE 754 compliant platform, or only single precision, you could get the integer part of square root with a simple Newton-Raphson loop. For example in Squeak Smalltalk we have this method in Integer:
sqrtFloor
"Return the integer part of the square root of self"
| guess delta |
guess := 1 bitShift: (self highBit + 1) // 2.
[
delta := (guess squared - self) // (guess + guess).
delta = 0 ] whileFalse: [
guess := guess - delta ].
^guess - 1
Where // is operator for quotient of integer division.
Final guard guess*guess <= self ifTrue: [^guess]. can be avoided if initial guess is fed in excess of exact solution as is the case here.
Initializing with approximate float sqrt was not an option because integers are arbitrarily large and might overflow
But here, you could seed the initial guess with floating point sqrt approximation, and my bet is that the exact solution will be found in very few loops. In C that would be:
uint32_t sqrtFloor(uint64_t n)
{
int64_t diff;
int64_t delta;
uint64_t guess=sqrt(n); /* implicit conversions here... */
while( (delta = (diff=guess*guess-n) / (guess+guess)) != 0 )
guess -= delta;
return guess-(diff>0);
}
That's a few integer multiplications and divisions, but outside the main loop.

What you are looking for is a way to calculate a rational upper bound of the square root of a natural number. Continued fraction is what you need see wikipedia.
For x>0, there is
.
To make the notation more compact, rewriting the above formula as
Truncate the continued fraction by removing the tail term (x-1)/2's at each recursion depth, one gets a sequence of approximations of sqrt(x) as below:
Upper bounds appear at lines with odd line numbers, and gets tighter. When distance between an upper bound and its neighboring lower bound is less than 1, that approximation is what you need. Using that value as the value of cut, here cut must be a float number, solves the problem.
For very large number, rational number should be used, so no precision is lost during conversion between integer and floating point number.

Related

Finding whether an interval contains at least one integer without math.h

For a class project I need to split some audio clips in smaller sections, for which we are provided a min length and a max length, to figure out whether this is possible, I do the following:
a = length/max
b = length/min
mathematically I figured that [a,b] contains at least one integer if ⌊b⌋ >= ⌈a⌉, but I can't use math.h for floor() and ceil(). Since a and b are always positive I can use type casting for floor(), but I am at a loss at how to do ceil(). I thought about using ((int)x)+1 but that would round integers up which would break the formula.
I would like either a way to do ceil() which would solve my problem, or another way to check whether an interval contains at least one integer.

You don't need the math.h to perform floor. Please look at the following code:
int length=5,min=2,max=3; // only an example of inputs.
int a = length/max;
int b = length/min;
if(a!=b){
//there is at least one integer in the interval.
}else{
if(length % min==0 || length % max==0 ){
//there is at least one integer in the interval.
}else{
//there is no integer in the interval.
}
}
The result for the above example will be that there is an integer in the interval.
You can also perform ceil without using math.h as following:
int a;
if(length % max == 0){
a = length / max;
}else{
a = (length / max) + 1;
}

If I understood you question right, I guess, you can do ceil(a) in this case, and then check if the result is less then b. Thus, for example, for interval [1.3, 3.5], ceil(1.3) will return 2, which fits into this interval.
UPD
Also you could do (b - a). If it's > 1, there's for sure at least one integer between them.

There is a general trick in programming that will come in hand if you ever find yourself programming Apple Basic, or any other language where floating point math is supported.
You can "round" a number by addition, then truncation, as follows:
x = some floating value
rounded_x = int(x + roundoff_amount)
Where roundoff_amount is the difference between the lowest fraction to round up, and 1.
So, to round at .5, your round_off would be 1 - .5 = .5, and you would do int(x + .5). If x is .5 or .51 then the result becomes 1.0 or 1.01 and int() takes that to 1. Obviously, if x is higher, then you still get rounded to 1, until x becomes 1.5 when rounding takes it to 2. To round upwards starting at .6, your roundoff amount would be 1 - .6 = .4, and you would do int(x + .4), etc.
You can do a similar thing to get ceil behavior. Set your roundoff_amount to be 0.99999... and do the round. You can choose your value to provide a "nearby" window, since floats have some inaccuracy inherent that might prevent getting a perfectly integer value after adding fractions.

Taylor series of function e^x

Given a number x. You need to calculate sum of Taylor Series of e^x.
e^x = 1 + x + x^2/2! + x^3/3! + ...
Calculate sum until a general number is lower or equal to 10^(-9).
Down below is my solution but it is wrong for x<0 numbers. Do you have any idea how to fix this to work for negative numbers.
int x,i,n;
long long fact; //fact needs to be double
double sum=0,k=1;
scanf("%d",&x);
i=0; sum=0; k=1;
while (fabs(k)>=1.0E-9) {
fact=1;
for (int j=1;j<=i;++j)
fact*=j;
k=pow(x,i)/fact;
sum+=k;
++i;
}
printf("%lf\n",sum);

You should not use the pow function for raising a (possibly negative) number to an integer power. Instead use repeated multiplication as you do to compute the factorial.
Notice also that you could store the last computed values of $n!$ and $x^k$ to obtain $(n+1)!$ and $x^{k+1}$ with a single multiplication.

Your problem is that your factorial computation overflows and becomes garbage.
After that your ith term doesn't decrease anymore and produce completely wrong results.
After 20 iterations a 64 bits number cannot contains the value of 20!. See: http://www.wolframalpha.com/input/?i=21%21%2F2%5E64
If x^n/n! is not inferior to your threshold (1e-9) when n=20 then your computation of n! will overflow even a 64 bits integer. When that happens you will get the value of n! modulo 2^63 (I simplify because you didn't use an unsigned integer and you will get random negative value instead but the principle remains). These values may be very low instead of being very high. And this will cause your x^n/n! to become greater instead of smaller.

fact needs to be double, it can not be long long because of divides.

Error Propagation upon Summing Single-Precision (float) Values

I'm learning single precision and would like to understand the error propagation. According to this nice website, addition is a dangerous operation.
So I wrote a small C program to test how quickly the errors add up. I'm not entirely sure if this is a valid way of testing. If it is, I'm unsure how to interpret the result, see below.
#include <stdio.h>
#include <math.h>
#define TYPE float
#define NUM_IT 168600
void increment (TYPE base, const TYPE increment, const unsigned long num_iters) {
TYPE err;
unsigned long i;
const TYPE ref = base + increment * num_iters;
for (i=0; i < num_iters; i++ ) {
base += increment;
}
err = (base - ref)/ref;
printf("%lu\t%9f\t%9f\t%+1.9f\n", i, base, ref, err);
}
int
main()
{
int j;
printf("iters\tincVal\trefVal\trelErr\n");
for (j = 1; j < 20; j++ ) {
increment(1e-1, 1e-6, (unsigned long) (pow(2, (j-10))* NUM_IT));
}
return 0;
}
The result of executing
gcc -pedantic -Wall -Wextra -Werror -lm errorPropagation.c && ./a.out | tee float.dat | column -t
is
iters incVal refVal relErr
329 0.100328 0.100329 -0.000005347
658 0.100657 0.100658 -0.000010585
1317 0.101315 0.101317 -0.000021105
2634 0.102630 0.102634 -0.000041596
5268 0.105259 0.105268 -0.000081182
10537 0.110520 0.110537 -0.000154624
21075 0.121041 0.121075 -0.000282393
42150 0.142082 0.142150 -0.000480946
84300 0.184163 0.184300 -0.000741986
168600 0.268600 0.268600 +0.000000222 <-- *
337200 0.439439 0.437200 +0.005120996
674400 0.781117 0.774400 +0.008673230
1348800 1.437150 1.448800 -0.008041115
2697600 2.723466 2.797600 -0.026499098
5395200 5.296098 5.495200 -0.036231972
10790400 10.441361 10.890400 -0.041232508
21580800 25.463778 21.680799 +0.174485177
43161600 32.000000 43.261597 -0.260313928 <-- **
86323200 32.000000 86.423195 -0.629729033
If the test is valid
Why does the error change sign? If 0.1 is represented as e.g. 0.100000001, shouldn't this accumulate always to the same bias, irrespective of the number of summations?
What's special about 168600 summations (see *)? The error becomes very small. Might be a coincidence.
Which wall is being hit at incVal = 32.00 (see **, last two lines). I'm still well below the unsigned long limit.
Thanks in advance for your effort.

First, it's important to know that 0.1 can't be represented exactly, in binary it's has periodically repeating digits. The value would be 0.0001100110011.... Compare to how 1/3 and 1/7 are represented with decimal digits. It's worth repeating your test with increment 0.25, which can be represented exactly as 0.01.
I'll illustrate the errors in decimal, that's what we humans are used to. Let's work with decimal, and assume we can have 4 digits of precision. Those are the things happening here.
Division: let's calculate 1/11:
1/11 equals 0.090909..., which is probably rounded to 0.09091. This is, as expected, correct to 4 significant digits (in bold).
magnitude differences: suppose we calculate 10 + 1/11.
When adding 1/11 to 10, we have to do more rounding, since 10.09091 are 7 significant digits, and we have only four. We have to round 1/11 to two digits after the point, and the calculated sum is 10.09. That's a underestimation. Note how only one significant digit of 1/11 is retained. If you add a lot of small values together, this will limit the precision of your final result.
Now calculate 100 + 1/11. Now we round 1/11 to 0.1 and represent the sum as 100.1. Now we have a slight overestimation instead of a slight underestimation.
My guess is the pattern of sign changes in your test are the effect of systematic slight underestimation vs. overestimation depending on the magnitude of base.
What about 1000 + 1/11? Now we can't have any digits after the point, as we have 4 significant digits before the point already. 1/11 is now rounded to 0, and the sum is still 1000. That's the wall you're seeing.
Another important thing you're not seeing in your test is: what happens if the two values have a different sign. Calculate 1.234 – 1.243: both numbers have 4 significant digits. The result is -0.009. Now the result has only one correct significant digit instead of four.
An answer to a similar question here: How does floating point error propagate when doing mathematical operations in C++? . It has a few links to more information.

To answer your questions...
1 - IEEE float rounds to even mantissas. This was done specifically in order to prevent error accumulation from always biasing in one way or the other; if it always rounded down, or rounded up, your errors would be much larger.
2 - Nothing in particular is special about 168600 in and of itself. I haven't mathed it out but it's entirely likely that it ends up making a cleaner value in binary representation (i.e. a rational/non-repeating value). Look at the values in binary, not decimal, and see if that theory holds up.
3 - The limiting factor might be due to the float mantissa being 23 bits long. Once base gets to be a certain size, increment is so small in comparison to base that computing base + increment and then rounding the mantissa back down to 23 bits completely erases the change. That is, the difference between base and base + increment is rounding error.

The "wall" you are hitting has nothing to do with the increment value, if it is constant through the addition and you start at zero. It has to with the iters. 2^23 = 8 million, and you are doing 86 million additions. So once the accumulator is 2^23 bigger than the increment, you hit the wall.
Try running the code with 86323200 iterations, but an increment of 1 or 0.0000152587890625 (or any power of 2). It should have the same relative problem as an increment of 32.

Upper bound for number of digits of big integer in different base

I want to create a big integer from string representation and to do that efficiently I need an upper bound on the number of digits in the target base to avoid reallocating memory.
Example:
A 640 bit number has 640 digits in base 2, but only ten digits in base 2^64, so I will have to allocate ten 64 bit integers to hold the result.
The function I am currently using is:
int get_num_digits_in_different_base(int n_digits, double src_base, double dst_base){
return ceil(n_digits*log(src_base)/log(dst_base));
}
Where src_base is in {2, ..., 10 + 26} and dst_base is in {2^8, 2^16, 2^32, 2^64}.
I am not sure if the result will always be correctly rounded though. log2 would be easier to reason about, but I read that older versions of Microsoft Visual C++ do not support that function. It could be emulated like log2(x) = log(x)/log(2) but now I am back where I started.
GMP probably implements a function to do base conversion, but I may not read the source or else I might get GPL cancer so I can not do that.

I imagine speed is of some concern, or else you could just try the floating point-based estimate and adjust if it turned out to be too small. In that case, one can sacrifice tightness of the estimate for speed.
In the following, let dst_base be 2^w, src_base be b, and n_digits be n.
Let k(b,w)=max {j | b^j < 2^w}. This represents the largest power of b that is guaranteed to fit within a w-wide binary (non-negative) integer. Because of the relatively small number of source and destination bases, these values can be precomputed and looked-up in a table, but mathematically k(b,w)=[w log 2/log b] (where [.] denotes the integer part.)
For a given n let m=ceil( n / k(b,w) ). Then the maximum number of dst_base digits required to hold a number less than b^n is:
ceil(log (b^n-1)/log (2^w)) ≤ ceil(log (b^n) / log (2^w) )
≤ ceil( m . log (b^k(b,w)) / log (2^w) ) ≤ m.
In short, if you precalculate the k(b,w) values, you can quickly get an upper bound (which is not tight!) by dividing n by k, rounding up.

I'm not sure about float point rounding in this case, but it is relatively easy to implement this using only integers, as log2 is a classic bit manipulation pattern and integer division can be easily rounded up. The following code is equivalent to yours, but using integers:
// Returns log2(x) rounded up using bit manipulation (not most efficient way)
unsigned int log2(unsigned int x)
{
unsigned int y = 0;
--x;
while (x) {
y++;
x >>= 1;
}
return y;
}
// Returns ceil(a/b) using integer division
unsigned int roundup(unsigned int a, unsigned int b)
{
return (a + b - 1) / b;
}
unsigned int get_num_digits_in_different_base(unsigned int n_digits, unsigned int src_base, unsigned int log2_dst_base)
{
return roundup(n_digits * log2(src_base), log2_dst_base);
}
Please, note that:
This function return different results compared to yours! However, in every case I looked, both were still correct (the smaller value was more accurate, but your requirement is just an upper bound).
The integer version I wrote receives log2_dst_base instead of dst_base to avoid overflow for 2^64.
log2 can be made more efficient using lookup tables.
I've used unsigned int instead of int.

Need Floating Point Precision Using Unsigned Int

I'm working with a microchip that doesn't have room for floating point precision, however. I need to account for fractional values during some equations. So far I've had good luck using the old *100 -> /100 method like so:
increment = (short int)(((value1 - value2)*100 / totalSteps));
// later in the code I loop through the number of totolSteps
// adding back the increment to arrive at the total I want at the precise time
// time I need it.
newValue = oldValue + (increment / 100);
This works great for values from 0-255 divided by a totalSteps of up to 300. After 300, the fractional values to the right of the decimal place, become important, because they add up over time of course.
I'm curious if anyone has a better way to save decimal accuracy within an integer paradigm? I tried using *1000 /1000, but that didn't work at all.
Thank you in advance.

Fractions with integers is called fixed point math.
Try Googling "fixed point".
Fixed point tips and tricks are out of the scope of SO answer...
Example: 5 tap FIR filter
// C is the filter coefficients using 2.8 fixed precision.
// 2 MSB (of 10) is for integer part and 8 LSB (of 10) is the fraction part.
// Actual fraction precision here is 1/256.
int FIR_5(int* in, // input samples
int inPrec, // sample fraction precision
int* c, // filter coefficients
int cPrec) // coefficients fraction precision
{
const int coefHalf = (cPrec > 0) ? 1 << (cPrec - 1) : 0; // value of 0.5 using cPrec
int sum = 0;
for ( int i = 0; i < 5; ++i )
{
sum += in[i] * c[i];
}
// sum's precision is X.N. where N = inPrec + cPrec;
// return to original precision (inPrec)
sum = (sum + coefHalf) >> cPrec; // adding coefHalf for rounding
return sum;
}
int main()
{
const int filterPrec = 8;
int C[5] = { 8, 16, 208, 16, 8 }; // 1.0 == 256 in 2.8 fixed point. Filter value are 8/256, 16/256, 208/256, etc.
int W[5] = { 10, 203, 40, 50, 72}; // A sampling window (example)
int res = FIR_5(W, 0, C, filterPrec);
return 0;
}
Notes:
In the above example:
the samples are integers (no fraction)
the coefs have fractions of 8 bit.
8 bit fractions mean that each change of 1 is treated as 1/256. 1 << 8 == 256.
Useful notation is Y.Xu or Y.Xs. where Y is how many bits are allocated for the integer part and X for he fraction. u/s denote signed/unsigned.
when multiplying 2 fixed point numbers, their precision (size of fraction bits) are added to each other.
Example A is 0.8u, B is 0.2U. C=A*B. C is 0.10u
when dividing, use a shift operation to lower the result precision. Amount of shifting is up to you. Before lowering precision it's better to add a half to lower the error.
Example: A=129 in 0.8u which is a little over 0.5 (129/256). We want the integer part so we right shift it by 8. Before that we want to add a half which is 128 (1<<7). So A = (A + 128) >> 8 --> 1.
Without adding a half you'll get a larger error in the final result.

Don't use this approach.
New paradigm: Do not accumulate using FP math or fixed point math. Do your accumulation and other equations with integer math. Anytime you need to get some scaled value, divide by your scale factor (100), but do the "add up" part with the raw, unscaled values.

Here's a quick attempt at a precise rational (Bresenham-esque) version of the interpolation if you truly cannot afford to directly interpolate at each step.
div_t frac_step = div(target - source, num_steps);
if(frac_step.rem < 0) {
// Annoying special case to deal with rounding towards zero.
// Alternatively check for the error term slipping to < -num_steps as well
frac_step.rem = -frac_step.rem;
--frac_step.quot;
}
unsigned int error = 0;
do {
// Add the integer term plus an accumulated fraction
error += frac_step.rem;
if(error >= num_steps) {
// Time to carry
error -= num_steps;
++source;
}
source += frac_step.quot;
} while(--num_steps);
A major drawback compared to the fixed-point solution is that the fractional term gets rounded off between iterations if you are using the function to continually walk towards a moving target at differing step lengths.
Oh, and for the record your original code does not seem to be properly accumulating the fractions when stepping, e.g. a 1/100 increment will always be truncated to 0 in the addition no matter how many times the step is taken. Instead you really want to add the increment to a higher-precision fixed-point accumulator and then divide it by 100 (or preferably right shift to divide by a power-of-two) each iteration in order to compute the integer "position".
Do take care with the different integer types and ranges required in your calculations. A multiplication by 1000 will overflow a 16-bit integer unless one term is a long. Go through you calculations and keep track of input ranges and the headroom at each step, then select your integer types to match.

Maybe you can simulate floating point behaviour by saving
it using the IEEE 754 specification
So you save mantisse, exponent, and sign as unsigned int values.
For calculation you use then bitwise addition of mantisse and exponent and so on.
Multiplication and Division you can replace by bitwise addition operations.
I think it is a lot of programming staff to emulate that but it should work.

Your choice of type is the problem: short int is likely to be 16 bits wide. That's why large multipliers don't work - you're limited to +/-32767. Use a 32 bit long int, assuming that your compiler supports it. What chip is it, by the way, and what compiler?