Power function giving different answer than math.pow function in C - c

I was trying to write a program to calculate the value of x^n using a while loop:
#include <stdio.h>
#include <math.h>
int main()
{
float x = 3, power = 1, copyx;
int n = 22, copyn;
copyx = x;
copyn = n;
while (n)
{
if ((n % 2) == 1)
{
power = power * x;
}
n = n / 2;
x *= x;
}
printf("%g^%d = %f\n", copyx, copyn, power);
printf("%g^%d = %f\n", copyx, copyn, pow(copyx, copyn));
return 0;
}
Up until the value of 15 for n, the answer from my created function and the pow function (from math.h) gives the same value; but, when the value of n exceeds 15, then it starts giving different answers.
I cannot understand why there is a difference in the answer. Is it that I have written the function in the wrong way or it is something else?

You are mixing up two different types of floating-point data. The pow function uses the double type but your loop uses the float type (which has less precision).
You can make the results coincide by either using the double type for your x, power and copyx variables, or by calling the powf function (which uses the float type) instead of pow.
The latter adjustment (using powf) gives the following output (clang-cl compiler, Windows 10, 64-bit):
3^22 = 31381059584.000000
3^22 = 31381059584.000000
And, changing the first line of your main to double x = 3, power = 1, copyx; gives the following:
3^22 = 31381059609.000000
3^22 = 31381059609.000000
Note that, with larger and larger values of n, you are increasingly likely to get divergence between the results of your loop and the value calculated using the pow or powf library functions. On my platform, the double version gives the same results, right up to the point where the value overflows the range and becomes Infinity. However, the float version starts to diverge around n = 55:
3^55 = 174449198498104595772866560.000000
3^55 = 174449216944848669482418176.000000

When I run your code I get this:
3^22 = 31381059584.000000
3^22 = 31381059609.000000
This would be because pow returns a double but your code uses float. When I changed to powf I got identical results:
3^22 = 31381059584.000000
3^22 = 31381059584.000000
So simply use double everywhere if you need high resolution results.

Floating point math is imprecise (and float is worse than double, having even fewer bits to store the data in; using double might delay the imprecision longer). The pow function (usually) uses an exponentiation algorithm that minimizes precision loss, and/or delegates to a chip-level instruction that may do stuff more efficiently, more precisely, or both. There could be more than one implementation of pow too, depending on whether you tell the compiler to use strictly conformant floating point math, the fastest possible, the hardware instruction, etc.
Your code is fine (though using double would get more precise results), but matching the improved precision of math.h's pow is non-trivial; by the time you've done so, you'll have reinvented it. That's why you use the library function.
That said, for logically integer math as you're using here, precision loss from your algorithm likely doesn't matter, it's purely the float vs. double issue where you lose precision from the type itself. As a rule, default to using double, and only switch to float if you're 100% sure you don't need the precision and can't afford the extra memory/computation cost of double.

Precision
float x = 3, power = 1; ... power = power * x forms a float product.
pow(x, y) forms a double result and good implementations internally use even wider math.
OP's loop method incurs rounded results after the 15th iteration. These roundings slowly compound the inaccuracy of the final result.
316 is a 26 bit odd number.
float encodes all odd numbers exactly until typically 224. Larger values are all even and of only 24 significant binary digits.
double encodes all odd numbers exactly until typically 253.
To do a fair comparison, use:
double objects and pow() or
float objects and powf().
For large powers, the pow(f)() function is certain to provide better answers than a loop at such functions often use internally extended precision and well managed rounding vs. the loop approach.

Related

Underflow error in floating point arithmetic in C

I am new to C, and my task is to create a function
f(x) = sqrt[(x^2)+1]-1
that can handle very large numbers and very small numbers. I am submitting my script on an online interface that checks my answers.
For very large numbers I simplify the expression to:
f(x) = x-1
By just using the highest power. This was the correct answer.
The same logic does not work for smaller numbers. For small numbers (on the order of 1e-7), they are very quickly truncated to zero, even before they are squared. I suspect that this has to do with floating point precision in C. In my textbook, it says that the float type has smallest possible value of 1.17549e-38, with 6 digit precision. So although 1e-7 is much larger than 1.17e-38, it has a higher precision, and is therefore rounded to zero. This is my guess, correct me if I'm wrong.
As a solution, I am thinking that I should convert x to a long double when x < 1e-6. However when I do this, I still get the same error. Any ideas? Let me know if I can clarify. Code below:
#include <math.h>
#include <stdio.h>
double feval(double x) {
/* Insert your code here */
if (x > 1e299)
{;
return x-1;
}
if (x < 1e-6)
{
long double g;
g = x;
printf("x = %Lf\n", g);
long double a;
a = pow(x,2);
printf("x squared = %Lf\n", a);
return sqrt(g*g+1.)- 1.;
}
else
{
printf("x = %f\n", x);
printf("Used third \n");
return sqrt(pow(x,2)+1.)-1;
}
}
int main(void)
{
double x;
printf("Input: ");
scanf("%lf", &x);
double b;
b = feval(x);
printf("%f\n", b);
return 0;
}
For small inputs, you're getting truncation error when you do 1+x^2. If x=1e-7f, x*x will happily fit into a 32 bit floating point number (with a little bit of error due to the fact that 1e-7 does not have an exact floating point representation, but x*x will be so much smaller than 1 that floating point precision will not be sufficient to represent 1+x*x.
It would be more appropriate to do a Taylor expansion of sqrt(1+x^2), which to lowest order would be
sqrt(1+x^2) = 1 + 0.5*x^2 + O(x^4)
Then, you could write your result as
sqrt(1+x^2)-1 = 0.5*x^2 + O(x^4),
avoiding the scenario where you add a very small number to 1.
As a side note, you should not use pow for integer powers. For x^2, you should just do x*x. Arbitrary integer powers are a little trickier to do efficiently; the GNU scientific library for example has a function for efficiently computing arbitrary integer powers.
There are two issues here when implementing this in the naive way: Overflow or underflow in intermediate computation when computing x * x, and substractive cancellation during final subtraction of 1. The second issue is an accuracy issue.
ISO C has a standard math function hypot (x, y) that performs the computation sqrt (x * x + y * y) accurately while avoiding underflow and overflow in intermediate computation. A common approach to fix issues with subtractive cancellation is to transform the computation algebraically such that it is transformed into multiplications and / or divisions.
Combining these two fixes leads to the following implementation for float argument. It has an error of less than 3 ulps across all possible inputs according to my testing.
/* Compute sqrt(x*x+1)-1 accurately and without spurious overflow or underflow */
float func (float x)
{
return (x / (1.0f + hypotf (x, 1.0f))) * x;
}
A trick that is often useful in these cases is based on the identity
(a+1)*(a-1) = a*a-1
In this case
sqrt(x*x+1)-1 = (sqrt(x*x+1)-1)*(sqrt(x*x+1)+1)
/(sqrt(x*x+1)+1)
= (x*x+1-1) / (sqrt(x*x+1)+1)
= x*x/(sqrt(x*x+1)+1)
The last formula can be used as an implementation. For vwry small x sqrt(x*x+1)+1 will be close to 2 (for small enough x it will be 2) but we don;t loose precision in evaluating it.
The problem isn't with running into the minimum value, but with the precision.
As you said yourself, float on your machine has about 7 digits of precision. So let's take x = 1e-7, so that x^2 = 1e-14. That's still well within the range of float, no problems there. But now add 1. The exact answer would be 1.00000000000001. But if we only have 7 digits of precision, this gets rounded to 1.0000000, i.e. exactly 1. So you end up computing sqrt(1.0)-1 which is exactly 0.
One approach would be to use the linear approximation of sqrt around x=1 that sqrt(x) ~ 1+0.5*(x-1). That would lead to the approximation f(x) ~ 0.5*x^2.

Efficient check of float division by zero for embedded application

For the embedded SW application on a microcontroller I have to write a some kind of customised division operator for float. The solution, I want to go with, is shortly described below.
In fact, I am not sure if this approach could be efficient enough in terms of execution time for the embedded application with high performance requirements.
Has anybody experience in diverse approaches to handle float division by zero, which can be efficient/optimised for embedded applications?
typedef union MyFloatType_
{
unsigned long Ulong;
float Float;
}MyFloatType;
float div_float(float a, float b)
{
float numerator = a;
MyFloatType denominator;
denominator.Float= b;
float result = 0.0f;
if((denominator.Ulong & 0x7fffffffUL) != 0UL)
{
result = numerator / denominator.Float;
} else
{
/*handle devision by zero, for example:*/
result = 0.0f;
}
return result;
}
In most applications, there will be some value below which a floating-point number "might as well be" zero. For example, consider something like:
float intensity(float dx, float dy)
{
float result = 1/(dx*dx + dy*dy);
if (result > 65535.0f) result = 65535.0f;
return result;
}
If the divisor is less than 1/65535.0f, then in cases where the distance isn't exactly zero the function should return 65535.0f regardless of the actual value of the divisor, and such behavior would probably be useful even if it is zero. Thus, the function could be rewritten as:
float intensity(float dx, float dy)
{
float distSq = dx*dx + dy*dy;
if (distSq <= (1.0f/65535.0f))
return 65535.0;
else
return 1/distSq;
}
Note that the corner-case handling of this sort of code may be very slightly imperfect. While not a problem for 65535.0f in particular, there may be cases where distSq is precisely equal to the reciprocal of the maximum value, but the reciprocal of that is less than the maximum value. For example, if the maximum value was 46470.0f and distSq was 0.0000215238924f, the correct result would be 46459.9961f but the function would return 46470.0f. Such issues are unlikely to pose problems in practice, but one should be aware of them. Note that if the comparison had used less-than rather than less-than-or-equal, and the maximum had been 46590.0f, a distSq value of 0.0000214642932f would yield a result of 46589.0039, which exceeds the maximum.
Incidentally, on many systems, the cost of computing an approximate reciprocal of the divisor and multiplying by the dividend may be much cheaper than the cost of performing a floating-point division. Such an approach may be useful in the many situations where its precision would be adequate.
It's not faster but potentially much slower. It has to move the float to an integer register (possible by writing it into memory and reading it back) and then perform the integer operation. Just perform == 0.0f and a floating point comparison is done instead and is the most efficient way.
If you want it to be "high performance" then try the following:
float div_float(float a, float b)
{
b = b == 0.0f ? 1.0f : b;
return a / b;
}
This can be optimized to few simple instructions without branching and is much faster overall as any branching kills performance. This is used in graphics drivers where if the user is supplying garbage data that will eventually result in division by zero then the results are already invalid but throwing a floating point exception is not desired.

Double float point precision

I'm solving Laplace's Equation with Gauss-Seidel method, but in some regions, its showing a plateau-like aspect. Formally, ie, by numerical analysis, such regions should not exist, even if the gradient is almost zero.
I'm forced to believe that double precision isn't enough to perform the arithmetic and that a big number library need to be used (killing the performance, since now it will be done by software). Or, that I should do the operations in a different order, aiming to preserve some significance to the decimals.
Example
Cell (13, 14, 0) is being updated by 7-point mesh (in 3D), and its neighbours are:
(12,14,0)= 0.9999999999999936; // (x-)
(14,14,0)= 0.9999999999999969; // (x+)
(13,13,0)= 0.9999999999999938; // (y-)
(13,15,0)= 1.0000000000000000; // (y+)
(13,14,-1)= 1.0000000000000000; // (z-)
(13,14,1)= 0.9999999999999959; // (z+)
So, the new value of cell (13,14,0) would be evaluated as:
p_new = (0.9999999999999936 + 0.9999999999999969 + 0.9999999999999938 + 1.0000000000000000 + 1.0000000000000000 + 0.9999999999999959) / 6.0 ;
which leads to p_new being 1.0000000000000000, when it should be 0.9999999999999966.
Code
#include <stdio.h>
int main()
{
double ad_neighboor[6] = {0.9999999999999936, 0.9999999999999969,
0.9999999999999938, 1.0000000000000000,
1.0000000000000000, 0.9999999999999959};
double d_denom = 6.0;
unsigned int i_xBackward=0;
unsigned int i_xForward=1;
unsigned int i_yBackward=2;
unsigned int i_yForward=3;
unsigned int i_zBackward=4;
unsigned int i_zForward=5;
double d_newPotential = (ad_neighboor[i_xForward] + ad_neighboor[i_xBackward] +
ad_neighboor[i_yForward] + ad_neighboor[i_yBackward] +
ad_neighboor[i_zForward] + ad_neighboor[i_zBackward] ) / d_denom;
printf("%.16f\n", d_newPotential);
}
Since you are solving:
d²(phi)/dx² + d²(phi)/dy² = 0
Instead you can solve the equivalent problem:
d²(phi')/dx² + d²(phi')/dy² = 0
Where, phi' = phi - 1.
Remember to apply the boundary conditions in terms of phi'.
Finally after the solution has converged, you can get the solution as phi = 1 + phi'.
I am assuming here that the boundary values are close to 1.
I haven't tried this, but I think that the numbers will be represented in their significant digits in a floating point notations, thus the truncation error will be reduced.
Your granularity is too fine for the double precision floating point type on your platform.
In the majority of cases, you'd address this by adjusting your granularity. If you need any convincing, 15 significant figures of granularity is enough to mesh the Solar system out to the orbit or Pluto in squares of 1cm length! For this method, I'd be inclined to reserve at least four orders of magnitude to obviate numerical noise.
Only in a very minimal number of cases ought you think in terms of switching to another data type, such as as long double (if different to double to your platform), or an arbitrary precision type.

Strange output when using float instead of double

Strange output when I use float instead of double
#include <stdio.h>
void main()
{
double p,p1,cost,cost1=30;
for (p = 0.1; p < 10;p=p+0.1)
{
cost = 30-6*p+p*p;
if (cost<cost1)
{
cost1=cost;
p1=p;
}
else
{
break;
}
printf("%lf\t%lf\n",p,cost);
}
printf("%lf\t%lf\n",p1,cost1);
}
Gives output as expected at p = 3;
But when I use float the output is a little weird.
#include <stdio.h>
void main()
{
float p,p1,cost,cost1=40;
for (p = 0.1; p < 10;p=p+0.1)
{
cost = 30-6*p+p*p;
if (cost<cost1)
{
cost1=cost;
p1=p;
}
else
{
break;
}
printf("%f\t%f\n",p,cost);
}
printf("%f\t%f\n",p1,cost1);
}
Why is the increment of p in the second case going weird after 2.7?
This is happening because the float and double data types store numbers in base 2. Most base-10 numbers can’t be stored exactly. Rounding errors add up much more quickly when using floats. Outside of embedded applications with limited memory, it’s generally better, or at least easier, to use doubles for this reason.
To see this happening for double types, consider the output of this code:
#include <stdio.h>
int main(void)
{
double d = 0.0;
for (int i = 0; i < 100000000; i++)
d += 0.1;
printf("%f\n", d);
return 0;
}
On my computer, it outputs 9999999.981129. So after 100 million iterations, rounding error made a difference of 0.018871 in the result.
For more information about how floating-point data types work, read What Every Computer Scientist Should Know About Floating-Point Arithmetic. Or, as akira mentioned in a comment, see the Floating-Point Guide.
Your program can work fine with float. You don't need double to compute a table of 100 values to a few significant digits. You can use double, and if you do, it will have chances to work even if you use binary floating-point binary at cross-purposes. The IEEE 754 double-precision format used for double by most C compilers is so precise that it makes many misuses of floating-point unnoticeable (but not all of them).
Values that are simple in decimal may not be simple in binary
A consequence is that a value that is simple in decimal may not be represented exactly in binary.
This is the case for 0.1: it is not simple in binary, and it is not represented exactly as either double or float, but the double representation has more digits and as a result, is closer to the intended value 1/10.
Floating-point operations are not exact in general
Binary floating-point operations in a format such as float or double have to produce a result in the intended format. This leads to some digits having to be dropped from the result each time an operation is computed. When using binary floating-point in an advanced manner, the programmer sometimes knows that the result will have few enough digits for all the digits to be represented in the format (in other words, sometimes a floating-point operation can be exact and advanced programmers can predict and take advantage of conditions in which this happens). But here, you are adding 0.1, which is not simple and (in binary) uses all the available digits, so most of the times, this addition is not be exact.
How to print a small table of values using only float
In for (p = 0.1; p < 10;p=p+0.1), the value of p, being a float, will be rounded at each iteration. Each iteration will be computed from a previous iteration that was already rounded, so the rounding errors will accumulate and make the end result drift away from the intended, mathematical value.
Here is a list of improvements over what you wrote, in reverse order of exactness:
for (i = 1, p = 0.1f; i < 100; i++, p = i * 0.1f)
In the above version, 0.1f is not exactly 1/10, but the computation of p involves only one multiplication and one rounding, instead of up to 100. That version gives a more precise approximation of i/10.
for (i = 1, p = 0.1f; i < 100; i++, p = i * 0.1)
In the very slightly different version above, i is multiplied by the double value 0.1, which more closely approximates 1/10. The result is always the closest float to i/10, but this solution is cheating a bit, since it uses a double multiplication. I said a solution existed with only float!
for (i = 1, p = 0.1f; i < 100; i++, p = i / 10.0f)
In this last solution, p is computed as the division of i, represented exactly as a float because it is a small integer, by 10.0f, which is also exact for the same reason. The only computation approximation is that of a single operation, and the arguments are exactly what we wanted them to, so this is the best solution. It produces the closest float to i/10 for all values of i between 1 and 99.

Faster way of finding multiple of double

If have the following C function, used to determine if one number is a multiple of another to an arbirary tolerance
#include <math.h>
#define TOLERANCE 0.0001
int IsMultipleOf(double x,double mod)
{
return(fabs(fmod(x, mod)) < TOLERANCE);
}
It works fine, but profiling shows it to be very slow, to the extent that it has become a candidate for optimization. About 75% of the time is spent in modulo and the remaining in fabs. I'm trying to figure a way of speeding things up, using something like a look-up table. The parameter x changes regularly, whereas mod changes infrequently. The number of possible values of x is small enough that the space for a look-up would not be an issue, typically it will be one of a few hundred possible values. I can get rid of the fabs easily enough, but can't figure out a reasonable alternative to the modulo. Any ideas on how to optimize the above?
Edit The code will be running on a wide range of Windows desktop and mobile devices, hence processors could include Intel, AMD on desktop, and ARM or SH4 on mobile devices. VisualStudio 2008 is the compiler.
Do you really have to use modulo for this?
Wouldn't it be possible to just result = x / mod and then check if the decimal part of result is close to 0. For instance:
11 / 5.4999 = 2.000003 ==> 0.000003 < TOLERANCE
Or something like that.
Division (floating point or not, fmod in your case) is often an operation where the execution time varies a lot depending on the cpu and compiler:
gcc has a builtin replacement for
that if you give it the right compile
flags or if you use __builtin_fmod
explicitly. This then might map the
operation on a small number of
assembler instructions.
there may be special units like SSE
on intel processors where this
operation is implemented more
efficiently
By such tricks, depending on your environment (you didn't tell which) the time may vary from some clock cycles to some hundred. I think best is to look into the documentation of your compiler and cpu for that particular operation.
The following is probably overkill, and sub-optimal. But for what it is worth here is one way on how to do it.
We know the format of the double ...
1 bit for the sign
11 bits for the biased exponent
52 fraction bits
Let ...
value = x / mod;
exp = exponent bits of value - BIAS;
lsb = least sig bit of value's fraction bits;
Once you have that ...
/*
* If applying the exponent would eliminate the fraction bits
* then for double precision resolution it is a multiple.
* Note: lsb may require some massaging.
*/
if (exp > lsb)
return (true);
if (exp < 0)
return (false);
The only case remaining is the tolerance case. Build your double so that you are getting rid of all the digits to the left of the decimal.
sign bit is zero (positive)
exponent is the BIAS (1023 I think ... look it up to be sure)
shift the fraction bits as appropriate
Now compare it against your tolerance.
I think you need to inspect the bowels of your C RTL fmod() function: X86 FPU's have 'FPREM/FPREM1' instructions which computes remainders by repeated subtraction.
While floating point division is a single instruction, it seems you may need to call FPREM repeatedly to get the right answer for modulus, so your RTL may not use it.
I have not tested this at all, but from the way I understand fmod this should be equivalent inlined, which might let the compiler optimize it better, though I would have thought that the compiler's math library (or builtins) would work just as well. (also, I don't even know for sure if this is correct).
#include <math.h>
int IsMultipleOf(double x, double mod) {
long n = x / mod; // You should probably test for /0 or NAN result here
double new_x = mod * n;
double delta = x - new_x;
return fabs(delta) < TOLERANCE; // and for NAN result from fabs
}
Maybe you can get away with long long instead of double if you have comparable scale of data. For example long long would be enough for over 60 astronomical units in micrometer resolution.
Does it need to be double precision ? Depending on how good your math library is, this ought to be faster:
#include <math.h>
#define TOLERANCE 0.0001f
bool IsMultipleOf(float x, float mod)
{
return(fabsf(fmodf(x, mod)) < TOLERANCE);
}
I presume modulo looks a little like this on the inside:
mod(x,m) {
while (x > m) {
x = x - m
}
return x
}
I think that through some sort of search i could be optimised: eg:
fastmod(x,m) {
q = 1
while (m * q < x) {
q = q * 2
}
return mod((x - (q / 2) * m), m)
}
You might even choose to replace the finall call to mod with annother call to fastmod, adding the condition that if x < m then to return x.

Resources