fast square root optimization? - c

If you check this very nice page:
http://www.codeproject.com/Articles/69941/Best-Square-Root-Method-Algorithm-Function-Precisi
You'll see this program:
#define SQRT_MAGIC_F 0x5f3759df
float sqrt2(const float x)
{
const float xhalf = 0.5f*x;
union // get bits for floating value
{
float x;
int i;
} u;
u.x = x;
u.i = SQRT_MAGIC_F - (u.i >> 1); // gives initial guess y0
return x*u.x*(1.5f - xhalf*u.x*u.x);// Newton step, repeating increases accuracy
}
My question is: Is there any particular reason why this isn't implemented as:
#define SQRT_MAGIC_F 0x5f3759df
float sqrt2(const float x)
{
union // get bits for floating value
{
float x;
int i;
} u;
u.x = x;
u.i = SQRT_MAGIC_F - (u.i >> 1); // gives initial guess y0
const float xux = x*u.x;
return xux*(1.5f - .5f*xux*u.x);// Newton step, repeating increases accuracy
}
As, from disassembly, I see one MUL less. Is there any purpose to having xhalf appear at all?

It could be that legacy floating point math, which used 80 bit registers, was more accurate when the multipliers where linked together in the last line as intermediate results where kept in 80 bit registers.
The first multiplication in the upper implementation takes place in parallel to the integer math that follows, they use different execution resources.
The second function on the other hand looks faster but it's hard to tell if it really is because of the above.
Also, the const float xux = x*u.x; statement reduces the result back to 32 bit float, which may reduce overall accuracy.
You could test these functions head to head and compare them to the sqrt function in math.h (use double not float). This way you can see which is faster and which is more accurate.

Related

Underflow error in floating point arithmetic in C

I am new to C, and my task is to create a function
f(x) = sqrt[(x^2)+1]-1
that can handle very large numbers and very small numbers. I am submitting my script on an online interface that checks my answers.
For very large numbers I simplify the expression to:
f(x) = x-1
By just using the highest power. This was the correct answer.
The same logic does not work for smaller numbers. For small numbers (on the order of 1e-7), they are very quickly truncated to zero, even before they are squared. I suspect that this has to do with floating point precision in C. In my textbook, it says that the float type has smallest possible value of 1.17549e-38, with 6 digit precision. So although 1e-7 is much larger than 1.17e-38, it has a higher precision, and is therefore rounded to zero. This is my guess, correct me if I'm wrong.
As a solution, I am thinking that I should convert x to a long double when x < 1e-6. However when I do this, I still get the same error. Any ideas? Let me know if I can clarify. Code below:
#include <math.h>
#include <stdio.h>
double feval(double x) {
/* Insert your code here */
if (x > 1e299)
{;
return x-1;
}
if (x < 1e-6)
{
long double g;
g = x;
printf("x = %Lf\n", g);
long double a;
a = pow(x,2);
printf("x squared = %Lf\n", a);
return sqrt(g*g+1.)- 1.;
}
else
{
printf("x = %f\n", x);
printf("Used third \n");
return sqrt(pow(x,2)+1.)-1;
}
}
int main(void)
{
double x;
printf("Input: ");
scanf("%lf", &x);
double b;
b = feval(x);
printf("%f\n", b);
return 0;
}
For small inputs, you're getting truncation error when you do 1+x^2. If x=1e-7f, x*x will happily fit into a 32 bit floating point number (with a little bit of error due to the fact that 1e-7 does not have an exact floating point representation, but x*x will be so much smaller than 1 that floating point precision will not be sufficient to represent 1+x*x.
It would be more appropriate to do a Taylor expansion of sqrt(1+x^2), which to lowest order would be
sqrt(1+x^2) = 1 + 0.5*x^2 + O(x^4)
Then, you could write your result as
sqrt(1+x^2)-1 = 0.5*x^2 + O(x^4),
avoiding the scenario where you add a very small number to 1.
As a side note, you should not use pow for integer powers. For x^2, you should just do x*x. Arbitrary integer powers are a little trickier to do efficiently; the GNU scientific library for example has a function for efficiently computing arbitrary integer powers.
There are two issues here when implementing this in the naive way: Overflow or underflow in intermediate computation when computing x * x, and substractive cancellation during final subtraction of 1. The second issue is an accuracy issue.
ISO C has a standard math function hypot (x, y) that performs the computation sqrt (x * x + y * y) accurately while avoiding underflow and overflow in intermediate computation. A common approach to fix issues with subtractive cancellation is to transform the computation algebraically such that it is transformed into multiplications and / or divisions.
Combining these two fixes leads to the following implementation for float argument. It has an error of less than 3 ulps across all possible inputs according to my testing.
/* Compute sqrt(x*x+1)-1 accurately and without spurious overflow or underflow */
float func (float x)
{
return (x / (1.0f + hypotf (x, 1.0f))) * x;
}
A trick that is often useful in these cases is based on the identity
(a+1)*(a-1) = a*a-1
In this case
sqrt(x*x+1)-1 = (sqrt(x*x+1)-1)*(sqrt(x*x+1)+1)
/(sqrt(x*x+1)+1)
= (x*x+1-1) / (sqrt(x*x+1)+1)
= x*x/(sqrt(x*x+1)+1)
The last formula can be used as an implementation. For vwry small x sqrt(x*x+1)+1 will be close to 2 (for small enough x it will be 2) but we don;t loose precision in evaluating it.
The problem isn't with running into the minimum value, but with the precision.
As you said yourself, float on your machine has about 7 digits of precision. So let's take x = 1e-7, so that x^2 = 1e-14. That's still well within the range of float, no problems there. But now add 1. The exact answer would be 1.00000000000001. But if we only have 7 digits of precision, this gets rounded to 1.0000000, i.e. exactly 1. So you end up computing sqrt(1.0)-1 which is exactly 0.
One approach would be to use the linear approximation of sqrt around x=1 that sqrt(x) ~ 1+0.5*(x-1). That would lead to the approximation f(x) ~ 0.5*x^2.

fast double exp2 function in C [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
The community reviewed whether to reopen this question 9 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I need a version of the following fast exp2 function working in double precision, can you help me ? Please don't answer saying that it is an approximation so a double version is pointless and casting the result to (double) is enough..thanks. The function which I found somewhere and which works for me is the following and it is much faster than exp2f(), but unfortunately I could not find any double precision version:
inline float fastexp2(float p)
{
if(p<-126.f) p= -126.f;
int w=(int)p;
float z=p-(float)w;
if(p<0.f) z+= 1.f;
union {uint32_t i; float f;} v={(uint32_t)((1<<23)*(p+121.2740575f+27.7280233f/(4.84252568f -z)-1.49012907f * z)) };
return v.f;
}
My assumption is that the existing code from the question assumes IEEE-754 binary floating-point computation, in particular mapping C's float type to IEEE-754's binary32 format.
The existing code suggests that only floating-point results in the normal range are of interest: subnormal results are avoided by clamping the input from below and overflow is ignored. So for float computation valid inputs are in the interval [-126, 128). By exhaustive test, I found that the maximum relative error of the function in the question is 7.16e-5, and that the root-mean-square (RMS) error is 1.36e-5.
My assumption is that the desired change to double computation should widen the range of allowed inputs to [-1022, 1024), and that identical relative accuracy should be maintained. The code is written in a fairly cryptic fashion. So as a first step, I rearranged it into a more readable version. In a second step, I tweaked the coefficients of the core approximation so as to minimize the maximum relative error. This results in the following ISO-C99 code:
/* compute 2**p, for p in [-126, 128). Maximum relative error: 5.04e-5; RMS error: 1.03e-5 */
float fastexp2 (float p)
{
const int FP32_MIN_EXPO = -126; // exponent of minimum binary32 normal
const int FP32_MANT_BITS = 23; // number of stored mantissa (significand) bits
const int FP32_EXPO_BIAS = 127; // binary32 exponent bias
float res;
p = (p < FP32_MIN_EXPO) ? FP32_MIN_EXPO : p; // clamp below
/* 2**p = 2**(w+z), with w an integer and z in [0, 1) */
float w = floorf (p); // integral part
float z = p - w; // fractional part
/* approximate 2**z-1 for z in [0, 1) */
float approx = -0x1.6e7592p+2f + 0x1.bba764p+4f / (0x1.35ed00p+2f - z) - 0x1.f5e546p-2f * z;
/* assemble the exponent and mantissa components into final result */
int32_t resi = ((1 << FP32_MANT_BITS) * (w + FP32_EXPO_BIAS + approx));
memcpy (&res, &resi, sizeof res);
return res;
}
Refactoring and retuning of the coefficients resulted in slight improvements in accuracy, with maximum relative error of 5.04e-5 and RMS error of 1.03e-5. It should be noted that floating-point arithmetic is generally not associative, therefore any re-association of floating-point operations, either by manual code changes or compiler transformations could affect the stated accuracy and requires careful re-testing.
I do not expect any need to modify the code as it compiles into efficient machine code for common architectures, as can be seen from trial compilations with Compiler Explorer, e.g. gcc with x86-64 or gcc with ARM64.
At this point it is obvious what needs to be changed for switching to double computation. Change all instances of float to double, all instances of int32_t to int64_t, change type suffixes for literal constants and math functions, and change the floating-point format specific parameters for IEEE-754 binary32 to those for IEEE-754 binary64. The core approximation needs re-tuning to make best possible use of double precision coefficients in the core approximation.
/* compute 2**p, for p in [-1022, 1024). Maximum relative error: 4.93e-5. RMS error: 9.91e-6 */
double fastexp2 (double p)
{
const int FP64_MIN_EXPO = -1022; // exponent of minimum binary64 normal
const int FP64_MANT_BITS = 52; // number of stored mantissa (significand) bits
const int FP64_EXPO_BIAS = 1023; // binary64 exponent bias
double res;
p = (p < FP64_MIN_EXPO) ? FP64_MIN_EXPO : p; // clamp below
/* 2**p = 2**(w+z), with w an integer and z in [0, 1) */
double w = floor (p); // integral part
double z = p - w; // fractional part
/* approximate 2**z-1 for z in [0, 1) */
double approx = -0x1.6e75d58p+2 + 0x1.bba7414p+4 / (0x1.35eccbap+2 - z) - 0x1.f5e53c2p-2 * z;
/* assemble the exponent and mantissa components into final result */
int64_t resi = ((1LL << FP64_MANT_BITS) * (w + FP64_EXPO_BIAS + approx));
memcpy (&res, &resi, sizeof res);
return res;
}
Both maximum relative error and root-mean-square error decreases very slightly to 4.93e-5 and 9.91e-6, respectively. This is as expected, because for an approximation that is roughly accurate to 15 bits it matters little whether intermediate computation is performed with 24 bits of precision or 53 bits of precision. The computation uses a division, and this tends to be slower for double than for float on all platforms I am familiar with, so the double-precision port doesn't seem to provide any significant advantages other than perhaps eliminating conversion overhead if the calling code uses double-precision computation.

Efficient computation of 2**64 / divisor via fast floating-point reciprocal

I am currently looking into ways of using the fast single-precision floating-point reciprocal capability of various modern processors to compute a starting approximation for a 64-bit unsigned integer division based on fixed-point Newton-Raphson iterations. It requires computation of 264 / divisor, as accurately as possible, where the initial approximation must be smaller than, or equal to, the mathematical result, based on the requirements of the following fixed-point iterations. This means this computation needs to provide an underestimate. I currently have the following code, which works well, based on extensive testing:
#include <stdint.h> // import uint64_t
#include <math.h> // import nextafterf()
uint64_t divisor, recip;
float r, s, t;
t = uint64_to_float_ru (divisor); // ensure t >= divisor
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; // underestimate of 2**64 / divisor
While this code is functional, it isn't exactly fast on most platforms. One obvious improvement, which requires a bit of machine-specific code, is to replace the division r = 1.0f / t with code that makes use of a fast floating-point reciprocal provided by the hardware. This can be augmented with iteration to produce a result that is within 1 ulp of the mathematical result, so an underestimate is produced in the context of the existing code. A sample implementation for x86_64 would be:
#include <xmmintrin.h>
/* Compute 1.0f/a almost correctly rounded. Halley iteration with cubic convergence */
inline float fast_recip_f32 (float a)
{
__m128 t;
float e, r;
t = _mm_set_ss (a);
t = _mm_rcp_ss (t);
_mm_store_ss (&r, t);
e = fmaf (r, -a, 1.0f);
e = fmaf (e, e, e);
r = fmaf (e, r, r);
return r;
}
Implementations of nextafterf() are typically not performance optimized. On platforms where there are means to quickly re-interprete an IEEE 754 binary32 into an int32 and vice versa, via intrinsics float_as_int() and int_as_float(), we can combine use of nextafterf() and scaling as follows:
s = int_as_float (float_as_int (r) + 0x1fffffff);
Assuming these approaches are possible on a given platform, this leaves us with the conversions between float and uint64_t as major obstacles. Most platforms don't provide an instruction that performs a conversion from uint64_t to float with static rounding mode (here: towards positive infinity = up), and some don't offer any instructions to convert between uint64_t and floating-point types, making this a performance bottleneck.
t = uint64_to_float_ru (divisor);
r = fast_recip_f32 (t);
s = int_as_float (float_as_int (r) + 0x1fffffff);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */
A portable, but slow, implementation of uint64_to_float_ru uses dynamic changes to FPU rounding mode:
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
float uint64_to_float_ru (uint64_t a)
{
float res;
int curr_mode = fegetround ();
fesetround (FE_UPWARD);
res = (float)a;
fesetround (curr_mode);
return res;
}
I have looked into various splitting and bit-twiddling approaches to deal with the conversions (e.g. do the rounding on the integer side, then use a normal conversion to float which uses the IEEE 754 rounding mode round-to-nearest-or-even), but the overhead this creates makes this computation via fast floating-point reciprocal unappealing from a performance perspective. As it stands, it looks like I would be better off generating a starting approximation by using a classical LUT with interpolation, or a fixed-point polynomial approximation, and follow those up with a 32-bit fixed-point Newton-Raphson step.
Are there ways to improve the efficiency of my current approach? Portable and semi-portable ways involving intrinsics for specific platforms would be of interest (in particular for x86 and ARM as the currently dominant CPU architectures). Compiling for x86_64 using the Intel compiler at very high optimization (/O3 /QxCORE-AVX2 /Qprec-div-) the computation of the initial approximation takes more instructions than the iteration, which takes about 20 instructions. Below is the complete division code for reference, showing the approximation in context.
uint64_t udiv64 (uint64_t dividend, uint64_t divisor)
{
uint64_t temp, quot, rem, recip, neg_divisor = 0ULL - divisor;
float r, s, t;
/* compute initial approximation for reciprocal; must be underestimate! */
t = uint64_to_float_ru (divisor);
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */
/* perform Halley iteration with cubic convergence to refine reciprocal */
temp = neg_divisor * recip;
temp = umul64hi (temp, temp) + temp;
recip = umul64hi (recip, temp) + recip;
/* compute preliminary quotient and remainder */
quot = umul64hi (dividend, recip);
rem = dividend - divisor * quot;
/* adjust quotient if too small; quotient off by 2 at most */
if (rem >= divisor) quot += ((rem - divisor) >= divisor) ? 2 : 1;
/* handle division by zero */
if (divisor == 0ULL) quot = ~0ULL;
return quot;
}
umul64hi() would generally map to a platform-specific intrinsic, or a bit of inline assembly code. On x86_64 I currently use this implementation:
inline uint64_t umul64hi (uint64_t a, uint64_t b)
{
uint64_t res;
__asm__ (
"movq %1, %%rax;\n\t" // rax = a
"mulq %2;\n\t" // rdx:rax = a * b
"movq %%rdx, %0;\n\t" // res = (a * b)<63:32>
: "=rm" (res)
: "rm"(a), "rm"(b)
: "%rax", "%rdx");
return res;
}
This solution combines two ideas:
You can convert to floating point by simply reinterpreting the bits as floating point and subtracting a constant, so long as the number is within a particular range. So add a constant, reinterpret, and then subtract that constant. This will give a truncated result (which is therefore always less than or equal the desired value).
You can approximate reciprocal by negating both the exponent and the mantissa. This may be achieved by interpreting the bits as int.
Option 1 here only works in a certain range, so we check the range and adjust the constants used. This works in 64 bits because the desired float only has 23 bits of precision.
The result in this code will be double, but converting to float is trivial, and can be done on the bits or directly, depending on hardware.
After this you'd want to do the Newton-Raphson iteration(s).
Much of this code simply converts to magic numbers.
double
u64tod_inv( uint64_t u64 ) {
__asm__( "#annot0" );
union {
double f;
struct {
unsigned long m:52; // careful here with endianess
unsigned long x:11;
unsigned long s:1;
} u64;
uint64_t u64i;
} z,
magic0 = { .u64 = { 0, (1<<10)-1 + 52, 0 } },
magic1 = { .u64 = { 0, (1<<10)-1 + (52+12), 0 } },
magic2 = { .u64 = { 0, 2046, 0 } };
__asm__( "#annot1" );
if( u64 < (1UL << 52UL ) ) {
z.u64i = u64 + magic0.u64i;
z.f -= magic0.f;
} else {
z.u64i = ( u64 >> 12 ) + magic1.u64i;
z.f -= magic1.f;
}
__asm__( "#annot2" );
z.u64i = magic2.u64i - z.u64i;
return z.f;
}
Compiling this on an Intel core 7 gives a number of instructions (and a branch), but, of course, no multiplies or divides at all. If the casts between int and double are fast this should run pretty quickly.
I suspect float (with only 23 bits of precision) will require more than 1 or 2 Newton-Raphson iterations to get the accuracy you want, but I haven't done the math...

Efficient implementation of natural logarithm (ln) and exponentiation

I'm looking for implementation of log() and exp() functions provided in C library <math.h>. I'm working with 8 bit microcontrollers (OKI 411 and 431). I need to calculate Mean Kinetic Temperature. The requirement is that we should be able to calculate MKT as fast as possible and with as little code memory as possible. The compiler comes with log() and exp() functions in <math.h>. But calling either function and linking with the library causes the code size to increase by 5 Kilobytes, which will not fit in one of the micro we work with (OKI 411), because our code already consumed ~12K of available ~15K code memory.
The implementation I'm looking for should not use any other C library functions (like pow(), sqrt() etc). This is because all library functions are packed in one library and even if one function is called, the linker will bring whole 5K library to code memory.
EDIT
The algorithm should be correct up to 3 decimal places.
Using Taylor series is not the simplest neither the fastest way of doing this. Most professional implementations are using approximating polynomials. I'll show you how to generate one in Maple (it is a computer algebra program), using the Remez algorithm.
For 3 digits of accuracy execute the following commands in Maple:
with(numapprox):
Digits := 8
minimax(ln(x), x = 1 .. 2, 4, 1, 'maxerror')
maxerror
Its response is the following polynomial:
-1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x
With the maximal error of: 0.000061011436
We generated a polynomial which approximates the ln(x), but only inside the [1..2] interval. Increasing the interval is not wise, because that would increase the maximal error even more. Instead of that, do the following decomposition:
So first find the highest power of 2, which is still smaller than the number (See: What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?). That number is actually the base-2 logarithm. Divide with that value, then the result gets into the 1..2 interval. At the end we will have to add n*ln(2) to get the final result.
An example implementation for numbers >= 1:
float ln(float y) {
int log2;
float divisor, x, result;
log2 = msb((int)y); // See: https://stackoverflow.com/a/4970859/6630230
divisor = (float)(1 << log2);
x = y / divisor; // normalized value between [1.0, 2.0]
result = -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x;
result += ((float)log2) * 0.69314718; // ln(2) = 0.69314718
return result;
}
Although if you plan to use it only in the [1.0, 2.0] interval, then the function is like:
float ln(float x) {
return -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x;
}
The Taylor series for e^x converges extremely quickly, and you can tune your implementation to the precision that you need. (http://en.wikipedia.org/wiki/Taylor_series)
The Taylor series for log is not as nice...
If you don't need floating-point math for anything else, you may compute an approximate fractional base-2 log pretty easily. Start by shifting your value left until it's 32768 or higher and store the number of times you did that in count. Then, repeat some number of times (depending upon your desired scale factor):
n = (mult(n,n) + 32768u) >> 16; // If a function is available for 16x16->32 multiply
count<<=1;
if (n < 32768) n*=2; else count+=1;
If the above loop is repeated 8 times, then the log base 2 of the number will be count/256. If ten times, count/1024. If eleven, count/2048. Effectively, this function works by computing the integer power-of-two logarithm of n**(2^reps), but with intermediate values scaled to avoid overflow.
Would basic table with interpolation between values approach work? If ranges of values are limited (which is likely for your case - I doubt temperature readings have huge range) and high precisions is not required it may work. Should be easy to test on normal machine.
Here is one of many topics on table representation of functions: Calculating vs. lookup tables for sine value performance?
Necromancing.
I had to implement logarithms on rational numbers.
This is how I did it:
Occording to Wikipedia, there is the Halley-Newton approximation method
which can be used for very-high precision.
Using Newton's method, the iteration simplifies to (implementation), which has cubic convergence to ln(x), which is way better than what the Taylor-Series offers.
// Using Newton's method, the iteration simplifies to (implementation)
// which has cubic convergence to ln(x).
public static double ln(double x, double epsilon)
{
double yn = x - 1.0d; // using the first term of the taylor series as initial-value
double yn1 = yn;
do
{
yn = yn1;
yn1 = yn + 2 * (x - System.Math.Exp(yn)) / (x + System.Math.Exp(yn));
} while (System.Math.Abs(yn - yn1) > epsilon);
return yn1;
}
This is not C, but C#, but I'm sure anybody capable to program in C will be able to deduce the C-Code from that.
Furthermore, since
logn(x) = ln(x)/ln(n).
You have therefore just implemented logN as well.
public static double log(double x, double n, double epsilon)
{
return ln(x, epsilon) / ln(n, epsilon);
}
where epsilon (error) is the minimum precision.
Now as to speed, you're probably better of using the ln-cast-in-hardware, but as I said, I used this as a base to implement logarithms on a rational numbers class working with arbitrary precision.
Arbitrary precision might be more important than speed, under certain circumstances.
Then, use the logarithmic identities for rational numbers:
logB(x/y) = logB(x) - logB(y)
In addition to Crouching Kitten's answer which gave me inspiration, you can build a pseudo-recursive (at most 1 self-call) logarithm to avoid using polynomials. In pseudo code
ln(x) :=
If (x <= 0)
return NaN
Else if (!(1 <= x < 2))
return LN2 * b + ln(a)
Else
return taylor_expansion(x - 1)
This is pretty efficient and precise since on [1; 2) the taylor series converges A LOT faster, and we get such a number 1 <= a < 2 with the first call to ln if our input is positive but not in this range.
You can find 'b' as your unbiased exponent from the data held in the float x, and 'a' from the mantissa of the float x (a is exactly the same float as x, but now with exponent biased_0 rather than exponent biased_b). LN2 should be kept as a macro in hexadecimal floating point notation IMO. You can also use http://man7.org/linux/man-pages/man3/frexp.3.html for this.
Also, the trick
unsigned long tmp = *(ulong*)(&d);
for "memory-casting" double to unsigned long, rather than "value-casting", is very useful to know when dealing with floats memory-wise, as bitwise operators will cause warnings or errors depending on the compiler.
Possible computation of ln(x) and expo(x) in C without <math.h> :
static double expo(double n) {
int a = 0, b = n > 0;
double c = 1, d = 1, e = 1;
for (b || (n = -n); e + .00001 < (e += (d *= n) / (c *= ++a)););
// approximately 15 iterations
return b ? e : 1 / e;
}
static double native_log_computation(const double n) {
// Basic logarithm computation.
static const double euler = 2.7182818284590452354 ;
unsigned a = 0, d;
double b, c, e, f;
if (n > 0) {
for (c = n < 1 ? 1 / n : n; (c /= euler) > 1; ++a);
c = 1 / (c * euler - 1), c = c + c + 1, f = c * c, b = 0;
for (d = 1, c /= 2; e = b, b += 1 / (d * c), b - e/* > 0.0000001 */;)
d += 2, c *= f;
} else b = (n == 0) / 0.;
return n < 1 ? -(a + b) : a + b;
}
static inline double native_ln(const double n) {
// Returns the natural logarithm (base e) of N.
return native_log_computation(n) ;
}
static inline double native_log_base(const double n, const double base) {
// Returns the logarithm (base b) of N.
return native_log_computation(n) / native_log_computation(base) ;
}
Try it Online
Building off #Crouching Kitten's great natural log answer above, if you need it to be accurate for inputs <1 you can add a simple scaling factor. Below is an example in C++ that i've used in microcontrollers. It has a scaling factor of 256 and it's accurate to inputs down to 1/256 = ~0.04, and up to 2^32/256 = 16777215 (due to overflow of a uint32 variable).
It's interesting to note that even on an STMF103 Arm M3 with no FPU, the float implementation below is significantly faster (eg 3x or better) than the 16 bit fixed-point implementation in libfixmath (that being said, this float implementation still takes a few thousand cycles so it's still not ~fast~)
#include <float.h>
float TempSensor::Ln(float y)
{
// Algo from: https://stackoverflow.com/a/18454010
// Accurate between (1 / scaling factor) < y < (2^32 / scaling factor). Read comments below for more info on how to extend this range
float divisor, x, result;
const float LN_2 = 0.69314718; //pre calculated constant used in calculations
uint32_t log2 = 0;
//handle if input is less than zero
if (y <= 0)
{
return -FLT_MAX;
}
//scaling factor. The polynomial below is accurate when the input y>1, therefore using a scaling factor of 256 (aka 2^8) extends this to 1/256 or ~0.04. Given use of uint32_t, the input y must stay below 2^24 or 16777216 (aka 2^(32-8)), otherwise uint_y used below will overflow. Increasing the scaing factor will reduce the lower accuracy bound and also reduce the upper overflow bound. If you need the range to be wider, consider changing uint_y to a uint64_t
const uint32_t SCALING_FACTOR = 256;
const float LN_SCALING_FACTOR = 5.545177444; //this is the natural log of the scaling factor and needs to be precalculated
y = y * SCALING_FACTOR;
uint32_t uint_y = (uint32_t)y;
while (uint_y >>= 1) // Convert the number to an integer and then find the location of the MSB. This is the integer portion of Log2(y). See: https://stackoverflow.com/a/4970859/6630230
{
log2++;
}
divisor = (float)(1 << log2);
x = y / divisor; // FInd the remainder value between [1.0, 2.0] then calculate the natural log of this remainder using a polynomial approximation
result = -1.7417939 + (2.8212026 + (-1.4699568 + (0.44717955 - 0.056570851 * x) * x) * x) * x; //This polynomial approximates ln(x) between [1,2]
result = result + ((float)log2) * LN_2 - LN_SCALING_FACTOR; // Using the log product rule Log(A) + Log(B) = Log(AB) and the log base change rule log_x(A) = log_y(A)/Log_y(x), calculate all the components in base e and then sum them: = Ln(x_remainder) + (log_2(x_integer) * ln(2)) - ln(SCALING_FACTOR)
return result;
}

How will you implement pow(a,b) in C ? condition follows --

without using multiplication or division operators.
You can use only add/substract operators.
A pointless problem, but solvable with the properties of logarithms:
pow(a,b) = exp( b * log(a) )
= exp( exp(log(b) + log(log(a)) )
Take care to insure that your exponential and logarithm functions are using the same base.
Yes, I know how to use a sliderule. Learning that trick will change your perspective of logarithms.
If they are integers, it's simple to turn pow (a, b) into b multiplications of a.
pow(a, b) = a * a * a * a ... ; // do this b times
And simple to turn a * a into additions
a * a = a + a + a + a + ... ; // do this a times
If you combine them, you can make pow.
First, make mult(int a, int b), then use it to make pow.
A recursive solution :
#include<stdio.h>
int multiplication(int a1, int b1)
{
if(b1)
return (a1 + multiplication(a1, b1-1));
else
return 0;
}
int pow(int a, int b)
{
if(b)
return multiplication(a, pow(a, b-1));
else
return 1;
}
int main()
{
printf("\n %d", pow(5, 4));
}
You've already gotten answers purely for FP and purely for integers. Here's one for a FP number raised to an integer power:
double power(double x, int y) {
double z = 1.0;
while (y > 0) {
while (!(y&1)) {
y >>= 2;
x *= x;
}
--y;
z = x * z;
}
return z;
}
At the moment this uses multiplication. You can implement multiplication using only bit shifts, a few bit comparisons, and addition. For integers it looks like this:
int mul(int x, int y) {
int result = 0;
while (y) {
if (y&1)
result += x;
x <<= 1;
y >>= 1;
}
return result;
}
Floating point is pretty much the same, except you have to normalize your results -- i.e., in essence, a floating point number is 1) a significand expressed as a (usually fairly large) integer, and 2) a scale factor. If you want to produce normal IEEE floating point numbers a few parts get a bit ugly though -- for example, the scale factor is stored as a "bias" number instead of any of the usual 1's complement, 2's complement, etc., so working with it is clumsy (basically, each operation you subtract off the bias, do the operation, check for overflow, and (assuming it hasn't overflowed) add the bias back on again).
Doing the job without any kind of logical tests sounds (to me) like it probably wasn't really intended. For quite a few computer architecture classes, it's interesting to reduce a problem to primitive operations you can express directly in hardware (e.g., bit shifts, bitwise-AND, -OR and -NOT, etc.) The implementation shown above fits that reasonably well (if you want to get technical, an adder takes a few gates, but VHDL, Verilog, etc., but it's included in things like VHDL and Verilog anyway).

Resources