I have to raise 10 to the power of a double a lot of times.
Is there a more efficient way to do this than with the math library pow(10,double)? If it matters, my doubles are always negative between -5 and -11.
I assume pow(double,double) uses a more general algorithm than is required for pow(10,double) and might therefore not be the fastest method. Given some of the answers below, that might have been an incorrect assumption.
As for the why, it is for logartihmic interpolation.
I have a table of x and y values.
My object has a known x value (which is almost always a double).
double Dbeta(struct Data *diffusion, double per){
double frac;
while(per>diffusion->x[i]){
i++;
}
frac = (per-diffusion->x[i-1])/(diffusion->x[i]-diffusion->x[i-1]);
return pow(10,log10DB[i-1] + frac * (log10DB[i]-log10DB[i-1]));
}
This function is called a lot of times.
I have been told to look into profiling, so that is what I will do first.
I have just been told I could have used natural logarithms instead of base 10, which is obviously right. (my stupidity sometimes amazes even myself.)
After replacing everything with natural logarithms everything runs a bit faster. With profiling (which is a new word I learned today) I found out 39% of my code is spend in the exp function, so for those who wondered if it was in fact this part that was bottlenecking my code, it was.
For pow(10.0, n) it should be faster to set c = log(10.0), which you can compute once, then use exp(c*n), which should be significantly faster than pow(10.0, n) (which is basically doing that same thing internally, except it would be calculating log(10.0) over and over instead of just once). Beyond that, there probably isn't much else you can do.
Yes, the pow function is slow (roughly 50x the cost of a multiply, for those asking for benchmarks).
By some log/exponents trickery, we can express 10^x as
10^x = exp(log(10^x)) = exp(x * log(10)).
So you can implement 10^x with exp(x * M_LN10), which should be more efficient than pow.
If double accuracy isn't critical, use the float version of the function expf (or powf), which should be more efficient than the double version.
If rough accuracy is Ok, precompute a table over the [-5, -11] range and do a quick look up with linear interpolation.
Some benchmarks (using glibc 2.31):
Benchmark Time
---------------------------------
pow(10, x) 15.54 ns
powf(10, x) 7.18 ns
expf(x * (float)M_LN10) 3.45 ns
Related
I'm trying to do the Steinhart-Hart temperature calculation on an Arduino. The equation is
I solved a system of 3 equations to obtain the values of A, B and C, which are:
A = 0.0164872
B = -0.00158538
C = 3.3813e-6
When I plug these into WolframAlpha to solve for T I get a value in Kelvins that makes sense:
T=1/(0.0164872-0.00158538*log2(10000)+3.3813E-6*(log2(10000))^3) solve for T
T = 298.145 Kelvins = 77 Fahrenheit
However when I try to use this equation on my Arduino, I get a very wrong answer, I suspect because doubles do not have enough precision. Here's what I'm using:
double temp = (1 / (A + B*log(R_therm) + C*pow(log(R_therm),3)));
This returns 222 Kelvin instead, which is way off.
So, how can I do a calculation like this in Arduino?? Any advice is greatly appreciated, thanks.
Precision is not the main issue. Could even use float and powf(). A thermistor temperature calculation is not that accurate. After all the temperature is certainly not better than ±0.1°C accurate. Self heating of the thermistor is a larger factor.
OP's C code assumes log base 2, use log base e log() as the constants were derived using log base 2. #Martin R
// double temp = (1 / (A + B*log(R_therm) + C*pow(log(R_therm),3)));
double temp = (1 / (A + B*log(R_therm)/log(2) + C*pow(log(R_therm)/log(2),3)));`
Sample implementation, that avoids an unnecessary slow pow() call.
static const inv_ln2 = 1.4426950408889634073599246810019;
double ln2_R = log(R_therm)*inv_ln2;
double temp = 1.0 / (A + ln2_R*(B + C*ln2_R*ln2_R));
Yes, floating point arithmetic has limited precision on most arduinos.
Have you considered using fixed precision? If used correctly, this might give you better results. The requirement for this is to have rather narrow parameters, however, and be careful about unit conversions.
An unsigned long on arduino is 4 bytes too, so it can contain numbers up to 2^32-1. If using fixed point, you might want to replace this 1/T by something like 100000/T, where the numerator constant and T have been scaled according to the desired precision.
You will also need to keep a (mental or paper) model of the number of decimals each variable contains, in order to optimize the operation order not to lose precision.
For the log2 function, I doubt it is available out of the box for integers. You could either cast the result or reimplement it. There is plenty of ressources for this problem, even here on SO.
I am trying to optimize a code in C, specificly a critical loop which takes almost 99.99% of total execution time. Here is that loop:
#pragma omp parallel shared(NTOT,i) num_threads(4)
{
# pragma omp for private(dx,dy,d,j,V,E,F,G) reduction(+:dU) nowait
for(j = 1; j <= NTOT; j++){
if(j == i) continue;
dx = (X[j][0]-X[i][0])*a;
dy = (X[j][1]-X[i][1])*a;
d = sqrt(dx*dx+dy*dy);
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
E = dS[0]*dx+dS[1]*dy;
F = spin[2*j-2]*dx+spin[2*j-1]*dy;
G = -3*(D/(d*d*d*d*d))*E*F;
dU += (V+G);
}
}
All variables are local. The loop takes 0.7 second for NTOT=3600 which is a large amount of time, especially when I have to do this 500,000 times in the whole program, resulting in 97 hours spent in this loop. My question is if there are other things to be optimized in this loop?
My computer's processor is an Intel core i5 with 4 CPU(4X1600Mhz) and 3072K L3 cache.
Optimize for hardware or software?
Soft:
Getting rid of time consuming exceptions such as divide by zeros:
d = sqrt(dx*dx+dy*dy + 0.001f );
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
You could also try John Carmack , Terje Mathisen and Gary Tarolli 's "Fast inverse square root" for the
D/(d*d*d)
part. You get rid of division too.
float qrsqrt=q_rsqrt(dx*dx+dy*dy + easing);
qrsqrt=qrsqrt*qrsqrt*qrsqrt * D;
with sacrificing some precision.
There is another division also to be gotten rid of:
(D/(d*d*d*d*d))
such as
qrsqrt_to_the_power2 * qrsqrt_to_the_power3 * D
Here is the fast inverse sqrt:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what ?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
To overcome big arrays' non-caching behaviour, you can do the computation in smaller patches/groups especially when is is many to many O(N*N) algorithm. Such as:
get 256 particles.
compute 256 x 256 relations.
save 256 results on variables.
select another 256 particles as target(saving the first 256 group in place)
do same calculations but this time 1st group vs 2nd group.
save first 256 results again.
move to 3rd group
repeat.
do same until all particles are versused against first 256 particles.
Now get second group of 256.
iterate until all 256's are complete.
Your CPU has big cache so you can try 32k particles versus 32k particles directly. But L1 may not be big so I would stick with 512 vs 512(or 500 vs 500 to avoid cache line ---> this is going to be dependent on architecture) if I were you.
Hard:
SSE, AVX, GPGPU, FPGA .....
As #harold commented, SSE should be start point to compare and you should vectorize or at least parallelize through 4-packed vector instructions which have advantage of optimum memory fetching ability and pipelining. When you need 3x-10x more performance(on top of SSE version using all cores), you will need an opencl/cuda compliant gpu(equally priced as i5) and opencl(or cuda) api or you can learn opengl too but it seems harder(maybe directx easier).
Trying SSE is easiest, should give 3x faster than the fast inverse I mentionad above. An equally priced gpu should give another 3x of SSE at least for thousands of particles. Going or over 100k particles, whole gpu can achieve 80x performance of a single core of cpu for this type of algorithm when you optimize it enough(making it less dependent to main memory). Opencl gives ability to address cache to save your arrays. So you can use terabytes/s of bandwidth in it.
I would always do random pausing
to pin down exactly which lines were most costly.
Then, after fixing something I would do it again, to find another fix, and so on.
That said, some things look suspicious.
People will say the compiler's optimizer should fix these, but I never rely on that if I can help it.
X[i], X[j], spin[2*j-1(and 2)] look like candidates for pointers. There is no need to do this index calculation and then hope the optimizer can remove it.
You could define a variable d2 = dx*dx+dy*dy and then say d = sqrt(d2). Then wherever you have d*d you can instead write d2.
I suspect a lot of samples will land in the sqrt function, so I would try to figure a way around using that.
I do wonder if some of these quantities like (dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]) could be calculated in a separate unrolled loop outside this loop. In some cases two loops can be faster than one if the compiler can save some registers.
I cannot believe that 3600 iterations of an O(1) loop can take 0.7 seconds. Perhaps you meant the double loop with 3600 * 3600 iterations? Otherwise I can suggest checking if optimization is enabled, and how long threads spawning takes.
General
Your inner loop is very simple and it contains only a few operations. Note that divisions and square roots are roughly 15-30 times slower than additions, subtractions and multiplications. You are doing three of them, so most of the time is eaten by them.
First of all, you can compute reciprocal square root in one operation instead of computing square root, then getting reciprocal of it. Second, you should save the result and reuse it when necessary (right now you divide by d twice). This would result in one problematic operation per iteration instead of three.
invD = rsqrt(dx*dx+dy*dy);
V = (D * (invD*invD*invD))*(...);
...
G = -3*(D * (invD*invD*invD*invD*invD))*E*F;
dU += (V+G);
In order to further reduce time taken by rsqrt, I advise vectorizing it. I mean: compute rsqrt for two or four input values at once with SSE. Depending on size of your arguments and desired precision of result, you can take one of the routines from this question. Note that it contains a link to a small GitHub project with all the implementations.
Indeed you can go further and vectorize the whole loop with SSE (or even AVX), that is not hard.
OpenCL
If you are ready to use some big framework, then I suggest using OpenCL. Your loop is very simple, so you won't have any problems porting it to OpenCL (except for some initial adaptation to OpenCL).
Then you can use CPU implementations of OpenCL, e.g. from Intel or AMD. Both of them would automatically use multithreading. Also, they are likely to automatically vectorize your loop (e.g. see this article). Finally, there is a chance that they would find a good implementation of rsqrt automatically, if you use native_rsqrt function or something like that.
Also, you would be able to run your code on GPU. If you use single precision, it may result in significant speedup. If you use double precision, then it is not so clear: modern consumer GPUs are often slow with double precision, because they lack the necessary hardware.
Minor optimisations:
(d * d * d) is calculated twice. Store d*d and use it for d^3 and d^5
Modify 2 * x by x<<1;
I'm trying to make a program to calculate the cos(x) function using taylor series so far I've got this:
int factorial(int a){
if(a < 0)
return 0;
else if(a==0 || a==1)
return 1;
else
return a*(factorial(a-1));
}
double Tserie(float angle, int repetitions){
double series = 0.0;
float i;
for(i = 0.0; i < repeticiones; i++){
series += (pow(-1, i) * pow(angle, 2*i))/factorial(2*i);
printf("%f\n", (pow(-1, i) * pow(angle, 2*i))/factorial(2*i));
}
return series;
}
For my example I'm using angle = 90, and repetitions = 20, to calculate cos(90) but it's useless I just keep getting values close to the infinite, any help will be greatly appreciated.
For one thing, the angle is in radians, so for a 90 degree angle, you'd need to pass M_PI/2.
Also, you should avoid recursive functions for something as trivial as factorials, it would take 1/4 the effort to write it iteratively and it would perform a lot better. You don't actually even need it, you can keep the factorial in a temporary variable and just multiply it by 2*i*(2*i-1) at each step. Keep in mind that you'll hit a representability/precision wall really quickly at this step.
You also don't need to actually call pow for -1 to the power of i, a simple i%2?1:-1 would suffice. This way it's faster and it won't lose precision as you increase i.
Oh and don't make i float, it's an integer, make it an integer. You're leaking precision a lot as it is, why make it worse..
And to top it all off, you're approximating cos around 0, but are calling it for pi/2. You'll get really high errors doing that.
The Taylor series is for the mathematical cosine function, whose arguments is in radians. So 90 probably doesn't mean what you thought it meant here.
Furthermore, the series requires more terms the longer the argument is from 0. Generally, the number of terms need to be comparable to the size of the argument before you even begin to see the successive terms becoming smaller, and many more than that in order to get convergence. 20 is pitifully few terms to use for x=90.
Another problem is then that you compute the factorial as an int. The factorial function grows very fast -- already for 13! an ordinary C int (on a 32-bit machine) will overflow, so your terms beyond the sixth will be completely wrong anyway.
In fact the factorials and the powers of 90 quickly become too large to be represented even as doubles. If you want any chance of seeing the series converge, you must not compute each term from scratch but derive it from the previous one using a formula such as
nextTerm = - prevTerm * x * x / (2*i-1) / (2*i);
If have the following C function, used to determine if one number is a multiple of another to an arbirary tolerance
#include <math.h>
#define TOLERANCE 0.0001
int IsMultipleOf(double x,double mod)
{
return(fabs(fmod(x, mod)) < TOLERANCE);
}
It works fine, but profiling shows it to be very slow, to the extent that it has become a candidate for optimization. About 75% of the time is spent in modulo and the remaining in fabs. I'm trying to figure a way of speeding things up, using something like a look-up table. The parameter x changes regularly, whereas mod changes infrequently. The number of possible values of x is small enough that the space for a look-up would not be an issue, typically it will be one of a few hundred possible values. I can get rid of the fabs easily enough, but can't figure out a reasonable alternative to the modulo. Any ideas on how to optimize the above?
Edit The code will be running on a wide range of Windows desktop and mobile devices, hence processors could include Intel, AMD on desktop, and ARM or SH4 on mobile devices. VisualStudio 2008 is the compiler.
Do you really have to use modulo for this?
Wouldn't it be possible to just result = x / mod and then check if the decimal part of result is close to 0. For instance:
11 / 5.4999 = 2.000003 ==> 0.000003 < TOLERANCE
Or something like that.
Division (floating point or not, fmod in your case) is often an operation where the execution time varies a lot depending on the cpu and compiler:
gcc has a builtin replacement for
that if you give it the right compile
flags or if you use __builtin_fmod
explicitly. This then might map the
operation on a small number of
assembler instructions.
there may be special units like SSE
on intel processors where this
operation is implemented more
efficiently
By such tricks, depending on your environment (you didn't tell which) the time may vary from some clock cycles to some hundred. I think best is to look into the documentation of your compiler and cpu for that particular operation.
The following is probably overkill, and sub-optimal. But for what it is worth here is one way on how to do it.
We know the format of the double ...
1 bit for the sign
11 bits for the biased exponent
52 fraction bits
Let ...
value = x / mod;
exp = exponent bits of value - BIAS;
lsb = least sig bit of value's fraction bits;
Once you have that ...
/*
* If applying the exponent would eliminate the fraction bits
* then for double precision resolution it is a multiple.
* Note: lsb may require some massaging.
*/
if (exp > lsb)
return (true);
if (exp < 0)
return (false);
The only case remaining is the tolerance case. Build your double so that you are getting rid of all the digits to the left of the decimal.
sign bit is zero (positive)
exponent is the BIAS (1023 I think ... look it up to be sure)
shift the fraction bits as appropriate
Now compare it against your tolerance.
I think you need to inspect the bowels of your C RTL fmod() function: X86 FPU's have 'FPREM/FPREM1' instructions which computes remainders by repeated subtraction.
While floating point division is a single instruction, it seems you may need to call FPREM repeatedly to get the right answer for modulus, so your RTL may not use it.
I have not tested this at all, but from the way I understand fmod this should be equivalent inlined, which might let the compiler optimize it better, though I would have thought that the compiler's math library (or builtins) would work just as well. (also, I don't even know for sure if this is correct).
#include <math.h>
int IsMultipleOf(double x, double mod) {
long n = x / mod; // You should probably test for /0 or NAN result here
double new_x = mod * n;
double delta = x - new_x;
return fabs(delta) < TOLERANCE; // and for NAN result from fabs
}
Maybe you can get away with long long instead of double if you have comparable scale of data. For example long long would be enough for over 60 astronomical units in micrometer resolution.
Does it need to be double precision ? Depending on how good your math library is, this ought to be faster:
#include <math.h>
#define TOLERANCE 0.0001f
bool IsMultipleOf(float x, float mod)
{
return(fabsf(fmodf(x, mod)) < TOLERANCE);
}
I presume modulo looks a little like this on the inside:
mod(x,m) {
while (x > m) {
x = x - m
}
return x
}
I think that through some sort of search i could be optimised: eg:
fastmod(x,m) {
q = 1
while (m * q < x) {
q = q * 2
}
return mod((x - (q / 2) * m), m)
}
You might even choose to replace the finall call to mod with annother call to fastmod, adding the condition that if x < m then to return x.
For one of my course project I started implementing "Naive Bayesian classifier" in C. My project is to implement a document classifier application (especially Spam) using huge training data.
Now I have problem implementing the algorithm because of the limitations in the C's datatype.
( Algorithm I am using is given here, http://en.wikipedia.org/wiki/Bayesian_spam_filtering )
PROBLEM STATEMENT:
The algorithm involves taking each word in a document and calculating probability of it being spam word. If p1, p2 p3 .... pn are probabilities of word-1, 2, 3 ... n. The probability of doc being spam or not is calculated using
Here, probability value can be very easily around 0.01. So even if I use datatype "double" my calculation will go for a toss. To confirm this I wrote a sample code given below.
#define PROBABILITY_OF_UNLIKELY_SPAM_WORD (0.01)
#define PROBABILITY_OF_MOSTLY_SPAM_WORD (0.99)
int main()
{
int index;
long double numerator = 1.0;
long double denom1 = 1.0, denom2 = 1.0;
long double doc_spam_prob;
/* Simulating FEW unlikely spam words */
for(index = 0; index < 162; index++)
{
numerator = numerator*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
denom2 = denom2*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
denom1 = denom1*(long double)(1 - PROBABILITY_OF_UNLIKELY_SPAM_WORD);
}
/* Simulating lot of mostly definite spam words */
for (index = 0; index < 1000; index++)
{
numerator = numerator*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
denom2 = denom2*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
denom1 = denom1*(long double)(1- PROBABILITY_OF_MOSTLY_SPAM_WORD);
}
doc_spam_prob= (numerator/(denom1+denom2));
return 0;
}
I tried Float, double and even long double datatypes but still same problem.
Hence, say in a 100K words document I am analyzing, if just 162 words are having 1% spam probability and remaining 99838 are conspicuously spam words, then still my app will say it as Not Spam doc because of Precision error (as numerator easily goes to ZERO)!!!.
This is the first time I am hitting such issue. So how exactly should this problem be tackled?
This happens often in machine learning. AFAIK, there's nothing you can do about the loss in precision. So to bypass this, we use the log function and convert divisions and multiplications to subtractions and additions, resp.
SO I decided to do the math,
The original equation is:
I slightly modify it:
Taking logs on both sides:
Let,
Substituting,
Hence the alternate formula for computing the combined probability:
If you need me to expand on this, please leave a comment.
Here's a trick:
for the sake of readability, let S := p_1 * ... * p_n and H := (1-p_1) * ... * (1-p_n),
then we have:
p = S / (S + H)
p = 1 / ((S + H) / S)
p = 1 / (1 + H / S)
let`s expand again:
p = 1 / (1 + ((1-p_1) * ... * (1-p_n)) / (p_1 * ... * p_n))
p = 1 / (1 + (1-p_1)/p_1 * ... * (1-p_n)/p_n)
So basically, you will obtain a product of quite large numbers (between 0 and, for p_i = 0.01, 99). The idea is, not to multiply tons of small numbers with one another, to obtain, well, 0, but to make a quotient of two small numbers. For example, if n = 1000000 and p_i = 0.5 for all i, the above method will give you 0/(0+0) which is NaN, whereas the proposed method will give you 1/(1+1*...1), which is 0.5.
You can get even better results, when all p_i are sorted and you pair them up in opposed order (let's assume p_1 < ... < p_n), then the following formula will get even better precision:
p = 1 / (1 + (1-p_1)/p_n * ... * (1-p_n)/p_1)
that way you devide big numerators (small p_i) with big denominators (big p_(n+1-i)), and small numerators with small denominators.
edit: MSalter proposed a useful further optimization in his answer. Using it, the formula reads as follows:
p = 1 / (1 + (1-p_1)/p_n * (1-p_2)/p_(n-1) * ... * (1-p_(n-1))/p_2 * (1-p_n)/p_1)
Your problem is caused because you are collecting too many terms without regard for their size. One solution is to take logarithms. Another is to sort your individual terms. First, let's rewrite the equation as 1/p = 1 + ∏((1-p_i)/p_i). Now your problem is that some of the terms are small, while others are big. If you have too many small terms in a row, you'll underflow, and with too many big terms you'll overflow the intermediate result.
So, don't put too many of the same order in a row. Sort the terms (1-p_i)/p_i. As a result, the first will be the smallest term, the last the biggest. Now, if you'd multiply them straight away you would still have an underflow. But the order of calculation doesn't matter. Use two iterators into your temporary collection. One starts at the beginning (i.e. (1-p_0)/p_0), the other at the end (i.e (1-p_n)/p_n), and your intermediate result starts at 1.0. Now, when your intermediate result is >=1.0, you take a term from the front, and when your intemediate result is < 1.0 you take a result from the back.
The result is that as you take terms, the intermediate result will oscillate around 1.0. It will only go up or down as you run out of small or big terms. But that's OK. At that point, you've consumed the extremes on both ends, so it the intermediate result will slowly approach the final result.
There's of course a real possibility of overflow. If the input is completely unlikely to be spam (p=1E-1000) then 1/p will overflow, because ∏((1-p_i)/p_i) overflows. But since the terms are sorted, we know that the intermediate result will overflow only if ∏((1-p_i)/p_i) overflows. So, if the intermediate result overflows, there's no subsequent loss of precision.
Try computing the inverse 1/p. That gives you an equation of the form 1 + 1/(1-p1)*(1-p2)...
If you then count the occurrence of each probability--it looks like you have a small number of values that recur--you can use the pow() function--pow(1-p, occurences_of_p)*pow(1-q, occurrences_of_q)--and avoid individual roundoff with each multiplication.
You can use probability in percents or promiles:
doc_spam_prob= (numerator*100/(denom1+denom2));
or
doc_spam_prob= (numerator*1000/(denom1+denom2));
or use some other coefficient
I am not strong in math so I cannot comment on possible simplifications to the formula that might eliminate or reduce your problem. However, I am familiar with the precision limitations of long double types and am aware of several arbitrary and extended precision math libraries for C. Check out:
http://www.nongnu.org/hpalib/
and
http://www.tc.umn.edu/~ringx004/mapm-main.html