I am trying to optimize a code in C, specificly a critical loop which takes almost 99.99% of total execution time. Here is that loop:
#pragma omp parallel shared(NTOT,i) num_threads(4)
{
# pragma omp for private(dx,dy,d,j,V,E,F,G) reduction(+:dU) nowait
for(j = 1; j <= NTOT; j++){
if(j == i) continue;
dx = (X[j][0]-X[i][0])*a;
dy = (X[j][1]-X[i][1])*a;
d = sqrt(dx*dx+dy*dy);
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
E = dS[0]*dx+dS[1]*dy;
F = spin[2*j-2]*dx+spin[2*j-1]*dy;
G = -3*(D/(d*d*d*d*d))*E*F;
dU += (V+G);
}
}
All variables are local. The loop takes 0.7 second for NTOT=3600 which is a large amount of time, especially when I have to do this 500,000 times in the whole program, resulting in 97 hours spent in this loop. My question is if there are other things to be optimized in this loop?
My computer's processor is an Intel core i5 with 4 CPU(4X1600Mhz) and 3072K L3 cache.
Optimize for hardware or software?
Soft:
Getting rid of time consuming exceptions such as divide by zeros:
d = sqrt(dx*dx+dy*dy + 0.001f );
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
You could also try John Carmack , Terje Mathisen and Gary Tarolli 's "Fast inverse square root" for the
D/(d*d*d)
part. You get rid of division too.
float qrsqrt=q_rsqrt(dx*dx+dy*dy + easing);
qrsqrt=qrsqrt*qrsqrt*qrsqrt * D;
with sacrificing some precision.
There is another division also to be gotten rid of:
(D/(d*d*d*d*d))
such as
qrsqrt_to_the_power2 * qrsqrt_to_the_power3 * D
Here is the fast inverse sqrt:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what ?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
To overcome big arrays' non-caching behaviour, you can do the computation in smaller patches/groups especially when is is many to many O(N*N) algorithm. Such as:
get 256 particles.
compute 256 x 256 relations.
save 256 results on variables.
select another 256 particles as target(saving the first 256 group in place)
do same calculations but this time 1st group vs 2nd group.
save first 256 results again.
move to 3rd group
repeat.
do same until all particles are versused against first 256 particles.
Now get second group of 256.
iterate until all 256's are complete.
Your CPU has big cache so you can try 32k particles versus 32k particles directly. But L1 may not be big so I would stick with 512 vs 512(or 500 vs 500 to avoid cache line ---> this is going to be dependent on architecture) if I were you.
Hard:
SSE, AVX, GPGPU, FPGA .....
As #harold commented, SSE should be start point to compare and you should vectorize or at least parallelize through 4-packed vector instructions which have advantage of optimum memory fetching ability and pipelining. When you need 3x-10x more performance(on top of SSE version using all cores), you will need an opencl/cuda compliant gpu(equally priced as i5) and opencl(or cuda) api or you can learn opengl too but it seems harder(maybe directx easier).
Trying SSE is easiest, should give 3x faster than the fast inverse I mentionad above. An equally priced gpu should give another 3x of SSE at least for thousands of particles. Going or over 100k particles, whole gpu can achieve 80x performance of a single core of cpu for this type of algorithm when you optimize it enough(making it less dependent to main memory). Opencl gives ability to address cache to save your arrays. So you can use terabytes/s of bandwidth in it.
I would always do random pausing
to pin down exactly which lines were most costly.
Then, after fixing something I would do it again, to find another fix, and so on.
That said, some things look suspicious.
People will say the compiler's optimizer should fix these, but I never rely on that if I can help it.
X[i], X[j], spin[2*j-1(and 2)] look like candidates for pointers. There is no need to do this index calculation and then hope the optimizer can remove it.
You could define a variable d2 = dx*dx+dy*dy and then say d = sqrt(d2). Then wherever you have d*d you can instead write d2.
I suspect a lot of samples will land in the sqrt function, so I would try to figure a way around using that.
I do wonder if some of these quantities like (dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]) could be calculated in a separate unrolled loop outside this loop. In some cases two loops can be faster than one if the compiler can save some registers.
I cannot believe that 3600 iterations of an O(1) loop can take 0.7 seconds. Perhaps you meant the double loop with 3600 * 3600 iterations? Otherwise I can suggest checking if optimization is enabled, and how long threads spawning takes.
General
Your inner loop is very simple and it contains only a few operations. Note that divisions and square roots are roughly 15-30 times slower than additions, subtractions and multiplications. You are doing three of them, so most of the time is eaten by them.
First of all, you can compute reciprocal square root in one operation instead of computing square root, then getting reciprocal of it. Second, you should save the result and reuse it when necessary (right now you divide by d twice). This would result in one problematic operation per iteration instead of three.
invD = rsqrt(dx*dx+dy*dy);
V = (D * (invD*invD*invD))*(...);
...
G = -3*(D * (invD*invD*invD*invD*invD))*E*F;
dU += (V+G);
In order to further reduce time taken by rsqrt, I advise vectorizing it. I mean: compute rsqrt for two or four input values at once with SSE. Depending on size of your arguments and desired precision of result, you can take one of the routines from this question. Note that it contains a link to a small GitHub project with all the implementations.
Indeed you can go further and vectorize the whole loop with SSE (or even AVX), that is not hard.
OpenCL
If you are ready to use some big framework, then I suggest using OpenCL. Your loop is very simple, so you won't have any problems porting it to OpenCL (except for some initial adaptation to OpenCL).
Then you can use CPU implementations of OpenCL, e.g. from Intel or AMD. Both of them would automatically use multithreading. Also, they are likely to automatically vectorize your loop (e.g. see this article). Finally, there is a chance that they would find a good implementation of rsqrt automatically, if you use native_rsqrt function or something like that.
Also, you would be able to run your code on GPU. If you use single precision, it may result in significant speedup. If you use double precision, then it is not so clear: modern consumer GPUs are often slow with double precision, because they lack the necessary hardware.
Minor optimisations:
(d * d * d) is calculated twice. Store d*d and use it for d^3 and d^5
Modify 2 * x by x<<1;
Related
As of recently an interest within the realm of computer architecture and performance has been sparked in me. With that said, I have been picking up an "easier" assembly language to really try and learn how stuff "works under the hood". Namely MIPS assembly. I feel comfortable enough to try and experiment with some more advanced stuff and as such I have decided to combine programming with my interest in mathematics.
My goal is simple, given a 24x24 (I don't care about any other size) matrix A, I want to write an algorithm that as efficiently as possible finds the upper triangular form of the matrix. With efficiently I mean that I want to eventually end up in a state where I use the processor's that I am using resources the best I can. High cache hit rate, efficient usage of memory (locality of reference principle etc.), performance as in time it takes to run the solution, etc.
Eventually my goal is to transform the C solution to MIPS-assembly and tailor it to fit the memory subsystem of the processor that I will be trying to run my algorithm on. Regarding the processor I will have different options to play around with when it comes to caches, write buffers and memory in the sense that I can play around with different cache sizes, block sizes, associativity levels, memory access times etc. Performance in this case will be measured in the time it takes to triangularize a 24x24 matrix.
To begin, I need to actually write some high level code and actually solve the problem there before diving into MIPS assembly. I have "looked around" and eventually came up with this seemingly standard solution. It isn't necessarily super fast, neither do I think it is optimal for triangularizing 24x24 matrices. Can I do better?
void triangularize(float **A, int N)
{
int i, j, k;
// Loop over the diagonal elements
for (k = 0; k < N; k++)
{
// Loop over all the elements in the pivot row and right of the pivot ELEMENT
for (j = k + 1; j < N; j++)
{
// divide by the pivot element
A[k][j] = A[k][j] / A[k][k];
}
// Set the pivot elements
A[k][k] = 1.0;
// Loop over all elements below the pivot right an right of the pivot COLUMN
for (i = k + 1; i < N; i++)
{
for (j = k + 1; j < N; j++)
{
A[i][j] = A[i][j] - A[i][k] * A[k][j];
}
A[i][k] = 0.0;
}
}
}
Furthermore, what should be my next steps when trying to convert the C code to MIPS assembly with respect to maximizing performance and minimizing cost (cache hit rates, IO costs when dealing with memory etc.) to get a lightning fast and efficient solution?
First of all, encoding a matrix as a jagged array (ie. float**) is generally not efficient as it cause unnecessary expensive indirections and the array may not be contiguous in memory resulting in more cache misses or even cache trashing in pathological cases. It is certainly better to copy the matrix in a contiguous flatten array. Please consider storing your matrices as flatten arrays that are generally more efficient (especially on MIPS). Flatten array can be indexed using something like array[i*24+j] instead of array[i][j].
Moreover, if you do not care about matrices other than 24x24 ones, then you can write a specialized code for 24x24 matrices. This help compilers to generate a more efficient assembly code (typically by unrolling loops and using more efficient instructions like multiplication by a constant).
Additionally, divisions are generally expensive, especially on embedded MIPS processors. Thus, you can replace divisions by multiplications with the inverse. For example:
float inv = 1.0f / A[k][k];
for (j = k + 1; j < N; j++)
A[k][j] *= inv;
Note that the result might be slightly different due to floating-point rounding. You can use the -ffast-math compiler flag so to help it generating such optimisation if you know that special values like NaN or Inf do not appear in the matrix.
Moreover, it may be faster to unroll the loop manually since not all compilers do that (properly). That being said, the benefit of loop unrolling is very dependent of the target processor (unspecified here). Without more information, it is very hard to know if this is useful. For example, some processor can execute multiple floating-point operation per cycles while some other cannot even do that natively (ie. no hardware FP unit): they are somehow emulated with many instruction which is very expensive (compilers like GCC do function calls for basic operations like addition/subtraction on such processors). If there is no hardware FP unit, then it might be faster to use fixed precision.
Finally, some MIPS processors have a 128-bit SIMD unit. Using it should significantly speed up the execution. Compilers should be able to mostly auto-vectorize your code but you need to tell them if your target processor support it (see the -march flag for GCC/Clang). For a fixed-size matrix, manual vectorization often result in a faster execution (than auto-vectorisation) assuming you write an efficient code.
I am working on a project which incorporates computing a sine wave as input for a control loop.
The sine wave has a frequency of 280 Hz, and the control loop runs every 30 µs and everything is written in C for an Arm Cortex-M7.
At the moment we are simply doing:
double time;
void control_loop() {
time += 30e-6;
double sine = sin(2 * M_PI * 280 * time);
...
}
Two problems/questions arise:
When running for a long time, time becomes bigger. Suddenly there is a point where the computation time for the sine function increases drastically (see image). Why is this? How are these functions usually implemented? Is there a way to circumvent this (without noticeable precision loss) as speed is a huge factor for us? We are using sin from math.h (Arm GCC).
How can I deal with time in general? When running for a long time, the variable time will inevitably reach the limits of double precision. Even using a counter time = counter++ * 30e-6; only improves this, but it does not solve it. As I am certainly not the first person who wants to generate a sine wave for a long time, there must be some ideas/papers/... on how to implement this fast and precise.
Instead of calculating sine as a function of time, maintain a sine/cosine pair and advance it through complex number multiplication. This doesn't require any trigonometric functions or lookup tables; only four multiplies and an occasional re-normalization:
static const double a = 2 * M_PI * 280 * 30e-6;
static const double dx = cos(a);
static const double dy = sin(a);
double x = 1, y = 0; // complex x + iy
int counter = 0;
void control_loop() {
double xx = dx*x - dy*y;
double yy = dx*y + dy*x;
x = xx, y = yy;
// renormalize once in a while, based on
// https://www.gamedev.net/forums/topic.asp?topic_id=278849
if((counter++ & 0xff) == 0) {
double d = 1 - (x*x + y*y - 1)/2;
x *= d, y *= d;
}
double sine = y; // this is your sine
}
The frequency can be adjusted, if needed, by recomputing dx, dy.
Additionally, all the operations here can be done, rather easily, in fixed point.
Rationality
As #user3386109 points out below (+1), the 280 * 30e-6 = 21 / 2500 is a rational number, thus the sine should loop around after 2500 samples exactly. We can combine this method with theirs by resetting our generator (x=1,y=0) every 2500 iterations (or 5000, or 10000, etc...). This would eliminate the need for renormalization, as well as get rid of any long-term phase inaccuracies.
(Technically any floating point number is a diadic rational. However 280 * 30e-6 doesn't have an exact representation in binary. Yet, by resetting the generator as suggested, we'll get an exactly periodic sine as intended.)
Explanation
Some requested an explanation down in the comments of why this works. The simplest explanation is to use the angle sum trigonometric identities:
xx = cos((n+1)*a) = cos(n*a)*cos(a) - sin(n*a)*sin(a) = x*dx - y*dy
yy = sin((n+1)*a) = sin(n*a)*cos(a) + cos(n*a)*sin(a) = y*dx + x*dy
and the correctness follows by induction.
This is essentially the De Moivre's formula if we view those sine/cosine pairs as complex numbers, in accordance to Euler's formula.
A more insightful way might be to look at it geometrically. Complex multiplication by exp(ia) is equivalent to rotation by a radians. Therefore, by repeatedly multiplying by dx + idy = exp(ia), we incrementally rotate our starting point 1 + 0i along the unit circle. The y coordinate, according to Euler's formula again, is the sine of the current phase.
Normalization
While the phase continues to advance with each iteration, the magnitude (aka norm) of x + iy drifts away from 1 due to round-off errors. However we're interested in generating a sine of amplitude 1, thus we need to normalize x + iy to compensate for numeric drift. The straight forward way is, of course, to divide it by its own norm:
double d = 1/sqrt(x*x + y*y);
x *= d, y *= d;
This requires a calculation of a reciprocal square root. Even though we normalize only once every X iterations, it'd still be cool to avoid it. Fortunately |x + iy| is already close to 1, thus we only need a slight correction to keep it at bay. Expanding the expression for d around 1 (first order Taylor approximation), we get the formula that's in the code:
d = 1 - (x*x + y*y - 1)/2
TODO: to fully understand the validity of this approximation one needs to prove that it compensates for round-off errors faster than they accumulate -- and thus get a bound on how often it needs to be applied.
The function can be rewritten as
double n;
void control_loop() {
n += 1;
double sine = sin(2 * M_PI * 280 * 30e-6 * n);
...
}
That does exactly the same thing as the code in the question, with exactly the same problems. But it can now be simplified:
280 * 30e-6 = 280 * 30 / 1000000 = 21 / 2500 = 8.4e-3
Which means that when n reaches 2500, you've output exactly 21 cycles of the sine wave. Which means that you can set n back to 0.
The resulting code is:
int n;
void control_loop() {
n += 1;
if (n == 2500)
n = 0;
double sine = sin(2 * M_PI * 8.4e-3 * n);
...
}
As long as your code can run for 21 cycles without problems, it'll run forever without problems.
I'm rather shocked at the existing answers. The first problem you detect is easily solved, and the next problem magically disappears when you solve the first problem.
You need a basic understanding of math to see how it works. Recall, sin(x+2pi) is just sin(x), mathematically. The large increase in time you see happens when your sin(float) implementation switches to another algorithm, and you really want to avoid that.
Remember that float has only 6 significant digits. 100000.0f*M_PI+x uses those 6 digits for 100000.0f*M_PI, so there's nothing left for x.
So, the easiest solution is to keep track of x yourself. At t=0 you initialize x to 0.0f. Every 30 us, you increment x+= M_PI * 280 * 30e-06;. The time does not appear in this formula! Finally, if x>2*M_PI, you decrement x-=2*M_PI; (Since sin(x)==sin(x-2*pi)
You now have an x that stays nicely in the range 0 to 6.2834, where sin is fast and the 6 digits of precision are all useful.
How to generate a lovely sine.
DAC is 12bits so you have only 4096 levels. It makes no sense to send more than 4096 samples per period. In real life you will need much less samples to generate a good quality waveform.
Create C file with the lookup table (using your PC). Redirect the output to the file (https://helpdeskgeek.com/how-to/redirect-output-from-command-line-to-text-file/).
#define STEP ((2*M_PI) / 4096.0)
int main(void)
{
double alpha = 0;
printf("#include <stdint.h>\nconst uint16_t sine[4096] = {\n");
for(int x = 0; x < 4096 / 16; x++)
{
for(int y = 0; y < 16; y++)
{
printf("%d, ", (int)(4095 * (sin(alpha) + 1.0) / 2.0));
alpha += STEP;
}
printf("\n");
}
printf("};\n");
}
https://godbolt.org/z/e899d98oW
Configure the timer to trigger the overflow 4096*280=1146880 times per second. Set the timer to generate the DAC trigger event. For 180MHz timer clock it will not be precise and the frequency will be 279.906449045Hz. If you need better precision change the number of samples to match your timer frequency or/and change the timer clock frequency (H7 timers can run up to 480MHz)
Configure DAC to use DMA and transfer the value from the lookup table created in the step 1 to the DAC on the trigger event.
Enjoy beautiful sine wave using your oscilloscope. Note that your microcontroller core will not be loaded at all. You will have it for other tasks. If you want to change the period simple reconfigure the timer. You can do it as many times per second as you wish. To reconfigure the timer use timer DMA burst mode - which will reload PSC & ARR registers on the upddate event automatically not disturbing the generated waveform.
I know it is advanced STM32 programming and it will require register level programming. I use it to generate complex waveforms in our devices.
It is the correct way of doing it. No control loops, no calculations, no core load.
I'd like to address the embedded programming issues in your code directly - #0___________'s answer is the correct way to do this on a microcontroller and I won't retread the same ground.
Variables representing time should never be floating point. If your increment is not a power of two, errors will always accumulate. Even if it is, eventually your increment will be smaller than the smallest increment and the timer will stop. Always use integers for time. You can pick an integer size big enough to ignore roll over - an unsigned 32 bit integer representing milliseconds will take 50 days to roll over, while an unsigned 64 bit integer will take over 500 million years.
Generating any periodic signal where you do not care about the signal's phase does not require a time variable. Instead, you can keep an internal counter which resets to 0 at the end of a period. (When you use DMA with a look-up table, that's exactly what you're doing - the counter is the DMA controller's next-read pointer.)
Whenever you use a transcendental function such as sine in a microcontroller, your first thought should be "can I use a look-up table for this?" You don't have access to the luxury of a modern operating system optimally shuffling your load around on a 4 GHz+ multi-core processor. You're often dealing with a single thread that will stall waiting for your 200 MHz microcontroller to bring the FPU out of standby and perform the approximation algorithm. There is a significant cost to transcendental functions. There's a cost to LUTs too, but if you're hitting the function constantly, there's a good chance you'll like the tradeoffs of the LUT a lot better.
As noted in some of the comments, the time value is continually growing with time. This poses two problems:
The sin function likely has to perform a modulus internally to get the internal value into a supported range.
The resolution of time will become worse and worse as the value increases, due to adding on higher digits.
Making the following changes should improve the performance:
double time;
void control_loop() {
time += 30.0e-6;
if((1.0/280.0) < time)
{
time -= 1.0/280.0;
}
double sine = sin(2 * M_PI * 280 * time);
...
}
Note that once this change is made, you will no longer have a time variable.
Use a look-up table. Your comment in the discussion with Eugene Sh.:
A small deviation from the sine frequency (like 280.1Hz) would be ok.
In that case, with a control interval of 30 µs, if you have a table of 119 samples that you repeat over and over, you will get a sine wave of 280.112 Hz. Since you have a 12-bit DAC, you only need 119 * 2 = 238 bytes to store this if you would output it directly to the DAC. If you use it as input for further calculations like you mention in the comments, you can store it as float or double as desired. On an MCU with embedded static RAM, it only takes a few cycles at most to load from memory.
If you have a few kilobytes of memory available, you can eliminate this problem completely with a lookup table.
With a sampling period of 30 µs, 2500 samples will have a total duration of 75 ms. This is exactly equal to the duration of 21 cycles at 280 Hz.
I haven't tested or compiled the following code, but it should at least demonstrate the approach:
double sin2500() {
static double *table = NULL;
static int n = 2499;
if (!table) {
table = malloc(2500 * sizeof(double));
for (int i=0; i<2500; i++) table[i] = sin(2 * M_PI * 280 * i * 30e-06);
}
n = (n+1) % 2500;
return table[n];
}
How about a variant of others' modulo-based concept:
int t = 0;
int divisor = 1000000;
void control_loop() {
t += 30 * 280;
if (t > divisor) t -= divisor;
double sine = sin(2 * M_PI * t / (double)divisor));
...
}
It calculates the modulo in integer then causes no roundoff errors.
There is an alternative approach to calculating a series of values of sine (and cosine) for angles that increase by some very small amount. It essentially devolves down to calculating the X and Y coordinates of a circle, and then dividing the Y value by some constant to produce the sine, and dividing the X value by the same constant to produce the cosine.
If you are content to generate a "very round ellipse", you can use a following hack, which is attributed to Marvin Minsky in the 1960s. It's much faster than calculating sines and cosines, although it introduces a very small error into the series. Here is an extract from the Hakmem Document, Item 149. The Minsky circle algorithm is outlined.
ITEM 149 (Minsky): CIRCLE ALGORITHM
Here is an elegant way to draw almost circles on a point-plotting display:
NEW X = OLD X - epsilon * OLD Y
NEW Y = OLD Y + epsilon * NEW(!) X
This makes a very round ellipse centered at the origin with its size determined by the initial point. epsilon determines the angular velocity of the circulating point, and slightly affects the eccentricity. If epsilon is a power of 2, then we don't even need multiplication, let alone square roots, sines, and cosines! The "circle" will be perfectly stable because the points soon become periodic.
The circle algorithm was invented by mistake when I tried to save one register in a display hack! Ben Gurley had an amazing display hack using only about six or seven instructions, and it was a great wonder. But it was basically line-oriented. It occurred to me that it would be exciting to have curves, and I was trying to get a curve display hack with minimal instructions.
Here is a link to the hakmem: http://inwap.com/pdp10/hbaker/hakmem/hacks.html
I think it would be possible to use a modulo because sin() is periodic.
Then you don’t have to worry about the problems.
double time = 0;
long unsigned int timesteps = 0;
double sine;
void controll_loop()
{
timesteps++;
time += 30e-6;
if( time > 1 )
{
time -= 1;
}
sine = sin( 2 * M_PI * 280 * time );
...
}
Fascinating thread. Minsky's algorithm mentioned in Walter Mitty's answer reminded me of a method for drawing circles that was published in Electronics & Wireless World and that I kept. (Credit: https://www.electronicsworld.co.uk/magazines/). I'm attaching it here for interest.
However, for my own similar projects (for audio synthesis) I use a lookup table, with enough points that linear interpolation is accurate enough (do the math(s)!)
I have to raise 10 to the power of a double a lot of times.
Is there a more efficient way to do this than with the math library pow(10,double)? If it matters, my doubles are always negative between -5 and -11.
I assume pow(double,double) uses a more general algorithm than is required for pow(10,double) and might therefore not be the fastest method. Given some of the answers below, that might have been an incorrect assumption.
As for the why, it is for logartihmic interpolation.
I have a table of x and y values.
My object has a known x value (which is almost always a double).
double Dbeta(struct Data *diffusion, double per){
double frac;
while(per>diffusion->x[i]){
i++;
}
frac = (per-diffusion->x[i-1])/(diffusion->x[i]-diffusion->x[i-1]);
return pow(10,log10DB[i-1] + frac * (log10DB[i]-log10DB[i-1]));
}
This function is called a lot of times.
I have been told to look into profiling, so that is what I will do first.
I have just been told I could have used natural logarithms instead of base 10, which is obviously right. (my stupidity sometimes amazes even myself.)
After replacing everything with natural logarithms everything runs a bit faster. With profiling (which is a new word I learned today) I found out 39% of my code is spend in the exp function, so for those who wondered if it was in fact this part that was bottlenecking my code, it was.
For pow(10.0, n) it should be faster to set c = log(10.0), which you can compute once, then use exp(c*n), which should be significantly faster than pow(10.0, n) (which is basically doing that same thing internally, except it would be calculating log(10.0) over and over instead of just once). Beyond that, there probably isn't much else you can do.
Yes, the pow function is slow (roughly 50x the cost of a multiply, for those asking for benchmarks).
By some log/exponents trickery, we can express 10^x as
10^x = exp(log(10^x)) = exp(x * log(10)).
So you can implement 10^x with exp(x * M_LN10), which should be more efficient than pow.
If double accuracy isn't critical, use the float version of the function expf (or powf), which should be more efficient than the double version.
If rough accuracy is Ok, precompute a table over the [-5, -11] range and do a quick look up with linear interpolation.
Some benchmarks (using glibc 2.31):
Benchmark Time
---------------------------------
pow(10, x) 15.54 ns
powf(10, x) 7.18 ns
expf(x * (float)M_LN10) 3.45 ns
I want to rewrite such simple routine to SSE2 code, (preferably
in nasm) and I am not totally sure how to do it, two things
not clear (how to express calculations (inner loop and those from
outer loop too) and how to call c code function "SetPixelInDibInt(i ,j, palette[n]);"
from under staticaly linked asm code
void DrawMandelbrotD(double ox, double oy, double lx, int N_ITER)
{
double ly = lx * double(CLIENT_Y)/double(CLIENT_X);
double dx = lx / CLIENT_X;
double dy = ly / CLIENT_Y;
double ax = ox - lx * 0.5 + dx * 0.5;
double ay = oy - ly * 0.5 + dy * 0.5;
static double re, im, re_n, im_n, c_re, c_im, rere, imim, int n;
for(int j=0; j<CLIENT_Y; j+=1)
{
for(int i=0; i<CLIENT_X; i+=1)
{
c_re = ax + i * dx;
c_im = ay + j * dy;
re = c_re;
im = c_im;
rere=re*re;
imim=im*im;
n=1;
for(int k=0;k<N_ITER;k++)
{
im = (re+re)*im + c_im;
re = rere - imim + c_re;
rere=re*re;
imim=im*im;
if ( (rere + imim) > 4.0 ) break;
n++;
}
SetPixelInDibInt(i ,j, palette[n]);
}
}
}
could someone help, I would like not to see other code
implementations but just nasm-sse translation of those above
- it would be most helpfull in my case - could someone help with that?
Intel has a complete implementation as an AVX example. See below.
What makes Mandelbrot tricky is that the early-out condition for each point in the set (i.e. pixel) is different. You could keep a pair or quad of pixels iterating until the magnitude of both exceeds 2.0 (or you hit max iterations). To do otherwise would require tracking which pixel's points were in which vector element.
Anyway, a simplistic implementation to operate on a vector of 2 (or 4 with AVX) doubles at a time would have its throughput limited by the latency of the dependency chains. You'd need to do multiple dependency chains in parallel to keep both of Haswell's FMA units fed. So you'd duplicate your variables, and interleave operations for two iterations of the outer loop inside the inner loop.
Keeping track of which pixels are being calculated would be a little tricky. I think it might take less overhead to use one set of registers for one row of pixels, and another set of registers for another row. (So you can always just move 4 pixels to the right, rather than checking whether the other dep chain is already processing that vector.)
I suspect that only checking the loop exit condition every 4 iterations or so might be a win. Getting code to branch based on a packed vector comparison, is slightly more expensive than in the scalar case. The extra FP add required is also expensive. (Haswell can do two FMAs per cycle, (latency = 5). The lone FP add unit is one the same port as one of the FMA units. The two FP mul units are on the same ports that can run FMA.)
The loop condition can be checked with a packed-compare to generate a mask of zeros and ones, and a (V)PTEST of that register with itself to see if it's all zero. (edit: movmskps then test+jcc is fewer uops, but maybe higher latency.) Then obviously je or jne as appropriate, depending on whether you did a FP compare that leaves zeros when you should exit, or zeros when you shouldn't. NAN shouldn't be possible, but there's no reason not to choose your comparison op such that a NAN will result in the exit condition being true.
const __mm256d const_four = _mm256_set1_pd(4.0); // outside the loop
__m256i cmp_result = _mm256_cmp_pd(mag_squared, const_four, _CMP_LE_OQ); // vcmppd. result is non-zero if at least one element < 4.0
if (_mm256_testz_si256(cmp_result, cmp_result))
break;
There MIGHT be some way to use PTEST directly on a packed-double, with some bit-hack AND-mask that will pick out bits that will be set iff the FP value is > 4.0. Like maybe some bits in the exponent? Maybe worth considering. I found a forum post about it, but didn't try it out.
Hmm, oh crap, this doesn't record WHEN the loop condition failed, for each vector element separately, for the purpose of coloring the points outside the Mandelbrot set. Maybe test for any element hitting the condition (instead of all), record the result, and then set that element (and c for that element) to 0.0 so it won't trigger the exit condition again. Or maybe scheduling pixels into vector elements is the way to go after all. This code might do fairly well on a hyperthreaded CPU, since there will be a lot of branch mispredicts with every element separately triggering the early-out condition.
That might waste a lot of your throughput, and given that 4 uops per cycle is doable, but only 2 of them can be FP mul/add/FMA, there's room for a significant amount of integer code to schedule points into vector elements. (On Sandybridge/Ivybrideg, without FMA, FP throughput is lower. But there are only 3 ports that can handle integer ops, and 2 of those are the ports for the FP mul and FP add units.)
Since you don't have to read any source data, there's only 1 memory access stream for each dep chain, and it's a write stream. (And it's low bandwidth, since most points take a lot of iterations before you're ready to write a single pixel value.) So the number of hardware prefetch streams isn't a limiting factor for the number of dep chains to run in parallel. Cache misses latency should be hidden by write buffers.
I can write some code if anyone's still interested in this (just post a comment). I stopped at the high-level design stage since this is an old question, though.
==============
I also found that Intel already used the Mandelbrot set as an example for one of their AVX tutorials. They use the mask-off-vector-elements method for the loop condition. (using the mask generated directly by vcmpps to AND). Their results indicate that AVX (single-precision) gave a 7x speedup over scalar float, so apparently it's not common for neighbouring pixels to hit the early-out condition at very different numbers of iterations. (at least for the zoom / pan they tested with.)
They just let the FP results keep accumulating for elements that fail the early-out condition. They just stop incrementing the counter for that element. Hopefully most systems default to having the control word set to zero out denormals, if denormals still take extra cycles.
Their code is silly in one way, though: They track the iteration count for each vector element with a floating-point vector, and then convert it to int at the end before use. It'd be faster, and not occupy an FP execution unit, to use packed-integers for that. Oh, I know why they do that: AVX (without AVX2) doesn't support 256bit integer vector ops. They could have used packed 16bit int loop counters, but that could overflow. (And they'd have to compress the mask down from 256b to 128b).
They also test for all elements being > 4.0 with movmskps and then test that, instead of using ptest. I guess the test / jcc can macro-fuse, and run on a different execution unit than FP vector ops, so it's maybe not even slower. Oh, and of course AVX (without AVX2) doesn't have 256bit PTEST. Also, PTEST is 2 uops, so actually movmskps + test / jcc is fewer uops than ptest + jcc. (PTEST is 1 fused-domain uop on SnB, but still 2 unfused uops for the execution ports. On IvB/HSW, 2 uops even in the fused domain.) So it looks like movmskps is the optimal way, unless you can take advantage of the bitwise AND that's part of PTEST, or need to test more than just the high bit of each element. If a branch is unpredictable, ptest might be lower latency, and thus be worth it by catching mispredicts a cycle sooner.
For one of my course project I started implementing "Naive Bayesian classifier" in C. My project is to implement a document classifier application (especially Spam) using huge training data.
Now I have problem implementing the algorithm because of the limitations in the C's datatype.
( Algorithm I am using is given here, http://en.wikipedia.org/wiki/Bayesian_spam_filtering )
PROBLEM STATEMENT:
The algorithm involves taking each word in a document and calculating probability of it being spam word. If p1, p2 p3 .... pn are probabilities of word-1, 2, 3 ... n. The probability of doc being spam or not is calculated using
Here, probability value can be very easily around 0.01. So even if I use datatype "double" my calculation will go for a toss. To confirm this I wrote a sample code given below.
#define PROBABILITY_OF_UNLIKELY_SPAM_WORD (0.01)
#define PROBABILITY_OF_MOSTLY_SPAM_WORD (0.99)
int main()
{
int index;
long double numerator = 1.0;
long double denom1 = 1.0, denom2 = 1.0;
long double doc_spam_prob;
/* Simulating FEW unlikely spam words */
for(index = 0; index < 162; index++)
{
numerator = numerator*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
denom2 = denom2*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
denom1 = denom1*(long double)(1 - PROBABILITY_OF_UNLIKELY_SPAM_WORD);
}
/* Simulating lot of mostly definite spam words */
for (index = 0; index < 1000; index++)
{
numerator = numerator*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
denom2 = denom2*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
denom1 = denom1*(long double)(1- PROBABILITY_OF_MOSTLY_SPAM_WORD);
}
doc_spam_prob= (numerator/(denom1+denom2));
return 0;
}
I tried Float, double and even long double datatypes but still same problem.
Hence, say in a 100K words document I am analyzing, if just 162 words are having 1% spam probability and remaining 99838 are conspicuously spam words, then still my app will say it as Not Spam doc because of Precision error (as numerator easily goes to ZERO)!!!.
This is the first time I am hitting such issue. So how exactly should this problem be tackled?
This happens often in machine learning. AFAIK, there's nothing you can do about the loss in precision. So to bypass this, we use the log function and convert divisions and multiplications to subtractions and additions, resp.
SO I decided to do the math,
The original equation is:
I slightly modify it:
Taking logs on both sides:
Let,
Substituting,
Hence the alternate formula for computing the combined probability:
If you need me to expand on this, please leave a comment.
Here's a trick:
for the sake of readability, let S := p_1 * ... * p_n and H := (1-p_1) * ... * (1-p_n),
then we have:
p = S / (S + H)
p = 1 / ((S + H) / S)
p = 1 / (1 + H / S)
let`s expand again:
p = 1 / (1 + ((1-p_1) * ... * (1-p_n)) / (p_1 * ... * p_n))
p = 1 / (1 + (1-p_1)/p_1 * ... * (1-p_n)/p_n)
So basically, you will obtain a product of quite large numbers (between 0 and, for p_i = 0.01, 99). The idea is, not to multiply tons of small numbers with one another, to obtain, well, 0, but to make a quotient of two small numbers. For example, if n = 1000000 and p_i = 0.5 for all i, the above method will give you 0/(0+0) which is NaN, whereas the proposed method will give you 1/(1+1*...1), which is 0.5.
You can get even better results, when all p_i are sorted and you pair them up in opposed order (let's assume p_1 < ... < p_n), then the following formula will get even better precision:
p = 1 / (1 + (1-p_1)/p_n * ... * (1-p_n)/p_1)
that way you devide big numerators (small p_i) with big denominators (big p_(n+1-i)), and small numerators with small denominators.
edit: MSalter proposed a useful further optimization in his answer. Using it, the formula reads as follows:
p = 1 / (1 + (1-p_1)/p_n * (1-p_2)/p_(n-1) * ... * (1-p_(n-1))/p_2 * (1-p_n)/p_1)
Your problem is caused because you are collecting too many terms without regard for their size. One solution is to take logarithms. Another is to sort your individual terms. First, let's rewrite the equation as 1/p = 1 + ∏((1-p_i)/p_i). Now your problem is that some of the terms are small, while others are big. If you have too many small terms in a row, you'll underflow, and with too many big terms you'll overflow the intermediate result.
So, don't put too many of the same order in a row. Sort the terms (1-p_i)/p_i. As a result, the first will be the smallest term, the last the biggest. Now, if you'd multiply them straight away you would still have an underflow. But the order of calculation doesn't matter. Use two iterators into your temporary collection. One starts at the beginning (i.e. (1-p_0)/p_0), the other at the end (i.e (1-p_n)/p_n), and your intermediate result starts at 1.0. Now, when your intermediate result is >=1.0, you take a term from the front, and when your intemediate result is < 1.0 you take a result from the back.
The result is that as you take terms, the intermediate result will oscillate around 1.0. It will only go up or down as you run out of small or big terms. But that's OK. At that point, you've consumed the extremes on both ends, so it the intermediate result will slowly approach the final result.
There's of course a real possibility of overflow. If the input is completely unlikely to be spam (p=1E-1000) then 1/p will overflow, because ∏((1-p_i)/p_i) overflows. But since the terms are sorted, we know that the intermediate result will overflow only if ∏((1-p_i)/p_i) overflows. So, if the intermediate result overflows, there's no subsequent loss of precision.
Try computing the inverse 1/p. That gives you an equation of the form 1 + 1/(1-p1)*(1-p2)...
If you then count the occurrence of each probability--it looks like you have a small number of values that recur--you can use the pow() function--pow(1-p, occurences_of_p)*pow(1-q, occurrences_of_q)--and avoid individual roundoff with each multiplication.
You can use probability in percents or promiles:
doc_spam_prob= (numerator*100/(denom1+denom2));
or
doc_spam_prob= (numerator*1000/(denom1+denom2));
or use some other coefficient
I am not strong in math so I cannot comment on possible simplifications to the formula that might eliminate or reduce your problem. However, I am familiar with the precision limitations of long double types and am aware of several arbitrary and extended precision math libraries for C. Check out:
http://www.nongnu.org/hpalib/
and
http://www.tc.umn.edu/~ringx004/mapm-main.html