Is SIMD Worth It? Is there a better option? - c

I have some code that runs fairly well, but I would like to make it run better. The major problem I have with it is that it needs to have a nested for loop. The outer one is for iterations (which must happen serially), and the inner one is for each point particle under consideration. I know there's not much I can do about the outer one, but I'm wondering if there is a way of optimizing something like:
void collide(particle particles[], box boxes[],
double boxShiftX, double boxShiftY) {/*{{{*/
int i;
double nX;
double nY;
int boxnum;
for(i=0;i<PART_COUNT;i++) {
boxnum = ((((int)(particles[i].sX+boxShiftX))/BOX_SIZE)%BWIDTH+
BWIDTH*((((int)(particles[i].sY+boxShiftY))/BOX_SIZE)%BHEIGHT));
//copied and pasted the macro which is why it's kinda odd looking
particles[i].vX -= boxes[boxnum].mX;
particles[i].vY -= boxes[boxnum].mY;
if(boxes[boxnum].rotDir == 1) {
nX = particles[i].vX*Wxx+particles[i].vY*Wxy;
nY = particles[i].vX*Wyx+particles[i].vY*Wyy;
} else { //to make it randomly pick a rot. direction
nX = particles[i].vX*Wxx-particles[i].vY*Wxy;
nY = -particles[i].vX*Wyx+particles[i].vY*Wyy;
}
particles[i].vX = nX + boxes[boxnum].mX;
particles[i].vY = nY + boxes[boxnum].mY;
}
}/*}}}*/
I've looked at SIMD, though I can't find much about it, and I'm not entirely sure that the processing required to properly extract and pack the data would be worth the gain of doing half as many instructions, since apparently only two doubles can be used at a time.
I tried breaking it up into multiple threads with shm and pthread_barrier (to synchronize the different stages, of which the above code is one), but it just made it slower.
My current code does go pretty quickly; it's on the order of one second per 10M particle*iterations, and from what I can tell from gprof, 30% of my time is spent in that function alone (5000 calls; PART_COUNT=8192 particles took 1.8 seconds). I'm not worried about small, constant time things, it's just that 512K particles * 50K iterations * 1000 experiments took more than a week last time.
I guess my question is if there is any way of dealing with these long vectors that is more efficient than just looping through them. I feel like there should be, but I can't find it.

I'm not sure how much SIMD would benefit; the inner loop is pretty small and simple, so I'd guess (just by looking) that you're probably more memory-bound than anything else. With that in mind, I'd try rewriting the main part of the loop to not touch the particles array more than needed:
const double temp_vX = particles[i].vX - boxes[boxnum].mX;
const double temp_vY = particles[i].vY - boxes[boxnum].mY;
if(boxes[boxnum].rotDir == 1)
{
nX = temp_vX*Wxx+temp_vY*Wxy;
nY = temp_vX*Wyx+temp_vY*Wyy;
}
else
{
//to make it randomly pick a rot. direction
nX = temp_vX*Wxx-temp_vY*Wxy;
nY = -temp_vX*Wyx+temp_vY*Wyy;
}
particles[i].vX = nX;
particles[i].vY = nY;
This has the small potential side effect of not doing the extra addition at the end.
Another potential speedup would be to use __restrict on the particle array, so that the compiler can better optimize the writes to the velocities. Also, if Wxx etc. are global variables, they may have to get reloaded each time through the loop instead of possibly stored in registers; using __restrict would help with that too.
Since you're accessing the particles in order, you can try prefetching (e.g. __builtin_prefetch on GCC) a few particles ahead to reduce cache misses. Prefetching on the boxes is a bit tougher since you're accessing them in an unpredictable order; you could try something like
int nextBoxnum = ((((int)(particles[i+1].sX+boxShiftX) /// etc...
// prefetch boxes[nextBoxnum]
One last one that I just noticed - if box::rotDir is always +/- 1.0, then you can eliminate the comparison and branch in the inner loop like this:
const double rot = boxes[boxnum].rotDir; // always +/- 1.0
nX = particles[i].vX*Wxx + rot*particles[i].vY*Wxy;
nY = rot*particles[i].vX*Wyx + particles[i].vY*Wyy;
Naturally, the usual caveats of profiling before and after apply. But I think all of these might help, and can be done regardless of whether or not you switch to SIMD.

Just for the record, there's also libSIMDx86!
http://simdx86.sourceforge.net/Modules.html
(On compiling you may also try: gcc -O3 -msse2 or similar).

((int)(particles[i].sX+boxShiftX))/BOX_SIZE
That's expensive if sX is an int (can't tell). Truncate boxShiftX/Y to an int before entering the loop.

Do you have sufficient profiling to tell you where the time is spent within that function?
For instance, are you sure it's not your divs and mods in the boxnum calculation where the time is being spent? Sometimes compilers fail to spot possible shift/AND alternatives, even where a human (or at least, one who knew BOX_SIZE and BWIDTH/BHEIGHT, which I don't) might be able to.
It would be a pity to spend lots of time on SIMDifying the wrong bit of the code...
The other thing which might be worth looking into is if the work can be coerced into something which could work with a library like IPP, which will make well-informed decisions about how best to use the processor.

Your algorithm has too many memory, integer and branch instructions to have enough independent flops to profit from SIMD. The pipeline will be constantly stalled.
Finding a more effective way to randomize would be top of the list. Then, try to work either in float or int, but not both. Recast conditionals as arithmetic, or at least as a select operation. Only then does SIMD become a realistic proposition

Related

C - Double type variables : same formulas, different values

EDIT
SOLVED
Solution was to use the long double versions of sin & cos: sinl & cosl.
It is my first post here, so bear with me :).
I come today here to ask for your input on a small problem that I am having with a C application at work. Basically, I am computing an Extended Kalman Filter and one of my formulas (that I store in a variable) has multiple computations of sin and cos, at least 16 in total in the same line. I want to decrease the time it takes for the computation to be done, so the idea is to compute each cos and sin separately, store them in a variable, and then replace the variables back in the formula.
So I did this:
const ComputationType sin_Roll = compute_sin((ComputationType)(Roll));
const ComputationType sin_Pitch = compute_sin((ComputationType)(Pitch));
const ComputationType cos_Pitch = compute_cos((ComputationType)(Pitch));
const ComputationType cos_Roll = compute_cos((ComputationType)(Roll));
Where ComputationType is a macro definition (renaming) of the type Double. I know it looks ugly, a lot of maybe unnecessary castings, but this code is generated in Python and it was specifically designed so....
Also, compute_cos and compute_sin are defined as such:
#define compute_sin(a) sinf(a)
#define compute_cos(a) cosf(a)
My problem is the value I get from the "optimized" formula is different from the value of the original one.
I will post the code of both and I apologise in advance because it is very ugly and hard to follow but the main points where cos and sin have been replaced can be seen. This is my task, to clean it up and optimize it, but I am doing it step by step to make sure I don't introduce a bug.
So, the new value is:
ComputationType newValue = (ComputationType)(((((((ComputationType)-1.0))*(sin_Pitch))+((DT)*((((Dg_y)+((((ComputationType)-1.0))*(Gy)))*(cos_Pitch)*(cos_Roll))+(((Gz)+((((ComputationType)-1.0))*(Dg_z)))*(cos_Pitch)*(sin_Roll)))))*(cos_Pitch)*(cos_Roll))+((((DT)*((((Dg_y)+((((ComputationType)-1.0))*(Gy)))*(cos_Roll)*(sin_Pitch))+(((Gz)+((((ComputationType)-1.0))*(Dg_z)))*(sin_Pitch)*(sin_Roll))))+(cos_Pitch))*(cos_Roll)*(sin_Pitch))+((((ComputationType)-1.0))*(DT)*((((Gz)+((((ComputationType)-1.0))*(Dg_z)))*(cos_Roll))+((((ComputationType)-1.0))*((Dg_y)+((((ComputationType)-1.0))*(Gy)))*(sin_Roll)))*(sin_Roll)));
And the original is:
ComputationType originalValue = (ComputationType)(((((((ComputationType)-1.0))*(compute_sin((ComputationType)(Pitch))))+((DT)*((((Dg_y)+((((ComputationType)-1.0))*(Gy)))*(compute_cos((ComputationType)(Pitch)))*(compute_cos((ComputationType)(Roll))))+(((Gz)+((((ComputationType)-1.0))*(Dg_z)))*(compute_cos((ComputationType)(Pitch)))*(compute_sin((ComputationType)(Roll)))))))*(compute_cos((ComputationType)(Pitch)))*(compute_cos((ComputationType)(Roll))))+((((DT)*((((Dg_y)+((((ComputationType)-1.0))*(Gy)))*(compute_cos((ComputationType)(Roll)))*(compute_sin((ComputationType)(Pitch))))+(((Gz)+((((ComputationType)-1.0))*(Dg_z)))*(compute_sin((ComputationType)(Pitch)))*(compute_sin((ComputationType)(Roll))))))+(compute_cos((ComputationType)(Pitch))))*(compute_cos((ComputationType)(Roll)))*(compute_sin((ComputationType)(Pitch))))+((((ComputationType)-1.0))*(DT)*((((Gz)+((((ComputationType)-1.0))*(Dg_z)))*(compute_cos((ComputationType)(Roll))))+((((ComputationType)-1.0))*((Dg_y)+((((ComputationType)-1.0))*(Gy)))*(compute_sin((ComputationType)(Roll)))))*(compute_sin((ComputationType)(Roll)))));
What I want is to get the same value as in the original formula. To compare them I use memcmp.
Any help is welcome. I kindly thank you in advance :).
EDIT
I will post also the values that I get.
New value : -1.2214615708217025e-005
Original value : -1.2214615708215651e-005
They are similar up to a point, but the application is safety critical and it is necessary to validate the results.
You can not meet your expectation for a couple of reasons.
By altering the code you adjust the machine instructions being used in subtle ways that will impact the final value.
For instance if originally it was using fused multiplies and adds and this is no longer happening it will change the result.
You don't mention the target architecture. Some architectures retain more than 64bits in the floating point pipeline. These extra bits get rounded when forced into 64bit memory. Again altering how this works will have minor effects on the final output.

Floating point equation checking ansi c - isnormal()

I'm trying to check my floating point operations in c99.
Should I be doing all of my operations inside of isnormal()? Does this code make sense?
double dTest1 = 0.0;
double dTest2 = 0.0;
double dOutput = 0.0;
dTest1 = 5.0;
dTest2 = 10.3;
dOutput = dTest1 * dTest2;
//add some logic based on output
isnormal(dOutput);
Your use of isnormal does not look like anything idiomatic. I am not sure what you expect exactly from using isnormal this way (it's obviously going to be true for 5.0*10.3, I would expect the compiler to optimize it so), but here are at least some obvious problems assuming you use it for other computations:
Zero is not normal, so you shouldn't use isnormal as a sanity check for a result that can be zero.
isnormal will not tell you if your computation came so close to zero that it lost precision (the subnormal range) and went back into the normal range later.
You might be better served by FPU exceptions: there is one for each possible event for which you might want to know if it happened since you initiated your computations, and the way to use them is sketched out in this existing answer.

GSL solving ODE for a pendulum movement

I'm trying to solve a differential equation for a pendulum movement, given the pendulum initial angle (x), gravity acceleration (g), line length (l), and a time step (h). I've tried this one using Euler method and everything's alright. But now i am to use Runge-Kutta method implemented in GSL. I've tried to implement it learning from the gsl manual, but I'm stuck at one problem. The pendulum doesn't want to stop. Let's say that I start it with initial angle 1 rad, it always has it's peak tilt at 1 rad, no matter how many swings it already did. Here's the equation and the function i use to give it to GSL:
x''(t) + g/l*sin(x(t)) = 0
transforming it:
x''(t) = -g/l*sin(x(t))
and decomposing:
y(t) = x'(t)
y'(t) = -g/l*sin(x(t))
Here's the code snippet, if that's not enough i can post the whole program (it's not too long), but maybe here's the problem somewhere:
int func (double t, const double x[], double dxdt[], void *params){
double l = *(double*) params;
double g = *(double*) (params+sizeof(double));
dxdt[0] = x[1];
dxdt[1] = -g/l*sin(x[0]);
return GSL_SUCCESS;
}
The parameters g and l are passed correctly to the function, I've already checked that.
As Barton Chittenden noted in a comment, the pendulum should keep going in the absence of friction. This is expected.
As for why it slows and stops when you use the Euler method, that's touching on a subtle and interesting subject. A (ideal, friction-free) physical pendulum has the property that energy in the system is conserved. Different integration schemes preserve that property to different degrees. With some integration schemes, the energy in the system will grow, and the pendulum will swing progressively higher. With others, energy is lost, and the pendulum comes to a halt. The speed at which either of these happens depends partially on the order of the method; a more accurate method will often lose energy more slowly.
You can easily observe this by plotting the total energy in your system (potential + kinetic) for different integration schemes.
Finally, there is a whole fascinating sub-field of integration methods which preserve certain conserved quantities of a system like this, called symplectic methods.

variable timestep and acceleration

To move objects with a variable time step I just have to do:
ship.position += ship.velocity * deltaTime;
But when I try this with:
ship.velocity += ship.power * deltaTime;
I get different results with different time steps. How can I fix this?
EDIT:
I am modelling an object falling to the ground on one axis with a single fixed force (gravity) acting on it.
ship.position = ship.position + ship.velocity * deltaTime + 0.5 * ship.power * deltaTime ^ 2;
ship.velocity += ship.power * deltaTime;
http://www.ugrad.math.ubc.ca/coursedoc/math101/notes/applications/velocity.html
The velocity part of your equations is correct and they must both be updated at every time step.
This all assumes that you have constant power (acceleration) over the deltaTime as pointed out by belisarius.
What you are doing (mathematically) is evaluating integrals. In the first case, the linear approximation is exact, as you have a linear relationship.
In the second case, you have at least a parabola, so your results are only approximate. You may get better results by using a smaller deltaTime, or by using the real integral equations, if available.
Edit
Brian's answer is right as long as the ship.power remains always constant, and you recalculate ship.velocity at each step. It is indeed the integral equation for a constant accelerated movement.
This is an inherent problem trying to integrate numerically. There will be an error. Lowering delta will give you more accurate results, but more computation is needed. If your power function is integrable, you could try that.
Your simulation is numerically solving the equation of motion for a single mass point. The time discretisation you are using is called "Euler method", and it is possible to show that it does not preserve energy (as the exact solution does in some way). A much better yet simple way of solving equations of motion is the "leapfrog integration".
You can use Verlet integration to calculate position and velocity of object. Acceleration you can calculate from a = m*F where m is mass and F is force. This is one of the easiest algorithm
In your code you use setInterval(moveBoxes,20) to update the boxes, and subsequently you use (new Date()).getTime()) to calculate deltaT. This is somewhat redundant, because you could have used the number 20 to calculate deltaT directly.
It is better write the code so that you use exacly the same value for deltaT during each time step. (In other words deltaT should not depend on the value of (new Date()).getTime())). This way your code becomes reproducible and it is easier for you to write unit tests.
Let us look at a situation where the browser has less CPU-time available for a short time interval. In this situation you want to avoid long term effects on the dynamics. One the lack of CPU-time is over you want the browser to return to a state that is unaffected by the short lack of CPU-time. You can achieve this by using the same value of deltaT in each time step.
By the way. I think that the following code
if(box.x < 0) {
box.x = 0;
box.vx *= -1;
}
Could be replaced with
if(box.x < 0) {
box.x *= -1 ;
box.vx *= -1;
}
Good luck with the project - and please include code samples in the first version of your question next time you ask :-)

Performance difference in looping

Will there be a huge performance difference between:
if (this.chkSelectAll.Checked)
for (int i = 0; i < this.listBoxColumns.Items.Count; i++)
this.listBoxColumns.SetSelected(i, true);
else
for (int i = 0; i < this.listBoxColumns.Items.Count; i++)
this.listBoxColumns.SetSelected(i, false);
vs.
for (int i = 0; i < this.listBoxColumns.Items.Count; i++)
this.listBoxColumns.SetSelected(i, this.chkSelectAll.Checked);
Which one is advisable. Concise coding vs. performance gain?
I wouldn't expect to see much performance difference, and I'd certainly go with the latter as it's more readable. (I'd put braces round it though.)
It's quite easy to imagine a situation where you might need to change the loop, and with the first example you might accidentally only change one of them instead of both. If you really want to avoid calling the Checked property in every iteration, you could always do:
bool checked = this.chkSelectAll.Checked;
for (int i = 0; i < this.listBoxColumns.Items.Count; i++)
{
this.listBoxColumns.SetSelected(i, checked);
}
As ever, write the most readable code first, and measure/profile any performance differences before bending your design/code out of shape for the sake of performance.
I suppose the performance difference will be barely noticeable. However here's a variation that is both efficient and highly readable:
bool isChecked = this.chkSelectAll.Checked;
for (int i = 0; i < this.listBoxColumns.Items.Count; i++) {
this.listBoxColumns.SetSelected(i, isChecked);
}
If you're after some real optimization you will also want to pay attention to whether the overhead of accessing "this.listBoxColumns" twice on each iteration is present in the first place and is worth paying attention to. That's what profiling is for.
You have an extra boolean check in the first example. But having said that, I can't imagine that the performance difference will be anything other than negligible. Have you tried measuring this in your particular scenario ?
The second example is preferable since you're not repeating the loop code.
I can't see there being a significant performance difference between the two. The way to confirm it would be to set up a benchmark and time the different algorithms over 1000s of iterations.
However as it's UI code any performance gain is pretty meaningless as you are going to be waiting for the user to read the dialog and decide what to do next.
Personally I'd go for the second approach every time. You've only got one loop to maintain, and the code is clearer.
Any performance difference will be negligible.
Your primary concern should be code readability and maintainability.
Micro-optimisations such as this are more often than not, misplaced. Always profile before being concerned with performance.
It's most likely to be negligible. More importantly, however, I feel the need to quote the following:
"Premature optimisation is the root of all evil"
The second is easily the more readable, so simply go with that, unless you later find a need to optimise (which is quite unlikely in my opinion).
Why not use System.Diagnostics.StopWatch and compare the two yourself? However, I don't believe there's going to be any real performance difference. The first example might be faster because you're only accessing chkSelectAll.Checked once. Both are easily readable though.

Resources