Is there a better way to optimize the lennard jones potential function? - c

In actual fact, its the derivative of the Lennard Jones potential. The reason for is that I am writing a Molecular Dynamics program and at least 80% of the time is spent in the following function, even with the most aggressive compiler options (gcc ** -O3).
double ljd(double r) /* Derivative of Lennard Jones Potential for Argon with
respect to distance (r) */
{
double temp;
temp = Si/r;
temp = temp*temp;
temp = temp*temp*temp;
return ( (24*Ep/r)*(temp-(2 * pow(temp,2))) );
}
This code is from a file "functs.h", which I import into my main file. I thought that using temporary variables in this way would make the function faster, but I am worried that creating them is too wasteful. Should I use static? Also the code is written in parallel using openmp, so I can't really declare temp as a global variable?
The variables Ep and Si are defined (using #define). I have only been using C for about 1 month. I tried to look at the assembler code generated by gcc, but I was completely lost.\

I would get rid of the call to pow() for a start:
double ljd(double r) /* Derivative of Lennard Jones Potential for Argon with
respect to distance (r) */
{
double temp;
temp = Si / r;
temp = temp * temp;
temp = temp * temp * temp;
return ( (24.0 * Ep / r) * (temp - (2.0 * temp * temp)) );
}

On my architecture (intel Centrino Duo, MinGW-gcc 4.5.2 on Windows XP), non-optimized code using pow()
static inline double ljd(double r)
{
return 24 * Ep / Si * (pow(Si / r, 7) - 2 * pow(Si / r, 13));
}
actually outperforms your version if -ffast-math is provided.
The generated assembly (using some arbitrary values for Ep and Si) looks like this:
fldl LC0
fdivl 8(%ebp)
fld %st(0)
fmul %st(1), %st
fmul %st, %st(1)
fld %st(0)
fmul %st(1), %st
fmul %st(2), %st
fxch %st(1)
fmul %st(2), %st
fmul %st(0), %st
fmulp %st, %st(2)
fxch %st(1)
fadd %st(0), %st
fsubrp %st, %st(1)
fmull LC1

Well, as I've said before, compilers suck at optimising floating point code for many reasons. So, here's an Intel assembly version that should be faster (compiled using DevStudio 2005):
const double Si6 = /*whatever pow(Si,6) is*/;
const double Si_value = /*whatever Si is*/; /* need _value as Si is a register name! */
const double Ep24 = /*whatever 24.Ep is*/;
double ljd (double r)
{
double result;
__asm
{
fld qword ptr [r]
fld st(0)
fmul st(0),st(0)
fld st(0)
fmul st(0),st(0)
fmulp st(1),st(0)
fld qword ptr [Si6]
fdivrp st(1),st(0)
fld st(0)
fld1
fsub st(0),st(1)
fsubrp st(1),st(0)
fmulp st(1),st(0)
fld qword ptr [Ep24]
fmulp st(1),st(0)
fdivrp st(1),st(0)
fstp qword ptr [result]
}
return result;
}
This version will produce slightly different results to the version posted. The compiler will probably be writing intermediate results to RAM in the original code. This will lose precision since the (Intel) FPU operates at 80bits internally whereas the double type is only 64bits. The above assembler will not lose precision in the intermediate results, it is all done at 80bits. Only the final result is rounded to 64bits.

The local variable is just fine. It doesn't cost anything. Leave it alone.
As others said, get rid of the pow call. It can't be any faster than simply squaring the number, and it could be a lot slower.
That said, just because the function is active 80+% of the time does not mean it's a problem. It only means if there is something you can optimize, it's either in there, or in something it calls (like pow) or in something that calls it.
If you try random pausing, which is a method of stack-sampling, you will see that routine on 80+% of samples, plus the lines within it that are responsible for the time, plus its callers that are responsible for the time, and their callers, and so on. All the lines of code on the stack are jointly responsible for the time.
Optimality is not when nothing take a large percent of time, it is when nothing you can fix takes a large percent of time.

Is your application structured in such a way that you could profitably vectorise this function, calculating several independent values in parallel? This would allow you to utilise hardware vector units, such as SSE.
It also seems like you would be better off keeping 1/r values around, rather than r itself.
This is an example explicitly using SSE2 instructions to implement the function. ljd() calculates two values at once.
static __m128d ljd(__m128d r)
{
static const __m128d two = { 2.0, 2.0 };
static const __m128d si = { Si, Si };
static const __m128d ep24 = { 24 * Ep, 24 * Ep };
__m128d temp2, temp3;
__m128d temp = _mm_div_pd(si, r);
__m128d ep24_r = _mm_div_pd(ep24, r);
temp = _mm_mul_pd(temp, temp);
temp2 = _mm_mul_pd(temp, temp);
temp2 = _mm_mul_pd(temp2, temp);
temp3 = _mm_mul_pd(temp2, temp2);
temp3 = _mm_mul_pd(temp3, two);
return _mm_mul_pd(ep24_r, _mm_sub_pd(temp2, temp3));
}
/* Requires `out` and `in` to be 16-byte aligned */
void ljd_array(double out[], const double in[], int n)
{
int i;
for (i = 0; i < n; i += 2)
{
_mm_store_pd(out + i, ljd(_mm_load_pd(in + i)));
}
}
However, it is important to note that recent versions of GCC are often able to vectorise functions like this automatically, as long as you're targetting the right architecture and have optimisation enabled. If you're targetting 32-bit x86, try compiling with -msse2 -O3, and adjust things such that the input and output arrays are 16-byte aligned.
Alignment for static and automatic arrays can be achieved under gcc with the type attribute __attribute__ ((aligned (16))), and for dynamic arrays using the posix_memalign() function.

Ah, that brings me back some memories... I've done MD with Lennard Jones potential years ago.
In my scenario (not huge systems) it was enough to replace the pow() with several multiplications, as suggested by another answer. I also restricted the range of neighbours, effectively truncating the potential at about r ~ 3.5 and applying some standard thermodynamic correction afterwards.
But if all this is not enough for you, I suggest to precompute the function for closely spaced values of r and simply interpolate (linear or quadratic, I'd say).

Related

C - efficiently changing a function pointer based on command line input

I have several similar functions, say A, B, C. I want to choose one of them with command line options. Also, I'm calling that function billion times because of that instead of checking a variable inside a function billion times, I'm defining a function pointer Phi and set it to desired function just one time. But when I set, Phi = A, (so no user input considered) my code runs in ~24 secs, when I add an if-else and set Phi to desired function, my code runs in ~30 secs with exact same parameters. (Of course command line option sets Phi to A) What is the efficient way to handle this case?
My functions:
double funcA(double r)
{
return 0;
}
double funcB(double r)
{
return 1;
}
double funcC(double r)
{
return r;
}
void computationFunctionFast(Context *userInputs) {
double (*Phi)(double) = funcA;
/* computation codes */
}
void computationFunctionSlow(Context *userInputs) {
double (*Phi)(double);
switch (userInputs->funcEnum) {
case A:
Phi = funcA;
break;
case B:
Phi = funcB;
break;
case C:
Phi = funcC;
}
/* computation codes */
}
I've tried gcc, clang, icx with -O2 and -O3 optimizations. (gcc has no performance difference in mentioned cases but has the worst performance) Although I'm using C, I've tried std::function too. I've tried defining Phi function in different scopes etc.
Generally, there are a few things here that are slightly bad for performance:
Branches/comparisons lead to inefficient use of branch prediction/instruction cache and might affect pipelining too.
Function pointers are notoriously inefficient since they generally block inlining and generally the compiler can't do much about them.
Here's an example based on your code:
double computationFunctionSlow (int input, double val) {
double (*Phi)(double);
switch (input) {
case 0: Phi = funcA; break;
case 1: Phi = funcB; break;
case 2: Phi = funcC; break;
}
double res = Phi(val);
return res;
}
clang 15.0.0 x86_64 -O3 gives:
computationFunctionSlow: # #computationFunctionSlow
cmp edi, 2
ja .LBB3_1
movsxd rax, edi
lea rcx, [rip + .Lswitch.table.computationFunctionSlow]
jmp qword ptr [rcx + 8*rax] # TAILCALL
.LBB3_1:
xorps xmm0, xmm0
ret
.Lswitch.table.computationFunctionSlow:
.quad funcA
.quad funcB
.quad funcC
Even though the numbers I picked are adjacent, the usual compilers fail to optimize out the comparison cmp. Even when I include a default: return 0; it is still there. You can quite easily manually optimize any switch with contiguous indices like this into a function pointer jump table:
double computationFunctionSlow (int input, double val) {
double (*Phi[3])(double) = {funcA, funcB, funcC};
double res = Phi[input](val);
return res;
}
clang 15.0.0 x86_64 -O3 gives:
computationFunctionSlow: # #computationFunctionSlow
movsxd rax, edi
lea rcx, [rip + .L__const.computationFunctionSlow.Phi]
jmp qword ptr [rcx + 8*rax] # TAILCALL
.L__const.computationFunctionSlow.Phi:
.quad funcA
.quad funcB
.quad funcC
This leads to slightly better code here as the comparison instruction/branch is now removed. However, this is really a micro optimization that shouldn't have that much impact of performance. You have to benchmark it for sure to see if there's any improvement.
(Also gcc 12.2 didn't optimize this code as good, why I went with clang for this example.)
Godbolt link: https://godbolt.org/z/ja4zerj7o
There isn't a more "efficient" way to handle this case, you are already doing what you should.
The difference in timing you observe is because:
In the first case (Phi = funcA) the compiler knows the function will always be the same and is therefore able to optimize its calls. Depending on what your "computation code" does, this could mean inlining the function and simplifying a lot of calculations for you.
In the second case (Phi = <choice from user>) the compiler cannot know which function will be selected, and therefore cannot optimize any of the calls made to it by the rest of the code. It also cannot propagate optimizations to other parts of your "computation code" like in the first case.
In general, there isn't much you can do. Dynamic function pointers inherently add a bit of runtime overhead and make optimizations harder (or impossible).
What you could try is duplicating the "computation code" inside different functions or different branches that you only enter after asserting that Phi is equal to a constant, like so:
void computationFunctionSlow(Context *userInputs) {
if (userInputs->funcEnum == A) {
const double (*Phi)(double) = funcA;
// computation code
} else if (...) {
// ...
}
}
In the above piece of code, the compiler knows that inside any of those if blocks the value of Phi can only have one value, and could therefore be able to perform the same optimizations discussed in point 1 above.
There's no need to put an enum in your userInputs when all you do with it is use it to select a function pointer. Just add the function pointer in the structure directly and eliminate the branching done on every call.
Instead of
struct Context
{
.
.
.
enum funcType funcEnum;
};
use
struct Context
{
.
.
.
double (*phi)(double);
};
You'd wind up with something like this:
void computationFunctionSlow(Context *userInputs) {
/* computation codes */
double result = userInputs->phi( data );
}

MSVC Inline Assembly: Freeing FPU registers for performance

While playing a little with FPU using MSVC's Inline Assembly, I got a little confused about freeing FPU registers in favor of increasing performance...
For example:
#include <stdio.h>
double fpu_add(register double x, register double y) {
double res = 0.0;
__asm {
fld x
fld y
fadd
fstp res
}
return res;
}
int main(void) {
double x = fpu_add(5.0, 2.0);
(void) printf("x = %f\n", x);
return 0;
}
When do I have to ffree the FPU registers in Inline Assembly?
In that example would performance be better if I decided to ffree the st(1) register?
Also is fstp a shorthand for instructions below?
__asm {
fst res
ffree st(0)
}
NOTE: I know FPU instructions are a bit old nowdays, But dealing with them as another option along with SSE
The ffree instruction allows you to mark any slot of the x87 fo stack as free without actually changing the stack pointer. So ffree st(0) does NOT pop the stack, just marks the top value of the stack as free/invalid, so any following instruction that tries to access it will get a floating point exception.
To actually pop to the stack you need both ffree st(0) and fincstp (to increment the pointer). Or better, fstp st(0) to do both those things with a single cheap instruction. Or fstp st(1) to keep the top-of-stack value and discard the old st(1).
But it's usually even better and easier (and faster) to use the p suffixed versions of other instructions. In your case, you probably want
__asm {
fld x // push x on the stack
fld y // push y on the stack
faddp // pop a value and add it to the (now) tos
fstp res // pop and store tos
}
This ends up pushing and popping two values, leaving the fp stack in the same state as it was before. Leaving stuff on the fp stack is likely to cause problems with other fp code, if the compiler is generating x87 fp code, so should be avoided.
Or even better, use memory-source fadd to save instructions, if you're optimizing for CPUs where that's not slower. (Check Agner Fog's microarch PDF and instruction tables for P5 Pentium and newer: seems to be fine, at least break even, and saves a uop on more modern CPUs like Core2 that can do micro-fusion of memory source operands.)
__asm {
fld x // push x on the stack
fadd y // ST0 += y
fstp res // pop and store tos
}
But MSVC inline asm is inherently slow for wrapping a single instruction like fadd, forcing inputs to be in memory, even if the compiler had them available in registers before the asm statement. And forcing the result to be stored in the asm and then reloaded for the return statement, unless you use a hack like leaving a value in st(0) and falling off the end of a function without a return statement. (MSVC does actually support this even when inlining, but clang-cl / clang -fasm-blocks does not.)
GNU C inline asm could wrap a single fadd instruction with appropriate constraints to ask for inputs in x87 registers and tell the compiler where the output is (in st(0)), but you'd still have to choose between fadd and faddp, not letting the compiler pick based on whether it had values in registers or a value from memory. (https://stackoverflow.com/tags/inline-assembly/info)
Compilers aren't terrible, they will make code at least this good from plain C source. Inline asm is generally not useful for performance, unless you're writing a whole loop that's carefully tuned for a specific CPU, or for a case where the compiler does a poor job with something. (Look at the compiler's optimized asm output, e.g. on https://godbolt.org/)

Optimization of matrix and vector multiplication in C

I have a function that gets a 3 x 3 matrix and a 3 x 4000 vector, and multiplies them.
All the calculation are done in double precision (64-bit).
The function is called about 3.5 million times so it should be optimized.
#define MATRIX_DIM 3
#define VECTOR_LEN 3000
typedef struct {
double a;
double b;
double c;
} vector_st;
double matrix[MATRIX_DIM][MATRIX_DIM];
vector_st vector[VACTOR_LEN];
inline void rotate_arr(double input_matrix[][MATRIX_DIM], vector_st *input_vector, vector_st *output_vector)
{
int i;
for (i = 0; i < VACTOR_LEN; i++) {
op_rotate_preset_arr[i].a = input_matrix[0][0] * input_vector[i].a +
input_matrix[0][1] * input_vector[i].b +
input_matrix[0][2] * input_vector[i].c;
op_rotate_preset_arr[i].b = input_matrix[1][0] * input_vector[i].a +
input_matrix[1][1] * input_vector[i].b +
input_matrix[1][2] * input_vector[i].c;
op_rotate_preset_arr[i].c = input_matrix[2][0] * input_vector[i].a +
input_matrix[2][1] * input_vector[i].b +
input_matrix[2][2] * input_vector[i].c;
}
}
I all out of ideas on how to optimize it because it's inline, data access is sequential and the function is short and pretty straight-forward.
It can be assumed that the vector is always the same and only the matrix is changing if it will boost performance.
One easy to fix problem here is that compilers assumes that the matrix and the output vectors may alias. As seen here in the second function, that causes code to be generated that is less efficient and significantly larger. This can be fixed simply by adding restrict to the output pointer. Doing only this already helps and keeps the code free from platform specific optimization, but relies on auto-vectorization in order to use the performance increases that have happened in the past two decades.
Auto-vectorization is evidently still too immature for the task, both Clang and GCC generate way too much shuffling around of the data. This should improve in future compilers, but for now even a case like this (that doesn't seem inherently super hard) needs manual help, such as this (not tested though)
void rotate_arr_avx(double input_matrix[][MATRIX_DIM], vector_st *input_vector, vector_st * restrict output_vector)
{
__m256d col0, col1, col2, a, b, c, t;
int i;
// using set macros like this is kind of dirty, but it's outside the loop anyway
col0 = _mm256_set_pd(0.0, input_matrix[2][0], input_matrix[1][0], input_matrix[0][0]);
col1 = _mm256_set_pd(0.0, input_matrix[2][1], input_matrix[1][1], input_matrix[0][1]);
col2 = _mm256_set_pd(0.0, input_matrix[2][2], input_matrix[1][2], input_matrix[0][2]);
for (i = 0; i < VECTOR_LEN; i++) {
a = _mm256_set1_pd(input_vector[i].a);
b = _mm256_set1_pd(input_vector[i].b);
c = _mm256_set1_pd(input_vector[i].c);
t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0, a), _mm256_mul_pd(col1, b)), _mm256_mul_pd(col2, c));
// this stores an element too much, ensure 8 bytes of padding exist after the array
_mm256_storeu_pd(&output_vector[i].a, t);
}
}
Writing it this way significantly improves what compilers do with it, now compiling to a nice and tight loop without all the nonsense. Earlier the code hurt to look at, but with this the loop now looks like this (GCC 8.1, with FMA enabled), which is actually readable:
.L2:
vbroadcastsd ymm2, QWORD PTR [rsi+8+rax]
vbroadcastsd ymm1, QWORD PTR [rsi+16+rax]
vbroadcastsd ymm0, QWORD PTR [rsi+rax]
vmulpd ymm2, ymm2, ymm4
vfmadd132pd ymm1, ymm2, ymm3
vfmadd132pd ymm0, ymm1, ymm5
vmovupd YMMWORD PTR [rdx+rax], ymm0
add rax, 24
cmp rax, 72000
jne .L2
This has an obvious deficiency: only 3 of the 4 double precision slots of the 256bit AVX vectors are actually used. If the data format of the vector was changed to for example AAAABBBBCCCC repeating, a totally different approach could be used, namely broadcasting the matrix elements instead of the vector elements, then multiplying the broadcasted matrix element by the A component of 4 different vector_sts at once.
An other thing we can try, without even changing the data format, is processing more than one matrix at the same time, which helps to re-use loads from the input_vector to increase arithmetic intensity.
void rotate_arr_avx(double input_matrixA[][MATRIX_DIM], double input_matrixB[][MATRIX_DIM], vector_st *input_vector, vector_st * restrict output_vectorA, vector_st * restrict output_vectorB)
{
__m256d col0A, col1A, col2A, a, b, c, t, col0B, col1B, col2B;
int i;
// using set macros like this is kind of dirty, but it's outside the loop anyway
col0A = _mm256_set_pd(0.0, input_matrixA[2][0], input_matrixA[1][0], input_matrixA[0][0]);
col1A = _mm256_set_pd(0.0, input_matrixA[2][1], input_matrixA[1][1], input_matrixA[0][1]);
col2A = _mm256_set_pd(0.0, input_matrixA[2][2], input_matrixA[1][2], input_matrixA[0][2]);
col0B = _mm256_set_pd(0.0, input_matrixB[2][0], input_matrixB[1][0], input_matrixB[0][0]);
col1B = _mm256_set_pd(0.0, input_matrixB[2][1], input_matrixB[1][1], input_matrixB[0][1]);
col2B = _mm256_set_pd(0.0, input_matrixB[2][2], input_matrixB[1][2], input_matrixB[0][2]);
for (i = 0; i < VECTOR_LEN; i++) {
a = _mm256_set1_pd(input_vector[i].a);
b = _mm256_set1_pd(input_vector[i].b);
c = _mm256_set1_pd(input_vector[i].c);
t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0A, a), _mm256_mul_pd(col1A, b)), _mm256_mul_pd(col2A, c));
// this stores an element too much, ensure 8 bytes of padding exist after the array
_mm256_storeu_pd(&output_vectorA[i].a, t);
t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0B, a), _mm256_mul_pd(col1B, b)), _mm256_mul_pd(col2B, c));
_mm256_storeu_pd(&output_vectorB[i].a, t);
}
}

Why is using a third variable faster than an addition trick?

When computing fibonacci numbers, a common method is mapping the pair of numbers (a, b) to (b, a + b) multiple times. This can usually be done by defining a third variable c and doing a swap. However, I realised you could do the following, avoiding the use of a third integer variable:
b = a + b; // b2 = a1 + b1
a = b - a; // a2 = b2 - a1 = b1, Ta-da!
I expected this to be faster than using a third variable, since in my mind this new method should only have to consider two memory locations.
So I wrote the following C programs comparing the processes. These mimic the calculation of fibonacci numbers, but rest assured I am aware that they will not calculate the correct values due to size limitations.
(Note: I realise now that it was unnecessary to make n a long int, but I will keep it as it is because that is how I first compiled it)
File: PlusMinus.c
// Using the 'b=a+b;a=b-a;' method.
#include <stdio.h>
int main() {
long int n = 1000000; // Number of iterations.
long int a,b;
a = 0; b = 1;
while (n--) {
b = a + b;
a = b - a;
}
printf("%lu\n", a);
}
File: ThirdVar.c
// Using the third-variable method.
#include <stdio.h>
int main() {
long int n = 1000000; // Number of iterations.
long int a,b,c;
a = 0; b = 1;
while (n--) {
c = a;
a = b;
b = b + c;
}
printf("%lu\n", a);
}
When I run the two with GCC (no optimisations enabled) I notice a consistent difference in speed:
$ time ./PlusMinus
14197223477820724411
real 0m0.014s
user 0m0.009s
sys 0m0.002s
$ time ./ThirdVar
14197223477820724411
real 0m0.012s
user 0m0.008s
sys 0m0.002s
When I run the two with GCC with -O3, the assembly outputs are equal. (I suspect I had confirmation bias when stating that one just outperformed the other in previous edits.)
Inspecting the assembly for each, I see that PlusMinus.s actually has one less instruction than ThirdVar.s, but runs consistently slower.
Question
Why does this time difference occur? Not only at all, but also why is my addition/subtraction method slower contrary to my expectations?
Why does this time difference occur?
There is no time difference when compiled with optimizations (under recent versions of gcc and clang). For instance, gcc 8.1 for x86_64 compiles both to:
Live at Godbolt
.LC0:
.string "%lu\n"
main:
sub rsp, 8
mov eax, 1000000
mov esi, 1
mov edx, 0
jmp .L2
.L3:
mov rsi, rcx
.L2:
lea rcx, [rdx+rsi]
mov rdx, rsi
sub rax, 1
jne .L3
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
add rsp, 8
ret
Not only at all, but also why is my addition/subtraction method slower contrary to my expectations?
Adding and subtracting could be slower than just moving. However, in most architectures (e.g. a x86 CPU), it is basically the same (1 cycle plus the memory latency); so this does not explain it.
The real problem is, most likely, the dependencies between the data. See:
b = a + b;
a = b - a;
To compute the second line, you have to have finished computing the value of the first. If the compiler uses the expressions as they are (which is the case under -O0), that is what the CPU will see.
In your second example, however:
c = a;
a = b;
b = b + c;
You can compute both the new a and b at the same time, since they do not depend on each other. And, in a modern processor, those operations can actually be computed in parallel. Or, putting it another way, you are not "stopping" the processor by making it wait on a previous result. This is called Instruction-level parallelism.

Fastest way to compute distance squared

My code relies heavily on computing distances between two points in 3D space.
To avoid the expensive square root I use the squared distance throughout.
But still it takes up a major fraction of the computing time and I would like to replace my simple function with something even faster.
I now have:
double distance_squared(double *a, double *b)
{
double dx = a[0] - b[0];
double dy = a[1] - b[1];
double dz = a[2] - b[2];
return dx*dx + dy*dy + dz*dz;
}
I also tried using a macro to avoid the function call but it doesn't help much.
#define DISTANCE_SQUARED(a, b) ((a)[0]-(b)[0])*((a)[0]-(b)[0]) + ((a)[1]-(b)[1])*((a)[1]-(b)[1]) + ((a)[2]-(b)[2])*((a)[2]-(b)[2])
I thought about using SIMD instructions but could not find a good example or complete list of instructions (ideally some multiply+add on two vectors).
GPU's are not an option since only one set of points is known at each function call.
What would be the fastest way to compute the distance squared?
A good compiler will optimize that about as well as you will ever manage. A good compiler will use SIMD instructions if it deems that they are going to be beneficial. Make sure that you turn on all such possible optimizations for your compiler. Unfortunately, vectors of dimension 3 don't tend to sit well with SIMD units.
I suspect that you will simply have to accept that the code produced by the compiler is probably pretty close to optimal and that no significant gains can be made.
The first obvious thing would be to use the restrict keyword.
As it is now, a and b are aliasable (and thus, from the compiler's point of view which assumes the worst possible case are aliased). No compiler will auto-vectorize this, as it is wrong to do so.
Worse, not only can the compiler not vectorize such a loop, in case you also store (luckily not in your example), it also must re-load values each time. Always be clear about aliasing, as it greatly impacts the compiler.
Next, if you can live with that, use float instead of double and pad to 4 floats even if one is unused, this is a more "natural" data layout for the majority of CPUs (this is somewhat platform specific, but 4 floats is a good guess for most platforms -- 3 doubles, a.k.a. 1.5 SIMD registers on "typical" CPUs, is not optimal anywhere).
(For a hand-written SIMD implementation (which is harder than you think), first and before all be sure to have aligned data. Next, look into what latencies your instrucitons have on the target machine and do the longest ones first. For example on pre-Prescott Intel it makes sense to first shuffle each component into a register and then multiply with itself, even though that uses 3 multiplies instead of one, because shuffles have a long latency. On the later models, a shuffle takes a single cycle, so that would be a total anti-optimization.
Which again shows that leaving it to the compiler is not such a bad idea.)
The SIMD code to do this (using SSE3):
movaps xmm0,a
movaps xmm1,b
subps xmm0,xmm1
mulps xmm0,xmm0
haddps xmm0,xmm0
haddps xmm0,xmm0
but you need four value vectors (x,y,z,0) for this to work. If you've only got three values then you'd need to do a bit of fiddling about to get the required format which would cancel out any benefit of the above.
In general though, due to the superscalar pipelined architecture of the CPU, the best way to get performance is to do the same operation on lots of data, that way you can interleave the various steps and do a bit of loop unrolling to avoid pipeline stalls. The above code will definately stall on the last three instructions based on the "can't use a value directly after it's modified" principle - the second instruction has to wait for the result of the previous instruction to complete which isn't good in a pipelined system.
Doing the calculation on two or more different sets points of points at the same time can remove the above bottleneck - whilst waiting for the result of one computation, you can start the computation of the next point:
movaps xmm0,a1
movaps xmm2,a2
movaps xmm1,b1
movaps xmm3,b2
subps xmm0,xmm1
subps xmm2,xmm3
mulps xmm0,xmm0
mulps xmm2,xmm2
haddps xmm0,xmm0
haddps xmm2,xmm2
haddps xmm0,xmm0
haddps xmm2,xmm2
If you would like to optimize something, at first profile code and inspect assembler output.
After compiling it with gcc -O3 (4.6.1) we'll have nice disassembled output with SIMD:
movsd (%rdi), %xmm0
movsd 8(%rdi), %xmm2
subsd (%rsi), %xmm0
movsd 16(%rdi), %xmm1
subsd 8(%rsi), %xmm2
subsd 16(%rsi), %xmm1
mulsd %xmm0, %xmm0
mulsd %xmm2, %xmm2
mulsd %xmm1, %xmm1
addsd %xmm2, %xmm0
addsd %xmm1, %xmm0
This type of problem often occurs in MD simulations. Usually the amount of calculations is reduced by cutoffs and neighbor lists, so the number for the calculation is reduced. The actual calculation of the squared distances however is exactly done (with compiler optimizations and a fixed type float[3]) as given in your question.
So if you want to reduce the amount of squared calculations you should tell us more about the problem.
Perhaps passing the 6 doubles directly as arguments could make it faster (because it could avoid the array dereference):
inline double distsquare_coord(double xa, double ya, double za,
double xb, double yb, double zb)
{
double dx = xa-yb; double dy=ya-yb; double dz=za-zb;
return dx*dx + dy*dy + dz*dz;
}
Or perhaps, if you have many points in the vicinity, you might compute a distance (to the same fixed other point) by linear approximation of the distances of other near points.
If you can rearrange your data to process two pairs of input vectors at once, you may use this code (SSE2 only)
// #brief Computes two squared distances between two pairs of 3D vectors
// #param a
// Pointer to the first pair of 3D vectors.
// The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
// Must be aligned by 16 (2 doubles).
// #param b
// Pointer to the second pairs of 3D vectors.
// The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
// Must be aligned by 16 (2 doubles).
// #param c
// Pointer to the output 2 element array.
// Must be aligned by 16 (2 doubles).
// The two distances between a and b vectors will be written to c[0] and c[1] respectively.
void (const double * __restrict__ a, const double * __restrict__ b, double * __restrict c) {
// diff0 = ( a0.y - b0.y, a0.x - b0.x ) = ( d0.y, d0.x )
__m128d diff0 = _mm_sub_pd(_mm_load_pd(a), _mm_load_pd(b));
// diff1 = ( a1.x - b1.x, a0.z - b0.z ) = ( d1.x, d0.z )
__m128d diff1 = _mm_sub_pd(_mm_load_pd(a + 2), _mm_load_pd(b + 2));
// diff2 = ( a1.z - b1.z, a1.y - b1.y ) = ( d1.z, d1.y )
__m128d diff2 = _mm_sub_pd(_mm_load_pd(a + 4), _mm_load_pd(b + 4));
// prod0 = ( d0.y * d0.y, d0.x * d0.x )
__m128d prod0 = _mm_mul_pd(diff0, diff0);
// prod1 = ( d1.x * d1.x, d0.z * d0.z )
__m128d prod1 = _mm_mul_pd(diff1, diff1);
// prod2 = ( d1.z * d1.z, d1.y * d1.y )
__m128d prod2 = _mm_mul_pd(diff1, diff1);
// _mm_unpacklo_pd(prod0, prod2) = ( d1.y * d1.y, d0.x * d0.x )
// psum = ( d1.x * d1.x + d1.y * d1.y, d0.x * d0.x + d0.z * d0.z )
__m128d psum = _mm_add_pd(_mm_unpacklo_pd(prod0, prod2), prod1);
// _mm_unpackhi_pd(prod0, prod2) = ( d1.z * d1.z, d0.y * d0.y )
// dotprod = ( d1.x * d1.x + d1.y * d1.y + d1.z * d1.z, d0.x * d0.x + d0.y * d0.y + d0.z * d0.z )
__m128d dotprod = _mm_add_pd(_mm_unpackhi_pd(prod0, prod2), psum);
__mm_store_pd(c, dotprod);
}

Resources