How to square two complex doubles with 256-bit AVX vectors?

How to square two complex doubles with 256-bit AVX vectors? - c

Matt Scarpino gives a good explanation (although he admits he's not sure it's the optimal algorithm, I offer him my gratitude) for how to multiply two complex doubles with Intel's AVX intrinsics. Here's his method, which I've verified:
__m256d vec1 = _mm256_setr_pd(4.0, 5.0, 13.0, 6.0);
__m256d vec2 = _mm256_setr_pd(9.0, 3.0, 6.0, 7.0);
__m256d neg = _mm256_setr_pd(1.0, -1.0, 1.0, -1.0);
/* Step 1: Multiply vec1 and vec2 */
__m256d vec3 = _mm256_mul_pd(vec1, vec2);
/* Step 2: Switch the real and imaginary elements of vec2 */
vec2 = _mm256_permute_pd(vec2, 0x5);
/* Step 3: Negate the imaginary elements of vec2 */
vec2 = _mm256_mul_pd(vec2, neg);
/* Step 4: Multiply vec1 and the modified vec2 */
__m256d vec4 = _mm256_mul_pd(vec1, vec2);
/* Horizontally subtract the elements in vec3 and vec4 */
vec1 = _mm256_hsub_pd(vec3, vec4);
/* Display the elements of the result vector */
double* res = (double*)&vec1;
printf("%lf %lf %lf %lf\n", res[0], res[1], res[2], res[3]);
My problem is that I want to square two complex doubles. I tried to use Matt's technique like so:
struct cmplx a;
struct cmplx b;
a.r = 2.5341;
a.i = 1.843;
b.r = 1.3941;
b.i = 0.93;
__m256d zzs = squareZ(a, b);
double* res = (double*) &zzs;
printf("\nA: %f + %f, B: %f + %f\n", res[0], res[1], res[2], res[3]);
Using Haskell's complex arithmetic, I have verified the results are correct except, as you can see, the real part of B:
A: 3.025014 + 9.340693, B: 0.000000 + 2.593026
So I have two questions really: is there a better (simpler and/or faster) way to square two complex doubles with AVX intrinsics? If not, how can I modify Matt's code to do it?

This answer covers the general case of multiplying two arrays of complex numbers
Ideally, store your data in separate real and imaginary arrays, so you can just load contiguous vectors of real and imaginary parts. That makes it free to do the cross-multiplying (just use different registers / variables) instead of having to shuffle things around within a vector.
You can convert between interleaved double complex style and SIMD-friendly separate-vectors style on the fly fairly cheaply, subject to the vagaries of AVX in-lane shuffles. e.g. very cheaply with unpacklo / unpackhi shuffles to de-interleave or to re-interleave within a lane, if you don't care about the actual order of the data within the temporary vector.
It's actually so cheap to do this shuffle that doing it on the fly for a single complex multiply comes out somewhat ahead of (even a tweaked version of) Matt's code, especially on CPUs that support FMA. This requires producing results in groups of 4 complex doubles (2 result vectors).
If you need to produce only one result vector at a time, I also came up with an alternative to Matt's algorithm that can use FMA (actually FMADDSUB) and avoid the separate sign-change insn.
gcc auto-vectorizes simple complex multiply scalar loop to pretty good code, as long as you use -ffast-math. It deinterleaves like I suggested.
#include <complex.h>
// even with -ffast-math -ffp-contract=fast, clang doesn't manage to use vfmaddsubpd, instead using vmulpd and vaddsubpd :(
// gcc does use FMA though.
// auto-vectorizes with a lot of extra shuffles
void cmul(double complex *restrict dst,
const double complex *restrict A, const double complex *restrict B)
{ // clang and gcc change strategy slightly for i<1 or i<2, vs. i<4
for (int i=0; i<4 ; i++) {
dst[i] = A[i] * B[i];
}
}
See the asm on the Godbolt compiler explorer. I'm not sure how good clang's asm is; it uses a lot of 64b->128b VMODDDUP broadcast-loads. This form is handled purely in the load ports on Intel CPUs (see Agner Fog's insn tables), but it's still a lot of operations. As mentioned earlier, gcc uses 4 VPERMPD shuffles to reorder within lanes before multiplying / FMA, then another 4 VPERMPD to reorder the results before combining them with VSHUFPD. This is 8 extra shuffles for 4 complex multiplies.
Converting gcc's version back to intrinsics and removing the redundant shuffles gives optimal code. (gcc apparently wants its temporaries to be in A B C D order instead of the A C B D order resulting from the in-lane behaviour of VUNPCKLPD (_mm256_unpacklo_pd)).
I put the code on Godbolt, along with a tweaked version of Matt's code. So you can play around with different compiler options, and also different compiler versions.
// multiplies 4 complex doubles each from A and B, storing the result in dst[0..3]
void cmul_manualvec(double complex *restrict dst,
const double complex *restrict A, const double complex *restrict B)
{
// low element first, little-endian style
__m256d A0 = _mm256_loadu_pd((double*)A); // [A0r A0i A1r A1i ] // [a b c d ]
__m256d A2 = _mm256_loadu_pd((double*)(A+2)); // [e f g h ]
__m256d realA = _mm256_unpacklo_pd(A0, A2); // [A0r A2r A1r A3r ] // [a e c g ]
__m256d imagA = _mm256_unpackhi_pd(A0, A2); // [A0i A2i A1i A3i ] // [b f d h ]
// the in-lane behaviour of this interleaving is matched by the same in-lane behaviour when we recombine.
__m256d B0 = _mm256_loadu_pd((double*)B); // [m n o p]
__m256d B2 = _mm256_loadu_pd((double*)(B+2)); // [q r s t]
__m256d realB = _mm256_unpacklo_pd(B0, B2); // [m q o s]
__m256d imagB = _mm256_unpackhi_pd(B0, B2); // [n r p t]
// desired: real=rArB - iAiB, imag=rAiB + rBiA
__m256d realprod = _mm256_mul_pd(realA, realB);
__m256d imagprod = _mm256_mul_pd(imagA, imagB);
__m256d rAiB = _mm256_mul_pd(realA, imagB);
__m256d rBiA = _mm256_mul_pd(realB, imagA);
// gcc and clang will contract these nto FMA. (clang needs -ffp-contract=fast)
// Doing it manually would remove the option to compile for non-FMA targets
__m256d real = _mm256_sub_pd(realprod, imagprod); // [D0r D2r | D1r D3r]
__m256d imag = _mm256_add_pd(rAiB, rBiA); // [D0i D2i | D1i D3i]
// interleave the separate real and imaginary vectors back into packed format
__m256d dst0 = _mm256_shuffle_pd(real, imag, 0b0000); // [D0r D0i | D1r D1i]
__m256d dst2 = _mm256_shuffle_pd(real, imag, 0b1111); // [D2r D2i | D3r D3i]
_mm256_storeu_pd((double*)dst, dst0);
_mm256_storeu_pd((double*)(dst+2), dst2);
}
Godbolt asm output: gcc6.2 -O3 -ffast-math -ffp-contract=fast -march=haswell
vmovupd ymm0, YMMWORD PTR [rsi+32]
vmovupd ymm3, YMMWORD PTR [rsi]
vmovupd ymm1, YMMWORD PTR [rdx]
vunpcklpd ymm5, ymm3, ymm0
vunpckhpd ymm3, ymm3, ymm0
vmovupd ymm0, YMMWORD PTR [rdx+32]
vunpcklpd ymm4, ymm1, ymm0
vunpckhpd ymm1, ymm1, ymm0
vmulpd ymm2, ymm1, ymm3
vmulpd ymm0, ymm4, ymm3
vfmsub231pd ymm2, ymm4, ymm5 # separate mul/sub contracted into FMA
vfmadd231pd ymm0, ymm1, ymm5
vunpcklpd ymm1, ymm2, ymm0
vunpckhpd ymm0, ymm2, ymm0
vmovupd YMMWORD PTR [rdi], ymm1
vmovupd YMMWORD PTR [rdi+32], ymm0
vzeroupper
ret
For 4 complex multiplies (of 2 pairs of input vectors), my code uses:
4 loads (32B each)
2 stores (32B each)
6 in-lane shuffles (one for each input vector, one for each output)
2 VMULPD
2 VFMA...something
(only 4 shuffles if we can use the results in separated real and imag vectors, or 0 shuffles if the inputs are already in this format, too)
latency on Intel Skylake (not counting loads/stores): 14 cycles = 4c for 4 shuffles until the second VMULPD can start + 4 cycles (second VMULPD) + 4c (second vfmadd231pd) + 1c (shuffle first result vector ready 1c earlier) + 1c (shuffle second result vector)
So for throughput, this completely bottlenecks on the shuffle port. (1 shuffle per clock throughput, vs. 2 total MUL/FMA/ADD per clock on Intel Haswell and later). This is why packed storage is horrible: shuffles have limited throughput, and spending more instructions shuffling than on doing math is not good.
Matt Scarpino's code with my minor tweaks (repeated to do 4 complex multiplies). (See below for my rewrite of producing one vector at a time more efficiently).
the same 6 loads/stores
6 in-lane shuffles (HSUBPD is 2 shuffles and a subtract on current Intel and AMD CPUs)
4 multiplies
2 subtracts (which can't combine with the muls into FMAs)
An extra instruction (+ a constant) to flip the sign of the imaginary elements. Matt chose to multiply by 1.0 or -1.0, but the efficient choice is to XOR the sign bit (i.e. XORPD with -0.0).
latency on Intel Skylake for the first result vector: 11 cycles. 1c(vpermilpd and vxorpd in the same cycle) + 4c(second vmulpd) + 6c(vhsubpd). The first vmulpd overlaps with other ops, starting in the same cycle as the shuffle and vxorpd. Computation of a second result vector should interleave pretty nicely.
The major advantage of Matt's code is that it works with just one vector-width of complex multiplies at once, instead of requiring you to have 4 input vectors of data. It has somewhat lower latency. But note that my version doesn't need the 2 pairs of input vectors to be from contiguous memory, or related to each other at all. They get mixed together while processing, but the result is 2 separate 32B vectors.
My tweaked version of Matt's code is nearly as good (as the 4-at-a-time version) on CPUs without FMA (just costing an extra VXORPD), but significantly worse when it stops us from taking advantage of FMA. Also, it never has the results available in non-packed form, so you can't use the separated form as input to another multiply and skip the shuffling.
One vector result at a time, with FMA:
Don't use this if you're sometimes squaring, instead of multiplying two different complex numbers. This is like Matt's algorithm in that common subexpression elimination doesn't simplify.
I haven't typed in the C intrinsics for this, just worked out the data movement. Since all the shuffles are in-lane, I'll only show the low lane. Use the 256b versions of the relevant instructions to do the same shuffle in both lanes. They stay separate.
// MULTIPLY: for each AVX lane: am-bn, an+bm
r i r i
a b c d // input vectors: a + b*i, etc.
m n o p
Algorithm:
create bm bn with movshdup(a b) + mulpd
create bn bm with shufpd on the previous result. (or create n m with a shuffle before the mul)
create a a with movsldup(a b)
use fmaddsubpd to produce the final result: [a|a]*[m|n] -/+ [bn|bm].
Yes, SSE/AVX has ADDSUBPD to do alternating subtract/add in even/odd elements (in that order, presumably because of this use-case). FMA includes FMADDSUB132PD which subtracts and adds, (and the reverse, FMSUBADD which adds and subtracts).
Per 4 results: 6x shuffle, 2x mul, 2xfmaddsub. So unless I got something wrong, it's as efficient as the deinterleave method (when not squaring the same number). Skylake latency = 10c = 1+4+1 to create bn bm (overlapping with 1 cycle to create a a), + 4 (FMA). So it's one cycle lower latency than Matt's.
On Bulldozer-family, it would be a win to shuffle both inputs to the first mul, so the mul->fmaddsub critical path stays inside the FMA domain (1 cycle lower latency). Doing it the other way helps stop silly compilers from making resource conflicts by doing the movsldup(a b) too early, and delaying the mulpd. (In a loop, though, many iterations will be in flight and bottleneck on the shuffle port.)
This is still better than Matt's for squaring (still save the XOR, and can use FMA), but we don't save any shuffles:
// SQUARING: for each AVX lane: aa-bb, 2*ab
// ab bb // movshdup + mul
// bb ab // ^ -> shufpd
// a a // movsldup
// aa-bb ab+ab // fmaddsubpd : [a|a]*[a|b] -/+ [bb|ab]
// per 4 results: 6x shuffle, 2x mul, 2xfmaddsub
I also played around with some possibilities like (a+b) * (a+b) = aa+2ab+bb, or (r-i)*(r+i) = rr - ii but didn't get anywhere. Rounding between steps means that FP math doesn't cancel perfectly, so doing something like this wouldn't even produce exactly identical results.

See my other answer for the general case of multiplying different complex numbers, not squaring.
TL:DR: just use the code in my other answer with both inputs the same. Compilers do a good job with the redundancy.
Squaring simplifies the math slightly: instead of needing two different cross products, rAiB and rBiA are the same. But it still needs to get doubled, so basically we end up with 2 mul + 1 FMA + 1 add, instead of 2 mul + 2 FMA.
With the SIMD-unfriendly interleaved storage format, it gives a big boost to the deinterleave method, since there's only one input to shuffle. Matt's method doesn't benefit at all, since it calculates both cross products with the same vector multiply.
Using the cmul_manualvec() from my other answer:
// squares 4 complex doubles from A[0..3], storing the result in dst[0..3]
void csquare_manual(double complex *restrict dst,
const double complex *restrict A) {
cmul_manualvec(dst, A, A);
}
gcc and clang are smart enough to optimize away the redundancy of using the same input twice, so there's no need to make a custom version with intrinsics. clang does a bad job on the scalar auto-vectorizing version, so don't use that. I don't see anything to be gained over this asm output (from Godbolt):
clang3.9 -O3 -ffast-math -ffp-contract=fast -march=haswell
vmovupd ymm0, ymmword ptr [rsi]
vmovupd ymm1, ymmword ptr [rsi + 32]
vunpcklpd ymm2, ymm0, ymm1
vunpckhpd ymm0, ymm0, ymm1 # doing this shuffle first would let the first multiply start a cycle earlier. Silly compiler.
vmulpd ymm1, ymm0, ymm0 # imag*imag
vfmsub231pd ymm1, ymm2, ymm2 # real*real - imag*imag
vaddpd ymm0, ymm0, ymm0 # imag+imag = 2*imag
vmulpd ymm0, ymm2, ymm0 # 2*imag * real
vunpcklpd ymm2, ymm1, ymm0
vunpckhpd ymm0, ymm1, ymm0
vmovupd ymmword ptr [rdi], ymm2
vmovupd ymmword ptr [rdi + 32], ymm0
vzeroupper
ret
Possibly a different instruction ordering would have been better, to maybe reduce resource conflicts. e.g. double the real vector, since it's unpacked first, so the VADDPD could start a cycle sooner, before the imag*imag VMULPD. But reordering lines in the C source doesn't usually translate directly to asm reordering, because modern compilers are complex beasts. (IIRC, gcc doesn't particularly try to schedule instructions for x86, because out-of-order execution mostly hides those effects.)
Anyway, per 4 complex squares:
2 loads (down from 4) + 2 stores, for obvious reasons
4 shuffles (down from 6), again obvious
2 VMULPD (same)
1 FMA + 1 VADDPD (down from 2 FMA. VADDPD is lower latency than FMA on Haswell/Broadwell, same on Skylake).
Matt's version would still be 6 shuffles, and same everything else.

Related

SIMD unpack 12-bit fields to 16-bit

I need to unpack two 16-bit values from each 24 bits of input. (3 bytes -> 4 bytes). I already did it the naïve way but I'm not happy with the performance.
For example, InBuffer is __m128i:
value1 = (uint16_t)InBuffer[0:11] // bit-ranges
value2 = (uint16_t)InBuffer[12:24]
value3 = (uint16_t)InBuffer[25:36]
value4 = (uint16_t)InBuffer[37:48]
... for all the 128 bits.
After the unpacking, The values should be stored in __m256i variable.
How can I solve this with AVX2? Probably using unpack / shuffle / permute intrinsics?

I'm assuming you're doing this in a loop over a large array. If you only used __m128i loads, you'd have 15 useful bytes, which would only produce 20 output bytes in your __m256i output. (Well, I guess the 21st byte of output would be present, as the 16th byte of the input vector, the first 8 bytes of a new bitfield. But then your next vector would need to shuffle differently.)
Much better to use 24 bytes of input, producing 32 bytes of output. Ideally with a load that splits down the middle, so the low 12 bytes are in the low 128-bit "lane", avoiding the need for a lane-crossing shuffle like _mm256_permutexvar_epi32. Instead you can just _mm256_shuffle_epi8 to put bytes where you want them, setting up for some shift/and.
// uses 24 bytes starting at p by doing a 32-byte load from p-4.
// Don't use this for the first vector of a page-aligned array, or the last
inline
__m256i unpack12to16(const char *p)
{
__m256i v = _mm256_loadu_si256( (const __m256i*)(p-4) );
// v= [ x H G F E | D C B A x ] where each letter is a 3-byte pair of two 12-bit fields, and x is 4 bytes of garbage we load but ignore
const __m256i bytegrouping =
_mm256_setr_epi8(4,5, 5,6, 7,8, 8,9, 10,11, 11,12, 13,14, 14,15, // low half uses last 12B
0,1, 1,2, 3,4, 4,5, 6, 7, 7, 8, 9,10, 10,11); // high half uses first 12B
v = _mm256_shuffle_epi8(v, bytegrouping);
// each 16-bit chunk has the bits it needs, but not in the right position
// in each chunk of 8 nibbles (4 bytes): [ f e d c | d c b a ]
__m256i hi = _mm256_srli_epi16(v, 4); // [ 0 f e d | xxxx ]
__m256i lo = _mm256_and_si256(v, _mm256_set1_epi32(0x00000FFF)); // [ 0000 | 0 c b a ]
return _mm256_blend_epi16(lo, hi, 0b10101010);
// nibbles in each pair of epi16: [ 0 f e d | 0 c b a ]
}
// Untested: I *think* I got my shuffle and blend controls right, but didn't check.
It compiles like this (Godbolt) with clang -O3 -march=znver2. Of course an inline version would load the vector constants once, outside a loop.
unpack12to16(char const*): # #unpack12to16(char const*)
vmovdqu ymm0, ymmword ptr [rdi - 4]
vpshufb ymm0, ymm0, ymmword ptr [rip + .LCPI0_0] # ymm0 = ymm0[4,5,5,6,7,8,8,9,10,11,11,12,13,14,14,15,16,17,17,18,19,20,20,21,22,23,23,24,25,26,26,27]
vpsrlw ymm1, ymm0, 4
vpand ymm0, ymm0, ymmword ptr [rip + .LCPI0_1]
vpblendw ymm0, ymm0, ymm1, 170 # ymm0 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7],ymm0[8],ymm1[9],ymm0[10],ymm1[11],ymm0[12],ymm1[13],ymm0[14],ymm1[15]
ret
On Intel CPUs (before Ice Lake) vpblendw only runs on port 5 (https://uops.info/), competing with vpshufb (...shuffle_epi8). But it's a single uop (unlike vpblendvb variable-blend) with an immediate control. Still, that means a back-end ALU bottleneck of at best one vector per 2 cycles on Intel. If your src and dst are hot in L2 cache (or maybe only L1d), that might be the bottleneck, but this is already 5 uops for the front end, so with loop overhead and a store you're already close to a front-end bottleneck.
Blending with another vpand / vpor would cost more front-end uops but would mitigate the back-end bottleneck on Intel (before Ice Lake). It would be worse on AMD, where vpblendw can run on any of the 4 FP execution ports, and worse on Ice Lake where vpblendw can run on p1 or p5. And like I said, cache load/store throughput might be a bigger bottleneck than port 5 anyway, so fewer front-end uops are definitely better to let out-of-order exec see farther.
This may not be optimal; perhaps there's some way to set up for vpunpcklwd by getting the even (low) and odd (high) bit fields into the bottom 8 bytes of two separate input vectors even more cheaply? Or set up so we can blend with OR instead of needing to clear garbage in one input with vpblendw which only runs on port 5 on Skylake?
Or something we can do with vpsrlvd? (But not vpsrlvw - that would require AVX-512).
If you have AVX512VBMI, vpmultishiftqb is a parallel bitfield-extract. You'd just need to shuffle the right 3-byte pairs into the right 64-bit SIMD elements, then one _mm256_multishift_epi64_epi8 to put the good bits where you want them, and a _mm256_and_si256 to zero the high 4 bits of each 16-bit field will do the trick. (Can't quite take care of everything with 0-masking, or shuffling some zeros into the input for multishift, because there won't be any contiguous with the low 12-bit field.) Or you could set up for just an srli_epi16 that works for both low and high, instead of needing an AND constant, by having the multishift bitfield-extract line up both output fields with the bits you want at the top of the 16-bit element.
This may also allow a shuffle with larger granularity than bytes, although vpermb is actually fast on CPUs with AVX512VBMI, and unfortunately Ice Lake's vpermw is slower than vpermb.
With AVX-512 but not AVX512VBMI, working in 256-bit chunks lets us do the same thing as AVX2 but avoiding the blend. Instead, use merge-masking for the right shift, or vpsrlvw with a control vector to only shift the odd elements. For 256-bit vectors, this is probably as good as vpmultishiftqb.

Optimization of matrix and vector multiplication in C

I have a function that gets a 3 x 3 matrix and a 3 x 4000 vector, and multiplies them.
All the calculation are done in double precision (64-bit).
The function is called about 3.5 million times so it should be optimized.
#define MATRIX_DIM 3
#define VECTOR_LEN 3000
typedef struct {
double a;
double b;
double c;
} vector_st;
double matrix[MATRIX_DIM][MATRIX_DIM];
vector_st vector[VACTOR_LEN];
inline void rotate_arr(double input_matrix[][MATRIX_DIM], vector_st *input_vector, vector_st *output_vector)
{
int i;
for (i = 0; i < VACTOR_LEN; i++) {
op_rotate_preset_arr[i].a = input_matrix[0][0] * input_vector[i].a +
input_matrix[0][1] * input_vector[i].b +
input_matrix[0][2] * input_vector[i].c;
op_rotate_preset_arr[i].b = input_matrix[1][0] * input_vector[i].a +
input_matrix[1][1] * input_vector[i].b +
input_matrix[1][2] * input_vector[i].c;
op_rotate_preset_arr[i].c = input_matrix[2][0] * input_vector[i].a +
input_matrix[2][1] * input_vector[i].b +
input_matrix[2][2] * input_vector[i].c;
}
}
I all out of ideas on how to optimize it because it's inline, data access is sequential and the function is short and pretty straight-forward.
It can be assumed that the vector is always the same and only the matrix is changing if it will boost performance.

One easy to fix problem here is that compilers assumes that the matrix and the output vectors may alias. As seen here in the second function, that causes code to be generated that is less efficient and significantly larger. This can be fixed simply by adding restrict to the output pointer. Doing only this already helps and keeps the code free from platform specific optimization, but relies on auto-vectorization in order to use the performance increases that have happened in the past two decades.
Auto-vectorization is evidently still too immature for the task, both Clang and GCC generate way too much shuffling around of the data. This should improve in future compilers, but for now even a case like this (that doesn't seem inherently super hard) needs manual help, such as this (not tested though)
void rotate_arr_avx(double input_matrix[][MATRIX_DIM], vector_st *input_vector, vector_st * restrict output_vector)
{
__m256d col0, col1, col2, a, b, c, t;
int i;
// using set macros like this is kind of dirty, but it's outside the loop anyway
col0 = _mm256_set_pd(0.0, input_matrix[2][0], input_matrix[1][0], input_matrix[0][0]);
col1 = _mm256_set_pd(0.0, input_matrix[2][1], input_matrix[1][1], input_matrix[0][1]);
col2 = _mm256_set_pd(0.0, input_matrix[2][2], input_matrix[1][2], input_matrix[0][2]);
for (i = 0; i < VECTOR_LEN; i++) {
a = _mm256_set1_pd(input_vector[i].a);
b = _mm256_set1_pd(input_vector[i].b);
c = _mm256_set1_pd(input_vector[i].c);
t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0, a), _mm256_mul_pd(col1, b)), _mm256_mul_pd(col2, c));
// this stores an element too much, ensure 8 bytes of padding exist after the array
_mm256_storeu_pd(&output_vector[i].a, t);
}
}
Writing it this way significantly improves what compilers do with it, now compiling to a nice and tight loop without all the nonsense. Earlier the code hurt to look at, but with this the loop now looks like this (GCC 8.1, with FMA enabled), which is actually readable:
.L2:
vbroadcastsd ymm2, QWORD PTR [rsi+8+rax]
vbroadcastsd ymm1, QWORD PTR [rsi+16+rax]
vbroadcastsd ymm0, QWORD PTR [rsi+rax]
vmulpd ymm2, ymm2, ymm4
vfmadd132pd ymm1, ymm2, ymm3
vfmadd132pd ymm0, ymm1, ymm5
vmovupd YMMWORD PTR [rdx+rax], ymm0
add rax, 24
cmp rax, 72000
jne .L2
This has an obvious deficiency: only 3 of the 4 double precision slots of the 256bit AVX vectors are actually used. If the data format of the vector was changed to for example AAAABBBBCCCC repeating, a totally different approach could be used, namely broadcasting the matrix elements instead of the vector elements, then multiplying the broadcasted matrix element by the A component of 4 different vector_sts at once.
An other thing we can try, without even changing the data format, is processing more than one matrix at the same time, which helps to re-use loads from the input_vector to increase arithmetic intensity.
void rotate_arr_avx(double input_matrixA[][MATRIX_DIM], double input_matrixB[][MATRIX_DIM], vector_st *input_vector, vector_st * restrict output_vectorA, vector_st * restrict output_vectorB)
{
__m256d col0A, col1A, col2A, a, b, c, t, col0B, col1B, col2B;
int i;
// using set macros like this is kind of dirty, but it's outside the loop anyway
col0A = _mm256_set_pd(0.0, input_matrixA[2][0], input_matrixA[1][0], input_matrixA[0][0]);
col1A = _mm256_set_pd(0.0, input_matrixA[2][1], input_matrixA[1][1], input_matrixA[0][1]);
col2A = _mm256_set_pd(0.0, input_matrixA[2][2], input_matrixA[1][2], input_matrixA[0][2]);
col0B = _mm256_set_pd(0.0, input_matrixB[2][0], input_matrixB[1][0], input_matrixB[0][0]);
col1B = _mm256_set_pd(0.0, input_matrixB[2][1], input_matrixB[1][1], input_matrixB[0][1]);
col2B = _mm256_set_pd(0.0, input_matrixB[2][2], input_matrixB[1][2], input_matrixB[0][2]);
for (i = 0; i < VECTOR_LEN; i++) {
a = _mm256_set1_pd(input_vector[i].a);
b = _mm256_set1_pd(input_vector[i].b);
c = _mm256_set1_pd(input_vector[i].c);
t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0A, a), _mm256_mul_pd(col1A, b)), _mm256_mul_pd(col2A, c));
// this stores an element too much, ensure 8 bytes of padding exist after the array
_mm256_storeu_pd(&output_vectorA[i].a, t);
t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0B, a), _mm256_mul_pd(col1B, b)), _mm256_mul_pd(col2B, c));
_mm256_storeu_pd(&output_vectorB[i].a, t);
}
}

Handling zeroes in _mm256_rsqrt_ps()

Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps(), looking around it seems that doing:
_mm256_mul_ps(_mm256_rsqrt_ps(eightFloats),
eightFloats);
Is the way to go for that extra bit of performance and avoiding a pipeline stall.
Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0). What is the best way around this? I have tried this (which works and is faster), but is there a better way, or am I going to run into problems under certain conditions?
_mm256_mul_ps(_mm256_rsqrt_ps(_mm256_max_ps(eightFloats,
_mm256_set1_ps(0.1))),
eightFloats);
My code is for a vertical application, so I can assume that it will be running on a Haswell CPU (i7-4810MQ), so FMA/AVX2 can be used. The original code is approximately:
float vals[MAX];
int sum = 0;
for (int i = 0; i < MAX; i++)
{
int thisSqrt = (int) floor(sqrt(vals[i]));
sum += min(thisSqrt, 0x3F);
}
All the values of vals should be integer values. (Why everything isn't just int is a different question...)

tl;dr: See the end for code that compiles and should work.
To just solve the 0.0 problem, you could also special-case inputs of 0.0 with an FP compare of the source against 0.0. Use the compare result as a mask to zero out any NaNs resulting from 0 * +Infinity in sqrt(x) = x * rsqrt(x)). Clang does this when autovectorizing. (But it uses blendps with the zeroed vector, instead of using the compare mask with andnps directly to zero or preserve elements.)
It would also be possible to use sqrt(x) ~= recip(rsqrt(x)), as suggested by njuffa. rsqrt(0) = +Inf. recip(+Inf) = 0. However, using two approximations would compound the relative error, which is a problem.
The thing you're missing:
Truncating to integer (instead of rounding) requires an accurate sqrt result when the input is a perfect square. If the result for 25*rsqrt(25) is 4.999999 or something (instead of 5.00001), you'll add 4 instead of 5.
Even with a Newton-Raphson iteration, rsqrtps isn't perfectly accurate the way sqrtps is, so it might still give 5.0 - 1ulp. (1ulp = one unit in the last place = lowest bit of the mantissa).
Also:
Newton Raphson formula explained
Newton Raphson SSE implementation performance (latency/throughput). Note that we care more about throughput than latency, since we're using it in a loop that doesn't do much else. sqrt isn't part of the loop-carried dep chain, so different iterations can have their sqrt calcs in flight at once.
It might be possible to kill 2 birds with one stone by adding a small constant before doing the (x+offset)*approx_rsqrt(x+offset) and then truncating to integer. Large enough to overcome the max relative error of 1.5*2-12, but small enough not to bump sqrt_approx(63*63-1+offset) up to 63 (the most sensitive case).
63*1.5*2^(-12) == 0.023071...
approx_sqrt(63*63-1) == 62.99206... +/- 0.023068..
Actually, we're screwed without a Newton iteration even without adding anything. approx_sqrt(63*63-1) could come out above 63.0 all by itself. n=36 is the largest value where the relative error in sqrt(n*n-1) + error is less than sqrt(n*n). GNU Calc:
define f(n) { local x=sqrt(n*n-1); local e=x*1.5*2^(-12); print x; print e, x+e; }
; f(36)
35.98610843089316319413
~0.01317850650545403926 ~35.99928693739861723339
; f(37)
36.9864840178138587015
~0.01354485498699237990 ~37.00002887280085108140
Does your source data have any properties that mean you don't have to worry about it being just below a large perfect square? e.g. is it always perfect squares?
You could check all possible input values, since the important domain is very small (integer FP values from 0..63*63) to see if the error in practice is small enough on Intel Haswell, but that would be a brittle optimization that could make your code break on AMD CPUs, or even on future Intel CPUs. Unfortunately, just coding to the ISA spec's guarantee that the relative error is up to 1.5*2-12 requires more instructions. I don't see any tricks a NR iteration.
If your upper limit was smaller (like 20), you could just do isqrt = static_cast<int> ((x+0.5)*approx_rsqrt(x+0.5)). You'd get 20 for 20*20, but always 19 for 20*20-1.
; define test_approx_sqrt(x, off) { local s=x*x+off; local sq=s/sqrt(s); local sq_1=(s-1)/sqrt(s-1); local e=1.5*2^(-12); print sq, sq_1; print sq*e, sq_1*e; }
; test_approx_sqrt(20, 0.5)
~20.01249609618950056874 ~19.98749609130668473087 # (x+0.5)/sqrt(x+0.5)
~0.00732879495710064718 ~0.00731963968187500662 # relative error
Note that val * (x +/- err) = val*x +/- val*err. IEEE FP mul produces results that are correctly rounded to 0.5ulp, so this should work for FP relative errors.
Anyway, I think you need one Newton-Raphson iteration.
The best bet is to add 0.5 to your input before doing an approx_sqrt using rsqrt. That sidesteps the 0/0 = NaN problem, and pushes the +/- error range all to one side of the whole number cut point (for numbers in the range we care about).
FP min/max instructions have the same performance as FP add, and will be on the critical path either way. Using an add instead of a max also solves the problem of results for perfect squares potentially being a few ulp below the correct result.
Compiler output: a decent starting point
I get pretty good autovectorization results from clang 3.7.1 with sum_int, with -fno-math-errno -funsafe-math-optimizations. -ffinite-math-only is not required (but even with the full -ffast-math, clang avoids sqrt(0) = NaN when using rsqrtps).
sum_fp doesn't auto-vectorize, even with the full -ffast-math.
However clang's version suffers from the same problem as your idea: truncating an inexact result from rsqrt + NR, potentially giving the wrong integer. IDK if this is why gcc doesn't auto-vectorize, because it could have used sqrtps for a big speedup without changing the results. (At least, as long as all the floats are between 0 and INT_MAX2, otherwise converting back to integer will give the "indefinite" result of INT_MIN. (sign bit set, all other bits cleared). This is a case where -ffast-math breaks your program, unless you use -mrecip=none or something.
See the asm output on godbolt from:
// autovectorizes with clang, but has rounding problems.
// Note the use of sqrtf, and that floorf before truncating to int is redundant. (removed because clang doesn't optimize away the roundps)
int sum_int(float vals[]){
int sum = 0;
for (int i = 0; i < MAX; i++) {
int thisSqrt = (int) sqrtf(vals[i]);
sum += std::min(thisSqrt, 0x3F);
}
return sum;
}
To manually vectorize with intrinsics, we can look at the asm output from -fno-unroll-loops (to keep things simple). I was going to include this in the answer, but then realized that it had problems.
putting it together:
I think converting to int inside the loop is better than using floorf and then addps. roundps is a 2-uop instruction (6c latency) on Haswell (1uop in SnB/IvB). Worse, both uops require port1, so they compete with FP add / mul. cvttps2dq is a 1-uop instruction for port1, with 3c latency, and then we can use integer min and add to clamp and accumulate, so port5 gets something to do. Using an integer vector accumulator also means the loop-carried dependency chain is 1 cycle, so we don't need to unroll or use multiple accumulators to keep multiple iterations in flight. Smaller code is always better for the big picture (uop cache, L1 I-cache, branch predictors).
As long as we aren't in danger of overflowing 32bit accumulators, this seems to be the best choice. (Without having benchmarked anything or even tested it).
I'm not using the sqrt(x) ~= approx_recip(approx_sqrt(x)) method, because I don't know how to do a Newton iteration to refine it (probably it would involve a division). And because the compounded error is larger.
Horizontal sum from this answer.
Complete but untested version:
#include <immintrin.h>
#define MAX 4096
// 2*sqrt(x) ~= 2*x*approx_rsqrt(x), with a Newton-Raphson iteration
// dividing by 2 is faster in the integer domain, so we don't do it
__m256 approx_2sqrt_ps256(__m256 x) {
// clang / gcc usually use -3.0 and -0.5. We could do the same by using fnmsub_ps (add 3 = subtract -3), so we can share constants
__m256 three = _mm256_set1_ps(3.0f);
//__m256 half = _mm256_set1_ps(0.5f); // we omit the *0.5 step
__m256 nr = _mm256_rsqrt_ps( x ); // initial approximation for Newton-Raphson
// 1/sqrt(x) ~= nr * (3 - x*nr * nr) * 0.5 = nr*(1.5 - x*0.5*nr*nr)
// sqrt(x) = x/sqrt(x) ~= (x*nr) * (3 - x*nr * nr) * 0.5
// 2*sqrt(x) ~= (x*nr) * (3 - x*nr * nr)
__m256 xnr = _mm256_mul_ps( x, nr );
__m256 three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three ); // -(xnr*nr) + 3
return _mm256_mul_ps( xnr, three_minus_muls );
}
// packed int32_t: correct results for inputs from 0 to well above 63*63
__m256i isqrt256_ps(__m256 x) {
__m256 offset = _mm256_set1_ps(0.5f); // or subtract -0.5, to maybe share constants with compiler-generated Newton iterations.
__m256 xoff = _mm256_add_ps(x, offset); // avoids 0*Inf = NaN, and rounding error before truncation
__m256 approx_2sqrt_xoff = approx_2sqrt_ps256(xoff);
__m256i i2sqrtx = _mm256_cvttps_epi32(approx_2sqrt_xoff);
return _mm256_srli_epi32(i2sqrtx, 1); // divide by 2 with truncation
// alternatively, we could mask the low bit to zero and divide by two outside the loop, but that has no advantage unless port0 turns out to be the bottleneck
}
__m256i isqrt256_ps_simple_exact(__m256 x) {
__m256 sqrt_x = _mm256_sqrt_ps(x);
__m256i isqrtx = _mm256_cvttps_epi32(sqrt_x);
return isqrtx;
}
int hsum_epi32_avx(__m256i x256){
__m128i xhi = _mm256_extracti128_si256(x256, 1);
__m128i xlo = _mm256_castsi256_si128(x256);
__m128i x = _mm_add_epi32(xlo, xhi);
__m128i hl = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
hl = _mm_add_epi32(hl, x);
x = _mm_shuffle_epi32(hl, _MM_SHUFFLE(2, 3, 0, 1));
hl = _mm_add_epi32(hl, x);
return _mm_cvtsi128_si32(hl);
}
int sum_int_avx(float vals[]){
__m256i sum = _mm256_setzero_si256();
__m256i upperlimit = _mm256_set1_epi32(0x3F);
for (int i = 0; i < MAX; i+=8) {
__m256 v = _mm256_loadu_ps(vals+i);
__m256i visqrt = isqrt256_ps(v);
// assert visqrt == isqrt256_ps_simple_exact(v) or something
visqrt = _mm256_min_epi32(visqrt, upperlimit);
sum = _mm256_add_epi32(sum, visqrt);
}
return hsum_epi32_avx(sum);
}
Compiles on godbolt to nice code, but I haven't tested it. clang makes slightly nicer code that gcc: clang uses broadcast-loads from 4B locations for the set1 constants, instead of repeating them at compile time into 32B constants. gcc also has a bizarre movdqa to copy a register.
Anyway, the whole loop winds up being only 9 vector instructions, compared to 12 for the compiler-generated sum_int version. It probably didn't notice the x*initial_guess(x) common-subexpressions that occur in the Newton-Raphson iteration formula when you're multiplying the result by x, or something like that. It also does an extra mulps instead of a psrld because it does the *0.5 before converting to int. So that's where the extra two mulps instructions come from, and there's the cmpps/blendvps.
sum_int_avx(float*):
vpxor ymm3, ymm3, ymm3
xor eax, eax
vbroadcastss ymm0, dword ptr [rip + .LCPI4_0] ; set1(0.5)
vbroadcastss ymm1, dword ptr [rip + .LCPI4_1] ; set1(3.0)
vpbroadcastd ymm2, dword ptr [rip + .LCPI4_2] ; set1(63)
LBB4_1: ; latencies
vaddps ymm4, ymm0, ymmword ptr [rdi + 4*rax] ; 3c
vrsqrtps ymm5, ymm4 ; 7c
vmulps ymm4, ymm4, ymm5 ; x*nr ; 5c
vfnmadd213ps ymm5, ymm4, ymm1 ; 5c
vmulps ymm4, ymm4, ymm5 ; 5c
vcvttps2dq ymm4, ymm4 ; 3c
vpsrld ymm4, ymm4, 1 ; 1c this would be a mulps (but not on the critical path) if we did this in the FP domain
vpminsd ymm4, ymm4, ymm2 ; 1c
vpaddd ymm3, ymm4, ymm3 ; 1c
; ... (those 9 insns repeated: loop unrolling)
add rax, 16
cmp rax, 4096
jl .LBB4_1
;... horizontal sum
IACA thinks that with no unroll, Haswell can sustain a throughput of one iteration per 4.15 cycles, bottlenecking on ports 0 and 1. So potentially you could shave a cycle by accumulating sqrt(x)*2 (with truncation to even numbers using _mm256_and_si256), and only divide by two outside the loop.
Also according to IACA, the latency of a single iteration is 38 cycles on Haswell. I only get 31c, so probably it's including L1 load-use latency or something. Anyway, this means that to saturate the execution units, operations from 8 iterations have to be in flight at once. That's 8 * ~14 unfused-domain uops = 112 unfused-uops (or less with clang's unroll) that have to be in flight at once. Haswell's scheduler is actually only 60 entries, but the ROB is 192 entries. The early uops from early iterations will already have executed, so they only need to be tracked in the ROB, not also in the scheduler. Many of the slow uops are at the beginning of each iteration, though. Still, there's reason to hope that this will come close-ish to saturating ports 0 and 1. Unless data is hot in L1 cache, cache/memory bandwidth will probably be the bottleneck.
Interleaving operations from multiple dep chains would also be better. When clang unrolls, it puts all 9 instructions for one iteration ahead of all 9 instructions for another iteration. It uses a surprisingly small number of registers, so it would be possible to have instructions for 2 or 4 iterations mixed together. This is the sort of thing compilers are supposed to be good at, but which is cumbersome for humans. :/
It would also be slightly more efficient if the compiler chose a one-register addressing mode, so the load could micro-fuse with the vaddps. gcc does this.

When the compiler reorders AVX instructions on Sandy, does it affect performance?

Please do not say this is premature microoptimization. I want to understand, as much as it is possible given my limited knowledge, how the described SB feature and assembly works, and make sure that my code makes use of this architectural feature. Thank you for understanding.
I've started to learn intrinsics a few days ago so the answer may seem obvious to some, but I don't have a reliable source of information to figure this out.
I need to optimize some code for a Sandy Bridge CPU (this is a requirement). Now I know that it can do one AVX multiply and one AVX add per cycle, and read this paper:
http://research.colfaxinternational.com/file.axd?file=2012%2F7%2FColfax_CPI.pdf
which shows how it can be done in C++. So, the problem is that my code won't get auto-vectorized using Intel's compiler (which is another requirement for the task), so I decided to implement it manually using intrinsics like this:
__sum1 = _mm256_setzero_pd();
__sum2 = _mm256_setzero_pd();
__sum3 = _mm256_setzero_pd();
sum = 0;
for(kk = k; kk < k + BS && kk < aW; kk+=12)
{
const double *a_addr = &A[i * aW + kk];
const double *b_addr = &newB[jj * aW + kk];
__aa1 = _mm256_load_pd((a_addr));
__bb1 = _mm256_load_pd((b_addr));
__sum1 = _mm256_add_pd(__sum1, _mm256_mul_pd(__aa1, __bb1));
__aa2 = _mm256_load_pd((a_addr + 4));
__bb2 = _mm256_load_pd((b_addr + 4));
__sum2 = _mm256_add_pd(__sum2, _mm256_mul_pd(__aa2, __bb2));
__aa3 = _mm256_load_pd((a_addr + 8));
__bb3 = _mm256_load_pd((b_addr + 8));
__sum3 = _mm256_add_pd(__sum3, _mm256_mul_pd(__aa3, __bb3));
}
__sum1 = _mm256_add_pd(__sum1, _mm256_add_pd(__sum2, __sum3));
_mm256_store_pd(&vsum[0], __sum1);
The reason I manually unroll the loop like this is explained here:
Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell
They say you need to unroll by a factor of 3 to achieve the best performance on Sandy. My naive testing confirms that this indeed runs better than without unrolling or 4-fold unrolling.
OK, so here is the problem. The icl compiler from Intel Parallel Studio 15 generates this:
$LN149:
movsxd r14, r14d ;78.49
$LN150:
vmovupd ymm3, YMMWORD PTR [r11+r14*8] ;80.48
$LN151:
vmovupd ymm5, YMMWORD PTR [32+r11+r14*8] ;84.49
$LN152:
vmulpd ymm4, ymm3, YMMWORD PTR [r8+r14*8] ;82.56
$LN153:
vmovupd ymm3, YMMWORD PTR [64+r11+r14*8] ;88.49
$LN154:
vmulpd ymm15, ymm5, YMMWORD PTR [32+r8+r14*8] ;86.56
$LN155:
vaddpd ymm2, ymm2, ymm4 ;82.34
$LN156:
vmulpd ymm4, ymm3, YMMWORD PTR [64+r8+r14*8] ;90.56
$LN157:
vaddpd ymm0, ymm0, ymm15 ;86.34
$LN158:
vaddpd ymm1, ymm1, ymm4 ;90.34
$LN159:
add r14d, 12 ;76.57
$LN160:
cmp r14d, ebx ;76.42
$LN161:
jb .B1.19 ; Prob 82% ;76.42
To me, this looks like a mess, where the correct order (add next to multiply required to use the handy SB feature) is broken.
Question:
Will this assembly code leverage the Sandy Bridge feature I am referring to?
If not, what do I need to do in order to utilize the feature and prevent the code from becoming "tangled" like this?
Also, when there is only one loop iteration, the order is nice and clean, i.e. load, multiply, add, as it should be.

With x86 CPUs many people expect to get the maximum FLOPS from the dot product
for(int i=0; i<n; i++) sum += a[i]*b[i];
but this turns out not to be the case.
What can give the maximum FLOPS is this
for(int i=0; i<n; i++) sum += k*a[i];
where k is a constant. Why is the CPU not optimized for the dot product? I can speculate. One of the things CPUs are optimized for is BLAS. BLAS is considering a building block of many other routines.
The Level-1 and Level-2 BLAS routines become memory bandwidth bound as n increases. It's only the Level-3 routines (e.g. Matrix Multiplication) which are capable of being compute bound. This is because the Level-3 computations go as n^3 and the reads as n^2. So the CPU is optimized for the Level-3 routines. The Level-3 routines don't need to optimize for a single dot product. They only need to read from one matrix per iteration (sum += k*a[i]).
From this we can conclude that the number of bits needed to be read each cycle to get the maximum FLOPS for the Level-3 routines is
read_size = SIMD_WIDTH * num_MAC
where num_MAC is the number of multiply–accumulate operations that can be done each cycle.
SIMD_WIDTH (bits) num_MAC read_size (bits) ports used
Nehalem 128 1 128 128-bits on port 2
Sandy Bridge 256 1 256 128-bits port 2 and 3
Haswell 256 2 512 256-bits port 2 and 3
Skylake 512 2 1024 ?
For Nehalem-Haswell this agrees with what the hardware is capable of. I don't actually know that Skylake will be able to read 1024-bits per clock cycle but if it can't AVX512 won't be very interesting so I'm confident in my guess. A nice plot for Nahalem, Sandy Bridge, and Haswell for each port can be found at http://www.anandtech.com/show/6355/intels-haswell-architecture/8
So far I have ignored latency and dependency chains. To really get the maximum FLOPS you need to unroll the loop at least three times on Sandy Bridge (I use four because I find it inconvenient to work with multiples of three)
The best way to answer your question about performance is to find the theoretic best performance you expect for your operation and then compare how close your code get to this. I call this the efficiency. Doing this you will find that despite the reordering of the instructions you see in the assembly the performance is still good. But there are many other subtle issues you may need to consider. Here are three issues I encountered:
l1-memory-bandwidth-50-drop-in-efficiency-using-addresses-which-differ-by-4096.
obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62%
difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp.
I also suggest you consider using IACA to study the performance.

Fastest way to compute distance squared

My code relies heavily on computing distances between two points in 3D space.
To avoid the expensive square root I use the squared distance throughout.
But still it takes up a major fraction of the computing time and I would like to replace my simple function with something even faster.
I now have:
double distance_squared(double *a, double *b)
{
double dx = a[0] - b[0];
double dy = a[1] - b[1];
double dz = a[2] - b[2];
return dx*dx + dy*dy + dz*dz;
}
I also tried using a macro to avoid the function call but it doesn't help much.
#define DISTANCE_SQUARED(a, b) ((a)[0]-(b)[0])*((a)[0]-(b)[0]) + ((a)[1]-(b)[1])*((a)[1]-(b)[1]) + ((a)[2]-(b)[2])*((a)[2]-(b)[2])
I thought about using SIMD instructions but could not find a good example or complete list of instructions (ideally some multiply+add on two vectors).
GPU's are not an option since only one set of points is known at each function call.
What would be the fastest way to compute the distance squared?

A good compiler will optimize that about as well as you will ever manage. A good compiler will use SIMD instructions if it deems that they are going to be beneficial. Make sure that you turn on all such possible optimizations for your compiler. Unfortunately, vectors of dimension 3 don't tend to sit well with SIMD units.
I suspect that you will simply have to accept that the code produced by the compiler is probably pretty close to optimal and that no significant gains can be made.

The first obvious thing would be to use the restrict keyword.
As it is now, a and b are aliasable (and thus, from the compiler's point of view which assumes the worst possible case are aliased). No compiler will auto-vectorize this, as it is wrong to do so.
Worse, not only can the compiler not vectorize such a loop, in case you also store (luckily not in your example), it also must re-load values each time. Always be clear about aliasing, as it greatly impacts the compiler.
Next, if you can live with that, use float instead of double and pad to 4 floats even if one is unused, this is a more "natural" data layout for the majority of CPUs (this is somewhat platform specific, but 4 floats is a good guess for most platforms -- 3 doubles, a.k.a. 1.5 SIMD registers on "typical" CPUs, is not optimal anywhere).
(For a hand-written SIMD implementation (which is harder than you think), first and before all be sure to have aligned data. Next, look into what latencies your instrucitons have on the target machine and do the longest ones first. For example on pre-Prescott Intel it makes sense to first shuffle each component into a register and then multiply with itself, even though that uses 3 multiplies instead of one, because shuffles have a long latency. On the later models, a shuffle takes a single cycle, so that would be a total anti-optimization.
Which again shows that leaving it to the compiler is not such a bad idea.)

The SIMD code to do this (using SSE3):
movaps xmm0,a
movaps xmm1,b
subps xmm0,xmm1
mulps xmm0,xmm0
haddps xmm0,xmm0
haddps xmm0,xmm0
but you need four value vectors (x,y,z,0) for this to work. If you've only got three values then you'd need to do a bit of fiddling about to get the required format which would cancel out any benefit of the above.
In general though, due to the superscalar pipelined architecture of the CPU, the best way to get performance is to do the same operation on lots of data, that way you can interleave the various steps and do a bit of loop unrolling to avoid pipeline stalls. The above code will definately stall on the last three instructions based on the "can't use a value directly after it's modified" principle - the second instruction has to wait for the result of the previous instruction to complete which isn't good in a pipelined system.
Doing the calculation on two or more different sets points of points at the same time can remove the above bottleneck - whilst waiting for the result of one computation, you can start the computation of the next point:
movaps xmm0,a1
movaps xmm2,a2
movaps xmm1,b1
movaps xmm3,b2
subps xmm0,xmm1
subps xmm2,xmm3
mulps xmm0,xmm0
mulps xmm2,xmm2
haddps xmm0,xmm0
haddps xmm2,xmm2
haddps xmm0,xmm0
haddps xmm2,xmm2

If you would like to optimize something, at first profile code and inspect assembler output.
After compiling it with gcc -O3 (4.6.1) we'll have nice disassembled output with SIMD:
movsd (%rdi), %xmm0
movsd 8(%rdi), %xmm2
subsd (%rsi), %xmm0
movsd 16(%rdi), %xmm1
subsd 8(%rsi), %xmm2
subsd 16(%rsi), %xmm1
mulsd %xmm0, %xmm0
mulsd %xmm2, %xmm2
mulsd %xmm1, %xmm1
addsd %xmm2, %xmm0
addsd %xmm1, %xmm0

This type of problem often occurs in MD simulations. Usually the amount of calculations is reduced by cutoffs and neighbor lists, so the number for the calculation is reduced. The actual calculation of the squared distances however is exactly done (with compiler optimizations and a fixed type float[3]) as given in your question.
So if you want to reduce the amount of squared calculations you should tell us more about the problem.

Perhaps passing the 6 doubles directly as arguments could make it faster (because it could avoid the array dereference):
inline double distsquare_coord(double xa, double ya, double za,
double xb, double yb, double zb)
{
double dx = xa-yb; double dy=ya-yb; double dz=za-zb;
return dx*dx + dy*dy + dz*dz;
}
Or perhaps, if you have many points in the vicinity, you might compute a distance (to the same fixed other point) by linear approximation of the distances of other near points.

If you can rearrange your data to process two pairs of input vectors at once, you may use this code (SSE2 only)
// #brief Computes two squared distances between two pairs of 3D vectors
// #param a
// Pointer to the first pair of 3D vectors.
// The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
// Must be aligned by 16 (2 doubles).
// #param b
// Pointer to the second pairs of 3D vectors.
// The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
// Must be aligned by 16 (2 doubles).
// #param c
// Pointer to the output 2 element array.
// Must be aligned by 16 (2 doubles).
// The two distances between a and b vectors will be written to c[0] and c[1] respectively.
void (const double * __restrict__ a, const double * __restrict__ b, double * __restrict c) {
// diff0 = ( a0.y - b0.y, a0.x - b0.x ) = ( d0.y, d0.x )
__m128d diff0 = _mm_sub_pd(_mm_load_pd(a), _mm_load_pd(b));
// diff1 = ( a1.x - b1.x, a0.z - b0.z ) = ( d1.x, d0.z )
__m128d diff1 = _mm_sub_pd(_mm_load_pd(a + 2), _mm_load_pd(b + 2));
// diff2 = ( a1.z - b1.z, a1.y - b1.y ) = ( d1.z, d1.y )
__m128d diff2 = _mm_sub_pd(_mm_load_pd(a + 4), _mm_load_pd(b + 4));
// prod0 = ( d0.y * d0.y, d0.x * d0.x )
__m128d prod0 = _mm_mul_pd(diff0, diff0);
// prod1 = ( d1.x * d1.x, d0.z * d0.z )
__m128d prod1 = _mm_mul_pd(diff1, diff1);
// prod2 = ( d1.z * d1.z, d1.y * d1.y )
__m128d prod2 = _mm_mul_pd(diff1, diff1);
// _mm_unpacklo_pd(prod0, prod2) = ( d1.y * d1.y, d0.x * d0.x )
// psum = ( d1.x * d1.x + d1.y * d1.y, d0.x * d0.x + d0.z * d0.z )
__m128d psum = _mm_add_pd(_mm_unpacklo_pd(prod0, prod2), prod1);
// _mm_unpackhi_pd(prod0, prod2) = ( d1.z * d1.z, d0.y * d0.y )
// dotprod = ( d1.x * d1.x + d1.y * d1.y + d1.z * d1.z, d0.x * d0.x + d0.y * d0.y + d0.z * d0.z )
__m128d dotprod = _mm_add_pd(_mm_unpackhi_pd(prod0, prod2), psum);
__mm_store_pd(c, dotprod);
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight