Optimization of matrix and vector multiplication in C

Optimization of matrix and vector multiplication in C - c

I have a function that gets a 3 x 3 matrix and a 3 x 4000 vector, and multiplies them.
All the calculation are done in double precision (64-bit).
The function is called about 3.5 million times so it should be optimized.
#define MATRIX_DIM 3
#define VECTOR_LEN 3000
typedef struct {
double a;
double b;
double c;
} vector_st;
double matrix[MATRIX_DIM][MATRIX_DIM];
vector_st vector[VACTOR_LEN];
inline void rotate_arr(double input_matrix[][MATRIX_DIM], vector_st *input_vector, vector_st *output_vector)
{
int i;
for (i = 0; i < VACTOR_LEN; i++) {
op_rotate_preset_arr[i].a = input_matrix[0][0] * input_vector[i].a +
input_matrix[0][1] * input_vector[i].b +
input_matrix[0][2] * input_vector[i].c;
op_rotate_preset_arr[i].b = input_matrix[1][0] * input_vector[i].a +
input_matrix[1][1] * input_vector[i].b +
input_matrix[1][2] * input_vector[i].c;
op_rotate_preset_arr[i].c = input_matrix[2][0] * input_vector[i].a +
input_matrix[2][1] * input_vector[i].b +
input_matrix[2][2] * input_vector[i].c;
}
}
I all out of ideas on how to optimize it because it's inline, data access is sequential and the function is short and pretty straight-forward.
It can be assumed that the vector is always the same and only the matrix is changing if it will boost performance.

One easy to fix problem here is that compilers assumes that the matrix and the output vectors may alias. As seen here in the second function, that causes code to be generated that is less efficient and significantly larger. This can be fixed simply by adding restrict to the output pointer. Doing only this already helps and keeps the code free from platform specific optimization, but relies on auto-vectorization in order to use the performance increases that have happened in the past two decades.
Auto-vectorization is evidently still too immature for the task, both Clang and GCC generate way too much shuffling around of the data. This should improve in future compilers, but for now even a case like this (that doesn't seem inherently super hard) needs manual help, such as this (not tested though)
void rotate_arr_avx(double input_matrix[][MATRIX_DIM], vector_st *input_vector, vector_st * restrict output_vector)
{
__m256d col0, col1, col2, a, b, c, t;
int i;
// using set macros like this is kind of dirty, but it's outside the loop anyway
col0 = _mm256_set_pd(0.0, input_matrix[2][0], input_matrix[1][0], input_matrix[0][0]);
col1 = _mm256_set_pd(0.0, input_matrix[2][1], input_matrix[1][1], input_matrix[0][1]);
col2 = _mm256_set_pd(0.0, input_matrix[2][2], input_matrix[1][2], input_matrix[0][2]);
for (i = 0; i < VECTOR_LEN; i++) {
a = _mm256_set1_pd(input_vector[i].a);
b = _mm256_set1_pd(input_vector[i].b);
c = _mm256_set1_pd(input_vector[i].c);
t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0, a), _mm256_mul_pd(col1, b)), _mm256_mul_pd(col2, c));
// this stores an element too much, ensure 8 bytes of padding exist after the array
_mm256_storeu_pd(&output_vector[i].a, t);
}
}
Writing it this way significantly improves what compilers do with it, now compiling to a nice and tight loop without all the nonsense. Earlier the code hurt to look at, but with this the loop now looks like this (GCC 8.1, with FMA enabled), which is actually readable:
.L2:
vbroadcastsd ymm2, QWORD PTR [rsi+8+rax]
vbroadcastsd ymm1, QWORD PTR [rsi+16+rax]
vbroadcastsd ymm0, QWORD PTR [rsi+rax]
vmulpd ymm2, ymm2, ymm4
vfmadd132pd ymm1, ymm2, ymm3
vfmadd132pd ymm0, ymm1, ymm5
vmovupd YMMWORD PTR [rdx+rax], ymm0
add rax, 24
cmp rax, 72000
jne .L2
This has an obvious deficiency: only 3 of the 4 double precision slots of the 256bit AVX vectors are actually used. If the data format of the vector was changed to for example AAAABBBBCCCC repeating, a totally different approach could be used, namely broadcasting the matrix elements instead of the vector elements, then multiplying the broadcasted matrix element by the A component of 4 different vector_sts at once.
An other thing we can try, without even changing the data format, is processing more than one matrix at the same time, which helps to re-use loads from the input_vector to increase arithmetic intensity.
void rotate_arr_avx(double input_matrixA[][MATRIX_DIM], double input_matrixB[][MATRIX_DIM], vector_st *input_vector, vector_st * restrict output_vectorA, vector_st * restrict output_vectorB)
{
__m256d col0A, col1A, col2A, a, b, c, t, col0B, col1B, col2B;
int i;
// using set macros like this is kind of dirty, but it's outside the loop anyway
col0A = _mm256_set_pd(0.0, input_matrixA[2][0], input_matrixA[1][0], input_matrixA[0][0]);
col1A = _mm256_set_pd(0.0, input_matrixA[2][1], input_matrixA[1][1], input_matrixA[0][1]);
col2A = _mm256_set_pd(0.0, input_matrixA[2][2], input_matrixA[1][2], input_matrixA[0][2]);
col0B = _mm256_set_pd(0.0, input_matrixB[2][0], input_matrixB[1][0], input_matrixB[0][0]);
col1B = _mm256_set_pd(0.0, input_matrixB[2][1], input_matrixB[1][1], input_matrixB[0][1]);
col2B = _mm256_set_pd(0.0, input_matrixB[2][2], input_matrixB[1][2], input_matrixB[0][2]);
for (i = 0; i < VECTOR_LEN; i++) {
a = _mm256_set1_pd(input_vector[i].a);
b = _mm256_set1_pd(input_vector[i].b);
c = _mm256_set1_pd(input_vector[i].c);
t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0A, a), _mm256_mul_pd(col1A, b)), _mm256_mul_pd(col2A, c));
// this stores an element too much, ensure 8 bytes of padding exist after the array
_mm256_storeu_pd(&output_vectorA[i].a, t);
t = _mm256_add_pd(_mm256_add_pd(_mm256_mul_pd(col0B, a), _mm256_mul_pd(col1B, b)), _mm256_mul_pd(col2B, c));
_mm256_storeu_pd(&output_vectorB[i].a, t);
}
}

Related

Efficient C vectors for generic SIMD (SSE, AVX, NEON) test for zero matches. (find FP max absolute value and index)

I want to see if it's possible to write some generic SIMD code that can compile efficiently. Mostly for SSE, AVX, and NEON. A simplified version of the problem is: Find the maximum absolute value of an array of floating point numbers and return both the value and the index. It is the last part, the index of the maximum, that causes the problem. There doesn't seem to be a very good way to write code that has a branch.
See update at end for finished code using some of the suggested answers.
Here's a sample implementation (more complete version on godbolt):
#define VLEN 8
typedef float vNs __attribute__((vector_size(VLEN*sizeof(float))));
typedef int vNb __attribute__((vector_size(VLEN*sizeof(int))));
#define SWAP128 4,5,6,7, 0,1,2,3
#define SWAP64 2,3, 0,1, 6,7, 4,5
#define SWAP32 1, 0, 3, 2, 5, 4, 7, 6
static bool any(vNb x) {
x = x | __builtin_shufflevector(x,x, SWAP128);
x = x | __builtin_shufflevector(x,x, SWAP64);
x = x | __builtin_shufflevector(x,x, SWAP32);
return x[0];
}
float maxabs(float* __attribute__((aligned(32))) data, unsigned n, unsigned *index) {
vNs max = {0,0,0,0,0,0,0,0};
vNs tmax;
unsigned imax = 0;
for (unsigned i = 0 ; i < n; i += VLEN) {
vNs t = *(vNs*)(data + i);
t = -t < t ? t : -t; // Absolute value
vNb cmp = t > max;
if (any(cmp)) {
tmax = t; imax = i;
// broadcast horizontal max of t into every element of max
vNs tswap128 = __builtin_shufflevector(t,t, SWAP128);
t = t < tswap128 ? tswap128 : t;
vNs tswap64 = __builtin_shufflevector(t,t, SWAP64);
t = t < tswap64 ? tswap64 : t;
vNs tswap32 = __builtin_shufflevector(t,t, SWAP32);
max = t < tswap32 ? tswap32 : t;
}
}
// To simplify example, ignore finding index of true value in tmax==max
*index = imax; // + which(tmax == max);
return max[0];
}
Code on godbolt allows changing VLEN to 8 or 4.
This mostly works very well. For AVX/SSE the absolute value becomes t & 0x7fffffff using a (v)andps, i.e. clear the sign bit. For NEON it's done with vneg + fmaxnm. The block to find and broadcast the horizontal max becomes an efficient sequence of permute and max instructions. gcc is able to use NEON fabs for absolute value.
The 8 element vector on the 4 element SSE/NEON targets works well on clang. It uses a pair of instructions on two sets of registers and for the SWAP128 horizontal op will max or or the two registers without any unnecessary permute. gcc on the other hand really can't handle this and produces mostly non-SIMD code. If we reduce the vector length to 4, gcc works fine for SSE and NEON.
But there's a problem with if (any(cmp)). For clang + SSE/AVX, it works well, vcmpltps + vptest, with an orps to go from 8->4 on SSE.
But gcc and clang on NEON do all the permutes and ORs, then move the result to a gp register to test.
Is there some bit of code, other than architecture specific intrinsics, to get ptest with gcc and vmaxvq with clang/gcc and NEON?
I tried some other methods, like if (x[0] || x[1] || ... x[7]) but they were worse.
Update
I've created an updated example that shows two different implementations, both the original and "indices in a vector" method as suggested by chtz and shown in Aki Suihkonen's answer. One can see the resulting SSE and NEON output.
While some might be skeptical, the compiler does produce very good code from the generic SIMD (not auto-vectorization!) C++ code. On SSE/AVX, I see very little room to improve the code in the loop. The NEON version still troubled by a sub-optimal implementation of "any()".
Unless the data is usually in ascending order, or nearly so, my original version is still fastest on SSE/AVX. I haven't tested on NEON. This is because most loop iterations do not find a new max value and it's best to optimize for that case. The "indices in a vector" method produces a tighter loop and the compiler does a better job too, but the common case is just a bit slower on SSE/AVX. The common case might be equal or faster on NEON.
Some notes on writing generic SIMD code.
The absolute value of a vector of floats can be found with the following. It produces optimal code on SSE/AVX (and with a mask that clears the sign bit) and on NEON (the fabs instruction).
static vNs vabs(vNs x) {
return -x < x ? x : -x;
}
This will do a vertical max efficiently on SSE/AVX/NEON. It doesn't do a compare; it produces the architecture's "max' instruction. On NEON, changing it to use > instead of < causes the compiler to produce very bad scalar code. Something with denormals or exceptions I guess.
template <typename v> // Deduce vector type (float, unsigned, etc.)
static v vmax(v a, v b) {
return a < b ? b : a; // compiles best with "<" as compare op
}
This code will broadcast the horizontal max across a register. It compiles very well on SSE/AVX. On NEON, it would probably be better if the compiler could use a horizontal max instruction and then broadcast the result. I was impressed to see that if one uses 8 element vectors on SSE/NEON, which have only 4 element registers, the compiler is smart enough to use just one register for the broadcasted result, since the top 4 and bottom 4 elements are the same.
template <typename v>
static v hmax(v x) {
if (VLEN >= 8)
x = vmax(x, __builtin_shufflevector(x,x, SWAP128));
x = vmax(x, __builtin_shufflevector(x,x, SWAP64));
return vmax(x, __builtin_shufflevector(x,x, SWAP32));
}
This is the best "any()" I found. It is optimal on SSE/AVX, using a single ptest instruction. On NEON it does the permutes and ORs, instead of a horizontal max instruction, but I haven't found a way to get anything better on NEON.
static bool any(vNb x) {
if (VLEN >= 8)
x |= __builtin_shufflevector(x,x, SWAP128);
x |= __builtin_shufflevector(x,x, SWAP64);
x |= __builtin_shufflevector(x,x, SWAP32);
return x[0];
}
Also interesting, on AVX the code i = i + 1 will be compiled to vpsubd ymmI, ymmI, ymmNegativeOne, i.e. subtract -1. Why? Because a vector of -1s is produced with vpcmpeqd ymm0, ymm0, ymm0 and that's faster than broadcasting a vector of 1s.
Here is the best which() I've come up with. This gives you the index of the 1st true value in a vector of booleans (0 = false, -1 = true). One can do somewhat better on AVX with movemask. I don't know about the best NEON.
// vector of signed ints
typedef int vNi __attribute__((vector_size(VLEN*sizeof(int))));
// vector of bytes, same number of elements, 1/4 the size
typedef unsigned char vNb __attribute__((vector_size(VLEN*sizeof(unsigned char))));
// scalar type the same size as the byte vector
using sNb = std::conditional_t<VLEN == 4, uint32_t, uint64_t>;
static int which(vNi x) {
vNb cidx = __builtin_convertvector(x, vNb);
return __builtin_ctzll((sNb)cidx) / 8u;
}

As commented by chtz, the most generic and typical method is to have another mask to gather indices:
Vec8s indices = { 0,1,2,3,4,5,6,7};
Vec8s max_idx = indices;
Vec8f max_abs = abs(load8(ptr));
for (auto i = 8; i + 8 <= vec_length; i+=8) {
Vec8s data = abs(load8(ptr[i]));
auto mask = is_greater(data, max_abs);
max_idx = bitselect(mask, indices, max_idx);
max_abs = max(max_abs, data);
indices = indices + 8;
}
Another option is to interleave the values and indices:
auto data = load8s(ptr) & 0x7fffffff; // can load data as int32_t
auto idx = vec8s{0,1,2,3,4,5,6,7};
auto lo = zip_lo(idx, data);
auto hi = zip_hi(idx, data);
for (int i = 8; i + 8 <= size; i+=8) {
idx = idx + 8;
auto d1 = load8s(ptr + i) & 0x7fffffff;
auto lo1 = zip_lo(idx, d1);
auto hi1 = zip_hi(idx, d1);
lo = max_u64(lo, lo1);
hi = max_u64(hi, hi1);
}
This method is especially lucrative, if the range of inputs is small enough to shift the input left, while appending a few bits from the index to the LSB bits of the same word.
Even in this case we can repurpose 1 bit in the float allowing us to save one half of the bit/index selection operations.
auto data0 = load8u(ptr) << 1; // take abs by shifting left
auto data1 = (load8u(ptr + 8) << 1) + 1; // encode odd index to data
auto mx = max_u32(data0, data1); // the LSB contains one bit of index
Looks like one can use double as the storage, since even SSE2 supports _mm_max_pd (some attention needs to be given to Inf/Nan handling, which don't encode as Inf/Nan any more when reinterpreted as the high part of 64-bit double).

UPD: the no-aligning issue is fixed now, all the examples on godbolt use aligned reads.
UPD: MISSED THE ABS
Terribly sorry about that, I missed the absolute value from the definition.
I do not have the measurements, but here are all 3 functions vectorised:
max value with abs: https://godbolt.org/z/6Wznrc5qq
find with abs: https://godbolt.org/z/61r9Efxvn
one pass with abs: https://godbolt.org/z/EvdbfnWjb
Asm stashed in a gist
On the method
The way to do max element with simd is to first find the value and then find the index.
Alternatively you have to keep a register of indexes and blend the indexes.
This requires keeping indexes, doing more operations and the problem of the overflow needs to be addressed.
Here are my timings on avx2 by type (char, short and int) for 10'000 bytes of data
The min_element is my implementation of keeping the index.
reduce(min) + find is doing two loops - first get the value, then find where.
For ints (should behave like floats), performance is 25% faster for the two loops solution, at least on my measurements.
For completeness, comparisons against scalar for both methods - this is definitely an operation that should be vectorized.
How to do it
finding the maximum value is auto-vectorised across all platforms if you write it as reduce
if (!arr.size()) return {};
// std::reduce is also ok, just showing for more C ppl
float res = arr[0];
for (int i = 1; i != (int)arr.size(); ++i) {
res = res > arr[i] ? res : arr[i];
}
return res;
https://godbolt.org/z/EsazWf1vT
Now the find portion is trickier, non of the compilers I know autovectorize find
We have eve library that provides you with find algorithm: https://godbolt.org/z/93a98x6Tj
Or I explain how to implement find in this talk if you want to do it yourself.
UPD:
UPD2: changed the blend to max
#Peter Cordes in the comments said that there is maybe a point to doing the one pass solution in case of bigger data.
I have no evidence of this - my measurements point to reduce + find.
However, I hacked together roughly how keeping the index looks (there is an aligning issue at the moment, we should definitely align reads here)
https://godbolt.org/z/djrzobEj4
AVX2 main loop:
.L6:
vmovups ymm6, YMMWORD PTR [rdx]
add rdx, 32
vcmpps ymm3, ymm6, ymm0, 30
vmaxps ymm0, ymm6, ymm0
vpblendvb ymm3, ymm2, ymm1, ymm3
vpaddd ymm1, ymm5, ymm1
vmovdqa ymm2, ymm3
cmp rcx, rdx
jne .L6
ARM-64 main loop:
.L6:
ldr q3, [x0], 16
fcmgt v4.4s, v3.4s, v0.4s
fmax v0.4s, v3.4s, v0.4s
bit v1.16b, v2.16b, v4.16b
add v2.4s, v2.4s, v5.4s
cmp x0, x1
bne .L6
Links to ASM if godbolt becomes stale: https://gist.github.com/DenisYaroshevskiy/56d82c8cf4a4dd5bf91d58b053ea80f2

I don’t believe that’s possible. Compilers aren’t smart enough to do that efficiently.
Compare the other answer (which uses NEON-like pseudocode) with the SSE version below:
// Compare vector absolute value with aa, if greater update both aa and maxIdx
inline void updateMax( __m128 vec, __m128i idx, __m128& aa, __m128& maxIdx )
{
vec = _mm_andnot_ps( _mm_set1_ps( -0.0f ), vec );
const __m128 greater = _mm_cmpgt_ps( vec, aa );
aa = _mm_max_ps( vec, aa );
// If you don't have SSE4, emulate with bitwise ops: and, andnot, or
maxIdx = _mm_blendv_ps( maxIdx, _mm_castsi128_ps( idx ), greater );
}
float maxabs_sse4( const float* rsi, size_t length, size_t& index )
{
// Initialize things
const float* const end = rsi + length;
const float* const endAligned = rsi + ( ( length / 4 ) * 4 );
__m128 aa = _mm_set1_ps( -1 );
__m128 maxIdx = _mm_setzero_ps();
__m128i idx = _mm_setr_epi32( 0, 1, 2, 3 );
// Main vectorized portion
while( rsi < endAligned )
{
__m128 vec = _mm_loadu_ps( rsi );
rsi += 4;
updateMax( vec, idx, aa, maxIdx );
idx = _mm_add_epi32( idx, _mm_set1_epi32( 4 ) );
}
// Handle the remainder, if present
if( rsi < end )
{
__m128 vec;
if( length > 4 )
{
// The source has at least 5 elements
// Offset the source pointer + index back, by a few elements
const int offset = (int)( 4 - ( length % 4 ) );
rsi -= offset;
idx = _mm_sub_epi32( idx, _mm_set1_epi32( offset ) );
vec = _mm_loadu_ps( rsi );
}
else
{
// The source was smaller than 4 elements, copy them into temporary buffer and load vector from there
alignas( 16 ) float buff[ 4 ];
_mm_store_ps( buff, _mm_setzero_ps() );
for( size_t i = 0; i < length; i++ )
buff[ i ] = rsi[ i ];
vec = _mm_load_ps( buff );
}
updateMax( vec, idx, aa, maxIdx );
}
// Reduce to scalar
__m128 tmpMax = _mm_movehl_ps( aa, aa );
__m128 tmpMaxIdx = _mm_movehl_ps( maxIdx, maxIdx );
__m128 greater = _mm_cmpgt_ps( tmpMax, aa );
aa = _mm_max_ps( tmpMax, aa );
maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );
// SSE3 has 100% market penetration in 2022
tmpMax = _mm_movehdup_ps( tmpMax );
tmpMaxIdx = _mm_movehdup_ps( tmpMaxIdx );
greater = _mm_cmpgt_ss( tmpMax, aa );
aa = _mm_max_ss( tmpMax, aa );
maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );
index = (size_t)_mm_cvtsi128_si32( _mm_castps_si128( maxIdx ) );
return _mm_cvtss_f32( aa );
}
As you see, pretty much everything is completely different. Not just the boilerplate about remainder and final reduction, the main loop is very different too.
SSE doesn’t have bitselect; blendvps is not quite that, it selects 32-bit lanes based on high bit of the selector. Unlike NEON, SSE doesn’t have instructions for absolute value, need to be emulated with bitwise andnot.
The final reduction going to be completely different as well. NEON has very limited shuffles, but it has better horizontal operations, like vmaxvq_f32 which finds horizontal maximum over the complete SIMD vector.

How to square two complex doubles with 256-bit AVX vectors?

Matt Scarpino gives a good explanation (although he admits he's not sure it's the optimal algorithm, I offer him my gratitude) for how to multiply two complex doubles with Intel's AVX intrinsics. Here's his method, which I've verified:
__m256d vec1 = _mm256_setr_pd(4.0, 5.0, 13.0, 6.0);
__m256d vec2 = _mm256_setr_pd(9.0, 3.0, 6.0, 7.0);
__m256d neg = _mm256_setr_pd(1.0, -1.0, 1.0, -1.0);
/* Step 1: Multiply vec1 and vec2 */
__m256d vec3 = _mm256_mul_pd(vec1, vec2);
/* Step 2: Switch the real and imaginary elements of vec2 */
vec2 = _mm256_permute_pd(vec2, 0x5);
/* Step 3: Negate the imaginary elements of vec2 */
vec2 = _mm256_mul_pd(vec2, neg);
/* Step 4: Multiply vec1 and the modified vec2 */
__m256d vec4 = _mm256_mul_pd(vec1, vec2);
/* Horizontally subtract the elements in vec3 and vec4 */
vec1 = _mm256_hsub_pd(vec3, vec4);
/* Display the elements of the result vector */
double* res = (double*)&vec1;
printf("%lf %lf %lf %lf\n", res[0], res[1], res[2], res[3]);
My problem is that I want to square two complex doubles. I tried to use Matt's technique like so:
struct cmplx a;
struct cmplx b;
a.r = 2.5341;
a.i = 1.843;
b.r = 1.3941;
b.i = 0.93;
__m256d zzs = squareZ(a, b);
double* res = (double*) &zzs;
printf("\nA: %f + %f, B: %f + %f\n", res[0], res[1], res[2], res[3]);
Using Haskell's complex arithmetic, I have verified the results are correct except, as you can see, the real part of B:
A: 3.025014 + 9.340693, B: 0.000000 + 2.593026
So I have two questions really: is there a better (simpler and/or faster) way to square two complex doubles with AVX intrinsics? If not, how can I modify Matt's code to do it?

This answer covers the general case of multiplying two arrays of complex numbers
Ideally, store your data in separate real and imaginary arrays, so you can just load contiguous vectors of real and imaginary parts. That makes it free to do the cross-multiplying (just use different registers / variables) instead of having to shuffle things around within a vector.
You can convert between interleaved double complex style and SIMD-friendly separate-vectors style on the fly fairly cheaply, subject to the vagaries of AVX in-lane shuffles. e.g. very cheaply with unpacklo / unpackhi shuffles to de-interleave or to re-interleave within a lane, if you don't care about the actual order of the data within the temporary vector.
It's actually so cheap to do this shuffle that doing it on the fly for a single complex multiply comes out somewhat ahead of (even a tweaked version of) Matt's code, especially on CPUs that support FMA. This requires producing results in groups of 4 complex doubles (2 result vectors).
If you need to produce only one result vector at a time, I also came up with an alternative to Matt's algorithm that can use FMA (actually FMADDSUB) and avoid the separate sign-change insn.
gcc auto-vectorizes simple complex multiply scalar loop to pretty good code, as long as you use -ffast-math. It deinterleaves like I suggested.
#include <complex.h>
// even with -ffast-math -ffp-contract=fast, clang doesn't manage to use vfmaddsubpd, instead using vmulpd and vaddsubpd :(
// gcc does use FMA though.
// auto-vectorizes with a lot of extra shuffles
void cmul(double complex *restrict dst,
const double complex *restrict A, const double complex *restrict B)
{ // clang and gcc change strategy slightly for i<1 or i<2, vs. i<4
for (int i=0; i<4 ; i++) {
dst[i] = A[i] * B[i];
}
}
See the asm on the Godbolt compiler explorer. I'm not sure how good clang's asm is; it uses a lot of 64b->128b VMODDDUP broadcast-loads. This form is handled purely in the load ports on Intel CPUs (see Agner Fog's insn tables), but it's still a lot of operations. As mentioned earlier, gcc uses 4 VPERMPD shuffles to reorder within lanes before multiplying / FMA, then another 4 VPERMPD to reorder the results before combining them with VSHUFPD. This is 8 extra shuffles for 4 complex multiplies.
Converting gcc's version back to intrinsics and removing the redundant shuffles gives optimal code. (gcc apparently wants its temporaries to be in A B C D order instead of the A C B D order resulting from the in-lane behaviour of VUNPCKLPD (_mm256_unpacklo_pd)).
I put the code on Godbolt, along with a tweaked version of Matt's code. So you can play around with different compiler options, and also different compiler versions.
// multiplies 4 complex doubles each from A and B, storing the result in dst[0..3]
void cmul_manualvec(double complex *restrict dst,
const double complex *restrict A, const double complex *restrict B)
{
// low element first, little-endian style
__m256d A0 = _mm256_loadu_pd((double*)A); // [A0r A0i A1r A1i ] // [a b c d ]
__m256d A2 = _mm256_loadu_pd((double*)(A+2)); // [e f g h ]
__m256d realA = _mm256_unpacklo_pd(A0, A2); // [A0r A2r A1r A3r ] // [a e c g ]
__m256d imagA = _mm256_unpackhi_pd(A0, A2); // [A0i A2i A1i A3i ] // [b f d h ]
// the in-lane behaviour of this interleaving is matched by the same in-lane behaviour when we recombine.
__m256d B0 = _mm256_loadu_pd((double*)B); // [m n o p]
__m256d B2 = _mm256_loadu_pd((double*)(B+2)); // [q r s t]
__m256d realB = _mm256_unpacklo_pd(B0, B2); // [m q o s]
__m256d imagB = _mm256_unpackhi_pd(B0, B2); // [n r p t]
// desired: real=rArB - iAiB, imag=rAiB + rBiA
__m256d realprod = _mm256_mul_pd(realA, realB);
__m256d imagprod = _mm256_mul_pd(imagA, imagB);
__m256d rAiB = _mm256_mul_pd(realA, imagB);
__m256d rBiA = _mm256_mul_pd(realB, imagA);
// gcc and clang will contract these nto FMA. (clang needs -ffp-contract=fast)
// Doing it manually would remove the option to compile for non-FMA targets
__m256d real = _mm256_sub_pd(realprod, imagprod); // [D0r D2r | D1r D3r]
__m256d imag = _mm256_add_pd(rAiB, rBiA); // [D0i D2i | D1i D3i]
// interleave the separate real and imaginary vectors back into packed format
__m256d dst0 = _mm256_shuffle_pd(real, imag, 0b0000); // [D0r D0i | D1r D1i]
__m256d dst2 = _mm256_shuffle_pd(real, imag, 0b1111); // [D2r D2i | D3r D3i]
_mm256_storeu_pd((double*)dst, dst0);
_mm256_storeu_pd((double*)(dst+2), dst2);
}
Godbolt asm output: gcc6.2 -O3 -ffast-math -ffp-contract=fast -march=haswell
vmovupd ymm0, YMMWORD PTR [rsi+32]
vmovupd ymm3, YMMWORD PTR [rsi]
vmovupd ymm1, YMMWORD PTR [rdx]
vunpcklpd ymm5, ymm3, ymm0
vunpckhpd ymm3, ymm3, ymm0
vmovupd ymm0, YMMWORD PTR [rdx+32]
vunpcklpd ymm4, ymm1, ymm0
vunpckhpd ymm1, ymm1, ymm0
vmulpd ymm2, ymm1, ymm3
vmulpd ymm0, ymm4, ymm3
vfmsub231pd ymm2, ymm4, ymm5 # separate mul/sub contracted into FMA
vfmadd231pd ymm0, ymm1, ymm5
vunpcklpd ymm1, ymm2, ymm0
vunpckhpd ymm0, ymm2, ymm0
vmovupd YMMWORD PTR [rdi], ymm1
vmovupd YMMWORD PTR [rdi+32], ymm0
vzeroupper
ret
For 4 complex multiplies (of 2 pairs of input vectors), my code uses:
4 loads (32B each)
2 stores (32B each)
6 in-lane shuffles (one for each input vector, one for each output)
2 VMULPD
2 VFMA...something
(only 4 shuffles if we can use the results in separated real and imag vectors, or 0 shuffles if the inputs are already in this format, too)
latency on Intel Skylake (not counting loads/stores): 14 cycles = 4c for 4 shuffles until the second VMULPD can start + 4 cycles (second VMULPD) + 4c (second vfmadd231pd) + 1c (shuffle first result vector ready 1c earlier) + 1c (shuffle second result vector)
So for throughput, this completely bottlenecks on the shuffle port. (1 shuffle per clock throughput, vs. 2 total MUL/FMA/ADD per clock on Intel Haswell and later). This is why packed storage is horrible: shuffles have limited throughput, and spending more instructions shuffling than on doing math is not good.
Matt Scarpino's code with my minor tweaks (repeated to do 4 complex multiplies). (See below for my rewrite of producing one vector at a time more efficiently).
the same 6 loads/stores
6 in-lane shuffles (HSUBPD is 2 shuffles and a subtract on current Intel and AMD CPUs)
4 multiplies
2 subtracts (which can't combine with the muls into FMAs)
An extra instruction (+ a constant) to flip the sign of the imaginary elements. Matt chose to multiply by 1.0 or -1.0, but the efficient choice is to XOR the sign bit (i.e. XORPD with -0.0).
latency on Intel Skylake for the first result vector: 11 cycles. 1c(vpermilpd and vxorpd in the same cycle) + 4c(second vmulpd) + 6c(vhsubpd). The first vmulpd overlaps with other ops, starting in the same cycle as the shuffle and vxorpd. Computation of a second result vector should interleave pretty nicely.
The major advantage of Matt's code is that it works with just one vector-width of complex multiplies at once, instead of requiring you to have 4 input vectors of data. It has somewhat lower latency. But note that my version doesn't need the 2 pairs of input vectors to be from contiguous memory, or related to each other at all. They get mixed together while processing, but the result is 2 separate 32B vectors.
My tweaked version of Matt's code is nearly as good (as the 4-at-a-time version) on CPUs without FMA (just costing an extra VXORPD), but significantly worse when it stops us from taking advantage of FMA. Also, it never has the results available in non-packed form, so you can't use the separated form as input to another multiply and skip the shuffling.
One vector result at a time, with FMA:
Don't use this if you're sometimes squaring, instead of multiplying two different complex numbers. This is like Matt's algorithm in that common subexpression elimination doesn't simplify.
I haven't typed in the C intrinsics for this, just worked out the data movement. Since all the shuffles are in-lane, I'll only show the low lane. Use the 256b versions of the relevant instructions to do the same shuffle in both lanes. They stay separate.
// MULTIPLY: for each AVX lane: am-bn, an+bm
r i r i
a b c d // input vectors: a + b*i, etc.
m n o p
Algorithm:
create bm bn with movshdup(a b) + mulpd
create bn bm with shufpd on the previous result. (or create n m with a shuffle before the mul)
create a a with movsldup(a b)
use fmaddsubpd to produce the final result: [a|a]*[m|n] -/+ [bn|bm].
Yes, SSE/AVX has ADDSUBPD to do alternating subtract/add in even/odd elements (in that order, presumably because of this use-case). FMA includes FMADDSUB132PD which subtracts and adds, (and the reverse, FMSUBADD which adds and subtracts).
Per 4 results: 6x shuffle, 2x mul, 2xfmaddsub. So unless I got something wrong, it's as efficient as the deinterleave method (when not squaring the same number). Skylake latency = 10c = 1+4+1 to create bn bm (overlapping with 1 cycle to create a a), + 4 (FMA). So it's one cycle lower latency than Matt's.
On Bulldozer-family, it would be a win to shuffle both inputs to the first mul, so the mul->fmaddsub critical path stays inside the FMA domain (1 cycle lower latency). Doing it the other way helps stop silly compilers from making resource conflicts by doing the movsldup(a b) too early, and delaying the mulpd. (In a loop, though, many iterations will be in flight and bottleneck on the shuffle port.)
This is still better than Matt's for squaring (still save the XOR, and can use FMA), but we don't save any shuffles:
// SQUARING: for each AVX lane: aa-bb, 2*ab
// ab bb // movshdup + mul
// bb ab // ^ -> shufpd
// a a // movsldup
// aa-bb ab+ab // fmaddsubpd : [a|a]*[a|b] -/+ [bb|ab]
// per 4 results: 6x shuffle, 2x mul, 2xfmaddsub
I also played around with some possibilities like (a+b) * (a+b) = aa+2ab+bb, or (r-i)*(r+i) = rr - ii but didn't get anywhere. Rounding between steps means that FP math doesn't cancel perfectly, so doing something like this wouldn't even produce exactly identical results.

See my other answer for the general case of multiplying different complex numbers, not squaring.
TL:DR: just use the code in my other answer with both inputs the same. Compilers do a good job with the redundancy.
Squaring simplifies the math slightly: instead of needing two different cross products, rAiB and rBiA are the same. But it still needs to get doubled, so basically we end up with 2 mul + 1 FMA + 1 add, instead of 2 mul + 2 FMA.
With the SIMD-unfriendly interleaved storage format, it gives a big boost to the deinterleave method, since there's only one input to shuffle. Matt's method doesn't benefit at all, since it calculates both cross products with the same vector multiply.
Using the cmul_manualvec() from my other answer:
// squares 4 complex doubles from A[0..3], storing the result in dst[0..3]
void csquare_manual(double complex *restrict dst,
const double complex *restrict A) {
cmul_manualvec(dst, A, A);
}
gcc and clang are smart enough to optimize away the redundancy of using the same input twice, so there's no need to make a custom version with intrinsics. clang does a bad job on the scalar auto-vectorizing version, so don't use that. I don't see anything to be gained over this asm output (from Godbolt):
clang3.9 -O3 -ffast-math -ffp-contract=fast -march=haswell
vmovupd ymm0, ymmword ptr [rsi]
vmovupd ymm1, ymmword ptr [rsi + 32]
vunpcklpd ymm2, ymm0, ymm1
vunpckhpd ymm0, ymm0, ymm1 # doing this shuffle first would let the first multiply start a cycle earlier. Silly compiler.
vmulpd ymm1, ymm0, ymm0 # imag*imag
vfmsub231pd ymm1, ymm2, ymm2 # real*real - imag*imag
vaddpd ymm0, ymm0, ymm0 # imag+imag = 2*imag
vmulpd ymm0, ymm2, ymm0 # 2*imag * real
vunpcklpd ymm2, ymm1, ymm0
vunpckhpd ymm0, ymm1, ymm0
vmovupd ymmword ptr [rdi], ymm2
vmovupd ymmword ptr [rdi + 32], ymm0
vzeroupper
ret
Possibly a different instruction ordering would have been better, to maybe reduce resource conflicts. e.g. double the real vector, since it's unpacked first, so the VADDPD could start a cycle sooner, before the imag*imag VMULPD. But reordering lines in the C source doesn't usually translate directly to asm reordering, because modern compilers are complex beasts. (IIRC, gcc doesn't particularly try to schedule instructions for x86, because out-of-order execution mostly hides those effects.)
Anyway, per 4 complex squares:
2 loads (down from 4) + 2 stores, for obvious reasons
4 shuffles (down from 6), again obvious
2 VMULPD (same)
1 FMA + 1 VADDPD (down from 2 FMA. VADDPD is lower latency than FMA on Haswell/Broadwell, same on Skylake).
Matt's version would still be 6 shuffles, and same everything else.

Handling zeroes in _mm256_rsqrt_ps()

Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps(), looking around it seems that doing:
_mm256_mul_ps(_mm256_rsqrt_ps(eightFloats),
eightFloats);
Is the way to go for that extra bit of performance and avoiding a pipeline stall.
Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0). What is the best way around this? I have tried this (which works and is faster), but is there a better way, or am I going to run into problems under certain conditions?
_mm256_mul_ps(_mm256_rsqrt_ps(_mm256_max_ps(eightFloats,
_mm256_set1_ps(0.1))),
eightFloats);
My code is for a vertical application, so I can assume that it will be running on a Haswell CPU (i7-4810MQ), so FMA/AVX2 can be used. The original code is approximately:
float vals[MAX];
int sum = 0;
for (int i = 0; i < MAX; i++)
{
int thisSqrt = (int) floor(sqrt(vals[i]));
sum += min(thisSqrt, 0x3F);
}
All the values of vals should be integer values. (Why everything isn't just int is a different question...)

tl;dr: See the end for code that compiles and should work.
To just solve the 0.0 problem, you could also special-case inputs of 0.0 with an FP compare of the source against 0.0. Use the compare result as a mask to zero out any NaNs resulting from 0 * +Infinity in sqrt(x) = x * rsqrt(x)). Clang does this when autovectorizing. (But it uses blendps with the zeroed vector, instead of using the compare mask with andnps directly to zero or preserve elements.)
It would also be possible to use sqrt(x) ~= recip(rsqrt(x)), as suggested by njuffa. rsqrt(0) = +Inf. recip(+Inf) = 0. However, using two approximations would compound the relative error, which is a problem.
The thing you're missing:
Truncating to integer (instead of rounding) requires an accurate sqrt result when the input is a perfect square. If the result for 25*rsqrt(25) is 4.999999 or something (instead of 5.00001), you'll add 4 instead of 5.
Even with a Newton-Raphson iteration, rsqrtps isn't perfectly accurate the way sqrtps is, so it might still give 5.0 - 1ulp. (1ulp = one unit in the last place = lowest bit of the mantissa).
Also:
Newton Raphson formula explained
Newton Raphson SSE implementation performance (latency/throughput). Note that we care more about throughput than latency, since we're using it in a loop that doesn't do much else. sqrt isn't part of the loop-carried dep chain, so different iterations can have their sqrt calcs in flight at once.
It might be possible to kill 2 birds with one stone by adding a small constant before doing the (x+offset)*approx_rsqrt(x+offset) and then truncating to integer. Large enough to overcome the max relative error of 1.5*2-12, but small enough not to bump sqrt_approx(63*63-1+offset) up to 63 (the most sensitive case).
63*1.5*2^(-12) == 0.023071...
approx_sqrt(63*63-1) == 62.99206... +/- 0.023068..
Actually, we're screwed without a Newton iteration even without adding anything. approx_sqrt(63*63-1) could come out above 63.0 all by itself. n=36 is the largest value where the relative error in sqrt(n*n-1) + error is less than sqrt(n*n). GNU Calc:
define f(n) { local x=sqrt(n*n-1); local e=x*1.5*2^(-12); print x; print e, x+e; }
; f(36)
35.98610843089316319413
~0.01317850650545403926 ~35.99928693739861723339
; f(37)
36.9864840178138587015
~0.01354485498699237990 ~37.00002887280085108140
Does your source data have any properties that mean you don't have to worry about it being just below a large perfect square? e.g. is it always perfect squares?
You could check all possible input values, since the important domain is very small (integer FP values from 0..63*63) to see if the error in practice is small enough on Intel Haswell, but that would be a brittle optimization that could make your code break on AMD CPUs, or even on future Intel CPUs. Unfortunately, just coding to the ISA spec's guarantee that the relative error is up to 1.5*2-12 requires more instructions. I don't see any tricks a NR iteration.
If your upper limit was smaller (like 20), you could just do isqrt = static_cast<int> ((x+0.5)*approx_rsqrt(x+0.5)). You'd get 20 for 20*20, but always 19 for 20*20-1.
; define test_approx_sqrt(x, off) { local s=x*x+off; local sq=s/sqrt(s); local sq_1=(s-1)/sqrt(s-1); local e=1.5*2^(-12); print sq, sq_1; print sq*e, sq_1*e; }
; test_approx_sqrt(20, 0.5)
~20.01249609618950056874 ~19.98749609130668473087 # (x+0.5)/sqrt(x+0.5)
~0.00732879495710064718 ~0.00731963968187500662 # relative error
Note that val * (x +/- err) = val*x +/- val*err. IEEE FP mul produces results that are correctly rounded to 0.5ulp, so this should work for FP relative errors.
Anyway, I think you need one Newton-Raphson iteration.
The best bet is to add 0.5 to your input before doing an approx_sqrt using rsqrt. That sidesteps the 0/0 = NaN problem, and pushes the +/- error range all to one side of the whole number cut point (for numbers in the range we care about).
FP min/max instructions have the same performance as FP add, and will be on the critical path either way. Using an add instead of a max also solves the problem of results for perfect squares potentially being a few ulp below the correct result.
Compiler output: a decent starting point
I get pretty good autovectorization results from clang 3.7.1 with sum_int, with -fno-math-errno -funsafe-math-optimizations. -ffinite-math-only is not required (but even with the full -ffast-math, clang avoids sqrt(0) = NaN when using rsqrtps).
sum_fp doesn't auto-vectorize, even with the full -ffast-math.
However clang's version suffers from the same problem as your idea: truncating an inexact result from rsqrt + NR, potentially giving the wrong integer. IDK if this is why gcc doesn't auto-vectorize, because it could have used sqrtps for a big speedup without changing the results. (At least, as long as all the floats are between 0 and INT_MAX2, otherwise converting back to integer will give the "indefinite" result of INT_MIN. (sign bit set, all other bits cleared). This is a case where -ffast-math breaks your program, unless you use -mrecip=none or something.
See the asm output on godbolt from:
// autovectorizes with clang, but has rounding problems.
// Note the use of sqrtf, and that floorf before truncating to int is redundant. (removed because clang doesn't optimize away the roundps)
int sum_int(float vals[]){
int sum = 0;
for (int i = 0; i < MAX; i++) {
int thisSqrt = (int) sqrtf(vals[i]);
sum += std::min(thisSqrt, 0x3F);
}
return sum;
}
To manually vectorize with intrinsics, we can look at the asm output from -fno-unroll-loops (to keep things simple). I was going to include this in the answer, but then realized that it had problems.
putting it together:
I think converting to int inside the loop is better than using floorf and then addps. roundps is a 2-uop instruction (6c latency) on Haswell (1uop in SnB/IvB). Worse, both uops require port1, so they compete with FP add / mul. cvttps2dq is a 1-uop instruction for port1, with 3c latency, and then we can use integer min and add to clamp and accumulate, so port5 gets something to do. Using an integer vector accumulator also means the loop-carried dependency chain is 1 cycle, so we don't need to unroll or use multiple accumulators to keep multiple iterations in flight. Smaller code is always better for the big picture (uop cache, L1 I-cache, branch predictors).
As long as we aren't in danger of overflowing 32bit accumulators, this seems to be the best choice. (Without having benchmarked anything or even tested it).
I'm not using the sqrt(x) ~= approx_recip(approx_sqrt(x)) method, because I don't know how to do a Newton iteration to refine it (probably it would involve a division). And because the compounded error is larger.
Horizontal sum from this answer.
Complete but untested version:
#include <immintrin.h>
#define MAX 4096
// 2*sqrt(x) ~= 2*x*approx_rsqrt(x), with a Newton-Raphson iteration
// dividing by 2 is faster in the integer domain, so we don't do it
__m256 approx_2sqrt_ps256(__m256 x) {
// clang / gcc usually use -3.0 and -0.5. We could do the same by using fnmsub_ps (add 3 = subtract -3), so we can share constants
__m256 three = _mm256_set1_ps(3.0f);
//__m256 half = _mm256_set1_ps(0.5f); // we omit the *0.5 step
__m256 nr = _mm256_rsqrt_ps( x ); // initial approximation for Newton-Raphson
// 1/sqrt(x) ~= nr * (3 - x*nr * nr) * 0.5 = nr*(1.5 - x*0.5*nr*nr)
// sqrt(x) = x/sqrt(x) ~= (x*nr) * (3 - x*nr * nr) * 0.5
// 2*sqrt(x) ~= (x*nr) * (3 - x*nr * nr)
__m256 xnr = _mm256_mul_ps( x, nr );
__m256 three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three ); // -(xnr*nr) + 3
return _mm256_mul_ps( xnr, three_minus_muls );
}
// packed int32_t: correct results for inputs from 0 to well above 63*63
__m256i isqrt256_ps(__m256 x) {
__m256 offset = _mm256_set1_ps(0.5f); // or subtract -0.5, to maybe share constants with compiler-generated Newton iterations.
__m256 xoff = _mm256_add_ps(x, offset); // avoids 0*Inf = NaN, and rounding error before truncation
__m256 approx_2sqrt_xoff = approx_2sqrt_ps256(xoff);
__m256i i2sqrtx = _mm256_cvttps_epi32(approx_2sqrt_xoff);
return _mm256_srli_epi32(i2sqrtx, 1); // divide by 2 with truncation
// alternatively, we could mask the low bit to zero and divide by two outside the loop, but that has no advantage unless port0 turns out to be the bottleneck
}
__m256i isqrt256_ps_simple_exact(__m256 x) {
__m256 sqrt_x = _mm256_sqrt_ps(x);
__m256i isqrtx = _mm256_cvttps_epi32(sqrt_x);
return isqrtx;
}
int hsum_epi32_avx(__m256i x256){
__m128i xhi = _mm256_extracti128_si256(x256, 1);
__m128i xlo = _mm256_castsi256_si128(x256);
__m128i x = _mm_add_epi32(xlo, xhi);
__m128i hl = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
hl = _mm_add_epi32(hl, x);
x = _mm_shuffle_epi32(hl, _MM_SHUFFLE(2, 3, 0, 1));
hl = _mm_add_epi32(hl, x);
return _mm_cvtsi128_si32(hl);
}
int sum_int_avx(float vals[]){
__m256i sum = _mm256_setzero_si256();
__m256i upperlimit = _mm256_set1_epi32(0x3F);
for (int i = 0; i < MAX; i+=8) {
__m256 v = _mm256_loadu_ps(vals+i);
__m256i visqrt = isqrt256_ps(v);
// assert visqrt == isqrt256_ps_simple_exact(v) or something
visqrt = _mm256_min_epi32(visqrt, upperlimit);
sum = _mm256_add_epi32(sum, visqrt);
}
return hsum_epi32_avx(sum);
}
Compiles on godbolt to nice code, but I haven't tested it. clang makes slightly nicer code that gcc: clang uses broadcast-loads from 4B locations for the set1 constants, instead of repeating them at compile time into 32B constants. gcc also has a bizarre movdqa to copy a register.
Anyway, the whole loop winds up being only 9 vector instructions, compared to 12 for the compiler-generated sum_int version. It probably didn't notice the x*initial_guess(x) common-subexpressions that occur in the Newton-Raphson iteration formula when you're multiplying the result by x, or something like that. It also does an extra mulps instead of a psrld because it does the *0.5 before converting to int. So that's where the extra two mulps instructions come from, and there's the cmpps/blendvps.
sum_int_avx(float*):
vpxor ymm3, ymm3, ymm3
xor eax, eax
vbroadcastss ymm0, dword ptr [rip + .LCPI4_0] ; set1(0.5)
vbroadcastss ymm1, dword ptr [rip + .LCPI4_1] ; set1(3.0)
vpbroadcastd ymm2, dword ptr [rip + .LCPI4_2] ; set1(63)
LBB4_1: ; latencies
vaddps ymm4, ymm0, ymmword ptr [rdi + 4*rax] ; 3c
vrsqrtps ymm5, ymm4 ; 7c
vmulps ymm4, ymm4, ymm5 ; x*nr ; 5c
vfnmadd213ps ymm5, ymm4, ymm1 ; 5c
vmulps ymm4, ymm4, ymm5 ; 5c
vcvttps2dq ymm4, ymm4 ; 3c
vpsrld ymm4, ymm4, 1 ; 1c this would be a mulps (but not on the critical path) if we did this in the FP domain
vpminsd ymm4, ymm4, ymm2 ; 1c
vpaddd ymm3, ymm4, ymm3 ; 1c
; ... (those 9 insns repeated: loop unrolling)
add rax, 16
cmp rax, 4096
jl .LBB4_1
;... horizontal sum
IACA thinks that with no unroll, Haswell can sustain a throughput of one iteration per 4.15 cycles, bottlenecking on ports 0 and 1. So potentially you could shave a cycle by accumulating sqrt(x)*2 (with truncation to even numbers using _mm256_and_si256), and only divide by two outside the loop.
Also according to IACA, the latency of a single iteration is 38 cycles on Haswell. I only get 31c, so probably it's including L1 load-use latency or something. Anyway, this means that to saturate the execution units, operations from 8 iterations have to be in flight at once. That's 8 * ~14 unfused-domain uops = 112 unfused-uops (or less with clang's unroll) that have to be in flight at once. Haswell's scheduler is actually only 60 entries, but the ROB is 192 entries. The early uops from early iterations will already have executed, so they only need to be tracked in the ROB, not also in the scheduler. Many of the slow uops are at the beginning of each iteration, though. Still, there's reason to hope that this will come close-ish to saturating ports 0 and 1. Unless data is hot in L1 cache, cache/memory bandwidth will probably be the bottleneck.
Interleaving operations from multiple dep chains would also be better. When clang unrolls, it puts all 9 instructions for one iteration ahead of all 9 instructions for another iteration. It uses a surprisingly small number of registers, so it would be possible to have instructions for 2 or 4 iterations mixed together. This is the sort of thing compilers are supposed to be good at, but which is cumbersome for humans. :/
It would also be slightly more efficient if the compiler chose a one-register addressing mode, so the load could micro-fuse with the vaddps. gcc does this.

Fastest way to compute distance squared

My code relies heavily on computing distances between two points in 3D space.
To avoid the expensive square root I use the squared distance throughout.
But still it takes up a major fraction of the computing time and I would like to replace my simple function with something even faster.
I now have:
double distance_squared(double *a, double *b)
{
double dx = a[0] - b[0];
double dy = a[1] - b[1];
double dz = a[2] - b[2];
return dx*dx + dy*dy + dz*dz;
}
I also tried using a macro to avoid the function call but it doesn't help much.
#define DISTANCE_SQUARED(a, b) ((a)[0]-(b)[0])*((a)[0]-(b)[0]) + ((a)[1]-(b)[1])*((a)[1]-(b)[1]) + ((a)[2]-(b)[2])*((a)[2]-(b)[2])
I thought about using SIMD instructions but could not find a good example or complete list of instructions (ideally some multiply+add on two vectors).
GPU's are not an option since only one set of points is known at each function call.
What would be the fastest way to compute the distance squared?

A good compiler will optimize that about as well as you will ever manage. A good compiler will use SIMD instructions if it deems that they are going to be beneficial. Make sure that you turn on all such possible optimizations for your compiler. Unfortunately, vectors of dimension 3 don't tend to sit well with SIMD units.
I suspect that you will simply have to accept that the code produced by the compiler is probably pretty close to optimal and that no significant gains can be made.

The first obvious thing would be to use the restrict keyword.
As it is now, a and b are aliasable (and thus, from the compiler's point of view which assumes the worst possible case are aliased). No compiler will auto-vectorize this, as it is wrong to do so.
Worse, not only can the compiler not vectorize such a loop, in case you also store (luckily not in your example), it also must re-load values each time. Always be clear about aliasing, as it greatly impacts the compiler.
Next, if you can live with that, use float instead of double and pad to 4 floats even if one is unused, this is a more "natural" data layout for the majority of CPUs (this is somewhat platform specific, but 4 floats is a good guess for most platforms -- 3 doubles, a.k.a. 1.5 SIMD registers on "typical" CPUs, is not optimal anywhere).
(For a hand-written SIMD implementation (which is harder than you think), first and before all be sure to have aligned data. Next, look into what latencies your instrucitons have on the target machine and do the longest ones first. For example on pre-Prescott Intel it makes sense to first shuffle each component into a register and then multiply with itself, even though that uses 3 multiplies instead of one, because shuffles have a long latency. On the later models, a shuffle takes a single cycle, so that would be a total anti-optimization.
Which again shows that leaving it to the compiler is not such a bad idea.)

The SIMD code to do this (using SSE3):
movaps xmm0,a
movaps xmm1,b
subps xmm0,xmm1
mulps xmm0,xmm0
haddps xmm0,xmm0
haddps xmm0,xmm0
but you need four value vectors (x,y,z,0) for this to work. If you've only got three values then you'd need to do a bit of fiddling about to get the required format which would cancel out any benefit of the above.
In general though, due to the superscalar pipelined architecture of the CPU, the best way to get performance is to do the same operation on lots of data, that way you can interleave the various steps and do a bit of loop unrolling to avoid pipeline stalls. The above code will definately stall on the last three instructions based on the "can't use a value directly after it's modified" principle - the second instruction has to wait for the result of the previous instruction to complete which isn't good in a pipelined system.
Doing the calculation on two or more different sets points of points at the same time can remove the above bottleneck - whilst waiting for the result of one computation, you can start the computation of the next point:
movaps xmm0,a1
movaps xmm2,a2
movaps xmm1,b1
movaps xmm3,b2
subps xmm0,xmm1
subps xmm2,xmm3
mulps xmm0,xmm0
mulps xmm2,xmm2
haddps xmm0,xmm0
haddps xmm2,xmm2
haddps xmm0,xmm0
haddps xmm2,xmm2

If you would like to optimize something, at first profile code and inspect assembler output.
After compiling it with gcc -O3 (4.6.1) we'll have nice disassembled output with SIMD:
movsd (%rdi), %xmm0
movsd 8(%rdi), %xmm2
subsd (%rsi), %xmm0
movsd 16(%rdi), %xmm1
subsd 8(%rsi), %xmm2
subsd 16(%rsi), %xmm1
mulsd %xmm0, %xmm0
mulsd %xmm2, %xmm2
mulsd %xmm1, %xmm1
addsd %xmm2, %xmm0
addsd %xmm1, %xmm0

This type of problem often occurs in MD simulations. Usually the amount of calculations is reduced by cutoffs and neighbor lists, so the number for the calculation is reduced. The actual calculation of the squared distances however is exactly done (with compiler optimizations and a fixed type float[3]) as given in your question.
So if you want to reduce the amount of squared calculations you should tell us more about the problem.

Perhaps passing the 6 doubles directly as arguments could make it faster (because it could avoid the array dereference):
inline double distsquare_coord(double xa, double ya, double za,
double xb, double yb, double zb)
{
double dx = xa-yb; double dy=ya-yb; double dz=za-zb;
return dx*dx + dy*dy + dz*dz;
}
Or perhaps, if you have many points in the vicinity, you might compute a distance (to the same fixed other point) by linear approximation of the distances of other near points.

If you can rearrange your data to process two pairs of input vectors at once, you may use this code (SSE2 only)
// #brief Computes two squared distances between two pairs of 3D vectors
// #param a
// Pointer to the first pair of 3D vectors.
// The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
// Must be aligned by 16 (2 doubles).
// #param b
// Pointer to the second pairs of 3D vectors.
// The two vectors must be stored with stride 24, i.e. (a + 3) should point to the first component of the second vector in the pair.
// Must be aligned by 16 (2 doubles).
// #param c
// Pointer to the output 2 element array.
// Must be aligned by 16 (2 doubles).
// The two distances between a and b vectors will be written to c[0] and c[1] respectively.
void (const double * __restrict__ a, const double * __restrict__ b, double * __restrict c) {
// diff0 = ( a0.y - b0.y, a0.x - b0.x ) = ( d0.y, d0.x )
__m128d diff0 = _mm_sub_pd(_mm_load_pd(a), _mm_load_pd(b));
// diff1 = ( a1.x - b1.x, a0.z - b0.z ) = ( d1.x, d0.z )
__m128d diff1 = _mm_sub_pd(_mm_load_pd(a + 2), _mm_load_pd(b + 2));
// diff2 = ( a1.z - b1.z, a1.y - b1.y ) = ( d1.z, d1.y )
__m128d diff2 = _mm_sub_pd(_mm_load_pd(a + 4), _mm_load_pd(b + 4));
// prod0 = ( d0.y * d0.y, d0.x * d0.x )
__m128d prod0 = _mm_mul_pd(diff0, diff0);
// prod1 = ( d1.x * d1.x, d0.z * d0.z )
__m128d prod1 = _mm_mul_pd(diff1, diff1);
// prod2 = ( d1.z * d1.z, d1.y * d1.y )
__m128d prod2 = _mm_mul_pd(diff1, diff1);
// _mm_unpacklo_pd(prod0, prod2) = ( d1.y * d1.y, d0.x * d0.x )
// psum = ( d1.x * d1.x + d1.y * d1.y, d0.x * d0.x + d0.z * d0.z )
__m128d psum = _mm_add_pd(_mm_unpacklo_pd(prod0, prod2), prod1);
// _mm_unpackhi_pd(prod0, prod2) = ( d1.z * d1.z, d0.y * d0.y )
// dotprod = ( d1.x * d1.x + d1.y * d1.y + d1.z * d1.z, d0.x * d0.x + d0.y * d0.y + d0.z * d0.z )
__m128d dotprod = _mm_add_pd(_mm_unpackhi_pd(prod0, prod2), psum);
__mm_store_pd(c, dotprod);
}

Is there a better way to optimize the lennard jones potential function?

In actual fact, its the derivative of the Lennard Jones potential. The reason for is that I am writing a Molecular Dynamics program and at least 80% of the time is spent in the following function, even with the most aggressive compiler options (gcc ** -O3).
double ljd(double r) /* Derivative of Lennard Jones Potential for Argon with
respect to distance (r) */
{
double temp;
temp = Si/r;
temp = temp*temp;
temp = temp*temp*temp;
return ( (24*Ep/r)*(temp-(2 * pow(temp,2))) );
}
This code is from a file "functs.h", which I import into my main file. I thought that using temporary variables in this way would make the function faster, but I am worried that creating them is too wasteful. Should I use static? Also the code is written in parallel using openmp, so I can't really declare temp as a global variable?
The variables Ep and Si are defined (using #define). I have only been using C for about 1 month. I tried to look at the assembler code generated by gcc, but I was completely lost.\

I would get rid of the call to pow() for a start:
double ljd(double r) /* Derivative of Lennard Jones Potential for Argon with
respect to distance (r) */
{
double temp;
temp = Si / r;
temp = temp * temp;
temp = temp * temp * temp;
return ( (24.0 * Ep / r) * (temp - (2.0 * temp * temp)) );
}

On my architecture (intel Centrino Duo, MinGW-gcc 4.5.2 on Windows XP), non-optimized code using pow()
static inline double ljd(double r)
{
return 24 * Ep / Si * (pow(Si / r, 7) - 2 * pow(Si / r, 13));
}
actually outperforms your version if -ffast-math is provided.
The generated assembly (using some arbitrary values for Ep and Si) looks like this:
fldl LC0
fdivl 8(%ebp)
fld %st(0)
fmul %st(1), %st
fmul %st, %st(1)
fld %st(0)
fmul %st(1), %st
fmul %st(2), %st
fxch %st(1)
fmul %st(2), %st
fmul %st(0), %st
fmulp %st, %st(2)
fxch %st(1)
fadd %st(0), %st
fsubrp %st, %st(1)
fmull LC1

Well, as I've said before, compilers suck at optimising floating point code for many reasons. So, here's an Intel assembly version that should be faster (compiled using DevStudio 2005):
const double Si6 = /*whatever pow(Si,6) is*/;
const double Si_value = /*whatever Si is*/; /* need _value as Si is a register name! */
const double Ep24 = /*whatever 24.Ep is*/;
double ljd (double r)
{
double result;
__asm
{
fld qword ptr [r]
fld st(0)
fmul st(0),st(0)
fld st(0)
fmul st(0),st(0)
fmulp st(1),st(0)
fld qword ptr [Si6]
fdivrp st(1),st(0)
fld st(0)
fld1
fsub st(0),st(1)
fsubrp st(1),st(0)
fmulp st(1),st(0)
fld qword ptr [Ep24]
fmulp st(1),st(0)
fdivrp st(1),st(0)
fstp qword ptr [result]
}
return result;
}
This version will produce slightly different results to the version posted. The compiler will probably be writing intermediate results to RAM in the original code. This will lose precision since the (Intel) FPU operates at 80bits internally whereas the double type is only 64bits. The above assembler will not lose precision in the intermediate results, it is all done at 80bits. Only the final result is rounded to 64bits.

The local variable is just fine. It doesn't cost anything. Leave it alone.
As others said, get rid of the pow call. It can't be any faster than simply squaring the number, and it could be a lot slower.
That said, just because the function is active 80+% of the time does not mean it's a problem. It only means if there is something you can optimize, it's either in there, or in something it calls (like pow) or in something that calls it.
If you try random pausing, which is a method of stack-sampling, you will see that routine on 80+% of samples, plus the lines within it that are responsible for the time, plus its callers that are responsible for the time, and their callers, and so on. All the lines of code on the stack are jointly responsible for the time.
Optimality is not when nothing take a large percent of time, it is when nothing you can fix takes a large percent of time.

Is your application structured in such a way that you could profitably vectorise this function, calculating several independent values in parallel? This would allow you to utilise hardware vector units, such as SSE.
It also seems like you would be better off keeping 1/r values around, rather than r itself.
This is an example explicitly using SSE2 instructions to implement the function. ljd() calculates two values at once.
static __m128d ljd(__m128d r)
{
static const __m128d two = { 2.0, 2.0 };
static const __m128d si = { Si, Si };
static const __m128d ep24 = { 24 * Ep, 24 * Ep };
__m128d temp2, temp3;
__m128d temp = _mm_div_pd(si, r);
__m128d ep24_r = _mm_div_pd(ep24, r);
temp = _mm_mul_pd(temp, temp);
temp2 = _mm_mul_pd(temp, temp);
temp2 = _mm_mul_pd(temp2, temp);
temp3 = _mm_mul_pd(temp2, temp2);
temp3 = _mm_mul_pd(temp3, two);
return _mm_mul_pd(ep24_r, _mm_sub_pd(temp2, temp3));
}
/* Requires `out` and `in` to be 16-byte aligned */
void ljd_array(double out[], const double in[], int n)
{
int i;
for (i = 0; i < n; i += 2)
{
_mm_store_pd(out + i, ljd(_mm_load_pd(in + i)));
}
}
However, it is important to note that recent versions of GCC are often able to vectorise functions like this automatically, as long as you're targetting the right architecture and have optimisation enabled. If you're targetting 32-bit x86, try compiling with -msse2 -O3, and adjust things such that the input and output arrays are 16-byte aligned.
Alignment for static and automatic arrays can be achieved under gcc with the type attribute __attribute__ ((aligned (16))), and for dynamic arrays using the posix_memalign() function.

Ah, that brings me back some memories... I've done MD with Lennard Jones potential years ago.
In my scenario (not huge systems) it was enough to replace the pow() with several multiplications, as suggested by another answer. I also restricted the range of neighbours, effectively truncating the potential at about r ~ 3.5 and applying some standard thermodynamic correction afterwards.
But if all this is not enough for you, I suggest to precompute the function for closely spaced values of r and simply interpolate (linear or quadratic, I'd say).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Optimization of matrix and vector multiplication in C - c

Related

Efficient C vectors for generic SIMD (SSE, AVX, NEON) test for zero matches. (find FP max absolute value and index)

How to square two complex doubles with 256-bit AVX vectors?

Handling zeroes in _mm256_rsqrt_ps()

Fastest way to compute distance squared

Is there a better way to optimize the lennard jones potential function?

Categories

Resources