I'm trying to implement the following operation using AVX:
for (i=0; i<N; i++) {
for(j=0; j<N; j++) {
for (k=0; k<K; k++) {
d[i][j] += 2 * a[i][k] * ( b[k][j]- c[k]);
}
}
}
for (int i=0; i<N; i++){
f+= d[ind[i]][ind[i]]/2;
}
Where d is a NxN matrix, a is a NxK, b a KxN and c a vector of length K. All of them are doubles. Of course, all the data is aligned and I am using #pragma vector aligned to help compiler (gcc).
I know how to use AVX extensions with one-dimension arrays, but it is being a little bit tricky to me to do it with matrix. Currently, I have the following, but I'm not getting correct results:
for (int i=0; i< floor (N/4); i++){
for (int j=0; j< floor (N/4); j++){
__m256d D, A, B, C;
D = _mm256_setzero_pd();
#pragma vector aligned
for (int k=0; k<K_MAX; k++){
A = _mm256_load_pd(a[i] + k*4);
B = _mm256_load_pd(b[k] + j*4);
C = _mm256_load_pd(c + 4*k);
B = _mm256_sub_pd(B, C);
A = _mm256_mul_pd(A, B);
D = _mm256_add_pd(_mm256_set1_pd(2.0), A);
_mm256_store_pd(d[i] + j*4, D);
}
}
}
for (int i=0; i<N; i++){
f+= d[ind[i]][ind[i]]/2;
}
I hope someone can tell me where the mistake is.
Thanks in advance.
NOTE: I'm not willing to introduce OpenMP, just using SIMD Intel instructions
Assuming both N and K numbers are relatively large (much larger than 4 which is hardware vector size), here's one way to vectorize your main loop. Untested.
The main idea is vectorizing the middle loop instead of the inner one. This is done for two reasons.
This avoids horizontal operations. When vectorizing just the inner loop, we would have to compute horizontal sum of a vector.
That b[k][j] load has unfortunate RAM access pattern when loading for 4 consecutive k values, need either 4 separate load instructions, or gather load, both methods are relatively slow. Loading elements for 4 consecutive j values is a full-vector load instruction, very efficient, especially since you align your inputs.
const int N_aligned = ( N / 4 ) * 4;
for( int i = 0; i < N; i++ )
{
int j = 0;
for( ; j < N_aligned; j += 4 )
{
// Load 4 scalars from d
__m256d dv = _mm256_loadu_pd( &d[ i ][ j ] );
// Run the inner loop which only loads from RAM but never stores any data
for( int k = 0; k < K; k++ )
{
__m256d av = _mm256_broadcast_sd( &a[ i ][ k ] );
__m256d bv = _mm256_loadu_pd( &b[ k ][ j ] );
__m256d cv = _mm256_broadcast_sd( &c[ k ] );
// dv += 2*av*( bv - cv )
__m256d t1 = _mm256_add_pd( av, av ); // 2*av
__m256d t2 = _mm256_sub_pd( bv, cv ); // bv - cv
dv = _mm256_fmadd_pd( t1, t2, dv );
}
// Store the updated 4 values
_mm256_storeu_pd( &d[ i ][ j ], dv );
}
// Handle remainder with scalar code
for( ; j < N; j++ )
{
double ds = d[ i ][ j ];
for( int k = 0; k < K; k++ )
ds += 2 * a[ i ][ k ] * ( b[ k ][ j ] - c[ k ] );
d[ i ][ j ] = ds;
}
}
If you want to optimize further, try to unroll the inner loop by a small factor like 2, use 2 independent accumulators initialized with _mm256_setzero_pd(), add them after the loop. It could be that on some processors, this version stalls on the latency of the FMA instruction, instead of saturating load ports or ALUs. Multiple independent accumulators sometimes help.
b[k][j] is your problem: the elements b[k + 0..3][j] aren't contiguous in memory. Using SIMD (in a reasonable / useful way) is not something you can drop in to the classic naive matmul loop. See What Every Programmer Should Know About Memory? - there's an appendix with an example of an SSE2 matmul (with cache-blocking) which shows how to do operations in a different order that's SIMD-friendly.
Soonts's answer shows how to vectorize at all, by vectorizing over j, the middle loop. But that leaves a relatively poor memory access pattern, and 3 loads + 3 ALU operations inside the loop. (This answer started out as a comment on it, see it for the code I'm talking about and proposing changes to.)
Loop inversion should be possible to do j as the inner-most loop. That would mean doing stores for d[i][j] += ... inside the inner-most loop, but OTOH it makes more loop invariants in 2 * a[i][k] * ( b[k][j]- c[k] ) so you can usefully transform to d[i][j] += (2*a_ik) * b[k][j] - (2*a_ik*c_k), i.e. one VFMSUBPD and one VADDPD per load&store. (With the bv load folding into the FMSUB as a memory source operand, and the dv load folding into VADDPD, so hopefully only 3 uops for the front-end, including a separate store, not including loop overhead.)
The compiler will have to unroll and avoid an indexed addressing mode so the store-address uop can stay micro-fused and run on port 7 on Intel CPUs (Haswell through Skylake-family), not competing with the two loads. Ice Lake doesn't have that problem, having two full independent store-AGUs separate from the two load AGUs. But probably still needs some loop unrolling to avoid a front-end bottleneck.
Here's an example, untested (original version contributed by Soonts, thanks). It optimizes down to 2 FP math ops in the loop in a different way: simply hoisting 2*a out of the loop, doing SUB then FMA for dv += (2av)*(sub_result). But bv can't be a source operand for vsubpd because we need bv - cv. But we can fix that by negating cv to allow (-cv) + bv in the inner loop, with bv as a memory source operand. Sometimes compilers will do things like that for you, but here it seems they didn't, so I did it manually. Otherwise we get a separate vmovupd load going through the front-end.
#include <stdint.h>
#include <stdlib.h>
#include <immintrin.h>
// This double [N][N] C99 VLA syntax isn't portable to C++ even with GNU extensions
// restrict tells the compiler the output doesn't overlap with any of the inputs
void matop(size_t N, size_t K, double d[restrict N][N], const double a[restrict N][K], const double b[restrict K][N], const double c[restrict K])
{
for( size_t i = 0; i < N; i++ ) {
// loop-invariant pointers for this outer iteration
//double* restrict rowDi = &d[ i ][ 0 ];
const double* restrict rowAi = &a[ i ][ 0 ];
for( size_t k = 0; k < K; k++ ) {
const double* restrict rowBk = &b[ k ][ 0 ];
double* restrict rowDi = &d[ i ][ 0 ];
#if 0 // pure scalar
// auto-vectorizes ok; still a lot of extra checking outside outermost loop even with restrict
for (size_t j=0 ; j<N ; j++){
rowDi[j] += 2*rowAi[k] * (rowBk[j] - c[k]);
}
#else // SIMD inner loop with cleanup
// *** TODO: unroll over 2 or 3 i values
// and maybe also 2 or 3 k values, to reuse each bv a few times while it's loaded.
__m256d av = _mm256_broadcast_sd( rowAi + k );
av = _mm256_add_pd( av, av ); // 2*a[ i ][ k ] broadcasted
const __m256d cv = _mm256_broadcast_sd( &c[ k ] );
const __m256d minus_ck = _mm256_xor_pd(cv, _mm256_set1_pd(-0.0)); // broadcasted -c[k]
//const size_t N_aligned = ( (size_t)N / 4 ) * 4;
size_t N_aligned = N & -4; // round down to a multiple of 4 j iterations
const double* endBk = rowBk + N_aligned;
//for( ; j < N_aligned; j += 4 )
for ( ; rowBk != endBk ; rowBk += 4, rowDi += 4) { // coax GCC into using pointer-increments in the asm, instead of j+=4
// Load the output vector to update
__m256d dv = _mm256_loadu_pd( rowDi );
// Update with FMA
__m256d bv = _mm256_loadu_pd( rowBk );
__m256d t2 = _mm256_add_pd( minus_ck, bv ); // bv - cv
dv = _mm256_fmadd_pd( av, t2, dv );
// Store back to the same address
_mm256_storeu_pd( rowDi, dv );
}
// rowDi and rowBk point to the double after the last full vector
// The remainder, if you can't pad your rows to a multiple of 4 and step on that padding
for(int j=0 ; j < (N&3); j++ )
rowDi[ j ] += _mm256_cvtsd_f64( av ) * ( rowBk[ j ] + _mm256_cvtsd_f64( minus_ck ) );
#endif
}
}
}
Without unrolling (https://godbolt.org/z/6WeYKbnYY), GCC11's inner loop asm looks like this, all single-uop instructions that can stay micro-fused even in the back-end on Haswell and later.
.L7: # do{
vaddpd ymm0, ymm2, YMMWORD PTR [rax] # -c[k] + rowBk[0..3]
add rax, 32 # rowBk += 4
add rdx, 32 # rowDi += 4
vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdx-32] # fma(2aik, Bkj-ck, Dij)
vmovupd YMMWORD PTR [rdx-32], ymm0 # store FMA result
cmp rcx, rax
jne .L7 # }while(p != endp)
But it's 6 total uops, 3 of them loop overhead (pointer increments and fused cmp+jne), so Haswell through Skylake could only run it at 1 iteration per 1.5 clocks, bottlenecked on the 4-wide issue stage in the front-end. (Which wouldn't let OoO exec get ahead on executing the pointer increments and loop branch, to notice early and recover while the back-end was still chewing on older loads and FP math.)
So loop unrolling should be helpful, since we managed to coax GCC into using indexed addressing modes. Without that it's relatively useless with AVX code on Intel Haswell/Skylake CPUs, with each vaddpd ymm5, ymm4, [rax + r14] decoding as 1 micro-fused uop, but unlaminating into 2 at issue into the back-end, not helping us get more work through the narrowest part of the front-end. (A lot like if we'd used a separate vmovupd load like we got with _mm256_sub_pd(bv, cv) instead of add(bv, -cv).)
The vmovupd ymmword ptr [rbp + r14], ymm5 store stays micro-fused but can't run on port 7, limiting us to a total of 2 memory operations per clock (up to 1 of which can be a store.) So a best case of 1.5 cycles per vector.
Compiled on https://godbolt.org/z/rd3rn9zor with GCC and clang -O3 -march=skylake -funroll-loops. GCC does actually use pointer increments with loads folded into 8x vaddpd and 8x vfmadd213pd. But clang uses indexed addressing modes and doesn't unroll. (You probably don't want -funroll-loops for your whole program, so either compile this separately or manually unroll. GCC's unrolling fully peels a prologue that does 0..7 vector iterations before entering the actual SIMD loop, so it's quite aggressive.)
GCC's loop-unrolling looks useful here for large N, amortizing the pointer increments and loop overhead over multiple vectors. (GCC doesn't know how to invent multiple accumulators for FP dep chains in a dot product for example, making its unrolling useless in that case, unlike clang.)
Unfortunately clang doesn't unroll the inner loop for us, but it does use vmaskmovpd in an interesting way for the cleanup.
It's maybe good that we use a separate loop counter for cleanup, in a way that lets the compiler easily prove the trip-count for the cleanup is 0..3, so it doesn't try to auto-vectorize with YMM.
The other way to do it, using an actual j variable for the inner loop and its cleanup, more like Soonts' edit. IIRC, compilers did try to auto-vectorize the cleanup for this, wasting code size and some always-false branching.
size_t j = 0; // used for cleanup loop after
for( ; j < N_aligned; j += 4 )
{
// Load the output vector to update
__m256d dv = _mm256_loadu_pd( rowDi + j );
// Update with FMA
__m256d bv = _mm256_loadu_pd( rowBk + j );
__m256d t2 = _mm256_sub_pd( bv, cv ); // bv - cv
dv = _mm256_fmadd_pd( av, t2, dv );
// Store back to the same address
_mm256_storeu_pd( rowDi + j, dv );
}
// The remainder, if you can't pad your rows to a multiple of 4
for( ; j < N; j++ )
rowDi[ j ] += _mm256_cvtsd_f64( av ) * ( rowBk[ j ] - _mm256_cvtsd_f64( cv ) );
This has a fairly good mix of load&store vs. FP math for modern CPUs (https://agner.org/optimize/ and https://uops.info/), especially Intel where we can do 2 loads and 1 store. I think Zen 2 or 3 can also do 2 loads + 1 store. It needs to hit in L1d cache to sustain that kind of throughput, though. (And even then, Intel's optimization manual says the max sustained L1d bandwidth on Skylake is less than the full 96 bytes/cycle that would require. More like mid-80s IIRC, so we can't quite expect one result vector per cycle, even with sufficient unrolling to avoid front-end bottlenecks.)
There's no latency bottleneck, since we move on to a new dv every iteration instead of accumulating anything across loop iterations.
The other advantage to this is that memory access to d[i][j] and b[k][j] would be sequential, with no other memory access in the inner-most loop. (The middle loop would do broadcast-loads of a[i][k] and c[k]. Those seem likely to cache-miss if the inner loop evicts too much; with some unrolling of the outer loop, one SIMD load and some shuffling could help, but probably cache-blocking would avoid a need for that.)
Looping over the same d[i] row repeatedly for different b[k] rows gives us locality for the part that we're modifying (i.e. use k as the middle loop, keeping i as the outer-most.) With k as the outer loop, we'd be looping K times over the whole d[0..N-1][0..N-1], probably needing to write + read each pass all that way out to whichever level of cache or memory could hold it.
But really you'd still want to cache-block if each row is really long, so you avoid the cache misses to bring all of b[][] in from DRAM N times. And avoid evicting the stuff you're going to broadcast-load next.
Smarter unrolling: a first step towards cache-blocking
Some of the above problems with maxing out load/store execution unit throughput, and requiring the compiler to use non-indexed addressing modes, can go away if we do more with each vector of data while it's loaded.
For example, instead of working on just one row of d[][], we could be working on 2, 3, or 4. Then every (rowBk[j] - c[k]) result can be used that many times (with a different 2aik) for a d[i+unroll][j + 0..vec] vector.
And we can also load a couple different (rowBk+K*0..unroll)[j+0..3], each with a corresponding minus_ck0, minus_ck1, etc. (Or keep an array of vectors; as long as it's small and the compiler has enough registers, the elements won't exist in memory.)
With multiple bv-cv and dv vectors in registers all at the same time, we can do significantly more FMAs per load without increasing the total amount of FP work. It takes more registers for constants, though, otherwise we could be defeating the purpose by forcing more reloads.
The d[i][j] += (2*a_ik) * b[k][j] - (2*a_ik*c_k) transformation wouldn't be useful here; we want to keep bv-cv separate from i so we can reuse that result as an input for different FMAs.
The b[k][j]+(-c[k]) can still benefit from micro-fusion of a load with a vaddpd so ideally it would still use a pointer increment, but the front-end might not be a bottleneck anymore.
Don't overdo it with this; too many memory input streams can be a problem for cache conflict misses especially for some N values that might create aliasing, and also for HW prefetching tracking them all. (Although Intel's L2 streamer is said to track 1 forward and 1 backward stream per 4k page, IIRC.) Probably about 4 to 8 ish streams is ok. But if d[][] isn't missing in L1d, then it's not really an input stream from memory. You don't want your b[][] input rows to be evicting the d data, though, since you'll be looping over 2 to 4 rows of d data repeatedly.
By comparison: Soonts's loop - less frequent cleanup, but worse memory access pattern.
Soonts's current loop with 3 loads and 3 ALU operations isn't ideal, although 1 load per FMA operation is already ok if they hit in cache (most modern CPUs can do 2 each per clock, although AMD Zen can also do 2 FP adds in parallel with mul/fma). If that extra ALU operation was a bottleneck, we could pre-multiply a[][] by 2 once, taking only O(N*K) work vs. O(N^2*K) to do it on the fly. But it's probably not a bottleneck and thus not worth it.
More importantly, the memory access pattern in Soonts's current answer is looping forward 1 double at a time for broadcast loads of c[k] and a[i][k] which is good, but the bv = _mm256_loadu_pd of b[k][j + 0..3] is unfortunately striding down a column.
If you're going to unroll as Soonts suggested, don't just do two dep chains for one dv, do at least two vectors, d[i][j + 0..3] and 4..7 so you use a whole 64 bytes (full cache line) from every b[k][j] you touch. Or four vectors for a pair of cache-lines. (Intel CPUs at least use an adjacent-line prefetcher, which likes to complete a 128-byte aligned pair of cache lines, so you'd benefit from aligning the rows of b[][] by 128. Or at least by 64, and get some benefit from adjacent-line prefetching.
If a vertical slice of b[][] fits in some level of cache (along with the row of d[i][] you're currently accumulating into), the next stride down the next group of columns can benefit from that prefetching and locality. If not, fully using the lines you touch is more important, so they don't have to get pulled in again later.
So with Soonts's vectorization strategy, for large problems where this won't fit in L1d cache, probably good to make sure b's rows are aligned by 64, even if that means padding at the end of each row. (The storage geometry doesn't have to match the actual matrix dimension; you pass N and row_stride separately. You use one for index calculations, the other for loop bounds.)
Related
/* Inner product. Accumulate in temporary */
void inner4(vec_ptr u, vec_ptr v, data_t *dest)
{
long i;
long length = vec_length(u);
data_t *udata = get_vec_start(u);
data_t *vdata = get_vec_start(v);
data_t sum = (data_t) 0;
for (i = 0; i < length; i++) {
sum = sum + udata[i] * vdata[i];
}
*dest = sum;
}
Write a version of the inner product procedure described in the above problem that
uses 6 × 1a loop unrolling . For x86-64, our measurements of the unrolled version
give a CPE of 1.07 for integer data but still 3.01 for both floating-point data.
My code for 6*1a version of loop unrolling
void inner4(vec_ptr u, vec_ptr v, data_t *dest){
long i;
long length = vec_length(u);
data_t *udata = get_vec_start(u);
data_t *vdata = get_vec_start(v);
long limit = length -5;
data_t sum = (data_t) 0;
for(i=0; i<limit; i+=6){
sum = sum +
((udata[ i ] * vdata[ i ]
+ udata[ i+1 ] * vdata[ i+1 ])
+ (udata[ i+2 ] * vdata[ i+2 ]
+ udata[ i+3 ] * vdata[ i+3 ]))
+ ((udata[ i+4 ] * vdata[ i+4 ])
+ udata[ i+5 ] * vdata[ i+5 ]);
}
for (i = 0; i < length; i++) {
sum = sum + udata[i] * vdata[i];
}
*dest = sum;
}
Question: Explain why any (scalar) version of an inner product procedure running on an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00.
Any idea how to solve the problem?
Your unroll doesn't help with the FP latency bottleneck:
sum + x + y + z without -ffast-math is the same order of operations as sum += x; sum += y; ... so you haven't done anything about the single dependency chain running through all the + operations. Loop overhead (or front-end throughput) is not the bottleneck, it's the 3 cycle latency of addss on Haswell, so this unroll makes basically no difference.
What would work is sum += u[i]*v[i] + u[i+1]*v[i+1] + ... as a way to unroll without multiple accumulators, because then the sum of each group of elements is independent.
It costs slightly more math operations that way, like starting with a mul and ending with an add, but the middle ones can still contract into FMAs if you compile with -march=haswell. See comments on AVX performance slower for bitwise xor op and popcount for an example of GCC turning a naive unroll like sum += u0*v0; sum += u1*v1 into sum += u0*v0 + u1*v1;. In that case the problem was slightly different: sum of squared differences like sum += (u0-v0)**2 + (u1-v1)**2;, but it boils down to the same latency problem of ultimately doing some multiplies and adds.
The other way to solve the problem is with multiple accumulators, allowing all the operations to be FMAs. But Haswell has 5-cycle latency FMA, and 3-cycle latency addss, so doing the sum += ... addition on its own, not as part of an FMA, actually helps with the latency bottleneck on Haswell (unlike on Skylake add/sub/mul are all 4 cycle latency). The following all show unrolling with multiple accumulators, instead of with adding groups together like the first towards pairwise summation like you're doing:
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)
When, if ever, is loop unrolling still useful?
Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell
FP math instruction throughput isn't the bottleneck for a big dot product on modern CPUs, only latency. Or load throughput if you unroll enough.
Explain why any (scalar) version of an inner product procedure running on an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00.
Each element takes 2 loads, and with only 2 load ports, that's a hard throughput bottleneck. (https://agner.org/optimize/ / https://www.realworldtech.com/haswell-cpu/5/)
I'm assuming you're counting an "element" as an i value, a pair of floats, one each from udata[i] and vdata[i]. The FP FMA throughput bottleneck is also 2/clock on Haswell (whether they're scalar, 128-bit, or 256-bit vectors), but dot product takes 2 loads per FMA. In theory, even Sandybridge or maybe even K8 could achieve 1 element per clock, with separate mul and add instructions, since they both support 2 loads per clock, and have a wide enough pipeline to get load / mulss / addss through the pipeline with some room to spare.
I want to see if it's possible to write some generic SIMD code that can compile efficiently. Mostly for SSE, AVX, and NEON. A simplified version of the problem is: Find the maximum absolute value of an array of floating point numbers and return both the value and the index. It is the last part, the index of the maximum, that causes the problem. There doesn't seem to be a very good way to write code that has a branch.
See update at end for finished code using some of the suggested answers.
Here's a sample implementation (more complete version on godbolt):
#define VLEN 8
typedef float vNs __attribute__((vector_size(VLEN*sizeof(float))));
typedef int vNb __attribute__((vector_size(VLEN*sizeof(int))));
#define SWAP128 4,5,6,7, 0,1,2,3
#define SWAP64 2,3, 0,1, 6,7, 4,5
#define SWAP32 1, 0, 3, 2, 5, 4, 7, 6
static bool any(vNb x) {
x = x | __builtin_shufflevector(x,x, SWAP128);
x = x | __builtin_shufflevector(x,x, SWAP64);
x = x | __builtin_shufflevector(x,x, SWAP32);
return x[0];
}
float maxabs(float* __attribute__((aligned(32))) data, unsigned n, unsigned *index) {
vNs max = {0,0,0,0,0,0,0,0};
vNs tmax;
unsigned imax = 0;
for (unsigned i = 0 ; i < n; i += VLEN) {
vNs t = *(vNs*)(data + i);
t = -t < t ? t : -t; // Absolute value
vNb cmp = t > max;
if (any(cmp)) {
tmax = t; imax = i;
// broadcast horizontal max of t into every element of max
vNs tswap128 = __builtin_shufflevector(t,t, SWAP128);
t = t < tswap128 ? tswap128 : t;
vNs tswap64 = __builtin_shufflevector(t,t, SWAP64);
t = t < tswap64 ? tswap64 : t;
vNs tswap32 = __builtin_shufflevector(t,t, SWAP32);
max = t < tswap32 ? tswap32 : t;
}
}
// To simplify example, ignore finding index of true value in tmax==max
*index = imax; // + which(tmax == max);
return max[0];
}
Code on godbolt allows changing VLEN to 8 or 4.
This mostly works very well. For AVX/SSE the absolute value becomes t & 0x7fffffff using a (v)andps, i.e. clear the sign bit. For NEON it's done with vneg + fmaxnm. The block to find and broadcast the horizontal max becomes an efficient sequence of permute and max instructions. gcc is able to use NEON fabs for absolute value.
The 8 element vector on the 4 element SSE/NEON targets works well on clang. It uses a pair of instructions on two sets of registers and for the SWAP128 horizontal op will max or or the two registers without any unnecessary permute. gcc on the other hand really can't handle this and produces mostly non-SIMD code. If we reduce the vector length to 4, gcc works fine for SSE and NEON.
But there's a problem with if (any(cmp)). For clang + SSE/AVX, it works well, vcmpltps + vptest, with an orps to go from 8->4 on SSE.
But gcc and clang on NEON do all the permutes and ORs, then move the result to a gp register to test.
Is there some bit of code, other than architecture specific intrinsics, to get ptest with gcc and vmaxvq with clang/gcc and NEON?
I tried some other methods, like if (x[0] || x[1] || ... x[7]) but they were worse.
Update
I've created an updated example that shows two different implementations, both the original and "indices in a vector" method as suggested by chtz and shown in Aki Suihkonen's answer. One can see the resulting SSE and NEON output.
While some might be skeptical, the compiler does produce very good code from the generic SIMD (not auto-vectorization!) C++ code. On SSE/AVX, I see very little room to improve the code in the loop. The NEON version still troubled by a sub-optimal implementation of "any()".
Unless the data is usually in ascending order, or nearly so, my original version is still fastest on SSE/AVX. I haven't tested on NEON. This is because most loop iterations do not find a new max value and it's best to optimize for that case. The "indices in a vector" method produces a tighter loop and the compiler does a better job too, but the common case is just a bit slower on SSE/AVX. The common case might be equal or faster on NEON.
Some notes on writing generic SIMD code.
The absolute value of a vector of floats can be found with the following. It produces optimal code on SSE/AVX (and with a mask that clears the sign bit) and on NEON (the fabs instruction).
static vNs vabs(vNs x) {
return -x < x ? x : -x;
}
This will do a vertical max efficiently on SSE/AVX/NEON. It doesn't do a compare; it produces the architecture's "max' instruction. On NEON, changing it to use > instead of < causes the compiler to produce very bad scalar code. Something with denormals or exceptions I guess.
template <typename v> // Deduce vector type (float, unsigned, etc.)
static v vmax(v a, v b) {
return a < b ? b : a; // compiles best with "<" as compare op
}
This code will broadcast the horizontal max across a register. It compiles very well on SSE/AVX. On NEON, it would probably be better if the compiler could use a horizontal max instruction and then broadcast the result. I was impressed to see that if one uses 8 element vectors on SSE/NEON, which have only 4 element registers, the compiler is smart enough to use just one register for the broadcasted result, since the top 4 and bottom 4 elements are the same.
template <typename v>
static v hmax(v x) {
if (VLEN >= 8)
x = vmax(x, __builtin_shufflevector(x,x, SWAP128));
x = vmax(x, __builtin_shufflevector(x,x, SWAP64));
return vmax(x, __builtin_shufflevector(x,x, SWAP32));
}
This is the best "any()" I found. It is optimal on SSE/AVX, using a single ptest instruction. On NEON it does the permutes and ORs, instead of a horizontal max instruction, but I haven't found a way to get anything better on NEON.
static bool any(vNb x) {
if (VLEN >= 8)
x |= __builtin_shufflevector(x,x, SWAP128);
x |= __builtin_shufflevector(x,x, SWAP64);
x |= __builtin_shufflevector(x,x, SWAP32);
return x[0];
}
Also interesting, on AVX the code i = i + 1 will be compiled to vpsubd ymmI, ymmI, ymmNegativeOne, i.e. subtract -1. Why? Because a vector of -1s is produced with vpcmpeqd ymm0, ymm0, ymm0 and that's faster than broadcasting a vector of 1s.
Here is the best which() I've come up with. This gives you the index of the 1st true value in a vector of booleans (0 = false, -1 = true). One can do somewhat better on AVX with movemask. I don't know about the best NEON.
// vector of signed ints
typedef int vNi __attribute__((vector_size(VLEN*sizeof(int))));
// vector of bytes, same number of elements, 1/4 the size
typedef unsigned char vNb __attribute__((vector_size(VLEN*sizeof(unsigned char))));
// scalar type the same size as the byte vector
using sNb = std::conditional_t<VLEN == 4, uint32_t, uint64_t>;
static int which(vNi x) {
vNb cidx = __builtin_convertvector(x, vNb);
return __builtin_ctzll((sNb)cidx) / 8u;
}
As commented by chtz, the most generic and typical method is to have another mask to gather indices:
Vec8s indices = { 0,1,2,3,4,5,6,7};
Vec8s max_idx = indices;
Vec8f max_abs = abs(load8(ptr));
for (auto i = 8; i + 8 <= vec_length; i+=8) {
Vec8s data = abs(load8(ptr[i]));
auto mask = is_greater(data, max_abs);
max_idx = bitselect(mask, indices, max_idx);
max_abs = max(max_abs, data);
indices = indices + 8;
}
Another option is to interleave the values and indices:
auto data = load8s(ptr) & 0x7fffffff; // can load data as int32_t
auto idx = vec8s{0,1,2,3,4,5,6,7};
auto lo = zip_lo(idx, data);
auto hi = zip_hi(idx, data);
for (int i = 8; i + 8 <= size; i+=8) {
idx = idx + 8;
auto d1 = load8s(ptr + i) & 0x7fffffff;
auto lo1 = zip_lo(idx, d1);
auto hi1 = zip_hi(idx, d1);
lo = max_u64(lo, lo1);
hi = max_u64(hi, hi1);
}
This method is especially lucrative, if the range of inputs is small enough to shift the input left, while appending a few bits from the index to the LSB bits of the same word.
Even in this case we can repurpose 1 bit in the float allowing us to save one half of the bit/index selection operations.
auto data0 = load8u(ptr) << 1; // take abs by shifting left
auto data1 = (load8u(ptr + 8) << 1) + 1; // encode odd index to data
auto mx = max_u32(data0, data1); // the LSB contains one bit of index
Looks like one can use double as the storage, since even SSE2 supports _mm_max_pd (some attention needs to be given to Inf/Nan handling, which don't encode as Inf/Nan any more when reinterpreted as the high part of 64-bit double).
UPD: the no-aligning issue is fixed now, all the examples on godbolt use aligned reads.
UPD: MISSED THE ABS
Terribly sorry about that, I missed the absolute value from the definition.
I do not have the measurements, but here are all 3 functions vectorised:
max value with abs: https://godbolt.org/z/6Wznrc5qq
find with abs: https://godbolt.org/z/61r9Efxvn
one pass with abs: https://godbolt.org/z/EvdbfnWjb
Asm stashed in a gist
On the method
The way to do max element with simd is to first find the value and then find the index.
Alternatively you have to keep a register of indexes and blend the indexes.
This requires keeping indexes, doing more operations and the problem of the overflow needs to be addressed.
Here are my timings on avx2 by type (char, short and int) for 10'000 bytes of data
The min_element is my implementation of keeping the index.
reduce(min) + find is doing two loops - first get the value, then find where.
For ints (should behave like floats), performance is 25% faster for the two loops solution, at least on my measurements.
For completeness, comparisons against scalar for both methods - this is definitely an operation that should be vectorized.
How to do it
finding the maximum value is auto-vectorised across all platforms if you write it as reduce
if (!arr.size()) return {};
// std::reduce is also ok, just showing for more C ppl
float res = arr[0];
for (int i = 1; i != (int)arr.size(); ++i) {
res = res > arr[i] ? res : arr[i];
}
return res;
https://godbolt.org/z/EsazWf1vT
Now the find portion is trickier, non of the compilers I know autovectorize find
We have eve library that provides you with find algorithm: https://godbolt.org/z/93a98x6Tj
Or I explain how to implement find in this talk if you want to do it yourself.
UPD:
UPD2: changed the blend to max
#Peter Cordes in the comments said that there is maybe a point to doing the one pass solution in case of bigger data.
I have no evidence of this - my measurements point to reduce + find.
However, I hacked together roughly how keeping the index looks (there is an aligning issue at the moment, we should definitely align reads here)
https://godbolt.org/z/djrzobEj4
AVX2 main loop:
.L6:
vmovups ymm6, YMMWORD PTR [rdx]
add rdx, 32
vcmpps ymm3, ymm6, ymm0, 30
vmaxps ymm0, ymm6, ymm0
vpblendvb ymm3, ymm2, ymm1, ymm3
vpaddd ymm1, ymm5, ymm1
vmovdqa ymm2, ymm3
cmp rcx, rdx
jne .L6
ARM-64 main loop:
.L6:
ldr q3, [x0], 16
fcmgt v4.4s, v3.4s, v0.4s
fmax v0.4s, v3.4s, v0.4s
bit v1.16b, v2.16b, v4.16b
add v2.4s, v2.4s, v5.4s
cmp x0, x1
bne .L6
Links to ASM if godbolt becomes stale: https://gist.github.com/DenisYaroshevskiy/56d82c8cf4a4dd5bf91d58b053ea80f2
I don’t believe that’s possible. Compilers aren’t smart enough to do that efficiently.
Compare the other answer (which uses NEON-like pseudocode) with the SSE version below:
// Compare vector absolute value with aa, if greater update both aa and maxIdx
inline void updateMax( __m128 vec, __m128i idx, __m128& aa, __m128& maxIdx )
{
vec = _mm_andnot_ps( _mm_set1_ps( -0.0f ), vec );
const __m128 greater = _mm_cmpgt_ps( vec, aa );
aa = _mm_max_ps( vec, aa );
// If you don't have SSE4, emulate with bitwise ops: and, andnot, or
maxIdx = _mm_blendv_ps( maxIdx, _mm_castsi128_ps( idx ), greater );
}
float maxabs_sse4( const float* rsi, size_t length, size_t& index )
{
// Initialize things
const float* const end = rsi + length;
const float* const endAligned = rsi + ( ( length / 4 ) * 4 );
__m128 aa = _mm_set1_ps( -1 );
__m128 maxIdx = _mm_setzero_ps();
__m128i idx = _mm_setr_epi32( 0, 1, 2, 3 );
// Main vectorized portion
while( rsi < endAligned )
{
__m128 vec = _mm_loadu_ps( rsi );
rsi += 4;
updateMax( vec, idx, aa, maxIdx );
idx = _mm_add_epi32( idx, _mm_set1_epi32( 4 ) );
}
// Handle the remainder, if present
if( rsi < end )
{
__m128 vec;
if( length > 4 )
{
// The source has at least 5 elements
// Offset the source pointer + index back, by a few elements
const int offset = (int)( 4 - ( length % 4 ) );
rsi -= offset;
idx = _mm_sub_epi32( idx, _mm_set1_epi32( offset ) );
vec = _mm_loadu_ps( rsi );
}
else
{
// The source was smaller than 4 elements, copy them into temporary buffer and load vector from there
alignas( 16 ) float buff[ 4 ];
_mm_store_ps( buff, _mm_setzero_ps() );
for( size_t i = 0; i < length; i++ )
buff[ i ] = rsi[ i ];
vec = _mm_load_ps( buff );
}
updateMax( vec, idx, aa, maxIdx );
}
// Reduce to scalar
__m128 tmpMax = _mm_movehl_ps( aa, aa );
__m128 tmpMaxIdx = _mm_movehl_ps( maxIdx, maxIdx );
__m128 greater = _mm_cmpgt_ps( tmpMax, aa );
aa = _mm_max_ps( tmpMax, aa );
maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );
// SSE3 has 100% market penetration in 2022
tmpMax = _mm_movehdup_ps( tmpMax );
tmpMaxIdx = _mm_movehdup_ps( tmpMaxIdx );
greater = _mm_cmpgt_ss( tmpMax, aa );
aa = _mm_max_ss( tmpMax, aa );
maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );
index = (size_t)_mm_cvtsi128_si32( _mm_castps_si128( maxIdx ) );
return _mm_cvtss_f32( aa );
}
As you see, pretty much everything is completely different. Not just the boilerplate about remainder and final reduction, the main loop is very different too.
SSE doesn’t have bitselect; blendvps is not quite that, it selects 32-bit lanes based on high bit of the selector. Unlike NEON, SSE doesn’t have instructions for absolute value, need to be emulated with bitwise andnot.
The final reduction going to be completely different as well. NEON has very limited shuffles, but it has better horizontal operations, like vmaxvq_f32 which finds horizontal maximum over the complete SIMD vector.
I am trying to optimize my matrix multiplication code running on a single core. How can I futher improve the performance in regards to loop unrolling, FMA/SSE? I'm also curious to know why the performance won't increase if you use four instead of two sums in the inner loop.
The problem size is a 1000x1000 matrix multiplication. Both gcc 9 and icc 19.0.5 are available. Intel Xeon # 3.10GHz, 32K L1d Cache, Skylake Architecture. Compiled with gcc -O3 -mavx.
void mmult(double* A, double* B, double* C)
{
const int block_size = 64 / sizeof(double);
__m256d sum[2], broadcast;
for (int i0 = 0; i0 < SIZE_M; i0 += block_size) {
for (int k0 = 0; k0 < SIZE_N; k0 += block_size) {
for (int j0 = 0; j0 < SIZE_K; j0 += block_size) {
int imax = i0 + block_size > SIZE_M ? SIZE_M : i0 + block_size;
int kmax = k0 + block_size > SIZE_N ? SIZE_N : k0 + block_size;
int jmax = j0 + block_size > SIZE_K ? SIZE_K : j0 + block_size;
for (int i1 = i0; i1 < imax; i1++) {
for (int k1 = k0; k1 < kmax; k1++) {
broadcast = _mm256_broadcast_sd(A+i1*SIZE_N+k1);
for (int j1 = j0; j1 < jmax; j1+=8) {
sum[0] = _mm256_load_pd(C+i1*SIZE_K+j1+0);
sum[0] = _mm256_add_pd(sum[0], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+0)));
_mm256_store_pd(C+i1*SIZE_K+j1+0, sum[0]);
sum[1] = _mm256_load_pd(C+i1*SIZE_K+j1+4);
sum[1] = _mm256_add_pd(sum[1], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+4)));
_mm256_store_pd(C+i1*SIZE_K+j1+4, sum[1]);
// doesn't improve performance.. why?
// sum[2] = _mm256_load_pd(C+i1*SIZE_K+j1+8);
// sum[2] = _mm256_add_pd(sum[2], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+8)));
// _mm256_store_pd(C+i1*SIZE_K+j1+8, sum[2]);
// sum[3] = _mm256_load_pd(C+i1*SIZE_K+j1+12);
// sum[3] = _mm256_add_pd(sum[3], _mm256_mul_pd(broadcast, _mm256_load_pd(B+k1*SIZE_K+j1+12)));
// _mm256_store_pd(C+i1*SIZE_K+j1+4, sum[3]);
}
}
}
}
}
}
}
This code has 2 loads per FMA (if FMA-contraction happens), but Skylake only supports at most one load per FMA in theory (if you want to max out 2/clock FMA throughput), and even that is usually too much in practice. (Peak through is 2 loads + 1 store per clock, but it usually can't quite sustain that). See Intel's optimization guide and https://agner.org/optimize/
The loop overhead is not the biggest problem, the body itself forces the code to run at half speed.
If the k-loop was the inner loop, a lot of accumulation could be chained, without having to load/store to and from C. This has a downside: with a loop-carried dependency chain like that, it would be up to to code to explicitly ensure that there is enough independent work to be done.
In order to have few loads but enough independent work, the body of the inner loop could calculate the product between a small column vector from A and a small row vector from B, for example using 4 scalar broadcasts to load the column and 2 normal vector loads from B, resulting in just 6 loads for 8 independent FMAs (even lower ratios are possible), which is enough independent FMAs to keep Skylake happy and not too many loads. Even a 3x4 footprint is possible, which also has enough independent FMAs to keep Haswell happy (it needs at least 10).
I happen to have some example code, it's for single precision and C++ but you'll get the point:
sumA_1 = _mm256_load_ps(&result[i * N + j]);
sumB_1 = _mm256_load_ps(&result[i * N + j + 8]);
sumA_2 = _mm256_load_ps(&result[(i + 1) * N + j]);
sumB_2 = _mm256_load_ps(&result[(i + 1) * N + j + 8]);
sumA_3 = _mm256_load_ps(&result[(i + 2) * N + j]);
sumB_3 = _mm256_load_ps(&result[(i + 2) * N + j + 8]);
sumA_4 = _mm256_load_ps(&result[(i + 3) * N + j]);
sumB_4 = _mm256_load_ps(&result[(i + 3) * N + j + 8]);
for (size_t k = kk; k < kk + akb; k++) {
auto bc_mat1_1 = _mm256_set1_ps(*mat1ptr);
auto vecA_mat2 = _mm256_load_ps(mat2 + m2idx);
auto vecB_mat2 = _mm256_load_ps(mat2 + m2idx + 8);
sumA_1 = _mm256_fmadd_ps(bc_mat1_1, vecA_mat2, sumA_1);
sumB_1 = _mm256_fmadd_ps(bc_mat1_1, vecB_mat2, sumB_1);
auto bc_mat1_2 = _mm256_set1_ps(mat1ptr[N]);
sumA_2 = _mm256_fmadd_ps(bc_mat1_2, vecA_mat2, sumA_2);
sumB_2 = _mm256_fmadd_ps(bc_mat1_2, vecB_mat2, sumB_2);
auto bc_mat1_3 = _mm256_set1_ps(mat1ptr[N * 2]);
sumA_3 = _mm256_fmadd_ps(bc_mat1_3, vecA_mat2, sumA_3);
sumB_3 = _mm256_fmadd_ps(bc_mat1_3, vecB_mat2, sumB_3);
auto bc_mat1_4 = _mm256_set1_ps(mat1ptr[N * 3]);
sumA_4 = _mm256_fmadd_ps(bc_mat1_4, vecA_mat2, sumA_4);
sumB_4 = _mm256_fmadd_ps(bc_mat1_4, vecB_mat2, sumB_4);
m2idx += 16;
mat1ptr++;
}
_mm256_store_ps(&result[i * N + j], sumA_1);
_mm256_store_ps(&result[i * N + j + 8], sumB_1);
_mm256_store_ps(&result[(i + 1) * N + j], sumA_2);
_mm256_store_ps(&result[(i + 1) * N + j + 8], sumB_2);
_mm256_store_ps(&result[(i + 2) * N + j], sumA_3);
_mm256_store_ps(&result[(i + 2) * N + j + 8], sumB_3);
_mm256_store_ps(&result[(i + 3) * N + j], sumA_4);
_mm256_store_ps(&result[(i + 3) * N + j + 8], sumB_4);
This means that the j-loop and the i-loop are unrolled, but not the k-loop, even though it is the inner loop now. Unrolling the k-loop a bit did help a bit in my experiments.
See #harold's answer for an actual improvement. This is mostly to repost what I wrote in comments.
four instead of two sums in the inner loop. (Why doesn't unrolling help?)
There's no loop-carried dependency through sum[i]. The next iteration assigns sum[0] = _mm256_load_pd(C+i1*SIZE_K+j1+0); which has no dependency on the previous value.
Therefore register-renaming of the same architectural register onto different physical registers is sufficient to avoid write-after-write hazards that might stall the pipeline. No need to complicate the source with multiple tmp variables. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (In that question, one dot product of 2 arrays, there is a loop carried dependency through an accumulator. There, using multiple accumulators is valuable to hide FP FMA latency so we bottleneck on FMA throughput, not latency.)
A pipeline without register renaming (most in-order CPUs) would benefit from "software pipelining" to statically schedule for what out-of-order exec can do on the fly: load into different registers so there's distance (filled with independent work) between each load and the FMA that consumes it. And then between that and the store.
But all modern x86 CPUs are OoO; even Knight's Landing has some limited OoO exec for SIMD vectors. (Silvermont doesn't support AVX, but does run SIMD instructions in-order, only doing OoO for integer).
Without any multiple-accumulator situation to hide latency, the benefits of unrolling (explicitly in the source or with -funroll-loop as enabled by -fprofile-use, or in clang by default) are:
Reduce front-end bandwidth to get the loop overhead into the back-end. More useful-work uops per loop overhead. Thus it helps if your "useful work" is close to bottlenecked on the front end.
Less back-end execution-unit demand for running the loop overhead. Normally not a problem on Haswell and later, or Zen; the back end can mostly keep up with the front-end when the instruction mix includes some integer stuff and some pure load instructions.
Fewer total uops per work done means OoO exec can "see" farther ahead for memory loads/stores.
Sometimes better branch prediction for short-running loops: The lower iteration count means a shorter pattern for branch prediction to learn. So for short trip-counts, a better chance of correctly predicting the not-taken for the last iteration when execution falls out of the loop.
Sometimes save a mov reg,reg in more complicated cases where it's easier for the compiler to generate a new result in a different reg. The same variable can alternate between living in two regs instead of needing to get moved back to the same one to be ready for the next iteration. Especially if you have a loop that uses a[i] and a[i+1] in a dependent way, or something like Fibonacci.
With 2 loads + 1 store in the loop, that will probably be the bottleneck, not FMA or front-end bandwidth. Unrolling by 2 might have helped avoid a front-end bottleneck, but more than that would only matter with contention from another hyperthread.
An interesting question came up in comments: doesn't unrolling need a lot of registers to be useful?
Harold commented:
16 is not a huge number of registers, but it's enough to have 12
accumulators and 3 pieces of row vector from B and the broadcasted
scalar from A, so it works out to just about enough. The loop from OP
above barely uses any registers anyway. The 8 registers in 32bit are
indeed too few.
Of course since the code in the question doesn't have "accumulators" in registers across loop iterations, only adding into memory, compilers could have optimized all of sum[0..n] to reuse the same register in asm; it's "dead" after storing. So actual register pressure is very low.
Yes x86-64 is somewhat register-poor, that's why AVX512 doubles the number as well as width of vector regs (zmm0..31). Yes, many RISCs have 32 int / 32 fp regs, including AArch64 up from 16 in ARM.
x86-64 has 16 scalar integer registers (including the stack pointer, not including the program counter), so normal functions can use 15. There are also 16 vector regs, xmm0..15. (And with AVX they're double the width ymm0..15).
(Some of this was written before I noticed that sum[0..n] was pointless, not loop-carried.)
Register renaming onto a large physical register file is sufficient in this case. There are other cases where having more architectural registers helps, especially for higher FP latency hence why AVX512 has 32 zmm regs. But for integer 16 is close to enough. RISC CPUs were often designed for in-order without reg renaming, needing SW pipeline.
With OoO exec, the jump from 8 to 16 architectural GP integer regs is more significant than a jump from 16 to 32 would be, in terms of reducing spill/reloads. (I've seen a paper that measured total dynamic instruction count for SPECint with various numbers of architectural registers. I didn't look it up again, but 8->16 might have been 10% total saving while 16->32 was only a couple %. Something like that).
But this specific problem doesn't need a lot of FP registers, only 4 vectors for sum[0..3] (if they were loop-carried) and maybe 1 temporary; x86 can use memory-source mul/add/FMA. Register renaming removes any WAW hazards so we can reuse the same temporary register instead of needing software pipelining. (And OoO exec also hides load and ALU latency.)
You want multiple accumulators when there are loop-carried dependencies. This code is adding into memory, not into a few vector accumulators, so any dependency is through store/reload. But that only has ~7 cycle latency so any sane cache-blocking factor hides it.
I understand the concept of unrolling loops however, can someone explain to me how to unroll a simple loop?
It would be great if you would show me a loop, and then a unrolled version of that loop with explanations of what is happening.
I think it's important to clarify when loop unrolling is most effective: with dependency chains. A dependency chain is a series of operations where each calculation depends on the previous calculation. For example, the following loop has a dependency chain.
for(i=0; i<n; i++) sum += a[i];
Most modern processors can do multiple out-of-order operations per cycle. This increases the instruction throughput. However, out-of-order operations can't do this in a dependency chain. In the loop above each calculation is bound by the latency of the addition operation.
In the loop above we can unroll it into two dependency chains like this
sum1 = 0, sum2 = 0;
for(i=0; i<n/2; i+=2) sum1 += a[2*i], sum2 += a[2*i+1];
for(i=(n/2)*2; i<n; i++) sum += a[i]; // clean up for n odd
sum += sum1 + sum2;
Now an out-of-order processor could operate on either chain independently and depending on the processor simultaneously.
In general you should unroll by an amount equal to the latency of the operation times the number of those operations that can be done per clock cycle. For example with a x86_64 processor it can perform at least one SSE addition per clock cycle and the SSE addition has a latency of 3 so you should unroll three times. With a Haswell processor it can do two FMA operations per clock cycle and each FMA operations has a latency of 5 so you would need to unroll 10 times to get the maximum throughput.
As far as compilers go GCC does not unroll dependency chains (even with -funroll-loops). You have to unroll yourself with GCC. With Clang it unrolls four times which is generally pretty good (in some cases on Haswell and Broadwell you would need to unroll 10 times and with Skylake 8 times).
Another reason to unroll is when the number of operations in a loop exceeds the number of instructions which can be push through per clock cycle. For example in the following loop
for(i=0; i<n; i++) b[i] += 3.14159*a[i];
there is no dependency chain so there is no problem with out-of-order execution. But let's consider an instruction set which needs the following operations per iteration.
2 SIMD load
1 SIMD store
1 SIMD multiply
1 SIMD addition
1 scalar addition for the loop counter
1 conditional jump
Let's also assume the the processor can push through five of these instructions per cycle. In this case there are seven instructions per iteration but only five can be done per cycle. Loop unrolling can then be used to amortize the cost of the scalar addition to the counter i and the conditional jump. For example if you fully unrolled the loop these instruction would not be necessary.
For amortizing the cost of the loop counter and jump -funroll-loops works fine with GCC . It unrolls eight times which means the counter addition and jump has to be done once every eight iteration instead of every iteration.
The process of unrolling loops utilizes an essential concept in computer science: the space-time tradeoff, where increasing the space used can often lead to decreasing the time of an algorithm.
Let's say we have a simple loop,
const int n = 1000;
for (int i = 0; i < n; ++i) {
foo();
}
This is compiled to assembly looking something like this:
mov eax, 0
loop:
call foo
inc eax
cmp eax, 1000
jne loop
So the space-time trade-off is 5 lines of assembly for ~(4 * 1000) = ~4000 instructions executed.
So, let's try and unroll the loop a bit.
for (int i = 0; i < n; i += 10) {
foo();
foo();
foo();
foo();
foo();
foo();
foo();
foo();
foo();
foo();
}
And its assembly:
mov eax, 0
loop:
call foo
call foo
call foo
call foo
call foo
call foo
call foo
call foo
call foo
call foo
add eax, 10
cmp eax, 1000
jne loop
The space-time trade-off is 14 lines of assembly for ~(14 * 100) = ~1400 instructions executed.
We can do a total unrolling, like this:
foo();
foo();
// ...
// 996 foo()'s
// ...
foo();
foo();
Which compiles in assembly as 1000 call instructions.
This gives a space-time trade-off of 1000 lines of assembly for 1000 instructions.
As you can see, the general trend is that to reduce the amount of instructions executed by the CPU, we must increase the space required.
It is not efficient to totally unroll a loop, as the space required becomes extremely large. Partial unrolling gives huge benefits with greatly diminishing returns the more you unroll the loop.
While it's a good idea to understand loop unrolling, keep in mind that the compiler is smart and will do it for you.
Rolled (regular):
#define N 44
int main() {
int A[N], B[N];
int i;
// fill A with stuff ...
for(i = 0; i < N; i++) {
B[i] = A[i] * (100 % i);
}
// do stuff with B ...
}
Unrolled:
#define N 44
int main() {
int A[N], B[N];
int i;
// fill A with stuff ...
for(i = 0; i < N; i += 4) {
B[i] = A[i] * (100 % i);
B[i+1] = A[i+1] * (100 % i+1);
B[i+2] = A[i+2] * (100 % i+2);
B[i+3] = A[i+3] * (100 % i+3);
}
// do stuff with B ...
}
Unrolling can potentially increase performance at the cost of a larger program size. Performance increases could be due to a reduction in branch penalties, cache misses and execution instructions. Some disadvantages are obvious, like an increase in the amount of code and a decrease in readability, and some are not so obvious.
I have been working with a piece of code which is intensively memory bound. I am trying to optimize it within a single core by manually implementing cache blocking, sw prefetching, loop unrolling etc. Even though cache blocking gives significant improvement in performance. However when i introduce loop unrolling I get tremendous performance degradation.
I am compiling with Intel icc with compiler flags -O2 and -ipo in all my test cases.
My code is similar to this (3D 25-point stencil):
void stencil_baseline (double *V, double *U, int dx, int dy, int dz, double c0, double c1, double c2, double c3, double c4)
{
int i, j, k;
for (k = 4; k < dz-4; k++)
{
for (j = 4; j < dy-4; j++)
{
//x-direction
for (i = 4; i < dx-4; i++)
{
U[k*dy*dx+j*dx+i] = (c0 * (V[k*dy*dx+j*dx+i]) //center
+ c1 * (V[k*dy*dx+j*dx+(i-1)] + V[k*dy*dx+j*dx+(i+1)])
+ c2 * (V[k*dy*dx+j*dx+(i-2)] + V[k*dy*dx+j*dx+(i+2)])
+ c3 * (V[k*dy*dx+j*dx+(i-3)] + V[k*dy*dx+j*dx+(i+3)])
+ c4 * (V[k*dy*dx+j*dx+(i-4)] + V[k*dy*dx+j*dx+(i+4)]));
}
//y-direction
for (i = 4; i < dx-4; i++)
{
U[k*dy*dx+j*dx+i] += (c1 * (V[k*dy*dx+(j-1)*dx+i] + V[k*dy*dx+(j+1)*dx+i])
+ c2 * (V[k*dy*dx+(j-2)*dx+i] + V[k*dy*dx+(j+2)*dx+i])
+ c3 * (V[k*dy*dx+(j-3)*dx+i] + V[k*dy*dx+(j+3)*dx+i])
+ c4 * (V[k*dy*dx+(j-4)*dx+i] + V[k*dy*dx+(j+4)*dx+i]));
}
//z-direction
for (i = 4; i < dx-4; i++)
{
U[k*dy*dx+j*dx+i] += (c1 * (V[(k-1)*dy*dx+j*dx+i] + V[(k+1)*dy*dx+j*dx+i])
+ c2 * (V[(k-2)*dy*dx+j*dx+i] + V[(k+2)*dy*dx+j*dx+i])
+ c3 * (V[(k-3)*dy*dx+j*dx+i] + V[(k+3)*dy*dx+j*dx+i])
+ c4 * (V[(k-4)*dy*dx+j*dx+i] + V[(k+4)*dy*dx+j*dx+i]));
}
}
}
}
When I do loop unrolling on the innermost loop (dimension i) and unroll in directions x,y,z separately by unroll factor 2,4,8 respectively, I get performance degradation in all 9 cases i.e. unroll by 2 on direction x, unroll by 2 on direction y, unroll by 2 in direction z, unroll by 4 in direction x ... etc.
But when I perform loop unrolling on the outermost loop (dimension k) by factor of 8 (2 & 4 also), I get v.good performance improvement which is even better than cache blocking.
I even tried profiling my code with Intel Vtune. It seemed like the bottlenecks where mainly due to 1.LLC Miss and 2. LLC Load Misses serviced by Remote DRAM.
I am unable to understand why unrolling the innermost fastest loop in giving performance degradation whereas unrolling the outermost, slowest dimension is fetching performance improvement. However, this improvement in the latter case is when i use -O2 and -ipo when compiling with icc.
I am not sure how to interpret these statistics. Can someone help shed some light on this.
This strongly suggests that you are causing instruction cache misses by the unrolling, which is typical. In the age of modern hardware, unrolling no longer automatically means faster code. If each inner loop fits in a cache line, you will get better performance.
You may be able to unroll manually, to limit the size of the generated code, but this will require examining the generated machine-language instructions -- and their position -- to ensure that your loop is within a single cache line. Cache lines are typically 64 bytes long, and aligned on 64-byte boundaries.
Outer loops do not have the same effect. They will likely be outside of the instruction cache regardless of the unroll level. Unrolling these results in fewer branches, which is why you get better performance.
"Load misses serviced by remote DRAM" means that you allocated memory on one NUMA node, but now you are running on the other. Setting process or thread affinity based on NUMA is the answer.
Remote DRAM takes almost twice as long to read as local DRAM on the Intel machines that I have used.