SSE: not seeing a speedup by using _mm_add_epi32 - c

I would expect SSE to be faster than not using SSE. Do I need to add some additional compiler flags? Could it be that I am not seeing a speedup because this is integer code and not floating point?
invocation/output
$ make sum2
clang -O3 -msse -msse2 -msse3 -msse4.1 sum2.c ; ./a.out 123
n: 123
SSE Time taken: 0 seconds 124 milliseconds
vector+vector:begin int: 1 5 127 0
vector+vector:end int: 0 64 66 68
NOSSE Time taken: 0 seconds 115 milliseconds
vector+vector:begin int: 1 5 127 0
vector+vector:end int: 0 64 66 68
compiler
$ clang --version
Apple LLVM version 9.0.0 (clang-900.0.37)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
sum2.c
#include <stdlib.h>
#include <stdio.h>
#include <x86intrin.h>
#include <time.h>
#ifndef __cplusplus
#include <stdalign.h> // C11 defines _Alignas(). This header defines alignas()
#endif
#define CYCLE_COUNT 10000
// add vector and return resulting value on stack
__attribute__((noinline)) __m128i add_iv(__m128i *a, __m128i *b) {
return _mm_add_epi32(*a,*b);
}
// add int vectors via sse
__attribute__((noinline)) void add_iv_sse(__m128i *a, __m128i *b, __m128i *out, int N) {
for(int i=0; i<N/sizeof(int); i++) {
//out[i]= _mm_add_epi32(a[i], b[i]); // this also works
_mm_storeu_si128(&out[i], _mm_add_epi32(a[i], b[i]));
}
}
// add int vectors without sse
__attribute__((noinline)) void add_iv_nosse(int *a, int *b, int *out, int N) {
for(int i=0; i<N; i++) {
out[i] = a[i] + b[i];
}
}
__attribute__((noinline)) void p128_as_int(__m128i in) {
alignas(16) uint32_t v[4];
_mm_store_si128((__m128i*)v, in);
printf("int: %i %i %i %i\n", v[0], v[1], v[2], v[3]);
}
// print first 4 and last 4 elements of int array
__attribute__((noinline)) void debug_print(int *h) {
printf("vector+vector:begin ");
p128_as_int(* (__m128i*) &h[0] );
printf("vector+vector:end ");
p128_as_int(* (__m128i*) &h[32764] );
}
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
printf("n: %d\n", n);
// sum: vector + vector, of equal length
int f[32768] __attribute__((aligned(16))) = {0,2,4};
int g[32768] __attribute__((aligned(16))) = {1,3,n};
int h[32768] __attribute__((aligned(16)));
f[32765] = 33; f[32766] = 34; f[32767] = 35;
g[32765] = 31; g[32766] = 32; g[32767] = 33;
// https://stackoverflow.com/questions/459691/best-timing-method-in-c
clock_t start = clock();
for(int i=0; i<CYCLE_COUNT; ++i) {
add_iv_sse((__m128i*)f, (__m128i*)g, (__m128i*)h, 32768);
}
int msec = (clock()-start) * 1000 / CLOCKS_PER_SEC;
printf(" SSE Time taken: %d seconds %d milliseconds\n", msec/1000, msec%1000);
debug_print(h);
// process intense function again
start = clock();
for(int i=0; i<CYCLE_COUNT; ++i) {
add_iv_nosse(f, g, h, 32768);
}
msec = (clock()-start) * 1000 / CLOCKS_PER_SEC;
printf("NOSSE Time taken: %d seconds %d milliseconds\n", msec/1000, msec%1000);
debug_print(h);
return EXIT_SUCCESS;
}

Look at the asm: clang -O2 or -O3 probably auto-vectorizes add_iv_nosse (with a check for overlap, since you didn't use int * restrict a and so on).
Use -fno-tree-vectorize to disable auto vectorization, without stopping you from using intrinsics. I'd recommend clang -march=native -mno-avx -O3 -fno-tree-vectorize to test what I think you want to test, scalar integer vs. legacy-SSE paddd. (It works in gcc and clang. In clang, AFAIK it's a synonym for the clang-specific -fno-vectorize.)
BTW, timing both in the same executable hurts the first one, because the CPU doesn't ramp to full turbo right away. You're probably into the timed section of the code before your CPU hits full speed. (So run this a couple times back-to-back, with for i in {1..10}; do time ./a.out; done.
On Linux I'd use perf stat -r5 ./a.out to run it 5 times with performance counters (and I'd split it up so one run tested one or the other, so I could look at perf counters for the whole run.)
Code review:
You forgot stdint.h for uint32_t. I had to add that to get it to compile on Godbolt to see the asm. (Assuming clang-5.0 is something like the Apple clang version you're using. IDK if Apple's clang implies a default -mtune= option, but that would make sense because it's only targeting Mac. Also a baseline SSSE3 would make sense for 64-bit on x86-64 OS X.)
You don't need noinline on debug_print. Also, I'd recommend a different name for CYCLE_COUNT. Cycles in this context makes me think of clock cycles, so call it REP_COUNT or REPEATS or whatever.
Putting your arrays on the stack in main is probably fine. You do initialize both input arrays (to mostly zero, but add performance isn't data-dependent).
This is good, because leaving them uninitialized might mean that multiple 4k pages of each array was copy-on-write mapped to the same physical zero page, so you'd get more than the expected number of L1D cache hits.
The SSE2 loop should bottleneck on L2 / L3 cache bandwidth, since the working set it 4 * 32kiB * 3 = 384 kiB, so it's about 1.5x the 256kiB L2 cache in Intel CPUs.
clang might unroll it's auto-vectorized loop more than it does your manual intrinsics loop. That might explain better performance, since only 16B vectors (not 32B AVX2) might not saturate cache bandwidth if you're not getting 2 loads + 1 store per clock.
Update: actually the loop overhead is pretty extreme, with 3 pointer increments + a loop counter, and only unrolling by 2 to amortize that.
The auto-vectorized loop:
.LBB2_12: # =>This Inner Loop Header: Depth=1
movdqu xmm0, xmmword ptr [r9 - 16]
movdqu xmm1, xmmword ptr [r9] # hoisted load for 2nd unrolled iter
movdqu xmm2, xmmword ptr [r10 - 16]
paddd xmm2, xmm0
movdqu xmm0, xmmword ptr [r10]
paddd xmm0, xmm1
movdqu xmmword ptr [r11 - 16], xmm2
movdqu xmmword ptr [r11], xmm0
add r9, 32
add r10, 32
add r11, 32
add rbx, -8 # add / jne macro-fused on SnB-family CPUs
jne .LBB2_12
So it's 12 fused-domain uops, and can run at best 2 vectors per 3 clocks, bottlenecked on the front-end issue bandwidth of 4 uops per clock.
It's not using aligned loads because the compiler doesn't have that info without inlining into main where the alignment is known, and you didn't guarantee alignment with p = __builtin_assume_aligned(p, 16) or anything in the stand-alone function. Aligned loads (or AVX) would let paddd use a memory operand instead of a separate movdqu load.
The manually-vectorized loop uses aligned loads to save front-end uops, but has more loop overhead from the loop counter.
.LBB1_7: # =>This Inner Loop Header: Depth=1
movdqa xmm0, xmmword ptr [rcx - 16]
paddd xmm0, xmmword ptr [rax - 16]
movdqu xmmword ptr [r11 - 16], xmm0
movdqa xmm0, xmmword ptr [rcx]
paddd xmm0, xmmword ptr [rax]
movdqu xmmword ptr [r11], xmm0
add r10, 2 # separate loop counter
add r11, 32 # 3 pointer incrmeents
add rax, 32
add rcx, 32
cmp r9, r10 # compare the loop counter
jne .LBB1_7
So it's 11 fused-domain uops. It should be running faster than the auto-vectorized loop. Your timing method probably caused the problem.
(Unless mixing loads and stores is actually making it less optimal. The auto-vectorized loop did 4 loads and then 2 stores. Actually that might explain it. Your arrays are a multiple of 4kiB, and might all have the same relative alignment. So you might be getting 4k aliasing here, which means the CPU isn't sure that a store doesn't overlap a load. I think there's a performance counter you can check for that.)
See also Agner Fog's microarch guide (and instruction tables + optimization guide, and other links in the x86 tag wiki, especially Intel's optimization guide.
There's also some good SSE/SIMD beginner stuff in the sse tag wiki.

Related

Fastest way to find 16bit match in a 4 element short array?

I may confirm by using nanobench. Today I don't feel clever and can't think of an easy way
I have a array, short arr[]={0x1234, 0x5432, 0x9090, 0xFEED};. I know I can use SIMD to compare all elements at once, using movemask+tzcnt to find the index of a match. However since it's only 64 bits I was wondering if there's a faster way?
First I thought maybe I can build a 64-bit int by writing target|(target<<16)|(target<<32)|(target<<48) but then realized both an AND and SUB isn't the same as a compare since the low 16 can affect the higher 16. Then I thought instead of a plain loop I can write index=tzcnt((target==arr[0]?1:0)... | target==arr[3]?8:0
Can anyone think of something more clever? I suspect using the ternary method would give me best results since it's branchless?
For SWAR compare-for-equality, the operation you want is XOR, which like SUB produces all-zero on equal inputs, but unlike SUB doesn't propagate carry sideways.
But then you need to detect a contiguous 16 0 bits. Unlike pcmpeqw, you'll have some zero bits in the other elements.
So it's probably about the same as https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord but with wider mask patterns to operate on 16-bit instead of 8-bit chunks.
There is yet a faster method — use hasless(v, 1), which is defined below; it works in 4 operations and requires no subsquent verification. It simplifies to
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second.
This bithack was originally by Alan Mycroft in 1987.
So it could look like this (untested):
#include <stdint.h>
#include <string.h>
// returns 0 / non-zero status.
uint64_t hasmatch_16in64(uint16_t needle, const uint16_t haystack[4])
{
uint64_t vneedle = 0x0001000100010001ULL * needle; // broadcast
uint64_t vbuf;
memcpy(&vbuf, haystack, sizeof(vbuf)); // aliasing-safe unaligned load
//static_assert(sizeof(vbuf) == 4*sizeof(haystack[0]));
uint64_t match = vbuf ^ vneedle;
uint64_t any_zeros = (match - 0x0001000100010001ULL) & ~match & 0x8000800080008000ULL;
return any_zeros;
// unsigned matchpos = _tzcnt_u32(any_zeros) >> 4; // I think.
}
Godbolt with GCC and clang, also including a SIMD intrinsics version.
# gcc12.2 -O3 -march=x86-64-v3 -mtune=znver1
# x86-64-v3 is the Haswell/Zen1 baseline: AVX2+FMA+BMI2, but with tune=generic
# without tune=haswell or whatever, GCC uses shl/add /shl/add instead of imul, despite still needing the same constant
hasmatch_16in64:
movabs rax, 281479271743489 # 0x1000100010001
movzx edi, di # zero-extend to 64-bit
imul rdi, rax # vneedle
xor rdi, QWORD PTR [rsi] # match
# then the bithack
mov rdx, rdi
sub rdx, rax
andn rax, rdi, rdx # BMI1
movabs rdx, -9223231297218904064 # 0x8000800080008000
and rax, rdx
ret
Clang unfortunately adds 0xFFFEFFFEFFFEFFFF instead of reusing the multiplier constant, so it has three 64-bit immediate constants.
AArch64 can do repeating-pattern constants like this as immediates for bitwise ops, and doesn't have as convenient SIMD movemask, so this might be more of a win there, especially if you can guarantee alignment of your array of shorts.
Match position
If you need to know where the match is, I think that bithack has a 1 in the high bit of each zero byte or u16, and nowhere else. (The lowest-precendence / last operations are bitwise AND involving 0x80008000...).
So maybe tzcnt(any_zeros) >> 4 to go from bit-index to u16-index, rounding down. e.g. if the second one is zero, the tzcnt result will be 31. 31 >> 4 = 1.
If that doesn't work, then yeah AVX2 or AVX-512 vpbroadcastw xmm0, edi / vmovq / vpcmeqw / vpmovmskb / tzcnt will work well, too, with smaller code-size and fewer uops, but maybe higher latency. Or maybe less. (To get a byte offset, right shift if you need an index of which short.)
Actually just SSE2 pshuflw can broadcast a word to the low qword of an XMM register. Same for MMX, which would actually allow a memory-source pcmpeqw mm0, [rsi] since it has no alignment requirement and is only 64-bit, not 128.
If you can use SIMD intrinsics, especially if you have efficient word broadcast from AVX2, definitely have a look at it.
#include <immintrin.h>
// note the unsigned function arg, not uint16_t;
// we only use the low 16, but GCC doesn't realize that and wastes an instruction in the non-AVX2 version
int hasmatch_SIMD(unsigned needle, const uint16_t haystack[4])
{
#ifdef __AVX2__ // or higher
__m128i vneedle = _mm_set1_epi16(needle);
#else
__m128i vneedle = _mm_cvtsi32_si128(needle); // movd
vneedle = _mm_shufflelo_epi16(vneedle, 0); // broadcast to low half
#endif
__m128i vbuf = _mm_loadl_epi64((void*)haystack); // alignment and aliasing safe
unsigned mask = _mm_movemask_epi8(_mm_cmpeq_epi16(vneedle, vbuf));
//return _tzcnt_u32(mask) >> 1;
return mask;
}
# clang expects narrow integer args to already be zero- or sign-extended to 32
hasmatch_SIMD:
movd xmm0, edi
pshuflw xmm0, xmm0, 0 # xmm0 = xmm0[0,0,0,0,4,5,6,7]
movq xmm1, qword ptr [rsi] # xmm1 = mem[0],zero
pcmpeqw xmm1, xmm0
pmovmskb eax, xmm1
ret
AXV-512 gives us vpbroadcastw xmm0, edi, replacing vmovd + vpbroadcastw xmm,xmm or movd + pshuflw, saving a shuffle uop.
With AVX2, this is 5 single-uop instructions, vs. 7 (or 9 counting the constants) for the SWAR bithack. Or 6 or 8 not counting the zero-extension of the "needle". So SIMD is better for front-end throughput. (https://agner.org/optimize/ / https://uops.info/)
There are limits to which ports some of these instructions can run on (vs. the bithack instructions mostly being any integer ALU port), but presumably you're not doing this in a loop over many such 4-element arrays. Or else SIMD is an obvious win; checking two 4-element arrays at once in the low and high halves of a __m128i. So probably we do need to consider the front-end costs of setting up those constants.
I didn't add up the latencies; it's probably a bit higher even on Intel CPUs which generally have good latency between integer and SIMD units.
GCC unfortunately fails to optimize away the movzx edi, di from the SIMD version if compiled without AVX2; only clang realizes the upper 16 of _mm_cvtsi32_si128(needle) is discarded by the later shuffle. Maybe better to make the function arg unsigned, not explicitly a narrow 16-bit type.
Clang with -O2 or -O3 and GCC with -O3 compile a simple search loop into branchless instructions:
int indexOf(short target, short* arr) {
int index = -1;
for (int i = 0; i < 4; ++i) {
if (target == arr[i]) {
index = i;
}
}
return index;
}
Demo
I doubt you can get much better without SIMD. In other words, write simple and understandable code to help the compiler produce efficient code.
Side note: for some reason, neither Clang nor GCC use conditional moves on this very similar code:
int indexOf(short target, short* arr) {
for (int i = 0; i < 4; ++i) {
if (target == arr[i]) {
return i;
}
}
return -1;
}

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2

I am looking for an optimal method to calculate sum of all packed 32-bit integers in a __m256i or __m512i. To calculate sum of n elements, I ofter use log2(n) vpaddd and vpermd function, then extract the final result. Howerver, it is not the best option I think.
Edit: best/optimal in term of speed/cycle reduction.
Related: if you're looking for the non-existant _mm512_reduce_add_epu8, see Summing 8-bit integers in __m512i with AVX intrinsics vpsadbw as an hsum within qwords is much more efficient than shuffling.
Without AVX512, see hsum_8x32(__m256i) below for AVX2 without Intel's reduce_add helper function. reduce_add doesn't necessarily compile optimally anyway with AVX512.
There is a int _mm512_reduce_add_epi32(__m512i) inline function in immintrin.h. You might as well use it. (It compiles to shuffle and add instructions, but more efficient ones than vpermd, like I describe below.) AVX512 didn't introduce any new hardware support for horizontal sums, just this new helper function. It's still something to avoid or sink out of loops whenever possible.
GCC 9.2 -O3 -march=skylake-avx512 compiles a wrapper that calls it as follows:
vextracti64x4 ymm1, zmm0, 0x1
vpaddd ymm1, ymm1, ymm0
vextracti64x2 xmm0, ymm1, 0x1 # silly compiler, vextracti128 would be shorter
vpaddd xmm1, xmm0, xmm1
vpshufd xmm0, xmm1, 78
vpaddd xmm0, xmm0, xmm1
vmovd edx, xmm0
vpextrd eax, xmm0, 1 # 2x xmm->integer to feed scalar add.
add eax, edx
ret
Extracting twice to feed scalar add is questionable; it needs uops for p0 and p5 so it's equivalent to a regular shuffle + a movd.
Clang doesn't do that; it does one more step of shuffle / SIMD add to reduce down to a single scalar for vmovd. See below for perf analysis of the two.
There is a VPHADDD but you should never use it with both inputs the same. (Unless you're optimizing for code-size over speed). It can be useful to transpose-and-sum multiple vectors, resulting in some vectors of results. You do that by feeding phadd with 2 different inputs. (Except it gets messy with 256 and 512-bit because vphadd is still only in-lane.)
Yes, you need log2(vector_width) shuffles and vpaddd instructions. (So this isn't very efficient; avoid horizontal sums inside inner loops. Accumulate vertically until the end of a loop, for example).
General strategy for all SSE / AVX / AVX512
You want to successively narrow from 512 -> 256, then 256 -> 128, then shuffle within __m128i until you're down to one scalar element. Presumably some future AMD CPU will decode 512-bit instructions to two 256-bit uops, so reducing width is a big win there. And narrower instructions presumably cost slightly less power.
Your shuffles can take immediate control operands, not vectors for vpermd. e.g. VEXTRACTI32x8, vextracti128, and vpshufd. (Or vpunpckhqdq to save code size for the immediate constant.)
See Fastest way to do horizontal SSE vector sum (or other reduction) (my answer also includes some integer versions).
This general strategy is appropriate for all element types: float, double, and any size integer
Special cases:
8-bit integer: start with vpsadbw, more efficient and avoids overflow, but then continue as for 64-bit integers.
16-bit integer: start by widening to 32 with pmaddwd (_mm256_madd_epi16 with set1_epi16(1)) : SIMD: Accumulate Adjacent Pairs - fewer uops even if you don't care about the avoiding-overflow benefit, except on AMD before Zen2 where 256-bit instructions cost at least 2 uops. But then you continue as for 32-bit integer.
32-bit integer can be done manually like this, with an SSE2 function called by the AVX2 function after reducing to __m128i, in turn called by the AVX512 function after reducing to __m256i. The calls will of course inline in practice.
#include <immintrin.h>
#include <stdint.h>
// from my earlier answer, with tuning for non-AVX CPUs removed
// static inline
uint32_t hsum_epi32_avx(__m128i x)
{
__m128i hi64 = _mm_unpackhi_epi64(x, x); // 3-operand non-destructive AVX lets us save a byte without needing a movdqa
__m128i sum64 = _mm_add_epi32(hi64, x);
__m128i hi32 = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1)); // Swap the low two elements
__m128i sum32 = _mm_add_epi32(sum64, hi32);
return _mm_cvtsi128_si32(sum32); // movd
}
// only needs AVX2
uint32_t hsum_8x32(__m256i v)
{
__m128i sum128 = _mm_add_epi32(
_mm256_castsi256_si128(v),
_mm256_extracti128_si256(v, 1)); // silly GCC uses a longer AXV512VL instruction if AVX512 is enabled :/
return hsum_epi32_avx(sum128);
}
// AVX512
uint32_t hsum_16x32(__m512i v)
{
__m256i sum256 = _mm256_add_epi32(
_mm512_castsi512_si256(v), // low half
_mm512_extracti64x4_epi64(v, 1)); // high half. AVX512F. 32x8 version is AVX512DQ
return hsum_8x32(sum256);
}
Notice that this uses __m256i hsum as a building block for __m512i; there's nothing to be gained by doing in-lane operations first.
Well possibly a very tiny advantage: in-lane shuffles have lower latency than lane-crossing, so they could execute 2 cycles earlier and leave the RS earlier, and similarly retire from the ROB slightly earlier. But the higher-latency shuffles are coming just a couple instructions later even if you did that. So you might get a handful of some independent instructions into the back-end 2 cycles earlier if this hsum was on the critical path (blocking retirement).
But reducing to a narrower vector width sooner is generally good, maybe getting 512-bit uops out of the system sooner so the CPU can re-activate the SIMD execution units on port 1, if you aren't doing more 512-bit work right away.
Compiles on Godbolt to these instructions, with GCC9.2 -O3 -march=skylake-avx512
hsum_16x32(long long __vector(8)):
vextracti64x4 ymm1, zmm0, 0x1
vpaddd ymm0, ymm1, ymm0
vextracti64x2 xmm1, ymm0, 0x1 # silly compiler uses a longer EVEX instruction when its available (AVX512VL)
vpaddd xmm0, xmm0, xmm1
vpunpckhqdq xmm1, xmm0, xmm0
vpaddd xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 177
vpaddd xmm0, xmm1, xmm0
vmovd eax, xmm0
ret
P.S.: perf analysis of GCC's _mm512_reduce_add_epi32 vs. clang's (which is equivalent to my version), using data from https://uops.info/ and/or Agner Fog's instruction tables:
After inlining into a caller that does something with the result, it could allow optimizations like adding a constant as well using lea eax, [rax + rdx + 123] or something.
But other than that it seems almost always worse than the the shuffle / vpadd / vmovd at the end of my implementation, on Skylake-X:
total uops: reduce: 4. Mine: 3
ports: reduce: 2p0, p5 (part of vpextrd), p0156 (scalar add)
ports: mine: p5, p015 (vpadd on SKX), p0 (vmod)
Latency is equal at 4 cycles, assuming no resource conflicts:
shuffle 1 cycle -> SIMD add 1 cycle -> vmovd 2 cycles
vpextrd 3 cycles (in parallel with 2 cycle vmovd) -> add 1 cycle.

Why is there a large performance impact when looping over an array with 240 or more elements?

When running a sum loop over an array in Rust, I noticed a huge performance drop when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.
Is there special compilation optimization Rust is doing for "short" arrays?
Compiled with rustc -C opt-level=3.
use std::time::Instant;
const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;
fn main() {
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY {
arr[i] = i;
}
let mut sum = 0;
let now = Instant::now();
for _ in 0..IN_LOOPS {
let mut s = 0;
for i in 0..arr.len() {
s += arr[i];
}
sum += s;
}
println!("sum:{} time:{:?}", sum, now.elapsed());
}
Summary: below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark.
You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization – only performed for length 239 – is responsible for the huge speed difference. But let's start slowly:
(All code in this answer is compiled with -C opt-level=3)
pub fn foo() -> usize {
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len() {
s += arr[i];
}
s
}
This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. Here is a small part of the assembly:
movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0
This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. incrementing the loop variable, check if the loop has ended and the jump to the start of the loop.
In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. Moreover, two 16-byte SIMD registers (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. After all, they are independent of one another. In the end, both registers are added together and then horizontally summed down to the scalar result.
Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and paddq throughput is also at least 2 per clock, with 1 cycle latency on most CPUs. See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead.
LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. So usually, front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling.
But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:
const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;
pub fn foo() -> usize {
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY {
arr[i] = i;
}
let mut sum = 0;
for _ in 0..IN_LOOPS {
let mut s = 0;
for i in 0..arr.len() {
s += arr[i];
}
sum += s;
}
sum
}
(On Godbolt Compiler Explorer)
The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.
The important difference is that for 239, LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!
First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly; the comments with * are especially important):
; at the start of the function, `rbx` was set to 0
movq rax, xmm1 ; result of SIMD summing up stored in `rax`
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in `rax`
As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.
This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.
An additional note: can you do anything about this? Not really. LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.
As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.
In addition to Lukas' answer, if you want to use an iterator, try this:
const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;
pub fn bar() -> usize {
(0..CAPACITY).sum::<usize>() * IN_LOOPS
}
Thanks #Chris Morgan for the suggestion about range pattern.
The optimized assembly is quite good:
example::bar:
movabs rax, 14340000000
ret

Why vector length SIMD code is slower than plain C

Why is my SIMD vector4 length function 3x slower than a naive vector length method?
SIMD vector4 length function:
__extern_always_inline float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
Naive implementation:
sqrtf(V[0] * V[0] + V[1] * V[1] + V[2] * V[2] + V[3] * V[3])
The SIMD version took 16110ms to iterate 1000000000 times. The naive version was ~3 times faster, it takes only 4746ms.
#include <math.h>
#include <time.h>
#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>
static float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
int main() {
float A[4] __attribute__((aligned(16))) = {3, 4, 0, 0};
struct timespec t0 = {};
clock_gettime(CLOCK_MONOTONIC, &t0);
double sum_len = 0;
for (uint64_t k = 0; k < 1000000000; ++k) {
A[3] = k;
sum_len += vec4_len(A);
// sum_len += sqrtf(A[0] * A[0] + A[1] * A[1] + A[2] * A[2] + A[3] * A[3]);
}
struct timespec t1 = {};
clock_gettime(CLOCK_MONOTONIC, &t1);
fprintf(stdout, "%f\n", sum_len);
fprintf(stdout, "%ldms\n", (((t1.tv_sec - t0.tv_sec) * 1000000000) + (t1.tv_nsec - t0.tv_nsec)) / 1000000);
return 0;
}
I run with the following command on an Intel(R) Core(TM) i7-8550U CPU. First with the vec4_len version then with the plain C.
I compile with GCC (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0:
gcc -Wall -Wextra -O3 -msse -msse3 sse.c -lm && ./a.out
SSE version output:
499999999500000128.000000
13458ms
Plain C version output:
499999999500000128.000000
4441ms
The most obvious problem is using an inefficient dot-product (with haddps which costs 2x shuffle uops + 1x add uop) instead of shuffle + add. See Fastest way to do horizontal float vector sum on x86 for what to do after _mm_mul_ps that doesn't suck as much. But still this is just not something x86 can do very efficiently.
But anyway, the real problem is your benchmark loop.
A[3] = k; and then using _mm_load_ps(A) creates a store-forwarding stall, if it compiles naively instead of to a vector shuffle. A store + reload can be efficiently forwarded with ~5 cycles of latency if the load only loads data from a single store instruction, and no data outside that. Otherwise it has to do a slower scan of the whole store buffer to assemble bytes. This adds about 10 cycles of latency to the store-forwarding.
I'm not sure how much impact this has on throughput, but could be enough to stop out-of-order exec from overlapping enough loop iterations to hide the latency and only bottleneck on sqrtss shuffle throughput.
(Your Coffee Lake CPU has 1 per 3 cycle sqrtss throughput, so surprisingly SQRT throughput is not your bottleneck.1 Instead it will be shuffle throughput or something else.)
See Agner Fog's microarch guide and/or optimization manual.
What does "store-buffer forwarding" mean in the Intel developer's manual?
How does store to load forwarding happens in case of unaligned memory access?
Can modern x86 implementations store-forward from more than one prior store?
Why would a compiler generate this assembly? quotes Intel's optimization manual re: store forwarding. (In that question, and old gcc version stored the 2 dword halves of an 8-byte struct separately, then copied the struct with a qword load/store. Super braindead.)
Plus you're biasing this even more against SSE by letting the compiler hoist the computation of V[0] * V[0] + V[1] * V[1] + V[2] * V[2] out of the loop.
That part of the expression is loop-invariant, so the compiler only has to do (float)k squared, add, and a scalar sqrt every loop iteration. (And convert that to double to add to your accumulator).
(#StaceyGirl's deleted answer pointed this out; looking over the code of the inner loops in it was a great start on writing this answer.)
Extra inefficiency in A[3] = k in the vector version
GCC9.1's inner loop from Kamil's Godbolt link looks terrible, and seems to include a loop-carried store/reload to merge a new A[3] into the 8-byte A[2..3] pair, further limiting the CPU's ability to overlap multiple iterations.
I'm not sure why gcc thought this was a good idea. It would maybe help on CPUs that split vector loads into 8-byte halves (like Pentium M or Bobcat) to avoid store-forwarding stalls. But that's not a sane tuning for "generic" modern x86-64 CPUs.
.L18:
pxor xmm4, xmm4
mov rdx, QWORD PTR [rsp+8] ; reload A[2..3]
cvtsi2ss xmm4, rbx
mov edx, edx ; truncate RDX to 32-bit
movd eax, xmm4 ; float bit-pattern of (float)k
sal rax, 32
or rdx, rax ; merge the float bit-pattern into A[3]
mov QWORD PTR [rsp+8], rdx ; store A[2..3] again
movaps xmm0, XMMWORD PTR [rsp] ; vector load: store-forwarding stall
mulps xmm0, xmm0
haddps xmm0, xmm0
haddps xmm0, xmm0
ucomiss xmm3, xmm0
movaps xmm1, xmm0
sqrtss xmm1, xmm1
ja .L21 ; call sqrtf to set errno if needed; flags set by ucomiss.
.L17:
add rbx, 1
cvtss2sd xmm1, xmm1
addsd xmm2, xmm1 ; total += (double)sqrtf
cmp rbx, 1000000000
jne .L18 ; }while(k<1000000000);
This insanity isn't present in the scalar version.
Either way, gcc did manage to avoid the inefficiency of a full uint64_t -> float conversion (which x86 doesn't have in hardware until AVX512). It was presumably able to prove that using a signed 64-bit -> float conversion would always work because the high bit can't be set.
Footnote 1: But sqrtps has the same 1 per 3 cycle throughput as scalar, so you're only getting 1/4 of your CPU's sqrt throughput capability by doing 1 vector at a time horizontally, instead of doing 4 lengths for 4 vectors in parallel.

searching through a short sorted array of doubles

I am trying to optimize a search through a very short sorted array of doubles to locate a bucket a given value belongs to. Assuming the size of the array is 8 doubles, I have come up with the following sequence of AVX intrinsics:
_data = _mm256_load_pd(array);
temp = _mm256_movemask_pd(_mm256_cmp_pd(_data, _value, _CMP_LT_OQ));
pos = _mm_popcnt_u32(temp);
_data = _mm256_load_pd(array+4);
temp = _mm256_movemask_pd(_mm256_cmp_pd(_data, _value, _CMP_LT_OQ));
pos += _mm_popcnt_u32(temp);
To my surprise (I do not have the instruction latency specs in my head..), it turned out that a faster code is generated by gcc for the following C loop:
for(i=0; i<7; ++i) if(array[i+1]>=value) break;
This loop compiles into what I found to be a very efficient code:
lea ecx, [rax+1]
vmovsd xmm1, QWORD PTR [rdx+rcx*8]
vucomisd xmm1, xmm0
jae .L7
lea ecx, [rax+2]
vmovsd xmm1, QWORD PTR [rdx+rcx*8]
vucomisd xmm1, xmm0
jae .L8
[... repeat for all elements of array]
so it takes 4 instructions to check 1 bucket (lea, vmovsd, vucomisd, jae). Assuming the value is uniformly spread, on average I will have to check ~3.5 buckets per value. Apparently, this is enough to outperform the AVX code listed earlier.
Now, in a general case the array may of course be larger than 8 elements. If I code a C loop like this:
for(i=0; u<n-1; i++) if(array[i+1]>=value) break;
I get the following instruction sequence for the loop body:
.L76:
mov eax, edx
.L67:
cmp eax, esi
jae .L77
lea edx, [rax+1]
mov ecx, edx
vmovsd xmm1, QWORD PTR [rdi+rcx*8]
vucomisd xmm1, xmm0
jb .L76
I can tell gcc to unroll the loop, but the point is that the number of instructions per element is larger than in the case of the loop with constant bounds, and the code is slower. Also, I do not understand the reason behind using an additional rcx register for addressing in vmovsd.
I can manually modify the assembly for the loop to look something like in the first example, and it does work faster:
.L76:
cmp edx, esi # eax -> edx
jae .L77
lea edx, [rdx+1] # rax -> rdx
vmovsd xmm1, QWORD PTR [rdi+rdx*8]
vucomisd xmm1, xmm0
jb .L76
but I can not seem to make gcc do it. And I know it can - the asm generated in the first example is OK.
Do you have any ideas how to do it otherwise than using inline asm? Or even better - can you suggest a faster implementation of the search?
Not really an answer, but there's no room in the comments for this.
I tested the AVX function against a simple C implementation and got completely different results.
I tested on Windows 7 x64 not Linux but the generated code was very similar.
How the test went:
1) I disabled the CPU's SpeedStep.
2) Within main() I raised the process priority and thread priority to the max (realtime).
3) I ran 10M calls to the tested function to heat up the CPU - activate turbo.
4) I called Sleep(0) to avoid a context switch
5) I called __rdtscp to start measurement
6) in a loop I called either the AVX find index function or the simple C version - like you did. the other implementation was commented out and not used. Loop size was 10M calls.
7) I called __rdtscp again to finish the benchmark.
8) I printed ticks/iterations. to get the average tick count for a call
Note: I declared both 'find index' functions as inline and I confirmed in the disassembly that they got inlined.
The AVX function and the C functions you described are not identical, the C function return a zero based index and the AVX functio returns a 1 based index.
On my system, it took the AVX function 1.1 cycles per iteration and the C function took 4.4 cycles per iteration.
I couldn't force the MSVC compiler to use more than ymm registers :(
Array used:
double A[8] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 };
Results (avg. ticks/iter):
value = 0.3 (index = 2): AVX: 1.1 | C: 4.4
value = 0.5 (index = 3): AVX: 1.1 | C: 11.1
value = 0.9 (index = 7): AVX: 1.1 | C: 18.1
If the AVX function is corrected to return pos-1, then it will be 50% slower.
You can see that the AVX function works in constant time while the trivial C loop function performance depends on the index you're looking for.
Timing with clock() and running 100M yields similar results, AVX is almost x4 faster for the first test.
Also note that running longer tests reveal different results, but every time AVX holds a similar advantage.
You can try integer comparison. Double comparison is equivalent to int64_t comparison of the same bits with exception for NaNs. It could turn faster. CPU has more integer execution units then SIMD. Just send double* and receive int64_t* in function argument.

Resources