Why vector length SIMD code is slower than plain C

Why vector length SIMD code is slower than plain C - c

Why is my SIMD vector4 length function 3x slower than a naive vector length method?
SIMD vector4 length function:
__extern_always_inline float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
Naive implementation:
sqrtf(V[0] * V[0] + V[1] * V[1] + V[2] * V[2] + V[3] * V[3])
The SIMD version took 16110ms to iterate 1000000000 times. The naive version was ~3 times faster, it takes only 4746ms.
#include <math.h>
#include <time.h>
#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>
static float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
int main() {
float A[4] __attribute__((aligned(16))) = {3, 4, 0, 0};
struct timespec t0 = {};
clock_gettime(CLOCK_MONOTONIC, &t0);
double sum_len = 0;
for (uint64_t k = 0; k < 1000000000; ++k) {
A[3] = k;
sum_len += vec4_len(A);
// sum_len += sqrtf(A[0] * A[0] + A[1] * A[1] + A[2] * A[2] + A[3] * A[3]);
}
struct timespec t1 = {};
clock_gettime(CLOCK_MONOTONIC, &t1);
fprintf(stdout, "%f\n", sum_len);
fprintf(stdout, "%ldms\n", (((t1.tv_sec - t0.tv_sec) * 1000000000) + (t1.tv_nsec - t0.tv_nsec)) / 1000000);
return 0;
}
I run with the following command on an Intel(R) Core(TM) i7-8550U CPU. First with the vec4_len version then with the plain C.
I compile with GCC (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0:
gcc -Wall -Wextra -O3 -msse -msse3 sse.c -lm && ./a.out
SSE version output:
499999999500000128.000000
13458ms
Plain C version output:
499999999500000128.000000
4441ms

The most obvious problem is using an inefficient dot-product (with haddps which costs 2x shuffle uops + 1x add uop) instead of shuffle + add. See Fastest way to do horizontal float vector sum on x86 for what to do after _mm_mul_ps that doesn't suck as much. But still this is just not something x86 can do very efficiently.
But anyway, the real problem is your benchmark loop.
A[3] = k; and then using _mm_load_ps(A) creates a store-forwarding stall, if it compiles naively instead of to a vector shuffle. A store + reload can be efficiently forwarded with ~5 cycles of latency if the load only loads data from a single store instruction, and no data outside that. Otherwise it has to do a slower scan of the whole store buffer to assemble bytes. This adds about 10 cycles of latency to the store-forwarding.
I'm not sure how much impact this has on throughput, but could be enough to stop out-of-order exec from overlapping enough loop iterations to hide the latency and only bottleneck on sqrtss shuffle throughput.
(Your Coffee Lake CPU has 1 per 3 cycle sqrtss throughput, so surprisingly SQRT throughput is not your bottleneck.1 Instead it will be shuffle throughput or something else.)
See Agner Fog's microarch guide and/or optimization manual.
What does "store-buffer forwarding" mean in the Intel developer's manual?
How does store to load forwarding happens in case of unaligned memory access?
Can modern x86 implementations store-forward from more than one prior store?
Why would a compiler generate this assembly? quotes Intel's optimization manual re: store forwarding. (In that question, and old gcc version stored the 2 dword halves of an 8-byte struct separately, then copied the struct with a qword load/store. Super braindead.)
Plus you're biasing this even more against SSE by letting the compiler hoist the computation of V[0] * V[0] + V[1] * V[1] + V[2] * V[2] out of the loop.
That part of the expression is loop-invariant, so the compiler only has to do (float)k squared, add, and a scalar sqrt every loop iteration. (And convert that to double to add to your accumulator).
(#StaceyGirl's deleted answer pointed this out; looking over the code of the inner loops in it was a great start on writing this answer.)
Extra inefficiency in A[3] = k in the vector version
GCC9.1's inner loop from Kamil's Godbolt link looks terrible, and seems to include a loop-carried store/reload to merge a new A[3] into the 8-byte A[2..3] pair, further limiting the CPU's ability to overlap multiple iterations.
I'm not sure why gcc thought this was a good idea. It would maybe help on CPUs that split vector loads into 8-byte halves (like Pentium M or Bobcat) to avoid store-forwarding stalls. But that's not a sane tuning for "generic" modern x86-64 CPUs.
.L18:
pxor xmm4, xmm4
mov rdx, QWORD PTR [rsp+8] ; reload A[2..3]
cvtsi2ss xmm4, rbx
mov edx, edx ; truncate RDX to 32-bit
movd eax, xmm4 ; float bit-pattern of (float)k
sal rax, 32
or rdx, rax ; merge the float bit-pattern into A[3]
mov QWORD PTR [rsp+8], rdx ; store A[2..3] again
movaps xmm0, XMMWORD PTR [rsp] ; vector load: store-forwarding stall
mulps xmm0, xmm0
haddps xmm0, xmm0
haddps xmm0, xmm0
ucomiss xmm3, xmm0
movaps xmm1, xmm0
sqrtss xmm1, xmm1
ja .L21 ; call sqrtf to set errno if needed; flags set by ucomiss.
.L17:
add rbx, 1
cvtss2sd xmm1, xmm1
addsd xmm2, xmm1 ; total += (double)sqrtf
cmp rbx, 1000000000
jne .L18 ; }while(k<1000000000);
This insanity isn't present in the scalar version.
Either way, gcc did manage to avoid the inefficiency of a full uint64_t -> float conversion (which x86 doesn't have in hardware until AVX512). It was presumably able to prove that using a signed 64-bit -> float conversion would always work because the high bit can't be set.
Footnote 1: But sqrtps has the same 1 per 3 cycle throughput as scalar, so you're only getting 1/4 of your CPU's sqrt throughput capability by doing 1 vector at a time horizontally, instead of doing 4 lengths for 4 vectors in parallel.

Related

Strength reduction on floating point division by hand

In one of our last assignments in Computer Science this term we have to apply strength reduction on some code fragments. Most of them were just straight forward, especially with looking into compiler output. But one of them I wont be able to solve, even with the help of the compiler.
Our profs gave us the following hint:
Hint: Inquire how IEEE 754 single-precision floating-point numbers are
represented in memory.
Here is the code snippet: (a is of type double*)
for (int i = 0; i < N; ++i) {
a[i] += i / 5.3;
}
At first I tried to look into the compiler output for this snipped on godbolt. I tried to compile it without any optimization: (note: I copied only the relevant part in the for loop)
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*8]
mov rax, QWORD PTR [rbp-16]
add rax, rdx
movsd xmm1, QWORD PTR [rax]
cvtsi2sd xmm0, DWORD PTR [rbp-4] //division relevant
movsd xmm2, QWORD PTR .LC0[rip] //division relevant
divsd xmm0, xmm2 //division relevant
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*8]
mov rax, QWORD PTR [rbp-16]
add rax, rdx
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
and with -O3:
.L2:
pshufd xmm0, xmm2, 238 //division relevant
cvtdq2pd xmm1, xmm2 //division relevant
movupd xmm6, XMMWORD PTR [rax]
add rax, 32
cvtdq2pd xmm0, xmm0 //division relevant
divpd xmm1, xmm3 //division relevant
movupd xmm5, XMMWORD PTR [rax-16]
paddd xmm2, xmm4
divpd xmm0, xmm3 //division relevant
addpd xmm1, xmm6
movups XMMWORD PTR [rax-32], xmm1
addpd xmm0, xmm5
movups XMMWORD PTR [rax-16], xmm0
cmp rax, rbp
jne .L2
I commented the division part of the assembly code. But this output does not help me understanding how to apply strength reduction on the snippet. (Maybe there are too many optimizations going on to fully understand the output)
Second, I tried to understand the bit representation of the float part 5.3.
Which is:
0 |10000001|01010011001100110011010
Sign|Exponent|Mantissa
But this does not help me either.

If we adopt Wikipedia's definition that
strength reduction is a compiler optimization where expensive operations are replaced with equivalent but less expensive operations
then we can apply strength reduction here by converting the expensive floating-point division into a floating-point multiply plus two floating-point multiply-adds (FMAs). Assuming that double is mapped to IEEE-754 binary64, the default rounding mode for floating-point computation is round-to-nearest-or-even, and that int is a 32-bit type, we can prove the transformation correct by simple exhaustive test:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <math.h>
int main (void)
{
const double rcp_5p3 = 1.0 / 5.3; // 0x1.826a439f656f2p-3
int i = INT_MAX;
do {
double ref = i / 5.3;
double res = fma (fma (-5.3, i * rcp_5p3, i), rcp_5p3, i * rcp_5p3);
if (res != ref) {
printf ("error: i=%2d res=%23.13a ref=%23.13a\n", i, res, ref);
return EXIT_FAILURE;
}
i--;
} while (i >= 0);
return EXIT_SUCCESS;
}
Most modern instances of common processors architectures like x86-64 and ARM64 have hardware support for FMA, such that fma() can be mapped directly to the appropriate hardware instruction. This should be confirmed by looking at the disassembly of the binary generated. Where hardware support for FMA is lacking the transformation obviously should not be applied, as software implementations of fma() are slow and sometimes functionally incorrect.
The basic idea here is that mathematically, division is equivalent to multiplication with the reciprocal. However, that is not necessarily true for finite-precision floating-point arithmetic. The code above tries to improve the likelihood of bit-accurate computation by determining the error in the naive approach with the help of FMA and applying a correction where necessary. For background including literature references see this earlier question.
To the best of my knowledge, there is not yet a general mathematically proven algorithm to determine for which divisors paired with which dividends the above transformation is safe (that is, delivers bit-accurate results), which is why an exhaustive test is strictly necessary to show that the transformation is valid.
In comments, Pascal Cuoq points out that there is an alternative algorithm to potentially strength-reduce floating-point division with a compile-time constant divisor, by precomputing the reciprocal of the divisor to more than native precision and specifically as a double-double. For background see N. Brisebarre and J.-M. Muller, "Correctly rounded multiplication by arbirary precision constant", IEEE Transactions on Computers, 57(2): 162-174, February 2008, which also provides guidance how to determine whether that transformation is safe for any particular constant. Since the present case is simple, I again used exhaustive test to show it is safe. Where applicable, this will reduce the division down to one FMA plus a multiply:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <mathimf.h>
int main (void)
{
const double rcp_5p3_hi = 1.8867924528301888e-1; // 0x1.826a439f656f2p-3
const double rcp_5p3_lo = -7.2921377017921457e-18;// -0x1.0d084b1883f6e0p-57
int i = INT_MAX;
do {
double ref = i / 5.3;
double res = fma (i, rcp_5p3_hi, i * rcp_5p3_lo);
if (res != ref) {
printf ("i=%2d res=%23.13a ref=%23.13a\n", i, res, ref);
return EXIT_FAILURE;
}
i--;
} while (i >= 0);
return EXIT_SUCCESS;
}

To cover another aspect: since all values of type int are exactly representable as double (but not as float), it is possible to get rid of int-to-double conversion that happens in the loop when evaluating i / 5.3 by introducing a floating-point variable that counts from 0.0 to N:
double fp_i = 0;
for (int i = 0; i < N; fp_i += 1, i++)
a[i] += fp_i / 5.3;
However, this kills autovectorization, and introduces a chain of dependent floating-point additions. Floating point addition is typically 3 or 4 cycles, so the last iteration will retire after at least (N-1)*3 cycles, even if the CPU could dispatch the instructions in the loop faster. Thankfully, floating-point division is not fully pipelined, and the rate at which an x86 CPU can dispatch floating-point division roughly matches or exceeds latency of the addition instruction.
This leaves the problem of killed vectorization. It's possible to bring it back by manually unrolling the loop and introducing two independent chains, but with AVX you'd need four chains for full vectorization:
double fp_i0 = 0, fp_i1 = 1;
int i = 0;
for (; i+1 < N; fp_i0 += 2, fp_i1 += 2, i += 2) {
double t0 = a[i], t1 = a[i+1];
a[i] = t0 + fp_i0 / 5.3;
a[i+1] = t1 + fp_i1 / 5.3;
}
if (i < N)
a[i] += i / 5.3;

CAVEAT: After a few days I realized that this answer is incorrect in that it ignores the consequence of underflow (to subnormal or to zero) in the computation of o / 5.3. In this case, multiplying the result by a power of two is “exact” but does not produce the result that dividing the larger integer by 5.3 would have.
i / 5.3 only needs to be computed for odd values of i.
For even values of i, you can simply multiply by 2.0 the value of (i/2)/5.3, which was already computed earlier in the loop.
The remaining difficulty is to reorder the iterations in a way such that each index between 0 and N-1 is handled exactly once and the program does not need to record an arbitrary number of division results.
One way to achieve this is to iterate on all odd numbers o less than N, and after computing o / 5.3 in order to handle index o, to also handle all indexes of the form o * 2**p.
if (N > 0) {
a[0] += 0.0; // this is needed for strict IEEE 754 compliance lol
for (int o = 1; o < N; o += 2) {
double d = o / 5.3;
int i = o;
do {
a[i] += d;
i += i;
d += d;
} while (i < N);
}
}
Note: this does not use the provided hint “Inquire how IEEE 754 single-precision floating-point numbers are represented in memory”. I think I know pretty well how single-precision floating-point numbers are represented in memory, but I do not see how that is relevant, especially since there are no single-precision values or computations in the code to optimize. I think there is a mistake in the way the problem is expressed, but still the above is technically a partial answer to the question as phrased.
I also ignored overflow problems for values of N that come close to INT_MAX in the code snippet above, since the code is already complicated enough.
As an additional note, the above transformation only replaces one division out of two. It does this by making the code unvectorizable (and also less cache-friendly). In your question, gcc -O3 has already shown that automatic vectorization could be applied to the starting point that your professor suggested, and that is likely to be more beneficial than suppressing half the divisions can be. The only good thing about the transformation in this answer is that it is a sort of strength reduction, which your professor requested.

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2

I am looking for an optimal method to calculate sum of all packed 32-bit integers in a __m256i or __m512i. To calculate sum of n elements, I ofter use log2(n) vpaddd and vpermd function, then extract the final result. Howerver, it is not the best option I think.
Edit: best/optimal in term of speed/cycle reduction.

Related: if you're looking for the non-existant _mm512_reduce_add_epu8, see Summing 8-bit integers in __m512i with AVX intrinsics vpsadbw as an hsum within qwords is much more efficient than shuffling.
Without AVX512, see hsum_8x32(__m256i) below for AVX2 without Intel's reduce_add helper function. reduce_add doesn't necessarily compile optimally anyway with AVX512.
There is a int _mm512_reduce_add_epi32(__m512i) inline function in immintrin.h. You might as well use it. (It compiles to shuffle and add instructions, but more efficient ones than vpermd, like I describe below.) AVX512 didn't introduce any new hardware support for horizontal sums, just this new helper function. It's still something to avoid or sink out of loops whenever possible.
GCC 9.2 -O3 -march=skylake-avx512 compiles a wrapper that calls it as follows:
vextracti64x4 ymm1, zmm0, 0x1
vpaddd ymm1, ymm1, ymm0
vextracti64x2 xmm0, ymm1, 0x1 # silly compiler, vextracti128 would be shorter
vpaddd xmm1, xmm0, xmm1
vpshufd xmm0, xmm1, 78
vpaddd xmm0, xmm0, xmm1
vmovd edx, xmm0
vpextrd eax, xmm0, 1 # 2x xmm->integer to feed scalar add.
add eax, edx
ret
Extracting twice to feed scalar add is questionable; it needs uops for p0 and p5 so it's equivalent to a regular shuffle + a movd.
Clang doesn't do that; it does one more step of shuffle / SIMD add to reduce down to a single scalar for vmovd. See below for perf analysis of the two.
There is a VPHADDD but you should never use it with both inputs the same. (Unless you're optimizing for code-size over speed). It can be useful to transpose-and-sum multiple vectors, resulting in some vectors of results. You do that by feeding phadd with 2 different inputs. (Except it gets messy with 256 and 512-bit because vphadd is still only in-lane.)
Yes, you need log2(vector_width) shuffles and vpaddd instructions. (So this isn't very efficient; avoid horizontal sums inside inner loops. Accumulate vertically until the end of a loop, for example).
General strategy for all SSE / AVX / AVX512
You want to successively narrow from 512 -> 256, then 256 -> 128, then shuffle within __m128i until you're down to one scalar element. Presumably some future AMD CPU will decode 512-bit instructions to two 256-bit uops, so reducing width is a big win there. And narrower instructions presumably cost slightly less power.
Your shuffles can take immediate control operands, not vectors for vpermd. e.g. VEXTRACTI32x8, vextracti128, and vpshufd. (Or vpunpckhqdq to save code size for the immediate constant.)
See Fastest way to do horizontal SSE vector sum (or other reduction) (my answer also includes some integer versions).
This general strategy is appropriate for all element types: float, double, and any size integer
Special cases:
8-bit integer: start with vpsadbw, more efficient and avoids overflow, but then continue as for 64-bit integers.
16-bit integer: start by widening to 32 with pmaddwd (_mm256_madd_epi16 with set1_epi16(1)) : SIMD: Accumulate Adjacent Pairs - fewer uops even if you don't care about the avoiding-overflow benefit, except on AMD before Zen2 where 256-bit instructions cost at least 2 uops. But then you continue as for 32-bit integer.
32-bit integer can be done manually like this, with an SSE2 function called by the AVX2 function after reducing to __m128i, in turn called by the AVX512 function after reducing to __m256i. The calls will of course inline in practice.
#include <immintrin.h>
#include <stdint.h>
// from my earlier answer, with tuning for non-AVX CPUs removed
// static inline
uint32_t hsum_epi32_avx(__m128i x)
{
__m128i hi64 = _mm_unpackhi_epi64(x, x); // 3-operand non-destructive AVX lets us save a byte without needing a movdqa
__m128i sum64 = _mm_add_epi32(hi64, x);
__m128i hi32 = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1)); // Swap the low two elements
__m128i sum32 = _mm_add_epi32(sum64, hi32);
return _mm_cvtsi128_si32(sum32); // movd
}
// only needs AVX2
uint32_t hsum_8x32(__m256i v)
{
__m128i sum128 = _mm_add_epi32(
_mm256_castsi256_si128(v),
_mm256_extracti128_si256(v, 1)); // silly GCC uses a longer AXV512VL instruction if AVX512 is enabled :/
return hsum_epi32_avx(sum128);
}
// AVX512
uint32_t hsum_16x32(__m512i v)
{
__m256i sum256 = _mm256_add_epi32(
_mm512_castsi512_si256(v), // low half
_mm512_extracti64x4_epi64(v, 1)); // high half. AVX512F. 32x8 version is AVX512DQ
return hsum_8x32(sum256);
}
Notice that this uses __m256i hsum as a building block for __m512i; there's nothing to be gained by doing in-lane operations first.
Well possibly a very tiny advantage: in-lane shuffles have lower latency than lane-crossing, so they could execute 2 cycles earlier and leave the RS earlier, and similarly retire from the ROB slightly earlier. But the higher-latency shuffles are coming just a couple instructions later even if you did that. So you might get a handful of some independent instructions into the back-end 2 cycles earlier if this hsum was on the critical path (blocking retirement).
But reducing to a narrower vector width sooner is generally good, maybe getting 512-bit uops out of the system sooner so the CPU can re-activate the SIMD execution units on port 1, if you aren't doing more 512-bit work right away.
Compiles on Godbolt to these instructions, with GCC9.2 -O3 -march=skylake-avx512
hsum_16x32(long long __vector(8)):
vextracti64x4 ymm1, zmm0, 0x1
vpaddd ymm0, ymm1, ymm0
vextracti64x2 xmm1, ymm0, 0x1 # silly compiler uses a longer EVEX instruction when its available (AVX512VL)
vpaddd xmm0, xmm0, xmm1
vpunpckhqdq xmm1, xmm0, xmm0
vpaddd xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 177
vpaddd xmm0, xmm1, xmm0
vmovd eax, xmm0
ret
P.S.: perf analysis of GCC's _mm512_reduce_add_epi32 vs. clang's (which is equivalent to my version), using data from https://uops.info/ and/or Agner Fog's instruction tables:
After inlining into a caller that does something with the result, it could allow optimizations like adding a constant as well using lea eax, [rax + rdx + 123] or something.
But other than that it seems almost always worse than the the shuffle / vpadd / vmovd at the end of my implementation, on Skylake-X:
total uops: reduce: 4. Mine: 3
ports: reduce: 2p0, p5 (part of vpextrd), p0156 (scalar add)
ports: mine: p5, p015 (vpadd on SKX), p0 (vmod)
Latency is equal at 4 cycles, assuming no resource conflicts:
shuffle 1 cycle -> SIMD add 1 cycle -> vmovd 2 cycles
vpextrd 3 cycles (in parallel with 2 cycle vmovd) -> add 1 cycle.

Why is there a large performance impact when looping over an array with 240 or more elements?

When running a sum loop over an array in Rust, I noticed a huge performance drop when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.
Is there special compilation optimization Rust is doing for "short" arrays?
Compiled with rustc -C opt-level=3.
use std::time::Instant;
const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;
fn main() {
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY {
arr[i] = i;
}
let mut sum = 0;
let now = Instant::now();
for _ in 0..IN_LOOPS {
let mut s = 0;
for i in 0..arr.len() {
s += arr[i];
}
sum += s;
}
println!("sum:{} time:{:?}", sum, now.elapsed());
}

Summary: below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark.
You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization – only performed for length 239 – is responsible for the huge speed difference. But let's start slowly:
(All code in this answer is compiled with -C opt-level=3)
pub fn foo() -> usize {
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len() {
s += arr[i];
}
s
}
This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. Here is a small part of the assembly:
movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0
This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. incrementing the loop variable, check if the loop has ended and the jump to the start of the loop.
In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. Moreover, two 16-byte SIMD registers (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. After all, they are independent of one another. In the end, both registers are added together and then horizontally summed down to the scalar result.
Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and paddq throughput is also at least 2 per clock, with 1 cycle latency on most CPUs. See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead.
LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. So usually, front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling.
But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:
const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;
pub fn foo() -> usize {
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY {
arr[i] = i;
}
let mut sum = 0;
for _ in 0..IN_LOOPS {
let mut s = 0;
for i in 0..arr.len() {
s += arr[i];
}
sum += s;
}
sum
}
(On Godbolt Compiler Explorer)
The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.
The important difference is that for 239, LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!
First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly; the comments with * are especially important):
; at the start of the function, `rbx` was set to 0
movq rax, xmm1 ; result of SIMD summing up stored in `rax`
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in `rax`
As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.
This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.
An additional note: can you do anything about this? Not really. LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.
As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.

In addition to Lukas' answer, if you want to use an iterator, try this:
const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;
pub fn bar() -> usize {
(0..CAPACITY).sum::<usize>() * IN_LOOPS
}
Thanks #Chris Morgan for the suggestion about range pattern.
The optimized assembly is quite good:
example::bar:
movabs rax, 14340000000
ret

SSE: not seeing a speedup by using _mm_add_epi32

I would expect SSE to be faster than not using SSE. Do I need to add some additional compiler flags? Could it be that I am not seeing a speedup because this is integer code and not floating point?
invocation/output
$ make sum2
clang -O3 -msse -msse2 -msse3 -msse4.1 sum2.c ; ./a.out 123
n: 123
SSE Time taken: 0 seconds 124 milliseconds
vector+vector:begin int: 1 5 127 0
vector+vector:end int: 0 64 66 68
NOSSE Time taken: 0 seconds 115 milliseconds
vector+vector:begin int: 1 5 127 0
vector+vector:end int: 0 64 66 68
compiler
$ clang --version
Apple LLVM version 9.0.0 (clang-900.0.37)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
sum2.c
#include <stdlib.h>
#include <stdio.h>
#include <x86intrin.h>
#include <time.h>
#ifndef __cplusplus
#include <stdalign.h> // C11 defines _Alignas(). This header defines alignas()
#endif
#define CYCLE_COUNT 10000
// add vector and return resulting value on stack
__attribute__((noinline)) __m128i add_iv(__m128i *a, __m128i *b) {
return _mm_add_epi32(*a,*b);
}
// add int vectors via sse
__attribute__((noinline)) void add_iv_sse(__m128i *a, __m128i *b, __m128i *out, int N) {
for(int i=0; i<N/sizeof(int); i++) {
//out[i]= _mm_add_epi32(a[i], b[i]); // this also works
_mm_storeu_si128(&out[i], _mm_add_epi32(a[i], b[i]));
}
}
// add int vectors without sse
__attribute__((noinline)) void add_iv_nosse(int *a, int *b, int *out, int N) {
for(int i=0; i<N; i++) {
out[i] = a[i] + b[i];
}
}
__attribute__((noinline)) void p128_as_int(__m128i in) {
alignas(16) uint32_t v[4];
_mm_store_si128((__m128i*)v, in);
printf("int: %i %i %i %i\n", v[0], v[1], v[2], v[3]);
}
// print first 4 and last 4 elements of int array
__attribute__((noinline)) void debug_print(int *h) {
printf("vector+vector:begin ");
p128_as_int(* (__m128i*) &h[0] );
printf("vector+vector:end ");
p128_as_int(* (__m128i*) &h[32764] );
}
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
printf("n: %d\n", n);
// sum: vector + vector, of equal length
int f[32768] __attribute__((aligned(16))) = {0,2,4};
int g[32768] __attribute__((aligned(16))) = {1,3,n};
int h[32768] __attribute__((aligned(16)));
f[32765] = 33; f[32766] = 34; f[32767] = 35;
g[32765] = 31; g[32766] = 32; g[32767] = 33;
// https://stackoverflow.com/questions/459691/best-timing-method-in-c
clock_t start = clock();
for(int i=0; i<CYCLE_COUNT; ++i) {
add_iv_sse((__m128i*)f, (__m128i*)g, (__m128i*)h, 32768);
}
int msec = (clock()-start) * 1000 / CLOCKS_PER_SEC;
printf(" SSE Time taken: %d seconds %d milliseconds\n", msec/1000, msec%1000);
debug_print(h);
// process intense function again
start = clock();
for(int i=0; i<CYCLE_COUNT; ++i) {
add_iv_nosse(f, g, h, 32768);
}
msec = (clock()-start) * 1000 / CLOCKS_PER_SEC;
printf("NOSSE Time taken: %d seconds %d milliseconds\n", msec/1000, msec%1000);
debug_print(h);
return EXIT_SUCCESS;
}

Look at the asm: clang -O2 or -O3 probably auto-vectorizes add_iv_nosse (with a check for overlap, since you didn't use int * restrict a and so on).
Use -fno-tree-vectorize to disable auto vectorization, without stopping you from using intrinsics. I'd recommend clang -march=native -mno-avx -O3 -fno-tree-vectorize to test what I think you want to test, scalar integer vs. legacy-SSE paddd. (It works in gcc and clang. In clang, AFAIK it's a synonym for the clang-specific -fno-vectorize.)
BTW, timing both in the same executable hurts the first one, because the CPU doesn't ramp to full turbo right away. You're probably into the timed section of the code before your CPU hits full speed. (So run this a couple times back-to-back, with for i in {1..10}; do time ./a.out; done.
On Linux I'd use perf stat -r5 ./a.out to run it 5 times with performance counters (and I'd split it up so one run tested one or the other, so I could look at perf counters for the whole run.)
Code review:
You forgot stdint.h for uint32_t. I had to add that to get it to compile on Godbolt to see the asm. (Assuming clang-5.0 is something like the Apple clang version you're using. IDK if Apple's clang implies a default -mtune= option, but that would make sense because it's only targeting Mac. Also a baseline SSSE3 would make sense for 64-bit on x86-64 OS X.)
You don't need noinline on debug_print. Also, I'd recommend a different name for CYCLE_COUNT. Cycles in this context makes me think of clock cycles, so call it REP_COUNT or REPEATS or whatever.
Putting your arrays on the stack in main is probably fine. You do initialize both input arrays (to mostly zero, but add performance isn't data-dependent).
This is good, because leaving them uninitialized might mean that multiple 4k pages of each array was copy-on-write mapped to the same physical zero page, so you'd get more than the expected number of L1D cache hits.
The SSE2 loop should bottleneck on L2 / L3 cache bandwidth, since the working set it 4 * 32kiB * 3 = 384 kiB, so it's about 1.5x the 256kiB L2 cache in Intel CPUs.
clang might unroll it's auto-vectorized loop more than it does your manual intrinsics loop. That might explain better performance, since only 16B vectors (not 32B AVX2) might not saturate cache bandwidth if you're not getting 2 loads + 1 store per clock.
Update: actually the loop overhead is pretty extreme, with 3 pointer increments + a loop counter, and only unrolling by 2 to amortize that.
The auto-vectorized loop:
.LBB2_12: # =>This Inner Loop Header: Depth=1
movdqu xmm0, xmmword ptr [r9 - 16]
movdqu xmm1, xmmword ptr [r9] # hoisted load for 2nd unrolled iter
movdqu xmm2, xmmword ptr [r10 - 16]
paddd xmm2, xmm0
movdqu xmm0, xmmword ptr [r10]
paddd xmm0, xmm1
movdqu xmmword ptr [r11 - 16], xmm2
movdqu xmmword ptr [r11], xmm0
add r9, 32
add r10, 32
add r11, 32
add rbx, -8 # add / jne macro-fused on SnB-family CPUs
jne .LBB2_12
So it's 12 fused-domain uops, and can run at best 2 vectors per 3 clocks, bottlenecked on the front-end issue bandwidth of 4 uops per clock.
It's not using aligned loads because the compiler doesn't have that info without inlining into main where the alignment is known, and you didn't guarantee alignment with p = __builtin_assume_aligned(p, 16) or anything in the stand-alone function. Aligned loads (or AVX) would let paddd use a memory operand instead of a separate movdqu load.
The manually-vectorized loop uses aligned loads to save front-end uops, but has more loop overhead from the loop counter.
.LBB1_7: # =>This Inner Loop Header: Depth=1
movdqa xmm0, xmmword ptr [rcx - 16]
paddd xmm0, xmmword ptr [rax - 16]
movdqu xmmword ptr [r11 - 16], xmm0
movdqa xmm0, xmmword ptr [rcx]
paddd xmm0, xmmword ptr [rax]
movdqu xmmword ptr [r11], xmm0
add r10, 2 # separate loop counter
add r11, 32 # 3 pointer incrmeents
add rax, 32
add rcx, 32
cmp r9, r10 # compare the loop counter
jne .LBB1_7
So it's 11 fused-domain uops. It should be running faster than the auto-vectorized loop. Your timing method probably caused the problem.
(Unless mixing loads and stores is actually making it less optimal. The auto-vectorized loop did 4 loads and then 2 stores. Actually that might explain it. Your arrays are a multiple of 4kiB, and might all have the same relative alignment. So you might be getting 4k aliasing here, which means the CPU isn't sure that a store doesn't overlap a load. I think there's a performance counter you can check for that.)
See also Agner Fog's microarch guide (and instruction tables + optimization guide, and other links in the x86 tag wiki, especially Intel's optimization guide.
There's also some good SSE/SIMD beginner stuff in the sse tag wiki.

Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD

Problem
I am studying high performance matrix multiplication algorithms such as OpenBLAS or GotoBLAS and I am trying to reproduce some of the results. This question deals with the inner kernel of a matrix multiplication algorithm. Specifically, I am looking at computing C += AB, where A and B are 2x2 matrices of type double at the peak speed of my CPU. There are two ways to do this. One way is to use SIMD instructions. The second way is to code directly in assembly using the SIMD registers.
What I have looked at so far
All the relevant papers, course webpages, many many SO Q&As dealing with the subject (too many to list), I have compiled OpenBLAS on my computer, looked through OpenBLAS, GotoBLAS, and BLIS source codes, Agner's manuals.
Hardware
My CPU is an Intel i5 - 540M. You can find the relevant CPUID information on cpu-world.com. The microarchitecture is Nehalem (westmere), so it can theoretically compute 4 double precision flops per core per cycle. I will be using just one core (no OpenMP), so with hyperthreading off and 4-step Intel Turbo Boost, I should be seeing a peak of ( 2.533 Ghz + 4*0.133 Ghz ) * ( 4 DP flops/core/cycle ) * ( 1 core ) = 12.27 DP Gflops. For reference, with both cores running at peak, Intel Turbo Boost gives a 2-step speed up and I should get a theoretical peak of 22.4 DP Gflops.
Setup
I declare my 2x2 matrices as double and initialize them with random entries as shown in the code snippet below.
srand(time(NULL));
const int n = 2;
double A[n*n];
double B[n*n];
double C[n*n];
double T[n*n];
for(int i = 0; i < n*n; i++){
A[i] = (double) rand()/RAND_MAX;
B[i] = (double) rand()/RAND_MAX;
C[i] = 0.0;
}
I compute a true answer using naive matrix-matrix multiplcation (shown below) which allows me to check my result either visually or by computing the L2 norm of all the elements
// "true" answer
for(int i = 0; i < n; i++)
for(int j = 0; j < n; j++)
for(int k = 0; k < n; k++)
T[i*n + j] += A[i*n + k]*B[k*n + j];
To run the code and get an estimate of the Gflops, I call each multiplication function once to warmup, and then execute it inside a for loop for maxiter times, making sure to zero the C matrix each time as I am computing C += AB. The for loop is placed inside two clock() statements and this is used to estimate the Gflops. The code snippet blow illustrates this part.
C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
mult2by2(A,B,C); //warmup
time1 = clock();
for(int i = 0; i < maxiter; i++){
mult2by2(A,B,C);
C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
}
time2 = clock() - time1;
time3 = (double)(time2)/CLOCKS_PER_SEC;
gflops = (double) (2.0*n*n*n)/time3/1.0e9*maxiter;
mult2by2(A,B,C); // to compute the norm against T
norm = L2norm(n,C,T);
SIMD code
My CPU supports 128-bit vectors, so I can fit 2 doubles in each vector. This is the main reason why I am doing 2x2 matrix multiplication in the inner kernel. The SIMD code computes an entire row of C at one time.
inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(16))) mult2by2B(
const double* restrict A,
const double* restrict B,
double* restrict C
)
{
register __m128d xmm0, xmm1, xmm2, xmm3, xmm4;
xmm0 = _mm_load_pd(C);
xmm1 = _mm_load1_pd(A);
xmm2 = _mm_load_pd(B);
xmm3 = _mm_load1_pd(A + 1);
xmm4 = _mm_load_pd(B + 2);
xmm1 = _mm_mul_pd(xmm1,xmm2);
xmm2 = _mm_add_pd(xmm1,xmm0);
xmm1 = _mm_mul_pd(xmm3,xmm4);
xmm2 = _mm_add_pd(xmm1,xmm2);
_mm_store_pd(C,xmm2);
xmm0 = _mm_load_pd(C + 2);
xmm1 = _mm_load1_pd(A + 2);
xmm2 = _mm_load_pd(B);
xmm3 = _mm_load1_pd(A + 3);
//xmm4 = _mm_load_pd(B + 2);
xmm1 = _mm_mul_pd(xmm1,xmm2);
xmm2 = _mm_add_pd(xmm1,xmm0);
xmm1 = _mm_mul_pd(xmm3,xmm4);
xmm2 = _mm_add_pd(xmm1,xmm2);
_mm_store_pd(C + 2,xmm2);
}
Assmebly (Intel Syntax)
My first attempt was to create a separate assembly routine for this part and call it from the main routine. However, it was extremely slow because I cannot inline extern functions. I wrote the assembly as inline assembly as shown below. It is identical to that which is produced by gcc -S -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-all-loops -fomit-frame-pointer -masm=intel. From what I understand of the Nehalem microarchitecture diagrams, this processor can perform SSE ADD, SSE MUL, and SSE MOV in parallel, which explains the interleaving of MUL, ADD, MOV instructions. You will notice the SIMD instructions above are in a different order because I had a different understanding from Agner Fog's manuals. Nevertheless, gcc is smart and the SIMD code above compiles to the assembly shown in the inline version.
inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(16))) mult2by2A
(
const double* restrict A,
const double* restrict B,
double* restrict C
)
{
__asm__ __volatile__
(
"mov edx, %[A] \n\t"
"mov ecx, %[B] \n\t"
"mov eax, %[C] \n\t"
"movapd xmm3, XMMWORD PTR [ecx] \n\t"
"movapd xmm2, XMMWORD PTR [ecx+16] \n\t"
"movddup xmm1, QWORD PTR [edx] \n\t"
"mulpd xmm1, xmm3 \n\t"
"addpd xmm1, XMMWORD PTR [eax] \n\t"
"movddup xmm0, QWORD PTR [edx+8] \n\t"
"mulpd xmm0, xmm2 \n\t"
"addpd xmm0, xmm1 \n\t"
"movapd XMMWORD PTR [eax], xmm0 \n\t"
"movddup xmm4, QWORD PTR [edx+16] \n\t"
"mulpd xmm4, xmm3 \n\t"
"addpd xmm4, XMMWORD PTR [eax+16] \n\t"
"movddup xmm5, QWORD PTR [edx+24] \n\t"
"mulpd xmm5, xmm2 \n\t"
"addpd xmm5, xmm4 \n\t"
"movapd XMMWORD PTR [eax+16], xmm5 \n\t"
: // no outputs
: // inputs
[A] "m" (A),
[B] "m" (B),
[C] "m" (C)
: //register clobber
"memory",
"edx","ecx","eax",
"xmm0","xmm1","xmm2","xmm3","xmm4","xmm5"
);
}
Results
I compile my code with the following flags:
gcc -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-all-loops -fomit-frame-pointer -masm=intel
The results for maxiter = 1000000000 are below:
********** Inline ASM
L2 norm: 0.000000e+000, Avg. CPU time: 9.563000, Avg. Gflops: 1.673115
********** SIMD Version
L2 norm: 0.000000e+000, Avg. CPU time: 0.359000, Avg. Gflops: 44.568245
If I force the SIMD version to not be inlined with __attribute__ ((noinline)), the results are:
********** Inline ASM
L2 norm: 0.000000e+000, Avg. CPU time: 11.155000, Avg. Gflops: 1.434334
********** SIMD Version
L2 norm: 0.000000e+000, Avg. CPU time: 11.264000, Avg. Gflops: 1.420455
Questions
If both the inline ASM and SIMD implementations produce identical assembly output, why is the assembly version so much slower? It's as if the inline assembly did not get inlined, which is made evident by the second set of results showing identical performance for "inline" ASM versus "noinline" SIMD. The only explanation I can find is in Agner Fog Volume 2 page 6:
Compiled code may be faster than assembly code because compilers can make
inter-procedural optimization and whole-program optimization. The assembly
programmer usually has to make well-defined functions with a well-defined call
interface that obeys all calling conventions in order to make the code testable and
verifiable. This prevents many of the optimization methods that compilers use, such
as function inlining, register allocation, constant propagation, common subexpression
elimination across functions, scheduling across functions, etc. These
advantages can be obtained by using C++ code with intrinsic functions instead of
assembly code.
But the assembler output for both versions is exactly the same.
Why am I seeing 44 Gflops in the first set of results? This is way above the 12 Gflops peak I calculated, and is what I would expect if I was running both cores with single precision calculations.
EDIT 1
The comment says there might be dead code elimination I can confirm that this is happening for the SIMd instructions. The -S output shows that the for loop for SIMD only zeros C matrix. I can disable that by turning off compiler optimization with -O0. In that case, SIMD runs 3x as slow as ASM, but ASM still runs at exactly the same speed. The norm is also nonzero now, but it's still OK at 10^-16. I also see that the inline ASM version is being inlined with the APP and NO_APP tags, but it is also being unrolled 8 times in the for loop. I think unrolling that many times will impact performance heavily, as I usually unroll loops 4 times. Anything more, in my experience, seems to degrade performance.

GCC is optimizing away your inline function using intrinsics, mult2by2B, due to the line
C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
Without that line it takes 2.9 seconds on the computer from Coliru
http://coliru.stacked-crooked.com/a/992304f5f672e257
And with the line it only takes 0.000001
http://coliru.stacked-crooked.com/a/9722c39bb6b8590a
You can also see this in the assembly. If you drop the code below into http://gcc.godbolt.org/ you will see that with that line of code it skips the function entirely.
However, when you inline the assembly GCC is NOT optimizing the function, mult2by2A, away (even though it inlines it). You can see this in the assembly as well.
#include <stdio.h>
#include <emmintrin.h> // SSE2
#include <omp.h>
inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(16))) mult2by2B(
const double* __restrict A,
const double* __restrict B,
double* __restrict C
)
{
register __m128d xmm0, xmm1, xmm2, xmm3, xmm4;
xmm0 = _mm_load_pd(C);
xmm1 = _mm_load1_pd(A);
xmm2 = _mm_load_pd(B);
xmm3 = _mm_load1_pd(A + 1);
xmm4 = _mm_load_pd(B + 2);
xmm1 = _mm_mul_pd(xmm1,xmm2);
xmm2 = _mm_add_pd(xmm1,xmm0);
xmm1 = _mm_mul_pd(xmm3,xmm4);
xmm2 = _mm_add_pd(xmm1,xmm2);
_mm_store_pd(C,xmm2);
xmm0 = _mm_load_pd(C + 2);
xmm1 = _mm_load1_pd(A + 2);
xmm2 = _mm_load_pd(B);
xmm3 = _mm_load1_pd(A + 3);
//xmm4 = _mm_load_pd(B + 2);
xmm1 = _mm_mul_pd(xmm1,xmm2);
xmm2 = _mm_add_pd(xmm1,xmm0);
xmm1 = _mm_mul_pd(xmm3,xmm4);
xmm2 = _mm_add_pd(xmm1,xmm2);
_mm_store_pd(C + 2,xmm2);
}
int main() {
double A[4], B[4], C[4];
int maxiter = 10000000;
//int maxiter = 1000000000;
double dtime;
dtime = omp_get_wtime();
for(int i = 0; i < maxiter; i++){
mult2by2B(A,B,C);
C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
}
dtime = omp_get_wtime() - dtime;
printf("%f %f %f %f\n", C[0], C[1], C[2], C[3]);
//gflops = (double) (2.0*n*n*n)/time3/1.0e9*maxiter;
printf("time %f\n", dtime);
}