In one of our last assignments in Computer Science this term we have to apply strength reduction on some code fragments. Most of them were just straight forward, especially with looking into compiler output. But one of them I wont be able to solve, even with the help of the compiler.
Our profs gave us the following hint:
Hint: Inquire how IEEE 754 single-precision floating-point numbers are
represented in memory.
Here is the code snippet: (a is of type double*)
for (int i = 0; i < N; ++i) {
a[i] += i / 5.3;
}
At first I tried to look into the compiler output for this snipped on godbolt. I tried to compile it without any optimization: (note: I copied only the relevant part in the for loop)
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*8]
mov rax, QWORD PTR [rbp-16]
add rax, rdx
movsd xmm1, QWORD PTR [rax]
cvtsi2sd xmm0, DWORD PTR [rbp-4] //division relevant
movsd xmm2, QWORD PTR .LC0[rip] //division relevant
divsd xmm0, xmm2 //division relevant
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*8]
mov rax, QWORD PTR [rbp-16]
add rax, rdx
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
and with -O3:
.L2:
pshufd xmm0, xmm2, 238 //division relevant
cvtdq2pd xmm1, xmm2 //division relevant
movupd xmm6, XMMWORD PTR [rax]
add rax, 32
cvtdq2pd xmm0, xmm0 //division relevant
divpd xmm1, xmm3 //division relevant
movupd xmm5, XMMWORD PTR [rax-16]
paddd xmm2, xmm4
divpd xmm0, xmm3 //division relevant
addpd xmm1, xmm6
movups XMMWORD PTR [rax-32], xmm1
addpd xmm0, xmm5
movups XMMWORD PTR [rax-16], xmm0
cmp rax, rbp
jne .L2
I commented the division part of the assembly code. But this output does not help me understanding how to apply strength reduction on the snippet. (Maybe there are too many optimizations going on to fully understand the output)
Second, I tried to understand the bit representation of the float part 5.3.
Which is:
0 |10000001|01010011001100110011010
Sign|Exponent|Mantissa
But this does not help me either.
If we adopt Wikipedia's definition that
strength reduction is a compiler optimization where expensive operations are replaced with equivalent but less expensive operations
then we can apply strength reduction here by converting the expensive floating-point division into a floating-point multiply plus two floating-point multiply-adds (FMAs). Assuming that double is mapped to IEEE-754 binary64, the default rounding mode for floating-point computation is round-to-nearest-or-even, and that int is a 32-bit type, we can prove the transformation correct by simple exhaustive test:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <math.h>
int main (void)
{
const double rcp_5p3 = 1.0 / 5.3; // 0x1.826a439f656f2p-3
int i = INT_MAX;
do {
double ref = i / 5.3;
double res = fma (fma (-5.3, i * rcp_5p3, i), rcp_5p3, i * rcp_5p3);
if (res != ref) {
printf ("error: i=%2d res=%23.13a ref=%23.13a\n", i, res, ref);
return EXIT_FAILURE;
}
i--;
} while (i >= 0);
return EXIT_SUCCESS;
}
Most modern instances of common processors architectures like x86-64 and ARM64 have hardware support for FMA, such that fma() can be mapped directly to the appropriate hardware instruction. This should be confirmed by looking at the disassembly of the binary generated. Where hardware support for FMA is lacking the transformation obviously should not be applied, as software implementations of fma() are slow and sometimes functionally incorrect.
The basic idea here is that mathematically, division is equivalent to multiplication with the reciprocal. However, that is not necessarily true for finite-precision floating-point arithmetic. The code above tries to improve the likelihood of bit-accurate computation by determining the error in the naive approach with the help of FMA and applying a correction where necessary. For background including literature references see this earlier question.
To the best of my knowledge, there is not yet a general mathematically proven algorithm to determine for which divisors paired with which dividends the above transformation is safe (that is, delivers bit-accurate results), which is why an exhaustive test is strictly necessary to show that the transformation is valid.
In comments, Pascal Cuoq points out that there is an alternative algorithm to potentially strength-reduce floating-point division with a compile-time constant divisor, by precomputing the reciprocal of the divisor to more than native precision and specifically as a double-double. For background see N. Brisebarre and J.-M. Muller, "Correctly rounded multiplication by arbirary precision constant", IEEE Transactions on Computers, 57(2): 162-174, February 2008, which also provides guidance how to determine whether that transformation is safe for any particular constant. Since the present case is simple, I again used exhaustive test to show it is safe. Where applicable, this will reduce the division down to one FMA plus a multiply:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <mathimf.h>
int main (void)
{
const double rcp_5p3_hi = 1.8867924528301888e-1; // 0x1.826a439f656f2p-3
const double rcp_5p3_lo = -7.2921377017921457e-18;// -0x1.0d084b1883f6e0p-57
int i = INT_MAX;
do {
double ref = i / 5.3;
double res = fma (i, rcp_5p3_hi, i * rcp_5p3_lo);
if (res != ref) {
printf ("i=%2d res=%23.13a ref=%23.13a\n", i, res, ref);
return EXIT_FAILURE;
}
i--;
} while (i >= 0);
return EXIT_SUCCESS;
}
To cover another aspect: since all values of type int are exactly representable as double (but not as float), it is possible to get rid of int-to-double conversion that happens in the loop when evaluating i / 5.3 by introducing a floating-point variable that counts from 0.0 to N:
double fp_i = 0;
for (int i = 0; i < N; fp_i += 1, i++)
a[i] += fp_i / 5.3;
However, this kills autovectorization, and introduces a chain of dependent floating-point additions. Floating point addition is typically 3 or 4 cycles, so the last iteration will retire after at least (N-1)*3 cycles, even if the CPU could dispatch the instructions in the loop faster. Thankfully, floating-point division is not fully pipelined, and the rate at which an x86 CPU can dispatch floating-point division roughly matches or exceeds latency of the addition instruction.
This leaves the problem of killed vectorization. It's possible to bring it back by manually unrolling the loop and introducing two independent chains, but with AVX you'd need four chains for full vectorization:
double fp_i0 = 0, fp_i1 = 1;
int i = 0;
for (; i+1 < N; fp_i0 += 2, fp_i1 += 2, i += 2) {
double t0 = a[i], t1 = a[i+1];
a[i] = t0 + fp_i0 / 5.3;
a[i+1] = t1 + fp_i1 / 5.3;
}
if (i < N)
a[i] += i / 5.3;
CAVEAT: After a few days I realized that this answer is incorrect in that it ignores the consequence of underflow (to subnormal or to zero) in the computation of o / 5.3. In this case, multiplying the result by a power of two is “exact” but does not produce the result that dividing the larger integer by 5.3 would have.
i / 5.3 only needs to be computed for odd values of i.
For even values of i, you can simply multiply by 2.0 the value of (i/2)/5.3, which was already computed earlier in the loop.
The remaining difficulty is to reorder the iterations in a way such that each index between 0 and N-1 is handled exactly once and the program does not need to record an arbitrary number of division results.
One way to achieve this is to iterate on all odd numbers o less than N, and after computing o / 5.3 in order to handle index o, to also handle all indexes of the form o * 2**p.
if (N > 0) {
a[0] += 0.0; // this is needed for strict IEEE 754 compliance lol
for (int o = 1; o < N; o += 2) {
double d = o / 5.3;
int i = o;
do {
a[i] += d;
i += i;
d += d;
} while (i < N);
}
}
Note: this does not use the provided hint “Inquire how IEEE 754 single-precision floating-point numbers are represented in memory”. I think I know pretty well how single-precision floating-point numbers are represented in memory, but I do not see how that is relevant, especially since there are no single-precision values or computations in the code to optimize. I think there is a mistake in the way the problem is expressed, but still the above is technically a partial answer to the question as phrased.
I also ignored overflow problems for values of N that come close to INT_MAX in the code snippet above, since the code is already complicated enough.
As an additional note, the above transformation only replaces one division out of two. It does this by making the code unvectorizable (and also less cache-friendly). In your question, gcc -O3 has already shown that automatic vectorization could be applied to the starting point that your professor suggested, and that is likely to be more beneficial than suppressing half the divisions can be. The only good thing about the transformation in this answer is that it is a sort of strength reduction, which your professor requested.
Related
I may confirm by using nanobench. Today I don't feel clever and can't think of an easy way
I have a array, short arr[]={0x1234, 0x5432, 0x9090, 0xFEED};. I know I can use SIMD to compare all elements at once, using movemask+tzcnt to find the index of a match. However since it's only 64 bits I was wondering if there's a faster way?
First I thought maybe I can build a 64-bit int by writing target|(target<<16)|(target<<32)|(target<<48) but then realized both an AND and SUB isn't the same as a compare since the low 16 can affect the higher 16. Then I thought instead of a plain loop I can write index=tzcnt((target==arr[0]?1:0)... | target==arr[3]?8:0
Can anyone think of something more clever? I suspect using the ternary method would give me best results since it's branchless?
For SWAR compare-for-equality, the operation you want is XOR, which like SUB produces all-zero on equal inputs, but unlike SUB doesn't propagate carry sideways.
But then you need to detect a contiguous 16 0 bits. Unlike pcmpeqw, you'll have some zero bits in the other elements.
So it's probably about the same as https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord but with wider mask patterns to operate on 16-bit instead of 8-bit chunks.
There is yet a faster method — use hasless(v, 1), which is defined below; it works in 4 operations and requires no subsquent verification. It simplifies to
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second.
This bithack was originally by Alan Mycroft in 1987.
So it could look like this (untested):
#include <stdint.h>
#include <string.h>
// returns 0 / non-zero status.
uint64_t hasmatch_16in64(uint16_t needle, const uint16_t haystack[4])
{
uint64_t vneedle = 0x0001000100010001ULL * needle; // broadcast
uint64_t vbuf;
memcpy(&vbuf, haystack, sizeof(vbuf)); // aliasing-safe unaligned load
//static_assert(sizeof(vbuf) == 4*sizeof(haystack[0]));
uint64_t match = vbuf ^ vneedle;
uint64_t any_zeros = (match - 0x0001000100010001ULL) & ~match & 0x8000800080008000ULL;
return any_zeros;
// unsigned matchpos = _tzcnt_u32(any_zeros) >> 4; // I think.
}
Godbolt with GCC and clang, also including a SIMD intrinsics version.
# gcc12.2 -O3 -march=x86-64-v3 -mtune=znver1
# x86-64-v3 is the Haswell/Zen1 baseline: AVX2+FMA+BMI2, but with tune=generic
# without tune=haswell or whatever, GCC uses shl/add /shl/add instead of imul, despite still needing the same constant
hasmatch_16in64:
movabs rax, 281479271743489 # 0x1000100010001
movzx edi, di # zero-extend to 64-bit
imul rdi, rax # vneedle
xor rdi, QWORD PTR [rsi] # match
# then the bithack
mov rdx, rdi
sub rdx, rax
andn rax, rdi, rdx # BMI1
movabs rdx, -9223231297218904064 # 0x8000800080008000
and rax, rdx
ret
Clang unfortunately adds 0xFFFEFFFEFFFEFFFF instead of reusing the multiplier constant, so it has three 64-bit immediate constants.
AArch64 can do repeating-pattern constants like this as immediates for bitwise ops, and doesn't have as convenient SIMD movemask, so this might be more of a win there, especially if you can guarantee alignment of your array of shorts.
Match position
If you need to know where the match is, I think that bithack has a 1 in the high bit of each zero byte or u16, and nowhere else. (The lowest-precendence / last operations are bitwise AND involving 0x80008000...).
So maybe tzcnt(any_zeros) >> 4 to go from bit-index to u16-index, rounding down. e.g. if the second one is zero, the tzcnt result will be 31. 31 >> 4 = 1.
If that doesn't work, then yeah AVX2 or AVX-512 vpbroadcastw xmm0, edi / vmovq / vpcmeqw / vpmovmskb / tzcnt will work well, too, with smaller code-size and fewer uops, but maybe higher latency. Or maybe less. (To get a byte offset, right shift if you need an index of which short.)
Actually just SSE2 pshuflw can broadcast a word to the low qword of an XMM register. Same for MMX, which would actually allow a memory-source pcmpeqw mm0, [rsi] since it has no alignment requirement and is only 64-bit, not 128.
If you can use SIMD intrinsics, especially if you have efficient word broadcast from AVX2, definitely have a look at it.
#include <immintrin.h>
// note the unsigned function arg, not uint16_t;
// we only use the low 16, but GCC doesn't realize that and wastes an instruction in the non-AVX2 version
int hasmatch_SIMD(unsigned needle, const uint16_t haystack[4])
{
#ifdef __AVX2__ // or higher
__m128i vneedle = _mm_set1_epi16(needle);
#else
__m128i vneedle = _mm_cvtsi32_si128(needle); // movd
vneedle = _mm_shufflelo_epi16(vneedle, 0); // broadcast to low half
#endif
__m128i vbuf = _mm_loadl_epi64((void*)haystack); // alignment and aliasing safe
unsigned mask = _mm_movemask_epi8(_mm_cmpeq_epi16(vneedle, vbuf));
//return _tzcnt_u32(mask) >> 1;
return mask;
}
# clang expects narrow integer args to already be zero- or sign-extended to 32
hasmatch_SIMD:
movd xmm0, edi
pshuflw xmm0, xmm0, 0 # xmm0 = xmm0[0,0,0,0,4,5,6,7]
movq xmm1, qword ptr [rsi] # xmm1 = mem[0],zero
pcmpeqw xmm1, xmm0
pmovmskb eax, xmm1
ret
AXV-512 gives us vpbroadcastw xmm0, edi, replacing vmovd + vpbroadcastw xmm,xmm or movd + pshuflw, saving a shuffle uop.
With AVX2, this is 5 single-uop instructions, vs. 7 (or 9 counting the constants) for the SWAR bithack. Or 6 or 8 not counting the zero-extension of the "needle". So SIMD is better for front-end throughput. (https://agner.org/optimize/ / https://uops.info/)
There are limits to which ports some of these instructions can run on (vs. the bithack instructions mostly being any integer ALU port), but presumably you're not doing this in a loop over many such 4-element arrays. Or else SIMD is an obvious win; checking two 4-element arrays at once in the low and high halves of a __m128i. So probably we do need to consider the front-end costs of setting up those constants.
I didn't add up the latencies; it's probably a bit higher even on Intel CPUs which generally have good latency between integer and SIMD units.
GCC unfortunately fails to optimize away the movzx edi, di from the SIMD version if compiled without AVX2; only clang realizes the upper 16 of _mm_cvtsi32_si128(needle) is discarded by the later shuffle. Maybe better to make the function arg unsigned, not explicitly a narrow 16-bit type.
Clang with -O2 or -O3 and GCC with -O3 compile a simple search loop into branchless instructions:
int indexOf(short target, short* arr) {
int index = -1;
for (int i = 0; i < 4; ++i) {
if (target == arr[i]) {
index = i;
}
}
return index;
}
Demo
I doubt you can get much better without SIMD. In other words, write simple and understandable code to help the compiler produce efficient code.
Side note: for some reason, neither Clang nor GCC use conditional moves on this very similar code:
int indexOf(short target, short* arr) {
for (int i = 0; i < 4; ++i) {
if (target == arr[i]) {
return i;
}
}
return -1;
}
When running a sum loop over an array in Rust, I noticed a huge performance drop when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.
Is there special compilation optimization Rust is doing for "short" arrays?
Compiled with rustc -C opt-level=3.
use std::time::Instant;
const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;
fn main() {
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY {
arr[i] = i;
}
let mut sum = 0;
let now = Instant::now();
for _ in 0..IN_LOOPS {
let mut s = 0;
for i in 0..arr.len() {
s += arr[i];
}
sum += s;
}
println!("sum:{} time:{:?}", sum, now.elapsed());
}
Summary: below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark.
You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization – only performed for length 239 – is responsible for the huge speed difference. But let's start slowly:
(All code in this answer is compiled with -C opt-level=3)
pub fn foo() -> usize {
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len() {
s += arr[i];
}
s
}
This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. Here is a small part of the assembly:
movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0
This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. incrementing the loop variable, check if the loop has ended and the jump to the start of the loop.
In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. Moreover, two 16-byte SIMD registers (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. After all, they are independent of one another. In the end, both registers are added together and then horizontally summed down to the scalar result.
Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and paddq throughput is also at least 2 per clock, with 1 cycle latency on most CPUs. See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead.
LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. So usually, front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling.
But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:
const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;
pub fn foo() -> usize {
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY {
arr[i] = i;
}
let mut sum = 0;
for _ in 0..IN_LOOPS {
let mut s = 0;
for i in 0..arr.len() {
s += arr[i];
}
sum += s;
}
sum
}
(On Godbolt Compiler Explorer)
The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.
The important difference is that for 239, LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!
First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly; the comments with * are especially important):
; at the start of the function, `rbx` was set to 0
movq rax, xmm1 ; result of SIMD summing up stored in `rax`
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in `rax`
As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.
This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.
An additional note: can you do anything about this? Not really. LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.
As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.
In addition to Lukas' answer, if you want to use an iterator, try this:
const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;
pub fn bar() -> usize {
(0..CAPACITY).sum::<usize>() * IN_LOOPS
}
Thanks #Chris Morgan for the suggestion about range pattern.
The optimized assembly is quite good:
example::bar:
movabs rax, 14340000000
ret
Why is my SIMD vector4 length function 3x slower than a naive vector length method?
SIMD vector4 length function:
__extern_always_inline float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
Naive implementation:
sqrtf(V[0] * V[0] + V[1] * V[1] + V[2] * V[2] + V[3] * V[3])
The SIMD version took 16110ms to iterate 1000000000 times. The naive version was ~3 times faster, it takes only 4746ms.
#include <math.h>
#include <time.h>
#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>
static float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
int main() {
float A[4] __attribute__((aligned(16))) = {3, 4, 0, 0};
struct timespec t0 = {};
clock_gettime(CLOCK_MONOTONIC, &t0);
double sum_len = 0;
for (uint64_t k = 0; k < 1000000000; ++k) {
A[3] = k;
sum_len += vec4_len(A);
// sum_len += sqrtf(A[0] * A[0] + A[1] * A[1] + A[2] * A[2] + A[3] * A[3]);
}
struct timespec t1 = {};
clock_gettime(CLOCK_MONOTONIC, &t1);
fprintf(stdout, "%f\n", sum_len);
fprintf(stdout, "%ldms\n", (((t1.tv_sec - t0.tv_sec) * 1000000000) + (t1.tv_nsec - t0.tv_nsec)) / 1000000);
return 0;
}
I run with the following command on an Intel(R) Core(TM) i7-8550U CPU. First with the vec4_len version then with the plain C.
I compile with GCC (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0:
gcc -Wall -Wextra -O3 -msse -msse3 sse.c -lm && ./a.out
SSE version output:
499999999500000128.000000
13458ms
Plain C version output:
499999999500000128.000000
4441ms
The most obvious problem is using an inefficient dot-product (with haddps which costs 2x shuffle uops + 1x add uop) instead of shuffle + add. See Fastest way to do horizontal float vector sum on x86 for what to do after _mm_mul_ps that doesn't suck as much. But still this is just not something x86 can do very efficiently.
But anyway, the real problem is your benchmark loop.
A[3] = k; and then using _mm_load_ps(A) creates a store-forwarding stall, if it compiles naively instead of to a vector shuffle. A store + reload can be efficiently forwarded with ~5 cycles of latency if the load only loads data from a single store instruction, and no data outside that. Otherwise it has to do a slower scan of the whole store buffer to assemble bytes. This adds about 10 cycles of latency to the store-forwarding.
I'm not sure how much impact this has on throughput, but could be enough to stop out-of-order exec from overlapping enough loop iterations to hide the latency and only bottleneck on sqrtss shuffle throughput.
(Your Coffee Lake CPU has 1 per 3 cycle sqrtss throughput, so surprisingly SQRT throughput is not your bottleneck.1 Instead it will be shuffle throughput or something else.)
See Agner Fog's microarch guide and/or optimization manual.
What does "store-buffer forwarding" mean in the Intel developer's manual?
How does store to load forwarding happens in case of unaligned memory access?
Can modern x86 implementations store-forward from more than one prior store?
Why would a compiler generate this assembly? quotes Intel's optimization manual re: store forwarding. (In that question, and old gcc version stored the 2 dword halves of an 8-byte struct separately, then copied the struct with a qword load/store. Super braindead.)
Plus you're biasing this even more against SSE by letting the compiler hoist the computation of V[0] * V[0] + V[1] * V[1] + V[2] * V[2] out of the loop.
That part of the expression is loop-invariant, so the compiler only has to do (float)k squared, add, and a scalar sqrt every loop iteration. (And convert that to double to add to your accumulator).
(#StaceyGirl's deleted answer pointed this out; looking over the code of the inner loops in it was a great start on writing this answer.)
Extra inefficiency in A[3] = k in the vector version
GCC9.1's inner loop from Kamil's Godbolt link looks terrible, and seems to include a loop-carried store/reload to merge a new A[3] into the 8-byte A[2..3] pair, further limiting the CPU's ability to overlap multiple iterations.
I'm not sure why gcc thought this was a good idea. It would maybe help on CPUs that split vector loads into 8-byte halves (like Pentium M or Bobcat) to avoid store-forwarding stalls. But that's not a sane tuning for "generic" modern x86-64 CPUs.
.L18:
pxor xmm4, xmm4
mov rdx, QWORD PTR [rsp+8] ; reload A[2..3]
cvtsi2ss xmm4, rbx
mov edx, edx ; truncate RDX to 32-bit
movd eax, xmm4 ; float bit-pattern of (float)k
sal rax, 32
or rdx, rax ; merge the float bit-pattern into A[3]
mov QWORD PTR [rsp+8], rdx ; store A[2..3] again
movaps xmm0, XMMWORD PTR [rsp] ; vector load: store-forwarding stall
mulps xmm0, xmm0
haddps xmm0, xmm0
haddps xmm0, xmm0
ucomiss xmm3, xmm0
movaps xmm1, xmm0
sqrtss xmm1, xmm1
ja .L21 ; call sqrtf to set errno if needed; flags set by ucomiss.
.L17:
add rbx, 1
cvtss2sd xmm1, xmm1
addsd xmm2, xmm1 ; total += (double)sqrtf
cmp rbx, 1000000000
jne .L18 ; }while(k<1000000000);
This insanity isn't present in the scalar version.
Either way, gcc did manage to avoid the inefficiency of a full uint64_t -> float conversion (which x86 doesn't have in hardware until AVX512). It was presumably able to prove that using a signed 64-bit -> float conversion would always work because the high bit can't be set.
Footnote 1: But sqrtps has the same 1 per 3 cycle throughput as scalar, so you're only getting 1/4 of your CPU's sqrt throughput capability by doing 1 vector at a time horizontally, instead of doing 4 lengths for 4 vectors in parallel.
I am trying to optimize a search through a very short sorted array of doubles to locate a bucket a given value belongs to. Assuming the size of the array is 8 doubles, I have come up with the following sequence of AVX intrinsics:
_data = _mm256_load_pd(array);
temp = _mm256_movemask_pd(_mm256_cmp_pd(_data, _value, _CMP_LT_OQ));
pos = _mm_popcnt_u32(temp);
_data = _mm256_load_pd(array+4);
temp = _mm256_movemask_pd(_mm256_cmp_pd(_data, _value, _CMP_LT_OQ));
pos += _mm_popcnt_u32(temp);
To my surprise (I do not have the instruction latency specs in my head..), it turned out that a faster code is generated by gcc for the following C loop:
for(i=0; i<7; ++i) if(array[i+1]>=value) break;
This loop compiles into what I found to be a very efficient code:
lea ecx, [rax+1]
vmovsd xmm1, QWORD PTR [rdx+rcx*8]
vucomisd xmm1, xmm0
jae .L7
lea ecx, [rax+2]
vmovsd xmm1, QWORD PTR [rdx+rcx*8]
vucomisd xmm1, xmm0
jae .L8
[... repeat for all elements of array]
so it takes 4 instructions to check 1 bucket (lea, vmovsd, vucomisd, jae). Assuming the value is uniformly spread, on average I will have to check ~3.5 buckets per value. Apparently, this is enough to outperform the AVX code listed earlier.
Now, in a general case the array may of course be larger than 8 elements. If I code a C loop like this:
for(i=0; u<n-1; i++) if(array[i+1]>=value) break;
I get the following instruction sequence for the loop body:
.L76:
mov eax, edx
.L67:
cmp eax, esi
jae .L77
lea edx, [rax+1]
mov ecx, edx
vmovsd xmm1, QWORD PTR [rdi+rcx*8]
vucomisd xmm1, xmm0
jb .L76
I can tell gcc to unroll the loop, but the point is that the number of instructions per element is larger than in the case of the loop with constant bounds, and the code is slower. Also, I do not understand the reason behind using an additional rcx register for addressing in vmovsd.
I can manually modify the assembly for the loop to look something like in the first example, and it does work faster:
.L76:
cmp edx, esi # eax -> edx
jae .L77
lea edx, [rdx+1] # rax -> rdx
vmovsd xmm1, QWORD PTR [rdi+rdx*8]
vucomisd xmm1, xmm0
jb .L76
but I can not seem to make gcc do it. And I know it can - the asm generated in the first example is OK.
Do you have any ideas how to do it otherwise than using inline asm? Or even better - can you suggest a faster implementation of the search?
Not really an answer, but there's no room in the comments for this.
I tested the AVX function against a simple C implementation and got completely different results.
I tested on Windows 7 x64 not Linux but the generated code was very similar.
How the test went:
1) I disabled the CPU's SpeedStep.
2) Within main() I raised the process priority and thread priority to the max (realtime).
3) I ran 10M calls to the tested function to heat up the CPU - activate turbo.
4) I called Sleep(0) to avoid a context switch
5) I called __rdtscp to start measurement
6) in a loop I called either the AVX find index function or the simple C version - like you did. the other implementation was commented out and not used. Loop size was 10M calls.
7) I called __rdtscp again to finish the benchmark.
8) I printed ticks/iterations. to get the average tick count for a call
Note: I declared both 'find index' functions as inline and I confirmed in the disassembly that they got inlined.
The AVX function and the C functions you described are not identical, the C function return a zero based index and the AVX functio returns a 1 based index.
On my system, it took the AVX function 1.1 cycles per iteration and the C function took 4.4 cycles per iteration.
I couldn't force the MSVC compiler to use more than ymm registers :(
Array used:
double A[8] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 };
Results (avg. ticks/iter):
value = 0.3 (index = 2): AVX: 1.1 | C: 4.4
value = 0.5 (index = 3): AVX: 1.1 | C: 11.1
value = 0.9 (index = 7): AVX: 1.1 | C: 18.1
If the AVX function is corrected to return pos-1, then it will be 50% slower.
You can see that the AVX function works in constant time while the trivial C loop function performance depends on the index you're looking for.
Timing with clock() and running 100M yields similar results, AVX is almost x4 faster for the first test.
Also note that running longer tests reveal different results, but every time AVX holds a similar advantage.
You can try integer comparison. Double comparison is equivalent to int64_t comparison of the same bits with exception for NaNs. It could turn faster. CPU has more integer execution units then SIMD. Just send double* and receive int64_t* in function argument.
I have just started using SSE and I am confused how to get the maximum integer value (max) of a __m128i. For instance:
__m128i t = _mm_setr_ps(0,1,2,3);
// max(t) = 3;
Searching around led me to MAXPS instruction but I can't seem to find how to use that with "xmmintrin.h".
Also, is there any documentation for "xmmintrin.h" that you would recommend, rather than looking into the header file itself?
In case anyone cares and since intrinsics seem to be the way to go these days here is a solution in terms of intrinsics.
int horizontal_max_Vec4i(__m128i x) {
__m128i max1 = _mm_shuffle_epi32(x, _MM_SHUFFLE(0,0,3,2));
__m128i max2 = _mm_max_epi32(x,max1);
__m128i max3 = _mm_shuffle_epi32(max2, _MM_SHUFFLE(0,0,0,1));
__m128i max4 = _mm_max_epi32(max2,max3);
return _mm_cvtsi128_si32(max4);
}
I don't know if that's any better than this:
int horizontal_max_Vec4i(__m128i x) {
int result[4] __attribute__((aligned(16))) = {0};
_mm_store_si128((__m128i *) result, x);
return max(max(max(result[0], result[1]), result[2]), result[3]);
}
If you find yourself needing to do horizontal operations on vectors, especially if it's inside an inner loop, then it's usually a sign that you are approaching your SIMD implementation in the wrong way. SIMD likes to operate element-wise on vectors - "vertically" if you like, not horizontally.
As for documentation, there is a very useful reference on intel.com which contains all the opcodes and intrinsics for everything from MMX through the various flavours of SSE all the way up to AVX and AVX-512.
According to this page, there is no horizontal max, and you need to test the elements vertically:
movhlps xmm1,xmm0 ; Move top two floats to lower part of xmm1
maxps xmm0,xmm1 ; Get the maximum of the two sets of floats
pshufd xmm1,xmm0,$55 ; Move second float to lower part of xmm1
maxps xmm0,xmm1 ; Get the maximum of the two remaining floats
Conversely, getting the minimum:
movhlps xmm1,xmm0
minps xmm0,xmm1
pshufd xmm1,xmm0,$55
minps xmm0,xmm1
There is no Horizontal Maximum opcode in SSE (at least up until the point where I stopped keep track of new SSE instructions).
So you are stuck doing some shuffling. What you end up with is...
movhlps %xmm0, %xmm1 # Move top two floats to lower part of %xmm1
maxps %xmm1, %xmm0 # Get minimum of sets of two floats
pshufd $0x55, %xmm0, %xmm1 # Move second float to lower part of %xmm1
maxps %xmm1, %xmm0 # Get minimum of all four floats originally in %xmm0
http://locklessinc.com/articles/instruction_wishlist/
MSDN has the intrinsic and macro function mappings documented
http://msdn.microsoft.com/en-us/library/t467de55.aspx