Getting max value in a __m128i vector with SSE? - c

I have just started using SSE and I am confused how to get the maximum integer value (max) of a __m128i. For instance:
__m128i t = _mm_setr_ps(0,1,2,3);
// max(t) = 3;
Searching around led me to MAXPS instruction but I can't seem to find how to use that with "xmmintrin.h".
Also, is there any documentation for "xmmintrin.h" that you would recommend, rather than looking into the header file itself?

In case anyone cares and since intrinsics seem to be the way to go these days here is a solution in terms of intrinsics.
int horizontal_max_Vec4i(__m128i x) {
__m128i max1 = _mm_shuffle_epi32(x, _MM_SHUFFLE(0,0,3,2));
__m128i max2 = _mm_max_epi32(x,max1);
__m128i max3 = _mm_shuffle_epi32(max2, _MM_SHUFFLE(0,0,0,1));
__m128i max4 = _mm_max_epi32(max2,max3);
return _mm_cvtsi128_si32(max4);
}
I don't know if that's any better than this:
int horizontal_max_Vec4i(__m128i x) {
int result[4] __attribute__((aligned(16))) = {0};
_mm_store_si128((__m128i *) result, x);
return max(max(max(result[0], result[1]), result[2]), result[3]);
}

If you find yourself needing to do horizontal operations on vectors, especially if it's inside an inner loop, then it's usually a sign that you are approaching your SIMD implementation in the wrong way. SIMD likes to operate element-wise on vectors - "vertically" if you like, not horizontally.
As for documentation, there is a very useful reference on intel.com which contains all the opcodes and intrinsics for everything from MMX through the various flavours of SSE all the way up to AVX and AVX-512.

According to this page, there is no horizontal max, and you need to test the elements vertically:
movhlps xmm1,xmm0 ; Move top two floats to lower part of xmm1
maxps xmm0,xmm1 ; Get the maximum of the two sets of floats
pshufd xmm1,xmm0,$55 ; Move second float to lower part of xmm1
maxps xmm0,xmm1 ; Get the maximum of the two remaining floats
Conversely, getting the minimum:
movhlps xmm1,xmm0
minps xmm0,xmm1
pshufd xmm1,xmm0,$55
minps xmm0,xmm1

There is no Horizontal Maximum opcode in SSE (at least up until the point where I stopped keep track of new SSE instructions).
So you are stuck doing some shuffling. What you end up with is...
movhlps %xmm0, %xmm1 # Move top two floats to lower part of %xmm1
maxps %xmm1, %xmm0 # Get minimum of sets of two floats
pshufd $0x55, %xmm0, %xmm1 # Move second float to lower part of %xmm1
maxps %xmm1, %xmm0 # Get minimum of all four floats originally in %xmm0
http://locklessinc.com/articles/instruction_wishlist/
MSDN has the intrinsic and macro function mappings documented
http://msdn.microsoft.com/en-us/library/t467de55.aspx

Related

Strength reduction on floating point division by hand

In one of our last assignments in Computer Science this term we have to apply strength reduction on some code fragments. Most of them were just straight forward, especially with looking into compiler output. But one of them I wont be able to solve, even with the help of the compiler.
Our profs gave us the following hint:
Hint: Inquire how IEEE 754 single-precision floating-point numbers are
represented in memory.
Here is the code snippet: (a is of type double*)
for (int i = 0; i < N; ++i) {
a[i] += i / 5.3;
}
At first I tried to look into the compiler output for this snipped on godbolt. I tried to compile it without any optimization: (note: I copied only the relevant part in the for loop)
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*8]
mov rax, QWORD PTR [rbp-16]
add rax, rdx
movsd xmm1, QWORD PTR [rax]
cvtsi2sd xmm0, DWORD PTR [rbp-4] //division relevant
movsd xmm2, QWORD PTR .LC0[rip] //division relevant
divsd xmm0, xmm2 //division relevant
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*8]
mov rax, QWORD PTR [rbp-16]
add rax, rdx
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
and with -O3:
.L2:
pshufd xmm0, xmm2, 238 //division relevant
cvtdq2pd xmm1, xmm2 //division relevant
movupd xmm6, XMMWORD PTR [rax]
add rax, 32
cvtdq2pd xmm0, xmm0 //division relevant
divpd xmm1, xmm3 //division relevant
movupd xmm5, XMMWORD PTR [rax-16]
paddd xmm2, xmm4
divpd xmm0, xmm3 //division relevant
addpd xmm1, xmm6
movups XMMWORD PTR [rax-32], xmm1
addpd xmm0, xmm5
movups XMMWORD PTR [rax-16], xmm0
cmp rax, rbp
jne .L2
I commented the division part of the assembly code. But this output does not help me understanding how to apply strength reduction on the snippet. (Maybe there are too many optimizations going on to fully understand the output)
Second, I tried to understand the bit representation of the float part 5.3.
Which is:
0 |10000001|01010011001100110011010
Sign|Exponent|Mantissa
But this does not help me either.
If we adopt Wikipedia's definition that
strength reduction is a compiler optimization where expensive operations are replaced with equivalent but less expensive operations
then we can apply strength reduction here by converting the expensive floating-point division into a floating-point multiply plus two floating-point multiply-adds (FMAs). Assuming that double is mapped to IEEE-754 binary64, the default rounding mode for floating-point computation is round-to-nearest-or-even, and that int is a 32-bit type, we can prove the transformation correct by simple exhaustive test:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <math.h>
int main (void)
{
const double rcp_5p3 = 1.0 / 5.3; // 0x1.826a439f656f2p-3
int i = INT_MAX;
do {
double ref = i / 5.3;
double res = fma (fma (-5.3, i * rcp_5p3, i), rcp_5p3, i * rcp_5p3);
if (res != ref) {
printf ("error: i=%2d res=%23.13a ref=%23.13a\n", i, res, ref);
return EXIT_FAILURE;
}
i--;
} while (i >= 0);
return EXIT_SUCCESS;
}
Most modern instances of common processors architectures like x86-64 and ARM64 have hardware support for FMA, such that fma() can be mapped directly to the appropriate hardware instruction. This should be confirmed by looking at the disassembly of the binary generated. Where hardware support for FMA is lacking the transformation obviously should not be applied, as software implementations of fma() are slow and sometimes functionally incorrect.
The basic idea here is that mathematically, division is equivalent to multiplication with the reciprocal. However, that is not necessarily true for finite-precision floating-point arithmetic. The code above tries to improve the likelihood of bit-accurate computation by determining the error in the naive approach with the help of FMA and applying a correction where necessary. For background including literature references see this earlier question.
To the best of my knowledge, there is not yet a general mathematically proven algorithm to determine for which divisors paired with which dividends the above transformation is safe (that is, delivers bit-accurate results), which is why an exhaustive test is strictly necessary to show that the transformation is valid.
In comments, Pascal Cuoq points out that there is an alternative algorithm to potentially strength-reduce floating-point division with a compile-time constant divisor, by precomputing the reciprocal of the divisor to more than native precision and specifically as a double-double. For background see N. Brisebarre and J.-M. Muller, "Correctly rounded multiplication by arbirary precision constant", IEEE Transactions on Computers, 57(2): 162-174, February 2008, which also provides guidance how to determine whether that transformation is safe for any particular constant. Since the present case is simple, I again used exhaustive test to show it is safe. Where applicable, this will reduce the division down to one FMA plus a multiply:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <mathimf.h>
int main (void)
{
const double rcp_5p3_hi = 1.8867924528301888e-1; // 0x1.826a439f656f2p-3
const double rcp_5p3_lo = -7.2921377017921457e-18;// -0x1.0d084b1883f6e0p-57
int i = INT_MAX;
do {
double ref = i / 5.3;
double res = fma (i, rcp_5p3_hi, i * rcp_5p3_lo);
if (res != ref) {
printf ("i=%2d res=%23.13a ref=%23.13a\n", i, res, ref);
return EXIT_FAILURE;
}
i--;
} while (i >= 0);
return EXIT_SUCCESS;
}
To cover another aspect: since all values of type int are exactly representable as double (but not as float), it is possible to get rid of int-to-double conversion that happens in the loop when evaluating i / 5.3 by introducing a floating-point variable that counts from 0.0 to N:
double fp_i = 0;
for (int i = 0; i < N; fp_i += 1, i++)
a[i] += fp_i / 5.3;
However, this kills autovectorization, and introduces a chain of dependent floating-point additions. Floating point addition is typically 3 or 4 cycles, so the last iteration will retire after at least (N-1)*3 cycles, even if the CPU could dispatch the instructions in the loop faster. Thankfully, floating-point division is not fully pipelined, and the rate at which an x86 CPU can dispatch floating-point division roughly matches or exceeds latency of the addition instruction.
This leaves the problem of killed vectorization. It's possible to bring it back by manually unrolling the loop and introducing two independent chains, but with AVX you'd need four chains for full vectorization:
double fp_i0 = 0, fp_i1 = 1;
int i = 0;
for (; i+1 < N; fp_i0 += 2, fp_i1 += 2, i += 2) {
double t0 = a[i], t1 = a[i+1];
a[i] = t0 + fp_i0 / 5.3;
a[i+1] = t1 + fp_i1 / 5.3;
}
if (i < N)
a[i] += i / 5.3;
CAVEAT: After a few days I realized that this answer is incorrect in that it ignores the consequence of underflow (to subnormal or to zero) in the computation of o / 5.3. In this case, multiplying the result by a power of two is “exact” but does not produce the result that dividing the larger integer by 5.3 would have.
i / 5.3 only needs to be computed for odd values of i.
For even values of i, you can simply multiply by 2.0 the value of (i/2)/5.3, which was already computed earlier in the loop.
The remaining difficulty is to reorder the iterations in a way such that each index between 0 and N-1 is handled exactly once and the program does not need to record an arbitrary number of division results.
One way to achieve this is to iterate on all odd numbers o less than N, and after computing o / 5.3 in order to handle index o, to also handle all indexes of the form o * 2**p.
if (N > 0) {
a[0] += 0.0; // this is needed for strict IEEE 754 compliance lol
for (int o = 1; o < N; o += 2) {
double d = o / 5.3;
int i = o;
do {
a[i] += d;
i += i;
d += d;
} while (i < N);
}
}
Note: this does not use the provided hint “Inquire how IEEE 754 single-precision floating-point numbers are represented in memory”. I think I know pretty well how single-precision floating-point numbers are represented in memory, but I do not see how that is relevant, especially since there are no single-precision values or computations in the code to optimize. I think there is a mistake in the way the problem is expressed, but still the above is technically a partial answer to the question as phrased.
I also ignored overflow problems for values of N that come close to INT_MAX in the code snippet above, since the code is already complicated enough.
As an additional note, the above transformation only replaces one division out of two. It does this by making the code unvectorizable (and also less cache-friendly). In your question, gcc -O3 has already shown that automatic vectorization could be applied to the starting point that your professor suggested, and that is likely to be more beneficial than suppressing half the divisions can be. The only good thing about the transformation in this answer is that it is a sort of strength reduction, which your professor requested.

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2

I am looking for an optimal method to calculate sum of all packed 32-bit integers in a __m256i or __m512i. To calculate sum of n elements, I ofter use log2(n) vpaddd and vpermd function, then extract the final result. Howerver, it is not the best option I think.
Edit: best/optimal in term of speed/cycle reduction.
Related: if you're looking for the non-existant _mm512_reduce_add_epu8, see Summing 8-bit integers in __m512i with AVX intrinsics vpsadbw as an hsum within qwords is much more efficient than shuffling.
Without AVX512, see hsum_8x32(__m256i) below for AVX2 without Intel's reduce_add helper function. reduce_add doesn't necessarily compile optimally anyway with AVX512.
There is a int _mm512_reduce_add_epi32(__m512i) inline function in immintrin.h. You might as well use it. (It compiles to shuffle and add instructions, but more efficient ones than vpermd, like I describe below.) AVX512 didn't introduce any new hardware support for horizontal sums, just this new helper function. It's still something to avoid or sink out of loops whenever possible.
GCC 9.2 -O3 -march=skylake-avx512 compiles a wrapper that calls it as follows:
vextracti64x4 ymm1, zmm0, 0x1
vpaddd ymm1, ymm1, ymm0
vextracti64x2 xmm0, ymm1, 0x1 # silly compiler, vextracti128 would be shorter
vpaddd xmm1, xmm0, xmm1
vpshufd xmm0, xmm1, 78
vpaddd xmm0, xmm0, xmm1
vmovd edx, xmm0
vpextrd eax, xmm0, 1 # 2x xmm->integer to feed scalar add.
add eax, edx
ret
Extracting twice to feed scalar add is questionable; it needs uops for p0 and p5 so it's equivalent to a regular shuffle + a movd.
Clang doesn't do that; it does one more step of shuffle / SIMD add to reduce down to a single scalar for vmovd. See below for perf analysis of the two.
There is a VPHADDD but you should never use it with both inputs the same. (Unless you're optimizing for code-size over speed). It can be useful to transpose-and-sum multiple vectors, resulting in some vectors of results. You do that by feeding phadd with 2 different inputs. (Except it gets messy with 256 and 512-bit because vphadd is still only in-lane.)
Yes, you need log2(vector_width) shuffles and vpaddd instructions. (So this isn't very efficient; avoid horizontal sums inside inner loops. Accumulate vertically until the end of a loop, for example).
General strategy for all SSE / AVX / AVX512
You want to successively narrow from 512 -> 256, then 256 -> 128, then shuffle within __m128i until you're down to one scalar element. Presumably some future AMD CPU will decode 512-bit instructions to two 256-bit uops, so reducing width is a big win there. And narrower instructions presumably cost slightly less power.
Your shuffles can take immediate control operands, not vectors for vpermd. e.g. VEXTRACTI32x8, vextracti128, and vpshufd. (Or vpunpckhqdq to save code size for the immediate constant.)
See Fastest way to do horizontal SSE vector sum (or other reduction) (my answer also includes some integer versions).
This general strategy is appropriate for all element types: float, double, and any size integer
Special cases:
8-bit integer: start with vpsadbw, more efficient and avoids overflow, but then continue as for 64-bit integers.
16-bit integer: start by widening to 32 with pmaddwd (_mm256_madd_epi16 with set1_epi16(1)) : SIMD: Accumulate Adjacent Pairs - fewer uops even if you don't care about the avoiding-overflow benefit, except on AMD before Zen2 where 256-bit instructions cost at least 2 uops. But then you continue as for 32-bit integer.
32-bit integer can be done manually like this, with an SSE2 function called by the AVX2 function after reducing to __m128i, in turn called by the AVX512 function after reducing to __m256i. The calls will of course inline in practice.
#include <immintrin.h>
#include <stdint.h>
// from my earlier answer, with tuning for non-AVX CPUs removed
// static inline
uint32_t hsum_epi32_avx(__m128i x)
{
__m128i hi64 = _mm_unpackhi_epi64(x, x); // 3-operand non-destructive AVX lets us save a byte without needing a movdqa
__m128i sum64 = _mm_add_epi32(hi64, x);
__m128i hi32 = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1)); // Swap the low two elements
__m128i sum32 = _mm_add_epi32(sum64, hi32);
return _mm_cvtsi128_si32(sum32); // movd
}
// only needs AVX2
uint32_t hsum_8x32(__m256i v)
{
__m128i sum128 = _mm_add_epi32(
_mm256_castsi256_si128(v),
_mm256_extracti128_si256(v, 1)); // silly GCC uses a longer AXV512VL instruction if AVX512 is enabled :/
return hsum_epi32_avx(sum128);
}
// AVX512
uint32_t hsum_16x32(__m512i v)
{
__m256i sum256 = _mm256_add_epi32(
_mm512_castsi512_si256(v), // low half
_mm512_extracti64x4_epi64(v, 1)); // high half. AVX512F. 32x8 version is AVX512DQ
return hsum_8x32(sum256);
}
Notice that this uses __m256i hsum as a building block for __m512i; there's nothing to be gained by doing in-lane operations first.
Well possibly a very tiny advantage: in-lane shuffles have lower latency than lane-crossing, so they could execute 2 cycles earlier and leave the RS earlier, and similarly retire from the ROB slightly earlier. But the higher-latency shuffles are coming just a couple instructions later even if you did that. So you might get a handful of some independent instructions into the back-end 2 cycles earlier if this hsum was on the critical path (blocking retirement).
But reducing to a narrower vector width sooner is generally good, maybe getting 512-bit uops out of the system sooner so the CPU can re-activate the SIMD execution units on port 1, if you aren't doing more 512-bit work right away.
Compiles on Godbolt to these instructions, with GCC9.2 -O3 -march=skylake-avx512
hsum_16x32(long long __vector(8)):
vextracti64x4 ymm1, zmm0, 0x1
vpaddd ymm0, ymm1, ymm0
vextracti64x2 xmm1, ymm0, 0x1 # silly compiler uses a longer EVEX instruction when its available (AVX512VL)
vpaddd xmm0, xmm0, xmm1
vpunpckhqdq xmm1, xmm0, xmm0
vpaddd xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 177
vpaddd xmm0, xmm1, xmm0
vmovd eax, xmm0
ret
P.S.: perf analysis of GCC's _mm512_reduce_add_epi32 vs. clang's (which is equivalent to my version), using data from https://uops.info/ and/or Agner Fog's instruction tables:
After inlining into a caller that does something with the result, it could allow optimizations like adding a constant as well using lea eax, [rax + rdx + 123] or something.
But other than that it seems almost always worse than the the shuffle / vpadd / vmovd at the end of my implementation, on Skylake-X:
total uops: reduce: 4. Mine: 3
ports: reduce: 2p0, p5 (part of vpextrd), p0156 (scalar add)
ports: mine: p5, p015 (vpadd on SKX), p0 (vmod)
Latency is equal at 4 cycles, assuming no resource conflicts:
shuffle 1 cycle -> SIMD add 1 cycle -> vmovd 2 cycles
vpextrd 3 cycles (in parallel with 2 cycle vmovd) -> add 1 cycle.

How results of a SIMD operation go back into an array: cache-unfriendly?

Once again I'm jumping back into teaching myself basic assembly language again so I don't completely forget everything.
I made this practice code the other day, and in it, it turned out that I had to plop the results of a vector operation into an array backwards; otherwise it gave a wrong answer. Incidentally, this is also how GCC similarly outputs assembly code for SIMD operation results going back to a memory location, so I presume it's the "correct" way.
However, something occurred to me, from something I've been conscious about as a game developer for quite a long time: cache friendliness. My understanding is that moving forward in a contiguous block of memory is always ideal, else you risk cache misses.
My question is: Even if this example below is nothing more than calculating a couple of four-element vectors and spitting out four numbers before quitting, I have to wonder if this -- putting numbers back into an array in what's technically in reverse order -- has any impact at all on cache misses in the real world, within in a typical production-level program that does hundreds of thousands of SIMD vector calculations (and more specifically, returning them back to memory) per second?
Here is the full code (linux 64-bit NASM) with comments including the original one that prompted me to bring this curiosity of mine to stackexchange:
extern printf
extern fflush
global _start
section .data
outputText: db '[%f, %f, %f, %f]',10,0
align 16
vec1: dd 1.0, 2.0, 3.0, 4.0
vec2: dd 10.0,10.0,10.0,50.0
section .bss
result: resd 4 ; four 32-bit single-precision floats
section .text
_start:
sub rsp,16
movaps xmm0,[vec1]
movaps xmm1,[vec2]
mulps xmm0,xmm1 ; xmm0 = (vec1 * vec2)
movaps [result],xmm0 ; copy 4 floats back to result[]
; printf only accepts 64-bit floats for some dumb reason,
; so convert these 32-bit floats packed within the 128-bit xmm0
; register into four 64-bit floats, each in a separate xmm* reg
movss xmm0,[result+12] ; result[3]
unpcklps xmm0,xmm0 ; 32-->64 bit
cvtps2pd xmm3,xmm0 ; put double in 4th XMM
movss xmm0,[result+8] ; result[2]
unpcklps xmm0,xmm0 ; 32-->64 bit
cvtps2pd xmm2,xmm0 ; put double in 3rd XMM
movss xmm0,[result+4] ; result[1]
unpcklps xmm0,xmm0 ; 32-->64 bit
cvtps2pd xmm1,xmm0 ; put double in 2nd XMM
movss xmm0,[result] ; result[0]
unpcklps xmm0,xmm0 ; 32-->64 bit
cvtps2pd xmm0,xmm0 ; put double in 1st XMM
; FOOD FOR THOUGHT!
; *****************
; That was done backwards, going from highest element
; of what is technically an array down to the lowest.
;
; This is because when it was done from lowest to
; highest, this garbled bird poop was the answer:
; [13510801139695616.000000, 20.000000, 30.000000, 200.000000]
;
; HOWEVER, if the correct way is this way, in which
; it traipses through an array backwards...
; is that not cache-unfriendly? Or is it too tiny and
; miniscule to have any impact with cache misses?
mov rdi, outputText ; tells printf where is format string
mov rax,4 ; tells printf to print 4 XMM regs
call printf
mov rdi,0
call fflush ; ensure we see printf output b4 exit
add rsp,16
_exit:
mov eax,1 ; syscall id for sys_exit
mov ebx,0 ; exit with ret of 0 (no error)
int 80h
HW prefetchers can recognize streams with descending addresses as well as ascending. Intel's optimization manual documents the HW prefetchers in fair detail. I think AMD's prefetchers are broadly similar in terms of being able to recognize descending patterns as well.
Within a single cache-line, it doesn't matter at all what order you access things in, AFAIK.
See the x86 tag wiki for more links, especially Agner Fog's Optimizing Assembly guide to learn how to write asm that isn't slower than what a compiler could make. The tag wiki also has links to Intel's manuals.
Also, that is some ugly / bad asm. Here's how to do it better:
Printf only accepts double because of C rules for arg promotion to variadic functions. Yes, this is kinda dumb, but FP->base-10-text conversion dwarfs the overhead from an extra float->double conversion. If you need high-performance FP->string, you probably should avoid using a function that has to parse a format string every call.
debug-prints in ASM are usually more trouble than they're worth, compared to using a debugger.
Also:
This is 64-bit code, so don't use the 32-bit int 0x80 ABI to exit.
The UNPCKLPS instructions are pointless, because you only care about the low element anyway. CVTPS2PD produces two results, but you're converting the same number twice in parallel instead of converting two and then unpacking. Only the low double in an XMM matters, when calling a function that takes scalar args, so you can leave high garbage.
Store/reload is also pointless
DEFAULT REL ; use RIP-relative addressing for [vec1]
extern printf
;extern fflush ; just call exit(3) instead of manual fflush
extern exit
section .rodata ; read-only data can be part of the text segment
outputText: db '[%f, %f, %f, %f]',10,0
align 16
vec1: dd 1.0, 2.0, 3.0, 4.0
vec2: dd 10.0,10.0,10.0,50.0
section .bss
;; static scratch space is unwise. Use the stack to reduce cache misses, and for thread safety
; result: resd 4 ; four 32-bit single-precision floats
section .text
global _start
_start:
;; sub rsp,16 ; What was this for? We have a red-zone in x86-64 SysV, and we don't use
movaps xmm2, [vec1]
; fold the load into the mulps
mulps xmm2, [vec2] ; (vec1 * vec2)
; printf only accepts 64-bit doubles, because it's a C variadic function.
; so convert these 32-bit floats packed within the 128-bit xmm0
; register into four 64-bit floats, each in a separate xmm* reg
; xmm2 = [f0,f1,f2,f3]
cvtps2pd xmm0, xmm2 ; xmm0=[d0,d1]
movaps xmm1, xmm0
unpckhpd xmm1, xmm1 ; xmm1=[d1,d1]
unpckhpd xmm2, xmm2 ; xmm2=[f2,f3, f2,f3]
cvtps2pd xmm2, xmm2 ; xmm2=[d2,d3]
movaps xmm3, xmm3
unpckhpd xmm3, xmm3 ; xmm3=[d3,d3]
mov edi, outputText ; static data is in the low 2G, so we can use 32-bit absolute addresses
;lea rdi, [outputText] ; or this is the PIC way to do it
mov eax,4 ; tells printf to print 4 XMM regs
call printf
xor edi, edi
;call fflush ; flush before _exit()
jmp exit ; tailcall exit(3) which does flush, like if you returned from main()
; add rsp,16
;; this is how you would exit if you didn't use the libc function.
_exit:
xor edi, edi
mov eax, 231 ; exit_group(0)
syscall ; 64-bit code should use the 64-bit ABI
You could also use MOVHLPS to move the high 64 bits from one register into the low 64 bits of another reg, but that has a false dependency on the old contents.
cvtps2pd xmm0, xmm2 ; xmm0=[d0,d1]
;movaps xmm1, xmm0
;unpckhpd xmm1, xmm1 ; xmm1=[d1,d1]
;xorps xmm1, xmm1 ; break the false dependency
movhlps xmm1, xmm0 ; xmm1=[d1,??] ; false dependency on old value of xmm1
On Sandybridge, xorps and movhlps would be more efficient, because it can handle xor-zeroing without using an execution unit. IvyBridge and later, and AMD CPUs, can eliminate the MOVAPS the same way: zero latency. But still takes a uop and some frontend throughput resources.
If you were going to store and reload, and convert each float separately, you'd use CVTSS2SD, either as a load (cvtss2sd xmm2, [result + 12]) or after movss.
Using MOVSS first would break the false-dependency on the full register, which CVTSS2SD has because Intel designed it badly, to merge with the old value instead of replacing. Same for int->float or double conversion. The merging case is much rarer than scalar math, and can be done with a reg-reg MOVSS.

searching through a short sorted array of doubles

I am trying to optimize a search through a very short sorted array of doubles to locate a bucket a given value belongs to. Assuming the size of the array is 8 doubles, I have come up with the following sequence of AVX intrinsics:
_data = _mm256_load_pd(array);
temp = _mm256_movemask_pd(_mm256_cmp_pd(_data, _value, _CMP_LT_OQ));
pos = _mm_popcnt_u32(temp);
_data = _mm256_load_pd(array+4);
temp = _mm256_movemask_pd(_mm256_cmp_pd(_data, _value, _CMP_LT_OQ));
pos += _mm_popcnt_u32(temp);
To my surprise (I do not have the instruction latency specs in my head..), it turned out that a faster code is generated by gcc for the following C loop:
for(i=0; i<7; ++i) if(array[i+1]>=value) break;
This loop compiles into what I found to be a very efficient code:
lea ecx, [rax+1]
vmovsd xmm1, QWORD PTR [rdx+rcx*8]
vucomisd xmm1, xmm0
jae .L7
lea ecx, [rax+2]
vmovsd xmm1, QWORD PTR [rdx+rcx*8]
vucomisd xmm1, xmm0
jae .L8
[... repeat for all elements of array]
so it takes 4 instructions to check 1 bucket (lea, vmovsd, vucomisd, jae). Assuming the value is uniformly spread, on average I will have to check ~3.5 buckets per value. Apparently, this is enough to outperform the AVX code listed earlier.
Now, in a general case the array may of course be larger than 8 elements. If I code a C loop like this:
for(i=0; u<n-1; i++) if(array[i+1]>=value) break;
I get the following instruction sequence for the loop body:
.L76:
mov eax, edx
.L67:
cmp eax, esi
jae .L77
lea edx, [rax+1]
mov ecx, edx
vmovsd xmm1, QWORD PTR [rdi+rcx*8]
vucomisd xmm1, xmm0
jb .L76
I can tell gcc to unroll the loop, but the point is that the number of instructions per element is larger than in the case of the loop with constant bounds, and the code is slower. Also, I do not understand the reason behind using an additional rcx register for addressing in vmovsd.
I can manually modify the assembly for the loop to look something like in the first example, and it does work faster:
.L76:
cmp edx, esi # eax -> edx
jae .L77
lea edx, [rdx+1] # rax -> rdx
vmovsd xmm1, QWORD PTR [rdi+rdx*8]
vucomisd xmm1, xmm0
jb .L76
but I can not seem to make gcc do it. And I know it can - the asm generated in the first example is OK.
Do you have any ideas how to do it otherwise than using inline asm? Or even better - can you suggest a faster implementation of the search?
Not really an answer, but there's no room in the comments for this.
I tested the AVX function against a simple C implementation and got completely different results.
I tested on Windows 7 x64 not Linux but the generated code was very similar.
How the test went:
1) I disabled the CPU's SpeedStep.
2) Within main() I raised the process priority and thread priority to the max (realtime).
3) I ran 10M calls to the tested function to heat up the CPU - activate turbo.
4) I called Sleep(0) to avoid a context switch
5) I called __rdtscp to start measurement
6) in a loop I called either the AVX find index function or the simple C version - like you did. the other implementation was commented out and not used. Loop size was 10M calls.
7) I called __rdtscp again to finish the benchmark.
8) I printed ticks/iterations. to get the average tick count for a call
Note: I declared both 'find index' functions as inline and I confirmed in the disassembly that they got inlined.
The AVX function and the C functions you described are not identical, the C function return a zero based index and the AVX functio returns a 1 based index.
On my system, it took the AVX function 1.1 cycles per iteration and the C function took 4.4 cycles per iteration.
I couldn't force the MSVC compiler to use more than ymm registers :(
Array used:
double A[8] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 };
Results (avg. ticks/iter):
value = 0.3 (index = 2): AVX: 1.1 | C: 4.4
value = 0.5 (index = 3): AVX: 1.1 | C: 11.1
value = 0.9 (index = 7): AVX: 1.1 | C: 18.1
If the AVX function is corrected to return pos-1, then it will be 50% slower.
You can see that the AVX function works in constant time while the trivial C loop function performance depends on the index you're looking for.
Timing with clock() and running 100M yields similar results, AVX is almost x4 faster for the first test.
Also note that running longer tests reveal different results, but every time AVX holds a similar advantage.
You can try integer comparison. Double comparison is equivalent to int64_t comparison of the same bits with exception for NaNs. It could turn faster. CPU has more integer execution units then SIMD. Just send double* and receive int64_t* in function argument.

SSE2 instruction to convert a 8x16 register to two 4x32 registers having the even and odd indexed elements

Is there any SSE2 instruction to convert a 8x16 register to two 4x32 registers,one 4x32 register having the odd indexed elements from the 8x16 register and the other having the even indexed elements? Please suggest.
Untested:
movdqa xmm1, xmm0
pslld xmm0, 16
psrad xmm1, 16 ; odd words
psrad xmm0, 16 ; even words
Should be easy enough to convert to intrinsics.
There is no single instruction for this, not even in later versions of SSE. Multiple-outputs is very rare, mostly reserved for old instructions.
pmovsxwd from SSE4.1 uses the (for this problem) wrong subset of elements, namely the bottom 4.
Note sure if there's a single instruction for this, but something like this ought to work (untested):
; Assume that the 8 16-bit values are in xmm0
PSHUFLW xmm1,xmm0,0D8h ; Change word order to 3120 in the low qword
PSHUFHW xmm1,xmm1,0D8h ; Change word order to 3120 in the high qword
PSHUFD xmm1,xmm1,0D8h ; Change dword order to 3120
MOVAPD xmm0,xmm1 ; Copy to xmm0
PUNPCKLWD xmm0,xmm0 ; Expand even words to dwords
PUNPCKHWD xmm1,xmm1 ; Expand odd words to dwords
PSLLD xmm0,16 ; Sign-extend
PSRAD xmm0,16 ; ...
PSLLD xmm1,16
PSRAD xmm1,16
xmm0 should now contain the 4 even words sign-extended to 32 bits, and xmm1 should contain the odd words.
If you can use SSE4.1 instructions it's possible to simplify the sign-extension part a bit. For the even words (xmm0) you could replace the unpack and the two shifts with:
PMOVSXWD xmm0,xmm0

Resources