GCC SSE code optimization

GCC SSE code optimization - c

This post is closely related to another one I posted some days ago. This time, I wrote a simple code that just adds a pair of arrays of elements, multiplies the result by the values in another array and stores it in a forth array, all variables floating point double precision typed.
I made two versions of that code: one with SSE instructions, using calls to and another one without them I then compiled them with gcc and -O0 optimization level. I write them below:
// SSE VERSION
#define N 10000
#define NTIMES 100000
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>
double a[N] __attribute__((aligned(16)));
double b[N] __attribute__((aligned(16)));
double c[N] __attribute__((aligned(16)));
double r[N] __attribute__((aligned(16)));
int main(void){
int i, times;
for( times = 0; times < NTIMES; times++ ){
for( i = 0; i <N; i+= 2){
__m128d mm_a = _mm_load_pd( &a[i] );
_mm_prefetch( &a[i+4], _MM_HINT_T0 );
__m128d mm_b = _mm_load_pd( &b[i] );
_mm_prefetch( &b[i+4] , _MM_HINT_T0 );
__m128d mm_c = _mm_load_pd( &c[i] );
_mm_prefetch( &c[i+4] , _MM_HINT_T0 );
__m128d mm_r;
mm_r = _mm_add_pd( mm_a, mm_b );
mm_a = _mm_mul_pd( mm_r , mm_c );
_mm_store_pd( &r[i], mm_a );
}
}
}
//NO SSE VERSION
//same definitions as before
int main(void){
int i, times;
for( times = 0; times < NTIMES; times++ ){
for( i = 0; i < N; i++ ){
r[i] = (a[i]+b[i])*c[i];
}
}
}
When compiling them with -O0, gcc makes use of XMM/MMX registers and SSE intstructions, if not specifically given the -mno-sse (and others) options. I inspected the assembly code generated for the second code and I noticed that it makes use of movsd, addsd and mulsd instructions. So it makes use of SSE instructions but only of those that use the lowest part of the registers, if I am not wrong. The assembly code generated for the first C code made use, as expected, of the addp and mulpd instructions, though a pretty larger assembly code was generated.
Anyway, the first code should get better profit, as far as I know, of SIMD paradigm, since every iteration two result values are computed. Still that, the second code performs something such as a 25 per cent faster than the first one. I also made a test with single precision values and get similar results. What's the reason for that?

Vectorization in GCC is enabled at -O3. That's why at -O0, you see only the ordinary scalar SSE2 instructions (movsd, addsd, etc). Using GCC 4.6.1 and your second example:
#define N 10000
#define NTIMES 100000
double a[N] __attribute__ ((aligned (16)));
double b[N] __attribute__ ((aligned (16)));
double c[N] __attribute__ ((aligned (16)));
double r[N] __attribute__ ((aligned (16)));
int
main (void)
{
int i, times;
for (times = 0; times < NTIMES; times++)
{
for (i = 0; i < N; ++i)
r[i] = (a[i] + b[i]) * c[i];
}
return 0;
}
and compiling with gcc -S -O3 -msse2 sse.c produces for the inner loop the following instructions, which is pretty good:
.L3:
movapd a(%eax), %xmm0
addpd b(%eax), %xmm0
mulpd c(%eax), %xmm0
movapd %xmm0, r(%eax)
addl $16, %eax
cmpl $80000, %eax
jne .L3
As you can see, with the vectorization enabled GCC emits code to perform two loop iterations in parallel. It can be improved, though - this code uses the lower 128 bits of the SSE registers, but it can use the full the 256-bit YMM registers, by enabling the AVX encoding of SSE instructions (if available on the machine). So, compiling the same program with gcc -S -O3 -msse2 -mavx sse.c gives for the inner loop:
.L3:
vmovapd a(%eax), %ymm0
vaddpd b(%eax), %ymm0, %ymm0
vmulpd c(%eax), %ymm0, %ymm0
vmovapd %ymm0, r(%eax)
addl $32, %eax
cmpl $80000, %eax
jne .L3
Note that v in front of each instruction and that instructions use the 256-bit YMM registers, four iterations of the original loop are executed in parallel.

I would like to extend chill's answer and draw your attention on the fact that GCC seems not to be able to do the same smart use of the AVX instructions when iterating backwards.
Just replace the inner loop in chill's sample code with:
for (i = N-1; i >= 0; --i)
r[i] = (a[i] + b[i]) * c[i];
GCC (4.8.4) with options -S -O3 -mavx produces:
.L5:
vmovsd a+79992(%rax), %xmm0
subq $8, %rax
vaddsd b+80000(%rax), %xmm0, %xmm0
vmulsd c+80000(%rax), %xmm0, %xmm0
vmovsd %xmm0, r+80000(%rax)
cmpq $-80000, %rax
jne .L5

Related

How can I understand -O3's optimizations? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I currently have two functions A and B.
When compiled without any flags, A is faster than B.
But when compiled with -O1 or -O3, B is much faster than A.
I want to port the function to other languages, so it seems like A is a better choice.
But it would be great if I could understand how -O3 managed to speed up function B. Are there any good ways of at least getting a slight understanding of the kind of optimizations done by -O3?

-O3 does the same as -O2, and also:
Inline parts of functions.
Perform function cloning to make interprocedural constant propagation stronger.
Perform loop interchange outside of graphite. This can improve cache performance on loop nest and allow further loop optimizations, like vectorization, to take place. For example, the loop:
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
is transformed to
for (int i = 0; i < N; i++)
for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
Apply unroll and jam transformations on feasible loops. In a loop nest this unrolls the outer loop by some factor and fuses the resulting multiple inner loops.
Peels loops for which there is enough information that they do not roll much. It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations).
Perform predictive commoning optimization, i.e., reusing computations (especially memory loads and stores) performed in previous iterations of loops.
Split paths leading to loop backedges. This can improve dead code elimination and common subexpression elimination.
Improve cache performance on big loop bodies and allow further loop optimizations, like parallelization or vectorization, to take place.
Move branches with loop invariant conditions out of the loop, with duplicates of the loop on both branches (modified according to result of the condition).
If a loop iterates over an array with a variable stride, create another version of the loop that assumes the stride is always one. For example:
for (int i = 0; i < n; ++i)
x[i * stride] = …;
becomes:
if (stride == 1)
for (int i = 0; i < n; ++i)
x[i] = …;
else
for (int i = 0; i < n; ++i)
x[i * stride] = …;
For example, the following code:
unsigned long apply(unsigned long (*f)(unsigned long, unsigned long), unsigned long a, unsigned long b, unsigned long c) {
for (unsigned long i = 0; i < b; i++)
c = f(c, a);
return c;
}
unsigned long inc(unsigned long a, unsigned long b) { return a + 1; }
unsigned long add(unsigned long a, unsigned long b) { return apply(inc, 0, b, a); }
Optimizes the add function to:
Intel Syntax
add:
lea rax, [rsi+rdi]
ret
AT&T:
add:
leaq (%rsi,%rdi), %rax
ret
Without -O3 output is:
Intel Syntax
add:
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov QWORD PTR [rbp-16], rsi
mov rdx, QWORD PTR [rbp-8]
mov rax, QWORD PTR [rbp-16]
mov rcx, rdx
mov rdx, rax
mov esi, 0
mov edi, OFFSET FLAT:inc
call apply
leave
ret
AT&T:
add:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movq %rdi, -8(%rbp)
movq %rsi, -16(%rbp)
movq -8(%rbp), %rdx
movq -16(%rbp), %rax
movq %rdx, %rcx
movq %rax, %rdx
movl $0, %esi
movl $inc, %edi
call apply
leave
ret
You can compare the output assembler for functions A and B using -S flag and -masm=intel.
This answer is based on GCC documentation, you can learn more from it.

The question being
Are there any good ways of at least getting a slight understanding of the kind of optimizations done by -O3?
, and the intention apparently being that the question be answered in a general sense that does not take the actual code into consideration, the best answer I see is to recommend reading the documentation for your compiler, especially the documentation on optimizations.
Although not every optimization GCC performs has a corresponding option flag, most do. The docs specify which optimizations are performed at each level in terms of those flags, and they also specify what each individual flag means. Some of the terminology used in those explanations may be unfamiliar, but you should be able to glean at least "a slight understanding". Do start reading at the very top of the optimization docs.

Looping over arrays with inline assembly

When looping over an array with inline assembly should I use the register modifier "r" or he memory modifier "m"?
Let's consider an example which adds two float arrays x, and y and writes the results to z. Normally I would use intrinsics to do this like this
for(int i=0; i<n/4; i++) {
__m128 x4 = _mm_load_ps(&x[4*i]);
__m128 y4 = _mm_load_ps(&y[4*i]);
__m128 s = _mm_add_ps(x4,y4);
_mm_store_ps(&z[4*i], s);
}
Here is the inline assembly solution I have come up with using the register modifier "r"
void add_asm1(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps (%1,%%rax,4), %%xmm0\n"
"addps (%2,%%rax,4), %%xmm0\n"
"movaps %%xmm0, (%0,%%rax,4)\n"
:
: "r" (z), "r" (y), "r" (x), "a" (i)
:
);
}
}
This generates similar assembly to GCC. The main difference is that GCC adds 16 to the index register and uses a scale of 1 whereas the inline-assembly solution adds 4 to the index register and uses a scale of 4.
I was not able to use a general register for the iterator. I had to specify one which in this case was rax. Is there a reason for this?
Here is the solution I came up with using the memory modifer "m"
void add_asm2(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps %1, %%xmm0\n"
"addps %2, %%xmm0\n"
"movaps %%xmm0, %0\n"
: "=m" (z[i])
: "m" (y[i]), "m" (x[i])
:
);
}
}
This is less efficient as it does not use an index register and instead has to add 16 to the base register of each array. The generated assembly is (gcc (Ubuntu 5.2.1-22ubuntu2) with gcc -O3 -S asmtest.c):
.L22
movaps (%rsi), %xmm0
addps (%rdi), %xmm0
movaps %xmm0, (%rdx)
addl $4, %eax
addq $16, %rdx
addq $16, %rsi
addq $16, %rdi
cmpl %eax, %ecx
ja .L22
Is there a better solution using the memory modifier "m"? Is there some way to get it to use an index register? The reason I asked is that it seemed more logical to me to use the memory modifer "m" since I am reading and writing memory. Additionally, with the register modifier "r" I never use an output operand list which seemed odd to me at first.
Maybe there is a better solution than using "r" or "m"?
Here is the full code I used to test this
#include <stdio.h>
#include <x86intrin.h>
#define N 64
void add_intrin(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n; i+=4) {
__m128 x4 = _mm_load_ps(&x[i]);
__m128 y4 = _mm_load_ps(&y[i]);
__m128 s = _mm_add_ps(x4,y4);
_mm_store_ps(&z[i], s);
}
}
void add_intrin2(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n/4; i++) {
__m128 x4 = _mm_load_ps(&x[4*i]);
__m128 y4 = _mm_load_ps(&y[4*i]);
__m128 s = _mm_add_ps(x4,y4);
_mm_store_ps(&z[4*i], s);
}
}
void add_asm1(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps (%1,%%rax,4), %%xmm0\n"
"addps (%2,%%rax,4), %%xmm0\n"
"movaps %%xmm0, (%0,%%rax,4)\n"
:
: "r" (z), "r" (y), "r" (x), "a" (i)
:
);
}
}
void add_asm2(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps %1, %%xmm0\n"
"addps %2, %%xmm0\n"
"movaps %%xmm0, %0\n"
: "=m" (z[i])
: "m" (y[i]), "m" (x[i])
:
);
}
}
int main(void) {
float x[N], y[N], z1[N], z2[N], z3[N];
for(int i=0; i<N; i++) x[i] = 1.0f, y[i] = 2.0f;
add_intrin2(x,y,z1,N);
add_asm1(x,y,z2,N);
add_asm2(x,y,z3,N);
for(int i=0; i<N; i++) printf("%.0f ", z1[i]); puts("");
for(int i=0; i<N; i++) printf("%.0f ", z2[i]); puts("");
for(int i=0; i<N; i++) printf("%.0f ", z3[i]); puts("");
}

Avoid inline asm whenever possible: https://gcc.gnu.org/wiki/DontUseInlineAsm. It blocks many optimizations. But if you really can't hand-hold the compiler into making the asm you want, you should probably write your whole loop in asm so you can unroll and tweak it manually, instead of doing stuff like this.
You can use an r constraint for the index. Use the q modifier to get the name of the 64bit register, so you can use it in an addressing mode. When compiled for 32bit targets, the q modifier selects the name of the 32bit register, so the same code still works.
If you want to choose what kind of addressing mode is used, you'll need to do it yourself, using pointer operands with r constraints.
GNU C inline asm syntax doesn't assume that you read or write memory pointed to by pointer operands. (e.g. maybe you're using an inline-asm and on the pointer value). So you need to do something with either a "memory" clobber or memory input/output operands to let it know what memory you modify. A "memory" clobber is easy, but forces everything except locals to be spilled/reloaded. See the Clobbers section in the docs for an example of using a dummy input operand.
Specifically, a "m" (*(const float (*)[]) fptr) will tell the compiler that the entire array object is an input, arbitrary-length. i.e. the asm can't reorder with any stores that use fptr as part of the address (or that use the array it's known to point into). Also works with an "=m" or "+m" constraint (without the const, obviously).
Using a specific size like "m" (*(const float (*)[4]) fptr) lets you tell the compiler what you do/don't read. (Or write). Then it can (if otherwise permitted) sink a store to a later element past the asm statement, and combine it with another store (or do dead-store elimination) of any stores that your inline asm doesn't read.
(See How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for a whole Q&A about this.)
Another huge benefit to an m constraint is that -funroll-loops can work by generating addresses with constant offsets. Doing the addressing ourself prevents the compiler from doing a single increment every 4 iterations or something, because every source-level value of i needs to appear in a register.
Here's my version, with some tweaks as noted in comments. This is not optimal, e.g. can't be unrolled efficiently by the compiler.
#include <immintrin.h>
void add_asm1_memclobber(float *x, float *y, float *z, unsigned n) {
__m128 vectmp; // let the compiler choose a scratch register
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps (%[y],%q[idx],4), %[vectmp]\n\t" // q modifier: 64bit version of a GP reg
"addps (%[x],%q[idx],4), %[vectmp]\n\t"
"movaps %[vectmp], (%[z],%q[idx],4)\n\t"
: [vectmp] "=x" (vectmp) // "=m" (z[i]) // gives worse code if the compiler prepares a reg we don't use
: [z] "r" (z), [y] "r" (y), [x] "r" (x),
[idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
: "memory"
// you can avoid a "memory" clobber with dummy input/output operands
);
}
}
Godbolt compiler explorer asm output for this and a couple versions below.
Your version needs to declare %xmm0 as clobbered, or you will have a bad time when this is inlined. My version uses a temporary variable as an output-only operand that's never used. This gives the compiler full freedom for register allocation.
If you want to avoid the "memory" clobber, you can use dummy memory input/output operands like "m" (*(const __m128*)&x[i]) to tell the compiler which memory is read and written by your function. This is necessary to ensure correct code-generation if you did something like x[4] = 1.0; right before running that loop. (And even if you didn't write something that simple, inlining and constant propagation can boil it down to that.) And also to make sure the compiler doesn't read from z[] before the loop runs.
In this case, we get horrible results: gcc5.x actually increments 3 extra pointers because it decides to use [reg] addressing modes instead of indexed. It doesn't know that the inline asm never actually references those memory operands using the addressing mode created by the constraint!
# gcc5.4 with dummy constraints like "=m" (*(__m128*)&z[i]) instead of "memory" clobber
.L11:
movaps (%rsi,%rax,4), %xmm0 # y, i, vectmp
addps (%rdi,%rax,4), %xmm0 # x, i, vectmp
movaps %xmm0, (%rdx,%rax,4) # vectmp, z, i
addl $4, %eax #, i
addq $16, %r10 #, ivtmp.19
addq $16, %r9 #, ivtmp.21
addq $16, %r8 #, ivtmp.22
cmpl %eax, %ecx # i, n
ja .L11 #,
r8, r9, and r10 are the extra pointers that the inline asm block doesn't use.
You can use a constraint that tells gcc an entire array of arbitrary length is an input or an output: "m" (*(const char (*)[]) pStr). This casts the pointer to a pointer-to-array (of unspecified size). See How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
If we want to use indexed addressing modes, we will have the base address of all three arrays in registers, and this form of constraint asks for the base address (of the whole array) as an operand, rather than a pointer to the current memory being operated on.
This actually works without any extra pointer or counter increments inside the loop: (avoiding a "memory" clobber, but still not easily unrollable by the compiler).
void add_asm1_dummy_whole_array(const float *restrict x, const float *restrict y,
float *restrict z, unsigned n) {
__m128 vectmp; // let the compiler choose a scratch register
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
"movaps (%[y],%q[idx],4), %[vectmp]\n\t" // q modifier: 64bit version of a GP reg
"addps (%[x],%q[idx],4), %[vectmp]\n\t"
"movaps %[vectmp], (%[z],%q[idx],4)\n\t"
: [vectmp] "=x" (vectmp)
, "=m" (*(float (*)[]) z) // "=m" (z[i]) // gives worse code if the compiler prepares a reg we don't use
: [z] "r" (z), [y] "r" (y), [x] "r" (x),
[idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
, "m" (*(const float (*)[]) x),
"m" (*(const float (*)[]) y) // pointer to unsized array = all memory from this pointer
);
}
}
This gives us the same inner loop we got with a "memory" clobber:
.L19: # with clobbers like "m" (*(const struct {float a; float x[];} *) y)
movaps (%rsi,%rax,4), %xmm0 # y, i, vectmp
addps (%rdi,%rax,4), %xmm0 # x, i, vectmp
movaps %xmm0, (%rdx,%rax,4) # vectmp, z, i
addl $4, %eax #, i
cmpl %eax, %ecx # i, n
ja .L19 #,
It tells the compiler that each asm block reads or writes the entire arrays, so it may unnecessarily stop it from interleaving with other code (e.g. after fully unrolling with low iteration count). It doesn't stop unrolling, but the requirement to have each index value in a register does make it less effective. There's no way for this to end up with a 16(%rsi,%rax,4) addressing mode in a 2nd copy of this block in the same loop, because we're hiding the addressing from the compiler.
A version with m constraints, that gcc can unroll:
#include <immintrin.h>
void add_asm1(float *x, float *y, float *z, unsigned n) {
// x, y, z are assumed to be aligned
__m128 vectmp; // let the compiler choose a scratch register
for(int i=0; i<n; i+=4) {
__asm__ __volatile__ (
// "movaps %[yi], %[vectmp]\n\t" // get the compiler to do this load instead
"addps %[xi], %[vectmp]\n\t"
"movaps %[vectmp], %[zi]\n\t"
// __m128 is a may_alias type so these casts are safe.
: [vectmp] "=x" (vectmp) // let compiler pick a stratch reg
,[zi] "=m" (*(__m128*)&z[i]) // actual memory output for the movaps store
: [yi] "0" (*(__m128*)&y[i]) // or [yi] "xm" (*(__m128*)&y[i]), and uncomment the movaps load
,[xi] "xm" (*(__m128*)&x[i])
//, [idx] "r" (i) // unrolling with this would need an insn for every increment by 4
);
}
}
Using [yi] as a +x input/output operand would be simpler, but writing it this way makes a smaller change for uncommenting the load in the inline asm, instead of letting the compiler get one value into registers for us.

When I compile your add_asm2 code with gcc (4.9.2) I get:
add_asm2:
.LFB0:
.cfi_startproc
xorl %eax, %eax
xorl %r8d, %r8d
testl %ecx, %ecx
je .L1
.p2align 4,,10
.p2align 3
.L5:
#APP
# 3 "add_asm2.c" 1
movaps (%rsi,%rax), %xmm0
addps (%rdi,%rax), %xmm0
movaps %xmm0, (%rdx,%rax)
# 0 "" 2
#NO_APP
addl $4, %r8d
addq $16, %rax
cmpl %r8d, %ecx
ja .L5
.L1:
rep; ret
.cfi_endproc
so it is not perfect (it uses a redundant register), but does use indexed loads...

gcc also has builtin vector extensions which are even cross platform:
typedef float v4sf __attribute__((vector_size(16)));
void add_vector(float *x, float *y, float *z, unsigned n) {
for(int i=0; i<n/4; i+=1) {
*(v4sf*)(z + 4*i) = *(v4sf*)(x + 4*i) + *(v4sf*)(y + 4*i);
}
}
On my gcc version 4.7.2 the generated assembly is:
.L28:
movaps (%rdi,%rax), %xmm0
addps (%rsi,%rax), %xmm0
movaps %xmm0, (%rdx,%rax)
addq $16, %rax
cmpq %rcx, %rax
jne .L28

SSE byte and half word swapping

I would like to translate this code using SSE intrinsics.
for (uint32_t i = 0; i < length; i += 4, src += 4, dest += 4)
{
uint32_t value = *(uint32_t*)src;
*(uint32_t*)dest = ((value >> 16) & 0xFFFF) | (value << 16);
}
Is anyone aware of an intrinsic to perform the 16-bit word swapping?

pshufb (SSSE3) should be faster than 2 shifts and an OR. Also, a slight modification to the shuffle mask would enable an endian conversion, instead of just a word-swap.
stealing Paul R's function structure, just replacing the vector intrinsics:
void word_swapping_ssse3(uint32_t* dest, const uint32_t* src, size_t count)
{
size_t i;
__m128i shufmask = _mm_set_epi8(13,12, 15,14, 9,8, 11,10, 5,4, 7,6, 1,0, 3,2);
// _mm_set args go in big-endian order for some reason.
for (i = 0; i + 4 <= count; i += 4)
{
__m128i s = _mm_loadu_si128((__m128i*)&src[i]);
__m128i d = _mm_shuffle_epi8(s, shufmask);
_mm_storeu_si128((__m128i*)&dest[i], d);
}
for ( ; i < count; ++i) // handle residual elements
{
uint32_t w = src[i];
w = (w >> 16) | (w << 16);
dest[i] = w;
}
}
pshufb can have a memory operand, but it has to be the shuffle mask, not the data to be shuffled. So you can't use it as a shuffled-load. :/
gcc doesn't generate great code for the loop. The main loop is
# src: r8. dest: rcx. count: rax. shufmask: xmm1
.L16:
movq %r9, %rax
.L3: # first-iteration entry point
movdqu (%r8), %xmm0
leaq 4(%rax), %r9
addq $16, %r8
addq $16, %rcx
pshufb %xmm1, %xmm0
movups %xmm0, -16(%rcx)
cmpq %rdx, %r9
jbe .L16
With all that loop overhead, and needing a separate load and store instruction, throughput will only be 1 shuffle per 2 cycles. (8 uops, since cmp macro-fuses with jbe).
A faster loop would be
shl $2, %rax # uint count -> byte count
# check for %rax less than 16 and skip the vector loop
# cmp / jsomething
add %rax, %r8 # set up pointers to the end of the array
add %rax, %rcx
neg %rax # and count upwards toward zero
.loop:
movdqu (%r8, %rax), %xmm0
pshufb %xmm1, %xmm0
movups %xmm0, (%rcx, %rax) # IDK why gcc chooses movups for stores. Shorter encoding?
add $16, %rax
jl .loop
# ...
# scalar cleanup
movdqu loads can micro-fuse with complex addressing modes, unlike vector ALU ops, so all these instructions are single-uop except the store, I believe.
This should run at 1 cycle per iteration with some unrolling, since add can micro-fuse with jl. So the loop has 5 total uops. 3 of them are load/store ops, which have dedicated ports. Bottlenecks are: pshufb can only run on one execution port (Haswell (SnB/IvB can pshufb on ports 1&5)). One store per cycle (all microarches). And finally, the 4 fused-domain uops per clock limit for Intel CPUs, which should be reachable barring cache-misses on Nehalem and later (uop loop buffer).
Unrolling would bring the total fused-domain uops per 16B down below 4. Incrementing pointers, instead of using complex addressing modes, would let the stores micro-fuse. (Reducing loop overhead is always good: letting the re-order buffer fill up with future iterations means the CPU has something to do when it hits a mispredict at the end of the loop and returns to other code.)
This is pretty much what you'd get by unrolling the intrinsics loops, as Elalfer rightly suggests would be a good idea. With gcc, try -funroll-loops if that doesn't bloat the code too much.
BTW, it's probably going to be better to byte-swap while loading or storing, mixed in with other code, rather than converting a buffer as a separate operation.

The scalar code in your question isn't really byte swapping (in the sense of endianness conversion, at least) - it's just swapping the high and low 16 bits within a 32 bit word. If this is what you want though then just re-use the solution to your previous question, with appropriate changes:
void byte_swapping(uint32_t* dest, const uint32_t* src, size_t count)
{
size_t i;
for (i = 0; i + 4 <= count; i += 4)
{
__m128i s = _mm_loadu_si128((__m128i*)&src[i]);
__m128i d = _mm_or_si128(_mm_slli_epi32(s, 16), _mm_srli_epi32(s, 16));
_mm_storeu_si128((__m128i*)&dest[i], d);
}
for ( ; i < count; ++i) // handle residual elements
{
uint32_t w = src[i];
w = (w >> 16) | (w << 16);
dest[i] = w;
}
}

Two very similar functions involving sin() exhibit vastly different performance -- why?

Consider the following two programs that perform the same computations in two different ways:
// v1.c
#include <stdio.h>
#include <math.h>
int main(void) {
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
float x;
for (j = 0; j < nbr_values; j++) {
x = 1;
for (i = 0; i < n_iter; i++)
x = sin(x);
}
printf("%f\n", x);
return 0;
}
and
// v2.c
#include <stdio.h>
#include <math.h>
int main(void) {
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
float x[nbr_values];
for (i = 0; i < nbr_values; ++i) {
x[i] = 1;
}
for (i = 0; i < n_iter; i++) {
for (j = 0; j < nbr_values; ++j) {
x[j] = sin(x[j]);
}
}
printf("%f\n", x[0]);
return 0;
}
When I compile them using gcc 4.7.2 with -O3 -ffast-math and run on a Sandy Bridge box, the second program is twice as fast as the first one.
Why is that?
One suspect is the data dependency between successive iterations of the i loop in v1. However, I don't quite see what the full explanation might be.
(Question inspired by Why is my python/numpy example faster than pure C implementation?)
EDIT:
Here is the generated assembly for v1:
movl $8192, %ebp
pushq %rbx
LCFI1:
subq $8, %rsp
LCFI2:
.align 4
L2:
movl $100000, %ebx
movss LC0(%rip), %xmm0
jmp L5
.align 4
L3:
call _sinf
L5:
subl $1, %ebx
jne L3
subl $1, %ebp
.p2align 4,,2
jne L2
and for v2:
movl $100000, %r14d
.align 4
L8:
xorl %ebx, %ebx
.align 4
L9:
movss (%r12,%rbx), %xmm0
call _sinf
movss %xmm0, (%r12,%rbx)
addq $4, %rbx
cmpq $32768, %rbx
jne L9
subl $1, %r14d
jne L8

Ignore the loop structure all together, and only think about the sequence of calls to sin. v1 does the following:
x <-- sin(x)
x <-- sin(x)
x <-- sin(x)
...
that is, each computation of sin( ) cannot begin until the result of the previous call is available; it must wait for the entirety of the previous computation. This means that for N calls to sin, the total time required is 819200000 times the latency of a single sin evaluation.
In v2, by contrast, you do the following:
x[0] <-- sin(x[0])
x[1] <-- sin(x[1])
x[2] <-- sin(x[2])
...
notice that each call to sin does not depend on the previous call. Effectively, the calls to sin are all independent, and the processor can begin on each as soon as the necessary register and ALU resources are available (without waiting for the previous computation to be completed). Thus, the time required is a function of the throughput of the sin function, not the latency, and so v2 can finish in significantly less time.
I should also note that DeadMG is right that v1 and v2 are formally equivalent, and in a perfect world the compiler would optimize both of them into a single chain of 100000 sin evaluations (or simply evaluate the result at compile time). Sadly, we live in an imperfect world.

In the first example, it runs 100000 loops of sin, 8192 times.
In the second example, it runs 8192 loops of sin, 100000 times.
Other than that and storing the result differently, I don't see any difference.
However, what does make a difference is that the input is being changed for each loop in the second case. So I suspect what happens is that the sin value, at certain times in the loop, gets much easier to calculate. And that can make a big difference. Calculating sin is not entirely trivial, and it's a series calculation that loops until the exit condition is hit.

Persuading the compiler to set registers outside a loop

Firstly, I will prefix this by saying I don't think it is necessary to understand the functioning of the code below to make a sensible attempt to solve my problem. This is primarily an optimisation problem. The code is to understand what is being done.
I have the following somewhat optimised convolution main loop (which works):
for(int i=0; i<length-kernel_length; i+=4){
acc = _mm_setzero_ps();
for(int k=0; k<KERNEL_LENGTH; k+=4){
int data_offset = i + k;
for (int l = 0; l < 4; l++){
data_block = _mm_load_ps(in_aligned[l] + data_offset);
prod = _mm_mul_ps(kernel_reverse[k+l], data_block);
acc = _mm_add_ps(acc, prod);
}
}
_mm_storeu_ps(out+i, acc);
}
KERNEL_LENGTH is 4.
in_aligned is the input array (upon which the convolution is performed) repeated 4 times, with each repeat being shifted one sample to the left from the others. This is so every sample can be found on a 16-byte aligned location.
kernel_reverse is the reversed kernel, with every sample repeated 4 times to fill a 4-vector and is declared and defined as:
float kernel_block[4] __attribute__ ((aligned (16)));
__m128 kernel_reverse[KERNEL_LENGTH] __attribute__ ((aligned (16)));
// Repeat the kernel across the vector
for(int i=0; i<KERNEL_LENGTH; i++){
kernel_block[0] = kernel[kernel_length - i - 1];
kernel_block[1] = kernel[kernel_length - i - 1];
kernel_block[2] = kernel[kernel_length - i - 1];
kernel_block[3] = kernel[kernel_length - i - 1];
kernel_reverse[i] = _mm_load_ps(kernel_block);
}
The code computes the algorithm correctly and pretty quickly too.
I compile the code with gcc -std=c99 -Wall -O3 -msse3 -mtune=core2
My question is this:
The loop is compiled to the machine code below. Inside this loop, a not-insignificant number of instructions are spent loading the kernel every time. The kernel does not change on each iteration of the loop and so can, in principle, be kept in SSE registers. As I understand it, there are sufficient registers to easily store the kernel (and indeed, the machine code doesn't suggest too much register pressure).
How do I persuade the compiler to not load the kernel on every loop?
I was expecting the compiler to do this automatically when the kernel length was set to be constant.
testl %edx, %edx
jle .L79
leaq (%rcx,%rcx,2), %rsi
movaps -144(%rbp), %xmm6
xorps %xmm2, %xmm2
leal -1(%rdx), %ecx
movaps -128(%rbp), %xmm5
xorl %eax, %eax
movaps -112(%rbp), %xmm4
leaq 0(%r13,%rsi,4), %rsi
shrl $2, %ecx
addq $1, %rcx
movaps -96(%rbp), %xmm3
salq $4, %rcx
.p2align 4,,10
.p2align 3
.L80:
movaps 0(%r13,%rax), %xmm0
movaps (%r14,%rax), %xmm1
mulps %xmm6, %xmm0
mulps %xmm5, %xmm1
addps %xmm2, %xmm0
addps %xmm1, %xmm0
movaps (%r9,%rax), %xmm1
mulps %xmm4, %xmm1
addps %xmm1, %xmm0
movaps (%rsi,%rax), %xmm1
mulps %xmm3, %xmm1
addps %xmm1, %xmm0
movups %xmm0, (%rbx,%rax)
addq $16, %rax
cmpq %rcx, %rax
jne .L80
.L79:
Edit: the full code listing is as follows:
#define KERNEL_LENGTH 4
int convolve_sse_in_aligned_fixed_kernel(float* in, float* out, int length,
float* kernel, int kernel_length)
{
float kernel_block[4] __attribute__ ((aligned (16)));
float in_aligned[4][length] __attribute__ ((aligned (16)));
__m128 kernel_reverse[KERNEL_LENGTH] __attribute__ ((aligned (16)));
__m128 data_block __attribute__ ((aligned (16)));
__m128 prod __attribute__ ((aligned (16)));
__m128 acc __attribute__ ((aligned (16)));
// Repeat the kernel across the vector
for(int i=0; i<KERNEL_LENGTH; i++){
int index = kernel_length - i - 1;
kernel_block[0] = kernel[index];
kernel_block[1] = kernel[index];
kernel_block[2] = kernel[index];
kernel_block[3] = kernel[index];
kernel_reverse[i] = _mm_load_ps(kernel_block);
}
/* Create a set of 4 aligned arrays
* Each array is offset by one sample from the one before
*/
for(int i=0; i<4; i++){
memcpy(in_aligned[i], (in+i), (length-i)*sizeof(float));
}
for(int i=0; i<length-kernel_length; i+=4){
acc = _mm_setzero_ps();
for(int k=0; k<KERNEL_LENGTH; k+=4){
int data_offset = i + k;
for (int l = 0; l < 4; l++){
data_block = _mm_load_ps(in_aligned[l] + data_offset);
prod = _mm_mul_ps(kernel_reverse[k+l], data_block);
acc = _mm_add_ps(acc, prod);
}
}
_mm_storeu_ps(out+i, acc);
}
// Need to do the last value as a special case
int i = length - kernel_length;
out[i] = 0.0;
for(int k=0; k<kernel_length; k++){
out[i] += in_aligned[0][i+k] * kernel[kernel_length - k - 1];
}
return 0;
}

The answer is, it is doing exactly what I wanted. The problem, it seems, was down to me being inept at reading the output from objdump -d. In modifying the question to use the output from gcc -S as suggested by #PascalCuoq, the loop is notably easier to understand.
I left the question because somebody may value that point! (and indeed the code).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

GCC SSE code optimization - c

Related

How can I understand -O3's optimizations? [closed]

Looping over arrays with inline assembly

SSE byte and half word swapping

Two very similar functions involving sin() exhibit vastly different performance -- why?

Persuading the compiler to set registers outside a loop

Categories

Resources