Why is my more complicated C loop faster? - c

I am looking at the performance of memchr-like functions and made an interesting observation.
This is check.c with 3 implementations to find the offset of a \n character in a string:
#include <stdlib.h>
size_t mem1(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x == '\n') return (p - s);
p++;
}
}
size_t mem2(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x <= '$' && (x == '\n' || x == '\0')) return (p - s);
p++;
}
}
size_t mem3(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x == '\n' || x == '\0') return (p - s);
p++;
}
}
size_t mem4(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x <= '$' && (x == '\n')) return (p - s);
p++;
}
}
I run these functions on a string of bytes which can be described by the Haskell expression (concat $ replicate 10000 "abcd") ++ "\n" ++ "hello" - that is 10000 times asdf, then the newline to find, and then hello. Of course all 3 implementations return the same offset: 40000 as expected.
Interestingly, when compiled with gcc -O2, the run times on that string are:
mem1: 16 us
mem2: 12 us
mem3: 25 us
mem4: 16 us
(I'm using the criterion library to measure these times with statistical accuracy.)
I cannot explain this to myself. Why is mem2 so much faster than the other two?
--
The assembly as generated by gcc -S -O2 -o check.asm check.c:
mem1:
.LFB14:
cmpb $10, (%rdi)
movq %rdi, %rax
je .L9
.L6:
addq $1, %rax
cmpb $10, (%rax)
jne .L6
subq %rdi, %rax
ret
.L9:
xorl %eax, %eax
ret
mem2:
.LFB15:
movq %rdi, %rax
jmp .L13
.L19:
cmpb $10, %dl
je .L14
.L11:
addq $1, %rax
.L13:
movzbl (%rax), %edx
cmpb $36, %dl
jg .L11
testb %dl, %dl
jne .L19
.L14:
subq %rdi, %rax
ret
mem3:
.LFB16:
movzbl (%rdi), %edx
testb %dl, %dl
je .L26
cmpb $10, %dl
movq %rdi, %rax
jne .L27
jmp .L26
.L30:
cmpb $10, %dl
je .L23
.L27:
addq $1, %rax
movzbl (%rax), %edx
testb %dl, %dl
jne .L30
.L23:
subq %rdi, %rax
ret
.L26:
xorl %eax, %eax
ret
mem4:
.LFB17:
cmpb $10, (%rdi)
movq %rdi, %rax
je .L38
.L36:
addq $1, %rax
cmpb $10, (%rax)
jne .L36
subq %rdi, %rax
ret
.L38:
xorl %eax, %eax
ret
Any explanation is very appreciated!

My best guess is it's to do with register dependency - if you look at the 3-instruction main loop in mem1, you have a circular dependency on rax. Naïvely, this means each instruction has to wait for the last one to finish - in practice it means if the instructions aren't retired quickly enough the microarchitecture may run out of registers to rename and just give up and stall for a bit.
In mem2 the fact that there are 4 instructions in the loop - and possibly also the fact that there's more of an explicit pipeline in the use of both rax and edx/dl - is probably giving the out-of-order execution hardware an easier time thus it ends up pipelining more efficiently.
I don't claim to be an expert so this may be complete nonsense, but based on what I've studied of Agner Fog's absolute goldmine of Intel optimisation details it doesn't seem an entirely unreasonable hypothesis.
Edit: Out of interest, I've tested mem1 and mem2 on my machine (Core 2 Duo E7500), compiled with -O2 -falign-functions=64 to the exact same assembly code. Calling either function with the given string 1,000,000 times in a loop and using Linux's time, I get ~19s for mem1 and ~18.8s for mem2 - much less than the 25% difference on the newer microarchitecture. Guess it's time to buy an i5...

Your input is such that makes mem2 faster. Every letter in the input apart from '\n' has value larger than '$', so if condition is false from the first part of the expression (x <= '$'), and second part of the expression (x == '\n' || x == '\0') is never executed. If you would use "####" instead of "abcd" I suspect the execution would become slower.

With a cache, the test of mem1() takes the brunt of filling the cache.
Run the mem1() test first and again as last and use the 2nd time as it reflects a primed cache like the other tests. Confident it will be faster and a more fair time comparison.

Related

Trying to translate a C function to x86_64 AT&T assembly

I've been trying to translate this function to assembly:
void foo (int a[], int n) {
int i;
int s = 0;
for (i=0; i<n; i++) {
s += a[i];
if (a[i] == 0) {
a[i] = s;
s = 0;
}
}
}
But something is going wrong.
That's what I've done so far:
.section .text
.globl foo
foo:
.L1:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $0, -16(%rbp) /*s*/
movl $0, -8(%rbp) /*i*/
jmp .L2
.L2:
cmpl -8(%rbp), %esi
jle .L4
leave
ret
.L3:
addl $1, -8(%rbp)
jmp .L2
.L4:
movl -8(%rbp), %eax
imull $4, %eax
movslq %eax, %rax
addq %rdi, %rax
movl (%rax), %eax
addl %eax, -16(%rbp)
cmpl $0, %eax
jne .L3
/* if */
leaq (%rax), %rdx
movl -16(%rbp), %eax
movl %eax, (%rdx)
movl $0, -16(%rbp)
jmp .L3
I am compiling the .s module with a .c module, for example, with an int nums [5] = {65, 23, 11, 0, 34} and I'm getting back the same array instead of {65, 23, 11 , 99, 34}.
Could someone help me?
Presumably you have a compiler that can generate AT&T syntax. It might be more instructive to look at what assembly output the compiler generates. Here's my re-formulation of your demo:
#include <stdio.h>
void foo (int a[], int n)
{
for (int s = 0, i = 0; i < n; i++)
{
if (a[i] != 0)
s += a[i];
else
a[i] = s, s = 0;
}
}
int main (void)
{
int nums[] = {65, 23, 11, 0, 34};
int size = sizeof(nums) / sizeof(int);
foo(nums, size);
for (int i = 0; i < size; i++)
fprintf(stdout, i < (size - 1) ? "%d, " : "%d\n", nums[i]);
return (0);
}
Compiling without optimizations enabled is typically harder to work through than optimized code, since it loads from and spills results to memory. You won't learn much from it if you're investing time in learning how to write efficient assembly.
Compiling with the Godbolt compiler explorer with -O2 optimizations yields much more efficient code; it's also useful for cutting out unnecessary directives, labels, etc., that would be visual noise in this case.
In my experience, using -O2 optimizations are clever enough to make you rethink your use of registers, refactoring, etc. -O3 can sometimes optimize too agressively - unrolling loops, vectorizing, etc., to easily follow.
Finally, for the case you have presented, there's a perfect compromise: -Os, which enables many of the optimizations of -O2, but not at the expense of increased code size. I'll paste the assembly here just for comparative purposes:
foo:
xorl %eax, %eax
xorl %ecx, %ecx
.L2:
cmpl %eax, %esi
jle .L7
movl (%rdi,%rax,4), %edx
testl %edx, %edx
je .L3
addl %ecx, %edx
jmp .L4
.L3:
movl %ecx, (%rdi,%rax,4)
.L4:
incq %rax
movl %edx, %ecx
jmp .L2
.L7:
ret
Remember that the calling convention passes the pointer to (a) in %rdi, and the 'count' (n) in %rsi. These are the calling conventions being used. Notice that your code does not 'dereference' or 'index' any elements through %rdi. It's definitely worth going stepping through the code - even with pen and paper if it helps - to understand the branch conditions and how reading and writing is performed on element a[i].
Curiously, using the inner loop of your code:
s += a[i];
if (a[i] == 0)
a[i] = s, s = 0;
Appears to generate more efficient code with -Os than the inner loop I used:
foo:
xorl %eax, %eax
xorl %edx, %edx
.L2:
cmpl %eax, %esi
jle .L6
movl (%rdi,%rax,4), %ecx
addl %ecx, %edx
testl %ecx, %ecx
jne .L3
movl %edx, (%rdi,%rax,4)
xorl %edx, %edx
.L3:
incq %rax
jmp .L2
.L6:
ret
A reminder for me to keep things simple!

Are this inlining results common?

Due to university work, I have to investigate a simple optimization, the inlining.
Here is the basic code:
#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
#define ITER 1000
#define N 3000000
int i, j;
float x[N], y[N], z[N];
void add(float x, float y, float *z){
*z = x + y;
}
void initialVersion(){
struct timeval inicio, final;
double time;
gettimeofday(&inicio, 0);
for(j = 0; j < ITER; j++){
for(i = 0; i < N; i++){
add(x[i], y[i], &z[i]);
}
}
gettimeofday(&final, 0);
time = (final.tv_sec - inicio.tv_sec + (final.tv_usec - inicio.tv_usec)/1.e6);
printf("Time: %f\n", time);
}
And here is the code with inlining:
#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
#define ITER 1000
#define N 3000000
int i, j;
float x[N], y[N], z[N];
void inliningVersion(){
struct timeval inicio, final;
double time;
gettimeofday(&inicio, 0);
for(j = 0; j < ITER; j++){
for(i = 0; i < N; i++){
z[i] = x[i] + y[i];
}
}
gettimeofday(&final, 0);
time = (final.tv_sec - inicio.tv_sec + (final.tv_usec - inicio.tv_usec)/1.e6);
printf("Time: %f\n", time);
}
Compiling using the option -O0 with gcc, the results are 14.27 seconds for the basic version and 4.45 seconds for the version with the inlining. Is that common? I executed the programm 10 times and the results are always similar. What do you think?
Then, compiling with the option -O1 the results are similar for both versions, 1.5 seconds approximately so I suppose that gcc does the inlining for me with O1.
By the way, I know that gettimeofday counts the overall time and not only the time used by the programm itself, but I am required to use that function specifically.
Thanks in advance!
Let's us analyze the assembly output generated by GCC 7.2 (with O0) for both versions of the code.
Without inlining
First, let's check how much work has to be done by the computer to achieve the task with a separate function:
void add(float x, float y, float *z){
*z = x + y;
}
int main ()
{
float x[100], y[100], z[100];
for(int i = 0; i < 100; i++){
add(x[i], y[i], &z[i]);
}
}
For the above code, GCC produces an assembly as given below:
add(float, float, float*):
pushq %rbp
movq %rsp, %rbp
movss %xmm0, -4(%rbp)
movss %xmm1, -8(%rbp)
movq %rdi, -16(%rbp)
movss -4(%rbp), %xmm0
addss -8(%rbp), %xmm0
movq -16(%rbp), %rax
movss %xmm0, (%rax)
nop
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
subq $1224, %rsp
movl $0, -4(%rbp)
.L4:
cmpl $99, -4(%rbp)
jg .L3
leaq -1216(%rbp), %rax
movl -4(%rbp), %edx
movslq %edx, %rdx
salq $2, %rdx
addq %rax, %rdx
movl -4(%rbp), %eax
cltq
movss -816(%rbp,%rax,4), %xmm0
movl -4(%rbp), %eax
cltq
movl -416(%rbp,%rax,4), %eax
movq %rdx, %rdi
movaps %xmm0, %xmm1
movl %eax, -1220(%rbp)
movss -1220(%rbp), %xmm0
call add(float, float, float*)
addl $1, -4(%rbp)
jmp .L4
.L3:
movl $0, %eax
leave
ret
The processing part of the code takes approximately 32 instructions (instructions between L4 and L3 and that of add function).
A large majority of the instructions are used for making the function call.
A simplified way to understand how function calls work is:
arguments are pushed on the call stack
return address is pushed on to the call stack
the function is called
make a copy of the frame pointer
make room for locals on the stack
actual function code is executed
restorel the state as it was before the function call
return to the caller
The above steps (except 6th) take additional instructions to do the required processing. This is called the function call overhead.
With inlining
Now let's check how much work the computer has to do if the function was inlined.
int main ()
{
float x[100], y[100], z[100];
for(int i = 0; i < 100; i++){
z[i] = x[i] + y[i];
}
}
For the above code, GCC produces an assembly output as given below:
main:
pushq %rbp
movq %rsp, %rbp
subq $1096, %rsp
movl $0, -4(%rbp)
.L3:
cmpl $99, -4(%rbp)
jg .L2
movl -4(%rbp), %eax
cltq
movss -416(%rbp,%rax,4), %xmm1
movl -4(%rbp), %eax
cltq
movss -816(%rbp,%rax,4), %xmm0
addss %xmm1, %xmm0
movl -4(%rbp), %eax
cltq
movss %xmm0, -1216(%rbp,%rax,4)
addl $1, -4(%rbp)
jmp .L3
.L2:
movl $0, %eax
leave
ret
The processing code (instructions between label L3 and L2) has around 14 instructions. In this assembly output, all the instructions which are responsible for making the function call aren't present which saves considerable amount of CPU cycles.
In general, the overhead of a function call is not relevant when your function's running time is more than several times of the overhead of a function call. In your code, the running time of your function is quite small and hence the function call overhead gains significance.
If you use the O1 flag, the compiler indeed does the inlining for you. You can find out by checking the assembly generated with the O1 or you can directly check the GCC manual for the list of optimizations which are tried with O1.
You can generate assembly output using the -S flag or you can do it online with GodBolt (the assembly outputs were taken from here for this post).

Does the C compiler combine if statements if they return the same thing?

I'm trying to understand how a compiler will optimize two if statements that return the same value. Consider the following code at the top of a function:
if (some_ptr == NULL) {
return -1;
}
if (some_other_ptr == NULL) {
return -1;
}
Will the two if statements be combined into one check that is equivalent to:
if (some_ptr == NULL || some_other_ptr == NULL) {
return -1;
}
While the comments emphasize that this behavior is compiler implementation dependent, a look at a specific compiler is helpful in understanding this.
Using the test program:
int main(int argc, char *argv[]) {
srand(time(NULL));
char *some_ptr = (char *) rand();
char *some_other_ptr = (char *) rand();
if (some_ptr == NULL) {
return -1;
}
if (some_other_ptr == NULL) {
return -1;
}
return 0;
}
Using clang on my laptop running OS X, with no optimization (-O0 flag) the assembly output follows the input code closely with no shortcuts.
movslq %eax, %rcx
movq %rcx, -32(%rbp)
cmpq $0, -24(%rbp)
jne LBB0_2
## BB#1:
movl $-1, -4(%rbp)
jmp LBB0_5
LBB0_2:
cmpq $0, -32(%rbp)
jne LBB0_4
## BB#3:
movl $-1, -4(%rbp)
jmp LBB0_5
LBB0_4:
movl $0, -4(%rbp)
LBB0_5:
movl -4(%rbp), %eax
addq $32, %rsp
popq %rbp
retq
.cfi_endproc
But compiling with the highest optimization (-O3 flag) results in some different code.
movl %eax, %ecx
movl $-1, %eax
testl %ebx, %ebx
je LBB0_2
## BB#1:
cmpl $1, %ecx
sbbl %eax, %eax
LBB0_2:
addq $8, %rsp
popq %rbx
popq %rbp
retq
.cfi_endprocemphasized text
In either case, in the case of my version of clang, the compiler does not ever or the two boolean results together, and even in the optimized code, two comparisons are being made via the testl and cmpl instructions.
You could write a compiler that has this behavior if you wanted to!

understanding pointers and casting in assembly

I was given a function in assembly which basically converted uppercase letters to lowercase letters. Here is some of the assembly,
Q1:
pushq %rbp
movq %rsp, %rbp
subq $24, %rsp
movq %rdi, -24(%rbp)
movl $0, -4(%rbp)
movl $0. -8%(%rbp)
jmp .L2
L2:
movl -4(%rbp) %edx
movq -24(%rbp), %rax
addq %rdx, %rax
movzbl (%rax), %eax
testb %al, %al
jne .L4
...
Much of the rest is repetitive but L2 is what really is confusing me. This is my logic so far:
We store param1 into -24(%rbp). We create local1 and local2, set them both to 0 and then jump to L2. I move local1 into %edx, param1 into %rax. Now this is where things get confusing for me,
I was told the following line, addq ended up in local1 being a pointer to param1. I just reasoned add local1 + param1 and store them into %rax. How is that possible?
Next is, movzbl. From my understanding we dereference %rax so we get something like eax = (int) rax.
I was also told to think of it as converting a char to int. Which one is true, how do I know that I'm typecasting? What about if %rax didn't have parentheses around it? Is it an int because it's 4 bytes and %eax is a 32 bit register. Thank you in advance for your help, I'm kind of lost here....
local1 is not a pointer, it's an index (a counter).
That code is doing something like:
void toupper(char* text)
{
int i = 0; /* at rbp-4 */
int j = 0; /* unused, at rbp-8 */
int ch; /* in eax */
while((ch = *(text + i)) != 0)
{
...
}
}
Note that in C pointer arithmetic *(text + i) is of course equivalent to text[i].
Yes, the movzbl is converting an unsigned char to an int you can see that from the instruction name itself: MOVe Zero extended Byte to Long.
The parentheses denote pointer dereferencing.

GCC optimization missed opportunity

I'm compiling this C code:
int mode; // use aa if true, else bb
int aa[2];
int bb[2];
inline int auto0() { return mode ? aa[0] : bb[0]; }
inline int auto1() { return mode ? aa[1] : bb[1]; }
int slow() { return auto1() - auto0(); }
int fast() { return mode ? aa[1] - aa[0] : bb[1] - bb[0]; }
Both slow() and fast() functions are meant to do the same thing, though fast() does it with one branch statement instead of two. I wanted to check if GCC would collapse the two branches into one. I've tried this with GCC 4.4 and 4.7, with various levels of optimization such as -O2, -O3, -Os, and -Ofast. It always gives the same strange results:
slow():
movl mode(%rip), %ecx
testl %ecx, %ecx
je .L10
movl aa+4(%rip), %eax
movl aa(%rip), %edx
subl %edx, %eax
ret
.L10:
movl bb+4(%rip), %eax
movl bb(%rip), %edx
subl %edx, %eax
ret
fast():
movl mode(%rip), %esi
testl %esi, %esi
jne .L18
movl bb+4(%rip), %eax
subl bb(%rip), %eax
ret
.L18:
movl aa+4(%rip), %eax
subl aa(%rip), %eax
ret
Indeed, only one branch is generated in each function. However, slow() seems to be inferior in a surprising way: it uses one extra load in each branch, for aa[0] and bb[0]. The fast() code uses them straight from memory in the subls without loading them into a register first. So slow() uses one extra register and one extra instruction per call.
A simple micro-benchmark shows that calling fast() one billion times takes 0.7 seconds, vs. 1.1 seconds for slow(). I'm using a Xeon E5-2690 at 2.9 GHz.
Why should this be? Can you tweak my source code somehow so that GCC does a better job?
Edit: here are the results with clang 4.2 on Mac OS:
slow():
movq _aa#GOTPCREL(%rip), %rax ; rax = aa (both ints at once)
movq _bb#GOTPCREL(%rip), %rcx ; rcx = bb
movq _mode#GOTPCREL(%rip), %rdx ; rdx = mode
cmpl $0, (%rdx) ; mode == 0 ?
leaq 4(%rcx), %rdx ; rdx = bb[1]
cmovneq %rax, %rcx ; if (mode != 0) rcx = aa
leaq 4(%rax), %rax ; rax = aa[1]
cmoveq %rdx, %rax ; if (mode == 0) rax = bb
movl (%rax), %eax ; eax = xx[1]
subl (%rcx), %eax ; eax -= xx[0]
fast():
movq _mode#GOTPCREL(%rip), %rax ; rax = mode
cmpl $0, (%rax) ; mode == 0 ?
je LBB1_2 ; if (mode != 0) {
movq _aa#GOTPCREL(%rip), %rcx ; rcx = aa
jmp LBB1_3 ; } else {
LBB1_2: ; // (mode == 0)
movq _bb#GOTPCREL(%rip), %rcx ; rcx = bb
LBB1_3: ; }
movl 4(%rcx), %eax ; eax = xx[1]
subl (%rcx), %eax ; eax -= xx[0]
Interesting: clang generates branchless conditionals for slow() but one branch for fast()! On the other hand, slow() does three loads (two of which are speculative, one will be unnecessary) vs. two for fast(). The fast() implementation is more "obvious," and as with GCC it's shorter and uses one less register.
GCC 4.7 on Mac OS generally suffers the same issue as on Linux. Yet it uses the same "load 8 bytes then twice extract 4 bytes" pattern as Clang on Mac OS. That's sort of interesting, but not very relevant, as the original issue of emitting subl with two registers rather than one memory and one register is the same on either platform for GCC.
The reason is that in the initial intermediate code, emitted for slow(), the memory load and the subtraction are in different basic blocks:
slow ()
{
int D.1405;
int mode.3;
int D.1402;
int D.1379;
# BLOCK 2 freq:10000
mode.3_5 = mode;
if (mode.3_5 != 0)
goto <bb 3>;
else
goto <bb 4>;
# BLOCK 3 freq:5000
D.1402_6 = aa[1];
D.1405_10 = aa[0];
goto <bb 5>;
# BLOCK 4 freq:5000
D.1402_7 = bb[1];
D.1405_11 = bb[0];
# BLOCK 5 freq:10000
D.1379_3 = D.1402_17 - D.1405_12;
return D.1379_3;
}
whereas in fast() they are in the same basic block:
fast ()
{
int D.1377;
int D.1376;
int D.1374;
int D.1373;
int mode.1;
int D.1368;
# BLOCK 2 freq:10000
mode.1_2 = mode;
if (mode.1_2 != 0)
goto <bb 3>;
else
goto <bb 4>;
# BLOCK 3 freq:3900
D.1373_3 = aa[1];
D.1374_4 = aa[0];
D.1368_5 = D.1373_3 - D.1374_4;
goto <bb 5>;
# BLOCK 4 freq:6100
D.1376_6 = bb[1];
D.1377_7 = bb[0];
D.1368_8 = D.1376_6 - D.1377_7;
# BLOCK 5 freq:10000
return D.1368_1;
}
GCC relies on instruction combining pass to handle cases like this (i.e. apparently not on the peephole optimization pass) and combining works on the scope of a basic block. That's why the subtraction and load are combined in a single insn in fast() and they aren't even considered for combining in slow().
Later, in the basic block reordering pass, the subtraction in slow() is duplicated and moved into the basic blocks, which contain the loads. Now there's opportunity for the combiner to, well, combine the load and the subtraction, but unfortunately, the combiner pass is not run again (and perhaps it cannot be run that late in the compilation process with hard registers already allocated and stuff).
I don't have an answer as to why GCC is unable to optimize the code the way you want it to, but I have a way to re-organize your code to achieve similar performance. Instead of organizing your code the way you have done so in slow() or fast(), I would recommend that you define an inline function that returns either aa or bb based on mode without needing a branch:
inline int * xx () { static int *xx[] = { bb, aa }; return xx[!!mode]; }
inline int kwiky(int *xx) { return xx[1] - xx[0]; }
int kwik() { return kwiky(xx()); }
When compiled by GCC 4.7 with -O3:
movl mode, %edx
xorl %eax, %eax
testl %edx, %edx
setne %al
movl xx.1369(,%eax,4), %edx
movl 4(%edx), %eax
subl (%edx), %eax
ret
With the definition of xx(), you can redefine auto0() and auto1() like so:
inline int auto0() { return xx()[0]; }
inline int auto1() { return xx()[1]; }
And, from this, you should see that slow() now compiles into code similar or identical to kwik().
Have you tried to modify internals compilers parameters (--param name=value in man page). Those are not changed with any optimizations level (with three minor excepts).
Some of them control code reduction/deduplication.
For some optimizations in this section you can read things like « larger values can exponentially increase compilation time » .

Resources