Why is it that the code:
for( i = 0, j = 0; i < 4 , j < 3; i++, j++)
is slower than
for( i = 0, j = 0; i < 4 && j < 3; i++, j++)
Elaborating on that some users proposed that two if statemnts take more time than a single if statement with an && operator: I tested it without for loops and it is not true. Two if statements are faster than a single one with a && operator.
The first code is not slower; at least in gcc without optimization. In fact, it should be faster.
When you compile both codes and disassemble them, you will find this for the first code:
cmpl $0x2,-0x8(%rbp)
jle 26 <main+0x26>
And this for the second one:
cmpl $0x3,-0x4(%rbp)
jg 44 <main+0x44>
cmpl $0x2,-0x8(%rbp)
jle 26 <main+0x26>
In the first example, gcc evaluates just the second part, because the first one has no effect and is not used in the comparison. In the second one, it has to check for the first one, and then, if true, check the second one.
So, in the general case, the first example should be faster than the first one. If you find the first slower, maybe your way to measure it was not 100% correct.
Their may be no change in execution time but may very the number of iterations since :
If we put comma separated condition in for loop,it evaluates the value of the last one. So basically whichever condition you write first, it will be disregarded, and the second one will be checked. So j = 0; i < 4 will always check for i<4 where as i < 4 && j < 3 will examine and return true if and only if both the conditions are true.
Reference
If we do check the assembly of the code you have you may see the difference
program
int main()
{
int x,y;
for(x=0,y=0;x<4,y<5;x++,y++);
printf("New one");
for(x=0,y=0;x<4 && y<5;x++,y++);
}
command to get assembly : gcc -S <program name>
Assembly
.file "for1.c"
.section .rodata
.LC0:
.string "New one"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
andl $-16, %esp
subl $32, %esp
movl $0, 24(%esp)
movl $0, 28(%esp)
jmp .L2
.L3:
addl $1, 24(%esp)
addl $1, 28(%esp)
.L2:
cmpl $4, 28(%esp) //Here only one condition
jle .L3
movl $.LC0, (%esp)
call printf
movl $0, 24(%esp)
movl $0, 28(%esp)
jmp .L4
.L6:
addl $1, 24(%esp)
addl $1, 28(%esp)
.L4:
cmpl $3, 24(%esp) //First Condition
jg .L7
cmpl $4, 28(%esp) //Second Condition
jle .L6
.L7:
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
So, it is clear if we have 2 condition then it will be more time taking.
the first option one is two ifs,
the second option is a mathematical equation and one if, which is usually faster,
here you save one if by doing a calculation, that costs less process time.
first option -> if() && if(),
second option-> if(() && ())
Related
I've been trying to translate this function to assembly:
void foo (int a[], int n) {
int i;
int s = 0;
for (i=0; i<n; i++) {
s += a[i];
if (a[i] == 0) {
a[i] = s;
s = 0;
}
}
}
But something is going wrong.
That's what I've done so far:
.section .text
.globl foo
foo:
.L1:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $0, -16(%rbp) /*s*/
movl $0, -8(%rbp) /*i*/
jmp .L2
.L2:
cmpl -8(%rbp), %esi
jle .L4
leave
ret
.L3:
addl $1, -8(%rbp)
jmp .L2
.L4:
movl -8(%rbp), %eax
imull $4, %eax
movslq %eax, %rax
addq %rdi, %rax
movl (%rax), %eax
addl %eax, -16(%rbp)
cmpl $0, %eax
jne .L3
/* if */
leaq (%rax), %rdx
movl -16(%rbp), %eax
movl %eax, (%rdx)
movl $0, -16(%rbp)
jmp .L3
I am compiling the .s module with a .c module, for example, with an int nums [5] = {65, 23, 11, 0, 34} and I'm getting back the same array instead of {65, 23, 11 , 99, 34}.
Could someone help me?
Presumably you have a compiler that can generate AT&T syntax. It might be more instructive to look at what assembly output the compiler generates. Here's my re-formulation of your demo:
#include <stdio.h>
void foo (int a[], int n)
{
for (int s = 0, i = 0; i < n; i++)
{
if (a[i] != 0)
s += a[i];
else
a[i] = s, s = 0;
}
}
int main (void)
{
int nums[] = {65, 23, 11, 0, 34};
int size = sizeof(nums) / sizeof(int);
foo(nums, size);
for (int i = 0; i < size; i++)
fprintf(stdout, i < (size - 1) ? "%d, " : "%d\n", nums[i]);
return (0);
}
Compiling without optimizations enabled is typically harder to work through than optimized code, since it loads from and spills results to memory. You won't learn much from it if you're investing time in learning how to write efficient assembly.
Compiling with the Godbolt compiler explorer with -O2 optimizations yields much more efficient code; it's also useful for cutting out unnecessary directives, labels, etc., that would be visual noise in this case.
In my experience, using -O2 optimizations are clever enough to make you rethink your use of registers, refactoring, etc. -O3 can sometimes optimize too agressively - unrolling loops, vectorizing, etc., to easily follow.
Finally, for the case you have presented, there's a perfect compromise: -Os, which enables many of the optimizations of -O2, but not at the expense of increased code size. I'll paste the assembly here just for comparative purposes:
foo:
xorl %eax, %eax
xorl %ecx, %ecx
.L2:
cmpl %eax, %esi
jle .L7
movl (%rdi,%rax,4), %edx
testl %edx, %edx
je .L3
addl %ecx, %edx
jmp .L4
.L3:
movl %ecx, (%rdi,%rax,4)
.L4:
incq %rax
movl %edx, %ecx
jmp .L2
.L7:
ret
Remember that the calling convention passes the pointer to (a) in %rdi, and the 'count' (n) in %rsi. These are the calling conventions being used. Notice that your code does not 'dereference' or 'index' any elements through %rdi. It's definitely worth going stepping through the code - even with pen and paper if it helps - to understand the branch conditions and how reading and writing is performed on element a[i].
Curiously, using the inner loop of your code:
s += a[i];
if (a[i] == 0)
a[i] = s, s = 0;
Appears to generate more efficient code with -Os than the inner loop I used:
foo:
xorl %eax, %eax
xorl %edx, %edx
.L2:
cmpl %eax, %esi
jle .L6
movl (%rdi,%rax,4), %ecx
addl %ecx, %edx
testl %ecx, %ecx
jne .L3
movl %edx, (%rdi,%rax,4)
xorl %edx, %edx
.L3:
incq %rax
jmp .L2
.L6:
ret
A reminder for me to keep things simple!
Does jg and jge execute or jump to the following label if the second register in the statement eg., cmpl %esi, %edi is greater than or equal to (or only greater than) the first register %esi? And is the result of which one is greater stored in the second register and used to determine if jump executes the consecutive label?
sum1.c
int sum(int first, int last)
{
int sum = 0;
int in_between;
for (in_between = first; in_between <= last; in_between++)
{
sum += in_between;
}
return sum;
}
sum1.s:
.file "sum1.c"
.text
.globl sum
.type sum, #function
sum:
.LFB0:
.cfi_startproc
movl %edi, %edx ; puts first into in_between
movl $0, %eax ; sets sum to zero
cmpl %esi, %edi ;compares first and last, checking if first– last < 0
jg .L3 ; jumps to .L3 if first is greater than last, otherwise
;executes .L6
.L6:
addl %edx, %eax ;adds in_between to sum
addl $1, %edx ; increments in_between
cmpl %edx, %esi ; makes the comparison between in_between and last,
;last < in_between
jge .L6 ; jumps to .L6 if last is greater than or equal to
;in_between. (the result jump uses is stored in last).
.L3:
rep
ret ;returns the value stored in %eax register.
.cfi_endproc
.LFE0:
.size sum, .-sum
.ident "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-17)"
.section .note.GNU-stack,"",#progbits
The CPU updates the processor status (PS sometimes PSL) register after [nearly] every instruction. The CMPL instruction does an implicit subtraction and updates the values of the PS rester. You'd get the same effect if you did a SUBL instruction, except that SUBL puts the result in the destination operand while CMPL does not.
The Jxx instructions conditionally branch depending upon the value of the PS register.
Is there a gcc pragma or something I can use to force gcc to generate branch-free instructions on a specific section of code?
I have a piece of code that I want gcc to compile to branch-free code using cmov instructions:
int foo(int *a, int n, int x) {
int i = 0, j = n;
while (i < n) {
#ifdef PREFETCH
__builtin_prefetch(a+16*i + 15);
#endif /* PREFETCH */
j = (x <= a[i]) ? i : j;
i = (x <= a[i]) ? 2*i + 1 : 2*i + 2;
}
return j;
}
and, indeed, it does so:
morin#soprano$ gcc -O4 -S -c test.c -o -
.file "test.c"
.text
.p2align 4,,15
.globl foo
.type foo, #function
foo:
.LFB0:
.cfi_startproc
testl %esi, %esi
movl %esi, %eax
jle .L2
xorl %r8d, %r8d
jmp .L3
.p2align 4,,10
.p2align 3
.L6:
movl %ecx, %r8d
.L3:
movslq %r8d, %rcx
movl (%rdi,%rcx,4), %r9d
leal (%r8,%r8), %ecx # put 2*i in ecx
leal 1(%rcx), %r10d # put 2*i+1 in r10d
addl $2, %ecx # put 2*i+2 in ecx
cmpl %edx, %r9d
cmovge %r10d, %ecx # put 2*i+1 in ecx if appropriate
cmovge %r8d, %eax # set j = i if appropriate
cmpl %esi, %ecx
jl .L6
.L2:
rep ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section .note.GNU-stack,"",#progbits
(Yes, I realize the loop is a branch, but I'm talking about the choice operators inside the loop.)
Unfortunately, when I enable the __builtin_prefetch call, gcc generates branchy code:
morin#soprano$ gcc -DPREFETCH -O4 -S -c test.c -o -
.file "test.c"
.text
.p2align 4,,15
.globl foo
.type foo, #function
foo:
.LFB0:
.cfi_startproc
testl %esi, %esi
movl %esi, %eax
jle .L7
xorl %ecx, %ecx
jmp .L5
.p2align 4,,10
.p2align 3
.L3:
movl %ecx, %eax # this is the x <= a[i] branch
leal 1(%rcx,%rcx), %ecx
cmpl %esi, %ecx
jge .L11
.L5:
movl %ecx, %r8d # this is the main branch
sall $4, %r8d # setup the prefetch
movslq %r8d, %r8 # setup the prefetch
prefetcht0 60(%rdi,%r8,4) # do the prefetch
movslq %ecx, %r8
cmpl %edx, (%rdi,%r8,4) # compare x with a[i]
jge .L3
leal 2(%rcx,%rcx), %ecx # this is the x > a[i] branch
cmpl %esi, %ecx
jl .L5
.L11:
rep ret
.L7:
.p2align 4,,5
rep ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section .note.GNU-stack,"",#progbits
I've tried using __attribute__((optimize("if-conversion2"))) on this function, but that has no effect.
The reason I care so much is that I haved hand-edited compiler-generated branch-free code (from the first example) to include the prefetcht0 instructions and it runs considerably faster than both of the versions gcc produces.
If you really rely on that level of optimization, you have to write your own assembler stubs.
Reason is that even a modification elsewhere in the code might change the code the compiler (that is not gcc specific) emits. Also, a different version of gcc, different options (e.g. -fomit-frame-pointer) can change the code dramatically.
You should really only do this if you have to. Other influences might have much more impact, like cache configuration, memory allocation (DRAM-page/bank), execution order compared with concurrently run programs, CPU association, and much more. Play with compiler optimizations first. Command line options you will find in the docs (you did not post the version used, therefore not more specific).
A (serious) alternative would be to use clang/llvm. Or just help the gcc team improve their optimizers. You would not be the first. Note also that gcc has made massive improvements specifically for ARM over the last versions.
It looks like gcc might have troubles to generate branch-free code on variables used in loop conditions and post-conditions, together with the constraints of keeping temporary registers alive across a pseudo-function intrinsic call.
There is something suspicious, the generated code from your function is different when using -funroll-all-loops and -fguess-branch-probability. I generates many return instructions. It smells like a little bug in gcc, around the rtl pass of the compiler, or simplifications of blocks of codes.
The following code is branch-less in both cases. This would be a good reason to submit a bug to GCC. At level -O3, GCC should always generate the same code.
int foo( int *a, int n, int x) {
int c, i = 0, j = n;
while (i < n) {
#ifdef PREFETCH
__builtin_prefetch(a+16*i + 15);
#endif /* PREFETCH */
c = (x > a[i]);
j = c ? j : i;
i = 2*i + 1 + c;
}
return j;
}
which generates this
.cfi_startproc
testl %esi, %esi
movl %esi, %eax
jle .L4
xorl %ecx, %ecx
.p2align 4,,10
.p2align 3
.L3:
movslq %ecx, %r8
cmpl %edx, (%rdi,%r8,4)
setl %r8b
cmovge %ecx, %eax
movzbl %r8b, %r8d
leal 1(%r8,%rcx,2), %ecx
cmpl %ecx, %esi
jg .L3
.L4:
rep ret
.cfi_endproc
and this
.cfi_startproc
testl %esi, %esi
movl %esi, %eax
jle .L5
xorl %ecx, %ecx
.p2align 4,,10
.p2align 3
.L4:
movl %ecx, %r8d
sall $4, %r8d
movslq %r8d, %r8
prefetcht0 60(%rdi,%r8,4)
movslq %ecx, %r8
cmpl %edx, (%rdi,%r8,4)
setl %r8b
testb %r8b, %r8b
movzbl %r8b, %r9d
cmove %ecx, %eax
leal 1(%r9,%rcx,2), %ecx
cmpl %ecx, %esi
jg .L4
.L5:
rep ret
.cfi_endproc
My question is that does it matter if I declare a variable outside the loop and reinitialize every time inside the loop or declaring and initializing inside the loop? So basically is there any difference between these two syntaxes (Performance,standard etc)?
Method 1
int a,count=0;
while(count<10)
a=0;
Method 2
int count=0;
while(count<10)
int a=0;
Please assume that this is only a part of a bigger program and that the body inside loop requires that the variable a have a value of 0 every time. So, will there be any difference in the execution times in both the methods?
Yes, it does matter. In second case
int count=0;
while(count<10)
int a=0;
a can't be referenced out side of while loop. It has block scope; the portion of the program text in which the variable can be referenced.
Another thing that Jonathan Leffler pointed out in his answer is both of these loops are infinite loop. And second, the most important second snippet would not compile without {} (in C) because a variable definition/declaration is not a statement and cannot appear as the body of a loop.
int count =0;
while(count++ < 10)
{
int a=0;
}
This
void f1(void)
{
int a, count = 10;
while (count--)
a = 0;
}
void f2(void)
{
int count = 10;
while (count--)
{
int a = 0;
}
}
results in this (using non optimising gcc (Debian 4.4.5-8) 4.4.5):
.globl f1
.type f1, #function
f1:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl $10, -4(%rbp)
jmp .L2
.L3:
movl $0, -8(%rbp)
.L2:
cmpl $0, -4(%rbp)
setne %al
subl $1, -4(%rbp)
testb %al, %al
jne .L3
leave
ret
.cfi_endproc
.LFE0:
.size f1, .-f1
.globl f2
.type f2, #function
f2:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl $10, -8(%rbp)
jmp .L6
.L7:
movl $0, -4(%rbp)
.L6:
cmpl $0, -8(%rbp)
setne %al
subl $1, -8(%rbp)
testb %al, %al
jne .L7
leave
ret
.cfi_endproc
.LFE1:
.size f2, .-f2
The assembly code looks quiet the same.
The first compiles; the second doesn't. The first runs for a very long time.
If the second was:
int count = 0;
while (count++ < 10)
{
int a = 0;
...do something with a...
}
and you also made similar changes and did something with a in the first loop, then the difference is that a is set to zero on each iteration of the loop in the second case, but it holds whatever value it is set to in the loop in the first case. Also, in the second case, the variable a does not exist outside the loop and cannot therefore be referenced after the loop.
yes in your 1st method it is zero for next of your bigger program.
while in your method 2 the same 'a' cannot be accessed in further statements. if so would produce an error as undefined reference to variable 'a'. it would be zero for that loop life but is nothing after the loop. you can again define variable 'a' after that while loop.
Consider this C-code:
int sum=0;
for(int i=0;i<5;i++)
sum+=i;
This could be translated in (pseudo-) assembly this way (without loop unrolling):
% pseudo-code assembly
ADDI $R10, #0 % sum
ADDI $R11, #0 % i
LOOP:
ADD $R10, $R11
ADDI $R11, #1
BNE $R11, #5 LOOP
So my first question is how is this code translated using loop unrolling, between these two ways:
1)
ADDI $R10, #0
ADDI $R10, #0
ADDI $R10, #1
ADDI $R10, #2
ADDI $R10, #3
ADDI $R10, #4
2)
ADD $R10, #10
Is the compiler able to optimize the code and directly know that it has to add 10 without performing all sums?
Also, is there a possibility to block the pipeline with a branch instruction? Do I have to write it this way:
% pseudo-code assembly
ADDI $R10, #0 % sum
ADDI $R11, #0 % i
LOOP:
ADD $R10, $R11
ADDI $R11, #1
NOP % is this necessary to avoid the pipeline blocking?
NOP
NOP
NOP
BNE $R11, #5 LOOP
To avoid that the fetch-decode-exe-mem-write back cycle is interrupted by the branch?
This is more for demonstration of what a compiler is capable of, rather than what every compiler would do. The source:
#include <stdio.h>
int main(void)
{
int i, sum = 0;
for(i=0; i<5; i++) {
sum+=i;
}
printf("%d\n", sum);
return 0;
}
Note the printf I have added. If the variable is not used, the compiler will optimize out the entire loop.
Compiling with -O0 (No optimization)
gcc -Wall -O0 -S -c lala.c:
.L3:
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
.L2:
cmpl $4, -8(%rbp)
jle .L3
The loop happens in a 'dumb' way, with -8(%rbp) being the variable i.
Compiling with -O1 (Optimization level 1)
gcc -Wall -O1 -S -c lala.c:
movl $10, %edx
The loop has been completely removed and replaced with the equivalent value.
In unrolling, the compiler looks to see how many iterations would happen and tries to unroll by performing less iterations. For example, the loop body might be duplicated twice which would result in the number of branches to be halved. Such a case in C:
int i = 0, sum = 0;
sum += i;
i++;
for(; i<5;i++) {
sum+=i;
i++;
sum+=i;
}
Notice that one iteration had to be extracted out of the loop. This is because 5 is an odd number and so the work can not simply be halved by duplicating the contents. In this case the loop will only be entered twice. The assembly code produced by -O0:
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
jmp .L2
.L3:
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
.L2:
cmpl $4, -8(%rbp)
Completely unrolling in C:
for(i=0; i<5;i++) {
sum+=i;
i++;
sum+=i;
i++;
sum+=i;
i++;
sum+=i;
i++;
sum+=i;
}
This time the loop is actually entered only once. The assembly produced with -O0:
.L3:
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
movl -8(%rbp), %eax
addl %eax, -4(%rbp)
addl $1, -8(%rbp)
.L2:
cmpl $4, -8(%rbp)
jle .L3
So my first question is how is this code translated using loop unrolling, between these two ways
This kind of optimization is usually implemented on AST level instead of output code (e.g. assembly) level. Loop unrolling can be done when the number of iteration is fixed and known at compile time. So for instance I have this AST:
Program
|
+--For
|
+--Var
| |
| +--Variable i
|
+--Start
| |
| +--Constant 1
|
+--End
| |
| +--Constant 3
|
+--Statements
|
+ Print i
The compiler would have known that For's Start and End are constants, and therefore could easily copy the Statements, replacing all occurences of Var with its value for each call. For above AST, it would be translated to:
Program
|
+--Print 1
|
+--Print 2
|
+--Print 3
Is the compiler able to optimize the code and directly know that it has to add 10 without performing all sums?
Yes, if it's implemented to have such a feature. It's actually an improvement over the above case. In your example case, after doing the unrolling, the compiler could see that all l-value remains the same while r-value are constants. Therefore it could perform peephole optimization combined with constant folding to yield single addition. If the peephole optimization also considers the declaration, then it could be even optimized more into a single move instruction.
At the basic level, the concept of loop unrolling is just simply copying the body of the loop multiple times as appropriate. The compiler may do other optimizations (such as inserting fixed values from a calculation) as well but wouldn't be considered as unrolling the loop but potentially replacing it all together. But that would ultimately depend on the compiler and flags used.
The C code (unrolled only) would look more like this:
int sum = 0;
int i = 0;
for ( ; i < (5 & ~(4-1)); i += 4) /* unrolling 4 iterations */
{
sum+=(i+0);
sum+=(i+1);
sum+=(i+2);
sum+=(i+3);
}
for ( ; i < 5; i++)
{
sum+=i;
}
Though there's plenty of opportunities for the compiler to make even more optimizations here, this is just one step.
There is no general answer possible for this, different compilers, different versions of them, different compiler flags will vary. Use the appropriate option of your compiler to look at the assembler outcome. With gcc and relatives this is the -S option.