Does jg and jge execute or jump to the following label if the second register in the statement eg., cmpl %esi, %edi is greater than or equal to (or only greater than) the first register %esi? And is the result of which one is greater stored in the second register and used to determine if jump executes the consecutive label?
sum1.c
int sum(int first, int last)
{
int sum = 0;
int in_between;
for (in_between = first; in_between <= last; in_between++)
{
sum += in_between;
}
return sum;
}
sum1.s:
.file "sum1.c"
.text
.globl sum
.type sum, #function
sum:
.LFB0:
.cfi_startproc
movl %edi, %edx ; puts first into in_between
movl $0, %eax ; sets sum to zero
cmpl %esi, %edi ;compares first and last, checking if first– last < 0
jg .L3 ; jumps to .L3 if first is greater than last, otherwise
;executes .L6
.L6:
addl %edx, %eax ;adds in_between to sum
addl $1, %edx ; increments in_between
cmpl %edx, %esi ; makes the comparison between in_between and last,
;last < in_between
jge .L6 ; jumps to .L6 if last is greater than or equal to
;in_between. (the result jump uses is stored in last).
.L3:
rep
ret ;returns the value stored in %eax register.
.cfi_endproc
.LFE0:
.size sum, .-sum
.ident "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-17)"
.section .note.GNU-stack,"",#progbits
The CPU updates the processor status (PS sometimes PSL) register after [nearly] every instruction. The CMPL instruction does an implicit subtraction and updates the values of the PS rester. You'd get the same effect if you did a SUBL instruction, except that SUBL puts the result in the destination operand while CMPL does not.
The Jxx instructions conditionally branch depending upon the value of the PS register.
Related
this is the assembly code i am supposed to translate:
f1:
subl $97, %edi
xorl %eax, %eax
cmpb $25, %dil
setbe %al
ret
heres the c code I wrote that I think is equivalent.
int f1(int y){
int x = y-97;
int i = 0;
if(x<=25){
x = i;
}
return x;
}
and heres what I get from compiling the C code.
_f1: ## #f1
.cfi_startproc
%bb.0:
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
## kill: def %edi killed %edi def %rdi
leal -97(%rdi), %ecx
xorl %eax, %eax
cmpl $123, %edi
cmovgel %ecx, %eax
popq %rbp
retq
.cfi_endproc
I was wondering if this was correct / what should be different and if anyone could help explain how jmps work as I am also trying to translate this assembly code and have gotten stuck
f2:
cmpl $1, %edi
jle .L6
movl $2, %edx
movl $1, %eax
jmp .L5
.L8:
movl %ecx, %edx
.L5:
imull %edx, %eax
leal 1(%rdx), %ecx
cmpl %eax, %edi
jg .L8
.L4:
cmpl %edi, %eax
sete %al
movzbl %al, %eax
ret
.L6:
movl $1, %eax
jmp .L4
gcc8.3 -O3 emits exactly the asm in the question for this way of writing the range check using the unsigned-compare trick.
int is_ascii_lowercase_v2(int y){
unsigned char x = y-'a';
return x <= (unsigned)('z'-'a');
}
Narrowing to 8-bit after the int subtract matches the asm more exactly, but it's not necessary for correctness or even to convince compilers to use a 32-bit sub. For unsigned char y, the upper bytes of RDI are allowed to hold arbitrary garbage (x86-64 System V calling convention), but carry only propagates from low to high with sub and add.
The low 8 bits of the result (which is all the cmp reads) would be the same with sub $'a', %dil or sub $'a', %edi.
Writing it as a normal range-check also gets gcc to emit identical code, because compilers know how optimize range-checks. (And gcc chooses to use 32-bit operand-size for the sub, unlike clang which uses 8-bit.)
int is_ascii_lowercase_v3(char y){
return (y>='a' && y<='z');
}
On the Godbolt compiler explorer, this and _v2 compile as follows:
## gcc8.3 -O3
is_ascii_lowercase_v3: # and _v2 is identical
subl $97, %edi
xorl %eax, %eax
cmpb $25, %dil
setbe %al
ret
Returning a compare result as an integer, instead of using an if, much more naturally matches the asm.
But even writing it "branchlessly" in C won't match the asm unless you enable optimization. The default code-gen from gcc/clang is -O0: anti-optimize for consistent debugging, storing/reloading everything to memory between statements. (And function args on function entry.) You need optimization, because -O0 code-gen is (intentionally) mostly braindead, and nasty looking. See How to remove "noise" from GCC/clang assembly output?
## gcc8.3 -O0
is_ascii_lowercase_v2:
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp)
movl -20(%rbp), %eax
subl $97, %eax
movb %al, -1(%rbp)
cmpb $25, -1(%rbp)
setbe %al
movzbl %al, %eax
popq %rbp
ret
gcc and clang with optimization enabled will do if-conversion to branchless code when it's efficient. e.g.
int is_ascii_lowercase_branchy(char y){
unsigned char x = y-'a';
if (x < 25U) {
return 1;
}
return 0;
}
still compiles to the same asm with GCC8.3 -O3
is_ascii_lowercase_branchy:
subl $97, %edi
xorl %eax, %eax
cmpb $25, %dil
setbe %al
ret
We can tell that the optimization level was at least gcc -O2. At -O1, gcc uses the less efficient setbe / movzx instead of xor-zeroing EAX ahead of setbe
is_ascii_lowercase_v2:
subl $97, %edi
cmpb $25, %dil
setbe %al
movzbl %al, %eax
ret
I could never get clang to reproduce exactly the same sequence of instructions. It likes to use add $-97, %edi, and cmp with $26 / setb.
Or it will do really interesting (but sub-optimal) things like this:
# clang7.0 -O3
is_ascii_lowercase_v2:
addl $159, %edi # 256-97 = 8-bit version of -97
andl $254, %edi # 0xFE; I haven't figured out why it's clearing the low bit as well as the high bits
xorl %eax, %eax
cmpl $26, %edi
setb %al
retq
So this is something involving -(x-97), maybe using the 2's complement identity in there somewhere (-x = ~x + 1).
Here is an annotated version of the assembly:
# %edi is the first argument, we denote x
subl $97, %edi
# x -= 97
# %eax is the return value, we denote y
xorl %eax, %eax
# y = 0
# %dil is the least significant byte (lsb) of x
cmpb $25, %dil
# %al is lsb(y) which is already zeroed
setbe %al
# if lsb(x) <= 25 then lsb(y) = 1
# setbe is unsigned version, setle would be signed
ret
# return y
So a verbose C equivalent is:
int f(int x) {
int y = 0;
x -= 97;
x &= 0xFF; // x = lsb(x) using 0xFF as a bitmask
y = (unsigned)x <= 25; // Section 6.5.8 of C standard: comparisons yield 0 or 1
return y;
}
We can shorten it by realizing y is unnecessary:
int f(int x) {
x -= 97;
x &= 0xFF;
return (unsigned)x <= 25;
}
The assembly of this is an exact match on Godbolt Compiler Explorer (x86-64 gcc8.2 -O2): https://godbolt.org/z/fQ0LVR
Is there a gcc pragma or something I can use to force gcc to generate branch-free instructions on a specific section of code?
I have a piece of code that I want gcc to compile to branch-free code using cmov instructions:
int foo(int *a, int n, int x) {
int i = 0, j = n;
while (i < n) {
#ifdef PREFETCH
__builtin_prefetch(a+16*i + 15);
#endif /* PREFETCH */
j = (x <= a[i]) ? i : j;
i = (x <= a[i]) ? 2*i + 1 : 2*i + 2;
}
return j;
}
and, indeed, it does so:
morin#soprano$ gcc -O4 -S -c test.c -o -
.file "test.c"
.text
.p2align 4,,15
.globl foo
.type foo, #function
foo:
.LFB0:
.cfi_startproc
testl %esi, %esi
movl %esi, %eax
jle .L2
xorl %r8d, %r8d
jmp .L3
.p2align 4,,10
.p2align 3
.L6:
movl %ecx, %r8d
.L3:
movslq %r8d, %rcx
movl (%rdi,%rcx,4), %r9d
leal (%r8,%r8), %ecx # put 2*i in ecx
leal 1(%rcx), %r10d # put 2*i+1 in r10d
addl $2, %ecx # put 2*i+2 in ecx
cmpl %edx, %r9d
cmovge %r10d, %ecx # put 2*i+1 in ecx if appropriate
cmovge %r8d, %eax # set j = i if appropriate
cmpl %esi, %ecx
jl .L6
.L2:
rep ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section .note.GNU-stack,"",#progbits
(Yes, I realize the loop is a branch, but I'm talking about the choice operators inside the loop.)
Unfortunately, when I enable the __builtin_prefetch call, gcc generates branchy code:
morin#soprano$ gcc -DPREFETCH -O4 -S -c test.c -o -
.file "test.c"
.text
.p2align 4,,15
.globl foo
.type foo, #function
foo:
.LFB0:
.cfi_startproc
testl %esi, %esi
movl %esi, %eax
jle .L7
xorl %ecx, %ecx
jmp .L5
.p2align 4,,10
.p2align 3
.L3:
movl %ecx, %eax # this is the x <= a[i] branch
leal 1(%rcx,%rcx), %ecx
cmpl %esi, %ecx
jge .L11
.L5:
movl %ecx, %r8d # this is the main branch
sall $4, %r8d # setup the prefetch
movslq %r8d, %r8 # setup the prefetch
prefetcht0 60(%rdi,%r8,4) # do the prefetch
movslq %ecx, %r8
cmpl %edx, (%rdi,%r8,4) # compare x with a[i]
jge .L3
leal 2(%rcx,%rcx), %ecx # this is the x > a[i] branch
cmpl %esi, %ecx
jl .L5
.L11:
rep ret
.L7:
.p2align 4,,5
rep ret
.cfi_endproc
.LFE0:
.size foo, .-foo
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section .note.GNU-stack,"",#progbits
I've tried using __attribute__((optimize("if-conversion2"))) on this function, but that has no effect.
The reason I care so much is that I haved hand-edited compiler-generated branch-free code (from the first example) to include the prefetcht0 instructions and it runs considerably faster than both of the versions gcc produces.
If you really rely on that level of optimization, you have to write your own assembler stubs.
Reason is that even a modification elsewhere in the code might change the code the compiler (that is not gcc specific) emits. Also, a different version of gcc, different options (e.g. -fomit-frame-pointer) can change the code dramatically.
You should really only do this if you have to. Other influences might have much more impact, like cache configuration, memory allocation (DRAM-page/bank), execution order compared with concurrently run programs, CPU association, and much more. Play with compiler optimizations first. Command line options you will find in the docs (you did not post the version used, therefore not more specific).
A (serious) alternative would be to use clang/llvm. Or just help the gcc team improve their optimizers. You would not be the first. Note also that gcc has made massive improvements specifically for ARM over the last versions.
It looks like gcc might have troubles to generate branch-free code on variables used in loop conditions and post-conditions, together with the constraints of keeping temporary registers alive across a pseudo-function intrinsic call.
There is something suspicious, the generated code from your function is different when using -funroll-all-loops and -fguess-branch-probability. I generates many return instructions. It smells like a little bug in gcc, around the rtl pass of the compiler, or simplifications of blocks of codes.
The following code is branch-less in both cases. This would be a good reason to submit a bug to GCC. At level -O3, GCC should always generate the same code.
int foo( int *a, int n, int x) {
int c, i = 0, j = n;
while (i < n) {
#ifdef PREFETCH
__builtin_prefetch(a+16*i + 15);
#endif /* PREFETCH */
c = (x > a[i]);
j = c ? j : i;
i = 2*i + 1 + c;
}
return j;
}
which generates this
.cfi_startproc
testl %esi, %esi
movl %esi, %eax
jle .L4
xorl %ecx, %ecx
.p2align 4,,10
.p2align 3
.L3:
movslq %ecx, %r8
cmpl %edx, (%rdi,%r8,4)
setl %r8b
cmovge %ecx, %eax
movzbl %r8b, %r8d
leal 1(%r8,%rcx,2), %ecx
cmpl %ecx, %esi
jg .L3
.L4:
rep ret
.cfi_endproc
and this
.cfi_startproc
testl %esi, %esi
movl %esi, %eax
jle .L5
xorl %ecx, %ecx
.p2align 4,,10
.p2align 3
.L4:
movl %ecx, %r8d
sall $4, %r8d
movslq %r8d, %r8
prefetcht0 60(%rdi,%r8,4)
movslq %ecx, %r8
cmpl %edx, (%rdi,%r8,4)
setl %r8b
testb %r8b, %r8b
movzbl %r8b, %r9d
cmove %ecx, %eax
leal 1(%r9,%rcx,2), %ecx
cmpl %ecx, %esi
jg .L4
.L5:
rep ret
.cfi_endproc
Why is it that the code:
for( i = 0, j = 0; i < 4 , j < 3; i++, j++)
is slower than
for( i = 0, j = 0; i < 4 && j < 3; i++, j++)
Elaborating on that some users proposed that two if statemnts take more time than a single if statement with an && operator: I tested it without for loops and it is not true. Two if statements are faster than a single one with a && operator.
The first code is not slower; at least in gcc without optimization. In fact, it should be faster.
When you compile both codes and disassemble them, you will find this for the first code:
cmpl $0x2,-0x8(%rbp)
jle 26 <main+0x26>
And this for the second one:
cmpl $0x3,-0x4(%rbp)
jg 44 <main+0x44>
cmpl $0x2,-0x8(%rbp)
jle 26 <main+0x26>
In the first example, gcc evaluates just the second part, because the first one has no effect and is not used in the comparison. In the second one, it has to check for the first one, and then, if true, check the second one.
So, in the general case, the first example should be faster than the first one. If you find the first slower, maybe your way to measure it was not 100% correct.
Their may be no change in execution time but may very the number of iterations since :
If we put comma separated condition in for loop,it evaluates the value of the last one. So basically whichever condition you write first, it will be disregarded, and the second one will be checked. So j = 0; i < 4 will always check for i<4 where as i < 4 && j < 3 will examine and return true if and only if both the conditions are true.
Reference
If we do check the assembly of the code you have you may see the difference
program
int main()
{
int x,y;
for(x=0,y=0;x<4,y<5;x++,y++);
printf("New one");
for(x=0,y=0;x<4 && y<5;x++,y++);
}
command to get assembly : gcc -S <program name>
Assembly
.file "for1.c"
.section .rodata
.LC0:
.string "New one"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
andl $-16, %esp
subl $32, %esp
movl $0, 24(%esp)
movl $0, 28(%esp)
jmp .L2
.L3:
addl $1, 24(%esp)
addl $1, 28(%esp)
.L2:
cmpl $4, 28(%esp) //Here only one condition
jle .L3
movl $.LC0, (%esp)
call printf
movl $0, 24(%esp)
movl $0, 28(%esp)
jmp .L4
.L6:
addl $1, 24(%esp)
addl $1, 28(%esp)
.L4:
cmpl $3, 24(%esp) //First Condition
jg .L7
cmpl $4, 28(%esp) //Second Condition
jle .L6
.L7:
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
So, it is clear if we have 2 condition then it will be more time taking.
the first option one is two ifs,
the second option is a mathematical equation and one if, which is usually faster,
here you save one if by doing a calculation, that costs less process time.
first option -> if() && if(),
second option-> if(() && ())
For my homework assignment I am supposed to convert this C code
#define UPPER 15
const int lower = 12;
int sum = 0;
int main(void) {
int i;
for (i = lower; i < UPPER; i++) {
sum += i;
}
return sum;
}
into gcc assembly. I already compiled it to first study the code before doing it per hand (obviously translating by hand is going to look much differently). This is the assembler code I received:
.file "upper.c"
.globl lower
.section .rodata
.align 4
.type lower, #object
.size lower, 4
lower:
.long 12
.globl sum
.bss
.align 4
.type sum, #object
.size sum, 4
sum:
.zero 4
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $12, -4(%rbp)
jmp .L2
.L3:
movl sum(%rip), %edx
movl -4(%rbp), %eax
addl %edx, %eax
movl %eax, sum(%rip)
addl $1, -4(%rbp)
.L2:
cmpl $14, -4(%rbp)
jle .L3
movl sum(%rip), %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388]"
.section .note.GNU-stack,"",#progbits
Now I was wondering if someone could give me a few examples like
where the constructors i, lower, upper and sum are located it in code
where some of the expressions i = lower or i < UPPER are located
where the for-loop starts
and such things so I can then get an idea of how the assembler code is constructed. Thank you!
If I understood correctly you question, here is the answers:
Q: where the constructors i, lower, upper and sum are located it in code?
lower is located inside .rodata section (readonly data section). It's value is initialized by linux loader during program loading stage to the value .long 12. lower constructor is a linux loader. It just loads lower value from binary image.
.globl lower
.section .rodata
.align 4
.type lower, #object
.size lower, 4
lower:
.long 12
sum is located inside .bss section (data segment containing statically-allocated variables). It's value is initialized by _init function what gets called when program execution begins. It's value is zero (.zero 4). Every variable located inside .bss section has zero as initial value (link to wiki's article for .bss).
.globl sum
.bss
.align 4
.type sum, #object
.size sum, 4
sum:
.zero 4
upper is a constant. The compiler did not put it's declaration into assembly. There is a reference to upper-1 (as $14) here:
.L2:
cmpl $14, -4(%rbp)
i is a on stack temporary variable. It's value is accessed using addresses relative %rbp (%rbp is a pointer to current function stack frame). The is no explicit declaration of i into assembly. There is no explicit stack reservation for i (no instruction like sub $0x8,%rsp at main preamble), I think, because main doesn't call another functions. Here is code for i initialization (note compiler knows that lower initial value is $12 and removes access to lower during i initialization):
movl $12, -4(%rbp)
Q: where some of the expressions i = lower or i < UPPER are located
i = lower:
movl $12, -4(%rbp)
jmp .L2
i < UPPER:
.L2:
cmpl $14, -4(%rbp)
jle .L3
i++:
addl $1, -4(%rbp)
sum += i;:
movl sum(%rip), %edx
movl -4(%rbp), %eax
addl %edx, %eax
movl %eax, sum(%rip)
return sum; (%eax register is used to hold function return value - more about this: X86 calling conventions):
jle .L3
movl sum(%rip), %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
Q: where the for-loop starts
it start here:
movl $12, -4(%rbp)
jmp .L2
I have this IA32 assembly language code I'm trying to convert into regular C code.
.globl fn
.type fn, #function
fn:
pushl %ebp #setup
movl $1, %eax #setup 1 is in A
movl %esp, %ebp #setup
movl 8(%ebp), %edx # pointer X is in D
cmpl $1, %edx # (*x > 1)
jle .L4
.L5:
imull %edx, %eax
subl $1, %edx
cmpl $1, %edx
jne .L5
.L4:
popl %ebp
ret
The trouble I'm having is deciding what type of comparison is going on. I don't get how the program gets to the L5 cache. L5 seems to be a loop since there's a comparison within it. I'm also unsure of what is being returned because it seems like most of the work is done is the %edx register, but doesn't go back to %eax for returning.
What I have so far:
int fn(int x)
{
}
It looks to me like it's computing a factorial. Ignoring the stack frame manipulation and such, we're left with:
movl $1, %eax #setup 1 is in A
Puts 1 into eax.
movl 8(%ebp), %edx # pointer X is in D
Retrieves a parameter into edx
imull %edx, %eax
Multiplies eax by edx, putting the result into eax.
subl $1, %edx
cmpl $1, %edx
jne .L5
Decrements edx and repeats if edx != 1.
In other words, this is roughly equivalent to:
unsigned fact(unsigned input) {
unsigned retval = 1;
for ( ; input != 1; --input)
retval *= input;
return retval;
}