GCC vs CLANG pointer to char* optimization

GCC vs CLANG pointer to char* optimization - c

I have this code:
mainP.c:
int main(int c, char **v){
char *s = v[0];
while (*s++ != 0) {
if ((*s == 'a') && (*s != 'b')) {
return 1;
}
}
return 0;
}
which I compile with clang and gcc generating the assembly code to compare the optimizations:
clang-3.9 -S -masm=intel -O3 mainP.c
gcc -S -masm=intel -O3 mainP.c
The compiler version are:
clang version 3.9.1-9 (tags/RELEASE_391/rc2)
Target: x86_64-pc-linux-gnu
gcc (Debian 6.3.0-18) 6.3.0 20170516
The 2 resulting assembly codes are:
gcc assembly code:
main:
.LFB0:
.cfi_startproc
mov rax, QWORD PTR [rsi]
jmp .L2
.p2align 4,,10
.p2align 3
.L4:
cmp BYTE PTR [rax], 97
je .L5
.L2:
add rax, 1
cmp BYTE PTR -1[rax], 0
jne .L4
xor eax, eax
ret
.L5:
mov eax, 1
ret
clang assembly code:
main: # #main
.cfi_startproc
# BB#0:
mov rcx, qword ptr [rsi]
mov dl, byte ptr [rcx]
inc rcx
.p2align 4, 0x90
.LBB0_1: # =>This Inner Loop Header: Depth=1
xor eax, eax
test dl, dl
je .LBB0_3
# BB#2: # in Loop: Header=BB0_1 Depth=1
movzx edx, byte ptr [rcx]
inc rcx
mov eax, 1
cmp dl, 97
jne .LBB0_1
.LBB0_3:
ret
I notice this: in the gcc assembly code, *s is accessed twice in the loop while *s is accessed only once the clang assembly code.
is there an explanation for the difference?
Then after changing the C code a bit (adding a local char variable), I get about same the assembly code with GCC:
int main(int c, char **v){
char *s = v[0];
char ch;
ch = *s;
while (ch != 0) {
if ((ch == 'a') && (ch != 'b')) {
return 1;
}
ch = *s++;
}
return 0;
}
Resulting assembly code with GCC:
main:
.LFB0:
.cfi_startproc
mov rax, QWORD PTR [rsi]
movzx edx, BYTE PTR [rax]
test dl, dl
je .L6
add rax, 1
cmp dl, 97
jne .L5
jmp .L8
.p2align 4,,10
.p2align 3
.L4:
cmp dl, 97
je .L8
.L5:
add rax, 1
movzx edx, BYTE PTR -1[rax]
test dl, dl
jne .L4
.L6:
xor eax, eax
ret
.L8:
mov eax, 1
ret

The explanation for the difference: compiler optimizations are hard to do, and different compilers will optimize your code in different ways.
Because the expression is relatively simple, we can assume that *s is the same value in both places and only one load is really needed.
But figuring out that optimization is tricky, because the compiler has to absolutely "know" that "whatever s points to" can't be changed between the first reference and the second.
In your example, clang is better at figuring that out than gcc is.

Related

How to import a library from C to a Assembly code?

I'm trying to convert a C code to Assembly but I have a problem in the Assembly code. In the C code i'm importing the 'conio.h' library to use the getch() function, but in the Assembly code when I have the call of this function the output says error because the getch() function is undefined. I've already tried use the extern command in the asm code but it didnt work. Any ideias how to solve this?
C code for reference:
#include <stdio.h>
#include <stdlib.h>
#include <conio.h>
int main()
{
int cont=0;
char *string;
string = (char*)malloc(sizeof(char)*20);
do{
string[cont] = getch();
cont++;
}while (string[cont] != '\0');
printf("%s", string);
return 0;
}
Assembly code for reference:
.intel_syntax
.global main
.text
main:
push %rbp
mov %rbp, %rsp
push %rbx
sub %rsp, 40
mov DWORD PTR [%rbp-28], 0
mov %edi, 20
call malloc
mov QWORD PTR [%rbp-24], %rax
.L2:
mov %eax, DWORD PTR [%rbp-28]
cdqe
mov %rbx, %rax
add %rbx, QWORD PTR [%rbp-24]
mov %eax, 0
call getch ; Call of the getch function
mov BYTE PTR [%rbx], %al
inc DWORD PTR [%rbp-28]
mov %eax, DWORD PTR [%rbp-28]
cdqe
add %rax, QWORD PTR [%rbp-24]
movzx %eax, BYTE PTR [%rax]
test %al, %al
jne .L2
mov %rsi, QWORD PTR [%rbp-24]
mov %edi, OFFSET FLAT:.LC0
mov %eax, 0
call printf
mov %eax, 0
add %rsp, 40
pop %rbx
leave
ret
.LC0:
.string "%s"

Intuition for understanding gcc -Og and -Os loop output

Background
I've assumed for a while that gcc will convert while-loops into do-while form. (See Why are loops always compiled into "do...while" style (tail jump)?)
And that -O0 for while-loops...
while (test-expr)
body-statement
..Will generate code on the form of jump-to-middle do-while
goto test;
loop:
body-statement
test:
if (test-expr) goto loop;
And gcc -O2 will generate guarded do while
if (test-expr)
goto done;
loop:
body-statement
if (test-expr) goto loop;
done:
Concrete examples
Here are godbolt examples of functions for which gcc generates the kind of control flow I'm describing above (I use for-loops but a while loop will give the same code).
This simple function...
int sum1(int a[], size_t N) {
int s = 0;
for (size_t i = 0; i < N; i++) {
s += a[i];
}
return s;
}
Will for -O0 generate this jump to middle code
```sum1:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-24], rdi
mov QWORD PTR [rbp-32], rsi
mov DWORD PTR [rbp-4], 0
mov QWORD PTR [rbp-16], 0
jmp .L2
.L3:
mov rax, QWORD PTR [rbp-16]
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov eax, DWORD PTR [rax]
add DWORD PTR [rbp-4], eax
add QWORD PTR [rbp-16], 1
.L2:
mov rax, QWORD PTR [rbp-16]
cmp rax, QWORD PTR [rbp-32]
jb .L3
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
Will for -O2 generate this guarded-do code.
sum1:
test rsi, rsi
je .L4
lea rdx, [rdi+rsi*4]
xor eax, eax
.L3:
add eax, DWORD PTR [rdi]
add rdi, 4
cmp rdi, rdx
jne .L3
ret
.L4:
xor eax, eax
ret
My question
What I'm after is hand-wavy rule to apply when looking at -Os loops. I'm more used to looking at -O2 code and now that I'm working in the embedded field where -Os is more prevalent, I'm surprised by the form of loops I see.
It seems that gcc -Og and -Os both generate code as a jmpat a bottom and if() break at the top. Clang on the other hand generated guarded-do-while A godbolt link to gcc and clang output
Here is an example of gcc -Os output for the above function:
sum1:
xor eax, eax
xor r8d, r8d
.L2:
cmp rax, rsi
je .L5
add r8d, DWORD PTR [rdi+rax*4]
inc rax
jmp .L2
.L5:
mov eax, r8d
ret
Am I correct in assuming that gcc -Og and -Os generates code on the form I described above?
Does anyone have a resource that describes the rationale for using while-form for -Og and -Os? Is it by design or an accidental fall-out form the way the optimization passes are organized.
I thought that converting loops into do-while form was part of the early canonicalization done by compilers? How come gcc -O0 generates do-while but gcc -Og gives while-loops? Do that canonicalization only happen when optimization is enabled?
Sidenote: I'm surprised by how much code generated with -Os and -O2 differ given that there aren't many compiler flags that are different. Maybe many passes checks some variable for tradeoff_speed_vs_space.

Conditional move instruction on a NULL check before dereferencing a pointer [duplicate]

This question already has answers here:
Why does this example of what compilers aren't allowed to do cause null pointer dereferencing using cmov?
(3 answers)
Hard to debug SEGV due to skipped cmov from out-of-bounds memory
(2 answers)
Closed 4 years ago.
I'm working on exercise 3.61 of CSAPP, which requires to write a very simple function that checks if a pointer is NULL before trying to dereference it, which should base on a conditional move instruction rather than a jump. Here's an example I've found online:
long cond(long* p) {
return (!p) ? 0 : *p;
}
According to claims, the function can be compiled into the following assembly:
cond:
xor eax, eax
test rdi, rdi
cmovne rax, QWORD PTR [rdi]
ret
I am running GCC 7.3.0 (from APT package gcc/bionic-updates,now 4:7.3.0-3ubuntu2.1 amd64) on Ubuntu 18.04 on WSL. The computer is running on an Intel Coffee Lake (i.e. 8th-gen Core-i) processor.
I have tried the following commands:
gcc -S a.c -O3
gcc -S a.c -O3 -march=x86-64
gcc -S a.c -O3 -march=core2
gcc -S a.c -O3 -march=k8
Honestly, I wasn't able to observe any difference in the generated a.s file, since all of them look like
cond:
xorl %eax, %eax
testq %rdi, %rdi
je .L1
movq (%rdi), %rax
.L1:
ret
Is there any possibility to have such a function that compiles into a conditional move, without a jump?
Edit: As told in comments, the CMOVxx series of instructions loads operands unconditionally, and only the actual assignment operation is conditional, so there'd be no luck to put *p (or (%rdi)) as the source operand of CMOV, is it right?
The claim is on this page but I think it's invalid.

Here is a branch-less version:
inline long* select(long* p, long* q) {
uintptr_t a = (uintptr_t)p;
uintptr_t b = (uintptr_t)q;
uintptr_t c = !a;
uintptr_t r = (a & (c - 1)) | (b & (!c - 1));
return (long*)r;
}
long cond(long* p) {
long t = 0;
return *select(p, &t);
}
Assembly for gcc-8.2:
cond(long*):
mov QWORD PTR [rsp-8], 0
xor eax, eax
test rdi, rdi
sete al
lea rdx, [rax-1]
neg rax
and rdi, rdx
lea rdx, [rsp-8]
and rax, rdx
or rdi, rax
mov rax, QWORD PTR [rdi]
ret
Assembly for clang-7:
cond(long*): # #cond(long*)
mov qword ptr [rsp - 8], 0
xor eax, eax
test rdi, rdi
lea rcx, [rsp - 8]
cmovne rcx, rax
or rcx, rdi
mov rax, qword ptr [rcx]
ret

assembly output + questions about stacks

I was studying one of my courses when I ran into a specific exercise that I cannot seem to resolve... It is pretty basic because I am VERY new to assembly. So lets begin.
I have a C function
unsigned int func(int *ptr, unsigned int j) {
unsigned int res = j;
int i = ptr[j+1];
for(; i<8; ++i) {
res >>= 1;
}
return res;
}
I translated it with gcc to assembly
.file "func.c"
.intel_syntax noprefix
.text
.globl func
.type func, #function
func:
.LFB0:
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
mov rbp, rsp
.cfi_def_cfa_register 6
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
mov eax, DWORD PTR [rbp-28]
mov DWORD PTR [rbp-8], eax
mov eax, DWORD PTR [rbp-28]
add eax, 1
mov eax, eax
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov eax, DWORD PTR [rax]
mov DWORD PTR [rbp-4], eax
jmp .L2
.L3:
shr DWORD PTR [rbp-8]
add DWORD PTR [rbp-4], 1
.L2:
cmp DWORD PTR [rbp-4], 7
jle .L3
mov eax, DWORD PTR [rbp-8]
pop rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size func, .-func
.ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4"
.section .note.GNU-stack,"",#progbits
The question is as follow. what is the command that place j (variable in the c function) on top of the stack?
I sincerely cannot find out please enlighten me XD.

The variable j is the second parameter for func; it is stored in the register esi in the x86-64 System V ABI calling convention. This instruction mov DWORD PTR [rbp-28], esi put j into the stack.
You can see it very clearly by writing a simple function that calls "func" and compiling it with -O0 (or with -O2 and marking it as noinline, or only providing a prototype so there's nothing for the compiler to inline).
unsigned int func(int *ptr, unsigned int j) {
unsigned int res = j;
int i = ptr[j+1];
for(; i<8; ++i) {
res >>= 1;
}
return res;
}
int main()
{
int a = 1;
int array[10];
func (array, a);
return 0;
}
Using the Godbolt compiler explorer, we can easily get gcc -O0 -fverbose-asm assembly output.
Please focus on the following instructions:
# in main:
...
mov DWORD PTR [rbp-4], 1
mov edx, DWORD PTR [rbp-4]
...
mov esi, edx
...
func(int*, unsigned int):
...
mov DWORD PTR [rbp-28], esi # j, j
...
j, j is a comment added by gcc -fverbose-asm tell you that the source and destination operands are both the C variable j in that instruction.
The full assembly instructions:
func(int*, unsigned int):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
mov eax, DWORD PTR [rbp-28]
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-28]
add eax, 1
mov eax, eax
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov eax, DWORD PTR [rax]
mov DWORD PTR [rbp-8], eax
jmp .L2
.L3:
shr DWORD PTR [rbp-4]
add DWORD PTR [rbp-8], 1
.L2:
cmp DWORD PTR [rbp-8], 7
jle .L3
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
main:
push rbp
mov rbp, rsp
sub rsp, 48
mov DWORD PTR [rbp-4], 1
mov edx, DWORD PTR [rbp-4]
lea rax, [rbp-48]
mov esi, edx
mov rdi, rax
call func(int*, unsigned int)
mov eax, 0
leave
ret

Taking into account these instructions
mov eax, DWORD PTR [rbp-28]
add eax, 1
it seems that j is stored at address rbp-28 While ptr is stored at address rbp-24.
These are instructions where the values are stored in the stack
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
It seems the arguments are passed to the function using registers rdi and esi.
Compilers can optimize their calls of functions and use registers instead of the stack to pass arguments of small sizes to functions. Within the functions they can use the stack to temporary store the arguments passed through registers.

Just a suggestion for further explorations on your own. Use gcc -O0 -g2 f.c -Wa,-adhln. It will turn off optimizations and generate assembly code intermixed with the source. It might give you better ideas about what it does.
As an alternative you can use the objdump -Sd f.o on the output '.o' or executable. Just make sure that you add debugging info and turn off optimizations at compilation.

Understanding these assembly instructions [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm learning assembly by writing C programs and viewing the assembly output. I've included the C program at the bottom for the page to make it easier. I'm struggling to understand one line of assembly:
cdqe
movzx eax, BYTE PTR [rbp-32+rax] <--- what is this doing?
movsx eax, al
So I think cdqe extends eax into rax (64 bits). Its clear that the string I want to print fits into the al register but I don't understand what is happening deep down with rbp-32+rax. Can someone explain for me?
.file "string_manip.c"
.intel_syntax noprefix
.section .rodata
.LC0:
.string "Hello"
.string ""
.zero 3
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
mov rbp, rsp
.cfi_def_cfa_register 6
sub rsp, 48
mov rax, QWORD PTR fs:40
mov QWORD PTR [rbp-8], rax
xor eax, eax
mov DWORD PTR [rbp-36], 0
mov eax, DWORD PTR .LC0[rip]
mov DWORD PTR [rbp-32], eax
movzx eax, WORD PTR .LC0[rip+4]
mov WORD PTR [rbp-28], ax
movzx eax, BYTE PTR .LC0[rip+6]
mov BYTE PTR [rbp-26], al
mov WORD PTR [rbp-25], 0
mov BYTE PTR [rbp-23], 0
mov DWORD PTR [rbp-36], 0
jmp .L2
.L3:
mov eax, DWORD PTR [rbp-36]
cdqe
movzx eax, BYTE PTR [rbp-32+rax] <--- what is this doing?
movsx eax, al
mov edi, eax
call putchar
add DWORD PTR [rbp-36], 1
.L2:
cmp DWORD PTR [rbp-36], 5
jle .L3
mov edi, 10
call putchar
mov eax, 0
mov rdx, QWORD PTR [rbp-8]
xor rdx, QWORD PTR fs:40
je .L5
call __stack_chk_fail
.L5:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4"
.section .note.GNU-stack,"",#progbits
#include <string.h>
#include <stdio.h>
int main()
{
int i = 0;
char array[10] = "Hello\0";
for(i=0; i<6; i++)
printf("%c", array[i]);
printf("\n");
return 0;
}

It's just calculating the address of one of the characters.
Presumably your string starts at rbp-32 and then the instruction does the C equivalent of ch = string[rax].
I guess this is unoptimized code, so the compiler does a few extra sign extend and zero extend that are not really needed.