Optimization for pure vs. const function

Optimization for pure vs. const function - c

The source code I use in this post, is also available here: https://gcc.godbolt.org/z/dGvxnv
Given this C source code:
int pure_f(int a, int b) __attribute__((pure));
int const_f(int a, int b) __attribute__((const));
int my_f(int a, int b) {
int x = pure_f(a, b);
if (a > 0) {
return x;
}
return a;
}
If this is compiled with gcc with -O3, I would expect that the evaluation of pure_f(a, b) is moved into the if. But it is not done:
my_f(int, int):
push r12
mov r12d, edi
call pure_f(int, int)
test r12d, r12d
cmovg r12d, eax
mov eax, r12d
pop r12
ret
On the other side, if const_f is called instead of pure_f, it is moved into the if:
my_f(int, int):
test edi, edi
jg .L4
mov eax, edi
ret
.L4:
jmp const_f(int, int)
Why isn't this optimization applied for a pure function? From my understanding, this should also be possible and it seems to be beneficial.
-- EDIT --
GCC bug report (see comments): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97307

Related

how to save the value of ESP during a function call

I have a problem with the below code:
void swap(int* a, int* b) {
__asm {
mov eax, a;
mov ebx, b;
push[eax];
push[ebx];
pop[eax];
pop[ebx];
}
}
int main() {
int a = 3, b = 6;
printf("a: %d\tb: %d\n", a, b);
swap(&a, &b);
printf("a: %d\tb: %d\n", a, b);
}
I am running this code in visual studio and when I run this, it says:
Run-Time check failure- The value of ESP was not properly saved across a function call. This is usually a result of calling a function declared with one calling convention with a function pointer declared with a different calling convention.
What am I missing?

To answer the title question: make sure you balance pushes and pops. (Normally getting that wrong would just crash, not return with the wrong ESP). If you're writing a whole function in asm make sure ret 0 or ret 8 or whatever matches the calling convention you're supposed to be using and the amount of stack args to pop (e.g. caller-pops cdecl ret 0 or callee-pops stdcall ret n).
Looking at the compiler's asm output (e.g. on Godbolt or locally) reveals the problem: different operand-sizes for push vs. pop, MSVC not defaulting to dword ptr for pop.
; MSVC 19.14 (under WINE) -O0
_a$ = 8 ; size = 4
_b$ = 12 ; size = 4
void swap(int *,int *) PROC ; swap
push ebp
mov ebp, esp
push ebx ; save this call-preserved reg because you used it instead of ECX or EDX
mov eax, DWORD PTR _a$[ebp]
mov ebx, DWORD PTR _b$[ebp]
push DWORD PTR [eax]
push DWORD PTR [ebx]
pop WORD PTR [eax]
pop WORD PTR [ebx]
pop ebx
pop ebp
ret 0
void swap(int *,int *) ENDP
This code would just crash, with ret executing while ESP points to the saved EBP (pushed by push ebp). Presumably Visual Studio passes addition debug-build options to the compiler so it does more checking instead of just crashing?
Insanely, MSVC compiles/assembles push [reg] to push dword ptr (32-bit operand-size, ESP-=4 each), but pop [reg] to pop word ptr (16-bit operand-size, ESP+=2 each)
It doesn't even warn about the operand-size being ambiguous, unlike good assemblers such as NASM where push [eax] is an error without a size override. (push 123 of an immediate always defaults to an operand-size matching the mode, but push/pop of a memory operand usually needs a size specifier in most assemblers.)
Use push dword ptr [eax] / pop dword ptr [ebx]
Or since you're using EBX anyway, not limiting your function to just the 3 call-clobbered registers in the standard 32-bit calling conventions, use registers to hold the temporaries instead of stack space.
void swap_mov(int* a, int* b) {
__asm {
mov eax, a
mov ebx, b
mov ecx, [eax]
mov edx, [ebx]
mov [eax], edx
mov [ebx], ecx
}
}
(You don't need ; empty comments at the end of each line. The syntax inside an asm{} block is MASM-like, not C statements.)

x64 argument and return value calling convention

I invoke Clang 12.0.0 with -Os -march=haswell to compile the following C program:
int bar(int);
int foo(int x) {
const int b = bar(x);
if (x || b) {
return 123;
}
return 456;
}
The following assembly is generated:
foo: # #foo
push rbx
mov ebx, edi
call bar
or eax, ebx
mov ecx, 456
mov eax, 123
cmove eax, ecx
pop rbx
ret
https://gcc.godbolt.org/z/WsGoM56Ez
As I understand it, the caller of foo sets up x in RAX/EAX. foo then calls bar, which doesn't require modifying RAX/EAX, since x is passed through as unmodified input.
The or eax, ebx instruction appears to be comparing the input x with the result of bar. How does that result end up in EBX? What purpose does mov ebx,edi serve?

I'm afraid you are mistaken:
the function argument is passed in rdi, as per the x86-64 System V calling convention.
register rbx must not be modified by a function; GCC saves/restores it as required, so it can keep a copy of x there across the call to bar.
the function return value is in rax. (Actually eax; a 32-bit int only uses the low half)
You can verify the basics by compiling a function like int foo(int x){return x;} - you'll see just a mov eax, edi.
Here is a commented version:
foo: # #foo
push rbx # save register rbx
mov ebx, edi # save argument `x` in ebx
call bar # a = bar() (in eax)
or eax, ebx # compute `x | a`, setting FLAGS
mov ecx, 456 # prepare 456 for conditional move
mov eax, 123 # eax = 123
cmove eax, ecx # if `(x | a) == 0` set eax to 456
pop rbx # restore register rbx
ret # return value is in eax
The compiler optimizes x || b as (x | b) != 0 which allows for branchless code generation.
Note that mov doesn't modify the FLAGS, unlike most integer ALU instructions.

Is there a way to convert a conditional assignment to branch free code?

Is there a way to convert the following C code to something without any conditional statements? I have profiled some of my code and noticed that it is getting many branch misses on an if statement that is very similar to this one.
int cond = /*...*/;
int a = /*...*/;
int b = /*...*/;
int x;
if (cond) {
x = a;
} else {
x = b;
}

It depends on the instruction set you're targeting. For x86, there's cmov. For arm64, there's csel. For armv7, there's mov with an optional conditional op-code.
Any decent compiler should be able to optimize that code you have into the most optimal set of instructions. GCC and clang do that (try it out yourself at https://gcc.godbolt.org/).
To answer your question more directly: there is no way to force this in straight C, since it's possible the CPU instruction set doesn't have a branch-free instruction that can be used as a substitute. So you either have to rely on your compiler (which is probably a good idea), or hand-write your own assembly.
To give you a little example, consider the following C code:
int min(int a, int b) {
int result;
if (a < b) {
result = a;
} else {
result = b;
}
return result;
}
gcc 5.4.1 for armv7 generates:
min(int, int):
cmp r0, r1
movge r0, r1
bx lr
gcc 5.4 for arm64 generates:
min(int, int):
cmp w0, w1
csel w0, w0, w1, le
ret
clang 4.0 for x86 generates:
min(int, int): # #min(int, int)
cmp edi, esi
cmovle esi, edi
mov eax, esi
ret
gcc 5 for x86 generates:
min(int, int):
cmp edi, esi
mov eax, esi
cmovle eax, edi
ret
icc 17 for x86 generates:
min(int, int):
cmp edi, esi #8.10
cmovl esi, edi #8.10
mov eax, esi #8.10
ret #8.10
As you can see, they're all branch-free (when compiled at -O1 or above).

A more complete example would be more helpful as the way the variables x, a, b and cond are accessed can play a role. If they are global variables declared outside the function that performs the conditional assignment then they will be accessed using loads and stores, which the compiler may deem to be too expensive to execute conditionally.
Look at the examples at https://godbolt.org/g/GEZbuf where the same conditional assignment is performed on globals and in foo and on local arguments in foo2

x = (!cond * b) | (!!cond * a);

Replacing function with inline assembly C

I've got a function whose inner code I want to convert into assembly (for various reasons):
int foo(int x, int y, int z);
I generated the assembly code using:
clang -S -mllvm --x86-asm-syntax=intel foo.c
The assembly output: foo.s starts off with something like:
_foo: ## #foo
.cfi_startproc
## BB#0:
push RBP
Ltmp2:
.cfi_def_cfa_offset 16
...
I assume this is the corresponding assembly code for that function. My question is, what part of the assembly output should I copy into the C code (I'm trying to use inline assembly) so that the function would work? The code should look like:
int foo(int x, int y, int z) {
__asm__("..."); // <-- What goes inside?
}
Thanks

You have to see the disassembly of that function and write the __asm__. For example below code
int foo(int x, int y, int z) {
x = y+z;
return x;
}
will yeild a disassembly of following :
int foo(int x, int y, int z) {
push ebp
mov ebp,esp
sub esp,0C0h
push ebx
push esi
push edi
lea edi,[ebp-0C0h]
mov ecx,30h
mov eax,0CCCCCCCCh
rep stos dword ptr es:[edi]
x = y+z;
mov eax,dword ptr [y]
add eax,dword ptr [z]
mov dword ptr [x],eax
return x;
mov eax,dword ptr [x]
}
so you have to add below for statement x= y+z,
mov eax,dword ptr [y]
add eax,dword ptr [z]
mov dword ptr [x],eax

Understanding the C function call prolog with __cdecl on windows

Compiling this simple function with MSVC2008, in Debug mode:
int __cdecl sum(int a, int b)
{
return a + b;
}
I get the following disassembly listing:
int __cdecl sum(int a, int b)
{
004113B0 push ebp
004113B1 mov ebp,esp
004113B3 sub esp,0C0h
004113B9 push ebx
004113BA push esi
004113BB push edi
004113BC lea edi,[ebp-0C0h]
004113C2 mov ecx,30h
004113C7 mov eax,0CCCCCCCCh
004113CC rep stos dword ptr es:[edi]
return a + b;
004113CE mov eax,dword ptr [a]
004113D1 add eax,dword ptr [b]
}
004113D4 pop edi
004113D5 pop esi
004113D6 pop ebx
004113D7 mov esp,ebp
004113D9 pop ebp
004113DA ret
There are some parts of the prolog I don't understand:
004113BC lea edi,[ebp-0C0h]
004113C2 mov ecx,30h
004113C7 mov eax,0CCCCCCCCh
004113CC rep stos dword ptr es:[edi]
Why is this required?
EDIT:
After removing the /RTC compiler option, as was suggested, most of this code indeed went away. What remained is:
int __cdecl sum(int a, int b)
{
00411270 push ebp
00411271 mov ebp,esp
00411273 sub esp,40h
00411276 push ebx
00411277 push esi
00411278 push edi
return a + b;
00411279 mov eax,dword ptr [a]
0041127C add eax,dword ptr [b]
}
Now, why is the: sub esp, 40h needed? It's as if place is being allocated for local variables, though there aren't any. Why is the compiler doing this? Is there another flag involved?

This code is emitted due to the /RTC compile option. It initializes all local variables in your function to a bit pattern that is highly likely to generate an access violation or to cause unusual output values. That helps you find out when you forgot to initialize a variable.
The extra space in the stack frame you see allocated is there to support the Edit + Continue feature. This space will be used when you edit the function while debugging and add more local variables. Change the /ZI option to /Zi to disable it.

and in any case of buffer overflow (if you would overwrite local variables) you will end up in a field of "int 3" opcodes:
int 3 ; 0xCC
int 3 ; 0xCC
int 3 ; 0xCC
int 3 ; 0xCC
int 3 ; 0xCC
int 3 ; 0xCC
...
that can be catched by the debugger, so you can fix your code

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Optimization for pure vs. const function - c

Related

how to save the value of ESP during a function call

x64 argument and return value calling convention

Is there a way to convert a conditional assignment to branch free code?

Replacing function with inline assembly C

Understanding the C function call prolog with __cdecl on windows

Categories

Resources