Loop fission/invariant optimization not performed, why?

Loop fission/invariant optimization not performed, why? - c

I am trying to learn more about assembly and which optimizations compilers can and cannot do.
I have a test piece of code for which I have some questions.
See it in action here: https://godbolt.org/z/pRztTT, or check the code and assembly below.
#include <stdio.h>
#include <string.h>
int main(int argc, char* argv[])
{
for (int j = 0; j < 100; j++) {
if (argc == 2 && argv[1][0] == '5') {
printf("yes\n");
}
else {
printf("no\n");
}
}
return 0;
}
The assembly produced by GCC 10.1 with -O3:
.LC0:
.string "no"
.LC1:
.string "yes"
main:
push rbp
mov rbp, rsi
push rbx
mov ebx, 100
sub rsp, 8
cmp edi, 2
je .L2
jmp .L3
.L5:
mov edi, OFFSET FLAT:.LC0
call puts
sub ebx, 1
je .L4
.L2:
mov rax, QWORD PTR [rbp+8]
cmp BYTE PTR [rax], 53
jne .L5
mov edi, OFFSET FLAT:.LC1
call puts
sub ebx, 1
jne .L2
.L4:
add rsp, 8
xor eax, eax
pop rbx
pop rbp
ret
.L3:
mov edi, OFFSET FLAT:.LC0
call puts
sub ebx, 1
je .L4
mov edi, OFFSET FLAT:.LC0
call puts
sub ebx, 1
jne .L3
jmp .L4
It seems like GCC produces two versions of the loop: one with the argv[1][0] == '5' condition but without the argc == 2 condition, and one without any condition.
My questions:
What is preventing GCC from splitting away the full condition? It is similar to this question, but there is no chance for the code to get a pointer into argv here.
In the loop without any condition (L3 in assembly), why is the loop body duplicated? Is it to reduce the number of jumps while still fitting in some sort of cache?

GCC doesn't know that printf won't modify memory pointed-to by argv, so it can't hoist that check out of the loop.
argc is a local variable (that can't be pointed-to by any pointer global variable), so it knows that calling an opaque function can't modify it. Proving that a local variable is truly private is part of Escape Analysis.
The OP tested this by copying argv[1][0] into a local char variable first: that let GCC hoist the full condition out of the loop.
In practice argv[1] won't be pointing to memory that printf can modify. But we only know that because printf is a C standard library function, and we assume that main is only called by the CRT startup code with the actual command line args. Not by some other function in this program that passes its own args. In C (unlike C++), main is re-entrant and can be called from within the program.
Also, in GNU C, printf can have custom format-string handling functions registered with it. Although in this case, the compiler built-in printf looks at the format string and optimizes it to a puts call.
So printf is already partly special, but I don't think GCC bothers to look for optimizations based on it not modifying any other globally-reachable memory. With a custom stdio output buffer, that might not even be true. printf is slow; saving some spill / reloads around it is generally not a big deal.
Would (theoretically) compiling puts() together with this main() allow the compiler to see puts() isn't touching argv and optimize the loop fully?
Yes, e.g. if you'd written your own write function that uses an inline asm statement around a syscall instruction (with a memory input-only operand to make it safe while avoiding a "memory" clobber) then it could inline and assume that argv[1][0] wasn't changed by the asm statement and hoist a check based on it. Even if you were outputting argv[1].
Or maybe do inter-procedural optimization without inlining.
Re: unrolling: that's odd, -funroll-loops isn't on by default for GCC at -O3, only with -O3 -fprofile-use. Or if enabled manually.

Related

Why the first actual parameter printing as a output in C

#include <stdio.h>
int add(int a, int b)
{
if (a > b)
return a * b;
}
int main(void)
{
printf("%d", add(3, 7));
return 0;
}
Output:
3
In the above code, I am calling the function inside the print. In the function, the if condition is not true, so it won't execute. Then why I am getting 3 as output? I tried changing the first parameter to some other value, but it's printing the same when the if condition is not satisfied.

What happens here is called undefined behaviour.
When (a <= b), you don't return any value (and your compiler probably told you so). But if you use the return value of the function anyway, even if the function doesn't return anything, that value is garbage. In your case it is 3, but with another compiler or with other compiler flags it could be something else.
If your compiler didn't warn you, add the corresponding compiler flags. If your compiler is gcc or clang, use the -Wall compiler flags.

Jabberwocky is right: this is undefined behavior. You should turn your compiler warnings on and listen to them.
However, I think it can still be interesting to see what the compiler was thinking. And we have a tool to do just that: Godbolt Compiler Explorer.
We can plug your C program into Godbolt and see what assembly instructions it outputs. Here's the direct Godbolt link, and here's the assembly that it produces.
add:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov eax, DWORD PTR [rbp-4]
cmp eax, DWORD PTR [rbp-8]
jle .L2
mov eax, DWORD PTR [rbp-4]
imul eax, DWORD PTR [rbp-8]
jmp .L1
.L2:
.L1:
pop rbp
ret
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
mov esi, 7
mov edi, 3
call add
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
pop rbp
ret
Again, to be perfectly clear, what you've done is undefined behavior. With different compiler flags or a different compiler version or even just a compiler that happens to feel like doing things differently on a particular day, you will get different behavior. What I'm studying here is the assembly output by gcc 12.2 on Godbolt with optimizations disabled, and I am not representing this as standard or well-defined behavior.
This engine is using the System V AMD64 calling convention, common on Linux machines. In System V, the first two integer or pointer arguments are passed in the rdi and rsi registers, and integer values are returned in rax. Since everything we work with here is either an int or a char*, this is good enough for us. Note that the compiler seems to have been smart enough to figure out that it only needs edi, esi, and eax, the lower half-words of each of these registers, so I'll start using edi, esi, and eax from this point on.
Our main function works fine. It does everything we'd expect. Our two function calls are here.
mov esi, 7
mov edi, 3
call add
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
To call add, we put 3 in the edi register and 7 in the esi register and then we make the call. We get the return value back from add in eax, and we move it to esi (since it will be the second argument to printf). We put the address of the static memory containing "%d" in edi (the first argument), and then we call printf. This is all normal. main knows that add was declared to return an integer, so it has the right to assume that, after calling add, there will be something useful in eax.
Now let's look at add.
add:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov eax, DWORD PTR [rbp-4]
cmp eax, DWORD PTR [rbp-8]
jle .L2
mov eax, DWORD PTR [rbp-4]
imul eax, DWORD PTR [rbp-8]
jmp .L1
.L2:
.L1:
pop rbp
ret
The rbp and rsp shenanigans are standard function call fare and aren't specific to add. First, we load our two arguments onto the call stack as local variables. Now here's where the undefined behavior comes in. Remember that I said eax is the return value of our function. Whatever happens to be in eax when the function returns is the returned value.
We want to compare a and b. To do that, we need a to be in a register (lots of assembly instructions require their left-hand argument to be a register, while the right-hand can be a register, reference, immediate, or just about anything). So we load a into eax. Then we compare the value in eax to the value b on the call stack. If a > b, then the jle does nothing. We go down to the next two lines, which are the inside of your if statement. They correctly set eax and return a value.
However, if a <= b, then the jle instruction jumps to the end of the function without doing anything else to eax. Since the last thing in eax happened to be a (because we happened to use eax as our comparison register in cmp), that's what gets returned from our function.
But this really is just random. It's what the compiler happened to have put in that register previously. If I turn optimizations up (with -O3), then gcc inlines the whole function call and ends up printing out 0 rather than a. I don't know exactly what sequence of optimizations led to this conclusion, but since they started out by hinging on undefined behavior, the compiler is free to make what assumptions it chooses.

Ambiguous behaviour of strcmp()

Please note that I have checked the relevant questions to this title, but from my point of view they are not related to this question.
Initially I thought that program1 and program2 would give me the same result.
//Program 1
char *a = "abcd";
char *b = "efgh";
printf("%d", strcmp(a,b));
//Output: -4
//Program 2
printf("%d", strcmp("abcd", "efgh"));
//Output: -1
Only difference that I can spot is that in the program2 I have passed string literal, while in program I've passed char * as the argument of the strcmp() function.
Why there is a difference between the behaviour of these seemingly same program?
Platform: Linux mint
compiler: g++
Edit: Actually the program1 always prints the difference of ascii code of the first mismatched characters, but the program2 print -1 if the ascii code of the first mismatched character in string2 is greater than that of string1 and vice versa.

This is your C code:
int x1()
{
char *a = "abcd";
char *b = "efgh";
printf("%d", strcmp(a,b));
}
int x2()
{
printf("%d", strcmp("abcd", "efgh"));
}
And this is the generated assembly output for both functions:
.LC0:
.string "abcd"
.LC1:
.string "efgh"
.LC2:
.string "%d"
x1:
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], OFFSET FLAT:.LC0
mov QWORD PTR [rbp-16], OFFSET FLAT:.LC1
mov rdx, QWORD PTR [rbp-16]
mov rax, QWORD PTR [rbp-8]
mov rsi, rdx
mov rdi, rax
call strcmp // the strcmp function is actually called
mov esi, eax
mov edi, OFFSET FLAT:.LC2
mov eax, 0
call printf
nop
leave
ret
x2:
push rbp
mov rbp, rsp
mov esi, -1 // strcmp is never called, the compiler
// knows what the result will be and it just
// uses -1
mov edi, OFFSET FLAT:.LC2
mov eax, 0
call printf
nop
pop rbp
ret
When the compiler sees strcmp("abcd", "efgh") it knows the result beforehand, because it knows that "abcd" comes before "efgh".
But if it sees strcmp(a,b) it does not know and hence generates code that actually calls strcmp.
With another compiler or with different compiler settings things could be different. You really shouldn't care about such details at least at a beginner's level.

It is indeed surprising that strcmp returns 2 different values for these calls, but it is not incompatible with the C Standard:
strcmp() returns a negative value if the first string is lexicographically before the second string. Both -4 and -1 are negative values.
As pointed by others, the code generated for the different calls is different:
the compiler generates a call to the library function in the first program
the compiler is able to determine the result of the comparison and generates an explicit result of -1 for the second case where both arguments are string literals.
In order to perform this compile time evaluation, strcmp must be defined in a subtile way in <string.h> so the compiler can determine that the program refers to the C library's implementation and not an alternative that might behave differently. Tracing the corresponding prototype in recent GNU libc include files is a bit difficult with a number of nested macros eventually leading to a hidden prototype.
Note that more recent versions of both gcc and clang will perform the optimisation in both cases as can be tested on Godbolt Compiler Explorer, but neither combines this optmisation with that of printf to generate the even more compact code puts("-1");. They seem to convert printf to puts only for string literal formats without arguments.

I believe (would need to see (and interpret) machine code) one version works without calling code in the library (as if you wrote printf("%d", -1);).

stand alone object code in C and inline functions

I was reading about inline functions from Inline Functions In C when I came across this line:
Sometimes it is necessary for the compiler to emit a stand-alone copy of the object code for a function even though it is an inline function - for instance if it is necessary to take the address of the function, or if it can't be inlined in some particular context, or (perhaps) if optimization has been turned off. (And of course, if you use a compiler that doesn't understand inline, you'll need a stand-alone copy of the object code so that all the calls actually work at all.)
I am completely clueless about what it is trying to say, can somebody please explain it specially what is a stand-alone object code?

"Object code" generally refers to the output from the compiler handed over to the linker, as a middle step before machine code is generated.
What the text says is that if you for some reason take the address of the function, by for example using a function pointer to it, then the function can't be inlined. Because inlined functions don't have an address that can be called upon through a function pointer. Inline functions are just linked in together with the calling code without any function call actually being made.

As you know, an "inline" function is translated to machine-instructions that are "right there." Every time a new "call" to the function appears, those instructions are repeated verbatim in every different place -- the function is not actually "called." (An inline function is very much like an assembler "macro.")
But, if you ask for (say) "the address of" the function, the compiler has to generate a non-inlined copy of it in order to be able to give you one "place where it is."

Here you have an example:
#include <stdio.h>
#include <stdlib.h>
extern inline __attribute__((always_inline)) int mul16(int x) {
return x * 16; }
extern inline __attribute__((always_inline)) int mul3(int x) {
return x * 3; }
int main() {
for(int i = 0; i < 10; i ++)
{
int (*ptr)(int) = rand() & 1 ? mul16 : mul3;
printf("Mul2 = %d", mul16(i));
printf(", ptr(i) = %d\n", ptr(i));
}
}
https://godbolt.org/z/wDpF4j
mul16 exists as a separate object and is also inlined in the same code.
mul16: <----- object
mov eax, edi
sal eax, 4
ret
mul3:
lea eax, [rdi+rdi*2]
ret
.LC0:
.string "Mul2 = %d"
.LC1:
.string ", ptr(i) = %d\n"
main:
push r12
push rbp
push rbx
mov ebx, 0
mov r12d, OFFSET FLAT:mul16
.L5:
call rand
test al, 1
mov ebp, OFFSET FLAT:mul3
cmovne rbp, r12
mov esi, ebx
sal esi, 4 <-------------- inlined version
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov edi, ebx
call rbp
mov esi, eax
mov edi, OFFSET FLAT:.LC1
mov eax, 0
call printf
add ebx, 1
cmp ebx, 10
jne .L5
mov eax, 0
pop rbx
pop rbp
pop r12
ret

Is there any way to save registers before jumping into function?

this is my first question, because I couldn't find anything related to this topic.
Recently, while making a class for my C game engine project I've found something interesting:
struct Stack *S1 = new(Stack);
struct Stack *S2 = new(Stack);
S1->bPush(S1, 1, 2); //at this point
bPush is a function pointer in the structure.
So I wondered, what does operator -> in that case, and I've discovered:
mov r8b,2 ; a char, written to a low point of register r8
mov dl,1 ; also a char, but to d this time
mov rcx,qword ptr [S1] ; this is the 1st parameter of function
mov rax,qword ptr [S1] ; !Why cannot I use this one?
call qword ptr [rax+1A0h] ; pointer call
so I assume -> writes an object pointer to rcx, and I'd like to use it in functions (methods they shall be). So the question is, how can I do something alike
push rcx
// do other call vars
pop rcx
mov qword ptr [this], rcx
before it starts writing other variables of the function. Something with preprocessor?

It looks like you'd have an easier time (and get asm that's the same or more efficient) if you wrote in C++ so you could use language built-in support for virtual functions, and for running constructors on initialization. Not to mention not having to manually run destructors. You wouldn't need your struct Class hack.
I'd like to implicitly pass *this pointer, because as shown in second asm part it does the same thing twice, yes, it is what I'm looking for, bPush is a part of a struct and it cannot be called from outside, but I have to pass the pointer S1, which it already has.
You get inefficient asm because you disabled optimization.
MSVC -O2 or -Ox doesn't reload the static pointer twice. It does waste a mov instruction copying between registers, but if you want better asm use a better compiler (like gcc or clang).
The oldest MSVC on the Godbolt compiler explorer is CL19.0 from MSVC 2015, which compiles this source
struct Stack {
int stuff[4];
void (*bPush)(struct Stack*, unsigned char value, unsigned char length);
};
struct Stack *const S1 = new(Stack);
int foo(){
S1->bPush(S1, 1, 2);
//S1->bPush(S1, 1, 2);
return 0; // prevent tailcall optimization
}
into this asm (Godbolt)
# MSVC 2015 -O2
int foo(void) PROC ; foo, COMDAT
$LN4:
sub rsp, 40 ; 00000028H
mov rax, QWORD PTR Stack * __ptr64 __ptr64 S1
mov r8b, 2
mov dl, 1
mov rcx, rax ;; copy RAX to the arg-passing register
call QWORD PTR [rax+16]
xor eax, eax
add rsp, 40 ; 00000028H
ret 0
int foo(void) ENDP ; foo
(I compiled in C++ mode so I could write S1 = new(Stack) without having to copy your github code, and write it at global scope with a non-constant initializer.)
Clang7.0 -O3 loads into RCX straight away:
# clang -O3
foo():
sub rsp, 40
mov rcx, qword ptr [rip + S1]
mov dl, 1
mov r8b, 2
call qword ptr [rcx + 16] # uses the arg-passing register
xor eax, eax
add rsp, 40
ret
Strangely, clang only decides to use low-byte registers when targeting the Windows ABI with __attribute__((ms_abi)). It uses mov esi, 1 to avoid false dependencies when targeting its default Linux calling convention, not mov sil, 1.
Or if you are using optimization, then it's because even older MSVC is even worse. In that case you probably can't do anything in the C source to fix it, although you might try using a struct Stack *p = S1 local variable to hand-hold the compiler into loading it into a register once and reusing it from there.)

NASM Assembly while loop counter

I'm writing a while loop in assembly to compile in the Linux terminal with nasm and gcc. The program compares x and y until y >= x and reports number of loops at the end. Here's the code:
segment .data
out1 db "It took ", 10, 0
out2 db "iterations to complete loop. That seems like a lot.", 10, 0
x db 10
y db 2
count db 0
segment .bss
segment .text
global main
extern printf
main:
mov eax, x
mov ebx, y
mov ecx, count
jmp lp ;jump to loop lp
lp:
cmp ebx, eax ;compare x and y
jge end ;jump to end if y >= x
inc eax ;add 1 to x
inc ebx ;add 2 to y
inc ebx
inc ecx ;add 1 to count
jp lp ;repeat loop
end:
push out1 ;print message part 1
call printf
push count ;print count
call printf
push out2 ;print message part 2
call printf
;mov edx, out1 ;
;call print_string ;
;
;mov edx, ecx ;these were other attempts to print
;call print_int ;using an included file
;
;mov edx, out2 ;
;call print_string ;
This is compiled and run in the terminal with:
nasm -f elf test.asm
gcc -o test test.o
./test
Terminal output comes out as:
It took
iterations to complete loop. That seems like a lot.
Segmentation fault (core dumped)
I can't see anything wrong with the logic. I think it's syntactical but we've only just started learning assembly and I've tried all sorts of different syntax like brackets around variables and using ret at the end of a segment, but nothing seems to work. I've also searched for segmentation faults but I haven't found anything really helpful. Any help would be appreciated because I'm an absolute beginner.

The reason it crashes is probably that your main function doesn't have a ret instruction. Also be sure to set eax to 0 to signal success:
xor eax, eax ; or `mov eax, 0` if you're more comfortable with that
ret
Additionally, global variables designate pointers, not values. mov eax, x sets eax to the address of x. You need to write back to it if you want anything to happen (or not use global variables).
Finally, you're calling printf with a single non-string argument:
push count ;print count
call printf
The first argument needs to be a format string, like "%i". Here, count is a pointer to a null byte, so you get nothing instead. Off my head, you should try this:
out3 db "%i ", 0
; snip
push ecx
push out3
call printf

I think your problem might just be that you are referencing the addresses of your constants and not their intrinsic value. One must think of a label in nasm as a pointer rather than a value. To access it you just need to use [label]:
segment .data
x dw 42
segment .text
global main
extern printf
main:
mov eax, x
push eax
call printf ; will print address of x (like doing cout<<&x in C++)
mov eax, [x]
push eax
call printf ; will print 42
sub esp, 8
xor eax, eax
ret
PS:I don't think anyone has mentioned it but volatile registers are modified very often when calling external code (C or C++ or other) since at compilation those functions you use are "translated" to assembly and then linked with your asm file. The PC is not a human so it is not distinguishing between what was written in high-level or low-level, the processor is just reading opcodes and operands stored in registers and memory, hence why an external function when using low-level language (call printf) is going to modify (or not! always depends on compiler and architecture) registers that you are also using.
To solve this there are various solutions:
You check what registers are not being modified by using gcc your_c_file.c -S and then in the file your_c_file.swill be the pre-prepared assembly code your compiler has produced from your C file. (It tends to be quite hard to figure out what is what and if you are going to use this method check out Name Mangling, to see how func names will be changed.)
Push all the registers you want to save to stack, and then after the call pop them back to their registers keeping in mind LIFO method.
Use the instructions PUSHA and POPAwhich push or pop all registers respectively.
This is the NASM manual chapter 3 which explains the basis of the language to use: http://www.csie.ntu.edu.tw/~comp03/nasm/nasmdoc3.html
Hope you managed to solve it.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Loop fission/invariant optimization not performed, why? - c

Related

Why the first actual parameter printing as a output in C

Ambiguous behaviour of strcmp()

stand alone object code in C and inline functions

Is there any way to save registers before jumping into function?

NASM Assembly while loop counter

Categories

Resources