Related
So I'm trying to understand how assembly programming works with stack frames etc.
I did some exercises and disassembled some C-code with GDB. The task now is to find out how the transfer of parameters between the 'main' and functions works. I just started learning and kinda got lost on what next example is actually doing. Any ideas or tips on where to get started?
It's a recursive program working with faculty.
The assembly code looks like this:
1149: f3 0f 1e fa endbr64
114d: 55 push rbp
114e: 48 89 e5 mov rbp,rsp
1151: 48 83 ec 10 sub rsp,0x10
1155: 89 7d fc mov DWORD PTR [rbp-0x4],edi
1158: 83 7d fc 01 cmp DWORD PTR [rbp-0x4],0x1
115c: 76 13 jbe 1171 <f+0x28>
115e: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
1161: 83 e8 01 sub eax,0x1
1164: 89 c7 mov edi,eax
1166: e8 de ff ff ff call 1149 <f>
116b: 0f af 45 fc imul eax,DWORD PTR [rbp-0x4]
116f: eb 05 jmp 1176 <f+0x2d>
1171: b8 01 00 00 00 mov eax,0x1
1176: c9 leave
1177: c3 ret
1178: f3 0f 1e fa endbr64
117c: 55 push rbp
117d: 48 89 e5 mov rbp,rsp
1180: 48 83 ec 10 sub rsp,0x10
1184: c7 45 f8 05 00 00 00 mov DWORD PTR [rbp-0x8],0x5
118b: c7 45 fc 00 00 00 00 mov DWORD PTR [rbp-0x4],0x0
1192: 8b 45 f8 mov eax,DWORD PTR [rbp-0x8]
1195: 89 c7 mov edi,eax
1197: e8 ad ff ff ff call 1149 <f>
119c: 89 45 fc mov DWORD PTR [rbp-0x4],eax
Study the calling convention for your environment. An overview of the many calling conventions for a number of architectures: https://www.dyncall.org/docs/manual/manualse11.html
The calling convention specifies:
Where parameters and return values must appear at the one single point of transfer of control of the instruction stream from the caller to the callee. For parameters being passed, that single point is after the call is made and before the first instruction of the callee (and for return values, at the point where the callee finishes and just before execution resumes in the caller).
Many conventions combine parameter passing in CPU registers with stack memory for parameters that don't fit in CPU registers. And even some that don't use CPU registers for parameters still use CPU registers for return values.
What registers a function is allowed to clobber vs. must preserve. Call-clobbered registers can be assigned new values without concern. Call-preserved registers can be used but must be restored to the value they had upon entry before returning to the caller. The advantage of call-preserved registers is that since they are preserved by a call, you can use them for variables that need to survive another call.
The meaning & treatment of the stack pointer, regarding memory below and above the current pointer, and alignment requirements for stack allocation.
If the function allocates stack space in some manner, then memory parameters will appear to move farther away from the top of the stack (they don't actually move, of course, but become larger offsets from the current stack or frame pointer). Compilers know this and adjust their access to stack memory accordingly.
Some compilers set up frame pointers to refer to stack memory. A frame pointer is a copy of the stack pointer made at some point in the prologue. Frame pointers are not always necessary but facilitate exception handling and stack unwinding, as well as dynamic stack allocation.
It's probably a silly question, but it makes me slightly quibble every time I want to "optimize" the passage of heavy arguments (such as structure for example) to a function that just reads them. I hesitate between passing a pointer:
struct Foo
{
int x;
int y;
int z;
} Foo;
int sum(struct Foo *foo_struct)
{
return foo_struct->x + foo_struct->y + foo_struct->z;
}
Or a constant:
struct Foo
{
int x;
int y;
int z;
} Foo;
int sum(const struct Foo foo_struct)
{
return foo_struct.x + foo_struct.y + foo_struct.z;
}
The pointers are intended not to copy the data but just to send its address, which costs almost nothing.
For constants, it probably varies between compilers or optimization levels, although I don't know how a constant pass is optimized; if it is, then the compiler probably does a better job than I do.
From a performance point of view only (even if it is negligible in my examples), what is the preferred way of doing things?
Structs, much like arrays, are containers of data. Every time you work with a container, you will have its data layed out in a contiguous block of memory. The container itself is identified by its starting address, and every single time you operate with it, your program will need to do low level pointer arithmetic through dedicated instructions in order to apply an offset to get from the starting address to the desired field (or element in case of arrays). The only things that a compiler needs to know to work with a struct are (roughly):
Its starting address in memory.
The offset of each field.
The size of each field.
A compiler can optimize code working on structs in the same way if the struct is passed as pointer or not, and we'll see how in a moment. What's different though, it's how the struct is passed to each function.
First let me make one thing clear: the const qualifier is not useful to understand the difference between passing a structure as pointer or by value. It merely tells the compiler that inside the function the value of the parameter itself will not be modified. Performance difference between passing as value or as pointer is not affected in general by const. The const keyword only becomes useful for other kinds of optimizations, not this one.
The main difference between these two signatures:
void first(const struct mystruct x);
void second(struct mystruct *x);
is that the first function will expect the whole struct to be passed as parameter, which therefore means copying the whole structure on the stack right before calling the function. The second function however only needs a pointer to the structure, and therefore the argument can be passed as a single value on the stack, or in a register like it's usually done in x86-64.
Now, to better understand what happens, let's analyze the following program:
#include <stdio.h>
struct mystruct {
unsigned a, b, c, d, e, f, g, h, i, j, k;
};
unsigned long __attribute__ ((noinline)) first(const struct mystruct x) {
unsigned long total = x.a;
total += x.b;
total += x.c;
total += x.d;
total += x.e;
total += x.f;
total += x.g;
total += x.h;
total += x.i;
total += x.j;
total += x.k;
return total;
}
unsigned long __attribute__ ((noinline)) second(struct mystruct *x) {
unsigned long total = x->a;
total += x->b;
total += x->c;
total += x->d;
total += x->e;
total += x->f;
total += x->g;
total += x->h;
total += x->i;
total += x->j;
total += x->k;
return total;
}
int main (void) {
struct mystruct x = {0};
scanf("%u", &x.a);
unsigned long v = first(x);
printf("%lu\n", v);
v = second(&x);
printf("%lu\n", v);
return 0;
}
The __attribute__ ((noinline)) is just to avoid automatic inlining of the function, which for testing purposes is very simple and therefore will probably get inlined with -O3.
Let's now compile and disassemble the result with the help of objdump.
No optimizations
Let's first compile without optimizations and see what happens:
This is how main() calls first():
86a: 48 89 e0 mov rax,rsp
86d: 48 8b 55 c0 mov rdx,QWORD PTR [rbp-0x40]
871: 48 89 10 mov QWORD PTR [rax],rdx
874: 48 8b 55 c8 mov rdx,QWORD PTR [rbp-0x38]
878: 48 89 50 08 mov QWORD PTR [rax+0x8],rdx
87c: 48 8b 55 d0 mov rdx,QWORD PTR [rbp-0x30]
880: 48 89 50 10 mov QWORD PTR [rax+0x10],rdx
884: 48 8b 55 d8 mov rdx,QWORD PTR [rbp-0x28]
888: 48 89 50 18 mov QWORD PTR [rax+0x18],rdx
88c: 48 8b 55 e0 mov rdx,QWORD PTR [rbp-0x20]
890: 48 89 50 20 mov QWORD PTR [rax+0x20],rdx
894: 8b 55 e8 mov edx,DWORD PTR [rbp-0x18]
897: 89 50 28 mov DWORD PTR [rax+0x28],edx
89a: e8 81 fe ff ff call 720 <first>
And this is the function itself:
0000000000000720 <first>:
720: 55 push rbp
721: 48 89 e5 mov rbp,rsp
724: 8b 45 10 mov eax,DWORD PTR [rbp+0x10]
727: 89 c0 mov eax,eax
729: 48 89 45 f8 mov QWORD PTR [rbp-0x8],rax
72d: 8b 45 14 mov eax,DWORD PTR [rbp+0x14]
730: 89 c0 mov eax,eax
732: 48 01 45 f8 add QWORD PTR [rbp-0x8],rax
736: 8b 45 18 mov eax,DWORD PTR [rbp+0x18]
739: 89 c0 mov eax,eax
... same stuff happening over and over ...
783: 48 01 45 f8 add QWORD PTR [rbp-0x8],rax
787: 48 8b 45 f8 mov rax,QWORD PTR [rbp-0x8]
78b: 5d pop rbp
78c: c3 ret
It's quite obvious that the whole structure is being copied on the stack before calling the function.
The function then takes each value in the struct looking at each value contained in the struct on the stack each time (DWORD PTR [rbp + offset]).
This is how main() calls second():
8bf: 48 8d 45 c0 lea rax,[rbp-0x40]
8c3: 48 89 c7 mov rdi,rax
8c6: e8 c2 fe ff ff call 78d <second>
And this is the function itself:
000000000000078d <second>:
78d: 55 push rbp
78e: 48 89 e5 mov rbp,rsp
791: 48 89 7d e8 mov QWORD PTR [rbp-0x18],rdi
795: 48 8b 45 e8 mov rax,QWORD PTR [rbp-0x18]
799: 8b 00 mov eax,DWORD PTR [rax]
79b: 89 c0 mov eax,eax
79d: 48 89 45 f8 mov QWORD PTR [rbp-0x8],rax
7a1: 48 8b 45 e8 mov rax,QWORD PTR [rbp-0x18]
7a5: 8b 40 04 mov eax,DWORD PTR [rax+0x4]
7a8: 89 c0 mov eax,eax
... same stuff happening over and over ...
81f: 48 01 45 f8 add QWORD PTR [rbp-0x8],rax
823: 48 8b 45 f8 mov rax,QWORD PTR [rbp-0x8]
827: 5d pop rbp
828: c3 ret
You can see that the argument is passed as a pointer instead of being copied on the stack, which is only two very simple instructions (lea + mov). However, since now the function has to work with a pointer using the -> operator, we see that every single time a value in the struct needs to be accessed, memory needs to be dereferenced two times instead of one (first to get the pointer to the structure from the stack, then to get the value at the specified offset in the struct).
It may seem that there is no real difference between the two functions, since the linear number of instructions (linear in terms of struct members) that was required to load the struct on the stack in the first case is still required to dereference the pointer another time in the second case.
We are talking about optimization though, and it makes no sense to not optimize the code. Let's see what happens if we do.
With optimizations
In reality, when working with a struct, we don't really care where it is in memory (stack, heap, data segment, whatever). As long as we know where it starts, it all boils down to applying the same simple pointer arithmetic to access the fields. This always needs to be done, regardless of where the structure resides or whether it was dynamically allocated or not.
If we optimize the code above with -O3, we now see the following:
This is how main() calls first():
61a: 48 83 ec 30 sub rsp,0x30
61e: 48 8b 44 24 30 mov rax,QWORD PTR [rsp+0x30]
623: 48 89 04 24 mov QWORD PTR [rsp],rax
627: 48 8b 44 24 38 mov rax,QWORD PTR [rsp+0x38]
62c: 48 89 44 24 08 mov QWORD PTR [rsp+0x8],rax
631: 48 8b 44 24 40 mov rax,QWORD PTR [rsp+0x40]
636: 48 89 44 24 10 mov QWORD PTR [rsp+0x10],rax
63b: 48 8b 44 24 48 mov rax,QWORD PTR [rsp+0x48]
640: 48 89 44 24 18 mov QWORD PTR [rsp+0x18],rax
645: 48 8b 44 24 50 mov rax,QWORD PTR [rsp+0x50]
64a: 48 89 44 24 20 mov QWORD PTR [rsp+0x20],rax
64f: 8b 44 24 58 mov eax,DWORD PTR [rsp+0x58]
653: 89 44 24 28 mov DWORD PTR [rsp+0x28],eax
657: e8 74 01 00 00 call 7d0 <first>
And this is the function itself:
00000000000007d0 <first>:
7d0: 8b 44 24 0c mov eax,DWORD PTR [rsp+0xc]
7d4: 8b 54 24 08 mov edx,DWORD PTR [rsp+0x8]
7d8: 48 01 c2 add rdx,rax
7db: 8b 44 24 10 mov eax,DWORD PTR [rsp+0x10]
7df: 48 01 d0 add rax,rdx
7e2: 8b 54 24 14 mov edx,DWORD PTR [rsp+0x14]
7e6: 48 01 d0 add rax,rdx
7e9: 8b 54 24 18 mov edx,DWORD PTR [rsp+0x18]
7ed: 48 01 c2 add rdx,rax
7f0: 8b 44 24 1c mov eax,DWORD PTR [rsp+0x1c]
7f4: 48 01 c2 add rdx,rax
7f7: 8b 44 24 20 mov eax,DWORD PTR [rsp+0x20]
7fb: 48 01 d0 add rax,rdx
7fe: 8b 54 24 24 mov edx,DWORD PTR [rsp+0x24]
802: 48 01 d0 add rax,rdx
805: 8b 54 24 28 mov edx,DWORD PTR [rsp+0x28]
809: 48 01 c2 add rdx,rax
80c: 8b 44 24 2c mov eax,DWORD PTR [rsp+0x2c]
810: 48 01 c2 add rdx,rax
813: 8b 44 24 30 mov eax,DWORD PTR [rsp+0x30]
817: 48 01 d0 add rax,rdx
81a: c3 ret
This is how main() calls second():
671: 48 89 df mov rdi,rbx
674: e8 a7 01 00 00 call 820 <second>
And this is the function itself:
0000000000000820 <second>:
820: 8b 47 04 mov eax,DWORD PTR [rdi+0x4]
823: 8b 17 mov edx,DWORD PTR [rdi]
825: 48 01 c2 add rdx,rax
828: 8b 47 08 mov eax,DWORD PTR [rdi+0x8]
82b: 48 01 d0 add rax,rdx
82e: 8b 57 0c mov edx,DWORD PTR [rdi+0xc]
831: 48 01 d0 add rax,rdx
834: 8b 57 10 mov edx,DWORD PTR [rdi+0x10]
837: 48 01 c2 add rdx,rax
83a: 8b 47 14 mov eax,DWORD PTR [rdi+0x14]
83d: 48 01 c2 add rdx,rax
840: 8b 47 18 mov eax,DWORD PTR [rdi+0x18]
843: 48 01 d0 add rax,rdx
846: 8b 57 1c mov edx,DWORD PTR [rdi+0x1c]
849: 48 01 d0 add rax,rdx
84c: 8b 57 20 mov edx,DWORD PTR [rdi+0x20]
84f: 48 01 c2 add rdx,rax
852: 8b 47 24 mov eax,DWORD PTR [rdi+0x24]
855: 48 01 c2 add rdx,rax
858: 8b 47 28 mov eax,DWORD PTR [rdi+0x28]
85b: 48 01 d0 add rax,rdx
85e: c3 ret
It should now be clear which code is better. The compiler successfully identified that all it needs in both cases is to know where the beginning of the structure is, and then it can just apply the same simple math to determine the position of each field. Whether the address is on the stack or somewhere else, it does not really matter.
In fact, in the first() case we see all fields being accessed through [rsp + offset], meaning that some address on the stack itself (rsp) is used to calculate the position of the fields, while in the second() case we see [rdi + offset], meaning that the address passed as parameter (in rdi) is used instead. The offsets though are still the same.
So what's the difference now between the two functions? In terms of function code itself, basically none. In terms of parameter passing, the first() function still needs the struct passed by value, and therefore even with optimizations enabled, the whole structure still needs to be copied on the stack, therefore we can see that the first() function is way heavier and adds a lot of code in the caller.
As I previously said, a compiler can optimize code working on structs in the same way if the struct is passed as pointer or not. However, as we just saw, the way the structure is passed makes a big difference in the caller.
One could argue that the const qualifier for the first() function could ring a bell for the compiler and make it understand that there is really no need to copy the data on the stack, and the caller could just pass a pointer. However the compiler should strictly adhere to the calling convention dictated by the ABI for a given signature, instead of going out of its way to optimize the code. After all, it's not really the compiler's fault in this case, but the programmer's fault.
So, to answer your question:
From a performance point of view only (even if it is negligible in my examples), what is the preferred way of doing things?
The preferred way is definitely to pass a pointer, and not the struct itself.
The preferred way of doing things is to measure, not guess. Code up small prototypes of each approach, then instrument and profile them extensively. Quantify exactly how much of a runtime hit you take by passing a struct of a particular size by value vs. a pointer and accessing its contents. Remember, if you pass a pointer, you'll have to do a dereference operation to access each member, which may negate any savings you gained by not passing a full copy (after all, you may only pass it once but access it many times in a single function call). And you'll have to do this for every platform you want to support, because the answer will be different between different architectures.
Unless you are failing to meet a hard performance requirement, then do what best conveys the intent of the code. If the function is not supposed to modify the contents of the struct type, then favor passing it by value instead of using a pointer.
And, finally, it's not the 1980s anymore. Unless you're in an embedded environment or a mobile app that's trying to not suck the battery dry, you really shouldn't worry about performance at this level. Focus on higher-level design issues. Are you using the right algorithms and data structures? Are you doing needless I/O? Are you making these function calls a lot (as in a tight loop), or do they happen once over the lifetime of the program?
Every optimizing compiler will generate (sometimes almost) exactly the same code.
The only difference will be the invocation (ie function call). Structs are passed by the value and the whole struct has to be placed on stack (in typical implementation) when the argument of the function is not the pointer to the struct.
https://godbolt.org/z/Fx5tvG
The function call when passing by the pointer:
x: # #x
mov edi, offset Foo
jmp sum # TAILCALL
The function call when passed by the value:
y: # #y
push rbx
sub rsp, 416
lea rbx, [rsp + 208]
mov esi, offset Foo
mov edx, 208
mov rdi, rbx
call memcpy
mov ecx, 26
mov rdi, rsp
mov rsi, rbx
rep movsq es:[rdi], [rsi]
call sum1
add rsp, 416
pop rbx
ret
The difference is obvious.
The functions are:
struct Foo
{
int x;
int y[50];
int z;
} Foo;
int __attribute__((noinline)) sum(struct Foo *foo_struct);
int __attribute__((noinline)) sum1(const struct Foo foo_struct);
int x()
{
return sum(&Foo);
}
int y()
{
return sum1(Foo);
}
For the rest of the code please follow the godbolt link
Without optimization, gcc 9.2 compiles the pointer version to:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax+4]
add edx, eax
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax+8]
add eax, edx
pop rbp
ret
and the const version to:
push rbp
mov rbp, rsp
mov rdx, rdi
mov eax, esi
mov QWORD PTR [rbp-16], rdx
mov DWORD PTR [rbp-8], eax
mov edx, DWORD PTR [rbp-16]
mov eax, DWORD PTR [rbp-12]
add edx, eax
mov eax, DWORD PTR [rbp-8]
add eax, edx
pop rbp
ret
Passing a const means that the entire struct must be transferred to the stack frame of the function, while passing a pointer means that only enough room for the pointer needs to be allocated in the stack frame, no matter how large the struct is. Because of this, the pointer version will certainly be more memory efficient. I think it's possible that accessing the data through the pointer might be slower than accessing it within the stack frame if the pointer points far away (making the struct version potentially faster), but I'm not sure about that.
I've, for a few hours, been trying to enlarge my understanding of Assembly Language, by trying to read and understand the instructions of a very simple program I wrote in C to initiate myself to how arguments were handled in ASM.
#include <stdio.h>
int say_hello();
int main(void) {
printf("say_hello() -> %d\n", say_hello(10, 20, 30, 40, 50, 60, 70, 80, 90, 100));
}
int say_hello(int a, int b, int c, int d, int e, int f, int g, int h, int i, int j) {
printf("a:b:c:d:e:f:g:h:i:j -> %d:%d:%d:%d:%d:%d:%d:%d:%d:%d\n", a, b, c, d, e, f, g, h, i, j);
return 1000;
}
The program is as I said, very basic and contains two functions, the main and another one called say_hello which takes 10 arguments, from a to j and print each one of them in a printf call. I've tried doing the same process (So trying to understand the instructions and what's happening), with the same program and less arguments, I think I was able to understand most of it, but then I was wondering, "ok but what's happening if I have so many arguments, there isn't any more register available to store the value in"
So I went to look for how many registers were available and usable in my case, and I found out from this website that "only" (not sure, correct me if I'm wrong) the following registers could be used in my case to store argument values in them edi, esi, r8d, r9d, r10d, r11d, edx, ecx, which is 8, so I went to modify my C program and I added a few more arguments, so that I reach the 8 limit, I even added one more, I don't really know why, let's say just in case.
So when I compiled my program using gcc with no optimization related option whatsoever, I was expecting the main() function to push the values that were left after all the 8 registers have been used, but I wasn't expecting anything from the say_hello() method, that's pretty much why I tried this out in the first place.
So I went to compile my program, then disassembled it using the objdump command (More specifically, this is the full command I used: objdump -d -M intel helloworld) and I started looking for my main method, which was doing pretty much as I expected
000000000000064a <main>:
64a: 55 push rbp
64b: 48 89 e5 mov rbp,rsp
64e: 6a 64 push 0x64
650: 6a 5a push 0x5a
652: 6a 50 push 0x50
654: 6a 46 push 0x46
656: 41 b9 3c 00 00 00 mov r9d,0x3c
65c: 41 b8 32 00 00 00 mov r8d,0x32
662: b9 28 00 00 00 mov ecx,0x28
667: ba 1e 00 00 00 mov edx,0x1e
66c: be 14 00 00 00 mov esi,0x14
671: bf 0a 00 00 00 mov edi,0xa
676: b8 00 00 00 00 mov eax,0x0
67b: e8 1e 00 00 00 call 69e <say_hello>
680: 48 83 c4 20 add rsp,0x20
684: 89 c6 mov esi,eax
686: 48 8d 3d 0b 01 00 00 lea rdi,[rip+0x10b] # 798 <_IO_stdin_used+0x8>
68d: b8 00 00 00 00 mov eax,0x0
692: e8 89 fe ff ff call 520 <printf#plt>
697: b8 00 00 00 00 mov eax,0x0
69c: c9 leave
69d: c3 ret
So it, as I expected pushed the values that were left after all the registers had been used into the stack, and then just did the usual work to pass values from one method to another. But then I went to look for the say_hello method, and it got me really confused.
000000000000069e <say_hello>:
69e: 55 push rbp
69f: 48 89 e5 mov rbp,rsp
6a2: 48 83 ec 20 sub rsp,0x20
6a6: 89 7d fc mov DWORD PTR [rbp-0x4],edi
6a9: 89 75 f8 mov DWORD PTR [rbp-0x8],esi
6ac: 89 55 f4 mov DWORD PTR [rbp-0xc],edx
6af: 89 4d f0 mov DWORD PTR [rbp-0x10],ecx
6b2: 44 89 45 ec mov DWORD PTR [rbp-0x14],r8d
6b6: 44 89 4d e8 mov DWORD PTR [rbp-0x18],r9d
6ba: 44 8b 45 ec mov r8d,DWORD PTR [rbp-0x14]
6be: 8b 7d f0 mov edi,DWORD PTR [rbp-0x10]
6c1: 8b 4d f4 mov ecx,DWORD PTR [rbp-0xc]
6c4: 8b 55 f8 mov edx,DWORD PTR [rbp-0x8]
6c7: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
6ca: 48 83 ec 08 sub rsp,0x8
6ce: 8b 75 28 mov esi,DWORD PTR [rbp+0x28]
6d1: 56 push rsi
6d2: 8b 75 20 mov esi,DWORD PTR [rbp+0x20]
6d5: 56 push rsi
6d6: 8b 75 18 mov esi,DWORD PTR [rbp+0x18]
6d9: 56 push rsi
6da: 8b 75 10 mov esi,DWORD PTR [rbp+0x10]
6dd: 56 push rsi
6de: 8b 75 e8 mov esi,DWORD PTR [rbp-0x18]
6e1: 56 push rsi
6e2: 45 89 c1 mov r9d,r8d
6e5: 41 89 f8 mov r8d,edi
6e8: 89 c6 mov esi,eax
6ea: 48 8d 3d bf 00 00 00 lea rdi,[rip+0xbf] # 7b0 <_IO_stdin_used+0x20>
6f1: b8 00 00 00 00 mov eax,0x0
6f6: e8 25 fe ff ff call 520 <printf#plt>
6fb: 48 83 c4 30 add rsp,0x30
6ff: b8 e8 03 00 00 mov eax,0x3e8
704: c9 leave
705: c3 ret
706: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
70d: 00 00 00
I'm really sorry in advance, I'm not exactly sure I really understand well what the square brackets do, but from what I've read and understand it's a way to "point" to the address containing the value I want (please correct me if I'm wrong), so for example mov DWORD PTR [rbp-0x4],edi moves the value in edi to the value at the address rsp-0x4, right?
I'm also not actually not sure why this process is required, can't the say_hello method just read edi for example and that's it? Why does the program have to move it into [rbp-0x4] and then re-reading it back from [rbp-0x4] to eax ?
So the program just goes on and reads every value it needs and put them into an available register, and when it reaches the point when there's no register left, it just starts moving all of them into esi and then pushing them onto the stack, then repeating the process until all the 10 arguments have been stored somewhere.
So that makes sense, I was satisfied and then just went to double check if I really had got it well, so I started reading from bottom to top, starting from 0x6ea to 0x6e2 so the sample I'm working on is
6e2: 45 89 c1 mov r9d,r8d
6e5: 41 89 f8 mov r8d,edi
6e8: 89 c6 mov esi,eax
6ea: 48 8d 3d bf 00 00 00 lea rdi,[rip+0xbf] # 7b0 <_IO_stdin_used+0x20>
So just like on all my previous tests, I was expecting the arguments to go in "reverse" like the first argument is the last instruction executed, and the last one the first instruction executed, so I started double checking every field.
So the first one, rdi was [rip+0x10b] which I thought for sure was pointing to my string.
So then I moved to 0x6e8, which moves eax which is currently equal to the value stored in [rbp-0x4], which is equal to edi as stated at 0x6a6, and edi is equal to 0xa (10) as stated on 0x671, so my first argument is my string, and the second one is 10, which is exactly what I expected.
But then when I jumped on the instruction executed right before 0x6e8, so 0x6e5 I was expecting it to be 20, so I did the same process. edi is moved to r8d and is currently equal to the value stored in [rbp-0x10] which is equal to ecx which is equal to, as stated at 0x662.. 40? What the heck? I'm confused, why would it be 40? Then I tried looking up the instruction right above that one, and found 50, and did the same for the next one, and again I found 60!! Why? Is the way I get those values wrong? Am I missing something in the instructions? Or did I just assume something by looking at my previous programs (which all had way less arguments, and were all in "reverse" like I said earlier) that I should not have?
I'm sorry if this is a dumb post, I'm very new to ASM (few hours of experience!) and just trying to get my mind cleared on that one, as I really can't figure it out alone. I'm also sorry if this post is too long, I was trying to include a lot of informations so that what I'm trying to do is clear, the result I get is clear, and what my problem is is clear aswell. Thanks a lot for reading and even a bigger thanks to anyone who will help!
I wrote a very simple program in C and try to understand the function calling process.
#include "stdio.h"
void Oh(unsigned x) {
printf("%u\n", x);
}
int main(int argc, char const *argv[])
{
Oh(0x67611c8c);
return 0;
}
And its assembly code seems to be
0000000100000f20 <_Oh>:
100000f20: 55 push %rbp
100000f21: 48 89 e5 mov %rsp,%rbp
100000f24: 48 83 ec 10 sub $0x10,%rsp
100000f28: 48 8d 05 6b 00 00 00 lea 0x6b(%rip),%rax # 100000f9a <_printf$stub+0x20>
100000f2f: 89 7d fc mov %edi,-0x4(%rbp)
100000f32: 8b 75 fc mov -0x4(%rbp),%esi
100000f35: 48 89 c7 mov %rax,%rdi
100000f38: b0 00 mov $0x0,%al
100000f3a: e8 3b 00 00 00 callq 100000f7a <_printf$stub>
100000f3f: 89 45 f8 mov %eax,-0x8(%rbp)
100000f42: 48 83 c4 10 add $0x10,%rsp
100000f46: 5d pop %rbp
100000f47: c3 retq
100000f48: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
100000f4f: 00
0000000100000f50 <_main>:
100000f50: 55 push %rbp
100000f51: 48 89 e5 mov %rsp,%rbp
100000f54: 48 83 ec 10 sub $0x10,%rsp
100000f58: b8 8c 1c 61 67 mov $0x67611c8c,%eax
100000f5d: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
100000f64: 89 7d f8 mov %edi,-0x8(%rbp)
100000f67: 48 89 75 f0 mov %rsi,-0x10(%rbp)
100000f6b: 89 c7 mov %eax,%edi
100000f6d: e8 ae ff ff ff callq 100000f20 <_Oh>
100000f72: 31 c0 xor %eax,%eax
100000f74: 48 83 c4 10 add $0x10,%rsp
100000f78: 5d pop %rbp
100000f79: c3 retq
Well, I don't quite understand the argument passing process, since there is only one parameter passed to Oh function, I could under stand this
100000f58: b8 8c 1c 61 67 mov $0x67611c8c,%eax
So what does the the code below do? Why rbp? Isn't it abandoned in X86-64 assembly? If it is a x86 style assembly, how can I generate the x86-64 style assembly using clang? If it is x86, it doesn't matter, could any one explains the below code line by line for me?
100000f5d: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
100000f64: 89 7d f8 mov %edi,-0x8(%rbp)
100000f67: 48 89 75 f0 mov %rsi,-0x10(%rbp)
100000f6b: 89 c7 mov %eax,%edi
100000f6d: e8 ae ff ff ff callq 100000f20 <_Oh>
You might get cleaner code if you turned optimizations on, or you might not. But, here’s what that does.
The %rbp register is being used as a frame pointer, that is, a pointer to the original top of the stack. It’s saved on the stack, stored, and restored at the end. Far from being removed in x86_64, it was added there; the 32-bit equivalent was %ebp.
After this value is saved, the program allocates sixteen bytes off the stack by subtracting from the stack pointer.
There then is a very inefficient series of copies that sets the first argument of Oh() as the second argument of printf() and the constant address of the format string (relative to the instruction pointer) as the first argument of printf(). Remember that, in this calling convention, the first argument is passed in %rdi (or %edi for 32-bit operands) and the second in %rsi This could have been simplified to two instructions.
After calling printf(), the program (needlessly) saves the return value on the stack, restores the stack and frame pointers, and returns.
In main(), there’s similar code to set up the stack frame, then the program saves argc and argv (needlessly), then it moves around the constant argument to Oh into its first argument, by way of %eax. This could have been optimized into a single instruction. It then calls Oh(). On return, it sets its return value to 0, cleans up the stack, and returns.
The code you’re asking about does the following: stores the constant 32-bit value 0 on the stack, saves the 32-bit value argc on the stack, saves the 64-bit pointer argv on the stack (the first and second arguments to main()), and sets the first argument of the function it is about to call to %eax, which it had previously loaded with a constant. This is all unnecessary for this program, but would have been necessary had it needed to use argc and argv after the call, when those registers would have been clobbered. There’s no good reason it used two steps to load the constant instead of one.
As Jester mentions you still have frame pointers on (to aid debugging)so stepping through main:
0000000100000f50 <_main>:
First we enter a new stack frame, we have to save the base pointer and move the stack to the new base. Also, in x86_64 the stack frame has to be aligned to a 16 byte boundary (hence moving the stack pointer by 0x10).
100000f50: push %rbp
100000f51: mov %rsp,%rbp
100000f54: sub $0x10,%rsp
As you mention, x86_64 passes parameters by register, so load the param in to the register:
100000f58: mov $0x67611c8c,%eax
??? Help needed
100000f5d: movl $0x0,-0x4(%rbp)
From here: "Registers RBP, RBX, and R12-R15 are callee-save registers", so if we want to save other resisters then we have to do it ourselves ....
100000f64: mov %edi,-0x8(%rbp)
100000f67: mov %rsi,-0x10(%rbp)
Not really sure why we didn't just load this in %edi where it needs to be for the call to begin with, but we better move it there now.
100000f6b: mov %eax,%edi
Call the function:
100000f6d: callq 100000f20 <_Oh>
This is the return value (passed in %eax), xor is a smaller instruction than load 0, so is a cmmon optimization:
100000f72: xor %eax,%eax
Clean up that stack frame we added earlier (not really sure why we saved those registers on it when we didn't use them)
100000f74: add $0x10,%rsp
100000f78: pop %rbp
100000f79: retq
Assembly newbie here... I wrote the following simple C program:
void fun(int x, int* y)
{
char arr[4];
int* sp;
sp = y;
}
int main()
{
int i = 4;
fun(i, &i);
return 0;
}
I compiled it with gcc and ran objdump with -S, but the Assembly code output is confusing me:
000000000040055d <fun>:
void fun(int x, int* y)
{
40055d: 55 push %rbp
40055e: 48 89 e5 mov %rsp,%rbp
400561: 48 83 ec 30 sub $0x30,%rsp
400565: 89 7d dc mov %edi,-0x24(%rbp)
400568: 48 89 75 d0 mov %rsi,-0x30(%rbp)
40056c: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
400573: 00 00
400575: 48 89 45 f8 mov %rax,-0x8(%rbp)
400579: 31 c0 xor %eax,%eax
char arr[4];
int* sp;
sp = y;
40057b: 48 8b 45 d0 mov -0x30(%rbp),%rax
40057f: 48 89 45 e8 mov %rax,-0x18(%rbp)
}
400583: 48 8b 45 f8 mov -0x8(%rbp),%rax
400587: 64 48 33 04 25 28 00 xor %fs:0x28,%rax
40058e: 00 00
400590: 74 05 je 400597 <fun+0x3a>
400592: e8 a9 fe ff ff callq 400440 <__stack_chk_fail#plt>
400597: c9 leaveq
400598: c3 retq
0000000000400599 <main>:
int main()
{
400599: 55 push %rbp
40059a: 48 89 e5 mov %rsp,%rbp
40059d: 48 83 ec 10 sub $0x10,%rsp
int i = 4;
4005a1: c7 45 fc 04 00 00 00 movl $0x4,-0x4(%rbp)
fun(i, &i);
4005a8: 8b 45 fc mov -0x4(%rbp),%eax
4005ab: 48 8d 55 fc lea -0x4(%rbp),%rdx
4005af: 48 89 d6 mov %rdx,%rsi
4005b2: 89 c7 mov %eax,%edi
4005b4: e8 a4 ff ff ff callq 40055d <fun>
return 0;
4005b9: b8 00 00 00 00 mov $0x0,%eax
}
4005be: c9 leaveq
4005bf: c3 retq
First, in the line:
400561: 48 83 ec 30 sub $0x30,%rsp
Why is the stack pointer decremented so much in the call to 'fun' (48 bytes)? I assume it has to do with alignment issues, but I cannot visualize why it would need so much space (I only count 12 bytes for local variables (assuming 8 byte pointers))?
Second, I thought that in x86_64, the arguments to a function are either stored in specific registers, or if there are a lot of them, just 'above' (with a downward growing stack) the base pointer, %rbp. Like in the picture at http://en.wikipedia.org/wiki/Call_stack#Structure except 'upside-down'.
But the lines:
400565: 89 7d dc mov %edi,-0x24(%rbp)
400568: 48 89 75 d0 mov %rsi,-0x30(%rbp)
suggest to me that they are being stored way down from the base of the stack (%rsi and %edi are where main put the arguments, right before calling 'fun', and 0x30 down from %rbp is exactly where the stack pointer is pointing...). And when I try to do stuff with them , like assigning their values to local variables, it grabs them from those locations near the head of the stack:
sp = y;
40057b: 48 8b 45 d0 mov -0x30(%rbp),%rax
40057f: 48 89 45 e8 mov %rax,-0x18(%rbp)
... what is going on here?! I would expect them to grab the arguments from either the registers they were stored in, or just above the base pointer, where I thought they are 'supposed to be', according to every basic tutorial I read. Every answer and post I found on here related to stack frame questions confirms my understanding of what stack frames "should" look like, so why is my Assembly output so darn weird?
Because that stuff is a hideously simplified version of what really goes on. It's like wondering why Newtonian mechanics doesn't model the movement of the planets down to the millimeter. Compilers need stack space for all sorts of things. For example, saving callee-saved registers.
Also, the fundamental fact is that debug-mode compilations contain all sorts of debugging and checking machinery. The compiler outputs all sorts of code that checks that your code is correct, for example the call to __stack_chk_fail.
There are only two ways to understand the output of a given compiler. The first is to implement the compiler, or be otherwise very familiar with the implementation. The second is to accept that whatever you understand is a gross simplification. Pick one.
Because you're compiling without optimization, the compiler does lots of extra stuff to maybe make things easier to debug, which use lots of extra space.
it does not attempt to compress the stack frame to reuse memory for anything, or get rid of any unused things.
it redundantly copies the arguments into the stack frame (which requires still more memory)
it copies a 'canary' on to the stack to guard against stack smashing buffer overflows (even though they can't happen in this code).
Try turning on optimization, and you'll see more real code.
This is 64 bit code. 0x30 of stack space corresponds to 6 slots on the stack. You have what appears to be:
2 slots for function arguments (which happen also to be passed in registers)
2 slots for local variables
1 slot for saving the AX register
1 slot looks like a stack guard, probably related to DEBUG mode.
Best thing is to experiment rather than ask questions. Try compiling in different modes (DEBUG, optimisation, etc), and with different numbers and types of arguments and variables. Sometimes asking other people is just too easy -- you learn better by doing your own experiments.