Compiler using local variables without adjusting RSP

Compiler using local variables without adjusting RSP - c

In question Compilers: Understanding assembly code generated from small programs the compiler uses two local variables without adjusting the stack pointer.
Not adjusting RSP for the use of local variables seems not interrupt safe and so the compiler seems to rely on the hardware automatically switching to a system stack when interrupts occur. Otherwise, the first interrupt that came along would push the instruction pointer onto the stack and would overwrite the local variable.
The code from that question is:
#include <stdio.h>
int main()
{
for(int i=0;i<10;i++){
int k=0;
}
}
The assembly code generated by that compiler is:
00000000004004d6 <main>:
4004d6: 55 push rbp
4004d7: 48 89 e5 mov rbp,rsp
4004da: c7 45 f8 00 00 00 00 mov DWORD PTR [rbp-0x8],0x0
4004e1: eb 0b jmp 4004ee <main+0x18>
4004e3: c7 45 fc 00 00 00 00 mov DWORD PTR [rbp-0x4],0x0
4004ea: 83 45 f8 01 add DWORD PTR [rbp-0x8],0x1
4004ee: 83 7d f8 09 cmp DWORD PTR [rbp-0x8],0x9
4004f2: 7e ef jle 4004e3 <main+0xd>
4004f4: b8 00 00 00 00 mov eax,0x0
4004f9: 5d pop rbp
4004fa: c3 ret
The local variables are i at [rbp-0x8] and k at [rbp-0x4].
Can anyone shine light on this interrupt problem? Does the hardware indeed switch to a system stack? How? Am I wrong in my understanding?

This is the so called "red zone" of the x86-64 ABI. A summary from wikipedia:
In computing, a red zone is a fixed-size area in a function's stack frame beyond the current stack pointer which is not preserved by that function. The callee function may use the red zone for storing local variables without the extra overhead of modifying the stack pointer. This region of memory is not to be modified by interrupt/exception/signal handlers. The x86-64 ABI used by System V mandates a 128-byte red zone which begins directly under the current value of the stack pointer.
In 64-bit Linux user code it is OK, as long as no more than 128 bytes are used. It is an optimization used most prominently by leaf-functions, i.e. functions which don't call other functions,
If you were to compile the example program as a 64-bit Linux program with GCC (or compatible compiler) using the -mno-red-zone option you'd see code like this generated:
main:
push rbp
mov rbp, rsp
sub rsp, 16; <<============ Observe RSP is now being adjusted.
mov DWORD PTR [rbp-4], 0
.L3:
cmp DWORD PTR [rbp-4], 9
jg .L2
mov DWORD PTR [rbp-8], 0
add DWORD PTR [rbp-4], 1
jmp .L3
.L2:
mov eax, 0
leave
ret
This code generation can be observed at this godbolt.org link.
For a 32-bit Linux user program it would be a bad thing not to adjust the stack pointer. If you were to compile the code in the question as 32-bit code (using -m32 option) main would appear something like the following code:
main:
push ebp
mov ebp, esp
sub esp, 16; <<============ Observe ESP is being adjusted.
mov DWORD PTR [ebp-4], 0
.L3:
cmp DWORD PTR [ebp-4], 9
jg .L2
mov DWORD PTR [ebp-8], 0
add DWORD PTR [ebp-4], 1
jmp .L3
.L2:
mov eax, 0
leave
ret
This code generation can be observed at this gotbolt.org link.

Related

Why does GCC use additional registers for pushing values onto the stack? [duplicate]

This question already has an answer here:
Why does the x86-64 System V calling convention pass args in registers instead of just the stack?
(1 answer)
Closed 8 months ago.
This C code
void test_function(int a, int b, int c, int d) {}
int main() {
test_function(1, 2, 3, 4);
return 0;
}
gets compiled by GCC (no flags, version 12.1.1, target x86_64-redhat-linux) into
0000000000401106 <test_function>:
401106: 55 push rbp
401107: 48 89 e5 mov rbp,rsp
40110a: 89 7d fc mov DWORD PTR [rbp-0x4],edi
40110d: 89 75 f8 mov DWORD PTR [rbp-0x8],esi
401110: 89 55 f4 mov DWORD PTR [rbp-0xc],edx
401113: 89 4d f0 mov DWORD PTR [rbp-0x10],ecx
401116: 90 nop
401117: 5d pop rbp
401118: c3 ret
0000000000401119 <main>:
401119: 55 push rbp
40111a: 48 89 e5 mov rbp,rsp
40111d: b9 04 00 00 00 mov ecx,0x4
401122: ba 03 00 00 00 mov edx,0x3
401127: be 02 00 00 00 mov esi,0x2
40112c: bf 01 00 00 00 mov edi,0x1
401131: e8 d0 ff ff ff call 401106 <test_function>
401136: b8 00 00 00 00 mov eax,0x0
40113b: 5d pop rbp
40113c: c3 ret
Why are additional registers (ecx, edx, esi, edi) used as intermediary storage for values 1, 2, 3, 4 instead of putting them into rbp directly?

"as intermediary storage": You confusion seems to be this part.
The ABI specifies that these function arguments are passed in the registers you are seeing (see comments under the question). The registers are not just used as intermediary. The value are never supposed to be put on the stack at all. They stay in the register the whole time, unless the function needs to reuse the register for something else or pass on a pointer to the function parameter or something similar.
What you are seeing in test_function is just an artifact of not compiling with optimizations enabled. The mov instructions putting the registers on the stack are pointless, since nothing is done with them afterwards. The stack pointer is just immediately restored and then the function returns.
The whole function should just be a single ret instruction. See https://godbolt.org/z/qG9GjMohY where -O2 is used.
Without optimizations enabled the compiler makes no attempt to remove instructions even if they are pointless and it always stores values of variables to memory and loads them from memory again, even if they could have been held in registers. That's why it is almost always pointless to look at -O0 assembly.

The registers are used for the arguments to call the function. The standard calling convertion calls for aguments to be placed in certain register, so the code you see in main puts the arguments into those registers and the code in test_function expects them in those registers and reads them from there.
So your follow-on question might be "why is test_function copying those argument on to the stack?". That's because you're compiling without optimization, so the compiler produces inefficient code, allocation space in the stack frame for every argument and local var and copying the arguments from their input register into the stack frame as part of the function prolog. If you were to use those values in th function, you would see it reading them from the stack frame locations even though they are probably still in the registers. If you compile with -O, you'll see the compiler get rid of all this, as the stack frame is not needed.

Is it more expensive to send a data structure to a function to just check its data than sending a pointer? [duplicate]

It's probably a silly question, but it makes me slightly quibble every time I want to "optimize" the passage of heavy arguments (such as structure for example) to a function that just reads them. I hesitate between passing a pointer:
struct Foo
{
int x;
int y;
int z;
} Foo;
int sum(struct Foo *foo_struct)
{
return foo_struct->x + foo_struct->y + foo_struct->z;
}
Or a constant:
struct Foo
{
int x;
int y;
int z;
} Foo;
int sum(const struct Foo foo_struct)
{
return foo_struct.x + foo_struct.y + foo_struct.z;
}
The pointers are intended not to copy the data but just to send its address, which costs almost nothing.
For constants, it probably varies between compilers or optimization levels, although I don't know how a constant pass is optimized; if it is, then the compiler probably does a better job than I do.
From a performance point of view only (even if it is negligible in my examples), what is the preferred way of doing things?

Structs, much like arrays, are containers of data. Every time you work with a container, you will have its data layed out in a contiguous block of memory. The container itself is identified by its starting address, and every single time you operate with it, your program will need to do low level pointer arithmetic through dedicated instructions in order to apply an offset to get from the starting address to the desired field (or element in case of arrays). The only things that a compiler needs to know to work with a struct are (roughly):
Its starting address in memory.
The offset of each field.
The size of each field.
A compiler can optimize code working on structs in the same way if the struct is passed as pointer or not, and we'll see how in a moment. What's different though, it's how the struct is passed to each function.
First let me make one thing clear: the const qualifier is not useful to understand the difference between passing a structure as pointer or by value. It merely tells the compiler that inside the function the value of the parameter itself will not be modified. Performance difference between passing as value or as pointer is not affected in general by const. The const keyword only becomes useful for other kinds of optimizations, not this one.
The main difference between these two signatures:
void first(const struct mystruct x);
void second(struct mystruct *x);
is that the first function will expect the whole struct to be passed as parameter, which therefore means copying the whole structure on the stack right before calling the function. The second function however only needs a pointer to the structure, and therefore the argument can be passed as a single value on the stack, or in a register like it's usually done in x86-64.
Now, to better understand what happens, let's analyze the following program:
#include <stdio.h>
struct mystruct {
unsigned a, b, c, d, e, f, g, h, i, j, k;
};
unsigned long __attribute__ ((noinline)) first(const struct mystruct x) {
unsigned long total = x.a;
total += x.b;
total += x.c;
total += x.d;
total += x.e;
total += x.f;
total += x.g;
total += x.h;
total += x.i;
total += x.j;
total += x.k;
return total;
}
unsigned long __attribute__ ((noinline)) second(struct mystruct *x) {
unsigned long total = x->a;
total += x->b;
total += x->c;
total += x->d;
total += x->e;
total += x->f;
total += x->g;
total += x->h;
total += x->i;
total += x->j;
total += x->k;
return total;
}
int main (void) {
struct mystruct x = {0};
scanf("%u", &x.a);
unsigned long v = first(x);
printf("%lu\n", v);
v = second(&x);
printf("%lu\n", v);
return 0;
}
The __attribute__ ((noinline)) is just to avoid automatic inlining of the function, which for testing purposes is very simple and therefore will probably get inlined with -O3.
Let's now compile and disassemble the result with the help of objdump.
No optimizations
Let's first compile without optimizations and see what happens:
This is how main() calls first():
86a: 48 89 e0 mov rax,rsp
86d: 48 8b 55 c0 mov rdx,QWORD PTR [rbp-0x40]
871: 48 89 10 mov QWORD PTR [rax],rdx
874: 48 8b 55 c8 mov rdx,QWORD PTR [rbp-0x38]
878: 48 89 50 08 mov QWORD PTR [rax+0x8],rdx
87c: 48 8b 55 d0 mov rdx,QWORD PTR [rbp-0x30]
880: 48 89 50 10 mov QWORD PTR [rax+0x10],rdx
884: 48 8b 55 d8 mov rdx,QWORD PTR [rbp-0x28]
888: 48 89 50 18 mov QWORD PTR [rax+0x18],rdx
88c: 48 8b 55 e0 mov rdx,QWORD PTR [rbp-0x20]
890: 48 89 50 20 mov QWORD PTR [rax+0x20],rdx
894: 8b 55 e8 mov edx,DWORD PTR [rbp-0x18]
897: 89 50 28 mov DWORD PTR [rax+0x28],edx
89a: e8 81 fe ff ff call 720 <first>
And this is the function itself:
0000000000000720 <first>:
720: 55 push rbp
721: 48 89 e5 mov rbp,rsp
724: 8b 45 10 mov eax,DWORD PTR [rbp+0x10]
727: 89 c0 mov eax,eax
729: 48 89 45 f8 mov QWORD PTR [rbp-0x8],rax
72d: 8b 45 14 mov eax,DWORD PTR [rbp+0x14]
730: 89 c0 mov eax,eax
732: 48 01 45 f8 add QWORD PTR [rbp-0x8],rax
736: 8b 45 18 mov eax,DWORD PTR [rbp+0x18]
739: 89 c0 mov eax,eax
... same stuff happening over and over ...
783: 48 01 45 f8 add QWORD PTR [rbp-0x8],rax
787: 48 8b 45 f8 mov rax,QWORD PTR [rbp-0x8]
78b: 5d pop rbp
78c: c3 ret
It's quite obvious that the whole structure is being copied on the stack before calling the function.
The function then takes each value in the struct looking at each value contained in the struct on the stack each time (DWORD PTR [rbp + offset]).
This is how main() calls second():
8bf: 48 8d 45 c0 lea rax,[rbp-0x40]
8c3: 48 89 c7 mov rdi,rax
8c6: e8 c2 fe ff ff call 78d <second>
And this is the function itself:
000000000000078d <second>:
78d: 55 push rbp
78e: 48 89 e5 mov rbp,rsp
791: 48 89 7d e8 mov QWORD PTR [rbp-0x18],rdi
795: 48 8b 45 e8 mov rax,QWORD PTR [rbp-0x18]
799: 8b 00 mov eax,DWORD PTR [rax]
79b: 89 c0 mov eax,eax
79d: 48 89 45 f8 mov QWORD PTR [rbp-0x8],rax
7a1: 48 8b 45 e8 mov rax,QWORD PTR [rbp-0x18]
7a5: 8b 40 04 mov eax,DWORD PTR [rax+0x4]
7a8: 89 c0 mov eax,eax
... same stuff happening over and over ...
81f: 48 01 45 f8 add QWORD PTR [rbp-0x8],rax
823: 48 8b 45 f8 mov rax,QWORD PTR [rbp-0x8]
827: 5d pop rbp
828: c3 ret
You can see that the argument is passed as a pointer instead of being copied on the stack, which is only two very simple instructions (lea + mov). However, since now the function has to work with a pointer using the -> operator, we see that every single time a value in the struct needs to be accessed, memory needs to be dereferenced two times instead of one (first to get the pointer to the structure from the stack, then to get the value at the specified offset in the struct).
It may seem that there is no real difference between the two functions, since the linear number of instructions (linear in terms of struct members) that was required to load the struct on the stack in the first case is still required to dereference the pointer another time in the second case.
We are talking about optimization though, and it makes no sense to not optimize the code. Let's see what happens if we do.
With optimizations
In reality, when working with a struct, we don't really care where it is in memory (stack, heap, data segment, whatever). As long as we know where it starts, it all boils down to applying the same simple pointer arithmetic to access the fields. This always needs to be done, regardless of where the structure resides or whether it was dynamically allocated or not.
If we optimize the code above with -O3, we now see the following:
This is how main() calls first():
61a: 48 83 ec 30 sub rsp,0x30
61e: 48 8b 44 24 30 mov rax,QWORD PTR [rsp+0x30]
623: 48 89 04 24 mov QWORD PTR [rsp],rax
627: 48 8b 44 24 38 mov rax,QWORD PTR [rsp+0x38]
62c: 48 89 44 24 08 mov QWORD PTR [rsp+0x8],rax
631: 48 8b 44 24 40 mov rax,QWORD PTR [rsp+0x40]
636: 48 89 44 24 10 mov QWORD PTR [rsp+0x10],rax
63b: 48 8b 44 24 48 mov rax,QWORD PTR [rsp+0x48]
640: 48 89 44 24 18 mov QWORD PTR [rsp+0x18],rax
645: 48 8b 44 24 50 mov rax,QWORD PTR [rsp+0x50]
64a: 48 89 44 24 20 mov QWORD PTR [rsp+0x20],rax
64f: 8b 44 24 58 mov eax,DWORD PTR [rsp+0x58]
653: 89 44 24 28 mov DWORD PTR [rsp+0x28],eax
657: e8 74 01 00 00 call 7d0 <first>
And this is the function itself:
00000000000007d0 <first>:
7d0: 8b 44 24 0c mov eax,DWORD PTR [rsp+0xc]
7d4: 8b 54 24 08 mov edx,DWORD PTR [rsp+0x8]
7d8: 48 01 c2 add rdx,rax
7db: 8b 44 24 10 mov eax,DWORD PTR [rsp+0x10]
7df: 48 01 d0 add rax,rdx
7e2: 8b 54 24 14 mov edx,DWORD PTR [rsp+0x14]
7e6: 48 01 d0 add rax,rdx
7e9: 8b 54 24 18 mov edx,DWORD PTR [rsp+0x18]
7ed: 48 01 c2 add rdx,rax
7f0: 8b 44 24 1c mov eax,DWORD PTR [rsp+0x1c]
7f4: 48 01 c2 add rdx,rax
7f7: 8b 44 24 20 mov eax,DWORD PTR [rsp+0x20]
7fb: 48 01 d0 add rax,rdx
7fe: 8b 54 24 24 mov edx,DWORD PTR [rsp+0x24]
802: 48 01 d0 add rax,rdx
805: 8b 54 24 28 mov edx,DWORD PTR [rsp+0x28]
809: 48 01 c2 add rdx,rax
80c: 8b 44 24 2c mov eax,DWORD PTR [rsp+0x2c]
810: 48 01 c2 add rdx,rax
813: 8b 44 24 30 mov eax,DWORD PTR [rsp+0x30]
817: 48 01 d0 add rax,rdx
81a: c3 ret
This is how main() calls second():
671: 48 89 df mov rdi,rbx
674: e8 a7 01 00 00 call 820 <second>
And this is the function itself:
0000000000000820 <second>:
820: 8b 47 04 mov eax,DWORD PTR [rdi+0x4]
823: 8b 17 mov edx,DWORD PTR [rdi]
825: 48 01 c2 add rdx,rax
828: 8b 47 08 mov eax,DWORD PTR [rdi+0x8]
82b: 48 01 d0 add rax,rdx
82e: 8b 57 0c mov edx,DWORD PTR [rdi+0xc]
831: 48 01 d0 add rax,rdx
834: 8b 57 10 mov edx,DWORD PTR [rdi+0x10]
837: 48 01 c2 add rdx,rax
83a: 8b 47 14 mov eax,DWORD PTR [rdi+0x14]
83d: 48 01 c2 add rdx,rax
840: 8b 47 18 mov eax,DWORD PTR [rdi+0x18]
843: 48 01 d0 add rax,rdx
846: 8b 57 1c mov edx,DWORD PTR [rdi+0x1c]
849: 48 01 d0 add rax,rdx
84c: 8b 57 20 mov edx,DWORD PTR [rdi+0x20]
84f: 48 01 c2 add rdx,rax
852: 8b 47 24 mov eax,DWORD PTR [rdi+0x24]
855: 48 01 c2 add rdx,rax
858: 8b 47 28 mov eax,DWORD PTR [rdi+0x28]
85b: 48 01 d0 add rax,rdx
85e: c3 ret
It should now be clear which code is better. The compiler successfully identified that all it needs in both cases is to know where the beginning of the structure is, and then it can just apply the same simple math to determine the position of each field. Whether the address is on the stack or somewhere else, it does not really matter.
In fact, in the first() case we see all fields being accessed through [rsp + offset], meaning that some address on the stack itself (rsp) is used to calculate the position of the fields, while in the second() case we see [rdi + offset], meaning that the address passed as parameter (in rdi) is used instead. The offsets though are still the same.
So what's the difference now between the two functions? In terms of function code itself, basically none. In terms of parameter passing, the first() function still needs the struct passed by value, and therefore even with optimizations enabled, the whole structure still needs to be copied on the stack, therefore we can see that the first() function is way heavier and adds a lot of code in the caller.
As I previously said, a compiler can optimize code working on structs in the same way if the struct is passed as pointer or not. However, as we just saw, the way the structure is passed makes a big difference in the caller.
One could argue that the const qualifier for the first() function could ring a bell for the compiler and make it understand that there is really no need to copy the data on the stack, and the caller could just pass a pointer. However the compiler should strictly adhere to the calling convention dictated by the ABI for a given signature, instead of going out of its way to optimize the code. After all, it's not really the compiler's fault in this case, but the programmer's fault.
So, to answer your question:
From a performance point of view only (even if it is negligible in my examples), what is the preferred way of doing things?
The preferred way is definitely to pass a pointer, and not the struct itself.

The preferred way of doing things is to measure, not guess. Code up small prototypes of each approach, then instrument and profile them extensively. Quantify exactly how much of a runtime hit you take by passing a struct of a particular size by value vs. a pointer and accessing its contents. Remember, if you pass a pointer, you'll have to do a dereference operation to access each member, which may negate any savings you gained by not passing a full copy (after all, you may only pass it once but access it many times in a single function call). And you'll have to do this for every platform you want to support, because the answer will be different between different architectures.
Unless you are failing to meet a hard performance requirement, then do what best conveys the intent of the code. If the function is not supposed to modify the contents of the struct type, then favor passing it by value instead of using a pointer.
And, finally, it's not the 1980s anymore. Unless you're in an embedded environment or a mobile app that's trying to not suck the battery dry, you really shouldn't worry about performance at this level. Focus on higher-level design issues. Are you using the right algorithms and data structures? Are you doing needless I/O? Are you making these function calls a lot (as in a tight loop), or do they happen once over the lifetime of the program?

Every optimizing compiler will generate (sometimes almost) exactly the same code.
The only difference will be the invocation (ie function call). Structs are passed by the value and the whole struct has to be placed on stack (in typical implementation) when the argument of the function is not the pointer to the struct.
https://godbolt.org/z/Fx5tvG
The function call when passing by the pointer:
x: # #x
mov edi, offset Foo
jmp sum # TAILCALL
The function call when passed by the value:
y: # #y
push rbx
sub rsp, 416
lea rbx, [rsp + 208]
mov esi, offset Foo
mov edx, 208
mov rdi, rbx
call memcpy
mov ecx, 26
mov rdi, rsp
mov rsi, rbx
rep movsq es:[rdi], [rsi]
call sum1
add rsp, 416
pop rbx
ret
The difference is obvious.
The functions are:
struct Foo
{
int x;
int y[50];
int z;
} Foo;
int __attribute__((noinline)) sum(struct Foo *foo_struct);
int __attribute__((noinline)) sum1(const struct Foo foo_struct);
int x()
{
return sum(&Foo);
}
int y()
{
return sum1(Foo);
}
For the rest of the code please follow the godbolt link

Without optimization, gcc 9.2 compiles the pointer version to:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax+4]
add edx, eax
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax+8]
add eax, edx
pop rbp
ret
and the const version to:
push rbp
mov rbp, rsp
mov rdx, rdi
mov eax, esi
mov QWORD PTR [rbp-16], rdx
mov DWORD PTR [rbp-8], eax
mov edx, DWORD PTR [rbp-16]
mov eax, DWORD PTR [rbp-12]
add edx, eax
mov eax, DWORD PTR [rbp-8]
add eax, edx
pop rbp
ret
Passing a const means that the entire struct must be transferred to the stack frame of the function, while passing a pointer means that only enough room for the pointer needs to be allocated in the stack frame, no matter how large the struct is. Because of this, the pointer version will certainly be more memory efficient. I think it's possible that accessing the data through the pointer might be slower than accessing it within the stack frame if the pointer points far away (making the struct version potentially faster), but I'm not sure about that.

How is main() called? Call to main() inside __libc_start_main()

I am trying to understand the call to main() inside __libc_start_main(). I know one of the parameters of __libc_start_main() is the address of main(). But, I am not able to figure out how is main() being called inside __libc_start_main() as there is no Opcode CALL or JMP. I see the following disassembly right before execution jumps to main().
0x7ffff7ded08b <__libc_start_main+203>: lea rax,[rsp+0x20]
0x7ffff7ded090 <__libc_start_main+208>: mov QWORD PTR fs:0x300,rax
=> 0x7ffff7ded099 <__libc_start_main+217>: mov rax,QWORD PTR [rip+0x1c3e10] # 0x7ffff7fb0eb0
I wrote a simple "Hello, World!!" in C. In the assembly above:
The execution jumps to main() right after instruction at address 0x7ffff7ded099.
Why is the MOV (to RAX) instruction causing a jump to main()?

Well, of course those instructions are not the ones that cause the call to main. I am not sure how you are stepping through those instructions, but if you are using GDB, you should use stepi instead of nexti.
I don't know why this happens precisely (some strange GDB or x86 quirk?) so I only speak from personal experience, but when reverse-engineering ELF binaries, I occasionally find that the nexti command executes several instructions before breaking. In your case, it misses a few movs before the actual call rax to call main().
What you can do to remediate this is to either use stepi, or to dump more code and then explicitly tell GDB to set breakpoints:
(gdb) x/20i
0x7ffff7ded08b <__libc_start_main+203>: lea rax,[rsp+0x20]
0x7ffff7ded090 <__libc_start_main+208>: mov QWORD PTR fs:0x300,rax
=> 0x7ffff7ded099 <__libc_start_main+217>: mov rax,QWORD PTR [rip+0x1c3e10] # 0x7ffff7fb0eb0
... more lines ...
... find call rax ...
(gdb) b *0x7ffff7dedXXX <= replace this
(gdb) continue
Here's what __libc_start_main() on my system does to call main():
21b6f: 48 8d 44 24 20 lea rax,[rsp+0x20] ; start preparing args
21b74: 64 48 89 04 25 00 03 mov QWORD PTR fs:0x300,rax
21b7b: 00 00
21b7d: 48 8b 05 24 93 3c 00 mov rax,QWORD PTR [rip+0x3c9324]
21b84: 48 8b 74 24 08 mov rsi,QWORD PTR [rsp+0x8]
21b89: 8b 7c 24 14 mov edi,DWORD PTR [rsp+0x14]
21b8d: 48 8b 10 mov rdx,QWORD PTR [rax]
21b90: 48 8b 44 24 18 mov rax,QWORD PTR [rsp+0x18] ; get address of main
21b95: ff d0 call rax ; actual call to main()
21b97: 89 c7 mov edi,eax
21b99: e8 32 16 02 00 call 431d0 <exit##GLIBC_2.2.5> ; exit(result of main)
The first three instructions are the same that you show. At the moment of call rax, rax will contain the address of main. After calling main, the result is moved into edi (first argument) and exit(result) is called.
Looking at glibc's source code for __libc_start_main(), we can see that this is exactly what happens:
/* ... */
#ifdef HAVE_CLEANUP_JMP_BUF
int not_first_call;
not_first_call = setjmp ((struct __jmp_buf_tag *) unwind_buf.cancel_jmp_buf);
if (__glibc_likely (! not_first_call))
{
/* ... a bunch of stuff ... */
/* Run the program. */
result = main (argc, argv, __environ MAIN_AUXVEC_PARAM);
}
else
{
/* ... a bunch of stuff ... */
}
#else
/* Nothing fancy, just call the function. */
result = main (argc, argv, __environ MAIN_AUXVEC_PARAM);
#endif
exit (result);
}
In my case I can see from the disassembly that HAVE_CLEANUP_JMP_BUF was defined when my glibc was compiled, so the actual call to main() is the one inside the if. I also suspect this is the case for your glibc.

How does this program know the exact location where this string is stored?

I have disassembled a C program with Radare2. Inside this program there are many calls to scanf like the following:
0x000011fe 488d4594 lea rax, [var_6ch]
0x00001202 4889c6 mov rsi, rax
0x00001205 488d3df35603. lea rdi, [0x000368ff] ; "%d" ; const char *format
0x0000120c b800000000 mov eax, 0
0x00001211 e86afeffff call sym.imp.__isoc99_scanf ; int scanf(const char *format)
0x00001216 8b4594 mov eax, dword [var_6ch]
0x00001219 83f801 cmp eax, 1 ; rsi ; "ELF\x02\x01\x01"
0x0000121c 740a je 0x1228
Here scanf has the address of the string "%d" passed to it from the line lea rdi, [0x000368ff]. I'm assuming 0x000368ff is the location of "%d" in the exectable file because if I restart Radare2 in debugging mode (r2 -d ./exec) then lea rdi, [0x000368ff] is replaced by lea rdi, [someMemoryAddress].
If lea rdi, [0x000368ff] is whats hard coded in the file then how does the instruction change to the actual memory address when run?

Radare is tricking you, what you see is not the real instruction, it has been simplified for you.
The real instruction is:
0x00001205 488d3df3560300 lea rdi, qword [rip + 0x356f3]
0x0000120c b800000000 mov eax, 0
This is a typical position independent lea. The string to use is stored in your binary at the offset 0x000368ff, but since the executable is position independent, the real address needs to be calculated at runtime. Since the next instruction is at offset 0x0000120c, you know that, no matter where the binary is loaded in memory, the address you want will be rip + (0x000368ff - 0x0000120c) = rip + 0x356f3, which is what you see above.
When doing static analysis, since Radare does not know the base address of the binary in memory, it simply calculates 0x0000120c + 0x356f3 = 0x000368ff. This makes reverse engineering easier, but can be confusing since the real instruction is different.
As an example, the following program:
int main(void) {
puts("Hello world!");
}
When compiled produces:
6b4: 48 8d 3d 99 00 00 00 lea rdi,[rip+0x99]
6bb: e8 a0 fe ff ff call 560 <puts#plt>
So rip + 0x99 = 0x6bb + 0x99 = 0x754, and if we take a look at offset 0x754 in the binary with hd:
$ hd -s 0x754 -n 16 a.out
00000754 48 65 6c 6c 6f 20 77 6f 72 6c 64 21 00 00 00 00 |Hello world!....|
00000764

The full instruction is
48 8d 3d f3 56 03 00
This instruction is literally
lea rdi, [rip + 0x000356f3]
with a rip relative addressing mode. The instruction pointer rip has the value 0x0000120c when the instruction is executed, thus rdi receives the desired value 0x000368ff.
If this is not the real address, it is possible that your program is a position-independent executable (PIE) which is subject to relocation. Since the address is encoded using a rip-relative addressing mode, no relocation is needed and the address is correct, regardless of where the binary is loaded.

Shellcode Segmentation Fault error when run from exploitable program

BITS 64
section .text
global _start
_start:
jmp short two
one:
pop rbx
xor al,al
xor cx,cx
mov al,8
mov cx,0755
int 0x80
xor al,al
inc al
xor bl,bl
int 0x80
two:
call one
db 'H'`
This is my assembly code.
Then I used two commands. "nasm -f elf64 newdir.s -o newdir.o" and "ld newdir.o -o newdir".I run ./newdir and worked fine but when I extracted op code and tried to test this shellcode using following c program . It is not working(no segmentation fault).I have compiled using cmd gcc newdir -z execstack
#include <stdio.h>
char sh[]="\xeb\x16\x5b\x30\xc0\x66\x31\xc9\xb0\x08\x66\xb9\xf3\x02\xcd\x80\x30\xc0\xfe\xc0\x30\xdb\xcd\x80\xe8\xe5\xff\xff\xff\x48";
void main(int argc, char **argv)
{
int (*func)();
func = (int (*)()) sh;
(int)(*func)();
}
objdump -d newdir
newdir: file format elf64-x86-64
Disassembly of section .text:
0000000000400080 <_start>:
400080: eb 16 jmp 400098 <two>
0000000000400082 <one>:
400082: 5b pop %rbx
400083: 30 c0 xor %al,%al
400085: 66 31 c9 xor %cx,%cx
400088: b0 08 mov $0x8,%al
40008a: 66 b9 f3 02 mov $0x2f3,%cx
40008e: cd 80 int $0x80
400090: 30 c0 xor %al,%al
400092: fe c0 inc %al
400094: 30 db xor %bl,%bl
400096: cd 80 int $0x80
0000000000400098 <two>:
400098: e8 e5 ff ff ff callq 400082 <one>
40009d: 48 rex.W
when I run ./a.out , I am getting something like in photo. I am attaching photo because I cant explain what is happening.image
P.S- My problem is resolved. But I wanted to know where things was going wrong. So I used debugger and the result is below
`
(gdb) list
1 char shellcode[] = "\xeb\x16\x5b\x30\xc0\x66\x31\xc9\xb0\x08\x66\xb9\xf3\x02\xcd\x80\x30\xc0\xfe\xc0\x30\xdb\xcd\x80\xe8\xe5\xff\xff\xff\x48";
2 int main (int argc, char **argv)
3 {
4 int (*ret)();
5 ret = (int(*)())shellcode;
6
7 (int)(*ret)();
8 } (gdb) disassemble main
Dump of assembler code for function main:
0x00000000000005fa <+0>: push %rbp
0x00000000000005fb <+1>: mov %rsp,%rbp
0x00000000000005fe <+4>: sub $0x20,%rsp
0x0000000000000602 <+8>: mov %edi,-0x14(%rbp)
0x0000000000000605 <+11>: mov %rsi,-0x20(%rbp)
0x0000000000000609 <+15>: lea 0x200a20(%rip),%rax # 0x201030 <shellcode>
0x0000000000000610 <+22>: mov %rax,-0x8(%rbp)
0x0000000000000614 <+26>: mov -0x8(%rbp),%rdx
0x0000000000000618 <+30>: mov $0x0,%eax
0x000000000000061d <+35>: callq *%rdx
0x000000000000061f <+37>: mov $0x0,%eax
0x0000000000000624 <+42>: leaveq
0x0000000000000625 <+43>: retq
End of assembler dump.
(gdb) b 7
Breakpoint 1 at 0x614: file test.c, line 7.
(gdb) run
Starting program: /root/Desktop/Progs/shell/a.out
Breakpoint 1, main (argc=1, argv=0x7fffffffe2b8) at test.c:7
7 (int)(*ret)();
(gdb) info registers rip
rip 0x555555554614 0x555555554614 <main+26>
(gdb) x/5i $rip
=> 0x555555554614 <main+26>: mov -0x8(%rbp),%rdx
0x555555554618 <main+30>: mov $0x0,%eax
0x55555555461d <main+35>: callq *%rdx
0x55555555461f <main+37>: mov $0x0,%eax
0x555555554624 <main+42>: leaveq
(gdb) s
(Control got stuck here, so i pressed ctrl+c)
^C
Program received signal SIGINT, Interrupt.
0x0000555555755048 in shellcode ()
(gdb) x/5i 0x0000555555755048
=> 0x555555755048 <shellcode+24>: callq 0x555555755032 <shellcode+2>
0x55555575504d <shellcode+29>: rex.W add %al,(%rax)
0x555555755050: add %al,(%rax)
0x555555755052: add %al,(%rax)
0x555555755054: add %al,(%rax)
Here is the debugging information. I am not able to find where the control goes wrong.If need more info please ask.

Below is a working example using x86-64; which could be further optimized for size. That last 0x00 null is ok for the purpose of executing the shellcode.
assemble & link:
$ nasm -felf64 -g -F dwarf pushpam_001.s -o pushpam_001.o && ld pushpam_001.o -o pushpam_001
Code:
BITS 64
section .text
global _start
_start:
jmp short two
one:
pop rdi ; pathname
xor rax, rax
add al, 85 ; creat syscall 64-bit Linux
xor rsi, rsi
add si, 0755 ; mode - octal
syscall
xor rax, rax
add ax, 60
xor rdi, rdi
syscall
two:
call one
db 'H',0
objdump:
pushpam_001: file format elf64-x86-64
0000000000400080 <_start>:
400080: eb 1c jmp 40009e <two>
0000000000400082 <one>:
400082: 5f pop rdi
400083: 48 31 c0 xor rax,rax
400086: 04 55 add al,0x55
400088: 48 31 f6 xor rsi,rsi
40008b: 66 81 c6 f3 02 add si,0x2f3
400090: 0f 05 syscall
400092: 48 31 c0 xor rax,rax
400095: 66 83 c0 3c add ax,0x3c
400099: 48 31 ff xor rdi,rdi
40009c: 0f 05 syscall
000000000040009e <two>:
40009e: e8 df ff ff ff 48 00
.....H.
encoding extraction: There are many other ways to do this.
$ for i in `objdump -d pushpam_001 | grep "^ " | cut -f2`; do echo -n '\x'$i; done; echo
\xeb\x1c\x5f\x48\x31\xc0\x04\x55\x48\x31\xf6\x66\x81\xc6\xf3\x02\x0f\x05\x48\x31\xc0\x66\x83\xc0\x3c\x48\x31\xff\x0f\x05\xe8\xdf\xff\xff\xff\x48\x00\x.....H.
C shellcode.c - partial
...
unsigned char code[] = \
"\xeb\x1c\x5f\x48\x31\xc0\x04\x55\x48\x31\xf6\x66\x81\xc6\xf3\x02\x0f\x05\x48\x31\xc0\x66\x83\xc0\x3c\x48\x31\xff\x0f\x05\xe8\xdf\xff\xff\xff\x48\x00";
...
final:
./shellcode
--wxrw---t 1 david david 0 Jan 31 12:25 H

If int 0x80 in 64-bit code was the only problem, building your C test with gcc -fno-pie -no-pie would have worked, because then char sh[] would be in the low 32 bits of virtual address space, so system calls that truncate pointers to 32 bits would still work.
Run your program under strace to see what system calls it actually makes. (Except that strace decodes int 0x80 syscalls incorrectly in 64-bit code, decoding as if you'd used the 64-bit syscall ABI. The call numbers and arg registers are different.) But at least you can see the system-call return values (which will be -EFAULT for 32-bit creat with a truncated 64-bit pointer.)
You can also just gdb to single-step and check the system call return values. Having strace decode the system-call inputs is really nice, though, so I'd recommend porting your code to use the 64-bit ABI, and then it would just work.
Also, it would actually be able to exploit 64-bit processes where the buffer overflow is in memory at an address outside the low 32 bits. (e.g. like the stack). So yes, you should really stop using int 0x80 or stick to 32-bit code.
You're also depending on registers being zeroed before your code runs, like they are on process startup, but not when called from anywhere else.
xor al,al before mov al,8 is completely pointless, because xor-zeroing al doesn't clear upper bytes. Writing 32-bit registers clears the upper 32, but not writing 8 or 16 bit registers. And if it did, you wouldn't need the xor-zeroing before using mov which is also write-only.
If you want to set RAX=8 without any zero bytes in the machine code, you can
push 8 / pop rax (3 bytes)
xor eax,eax / mov al,8 (4 bytes)
Or given a zeroed rcx register, lea eax, [rcx+8] (3 bytes)
Setting CX to 0755 isn't so simple, because the constant doesn't fit in an imm8. Your 16-bit mov is a good choice (or would have been if you'd zeroed rcx first.
xor ecx,ecx
lea eax, [rcx+8] ; SYS_creat = 8 from unistd_32.h
mov cx, 0755 ; mode
int 0x80 ; invoke 32-bit ABI
xor ebx,ebx
lea eax, [rbx+1] ; SYS_exit = 1
int 0x80

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight