Why does gcc reorder the local variable in function? - c

I wrote a C program that just read/write a large array. I compiled the program with command gcc -O0 program.c -o program Out of curiosity, I dissemble the C program with objdump -S command.
The code and assembly of the read_array and write_array functions are attached at the end of this question.
I'm trying to interpret how gcc compiles the function. I used // to add my comments and questions
Take one piece of the beginning of the assembly code of the write_array() function
4008c1: 48 89 7d e8 mov %rdi,-0x18(%rbp) // this is the first parameter of the fuction
4008c5: 48 89 75 e0 mov %rsi,-0x20(%rbp) // this is the second parameter of the fuction
4008c9: c6 45 ff 01 movb $0x1,-0x1(%rbp) // comparing with the source code, I think this is the `char tmp` variable
4008cd: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%rbp) // this should be the `int i` variable.
What I don't understand is:
1) char tmp is obviously defined after int i in write_array function. Why gcc reorder the memory location of these two local variables?
2) From the offset, int i is at -0x8(%rbp) and char tmp is at -0x1(%rbp), which indicates variable int i takes 7 bytes? This is quite weird because int i should be 4 bytes on x86-64 machine. Isn't it? My speculation is that gcc tries to do some alignment?
3) I found the gcc optimization choices are quite interesting. Is there some good documents/book that explain how gcc works? (The third question may be off-topic, and if you think so, please just ignore. I just try to see if there is some short cut to learn the underlying mechanisms gcc uses for compilation. :-) )
Below is the piece of function code:
#define CACHE_LINE_SIZE 64
static inline void
read_array(char* array, long size)
{
int i;
char tmp;
for ( i = 0; i < size; i+= CACHE_LINE_SIZE )
{
tmp = array[i];
}
return;
}
static inline void
write_array(char* array, long size)
{
int i;
char tmp = 1;
for ( i = 0; i < size; i+= CACHE_LINE_SIZE )
{
array[i] = tmp;
}
return;
}
Below is the piece of disassembled code for write_array, from gcc -O0:
00000000004008bd <write_array>:
4008bd: 55 push %rbp
4008be: 48 89 e5 mov %rsp,%rbp
4008c1: 48 89 7d e8 mov %rdi,-0x18(%rbp)
4008c5: 48 89 75 e0 mov %rsi,-0x20(%rbp)
4008c9: c6 45 ff 01 movb $0x1,-0x1(%rbp)
4008cd: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%rbp)
4008d4: eb 13 jmp 4008e9 <write_array+0x2c>
4008d6: 8b 45 f8 mov -0x8(%rbp),%eax
4008d9: 48 98 cltq
4008db: 48 03 45 e8 add -0x18(%rbp),%rax
4008df: 0f b6 55 ff movzbl -0x1(%rbp),%edx
4008e3: 88 10 mov %dl,(%rax)
4008e5: 83 45 f8 40 addl $0x40,-0x8(%rbp)
4008e9: 8b 45 f8 mov -0x8(%rbp),%eax
4008ec: 48 98 cltq
4008ee: 48 3b 45 e0 cmp -0x20(%rbp),%rax
4008f2: 7c e2 jl 4008d6 <write_array+0x19>
4008f4: 5d pop %rbp
4008f5: c3 retq

Even at -O0, gcc doesn't emit definitions for static inline functios unless there's a caller. In that case, it doesn't actually inline: instead it emits a stand-alone definition. So I guess your disassembly is from that.
Are you using a really old gcc version? gcc 4.6.4 puts the vars in that order on the stack, but 4.7.3 and later use the other order:
movb $1, -5(%rbp) #, tmp
movl $0, -4(%rbp) #, i
In your asm, they're stored in order of initialization rather than declaration, but I think that's just by chance, since the order changed with gcc 4.7. Also, tacking on an initializers like int i=1; doesn't change the allocation order, so that completely torpedoes that theory.
Remember that gcc is designed around a series of transformations from source to asm, so -O0 doesn't mean "no optimization". You should think of -O0 as leaving out some things that -O3 normally does. There is no option that tries to make a literal-as-possible translation from source to asm.
Once gcc does decide which order to allocate space for them:
the char at rbp-1: That's the first available location that can hold a char. If there was another char that needed storing, it could go at rbp-2.
the int at rbp-8: Since the 4 bytes from rbp-1 to rbp-4 isn't free, the next available naturally-aligned location is rbp-8.
Or with gcc 4.7 and newer, -4 is the first available spot for an int, and -5 is the next byte below that.
RE: space saving:
It's true that putting the char at -5 makes the lowest touched address %rsp-5, instead of %rsp-8, but this doesn't save anything.
The stack pointer is 16B-aligned in the AMD64 SysV ABI. (Technically, %rsp+8 (the start of stack args) is aligned on function entry, before you push anything.) The only way for %rbp-8 to touch a new page or cache-line that %rbp-5 wouldn't is for the stack to be less than 4B-aligned. This is extremely unlikely, even in 32bit code.
As far as how much stack is "allocated" or "owned" by the function: In the AMD64 SysV ABI, the function "owns" the red zone of 128B below %rsp (That size was chosen because a one-byte displacement can go up to -128). Signal handlers and any other asynchronous users of the user-space stack will avoid clobbering the red zone, which is why the function can write to memory below %rsp without decrementing %rsp. So from that perspective, it doesn't matter how much of the red-zone we use; the chances of a signal handler running out of stack is unaffected.
In 32bit code, where there's no redzone, for either order gcc reserves space on the stack with sub $16, %esp. (try with -m32 on godbolt). So again, it doesn't matter whether we use 5 or 8 bytes, because we reserve in units of 16.
When there are many char and int variables, gcc packs the chars into 4B groups, instead of losing space to fragmentation, even when the declarations are mixed together:
void many_vars(void) {
char tmp = 1; int i=1;
char t2 = 2; int i2 = 2;
char t3 = 3; int i3 = 3;
char t4 = 4;
}
with gcc 4.6.4 -O0 -fverbose-asm, which helpfully labels which store is for which variable, which is why compiler asm output is preferable to disassembly:
pushq %rbp #
movq %rsp, %rbp #,
movb $1, -4(%rbp) #, tmp
movl $1, -16(%rbp) #, i
movb $2, -3(%rbp) #, t2
movl $2, -12(%rbp) #, i2
movb $3, -2(%rbp) #, t3
movl $3, -8(%rbp) #, i3
movb $4, -1(%rbp) #, t4
popq %rbp #
ret
I think variables go in either forward or reverse order of declaration, depending on gcc version, at -O0.
I made a version of your read_array function that works with optimization on:
// assumes that size is non-zero. Use a while() instead of do{}while() if you want extra code to check for that case.
void read_array_good(const char* array, size_t size) {
const volatile char *vp = array;
do {
(void) *vp; // this counts as accessing the volatile memory, with gcc/clang at least
vp += CACHE_LINE_SIZE/sizeof(vp[0]);
} while (vp < array+size);
}
Compiles to the following, with gcc 5.3 -O3 -march=haswell:
addq %rdi, %rsi # array, D.2434
.L11:
movzbl (%rdi), %eax # MEM[(const char *)array_1], D.2433
addq $64, %rdi #, array
cmpq %rsi, %rdi # D.2434, array
jb .L11 #,
ret
Casting an expression to void is the canonical way to tell the compiler that a value is used. e.g. to suppress unused-variable warnings, you can write (void)my_unused_var;.
For gcc and clang, doing that with a volatile pointer dereference does generate a memory access, with no need for a tmp variable. The C standard is very non-specific about what constitutes access to something that's volatile, so this probably isn't perfectly portable. Another way is to xor the values you read into an accumulator, and then store that to a global. As long as you don't use whole-program optimization, the compiler doesn't know that nothing reads the global, so it can't optimize away the calculation.
See the vmtouch source code for an example of this second technique. (It actually uses a global variable for the accumulator, which makes clunky code. Of course, that hardly matters since it's touching pages, not just cache lines, so it very quickly bottlenecks on TLB misses and page faults, even with a memory read-modify-write in the loop-carried dependency chain.)
I tried and failed to write something that gcc or clang would compile to a function with no prologue (which assumes that size is initially non-zero). GCC always wants to add rsi,rdi for a cmp/jcc loop condition, even with -march=haswell where sub rsi,64/jae can macro-fuse just as well as cmp/jcc. But in general on AMD, what GCC has fewer uops inside the loop.
read_array_handtuned_haswell:
.L0
movzx eax, byte [rdi] ; overwrite the full RAX to avoid any partial-register false deps from writing AL
add rdi, 64
sub rsi, 64
jae .L0 ; or ja, depending on what semantics you want
ret
Godbolt Compiler Explorer link with all my attempts and trial versions
I can get similar if the loop-termination condition is je, in a loop like do { ... } while( size -= CL_SIZE ); But I can't seem to convince gcc to catch unsigned borrow when subtracting. It want to subtract and then cmp -64/jb to detect underflow. It's not that hard to get compilers to check the carry flag after an add to detect carry :/
It's also easy to get compilers to make a 4-insn loop, but not without prologue. e.g. calculate an end pointer (array+size) and increment a pointer until it's greater or equal.
Fortunately this is not a big deal; the loop we do get is good.

For local variable saved in stack, the address order depends in the stack grow direction. you can refer to Does stack grow upward or downward? for more information.
This is quite weird because int i should be 4 bytes on x86-64 machine. Isn't it?
If my memory serve me correctly, the size of int on x86-64 machine is 8. you can confirm it by writing a test application to print sizeof(int).

Related

Tiny C Compiler's generated code emits extra (unnecessary?) NOPs and JMPs

Can someone explain why this code:
#include <stdio.h>
int main()
{
return 0;
}
when compiled with tcc using tcc code.c produces this asm:
00401000 |. 55 PUSH EBP
00401001 |. 89E5 MOV EBP,ESP
00401003 |. 81EC 00000000 SUB ESP,0
00401009 |. 90 NOP
0040100A |. B8 00000000 MOV EAX,0
0040100F |. E9 00000000 JMP fmt_vuln1.00401014
00401014 |. C9 LEAVE
00401015 |. C3 RETN
I guess that
00401009 |. 90 NOP
is maybe there for some memory alignment, but what about
0040100F |. E9 00000000 JMP fmt_vuln1.00401014
00401014 |. C9 LEAVE
I mean why would compiler insert this near jump that jumps to the next instruction, LEAVE would execute anyway?
I'm on 64-bit Windows generating 32-bit executable using TCC 0.9.26.
Superfluous JMP before the Function Epilogue
The JMP at the bottom that goes to the next statement, this was fixed in a commit. Version 0.9.27 of TCC resolves this issue:
When 'return' is the last statement of the top-level block
(very common and often recommended case) jump is not needed.
As for the reason it existed in the first place? The idea is that each function has a possible common exit point. If there is a block of code with a return in it at the bottom, the JMP goes to a common exit point where stack cleanup is done and the ret is executed. Originally the code generator also emitted the JMP instruction erroneously at the end of the function too if it appeared just before the final } (closing brace). The fix checks to see if there is a return statement followed by a closing brace at the top level of the function. If there is, the JMP is omitted
An example of code that has a return at a lower scope before a closing brace:
int main(int argc, char *argv[])
{
if (argc == 3) {
argc++;
return argc;
}
argc += 3;
return argc;
}
The generated code looks like:
401000: 55 push ebp
401001: 89 e5 mov ebp,esp
401003: 81 ec 00 00 00 00 sub esp,0x0
401009: 90 nop
40100a: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
40100d: 83 f8 03 cmp eax,0x3
401010: 0f 85 11 00 00 00 jne 0x401027
401016: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
401019: 89 c1 mov ecx,eax
40101b: 40 inc eax
40101c: 89 45 08 mov DWORD PTR [ebp+0x8],eax
40101f: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
; Jump to common function exit point. This is the `return argc` inside the if statement
401022: e9 11 00 00 00 jmp 0x401038
401027: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
40102a: 83 c0 03 add eax,0x3
40102d: 89 45 08 mov DWORD PTR [ebp+0x8],eax
401030: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
; Jump to common function exit point. This is the `return argc` at end of the function
401033: e9 00 00 00 00 jmp 0x401038
; Common function exit point
401038: c9 leave
401039: c3 ret
In versions prior to 0.9.27 the return argc inside the if statement would jump to a common exit point (function epilogue). As well the return argc at the bottom of the function also jumps to the same common exit point of the function. The problem is that the common exit point for the function happens to be right after the top level return argcso the side effect is an extra JMP that happens to be to the next instruction.
NOP after Function Prologue
The NOP isn't for alignment. Because of the way Windows implements guard pages for the stack (Programs that are in Portable Executable format) TCC has two types of prologues. If the local stack space required < 4096 (smaller than a single page) then you see this kind of code generated:
401000: 55 push ebp
401001: 89 e5 mov ebp,esp
401003: 81 ec 00 00 00 00 sub esp,0x0
The sub esp,0 isn't optimized out. It is the amount of stack space needed for local variables (in this case 0). If you add some local variables you will see the 0x0 in the SUB instruction changes to coincide with the amount of stack space needed for local variables. This prologue requires 9 bytes. There is another prologue to handle the case where the stack space needed is >= 4096 bytes. If you add an array of 4096 bytes with something like:
char somearray[4096]
and look at the resulting instruction you will see the function prologue change to a 10 byte prologue:
401000: b8 00 10 00 00 mov eax,0x1000
401005: e8 d6 00 00 00 call 0x4010e0
TCC's code generator assumes that the function prologue is always 10 bytes when targeting WinPE. This is primarily because TCC is a single pass compiler. The compiler doesn't know how much stack space a function will use until after the function is processed. To get around not knowing this ahead of time, TCC pre-allocates 10 bytes for the prologue to fit the largest method. Anything shorter is padded to 10 bytes.
In the case where stack space needed < 4096 bytes the instructions used total 9 bytes. The NOP is used to pad the prologue to 10 bytes. For the case where >= 4096 bytes are needed, the number of bytes is passed in EAX and the function __chkstk is called to allocate the required stack space instead.
TCC is not an optimizing compiler, at least not really. Every single instruction it emitted for main is sub-optimal or not needed at all, except the ret. IDK why you thought the JMP was the only instruction that might not make sense for performance.
This is by design: TCC stands for Tiny C Compiler. The compiler itself is designed to be simple, so it intentionally doesn't include code to look for many kinds of optimizations. Notice the sub esp, 0: this useless instruction clearly come from filling in a function-prologue template, and TCC doesn't even look for the special case where the offset is 0 bytes. Other function need stack space for locals, or to align the stack before any child function calls, but this main() doesn't. TCC doesn't care, and blindly emits sub esp,0 to reserve 0 bytes.
(In fact, TCC is truly one pass, laying out machine code as it does through the C statement by statement. It uses the imm32 encoding for sub so it will have room to fill in the right number (upon reaching the end of the function) even if it turns out the function uses more than 255 bytes of stack space. So instead of constructing a list of instructions in memory to finish assembling later, it just remembers one spot to fill in a uint32_t. That's why it can't omit the sub when it turns out not to be needed.)
Most of the work in creating a good optimizing compiler that anyone will use in practice is the optimizer. Even parsing modern C++ is peanuts compared to reliably emitting efficient asm (which not even gcc / clang / icc can do all the time, even without considering autovectorization). Just generating working but inefficient asm is easy compared to optimizing; most of gcc's codebase is optimization, not parsing. See Basile's answer on Why are there so few C compilers?
The JMP (as you can see from #MichaelPetch's answer) has a similar explanation: TCC (until recently) didn't optimize the case where a function only has one return path, and doesn't need to JMP to a common epilogue.
There's even a NOP in the middle of the function. It's obviously a waste of code bytes and decode / issue front-end bandwidth and out-of-order window size. (Sometimes executing a NOP outside a loop or something is worth it to align the top of a loop which is branched to repeatedly, but a NOP in the middle of a basic block is basically never worth it, so that's not why TCC put it there. And if a NOP did help, you could probably do even better by reordering instructions or choosing larger instructions to do the same thing without a NOP. Even proper optimizing compilers like gcc/clang/icc don't try to predict this kind of subtle front-end effect.)
#MichaelPetch points out that TCC always wants its function prologue to be 10 bytes, because it's a single-pass compiler (and it doesn't know how much space it needs for locals until the end of the function, when it comes back and fills in the imm32). But Windows targets need stack probes when modifying ESP / RSP by more than a whole page (4096 bytes), and the alternate prologue for that case is 10 bytes, instead of 9 for the normal one without the NOP. So this is another tradeoff favouring compilation speed over good asm.
An optimizing compiler would xor-zero EAX (because that's smaller and at least as fast as mov eax,0), and leave out all the other instruction. Xor-zeroing is one of the most well-known / common / basic x86 peephole optimizations, and has several advantages other than code-size on some modern x86 microarchitectures.
main:
xor eax,eax
ret
Some optimizing compilers might still make a stack frame with EBP, but tearing it down with pop ebp would be strictly better than leave on all CPUs, for this special case where ESP = EBP so the mov esp,ebp part of leave isn't needed. pop ebp is still 1 byte, but it's also a single-uop instruction on modern CPUs, unlike leave which is 2 or 3 on modern CPUs. (http://agner.org/optimize/, and see also other performance optimization links in the x86 tag wiki.) This is what gcc does. It's a fairly common situation; if you push some other registers after making a stack frame, you have to point ESP at the right place before pop ebx or whatever. (Or use mov to restore them.)
The benchmarks TCC cares about are compilation speed, not quality (speed or size) of the resulting code. For example, the TCC web site has a benchmark in lines/sec and MB/sec (of C source) vs. gcc3.2 -O0, where it's ~9x faster on a P4.
However, TCC is not totally braindead: it will apparently do some inlining, and as Michael's answer points out, a recent patch does leave out the JMP (but still not the useless sub esp, 0).

Is it practical to create a C language addon for anonymous functions?

I know that C compilers are capable of taking standalone code, and generate standalone shellcode out of it for the specific system they are targetting.
For example, given the following in anon.c:
int give3() {
return 3;
}
I can run
gcc anon.c -o anon.obj -c
objdump -D anon.obj
which gives me (on MinGW):
anon1.obj: file format pe-i386
Disassembly of section .text:
00000000 <_give3>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: b8 03 00 00 00 mov $0x3,%eax
8: 5d pop %ebp
9: c3 ret
a: 90 nop
b: 90 nop
So I can make main like this:
main.c
#include <stdio.h>
#include <stdint.h>
int main(int argc, char **argv)
{
uint8_t shellcode[] = {
0x55,
0x89, 0xe5,
0xb8, 0x03, 0x00, 0x00, 0x00,
0x5d, 0xc3,
0x90,
0x90
};
int (*p_give3)() = (int (*)())shellcode;
printf("%d.\n", (*p_give3)());
}
My question is, is it practical to automate the process of converting the self contained anonymous function that does not refer to anything that is not within its scope or in arguments?
eg:
#include <stdio.h>
#include <stdint.h>
int main(int argc, char **argv)
{
uint8_t shellcode[] = [#[
int anonymous() {
return 3;
}
]];
int (*p_give3)() = (int (*)())shellcode;
printf("%d.\n", (*p_give3)());
}
Which would compile the text into shellcode, and place it into the buffer?
The reason I ask is because I really like writing C, but making pthreads, callbacks is incredibly painful; and as soon as you go one step above C to get the notion of "lambdas", you lose your language's ABI(eg, C++ has lambda, but everything you do in C++ is suddenly implementation dependent), and "Lisplike" scripting addons(eg plug in Lisp, Perl, JavaScript/V8, any other runtime that already knows how to generalize callbacks) make callbacks very easy, but also much more expensive than tossing shellcode around.
If this is practical, then it is possible to put functions which are only called once into the body of the function calling it, thus reducing global scope pollution. It also means that you do not need to generate the shellcode manually for each system you are targetting, since each system's C compiler already knows how to turn self contained C into assembly, so why should you do it for it, and ruin readability of your own code with a bunch of binary blobs.
So the question is: is this practical(for functions which are perfectly self contained, eg even if they want to call puts, puts has to be given as an argument or inside a hash table/struct in an argument)? Or is there some issue preventing this from being practical?
Apple has implemented a very similar feature in clang, where it's called "blocks". Here's a sample:
int main(int argc, char **argv)
{
int (^blk_give3)(void) = ^(void) {
return 3;
};
printf("%d.\n", blk_give3());
return 0;
}
More information:
Clang: Language Specification for Blocks
Wikipedia: Blocks (C language extension)
I know that C compilers are capable of taking standalone code, and generate standalone shellcode out of it for the specific system they are targeting.
Turning source into machine code is what compilation is. Shellcode is machine code with specific constraints, none of which apply to this use-case. You just want ordinary machine code like compilers generate when they compile functions normally.
AFAICT, what you want is exactly what you get from static foo(int x){ ...; }, and then passing foo as a function pointer. i.e. a block of machine code with a label attached, in the code section of your executable.
Jumping through hoops to get compiler-generated machine code into an array is not even close to worth the portability downsides (esp. in terms of making sure the array is in executable memory).
It seems the only thing you're trying to avoid is having a separately-defined function with its own name. That's an incredibly small benefit that doesn't come close to justifying doing anything like you're suggesting in the question. AFAIK, there's no good way to achieve it in ISO C11, but:
Some compilers support nested functions as a GNU extension:
This compiles (with gcc6.2). On Godbolt, I used -xc to compile it as C, not C++.. It also compiles with ICC17, but not clang3.9.
#include <stdlib.h>
void sort_integers(int *arr, size_t len)
{
int bar(){return 3;} // gcc warning: ISO C forbids nested functions [-Wpedantic]
int cmp(const void *va, const void *vb) {
const int *a=va, *b=vb; // taking const int* args directly gives a warning, which we could silence with a cast
return *a > *b;
}
qsort(arr, len, sizeof(int), cmp);
}
The asm output is:
cmp.2286:
mov eax, DWORD PTR [rsi]
cmp DWORD PTR [rdi], eax
setg al
movzx eax, al
ret
sort_integers:
mov ecx, OFFSET FLAT:cmp.2286
mov edx, 4
jmp qsort
Notice that no definition for bar() was emitted, because it's unused.
Programs with nested functions built without optimization will have executable stacks. (For reasons explained below). So if you use this, make sure you use optimization if you care about security.
BTW, nested functions can even access variable in their parent (like lambas). Changing cmp into a function that does return len results in this highly surprising asm:
__attribute__((noinline))
void call_callback(int (*cb)()) {
cb();
}
void foo(int *arr, size_t len) {
int access_parent() { return len; }
call_callback(access_parent);
}
## gcc5.4
access_parent.2450:
mov rax, QWORD PTR [r10]
ret
call_callback:
xor eax, eax
jmp rdi
foo:
sub rsp, 40
mov eax, -17599
mov edx, -17847
lea rdi, [rsp+8]
mov WORD PTR [rsp+8], ax
mov eax, OFFSET FLAT:access_parent.2450
mov QWORD PTR [rsp], rsi
mov QWORD PTR [rdi+8], rsp
mov DWORD PTR [rdi+2], eax
mov WORD PTR [rdi+6], dx
mov DWORD PTR [rdi+16], -1864106167
call call_callback
add rsp, 40
ret
I just figured out what this mess is about while single-stepping it: Those MOV-immediate instructions are writing machine-code for a trampoline function to the stack, and passing that as the actual callback.
gcc must ensure that the ELF metadata in the final binary tells the OS that the process needs an executable stack (note readelf -l shows GNU_STACK with RWE permissions). So nested functions that access outside their scope prevent the whole process from having the security benefits of NX stacks. (With optimization disabled, this still affects programs that use nested functions that don't access stuff from outer scopes, but with optimization enabled gcc realizes that it doesn't need the trampoline.)
The trampoline (from gcc5.2 -O0 on my desktop) is:
0x00007fffffffd714: 41 bb 80 05 40 00 mov r11d,0x400580 # address of access_parent.2450
0x00007fffffffd71a: 49 ba 10 d7 ff ff ff 7f 00 00 movabs r10,0x7fffffffd710 # address of `len` in the parent stack frame
0x00007fffffffd724: 49 ff e3 rex.WB jmp r11
# This can't be a normal rel32 jmp, and indirect is the only way to get an absolute near jump in x86-64.
0x00007fffffffd727: 90 nop
0x00007fffffffd728: 00 00 add BYTE PTR [rax],al
...
(trampoline might not be the right terminology for this wrapper function; I'm not sure.)
This finally makes sense, because r10 is normally clobbered without saving by functions. There's no register that foo could set that would be guaranteed to still have that value when the callback is eventually called.
The x86-64 SysV ABI says that r10 is the "static chain pointer", but C/C++ don't use that. (Which is why r10 is treated like r11, as a pure scratch register).
Obviously a nested function that accesses variables in the outer scope can't be called after the outer function returns. e.g. if call_callback held onto the pointer for future use from other callers, you would get bogus results. When the nested function doesn't do that, gcc doesn't do the trampoline thing, so the function works just like a separately-defined function, so it would be a function pointer you could pass around arbitrarily.
It seems possible, but unnecessarliy complicated:
shellcode.c
int anon() { return 3; }
main.c
...
uint8_t shellcode[] = {
#include anon.shell
};
int (*p_give3)() = (int (*)())shellcode;
printf("%d.\n", (*p_give3)());
makefile:
anon.shell:
gcc anon.c -o anon.obj -c; objdump -D anon.obj | extractShellBytes.py anon.shell
Where extractShellBytes.py is a script you write which prints only the raw comma-separated code bytes from the objdump output.

Why does initializing a variable `i` to 0 and to a large size result in the same size of the program?

There is a problem which confuses me a lot.
int main(int argc, char *argv[])
{
int i = 12345678;
return 0;
}
int main(int argc, char *argv[])
{
int i = 0;
return 0;
}
The programs have the same bytes in total. Why?
And where the literal value indeed stored? Text segment or other place?
The programs have the same bytes in total.Why?
There are two possibilities:
The compiler is optimizing out the variable. It isn't used anywhere and therefore doesn't make sense.
If 1. doesn't apply, the program sizes are equal anyway. Why shouldn't they? 0 is just as large in size as 12345678. Two variables of type T occupy the same size in memory.
And where the literal value indeed stored?
On the stack. Local variables are commonly stored on the stack.
Consider your bedroom.if you filled it with stuff or you left it empty,does that change the area of your bedroom?
the size of int is sizeof(int).it does not matter what value you store in it.
Because your program is optimized. At compile time, the compiler found out that i was useless and removed it.
If optimization didn't occurs, another simple explanation is that an int is the same size of another int.
TL;DR
First question: They're the same size because the instructions output of your program are roughly the same (more on that below). Further, they're the same size because the size(number of bytes) of your ints never change.
Second question: i variable is stored in your local variables frame which is in the function stack. The actual value you set to i is in the instructions (hardcoded) in the text-segment.
GDB and Assembly
I know you're using Windows, but consider these codes and output on Linux. I used the exactly same sources you posted.
For the first one, with i = 12345678, the actual main function is these computer instructions:
(gdb) disass main
Dump of assembler code for function main:
0x00000000004004ed <+0>: push %rbp
0x00000000004004ee <+1>: mov %rsp,%rbp
0x00000000004004f1 <+4>: mov %edi,-0x14(%rbp)
0x00000000004004f4 <+7>: mov %rsi,-0x20(%rbp)
0x00000000004004f8 <+11>:movl $0xbc614e,-0x4(%rbp)
0x00000000004004ff <+18>:mov $0x0,%eax
0x0000000000400504 <+23>:pop %rbp
0x0000000000400505 <+24>:retq
End of assembler dump.
As for the other program, with i = 0, main is:
(gdb) disass main
Dump of assembler code for function main:
0x00000000004004ed <+0>: push %rbp
0x00000000004004ee <+1>: mov %rsp,%rbp
0x00000000004004f1 <+4>: mov %edi,-0x14(%rbp)
0x00000000004004f4 <+7>: mov %rsi,-0x20(%rbp)
0x00000000004004f8 <+11>:movl $0x0,-0x4(%rbp)
0x00000000004004ff <+18>:mov $0x0,%eax
0x0000000000400504 <+23>:pop %rbp
0x0000000000400505 <+24>:retq
End of assembler dump.
The only difference between both codes is the actual value being stored in your variable. Lets go in a step by step trough these lines bellow (my computer is x86_64, so if your architecture is different, instructions may differ).
OPCODES
And the actual instructions of main (using objdump):
00000000004004ed <main>:
4004ed: 55 push %rbp
4004ee: 48 89 e5 mov %rsp,%rbp
4004f1: 89 7d ec mov %edi,-0x14(%rbp)
4004f4: 48 89 75 e0 mov %rsi,-0x20(%rbp)
4004f8: c7 45 fc 4e 61 bc 00 movl $0xbc614e,-0x4(%rbp)
4004ff: b8 00 00 00 00 mov $0x0,%eax
400504: 5d pop %rbp
400505: c3 retq
400506: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40050d: 00 00 00
To get the actual difference of bytes, using objdump -D prog1 > prog1_dump and objdump -D prog2 > prog2_dump and them diff prog1_dump prog2_dump:
2c2
< draft1: file format elf64-x86-64
---
> draft2: file format elf64-x86-64
51,58c51,58
< 400283: 00 bc f6 06 64 9f ba add %bh,-0x45609bfa(%rsi,%rsi,8)
< 40028a: 01 3b add %edi,(%rbx)
< 40028c: 14 d1 adc $0xd1,%al
< 40028e: 12 cf adc %bh,%cl
< 400290: cd 2e int $0x2e
< 400292: 11 77 5d adc %esi,0x5d(%rdi)
< 400295: 79 fe jns 400295 <_init-0x113>
< 400297: 3b .byte 0x3b
---
> 400283: 00 e8 add %ch,%al
> 400285: f1 icebp
> 400286: 6e outsb %ds:(%rsi),(%dx)
> 400287: 8a f8 mov %al,%bh
> 400289: a8 05 test $0x5,%al
> 40028b: ab stos %eax,%es:(%rdi)
> 40028c: 48 2d 3f e9 e2 b2 sub $0xffffffffb2e2e93f,%rax
> 400292: f7 06 53 df ba af testl $0xafbadf53,(%rsi)
287c287
< 4004f8: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
---
> 4004f8: c7 45 fc 4e 61 bc 00 movl $0xbc614e,-0x4(%rbp)
Note on address 0x4004f8 your number is there, 4e 61 bc 00 on prog2 and 00 00 00 00 on prog1, both 4 bytes which is equal to sizeof(int). The bytes c7 45 fc are the rest of the instructions (move some value into an offset of rbp). Also note that the first two sections that differ have the same size in bytes (21). So, there you go, although slightly different, they're the same size.
Step by step through Assembly Instructions
push %rbp; mov %rsp, %rbp: This is called setting up the Stack Frame, and is standard for all C functions (unless you tell gcc -fomit-frame-pointer). This enables you to access the stack and your local variables through a fixed register, in this case, rbp.
mov %edi, -0x14(%rbp): This moves the content of register edi into our local variables frame. Specifically, into offset -0x14
mov %rsi, -0x20(%rbp): Same here. But this time it saves rsi. This is part of the x86_64 calling convention (which now uses registers instead of pushing everything on stack like x86_32), but instead of keeping them in registers, we free the registers by saving the contents in our local variables frame - register are way faster and are the only way the CPU can actually process anything, so the more free registers we have, the better.
Note: edi is the 4-bytes part of the rsi register and from the x86_64 calling convention, we know that rsi register is used for the first argument. main's first argument is int argc, so it makes sense we use a 4-byte register to store it. rsi is the second argument, effectively the address of a pointer to pointer to chars (**argv). So, in 64bit architectures, that fits perfectly into a 64bit register.
<+11>: movl $0xbc614e,-0x4(%rbp): This is the actual line int i = 12345678 (0xbc614e = 12345678d). Now, note that the way we "move" that value is very similar to how we stored the main arguments. We use offset -0x4(%rbp) to store it memory, on the "local variables frame" (this answers your question on where it gets stored).
mov $0x0, %eax; pop %rbp; retq: Again, dull stuff to clear up the frame pointer and return (end the program since we're in main).
Note that on the second example, the only difference is the line <+11>: movl $0x0,-0x4(%rbp), which effectively stores the value zero - in C words, int i = 0.
So, by these instructions you can see that the main function of both programs gets translated to assembly in the exact the same way, so their sizes are the same in the end. (Assuming you compiled them the same way, because the compiler also puts lots of other things in the binaries, like data, library functions, etc. In linux, you can get a full disassembly dump using objdump -D program.
Note 2: In these examples, you cannot see how the computer subtracts values from rsp in order to allocate stack space, but that's how it's normally done.
Stack Representation
The stack would be like this for both cases (only the value of i would change, or the value at -0x4(%rbp))
| ~~~ | Higher Memory addresses
| |
+------------------+ <--- Address 0x8(%rbp)
| RETURN ADDRESS |
+------------------+ <--- Address 0x0(%rbp) // instruction push %rbp
| previous rbp |
+------------------+ <--- Address -0x4(%rbp)
| i=0x11223344 |
+------------------+ <---- Address -0x14(%rbp)
| argc |
+------------------+ <---- address -0x20(%rbp)
| argv |
+------------------+
| |
+~~~~~~~~~~~~~~~~~~+ Lower memory addresses
Note 3: The direction to where the stack grows depends on your architecture. How data gets written in memory also depends on your architecture.
Resources
What are the calling conventions for UNIX & Linux system calls on x86-64
Call Stack
GCC Optimization Options
Understanding the Stack
How does the stack work in assembly language?
x86_64 : is stack frame pointer almost useless?

some clang-generated assembly not working in real mode (.COM, tiny memory model)

First, this is kind of a follow-up to Custom memory allocator for real-mode DOS .COM (freestanding) — how to debug?. But to have it self-contained, here's the background:
clang (and gcc, too) has an -m16 switch so long instructions of the i386 instruction set are prefixed for execution in "16bit" real mode. This can be exploited to create DOS .COM 32bit-realmode-executables using the GNU linker, as described in this blog post. (of course still limited to the tiny memory model, means everything in one 64KB segment) Wanting to play with this, I created a minimal runtime that seems to work quite nice.
Then I tried to build my recently-created curses-based game with this runtime, and well, it crashed. The first thing I encountered was a classical heisenbug: printing the offending wrong value made it correct. I found a workaround, only to face the next crash. So the first thing to blame I had in mind was my custom malloc() implementation, see the other question. But as nobody spotted something really wrong with it so far, I decided to give my heisenbug a second look. It manifests in the following code snippet (note this worked flawlessly when compiling for other platforms):
typedef struct
{
Item it; /* this is an enum value ... */
Food *f; /* and this is an opaque pointer */
} Slot;
typedef struct board
{
Screen *screen;
int w, h;
Slot slots[1]; /* 1 element for C89 compatibility */
} Board;
[... *snip* ...]
size = sizeof(Board) + (size_t)(w*h-1) * sizeof(Slot);
self = malloc(size);
memset(self, 0, size);
sizeof(Slot) is 8 (with clang and i386 architecture), sizeof(Board) is 20 and w and h are the dimensions of the game board, in case of running in DOS 80 and 24 (because one line is reserved for the title/status bar). To debug what's going on here, I made my malloc() output its parameter, and it was called with the value 12 (sizeof(board) + (-1) * sizeof(Slot)?)
Printing out w and h showed the correct values, still malloc() got 12. Printing out size showed the correctly calculated size and this time, malloc() got the correct value, too. So, classical heisenbug.
The workaround I found looks like this:
size = sizeof(Board);
for (int i = 0; i < w*h-1; ++i) size += sizeof(Slot);
Weird enough, this worked. Next logical step: compare the generated assembly. Here I have to admit I'm totally new to x86, my only assembly experience was with the good old 6502. So, In the following snippets, I'll add my assumptions and thoughts as comments, please correct me here.
First the "broken" original version (w, h are in %esi, %edi):
movl %esi, %eax
imull %edi, %eax # ok, calculate the product w*h
leal 12(,%eax,8), %eax # multiply by 8 (sizeof(Slot)) and add
# 12 as an offset. Looks good because
# 12 = sizeof(Board) - sizeof(Slot)...
movzwl %ax, %ebp # just use 16bit because my size_t for
# realmode is "unsigned short"
movl %ebp, (%esp)
calll malloc
Now, to me, this looks good, but my malloc() sees 12, as mentioned. The workaround with the loop compiles to the following assembly:
movl %edi, %ecx
imull %esi, %ecx # ok, w*h again.
leal -1(%ecx), %edx # edx = ecx-1? loop-end condition?
movw $20, %ax # sizeof(Board)
testl %edx, %edx # I guess that sets just some flags in
# order to check whether (w*h-1) is <= 0?
jle .LBB0_5
leal 65548(,%ecx,8), %eax # This seems to be the loop body
# condensed to a single instruction.
# 65548 = 65536 (0x10000) + 12. So
# there is our offset of 12 again (for
# 16bit). The rest is the same ...
.LBB0_5:
movzwl %ax, %ebp # use bottom 16 bits
movl %ebp, (%esp)
calll malloc
As described before, this second variant works as expected. My question after all this long text is as simple as ... WHY? Is there something special about realmode I'm missing here?
For reference: this commit contains both code versions. Just type make -f libdos.mk for a version with the workaround (crashing later). To compile the code leading to the bug, remove the -DDOSREAL from the CFLAGS in libdos.mk first.
Update: given the comments, I tried to debug this myself a bit deeper. Using dosbox' debugger is somewhat cumbersome, but I finally got it to break at the position of this bug. So, the following assembly code intended by clang:
movl %esi, %eax
imull %edi, %eax
leal 12(,%eax,8), %eax
movzwl %ax, %ebp
movl %ebp, (%esp)
calll malloc
ends up as this (note intel syntax used by dosbox' disassembler):
0193:2839 6689F0 mov eax,esi
0193:283C 660FAFC7 imul eax,edi
0193:2840 668D060C00 lea eax,[000C] ds:[000C]=0000F000
0193:2845 660FB7E8 movzx ebp,ax
0193:2849 6766892C24 mov [esp],ebp ss:[FFB2]=00007B5C
0193:284E 66E8401D0000 call 4594 ($+1d40)
I think this lea instruction looks suspicious, and indeed, after it, the wrong value is in ax. So, I tried to feed the same assembly source to the GNU assembler, using .code16 with the following result (disassembly by objdump, I think it is not entirely correct because it might misinterpret the size prefix bytes):
00000000 <.text>:
0: 66 89 f0 mov %si,%ax
3: 66 0f af c7 imul %di,%ax
7: 67 66 8d 04 lea (%si),%ax
b: c5 0c 00 lds (%eax,%eax,1),%ecx
e: 00 00 add %al,(%eax)
10: 66 0f b7 e8 movzww %ax,%bp
14: 67 66 89 2c mov %bp,(%si)
The only difference is this lea instruction. Here it starts with 67 meaning "address is 32bit" in 16bit real mode. My guess is, this is actually needed because lea is meant to operate on addresses and just "abused" by the optimizer to do data calculation here. Are my assumptions correct? If so, could this be a bug in clangs internal assembler for -m16? Maybe someone can explain where this 668D060C00 emitted by clang comes from and what may be the meaning? 66 means "data is 32bit" and 8D probably is the opcode itself --- but what about the rest?
Your objdump output is bogus. It looks like it's disassembling with the assumption of 32bit address and operand sizes, rather than 16. So it thinks lea ends sooner than it does, and disassembles some of the address bytes into lds / add. And then miraculously gets back into sync, and sees a movzww that zero extends from 16b to 16b... Pretty funny.
I'm inclined to trust your DOSBOX disassembly output. It perfectly explains your observed behaviour (malloc always called with an arg of 12). You are correct that the culprit is
lea eax,[000C] ; eax = 0x0C = 12. Intel/MASM/NASM syntax
leal 12, %eax #or AT&T syntax:
It looks like a bug in whatever assembled your DOSBOX binary (clang -m16 I think you said), since it assembled leal 12(,%eax,8), %eax into that.
leal 12(,%eax,8), %eax # AT&T
lea eax, [12 + eax*8] ; Intel/MASM/NASM syntax
I could probably dig through some instruction encoding tables / docs and figure out exactly how that lea should have been assembled into machine code. It should be the same as the 32bit-mode encoding, but with 67 66 prefixes (address size and operand size, respectively). (And no, the order of those prefixes doesn't matter, 66 67 would work, too.)
Your DOSBOX and objdump outputs don't even have the same binary, so yes, they did come out differently. (objdump is misinterpreting the operand-size prefix in previous instructions, but that didn't affect the insn length until LEA.)
Your GNU as .code16 binary has 67 66 8D 04 C5, then the 32bit 0x0000000C displacement (little-endian). This is LEA with both prefixes. I assume that's the correct encoding of leal 12(,%eax,8), %eax for 16bit mode.
Your DOSBOX disassembly has just 66 8D 06, with a 16bit 0x0C absolute address. (Missing the 32bit address size prefix, and using a different addressing mode.) I'm not an x86 binary expert; I haven't had problems with disassemblers / instruction encoding before. (And I usually only look at 64bit asm.) So I'd have to look up the encodings for the different addressing modes.
My go-to source for x86 instructions is Intel's Intel® 64 and IA-32 Architectures
Software Developer’s Manual
Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z. (linked from https://stackoverflow.com/tags/x86/info, BTW.)
It says: (section 2.1.1)
The operand-size override prefix allows a program to switch between
16- and 32-bit operand sizes. Either size can be the default; use of
the prefix selects the non-default size.
So that's easy, everything is pretty much the same as normal 32bit protected mode, except 16bit operand-size is the default.
The LEA insn description has a table describing exactly what happens with various combinations of 16, 32, and 64bit address (67H prefix) and operand sizes (66H prefix). In all cases, it truncates or zero extend the result when there's a size mismatch, but it's an Intel insn ref manual so it has to lay out every case separately. (This is helpful for more complex instruction behaviour.)
And yes, "abusing" lea by using it on non-address data is a common and useful optimization. You can do a non-destructive add of 2 registers, placing the result in a 3rd. And at the same time add a constant, and scale one of the inputs by 2, 4, or 8. So it can do things that would take up to 4 other instructions. (mov / shl / add r,r / add r,i). Also, it doesn't affect flags, which is a bonus if you want to preserve flags for another jump or especially cmov.

Smashing the stack example3 ala Aleph One

I've reproduced Example 3 from Smashing the Stack for Fun and Profit on Linux x86_64. However I'm having trouble understanding what is the correct number of bytes that should be incremented to the return address in order to skip past the instruction:
0x0000000000400595 <+35>: movl $0x1,-0x4(%rbp)
which is where I think the x = 1 instruction is. I've written the following:
#include <stdio.h>
void fn(int a, int b, int c) {
char buf1[5];
char buf2[10];
int *ret;
ret = buf1 + 24;
(*ret) += 7;
}
int main() {
int x;
x = 0;
fn(1, 2, 3);
x = 1;
printf("%d\n", x);
}
and disassembled it in gdb. I have disabled address randomization and compiled the program with the -fno-stack-protector option.
Question 1
I can see from the disassembler output below that I want to skip past the instruction at address 0x0000000000400595: both the return address from callq <fn> and the address of the movl instruction. Therefore, if the return address is 0x0000000000400595, and the next instruction is 0x000000000040059c, I should add 7 bytes to the return address?
0x0000000000400572 <+0>: push %rbp
0x0000000000400573 <+1>: mov %rsp,%rbp
0x0000000000400576 <+4>: sub $0x10,%rsp
0x000000000040057a <+8>: movl $0x0,-0x4(%rbp)
0x0000000000400581 <+15>: mov $0x3,%edx
0x0000000000400586 <+20>: mov $0x2,%esi
0x000000000040058b <+25>: mov $0x1,%edi
0x0000000000400590 <+30>: callq 0x40052d <fn>
0x0000000000400595 <+35>: movl $0x1,-0x4(%rbp)
0x000000000040059c <+42>: mov -0x4(%rbp),%eax
0x000000000040059f <+45>: mov %eax,%esi
0x00000000004005a1 <+47>: mov $0x40064a,%edi
0x00000000004005a6 <+52>: mov $0x0,%eax
0x00000000004005ab <+57>: callq 0x400410 <printf#plt>
0x00000000004005b0 <+62>: leaveq
0x00000000004005b1 <+63>: retq
Question 2
I notice that I can add 5 bytes to the return address in place of 7 and achieve the same result. When I do so, am I not jumping into the middle of the instruction 0x0000000000400595 <+35>: movl $0x1,-0x4(%rbp)? In which case, why does this not crash the program, like when I add 6 bytes to the return address in place of 5 bytes or 7 bytes.
Question 3
Just before buffer1[] on the stack is SFP, and before it, the return address.
That is 4 bytes pass the end of buffer1[]. But remember that buffer1[] is
really 2 word so its 8 bytes long. So the return address is 12 bytes from
the start of buffer1[].
In the example by Aleph 1, he/she calculates the offset of the return address as 12 bytes from the start of buffer1[]. Since I am on x86_64, and not x86_32, I need to recalculate the offset to the return address. When on x86_64, is it the case that buffer1[] is still 2 words, which is 16 bytes; and the SFP and return address are 8 bytes each (as we're on 64 bit) and therefore the return address is at: buf1 + (8 * 2) + 8 which is equivalent to buf1 + 24?
The first, and very important, thing to note: all numbers and offsets are very compiler-dependent. Different compilers, and even the same compiler with different settings, can produce drastically different assemblies. For example, many compilers can (and will) remove buf2 because it's not used. They can also remove x = 0 as its effect is not used and later overwritten. They can also remove x = 1 and replace all occurences of x with a constant 1, etc, etc.
That said, you absolutely need to make numbers for a specific assembly you're getting on your specific compiler and its settings.
Question 1
Since you provided the assembly for main(), I can confirm that you need to add 7 bytes to the return address, which would normally be 0x0000000000400595, to skip over x=1 and go to 0x000000000040059c which loads x into register for later use. 0x000000000040059c - 0x0000000000400595 = 7.
Question 2
Adding just 5 bytes instead of 7 will indeed jump into middle of instruction. However, this 2-byte tail of instruction happen (by pure chance) to be another valid instruction code. This is why it doesn't crash.
Question 3
This is again very compiler and settings dependent. Pretty much everything can happen there. Since you didn't provide disassembly, I can only make guesses. The guess would be the following: buf and buf2 are rounded up to the next stack unit boundary (8 bytes on x64). buf becomes 8 bytes, and buf2 becomes 16 bytes. Frame pointers are not saved to stack on x64, so no "SFP". That's 24 bytes total.

Resources