Causes and benefits of this improvement on gcc version >= 4.9.0 vs gcc version < 4.9? - c

I have recently exploited a dangerous program and found something interesting about the difference between versions of gcc on x86-64 architecture.
Note:
Wrongful usage of gets is not the issue here.
If we replace gets with any other functions, the problem doesn't change.
This is the source code I use:
#include <stdio.h>
int main()
{
char buf[16];
gets(buf);
return 0;
}
I use gcc.godbolt.org to disassemble the program with flag -m32 -fno-stack-protector -z execstack -g.
At the disassembled code, when gcc with version >= 4.9.0:
lea ecx, [esp+4] # begin of main
and esp, -16
push DWORD PTR [ecx-4] # push esp
push ebp
mov ebp, esp
/* between these comment is not related to the question
push ecx
sub esp, 20
sub esp, 12
lea eax, [ebp-24]
push eax
call gets
add esp, 16
mov eax, 0
*/
mov ebp, esp
mov ecx, DWORD PTR [ebp-4] # ecx = saved esp
leave
lea esp, [ecx-4]
ret # end of main
But gcc with version < 4.9.0 just:
push ebp # begin of main
mov ebp, esp
/* between these comment is not related to the question
and esp, -16
sub esp, 32
lea eax, [esp+16]
mov DWORD PTR [esp], eax
call gets
mov eax, 0
*/
leave
ret # end of main
My question is: What is the causes of this difference on the disassembled code and its benefits? Does it have a name for this technique?

I can't say for sure without the actual values in:
and esp, 0xXX # XX is a number
but this looks a lot like extra code to align the stack to a larger value than the ABI requires.
Edit: The value is -16, which is 32-bit 0xFFFFFFF0 or 64-bit 0xFFFFFFFFFFFFFFF0 so this is indeed stack alignment to 16 bytes, likely meant for use of SSE instructions. As mentioned in comments, there is more code in the >= 4.9.0 version because it also aligns the frame pointer instead of only the stack pointer.

The i386 ABI, used for 32-bit programs, imposes that a process, immediately after loaded, has to have the stack aligned on 32-bit values:
%esp Performing its usual job, the stack pointer holds the address of the
bottom of the stack, which is guaranteed to be word aligned.
confront this with the x86_64 ABI1 used for 64-bit programs:
%rsp The stack pointer holds the address of the byte with lowest address which
is part of the stack. It is guaranteed to be 16-byte aligned at process entry
The opportunity gave by the new AMD's 64-bit technology to rewrite the old i386 ABI allow a number of optimizations that were lacking due to backward compatibility, among these a bigger (stricter?) stack alignment.
I won't dwell on the benefits of stack alignment but it suffices to say that if a 4-byte alignment was good, so is a 16-byte one.
So much that it is worth spending some instructions aligning the stack.
That's what GCC 4.9.0+ does, it aligns the stack at 16-bytes.
That explains the and esp, -16 but not the other instructions.
Aligning the stack with and esp, -16 is the fastest way to do it when the compiler only knows that the stack is 4-byte aligned (since esp MOD 16 can be 0, 4, 8 or 12).
However it is a destructive method, the compiler loses the original esp value.
But now it comes the chicken or the egg problem: if we save the original esp on the stack before aligning the stack, we lose it because we don't know how far the stack pointer is lowered by the alignment. If we save it after the alignment, well, we can't. We lost it in the alignment.
So the only possible solution is to save it in a register, align the stack and then save said register on the stack.
;Save the stack pointer in ECX, actually is ESP+4 but still does
lea ecx, [esp+4] #ECX = ESP+4
;Align the stack
and esp, -16 #This lowers ESP by 0, 4, 8 or 12
;IGNORE THIS FOR NOW
push DWORD PTR [ecx-4]
;Usual prolog
push ebp
mov ebp, esp
;Save the original ESP (before alignment), actually is ESP+4 but OK
push ecx
GCC saves esp+4 in ecx, I don't know why2 but this values still does the trick.
The only mystery left is the push DWORD PTR [ecx-4].
But it turns out to be a simple mystery: for debugging purposes GCC pushes the return addresses just before the old frame pointer (before push ebp), this is where 32-bit tools expect it to be.
Since ecx=esp_o+4, where esp_o is the original stack pointer pre-alignment, [ecx-4] = [esp_o] = return address.
Note that now the stack is at 12 bytes modulo 16, thus the local variable area must be of size 16*k+4 to have the stack aligned at 16-byte again.
In your example k is 1 and the area is of 20 bytes in size.
The subsequent sub esp, 12 is to align the stack for the gets function (the requirement is to have the stack aligned at the function call).
Finally, the code
mov ebp, esp
mov ecx, DWORD PTR [ebp-4] # ecx = saved esp
leave
lea esp, [ecx-4]
ret
The first instruction is copy-paste error.
One could check it out or simply reason that
if it were there the [ebp-4] would be below the stack pointer (and there is no red zone for the i386 ABI).
The rest is just undoing what's is done in the prolog:
;Get the original stack pointer
mov ecx, DWORD PTR [ebp-4] ;ecx = esp_o+4
;Standard epilog
leave ;mov esp, ebp / pop ebp
;The stack pointer points to the copied return address
;Restore the original stack pointer
lea esp, [ecx-4] ;esp = esp_o
ret
GCC has to first get the original stack pointer (+4) saved on the stack, then restore the old frame pointer (ebp) and finally, restore the original stack pointer.
The return address is on the top of the stack when lea esp, [ecx-4] is executed, so in theory GCC could just return but it has to restore the original esp because main is not the first function to be executed in a C program, so it cannot leave the stack unbalanced.
1 This is not the latest version but the text quoted went unchanged in the successive editions.
2 This has been discussed here on SO but I can't remember if in some comment or in an answer.

Related

How does assembly code determine how far down variables are placed on the stack?

I am trying to understand some basic assembly code concepts and am getting stuck on how the assembly code determines where to place things on the stack and how much space to give it.
To start playing around with it, I entered this simple code in godbolt.org's compiler explorer.
int main(int argc, char** argv) {
int num = 1;
num++;
return num;
}
and got this assembly code
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov QWORD PTR [rbp-32], rsi
mov DWORD PTR [rbp-4], 1
add DWORD PTR [rbp-4], 1
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
So a couple questions here:
Shouldn't the parameters have been placed on the stack before the call? Why are argc and argv placed at offset 20 and 32 from the base pointer of the current stack frame? That seems really far down to put them if we only need room for the one local variable num. Is there a reason for all of this extra space?
The local variable is stored at 4 below the base pointer. So if we were visualizing this in the stack and say the base pointer currently pointed at 0x00004000 (just making this up for an example, not sure if that's realistic), then we place the value at 0x00003FFC, right? And an integer is size 4 bytes, so does it take up the memory space from 0x00003FFC downward to 0x00003FF8, or does it take up the memory space from 0x00004000 to 0x00003FFC?
It looks like stack pointer was never moved down to allow room for this local variable. Shouldn't we have done something like sub rsp, 4 to make room for the local int?
And then if I modify this to add more locals to it:
int main(int argc, char** argv) {
int num = 1;
char *str1 = {0};
char *str2 = "some string";
num++;
return num;
}
Then we get
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-36], edi
mov QWORD PTR [rbp-48], rsi
mov DWORD PTR [rbp-4], 1
mov QWORD PTR [rbp-16], 0
mov QWORD PTR [rbp-24], OFFSET FLAT:.LC0
add DWORD PTR [rbp-4], 1
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
So now the main arguments got pushed down even further from base pointer. Why is the space between the first two locals 12 bytes but the space between the second two locals 8 bytes? Is that because of the sizes of the types?
I'm only going to answer this part of the question:
Shouldn't the parameters have been placed on the stack before the call? Why are argc and argv placed at offset 20 and 32 from the base pointer of the current stack frame?
The parameters to main are indeed set up by the code that calls main.
This appears to be code compiled according to the 64-bit ELF psABI for x86, in which the first several parameters to any function are passed in registers, not on the stack. When control reaches the main: label, argc will be in edi, argv will be in rsi, and a third argument conventionally called envp will be in rdx. (You didn't declare that argument, so you can't use it, but the code that calls main is generic and always sets it up.)
The instructions I believe you are referring to
mov DWORD PTR [rbp-20], edi
mov QWORD PTR [rbp-32], rsi
are what compiler nerds call spill instructions: they are copying the initial values of the argc and argv parameters from their original registers to the stack, just in case those registers are needed for something else. As several other people pointed out, this is unoptimized code; these instructions are unnecessary and would not have been emitted if you had turned optimization on. Of course, if you'd turned optimization on you'd have gotten code that doesn't touch the stack at all:
main:
mov eax, 2
ret
In this ABI, the compiler is allowed to put the "spill slots," to which register values are saved, wherever it wants within the stack frame. Their locations do not have to make sense, and may vary from compiler to compiler, from patchlevel to patchlevel of the same compiler, or with apparently-unconnected changes to the source code.
(Some ABIs do specify stack frame layout in some detail, e.g. IIRC the 32-bit Windows ABI does this, to facilitate "unwinding", but that's not important right now.)
(To underline that the arguments to main are in registers, this is the assembly I get at -O1 from
int main(int argc) { return argc + 1; }
:
main:
lea eax, [rdi+1]
ret
Still doesn't do anything with the stack! (Besides ret.))
This is "compiler 101" and what you want to research is "calling convention" and "stack frame". The details are compiler/OS/optimizations dependent. Briefly, incoming parameters may be in registers or on stack. When a function is entered, it may create a stack frame to save some of the registers. And then it may define a "frame pointer" to reference stack locals and stack parameters off the frame pointer. Sometimes the stack pointer is used as a frame pointer as well.
As for registers, usually someone (company) would define a calling convention and specifies which registers are "volatile", meaning that they can be used by a routine without issues, and "preserved", meaning that if a routine uses them, they will have to be saved and restored on function entry and exit. The calling convention also specifies which registers (if any) are used for parameter passing and function return.

C Pointer to EFLAGS using NASM

For a task at my school, I need to write a C Program which does a 16 bit addition using an assembly programm. Besides the result the EFLAGS shall also be returned.
Here is my C Program:
int add(int a, int b, unsigned short* flags);//function must be used this way
int main(void){
unsigned short* flags = NULL;
printf("%d\n",add(30000, 36000, flags);// printing just the result
return 0;
}
For the moment the Program just prints out the result and not the flags because I am unable to get them.
Here is the assembly program for which I use NASM:
global add
section .text
add:
push ebp
mov ebp,esp
mov ax,[ebp+8]
mov bx,[ebp+12]
add ax,bx
mov esp,ebp
pop ebp
ret
Now this all works smoothly. But I have no idea how to get the pointer which must be at [ebp+16] pointing to the flag register. The professor said we will have to use the pushfd command.
My problem just lies in the assembly code. I will modify the C Program to give out the flags after I get the solution for the flags.
Normally you'd just use a debugger to look at flags, instead of writing all the code to get them into a C variable for a debug-print. Especially since decent debuggers decode the condition flags symbolically for you, instead of or as well as showing a hex value.
You don't have to know or care which bit in FLAGS is CF and which is ZF. (This information isn't relevant for writing real programs, either. I don't have it memorized, I just know which flags are tested by different conditions like jae or jl. Of course, it's good to understand that FLAGS are just data that you can copy around, save/restore, or even modify if you want)
Your function args and return value are int, which is 32-bit in the System V 32-bit x86 ABI you're using. (links to ABI docs in the x86 tag wiki). Writing a function that only looks at the low 16 bits of its input, and leaves high garbage in the high 16 bits of the output is a bug. The int return value in the prototype tells the compiler that all 32 bits of EAX are part of the return value.
As Michael points out, you seem to be saying that your assignment requires using a 16-bit ADD. That will produce carry, overflow, and other conditions with different inputs than if you looked at the full 32 bits. (BTW, this article explains carry vs. overflow very well.)
Here's what I'd do. Note the 32-bit operand size for the ADD.
global add
section .text
add:
push ebp
mov ebp,esp ; stack frames are optional, you can address things relative to ESP
mov eax, [ebp+8] ; first arg: No need to avoid loading the full 32 bits; the next insn doesn't care about the high garbage.
add ax, [ebp+12] ; low 16 bits of second arg. Operand-size implied by AX
cwde ; sign-extend AX into EAX
mov ecx, [ebp+16] ; the pointer arg
pushf ; the simple straightforward way
pop edx
mov [ecx], dx ; Store the low 16 of what we popped. Writing word [ecx] is optional, because dx implies 16-bit operand-size
; be careful not to do a 32-bit store here, because that would write outside the caller's object.
; mov esp,ebp ; redundant: ESP is still pointing at the place we pushed EBP, since the push is balanced by an equal-size pop
pop ebp
ret
CWDE (the 16->32 form of the 8086 CBW instruction) is not to be confused with CWD (the AX -> DX:AX 8086 instruction). If you're not using AX, then MOVSX / MOVZX are a good way to do this.
The fun way: instead of using the default operand size and doing 32-bit push and pop, we can do a 16-bit pop directly into the destination memory address. That would leave the stack unbalanced, so we could either uncomment the mov esp, ebp again, or use a 16-bit pushf (with an operand-size prefix, which according to the docs makes it only push the low 16 FLAGS, not the 32-bit EFLAGS.)
; What I'd *really* do: maximum efficiency if I had to use the 32-bit ABI with args on the stack, instead of args in registers
global add
section .text
add:
mov eax, [esp+4] ; first arg, first thing above the return address
add ax, [esp+8] ; second arg
cwde ; sign-extend AX into EAX
mov ecx, [esp+12] ; the pointer
pushfw ; push the low 16 of FLAGS
pop word [ecx] ; pop into memory pointed to by unsigned short* flags
ret
Both PUSHFW and POP WORD will assemble with an operand-size prefix. output from objdump -Mintel, which uses slightly different syntax from NASM:
4000c0: 66 9c pushfw
4000c2: 66 8f 01 pop WORD PTR [ecx]
PUSHFW is the same as o16 PUSHF. In NASM, o16 applies the operand-size prefix.
If you only needed the low 8 flags (not including OF), you could use LAHF to load FLAGS into AH and store that.
PUSHFing directly into the destination is not something I'd recommend. Temporarily pointing the stack at some random address is not safe in general. Programs with signal handlers will use the space below the stack asynchronously. This is why you have to reserve stack space before using it with sub esp, 32 or whatever, even if you're not going to make a function call that would overwrite it by pushing more stuff on the stack. The only exception is when you have a red-zone.
You C caller:
You're passing a NULL pointer, so of course your asm function segfaults. Pass the address of a local to give the function somewhere to store to.
int add(int a, int b, unsigned short* flags);
int main(void) {
unsigned short flags;
int result = add(30000, 36000, &flags);
printf("%d %#hx\n", result, flags);
return 0;
}
This is just a simple approach. I didn't test it, but you should get the idea...
Just set ESP to the pointer value, increment it by 2 (even for 32-bit arch) and PUSHF like this:
global add
section .text
add:
push ebp
mov ebp,esp
mov ax,[ebp+8]
mov bx,[ebp+12]
add ax,bx
; --- here comes the mod
mov esp, [ebp+16] ; this will set ESP to the pointers address "unsigned short* flags"
lea esp, [esp+2] ; adjust ESP to address above target
db 66h ; operand size prefix for 16-bit PUSHF (alternatively 'db 0x66', depending on your assembler
pushf ; this will save the lower 16-bits of EFLAGS to WORD PTR [EBP+16] = [ESP+2-2]
; --- here ends the mod
mov esp,ebp
pop ebp
ret
This should work, because PUSHF decrements ESP by 2 and then saves the value to WORD PTR [ESP]. Therefore it had to be increased before using the pointer address. Setting ESP to the appropriate value enables you to denominate the direct target of PUSHF.

How an assembly language works?

I am learning assembly and I have this assembly code and having much trouble understanding it can someone clarify it?
Dump of assembler code for function main:
0x080483ed <+0>: push ebp
0x080483ee <+1>: mov ebp,esp
0x080483f0 <+3>: sub esp,0x10
0x080483f3 <+6>: mov DWORD PTR [ebp-0x8],0x0
0x080483fa <+13>: mov eax,DWORD PTR [ebp-0x8]
0x080483fd <+16>: add eax,0x1
0x08048400 <+19>: mov DWORD PTR [ebp-0x4],eax
0x08048403 <+22>: leave
0x08048404 <+23>: ret
Until now, my understood knowledge is the following:
Push something (don't know what) in ebp register. then move content of esp register into ebp (I think the data of ebp should be overwritten), then subtract 10 from the esp and store it in the esp (The function will take 10 byte, This reg is never used again, so no point of doing this operation). Now assign value 0 to the address pointed by 8 bytes less than ebp.
Now store that address into register eax. Now add 1 to the value pointed by eax (the previous value is lost). Now store the eax value on [ebp-0x4], then leave to the return address of main.
Here is my C code for the above program:
int main(){
int x=0;
int y = x+1;
}
Now, can someone figure out if I am wrong at anything,and I also don't understand the mov at <+13> it adds 1 to the addrs ebp-0x8, but that is the address of int x so, x no longer contain 0. Where am I wrong?
first of all, push ebp and then mov ebp, esp are two instructions that are common at the beggining of a procedure. ESP register is an indicator for the top of the stack - so it changes constantly as the stack grows or shrinks. EBP is a helping register here. First we push content of ebp on stack. then we copy ESP (current stack top adress) to ebp - that is why when we refer to other items on the stack, we use constant value of ebp (and not changing one of esp).
sub esp, 0x10 ; means we reserve 16 bytes on the stack (0x10 is 16 in hex)
now for the real fun:
mov DWORD PTR [ebp-0x8],0x0 ; remember ebp was showing on the stack
; top BEFORE reserving 16 bytes.
; DWORD PTR means Double-word property which is 32 bits.
; so the whole instruction means
; "move 0 to the 32 bits of the stack in a place which
; starts with the adress ebp-8.
; this is our`int x = 0`
mov eax,DWORD PTR [ebp-0x8] ; send x to EAX register.
add eax,0x1` ; add 1 to the eax register
mov DWORD PTR [ebp-0x4],eax ; send the result (which is in eax) to the stack adress
; [ebp-4]
leave ; Cleanup stack (reverse the "mov ebp, esp" from above).
ret ; let's say this instruction returns to the program, (it's slightly more
; complicated than that)
Hope this helps! :)
0x080483ed <+0>: push ebp
0x080483ee <+1>: mov ebp,esp
Setting up the stack frame. Save the old base pointer and set the top of the stack as the new base pointer. This allows local variables and arguments within this function to be referenced relative to ebp (base pointer). The advantage of this is that its value is stable unlike esp which is affected by pushes and pops.
0x080483f0 <+3>: sub esp,0x10
On the x86 platform the stack 'grows' downwards. Generally speaking this means esp has a lower value (address in memory) than ebp. When ebp == esp the stack has not reserved any memory for local variables. This does not mean it is 'empty' - a common usage of a stack is [ebp+0x8] for instance. In this case the code is looking for something on the stack which was previously pushed on prior to the call (this could be arguments in the stdcall convention).
In this case the stack is extended by 16 bytes. In this case more space is reserved than necessary for alignment purposes:
0x080483f3 <+6>: mov DWORD PTR [ebp-0x8],0x0
The 4 bytes at [ebp-0x8] are initialised to the value 0. This is your x local variable.
0x080483fa <+13>: mov eax,DWORD PTR [ebp-0x8]
The 4 bytes at [ebp-0x8] are moved to a register. Arithmetic opcodes can not operate with two memory operands. Data needs to be moved to a register first before the arithmetic is performed. eax now holds the value of your x variable.
0x080483fd <+16>: add eax,0x1
The value of eax is increased so it now holds the value x + 1.
0x08048400 <+19>: mov DWORD PTR [ebp-0x4],eax
Stores the calculated value back on the stack. Notice the local variable is now [ebp-0x4] - this is your y variable.
0x08048403 <+22>: leave
Destroys the stack frame. Essentially maps to pop ebp and restores the old stack base pointer.
0x08048404 <+23>: ret
Pops the top of the stack treating the value as a return address and sets the program pointer (eip) to this value. The return address typically holds the address of the instruction directly after the call instruction that brought execution into this function.

Why do byte spills occur and what do they achieve?

What is a byte spill?
When I dump the x86 ASM from an LLVM intermediate representation generated from a C program, there are numerous spills, usually of a 4 byte size. I cannot figure out why they occur and what they achieve.
They seem to "cut" pieces of the stack off, but in an unusual way:
## this fragment comes from a C program right before a malloc() call to a struct.
## there are other spills in different circumstances in this same program, so it
## is not related exclusively to malloc()
...
sub ESP, 84
mov EAX, 60
mov DWORD PTR [ESP + 80], 0
mov DWORD PTR [ESP], 60
mov DWORD PTR [ESP + 60], EAX # 4-byte Spill
call malloc
mov ECX, 60
...
A register spill is simply what happens when you have more local variables than registers (it's an analogy - really the meaning is that they must be saved to memory). The instruction is saving the value of EAX, likely because EAX is clobbered by malloc and you don't have another spare register to save it in (and for whatever reason the compiler has decided it needs the constant 60 in the register later).
By the looks of it, the compiler could certainly have omitted mov DWORD PTR [ESP + 60], EAX and instead repeated the mov EAX, 60 where it would otherwise mov EAX, DWORD PTR [ESP + 60] or whatever offset it used, because the saved value of EAX cannot be other than 60 at that point. However, compilation is not guaranteed to be perfectly optimal.
Bear also in mind that after sub ESP, 84, the stack size is not adjusted (except by the call instruction which of course pushes the return address). The following instructions are using ESP as a memory offset, not a destination.

Why is the compiler generating a push/pop instruction pair?

I compiled the code below with the VC++ 2010 compiler:
__declspec(dllexport)
unsigned int __cdecl __mm_getcsr(void) { return _mm_getcsr(); }
and the generated code was:
push ECX
stmxcsr [ESP]
mov EAX, [ESP]
pop ECX
retn
Why is there a push ECX/pop ECX instruction pair?
The compiler is making room on the stack to store the MXCSR. It could have equally well done this:
sub esp,4
stmxcsr [ESP]
mov EAX, [ESP]
add esp,4
retn
But "push ecx" is probably shorter or faster.
The push here is used to allocate 4 bytes of temporary space. [ESP] would normally point to the pushed return address, which we cannot overwrite.
ECX will be overwritten here, however, ECX is a probably a volatile register in the ABI you're targeting, so functions don't have to preserve ECX.
The reason a push/pop is used here is a space (and possibly speed) optimization.
It creates an top-of-stack entry that ESP now refers to as the target for the stmxcsr instruction. Then the result is stored in EAX for the return.

Resources