I'm having trouble understanding why the compiler chose to offset the stack space in the way it did with the code I wrote.
I was toying with Godbolt's Compiler Explorer in order to study the C calling convention, when I came up with a simple code that puzzled me by its choices.
The code is found in this link. I selected GCC 8.2 x86-64, but am targetting x86 processors and this is important. Bellow is the transcription of the C code and the generated assembly reported by the Compiler Explorer.
// C code
int testing(char a, int b, char c) {
return 42;
}
int main() {
int x = testing('0', 0, '7');
return 0;
}
; Generated assembly
testing(char, int, char):
push ebp
mov ebp, esp
sub esp, 8
mov edx, DWORD PTR [ebp+8]
mov eax, DWORD PTR [ebp+16]
mov BYTE PTR [ebp-4], dl
mov BYTE PTR [ebp-8], al
mov eax, 42
leave
ret
main:
push ebp
mov ebp, esp
sub esp, 16
push 55
push 0
push 48
call testing(char, int, char)
add esp, 12
mov DWORD PTR [ebp-4], eax
mov eax, 0
leave
ret
Looking at the assembly column from now on, as I understood, line 15 is responsible for reserving space in the stack for the local variables. The problem is that I have only one local int and the offset was by 16 bytes instead of 4. This feels like wasted space.
Is this somewhat related to word alignment? But even if it is, if the sizes of the general purpose registers are 4 bytes, shouldn't this alignment be with regards to 4 bytes?
One other strange thing I see is with respect to the placement of the local chars of the testing function. They seem to be taking 4 bytes each in the stack, as seen in lines 7-8, but only the lower bytes are manipulated. Why not use only 1 byte each?
These choices are probably well intended, and I would really like to understand their purpose (or whether there is no purpose). Or maybe I'm just confused and didn't quite get it.
So, by the comments, I could figure out that the stack growth issue is due to the i386 SystemV ABI requirements, as stated by #PeterCordes.
The reason why the chars are word aligned may be due to GCC's default behavior to improve speed, as maybe inferenced from #Ped7g's comment. Although not definite, this is a good enough answer for me.
It's common today to acquire stack space in multiples of this size, for several reasons:
cache lines favor this behaviour by maintaining the whole data in the cache.
space for temporaries is preallocated, avoiding push and pop instructions to be used in case some temporary storage is needed out of the cpu.
individual push and pop instructions degrade pipeline execution, by requiring data to be updated before next instruction is executed. This decouples the data dependencies between consecutive instructions and allow them to run faster.
For this reasons, actual compilers specify ABIs to be designed in this way.
Whenever I read about program execution in C, it speaks very less about the function execution. I am still trying to find out what happens to a function when the program starts executing it from the time it is been called from another function to the time it returns? How do the function arguments get stored in memory?
That's unspecified; it's up to the implementation. As pointed out by Keith Thompson, it doesn't even have to tell you how it works. :)
Some implementations will put all the arguments on the stack, some will use registers, and many use a mix (the first n arguments passed in registers, any more and they go on the stack).
But the function itself is just code, it's read-only and nothing much "happens" to it during execution.
There is no one correct answer to this question, it depends heavily upon how the compiler writer determines is the best model to do this. There are various bits in the standard that describes this process but most of it is implementation defined. Also, the process is dependent on the architecture of the system, the OS you're aiming for, the level of optimisation and so forth.
Take the following code:-
int DoProduct (int a, int b, int c)
{
return a * b * c;
}
int result = DoProduct (4, 5, 6);
The MSVC2005 compiler, using standard debug build options created this for the last line of the above code:-
push 6
push 5
push 4
call DoProduct (411186h)
add esp,0Ch
mov dword ptr [ebp-18h],eax
Here, the arguments are pushed onto the stack, starting with the last argument, then the penultimate argument and so on until the the first argument is pushed onto the stack. The function is called, then the arguments are removed from the stack (the add esp,0ch) and then the return value is saved - the result is stored in the eax register.
Here's the code for the function:-
push ebp
mov ebp,esp
sub esp,0C0h
push ebx
push esi
push edi
lea edi,[ebp-0C0h]
mov ecx,30h
mov eax,0CCCCCCCCh
rep stos dword ptr es:[edi]
mov eax,dword ptr [a]
imul eax,dword ptr [b]
imul eax,dword ptr [c]
pop edi
pop esi
pop ebx
mov esp,ebp
pop ebp
ret
The first thing the function does is to create a local stack frame. This involves creating a space on the stack to store local and temporary variables in. In this case, 192 (0xc0) bytes are reserved (the first three instructions). The reason it's so many is to allow the edit-and-continue feature some space to put new variables into.
The next three instructions save the reserved registers as defined by the MS compiler. Then the stack frame space just created is initialised to contain a special debug signature, in this case 0xCC. This means unitialised memory and if you ever see a value consisting of just 0xCC's in debug mode then you've forgotten to initialise the value (unless 0xCC was the value).
Once all that housekeeping has been done, the next three instructions implement the body of the function, the two multiplies. After that, the reserved registers are restored and then the stack frame destroyed and finally the function ends with a ret. Fortunately, the imul puts the result of the multiplication into the eax register so there's no special code to get the result into the right register.
Now, you've probably been thinking that there's a lot there that isn't really necessary. And you're right, but debug is about getting the code right and a lot of the above helps to achieve that. In release, there's a lot that can be got rid of. There's no need for a stack frame, no need, therefore, to initialise it. There's no need to save the reserved registers as they aren't modified. In fact, the compiler creates this:-
mov eax,dword ptr [esp+4]
imul eax,dword ptr [esp+8]
imul eax,dword ptr [esp+0Ch]
ret
which, if I'd let the compiler do it, would have been in-lined into the caller.
There's a lot more stuff that can happen: values passed in registers and so on. Also, I've not got into how floating point values and structures / classes as passed to and from functions. And there's more that I've probably left out.
If I compile:
int *a;
void main(void)
{
*a = 1;
}
and then disassemble main in cdb I get:
pointersproject!main:
00000001`3fd51010 mov rax,qword ptr [pointersproject!a (00000001`3fd632f0)]
00000001`3fd51017 mov dword ptr [rax],1
00000001`3fd5101d xor eax,eax
00000001`3fd5101f ret
So *a is symbolized by pointersproject!a. All good.
However, if I declare the pointer within main:
void main(void)
{
int *a;
a = 1;
}
I see that a is just an offset from the stack pointer (I believe), rather then the human-readable structure I'd expect (like, say pointersproject!main!a):
pointersproject!main:
00000001`3fd51010 sub rsp,18h
00000001`3fd51014 mov rax,qword ptr [rsp]
00000001`3fd51018 mov dword ptr [rax],1
00000001`3fd5101e xor eax,eax
00000001`3fd51020 add rsp,18h
00000001`3fd51024 ret
This is probably as much about my understanding of what the compiler's done as anything else but: can anyone explain why the notation for a isn't what I expect?
(This inspired by musing while looking at x64 Windows Debugging: Practical Foundations by Dmitry Vostokov).
When a variable is defined inside a function, it is an automatic variable unless explicitly declared static. Such variables only live during the execution of the function and are normally allocated in the stack, thus they are deallocated when the function exits. The change you see in the complied code is not due to the change in scope but to the change from static to automatic variable. If you make a static, it will not be allocated in the stack, even if its scope is the function main.
I am having a lot of trouble accessing a value in an array of chars at a specific location. I am using inline-assembly in C++ and using visual studio (if that is of any help). Here is my code:
char* addTwoStringNumbers(char *num1)
{
// here is what I have tried so far:
movzx eax, num1[3];
mov al, [eax]
}
When I debug, I can see that num1[3] is the value I want but I can't seem to make either al or eax equal that value, it seems to always be some pointer reference. I have also played around with Byte PTR with no luck.
I'm not good neither at inline assembly, neither at MASM syntax, but here are some hints:
1) Try this:
mov eax, num1 ;// eax points to the beggining of the string
movsx eax, [eax + some_index] ;// movsx puts the char num1[some_index] in eax with sign extend.
(movzx is for unsigned char, so we used movsx)
2) You need to pass the value from eax to C. The simpliest way is to declare a variable and to put results there: int rez; __asm { mov rez, eax; };
3) If you want to write the whole function in assembly, you should consider using the naked keyword (and read about calling conventions). If not, make sure you preserve registers and don't damage the stack.
Looks like someone's doing their ICS 51 homework! Follow ruslik's advice and you'll be done in no time.
How does one implement alloca() using inline x86 assembler in languages like D, C, and C++? I want to create a slightly modified version of it, but first I need to know how the standard version is implemented. Reading the disassembly from compilers doesn't help because they perform so many optimizations, and I just want the canonical form.
Edit: I guess the hard part is that I want this to have normal function call syntax, i.e. using a naked function or something, make it look like the normal alloca().
Edit # 2: Ah, what the heck, you can assume that we're not omitting the frame pointer.
implementing alloca actually requires compiler assistance. A few people here are saying it's as easy as:
sub esp, <size>
which is unfortunately only half of the picture. Yes that would "allocate space on the stack" but there are a couple of gotchas.
if the compiler had emitted code
which references other variables
relative to esp instead of ebp
(typical if you compile with no
frame pointer). Then those
references need to be adjusted. Even with frame pointers, compilers do this sometimes.
more importantly, by definition, space allocated with alloca must be
"freed" when the function exits.
The big one is point #2. Because you need the compiler to emit code to symmetrically add <size> to esp at every exit point of the function.
The most likely case is the compiler offers some intrinsics which allow library writers to ask the compiler for the help needed.
EDIT:
In fact, in glibc (GNU's implementation of libc). The implementation of alloca is simply this:
#ifdef __GNUC__
# define __alloca(size) __builtin_alloca (size)
#endif /* GCC. */
EDIT:
after thinking about it, the minimum I believe would be required would be for the compiler to always use a frame pointer in any functions which uses alloca, regardless of optimization settings. This would allow all locals to be referenced through ebp safely and the frame cleanup would be handled by restoring the frame pointer to esp.
EDIT:
So i did some experimenting with things like this:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#define __alloca(p, N) \
do { \
__asm__ __volatile__( \
"sub %1, %%esp \n" \
"mov %%esp, %0 \n" \
: "=m"(p) \
: "i"(N) \
: "esp"); \
} while(0)
int func() {
char *p;
__alloca(p, 100);
memset(p, 0, 100);
strcpy(p, "hello world\n");
printf("%s\n", p);
}
int main() {
func();
}
which unfortunately does not work correctly. After analyzing the assembly output by gcc. It appears that optimizations get in the way. The problem seems to be that since the compiler's optimizer is entirely unaware of my inline assembly, it has a habit of doing the things in an unexpected order and still referencing things via esp.
Here's the resultant ASM:
8048454: push ebp
8048455: mov ebp,esp
8048457: sub esp,0x28
804845a: sub esp,0x64 ; <- this and the line below are our "alloc"
804845d: mov DWORD PTR [ebp-0x4],esp
8048460: mov eax,DWORD PTR [ebp-0x4]
8048463: mov DWORD PTR [esp+0x8],0x64 ; <- whoops! compiler still referencing via esp
804846b: mov DWORD PTR [esp+0x4],0x0 ; <- whoops! compiler still referencing via esp
8048473: mov DWORD PTR [esp],eax ; <- whoops! compiler still referencing via esp
8048476: call 8048338 <memset#plt>
804847b: mov eax,DWORD PTR [ebp-0x4]
804847e: mov DWORD PTR [esp+0x8],0xd ; <- whoops! compiler still referencing via esp
8048486: mov DWORD PTR [esp+0x4],0x80485a8 ; <- whoops! compiler still referencing via esp
804848e: mov DWORD PTR [esp],eax ; <- whoops! compiler still referencing via esp
8048491: call 8048358 <memcpy#plt>
8048496: mov eax,DWORD PTR [ebp-0x4]
8048499: mov DWORD PTR [esp],eax ; <- whoops! compiler still referencing via esp
804849c: call 8048368 <puts#plt>
80484a1: leave
80484a2: ret
As you can see, it isn't so simple. Unfortunately, I stand by my original assertion that you need compiler assistance.
It would be tricky to do this - in fact, unless you have enough control over the compiler's code generation it cannot be done entirely safely. Your routine would have to manipulate the stack, such that when it returned everything was cleaned, but the stack pointer remained in such a position that the block of memory remained in that place.
The problem is that unless you can inform the compiler that the stack pointer is has been modified across your function call, it may well decide that it can continue to refer to other locals (or whatever) through the stack pointer - but the offsets will be incorrect.
For the D programming language, the source code for alloca() comes with the download. How it works is fairly well commented. For dmd1, it's in /dmd/src/phobos/internal/alloca.d. For dmd2, it's in /dmd/src/druntime/src/compiler/dmd/alloca.d.
The C and C++ standards don't specify that alloca() has to the use the stack, because alloca() isn't in the C or C++ standards (or POSIX for that matter)¹.
A compiler may also implement alloca() using the heap. For example, the ARM RealView (RVCT) compiler's alloca() uses malloc() to allocate the buffer (referenced on their website here), and also causes the compiler to emit code that frees the buffer when the function returns. This doesn't require playing with the stack pointer, but still requires compiler support.
Microsoft Visual C++ has a _malloca() function that uses the heap if there isn't enough room on the stack, but it requires the caller to use _freea(), unlike _alloca(), which does not need/want explicit freeing.
(With C++ destructors at your disposal, you can obviously do the cleanup without compiler support, but you can't declare local variables inside an arbitrary expression so I don't think you could write an alloca() macro that uses RAII. Then again, apparently you can't use alloca() in some expressions (like function parameters) anyway.)
¹ Yes, it's legal to write an alloca() that simply calls system("/usr/games/nethack").
Continuation Passing Style Alloca
Variable-Length Array in pure ISO C++. Proof-of-Concept implementation.
Usage
void foo(unsigned n)
{
cps_alloca<Payload>(n,[](Payload *first,Payload *last)
{
fill(first,last,something);
});
}
Core Idea
template<typename T,unsigned N,typename F>
auto cps_alloca_static(F &&f) -> decltype(f(nullptr,nullptr))
{
T data[N];
return f(&data[0],&data[0]+N);
}
template<typename T,typename F>
auto cps_alloca_dynamic(unsigned n,F &&f) -> decltype(f(nullptr,nullptr))
{
vector<T> data(n);
return f(&data[0],&data[0]+n);
}
template<typename T,typename F>
auto cps_alloca(unsigned n,F &&f) -> decltype(f(nullptr,nullptr))
{
switch(n)
{
case 1: return cps_alloca_static<T,1>(f);
case 2: return cps_alloca_static<T,2>(f);
case 3: return cps_alloca_static<T,3>(f);
case 4: return cps_alloca_static<T,4>(f);
case 0: return f(nullptr,nullptr);
default: return cps_alloca_dynamic<T>(n,f);
}; // mpl::for_each / array / index pack / recursive bsearch / etc variacion
}
LIVE DEMO
cps_alloca on github
alloca is directly implemented in assembly code.
That's because you cannot control stack layout directly from high level languages.
Also note that most implementation will perform some additional optimization like aligning the stack for performance reasons.
The standard way of allocating stack space on X86 looks like this:
sub esp, XXX
Whereas XXX is the number of bytes to allcoate
Edit:
If you want to look at the implementation (and you're using MSVC) see alloca16.asm and chkstk.asm.
The code in the first file basically aligns the desired allocation size to a 16 byte boundary. Code in the 2nd file actually walks all pages which would belong to the new stack area and touches them. This will possibly trigger PAGE_GAURD exceptions which are used by the OS to grow the stack.
You can examine sources of an open-source C compiler, like Open Watcom, and find it yourself
If you can't use c99's Variable Length Arrays, you can use a compound literal cast to a void pointer.
#define ALLOCA(sz) ((void*)((char[sz]){0}))
This also works for -ansi (as a gcc extension) and even when it is a function argument;
some_func(&useful_return, ALLOCA(sizeof(struct useless_return)));
The downside is that when compiled as c++, g++>4.6 will give you an error: taking address of temporary array ... clang and icc don't complain though
Alloca is easy, you just move the stack pointer up; then generate all the read/writes to point to this new block
sub esp, 4
What we want to do is something like that:
void* alloca(size_t size) {
<sp> -= size;
return <sp>;
}
In Assembly (Visual Studio 2017, 64bit) it looks like:
;alloca.asm
_TEXT SEGMENT
PUBLIC alloca
alloca PROC
sub rsp, rcx ;<sp> -= size
mov rax, rsp ;return <sp>;
ret
alloca ENDP
_TEXT ENDS
END
Unfortunately our return pointer is the last item on the stack, and we do not want to overwrite it. Additionally we need to take care for the alignment, ie. round size up to multiple of 8. So we have to do this:
;alloca.asm
_TEXT SEGMENT
PUBLIC alloca
alloca PROC
;round up to multiple of 8
mov rax, rcx
mov rbx, 8
xor rdx, rdx
div rbx
sub rbx, rdx
mov rax, rbx
mov rbx, 8
xor rdx, rdx
div rbx
add rcx, rdx
;increase stack pointer
pop rbx
sub rsp, rcx
mov rax, rsp
push rbx
ret
alloca ENDP
_TEXT ENDS
END
my_alloca: ; void *my_alloca(int size);
MOV EAX, [ESP+4] ; get size
ADD EAX,-4 ; include return address as stack space(4bytes)
SUB ESP,EAX
JMP DWORD [ESP+EAX] ; replace RET(do not pop return address)
I recommend the "enter" instruction. Available on 286 and newer processors (may have been available on the 186 as well, I can't remember offhand, but those weren't widely available anyways).