Alloca implementation

Alloca implementation - c

How does one implement alloca() using inline x86 assembler in languages like D, C, and C++? I want to create a slightly modified version of it, but first I need to know how the standard version is implemented. Reading the disassembly from compilers doesn't help because they perform so many optimizations, and I just want the canonical form.
Edit: I guess the hard part is that I want this to have normal function call syntax, i.e. using a naked function or something, make it look like the normal alloca().
Edit # 2: Ah, what the heck, you can assume that we're not omitting the frame pointer.

implementing alloca actually requires compiler assistance. A few people here are saying it's as easy as:
sub esp, <size>
which is unfortunately only half of the picture. Yes that would "allocate space on the stack" but there are a couple of gotchas.
if the compiler had emitted code
which references other variables
relative to esp instead of ebp
(typical if you compile with no
frame pointer). Then those
references need to be adjusted. Even with frame pointers, compilers do this sometimes.
more importantly, by definition, space allocated with alloca must be
"freed" when the function exits.
The big one is point #2. Because you need the compiler to emit code to symmetrically add <size> to esp at every exit point of the function.
The most likely case is the compiler offers some intrinsics which allow library writers to ask the compiler for the help needed.
EDIT:
In fact, in glibc (GNU's implementation of libc). The implementation of alloca is simply this:
#ifdef __GNUC__
# define __alloca(size) __builtin_alloca (size)
#endif /* GCC. */
EDIT:
after thinking about it, the minimum I believe would be required would be for the compiler to always use a frame pointer in any functions which uses alloca, regardless of optimization settings. This would allow all locals to be referenced through ebp safely and the frame cleanup would be handled by restoring the frame pointer to esp.
EDIT:
So i did some experimenting with things like this:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#define __alloca(p, N) \
do { \
__asm__ __volatile__( \
"sub %1, %%esp \n" \
"mov %%esp, %0 \n" \
: "=m"(p) \
: "i"(N) \
: "esp"); \
} while(0)
int func() {
char *p;
__alloca(p, 100);
memset(p, 0, 100);
strcpy(p, "hello world\n");
printf("%s\n", p);
}
int main() {
func();
}
which unfortunately does not work correctly. After analyzing the assembly output by gcc. It appears that optimizations get in the way. The problem seems to be that since the compiler's optimizer is entirely unaware of my inline assembly, it has a habit of doing the things in an unexpected order and still referencing things via esp.
Here's the resultant ASM:
8048454: push ebp
8048455: mov ebp,esp
8048457: sub esp,0x28
804845a: sub esp,0x64 ; <- this and the line below are our "alloc"
804845d: mov DWORD PTR [ebp-0x4],esp
8048460: mov eax,DWORD PTR [ebp-0x4]
8048463: mov DWORD PTR [esp+0x8],0x64 ; <- whoops! compiler still referencing via esp
804846b: mov DWORD PTR [esp+0x4],0x0 ; <- whoops! compiler still referencing via esp
8048473: mov DWORD PTR [esp],eax ; <- whoops! compiler still referencing via esp
8048476: call 8048338 <memset#plt>
804847b: mov eax,DWORD PTR [ebp-0x4]
804847e: mov DWORD PTR [esp+0x8],0xd ; <- whoops! compiler still referencing via esp
8048486: mov DWORD PTR [esp+0x4],0x80485a8 ; <- whoops! compiler still referencing via esp
804848e: mov DWORD PTR [esp],eax ; <- whoops! compiler still referencing via esp
8048491: call 8048358 <memcpy#plt>
8048496: mov eax,DWORD PTR [ebp-0x4]
8048499: mov DWORD PTR [esp],eax ; <- whoops! compiler still referencing via esp
804849c: call 8048368 <puts#plt>
80484a1: leave
80484a2: ret
As you can see, it isn't so simple. Unfortunately, I stand by my original assertion that you need compiler assistance.

It would be tricky to do this - in fact, unless you have enough control over the compiler's code generation it cannot be done entirely safely. Your routine would have to manipulate the stack, such that when it returned everything was cleaned, but the stack pointer remained in such a position that the block of memory remained in that place.
The problem is that unless you can inform the compiler that the stack pointer is has been modified across your function call, it may well decide that it can continue to refer to other locals (or whatever) through the stack pointer - but the offsets will be incorrect.

For the D programming language, the source code for alloca() comes with the download. How it works is fairly well commented. For dmd1, it's in /dmd/src/phobos/internal/alloca.d. For dmd2, it's in /dmd/src/druntime/src/compiler/dmd/alloca.d.

The C and C++ standards don't specify that alloca() has to the use the stack, because alloca() isn't in the C or C++ standards (or POSIX for that matter)¹.
A compiler may also implement alloca() using the heap. For example, the ARM RealView (RVCT) compiler's alloca() uses malloc() to allocate the buffer (referenced on their website here), and also causes the compiler to emit code that frees the buffer when the function returns. This doesn't require playing with the stack pointer, but still requires compiler support.
Microsoft Visual C++ has a _malloca() function that uses the heap if there isn't enough room on the stack, but it requires the caller to use _freea(), unlike _alloca(), which does not need/want explicit freeing.
(With C++ destructors at your disposal, you can obviously do the cleanup without compiler support, but you can't declare local variables inside an arbitrary expression so I don't think you could write an alloca() macro that uses RAII. Then again, apparently you can't use alloca() in some expressions (like function parameters) anyway.)
¹ Yes, it's legal to write an alloca() that simply calls system("/usr/games/nethack").

Continuation Passing Style Alloca
Variable-Length Array in pure ISO C++. Proof-of-Concept implementation.
Usage
void foo(unsigned n)
{
cps_alloca<Payload>(n,[](Payload *first,Payload *last)
{
fill(first,last,something);
});
}
Core Idea
template<typename T,unsigned N,typename F>
auto cps_alloca_static(F &&f) -> decltype(f(nullptr,nullptr))
{
T data[N];
return f(&data[0],&data[0]+N);
}
template<typename T,typename F>
auto cps_alloca_dynamic(unsigned n,F &&f) -> decltype(f(nullptr,nullptr))
{
vector<T> data(n);
return f(&data[0],&data[0]+n);
}
template<typename T,typename F>
auto cps_alloca(unsigned n,F &&f) -> decltype(f(nullptr,nullptr))
{
switch(n)
{
case 1: return cps_alloca_static<T,1>(f);
case 2: return cps_alloca_static<T,2>(f);
case 3: return cps_alloca_static<T,3>(f);
case 4: return cps_alloca_static<T,4>(f);
case 0: return f(nullptr,nullptr);
default: return cps_alloca_dynamic<T>(n,f);
}; // mpl::for_each / array / index pack / recursive bsearch / etc variacion
}
LIVE DEMO
cps_alloca on github

alloca is directly implemented in assembly code.
That's because you cannot control stack layout directly from high level languages.
Also note that most implementation will perform some additional optimization like aligning the stack for performance reasons.
The standard way of allocating stack space on X86 looks like this:
sub esp, XXX
Whereas XXX is the number of bytes to allcoate
Edit:
If you want to look at the implementation (and you're using MSVC) see alloca16.asm and chkstk.asm.
The code in the first file basically aligns the desired allocation size to a 16 byte boundary. Code in the 2nd file actually walks all pages which would belong to the new stack area and touches them. This will possibly trigger PAGE_GAURD exceptions which are used by the OS to grow the stack.

You can examine sources of an open-source C compiler, like Open Watcom, and find it yourself

If you can't use c99's Variable Length Arrays, you can use a compound literal cast to a void pointer.
#define ALLOCA(sz) ((void*)((char[sz]){0}))
This also works for -ansi (as a gcc extension) and even when it is a function argument;
some_func(&useful_return, ALLOCA(sizeof(struct useless_return)));
The downside is that when compiled as c++, g++>4.6 will give you an error: taking address of temporary array ... clang and icc don't complain though

Alloca is easy, you just move the stack pointer up; then generate all the read/writes to point to this new block
sub esp, 4

What we want to do is something like that:
void* alloca(size_t size) {
<sp> -= size;
return <sp>;
}
In Assembly (Visual Studio 2017, 64bit) it looks like:
;alloca.asm
_TEXT SEGMENT
PUBLIC alloca
alloca PROC
sub rsp, rcx ;<sp> -= size
mov rax, rsp ;return <sp>;
ret
alloca ENDP
_TEXT ENDS
END
Unfortunately our return pointer is the last item on the stack, and we do not want to overwrite it. Additionally we need to take care for the alignment, ie. round size up to multiple of 8. So we have to do this:
;alloca.asm
_TEXT SEGMENT
PUBLIC alloca
alloca PROC
;round up to multiple of 8
mov rax, rcx
mov rbx, 8
xor rdx, rdx
div rbx
sub rbx, rdx
mov rax, rbx
mov rbx, 8
xor rdx, rdx
div rbx
add rcx, rdx
;increase stack pointer
pop rbx
sub rsp, rcx
mov rax, rsp
push rbx
ret
alloca ENDP
_TEXT ENDS
END

my_alloca: ; void *my_alloca(int size);
MOV EAX, [ESP+4] ; get size
ADD EAX,-4 ; include return address as stack space(4bytes)
SUB ESP,EAX
JMP DWORD [ESP+EAX] ; replace RET(do not pop return address)

I recommend the "enter" instruction. Available on 286 and newer processors (may have been available on the 186 as well, I can't remember offhand, but those weren't widely available anyways).

Related

Assembly: Purpose of loading the effective address before a call to a function?

Source C Code:
int main()
{
int i;
for(i=0, i < 10; i++)
{
printf("Hello World!\n");
}
}
Dump of Intel syntax x86 assembler code for function main:
1. 0x000055555555463a <+0>: push rbp
2. 0x000055555555463b <+1>: mov rbp,rsp
3. 0x000055555555463e <+4>: sub rsp,0x10
4. 0x0000555555554642 <+8>: mov DWORD PTR [rbp-0x4],0x0
5. 0x0000555555554649 <+15>: jmp 0x55555555465b <main+33>
6. 0x000055555555464b <+17>: lea rdi,[rip+0xa2] # 0x5555555546f4
7. 0x0000555555554652 <+24>: call 0x555555554510 <puts#plt>
8. 0x0000555555554657 <+29>: add DWORD PTR [rbp-0x4],0x1
9. 0x000055555555465b <+33>: cmp DWORD PTR [rbp-0x4],0x9
10. 0x000055555555465f <+37>: jle 0x55555555464b <main+17>
11. 0x0000555555554661 <+39>: mov eax,0x0
12. 0x0000555555554666 <+44>: leave
13. 0x0000555555554667 <+45>: ret
I'm currently working through "Hacking, The Art of Exploitation 2nd Edition by Jon Erickson", and I'm just starting to tackle assembly.
I have a few questions about the translation of the provided C code to Assembly, but I am mainly wondering about my first question.
1st Question: What is the purpose of line 6? (lea rdi,[rip+0xa2]).
My current working theory, is that this is used to save where the next instructions will jump to in order to track what is going on. I believe this line correlates with the printf function in the source C code.
So essentially, its loading the effective address of rip+0xa2 (0x5555555546f4) into the register rdi, to simply track where it will jump to for the printf function?
2nd Question: What is the purpose of line 11? (mov eax,0x0?)
I do not see a prior use of the register, EAX and am not sure why it needs to be set to 0.

The LEA puts a pointer to the string literal into a register, as the first arg for puts. The search term you're looking for is "calling convention" and/or ABI. (And also RIP-relative addressing). Why is the address of static variables relative to the Instruction Pointer?
The small offset between code and data (only +0xa2) is because the .rodata section gets linked into the same ELF segment as .text, and your program is tiny. (Newer gcc + ld versions will put it in a separate page so it can be non-executable.)
The compiler can't use a shorter more efficient mov edi, address in position-independent code in your Linux PIE executable. It would do that with gcc -fno-pie -no-pie
mov eax,0 implements the implicit return 0 at the end of main that C99 and C++ guarantee. EAX is the return-value register in all calling conventions.
If you don't use gcc -O2 or higher, you won't get peephole optimizations like xor-zeroing (xor eax,eax).

This:
lea rdi,[rip+0xa2]
Is a typical position independent LEA, putting the string address into a register (instead of loading from that memory address).
Your executable is position independent, meaning that it can be loaded at runtime at any address. Therefore, the real address of the argument to be passed to puts() needs to be calculated at runtime every single time, since the base address of the program could be different each time. Also, puts() is used instead of printf() because the compiler optimized the call since there is no need to format anything.
In this case, the binary was most probably loaded with the base address 0x555555554000. The string to use is stored in your binary at offset 0x6f4. Since the next instruction is at offset 0x652, you know that, no matter where the binary is loaded in memory, the address you want will be rip + (0x6f4 - 0x652) = rip + 0xa2, which is what you see above. See this answer of mine for another example.
The purpose of:
mov eax,0x0
Is to set the return value of main(). In Intel x86, the calling convention is to return values in the rax register (eax if the value is 32 bits, which is true in this case since main returns an int). See the table entry for x86-64 at the end of this page.
Even if you don't add an explicit return statement, main() is a special function, and the compiler will add a default return 0 for you.

If you add some debug data and symbols to the assembly everything will be easier. It is also easier to read the code if you add some optimizations.
There is a very useful tool godbolt and your example https://godbolt.org/z/9sRFmU
On the asm listing there you can clearly see that that lines loads the address of the string literal which will be then printed by the function.
EAX is considered volatile and main by default returns zero and thats the reason why it is zeroed.
The calling convention is explained here: https://en.wikipedia.org/wiki/X86_calling_conventions
Here you have more interesting cases https://godbolt.org/z/M4MeGk

Why stack grows by 16 bytes in this disassembly, when I only have one 4 byte local variable?

I'm having trouble understanding why the compiler chose to offset the stack space in the way it did with the code I wrote.
I was toying with Godbolt's Compiler Explorer in order to study the C calling convention, when I came up with a simple code that puzzled me by its choices.
The code is found in this link. I selected GCC 8.2 x86-64, but am targetting x86 processors and this is important. Bellow is the transcription of the C code and the generated assembly reported by the Compiler Explorer.
// C code
int testing(char a, int b, char c) {
return 42;
}
int main() {
int x = testing('0', 0, '7');
return 0;
}
; Generated assembly
testing(char, int, char):
push ebp
mov ebp, esp
sub esp, 8
mov edx, DWORD PTR [ebp+8]
mov eax, DWORD PTR [ebp+16]
mov BYTE PTR [ebp-4], dl
mov BYTE PTR [ebp-8], al
mov eax, 42
leave
ret
main:
push ebp
mov ebp, esp
sub esp, 16
push 55
push 0
push 48
call testing(char, int, char)
add esp, 12
mov DWORD PTR [ebp-4], eax
mov eax, 0
leave
ret
Looking at the assembly column from now on, as I understood, line 15 is responsible for reserving space in the stack for the local variables. The problem is that I have only one local int and the offset was by 16 bytes instead of 4. This feels like wasted space.
Is this somewhat related to word alignment? But even if it is, if the sizes of the general purpose registers are 4 bytes, shouldn't this alignment be with regards to 4 bytes?
One other strange thing I see is with respect to the placement of the local chars of the testing function. They seem to be taking 4 bytes each in the stack, as seen in lines 7-8, but only the lower bytes are manipulated. Why not use only 1 byte each?
These choices are probably well intended, and I would really like to understand their purpose (or whether there is no purpose). Or maybe I'm just confused and didn't quite get it.

So, by the comments, I could figure out that the stack growth issue is due to the i386 SystemV ABI requirements, as stated by #PeterCordes.
The reason why the chars are word aligned may be due to GCC's default behavior to improve speed, as maybe inferenced from #Ped7g's comment. Although not definite, this is a good enough answer for me.

It's common today to acquire stack space in multiples of this size, for several reasons:
cache lines favor this behaviour by maintaining the whole data in the cache.
space for temporaries is preallocated, avoiding push and pop instructions to be used in case some temporary storage is needed out of the cpu.
individual push and pop instructions degrade pipeline execution, by requiring data to be updated before next instruction is executed. This decouples the data dependencies between consecutive instructions and allow them to run faster.
For this reasons, actual compilers specify ABIs to be designed in this way.

Is the stack frame required for all functions in C on x86-64?

I've made a function to calculate the length of a C string (I'm trying to beat clang's optimizer using -O3). I'm running macOS.
_string_length1:
push rbp
mov rbp, rsp
xor rax, rax
.body:
cmp byte [rdi], 0
je .exit
inc rdi
inc rax
jmp .body
.exit:
pop rbp
ret
This is the C function I'm trying to beat:
size_t string_length2(const char *str) {
size_t ret = 0;
while (str[ret]) {
ret++;
}
return ret;
}
And it disassembles to this:
string_length2:
push rbp
mov rbp, rsp
mov rax, -1
LBB0_1:
cmp byte ptr [rdi + rax + 1], 0
lea rax, [rax + 1]
jne LBB0_1
pop rbp
ret
Every C function sets up a stack frame using push rbp and mov rbp, rsp, and breaks it using pop rbp. But I'm not using the stack in any way here, I'm only using processor registers. It worked without using a stack frame (when I tested on x86-64), but is it necessary?

No, the stack frame is, at least in theory, not always required. An optimizing compiler might in some cases avoid using the call stack. Notably when it is able to inline a called function (in some specific call site), or when the compiler successfully detects a tail call (which reuses the caller's frame).
Read the ABI of your platform to understand requirements related to the stack.
You might try to compile your program with link time optimization (e.g. compile and link with gcc -flto -O2) to get more optimizations.
In principle, one could imagine a compiler clever enough to (for some programs) avoid using any call stack.
BTW, I just compiled a naive recursive long fact(int n) factorial function with GCC 7.1 (on Debian/Sid/x86-64) at -O3 (i.e. gcc -fverbose-asm -S -O3 fact.c). The resulting assembler code fact.s contains no call machine instruction.

Every C function sets up a stack frame using...
This is true for your compiler, not in general. It is possible to compile a C program without using the stack at all—see, for example, the method CPS, continuation passing style. Probably no C compiler on the market does so, but it is important to know that there are other ways to execute programs, in addition to stack-evaluation.
The ISO 9899 standard says nothing about the stack. It leaves compiler implementations free to choose whichever method of evaluation they consider to be the best.

Inline assembly that clobbers the red zone

I'm writing a cryptography program, and the core (a wide multiply routine) is written in x86-64 assembly, both for speed and because it extensively uses instructions like adc that are not easily accessible from C. I don't want to inline this function, because it's big and it's called several times in the inner loop.
Ideally I would also like to define a custom calling convention for this function, because internally it uses all the registers (except rsp), doesn't clobber its arguments, and returns in registers. Right now, it's adapted to the C calling convention, but of course this makes it slower (by about 10%).
To avoid this, I can call it with asm("call %Pn" : ... : my_function... : "cc", all the registers); but is there a way to tell GCC that the call instruction messes with the stack? Otherwise GCC will just put all those registers in the red zone, and the top one will get clobbered. I can compile the whole module with -mno-red-zone, but I'd prefer a way to tell GCC that, say, the top 8 bytes of the red zone will be clobbered so that it won't put anything there.

From your original question I did not realize gcc limited red-zone use to leaf functions. I don't think that's required by the x86_64 ABI, but it is a reasonable simplifying assumption for a compiler. In that case you only need to make the function calling your assembly routine a non-leaf for purposes of compilation:
int global;
was_leaf()
{
if (global) other();
}
GCC can't tell if global will be true, so it can't optimize away the call to other() so was_leaf() is not a leaf function anymore. I compiled this (with more code that triggered stack usage) and observed that as a leaf it did not move %rsp and with the modification shown it did.
I also tried simply allocating more than 128 bytes (just char buf[150]) in a leaf but I was shocked to see it only did a partial subtraction:
pushq %rbp
movq %rsp, %rbp
subq $40, %rsp
movb $7, -155(%rbp)
If I put the leaf-defeating code back in that becomes subq $160, %rsp

The max-performance way might be to write the whole inner loop in asm (including the call instructions, if it's really worth it to unroll but not inline. Certainly plausible if fully inlining is causing too many uop-cache misses elsewhere).
Anyway, have C call an asm function containing your optimized loop.
BTW, clobbering all the registers makes it hard for gcc to make a very good loop, so you might well come out ahead from optimizing the whole loop yourself. (e.g. maybe keep a pointer in a register, and an end-pointer in memory, because cmp mem,reg is still fairly efficient).
Have a look at the code gcc/clang wrap around an asm statement that modifies an array element (on Godbolt):
void testloop(long *p, long count) {
for (long i = 0 ; i < count ; i++) {
asm(" # XXX asm operand in %0"
: "+r" (p[i])
:
: // "rax",
"rbx", "rcx", "rdx", "rdi", "rsi", "rbp",
"r8", "r9", "r10", "r11", "r12","r13","r14","r15"
);
}
}
#gcc7.2 -O3 -march=haswell
push registers and other function-intro stuff
lea rcx, [rdi+rsi*8] ; end-pointer
mov rax, rdi
mov QWORD PTR [rsp-8], rcx ; store the end-pointer
mov QWORD PTR [rsp-16], rdi ; and the start-pointer
.L6:
# rax holds the current-position pointer on loop entry
# also stored in [rsp-16]
mov rdx, QWORD PTR [rax]
mov rax, rdx # looks like a missed optimization vs. mov rax, [rax], because the asm clobbers rdx
XXX asm operand in rax
mov rbx, QWORD PTR [rsp-16] # reload the pointer
mov QWORD PTR [rbx], rax
mov rax, rbx # another weird missed-optimization (lea rax, [rbx+8])
add rax, 8
mov QWORD PTR [rsp-16], rax
cmp QWORD PTR [rsp-8], rax
jne .L6
# cleanup omitted.
clang counts a separate counter down towards zero. But it uses load / add -1 / store instead of a memory-destination add [mem], -1 / jnz.
You can probably do better than this if you write the whole loop yourself in asm instead of leaving that part of your hot loop to the compiler.
Consider using some XMM registers for integer arithmetic to reduce register pressure on the integer registers, if possible. On Intel CPUs, moving between GP and XMM registers only costs 1 ALU uop with 1c latency. (It's still 1 uop on AMD, but higher latency especially on Bulldozer-family). Doing scalar integer stuff in XMM registers is not much worse, and could be worth it if total uop throughput is your bottleneck, or it saves more spill/reloads than it costs.
But of course XMM is not very viable for loop counters (paddd/pcmpeq/pmovmskb/cmp/jcc or psubd/ptest/jcc are not great compared to sub [mem], 1 / jcc), or for pointers, or for extended-precision arithmetic (manually doing carry-out with a compare and carry-in with another paddq sucks even in 32-bit mode where 64-bit integer regs aren't available). It's usually better to spill/reload to memory instead of XMM registers, if you're not bottlenecked on load/store uops.
If you also need calls to the function from outside the loop (cleanup or something), write a wrapper or use add $-128, %rsp ; call ; sub $-128, %rsp to preserve the red-zone in those versions. (Note that -128 is encodeable as an imm8 but +128 isn't.)
Including an actual function call in your C function doesn't necessarily make it safe to assume the red-zone is unused, though. Any spill/reload between (compiler-visible) function calls could use the red-zone, so clobbering all the registers in an asm statement is quite likely to trigger that behaviour.
// a non-leaf function that still uses the red-zone with gcc
void bar(void) {
//cryptofunc(1); // gcc/clang don't use the redzone after this (not future-proof)
volatile int tmp = 1;
(void)tmp;
cryptofunc(1); // but gcc will use the redzone before a tailcall
}
# gcc7.2 -O3 output
mov edi, 1
mov DWORD PTR [rsp-12], 1
mov eax, DWORD PTR [rsp-12]
jmp cryptofunc(long)
If you want to depend on compiler-specific behaviour, you could call (with regular C) a non-inline function before the hot loop. With current gcc / clang, that will make them reserve enough stack space since they have to adjust the stack anyway (to align rsp before a call). This is not future-proof at all, but should happen to work.
GNU C has an __attribute__((target("options"))) x86 function attribute, but it's not usable for arbitrary options, and -mno-red- zone is not one of the ones you can toggle on a per-function basis, or with #pragma GCC target ("options") within a compilation unit.
You can use stuff like
__attribute__(( target("sse4.1,arch=core2") ))
void penryn_version(void) {
...
}
but not __attribute__(( target("mno-red-zone") )).
There's a #pragma GCC optimize and an optimize function-attribute (both of which are not intended for production code), but #pragma GCC optimize ("-mno-red-zone") doesn't work either. I think the idea is to let some important functions be optimized with -O2 even in debug builds. You can set -f options or -O.
You could put the function in a file by itself and compile that compilation unit with -mno-red-zone, though. (And hopefully LTO will not break anything...)

Can't you just modify your assembly function to meet the requirements of a signal in the x86-64 ABI by shifting the stack pointer by 128 bytes on entry to your function?
Or if you are referring to the return pointer itself, put the shift into your call macro (so sub %rsp; call...)

Not sure but looking at GCC documentation for function attributes, I found the stdcall function attribute which might be of interest.
I'm still wondering what you find problematic with your asm call version. If it's just aesthetics, you could transform it into a macro, or a inline function.

What about creating a dummy function that is written in C and does nothing but call the inline assembly?

Help deciphering simple Assembly Code

I am learning assembly using GDB & Eclipse
Here is a simple C code.
int absdiff(int x, int y)
{
if(x < y)
return y-x;
else
return x-y;
}
int main(void) {
int x = 10;
int y = 15;
absdiff(x,y);
return EXIT_SUCCESS;
}
Here is corresponding assembly instructions for main()
main:
080483bb: push %ebp #push old frame pointer onto the stack
080483bc: mov %esp,%ebp #move the frame pointer down, to the position of stack pointer
080483be: sub $0x18,%esp # ???
25 int x = 10;
080483c1: movl $0xa,-0x4(%ebp) #move the "x(10)" to 4 address below frame pointer (why not push?)
26 int y = 15;
080483c8: movl $0xf,-0x8(%ebp) #move the "y(15)" to 8 address below frame pointer (why not push?)
28 absdiff(x,y);
080483cf: mov -0x8(%ebp),%eax # -0x8(%ebp) == 15 = y, and move it into %eax
080483d2: mov %eax,0x4(%esp) # from this point on, I am confused
080483d6: mov -0x4(%ebp),%eax
080483d9: mov %eax,(%esp)
080483dc: call 0x8048394 <absdiff>
31 return EXIT_SUCCESS;
080483e1: mov $0x0,%eax
32 }
Basically, I am asking to help me to make sense of this assembly code, and why it is doing things in this particular order. Point where I am stuck, is shown in assembly comments. Thanks !

Lines 0x080483cf to 0x080483d9 are copying x and y from the current frame on the stack, and pushing them back onto the stack as arguments for absdiff() (this is typical; see e.g. http://en.wikipedia.org/wiki/X86_calling_conventions#cdecl). If you look at the disassembler for absdiff() (starting at 0x8048394), I bet you'll see it pick these values up from the stack and use them.
This might seem like a waste of cycles in this instance, but that's probably because you've compiled without optimisation, so the compiler does literally what you asked for. If you use e.g. -O2, you'll probably see most of this code disappear.

First it bears saying that this assembly is in the AT&T syntax version of x86_32, and that the order of arguments to operations is reversed from the Intel syntax (used with MASM, YASM, and many other assemblers and debuggers).
080483bb: push %ebp #push old frame pointer onto the stack
080483bc: mov %esp,%ebp #move the frame pointer down, to the position of stack pointer
080483be: sub $0x18,%esp # ???
This enters a stack frame. A frame is an area of memory between the stack pointer (esp) and the base pointer (ebp). This area is intended to be used for local variables that have to live on the stack. NOTE: Stack frames don't have to be implemented in this way, and GCC has the optimization switch -fomit-frame-pointer that does away with it except when alloca or variable sized arrays are used, because they are implemented by changing the stack pointer by arbitrary values. Not using ebp as the frame pointer allows it to be used as an extra general purpose register (more general purpose registers is usually good).
Using the base pointer makes several things simpler to calculate for compilers and debuggers, since where variables are located relative to the base does not change while in the function, but you can also index them relative to the stack pointer and get the same results, though the stack pointer does tend to change around so the same location may require a different index at different times.
In this code 0x18 (or 24) bytes are being reserved on the stack for local use.
This code so far is often called the function prologue (not to be confused with the programming language "prolog").
25 int x = 10;
080483c1: movl $0xa,-0x4(%ebp) #move the "x(10)" to 4 address below frame pointer (why not push?)
This line moves the constant 10 (0xA) to a location within the current stack frame relative to the base pointer. Because the base pointer below the top of the stack and since the stack grows downward in RAM the index is negative rather than positive. If this were indexed relative to the stack pointer a different index would be used, but it would be positive.
You are correct that this value could have been pushed rather than copied like this. I suspect that this is done this way because you have not compiled with optimizations turned on. By default gcc (which I assume you are using based on your use of gdb) does not optimize much, and so this code is probably the default "copy a constant to a location in the stack frame" code. This may not be the case, but it is one possible explanation.
26 int y = 15;
080483c8: movl $0xf,-0x8(%ebp) #move the "y(15)" to 8 address below frame pointer (why not push?)
Similar to the previous line of code. These two lines of code put the 10 and 15 into local variables. They are on the stack (rather than in registers) because this is unoptimized code.
28 absdiff(x,y);
gdb printing this meant that this is the source code line being executed, not that this function is being executed (yet).
080483cf: mov -0x8(%ebp),%eax # -0x8(%ebp) == 15 = y, and move it into %eax
In preparation for calling the function the values that are being passed as arguments need to be retrieved from their storage locations (even though they were just placed at those locations and their values are known because of the no optimization thing)
080483d2: mov %eax,0x4(%esp) # from this point on, I am confused
This is the second part of the move to the stack of one of the local variables' value so that it can be use as an argument to the function. You can't (usually) move from one memory address to another on x86, so you have to move it through a register (eax in this case).
080483d6: mov -0x4(%ebp),%eax
080483d9: mov %eax,(%esp)
These two lines do the same thing except for the other variable. Note that since this variable is being moved to the top of the stack that no offset is being used in the second instruction.
080483dc: call 0x8048394 <absdiff>
This pushed the return address to the top of the stack and jumps to the address of absdiff.
You didn't include code for absdiff, so you probably did not step through that.
31 return EXIT_SUCCESS;
080483e1: mov $0x0,%eax
C programs return 0 upon success, so EXIT_SUCCESS was defined as 0 by someone. Integer return values are put in eax, and some code that called the main function will use that value as the argument when calling the exit function.
32 }
This is the end. The reason that gdb stopped here is that there are things that actually happen to clean up. In C++ it is common to see destructor for local class instances being called here, but in C you will probably just see the function epilogue. This is the compliment to the function prologue, and consists of returning the stack pointer and base pointer to the values that they were originally at. Sometimes this is done with similar math on them, but sometimes it is done with the leave instruction. There is also an enter instruction which can be used for the prologue, but gcc doesn't do this (I don't know why). If you had continued to view the disassembly here you would have seen the epilogue code and a ret instruction.
Something you may be interested in is the ability to tell gcc to produce assembly files. If you do:
gcc -S source_file.c
a file named source_file.s will be produced with assembly code in it.
If you do:
gcc -S -O source_file.c
Then the same thing will happen, but some basic optimizations will be done. This will probably make reading the assembly code easier since the code will not likely have as many odd instructions that seem like they could have been done a better way (like moving constant values to the stack, then to a register, then to another location on the stack and never using the push instruction).
You regular optimization flags for gcc are:
-O0 default -- none
-O1 a few optimizations
-O the same as -O1
-O2 a lot of optimizations
-O3 a bunch more, some of which may take a long time and/or make the code a lot bigger
-Os optimize for size -- similar to -O2, but not quite
If you are actually trying to debug C programs then you will probably want the least optimizations possible since things will happen in the order that they are written in your code and variables won't disappear.
You should have a look at the gcc man page:
man gcc

Remember, if you're running in a debugger or debug mode, the compiler reserves the right to insert whatever debugging code it likes and make other nonsensical code changes.
For example, this is Visual Studio's debug main():
int main(void) {
001F13D0 push ebp
001F13D1 mov ebp,esp
001F13D3 sub esp,0D8h
001F13D9 push ebx
001F13DA push esi
001F13DB push edi
001F13DC lea edi,[ebp-0D8h]
001F13E2 mov ecx,36h
001F13E7 mov eax,0CCCCCCCCh
001F13EC rep stos dword ptr es:[edi]
int x = 10;
001F13EE mov dword ptr [x],0Ah
int y = 15;
001F13F5 mov dword ptr [y],0Fh
absdiff(x,y);
001F13FC mov eax,dword ptr [y]
001F13FF push eax
001F1400 mov ecx,dword ptr [x]
001F1403 push ecx
001F1404 call absdiff (1F10A0h)
001F1409 add esp,8
*(int*)nullptr = 5;
001F140C mov dword ptr ds:[0],5
return 0;
001F1416 xor eax,eax
}
001F1418 pop edi
001F1419 pop esi
001F141A pop ebx
001F141B add esp,0D8h
001F1421 cmp ebp,esp
001F1423 call #ILT+300(__RTC_CheckEsp) (1F1131h)
001F1428 mov esp,ebp
001F142A pop ebp
001F142B ret
It helpfully posts the C++ source next to the corresponding assembly. In this case, you can fairly clearly see that x and y are stored on the stack explicitly, and an explicit copy is pushed on, then absdiff is called. I explicitly de-referenced nullptr to cause the debugger to break in. You may wish to change compiler.

Compile with -fverbose-asm -g -save-temps for additional information with GCC.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Alloca implementation - c

For the D programming language, the source code for alloca() comes with the download. How it works is fairly well commented. For dmd1, it's in /dmd/src/phobos/internal/alloca.d. For dmd2, it's in /dmd/src/druntime/src/compiler/dmd/alloca.d.

You can examine sources of an open-source C compiler, like Open Watcom, and find it yourself

Alloca is easy, you just move the stack pointer up; then generate all the read/writes to point to this new block sub esp, 4

my_alloca: ; void *my_alloca(int size); MOV EAX, [ESP+4] ; get size ADD EAX,-4 ; include return address as stack space(4bytes) SUB ESP,EAX JMP DWORD [ESP+EAX] ; replace RET(do not pop return address)

I recommend the "enter" instruction. Available on 286 and newer processors (may have been available on the 186 as well, I can't remember offhand, but those weren't widely available anyways).

Related

Assembly: Purpose of loading the effective address before a call to a function?

Why stack grows by 16 bytes in this disassembly, when I only have one 4 byte local variable?

Is the stack frame required for all functions in C on x86-64?

Inline assembly that clobbers the red zone

Help deciphering simple Assembly Code

Categories

Resources