Fastest (CPU-wise) way to do functions in intel x64 assembly? - c

I've been reading about assembly functions and I'm confused as to whether to use the enter and exit or just the call/return instructions for fast execution. Is one way fast and the other smaller? For example what is the fastest (stdcall) way to do this in assembly without inlining the function:
static Int32 Add(Int32 a, Int32 b) {
return a + b;
}
int main() {
Int32 i = Add(1, 3);
}

Use call / ret, without making a stack frame with either enter / leave or push&pop rbp / mov rbp, rsp. gcc (with the default -fomit-frame-pointer) only makes a stack frame in functions that do variable-size allocation on the stack. This may make debugging slightly more difficult, since gcc normally emits stack unwind info when compiling with -fomit-frame-pointer, but your hand-written asm won't have that. Normally it only makes sense to write leaf functions in asm, or at least ones that don't call many other functions.
Stack frames mean you don't have to keep track of how much the stack pointer has changed since function entry to access stuff on the stack (e.g. function args and spill slots for locals). Both Windows and Linux/Unix 64bit ABIs pass the first few args in registers, and there are often enough regs that you don't have to spill any variables to the stack. Stack frames are a waste of instructions in most cases. In 32bit code, having ebp available (going from 6 to 7 GP regs, not counting the stack pointer) makes a bigger difference than going from 14 to 15. Of course, you still have to push/pop rbp if you do use it, though, because in both ABIs it's a callee-saved register that functions aren't allowed to clobber.
If you're optimizing x86-64 asm, you should read Agner Fog's guides, and check out some of the other links in the x86 tag wiki.
The best implementation of your function is probably:
align 16
global Add
Add:
lea eax, [rdi + rsi]
ret
; the high 32 of either reg doesn't affect the low32 of the result
; so we don't need to zero-extend or use a 32bit address-size prefix
; like lea eax, [edi, esi]
; even if we're called with non-zeroed upper32 in rdi/rsi.
align 16
global main
main:
mov edi, 1 ; 1st arg in SysV ABI
mov esi, 3 ; 2nd arg in SysV ABI
call Add
; return value in eax in all ABIs
ret
align 16
OPmain: ; This is what you get if you don't return anything from main to use the result of Add
xor eax, eax
ret
This is in fact what gcc emits for Add(), but it still turns main into an empty function, or into a return 4 if you return i. clang 3.7 respects -fno-inline-functions even when the result is a compile-time constant. It beats my asm by doing tail-call optimization, and jmping to Add.
Note that the Windows 64bit ABI uses different registers for function args. See the links in the x86 tag wiki, or Agner Fog's ABI guide. Assembler macros may help for writing functions in asm that use the correct registers for their args, depending on the platform you're targeting.

Related

How to set function arguments in assembly during runtime in a 64bit application on Windows?

I am trying to set arguments using assembly code that are used in a generic function. The arguments of this generic function - that is resident in a dll - are not known during compile time. During runtime the pointer to this function is determined using the GetProcAddress function. However its arguments are not known. During runtime I can determine the arguments - both value and type - using a datafile (not a header file or anything that can be included or compiled). I have found a good example of how to solve this problem for 32 bit (C Pass arguments as void-pointer-list to imported function from LoadLibrary()), but for 64 bit this example does not work, because you cannot fill the stack but you have to fill the registers. So I tried to use assembly code to fill the registers but until now no success. I use C-code to call the assembly code. I use VS2015 and MASM (64 bit). The C-code below works fine, but the assembly code does not. So what is wrong with the assembly code? Thanks in advance.
C code:
...
void fill_register_xmm0(double); // proto of assembly function
...
// code determining the pointer to a func returned by the GetProcAddress()
...
double dVal = 12.0;
int v;
fill_register_xmm0(dVal);
v = func->func_i(); // integer function that will use the dVal
...
assembly code in different .asm file (MASM syntax):
TITLE fill_register_xmm0
.code
option prologue:none ; turn off default prologue creation
option epilogue:none ; turn off default epilogue creation
fill_register_xmm0 PROC variable: REAL8 ; REAL8=equivalent to double or float64
movsd xmm0, variable ; fill value of variable into xmm0
ret
fill_register_xmm0 ENDP
option prologue:PrologueDef ; turn on default prologue creation
option epilogue:EpilogueDef ; turn on default epilogue creation
END
The x86-64 Windows calling convention is fairly simple, and makes it possible to write a wrapper function that doesn't know the types of anything. Just load the first 32 bytes of args into registers, and copy the rest to the stack.
You definitely need to make the function call from asm; It can't possibly work reliably to make a bunch of function calls like fill_register_xmm0 and hope that the compiler doesn't clobber any of those registers. The C compiler emits instructions that use the registers, as part of its normal job, including passing args to functions like fill_register_xmm0.
The only alternative would be to write a C statement with a function call with all the args having the correct type, to get the compiler to emit code to make a function call normally. If there are only a few possible different combinations of args, putting those in if() blocks might be good.
And BTW, movsd xmm0, variable probably assembles to movsd xmm0, xmm0, because the first function arg is passed in XMM0 if it's FP.
In C, prepare a buffer with the args (like in the 32-bit case).
Each one needs to be padded to 8 bytes if it's narrower. See MS's docs for x86-64 __fastcall. (Note that x86-64 __vectorcall passes __m128 args by value in registers, but for __fastcall it's strictly true that the args form an array of 8-byte values, after the register args. And storing those into the shadow space creates a full array of all the args.)
Any argument that doesn’t fit in 8 bytes, or is not 1, 2, 4, or 8 bytes, must be passed by reference. There is no attempt to spread a single argument across multiple registers.
But the key thing that makes variadic functions easy in the Windows calling convention also works here: The register used for the 2nd arg doesn't depend on the type of the first. i.e. if an FP arg is the first arg, then that uses up an integer register arg-passing slot. So you can only have up to 4 register args, not 4 integer and 4 FP.
If the 4th arg is integer, it goes in R9, even if it's the first integer arg. Unlike in the x86-64 System V calling convention, where the first integer arg goes in rdi, regardless of how many earlier FP args are in registers and/or on the stack.
So the asm wrapper that calls the function can load the first 8 bytes into both integer and FP registers! (Variadic functions already require this, so a callee doesn't have to know whether to store the integer or FP register to form that arg array. MS optimized the calling convention for simplicity of variadic callee functions at the expense of efficiency for functions with a mix of integer and FP args.)
The C side that puts all the args into a buffer can look like this:
#include <stdalign.h>
int asmwrapper(const char *argbuf, size_t argp-argbuf, void (*funcpointer)(...));
void somefunc() {
alignas(16) uint64_t argbuf[256/8]; // or char argbuf[256]. But if you choose not to use alignas, then uint64_t will still give 8-byte alignment
char *argp = (char*)argbuf;
for( ; argp < &argbuf[256] ; argp += 8) {
if (figure_out_an_arg()) {
int foo = get_int_arg();
memcpy(argp, &foo, sizeof(foo));
} else if(bar) {
double foo = get_double_arg();
memcpy(argp, &foo, sizeof(foo));
} else
... memcpy whatever size
// or allocate space to pass by ref and memcpy a pointer
}
if (argp == &argbuf[256]) {
// error, ran out of space for args
}
asmwrapper(argbuf, argp-argbuf, funcpointer);
}
Unfortunately I don't think we can directly use argbuf on the stack as the args + shadow space for a function call. We have no way of stopping the compiler from putting something valuable below argbuf which would let us just set rsp to the bottom of it (and save the return address somewhere, maybe at the top of argbuf by reserving some space for use by the asm).
Anyway, just copying the whole buffer will work. Or actually, load the first 32 bytes into registers (both integer and FP), and only copy the rest. The shadow space doesn't need to be initialized.
argbuf could be a VLA if you knew ahead of time how big it needed to be, but 256 bytes is pretty small. It's not like reading past the end of it can be a problem, it can't be at the end of a page with unmapped memory later, because our parent function's stack frame definitely takes some space.
;; NASM syntax. For MASM just rename the local labels and add whatever PROC / ENDPROC is needed.
;; UNTESTED
;; rcx: argbuf
;; rdx: length in bytes of the args. 0..256, zero-extended to 64 bits
;; r8 : function pointer
;; reserve rdx bytes of space for arg passing
;; load first 32 bytes of argbuf into integer and FP arg-passing registers
;; copy the rest as stack-args above the shadow space
global asmwrapper
asmwrapper:
push rbp
mov rbp, rsp ; so we can efficiently restore the stack later
mov r10, r8 ; move function pointer to a volatile but non-arg-passing register
; load *both* xmm0-3 and rcx,rdx,r8,r9 from the first 32 bytes of argbuf
; regardless of types or whether there were that many arg bytes
; All bytes are loaded into registers early, some reg->reg transfers are done later
; when we're done with more registers.
; movsd xmm0, [rcx]
; movsd xmm1, [rcx+8]
movaps xmm0, [rcx] ; 16-byte alignment required for argbuf. Use movups to allow misalignment if you want
movhlps xmm1, xmm0 ; use some ALU instructions instead of just loads
; rcx,rdx can't be set yet, still in use for wrapper args
movaps xmm2, [rcx+16] ; it's ok to leave garbage in the high 64-bits of an XMM passing a float or double.
;movhlps xmm3, xmm2 ; the copyloop uses xmm3: do this later
movq r8, xmm2
mov r9, [rcx+24]
mov eax, 32
cmp edx, eax
jbe .small_args ; no copying needed, just shadow space
sub rsp, rdx
and rsp, -16 ; reserve extra space, realigning the stack by 16
; rax=32 on entry, start copying just above shadow space (which doesn't need to be copied)
.copyloop: ; do {
movaps xmm3, [rcx+rax]
movaps [rsp+rax], xmm3 ; indexed addressing modes aren't always optimal, but this loop only runs a couple times.
add eax, 16
cmp eax, edx
jb .copyloop ; } while(bytes_copied < arg_bytes);
.done_arg_copying:
; xmm0,xmm1 have the first 2 qwords of args
movq rcx, xmm0 ; RCX NO LONGER POINTS AT argbuf
movq rdx, xmm1
; xmm2 still has the 2nd 16 bytes of args
;movhlps xmm3, xmm2 ; don't use: false dependency on old value and we just used it.
pshufd xmm3, xmm2, 0xee ; xmm3 = high 64 bits of xmm2. (0xee = _MM_SHUFFLE(3,2,3,2))
; movq xmm3, r9 ; nah, can be multiple uops on AMD
; r8,r9 set earlier
call r10
leave ; restore RSP to its value on entry
ret
; could handle this branchlessly, but copy loop still needs to run zero times
; unless we bump up the min arg_bytes to 48 and sometimes copy an unnecessary 16 bytes
; As much work as possible is before the first branch, so it can happen while a mispredict recovers
.small_args:
sub rsp, rax ; reserve shadow space
;rsp still aligned by 16 after push rbp
jmp .done_arg_copying
;byte count. This wrapper is 82 bytes; would be nice to fit it in 80 so we don't waste 14 bytes before the next function.
;e.g. maybe mov rcx, [rcx] instead of movq rcx, xmm0
;mov eax, $-asmwrapper
align 16
This does assemble (on Godbolt with NASM), but I haven't tested it.
It should perform pretty well, but if you get mispredicts around the cutoff from <= 32 bytes to > 32 bytes, change the branching so it always copies an extra 16 bytes. (Uncomment the cmp/cmovb in the version on Godbolt, but the copy loop still needs to start at 32 bytes into each buffer.)
If you often pass very few args, the 16-byte loads might hit a store-forwarding stall from two narrow stores to one wide reload, causing about an extra 8 cycles of latency. This isn't normally a throughput problem, but it can increase the latency before the called function can access its args. If out-of-order execution can't hide that, then it's worth using more load uops to load each 8-byte arg separately. (Especially into integer registers, and then from there to XMM, if the args are mostly integer. That will have lower latency than mem -> xmm -> integer.)
If you have more than a couple args, though, hopefully the first few have committed to L1d and no longer need store forwarding by the time the asm wrapper runs. Or there's enough copying of later args that the first 2 args finish their load + ALU chain early enough not to delay the critical path inside the called function.
Of course, if performance was a huge issue, you'd write the code that figures out the args in asm so you didn't need this copy stuff, or use a library interface with a fixed function signature that a C compiler can call directly. I did try to make this suck as little as possible on modern Intel / AMD mainstream CPUs (http://agner.org/optimize/), but I didn't benchmark it or tune it, so probably it could be improved with some time spent profiling it, especially for some real use-case.
If you know that FP args aren't a possibility for the first 4, you can simplify by just loading integer regs.
So you need to call a function (in a DLL) but only at run-time can you figure out the number and type of parameters. Then you need to perpare the parameters, either on the stack or in registers, depending on the Application Binary Interface/calling convention.
I would use the following approach: some component of your program figures out the number and type of parameters. Let's assume it creates a list of {type, value}, {type, value}, ...
You then pass this list to a function to prepare the ABI call. This will be an assembler function. For a stack-based ABI (32 bit), it just pushes the parameters on to the stack. For a register based ABI, it can prepare the register values and save them as local variables (add sp,nnn) and once all parameters have been prepared (possibly using registers needed for the call, hence first saving them), loads the registers (a series of mov instructions) and performs the call instruction.

Is the stack frame required for all functions in C on x86-64?

I've made a function to calculate the length of a C string (I'm trying to beat clang's optimizer using -O3). I'm running macOS.
_string_length1:
push rbp
mov rbp, rsp
xor rax, rax
.body:
cmp byte [rdi], 0
je .exit
inc rdi
inc rax
jmp .body
.exit:
pop rbp
ret
This is the C function I'm trying to beat:
size_t string_length2(const char *str) {
size_t ret = 0;
while (str[ret]) {
ret++;
}
return ret;
}
And it disassembles to this:
string_length2:
push rbp
mov rbp, rsp
mov rax, -1
LBB0_1:
cmp byte ptr [rdi + rax + 1], 0
lea rax, [rax + 1]
jne LBB0_1
pop rbp
ret
Every C function sets up a stack frame using push rbp and mov rbp, rsp, and breaks it using pop rbp. But I'm not using the stack in any way here, I'm only using processor registers. It worked without using a stack frame (when I tested on x86-64), but is it necessary?
No, the stack frame is, at least in theory, not always required. An optimizing compiler might in some cases avoid using the call stack. Notably when it is able to inline a called function (in some specific call site), or when the compiler successfully detects a tail call (which reuses the caller's frame).
Read the ABI of your platform to understand requirements related to the stack.
You might try to compile your program with link time optimization (e.g. compile and link with gcc -flto -O2) to get more optimizations.
In principle, one could imagine a compiler clever enough to (for some programs) avoid using any call stack.
BTW, I just compiled a naive recursive long fact(int n) factorial function with GCC 7.1 (on Debian/Sid/x86-64) at -O3 (i.e. gcc -fverbose-asm -S -O3 fact.c). The resulting assembler code fact.s contains no call machine instruction.
Every C function sets up a stack frame using...
This is true for your compiler, not in general. It is possible to compile a C program without using the stack at all—see, for example, the method CPS, continuation passing style. Probably no C compiler on the market does so, but it is important to know that there are other ways to execute programs, in addition to stack-evaluation.
The ISO 9899 standard says nothing about the stack. It leaves compiler implementations free to choose whichever method of evaluation they consider to be the best.

Why is memcmp so much faster than a for loop check?

Why is memcmp(a, b, size) so much faster than:
for(i = 0; i < nelements; i++) {
if a[i] != b[i] return 0;
}
return 1;
Is memcmp a CPU instruction or something? It must be pretty deep because I got a massive speedup using memcmp over the loop.
memcmp is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C.
As a "builtin"
GCC supports memcmp (as well as a ton of other functions) as builtins. In some versions / configurations of GCC, a call to memcmp will be recognized as __builtin_memcmp. Instead of emitting a call to the memcmp library function, GCC will emit a handful of instructions to act as an optimized inline version of the function.
On x86, this leverages the use of the cmpsb instruction, which compares a string of bytes at one memory location to another. This is coupled with the repe prefix, so the strings are compared until they are no longer equal, or a count is exhausted. (Exactly what memcmp does).
Given the following code:
int test(const void* s1, const void* s2, int count)
{
return memcmp(s1, s2, count) == 0;
}
gcc version 3.4.4 on Cygwin generates the following assembly:
; (prologue)
mov esi, [ebp+arg_0] ; Move first pointer to esi
mov edi, [ebp+arg_4] ; Move second pointer to edi
mov ecx, [ebp+arg_8] ; Move length to ecx
cld ; Clear DF, the direction flag, so comparisons happen
; at increasing addresses
cmp ecx, ecx ; Special case: If length parameter to memcmp is
; zero, don't compare any bytes.
repe cmpsb ; Compare bytes at DS:ESI and ES:EDI, setting flags
; Repeat this while equal ZF is set
setz al ; Set al (return value) to 1 if ZF is still set
; (all bytes were equal).
; (epilogue)
Reference:
cmpsb instruction
As a library function
Highly-optimized versions of memcmp exist in many C standard libraries. These will usually take advantage of architecture-specific instructions to work with lots of data in parallel.
In Glibc, there are versions of memcmp for x86_64 that can take advantage of the following instruction set extensions:
SSE2 - sysdeps/x86_64/memcmp.S
SSE4 - sysdeps/x86_64/multiarch/memcmp-sse4.S
SSSE3 - sysdeps/x86_64/multiarch/memcmp-ssse3.S
The cool part is that glibc will detect (at run-time) the newest instruction set your CPU has, and execute the version optimized for it. See this snippet from sysdeps/x86_64/multiarch/memcmp.S:
ENTRY(memcmp)
.type memcmp, #gnu_indirect_function
LOAD_RTLD_GLOBAL_RO_RDX
HAS_CPU_FEATURE (SSSE3)
jnz 2f
leaq __memcmp_sse2(%rip), %rax
ret
2: HAS_CPU_FEATURE (SSE4_1)
jz 3f
leaq __memcmp_sse4_1(%rip), %rax
ret
3: leaq __memcmp_ssse3(%rip), %rax
ret
END(memcmp)
In the Linux kernel
Linux does not seem to have an optimized version of memcmp for x86_64, but it does for memcpy, in arch/x86/lib/memcpy_64.S. Note that is uses the alternatives infrastructure (arch/x86/kernel/alternative.c) for not only deciding at runtime which version to use, but actually patching itself to only make this decision once at boot-up.
It's usually a compiler intrinsic that is translated into fast assembly with specialized instructions for comparing blocks of memory.
intrinsic memcmp
Is memcmp a CPU instruction or something?
It is at least a very highly optimized compiler-provided intrinsic function. Possibly a single machine instruction, or two, depending on the platform, which you haven't specified.

Inline assembly that clobbers the red zone

I'm writing a cryptography program, and the core (a wide multiply routine) is written in x86-64 assembly, both for speed and because it extensively uses instructions like adc that are not easily accessible from C. I don't want to inline this function, because it's big and it's called several times in the inner loop.
Ideally I would also like to define a custom calling convention for this function, because internally it uses all the registers (except rsp), doesn't clobber its arguments, and returns in registers. Right now, it's adapted to the C calling convention, but of course this makes it slower (by about 10%).
To avoid this, I can call it with asm("call %Pn" : ... : my_function... : "cc", all the registers); but is there a way to tell GCC that the call instruction messes with the stack? Otherwise GCC will just put all those registers in the red zone, and the top one will get clobbered. I can compile the whole module with -mno-red-zone, but I'd prefer a way to tell GCC that, say, the top 8 bytes of the red zone will be clobbered so that it won't put anything there.
From your original question I did not realize gcc limited red-zone use to leaf functions. I don't think that's required by the x86_64 ABI, but it is a reasonable simplifying assumption for a compiler. In that case you only need to make the function calling your assembly routine a non-leaf for purposes of compilation:
int global;
was_leaf()
{
if (global) other();
}
GCC can't tell if global will be true, so it can't optimize away the call to other() so was_leaf() is not a leaf function anymore. I compiled this (with more code that triggered stack usage) and observed that as a leaf it did not move %rsp and with the modification shown it did.
I also tried simply allocating more than 128 bytes (just char buf[150]) in a leaf but I was shocked to see it only did a partial subtraction:
pushq %rbp
movq %rsp, %rbp
subq $40, %rsp
movb $7, -155(%rbp)
If I put the leaf-defeating code back in that becomes subq $160, %rsp
The max-performance way might be to write the whole inner loop in asm (including the call instructions, if it's really worth it to unroll but not inline. Certainly plausible if fully inlining is causing too many uop-cache misses elsewhere).
Anyway, have C call an asm function containing your optimized loop.
BTW, clobbering all the registers makes it hard for gcc to make a very good loop, so you might well come out ahead from optimizing the whole loop yourself. (e.g. maybe keep a pointer in a register, and an end-pointer in memory, because cmp mem,reg is still fairly efficient).
Have a look at the code gcc/clang wrap around an asm statement that modifies an array element (on Godbolt):
void testloop(long *p, long count) {
for (long i = 0 ; i < count ; i++) {
asm(" # XXX asm operand in %0"
: "+r" (p[i])
:
: // "rax",
"rbx", "rcx", "rdx", "rdi", "rsi", "rbp",
"r8", "r9", "r10", "r11", "r12","r13","r14","r15"
);
}
}
#gcc7.2 -O3 -march=haswell
push registers and other function-intro stuff
lea rcx, [rdi+rsi*8] ; end-pointer
mov rax, rdi
mov QWORD PTR [rsp-8], rcx ; store the end-pointer
mov QWORD PTR [rsp-16], rdi ; and the start-pointer
.L6:
# rax holds the current-position pointer on loop entry
# also stored in [rsp-16]
mov rdx, QWORD PTR [rax]
mov rax, rdx # looks like a missed optimization vs. mov rax, [rax], because the asm clobbers rdx
XXX asm operand in rax
mov rbx, QWORD PTR [rsp-16] # reload the pointer
mov QWORD PTR [rbx], rax
mov rax, rbx # another weird missed-optimization (lea rax, [rbx+8])
add rax, 8
mov QWORD PTR [rsp-16], rax
cmp QWORD PTR [rsp-8], rax
jne .L6
# cleanup omitted.
clang counts a separate counter down towards zero. But it uses load / add -1 / store instead of a memory-destination add [mem], -1 / jnz.
You can probably do better than this if you write the whole loop yourself in asm instead of leaving that part of your hot loop to the compiler.
Consider using some XMM registers for integer arithmetic to reduce register pressure on the integer registers, if possible. On Intel CPUs, moving between GP and XMM registers only costs 1 ALU uop with 1c latency. (It's still 1 uop on AMD, but higher latency especially on Bulldozer-family). Doing scalar integer stuff in XMM registers is not much worse, and could be worth it if total uop throughput is your bottleneck, or it saves more spill/reloads than it costs.
But of course XMM is not very viable for loop counters (paddd/pcmpeq/pmovmskb/cmp/jcc or psubd/ptest/jcc are not great compared to sub [mem], 1 / jcc), or for pointers, or for extended-precision arithmetic (manually doing carry-out with a compare and carry-in with another paddq sucks even in 32-bit mode where 64-bit integer regs aren't available). It's usually better to spill/reload to memory instead of XMM registers, if you're not bottlenecked on load/store uops.
If you also need calls to the function from outside the loop (cleanup or something), write a wrapper or use add $-128, %rsp ; call ; sub $-128, %rsp to preserve the red-zone in those versions. (Note that -128 is encodeable as an imm8 but +128 isn't.)
Including an actual function call in your C function doesn't necessarily make it safe to assume the red-zone is unused, though. Any spill/reload between (compiler-visible) function calls could use the red-zone, so clobbering all the registers in an asm statement is quite likely to trigger that behaviour.
// a non-leaf function that still uses the red-zone with gcc
void bar(void) {
//cryptofunc(1); // gcc/clang don't use the redzone after this (not future-proof)
volatile int tmp = 1;
(void)tmp;
cryptofunc(1); // but gcc will use the redzone before a tailcall
}
# gcc7.2 -O3 output
mov edi, 1
mov DWORD PTR [rsp-12], 1
mov eax, DWORD PTR [rsp-12]
jmp cryptofunc(long)
If you want to depend on compiler-specific behaviour, you could call (with regular C) a non-inline function before the hot loop. With current gcc / clang, that will make them reserve enough stack space since they have to adjust the stack anyway (to align rsp before a call). This is not future-proof at all, but should happen to work.
GNU C has an __attribute__((target("options"))) x86 function attribute, but it's not usable for arbitrary options, and -mno-red- zone is not one of the ones you can toggle on a per-function basis, or with #pragma GCC target ("options") within a compilation unit.
You can use stuff like
__attribute__(( target("sse4.1,arch=core2") ))
void penryn_version(void) {
...
}
but not __attribute__(( target("mno-red-zone") )).
There's a #pragma GCC optimize and an optimize function-attribute (both of which are not intended for production code), but #pragma GCC optimize ("-mno-red-zone") doesn't work either. I think the idea is to let some important functions be optimized with -O2 even in debug builds. You can set -f options or -O.
You could put the function in a file by itself and compile that compilation unit with -mno-red-zone, though. (And hopefully LTO will not break anything...)
Can't you just modify your assembly function to meet the requirements of a signal in the x86-64 ABI by shifting the stack pointer by 128 bytes on entry to your function?
Or if you are referring to the return pointer itself, put the shift into your call macro (so sub %rsp; call...)
Not sure but looking at GCC documentation for function attributes, I found the stdcall function attribute which might be of interest.
I'm still wondering what you find problematic with your asm call version. If it's just aesthetics, you could transform it into a macro, or a inline function.
What about creating a dummy function that is written in C and does nothing but call the inline assembly?

Alloca implementation

How does one implement alloca() using inline x86 assembler in languages like D, C, and C++? I want to create a slightly modified version of it, but first I need to know how the standard version is implemented. Reading the disassembly from compilers doesn't help because they perform so many optimizations, and I just want the canonical form.
Edit: I guess the hard part is that I want this to have normal function call syntax, i.e. using a naked function or something, make it look like the normal alloca().
Edit # 2: Ah, what the heck, you can assume that we're not omitting the frame pointer.
implementing alloca actually requires compiler assistance. A few people here are saying it's as easy as:
sub esp, <size>
which is unfortunately only half of the picture. Yes that would "allocate space on the stack" but there are a couple of gotchas.
if the compiler had emitted code
which references other variables
relative to esp instead of ebp
(typical if you compile with no
frame pointer). Then those
references need to be adjusted. Even with frame pointers, compilers do this sometimes.
more importantly, by definition, space allocated with alloca must be
"freed" when the function exits.
The big one is point #2. Because you need the compiler to emit code to symmetrically add <size> to esp at every exit point of the function.
The most likely case is the compiler offers some intrinsics which allow library writers to ask the compiler for the help needed.
EDIT:
In fact, in glibc (GNU's implementation of libc). The implementation of alloca is simply this:
#ifdef __GNUC__
# define __alloca(size) __builtin_alloca (size)
#endif /* GCC. */
EDIT:
after thinking about it, the minimum I believe would be required would be for the compiler to always use a frame pointer in any functions which uses alloca, regardless of optimization settings. This would allow all locals to be referenced through ebp safely and the frame cleanup would be handled by restoring the frame pointer to esp.
EDIT:
So i did some experimenting with things like this:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#define __alloca(p, N) \
do { \
__asm__ __volatile__( \
"sub %1, %%esp \n" \
"mov %%esp, %0 \n" \
: "=m"(p) \
: "i"(N) \
: "esp"); \
} while(0)
int func() {
char *p;
__alloca(p, 100);
memset(p, 0, 100);
strcpy(p, "hello world\n");
printf("%s\n", p);
}
int main() {
func();
}
which unfortunately does not work correctly. After analyzing the assembly output by gcc. It appears that optimizations get in the way. The problem seems to be that since the compiler's optimizer is entirely unaware of my inline assembly, it has a habit of doing the things in an unexpected order and still referencing things via esp.
Here's the resultant ASM:
8048454: push ebp
8048455: mov ebp,esp
8048457: sub esp,0x28
804845a: sub esp,0x64 ; <- this and the line below are our "alloc"
804845d: mov DWORD PTR [ebp-0x4],esp
8048460: mov eax,DWORD PTR [ebp-0x4]
8048463: mov DWORD PTR [esp+0x8],0x64 ; <- whoops! compiler still referencing via esp
804846b: mov DWORD PTR [esp+0x4],0x0 ; <- whoops! compiler still referencing via esp
8048473: mov DWORD PTR [esp],eax ; <- whoops! compiler still referencing via esp
8048476: call 8048338 <memset#plt>
804847b: mov eax,DWORD PTR [ebp-0x4]
804847e: mov DWORD PTR [esp+0x8],0xd ; <- whoops! compiler still referencing via esp
8048486: mov DWORD PTR [esp+0x4],0x80485a8 ; <- whoops! compiler still referencing via esp
804848e: mov DWORD PTR [esp],eax ; <- whoops! compiler still referencing via esp
8048491: call 8048358 <memcpy#plt>
8048496: mov eax,DWORD PTR [ebp-0x4]
8048499: mov DWORD PTR [esp],eax ; <- whoops! compiler still referencing via esp
804849c: call 8048368 <puts#plt>
80484a1: leave
80484a2: ret
As you can see, it isn't so simple. Unfortunately, I stand by my original assertion that you need compiler assistance.
It would be tricky to do this - in fact, unless you have enough control over the compiler's code generation it cannot be done entirely safely. Your routine would have to manipulate the stack, such that when it returned everything was cleaned, but the stack pointer remained in such a position that the block of memory remained in that place.
The problem is that unless you can inform the compiler that the stack pointer is has been modified across your function call, it may well decide that it can continue to refer to other locals (or whatever) through the stack pointer - but the offsets will be incorrect.
For the D programming language, the source code for alloca() comes with the download. How it works is fairly well commented. For dmd1, it's in /dmd/src/phobos/internal/alloca.d. For dmd2, it's in /dmd/src/druntime/src/compiler/dmd/alloca.d.
The C and C++ standards don't specify that alloca() has to the use the stack, because alloca() isn't in the C or C++ standards (or POSIX for that matter)¹.
A compiler may also implement alloca() using the heap. For example, the ARM RealView (RVCT) compiler's alloca() uses malloc() to allocate the buffer (referenced on their website here), and also causes the compiler to emit code that frees the buffer when the function returns. This doesn't require playing with the stack pointer, but still requires compiler support.
Microsoft Visual C++ has a _malloca() function that uses the heap if there isn't enough room on the stack, but it requires the caller to use _freea(), unlike _alloca(), which does not need/want explicit freeing.
(With C++ destructors at your disposal, you can obviously do the cleanup without compiler support, but you can't declare local variables inside an arbitrary expression so I don't think you could write an alloca() macro that uses RAII. Then again, apparently you can't use alloca() in some expressions (like function parameters) anyway.)
¹ Yes, it's legal to write an alloca() that simply calls system("/usr/games/nethack").
Continuation Passing Style Alloca
Variable-Length Array in pure ISO C++. Proof-of-Concept implementation.
Usage
void foo(unsigned n)
{
cps_alloca<Payload>(n,[](Payload *first,Payload *last)
{
fill(first,last,something);
});
}
Core Idea
template<typename T,unsigned N,typename F>
auto cps_alloca_static(F &&f) -> decltype(f(nullptr,nullptr))
{
T data[N];
return f(&data[0],&data[0]+N);
}
template<typename T,typename F>
auto cps_alloca_dynamic(unsigned n,F &&f) -> decltype(f(nullptr,nullptr))
{
vector<T> data(n);
return f(&data[0],&data[0]+n);
}
template<typename T,typename F>
auto cps_alloca(unsigned n,F &&f) -> decltype(f(nullptr,nullptr))
{
switch(n)
{
case 1: return cps_alloca_static<T,1>(f);
case 2: return cps_alloca_static<T,2>(f);
case 3: return cps_alloca_static<T,3>(f);
case 4: return cps_alloca_static<T,4>(f);
case 0: return f(nullptr,nullptr);
default: return cps_alloca_dynamic<T>(n,f);
}; // mpl::for_each / array / index pack / recursive bsearch / etc variacion
}
LIVE DEMO
cps_alloca on github
alloca is directly implemented in assembly code.
That's because you cannot control stack layout directly from high level languages.
Also note that most implementation will perform some additional optimization like aligning the stack for performance reasons.
The standard way of allocating stack space on X86 looks like this:
sub esp, XXX
Whereas XXX is the number of bytes to allcoate
Edit:
If you want to look at the implementation (and you're using MSVC) see alloca16.asm and chkstk.asm.
The code in the first file basically aligns the desired allocation size to a 16 byte boundary. Code in the 2nd file actually walks all pages which would belong to the new stack area and touches them. This will possibly trigger PAGE_GAURD exceptions which are used by the OS to grow the stack.
You can examine sources of an open-source C compiler, like Open Watcom, and find it yourself
If you can't use c99's Variable Length Arrays, you can use a compound literal cast to a void pointer.
#define ALLOCA(sz) ((void*)((char[sz]){0}))
This also works for -ansi (as a gcc extension) and even when it is a function argument;
some_func(&useful_return, ALLOCA(sizeof(struct useless_return)));
The downside is that when compiled as c++, g++>4.6 will give you an error: taking address of temporary array ... clang and icc don't complain though
Alloca is easy, you just move the stack pointer up; then generate all the read/writes to point to this new block
sub esp, 4
What we want to do is something like that:
void* alloca(size_t size) {
<sp> -= size;
return <sp>;
}
In Assembly (Visual Studio 2017, 64bit) it looks like:
;alloca.asm
_TEXT SEGMENT
PUBLIC alloca
alloca PROC
sub rsp, rcx ;<sp> -= size
mov rax, rsp ;return <sp>;
ret
alloca ENDP
_TEXT ENDS
END
Unfortunately our return pointer is the last item on the stack, and we do not want to overwrite it. Additionally we need to take care for the alignment, ie. round size up to multiple of 8. So we have to do this:
;alloca.asm
_TEXT SEGMENT
PUBLIC alloca
alloca PROC
;round up to multiple of 8
mov rax, rcx
mov rbx, 8
xor rdx, rdx
div rbx
sub rbx, rdx
mov rax, rbx
mov rbx, 8
xor rdx, rdx
div rbx
add rcx, rdx
;increase stack pointer
pop rbx
sub rsp, rcx
mov rax, rsp
push rbx
ret
alloca ENDP
_TEXT ENDS
END
my_alloca: ; void *my_alloca(int size);
MOV EAX, [ESP+4] ; get size
ADD EAX,-4 ; include return address as stack space(4bytes)
SUB ESP,EAX
JMP DWORD [ESP+EAX] ; replace RET(do not pop return address)
I recommend the "enter" instruction. Available on 286 and newer processors (may have been available on the 186 as well, I can't remember offhand, but those weren't widely available anyways).

Resources