I'm writing a cryptography program, and the core (a wide multiply routine) is written in x86-64 assembly, both for speed and because it extensively uses instructions like adc that are not easily accessible from C. I don't want to inline this function, because it's big and it's called several times in the inner loop.
Ideally I would also like to define a custom calling convention for this function, because internally it uses all the registers (except rsp), doesn't clobber its arguments, and returns in registers. Right now, it's adapted to the C calling convention, but of course this makes it slower (by about 10%).
To avoid this, I can call it with asm("call %Pn" : ... : my_function... : "cc", all the registers); but is there a way to tell GCC that the call instruction messes with the stack? Otherwise GCC will just put all those registers in the red zone, and the top one will get clobbered. I can compile the whole module with -mno-red-zone, but I'd prefer a way to tell GCC that, say, the top 8 bytes of the red zone will be clobbered so that it won't put anything there.
From your original question I did not realize gcc limited red-zone use to leaf functions. I don't think that's required by the x86_64 ABI, but it is a reasonable simplifying assumption for a compiler. In that case you only need to make the function calling your assembly routine a non-leaf for purposes of compilation:
int global;
was_leaf()
{
if (global) other();
}
GCC can't tell if global will be true, so it can't optimize away the call to other() so was_leaf() is not a leaf function anymore. I compiled this (with more code that triggered stack usage) and observed that as a leaf it did not move %rsp and with the modification shown it did.
I also tried simply allocating more than 128 bytes (just char buf[150]) in a leaf but I was shocked to see it only did a partial subtraction:
pushq %rbp
movq %rsp, %rbp
subq $40, %rsp
movb $7, -155(%rbp)
If I put the leaf-defeating code back in that becomes subq $160, %rsp
The max-performance way might be to write the whole inner loop in asm (including the call instructions, if it's really worth it to unroll but not inline. Certainly plausible if fully inlining is causing too many uop-cache misses elsewhere).
Anyway, have C call an asm function containing your optimized loop.
BTW, clobbering all the registers makes it hard for gcc to make a very good loop, so you might well come out ahead from optimizing the whole loop yourself. (e.g. maybe keep a pointer in a register, and an end-pointer in memory, because cmp mem,reg is still fairly efficient).
Have a look at the code gcc/clang wrap around an asm statement that modifies an array element (on Godbolt):
void testloop(long *p, long count) {
for (long i = 0 ; i < count ; i++) {
asm(" # XXX asm operand in %0"
: "+r" (p[i])
:
: // "rax",
"rbx", "rcx", "rdx", "rdi", "rsi", "rbp",
"r8", "r9", "r10", "r11", "r12","r13","r14","r15"
);
}
}
#gcc7.2 -O3 -march=haswell
push registers and other function-intro stuff
lea rcx, [rdi+rsi*8] ; end-pointer
mov rax, rdi
mov QWORD PTR [rsp-8], rcx ; store the end-pointer
mov QWORD PTR [rsp-16], rdi ; and the start-pointer
.L6:
# rax holds the current-position pointer on loop entry
# also stored in [rsp-16]
mov rdx, QWORD PTR [rax]
mov rax, rdx # looks like a missed optimization vs. mov rax, [rax], because the asm clobbers rdx
XXX asm operand in rax
mov rbx, QWORD PTR [rsp-16] # reload the pointer
mov QWORD PTR [rbx], rax
mov rax, rbx # another weird missed-optimization (lea rax, [rbx+8])
add rax, 8
mov QWORD PTR [rsp-16], rax
cmp QWORD PTR [rsp-8], rax
jne .L6
# cleanup omitted.
clang counts a separate counter down towards zero. But it uses load / add -1 / store instead of a memory-destination add [mem], -1 / jnz.
You can probably do better than this if you write the whole loop yourself in asm instead of leaving that part of your hot loop to the compiler.
Consider using some XMM registers for integer arithmetic to reduce register pressure on the integer registers, if possible. On Intel CPUs, moving between GP and XMM registers only costs 1 ALU uop with 1c latency. (It's still 1 uop on AMD, but higher latency especially on Bulldozer-family). Doing scalar integer stuff in XMM registers is not much worse, and could be worth it if total uop throughput is your bottleneck, or it saves more spill/reloads than it costs.
But of course XMM is not very viable for loop counters (paddd/pcmpeq/pmovmskb/cmp/jcc or psubd/ptest/jcc are not great compared to sub [mem], 1 / jcc), or for pointers, or for extended-precision arithmetic (manually doing carry-out with a compare and carry-in with another paddq sucks even in 32-bit mode where 64-bit integer regs aren't available). It's usually better to spill/reload to memory instead of XMM registers, if you're not bottlenecked on load/store uops.
If you also need calls to the function from outside the loop (cleanup or something), write a wrapper or use add $-128, %rsp ; call ; sub $-128, %rsp to preserve the red-zone in those versions. (Note that -128 is encodeable as an imm8 but +128 isn't.)
Including an actual function call in your C function doesn't necessarily make it safe to assume the red-zone is unused, though. Any spill/reload between (compiler-visible) function calls could use the red-zone, so clobbering all the registers in an asm statement is quite likely to trigger that behaviour.
// a non-leaf function that still uses the red-zone with gcc
void bar(void) {
//cryptofunc(1); // gcc/clang don't use the redzone after this (not future-proof)
volatile int tmp = 1;
(void)tmp;
cryptofunc(1); // but gcc will use the redzone before a tailcall
}
# gcc7.2 -O3 output
mov edi, 1
mov DWORD PTR [rsp-12], 1
mov eax, DWORD PTR [rsp-12]
jmp cryptofunc(long)
If you want to depend on compiler-specific behaviour, you could call (with regular C) a non-inline function before the hot loop. With current gcc / clang, that will make them reserve enough stack space since they have to adjust the stack anyway (to align rsp before a call). This is not future-proof at all, but should happen to work.
GNU C has an __attribute__((target("options"))) x86 function attribute, but it's not usable for arbitrary options, and -mno-red- zone is not one of the ones you can toggle on a per-function basis, or with #pragma GCC target ("options") within a compilation unit.
You can use stuff like
__attribute__(( target("sse4.1,arch=core2") ))
void penryn_version(void) {
...
}
but not __attribute__(( target("mno-red-zone") )).
There's a #pragma GCC optimize and an optimize function-attribute (both of which are not intended for production code), but #pragma GCC optimize ("-mno-red-zone") doesn't work either. I think the idea is to let some important functions be optimized with -O2 even in debug builds. You can set -f options or -O.
You could put the function in a file by itself and compile that compilation unit with -mno-red-zone, though. (And hopefully LTO will not break anything...)
Can't you just modify your assembly function to meet the requirements of a signal in the x86-64 ABI by shifting the stack pointer by 128 bytes on entry to your function?
Or if you are referring to the return pointer itself, put the shift into your call macro (so sub %rsp; call...)
Not sure but looking at GCC documentation for function attributes, I found the stdcall function attribute which might be of interest.
I'm still wondering what you find problematic with your asm call version. If it's just aesthetics, you could transform it into a macro, or a inline function.
What about creating a dummy function that is written in C and does nothing but call the inline assembly?
Related
I wrote a simple multiplication function in C, and another in assembly code, using GCC's "asm" keyword.
I took the execution time for each of them, and although their times are pretty close, the C function is a little faster than the one in assembly code.
I would like to know why, since I expected for the asm one to be faster. Is it because of the extra "call" (i don't know what word to use) to the GCC's "asm" keyword?
Here is the C function:
int multiply (int a, int b){return a*b;}
And here is the asm one in the C file:
int asmMultiply(int a, int b){
asm ("imull %1,%0;"
: "+r" (a)
: "r" (b)
);
return a;
}
my main where I take the times:
int main(){
int n = 50000;
clock_t asmClock = clock();
while(n>0){
asmMultiply(4,5);
n--;
}
asmClock = clock() - asmClock;
double asmTime = ((double)asmClock)/CLOCKS_PER_SEC;
clock_t cClock = clock();
n = 50000;
while(n>0){
multiply(4,5);
n--;
}
cClock = clock() - cClock;
double cTime = ((double)cClock)/CLOCKS_PER_SEC;
printf("Asm time: %f\n",asmTime);
printf("C code time: %f\n",cTime);
Thanks!
The assembly function is doing more work than the C function — it's initializing mult, then doing the multiplication and assigning the result to mult, and then pushing the value from mult into the return location.
Compilers are good at optimizing; you won't easily beat them on basic arithmetic.
If you really want improvement, use static inline int multiply(int a, int b) { return a * b; }. Or just write a * b (or the equivalent) in the calling code instead of int x = multiply(a, b);.
This attempt to microbenchmark is too naive in almost every way possible for you to get any meaningful results.
Even if you fixed the surface problems (so the code didn't optimize away), there are major deep problems before you can conclude anything about when your asm would be better than *.
(Hint: probably never. Compilers already know how to optimally multiply integers, and understand the semantics of that operation. Forcing it to use imul instead of auto-vectorizing or doing other optimizations is going to be a loss.)
Both timed regions are empty because both multiplies can optimize away. (The asm is not asm volatile, and you don't use the result.) You're only measuring noise and/or CPU frequency ramp-up to max turbo before the clock() overhead.
And even if they weren't, a single imul instruction is basically unmeasurable with a function with as much overhead as clock(). Maybe if you serialized with lfence to force the CPU to wait for imul to retire, before rdtsc... See RDTSCP in NASM always returns the same value
Or you compiled with optimization disabled, which is pointless.
You basically can't measure a C * operator vs. inline asm without some kind of context involving a loop. And then it will be for that context, dependent on what optimizations you defeated by using inline asm. (And what if anything you did to stop the compiler from optimizing away work for the pure C version.)
Measuring only one number for a single x86 instruction doesn't tell you much about it. You need to measure latency, throughput, and front-end uop cost to properly characterize its cost. Modern x86 CPUs are superscalar out-of-order pipelined, so the sum of costs for 2 instructions depends on whether they're dependent on each other, and other surrounding context. How many CPU cycles are needed for each assembly instruction?
The stand-alone definitions of the functions are identical, after your change to let the compiler pick registers, and your asm could inline somewhat efficiently, but it's still optimization-defeating. gcc knows that 5*4 = 20 at compile time, so if you did use the result multiply(4,5) could optimize to an immediate 20. But gcc doesn't know what the asm does, so it just has to feed it the inputs at least once. (non-volatile means it can CSE the result if you used asmMultiply(4,5) in a loop, though.)
So among other things, inline asm defeats constant propagation. This matters even if only one of the inputs is a constant, and the other is a runtime variable. Many small integer multipliers can be implemented with one or 2 LEA instructions or a shift (with lower latency than the 3c for imul on modern x86).
https://gcc.gnu.org/wiki/DontUseInlineAsm
The only use-case I could imagine asm helping is if a compiler used 2x LEA instructions in a situation that's actually front-end bound, where imul $constant, %[src], %[dst] would let it copy-and-multiply with 1 uop instead of 2. But your asm removes the possibility of using immediates (you only allowed register constraints), and GNU C inline can't let you use a different template for immediate vs. register arg. Maybe if you used multi-alternative constraints and a matching register constraint for the register-only part? But no, you'd still have to have something like asm("%2, %1, %0" :...) and that can't work for reg,reg.
You could use if(__builtin_constant_p(a)) { asm using imul-immediate } else { return a*b; }, which would work with GCC to let you defeat LEA. Or just require a constant multiplier anyway, since you'd only ever want to use this for a specific gcc version to work around a specific missed-optimization. (i.e. it's so niche that in practice you wouldn't ever do this.)
Your code on the Godbolt compiler explorer, with clang7.0 -O3 for the x86-64 System V calling convention:
# clang7.0 -O3 (The functions both inline and optimize away)
main: # #main
push rbx
sub rsp, 16
call clock
mov rbx, rax # save the return value
call clock
sub rax, rbx # end - start time
cvtsi2sd xmm0, rax
divsd xmm0, qword ptr [rip + .LCPI2_0]
movsd qword ptr [rsp + 8], xmm0 # 8-byte Spill
call clock
mov rbx, rax
call clock
sub rax, rbx # same block again for the 2nd group.
xorps xmm0, xmm0
cvtsi2sd xmm0, rax
divsd xmm0, qword ptr [rip + .LCPI2_0]
movsd qword ptr [rsp], xmm0 # 8-byte Spill
mov edi, offset .L.str
mov al, 1
movsd xmm0, qword ptr [rsp + 8] # 8-byte Reload
call printf
mov edi, offset .L.str.1
mov al, 1
movsd xmm0, qword ptr [rsp] # 8-byte Reload
call printf
xor eax, eax
add rsp, 16
pop rbx
ret
TL:DR: if you want to understand inline asm performance on this fine-grained level of detail, you need to understand how compilers optimize in the first place.
How to remove "noise" from GCC/clang assembly output?
C++ code for testing the Collatz conjecture faster than hand-written assembly - why?
Modern x86 cost model
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
I'm trying to call a golang function from my C code. Golang does not use the standard x86_64 calling convention, so I have to resort to implementing the transition myself. As gcc does not want to mix cdecl with the x86_64 convention,
I'm trying to call the function using inline assembly:
void go_func(struct go_String filename, void* key, int error){
void* f_address = (void*)SAVEECDSA;
asm volatile(" sub rsp, 0xe0; \t\n\
mov [rsp+0xe0], rbp; \t\n\
mov [rsp], %0; \t\n\
mov [rsp+0x8], %1; \t\n\
mov [rsp+0x18], %2; \t\n\
call %3; \t\n\
mov rbp, [rsp+0xe0]; \t\n\
add rsp, 0xe0;"
:
: "g"(filename.str), "g"(filename.len), "g"(key), "g"(f_address)
: );
return;
}
Sadly the compiler always throws an error at me that I dont understand:
./code.c:241: Error: too many memory references for `mov'
This corresponds to this line: mov [rsp+0x18], %2; \t\n\ If I delete it, the compilation works. I don't understand what my mistake is...
I'm compiling with the -masm=intel flag so I use Intel syntax. Can someone please help me?
A "g" constraint allows the compiler to pick memory or register, so obviously you'll end up with mov mem,mem if that happens. mov can have at most 1 memory operand. (Like all x86 instructions, at most one explicit memory operand is possible.)
Use "ri" constraints for the inputs that will be moved to a memory destination, to allow register or immediate but not memory.
Also, you're modifying RSP so you can't safely use memory source operands. The compiler is going to assume it can use addressing modes like [rsp+16] or [rsp-4]. So you can't use push instead of mov.
You also need to declare clobbers on all the call-clobbered registers, because the function call will do that. (Or better, maybe ask for the inputs in those call-clobbered registers so the compiler doesn't have to bounce them through call-preserved regs like RBX. But you need to make those operands read/write or declare separate output operands for the same registers to let the compiler know they'll be modified.)
So probably your best bet for efficiency is something like
int ecx, edx, edi, esi; // dummy outputs as clobbers
register int r8 asm("r8d"); // for all the call-clobbered regs in the calling convention
register int r9 asm("r9d");
register int r10 asm("r10d");
register int r11 asm("r11d");
// These are the regs for x86-64 System V.
// **I don't know what Go actually clobbers.**
asm("sub rsp, 0xe0\n\t" // adjust as necessary to align the stack before a call
// "push args in reverse order"
"push %[fn_len] \n\t"
"push %[fn_str] \n\t"
"call \n\t"
"add rsp, 0xe0 + 3*8 \n\t" // pop red-zone skip space + pushed args
// real output in RAX, and dummy outputs in call-clobbered regs
: "=a"(retval), "=c"(ecx), "=d"(edx), "=D"(edi), "=S"(esi), "=r"(r8), "=r"(r9), "=r"(r10), "=r"(r11)
: [fn_str] "ri" (filename.str), [fn_len] "ri" (filename.len), etc. // inputs can use the same regs as dummy outputs
: "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7", // All vector regs are call-clobbered
"xmm8", "xmm9", "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15",
"memory" // if you're passing any pointers (even read-only), or the function accesses any globals,
// best to make this a compiler memory barrier
);
Notice that the output are not early-clobber, so the compiler can (at its option) use those registers for inputs, but we're not forcing it so the compiler is still free to use some other register or an immediate.
Upon further discussion, Go functions don't clobber RBP, so there's no reason to save/restore it manually. The only reason you might have wanted to is that locals might use RBP-relative addressing modes, and older GCC made it an error to declare a clobber on RBP when compiling without -fomit-frame-pointer. (I think. Or maybe I'm thinking of EBX in 32-bit PIC code.)
Also, if you're using the x86-64 System V ABI, beware that inline asm must not clobber the red-zone. The compiler assumes that doesn't happen and there's no way to declare a clobber on the red zone or even set -mno-redzone on a per-function basis. So you probably need to sub rsp, 128 + 0xe0. Or 0xe0 already includes enough space to skip the red-zone if that's not part of the callee's args.
The original poster added this solution as an edit to their question:
If someone ever finds this, the accepted answer does not help you when you try to call golang code with inline asm! The accepted answer only helps with my initial problem, which helped me to fix the golangcall. Use something like this:**
void* __cdecl go_call(void* func, __int64 p1, __int64 p2, __int64 p3, __int64 p4){
void* ret;
asm volatile(" sub rsp, 0x28; \t\n\
mov [rsp], %[p1]; \t\n\
mov [rsp+0x8], %[p2]; \t\n\
mov [rsp+0x10], %[p3]; \t\n\
mov [rsp+0x18], %[p4]; \t\n\
call %[func_addr]; \t\n\
add rsp, 0x28; "
:
: [p1] "ri"(p1), [p2] "ri"(p2),
[p3] "ri"(p3), [p4] "ri"(p4), [func_addr] "ri"(func)
: );
return ret;
}
I am trying to set arguments using assembly code that are used in a generic function. The arguments of this generic function - that is resident in a dll - are not known during compile time. During runtime the pointer to this function is determined using the GetProcAddress function. However its arguments are not known. During runtime I can determine the arguments - both value and type - using a datafile (not a header file or anything that can be included or compiled). I have found a good example of how to solve this problem for 32 bit (C Pass arguments as void-pointer-list to imported function from LoadLibrary()), but for 64 bit this example does not work, because you cannot fill the stack but you have to fill the registers. So I tried to use assembly code to fill the registers but until now no success. I use C-code to call the assembly code. I use VS2015 and MASM (64 bit). The C-code below works fine, but the assembly code does not. So what is wrong with the assembly code? Thanks in advance.
C code:
...
void fill_register_xmm0(double); // proto of assembly function
...
// code determining the pointer to a func returned by the GetProcAddress()
...
double dVal = 12.0;
int v;
fill_register_xmm0(dVal);
v = func->func_i(); // integer function that will use the dVal
...
assembly code in different .asm file (MASM syntax):
TITLE fill_register_xmm0
.code
option prologue:none ; turn off default prologue creation
option epilogue:none ; turn off default epilogue creation
fill_register_xmm0 PROC variable: REAL8 ; REAL8=equivalent to double or float64
movsd xmm0, variable ; fill value of variable into xmm0
ret
fill_register_xmm0 ENDP
option prologue:PrologueDef ; turn on default prologue creation
option epilogue:EpilogueDef ; turn on default epilogue creation
END
The x86-64 Windows calling convention is fairly simple, and makes it possible to write a wrapper function that doesn't know the types of anything. Just load the first 32 bytes of args into registers, and copy the rest to the stack.
You definitely need to make the function call from asm; It can't possibly work reliably to make a bunch of function calls like fill_register_xmm0 and hope that the compiler doesn't clobber any of those registers. The C compiler emits instructions that use the registers, as part of its normal job, including passing args to functions like fill_register_xmm0.
The only alternative would be to write a C statement with a function call with all the args having the correct type, to get the compiler to emit code to make a function call normally. If there are only a few possible different combinations of args, putting those in if() blocks might be good.
And BTW, movsd xmm0, variable probably assembles to movsd xmm0, xmm0, because the first function arg is passed in XMM0 if it's FP.
In C, prepare a buffer with the args (like in the 32-bit case).
Each one needs to be padded to 8 bytes if it's narrower. See MS's docs for x86-64 __fastcall. (Note that x86-64 __vectorcall passes __m128 args by value in registers, but for __fastcall it's strictly true that the args form an array of 8-byte values, after the register args. And storing those into the shadow space creates a full array of all the args.)
Any argument that doesn’t fit in 8 bytes, or is not 1, 2, 4, or 8 bytes, must be passed by reference. There is no attempt to spread a single argument across multiple registers.
But the key thing that makes variadic functions easy in the Windows calling convention also works here: The register used for the 2nd arg doesn't depend on the type of the first. i.e. if an FP arg is the first arg, then that uses up an integer register arg-passing slot. So you can only have up to 4 register args, not 4 integer and 4 FP.
If the 4th arg is integer, it goes in R9, even if it's the first integer arg. Unlike in the x86-64 System V calling convention, where the first integer arg goes in rdi, regardless of how many earlier FP args are in registers and/or on the stack.
So the asm wrapper that calls the function can load the first 8 bytes into both integer and FP registers! (Variadic functions already require this, so a callee doesn't have to know whether to store the integer or FP register to form that arg array. MS optimized the calling convention for simplicity of variadic callee functions at the expense of efficiency for functions with a mix of integer and FP args.)
The C side that puts all the args into a buffer can look like this:
#include <stdalign.h>
int asmwrapper(const char *argbuf, size_t argp-argbuf, void (*funcpointer)(...));
void somefunc() {
alignas(16) uint64_t argbuf[256/8]; // or char argbuf[256]. But if you choose not to use alignas, then uint64_t will still give 8-byte alignment
char *argp = (char*)argbuf;
for( ; argp < &argbuf[256] ; argp += 8) {
if (figure_out_an_arg()) {
int foo = get_int_arg();
memcpy(argp, &foo, sizeof(foo));
} else if(bar) {
double foo = get_double_arg();
memcpy(argp, &foo, sizeof(foo));
} else
... memcpy whatever size
// or allocate space to pass by ref and memcpy a pointer
}
if (argp == &argbuf[256]) {
// error, ran out of space for args
}
asmwrapper(argbuf, argp-argbuf, funcpointer);
}
Unfortunately I don't think we can directly use argbuf on the stack as the args + shadow space for a function call. We have no way of stopping the compiler from putting something valuable below argbuf which would let us just set rsp to the bottom of it (and save the return address somewhere, maybe at the top of argbuf by reserving some space for use by the asm).
Anyway, just copying the whole buffer will work. Or actually, load the first 32 bytes into registers (both integer and FP), and only copy the rest. The shadow space doesn't need to be initialized.
argbuf could be a VLA if you knew ahead of time how big it needed to be, but 256 bytes is pretty small. It's not like reading past the end of it can be a problem, it can't be at the end of a page with unmapped memory later, because our parent function's stack frame definitely takes some space.
;; NASM syntax. For MASM just rename the local labels and add whatever PROC / ENDPROC is needed.
;; UNTESTED
;; rcx: argbuf
;; rdx: length in bytes of the args. 0..256, zero-extended to 64 bits
;; r8 : function pointer
;; reserve rdx bytes of space for arg passing
;; load first 32 bytes of argbuf into integer and FP arg-passing registers
;; copy the rest as stack-args above the shadow space
global asmwrapper
asmwrapper:
push rbp
mov rbp, rsp ; so we can efficiently restore the stack later
mov r10, r8 ; move function pointer to a volatile but non-arg-passing register
; load *both* xmm0-3 and rcx,rdx,r8,r9 from the first 32 bytes of argbuf
; regardless of types or whether there were that many arg bytes
; All bytes are loaded into registers early, some reg->reg transfers are done later
; when we're done with more registers.
; movsd xmm0, [rcx]
; movsd xmm1, [rcx+8]
movaps xmm0, [rcx] ; 16-byte alignment required for argbuf. Use movups to allow misalignment if you want
movhlps xmm1, xmm0 ; use some ALU instructions instead of just loads
; rcx,rdx can't be set yet, still in use for wrapper args
movaps xmm2, [rcx+16] ; it's ok to leave garbage in the high 64-bits of an XMM passing a float or double.
;movhlps xmm3, xmm2 ; the copyloop uses xmm3: do this later
movq r8, xmm2
mov r9, [rcx+24]
mov eax, 32
cmp edx, eax
jbe .small_args ; no copying needed, just shadow space
sub rsp, rdx
and rsp, -16 ; reserve extra space, realigning the stack by 16
; rax=32 on entry, start copying just above shadow space (which doesn't need to be copied)
.copyloop: ; do {
movaps xmm3, [rcx+rax]
movaps [rsp+rax], xmm3 ; indexed addressing modes aren't always optimal, but this loop only runs a couple times.
add eax, 16
cmp eax, edx
jb .copyloop ; } while(bytes_copied < arg_bytes);
.done_arg_copying:
; xmm0,xmm1 have the first 2 qwords of args
movq rcx, xmm0 ; RCX NO LONGER POINTS AT argbuf
movq rdx, xmm1
; xmm2 still has the 2nd 16 bytes of args
;movhlps xmm3, xmm2 ; don't use: false dependency on old value and we just used it.
pshufd xmm3, xmm2, 0xee ; xmm3 = high 64 bits of xmm2. (0xee = _MM_SHUFFLE(3,2,3,2))
; movq xmm3, r9 ; nah, can be multiple uops on AMD
; r8,r9 set earlier
call r10
leave ; restore RSP to its value on entry
ret
; could handle this branchlessly, but copy loop still needs to run zero times
; unless we bump up the min arg_bytes to 48 and sometimes copy an unnecessary 16 bytes
; As much work as possible is before the first branch, so it can happen while a mispredict recovers
.small_args:
sub rsp, rax ; reserve shadow space
;rsp still aligned by 16 after push rbp
jmp .done_arg_copying
;byte count. This wrapper is 82 bytes; would be nice to fit it in 80 so we don't waste 14 bytes before the next function.
;e.g. maybe mov rcx, [rcx] instead of movq rcx, xmm0
;mov eax, $-asmwrapper
align 16
This does assemble (on Godbolt with NASM), but I haven't tested it.
It should perform pretty well, but if you get mispredicts around the cutoff from <= 32 bytes to > 32 bytes, change the branching so it always copies an extra 16 bytes. (Uncomment the cmp/cmovb in the version on Godbolt, but the copy loop still needs to start at 32 bytes into each buffer.)
If you often pass very few args, the 16-byte loads might hit a store-forwarding stall from two narrow stores to one wide reload, causing about an extra 8 cycles of latency. This isn't normally a throughput problem, but it can increase the latency before the called function can access its args. If out-of-order execution can't hide that, then it's worth using more load uops to load each 8-byte arg separately. (Especially into integer registers, and then from there to XMM, if the args are mostly integer. That will have lower latency than mem -> xmm -> integer.)
If you have more than a couple args, though, hopefully the first few have committed to L1d and no longer need store forwarding by the time the asm wrapper runs. Or there's enough copying of later args that the first 2 args finish their load + ALU chain early enough not to delay the critical path inside the called function.
Of course, if performance was a huge issue, you'd write the code that figures out the args in asm so you didn't need this copy stuff, or use a library interface with a fixed function signature that a C compiler can call directly. I did try to make this suck as little as possible on modern Intel / AMD mainstream CPUs (http://agner.org/optimize/), but I didn't benchmark it or tune it, so probably it could be improved with some time spent profiling it, especially for some real use-case.
If you know that FP args aren't a possibility for the first 4, you can simplify by just loading integer regs.
So you need to call a function (in a DLL) but only at run-time can you figure out the number and type of parameters. Then you need to perpare the parameters, either on the stack or in registers, depending on the Application Binary Interface/calling convention.
I would use the following approach: some component of your program figures out the number and type of parameters. Let's assume it creates a list of {type, value}, {type, value}, ...
You then pass this list to a function to prepare the ABI call. This will be an assembler function. For a stack-based ABI (32 bit), it just pushes the parameters on to the stack. For a register based ABI, it can prepare the register values and save them as local variables (add sp,nnn) and once all parameters have been prepared (possibly using registers needed for the call, hence first saving them), loads the registers (a series of mov instructions) and performs the call instruction.
I've made a function to calculate the length of a C string (I'm trying to beat clang's optimizer using -O3). I'm running macOS.
_string_length1:
push rbp
mov rbp, rsp
xor rax, rax
.body:
cmp byte [rdi], 0
je .exit
inc rdi
inc rax
jmp .body
.exit:
pop rbp
ret
This is the C function I'm trying to beat:
size_t string_length2(const char *str) {
size_t ret = 0;
while (str[ret]) {
ret++;
}
return ret;
}
And it disassembles to this:
string_length2:
push rbp
mov rbp, rsp
mov rax, -1
LBB0_1:
cmp byte ptr [rdi + rax + 1], 0
lea rax, [rax + 1]
jne LBB0_1
pop rbp
ret
Every C function sets up a stack frame using push rbp and mov rbp, rsp, and breaks it using pop rbp. But I'm not using the stack in any way here, I'm only using processor registers. It worked without using a stack frame (when I tested on x86-64), but is it necessary?
No, the stack frame is, at least in theory, not always required. An optimizing compiler might in some cases avoid using the call stack. Notably when it is able to inline a called function (in some specific call site), or when the compiler successfully detects a tail call (which reuses the caller's frame).
Read the ABI of your platform to understand requirements related to the stack.
You might try to compile your program with link time optimization (e.g. compile and link with gcc -flto -O2) to get more optimizations.
In principle, one could imagine a compiler clever enough to (for some programs) avoid using any call stack.
BTW, I just compiled a naive recursive long fact(int n) factorial function with GCC 7.1 (on Debian/Sid/x86-64) at -O3 (i.e. gcc -fverbose-asm -S -O3 fact.c). The resulting assembler code fact.s contains no call machine instruction.
Every C function sets up a stack frame using...
This is true for your compiler, not in general. It is possible to compile a C program without using the stack at all—see, for example, the method CPS, continuation passing style. Probably no C compiler on the market does so, but it is important to know that there are other ways to execute programs, in addition to stack-evaluation.
The ISO 9899 standard says nothing about the stack. It leaves compiler implementations free to choose whichever method of evaluation they consider to be the best.
I've been reading about assembly functions and I'm confused as to whether to use the enter and exit or just the call/return instructions for fast execution. Is one way fast and the other smaller? For example what is the fastest (stdcall) way to do this in assembly without inlining the function:
static Int32 Add(Int32 a, Int32 b) {
return a + b;
}
int main() {
Int32 i = Add(1, 3);
}
Use call / ret, without making a stack frame with either enter / leave or push&pop rbp / mov rbp, rsp. gcc (with the default -fomit-frame-pointer) only makes a stack frame in functions that do variable-size allocation on the stack. This may make debugging slightly more difficult, since gcc normally emits stack unwind info when compiling with -fomit-frame-pointer, but your hand-written asm won't have that. Normally it only makes sense to write leaf functions in asm, or at least ones that don't call many other functions.
Stack frames mean you don't have to keep track of how much the stack pointer has changed since function entry to access stuff on the stack (e.g. function args and spill slots for locals). Both Windows and Linux/Unix 64bit ABIs pass the first few args in registers, and there are often enough regs that you don't have to spill any variables to the stack. Stack frames are a waste of instructions in most cases. In 32bit code, having ebp available (going from 6 to 7 GP regs, not counting the stack pointer) makes a bigger difference than going from 14 to 15. Of course, you still have to push/pop rbp if you do use it, though, because in both ABIs it's a callee-saved register that functions aren't allowed to clobber.
If you're optimizing x86-64 asm, you should read Agner Fog's guides, and check out some of the other links in the x86 tag wiki.
The best implementation of your function is probably:
align 16
global Add
Add:
lea eax, [rdi + rsi]
ret
; the high 32 of either reg doesn't affect the low32 of the result
; so we don't need to zero-extend or use a 32bit address-size prefix
; like lea eax, [edi, esi]
; even if we're called with non-zeroed upper32 in rdi/rsi.
align 16
global main
main:
mov edi, 1 ; 1st arg in SysV ABI
mov esi, 3 ; 2nd arg in SysV ABI
call Add
; return value in eax in all ABIs
ret
align 16
OPmain: ; This is what you get if you don't return anything from main to use the result of Add
xor eax, eax
ret
This is in fact what gcc emits for Add(), but it still turns main into an empty function, or into a return 4 if you return i. clang 3.7 respects -fno-inline-functions even when the result is a compile-time constant. It beats my asm by doing tail-call optimization, and jmping to Add.
Note that the Windows 64bit ABI uses different registers for function args. See the links in the x86 tag wiki, or Agner Fog's ABI guide. Assembler macros may help for writing functions in asm that use the correct registers for their args, depending on the platform you're targeting.