Why is memcmp so much faster than a for loop check? - c

Why is memcmp(a, b, size) so much faster than:
for(i = 0; i < nelements; i++) {
if a[i] != b[i] return 0;
}
return 1;
Is memcmp a CPU instruction or something? It must be pretty deep because I got a massive speedup using memcmp over the loop.

memcmp is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C.
As a "builtin"
GCC supports memcmp (as well as a ton of other functions) as builtins. In some versions / configurations of GCC, a call to memcmp will be recognized as __builtin_memcmp. Instead of emitting a call to the memcmp library function, GCC will emit a handful of instructions to act as an optimized inline version of the function.
On x86, this leverages the use of the cmpsb instruction, which compares a string of bytes at one memory location to another. This is coupled with the repe prefix, so the strings are compared until they are no longer equal, or a count is exhausted. (Exactly what memcmp does).
Given the following code:
int test(const void* s1, const void* s2, int count)
{
return memcmp(s1, s2, count) == 0;
}
gcc version 3.4.4 on Cygwin generates the following assembly:
; (prologue)
mov esi, [ebp+arg_0] ; Move first pointer to esi
mov edi, [ebp+arg_4] ; Move second pointer to edi
mov ecx, [ebp+arg_8] ; Move length to ecx
cld ; Clear DF, the direction flag, so comparisons happen
; at increasing addresses
cmp ecx, ecx ; Special case: If length parameter to memcmp is
; zero, don't compare any bytes.
repe cmpsb ; Compare bytes at DS:ESI and ES:EDI, setting flags
; Repeat this while equal ZF is set
setz al ; Set al (return value) to 1 if ZF is still set
; (all bytes were equal).
; (epilogue)
Reference:
cmpsb instruction
As a library function
Highly-optimized versions of memcmp exist in many C standard libraries. These will usually take advantage of architecture-specific instructions to work with lots of data in parallel.
In Glibc, there are versions of memcmp for x86_64 that can take advantage of the following instruction set extensions:
SSE2 - sysdeps/x86_64/memcmp.S
SSE4 - sysdeps/x86_64/multiarch/memcmp-sse4.S
SSSE3 - sysdeps/x86_64/multiarch/memcmp-ssse3.S
The cool part is that glibc will detect (at run-time) the newest instruction set your CPU has, and execute the version optimized for it. See this snippet from sysdeps/x86_64/multiarch/memcmp.S:
ENTRY(memcmp)
.type memcmp, #gnu_indirect_function
LOAD_RTLD_GLOBAL_RO_RDX
HAS_CPU_FEATURE (SSSE3)
jnz 2f
leaq __memcmp_sse2(%rip), %rax
ret
2: HAS_CPU_FEATURE (SSE4_1)
jz 3f
leaq __memcmp_sse4_1(%rip), %rax
ret
3: leaq __memcmp_ssse3(%rip), %rax
ret
END(memcmp)
In the Linux kernel
Linux does not seem to have an optimized version of memcmp for x86_64, but it does for memcpy, in arch/x86/lib/memcpy_64.S. Note that is uses the alternatives infrastructure (arch/x86/kernel/alternative.c) for not only deciding at runtime which version to use, but actually patching itself to only make this decision once at boot-up.

It's usually a compiler intrinsic that is translated into fast assembly with specialized instructions for comparing blocks of memory.
intrinsic memcmp

Is memcmp a CPU instruction or something?
It is at least a very highly optimized compiler-provided intrinsic function. Possibly a single machine instruction, or two, depending on the platform, which you haven't specified.

Related

Is subtraction less performant than negation?

I wonder if it's faster for the processor to negate a number or to do a subtraction. For example:
Is
int a = -3;
more efficient than
int a = 0 - 3;
In other words, does a negation is equivalent to subtracting from 0? Or is there a special CPU instruction that negate faster that a subtraction?
I suppose that the compiler does not optimize anything.
From the C language point of view, 0 - 3 is an integer constant expression and those are always calculated at compile-time.
Formal definition from C11 6.6/6:
An integer constant expression shall have integer type and shall
only have operands that are integer constants, enumeration constants,
character constants, sizeof expressions whose results are integer
constants, _Alignof expressions, and floating constants that are the
immediate operands of casts.
Knowing that these are calculated at compile time is important when writing readable code. For example if you want to declare a char array to hold 5 characters and the null terminator, you can write char str[5+1]; rather than 6, to get self-documenting code telling the reader that you have considered null termination.
Similarly when writing macros, you can make use of integer constant expressions to perform parts of the calculation at compile time.
(This answer is about negating a runtime variable, like -x or 0-x where constant-propagation doesn't result in a compile-time constant value for x. A constant like 0-3 has no runtime cost.)
I suppose that the compiler does not optimize anything.
That's not a good assumption if you're writing in C. Both are equivalent for any non-terrible compiler because of how integers work, and it would be a missed-optimization bug if one compiled to more efficient code than the other.
If you actually want to ask about asm, then how to negate efficiently depends on the ISA.
But yes, most ISAs can negate with a single instruction, usually by subtracting from an immediate or implicit zero, or from an architectural zero register.
e.g. 32-bit ARM has an rsb (reverse-subtract) instruction that can take an immediate operand. rsb rdst, rsrc, #123 does dst = 123-src. With an immediate of zero, this is just negation.
x86 has a neg instruction: neg eax is exactly equivalent to eax = 0-eax, setting flags the same way.
3-operand architectures with a zero register (hard-wired to zero) can just do something like MIPS subu $t0, $zero, $t0 to do t0 = 0 - t0. It has no need for a special instruction because the $zero register always reads as zero. Similarly AArch64 removed RSB but has a xzr / wzr 64/32-bit zero register. (Although it also has a pseudo-instruction called neg which subtracts from the zero register).
You could see most of this by using a compiler. https://godbolt.org/z/B7N8SK
But you'd have to actually compile to machine code and disassemble because gcc/clang tend to use the neg pseudo-instruction on AArch64 and RISC-V. Still, you can see ARM32's rsb r0,r0,#0 for int negate(int x){return -x;}
Both are compile time constants, and will generate the same constant initialisation in any reasonable compiler regardless of optimisation.
For example at https://godbolt.org/z/JEMWvS the following code:
void test( void )
{
int a = -3;
}
void test2( void )
{
int a = 0-3;
}
Compiled with gcc 9.2 x86-64 -std=c99 -O0 generates:
test:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], -3
nop
pop rbp
ret
test2:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], -3
nop
pop rbp
ret
Using -Os, the code:
void test( void )
{
volatile int a = -3;
}
void test2( void )
{
volatile int a = 0-3;
}
generates:
test:
mov DWORD PTR [rsp-4], -3
ret
test2:
mov DWORD PTR [rsp-4], -3
ret
The volatile being necessary to prevent the compiler removing the unused variables.
As static data it is even simpler:
int a = -3;
int b = 0-3;
outside of a function generates no executable code, just initialised data objects (initialisation is different from assignment):
a:
.long -3
b:
.long -3
Assignment of the above statics:
a = -4 ;
b = 0-4 ;
is still a compiler evaluated constant:
mov DWORD PTR a[rip], -4
mov DWORD PTR b[rip], -4
The take-home here is:
If you are interested, try it and see (with your own compiler or Godbolt set for your compiler and/or architecture),
don't sweat the small stuff, let the compiler do its job,
constant expressions are evaluated at compile time and have no run-time impact,
writing weird code in the belief you can better the compiler is almost always pointless. Compilers work better with idiomatic code the optimiser can recognise.
It's hard to tell if you ask asking if subtraction is fast then negation in a general sense, or in this specific case of implementing negation via subtraction from zero. I'll try to answer both.
General Case
For the general case, on most modern CPUs these operations are both very fast: usually each only taking a single cycle to execute, and often having a throughput of more than one per cycle (because CPUs are superscalar). On all recent AMD and Intel CPUs that I checked, both sub and neg execute at the same speed for register and immediate arguments.
Implementing -x
As regards to your specific question of implementing the -x operation, it would usually be slightly faster to implement this with a dedicated neg operation than with a sub, because with neg you don't have to prepare the zero registers. For example, a negation function int neg(int x) { return -x; }; would look something like this with the neg instruction:
neg:
mov eax, edi
neg eax
... while implementing it terms of subtraction would look something like:
neg:
xor eax, eax
sub eax, edi
Well ... sub didn't come out looking at worse there, but that's mostly a quirk of the calling convention and the fact that x86 uses a 1 argument destructive neg: the result needs to be in eax, so in the neg case 1 instruction is spent just moving the result to the right register, and one doing the negation. The sub version takes two instructions to perform the negation itself: one to zero a register, and one to do the subtraction. It so happens that this lets you avoid the ABI shuffling because you get to choose the zero register as the result register.
Still, this ABI related inefficiency wouldn't persist after inlining, so we can say in some fundamental sense that neg is slightly more efficient.
Now many ISAs may not have a neg instruction at all, so the question is more or less moot. They may have a hardcoded zero register, so you'd implement negation via subtraction from this register and there is no cost to set up the zero.

Why does the asm code function in C, takes more time than the c code function?

I wrote a simple multiplication function in C, and another in assembly code, using GCC's "asm" keyword.
I took the execution time for each of them, and although their times are pretty close, the C function is a little faster than the one in assembly code.
I would like to know why, since I expected for the asm one to be faster. Is it because of the extra "call" (i don't know what word to use) to the GCC's "asm" keyword?
Here is the C function:
int multiply (int a, int b){return a*b;}
And here is the asm one in the C file:
int asmMultiply(int a, int b){
asm ("imull %1,%0;"
: "+r" (a)
: "r" (b)
);
return a;
}
my main where I take the times:
int main(){
int n = 50000;
clock_t asmClock = clock();
while(n>0){
asmMultiply(4,5);
n--;
}
asmClock = clock() - asmClock;
double asmTime = ((double)asmClock)/CLOCKS_PER_SEC;
clock_t cClock = clock();
n = 50000;
while(n>0){
multiply(4,5);
n--;
}
cClock = clock() - cClock;
double cTime = ((double)cClock)/CLOCKS_PER_SEC;
printf("Asm time: %f\n",asmTime);
printf("C code time: %f\n",cTime);
Thanks!
The assembly function is doing more work than the C function — it's initializing mult, then doing the multiplication and assigning the result to mult, and then pushing the value from mult into the return location.
Compilers are good at optimizing; you won't easily beat them on basic arithmetic.
If you really want improvement, use static inline int multiply(int a, int b) { return a * b; }. Or just write a * b (or the equivalent) in the calling code instead of int x = multiply(a, b);.
This attempt to microbenchmark is too naive in almost every way possible for you to get any meaningful results.
Even if you fixed the surface problems (so the code didn't optimize away), there are major deep problems before you can conclude anything about when your asm would be better than *.
(Hint: probably never. Compilers already know how to optimally multiply integers, and understand the semantics of that operation. Forcing it to use imul instead of auto-vectorizing or doing other optimizations is going to be a loss.)
Both timed regions are empty because both multiplies can optimize away. (The asm is not asm volatile, and you don't use the result.) You're only measuring noise and/or CPU frequency ramp-up to max turbo before the clock() overhead.
And even if they weren't, a single imul instruction is basically unmeasurable with a function with as much overhead as clock(). Maybe if you serialized with lfence to force the CPU to wait for imul to retire, before rdtsc... See RDTSCP in NASM always returns the same value
Or you compiled with optimization disabled, which is pointless.
You basically can't measure a C * operator vs. inline asm without some kind of context involving a loop. And then it will be for that context, dependent on what optimizations you defeated by using inline asm. (And what if anything you did to stop the compiler from optimizing away work for the pure C version.)
Measuring only one number for a single x86 instruction doesn't tell you much about it. You need to measure latency, throughput, and front-end uop cost to properly characterize its cost. Modern x86 CPUs are superscalar out-of-order pipelined, so the sum of costs for 2 instructions depends on whether they're dependent on each other, and other surrounding context. How many CPU cycles are needed for each assembly instruction?
The stand-alone definitions of the functions are identical, after your change to let the compiler pick registers, and your asm could inline somewhat efficiently, but it's still optimization-defeating. gcc knows that 5*4 = 20 at compile time, so if you did use the result multiply(4,5) could optimize to an immediate 20. But gcc doesn't know what the asm does, so it just has to feed it the inputs at least once. (non-volatile means it can CSE the result if you used asmMultiply(4,5) in a loop, though.)
So among other things, inline asm defeats constant propagation. This matters even if only one of the inputs is a constant, and the other is a runtime variable. Many small integer multipliers can be implemented with one or 2 LEA instructions or a shift (with lower latency than the 3c for imul on modern x86).
https://gcc.gnu.org/wiki/DontUseInlineAsm
The only use-case I could imagine asm helping is if a compiler used 2x LEA instructions in a situation that's actually front-end bound, where imul $constant, %[src], %[dst] would let it copy-and-multiply with 1 uop instead of 2. But your asm removes the possibility of using immediates (you only allowed register constraints), and GNU C inline can't let you use a different template for immediate vs. register arg. Maybe if you used multi-alternative constraints and a matching register constraint for the register-only part? But no, you'd still have to have something like asm("%2, %1, %0" :...) and that can't work for reg,reg.
You could use if(__builtin_constant_p(a)) { asm using imul-immediate } else { return a*b; }, which would work with GCC to let you defeat LEA. Or just require a constant multiplier anyway, since you'd only ever want to use this for a specific gcc version to work around a specific missed-optimization. (i.e. it's so niche that in practice you wouldn't ever do this.)
Your code on the Godbolt compiler explorer, with clang7.0 -O3 for the x86-64 System V calling convention:
# clang7.0 -O3 (The functions both inline and optimize away)
main: # #main
push rbx
sub rsp, 16
call clock
mov rbx, rax # save the return value
call clock
sub rax, rbx # end - start time
cvtsi2sd xmm0, rax
divsd xmm0, qword ptr [rip + .LCPI2_0]
movsd qword ptr [rsp + 8], xmm0 # 8-byte Spill
call clock
mov rbx, rax
call clock
sub rax, rbx # same block again for the 2nd group.
xorps xmm0, xmm0
cvtsi2sd xmm0, rax
divsd xmm0, qword ptr [rip + .LCPI2_0]
movsd qword ptr [rsp], xmm0 # 8-byte Spill
mov edi, offset .L.str
mov al, 1
movsd xmm0, qword ptr [rsp + 8] # 8-byte Reload
call printf
mov edi, offset .L.str.1
mov al, 1
movsd xmm0, qword ptr [rsp] # 8-byte Reload
call printf
xor eax, eax
add rsp, 16
pop rbx
ret
TL:DR: if you want to understand inline asm performance on this fine-grained level of detail, you need to understand how compilers optimize in the first place.
How to remove "noise" from GCC/clang assembly output?
C++ code for testing the Collatz conjecture faster than hand-written assembly - why?
Modern x86 cost model
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

Coaxing GCC to emit REPE CMPSB

How to coax the GCC compiler to emit the REPE CMPSB instruction in plain C, without the "asm" and "_emit" keywords, calls to an included library and compiler intrinsics?
I tried some C code like the one listed below, but unsuccessfully:
unsigned int repe_cmpsb(unsigned char *esi, unsigned char *edi, unsigned int ecx) {
for (; ((*esi == *edi) && (ecx != 0)); esi++, edi++, ecx--);
return ecx;
}
See how GCC compiles it at this link:
https://godbolt.org/g/obJbpq
P.S.
I realize that there are no guarantees that the compiler compiles a C code in a certain way, but I'd like to coax it anyway for fun and just to see how smart it is.
rep cmps isn't fast; it's >= 2 cycles per count throughput on Haswell, for example, plus startup overhead. (http://agner.org/optimize). You can get a regular byte-at-a-time loop to go at 1 compare per clock (modern CPUs can run 2 loads per clock) even when you have to check for a match and for a 0 terminator, if you write it carefully.
InstLatx64 numbers agree: Haswell can manage 1 cycle per byte for rep cmpsb, but that's total bandwidth (i.e. 2 cycles to compare 1 byte from each string).
Only rep movs and rep stos have "fast strings" support in current x86 CPUs. (i.e. microcoded implementations that internally use wider loads/stores when alignment and lack of overlap allow.)
The "smart" thing for modern CPUs is to use SSE2 pcmpeqb / pmovmskb. (But gcc and clang don't know how to vectorize loops with an iteration count that isn't known before loop entry; i.e. they can't vectorize search loops. ICC can, though.)
However, gcc will for some reason inline repz cmpsb for strcmp against short fixed strings. Presumably it doesn't know any smarter patterns for inlining strcmp, and the startup overhead may still be better than the overhead of a function call to a dynamic library function. Or maybe not, I haven't tested. Anyway, it's not horrible for code size in a block of code that compares something against a bunch of fixed strings.
#include <string.h>
int string_equal(const char *s) {
return 0 == strcmp(s, "test1");
}
gcc7.3 -O3 output from Godbolt
.LC0:
.string "test1"
string_cmp:
mov rsi, rdi
mov ecx, 6
mov edi, OFFSET FLAT:.LC0
repz cmpsb
setne al
movzx eax, al
ret
If you don't booleanize the result somehow, gcc generates a -1 / 0 / +1 result with seta / setb / sub / movzx. (Causing a partial-register stall on Intel before IvyBridge, and a false dependency on other CPUs, because it uses 32-bit sub on the setcc results, /facepalm. Fortunately most code only needs a 2-way result from strcmp, not 3-way).
gcc only does this with fixed-length string constants, otherwise it wouldn't know how to set rcx.
The results are totally different for memcmp: gcc does a pretty good job, in this case using a DWORD and a WORD cmp, with no rep string instructions.
int cmp_mem(const char *s) {
return 0 == memcmp(s, "test1", 6);
}
cmp DWORD PTR [rdi], 1953719668 # 0x74736574
je .L8
.L5:
mov eax, 1
xor eax, 1 # missed optimization here after the memcmp pattern; should just xor eax,eax
ret
.L8:
xor eax, eax
cmp WORD PTR [rdi+4], 49 # check last 2 bytes
jne .L5
xor eax, 1
ret
Controlling this behaviour
The manual says that -mstringop-strategy=libcall should force a library call, but it doesn't work. No change in asm output.
Neither does -mno-inline-stringops-dynamically -mno-inline-all-stringops.
It seems this part of the GCC docs is obsolete. I haven't investigated further with larger string literals, or fixed size but non-constant strings, or similar.

Fastest (CPU-wise) way to do functions in intel x64 assembly?

I've been reading about assembly functions and I'm confused as to whether to use the enter and exit or just the call/return instructions for fast execution. Is one way fast and the other smaller? For example what is the fastest (stdcall) way to do this in assembly without inlining the function:
static Int32 Add(Int32 a, Int32 b) {
return a + b;
}
int main() {
Int32 i = Add(1, 3);
}
Use call / ret, without making a stack frame with either enter / leave or push&pop rbp / mov rbp, rsp. gcc (with the default -fomit-frame-pointer) only makes a stack frame in functions that do variable-size allocation on the stack. This may make debugging slightly more difficult, since gcc normally emits stack unwind info when compiling with -fomit-frame-pointer, but your hand-written asm won't have that. Normally it only makes sense to write leaf functions in asm, or at least ones that don't call many other functions.
Stack frames mean you don't have to keep track of how much the stack pointer has changed since function entry to access stuff on the stack (e.g. function args and spill slots for locals). Both Windows and Linux/Unix 64bit ABIs pass the first few args in registers, and there are often enough regs that you don't have to spill any variables to the stack. Stack frames are a waste of instructions in most cases. In 32bit code, having ebp available (going from 6 to 7 GP regs, not counting the stack pointer) makes a bigger difference than going from 14 to 15. Of course, you still have to push/pop rbp if you do use it, though, because in both ABIs it's a callee-saved register that functions aren't allowed to clobber.
If you're optimizing x86-64 asm, you should read Agner Fog's guides, and check out some of the other links in the x86 tag wiki.
The best implementation of your function is probably:
align 16
global Add
Add:
lea eax, [rdi + rsi]
ret
; the high 32 of either reg doesn't affect the low32 of the result
; so we don't need to zero-extend or use a 32bit address-size prefix
; like lea eax, [edi, esi]
; even if we're called with non-zeroed upper32 in rdi/rsi.
align 16
global main
main:
mov edi, 1 ; 1st arg in SysV ABI
mov esi, 3 ; 2nd arg in SysV ABI
call Add
; return value in eax in all ABIs
ret
align 16
OPmain: ; This is what you get if you don't return anything from main to use the result of Add
xor eax, eax
ret
This is in fact what gcc emits for Add(), but it still turns main into an empty function, or into a return 4 if you return i. clang 3.7 respects -fno-inline-functions even when the result is a compile-time constant. It beats my asm by doing tail-call optimization, and jmping to Add.
Note that the Windows 64bit ABI uses different registers for function args. See the links in the x86 tag wiki, or Agner Fog's ABI guide. Assembler macros may help for writing functions in asm that use the correct registers for their args, depending on the platform you're targeting.

Inline assembly that clobbers the red zone

I'm writing a cryptography program, and the core (a wide multiply routine) is written in x86-64 assembly, both for speed and because it extensively uses instructions like adc that are not easily accessible from C. I don't want to inline this function, because it's big and it's called several times in the inner loop.
Ideally I would also like to define a custom calling convention for this function, because internally it uses all the registers (except rsp), doesn't clobber its arguments, and returns in registers. Right now, it's adapted to the C calling convention, but of course this makes it slower (by about 10%).
To avoid this, I can call it with asm("call %Pn" : ... : my_function... : "cc", all the registers); but is there a way to tell GCC that the call instruction messes with the stack? Otherwise GCC will just put all those registers in the red zone, and the top one will get clobbered. I can compile the whole module with -mno-red-zone, but I'd prefer a way to tell GCC that, say, the top 8 bytes of the red zone will be clobbered so that it won't put anything there.
From your original question I did not realize gcc limited red-zone use to leaf functions. I don't think that's required by the x86_64 ABI, but it is a reasonable simplifying assumption for a compiler. In that case you only need to make the function calling your assembly routine a non-leaf for purposes of compilation:
int global;
was_leaf()
{
if (global) other();
}
GCC can't tell if global will be true, so it can't optimize away the call to other() so was_leaf() is not a leaf function anymore. I compiled this (with more code that triggered stack usage) and observed that as a leaf it did not move %rsp and with the modification shown it did.
I also tried simply allocating more than 128 bytes (just char buf[150]) in a leaf but I was shocked to see it only did a partial subtraction:
pushq %rbp
movq %rsp, %rbp
subq $40, %rsp
movb $7, -155(%rbp)
If I put the leaf-defeating code back in that becomes subq $160, %rsp
The max-performance way might be to write the whole inner loop in asm (including the call instructions, if it's really worth it to unroll but not inline. Certainly plausible if fully inlining is causing too many uop-cache misses elsewhere).
Anyway, have C call an asm function containing your optimized loop.
BTW, clobbering all the registers makes it hard for gcc to make a very good loop, so you might well come out ahead from optimizing the whole loop yourself. (e.g. maybe keep a pointer in a register, and an end-pointer in memory, because cmp mem,reg is still fairly efficient).
Have a look at the code gcc/clang wrap around an asm statement that modifies an array element (on Godbolt):
void testloop(long *p, long count) {
for (long i = 0 ; i < count ; i++) {
asm(" # XXX asm operand in %0"
: "+r" (p[i])
:
: // "rax",
"rbx", "rcx", "rdx", "rdi", "rsi", "rbp",
"r8", "r9", "r10", "r11", "r12","r13","r14","r15"
);
}
}
#gcc7.2 -O3 -march=haswell
push registers and other function-intro stuff
lea rcx, [rdi+rsi*8] ; end-pointer
mov rax, rdi
mov QWORD PTR [rsp-8], rcx ; store the end-pointer
mov QWORD PTR [rsp-16], rdi ; and the start-pointer
.L6:
# rax holds the current-position pointer on loop entry
# also stored in [rsp-16]
mov rdx, QWORD PTR [rax]
mov rax, rdx # looks like a missed optimization vs. mov rax, [rax], because the asm clobbers rdx
XXX asm operand in rax
mov rbx, QWORD PTR [rsp-16] # reload the pointer
mov QWORD PTR [rbx], rax
mov rax, rbx # another weird missed-optimization (lea rax, [rbx+8])
add rax, 8
mov QWORD PTR [rsp-16], rax
cmp QWORD PTR [rsp-8], rax
jne .L6
# cleanup omitted.
clang counts a separate counter down towards zero. But it uses load / add -1 / store instead of a memory-destination add [mem], -1 / jnz.
You can probably do better than this if you write the whole loop yourself in asm instead of leaving that part of your hot loop to the compiler.
Consider using some XMM registers for integer arithmetic to reduce register pressure on the integer registers, if possible. On Intel CPUs, moving between GP and XMM registers only costs 1 ALU uop with 1c latency. (It's still 1 uop on AMD, but higher latency especially on Bulldozer-family). Doing scalar integer stuff in XMM registers is not much worse, and could be worth it if total uop throughput is your bottleneck, or it saves more spill/reloads than it costs.
But of course XMM is not very viable for loop counters (paddd/pcmpeq/pmovmskb/cmp/jcc or psubd/ptest/jcc are not great compared to sub [mem], 1 / jcc), or for pointers, or for extended-precision arithmetic (manually doing carry-out with a compare and carry-in with another paddq sucks even in 32-bit mode where 64-bit integer regs aren't available). It's usually better to spill/reload to memory instead of XMM registers, if you're not bottlenecked on load/store uops.
If you also need calls to the function from outside the loop (cleanup or something), write a wrapper or use add $-128, %rsp ; call ; sub $-128, %rsp to preserve the red-zone in those versions. (Note that -128 is encodeable as an imm8 but +128 isn't.)
Including an actual function call in your C function doesn't necessarily make it safe to assume the red-zone is unused, though. Any spill/reload between (compiler-visible) function calls could use the red-zone, so clobbering all the registers in an asm statement is quite likely to trigger that behaviour.
// a non-leaf function that still uses the red-zone with gcc
void bar(void) {
//cryptofunc(1); // gcc/clang don't use the redzone after this (not future-proof)
volatile int tmp = 1;
(void)tmp;
cryptofunc(1); // but gcc will use the redzone before a tailcall
}
# gcc7.2 -O3 output
mov edi, 1
mov DWORD PTR [rsp-12], 1
mov eax, DWORD PTR [rsp-12]
jmp cryptofunc(long)
If you want to depend on compiler-specific behaviour, you could call (with regular C) a non-inline function before the hot loop. With current gcc / clang, that will make them reserve enough stack space since they have to adjust the stack anyway (to align rsp before a call). This is not future-proof at all, but should happen to work.
GNU C has an __attribute__((target("options"))) x86 function attribute, but it's not usable for arbitrary options, and -mno-red- zone is not one of the ones you can toggle on a per-function basis, or with #pragma GCC target ("options") within a compilation unit.
You can use stuff like
__attribute__(( target("sse4.1,arch=core2") ))
void penryn_version(void) {
...
}
but not __attribute__(( target("mno-red-zone") )).
There's a #pragma GCC optimize and an optimize function-attribute (both of which are not intended for production code), but #pragma GCC optimize ("-mno-red-zone") doesn't work either. I think the idea is to let some important functions be optimized with -O2 even in debug builds. You can set -f options or -O.
You could put the function in a file by itself and compile that compilation unit with -mno-red-zone, though. (And hopefully LTO will not break anything...)
Can't you just modify your assembly function to meet the requirements of a signal in the x86-64 ABI by shifting the stack pointer by 128 bytes on entry to your function?
Or if you are referring to the return pointer itself, put the shift into your call macro (so sub %rsp; call...)
Not sure but looking at GCC documentation for function attributes, I found the stdcall function attribute which might be of interest.
I'm still wondering what you find problematic with your asm call version. If it's just aesthetics, you could transform it into a macro, or a inline function.
What about creating a dummy function that is written in C and does nothing but call the inline assembly?

Resources