Implementations for asm("nop") in windows - c

Is an empty line of code that ends with a semicolon equivelent to an asm("nop") instruction?
volatile int x = 5;
if(x == 5){
printf("x has not been changed yet\n");
}
else{
;//Is this the same as asm("nop") or __asm nop in windows?
//alternatively could use __asm nop or __nop();
}
I looked at this answer and it makes me not want to use an x86 specific implementation of using inline assembly.
Is `__asm nop` the Windows equivalent of `asm volatile("nop");` from GCC compiler
I can use this void __nop(); function that msdn seems to recommend, but I don't want to drag in the library if I don't have to.
https://learn.microsoft.com/en-us/cpp/intrinsics/nop?view=vs-2017
Is there a cheap, portable way to add a nop instruction that won't get compiled out? I thought an empty semicolon either was nop or compiled out but I can't find any info on it tonight for some reason.
CLARIFICATION EDIT I can use inline asm to do this for x86 but I would like it to be portable. I can use the windows library __nop() but I don't want to import the library into my project, its undesirable overhead.
I am looking for a cleaver way to generate a NOP instruction that will not be optimized out (with standard C syntax preferably) that can be made into a MACRO and used throughout a project, having minimal overhead and works (or can easy be improved to work) on windows/linux/x86/x64.
Thanks.

I mean i don't want to add a library just to force the compiler to add a NOP.
... in a way that is independent of compiler settings (such as optimization settings) and in a way that works with all Visual C++ versions (and maybe even other compilers):
No chance: A compiler is free on how it is generating code as long as the assembler code has the behavior the C code is describing.
And because the NOP instruction does not change the behavior of the program, the compiler is free to add it or to leave it out.
Even if you found a way to force the compiler to generate a NOP: One update of the compiler or a Windows update modifying some file and the compiler might not generate the NOP instruction any longer.
I can use inline asm to do this for x86 but I would like it to be portable.
As I wrote above, any way to force the compiler to write a NOP would only work on a certain compiler version for a certain CPU.
Using inline assembly or __nop() you might cover all compilers of a certain manufacturer (for example: all GNU C compilers or all variants of Visual C++ etc...).
Another question would be: Do you explicitly need the "official" NOP instruction or can you live with any instruction that does nothing?
If you could live with any instruction doing (nearly) nothing, reading a global or static volatile variable could be a replacement for NOP:
static volatile char dummy;
...
else
{
(void)dummy;
}
This should force the compiler to add a MOV instruction reading the variable dummy.
Background:
If you wrote a device driver, you could link the variable dummy to some location where reading the variable has "side-effects". Example: Reading a variable located in VGA video memory can cause influence the screen content!
Using the volatile keyword you do not only tell the compiler that the value of the variable may change at any time, but also that reading the variable may have such effects.
This means that the compiler has to assume that not reading the variable causes the program not to work correctly. It cannot optimize away the (actually unnecessary) MOV instruction reading the variable.

Is an empty line of code that ends with a semicolon equivelent to an asm("nop") instruction?
No, of course not. You could have trivially tried it yourself. (On your own machine, or on the Godbolt compiler explorer, https://godbolt.org/)
You wouldn't want innocent CPP macros to introduce a NOP if FOO(x); expanded to just ; because the appropriate definition for FOO() in this case was the empty string.
__nop() is not a library function. It's an intrinsic that does exactly what you want. e.g.
#ifdef USE_NOP
#ifdef _MSC_VER
#include <intrin.h>
#define NOP() __nop() // _emit 0x90
#else
// assume __GNUC__ inline asm
#define NOP() asm("nop") // implicitly volatile
#endif
#else
#define NOP() // no NOPs
#endif
int idx(int *arr, int b) {
NOP();
return arr[b];
}
compiles with Clang7.0 -O3 for x86-64 Linux to this asm
idx(int*, int):
nop
movsxd rax, esi # sign extend b
mov eax, dword ptr [rdi + 4*rax]
ret
compiles with 32-bit x86 MSVC 19.16 -O2 -Gv to this asm
int idx(int *,int) PROC ; idx, COMDAT
npad 1 ; pad with a 1 byte NOP
mov eax, DWORD PTR [ecx+edx*4] ; __vectorcall arg regs
ret 0
and compiles with x64 MSVC 19.16 -O2 -Gv to this asm (Godbolt for all of them):
int idx(int *,int) PROC ; idx, COMDAT
movsxd rax, edx
npad 1 ; pad with a 1 byte NOP
mov eax, DWORD PTR [rcx+rax*4] ; x64 __vectorcall arg regs
ret 0
Interestingly, the sign-extension of b to 64-bit is done before the NOP. Apparently x64 MSVC requires (by default) that functions start with at least a 2-byte or longer instruction (after the prologue of 1-byte push instructions, maybe?), so they support hot-patching with a jmp rel8.
If you use this in a 1-instruction function, you get an npad 2 (2 byte NOP) before the npad 1 from x64 MSVC:
int bar(int a, int b) {
__nop();
return a+b;
}
;; x64 MSVC 19.16
int bar(int,int) PROC ; bar, COMDAT
npad 2
npad 1
lea eax, DWORD PTR [rcx+rdx]
ret 0
I'm not sure how aggressively MSVC will reorder the NOP with respect to pure register instructions, but a^=b; after the __nop() will actually result in xor ecx, edx before the NOP instruction.
But wrt. memory access, MSVC decides not to reorder anything to fill that 2-byte slot in this case.
int sink;
int foo(int a, int b) {
__nop();
sink = 1;
//a^=b;
return a+b;
}
;; MSVC 19.16 -O2
int foo(int,int) PROC ; foo, COMDAT
npad 2
npad 1
lea eax, DWORD PTR [rcx+rdx]
mov DWORD PTR int sink, 1 ; sink
ret 0
It does the LEA first, but doesn't move it before the __nop(); seems like an obvious missed optimization, but then again if you're inserting __nop() instructions then optimization is clearly not the priority.
If you compiled to a .obj or .exe and disassembled, you'd see a plain 0x90 nop. But Godbolt doesn't support that for MSVC, only Linux compilers, unfortunately, so all I can do easily is copy the asm text output.
And as you'd expect, with the __nop() ifdefed out, the functions compile normally, to the same code but with no npad directive.
The nop instruction will run as many times as the NOP() macro does in the C abstract machine. Ordering wrt. surrounding non-volatile memory accesses is not guaranteed by the optimizer, or wrt. calculations in registers.
If you want it to be a compile-time memory reordering barrier, for GNU C use asm("nop" ::: "memory");`. For MSVC, that would have to be separate, I assume.

Related

Why is GCC with __attribute__((__ms_abi__)) returning values differently than MSVC does?

x86 Function Attributes in the GCC documentation says this:
On 32-bit and 64-bit x86 targets, you can use an ABI attribute to indicate which calling convention should be used for a function. The ms_abi attribute tells the compiler to use the Microsoft ABI, while the sysv_abi attribute tells the compiler to use the System V ELF ABI, which is used on GNU/Linux and other systems. The default is to use the Microsoft ABI when targeting Windows. On all other systems, the default is the System V ELF ABI.
But consider this C code:
#include <assert.h>
#ifdef _MSC_VER
#define MS_ABI
#else
#define MS_ABI __attribute__((__ms_abi__))
#endif
typedef struct {
void *x, *y;
} foo;
static_assert(sizeof(foo) == 8, "foo must be an 8-byte structure");
foo MS_ABI f(void *x, void *y) {
foo rv;
rv.x = x;
rv.y = y;
return rv;
}
gcc -O2 -m32 compiles it to this:
f:
mov eax, DWORD PTR [esp+4]
mov edx, DWORD PTR [esp+8]
mov DWORD PTR [eax], edx
mov edx, DWORD PTR [esp+12]
mov DWORD PTR [eax+4], edx
ret
But cl /O2 compiles it to this:
_x$ = 8 ; size = 4
_y$ = 12 ; size = 4
_f PROC ; COMDAT
mov eax, DWORD PTR _x$[esp-4]
mov edx, DWORD PTR _y$[esp-4]
ret 0
_f ENDP
Godbolt link
These are clearly using incompatible calling conventions. Argument Passing and Naming Conventions on MSDN says this:
Return values are also widened to 32 bits and returned in the EAX register, except for 8-byte structures, which are returned in the EDX:EAX register pair. Larger structures are returned in the EAX register as pointers to hidden return structures.
Which means MSVC is correct. So why is GCC using the pointer-to-hidden-return-structure approach even though the return value is an 8-byte structure? Is this a bug in GCC, or am I not allowed to use ms_abi like I think I am?
That seems buggy to me. i386 SysV only returns int64_t in registers, not structs of the same size, but perhaps GCC forgot to take that into account for ms_abi. Same problem with gcc -mabi=ms (docs).
Even -mabi=ms doesn't in general change struct layouts in cases where that differs, or for x86-64 make long a 32-bit type. But your struct does have the same layout in both ABIs, so you'd expect it to get returned the way an MSVC caller wants. But that's not happening.
This is IIRC not the first time I've heard of bugs in __attribute__((ms_abi)) or -mabi=ms. But you are using it correctly; in 64-bit code that would make the difference to which arg-passing registers it looked in.
Clang makes the same asm as GCC, but that's not significant because it warns that the 'ms_abi' calling convention is not supported for this target. This is the asm we'd expect from i386 SysV. (And Godbolt doesn't have clang-cl.)
So thanks for reported it to GCC as bug #105932.
(Funny that when it chooses to use SSE or AVX, it loads both dword stack args separately and shuffles them together, instead of a movq load. I guess that avoids likely store-forwarding stalls if the caller didn't use a single 64-bit store to write the args, though.)

GCC inline assembly: "g" constraint and parameter size

Background
I am aware that solving the following problem with inline assembly is a bad idea. I'm currently learning inline assembly as part of a class on the linux kernel, and this was part of an assignment for that class.
The Setup
The begin with, below is a snippet of code that is almost correct, but instead segfaults. It is a function that copies the substring of src starting at index s_idx and ending (exclusively) at index e_idx into the pre-allocated dest using only inline assembly.
static inline char *asm_sub_str(char *dest, char *src, int s_idx, int e_idx) {
asm("addq %q2, %%rsi;" /* Add start index to src (ptrs are 64-bit) */
"subl %k2, %%ecx;" /* Get length of substr as e - s (int is 32-bit) */
"cld;" /* Clear direction bit (force increment) */
"rep movsb;" /* Move %ecx bytes of str at %esi into str at %edi */
: /* No Ouputs */
: "S" (src), "D" (dest), "g" (s_idx), "c" (e_idx)
: "cc", "memory"
);
return dest;
}
The issue with this code is the constraint for the second input parameter. When compiled with gccs default optimization and -ggdb, the following assembly is generated:
Dump of assembler code for function asm_sub_str:
0x00000000004008e6 <+0>: push %rbp
0x00000000004008e7 <+1>: mov %rsp,%rbp
0x00000000004008ea <+4>: mov %rdi,-0x8(%rbp)
0x00000000004008ee <+8>: mov %rsi,-0x10(%rbp)
0x00000000004008f2 <+12>: mov %edx,-0x14(%rbp)
0x00000000004008f5 <+15>: mov %ecx,-0x18(%rbp)
0x00000000004008f8 <+18>: mov -0x10(%rbp),%rax
0x00000000004008fc <+22>: mov -0x8(%rbp),%rdx
0x0000000000400900 <+26>: mov -0x18(%rbp),%ecx
0x0000000000400903 <+29>: mov %rax,%rsi
0x0000000000400906 <+32>: mov %rdx,%rdi
0x0000000000400909 <+35>: add -0x14(%rbp),%rsi
0x000000000040090d <+39>: sub -0x14(%rbp),%ecx
0x0000000000400910 <+42>: cld
0x0000000000400911 <+43>: rep movsb %ds:(%rsi),%es:(%rdi)
0x0000000000400913 <+45>: mov -0x8(%rbp),%rax
0x0000000000400917 <+49>: pop %rbp
0x0000000000400918 <+50>: retq
This is identical to the assembly that is generated when the second input parameter's constraint is set to "m" instead of "g", leading me to believe the compiler is effectively choosing the "m" constraint. In stepping through these instructions with gdb, I found that the offending instruction is +35 which adds the starting offset index s_idx to the src pointer in %rsi. The problem of course is that s_idx is only 32-bits and the upper 4 bytes of a 64-bit integer at that location on the static is not necessarily 0. On my machine, it is in fact nonzero and causes the addition to muddle the upper 4 bytes of %rsi which leads to a segfault in instruction +43.
The Question
Of course the solution to the above is to change the constraint of parameter 2 to "r" so it's placed in its own 64-bit register where the top 4 bytes are correctly zeroed and call it a day. Instead, my question is why does gcc resolve the "g" constraint as "m" instead of "r" in this case when the expression "%q2" indicates the value of parameter 2 will be used as a 64-bit value?
I don't know much about how gcc parses inline assembly, and I know there's not really a sense of typing in assembly, but I would think that gcc could recognize the effectively implicit cast of s_idx to a long when it's used as a 64-bit value in the first inline instruction. FWIW, if I explicitly change "g" (s_idx) to "g" ((long) s_idx), gcc then resolves the "g" constraint to "r" since (long) s_idx is a temporary value. I would think gcc could do that implicitly as well?
but I would think that gcc could recognize the effectively implicit cast of s_idx to a long when it's used as a 64-bit value in the first inline instruction.
No, gcc only looks at the constraints, not the asm template string at all, when compiling the surrounding code. The part of gcc that fills in the % template operands is totally separate from register-allocation and code-gen for the surrounding code.
Nothing checks for sanity or understands the context that template operands are being used in. Maybe you have a 16-bit input and want to copy it to a vector register with vmovd %k[input], %%xmm0 / vpbroadcastw %%xmm0, %%ymm0. The upper 16 bits are ignored, so you don't want gcc to waste time zero or sign-extending it for you. But you definitely want to use vmovd instead of vpinsrw $0, %[input], %%xmm0, because that would be more uops and have a false dependency. For all gcc knows or cares, you could have used the operand in an asm comment line like "# low word of input = %h2 \n.
GNU C inline asm is designed so that the constraints tell the compiler everything it needs to know. Thus, you need to manually cast s_idx to long.
You don't need to cast the input for ECX, because the sub instruction will zero-extend the result implicitly (into RCX). Your inputs are signed types, but presumably you are expecting the difference to always be positive.
Register inputs must always be assumed to have high garbage beyond the width of the input type. This is similar to how function args in the x86-64 System V calling convention can have can have garbage in the upper 32 bits, but (I assume) with no unwritten rule about extending out to 32 bits. (And note that after function inlining, your asm statement's inputs might not be function args. You don't want to use __attribute__((noinline)), and as I said it wouldn't help anyway.)
leading me to believe the compiler is effectively choosing the "m" constraint.
Yes, gcc -O0 spills everything to memory between every C statement (so you can change it with a debugger if stopped at a breakpoint). Thus, a memory operand is the most efficient choice for the compiler. It would need a load instruction to get it back into a register. i.e. the value is in memory before the asm statement, at -O0.
(clang is bad at multiple-option constraints and picks memory even at -O3, even when that means spilling first, but gcc doesn't have that problem.)
gcc -O0 (and clang) will use an immediate for a g constraint when the input is a numeric literal constant, e.g. "g" (1234). In your case, you get:
...
addq $1234, %rsi;
subl $1234, %ecx;
rep movsb
...
An input like "g" ((long)s_idx) will use a register even at -O0, just like x+y or any other temporary result (as long as s_idx isn't already long). Interestingly, even (unsigned) resulted in a register operand, even though int and unsigned are the same size and the cast takes no instructions. At this point you're seeing exactly how little gcc -O0 optimizes, because what you get is more dependent on how gcc internals are designed than on what makes sense or is efficient.
Compile with optimization enabled if you want to see interesting asm. See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at compiler output.
Although checking the asm without optimizations disabled is good, too for inline asm; you might not have realized the problem with using a q override if it was just registers, although it is still a problem. Checking how it inlines into a few different callers at -O3 can be useful, too (especially if you test with some compile-time-constant inputs).
Your code is seriously broken
Besides the high-garbage problems discussed above, you modify input-operand registers without telling the compiler about it.
Fixing this by making some of them "+" read/write outputs means your asm statement is no longer volatile by default, so the compiler will optimize it away if the outputs are unused. (This includes after function inlining, so the return dest is sufficient for the standalone version, but not after inlining if the caller ignores the return value.)
You did use a "memory" clobber, so the compiler will assume that you read/write memory. You could tell it which memory you read and write, so it can optimize around your copy more efficiently. See get string length in inline GNU Assembler: you can use dummy memory input/output constraints like "m" (*(const char (*)[]) src)
char *asm_sub_str_fancyconstraints(char *dest, char *src, int s_idx, int e_idx) {
asm (
"addq %[s_idx], %%rsi; \n\t" /* Add start index to src (ptrs are 64-bit) */
"subl %k[s_idx], %%ecx; \n\t" /* Get length of substr as e - s (int is 32-bit) */
// the calling convention requires DF=0, and inline-asm can safely assume it, too
// (it's widely done, including in the Linux kernel)
//"cld;" /* Clear direction bit (force increment) */
"rep movsb; \n\t" /* Move %ecx bytes of str at %esi into str at %edi */
: [src]"+&S" (src), [dest]"+D" (dest), [e_idx]"+c" (e_idx)
, "=m" (*(char (*)[]) dest) // dummy output: all of dest
: [s_idx]"g" ((long long)s_idx)
, "m" (*(const char (*)[]) src) // dummy input: tell the compiler we read all of src[0..infinity]
: "cc"
);
return 0; // asm statement not optimized away, even without volatile,
// because of the memory output.
// Just like dest++; could optimize away, but *dest = 0; couldn't.
}
formatting: note the use of \n\t at the end of each line for readability; otherwise the asm instructions are all on one line separated only by ;. (It will assemble fine, but not very human-readable if you're checking how your asm template worked out.)
This compiles (with gcc -O3) to
asm_sub_str_fancyconstraints:
movslq %edx, %rdx # from the (long long)s_idx
xorl %eax, %eax # from the return 0, which I changed to test that it doesn't optimize away
addq %rdx, %rsi;
subl %edx, %ecx; # your code zero-extends (e_idx - s_idx)
rep movsb;
ret
I put this + a couple other versions on the Godbolt compiler explorer with gcc + clang. A simpler version fixes the bugs but still uses a "memory" clobber + asm volatile to get correctness with more compile-time optimization cost than this version that tells the compiler which memory is read and written.
Early clobber: Note the "+&S" constraint:
If for some weird reason, the compiler knew that the src address and s_idx were equal, it could use the same register (esi/rsi) for both inputs. This would lead to modifying s_idx before it was used in the sub. Declaring that the register holding src is clobbered early (before all input registers are read for the last time) will force the compiler to choose different registers.
See the Godbolt link above for a caller that causes breakage without the & for early-clobber. (But only with the nonsensical src = (char*)s_idx;). Early-clobber declarations are often necessary for multi-instruction asm statements to prevent more realistic breakage possibilities, so definitely keep this in mind, and only leave it out when you're sure it's ok for any read-only input to share a register with an output or input/output operand. (Of course using specific-register constraints limits that possibility.)
I omitted the early-clobber declaration from e_idx in ecx, because the only "free" parameter is s_idx, and putting them both in the same register will result in sub same,same, and rep movsb running 0 iterations as desired.
It would of course be more efficient to let the compiler do the math, and simply ask for the inputs to rep movsb in the right registers. Especially if both e_idx and s_idx are compile-time constants, it's silly to force the compiler to mov an immediate to a register and then subtract another immediate.
Or even better, don't use inline asm at all. (But if you really want rep movsb to test its performance, inline asm is one way to do it. gcc also has tuning options that control how memcpy inlines, if at all.)
No inline asm answer is complete without recommending that you https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it.

Fastest (CPU-wise) way to do functions in intel x64 assembly?

I've been reading about assembly functions and I'm confused as to whether to use the enter and exit or just the call/return instructions for fast execution. Is one way fast and the other smaller? For example what is the fastest (stdcall) way to do this in assembly without inlining the function:
static Int32 Add(Int32 a, Int32 b) {
return a + b;
}
int main() {
Int32 i = Add(1, 3);
}
Use call / ret, without making a stack frame with either enter / leave or push&pop rbp / mov rbp, rsp. gcc (with the default -fomit-frame-pointer) only makes a stack frame in functions that do variable-size allocation on the stack. This may make debugging slightly more difficult, since gcc normally emits stack unwind info when compiling with -fomit-frame-pointer, but your hand-written asm won't have that. Normally it only makes sense to write leaf functions in asm, or at least ones that don't call many other functions.
Stack frames mean you don't have to keep track of how much the stack pointer has changed since function entry to access stuff on the stack (e.g. function args and spill slots for locals). Both Windows and Linux/Unix 64bit ABIs pass the first few args in registers, and there are often enough regs that you don't have to spill any variables to the stack. Stack frames are a waste of instructions in most cases. In 32bit code, having ebp available (going from 6 to 7 GP regs, not counting the stack pointer) makes a bigger difference than going from 14 to 15. Of course, you still have to push/pop rbp if you do use it, though, because in both ABIs it's a callee-saved register that functions aren't allowed to clobber.
If you're optimizing x86-64 asm, you should read Agner Fog's guides, and check out some of the other links in the x86 tag wiki.
The best implementation of your function is probably:
align 16
global Add
Add:
lea eax, [rdi + rsi]
ret
; the high 32 of either reg doesn't affect the low32 of the result
; so we don't need to zero-extend or use a 32bit address-size prefix
; like lea eax, [edi, esi]
; even if we're called with non-zeroed upper32 in rdi/rsi.
align 16
global main
main:
mov edi, 1 ; 1st arg in SysV ABI
mov esi, 3 ; 2nd arg in SysV ABI
call Add
; return value in eax in all ABIs
ret
align 16
OPmain: ; This is what you get if you don't return anything from main to use the result of Add
xor eax, eax
ret
This is in fact what gcc emits for Add(), but it still turns main into an empty function, or into a return 4 if you return i. clang 3.7 respects -fno-inline-functions even when the result is a compile-time constant. It beats my asm by doing tail-call optimization, and jmping to Add.
Note that the Windows 64bit ABI uses different registers for function args. See the links in the x86 tag wiki, or Agner Fog's ABI guide. Assembler macros may help for writing functions in asm that use the correct registers for their args, depending on the platform you're targeting.

Why is memcmp so much faster than a for loop check?

Why is memcmp(a, b, size) so much faster than:
for(i = 0; i < nelements; i++) {
if a[i] != b[i] return 0;
}
return 1;
Is memcmp a CPU instruction or something? It must be pretty deep because I got a massive speedup using memcmp over the loop.
memcmp is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C.
As a "builtin"
GCC supports memcmp (as well as a ton of other functions) as builtins. In some versions / configurations of GCC, a call to memcmp will be recognized as __builtin_memcmp. Instead of emitting a call to the memcmp library function, GCC will emit a handful of instructions to act as an optimized inline version of the function.
On x86, this leverages the use of the cmpsb instruction, which compares a string of bytes at one memory location to another. This is coupled with the repe prefix, so the strings are compared until they are no longer equal, or a count is exhausted. (Exactly what memcmp does).
Given the following code:
int test(const void* s1, const void* s2, int count)
{
return memcmp(s1, s2, count) == 0;
}
gcc version 3.4.4 on Cygwin generates the following assembly:
; (prologue)
mov esi, [ebp+arg_0] ; Move first pointer to esi
mov edi, [ebp+arg_4] ; Move second pointer to edi
mov ecx, [ebp+arg_8] ; Move length to ecx
cld ; Clear DF, the direction flag, so comparisons happen
; at increasing addresses
cmp ecx, ecx ; Special case: If length parameter to memcmp is
; zero, don't compare any bytes.
repe cmpsb ; Compare bytes at DS:ESI and ES:EDI, setting flags
; Repeat this while equal ZF is set
setz al ; Set al (return value) to 1 if ZF is still set
; (all bytes were equal).
; (epilogue)
Reference:
cmpsb instruction
As a library function
Highly-optimized versions of memcmp exist in many C standard libraries. These will usually take advantage of architecture-specific instructions to work with lots of data in parallel.
In Glibc, there are versions of memcmp for x86_64 that can take advantage of the following instruction set extensions:
SSE2 - sysdeps/x86_64/memcmp.S
SSE4 - sysdeps/x86_64/multiarch/memcmp-sse4.S
SSSE3 - sysdeps/x86_64/multiarch/memcmp-ssse3.S
The cool part is that glibc will detect (at run-time) the newest instruction set your CPU has, and execute the version optimized for it. See this snippet from sysdeps/x86_64/multiarch/memcmp.S:
ENTRY(memcmp)
.type memcmp, #gnu_indirect_function
LOAD_RTLD_GLOBAL_RO_RDX
HAS_CPU_FEATURE (SSSE3)
jnz 2f
leaq __memcmp_sse2(%rip), %rax
ret
2: HAS_CPU_FEATURE (SSE4_1)
jz 3f
leaq __memcmp_sse4_1(%rip), %rax
ret
3: leaq __memcmp_ssse3(%rip), %rax
ret
END(memcmp)
In the Linux kernel
Linux does not seem to have an optimized version of memcmp for x86_64, but it does for memcpy, in arch/x86/lib/memcpy_64.S. Note that is uses the alternatives infrastructure (arch/x86/kernel/alternative.c) for not only deciding at runtime which version to use, but actually patching itself to only make this decision once at boot-up.
It's usually a compiler intrinsic that is translated into fast assembly with specialized instructions for comparing blocks of memory.
intrinsic memcmp
Is memcmp a CPU instruction or something?
It is at least a very highly optimized compiler-provided intrinsic function. Possibly a single machine instruction, or two, depending on the platform, which you haven't specified.

Inline assembly that clobbers the red zone

I'm writing a cryptography program, and the core (a wide multiply routine) is written in x86-64 assembly, both for speed and because it extensively uses instructions like adc that are not easily accessible from C. I don't want to inline this function, because it's big and it's called several times in the inner loop.
Ideally I would also like to define a custom calling convention for this function, because internally it uses all the registers (except rsp), doesn't clobber its arguments, and returns in registers. Right now, it's adapted to the C calling convention, but of course this makes it slower (by about 10%).
To avoid this, I can call it with asm("call %Pn" : ... : my_function... : "cc", all the registers); but is there a way to tell GCC that the call instruction messes with the stack? Otherwise GCC will just put all those registers in the red zone, and the top one will get clobbered. I can compile the whole module with -mno-red-zone, but I'd prefer a way to tell GCC that, say, the top 8 bytes of the red zone will be clobbered so that it won't put anything there.
From your original question I did not realize gcc limited red-zone use to leaf functions. I don't think that's required by the x86_64 ABI, but it is a reasonable simplifying assumption for a compiler. In that case you only need to make the function calling your assembly routine a non-leaf for purposes of compilation:
int global;
was_leaf()
{
if (global) other();
}
GCC can't tell if global will be true, so it can't optimize away the call to other() so was_leaf() is not a leaf function anymore. I compiled this (with more code that triggered stack usage) and observed that as a leaf it did not move %rsp and with the modification shown it did.
I also tried simply allocating more than 128 bytes (just char buf[150]) in a leaf but I was shocked to see it only did a partial subtraction:
pushq %rbp
movq %rsp, %rbp
subq $40, %rsp
movb $7, -155(%rbp)
If I put the leaf-defeating code back in that becomes subq $160, %rsp
The max-performance way might be to write the whole inner loop in asm (including the call instructions, if it's really worth it to unroll but not inline. Certainly plausible if fully inlining is causing too many uop-cache misses elsewhere).
Anyway, have C call an asm function containing your optimized loop.
BTW, clobbering all the registers makes it hard for gcc to make a very good loop, so you might well come out ahead from optimizing the whole loop yourself. (e.g. maybe keep a pointer in a register, and an end-pointer in memory, because cmp mem,reg is still fairly efficient).
Have a look at the code gcc/clang wrap around an asm statement that modifies an array element (on Godbolt):
void testloop(long *p, long count) {
for (long i = 0 ; i < count ; i++) {
asm(" # XXX asm operand in %0"
: "+r" (p[i])
:
: // "rax",
"rbx", "rcx", "rdx", "rdi", "rsi", "rbp",
"r8", "r9", "r10", "r11", "r12","r13","r14","r15"
);
}
}
#gcc7.2 -O3 -march=haswell
push registers and other function-intro stuff
lea rcx, [rdi+rsi*8] ; end-pointer
mov rax, rdi
mov QWORD PTR [rsp-8], rcx ; store the end-pointer
mov QWORD PTR [rsp-16], rdi ; and the start-pointer
.L6:
# rax holds the current-position pointer on loop entry
# also stored in [rsp-16]
mov rdx, QWORD PTR [rax]
mov rax, rdx # looks like a missed optimization vs. mov rax, [rax], because the asm clobbers rdx
XXX asm operand in rax
mov rbx, QWORD PTR [rsp-16] # reload the pointer
mov QWORD PTR [rbx], rax
mov rax, rbx # another weird missed-optimization (lea rax, [rbx+8])
add rax, 8
mov QWORD PTR [rsp-16], rax
cmp QWORD PTR [rsp-8], rax
jne .L6
# cleanup omitted.
clang counts a separate counter down towards zero. But it uses load / add -1 / store instead of a memory-destination add [mem], -1 / jnz.
You can probably do better than this if you write the whole loop yourself in asm instead of leaving that part of your hot loop to the compiler.
Consider using some XMM registers for integer arithmetic to reduce register pressure on the integer registers, if possible. On Intel CPUs, moving between GP and XMM registers only costs 1 ALU uop with 1c latency. (It's still 1 uop on AMD, but higher latency especially on Bulldozer-family). Doing scalar integer stuff in XMM registers is not much worse, and could be worth it if total uop throughput is your bottleneck, or it saves more spill/reloads than it costs.
But of course XMM is not very viable for loop counters (paddd/pcmpeq/pmovmskb/cmp/jcc or psubd/ptest/jcc are not great compared to sub [mem], 1 / jcc), or for pointers, or for extended-precision arithmetic (manually doing carry-out with a compare and carry-in with another paddq sucks even in 32-bit mode where 64-bit integer regs aren't available). It's usually better to spill/reload to memory instead of XMM registers, if you're not bottlenecked on load/store uops.
If you also need calls to the function from outside the loop (cleanup or something), write a wrapper or use add $-128, %rsp ; call ; sub $-128, %rsp to preserve the red-zone in those versions. (Note that -128 is encodeable as an imm8 but +128 isn't.)
Including an actual function call in your C function doesn't necessarily make it safe to assume the red-zone is unused, though. Any spill/reload between (compiler-visible) function calls could use the red-zone, so clobbering all the registers in an asm statement is quite likely to trigger that behaviour.
// a non-leaf function that still uses the red-zone with gcc
void bar(void) {
//cryptofunc(1); // gcc/clang don't use the redzone after this (not future-proof)
volatile int tmp = 1;
(void)tmp;
cryptofunc(1); // but gcc will use the redzone before a tailcall
}
# gcc7.2 -O3 output
mov edi, 1
mov DWORD PTR [rsp-12], 1
mov eax, DWORD PTR [rsp-12]
jmp cryptofunc(long)
If you want to depend on compiler-specific behaviour, you could call (with regular C) a non-inline function before the hot loop. With current gcc / clang, that will make them reserve enough stack space since they have to adjust the stack anyway (to align rsp before a call). This is not future-proof at all, but should happen to work.
GNU C has an __attribute__((target("options"))) x86 function attribute, but it's not usable for arbitrary options, and -mno-red- zone is not one of the ones you can toggle on a per-function basis, or with #pragma GCC target ("options") within a compilation unit.
You can use stuff like
__attribute__(( target("sse4.1,arch=core2") ))
void penryn_version(void) {
...
}
but not __attribute__(( target("mno-red-zone") )).
There's a #pragma GCC optimize and an optimize function-attribute (both of which are not intended for production code), but #pragma GCC optimize ("-mno-red-zone") doesn't work either. I think the idea is to let some important functions be optimized with -O2 even in debug builds. You can set -f options or -O.
You could put the function in a file by itself and compile that compilation unit with -mno-red-zone, though. (And hopefully LTO will not break anything...)
Can't you just modify your assembly function to meet the requirements of a signal in the x86-64 ABI by shifting the stack pointer by 128 bytes on entry to your function?
Or if you are referring to the return pointer itself, put the shift into your call macro (so sub %rsp; call...)
Not sure but looking at GCC documentation for function attributes, I found the stdcall function attribute which might be of interest.
I'm still wondering what you find problematic with your asm call version. If it's just aesthetics, you could transform it into a macro, or a inline function.
What about creating a dummy function that is written in C and does nothing but call the inline assembly?

Resources