GCC inline assembly: "g" constraint and parameter size - c

Background
I am aware that solving the following problem with inline assembly is a bad idea. I'm currently learning inline assembly as part of a class on the linux kernel, and this was part of an assignment for that class.
The Setup
The begin with, below is a snippet of code that is almost correct, but instead segfaults. It is a function that copies the substring of src starting at index s_idx and ending (exclusively) at index e_idx into the pre-allocated dest using only inline assembly.
static inline char *asm_sub_str(char *dest, char *src, int s_idx, int e_idx) {
asm("addq %q2, %%rsi;" /* Add start index to src (ptrs are 64-bit) */
"subl %k2, %%ecx;" /* Get length of substr as e - s (int is 32-bit) */
"cld;" /* Clear direction bit (force increment) */
"rep movsb;" /* Move %ecx bytes of str at %esi into str at %edi */
: /* No Ouputs */
: "S" (src), "D" (dest), "g" (s_idx), "c" (e_idx)
: "cc", "memory"
);
return dest;
}
The issue with this code is the constraint for the second input parameter. When compiled with gccs default optimization and -ggdb, the following assembly is generated:
Dump of assembler code for function asm_sub_str:
0x00000000004008e6 <+0>: push %rbp
0x00000000004008e7 <+1>: mov %rsp,%rbp
0x00000000004008ea <+4>: mov %rdi,-0x8(%rbp)
0x00000000004008ee <+8>: mov %rsi,-0x10(%rbp)
0x00000000004008f2 <+12>: mov %edx,-0x14(%rbp)
0x00000000004008f5 <+15>: mov %ecx,-0x18(%rbp)
0x00000000004008f8 <+18>: mov -0x10(%rbp),%rax
0x00000000004008fc <+22>: mov -0x8(%rbp),%rdx
0x0000000000400900 <+26>: mov -0x18(%rbp),%ecx
0x0000000000400903 <+29>: mov %rax,%rsi
0x0000000000400906 <+32>: mov %rdx,%rdi
0x0000000000400909 <+35>: add -0x14(%rbp),%rsi
0x000000000040090d <+39>: sub -0x14(%rbp),%ecx
0x0000000000400910 <+42>: cld
0x0000000000400911 <+43>: rep movsb %ds:(%rsi),%es:(%rdi)
0x0000000000400913 <+45>: mov -0x8(%rbp),%rax
0x0000000000400917 <+49>: pop %rbp
0x0000000000400918 <+50>: retq
This is identical to the assembly that is generated when the second input parameter's constraint is set to "m" instead of "g", leading me to believe the compiler is effectively choosing the "m" constraint. In stepping through these instructions with gdb, I found that the offending instruction is +35 which adds the starting offset index s_idx to the src pointer in %rsi. The problem of course is that s_idx is only 32-bits and the upper 4 bytes of a 64-bit integer at that location on the static is not necessarily 0. On my machine, it is in fact nonzero and causes the addition to muddle the upper 4 bytes of %rsi which leads to a segfault in instruction +43.
The Question
Of course the solution to the above is to change the constraint of parameter 2 to "r" so it's placed in its own 64-bit register where the top 4 bytes are correctly zeroed and call it a day. Instead, my question is why does gcc resolve the "g" constraint as "m" instead of "r" in this case when the expression "%q2" indicates the value of parameter 2 will be used as a 64-bit value?
I don't know much about how gcc parses inline assembly, and I know there's not really a sense of typing in assembly, but I would think that gcc could recognize the effectively implicit cast of s_idx to a long when it's used as a 64-bit value in the first inline instruction. FWIW, if I explicitly change "g" (s_idx) to "g" ((long) s_idx), gcc then resolves the "g" constraint to "r" since (long) s_idx is a temporary value. I would think gcc could do that implicitly as well?

but I would think that gcc could recognize the effectively implicit cast of s_idx to a long when it's used as a 64-bit value in the first inline instruction.
No, gcc only looks at the constraints, not the asm template string at all, when compiling the surrounding code. The part of gcc that fills in the % template operands is totally separate from register-allocation and code-gen for the surrounding code.
Nothing checks for sanity or understands the context that template operands are being used in. Maybe you have a 16-bit input and want to copy it to a vector register with vmovd %k[input], %%xmm0 / vpbroadcastw %%xmm0, %%ymm0. The upper 16 bits are ignored, so you don't want gcc to waste time zero or sign-extending it for you. But you definitely want to use vmovd instead of vpinsrw $0, %[input], %%xmm0, because that would be more uops and have a false dependency. For all gcc knows or cares, you could have used the operand in an asm comment line like "# low word of input = %h2 \n.
GNU C inline asm is designed so that the constraints tell the compiler everything it needs to know. Thus, you need to manually cast s_idx to long.
You don't need to cast the input for ECX, because the sub instruction will zero-extend the result implicitly (into RCX). Your inputs are signed types, but presumably you are expecting the difference to always be positive.
Register inputs must always be assumed to have high garbage beyond the width of the input type. This is similar to how function args in the x86-64 System V calling convention can have can have garbage in the upper 32 bits, but (I assume) with no unwritten rule about extending out to 32 bits. (And note that after function inlining, your asm statement's inputs might not be function args. You don't want to use __attribute__((noinline)), and as I said it wouldn't help anyway.)
leading me to believe the compiler is effectively choosing the "m" constraint.
Yes, gcc -O0 spills everything to memory between every C statement (so you can change it with a debugger if stopped at a breakpoint). Thus, a memory operand is the most efficient choice for the compiler. It would need a load instruction to get it back into a register. i.e. the value is in memory before the asm statement, at -O0.
(clang is bad at multiple-option constraints and picks memory even at -O3, even when that means spilling first, but gcc doesn't have that problem.)
gcc -O0 (and clang) will use an immediate for a g constraint when the input is a numeric literal constant, e.g. "g" (1234). In your case, you get:
...
addq $1234, %rsi;
subl $1234, %ecx;
rep movsb
...
An input like "g" ((long)s_idx) will use a register even at -O0, just like x+y or any other temporary result (as long as s_idx isn't already long). Interestingly, even (unsigned) resulted in a register operand, even though int and unsigned are the same size and the cast takes no instructions. At this point you're seeing exactly how little gcc -O0 optimizes, because what you get is more dependent on how gcc internals are designed than on what makes sense or is efficient.
Compile with optimization enabled if you want to see interesting asm. See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at compiler output.
Although checking the asm without optimizations disabled is good, too for inline asm; you might not have realized the problem with using a q override if it was just registers, although it is still a problem. Checking how it inlines into a few different callers at -O3 can be useful, too (especially if you test with some compile-time-constant inputs).
Your code is seriously broken
Besides the high-garbage problems discussed above, you modify input-operand registers without telling the compiler about it.
Fixing this by making some of them "+" read/write outputs means your asm statement is no longer volatile by default, so the compiler will optimize it away if the outputs are unused. (This includes after function inlining, so the return dest is sufficient for the standalone version, but not after inlining if the caller ignores the return value.)
You did use a "memory" clobber, so the compiler will assume that you read/write memory. You could tell it which memory you read and write, so it can optimize around your copy more efficiently. See get string length in inline GNU Assembler: you can use dummy memory input/output constraints like "m" (*(const char (*)[]) src)
char *asm_sub_str_fancyconstraints(char *dest, char *src, int s_idx, int e_idx) {
asm (
"addq %[s_idx], %%rsi; \n\t" /* Add start index to src (ptrs are 64-bit) */
"subl %k[s_idx], %%ecx; \n\t" /* Get length of substr as e - s (int is 32-bit) */
// the calling convention requires DF=0, and inline-asm can safely assume it, too
// (it's widely done, including in the Linux kernel)
//"cld;" /* Clear direction bit (force increment) */
"rep movsb; \n\t" /* Move %ecx bytes of str at %esi into str at %edi */
: [src]"+&S" (src), [dest]"+D" (dest), [e_idx]"+c" (e_idx)
, "=m" (*(char (*)[]) dest) // dummy output: all of dest
: [s_idx]"g" ((long long)s_idx)
, "m" (*(const char (*)[]) src) // dummy input: tell the compiler we read all of src[0..infinity]
: "cc"
);
return 0; // asm statement not optimized away, even without volatile,
// because of the memory output.
// Just like dest++; could optimize away, but *dest = 0; couldn't.
}
formatting: note the use of \n\t at the end of each line for readability; otherwise the asm instructions are all on one line separated only by ;. (It will assemble fine, but not very human-readable if you're checking how your asm template worked out.)
This compiles (with gcc -O3) to
asm_sub_str_fancyconstraints:
movslq %edx, %rdx # from the (long long)s_idx
xorl %eax, %eax # from the return 0, which I changed to test that it doesn't optimize away
addq %rdx, %rsi;
subl %edx, %ecx; # your code zero-extends (e_idx - s_idx)
rep movsb;
ret
I put this + a couple other versions on the Godbolt compiler explorer with gcc + clang. A simpler version fixes the bugs but still uses a "memory" clobber + asm volatile to get correctness with more compile-time optimization cost than this version that tells the compiler which memory is read and written.
Early clobber: Note the "+&S" constraint:
If for some weird reason, the compiler knew that the src address and s_idx were equal, it could use the same register (esi/rsi) for both inputs. This would lead to modifying s_idx before it was used in the sub. Declaring that the register holding src is clobbered early (before all input registers are read for the last time) will force the compiler to choose different registers.
See the Godbolt link above for a caller that causes breakage without the & for early-clobber. (But only with the nonsensical src = (char*)s_idx;). Early-clobber declarations are often necessary for multi-instruction asm statements to prevent more realistic breakage possibilities, so definitely keep this in mind, and only leave it out when you're sure it's ok for any read-only input to share a register with an output or input/output operand. (Of course using specific-register constraints limits that possibility.)
I omitted the early-clobber declaration from e_idx in ecx, because the only "free" parameter is s_idx, and putting them both in the same register will result in sub same,same, and rep movsb running 0 iterations as desired.
It would of course be more efficient to let the compiler do the math, and simply ask for the inputs to rep movsb in the right registers. Especially if both e_idx and s_idx are compile-time constants, it's silly to force the compiler to mov an immediate to a register and then subtract another immediate.
Or even better, don't use inline asm at all. (But if you really want rep movsb to test its performance, inline asm is one way to do it. gcc also has tuning options that control how memcpy inlines, if at all.)
No inline asm answer is complete without recommending that you https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it.

Related

How to specify %edx to be the output instead of conventional %eax in inline assembly in C?

I am trying to use the following inline assembly in C to read the high word (%edx) of Time Stamp Counter for time measurement:
unsigned int tsc;
asm volatile ("rdtscp; xchgl %%edx, %%eax" : "=r" (tsc));
Unfortunately, the code crashes. If the xchgl instruction is removed, or use rdtsc instruction, there is no problem. For the former, although the code does not crash, I have no way to take out what I want -- the value in %edx register. I have checked an online document about inline assembly in C but failed to find any clue to return the value in %edx to the output C variable directly without any additional instructions (if I change xchgl to movl, the same crash occurs). With little hope that I missed any syntax or did not understand the document correctly, I come here to ask: is there any way to specify %edx register to be the output instead of conventional %eax in inline assembly in C? Thank you.
PS1: I am working in Linux on an intel i386 compatible CPU whose TSC works well to me.
PS2: For some reason I just need the value in %edx.
You need to specify the register as a constraint to tell the compiler that it has to pick a specific register.
unsigned int tsc;
asm volatile ("rdtsc" : "=d" (tsc) : : "eax");
"d" is the edx register as the output operand. "a" is the eax register in the clobber list because its content is altered by the instruction. There are two colons in between because there are no input operands.
Your "=r" lets the compiler pick any register; it only happens to pick EAX in a debug build in a function that doesn't inline. You also need to tell the compiler about all other registers that are modified, even if you don't want them as outputs; the compiler assumes all registers and memory are unmodified and unread unless you tell it otherwise. https://stackoverflow.com/tags/inline-assembly/info
Normally you'd use intrinsics instead of inline asm for portability and to avoid messing with asm:
How to get the CPU cycle count in x86_64 from C++?
But if you did want safe inline asm to get both halves, on both 32 bit and 64 bit:
unsigned int tsc_hi, tsc_lo;
asm volatile ("rdtsc" : "=a" (tsc_lo), "=d" (tsc_hi));
Alternatively, you can use 'A' constraint for the value in the edx:eax register-pair on a 32 bit machine. But in a 64-bit built, "A" lets the compiler pick either RDX or RAX for the 64-bit integer; it would only be both with an unsigned __int128.
// Beware: breaks silently if compiled as 64-bit code
unsigned long long tsc;
asm volatile ("rdtsc" : "=A" (tsc));

Copy a byte to another register in GNU C inline asm, where the compiler chooses registers for both operands

I'm trying to mess around with strings in inline asm for c. I was able to understand how strcpy works (shown below):
static inline char *strcpy(char *dest, char *src)
{
int d0, d1, d2;
char temp;
asm volatile(
"loop: lodsb;" /* load value pointed to by %si into %al, increment %si */
" stosb;" /* move %al to address pointed to by %di, increment %di */
" testb %%al, %%al;"
" jne loop;"
: "=&S" (d0), "=&D" (d1), "=&a" (d2)
: "0" (src), "1" (dest)
: "memory"
);
}
I'm trying to use this structure to make it so I can modify individual characters of the string before returning them. As a result I'm attempting something that looks like:
static inline char *strcpy(char *dest, char *src)
{
int d0, d1, d2;
char temp;
asm volatile(
"loop: lodsb;" /* load value pointed to by %si into %al, increment %si */
" mov %2, %3;" /* move al into temp */
/*
*
* Do and comparisons and jumps based off how I want to change the characters
*
*/
" stosb;" /* move %al to address pointed to by %di, increment %di */
" testb %%al, %%al;"
" jne loop;"
: "=&S" (d0), "=&D" (d1), "=&a" (d2), "+r" (temp)
: "0" (src), "1" (dest)
: "memory"
);
}
Where I'm basically moving the byte put into %al by the lodsb instruction into a temp variable where I do any processing after. However, it seems that the character is never actually stored in temp for some reason I cannot figure out.
Your 2nd version won't even assemble because temp and d2 are different sizes, so you end up with mov %eax, %dl from GCC: https://godbolt.org/z/tng4g4. When inline asm doesn't do what you want, always look at the compiler-generated asm to see what the compiler actually substituted into your template (and what registers it picked for which operand).
This doesn't match what you describe (runs but doesn't work) so it's not a MCVE of exactly what you were doing. But the real question is still answerable.
One easy way is to declare both C temporaries the same size so GCC picks the same width registers.
Or you can use size overrides like mov %k2, %k3 to get movl or mov %b2, %b3 to get movb (8-bit operand-size).
Strangely you chose int for the "=a" temporary so the compiler picks EAX, even though you only load a char.
I'd actually recommend movzbl %2b, %3k to use the opposite sizes from how you declared the variables; that's more efficient than merging a byte into the low byte of the destination, and avoids introducing (or adding more) partial-register problems on P6-family, early Sandybridge-family, and CPUs that don't do any partial-register renaming. Plus, Intel since Ivybridge can do mov-elimination on it.
BTW, your first version of strcpy looks safe and correct, nice. And yes, the "memory" clobber is necessary.
Err, at least the inline asm is correct. You have C undefined behaviour from falling off the end of a non-void function without a return statement, though.
You could simplify the asm operands with a "+&S"(src) read/write operand instead of a dummy output because you're inside a wrapper function (so it's ok to modify this function's local src). Dummy output with matching constraint is the canonical way to take an input in a register you want to destroy, though.
(If you want to work like ISO C's poorly-designed strcpy, you'd want char *retval = dst ahead of the asm statement, if you're going to use the above suggestion of "+S" and "+D" operands. A better idea would be to call it stpcpy and return a pointer to the end of the destination. Also, your src should be const char*.)
Of course it's not particularly efficient to use lodsb/stosb in a loop, especially on CPUs that don't rename AL separately from RAX so every load also needs an ALU uop to merge into RAX. But byte-at-a-time is much worse than you can do with SSE2 so optimizing this with movzx loads, and maybe an indexed addressing mode is probably not worth the trouble. See https://agner.org/optimize/ and other optimization links in https://stackoverflow.com/tags/x86/info, especially https://uops.info/ for instruction latency / throughput / uop count. (stosb is 3 uops vs. 2 total for a mov store + inc edi.)
If you're actually optimizing for code-size over speed, just use 8-bit or 32-bit mov to copy registers, not movzbl.
BTW, with this many operands, you probably want to use named operands like [src] "+&S"(src) in the constraints, and then %[src] in the template.

c inline assembly getting "operand size mismatch" when using cmpxchg

I'm trying to use cmpxchg with inline assembly through c. This is my code:
static inline int
cas(volatile void* addr, int expected, int newval) {
int ret;
asm volatile("movl %2 , %%eax\n\t"
"lock; cmpxchg %0, %3\n\t"
"pushfl\n\t"
"popl %1\n\t"
"and $0x0040, %1\n\t"
: "+m" (*(int*)addr), "=r" (ret)
: "r" (expected), "r" (newval)
: "%eax"
);
return ret;
}
This is my first time using inline and i'm not sure what could be causing this problem.
I tried "cmpxchgl" as well, but still nothing. Also tried removing the lock.
I get "operand size mismatch".
I think maybe it has something to do with the casting i do to addr, but i'm unsure. I try and exchange int for int, so don't really understand why there would be a size mismatch.
This is using AT&T style.
Thanks
As #prl points out, you reversed the operands, putting them in Intel order (See Intel's manual entry for cmpxchg). Any time your inline asm doesn't assemble, you should look at the asm the compiler was feeding to the assembler to see what happened to your template. In your case, simply remove the static inline so the compiler will make a stand-alone definition, then you get (on the Godbolt compiler explorer):
# gcc -S output for the original, with cmpxchg operands backwards
movl %edx , %eax
lock; cmpxchg (%ecx), %ebx # error on this line from the assembler
pushfl
popl %edx
and $0x0040, %edx
Sometimes that will clue your eye / brain in cases where staring at %3 and %0 didn't, especially after you check the instruction-set reference manual entry for cmpxchg and see that the memory operand is the destination (Intel-syntax first operand, AT&T syntax last operand).
This makes sense because the explicit register operand is only ever a source, while EAX and the memory operand are both read and then one or the other is written depending on the success of the compare. (And semantically you use cmpxchg as a conditional store to a memory destination.)
You're discarding the load result from the cas-failure case. I can't think of any use-cases for cmpxchg where doing a separate load of the atomic value would be incorrect, rather than just inefficient, but the usual semantics for a CAS function is that oldval is taken by reference and updated on failure. (At least that's how C++11 std::atomic and C11 stdatomic do it with bool atomic_compare_exchange_weak( volatile A *obj, C* expected, C desired );.)
(The weak/strong thing allows better code-gen for CAS retry-loops on targets that use LL/SC, where spurious failure is possible due to an interrupt or being rewritten with the same value. x86's lock cmpxchg is "strong")
Actually, GCC's legacy __sync builtins provide 2 separate CAS functions: one that returns the old value, and one that returns a bool. Both take the old/new value by reference. So it's not the same API that C++11 uses, but apparently it isn't so horrible that nobody used it.
Your overcomplicated code isn't portable to x86-64. From your use of popl, I assume you developed it on x86-32. You don't need pushf/pop to get ZF as an integer; that's what setcc is for. cmpxchg example for 64 bit integer has a 32-bit example that works that way (to show what they want a 64-bit version of).
Or even better, use GCC6 flag-return syntax so using this in a loop can compile to a cmpxchg / jne loop instead of cmpxchg / setz %al / test %al,%al / jnz.
We can fix all of those problems and improve the register allocation as well. (If the first or last instruction of an inline-asm statement is mov, you're probably using constraints inefficiently.)
Of course, by far the best thing for real usage would be to use C11 stdatomic or a GCC builtin. https://gcc.gnu.org/wiki/DontUseInlineAsm in cases where the compiler can emit just as good (or better) asm from code it "understands", because inline asm constrains the compiler. It's also difficult to write correctly / efficient, and to maintain.
Portable to i386 and x86-64, AT&T or Intel syntax, and works for any integer type width of register width or smaller:
// Note: oldVal by reference
static inline char CAS_flagout(int *ptr, int *poldVal, int newVal)
{
char ret;
__asm__ __volatile__ (
" lock; cmpxchg {%[newval], %[mem] | %[mem], %[newval]}\n"
: "=#ccz" (ret), [mem] "+m" (*ptr), "+a" (*poldVal)
: [newval]"r" (newVal)
: "memory"); // barrier for compiler reordering around this
return ret; // ZF result, 1 on success else 0
}
// spinning read-only is much better (with _mm_pause in the retry loop)
// not hammering on the cache line with lock cmpxchg.
// This is over-simplified so the asm is super-simple.
void cas_retry(int *lock) {
int oldval = 0;
while(!CAS_flagout(lock, &oldval, 1)) oldval = 0;
}
The { foo,bar | bar,foo } is ASM dialect alternatives. For x86, it's {AT&T | Intel}. The %[newval] is a named operand constraint; it's another way to keep your operands . The "=ccz" takes the z condition code as the output value, like a setz.
Compiles on Godbolt to this asm for 32-bit x86 with AT&T output:
cas_retry:
pushl %ebx
movl 8(%esp), %edx # load the pointer arg.
movl $1, %ecx
xorl %ebx, %ebx
.L2:
movl %ebx, %eax # xor %eax,%eax would save a lot of insns
lock; cmpxchg %ecx, (%edx)
jne .L2
popl %ebx
ret
gcc is dumb and stores a 0 in one reg before copying it to eax, instead of re-zeroing eax inside the loop. This is why it needs to save/restore EBX at all. It's the same asm we get from avoiding inline-asm, though (from x86 spinlock using cmpxchg):
// also omits _mm_pause and read-only retry, see the linked question
void spin_lock_oversimplified(int *p) {
while(!__sync_bool_compare_and_swap(p, 0, 1));
}
Someone should teach gcc that Intel CPUs can materialize a 0 more cheaply with xor-zeroing than they can copy it with mov, especially on Sandybridge (xor-zeroing elimination but no mov-elimination).
You had the operand order for the cmpxchg instruction is reversed. AT&T syntax needs the memory destination last:
"lock; cmpxchg %3, %0\n\t"
Or you could compile that instruction with its original order using -masm=intel, but the rest of your code is AT&T syntax and ordering so that's not the right answer.
As far as why it says "operand size mismatch", I can only say that that appears to be an assembler bug, in that it uses the wrong message.

Operand types for movaps

I'm trying to load 4 packed floats into xmm0 register:
float *f=(float*)_aligned_malloc(16,16);
asm volatile
(
"movaps %0,%%xmm0"
:
:"r"(f)
:"%xmm0","memory"
);
But I get this error:
operand type mismatch for `movaps'
How can I fix it?
You can just use an intrinsic, rather than trying to "re-invent the wheel":
#include <xmmintrin.h>
__m128 v = _mm_load_ps(f); // compiles to movaps
This just seems like a bad idea. If you want to write a whole block in asm, then do that, but don't try to build your own version of intrinsics using separate single-instruction asm blocks. It will not perform well, and you can't force register allocation between asm blocks.
You could maybe use stuff like __m128 foo asm("xmm2"); to have the compiler keep that C variable in xmm2, but that's not guaranteed to be respected except when used as an operand for an asm statement. The optimizer will still do its job.
I tried to make use of all 16 xmm registers in my program which uses intrinsics, but the assembly output of the code shows that only 4 xmm registers are actually used. So I thought it's better to implement it by inline assembly instead.
The compiler won't use extra register for no reason; only if it needs to keep more values live at once. x86 is out-of-order with register renaming, so there are no write-after-read or write-after-write hazards; reusing the same register for something independent is not a problem. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?. A write-only access to a full register has no dependency on the old value of the register. (Merging with the old value does create a dependency though, like movss %xmm1, %xmm0, which is why yous should use movaps to copy registers even if you only care about the low element.)
Your template will assemble to something like movaps %rax, %xmm0, which of course doesn't work. movaps needs a memory source, not an integer register.
The best way is usually to tell the compiler about the memory operand, so you don't need a "memory" clobber or a separate "dummy" operand. (A pointer operand in a register doesn't imply that the pointed-to memory needs to be in sync).
But note that the memory operand has to have the right size, so the compiler knows you read 4 floats starting at that address. If you just used "m" (*f), it could still reorder your asm with an assignment to f[3]. (Yes, even with asm volatile, unless f[3] was also a volatile access.)
typedef float v4sf __attribute__((vector_size(16),may_alias));
// static inline
v4sf my_load_ps(float *f) {
v4sf my_vec;
asm(
"movaps %[input], %[output]"
: [output] "=x" (my_vec)
: [input] "m" (*(v4sf*)f)
: // no clobbers
);
return my_vec;
}
(On Godbolt)
Using a memory operand lets the compiler pick the addressing mode, so it can still unroll loops if you use this inside a loop. e.g. adding f+=16 to this function results in
movaps 64(%rdi), %xmm0
ret
instead of add $64, %rdi / movaps (%rdi), %xmm0 like you'd get if you hard-coded the addressing mode. See Looping over arrays with inline assembly.
movaps into a clobbered register is completely pointless. Use an "=x" output constraint for a vector of float. If you were planning to write a separate asm() statement that assumed something would still be in xmm0, that's not safe because the compiler could have used xmm0 to copy 16 bytes, or for scalar math. asm volatile doesn't make that safe.
Hopefully you were planning to add more instructions to the same asm statement, though.
You need to place the pointer operand inside parentheses:
"movaps (%0),%%xmm0"

Inline assembly that clobbers the red zone

I'm writing a cryptography program, and the core (a wide multiply routine) is written in x86-64 assembly, both for speed and because it extensively uses instructions like adc that are not easily accessible from C. I don't want to inline this function, because it's big and it's called several times in the inner loop.
Ideally I would also like to define a custom calling convention for this function, because internally it uses all the registers (except rsp), doesn't clobber its arguments, and returns in registers. Right now, it's adapted to the C calling convention, but of course this makes it slower (by about 10%).
To avoid this, I can call it with asm("call %Pn" : ... : my_function... : "cc", all the registers); but is there a way to tell GCC that the call instruction messes with the stack? Otherwise GCC will just put all those registers in the red zone, and the top one will get clobbered. I can compile the whole module with -mno-red-zone, but I'd prefer a way to tell GCC that, say, the top 8 bytes of the red zone will be clobbered so that it won't put anything there.
From your original question I did not realize gcc limited red-zone use to leaf functions. I don't think that's required by the x86_64 ABI, but it is a reasonable simplifying assumption for a compiler. In that case you only need to make the function calling your assembly routine a non-leaf for purposes of compilation:
int global;
was_leaf()
{
if (global) other();
}
GCC can't tell if global will be true, so it can't optimize away the call to other() so was_leaf() is not a leaf function anymore. I compiled this (with more code that triggered stack usage) and observed that as a leaf it did not move %rsp and with the modification shown it did.
I also tried simply allocating more than 128 bytes (just char buf[150]) in a leaf but I was shocked to see it only did a partial subtraction:
pushq %rbp
movq %rsp, %rbp
subq $40, %rsp
movb $7, -155(%rbp)
If I put the leaf-defeating code back in that becomes subq $160, %rsp
The max-performance way might be to write the whole inner loop in asm (including the call instructions, if it's really worth it to unroll but not inline. Certainly plausible if fully inlining is causing too many uop-cache misses elsewhere).
Anyway, have C call an asm function containing your optimized loop.
BTW, clobbering all the registers makes it hard for gcc to make a very good loop, so you might well come out ahead from optimizing the whole loop yourself. (e.g. maybe keep a pointer in a register, and an end-pointer in memory, because cmp mem,reg is still fairly efficient).
Have a look at the code gcc/clang wrap around an asm statement that modifies an array element (on Godbolt):
void testloop(long *p, long count) {
for (long i = 0 ; i < count ; i++) {
asm(" # XXX asm operand in %0"
: "+r" (p[i])
:
: // "rax",
"rbx", "rcx", "rdx", "rdi", "rsi", "rbp",
"r8", "r9", "r10", "r11", "r12","r13","r14","r15"
);
}
}
#gcc7.2 -O3 -march=haswell
push registers and other function-intro stuff
lea rcx, [rdi+rsi*8] ; end-pointer
mov rax, rdi
mov QWORD PTR [rsp-8], rcx ; store the end-pointer
mov QWORD PTR [rsp-16], rdi ; and the start-pointer
.L6:
# rax holds the current-position pointer on loop entry
# also stored in [rsp-16]
mov rdx, QWORD PTR [rax]
mov rax, rdx # looks like a missed optimization vs. mov rax, [rax], because the asm clobbers rdx
XXX asm operand in rax
mov rbx, QWORD PTR [rsp-16] # reload the pointer
mov QWORD PTR [rbx], rax
mov rax, rbx # another weird missed-optimization (lea rax, [rbx+8])
add rax, 8
mov QWORD PTR [rsp-16], rax
cmp QWORD PTR [rsp-8], rax
jne .L6
# cleanup omitted.
clang counts a separate counter down towards zero. But it uses load / add -1 / store instead of a memory-destination add [mem], -1 / jnz.
You can probably do better than this if you write the whole loop yourself in asm instead of leaving that part of your hot loop to the compiler.
Consider using some XMM registers for integer arithmetic to reduce register pressure on the integer registers, if possible. On Intel CPUs, moving between GP and XMM registers only costs 1 ALU uop with 1c latency. (It's still 1 uop on AMD, but higher latency especially on Bulldozer-family). Doing scalar integer stuff in XMM registers is not much worse, and could be worth it if total uop throughput is your bottleneck, or it saves more spill/reloads than it costs.
But of course XMM is not very viable for loop counters (paddd/pcmpeq/pmovmskb/cmp/jcc or psubd/ptest/jcc are not great compared to sub [mem], 1 / jcc), or for pointers, or for extended-precision arithmetic (manually doing carry-out with a compare and carry-in with another paddq sucks even in 32-bit mode where 64-bit integer regs aren't available). It's usually better to spill/reload to memory instead of XMM registers, if you're not bottlenecked on load/store uops.
If you also need calls to the function from outside the loop (cleanup or something), write a wrapper or use add $-128, %rsp ; call ; sub $-128, %rsp to preserve the red-zone in those versions. (Note that -128 is encodeable as an imm8 but +128 isn't.)
Including an actual function call in your C function doesn't necessarily make it safe to assume the red-zone is unused, though. Any spill/reload between (compiler-visible) function calls could use the red-zone, so clobbering all the registers in an asm statement is quite likely to trigger that behaviour.
// a non-leaf function that still uses the red-zone with gcc
void bar(void) {
//cryptofunc(1); // gcc/clang don't use the redzone after this (not future-proof)
volatile int tmp = 1;
(void)tmp;
cryptofunc(1); // but gcc will use the redzone before a tailcall
}
# gcc7.2 -O3 output
mov edi, 1
mov DWORD PTR [rsp-12], 1
mov eax, DWORD PTR [rsp-12]
jmp cryptofunc(long)
If you want to depend on compiler-specific behaviour, you could call (with regular C) a non-inline function before the hot loop. With current gcc / clang, that will make them reserve enough stack space since they have to adjust the stack anyway (to align rsp before a call). This is not future-proof at all, but should happen to work.
GNU C has an __attribute__((target("options"))) x86 function attribute, but it's not usable for arbitrary options, and -mno-red- zone is not one of the ones you can toggle on a per-function basis, or with #pragma GCC target ("options") within a compilation unit.
You can use stuff like
__attribute__(( target("sse4.1,arch=core2") ))
void penryn_version(void) {
...
}
but not __attribute__(( target("mno-red-zone") )).
There's a #pragma GCC optimize and an optimize function-attribute (both of which are not intended for production code), but #pragma GCC optimize ("-mno-red-zone") doesn't work either. I think the idea is to let some important functions be optimized with -O2 even in debug builds. You can set -f options or -O.
You could put the function in a file by itself and compile that compilation unit with -mno-red-zone, though. (And hopefully LTO will not break anything...)
Can't you just modify your assembly function to meet the requirements of a signal in the x86-64 ABI by shifting the stack pointer by 128 bytes on entry to your function?
Or if you are referring to the return pointer itself, put the shift into your call macro (so sub %rsp; call...)
Not sure but looking at GCC documentation for function attributes, I found the stdcall function attribute which might be of interest.
I'm still wondering what you find problematic with your asm call version. If it's just aesthetics, you could transform it into a macro, or a inline function.
What about creating a dummy function that is written in C and does nothing but call the inline assembly?

Resources