c inline assembly getting "operand size mismatch" when using cmpxchg - c

I'm trying to use cmpxchg with inline assembly through c. This is my code:
static inline int
cas(volatile void* addr, int expected, int newval) {
int ret;
asm volatile("movl %2 , %%eax\n\t"
"lock; cmpxchg %0, %3\n\t"
"pushfl\n\t"
"popl %1\n\t"
"and $0x0040, %1\n\t"
: "+m" (*(int*)addr), "=r" (ret)
: "r" (expected), "r" (newval)
: "%eax"
);
return ret;
}
This is my first time using inline and i'm not sure what could be causing this problem.
I tried "cmpxchgl" as well, but still nothing. Also tried removing the lock.
I get "operand size mismatch".
I think maybe it has something to do with the casting i do to addr, but i'm unsure. I try and exchange int for int, so don't really understand why there would be a size mismatch.
This is using AT&T style.
Thanks

As #prl points out, you reversed the operands, putting them in Intel order (See Intel's manual entry for cmpxchg). Any time your inline asm doesn't assemble, you should look at the asm the compiler was feeding to the assembler to see what happened to your template. In your case, simply remove the static inline so the compiler will make a stand-alone definition, then you get (on the Godbolt compiler explorer):
# gcc -S output for the original, with cmpxchg operands backwards
movl %edx , %eax
lock; cmpxchg (%ecx), %ebx # error on this line from the assembler
pushfl
popl %edx
and $0x0040, %edx
Sometimes that will clue your eye / brain in cases where staring at %3 and %0 didn't, especially after you check the instruction-set reference manual entry for cmpxchg and see that the memory operand is the destination (Intel-syntax first operand, AT&T syntax last operand).
This makes sense because the explicit register operand is only ever a source, while EAX and the memory operand are both read and then one or the other is written depending on the success of the compare. (And semantically you use cmpxchg as a conditional store to a memory destination.)
You're discarding the load result from the cas-failure case. I can't think of any use-cases for cmpxchg where doing a separate load of the atomic value would be incorrect, rather than just inefficient, but the usual semantics for a CAS function is that oldval is taken by reference and updated on failure. (At least that's how C++11 std::atomic and C11 stdatomic do it with bool atomic_compare_exchange_weak( volatile A *obj, C* expected, C desired );.)
(The weak/strong thing allows better code-gen for CAS retry-loops on targets that use LL/SC, where spurious failure is possible due to an interrupt or being rewritten with the same value. x86's lock cmpxchg is "strong")
Actually, GCC's legacy __sync builtins provide 2 separate CAS functions: one that returns the old value, and one that returns a bool. Both take the old/new value by reference. So it's not the same API that C++11 uses, but apparently it isn't so horrible that nobody used it.
Your overcomplicated code isn't portable to x86-64. From your use of popl, I assume you developed it on x86-32. You don't need pushf/pop to get ZF as an integer; that's what setcc is for. cmpxchg example for 64 bit integer has a 32-bit example that works that way (to show what they want a 64-bit version of).
Or even better, use GCC6 flag-return syntax so using this in a loop can compile to a cmpxchg / jne loop instead of cmpxchg / setz %al / test %al,%al / jnz.
We can fix all of those problems and improve the register allocation as well. (If the first or last instruction of an inline-asm statement is mov, you're probably using constraints inefficiently.)
Of course, by far the best thing for real usage would be to use C11 stdatomic or a GCC builtin. https://gcc.gnu.org/wiki/DontUseInlineAsm in cases where the compiler can emit just as good (or better) asm from code it "understands", because inline asm constrains the compiler. It's also difficult to write correctly / efficient, and to maintain.
Portable to i386 and x86-64, AT&T or Intel syntax, and works for any integer type width of register width or smaller:
// Note: oldVal by reference
static inline char CAS_flagout(int *ptr, int *poldVal, int newVal)
{
char ret;
__asm__ __volatile__ (
" lock; cmpxchg {%[newval], %[mem] | %[mem], %[newval]}\n"
: "=#ccz" (ret), [mem] "+m" (*ptr), "+a" (*poldVal)
: [newval]"r" (newVal)
: "memory"); // barrier for compiler reordering around this
return ret; // ZF result, 1 on success else 0
}
// spinning read-only is much better (with _mm_pause in the retry loop)
// not hammering on the cache line with lock cmpxchg.
// This is over-simplified so the asm is super-simple.
void cas_retry(int *lock) {
int oldval = 0;
while(!CAS_flagout(lock, &oldval, 1)) oldval = 0;
}
The { foo,bar | bar,foo } is ASM dialect alternatives. For x86, it's {AT&T | Intel}. The %[newval] is a named operand constraint; it's another way to keep your operands . The "=ccz" takes the z condition code as the output value, like a setz.
Compiles on Godbolt to this asm for 32-bit x86 with AT&T output:
cas_retry:
pushl %ebx
movl 8(%esp), %edx # load the pointer arg.
movl $1, %ecx
xorl %ebx, %ebx
.L2:
movl %ebx, %eax # xor %eax,%eax would save a lot of insns
lock; cmpxchg %ecx, (%edx)
jne .L2
popl %ebx
ret
gcc is dumb and stores a 0 in one reg before copying it to eax, instead of re-zeroing eax inside the loop. This is why it needs to save/restore EBX at all. It's the same asm we get from avoiding inline-asm, though (from x86 spinlock using cmpxchg):
// also omits _mm_pause and read-only retry, see the linked question
void spin_lock_oversimplified(int *p) {
while(!__sync_bool_compare_and_swap(p, 0, 1));
}
Someone should teach gcc that Intel CPUs can materialize a 0 more cheaply with xor-zeroing than they can copy it with mov, especially on Sandybridge (xor-zeroing elimination but no mov-elimination).

You had the operand order for the cmpxchg instruction is reversed. AT&T syntax needs the memory destination last:
"lock; cmpxchg %3, %0\n\t"
Or you could compile that instruction with its original order using -masm=intel, but the rest of your code is AT&T syntax and ordering so that's not the right answer.
As far as why it says "operand size mismatch", I can only say that that appears to be an assembler bug, in that it uses the wrong message.

Related

error: unsupported size for integer register

I'm using i686 gcc on windows. When I built the code with separate asm statements, it worked. However, when I try to combine it into one statement, it doesn't build and gives me a error: unsupported size for integer register.
Here's my code
u8 lstatus;
u8 lsectors_read;
u8 data_buffer;
void operate(u8 opcode, u8 sector_size, u8 track, u8 sector, u8 head, u8 drive, u8* buffer, u8* status, u8* sectors_read)
{
asm volatile("mov %3, %%ah;\n"
"mov %4, %%al;\n"
"mov %5, %%ch;\n"
"mov %6, %%cl;\n"
"mov %7, %%dh;\n"
"mov %8, %%dl;\n"
"int $0x13;\n"
"mov %%ah, %0;\n"
"mov %%al, %1;\n"
"mov %%es:(%%bx), %2;\n"
: "=r"(lstatus), "=r"(lsectors_read), "=r"(buffer)
: "r"(opcode), "r"(sector_size), "r"(track), "r"(sector), "r"(head), "r"(drive)
:);
status = &lstatus;
sectors_read = &lsectors_read;
buffer = &data_buffer;
}
The error message is a little misleading. It seems to be happening because GCC ran out of 8-bit registers.
Interestingly, it compiles without error messages if you just edit the template to remove references to the last 2 operands (https://godbolt.org/z/oujNP7), even without dropping them from the list of input constraints! (Trimming down your asm statement is a useful debugging technique to figure out which part of it GCC doesn't like, without caring for now if the asm will do anything useful.)
Removing 2 earlier operands and changing numbers shows that "r"(head), "r"(drive) weren't specifically a problem, just the combination of everything.
It looks like GCC is avoiding high-8 registers like AH as inputs, and x86-16 only has 4 low-8 registers but you have 6 u8 inputs. So I think GCC means it ran out of byte registers that it was willing to use.
(The 3 outputs aren't declared early-clobber so they're allowed to overlap the inputs.)
You could maybe work around this by using "rm" to give GCC the option of picking a memory input. (The x86-specific constraints like "Q" that are allowed to pick a high-8 register wouldn't help unless you require it to pick the correct one to get the compiler to emit a mov for you.) That would probably let your code compile, but the result would be totally broken.
You re-introduced basically the same bugs as before: not telling the compiler which registers you write, so for example your mov %4, %%al will overwrite one of the registers GCC picked as an input, before you actually read that operand.
Declaring clobbers on all the registers you use would leave not enough registers to hold all the input variables. (Unless you allow memory source operands.) That could work but is very inefficient: if your asm template string starts or ends with mov, you're almost always doing it wrong.
Also, there are other serious bugs, apart from how you're using inline asm. You don't supply an input pointer to your buffer. int $0x13 doesn't allocate a new buffer for you, it needs a pointer in ES:BX (which it dereferences but leaves unmodified). GCC requires that ES=DS=SS so you already have to have properly set up segmentation before calling into your C code, and isn't something you have to do every call.
Plus even in C terms outside the inline asm, your function doesn't make sense. status = &lstatus; modifies the value of a function arg, not dereferencing it to modify a pointed-to output variable. The variable written by those assignments die at the end of the function. But the global temporaries do have to be updated because they're global and some other function could see their value. Perhaps you meant something like *status = lstatus; with different types for your vars?
If that C problem isn't obvious (at least once it's pointed out), you need some more practice with C before you're ready to try mixing C and asm which require you to understand both very well, in order to correctly describe your asm to the compiler with accurate constraints.
A good and correct way to implement this is shown in #fuz's answer to your previous question. If you want to understand how the constraints can replace your mov instructions, compile it and look at the compiler-generated instructions. See https://stackoverflow.com/tags/inline-assembly/info for links to guides and docs. e.g. #fuz's version without the ES setup (because GCC needs you to have done that already before calling any C):
typedef unsigned char u8;
typedef unsigned short u16;
// Note the different signature, and using the output args correctly.
void read(u8 sector_size, u8 track, u8 sector, u8 head, u8 drive,
u8 *buffer, u8 *status, u8 *sectors_read)
{
u16 result;
asm volatile("int $0x13"
: "=a"(result)
: "a"(0x200|sector_size), "b"(buffer),
"c"(track<<8|sector), "d"(head<<8|drive)
: "memory" ); // memory clobber was missing from #fuz's version
*status = result >> 8;
*sectors_read = result >> 0;
}
Compiles as follows, with GCC10.1 -O2 -m16 on Godbolt:
read:
pushl %ebx
movzbl 12(%esp), %ecx
movzbl 16(%esp), %edx
movzbl 24(%esp), %ebx # load some stack args
sall $8, %ecx
movzbl 8(%esp), %eax
orl %edx, %ecx # shift and merge into CL,CH instead of writing partial regs
movzbl 20(%esp), %edx
orb $2, %ah
sall $8, %edx
orl %ebx, %edx
movl 28(%esp), %ebx # the pointer arg
int $0x13 # from the inline asm statement
movl 32(%esp), %edx # load output pointer arg
movl %eax, %ecx
shrw $8, %cx
movb %cl, (%edx)
movl 36(%esp), %edx
movb %al, (%edx)
popl %ebx
ret
It might be possible to use register u8 track asm("ch") or something to get the compiler to just write partial regs instead of shift/OR.
If you don't want to understand how constraints work, don't use GNU C inline asm. You could instead write stand-alone functions that you call from C, which accept args according to the calling convention the compiler uses (e.g. gcc -mregparm=3, or just everything on the stack with the traditional inefficient calling convention.)
You could do a better job than GCC's above code-gen, but note that the inline asm could optimize into surrounding code and avoid some of the actual copying to memory for passing args via the stack.

GCC inline assembly: "g" constraint and parameter size

Background
I am aware that solving the following problem with inline assembly is a bad idea. I'm currently learning inline assembly as part of a class on the linux kernel, and this was part of an assignment for that class.
The Setup
The begin with, below is a snippet of code that is almost correct, but instead segfaults. It is a function that copies the substring of src starting at index s_idx and ending (exclusively) at index e_idx into the pre-allocated dest using only inline assembly.
static inline char *asm_sub_str(char *dest, char *src, int s_idx, int e_idx) {
asm("addq %q2, %%rsi;" /* Add start index to src (ptrs are 64-bit) */
"subl %k2, %%ecx;" /* Get length of substr as e - s (int is 32-bit) */
"cld;" /* Clear direction bit (force increment) */
"rep movsb;" /* Move %ecx bytes of str at %esi into str at %edi */
: /* No Ouputs */
: "S" (src), "D" (dest), "g" (s_idx), "c" (e_idx)
: "cc", "memory"
);
return dest;
}
The issue with this code is the constraint for the second input parameter. When compiled with gccs default optimization and -ggdb, the following assembly is generated:
Dump of assembler code for function asm_sub_str:
0x00000000004008e6 <+0>: push %rbp
0x00000000004008e7 <+1>: mov %rsp,%rbp
0x00000000004008ea <+4>: mov %rdi,-0x8(%rbp)
0x00000000004008ee <+8>: mov %rsi,-0x10(%rbp)
0x00000000004008f2 <+12>: mov %edx,-0x14(%rbp)
0x00000000004008f5 <+15>: mov %ecx,-0x18(%rbp)
0x00000000004008f8 <+18>: mov -0x10(%rbp),%rax
0x00000000004008fc <+22>: mov -0x8(%rbp),%rdx
0x0000000000400900 <+26>: mov -0x18(%rbp),%ecx
0x0000000000400903 <+29>: mov %rax,%rsi
0x0000000000400906 <+32>: mov %rdx,%rdi
0x0000000000400909 <+35>: add -0x14(%rbp),%rsi
0x000000000040090d <+39>: sub -0x14(%rbp),%ecx
0x0000000000400910 <+42>: cld
0x0000000000400911 <+43>: rep movsb %ds:(%rsi),%es:(%rdi)
0x0000000000400913 <+45>: mov -0x8(%rbp),%rax
0x0000000000400917 <+49>: pop %rbp
0x0000000000400918 <+50>: retq
This is identical to the assembly that is generated when the second input parameter's constraint is set to "m" instead of "g", leading me to believe the compiler is effectively choosing the "m" constraint. In stepping through these instructions with gdb, I found that the offending instruction is +35 which adds the starting offset index s_idx to the src pointer in %rsi. The problem of course is that s_idx is only 32-bits and the upper 4 bytes of a 64-bit integer at that location on the static is not necessarily 0. On my machine, it is in fact nonzero and causes the addition to muddle the upper 4 bytes of %rsi which leads to a segfault in instruction +43.
The Question
Of course the solution to the above is to change the constraint of parameter 2 to "r" so it's placed in its own 64-bit register where the top 4 bytes are correctly zeroed and call it a day. Instead, my question is why does gcc resolve the "g" constraint as "m" instead of "r" in this case when the expression "%q2" indicates the value of parameter 2 will be used as a 64-bit value?
I don't know much about how gcc parses inline assembly, and I know there's not really a sense of typing in assembly, but I would think that gcc could recognize the effectively implicit cast of s_idx to a long when it's used as a 64-bit value in the first inline instruction. FWIW, if I explicitly change "g" (s_idx) to "g" ((long) s_idx), gcc then resolves the "g" constraint to "r" since (long) s_idx is a temporary value. I would think gcc could do that implicitly as well?
but I would think that gcc could recognize the effectively implicit cast of s_idx to a long when it's used as a 64-bit value in the first inline instruction.
No, gcc only looks at the constraints, not the asm template string at all, when compiling the surrounding code. The part of gcc that fills in the % template operands is totally separate from register-allocation and code-gen for the surrounding code.
Nothing checks for sanity or understands the context that template operands are being used in. Maybe you have a 16-bit input and want to copy it to a vector register with vmovd %k[input], %%xmm0 / vpbroadcastw %%xmm0, %%ymm0. The upper 16 bits are ignored, so you don't want gcc to waste time zero or sign-extending it for you. But you definitely want to use vmovd instead of vpinsrw $0, %[input], %%xmm0, because that would be more uops and have a false dependency. For all gcc knows or cares, you could have used the operand in an asm comment line like "# low word of input = %h2 \n.
GNU C inline asm is designed so that the constraints tell the compiler everything it needs to know. Thus, you need to manually cast s_idx to long.
You don't need to cast the input for ECX, because the sub instruction will zero-extend the result implicitly (into RCX). Your inputs are signed types, but presumably you are expecting the difference to always be positive.
Register inputs must always be assumed to have high garbage beyond the width of the input type. This is similar to how function args in the x86-64 System V calling convention can have can have garbage in the upper 32 bits, but (I assume) with no unwritten rule about extending out to 32 bits. (And note that after function inlining, your asm statement's inputs might not be function args. You don't want to use __attribute__((noinline)), and as I said it wouldn't help anyway.)
leading me to believe the compiler is effectively choosing the "m" constraint.
Yes, gcc -O0 spills everything to memory between every C statement (so you can change it with a debugger if stopped at a breakpoint). Thus, a memory operand is the most efficient choice for the compiler. It would need a load instruction to get it back into a register. i.e. the value is in memory before the asm statement, at -O0.
(clang is bad at multiple-option constraints and picks memory even at -O3, even when that means spilling first, but gcc doesn't have that problem.)
gcc -O0 (and clang) will use an immediate for a g constraint when the input is a numeric literal constant, e.g. "g" (1234). In your case, you get:
...
addq $1234, %rsi;
subl $1234, %ecx;
rep movsb
...
An input like "g" ((long)s_idx) will use a register even at -O0, just like x+y or any other temporary result (as long as s_idx isn't already long). Interestingly, even (unsigned) resulted in a register operand, even though int and unsigned are the same size and the cast takes no instructions. At this point you're seeing exactly how little gcc -O0 optimizes, because what you get is more dependent on how gcc internals are designed than on what makes sense or is efficient.
Compile with optimization enabled if you want to see interesting asm. See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at compiler output.
Although checking the asm without optimizations disabled is good, too for inline asm; you might not have realized the problem with using a q override if it was just registers, although it is still a problem. Checking how it inlines into a few different callers at -O3 can be useful, too (especially if you test with some compile-time-constant inputs).
Your code is seriously broken
Besides the high-garbage problems discussed above, you modify input-operand registers without telling the compiler about it.
Fixing this by making some of them "+" read/write outputs means your asm statement is no longer volatile by default, so the compiler will optimize it away if the outputs are unused. (This includes after function inlining, so the return dest is sufficient for the standalone version, but not after inlining if the caller ignores the return value.)
You did use a "memory" clobber, so the compiler will assume that you read/write memory. You could tell it which memory you read and write, so it can optimize around your copy more efficiently. See get string length in inline GNU Assembler: you can use dummy memory input/output constraints like "m" (*(const char (*)[]) src)
char *asm_sub_str_fancyconstraints(char *dest, char *src, int s_idx, int e_idx) {
asm (
"addq %[s_idx], %%rsi; \n\t" /* Add start index to src (ptrs are 64-bit) */
"subl %k[s_idx], %%ecx; \n\t" /* Get length of substr as e - s (int is 32-bit) */
// the calling convention requires DF=0, and inline-asm can safely assume it, too
// (it's widely done, including in the Linux kernel)
//"cld;" /* Clear direction bit (force increment) */
"rep movsb; \n\t" /* Move %ecx bytes of str at %esi into str at %edi */
: [src]"+&S" (src), [dest]"+D" (dest), [e_idx]"+c" (e_idx)
, "=m" (*(char (*)[]) dest) // dummy output: all of dest
: [s_idx]"g" ((long long)s_idx)
, "m" (*(const char (*)[]) src) // dummy input: tell the compiler we read all of src[0..infinity]
: "cc"
);
return 0; // asm statement not optimized away, even without volatile,
// because of the memory output.
// Just like dest++; could optimize away, but *dest = 0; couldn't.
}
formatting: note the use of \n\t at the end of each line for readability; otherwise the asm instructions are all on one line separated only by ;. (It will assemble fine, but not very human-readable if you're checking how your asm template worked out.)
This compiles (with gcc -O3) to
asm_sub_str_fancyconstraints:
movslq %edx, %rdx # from the (long long)s_idx
xorl %eax, %eax # from the return 0, which I changed to test that it doesn't optimize away
addq %rdx, %rsi;
subl %edx, %ecx; # your code zero-extends (e_idx - s_idx)
rep movsb;
ret
I put this + a couple other versions on the Godbolt compiler explorer with gcc + clang. A simpler version fixes the bugs but still uses a "memory" clobber + asm volatile to get correctness with more compile-time optimization cost than this version that tells the compiler which memory is read and written.
Early clobber: Note the "+&S" constraint:
If for some weird reason, the compiler knew that the src address and s_idx were equal, it could use the same register (esi/rsi) for both inputs. This would lead to modifying s_idx before it was used in the sub. Declaring that the register holding src is clobbered early (before all input registers are read for the last time) will force the compiler to choose different registers.
See the Godbolt link above for a caller that causes breakage without the & for early-clobber. (But only with the nonsensical src = (char*)s_idx;). Early-clobber declarations are often necessary for multi-instruction asm statements to prevent more realistic breakage possibilities, so definitely keep this in mind, and only leave it out when you're sure it's ok for any read-only input to share a register with an output or input/output operand. (Of course using specific-register constraints limits that possibility.)
I omitted the early-clobber declaration from e_idx in ecx, because the only "free" parameter is s_idx, and putting them both in the same register will result in sub same,same, and rep movsb running 0 iterations as desired.
It would of course be more efficient to let the compiler do the math, and simply ask for the inputs to rep movsb in the right registers. Especially if both e_idx and s_idx are compile-time constants, it's silly to force the compiler to mov an immediate to a register and then subtract another immediate.
Or even better, don't use inline asm at all. (But if you really want rep movsb to test its performance, inline asm is one way to do it. gcc also has tuning options that control how memcpy inlines, if at all.)
No inline asm answer is complete without recommending that you https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it.

x86 add and addl operands are adding wrong?

I working with xv6, which implements the original UNIX on x86 machines. I wrote very simple inline assembly in a C program :
register int ecx asm ("%ecx");
printf(1, "%d\n", ecx);
__asm__("movl 16(%esp), %ecx\t\n");
printf(1, "%d\n", ecx);
__asm__("add $0, %ecx\t\n");
printf(1, "%d\n", ecx);
__asm__("movl %ecx, 16(%esp)\t\n");
I usually get a value like 434 printed by the second print statement. However, after the add command it prints 2. If I use the addl command instead, it also prints 2. I am using the latest stable version of xv6. So, I don't really suspect it to be the problem. Is there any other way I can add two numbers in inline assembly?
Essentially I need to increment 16(%esp) by 4.
Edited code to:
__asm__("addl $8, 16(%esp)\t\n");
1) In your example you're not incrementing ecx by 4, your incrementing it by 0.
__asm__("addl $4, %ecx");
2) You should be able to chain multiple commands into one asm call
__asm__("movl 16(%esp), %ecx\n\t"
"addl $4, %ecx\n\t"
"movl %ecx, 16(%esp)");
3) The register keyword is a hint, and the compiler may decide to put your variable where ever it wants still. Also reading the documentation on the GCC page warns about how some functions may clobber various registers. printf() being a C function may very well use the ecx register without preserving its value. It could preserve it, but it may not; the compiler could be using that register for all sorts of optimizations inside of that call. It is a general purpose register on the 80x86 and those are often used for various parameter passing and return values all the time.
Untested corrections:
int reg; // By leaving this out, we give GCC the ability to pick the best available register.
/*
* volatile indicates to GCC that this inline assembly might do odd side
* effects and should disable any optimizations around it.
*/
asm volatile ("movl 16(%esp), %0\n\t"
"addl $4, %0\n\t"
"movl %0, 16(%esp)"
: "r" (reg)); // The "r" indicates we want to use a register
printf("Result: %d\n", reg);
The GCC manage page has more details.

"unsupported for mov" GCC inline assembler

While playing around with GCC's inline assembler feature, I tried to make a function which immediately exited the process, akin to _Exit from the C standard library.
Here is the relevant piece of source code:
void immediate_exit(int code)
{
#if defined(__x86_64__)
asm (
//Load exit code into %rdi
"mov %0, %%rdi\n\t"
//Load system call number (group_exit)
"mov $231, %%rax\n\t"
//Linux syscall, 64-bit version.
"syscall\n\t"
//No output operands, single unrestricted input register, no clobbered registers because we're about to exit.
:: "" (code) :
);
//Skip other architectures here, I'll fix these later.
#else
# error "Architecture not supported."
#endif
}
This works fine for debug builds (with -O0), but as soon as I turn optimisation on at any level, I get the following error:
immediate_exit.c: Assembler messages:
immediate_exit.c:4: Error: unsupported for `mov'
So I looked at the assembler output for both builds (I've removed .cfi* directives and other things for clarity, I can add that in again if it's a problem). The debug build:
immediate_exit:
.LFB0:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
mov -4(%rbp), %rdi
mov $231, %rax
syscall
popq %rbp
ret
And the optimised version:
immediate_exit:
.LFB0:
mov %edi, %rdi
mov $231, %rax
syscall
ret
So the optimised version is trying to put a 32-bit register edi into a 64-bit register, rdi, rather than loading it from rbp, which I presume is what is causing the error.
Now, I can fix this by specifying 'm' as a register constraint for code, which causes GCC to load from rbp regardless of optimisation level. However, I'd rather not do that, because I think the compiler and its authors has a much better idea about where to put stuff than I do.
So (finally!) my question is: how do I persuade GCC to use rdi rather than edi for the assembly output?
Overall, you're much better off using constraints to get values into the right registers rather than explicit moves:
#include <asm/unistd.h>
asm volatile("syscall"
: // no outputs. Other syscalls need an "=a"(retval) to tell the compiler RAX is modified, whether you actually use the retval or not.
: "D" ((uint64_t)code), "a" ((uint64_t)__NR_exit_group) // 231
: "rcx", "r11" // syscall itself clobbers these. exit can't fail and return; mostly here as an example for other syscalls
, "memory" // make sure any stores, e.g. to mmapped files, are done before this
);
__builtin_unreachable(); // tell the compiler execution doesn't come out the bottom of the asm statement. Maybe have the same effect as a "memory" clobber of making sure not to delay stores which could potentially be to mmapped files or shared memory.
That lets compiler hoist the moves earlier in the code if useful, or even avoid the move altogether if the value can be arranged to already be in the correct register...
For example code will be in EDI if this function doesn't inline; the Linux system-calling convention was chosen to be as close as possible to the x86-64 System V function-calling convention, except for using R10 instead of RCX because the syscall instruction itself overwrites it with saved-RIP, and R11 with saved-RFLAGS.
(Unnecessarily casting (uint64_t)code would force the compiler to redo zero-extension with a mov %edi, %edi in that case, though. The call number does need to be zero-extended to 64-bit, which will almost certainly happen for free even if you didn't manually cast it (since the compiler will use a mov $231, %eax), but it doesn't hurt to be explicit about something that is required. The exit_group system call takes a 32-bit int arg, so the kernel is guaranteed to ignore high garbage in RDI.)
Cast your variable into the appropriate length type.
#include <stdint.h>
asm (
//Load exit code into %rdi
"mov %0, %%rdi\n\t"
//Load system call number (group_exit)
"mov $231, %%rax\n\t"
//Linux syscall, 64-bit version.
"syscall\n\t"
//No output operands, single unrestricted input register, no clobbered registers because we're about to exit.
:: "g" ((uint64_t)code)
);
or better have your operand type straight away of the right size:
void immediate_exit(uint64_t code) { ...

Fastest way to find out minimum of 3 numbers?

In a program I wrote, 20% of the time is being spent on finding out the minimum of 3 numbers in an inner loop, in this routine:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
unsigned int m = a;
if (m > b) m = b;
if (m > c) m = c;
return m;
}
Is there any way to speed this up? I am ok with assembly code too for x86/x86_64.
Edit: In reply to some of the comments:
* Compiler being used is gcc 4.3.3
* As far as assembly is concerned, I am only a beginner there. I asked for assembly here, to learn how to do this. :)
* I have a quad-core Intel 64 running, so MMX/SSE etc. are supported.
* It's hard to post the loop here, but I can tell you it's a heavily optimized implementation of the levenshtein algorithm.
This is what the compiler is giving me for the non-inlined version of min:
.globl min
.type min, #function
min:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %eax
movl 16(%ebp), %ecx
cmpl %edx, %eax
jbe .L2
movl %edx, %eax
.L2:
cmpl %ecx, %eax
jbe .L3
movl %ecx, %eax
.L3:
popl %ebp
ret
.size min, .-min
.ident "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
.section .note.GNU-stack,"",#progbits
The inlined version is within -O2 optimized code (even my markers mrk = 0xfefefefe, before and after the call to min()) are getting optimized away by gcc, so I couldn't get hold of it.
Update: I tested the changes suggested by Nils, ephemient, however there's no perceptible performance boost I get by using the assembly versions of min(). However, I get a 12.5% boost by compiling the program with -march=i686, which I guess is because the whole program is getting the benefits of the new faster instructions that gcc is generating with this option. Thanks for your help guys.
P.S. - I used the ruby profiler to measure performance (my C program is a shared library loaded by a ruby program), so I could get time spent only for the top-level C function called by the ruby program, which ends up calling min() down the stack. Please see this question.
Make sure you are using an appropriate -march setting, first off. GCC defaults to not using any instructions that were not supported on the original i386 - allowing it to use newer instruction sets can make a BIG difference at times! On -march=core2 -O2 I get:
min:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %ecx
movl 16(%ebp), %eax
cmpl %edx, %ecx
leave
cmovbe %ecx, %edx
cmpl %eax, %edx
cmovbe %edx, %eax
ret
The use of cmov here may help you avoid branch delays - and you get it without any inline asm just by passing in -march. When inlined into a larger function this is likely to be even more efficient, possibly just four assembly operations. If you need something faster than this, see if you can get the SSE vector operations to work in the context of your overall algorithm.
Assuming your compiler isn't out to lunch, this should compile down to two compares and two conditional moves. It isn't possible to do much better than that.
If you post the assembly that your compiler is actually generating, we can see if there's anything unnecessary that's slowing it down.
The number one thing to check is that the routine is actually getting inlined. The compiler isn't obligated to do so, and if it's generating a function call, that will be hugely expensive for such a simple operation.
If the call really is getting inlined, then loop unrolling may be beneficial, as DigitalRoss said, or vectorization may be possible.
Edit: If you want to vectorize the code, and are using a recent x86 processor, you will want to use the SSE4.1 pminud instruction (intrinsic: _mm_min_epu32), which takes two vectors of four unsigned ints each, and produces a vector of four unsigned ints. Each element of the result is the minimum of the corresponding elements in the two inputs.
I also note that your compiler used branches instead of conditional moves; you should probably try a version that uses conditional moves first and see if that gets you any speedup before you go off to the races on a vector implementation.
This drop-in replacement clocks in about 1.5% faster on my AMD Phenom:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
asm("cmp %1,%0\n"
"cmova %1,%0\n"
"cmp %2,%0\n"
"cmova %2,%0\n"
: "+r" (a) : "r" (b), "r" (c));
return a;
}
Results may vary; some x86 processors don't handle CMOV very well.
My take on an x86 assembler implementation, GCC syntax. Should be trivial to translate to another inline assembler syntax:
int inline least (int a, int b, int c)
{
int result;
__asm__ ("mov %1, %0\n\t"
"cmp %0, %2\n\t"
"cmovle %2, %0\n\t"
"cmp %0, %3\n\t"
"cmovle %3, %0\n\t"
: "=r"(result) :
"r"(a), "r"(b), "r"(c)
);
return result;
}
New and improved version:
int inline least (int a, int b, int c)
{
__asm__ (
"cmp %0, %1\n\t"
"cmovle %1, %0\n\t"
"cmp %0, %2\n\t"
"cmovle %2, %0\n\t"
: "+r"(a) :
"%r"(b), "r"(c)
);
return a;
}
NOTE: It may or may not be faster than C code.
This depends on a lot of factors. Usually cmov wins if the branches are not predictable (on some x86 architectures) OTOH inline assembler is always a problem for the optimizer, so the optimization penalty for the surrounding code may outweight all gains..
Btw Sudhanshu, it would be interesting to hear how this code performs with your testdata.
The SSE2 instruction extensions contain an integer min instruction that can choose 8 minimums at a time. See _mm_mulhi_epu16 in http://www.intel.com/software/products/compilers/clin/docs/ug_cpp/comm1046.htm
First, look at the disassembly. That'll tell you a lot. For example, as written, there are 2 if-statements (which means there are 2 possible branch mispredictions), but my guess is that a decent modern C compiler will have some clever optimization that can do it without branching. I'd be curious to find out.
Second, if your libc has special built-in min/max functions, use them. GNU libc has fmin/fmax for floating-point, for example, and they claim that "On some processors these functions can use special machine instructions to perform these operations faster than the equivalent C code". Maybe there's something similar for uints.
Finally, if you're doing this to a bunch of numbers in parallel, there are probably vector instructions to do this, which could provide significant speedup. But I've even seen non-vector code be faster when using vector units. Something like "load one uint into a vector register, call vector min function, get result out" looks dumb but might actually be faster.
If you are only doing one comparison you might want to unroll the loop manually.
First, see if you can get the compiler to unroll the loop for you, and if you can't, do it yourself. This will at least reduce the loop control overhead...
You could try something like this to save on declaration and unnecessary comparisons:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
if (a < b)
{
if (a < c)
return a;
else
return c;
}
if (b < c)
return b;
else return c;
}
These are all good answers. At the risk of being accused of not answering the question, I would also look at the other 80% of the time. Stackshots are my favorite way to find code worth optimizing, especially if it is function calls that you find out you don't absolutely need.
Yes, post assembly, but my naive optimization is:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
unsigned int m = a;
if (m > b) m = b;
if (m > c) return c;
return m;
}

Resources