Related
Hello,
So, I'm optimizing some functions that I wrote for a simple operating system I'm developing. This function, putpixel(), currently looks like this (in case my assembly is unclear or wrong):
uint32_t loc = (x*pixel_w)+(y*pitch);
vidmem[loc] = color & 255;
vidmem[loc+1] = (color >> 8) & 255;
vidmem[loc+2] = (color >> 16) & 255;
This takes a little bit of explanation. First, loc is the pixel index I want to write to in video memory. X and Y coordinates are passed to the function. Then, we multiply X by the pixel width in bytes (in this case, 3) and Y by the number of bytes in each line. More information can be found here.
vidmem is a global variable, a uint8_t pointer to video memory.
That being said, anyone familiar with bitwise operations should be able to figure out how putpixel() works fairly easily.
Now, here's my assembly. Note that it has not been tested and may even be slower or just plain not work. This question is about how to make it compile.
I've replaced everything after the definition of loc with this:
__asm(
"push %%rdi;"
"push %%rbx;"
"mov %0, %%rdi;"
"lea %1, %%rbx;"
"add %%rbx, %%rdi;"
"pop %%rbx;"
"mov %2, %%rax;"
"stosb;"
"shr $8, %%rax;"
"stosb;"
"shr $8, %%rax;"
"stosb;"
"pop %%rdi;" : :
"r"(loc), "r"(vidmem), "r"(color)
);
When I compile this, clang gives me this error for every push instruction:
So when I saw that error, I assumed it had to do with my omission of the GAS suffixes (which should have been implicitly decided on, anyway). But when I added the "l" suffix (all of my variables are uint32_ts), I got the same error! I'm not quite sure what's causing it, and any help would be much appreciated. Thanks in advance!
You could probably make the compiler's output for your C version much more efficient by loading vidmem into a local variable before the stores. As it is, it can't assume that the stores don't alias vidmem, so it reloads the pointer before every byte store. Hrm, that does let gcc 4.9.2 avoid reloading vidmem, but it still generates some nasty code. clang 3.5 does slightly better.
Implementing what I said in my comment on your answer (that stos is 3 uops vs. 1 for mov):
#include <stdint.h>
extern uint8_t *vidmem;
void putpixel_asm_peter(uint32_t color, uint32_t loc)
{
// uint32_t loc = (x*pixel_w)+(y*pitch);
__asm( "\n"
"\t movb %b[col], (%[ptr])\n"
"\t shrl $8, %[col];\n"
"\t movw %w[col], 1(%[ptr]);\n"
: [col] "+r" (color), "=m" (vidmem[loc])
: [ptr] "r" (vidmem+loc)
:
);
}
compiles to a very efficient implementation:
gcc -O3 -S -o- putpixel.c 2>&1 | less # (with extra lines removed)
putpixel_asm_peter:
movl %esi, %esi
addq vidmem(%rip), %rsi
#APP
movb %dil, (%rsi)
shrl $8, %edi;
movw %di, 1(%rsi);
#NO_APP
ret
All of those instructions decode to a single uop on Intel CPUs. (The stores can micro-fuse, because they use a single-register addressing mode.) The movl %esi, %esi zeroes the upper 32, since the caller might have generated that function arg with a 64bit instruction the left garbage in the high 32 of %rsi. Your version could have saved some instructions by using constraints to ask for the values in the desired registers in the first place, but this will still be faster than stos
Also notice how I let the compiler take care of adding loc to vidmem. You could have done it more efficiently in yours, with a lea to combine an add with a move. However, if the compiler wants to get clever when this is used in a loop, it could increment the pointer instead of the address. Finally, this means the same code will work for 32 and 64bit. %[ptr] will be a 64bit reg in 64bit mode, but a 32bit reg in 32bit mode. Since I don't have to do any math on it, it Just Works.
I used =m output constraint to tell the compiler where we're writing in memory. (I should have cast the pointer to a struct { char a[3]; } or something, to tell gcc how much memory it actually writes, as per the tip at the end of the "Clobbers" section in the gcc manual)
I also used color as an input/output constraint to tell the compiler that we modify it. If this got inlined, and later code expected to still find the value of color in the register, we'd have a problem. Having this in a function means color is already a tmp copy of the caller's value, so the compiler will know it needs to throw away the old color. Calling this in a loop could be slightly more efficient with two read-only inputs: one for color, one for color >> 8.
Note that I could have written the constraints as
: [col] "+r" (color), [memref] "=m" (vidmem[loc])
:
:
But using %[memref] and 1 %[memref] to generate the desired addresses would lead gcc to emit
movl %esi, %esi
movq vidmem(%rip), %rax
# APP
movb %edi, (%rax,%rsi)
shrl $8, %edi;
movw %edi, 1 (%rax,%rsi);
The two-reg addressing mode means the store instructions can't micro-fuse (on Sandybridge and later, at least).
You don't even need inline asm to get decent code, though:
void putpixel_cast(uint32_t color, uint32_t loc)
{
// uint32_t loc = (x*pixel_w)+(y*pitch);
typeof(vidmem) vmem = vidmem;
vmem[loc] = color & 255;
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
*(uint16_t *)(vmem+loc+1) = color >> 8;
#else
vmem[loc+1] = (color >> 8) & 255; // gcc sucks at optimizing this for little endian :(
vmem[loc+2] = (color >> 16) & 255;
#endif
}
compiles to (gcc 4.9.2 and clang 3.5 give the same output):
movq vidmem(%rip), %rax
movl %esi, %esi
movb %dil, (%rax,%rsi)
shrl $8, %edi
movw %di, 1(%rax,%rsi)
ret
This is only a tiny bit less efficient than what we get with inline asm, and should be easier for the optimizer to optimize if inlined into loops.
Overall performance
Calling this in a loop is probably a mistake. It'll be more efficient to combine multiple pixels in a register (esp. a vector register), and then write all at once. Or, do 4-byte writes, overlapping the last byte of the previous write, until you get to the end and have to preserve the byte after the last chunk of 3.
See http://agner.org/optimize/ for more stuff about optimizing C and asm. That and other links can be found at https://stackoverflow.com/tags/x86/info.
Found the problem!
It was in a lot of places, but the major one was vidmem. I assumed it would pass the address, but it was causing an error. After referring to it as a dword, it worked perfectly. I also had to change the other constraints to "m", and I finally got this result (after some optimization):
__asm(
"movl %0, %%edi;"
"movl %k1, %%ebx;"
"addl %%ebx, %%edi;"
"movl %2, %%eax;"
"stosb;"
"shrl $8, %%eax;"
"stosw;" : :
"m"(loc), "r"(vidmem), "m"(color)
: "edi", "ebx", "eax"
);
Thanks to everyone who answered in the comments!
I have just written a few small inline asm routines to query the timestamp counter in x86 so that I can profile small portions of code. I would really like to put those routines in a header so that I can reuse them in many different source files so basically my question is whether I should just organize those in macros or make them inline functions, my doubt with inline is that it is not necessarily the case that the compiler will actually inline it and since it is a performance sensitive call I would rather skip the function call overhead, on the other hand with macros the whole type safety goes away and I would strictly need a 32 bit int for this, I assume I could just add the specification in comments but still I try to avoid macros because of the many caveats. Here is the code:
inline void rdtsc(uint64_t* cycles)
{
uint32_t cycles_high, cycles_low;
asm volatile (
".att_syntax\n"
"CPUID\n\t" //Serialize
"RDTSC\n\t" //Read clock and cpuid
"mov %%edx, %0 \n\t"
"mov %%eax, %1 \n\t"
: "=r" (cycles_high), "=r" (cycles_low)
:: "%edx", "%eax");
*cycles = ((uint64_t) cycles_high << 32) | cycles_low;
}
Any suggestions on this are welcome. I am just trying to figure out what the preferred style would be for this kind of situation.
Since you will be measuring performance of portions of code, not necessarily always entire functions, you should not try to inline your performance counter.
It doesn't matter if there's a call overhead or not. What matter is that the mesurement is consistent, which means you either want ALWAYS the call overhead to be present, or NEVER.
The first is much easier to achieve than the former.
Let every portion of your code have the same call overhead.
If you really need to serialize before reading the TSC, you could use the LFENCE instruction instead which doesn't alter registers.
If you decide to continue to use CPUID for serialization, you ought to set EAX first (probably to 0, since you're not really concerned about the output) and note that this instruction trashes the EAX, EBX, ECX and EDX registers, so your routine MUST account for this fact.
In all, I'd be inclined to write it like this:
#include <stdint.h>
#include <stdio.h>
inline uint64_t rdtsc() {
uint32_t high, low;
asm volatile (
".att_syntax\n\t"
"LFENCE\n\t"
"RDTSC\n\t"
"movl %%eax, %0\n\t"
"movl %%edx, %1\n\t"
: "=rm" (low), "=rm" (high)
:: "%edx", "%eax");
return ((uint64_t) high << 32) | low;
}
int main() {
uint64_t x, y;
x = rdtsc();
printf("%lu\n", x);
y = rdtsc();
printf("%lu\n", y);
printf("%lu\n", y-x);
}
update:
It's been proposed by #Jester, and by #DavidWohlferd that one can eliminate the register allocations by assigning high and low directly to the edx and eax registers.
That version would look like this:
inline uint64_t rdtsc() {
uint32_t high, low;
asm volatile (
".att_syntax\n\t"
"LFENCE\n\t"
"RDTSC\n\t"
: "=a" (low), "=d" (high)
:: );
return ((uint64_t) high << 32) | low;
}
The resulting code (using gcc 4.8.3 on a 64-bit machine running Linux) using optimization -O2 and including up to the call to printf, is this:
#APP
# 20 "rdtsc.c" 1
.att_syntax
LFENCE
RDTSC
# 0 "" 2
#NO_APP
movq %rdx, %rbx
movl %eax, %eax
movl $.LC0, %edi
salq $32, %rbx
orq %rax, %rbx
xorl %eax, %eax
movq %rbx, %rsi
call printf
The version I originally posted results in this:
#APP
# 7 "rdtsc.c" 1
.att_syntax
LFENCE
RDTSC
movl %eax, %ecx
movl %edx, %ebx
# 0 "" 2
#NO_APP
movl %ecx, %ecx
salq $32, %rbx
movl $.LC0, %edi
orq %rcx, %rbx
xorl %eax, %eax
movq %rbx, %rsi
call printf
That version of the code is one instruction longer.
Good day.
I faced a problem that I couldn't solve for several days. The error appears when I try to compile this function in C language.
void GetInInterrupt(UChar Interrupt)
{
//asm volatile(".intel_syntax noprefix");
asm volatile
(
"movb %0, %%al\n"
"movb %%al, 1(point)\n"
"point:\n"
"int $0\n"
: /*output*/ : "r" (Interrupt) /*input*/ : /*clobbered*/
);
//asm volatile(".att_syntax noprefix");
}
Message I get from gas is following:
Error: junk '(point)' after expression
As I can understand the pointer in second line is faulty, but unfortunately I can't solve it by my own.
Thank you for help.
If you can use C++, then this one:
template <int N> static inline void GetInInterrupt (void)
{
__asm__ ("int %0\n" : "N"(N));
}
will do. If I use that template like:
GetInInterrupt<123>();
GetInInterrupt<3>();
GetInInterrupt<23>();
GetInInterrupt<0>();
that creates the following object code: 0: cd 7b int $0x7b
2: cc int3
3: cd 17 int $0x17
5: cd 00 int $0x0
which is pretty much optimal (even for the int3 case, which is the breakpoint op). It'll also create a compile-time warning if the operand is out of the 0..255 range, due to the N constraint allowing only that.
Edit: plain old C-style macros work as well, of course:
#define GetInInterrupt(arg) __asm__("int %0\n" : : "N"((arg)) : "cc", "memory")
creates the same code as the C++ templated function. Due to the way int behaves, it's a good idea to tell the compiler (via the "cc", "memory" constraints) about the barrier semantics, to make sure it doesn't try to re-order instructions when embedding the inline assembly.
The limitation of both is, obviously, the fact that the interrupt number must be a compile-time constant. If you absolutely don't want that, then creating a switch() statement created e.g. with the help of BOOST_PP_REPEAT() covering all 255 cases is a better option than self-modifying code, i.e. like:
#include <boost/preprocessor/repetition/repeat.html>
#define GET_INTO_INT(a, INT, d) case INT: GetInInterrupt<INT>(); break;
void GetInInterrupt(int interruptNumber)
{
switch(interruptNumber) {
BOOST_PP_REPEAT(256, GET_INTO_INT, 0)
default:
runtime_error("interrupt Number %d out of range", interruptNumber);
}
}
This can be done in plain C (if you change the templated function invocation for a plain __asm__ of course) - because the boost preprocessor library does not depend on a C++ compiler ... and gcc 4.7.2 creates the following code for this:
GetInInterrupt:
.LFB0:
cmpl $255, %edi
jbe .L262
movl %edi, %esi
xorl %eax, %eax
movl $.LC0, %edi
jmp runtime_error
.p2align 4,,10
.p2align 3
.L262:
movl %edi, %edi
jmp *.L259(,%rdi,8)
.section .rodata
.align 8
.align 4
.L259:
.quad .L3
.quad .L4
[ ... ]
.quad .L258
.text
.L257:
#APP
# 17 "tccc.c" 1
int $254
# 0 "" 2
#NO_APP
ret
[ ... accordingly for the other vectors ... ]
Beware though if you do the above ... the compiler (gcc up to and including 4.8) is not intelligent enough to optimize the switch() away, i.e. even if you say static __inline__ ... it'll create the full jump table version of GetInInterrupt(3) instead of just an inlined int3 as would the simpler implementations.
Below show how you could write to a location in the code. It does assume that the code is writeable in the first place, which is typically not the case in mainstream OS's - since that would hide some nasty bugs.
void GetInInterrupt(UChar Interrupt)
{
//asm volatile(".intel_syntax noprefix");
asm volatile
(
"movb %0, point+1\n"
"point:\n"
"int $0\n"
: /*output*/ : "r" (Interrupt) /*input*/ : /*clobbered */
);
//asm volatile(".att_syntax noprefix");
}
I also simplified the code to avoid using two registers, and instead just using the register that Interrupt already is in. If the compiler moans about it, you may find that "a" instead or "r" solves the problem.
Update: See Eric Postpischil's answer, I think he is right.
I found it on this page when I study inline assembly code.
static inline char * strcpy(char * dest,const char *src)
{
int d0, d1, d2;
__asm__ __volatile__( "1:\n\t"
"lodsb\n\t"
"stosb\n\t"
"testb %%al,%%al\n\t"
"jne 1b"
: "=&S" (d0), "=&D" (d1), "=&a" (d2)
: "0" (src),"1" (dest)
: "memory");
return dest;
}
I am confused that why three temp variables is needed? I try to implement without using them.
static inline char * strcpy(char * dest,const char *src)
{
__asm__ __volatile__( "1:\n\t"
"lodsb\n\t"
"stosb\n\t"
"testb %%al,%%al\n\t"
"jne 1b"
/* : "=&S" (d0), "=&D" (d1), "=&a" (d2) */
:
: "S" (src),"D" (dest)
: "memory", "esi", "edi", "al");
return dest;
}
But I got errors when compile it with gcc.
inline.c: In function ‘strcpy’:
inline.c:6:9: error: can't find a register in class ‘SIREG’ while reloading ‘asm’
inline.c:6:9: error: ‘asm’ operand has impossible constraints
The original code is trying to deal with the fact that the lodsb and stosb instructions modify registers %ESI, %EDI, and %AL. To do this, it has defined d0, d1, and d2 so that it can declare them as outputs from the assembly code. It does not actually want these values as outputs, so it does not use them after the assembly. However, because they are outputs, the compiler knows that they are modified by the assembly code, so the compiler knows not to keep any values in those registers that it wants to remain unchanged through the assembly code.
It should be possible to do this in another way, and your code better expresses the semantics: It says inputs are provided in %ESI and %EDI and that memory, %ESI, %EDI, and %AL are altered. The original code asserts that %ESI, %EDI, and %AL are outputs, which is not the true intent; they are unwanted byproducts, not desired outputs.
However, your version of GCC is different from mine (Apple/clang-418.0.60); mine accepts the code you wrote without error. I suspect your GCC is confused by the fact that %ESI is specified both as an input register (by "S" after the second colon) and as something that is clobbered (by "esi" after the third colon). Perhaps this was a shortcoming in GCC that was later fixed, and the original code was working around that shortcoming.
Things are really simple. Lets recall, that S constraint stands for esi, a for al and D for edi register. When you inform compiler, that those guys will be clobbered, you must explicitly point where to spill them (d0 is temporary storage for esi, d1 for edi, d2 for al).
Next in the process of register allocation, compiler decides if he actually need to spill and looks for available register. For example my compiler (gcc 4.7.2) does it as follows:
movq %rdi, %rdx
1:
lodsb
stosb
testb %al,%al
jne 1b
movq %rdx, %rax
ret
You may see, that d1 was allocated to rdx, d0 and d2 was eliminated as excessive.
If you don't supply compiler with that information, it can not guess, so outputs error.
I'm trying to do some Code Optimization to Eliminate Branches, the original c code is
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1)
I intend to replace it with assembly code like below
mov a, %rax
mov b, %rbx
mov k, %rcx
xor %rdx %rdx
shl 1, %rcx
cmp %rax, %rax
setb %rdx
add %rdx,%rcx
mov %rcx, k
so I write c inline assembly code like blow,
#define next(a, b, k)\
__asm__("shl $0x1, %0; \
xor %%rbx, %%rbx; \
cmp %1, %2; \
setb %%rbx; \
addl %%rbx,%0;":"+c"(k) :"g"(a),"g"(b))
when I compile the code below i got error:
operand type mismatch for `add'
operand type mismatch for `setb'
How can I fix it?
Here are the mistakes in your code:
Error: operand type mismatch for 'cmp' -- One of CMP's operands must be a register. You're probably generating code that's trying to compare two immediates. Change the second operand's constraint from "g" to "r". (See GCC Manual - Extended Asm - Simple Constraints)
Error: operand type mismatch for 'setb' -- SETB only takes 8 bit operands, i.e. setb %bl works while setb %rbx doesn't.
The C expression T = (A < B) should translate to cmp B,A; setb T in AT&T x86 assembler syntax. You had the two operands to CMP in the wrong order. Remember that CMP works like SUB.
Once you realize the first two error messages are produced by the assembler, it follows that the trick to debugging them is to look at the assembler code generated by gcc. Try gcc $CFLAGS -S t.c and compare the problematic lines in t.s with an x86 opcode reference. Focus on the allowed operand codes for each instruction and you'll quickly see the problems.
In the fixed source code posted below, I assume your operands are unsigned since you're using SETB instead of SETL. I switched from using RBX to RCX to hold the temporary value because RCX is a call clobbered register in the ABI and used the "=&c" constraint to mark it as an earlyclobber operand since RCX is cleared before the inputs a and b are read:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
static uint64_t next(uint64_t a, uint64_t b, uint64_t k)
{
uint64_t tmp;
__asm__("shl $0x1, %[k];"
"xor %%rcx, %%rcx;"
"cmp %[b], %[a];"
"setb %%cl;"
"addq %%rcx, %[k];"
: /* outputs */ [k] "+g" (k), [tmp] "=&c" (tmp)
: /* inputs */ [a] "r" (a), [b] "g" (b)
: /* clobbers */ "cc");
return k;
}
int main()
{
uint64_t t, t0, k;
k = next(1, 2, 0);
printf("%" PRId64 "\n", k);
scanf("%" SCNd64 "%" SCNd64, &t, &t0);
k = next(t, t0, k);
printf("%" PRId64 "\n", k);
return 0;
}
main() translates to:
<+0>: push %rbx
<+1>: xor %ebx,%ebx
<+3>: mov $0x4006c0,%edi
<+8>: mov $0x1,%bl
<+10>: xor %eax,%eax
<+12>: sub $0x10,%rsp
<+16>: shl %rax
<+19>: xor %rcx,%rcx
<+22>: cmp $0x2,%rbx
<+26>: setb %cl
<+29>: add %rcx,%rax
<+32>: mov %rax,%rbx
<+35>: mov %rax,%rsi
<+38>: xor %eax,%eax
<+40>: callq 0x400470 <printf#plt>
<+45>: lea 0x8(%rsp),%rdx
<+50>: mov %rsp,%rsi
<+53>: mov $0x4006c5,%edi
<+58>: xor %eax,%eax
<+60>: callq 0x4004a0 <__isoc99_scanf#plt>
<+65>: mov (%rsp),%rax
<+69>: mov %rbx,%rsi
<+72>: mov $0x4006c0,%edi
<+77>: shl %rsi
<+80>: xor %rcx,%rcx
<+83>: cmp 0x8(%rsp),%rax
<+88>: setb %cl
<+91>: add %rcx,%rsi
<+94>: xor %eax,%eax
<+96>: callq 0x400470 <printf#plt>
<+101>: add $0x10,%rsp
<+105>: xor %eax,%eax
<+107>: pop %rbx
<+108>: retq
You can see the result of next() being moved into RSI before each printf() call.
Given that gcc (and it looks like gcc inline assembler) produces:
leal (%rdx,%rdx), %eax
xorl %edx, %edx
cmpl %esi, %edi
setl %dl
addl %edx, %eax
ret
from
int f(int a, int b, int k)
{
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
It would think that writing your own inline assembler is a complete waste of time and effort.
As always, BEFORE you start writing inline assembler, check what the compiler actually does. If your compiler doesn't produce this code, then you may need to upgrade the version of compiler to something a bit newer (I reported this sort of thing to Jan Hubicka [gcc maintainer for x86-64 at the time] ca 2001, and I'm sure it's been in gcc for quite some time).
You could just do this and the compiler will not generate a branch:
k = (k<<1) + (a < b) ;
But if you must, I fixed some stuff in your code now it should work as expected:
__asm__(
"shl $0x1, %0; \
xor %%eax, %%eax; \
cmpl %3, %2; \
setb %%al; \
addl %%eax, %0;"
:"=r"(k) /* output */
:"0"(k), "r"(a),"r"(b) /* input */
:"eax", "cc" /* clobbered register */
);
Note that setb expects a reg8 or mem8 and you should add eax to the clobbered list, because you change it, as well as cc just to be safe, as for the register constraints, I'm not sure why you used those, but =r and r work just fine.
And you need to add k to both the input and output lists. There's more in the GCC-Inline-Assembly-HOWTO
Summary:
Branchless might not even be the best choice.
Inline asm defeats some other optimizations, try other source changes first, e.g. ? : often compiles branchlessly, also use booleans as integer 0/1.
If you use inline-asm, make sure you optimize the constraints as well to make the compiler-generated code outside your asm block efficient.
The whole thing is doable with cmp %[b], %[a] / adc %[k],%[k]. Your hand-written code is worse than what compilers generate, but they are beatable in the small scale for cases where constant-propagation / CSE / inlining didn't make this code (partially) optimize away.
If your compiler generates branchy code, and profiling shows that was the wrong choice (high counts for branch misses at that instruction, e.g. on Linux perf record -ebranch-misses ./my_program && perf report), then yes you should do something to get branchless code.
(Branchy can be an advantage if it's predictable: branching means out-of-order execution of code that uses (k<<1) + 1 doesn't have to wait for a and b to be ready. LLVM recently merged a patch that makes x86 code-gen more branchy by default, because modern x86 CPUs have such powerful branch predictors. Clang/LLVM nightly build (with that patch) does still choose branchless for this C source, at least in a stand-alone function outside a loop).
If this is for a binary search, branchless probably is good strategy, unless you see the same search often. (Branching + speculative execution means you have a control dependency off the critical path,
Compile with profile-guided optimization so the compiler has run-time info on which branches almost always go one way. It still might not know the difference between a poorly-predictable branch and one that does overall take both paths but with a simple pattern. (Or that's predictable based on global history; many modern branch-predictor designs index based on branch history, so which way the last few branches went determine which table entry is used for the current branch.)
Related: gcc optimization flag -O3 makes code slower then -O2 shows a case where a sorted array makes for near-perfect branch prediction for a condition inside a loop, and gcc -O3's branchless code (without profile guided optimization) bottlenecks on a data dependency from using cmov. But -O3 -fprofile-use makes branchy code. (Also, a different way of writing it makes lower-latency branchless code that also auto-vectorizes better.)
Inline asm should be your last resort if you can't hand-hold the compiler into making the asm you want, e.g. by writing it as (k<<1) + (a<b) as others have suggested.
Inline asm defeats many optimizations, most obvious constant-propagation (as seen in some other answers, where gcc moves a constant into a register outside the block of inline-asm code). https://gcc.gnu.org/wiki/DontUseInlineAsm.
You could maybe use if(__builtin_constant_p(a)) and so on to use a pure C version when the compiler has constant values for some/all of the variables, but that's a lot more work. (And doesn't work well with Clang, where __builtin_constant_p() is evaluated before function inlining.)
Even then (once you've limited things to cases where the inputs aren't compile-time constants), it's not possible to give the compiler the full range of options, because you can't use different asm blocks depending on which constraints are matched (e.g. a in a register and b in memory, or vice versa.) In cases where you want to use a different instruction depending on the situation, you're screwed, but here we can use multi-alternative constraints to expose most of the flexibility of cmp.
It's still usually better to let the compiler make near-optimal code than to use inline asm. Inline-asm destroys the ability of the compiler to reuse use any temporary results, or spread out the instructions to mix with other compiler-generated code. (Instruction-scheduling isn't a big deal on x86 because of good out-of-order execution, but still.)
That asm is pretty crap. If you get a lot of branch misses, it's better than a branchy implementation, but a much better branchless implementation is possible.
Your a<b is an unsigned compare (you're using setb, the unsigned below condition). So your compare result is in the carry flag. x86 has an add-with-carry instruction. Furthermore, k<<1 is the same thing as k+k.
So the asm you want (compiler-generated or with inline asm) is:
# k in %rax, a in %rdi, b in %rsi for this example
cmp %rsi, %rdi # CF = (a < b) = the carry-out from edi - esi
adc %rax, %rax # eax = (k<<1) + CF = (k<<1) + (a < b)
Compilers are smart enough to use add or lea for a left-shift by 1, and some are smart enough to use adc instead of setb, but they don't manage to combine both.
Writing a function with register args and a return value is often a good way to see what compilers might do, although it does force them to produce the result in a different register. (See also this Q&A, and Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”).
// I also tried a version where k is a function return value,
// or where k is a global, so it's in the same register.
unsigned funcarg(unsigned a, unsigned b, unsigned k) {
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
On the Godbolt compiler explorer, along with a couple other versions. (I used unsigned in this version, because you had addl in your asm. Using unsigned long makes everything except the xor-zeroing into 64-bit registers. (xor %eax,%eax is still the best way to zero RAX.)
# gcc7.2 -O3 When it can keep the value in the same reg, uses add instead of lea
leal (%rdx,%rdx), %eax #, <retval>
cmpl %esi, %edi # b, a
adcl $0, %eax #, <retval>
ret
#clang 6.0 snapshot -O3
xorl %eax, %eax
cmpl %esi, %edi
setb %al
leal (%rax,%rdx,2), %eax
retq
# ICC18, same as gcc but fails to save a MOV
addl %edx, %edx #14.16
cmpl %esi, %edi #17.12
adcl $0, %edx #17.12
movl %edx, %eax #17.12
ret #17.12
MSVC is the only compiler that doesn't make branchless code without hand-holding. ((k<<1) + ( a < b ); gives us exactly the same xor/cmp/setb / lea sequence as clang above (but with the Windows x86-64 calling convention).
funcarg PROC ; x86-64 MSVC CL19 -Ox
lea eax, DWORD PTR [r8*2+1]
cmp ecx, edx
jb SHORT $LN3#funcarg
lea eax, DWORD PTR [r8+r8] ; conditionally jumped over
$LN3#funcarg:
ret 0
Inline asm
The other answers cover the problems with your implementation pretty well. To debug assembler errors in inline asm, use gcc -O3 -S -fverbose-asm to see what the compiler is feeding to the assembler, with the asm template filled in. You would have seen addl %rax, %ecx or something.
This optimized implementation uses multi-alternative constraints to let the compiler pick either the cmp $imm, r/m, cmp r/m, r, or cmp r, r/m forms of CMP. I used two alternates that split things up not by opcode but by which side included the possible memory operand. "rme" is like "g" (rmi) but limited to 32-bit sign-extended immediates).
unsigned long inlineasm(unsigned long a, unsigned long b, unsigned long k)
{
__asm__("cmpq %[b], %[a] \n\t"
"adc %[k],%[k]"
: /* outputs */ [k] "+r,r" (k)
: /* inputs */ [a] "r,rm" (a), [b] "rme,re" (b)
: /* clobbers */ "cc"); // "cc" clobber is implicit for x86, but it doesn't hurt
return k;
}
I put this on Godbolt with callers that inline it in different contexts. gcc7.2 -O3 does what we expect for the stand-alone version (with register args).
inlineasm:
movq %rdx, %rax # k, k
cmpq %rsi, %rdi # b, a
adc %rax,%rax # k
ret
We can look at how well our constraints work by inlining into other callers:
unsigned long call_with_mem(unsigned long *aptr) {
return inlineasm(*aptr, 5, 4);
}
# gcc
movl $4, %eax #, k
cmpq $55555, (%rdi) #, *aptr_3(D)
adc %rax,%rax # k
ret
With a larger immediate, we get movabs into a register. (But with an "i" or "g" constraint, gcc would emit code that doesn't assemble, or truncates the constant, trying to use a large immediate constant for cmpq.)
Compare what we get from pure C:
unsigned long call_with_mem_nonasm(unsigned long *aptr) {
return handhold(*aptr, 5, 4);
}
# gcc -O3
xorl %eax, %eax # tmp93
cmpq $4, (%rdi) #, *aptr_3(D)
setbe %al #, tmp93
addq $8, %rax #, k
ret
adc $8, %rax without setc would probably have been better, but we can't get that from inline asm without __builtin_constant_p() on k.
clang often picks the mem alternative if there is one, so it does this: /facepalm. Don't use inline asm.
inlineasm: # clang 5.0
movq %rsi, -8(%rsp)
cmpq -8(%rsp), %rdi
adcq %rdx, %rdx
movq %rdx, %rax
retq
BTW, unless you're going to optimize the shift into the compare-and-add, you can and should have asked the compiler for k<<1 as an input.