alternative to mangling jmp_buf in c for a context switch - c

In setjmp.h library in linux system jmp_buf is encrypted to decrypt it we use mangle function
*/static long int i64_ptr_mangle(long int p) {
long int ret;
asm(" mov %1, %%rax;\n"
" xor %%fs:0x30, %%rax;"
" rol $0x11, %%rax;"
" mov %%rax, %0;"
: "=r"(ret)
: "r"(p)
: "%rax"
);
return ret;
}
I need to save the context and change the stack pointer, base pointer and program counter in jmp_buffer any alternative to this function that I can use. I am trying to build basic thread library can't head around this. I can't use ucontext.h .

You might as well roll your own version of setjmp/longjmp; even if you reverse engineered that mess, your result will be more fragile than a proper version.
You will need to have a peek at the calling conventions for your environment, but mainly something like:
mov 4(%esp), %eax
mov %ebx, _BX(%eax)
mov %esi, _SI(%eax)
mov %edi, _DI(%eax)
mov %ebp, _BP(%eax)
pushf; pop _FL(%eax)
mov %esp, _SP(%eax)
pop _PC(%eax)
xor %eax,%eax
ret
loadctx:
mov 4(%esp), %edx
mov 8(%esp), %eax
mov _BX(%edx), %ebx
...
push _FL(%edx)
popf
mov _SP(%edx), %esp
jmp _PC(%edx)
Then you define your register layout maybe like:
#define _PC 0
#define _SP 4
#define _FL 8
...
This should work in a dated compiler, like gcc2.x as is. More modern compilers have been, uh, enhanced, to rely on thead local storage(TLS) and the like. You may have to add bits to your context.
Another enhancement is stack checking, typically layered on TLS. Even if you disable stack checking, it is possible that libraries you use will rely on it, so you will have to swap the appropriate entries.

Related

Segmentation fault in assembly when multiplying registers?

I was trying to convert the following C code into assembly. Here is the C code:
typedef struct {
int x;
int y;
} point;
int square_distance( point * p ) {
return p->x * p->x + p->y * p->y;
}
My assembly code is as follows:
square_distance:
.LFB23:
.cfi_startproc
movl (%edi), %edx
imull %edx, %edx
movl 4(%edi), %eax
imull %eax, %eax
addl %edx, %eax
ret
.cfi_endproc
I get a segmentation fault when I try to run this program. Could someone please explain why? Thanks! I would be grateful!
Your code is 32 bit code (x86) but you apply the calling convention used with 64 bit code (x64). This can obviously not work.
The x86 calling convention is passing all parameters on the stack.
The x64 calling convention is passing the first parameter in rdi, the second in rsi, the third in rdx, etc. (I'm not sure which registers are used if there are more than 3 parameters, this might also depend on your platform).
Your code is presumably more or less correct for x64 code, that would be something like this:
square_distance:
movl (%rdi), %edx
imull %edx, %edx
movl 4(%rdi), %eax
imull %eax, %eax
addl %edx, %eax
ret
With x86 code the parameters are passed on the stack and the corresponding correct code would be something like this:
square_distance:
movl 4(%esp), edx
movl (%edx), eax
imull eax, eax
movl 4(%edx), edx
imull edx, edx
addl edx, eax
ret
In general the Calling conventions subject is vast, there are other calling conventions depending on the platform and even within the same platform different calling conventions can exist in certain cases.
Just want to supplement Jabberwocky answer. Because my reputation is not enough to comment.
The way of passing paraments when calling functions (also known as calling convention) are different from architectures and operating systems(OS). You can find out many common calling conventions from this wiki
From the wiki we can know that The x64 calling convention on *nix is passing the first six parameters through RDI, RSI, RDX, RCX, R8, R9 registers, while others through stack.

how to call printf in C __asm__()? [duplicate]

This question already has an answer here:
Calling printf in extended inline ASM
(1 answer)
Closed 3 years ago.
I am trying to print AAAA using c __asm__ as following:
#include <stdio.h>
int main()
{
__asm__("sub $0x150, %rsp\n\t"
"mov $0x0,%rax\n\t"
"lea -0x140(%rbp), %rax\n\t"
"movl $0x41414141,(%rax)\n\t"
"movb $0x0, 0x4(%rax)\n\t"
"lea -0x140(%rbp), %rax\n\t"
"mov %rax, %rdi\n\t"
"call printf\n\t");
return 0;
}
Disassembly:
Dump of assembler code for function main:
0x0000000000400536 <+0>: push %rbp
0x0000000000400537 <+1>: mov %rsp,%rbp
0x000000000040053a <+4>: sub $0x150,%rsp
0x0000000000400541 <+11>: mov $0x0,%rax
0x0000000000400548 <+18>: lea -0x140(%rbp),%rax
0x000000000040054f <+25>: movl $0x41414141,(%rax)
0x0000000000400555 <+31>: movb $0x0,0x4(%rax)
0x0000000000400559 <+35>: lea -0x140(%rbp),%rax
0x0000000000400560 <+42>: mov %rax,%rdi
0x0000000000400563 <+45>: callq 0x400410 <printf#plt>
0x0000000000400568 <+50>: mov $0x0,%eax
0x000000000040056d <+55>: pop %rbp
0x000000000040056e <+56>: retq
End of assembler dump.
While running the code, there are basically two issues. #1 that it does not print "AAAA", #1 that when the RIP reaches retq, it throws segmentation fault
is there anything I am missing?
Your code has the following problems:
The main problem of your code is that you fail to restore the stack pointer to its previous value after calling printf. The compiler does not know that you modified the stack pointer and tries to return to whatever address is at (%rsp), crashing your program. To fix this, restore rsp to the value it had at the beginning of the asm statement.
you forgot to set up al with 0 to indicate that no floating point values are being passed to printf. This is needed as printf is a variadic function. While it doesn't cause any problems to set al to a value too high (and all values are less than or equal to 0), it is still good practice to set up al correctly. To fix this, set al to 0 before calling printf.
That said, you cannot safely assume that the compiler is going to set up a base pointer. To fix this, make sure to only reference rsp instead of rbp or use extended asm to let the compiler figure this out.
take care not to overwrite the stack. Note that the 128 bytes below rsp are called the red zone and must be preserved, too. This is fine in your code as you allocate enough stack space to avoid this issue.
your code tacitly assumes that the stack is aligned to a multiple of 16 bytes on entry to the asm statement. This too is an assumption you cannot make. To fix this, align the stack pointer to a multiple of 16 bytes before calling printf.
you overwrite a bunch of registers; apart from rax and rdi, the call to printf may overwrite any caller-saved register. The compiler does not know that you did so and may assume that all registers kept the value they had before. To fix this, declare an appropriate clobber list or save and restore all registers you plan to overwrite, including all caller-saved registers.
So TL;DR: don't call functions from inline assembly and don't use inline assembly as a learning tool! It's very hard to get right and does not teach you anything useful about assembly programming.
Example in Pure Assembly
Here is how I would write your code in normal assembly. This is what I suggest you to do:
.section .rodata # enter the read-only data section
str: .string "AAAA" # and place the string we want to print there
.section .text # enter the text section
.global main # make main visible to the link editor
main: push %rbp # establish...
mov %rsp, %rbp # ...a stack frame (and align rsp to 16 bytes)
lea str(%rip), %rdi # load the effective address of str to rdi
xor %al, %al # tell printf we have no floating point args
call printf # call printf(str)
leave # tear down the stack frame
ret # return
Example in Inline Assembly
Here is how you could call a function in inline assembly. Understand that you should never ever do this. Not even for educational purposes. It's just terrible to do this. If you want to call a function in C, do it in C code, not inline assembly.
That said, you could do something like this. Note that we use extended assembly to make our life a lot easier:
int main(void)
{
char fpargs = 0; /* dummy variable: no fp arguments */
const char *str = "AAAA";/* the string we want to print */
__asm__ volatile ( /* volatile means we do more than just
returning a result and ban the compiler
from optimising our asm statement away */
"mov %%rsp, %%rbx;" /* save the stack pointer */
"and $~0xf, %%rsp;" /* align stack to 16 bytes */
"sub $128, %%rsp;" /* skip red zone */
"call printf;" /* do the actual function call */
"mov %%rbx, %%rsp" /* restore the stack pointer */
: /* (pseudo) output operands: */
"+a"(fpargs), /* al must be 0 (no FP arguments) */
"+D"(str) /* rdi contains pointer to string "AAAA" */
: /* input operands: none */
: /* clobber list */
"rsi", "rdx", /* all other registers... */
"rcx", "r8", "r9", /* ...the function printf... */
"r10", "r11", /* ...is allowed to overwrite */
"rbx", /* and rbx which we use for scratch space */
"cc", /* and flags */
"memory"); /* and arbitrary memory regions */
return (0); /* wrap this up */
}

Does GCC generate suboptimal code for static branch prediction?

From my university course, I heard, that by convention it is better to place more probable condition in if rather than in else, which may help the static branch predictor. For instance:
if (check_collision(player, enemy)) { // very unlikely to be true
doA();
} else {
doB();
}
may be rewritten as:
if (!check_collision(player, enemy)) {
doB();
} else {
doA();
}
I found a blog post Branch Patterns, Using GCC, which explains this phenomenon in more detail:
Forward branches are generated for if statements. The rationale for
making them not likely to be taken is that the processor can take
advantage of the fact that instructions following the branch
instruction may already be placed in the instruction buffer inside the
Instruction Unit.
next to it, it says (emphasis mine):
When writing an if-else statement, always make the "then" block more
likely to be executed than the else block, so the processor can take
advantage of instructions already placed in the instruction fetch
buffer.
Ultimately, there is article, written by Intel, Branch and Loop Reorganization to Prevent Mispredicts, which summarizes this with two rules:
Static branch prediction is used when there is no data collected by the
microprocessor when it encounters a branch, which is typically the
first time a branch is encountered. The rules are simple:
A forward branch defaults to not taken
A backward branch defaults to taken
In order to effectively write your code to take advantage of these
rules, when writing if-else or switch statements, check the most
common cases first and work progressively down to the least common.
As I understand, the idea is that pipelined CPU may follow the instructions from the instruction cache without breaking it by jumping to another address within code segment. I am aware, though, that this may be largly oversimplified in case modern CPU microarchitectures.
However, it looks like GCC doesn't respect these rules. Given the code:
extern void foo();
extern void bar();
int some_func(int n)
{
if (n) {
foo();
}
else {
bar();
}
return 0;
}
it generates (version 6.3.0 with -O3 -mtune=intel):
some_func:
lea rsp, [rsp-8]
xor eax, eax
test edi, edi
jne .L6 ; here, forward branch if (n) is (conditionally) taken
call bar
xor eax, eax
lea rsp, [rsp+8]
ret
.L6:
call foo
xor eax, eax
lea rsp, [rsp+8]
ret
The only way, that I found to force the desired behavior is by rewriting the if condition using __builtin_expect as follows:
if (__builtin_expect(n, 1)) { // force n condition to be treated as true
so the assembly code would become:
some_func:
lea rsp, [rsp-8]
xor eax, eax
test edi, edi
je .L2 ; here, backward branch is (conditionally) taken
call foo
xor eax, eax
lea rsp, [rsp+8]
ret
.L2:
call bar
xor eax, eax
lea rsp, [rsp+8]
ret
The short answer: no, it is not.
GCC does metrics ton of non trivial optimization and one of them is guessing branch probabilities judging by control flow graph.
According to GCC manual:
fno-guess-branch-probability
Do not guess branch probabilities using
heuristics.
GCC uses heuristics to guess branch probabilities if they are not
provided by profiling feedback (-fprofile-arcs). These heuristics are
based on the control flow graph. If some branch probabilities are
specified by __builtin_expect, then the heuristics are used to guess
branch probabilities for the rest of the control flow graph, taking
the __builtin_expect info into account. The interactions between the
heuristics and __builtin_expect can be complex, and in some cases, it
may be useful to disable the heuristics so that the effects of
__builtin_expect are easier to understand.
-freorder-blocks may swap branches as well.
Also, as OP mentioned the behavior might be overridden with __builtin_expect.
Proof
Look at the following listing.
void doA() { printf("A\n"); }
void doB() { printf("B\n"); }
int check_collision(void* a, void* b)
{ return a == b; }
void some_func (void* player, void* enemy) {
if (check_collision(player, enemy)) {
doA();
} else {
doB();
}
}
int main() {
// warming up gcc statistic
some_func((void*)0x1, NULL);
some_func((void*)0x2, NULL);
some_func((void*)0x3, NULL);
some_func((void*)0x4, NULL);
some_func((void*)0x5, NULL);
some_func(NULL, NULL);
return 0;
}
It is obvious that check_collision will return 0 most of the times. So, the doB() branch is likely and GCC can guess this:
gcc -O main.c -o opt.a
objdump -d opt.a
The asm of some_func is:
sub $0x8,%rsp
cmp %rsi,%rdi
je 6c6 <some_func+0x18>
mov $0x0,%eax
callq 68f <doB>
add $0x8,%rsp
retq
mov $0x0,%eax
callq 67a <doA>
jmp 6c1 <some_func+0x13>
But for sure, we can enforce GCC from being too smart:
gcc -fno-guess-branch-probability main.c -o non-opt.a
objdump -d non-opt.a
And we will get:
push %rbp
mov %rsp,%rbp
sub $0x10,%rsp
mov %rdi,-0x8(%rbp)
mov %rsi,-0x10(%rbp)
mov -0x10(%rbp),%rdx
mov -0x8(%rbp),%rax
mov %rdx,%rsi
mov %rax,%rdi
callq 6a0 <check_collision>
test %eax,%eax
je 6ef <some_func+0x33>
mov $0x0,%eax
callq 67a <doA>
jmp 6f9 <some_func+0x3d>
mov $0x0,%eax
callq 68d <doB>
nop
leaveq
retq
So GCC will leave branches in source order.
I used gcc 7.1.1 for those tests.
I Think That You've Found A "Bug"
The funny thing is that optimization for space and no optimization are the only cases in which the "optimal" instruction code is generated: gcc -S [-O0 | -Os] source.c
some_func:
FB0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
cmpl $0, 8(%ebp)
je L2
call _foo
jmp L3
2:
call _bar
3:
movl $0, %eax
# Or, for -Os:
# xorl %eax, %eax
leave
ret
My point is that ...
some_func:
FB0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
cmpl $0, 8(%ebp)
je L2
call _foo
... up to & through the call to foo everything is "optimal", in the traditional sense, regardless of the exit strategy.
Optimality is ultimately determined by the processor, of course.

Difference between ebp based addressing and esp addressing

I have written some code to learn about the call stack. I've done this with some inline assembly for passing parameters on stack. I've compiled it with gcc 4.1.2(on CentOS5.4) and it works well, then I compile it with gcc 4.8.4(on Ubuntu14.04.3) and run the program but it always crashes.
I discovered that there are differences in how the variables are referenced. The local variable is addressed using the EBP register in gcc 4.1.2(CentOS5.4) while the local variable is addressed using the ESP register in gcc 4.8.4(Ubuntu14.04.3). This seems to be the reason of that it crashes.
My question is, how can I control whether gcc uses EBP or ESP? Also, what is the difference between them?
Here is the C code:
double fun(double d) {
return d;
}
int main(void) {
double a = 1.6;
double (*myfun)() = fun;
asm volatile("subl $8, %esp\n"
"fstpl (%esp)\n");
myfun();
asm volatile("addl $8, %esp\n");
return 0;
}
Here is the assembly in gcc 4.1.2, and it works
int main(void) {
**......**
double a = 1.6;
0x080483bf <+17>: fldl 0x80484d0
0x080483c5 <+23>: fstpl -0x18(%ebp)
double (*myfun) () = fun;
0x080483c8 <+26>: movl $0x8048384,-0xc(%ebp)
asm volatile("subl $8, %esp\n"
"fstpl (%esp)\n");
0x080483cf <+33>: sub $0x8,%esp
0x080483d2 <+36>: fstpl (%esp)
myfun();
0x080483d5 <+39>: mov -0xc(%ebp),%eax
0x080483d8 <+42>: call *%eax
0x080483da <+44>: fstp %st(0)
asm volatile("addl $8, %esp\n");
0x080483dc <+46>: add $0x8,%esp
**......**
here is the assembly in gcc 4.8.4. This is what crashes:
int main(void) {
**......**
double a = 1.6;
0x0804840d <+9>: fldl 0x80484d0
0x08048413 <+15>: fstpl 0x8(%esp)
double (*myfun)() = fun;
0x08048417 <+19>: movl $0x80483ed,0x4(%esp)
asm volatile("subl $8,%esp\n"
"fstpl (%esp)\n");
0x0804841f <+27>: sub $0x8,%esp
0x08048422 <+30>: fstpl (%esp)
myfun();
0x08048425 <+33>: mov 0x4(%esp),%eax
0x08048429 <+37>: call *%eax
0x0804842b <+39>: fstp %st(0)
asm volatile("addl $8,%esp\n");
0x0804842d <+41>: add $0x8,%esp
**......**
There's no real difference between using esp and ebp, except that esp changes with push, pop, call, ret, which sometimes makes it difficult to know where a certain local variable or parameter is located in the stack. That's why ebp gets loaded with esp, so that there is a stable reference point to refer to the function arguments and the local variables.
For a function like this:
int foo( int arg ) {
int a, b, c, d;
....
}
the following assembly is usually generated:
# using Intel syntax, where `mov eax, ebx` puts the value in `ebx` into `eax`
.intel_syntax noprefix
foo:
push ebp # preserve
mov ebp, esp # remember stack
sub esp, 16 # allocate local variables a, b, c, d
...
mov esp, ebp # de-allocate the 16 bytes
pop ebp # restore ebp
ret
Calling this method (foo(0)) would generate something like this:
pushd 0 # the value for arg; esp becomes esp-4
call foo
add esp, 4 # free the 4 bytes of the argument 'arg'.
Immediately after the call instruction has executed, right before the first instruction of the foo method is executed, [esp] will hold the return address, and [esp+4] the 0 value for arg.
In method foo, if we wanted to load arg into eax (at the ...)
we could use:
mov eax, [ebp + 4 + 4]
because [ebp + 0] holds the previous value of ebp (from the push ebp),
and [ebp + 4] (the original value of esp), holds the return address.
But we could also reference the parameter using esp:
mov eax, [esp + 16 + 4 + 4]
We add 16 because of the sub esp, 16, then 4 because of the push ebp, and another 4 to skip the return address, to arrive at arg.
Similarly accessing the four local variables can be done in two ways:
mov eax, [ebp - 4]
mov eax, [ebp - 8]
mov eax, [ebp - 12]
mov eax, [ebp - 16]
or
mov eax, [esp + 12]
mov eax, [esp + 8]
mov eax, [esp + 4]
mov eax, [esp + 0]
But, whenever esp changes, these instructions must change aswell. So, in the end, it does not matter whether esp or ebp is used. It might be more efficient to use esp since you don't have to push ebp; mov ebp, esp; ... mov esp, ebp; pop ebp.
UPDATE
As far as I can tell, there's no way to guarantee your inline assembly will work: the gcc 4.8.4 on Ubunty optimizes out the use of ebp and references everything with esp. It doesn't know that your inline assembly modifies esp, so when it tries to call myfun(), it fetches it from [esp + 4], but it should have fetched it from [esp + 4 + 8].
Here is a workaround: don't use local variables (or parameters) in the function where you use inline assembly that does stack manipulation. To bypass the problem of casting double fun(double) to double fn() call the function directly in assembly:
void my_call() {
asm volatile("subl $8, %esp\n"
"fstpl (%esp)\n"
"call fun\n"
"addl $8, %esp\n");
}
int main(void) {
my_call();
return 0;
}
You could also place the my_call function in a separate .s (or .S) file:
.text
.global my_call
my_call:
subl $8, %esp
fstpl (%esp)
call fun
addl $8, %esp
ret
and in C:
extern double my_call();
You could also pass fun as an argument:
extern double my_call( double (*myfun)() );
...
my_call( fun );
and
.text
.global my_call
my_call:
sub $8, %esp
fstp (%esp)
call *12(%esp)
add $8, %esp
ret
Most compilers create EBP-based stack frames. Or, at least they used to. This is the method that most people are taught that utilizes using EBP as a fixed base frame pointer.
Some compilers create ESP-based stack frames. The reason is simple. It frees up EBP to be used for other uses, and removes the overhead of setting up and restoring the stack frame. It is clearly much harder to visualize, since the stack pointer can be constantly changing.
The problem you are having might be because you are calling APIs that use stdcall calling convention, which end up trashing your stack, unintentionally, when they return to the caller. EBP must be preserved by the callee by cdecl and stdcall founction. However, stdcall routines will clean up the stack with ret 4 for example, thus shrinking its size. The caller must compensate for these types of mishaps, and reallocate space on the stack after the call returns.
GCC has the option -fomit-frame-pointer which will turn off EBP-based frames. It's on by default at most optimization levels. You can use -O2 -fno-omit-frame-pointer to optimize normally except for still setting up EBP as a frame pointer.
If you want to learn about the stack and parameter passing conventions (ABI), I suggest you look at the assembly generated by the compiler. You can do this interactively on this site: http://gcc.godbolt.org/#
Try various argument types, varadic functions, passing and returning floats, doubles, structures of different sizes...
Messing with the stack using inline assembly is too difficult and unpredictable. It is likely to fail in so many ways, you will not learn anything useful.
ebp is normally used for frame pointers. The first instructions for functions using frame pointers are
push ebp ;save ebp
mov ebp,esp ;ebp = esp
sub esp,... ;allocate space for local variables
then parameters and local variable are +/- offsets from ebp
Most compilers have an option to not use frame pointers, in which case esp is used as the base pointer. If non-frame pointer code uses ebp as a generic register, it still need to be saved.

gcc inline assembly - operand type mismatch for `add', trying to create branchless code

I'm trying to do some Code Optimization to Eliminate Branches, the original c code is
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1)
I intend to replace it with assembly code like below
mov a, %rax
mov b, %rbx
mov k, %rcx
xor %rdx %rdx
shl 1, %rcx
cmp %rax, %rax
setb %rdx
add %rdx,%rcx
mov %rcx, k
so I write c inline assembly code like blow,
#define next(a, b, k)\
__asm__("shl $0x1, %0; \
xor %%rbx, %%rbx; \
cmp %1, %2; \
setb %%rbx; \
addl %%rbx,%0;":"+c"(k) :"g"(a),"g"(b))
when I compile the code below i got error:
operand type mismatch for `add'
operand type mismatch for `setb'
How can I fix it?
Here are the mistakes in your code:
Error: operand type mismatch for 'cmp' -- One of CMP's operands must be a register. You're probably generating code that's trying to compare two immediates. Change the second operand's constraint from "g" to "r". (See GCC Manual - Extended Asm - Simple Constraints)
Error: operand type mismatch for 'setb' -- SETB only takes 8 bit operands, i.e. setb %bl works while setb %rbx doesn't.
The C expression T = (A < B) should translate to cmp B,A; setb T in AT&T x86 assembler syntax. You had the two operands to CMP in the wrong order. Remember that CMP works like SUB.
Once you realize the first two error messages are produced by the assembler, it follows that the trick to debugging them is to look at the assembler code generated by gcc. Try gcc $CFLAGS -S t.c and compare the problematic lines in t.s with an x86 opcode reference. Focus on the allowed operand codes for each instruction and you'll quickly see the problems.
In the fixed source code posted below, I assume your operands are unsigned since you're using SETB instead of SETL. I switched from using RBX to RCX to hold the temporary value because RCX is a call clobbered register in the ABI and used the "=&c" constraint to mark it as an earlyclobber operand since RCX is cleared before the inputs a and b are read:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
static uint64_t next(uint64_t a, uint64_t b, uint64_t k)
{
uint64_t tmp;
__asm__("shl $0x1, %[k];"
"xor %%rcx, %%rcx;"
"cmp %[b], %[a];"
"setb %%cl;"
"addq %%rcx, %[k];"
: /* outputs */ [k] "+g" (k), [tmp] "=&c" (tmp)
: /* inputs */ [a] "r" (a), [b] "g" (b)
: /* clobbers */ "cc");
return k;
}
int main()
{
uint64_t t, t0, k;
k = next(1, 2, 0);
printf("%" PRId64 "\n", k);
scanf("%" SCNd64 "%" SCNd64, &t, &t0);
k = next(t, t0, k);
printf("%" PRId64 "\n", k);
return 0;
}
main() translates to:
<+0>: push %rbx
<+1>: xor %ebx,%ebx
<+3>: mov $0x4006c0,%edi
<+8>: mov $0x1,%bl
<+10>: xor %eax,%eax
<+12>: sub $0x10,%rsp
<+16>: shl %rax
<+19>: xor %rcx,%rcx
<+22>: cmp $0x2,%rbx
<+26>: setb %cl
<+29>: add %rcx,%rax
<+32>: mov %rax,%rbx
<+35>: mov %rax,%rsi
<+38>: xor %eax,%eax
<+40>: callq 0x400470 <printf#plt>
<+45>: lea 0x8(%rsp),%rdx
<+50>: mov %rsp,%rsi
<+53>: mov $0x4006c5,%edi
<+58>: xor %eax,%eax
<+60>: callq 0x4004a0 <__isoc99_scanf#plt>
<+65>: mov (%rsp),%rax
<+69>: mov %rbx,%rsi
<+72>: mov $0x4006c0,%edi
<+77>: shl %rsi
<+80>: xor %rcx,%rcx
<+83>: cmp 0x8(%rsp),%rax
<+88>: setb %cl
<+91>: add %rcx,%rsi
<+94>: xor %eax,%eax
<+96>: callq 0x400470 <printf#plt>
<+101>: add $0x10,%rsp
<+105>: xor %eax,%eax
<+107>: pop %rbx
<+108>: retq
You can see the result of next() being moved into RSI before each printf() call.
Given that gcc (and it looks like gcc inline assembler) produces:
leal (%rdx,%rdx), %eax
xorl %edx, %edx
cmpl %esi, %edi
setl %dl
addl %edx, %eax
ret
from
int f(int a, int b, int k)
{
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
It would think that writing your own inline assembler is a complete waste of time and effort.
As always, BEFORE you start writing inline assembler, check what the compiler actually does. If your compiler doesn't produce this code, then you may need to upgrade the version of compiler to something a bit newer (I reported this sort of thing to Jan Hubicka [gcc maintainer for x86-64 at the time] ca 2001, and I'm sure it's been in gcc for quite some time).
You could just do this and the compiler will not generate a branch:
k = (k<<1) + (a < b) ;
But if you must, I fixed some stuff in your code now it should work as expected:
__asm__(
"shl $0x1, %0; \
xor %%eax, %%eax; \
cmpl %3, %2; \
setb %%al; \
addl %%eax, %0;"
:"=r"(k) /* output */
:"0"(k), "r"(a),"r"(b) /* input */
:"eax", "cc" /* clobbered register */
);
Note that setb expects a reg8 or mem8 and you should add eax to the clobbered list, because you change it, as well as cc just to be safe, as for the register constraints, I'm not sure why you used those, but =r and r work just fine.
And you need to add k to both the input and output lists. There's more in the GCC-Inline-Assembly-HOWTO
Summary:
Branchless might not even be the best choice.
Inline asm defeats some other optimizations, try other source changes first, e.g. ? : often compiles branchlessly, also use booleans as integer 0/1.
If you use inline-asm, make sure you optimize the constraints as well to make the compiler-generated code outside your asm block efficient.
The whole thing is doable with cmp %[b], %[a] / adc %[k],%[k]. Your hand-written code is worse than what compilers generate, but they are beatable in the small scale for cases where constant-propagation / CSE / inlining didn't make this code (partially) optimize away.
If your compiler generates branchy code, and profiling shows that was the wrong choice (high counts for branch misses at that instruction, e.g. on Linux perf record -ebranch-misses ./my_program && perf report), then yes you should do something to get branchless code.
(Branchy can be an advantage if it's predictable: branching means out-of-order execution of code that uses (k<<1) + 1 doesn't have to wait for a and b to be ready. LLVM recently merged a patch that makes x86 code-gen more branchy by default, because modern x86 CPUs have such powerful branch predictors. Clang/LLVM nightly build (with that patch) does still choose branchless for this C source, at least in a stand-alone function outside a loop).
If this is for a binary search, branchless probably is good strategy, unless you see the same search often. (Branching + speculative execution means you have a control dependency off the critical path,
Compile with profile-guided optimization so the compiler has run-time info on which branches almost always go one way. It still might not know the difference between a poorly-predictable branch and one that does overall take both paths but with a simple pattern. (Or that's predictable based on global history; many modern branch-predictor designs index based on branch history, so which way the last few branches went determine which table entry is used for the current branch.)
Related: gcc optimization flag -O3 makes code slower then -O2 shows a case where a sorted array makes for near-perfect branch prediction for a condition inside a loop, and gcc -O3's branchless code (without profile guided optimization) bottlenecks on a data dependency from using cmov. But -O3 -fprofile-use makes branchy code. (Also, a different way of writing it makes lower-latency branchless code that also auto-vectorizes better.)
Inline asm should be your last resort if you can't hand-hold the compiler into making the asm you want, e.g. by writing it as (k<<1) + (a<b) as others have suggested.
Inline asm defeats many optimizations, most obvious constant-propagation (as seen in some other answers, where gcc moves a constant into a register outside the block of inline-asm code). https://gcc.gnu.org/wiki/DontUseInlineAsm.
You could maybe use if(__builtin_constant_p(a)) and so on to use a pure C version when the compiler has constant values for some/all of the variables, but that's a lot more work. (And doesn't work well with Clang, where __builtin_constant_p() is evaluated before function inlining.)
Even then (once you've limited things to cases where the inputs aren't compile-time constants), it's not possible to give the compiler the full range of options, because you can't use different asm blocks depending on which constraints are matched (e.g. a in a register and b in memory, or vice versa.) In cases where you want to use a different instruction depending on the situation, you're screwed, but here we can use multi-alternative constraints to expose most of the flexibility of cmp.
It's still usually better to let the compiler make near-optimal code than to use inline asm. Inline-asm destroys the ability of the compiler to reuse use any temporary results, or spread out the instructions to mix with other compiler-generated code. (Instruction-scheduling isn't a big deal on x86 because of good out-of-order execution, but still.)
That asm is pretty crap. If you get a lot of branch misses, it's better than a branchy implementation, but a much better branchless implementation is possible.
Your a<b is an unsigned compare (you're using setb, the unsigned below condition). So your compare result is in the carry flag. x86 has an add-with-carry instruction. Furthermore, k<<1 is the same thing as k+k.
So the asm you want (compiler-generated or with inline asm) is:
# k in %rax, a in %rdi, b in %rsi for this example
cmp %rsi, %rdi # CF = (a < b) = the carry-out from edi - esi
adc %rax, %rax # eax = (k<<1) + CF = (k<<1) + (a < b)
Compilers are smart enough to use add or lea for a left-shift by 1, and some are smart enough to use adc instead of setb, but they don't manage to combine both.
Writing a function with register args and a return value is often a good way to see what compilers might do, although it does force them to produce the result in a different register. (See also this Q&A, and Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”).
// I also tried a version where k is a function return value,
// or where k is a global, so it's in the same register.
unsigned funcarg(unsigned a, unsigned b, unsigned k) {
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
On the Godbolt compiler explorer, along with a couple other versions. (I used unsigned in this version, because you had addl in your asm. Using unsigned long makes everything except the xor-zeroing into 64-bit registers. (xor %eax,%eax is still the best way to zero RAX.)
# gcc7.2 -O3 When it can keep the value in the same reg, uses add instead of lea
leal (%rdx,%rdx), %eax #, <retval>
cmpl %esi, %edi # b, a
adcl $0, %eax #, <retval>
ret
#clang 6.0 snapshot -O3
xorl %eax, %eax
cmpl %esi, %edi
setb %al
leal (%rax,%rdx,2), %eax
retq
# ICC18, same as gcc but fails to save a MOV
addl %edx, %edx #14.16
cmpl %esi, %edi #17.12
adcl $0, %edx #17.12
movl %edx, %eax #17.12
ret #17.12
MSVC is the only compiler that doesn't make branchless code without hand-holding. ((k<<1) + ( a < b ); gives us exactly the same xor/cmp/setb / lea sequence as clang above (but with the Windows x86-64 calling convention).
funcarg PROC ; x86-64 MSVC CL19 -Ox
lea eax, DWORD PTR [r8*2+1]
cmp ecx, edx
jb SHORT $LN3#funcarg
lea eax, DWORD PTR [r8+r8] ; conditionally jumped over
$LN3#funcarg:
ret 0
Inline asm
The other answers cover the problems with your implementation pretty well. To debug assembler errors in inline asm, use gcc -O3 -S -fverbose-asm to see what the compiler is feeding to the assembler, with the asm template filled in. You would have seen addl %rax, %ecx or something.
This optimized implementation uses multi-alternative constraints to let the compiler pick either the cmp $imm, r/m, cmp r/m, r, or cmp r, r/m forms of CMP. I used two alternates that split things up not by opcode but by which side included the possible memory operand. "rme" is like "g" (rmi) but limited to 32-bit sign-extended immediates).
unsigned long inlineasm(unsigned long a, unsigned long b, unsigned long k)
{
__asm__("cmpq %[b], %[a] \n\t"
"adc %[k],%[k]"
: /* outputs */ [k] "+r,r" (k)
: /* inputs */ [a] "r,rm" (a), [b] "rme,re" (b)
: /* clobbers */ "cc"); // "cc" clobber is implicit for x86, but it doesn't hurt
return k;
}
I put this on Godbolt with callers that inline it in different contexts. gcc7.2 -O3 does what we expect for the stand-alone version (with register args).
inlineasm:
movq %rdx, %rax # k, k
cmpq %rsi, %rdi # b, a
adc %rax,%rax # k
ret
We can look at how well our constraints work by inlining into other callers:
unsigned long call_with_mem(unsigned long *aptr) {
return inlineasm(*aptr, 5, 4);
}
# gcc
movl $4, %eax #, k
cmpq $55555, (%rdi) #, *aptr_3(D)
adc %rax,%rax # k
ret
With a larger immediate, we get movabs into a register. (But with an "i" or "g" constraint, gcc would emit code that doesn't assemble, or truncates the constant, trying to use a large immediate constant for cmpq.)
Compare what we get from pure C:
unsigned long call_with_mem_nonasm(unsigned long *aptr) {
return handhold(*aptr, 5, 4);
}
# gcc -O3
xorl %eax, %eax # tmp93
cmpq $4, (%rdi) #, *aptr_3(D)
setbe %al #, tmp93
addq $8, %rax #, k
ret
adc $8, %rax without setc would probably have been better, but we can't get that from inline asm without __builtin_constant_p() on k.
clang often picks the mem alternative if there is one, so it does this: /facepalm. Don't use inline asm.
inlineasm: # clang 5.0
movq %rsi, -8(%rsp)
cmpq -8(%rsp), %rdi
adcq %rdx, %rdx
movq %rdx, %rax
retq
BTW, unless you're going to optimize the shift into the compare-and-add, you can and should have asked the compiler for k<<1 as an input.

Resources