In setjmp.h library in linux system jmp_buf is encrypted to decrypt it we use mangle function
*/static long int i64_ptr_mangle(long int p) {
long int ret;
asm(" mov %1, %%rax;\n"
" xor %%fs:0x30, %%rax;"
" rol $0x11, %%rax;"
" mov %%rax, %0;"
: "=r"(ret)
: "r"(p)
: "%rax"
);
return ret;
}
I need to save the context and change the stack pointer, base pointer and program counter in jmp_buffer any alternative to this function that I can use. I am trying to build basic thread library can't head around this. I can't use ucontext.h .
You might as well roll your own version of setjmp/longjmp; even if you reverse engineered that mess, your result will be more fragile than a proper version.
You will need to have a peek at the calling conventions for your environment, but mainly something like:
mov 4(%esp), %eax
mov %ebx, _BX(%eax)
mov %esi, _SI(%eax)
mov %edi, _DI(%eax)
mov %ebp, _BP(%eax)
pushf; pop _FL(%eax)
mov %esp, _SP(%eax)
pop _PC(%eax)
xor %eax,%eax
ret
loadctx:
mov 4(%esp), %edx
mov 8(%esp), %eax
mov _BX(%edx), %ebx
...
push _FL(%edx)
popf
mov _SP(%edx), %esp
jmp _PC(%edx)
Then you define your register layout maybe like:
#define _PC 0
#define _SP 4
#define _FL 8
...
This should work in a dated compiler, like gcc2.x as is. More modern compilers have been, uh, enhanced, to rely on thead local storage(TLS) and the like. You may have to add bits to your context.
Another enhancement is stack checking, typically layered on TLS. Even if you disable stack checking, it is possible that libraries you use will rely on it, so you will have to swap the appropriate entries.
I was trying to convert the following C code into assembly. Here is the C code:
typedef struct {
int x;
int y;
} point;
int square_distance( point * p ) {
return p->x * p->x + p->y * p->y;
}
My assembly code is as follows:
square_distance:
.LFB23:
.cfi_startproc
movl (%edi), %edx
imull %edx, %edx
movl 4(%edi), %eax
imull %eax, %eax
addl %edx, %eax
ret
.cfi_endproc
I get a segmentation fault when I try to run this program. Could someone please explain why? Thanks! I would be grateful!
Your code is 32 bit code (x86) but you apply the calling convention used with 64 bit code (x64). This can obviously not work.
The x86 calling convention is passing all parameters on the stack.
The x64 calling convention is passing the first parameter in rdi, the second in rsi, the third in rdx, etc. (I'm not sure which registers are used if there are more than 3 parameters, this might also depend on your platform).
Your code is presumably more or less correct for x64 code, that would be something like this:
square_distance:
movl (%rdi), %edx
imull %edx, %edx
movl 4(%rdi), %eax
imull %eax, %eax
addl %edx, %eax
ret
With x86 code the parameters are passed on the stack and the corresponding correct code would be something like this:
square_distance:
movl 4(%esp), edx
movl (%edx), eax
imull eax, eax
movl 4(%edx), edx
imull edx, edx
addl edx, eax
ret
In general the Calling conventions subject is vast, there are other calling conventions depending on the platform and even within the same platform different calling conventions can exist in certain cases.
Just want to supplement Jabberwocky answer. Because my reputation is not enough to comment.
The way of passing paraments when calling functions (also known as calling convention) are different from architectures and operating systems(OS). You can find out many common calling conventions from this wiki
From the wiki we can know that The x64 calling convention on *nix is passing the first six parameters through RDI, RSI, RDX, RCX, R8, R9 registers, while others through stack.
I have a program in C which uses a NASM function. Here is the code of the C program:
#include <stdio.h>
#include <string.h>
#include <math.h>
extern float hyp(float a); // supposed to calculate 1/(2 - a) + 6
void test(float (*f)(float)){
printf("%f %f %f\n", f(2.1), f(2.1), f(2.1));
}
void main(int argc, char** argv){
for(int i = 1; i < argc; i++){
if(!strcmp(argv[i], "calculate")){
test(hyp);
}
}
}
And here is the NASM function:
section .data
a dd 1.0
b dd 2.0
c dd 6.0
section .text
global hyp
hyp:
push ebp
mov ebp, esp
finit
fld dword[b]
fsub dword[ebp + 8]
fstp dword[b]
fld dword[a]
fdiv dword[b]
fadd dword[c]
mov esp, ebp
pop ebp
ret
These programs were linked in Linux with gcc and nasm. Here is the Makefile:
all: project clean
main.o: main.c
gcc -c main.c -o main.o -m32 -std=c99
hyp.o: hyp.asm
nasm -f elf32 -o hyp.o hyp.asm -D UNIX
project: main.o hyp.o
gcc -o project main.o hyp.o -m32 -lm
clean:
rm -rf *.o
When the program is run, it outputs this:
5.767442 5.545455 -4.000010
The last number is correct. My question is: why do these results differ even though the input is the same?
The bug is that you do this:
fstp dword[b]
That overwrites the value of b, so the next time you call the function, the constant is wrong. In the overall program's output, this shows up as the rightmost evaluation being the only correct one, because the compiler evaluated the arguments to printf from right to left. (It is allowed to evaluate the arguments to a multi-argument function in any order it wants.)
You should have used the .rodata section for your constants; then the program would crash rather than overwrite a constant.
You can avoid needing to store and reload an intermediate value by using fdivr instead of fdiv.
hyp:
fld DWORD PTR [b]
fsub DWORD PTR [esp+4]
fdivr DWORD PTR [a]
fadd DWORD PTR [c]
ret
Alternatively, do what a Forth programmer would do, and load the constant 1 before everything else, so it's in ST(1) when it needs to be. This allows you to use fld1 instead of putting 1.0 in memory.
hyp:
fld1
fld DWORD PTR [b]
fsub DWORD PTR [esp+4]
fdivp
fadd DWORD PTR [c]
ret
You do not need to issue a finit, because the ABI guarantees that this was already done during process startup. You do not need to set up EBP for this function, as it does not make any function calls itself (the jargon term for this is "leaf procedure"), nor does it need any scratch space on the stack.
Another alternative, if you have a modern CPU, is to use the newer SSE2 instructions. That gives you normal registers instead of an operand stack, and also means the calculations are all actually done in float instead of 80-bit extended, which can be very important — some complex numerical algorithms will malfunction if they have more floating-point precision than the designers expected to have. Because you're using the 32-bit ELF ABI, though, the return value still needs to wind up in ST(0), and there's no direct move instructions between SSE and x87 registers, you have to go through memory. I don't know how to write SSE2 instructions in Intel syntax, sorry.
hyp:
subl $4, %esp
movss b, %xmm1
subss 8(%esp), %xmm1
movss a, %xmm0
divss %xmm1, %xmm0
addss c, %xmm0
movss %xmm0, (%esp)
flds (%esp)
addl $4, %esp
ret
In the 64-bit ELF ABI, with floating-point return values in XMM0 (and argument passing in registers by default as well), that would just be
hyp:
movss b(%rip), %xmm1
subss %xmm0, %xmm1
movss a(%rip), %xmm0
divss %xmm1, %xmm0
addss c(%rip), %xmm0
ret
I'm trying to do some Code Optimization to Eliminate Branches, the original c code is
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1)
I intend to replace it with assembly code like below
mov a, %rax
mov b, %rbx
mov k, %rcx
xor %rdx %rdx
shl 1, %rcx
cmp %rax, %rax
setb %rdx
add %rdx,%rcx
mov %rcx, k
so I write c inline assembly code like blow,
#define next(a, b, k)\
__asm__("shl $0x1, %0; \
xor %%rbx, %%rbx; \
cmp %1, %2; \
setb %%rbx; \
addl %%rbx,%0;":"+c"(k) :"g"(a),"g"(b))
when I compile the code below i got error:
operand type mismatch for `add'
operand type mismatch for `setb'
How can I fix it?
Here are the mistakes in your code:
Error: operand type mismatch for 'cmp' -- One of CMP's operands must be a register. You're probably generating code that's trying to compare two immediates. Change the second operand's constraint from "g" to "r". (See GCC Manual - Extended Asm - Simple Constraints)
Error: operand type mismatch for 'setb' -- SETB only takes 8 bit operands, i.e. setb %bl works while setb %rbx doesn't.
The C expression T = (A < B) should translate to cmp B,A; setb T in AT&T x86 assembler syntax. You had the two operands to CMP in the wrong order. Remember that CMP works like SUB.
Once you realize the first two error messages are produced by the assembler, it follows that the trick to debugging them is to look at the assembler code generated by gcc. Try gcc $CFLAGS -S t.c and compare the problematic lines in t.s with an x86 opcode reference. Focus on the allowed operand codes for each instruction and you'll quickly see the problems.
In the fixed source code posted below, I assume your operands are unsigned since you're using SETB instead of SETL. I switched from using RBX to RCX to hold the temporary value because RCX is a call clobbered register in the ABI and used the "=&c" constraint to mark it as an earlyclobber operand since RCX is cleared before the inputs a and b are read:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
static uint64_t next(uint64_t a, uint64_t b, uint64_t k)
{
uint64_t tmp;
__asm__("shl $0x1, %[k];"
"xor %%rcx, %%rcx;"
"cmp %[b], %[a];"
"setb %%cl;"
"addq %%rcx, %[k];"
: /* outputs */ [k] "+g" (k), [tmp] "=&c" (tmp)
: /* inputs */ [a] "r" (a), [b] "g" (b)
: /* clobbers */ "cc");
return k;
}
int main()
{
uint64_t t, t0, k;
k = next(1, 2, 0);
printf("%" PRId64 "\n", k);
scanf("%" SCNd64 "%" SCNd64, &t, &t0);
k = next(t, t0, k);
printf("%" PRId64 "\n", k);
return 0;
}
main() translates to:
<+0>: push %rbx
<+1>: xor %ebx,%ebx
<+3>: mov $0x4006c0,%edi
<+8>: mov $0x1,%bl
<+10>: xor %eax,%eax
<+12>: sub $0x10,%rsp
<+16>: shl %rax
<+19>: xor %rcx,%rcx
<+22>: cmp $0x2,%rbx
<+26>: setb %cl
<+29>: add %rcx,%rax
<+32>: mov %rax,%rbx
<+35>: mov %rax,%rsi
<+38>: xor %eax,%eax
<+40>: callq 0x400470 <printf#plt>
<+45>: lea 0x8(%rsp),%rdx
<+50>: mov %rsp,%rsi
<+53>: mov $0x4006c5,%edi
<+58>: xor %eax,%eax
<+60>: callq 0x4004a0 <__isoc99_scanf#plt>
<+65>: mov (%rsp),%rax
<+69>: mov %rbx,%rsi
<+72>: mov $0x4006c0,%edi
<+77>: shl %rsi
<+80>: xor %rcx,%rcx
<+83>: cmp 0x8(%rsp),%rax
<+88>: setb %cl
<+91>: add %rcx,%rsi
<+94>: xor %eax,%eax
<+96>: callq 0x400470 <printf#plt>
<+101>: add $0x10,%rsp
<+105>: xor %eax,%eax
<+107>: pop %rbx
<+108>: retq
You can see the result of next() being moved into RSI before each printf() call.
Given that gcc (and it looks like gcc inline assembler) produces:
leal (%rdx,%rdx), %eax
xorl %edx, %edx
cmpl %esi, %edi
setl %dl
addl %edx, %eax
ret
from
int f(int a, int b, int k)
{
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
It would think that writing your own inline assembler is a complete waste of time and effort.
As always, BEFORE you start writing inline assembler, check what the compiler actually does. If your compiler doesn't produce this code, then you may need to upgrade the version of compiler to something a bit newer (I reported this sort of thing to Jan Hubicka [gcc maintainer for x86-64 at the time] ca 2001, and I'm sure it's been in gcc for quite some time).
You could just do this and the compiler will not generate a branch:
k = (k<<1) + (a < b) ;
But if you must, I fixed some stuff in your code now it should work as expected:
__asm__(
"shl $0x1, %0; \
xor %%eax, %%eax; \
cmpl %3, %2; \
setb %%al; \
addl %%eax, %0;"
:"=r"(k) /* output */
:"0"(k), "r"(a),"r"(b) /* input */
:"eax", "cc" /* clobbered register */
);
Note that setb expects a reg8 or mem8 and you should add eax to the clobbered list, because you change it, as well as cc just to be safe, as for the register constraints, I'm not sure why you used those, but =r and r work just fine.
And you need to add k to both the input and output lists. There's more in the GCC-Inline-Assembly-HOWTO
Summary:
Branchless might not even be the best choice.
Inline asm defeats some other optimizations, try other source changes first, e.g. ? : often compiles branchlessly, also use booleans as integer 0/1.
If you use inline-asm, make sure you optimize the constraints as well to make the compiler-generated code outside your asm block efficient.
The whole thing is doable with cmp %[b], %[a] / adc %[k],%[k]. Your hand-written code is worse than what compilers generate, but they are beatable in the small scale for cases where constant-propagation / CSE / inlining didn't make this code (partially) optimize away.
If your compiler generates branchy code, and profiling shows that was the wrong choice (high counts for branch misses at that instruction, e.g. on Linux perf record -ebranch-misses ./my_program && perf report), then yes you should do something to get branchless code.
(Branchy can be an advantage if it's predictable: branching means out-of-order execution of code that uses (k<<1) + 1 doesn't have to wait for a and b to be ready. LLVM recently merged a patch that makes x86 code-gen more branchy by default, because modern x86 CPUs have such powerful branch predictors. Clang/LLVM nightly build (with that patch) does still choose branchless for this C source, at least in a stand-alone function outside a loop).
If this is for a binary search, branchless probably is good strategy, unless you see the same search often. (Branching + speculative execution means you have a control dependency off the critical path,
Compile with profile-guided optimization so the compiler has run-time info on which branches almost always go one way. It still might not know the difference between a poorly-predictable branch and one that does overall take both paths but with a simple pattern. (Or that's predictable based on global history; many modern branch-predictor designs index based on branch history, so which way the last few branches went determine which table entry is used for the current branch.)
Related: gcc optimization flag -O3 makes code slower then -O2 shows a case where a sorted array makes for near-perfect branch prediction for a condition inside a loop, and gcc -O3's branchless code (without profile guided optimization) bottlenecks on a data dependency from using cmov. But -O3 -fprofile-use makes branchy code. (Also, a different way of writing it makes lower-latency branchless code that also auto-vectorizes better.)
Inline asm should be your last resort if you can't hand-hold the compiler into making the asm you want, e.g. by writing it as (k<<1) + (a<b) as others have suggested.
Inline asm defeats many optimizations, most obvious constant-propagation (as seen in some other answers, where gcc moves a constant into a register outside the block of inline-asm code). https://gcc.gnu.org/wiki/DontUseInlineAsm.
You could maybe use if(__builtin_constant_p(a)) and so on to use a pure C version when the compiler has constant values for some/all of the variables, but that's a lot more work. (And doesn't work well with Clang, where __builtin_constant_p() is evaluated before function inlining.)
Even then (once you've limited things to cases where the inputs aren't compile-time constants), it's not possible to give the compiler the full range of options, because you can't use different asm blocks depending on which constraints are matched (e.g. a in a register and b in memory, or vice versa.) In cases where you want to use a different instruction depending on the situation, you're screwed, but here we can use multi-alternative constraints to expose most of the flexibility of cmp.
It's still usually better to let the compiler make near-optimal code than to use inline asm. Inline-asm destroys the ability of the compiler to reuse use any temporary results, or spread out the instructions to mix with other compiler-generated code. (Instruction-scheduling isn't a big deal on x86 because of good out-of-order execution, but still.)
That asm is pretty crap. If you get a lot of branch misses, it's better than a branchy implementation, but a much better branchless implementation is possible.
Your a<b is an unsigned compare (you're using setb, the unsigned below condition). So your compare result is in the carry flag. x86 has an add-with-carry instruction. Furthermore, k<<1 is the same thing as k+k.
So the asm you want (compiler-generated or with inline asm) is:
# k in %rax, a in %rdi, b in %rsi for this example
cmp %rsi, %rdi # CF = (a < b) = the carry-out from edi - esi
adc %rax, %rax # eax = (k<<1) + CF = (k<<1) + (a < b)
Compilers are smart enough to use add or lea for a left-shift by 1, and some are smart enough to use adc instead of setb, but they don't manage to combine both.
Writing a function with register args and a return value is often a good way to see what compilers might do, although it does force them to produce the result in a different register. (See also this Q&A, and Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”).
// I also tried a version where k is a function return value,
// or where k is a global, so it's in the same register.
unsigned funcarg(unsigned a, unsigned b, unsigned k) {
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
On the Godbolt compiler explorer, along with a couple other versions. (I used unsigned in this version, because you had addl in your asm. Using unsigned long makes everything except the xor-zeroing into 64-bit registers. (xor %eax,%eax is still the best way to zero RAX.)
# gcc7.2 -O3 When it can keep the value in the same reg, uses add instead of lea
leal (%rdx,%rdx), %eax #, <retval>
cmpl %esi, %edi # b, a
adcl $0, %eax #, <retval>
ret
#clang 6.0 snapshot -O3
xorl %eax, %eax
cmpl %esi, %edi
setb %al
leal (%rax,%rdx,2), %eax
retq
# ICC18, same as gcc but fails to save a MOV
addl %edx, %edx #14.16
cmpl %esi, %edi #17.12
adcl $0, %edx #17.12
movl %edx, %eax #17.12
ret #17.12
MSVC is the only compiler that doesn't make branchless code without hand-holding. ((k<<1) + ( a < b ); gives us exactly the same xor/cmp/setb / lea sequence as clang above (but with the Windows x86-64 calling convention).
funcarg PROC ; x86-64 MSVC CL19 -Ox
lea eax, DWORD PTR [r8*2+1]
cmp ecx, edx
jb SHORT $LN3#funcarg
lea eax, DWORD PTR [r8+r8] ; conditionally jumped over
$LN3#funcarg:
ret 0
Inline asm
The other answers cover the problems with your implementation pretty well. To debug assembler errors in inline asm, use gcc -O3 -S -fverbose-asm to see what the compiler is feeding to the assembler, with the asm template filled in. You would have seen addl %rax, %ecx or something.
This optimized implementation uses multi-alternative constraints to let the compiler pick either the cmp $imm, r/m, cmp r/m, r, or cmp r, r/m forms of CMP. I used two alternates that split things up not by opcode but by which side included the possible memory operand. "rme" is like "g" (rmi) but limited to 32-bit sign-extended immediates).
unsigned long inlineasm(unsigned long a, unsigned long b, unsigned long k)
{
__asm__("cmpq %[b], %[a] \n\t"
"adc %[k],%[k]"
: /* outputs */ [k] "+r,r" (k)
: /* inputs */ [a] "r,rm" (a), [b] "rme,re" (b)
: /* clobbers */ "cc"); // "cc" clobber is implicit for x86, but it doesn't hurt
return k;
}
I put this on Godbolt with callers that inline it in different contexts. gcc7.2 -O3 does what we expect for the stand-alone version (with register args).
inlineasm:
movq %rdx, %rax # k, k
cmpq %rsi, %rdi # b, a
adc %rax,%rax # k
ret
We can look at how well our constraints work by inlining into other callers:
unsigned long call_with_mem(unsigned long *aptr) {
return inlineasm(*aptr, 5, 4);
}
# gcc
movl $4, %eax #, k
cmpq $55555, (%rdi) #, *aptr_3(D)
adc %rax,%rax # k
ret
With a larger immediate, we get movabs into a register. (But with an "i" or "g" constraint, gcc would emit code that doesn't assemble, or truncates the constant, trying to use a large immediate constant for cmpq.)
Compare what we get from pure C:
unsigned long call_with_mem_nonasm(unsigned long *aptr) {
return handhold(*aptr, 5, 4);
}
# gcc -O3
xorl %eax, %eax # tmp93
cmpq $4, (%rdi) #, *aptr_3(D)
setbe %al #, tmp93
addq $8, %rax #, k
ret
adc $8, %rax without setc would probably have been better, but we can't get that from inline asm without __builtin_constant_p() on k.
clang often picks the mem alternative if there is one, so it does this: /facepalm. Don't use inline asm.
inlineasm: # clang 5.0
movq %rsi, -8(%rsp)
cmpq -8(%rsp), %rdi
adcq %rdx, %rdx
movq %rdx, %rax
retq
BTW, unless you're going to optimize the shift into the compare-and-add, you can and should have asked the compiler for k<<1 as an input.
I have the following C code:
#include <stdio.h>
void function(int a, int b, int c) {
int buff_1[5];
int buff_2[10];
buff_1[0] = 6;
buff_2[0] = 'A';
buff_2[1] = 'B';
}
int main(void) {
int i = 1;
function(1,2,3);
return 0;
}
now I want to analyze the associated assembly code:
The assembly instructions before the function call, according to the book I'm reading are:
pushl $3
pushl $2
pushl $1
call function
The underlying object file was created using gcc-5.3 -O0 -c functions.c.
However, if I create the assembly code using objdump I get the following instructions:
movl $3, %edx
movl $2, %esi
movl $1, %edi
As far as I understand assembly (I'm pretty new to it) the first one makes more sense to me.
Is the book simply wrong? Or is the books output just outdated because of using gcc 2.9
The book is out of date with respect to 64-bit x86. The x86-64 calling conventions per Wikipedia are:
System V AMD64 ABI
The calling convention of the System V AMD64 ABI is followed on Solaris, Linux, FreeBSD, OS X, and other UNIX-like or POSIX-compliant operating systems. The first six integer or pointer arguments are passed in registers RDI, RSI, RDX, RCX (R10 in the Linux kernel interface), R8, and R9, while XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6 and XMM7 are used for certain floating point arguments. As in the Microsoft x64 calling convention, additional arguments are passed on the stack and the return value is stored in RAX.
Since you're passing 32-bit values, gcc is using the lower half of each register, hence %edi, %esi, and %edx.