inline asm code organization

inline asm code organization - c

I have just written a few small inline asm routines to query the timestamp counter in x86 so that I can profile small portions of code. I would really like to put those routines in a header so that I can reuse them in many different source files so basically my question is whether I should just organize those in macros or make them inline functions, my doubt with inline is that it is not necessarily the case that the compiler will actually inline it and since it is a performance sensitive call I would rather skip the function call overhead, on the other hand with macros the whole type safety goes away and I would strictly need a 32 bit int for this, I assume I could just add the specification in comments but still I try to avoid macros because of the many caveats. Here is the code:
inline void rdtsc(uint64_t* cycles)
{
uint32_t cycles_high, cycles_low;
asm volatile (
".att_syntax\n"
"CPUID\n\t" //Serialize
"RDTSC\n\t" //Read clock and cpuid
"mov %%edx, %0 \n\t"
"mov %%eax, %1 \n\t"
: "=r" (cycles_high), "=r" (cycles_low)
:: "%edx", "%eax");
*cycles = ((uint64_t) cycles_high << 32) | cycles_low;
}
Any suggestions on this are welcome. I am just trying to figure out what the preferred style would be for this kind of situation.

Since you will be measuring performance of portions of code, not necessarily always entire functions, you should not try to inline your performance counter.
It doesn't matter if there's a call overhead or not. What matter is that the mesurement is consistent, which means you either want ALWAYS the call overhead to be present, or NEVER.
The first is much easier to achieve than the former.
Let every portion of your code have the same call overhead.

If you really need to serialize before reading the TSC, you could use the LFENCE instruction instead which doesn't alter registers.
If you decide to continue to use CPUID for serialization, you ought to set EAX first (probably to 0, since you're not really concerned about the output) and note that this instruction trashes the EAX, EBX, ECX and EDX registers, so your routine MUST account for this fact.
In all, I'd be inclined to write it like this:
#include <stdint.h>
#include <stdio.h>
inline uint64_t rdtsc() {
uint32_t high, low;
asm volatile (
".att_syntax\n\t"
"LFENCE\n\t"
"RDTSC\n\t"
"movl %%eax, %0\n\t"
"movl %%edx, %1\n\t"
: "=rm" (low), "=rm" (high)
:: "%edx", "%eax");
return ((uint64_t) high << 32) | low;
}
int main() {
uint64_t x, y;
x = rdtsc();
printf("%lu\n", x);
y = rdtsc();
printf("%lu\n", y);
printf("%lu\n", y-x);
}
update:
It's been proposed by #Jester, and by #DavidWohlferd that one can eliminate the register allocations by assigning high and low directly to the edx and eax registers.
That version would look like this:
inline uint64_t rdtsc() {
uint32_t high, low;
asm volatile (
".att_syntax\n\t"
"LFENCE\n\t"
"RDTSC\n\t"
: "=a" (low), "=d" (high)
:: );
return ((uint64_t) high << 32) | low;
}
The resulting code (using gcc 4.8.3 on a 64-bit machine running Linux) using optimization -O2 and including up to the call to printf, is this:
#APP
# 20 "rdtsc.c" 1
.att_syntax
LFENCE
RDTSC
# 0 "" 2
#NO_APP
movq %rdx, %rbx
movl %eax, %eax
movl $.LC0, %edi
salq $32, %rbx
orq %rax, %rbx
xorl %eax, %eax
movq %rbx, %rsi
call printf
The version I originally posted results in this:
#APP
# 7 "rdtsc.c" 1
.att_syntax
LFENCE
RDTSC
movl %eax, %ecx
movl %edx, %ebx
# 0 "" 2
#NO_APP
movl %ecx, %ecx
salq $32, %rbx
movl $.LC0, %edi
orq %rcx, %rbx
xorl %eax, %eax
movq %rbx, %rsi
call printf
That version of the code is one instruction longer.

Related

inline assembly - useless intermediate copy instructions

I'm trying to write a scheduler to run what we call "fibers".
Unfortunately, I'm not really used to writing inline assembly.
typedef struct {
//fiber's stack
long rsp;
long rbp;
//next fiber in ready list
struct fiber *next;
} fiber;
//currently executing fiber
fiber *fib;
So the very first task is - obviously - creating a fiber for the main function so it can be suspended.
int main(int argc, char* argv[]){
//create fiber for main function
fib = malloc(sizeof(*fib));
__asm__(
"movq %%rsp, %0;"
"movq %%rbp, %1;"
: "=r"(fib->rsp),"=r"(fib->rbp)
);
//jump to actual main and execute
__asm__(...);
}
This gets compiled to
movl $24, %edi #,
call malloc #
#APP
# 27 "scheduler.c" 1
movq %rsp, %rcx;movq %rbp, %rdx; # tmp92, tmp93
# 0 "" 2
#NO_APP
movq %rax, fib(%rip) # tmp91, fib
movq %rcx, (%rax) # tmp92, MEM[(struct fiber *)_3].rsp
movq %rdx, 8(%rax) # tmp93, MEM[(struct fiber *)_3].rbp
Why does this compile movs into temporary registers? Can I somehow get rid of them?
The first version of this question had asm output from gcc -O0, with even more instructions and temporaries.
Turning on optimisations does not get rid of them.

turning them on does not get rid of the temporaries
It did get rid of some extra loads and stores. The fib is of course still there in memory since you declared that as a global variable. The rax is the return value from the malloc that must be assigned to the fib in memory. The other two lines write into your fib members which are also required.
Since you specified register outputs the asm block can't write directly into memory. That's easy to fix with a memory constraint though:
__asm__(
"movq %%rsp, %0;"
"movq %%rbp, %1;"
: "=m"(fib->rsp),"=m"(fib->rbp)
);
This will generate:
call malloc
movq %rax, fib(%rip)
movq %rsp, (%rax)
movq %rbp, 8(%rax)

OS X asm C call with return value

I've been playing around with the asm macro in C to directly call some assembly instructions on OS X Mavericks to get a stack pointer address (from %rsp) and I've found really strange behaviour (at least to me) while trying to assign a return value from the assembler code into the %rax register (the one that should by convention hold the function return value). The C code is very simple:
#include <stdio.h>
unsigned long long get_sp(void) {
asm ("mov %rsp, %rax");
return 0;
}
int main(void) {
printf("0x%llx\n", get_sp());
}
If I compile and run the code, the value from %rax register gets printed(the actual stack pointer), which is strange as I would expect the %rax register to be overwritten by "return 0;"
However if I remove the return 0; a string "0x0" gets printed which is also strange as I would expect the return value from %rax register to be read and printed.
I've tried to run this code(with the only difference using %esp and %eax registers) also on the Ubuntu Linux also and it actually works as I would expect(using the gcc compiler).
Could this be a bug in the llvm-gcc compiler(Apple LLVM version 5.1)?
//EDIT
This is the version without the "return 0;"
otool -tV sp.out
sp.out:
(__TEXT,__text) section
_get_sp:
0000000100000f30 pushq %rbp
0000000100000f31 movq %rsp, %rbp
0000000100000f34 movq %rsp, %rax
0000000100000f37 movq -0x8(%rbp), %rax
0000000100000f3b popq %rbp
0000000100000f3c ret
0000000100000f3d nopl (%rax)
_main:
0000000100000f40 pushq %rbp
0000000100000f41 movq %rsp, %rbp
0000000100000f44 subq $0x10, %rsp
0000000100000f48 callq _get_sp
0000000100000f4d leaq 0x3a(%rip), %rdi ## literal pool for: "0x%llx
"
0000000100000f54 movq %rax, %rsi
0000000100000f57 movb $0x0, %al
0000000100000f59 callq 0x100000f6e ## symbol stub for: _printf
0000000100000f5e movl $0x0, %ecx
0000000100000f63 movl %eax, -0x4(%rbp)
0000000100000f66 movl %ecx, %eax
0000000100000f68 addq $0x10, %rsp
0000000100000f6c popq %rbp
0000000100000f6d ret

This is not a bug. It's the result of incorrect use of inline assembly. In the case where the return statement is included, the compiler does not inspect the asm statement. If %rax has already been set to zero before the asm block, the instruction overwrites that value. The compiler is free to do this before the asm block, since you haven't informed it of any register outputs, clobbers, etc.
In the case where no return statement is included, you can't rely on the return value. Which is why clang (that's what llvm-gcc is with Xcode 5.1 - it's not the gcc front end) issues a warning. gcc-4.8.2 appears to work on OS X - but because the code is incorrect in both cases, it's just 'luck'. With optimization: -O2, it no longer works. gcc doesn't issue a warning by default, which is a good reason to at least use -Wall.
{
unsigned long long ret;
__asm__ ("movq %rsp, %0" : "=r" (ret));
return ret;
}
always works. volatile is not necessary, as the compiler is using an output, so it cannot discard the asm statement. Even changing the first line to unsigned long long ret = 0; - the compiler is obviously not free to reorder.

this works for me on Mavericks [edit: and without a single change on Ubuntu Saucy x86_64]:
#include <stdio.h>
unsigned long long get_sp(void) {
long _sp = 0x0L;
__asm__ __volatile__(
"mov %%rsp, %[value] \n\t"
: [value] "=r" (_sp)
:
:);
return _sp;
}
int main(void) {
printf("0x%llx\n", get_sp());
}

gcc inline assembly - operand type mismatch for `add', trying to create branchless code

I'm trying to do some Code Optimization to Eliminate Branches, the original c code is
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1)
I intend to replace it with assembly code like below
mov a, %rax
mov b, %rbx
mov k, %rcx
xor %rdx %rdx
shl 1, %rcx
cmp %rax, %rax
setb %rdx
add %rdx,%rcx
mov %rcx, k
so I write c inline assembly code like blow,
#define next(a, b, k)\
__asm__("shl $0x1, %0; \
xor %%rbx, %%rbx; \
cmp %1, %2; \
setb %%rbx; \
addl %%rbx,%0;":"+c"(k) :"g"(a),"g"(b))
when I compile the code below i got error:
operand type mismatch for `add'
operand type mismatch for `setb'
How can I fix it?

Here are the mistakes in your code:
Error: operand type mismatch for 'cmp' -- One of CMP's operands must be a register. You're probably generating code that's trying to compare two immediates. Change the second operand's constraint from "g" to "r". (See GCC Manual - Extended Asm - Simple Constraints)
Error: operand type mismatch for 'setb' -- SETB only takes 8 bit operands, i.e. setb %bl works while setb %rbx doesn't.
The C expression T = (A < B) should translate to cmp B,A; setb T in AT&T x86 assembler syntax. You had the two operands to CMP in the wrong order. Remember that CMP works like SUB.
Once you realize the first two error messages are produced by the assembler, it follows that the trick to debugging them is to look at the assembler code generated by gcc. Try gcc $CFLAGS -S t.c and compare the problematic lines in t.s with an x86 opcode reference. Focus on the allowed operand codes for each instruction and you'll quickly see the problems.
In the fixed source code posted below, I assume your operands are unsigned since you're using SETB instead of SETL. I switched from using RBX to RCX to hold the temporary value because RCX is a call clobbered register in the ABI and used the "=&c" constraint to mark it as an earlyclobber operand since RCX is cleared before the inputs a and b are read:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
static uint64_t next(uint64_t a, uint64_t b, uint64_t k)
{
uint64_t tmp;
__asm__("shl $0x1, %[k];"
"xor %%rcx, %%rcx;"
"cmp %[b], %[a];"
"setb %%cl;"
"addq %%rcx, %[k];"
: /* outputs */ [k] "+g" (k), [tmp] "=&c" (tmp)
: /* inputs */ [a] "r" (a), [b] "g" (b)
: /* clobbers */ "cc");
return k;
}
int main()
{
uint64_t t, t0, k;
k = next(1, 2, 0);
printf("%" PRId64 "\n", k);
scanf("%" SCNd64 "%" SCNd64, &t, &t0);
k = next(t, t0, k);
printf("%" PRId64 "\n", k);
return 0;
}
main() translates to:
<+0>: push %rbx
<+1>: xor %ebx,%ebx
<+3>: mov $0x4006c0,%edi
<+8>: mov $0x1,%bl
<+10>: xor %eax,%eax
<+12>: sub $0x10,%rsp
<+16>: shl %rax
<+19>: xor %rcx,%rcx
<+22>: cmp $0x2,%rbx
<+26>: setb %cl
<+29>: add %rcx,%rax
<+32>: mov %rax,%rbx
<+35>: mov %rax,%rsi
<+38>: xor %eax,%eax
<+40>: callq 0x400470 <printf#plt>
<+45>: lea 0x8(%rsp),%rdx
<+50>: mov %rsp,%rsi
<+53>: mov $0x4006c5,%edi
<+58>: xor %eax,%eax
<+60>: callq 0x4004a0 <__isoc99_scanf#plt>
<+65>: mov (%rsp),%rax
<+69>: mov %rbx,%rsi
<+72>: mov $0x4006c0,%edi
<+77>: shl %rsi
<+80>: xor %rcx,%rcx
<+83>: cmp 0x8(%rsp),%rax
<+88>: setb %cl
<+91>: add %rcx,%rsi
<+94>: xor %eax,%eax
<+96>: callq 0x400470 <printf#plt>
<+101>: add $0x10,%rsp
<+105>: xor %eax,%eax
<+107>: pop %rbx
<+108>: retq
You can see the result of next() being moved into RSI before each printf() call.

Given that gcc (and it looks like gcc inline assembler) produces:
leal (%rdx,%rdx), %eax
xorl %edx, %edx
cmpl %esi, %edi
setl %dl
addl %edx, %eax
ret
from
int f(int a, int b, int k)
{
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
It would think that writing your own inline assembler is a complete waste of time and effort.
As always, BEFORE you start writing inline assembler, check what the compiler actually does. If your compiler doesn't produce this code, then you may need to upgrade the version of compiler to something a bit newer (I reported this sort of thing to Jan Hubicka [gcc maintainer for x86-64 at the time] ca 2001, and I'm sure it's been in gcc for quite some time).

You could just do this and the compiler will not generate a branch:
k = (k<<1) + (a < b) ;
But if you must, I fixed some stuff in your code now it should work as expected:
__asm__(
"shl $0x1, %0; \
xor %%eax, %%eax; \
cmpl %3, %2; \
setb %%al; \
addl %%eax, %0;"
:"=r"(k) /* output */
:"0"(k), "r"(a),"r"(b) /* input */
:"eax", "cc" /* clobbered register */
);
Note that setb expects a reg8 or mem8 and you should add eax to the clobbered list, because you change it, as well as cc just to be safe, as for the register constraints, I'm not sure why you used those, but =r and r work just fine.
And you need to add k to both the input and output lists. There's more in the GCC-Inline-Assembly-HOWTO

Summary:
Branchless might not even be the best choice.
Inline asm defeats some other optimizations, try other source changes first, e.g. ? : often compiles branchlessly, also use booleans as integer 0/1.
If you use inline-asm, make sure you optimize the constraints as well to make the compiler-generated code outside your asm block efficient.
The whole thing is doable with cmp %[b], %[a] / adc %[k],%[k]. Your hand-written code is worse than what compilers generate, but they are beatable in the small scale for cases where constant-propagation / CSE / inlining didn't make this code (partially) optimize away.
If your compiler generates branchy code, and profiling shows that was the wrong choice (high counts for branch misses at that instruction, e.g. on Linux perf record -ebranch-misses ./my_program && perf report), then yes you should do something to get branchless code.
(Branchy can be an advantage if it's predictable: branching means out-of-order execution of code that uses (k<<1) + 1 doesn't have to wait for a and b to be ready. LLVM recently merged a patch that makes x86 code-gen more branchy by default, because modern x86 CPUs have such powerful branch predictors. Clang/LLVM nightly build (with that patch) does still choose branchless for this C source, at least in a stand-alone function outside a loop).
If this is for a binary search, branchless probably is good strategy, unless you see the same search often. (Branching + speculative execution means you have a control dependency off the critical path,
Compile with profile-guided optimization so the compiler has run-time info on which branches almost always go one way. It still might not know the difference between a poorly-predictable branch and one that does overall take both paths but with a simple pattern. (Or that's predictable based on global history; many modern branch-predictor designs index based on branch history, so which way the last few branches went determine which table entry is used for the current branch.)
Related: gcc optimization flag -O3 makes code slower then -O2 shows a case where a sorted array makes for near-perfect branch prediction for a condition inside a loop, and gcc -O3's branchless code (without profile guided optimization) bottlenecks on a data dependency from using cmov. But -O3 -fprofile-use makes branchy code. (Also, a different way of writing it makes lower-latency branchless code that also auto-vectorizes better.)
Inline asm should be your last resort if you can't hand-hold the compiler into making the asm you want, e.g. by writing it as (k<<1) + (a<b) as others have suggested.
Inline asm defeats many optimizations, most obvious constant-propagation (as seen in some other answers, where gcc moves a constant into a register outside the block of inline-asm code). https://gcc.gnu.org/wiki/DontUseInlineAsm.
You could maybe use if(__builtin_constant_p(a)) and so on to use a pure C version when the compiler has constant values for some/all of the variables, but that's a lot more work. (And doesn't work well with Clang, where __builtin_constant_p() is evaluated before function inlining.)
Even then (once you've limited things to cases where the inputs aren't compile-time constants), it's not possible to give the compiler the full range of options, because you can't use different asm blocks depending on which constraints are matched (e.g. a in a register and b in memory, or vice versa.) In cases where you want to use a different instruction depending on the situation, you're screwed, but here we can use multi-alternative constraints to expose most of the flexibility of cmp.
It's still usually better to let the compiler make near-optimal code than to use inline asm. Inline-asm destroys the ability of the compiler to reuse use any temporary results, or spread out the instructions to mix with other compiler-generated code. (Instruction-scheduling isn't a big deal on x86 because of good out-of-order execution, but still.)
That asm is pretty crap. If you get a lot of branch misses, it's better than a branchy implementation, but a much better branchless implementation is possible.
Your a<b is an unsigned compare (you're using setb, the unsigned below condition). So your compare result is in the carry flag. x86 has an add-with-carry instruction. Furthermore, k<<1 is the same thing as k+k.
So the asm you want (compiler-generated or with inline asm) is:
# k in %rax, a in %rdi, b in %rsi for this example
cmp %rsi, %rdi # CF = (a < b) = the carry-out from edi - esi
adc %rax, %rax # eax = (k<<1) + CF = (k<<1) + (a < b)
Compilers are smart enough to use add or lea for a left-shift by 1, and some are smart enough to use adc instead of setb, but they don't manage to combine both.
Writing a function with register args and a return value is often a good way to see what compilers might do, although it does force them to produce the result in a different register. (See also this Q&A, and Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”).
// I also tried a version where k is a function return value,
// or where k is a global, so it's in the same register.
unsigned funcarg(unsigned a, unsigned b, unsigned k) {
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
On the Godbolt compiler explorer, along with a couple other versions. (I used unsigned in this version, because you had addl in your asm. Using unsigned long makes everything except the xor-zeroing into 64-bit registers. (xor %eax,%eax is still the best way to zero RAX.)
# gcc7.2 -O3 When it can keep the value in the same reg, uses add instead of lea
leal (%rdx,%rdx), %eax #, <retval>
cmpl %esi, %edi # b, a
adcl $0, %eax #, <retval>
ret
#clang 6.0 snapshot -O3
xorl %eax, %eax
cmpl %esi, %edi
setb %al
leal (%rax,%rdx,2), %eax
retq
# ICC18, same as gcc but fails to save a MOV
addl %edx, %edx #14.16
cmpl %esi, %edi #17.12
adcl $0, %edx #17.12
movl %edx, %eax #17.12
ret #17.12
MSVC is the only compiler that doesn't make branchless code without hand-holding. ((k<<1) + ( a < b ); gives us exactly the same xor/cmp/setb / lea sequence as clang above (but with the Windows x86-64 calling convention).
funcarg PROC ; x86-64 MSVC CL19 -Ox
lea eax, DWORD PTR [r8*2+1]
cmp ecx, edx
jb SHORT $LN3#funcarg
lea eax, DWORD PTR [r8+r8] ; conditionally jumped over
$LN3#funcarg:
ret 0
Inline asm
The other answers cover the problems with your implementation pretty well. To debug assembler errors in inline asm, use gcc -O3 -S -fverbose-asm to see what the compiler is feeding to the assembler, with the asm template filled in. You would have seen addl %rax, %ecx or something.
This optimized implementation uses multi-alternative constraints to let the compiler pick either the cmp $imm, r/m, cmp r/m, r, or cmp r, r/m forms of CMP. I used two alternates that split things up not by opcode but by which side included the possible memory operand. "rme" is like "g" (rmi) but limited to 32-bit sign-extended immediates).
unsigned long inlineasm(unsigned long a, unsigned long b, unsigned long k)
{
__asm__("cmpq %[b], %[a] \n\t"
"adc %[k],%[k]"
: /* outputs */ [k] "+r,r" (k)
: /* inputs */ [a] "r,rm" (a), [b] "rme,re" (b)
: /* clobbers */ "cc"); // "cc" clobber is implicit for x86, but it doesn't hurt
return k;
}
I put this on Godbolt with callers that inline it in different contexts. gcc7.2 -O3 does what we expect for the stand-alone version (with register args).
inlineasm:
movq %rdx, %rax # k, k
cmpq %rsi, %rdi # b, a
adc %rax,%rax # k
ret
We can look at how well our constraints work by inlining into other callers:
unsigned long call_with_mem(unsigned long *aptr) {
return inlineasm(*aptr, 5, 4);
}
# gcc
movl $4, %eax #, k
cmpq $55555, (%rdi) #, *aptr_3(D)
adc %rax,%rax # k
ret
With a larger immediate, we get movabs into a register. (But with an "i" or "g" constraint, gcc would emit code that doesn't assemble, or truncates the constant, trying to use a large immediate constant for cmpq.)
Compare what we get from pure C:
unsigned long call_with_mem_nonasm(unsigned long *aptr) {
return handhold(*aptr, 5, 4);
}
# gcc -O3
xorl %eax, %eax # tmp93
cmpq $4, (%rdi) #, *aptr_3(D)
setbe %al #, tmp93
addq $8, %rax #, k
ret
adc $8, %rax without setc would probably have been better, but we can't get that from inline asm without __builtin_constant_p() on k.
clang often picks the mem alternative if there is one, so it does this: /facepalm. Don't use inline asm.
inlineasm: # clang 5.0
movq %rsi, -8(%rsp)
cmpq -8(%rsp), %rdi
adcq %rdx, %rdx
movq %rdx, %rax
retq
BTW, unless you're going to optimize the shift into the compare-and-add, you can and should have asked the compiler for k<<1 as an input.

gcc inline asm statement gets optimized away - wrong constraints?

I'm having trouble with a gcc inline asm statement; gcc seems to think the result is a constant (which it isn't) and optimizes the statement away. I think I am using the operand constraints correctly, but would like a second opinion on the matter. If the problem is not in my use of constraints, I'll try to isolate a test case for a gcc bug report, but that may be difficult as even subtle changes in the surrounding code cause the problem to disappear.
The inline asm in question is
static inline void
ularith_div_2ul_ul_ul_r (unsigned long *r, unsigned long a1,
const unsigned long a2, const unsigned long b)
{
ASSERT(a2 < b); /* Or there will be quotient overflow */
__asm__(
"# ularith_div_2ul_ul_ul_r: divq %0 %1 %2 %3\n\t"
"divq %3"
: "+a" (a1), "=d" (*r)
: "1" (a2), "rm" (b)
: "cc");
}
which is a pretty run-of-the-mill remainder of a two-word dividend by a one-word divisor. Note that the high word of the input, a2, and the remainder output, *r, are tied to the same register %rdx by the "1" constraint.
From the surrounding code, ularith_div_2ul_ul_ul_r() gets effectively called as if by
if (s == 1)
modpp[0].one = 0;
else
ularith_div_2ul_ul_ul_r(&modpp[0].one, 0UL, 1UL, s);
so the high word of the input, a2, is the constant 1UL.
The resulting asm output of gcc -S -fverbose_asm looks like:
(earlier:)
xorl %r8d, %r8d # cstore.863
(then:)
cmpq $1, -208(%rbp) #, %sfp
movl $1, %eax #, tmp841
movq %rsi, -184(%rbp) # prephitmp.966, MEM[(struct __modulusredcul_t *)&modpp][0].invm
cmovne -208(%rbp), %rcx # prephitmp.966,, %sfp, prephitmp.966
cmovne %rax, %r8 # cstore.863,, tmp841, cstore.863
movq %r8, -176(%rbp) # cstore.863, MEM[(struct __modulusredcul_t *)&modpp][0].one
The effect is that the result of the ularith_div_2ul_ul_ul_r() call is assumed to be the constant 1; the divq never appears in the output.
Various changes make the problem disappear; different compiler flags, different code context or marking the asm block __asm__ __volatile__ (...). The output then correctly contains the divq instruction:
#APP
# ularith_div_2ul_ul_ul_r: divq %rax %rdx %rdx -208(%rbp) # a1, tmp590, tmp590, %sfp
divq -208(%rbp) # %sfp
#NO_APP
So, my question to the inline assembly guys here: did I do something wrong with the contraints?

The bug affects only Ubuntu versions of gcc; the stock GNU gcc is unaffected as far as we can tell. The bug was reported to Ubuntu launchpad and confirmed: Bug #1029454

syscall from within GCC inline assembly [duplicate]

This question already has answers here:
How to invoke a system call via syscall or sysenter in inline assembly?
(2 answers)
Closed 3 years ago.
is it possible to write a single character using a syscall from within an inline assembly block? if so, how? it should look "something" like this:
__asm__ __volatile__
(
" movl $1, %%edx \n\t"
" movl $80, %%ecx \n\t"
" movl $0, %%ebx \n\t"
" movl $4, %%eax \n\t"
" int $0x80 \n\t"
::: "%eax", "%ebx", "%ecx", "%edx"
);
$80 is 'P' in ascii, but that returns nothing.
any suggestions much appreciated!

You can use architecture-specific constraints to directly place the arguments in specific registers, without needing the movl instructions in your inline assembly. Furthermore, then you can then use the & operator to get the address of the character:
#include <sys/syscall.h>
void sys_putc(char c) {
// write(int fd, const void *buf, size_t count);
int ret;
asm volatile("int $0x80"
: "=a"(ret) // outputs
: "a"(SYS_write), "b"(1), "c"(&c), "d"(1) // inputs
: "memory"); // clobbers
}
int main(void) {
sys_putc('P');
sys_putc('\n');
}
(Editor's note: the "memory" clobber is needed, or some other way of telling the compiler that the memory pointed-to by &c is read. How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)
(In this case, =a(ret) is needed to indicate that the syscall clobbers EAX. We can't list EAX as a clobber because we need an input operand to use that register. The "a" constraint is like "r" but can only pick AL/AX/EAX/RAX. )
$ cc -m32 sys_putc.c && ./a.out
P
You could also return the number of bytes written that the syscall returns, and use "0" as a constraint to indicate EAX again:
int sys_putc(char c) {
int ret;
asm volatile("int $0x80" : "=a"(ret) : "0"(SYS_write), "b"(1), "c"(&c), "d"(1) : "memory");
return ret;
}
Note that on error, the system call return value will be a -errno code like -EBADF (bad file descriptor) or -EFAULT (bad pointer).
The normal libc system call wrapper functions check for a return value of unsigned eax > -4096UL and set errno + return -1.
Also note that compiling with -m32 is required: the 64-bit syscall ABI uses different call numbers (and registers), but this asm is hard-coding the slow way of invoking the 32-bit ABI, int $0x80.
Compiling in 64-bit mode will get sys/syscall.h to define SYS_write with 64-bit call numbers, which would break this code. So would 64-bit stack addresses even if you used the right numbers. What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? - don't do that.

IIRC, two things are wrong in your example.
Firstly, you're writing to stdin with mov $0, %ebx
Second, write takes a pointer as it's second argument, so to write a single character you need that character stored somewhere in memory, you can't write the value directly to %ecx
ex:
.data
char: .byte 80
.text
mov $char, %ecx
I've only done pure asm in Linux, never inline using gcc, you can't drop data into the middle of the assembly, so I'm not sure how you'd get the pointer using inline assembly.
EDIT: I think I just remembered how to do it. you could push 'p' onto the stack and use %esp
pushw $80
movl %%esp, %%ecx
... int $0x80 ...
addl $2, %%esp

Something like
char p = 'P';
int main()
{
__asm__ __volatile__
(
" movl $1, %%edx \n\t"
" leal p , %%ecx \n\t"
" movl $0, %%ebx \n\t"
" movl $4, %%eax \n\t"
" int $0x80 \n\t"
::: "%eax", "%ebx", "%ecx", "%edx"
);
}
Add: note that I've used lea to Load the Effective Address of the char into ecx register; for the value of ebx I tried $0 and $1 and it seems to work anyway ...
Avoid the use of external char
int main()
{
__asm__ __volatile__
(
" movl $1, %%edx \n\t"
" subl $4, %%esp \n\t"
" movl $80, (%%esp)\n\t"
" movl %%esp, %%ecx \n\t"
" movl $1, %%ebx \n\t"
" movl $4, %%eax \n\t"
" int $0x80 \n\t"
" addl $4, %%esp\n\t"
::: "%eax", "%ebx", "%ecx", "%edx"
);
}
N.B.: it works because of the endianness of intel processors! :D

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

inline asm code organization - c

Related

inline assembly - useless intermediate copy instructions

OS X asm C call with return value

gcc inline assembly - operand type mismatch for `add', trying to create branchless code

gcc inline asm statement gets optimized away - wrong constraints?

syscall from within GCC inline assembly [duplicate]

Categories

Resources