Confused with this implementation of strcpy, why three temp variables is needed? - c

Update: See Eric Postpischil's answer, I think he is right.
I found it on this page when I study inline assembly code.
static inline char * strcpy(char * dest,const char *src)
{
int d0, d1, d2;
__asm__ __volatile__( "1:\n\t"
"lodsb\n\t"
"stosb\n\t"
"testb %%al,%%al\n\t"
"jne 1b"
: "=&S" (d0), "=&D" (d1), "=&a" (d2)
: "0" (src),"1" (dest)
: "memory");
return dest;
}
I am confused that why three temp variables is needed? I try to implement without using them.
static inline char * strcpy(char * dest,const char *src)
{
__asm__ __volatile__( "1:\n\t"
"lodsb\n\t"
"stosb\n\t"
"testb %%al,%%al\n\t"
"jne 1b"
/* : "=&S" (d0), "=&D" (d1), "=&a" (d2) */
:
: "S" (src),"D" (dest)
: "memory", "esi", "edi", "al");
return dest;
}
But I got errors when compile it with gcc.
inline.c: In function ‘strcpy’:
inline.c:6:9: error: can't find a register in class ‘SIREG’ while reloading ‘asm’
inline.c:6:9: error: ‘asm’ operand has impossible constraints

The original code is trying to deal with the fact that the lodsb and stosb instructions modify registers %ESI, %EDI, and %AL. To do this, it has defined d0, d1, and d2 so that it can declare them as outputs from the assembly code. It does not actually want these values as outputs, so it does not use them after the assembly. However, because they are outputs, the compiler knows that they are modified by the assembly code, so the compiler knows not to keep any values in those registers that it wants to remain unchanged through the assembly code.
It should be possible to do this in another way, and your code better expresses the semantics: It says inputs are provided in %ESI and %EDI and that memory, %ESI, %EDI, and %AL are altered. The original code asserts that %ESI, %EDI, and %AL are outputs, which is not the true intent; they are unwanted byproducts, not desired outputs.
However, your version of GCC is different from mine (Apple/clang-418.0.60); mine accepts the code you wrote without error. I suspect your GCC is confused by the fact that %ESI is specified both as an input register (by "S" after the second colon) and as something that is clobbered (by "esi" after the third colon). Perhaps this was a shortcoming in GCC that was later fixed, and the original code was working around that shortcoming.

Things are really simple. Lets recall, that S constraint stands for esi, a for al and D for edi register. When you inform compiler, that those guys will be clobbered, you must explicitly point where to spill them (d0 is temporary storage for esi, d1 for edi, d2 for al).
Next in the process of register allocation, compiler decides if he actually need to spill and looks for available register. For example my compiler (gcc 4.7.2) does it as follows:
movq %rdi, %rdx
1:
lodsb
stosb
testb %al,%al
jne 1b
movq %rdx, %rax
ret
You may see, that d1 was allocated to rdx, d0 and d2 was eliminated as excessive.
If you don't supply compiler with that information, it can not guess, so outputs error.

Related

error: unsupported size for integer register

I'm using i686 gcc on windows. When I built the code with separate asm statements, it worked. However, when I try to combine it into one statement, it doesn't build and gives me a error: unsupported size for integer register.
Here's my code
u8 lstatus;
u8 lsectors_read;
u8 data_buffer;
void operate(u8 opcode, u8 sector_size, u8 track, u8 sector, u8 head, u8 drive, u8* buffer, u8* status, u8* sectors_read)
{
asm volatile("mov %3, %%ah;\n"
"mov %4, %%al;\n"
"mov %5, %%ch;\n"
"mov %6, %%cl;\n"
"mov %7, %%dh;\n"
"mov %8, %%dl;\n"
"int $0x13;\n"
"mov %%ah, %0;\n"
"mov %%al, %1;\n"
"mov %%es:(%%bx), %2;\n"
: "=r"(lstatus), "=r"(lsectors_read), "=r"(buffer)
: "r"(opcode), "r"(sector_size), "r"(track), "r"(sector), "r"(head), "r"(drive)
:);
status = &lstatus;
sectors_read = &lsectors_read;
buffer = &data_buffer;
}
The error message is a little misleading. It seems to be happening because GCC ran out of 8-bit registers.
Interestingly, it compiles without error messages if you just edit the template to remove references to the last 2 operands (https://godbolt.org/z/oujNP7), even without dropping them from the list of input constraints! (Trimming down your asm statement is a useful debugging technique to figure out which part of it GCC doesn't like, without caring for now if the asm will do anything useful.)
Removing 2 earlier operands and changing numbers shows that "r"(head), "r"(drive) weren't specifically a problem, just the combination of everything.
It looks like GCC is avoiding high-8 registers like AH as inputs, and x86-16 only has 4 low-8 registers but you have 6 u8 inputs. So I think GCC means it ran out of byte registers that it was willing to use.
(The 3 outputs aren't declared early-clobber so they're allowed to overlap the inputs.)
You could maybe work around this by using "rm" to give GCC the option of picking a memory input. (The x86-specific constraints like "Q" that are allowed to pick a high-8 register wouldn't help unless you require it to pick the correct one to get the compiler to emit a mov for you.) That would probably let your code compile, but the result would be totally broken.
You re-introduced basically the same bugs as before: not telling the compiler which registers you write, so for example your mov %4, %%al will overwrite one of the registers GCC picked as an input, before you actually read that operand.
Declaring clobbers on all the registers you use would leave not enough registers to hold all the input variables. (Unless you allow memory source operands.) That could work but is very inefficient: if your asm template string starts or ends with mov, you're almost always doing it wrong.
Also, there are other serious bugs, apart from how you're using inline asm. You don't supply an input pointer to your buffer. int $0x13 doesn't allocate a new buffer for you, it needs a pointer in ES:BX (which it dereferences but leaves unmodified). GCC requires that ES=DS=SS so you already have to have properly set up segmentation before calling into your C code, and isn't something you have to do every call.
Plus even in C terms outside the inline asm, your function doesn't make sense. status = &lstatus; modifies the value of a function arg, not dereferencing it to modify a pointed-to output variable. The variable written by those assignments die at the end of the function. But the global temporaries do have to be updated because they're global and some other function could see their value. Perhaps you meant something like *status = lstatus; with different types for your vars?
If that C problem isn't obvious (at least once it's pointed out), you need some more practice with C before you're ready to try mixing C and asm which require you to understand both very well, in order to correctly describe your asm to the compiler with accurate constraints.
A good and correct way to implement this is shown in #fuz's answer to your previous question. If you want to understand how the constraints can replace your mov instructions, compile it and look at the compiler-generated instructions. See https://stackoverflow.com/tags/inline-assembly/info for links to guides and docs. e.g. #fuz's version without the ES setup (because GCC needs you to have done that already before calling any C):
typedef unsigned char u8;
typedef unsigned short u16;
// Note the different signature, and using the output args correctly.
void read(u8 sector_size, u8 track, u8 sector, u8 head, u8 drive,
u8 *buffer, u8 *status, u8 *sectors_read)
{
u16 result;
asm volatile("int $0x13"
: "=a"(result)
: "a"(0x200|sector_size), "b"(buffer),
"c"(track<<8|sector), "d"(head<<8|drive)
: "memory" ); // memory clobber was missing from #fuz's version
*status = result >> 8;
*sectors_read = result >> 0;
}
Compiles as follows, with GCC10.1 -O2 -m16 on Godbolt:
read:
pushl %ebx
movzbl 12(%esp), %ecx
movzbl 16(%esp), %edx
movzbl 24(%esp), %ebx # load some stack args
sall $8, %ecx
movzbl 8(%esp), %eax
orl %edx, %ecx # shift and merge into CL,CH instead of writing partial regs
movzbl 20(%esp), %edx
orb $2, %ah
sall $8, %edx
orl %ebx, %edx
movl 28(%esp), %ebx # the pointer arg
int $0x13 # from the inline asm statement
movl 32(%esp), %edx # load output pointer arg
movl %eax, %ecx
shrw $8, %cx
movb %cl, (%edx)
movl 36(%esp), %edx
movb %al, (%edx)
popl %ebx
ret
It might be possible to use register u8 track asm("ch") or something to get the compiler to just write partial regs instead of shift/OR.
If you don't want to understand how constraints work, don't use GNU C inline asm. You could instead write stand-alone functions that you call from C, which accept args according to the calling convention the compiler uses (e.g. gcc -mregparm=3, or just everything on the stack with the traditional inefficient calling convention.)
You could do a better job than GCC's above code-gen, but note that the inline asm could optimize into surrounding code and avoid some of the actual copying to memory for passing args via the stack.

GCC doesn't push registers around my inline asm function call even though I have clobbers

I have a function (C) that modifies "ecx" (or any other registers)
int proc(int n) {
int ret;
asm volatile ("movl %1, %%ecx\n\t" // mov (n) to ecx
"addl $10, %%ecx\n\t" // add (10) to ecx (n)
"movl %%ecx, %0" /* ret = n + 10 */
: "=r" (ret) : "r" (n) : "ecx");
return ret;
}
now i want to call this function in another function which that function moves a value in "ecx" before calling "proc" function
int main_proc(int n) {
asm volatile ("movl $55, %%ecx" ::: "ecx"); /// mov (55) to ecx
int ret;
asm volatile ("call proc" : "=r" (ret) : "r" (n) : "ecx"); // ecx is modified in proc function and the value of ecx is not 55 anymore even with "ecx" clobber
asm volatile ("addl %%ecx, %0" : "=r" (ret));
return ret;
}
in this function, (55) is moved into "ecx" register and then "proc" function is called (which modifies "ecx"). in this situation, "proc" function Must push "ecx" first and pop it at the end but it's not going to happen !!!!
this is the assembly source with (-O3) optimiaztion level
proc:
movl %edi, %ecx
addl $10, %ecx
movl %ecx, %eax
ret
main_proc:
movl $55, %ecx
call proc
addl %ecx, %eax
ret
why GCC is not going to use (push) and (pop) for "ecx" register ?? i used "ecx" clobber too !!!!!
You are using inline asm completely wrong. Your input/output constraints need to fully describe the inputs / outputs of each asm statement. To get data between asm statements, you have to hold it in C variables between them.
Also, call isn't safe inside inline asm in general, and specifically in x86-64 code for the System V ABI it steps on the red-zone where gcc might have been keeping things. There's no way to declare a clobber on that. You could use sub $128, %rsp first to skip past the red zone, or you could make calls from pure C like a normal person so the compiler knows about it. (Remember that call pushes a return address.) Your inline asm doesn't even make sense; your proc takes an arg but you didn't do anything in the caller to pass one.
The compiler-generated code in proc could have also destroyed any other call-clobbered registers, so you at least need to declare clobbers on those registers. Or hand-write the whole function in asm so you know what to put in clobbers.
why GCC is not going to use (push) and (pop) for "ecx" register ?? i used "ecx" clobber too !!!!!
An ecx clobber tells GCC that this asm statement destroys whatever GCC had in ECX previously. Using an ECX clobber in two separate inline-asm statements doesn't declare any kind of data dependency between them.
It's not equivalent to declaring a register-asm local variable like
register int foo asm("ecx"); that you use as a "+r" (foo) operand to the first and last asm statement. (Or more simply that you use with a "+c" constraint to make an ordinary variable pick ECX).
From GCC's point of view, your source means only what the constraints + clobbers tell it.
int main_proc(int n) {
asm volatile ("movl $55, %%ecx" ::: "ecx");
// ^^ black box that destroys ECX and produces no outputs
int ret;
asm volatile ("call proc" : "=r" (ret) : "r" (n) : "ecx");
// ^^ black box that can take `n` in any register, and can produce `ret` in any reg. And destroys ECX.
asm volatile ("addl %%ecx, %0" : "=r" (ret));
// ^^ black box with no inputs that can produce a new value for `ret` in any register
return ret;
}
I suspect you wanted the last asm statement to be "+r"(ret) to read/write the C variable ret instead of telling GCC that it was output-only. Because your asm uses it as an input as well as output as the destination of an add.
It might be interesting to add comments like # %%0 = %0 %%1 = %1 inside your 2nd asm statement to see which registers the "=r" and "r" constraints picked. On the Godbolt compiler explorer:
# gcc9.2 -O3
main_proc:
movl $55, %ecx
call proc # %0 = %edi %1 = %edi
addl %ecx, %eax # "=r" happened to pick EAX,
# which happens to still hold the return value from proc
ret
That accident of picking EAX as the add destinatino might not happen after this function inlines into something else. or GCC happens to put some compiler-generated instructions between asm statements. (asm volatile is barrier to compile-time reordering but not not a strog one. It only definitely stops optimizing away entirely).
Remember that inline asm templates are purely text substitution; asking the compiler to fill in an operand into a comment is no different from anywhere else in the template string. (Godbolt strips comment lines by default so sometimes it's handy to tack them onto other instructions, or onto a nop).
As you can see, this is 64-bit code (n arrives in EDI as per the x86-64 SysV calling convention, like how you built your code), so push %ecx wouldn't be encodeable. push %rcx would be.
Of course if GCC actually wanted to keep a value around past an asm statement with an "ecx" clobber, it would have just used mov %ecx, %edx or whatever other call-clobbered register that wasn't in the clobber list.

Inline Assembly Causing Errors about No Prefixes

Hello,
So, I'm optimizing some functions that I wrote for a simple operating system I'm developing. This function, putpixel(), currently looks like this (in case my assembly is unclear or wrong):
uint32_t loc = (x*pixel_w)+(y*pitch);
vidmem[loc] = color & 255;
vidmem[loc+1] = (color >> 8) & 255;
vidmem[loc+2] = (color >> 16) & 255;
This takes a little bit of explanation. First, loc is the pixel index I want to write to in video memory. X and Y coordinates are passed to the function. Then, we multiply X by the pixel width in bytes (in this case, 3) and Y by the number of bytes in each line. More information can be found here.
vidmem is a global variable, a uint8_t pointer to video memory.
That being said, anyone familiar with bitwise operations should be able to figure out how putpixel() works fairly easily.
Now, here's my assembly. Note that it has not been tested and may even be slower or just plain not work. This question is about how to make it compile.
I've replaced everything after the definition of loc with this:
__asm(
"push %%rdi;"
"push %%rbx;"
"mov %0, %%rdi;"
"lea %1, %%rbx;"
"add %%rbx, %%rdi;"
"pop %%rbx;"
"mov %2, %%rax;"
"stosb;"
"shr $8, %%rax;"
"stosb;"
"shr $8, %%rax;"
"stosb;"
"pop %%rdi;" : :
"r"(loc), "r"(vidmem), "r"(color)
);
When I compile this, clang gives me this error for every push instruction:
So when I saw that error, I assumed it had to do with my omission of the GAS suffixes (which should have been implicitly decided on, anyway). But when I added the "l" suffix (all of my variables are uint32_ts), I got the same error! I'm not quite sure what's causing it, and any help would be much appreciated. Thanks in advance!
You could probably make the compiler's output for your C version much more efficient by loading vidmem into a local variable before the stores. As it is, it can't assume that the stores don't alias vidmem, so it reloads the pointer before every byte store. Hrm, that does let gcc 4.9.2 avoid reloading vidmem, but it still generates some nasty code. clang 3.5 does slightly better.
Implementing what I said in my comment on your answer (that stos is 3 uops vs. 1 for mov):
#include <stdint.h>
extern uint8_t *vidmem;
void putpixel_asm_peter(uint32_t color, uint32_t loc)
{
// uint32_t loc = (x*pixel_w)+(y*pitch);
__asm( "\n"
"\t movb %b[col], (%[ptr])\n"
"\t shrl $8, %[col];\n"
"\t movw %w[col], 1(%[ptr]);\n"
: [col] "+r" (color), "=m" (vidmem[loc])
: [ptr] "r" (vidmem+loc)
:
);
}
compiles to a very efficient implementation:
gcc -O3 -S -o- putpixel.c 2>&1 | less # (with extra lines removed)
putpixel_asm_peter:
movl %esi, %esi
addq vidmem(%rip), %rsi
#APP
movb %dil, (%rsi)
shrl $8, %edi;
movw %di, 1(%rsi);
#NO_APP
ret
All of those instructions decode to a single uop on Intel CPUs. (The stores can micro-fuse, because they use a single-register addressing mode.) The movl %esi, %esi zeroes the upper 32, since the caller might have generated that function arg with a 64bit instruction the left garbage in the high 32 of %rsi. Your version could have saved some instructions by using constraints to ask for the values in the desired registers in the first place, but this will still be faster than stos
Also notice how I let the compiler take care of adding loc to vidmem. You could have done it more efficiently in yours, with a lea to combine an add with a move. However, if the compiler wants to get clever when this is used in a loop, it could increment the pointer instead of the address. Finally, this means the same code will work for 32 and 64bit. %[ptr] will be a 64bit reg in 64bit mode, but a 32bit reg in 32bit mode. Since I don't have to do any math on it, it Just Works.
I used =m output constraint to tell the compiler where we're writing in memory. (I should have cast the pointer to a struct { char a[3]; } or something, to tell gcc how much memory it actually writes, as per the tip at the end of the "Clobbers" section in the gcc manual)
I also used color as an input/output constraint to tell the compiler that we modify it. If this got inlined, and later code expected to still find the value of color in the register, we'd have a problem. Having this in a function means color is already a tmp copy of the caller's value, so the compiler will know it needs to throw away the old color. Calling this in a loop could be slightly more efficient with two read-only inputs: one for color, one for color >> 8.
Note that I could have written the constraints as
: [col] "+r" (color), [memref] "=m" (vidmem[loc])
:
:
But using %[memref] and 1 %[memref] to generate the desired addresses would lead gcc to emit
movl %esi, %esi
movq vidmem(%rip), %rax
# APP
movb %edi, (%rax,%rsi)
shrl $8, %edi;
movw %edi, 1 (%rax,%rsi);
The two-reg addressing mode means the store instructions can't micro-fuse (on Sandybridge and later, at least).
You don't even need inline asm to get decent code, though:
void putpixel_cast(uint32_t color, uint32_t loc)
{
// uint32_t loc = (x*pixel_w)+(y*pitch);
typeof(vidmem) vmem = vidmem;
vmem[loc] = color & 255;
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
*(uint16_t *)(vmem+loc+1) = color >> 8;
#else
vmem[loc+1] = (color >> 8) & 255; // gcc sucks at optimizing this for little endian :(
vmem[loc+2] = (color >> 16) & 255;
#endif
}
compiles to (gcc 4.9.2 and clang 3.5 give the same output):
movq vidmem(%rip), %rax
movl %esi, %esi
movb %dil, (%rax,%rsi)
shrl $8, %edi
movw %di, 1(%rax,%rsi)
ret
This is only a tiny bit less efficient than what we get with inline asm, and should be easier for the optimizer to optimize if inlined into loops.
Overall performance
Calling this in a loop is probably a mistake. It'll be more efficient to combine multiple pixels in a register (esp. a vector register), and then write all at once. Or, do 4-byte writes, overlapping the last byte of the previous write, until you get to the end and have to preserve the byte after the last chunk of 3.
See http://agner.org/optimize/ for more stuff about optimizing C and asm. That and other links can be found at https://stackoverflow.com/tags/x86/info.
Found the problem!
It was in a lot of places, but the major one was vidmem. I assumed it would pass the address, but it was causing an error. After referring to it as a dword, it worked perfectly. I also had to change the other constraints to "m", and I finally got this result (after some optimization):
__asm(
"movl %0, %%edi;"
"movl %k1, %%ebx;"
"addl %%ebx, %%edi;"
"movl %2, %%eax;"
"stosb;"
"shrl $8, %%eax;"
"stosw;" : :
"m"(loc), "r"(vidmem), "m"(color)
: "edi", "ebx", "eax"
);
Thanks to everyone who answered in the comments!

Inline assembly, getting into interrupt

Good day.
I faced a problem that I couldn't solve for several days. The error appears when I try to compile this function in C language.
void GetInInterrupt(UChar Interrupt)
{
//asm volatile(".intel_syntax noprefix");
asm volatile
(
"movb %0, %%al\n"
"movb %%al, 1(point)\n"
"point:\n"
"int $0\n"
: /*output*/ : "r" (Interrupt) /*input*/ : /*clobbered*/
);
//asm volatile(".att_syntax noprefix");
}
Message I get from gas is following:
Error: junk '(point)' after expression
As I can understand the pointer in second line is faulty, but unfortunately I can't solve it by my own.
Thank you for help.
If you can use C++, then this one:
template <int N> static inline void GetInInterrupt (void)
{
__asm__ ("int %0\n" : "N"(N));
}
will do. If I use that template like:
GetInInterrupt<123>();
GetInInterrupt<3>();
GetInInterrupt<23>();
GetInInterrupt<0>();
that creates the following object code: 0: cd 7b int $0x7b
2: cc int3
3: cd 17 int $0x17
5: cd 00 int $0x0
which is pretty much optimal (even for the int3 case, which is the breakpoint op). It'll also create a compile-time warning if the operand is out of the 0..255 range, due to the N constraint allowing only that.
Edit: plain old C-style macros work as well, of course:
#define GetInInterrupt(arg) __asm__("int %0\n" : : "N"((arg)) : "cc", "memory")
creates the same code as the C++ templated function. Due to the way int behaves, it's a good idea to tell the compiler (via the "cc", "memory" constraints) about the barrier semantics, to make sure it doesn't try to re-order instructions when embedding the inline assembly.
The limitation of both is, obviously, the fact that the interrupt number must be a compile-time constant. If you absolutely don't want that, then creating a switch() statement created e.g. with the help of BOOST_PP_REPEAT() covering all 255 cases is a better option than self-modifying code, i.e. like:
#include <boost/preprocessor/repetition/repeat.html>
#define GET_INTO_INT(a, INT, d) case INT: GetInInterrupt<INT>(); break;
void GetInInterrupt(int interruptNumber)
{
switch(interruptNumber) {
BOOST_PP_REPEAT(256, GET_INTO_INT, 0)
default:
runtime_error("interrupt Number %d out of range", interruptNumber);
}
}
This can be done in plain C (if you change the templated function invocation for a plain __asm__ of course) - because the boost preprocessor library does not depend on a C++ compiler ... and gcc 4.7.2 creates the following code for this:
GetInInterrupt:
.LFB0:
cmpl $255, %edi
jbe .L262
movl %edi, %esi
xorl %eax, %eax
movl $.LC0, %edi
jmp runtime_error
.p2align 4,,10
.p2align 3
.L262:
movl %edi, %edi
jmp *.L259(,%rdi,8)
.section .rodata
.align 8
.align 4
.L259:
.quad .L3
.quad .L4
[ ... ]
.quad .L258
.text
.L257:
#APP
# 17 "tccc.c" 1
int $254
# 0 "" 2
#NO_APP
ret
[ ... accordingly for the other vectors ... ]
Beware though if you do the above ... the compiler (gcc up to and including 4.8) is not intelligent enough to optimize the switch() away, i.e. even if you say static __inline__ ... it'll create the full jump table version of GetInInterrupt(3) instead of just an inlined int3 as would the simpler implementations.
Below show how you could write to a location in the code. It does assume that the code is writeable in the first place, which is typically not the case in mainstream OS's - since that would hide some nasty bugs.
void GetInInterrupt(UChar Interrupt)
{
//asm volatile(".intel_syntax noprefix");
asm volatile
(
"movb %0, point+1\n"
"point:\n"
"int $0\n"
: /*output*/ : "r" (Interrupt) /*input*/ : /*clobbered */
);
//asm volatile(".att_syntax noprefix");
}
I also simplified the code to avoid using two registers, and instead just using the register that Interrupt already is in. If the compiler moans about it, you may find that "a" instead or "r" solves the problem.

gcc inline asm statement gets optimized away - wrong constraints?

I'm having trouble with a gcc inline asm statement; gcc seems to think the result is a constant (which it isn't) and optimizes the statement away. I think I am using the operand constraints correctly, but would like a second opinion on the matter. If the problem is not in my use of constraints, I'll try to isolate a test case for a gcc bug report, but that may be difficult as even subtle changes in the surrounding code cause the problem to disappear.
The inline asm in question is
static inline void
ularith_div_2ul_ul_ul_r (unsigned long *r, unsigned long a1,
const unsigned long a2, const unsigned long b)
{
ASSERT(a2 < b); /* Or there will be quotient overflow */
__asm__(
"# ularith_div_2ul_ul_ul_r: divq %0 %1 %2 %3\n\t"
"divq %3"
: "+a" (a1), "=d" (*r)
: "1" (a2), "rm" (b)
: "cc");
}
which is a pretty run-of-the-mill remainder of a two-word dividend by a one-word divisor. Note that the high word of the input, a2, and the remainder output, *r, are tied to the same register %rdx by the "1" constraint.
From the surrounding code, ularith_div_2ul_ul_ul_r() gets effectively called as if by
if (s == 1)
modpp[0].one = 0;
else
ularith_div_2ul_ul_ul_r(&modpp[0].one, 0UL, 1UL, s);
so the high word of the input, a2, is the constant 1UL.
The resulting asm output of gcc -S -fverbose_asm looks like:
(earlier:)
xorl %r8d, %r8d # cstore.863
(then:)
cmpq $1, -208(%rbp) #, %sfp
movl $1, %eax #, tmp841
movq %rsi, -184(%rbp) # prephitmp.966, MEM[(struct __modulusredcul_t *)&modpp][0].invm
cmovne -208(%rbp), %rcx # prephitmp.966,, %sfp, prephitmp.966
cmovne %rax, %r8 # cstore.863,, tmp841, cstore.863
movq %r8, -176(%rbp) # cstore.863, MEM[(struct __modulusredcul_t *)&modpp][0].one
The effect is that the result of the ularith_div_2ul_ul_ul_r() call is assumed to be the constant 1; the divq never appears in the output.
Various changes make the problem disappear; different compiler flags, different code context or marking the asm block __asm__ __volatile__ (...). The output then correctly contains the divq instruction:
#APP
# ularith_div_2ul_ul_ul_r: divq %rax %rdx %rdx -208(%rbp) # a1, tmp590, tmp590, %sfp
divq -208(%rbp) # %sfp
#NO_APP
So, my question to the inline assembly guys here: did I do something wrong with the contraints?
The bug affects only Ubuntu versions of gcc; the stock GNU gcc is unaffected as far as we can tell. The bug was reported to Ubuntu launchpad and confirmed: Bug #1029454

Resources