Extended assembly, floating point division [duplicate] - c

I'm trying to compile a simple C program (Win7 32bit, Mingw32 Shell and GCC 5.3.0). The C code is like this:
#include <stdio.h>
#include <stdlib.h>
#define _set_tssldt_desc(n,addr,type) \
__asm__ ("movw $104,%1\n\t" \
:\
:"a" (addr),\
"m" (*(n)),\
"m" (*(n+2)),\
"m" (*(n+4)),\
"m" (*(n+5)),\
"m" (*(n+6)),\
"m" (*(n+7))\
)
#define set_tss_desc(n,addr) _set_tssldt_desc(((char *) (n)),addr,"0x89")
char *n;
char *addr;
int main(void) {
char *n = (char *)malloc(100*sizeof(int));
char *addr = (char *)malloc(100*sizeof(int));
set_tss_desc(n, addr);
free(n);
free(addr);
return 0;
}
_set_tssldt_desc(n,addr,type) is a macro and its body is assembly code. set_tss_desc(n,addr) is another macro very similar to _set_tssldt_desc(n,addr,type). The set_tss_desc(n,addr) macro is called in main function.
When I'm trying to compile this code, the compiler's showing me the following error:
$ gcc test.c
test.c: In function 'main':
test.c:5:1: error: 'asm' operand has impossible constraints
__asm__ ("movw $104,%1\n\t" \
^
test.c:16:30: note: in expansion of macro '_set_tssldt_desc'
#define set_tss_desc(n,addr) _set_tssldt_desc(((char *) (n)),addr,"0x89")
^
test.c:25:3: note: in expansion of macro 'set_tss_desc'
set_tss_desc(n, addr);
^
The strange thing is, if I comment invoke point out in main function, the code compiled successfully.
int main(void) {
char *n = (char *)malloc(100*sizeof(int));
char *addr = (char *)malloc(100*sizeof(int));
//I comment it out and code compiled.
//set_tss_desc(n, addr);
free(n);
free(addr);
return 0;
}
Or, if I delete some variables in output part of assembly code, it also compiled.
#include <stdio.h>
#include <stdlib.h>
#define _set_tssldt_desc(n,addr,type) \
__asm__ ("movw $104,%1\n\t" \
:\
:"a" (addr),\
"m" (*(n)),\
"m" (*(n+2)),\
"m" (*(n+4)),\
"m" (*(n+5)),\
"m" (*(n+6))\
)
//I DELETE "m" (*(n+7)) , code compiled
#define set_tss_desc(n,addr) _set_tssldt_desc(((char *) (n)),addr,"0x89")
char *n;
char *addr;
int main(void) {
char *n = (char *)malloc(100*sizeof(int));
char *addr = (char *)malloc(100*sizeof(int));
set_tss_desc(n, addr);
free(n);
free(addr);
return 0;
}
Can someone explain to me why that is and how to fix this?

As #MichealPetch says, you're approaching this the wrong way. If you're trying to set up an operand for lgdt, do that in C and only use inline-asm for the lgdt instruction itself. See the inline-assembly tag wiki, and the x86 tag wiki.
Related: a C struct/union for messing with Intel descriptor-tables: How to do computations with addresses at compile/linking time?. (The question wanted to generate the table as static data, hence asking about breaking addresses into low / high halves at compile time).
Also: Implementing GDT with basic kernel for some C + asm GDT manipulation. Or maybe not, since the answer there just says the code in the question is problematic, without a detailed fix.
Linker error setting loading GDT register with LGDT instruction using Inline assembly has an answer from Michael Petch, with some links to more guides/tutorials.
It's still useful to answer the specific question, even though the right fix is https://gcc.gnu.org/wiki/DontUseInlineAsm.
This compiles fine with optimization enabled.
With -O0, gcc doesn't notice or take advantage of the fact that the operands are all small constant offsets from each other, and can use the same base register with an offset addressing mode. It wants to put a pointer to each input memory operand into a separate register, but runs out of registers. With -O1 or higher, CSE does what you'd expect.
You can see this in a reduced example with the last 3 memory operands commented, and changing the asm string to include an asm comment with all the operands. From gcc5.3 -O0 -m32 on the Godbolt compiler explorer:
#define _set_tssldt_desc(n,addr,type) \
__asm__ ("movw $104,%1\n\t" \
"#operands: %0, %1, %2, %3\n" \
...
void simple_wrapper(char *n, char *addr) {
set_tss_desc(n, addr);
}
pushl %ebp
movl %esp, %ebp
pushl %ebx
movl 8(%ebp), %eax
leal 2(%eax), %ecx
movl 8(%ebp), %eax
leal 4(%eax), %ebx
movl 12(%ebp), %eax
movl 8(%ebp), %edx
#APP # your inline-asm code
movw $104,(%edx)
#operands: %eax, (%edx), (%ecx), (%ebx)
#NO_APP
nop # no idea why the compiler inserted a literal NOP here (not .p2align)
popl %ebx
popl %ebp
ret
But with optimization enabled, you get
simple_wrapper:
movl 4(%esp), %edx
movl 8(%esp), %eax
#APP
movw $104,(%edx)
#operands: %eax, (%edx), 2(%edx), 4(%edx)
#NO_APP
ret
Notice how the later operands use base+disp addressing modes.
Your constraints are totally backwards. You're writing to memory that you've told the compiler is an input operand. It will assume that the memory is not modified by the asm statement, so if you load from it in C, it might move that load ahead of the asm. And other possible breakage.
If you had used "=m" output operands, this code would be correct (but still inefficient compared to letting the compiler do it for you.)
You could have written your asm to do the offsetting itself from a single memory-input operand, but then you'd need to do something to tell the compiler about that the memory read by the asm statement; e.g. "=m" (*(struct {char a; char x[];} *) n) to tell it that you write the entire object starting at n. (See this answer).
AT&T syntax x86 memory operands are always offsetable, so you can use 2 + %[nbase] instead of a separate operand, if you do
asm("movw $104, %[nbase]\n\t"
"movw $123, 2 + %[nbase]\n\t"
: [nbase] "=m" (*(struct {char a; char x[];} *) n)
: [addr] "ri" (addr)
);
gas will warn about 2 + (%ebx) or whatever it ends up being, but that's ok.
Using a separate memory output operand for each place you write will avoid any problems about telling the compiler which memory you write. But you got it wrong: you've told the compiler that your code doesn't use n+1 when in fact you're using movw $104 to store 2 bytes starting at n. So that should be a uint16_t memory operand. If this sounds complicated, https://gcc.gnu.org/wiki/DontUseInlineAsm. Like Michael said, do this part in C with a struct, and only use inline asm for a single instruction that needs it.
It would obviously be more efficient to use fewer wider store instructions. IDK what you're planning to do next, but any adjacent constants should be coalesced into a 32-bit store, like mov $(104 + 0x1234 << 16), %[n0] or something. Again, https://gcc.gnu.org/wiki/DontUseInlineAsm.

Related

error: unsupported size for integer register

I'm using i686 gcc on windows. When I built the code with separate asm statements, it worked. However, when I try to combine it into one statement, it doesn't build and gives me a error: unsupported size for integer register.
Here's my code
u8 lstatus;
u8 lsectors_read;
u8 data_buffer;
void operate(u8 opcode, u8 sector_size, u8 track, u8 sector, u8 head, u8 drive, u8* buffer, u8* status, u8* sectors_read)
{
asm volatile("mov %3, %%ah;\n"
"mov %4, %%al;\n"
"mov %5, %%ch;\n"
"mov %6, %%cl;\n"
"mov %7, %%dh;\n"
"mov %8, %%dl;\n"
"int $0x13;\n"
"mov %%ah, %0;\n"
"mov %%al, %1;\n"
"mov %%es:(%%bx), %2;\n"
: "=r"(lstatus), "=r"(lsectors_read), "=r"(buffer)
: "r"(opcode), "r"(sector_size), "r"(track), "r"(sector), "r"(head), "r"(drive)
:);
status = &lstatus;
sectors_read = &lsectors_read;
buffer = &data_buffer;
}
The error message is a little misleading. It seems to be happening because GCC ran out of 8-bit registers.
Interestingly, it compiles without error messages if you just edit the template to remove references to the last 2 operands (https://godbolt.org/z/oujNP7), even without dropping them from the list of input constraints! (Trimming down your asm statement is a useful debugging technique to figure out which part of it GCC doesn't like, without caring for now if the asm will do anything useful.)
Removing 2 earlier operands and changing numbers shows that "r"(head), "r"(drive) weren't specifically a problem, just the combination of everything.
It looks like GCC is avoiding high-8 registers like AH as inputs, and x86-16 only has 4 low-8 registers but you have 6 u8 inputs. So I think GCC means it ran out of byte registers that it was willing to use.
(The 3 outputs aren't declared early-clobber so they're allowed to overlap the inputs.)
You could maybe work around this by using "rm" to give GCC the option of picking a memory input. (The x86-specific constraints like "Q" that are allowed to pick a high-8 register wouldn't help unless you require it to pick the correct one to get the compiler to emit a mov for you.) That would probably let your code compile, but the result would be totally broken.
You re-introduced basically the same bugs as before: not telling the compiler which registers you write, so for example your mov %4, %%al will overwrite one of the registers GCC picked as an input, before you actually read that operand.
Declaring clobbers on all the registers you use would leave not enough registers to hold all the input variables. (Unless you allow memory source operands.) That could work but is very inefficient: if your asm template string starts or ends with mov, you're almost always doing it wrong.
Also, there are other serious bugs, apart from how you're using inline asm. You don't supply an input pointer to your buffer. int $0x13 doesn't allocate a new buffer for you, it needs a pointer in ES:BX (which it dereferences but leaves unmodified). GCC requires that ES=DS=SS so you already have to have properly set up segmentation before calling into your C code, and isn't something you have to do every call.
Plus even in C terms outside the inline asm, your function doesn't make sense. status = &lstatus; modifies the value of a function arg, not dereferencing it to modify a pointed-to output variable. The variable written by those assignments die at the end of the function. But the global temporaries do have to be updated because they're global and some other function could see their value. Perhaps you meant something like *status = lstatus; with different types for your vars?
If that C problem isn't obvious (at least once it's pointed out), you need some more practice with C before you're ready to try mixing C and asm which require you to understand both very well, in order to correctly describe your asm to the compiler with accurate constraints.
A good and correct way to implement this is shown in #fuz's answer to your previous question. If you want to understand how the constraints can replace your mov instructions, compile it and look at the compiler-generated instructions. See https://stackoverflow.com/tags/inline-assembly/info for links to guides and docs. e.g. #fuz's version without the ES setup (because GCC needs you to have done that already before calling any C):
typedef unsigned char u8;
typedef unsigned short u16;
// Note the different signature, and using the output args correctly.
void read(u8 sector_size, u8 track, u8 sector, u8 head, u8 drive,
u8 *buffer, u8 *status, u8 *sectors_read)
{
u16 result;
asm volatile("int $0x13"
: "=a"(result)
: "a"(0x200|sector_size), "b"(buffer),
"c"(track<<8|sector), "d"(head<<8|drive)
: "memory" ); // memory clobber was missing from #fuz's version
*status = result >> 8;
*sectors_read = result >> 0;
}
Compiles as follows, with GCC10.1 -O2 -m16 on Godbolt:
read:
pushl %ebx
movzbl 12(%esp), %ecx
movzbl 16(%esp), %edx
movzbl 24(%esp), %ebx # load some stack args
sall $8, %ecx
movzbl 8(%esp), %eax
orl %edx, %ecx # shift and merge into CL,CH instead of writing partial regs
movzbl 20(%esp), %edx
orb $2, %ah
sall $8, %edx
orl %ebx, %edx
movl 28(%esp), %ebx # the pointer arg
int $0x13 # from the inline asm statement
movl 32(%esp), %edx # load output pointer arg
movl %eax, %ecx
shrw $8, %cx
movb %cl, (%edx)
movl 36(%esp), %edx
movb %al, (%edx)
popl %ebx
ret
It might be possible to use register u8 track asm("ch") or something to get the compiler to just write partial regs instead of shift/OR.
If you don't want to understand how constraints work, don't use GNU C inline asm. You could instead write stand-alone functions that you call from C, which accept args according to the calling convention the compiler uses (e.g. gcc -mregparm=3, or just everything on the stack with the traditional inefficient calling convention.)
You could do a better job than GCC's above code-gen, but note that the inline asm could optimize into surrounding code and avoid some of the actual copying to memory for passing args via the stack.

c inline assembly getting "operand size mismatch" when using cmpxchg

I'm trying to use cmpxchg with inline assembly through c. This is my code:
static inline int
cas(volatile void* addr, int expected, int newval) {
int ret;
asm volatile("movl %2 , %%eax\n\t"
"lock; cmpxchg %0, %3\n\t"
"pushfl\n\t"
"popl %1\n\t"
"and $0x0040, %1\n\t"
: "+m" (*(int*)addr), "=r" (ret)
: "r" (expected), "r" (newval)
: "%eax"
);
return ret;
}
This is my first time using inline and i'm not sure what could be causing this problem.
I tried "cmpxchgl" as well, but still nothing. Also tried removing the lock.
I get "operand size mismatch".
I think maybe it has something to do with the casting i do to addr, but i'm unsure. I try and exchange int for int, so don't really understand why there would be a size mismatch.
This is using AT&T style.
Thanks
As #prl points out, you reversed the operands, putting them in Intel order (See Intel's manual entry for cmpxchg). Any time your inline asm doesn't assemble, you should look at the asm the compiler was feeding to the assembler to see what happened to your template. In your case, simply remove the static inline so the compiler will make a stand-alone definition, then you get (on the Godbolt compiler explorer):
# gcc -S output for the original, with cmpxchg operands backwards
movl %edx , %eax
lock; cmpxchg (%ecx), %ebx # error on this line from the assembler
pushfl
popl %edx
and $0x0040, %edx
Sometimes that will clue your eye / brain in cases where staring at %3 and %0 didn't, especially after you check the instruction-set reference manual entry for cmpxchg and see that the memory operand is the destination (Intel-syntax first operand, AT&T syntax last operand).
This makes sense because the explicit register operand is only ever a source, while EAX and the memory operand are both read and then one or the other is written depending on the success of the compare. (And semantically you use cmpxchg as a conditional store to a memory destination.)
You're discarding the load result from the cas-failure case. I can't think of any use-cases for cmpxchg where doing a separate load of the atomic value would be incorrect, rather than just inefficient, but the usual semantics for a CAS function is that oldval is taken by reference and updated on failure. (At least that's how C++11 std::atomic and C11 stdatomic do it with bool atomic_compare_exchange_weak( volatile A *obj, C* expected, C desired );.)
(The weak/strong thing allows better code-gen for CAS retry-loops on targets that use LL/SC, where spurious failure is possible due to an interrupt or being rewritten with the same value. x86's lock cmpxchg is "strong")
Actually, GCC's legacy __sync builtins provide 2 separate CAS functions: one that returns the old value, and one that returns a bool. Both take the old/new value by reference. So it's not the same API that C++11 uses, but apparently it isn't so horrible that nobody used it.
Your overcomplicated code isn't portable to x86-64. From your use of popl, I assume you developed it on x86-32. You don't need pushf/pop to get ZF as an integer; that's what setcc is for. cmpxchg example for 64 bit integer has a 32-bit example that works that way (to show what they want a 64-bit version of).
Or even better, use GCC6 flag-return syntax so using this in a loop can compile to a cmpxchg / jne loop instead of cmpxchg / setz %al / test %al,%al / jnz.
We can fix all of those problems and improve the register allocation as well. (If the first or last instruction of an inline-asm statement is mov, you're probably using constraints inefficiently.)
Of course, by far the best thing for real usage would be to use C11 stdatomic or a GCC builtin. https://gcc.gnu.org/wiki/DontUseInlineAsm in cases where the compiler can emit just as good (or better) asm from code it "understands", because inline asm constrains the compiler. It's also difficult to write correctly / efficient, and to maintain.
Portable to i386 and x86-64, AT&T or Intel syntax, and works for any integer type width of register width or smaller:
// Note: oldVal by reference
static inline char CAS_flagout(int *ptr, int *poldVal, int newVal)
{
char ret;
__asm__ __volatile__ (
" lock; cmpxchg {%[newval], %[mem] | %[mem], %[newval]}\n"
: "=#ccz" (ret), [mem] "+m" (*ptr), "+a" (*poldVal)
: [newval]"r" (newVal)
: "memory"); // barrier for compiler reordering around this
return ret; // ZF result, 1 on success else 0
}
// spinning read-only is much better (with _mm_pause in the retry loop)
// not hammering on the cache line with lock cmpxchg.
// This is over-simplified so the asm is super-simple.
void cas_retry(int *lock) {
int oldval = 0;
while(!CAS_flagout(lock, &oldval, 1)) oldval = 0;
}
The { foo,bar | bar,foo } is ASM dialect alternatives. For x86, it's {AT&T | Intel}. The %[newval] is a named operand constraint; it's another way to keep your operands . The "=ccz" takes the z condition code as the output value, like a setz.
Compiles on Godbolt to this asm for 32-bit x86 with AT&T output:
cas_retry:
pushl %ebx
movl 8(%esp), %edx # load the pointer arg.
movl $1, %ecx
xorl %ebx, %ebx
.L2:
movl %ebx, %eax # xor %eax,%eax would save a lot of insns
lock; cmpxchg %ecx, (%edx)
jne .L2
popl %ebx
ret
gcc is dumb and stores a 0 in one reg before copying it to eax, instead of re-zeroing eax inside the loop. This is why it needs to save/restore EBX at all. It's the same asm we get from avoiding inline-asm, though (from x86 spinlock using cmpxchg):
// also omits _mm_pause and read-only retry, see the linked question
void spin_lock_oversimplified(int *p) {
while(!__sync_bool_compare_and_swap(p, 0, 1));
}
Someone should teach gcc that Intel CPUs can materialize a 0 more cheaply with xor-zeroing than they can copy it with mov, especially on Sandybridge (xor-zeroing elimination but no mov-elimination).
You had the operand order for the cmpxchg instruction is reversed. AT&T syntax needs the memory destination last:
"lock; cmpxchg %3, %0\n\t"
Or you could compile that instruction with its original order using -masm=intel, but the rest of your code is AT&T syntax and ordering so that's not the right answer.
As far as why it says "operand size mismatch", I can only say that that appears to be an assembler bug, in that it uses the wrong message.

Inline Assembly Causing Errors about No Prefixes

Hello,
So, I'm optimizing some functions that I wrote for a simple operating system I'm developing. This function, putpixel(), currently looks like this (in case my assembly is unclear or wrong):
uint32_t loc = (x*pixel_w)+(y*pitch);
vidmem[loc] = color & 255;
vidmem[loc+1] = (color >> 8) & 255;
vidmem[loc+2] = (color >> 16) & 255;
This takes a little bit of explanation. First, loc is the pixel index I want to write to in video memory. X and Y coordinates are passed to the function. Then, we multiply X by the pixel width in bytes (in this case, 3) and Y by the number of bytes in each line. More information can be found here.
vidmem is a global variable, a uint8_t pointer to video memory.
That being said, anyone familiar with bitwise operations should be able to figure out how putpixel() works fairly easily.
Now, here's my assembly. Note that it has not been tested and may even be slower or just plain not work. This question is about how to make it compile.
I've replaced everything after the definition of loc with this:
__asm(
"push %%rdi;"
"push %%rbx;"
"mov %0, %%rdi;"
"lea %1, %%rbx;"
"add %%rbx, %%rdi;"
"pop %%rbx;"
"mov %2, %%rax;"
"stosb;"
"shr $8, %%rax;"
"stosb;"
"shr $8, %%rax;"
"stosb;"
"pop %%rdi;" : :
"r"(loc), "r"(vidmem), "r"(color)
);
When I compile this, clang gives me this error for every push instruction:
So when I saw that error, I assumed it had to do with my omission of the GAS suffixes (which should have been implicitly decided on, anyway). But when I added the "l" suffix (all of my variables are uint32_ts), I got the same error! I'm not quite sure what's causing it, and any help would be much appreciated. Thanks in advance!
You could probably make the compiler's output for your C version much more efficient by loading vidmem into a local variable before the stores. As it is, it can't assume that the stores don't alias vidmem, so it reloads the pointer before every byte store. Hrm, that does let gcc 4.9.2 avoid reloading vidmem, but it still generates some nasty code. clang 3.5 does slightly better.
Implementing what I said in my comment on your answer (that stos is 3 uops vs. 1 for mov):
#include <stdint.h>
extern uint8_t *vidmem;
void putpixel_asm_peter(uint32_t color, uint32_t loc)
{
// uint32_t loc = (x*pixel_w)+(y*pitch);
__asm( "\n"
"\t movb %b[col], (%[ptr])\n"
"\t shrl $8, %[col];\n"
"\t movw %w[col], 1(%[ptr]);\n"
: [col] "+r" (color), "=m" (vidmem[loc])
: [ptr] "r" (vidmem+loc)
:
);
}
compiles to a very efficient implementation:
gcc -O3 -S -o- putpixel.c 2>&1 | less # (with extra lines removed)
putpixel_asm_peter:
movl %esi, %esi
addq vidmem(%rip), %rsi
#APP
movb %dil, (%rsi)
shrl $8, %edi;
movw %di, 1(%rsi);
#NO_APP
ret
All of those instructions decode to a single uop on Intel CPUs. (The stores can micro-fuse, because they use a single-register addressing mode.) The movl %esi, %esi zeroes the upper 32, since the caller might have generated that function arg with a 64bit instruction the left garbage in the high 32 of %rsi. Your version could have saved some instructions by using constraints to ask for the values in the desired registers in the first place, but this will still be faster than stos
Also notice how I let the compiler take care of adding loc to vidmem. You could have done it more efficiently in yours, with a lea to combine an add with a move. However, if the compiler wants to get clever when this is used in a loop, it could increment the pointer instead of the address. Finally, this means the same code will work for 32 and 64bit. %[ptr] will be a 64bit reg in 64bit mode, but a 32bit reg in 32bit mode. Since I don't have to do any math on it, it Just Works.
I used =m output constraint to tell the compiler where we're writing in memory. (I should have cast the pointer to a struct { char a[3]; } or something, to tell gcc how much memory it actually writes, as per the tip at the end of the "Clobbers" section in the gcc manual)
I also used color as an input/output constraint to tell the compiler that we modify it. If this got inlined, and later code expected to still find the value of color in the register, we'd have a problem. Having this in a function means color is already a tmp copy of the caller's value, so the compiler will know it needs to throw away the old color. Calling this in a loop could be slightly more efficient with two read-only inputs: one for color, one for color >> 8.
Note that I could have written the constraints as
: [col] "+r" (color), [memref] "=m" (vidmem[loc])
:
:
But using %[memref] and 1 %[memref] to generate the desired addresses would lead gcc to emit
movl %esi, %esi
movq vidmem(%rip), %rax
# APP
movb %edi, (%rax,%rsi)
shrl $8, %edi;
movw %edi, 1 (%rax,%rsi);
The two-reg addressing mode means the store instructions can't micro-fuse (on Sandybridge and later, at least).
You don't even need inline asm to get decent code, though:
void putpixel_cast(uint32_t color, uint32_t loc)
{
// uint32_t loc = (x*pixel_w)+(y*pitch);
typeof(vidmem) vmem = vidmem;
vmem[loc] = color & 255;
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
*(uint16_t *)(vmem+loc+1) = color >> 8;
#else
vmem[loc+1] = (color >> 8) & 255; // gcc sucks at optimizing this for little endian :(
vmem[loc+2] = (color >> 16) & 255;
#endif
}
compiles to (gcc 4.9.2 and clang 3.5 give the same output):
movq vidmem(%rip), %rax
movl %esi, %esi
movb %dil, (%rax,%rsi)
shrl $8, %edi
movw %di, 1(%rax,%rsi)
ret
This is only a tiny bit less efficient than what we get with inline asm, and should be easier for the optimizer to optimize if inlined into loops.
Overall performance
Calling this in a loop is probably a mistake. It'll be more efficient to combine multiple pixels in a register (esp. a vector register), and then write all at once. Or, do 4-byte writes, overlapping the last byte of the previous write, until you get to the end and have to preserve the byte after the last chunk of 3.
See http://agner.org/optimize/ for more stuff about optimizing C and asm. That and other links can be found at https://stackoverflow.com/tags/x86/info.
Found the problem!
It was in a lot of places, but the major one was vidmem. I assumed it would pass the address, but it was causing an error. After referring to it as a dword, it worked perfectly. I also had to change the other constraints to "m", and I finally got this result (after some optimization):
__asm(
"movl %0, %%edi;"
"movl %k1, %%ebx;"
"addl %%ebx, %%edi;"
"movl %2, %%eax;"
"stosb;"
"shrl $8, %%eax;"
"stosw;" : :
"m"(loc), "r"(vidmem), "m"(color)
: "edi", "ebx", "eax"
);
Thanks to everyone who answered in the comments!

Inline assembly, getting into interrupt

Good day.
I faced a problem that I couldn't solve for several days. The error appears when I try to compile this function in C language.
void GetInInterrupt(UChar Interrupt)
{
//asm volatile(".intel_syntax noprefix");
asm volatile
(
"movb %0, %%al\n"
"movb %%al, 1(point)\n"
"point:\n"
"int $0\n"
: /*output*/ : "r" (Interrupt) /*input*/ : /*clobbered*/
);
//asm volatile(".att_syntax noprefix");
}
Message I get from gas is following:
Error: junk '(point)' after expression
As I can understand the pointer in second line is faulty, but unfortunately I can't solve it by my own.
Thank you for help.
If you can use C++, then this one:
template <int N> static inline void GetInInterrupt (void)
{
__asm__ ("int %0\n" : "N"(N));
}
will do. If I use that template like:
GetInInterrupt<123>();
GetInInterrupt<3>();
GetInInterrupt<23>();
GetInInterrupt<0>();
that creates the following object code: 0: cd 7b int $0x7b
2: cc int3
3: cd 17 int $0x17
5: cd 00 int $0x0
which is pretty much optimal (even for the int3 case, which is the breakpoint op). It'll also create a compile-time warning if the operand is out of the 0..255 range, due to the N constraint allowing only that.
Edit: plain old C-style macros work as well, of course:
#define GetInInterrupt(arg) __asm__("int %0\n" : : "N"((arg)) : "cc", "memory")
creates the same code as the C++ templated function. Due to the way int behaves, it's a good idea to tell the compiler (via the "cc", "memory" constraints) about the barrier semantics, to make sure it doesn't try to re-order instructions when embedding the inline assembly.
The limitation of both is, obviously, the fact that the interrupt number must be a compile-time constant. If you absolutely don't want that, then creating a switch() statement created e.g. with the help of BOOST_PP_REPEAT() covering all 255 cases is a better option than self-modifying code, i.e. like:
#include <boost/preprocessor/repetition/repeat.html>
#define GET_INTO_INT(a, INT, d) case INT: GetInInterrupt<INT>(); break;
void GetInInterrupt(int interruptNumber)
{
switch(interruptNumber) {
BOOST_PP_REPEAT(256, GET_INTO_INT, 0)
default:
runtime_error("interrupt Number %d out of range", interruptNumber);
}
}
This can be done in plain C (if you change the templated function invocation for a plain __asm__ of course) - because the boost preprocessor library does not depend on a C++ compiler ... and gcc 4.7.2 creates the following code for this:
GetInInterrupt:
.LFB0:
cmpl $255, %edi
jbe .L262
movl %edi, %esi
xorl %eax, %eax
movl $.LC0, %edi
jmp runtime_error
.p2align 4,,10
.p2align 3
.L262:
movl %edi, %edi
jmp *.L259(,%rdi,8)
.section .rodata
.align 8
.align 4
.L259:
.quad .L3
.quad .L4
[ ... ]
.quad .L258
.text
.L257:
#APP
# 17 "tccc.c" 1
int $254
# 0 "" 2
#NO_APP
ret
[ ... accordingly for the other vectors ... ]
Beware though if you do the above ... the compiler (gcc up to and including 4.8) is not intelligent enough to optimize the switch() away, i.e. even if you say static __inline__ ... it'll create the full jump table version of GetInInterrupt(3) instead of just an inlined int3 as would the simpler implementations.
Below show how you could write to a location in the code. It does assume that the code is writeable in the first place, which is typically not the case in mainstream OS's - since that would hide some nasty bugs.
void GetInInterrupt(UChar Interrupt)
{
//asm volatile(".intel_syntax noprefix");
asm volatile
(
"movb %0, point+1\n"
"point:\n"
"int $0\n"
: /*output*/ : "r" (Interrupt) /*input*/ : /*clobbered */
);
//asm volatile(".att_syntax noprefix");
}
I also simplified the code to avoid using two registers, and instead just using the register that Interrupt already is in. If the compiler moans about it, you may find that "a" instead or "r" solves the problem.

GNU C inline asm "m" constraint with a pointer: address vs. pointed-to value?

I am trying to understand some things about inline assembler in Linux. I am using following function:
void test_func(Word32 *var){
asm( " addl %0, %%eax" : : "m"(var) );
return;
}
It generates following assembler code:
.globl test_func
.type test_func, #function
test_func:
pushl %ebp
movl %esp, %ebp
#APP
# 336 "opers.c" 1
addl 8(%ebp), %eax
# 0 "" 2
#NO_APP
popl %ebp
ret
.size test_func, .-test_func
It sums var mem address to eax register value instead var value.
Is there any way to tell addl instruction to use var value instead var mem address without copying var mem address to a register?
Regards
It sums var mem address to eax register value instead var value.
Yes, the syntax of gcc inline assembly is pretty arcane. Paraphrasing from the relevant section in the GCC Inline Assembly HOWTO "m" roughly gives you the memory location of the C-variable.
It's what you'd use when you just want an address you can write to or read from. Notice I said the location of the C variable, so %0 is set to the address of Word32 *var - you have a pointer to a pointer. A C translation of the inline assembly block could look like EAX += *(&var) because you can say that the "m" constraint implicitly takes the address of the C variable and gives you an address expression, that you then add to %eax.
Is there any way to tell addl instruction to use var value instead var mem address without copying var mem address to a register?
That depends on what you mean. You need to get var from the stack, so someone has to dereference memory (see #Bo Perssons answer), but you don't have to do it in inline assembly
The constraint needs to be "m"(*var) (as #fazo suggested). That will give you the memory location of the value that var is pointing to, rather than a memory location pointing to it.
The generated code is now:
test_func:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
#APP
# 2 "test.c" 1
addl (%eax), %eax
# 0 "" 2
#NO_APP
popl %ebp
ret
Which is a little suspect, but that's understandable as you forgot to tell GCC that you clobbered (modified without having in the input/output list) %eax. Fixing that asm("addl %0, %%eax" : : "m"(*var) : "%eax" ) generates:
movl 8(%ebp), %edx
addl (%edx), %eax
Which isn't any better or more correct in this case, but it is always a good practice to remember. See the section on the clobber list and pay special attention to the "memory" clobber for advanced usage of inline assembly.
Even though you don't want to (explicitly) load the memory address into a register I'll briefly cover it.
Changing the constraint from "m" to "r" almost seems to work, the relevant sections gets changed to (if we include %eax in the clobber list):
movl 8(%ebp), %edx
addl %edx, %eax
Which is almost correct, we have loaded the pointer value var into a register, but now we have to specify ourselves that we're loading from memory. Changing the code to match the constraint (usually undesirable, I'm only showing it for completeness):
asm("addl (%0), %%eax" : : "r"(var) : "%eax" );
Gives:
movl 8(%ebp), %edx
addl (%edx), %eax
The same as with "m".
yes, because you give him var which is address. give him *var.
like:
void test_func(Word32 *var){
asm( " addl %0, %%eax" : : "m"(*var) );
return;
}
i don't remember exactly, but you should replace "m" with "r" ?
memory operand doesn;t mean that it will take value from that address. it's just a pointer
No, there is no addressing mode for x86 processors that goes two levels indirect.
You have to first load the pointer from a memory address and then load indirectly from the pointer.
An "m" constraint doesn't implicitly dereference anything. It's just like an "r" constraint, except it expands to an addressing mode for a memory location holding the value of the expression, instead of a register. (In C, every object has an address, although often that can be optimized away.)
The C object that's an input (or output for "=m") for the asm is the lvalue or rvalue you specify, e.g. "m"(var) takes the value of var, not *var. So you'd adding the pointer. (And telling the compiler that you want that input pointer value to be in memory, not a register.)
Perhaps it's confusing you that you have a pointer but you called it var, not ptr or something? A C pointer is an object whose value is an address, and can itself be stored in memory. If you were using C++, Word32 &var would make the dereference implicit whenever you write var.
In C terms, you're doing eax += ptr, but you want eax += *ptr, so you should write
void test_func(Word32 *ptr){
asm( "add %[input], %%eax"
: // no inputs. Probably you should use "+a"(add_to_this) if you want the add result, and remove the EAX clobber.
: [input] "m"(*ptr) // the pointed-to Word32 in memory
: "eax" // the instruction modifies EAX; tell the compiler about it
);
}
Compiling (Godbolt compiler explorer) results in:
# gcc -O3 -m32
test_func:
movl 4(%esp), %edx # compiler-generated load of the function arg
add (%edx), %eax # from asm template, (%edx) filled in as %[input] for *ptr
ret
Or if you'd compiled with -mregparm=3, or a 64-bit build, the arg would already be in a register. e.g. 64-bit GCC emits add (%rdi), %eax ; ret.
If you'd written return *ptr in C for a function returning Word32, with no inline asm, the asm would be similar, loading the pointer arg from the stack and then mov (%edx), %eax to load the return value. See the Godbolt link for that.
If inline asm isn't doing what you expect, look at the compiler generated asm to see how it filled in your template. That sometimes helps you figure out what the compiler thought you meant. (But only if you understand the basic design principles.)
If you write "m"(ptr), it compiles as follows:
void add_pointer(Word32 *ptr)
{
asm( "add %[input], %%eax" : : [input] "m"(ptr) : "eax" );
}
add_pointer:
add 4(%esp), %eax # ptr
ret
Very similar to if you wrote Word32 *bar(Word32 *ptr){ return ptr; }
Note that if you wanted to increment the memory location, you'd use a "+m"(*ptr) constraint to tell the compiler that the pointed-to memory is both an input and output. Or if you write-only to the memory, "=m"(*ptr) so it can potentially optimize away earlier dead stores to this memory location.
See also How can I indicate that the memory *pointed* to by an inline ASM argument may be used? to handle cases where you use an "r"(ptr) input and dereference the pointer manually inside the asm, accessing memory that you didn't tell the compiler about as being an input or output operand.
Generally avoid doing "r"(ptr) and then manually doing add (%0), %%eax. It needs extra constraints to make it safe, and it forces the compiler to materialize the exact address in a register, instead of using an addressing mode to reach it relative to some other register. e.g. 4(%ecx) if after inlining it sees that you're actually passing a pointer into an array or to a struct member.
Of course, generally avoid inline asm entirely unless you can't get the compiler to emit good enough asm without it. https://gcc.gnu.org/wiki/DontUseInlineAsm. If you do decide to use it, see https://stackoverflow.com/tags/inline-assembly/info for guides to avoid common mistakes.
Try
void test_func(Word32 *var){
asm( " mov %0, %%edx; \
addl (%%edx), %%eax" : : "m"(var) );
return;
}

Resources