Clang integrated assembler and “unknown token in expression” during negate - c

We are using Clang in a default configuration. In the default configuration, Clang's Integrated Assembler is used (and not the system assembler, like GAS). I'm having trouble determining the exact problem (and fix) for:
make
...
clang++ -DNDEBUG -g2 -O3 -Wall -fPIC -arch i386 -arch x86_64 -pipe -Wno-tautological-compare -c integer.cpp
integer.cpp:542:2: error: unknown token in expression
AS1( neg %1)
^
./cpu.h:220:17: note: expanded from macro 'AS1'
#define AS1(x) GNU_AS1(x)
^
./cpu.h:215:24: note: expanded from macro 'GNU_AS1'
#define GNU_AS1(x) "\n\t" #x ";"
^
<inline asm>:3:6: note: instantiated into assembly here
neg %rcx;
^
The code in question is below.
I think the error above may be related to Inline Assembly and Compatibility. A simple negq %1 did not work as expected:
integer.cpp:543:2: error: unknown token in expression
AS1( negq %1)
^
./cpu.h:220:17: note: expanded from macro 'AS1'
#define AS1(x) GNU_AS1(x)
^
./cpu.h:215:24: note: expanded from macro 'GNU_AS1'
#define GNU_AS1(x) "\n\t" #x ";"
^
<inline asm>:3:7: note: instantiated into assembly here
negq %rcx;
^
Since its Intel assembly, I also tried negq [%1], neg DWORD PTR [%1] and negq DWORD PTR [%1] with no joy.
“unknown token in expression” appears to be a bogus message since the assembly instruction its complaining about does not seem to have a problem. So I suspect there is something else the Clang integrated assembler finds offensive.
What is the problem with the code?
Code causing the compile error:
int Baseline_Add(size_t N, word *C, const word *A, const word *B)
{
word result;
__asm__ __volatile__
(
".intel_syntax;" "\n"
AS1( neg %1)
ASJ( jz, 1, f)
AS2( mov %0,[%3+8*%1])
AS2( add %0,[%4+8*%1])
AS2( mov [%2+8*%1],%0)
ASL(0)
AS2( mov %0,[%3+8*%1+8])
AS2( adc %0,[%4+8*%1+8])
AS2( mov [%2+8*%1+8],%0)
AS2( lea %1,[%1+2])
ASJ( jrcxz, 1, f)
AS2( mov %0,[%3+8*%1])
AS2( adc %0,[%4+8*%1])
AS2( mov [%2+8*%1],%0)
ASJ( jmp, 0, b)
ASL(1)
AS2( mov %0, 0)
AS2( adc %0, %0)
".att_syntax;" "\n"
: "=&r" (result), "+c" (N)
: "r" (C+N), "r" (A+N), "r" (B+N)
: "memory", "cc"
);
return (int)result;
}

This appears to be an issue in the Integrated Assembler. One of the LLVM developers opened a PR after the question was asked on the CFE-dev mailing list. Also see LLVM Bug 24232 - [X86] Inline assembly operands don't work with .intel_syntax.
I think that means Michael's comment could be a solution if we had the time and energy to rewrite blocks with 1000's of lines.

Related

Intel asm syntax with GCC: undefined reference

I am running Ubuntu 64-bit and I have this code:
#include <stdio.h>
int main() {
int x, y;
int z = 0;
printf("Enter two numbers: ");
scanf("%d %d", &x, &y);
asm(".intel_syntax noprefix\n"
"MOV EAX, _x\n"
"MOV ECX, _y\n"
"ADD EAX, ECX\n"
"MOV _z, EAX\n"
".att_syntax\n");
printf("%d + %d = %d \n", x, y, z);
return 0;
}
According to lecture at school it should work, but when I try to compile it with GCC I get this error:
/tmp/ccU4vNLr.o: In function `main':
Jung_79913_211.c:(.text+0x4a): undefined reference to `_x'
Jung_79913_211.c:(.text+0x51): undefined reference to `_y'
Jung_79913_211.c:(.text+0x5a): undefined reference to `_z'
collect2: error: ld returned 1 exit status
I know GCC uses AT&T asm syntax by default, but I need Intel systax at university. So question is, how I can get it working?
Two things: First, on Linux you don't prefix C symbols with an underscore in assembly, so x, y, z instead of _x, _y, _z. Second, these three variables are automatic variables. You cannot refer to automatic variables like this as no symbols are created for them. Instead, you need to tell the compiler to hand-over these variables into your assembly. You also need to mark the registers eax and ecx as clobbered because your assembly modifies them. Read this documentation for details. Here is how this could work with your code:
asm(
"MOV EAX, %1\n"
"MOV ECX, %2\n"
"ADD EAX, ECX\n"
"MOV %0, EAX\n"
: "=r" (z) : "r" (x), "r" (y) : "eax", "ecx");
You also need to compile with -masm=intel for this to work as otherwise gcc will insert references to registers in AT&T syntax, causing a compilation error. Even better, learn AT&T syntax if you plan to write a lot of inline assembly for gcc.

Embedded broadcasts with intrinsics and assembly

In section 2.5.3 "Broadcasts" of the Intel Architecture Instruction Set Extensions Programming Reference the we learn than
AVX512 (and Knights Corner) has
a bit-field to encode data broadcast for some load-op instructions, i.e. instructions that
load data from memory and perform some computational
or data movement operation.
For example using Intel assembly syntax we can broadcast the scalar at the address stored in rax and then multiplying with the 16 floats in zmm2 and write the result to zmm1 like this
vmulps zmm1, zmm2, [rax] {1to16}
However, there are no intrinsics which can do this. Therefore, with intrinsics the compiler should be able to fold
__m512 bb = _mm512_set1_ps(b);
__m512 ab = _mm512_mul_ps(a,bb);
to a single instruction
vmulps zmm1, zmm2, [rax] {1to16}
but I have not observed GCC doing this. I found a GCC bug report about this.
I have observed something similar with FMA with GCC. e.g. GCC 4.9 will not collapse _mm256_add_ps(_mm256_mul_ps(areg0,breg0) to a single fma instruction with -Ofast. However, GCC 5.1 does collapse it to a single fma now. At least there are intrinsics to do this with FMA e.g. _mm256_fmadd_ps. But there is no e.g. _mm512_mulbroad_ps(vector,scalar) intrinsic.
GCC may fix this at some point but until then assembly is the only solution.
So my question is how to do this with inline assembly in GCC?
I think I may have come up with the correct syntax (but I am not sure) for GCC inline assembly for the example above.
"vmulps (%%rax)%{1to16}, %%zmm1, %%zmm2\n\t"
I am really looking for a function like this
static inline __m512 mul_broad(__m512 a, float b) {
return a*b;
}
where if b is in memory point to in rax it produces
vmulps (%rax){1to16}, %zmm0, %zmm0
ret
and if b is in xmm1 it produces
vbroadcastss %xmm1, %zmm1
vmulps %zmm1, %zmm0, %zmm0
ret
GCC will already do the vbroadcastss-from-register case with intrinsics, but if b is in memory, compiles this to a vbroadcastss from memory.
__m512 mul_broad(__m512 a, float b) {
__m512 bb = _mm512_set1_ps(b);
__m512 ab = _mm512_mul_ps(a,bb);
return ab;
}
clang will use a broadcast memory operand if b is in memory.
As Peter Cordes notes GCC doesn't let you specify a different template for different constraint alternatives. So instead my solution has the assembler choose the correct instruction according to the operands chosen.
I don't have a version of GCC that supports the ZMM registers, so this following example uses XMM registers and a couple of nonexistent instructions to demonstrate how you can achieve what you're looking for.
typedef __attribute__((vector_size(16))) float v4sf;
v4sf
foo(v4sf a, float b) {
v4sf ret;
asm(".ifndef isxmm\n\t"
".altmacro\n\t"
".macro ifxmm operand, rnum\n\t"
".ifc \"\\operand\",\"%%xmm\\rnum\"\n\t"
".set isxmm, 1\n\t"
".endif\n\t"
".endm\n\t"
".endif\n\t"
".set isxmm, 0\n\t"
".set regnum, 0\n\t"
".rept 8\n\t"
"ifxmm <%2>, %%regnum\n\t"
".set regnum, regnum + 1\n\t"
".endr\n\t"
".if isxmm\n\t"
"alt-1 %1, %2, %0\n\t"
".else\n\t"
"alt-2 %1, %2, %0\n\t"
".endif\n\t"
: "=x,x" (ret)
: "x,x" (a), "x,m" (b));
return ret;
}
v4sf
bar(v4sf a, v4sf b) {
return foo(a, b[0]);
}
This example should be compiled with gcc -m32 -msse -O3 and should generate two assembler error messages similar to the following:
t103.c: Assembler messages:
t103.c:24: Error: no such instruction: `alt-2 %xmm0,4(%esp),%xmm0'
t103.c:22: Error: no such instruction: `alt-1 %xmm0,%xmm1,%xmm0'
The basic idea here is the assembler checks to see whether the second operand (%2) is an XMM register or something else, presumably a memory location. Since the GNU assembler doesn't support much in the way of operations on strings, the second operand is compared to every possible XMM register one at a time in a .rept loop. The isxmm macro is used to paste %xmm and a register number together.
For your specific problem you'd probably need to rewrite it something like this:
__m512
mul_broad(__m512 a, float b) {
__m512 ret;
__m512 dummy;
asm(".ifndef isxmm\n\t"
".altmacro\n\t"
".macro ifxmm operand, rnum\n\t"
".ifc \"\\operand\",\"%%zmm\\rnum\"\n\t"
".set isxmm, 1\n\t"
".endif\n\t"
".endm\n\t"
".endif\n\t"
".set isxmm, 0\n\t"
".set regnum, 0\n\t"
".rept 32\n\t"
"ifxmm <%[b]>, %%regnum\n\t"
".set regnum, regnum + 1\n\t"
".endr\n\t"
".if isxmm\n\t"
"vbroadcastss %x[b], %[b]\n\t"
"vmulps %[a], %[b], %[ret]\n\t"
".else\n\t"
"vmulps %[b] %{1to16%}, %[a], %[ret]\n\t"
"# dummy = %[dummy]\n\t"
".endif\n\t"
: [ret] "=x,x" (ret), [dummy] "=xm,x" (dummy)
: [a] "x,xm" (a), [b] "m,[dummy]" (b));
return ret;
}

How to access C struct/variables from inline asm?

Consider the following code:
int bn_div(bn_t *bn1, bn_t *bn2, bn_t *bnr)
{
uint32 q, m; /* Division Result */
uint32 i; /* Loop Counter */
uint32 j; /* Loop Counter */
/* Check Input */
if (bn1 == NULL) return(EFAULT);
if (bn1->dat == NULL) return(EFAULT);
if (bn2 == NULL) return(EFAULT);
if (bn2->dat == NULL) return(EFAULT);
if (bnr == NULL) return(EFAULT);
if (bnr->dat == NULL) return(EFAULT);
#if defined(__i386__) || defined(__amd64__)
__asm__ (".intel_syntax noprefix");
__asm__ ("pushl %eax");
__asm__ ("pushl %edx");
__asm__ ("pushf");
__asm__ ("movl %eax, (bn1->dat[i])");
__asm__ ("xorl %edx, %edx");
__asm__ ("divl (bn2->dat[j])");
__asm__ ("movl (q), %eax");
__asm__ ("movl (m), %edx");
__asm__ ("popf");
__asm__ ("popl %edx");
__asm__ ("popl %eax");
#else
q = bn->dat[i] / bn->dat[j];
m = bn->dat[i] % bn->dat[j];
#endif
/* Return */
return(0);
}
The data types uint32 is basically an unsigned long int or a uint32_t unsigned 32-bit integer. The type bnint is either a unsigned short int (uint16_t) or a uint32_t depending on if 64-bit data types are available or not. If 64-bit is available, then bnint is a uint32, otherwise it's a uint16. This was done in order to capture carry/overflow in other parts of the code. The structure bn_t is defined as follows:
typedef struct bn_data_t bn_t;
struct bn_data_t
{
uint32 sz1; /* Bit Size */
uint32 sz8; /* Byte Size */
uint32 szw; /* Word Count */
bnint *dat; /* Data Array */
uint32 flags; /* Operational Flags */
};
The function starts on line 300 in my source code. So when I try to compile/make it, I get the following errors:
system:/home/user/c/m3/bn 1036 $$$ ->make
clang -I. -I/home/user/c/m3/bn/.. -I/home/user/c/m3/bn/../include -std=c99 -pedantic -Wall -Wextra -Wshadow -Wpointer-arith -Wcast-align -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Wwrite-strings -Wfloat-equal -Winline -Wunknown-pragmas -Wundef -Wendif-labels -c /home/user/c/m3/bn/bn.c
/home/user/c/m3/bn/bn.c:302:12: warning: unused variable 'q' [-Wunused-variable]
uint32 q, m; /* Division Result */
^
/home/user/c/m3/bn/bn.c:302:15: warning: unused variable 'm' [-Wunused-variable]
uint32 q, m; /* Division Result */
^
/home/user/c/m3/bn/bn.c:303:12: warning: unused variable 'i' [-Wunused-variable]
uint32 i; /* Loop Counter */
^
/home/user/c/m3/bn/bn.c:304:12: warning: unused variable 'j' [-Wunused-variable]
uint32 j; /* Loop Counter */
^
/home/user/c/m3/bn/bn.c:320:14: error: unknown token in expression
__asm__ ("movl %eax, (bn1->dat[i])");
^
<inline asm>:1:18: note: instantiated into assembly here
movl %eax, (bn1->dat[i])
^
/home/user/c/m3/bn/bn.c:322:14: error: unknown token in expression
__asm__ ("divl (bn2->dat[j])");
^
<inline asm>:1:12: note: instantiated into assembly here
divl (bn2->dat[j])
^
4 warnings and 2 errors generated.
*** [bn.o] Error code 1
Stop in /home/user/c/m3/bn.
system:/home/user/c/m3/bn 1037 $$$ ->
What I know:
I consider myself to be fairly well versed in x86 assembler (as evidenced from the code that I wrote above). However, the last time that I mixed a high level language and assembler was using Borland Pascal about 15-20 years ago when writing graphics drivers for games (pre-Windows 95 era). My familiarity is with Intel syntax.
What I don't know:
How do I access members of bn_t (especially *dat) from asm? Since *dat is a pointer to uint32, I am accessing the elements as an array (eg. bn1->dat[i]).
How do I access local variables that are declared on the stack?
I am using push/pop to restore clobbered registers to their previous values so as to not upset the compiler. However, do I also need to include the volatile keyword on the local variables as well?
Or, is there a better way that I am not aware of? I don't want to put this in a separate function call because of the calling overhead as this function is performance critical.
Additional:
Right now, I'm just starting to write this function so it is no where complete. There are missing loops and other such support/glue code. But, the main gist is accessing local variables/structure elements.
EDIT 1:
The syntax that I am using seems to be the only one that clang supports. I tried the following code and clang gave me all sorts of errors:
__asm__ ("pushl %%eax",
"pushl %%edx",
"pushf",
"movl (bn1->dat[i]), %%eax",
"xorl %%edx, %%edx",
"divl ($0x0c + bn2 + j)",
"movl %%eax, (q)",
"movl %%edx, (m)",
"popf",
"popl %%edx",
"popl %%eax"
);
It wants me to put a closing parenthesis on the first line, replacing the comma. I switched to using %% instead of % because I read somewhere that inline assembly requires %% to denote CPU registers, and clang was telling me that I was using an invalid escape sequence.
If you only need 32b / 32b => 32bit division, let the compiler use both outputs of div, which gcc, clang and icc all do just fine, as you can see on the Godbolt compiler explorer:
uint32_t q = bn1->dat[i] / bn2->dat[j];
uint32_t m = bn1->dat[i] % bn2->dat[j];
Compilers are quite good at CSEing that into one div. Just make sure you don't store the division result somewhere that gcc can't prove won't affect the input of the remainder.
e.g. *m = dat[i] / dat[j] might overlap (alias) dat[i] or dat[j], so gcc would have to reload the operands and redo the div for the % operation. See the godbolt link for bad/good examples.
Using inline asm for 32bit / 32bit = 32bit div doesn't gain you anything, and actually makes worse code with clang (see the godbolt link).
If you need 64bit / 32bit = 32bit, you probably need asm, though, if there isn't a compiler built-in for it. (GNU C doesn't have one, AFAICT). The obvious way in C (casting operands to uint64_t) generates a call to a 64bit/64bit = 64bit libgcc function, which has branches and multiple div instructions. gcc isn't good at proving the result will fit in 32bits, so a single div instruction don't cause a #DE.
For a lot of other instructions, you can avoid writing inline asm a lot of the time with builtin functions for things like popcount. With -mpopcnt, it compiles to the popcnt instruction (and accounts for the false-dependency on the output operand that Intel CPUs have.) Without, it compiles to a libgcc function call.
Always prefer builtins, or pure C that compiles to good asm, so the compiler knows what the code does. When inlining makes some of the arguments known at compile-time, pure C can be optimized away or simplified, but code using inline asm will just load constants into registers and do a div at run-time. Inline asm also defeats CSE between similar computations on the same data, and of course can't auto-vectorize.
Using GNU C syntax the right way
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html explains how to tell the assembler which variables you want in registers, and what the outputs are.
You can use Intel/MASM-like syntax and mnemonics, and non-% register names if you like, preferably by compiling with -masm=intel. The AT&T syntax bug (fsub and fsubr mnemonics are reversed) might still be present in intel-syntax mode; I forget.
Most software projects that use GNU C inline asm use AT&T syntax only.
See also the bottom of this answer for more GNU C inline asm info, and the x86 tag wiki.
An asm statement takes one string arg, and 3 sets of constraints. The easiest way to make it multi-line is by making each asm line a separate string ending with \n, and let the compiler implicitly concatenate them.
Also, you tell the compiler which registers you want stuff in. Then if variables are already in registers, the compiler doesn't have to spill them and have you load and store them. Doing that would really shoot yourself in the foot. The tutorial Brett Hale linked in comments hopefully covers all this.
Correct example of div with GNU C inline asm
You can see the compiler asm output for this on godbolt.
uint32_t q, m; // this is unsigned int on every compiler that supports x86 inline asm with this syntax, but not when writing portable code.
asm ("divl %[bn2dat_j]\n"
: "=a" (q), "=d" (m) // results are in eax, edx registers
: "d" (0), // zero edx for us, please
"a" (bn1->dat[i]), // "a" means EAX / RAX
[bn2dat_j] "mr" (bn2->dat[j]) // register or memory, compiler chooses which is more efficient
: // no register clobbers, and we don't read/write "memory" other than operands
);
"divl %4" would have worked too, but named inputs/outputs don't change name when you add more input/output constraints.

What ensures reads/writes of operands occurs at desired timed with extended ASM?

According to GCC's Extended ASM and Assembler Template, to keep instructions consecutive, they must be in the same ASM block. I'm having trouble understanding what provides the scheduling or timings of reads and writes to the operands in a block with multiple statements.
As an example, EBX or RBX needs to be preserved when using CPUID because, according to the ABI, the caller owns it. There are some open questions with respect to the use of EBX and RBX, so we want to preserve it unconditionally (its a requirement). So three instructions need to be encoded into a single ASM block to ensure the consecutive-ness of the instructions (re: the assembler template discussed in the first paragraph):
unsigned int __FUNC = 1, __SUBFUNC = 0;
unsigned int __EAX, __EBX, __ECX, __EDX;
__asm__ __volatile__ (
"push %ebx;"
"cpuid;"
"pop %ebx"
: "=a"(__EAX), "=b"(__EBX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC)
);
If the expression representing the operands is interpreted at the wrong point in time, then __EBX will be the saved EBX (and not the CPUID's EBX), which will likely be a pointer to the Global Offset Table (GOT) if PIC is enabled.
Where, exactly, does the expression specify that the store of CPUID's %EBX into __EBX should happen (1) after the PUSH %EBX; (2) after the CPUID; but (3) before the POP %EBX?
In your question you present some code that does a push and pop of ebx. The idea of saving ebx in the event that you compile with gcc using -fPIC (position independent code) is correct. It is up to our function not to clobber ebx upon return in that situation. Unfortunately the way you have defined the constraints you explicitly use ebx. Generally the compiler will warn you (error: inconsistent operand constraints in an 'asm') if you are using PIC code and you specify =b as an output constraint. Why it doesn't produce a warning for you is unusual.
To get around this problem you can let the assembler template choose a register for you. Instead of pushing and popping we simply exchange %ebx with an unused register chosen by the compiler and restore it by exchanging it back after. Since we don't wish to have the compiler clobber our input registers during the exchange we specify early clobber modifier, thus ending up with a constraint of =&r (instead of =b in the OPs code). More on modifiers can be found here. Your code (for 32 bit) would look something like:
unsigned int __FUNC = 1, __SUBFUNC = 0;
unsigned int __EAX, __EBX, __ECX, __EDX;
__asm__ __volatile__ (
"xchgl\t%%ebx, %k1\n\t" \
"cpuid\n\t" \
"xchgl\t%%ebx, %k1\n\t"
: "=a"(__EAX), "=&r"(__EBX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC));
If you intend to compile for X86_64 (64 bit) you'll need to save the entire contents of %rbx. The code above will not quite work. You'd have to use something like:
uint32_t __FUNC = 1, __SUBFUNC = 0;
uint32_t __EAX, __ECX, __EDX;
uint64_t __BX; /* Big enough to hold a 64 bit value */
__asm__ __volatile__ (
"xchgq\t%%rbx, %q1\n\t" \
"cpuid\n\t" \
"xchgq\t%%rbx, %q1\n\t"
: "=a"(__EAX), "=&r"(__BX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC));
You could code this up using conditional compilation to deal with both X86_64 and i386:
uint32_t __FUNC = 1, __SUBFUNC = 0;
uint32_t __EAX, __ECX, __EDX;
uint64_t __BX; /* Big enough to hold a 64 bit value */
#if defined(__i386__)
__asm__ __volatile__ (
"xchgl\t%%ebx, %k1\n\t" \
"cpuid\n\t" \
"xchgl\t%%ebx, %k1\n\t"
: "=a"(__EAX), "=&r"(__BX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC));
#elif defined(__x86_64__)
__asm__ __volatile__ (
"xchgq\t%%rbx, %q1\n\t" \
"cpuid\n\t" \
"xchgq\t%%rbx, %q1\n\t"
: "=a"(__EAX), "=&r"(__BX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC));
#else
#error "Unknown architecture."
#endif
GCC has a __cpuid macro defined in cpuid.h. It defined the macro so that it only saves the ebx and rbx register when required. You can find the GCC 4.8.1 macro definition here to get an idea of how they handle cpuid in cpuid.h.
The astute reader may ask the question - what stops the compiler from choosing ebx or rbx as the scratch register to use for the exchange. The compiler knows about ebx and rbx in the context of PIC, and will not allow it to be used as a scratch register. This is based on my personal observations over the years and reviewing the assembler (.s) files generated from C code. I can't say for certain how more ancient versions of gcc handled it so it could be a problem.
I think you understand, but to be clear, the "consecutive" rule means that this:
asm ("a");
asm ("b");
asm ("c");
... might get other instructions interposed, so if that's not desirable then it must be rewritten like this:
asm ("a\n"
"b\n"
"c");
... and now it will be inserted as a whole.
As for the cpuid snippet, we have two problems:
The cpuid instruction will overwrite ebx, and hence clobber the data that PIC code must keep there.
We want to extract the value that cpuid places in ebx while never returning to compiled code with the "wrong" ebx value.
One possible solution would be this:
unsigned int __FUNC = 1, __SUBFUNC = 0;
unsigned int __EAX, __EBX, __ECX, __EDX;
__asm__ __volatile__ (
"push %ebx;"
"cpuid;"
"mov %ebx, %ecx"
"pop %ebx"
: "=c"(__EBX)
: "a"(__FUNC), "c"(__SUBFUNC)
: "eax", "edx"
);
__asm__ __volatile__ (
"push %ebx;"
"cpuid;"
"pop %ebx"
: "=a"(__EAX), "=c"(__ECX), "=d"(__EDX)
: "a"(__FUNC), "c"(__SUBFUNC)
);
There's no need to mark ebx as clobbered as you're putting it back how you found it.
(I don't do much Intel programming, so I may have some of the assembler-specific details off there, but this is how asm works.)

Switching between Intel and ATT mode in GCC

So I have this inline assembly code along with my C code, and I want to use intel syntax for this particular call to asm(), however I need to switch back to ATT syntax or else it will give a long list of errors.
asm(".intel_syntax prefix");
asm volatile (
"add %0, $1 \n\t"
: "=r" (dst)
: "r" (src));
asm(".att_syntax prefix");
Now it gives the following error
/tmp/ccDNa2Wk.s: Assembler messages:
/tmp/ccDNa2Wk.s:180: Error: no such instruction: `movl -16(%ebp),%eax'
/tmp/ccDNa2Wk.s:187: Error: no such instruction: `movl %eax,-12(%ebp)'
I dont understand how to fix the error, i have no call to movl in any part of my code.
Since you haven't yet accepted an answer (<hint><hint>), let me add a third thought:
1) Instead of having 3 asm statements, do it in 1:
asm(".intel_syntax prefix\n\t"
"add %0, 1 \n\t"
".att_syntax prefix"
: "=r" (dst)
: "r" (src));
2) Change your compile options to include -masm=intel and omit the 2 syntax statements.
3) It is possible to support both intel and att at the same time. This way your code works whatever value is passed for -masm:
asm("{addl $1, %0 | add %0, 1}"
: "=r" (dst)
: "r" (src));
I should also mention that your asm may not work as expected. Since you are updating the contents of dst (instead of overwriting it), you probably want to use "+r" instead of "=r". And you do realize that this code doesn't actually use src, right?
Oh, and your original asm is NOT intel format (the $1 is the give-away).
I would try to do the following tests:
In some C code not containing inline assembler insert the line
asm(".att_syntax prefix");
in multiple different locations. Then compile the C code to object files and disassemble these object files (compiling to assembler won't work for this test).
Then compare the disassembly of the original code with the disassembly of the code containing the ".att_syntax" lines.
If the line ".att_syntax prefix" indeed is the correct line for switching back to AT&T mode the disassemblies must be equal AND compiling must work without any errors.
In the next step take your code and compile to assembler instead of object code ("-S" option of GCC). Then you can look at the assembler code.
My idea is the following one:
If you use data exchange in inline assembler ("=r" and "r" for example) GCC needs to insert code that is doing the data exchange:
asm(".intel_syntax prefix");
// GCC inserts code moving "src" to "%0" here
asm volatile (
"add %0, $1 \n\t"
: "=r" (dst)
: "r" (src));
// GCC inserts code moving "%0" to "dst" here
asm(".att_syntax prefix");
This code inserted by GCC is of course in AT&T syntax.
If you want to use Intel syntax in inline assembly you have to use the ".att_syntax" and ".intel_syntax" instructions in the same inline assembly block, just like this:
// GCC inserts code moving "src" to "%0" here
asm volatile (
".intel_syntax prefix \n\t"
"add %0, $1 \n\t"
".att_syntax prefix \n\t"
: "=r" (dst)
: "r" (src));
// GCC inserts code moving "%0" to "dst" here

Resources