Consider the following code:
int bn_div(bn_t *bn1, bn_t *bn2, bn_t *bnr)
{
uint32 q, m; /* Division Result */
uint32 i; /* Loop Counter */
uint32 j; /* Loop Counter */
/* Check Input */
if (bn1 == NULL) return(EFAULT);
if (bn1->dat == NULL) return(EFAULT);
if (bn2 == NULL) return(EFAULT);
if (bn2->dat == NULL) return(EFAULT);
if (bnr == NULL) return(EFAULT);
if (bnr->dat == NULL) return(EFAULT);
#if defined(__i386__) || defined(__amd64__)
__asm__ (".intel_syntax noprefix");
__asm__ ("pushl %eax");
__asm__ ("pushl %edx");
__asm__ ("pushf");
__asm__ ("movl %eax, (bn1->dat[i])");
__asm__ ("xorl %edx, %edx");
__asm__ ("divl (bn2->dat[j])");
__asm__ ("movl (q), %eax");
__asm__ ("movl (m), %edx");
__asm__ ("popf");
__asm__ ("popl %edx");
__asm__ ("popl %eax");
#else
q = bn->dat[i] / bn->dat[j];
m = bn->dat[i] % bn->dat[j];
#endif
/* Return */
return(0);
}
The data types uint32 is basically an unsigned long int or a uint32_t unsigned 32-bit integer. The type bnint is either a unsigned short int (uint16_t) or a uint32_t depending on if 64-bit data types are available or not. If 64-bit is available, then bnint is a uint32, otherwise it's a uint16. This was done in order to capture carry/overflow in other parts of the code. The structure bn_t is defined as follows:
typedef struct bn_data_t bn_t;
struct bn_data_t
{
uint32 sz1; /* Bit Size */
uint32 sz8; /* Byte Size */
uint32 szw; /* Word Count */
bnint *dat; /* Data Array */
uint32 flags; /* Operational Flags */
};
The function starts on line 300 in my source code. So when I try to compile/make it, I get the following errors:
system:/home/user/c/m3/bn 1036 $$$ ->make
clang -I. -I/home/user/c/m3/bn/.. -I/home/user/c/m3/bn/../include -std=c99 -pedantic -Wall -Wextra -Wshadow -Wpointer-arith -Wcast-align -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Wwrite-strings -Wfloat-equal -Winline -Wunknown-pragmas -Wundef -Wendif-labels -c /home/user/c/m3/bn/bn.c
/home/user/c/m3/bn/bn.c:302:12: warning: unused variable 'q' [-Wunused-variable]
uint32 q, m; /* Division Result */
^
/home/user/c/m3/bn/bn.c:302:15: warning: unused variable 'm' [-Wunused-variable]
uint32 q, m; /* Division Result */
^
/home/user/c/m3/bn/bn.c:303:12: warning: unused variable 'i' [-Wunused-variable]
uint32 i; /* Loop Counter */
^
/home/user/c/m3/bn/bn.c:304:12: warning: unused variable 'j' [-Wunused-variable]
uint32 j; /* Loop Counter */
^
/home/user/c/m3/bn/bn.c:320:14: error: unknown token in expression
__asm__ ("movl %eax, (bn1->dat[i])");
^
<inline asm>:1:18: note: instantiated into assembly here
movl %eax, (bn1->dat[i])
^
/home/user/c/m3/bn/bn.c:322:14: error: unknown token in expression
__asm__ ("divl (bn2->dat[j])");
^
<inline asm>:1:12: note: instantiated into assembly here
divl (bn2->dat[j])
^
4 warnings and 2 errors generated.
*** [bn.o] Error code 1
Stop in /home/user/c/m3/bn.
system:/home/user/c/m3/bn 1037 $$$ ->
What I know:
I consider myself to be fairly well versed in x86 assembler (as evidenced from the code that I wrote above). However, the last time that I mixed a high level language and assembler was using Borland Pascal about 15-20 years ago when writing graphics drivers for games (pre-Windows 95 era). My familiarity is with Intel syntax.
What I don't know:
How do I access members of bn_t (especially *dat) from asm? Since *dat is a pointer to uint32, I am accessing the elements as an array (eg. bn1->dat[i]).
How do I access local variables that are declared on the stack?
I am using push/pop to restore clobbered registers to their previous values so as to not upset the compiler. However, do I also need to include the volatile keyword on the local variables as well?
Or, is there a better way that I am not aware of? I don't want to put this in a separate function call because of the calling overhead as this function is performance critical.
Additional:
Right now, I'm just starting to write this function so it is no where complete. There are missing loops and other such support/glue code. But, the main gist is accessing local variables/structure elements.
EDIT 1:
The syntax that I am using seems to be the only one that clang supports. I tried the following code and clang gave me all sorts of errors:
__asm__ ("pushl %%eax",
"pushl %%edx",
"pushf",
"movl (bn1->dat[i]), %%eax",
"xorl %%edx, %%edx",
"divl ($0x0c + bn2 + j)",
"movl %%eax, (q)",
"movl %%edx, (m)",
"popf",
"popl %%edx",
"popl %%eax"
);
It wants me to put a closing parenthesis on the first line, replacing the comma. I switched to using %% instead of % because I read somewhere that inline assembly requires %% to denote CPU registers, and clang was telling me that I was using an invalid escape sequence.
If you only need 32b / 32b => 32bit division, let the compiler use both outputs of div, which gcc, clang and icc all do just fine, as you can see on the Godbolt compiler explorer:
uint32_t q = bn1->dat[i] / bn2->dat[j];
uint32_t m = bn1->dat[i] % bn2->dat[j];
Compilers are quite good at CSEing that into one div. Just make sure you don't store the division result somewhere that gcc can't prove won't affect the input of the remainder.
e.g. *m = dat[i] / dat[j] might overlap (alias) dat[i] or dat[j], so gcc would have to reload the operands and redo the div for the % operation. See the godbolt link for bad/good examples.
Using inline asm for 32bit / 32bit = 32bit div doesn't gain you anything, and actually makes worse code with clang (see the godbolt link).
If you need 64bit / 32bit = 32bit, you probably need asm, though, if there isn't a compiler built-in for it. (GNU C doesn't have one, AFAICT). The obvious way in C (casting operands to uint64_t) generates a call to a 64bit/64bit = 64bit libgcc function, which has branches and multiple div instructions. gcc isn't good at proving the result will fit in 32bits, so a single div instruction don't cause a #DE.
For a lot of other instructions, you can avoid writing inline asm a lot of the time with builtin functions for things like popcount. With -mpopcnt, it compiles to the popcnt instruction (and accounts for the false-dependency on the output operand that Intel CPUs have.) Without, it compiles to a libgcc function call.
Always prefer builtins, or pure C that compiles to good asm, so the compiler knows what the code does. When inlining makes some of the arguments known at compile-time, pure C can be optimized away or simplified, but code using inline asm will just load constants into registers and do a div at run-time. Inline asm also defeats CSE between similar computations on the same data, and of course can't auto-vectorize.
Using GNU C syntax the right way
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html explains how to tell the assembler which variables you want in registers, and what the outputs are.
You can use Intel/MASM-like syntax and mnemonics, and non-% register names if you like, preferably by compiling with -masm=intel. The AT&T syntax bug (fsub and fsubr mnemonics are reversed) might still be present in intel-syntax mode; I forget.
Most software projects that use GNU C inline asm use AT&T syntax only.
See also the bottom of this answer for more GNU C inline asm info, and the x86 tag wiki.
An asm statement takes one string arg, and 3 sets of constraints. The easiest way to make it multi-line is by making each asm line a separate string ending with \n, and let the compiler implicitly concatenate them.
Also, you tell the compiler which registers you want stuff in. Then if variables are already in registers, the compiler doesn't have to spill them and have you load and store them. Doing that would really shoot yourself in the foot. The tutorial Brett Hale linked in comments hopefully covers all this.
Correct example of div with GNU C inline asm
You can see the compiler asm output for this on godbolt.
uint32_t q, m; // this is unsigned int on every compiler that supports x86 inline asm with this syntax, but not when writing portable code.
asm ("divl %[bn2dat_j]\n"
: "=a" (q), "=d" (m) // results are in eax, edx registers
: "d" (0), // zero edx for us, please
"a" (bn1->dat[i]), // "a" means EAX / RAX
[bn2dat_j] "mr" (bn2->dat[j]) // register or memory, compiler chooses which is more efficient
: // no register clobbers, and we don't read/write "memory" other than operands
);
"divl %4" would have worked too, but named inputs/outputs don't change name when you add more input/output constraints.
Related
I am using the bsf x86-64 instruction found on page 210 of Intels developers manual found here. Essentially, if a least significant 1 bit is found, its bit index is stored in the destination operand .
Furthermore, the ZF flag is set to 1 if all the source operand is 0; otherwise, the ZF flag is cleared.
I am compiling my C code with inline x86-64 assembly instructions. I have defined a C function which invokes the bsf instruction:
uint64_t bitScanForward(T_bitboard b) {
__asm__(
"bsf %rcx,%rax\n"
"leave\n"
"ret\n"
);
}
and also another C function which checks if the status of the ZF bit in the flag register:
uint64_t isZFSet() {
printf("\n"); <- This is another problem I am having (see below)...
__asm__(
"jz true\n"
"movq $0,%rax\n"//return false
"jmp end\n"
"true:\n"
"movq $1,%rax\n"//return true
"end:\n"
"leave\n"
"ret\n"
);
}
I have tested these and found that the ZF flag is always cleared even when the bsf comand is applied to the number zero, seemingly going against the specification.
//Calling function...
//Do stuff...
bitScanForward(0ULL);//ULL is 64 bit on my machine
if(isZFSet()){//ZF flag *should* be set here but its not
printf("ZF flag is set\n");
}
//More stuff...
I suspect the reason the ZF flag is clearing is due to entering and leaving one set of inline instructions to another.
How can I ensure that the flag in the above code is set as specified in the documentation? (I don't want to change much of my code or design)
My "other problem" is that if I dont include the printf statement in the isZFFlagSet, the function seemingly doesnt execute. Totally bizarre. Can anyone explain why?
You are treating an aggressively optimizing C compiler as if it were a macro assembler. That just plain isn't going to work. To get GCC to emit correct code in the presence of assembly inserts, you have to annotate the inserts with complete information about the registers and memory regions that are affected by the assembly code, and you have to use ancillary C statements to mesh them with the surrounding code. Even then, there are things the assembly insert cannot do at all. I urge you to scrap this entire mess and instead use the __builtin_ctzll intrinsic, as suggested in the comments on the question.
Now, to specifics. Your first function is incorrect because GCC does not support use of leave or ret inside an assembly insert. (More generally, assembly inserts may not alter the stack pointer, and may only jump to designated labels within the same function.) The correct way to use bsf from a GCC-style assembly insert is with "extended asm" with input and output operands:
uint64_t bitScanForward(uint64_t b) {
uint64_t ret;
asm ("bsf %1, %0" : "=r" (ret) : "r" (b));
return ret;
}
You must declare a C variable to receive the output of the operation, and explicitly return that variable; having bsf write to %rax would not work (unlike how it was in old MSVC). BSF accepts any two registers as operands, so there is no need to use constraints more specific than r.
Your second function is incorrect because you didn't tell GCC that the condition codes were meaningful after bitScanForward, and because GCC does not support using the condition-code register as an input to an assembly insert. In order to read the ZF output from bsf you must do so within the same assembly insert that invoked bsf:
uint64_t countTrailingZeroes(uint64_t b) {
uint64_t ret;
asm ("bsf %1, %0\n\t"
"cmove %2, %0"
: "=&r" (ret)
: "r" (b), "rm" (64));
return ret;
}
This requires special care -- see how the constraint on operand 0 is now =&r instead of just =r? Without that, GCC is liable to think it can put operand 2 in the same register as operand 0.
Alternatively, you can specify that ZF is an output, which is supported (see the "flag output operands" section of the manual) and then supply a default value from C:
uint64_t countTrailingZeroes(uint64_t b) {
uint64_t ret;
int zf;
asm ("bsf %2, %0"
: "=r" (ret), "=#ccz" (zf) : "r" (b));
if (zf) ret = 64;
return ret;
}
AVX512 introduced opmask feature for its arithmetic commands. A simple example: godbolt.org.
#include <immintrin.h>
__m512i add(__m512i a, __m512i b) {
__m512i sum;
asm(
"mov ebx, 0xAAAAAAAA; \n\t"
"kmovw k1, ebx; \n\t"
"vpaddd %[SUM] %{k1%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b)
: "ebx", "k1" // clobbers
);
return sum;
}
-march=skylake-avx512 -masm=intel -O3
mov ebx,0xaaaaaaaa
kmovw k1,ebx
vpaddd zmm0{k1}{z},zmm0,zmm1
The problem is that k1 has to be specified.
Is there an input constraint like "r" for integers except that it picks a k register instead of a general-purpose register?
__mmask16 is literally a typedef for unsigned short (and other mask types for other plain integer types), so we just need a constraint for passing it in a k register.
We have to go digging in the gcc sources config/i386/constraints.md to find it:
The constraint for any mask register is "k". Or use "Yk" for k1..k7 (which can be used as a predicate, unlike k0). You'd use an "=k" operand as the destination for a compare-into-mask, for example.
Obviously you can use "=Yk"(tmp) with a __mmask16 tmp to get the compiler to do register allocation for you, instead of just declaring clobbers on whichever "k" registers you decide to use.
Prefer intrinsics like _mm512_maskz_add_epi32
First of all, https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. Understanding asm is great, but use that to read compiler output and/or figure out what would be optimal, then write intrinsics that can compile the way you want. Performance tuning info like https://agner.org/optimize/ and https://uops.info/ list things by asm mnemonic, and they're shorter / easier to remember than intrinsics, but you can search by mnemonic to find intrinsics on https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Intrinsics will also let the compiler fold loads into memory source operands for other instructions; with AVX512 those can even be broadcast loads! Your inline asm forces the compiler to use a separate load instruction. Even a "vm" input won't let the compiler pick a broadcast-load as the memory source, because it wouldn't know the broadcast element width of the instruction(s) you were using it with.
Use _mm512_mask_add_epi32 or _mm512_maskz_add_epi32 especially if you're already using __m512i types from <immintrin.h>.
Also, your asm has a bug: you're using {k1} merge-masking not {k1}{z} zero-masking, but you used uninitialized __m512i sum; with an output-only "=v" constraint as the merge destination! As a stand-alone function, it happens to merge into a because the calling convention has ZMM0 = first input = return value register. But when inlining into other functions, you definitely can't assume that sum will pick the same register as a. Your best bet is to use a read/write operand for "+v"(a) and use is as the destination and first source.
Merge-masking only makes sense with a "+v" read/write operand. (Or in an asm statement with multiple instructions where you've already written an output once, and want to merge another result into it.)
Intrinsics would stop you from making this mistake; the merge-masking version has an extra input for the merge-target. (The asm destination operand).
Example using "Yk"
// works with -march=skylake-avx512 or -march=knl
// or just -mavx512f but don't do that.
// also needed: -masm=intel
#include <immintrin.h>
__m512i add_zmask(__m512i a, __m512i b) {
__m512i sum;
asm(
"vpaddd %[SUM] %{%[mask]%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b),
[mask] "Yk" ((__mmask16)0xAAAA)
// no clobbers needed, unlike your question which I fixed with an edit
);
return sum;
}
Note that all the { and } are escaped with % (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Special-format-strings), so they're not parsed as dialect-alternatives {AT&T | Intel-syntax}.
This compiles with gcc as early as 4.9, but don't actually do that because it doesn't understand -march=skylake-avx512, or even have tuning settings for Skylake or KNL. Use a more recent GCC that knows about your CPU for best results.
Godbolt compiler explorer:
# gcc8.3 -O3 -march=skylake-avx512 or -march=knl (and -masm=intel)
add(long long __vector, long long __vector):
mov eax, -21846
kmovw k1, eax # compiler-generated
# inline asm starts
vpaddd zmm0 {k1}{z}, zmm0, zmm1; # conditional add
# inline asm ends
ret
-mavx512bw (implied by -march=skylake-avx512 but not knl) is required for "Yk" to work on an int. If you're compiling with -march=knl, integer literals need a cast to __mmask16 or __mask8, because unsigned int = __mask32 isn't available for masks.
[mask] "Yk" (0xAAAA) requires AVX512BW even though the constant does fit in 16 bits, just because bare integer literals always have type int. (vpaddd zmm has 16 elements per vector, so I shortened your constant to 16-bit.) With AVX512BW, you can pass wider constants or leave out the cast for narrow ones.
gcc6 and later support -march=skylake-avx512. Use that to set tuning as well as enabling everything. Preferably gcc8 or at least gcc7. Newer compilers generate less clunky code with new ISA extensions like AVX512 if you're ever using it outside of inline asm.
gcc5 supports -mavx512f -mavx512bw but doesn't know about Skylake.
gcc4.9 doesn't support -mavx512bw.
"Yk" is unfortunately not yet documented in https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html.
I knew where to look in the GCC source thanks to Ross's answer on In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?
While it is undocumented, looking here we see:
(define_register_constraint "Yk" "TARGET_AVX512F ? MASK_REGS :
NO_REGS" "#internal Any mask register that can be used as predicate,
i.e. k1-k7.")
Editing your godbolt to this:
asm(
"vpaddd %[SUM] %{%[k]}, %[A], %[B]"
: [SUM] "=v"(sum)
: [A] "v" (a), [B] "v" (b), [k] "Yk" (0xaaaaaaaa) );
seems to produce the correct output.
That said, I usually try to discourage people from using inline asm (and undocumented features). Can you use _mm512_mask_add_epi32?
I am currently working on my x86 OS. I tried implementing the inb function from here and it gives me Error: Operand type mismatch for `in'.
This may also be the same with outb or io_wait.
I am using Intel syntax (-masm=intel) and I don't know what to do.
Code:
#include <stdint.h>
#include "ioaccess.h"
uint8_t inb(uint16_t port)
{
uint8_t ret;
asm volatile ( "inb %1, %0"
: "=a"(ret)
: "Nd"(port) );
return ret;
}
With AT&T syntax this does work.
For outb I'm having a different problem after reversing the operands:
void io_wait(void)
{
asm volatile ( "outb $0x80, %0" : : "a"(0) );
}
Error: operand size mismatch for `out'
If you need to use -masm=intel you will need to insure that your inline assembly is in Intel syntax. Intel syntax is dst, src (AT&T syntax is reverse). This somewhat related answer has some useful information on some differences between NASM's Intel variant1 (not GAS's variant) and AT&T syntax:
Information on how you can go about translating NASM Intel syntax to GAS's AT&T syntax can be found in this Stackoverflow Answer, and a lot of useful information is provided in this IBM article.
[snip]
In general the biggest differences are:
With AT&T syntax the source is on the left and destination is on the right and Intel is the reverse.
With AT&T syntax register names are prepended with a %
With AT&T syntax immediate values are prepended with a $
Memory operands are probably the biggest difference. NASM uses [segment:disp+base+index*scale] instead of GAS's syntax of segment:disp(base, index, scale).
The problem in your code is that source and destination operands have to be reversed from the original AT&T syntax you were working with. This code:
asm volatile ( "inb %1, %0"
: "=a"(ret)
: "Nd"(port) );
Needs to be:
asm volatile ( "inb %0, %1"
: "=a"(ret)
: "Nd"(port) );
Regarding your update: the problem is that in Intel syntax immediate values are not prepended with a $. This line is a problem:
asm volatile ( "outb $0x80, %0" : : "a"(0) );
It should be:
asm volatile ( "outb 0x80, %0" : : "a"(0) );
If you had a proper outb function you could do something like this instead:
#include <stdint.h>
#include "ioaccess.h"
uint8_t inb(uint16_t port)
{
uint8_t ret;
asm volatile ( "inb %0, %1"
: "=a"(ret)
: "Nd"(port) );
return ret;
}
void outb(uint16_t port, uint8_t byte)
{
asm volatile ( "outb %1, %0"
:
: "a"(byte),
"Nd"(port) );
}
void io_wait(void)
{
outb (0x80, 0);
}
A slightly more complex version that supports both the AT&T and Intel dialects:
Multiple assembler dialects in asm templates On targets such as x86,
GCC supports multiple assembler dialects. The -masm option controls
which dialect GCC uses as its default for inline assembler. The
target-specific documentation for the -masm option contains the list
of supported dialects, as well as the default dialect if the option is
not specified. This information may be important to understand, since
assembler code that works correctly when compiled using one dialect
will likely fail if compiled using another. See x86 Options.
If your code needs to support multiple assembler dialects (for
example, if you are writing public headers that need to support a
variety of compilation options), use constructs of this form:
{ dialect0 | dialect1 | dialect2... }
On x86 and x86-64 targets there are two dialects. Dialect0 is AT&T syntax and Dialect1 is Intel syntax. The functions could be reworked this way:
#include <stdint.h>
#include "ioaccess.h"
uint8_t inb(uint16_t port)
{
uint8_t ret;
asm volatile ( "inb {%[port], %[retreg] | %[retreg], %[port]}"
: [retreg]"=a"(ret)
: [port]"Nd"(port) );
return ret;
}
void outb(uint16_t port, uint8_t byte)
{
asm volatile ( "outb {%[byte], %[port] | %[port], %[byte]}"
:
: [byte]"a"(byte),
[port]"Nd"(port) );
}
void io_wait(void)
{
outb (0x80, 0);
}
I have also given the constraints symbolic names rather than using %0 and %1 to make the inline assembly easier to read and maintain.. From the GCC documentation each constraint has the form:
[ [asmSymbolicName] ] constraint (cvariablename)
Where:
asmSymbolicName
Specifies a symbolic name for the operand. Reference the name in the assembler template by enclosing it in square brackets (i.e. ‘%[Value]’). The scope of the name is the asm statement that contains the definition. Any valid C variable name is acceptable, including names already defined in the surrounding code. No two operands within the same asm statement can use the same symbolic name.
When not using an asmSymbolicName, use the (zero-based) position of the operand in the list of operands in the assembler template. For example if there are three output operands, use ‘%0’ in the template to refer to the first, ‘%1’ for the second, and ‘%2’ for the third.
This version should work2 whether you compile with -masm=intel or -masm=att options
Footnotes
1Although NASM Intel dialect and GAS's (GNU Assembler) Intel syntax are similar there are some differences. One is that NASM Intel syntax uses [segment:disp+base+index*scale] where a segment can be specified inside the [] and GAS's Intel syntax requires the segment outside with segment:[disp+base+index*scale].
2Although the code will work, you should place all these basic functions in the ioaccess.h file directly and eliminate them from the .c file that contains them. Because you placed these basic functions in a separate .c file (external linkage) the compiler can't optimize them as well as it could. You can modify the functions to be of type static inline and place them in the header directly. The compiler will then have the ability to optimize the code by removing function calling overhead and reduce the need for extra loads and stores. You will want to compile with optimizations higher than -O0. Consider -O2 or -O3.
Special Notes Regarding OS Development:
There are many toy OSes (examples, tutorials, and even code on OSDev Wiki) that do not work with optimizations on. Many failures are due to bad/poor inline assembly or using undefined behaviour. Inline assembly should be used as a last resort. If your kernel doesn't run with optimizations on it is likely not a bug in the compiler (it is possible just not likely).
Heed the advice in #PeterCordes answer regarding port access that may trigger DMA reads.
It's possible to writing code that works with or without -masm=intel, using dialect alternatives for GNU C inline asm https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html (This is a good idea for headers that other people might include.)
It works like "{at&t stuff | intel stuff}": the compiler picks which side of the | to keep based on the current mode.
The major difference between AT&T vs. Intel syntax is that the operand-list is reversed, so usually you have something like "inb {%1,%0 | %0,%1}".
This is a version of #MichaelPetch's nice functions using dialect alternatives:
// make this a header: these single instructions can inline more cheaply
// than setting up args for a function call
#include <stdint.h>
static inline
uint8_t inb(uint16_t port)
{
uint8_t ret;
asm volatile ( "inb {%1, %0 | %0, %1}"
: "=a"(ret)
: "Nd"(port) );
return ret;
}
static inline
void outb(uint16_t port, uint8_t byte)
{
asm volatile ( "outb {%1, %0 | %0, %1}"
:
: "a"(byte),
"Nd"(port) );
}
static inline
void io_wait(void) {
outb (0x80, 0);
}
The Linux/Glibc sys/io.h macros sometimes use %w1 to expand a constraint to the 16-bit register name, but using types of the right size also works.
If you want a memory-barrier version of these to take advantage of the fact that in / out are (more or less) serializing like a locked instruction or mfence, add a "memory" clobber to stop compile-time reordering of memory access across it.
If a port I/O can trigger a DMA read of some other memory that you wrote recently, you might also need a "memory" clobber for that. (x86 has cache-coherent DMA so you wouldn't have had to explicitly flush it, but you can't let the compiler reorder it after an outb, or even optimize away an apparently dead store.)
There's no support in GAS for saving the old mode, so using .intel_syntax noprefix inside your inline asm leaves you no way know whether to switch back to .att_syntax or not.
But that wouldn't usually be sufficient anyway: you need to get the compiler to format operands in ways that match the syntax mode when filling in a template. e.g. the port number needs to expand to $imm or %dx (AT&T1) vs. dx or imm without the $ prefix.
Or for a memory operand, [rdi + rax*4 + 8] or 8(%rdi, %rax, 4).
But you still need to take care of reversing the operand list with { | } yourself; the compiler doesn't try to do that for you. It's purely a text-substitution into the template according to simple rules.
Footnote 1: AT&T disassembly by objdump -d bizarrely uses (%dx) for the port number in the non-immediate form, but GAS accepts %dx or (%dx) on input, so an "Nd" constraint is usable, simply expanding to the bare register name.
I was looking at the documentation on the Atmel website and I came across this example where they explain some issues with reordering.
Here's the example code:
#define cli() __asm volatile( "cli" ::: "memory" )
#define sei() __asm volatile( "sei" ::: "memory" )
unsigned int ivar;
void test2( unsigned int val )
{
val = 65535U / val;
cli();
ivar = val;
sei();
}
In this example, they're implementing a critical region-like mechanism. The cli instruction disables interrupts and the sei instruction enables them. Normally, I would save the interrupt state and restore to that state, but I digress...
The problem which they note is that, with optimization enabled, the division on the first line actually gets moved to after the cli instruction. This can cause some issues when you're trying to be inside of the critical region for the shortest amount of time as possible.
How come this is possible if the cli() MACRO expands to inline asm which explicitly clobbers the memory? How is the compiler free to move things before or after this statement?
Also, I modified the code to include memory barriers before every statement in the form of __asm volatile("" ::: "memory"); and it doesn't seem to change anything.
I also removed the memory clobber from the cli() and sei() MACROs, and the generated code was identical.
Of course, if I declare the test2 function argument as volatile, there is no reordering, which I assume to be because volatile statements can't be reordered with respect to other volatile statements (which the inline asm technically is). Is my assumption correct?
Can volatile accesses be reordered with respect to volatile inline asm?
Can non-volatile accesses be reordered with respect to volatile inline asm?
What's weird is that Atmel claims they need the memory clobber just to enforce the ordering of volatile accesses with respect to the asm. That doesn't make any sense to me.
If the compiler barrier isn't the proper solution for this, then how could I go about preventing any outside code from "leaking" into the critical region?
If anyone could shed some light, I'd appreciate it.
Thanks
How come this is possible if the cli() MACRO expands to inline asm which explicitly clobbers the memory? How is the compiler free to move things before or after this statement?
This is due to implementation details of avr-gcc: The compiler's support library, libgcc, provides many functions written in assembly for performance; including functions for integer division like __udivmodhi4. Not all of these functions clobber all of the callee-used registers as specified by the avr-gcc ABI. In particular, __udivmodhi4 does not clobber the Z register.
avr-gcc makes use of this as follows: On machines without 16-bit division instruction like AVR, GCC would issue a library call instead of generating code for it inline. avr-gcc however pretends that the architecture does have such division instruction and models it as having an effect on processor registers just like the library call. Finally, after all code analyzes and optimizations, the avr backend prints this instruction as [R]CALL __udivmodhi4. Let's call this a transparent call, i.e. a call which the compiler analysis does not see.
Example
int div (int a, int b, volatile const __flash char *z)
{
int ab;
(void) *z;
asm volatile ("" : "+r" (a));
ab = a / b;
asm volatile ("" : "+r" (ab));
(void) *z;
return ab;
}
Compile this with avr-gcc -S -Os -mmcu=atmega8 ... to get assembly file *.s:
div:
movw r30,r20
lpm r18,Z
rcall __divmodhi4
movw r24,r22
lpm r18,Z
ret
Explanation
(void) *z reads one byte from flash, and in order to use lpm instruction, the address must be in the Z register accomplished by movw r30,r20. After reading via lpm, the compiler issues rcall __divmodhi4 to perform signed 16-bit division. If this was an ordinary (non-transparent) call, the compiler would know nothing about the internal working of the callee, but as the avr backend models the call by hand, the compiler knows that the instruction sequence does not change Z and hence may use Z again after the call without any further ado. This allows for better code generation due to less register pressure, in particular z need not be saved / restores around the division.
The asm just serves to order the code: It is volatile and hence must not be reordered against the volatile read *z. And the asm must not be reordered against the division because the asm changes a and ab – at least that's what we are pretending and telling the compiler by means of the constraints. (These variables are not actually changed, but that does not matter here.)
Also, I modified the code to include memory barriers before every statement in the form of __asm volatile("" ::: "memory"); and it doesn't seem to change anything.
The division does not touch memory (it's a transparent call without memory clobber) hence the compiler machinery may reorder it against memory clobber / accesses.
If you need a specific order, then you'll have to introduce artificial dependencies like in in my example above.
In order to tell apart ordinary calls from transparent ones, you can dump the generated assembly in the .s file be means of -save-temps -dp where -dp prints insn names:
void func0 (void);
int func1 (int a, int b)
{
return a / b;
}
void func2 (void)
{
func0();
}
Every call that's neither call_insn nor call_value_insn is a transparent call, *divmodhi4_call in this case:
func1:
rcall __divmodhi4 ; 17 [c=0 l=1] *divmodhi4_call
movw r24,r22 ; 18 [c=4 l=1] *movhi/0
ret ; 23 [c=0 l=1] return
func2:
rjmp func0 ; 5 [c=0 l=1] call_insn/3
This is a strange request but I have a feeling that it could be possible. What I would like is to insert some pragmas or directives into areas of my code (written in C) so that GCC's register allocator will not use them.
I understand that I can do something like this, which might set aside this register for this variable
register int var1 asm ("EBX") = 1984;
register int var2 asm ("r9") = 101;
The problem is that I'm inserting new instructions (for a hardware simulator) directly and GCC and GAS don't recognise these yet. My new instructions can use the existing general purpose registers and I want to make sure that I have some of them (i.e. r12->r15) reserved.
Right now, I'm working in a mockup environment and I want to do my experiments quickly. In the future I will append GAS and add intrinsics into GCC, but right now I'm looking for a quick fix.
Thanks!
When writing GCC inline assembler, you can specify a "clobber list" - a list of registers that may be overwritten by your inline assembler code. GCC will then do whatever is needed to save and restore data in those registers (or avoid their use in the first place) over the course of the inline asm segment. You can also bind input or output registers to C variables.
For example:
inline unsigned long addone(unsigned long v)
{
unsigned long rv;
asm("mov $1, %%eax;"
"mov %0, %%ebx;"
"add %%eax, %%ebx"
: /* outputs */ "b" (rv)
: /* inputs */ "g" (v) /* select unused general purpose reg into %0 */
: /* clobbers */ "eax"
);
}
For more information, see the GCC-Inline-Asm-HOWTO.
If you use global explicit register variables, these will be reserved throughout the compilation unit, and will not be used by the compiler for anything else (it may still be used by the system's libraries, so choose something that will be restored by those). local register variables do not guarantee that your value will be in the register at all times, but only when referenced by code or as an asm operand.
If you write an inline asm block for your new instructions, there are commands that inform GCC what registers are used by that block and how they are used. GCC will then avoid using those registers or will at least save and reload their contents.
Non-hardcoded scratch register in inline assembly
This is not a direct answer to the original question, but since and since I keep Googling this in that context and since https://stackoverflow.com/a/6683183/895245 was accepted, I'm going to try and provide a possible improvement to that answer.
The improvement is the following: you should avoid hard-coding your scratch registers when possible, to give the register allocator more freedom.
Therefore, as an educational example that is useless in practice (could be done in a single lea (%[in1], %[in2]), %[out];), the following hardcoded scratch register code:
bad.c
#include <assert.h>
#include <inttypes.h>
int main(void) {
uint64_t in1 = 0xFFFFFFFF;
uint64_t in2 = 1;
uint64_t out;
__asm__ (
"mov %[in2], %%rax;" /* scratch = in2 */
"add %[in1], %%rax;" /* scratch += in1 */
"mov %%rax, %[out];" /* out = scratch */
: [out] "=r" (out)
: [in1] "r" (in1),
[in2] "r" (in2)
: "rax"
);
assert(out == 0x100000000);
}
could compile to something more efficient if you instead use this non-hardcoded version:
good.c
#include <assert.h>
#include <inttypes.h>
int main(void) {
uint64_t in1 = 0xFFFFFFFF;
uint64_t in2 = 1;
uint64_t out;
uint64_t scratch;
__asm__ (
"mov %[in2], %[scratch];" /* scratch = in2 */
"add %[in1], %[scratch];" /* scratch += in1 */
"mov %[scratch], %[out];" /* out = scratch */
: [scratch] "=&r" (scratch),
[out] "=r" (out)
: [in1] "r" (in1),
[in2] "r" (in2)
:
);
assert(out == 0x100000000);
}
since the compiler is free to choose any register it wants instead of just rax,
Note that in this example we had to mark the scratch as an early clobber register with & to prevent it from being put into the same register as an input, I have explained that in more detail at: When to use earlyclobber constraint in extended GCC inline assembly? This example also happens to fail in the implementation I tested on without &.
Tested in Ubuntu 18.10 amd64, GCC 8.2.0, compile and run with:
gcc -O3 -std=c99 -ggdb3 -Wall -Werror -pedantic -o good.out good.c
./good.out
Non-hardcoded scratch registers are also mentioned in the GCC manual 6.45.2.6 "Clobbers and Scratch Registers", although their example is too much for mere mortals to take in at once:
Rather than allocating fixed registers via clobbers to provide scratch registers for an asm statement, an alternative is to define a variable and make it an early-clobber output as with a2 and a3 in the example below. This gives the compiler register allocator more freedom. You can also define a variable and make it an output tied to an input as with a0 and a1, tied respectively to ap and lda. Of course, with tied outputs your asm can’t use the input value after modifying the output register since they are one and the same register. What’s more, if you omit the early-clobber on the output, it is possible that GCC might allocate the same register to another of the inputs if GCC could prove they had the same value on entry to the asm. This is why a1 has an early-clobber. Its tied input, lda might conceivably be known to have the value 16 and without an early-clobber share the same register as %11. On the other hand, ap can’t be the same as any of the other inputs, so an early-clobber on a0 is not needed. It is also not desirable in this case. An early-clobber on a0 would cause GCC to allocate a separate register for the "m" ((const double ()[]) ap) input. Note that tying an input to an output is the way to set up an initialized temporary register modified by an asm statement. An input not tied to an output is assumed by GCC to be unchanged, for example "b" (16) below sets up %11 to 16, and GCC might use that register in following code if the value 16 happened to be needed. You can even use a normal asm output for a scratch if all inputs that might share the same register are consumed before the scratch is used. The VSX registers clobbered by the asm statement could have used this technique except for GCC’s limit on the number of asm parameters.
static void
dgemv_kernel_4x4 (long n, const double *ap, long lda,
const double *x, double *y, double alpha)
{
double *a0;
double *a1;
double *a2;
double *a3;
__asm__
(
/* lots of asm here */
"#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
"#a0=%3 a1=%4 a2=%5 a3=%6"
:
"+m" (*(double (*)[n]) y),
"+&r" (n), // 1
"+b" (y), // 2
"=b" (a0), // 3
"=&b" (a1), // 4
"=&b" (a2), // 5
"=&b" (a3) // 6
:
"m" (*(const double (*)[n]) x),
"m" (*(const double (*)[]) ap),
"d" (alpha), // 9
"r" (x), // 10
"b" (16), // 11
"3" (ap), // 12
"4" (lda) // 13
:
"cr0",
"vs32","vs33","vs34","vs35","vs36","vs37",
"vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
);
}