MinGW gcc set fp rounding mode

MinGW gcc set fp rounding mode - c

I'm using gcc compiler, and I want to be able to fast change the sse rounding mode. The following code works if compile it under linux:
#include <xmmintrin.h>
unsigned int _mxcsr_up = _MM_MASK_MASK | _MM_ROUND_UP;
unsigned int _mxcsr_down = _MM_MASK_MASK | _MM_ROUND_DOWN;
unsigned int _mxcsr_n = _MM_MASK_MASK;
void round_nearest_mode() {
asm (
"ldmxcsr %0" : : "m" (_mxcsr_n)
);
}
void round_up_mode() {
asm (
"ldmxcsr %0" : : "m" (_mxcsr_up)
);
}
void round_down_mode() {
asm (
"ldmxcsr %0" : : "m" (_mxcsr_down)
);
}
But when I compile it under windows using MinGW, the rounding mode is not changed. What is the reason?

The same header that provides the _MM_ROUND_UP constants also defines _mm_setcsr(unsigned int i) and _mm_getcsr(void) intrinsic wrappers around the relevant instructions.
You should normally retrieve the old value, OR or ANDN the bit you want to change, then apply the new value. (e.g. mxcsr &= ~SOME_BITS). You won't find many examples that just use LDMXCSR without doing a STMXCSR first.
Oh, I think you're actually doing that part wrong in your code. I haven't looked at how _MM_MASK_MASK is defined, but its name includes the word MASK. You're ORing it with various other constants, instead of ANDing it. You're probably setting the MXCSR to the same value every time, because you're ORing everything with _MM_MASK_MASK, which I assume has all the rounding-mode bits set.
As #StoryTeller points out, you don't need inline asm or intrinsics to change rounding modes, since the four rounding modes provided by x86 hardware match the four defined by fenv.h in C99: (FE_DOWNWARD, FE_TONEAREST (the default), FE_TOWARDZERO, and FE_UPWARD), which you can set with fesetround(FE_DOWNWARD);.
If you want to change rounding modes on the fly and make sure the optimizer doesn't reorder any FP ops to a place where the rounding mode was set differently, you need
#pragma STDC FENV_ACCESS ON, but gcc doesn't support it. See also this gcc bug from 2008 which is still open: Optimization generates incorrect code with -frounding-math option (#pragma STDC FENV_ACCESS not implemented).
Doing it manually with asm volatile still won't prevent CSE from thinking x/y computed earlier is the same value, though, and not recomputing it after the asm statement. Unless you use x or y as a read-write operand for the asm statement that is never actually used. e.g.
asm volatile("" : "+g"(x)); // optimizer must not make any assumptions about x's value.
You could put the LDMXCSR inside that same inline-asm statement, to guarantee that the point where the rounding mode changed is also the point where the compiler treats x as having changed.

Related

x86-64 Zero Flag is clearing between inline calls (and another problem)

I am using the bsf x86-64 instruction found on page 210 of Intels developers manual found here. Essentially, if a least significant 1 bit is found, its bit index is stored in the destination operand .
Furthermore, the ZF flag is set to 1 if all the source operand is 0; otherwise, the ZF flag is cleared.
I am compiling my C code with inline x86-64 assembly instructions. I have defined a C function which invokes the bsf instruction:
uint64_t bitScanForward(T_bitboard b) {
__asm__(
"bsf %rcx,%rax\n"
"leave\n"
"ret\n"
);
}
and also another C function which checks if the status of the ZF bit in the flag register:
uint64_t isZFSet() {
printf("\n"); <- This is another problem I am having (see below)...
__asm__(
"jz true\n"
"movq $0,%rax\n"//return false
"jmp end\n"
"true:\n"
"movq $1,%rax\n"//return true
"end:\n"
"leave\n"
"ret\n"
);
}
I have tested these and found that the ZF flag is always cleared even when the bsf comand is applied to the number zero, seemingly going against the specification.
//Calling function...
//Do stuff...
bitScanForward(0ULL);//ULL is 64 bit on my machine
if(isZFSet()){//ZF flag *should* be set here but its not
printf("ZF flag is set\n");
}
//More stuff...
I suspect the reason the ZF flag is clearing is due to entering and leaving one set of inline instructions to another.
How can I ensure that the flag in the above code is set as specified in the documentation? (I don't want to change much of my code or design)
My "other problem" is that if I dont include the printf statement in the isZFFlagSet, the function seemingly doesnt execute. Totally bizarre. Can anyone explain why?

You are treating an aggressively optimizing C compiler as if it were a macro assembler. That just plain isn't going to work. To get GCC to emit correct code in the presence of assembly inserts, you have to annotate the inserts with complete information about the registers and memory regions that are affected by the assembly code, and you have to use ancillary C statements to mesh them with the surrounding code. Even then, there are things the assembly insert cannot do at all. I urge you to scrap this entire mess and instead use the __builtin_ctzll intrinsic, as suggested in the comments on the question.
Now, to specifics. Your first function is incorrect because GCC does not support use of leave or ret inside an assembly insert. (More generally, assembly inserts may not alter the stack pointer, and may only jump to designated labels within the same function.) The correct way to use bsf from a GCC-style assembly insert is with "extended asm" with input and output operands:
uint64_t bitScanForward(uint64_t b) {
uint64_t ret;
asm ("bsf %1, %0" : "=r" (ret) : "r" (b));
return ret;
}
You must declare a C variable to receive the output of the operation, and explicitly return that variable; having bsf write to %rax would not work (unlike how it was in old MSVC). BSF accepts any two registers as operands, so there is no need to use constraints more specific than r.
Your second function is incorrect because you didn't tell GCC that the condition codes were meaningful after bitScanForward, and because GCC does not support using the condition-code register as an input to an assembly insert. In order to read the ZF output from bsf you must do so within the same assembly insert that invoked bsf:
uint64_t countTrailingZeroes(uint64_t b) {
uint64_t ret;
asm ("bsf %1, %0\n\t"
"cmove %2, %0"
: "=&r" (ret)
: "r" (b), "rm" (64));
return ret;
}
This requires special care -- see how the constraint on operand 0 is now =&r instead of just =r? Without that, GCC is liable to think it can put operand 2 in the same register as operand 0.
Alternatively, you can specify that ZF is an output, which is supported (see the "flag output operands" section of the manual) and then supply a default value from C:
uint64_t countTrailingZeroes(uint64_t b) {
uint64_t ret;
int zf;
asm ("bsf %2, %0"
: "=r" (ret), "=#ccz" (zf) : "r" (b));
if (zf) ret = 64;
return ret;
}

GNU C inline asm input constraint for AVX512 mask registers (k1...k7)?

AVX512 introduced opmask feature for its arithmetic commands. A simple example: godbolt.org.
#include <immintrin.h>
__m512i add(__m512i a, __m512i b) {
__m512i sum;
asm(
"mov ebx, 0xAAAAAAAA; \n\t"
"kmovw k1, ebx; \n\t"
"vpaddd %[SUM] %{k1%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b)
: "ebx", "k1" // clobbers
);
return sum;
}
-march=skylake-avx512 -masm=intel -O3
mov ebx,0xaaaaaaaa
kmovw k1,ebx
vpaddd zmm0{k1}{z},zmm0,zmm1
The problem is that k1 has to be specified.
Is there an input constraint like "r" for integers except that it picks a k register instead of a general-purpose register?

__mmask16 is literally a typedef for unsigned short (and other mask types for other plain integer types), so we just need a constraint for passing it in a k register.
We have to go digging in the gcc sources config/i386/constraints.md to find it:
The constraint for any mask register is "k". Or use "Yk" for k1..k7 (which can be used as a predicate, unlike k0). You'd use an "=k" operand as the destination for a compare-into-mask, for example.
Obviously you can use "=Yk"(tmp) with a __mmask16 tmp to get the compiler to do register allocation for you, instead of just declaring clobbers on whichever "k" registers you decide to use.
Prefer intrinsics like _mm512_maskz_add_epi32
First of all, https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. Understanding asm is great, but use that to read compiler output and/or figure out what would be optimal, then write intrinsics that can compile the way you want. Performance tuning info like https://agner.org/optimize/ and https://uops.info/ list things by asm mnemonic, and they're shorter / easier to remember than intrinsics, but you can search by mnemonic to find intrinsics on https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Intrinsics will also let the compiler fold loads into memory source operands for other instructions; with AVX512 those can even be broadcast loads! Your inline asm forces the compiler to use a separate load instruction. Even a "vm" input won't let the compiler pick a broadcast-load as the memory source, because it wouldn't know the broadcast element width of the instruction(s) you were using it with.
Use _mm512_mask_add_epi32 or _mm512_maskz_add_epi32 especially if you're already using __m512i types from <immintrin.h>.
Also, your asm has a bug: you're using {k1} merge-masking not {k1}{z} zero-masking, but you used uninitialized __m512i sum; with an output-only "=v" constraint as the merge destination! As a stand-alone function, it happens to merge into a because the calling convention has ZMM0 = first input = return value register. But when inlining into other functions, you definitely can't assume that sum will pick the same register as a. Your best bet is to use a read/write operand for "+v"(a) and use is as the destination and first source.
Merge-masking only makes sense with a "+v" read/write operand. (Or in an asm statement with multiple instructions where you've already written an output once, and want to merge another result into it.)
Intrinsics would stop you from making this mistake; the merge-masking version has an extra input for the merge-target. (The asm destination operand).
Example using "Yk"
// works with -march=skylake-avx512 or -march=knl
// or just -mavx512f but don't do that.
// also needed: -masm=intel
#include <immintrin.h>
__m512i add_zmask(__m512i a, __m512i b) {
__m512i sum;
asm(
"vpaddd %[SUM] %{%[mask]%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b),
[mask] "Yk" ((__mmask16)0xAAAA)
// no clobbers needed, unlike your question which I fixed with an edit
);
return sum;
}
Note that all the { and } are escaped with % (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Special-format-strings), so they're not parsed as dialect-alternatives {AT&T | Intel-syntax}.
This compiles with gcc as early as 4.9, but don't actually do that because it doesn't understand -march=skylake-avx512, or even have tuning settings for Skylake or KNL. Use a more recent GCC that knows about your CPU for best results.
Godbolt compiler explorer:
# gcc8.3 -O3 -march=skylake-avx512 or -march=knl (and -masm=intel)
add(long long __vector, long long __vector):
mov eax, -21846
kmovw k1, eax # compiler-generated
# inline asm starts
vpaddd zmm0 {k1}{z}, zmm0, zmm1; # conditional add
# inline asm ends
ret
-mavx512bw (implied by -march=skylake-avx512 but not knl) is required for "Yk" to work on an int. If you're compiling with -march=knl, integer literals need a cast to __mmask16 or __mask8, because unsigned int = __mask32 isn't available for masks.
[mask] "Yk" (0xAAAA) requires AVX512BW even though the constant does fit in 16 bits, just because bare integer literals always have type int. (vpaddd zmm has 16 elements per vector, so I shortened your constant to 16-bit.) With AVX512BW, you can pass wider constants or leave out the cast for narrow ones.
gcc6 and later support -march=skylake-avx512. Use that to set tuning as well as enabling everything. Preferably gcc8 or at least gcc7. Newer compilers generate less clunky code with new ISA extensions like AVX512 if you're ever using it outside of inline asm.
gcc5 supports -mavx512f -mavx512bw but doesn't know about Skylake.
gcc4.9 doesn't support -mavx512bw.
"Yk" is unfortunately not yet documented in https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html.
I knew where to look in the GCC source thanks to Ross's answer on In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?

While it is undocumented, looking here we see:
(define_register_constraint "Yk" "TARGET_AVX512F ? MASK_REGS :
NO_REGS" "#internal Any mask register that can be used as predicate,
i.e. k1-k7.")
Editing your godbolt to this:
asm(
"vpaddd %[SUM] %{%[k]}, %[A], %[B]"
: [SUM] "=v"(sum)
: [A] "v" (a), [B] "v" (b), [k] "Yk" (0xaaaaaaaa) );
seems to produce the correct output.
That said, I usually try to discourage people from using inline asm (and undocumented features). Can you use _mm512_mask_add_epi32?

Porting AT&T inline-asm inb / outb wrappers to work with gcc -masm=intel

I am currently working on my x86 OS. I tried implementing the inb function from here and it gives me Error: Operand type mismatch for `in'.
This may also be the same with outb or io_wait.
I am using Intel syntax (-masm=intel) and I don't know what to do.
Code:
#include <stdint.h>
#include "ioaccess.h"
uint8_t inb(uint16_t port)
{
uint8_t ret;
asm volatile ( "inb %1, %0"
: "=a"(ret)
: "Nd"(port) );
return ret;
}
With AT&T syntax this does work.
For outb I'm having a different problem after reversing the operands:
void io_wait(void)
{
asm volatile ( "outb $0x80, %0" : : "a"(0) );
}
Error: operand size mismatch for `out'

If you need to use -masm=intel you will need to insure that your inline assembly is in Intel syntax. Intel syntax is dst, src (AT&T syntax is reverse). This somewhat related answer has some useful information on some differences between NASM's Intel variant1 (not GAS's variant) and AT&T syntax:
Information on how you can go about translating NASM Intel syntax to GAS's AT&T syntax can be found in this Stackoverflow Answer, and a lot of useful information is provided in this IBM article.
[snip]
In general the biggest differences are:
With AT&T syntax the source is on the left and destination is on the right and Intel is the reverse.
With AT&T syntax register names are prepended with a %
With AT&T syntax immediate values are prepended with a $
Memory operands are probably the biggest difference. NASM uses [segment:disp+base+index*scale] instead of GAS's syntax of segment:disp(base, index, scale).
The problem in your code is that source and destination operands have to be reversed from the original AT&T syntax you were working with. This code:
asm volatile ( "inb %1, %0"
: "=a"(ret)
: "Nd"(port) );
Needs to be:
asm volatile ( "inb %0, %1"
: "=a"(ret)
: "Nd"(port) );
Regarding your update: the problem is that in Intel syntax immediate values are not prepended with a $. This line is a problem:
asm volatile ( "outb $0x80, %0" : : "a"(0) );
It should be:
asm volatile ( "outb 0x80, %0" : : "a"(0) );
If you had a proper outb function you could do something like this instead:
#include <stdint.h>
#include "ioaccess.h"
uint8_t inb(uint16_t port)
{
uint8_t ret;
asm volatile ( "inb %0, %1"
: "=a"(ret)
: "Nd"(port) );
return ret;
}
void outb(uint16_t port, uint8_t byte)
{
asm volatile ( "outb %1, %0"
:
: "a"(byte),
"Nd"(port) );
}
void io_wait(void)
{
outb (0x80, 0);
}
A slightly more complex version that supports both the AT&T and Intel dialects:
Multiple assembler dialects in asm templates On targets such as x86,
GCC supports multiple assembler dialects. The -masm option controls
which dialect GCC uses as its default for inline assembler. The
target-specific documentation for the -masm option contains the list
of supported dialects, as well as the default dialect if the option is
not specified. This information may be important to understand, since
assembler code that works correctly when compiled using one dialect
will likely fail if compiled using another. See x86 Options.
If your code needs to support multiple assembler dialects (for
example, if you are writing public headers that need to support a
variety of compilation options), use constructs of this form:
{ dialect0 | dialect1 | dialect2... }
On x86 and x86-64 targets there are two dialects. Dialect0 is AT&T syntax and Dialect1 is Intel syntax. The functions could be reworked this way:
#include <stdint.h>
#include "ioaccess.h"
uint8_t inb(uint16_t port)
{
uint8_t ret;
asm volatile ( "inb {%[port], %[retreg] | %[retreg], %[port]}"
: [retreg]"=a"(ret)
: [port]"Nd"(port) );
return ret;
}
void outb(uint16_t port, uint8_t byte)
{
asm volatile ( "outb {%[byte], %[port] | %[port], %[byte]}"
:
: [byte]"a"(byte),
[port]"Nd"(port) );
}
void io_wait(void)
{
outb (0x80, 0);
}
I have also given the constraints symbolic names rather than using %0 and %1 to make the inline assembly easier to read and maintain.. From the GCC documentation each constraint has the form:
[ [asmSymbolicName] ] constraint (cvariablename)
Where:
asmSymbolicName
Specifies a symbolic name for the operand. Reference the name in the assembler template by enclosing it in square brackets (i.e. ‘%[Value]’). The scope of the name is the asm statement that contains the definition. Any valid C variable name is acceptable, including names already defined in the surrounding code. No two operands within the same asm statement can use the same symbolic name.
When not using an asmSymbolicName, use the (zero-based) position of the operand in the list of operands in the assembler template. For example if there are three output operands, use ‘%0’ in the template to refer to the first, ‘%1’ for the second, and ‘%2’ for the third.
This version should work2 whether you compile with -masm=intel or -masm=att options
Footnotes
1Although NASM Intel dialect and GAS's (GNU Assembler) Intel syntax are similar there are some differences. One is that NASM Intel syntax uses [segment:disp+base+index*scale] where a segment can be specified inside the [] and GAS's Intel syntax requires the segment outside with segment:[disp+base+index*scale].
2Although the code will work, you should place all these basic functions in the ioaccess.h file directly and eliminate them from the .c file that contains them. Because you placed these basic functions in a separate .c file (external linkage) the compiler can't optimize them as well as it could. You can modify the functions to be of type static inline and place them in the header directly. The compiler will then have the ability to optimize the code by removing function calling overhead and reduce the need for extra loads and stores. You will want to compile with optimizations higher than -O0. Consider -O2 or -O3.
Special Notes Regarding OS Development:
There are many toy OSes (examples, tutorials, and even code on OSDev Wiki) that do not work with optimizations on. Many failures are due to bad/poor inline assembly or using undefined behaviour. Inline assembly should be used as a last resort. If your kernel doesn't run with optimizations on it is likely not a bug in the compiler (it is possible just not likely).
Heed the advice in #PeterCordes answer regarding port access that may trigger DMA reads.

It's possible to writing code that works with or without -masm=intel, using dialect alternatives for GNU C inline asm https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html (This is a good idea for headers that other people might include.)
It works like "{at&t stuff | intel stuff}": the compiler picks which side of the | to keep based on the current mode.
The major difference between AT&T vs. Intel syntax is that the operand-list is reversed, so usually you have something like "inb {%1,%0 | %0,%1}".
This is a version of #MichaelPetch's nice functions using dialect alternatives:
// make this a header: these single instructions can inline more cheaply
// than setting up args for a function call
#include <stdint.h>
static inline
uint8_t inb(uint16_t port)
{
uint8_t ret;
asm volatile ( "inb {%1, %0 | %0, %1}"
: "=a"(ret)
: "Nd"(port) );
return ret;
}
static inline
void outb(uint16_t port, uint8_t byte)
{
asm volatile ( "outb {%1, %0 | %0, %1}"
:
: "a"(byte),
"Nd"(port) );
}
static inline
void io_wait(void) {
outb (0x80, 0);
}
The Linux/Glibc sys/io.h macros sometimes use %w1 to expand a constraint to the 16-bit register name, but using types of the right size also works.
If you want a memory-barrier version of these to take advantage of the fact that in / out are (more or less) serializing like a locked instruction or mfence, add a "memory" clobber to stop compile-time reordering of memory access across it.
If a port I/O can trigger a DMA read of some other memory that you wrote recently, you might also need a "memory" clobber for that. (x86 has cache-coherent DMA so you wouldn't have had to explicitly flush it, but you can't let the compiler reorder it after an outb, or even optimize away an apparently dead store.)
There's no support in GAS for saving the old mode, so using .intel_syntax noprefix inside your inline asm leaves you no way know whether to switch back to .att_syntax or not.
But that wouldn't usually be sufficient anyway: you need to get the compiler to format operands in ways that match the syntax mode when filling in a template. e.g. the port number needs to expand to $imm or %dx (AT&T1) vs. dx or imm without the $ prefix.
Or for a memory operand, [rdi + rax*4 + 8] or 8(%rdi, %rax, 4).
But you still need to take care of reversing the operand list with { | } yourself; the compiler doesn't try to do that for you. It's purely a text-substitution into the template according to simple rules.
Footnote 1: AT&T disassembly by objdump -d bizarrely uses (%dx) for the port number in the non-immediate form, but GAS accepts %dx or (%dx) on input, so an "Nd" constraint is usable, simply expanding to the bare register name.

How to understand this GNU C inline assembly macro for PowerPC stwbrx

This is basically to perform swap for the buffers while transferring a message buffer. This statement left me puzzled (because of my unfamiliarity with the embedded assembly code in c). This is a power pc instruction
#define ASMSWAP32(dest_addr,data) __asm__ volatile ("stwbrx %0, 0, %1" : : "r" (data), "r" (dest_addr))

Besides being unsafe because of a bug, this macro is also less efficient than what the compiler will generate for you.
stwbrx = store word byte-reversed. The x stands for indexed.
You don't need inline asm for this in GNU C, where you can use __builtin_bswap32 and let the compiler emit this instruction for you.
void swapstore_asm(int a, int *p) {
ASMSWAP32(p, a);
}
void swapstore_c(int a, int *p) {
*p = __builtin_bswap32(a);
}
Compiled with gcc4.8.5 -O3 -mregnames, we get identical code from both functions (Godbolt compiler explorer):
swapstore:
stwbrx %r3, 0, %r4
blr
swapstore_c:
stwbrx %r3,0,%r4
blr
But with a more complicated address (storing to p[off], where off is an integer function arg), the compiler knows how to use both register inputs, while your macro forces the compiler to have the address in a single register:
void swapstore_offset(int a, int *p, int off) {
= __builtin_bswap32(a);
}
swapstore_offset:
slwi %r5,%r5,2 # *4 = sizeof(int)
stwbrx %r3,%r4,%r5 # use an indexed addressing mode, with both registers non-zero
blr
swapstore_offset_asm:
slwi %r5,%r5,2
add %r4,%r4,%r5 # extra instruction forced by using the macro
stwbrx %r3, 0, %r4
blr
BTW, if you're having trouble understanding GNU C inline asm templates, looking at the compiler's asm output can be a useful way to see what gets substituted in. See How to remove "noise" from GCC/clang assembly output? for more about reading compiler asm output.
Also note that this macro is buggy: it's missing a "memory" clobber for the store. And yes, you still need that with asm volatile. The compiler doesn't assume that *dest_addr is modified unless you tell it, so it could hoist a non-volatile load of *dest_addr ahead of this insn, or more likely to be a real problem, sink a store after it. (e.g. if you zeroed a buffer before storing to it with this, the compiler might actually zero after this instruction.)
Instead of a "memory" clobber (and also leaving out volatile), you could tell the compiler which memory location you modify with a =m" (*dest_addr) operand, either as a dummy operand or with a constraint on the addressing mode so you could use it as reg+reg. (IDK PPC well enough to know what "=m" usually expands to.)
In most cases this bug won't bite you, but it's still a bug. Upgrading your compiler version or using link-time optimization could maybe make your program buggy with no source-level changes.
This kind of thing is why https://gcc.gnu.org/wiki/DontUseInlineAsm
See also https://stackoverflow.com/tags/inline-assembly/info.

#define ASMSWAP32(dest_addr,data) ...
This part should be clear
__asm__ volatile ( ... : : "r" (data), "r" (dest_addr))
This is the actual inline assembly:
Two values are passed to the assmbly code; no value is returned from the assembly code (this is the colons after the actual assembly code).
Both parameters are passed in registers ("r"). The expression %0 will be replaced by the register that contains the value of data while the expression %1 will be replaced by the register that contains the value of dest_addr (which will be a pointer in this case).
The volatile here means that the assembly code has to be executed at this point and cannot be moved to somewhere else.
So if you use the following code in the C source:
ASMSWAP(&a, b);
... the following assembler code will be generated:
# write the address of a to register 5 (for example)
...
# write the value of b to register 6
...
stwbrx 6, 0, 5
So the first argument of the stwbrx instruction is the value of b and the last argument is the address of a.
stwbrx x, 0, y
This instruction writes the value in register x to the address stored in register y; however it writes the value in "reverse endian" (on a big-endian CPU it writes the value "little endian".
The following code:
uint32 a;
ASMSWAP32(&a, 0x12345678);
... should therefore result in a = 0x78563412.

Why doesn't this compiler barrier enforce ordering?

I was looking at the documentation on the Atmel website and I came across this example where they explain some issues with reordering.
Here's the example code:
#define cli() __asm volatile( "cli" ::: "memory" )
#define sei() __asm volatile( "sei" ::: "memory" )
unsigned int ivar;
void test2( unsigned int val )
{
val = 65535U / val;
cli();
ivar = val;
sei();
}
In this example, they're implementing a critical region-like mechanism. The cli instruction disables interrupts and the sei instruction enables them. Normally, I would save the interrupt state and restore to that state, but I digress...
The problem which they note is that, with optimization enabled, the division on the first line actually gets moved to after the cli instruction. This can cause some issues when you're trying to be inside of the critical region for the shortest amount of time as possible.
How come this is possible if the cli() MACRO expands to inline asm which explicitly clobbers the memory? How is the compiler free to move things before or after this statement?
Also, I modified the code to include memory barriers before every statement in the form of __asm volatile("" ::: "memory"); and it doesn't seem to change anything.
I also removed the memory clobber from the cli() and sei() MACROs, and the generated code was identical.
Of course, if I declare the test2 function argument as volatile, there is no reordering, which I assume to be because volatile statements can't be reordered with respect to other volatile statements (which the inline asm technically is). Is my assumption correct?
Can volatile accesses be reordered with respect to volatile inline asm?
Can non-volatile accesses be reordered with respect to volatile inline asm?
What's weird is that Atmel claims they need the memory clobber just to enforce the ordering of volatile accesses with respect to the asm. That doesn't make any sense to me.
If the compiler barrier isn't the proper solution for this, then how could I go about preventing any outside code from "leaking" into the critical region?
If anyone could shed some light, I'd appreciate it.
Thanks

How come this is possible if the cli() MACRO expands to inline asm which explicitly clobbers the memory? How is the compiler free to move things before or after this statement?
This is due to implementation details of avr-gcc: The compiler's support library, libgcc, provides many functions written in assembly for performance; including functions for integer division like __udivmodhi4. Not all of these functions clobber all of the callee-used registers as specified by the avr-gcc ABI. In particular, __udivmodhi4 does not clobber the Z register.
avr-gcc makes use of this as follows: On machines without 16-bit division instruction like AVR, GCC would issue a library call instead of generating code for it inline. avr-gcc however pretends that the architecture does have such division instruction and models it as having an effect on processor registers just like the library call. Finally, after all code analyzes and optimizations, the avr backend prints this instruction as [R]CALL __udivmodhi4. Let's call this a transparent call, i.e. a call which the compiler analysis does not see.
Example
int div (int a, int b, volatile const __flash char *z)
{
int ab;
(void) *z;
asm volatile ("" : "+r" (a));
ab = a / b;
asm volatile ("" : "+r" (ab));
(void) *z;
return ab;
}
Compile this with avr-gcc -S -Os -mmcu=atmega8 ... to get assembly file *.s:
div:
movw r30,r20
lpm r18,Z
rcall __divmodhi4
movw r24,r22
lpm r18,Z
ret
Explanation
(void) *z reads one byte from flash, and in order to use lpm instruction, the address must be in the Z register accomplished by movw r30,r20. After reading via lpm, the compiler issues rcall __divmodhi4 to perform signed 16-bit division. If this was an ordinary (non-transparent) call, the compiler would know nothing about the internal working of the callee, but as the avr backend models the call by hand, the compiler knows that the instruction sequence does not change Z and hence may use Z again after the call without any further ado. This allows for better code generation due to less register pressure, in particular z need not be saved / restores around the division.
The asm just serves to order the code: It is volatile and hence must not be reordered against the volatile read *z. And the asm must not be reordered against the division because the asm changes a and ab – at least that's what we are pretending and telling the compiler by means of the constraints. (These variables are not actually changed, but that does not matter here.)
Also, I modified the code to include memory barriers before every statement in the form of __asm volatile("" ::: "memory"); and it doesn't seem to change anything.
The division does not touch memory (it's a transparent call without memory clobber) hence the compiler machinery may reorder it against memory clobber / accesses.
If you need a specific order, then you'll have to introduce artificial dependencies like in in my example above.
In order to tell apart ordinary calls from transparent ones, you can dump the generated assembly in the .s file be means of -save-temps -dp where -dp prints insn names:
void func0 (void);
int func1 (int a, int b)
{
return a / b;
}
void func2 (void)
{
func0();
}
Every call that's neither call_insn nor call_value_insn is a transparent call, *divmodhi4_call in this case:
func1:
rcall __divmodhi4 ; 17 [c=0 l=1] *divmodhi4_call
movw r24,r22 ; 18 [c=4 l=1] *movhi/0
ret ; 23 [c=0 l=1] return
func2:
rjmp func0 ; 5 [c=0 l=1] call_insn/3

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight