I have a problem about inline-assembly in AArch64, Linux, gcc version is 7.3.0
uint8x16_t vcopyq_laneq_u8_inner(uint8x16_t a, const int b, uint8x16_t c, const int d)
{
uint8x16_t res;
__asm__ __volatile__(
:"ins %[dst].B[%[dlane]], %[src].B[%[sland]] \n\t"
:[dst] "=w"(res)
:"0"(a), [dlane]"i"(b), [src]"w"(c), [slane]"i"(d)
:);
return res;
}
This function used to be a inline function that can be compiled and link to a executable programs. But now we want to compile this function into a dynamic library, so we removed its inline keyword. But it cannot compile successfully, and error info is:
warning: asm operand 2 probably doesn't match constraints
warning: asm operand 4 probably doesn't match constraints
error: impossible constraint in 'asm'
I guess this error happend because of the inline-assembly code "i" need a "immediate integer operand", but the var 'b' and 'd' is constant-var, isn't it?
And now i have an idea to make this function compile successfully, thats use if-else to judge the value of 'b' and 'd', and replace dlane/sland with "immediate integer operand". But in our code, uint8x16_t means a structrue of 16 uint8_t var, so i need coding 16x16==256 if-else statement, thats inefficient.
So my question is following:
Why this function can be complied and linked successfully to an executable programs with inline properties, but cant not complied to an Dynamic Link Library without inline properties?
Is there have an efficient way to avoid using 256 if-else statement?
const means you can't modify the variable, not that it's a compile-time constant. That's only the case if the caller passes a constant, and you compile with optimization enabled so constant-propagation can get that value to the asm statement. Even C++ constexpr doesn't require a constant expression in most contexts, it only allows it, and guarantees that compile-time constant-propagation is possible.
A stand-alone version of this function can't exist, but you didn't make it static so the compiler has to create a non-inline definition that can get called from other compilation units, even if it inlines into every call-site in this file. But this is impossible, because const int b doesn't have a known value.
For example,
int foo(const int x){
return x*37;
}
int bar(){
return foo(2);
}
On Godbolt compiled for AArch64: notice that foo can't just return a constant, it needs to work with a run-time variable argument, whatever value it happens to be. Only in bar with optimization enabled can it inline and not need the value of x in a register, just return a constant. (Which it used as an immediate for mov).
foo(int):
mov w1, 37
mul w0, w0, w1
ret
bar():
mov w0, 74
ret
In a shared library, your function also has to be __attribute__((visibility("hidden"))) so it can actually inline, otherwise the possibility of symbol interposition means that the compiler can't assume that foo(123) is actually going to call int foo(int) defined in the same .c
(Or static inline.)
Is there have an efficient way to avoid using 256 if-else statement?
Not sure what you're doing with your vector exactly, but if you don't have a shuffle that can work with runtime-variable counts, store to a 16-byte array can be the least bad option. But storing one byte and then reloading the whole vector will cause a store-forwarding stall, probably similar to the cost on x86 if not worse.
Doing your algorithm efficiently with AArch64 SIMD instructions is a separate question, and you haven't given enough info to figure out anything about that. Ask a different question if you want help implementing some algorithm to avoid this in the first place, or an efficient runtime-variable byte insert using other shuffles.
Constraint "i" means a number. A specific number. It means you want the compiler to emit an instruction like this:
ins v0.B[2], v1.B[3]
(pardon me if me AArch64 assembly syntax isn't quite right) where v0 is the register containing res, v1 is the register containing c, 2 is the value of b (not the number of the register containing b) and 3 is the value of d (not the number of the register which containing d).
That is, if you call
vcopyq_laneq_u8_inner(something, 2, something, 3)
the instruction in the function is
ins v0.B[2], v1.B[3]
but if you call
vcopyq_laneq_u8_inner(something, 1, something, 2)
the instruction in the function is
ins v0.B[1], v1.B[2]
The compiler has to know which numbers b and d are, so it knows which instruction you want. If the function is inlined, and the parameters b and d are constant numbers, it's smart enough to do that. However, if you write this function in a way where it's not inlined, the compiler has to make an actual function that works no matter what number the b and d parameters are, and how can it possibly do that if you want it to use a different instruction depending on what they are?
The only way it could do that is to write all 256 possible instructions and switch between them depending on the parameters. However, the compiler won't do that automatically - you'd need to do it yourself. For one thing, the compiler doesn't know that b and d can only go from 0 up to 15.
You should consider either not making this a library function (it's one instruction - doesn't doing a call into a library add overhead?) or else using different instructions where the lane number can be from a register. The instruction ins copies one vector element to another. I'm not familiar with ARM vector instructions, but there should be some instructions to rearrange or select items in a vector according to a number stored in a register.
But now we want to compile this function into a dynamic library, so we removed its inline keyword. But it cannot compile successfully, and error info is:
warning: asm operand 2 probably doesn't match constraints
warning: asm operand 4 probably doesn't match constraints
error: impossible constraint in 'asm'
I guess this error happend because of the inline-assembly code "i" need a "immediate integer operand"
In GCC, constraint "i" means "immediate operand", which is a value that is known at link-time or earlier, and that is an integer or an address. For example, the address of a variable in static storage is known at link time, and you can juse it just like a known value (provided the assembler supports a RELOC for it, which is beyond GCC).
but the var 'b' and 'd' is constant-var, isn't it?
const in C basically means read-only, which does not imply the value is know at link-time or earlier.
If that function was inline, and the context (hosting function and compiler optimization) is such that the values turn out to be known, then the constraints can be satisfied.
If the context is such that "i" cannot be satisfied — which is the case for a library function where you don't know the context at compile-time — then gcc will throw an error.
What you can do
One way is to supply the function as static inline in the header that accompanies the library (*.so, *.a, etc.) and describes the library interfaces and public functions. In that case the user is responsible to only use the function in appropriate contexts (or get that error message thrown at them).
Second way is to re-write the inline assembly to use instructions which can handle operands that are only known at run-time, e.g. register operands. This is usually less efficient and generates higher register pressure. In the case of a library function, you will add call-overhead just to issue one instruction.
Third way is o combine both approaches and supply the function as static inline in the library header, but write it like
static inline __attribute__((__always_inline__))
uint8x16_t vcopyq_laneq_u8_inner (uint8x16_t a, int b, uint8x16_t c, int d)
{
uint8x16_t res;
if (__builtin_constant_p (b) && __builtin_conpstant_p (d))
{
__asm__ __volatile__(
: "ins %[dst].B[%[dlane]], %[src].B[%[sland]]"
: [dst] "=w" (res)
: "0" (a), [dlane] "i" (b), [src] "w" (c), [slane] "i" (d));
}
else
{
__asm__ __volatile__(
// Use code and constraints that can handle non-"i" b and d.
}
return res;
}
This allows the compiler to use the optimal code when b and d are in "i", but it makes the function so generic that it will also work in a broader context.
Apart from that, nothing about that instructions seems to warrant volatile. If, for example, the return value is unused, the instruction is not needed, right? In that case, remove the volatile, which adds more freedom to schedule the inline asm.
Related
The following code contains an interrupt-service-routine and a normal function func(), which uses the global flag and additionally the static global g. Without any memory-barrier this code is faulty, since flag is modified asynchronously.
Introducing the global memory-barrier <1> fixes that, but also inhibits the optimizations on g. My expectation was that all access to g would be optimized out, since g is not accessible outside this TU.
I know that a global memory barried has the same effect as calling a non-inline function f() <3>. But here is the same question: since g is not visible outside this TU, why should not optimize the access to g.
I tried to use a specific memory-barrier against flag but that did not help either.
(I avoided qualifying flag as volatile: this would help here, but it should only be used accessing HW-registers).
The question now is how to get the accesses to g optimized?
Compiler: avr-gcc
https://godbolt.org/z/ob6YoKx5a
#include <stdint.h>
uint8_t flag;
void isr() __asm__("__vector_5") __attribute__ ((__signal__, __used__, __externally_visible__));
void isr() {
flag = 1;
}
static uint8_t g;
void f();
void func(void) {
for (uint8_t i=0; i<20; i++) {
// f(); // <3>
// __asm__ __volatile__ ("" : : : "memory"); // <1>
// __asm__ __volatile__ ("" : "=m" (flag)); // <2>
++g;
if (flag) {
flag = 0;
}
}
}
//void f(){}
Lots of misconceptions.
There is nothing called "static global", that's like saying "dog cat", they are each other's opposites. You can have variables declared at local scope or at file scope. You can have variables with internal linkage or external linkage.
A "global" - which isn't a formal term, is a variable declared at file scope with external linkage which may be referred to by other parts of the program using extern. This is almost always bad practice and bad design.
static ensures that a variable no matter where it was declared has internal linkage. So it is by definition not "global". Variables declared static are only accessible inside the scope where they were declared. For details check out What does the static keyword do in C?
Memory barriers is not a concept that makes much sense outside multicore systems. The purpose of memory barriers are to prevent concurrent execution/pipelining, pre-fetching or instruction reordering in multicore systems. Also, memory barriers do not guarantee re-entrancy on the software level.
This is some single core AVR, one of the simplest CPUs still manufactured, so memory barriers is not an applicable concept. Do not read articles about PC programming on 64 bit x86 and try to apply them to 8-bit legacy architectures from the 1990s. Wrong tool, wrong purpose, wrong system.
volatile does not necessarily act as a memory barrier even on systems where that concept is applicable. See for example https://stackoverflow.com/a/58697222/584518.
volatile does not make code re-entrant/thread-safe/interrupt-safe on any system, including AVR.
The purpose of volatile in this context is to protect against incorrect compiler optimizations when the compiler doesn't realize that an ISR is called by hardware and not by the program. We can't volatile qualify functions or code in C, only objects, hence variables shared with an ISR need to be volatile qualified. Details and examples here: https://electronics.stackexchange.com/questions/409545/using-volatile-in-embedded-c-development/409570#409570
As for what you should be doing instead, I believe my answer to your other question here covers that.
__asm__ __volatile__ ("" :: "m" (flag):"memory"); // <2>
You are clobbering all of memory due to "memory".
If you want to express that only flag is changed (and that the change has volatile effect(s)), then:
__asm__ __volatile__ ("" : "+m" (flag));
This tells GCC that flag is changed, not just an input to the asm like in <2>.
#Lundin is correct that if (flag) flag = 0; isn't an atomic RMW, and wouldn't be even with volatile (still just separate atomic-load and atomic-store; an interrupt could happen between them.) See their answer for more about that, and that this seems like fundamentally wrong approach for some goals. Also, avoiding volatile only makes sense if you're replacing it with _Atomic; that's what Herb Sutter's 2009 article you linked was saying. Not that you should use plain variables and force memory access via barriers; that is fraught with peril, as compilers can invent loads or invent stores to non-atomic variables, and other less-obvious pitfalls. If you're going to roll your own atomics with inline asm, you need volatile; GCC and clang support that usage of volatile since it's what the Linux kernel does, as well as pre-C11 code.
Optimizing away g
The barrier isn't what blocks GCC from optimizing away g entirely. GCC misses this valid optimization when there's a function as simple as void func(){g++;} with no other code in the file, except the declarations.
But g is be optimized away even with asm("" ::: "memory") if the C code that uses it won't produce a chain of different values across calls. Storing a constant is fine, and storing a constant after an increment is enough to make the increment a dead store. Perhaps GCC's heuristic for optimizing it away only considers a chain of a couple calls, and doesn't try to prove that nothing value-dependent happens?
#include <stdint.h>
//uint8_t flag;
static uint8_t g;
void func(void) {
__asm__ __volatile__ ("" ::: "memory"); // <1>
int tmp = g;
g = tmp;
++g; g = 1;
// g = tmp+1; // without a later constant store to make it dead will make GCC miss the optimization
__asm__ __volatile__ ("" ::: "memory"); // <1>
}
GCC12 -O3 output for AVR, on Godbolt:
func:
ret
The "memory" clobber forces the compiler to assume that g's value could have changed, if it doesn't optimize it away. But it doesn't make all static variables in the compilation implicit inputs/outputs. The asm statement is implicitly volatile because it has no output operands.
Telling the compiler that only flag was read+written should be equivalent. Except if g doesn't get optimized away, GCC can hoist the load of g out of the loop, only storing incremented values. (It misses the optimization of sinking the store out of the loop. That would be legal; the "+m"(flag) operand tells the compiler that flag was read + written so could have any value now, but without a "memory" clobber, the compiler can assume the asm statement didn't read or write any other state of the C abstract machine from registers or memory.
Your statement with an "=m" (flag) output-only operand is different: it tells the compiler that the old value of flag was irrelevant, not an input. So if it was unrolling the loop, any stores to flag before one of those asm statements would be a dead store.
(The asm statement is volatile so it does have to run it as many times as its reached in the abstract machine; it has to assume there might be side-effects like I/O to something that isn't a C variable. So the previous asm statements can't be removed as dead because of that, but only because of volatile.)
In some project, I must access the machine code instructions of a function for debugging reasons.
My first approach (I decided to do it differently) was to convert a function pointer to the function to a data pointer using a union like this:
void exampleCode(void)
{
volatile union {
uint8_t ** pptr;
void (* pFunc)(void);
} u;
uint8_t ** copyPptr;
uint8_t * resultPtr;
#if 1
u.pptr = (uint8_t **)0x10000000;
#endif
u.pFunc = &myFunction;
copyPptr = u.pptr;
/* The problem is here: */
resultPtr = copyPptr[109];
*resultPtr = 0xA5;
}
If I remove the line u.pptr = (uint32_t **)0x10000000, the compiler assumes that copyPptr two lines below has an undefined value (according to the C standard, converting function pointers to data pointers is not allowed).
For this reason, it "optimizes" the last two lines out (which is of course not wanted)!
However, the volatile keyword means that the C compiler must not optimize away the read operation of u.pptr in the line copyPptr = u.pptr.
From my understanding, this also means that the C compiler must assume that it is at least possible that a "valid" result is read in this read operation, so copyPptr contains some "valid" pointer.
(At least this is how I understand the answers to this question.)
Therefore, the compiler should not optimize away the last two lines.
Is this a compiler bug (GCC for ARM) or is this behavior allowed according to the C standard?
EDIT
As I have already written, I already have another solution for my problem.
I asked this question because I'm trying to understand why the union solution did not work to avoid similar problems in the future.
In the comments, you asked me for my exact use case:
For debugging reasons, I need to access some static variable in some library that comes as object code.
The disassembly looks like this:
.text
.global myFunction
myFunction:
...
.L1:
.word .theVariableOfInterest
.data:
# unfortunately not .global!
.theVariableOfInterest:
.word 0
What I'm doing is this:
I set copyPtr to myFunction. If K is (.L1-myFunction)/4, then copyPtr+K points to .L1 and copyPtr[K] points to .theVariableOfInterest.
So I can access the variable using *(copyPtr[K]).
From the disassembly of the object file, I can see that K is 109.
All this works fine if copyPtr contains the correct pointer.
EDIT 2
It's much simpler to reproduce the problem with gcc -O3:
char a, c;
char * b = &a;
void testCode(void)
{
char * volatile x;
c = 'A';
*x = 'X';
}
It's true that uninitialized variables cause undefined behavior.
However, if I understand the description of volatile in the draft "ISO/IEC 9899:TC3" of the C99 standard correctly, a volatile variable would not be "uninitialized" if I stop the program in the debugger (using a breakpoint) at the line c = 'A' and copy the value of b to x in the debugger.
On the other hand, some other statement was saying that using volatile for local variables in current C++ (not C) versions already causes undefined behavior ...
Now I'm wondering what is true for modern C (not C++) versions.
I'd personally try compiling with the no optimization flag -O0 (capital o, zero) to avoid the compiler "optimizing" out your code and seeing if it works then. If that's still acting problematic, consider adding the assembly -S flag so you can see what it's actually doing translating those instructions as and retranslating them yourself using the inline assembly __asm__ function.
I have the following problem (g++ (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4):
When I use _mm256_slli_si256() directly, such as:
__m256i x = _mm256_set1_epi8(0xff);
x = _mm256_slli_si256(x, 3);
the code compiles without problem (g++ -Wall -march=native -O3 -o shifttest shifttest.C).
However, if I wrap it into a function
__m256i doit(__m256i x, const int imm)
{
return _mm256_slli_si256(x, imm);
}
the compiler complains that
/usr/lib/gcc/x86_64-linux-gnu/4.8/include/avx2intrin.h: In function '__m256i doit(__m256i, int)':
/usr/lib/gcc/x86_64-linux-gnu/4.8/include/avx2intrin.h:651:58: error: the last argument must be an 8-bit immediate
return (__m256i)__builtin_ia32_pslldqi256 (__A, __N * 8);
regardless of whether the function is used or not.
This can't be a problem with the immediate operand, since the function doit() compiles if I use e.g. _mm256_slli_si32(x, imm) instead, and _mm256_slli_si32() also requires an immediate operand.
There is a related bug report on
https://gcc.gnu.org/bugzilla/show_bug.cgi?format=multiple&id=54825
but it is quite old (2012) and relates to gcc 4.8.0, so I thought the patch would be have been incorporated into g++ 4.8.4 already.
Is there a workaround for this problem?
The argument indicating the number of bits to shift must be a compile-time constant, as it is encoded as an immediate value in the instruction (i.e. not loaded from a register; the actual shift value is part of the instruction encoding). As long as you use it directly, like this:
__m256i x = _mm256_set1_epi8(0xff);
x = _mm256_slli_si256(x, 3);
then the compiler sees the shift value as a compile-time constant, 3. However, when in the context of your wrapping function:
__m256i doit(__m256i x, const int imm)
{
return _mm256_slli_si256(x, imm);
}
there is no way for the compiler to infer the value of imm at compile time, which is required in order for it to synthesize the shift instruction. The fact that imm is a const int doesn't mean that its value is known at compile time, only that the semantics of the language don't allow it to be modified within the doit() function scope.
It's possible that if doit() were to be inlined by the compiler, then it may be able to statically determine the value of imm and therefore compile successfully, but that may be going too far out on a limb.
If you're using C++, another option would be to make doit() a function template with an argument indicating the shift size, like this:
template <int Shift>
__m256i doit(__m256i x)
{
return _mm256_slli_si256(x, Shift);
}
The problem is due to your function being public (i.e. callable by functions in other C/C++ modules). If you declare it as static (or inline), the compiler will not do code-generation for this function, and you won't get an error.
In a code, I have following declaration
#if GCC == 1
#define SET_STACK(s) asm("movl temp,%esp");
...
#endif
In the code, exactly at one place, this macro is used, on which line compiler indicates of undefined reference to 'temp'.
temp = (int*)some_pointer;
SET_STACK(temp);
the temp variable is declared as global volatile void pointer
volatile void* temp;
Is there any syntax problem with the inline assembly ?
As of my understanding, the inline assembly tries to load the value of temp (not the dereferenced value , but the pointer itself)
You have to use extended assembler to pass C operands to the assembler: Read the manual. (Note: as you did not specify which version you are using, I just picked one).
Do not forget to add registers used in the assembler into the clobber list. You should also make the assembler asm volatile.
Depending on your execution environment, it might be a very bad idea to manually manipulate the stack pointer! At least you should put that into a __attribute__((naked)) function, not a macro. The trailing ; in the macro is definitively wrong, you will have that already right after the macro (2 semicolons might break conditional statements!
If you want to use C variables in GCC inline assembly, you have to make use of the Extended ASM syntax, e.g.:
volatile int temp = 0;
asm("movl %0,%%esp"
: /* No outputs. */
: "rm" (temp)
);
I'm reading Linux kernel source code (3.12.5 x86_64) to understand how process descriptor is handled.
I found to get current process descriptor I could use current_thread_info() function, which is implemented as follows:
static inline struct thread_info *current_thread_info(void)
{
struct thread_info *ti;
ti = (void *)(this_cpu_read_stable(kernel_stack) +
KERNEL_STACK_OFFSET - THREAD_SIZE);
return ti;
}
Then I looked into this_cpu_read_stable():
#define this_cpu_read_stable(var) percpu_from_op("mov", var, "p" (&(var)))
#define percpu_from_op(op, var, constraint) \
({ \
typeof(var) pfo_ret__; \
switch (sizeof(var)) { \
...
case 8: \
asm(op "q "__percpu_arg(1)",%0" \
: "=r" (pfo_ret__) \
: constraint); \
break; \
default: __bad_percpu_size(); \
} \
pfo_ret__; \
})
#define __percpu_arg(x) __percpu_prefix "%P" #x
#ifdef CONFIG_SMP
#define __percpu_prefix "%%"__stringify(__percpu_seg)":"
#else
#define __percpu_prefix ""
#endif
#ifdef CONFIG_X86_64
#define __percpu_seg gs
#else
#define __percpu_seg fs
#endif
The expanded macro should be inline asm code like this:
asm("movq %%gs:%P1,%0" : "=r" (pfo_ret__) : "p"(&(kernel_stack)));
According to this post the input constraint used to be "m"(kernel_stack), which makes sense to me. But obviously to improve performance Linus changed the constraint to "p" and passed the address of variable:
It uses a "p" (&var) constraint instead of a "m" (var) one, to make gcc
think there is no actual "load" from memory. This obviously _only_ works
for percpu variables that are stable within a thread, but 'current' and
'kernel_stack' should be that way.
Also in post Tejun Heo made this comments:
Added the magical undocumented "P" modifier to UP __percpu_arg()
to force gcc to dereference the pointer value passed in via the
"p" input constraint. Without this, percpu_read_stable() returns
the address of the percpu variable. Also added comment explaining
the difference between percpu_read() and percpu_read_stable().
But my experiments with combining modifier "P" modifier and constraint "p(&var)" did not work. If segment register is not specified, "%P1" always returns the address of the variable. The pointer was not dereferenced. I have to use a bracket to dereference it, like "(%P1)". If segment register is specified, without bracket gcc won't even compile. My test code is as follows:
#include <stdio.h>
#define current(var) ({\
typeof(var) pfo_ret__;\
asm(\
"movq %%es:%P1, %0\n"\
: "=r"(pfo_ret__)\
: "p" (&(var))\
);\
pfo_ret__;\
})
int main () {
struct foo {
int field1;
int field2;
} a = {
.field1 = 100,
.field2 = 200,
};
struct foo *var = &a;
printf ("field1: %d\n", current(var)->field1);
printf ("field2: %d\n", current(var)->field2);
return 0;
}
Is there anything wrong with my code? Or do I need to append some options for gcc? Also when I used gcc -S to generate assembly code I didn't see optimization by using "p" over "m". Any answer or comments is much appreciated.
The reason why your example code doesn't work is because the "p" constraint is only of a very limited use in inline assembly. All inline assembly operands have the requirement that they be representable as an operand in assembly language. If the operand isn't representable than compiler makes it so by moving it to a register first and substituting that as the operand. The "p" constraint places an additional restriction: the operand must be a valid address. The problem is that a register isn't a valid address. A register can contain an address but a register is not itself an valid address.
That means the operand of the "p" constraint must be have a valid assembly representation as is and be a valid address. You're trying to use the address of a variable on the stack as the operand. While this is a valid address, it's not a valid operand. The stack variable itself has a valid representation (something like 8(%rbp)), but the address of the stack variable doesn't. (If it were representable it would be something like 8 + %rbp, but this isn't a legal operand.)
One of the few things that you can take the address of and use as an operand with the "p" constraint is a statically allocated variable. In this case it's a valid assembly operand, as it can be represented as an immediate value (eg. &kernel_stack can be represented as $kernel_stack). It's also a valid address and so satisfies the constraint.
So that's why Linux kernel macro works and you macro doesn't. You're trying to use it with stack variables, while the kernel only uses it with statically allocated variables.
Or at least what looks like a statically allocated variabvle to the compiler. In fact kernel_stack is actually allocated in a special section used for per CPU data. This section doesn't actually exist, instead it's used as a template to create a separate region of memory for each CPU. The offset of kernel_stack in this special section is used as the offset in each per CPU data region to store a separate kernel stack value for each CPU. The FS or GS segment register is used as the base of this region, each CPU using a different address as the base.
So that's why the Linux kernel use inline assembly to access what otherwise looks like a static variable. The macro is used to turn the static variable into a per CPU variable. If you're not trying to do something like this then you probably don't have anything to gain by copying from the kernel macro. You should probably be considering a different way to do what you're trying accomplish.
Now if you're thinking since Linus Torvalds has come with this optimization in the kernel to replace an "m" constraint with a "p" it must be a good idea to do this generally, you should be very aware how fragile this optimization is. What its trying to do is fool GCC into thinking that reference to kernel_stack doesn't actually access memory, so that it won't keep reloading the value every time it changes memory. The danger here is that if kernel_stack does change then the compiler will be fooled, and continue to use the old value. Linus knows when and how the per CPU variables are changed, and so can be confident that the macro is safe when used for its intended purpose in the kernel.
If you want eliminate redundant loads in your own code, I suggest using -fstrict-aliasing and/or the restrict keyword. That way you're not dependant on a fragile and non-portable inline assembly macros.