ARMv8: XZR use in LDUR/STUR instructions assembler error - arm

It is supposed XZR can be used in any register field of ARMv8 ISA, but when I try use in indexed addressing for load/store intruction, I always obtain errors:
ex2.s:3: Error: integer 64-bit register expected at operand 2 -- `ldr X1,[XZR,#0]'
ex2.s:4: Error: integer 64-bit register expected at operand 2 -- `ldur X2,[XZR,#1]'
ex2.s:8: Error: integer 64-bit register expected at operand 2 -- `stur X1,[XZR,#2]'
ex2.s:9: Error: integer 64-bit register expected at operand 2 -- `stur X3,[XZR,#0]'
ex2.s:14: Error: integer 64-bit register expected at operand 2 -- `stur X4,[XZR,#2]'
It is easy to bypass it, for example with the previous:
MOV X0, XZR
and then use X0, but I would like the reason.
Thanks in advance.

Related

STM32 MCU GCC Compilation behavior

I have some misunderstanding about MCU GCC compilation behavior regarding function that return other things that 32bits value.
MCU: STM32 L0 Series (STM32L083)
GCC : gcc version 7.3.1 20180622 (release) [ARM/embedded-7-branch revision 261907] (GNU Tools for Arm Embedded Processors 7-2018-q2-update)
My code is optimized for size (with option -Os ). In my understanding, this will allow the gcc to use implicit -fshort-enums in order to pack enums.
I have two enum var, 1-byte wide :
enum eRadioMode radio_mode // (# 0x20003200)
enum eRadioFunction radio_func // (# 0x20003201)
And a function :
enum eRadioMode radio_get_mode(enum eRadioFunction _radio_func);
When i call this bunch of code :
radio_mode = radio_get_mode(radio_func);
It will produce this bunch of ASM at compile time:
; At this point :
; r4 value is 0x20003201 (Address of radio_func)
7820 ldrb r0, [r4, #0] ; GCC treat correctly r4 as a pointer to 1 byte wide var, no problem here
f7ff ffcd bl 80098a8 <radio_get_mode> ; Call to radio_get_mode()
4d1e ldr r5, [pc, #120] ; r5 is loaded with 0x20003200 (Address of radio_mode)
6028 str r0, [r5, #0] ; Why GCC use 'str' and not 'strb' at this point ?
The last line here is the problem : The value of r0, return value of radio_get_mode(), is stored into address pointed by r5, as a 32bit value.
Since radio_func is 1 byte after radio_mode, its value is overwritten by the second byte of r0 (that is always 0x00 since enum is only 1 byte wide).
As my function radio_get_mode is declared as returning 1 single byte, why GCC doesn't use instruction strb in order to save this single byte into the address pointed by r5 ?
I have tried :
radio_get_mode() as returning uint8_t : uint8_t radio_get_mode(enum eRadioFunction _radio_func);
Forcing cast to uint8_t : radio_mode = (uint8_t)radio_get_mode(radio_func);
Passing by a third var (but GCC cancel that useless move at compile - not so dumb) :
uint32_t r = radio_get_mode(radio_func);
radio_mode = (uint8_t) r;
But none of these solutions work.
Since the size optimization (-Os) is needed in first sight to reduce rom usage (and not ram - at this time of my project -) I found that the workaround gcc option -fno-short-enums will let the compiler to use 4 bytes by enum, discarding by the way any overlapping memory in this case.
But, in my opinion, this is a dirty way to hide a real problem here :
Is GCC not able to correctly handle other return size than 32bit ?
There is a correct way to do that ?
Thanks in advance.
EDIT :
I did NOT use -f-short-enums at any moment.
I'm sure that these enum has no value greater than 0xFF
I have tried to declare radio_mode and radio_func as uint8_t (aka unsigned char) : The problem is the same.
When compiled with -Os, Output.map is as follow :
Common symbol size file
...
radio_mode 0x1 src/radio/radio.o
radio_func 0x1 src/radio/radio.o
...
...
...
Section address label
0x2000319c radio_state
0x20003200 radio_mode
0x20003201 radio_func
0x20003202 radio_protocol
...
The output of the mapfile show clearly that radio_mode and radio_func is 1 byte wide and at following address.
When compiled without -Os, Output.map show clearly that enums become 4 byte wide (with address padded to 4).
When compiled with -Os and -fno-short-enums, do the same things that without -Os for all enums (This is why I guess -Os implies implicit -f-short-enums)
I will try to provide minimal reproducible example
My analysis of the problem is that I'm pretty sure it is a compiler bug. For me, this is clearly a memory overlapping. My question is more about the best things to do in order to avoid this - in the "best practice" way.
EDIT 2
It is my bad, I have re-tester changing all signature to uint8_t (aka unsigned char) and it work well.
#Peter Cordes seems to found the problem here : When using it, -Os is partly enabling -fshort-enums, getting some parts of GCC to treat it as size 1 and other parts to treat it as size 4.
ASM code using only uint8_t is :
; Same position than before
7820 ldrb r0, [r4, #0]
f7ff ffcd bl 80098a8 <radio_get_mode>
4d1e ldr r5, [pc, #120]
7028 strb r0, [r5, #0] ; Yes ! GCC use 'strb' and not 'str' like before !
To clarify :
It seems to have compiler bug when using -Os and enums. This is bad luck that two enum is at consecutive adresses that overlap.
Using -fno-short-enums in conjonction with -Os appear to be a good workaround IMO, since the problem is concerning only enum, and not all 1 byte var at all.
Thanks again.
ARM port abi defines none-aebi enums to be a variable sized type, linux-eabi to be standards fixed one.
That is the reason the behaviour you observe. It is not related to the optimisation.
In this example you can see how it works. https://godbolt.org/z/-mY_WY

Array of registers in asm

How would you define a pointer to a XMM register in asm()?
Like accessing array elements in a loop how can you access registers in asm using a counter?
I tried to do it in the following code:
float *f=(float*)_aligned_malloc(64,16);
for(int i=0;i<4;i++)
asm volatile
(
"movaps (%1),%%xmm%0"
:
:"r"(i),"r"(f+4*i)
:"%xmm%0"
);
But the compiler gives me this error:
unknown register name '%xmm%0' in 'asm'
This sounds like a horrible idea compared to using assembler macros or actually manual unrolling. Your code would totally break if gcc decided not to fully unroll the loop, because it can only work with compile-time constant indexing.
Also, there's no way to tell the compiler which register you're putting the result in, so this is basically useless. I'm only answering as a silly exercise in using GNU C inline-asm syntax, not because this answer is possibly useful in any project.
That said, you can do it using an "i" constraint and a c operand modifier to format the immediate as a bare number, like 1 instead of $1.
void *_aligned_malloc(int, int);
void foo()
{
float *f=(float*)_aligned_malloc(64,16);
for(int i=0;i<4;i++) {
asm volatile (
"movaps %[input],%%xmm%c[regnum]"
:
// only compiles with optimization enabled.
:[regnum] "i"(i), [input] "m"(f[4*i])
:"%xmm0", "%xmm1", "%xmm2", "%xmm3"
);
}
}
gcc and clang, with -O3, are able to fully unroll and make i for each iteration a compile-time constant that can match an "i" constraint. This compiles on Godbolt.
# gcc7.3 -O3
foo():
subq $8, %rsp
movl $16, %esi
movl $64, %edi
call _aligned_malloc(int, int) # from a dummy prototype so it compiles
movaps (%rax),%xmm0
movaps 16(%rax),%xmm1 # compiler can use addressing modes because I switched to an "m" constraint
movaps 32(%rax),%xmm2
movaps 48(%rax),%xmm3
vzeroupper # XMM clobbers also include YMM, and I guess gcc assumes you might have dirtied the upper lanes.
addq $8, %rsp
ret
Note that I've only told the compiler about reading the first float of every group of 4.
ICC -O3 says catastrophic error: Cannot match asm operand constraint even with -O3. With optimization disabled, gcc and clang have the same problem, of course. For example, gcc -O0 will say:
<source>: In function 'void foo()':
<source>:11:10: warning: asm operand 0 probably doesn't match constraints
);
^
<source>:11:10: error: impossible constraint in 'asm'
Compiler returned: 1
Because without optimization, i isn't a compile-time constant and can't match an "i" (immediate) constraint.
Obviously you can't use an "r" constraint; that would fill in the asm template with something like %xmm%eax if the compiler picked eax.
Anyway, this is useless because you can't use destination register. All you can do is tell the compiler that all of the possible destination registers are clobbered. It's not safe to write to a clobbered register in one asm statement and then assume the value is still there in a later asm statement.
x86, like all other architectures, can't index the architectural registers using a runtime value. Register numbers must be hard-coded into the instruction stream.
(Some microcontrollers, like AVR, have memory-mapped registers, so you can index them by indexing the memory that aliases the register file. But this is rare, and x86 doesn't do it. It would interfere with out-of-order execution in a similar way to self-modifying code. And BTW, SMC (or branching to one of 16 different versions of an instruction) is the only option for runtime indexing of the register file.)
You can't -- there is no way to index into the register file.
If you want to use multiple registers in sequence, you will need to unroll the loop and name each of the registers explicitly.

MSVC Inline ASM to GCC

I'm trying to handle both MSVC and GCC compilers while updating this code base to work on GCC. But I'm unsure exactly how GCCs inline ASM works. Now I'm not great at translating ASM to C else I would just use C instead of ASM.
SLONG Div16(signed long a, signed long b)
{
signed long v;
#ifdef __GNUC__ // GCC doesnt work.
__asm() {
#else // MSVC
__asm {
#endif
mov edx, a
mov ebx, b
mov eax, edx
shl eax, 16
sar edx, 16
idiv ebx
mov v, eax
}
return v;
}
signed long ROR13(signed long val)
{
_asm{
ror val, 13
}
}
I assume ROR13 works something like (val << 13) | (val >> (32 - 13)) but the code doesn't produce the same output.
What is the proper way to translate this inline ASM to GCC and/or whats the C translation of this code?
GCC uses a completely different syntax for inline assembly than MSVC does, so it's quite a bit of work to maintain both forms. It's not an especially good idea, either. There are many problems with inline assembly. People often use it because they think it'll make their code run faster, but it usually has quite the opposite effect. Unless you're an expert in both assembly language and the compiler's code-generation strategies, you are far better off letting the compiler's optimizer generate the code.
When you try to do that, you will have to be a bit careful here, though: signed right shifts are implementation-defined in C, so if you care about portability, you need to cast the value to an equivalent unsigned type:
#include <limits.h> // for CHAR_BIT
signed long ROR13(signed long val)
{
return ((unsigned long)val >> 13) |
((unsigned long)val << ((sizeof(val) * CHAR_BIT) - 13));
}
(See also Best practices for circular shift (rotate) operations in C++).
This will have the same semantics as your original code: ROR val, 13. In fact, MSVC will generate precisely that object code, as will GCC. (Clang, interestingly, will do ROL val, 19, which produces the same result, given the way that rotations work. ICC 17 generates an extended shift instead: SHLD val, val, 19. I'm not sure why; maybe that's faster than rotation on certain Intel processors, or maybe it's the same on Intel but slower on AMD.)
To implement Div16 in pure C, you want:
signed long Div16(signed long a, signed long b)
{
return ((long long)a << 16) / b;
}
On a 64-bit architecture that can do native 64-bit division, (assuming long is still a 32-bit type like on Windows) this will be transformed into:
movsxd rax, a # sign-extend from 32 to 64, if long wasn't already 64-bit
shl rax, 16
cqo # sign-extend rax into rdx:rax
movsxd rcx, b
idiv rcx # or idiv b if the inputs were already 64-bit
ret
Unfortunately, on 32-bit x86, the code isn't nearly as good. Compilers emit a call into their internal library function that provides extended 64-bit division, because they can't prove that using a single 64b/32b => 32b idiv instruction won't fault. (It will raise a #DE exception if the quotient doesn't fit in eax, rather than just truncating)
In other words, transforming:
int32_t Divide(int64_t a, int32_t b)
{
return (a / b);
}
into:
mov eax, a_low
mov edx, a_high
idiv b # will fault if a/b is outside [-2^32, 2^32-1]
ret
is not a legal optimization—the compiler is unable to emit this code. The language standard says that a 64/32 division is promoted to a 64/64 division, which always produces a 64-bit result. That you later cast or coerce that 64-bit result to a 32-bit value is irrelevant to the semantics of the division operation itself. Faulting for some combinations of a and b would violate the as-if rule, unless the compiler can prove that those combinations of a and b are impossible. (For example, if b was known to be greater than 1<<16, this could be a legal optimization for a = (int32_t)input; a <<= 16; But even though this would produce the same behaviour as the C abstract machine for all inputs, gcc and clang
currently don't do that optimization.)
There simply isn't a good way to override the rules imposed by the language standard and force the compiler to emit the desired object code. MSVC doesn't offer an intrinsic for it (although there is a Windows API function, MulDiv, it's not fast, and just uses inline assembly for its own implementation—and with a bug in a certain case, now cemented thanks to the need for backwards compatibility). You essentially have no choice but to resort to assembly, either inline or linked in from an external module.
So, you get into ugliness. It looks like this:
signed long Div16(signed long a, signed long b)
{
#ifdef __GNUC__ // A GNU-style compiler (e.g., GCC, Clang, etc.)
signed long quotient;
signed long remainder; // (unused, but necessary to signal clobbering)
__asm__("idivl %[divisor]"
: "=a" (quotient),
"=d" (remainder)
: "0" ((unsigned long)a << 16),
"1" (a >> 16),
[divisor] "rm" (b)
:
);
return quotient;
#elif _MSC_VER // A Microsoft-style compiler (i.e., MSVC)
__asm
{
mov eax, DWORD PTR [a]
mov edx, eax
shl eax, 16
sar edx, 16
idiv DWORD PTR [b]
// leave result in EAX, where it will be returned
}
#else
#error "Unsupported compiler"
#endif
}
This results in the desired output on both Microsoft and GNU-style compilers.
Well, mostly. For some reason, when you use the rm constraint, which gives the compiler to freedom to choose whether to treat the divisor as either a memory operand or load it into a register, Clang generates worse object code than if you just use r (which forces it to load it into a register). This doesn't affect GCC or ICC. If you care about the quality of output on Clang, you'll probably just want to use r, since this will give equally good object code on all compilers.
Live Demo on Godbolt Compiler Explorer
(Note: GCC uses the SAL mnemonic in its output, instead of the SHL mnemonic. These are identical instructions—the difference only matters for right shifts—and all sane assembly programmers use SHL. I have no idea why GCC emits SAL, but you can just convert it mentally into SHL.)

"Error: unsupported relocation against <register>" error with inline PPC assembly code in .c file

I have below inline assembly code. But when i try to compile it, It throws error mentioned after the code snippet.
unsigned int func(void)
{
__asm__ ("mfspr r3, svr;");
}
Below are the errors.
{standard input}: Assembler messages:
{standard input}:3349: Error: unsupported relocation against r3
{standard input}:3349: Error: unsupported relocation against svr
{standard input}:3375: Error: unsupported relocation against r3
{standard input}:3375: Error: unsupported relocation against svr
{standard input}:3510: Error: unsupported relocation against r3
{standard input}:3510: Error: unsupported relocation against svr
{standard input}:3517: Error: unsupported relocation against r3
{standard input}:3517: Error: unsupported relocation against svr
Can anyone help me fixing these?
Apparently gas has no built-in support for these registers. In order to use those you should either define them yourself or use their indexes explicitly like:
mfspr 3, <some_index_here>
Alternatively you could include: ppc_asm.tmpl.
If your core is an e500 then svr index would be 1023.
You should specify inputs and outputs explicitly. As written, your ASM block may be optimized out!
unsigned int func(void)
{
unsigned x;
__asm__("mfspr %0, svr" : "=b"(x));
return x;
}
The compiler is smart enough to figure out that the register should be r3. (That's one of the compiler's main jobs: register allocation to minimize extra moves.)
If you leave out the output specification, and then compile with optimization enabled, you may find that your function is empty, without the mfspr opcode anywhere to be found.
At least some of the errors will go away if you pass -mregnames option to the assembler. (-Wa,-mregnames). gas 2.19 supports the following symbolic register names for PPC (from binutils-2.19/gas/config/tc-ppc.c):
/* List of registers that are pre-defined:
Each general register has predefined names of the form:
1. r<reg_num> which has the value <reg_num>.
2. r.<reg_num> which has the value <reg_num>.
Each floating point register has predefined names of the form:
1. f<reg_num> which has the value <reg_num>.
2. f.<reg_num> which has the value <reg_num>.
Each vector unit register has predefined names of the form:
1. v<reg_num> which has the value <reg_num>.
2. v.<reg_num> which has the value <reg_num>.
Each condition register has predefined names of the form:
1. cr<reg_num> which has the value <reg_num>.
2. cr.<reg_num> which has the value <reg_num>.
There are individual registers as well:
sp or r.sp has the value 1
rtoc or r.toc has the value 2
fpscr has the value 0
xer has the value 1
lr has the value 8
ctr has the value 9
pmr has the value 0
dar has the value 19
dsisr has the value 18
dec has the value 22
sdr1 has the value 25
srr0 has the value 26
srr1 has the value 27
The table is sorted. Suitable for searching by a binary search. */
This is the helpful error message that gas emits when it is trying to say that it doesn't know that "r3" and "svr" are the names of registers. gas expects numbers instead of register names for register operands. You'd get similar error messages if you tried
__asm__ ("mfspr foo, fum;");
In other words, gas is interpreting the register names as arbitrary symbols.

Tiny C Compiler: "error: unknown opcode 'jmp'"

Given this code:
int main(void)
{
__asm volatile ("jmp %eax");
return 0;
}
32-bit TCC will complain with:
test.c:3: error: unknown opcode 'jmp'
but the 64-bit version will compile just fine.
What's the problem with the 32 bit code?
The solution is to simply add a star (*) before the register, like this:
__asm volatile ("jmp *%eax");
I'm not exactly sure what the star means. According to this SO post:
The star is some syntactical sugar indicating that control is to be passed indirectly, by reference/pointer.
As for why it works with 64-bit TCC, I assume that it's a bug; 64-bit GCC complains with Error: operand type mismatch for 'jmp', as it should.

Resources