I'm working on an embedded project and I'm trying add more structure to some of the code, which use macros to optimize access to registers for USARTs. I'd like to organize preprocessor #define'd register addresses into const structures. If I define the structs as compound literals in a macro and pass them to inline'd functions, gcc has been smart enough the bypass the pointer in the generated assembly and hardcode the structure member values directly in the code. E.g.:
C1:
struct uart {
volatile uint8_t * ucsra, * ucsrb, *ucsrc, * udr;
volitile uint16_t * ubrr;
};
#define M_UARTX(X) \
( (struct uart) { \
.ucsra = &UCSR##X##A, \
.ucsrb = &UCSR##X##B, \
.ucsrc = &UCSR##X##C, \
.ubrr = &UBRR##X, \
.udr = &UDR##X, \
} )
void inlined_func(const struct uart * p, other_args...) {
...
(*p->ucsra) = 0;
(*p->ucsrb) = 0;
(*p->ucsrc) = 0;
}
...
int main(){
...
inlined_func(&M_UART(0), other_parms...);
...
}
Here UCSR0A, UCSR0B, &c, are defined as the uart registers as l-values, like
#define UCSR0A (*(uint8_t*)0xFFFF)
gcc was able to eliminate the structure literal entirely, and all assignments like that shown in inlined_func() write directly into the register address, w/o having to read the register's address into a machine register, and w/o indirect addressing:
A1:
movb $0, UCSR0A
movb $0, UCSR0B
movb $0, UCSR0C
This writes the values directly into the USART registers, w/o having to load the addresses into a machine register, and so never needs to generate the struct literal into the object file at all. The struct literal becomes a compile-time structure, with no cost in the generated code for the abstraction.
I wanted to get rid of the use of the macro, and tried using a static constant struct defined in the header:
C2:
#define M_UART0 M_UARTX(0)
#define M_UART1 M_UARTX(1)
static const struct uart * const uart[2] = { &M_UART0, &M_UART1 };
....
int main(){
...
inlined_func(uart[0], other_parms...);
...
}
However, gcc cannot remove the struct entirely here:
A2:
movl __compound_literal.0, %eax
movb $0, (%eax)
movl __compound_literal.0+4, %eax
movb $0, (%eax)
movl __compound_literal.0+8, %eax
movb $0, (%eax)
This loads the register addresses into a machine register, and uses indirect addressing to write to the register. Does anyone know anyway I can convince gcc to generate A1 assembly code for C2 C code? I've tried various uses of the __restrict modifier, with no avail.
After many years of experience with UARTs and USARTs, I have come to these conclusions:
Don't use a struct for a 1:1 mapping with UART registers.
Compilers can add padding between struct members without your knowledge, thus messing up the 1:1 correspondence.
Writing to UART registers is best done directly or through a function.
Remember to use volatile modifier when defining pointers to the registers.
Very little performance gain with Assembly language
Assembly language should only be used if the UART is accessed through processor ports rather than memory-mapped. The C language has no support for ports. Accessing UART registers through pointers is very efficient (generate an assembly language listing and verify). Sometimes, it may take more time to code in assembly and verify.
Isolate UART functionality into a separate library
This is a good candidate. Besides, once the code has been tested, let it be. Libraries don't have to be (re)compiled all the time.
Using structs "across compile domains" is a cardinal sin in my book. Basically using a struct to point at something, anything, file data, memory, etc. And the reason is that it will fail, it is not reliable, no matter the compiler. There are many compiler specific flags and pragmas for this, the better solution is to just not do it. You want to point at address plus 8, point at address plus 8, use a pointer or an array. In this specific case I have had way too many compilers fail to do that as well and I write assembler PUT32/GET32 PUT16/GET16 functions to guarantee that the compiler doesnt mess with my register accesses, like structs, you will get burned one day and have a hell of a time figuring out why your 32 bit register only had 8 bits written to it. The overhead of the jump to the function is worth the peace of mind and the reliability and portability of the code. Also this makes your code extremely portable, you can put wrappers in for the put and get functions to cross networks, run your hardware in an hdl simulator and reach into the simulation to read/write registers, etc, with a single chunk of code that doesnt change from simulation to embedded to os device driver to application layer function.
Based on the register set, it looks like you are using an 8-bit Atmel AVR microncontroller (or something extremely similar). I'll show you some things I've used for Atmel's 32-bit ARM MCUs which is a slightly modified version of what they ship in their device packs.
Code Notation
I'm using various macros that I'm not going to include here, but they are defined to do basic operations or paste types (like UL) onto numbers. They are hidden in macros for the cases where something is not allowed (like in assembly). Yes, these are easy to break - it's on the programmer not to shoot themselves in the foot:
#define _PPU(_V) (_V##U) /* guarded with #if defined(__ASSEMBLY__) */
#define _BV(_V) (_PPU(1) << _PPU(_V)) /* Variants for U, L, UL, etc */
There are also typdefs for specific length registers. Example:
/* Variants for 8, 16, 32-bit, RO, WO, & RW */
typedef volatile uint32_t rw_reg32_t;
typedef volatile const uint32_t ro_reg32_t;
The classic #define method
You can define the peripheral address with any register offsets...
#define PORT_REG_ADDR _PPUL(0x41008000)
#define PORT_ADDR_DIR (PORT_REG_ADDR + _PPU(0x00))
#define PORT_ADDR_DIRCLR (PORT_REG_ADDR + _PPU(0x04))
#define PORT_ADDR_DIRSET (PORT_REG_ADDR + _PPU(0x08))
#define PORT_ADDR_DIRTGL (PORT_REG_ADDR + _PPU(0x0C))
And de-referenced pointers to the register addresses...
#define PORT_DIR (*(rw_reg32_t *)PORT_ADDR_DIR)
#define PORT_DIRCLR (*(rw_reg32_t *)PORT_ADDR_DIRCLR)
#define PORT_DIRSET (*(rw_reg32_t *)PORT_ADDR_DIRSET)
#define PORT_DIRTGL (*(rw_reg32_t *)PORT_ADDR_DIRTGL)
And then directly set values in the register:
PORT_DIRSET = _BV(0) | _BV(1) | _BV(2);
Compiling in GCC with some other startup code...
arm-none-eabi-gcc -c -x c -mthumb -mlong-calls -mcpu=cortex-m4 -pipe
-std=c17 -O2 -Wall -Wextra -Wpedantic main.c
[SIZE] : Calculating size from ELF file
text data bss dec hex
924 0 49184 50108 c3bc
With disassembly:
00000000 <main>:
#include "defs/hw-v1.0.h"
void main (void) {
PORT_DIRSET = _BV(0) | _BV(1) | _BV(2);
0: 4b01 ldr r3, [pc, #4] ; (8 <main+0x8>)
2: 2207 movs r2, #7
4: 601a str r2, [r3, #0]
}
6: 4770 bx lr
8: 41008008 .word 0x41008008
The "new" structured method
You still define a base address as before as well as some numerical constants (like some number of instances), but instead of defining individual register addresses, you create a structure that models the peripheral. Note, I manually include some reserved space at the end for alignment. For some peripherals, there will be reserved chunks between other registers - it all depends on that peripheral memory mapping.
typedef struct PortGroup {
rw_reg32_t DIR;
rw_reg32_t DIRCLR;
rw_reg32_t DIRSET;
rw_reg32_t DIRTGL;
rw_reg32_t OUT;
rw_reg32_t OUTCLR;
rw_reg32_t OUTSET;
rw_reg32_t OUTTGL;
ro_reg32_t IN;
rw_reg32_t CTRL;
wo_reg32_t WRCONFIG;
rw_reg32_t EVCTRL;
rw_reg8_t PMUX[PORT_NUM_PMUX];
rw_reg8_t PINCFG[PORT_NUM_PINFCG];
reserved8_t reserved[PORT_GROUP_RESERVED];
} PORT_group_t;
Since the PORT peripheral has four units, and the PortGroup structure is packed to exactly model the memory mapping, I can create a parent structure that contains all of them.
typedef struct Port {
PORT_group_t GROUP[PORT_NUM_GROUPS];
} PORT_t;
And the final step is to associate this structure with an address.
#define PORT ((PORT_t *)PORT_REG_ADDR)
Note, this can still be de-referenced as before - it's a matter of style choice.
#define PORT (*(PORT_t *)PORT_REG_ADDR)
And now to set the register value as before...
PORT->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
Compiling (and linking) with the same options, this produces identical size info and disassembly:
Disassembly of section .text.startup.main:
00000000 <main>:
#include "defs/hw-v1.0.h"
void main (void) {
PORT->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
0: 4b01 ldr r3, [pc, #4] ; (8 <main+0x8>)
2: 2207 movs r2, #7
4: 609a str r2, [r3, #8]
}
6: 4770 bx lr
8: 41008000 .word 0x41008000
Reusable Code
The first method is straightforward, but requires a lot of manual definitions and some ugly macros to if you have more than one peripheral. What if we had 2 different PORT peripherals at different addresses (similar to a device that has more than one USART). We can just create multiple structured PORT pointers:
#define PORT0 ((PORT_t *)PORT0_REG_ADDR)
#define PORT1 ((PORT_t *)PORT1_REG_ADDR)
Calling them individually looks like what you'd expect:
PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
Compiling results in:
[SIZE] : Calculating size from ELF file
text data bss dec hex
936 0 49184 50120 c3c8
Disassembly of section .text.startup.main:
00000000 <main>:
#include "defs/hw-v1.0.h"
void main (void) {
PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
0: 4903 ldr r1, [pc, #12] ; (10 <main+0x10>)
PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
2: 4b04 ldr r3, [pc, #16] ; (14 <main+0x14>)
PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
4: 2007 movs r0, #7
PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
6: 2270 movs r2, #112 ; 0x70
PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
8: 6088 str r0, [r1, #8]
PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
a: 609a str r2, [r3, #8]
}
c: 4770 bx lr
e: bf00 nop
10: 41008000 .word 0x41008000
14: 4100a000 .word 0x4100a000
And the final step to make it all reusable...
static PORT_t * const PORT[] = {PORT0, PORT1};
static inline void
PORT_setDir(const uint8_t unit, const uint8_t group, const uint32_t pins) {
PORT[unit]->GROUP[group].DIRSET = pins;
}
/* ... */
PORT_setDir(0, 0, _BV(0) | _BV(1) | _BV(2));
PORT_setDir(1, 0, _BV(4) | _BV(5) | _BV(6));
And compiling will give identical size and (basically) disassembly as before.
Disassembly of section .text.startup.main:
00000000 <main>:
static PORT_t * const PORT[] = {PORT0, PORT1};
static inline void
PORT_setDir(const uint8_t unit, const uint8_t group, const uint32_t pins) {
PORT[unit]->GROUP[group].DIRSET = pins;
0: 4903 ldr r1, [pc, #12] ; (10 <main+0x10>)
2: 4b04 ldr r3, [pc, #16] ; (14 <main+0x14>)
4: 2007 movs r0, #7
6: 2270 movs r2, #112 ; 0x70
8: 6088 str r0, [r1, #8]
a: 609a str r2, [r3, #8]
void main (void) {
PORT_setDir(0, 0, _BV(0) | _BV(1) | _BV(2));
PORT_setDir(1, 0, _BV(4) | _BV(5) | _BV(6));
}
c: 4770 bx lr
e: bf00 nop
10: 41008000 .word 0x41008000
14: 4100a000 .word 0x4100a000
I would clean it up a bit more with a module library header, enumerated constants, etc. But this should give someone a starting point. Note, in these examples, I am always calling a CONSTANT unit and group. I know exactly what I'm writing to, I just want reusable code. More instructions will be (probably) be needed if the unit or group cannot be optimized to compile time constants. Speaking of which, if optimizations are not used, all of this goes out the window. YMMV.
Side Note on Bit Fields
Atmel further breaks a peripheral structure into individual typedef'd registers that have named bitfields in a union with the size of the register. This is the ARM CMSIS way, but it's not great IMO. Contrary to the information in some of the other answers, I know exactly how a compiler will pack this structure; however, I do not know how it will arrange bit fields without using special compiler attributes and flags. I would rather explicitly set and mask defined register bit field constant values. It also violates MISRA (as does some of what I've done here...) if you are worried about that.
Related
I am writing C code which may be compiled for the Arm Cortex-M3 microcontroller.
This microcontroller supports several useful instructions for efficiently manipulating bits in registers, including REV*, RBIT, SXT*.
When writing C code, how can I take advantage of these instructions if I need those specific functions? For example, how can I complete this code?
#define REVERSE_BIT_ORDER(x) { /* what to write here? */ }
I would like to do this without using inline assembler so that this code is both portable, and readable.
Added:
In part, I am asking how to express such a function in C elegantly. For example, it's easy to express bit shifting in C, because it's built into the language. Likewise, setting or clearing bits. But bit reversal is unknown in C, and so is very hard to express. For example, this is how I would reverse bits:
unsigned int ReverseBits(unsigned int x)
{
unsigned int ret = 0;
for (int i=0; i<32; i++)
{
ret <<= 1;
if (x & (1<<i))
ret |= 1;
}
return ret;
}
Would the compiler recognise this as bit reversal, and issue the correct instruction?
Reversing bits in a 32 bit integer is such an exotic instruction so that might be why you can't reproduce it. I was able to generate code that utilizes REV (reverse byte order) however, which is a far more common use-case:
#include <stdint.h>
uint32_t endianize (uint32_t input)
{
return ((input >> 24) & 0x000000FF) |
((input >> 8) & 0x0000FF00) |
((input << 8) & 0x00FF0000) |
((input << 24) & 0xFF000000) ;
}
With gcc -O3 -mcpu=cortex-m3 -ffreestanding (for ARM32, vers 11.2.1 "none"):
endianize:
rev r0, r0
bx lr
https://godbolt.org/z/odGqzjTGz
It works for clang armv7-a 15.0.0 too, long as you use -mcpu=cortex-m3.
So this would support the idea of avoiding manual optimizations and let the compiler worry about such.
#Lundin's answer shows a pure-C shift/mask bithack that clang recognizes and compiles to a single rev instruction. (Or presumably to x86 bswap if targeting x86, or equivalent instructions on other ISAs that have them.)
In portable ISO C, hoping for pattern-recognition is unfortunately the best you can do, because they haven't added portable ways to expose CPU functionality; even C++ took until C++20 to add the <bit> header for things like std::popcount and C++23 std::byteswap.
(Some fairly-portable C libraries / headers have byte-reversal, e.g. as part of networking there's ntohl net-to-host which is an endian-swap on little-endian machines. Or there's GCC's (or glibc's?) endian.h, with htobe32 being host to big-endian 32-bit. Man page. These are usually implemented with intrinsics that compile to a single instruction in good-quality implementations.
Of course, if you definitely want a byte swap regardless of host endianness, you could do htole32(be32toh(x)) because one of them's a no-op and the other's a byte-swap, since ARM is either big or little endian. (It's still a byte-swap even if neither of them are NOPs, even on PDP or other mixed-endian machines, but there might be more efficient ways to do it.)
There are also some "collections of useful functions" headers with intrinsics for different compilers, with functions like byte swap. These can be of varying quality in terms of efficiency and maybe even correctness.
You can see that no, neither GCC nor clang optimize your code to rbit for ARM or AArch64. https://godbolt.org/z/Y7noP61dE . Presumably looping over bits in the other direction isn't any better. Perhaps a bithack as in In C/C++ what's the simplest way to reverse the order of bits in a byte? or Efficient Algorithm for Bit Reversal (from MSB->LSB to LSB->MSB) in C .
CC and clang recognize the standard bithack for popcount, but I didn't check any of the answers on the bit-reverse questions.
Some languages, notably Rust, do care more about making it possible to portably express what modern CPUs can do. foo.reverse_bits() (since Rust 1.37) and foo.swap_bytes() just work for any type on any ISA. For u32 specifically, https://doc.rust-lang.org/std/primitive.u32.html#method.reverse_bits (That's Rust's equivalent of C uint32_t.)
Most mainstream C implementations have portable (across ISAs) builtins or (target-specific) intrinsics (like __REV() or __REV16() for stuff like this.
The GNU dialect of C (GCC/clang/ICC and some others) includes __builtin_bswap32(input). See Does ARM GCC have a builtin function for the assembly 'REV' instruction?. It's named after the x86 bswap instruction, but it's just a byte-reverse that GCC / clang compile to whatever instructions can do it efficiently on the target ISA.
There's also a __builtin_bswap16(uint16_t) for swapping the bytes of a 16-bit integer, like revsh except the C semantics don't include preserving the upper 16 bits of a 32-bit integer. (Because normally you don't care about that part.) See the GCC manual
for the available GNU C builtins that aren't target-specific.
There isn't a GNU C builtin or intrinsic for bitwise reverse that I could find in the manual or GCC arm-none-eabi 12.2 headers.
ARM documents an __rbit() intrinsic for their own compiler, but I think that's Keil's ARMCC, so there might not be any equivalent of that for GCC/clang.
#0___________ suggests https://github.com/ARM-software/CMSIS_5 for headers that define a function for that.
If worst comes to worst, GNU C inline asm is possible for GCC/clang, given appropriate #ifdefs. You might also want if (__builtin_constant_p(x)) to use a pure-C bit-reversal so constant-propagation can happen on compile-time constants, only using inline asm on runtime-variable values.
uint32_t output, input=...;
#if defined(__arm__) || defined (__aarch64__)
// same instruction is valid for both
asm("rbit %0,%1" : "=r"(output) : "r"(input));
#else
... // pure C fallback or something
#endif
Note that it doesn't need to be volatile because rbit is a pure function of the input operand. It's a good thing if GCC/clang are able to hoist this out of a loop. And it's a single asm instruction so we don't need an early-clobber.
This has the downside that the compiler can't fold a shift into it, e.g. if you wanted a byte-reverse, __rbit(x) >> 24 equals __rbit(x<<24), which could be done with rbit r0, r1, lsl #24. (I think).
With inline asm I don't think there's a way to tell the compiler that a r1, lsl #24 is a valid expansion for the %1 input operand. Hmm, unless there's a machine-specific constraint for that? https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html - no, no mention of "shifted" or "flexible" source operand in the ARM section.
Efficient Algorithm for Bit Reversal (from MSB->LSB to LSB->MSB) in C shows an #ifdefed version with a working fallback that uses a bithack to reverse bits within a byte, then __builtin_bswap32 or MSVC _byteswap_ulong to reverse bytes.
It would be best if you used CMSIS intrinsic.
__REV, __REV16 etc. Those CMSIS header files contain much much more.
You can get them from here:
https://github.com/ARM-software/CMSIS_5
and you are looking for cmsis_gcc.h file (or similar if you use another compiler).
Interestingly, ARM gcc seems to have improved its detection of byte order reversing recently. With version 11, it would detect byte reversal if done by bit shifting, or by byte swapping through a pointer. However, from version 10 and backwards, the pointer method failed to issue the REV instruction.
uint32_t endianize1 (uint32_t input)
{
return ((input >> 24) & 0x000000FF) |
((input >> 8) & 0x0000FF00) |
((input << 8) & 0x00FF0000) |
((input << 24) & 0xFF000000) ;
}
uint32_t endianize2 (uint32_t input)
{
uint32_t output;
uint8_t *in8 = (uint8_t*)&input;
uint8_t *out8 = (uint8_t*)&output;
out8[0] = in8[3];
out8[1] = in8[2];
out8[2] = in8[1];
out8[3] = in8[0];
return output;
}
endianize1:
rev r0, r0
bx lr
endianize2:
mov r3, r0
movs r0, #0
lsrs r2, r3, #24
bfi r0, r2, #0, #8
ubfx r2, r3, #16, #8
bfi r0, r2, #8, #8
ubfx r2, r3, #8, #8
bfi r0, r2, #16, #8
bfi r0, r3, #24, #8
bx lr
https://godbolt.org/z/E3xGvG9qq
So, as we wait for optimisers to improve, there are certainly ways you can help the compiler understand your intent and take good advantage of the instruction set (without resorting to micro optimisations or inline assembler). But it's likely that this will involve a good understanding of the architecture by the programmer, and examination of the output assembler.
Take advantage of http://godbolt.org to help examine the compiler output, and see what produces the best output.
I noted a strange behavior if a function takes an argument as plain struct like this:
struct Foo
{
int a;
int b;
};
int foo(struct Foo d)
{
return d.a;
}
compiled ARM Cortex-M3 using GCC 10.2 with Os optimization (or any other optimization level):
arm-none-eabi-gcc.exe -Os -mcpu=cortex-m3 -o test2.c.obj -c test2.c
generates a code where the argument struct's data saved on stack for no reason.
Disassembly of section .text:
00000000 <foo>:
0: b082 sub sp, #8
2: ab02 add r3, sp, #8
4: e903 0003 stmdb r3, {r0, r1}
8: b002 add sp, #8
a: 4770 bx lr
What is the reason to save struct's data on stack? It never use this data.
If I compile this code on RISC-V architecture it will be more interesting:
Disassembly of section .text:
00000000 <foo>:
0: 1141 addi sp,sp,-16
2: 0141 addi sp,sp,16
4: 8082 ret
Here just stack pointer moves forward and back again. Why? What is the reason?
The optimizer just doesn't "optimize it away", probably because its relying on a later part of the optimizer to handle it.
Try changing the code to
struct Foo
{
int a;
int b;
};
extern int extern_bar(struct Foo d);
int bar(struct Foo d)
{
return d.a;
}
#include <stdlib.h>
#include <stdio.h>
int main()
{
struct Foo baz;
baz.a = rand();
baz.b = rand();
printf("%d",bar(baz));
printf("%d",extern_bar(baz));
return bar(baz);
}
And compiling at godbolt.org under the different architectures. (Make sure to set -Os).
You can see it many cases completely optimizes away the call to bar and just uses the value in the register. While we don't show it, the linker can/could completely cull the function body of bar because it's unnecessary.
The call to extern_bar is still there because the compiler can't know what's going on inside of it, so it dutifully does what it needs to do to pass the struct by value according to the architecture ABI (most architectures push the struct on the stack). That means the function must copy it off the stack.
Apparently RISCV EABI is different and it passes smaller structs by value in registers. I guess it just has built in prologue/epilogue to push and pop the stack and the optimizer doesn't trim it away because its a sort of an edge case.
Or, who knows.
But, the short of it is: if size and cycles REALLY matter, don't trust the compiler. Also don't trust the compiler to keep doing what its doing. Changing revisions of the toolchain is asking for slight differences in code generation. Even not changing the toolchain revision could end up with different code based on heuristics you just aren't privy to or don't realize.
As per AAPCS standard defined by Arm, composite types are passed to function in stack. Refer section 6.4 of below document for details.
https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#64parameter-passing
Problem description
I'm trying to design the C code unpacking array A of uint32_t elements to array B of uint32_t elements where each element of A is unpacked to two consecutive elements of B so that B[2*i] contains low 16 bits of A[i] and B[2*i + 1] contains high 16 bits of A[i] shifted right, i.e.,
B[2*i] = A[i] & 0xFFFFul;
B[2*i+1] = A[i] >> 16u;
Note the arrays are aligned to 4, have variable length, but A always contains multiple of 4 of uint32_t and the size is <= 32, B has sufficient space for unpacking and we are on ARM Cortex-M3.
Current bad solution in GCC inline asm
As the GCC is not good in optimizing this unpacking, I wrote unrolled C & inline asm to make it speed optimized with acceptable code size and register usage. The unrolled code looks like this:
static void unpack(uint32_t * src, uint32_t * dst, uint8_t nmb8byteBlocks)
{
switch(nmb8byteBlocks) {
case 8:
UNPACK(src, dst)
case 7:
UNPACK(src, dst)
...
case 1:
UNPACK(src, dst)
default:;
}
}
where
#define UNPACK(src, dst) \
asm volatile ( \
"ldm %0!, {r2, r4} \n\t" \
"lsrs r3, r2, #16 \n\t" \
"lsrs r5, r4, #16 \n\t" \
"stm %1!, {r2-r5} \n\t" \
: \
: "r" (src), "r" (dst) \
: "r2", "r3", "r4", "r5" \
);
It works until the GCC's optimizer decides to inline the function (wanted property) and reuse register variables src and dst in the next code. Clearly, due to the ldm %0! and stm %1! instructions the src and dst contain different addresses when leaving switch statement.
How to solve it?
I do not know how to inform GCC that registers used for src and dst are invalid after the last UNPACK macro in last case 1:.
I tried to pass them as output operands in all or only last macro ("=r" (mem), "=r" (pma)) or somehow (how) to include them in inline asm clobbers but it only make the register handling worse with bad code again.
Only one solution is to disable function inlining (__attribute__ ((noinline))), but in this case I lose the advantage of GCC which can cut the proper number of macros and inline it if the nmb8byteBlocks is known in compile time. (The same drawback holds for rewriting code to pure assembly.)
Is there any possibility how to solve this in inline assembly?
I think you are looking for the + constraint modifier, which means "this operand is both read and written". (See the "Modifiers" section of GCC's inline-assembly documentation.)
You also need to tell GCC that this asm reads and writes memory; the easiest way to do that is by adding "memory" to the clobber list. And that you clobber the "condition codes" with lsrs, so a "cc" clobber is also necessary. Try this:
#define UNPACK(src, dst) \
asm volatile ( \
"ldm %0!, {r2, r4} \n\t" \
"lsrs r3, r2, #16 \n\t" \
"lsrs r5, r4, #16 \n\t" \
"stm %1!, {r2-r5} \n\t" \
: "+r" (src), "+r" (dst) \
: /* no input-only operands */ \
: "r2", "r3", "r4", "r5", "memory", "cc" \
);
(Micro-optimization: since you don't use the condition codes from the shifts, it's better to use lsr instead of lsrs. It also makes the code easier to read months later; future you won't be scratching your head wondering if there's some reason why the condition codes are actually needed here. EDIT: I've been reminded that lsrs has a more compact encoding than lsr in Thumb format, which is enough of a reason to use it even though the condition codes aren't needed.)
(I would like to say that you'd get better register allocator behavior if you let GCC pick the scratch registers, but I don't know how to tell it to pick scratch registers in a particular numeric order as required by ldm and stm, or how to tell it to use only the registers accessible to 2-byte Thumb instructions.)
(It is possible to specify exactly what memory is read and written with "m"-type input and output operands, but it's complicated and may not improve things much. If you discover that this code works but causes a bunch of unrelated stuff to get reloaded from memory into registers unnecessarily, consult How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)
(You may get better code generation for what unpack is inlined into, if you change its function signature to
static void unpack(const uint32_t *restrict src,
uint32_t *restrict dst,
unsigned int nmb8byteBlocks)
I would also experiment with adding if (nmb8byteBlocks > 8) __builtin_trap(); as the first line of the function.)
Many thanks zwol, this is exactly what I was looking for but couldn't find it in GCC inline assembly pages. It solved the problem perfectly - now the GCC makes a copy of src and dst in different registers and uses them correctly after the last UNPACK macro.Two remarks:
I use lsrs because it compiles to 2-bytes Cortex-M3 native lsrs. If I use flags untouching lsr version, it compiles to 4-bytes mov.w r3, r2, lsr #16 -> the 16-bit Thumb 2 lsr is with 's' by default. Without the 's', the 32-bit Thumb 2 must be used (I have to check it). Anyway, I should add "cc" in clobbers in this case.
In code above, I removed the nmb8byteBlocks value range check to make it clear. But of course, your last sentence is a good point not only for all C programmers.
I have some misunderstanding about MCU GCC compilation behavior regarding function that return other things that 32bits value.
MCU: STM32 L0 Series (STM32L083)
GCC : gcc version 7.3.1 20180622 (release) [ARM/embedded-7-branch revision 261907] (GNU Tools for Arm Embedded Processors 7-2018-q2-update)
My code is optimized for size (with option -Os ). In my understanding, this will allow the gcc to use implicit -fshort-enums in order to pack enums.
I have two enum var, 1-byte wide :
enum eRadioMode radio_mode // (# 0x20003200)
enum eRadioFunction radio_func // (# 0x20003201)
And a function :
enum eRadioMode radio_get_mode(enum eRadioFunction _radio_func);
When i call this bunch of code :
radio_mode = radio_get_mode(radio_func);
It will produce this bunch of ASM at compile time:
; At this point :
; r4 value is 0x20003201 (Address of radio_func)
7820 ldrb r0, [r4, #0] ; GCC treat correctly r4 as a pointer to 1 byte wide var, no problem here
f7ff ffcd bl 80098a8 <radio_get_mode> ; Call to radio_get_mode()
4d1e ldr r5, [pc, #120] ; r5 is loaded with 0x20003200 (Address of radio_mode)
6028 str r0, [r5, #0] ; Why GCC use 'str' and not 'strb' at this point ?
The last line here is the problem : The value of r0, return value of radio_get_mode(), is stored into address pointed by r5, as a 32bit value.
Since radio_func is 1 byte after radio_mode, its value is overwritten by the second byte of r0 (that is always 0x00 since enum is only 1 byte wide).
As my function radio_get_mode is declared as returning 1 single byte, why GCC doesn't use instruction strb in order to save this single byte into the address pointed by r5 ?
I have tried :
radio_get_mode() as returning uint8_t : uint8_t radio_get_mode(enum eRadioFunction _radio_func);
Forcing cast to uint8_t : radio_mode = (uint8_t)radio_get_mode(radio_func);
Passing by a third var (but GCC cancel that useless move at compile - not so dumb) :
uint32_t r = radio_get_mode(radio_func);
radio_mode = (uint8_t) r;
But none of these solutions work.
Since the size optimization (-Os) is needed in first sight to reduce rom usage (and not ram - at this time of my project -) I found that the workaround gcc option -fno-short-enums will let the compiler to use 4 bytes by enum, discarding by the way any overlapping memory in this case.
But, in my opinion, this is a dirty way to hide a real problem here :
Is GCC not able to correctly handle other return size than 32bit ?
There is a correct way to do that ?
Thanks in advance.
EDIT :
I did NOT use -f-short-enums at any moment.
I'm sure that these enum has no value greater than 0xFF
I have tried to declare radio_mode and radio_func as uint8_t (aka unsigned char) : The problem is the same.
When compiled with -Os, Output.map is as follow :
Common symbol size file
...
radio_mode 0x1 src/radio/radio.o
radio_func 0x1 src/radio/radio.o
...
...
...
Section address label
0x2000319c radio_state
0x20003200 radio_mode
0x20003201 radio_func
0x20003202 radio_protocol
...
The output of the mapfile show clearly that radio_mode and radio_func is 1 byte wide and at following address.
When compiled without -Os, Output.map show clearly that enums become 4 byte wide (with address padded to 4).
When compiled with -Os and -fno-short-enums, do the same things that without -Os for all enums (This is why I guess -Os implies implicit -f-short-enums)
I will try to provide minimal reproducible example
My analysis of the problem is that I'm pretty sure it is a compiler bug. For me, this is clearly a memory overlapping. My question is more about the best things to do in order to avoid this - in the "best practice" way.
EDIT 2
It is my bad, I have re-tester changing all signature to uint8_t (aka unsigned char) and it work well.
#Peter Cordes seems to found the problem here : When using it, -Os is partly enabling -fshort-enums, getting some parts of GCC to treat it as size 1 and other parts to treat it as size 4.
ASM code using only uint8_t is :
; Same position than before
7820 ldrb r0, [r4, #0]
f7ff ffcd bl 80098a8 <radio_get_mode>
4d1e ldr r5, [pc, #120]
7028 strb r0, [r5, #0] ; Yes ! GCC use 'strb' and not 'str' like before !
To clarify :
It seems to have compiler bug when using -Os and enums. This is bad luck that two enum is at consecutive adresses that overlap.
Using -fno-short-enums in conjonction with -Os appear to be a good workaround IMO, since the problem is concerning only enum, and not all 1 byte var at all.
Thanks again.
ARM port abi defines none-aebi enums to be a variable sized type, linux-eabi to be standards fixed one.
That is the reason the behaviour you observe. It is not related to the optimisation.
In this example you can see how it works. https://godbolt.org/z/-mY_WY
I have a problem with passing a pointer to the struct to the device function.
I want to create a struct in local memory (i know it's slow, it's just an example) and pass it to the other function by pointer. The problem is that when i debug it with memcheck on, i get error:
Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.
Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 7, warp 0, lane 0
0x0000000000977608 in foo (st=0x3fffc38) at test.cu:15
15 st->m_tx = 99;
If I debug it without memcheck on, it works fine and gives expected results.
My OS is RedHat 6.3 64-bits with Kernel 2.6.32-220.
I use GTX680, CUDA 5.0 and compile the program with sm=30.
Code I used for testing this is below:
typedef struct __align__(8) {
int m_x0;
int m_tx;
} myStruct;
__device__ void foo(myStruct *st) {
st->m_tx = 99;
st->m_x0 = 123;
}
__global__ void myKernel(){
myStruct m_struct ;
m_struct.m_tx = 45;
m_struct.m_x0 = 90;
foo(&m_struct);
}
int main(void) {
myKernel <<<1,1 >>>();
cudaThreadSynchronize();
return 0;
}
Any suggestions? Thanks for any help.
Your example code is completely optimised away by the compiler because none of the code contributes to a global memory write. This is easily proved by compiling the kernel to a cubin file and disassembling the result with cuobjdump:
$ nvcc -arch=sm_20 -Xptxas="-v" -cubin struct.cu
ptxas info : Compiling entry function '_Z8myKernelv' for 'sm_20'
ptxas info : Function properties for _Z8myKernelv
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 32 bytes cmem[0]
$ cuobjdump -sass struct_dumb.cubin
code for sm_20
Function : _Z8myKernelv
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x00001de780000000*/ EXIT;
.............................
ie. the kernel is completely empty. The debugger can't debug the code you want to investigate because it does not exist in what the compiler/assembler emitted. If we take a few liberties with your code:
typedef struct __align__(8) {
int m_x0;
int m_tx;
} myStruct;
__device__ __noinline__ void foo(myStruct *st) {
st->m_tx = 99;
st->m_x0 = 123;
}
__global__ void myKernel(int dowrite, int *output){
myStruct m_struct ;
m_struct.m_tx = 45;
m_struct.m_x0 = 90;
if (dowrite) {
foo(&m_struct);
output[threadIdx.x] = m_struct.m_tx + m_struct.m_x0;
}
}
int main(void) {
int * output;
cudaMalloc((void **)(&output), sizeof(int));
myKernel <<<1,1 >>>(1, output);
cudaThreadSynchronize();
return 0;
}
and repeat the same compilation and disassembly steps, things look somewhat different:
$ nvcc -arch=sm_20 -Xptxas="-v" -cubin struct_dumb.cu
ptxas info : Compiling entry function '_Z8myKerneliPi' for 'sm_20'
ptxas info : Function properties for _Z8myKerneliPi
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z3fooP8myStruct
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 5 registers, 40 bytes cmem[0]
$ /usr/local/cuda/bin/cuobjdump -sass struct_dumb.cubin
code for sm_20
Function : _Z8myKerneliPi
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x20105d034800c000*/ IADD R1, R1, -0x8;
/*0010*/ /*0x68009de218000001*/ MOV32I R2, 0x5a;
/*0018*/ /*0xb400dde218000000*/ MOV32I R3, 0x2d;
/*0020*/ /*0x83f1dc23190e4000*/ ISETP.EQ.AND P0, pt, RZ, c [0x0] [0x20], pt;
/*0028*/ /*0x00101c034800c000*/ IADD R0, R1, 0x0;
/*0030*/ /*0x00109ca5c8000000*/ STL.64 [R1], R2;
/*0038*/ /*0x000001e780000000*/ #P0 EXIT;
/*0040*/ /*0x10011c0348004000*/ IADD R4, R0, c [0x0] [0x4];
/*0048*/ /*0xc001000750000000*/ CAL 0x80;
/*0050*/ /*0x00009ca5c0000000*/ LDL.64 R2, [R0];
/*0058*/ /*0x84011c042c000000*/ S2R R4, SR_Tid_X;
/*0060*/ /*0x90411c4340004000*/ ISCADD R4, R4, c [0x0] [0x24], 0x2;
/*0068*/ /*0x0c201c0348000000*/ IADD R0, R2, R3;
/*0070*/ /*0x00401c8590000000*/ ST [R4], R0;
/*0078*/ /*0x00001de780000000*/ EXIT;
/*0080*/ /*0x8c00dde218000001*/ MOV32I R3, 0x63;
/*0088*/ /*0xec009de218000001*/ MOV32I R2, 0x7b;
/*0090*/ /*0x1040dc8590000000*/ ST [R4+0x4], R3;
/*0098*/ /*0x00409c8590000000*/ ST [R4], R2;
/*00a0*/ /*0x00001de790000000*/ RET;
...............................
we get actual code in the assembler output. You might have more luck in the debugger with that.
I am from the CUDA developer tools team. When compiled for device side debug (i.e. -G), the original code will not be optimized out. The issue looks like a memcheck bug. Thank you for finding this. We will look into it.