Nvidia CUDA - passing struct by pointer - c

I have a problem with passing a pointer to the struct to the device function.
I want to create a struct in local memory (i know it's slow, it's just an example) and pass it to the other function by pointer. The problem is that when i debug it with memcheck on, i get error:
Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.
Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 7, warp 0, lane 0
0x0000000000977608 in foo (st=0x3fffc38) at test.cu:15
15 st->m_tx = 99;
If I debug it without memcheck on, it works fine and gives expected results.
My OS is RedHat 6.3 64-bits with Kernel 2.6.32-220.
I use GTX680, CUDA 5.0 and compile the program with sm=30.
Code I used for testing this is below:
typedef struct __align__(8) {
int m_x0;
int m_tx;
} myStruct;
__device__ void foo(myStruct *st) {
st->m_tx = 99;
st->m_x0 = 123;
}
__global__ void myKernel(){
myStruct m_struct ;
m_struct.m_tx = 45;
m_struct.m_x0 = 90;
foo(&m_struct);
}
int main(void) {
myKernel <<<1,1 >>>();
cudaThreadSynchronize();
return 0;
}
Any suggestions? Thanks for any help.

Your example code is completely optimised away by the compiler because none of the code contributes to a global memory write. This is easily proved by compiling the kernel to a cubin file and disassembling the result with cuobjdump:
$ nvcc -arch=sm_20 -Xptxas="-v" -cubin struct.cu
ptxas info : Compiling entry function '_Z8myKernelv' for 'sm_20'
ptxas info : Function properties for _Z8myKernelv
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 32 bytes cmem[0]
$ cuobjdump -sass struct_dumb.cubin
code for sm_20
Function : _Z8myKernelv
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x00001de780000000*/ EXIT;
.............................
ie. the kernel is completely empty. The debugger can't debug the code you want to investigate because it does not exist in what the compiler/assembler emitted. If we take a few liberties with your code:
typedef struct __align__(8) {
int m_x0;
int m_tx;
} myStruct;
__device__ __noinline__ void foo(myStruct *st) {
st->m_tx = 99;
st->m_x0 = 123;
}
__global__ void myKernel(int dowrite, int *output){
myStruct m_struct ;
m_struct.m_tx = 45;
m_struct.m_x0 = 90;
if (dowrite) {
foo(&m_struct);
output[threadIdx.x] = m_struct.m_tx + m_struct.m_x0;
}
}
int main(void) {
int * output;
cudaMalloc((void **)(&output), sizeof(int));
myKernel <<<1,1 >>>(1, output);
cudaThreadSynchronize();
return 0;
}
and repeat the same compilation and disassembly steps, things look somewhat different:
$ nvcc -arch=sm_20 -Xptxas="-v" -cubin struct_dumb.cu
ptxas info : Compiling entry function '_Z8myKerneliPi' for 'sm_20'
ptxas info : Function properties for _Z8myKerneliPi
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z3fooP8myStruct
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 5 registers, 40 bytes cmem[0]
$ /usr/local/cuda/bin/cuobjdump -sass struct_dumb.cubin
code for sm_20
Function : _Z8myKerneliPi
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x20105d034800c000*/ IADD R1, R1, -0x8;
/*0010*/ /*0x68009de218000001*/ MOV32I R2, 0x5a;
/*0018*/ /*0xb400dde218000000*/ MOV32I R3, 0x2d;
/*0020*/ /*0x83f1dc23190e4000*/ ISETP.EQ.AND P0, pt, RZ, c [0x0] [0x20], pt;
/*0028*/ /*0x00101c034800c000*/ IADD R0, R1, 0x0;
/*0030*/ /*0x00109ca5c8000000*/ STL.64 [R1], R2;
/*0038*/ /*0x000001e780000000*/ #P0 EXIT;
/*0040*/ /*0x10011c0348004000*/ IADD R4, R0, c [0x0] [0x4];
/*0048*/ /*0xc001000750000000*/ CAL 0x80;
/*0050*/ /*0x00009ca5c0000000*/ LDL.64 R2, [R0];
/*0058*/ /*0x84011c042c000000*/ S2R R4, SR_Tid_X;
/*0060*/ /*0x90411c4340004000*/ ISCADD R4, R4, c [0x0] [0x24], 0x2;
/*0068*/ /*0x0c201c0348000000*/ IADD R0, R2, R3;
/*0070*/ /*0x00401c8590000000*/ ST [R4], R0;
/*0078*/ /*0x00001de780000000*/ EXIT;
/*0080*/ /*0x8c00dde218000001*/ MOV32I R3, 0x63;
/*0088*/ /*0xec009de218000001*/ MOV32I R2, 0x7b;
/*0090*/ /*0x1040dc8590000000*/ ST [R4+0x4], R3;
/*0098*/ /*0x00409c8590000000*/ ST [R4], R2;
/*00a0*/ /*0x00001de790000000*/ RET;
...............................
we get actual code in the assembler output. You might have more luck in the debugger with that.

I am from the CUDA developer tools team. When compiled for device side debug (i.e. -G), the original code will not be optimized out. The issue looks like a memcheck bug. Thank you for finding this. We will look into it.

Related

Why stack behaved so strangely?

I noted a strange behavior if a function takes an argument as plain struct like this:
struct Foo
{
int a;
int b;
};
int foo(struct Foo d)
{
return d.a;
}
compiled ARM Cortex-M3 using GCC 10.2 with Os optimization (or any other optimization level):
arm-none-eabi-gcc.exe -Os -mcpu=cortex-m3 -o test2.c.obj -c test2.c
generates a code where the argument struct's data saved on stack for no reason.
Disassembly of section .text:
00000000 <foo>:
0: b082 sub sp, #8
2: ab02 add r3, sp, #8
4: e903 0003 stmdb r3, {r0, r1}
8: b002 add sp, #8
a: 4770 bx lr
What is the reason to save struct's data on stack? It never use this data.
If I compile this code on RISC-V architecture it will be more interesting:
Disassembly of section .text:
00000000 <foo>:
0: 1141 addi sp,sp,-16
2: 0141 addi sp,sp,16
4: 8082 ret
Here just stack pointer moves forward and back again. Why? What is the reason?
The optimizer just doesn't "optimize it away", probably because its relying on a later part of the optimizer to handle it.
Try changing the code to
struct Foo
{
int a;
int b;
};
extern int extern_bar(struct Foo d);
int bar(struct Foo d)
{
return d.a;
}
#include <stdlib.h>
#include <stdio.h>
int main()
{
struct Foo baz;
baz.a = rand();
baz.b = rand();
printf("%d",bar(baz));
printf("%d",extern_bar(baz));
return bar(baz);
}
And compiling at godbolt.org under the different architectures. (Make sure to set -Os).
You can see it many cases completely optimizes away the call to bar and just uses the value in the register. While we don't show it, the linker can/could completely cull the function body of bar because it's unnecessary.
The call to extern_bar is still there because the compiler can't know what's going on inside of it, so it dutifully does what it needs to do to pass the struct by value according to the architecture ABI (most architectures push the struct on the stack). That means the function must copy it off the stack.
Apparently RISCV EABI is different and it passes smaller structs by value in registers. I guess it just has built in prologue/epilogue to push and pop the stack and the optimizer doesn't trim it away because its a sort of an edge case.
Or, who knows.
But, the short of it is: if size and cycles REALLY matter, don't trust the compiler. Also don't trust the compiler to keep doing what its doing. Changing revisions of the toolchain is asking for slight differences in code generation. Even not changing the toolchain revision could end up with different code based on heuristics you just aren't privy to or don't realize.
As per AAPCS standard defined by Arm, composite types are passed to function in stack. Refer section 6.4 of below document for details.
https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#64parameter-passing

Why is gcc resolving the address of a string during compilation?

I'm trying to do some bare metal programming on a Raspberry Pi 3B. I still have not figured out the correct memory addresses for the UART controls so please just ignore those. I am experiencing a strange compilation issue though.
Here is the code I am trying to compile (not link):
void pstr(char* str) {
unsigned int* AUX_MU_IO_REG = (unsigned int*)0x7E215040;
unsigned int* AUX_MU_LSR_REG = (unsigned int*)0x7E215054;
while (*str != 0) {
while (!(*AUX_MU_LSR_REG & 0x00000020)) {
}
*AUX_MU_IO_REG = (unsigned int)((unsigned char)*str);
str++;
}
return;
}
signed int kmain(unsigned int argc, char* argv[], char* envp[]) {
char* text = "Test Output String\n";
unsigned int* AUXENB = (unsigned int*)0x7E215004;
*AUXENB = 0x00000001;
pstr(text);
return 0;
}
My addresses are not correct and invalid, but that is not the point. For some reason, the string "Test Output String\n" is being resolved to an address in the object file.
It is being compiled with the command:
aarch64-unknown-linux-gnu-gcc -Wall -Wextra -std=c99 -O2 -march=armv8-a -mtune=cortex-a53 -mlittle-endian -ffreestanding -nostdlib -nostartfiles -Wno-unused-parameter -fno-stack-check -fno-stack-protector src/kernel/base.c -c -o src/kernel/base.o
Interestingly, it doesn't happen if I compile with "-O0".
Here is what it looks like with "-O2" using "aarch64-unknown-linux-gnu-objdump -d ./src/kernel/base.o":
0000000000000040 <kmain>:
40: d28a0a80 mov x0, #0x5054 // #20564
44: f2afc420 movk x0, #0x7e21, lsl #16
48: d28a0084 mov x4, #0x5004 // #20484
4c: f2afc424 movk x4, #0x7e21, lsl #16
50: b9400000 ldr w0, [x0]
It crashes at "ldr w0, [x0]" because the address 0x7e215054 is not valid. I just don't know why the compiler would even be putting that there. It should be symbol to the data in .rodata so that it can be placed in the correct location by my linker script.

doing a syscall without libc using ARM inline assembly [duplicate]

This question already has answers here:
Can _start be the thumb function?
(3 answers)
Closed 9 years ago.
I want to write a tiny standalone executablewithout using libc. what I need for simulating some libc functions is to have function to do syscalls using inline assembly :
int syscall(int a,...) {
return __asm__ volatile (/* DO STH HERE */);
}
I am using Linux and ARM processor.
EDIT: found the solution:
int syscall(int n,...) {
return __asm__ volatile ("mov r7,r0\nmov r0,r1\nmov r1,r2\nmov r2,r3\nswi #1\n");
}
First you need to be able to command your toolchain (gcc?) to not to include anything extra other than your code. Something like -nostartfiles -nodefaultlibs to gcc should work.
Then you need to be nice working with Linux, meaning your elf need to be loaded properly by the os, meaning it needs to have _start point visible. Below would be such an example:
void _start() __attribute__ ((naked));
void _start() {
main();
asm volatile(
"mov r7, #1\n" /* exit */
"svc #0\n"
);
}
You can then create a main which contain what you want to do.
int main() {
linuxc('X');
return 42;
}
Then doing extra with write syscall...
void linuxc(int c) {
asm volatile(
"mov r0, #1\n" /* stdout */
"mov r1, %[buf]\n" /* write buffer */
"mov r2, #1\n" /* size */
"mov r7, #4\n" /* write syscall */
"svc #0\n"
: /* output */ : [buf] "r" (&c) : "r0", "r1", "r2", "r7", "memory"
);
}
I have a more complete example of that at my github. I like the teensy one most.

ARM ABI wants to return result in registers, using GCC

Similar to the question here:
Returning structs in registers - ARM ABI in GCC
I was hoping I could tell GCC that the result is in registered (leave them alone) and leave the stack untouched, but it only "mostly" works, which I suspect is just luck.
The when compiling, I am left with undefined reference __aeabi_uldivmod(), which I am trying to supplement. There is a nice _uldivmod.S from google, but I was looking at a C solution.
Currently, I am trying something like:
res = __udivdi3(u, v);
mod = __umoddi3(u, v);
{
register uint32_t r0 asm("r0") = (res&0xFFFFFFFF);
register uint32_t r1 asm("r1") = (res>>32);
register uint32_t r2 asm("r2") = (mod&0xFFFFFFFF);
register uint32_t r3 asm("r3") = (mod>>32);
printk("r0 %08X : %08X : %08X : %08X\n",r0, r1, r2, r3);
asm volatile(""
: "=r"(r0), "=r"(r1), "=r"(r2),"=r"(r3) // output
: "r"(r0), "r"(r1), "r"(r2), "r"(r3)); // input
return r0;
}
kernel: [ 3457.959207] r0 00000000 : 00000000 : 00000000 : 70000000
udivdi3: 7000000000000000/7000000080000000 != 000000000000003f rem dfffffe080000000
__udivi3() and __umoddi3() are standard C functions.
Not only am I returning something (stack) but it does not always leave r1-r3 alone, since I think the "output field" of the ASM statment, only affects the ASM statment itself, not the function declaration of my __aeabi_uldivmod.
gcc 4.4.3
Can it not be done?

gcc optimization, const static object, and restrict

I'm working on an embedded project and I'm trying add more structure to some of the code, which use macros to optimize access to registers for USARTs. I'd like to organize preprocessor #define'd register addresses into const structures. If I define the structs as compound literals in a macro and pass them to inline'd functions, gcc has been smart enough the bypass the pointer in the generated assembly and hardcode the structure member values directly in the code. E.g.:
C1:
struct uart {
volatile uint8_t * ucsra, * ucsrb, *ucsrc, * udr;
volitile uint16_t * ubrr;
};
#define M_UARTX(X) \
( (struct uart) { \
.ucsra = &UCSR##X##A, \
.ucsrb = &UCSR##X##B, \
.ucsrc = &UCSR##X##C, \
.ubrr = &UBRR##X, \
.udr = &UDR##X, \
} )
void inlined_func(const struct uart * p, other_args...) {
...
(*p->ucsra) = 0;
(*p->ucsrb) = 0;
(*p->ucsrc) = 0;
}
...
int main(){
...
inlined_func(&M_UART(0), other_parms...);
...
}
Here UCSR0A, UCSR0B, &c, are defined as the uart registers as l-values, like
#define UCSR0A (*(uint8_t*)0xFFFF)
gcc was able to eliminate the structure literal entirely, and all assignments like that shown in inlined_func() write directly into the register address, w/o having to read the register's address into a machine register, and w/o indirect addressing:
A1:
movb $0, UCSR0A
movb $0, UCSR0B
movb $0, UCSR0C
This writes the values directly into the USART registers, w/o having to load the addresses into a machine register, and so never needs to generate the struct literal into the object file at all. The struct literal becomes a compile-time structure, with no cost in the generated code for the abstraction.
I wanted to get rid of the use of the macro, and tried using a static constant struct defined in the header:
C2:
#define M_UART0 M_UARTX(0)
#define M_UART1 M_UARTX(1)
static const struct uart * const uart[2] = { &M_UART0, &M_UART1 };
....
int main(){
...
inlined_func(uart[0], other_parms...);
...
}
However, gcc cannot remove the struct entirely here:
A2:
movl __compound_literal.0, %eax
movb $0, (%eax)
movl __compound_literal.0+4, %eax
movb $0, (%eax)
movl __compound_literal.0+8, %eax
movb $0, (%eax)
This loads the register addresses into a machine register, and uses indirect addressing to write to the register. Does anyone know anyway I can convince gcc to generate A1 assembly code for C2 C code? I've tried various uses of the __restrict modifier, with no avail.
After many years of experience with UARTs and USARTs, I have come to these conclusions:
Don't use a struct for a 1:1 mapping with UART registers.
Compilers can add padding between struct members without your knowledge, thus messing up the 1:1 correspondence.
Writing to UART registers is best done directly or through a function.
Remember to use volatile modifier when defining pointers to the registers.
Very little performance gain with Assembly language
Assembly language should only be used if the UART is accessed through processor ports rather than memory-mapped. The C language has no support for ports. Accessing UART registers through pointers is very efficient (generate an assembly language listing and verify). Sometimes, it may take more time to code in assembly and verify.
Isolate UART functionality into a separate library
This is a good candidate. Besides, once the code has been tested, let it be. Libraries don't have to be (re)compiled all the time.
Using structs "across compile domains" is a cardinal sin in my book. Basically using a struct to point at something, anything, file data, memory, etc. And the reason is that it will fail, it is not reliable, no matter the compiler. There are many compiler specific flags and pragmas for this, the better solution is to just not do it. You want to point at address plus 8, point at address plus 8, use a pointer or an array. In this specific case I have had way too many compilers fail to do that as well and I write assembler PUT32/GET32 PUT16/GET16 functions to guarantee that the compiler doesnt mess with my register accesses, like structs, you will get burned one day and have a hell of a time figuring out why your 32 bit register only had 8 bits written to it. The overhead of the jump to the function is worth the peace of mind and the reliability and portability of the code. Also this makes your code extremely portable, you can put wrappers in for the put and get functions to cross networks, run your hardware in an hdl simulator and reach into the simulation to read/write registers, etc, with a single chunk of code that doesnt change from simulation to embedded to os device driver to application layer function.
Based on the register set, it looks like you are using an 8-bit Atmel AVR microncontroller (or something extremely similar). I'll show you some things I've used for Atmel's 32-bit ARM MCUs which is a slightly modified version of what they ship in their device packs.
Code Notation
I'm using various macros that I'm not going to include here, but they are defined to do basic operations or paste types (like UL) onto numbers. They are hidden in macros for the cases where something is not allowed (like in assembly). Yes, these are easy to break - it's on the programmer not to shoot themselves in the foot:
#define _PPU(_V) (_V##U) /* guarded with #if defined(__ASSEMBLY__) */
#define _BV(_V) (_PPU(1) << _PPU(_V)) /* Variants for U, L, UL, etc */
There are also typdefs for specific length registers. Example:
/* Variants for 8, 16, 32-bit, RO, WO, & RW */
typedef volatile uint32_t rw_reg32_t;
typedef volatile const uint32_t ro_reg32_t;
The classic #define method
You can define the peripheral address with any register offsets...
#define PORT_REG_ADDR _PPUL(0x41008000)
#define PORT_ADDR_DIR (PORT_REG_ADDR + _PPU(0x00))
#define PORT_ADDR_DIRCLR (PORT_REG_ADDR + _PPU(0x04))
#define PORT_ADDR_DIRSET (PORT_REG_ADDR + _PPU(0x08))
#define PORT_ADDR_DIRTGL (PORT_REG_ADDR + _PPU(0x0C))
And de-referenced pointers to the register addresses...
#define PORT_DIR (*(rw_reg32_t *)PORT_ADDR_DIR)
#define PORT_DIRCLR (*(rw_reg32_t *)PORT_ADDR_DIRCLR)
#define PORT_DIRSET (*(rw_reg32_t *)PORT_ADDR_DIRSET)
#define PORT_DIRTGL (*(rw_reg32_t *)PORT_ADDR_DIRTGL)
And then directly set values in the register:
PORT_DIRSET = _BV(0) | _BV(1) | _BV(2);
Compiling in GCC with some other startup code...
arm-none-eabi-gcc -c -x c -mthumb -mlong-calls -mcpu=cortex-m4 -pipe
-std=c17 -O2 -Wall -Wextra -Wpedantic main.c
[SIZE] : Calculating size from ELF file
text data bss dec hex
924 0 49184 50108 c3bc
With disassembly:
00000000 <main>:
#include "defs/hw-v1.0.h"
void main (void) {
PORT_DIRSET = _BV(0) | _BV(1) | _BV(2);
0: 4b01 ldr r3, [pc, #4] ; (8 <main+0x8>)
2: 2207 movs r2, #7
4: 601a str r2, [r3, #0]
}
6: 4770 bx lr
8: 41008008 .word 0x41008008
The "new" structured method
You still define a base address as before as well as some numerical constants (like some number of instances), but instead of defining individual register addresses, you create a structure that models the peripheral. Note, I manually include some reserved space at the end for alignment. For some peripherals, there will be reserved chunks between other registers - it all depends on that peripheral memory mapping.
typedef struct PortGroup {
rw_reg32_t DIR;
rw_reg32_t DIRCLR;
rw_reg32_t DIRSET;
rw_reg32_t DIRTGL;
rw_reg32_t OUT;
rw_reg32_t OUTCLR;
rw_reg32_t OUTSET;
rw_reg32_t OUTTGL;
ro_reg32_t IN;
rw_reg32_t CTRL;
wo_reg32_t WRCONFIG;
rw_reg32_t EVCTRL;
rw_reg8_t PMUX[PORT_NUM_PMUX];
rw_reg8_t PINCFG[PORT_NUM_PINFCG];
reserved8_t reserved[PORT_GROUP_RESERVED];
} PORT_group_t;
Since the PORT peripheral has four units, and the PortGroup structure is packed to exactly model the memory mapping, I can create a parent structure that contains all of them.
typedef struct Port {
PORT_group_t GROUP[PORT_NUM_GROUPS];
} PORT_t;
And the final step is to associate this structure with an address.
#define PORT ((PORT_t *)PORT_REG_ADDR)
Note, this can still be de-referenced as before - it's a matter of style choice.
#define PORT (*(PORT_t *)PORT_REG_ADDR)
And now to set the register value as before...
PORT->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
Compiling (and linking) with the same options, this produces identical size info and disassembly:
Disassembly of section .text.startup.main:
00000000 <main>:
#include "defs/hw-v1.0.h"
void main (void) {
PORT->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
0: 4b01 ldr r3, [pc, #4] ; (8 <main+0x8>)
2: 2207 movs r2, #7
4: 609a str r2, [r3, #8]
}
6: 4770 bx lr
8: 41008000 .word 0x41008000
Reusable Code
The first method is straightforward, but requires a lot of manual definitions and some ugly macros to if you have more than one peripheral. What if we had 2 different PORT peripherals at different addresses (similar to a device that has more than one USART). We can just create multiple structured PORT pointers:
#define PORT0 ((PORT_t *)PORT0_REG_ADDR)
#define PORT1 ((PORT_t *)PORT1_REG_ADDR)
Calling them individually looks like what you'd expect:
PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
Compiling results in:
[SIZE] : Calculating size from ELF file
text data bss dec hex
936 0 49184 50120 c3c8
Disassembly of section .text.startup.main:
00000000 <main>:
#include "defs/hw-v1.0.h"
void main (void) {
PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
0: 4903 ldr r1, [pc, #12] ; (10 <main+0x10>)
PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
2: 4b04 ldr r3, [pc, #16] ; (14 <main+0x14>)
PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
4: 2007 movs r0, #7
PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
6: 2270 movs r2, #112 ; 0x70
PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
8: 6088 str r0, [r1, #8]
PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
a: 609a str r2, [r3, #8]
}
c: 4770 bx lr
e: bf00 nop
10: 41008000 .word 0x41008000
14: 4100a000 .word 0x4100a000
And the final step to make it all reusable...
static PORT_t * const PORT[] = {PORT0, PORT1};
static inline void
PORT_setDir(const uint8_t unit, const uint8_t group, const uint32_t pins) {
PORT[unit]->GROUP[group].DIRSET = pins;
}
/* ... */
PORT_setDir(0, 0, _BV(0) | _BV(1) | _BV(2));
PORT_setDir(1, 0, _BV(4) | _BV(5) | _BV(6));
And compiling will give identical size and (basically) disassembly as before.
Disassembly of section .text.startup.main:
00000000 <main>:
static PORT_t * const PORT[] = {PORT0, PORT1};
static inline void
PORT_setDir(const uint8_t unit, const uint8_t group, const uint32_t pins) {
PORT[unit]->GROUP[group].DIRSET = pins;
0: 4903 ldr r1, [pc, #12] ; (10 <main+0x10>)
2: 4b04 ldr r3, [pc, #16] ; (14 <main+0x14>)
4: 2007 movs r0, #7
6: 2270 movs r2, #112 ; 0x70
8: 6088 str r0, [r1, #8]
a: 609a str r2, [r3, #8]
void main (void) {
PORT_setDir(0, 0, _BV(0) | _BV(1) | _BV(2));
PORT_setDir(1, 0, _BV(4) | _BV(5) | _BV(6));
}
c: 4770 bx lr
e: bf00 nop
10: 41008000 .word 0x41008000
14: 4100a000 .word 0x4100a000
I would clean it up a bit more with a module library header, enumerated constants, etc. But this should give someone a starting point. Note, in these examples, I am always calling a CONSTANT unit and group. I know exactly what I'm writing to, I just want reusable code. More instructions will be (probably) be needed if the unit or group cannot be optimized to compile time constants. Speaking of which, if optimizations are not used, all of this goes out the window. YMMV.
Side Note on Bit Fields
Atmel further breaks a peripheral structure into individual typedef'd registers that have named bitfields in a union with the size of the register. This is the ARM CMSIS way, but it's not great IMO. Contrary to the information in some of the other answers, I know exactly how a compiler will pack this structure; however, I do not know how it will arrange bit fields without using special compiler attributes and flags. I would rather explicitly set and mask defined register bit field constant values. It also violates MISRA (as does some of what I've done here...) if you are worried about that.

Resources