USART1 not giving any Putty output for Nucleo F411RE

USART1 not giving any Putty output for Nucleo F411RE - arm

Vendor: STM32
MC: Nucleo F411RE
Relevant Links: Data Sheet, Reference Manual, Nucleo Manual
Issue: I'm learning embedded bare metal using STM32, ARM Cortex M4 processor. I have configured USART2 with Putty correctly. USART2's output works just fine, even if I change Baud Rates. However, I cannot get USART1 to transmit anything on Putty at all.
Port: GPIOB
Pin: 6
APB2 Clock: 84MHz
Baud Rate: 115200
**USART1_BRR = 84MHz / 115200 = 729 [i.e. 0x02D9]
Below is a screenshot of my clock configuration:
Here's my code:
#include <stm32f4xx.h>
void USART1_Init(void);
void USART1_Write(int ch);
void delayMs(int delay);
int main(void)
{
USART1_Init();
while(1) {
USART1_Write('K');
delayMs(100);
}
}
void USART1_Init(void)
{
RCC->AHB1ENR |= 0x0002;
RCC->APB2ENR |= 0x0010;
GPIOB->MODER |= 0x2000;
GPIOB->AFR[0] |= 0x7000000;
USART1->BRR = 0x02D9; // 115200 #84MHz
USART1->CR1 = 0x0008;
USART1->CR1 |= 0x2000;
}
void USART1_Write(int ch)
{
while (!(USART1->SR & 0x0080)) {}
USART1->DR = (ch & 0xFF);
}
void delayMs(int delay)
{
int i;
while (delay > 0) {
for (i = 0; i < 3195; i++) {}
--delay;
}
}
What I did:
I have checked if all the configurations are working correctly turning on. Below are screenshots from RCC, GPIOB and USART1 registers:
At first, I tried using the default pins (PA9 and PA10) for USART1. But then, I read somewhere that they might be configured for USB output. So I switched PB6 and PB7 on to be used for USART1 TX and RX respectively.
I tried changing the Baud Rate, turn on DMAT (USART1->CR3), change GPIOB->OSPEEDR to high speed but still nothing. I'm using Manjaro Linux on an x86 laptop. If it helps, I can provide more context around my laptop's configuration.
My suspicion is still that I'm not configuring USART1->BRR correctly, or turning USART1 on as an alternate function requires bit more than it already is.
I'm still a beginner at embedded and I tried whatever I could infer from the block diagram and the reference manuals. But I can't seem to get this working at all. Is there something more I have to do with USART1s on STM32 in order for this to work?

A complete example; no other code needed.
flash.s
.cpu cortex-m0
.thumb
.thumb_func
.global _start
_start:
ldr r0,stacktop
mov sp,r0
bl notmain
b hang
.thumb_func
hang: b .
.align
stacktop: .word 0x20001000
.thumb_func
.globl PUT32
PUT32:
str r1,[r0]
bx lr
.thumb_func
.globl GET32
GET32:
ldr r0,[r0]
bx lr
.thumb_func
.globl dummy
dummy:
bx lr
flash.ld
MEMORY
{
rom : ORIGIN = 0x08000000, LENGTH = 0x1000
ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
notmain.c
void PUT32 ( unsigned int, unsigned int );
unsigned int GET32 ( unsigned int );
#define RCCBASE 0x40023800
#define RCC_CR (RCCBASE+0x00)
#define RCC_CFGR (RCCBASE+0x08)
#define RCC_APB1RSTR (RCCBASE+0x20)
#define RCC_AHB1ENR (RCCBASE+0x30)
#define RCC_APB1ENR (RCCBASE+0x40)
#define GPIOABASE 0x40020000
#define GPIOA_MODER (GPIOABASE+0x00)
#define GPIOA_AFRL (GPIOABASE+0x20)
#define USART2BASE 0x40004400
#define USART2_SR (USART2BASE+0x00)
#define USART2_DR (USART2BASE+0x04)
#define USART2_BRR (USART2BASE+0x08)
#define USART2_CR1 (USART2BASE+0x0C)
//PA2 is USART2_TX alternate function 1
//PA3 is USART2_RX alternate function 1
static int clock_init ( void )
{
unsigned int ra;
//switch to external clock.
ra=GET32(RCC_CR);
ra|=1<<16;
PUT32(RCC_CR,ra);
while(1) if(GET32(RCC_CR)&(1<<17)) break;
ra=GET32(RCC_CFGR);
ra&=~3;
ra|=1;
PUT32(RCC_CFGR,ra);
while(1) if(((GET32(RCC_CFGR)>>2)&3)==1) break;
return(0);
}
int uart2_init ( void )
{
unsigned int ra;
ra=GET32(RCC_AHB1ENR);
ra|=1<<0; //enable port A
PUT32(RCC_AHB1ENR,ra);
ra=GET32(RCC_APB1ENR);
ra|=1<<17; //enable USART2
PUT32(RCC_APB1ENR,ra);
ra=GET32(GPIOA_MODER);
ra&=~(3<<4); //PA2
ra&=~(3<<6); //PA3
ra|=2<<4; //PA2
ra|=2<<6; //PA3
PUT32(GPIOA_MODER,ra);
ra=GET32(GPIOA_AFRL);
ra&=~(0xF<<8); //PA2
ra&=~(0xF<<12); //PA3
ra|=0x7<<8; //PA2
ra|=0x7<<12; //PA3
PUT32(GPIOA_AFRL,ra);
ra=GET32(RCC_APB1RSTR);
ra|=1<<17; //reset USART2
PUT32(RCC_APB1RSTR,ra);
ra&=~(1<<17);
PUT32(RCC_APB1RSTR,ra);
//8000000/(16*115200) = 4.34 4+5/16
PUT32(USART2_BRR,0x45);
PUT32(USART2_CR1,(1<<3)|(1<<2)|(1<<13));
return(0);
}
void uart2_send ( unsigned int x )
{
while(1) if(GET32(USART2_SR)&(1<<7)) break;
PUT32(USART2_DR,x);
}
int notmain ( void )
{
unsigned int rx;
clock_init();
uart2_init();
for(rx=0;;rx++)
{
uart2_send(0x30+(rx&7));
}
return(0);
}
build
arm-linux-gnueabi-as --warn --fatal-warnings -mcpu=cortex-m4 flash.s -o flash.o
arm-linux-gnueabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m4 -mthumb -c notmain.c -o notmain.o
arm-linux-gnueabi-ld -nostdlib -nostartfiles -T flash.ld flash.o notmain.o -o notmain.elf
arm-linux-gnueabi-objdump -D notmain.elf > notmain.list
arm-linux-gnueabi-objcopy -O binary notmain.elf notmain.bin
arm-whatever-whatever will work...
check file
Disassembly of section .text:
08000000 <_start>:
8000000: 20001000 andcs r1, r0, r0
8000004: 08000011 stmdaeq r0, {r0, r4}
8000008: 08000017 stmdaeq r0, {r0, r1, r2, r4}
800000c: 08000017 stmdaeq r0, {r0, r1, r2, r4}
08000010 <reset>:
8000010: f000 f86e bl 80000f0 <notmain>
8000014: e7ff b.n 8000016 <hang>
08000016 <hang>:
8000016: e7fe b.n 8000016 <hang>
Vector table looks good, there is half a chance this will boot.
Copy notmain.bin to the virtual drive created when you plug the card in.
It will blast 0123456701234567 forever on the virtual com port created by the debugger end of the board (115200 8N1).
Not that I am using the rx as shown, but you only seemed to set one of the two.
I don't see that you zeroed the moder register bits before setting them.
The math looks wrong on the baud rate register unless you are magically setting the clock elsewhere then running this code after the fact (rather than normal power on/reset).
Same goes for afrl; I didn't look at the register today but any time you are changing bits (not just setting one bit to a one) you need to zero the other bits in the field. In your case 7 might be all the bits so an or equals might work, but check that.
I would recommend you do it in one register write rather than the magic volatile pointer &= then in an separate step |=. Instead x = register; x&=....x|=.... then register=x; The feature doesn't change modes twice, it just changes once. Depends on the feature and the peripheral and how it reacts to writes as to whether or not it is okay to change twice (probably fine in this case, general rule).
If you are doing some other magic for the clock they may also be messing with the uart, good idea to just reset it, in general for a peripheral like this (might have even been in the docs, have not looked in a while). Otherwise you are not at a known state (true for anything in your program if you are pre-running something else) and you can't simply touch a few fields in a few registers you have to touch the whole peripheral.
I don't think the clock init above is required, I simply switched it to use a crystal based clock rather than the on chip RC clock.
Edit
Very sorry, re-reading your question. Left above as is even though it is not what you asked, so that this modification makes it so that the uart is sending on PA9 using UART1_TX.
notmain.c
void PUT32 ( unsigned int, unsigned int );
unsigned int GET32 ( unsigned int );
#define RCCBASE 0x40023800
#define RCC_CR (RCCBASE+0x00)
#define RCC_CFGR (RCCBASE+0x08)
//#define RCC_APB1RSTR (RCCBASE+0x20)
#define RCC_APB2RSTR (RCCBASE+0x24)
#define RCC_AHB1ENR (RCCBASE+0x30)
//#define RCC_APB1ENR (RCCBASE+0x40)
#define RCC_APB2ENR (RCCBASE+0x44)
#define GPIOABASE 0x40020000
#define GPIOA_MODER (GPIOABASE+0x00)
//#define GPIOA_AFRL (GPIOABASE+0x20)
#define GPIOA_AFRH (GPIOABASE+0x24)
#define USART1BASE 0x40011000
#define USART1_SR (USART1BASE+0x00)
#define USART1_DR (USART1BASE+0x04)
#define USART1_BRR (USART1BASE+0x08)
#define USART1_CR1 (USART1BASE+0x0C)
//PA9 is USART1_TX alternate function 7
static int clock_init ( void )
{
unsigned int ra;
//switch to external clock.
ra=GET32(RCC_CR);
ra|=1<<16;
PUT32(RCC_CR,ra);
while(1) if(GET32(RCC_CR)&(1<<17)) break;
ra=GET32(RCC_CFGR);
ra&=~3;
ra|=1;
PUT32(RCC_CFGR,ra);
while(1) if(((GET32(RCC_CFGR)>>2)&3)==1) break;
return(0);
}
int uart_init ( void )
{
unsigned int ra;
ra=GET32(RCC_AHB1ENR);
ra|=1<<0; //enable port A
PUT32(RCC_AHB1ENR,ra);
ra=GET32(RCC_APB2ENR);
ra|=1<<4; //enable USART1
PUT32(RCC_APB2ENR,ra);
ra=GET32(GPIOA_MODER);
ra&=~(3<<(9<<1)); //PA9
ra|= 2<<(9<<1) ; //PA9
PUT32(GPIOA_MODER,ra);
ra=GET32(GPIOA_AFRH);
ra&=~(0xF<<4); //PA9
ra|= 0x7<<4; //PA9
PUT32(GPIOA_AFRH,ra);
ra=GET32(RCC_APB2RSTR);
ra|=1<<4; //reset USART1
PUT32(RCC_APB2RSTR,ra);
ra&=~(1<<4);
PUT32(RCC_APB2RSTR,ra);
//8000000/(16*115200) = 4.34 4+5/16
PUT32(USART1_BRR,0x45);
PUT32(USART1_CR1,(1<<3)|(1<<2)|(1<<13));
return(0);
}
void uart_send ( unsigned int x )
{
while(1) if(GET32(USART1_SR)&(1<<7)) break;
PUT32(USART1_DR,x);
}
int notmain ( void )
{
unsigned int rx;
clock_init();
uart_init();
for(rx=0;;rx++)
{
uart_send(0x30+(rx&7));
}
return(0);
}
PA9 is tied to an external header pin, Arduino style data pin, very unlikely they would also use that for usb.
MODER resets to zeros for these pins so an or equal will work.
AFRL and AFRH reset to zero so an or equal will work.
To see the output you need to connect a uart device to PA9, the data does not go through the virtual com port if you want to see UART1 work.
I changed the clock from 16Mhz to 8MHz so for this chip's uart (ST has different peripherals in their library the pick and choose from when they make a chip)
//8000000/(16*115200) = 4.34 4+5/16
PUT32(USART1_BRR,0x45);
If you think about it 8000000/115200 = 69.444 = 0x45. You don't need to do the fraction math separately.
So looking at your code you are doing PB6 which is fine for USART1_TX alternate function 7. It all looks fine except for the BRR and that your delay function may be dead code and optimized out but since you look for the tx empty status bit before adding a character that should allow your code to work.
PB6 is one of the header pins so you can hook a (3.3v) uart up to it and see if your data is coming out. I would recommend if nothing else you simply try 16000000/115200 = 138.8 = 0x8A or 0x8B in the BRR, why not only takes a second.
Otherwise if you have a scope put a probe on there. I recommend instead of the letter K use U which is 0x55 which with 8N1 that comes out as a square wave when you transmit as fast as you can (no gaps between characters) and really easy to measure on a scope. then mess with your BRR register and see how that changes the frequency of the output on a scope.
This uses USART1_TX on PB6, and I removed the crystal clock init so it is using the 16MHz HSI clock.
Fundamentally the difference here is you have a different BRR setting.
void PUT32 ( unsigned int, unsigned int );
unsigned int GET32 ( unsigned int );
#define RCCBASE 0x40023800
#define RCC_CR (RCCBASE+0x00)
#define RCC_CFGR (RCCBASE+0x08)
//#define RCC_APB1RSTR (RCCBASE+0x20)
#define RCC_APB2RSTR (RCCBASE+0x24)
#define RCC_AHB1ENR (RCCBASE+0x30)
//#define RCC_APB1ENR (RCCBASE+0x40)
#define RCC_APB2ENR (RCCBASE+0x44)
#define GPIOBBASE 0x40020400
#define GPIOB_MODER (GPIOBBASE+0x00)
#define GPIOB_AFRL (GPIOBBASE+0x20)
#define USART1BASE 0x40011000
#define USART1_SR (USART1BASE+0x00)
#define USART1_DR (USART1BASE+0x04)
#define USART1_BRR (USART1BASE+0x08)
#define USART1_CR1 (USART1BASE+0x0C)
int uart_init ( void )
{
unsigned int ra;
ra=GET32(RCC_AHB1ENR);
ra|=1<<1; //enable port B
PUT32(RCC_AHB1ENR,ra);
ra=GET32(RCC_APB2ENR);
ra|=1<<4; //enable USART1
PUT32(RCC_APB2ENR,ra);
ra=GET32(GPIOB_MODER);
ra&=~(3<<(6<<1)); //PB6
ra|= 2<<(6<<1) ; //PB6
PUT32(GPIOB_MODER,ra);
ra=GET32(GPIOB_AFRL);
ra&=~(0xF<<24); //PB6
ra|= 0x7<<24; //PB6
PUT32(GPIOB_AFRL,ra);
ra=GET32(RCC_APB2RSTR);
ra|=1<<4; //reset USART1
PUT32(RCC_APB2RSTR,ra);
ra&=~(1<<4);
PUT32(RCC_APB2RSTR,ra);
//16000000/115200
PUT32(USART1_BRR,0x8B);
PUT32(USART1_CR1,(1<<3)|(1<<13));
return(0);
}
void uart_send ( unsigned int x )
{
while(1) if(GET32(USART1_SR)&(1<<7)) break;
PUT32(USART1_DR,x);
}
int notmain ( void )
{
unsigned int rx;
uart_init();
for(rx=0;;rx++)
{
uart_send(0x30+(rx&7));
}
return(0);
}
Also note that when blasting at this rate, depending on data patterns it is possible for the receiver to get out of sync such that the characters received are not the ones sent, so you may need to press and hold the reset button on the board, then release and see if the receiver then sees the pattern as desired, perhaps that is why you are blasting K instead of U or something else.
The PB6 pin is two pins above the PA9 pin on the right side of the board D10 instead of D8, note that the male pins to the right are a half step below the female arduino header pins, look at the documentation for the board to find out where to hook your uart up.

Related

Implementation timecounter_init (clocksource.h) on kernel space gives Invalid ISA State Error

I'm implementing an embedded Linux OS on my stm32f429 with ARM Cortex-M4. In my code I need to implement the clocksource.h library for personal purpose for measuring elapsed time. The code on my system call is:
#include <linux/syscalls.h>
#include <linux/gpio.h>
#include <linux/delay.h>
#include <linux/time.h>
#include <linux/clocksource.h>
SYSCALL_DEFINE2(mycall, int*, arg1, char *, arg2)
{
struct timecounter *tc;
struct cyclecounter *cc;
long int value;
local_irq_disable(); /*Disabling local interrupt*/
spin_lock(&my_lock);
timecounter_init(&tc,&cc,0);
printk(KERN_INFO "PRINT\n");
value = timecounter_read(&tc);
local_irq_enable(); /*Enabling local interrupt*/
return( value );
}
When I load my test app on my board and I run, the output gives me:
KERNEL: fault at 0xd00877e0 [pc=0xd00877e0, sp=0xd0019f58]
Escalated to Hard Fault
Invalid ISA state
Pid: 17, comm: test
CPU: 0 Not tainted (2.6.33-arm1 #594)
pc : [<d00877e0>] lr : [<d009ec19>] psr: 6000000b
sp : d0019f58 ip : 08060501 fp : 00000000
Code dump at pc [d00877e0]:
d0094a69 d009122f d0087dc5 d00b2231
r10: 01000000 r9 : d0018000 r8 : d0257ec4
r7 : 00000000 r6 : 00000000 r5 : ffffffff r4 : d0019f84
r3 : d00877e0 r2 : 00000000 r1 : d0019f80 r0 : d0019f80
Flags: nZCv IRQs on FIQs on Mode UK11_26 ISA ARM Segment kernel
Kernel panic - not syncing:
Rebooting in 10 seconds..
How can I resolve? I must initialize cyclecounter cc? If yes, how?

If you use SDK from Emcraft then yo need to check CFLAGS and LDFLAGS
like this
CFLAGS="-Os -mcpu=cortex-m3 -mthumb -I /opt/linux-cortexm-1-12/A2F/root/usr/include"
and
LDFLAGS="-mcpu=cortex-m3 -mthumb -L/opt/linux-cortexm-1-12/A2F/root/usr/lib"

Decoding this assembly inline code snippet on PowerPc

I have this below code snippet from kernel source for PowerPc
#define SPRN_IVOR32 0x210 /* Interrupt Vector Offset Register 32 */
unsigned long ivor[3];
ivor[0] = mfspr(SPRN_IVOR32);
#define __stringify_1(x) #x
#define __stringify(x) __stringify_1(x)
#define mfspr(rn) ({unsigned long rval; \
asm volatile("mfspr %0," __stringify(rn) \
: "=r" (rval)); rval; })
Also, it this above exercise is about emulating MSR register's bits in PowerPc?
Can anyone help me on what exactly we are doing here?

The mfspr macro generates an asm instruction mfspr which reads the given special purpose register into a register chosen by the compiler, which then gets assigned to rval hence becomes the return value of the expression.
As the comment says, SPRN_IVOR32 is the Interrupt Vector Offset Register 32, whose contents are thus fetched into ivor[0].

attribute((OS_main)) results in strange behaviour in AVR

I don't know how to precisely described the error I am seeing. If I set up my port register in main() everything works as intended. However if I try to do it in a function, the program halts.
main.c:
__attribute__((OS_main)) int main(void);
int main(void) {
DDRD = 0xF0;
PORTD = 0xF0;
led( LED_GREEN, true );
while( true );
}
This turns on the green LED. However if I move the port setup to a seperate function nothing happens, like so:
__attribute__((OS_main)) int main(void);
int main(void) {
hwInit();
led( LED_GREEN, true );
while( true );
}
The culprit seems to be the attribute line because if I comment it out the second example works as expected. My problem is understanding why, since as I understand it the OS_main attribute should only tell the compiler that it should not store any registers upon entry or exit to the function. Is this not correct?

The following was compiled with avr-gcc 4.8.0 under ArchLinux. The distribution should be irrelevant to the situation, compiler and compiler version however, may produce different outputs. The code:
#include <avr/io.h>
#define LED_GREEN PD7
#define led(p, s) { if(s) PORTD |= _BV(p); \
else PORTD &= _BV(p); }
__attribute__((OS_main)) int main(void);
__attribute__((noinline)) void hwInit(void);
void hwInit(void){
DDRD = 0xF0;
}
int main(void){
hwInit();
led(LED_GREEN, 1);
while(1);
}
Generates:
000000a4 <hwInit>:
a4: 80 ef ldi r24, 0xF0 ; 240
a6: 8a b9 out 0x0a, r24 ; 10
a8: 08 95 ret
000000aa <main>:
aa: 0e 94 52 00 call 0xa4 ; 0xa4 <hwInit>
ae: 5f 9a sbi 0x0b, 7 ; 11
b0: ff cf rjmp .-2 ; 0xb0 <main+0x6>
when compiled with
avr-gcc -Wall -Os -fpack-struct -fshort-enums -std=gnu99 -funsigned-char -funsigned-bitfields -mmcu=atmega168 -DF_CPU=1000000UL -MMD -MP -MF"core.d" -MT"core.d" -c -o "core.o" "../core.c" and linked accordingly.
Commenting __attribute__((OS_main)) int main(void); from the above sources has no effect on the generated assembly. Curiously, however, removing the noinline directive from hwInit() has the effect of the compiler inlining the function into main, as expected, but the function itself still exists as part of the final binary, even when compiled with -Os.
This leads me to believe that your compiler version/arguments to the compiler are generating some assembly which is not correct. If possible, can you post the disassembly for the relevant areas for further examination?
Edited late to add two contributions, second of which resolves the problem at hand:
Hanno Binder states: "In order to remove those 'unused' functions from the binary you would also need -ffunction-sections -Wl,--gc-sections."
Asker adds [paraphrased]: "I followed a tutorial which neglected to mention the avr-objcopy step to create the hex file. I assumed that compiling and linking the project for the correct target was sufficient (which it for some reason was for basic functionality). After having added an avr-objcopy step to generate the file everything works."

Nvidia CUDA - passing struct by pointer

I have a problem with passing a pointer to the struct to the device function.
I want to create a struct in local memory (i know it's slow, it's just an example) and pass it to the other function by pointer. The problem is that when i debug it with memcheck on, i get error:
Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.
Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 7, warp 0, lane 0
0x0000000000977608 in foo (st=0x3fffc38) at test.cu:15
15 st->m_tx = 99;
If I debug it without memcheck on, it works fine and gives expected results.
My OS is RedHat 6.3 64-bits with Kernel 2.6.32-220.
I use GTX680, CUDA 5.0 and compile the program with sm=30.
Code I used for testing this is below:
typedef struct __align__(8) {
int m_x0;
int m_tx;
} myStruct;
__device__ void foo(myStruct *st) {
st->m_tx = 99;
st->m_x0 = 123;
}
__global__ void myKernel(){
myStruct m_struct ;
m_struct.m_tx = 45;
m_struct.m_x0 = 90;
foo(&m_struct);
}
int main(void) {
myKernel <<<1,1 >>>();
cudaThreadSynchronize();
return 0;
}
Any suggestions? Thanks for any help.

Your example code is completely optimised away by the compiler because none of the code contributes to a global memory write. This is easily proved by compiling the kernel to a cubin file and disassembling the result with cuobjdump:
$ nvcc -arch=sm_20 -Xptxas="-v" -cubin struct.cu
ptxas info : Compiling entry function '_Z8myKernelv' for 'sm_20'
ptxas info : Function properties for _Z8myKernelv
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 32 bytes cmem[0]
$ cuobjdump -sass struct_dumb.cubin
code for sm_20
Function : _Z8myKernelv
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x00001de780000000*/ EXIT;
.............................
ie. the kernel is completely empty. The debugger can't debug the code you want to investigate because it does not exist in what the compiler/assembler emitted. If we take a few liberties with your code:
typedef struct __align__(8) {
int m_x0;
int m_tx;
} myStruct;
__device__ __noinline__ void foo(myStruct *st) {
st->m_tx = 99;
st->m_x0 = 123;
}
__global__ void myKernel(int dowrite, int *output){
myStruct m_struct ;
m_struct.m_tx = 45;
m_struct.m_x0 = 90;
if (dowrite) {
foo(&m_struct);
output[threadIdx.x] = m_struct.m_tx + m_struct.m_x0;
}
}
int main(void) {
int * output;
cudaMalloc((void **)(&output), sizeof(int));
myKernel <<<1,1 >>>(1, output);
cudaThreadSynchronize();
return 0;
}
and repeat the same compilation and disassembly steps, things look somewhat different:
$ nvcc -arch=sm_20 -Xptxas="-v" -cubin struct_dumb.cu
ptxas info : Compiling entry function '_Z8myKerneliPi' for 'sm_20'
ptxas info : Function properties for _Z8myKerneliPi
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z3fooP8myStruct
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 5 registers, 40 bytes cmem[0]
$ /usr/local/cuda/bin/cuobjdump -sass struct_dumb.cubin
code for sm_20
Function : _Z8myKerneliPi
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x20105d034800c000*/ IADD R1, R1, -0x8;
/*0010*/ /*0x68009de218000001*/ MOV32I R2, 0x5a;
/*0018*/ /*0xb400dde218000000*/ MOV32I R3, 0x2d;
/*0020*/ /*0x83f1dc23190e4000*/ ISETP.EQ.AND P0, pt, RZ, c [0x0] [0x20], pt;
/*0028*/ /*0x00101c034800c000*/ IADD R0, R1, 0x0;
/*0030*/ /*0x00109ca5c8000000*/ STL.64 [R1], R2;
/*0038*/ /*0x000001e780000000*/ #P0 EXIT;
/*0040*/ /*0x10011c0348004000*/ IADD R4, R0, c [0x0] [0x4];
/*0048*/ /*0xc001000750000000*/ CAL 0x80;
/*0050*/ /*0x00009ca5c0000000*/ LDL.64 R2, [R0];
/*0058*/ /*0x84011c042c000000*/ S2R R4, SR_Tid_X;
/*0060*/ /*0x90411c4340004000*/ ISCADD R4, R4, c [0x0] [0x24], 0x2;
/*0068*/ /*0x0c201c0348000000*/ IADD R0, R2, R3;
/*0070*/ /*0x00401c8590000000*/ ST [R4], R0;
/*0078*/ /*0x00001de780000000*/ EXIT;
/*0080*/ /*0x8c00dde218000001*/ MOV32I R3, 0x63;
/*0088*/ /*0xec009de218000001*/ MOV32I R2, 0x7b;
/*0090*/ /*0x1040dc8590000000*/ ST [R4+0x4], R3;
/*0098*/ /*0x00409c8590000000*/ ST [R4], R2;
/*00a0*/ /*0x00001de790000000*/ RET;
...............................
we get actual code in the assembler output. You might have more luck in the debugger with that.

I am from the CUDA developer tools team. When compiled for device side debug (i.e. -G), the original code will not be optimized out. The issue looks like a memcheck bug. Thank you for finding this. We will look into it.

gcc optimization, const static object, and restrict

I'm working on an embedded project and I'm trying add more structure to some of the code, which use macros to optimize access to registers for USARTs. I'd like to organize preprocessor #define'd register addresses into const structures. If I define the structs as compound literals in a macro and pass them to inline'd functions, gcc has been smart enough the bypass the pointer in the generated assembly and hardcode the structure member values directly in the code. E.g.:
C1:
struct uart {
volatile uint8_t * ucsra, * ucsrb, *ucsrc, * udr;
volitile uint16_t * ubrr;
};
#define M_UARTX(X) \
( (struct uart) { \
.ucsra = &UCSR##X##A, \
.ucsrb = &UCSR##X##B, \
.ucsrc = &UCSR##X##C, \
.ubrr = &UBRR##X, \
.udr = &UDR##X, \
} )
void inlined_func(const struct uart * p, other_args...) {
...
(*p->ucsra) = 0;
(*p->ucsrb) = 0;
(*p->ucsrc) = 0;
}
...
int main(){
...
inlined_func(&M_UART(0), other_parms...);
...
}
Here UCSR0A, UCSR0B, &c, are defined as the uart registers as l-values, like
#define UCSR0A (*(uint8_t*)0xFFFF)
gcc was able to eliminate the structure literal entirely, and all assignments like that shown in inlined_func() write directly into the register address, w/o having to read the register's address into a machine register, and w/o indirect addressing:
A1:
movb $0, UCSR0A
movb $0, UCSR0B
movb $0, UCSR0C
This writes the values directly into the USART registers, w/o having to load the addresses into a machine register, and so never needs to generate the struct literal into the object file at all. The struct literal becomes a compile-time structure, with no cost in the generated code for the abstraction.
I wanted to get rid of the use of the macro, and tried using a static constant struct defined in the header:
C2:
#define M_UART0 M_UARTX(0)
#define M_UART1 M_UARTX(1)
static const struct uart * const uart[2] = { &M_UART0, &M_UART1 };
....
int main(){
...
inlined_func(uart[0], other_parms...);
...
}
However, gcc cannot remove the struct entirely here:
A2:
movl __compound_literal.0, %eax
movb $0, (%eax)
movl __compound_literal.0+4, %eax
movb $0, (%eax)
movl __compound_literal.0+8, %eax
movb $0, (%eax)
This loads the register addresses into a machine register, and uses indirect addressing to write to the register. Does anyone know anyway I can convince gcc to generate A1 assembly code for C2 C code? I've tried various uses of the __restrict modifier, with no avail.

After many years of experience with UARTs and USARTs, I have come to these conclusions:
Don't use a struct for a 1:1 mapping with UART registers.
Compilers can add padding between struct members without your knowledge, thus messing up the 1:1 correspondence.
Writing to UART registers is best done directly or through a function.
Remember to use volatile modifier when defining pointers to the registers.
Very little performance gain with Assembly language
Assembly language should only be used if the UART is accessed through processor ports rather than memory-mapped. The C language has no support for ports. Accessing UART registers through pointers is very efficient (generate an assembly language listing and verify). Sometimes, it may take more time to code in assembly and verify.
Isolate UART functionality into a separate library
This is a good candidate. Besides, once the code has been tested, let it be. Libraries don't have to be (re)compiled all the time.

Using structs "across compile domains" is a cardinal sin in my book. Basically using a struct to point at something, anything, file data, memory, etc. And the reason is that it will fail, it is not reliable, no matter the compiler. There are many compiler specific flags and pragmas for this, the better solution is to just not do it. You want to point at address plus 8, point at address plus 8, use a pointer or an array. In this specific case I have had way too many compilers fail to do that as well and I write assembler PUT32/GET32 PUT16/GET16 functions to guarantee that the compiler doesnt mess with my register accesses, like structs, you will get burned one day and have a hell of a time figuring out why your 32 bit register only had 8 bits written to it. The overhead of the jump to the function is worth the peace of mind and the reliability and portability of the code. Also this makes your code extremely portable, you can put wrappers in for the put and get functions to cross networks, run your hardware in an hdl simulator and reach into the simulation to read/write registers, etc, with a single chunk of code that doesnt change from simulation to embedded to os device driver to application layer function.

Based on the register set, it looks like you are using an 8-bit Atmel AVR microncontroller (or something extremely similar). I'll show you some things I've used for Atmel's 32-bit ARM MCUs which is a slightly modified version of what they ship in their device packs.
Code Notation
I'm using various macros that I'm not going to include here, but they are defined to do basic operations or paste types (like UL) onto numbers. They are hidden in macros for the cases where something is not allowed (like in assembly). Yes, these are easy to break - it's on the programmer not to shoot themselves in the foot:
#define _PPU(_V) (_V##U) /* guarded with #if defined(__ASSEMBLY__) */
#define _BV(_V) (_PPU(1) << _PPU(_V)) /* Variants for U, L, UL, etc */
There are also typdefs for specific length registers. Example:
/* Variants for 8, 16, 32-bit, RO, WO, & RW */
typedef volatile uint32_t rw_reg32_t;
typedef volatile const uint32_t ro_reg32_t;
The classic #define method
You can define the peripheral address with any register offsets...
#define PORT_REG_ADDR _PPUL(0x41008000)
#define PORT_ADDR_DIR (PORT_REG_ADDR + _PPU(0x00))
#define PORT_ADDR_DIRCLR (PORT_REG_ADDR + _PPU(0x04))
#define PORT_ADDR_DIRSET (PORT_REG_ADDR + _PPU(0x08))
#define PORT_ADDR_DIRTGL (PORT_REG_ADDR + _PPU(0x0C))
And de-referenced pointers to the register addresses...
#define PORT_DIR (*(rw_reg32_t *)PORT_ADDR_DIR)
#define PORT_DIRCLR (*(rw_reg32_t *)PORT_ADDR_DIRCLR)
#define PORT_DIRSET (*(rw_reg32_t *)PORT_ADDR_DIRSET)
#define PORT_DIRTGL (*(rw_reg32_t *)PORT_ADDR_DIRTGL)
And then directly set values in the register:
PORT_DIRSET = _BV(0) | _BV(1) | _BV(2);
Compiling in GCC with some other startup code...
arm-none-eabi-gcc -c -x c -mthumb -mlong-calls -mcpu=cortex-m4 -pipe
-std=c17 -O2 -Wall -Wextra -Wpedantic main.c
[SIZE] : Calculating size from ELF file
text data bss dec hex
924 0 49184 50108 c3bc
With disassembly:
00000000 <main>:
#include "defs/hw-v1.0.h"
void main (void) {
PORT_DIRSET = _BV(0) | _BV(1) | _BV(2);
0: 4b01 ldr r3, [pc, #4] ; (8 <main+0x8>)
2: 2207 movs r2, #7
4: 601a str r2, [r3, #0]
}
6: 4770 bx lr
8: 41008008 .word 0x41008008
The "new" structured method
You still define a base address as before as well as some numerical constants (like some number of instances), but instead of defining individual register addresses, you create a structure that models the peripheral. Note, I manually include some reserved space at the end for alignment. For some peripherals, there will be reserved chunks between other registers - it all depends on that peripheral memory mapping.
typedef struct PortGroup {
rw_reg32_t DIR;
rw_reg32_t DIRCLR;
rw_reg32_t DIRSET;
rw_reg32_t DIRTGL;
rw_reg32_t OUT;
rw_reg32_t OUTCLR;
rw_reg32_t OUTSET;
rw_reg32_t OUTTGL;
ro_reg32_t IN;
rw_reg32_t CTRL;
wo_reg32_t WRCONFIG;
rw_reg32_t EVCTRL;
rw_reg8_t PMUX[PORT_NUM_PMUX];
rw_reg8_t PINCFG[PORT_NUM_PINFCG];
reserved8_t reserved[PORT_GROUP_RESERVED];
} PORT_group_t;
Since the PORT peripheral has four units, and the PortGroup structure is packed to exactly model the memory mapping, I can create a parent structure that contains all of them.
typedef struct Port {
PORT_group_t GROUP[PORT_NUM_GROUPS];
} PORT_t;
And the final step is to associate this structure with an address.
#define PORT ((PORT_t *)PORT_REG_ADDR)
Note, this can still be de-referenced as before - it's a matter of style choice.
#define PORT (*(PORT_t *)PORT_REG_ADDR)
And now to set the register value as before...
PORT->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
Compiling (and linking) with the same options, this produces identical size info and disassembly:
Disassembly of section .text.startup.main:
00000000 <main>:
#include "defs/hw-v1.0.h"
void main (void) {
PORT->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
0: 4b01 ldr r3, [pc, #4] ; (8 <main+0x8>)
2: 2207 movs r2, #7
4: 609a str r2, [r3, #8]
}
6: 4770 bx lr
8: 41008000 .word 0x41008000
Reusable Code
The first method is straightforward, but requires a lot of manual definitions and some ugly macros to if you have more than one peripheral. What if we had 2 different PORT peripherals at different addresses (similar to a device that has more than one USART). We can just create multiple structured PORT pointers:
#define PORT0 ((PORT_t *)PORT0_REG_ADDR)
#define PORT1 ((PORT_t *)PORT1_REG_ADDR)
Calling them individually looks like what you'd expect:
PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
Compiling results in:
[SIZE] : Calculating size from ELF file
text data bss dec hex
936 0 49184 50120 c3c8
Disassembly of section .text.startup.main:
00000000 <main>:
#include "defs/hw-v1.0.h"
void main (void) {
PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
0: 4903 ldr r1, [pc, #12] ; (10 <main+0x10>)
PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
2: 4b04 ldr r3, [pc, #16] ; (14 <main+0x14>)
PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
4: 2007 movs r0, #7
PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
6: 2270 movs r2, #112 ; 0x70
PORT0->GROUP[0].DIRSET = _BV(0) | _BV(1) | _BV(2);
8: 6088 str r0, [r1, #8]
PORT1->GROUP[0].DIRSET = _BV(4) | _BV(5) | _BV(6);
a: 609a str r2, [r3, #8]
}
c: 4770 bx lr
e: bf00 nop
10: 41008000 .word 0x41008000
14: 4100a000 .word 0x4100a000
And the final step to make it all reusable...
static PORT_t * const PORT[] = {PORT0, PORT1};
static inline void
PORT_setDir(const uint8_t unit, const uint8_t group, const uint32_t pins) {
PORT[unit]->GROUP[group].DIRSET = pins;
}
/* ... */
PORT_setDir(0, 0, _BV(0) | _BV(1) | _BV(2));
PORT_setDir(1, 0, _BV(4) | _BV(5) | _BV(6));
And compiling will give identical size and (basically) disassembly as before.
Disassembly of section .text.startup.main:
00000000 <main>:
static PORT_t * const PORT[] = {PORT0, PORT1};
static inline void
PORT_setDir(const uint8_t unit, const uint8_t group, const uint32_t pins) {
PORT[unit]->GROUP[group].DIRSET = pins;
0: 4903 ldr r1, [pc, #12] ; (10 <main+0x10>)
2: 4b04 ldr r3, [pc, #16] ; (14 <main+0x14>)
4: 2007 movs r0, #7
6: 2270 movs r2, #112 ; 0x70
8: 6088 str r0, [r1, #8]
a: 609a str r2, [r3, #8]
void main (void) {
PORT_setDir(0, 0, _BV(0) | _BV(1) | _BV(2));
PORT_setDir(1, 0, _BV(4) | _BV(5) | _BV(6));
}
c: 4770 bx lr
e: bf00 nop
10: 41008000 .word 0x41008000
14: 4100a000 .word 0x4100a000
I would clean it up a bit more with a module library header, enumerated constants, etc. But this should give someone a starting point. Note, in these examples, I am always calling a CONSTANT unit and group. I know exactly what I'm writing to, I just want reusable code. More instructions will be (probably) be needed if the unit or group cannot be optimized to compile time constants. Speaking of which, if optimizations are not used, all of this goes out the window. YMMV.
Side Note on Bit Fields
Atmel further breaks a peripheral structure into individual typedef'd registers that have named bitfields in a union with the size of the register. This is the ARM CMSIS way, but it's not great IMO. Contrary to the information in some of the other answers, I know exactly how a compiler will pack this structure; however, I do not know how it will arrange bit fields without using special compiler attributes and flags. I would rather explicitly set and mask defined register bit field constant values. It also violates MISRA (as does some of what I've done here...) if you are worried about that.