ARM Cortex M7 unaligned access and memcpy - c

I am compiling this code for a Cortex M7 using GCC:
// copy manually
void write_test_plain(uint8_t * ptr, uint32_t value)
{
*ptr++ = (u8)(value);
*ptr++ = (u8)(value >> 8);
*ptr++ = (u8)(value >> 16);
*ptr++ = (u8)(value >> 24);
}
// copy using memcpy
void write_test_memcpy(uint8_t * ptr, uint32_t value)
{
void *px = (void*)&value;
memcpy(ptr, px, 4);
}
int main(void)
{
extern uint8_t data[];
extern uint32_t value;
// i added some offsets to data to
// make sure the compiler cannot
// assume it's aligned in memory
write_test_plain(data + 2, value);
__asm volatile("": : :"memory"); // just to split inlined calls
write_test_memcpy(data + 5, value);
... do something with data ...
}
And I get the following Thumb2 assembly with -O2:
// write_test_plain(data + 2, value);
800031c: 2478 movs r4, #120 ; 0x78
800031e: 2056 movs r0, #86 ; 0x56
8000320: 2134 movs r1, #52 ; 0x34
8000322: 2212 movs r2, #18 ; 0x12
8000324: 759c strb r4, [r3, #22]
8000326: 75d8 strb r0, [r3, #23]
8000328: 7619 strb r1, [r3, #24]
800032a: 765a strb r2, [r3, #25]
// write_test_memcpy(data + 5, value);
800032c: 4ac4 ldr r2, [pc, #784] ; (8000640 <main+0x3a0>)
800032e: 923b str r2, [sp, #236] ; 0xec
8000330: 983b ldr r0, [sp, #236] ; 0xec
8000332: f8c3 0019 str.w r0, [r3, #25]
Can someone explain how the memcpy version works? This looks like inlined 32-bit store to the destination address, but isn't this a problem since data + 5 is most certainly not aligned to a 4-byte boundary?
Is this perhaps some optimization which happens due to some undefined behavior in my source?

For Cortex-M processors unaligned loads and stores of bytes, half-words, and words are usually allowed and most compilers use this when generating code unless they are instructed not to. If you want to prevent gcc from assuming the unaligned accesses are OK, you can use the -mno-unaligned-access compiler flag.
If you specify this flag gcc will no longer inline the call to memcpy and write_test_memcpy looks like
write_test_memcpy(unsigned char*, unsigned long):
push {lr}
sub sp, sp, #12
movs r2, #4
add r3, sp, #8
str r1, [r3, #-4]!
mov r1, r3
bl memcpy
add sp, sp, #12
ldr pc, [sp], #4

Cortex-M 7 , M4, M3 M33, M23 does support unaligned access
M0, M+ doesn't support unaligned access
however you can disable the support of unaligned access in cortexm7 by setting bit UNALIGN_TRP in configuration and control register and any unaligned access will generate usage fault.
From compiler perspective, default setting is that generated assembly code does unaligned access unless you disable this by using the compile flag -mno-unaligned-access

Related

Data dependency detection bug with arm-none-eabi-gcc and -Ofast

I am writing a program for an Arm Cortex M7 based microcontroller, where I am facing an issue with (I guess) compiler optimization.
I compile using arm-none-eabi-gcc v10.2.1 with -mcpu=cortex-m7 -std=gnu11 -Ofast -mthumb.
I extracted the relevant parts, they are as below. I copy some data from a global variable into a local one, and from there the data is written to a hardware
CRC engine to compute the CRC over the data. Before writing it to the engine, the byte order is to be reversed.
#include <stddef.h>
#include <stdint.h>
#define MEMORY_BARRIER asm volatile("" : : : "memory")
#define __REV(x) __builtin_bswap32(x) // Reverse byte order
// Dummy struct, similiar as it is used in our program
typedef struct {
float n1, n2, n3, dn1, dn2, dn3;
} test_struct_t;
// CRC Engine Registers
typedef struct {
volatile uint32_t DR; // CRC Data register Address offset: 0x00
} CRC_TypeDef;
#define CRC ((CRC_TypeDef*) (0x40000000UL))
// Write data to CRC engine
static inline void CRC_write(uint32_t* data, size_t size)
{
for(int index = 0; index < (size / 4); index++) {
CRC->DR = __REV(*(data + index));
}
}
// Global variable holding some data, similiar as it is used in our program
volatile test_struct_t global_var = { 0 };
int main()
{
test_struct_t local_tmp;
MEMORY_BARRIER;
// Copy the current data into a local variable
local_tmp = global_var;
// Now, write the data to the CRC engine to compute CRC over the data
CRC_write((uint32_t*) &local_tmp, sizeof(local_tmp));
MEMORY_BARRIER;
return 0;
}
This compiles to:
main:
push {r4, r5, lr}
sub sp, sp, #28
ldr r4, .L4
mov ip, sp
ldr r3, [sp] ; load first float into r3
mov lr, #1073741824 ; load address of CRC engine DR register into lr
rev r5, r3 ; reverse byte order of first float and store in r5
ldmia r4!, {r0, r1, r2, r3} ; load first 4 floats into r0-r3
stmia ip!, {r0, r1, r2, r3} ; copy first 4 floats into local variable
ldm r4, {r0, r1} ; load float 5 and 6 into r0 and r1
stm ip, {r0, r1} ; copy float 5 and 6 into local variable
str r5, [lr] ; write reversed bytes from r5 to CRC engine
ldr r3, [sp, #4]
rev r3, r3
str r3, [lr]
ldr r3, [sp, #8]
rev r3, r3
str r3, [lr]
ldr r3, [sp, #12]
rev r3, r3
str r3, [lr]
ldr r3, [sp, #16]
rev r3, r3
str r3, [lr]
ldr r3, [sp, #20]
rev r3, r3
str r3, [lr]
movs r0, #0
add sp, sp, #28
pop {r4, r5, pc}
.L4:
.word .LANCHOR0
global_var:
The problem is now that the first float is loaded and its byte order is reversed (rev r5, r3) before the data has been copied from the global variable (happens with the ldmia/sdmia).
This is clearly not intended, leads to a wrong CRC, and I wonder why the compiler does not recognize this data dependency. Am I doing anything wrong or is there a bug in GCC?
Code in compiler explorer: https://godbolt.org/z/ErWThc5dx

Cortex M3, STM32, thumb2: My inc and dec operations are not atomic, but should be. What's wrong here?

I need a thread save idx++ and idx-- operation.
Disabling interrupts, i.e. use critical sections, is one thing, but I want
to understand why my operations are not atomic, as I expect ?
Here is the C-code with inline assembler code shown, using segger ozone:
(Also please notice, the address of the variables show that the 32 bit variable is 32-bit-aligned in memory, and the 8- and 16-bit variables are both 16 bit aligned)
volatile static U8 dbgIdx8 = 1000U;
volatile static U16 dbgIdx16 = 1000U;
volatile static U32 dbgIdx32 = 1000U;
dbgIdx8 ++;
080058BE LDR R3, [PC, #48]
080058C0 LDRB R3, [R3]
080058C2 UXTB R3, R3
080058C4 ADDS R3, #1
080058C6 UXTB R2, R3
080058C8 LDR R3, [PC, #36]
080058CA STRB R2, [R3]
dbgIdx16 ++;
080058CC LDR R3, [PC, #36]
080058CE LDRH R3, [R3]
080058D0 UXTH R3, R3
080058D2 ADDS R3, #1
080058D4 UXTH R2, R3
080058D6 LDR R3, [PC, #28]
080058D8 STRH R2, [R3]
dbgIdx32 ++;
080058DA LDR R3, [PC, #28]
080058DC LDR R3, [R3]
080058DE ADDS R3, #1
080058E0 LDR R2, [PC, #20]
080058E2 STR R3, [R2]
There is no guarantee that ++ and -- are atomic. If you need guaranteed atomicity, you will have to find some other way.
As #StaceyGirl points out in a comment, you might be able to use the facilities of <stdatomic.h>. For example, I see there's an atomic atomic_fetch_add function defined, which acts like the postfix ++ you're striving for. There's an atomic_fetch_sub, too.
Alternatively, you might have some compiler intrinsics available to you for performing an atomic increment in some processor-specific way.
ARM cortex cores do not modify memory. All memory modifications are performed as RMW (read-modify-write) operations which are not atomic by default.
But Cortex M3 has special instructions to lock access to the memory location. LDREX & STREX. https://developer.arm.com/documentation/100235/0004/the-cortex-m33-instruction-set/memory-access-instructions/ldaex-and-stlex
You can use them directly in the C code without touching the assembly by using intrinsic.
Do not use not 32 bits data types in any performance (you want to lock for as short as possible time) sensitive programs. Most shorter data types operations add some additional code.

What happens when a 64-bit value is passed as a parameter in a function on a 32-bit architecture?

I am facing a weird issue. I am passing a uint64_t offset to a function on a 32-bit architecture(Cortex R52). The registers are all 32-bit.
This is the function. :
Sorry, I messed up the function declaration.
void read_func(uint8_t* read_buffer_ptr, uint64_t offset, size_t size);
main(){
// read_buffer : memory where something is read.
// read_buffer_ptr : points to read_buffer structure where value should be stored after reading value.
read_func(read_buffer_ptr, 0, sizeof(read_buffer));
}
In this function, the value stored in offset is not zero but some random values which I also see in the registers(r5, r6). Also, when I use offset as a 32-bit value, it works perfectly fine. The value is copied from r2,r3 into r5,r6.
Can you please let me know why this could be happening? Are registers not enough?
The prototype posted is invalid, it should be:
void read_func(uint8_t *read_buffer_ptr, uint64_t offset, size_t size);
Similarly, the definition main() is obsolete: the implicit int return type is not supported as of c99, the function call has another syntax error with a missing )...
What happens when you pass a 64-bit argument on a 32-bit architecture is implementation defined:
either 8 bytes of stack space are used to pass the value
or 2 32-bit registers are used to pass the least significant part and the most significant part
or a combination of both depending on the number of arguments
or some other scheme appropriate for the target CPU
In your code you pass 0 which has type int and presumably has only 32 bits. This is not a problem if the prototype for read_func was correct and parsed before the function call, otherwise the behavior is undefined and a C99 compiler should not even compile the code, but may compilers will just issue a warning and generate bogus code.
In your case (Cortex R52), the 64-bit argument is passed to read_func in registers r2 and r3.
Cortex-R52 has 32 bits address bus and offset cannot be 64 bits. In calculations only lower 32bits will be used as higher ones will not have any effect.
example:
uint64_t foo(void *buff, uint64_t offset, uint64_t size)
{
unsigned char *cbuff = buff;
while(size--)
{
*(cbuff++ + offset) = size & 0xff;
}
return offset + (uint32_t)cbuff;
}
void *z1(void);
uint64_t z2(void);
uint64_t z3(void);
uint64_t bar(void)
{
return foo(z1(), z2(), z3());
}
foo:
push {r4, lr}
ldr lr, [sp, #8] //size
ldr r1, [sp, #12] //size
mov ip, lr
add r4, r0, r2 // cbuff + offset calculation r3 is ignored as not needed - processor has only 32bits address space.
.L2:
subs ip, ip, #1 //size--
sbc r1, r1, #0 //size--
cmn r1, #1
cmneq ip, #1
bne .L3
add r0, r0, lr
adds r0, r0, r2
adc r1, r3, #0
pop {r4, pc}
.L3:
strb ip, [r4], #1
b .L2
bar:
push {r0, r1, r4, r5, r6, lr}
bl z1
mov r4, r0 // buff
bl z2
mov r6, r0 // offset
mov r5, r1 // offset
bl z3
mov r2, r6 // offset
strd r0, [sp] // size passed on the stack
mov r3, r5 // offset
mov r0, r4 // buff
bl foo
add sp, sp, #8
pop {r4, r5, r6, pc}
As you see resister r2 & r3 contain the offset, r0 - buff and size is on the stack.

ARM Cortex M0, BYTE or WORD access, which one is the best?

What is the real benefit of using byte access (instead of word) on a Cortex M0+.
LPC845 does have for all package pins (GPIO) one "BYTE" register and one "WORD" register.
If I want to set a GPIO using WORD register, I would use something like this
*((volatile uint32_t*)0xA0001088) = 1;
0000036e: movs r4, #137 ; 0x89
00000370: ldr r0, [pc, #80] ; (0x3c4 <AppInit+144>)
00000372: str r4, [r0, #0]
In the same time, I could use BYTE register
(using this, GPIO peripheral will automatically ignore everything is over 0xFF, writing 1 to pin)
*((volatile uint8_t*)0xA000102C) = 1;
0000039c: ldr r3, [pc, #48] ; (0x3d0 <AppInit+156>)
0000039e: adds r1, #56 ; 0x38
000003a0: strb r1, [r3, #0]
Now both techniques are consuming the same number of instructions.
Which one is the best to use and why?

Overeager struct packing warnings with `__attribute__((packed))`?

I'm implementing a binary logging system on a 32 bit ARM mcu (Atmel SAM4SD32C, a Cortex-M4/ARMv7E-M part), and in the process of designing my data structures. My goal is to describe the log format as a packed struct, and simply union the struct with a char array, for writing to the log device (a SD card, via FatFS, in this case).
Basically, I have a very simple struct:
typedef struct adc_samples_t
{
int32_t adc_samples[6];
uint64_t acq_time;
int8_t overrun;
uint8_t padding_1;
uint8_t padding_2;
uint8_t padding_3;
} __attribute__((packed, aligned(4))) adc_sample_set;
Now, my architecture is 32 bits, so as far as I understand, access to any member /other/ then the overrun member should be 32-bit aligned, and therefore not have an extra overhead. Furthermore, the aligned(4) attribute should force any instantiations of the struct to be on a 32-bit aligned boundary.
However, compiling the above struct definition produces a pile of warnings:
In file included from ../src/main.c:13:0:
<snip>\src\fs\fs-logger.h(10,10): warning: packed attribute causes inefficient alignment for 'adc_samples' [-Wattributes]
int32_t adc_samples[6];
^
<snip>\src\fs\fs-logger.h(12,11): warning: packed attribute causes inefficient alignment for 'acq_time' [-Wattributes]
uint64_t acq_time;
As far as I know (and I'm now realizing this is a big assumption), I assumed that 32-bit alignment was all that was needed for optimal component positioning on 32 bit arm. Oddly, the only member that does /not/ produce warnings are the overrun and padding_X members, which I don't understand the causes for. (Ok, the ARM docs say Byte accesses are always aligned.)
What, exactly, is going on here? I assume (possibly incorrectly) that the struct instantiation will be on 4 bytes boundaries. Does the compiler require a more broad alignment (on 8 byte boundaries)?
Edit: Ok, digging into the ARM docs (the magic words here were "Cortex-M4 alignment":
3.3.5. Address alignment
An aligned access is an operation where a word-aligned address is used for a word, dual word, or multiple word access, or where a halfword-aligned address is used for a halfword access. Byte accesses are always aligned.
The Cortex-M4 processor supports unaligned access only for the following instructions:
LDR, LDRT
LDRH, LDRHT
LDRSH, LDRSHT
STR, STRT
STRH, STRHT
All other load and store instructions generate a UsageFault exception if they perform an unaligned access, and therefore their accesses must be address aligned. For more information about UsageFaults see Fault handling.
Unaligned accesses are usually slower than aligned accesses. In addition, some memory regions might not support unaligned accesses. Therefore, ARM recommends that programmers ensure that accesses are aligned. To trap
accidental generation of unaligned accesses, use the UNALIGN_TRP bit in the Configuration and Control Register, see Configuration and Control Register.
How is my 32-bit aligned value not word-aligned? The user guide defines "Aligned" as the following:
Aligned
A data item stored at an address that is divisible by the
number of bytes that defines the data size is said to be aligned.
Aligned words and halfwords have addresses that are divisible by four
and two respectively. The terms word-aligned and halfword-aligned
therefore stipulate addresses that are divisible by four and two
respectively.
I assumed that 32-bit alignment was all that was needed for optimal component positioning on 32-bit ARM
It is.
But you don't have 32-bit alignment here [in the originally-asked question] because:
Specifying the packed attribute for struct and union types is equivalent to specifying the packed attribute on each of the structure or union members.
given that:
The packed attribute specifies that a variable or structure field should have the smallest possible alignment—one byte for a variable, and one bit for a field, unless you specify a larger value with the aligned attribute.
In other words, if you still want a packed structure to still have some minimum alignment after you've forced the alignment of all members, and thus the type itself, to nothing, you need to specify so - the fact that that might not actually make -Wpacked shut up is a different matter - GCC may well just spit that out reflexively before it actually considers any further alignment modifiers.
Note that in terms of serialisation, you don't necessarily need to pack it anyway. The members fit in 9 words exactly, so the only compiler padding anywhere is an extra word at the end to round the total size up to 40 bytes, since acq_time forces the struct to a natural alignment of 8. Unless you want to operate on a whole array of these things at once, you can get away with simply ignoring that and still treating the members as one 36-byte chunk.
Ok, at this point, I'm somewhat confident that the warning is being emitted in error.
I have a statically defined instance of the struct, and at one point I zero it:
adc_sample_set running_average;
int accumulated_samples;
inline void zero_average_buf(void)
{
accumulated_samples = 0;
running_average.adc_samples[0] = 0;
running_average.adc_samples[1] = 0;
running_average.adc_samples[2] = 0;
running_average.adc_samples[3] = 0;
running_average.adc_samples[4] = 0;
running_average.adc_samples[5] = 0;
running_average.overrun = 0;
running_average.acq_time = 0;
}
The disassembly for the function is the follows:
{
004005F8 push {r3, lr}
accumulated_samples = 0;
004005FA movs r2, #0
004005FC ldr r3, [pc, #36]
004005FE str r2, [r3]
running_average.adc_samples[0] = 0;
00400600 ldr r3, [pc, #36]
00400602 str r2, [r3]
running_average.adc_samples[1] = 0;
00400604 str r2, [r3, #4]
running_average.adc_samples[2] = 0;
00400606 str r2, [r3, #8]
running_average.adc_samples[3] = 0;
00400608 str r2, [r3, #12]
running_average.adc_samples[4] = 0;
0040060A str r2, [r3, #16]
running_average.adc_samples[5] = 0;
0040060C str r2, [r3, #20]
running_average.overrun = 0;
0040060E strb.w r2, [r3, #32]
running_average.acq_time = 0;
00400612 movs r0, #0
00400614 movs r1, #0
00400616 strd r0, r1, [r3, #24]
Note that r3 in the above is 0x2001ef70, which is indeed 4-byte aligned. r2 is the literal value 0.
The str opcode requires 4-byte alignment. The strd opcode only requires 4 byte alignment as well, since it appears to really be two sequential 4-byte operations, though I don't know how it actually works internally.
If I intentionally mis-align my struct, to force the slow-path copy operation:
typedef struct adc_samples_t
{
int8_t overrun;
int32_t adc_samples[6];
uint64_t acq_time;
uint8_t padding_1;
uint8_t padding_2;
uint8_t padding_3;
} __attribute__((packed, aligned(8))) adc_sample_set;
I get the following assembly:
{
00400658 push {r3, lr}
accumulated_samples = 0;
0040065A movs r3, #0
0040065C ldr r2, [pc, #84]
0040065E str r3, [r2]
running_average.adc_samples[0] = 0;
00400660 ldr r2, [pc, #84]
00400662 strb r3, [r2, #1]
00400664 strb r3, [r2, #2]
00400666 strb r3, [r2, #3]
00400668 strb r3, [r2, #4]
running_average.adc_samples[1] = 0;
0040066A strb r3, [r2, #5]
0040066C strb r3, [r2, #6]
0040066E strb r3, [r2, #7]
00400670 strb r3, [r2, #8]
running_average.adc_samples[2] = 0;
00400672 strb r3, [r2, #9]
00400674 strb r3, [r2, #10]
00400676 strb r3, [r2, #11]
00400678 strb r3, [r2, #12]
running_average.adc_samples[3] = 0;
0040067A strb r3, [r2, #13]
0040067C strb r3, [r2, #14]
0040067E strb r3, [r2, #15]
00400680 strb r3, [r2, #16]
running_average.adc_samples[4] = 0;
00400682 strb r3, [r2, #17]
00400684 strb r3, [r2, #18]
00400686 strb r3, [r2, #19]
00400688 strb r3, [r2, #20]
running_average.adc_samples[5] = 0;
0040068A strb r3, [r2, #21]
0040068C strb r3, [r2, #22]
0040068E strb r3, [r2, #23]
00400690 strb r3, [r2, #24]
running_average.overrun = 0;
00400692 mov r1, r2
00400694 strb r3, [r1], #25
running_average.acq_time = 0;
00400698 strb r3, [r2, #25]
0040069A strb r3, [r1, #1]
0040069C strb r3, [r1, #2]
0040069E strb r3, [r1, #3]
004006A0 strb r3, [r1, #4]
004006A2 strb r3, [r1, #5]
004006A4 strb r3, [r1, #6]
004006A6 strb r3, [r1, #7]
So, pretty clearly, I'm getting the proper aligned-copy behaviour with my original struct definition, despite the compiler apparently incorrectly warning that it will result in inefficient accesses.

Resources