Overeager struct packing warnings with `__attribute__((packed))`? - c

I'm implementing a binary logging system on a 32 bit ARM mcu (Atmel SAM4SD32C, a Cortex-M4/ARMv7E-M part), and in the process of designing my data structures. My goal is to describe the log format as a packed struct, and simply union the struct with a char array, for writing to the log device (a SD card, via FatFS, in this case).
Basically, I have a very simple struct:
typedef struct adc_samples_t
{
int32_t adc_samples[6];
uint64_t acq_time;
int8_t overrun;
uint8_t padding_1;
uint8_t padding_2;
uint8_t padding_3;
} __attribute__((packed, aligned(4))) adc_sample_set;
Now, my architecture is 32 bits, so as far as I understand, access to any member /other/ then the overrun member should be 32-bit aligned, and therefore not have an extra overhead. Furthermore, the aligned(4) attribute should force any instantiations of the struct to be on a 32-bit aligned boundary.
However, compiling the above struct definition produces a pile of warnings:
In file included from ../src/main.c:13:0:
<snip>\src\fs\fs-logger.h(10,10): warning: packed attribute causes inefficient alignment for 'adc_samples' [-Wattributes]
int32_t adc_samples[6];
^
<snip>\src\fs\fs-logger.h(12,11): warning: packed attribute causes inefficient alignment for 'acq_time' [-Wattributes]
uint64_t acq_time;
As far as I know (and I'm now realizing this is a big assumption), I assumed that 32-bit alignment was all that was needed for optimal component positioning on 32 bit arm. Oddly, the only member that does /not/ produce warnings are the overrun and padding_X members, which I don't understand the causes for. (Ok, the ARM docs say Byte accesses are always aligned.)
What, exactly, is going on here? I assume (possibly incorrectly) that the struct instantiation will be on 4 bytes boundaries. Does the compiler require a more broad alignment (on 8 byte boundaries)?
Edit: Ok, digging into the ARM docs (the magic words here were "Cortex-M4 alignment":
3.3.5. Address alignment
An aligned access is an operation where a word-aligned address is used for a word, dual word, or multiple word access, or where a halfword-aligned address is used for a halfword access. Byte accesses are always aligned.
The Cortex-M4 processor supports unaligned access only for the following instructions:
LDR, LDRT
LDRH, LDRHT
LDRSH, LDRSHT
STR, STRT
STRH, STRHT
All other load and store instructions generate a UsageFault exception if they perform an unaligned access, and therefore their accesses must be address aligned. For more information about UsageFaults see Fault handling.
Unaligned accesses are usually slower than aligned accesses. In addition, some memory regions might not support unaligned accesses. Therefore, ARM recommends that programmers ensure that accesses are aligned. To trap
accidental generation of unaligned accesses, use the UNALIGN_TRP bit in the Configuration and Control Register, see Configuration and Control Register.
How is my 32-bit aligned value not word-aligned? The user guide defines "Aligned" as the following:
Aligned
A data item stored at an address that is divisible by the
number of bytes that defines the data size is said to be aligned.
Aligned words and halfwords have addresses that are divisible by four
and two respectively. The terms word-aligned and halfword-aligned
therefore stipulate addresses that are divisible by four and two
respectively.

I assumed that 32-bit alignment was all that was needed for optimal component positioning on 32-bit ARM
It is.
But you don't have 32-bit alignment here [in the originally-asked question] because:
Specifying the packed attribute for struct and union types is equivalent to specifying the packed attribute on each of the structure or union members.
given that:
The packed attribute specifies that a variable or structure field should have the smallest possible alignment—one byte for a variable, and one bit for a field, unless you specify a larger value with the aligned attribute.
In other words, if you still want a packed structure to still have some minimum alignment after you've forced the alignment of all members, and thus the type itself, to nothing, you need to specify so - the fact that that might not actually make -Wpacked shut up is a different matter - GCC may well just spit that out reflexively before it actually considers any further alignment modifiers.
Note that in terms of serialisation, you don't necessarily need to pack it anyway. The members fit in 9 words exactly, so the only compiler padding anywhere is an extra word at the end to round the total size up to 40 bytes, since acq_time forces the struct to a natural alignment of 8. Unless you want to operate on a whole array of these things at once, you can get away with simply ignoring that and still treating the members as one 36-byte chunk.

Ok, at this point, I'm somewhat confident that the warning is being emitted in error.
I have a statically defined instance of the struct, and at one point I zero it:
adc_sample_set running_average;
int accumulated_samples;
inline void zero_average_buf(void)
{
accumulated_samples = 0;
running_average.adc_samples[0] = 0;
running_average.adc_samples[1] = 0;
running_average.adc_samples[2] = 0;
running_average.adc_samples[3] = 0;
running_average.adc_samples[4] = 0;
running_average.adc_samples[5] = 0;
running_average.overrun = 0;
running_average.acq_time = 0;
}
The disassembly for the function is the follows:
{
004005F8 push {r3, lr}
accumulated_samples = 0;
004005FA movs r2, #0
004005FC ldr r3, [pc, #36]
004005FE str r2, [r3]
running_average.adc_samples[0] = 0;
00400600 ldr r3, [pc, #36]
00400602 str r2, [r3]
running_average.adc_samples[1] = 0;
00400604 str r2, [r3, #4]
running_average.adc_samples[2] = 0;
00400606 str r2, [r3, #8]
running_average.adc_samples[3] = 0;
00400608 str r2, [r3, #12]
running_average.adc_samples[4] = 0;
0040060A str r2, [r3, #16]
running_average.adc_samples[5] = 0;
0040060C str r2, [r3, #20]
running_average.overrun = 0;
0040060E strb.w r2, [r3, #32]
running_average.acq_time = 0;
00400612 movs r0, #0
00400614 movs r1, #0
00400616 strd r0, r1, [r3, #24]
Note that r3 in the above is 0x2001ef70, which is indeed 4-byte aligned. r2 is the literal value 0.
The str opcode requires 4-byte alignment. The strd opcode only requires 4 byte alignment as well, since it appears to really be two sequential 4-byte operations, though I don't know how it actually works internally.
If I intentionally mis-align my struct, to force the slow-path copy operation:
typedef struct adc_samples_t
{
int8_t overrun;
int32_t adc_samples[6];
uint64_t acq_time;
uint8_t padding_1;
uint8_t padding_2;
uint8_t padding_3;
} __attribute__((packed, aligned(8))) adc_sample_set;
I get the following assembly:
{
00400658 push {r3, lr}
accumulated_samples = 0;
0040065A movs r3, #0
0040065C ldr r2, [pc, #84]
0040065E str r3, [r2]
running_average.adc_samples[0] = 0;
00400660 ldr r2, [pc, #84]
00400662 strb r3, [r2, #1]
00400664 strb r3, [r2, #2]
00400666 strb r3, [r2, #3]
00400668 strb r3, [r2, #4]
running_average.adc_samples[1] = 0;
0040066A strb r3, [r2, #5]
0040066C strb r3, [r2, #6]
0040066E strb r3, [r2, #7]
00400670 strb r3, [r2, #8]
running_average.adc_samples[2] = 0;
00400672 strb r3, [r2, #9]
00400674 strb r3, [r2, #10]
00400676 strb r3, [r2, #11]
00400678 strb r3, [r2, #12]
running_average.adc_samples[3] = 0;
0040067A strb r3, [r2, #13]
0040067C strb r3, [r2, #14]
0040067E strb r3, [r2, #15]
00400680 strb r3, [r2, #16]
running_average.adc_samples[4] = 0;
00400682 strb r3, [r2, #17]
00400684 strb r3, [r2, #18]
00400686 strb r3, [r2, #19]
00400688 strb r3, [r2, #20]
running_average.adc_samples[5] = 0;
0040068A strb r3, [r2, #21]
0040068C strb r3, [r2, #22]
0040068E strb r3, [r2, #23]
00400690 strb r3, [r2, #24]
running_average.overrun = 0;
00400692 mov r1, r2
00400694 strb r3, [r1], #25
running_average.acq_time = 0;
00400698 strb r3, [r2, #25]
0040069A strb r3, [r1, #1]
0040069C strb r3, [r1, #2]
0040069E strb r3, [r1, #3]
004006A0 strb r3, [r1, #4]
004006A2 strb r3, [r1, #5]
004006A4 strb r3, [r1, #6]
004006A6 strb r3, [r1, #7]
So, pretty clearly, I'm getting the proper aligned-copy behaviour with my original struct definition, despite the compiler apparently incorrectly warning that it will result in inefficient accesses.

Related

Cortex M3, STM32, thumb2: My inc and dec operations are not atomic, but should be. What's wrong here?

I need a thread save idx++ and idx-- operation.
Disabling interrupts, i.e. use critical sections, is one thing, but I want
to understand why my operations are not atomic, as I expect ?
Here is the C-code with inline assembler code shown, using segger ozone:
(Also please notice, the address of the variables show that the 32 bit variable is 32-bit-aligned in memory, and the 8- and 16-bit variables are both 16 bit aligned)
volatile static U8 dbgIdx8 = 1000U;
volatile static U16 dbgIdx16 = 1000U;
volatile static U32 dbgIdx32 = 1000U;
dbgIdx8 ++;
080058BE LDR R3, [PC, #48]
080058C0 LDRB R3, [R3]
080058C2 UXTB R3, R3
080058C4 ADDS R3, #1
080058C6 UXTB R2, R3
080058C8 LDR R3, [PC, #36]
080058CA STRB R2, [R3]
dbgIdx16 ++;
080058CC LDR R3, [PC, #36]
080058CE LDRH R3, [R3]
080058D0 UXTH R3, R3
080058D2 ADDS R3, #1
080058D4 UXTH R2, R3
080058D6 LDR R3, [PC, #28]
080058D8 STRH R2, [R3]
dbgIdx32 ++;
080058DA LDR R3, [PC, #28]
080058DC LDR R3, [R3]
080058DE ADDS R3, #1
080058E0 LDR R2, [PC, #20]
080058E2 STR R3, [R2]
There is no guarantee that ++ and -- are atomic. If you need guaranteed atomicity, you will have to find some other way.
As #StaceyGirl points out in a comment, you might be able to use the facilities of <stdatomic.h>. For example, I see there's an atomic atomic_fetch_add function defined, which acts like the postfix ++ you're striving for. There's an atomic_fetch_sub, too.
Alternatively, you might have some compiler intrinsics available to you for performing an atomic increment in some processor-specific way.
ARM cortex cores do not modify memory. All memory modifications are performed as RMW (read-modify-write) operations which are not atomic by default.
But Cortex M3 has special instructions to lock access to the memory location. LDREX & STREX. https://developer.arm.com/documentation/100235/0004/the-cortex-m33-instruction-set/memory-access-instructions/ldaex-and-stlex
You can use them directly in the C code without touching the assembly by using intrinsic.
Do not use not 32 bits data types in any performance (you want to lock for as short as possible time) sensitive programs. Most shorter data types operations add some additional code.

Does Clang misunderstand the 'const' pointer specifier?

In the code below I saw that clang fails to perform better optimisation without implicit restrict pointer specifier:
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
typedef struct {
uint32_t event_type;
uintptr_t param;
} event_t;
typedef struct
{
event_t *queue;
size_t size;
uint16_t num_of_items;
uint8_t rd_idx;
uint8_t wr_idx;
} queue_t;
static bool queue_is_full(const queue_t *const queue_ptr)
{
return queue_ptr->num_of_items == queue_ptr->size;
}
static size_t queue_get_size_mask(const queue_t *const queue_ptr)
{
return queue_ptr->size - 1;
}
int queue_enqueue(queue_t *const queue_ptr, const event_t *const event_ptr)
{
if(queue_is_full(queue_ptr))
{
return 1;
}
queue_ptr->queue[queue_ptr->wr_idx++] = *event_ptr;
queue_ptr->num_of_items++;
queue_ptr->wr_idx &= queue_get_size_mask(queue_ptr);
return 0;
}
I compiled this code with clang version 11.0.0 (clang-1100.0.32.5)
clang -O2 -arch armv7m -S test.c -o test.s
In the disassembled file I saw that the generated code re-reads the memory:
_queue_enqueue:
.cfi_startproc
# %bb.0:
ldrh r2, [r0, #8] ---> reads the queue_ptr->num_of_items
ldr r3, [r0, #4] ---> reads the queue_ptr->size
cmp r3, r2
itt eq
moveq r0, #1
bxeq lr
ldrb r2, [r0, #11] ---> reads the queue_ptr->wr_idx
adds r3, r2, #1
strb r3, [r0, #11] ---> stores the queue_ptr->wr_idx + 1
ldr.w r12, [r1]
ldr r3, [r0]
ldr r1, [r1, #4]
str.w r12, [r3, r2, lsl #3]
add.w r2, r3, r2, lsl #3
str r1, [r2, #4]
ldrh r1, [r0, #8] ---> !!! re-reads the queue_ptr->num_of_items
adds r1, #1
strh r1, [r0, #8]
ldrb r1, [r0, #4] ---> !!! re-reads the queue_ptr->size (only the first byte)
ldrb r2, [r0, #11] ---> !!! re-reads the queue_ptr->wr_idx
subs r1, #1
ands r1, r2
strb r1, [r0, #11] ---> !!! stores the updated queue_ptr->wr_idx once again after applying the mask
movs r0, #0
bx lr
.cfi_endproc
# -- End function
After adding the restrict keyword to the pointers, these unneeded re-reads just vanished:
int queue_enqueue(queue_t * restrict const queue_ptr, const event_t * restrict const event_ptr)
I know that in clang, by default strict aliasing is disabled. But in this case, event_ptr pointer is defined as const so its object's content cannot be modified by this pointer, thus it cannot affect the content to which queue_ptr points (assuming the case when the objects overlap in the memory), right?
So is this a compiler optimisation bug or there is indeed some weird case when the object pointed by queue_ptr can be affected by event_ptr assuming this declaration:
int queue_enqueue(queue_t *const queue_ptr, const event_t *const event_ptr)
By the way, I tried to compile the same code for x86 target and inspected similar optimisation issue.
The generated assembly with the restrict keyword, doesn't contain the re-reads:
_queue_enqueue:
.cfi_startproc
# %bb.0:
ldr r3, [r0, #4]
ldrh r2, [r0, #8]
cmp r3, r2
itt eq
moveq r0, #1
bxeq lr
push {r4, r6, r7, lr}
.cfi_def_cfa_offset 16
.cfi_offset lr, -4
.cfi_offset r7, -8
.cfi_offset r6, -12
.cfi_offset r4, -16
add r7, sp, #8
.cfi_def_cfa r7, 8
ldr.w r12, [r1]
ldr.w lr, [r1, #4]
ldrb r1, [r0, #11]
ldr r4, [r0]
subs r3, #1
str.w r12, [r4, r1, lsl #3]
add.w r4, r4, r1, lsl #3
adds r1, #1
ands r1, r3
str.w lr, [r4, #4]
strb r1, [r0, #11]
adds r1, r2, #1
strh r1, [r0, #8]
movs r0, #0
pop {r4, r6, r7, pc}
.cfi_endproc
# -- End function
Addition:
After some discussion with Lundin in the comments to his answer, I got the impression that the re-reads could be caused because the compiler would assume that queue_ptr->queue might potentially point to *queue_ptr itself. So I changed the queue_t struct to contain array instead of the pointer:
typedef struct
{
event_t queue[256]; // changed from pointer to array with max size
size_t size;
uint16_t num_of_items;
uint8_t rd_idx;
uint8_t wr_idx;
} queue_t;
However the re-reads remained as previously. I still can't understand what could make the compiler think that queue_t fields may be modified and thus require re-reads... The following declaration eliminates the re-reads:
int queue_enqueue(queue_t * restrict const queue_ptr, const event_t *const event_ptr)
But why queue_ptr has to be declared as restrict pointer to prevent the re-reads I don't understand (unless it is a compiler optimization "bug").
P.S.
I also couldn't find any link to file/report an issue on clang that doesn't cause the compiler to crash...
[talking about the original program]
This is caused by deficiency in the TBAA metadata, generated by Clang.
If you emit LLVM IR with -S -emit-llvm you'll see (snipped for brevity):
...
%9 = load i8, i8* %wr_idx, align 1, !tbaa !12
%10 = trunc i32 %8 to i8
%11 = add i8 %10, -1
%conv4 = and i8 %11, %9
store i8 %conv4, i8* %wr_idx, align 1, !tbaa !12
br label %return
...
!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 1, !"min_enum_size", i32 4}
!2 = !{!"clang version 10.0.0 (/home/chill/src/llvm-project 07da145039e1a6a688fb2ac2035b7c062cc9d47d)"}
!3 = !{!4, !9, i64 8}
!4 = !{!"queue", !5, i64 0, !8, i64 4, !9, i64 8, !6, i64 10, !6, i64 11}
!5 = !{!"any pointer", !6, i64 0}
!6 = !{!"omnipotent char", !7, i64 0}
!7 = !{!"Simple C/C++ TBAA"}
!8 = !{!"int", !6, i64 0}
!9 = !{!"short", !6, i64 0}
!10 = !{!4, !8, i64 4}
!11 = !{!4, !5, i64 0}
!12 = !{!4, !6, i64 11}
See the TBAA metadata !4: this is the type descriptor for queue_t (btw, I added names to structs, e.g. typedef struct queue ...) you may see empty string there). Each element in the description corresponds to struct fields and look at !5, which is the event_t *queue field: it's "any pointer"! At this point we've lost all the information about the actual type of the pointer, which tells me the compiler would assume writes through this pointer can modify any memory object.
That said, there's a new form for TBAA metadata, which is more precise (still has deficiencies, but for that later ...)
Compile the original program with -Xclang -new-struct-path-tbaa. My exact command line was (and I've included stddef.h instead of stdlib.h since that development build with no libc):
./bin/clang -I lib/clang/10.0.0/include/ -target armv7m-eabi -O2 -Xclang -new-struct-path-tbaa -S queue.c
The resulting assembly is (again, some fluff snipped):
queue_enqueue:
push {r4, r5, r6, r7, lr}
add r7, sp, #12
str r11, [sp, #-4]!
ldrh r3, [r0, #8]
ldr.w r12, [r0, #4]
cmp r12, r3
bne .LBB0_2
movs r0, #1
ldr r11, [sp], #4
pop {r4, r5, r6, r7, pc}
.LBB0_2:
ldrb r2, [r0, #11] ; load `wr_idx`
ldr.w lr, [r0] ; load `queue` member
ldrd r6, r1, [r1] ; load data pointed to by `event_ptr`
add.w r5, lr, r2, lsl #3 ; compute address to store the event
str r1, [r5, #4] ; store `param`
adds r1, r3, #1 ; increment `num_of_items`
adds r4, r2, #1 ; increment `wr_idx`
str.w r6, [lr, r2, lsl #3] ; store `event_type`
strh r1, [r0, #8] ; store new value for `num_of_items`
sub.w r1, r12, #1 ; compute size mask
ands r1, r4 ; bitwise and size mask with `wr_idx`
strb r1, [r0, #11] ; store new value for `wr_idx`
movs r0, #0
ldr r11, [sp], #4
pop {r4, r5, r6, r7, pc}
Looks good, isn't it! :D
I mentioned earlier there are deficiencies with the "new struct path", but for that: on the mailing list.
PS. I'm afraid there's no general lesson to be learned in this case. In principle, the more information one is able to give the compiler - the better: things like restrict, strong typing (not gratuitous casts, type punning, etc), relevant function and variable attributes ... but not in this case, the original program already contained all the necessary information. It is just a compiler deficiency and the best way to tackle those is to raise awareness: ask on the mailing list and/or file bug reports.
The event_t member of queue_ptr could point at the same memory as event_ptr. Compilers tend to produce less efficient code when they can't rule out that two pointers point at the same memory. So there's nothing strange with restrict leading to better code.
Const qualifiers don't really matter, because these were added by the function and the original type could be modifiable elsewhere. In particular, the * const doesn't add anything because the pointer is already a local copy of the original pointer, so nobody including the caller cares if the function modifies that local copy or not.
"Strict aliasing" rather refers to the cases where the compiler can cut corners like when assuming that a uint16_t* can't point at a uint8_t* etc. But in your case you have two completely compatible types, one of them is merely wrapped in an outer struct.
As far as I can tell, yes, in your code queue_ptr pointee's contents cannot be modified. Is is it an optimization bug? It's a missed optimization opportunity, but I wouldn't call it a bug. It doesn't "misunderstand" const, it just doesn't have/doesn't do the necessary analyses to determine it cannot be modified for this specific scenario.
As a side note: queue_is_full(queue_ptr) can modify the contents of *queue_ptr even if it has const queue_t *const param because it can legally cast away the constness since the original object is not const. That being said, the definition of quueue_is_full is visible and available to the compiler so it can ascertain it indeed does not.
As you know, your code appears to modify the data, invalidating the const state:
queue_ptr->num_of_items++; // this stores data in the memory
Without the restrict keyword, the compiler must assume that the two types might share the same memory space.
This is required in the updated example because event_t is a member of queue_t and strict aliasing applies on:
... an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or...
On the original example there are a number of reasons the types might be considered to alias, leading to the same result (i.e., the use of a char pointer and the fact that types might be considered compatible "enough" on some architectures if not all).
Hence, the compiler is required to reload the memory after it was mutated to avoid possible conflicts.
The const keyword doesn't really enter into this, because a mutation happens through a pointer that might point to the same memory address.
(EDIT) For your convenience, here's the full rule regarding access to a variable:
An object shall have its stored value accessed only by an lvalue expression that has one of the following types (88):
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the object,
— a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
— a character type.
(88) The intent of this list is to specify those circumstances in which an object may or may not be aliased.
P.S.
The _t suffix is reserved by POSIX. You might consider using a different suffix.
It's common practice to use _s for structs and _u for unions.

C Struct where the elements cross the byte boundry and get placed in the next byte

I have the following C struct that represent a register in an external chip
typedef union {
// Individual Fields
struct {
uint8_t ELEM_1 : 4 ; // Bits 0-3
uint8_t ELEM_2 : 3 ; // Bits 4-6
uint8_t ELEM_3 : 2 ; // Bits 7-8
} field;
// Complete Value
uint32_t value;
} ELEMENTS_t;
As you can see, ELEM_1 and ELEM_2 can fit inside a byte without any issues and when accessed, the assembly code looks like this
ELEMENTS.field.ELEM_2 = 0x7;
101488: e55b3028 ldrb r3, [fp, #-40] ; 0xffffffd8
10148c: e3833070 orr r3, r3, #112 ; 0x70
101490: e54b3028 strb r3, [fp, #-40] ; 0xffffffd8
ELEMENTS.field.ELEM_1 = 0xf;
101494: e55b3028 ldrb r3, [fp, #-40] ; 0xffffffd8
101498: e383300f orr r3, r3, #15
10149c: e54b3028 strb r3, [fp, #-40] ; 0xffffffd8
They all get written in the same byte with the corret bit order.
The problem is when we get to ELEM_3. That element crosses the byte boundry, since it should be placed in bits[8:7] and to avoid having multiple memory accesses (probably) the compiler places it in a completely separate byte, so when I try to access it, it looks like this
ELEMENTS.field.ELEM_3 = 0x3;
10147c: e55b3027 ldrb r3, [fp, #-39] ; 0xffffffd9
101480: e3833003 orr r3, r3, #3
101484: e54b3027 strb r3, [fp, #-39] ; 0xffffffd9
This doesn't cause issues when accessing these elements field by field, but it does when trying to flush the data to the external chip.
Does anybody know how to tell the compiler to pack all the bits together? This is using the Xilinx SDK targeting the ARM Cortex-A9 processor embedded inside a Zynq SoC.
You can't pack 9 bits into a byte. So it crosses the byte boundary. If the register is a single byte it should only access bits 0 through 7.
There could be other problems with this if you need to set multiple bits at the same time.
I found a solution. If I set all the elements as uint32_t instead of uint8_t, it reads and writes the register 32-bits at a time, which makes the operations work.
ELEMENTS.field.ELEM_3 = 0x3;
10147c: e15b32b8 ldrh r3, [fp, #-40] ; 0xffffffd8
101480: e3833d06 orr r3, r3, #384 ; 0x180
101484: e14b32b8 strh r3, [fp, #-40] ; 0xffffffd8
ELEMENTS.field.ELEM_2 = 0x7;
101488: e55b3028 ldrb r3, [fp, #-40] ; 0xffffffd8
10148c: e3833070 orr r3, r3, #112 ; 0x70
101490: e54b3028 strb r3, [fp, #-40] ; 0xffffffd8
ELEMENTS.field.ELEM_1 = 0xf;
101494: e55b3028 ldrb r3, [fp, #-40] ; 0xffffffd8
101498: e383300f orr r3, r3, #15
10149c: e54b3028 strb r3, [fp, #-40] ; 0xffffffd8

ARM Cortex M7 unaligned access and memcpy

I am compiling this code for a Cortex M7 using GCC:
// copy manually
void write_test_plain(uint8_t * ptr, uint32_t value)
{
*ptr++ = (u8)(value);
*ptr++ = (u8)(value >> 8);
*ptr++ = (u8)(value >> 16);
*ptr++ = (u8)(value >> 24);
}
// copy using memcpy
void write_test_memcpy(uint8_t * ptr, uint32_t value)
{
void *px = (void*)&value;
memcpy(ptr, px, 4);
}
int main(void)
{
extern uint8_t data[];
extern uint32_t value;
// i added some offsets to data to
// make sure the compiler cannot
// assume it's aligned in memory
write_test_plain(data + 2, value);
__asm volatile("": : :"memory"); // just to split inlined calls
write_test_memcpy(data + 5, value);
... do something with data ...
}
And I get the following Thumb2 assembly with -O2:
// write_test_plain(data + 2, value);
800031c: 2478 movs r4, #120 ; 0x78
800031e: 2056 movs r0, #86 ; 0x56
8000320: 2134 movs r1, #52 ; 0x34
8000322: 2212 movs r2, #18 ; 0x12
8000324: 759c strb r4, [r3, #22]
8000326: 75d8 strb r0, [r3, #23]
8000328: 7619 strb r1, [r3, #24]
800032a: 765a strb r2, [r3, #25]
// write_test_memcpy(data + 5, value);
800032c: 4ac4 ldr r2, [pc, #784] ; (8000640 <main+0x3a0>)
800032e: 923b str r2, [sp, #236] ; 0xec
8000330: 983b ldr r0, [sp, #236] ; 0xec
8000332: f8c3 0019 str.w r0, [r3, #25]
Can someone explain how the memcpy version works? This looks like inlined 32-bit store to the destination address, but isn't this a problem since data + 5 is most certainly not aligned to a 4-byte boundary?
Is this perhaps some optimization which happens due to some undefined behavior in my source?
For Cortex-M processors unaligned loads and stores of bytes, half-words, and words are usually allowed and most compilers use this when generating code unless they are instructed not to. If you want to prevent gcc from assuming the unaligned accesses are OK, you can use the -mno-unaligned-access compiler flag.
If you specify this flag gcc will no longer inline the call to memcpy and write_test_memcpy looks like
write_test_memcpy(unsigned char*, unsigned long):
push {lr}
sub sp, sp, #12
movs r2, #4
add r3, sp, #8
str r1, [r3, #-4]!
mov r1, r3
bl memcpy
add sp, sp, #12
ldr pc, [sp], #4
Cortex-M 7 , M4, M3 M33, M23 does support unaligned access
M0, M+ doesn't support unaligned access
however you can disable the support of unaligned access in cortexm7 by setting bit UNALIGN_TRP in configuration and control register and any unaligned access will generate usage fault.
From compiler perspective, default setting is that generated assembly code does unaligned access unless you disable this by using the compile flag -mno-unaligned-access

Optimize C or assembly code in size for Cortex-M0

I need to reduce the code bloat for the Cortex-M0 microprocessor.
At startup the ROM data has to be copied to the RAM data once. Therefore I have this piece of code:
void __startup( void ){
extern unsigned int __data_init_start;
extern unsigned int __data_start;
extern unsigned int __data_end;
// copy .data section from flash to ram
s = & __data_init_start;
d = & __data_start;
e = & __data_end;
while( d != e ){
*d++ = *s++;
}
}
The assembly code that is generated by the compiler looks like this:
ldr r1, .L10+8
ldr r2, .L10+12
sub r0, r1, r2
lsr r3, r0, #2
add r3, r3, #1
lsl r1, r3, #2
mov r3, #0
.L4:
add r3, r3, #4
cmp r3, r1
beq .L9
.L5:
ldr r4, .L10+16
add r0, r2, r3
add r4, r3, r4
sub r4, r4, #4
ldr r4, [r4]
sub r0, r0, #4
str r4, [r0]
b .L4
How can I optimize this code so the code size is at minimum?
The compiler (or you!) does not realize that the range to copy is end - start. There seems to be some unnecessarily shuffling of data going on -- the 2 add and the sub in the loop. Also, it seems to me the compiler makes sure that the number of copies to make is a multiple of 4. An obvious optimization, then, is to make sure it is in advance! Below I assume it is (if not, the bne will fail and happily keep on copying and trample all over your memory).
Using my decade-old ARM assembler knowlegde (yes, that is a major disclaimer), and post-incrementing, I think the following short snippet is what it can be condensed to. From 18 instructions down to 8, not too bad. If it works.
ldr r1, __data_init_start
ldr r2, __data_start
ldr r3, __data_end
sub r4, r3, r2
.L1:
ldr r3, [r1], #4 ; safe to re-use r3 here
str r3, [r2], #4
subs r4, r4, #4
bne L1
May be that platform guarantees that writing to an unsigned int * you may change an unsigned int * value (i.e. it doesn't take advantage of type mismatch aliasing rules).
Then the code is inefficient because e is a global variable and the generated code logic must take in account that writing to *d may change the value of e.
Making at least e a local should solve this problem (most compilers know that aliasing a local that never had its address taken is not possible from a C point of view).

Resources