ARM GCC produces unaligned STRD - c

I am using GCC to compile a program for an ARM Cortex M3.
My program results in a hardfault, and I am trying to troubleshoot it.
GCC version is 10.3.1 but I have confirmed this with older versions too (i.e. 9.2).
The hardfault occurs only when optimizations are enabled (-O3).
The problematic function is the following:
void XTEA_decrypt(XTEA_t * xtea, uint32_t data[2])
{
uint32_t d0 = data[0];
uint32_t d1 = data[1];
uint32_t sum = XTEA_DELTA * XTEA_NUMBER_OF_ROUNDS;
for (int i = XTEA_NUMBER_OF_ROUNDS; i != 0; i--)
{
d1 -= (((d0 << 4) ^ (d0 >> 5)) + d0) ^ (sum + xtea->key[(sum >> 11) & 3]);
sum -= XTEA_DELTA;
d0 -= (((d1 << 4) ^ (d1 >> 5)) + d1) ^ (sum + xtea->key[sum & 3]);
}
data[0] = d0;
data[1] = d1;
}
I noticed that the fault happens in line:
data[0] = d0;
Disassembling this, gives me:
49 data[0] = d0;
0000f696: lsrs r0, r3, #5
0000f698: eor.w r0, r0, r3, lsl #4
0000f69c: add r0, r3
0000f69e: ldr.w r12, [sp, #4]
0000f6a2: eors r5, r0
0000f6a4: subs r2, r2, r5
0000f6a6: strd r2, r3, [r12]
0000f6aa: add sp, #12
0000f6ac: ldmia.w sp!, {r4, r5, r6, r7, r8, r9, r10, r11, pc}
0000f6b0: ldr r3, [sp, #576] ; 0x240
0000f6b2: b.n 0xfda4 <parseNode+232>
And the offending line is specifically:
0000f6a6: strd r2, r3, [r12]
GCC generates code that uses an unaligned memory address with strd, which is not allowed in my architecture.
How can this issue be fixed?
Is this a compiler bug, or the code somehow confuses GCC?
Is there any flag to alter this behavior in GCC?
The aforementioned function belongs to an external library, so I cannot modify it.
However, I prefer a solution that makes GCC produce the correct instructions, instead of modifying the code, as I need to ensure that this bug will actually be fixed, and it is not lurking elsewhere in the code.
UPDATE
Following the recommendations in the comments, I was suspecting that the function itself is called with unaligned data.
I checked the whole stack frame, all previous function calls, and my code does not contain casts, unaligned indexes in buffers etc, in contrast to what I had in mind initially.
The problem is that the buffer itself is unaligned, as it is defined as:
typedef struct {
uint32_t var1;
uint32_t var2;
uint8_t var3;
uint8_t buffer[BUFFER_SIZE];
uint16_t var4;
// More variables here...
} HDLC_t;
(And later cast to uint32_t by the external library).
Swapping places between var3 and buffer solves the issue.
The thing is that again this struct is defined in a library that is not in my control.
So, can GCC detect this issue between the libraries, and either align the data, or warn me of the issue?

So, can GCC detect this issue between the libraries, and either align the data, or warn me of the issue?
Yes it can, it does and it must do in order to be C compliant. This is what happens if you run gcc at default settings and attempt to pass a uint8_t pointer (HDLC_t buffer member) to a function expecting a uint32_t [2]:
warning: passing argument 2 of 'XTEA_decrypt' from incompatible pointer type [-Wincompatible-pointer-types]
This is a constraint violation, meaning that the code is invalid C and the compiler already told you as much. See What must a C compiler do when it finds an error? You could turn on -pedantic-errors if you wish to block gcc C from generating a binary executable out of invalid C code.
As for how to fix the code if you are stuck with that struct: memcpy the buffer member into a temporary uint32_t [2] array and then pass that one to the function.
You could also declare the struct member as _Alignas(uint32_t) uint8_t buffer[100]; but if you can modify the struct you might as well re-arrange it instead, since _Alignas will insert 3 wasteful padding bytes.

The easiest way is to align data to 8bytes.
You should declare the array like:
__attribute__((aligned(8))) uint32_t data[2];

Related

Conversion from uint64_t to double

For an STM32F7, which includes instructions for double floating points, I want to convert an uint64_t to double.
In order to test that, I used the following code:
volatile static uint64_t m_testU64 = 45uLL * 0xFFFFFFFFuLL;
volatile static double m_testD;
#ifndef DO_NOT_USE_UL2D
m_testD = (double)m_testU64;
#else
double t = (double)(uint32_t)(m_testU64 >> 32u);
t *= 4294967296.0;
t += (double)(uint32_t)(m_testU64 & 0xFFFFFFFFu);
m_testD = t;
#endif
By default (if DO_NOT_USE_UL2D is not defined) the compiler (gcc or clang) is calling the function: __aeabi_ul2d() which is kind of complex in number of executed instruction. See the assembly code here : https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/arm/ieee754-df.S#L537
For my particular example, it takes 20 instructions without entering in most of the branches
And if DO_NOT_USE_UL2D is defined, the compiler generate the following assembly code:
movw r0, #1728 ; 0x6c0
vldr d2, [pc, #112] ; 0x303fa0
movt r0, #8192 ; 0x2000
vldr s0, [r0, #4]
ldr r1, [r0, #0]
vcvt.f64.u32 d0, s0
vldr s2, [r0]
vcvt.f64.u32 d1, s2
ldr r1, [r0, #4]
vfma.f64 d1, d0, d2
vstr d1, [r0, #8]
The code is simpler, and it is only 10 instructions.
So here the the questions (if DO_NOT_USE_UL2D is defined):
Is my code (in C) correct?
Is my code slower than the __aeabi_ul2d() function (not really important, but a bit curious)?
I have to do that, since I am not allowed to use function from libgcc (There are very good reasons for that...)
Be aware that the main purpure of this question is not about performance, I am really curious about the implementation in libgcc, and I really want to know if there is something wrong in my code.

Why does writing to a bitfield-uint union by reference yield wrong assembly instruction?

First, some background:
This issue popped up while writing a driver for a sensor in my embedded system (STM32 ARM Cortex-M4).
Compiler: ARM NONE EABI GCC 7.2.1
The best solution to representing the sensor's internal control register was to use a union with a bitfield, along these lines
enum FlagA {
kFlagA_OFF,
kFlagA_ON,
};
enum FlagB {
kFlagB_OFF,
kFlagB_ON,
};
enum OptsA {
kOptsA_A,
kOptsA_B,
.
.
.
kOptsA_G // = 7
};
union ControlReg {
struct {
uint16_t RESERVED1 : 1;
FlagA flag_a : 1;
uint16_t RESERVED2 : 7;
OptsA opts_a : 3;
FlagB flag_b : 1;
uint16_t RESERVED3 : 3;
} u;
uint16_t reg;
};
This allows me to address the register's bits individually (e.g. ctrl_reg.u.flag_a = kFlagA_OFF;), and it allows me to set the value of the whole register at once (e.g. ctrl_reg.reg = 0xbeef;).
The problem:
When attempting to populate the register with a value fetched from the sensor through a function call, passing the union in by pointer, and then update only the opts_a portion of the register before writing it back to the sensor (as shown below), the compiler generates an incorrect bitfield insert assembly instruction.
ControlReg ctrl_reg;
readRegister(&ctrl_reg.reg);
ctrl_reg.opts_a = kOptsA_B; // <-- line of interest
writeRegister(ctrl_reg.reg);
yields
ldrb.w r3, [sp, #13]
bfi r3, r8, #1, #3 ;incorrectly writes to bits 1, 2, 3
strb.w r3, [sp, #13]
However, when I use an intermediate variable:
uint16_t reg_val = 0;
readRegister(&reg_val);
ControlReg ctrl_reg;
ctrl_reg.reg = reg_val;
ctrl_reg.opts_a = kOptsA_B; // <-- line of interest
writeRegister(ctrl_reg.reg);
It yields the correct instruction:
bfi r7, r8, #9, #3 ;sets the proper bits 9, 10, 11
The readRegister function does nothing funky and simply writes to the memory at the pointer
void readRegister(uint16_t* out) {
uint8_t data_in[3];
...
*out = (data_in[0] << 8) | data_in[1];
}
Why does the compiler improperly set the starting bit of the bitfield insert instruction?
I am not a fan of bitfields, especially if you're aiming for portability. C leaves a lot more unspecified or implementation-defined about them than most people seem to appreciate, and there are some very common misconceptions about what the standard requires of them as opposed to what happens to be the behavior of some implementations. Nevertheless, that's mostly moot if you're writing code for a specific application only, targeting a single, specific C implementation for the target platform.
In any case, C allows no room for a conforming implementation to behave inconsistently for conforming code. In your case, it is equally valid to set ctrl_reg.reg through a pointer, in function readRegister(), as to set it via assignment. Having done so, it is valid to assign to ctrl_reg.u.opts_a, and the result should read back correctly from ctrl_reg.u. It is also permitted to afterward read ctrl_reg.reg, and that will reflect the result of the modification.
However, you are making assumptions about the layout of the bitfields that are not supported by the standard. Your compiler will be consistent, but you need to carefully verify that the layout is actually what you expect, else going back and forth between the two union members will not produce the result you want.
Nevertheless, the way you store a value in ctrl_reg.reg is immaterial with respect to the effect that assigning to the bitfield has. Your compiler is not required to generate identical assembly for the two cases, but if there are no other differences between the two programs and they exercise no undefined behavior, then they are required to produce the same observable behavior for the same inputs.
It is 100% correct compiler generated code
void foo(ControlReg *reg)
{
reg -> opts_a = kOptsA_B;
}
void foo1(ControlReg *reg)
{
volatile ControlReg reg1;
reg1.opts_a = kOptsA_B;
}
foo:
movs r2, #1
ldrb r3, [r0, #1] # zero_extendqisi2
bfi r3, r2, #1, #3
strb r3, [r0, #1]
bx lr
foo1:
movs r2, #1
sub sp, sp, #8
ldrh r3, [sp, #4]
bfi r3, r2, #9, #3
strh r3, [sp, #4] # movhi
add sp, sp, #8
bx lr
As you see in the function 'foo' it loads only one byte (the second byte of the union) and the field is stored in 1 to 3 bits of this byte.
As you see in the function 'foo1' it loads half word (the whole structure) and the field is stored in 9 to 11 bits of the halfword.
Do not try to find errors in the compilers because they are almost always in your code.
PS
You do not need to name struct and the padding bitfields
typedef union {
struct {
uint16_t : 1;
uint16_t flag_a : 1;
uint16_t : 7;
uint16_t opts_a : 3;
uint16_t flag_b : 1;
uint16_t : 3;
};
uint16_t reg;
}ControlReg ;
EDIT
but if you want to make sure that the whole structure (union) is modified just make the function parameter volatile
void foo(volatile ControlReg *reg)
{
reg -> opts_a = kOptsA_B;
}
foo:
movs r2, #1
ldrh r3, [r0]
bfi r3, r2, #9, #3
strh r3, [r0] # movhi
bx lr

Struct and bitfield strange behaviour

I am trying to modify bitfields in register. Here is my struct with bitfields defined:
struct GROUP_tag
{
...
union
{
uint32_t R;
struct
{
uint64_t bitfield1:10;
uint64_t bitfield2:10;
uint64_t bitfield3:3;
uint64_t bitfield4:1;
} __attribute__((packed)) B;
} __attribute__((aligned(4))) myRegister;
...
}
#define GROUP (*(volatile struct GROUP_tag *) 0x400FE000)
When I use the following line:
GROUP.myRegister.B.bitfield1 = 0x60;
it doesn't change only bitfield1, but bitfield2 as well. The register has value 0x00006060.
Code gets compiled to the following assembly code:
ldr r3,[pc,#005C]
add r3,r3,#00000160
ldrb r2,[r3,#00]
mov r2,#00
orr r2,#00000060
strb r2,[r3,#00]
ldrb r2,[r3,#01]
bic r2,r2,#00000003
strb r2,[r3,#01]
If I try with direct register manipulation:
int volatile * reg = (int *) 0x400FE160;
*reg = 0x60
the value of register is 0x00000060.
I am using GCC compiler.
Why is the value duplicated when I use struct and bitfields?
EDIT
I found another strange behaviour:
GROUP.myRegister.R = 0x12345678; // value of register is 0x00021212
*reg = 0x12345678; // value of register is 0x0004567, this is correct (I am programming microcontroller and some bits in register can't be changed)
My approach to change register value (with struct and bitfield) gets compiled to:
ldr r3,[pc,#00B4]
ldrb r2,[r3,#0160]
mov r2,#00
orr r2,#00000078
strb r2,[r3,#0160]
ldrb r2,[r3,#0160]
mov r2,#00
orr r2,#00000056
strb r2,[r3,#0161]
ldrb r2,[r3,#0162]
mov r2,#00
orr r2,#00000034
strb r2,[r3,#0162]
ldrb r2,[r3,#0163]
mov r2,#00
orr r2,#00000012
strb r2,[r3,#0163]
Ah, I get it. The compiler is using strb twice to write the two least significant bytes to a Special Function Register. But the hardware is performing a word write (presumably 32 bits) each time, because byte writes to Special Function Registers are unsupported. No wonder it doesn't work!
As to how you can fix this, that depends on your compiler, and how much it knows about SFRs. As a quick and dirty fix, you can just use bit manipulation on R; instead of
GROUP.myRegister.B.bitfield1 = 0x60;
use e.g.
GROUP.myRegister.R = (GROUP.myRegister.R & ~0x3FF) | 0x60;
PS Another possibility: it looks like you have turned off optimisation (I see a redundant ldrb r2,[r3,#00] instruction in there). Perhaps if you turn it on, the compiler will come to its senses? Worth a try...
PPS Please change uint64_t to uint32_t. It's making my teeth hurt!
PPPS Come to think of it, that packed may be throwing the compiler off, causing it to assume that the bitfield struct may not be word-aligned (and thus forcing byte-by-byte acesss). Have you tried removing it?

How can the volatile keyword affect a static const array?

This is a mindblower and anyone who can answer it deserves massive recognition! It is actually a couple of connected questions that I am asking to get better understanding.
The drivers for the STM32 ARM Cortex platform have the following code in them:
static __I uint8_t APBAHBPrescTable[16] = {0, 0, 0, 0, 1, 2, 3, 4, 1, 2, 3, 4, 6, 7, 8, 9};
__I is defined as:
#ifdef __cplusplus
#define __I volatile /*!< defines 'read only' permissions */
#else
#define __I volatile const /*!< defines 'read only' permissions */
#endif
My program is a C program compiled with a GCC cross-compiler. Thus the array declaration is effectively:
static volatile const uint8_t APBAHBPrescTable[16] = {0, 0, 0, 0, 1, 2, 3, 4, 1, 2, 3, 4, 6, 7, 8, 9};
Question 1:
Given that this is a constant array, why use the volatile keywork here?
My understanding is that the volatile keyword means that the contents of the array can change, but the const means that they cannot.
The only use of this array in the code is three uses like this:
tmp = RCC->CFGR & CFGR_PPRE1_Set_Mask;
tmp = tmp >> 8;
presc = APBAHBPrescTable[tmp];
When I dump the values of tmp and presc I find that tmp has a value of 4 and presc has a value of 0. Index 4 is the 5th element of the array which has a value of 1. There are no other accesses or uses of this value...At all...Anywhere.
Question 2:
How might the value changed between it being declared?
When I dump the array I see it is filled with zeroes.
It happens reliably...until I remove the __I from the array declaration. This makes me think it is not a buffer overflow. Other than that I cannot think of anything.
I would think that the volatile keyword was there for a reason, except that I also saw code like the following in an interrupt handler where, as far as I understand, the volatile keyword is redundant:
volatile uint32_t status = USART2->SR;
This variable is local to the function and as such can never be changed by code elsewhere.
======== EXTRA DETAIL ========
Here is the annotated disassembly of the relevant piece of code. The value at (RCC_GetClocksFreq+128) is zero, but appears at some point to have had the address of the prescaler lookup table copied into it:
0x000001d0 <+56>: ldr r1, [pc, #68] ; (0x218 <RCC_GetClocksFreq+128>)
...
tmp = RCC->CFGR & CFGR_PPRE1_Set_Mask;
tmp = tmp >> 8;
0x000001de <+70>: ldr r4, [r2, #4]
0x000001e0 <+72>: ubfx r4, r4, #8, #3
presc = APBAHBPrescTable[tmp];
0x000001e4 <+76>: ldrb r4, [r1, r4]
RCC_Clocks->PCLK1_Frequency = RCC_Clocks->HCLK_Frequency >> presc;
0x000001e6 <+78>: lsr.w r4, r3, r4
0x000001ea <+82>: str r4, [r0, #8]
Here is the same, but with the volatile const macro replaced with const:
0x000001d0 <+56>: ldr r4, [pc, #68] ; (0x218 <RCC_GetClocksFreq+128>)
...
tmp = RCC->CFGR & CFGR_PPRE1_Set_Mask;
tmp = tmp >> 8;
0x000001de <+70>: ldr r1, [r2, #4]
0x000001e0 <+72>: ubfx r1, r1, #8, #3
presc = APBAHBPrescTable[tmp];
0x000001e4 <+76>: ldrb r1, [r4, r1]
RCC_Clocks->PCLK1_Frequency = RCC_Clocks->HCLK_Frequency >> presc;
0x000001e6 <+78>: lsr.w r1, r3, r1
0x000001ea <+82>: str r1, [r0, #8]
They are essentially identical. Yet somehow removing the volatile keyword solves the problem!
My understanding is that the volatile keyword means that the contents
of the array can change, but the const means that they cannot.
volatile means the program must read the value from memory every time it is used. const means the program may not change the value, but the environment (or "OS") may.
This explains the behavior you observed: Without volatile, the compiler assumes it is OK to read the value once and use it multiple times.
The volatile const construct may be used by a Real Time Clock to publish the current time:
volatile const struct tm TheTimeNow;
The clock cannot changed by your program, so it should be const.
The clock ticks permanently and magically behind your and the compiler's back, so better use volatile to force the compiler to fetch always the current time instead of old timestamps.
The RTC might have an own section in the address space, where it exhibits the current time.
First, thanks for all the comments and answers that led me to this answer.
When the variable is defined without the "volatile" keyword it is put into a readonly section of the binary file.
When the variable is defined with the "volatile" keyword it is put in the same section of the binary file as all other variables.
I have recently found 3 buffer overruns and I am sure there are others. A lot of the code is not very well written. It is likely that when the "volatile" keyword is specified the variable is so placed in memory as to make it vulnerable to a buffer overrun. There is no reason at all for this particular variable to be marked as volatile so the simple fix is to remove that keyword. The proper fix is to do that and also track down the buffer overrun and fix it.

Working of __asm__ __volatile__ ("" : : : "memory")

What basically __asm__ __volatile__ () does and what is significance of "memory" for ARM architecture?
asm volatile("" ::: "memory");
creates a compiler level memory barrier forcing optimizer to not re-order memory accesses across the barrier.
For example, if you need to access some address in a specific order (probably because that memory area is actually backed by a different device rather than a memory) you need to be able tell this to the compiler otherwise it may just optimize your steps for the sake of efficiency.
Assume in this scenario you must increment a value in address, read something and increment another value in an adjacent address.
int c(int *d, int *e) {
int r;
d[0] += 1;
r = e[0];
d[1] += 1;
return r;
}
Problem is compiler (gcc in this case) can rearrange your memory access to get better performance if you ask for it (-O). Probably leading to a sequence of instructions like below:
00000000 <c>:
0: 4603 mov r3, r0
2: c805 ldmia r0, {r0, r2}
4: 3001 adds r0, #1
6: 3201 adds r2, #1
8: 6018 str r0, [r3, #0]
a: 6808 ldr r0, [r1, #0]
c: 605a str r2, [r3, #4]
e: 4770 bx lr
Above values for d[0] and d[1] are loaded at the same time. Lets assume this is something you want to avoid then you need to tell compiler not to reorder memory accesses and that is to use asm volatile("" ::: "memory").
int c(int *d, int *e) {
int r;
d[0] += 1;
r = e[0];
asm volatile("" ::: "memory");
d[1] += 1;
return r;
}
So you'll get your instruction sequence as you want it to be:
00000000 <c>:
0: 6802 ldr r2, [r0, #0]
2: 4603 mov r3, r0
4: 3201 adds r2, #1
6: 6002 str r2, [r0, #0]
8: 6808 ldr r0, [r1, #0]
a: 685a ldr r2, [r3, #4]
c: 3201 adds r2, #1
e: 605a str r2, [r3, #4]
10: 4770 bx lr
12: bf00 nop
It should be noted that this is only compile time memory barrier to avoid compiler to reorder memory accesses, as it puts no extra hardware level instructions to flush memories or wait for load or stores to be completed. CPUs can still reorder memory accesses if they have the architectural capabilities and memory addresses are on normal type instead of strongly ordered or device (ref).
This sequence is a compiler memory access scheduling barrier, as noted in the article referenced by Udo. This one is GCC specific - other compilers have other ways of describing them, some of them with more explicit (and less esoteric) statements.
__asm__ is a gcc extension of permitting assembly language statements to be entered nested within your C code - used here for its property of being able to specify side effects that prevent the compiler from performing certain types of optimisations (which in this case might end up generating incorrect code).
__volatile__ is required to ensure that the asm statement itself is not reordered with any other volatile accesses any (a guarantee in the C language).
memory is an instruction to GCC that (sort of) says that the inline asm sequence has side effects on global memory, and hence not just effects on local variables need to be taken into account.
The meaning is explained here:
http://en.wikipedia.org/wiki/Memory_ordering
Basically it implies that the assembly code will be executed where you expect it. It tells the compiler to not reorder instructions around it. That is what is coded before this piece of code will be executed before and what is coded after will be executed after.
static inline unsigned long arch_local_irq_save(void)
{
unsigned long flags;
asm volatile(
" mrs %0, cpsr # arch_local_irq_save\n"
" cpsid i" //disabled irq
: "=r" (flags) : : "memory", "cc");
return flags;
}

Resources