I am trying to modify bitfields in register. Here is my struct with bitfields defined:
struct GROUP_tag
{
...
union
{
uint32_t R;
struct
{
uint64_t bitfield1:10;
uint64_t bitfield2:10;
uint64_t bitfield3:3;
uint64_t bitfield4:1;
} __attribute__((packed)) B;
} __attribute__((aligned(4))) myRegister;
...
}
#define GROUP (*(volatile struct GROUP_tag *) 0x400FE000)
When I use the following line:
GROUP.myRegister.B.bitfield1 = 0x60;
it doesn't change only bitfield1, but bitfield2 as well. The register has value 0x00006060.
Code gets compiled to the following assembly code:
ldr r3,[pc,#005C]
add r3,r3,#00000160
ldrb r2,[r3,#00]
mov r2,#00
orr r2,#00000060
strb r2,[r3,#00]
ldrb r2,[r3,#01]
bic r2,r2,#00000003
strb r2,[r3,#01]
If I try with direct register manipulation:
int volatile * reg = (int *) 0x400FE160;
*reg = 0x60
the value of register is 0x00000060.
I am using GCC compiler.
Why is the value duplicated when I use struct and bitfields?
EDIT
I found another strange behaviour:
GROUP.myRegister.R = 0x12345678; // value of register is 0x00021212
*reg = 0x12345678; // value of register is 0x0004567, this is correct (I am programming microcontroller and some bits in register can't be changed)
My approach to change register value (with struct and bitfield) gets compiled to:
ldr r3,[pc,#00B4]
ldrb r2,[r3,#0160]
mov r2,#00
orr r2,#00000078
strb r2,[r3,#0160]
ldrb r2,[r3,#0160]
mov r2,#00
orr r2,#00000056
strb r2,[r3,#0161]
ldrb r2,[r3,#0162]
mov r2,#00
orr r2,#00000034
strb r2,[r3,#0162]
ldrb r2,[r3,#0163]
mov r2,#00
orr r2,#00000012
strb r2,[r3,#0163]
Ah, I get it. The compiler is using strb twice to write the two least significant bytes to a Special Function Register. But the hardware is performing a word write (presumably 32 bits) each time, because byte writes to Special Function Registers are unsupported. No wonder it doesn't work!
As to how you can fix this, that depends on your compiler, and how much it knows about SFRs. As a quick and dirty fix, you can just use bit manipulation on R; instead of
GROUP.myRegister.B.bitfield1 = 0x60;
use e.g.
GROUP.myRegister.R = (GROUP.myRegister.R & ~0x3FF) | 0x60;
PS Another possibility: it looks like you have turned off optimisation (I see a redundant ldrb r2,[r3,#00] instruction in there). Perhaps if you turn it on, the compiler will come to its senses? Worth a try...
PPS Please change uint64_t to uint32_t. It's making my teeth hurt!
PPPS Come to think of it, that packed may be throwing the compiler off, causing it to assume that the bitfield struct may not be word-aligned (and thus forcing byte-by-byte acesss). Have you tried removing it?
Related
I am using GCC to compile a program for an ARM Cortex M3.
My program results in a hardfault, and I am trying to troubleshoot it.
GCC version is 10.3.1 but I have confirmed this with older versions too (i.e. 9.2).
The hardfault occurs only when optimizations are enabled (-O3).
The problematic function is the following:
void XTEA_decrypt(XTEA_t * xtea, uint32_t data[2])
{
uint32_t d0 = data[0];
uint32_t d1 = data[1];
uint32_t sum = XTEA_DELTA * XTEA_NUMBER_OF_ROUNDS;
for (int i = XTEA_NUMBER_OF_ROUNDS; i != 0; i--)
{
d1 -= (((d0 << 4) ^ (d0 >> 5)) + d0) ^ (sum + xtea->key[(sum >> 11) & 3]);
sum -= XTEA_DELTA;
d0 -= (((d1 << 4) ^ (d1 >> 5)) + d1) ^ (sum + xtea->key[sum & 3]);
}
data[0] = d0;
data[1] = d1;
}
I noticed that the fault happens in line:
data[0] = d0;
Disassembling this, gives me:
49 data[0] = d0;
0000f696: lsrs r0, r3, #5
0000f698: eor.w r0, r0, r3, lsl #4
0000f69c: add r0, r3
0000f69e: ldr.w r12, [sp, #4]
0000f6a2: eors r5, r0
0000f6a4: subs r2, r2, r5
0000f6a6: strd r2, r3, [r12]
0000f6aa: add sp, #12
0000f6ac: ldmia.w sp!, {r4, r5, r6, r7, r8, r9, r10, r11, pc}
0000f6b0: ldr r3, [sp, #576] ; 0x240
0000f6b2: b.n 0xfda4 <parseNode+232>
And the offending line is specifically:
0000f6a6: strd r2, r3, [r12]
GCC generates code that uses an unaligned memory address with strd, which is not allowed in my architecture.
How can this issue be fixed?
Is this a compiler bug, or the code somehow confuses GCC?
Is there any flag to alter this behavior in GCC?
The aforementioned function belongs to an external library, so I cannot modify it.
However, I prefer a solution that makes GCC produce the correct instructions, instead of modifying the code, as I need to ensure that this bug will actually be fixed, and it is not lurking elsewhere in the code.
UPDATE
Following the recommendations in the comments, I was suspecting that the function itself is called with unaligned data.
I checked the whole stack frame, all previous function calls, and my code does not contain casts, unaligned indexes in buffers etc, in contrast to what I had in mind initially.
The problem is that the buffer itself is unaligned, as it is defined as:
typedef struct {
uint32_t var1;
uint32_t var2;
uint8_t var3;
uint8_t buffer[BUFFER_SIZE];
uint16_t var4;
// More variables here...
} HDLC_t;
(And later cast to uint32_t by the external library).
Swapping places between var3 and buffer solves the issue.
The thing is that again this struct is defined in a library that is not in my control.
So, can GCC detect this issue between the libraries, and either align the data, or warn me of the issue?
So, can GCC detect this issue between the libraries, and either align the data, or warn me of the issue?
Yes it can, it does and it must do in order to be C compliant. This is what happens if you run gcc at default settings and attempt to pass a uint8_t pointer (HDLC_t buffer member) to a function expecting a uint32_t [2]:
warning: passing argument 2 of 'XTEA_decrypt' from incompatible pointer type [-Wincompatible-pointer-types]
This is a constraint violation, meaning that the code is invalid C and the compiler already told you as much. See What must a C compiler do when it finds an error? You could turn on -pedantic-errors if you wish to block gcc C from generating a binary executable out of invalid C code.
As for how to fix the code if you are stuck with that struct: memcpy the buffer member into a temporary uint32_t [2] array and then pass that one to the function.
You could also declare the struct member as _Alignas(uint32_t) uint8_t buffer[100]; but if you can modify the struct you might as well re-arrange it instead, since _Alignas will insert 3 wasteful padding bytes.
The easiest way is to align data to 8bytes.
You should declare the array like:
__attribute__((aligned(8))) uint32_t data[2];
First, some background:
This issue popped up while writing a driver for a sensor in my embedded system (STM32 ARM Cortex-M4).
Compiler: ARM NONE EABI GCC 7.2.1
The best solution to representing the sensor's internal control register was to use a union with a bitfield, along these lines
enum FlagA {
kFlagA_OFF,
kFlagA_ON,
};
enum FlagB {
kFlagB_OFF,
kFlagB_ON,
};
enum OptsA {
kOptsA_A,
kOptsA_B,
.
.
.
kOptsA_G // = 7
};
union ControlReg {
struct {
uint16_t RESERVED1 : 1;
FlagA flag_a : 1;
uint16_t RESERVED2 : 7;
OptsA opts_a : 3;
FlagB flag_b : 1;
uint16_t RESERVED3 : 3;
} u;
uint16_t reg;
};
This allows me to address the register's bits individually (e.g. ctrl_reg.u.flag_a = kFlagA_OFF;), and it allows me to set the value of the whole register at once (e.g. ctrl_reg.reg = 0xbeef;).
The problem:
When attempting to populate the register with a value fetched from the sensor through a function call, passing the union in by pointer, and then update only the opts_a portion of the register before writing it back to the sensor (as shown below), the compiler generates an incorrect bitfield insert assembly instruction.
ControlReg ctrl_reg;
readRegister(&ctrl_reg.reg);
ctrl_reg.opts_a = kOptsA_B; // <-- line of interest
writeRegister(ctrl_reg.reg);
yields
ldrb.w r3, [sp, #13]
bfi r3, r8, #1, #3 ;incorrectly writes to bits 1, 2, 3
strb.w r3, [sp, #13]
However, when I use an intermediate variable:
uint16_t reg_val = 0;
readRegister(®_val);
ControlReg ctrl_reg;
ctrl_reg.reg = reg_val;
ctrl_reg.opts_a = kOptsA_B; // <-- line of interest
writeRegister(ctrl_reg.reg);
It yields the correct instruction:
bfi r7, r8, #9, #3 ;sets the proper bits 9, 10, 11
The readRegister function does nothing funky and simply writes to the memory at the pointer
void readRegister(uint16_t* out) {
uint8_t data_in[3];
...
*out = (data_in[0] << 8) | data_in[1];
}
Why does the compiler improperly set the starting bit of the bitfield insert instruction?
I am not a fan of bitfields, especially if you're aiming for portability. C leaves a lot more unspecified or implementation-defined about them than most people seem to appreciate, and there are some very common misconceptions about what the standard requires of them as opposed to what happens to be the behavior of some implementations. Nevertheless, that's mostly moot if you're writing code for a specific application only, targeting a single, specific C implementation for the target platform.
In any case, C allows no room for a conforming implementation to behave inconsistently for conforming code. In your case, it is equally valid to set ctrl_reg.reg through a pointer, in function readRegister(), as to set it via assignment. Having done so, it is valid to assign to ctrl_reg.u.opts_a, and the result should read back correctly from ctrl_reg.u. It is also permitted to afterward read ctrl_reg.reg, and that will reflect the result of the modification.
However, you are making assumptions about the layout of the bitfields that are not supported by the standard. Your compiler will be consistent, but you need to carefully verify that the layout is actually what you expect, else going back and forth between the two union members will not produce the result you want.
Nevertheless, the way you store a value in ctrl_reg.reg is immaterial with respect to the effect that assigning to the bitfield has. Your compiler is not required to generate identical assembly for the two cases, but if there are no other differences between the two programs and they exercise no undefined behavior, then they are required to produce the same observable behavior for the same inputs.
It is 100% correct compiler generated code
void foo(ControlReg *reg)
{
reg -> opts_a = kOptsA_B;
}
void foo1(ControlReg *reg)
{
volatile ControlReg reg1;
reg1.opts_a = kOptsA_B;
}
foo:
movs r2, #1
ldrb r3, [r0, #1] # zero_extendqisi2
bfi r3, r2, #1, #3
strb r3, [r0, #1]
bx lr
foo1:
movs r2, #1
sub sp, sp, #8
ldrh r3, [sp, #4]
bfi r3, r2, #9, #3
strh r3, [sp, #4] # movhi
add sp, sp, #8
bx lr
As you see in the function 'foo' it loads only one byte (the second byte of the union) and the field is stored in 1 to 3 bits of this byte.
As you see in the function 'foo1' it loads half word (the whole structure) and the field is stored in 9 to 11 bits of the halfword.
Do not try to find errors in the compilers because they are almost always in your code.
PS
You do not need to name struct and the padding bitfields
typedef union {
struct {
uint16_t : 1;
uint16_t flag_a : 1;
uint16_t : 7;
uint16_t opts_a : 3;
uint16_t flag_b : 1;
uint16_t : 3;
};
uint16_t reg;
}ControlReg ;
EDIT
but if you want to make sure that the whole structure (union) is modified just make the function parameter volatile
void foo(volatile ControlReg *reg)
{
reg -> opts_a = kOptsA_B;
}
foo:
movs r2, #1
ldrh r3, [r0]
bfi r3, r2, #9, #3
strh r3, [r0] # movhi
bx lr
I have the following issue: I have two 64 bit variables and they have to be compared as quick as possible, my Microcontroller is only 32bit.
My thoughts are that it is necessary to divide 64 bit variable into two 32 bit variables, like this
uint64_t var = 0xAAFFFFFFABCDELL;
hiPart = (uint32_t)((var & 0xFFFFFFFF00000000LL) >> 32);
loPart = (uint32_t)(var & 0xFFFFFFFFLL);
and then to compare hiParts and loParts, but I am sure that this approach is slow and there is much better solution
The first rule should be: Write your program, so that is readable to a human.
When in doubt, don't assume anything, but measure it. Let's see, what godbolt gives us.
#include <stdint.h>
#include <stdbool.h>
bool foo(uint64_t a, uint64_t b) {
return a == b;
}
bool foo2(uint64_t a, uint64_t b) {
uint32_t ahiPart = (uint32_t)((a & 0xFFFFFFFF00000000ULL) >> 32);
uint32_t aloPart = (uint32_t)(a & 0xFFFFFFFFULL);
uint32_t bhiPart = (uint32_t)((b & 0xFFFFFFFF00000000ULL) >> 32);
uint32_t bloPart = (uint32_t)(b & 0xFFFFFFFFULL);
return ahiPart == bhiPart && aloPart == bloPart;
}
foo:
eor r1, r1, r3
eor r0, r0, r2
orr r0, r0, r1
rsbs r1, r0, #0
adc r0, r0, r1
bx lr
foo2:
eor r1, r1, r3
eor r0, r0, r2
orr r0, r0, r1
rsbs r1, r0, #0
adc r0, r0, r1
bx lr
As you can see, they result in the exact same assembly code, but you decide, which one is less error prone and easiert to read?
There was a time some years ago where you need to do tricks to be more smart than a compiler. But in 99.999% the compiler will be more smart than you.
And your variables are unsigned. So use ULL instead of LL.
The fastest way is to let the compiler do it. Most compilers are much better than humans at micro-optimization.
uint64_t var = …, other_var = …;
if (var == other_var) …
There aren't many ways to go about it. Under the hood, the compiler will arrange to load the upper 32 bits and the lower 32 bits of each variables into registers, and compare the two registers that contain upper 32 bits and the two registers that contain lower 32 bits. The assembly code might look something like this:
load 32 bits from &var into r0
load 32 bits from &other_var into r1
if r0 != r1: goto different
load 32 bits from &var + 4 into r2
load 32 bits from &other_var + 4 into r3
if r2 != r3: goto different
// code for if-equal
different:
// code for if-not-equal
Here are some things the compiler knows better than you:
Which registers to use, based on the needs of the surrounding code.
Whether to reuse the same registers to compare the upper and lower parts, or to use different registers.
Whether to process one part and then the other (as above), or to load one variable then the other. The best order depends on the pressure on registers and on the memory access times and pipelining of the particular processor model.
If you work with a union you could compare Hi and Lo Part without any extra calculations:
typedef union
{
struct
{
uint32_t loPart;
uint32_t hiPart;
};
uint64_t complete;
}uint64T;
uint64T var.complete = 0xAAFFFFFFABCDEULL;
I am wondering whenever I would need to use a atomic type or volatile (or nothing special) for a interrupt counter:
uint32_t uptime = 0;
// interrupt each 1 ms
ISR()
{
// this is the only location which writes to uptime
++uptime;
}
void some_func()
{
uint32_t now = uptime;
}
I myself would think that volatile should be enough and guarantee error-free operation and consistency (incremental value until overflow).
But it has come to my mind that maybe a mov instruction could be interrupted mid-operation when moving/setting individual bits, is that possible on x86_64 and/or armv7-m?
for example the mov instruction would begin to execute, set 16 bits, then would be pre-empted, the ISR would run increasing uptime by one (and maybe changing all bits) and then the mov instruction would be continued. I cannot find any material that could assure me of the working order.
Would this also be the same on armv7-m?
Would using sig_atomic_t be the correct solution to always have an error-free and consistent result or would it be "overkill"?
For example the ARM7-M architecture specifies:
In ARMv7-M, the single-copy atomic processor accesses are:
• All byte accesses.
• All halfword accesses to halfword-aligned locations.
• All word accesses to word-aligned locations.
would a assert with &uptime % 8 == 0 be sufficient to guarantee this?
Use volatile. You compiler does not know about interrupts. It may assume, that ISR() function is never called (do you have in your code anywhere a call to ISR?). That means that uptime will never increment, that means that uptime will always be zero, that means that uint32_t now = uptime; may be safely optimized to uint32_t now = 0;. Use volatile uint32_t uptime. That way the optimizer will not optimize uptime away.
Word size. uint32_t variable has 4bytes. So on 32-bit processor it will take 1 instruction to fetch it's value, but on 8-bit processor it will take at least 4 instructions (in general). So on 32-bit processor you don't need to disable interrupt before loading the value of uptime, because interrupt routine will start executing before or after the current instruction is executed on the processor. Processor can't branch to interrupt routing mid-instruction, that's not possible. On 8-bit processor we need to disable interrupts before reading from uptime, like:
DisableInterrupts();
uint32_t now = uptime;
EnableInterrupts();
C11 atomic types. I have never seen a real embedded code which uses them, still waiting, I see volatile everywhere. This is dependent on your compiler, because the compiler implements atomic types and atomic_* functions. This is compiler dependent. Are 100% sure that when reading from atomic_t variable your compiler will disable ISR() interrupt? Inspect the assembly output generated from atomic_* calls, you will know for sure. This was a good read. I would expect atomic* C11 types to work for concurrency between multiple threads, which can switch execution context anytime. Using it between interrupt and normal context may block your cpu, because once you are in IRQ you get back to normal execution only after servicing that IRQ, ie. some_func sets mutex up to read uptime, then IRQ fires up and IRQ will check in a loop if mutex is down, this will result in endless loop.
See for example HAL_GetTick() implementation, from here, removed __weak macro and substituted __IO macro by volatile, those macros are defined in cmsis file:
static volatile uint32_t uwTick;
void HAL_IncTick(void)
{
uwTick++;
}
uint32_t HAL_GetTick(void)
{
return uwTick;
}
Typically HAL_IncTick() is called from systick interrupt each 1ms.
You have to read the documentation for each separate core and/or chip. x86 is a completely separate thing from ARM, and within both families each instance may vary from any other instance, can be and should expect to be completely new designs each time. Might not be but from time to time are.
Things to watch out for as noted in the comments.
typedef unsigned int uint32_t;
uint32_t uptime = 0;
void ISR ( void )
{
++uptime;
}
void some_func ( void )
{
uint32_t now = uptime;
}
On my machine with the tool I am using today:
Disassembly of section .text:
00000000 <ISR>:
0: e59f200c ldr r2, [pc, #12] ; 14 <ISR+0x14>
4: e5923000 ldr r3, [r2]
8: e2833001 add r3, r3, #1
c: e5823000 str r3, [r2]
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
00000018 <some_func>:
18: e12fff1e bx lr
Disassembly of section .bss:
00000000 <uptime>:
0: 00000000 andeq r0, r0, r0
this could vary, but if you find a tool on one machine one day that builds a problem then you can assume it is a problem. So far we are actually okay. because some_func is dead code the read is optimized out.
typedef unsigned int uint32_t;
uint32_t uptime = 0;
void ISR ( void )
{
++uptime;
}
uint32_t some_func ( void )
{
uint32_t now = uptime;
return(now);
}
fixed
00000000 <ISR>:
0: e59f200c ldr r2, [pc, #12] ; 14 <ISR+0x14>
4: e5923000 ldr r3, [r2]
8: e2833001 add r3, r3, #1
c: e5823000 str r3, [r2]
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
00000018 <some_func>:
18: e59f3004 ldr r3, [pc, #4] ; 24 <some_func+0xc>
1c: e5930000 ldr r0, [r3]
20: e12fff1e bx lr
24: 00000000 andeq r0, r0, r0
Because of cores like mips and arm tending to have data aborts by default for unaligned accesses we might assume the tool will not generate an unaligned address for such a clean definition. But if we were to talk about packed structs, that is another story you told the compiler to generate an unaligned access and it will...If you want to feel safe remember a "word" in ARM is 32 bits so you can assert address of variable AND 3.
x86 one would also assume a clean definition like that would result in an aligned variable, but x86 doesnt have the data fault issue by default and as a result compilers are a bit more free...focusing on arm as I think that is your question.
Now if I do this:
typedef unsigned int uint32_t;
uint32_t uptime = 0;
void ISR ( void )
{
if(uptime)
{
uptime=uptime+1;
}
else
{
uptime=uptime+5;
}
}
uint32_t some_func ( void )
{
uint32_t now = uptime;
return(now);
}
00000000 <ISR>:
0: e59f2014 ldr r2, [pc, #20] ; 1c <ISR+0x1c>
4: e5923000 ldr r3, [r2]
8: e3530000 cmp r3, #0
c: 03a03005 moveq r3, #5
10: 12833001 addne r3, r3, #1
14: e5823000 str r3, [r2]
18: e12fff1e bx lr
1c: 00000000 andeq r0, r0, r0
and adding volatile
00000000 <ISR>:
0: e59f3018 ldr r3, [pc, #24] ; 20 <ISR+0x20>
4: e5932000 ldr r2, [r3]
8: e3520000 cmp r2, #0
c: e5932000 ldr r2, [r3]
10: 12822001 addne r2, r2, #1
14: 02822005 addeq r2, r2, #5
18: e5832000 str r2, [r3]
1c: e12fff1e bx lr
20: 00000000 andeq r0, r0, r0
the two reads results in two reads. now there is a problem here if the read-modify-write can get interrupted, but we assume since this is an ISR it cant? If you were to read a 7, add a 1 then write an 8 if you were interrupted after the read by something that is also modifying uptime, that modification has limited life, its modification happens, say a 5 is written, then this ISR writes an 8 on top if it.
if a read-modify-write were in the interruptable code then the isr could get in there and it probably wouldnt work the way you wanted. This is two readers two writers you want one responsible for writing a shared resource and the others read-only. Otherwise you need a lot more work not built into the language.
Note on an arm machine:
typedef int __sig_atomic_t;
...
typedef __sig_atomic_t sig_atomic_t;
so
typedef unsigned int uint32_t;
typedef int sig_atomic_t;
volatile sig_atomic_t uptime = 0;
void ISR ( void )
{
if(uptime)
{
uptime=uptime+1;
}
else
{
uptime=uptime+5;
}
}
uint32_t some_func ( void )
{
uint32_t now = uptime;
return(now);
}
Isnt going to change the result. At least not on that system with that define, need to examine other C libraries and/or sandbox headers to see what they define, or if you are not careful (happens often) the wrong headers are used, the x6_64 headers are used to build arm programs with the cross compiler. seen gcc and llvm make host vs target mistakes.
going back to a concern though which based on your comments you appear to already understand
typedef unsigned int uint32_t;
uint32_t uptime = 0;
void ISR ( void )
{
if(uptime)
{
uptime=uptime+1;
}
else
{
uptime=uptime+5;
}
}
void some_func ( void )
{
while(uptime&1) continue;
}
This was pointed out in the comments even though you have one writer and one reader
00000020 <some_func>:
20: e59f3018 ldr r3, [pc, #24] ; 40 <some_func+0x20>
24: e5933000 ldr r3, [r3]
28: e2033001 and r3, r3, #1
2c: e3530000 cmp r3, #0
30: 012fff1e bxeq lr
34: e3530000 cmp r3, #0
38: 012fff1e bxeq lr
3c: eafffffa b 2c <some_func+0xc>
40: 00000000 andeq r0, r0, r0
It never goes back to read the variable from memory, and unless someone corrupts the register in an event handler, this can be an infinite loop.
make uptime volatile:
00000024 <some_func>:
24: e59f200c ldr r2, [pc, #12] ; 38 <some_func+0x14>
28: e5923000 ldr r3, [r2]
2c: e3130001 tst r3, #1
30: 012fff1e bxeq lr
34: eafffffb b 28 <some_func+0x4>
38: 00000000 andeq r0, r0, r0
now the reader does a read every time.
same issue here, not in a loop, no volatile.
00000020 <some_func>:
20: e59f302c ldr r3, [pc, #44] ; 54 <some_func+0x34>
24: e5930000 ldr r0, [r3]
28: e3500005 cmp r0, #5
2c: 0a000004 beq 44 <some_func+0x24>
30: e3500004 cmp r0, #4
34: 0a000004 beq 4c <some_func+0x2c>
38: e3500001 cmp r0, #1
3c: 03a00006 moveq r0, #6
40: e12fff1e bx lr
44: e3a00003 mov r0, #3
48: e12fff1e bx lr
4c: e3a00007 mov r0, #7
50: e12fff1e bx lr
54: 00000000 andeq r0, r0, r0
uptime can have changed between tests. volatile fixes this.
so volatile is not the universal solution, having the variable be used for one way communication is ideal, need to communicate the other way use a separate variable, one writer one or more readers per.
you have done the right thing and consulted the documentation for your chip/core
So if aligned (in this case a 32 bit word) AND the compiler chooses the right instruction then the interrupt wont interrupt the transaction. If it is an LDM/STM though you should read the documentation (push and pop are also LDM/STM pseudo instructions) in some cores/architectures those can be interrupted and restarted as a result we are warned about those situations in arm documentation.
short answer, add volatile, and make it so there is only one writer per variable. and keep the variable aligned. (and read the docs each time you change chips/cores, and periodically disassemble to check the compiler is doing what you asked it to do). doesnt matter if it is the same core type (another cortex-m3) from the same vendor or different vendors or if it is some completely different core/chip (avr, msp430, pic, x86, mips, etc), start from zero, get the docs and read them, check the compiler output.
TL:DR: Use volatile if an aligned uint32_t is naturally atomic (it is on x86 and ARM). Why is integer assignment on a naturally aligned variable atomic on x86?. Your code will technically have C11 undefined behaviour, but real implementations will do what you want with volatile.
Or use C11 stdatomic.h with memory_order_relaxed if you want to tell the compiler exactly what you mean. It will compile to the same asm as volatile on x86 and ARM if you use it correctly.
(But if you actually need it to run efficiently on single-core CPUs where load/store of an aligned uint32_t isn't atomic "for free", e.g. with only 8-bit registers, you might rather disable interrupts instead of having stdatomic fall back to using a lock to serialize reads and writes of your counter.)
Whole instructions are always atomic with respect to interrupts on the same core, on all CPU architectures. Partially-completed instructions are either completed or discarded (without committing their stores) before servicing an interrupt.
For a single core, CPUs always preserve the illusion of running instructions one at a time, in program order. This includes interrupts only happening on the boundaries between instructions. See #supercat's single-core answer on Can num++ be atomic for 'int num'?. If the machine has 32-bit registers, you can safely assume that a volatile uint32_t will be loaded or stored with a single instruction. As #old_timer points out, beware of unaligned packed-struct members on ARM, but unless you manually do that with __attribute__((packed)) or something, the normal ABIs on x86 and ARM ensure natural alignment.
Multiple bus transactions from a single instruction for unaligned operands or narrow busses only matters for concurrent read+write, either from another core or a non-CPU hardware device. (e.g. if you're storing to device memory).
Some long-running x86 instructions like rep movs or vpgatherdd have well-defined ways to partially complete on exceptions or interrupts: update registers so re-running the instruction does the right thing. But other than that, an instruction has either run or it hasn't, even a "complex" instruction like a memory-destination add that does a read/modify/write.) IDK if anyone's ever proposed a CPU that could suspend/result multi-step instructions across interrupts instead of cancelling them, but x86 and ARM are definitely not like that. There are lots of weird ideas in computer-architecture research papers. But it seems unlikely that it would be worth the keeping all the necessary microarchitectural state to resume in the middle of a partially-executed instruction instead of just re-decoding it after returning from an interrupt.
This is why AVX2 / AVX512 gathers always need a gather mask even when you want to gather all the elements, and why they destroy the mask (so you have to reset it to all-ones again before the next gather).
In your case, you only need the store (and load outside the ISR) to be atomic. You don't need the whole ++uptime to be atomic. You can express this with C11 stdatomic like this:
#include <stdint.h>
#include <stdatomic.h>
_Atomic uint32_t uptime = 0;
// interrupt each 1 ms
void ISR()
{
// this is the only location which writes to uptime
uint32_t tmp = atomic_load_explicit(&uptime, memory_order_relaxed);
// the load doesn't even need to be atomic, but relaxed atomic is as cheap as volatile on machines with wide-enough loads
atomic_store_explicit(&uptime, tmp+1, memory_order_relaxed);
// some x86 compilers may fail to optimize to add dword [uptime],1
// but uptime+=1 would compile to LOCK ADD (an atomic increment), which you don't want.
}
// MODIFIED: return the load result
uint32_t some_func()
{
// this does need to be an atomic load
// you typically get that by default with volatile, too
uint32_t now = atomic_load_explicit(&uptime, memory_order_relaxed);
return now;
}
volatile uint32_t compiles to the exact same asm on x86 and ARM. I put the code on the Godbolt compiler explorer. This is what clang6.0 -O3 does for x86-64. (With -mtune=bdver2, it uses inc instead of add, but it knows that memory-destination inc is one of the few cases where inc is still worse than add on Intel :)
ISR: # #ISR
add dword ptr [rip + uptime], 1
ret
some_func: # #some_func
mov eax, dword ptr [rip + uptime]
ret
inc_volatile: // void func(){ volatile_var++; }
add dword ptr [rip + volatile_var], 1
ret
gcc uses separate load/store instructions for both volatile and _Atomic, unfortunately.
# gcc8.1 -O3
mov eax, DWORD PTR uptime[rip]
add eax, 1
mov DWORD PTR uptime[rip], eax
At least that means there's no downside to using _Atomic or volatile _Atomic on either gcc or clang.
Plain uint32_t without either qualifier is not a real option, at least not for the read side. You probably don't want the compiler to hoist get_time() out of a loop and use the same time for every iteration. In cases where you do want that, you could copy it to a local. That could result in extra work for no benefit if the compiler doesn't keep it in a register, though (e.g. across function calls it's easiest for the compiler to just reload from static storage). On ARM, though, copying to a local may actually help because then it can reference it relative to the stack pointer instead of needing to keep a static address in another register, or regenerate the address. (x86 can load from static addresses with a single large instruction, thanks to its variable-length instruction set.)
If you want any stronger memory-ordering, you can use atomic_signal_fence(memory_order_release); or whatever (signal_fence not thread_fence) to tell the compiler you only care about ordering wrt. code running asynchronously on the same CPU ("in the same thread" like a signal handler), so it will only have to block compile-time reordering, not emit any memory-barrier instructions like ARM dmb.
e.g. in the ISR:
uint32_t tmp = atomic_load_explicit(&idx, memory_order_relaxed);
tmp++;
shared_buf[tmp] = 2; // non-atomic
// Then do a release-store of the index
atomic_signal_fence(memory_order_release);
atomic_load_explicit(&idx, tmp, memory_order_relaxed);
Then it's safe for a reader to load idx, run atomic_signal_fence(memory_order_acquire);, and read from shared_buf[tmp] even if shared_buf is not _Atomic. (Assuming you took care of wraparound issues and so on.)
volatile is only sugestion for compiler, where value should be stored. typically with this flat this is stored in any CPU register. But if compiler will not take this space because it is busy for other operation, it will be ignored and traditionally stored in memory. this is the main rule.
then let's look at the architecture. all native CPU instruction with all native types are atomic. But many operation can be splited into two steps, when value should be copied from memory to memory. in that situation can be done some cpu interrupt. but don't worry, it is normal. when value will not be stored into prepared variable, you can understand this as not fully commited operation.
problem is when you use words longer than implemented in CPU, for example u32bit in 16 or 8 bit processor. In that situation reading and writting value will be splited into many steps. then it will be sure, then some part of value will be stored, other not, and you will get wrong damaged value.
in this scenario it is not allways good aproach for disabling interrupts, because this can take big time. of course you can use locking, but this can do the same.
but you can make some structure, with first field as data, and second field as counter that suit in architecture. then when you reading that value, you can at first get counter as first value, then get value, and at last get counter second time. when counter differs, you should repeat this process.
of course it doesn't guarantee all will be proper, but typically it saves a lot of cpu cycles. for example you will use 16bit additional counter for verification, it is 65536 values. then when you read this second counter first time, you main process must be frozen for very long cycles, in this example it should be 65536 missed interrupts, for making bug for main counter or any other stored value.
of course if you using 32bit value in 32bit architecture, it is not a problem, you don't need specially secure that operation, independed or architecture. of course except if architecture do all its operation as atomic :)
example code:
struct
{
ucint32_t value; //us important value
int watchdog; //for value secure, long platform depended, usually at least 32bits
} SecuredCounter;
ISR()
{
// this is the only location which writes to uptime
++SecuredCounter.value;
++SecuredCounter.watchdog;
}
void some_func()
{
uint32_t now = Read_uptime;
}
ucint32_t Read_uptime;
{
int secure1; //length platform dependee
ucint32_t value;
int secure2;
while (1) {
longint secure1=SecuredCounter.watchdog; //read first
ucint32_t value=SecuredCounter.value; //read value
longint secure2=SecuredCounter.watchdog; //read second, should be as first
if (secure1==secure2) return value; //this is copied and should be proper
};
};
Different approach is to make two identical counters, you should increase it both in single function. In read function you copy both values to local variables, and compare it is identical. If is, then value is proper and return single one. If differs, repeat reading. Don't worry, if values differs, then you reading function has been interrupted. It is very fiew chance, after repeated reading it will happen again. But if it will happen, it is no chance it will be stalled loop.
Environment: GCC 4.7.3 (arm-none-eabi-gcc) for ARM Cortex m4f. Bare-metal (actually MQX RTOS, but here that's irrelevant). The CPU is in Thumb state.
Here's a disassembler listing of some code I'm looking at:
//.label flash_command
// ...
while(!(FTFE_FSTAT & FTFE_FSTAT_CCIF_MASK)) {}
// Compiles to:
12: bf00 nop
14: f04f 0300 mov.w r3, #0
18: f2c4 0302 movt r3, #16386 ; 0x4002
1c: 781b ldrb r3, [r3, #0]
1e: b2db uxtb r3, r3
20: b2db uxtb r3, r3
22: b25b sxtb r3, r3
24: 2b00 cmp r3, #0
26: daf5 bge.n 14 <flash_command+0x14>
The constants (after expending macros, etc.) are:
address of FTFE_FSTAT is 0x40020000u
FTFE_FSTAT_CCIF_MASK is 0x80u
This is compiled with NO optimization (-O0), so GCC shouldn't be doing anything fancy... and yet, I don't get this code. Post-answer edit: Never assume this. My problem was getting a false sense of security from turning off optimization.
I've read that "uxtb r3,r3" is a common way of truncating a 32-bit value. Why would you want to truncate it twice and then sign-extend? And how in the world is this equivalent to the bit-masking operation in the C-code?
What am I missing here?
Edit: Types of the thing involved:
So the actual macro expansion of FTFE_FSTAT comes down to
((((FTFE_MemMapPtr)0x40020000u))->FSTAT)
where the struct is defined as
/** FTFE - Peripheral register structure */
typedef struct FTFE_MemMap {
uint8_t FSTAT; /**< Flash Status Register, offset: 0x0 */
uint8_t FCNFG; /**< Flash Configuration Register, offset: 0x1 */
//... a bunch of other uint_8
} volatile *FTFE_MemMapPtr;
The two uxtb instructions are the compiler being stupid, they should be optimized out if you turn on optimization. The sxtb is the compiler being brilliant, using a trick that you wouldn't expect in unoptimized code.
The first uxtb is due to the fact that you loaded a byte from memory. The compiler is zeroing the other 24 bits of register r3, so that the byte value fills the entire register.
The second uxtb is due to the fact that you're ANDing with an 8-bit value. The compiler realizes that the upper 24-bits of the result will always be zero, so it's using uxtb to clear the upper 24-bits.
Neither of the uxtb instructions does anything useful, because the sxtb instruction overwrites the upper 24 bits of r3 anyways. The optimizer should realize that and remove them when you compile with optimizations enabled.
The sxtb instruction takes the one bit you care about 0x80 and moves it into the sign bit of register r3. That way, if bit 0x80 is set, then r3 becomes a negative number. So now the compiler can compare with 0 to determine whether the bit was set. If the bit was not set then the bge instruction branches back to the top of the while loop.