I need a thread save idx++ and idx-- operation.
Disabling interrupts, i.e. use critical sections, is one thing, but I want
to understand why my operations are not atomic, as I expect ?
Here is the C-code with inline assembler code shown, using segger ozone:
(Also please notice, the address of the variables show that the 32 bit variable is 32-bit-aligned in memory, and the 8- and 16-bit variables are both 16 bit aligned)
volatile static U8 dbgIdx8 = 1000U;
volatile static U16 dbgIdx16 = 1000U;
volatile static U32 dbgIdx32 = 1000U;
dbgIdx8 ++;
080058BE LDR R3, [PC, #48]
080058C0 LDRB R3, [R3]
080058C2 UXTB R3, R3
080058C4 ADDS R3, #1
080058C6 UXTB R2, R3
080058C8 LDR R3, [PC, #36]
080058CA STRB R2, [R3]
dbgIdx16 ++;
080058CC LDR R3, [PC, #36]
080058CE LDRH R3, [R3]
080058D0 UXTH R3, R3
080058D2 ADDS R3, #1
080058D4 UXTH R2, R3
080058D6 LDR R3, [PC, #28]
080058D8 STRH R2, [R3]
dbgIdx32 ++;
080058DA LDR R3, [PC, #28]
080058DC LDR R3, [R3]
080058DE ADDS R3, #1
080058E0 LDR R2, [PC, #20]
080058E2 STR R3, [R2]
There is no guarantee that ++ and -- are atomic. If you need guaranteed atomicity, you will have to find some other way.
As #StaceyGirl points out in a comment, you might be able to use the facilities of <stdatomic.h>. For example, I see there's an atomic atomic_fetch_add function defined, which acts like the postfix ++ you're striving for. There's an atomic_fetch_sub, too.
Alternatively, you might have some compiler intrinsics available to you for performing an atomic increment in some processor-specific way.
ARM cortex cores do not modify memory. All memory modifications are performed as RMW (read-modify-write) operations which are not atomic by default.
But Cortex M3 has special instructions to lock access to the memory location. LDREX & STREX. https://developer.arm.com/documentation/100235/0004/the-cortex-m33-instruction-set/memory-access-instructions/ldaex-and-stlex
You can use them directly in the C code without touching the assembly by using intrinsic.
Do not use not 32 bits data types in any performance (you want to lock for as short as possible time) sensitive programs. Most shorter data types operations add some additional code.
In the code below I saw that clang fails to perform better optimisation without implicit restrict pointer specifier:
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
typedef struct {
uint32_t event_type;
uintptr_t param;
} event_t;
typedef struct
{
event_t *queue;
size_t size;
uint16_t num_of_items;
uint8_t rd_idx;
uint8_t wr_idx;
} queue_t;
static bool queue_is_full(const queue_t *const queue_ptr)
{
return queue_ptr->num_of_items == queue_ptr->size;
}
static size_t queue_get_size_mask(const queue_t *const queue_ptr)
{
return queue_ptr->size - 1;
}
int queue_enqueue(queue_t *const queue_ptr, const event_t *const event_ptr)
{
if(queue_is_full(queue_ptr))
{
return 1;
}
queue_ptr->queue[queue_ptr->wr_idx++] = *event_ptr;
queue_ptr->num_of_items++;
queue_ptr->wr_idx &= queue_get_size_mask(queue_ptr);
return 0;
}
I compiled this code with clang version 11.0.0 (clang-1100.0.32.5)
clang -O2 -arch armv7m -S test.c -o test.s
In the disassembled file I saw that the generated code re-reads the memory:
_queue_enqueue:
.cfi_startproc
# %bb.0:
ldrh r2, [r0, #8] ---> reads the queue_ptr->num_of_items
ldr r3, [r0, #4] ---> reads the queue_ptr->size
cmp r3, r2
itt eq
moveq r0, #1
bxeq lr
ldrb r2, [r0, #11] ---> reads the queue_ptr->wr_idx
adds r3, r2, #1
strb r3, [r0, #11] ---> stores the queue_ptr->wr_idx + 1
ldr.w r12, [r1]
ldr r3, [r0]
ldr r1, [r1, #4]
str.w r12, [r3, r2, lsl #3]
add.w r2, r3, r2, lsl #3
str r1, [r2, #4]
ldrh r1, [r0, #8] ---> !!! re-reads the queue_ptr->num_of_items
adds r1, #1
strh r1, [r0, #8]
ldrb r1, [r0, #4] ---> !!! re-reads the queue_ptr->size (only the first byte)
ldrb r2, [r0, #11] ---> !!! re-reads the queue_ptr->wr_idx
subs r1, #1
ands r1, r2
strb r1, [r0, #11] ---> !!! stores the updated queue_ptr->wr_idx once again after applying the mask
movs r0, #0
bx lr
.cfi_endproc
# -- End function
After adding the restrict keyword to the pointers, these unneeded re-reads just vanished:
int queue_enqueue(queue_t * restrict const queue_ptr, const event_t * restrict const event_ptr)
I know that in clang, by default strict aliasing is disabled. But in this case, event_ptr pointer is defined as const so its object's content cannot be modified by this pointer, thus it cannot affect the content to which queue_ptr points (assuming the case when the objects overlap in the memory), right?
So is this a compiler optimisation bug or there is indeed some weird case when the object pointed by queue_ptr can be affected by event_ptr assuming this declaration:
int queue_enqueue(queue_t *const queue_ptr, const event_t *const event_ptr)
By the way, I tried to compile the same code for x86 target and inspected similar optimisation issue.
The generated assembly with the restrict keyword, doesn't contain the re-reads:
_queue_enqueue:
.cfi_startproc
# %bb.0:
ldr r3, [r0, #4]
ldrh r2, [r0, #8]
cmp r3, r2
itt eq
moveq r0, #1
bxeq lr
push {r4, r6, r7, lr}
.cfi_def_cfa_offset 16
.cfi_offset lr, -4
.cfi_offset r7, -8
.cfi_offset r6, -12
.cfi_offset r4, -16
add r7, sp, #8
.cfi_def_cfa r7, 8
ldr.w r12, [r1]
ldr.w lr, [r1, #4]
ldrb r1, [r0, #11]
ldr r4, [r0]
subs r3, #1
str.w r12, [r4, r1, lsl #3]
add.w r4, r4, r1, lsl #3
adds r1, #1
ands r1, r3
str.w lr, [r4, #4]
strb r1, [r0, #11]
adds r1, r2, #1
strh r1, [r0, #8]
movs r0, #0
pop {r4, r6, r7, pc}
.cfi_endproc
# -- End function
Addition:
After some discussion with Lundin in the comments to his answer, I got the impression that the re-reads could be caused because the compiler would assume that queue_ptr->queue might potentially point to *queue_ptr itself. So I changed the queue_t struct to contain array instead of the pointer:
typedef struct
{
event_t queue[256]; // changed from pointer to array with max size
size_t size;
uint16_t num_of_items;
uint8_t rd_idx;
uint8_t wr_idx;
} queue_t;
However the re-reads remained as previously. I still can't understand what could make the compiler think that queue_t fields may be modified and thus require re-reads... The following declaration eliminates the re-reads:
int queue_enqueue(queue_t * restrict const queue_ptr, const event_t *const event_ptr)
But why queue_ptr has to be declared as restrict pointer to prevent the re-reads I don't understand (unless it is a compiler optimization "bug").
P.S.
I also couldn't find any link to file/report an issue on clang that doesn't cause the compiler to crash...
[talking about the original program]
This is caused by deficiency in the TBAA metadata, generated by Clang.
If you emit LLVM IR with -S -emit-llvm you'll see (snipped for brevity):
...
%9 = load i8, i8* %wr_idx, align 1, !tbaa !12
%10 = trunc i32 %8 to i8
%11 = add i8 %10, -1
%conv4 = and i8 %11, %9
store i8 %conv4, i8* %wr_idx, align 1, !tbaa !12
br label %return
...
!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 1, !"min_enum_size", i32 4}
!2 = !{!"clang version 10.0.0 (/home/chill/src/llvm-project 07da145039e1a6a688fb2ac2035b7c062cc9d47d)"}
!3 = !{!4, !9, i64 8}
!4 = !{!"queue", !5, i64 0, !8, i64 4, !9, i64 8, !6, i64 10, !6, i64 11}
!5 = !{!"any pointer", !6, i64 0}
!6 = !{!"omnipotent char", !7, i64 0}
!7 = !{!"Simple C/C++ TBAA"}
!8 = !{!"int", !6, i64 0}
!9 = !{!"short", !6, i64 0}
!10 = !{!4, !8, i64 4}
!11 = !{!4, !5, i64 0}
!12 = !{!4, !6, i64 11}
See the TBAA metadata !4: this is the type descriptor for queue_t (btw, I added names to structs, e.g. typedef struct queue ...) you may see empty string there). Each element in the description corresponds to struct fields and look at !5, which is the event_t *queue field: it's "any pointer"! At this point we've lost all the information about the actual type of the pointer, which tells me the compiler would assume writes through this pointer can modify any memory object.
That said, there's a new form for TBAA metadata, which is more precise (still has deficiencies, but for that later ...)
Compile the original program with -Xclang -new-struct-path-tbaa. My exact command line was (and I've included stddef.h instead of stdlib.h since that development build with no libc):
./bin/clang -I lib/clang/10.0.0/include/ -target armv7m-eabi -O2 -Xclang -new-struct-path-tbaa -S queue.c
The resulting assembly is (again, some fluff snipped):
queue_enqueue:
push {r4, r5, r6, r7, lr}
add r7, sp, #12
str r11, [sp, #-4]!
ldrh r3, [r0, #8]
ldr.w r12, [r0, #4]
cmp r12, r3
bne .LBB0_2
movs r0, #1
ldr r11, [sp], #4
pop {r4, r5, r6, r7, pc}
.LBB0_2:
ldrb r2, [r0, #11] ; load `wr_idx`
ldr.w lr, [r0] ; load `queue` member
ldrd r6, r1, [r1] ; load data pointed to by `event_ptr`
add.w r5, lr, r2, lsl #3 ; compute address to store the event
str r1, [r5, #4] ; store `param`
adds r1, r3, #1 ; increment `num_of_items`
adds r4, r2, #1 ; increment `wr_idx`
str.w r6, [lr, r2, lsl #3] ; store `event_type`
strh r1, [r0, #8] ; store new value for `num_of_items`
sub.w r1, r12, #1 ; compute size mask
ands r1, r4 ; bitwise and size mask with `wr_idx`
strb r1, [r0, #11] ; store new value for `wr_idx`
movs r0, #0
ldr r11, [sp], #4
pop {r4, r5, r6, r7, pc}
Looks good, isn't it! :D
I mentioned earlier there are deficiencies with the "new struct path", but for that: on the mailing list.
PS. I'm afraid there's no general lesson to be learned in this case. In principle, the more information one is able to give the compiler - the better: things like restrict, strong typing (not gratuitous casts, type punning, etc), relevant function and variable attributes ... but not in this case, the original program already contained all the necessary information. It is just a compiler deficiency and the best way to tackle those is to raise awareness: ask on the mailing list and/or file bug reports.
The event_t member of queue_ptr could point at the same memory as event_ptr. Compilers tend to produce less efficient code when they can't rule out that two pointers point at the same memory. So there's nothing strange with restrict leading to better code.
Const qualifiers don't really matter, because these were added by the function and the original type could be modifiable elsewhere. In particular, the * const doesn't add anything because the pointer is already a local copy of the original pointer, so nobody including the caller cares if the function modifies that local copy or not.
"Strict aliasing" rather refers to the cases where the compiler can cut corners like when assuming that a uint16_t* can't point at a uint8_t* etc. But in your case you have two completely compatible types, one of them is merely wrapped in an outer struct.
As far as I can tell, yes, in your code queue_ptr pointee's contents cannot be modified. Is is it an optimization bug? It's a missed optimization opportunity, but I wouldn't call it a bug. It doesn't "misunderstand" const, it just doesn't have/doesn't do the necessary analyses to determine it cannot be modified for this specific scenario.
As a side note: queue_is_full(queue_ptr) can modify the contents of *queue_ptr even if it has const queue_t *const param because it can legally cast away the constness since the original object is not const. That being said, the definition of quueue_is_full is visible and available to the compiler so it can ascertain it indeed does not.
As you know, your code appears to modify the data, invalidating the const state:
queue_ptr->num_of_items++; // this stores data in the memory
Without the restrict keyword, the compiler must assume that the two types might share the same memory space.
This is required in the updated example because event_t is a member of queue_t and strict aliasing applies on:
... an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or...
On the original example there are a number of reasons the types might be considered to alias, leading to the same result (i.e., the use of a char pointer and the fact that types might be considered compatible "enough" on some architectures if not all).
Hence, the compiler is required to reload the memory after it was mutated to avoid possible conflicts.
The const keyword doesn't really enter into this, because a mutation happens through a pointer that might point to the same memory address.
(EDIT) For your convenience, here's the full rule regarding access to a variable:
An object shall have its stored value accessed only by an lvalue expression that has one of the following types (88):
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the object,
— a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
— a character type.
(88) The intent of this list is to specify those circumstances in which an object may or may not be aliased.
P.S.
The _t suffix is reserved by POSIX. You might consider using a different suffix.
It's common practice to use _s for structs and _u for unions.
I have the following C struct that represent a register in an external chip
typedef union {
// Individual Fields
struct {
uint8_t ELEM_1 : 4 ; // Bits 0-3
uint8_t ELEM_2 : 3 ; // Bits 4-6
uint8_t ELEM_3 : 2 ; // Bits 7-8
} field;
// Complete Value
uint32_t value;
} ELEMENTS_t;
As you can see, ELEM_1 and ELEM_2 can fit inside a byte without any issues and when accessed, the assembly code looks like this
ELEMENTS.field.ELEM_2 = 0x7;
101488: e55b3028 ldrb r3, [fp, #-40] ; 0xffffffd8
10148c: e3833070 orr r3, r3, #112 ; 0x70
101490: e54b3028 strb r3, [fp, #-40] ; 0xffffffd8
ELEMENTS.field.ELEM_1 = 0xf;
101494: e55b3028 ldrb r3, [fp, #-40] ; 0xffffffd8
101498: e383300f orr r3, r3, #15
10149c: e54b3028 strb r3, [fp, #-40] ; 0xffffffd8
They all get written in the same byte with the corret bit order.
The problem is when we get to ELEM_3. That element crosses the byte boundry, since it should be placed in bits[8:7] and to avoid having multiple memory accesses (probably) the compiler places it in a completely separate byte, so when I try to access it, it looks like this
ELEMENTS.field.ELEM_3 = 0x3;
10147c: e55b3027 ldrb r3, [fp, #-39] ; 0xffffffd9
101480: e3833003 orr r3, r3, #3
101484: e54b3027 strb r3, [fp, #-39] ; 0xffffffd9
This doesn't cause issues when accessing these elements field by field, but it does when trying to flush the data to the external chip.
Does anybody know how to tell the compiler to pack all the bits together? This is using the Xilinx SDK targeting the ARM Cortex-A9 processor embedded inside a Zynq SoC.
You can't pack 9 bits into a byte. So it crosses the byte boundary. If the register is a single byte it should only access bits 0 through 7.
There could be other problems with this if you need to set multiple bits at the same time.
I found a solution. If I set all the elements as uint32_t instead of uint8_t, it reads and writes the register 32-bits at a time, which makes the operations work.
ELEMENTS.field.ELEM_3 = 0x3;
10147c: e15b32b8 ldrh r3, [fp, #-40] ; 0xffffffd8
101480: e3833d06 orr r3, r3, #384 ; 0x180
101484: e14b32b8 strh r3, [fp, #-40] ; 0xffffffd8
ELEMENTS.field.ELEM_2 = 0x7;
101488: e55b3028 ldrb r3, [fp, #-40] ; 0xffffffd8
10148c: e3833070 orr r3, r3, #112 ; 0x70
101490: e54b3028 strb r3, [fp, #-40] ; 0xffffffd8
ELEMENTS.field.ELEM_1 = 0xf;
101494: e55b3028 ldrb r3, [fp, #-40] ; 0xffffffd8
101498: e383300f orr r3, r3, #15
10149c: e54b3028 strb r3, [fp, #-40] ; 0xffffffd8
I am compiling this code for a Cortex M7 using GCC:
// copy manually
void write_test_plain(uint8_t * ptr, uint32_t value)
{
*ptr++ = (u8)(value);
*ptr++ = (u8)(value >> 8);
*ptr++ = (u8)(value >> 16);
*ptr++ = (u8)(value >> 24);
}
// copy using memcpy
void write_test_memcpy(uint8_t * ptr, uint32_t value)
{
void *px = (void*)&value;
memcpy(ptr, px, 4);
}
int main(void)
{
extern uint8_t data[];
extern uint32_t value;
// i added some offsets to data to
// make sure the compiler cannot
// assume it's aligned in memory
write_test_plain(data + 2, value);
__asm volatile("": : :"memory"); // just to split inlined calls
write_test_memcpy(data + 5, value);
... do something with data ...
}
And I get the following Thumb2 assembly with -O2:
// write_test_plain(data + 2, value);
800031c: 2478 movs r4, #120 ; 0x78
800031e: 2056 movs r0, #86 ; 0x56
8000320: 2134 movs r1, #52 ; 0x34
8000322: 2212 movs r2, #18 ; 0x12
8000324: 759c strb r4, [r3, #22]
8000326: 75d8 strb r0, [r3, #23]
8000328: 7619 strb r1, [r3, #24]
800032a: 765a strb r2, [r3, #25]
// write_test_memcpy(data + 5, value);
800032c: 4ac4 ldr r2, [pc, #784] ; (8000640 <main+0x3a0>)
800032e: 923b str r2, [sp, #236] ; 0xec
8000330: 983b ldr r0, [sp, #236] ; 0xec
8000332: f8c3 0019 str.w r0, [r3, #25]
Can someone explain how the memcpy version works? This looks like inlined 32-bit store to the destination address, but isn't this a problem since data + 5 is most certainly not aligned to a 4-byte boundary?
Is this perhaps some optimization which happens due to some undefined behavior in my source?
For Cortex-M processors unaligned loads and stores of bytes, half-words, and words are usually allowed and most compilers use this when generating code unless they are instructed not to. If you want to prevent gcc from assuming the unaligned accesses are OK, you can use the -mno-unaligned-access compiler flag.
If you specify this flag gcc will no longer inline the call to memcpy and write_test_memcpy looks like
write_test_memcpy(unsigned char*, unsigned long):
push {lr}
sub sp, sp, #12
movs r2, #4
add r3, sp, #8
str r1, [r3, #-4]!
mov r1, r3
bl memcpy
add sp, sp, #12
ldr pc, [sp], #4
Cortex-M 7 , M4, M3 M33, M23 does support unaligned access
M0, M+ doesn't support unaligned access
however you can disable the support of unaligned access in cortexm7 by setting bit UNALIGN_TRP in configuration and control register and any unaligned access will generate usage fault.
From compiler perspective, default setting is that generated assembly code does unaligned access unless you disable this by using the compile flag -mno-unaligned-access
I need to reduce the code bloat for the Cortex-M0 microprocessor.
At startup the ROM data has to be copied to the RAM data once. Therefore I have this piece of code:
void __startup( void ){
extern unsigned int __data_init_start;
extern unsigned int __data_start;
extern unsigned int __data_end;
// copy .data section from flash to ram
s = & __data_init_start;
d = & __data_start;
e = & __data_end;
while( d != e ){
*d++ = *s++;
}
}
The assembly code that is generated by the compiler looks like this:
ldr r1, .L10+8
ldr r2, .L10+12
sub r0, r1, r2
lsr r3, r0, #2
add r3, r3, #1
lsl r1, r3, #2
mov r3, #0
.L4:
add r3, r3, #4
cmp r3, r1
beq .L9
.L5:
ldr r4, .L10+16
add r0, r2, r3
add r4, r3, r4
sub r4, r4, #4
ldr r4, [r4]
sub r0, r0, #4
str r4, [r0]
b .L4
How can I optimize this code so the code size is at minimum?
The compiler (or you!) does not realize that the range to copy is end - start. There seems to be some unnecessarily shuffling of data going on -- the 2 add and the sub in the loop. Also, it seems to me the compiler makes sure that the number of copies to make is a multiple of 4. An obvious optimization, then, is to make sure it is in advance! Below I assume it is (if not, the bne will fail and happily keep on copying and trample all over your memory).
Using my decade-old ARM assembler knowlegde (yes, that is a major disclaimer), and post-incrementing, I think the following short snippet is what it can be condensed to. From 18 instructions down to 8, not too bad. If it works.
ldr r1, __data_init_start
ldr r2, __data_start
ldr r3, __data_end
sub r4, r3, r2
.L1:
ldr r3, [r1], #4 ; safe to re-use r3 here
str r3, [r2], #4
subs r4, r4, #4
bne L1
May be that platform guarantees that writing to an unsigned int * you may change an unsigned int * value (i.e. it doesn't take advantage of type mismatch aliasing rules).
Then the code is inefficient because e is a global variable and the generated code logic must take in account that writing to *d may change the value of e.
Making at least e a local should solve this problem (most compilers know that aliasing a local that never had its address taken is not possible from a C point of view).