Optimization level breaks the C code order - c

I have the following 5 lines of code and want the this lines are executed in exact this order with O2 or O3:
PORT->Group[GPIO_PORTB].OUTCLR.reg = (volatile uint32_t) 1 << 9;
TC3->COUNT16.COUNT.reg = (volatile uint16_t) 0;
TC3->COUNT16.CC[0].reg = (volatile uint16_t) vusb_driver->in_data->bitlength;
SERCOM0->SPI.DATA.reg = (volatile uint32_t) 0x54;
DMAC->Channel[USB_SEND_SD_DMA_CH].CHCTRLA.reg = (volatile uint8_t) DMAC_CHCTRLA_ENABLE;
If I optimize with O2 or O3 the code breaks at line 264 because this line must be executed before line 265:
261: PORT->Group[GPIO_PORTB].OUTCLR.reg = (volatile uint32_t) 1 << 9;
200001EE ldr r1, [pc, #84]
263: TC3->COUNT16.CC[0].reg = (volatile uint16_t) vusb_driver->in_data->bitlength;
200001F0 ldr r5, [pc, #84]
264: SERCOM0->SPI.DATA.reg = (volatile uint32_t) 0x54;
200001F2 ldr r4, [pc, #88]
265: DMAC->Channel[USB_SEND_SD_DMA_CH].CHCTRLA.reg = (volatile uint8_t) DMAC_CHCTRLA_ENABLE;
200001F4 ldr r0, [pc, #88]
261: PORT->Group[GPIO_PORTB].OUTCLR.reg = (volatile uint32_t) 1 << 9;
200001F6 mov.w r6, #512
200001FA str.w r6, [r1, #148]
262: TC3->COUNT16.COUNT.reg = (volatile uint16_t) 0;
200001FE strh r2, [r3, #20]
263: TC3->COUNT16.CC[0].reg = (volatile uint16_t) vusb_driver->in_data->bitlength;
20000200 ldr r2, [r5]
20000202 ldr r2, [r2, #20]
20000204 ldrh.w r2, [r2, #72]
20000208 strh r2, [r3, #28]
264: SERCOM0->SPI.DATA.reg = (volatile uint32_t) 0x54;
2000020A movs r5, #84
265: DMAC->Channel[USB_SEND_SD_DMA_CH].CHCTRLA.reg = (volatile uint8_t) DMAC_CHCTRLA_ENABLE;
2000020C movs r2, #2
264: SERCOM0->SPI.DATA.reg = (volatile uint32_t) 0x54;
2000020E str r5, [r4, #40]

Your use of volatile is incorrect, you should define the destination objects as volatile to ensure they are written to exactly in the order of the program.

The compiler is - in the regular case - allowed to reorder instructions if the effect of the execution is the same (this is the "as-if rule"). So you must do one of the following:
Indicate to it that the instructions will have a different effect (e.g. by making the relevant .reg variables volatile; or through aliasing of pointers etc.)
Use some sort of compiler-specific directives to control its behavior.
Not compile, i.e. generate your machine code in a different manner.
Specifically, if you choose the first option, you must explain - to the compiler and perhaps to yourself, why is it that
line [264] must be executed before line 265
In what sense "must" it be executed before 265? Who would notice? It's likely that a concrete answer to this question is something you could use to force the desired order of execution.

Related

Cortex M3, STM32, thumb2: My inc and dec operations are not atomic, but should be. What's wrong here?

I need a thread save idx++ and idx-- operation.
Disabling interrupts, i.e. use critical sections, is one thing, but I want
to understand why my operations are not atomic, as I expect ?
Here is the C-code with inline assembler code shown, using segger ozone:
(Also please notice, the address of the variables show that the 32 bit variable is 32-bit-aligned in memory, and the 8- and 16-bit variables are both 16 bit aligned)
volatile static U8 dbgIdx8 = 1000U;
volatile static U16 dbgIdx16 = 1000U;
volatile static U32 dbgIdx32 = 1000U;
dbgIdx8 ++;
080058BE LDR R3, [PC, #48]
080058C0 LDRB R3, [R3]
080058C2 UXTB R3, R3
080058C4 ADDS R3, #1
080058C6 UXTB R2, R3
080058C8 LDR R3, [PC, #36]
080058CA STRB R2, [R3]
dbgIdx16 ++;
080058CC LDR R3, [PC, #36]
080058CE LDRH R3, [R3]
080058D0 UXTH R3, R3
080058D2 ADDS R3, #1
080058D4 UXTH R2, R3
080058D6 LDR R3, [PC, #28]
080058D8 STRH R2, [R3]
dbgIdx32 ++;
080058DA LDR R3, [PC, #28]
080058DC LDR R3, [R3]
080058DE ADDS R3, #1
080058E0 LDR R2, [PC, #20]
080058E2 STR R3, [R2]
There is no guarantee that ++ and -- are atomic. If you need guaranteed atomicity, you will have to find some other way.
As #StaceyGirl points out in a comment, you might be able to use the facilities of <stdatomic.h>. For example, I see there's an atomic atomic_fetch_add function defined, which acts like the postfix ++ you're striving for. There's an atomic_fetch_sub, too.
Alternatively, you might have some compiler intrinsics available to you for performing an atomic increment in some processor-specific way.
ARM cortex cores do not modify memory. All memory modifications are performed as RMW (read-modify-write) operations which are not atomic by default.
But Cortex M3 has special instructions to lock access to the memory location. LDREX & STREX. https://developer.arm.com/documentation/100235/0004/the-cortex-m33-instruction-set/memory-access-instructions/ldaex-and-stlex
You can use them directly in the C code without touching the assembly by using intrinsic.
Do not use not 32 bits data types in any performance (you want to lock for as short as possible time) sensitive programs. Most shorter data types operations add some additional code.

What is the difference between initializing a struct vs anonymous struct?

Consider the following code:
typedef struct {
uint32_t a : 1;
uint32_t b : 1;
uint32_t c : 30;
} Foo1;
void Fun1(void) {
volatile Foo1 foo = (Foo1) {.a = 1, .b = 1, .c = 1};
}
This general pattern comes up quite a bit when using bit fields to punch registers in embedded applications. Using a recent ARM gcc compiler (e.g. gcc 8.2 or gcc 7.3) with -O3 and -std=c11, I get the following assembly:
sub sp, sp, #8
movs r3, #7
str r3, [sp, #4]
add sp, sp, #8
bx lr
This is pretty much exactly what you want and expect; Foo is not volatile, so the initialization of each bit can be combined together into the literal 0x7 before finally being stored to the volatile variable (register) foo.
However, it's convenient to be able to manipulate the raw contents of the entire register which gives rise to an anonymous implementation of the bit field:
typedef union {
struct {
uint32_t a : 1;
uint32_t b : 1;
uint32_t c : 30;
};
uint32_t raw;
} Foo2;
void Fun2(void) {
volatile Foo2 foo = (Foo2) {.a = 1, .b = 1, .c = 1};
}
Unfortunately, the resulting assembly is not so optimized:
sub sp, sp, #8
ldr r3, [sp, #4]
orr r3, r3, #1
str r3, [sp, #4]
ldr r3, [sp, #4]
orr r3, r3, #2
str r3, [sp, #4]
ldr r3, [sp, #4]
and r3, r3, #7
orr r3, r3, #4
str r3, [sp, #4]
add sp, sp, #8
bx lr
For a densely packed register, a read-modify-write of each bit can get... expensive.
What's special about the union / anonymous struct that prevents gcc from optimizing the initialization like the pure struct?
I hope I could answer your question. The problem is GCC compiler has some predefined rules about C to Assembly conversions, and when you make an union of one struct and one uint32_t it doesn't have a predefined pattern and that's why the resulting assembly it's not optimized as well as the first example.
I suggest you to use casting to solve the problem.
typedef struct {
uint32_t a : 1;
uint32_t b : 1;
uint32_t c : 30;
} Foo1;
void Fun1(void) {
volatile Foo1 foo = (Foo1) {.a = 1, .b = 1, .c = 1};
volatile uint32_t rawA = *((uint32_t *) &foo);
volatile uint32_t rawB = *((uint32_t *) &foo + sizeof(uint32_t);
}

ARM Cortex M7 unaligned access and memcpy

I am compiling this code for a Cortex M7 using GCC:
// copy manually
void write_test_plain(uint8_t * ptr, uint32_t value)
{
*ptr++ = (u8)(value);
*ptr++ = (u8)(value >> 8);
*ptr++ = (u8)(value >> 16);
*ptr++ = (u8)(value >> 24);
}
// copy using memcpy
void write_test_memcpy(uint8_t * ptr, uint32_t value)
{
void *px = (void*)&value;
memcpy(ptr, px, 4);
}
int main(void)
{
extern uint8_t data[];
extern uint32_t value;
// i added some offsets to data to
// make sure the compiler cannot
// assume it's aligned in memory
write_test_plain(data + 2, value);
__asm volatile("": : :"memory"); // just to split inlined calls
write_test_memcpy(data + 5, value);
... do something with data ...
}
And I get the following Thumb2 assembly with -O2:
// write_test_plain(data + 2, value);
800031c: 2478 movs r4, #120 ; 0x78
800031e: 2056 movs r0, #86 ; 0x56
8000320: 2134 movs r1, #52 ; 0x34
8000322: 2212 movs r2, #18 ; 0x12
8000324: 759c strb r4, [r3, #22]
8000326: 75d8 strb r0, [r3, #23]
8000328: 7619 strb r1, [r3, #24]
800032a: 765a strb r2, [r3, #25]
// write_test_memcpy(data + 5, value);
800032c: 4ac4 ldr r2, [pc, #784] ; (8000640 <main+0x3a0>)
800032e: 923b str r2, [sp, #236] ; 0xec
8000330: 983b ldr r0, [sp, #236] ; 0xec
8000332: f8c3 0019 str.w r0, [r3, #25]
Can someone explain how the memcpy version works? This looks like inlined 32-bit store to the destination address, but isn't this a problem since data + 5 is most certainly not aligned to a 4-byte boundary?
Is this perhaps some optimization which happens due to some undefined behavior in my source?
For Cortex-M processors unaligned loads and stores of bytes, half-words, and words are usually allowed and most compilers use this when generating code unless they are instructed not to. If you want to prevent gcc from assuming the unaligned accesses are OK, you can use the -mno-unaligned-access compiler flag.
If you specify this flag gcc will no longer inline the call to memcpy and write_test_memcpy looks like
write_test_memcpy(unsigned char*, unsigned long):
push {lr}
sub sp, sp, #12
movs r2, #4
add r3, sp, #8
str r1, [r3, #-4]!
mov r1, r3
bl memcpy
add sp, sp, #12
ldr pc, [sp], #4
Cortex-M 7 , M4, M3 M33, M23 does support unaligned access
M0, M+ doesn't support unaligned access
however you can disable the support of unaligned access in cortexm7 by setting bit UNALIGN_TRP in configuration and control register and any unaligned access will generate usage fault.
From compiler perspective, default setting is that generated assembly code does unaligned access unless you disable this by using the compile flag -mno-unaligned-access

Overeager struct packing warnings with `__attribute__((packed))`?

I'm implementing a binary logging system on a 32 bit ARM mcu (Atmel SAM4SD32C, a Cortex-M4/ARMv7E-M part), and in the process of designing my data structures. My goal is to describe the log format as a packed struct, and simply union the struct with a char array, for writing to the log device (a SD card, via FatFS, in this case).
Basically, I have a very simple struct:
typedef struct adc_samples_t
{
int32_t adc_samples[6];
uint64_t acq_time;
int8_t overrun;
uint8_t padding_1;
uint8_t padding_2;
uint8_t padding_3;
} __attribute__((packed, aligned(4))) adc_sample_set;
Now, my architecture is 32 bits, so as far as I understand, access to any member /other/ then the overrun member should be 32-bit aligned, and therefore not have an extra overhead. Furthermore, the aligned(4) attribute should force any instantiations of the struct to be on a 32-bit aligned boundary.
However, compiling the above struct definition produces a pile of warnings:
In file included from ../src/main.c:13:0:
<snip>\src\fs\fs-logger.h(10,10): warning: packed attribute causes inefficient alignment for 'adc_samples' [-Wattributes]
int32_t adc_samples[6];
^
<snip>\src\fs\fs-logger.h(12,11): warning: packed attribute causes inefficient alignment for 'acq_time' [-Wattributes]
uint64_t acq_time;
As far as I know (and I'm now realizing this is a big assumption), I assumed that 32-bit alignment was all that was needed for optimal component positioning on 32 bit arm. Oddly, the only member that does /not/ produce warnings are the overrun and padding_X members, which I don't understand the causes for. (Ok, the ARM docs say Byte accesses are always aligned.)
What, exactly, is going on here? I assume (possibly incorrectly) that the struct instantiation will be on 4 bytes boundaries. Does the compiler require a more broad alignment (on 8 byte boundaries)?
Edit: Ok, digging into the ARM docs (the magic words here were "Cortex-M4 alignment":
3.3.5. Address alignment
An aligned access is an operation where a word-aligned address is used for a word, dual word, or multiple word access, or where a halfword-aligned address is used for a halfword access. Byte accesses are always aligned.
The Cortex-M4 processor supports unaligned access only for the following instructions:
LDR, LDRT
LDRH, LDRHT
LDRSH, LDRSHT
STR, STRT
STRH, STRHT
All other load and store instructions generate a UsageFault exception if they perform an unaligned access, and therefore their accesses must be address aligned. For more information about UsageFaults see Fault handling.
Unaligned accesses are usually slower than aligned accesses. In addition, some memory regions might not support unaligned accesses. Therefore, ARM recommends that programmers ensure that accesses are aligned. To trap
accidental generation of unaligned accesses, use the UNALIGN_TRP bit in the Configuration and Control Register, see Configuration and Control Register.
How is my 32-bit aligned value not word-aligned? The user guide defines "Aligned" as the following:
Aligned
A data item stored at an address that is divisible by the
number of bytes that defines the data size is said to be aligned.
Aligned words and halfwords have addresses that are divisible by four
and two respectively. The terms word-aligned and halfword-aligned
therefore stipulate addresses that are divisible by four and two
respectively.
I assumed that 32-bit alignment was all that was needed for optimal component positioning on 32-bit ARM
It is.
But you don't have 32-bit alignment here [in the originally-asked question] because:
Specifying the packed attribute for struct and union types is equivalent to specifying the packed attribute on each of the structure or union members.
given that:
The packed attribute specifies that a variable or structure field should have the smallest possible alignment—one byte for a variable, and one bit for a field, unless you specify a larger value with the aligned attribute.
In other words, if you still want a packed structure to still have some minimum alignment after you've forced the alignment of all members, and thus the type itself, to nothing, you need to specify so - the fact that that might not actually make -Wpacked shut up is a different matter - GCC may well just spit that out reflexively before it actually considers any further alignment modifiers.
Note that in terms of serialisation, you don't necessarily need to pack it anyway. The members fit in 9 words exactly, so the only compiler padding anywhere is an extra word at the end to round the total size up to 40 bytes, since acq_time forces the struct to a natural alignment of 8. Unless you want to operate on a whole array of these things at once, you can get away with simply ignoring that and still treating the members as one 36-byte chunk.
Ok, at this point, I'm somewhat confident that the warning is being emitted in error.
I have a statically defined instance of the struct, and at one point I zero it:
adc_sample_set running_average;
int accumulated_samples;
inline void zero_average_buf(void)
{
accumulated_samples = 0;
running_average.adc_samples[0] = 0;
running_average.adc_samples[1] = 0;
running_average.adc_samples[2] = 0;
running_average.adc_samples[3] = 0;
running_average.adc_samples[4] = 0;
running_average.adc_samples[5] = 0;
running_average.overrun = 0;
running_average.acq_time = 0;
}
The disassembly for the function is the follows:
{
004005F8 push {r3, lr}
accumulated_samples = 0;
004005FA movs r2, #0
004005FC ldr r3, [pc, #36]
004005FE str r2, [r3]
running_average.adc_samples[0] = 0;
00400600 ldr r3, [pc, #36]
00400602 str r2, [r3]
running_average.adc_samples[1] = 0;
00400604 str r2, [r3, #4]
running_average.adc_samples[2] = 0;
00400606 str r2, [r3, #8]
running_average.adc_samples[3] = 0;
00400608 str r2, [r3, #12]
running_average.adc_samples[4] = 0;
0040060A str r2, [r3, #16]
running_average.adc_samples[5] = 0;
0040060C str r2, [r3, #20]
running_average.overrun = 0;
0040060E strb.w r2, [r3, #32]
running_average.acq_time = 0;
00400612 movs r0, #0
00400614 movs r1, #0
00400616 strd r0, r1, [r3, #24]
Note that r3 in the above is 0x2001ef70, which is indeed 4-byte aligned. r2 is the literal value 0.
The str opcode requires 4-byte alignment. The strd opcode only requires 4 byte alignment as well, since it appears to really be two sequential 4-byte operations, though I don't know how it actually works internally.
If I intentionally mis-align my struct, to force the slow-path copy operation:
typedef struct adc_samples_t
{
int8_t overrun;
int32_t adc_samples[6];
uint64_t acq_time;
uint8_t padding_1;
uint8_t padding_2;
uint8_t padding_3;
} __attribute__((packed, aligned(8))) adc_sample_set;
I get the following assembly:
{
00400658 push {r3, lr}
accumulated_samples = 0;
0040065A movs r3, #0
0040065C ldr r2, [pc, #84]
0040065E str r3, [r2]
running_average.adc_samples[0] = 0;
00400660 ldr r2, [pc, #84]
00400662 strb r3, [r2, #1]
00400664 strb r3, [r2, #2]
00400666 strb r3, [r2, #3]
00400668 strb r3, [r2, #4]
running_average.adc_samples[1] = 0;
0040066A strb r3, [r2, #5]
0040066C strb r3, [r2, #6]
0040066E strb r3, [r2, #7]
00400670 strb r3, [r2, #8]
running_average.adc_samples[2] = 0;
00400672 strb r3, [r2, #9]
00400674 strb r3, [r2, #10]
00400676 strb r3, [r2, #11]
00400678 strb r3, [r2, #12]
running_average.adc_samples[3] = 0;
0040067A strb r3, [r2, #13]
0040067C strb r3, [r2, #14]
0040067E strb r3, [r2, #15]
00400680 strb r3, [r2, #16]
running_average.adc_samples[4] = 0;
00400682 strb r3, [r2, #17]
00400684 strb r3, [r2, #18]
00400686 strb r3, [r2, #19]
00400688 strb r3, [r2, #20]
running_average.adc_samples[5] = 0;
0040068A strb r3, [r2, #21]
0040068C strb r3, [r2, #22]
0040068E strb r3, [r2, #23]
00400690 strb r3, [r2, #24]
running_average.overrun = 0;
00400692 mov r1, r2
00400694 strb r3, [r1], #25
running_average.acq_time = 0;
00400698 strb r3, [r2, #25]
0040069A strb r3, [r1, #1]
0040069C strb r3, [r1, #2]
0040069E strb r3, [r1, #3]
004006A0 strb r3, [r1, #4]
004006A2 strb r3, [r1, #5]
004006A4 strb r3, [r1, #6]
004006A6 strb r3, [r1, #7]
So, pretty clearly, I'm getting the proper aligned-copy behaviour with my original struct definition, despite the compiler apparently incorrectly warning that it will result in inefficient accesses.

What's faster on ARM? MUL or (SHIFT + SUB)?

Which is faster on ARM?
*p++ = (*p >> 7) * 255;
or
*p++ = ((*p >> 7) << 8) - 1
Essentially what I'm doing here is taking an 8-bit word and setting it to 255 if >= 128, and 0 otherwise.
If p is char below statement is just an assignment to 255.
*p++ = ((*p >> 7) << 8) - 1
If p is int, then of course it is a different story.
You can use GCC Explorer to see how the assembly output looks like. Below is appearently what you get from Linaro's arm-linux-gnueabi-g++ 4.6.3 with -O2 -march=armv7-a flags;
void test(char *p) {
*p++ = (*p >> 7) * 255;
}
void test2(char *p) {
*p++ = ((*p >> 7) << 8) - 1 ;
}
void test2_i(int *p) {
*p++ = ((*p >> 7) << 8) - 1 ;
}
void test3(char *p) {
*p++ = *p >= 128 ? ~0 : 0;
}
void test4(char *p) {
*p++ = *p & 0x80 ? ~0 : 0;
}
creates
test(char*):
ldrb r3, [r0, #0] # zero_extendqisi2
sbfx r3, r3, #7, #1
strb r3, [r0, #0]
bx lr
test2(char*):
movs r3, #255
strb r3, [r0, #0]
bx lr
test2_i(int*):
ldr r3, [r0, #0]
asrs r3, r3, #7
lsls r3, r3, #8
subs r3, r3, #1
str r3, [r0, #0]
bx lr
test3(char*):
ldrsb r3, [r0, #0]
cmp r3, #0
ite lt
movlt r3, #255
movge r3, #0
strb r3, [r0, #0]
bx lr
test4(char*):
ldrsb r3, [r0, #0]
cmp r3, #0
ite lt
movlt r3, #255
movge r3, #0
strb r3, [r0, #0]
bx lr
If you are not nitpicking best is always to check assembly of the generated code over such details. People tend to overestimate compilers, I agree most of the time they do great but I guess it is anyone's right to question generated code.
You should also be careful interpreting instructions, since they won't always match into cycle accurate listing due to core's architectural featuers like having out-of-order, super scalar deep pipelines. So it might not be always shortest sequence of instructions win.
Well, to answer the question in your title, on ARM, a SHIFT+SUB can be done in a single instruction with 1 cycle latenency, while a MUL usually has multiple cycle latency. So the shift will usually be faster.
To answer the implied question of what C code to write for this, generally you are best off with the simplest code that expresses your intent:
*p++ = *p >= 128 ? ~0 : 0; // set byte to all ones iff >= 128
or
*p++ = *p & 0x80 ? ~0 : 0; // set byte to all ones based on the MSB
this will generally get converted by the compiler into the fastest way of doing it, whether that is a shift and whatever, or a conditional move.
Despite the fact that your compiler can optimize the line quite well (and reading the assembly will tell you what is really done), you can refer from this page to know exactly how much cycles a MUL can take.

Resources